Welcome to Rediscovering Text as Data!

Week 1: Introduction

Instructors: Christopher Hench & Claudia von Vacano
Course Location: Evans 458
Course Time: Monday 4-6pm
Instructor's Office: Barrows 350
Instructor's Office Hours: Thursday 10-11 AM
Instructors' Email: chench@berkeley.edu, cvacano@berkeley.edu
Course Repository: https://github.com/henchc/Rediscovering-Text-as-Data

Course Description: Humanists have traditionally emphasized the ‘close reading’ of a text, where value is placed on the nuances of specific passages. The increasing amount of digital text being published and archived affords us an opportunity to read text differently—as data on a scale larger than ever before. This ‘distant reading’ approach (mediated through the computer) complements our ‘close reading’ by providing a broader context for interpretation previously inaccessible. It also allows us to quantify and model language, such as words in novels or syllables in poetry, to uncover hidden patterns in a single text or body of texts. In this course, we will help you find and explore newly available texts of interest to you and guide your understanding of textual phenomena obtained through computational methods, enriching your reading of an individual text.

As a connector course to Data 8 (Foundations of Data Science), this class will give students experience in the Python programming language. Students must be concurrently enrolled in the main course (Data 8) or have already completed it.

Who are you?



In [ ]:

    
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter


terms = pd.read_csv('data/terms.csv')
counts = Counter(terms['Terms in Attendance'].replace("—", "0")).most_common()
ax = sns.barplot(x=[x[0] for x in counts], y=[x[1] for x in counts], order=[x[0] for x in counts])
ax = ax.set(xlabel='# of Terms', ylabel='Count', title='# of Terms')



In [ ]:

    
majors = pd.read_csv('data/majors.csv')
counts = Counter(majors['Majors']).most_common()
ax = sns.barplot(x=[x[0] for x in counts], y=[x[1] for x in counts], order=[x[0] for x in counts])
ax.set_xticklabels([x[0] for x in counts], rotation=90)
ax = ax.set(xlabel='Majors', ylabel='Count', title='Major')

Let's get to know each other!



In [ ]:

    
import random

def assign_groups(rseed):
    names = list(pd.read_csv('data/names.csv')['Name'])

    random.seed(rseed)
    random.shuffle(names)

    def make_groups(lst, min_size):
        num_groups = int(len(lst) / min_size)
        return [lst[i::num_groups] for i in range(num_groups)]

    for i, g in enumerate(make_groups(names, 2)):
        print(i+1, g)



In [ ]:

    
assign_groups(1)

Introduce yourself to your partner:

What's your name?
What do you study/want to study?
Why are you interested in this course?

Questions to answer individually

What is data?
What data have you worked with?
How might we understand text as data?

Google Form: https://goo.gl/forms/5g9WzjCy6RJB1zTA3

Discuss your answers



In [ ]:

    
assign_groups(2)

How did you respond?



In [ ]:

    
gdoc_key = "1QvPYlufedKe9Myb4wdduxuvjkp2U4oA2FlTUO_4vE5A"
spreadsheet_url = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv'.format(gdoc_key)
table = pd.read_csv(spreadsheet_url)
table



In [ ]:

    
import nltk
from nltk.corpus import stopwords
from string import punctuation
nltk.download('punkt')

def visualize_responses(column, preprocess=False):
    all_responses = table[column].str.cat(sep=' ')
    if preprocess == True:
        toks = nltk.word_tokenize(all_responses.lower())
        all_responses = ' '.join([t for t in toks if t not in list(punctuation) and t not in stopwords.words('english')])
    freq = Counter(all_responses.split()).most_common()
    ax = sns.barplot([x[0] for x in freq][:10], [x[1] for x in freq][:10])
    ax.set_xticklabels([x[0] for x in freq], rotation=45)
    ax = ax.set(title=column)

What is data?



In [ ]:

    
visualize_responses('What is data?')

What data have you worked with?



In [ ]:

    
visualize_responses('What data have you worked with?')

How might we understand text as data?



In [ ]:

    
visualize_responses('How might we understand text as data?')

Not super interesting ☹ . Don't worry! We'll learn some common techniques to get through the noise. For now, I've done some magic preprocessing for you!



In [ ]:

    
visualize_responses('What is data?', preprocess=True)



In [ ]:

    
visualize_responses('What data have you worked with?', preprocess=True)



In [ ]:

    
visualize_responses('How might we understand text as data?', preprocess=True)

This method of understanding text by just counting words is often derogatorily referred to as the Bag of Words approach. While it is not the most sophisticated model of text, it is still astonishingly powerful.

How humanists are using text as data

A great blog post by Ted Underwood here!

Visualizing text
Features and vocabulary
Network analysis (characters and books!)
Modeling form and genre with supervised classification
Unsupervised topic modeling and clustering

Course Topics

Theoretical Lenses

Close Reading
Formalism
Gender Theory
Social Network Theory

Applied Methods

Bag of Words
- Preprocessing
- n-grams
Content Extraction
- Named Entity Recognition
- Network Analysis
Machine Learning
- Classification
- Clustering
- Topic Modeling
- Word Embeddings
- Generating Text

Course Objectives

Think critically about operationalizing
Solidify and expand programming and inference skills acquired in Data 8

Final Project

Final Project: The course is built around the final project (which replaces the final exam). This consists of a 4-5 page (double-spaced) paper in which an argument is made about a text(s) using evidence from both inferential statistics and close reading. This paper must examine an interpretive problem and may be written on any text(s) you choose, literary or other. While the corpus does not have to be literary in nature, please incorporate into your analysis the critical foundation we build in class.

In preparation for the final paper, students will be required to fulfill several milestones. During Week 8, students will meet with an instructor outside of class to consult on texts, interpretive problems, and statistical methods of interest. In Week 9, students will submit a one-paragraph ~250 word proposal for their final project including these three elements. We will meet again during Week 11 to discuss progress and obstacles in the project, as well as any findings. In Week 12, students submit one page describing their methods and statistical findings, including one visualization.

In keeping with the best practices of the field, students will be required to make available their data set (pending copyright) and code through GitHub. Preliminary code will be posted during Week 10 and final code – capable of reproducing your findings – before our last class. Please send me the link to your materials before this class so I can create an image and we can all run your code together!

During our final class, students will deliver a 3-5 minute elevator pitch describing the challenge being explored and any decisions made or roadblocks faced while applying statistical methods in literature. This will act as a kind of rough draft for the paper, as well as offer an opportunity for feedback from your peers. The final draft of the paper is due on December 11.

Collaboration

We want to encourage collaboration on all projects and assignments for this course. Collaboration is unfortunately a rare occurence in the humanities. Digital methods have forced humanists to rethink the utility of collaboration.

To that end, we will also be making extensive use of GitHub. We will have short tutorials as the course progresses, but we encourage you to walk through some tutorials (e.g. here) yourselves.

Syllabus

The syllabus and schedule is in the README file for the GitHub repo.

Personal Survey

Tell us about yourself!

Name
Major/Interest
Background
Concerns about the class
What you'd like to get out of this class

Google Form: https://goo.gl/forms/Ju7ZZUGXi1rXMhK23

Homework for our next meeting

We're going to start off with a piece by Franco Moretti, widely acknowledged as one of the founders of Digital Humanities for literary analysis. At this time, Moretti was trying to justify computational techniques to his colleagues, and establishes the very simple concept of "Character Space". How much can we learn about a text by just counting words?

First read an excerpt of Sophocles' Antigone, then read Moretti's critique.

We'll then walk through the notebook together next meeting.