Computing In Context

Social Sciences Track

Lecture 4--topics, trends, and dimensional scaling

Matthew L. Jones


In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

In [1]:
from IPython.display import Image
Image("http://journalofdigitalhumanities.org/wp-content/uploads/2013/02/blei_lda_illustration.png")


Out[1]:

In [ ]:


In [4]:
import textmining_blackboxes as tm

IMPORTANT: tm is our temporarily helper, not a standard python package!!

download it from my github: https://github.com/matthewljones/computingincontext


In [3]:
#see if package imported correctly
tm.icantbelieve("butter")


I can't believe it's not butter

Let's keep using the remarkable narratives available from Documenting the American South (http://docsouth.unc.edu/docsouthdata/)

Assuming that you are storing your data in a directory in the same place as your iPython notebook.

Put the slave narratives texts within a data directory in the same place as this notebook


In [ ]:
title_info["Date"].str.replace("[^0-9]", "") #use regular expressions to clean up

In [ ]:
title_info["Date"]=title_info["Date"].str.replace("\-\?", "5")
title_info["Date"]=title_info["Date"].str.replace("[^0-9]", "") # what assumptions have I made about the data?

In [ ]:
title_info["Date"]=pd.to_datetime(title_info["Date"], coerce=True)

In [ ]:
title_info["Date"]<pd.datetime(1800,1,1)

In [ ]:
title_info[title_info["Date"]<pd.datetime(1800,1,1)]

back to boolean indexing!


In [ ]:
#Let's use a brittle thing for reading in a directory of pure txt files.
our_texts=tm.readtextfiles('data/na-slave-narratives/data/texts')
#again, this is not a std python package
#returns a simple list of the document as very long strings

#note if you want the following notebook will work on any directory of text files.

For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.

Let's look at Victorian novels for a little while


In [61]:
our_texts, names=tm.readtextfiles("data/british-fiction-corpus")

In [62]:
names


Out[62]:
['ABronte_Agnes.txt',
 'ABronte_Tenant.txt',
 'Austen_Emma.txt',
 'Austen_Pride.txt',
 'Austen_Sense.txt',
 'CBronte_Jane.txt',
 'CBronte_Professor.txt',
 'CBronte_Villette.txt',
 'Dickens_Bleak.txt',
 'Dickens_David.txt',
 'Dickens_Hard.txt']

Our Zero-ith tool: cleaning up the text

I've included a little utility function in tm that takes a list of strings and cleans it up a bit

check out the code on your own time later


In [63]:
our_texts=tm.data_cleanse(our_texts)

#more necessary when have messy text
#eliminate escaped characters

back to vectorizer from scikit learn


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)

In [9]:
document_term_matrix=vectorizer.fit_transform(our_texts)

In [10]:
# now let's get our vocabulary--the names corresponding to the rows
vocab=vectorizer.get_feature_names()

In [13]:
len(vocab)


Out[13]:
7102

In [14]:
document_term_matrix.shape


Out[14]:
(11, 7102)

In [15]:
document_term_matrix_dense=document_term_matrix.toarray()

In [16]:
dtmdf=pd.DataFrame(document_term_matrix_dense, columns=vocab)

In [17]:
dtmdf


Out[17]:
18 abandoned abashed abhorred abhorrence abide abilities ability able abode ... youd youll young younger youngest youre youth youthful youve zeal
0 0.000000 0.000000 0.001862 0.000000 0.001700 0.000000 0.003115 0.003400 0.025401 0.011833 ... 0.004673 0.010011 0.081041 0.015724 0.001558 0.017133 0.003629 0.000000 0.011900 0.000000
1 0.000789 0.005939 0.000000 0.002161 0.004322 0.001980 0.000000 0.000000 0.017937 0.004457 ... 0.010559 0.021208 0.047150 0.004612 0.000000 0.027717 0.006662 0.001818 0.015846 0.000720
2 0.000000 0.000000 0.000000 0.000570 0.000000 0.000522 0.001567 0.000000 0.029214 0.000441 ... 0.000000 0.000000 0.077904 0.000811 0.002090 0.000000 0.004463 0.000959 0.000000 0.002281
3 0.000910 0.000000 0.000000 0.000000 0.004986 0.000761 0.004568 0.000000 0.031930 0.005142 ... 0.000000 0.000699 0.076278 0.017739 0.009898 0.000000 0.005322 0.000000 0.000000 0.000000
4 0.000989 0.000827 0.000000 0.001805 0.003611 0.000000 0.007444 0.002708 0.029546 0.003491 ... 0.000000 0.000000 0.066158 0.003854 0.003308 0.000000 0.004496 0.001519 0.000000 0.001805
5 0.000000 0.004706 0.000000 0.003210 0.000000 0.001176 0.000588 0.000000 0.010506 0.003972 ... 0.002353 0.004321 0.037001 0.006395 0.000588 0.002941 0.007766 0.001620 0.005778 0.002568
6 0.003283 0.001373 0.001641 0.005996 0.000000 0.001373 0.000000 0.001499 0.008532 0.006956 ... 0.004120 0.016393 0.061858 0.002133 0.000000 0.016480 0.012798 0.006305 0.011992 0.000000
7 0.000724 0.001212 0.001448 0.000661 0.001323 0.000606 0.000000 0.001323 0.007529 0.002558 ... 0.000606 0.001669 0.053176 0.000941 0.003636 0.000606 0.012706 0.002782 0.000000 0.001323
8 0.000000 0.002917 0.001341 0.000000 0.000245 0.000000 0.000898 0.000980 0.012721 0.000758 ... 0.003366 0.013598 0.074407 0.003137 0.001571 0.021317 0.005750 0.003296 0.004163 0.000000
9 0.000267 0.002905 0.001869 0.000000 0.000488 0.000223 0.002235 0.001219 0.007115 0.001320 ... 0.004916 0.015798 0.049804 0.003124 0.001788 0.012290 0.005380 0.004309 0.004146 0.000244
10 0.000000 0.002337 0.001862 0.000850 0.000000 0.000779 0.000779 0.000850 0.006048 0.000000 ... 0.008567 0.025029 0.078025 0.003024 0.000000 0.018692 0.004839 0.002145 0.002550 0.000850

11 rows × 7102 columns

While this data frame is lovely to look at and useful to think with, it's tough on your computer's memory


In [11]:
#easy to program, but let's use a robust version from sklearn!
from sklearn.metrics.pairwise import cosine_similarity

In [12]:
similarity=cosine_similarity(document_term_matrix)

#Note here that the `cosine_similiary` can take 
#an entire matrix as its argument

In [13]:
pd.DataFrame(similarity, index=names, columns=names)


Out[13]:
ABronte_Agnes.txt ABronte_Tenant.txt Austen_Emma.txt Austen_Pride.txt Austen_Sense.txt CBronte_Jane.txt CBronte_Professor.txt CBronte_Villette.txt Dickens_Bleak.txt Dickens_David.txt Dickens_Hard.txt
ABronte_Agnes.txt 1.000000 0.873077 0.766374 0.771317 0.750176 0.829614 0.783084 0.820091 0.756383 0.782174 0.736513
ABronte_Tenant.txt 0.873077 1.000000 0.761187 0.786333 0.777003 0.866513 0.821557 0.853758 0.785609 0.844785 0.810371
Austen_Emma.txt 0.766374 0.761187 1.000000 0.914527 0.801833 0.779011 0.642204 0.667375 0.814210 0.803813 0.766277
Austen_Pride.txt 0.771317 0.786333 0.914527 1.000000 0.828285 0.789536 0.660597 0.662716 0.806270 0.805168 0.767164
Austen_Sense.txt 0.750176 0.777003 0.801833 0.828285 1.000000 0.739302 0.671603 0.713728 0.666095 0.704389 0.698309
CBronte_Jane.txt 0.829614 0.866513 0.779011 0.789536 0.739302 1.000000 0.862033 0.884444 0.794386 0.823570 0.792873
CBronte_Professor.txt 0.783084 0.821557 0.642204 0.660597 0.671603 0.862033 1.000000 0.910828 0.680350 0.728871 0.692953
CBronte_Villette.txt 0.820091 0.853758 0.667375 0.662716 0.713728 0.884444 0.910828 1.000000 0.684109 0.750113 0.712343
Dickens_Bleak.txt 0.756383 0.785609 0.814210 0.806270 0.666095 0.794386 0.680350 0.684109 1.000000 0.905987 0.878677
Dickens_David.txt 0.782174 0.844785 0.803813 0.805168 0.704389 0.823570 0.728871 0.750113 0.905987 1.000000 0.901665
Dickens_Hard.txt 0.736513 0.810371 0.766277 0.767164 0.698309 0.792873 0.692953 0.712343 0.878677 0.901665 1.000000

that is a symmetrical matrix relating each of the texts (rows) to another text (row)


In [28]:
similarity_df.ix[1].order(ascending=False)


Out[28]:
ABronte_Tenant.txt       1.000000
ABronte_Agnes.txt        0.873077
CBronte_Jane.txt         0.866513
CBronte_Villette.txt     0.853758
Dickens_David.txt        0.844785
CBronte_Professor.txt    0.821557
Dickens_Hard.txt         0.810371
Austen_Pride.txt         0.786333
Dickens_Bleak.txt        0.785609
Austen_Sense.txt         0.777003
Austen_Emma.txt          0.761187
Name: ABronte_Tenant.txt, dtype: float64

can do lots of things with similarity matrix

you've already seen hierarchical clustering

Multidimension scaling

  • Technique to visualize distances in high dimensional spaces in ways we can cognize.
  • Keep distances but reduce dimensionality

In [14]:
#here's the blackbox
from sklearn.manifold import MDS
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
positions= mds.fit_transform(1-similarity)

In [15]:
positions.shape


Out[15]:
(11, 2)

It's an 11 by 2 matrix

OR

simply an (x,y) coordinate pair for each of our texts


In [16]:
#let's plot it: I've set up a black box
tm.plot_mds(positions,names)



In [17]:
names=[name.replace(".txt", "") for name in names]

In [18]:
tm.plot_mds(positions,names)


What has this got us?

It suggests that even this crude measure of similarity is able to capture something significant.

Note: the axes don't really mean anything

interesting but what does it mean?

topic modeling

unsupervised algorithm for finding the major topics of texts

unlike hierarchical clustering, assumes texts spring from multiple sets of topics

the big thing in much text modeling, from humanities, to Facebook, to NSA

many variations

fantastic python package gensim

"corpora" = a collection of documents or texts

gensim likes its documents to be a list of lists of words, not a list of strings

Get the stoplist in the data directory in my github.


In [3]:
our_texts, names=tm.readtextfiles("Data/PCCIPtext")

In [5]:
our_texts=tm.data_cleanse(our_texts)

In [6]:
#improved stoplist--may be too complete
stop=[]
with open('data/stoplist-multilingual') as f:
    stop=f.readlines()
    stop=[word.strip('\n') for word in stop]

In [7]:
texts = [[word for word in document.lower().split() if word not in stop] for document in our_texts] #gensim requires list of list of words in documents

In [8]:
from gensim import corpora, models, similarities, matutils
"""gensim includes its own vectorizing tools"""
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

#doc2bow just means `doc`uments to `b`ag `o`f `w`ords
#ok, this has just vectorized our texts; it's another form

Now we are going to call the topic modeling black box

the key parameter is how many distinct topics we want the computer to find

this will take a while


In [10]:
number_topics=40
model = models.LdaModel(corpus, id2word=dictionary, num_topics=number_topics, passes=10) #use gensim multicore LDA

In [11]:
model.show_topics()


Out[11]:
['0.011*security + 0.010*cyber + 0.009*national + 0.008*6633 + 0.008*sjud4 + 0.007*government + 0.006*critical + 0.006*09 + 0.006*infrastructure + 0.005*plan',
 '0.009*gas + 0.009*oil + 0.009*natural + 0.008*national + 0.007*infrastructure + 0.007*industry + 0.007*technology + 0.007*risk + 0.007*critical + 0.007*government',
 '0.013*nder + 0.007*program + 0.007*members + 0.007*fema + 0.006*data + 0.006*training + 0.006*office + 0.005*units + 0.005*enclosurei + 0.005*administration',
 '0.023*security + 0.018*critical + 0.014*infrastructure + 0.013*national + 0.011*sector + 0.009*government + 0.008*federal + 0.008*private + 0.008*protection + 0.007*agencies',
 '0.015*ati + 0.011*program + 0.011*army + 0.010*chemi + 0.010*cl + 0.009*wi + 0.009*dod + 0.008*gao + 0.008*defense + 0.008*ns',
 '0.022*air + 0.020*force + 0.013*warfare + 0.010*equipment + 0.009*report + 0.009*dod + 0.009*systems + 0.008*gao + 0.008*program + 0.007*test',
 '0.023*pki + 0.015*federal + 0.011*key + 0.009*00 + 0.008*certification + 0.008*government + 0.008*agencies + 0.007*security + 0.006*public + 0.006*certificates',
 '0.020*secretary + 0.017*policy + 0.015*department + 0.012*affairs + 0.010*issues + 0.010*security + 0.010*staff + 0.010*foreign + 0.009*international + 0.009*office',
 '0.016*market + 0.014*markets + 0.014*trading + 0.013*securities + 0.012*stock + 0.011*futures + 0.008*exchange + 0.008*clearing + 0.006*options + 0.006*exchanges',
 '0.023*rail + 0.017*transportation + 0.014*attacks + 0.013*rand + 0.011*freight + 0.011*passenger + 0.005*trains + 0.005*surface + 0.004*testimony + 0.004*targets']

In [42]:
topics_indexed=[[b for (a,b) in topics] for topics in model.show_topics(number_topics,10,formatted=False)]
topics_indexed=pd.DataFrame(topics_indexed)

In [43]:
topics_indexed


Out[43]:
0 1 2 3 4 5 6 7 8 9
0 time dont eyes thought aunt dear room good long better
1 ermine feeblest hindrances spar influx insinuation shimmer lessaie sophistical hewers
2 elinor marianne dashwood edward sister time mother jennings willoughby colonel
3 elinor marianne time mother thing dashwood jennings edward great willoughby
4 time good sir long house dont dear lady day thought
5 time sir good day dont house hand thought room great
6 sir dear time good eyes head night peggotty aunt looked
7 sir good dont day time great room head thought looked
8 good thought time dear sir day great long house dont
9 time good day thought sir dear hand room house great
10 emma elizabeth good jane time thing weston great dear harriet
11 madame eyes time hand dr thought long knew day looked
12 good time long thought room sir dont day night great
13 sunder robs unimpassioned 3rd huntingdon helen arthur dont hargrave better
14 good time dear sir day face dont lady young great
15 thought sir time good day looked long dont eyes head
16 jane room rochester heart thought john long good felt time
17 time sir good room looked thought long lady day hand
18 bounderby sir hand sparsit time maam dont face good louisa
19 time dont good face great day hand thought sir lady
20 bounderby gradgrind sparsit louisa sir tom dont father good time
21 emma jane elton knightley thing weston day harriet fairfax good
22 good long time hand room thought sir night day dear
23 elinor marianne dashwood jennings edward time mother brandon great thing
24 thought monsieur good english hunsden eye long day frances time
25 micawber aunt peggotty time dear good copperfield traddles dont hand
26 elinor marianne time good dashwood edward sir jennings sister thing
27 time better day love good room thought long helen hand
28 time good day sir dear thought room great lady young
29 time dont good great room thought day sir dear head
30 elinor marianne time mother good dashwood edward jennings sir dear
31 sir dear lady dont time good richard great hand jarndyce
32 thought long time good room face dont dear hand heart
33 prescience subservient ungraciously enquired volunteer nicest toothpick witnesses awaken casino
34 time great good room long sir lady house thought night
35 madame dr bretton lucy graham paul night beck ginevra eyes
36 time good thought long room dont day hand face eyes
37 good long time thought night day great looked room face
38 time dear thought long day good dont better hand room
39 time good dear sir dont great day room thought house

So which topics most significant for each document?


In [45]:
model[dictionary.doc2bow(texts[1])]


Out[45]:
[(36, 0.99823571784570908)]

In [48]:


In [ ]: