Computing In Context

Lecture 4--topics, trends, and dimensional scaling

Matthew L. Jones



In [1]:

    
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

Reading at scale:

Martha Ballard's Diary http://dohistory.org/diary/index.html

http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/

Richmond Dispatch

https://dsl.richmond.edu/dispatch/pages/home

Source: http://dlxs.richmond.edu/d/ddr/index.html



In [1]:

    
from IPython.display import Image
Image("http://journalofdigitalhumanities.org/wp-content/uploads/2013/02/blei_lda_illustration.png")









    Out[1]:



In [ ]:



In [4]:

    
import textmining_blackboxes as tm

IMPORTANT: `tm` is our temporarily helper, not a standard `python` package!!

download it from my github: https://github.com/matthewljones/computingincontext



In [3]:

    
#see if package imported correctly
tm.icantbelieve("butter")









    



I can't believe it's not butter

Let's keep using the remarkable narratives available from Documenting the American South (http://docsouth.unc.edu/docsouthdata/)

Assuming that you are storing your data in a directory in the same place as your iPython notebook.

Put the slave narratives texts within a data directory in the same place as this notebook



In [ ]:

    
title_info["Date"].str.replace("[^0-9]", "") #use regular expressions to clean up



In [ ]:

    
title_info["Date"]=title_info["Date"].str.replace("\-\?", "5")
title_info["Date"]=title_info["Date"].str.replace("[^0-9]", "") # what assumptions have I made about the data?



In [ ]:

    
title_info["Date"]=pd.to_datetime(title_info["Date"], coerce=True)



In [ ]:

    
title_info["Date"]<pd.datetime(1800,1,1)



In [ ]:

    
title_info[title_info["Date"]<pd.datetime(1800,1,1)]

back to boolean indexing!



In [ ]:

    
#Let's use a brittle thing for reading in a directory of pure txt files.
our_texts=tm.readtextfiles('data/na-slave-narratives/data/texts')
#again, this is not a std python package
#returns a simple list of the document as very long strings

#note if you want the following notebook will work on any directory of text files.

For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.

Let's look at Victorian novels for a little while



In [61]:

    
our_texts, names=tm.readtextfiles("data/british-fiction-corpus")



In [62]:

    
names









    Out[62]:





['ABronte_Agnes.txt',
 'ABronte_Tenant.txt',
 'Austen_Emma.txt',
 'Austen_Pride.txt',
 'Austen_Sense.txt',
 'CBronte_Jane.txt',
 'CBronte_Professor.txt',
 'CBronte_Villette.txt',
 'Dickens_Bleak.txt',
 'Dickens_David.txt',
 'Dickens_Hard.txt']

Our Zero-ith tool: cleaning up the text

I've included a little utility function in `tm` that takes a list of strings and cleans it up a bit

check out the code on your own time later



In [63]:

    
our_texts=tm.data_cleanse(our_texts)

#more necessary when have messy text
#eliminate escaped characters

back to `vectorizer` from `scikit learn`



In [7]:

    
from sklearn.feature_extraction.text import TfidfVectorizer



In [8]:

    
vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)



In [9]:

    
document_term_matrix=vectorizer.fit_transform(our_texts)



In [10]:

    
# now let's get our vocabulary--the names corresponding to the rows
vocab=vectorizer.get_feature_names()



In [13]:

    
len(vocab)









    Out[13]:





7102



In [14]:

    
document_term_matrix.shape









    Out[14]:





(11, 7102)



In [15]:

    
document_term_matrix_dense=document_term_matrix.toarray()



In [16]:

    
dtmdf=pd.DataFrame(document_term_matrix_dense, columns=vocab)



In [17]:

    
dtmdf









    Out[17]:






  
    
      
      18
      abandoned
      abashed
      abhorred
      abhorrence
      abide
      abilities
      ability
      able
      abode
      ...
      youd
      youll
      young
      younger
      youngest
      youre
      youth
      youthful
      youve
      zeal
    
  
  
    
      0 
       0.000000
       0.000000
       0.001862
       0.000000
       0.001700
       0.000000
       0.003115
       0.003400
       0.025401
       0.011833
      ...
       0.004673
       0.010011
       0.081041
       0.015724
       0.001558
       0.017133
       0.003629
       0.000000
       0.011900
       0.000000
    
    
      1 
       0.000789
       0.005939
       0.000000
       0.002161
       0.004322
       0.001980
       0.000000
       0.000000
       0.017937
       0.004457
      ...
       0.010559
       0.021208
       0.047150
       0.004612
       0.000000
       0.027717
       0.006662
       0.001818
       0.015846
       0.000720
    
    
      2 
       0.000000
       0.000000
       0.000000
       0.000570
       0.000000
       0.000522
       0.001567
       0.000000
       0.029214
       0.000441
      ...
       0.000000
       0.000000
       0.077904
       0.000811
       0.002090
       0.000000
       0.004463
       0.000959
       0.000000
       0.002281
    
    
      3 
       0.000910
       0.000000
       0.000000
       0.000000
       0.004986
       0.000761
       0.004568
       0.000000
       0.031930
       0.005142
      ...
       0.000000
       0.000699
       0.076278
       0.017739
       0.009898
       0.000000
       0.005322
       0.000000
       0.000000
       0.000000
    
    
      4 
       0.000989
       0.000827
       0.000000
       0.001805
       0.003611
       0.000000
       0.007444
       0.002708
       0.029546
       0.003491
      ...
       0.000000
       0.000000
       0.066158
       0.003854
       0.003308
       0.000000
       0.004496
       0.001519
       0.000000
       0.001805
    
    
      5 
       0.000000
       0.004706
       0.000000
       0.003210
       0.000000
       0.001176
       0.000588
       0.000000
       0.010506
       0.003972
      ...
       0.002353
       0.004321
       0.037001
       0.006395
       0.000588
       0.002941
       0.007766
       0.001620
       0.005778
       0.002568
    
    
      6 
       0.003283
       0.001373
       0.001641
       0.005996
       0.000000
       0.001373
       0.000000
       0.001499
       0.008532
       0.006956
      ...
       0.004120
       0.016393
       0.061858
       0.002133
       0.000000
       0.016480
       0.012798
       0.006305
       0.011992
       0.000000
    
    
      7 
       0.000724
       0.001212
       0.001448
       0.000661
       0.001323
       0.000606
       0.000000
       0.001323
       0.007529
       0.002558
      ...
       0.000606
       0.001669
       0.053176
       0.000941
       0.003636
       0.000606
       0.012706
       0.002782
       0.000000
       0.001323
    
    
      8 
       0.000000
       0.002917
       0.001341
       0.000000
       0.000245
       0.000000
       0.000898
       0.000980
       0.012721
       0.000758
      ...
       0.003366
       0.013598
       0.074407
       0.003137
       0.001571
       0.021317
       0.005750
       0.003296
       0.004163
       0.000000
    
    
      9 
       0.000267
       0.002905
       0.001869
       0.000000
       0.000488
       0.000223
       0.002235
       0.001219
       0.007115
       0.001320
      ...
       0.004916
       0.015798
       0.049804
       0.003124
       0.001788
       0.012290
       0.005380
       0.004309
       0.004146
       0.000244
    
    
      10
       0.000000
       0.002337
       0.001862
       0.000850
       0.000000
       0.000779
       0.000779
       0.000850
       0.006048
       0.000000
      ...
       0.008567
       0.025029
       0.078025
       0.003024
       0.000000
       0.018692
       0.004839
       0.002145
       0.002550
       0.000850
    
  

11 rows × 7102 columns

While this data frame is lovely to look at and useful to think with, it's tough on your computer's memory



In [11]:

    
#easy to program, but let's use a robust version from sklearn!
from sklearn.metrics.pairwise import cosine_similarity



In [12]:

    
similarity=cosine_similarity(document_term_matrix)

#Note here that the `cosine_similiary` can take 
#an entire matrix as its argument



In [13]:

    
pd.DataFrame(similarity, index=names, columns=names)









    Out[13]:






  
    
      
      ABronte_Agnes.txt
      ABronte_Tenant.txt
      Austen_Emma.txt
      Austen_Pride.txt
      Austen_Sense.txt
      CBronte_Jane.txt
      CBronte_Professor.txt
      CBronte_Villette.txt
      Dickens_Bleak.txt
      Dickens_David.txt
      Dickens_Hard.txt
    
  
  
    
      ABronte_Agnes.txt
       1.000000
       0.873077
       0.766374
       0.771317
       0.750176
       0.829614
       0.783084
       0.820091
       0.756383
       0.782174
       0.736513
    
    
      ABronte_Tenant.txt
       0.873077
       1.000000
       0.761187
       0.786333
       0.777003
       0.866513
       0.821557
       0.853758
       0.785609
       0.844785
       0.810371
    
    
      Austen_Emma.txt
       0.766374
       0.761187
       1.000000
       0.914527
       0.801833
       0.779011
       0.642204
       0.667375
       0.814210
       0.803813
       0.766277
    
    
      Austen_Pride.txt
       0.771317
       0.786333
       0.914527
       1.000000
       0.828285
       0.789536
       0.660597
       0.662716
       0.806270
       0.805168
       0.767164
    
    
      Austen_Sense.txt
       0.750176
       0.777003
       0.801833
       0.828285
       1.000000
       0.739302
       0.671603
       0.713728
       0.666095
       0.704389
       0.698309
    
    
      CBronte_Jane.txt
       0.829614
       0.866513
       0.779011
       0.789536
       0.739302
       1.000000
       0.862033
       0.884444
       0.794386
       0.823570
       0.792873
    
    
      CBronte_Professor.txt
       0.783084
       0.821557
       0.642204
       0.660597
       0.671603
       0.862033
       1.000000
       0.910828
       0.680350
       0.728871
       0.692953
    
    
      CBronte_Villette.txt
       0.820091
       0.853758
       0.667375
       0.662716
       0.713728
       0.884444
       0.910828
       1.000000
       0.684109
       0.750113
       0.712343
    
    
      Dickens_Bleak.txt
       0.756383
       0.785609
       0.814210
       0.806270
       0.666095
       0.794386
       0.680350
       0.684109
       1.000000
       0.905987
       0.878677
    
    
      Dickens_David.txt
       0.782174
       0.844785
       0.803813
       0.805168
       0.704389
       0.823570
       0.728871
       0.750113
       0.905987
       1.000000
       0.901665
    
    
      Dickens_Hard.txt
       0.736513
       0.810371
       0.766277
       0.767164
       0.698309
       0.792873
       0.692953
       0.712343
       0.878677
       0.901665
       1.000000

that is a symmetrical matrix relating each of the texts (rows) to another text (row)



In [28]:

    
similarity_df.ix[1].order(ascending=False)









    Out[28]:





ABronte_Tenant.txt       1.000000
ABronte_Agnes.txt        0.873077
CBronte_Jane.txt         0.866513
CBronte_Villette.txt     0.853758
Dickens_David.txt        0.844785
CBronte_Professor.txt    0.821557
Dickens_Hard.txt         0.810371
Austen_Pride.txt         0.786333
Dickens_Bleak.txt        0.785609
Austen_Sense.txt         0.777003
Austen_Emma.txt          0.761187
Name: ABronte_Tenant.txt, dtype: float64

can do lots of things with similarity matrix

you've already seen hierarchical clustering

Multidimension scaling

Technique to visualize distances in high dimensional spaces in ways we can cognize.
Keep distances but reduce dimensionality



In [14]:

    
#here's the blackbox
from sklearn.manifold import MDS
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
positions= mds.fit_transform(1-similarity)



In [15]:

    
positions.shape









    Out[15]:





(11, 2)

It's an 11 by 2 matrix

simply an (x,y) coordinate pair for each of our texts



In [16]:

    
#let's plot it: I've set up a black box
tm.plot_mds(positions,names)



In [17]:

    
names=[name.replace(".txt", "") for name in names]



In [18]:

    
tm.plot_mds(positions,names)

What has this got us?

It suggests that even this crude measure of similarity is able to capture something significant.

Note: the axes don't really mean anything

interesting but what does it mean?

topic modeling

unsupervised algorithm for finding the major topics of texts

unlike hierarchical clustering, assumes texts spring from multiple sets of topics

the big thing in much text modeling, from humanities, to Facebook, to NSA

many variations

fantastic `python` package gensim

"corpora" = a collection of documents or texts

gensim likes its documents to be a list of lists of words, not a list of strings

Get the stoplist in the data directory in my github.



In [3]:

    
our_texts, names=tm.readtextfiles("Data/PCCIPtext")



In [5]:

    
our_texts=tm.data_cleanse(our_texts)



In [6]:

    
#improved stoplist--may be too complete
stop=[]
with open('data/stoplist-multilingual') as f:
    stop=f.readlines()
    stop=[word.strip('\n') for word in stop]



In [7]:

    
texts = [[word for word in document.lower().split() if word not in stop] for document in our_texts] #gensim requires list of list of words in documents



In [8]:

    
from gensim import corpora, models, similarities, matutils
"""gensim includes its own vectorizing tools"""
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

#doc2bow just means `doc`uments to `b`ag `o`f `w`ords
#ok, this has just vectorized our texts; it's another form

Now we are going to call the topic modeling black box

the key parameter is how many distinct topics we want the computer to find

this will take a while



In [10]:

    
number_topics=40
model = models.LdaModel(corpus, id2word=dictionary, num_topics=number_topics, passes=10) #use gensim multicore LDA



In [11]:

    
model.show_topics()









    Out[11]:





['0.011*security + 0.010*cyber + 0.009*national + 0.008*6633 + 0.008*sjud4 + 0.007*government + 0.006*critical + 0.006*09 + 0.006*infrastructure + 0.005*plan',
 '0.009*gas + 0.009*oil + 0.009*natural + 0.008*national + 0.007*infrastructure + 0.007*industry + 0.007*technology + 0.007*risk + 0.007*critical + 0.007*government',
 '0.013*nder + 0.007*program + 0.007*members + 0.007*fema + 0.006*data + 0.006*training + 0.006*office + 0.005*units + 0.005*enclosurei + 0.005*administration',
 '0.023*security + 0.018*critical + 0.014*infrastructure + 0.013*national + 0.011*sector + 0.009*government + 0.008*federal + 0.008*private + 0.008*protection + 0.007*agencies',
 '0.015*ati + 0.011*program + 0.011*army + 0.010*chemi + 0.010*cl + 0.009*wi + 0.009*dod + 0.008*gao + 0.008*defense + 0.008*ns',
 '0.022*air + 0.020*force + 0.013*warfare + 0.010*equipment + 0.009*report + 0.009*dod + 0.009*systems + 0.008*gao + 0.008*program + 0.007*test',
 '0.023*pki + 0.015*federal + 0.011*key + 0.009*00 + 0.008*certification + 0.008*government + 0.008*agencies + 0.007*security + 0.006*public + 0.006*certificates',
 '0.020*secretary + 0.017*policy + 0.015*department + 0.012*affairs + 0.010*issues + 0.010*security + 0.010*staff + 0.010*foreign + 0.009*international + 0.009*office',
 '0.016*market + 0.014*markets + 0.014*trading + 0.013*securities + 0.012*stock + 0.011*futures + 0.008*exchange + 0.008*clearing + 0.006*options + 0.006*exchanges',
 '0.023*rail + 0.017*transportation + 0.014*attacks + 0.013*rand + 0.011*freight + 0.011*passenger + 0.005*trains + 0.005*surface + 0.004*testimony + 0.004*targets']



In [42]:

    
topics_indexed=[[b for (a,b) in topics] for topics in model.show_topics(number_topics,10,formatted=False)]
topics_indexed=pd.DataFrame(topics_indexed)



In [43]:

    
topics_indexed









    Out[43]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
    
  
  
    
      0 
             time
              dont
                eyes
         thought
             aunt
              dear
              room
            good
              long
           better
    
    
      1 
           ermine
          feeblest
          hindrances
            spar
           influx
       insinuation
           shimmer
         lessaie
       sophistical
           hewers
    
    
      2 
           elinor
          marianne
            dashwood
          edward
           sister
              time
            mother
        jennings
        willoughby
          colonel
    
    
      3 
           elinor
          marianne
                time
          mother
            thing
          dashwood
          jennings
          edward
             great
       willoughby
    
    
      4 
             time
              good
                 sir
            long
            house
              dont
              dear
            lady
               day
          thought
    
    
      5 
             time
               sir
                good
             day
             dont
             house
              hand
         thought
              room
            great
    
    
      6 
              sir
              dear
                time
            good
             eyes
              head
             night
        peggotty
              aunt
           looked
    
    
      7 
              sir
              good
                dont
             day
             time
             great
              room
            head
           thought
           looked
    
    
      8 
             good
           thought
                time
            dear
              sir
               day
             great
            long
             house
             dont
    
    
      9 
             time
              good
                 day
         thought
              sir
              dear
              hand
            room
             house
            great
    
    
      10
             emma
         elizabeth
                good
            jane
             time
             thing
            weston
           great
              dear
          harriet
    
    
      11
           madame
              eyes
                time
            hand
               dr
           thought
              long
            knew
               day
           looked
    
    
      12
             good
              time
                long
         thought
             room
               sir
              dont
             day
             night
            great
    
    
      13
           sunder
              robs
       unimpassioned
             3rd
       huntingdon
             helen
            arthur
            dont
          hargrave
           better
    
    
      14
             good
              time
                dear
             sir
              day
              face
              dont
            lady
             young
            great
    
    
      15
          thought
               sir
                time
            good
              day
            looked
              long
            dont
              eyes
             head
    
    
      16
             jane
              room
           rochester
           heart
          thought
              john
              long
            good
              felt
             time
    
    
      17
             time
               sir
                good
            room
           looked
           thought
              long
            lady
               day
             hand
    
    
      18
        bounderby
               sir
                hand
         sparsit
             time
              maam
              dont
            face
              good
           louisa
    
    
      19
             time
              dont
                good
            face
            great
               day
              hand
         thought
               sir
             lady
    
    
      20
        bounderby
         gradgrind
             sparsit
          louisa
              sir
               tom
              dont
          father
              good
             time
    
    
      21
             emma
              jane
               elton
       knightley
            thing
            weston
               day
         harriet
           fairfax
             good
    
    
      22
             good
              long
                time
            hand
             room
           thought
               sir
           night
               day
             dear
    
    
      23
           elinor
          marianne
            dashwood
        jennings
           edward
              time
            mother
         brandon
             great
            thing
    
    
      24
          thought
          monsieur
                good
         english
          hunsden
               eye
              long
             day
           frances
             time
    
    
      25
         micawber
              aunt
            peggotty
            time
             dear
              good
       copperfield
        traddles
              dont
             hand
    
    
      26
           elinor
          marianne
                time
            good
         dashwood
            edward
               sir
        jennings
            sister
            thing
    
    
      27
             time
            better
                 day
            love
             good
              room
           thought
            long
             helen
             hand
    
    
      28
             time
              good
                 day
             sir
             dear
           thought
              room
           great
              lady
            young
    
    
      29
             time
              dont
                good
           great
             room
           thought
               day
             sir
              dear
             head
    
    
      30
           elinor
          marianne
                time
          mother
             good
          dashwood
            edward
        jennings
               sir
             dear
    
    
      31
              sir
              dear
                lady
            dont
             time
              good
           richard
           great
              hand
         jarndyce
    
    
      32
          thought
              long
                time
            good
             room
              face
              dont
            dear
              hand
            heart
    
    
      33
       prescience
       subservient
        ungraciously
        enquired
        volunteer
            nicest
         toothpick
       witnesses
            awaken
           casino
    
    
      34
             time
             great
                good
            room
             long
               sir
              lady
           house
           thought
            night
    
    
      35
           madame
                dr
             bretton
            lucy
           graham
              paul
             night
            beck
           ginevra
             eyes
    
    
      36
             time
              good
             thought
            long
             room
              dont
               day
            hand
              face
             eyes
    
    
      37
             good
              long
                time
         thought
            night
               day
             great
          looked
              room
             face
    
    
      38
             time
              dear
             thought
            long
              day
              good
              dont
          better
              hand
             room
    
    
      39
             time
              good
                dear
             sir
             dont
             great
               day
            room
           thought
            house

So which topics most significant for each document?



In [45]:

    
model[dictionary.doc2bow(texts[1])]









    Out[45]:





[(36, 0.99823571784570908)]



In [48]:



In [ ]:

	18	abandoned	abashed	abhorred	abhorrence	abide	abilities	ability	able	abode	...	youd	youll	young	younger	youngest	youre	youth	youthful	youve	zeal
0	0.000000	0.000000	0.001862	0.000000	0.001700	0.000000	0.003115	0.003400	0.025401	0.011833	...	0.004673	0.010011	0.081041	0.015724	0.001558	0.017133	0.003629	0.000000	0.011900	0.000000
1	0.000789	0.005939	0.000000	0.002161	0.004322	0.001980	0.000000	0.000000	0.017937	0.004457	...	0.010559	0.021208	0.047150	0.004612	0.000000	0.027717	0.006662	0.001818	0.015846	0.000720
2	0.000000	0.000000	0.000000	0.000570	0.000000	0.000522	0.001567	0.000000	0.029214	0.000441	...	0.000000	0.000000	0.077904	0.000811	0.002090	0.000000	0.004463	0.000959	0.000000	0.002281
3	0.000910	0.000000	0.000000	0.000000	0.004986	0.000761	0.004568	0.000000	0.031930	0.005142	...	0.000000	0.000699	0.076278	0.017739	0.009898	0.000000	0.005322	0.000000	0.000000	0.000000
4	0.000989	0.000827	0.000000	0.001805	0.003611	0.000000	0.007444	0.002708	0.029546	0.003491	...	0.000000	0.000000	0.066158	0.003854	0.003308	0.000000	0.004496	0.001519	0.000000	0.001805
5	0.000000	0.004706	0.000000	0.003210	0.000000	0.001176	0.000588	0.000000	0.010506	0.003972	...	0.002353	0.004321	0.037001	0.006395	0.000588	0.002941	0.007766	0.001620	0.005778	0.002568
6	0.003283	0.001373	0.001641	0.005996	0.000000	0.001373	0.000000	0.001499	0.008532	0.006956	...	0.004120	0.016393	0.061858	0.002133	0.000000	0.016480	0.012798	0.006305	0.011992	0.000000
7	0.000724	0.001212	0.001448	0.000661	0.001323	0.000606	0.000000	0.001323	0.007529	0.002558	...	0.000606	0.001669	0.053176	0.000941	0.003636	0.000606	0.012706	0.002782	0.000000	0.001323
8	0.000000	0.002917	0.001341	0.000000	0.000245	0.000000	0.000898	0.000980	0.012721	0.000758	...	0.003366	0.013598	0.074407	0.003137	0.001571	0.021317	0.005750	0.003296	0.004163	0.000000
9	0.000267	0.002905	0.001869	0.000000	0.000488	0.000223	0.002235	0.001219	0.007115	0.001320	...	0.004916	0.015798	0.049804	0.003124	0.001788	0.012290	0.005380	0.004309	0.004146	0.000244
10	0.000000	0.002337	0.001862	0.000850	0.000000	0.000779	0.000779	0.000850	0.006048	0.000000	...	0.008567	0.025029	0.078025	0.003024	0.000000	0.018692	0.004839	0.002145	0.002550	0.000850

	ABronte_Agnes.txt	ABronte_Tenant.txt	Austen_Emma.txt	Austen_Pride.txt	Austen_Sense.txt	CBronte_Jane.txt	CBronte_Professor.txt	CBronte_Villette.txt	Dickens_Bleak.txt	Dickens_David.txt	Dickens_Hard.txt
ABronte_Agnes.txt	1.000000	0.873077	0.766374	0.771317	0.750176	0.829614	0.783084	0.820091	0.756383	0.782174	0.736513
ABronte_Tenant.txt	0.873077	1.000000	0.761187	0.786333	0.777003	0.866513	0.821557	0.853758	0.785609	0.844785	0.810371
Austen_Emma.txt	0.766374	0.761187	1.000000	0.914527	0.801833	0.779011	0.642204	0.667375	0.814210	0.803813	0.766277
Austen_Pride.txt	0.771317	0.786333	0.914527	1.000000	0.828285	0.789536	0.660597	0.662716	0.806270	0.805168	0.767164
Austen_Sense.txt	0.750176	0.777003	0.801833	0.828285	1.000000	0.739302	0.671603	0.713728	0.666095	0.704389	0.698309
CBronte_Jane.txt	0.829614	0.866513	0.779011	0.789536	0.739302	1.000000	0.862033	0.884444	0.794386	0.823570	0.792873
CBronte_Professor.txt	0.783084	0.821557	0.642204	0.660597	0.671603	0.862033	1.000000	0.910828	0.680350	0.728871	0.692953
CBronte_Villette.txt	0.820091	0.853758	0.667375	0.662716	0.713728	0.884444	0.910828	1.000000	0.684109	0.750113	0.712343
Dickens_Bleak.txt	0.756383	0.785609	0.814210	0.806270	0.666095	0.794386	0.680350	0.684109	1.000000	0.905987	0.878677
Dickens_David.txt	0.782174	0.844785	0.803813	0.805168	0.704389	0.823570	0.728871	0.750113	0.905987	1.000000	0.901665
Dickens_Hard.txt	0.736513	0.810371	0.766277	0.767164	0.698309	0.792873	0.692953	0.712343	0.878677	0.901665	1.000000

	0	1	2	3	4	5	6	7	8	9
0	time	dont	eyes	thought	aunt	dear	room	good	long	better
1	ermine	feeblest	hindrances	spar	influx	insinuation	shimmer	lessaie	sophistical	hewers
2	elinor	marianne	dashwood	edward	sister	time	mother	jennings	willoughby	colonel
3	elinor	marianne	time	mother	thing	dashwood	jennings	edward	great	willoughby
4	time	good	sir	long	house	dont	dear	lady	day	thought
5	time	sir	good	day	dont	house	hand	thought	room	great
6	sir	dear	time	good	eyes	head	night	peggotty	aunt	looked
7	sir	good	dont	day	time	great	room	head	thought	looked
8	good	thought	time	dear	sir	day	great	long	house	dont
9	time	good	day	thought	sir	dear	hand	room	house	great
10	emma	elizabeth	good	jane	time	thing	weston	great	dear	harriet
11	madame	eyes	time	hand	dr	thought	long	knew	day	looked
12	good	time	long	thought	room	sir	dont	day	night	great
13	sunder	robs	unimpassioned	3rd	huntingdon	helen	arthur	dont	hargrave	better
14	good	time	dear	sir	day	face	dont	lady	young	great
15	thought	sir	time	good	day	looked	long	dont	eyes	head
16	jane	room	rochester	heart	thought	john	long	good	felt	time
17	time	sir	good	room	looked	thought	long	lady	day	hand
18	bounderby	sir	hand	sparsit	time	maam	dont	face	good	louisa
19	time	dont	good	face	great	day	hand	thought	sir	lady
20	bounderby	gradgrind	sparsit	louisa	sir	tom	dont	father	good	time
21	emma	jane	elton	knightley	thing	weston	day	harriet	fairfax	good
22	good	long	time	hand	room	thought	sir	night	day	dear
23	elinor	marianne	dashwood	jennings	edward	time	mother	brandon	great	thing
24	thought	monsieur	good	english	hunsden	eye	long	day	frances	time
25	micawber	aunt	peggotty	time	dear	good	copperfield	traddles	dont	hand
26	elinor	marianne	time	good	dashwood	edward	sir	jennings	sister	thing
27	time	better	day	love	good	room	thought	long	helen	hand
28	time	good	day	sir	dear	thought	room	great	lady	young
29	time	dont	good	great	room	thought	day	sir	dear	head
30	elinor	marianne	time	mother	good	dashwood	edward	jennings	sir	dear
31	sir	dear	lady	dont	time	good	richard	great	hand	jarndyce
32	thought	long	time	good	room	face	dont	dear	hand	heart
33	prescience	subservient	ungraciously	enquired	volunteer	nicest	toothpick	witnesses	awaken	casino
34	time	great	good	room	long	sir	lady	house	thought	night
35	madame	dr	bretton	lucy	graham	paul	night	beck	ginevra	eyes
36	time	good	thought	long	room	dont	day	hand	face	eyes
37	good	long	time	thought	night	day	great	looked	room	face
38	time	dear	thought	long	day	good	dont	better	hand	room
39	time	good	dear	sir	dont	great	day	room	thought	house