Computing in Context

lecture two

OK, so we didn't get into the code this week...

History at Scale: what would you do if you could read at scale?

examples of not exclusively numerical data sources

http://dohistory.org/diary/index.html

http://dlxs.richmond.edu/d/ddr/index.html

http://www.oldbaileyonline.org/

http://www.masshist.org/digitaladams/archive/index

https://archivesholdings.worldbank.org/

http://docsouth.unc.edu/index.html

http://about.jstor.org/service/data-for-research

Choose to make tractable--Choose to lose to learn something couldn't otherwise!

The EASIER Way

Awesome tools to automate process

http://voyant-tools.org/

http://papermachines.org/

The HARDER Way

Why, oh why?

flexibility
- overcome limits of affordances
can't depend on lots of tools
understand epistemic trade offs
gateway to vast array of work in data sciences more generally

computers not so smart

computer not so good with non numerical things

Vectoring text, or: the world is too hard to understand easily



In [1]:

    
%matplotlib inline
import pandas as pd



In [2]:

    
def document_vector(wordstring): 
    """put yer documentation here friend"""
    wordlist = wordstring.split()
    set_of_words=set(wordlist)
    distinct_words=list(set_of_words)
    wordfreq = [wordlist.count(w) for w in distinct_words]
    return distinct_words, wordfreq



In [3]:

    
x,y = document_vector("I like to eat green apples but only if I eat them like green grapes like my green friend gripes")



In [4]:

    
x









    Out[4]:





['only',
 'them',
 'green',
 'my',
 'but',
 'eat',
 'I',
 'grapes',
 'to',
 'apples',
 'if',
 'gripes',
 'like',
 'friend']



In [5]:

    
y









    Out[5]:





[1, 1, 3, 1, 1, 2, 2, 1, 1, 1, 1, 1, 3, 1]



In [6]:

    
documents=pd.DataFrame(y, index=x)



In [7]:

    
documents

cool beans--we've vectorized our sentence

what has been lost?



In [8]:

    
documents[0].order()









    Out[8]:





only      1
them      1
my        1
but       1
grapes    1
to        1
apples    1
if        1
gripes    1
friend    1
eat       2
I         2
green     3
like      3
Name: 0, dtype: int64



In [9]:

    
documents[0].describe()









    Out[9]:





count    14.000000
mean      1.428571
std       0.755929
min       1.000000
25%       1.000000
50%       1.000000
75%       1.750000
max       3.000000
Name: 0, dtype: float64

Once we convert texts into vectors, computer doesn't care or know that we're dealing with something formerly known as texts. It's just another kind of vector . . .

Term frequency [TF]

The "term frequency" measures how often a given term occurs in a given document.

We count up how many times each term appears in a document, then divide it by the number of terms in the document.

For a word $t$ that appears $i_w$ times in a document $D$ with a number of words $n_D$, the term frequency is

$tf(t,D)=\frac{i_w}{n_D}$

Can you think of a problem with this as a measure?



In [10]:

    
def document_vector_freq(wordstring): 
    """put yer documentation here friend"""
    wordlist = wordstring.split()
    number_of_words=len(wordlist)
    set_of_words=set(wordlist)
    distinct_words=list(set_of_words)
    wordfreq = [wordlist.count(w) for w in distinct_words]
    wordfreq = [word_freq/number_of_words for word_freq in wordfreq]
    return distinct_words, wordfreq

Inverse document frequency

Often we'll divide the term frequency using some measure of how unusual each word is across all the documents in question.

If we were reading general political news stories for the last few years, "Obama" appears a lot in each document and in lots of documents. "Butte" appears, say, a lot in one document but not at all in the rest.

We want something that will help us to see that "Butte" is really significant for capturing something distinctive about that document, whereas "Obama" wouldn't be.

So we compute the inverse document frequency. [IDF]

You divide the total number of documents ($N$) by one plus the number of documents containing each word t ($n_w$)

$\frac{N}{1+n_w}$

Think about what is this does: for a word that appears in every document will thus be

$\frac{N}{1+n_w}=\frac{N}{1+N}\approx 1$

Where a word appears that in only one document will have a much bigger scaling factor:

$\frac{N}{1+n_w}=\frac{N}{2}$.

Typically, we take the log of this to get:

$idf(t,D)=\log(\frac{N}{1+n_w})$.

So we'll computer what's called tf-idf in the biz by multiplying the frequency and the inverse document frequency:

$tfidf=tf\times idf$

As so often, this is not a neutral choice:

if we pick $tf$ by itself, we want the most frequent words normalized by length in each document.
if we pick $tfidf$, then we are saying we want to work with the most frequent words that are also usual across our particular set of documents.

If we use tfidf on a set of documents about the CIA from 2000-2010, "intelligence" would be in most of them, we'd guess, and so the measure would down-play them in favor of what makes each document more distinctive within the corpus.

What we need to clean up text

tokenization

making .split much better Examples??

stemming:

converting inflected forms into some normalized forms
- e.g. "chefs" --> "chef"
- "goes" --> "go"
- "children" --> "child"

stopwords

they are the words you don't want to be included: "from" "to" "a" "they" "she" "he"

others?

Let's get some data!

Documenting the American South

http://docsouth.unc.edu/docsouthdata/



In [11]:

    
directory="/Users/mljones/Downloads/na-slave-narratives/data/texts/"   #PUT YOUR DIRECTORY HERE!



In [12]:

    
import sys, os



In [13]:

    
##I will provide a set of black boxes for this sort of thing soon; then you will import textmining_blackboxes



In [14]:

    
os.chdir(directory)
files=[file for file in os.listdir(".") if not file.startswith('.')] #defeat hidden files
files=[file for file in files if not os.path.isdir(file)==True] #defeat directories

articles=[]
file_titles=[]
for file in files:
    with open(file, encoding="UTF-8") as plaintext:
        lines=plaintext.readlines()
        #lines=[str(line) for line in lines]
        article=" ".join(lines) #alter lines if want to skip lines
        articles.append(article)
        file_titles.append(file) #keep track of file names



In [15]:

    
articles[2][:500]









    Out[15]:





'\n Wm. W. Brown.[Frontispiece Image]\n [Title Page Image]\n [Title Page Verso Image]\n \n _______Is there not some chosen curse,\n Some hidden thunder in the stores of heaven,\n Red with uncommon wrath, to blast the man\n who gains his fortune from the blood of souls !\n \n Cowper.\n PREFACE\n TO WELLS BROWN, OF OHIO.\n THIRTEEN years ago, I came to your door, a weary fugitive from chains and stripes. I was a stranger, and you took me in. I was hungry, and you fed me. Naked was I, and you clothed me. Even a '



In [16]:

    
import re



In [17]:

    
re.sub('\n', '', articles[2])[:500]









    Out[17]:





' Wm. W. Brown.[Frontispiece Image] [Title Page Image] [Title Page Verso Image]  _______Is there not some chosen curse, Some hidden thunder in the stores of heaven, Red with uncommon wrath, to blast the man who gains his fortune from the blood of souls !  Cowper. PREFACE TO WELLS BROWN, OF OHIO. THIRTEEN years ago, I came to your door, a weary fugitive from chains and stripes. I was a stranger, and you took me in. I was hungry, and you fed me. Naked was I, and you clothed me. Even a name by which'



In [18]:

    
documents=pd.DataFrame(y, index=x)



In [19]:

    
documents

Here's our help!

Python Libraries

Python has an embarrasment of riches when it comes to working with texts. Some libraries are higher level with simpler, well thought out defaults, namely pattern and TextBlob. Most general, of long development, and foundational is the Natural Language Tool Kit--NLTK. The ideas we'll learn to today are key--they have slightly different instantiations in the different tools. Not everything is yet in Python 3.

For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.



In [21]:

    
vectorizer=TfidfVectorizer(min_df=0.95, stop_words='english')  
#.95 is a VERY high threshold--only the most common words--chosen for the form of visualization we're going to do



In [22]:

    
document_term_matrix=vectorizer.fit_transform(articles)



In [23]:

    
document_term_matrix.shape









    Out[23]:





(294, 63)



In [24]:

    
#output is number of documents, then size of remaining vocabulary
rows, terms=document_term_matrix.shape



In [25]:

    
vocab=vectorizer.get_feature_names()



In [26]:

    
len(vocab)









    Out[26]:





63



In [27]:

    
dtm=document_term_matrix.toarray()
dtmdf=pd.DataFrame(dtm, columns=vocab)

put into pandas just so we could explore it a bit more elegantly



In [28]:

    
dtmdf









    Out[28]:






  
    
      
      away
      better
      born
      brought
      called
      came
      children
      come
      country
      day
      ...
      things
      thought
      time
      took
      way
      went
      white
      work
      year
      years
    
  
  
    
      0  
       0.062433
       0.033243
       0.035528
       0.049360
       0.059921
       0.096732
       0.030062
       0.076005
       0.051750
       0.156397
      ...
       0.146825
       0.060125
       0.272108
       0.085984
       0.202831
       0.162316
       0.137111
       0.074040
       0.002733
       0.077396
    
    
      1  
       0.082805
       0.031445
       0.005816
       0.071007
       0.084052
       0.209683
       0.080460
       0.086656
       0.124629
       0.273618
      ...
       0.027514
       0.052348
       0.282063
       0.091498
       0.088040
       0.106514
       0.055444
       0.149795
       0.041684
       0.068160
    
    
      2  
       0.067857
       0.057208
       0.023763
       0.041727
       0.074008
       0.160624
       0.068319
       0.041304
       0.053286
       0.204558
      ...
       0.015055
       0.098022
       0.290007
       0.125579
       0.066716
       0.123493
       0.134121
       0.092395
       0.050496
       0.113127
    
    
      3  
       0.079088
       0.086923
       0.008167
       0.040973
       0.099704
       0.110405
       0.104126
       0.064892
       0.134295
       0.162387
      ...
       0.047600
       0.028584
       0.223020
       0.072265
       0.081746
       0.084883
       0.264273
       0.188474
       0.102084
       0.119628
    
    
      4  
       0.136277
       0.019869
       0.034301
       0.068836
       0.058603
       0.211983
       0.249909
       0.087607
       0.078137
       0.280421
      ...
       0.084441
       0.078403
       0.194210
       0.105991
       0.071779
       0.247379
       0.167172
       0.132754
       0.053902
       0.119631
    
    
      5  
       0.058159
       0.077162
       0.029278
       0.017626
       0.023343
       0.103627
       0.187377
       0.133767
       0.029178
       0.323734
      ...
       0.083097
       0.076122
       0.311318
       0.103627
       0.182982
       0.075351
       0.047003
       0.193889
       0.146388
       0.160109
    
    
      6  
       0.027531
       0.042146
       0.013859
       0.013907
       0.151937
       0.177143
       0.083157
       0.068829
       0.041437
       0.255415
      ...
       0.014049
       0.041578
       0.455511
       0.095385
       0.094740
       0.192068
       0.111252
       0.069533
       0.138594
       0.108275
    
    
      7  
       0.208734
       0.012531
       0.030905
       0.031010
       0.049281
       0.164082
       0.228698
       0.061392
       0.012320
       0.179858
      ...
       0.012531
       0.067991
       0.256924
       0.048617
       0.132793
       0.220265
       0.074424
       0.223273
       0.098896
       0.114685
    
    
      8  
       0.079807
       0.122173
       0.040175
       0.000000
       0.080079
       0.000000
       0.160701
       0.039904
       0.080079
       0.077936
      ...
       0.000000
       0.080351
       0.194180
       0.039500
       0.078466
       0.039769
       0.040312
       0.080624
       0.040175
       0.078466
    
    
      9  
       0.251497
       0.030800
       0.040513
       0.010163
       0.020188
       0.009958
       0.040513
       0.130778
       0.040376
       0.176831
      ...
       0.030800
       0.060770
       0.205605
       0.019916
       0.049454
       0.010026
       0.020326
       0.030488
       0.020257
       0.059345
    
    
      10 
       0.145591
       0.023687
       0.016995
       0.074606
       0.101625
       0.273615
       0.076478
       0.156141
       0.044461
       0.166903
      ...
       0.053835
       0.082851
       0.242320
       0.089812
       0.138995
       0.254448
       0.089527
       0.230212
       0.063731
       0.053938
    
    
      11 
       0.140240
       0.035781
       0.010085
       0.070837
       0.115588
       0.148736
       0.201706
       0.115197
       0.010051
       0.278792
      ...
       0.020446
       0.055469
       0.214480
       0.133862
       0.054168
       0.214639
       0.136614
       0.141674
       0.015128
       0.054168
    
    
      12 
       0.127418
       0.058885
       0.019364
       0.069218
       0.075987
       0.198711
       0.342496
       0.153863
       0.025329
       0.172557
      ...
       0.047844
       0.157330
       0.171974
       0.070203
       0.056729
       0.147352
       0.077718
       0.049788
       0.037517
       0.112275
    
    
      13 
       0.169279
       0.029616
       0.014608
       0.009772
       0.072795
       0.196292
       0.267820
       0.217645
       0.009706
       0.203094
      ...
       0.039488
       0.102259
       0.221235
       0.057451
       0.095105
       0.134965
       0.083062
       0.073290
       0.073042
       0.114126
    
    
      14 
       0.026950
       0.034380
       0.094966
       0.037435
       0.141967
       0.050019
       0.078008
       0.053900
       0.074364
       0.121719
      ...
       0.072198
       0.027133
       0.219665
       0.120047
       0.129172
       0.057074
       0.122514
       0.462831
       0.142449
       0.308025
    
    
      15 
       0.081961
       0.033459
       0.012378
       0.053820
       0.053456
       0.174433
       0.024756
       0.086059
       0.020560
       0.348170
      ...
       0.008365
       0.107275
       0.327050
       0.097358
       0.120875
       0.208294
       0.028980
       0.244259
       0.045386
       0.080584
    
    
      16 
       0.095078
       0.027493
       0.049458
       0.062433
       0.082681
       0.141174
       0.114870
       0.098247
       0.041340
       0.273902
      ...
       0.016172
       0.060626
       0.348545
       0.045489
       0.084132
       0.034744
       0.096050
       0.089647
       0.084557
       0.162032
    
    
      17 
       0.114592
       0.024621
       0.024289
       0.046712
       0.108929
       0.189053
       0.095131
       0.102529
       0.052447
       0.178654
      ...
       0.051293
       0.064770
       0.225007
       0.073631
       0.108712
       0.186332
       0.109671
       0.081238
       0.052626
       0.124525
    
    
      18 
       0.107923
       0.020979
       0.015522
       0.067493
       0.072193
       0.289969
       0.036219
       0.056531
       0.077350
       0.135504
      ...
       0.015735
       0.087961
       0.265089
       0.234010
       0.192006
       0.317551
       0.103835
       0.083068
       0.067264
       0.101056
    
    
      19 
       0.079111
       0.044663
       0.022031
       0.061215
       0.106402
       0.158286
       0.088122
       0.097626
       0.153692
       0.193960
      ...
       0.048099
       0.108458
       0.324357
       0.104969
       0.100950
       0.070455
       0.100325
       0.045911
       0.050840
       0.084400
    
    
      20 
       0.107135
       0.007289
       0.079100
       0.036077
       0.085999
       0.247451
       0.050337
       0.142846
       0.021500
       0.132521
      ...
       0.051024
       0.079100
       0.264146
       0.084840
       0.196623
       0.149481
       0.158738
       0.122661
       0.021573
       0.126401
    
    
      21 
       0.028692
       0.064665
       0.030091
       0.027778
       0.046784
       0.055621
       0.046943
       0.069340
       0.153546
       0.093399
      ...
       0.082967
       0.034906
       0.265285
       0.028402
       0.166910
       0.040510
       0.286236
       0.234303
       0.043332
       0.157507
    
    
      22 
       0.046076
       0.035624
       0.015463
       0.029621
       0.060243
       0.134065
       0.057636
       0.083775
       0.096668
       0.192254
      ...
       0.047024
       0.046390
       0.383206
       0.037317
       0.161988
       0.100190
       0.252484
       0.355453
       0.088562
       0.138651
    
    
      23 
       0.143845
       0.025311
       0.019976
       0.070153
       0.114472
       0.196399
       0.089891
       0.079363
       0.089586
       0.184066
      ...
       0.030373
       0.059927
       0.207580
       0.098199
       0.092659
       0.108754
       0.045098
       0.100218
       0.064921
       0.126797
    
    
      24 
       0.114106
       0.031760
       0.007833
       0.031438
       0.085870
       0.197664
       0.174934
       0.225618
       0.010408
       0.200067
      ...
       0.037053
       0.078329
       0.201914
       0.069311
       0.114737
       0.173164
       0.094314
       0.089074
       0.046997
       0.130036
    
    
      25 
       0.018052
       0.018424
       0.018175
       0.091185
       0.289820
       0.232306
       0.054526
       0.036105
       0.181137
       0.035258
      ...
       0.073694
       0.018175
       0.070277
       0.089349
       0.124243
       0.089956
       0.000000
       0.054711
       0.145402
       0.266234
    
    
      26 
       0.129467
       0.032031
       0.031600
       0.035670
       0.043302
       0.066020
       0.078999
       0.133390
       0.062985
       0.118768
      ...
       0.044043
       0.063199
       0.110730
       0.139807
       0.123433
       0.121209
       0.233839
       0.063414
       0.047399
       0.123433
    
    
      27 
       0.034518
       0.105683
       0.013901
       0.013948
       0.055416
       0.109340
       0.132061
       0.138072
       0.027708
       0.175284
      ...
       0.077501
       0.083407
       0.322506
       0.027335
       0.081451
       0.110084
       0.097639
       0.132510
       0.006951
       0.149326
    
    
      28 
       0.062960
       0.034135
       0.045560
       0.079504
       0.112528
       0.083746
       0.077254
       0.035415
       0.157934
       0.122966
      ...
       0.038151
       0.051503
       0.239356
       0.050637
       0.079311
       0.049021
       0.125219
       0.125219
       0.069331
       0.208917
    
    
      29 
       0.051788
       0.000000
       0.010428
       0.052318
       0.093535
       0.061517
       0.020856
       0.041430
       0.187070
       0.171949
      ...
       0.052853
       0.062568
       0.352815
       0.215308
       0.050917
       0.165160
       0.031391
       0.073245
       0.020856
       0.050917
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      264
       0.063563
       0.064870
       0.030116
       0.075545
       0.138814
       0.159151
       0.127992
       0.093476
       0.093793
       0.160659
      ...
       0.022895
       0.120463
       0.149198
       0.099932
       0.176456
       0.178866
       0.109541
       0.117095
       0.022587
       0.143371
    
    
      265
       0.116409
       0.044551
       0.011720
       0.023520
       0.081763
       0.106588
       0.123061
       0.078576
       0.029201
       0.147782
      ...
       0.071281
       0.125991
       0.334216
       0.074900
       0.111590
       0.133417
       0.026460
       0.102899
       0.046880
       0.094423
    
    
      266
       0.100140
       0.063529
       0.021799
       0.035544
       0.074681
       0.105824
       0.104908
       0.116378
       0.029872
       0.132150
      ...
       0.056624
       0.113083
       0.293697
       0.069656
       0.091804
       0.115984
       0.062885
       0.120303
       0.031336
       0.191591
    
    
      267
       0.101758
       0.058107
       0.021954
       0.031819
       0.074147
       0.110321
       0.098791
       0.121140
       0.026741
       0.128946
      ...
       0.051925
       0.117086
       0.273525
       0.068351
       0.092901
       0.113486
       0.064861
       0.111365
       0.030491
       0.202477
    
    
      268
       0.107442
       0.045465
       0.007915
       0.042357
       0.099919
       0.145265
       0.116089
       0.125786
       0.055218
       0.143309
      ...
       0.042791
       0.129280
       0.209135
       0.108949
       0.097907
       0.117525
       0.158840
       0.050299
       0.029022
       0.103060
    
    
      269
       0.145391
       0.017121
       0.011260
       0.079088
       0.117830
       0.110707
       0.140751
       0.111839
       0.056110
       0.131060
      ...
       0.028535
       0.106971
       0.174155
       0.099637
       0.065976
       0.139326
       0.282458
       0.056492
       0.011260
       0.060478
    
    
      270
       0.026587
       0.013567
       0.040153
       0.094008
       0.093372
       0.118433
       0.147226
       0.039881
       0.040017
       0.220693
      ...
       0.176372
       0.080305
       0.349326
       0.026318
       0.065351
       0.185482
       0.214875
       0.161156
       0.013384
       0.065351
    
    
      271
       0.289544
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.217158
       0.000000
       0.141377
      ...
       0.036937
       0.145757
       0.140898
       0.143307
       0.035585
       0.180352
       0.036563
       0.036563
       0.072879
       0.177923
    
    
      272
       0.091124
       0.030999
       0.037377
       0.081828
       0.077888
       0.076839
       0.054367
       0.050625
       0.250596
       0.171383
      ...
       0.024111
       0.081551
       0.239781
       0.073498
       0.096229
       0.100906
       0.068190
       0.061371
       0.050969
       0.073001
    
    
      273
       0.064605
       0.032966
       0.032522
       0.065265
       0.097236
       0.127901
       0.032522
       0.129209
       0.000000
       0.189268
      ...
       0.000000
       0.032522
       0.282941
       0.063951
       0.190556
       0.000000
       0.000000
       0.065265
       0.032522
       0.190556
    
    
      274
       0.033574
       0.036280
       0.013919
       0.063844
       0.107008
       0.129026
       0.097430
       0.073072
       0.025761
       0.098359
      ...
       0.076590
       0.093453
       0.413246
       0.136845
       0.060193
       0.145649
       0.055863
       0.101751
       0.015907
       0.062135
    
    
      275
       0.070584
       0.024012
       0.015792
       0.055460
       0.062955
       0.124213
       0.094753
       0.086270
       0.125909
       0.114881
      ...
       0.008004
       0.039480
       0.366377
       0.178557
       0.084820
       0.171955
       0.023769
       0.079229
       0.157922
       0.185061
    
    
      276
       0.074653
       0.027705
       0.034164
       0.054848
       0.108955
       0.147795
       0.129824
       0.074653
       0.108955
       0.132550
      ...
       0.006926
       0.054663
       0.317042
       0.154513
       0.080071
       0.162328
       0.034280
       0.075417
       0.157155
       0.213522
    
    
      277
       0.038028
       0.052507
       0.051800
       0.042936
       0.078559
       0.081929
       0.045043
       0.082767
       0.170585
       0.146360
      ...
       0.052507
       0.040539
       0.215531
       0.059786
       0.114365
       0.053505
       0.207903
       0.223722
       0.074321
       0.202339
    
    
      278
       0.046682
       0.082580
       0.056400
       0.051876
       0.076507
       0.075476
       0.072067
       0.065355
       0.229521
       0.165635
      ...
       0.079404
       0.047000
       0.168103
       0.083178
       0.091795
       0.063583
       0.108467
       0.073884
       0.037600
       0.105565
    
    
      279
       0.139138
       0.025357
       0.030018
       0.050200
       0.119666
       0.078703
       0.060036
       0.074538
       0.064819
       0.281455
      ...
       0.030428
       0.055033
       0.256321
       0.078703
       0.073285
       0.178286
       0.055220
       0.381523
       0.085052
       0.141685
    
    
      280
       0.035987
       0.062153
       0.019509
       0.032160
       0.062496
       0.105497
       0.027870
       0.077510
       0.202766
       0.162197
      ...
       0.055090
       0.037625
       0.304437
       0.054804
       0.125197
       0.088283
       0.316006
       0.320201
       0.160255
       0.164661
    
    
      281
       0.076505
       0.031231
       0.015405
       0.023186
       0.153530
       0.136315
       0.046215
       0.076505
       0.061412
       0.216661
      ...
       0.062462
       0.107836
       0.379735
       0.212046
       0.082741
       0.038123
       0.061830
       0.131388
       0.084728
       0.105306
    
    
      282
       0.071054
       0.041437
       0.030659
       0.058108
       0.149382
       0.113876
       0.105604
       0.067671
       0.190122
       0.231293
      ...
       0.037984
       0.068131
       0.243683
       0.113876
       0.083167
       0.030349
       0.150398
       0.109381
       0.047692
       0.139720
    
    
      283
       0.081500
       0.049906
       0.016411
       0.123501
       0.089955
       0.153284
       0.106672
       0.057050
       0.024533
       0.175096
      ...
       0.033271
       0.131288
       0.245891
       0.048405
       0.032052
       0.113714
       0.123501
       0.164668
       0.041028
       0.072117
    
    
      284
       0.055471
       0.078851
       0.017951
       0.016011
       0.075537
       0.256897
       0.021940
       0.116884
       0.085476
       0.139293
      ...
       0.105135
       0.349051
       0.229442
       0.133351
       0.054538
       0.260619
       0.030020
       0.080054
       0.025929
       0.035060
    
    
      285
       0.086788
       0.030542
       0.021091
       0.024186
       0.102097
       0.216256
       0.063274
       0.143649
       0.090086
       0.181195
      ...
       0.054976
       0.057248
       0.372815
       0.106647
       0.176543
       0.101407
       0.139072
       0.142095
       0.063274
       0.147119
    
    
      286
       0.102295
       0.030517
       0.020598
       0.068364
       0.102643
       0.110607
       0.107745
       0.043279
       0.084483
       0.218235
      ...
       0.045775
       0.032482
       0.242769
       0.126965
       0.066535
       0.101949
       0.088238
       0.183630
       0.083186
       0.290896
    
    
      287
       0.092055
       0.036134
       0.028518
       0.078690
       0.021316
       0.049067
       0.071294
       0.084974
       0.071052
       0.276604
      ...
       0.050588
       0.057035
       0.158509
       0.042057
       0.027849
       0.007057
       0.021461
       0.028614
       0.057035
       0.153167
    
    
      288
       0.042836
       0.120220
       0.005391
       0.059501
       0.118198
       0.058303
       0.032345
       0.069608
       0.102080
       0.094120
      ...
       0.087433
       0.070082
       0.250136
       0.068903
       0.115818
       0.016009
       0.075729
       0.032455
       0.037736
       0.152669
    
    
      289
       0.039550
       0.051896
       0.011377
       0.045662
       0.119054
       0.044743
       0.051196
       0.124300
       0.005669
       0.110350
      ...
       0.074961
       0.045508
       0.197958
       0.033557
       0.094436
       0.078833
       0.034247
       0.102740
       0.068262
       0.038885
    
    
      290
       0.132217
       0.044979
       0.000000
       0.029682
       0.103185
       0.072710
       0.221861
       0.000000
       0.014741
       0.100424
      ...
       0.014993
       0.000000
       0.185870
       0.087253
       0.101107
       0.073205
       0.148410
       0.089046
       0.044372
       0.476646
    
    
      291
       0.007421
       0.000000
       0.018679
       0.014994
       0.078184
       0.040402
       0.048564
       0.055657
       0.003723
       0.105080
      ...
       0.015147
       0.059771
       0.090279
       0.022037
       0.091202
       0.059166
       0.037484
       0.007497
       0.041093
       0.062017
    
    
      292
       0.109306
       0.043256
       0.017967
       0.038310
       0.078341
       0.174445
       0.038181
       0.118229
       0.035813
       0.137241
      ...
       0.025043
       0.074116
       0.284408
       0.066245
       0.122822
       0.280122
       0.085635
       0.092396
       0.047164
       0.067991
    
    
      293
       0.155011
       0.020635
       0.000000
       0.034043
       0.027050
       0.040028
       0.027142
       0.384157
       0.006762
       0.342240
      ...
       0.013756
       0.000000
       0.288608
       0.040028
       0.019879
       0.047018
       0.006809
       0.204256
       0.040713
       0.033132
    
  

294 rows × 63 columns

Now we can throw wide variety of mining algorithms at our data!

Similarity and dissimilarity

We reduced our text to a vector of term-weights. What can we do once we've committed this violence on the text?

We can measure distance and similarity

I know. Crazy talk.

Right now our text is just a series of numbers, indexed to words. We can treat it like any other set of words.

And the key way to distinguish two vectors is by measuring their distance or computing their similiarity (1-distance).

You already know how, though you may have buried it along with memories of high school.

Many distance metrics to choose from

key one in textual analysis:

cosine similarity

If $\mathbf{a}$ and $\mathbf{b}$ are vectors, then

$\mathbf{a}\cdot\mathbf{b}=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta$

$\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum\limits_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum\limits_{i=1}^{n}{(B_i)^2}} }$



In [29]:

    
#easy to program, but let's use a robust version
from sklearn.metrics.pairwise import cosine_similarity



In [30]:

    
#cosine similarity is vectorized: that means is will operate on an entire matrix, not just its individual elements



In [31]:

    
similarity=cosine_similarity(dtmdf)



In [32]:

    
similarity









    Out[32]:





array([[ 1.        ,  0.68264883,  0.76388739, ...,  0.54432244,
         0.82308187,  0.74060483],
       [ 0.68264883,  1.        ,  0.87835427, ...,  0.41133783,
         0.76848862,  0.6303793 ],
       [ 0.76388739,  0.87835427,  1.        , ...,  0.51797058,
         0.86520961,  0.61500274],
       ..., 
       [ 0.54432244,  0.41133783,  0.51797058, ...,  1.        ,
         0.70739313,  0.50234307],
       [ 0.82308187,  0.76848862,  0.86520961, ...,  0.70739313,
         1.        ,  0.70527606],
       [ 0.74060483,  0.6303793 ,  0.61500274, ...,  0.50234307,
         0.70527606,  1.        ]])



In [33]:

    
import matplotlib.pyplot as plt 
#we can make a heatmap with no problems within mathplotlib
#pass plt.pcolor our similiarity matrix
plt.pcolor(similarity, norm=None, cmap='Blues')









    Out[33]:





<matplotlib.collections.PolyCollection at 0x1150b0210>



In [34]:

    
#we have too many documents for that to be very useful; so
plt.pcolor(similarity[100:110, 100:110], norm=None, cmap='Blues')









    Out[34]:





<matplotlib.collections.PolyCollection at 0x114eafa10>

supervised vs. unsupervised learning



In [35]:

    
##first example of unsupervised learning
###hierarchical clustering



In [36]:

    
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import ward, dendrogram
dtm=document_term_matrix
dtm_trans=dtm.T
dist=1-cosine_similarity(dtm_trans)
linkage_matrix=ward(dist)

#plot dendogram

f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()

is this significant? Are there interesting patterns to seek out?

here's what we're up to:

Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.

-. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findings as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.

Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

need to elicit patterns and avoid bad magical thinking!



In [ ]:

	0
only	1
them	1
green	3
my	1
but	1
eat	2
I	2
grapes	1
to	1
apples	1
if	1
gripes	1
like	3
friend	1

	away	better	born	brought	called	came	children	come	country	day	...	things	thought	time	took	way	went	white	work	year	years
0	0.062433	0.033243	0.035528	0.049360	0.059921	0.096732	0.030062	0.076005	0.051750	0.156397	...	0.146825	0.060125	0.272108	0.085984	0.202831	0.162316	0.137111	0.074040	0.002733	0.077396
1	0.082805	0.031445	0.005816	0.071007	0.084052	0.209683	0.080460	0.086656	0.124629	0.273618	...	0.027514	0.052348	0.282063	0.091498	0.088040	0.106514	0.055444	0.149795	0.041684	0.068160
2	0.067857	0.057208	0.023763	0.041727	0.074008	0.160624	0.068319	0.041304	0.053286	0.204558	...	0.015055	0.098022	0.290007	0.125579	0.066716	0.123493	0.134121	0.092395	0.050496	0.113127
3	0.079088	0.086923	0.008167	0.040973	0.099704	0.110405	0.104126	0.064892	0.134295	0.162387	...	0.047600	0.028584	0.223020	0.072265	0.081746	0.084883	0.264273	0.188474	0.102084	0.119628
4	0.136277	0.019869	0.034301	0.068836	0.058603	0.211983	0.249909	0.087607	0.078137	0.280421	...	0.084441	0.078403	0.194210	0.105991	0.071779	0.247379	0.167172	0.132754	0.053902	0.119631
5	0.058159	0.077162	0.029278	0.017626	0.023343	0.103627	0.187377	0.133767	0.029178	0.323734	...	0.083097	0.076122	0.311318	0.103627	0.182982	0.075351	0.047003	0.193889	0.146388	0.160109
6	0.027531	0.042146	0.013859	0.013907	0.151937	0.177143	0.083157	0.068829	0.041437	0.255415	...	0.014049	0.041578	0.455511	0.095385	0.094740	0.192068	0.111252	0.069533	0.138594	0.108275
7	0.208734	0.012531	0.030905	0.031010	0.049281	0.164082	0.228698	0.061392	0.012320	0.179858	...	0.012531	0.067991	0.256924	0.048617	0.132793	0.220265	0.074424	0.223273	0.098896	0.114685
8	0.079807	0.122173	0.040175	0.000000	0.080079	0.000000	0.160701	0.039904	0.080079	0.077936	...	0.000000	0.080351	0.194180	0.039500	0.078466	0.039769	0.040312	0.080624	0.040175	0.078466
9	0.251497	0.030800	0.040513	0.010163	0.020188	0.009958	0.040513	0.130778	0.040376	0.176831	...	0.030800	0.060770	0.205605	0.019916	0.049454	0.010026	0.020326	0.030488	0.020257	0.059345
10	0.145591	0.023687	0.016995	0.074606	0.101625	0.273615	0.076478	0.156141	0.044461	0.166903	...	0.053835	0.082851	0.242320	0.089812	0.138995	0.254448	0.089527	0.230212	0.063731	0.053938
11	0.140240	0.035781	0.010085	0.070837	0.115588	0.148736	0.201706	0.115197	0.010051	0.278792	...	0.020446	0.055469	0.214480	0.133862	0.054168	0.214639	0.136614	0.141674	0.015128	0.054168
12	0.127418	0.058885	0.019364	0.069218	0.075987	0.198711	0.342496	0.153863	0.025329	0.172557	...	0.047844	0.157330	0.171974	0.070203	0.056729	0.147352	0.077718	0.049788	0.037517	0.112275
13	0.169279	0.029616	0.014608	0.009772	0.072795	0.196292	0.267820	0.217645	0.009706	0.203094	...	0.039488	0.102259	0.221235	0.057451	0.095105	0.134965	0.083062	0.073290	0.073042	0.114126
14	0.026950	0.034380	0.094966	0.037435	0.141967	0.050019	0.078008	0.053900	0.074364	0.121719	...	0.072198	0.027133	0.219665	0.120047	0.129172	0.057074	0.122514	0.462831	0.142449	0.308025
15	0.081961	0.033459	0.012378	0.053820	0.053456	0.174433	0.024756	0.086059	0.020560	0.348170	...	0.008365	0.107275	0.327050	0.097358	0.120875	0.208294	0.028980	0.244259	0.045386	0.080584
16	0.095078	0.027493	0.049458	0.062433	0.082681	0.141174	0.114870	0.098247	0.041340	0.273902	...	0.016172	0.060626	0.348545	0.045489	0.084132	0.034744	0.096050	0.089647	0.084557	0.162032
17	0.114592	0.024621	0.024289	0.046712	0.108929	0.189053	0.095131	0.102529	0.052447	0.178654	...	0.051293	0.064770	0.225007	0.073631	0.108712	0.186332	0.109671	0.081238	0.052626	0.124525
18	0.107923	0.020979	0.015522	0.067493	0.072193	0.289969	0.036219	0.056531	0.077350	0.135504	...	0.015735	0.087961	0.265089	0.234010	0.192006	0.317551	0.103835	0.083068	0.067264	0.101056
19	0.079111	0.044663	0.022031	0.061215	0.106402	0.158286	0.088122	0.097626	0.153692	0.193960	...	0.048099	0.108458	0.324357	0.104969	0.100950	0.070455	0.100325	0.045911	0.050840	0.084400
20	0.107135	0.007289	0.079100	0.036077	0.085999	0.247451	0.050337	0.142846	0.021500	0.132521	...	0.051024	0.079100	0.264146	0.084840	0.196623	0.149481	0.158738	0.122661	0.021573	0.126401
21	0.028692	0.064665	0.030091	0.027778	0.046784	0.055621	0.046943	0.069340	0.153546	0.093399	...	0.082967	0.034906	0.265285	0.028402	0.166910	0.040510	0.286236	0.234303	0.043332	0.157507
22	0.046076	0.035624	0.015463	0.029621	0.060243	0.134065	0.057636	0.083775	0.096668	0.192254	...	0.047024	0.046390	0.383206	0.037317	0.161988	0.100190	0.252484	0.355453	0.088562	0.138651
23	0.143845	0.025311	0.019976	0.070153	0.114472	0.196399	0.089891	0.079363	0.089586	0.184066	...	0.030373	0.059927	0.207580	0.098199	0.092659	0.108754	0.045098	0.100218	0.064921	0.126797
24	0.114106	0.031760	0.007833	0.031438	0.085870	0.197664	0.174934	0.225618	0.010408	0.200067	...	0.037053	0.078329	0.201914	0.069311	0.114737	0.173164	0.094314	0.089074	0.046997	0.130036
25	0.018052	0.018424	0.018175	0.091185	0.289820	0.232306	0.054526	0.036105	0.181137	0.035258	...	0.073694	0.018175	0.070277	0.089349	0.124243	0.089956	0.000000	0.054711	0.145402	0.266234
26	0.129467	0.032031	0.031600	0.035670	0.043302	0.066020	0.078999	0.133390	0.062985	0.118768	...	0.044043	0.063199	0.110730	0.139807	0.123433	0.121209	0.233839	0.063414	0.047399	0.123433
27	0.034518	0.105683	0.013901	0.013948	0.055416	0.109340	0.132061	0.138072	0.027708	0.175284	...	0.077501	0.083407	0.322506	0.027335	0.081451	0.110084	0.097639	0.132510	0.006951	0.149326
28	0.062960	0.034135	0.045560	0.079504	0.112528	0.083746	0.077254	0.035415	0.157934	0.122966	...	0.038151	0.051503	0.239356	0.050637	0.079311	0.049021	0.125219	0.125219	0.069331	0.208917
29	0.051788	0.000000	0.010428	0.052318	0.093535	0.061517	0.020856	0.041430	0.187070	0.171949	...	0.052853	0.062568	0.352815	0.215308	0.050917	0.165160	0.031391	0.073245	0.020856	0.050917
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
264	0.063563	0.064870	0.030116	0.075545	0.138814	0.159151	0.127992	0.093476	0.093793	0.160659	...	0.022895	0.120463	0.149198	0.099932	0.176456	0.178866	0.109541	0.117095	0.022587	0.143371
265	0.116409	0.044551	0.011720	0.023520	0.081763	0.106588	0.123061	0.078576	0.029201	0.147782	...	0.071281	0.125991	0.334216	0.074900	0.111590	0.133417	0.026460	0.102899	0.046880	0.094423
266	0.100140	0.063529	0.021799	0.035544	0.074681	0.105824	0.104908	0.116378	0.029872	0.132150	...	0.056624	0.113083	0.293697	0.069656	0.091804	0.115984	0.062885	0.120303	0.031336	0.191591
267	0.101758	0.058107	0.021954	0.031819	0.074147	0.110321	0.098791	0.121140	0.026741	0.128946	...	0.051925	0.117086	0.273525	0.068351	0.092901	0.113486	0.064861	0.111365	0.030491	0.202477
268	0.107442	0.045465	0.007915	0.042357	0.099919	0.145265	0.116089	0.125786	0.055218	0.143309	...	0.042791	0.129280	0.209135	0.108949	0.097907	0.117525	0.158840	0.050299	0.029022	0.103060
269	0.145391	0.017121	0.011260	0.079088	0.117830	0.110707	0.140751	0.111839	0.056110	0.131060	...	0.028535	0.106971	0.174155	0.099637	0.065976	0.139326	0.282458	0.056492	0.011260	0.060478
270	0.026587	0.013567	0.040153	0.094008	0.093372	0.118433	0.147226	0.039881	0.040017	0.220693	...	0.176372	0.080305	0.349326	0.026318	0.065351	0.185482	0.214875	0.161156	0.013384	0.065351
271	0.289544	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.217158	0.000000	0.141377	...	0.036937	0.145757	0.140898	0.143307	0.035585	0.180352	0.036563	0.036563	0.072879	0.177923
272	0.091124	0.030999	0.037377	0.081828	0.077888	0.076839	0.054367	0.050625	0.250596	0.171383	...	0.024111	0.081551	0.239781	0.073498	0.096229	0.100906	0.068190	0.061371	0.050969	0.073001
273	0.064605	0.032966	0.032522	0.065265	0.097236	0.127901	0.032522	0.129209	0.000000	0.189268	...	0.000000	0.032522	0.282941	0.063951	0.190556	0.000000	0.000000	0.065265	0.032522	0.190556
274	0.033574	0.036280	0.013919	0.063844	0.107008	0.129026	0.097430	0.073072	0.025761	0.098359	...	0.076590	0.093453	0.413246	0.136845	0.060193	0.145649	0.055863	0.101751	0.015907	0.062135
275	0.070584	0.024012	0.015792	0.055460	0.062955	0.124213	0.094753	0.086270	0.125909	0.114881	...	0.008004	0.039480	0.366377	0.178557	0.084820	0.171955	0.023769	0.079229	0.157922	0.185061
276	0.074653	0.027705	0.034164	0.054848	0.108955	0.147795	0.129824	0.074653	0.108955	0.132550	...	0.006926	0.054663	0.317042	0.154513	0.080071	0.162328	0.034280	0.075417	0.157155	0.213522
277	0.038028	0.052507	0.051800	0.042936	0.078559	0.081929	0.045043	0.082767	0.170585	0.146360	...	0.052507	0.040539	0.215531	0.059786	0.114365	0.053505	0.207903	0.223722	0.074321	0.202339
278	0.046682	0.082580	0.056400	0.051876	0.076507	0.075476	0.072067	0.065355	0.229521	0.165635	...	0.079404	0.047000	0.168103	0.083178	0.091795	0.063583	0.108467	0.073884	0.037600	0.105565
279	0.139138	0.025357	0.030018	0.050200	0.119666	0.078703	0.060036	0.074538	0.064819	0.281455	...	0.030428	0.055033	0.256321	0.078703	0.073285	0.178286	0.055220	0.381523	0.085052	0.141685
280	0.035987	0.062153	0.019509	0.032160	0.062496	0.105497	0.027870	0.077510	0.202766	0.162197	...	0.055090	0.037625	0.304437	0.054804	0.125197	0.088283	0.316006	0.320201	0.160255	0.164661
281	0.076505	0.031231	0.015405	0.023186	0.153530	0.136315	0.046215	0.076505	0.061412	0.216661	...	0.062462	0.107836	0.379735	0.212046	0.082741	0.038123	0.061830	0.131388	0.084728	0.105306
282	0.071054	0.041437	0.030659	0.058108	0.149382	0.113876	0.105604	0.067671	0.190122	0.231293	...	0.037984	0.068131	0.243683	0.113876	0.083167	0.030349	0.150398	0.109381	0.047692	0.139720
283	0.081500	0.049906	0.016411	0.123501	0.089955	0.153284	0.106672	0.057050	0.024533	0.175096	...	0.033271	0.131288	0.245891	0.048405	0.032052	0.113714	0.123501	0.164668	0.041028	0.072117
284	0.055471	0.078851	0.017951	0.016011	0.075537	0.256897	0.021940	0.116884	0.085476	0.139293	...	0.105135	0.349051	0.229442	0.133351	0.054538	0.260619	0.030020	0.080054	0.025929	0.035060
285	0.086788	0.030542	0.021091	0.024186	0.102097	0.216256	0.063274	0.143649	0.090086	0.181195	...	0.054976	0.057248	0.372815	0.106647	0.176543	0.101407	0.139072	0.142095	0.063274	0.147119
286	0.102295	0.030517	0.020598	0.068364	0.102643	0.110607	0.107745	0.043279	0.084483	0.218235	...	0.045775	0.032482	0.242769	0.126965	0.066535	0.101949	0.088238	0.183630	0.083186	0.290896
287	0.092055	0.036134	0.028518	0.078690	0.021316	0.049067	0.071294	0.084974	0.071052	0.276604	...	0.050588	0.057035	0.158509	0.042057	0.027849	0.007057	0.021461	0.028614	0.057035	0.153167
288	0.042836	0.120220	0.005391	0.059501	0.118198	0.058303	0.032345	0.069608	0.102080	0.094120	...	0.087433	0.070082	0.250136	0.068903	0.115818	0.016009	0.075729	0.032455	0.037736	0.152669
289	0.039550	0.051896	0.011377	0.045662	0.119054	0.044743	0.051196	0.124300	0.005669	0.110350	...	0.074961	0.045508	0.197958	0.033557	0.094436	0.078833	0.034247	0.102740	0.068262	0.038885
290	0.132217	0.044979	0.000000	0.029682	0.103185	0.072710	0.221861	0.000000	0.014741	0.100424	...	0.014993	0.000000	0.185870	0.087253	0.101107	0.073205	0.148410	0.089046	0.044372	0.476646
291	0.007421	0.000000	0.018679	0.014994	0.078184	0.040402	0.048564	0.055657	0.003723	0.105080	...	0.015147	0.059771	0.090279	0.022037	0.091202	0.059166	0.037484	0.007497	0.041093	0.062017
292	0.109306	0.043256	0.017967	0.038310	0.078341	0.174445	0.038181	0.118229	0.035813	0.137241	...	0.025043	0.074116	0.284408	0.066245	0.122822	0.280122	0.085635	0.092396	0.047164	0.067991
293	0.155011	0.020635	0.000000	0.034043	0.027050	0.040028	0.027142	0.384157	0.006762	0.342240	...	0.013756	0.000000	0.288608	0.040028	0.019879	0.047018	0.006809	0.204256	0.040713	0.033132