Computing In Context

Lecture 3--text mining for real

Matthew L. Jones

like, with code and stuff



In [1]:

    
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt



In [2]:

    
import textmining_blackboxes as tm

IMPORTANT: `tm` is our temporarily helper, not a standard `python` package!!

download it from my github: https://github.com/matthewljones/computingincontext



In [3]:

    
#see if package imported correctly
tm.icantbelieve("butter")









    



I can't believe it's not butter

Let's get some text

Let's use the remarkable narratives available from Documenting the American South (http://docsouth.unc.edu/docsouthdata/)

Assuming that you are storing your data in a directory in the same place as your iPython notebook.

Put the slave narratives texts within a data directory in the same place as this notebook



In [4]:

    
title_info=pd.read_csv('data/na-slave-narratives/data/toc.csv')
#this is the "metadata" of these files--we didn't use today
#why does data appear twice?



In [5]:

    
#Let's use a brittle thing for reading in a directory of pure txt files.
our_texts=tm.readtextfiles('data/na-slave-narratives/data/texts')
#again, this is not a std python package
#returns a simple list of the document as very long strings

#note if you want the following notebook will work on any directory of text files.



In [6]:

    
len(our_texts)









    Out[6]:





294



In [42]:

    
our_texts[100][:300] # first 300 words of 100th text









    Out[42]:





'\n [Frontispiece Image]\n [Title Page Image]\n CONTENTS\n PREFACE.\n The idea of writing and giving the Church and community the advantage of my experience and such facts as came under my personal observation during the many years of labor I have spent in the A. M. E. Church, has occupied my attention fo'

list comprehensions!

most python thing evah!

how many words in each text within our_texts? can you make a list?

Sure, you could do this as a for loop

for text in our texts:
    blah.blah.blah(our_texts) #not real code

for i in range(len(our_texts)

But super easy in python



In [8]:

    
lengths=[len(text) for text in our_texts]

How to process text

Python Libraries

Python has an embarrasment of riches when it comes to working with texts. Some libraries are higher level with simpler, well thought out defaults, namely pattern and TextBlob. Most general, of long development, and foundational is the Natural Language Tool Kit--NLTK. The ideas we'll learn to today are key--they have slightly different instantiations in the different tools. Not everything is yet in Python 3, alas!!

nltk : grandparent of text analysis packages, cross-platform, complex

crucial for moving beyond bag of words: tagging & other grammatical analysis

pattern : higher level and easier to use the nltk but Python 2.7 only. (wah!)

textblob : even higher level range of natural language processing (3.4 but not yet in conda?)

scikit learn (sklearn): toolkit for scientists, faster, better (use for processing/memory intensive stuff) (Our choice!)

Things we might do to clean up text

tokenization

making .split much better

Examples??

stemming:

converting inflected forms into some normalized forms
- e.g. "chefs" --> "chef"
- "goes" --> "go"
- "children" --> "child"

stopwords

they are the words you don't want to be included: "from" "to" "a" "they" "she" "he"

If you need to do lots of such things, you'll want to use ntlk, pattern or TextBlob.

For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.

Our Zero-ith tool: cleaning up the text

I've included a little utility function in `tm` that takes a list of strings and cleans it up a bit

check out the code on your own time later



In [ ]:

    
our_texts=tm.data_cleanse(our_texts)

#more necessary when have messy text
#eliminate escaped characters

Our first tool: `vectorizer` from `scikit learn`



In [16]:

    
from sklearn.feature_extraction.text import TfidfVectorizer



In [17]:

    
vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)



In [18]:

    
document_term_matrix=vectorizer.fit_transform(our_texts)

for the documentation of sklearn's text data functionality, see http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

while this works, mini-lecture on crashes

see `kernel` above. Therein is the secret to eliminating the dreaded *.



In [43]:

    
# now let's get our vocabulary--the names corresponding to the rows
# "feature" is the general term in machine learning and data mining 
# we seek to characterize data by picking out features that will enable discovery

vocab=vectorizer.get_feature_names()



In [20]:

    
len(vocab)









    Out[20]:





1658



In [21]:

    
document_term_matrix.shape









    Out[21]:





(294, 1658)

so `document_term_matrix` is a matrix with 294 rows--the documents--and 1658 columns--the vocabulary or `terms` or `features`



In [22]:

    
vocab[1000:1100]









    Out[22]:





['ought',
 'outside',
 'overseer',
 'owned',
 'owner',
 'owners',
 'page',
 'pages',
 'paid',
 'pain',
 'painful',
 'pains',
 'pair',
 'paper',
 'papers',
 'parents',
 'particular',
 'particularly',
 'parties',
 'parting',
 'parts',
 'party',
 'pass',
 'passage',
 'passed',
 'passing',
 'past',
 'path',
 'pay',
 'paying',
 'peace',
 'peculiar',
 'pen',
 'people',
 'perfect',
 'perfectly',
 'perform',
 'performed',
 'period',
 'permission',
 'permit',
 'permitted',
 'person',
 'personal',
 'persons',
 'peter',
 'philadelphia',
 'picture',
 'piece',
 'pieces',
 'pity',
 'place',
 'placed',
 'places',
 'plain',
 'plan',
 'plans',
 'plantation',
 'play',
 'pleasant',
 'pleased',
 'pleasure',
 'plenty',
 'pocket',
 'point',
 'points',
 'poor',
 'portion',
 'position',
 'possess',
 'possessed',
 'possession',
 'possible',
 'possibly',
 'post',
 'pounds',
 'power',
 'powerful',
 'powers',
 'practice',
 'praise',
 'pray',
 'prayed',
 'prayer',
 'prayers',
 'praying',
 'preach',
 'preached',
 'preacher',
 'preaching',
 'precious',
 'preface',
 'prejudice',
 'prepare',
 'prepared',
 'preparing',
 'presence',
 'present',
 'presented',
 'president']

right now stored super efficiently as a sparse matrix

almost all zeros--good for our computers' limited memory

easier for us to see as a dense matrix



In [23]:

    
document_term_matrix_dense=document_term_matrix.toarray()



In [24]:

    
dtmdf=pd.DataFrame(document_term_matrix_dense, columns=vocab)



In [25]:

    
dtmdf









    Out[25]:






  
    
      
      10
      ability
      able
      abroad
      absence
      absent
      accept
      accepted
      accompanied
      accomplish
      ...
      wrote
      yard
      ye
      year
      years
      yes
      york
      young
      younger
      youth
    
  
  
    
      0  
       0.000000
       0.002778
       0.001837
       0.008431
       0.000000
       0.000000
       0.002855
       0.007959
       0.002445
       0.000000
      ...
       0.002333
       0.002427
       0.009639
       0.001776
       0.050294
       0.014698
       0.006708
       0.016257
       0.000000
       0.000000
    
    
      1  
       0.000000
       0.001035
       0.047255
       0.002095
       0.008032
       0.007569
       0.001064
       0.004944
       0.005469
       0.002129
      ...
       0.002609
       0.032574
       0.000000
       0.028466
       0.046546
       0.000000
       0.003334
       0.089552
       0.008616
       0.005845
    
    
      2  
       0.007761
       0.002577
       0.022157
       0.005214
       0.007496
       0.005382
       0.010594
       0.000000
       0.006805
       0.000000
      ...
       0.000000
       0.011259
       0.004471
       0.028007
       0.062745
       0.011688
       0.004149
       0.035189
       0.002680
       0.000000
    
    
      3  
       0.002040
       0.028447
       0.017471
       0.000000
       0.013792
       0.000000
       0.006265
       0.009703
       0.000000
       0.004177
      ...
       0.001707
       0.008878
       0.000000
       0.064955
       0.076117
       0.001536
       0.003271
       0.050210
       0.019022
       0.001912
    
    
      4  
       0.005407
       0.005386
       0.021373
       0.000000
       0.000000
       0.011249
       0.016607
       0.005144
       0.000000
       0.000000
      ...
       0.004524
       0.023532
       0.004672
       0.037876
       0.084063
       0.020356
       0.021677
       0.045529
       0.000000
       0.000000
    
    
      5  
       0.028843
       0.005746
       0.015202
       0.005814
       0.011144
       0.000000
       0.000000
       0.005488
       0.005058
       0.011812
      ...
       0.004826
       0.005021
       0.009970
       0.091842
       0.100450
       0.000000
       0.009251
       0.100886
       0.000000
       0.016217
    
    
      6  
       0.012275
       0.012228
       0.024262
       0.000000
       0.000000
       0.000000
       0.012568
       0.000000
       0.010763
       0.000000
      ...
       0.010271
       0.010685
       0.000000
       0.078175
       0.061073
       0.009243
       0.049214
       0.063610
       0.012718
       0.000000
    
    
      7  
       0.000000
       0.000000
       0.012353
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
      ...
       0.000000
       0.005440
       0.000000
       0.063682
       0.073848
       0.000000
       0.025056
       0.036434
       0.000000
       0.000000
    
    
      8  
       0.000000
       0.025964
       0.034346
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
      ...
       0.043617
       0.000000
       0.067576
       0.016600
       0.032421
       0.000000
       0.000000
       0.033768
       0.027007
       0.073277
    
    
      9  
       0.000000
       0.006766
       0.008950
       0.006845
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
      ...
       0.011366
       0.005912
       0.082176
       0.008651
       0.025345
       0.005114
       0.000000
       0.013199
       0.007037
       0.050919
    
    
      10 
       0.000000
       0.004677
       0.004640
       0.000000
       0.009071
       0.002442
       0.000000
       0.004467
       0.006176
       0.002404
      ...
       0.013751
       0.047005
       0.002029
       0.044856
       0.037963
       0.021214
       0.001883
       0.044103
       0.002433
       0.000000
    
    
      11 
       0.000000
       0.000000
       0.015878
       0.000000
       0.004656
       0.005014
       0.000000
       0.000000
       0.000000
       0.004935
      ...
       0.012099
       0.016783
       0.029158
       0.009209
       0.032974
       0.036295
       0.000000
       0.031222
       0.000000
       0.009034
    
    
      12 
       0.000000
       0.001307
       0.010370
       0.003966
       0.000000
       0.004093
       0.012086
       0.009983
       0.004600
       0.004029
      ...
       0.021949
       0.004567
       0.014736
       0.025895
       0.077496
       0.018766
       0.063104
       0.057775
       0.002718
       0.009833
    
    
      13 
       0.004397
       0.000000
       0.014484
       0.004431
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.004502
      ...
       0.040466
       0.000000
       0.015199
       0.042002
       0.065626
       0.000000
       0.056408
       0.011392
       0.000000
       0.000000
    
    
      14 
       0.004143
       0.020634
       0.017742
       0.002088
       0.006002
       0.000000
       0.004242
       0.003941
       0.000000
       0.000000
      ...
       0.003466
       0.000000
       0.001790
       0.055406
       0.119808
       0.003120
       0.029897
       0.029519
       0.000000
       0.000000
    
    
      15 
       0.000000
       0.000000
       0.003120
       0.004772
       0.000000
       0.000000
       0.000000
       0.018019
       0.004152
       0.004848
      ...
       0.003962
       0.024731
       0.000000
       0.033172
       0.058898
       0.014262
       0.015188
       0.021471
       0.000000
       0.000000
    
    
      16 
       0.000000
       0.001598
       0.017964
       0.000000
       0.013943
       0.018353
       0.001642
       0.004577
       0.018282
       0.016421
      ...
       0.008052
       0.065618
       0.024949
       0.054136
       0.103737
       0.019323
       0.100312
       0.033245
       0.013294
       0.016533
    
    
      17 
       0.013873
       0.004606
       0.013710
       0.002330
       0.000000
       0.000000
       0.004734
       0.006599
       0.000000
       0.002367
      ...
       0.030952
       0.014088
       0.019981
       0.038284
       0.090589
       0.015669
       0.009270
       0.017972
       0.002396
       0.000000
    
    
      18 
       0.000000
       0.000000
       0.058039
       0.000000
       0.004478
       0.004823
       0.000000
       0.004411
       0.000000
       0.009494
      ...
       0.007759
       0.012108
       0.004007
       0.038385
       0.057669
       0.003491
       0.033459
       0.021023
       0.000000
       0.004345
    
    
      19 
       0.001608
       0.006409
       0.016955
       0.009726
       0.006214
       0.005019
       0.011527
       0.016831
       0.007051
       0.004940
      ...
       0.012112
       0.005600
       0.009730
       0.030729
       0.051015
       0.008478
       0.063194
       0.033339
       0.013332
       0.006029
    
    
      20 
       0.000000
       0.000000
       0.032121
       0.000000
       0.007849
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
      ...
       0.000000
       0.000000
       0.007022
       0.015525
       0.090963
       0.036712
       0.000000
       0.026317
       0.000000
       0.000000
    
    
      21 
       0.001218
       0.015776
       0.063410
       0.006139
       0.001177
       0.000000
       0.008731
       0.005795
       0.002136
       0.007484
      ...
       0.004077
       0.003181
       0.000000
       0.027931
       0.101528
       0.002752
       0.017584
       0.029198
       0.005049
       0.001142
    
    
      22 
       0.002860
       0.022793
       0.036746
       0.004324
       0.009669
       0.004463
       0.004393
       0.014965
       0.002508
       0.007321
      ...
       0.011965
       0.011204
       0.000000
       0.057377
       0.089828
       0.001077
       0.025227
       0.048169
       0.001482
       0.021442
    
    
      23 
       0.008847
       0.000000
       0.040799
       0.000000
       0.012817
       0.004601
       0.000000
       0.000000
       0.003878
       0.000000
      ...
       0.003701
       0.003850
       0.003823
       0.036621
       0.071524
       0.006661
       0.021280
       0.057304
       0.004583
       0.000000
    
    
      24 
       0.005152
       0.000000
       0.010183
       0.002596
       0.004976
       0.002680
       0.005275
       0.002451
       0.004517
       0.002637
      ...
       0.028019
       0.004485
       0.015583
       0.029529
       0.081703
       0.027155
       0.033048
       0.023360
       0.000000
       0.002414
    
    
      25 
       0.000000
       0.018121
       0.035956
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
      ...
       0.030441
       0.015835
       0.141488
       0.092682
       0.169702
       0.027396
       0.102106
       0.011783
       0.000000
       0.000000
    
    
      26 
       0.003842
       0.000000
       0.005062
       0.000000
       0.000000
       0.000000
       0.003933
       0.000000
       0.006737
       0.003933
      ...
       0.009643
       0.000000
       0.059761
       0.029360
       0.076456
       0.005786
       0.018483
       0.017420
       0.000000
       0.010800
    
    
      27 
       0.000000
       0.007591
       0.010041
       0.000000
       0.000000
       0.000000
       0.015603
       0.007249
       0.000000
       0.007802
      ...
       0.019127
       0.000000
       0.000000
       0.004853
       0.104259
       0.005738
       0.006110
       0.054295
       0.000000
       0.000000
    
    
      28 
       0.004956
       0.024683
       0.018502
       0.006660
       0.001596
       0.005155
       0.001691
       0.007858
       0.008691
       0.000000
      ...
       0.011057
       0.002876
       0.018559
       0.036821
       0.110955
       0.009951
       0.037088
       0.044942
       0.005135
       0.026316
    
    
      29 
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.011208
       0.000000
       0.000000
       0.018893
       0.000000
      ...
       0.000000
       0.009378
       0.009311
       0.013723
       0.033502
       0.008113
       0.000000
       0.041872
       0.011163
       0.040384
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      264
       0.000000
       0.003486
       0.036891
       0.007054
       0.010141
       0.000000
       0.014332
       0.006659
       0.000000
       0.000000
      ...
       0.002928
       0.015232
       0.003024
       0.013372
       0.084881
       0.010541
       0.005612
       0.020402
       0.003626
       0.000000
    
    
      265
       0.000000
       0.002966
       0.021579
       0.003001
       0.002876
       0.012389
       0.000000
       0.011330
       0.002611
       0.000000
      ...
       0.012456
       0.000000
       0.010292
       0.030339
       0.061107
       0.058293
       0.057299
       0.021215
       0.000000
       0.002790
    
    
      266
       0.007076
       0.001410
       0.024244
       0.001426
       0.001367
       0.005889
       0.010143
       0.009425
       0.012410
       0.007245
      ...
       0.011842
       0.000000
       0.045254
       0.020731
       0.126748
       0.039431
       0.062416
       0.018335
       0.005866
       0.006631
    
    
      267
       0.007754
       0.001287
       0.022989
       0.001302
       0.001248
       0.005377
       0.009262
       0.008606
       0.012465
       0.006616
      ...
       0.011894
       0.000000
       0.040206
       0.020576
       0.136632
       0.048656
       0.065283
       0.019253
       0.006695
       0.006055
    
    
      268
       0.005221
       0.005200
       0.015478
       0.005262
       0.010085
       0.008146
       0.008017
       0.007450
       0.009155
       0.002672
      ...
       0.002184
       0.000000
       0.020302
       0.018286
       0.064935
       0.037345
       0.012558
       0.069323
       0.002705
       0.004892
    
    
      269
       0.005592
       0.000000
       0.003684
       0.005636
       0.005401
       0.011634
       0.000000
       0.015960
       0.004903
       0.000000
      ...
       0.000000
       0.000000
       0.000000
       0.007123
       0.038256
       0.033687
       0.017936
       0.097801
       0.005794
       0.005240
    
    
      270
       0.011798
       0.000000
       0.007773
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.020690
       0.012079
      ...
       0.000000
       0.030810
       0.020392
       0.007514
       0.036687
       0.008884
       0.000000
       0.007642
       0.000000
       0.011056
    
    
      271
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
      ...
       0.000000
       0.000000
       0.000000
       0.041778
       0.101994
       0.024698
       0.000000
       0.084985
       0.000000
       0.000000
    
    
      272
       0.000000
       0.004670
       0.027801
       0.002363
       0.009057
       0.009755
       0.002400
       0.004460
       0.004111
       0.000000
      ...
       0.003923
       0.002041
       0.002026
       0.022394
       0.032075
       0.007061
       0.001880
       0.009111
       0.000000
       0.006590
    
    
      273
       0.000000
       0.000000
       0.016820
       0.025730
       0.000000
       0.026557
       0.000000
       0.000000
       0.022385
       0.000000
      ...
       0.000000
       0.000000
       0.000000
       0.016259
       0.095264
       0.076895
       0.000000
       0.016537
       0.000000
       0.000000
    
    
      274
       0.000000
       0.004016
       0.006375
       0.000813
       0.012461
       0.006710
       0.002476
       0.000767
       0.006363
       0.000000
      ...
       0.006746
       0.000000
       0.001394
       0.004108
       0.016046
       0.004857
       0.047841
       0.010968
       0.002506
       0.000756
    
    
      275
       0.000000
       0.000000
       0.022407
       0.008569
       0.000000
       0.008845
       0.000000
       0.000000
       0.000000
       0.000000
      ...
       0.000000
       0.000000
       0.000000
       0.108294
       0.126904
       0.000000
       0.013635
       0.022029
       0.008809
       0.015935
    
    
      276
       0.000000
       0.007089
       0.018754
       0.014345
       0.000000
       0.007403
       0.000000
       0.000000
       0.000000
       0.000000
      ...
       0.005954
       0.000000
       0.006150
       0.104239
       0.141627
       0.000000
       0.017119
       0.041487
       0.014747
       0.013338
    
    
      277
       0.006928
       0.018978
       0.023964
       0.010474
       0.001673
       0.003603
       0.015960
       0.011534
       0.000000
       0.012413
      ...
       0.005797
       0.003015
       0.005987
       0.036400
       0.099100
       0.005217
       0.059718
       0.063949
       0.000000
       0.019477
    
    
      278
       0.001310
       0.003915
       0.015537
       0.007922
       0.006327
       0.002726
       0.010731
       0.009971
       0.009190
       0.001341
      ...
       0.008769
       0.001140
       0.004529
       0.020025
       0.056220
       0.018744
       0.079838
       0.028854
       0.001357
       0.004911
    
    
      279
       0.000000
       0.000000
       0.030560
       0.003596
       0.017232
       0.007423
       0.003653
       0.000000
       0.003129
       0.003653
      ...
       0.005971
       0.021742
       0.003084
       0.038630
       0.064352
       0.008060
       0.000000
       0.025423
       0.000000
       0.006687
    
    
      280
       0.018909
       0.010595
       0.021801
       0.004764
       0.005708
       0.000000
       0.013310
       0.010119
       0.006217
       0.009680
      ...
       0.016810
       0.004115
       0.000000
       0.086553
       0.088933
       0.000890
       0.043590
       0.031386
       0.002449
       0.014397
    
    
      281
       0.000000
       0.006831
       0.022589
       0.000000
       0.019870
       0.000000
       0.000000
       0.006523
       0.006013
       0.021062
      ...
       0.005737
       0.005969
       0.023704
       0.048037
       0.059704
       0.005163
       0.038488
       0.017767
       0.007105
       0.012852
    
    
      282
       0.000000
       0.008524
       0.020672
       0.002875
       0.011020
       0.000000
       0.000000
       0.005427
       0.005002
       0.011681
      ...
       0.004773
       0.007449
       0.014790
       0.025431
       0.074505
       0.019330
       0.082336
       0.027714
       0.000000
       0.005346
    
    
      283
       0.000000
       0.000000
       0.043636
       0.000000
       0.000000
       0.007655
       0.000000
       0.007001
       0.019358
       0.000000
      ...
       0.000000
       0.012812
       0.019079
       0.023433
       0.041190
       0.016624
       0.005901
       0.004767
       0.007625
       0.000000
    
    
      284
       0.000000
       0.004530
       0.022471
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.006984
      ...
       0.001902
       0.011876
       0.000000
       0.018825
       0.025454
       0.013697
       0.001823
       0.001473
       0.000000
       0.004262
    
    
      285
       0.000000
       0.006589
       0.010894
       0.003333
       0.009583
       0.000000
       0.003386
       0.006292
       0.005800
       0.003386
      ...
       0.000000
       0.014394
       0.002858
       0.044229
       0.102838
       0.004981
       0.007955
       0.042844
       0.000000
       0.003099
    
    
      286
       0.021025
       0.001611
       0.009590
       0.003803
       0.005728
       0.004486
       0.001104
       0.002051
       0.003309
       0.002208
      ...
       0.004059
       0.012670
       0.001398
       0.036049
       0.126060
       0.006089
       0.027665
       0.033523
       0.008378
       0.004547
    
    
      287
       0.004362
       0.000000
       0.002874
       0.017584
       0.012639
       0.000000
       0.000000
       0.004149
       0.003824
       0.000000
      ...
       0.014598
       0.000000
       0.131929
       0.022222
       0.059678
       0.000000
       0.000000
       0.048031
       0.004519
       0.040874
    
    
      288
       0.010687
       0.005323
       0.014082
       0.010771
       0.005161
       0.000000
       0.005471
       0.000000
       0.004685
       0.000000
      ...
       0.000000
       0.000000
       0.032324
       0.023821
       0.096372
       0.016094
       0.004285
       0.024228
       0.000000
       0.015022
    
    
      289
       0.005468
       0.005447
       0.061242
       0.005511
       0.010563
       0.011376
       0.000000
       0.000000
       0.023972
       0.016795
      ...
       0.004575
       0.000000
       0.099231
       0.041787
       0.023804
       0.012352
       0.035075
       0.017709
       0.000000
       0.005124
    
    
      290
       0.000000
       0.000000
       0.007489
       0.000000
       0.010980
       0.000000
       0.000000
       0.000000
       0.009967
       0.000000
      ...
       0.009511
       0.039579
       0.000000
       0.021718
       0.233290
       0.025678
       0.018229
       0.029452
       0.000000
       0.010652
    
    
      291
       0.004098
       0.000000
       0.005400
       0.008260
       0.011875
       0.004263
       0.004196
       0.000000
       0.000000
       0.008391
      ...
       0.000000
       0.000000
       0.092079
       0.028708
       0.043327
       0.000000
       0.009858
       0.007963
       0.000000
       0.007680
    
    
      292
       0.010646
       0.000000
       0.022795
       0.002682
       0.000000
       0.000000
       0.008175
       0.002532
       0.002334
       0.000000
      ...
       0.002227
       0.016217
       0.016100
       0.035594
       0.051310
       0.064129
       0.032010
       0.065509
       0.000000
       0.002494
    
    
      293
       0.000000
       0.000000
       0.135860
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
      ...
       0.000000
       0.023933
       0.000000
       0.026265
       0.021374
       0.036231
       0.005512
       0.053429
       0.000000
       0.000000
    
  

294 rows × 1658 columns

While this data frame is lovely to look at and useful to think with, it's tough on your computer's memory

Now we can throw wide variety of mining algorithms at our data!

Similarity and dissimilarity

We reduced our text to a vector of term-weights.

What can we do once we've committed this real violence on the text?

We can measure distance and similarity

I know. Crazy talk.

Right now our text is just a series of numbers, indexed to words. We can treat it like any collection of vectors more or less.

And the key way to distinguish two vectors is by measuring their distance or computing their similiarity (1-distance).

You already know how, though you may have buried it along with memories of high school.

Many distance metrics to choose from

key one in textual analysis:

cosine similarity

If $\mathbf{a}$ and $\mathbf{b}$ are vectors, then

$\mathbf{a}\cdot\mathbf{b}=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta$

$\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum\limits_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum\limits_{i=1}^{n}{(B_i)^2}} }$

(h/t wikipedia)



In [26]:

    
#easy to program, but let's use a robust version from sklearn!
from sklearn.metrics.pairwise import cosine_similarity



In [40]:

    
similarity=cosine_similarity(document_term_matrix)

#Note here that the `cosine_similiary` can take 
#an entire matrix as its argument



In [28]:

    
#what'd we get?

similarity









    Out[28]:





array([[ 1.        ,  0.4800956 ,  0.47789776, ...,  0.41147251,
         0.64107903,  0.49087961],
       [ 0.4800956 ,  1.        ,  0.65544451, ...,  0.31196723,
         0.64506243,  0.47103725],
       [ 0.47789776,  0.65544451,  1.        , ...,  0.35760039,
         0.70650975,  0.45047448],
       ..., 
       [ 0.41147251,  0.31196723,  0.35760039, ...,  1.        ,
         0.54765566,  0.29438818],
       [ 0.64107903,  0.64506243,  0.70650975, ...,  0.54765566,
         1.        ,  0.56796543],
       [ 0.49087961,  0.47103725,  0.45047448, ...,  0.29438818,
         0.56796543,  1.        ]])



In [29]:

    
similarity.shape









    Out[29]:





(294, 294)

that is a symmetrical matrix relating each of the texts (rows) to another text (row)



In [30]:

    
similarity[100]
#this gives the similarity of row 100 to each of the other rows









    Out[30]:





array([ 0.51351908,  0.37559544,  0.43518157,  0.55551997,  0.46036488,
        0.5228255 ,  0.56056587,  0.40698762,  0.29528043,  0.22663447,
        0.43146373,  0.404392  ,  0.44087255,  0.39765911,  0.75170812,
        0.4563751 ,  0.42273969,  0.53307405,  0.43241179,  0.52863713,
        0.43245394,  0.50163583,  0.55255707,  0.36087616,  0.52151398,
        0.44422755,  0.44204217,  0.44389512,  0.47398092,  0.31779759,
        0.49221207,  0.46843142,  0.56495427,  0.63240387,  0.33069644,
        0.39599524,  0.62049587,  0.51382474,  0.54573105,  0.46795562,
        0.54572092,  0.27907631,  0.31736056,  0.38786122,  0.45626854,
        0.33554468,  0.39141669,  0.39123794,  0.39268033,  0.39480131,
        0.18210545,  0.42944428,  0.4641688 ,  0.38645964,  0.44868397,
        0.36583276,  0.30356115,  0.4207236 ,  0.47623871,  0.74338913,
        0.70385898,  0.456456  ,  0.45573056,  0.41167837,  0.48595775,
        0.4969808 ,  0.51340512,  0.48723885,  0.39048843,  0.55268064,
        0.44370596,  0.46620405,  0.59864036,  0.5453561 ,  0.19284076,
        0.41974821,  0.46478893,  0.28167759,  0.4344191 ,  0.35451293,
        0.42340573,  0.47227169,  0.43494646,  0.45913277,  0.43349771,
        0.38247543,  0.39892623,  0.27673982,  0.34213901,  0.64612329,
        0.3626567 ,  0.42551363,  0.51860678,  0.56070145,  0.45950563,
        0.37292824,  0.47305083,  0.55650062,  0.51663525,  0.47743723,
        1.        ,  0.74504107,  0.5418364 ,  0.446839  ,  0.42183918,
        0.46757646,  0.40592573,  0.43250842,  0.42077958,  0.45397224,
        0.31944606,  0.54158949,  0.41347637,  0.37601824,  0.310596  ,
        0.4538872 ,  0.73505269,  0.25404726,  0.31252659,  0.36519023,
        0.38497747,  0.38380178,  0.29363724,  0.41179583,  0.54176381,
        0.73378491,  0.37935869,  0.40331248,  0.53966992,  0.37759265,
        0.38172116,  0.4010848 ,  0.36563491,  0.40777916,  0.44576063,
        0.30141366,  0.21701385,  0.46957471,  0.49328697,  0.55739909,
        0.42608044,  0.61105579,  0.41359103,  0.75236073,  0.56035999,
        0.52365363,  0.43469125,  0.4805973 ,  0.55483122,  0.30195174,
        0.46054438,  0.55915787,  0.4253061 ,  0.40968898,  0.34119052,
        0.47001332,  0.47038965,  0.54626945,  0.37483056,  0.57350522,
        0.80329149,  0.7243073 ,  0.41505792,  0.35195044,  0.39476144,
        0.72086426,  0.36628347,  0.32496974,  0.32097834,  0.63363847,
        0.57753911,  0.36209657,  0.38444127,  0.35573989,  0.41862441,
        0.43178962,  0.443835  ,  0.5608355 ,  0.444154  ,  0.48375123,
        0.28522753,  0.37559622,  0.47222523,  0.51180251,  0.41265869,
        0.54065297,  0.57687612,  0.64723645,  0.47474714,  0.38660376,
        0.68458167,  0.39413766,  0.38010527,  0.59400378,  0.57957167,
        0.37098426,  0.43403555,  0.32639742,  0.36205656,  0.51786785,
        0.41269349,  0.38620109,  0.50374064,  0.47638095,  0.33258194,
        0.37493858,  0.443436  ,  0.57791602,  0.65232685,  0.30007518,
        0.3492306 ,  0.32146507,  0.37118884,  0.45313594,  0.28738624,
        0.58066833,  0.37931137,  0.44479053,  0.66170862,  0.39806686,
        0.44628435,  0.413084  ,  0.34530214,  0.32644008,  0.39825262,
        0.35834356,  0.66729768,  0.52277553,  0.31125011,  0.32153004,
        0.41817371,  0.34185216,  0.41832183,  0.3425447 ,  0.45959116,
        0.33285397,  0.45256228,  0.37477928,  0.31155022,  0.30429663,
        0.52773434,  0.74161916,  0.3494954 ,  0.47205325,  0.38833223,
        0.51329957,  0.51852037,  0.38344333,  0.58984198,  0.26422658,
        0.33376944,  0.42060176,  0.27627799,  0.35126916,  0.37932132,
        0.3915702 ,  0.45151486,  0.5374713 ,  0.45565082,  0.37399019,
        0.4328711 ,  0.45998098,  0.40747782,  0.63226527,  0.51432353,
        0.49040904,  0.5749578 ,  0.57524544,  0.43183206,  0.40378449,
        0.37574633,  0.21411994,  0.4464823 ,  0.36864902,  0.32271726,
        0.34755168,  0.38877693,  0.7058656 ,  0.53344296,  0.40329998,
        0.5777419 ,  0.50958067,  0.54379815,  0.36165463,  0.48269786,
        0.46860293,  0.42628022,  0.2602712 ,  0.50098935,  0.41036843,
        0.4474139 ,  0.32264786,  0.52424594,  0.35982701])

HOMEWORK EXERCISE:

for given document find the most similar and give titles from the csv file you'll see!

supervised vs. unsupervised learning

slides from class omitted

first example of unsupervised learning

hierarchical clustering

This time we're interested in relations among the words not the texts.

In other words, we're interested in the similarities between one column and another--one term and another term

So we'll work with the transposed matrix--the term-document matrix, rather than the document-term matrix.

For a description of hierarchical clustering, look at the example at https://en.wikipedia.org/wiki/Hierarchical_clustering



In [31]:

    
term_document_matrix=document_term_matrix.T
# .T is the easy transposition method for a
# matrix in python's matrix packages.



In [32]:

    
# import a bunch of packages we need
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import ward, dendrogram



In [33]:

    
#distance is 1-similarity, so:

dist=1-cosine_similarity(term_document_matrix)

# ward is an algorithm for hierarchical clustering

linkage_matrix=ward(dist)

#plot dendogram

f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()

OMG U...G...L...Y!

WHAT THE? This is nonsense

what's the problem?

we just tried to plot a bunch o' features!

we need only the most significant words!

way to do this: change the min_df parameter in vectorizer

vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)

more an art than a science



In [34]:

    
vectorizer=TfidfVectorizer(min_df=.96, stop_words='english', use_idf=True)
#try a very high min_df



In [35]:

    
#rerun the model
document_term_matrix=vectorizer.fit_transform(our_texts)
vocab=vectorizer.get_feature_names()



In [36]:

    
#check the length of the vocab
len(vocab)









    Out[36]:





52



In [37]:

    
#switch again to the term_document_matrix
term_document_matrix=document_term_matrix.T



In [38]:

    
dist=1-cosine_similarity(term_document_matrix)
linkage_matrix=ward(dist)

#plot dendogram

f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()

is this significant? Are there interesting patterns to seek out?

here's what we're up to:

Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.

. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findings as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.

Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

	10	ability	able	abroad	absence	absent	accept	accepted	accompanied	accomplish	...	wrote	yard	ye	year	years	yes	york	young	younger	youth
0	0.000000	0.002778	0.001837	0.008431	0.000000	0.000000	0.002855	0.007959	0.002445	0.000000	...	0.002333	0.002427	0.009639	0.001776	0.050294	0.014698	0.006708	0.016257	0.000000	0.000000
1	0.000000	0.001035	0.047255	0.002095	0.008032	0.007569	0.001064	0.004944	0.005469	0.002129	...	0.002609	0.032574	0.000000	0.028466	0.046546	0.000000	0.003334	0.089552	0.008616	0.005845
2	0.007761	0.002577	0.022157	0.005214	0.007496	0.005382	0.010594	0.000000	0.006805	0.000000	...	0.000000	0.011259	0.004471	0.028007	0.062745	0.011688	0.004149	0.035189	0.002680	0.000000
3	0.002040	0.028447	0.017471	0.000000	0.013792	0.000000	0.006265	0.009703	0.000000	0.004177	...	0.001707	0.008878	0.000000	0.064955	0.076117	0.001536	0.003271	0.050210	0.019022	0.001912
4	0.005407	0.005386	0.021373	0.000000	0.000000	0.011249	0.016607	0.005144	0.000000	0.000000	...	0.004524	0.023532	0.004672	0.037876	0.084063	0.020356	0.021677	0.045529	0.000000	0.000000
5	0.028843	0.005746	0.015202	0.005814	0.011144	0.000000	0.000000	0.005488	0.005058	0.011812	...	0.004826	0.005021	0.009970	0.091842	0.100450	0.000000	0.009251	0.100886	0.000000	0.016217
6	0.012275	0.012228	0.024262	0.000000	0.000000	0.000000	0.012568	0.000000	0.010763	0.000000	...	0.010271	0.010685	0.000000	0.078175	0.061073	0.009243	0.049214	0.063610	0.012718	0.000000
7	0.000000	0.000000	0.012353	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.005440	0.000000	0.063682	0.073848	0.000000	0.025056	0.036434	0.000000	0.000000
8	0.000000	0.025964	0.034346	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.043617	0.000000	0.067576	0.016600	0.032421	0.000000	0.000000	0.033768	0.027007	0.073277
9	0.000000	0.006766	0.008950	0.006845	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.011366	0.005912	0.082176	0.008651	0.025345	0.005114	0.000000	0.013199	0.007037	0.050919
10	0.000000	0.004677	0.004640	0.000000	0.009071	0.002442	0.000000	0.004467	0.006176	0.002404	...	0.013751	0.047005	0.002029	0.044856	0.037963	0.021214	0.001883	0.044103	0.002433	0.000000
11	0.000000	0.000000	0.015878	0.000000	0.004656	0.005014	0.000000	0.000000	0.000000	0.004935	...	0.012099	0.016783	0.029158	0.009209	0.032974	0.036295	0.000000	0.031222	0.000000	0.009034
12	0.000000	0.001307	0.010370	0.003966	0.000000	0.004093	0.012086	0.009983	0.004600	0.004029	...	0.021949	0.004567	0.014736	0.025895	0.077496	0.018766	0.063104	0.057775	0.002718	0.009833
13	0.004397	0.000000	0.014484	0.004431	0.000000	0.000000	0.000000	0.000000	0.000000	0.004502	...	0.040466	0.000000	0.015199	0.042002	0.065626	0.000000	0.056408	0.011392	0.000000	0.000000
14	0.004143	0.020634	0.017742	0.002088	0.006002	0.000000	0.004242	0.003941	0.000000	0.000000	...	0.003466	0.000000	0.001790	0.055406	0.119808	0.003120	0.029897	0.029519	0.000000	0.000000
15	0.000000	0.000000	0.003120	0.004772	0.000000	0.000000	0.000000	0.018019	0.004152	0.004848	...	0.003962	0.024731	0.000000	0.033172	0.058898	0.014262	0.015188	0.021471	0.000000	0.000000
16	0.000000	0.001598	0.017964	0.000000	0.013943	0.018353	0.001642	0.004577	0.018282	0.016421	...	0.008052	0.065618	0.024949	0.054136	0.103737	0.019323	0.100312	0.033245	0.013294	0.016533
17	0.013873	0.004606	0.013710	0.002330	0.000000	0.000000	0.004734	0.006599	0.000000	0.002367	...	0.030952	0.014088	0.019981	0.038284	0.090589	0.015669	0.009270	0.017972	0.002396	0.000000
18	0.000000	0.000000	0.058039	0.000000	0.004478	0.004823	0.000000	0.004411	0.000000	0.009494	...	0.007759	0.012108	0.004007	0.038385	0.057669	0.003491	0.033459	0.021023	0.000000	0.004345
19	0.001608	0.006409	0.016955	0.009726	0.006214	0.005019	0.011527	0.016831	0.007051	0.004940	...	0.012112	0.005600	0.009730	0.030729	0.051015	0.008478	0.063194	0.033339	0.013332	0.006029
20	0.000000	0.000000	0.032121	0.000000	0.007849	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.007022	0.015525	0.090963	0.036712	0.000000	0.026317	0.000000	0.000000
21	0.001218	0.015776	0.063410	0.006139	0.001177	0.000000	0.008731	0.005795	0.002136	0.007484	...	0.004077	0.003181	0.000000	0.027931	0.101528	0.002752	0.017584	0.029198	0.005049	0.001142
22	0.002860	0.022793	0.036746	0.004324	0.009669	0.004463	0.004393	0.014965	0.002508	0.007321	...	0.011965	0.011204	0.000000	0.057377	0.089828	0.001077	0.025227	0.048169	0.001482	0.021442
23	0.008847	0.000000	0.040799	0.000000	0.012817	0.004601	0.000000	0.000000	0.003878	0.000000	...	0.003701	0.003850	0.003823	0.036621	0.071524	0.006661	0.021280	0.057304	0.004583	0.000000
24	0.005152	0.000000	0.010183	0.002596	0.004976	0.002680	0.005275	0.002451	0.004517	0.002637	...	0.028019	0.004485	0.015583	0.029529	0.081703	0.027155	0.033048	0.023360	0.000000	0.002414
25	0.000000	0.018121	0.035956	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.030441	0.015835	0.141488	0.092682	0.169702	0.027396	0.102106	0.011783	0.000000	0.000000
26	0.003842	0.000000	0.005062	0.000000	0.000000	0.000000	0.003933	0.000000	0.006737	0.003933	...	0.009643	0.000000	0.059761	0.029360	0.076456	0.005786	0.018483	0.017420	0.000000	0.010800
27	0.000000	0.007591	0.010041	0.000000	0.000000	0.000000	0.015603	0.007249	0.000000	0.007802	...	0.019127	0.000000	0.000000	0.004853	0.104259	0.005738	0.006110	0.054295	0.000000	0.000000
28	0.004956	0.024683	0.018502	0.006660	0.001596	0.005155	0.001691	0.007858	0.008691	0.000000	...	0.011057	0.002876	0.018559	0.036821	0.110955	0.009951	0.037088	0.044942	0.005135	0.026316
29	0.000000	0.000000	0.000000	0.000000	0.000000	0.011208	0.000000	0.000000	0.018893	0.000000	...	0.000000	0.009378	0.009311	0.013723	0.033502	0.008113	0.000000	0.041872	0.011163	0.040384
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
264	0.000000	0.003486	0.036891	0.007054	0.010141	0.000000	0.014332	0.006659	0.000000	0.000000	...	0.002928	0.015232	0.003024	0.013372	0.084881	0.010541	0.005612	0.020402	0.003626	0.000000
265	0.000000	0.002966	0.021579	0.003001	0.002876	0.012389	0.000000	0.011330	0.002611	0.000000	...	0.012456	0.000000	0.010292	0.030339	0.061107	0.058293	0.057299	0.021215	0.000000	0.002790
266	0.007076	0.001410	0.024244	0.001426	0.001367	0.005889	0.010143	0.009425	0.012410	0.007245	...	0.011842	0.000000	0.045254	0.020731	0.126748	0.039431	0.062416	0.018335	0.005866	0.006631
267	0.007754	0.001287	0.022989	0.001302	0.001248	0.005377	0.009262	0.008606	0.012465	0.006616	...	0.011894	0.000000	0.040206	0.020576	0.136632	0.048656	0.065283	0.019253	0.006695	0.006055
268	0.005221	0.005200	0.015478	0.005262	0.010085	0.008146	0.008017	0.007450	0.009155	0.002672	...	0.002184	0.000000	0.020302	0.018286	0.064935	0.037345	0.012558	0.069323	0.002705	0.004892
269	0.005592	0.000000	0.003684	0.005636	0.005401	0.011634	0.000000	0.015960	0.004903	0.000000	...	0.000000	0.000000	0.000000	0.007123	0.038256	0.033687	0.017936	0.097801	0.005794	0.005240
270	0.011798	0.000000	0.007773	0.000000	0.000000	0.000000	0.000000	0.000000	0.020690	0.012079	...	0.000000	0.030810	0.020392	0.007514	0.036687	0.008884	0.000000	0.007642	0.000000	0.011056
271	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.041778	0.101994	0.024698	0.000000	0.084985	0.000000	0.000000
272	0.000000	0.004670	0.027801	0.002363	0.009057	0.009755	0.002400	0.004460	0.004111	0.000000	...	0.003923	0.002041	0.002026	0.022394	0.032075	0.007061	0.001880	0.009111	0.000000	0.006590
273	0.000000	0.000000	0.016820	0.025730	0.000000	0.026557	0.000000	0.000000	0.022385	0.000000	...	0.000000	0.000000	0.000000	0.016259	0.095264	0.076895	0.000000	0.016537	0.000000	0.000000
274	0.000000	0.004016	0.006375	0.000813	0.012461	0.006710	0.002476	0.000767	0.006363	0.000000	...	0.006746	0.000000	0.001394	0.004108	0.016046	0.004857	0.047841	0.010968	0.002506	0.000756
275	0.000000	0.000000	0.022407	0.008569	0.000000	0.008845	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.108294	0.126904	0.000000	0.013635	0.022029	0.008809	0.015935
276	0.000000	0.007089	0.018754	0.014345	0.000000	0.007403	0.000000	0.000000	0.000000	0.000000	...	0.005954	0.000000	0.006150	0.104239	0.141627	0.000000	0.017119	0.041487	0.014747	0.013338
277	0.006928	0.018978	0.023964	0.010474	0.001673	0.003603	0.015960	0.011534	0.000000	0.012413	...	0.005797	0.003015	0.005987	0.036400	0.099100	0.005217	0.059718	0.063949	0.000000	0.019477
278	0.001310	0.003915	0.015537	0.007922	0.006327	0.002726	0.010731	0.009971	0.009190	0.001341	...	0.008769	0.001140	0.004529	0.020025	0.056220	0.018744	0.079838	0.028854	0.001357	0.004911
279	0.000000	0.000000	0.030560	0.003596	0.017232	0.007423	0.003653	0.000000	0.003129	0.003653	...	0.005971	0.021742	0.003084	0.038630	0.064352	0.008060	0.000000	0.025423	0.000000	0.006687
280	0.018909	0.010595	0.021801	0.004764	0.005708	0.000000	0.013310	0.010119	0.006217	0.009680	...	0.016810	0.004115	0.000000	0.086553	0.088933	0.000890	0.043590	0.031386	0.002449	0.014397
281	0.000000	0.006831	0.022589	0.000000	0.019870	0.000000	0.000000	0.006523	0.006013	0.021062	...	0.005737	0.005969	0.023704	0.048037	0.059704	0.005163	0.038488	0.017767	0.007105	0.012852
282	0.000000	0.008524	0.020672	0.002875	0.011020	0.000000	0.000000	0.005427	0.005002	0.011681	...	0.004773	0.007449	0.014790	0.025431	0.074505	0.019330	0.082336	0.027714	0.000000	0.005346
283	0.000000	0.000000	0.043636	0.000000	0.000000	0.007655	0.000000	0.007001	0.019358	0.000000	...	0.000000	0.012812	0.019079	0.023433	0.041190	0.016624	0.005901	0.004767	0.007625	0.000000
284	0.000000	0.004530	0.022471	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.006984	...	0.001902	0.011876	0.000000	0.018825	0.025454	0.013697	0.001823	0.001473	0.000000	0.004262
285	0.000000	0.006589	0.010894	0.003333	0.009583	0.000000	0.003386	0.006292	0.005800	0.003386	...	0.000000	0.014394	0.002858	0.044229	0.102838	0.004981	0.007955	0.042844	0.000000	0.003099
286	0.021025	0.001611	0.009590	0.003803	0.005728	0.004486	0.001104	0.002051	0.003309	0.002208	...	0.004059	0.012670	0.001398	0.036049	0.126060	0.006089	0.027665	0.033523	0.008378	0.004547
287	0.004362	0.000000	0.002874	0.017584	0.012639	0.000000	0.000000	0.004149	0.003824	0.000000	...	0.014598	0.000000	0.131929	0.022222	0.059678	0.000000	0.000000	0.048031	0.004519	0.040874
288	0.010687	0.005323	0.014082	0.010771	0.005161	0.000000	0.005471	0.000000	0.004685	0.000000	...	0.000000	0.000000	0.032324	0.023821	0.096372	0.016094	0.004285	0.024228	0.000000	0.015022
289	0.005468	0.005447	0.061242	0.005511	0.010563	0.011376	0.000000	0.000000	0.023972	0.016795	...	0.004575	0.000000	0.099231	0.041787	0.023804	0.012352	0.035075	0.017709	0.000000	0.005124
290	0.000000	0.000000	0.007489	0.000000	0.010980	0.000000	0.000000	0.000000	0.009967	0.000000	...	0.009511	0.039579	0.000000	0.021718	0.233290	0.025678	0.018229	0.029452	0.000000	0.010652
291	0.004098	0.000000	0.005400	0.008260	0.011875	0.004263	0.004196	0.000000	0.000000	0.008391	...	0.000000	0.000000	0.092079	0.028708	0.043327	0.000000	0.009858	0.007963	0.000000	0.007680
292	0.010646	0.000000	0.022795	0.002682	0.000000	0.000000	0.008175	0.002532	0.002334	0.000000	...	0.002227	0.016217	0.016100	0.035594	0.051310	0.064129	0.032010	0.065509	0.000000	0.002494
293	0.000000	0.000000	0.135860	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.023933	0.000000	0.026265	0.021374	0.036231	0.005512	0.053429	0.000000	0.000000