Computing In Context

Social Sciences Track

Lecture 3--text mining for real

Matthew L. Jones

like, with code and stuff


In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
import textmining_blackboxes as tm

IMPORTANT: tm is our temporarily helper, not a standard python package!!

download it from my github: https://github.com/matthewljones/computingincontext


In [3]:
#see if package imported correctly
tm.icantbelieve("butter")


I can't believe it's not butter

Let's get some text

Let's use the remarkable narratives available from Documenting the American South (http://docsouth.unc.edu/docsouthdata/)

Assuming that you are storing your data in a directory in the same place as your iPython notebook.

Put the slave narratives texts within a data directory in the same place as this notebook


In [4]:
title_info=pd.read_csv('data/na-slave-narratives/data/toc.csv')
#this is the "metadata" of these files--we didn't use today
#why does data appear twice?

In [5]:
#Let's use a brittle thing for reading in a directory of pure txt files.
our_texts=tm.readtextfiles('data/na-slave-narratives/data/texts')
#again, this is not a std python package
#returns a simple list of the document as very long strings

#note if you want the following notebook will work on any directory of text files.

In [6]:
len(our_texts)


Out[6]:
294

In [42]:
our_texts[100][:300] # first 300 words of 100th text


Out[42]:
'\n [Frontispiece Image]\n [Title Page Image]\n CONTENTS\n PREFACE.\n The idea of writing and giving the Church and community the advantage of my experience and such facts as came under my personal observation during the many years of labor I have spent in the A. M. E. Church, has occupied my attention fo'

list comprehensions!

most python thing evah!

how many words in each text within our_texts? can you make a list?

Sure, you could do this as a for loop

for text in our texts:
    blah.blah.blah(our_texts) #not real code

or

for i in range(len(our_texts)

But super easy in python


In [8]:
lengths=[len(text) for text in our_texts]

How to process text

Python Libraries

Python has an embarrasment of riches when it comes to working with texts. Some libraries are higher level with simpler, well thought out defaults, namely pattern and TextBlob. Most general, of long development, and foundational is the Natural Language Tool Kit--NLTK. The ideas we'll learn to today are key--they have slightly different instantiations in the different tools. Not everything is yet in Python 3, alas!!

nltk : grandparent of text analysis packages, cross-platform, complex

  • crucial for moving beyond bag of words: tagging & other grammatical analysis

pattern : higher level and easier to use the nltk but Python 2.7 only. (wah!)

textblob : even higher level range of natural language processing (3.4 but not yet in conda?)

scikit learn (sklearn): toolkit for scientists, faster, better (use for processing/memory intensive stuff) (Our choice!)

Things we might do to clean up text

tokenization

making .split much better

Examples??

stemming:

  • converting inflected forms into some normalized forms
    • e.g. "chefs" --> "chef"
    • "goes" --> "go"
    • "children" --> "child"

stopwords

they are the words you don't want to be included: "from" "to" "a" "they" "she" "he"

If you need to do lots of such things, you'll want to use ntlk, pattern or TextBlob.

For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.

Our Zero-ith tool: cleaning up the text

I've included a little utility function in tm that takes a list of strings and cleans it up a bit

check out the code on your own time later


In [ ]:
our_texts=tm.data_cleanse(our_texts)

#more necessary when have messy text
#eliminate escaped characters

Our first tool: vectorizer from scikit learn


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)

In [18]:
document_term_matrix=vectorizer.fit_transform(our_texts)

for the documentation of sklearn's text data functionality, see http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

while this works, mini-lecture on crashes

see kernel above. Therein is the secret to eliminating the dreaded *.


In [43]:
# now let's get our vocabulary--the names corresponding to the rows
# "feature" is the general term in machine learning and data mining 
# we seek to characterize data by picking out features that will enable discovery

vocab=vectorizer.get_feature_names()

In [20]:
len(vocab)


Out[20]:
1658

In [21]:
document_term_matrix.shape


Out[21]:
(294, 1658)

so document_term_matrix is a matrix with 294 rows--the documents--and 1658 columns--the vocabulary or terms or features


In [22]:
vocab[1000:1100]


Out[22]:
['ought',
 'outside',
 'overseer',
 'owned',
 'owner',
 'owners',
 'page',
 'pages',
 'paid',
 'pain',
 'painful',
 'pains',
 'pair',
 'paper',
 'papers',
 'parents',
 'particular',
 'particularly',
 'parties',
 'parting',
 'parts',
 'party',
 'pass',
 'passage',
 'passed',
 'passing',
 'past',
 'path',
 'pay',
 'paying',
 'peace',
 'peculiar',
 'pen',
 'people',
 'perfect',
 'perfectly',
 'perform',
 'performed',
 'period',
 'permission',
 'permit',
 'permitted',
 'person',
 'personal',
 'persons',
 'peter',
 'philadelphia',
 'picture',
 'piece',
 'pieces',
 'pity',
 'place',
 'placed',
 'places',
 'plain',
 'plan',
 'plans',
 'plantation',
 'play',
 'pleasant',
 'pleased',
 'pleasure',
 'plenty',
 'pocket',
 'point',
 'points',
 'poor',
 'portion',
 'position',
 'possess',
 'possessed',
 'possession',
 'possible',
 'possibly',
 'post',
 'pounds',
 'power',
 'powerful',
 'powers',
 'practice',
 'praise',
 'pray',
 'prayed',
 'prayer',
 'prayers',
 'praying',
 'preach',
 'preached',
 'preacher',
 'preaching',
 'precious',
 'preface',
 'prejudice',
 'prepare',
 'prepared',
 'preparing',
 'presence',
 'present',
 'presented',
 'president']

right now stored super efficiently as a sparse matrix

almost all zeros--good for our computers' limited memory

easier for us to see as a dense matrix


In [23]:
document_term_matrix_dense=document_term_matrix.toarray()

In [24]:
dtmdf=pd.DataFrame(document_term_matrix_dense, columns=vocab)

In [25]:
dtmdf


Out[25]:
10 ability able abroad absence absent accept accepted accompanied accomplish ... wrote yard ye year years yes york young younger youth
0 0.000000 0.002778 0.001837 0.008431 0.000000 0.000000 0.002855 0.007959 0.002445 0.000000 ... 0.002333 0.002427 0.009639 0.001776 0.050294 0.014698 0.006708 0.016257 0.000000 0.000000
1 0.000000 0.001035 0.047255 0.002095 0.008032 0.007569 0.001064 0.004944 0.005469 0.002129 ... 0.002609 0.032574 0.000000 0.028466 0.046546 0.000000 0.003334 0.089552 0.008616 0.005845
2 0.007761 0.002577 0.022157 0.005214 0.007496 0.005382 0.010594 0.000000 0.006805 0.000000 ... 0.000000 0.011259 0.004471 0.028007 0.062745 0.011688 0.004149 0.035189 0.002680 0.000000
3 0.002040 0.028447 0.017471 0.000000 0.013792 0.000000 0.006265 0.009703 0.000000 0.004177 ... 0.001707 0.008878 0.000000 0.064955 0.076117 0.001536 0.003271 0.050210 0.019022 0.001912
4 0.005407 0.005386 0.021373 0.000000 0.000000 0.011249 0.016607 0.005144 0.000000 0.000000 ... 0.004524 0.023532 0.004672 0.037876 0.084063 0.020356 0.021677 0.045529 0.000000 0.000000
5 0.028843 0.005746 0.015202 0.005814 0.011144 0.000000 0.000000 0.005488 0.005058 0.011812 ... 0.004826 0.005021 0.009970 0.091842 0.100450 0.000000 0.009251 0.100886 0.000000 0.016217
6 0.012275 0.012228 0.024262 0.000000 0.000000 0.000000 0.012568 0.000000 0.010763 0.000000 ... 0.010271 0.010685 0.000000 0.078175 0.061073 0.009243 0.049214 0.063610 0.012718 0.000000
7 0.000000 0.000000 0.012353 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.005440 0.000000 0.063682 0.073848 0.000000 0.025056 0.036434 0.000000 0.000000
8 0.000000 0.025964 0.034346 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.043617 0.000000 0.067576 0.016600 0.032421 0.000000 0.000000 0.033768 0.027007 0.073277
9 0.000000 0.006766 0.008950 0.006845 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.011366 0.005912 0.082176 0.008651 0.025345 0.005114 0.000000 0.013199 0.007037 0.050919
10 0.000000 0.004677 0.004640 0.000000 0.009071 0.002442 0.000000 0.004467 0.006176 0.002404 ... 0.013751 0.047005 0.002029 0.044856 0.037963 0.021214 0.001883 0.044103 0.002433 0.000000
11 0.000000 0.000000 0.015878 0.000000 0.004656 0.005014 0.000000 0.000000 0.000000 0.004935 ... 0.012099 0.016783 0.029158 0.009209 0.032974 0.036295 0.000000 0.031222 0.000000 0.009034
12 0.000000 0.001307 0.010370 0.003966 0.000000 0.004093 0.012086 0.009983 0.004600 0.004029 ... 0.021949 0.004567 0.014736 0.025895 0.077496 0.018766 0.063104 0.057775 0.002718 0.009833
13 0.004397 0.000000 0.014484 0.004431 0.000000 0.000000 0.000000 0.000000 0.000000 0.004502 ... 0.040466 0.000000 0.015199 0.042002 0.065626 0.000000 0.056408 0.011392 0.000000 0.000000
14 0.004143 0.020634 0.017742 0.002088 0.006002 0.000000 0.004242 0.003941 0.000000 0.000000 ... 0.003466 0.000000 0.001790 0.055406 0.119808 0.003120 0.029897 0.029519 0.000000 0.000000
15 0.000000 0.000000 0.003120 0.004772 0.000000 0.000000 0.000000 0.018019 0.004152 0.004848 ... 0.003962 0.024731 0.000000 0.033172 0.058898 0.014262 0.015188 0.021471 0.000000 0.000000
16 0.000000 0.001598 0.017964 0.000000 0.013943 0.018353 0.001642 0.004577 0.018282 0.016421 ... 0.008052 0.065618 0.024949 0.054136 0.103737 0.019323 0.100312 0.033245 0.013294 0.016533
17 0.013873 0.004606 0.013710 0.002330 0.000000 0.000000 0.004734 0.006599 0.000000 0.002367 ... 0.030952 0.014088 0.019981 0.038284 0.090589 0.015669 0.009270 0.017972 0.002396 0.000000
18 0.000000 0.000000 0.058039 0.000000 0.004478 0.004823 0.000000 0.004411 0.000000 0.009494 ... 0.007759 0.012108 0.004007 0.038385 0.057669 0.003491 0.033459 0.021023 0.000000 0.004345
19 0.001608 0.006409 0.016955 0.009726 0.006214 0.005019 0.011527 0.016831 0.007051 0.004940 ... 0.012112 0.005600 0.009730 0.030729 0.051015 0.008478 0.063194 0.033339 0.013332 0.006029
20 0.000000 0.000000 0.032121 0.000000 0.007849 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.007022 0.015525 0.090963 0.036712 0.000000 0.026317 0.000000 0.000000
21 0.001218 0.015776 0.063410 0.006139 0.001177 0.000000 0.008731 0.005795 0.002136 0.007484 ... 0.004077 0.003181 0.000000 0.027931 0.101528 0.002752 0.017584 0.029198 0.005049 0.001142
22 0.002860 0.022793 0.036746 0.004324 0.009669 0.004463 0.004393 0.014965 0.002508 0.007321 ... 0.011965 0.011204 0.000000 0.057377 0.089828 0.001077 0.025227 0.048169 0.001482 0.021442
23 0.008847 0.000000 0.040799 0.000000 0.012817 0.004601 0.000000 0.000000 0.003878 0.000000 ... 0.003701 0.003850 0.003823 0.036621 0.071524 0.006661 0.021280 0.057304 0.004583 0.000000
24 0.005152 0.000000 0.010183 0.002596 0.004976 0.002680 0.005275 0.002451 0.004517 0.002637 ... 0.028019 0.004485 0.015583 0.029529 0.081703 0.027155 0.033048 0.023360 0.000000 0.002414
25 0.000000 0.018121 0.035956 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.030441 0.015835 0.141488 0.092682 0.169702 0.027396 0.102106 0.011783 0.000000 0.000000
26 0.003842 0.000000 0.005062 0.000000 0.000000 0.000000 0.003933 0.000000 0.006737 0.003933 ... 0.009643 0.000000 0.059761 0.029360 0.076456 0.005786 0.018483 0.017420 0.000000 0.010800
27 0.000000 0.007591 0.010041 0.000000 0.000000 0.000000 0.015603 0.007249 0.000000 0.007802 ... 0.019127 0.000000 0.000000 0.004853 0.104259 0.005738 0.006110 0.054295 0.000000 0.000000
28 0.004956 0.024683 0.018502 0.006660 0.001596 0.005155 0.001691 0.007858 0.008691 0.000000 ... 0.011057 0.002876 0.018559 0.036821 0.110955 0.009951 0.037088 0.044942 0.005135 0.026316
29 0.000000 0.000000 0.000000 0.000000 0.000000 0.011208 0.000000 0.000000 0.018893 0.000000 ... 0.000000 0.009378 0.009311 0.013723 0.033502 0.008113 0.000000 0.041872 0.011163 0.040384
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
264 0.000000 0.003486 0.036891 0.007054 0.010141 0.000000 0.014332 0.006659 0.000000 0.000000 ... 0.002928 0.015232 0.003024 0.013372 0.084881 0.010541 0.005612 0.020402 0.003626 0.000000
265 0.000000 0.002966 0.021579 0.003001 0.002876 0.012389 0.000000 0.011330 0.002611 0.000000 ... 0.012456 0.000000 0.010292 0.030339 0.061107 0.058293 0.057299 0.021215 0.000000 0.002790
266 0.007076 0.001410 0.024244 0.001426 0.001367 0.005889 0.010143 0.009425 0.012410 0.007245 ... 0.011842 0.000000 0.045254 0.020731 0.126748 0.039431 0.062416 0.018335 0.005866 0.006631
267 0.007754 0.001287 0.022989 0.001302 0.001248 0.005377 0.009262 0.008606 0.012465 0.006616 ... 0.011894 0.000000 0.040206 0.020576 0.136632 0.048656 0.065283 0.019253 0.006695 0.006055
268 0.005221 0.005200 0.015478 0.005262 0.010085 0.008146 0.008017 0.007450 0.009155 0.002672 ... 0.002184 0.000000 0.020302 0.018286 0.064935 0.037345 0.012558 0.069323 0.002705 0.004892
269 0.005592 0.000000 0.003684 0.005636 0.005401 0.011634 0.000000 0.015960 0.004903 0.000000 ... 0.000000 0.000000 0.000000 0.007123 0.038256 0.033687 0.017936 0.097801 0.005794 0.005240
270 0.011798 0.000000 0.007773 0.000000 0.000000 0.000000 0.000000 0.000000 0.020690 0.012079 ... 0.000000 0.030810 0.020392 0.007514 0.036687 0.008884 0.000000 0.007642 0.000000 0.011056
271 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.041778 0.101994 0.024698 0.000000 0.084985 0.000000 0.000000
272 0.000000 0.004670 0.027801 0.002363 0.009057 0.009755 0.002400 0.004460 0.004111 0.000000 ... 0.003923 0.002041 0.002026 0.022394 0.032075 0.007061 0.001880 0.009111 0.000000 0.006590
273 0.000000 0.000000 0.016820 0.025730 0.000000 0.026557 0.000000 0.000000 0.022385 0.000000 ... 0.000000 0.000000 0.000000 0.016259 0.095264 0.076895 0.000000 0.016537 0.000000 0.000000
274 0.000000 0.004016 0.006375 0.000813 0.012461 0.006710 0.002476 0.000767 0.006363 0.000000 ... 0.006746 0.000000 0.001394 0.004108 0.016046 0.004857 0.047841 0.010968 0.002506 0.000756
275 0.000000 0.000000 0.022407 0.008569 0.000000 0.008845 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.108294 0.126904 0.000000 0.013635 0.022029 0.008809 0.015935
276 0.000000 0.007089 0.018754 0.014345 0.000000 0.007403 0.000000 0.000000 0.000000 0.000000 ... 0.005954 0.000000 0.006150 0.104239 0.141627 0.000000 0.017119 0.041487 0.014747 0.013338
277 0.006928 0.018978 0.023964 0.010474 0.001673 0.003603 0.015960 0.011534 0.000000 0.012413 ... 0.005797 0.003015 0.005987 0.036400 0.099100 0.005217 0.059718 0.063949 0.000000 0.019477
278 0.001310 0.003915 0.015537 0.007922 0.006327 0.002726 0.010731 0.009971 0.009190 0.001341 ... 0.008769 0.001140 0.004529 0.020025 0.056220 0.018744 0.079838 0.028854 0.001357 0.004911
279 0.000000 0.000000 0.030560 0.003596 0.017232 0.007423 0.003653 0.000000 0.003129 0.003653 ... 0.005971 0.021742 0.003084 0.038630 0.064352 0.008060 0.000000 0.025423 0.000000 0.006687
280 0.018909 0.010595 0.021801 0.004764 0.005708 0.000000 0.013310 0.010119 0.006217 0.009680 ... 0.016810 0.004115 0.000000 0.086553 0.088933 0.000890 0.043590 0.031386 0.002449 0.014397
281 0.000000 0.006831 0.022589 0.000000 0.019870 0.000000 0.000000 0.006523 0.006013 0.021062 ... 0.005737 0.005969 0.023704 0.048037 0.059704 0.005163 0.038488 0.017767 0.007105 0.012852
282 0.000000 0.008524 0.020672 0.002875 0.011020 0.000000 0.000000 0.005427 0.005002 0.011681 ... 0.004773 0.007449 0.014790 0.025431 0.074505 0.019330 0.082336 0.027714 0.000000 0.005346
283 0.000000 0.000000 0.043636 0.000000 0.000000 0.007655 0.000000 0.007001 0.019358 0.000000 ... 0.000000 0.012812 0.019079 0.023433 0.041190 0.016624 0.005901 0.004767 0.007625 0.000000
284 0.000000 0.004530 0.022471 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.006984 ... 0.001902 0.011876 0.000000 0.018825 0.025454 0.013697 0.001823 0.001473 0.000000 0.004262
285 0.000000 0.006589 0.010894 0.003333 0.009583 0.000000 0.003386 0.006292 0.005800 0.003386 ... 0.000000 0.014394 0.002858 0.044229 0.102838 0.004981 0.007955 0.042844 0.000000 0.003099
286 0.021025 0.001611 0.009590 0.003803 0.005728 0.004486 0.001104 0.002051 0.003309 0.002208 ... 0.004059 0.012670 0.001398 0.036049 0.126060 0.006089 0.027665 0.033523 0.008378 0.004547
287 0.004362 0.000000 0.002874 0.017584 0.012639 0.000000 0.000000 0.004149 0.003824 0.000000 ... 0.014598 0.000000 0.131929 0.022222 0.059678 0.000000 0.000000 0.048031 0.004519 0.040874
288 0.010687 0.005323 0.014082 0.010771 0.005161 0.000000 0.005471 0.000000 0.004685 0.000000 ... 0.000000 0.000000 0.032324 0.023821 0.096372 0.016094 0.004285 0.024228 0.000000 0.015022
289 0.005468 0.005447 0.061242 0.005511 0.010563 0.011376 0.000000 0.000000 0.023972 0.016795 ... 0.004575 0.000000 0.099231 0.041787 0.023804 0.012352 0.035075 0.017709 0.000000 0.005124
290 0.000000 0.000000 0.007489 0.000000 0.010980 0.000000 0.000000 0.000000 0.009967 0.000000 ... 0.009511 0.039579 0.000000 0.021718 0.233290 0.025678 0.018229 0.029452 0.000000 0.010652
291 0.004098 0.000000 0.005400 0.008260 0.011875 0.004263 0.004196 0.000000 0.000000 0.008391 ... 0.000000 0.000000 0.092079 0.028708 0.043327 0.000000 0.009858 0.007963 0.000000 0.007680
292 0.010646 0.000000 0.022795 0.002682 0.000000 0.000000 0.008175 0.002532 0.002334 0.000000 ... 0.002227 0.016217 0.016100 0.035594 0.051310 0.064129 0.032010 0.065509 0.000000 0.002494
293 0.000000 0.000000 0.135860 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.023933 0.000000 0.026265 0.021374 0.036231 0.005512 0.053429 0.000000 0.000000

294 rows × 1658 columns

While this data frame is lovely to look at and useful to think with, it's tough on your computer's memory

Now we can throw wide variety of mining algorithms at our data!

Similarity and dissimilarity

We reduced our text to a vector of term-weights.

What can we do once we've committed this real violence on the text?

We can measure distance and similarity

I know. Crazy talk.

Right now our text is just a series of numbers, indexed to words. We can treat it like any collection of vectors more or less.

And the key way to distinguish two vectors is by measuring their distance or computing their similiarity (1-distance).

You already know how, though you may have buried it along with memories of high school.

Many distance metrics to choose from

key one in textual analysis:

cosine similarity

If $\mathbf{a}$ and $\mathbf{b}$ are vectors, then

$\mathbf{a}\cdot\mathbf{b}=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta$

Or

$\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum\limits_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum\limits_{i=1}^{n}{(B_i)^2}} }$

(h/t wikipedia)


In [26]:
#easy to program, but let's use a robust version from sklearn!
from sklearn.metrics.pairwise import cosine_similarity

In [40]:
similarity=cosine_similarity(document_term_matrix)

#Note here that the `cosine_similiary` can take 
#an entire matrix as its argument

In [28]:
#what'd we get?

similarity


Out[28]:
array([[ 1.        ,  0.4800956 ,  0.47789776, ...,  0.41147251,
         0.64107903,  0.49087961],
       [ 0.4800956 ,  1.        ,  0.65544451, ...,  0.31196723,
         0.64506243,  0.47103725],
       [ 0.47789776,  0.65544451,  1.        , ...,  0.35760039,
         0.70650975,  0.45047448],
       ..., 
       [ 0.41147251,  0.31196723,  0.35760039, ...,  1.        ,
         0.54765566,  0.29438818],
       [ 0.64107903,  0.64506243,  0.70650975, ...,  0.54765566,
         1.        ,  0.56796543],
       [ 0.49087961,  0.47103725,  0.45047448, ...,  0.29438818,
         0.56796543,  1.        ]])

In [29]:
similarity.shape


Out[29]:
(294, 294)

that is a symmetrical matrix relating each of the texts (rows) to another text (row)


In [30]:
similarity[100]
#this gives the similarity of row 100 to each of the other rows


Out[30]:
array([ 0.51351908,  0.37559544,  0.43518157,  0.55551997,  0.46036488,
        0.5228255 ,  0.56056587,  0.40698762,  0.29528043,  0.22663447,
        0.43146373,  0.404392  ,  0.44087255,  0.39765911,  0.75170812,
        0.4563751 ,  0.42273969,  0.53307405,  0.43241179,  0.52863713,
        0.43245394,  0.50163583,  0.55255707,  0.36087616,  0.52151398,
        0.44422755,  0.44204217,  0.44389512,  0.47398092,  0.31779759,
        0.49221207,  0.46843142,  0.56495427,  0.63240387,  0.33069644,
        0.39599524,  0.62049587,  0.51382474,  0.54573105,  0.46795562,
        0.54572092,  0.27907631,  0.31736056,  0.38786122,  0.45626854,
        0.33554468,  0.39141669,  0.39123794,  0.39268033,  0.39480131,
        0.18210545,  0.42944428,  0.4641688 ,  0.38645964,  0.44868397,
        0.36583276,  0.30356115,  0.4207236 ,  0.47623871,  0.74338913,
        0.70385898,  0.456456  ,  0.45573056,  0.41167837,  0.48595775,
        0.4969808 ,  0.51340512,  0.48723885,  0.39048843,  0.55268064,
        0.44370596,  0.46620405,  0.59864036,  0.5453561 ,  0.19284076,
        0.41974821,  0.46478893,  0.28167759,  0.4344191 ,  0.35451293,
        0.42340573,  0.47227169,  0.43494646,  0.45913277,  0.43349771,
        0.38247543,  0.39892623,  0.27673982,  0.34213901,  0.64612329,
        0.3626567 ,  0.42551363,  0.51860678,  0.56070145,  0.45950563,
        0.37292824,  0.47305083,  0.55650062,  0.51663525,  0.47743723,
        1.        ,  0.74504107,  0.5418364 ,  0.446839  ,  0.42183918,
        0.46757646,  0.40592573,  0.43250842,  0.42077958,  0.45397224,
        0.31944606,  0.54158949,  0.41347637,  0.37601824,  0.310596  ,
        0.4538872 ,  0.73505269,  0.25404726,  0.31252659,  0.36519023,
        0.38497747,  0.38380178,  0.29363724,  0.41179583,  0.54176381,
        0.73378491,  0.37935869,  0.40331248,  0.53966992,  0.37759265,
        0.38172116,  0.4010848 ,  0.36563491,  0.40777916,  0.44576063,
        0.30141366,  0.21701385,  0.46957471,  0.49328697,  0.55739909,
        0.42608044,  0.61105579,  0.41359103,  0.75236073,  0.56035999,
        0.52365363,  0.43469125,  0.4805973 ,  0.55483122,  0.30195174,
        0.46054438,  0.55915787,  0.4253061 ,  0.40968898,  0.34119052,
        0.47001332,  0.47038965,  0.54626945,  0.37483056,  0.57350522,
        0.80329149,  0.7243073 ,  0.41505792,  0.35195044,  0.39476144,
        0.72086426,  0.36628347,  0.32496974,  0.32097834,  0.63363847,
        0.57753911,  0.36209657,  0.38444127,  0.35573989,  0.41862441,
        0.43178962,  0.443835  ,  0.5608355 ,  0.444154  ,  0.48375123,
        0.28522753,  0.37559622,  0.47222523,  0.51180251,  0.41265869,
        0.54065297,  0.57687612,  0.64723645,  0.47474714,  0.38660376,
        0.68458167,  0.39413766,  0.38010527,  0.59400378,  0.57957167,
        0.37098426,  0.43403555,  0.32639742,  0.36205656,  0.51786785,
        0.41269349,  0.38620109,  0.50374064,  0.47638095,  0.33258194,
        0.37493858,  0.443436  ,  0.57791602,  0.65232685,  0.30007518,
        0.3492306 ,  0.32146507,  0.37118884,  0.45313594,  0.28738624,
        0.58066833,  0.37931137,  0.44479053,  0.66170862,  0.39806686,
        0.44628435,  0.413084  ,  0.34530214,  0.32644008,  0.39825262,
        0.35834356,  0.66729768,  0.52277553,  0.31125011,  0.32153004,
        0.41817371,  0.34185216,  0.41832183,  0.3425447 ,  0.45959116,
        0.33285397,  0.45256228,  0.37477928,  0.31155022,  0.30429663,
        0.52773434,  0.74161916,  0.3494954 ,  0.47205325,  0.38833223,
        0.51329957,  0.51852037,  0.38344333,  0.58984198,  0.26422658,
        0.33376944,  0.42060176,  0.27627799,  0.35126916,  0.37932132,
        0.3915702 ,  0.45151486,  0.5374713 ,  0.45565082,  0.37399019,
        0.4328711 ,  0.45998098,  0.40747782,  0.63226527,  0.51432353,
        0.49040904,  0.5749578 ,  0.57524544,  0.43183206,  0.40378449,
        0.37574633,  0.21411994,  0.4464823 ,  0.36864902,  0.32271726,
        0.34755168,  0.38877693,  0.7058656 ,  0.53344296,  0.40329998,
        0.5777419 ,  0.50958067,  0.54379815,  0.36165463,  0.48269786,
        0.46860293,  0.42628022,  0.2602712 ,  0.50098935,  0.41036843,
        0.4474139 ,  0.32264786,  0.52424594,  0.35982701])

HOMEWORK EXERCISE:

for given document find the most similar and give titles from the csv file you'll see!

supervised vs. unsupervised learning

slides from class omitted

first example of unsupervised learning

hierarchical clustering

This time we're interested in relations among the words not the texts.

In other words, we're interested in the similarities between one column and another--one term and another term

So we'll work with the transposed matrix--the term-document matrix, rather than the document-term matrix.

For a description of hierarchical clustering, look at the example at https://en.wikipedia.org/wiki/Hierarchical_clustering


In [31]:
term_document_matrix=document_term_matrix.T
# .T is the easy transposition method for a
# matrix in python's matrix packages.

In [32]:
# import a bunch of packages we need
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import ward, dendrogram

In [33]:
#distance is 1-similarity, so:

dist=1-cosine_similarity(term_document_matrix)

# ward is an algorithm for hierarchical clustering

linkage_matrix=ward(dist)

#plot dendogram

f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()


OMG U...G...L...Y!

WHAT THE? This is nonsense

what's the problem?

we just tried to plot a bunch o' features!

we need only the most significant words!

way to do this: change the min_df parameter in vectorizer

vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)

more an art than a science


In [34]:
vectorizer=TfidfVectorizer(min_df=.96, stop_words='english', use_idf=True)
#try a very high min_df

In [35]:
#rerun the model
document_term_matrix=vectorizer.fit_transform(our_texts)
vocab=vectorizer.get_feature_names()

In [36]:
#check the length of the vocab
len(vocab)


Out[36]:
52

In [37]:
#switch again to the term_document_matrix
term_document_matrix=document_term_matrix.T

In [38]:
dist=1-cosine_similarity(term_document_matrix)
linkage_matrix=ward(dist)

#plot dendogram

f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()


is this significant? Are there interesting patterns to seek out?

here's what we're up to:

Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.

. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findings as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.

  • Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

need to elicit patterns and avoid bad magical thinking!

Key assignment: BRAINSTORM texts you wish to mine!