Computing in Context

Social Science Section

lecture two

OK, so we didn't get into the code this week...

History at Scale: what would you do if you could read at scale?

Choose to make tractable--Choose to lose to learn something couldn't otherwise!

The EASIER Way

Awesome tools to automate process

http://voyant-tools.org/

http://papermachines.org/

The HARDER Way

Why, oh why?

  • flexibility
    • overcome limits of affordances
  • can't depend on lots of tools
  • understand epistemic trade offs
  • gateway to vast array of work in data sciences more generally

computers not so smart

computer not so good with non numerical things

Vectoring text, or: the world is too hard to understand easily


In [1]:
%matplotlib inline
import pandas as pd

In [2]:
def document_vector(wordstring): 
    """put yer documentation here friend"""
    wordlist = wordstring.split()
    set_of_words=set(wordlist)
    distinct_words=list(set_of_words)
    wordfreq = [wordlist.count(w) for w in distinct_words]
    return distinct_words, wordfreq

In [3]:
x,y = document_vector("I like to eat green apples but only if I eat them like green grapes like my green friend gripes")

In [4]:
x


Out[4]:
['only',
 'them',
 'green',
 'my',
 'but',
 'eat',
 'I',
 'grapes',
 'to',
 'apples',
 'if',
 'gripes',
 'like',
 'friend']

In [5]:
y


Out[5]:
[1, 1, 3, 1, 1, 2, 2, 1, 1, 1, 1, 1, 3, 1]

In [6]:
documents=pd.DataFrame(y, index=x)

In [7]:
documents


Out[7]:
0
only 1
them 1
green 3
my 1
but 1
eat 2
I 2
grapes 1
to 1
apples 1
if 1
gripes 1
like 3
friend 1

cool beans--we've vectorized our sentence

what has been lost?


In [8]:
documents[0].order()


Out[8]:
only      1
them      1
my        1
but       1
grapes    1
to        1
apples    1
if        1
gripes    1
friend    1
eat       2
I         2
green     3
like      3
Name: 0, dtype: int64

In [9]:
documents[0].describe()


Out[9]:
count    14.000000
mean      1.428571
std       0.755929
min       1.000000
25%       1.000000
50%       1.000000
75%       1.750000
max       3.000000
Name: 0, dtype: float64

Once we convert texts into vectors, computer doesn't care or know that we're dealing with something formerly known as texts. It's just another kind of vector . . .

Term frequency [TF]

The "term frequency" measures how often a given term occurs in a given document.

We count up how many times each term appears in a document, then divide it by the number of terms in the document.

For a word $t$ that appears $i_w$ times in a document $D$ with a number of words $n_D$, the term frequency is

$tf(t,D)=\frac{i_w}{n_D}$

Can you think of a problem with this as a measure?


In [10]:
def document_vector_freq(wordstring): 
    """put yer documentation here friend"""
    wordlist = wordstring.split()
    number_of_words=len(wordlist)
    set_of_words=set(wordlist)
    distinct_words=list(set_of_words)
    wordfreq = [wordlist.count(w) for w in distinct_words]
    wordfreq = [word_freq/number_of_words for word_freq in wordfreq]
    return distinct_words, wordfreq

Inverse document frequency

Often we'll divide the term frequency using some measure of how unusual each word is across all the documents in question.

If we were reading general political news stories for the last few years, "Obama" appears a lot in each document and in lots of documents. "Butte" appears, say, a lot in one document but not at all in the rest.

We want something that will help us to see that "Butte" is really significant for capturing something distinctive about that document, whereas "Obama" wouldn't be.

So we compute the inverse document frequency. [IDF]

You divide the total number of documents ($N$) by one plus the number of documents containing each word t ($n_w$)

$\frac{N}{1+n_w}$

Think about what is this does: for a word that appears in every document will thus be

$\frac{N}{1+n_w}=\frac{N}{1+N}\approx 1$

Where a word appears that in only one document will have a much bigger scaling factor:

$\frac{N}{1+n_w}=\frac{N}{2}$.

Typically, we take the log of this to get:

$idf(t,D)=\log(\frac{N}{1+n_w})$.

So we'll computer what's called tf-idf in the biz by multiplying the frequency and the inverse document frequency:

$tfidf=tf\times idf$

As so often, this is not a neutral choice:

  • if we pick $tf$ by itself, we want the most frequent words normalized by length in each document.

  • if we pick $tfidf$, then we are saying we want to work with the most frequent words that are also usual across our particular set of documents.

If we use tfidf on a set of documents about the CIA from 2000-2010, "intelligence" would be in most of them, we'd guess, and so the measure would down-play them in favor of what makes each document more distinctive within the corpus.

What we need to clean up text

tokenization

making .split much better Examples??

stemming:

  • converting inflected forms into some normalized forms
    • e.g. "chefs" --> "chef"
    • "goes" --> "go"
    • "children" --> "child"

stopwords

they are the words you don't want to be included: "from" "to" "a" "they" "she" "he"

others?

Let's get some data!

Documenting the American South


In [11]:
directory="/Users/mljones/Downloads/na-slave-narratives/data/texts/"   #PUT YOUR DIRECTORY HERE!

In [12]:
import sys, os

In [13]:
##I will provide a set of black boxes for this sort of thing soon; then you will import textmining_blackboxes

In [14]:
os.chdir(directory)
files=[file for file in os.listdir(".") if not file.startswith('.')] #defeat hidden files
files=[file for file in files if not os.path.isdir(file)==True] #defeat directories

articles=[]
file_titles=[]
for file in files:
    with open(file, encoding="UTF-8") as plaintext:
        lines=plaintext.readlines()
        #lines=[str(line) for line in lines]
        article=" ".join(lines) #alter lines if want to skip lines
        articles.append(article)
        file_titles.append(file) #keep track of file names

In [15]:
articles[2][:500]


Out[15]:
'\n Wm. W. Brown.[Frontispiece Image]\n [Title Page Image]\n [Title Page Verso Image]\n \n _______Is there not some chosen curse,\n Some hidden thunder in the stores of heaven,\n Red with uncommon wrath, to blast the man\n who gains his fortune from the blood of souls !\n \n Cowper.\n PREFACE\n TO WELLS BROWN, OF OHIO.\n THIRTEEN years ago, I came to your door, a weary fugitive from chains and stripes. I was a stranger, and you took me in. I was hungry, and you fed me. Naked was I, and you clothed me. Even a '

In [16]:
import re

In [17]:
re.sub('\n', '', articles[2])[:500]


Out[17]:
' Wm. W. Brown.[Frontispiece Image] [Title Page Image] [Title Page Verso Image]  _______Is there not some chosen curse, Some hidden thunder in the stores of heaven, Red with uncommon wrath, to blast the man who gains his fortune from the blood of souls !  Cowper. PREFACE TO WELLS BROWN, OF OHIO. THIRTEEN years ago, I came to your door, a weary fugitive from chains and stripes. I was a stranger, and you took me in. I was hungry, and you fed me. Naked was I, and you clothed me. Even a name by which'

In [18]:
documents=pd.DataFrame(y, index=x)

In [19]:
documents


Out[19]:
0
only 1
them 1
green 3
my 1
but 1
eat 2
I 2
grapes 1
to 1
apples 1
if 1
gripes 1
like 3
friend 1

Here's our help!

Python Libraries

Python has an embarrasment of riches when it comes to working with texts. Some libraries are higher level with simpler, well thought out defaults, namely pattern and TextBlob. Most general, of long development, and foundational is the Natural Language Tool Kit--NLTK. The ideas we'll learn to today are key--they have slightly different instantiations in the different tools. Not everything is yet in Python 3.

For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.


In [21]:
vectorizer=TfidfVectorizer(min_df=0.95, stop_words='english')  
#.95 is a VERY high threshold--only the most common words--chosen for the form of visualization we're going to do

In [22]:
document_term_matrix=vectorizer.fit_transform(articles)

In [23]:
document_term_matrix.shape


Out[23]:
(294, 63)

In [24]:
#output is number of documents, then size of remaining vocabulary
rows, terms=document_term_matrix.shape

In [25]:
vocab=vectorizer.get_feature_names()

In [26]:
len(vocab)


Out[26]:
63

In [27]:
dtm=document_term_matrix.toarray()
dtmdf=pd.DataFrame(dtm, columns=vocab)

put into pandas just so we could explore it a bit more elegantly


In [28]:
dtmdf


Out[28]:
away better born brought called came children come country day ... things thought time took way went white work year years
0 0.062433 0.033243 0.035528 0.049360 0.059921 0.096732 0.030062 0.076005 0.051750 0.156397 ... 0.146825 0.060125 0.272108 0.085984 0.202831 0.162316 0.137111 0.074040 0.002733 0.077396
1 0.082805 0.031445 0.005816 0.071007 0.084052 0.209683 0.080460 0.086656 0.124629 0.273618 ... 0.027514 0.052348 0.282063 0.091498 0.088040 0.106514 0.055444 0.149795 0.041684 0.068160
2 0.067857 0.057208 0.023763 0.041727 0.074008 0.160624 0.068319 0.041304 0.053286 0.204558 ... 0.015055 0.098022 0.290007 0.125579 0.066716 0.123493 0.134121 0.092395 0.050496 0.113127
3 0.079088 0.086923 0.008167 0.040973 0.099704 0.110405 0.104126 0.064892 0.134295 0.162387 ... 0.047600 0.028584 0.223020 0.072265 0.081746 0.084883 0.264273 0.188474 0.102084 0.119628
4 0.136277 0.019869 0.034301 0.068836 0.058603 0.211983 0.249909 0.087607 0.078137 0.280421 ... 0.084441 0.078403 0.194210 0.105991 0.071779 0.247379 0.167172 0.132754 0.053902 0.119631
5 0.058159 0.077162 0.029278 0.017626 0.023343 0.103627 0.187377 0.133767 0.029178 0.323734 ... 0.083097 0.076122 0.311318 0.103627 0.182982 0.075351 0.047003 0.193889 0.146388 0.160109
6 0.027531 0.042146 0.013859 0.013907 0.151937 0.177143 0.083157 0.068829 0.041437 0.255415 ... 0.014049 0.041578 0.455511 0.095385 0.094740 0.192068 0.111252 0.069533 0.138594 0.108275
7 0.208734 0.012531 0.030905 0.031010 0.049281 0.164082 0.228698 0.061392 0.012320 0.179858 ... 0.012531 0.067991 0.256924 0.048617 0.132793 0.220265 0.074424 0.223273 0.098896 0.114685
8 0.079807 0.122173 0.040175 0.000000 0.080079 0.000000 0.160701 0.039904 0.080079 0.077936 ... 0.000000 0.080351 0.194180 0.039500 0.078466 0.039769 0.040312 0.080624 0.040175 0.078466
9 0.251497 0.030800 0.040513 0.010163 0.020188 0.009958 0.040513 0.130778 0.040376 0.176831 ... 0.030800 0.060770 0.205605 0.019916 0.049454 0.010026 0.020326 0.030488 0.020257 0.059345
10 0.145591 0.023687 0.016995 0.074606 0.101625 0.273615 0.076478 0.156141 0.044461 0.166903 ... 0.053835 0.082851 0.242320 0.089812 0.138995 0.254448 0.089527 0.230212 0.063731 0.053938
11 0.140240 0.035781 0.010085 0.070837 0.115588 0.148736 0.201706 0.115197 0.010051 0.278792 ... 0.020446 0.055469 0.214480 0.133862 0.054168 0.214639 0.136614 0.141674 0.015128 0.054168
12 0.127418 0.058885 0.019364 0.069218 0.075987 0.198711 0.342496 0.153863 0.025329 0.172557 ... 0.047844 0.157330 0.171974 0.070203 0.056729 0.147352 0.077718 0.049788 0.037517 0.112275
13 0.169279 0.029616 0.014608 0.009772 0.072795 0.196292 0.267820 0.217645 0.009706 0.203094 ... 0.039488 0.102259 0.221235 0.057451 0.095105 0.134965 0.083062 0.073290 0.073042 0.114126
14 0.026950 0.034380 0.094966 0.037435 0.141967 0.050019 0.078008 0.053900 0.074364 0.121719 ... 0.072198 0.027133 0.219665 0.120047 0.129172 0.057074 0.122514 0.462831 0.142449 0.308025
15 0.081961 0.033459 0.012378 0.053820 0.053456 0.174433 0.024756 0.086059 0.020560 0.348170 ... 0.008365 0.107275 0.327050 0.097358 0.120875 0.208294 0.028980 0.244259 0.045386 0.080584
16 0.095078 0.027493 0.049458 0.062433 0.082681 0.141174 0.114870 0.098247 0.041340 0.273902 ... 0.016172 0.060626 0.348545 0.045489 0.084132 0.034744 0.096050 0.089647 0.084557 0.162032
17 0.114592 0.024621 0.024289 0.046712 0.108929 0.189053 0.095131 0.102529 0.052447 0.178654 ... 0.051293 0.064770 0.225007 0.073631 0.108712 0.186332 0.109671 0.081238 0.052626 0.124525
18 0.107923 0.020979 0.015522 0.067493 0.072193 0.289969 0.036219 0.056531 0.077350 0.135504 ... 0.015735 0.087961 0.265089 0.234010 0.192006 0.317551 0.103835 0.083068 0.067264 0.101056
19 0.079111 0.044663 0.022031 0.061215 0.106402 0.158286 0.088122 0.097626 0.153692 0.193960 ... 0.048099 0.108458 0.324357 0.104969 0.100950 0.070455 0.100325 0.045911 0.050840 0.084400
20 0.107135 0.007289 0.079100 0.036077 0.085999 0.247451 0.050337 0.142846 0.021500 0.132521 ... 0.051024 0.079100 0.264146 0.084840 0.196623 0.149481 0.158738 0.122661 0.021573 0.126401
21 0.028692 0.064665 0.030091 0.027778 0.046784 0.055621 0.046943 0.069340 0.153546 0.093399 ... 0.082967 0.034906 0.265285 0.028402 0.166910 0.040510 0.286236 0.234303 0.043332 0.157507
22 0.046076 0.035624 0.015463 0.029621 0.060243 0.134065 0.057636 0.083775 0.096668 0.192254 ... 0.047024 0.046390 0.383206 0.037317 0.161988 0.100190 0.252484 0.355453 0.088562 0.138651
23 0.143845 0.025311 0.019976 0.070153 0.114472 0.196399 0.089891 0.079363 0.089586 0.184066 ... 0.030373 0.059927 0.207580 0.098199 0.092659 0.108754 0.045098 0.100218 0.064921 0.126797
24 0.114106 0.031760 0.007833 0.031438 0.085870 0.197664 0.174934 0.225618 0.010408 0.200067 ... 0.037053 0.078329 0.201914 0.069311 0.114737 0.173164 0.094314 0.089074 0.046997 0.130036
25 0.018052 0.018424 0.018175 0.091185 0.289820 0.232306 0.054526 0.036105 0.181137 0.035258 ... 0.073694 0.018175 0.070277 0.089349 0.124243 0.089956 0.000000 0.054711 0.145402 0.266234
26 0.129467 0.032031 0.031600 0.035670 0.043302 0.066020 0.078999 0.133390 0.062985 0.118768 ... 0.044043 0.063199 0.110730 0.139807 0.123433 0.121209 0.233839 0.063414 0.047399 0.123433
27 0.034518 0.105683 0.013901 0.013948 0.055416 0.109340 0.132061 0.138072 0.027708 0.175284 ... 0.077501 0.083407 0.322506 0.027335 0.081451 0.110084 0.097639 0.132510 0.006951 0.149326
28 0.062960 0.034135 0.045560 0.079504 0.112528 0.083746 0.077254 0.035415 0.157934 0.122966 ... 0.038151 0.051503 0.239356 0.050637 0.079311 0.049021 0.125219 0.125219 0.069331 0.208917
29 0.051788 0.000000 0.010428 0.052318 0.093535 0.061517 0.020856 0.041430 0.187070 0.171949 ... 0.052853 0.062568 0.352815 0.215308 0.050917 0.165160 0.031391 0.073245 0.020856 0.050917
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
264 0.063563 0.064870 0.030116 0.075545 0.138814 0.159151 0.127992 0.093476 0.093793 0.160659 ... 0.022895 0.120463 0.149198 0.099932 0.176456 0.178866 0.109541 0.117095 0.022587 0.143371
265 0.116409 0.044551 0.011720 0.023520 0.081763 0.106588 0.123061 0.078576 0.029201 0.147782 ... 0.071281 0.125991 0.334216 0.074900 0.111590 0.133417 0.026460 0.102899 0.046880 0.094423
266 0.100140 0.063529 0.021799 0.035544 0.074681 0.105824 0.104908 0.116378 0.029872 0.132150 ... 0.056624 0.113083 0.293697 0.069656 0.091804 0.115984 0.062885 0.120303 0.031336 0.191591
267 0.101758 0.058107 0.021954 0.031819 0.074147 0.110321 0.098791 0.121140 0.026741 0.128946 ... 0.051925 0.117086 0.273525 0.068351 0.092901 0.113486 0.064861 0.111365 0.030491 0.202477
268 0.107442 0.045465 0.007915 0.042357 0.099919 0.145265 0.116089 0.125786 0.055218 0.143309 ... 0.042791 0.129280 0.209135 0.108949 0.097907 0.117525 0.158840 0.050299 0.029022 0.103060
269 0.145391 0.017121 0.011260 0.079088 0.117830 0.110707 0.140751 0.111839 0.056110 0.131060 ... 0.028535 0.106971 0.174155 0.099637 0.065976 0.139326 0.282458 0.056492 0.011260 0.060478
270 0.026587 0.013567 0.040153 0.094008 0.093372 0.118433 0.147226 0.039881 0.040017 0.220693 ... 0.176372 0.080305 0.349326 0.026318 0.065351 0.185482 0.214875 0.161156 0.013384 0.065351
271 0.289544 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.217158 0.000000 0.141377 ... 0.036937 0.145757 0.140898 0.143307 0.035585 0.180352 0.036563 0.036563 0.072879 0.177923
272 0.091124 0.030999 0.037377 0.081828 0.077888 0.076839 0.054367 0.050625 0.250596 0.171383 ... 0.024111 0.081551 0.239781 0.073498 0.096229 0.100906 0.068190 0.061371 0.050969 0.073001
273 0.064605 0.032966 0.032522 0.065265 0.097236 0.127901 0.032522 0.129209 0.000000 0.189268 ... 0.000000 0.032522 0.282941 0.063951 0.190556 0.000000 0.000000 0.065265 0.032522 0.190556
274 0.033574 0.036280 0.013919 0.063844 0.107008 0.129026 0.097430 0.073072 0.025761 0.098359 ... 0.076590 0.093453 0.413246 0.136845 0.060193 0.145649 0.055863 0.101751 0.015907 0.062135
275 0.070584 0.024012 0.015792 0.055460 0.062955 0.124213 0.094753 0.086270 0.125909 0.114881 ... 0.008004 0.039480 0.366377 0.178557 0.084820 0.171955 0.023769 0.079229 0.157922 0.185061
276 0.074653 0.027705 0.034164 0.054848 0.108955 0.147795 0.129824 0.074653 0.108955 0.132550 ... 0.006926 0.054663 0.317042 0.154513 0.080071 0.162328 0.034280 0.075417 0.157155 0.213522
277 0.038028 0.052507 0.051800 0.042936 0.078559 0.081929 0.045043 0.082767 0.170585 0.146360 ... 0.052507 0.040539 0.215531 0.059786 0.114365 0.053505 0.207903 0.223722 0.074321 0.202339
278 0.046682 0.082580 0.056400 0.051876 0.076507 0.075476 0.072067 0.065355 0.229521 0.165635 ... 0.079404 0.047000 0.168103 0.083178 0.091795 0.063583 0.108467 0.073884 0.037600 0.105565
279 0.139138 0.025357 0.030018 0.050200 0.119666 0.078703 0.060036 0.074538 0.064819 0.281455 ... 0.030428 0.055033 0.256321 0.078703 0.073285 0.178286 0.055220 0.381523 0.085052 0.141685
280 0.035987 0.062153 0.019509 0.032160 0.062496 0.105497 0.027870 0.077510 0.202766 0.162197 ... 0.055090 0.037625 0.304437 0.054804 0.125197 0.088283 0.316006 0.320201 0.160255 0.164661
281 0.076505 0.031231 0.015405 0.023186 0.153530 0.136315 0.046215 0.076505 0.061412 0.216661 ... 0.062462 0.107836 0.379735 0.212046 0.082741 0.038123 0.061830 0.131388 0.084728 0.105306
282 0.071054 0.041437 0.030659 0.058108 0.149382 0.113876 0.105604 0.067671 0.190122 0.231293 ... 0.037984 0.068131 0.243683 0.113876 0.083167 0.030349 0.150398 0.109381 0.047692 0.139720
283 0.081500 0.049906 0.016411 0.123501 0.089955 0.153284 0.106672 0.057050 0.024533 0.175096 ... 0.033271 0.131288 0.245891 0.048405 0.032052 0.113714 0.123501 0.164668 0.041028 0.072117
284 0.055471 0.078851 0.017951 0.016011 0.075537 0.256897 0.021940 0.116884 0.085476 0.139293 ... 0.105135 0.349051 0.229442 0.133351 0.054538 0.260619 0.030020 0.080054 0.025929 0.035060
285 0.086788 0.030542 0.021091 0.024186 0.102097 0.216256 0.063274 0.143649 0.090086 0.181195 ... 0.054976 0.057248 0.372815 0.106647 0.176543 0.101407 0.139072 0.142095 0.063274 0.147119
286 0.102295 0.030517 0.020598 0.068364 0.102643 0.110607 0.107745 0.043279 0.084483 0.218235 ... 0.045775 0.032482 0.242769 0.126965 0.066535 0.101949 0.088238 0.183630 0.083186 0.290896
287 0.092055 0.036134 0.028518 0.078690 0.021316 0.049067 0.071294 0.084974 0.071052 0.276604 ... 0.050588 0.057035 0.158509 0.042057 0.027849 0.007057 0.021461 0.028614 0.057035 0.153167
288 0.042836 0.120220 0.005391 0.059501 0.118198 0.058303 0.032345 0.069608 0.102080 0.094120 ... 0.087433 0.070082 0.250136 0.068903 0.115818 0.016009 0.075729 0.032455 0.037736 0.152669
289 0.039550 0.051896 0.011377 0.045662 0.119054 0.044743 0.051196 0.124300 0.005669 0.110350 ... 0.074961 0.045508 0.197958 0.033557 0.094436 0.078833 0.034247 0.102740 0.068262 0.038885
290 0.132217 0.044979 0.000000 0.029682 0.103185 0.072710 0.221861 0.000000 0.014741 0.100424 ... 0.014993 0.000000 0.185870 0.087253 0.101107 0.073205 0.148410 0.089046 0.044372 0.476646
291 0.007421 0.000000 0.018679 0.014994 0.078184 0.040402 0.048564 0.055657 0.003723 0.105080 ... 0.015147 0.059771 0.090279 0.022037 0.091202 0.059166 0.037484 0.007497 0.041093 0.062017
292 0.109306 0.043256 0.017967 0.038310 0.078341 0.174445 0.038181 0.118229 0.035813 0.137241 ... 0.025043 0.074116 0.284408 0.066245 0.122822 0.280122 0.085635 0.092396 0.047164 0.067991
293 0.155011 0.020635 0.000000 0.034043 0.027050 0.040028 0.027142 0.384157 0.006762 0.342240 ... 0.013756 0.000000 0.288608 0.040028 0.019879 0.047018 0.006809 0.204256 0.040713 0.033132

294 rows × 63 columns

Now we can throw wide variety of mining algorithms at our data!

Similarity and dissimilarity

We reduced our text to a vector of term-weights. What can we do once we've committed this violence on the text?

We can measure distance and similarity

I know. Crazy talk.

Right now our text is just a series of numbers, indexed to words. We can treat it like any other set of words.

And the key way to distinguish two vectors is by measuring their distance or computing their similiarity (1-distance).

You already know how, though you may have buried it along with memories of high school.

Many distance metrics to choose from

key one in textual analysis:

cosine similarity

If $\mathbf{a}$ and $\mathbf{b}$ are vectors, then

$\mathbf{a}\cdot\mathbf{b}=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta$

Or

$\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum\limits_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum\limits_{i=1}^{n}{(B_i)^2}} }$


In [29]:
#easy to program, but let's use a robust version
from sklearn.metrics.pairwise import cosine_similarity

In [30]:
#cosine similarity is vectorized: that means is will operate on an entire matrix, not just its individual elements

In [31]:
similarity=cosine_similarity(dtmdf)

In [32]:
similarity


Out[32]:
array([[ 1.        ,  0.68264883,  0.76388739, ...,  0.54432244,
         0.82308187,  0.74060483],
       [ 0.68264883,  1.        ,  0.87835427, ...,  0.41133783,
         0.76848862,  0.6303793 ],
       [ 0.76388739,  0.87835427,  1.        , ...,  0.51797058,
         0.86520961,  0.61500274],
       ..., 
       [ 0.54432244,  0.41133783,  0.51797058, ...,  1.        ,
         0.70739313,  0.50234307],
       [ 0.82308187,  0.76848862,  0.86520961, ...,  0.70739313,
         1.        ,  0.70527606],
       [ 0.74060483,  0.6303793 ,  0.61500274, ...,  0.50234307,
         0.70527606,  1.        ]])

In [33]:
import matplotlib.pyplot as plt 
#we can make a heatmap with no problems within mathplotlib
#pass plt.pcolor our similiarity matrix
plt.pcolor(similarity, norm=None, cmap='Blues')


Out[33]:
<matplotlib.collections.PolyCollection at 0x1150b0210>

In [34]:
#we have too many documents for that to be very useful; so
plt.pcolor(similarity[100:110, 100:110], norm=None, cmap='Blues')


Out[34]:
<matplotlib.collections.PolyCollection at 0x114eafa10>

supervised vs. unsupervised learning


In [35]:
##first example of unsupervised learning
###hierarchical clustering

In [36]:
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import ward, dendrogram
dtm=document_term_matrix
dtm_trans=dtm.T
dist=1-cosine_similarity(dtm_trans)
linkage_matrix=ward(dist)

#plot dendogram

f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()


is this significant? Are there interesting patterns to seek out?

here's what we're up to:

Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.

-. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findings as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.

  • Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

need to elicit patterns and avoid bad magical thinking!


In [ ]: