In [1]:
%matplotlib inline
import pandas as pd
In [2]:
def document_vector(wordstring):
"""put yer documentation here friend"""
wordlist = wordstring.split()
set_of_words=set(wordlist)
distinct_words=list(set_of_words)
wordfreq = [wordlist.count(w) for w in distinct_words]
return distinct_words, wordfreq
In [3]:
x,y = document_vector("I like to eat green apples but only if I eat them like green grapes like my green friend gripes")
In [4]:
x
Out[4]:
In [5]:
y
Out[5]:
In [6]:
documents=pd.DataFrame(y, index=x)
In [7]:
documents
Out[7]:
In [8]:
documents[0].order()
Out[8]:
In [9]:
documents[0].describe()
Out[9]:
Once we convert texts into vectors, computer doesn't care or know that we're dealing with something formerly known as texts. It's just another kind of vector . . .
The "term frequency" measures how often a given term occurs in a given document.
We count up how many times each term appears in a document, then divide it by the number of terms in the document.
For a word $t$ that appears $i_w$ times in a document $D$ with a number of words $n_D$, the term frequency is
$tf(t,D)=\frac{i_w}{n_D}$
Can you think of a problem with this as a measure?
In [10]:
def document_vector_freq(wordstring):
"""put yer documentation here friend"""
wordlist = wordstring.split()
number_of_words=len(wordlist)
set_of_words=set(wordlist)
distinct_words=list(set_of_words)
wordfreq = [wordlist.count(w) for w in distinct_words]
wordfreq = [word_freq/number_of_words for word_freq in wordfreq]
return distinct_words, wordfreq
Often we'll divide the term frequency using some measure of how unusual each word is across all the documents in question.
If we were reading general political news stories for the last few years, "Obama" appears a lot in each document and in lots of documents. "Butte" appears, say, a lot in one document but not at all in the rest.
We want something that will help us to see that "Butte" is really significant for capturing something distinctive about that document, whereas "Obama" wouldn't be.
So we compute the inverse document frequency. [IDF]
You divide the total number of documents ($N$) by one plus the number of documents containing each word t ($n_w$)
$\frac{N}{1+n_w}$
Think about what is this does: for a word that appears in every document will thus be
$\frac{N}{1+n_w}=\frac{N}{1+N}\approx 1$
Where a word appears that in only one document will have a much bigger scaling factor:
$\frac{N}{1+n_w}=\frac{N}{2}$.
Typically, we take the log of this to get:
$idf(t,D)=\log(\frac{N}{1+n_w})$.
So we'll computer what's called tf-idf in the biz by multiplying the frequency and the inverse document frequency:
$tfidf=tf\times idf$
As so often, this is not a neutral choice:
if we pick $tf$ by itself, we want the most frequent words normalized by length in each document.
if we pick $tfidf$, then we are saying we want to work with the most frequent words that are also usual across our particular set of documents.
If we use tfidf on a set of documents about the CIA from 2000-2010, "intelligence" would be in most of them, we'd guess, and so the measure would down-play them in favor of what makes each document more distinctive within the corpus.
making .split
much better Examples??
they are the words you don't want to be included: "from" "to" "a" "they" "she" "he"
In [11]:
directory="/Users/mljones/Downloads/na-slave-narratives/data/texts/" #PUT YOUR DIRECTORY HERE!
In [12]:
import sys, os
In [13]:
##I will provide a set of black boxes for this sort of thing soon; then you will import textmining_blackboxes
In [14]:
os.chdir(directory)
files=[file for file in os.listdir(".") if not file.startswith('.')] #defeat hidden files
files=[file for file in files if not os.path.isdir(file)==True] #defeat directories
articles=[]
file_titles=[]
for file in files:
with open(file, encoding="UTF-8") as plaintext:
lines=plaintext.readlines()
#lines=[str(line) for line in lines]
article=" ".join(lines) #alter lines if want to skip lines
articles.append(article)
file_titles.append(file) #keep track of file names
In [15]:
articles[2][:500]
Out[15]:
In [16]:
import re
In [17]:
re.sub('\n', '', articles[2])[:500]
Out[17]:
In [18]:
documents=pd.DataFrame(y, index=x)
In [19]:
documents
Out[19]:
Python has an embarrasment of riches when it comes to working with texts. Some libraries are higher level with simpler, well thought out defaults, namely pattern
and TextBlob
. Most general, of long development, and foundational is the Natural Language Tool Kit--NLTK. The ideas we'll learn to today are key--they have slightly different instantiations in the different tools. Not everything is yet in Python 3.
For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.
In [21]:
vectorizer=TfidfVectorizer(min_df=0.95, stop_words='english')
#.95 is a VERY high threshold--only the most common words--chosen for the form of visualization we're going to do
In [22]:
document_term_matrix=vectorizer.fit_transform(articles)
In [23]:
document_term_matrix.shape
Out[23]:
In [24]:
#output is number of documents, then size of remaining vocabulary
rows, terms=document_term_matrix.shape
In [25]:
vocab=vectorizer.get_feature_names()
In [26]:
len(vocab)
Out[26]:
In [27]:
dtm=document_term_matrix.toarray()
dtmdf=pd.DataFrame(dtm, columns=vocab)
put into pandas just so we could explore it a bit more elegantly
In [28]:
dtmdf
Out[28]:
We reduced our text to a vector of term-weights. What can we do once we've committed this violence on the text?
We can measure distance and similarity
I know. Crazy talk.
Right now our text is just a series of numbers, indexed to words. We can treat it like any other set of words.
And the key way to distinguish two vectors is by measuring their distance or computing their similiarity (1-distance
).
You already know how, though you may have buried it along with memories of high school.
If $\mathbf{a}$ and $\mathbf{b}$ are vectors, then
$\mathbf{a}\cdot\mathbf{b}=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta$
Or
$\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum\limits_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum\limits_{i=1}^{n}{(B_i)^2}} }$
In [29]:
#easy to program, but let's use a robust version
from sklearn.metrics.pairwise import cosine_similarity
In [30]:
#cosine similarity is vectorized: that means is will operate on an entire matrix, not just its individual elements
In [31]:
similarity=cosine_similarity(dtmdf)
In [32]:
similarity
Out[32]:
In [33]:
import matplotlib.pyplot as plt
#we can make a heatmap with no problems within mathplotlib
#pass plt.pcolor our similiarity matrix
plt.pcolor(similarity, norm=None, cmap='Blues')
Out[33]:
In [34]:
#we have too many documents for that to be very useful; so
plt.pcolor(similarity[100:110, 100:110], norm=None, cmap='Blues')
Out[34]:
In [35]:
##first example of unsupervised learning
###hierarchical clustering
In [36]:
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import ward, dendrogram
dtm=document_term_matrix
dtm_trans=dtm.T
dist=1-cosine_similarity(dtm_trans)
linkage_matrix=ward(dist)
#plot dendogram
f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()
Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.
-. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findings as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.
In [ ]: