In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
import textmining_blackboxes as tm
tm
is our temporarily helper, not a standard python
package!!download it from my github: https://github.com/matthewljones/computingincontext
In [3]:
#see if package imported correctly
tm.icantbelieve("butter")
Let's use the remarkable narratives available from Documenting the American South (http://docsouth.unc.edu/docsouthdata/)
Assuming that you are storing your data in a directory in the same place as your iPython notebook.
Put the slave narratives texts within a data
directory in the same place as this notebook
In [4]:
title_info=pd.read_csv('data/na-slave-narratives/data/toc.csv')
#this is the "metadata" of these files--we didn't use today
#why does data appear twice?
In [5]:
#Let's use a brittle thing for reading in a directory of pure txt files.
our_texts=tm.readtextfiles('data/na-slave-narratives/data/texts')
#again, this is not a std python package
#returns a simple list of the document as very long strings
#note if you want the following notebook will work on any directory of text files.
In [6]:
len(our_texts)
Out[6]:
In [42]:
our_texts[100][:300] # first 300 words of 100th text
Out[42]:
Sure, you could do this as a for
loop
for text in our texts:
blah.blah.blah(our_texts) #not real code
or
for i in range(len(our_texts)
But super easy in python
In [8]:
lengths=[len(text) for text in our_texts]
Python has an embarrasment of riches when it comes to working with texts. Some libraries are higher level with simpler, well thought out defaults, namely pattern
and TextBlob
. Most general, of long development, and foundational is the Natural Language Tool Kit--NLTK. The ideas we'll learn to today are key--they have slightly different instantiations in the different tools. Not everything is yet in Python 3, alas!!
making .split
much better
Examples??
they are the words you don't want to be included: "from" "to" "a" "they" "she" "he"
For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.
In [ ]:
our_texts=tm.data_cleanse(our_texts)
#more necessary when have messy text
#eliminate escaped characters
In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [17]:
vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)
In [18]:
document_term_matrix=vectorizer.fit_transform(our_texts)
In [43]:
# now let's get our vocabulary--the names corresponding to the rows
# "feature" is the general term in machine learning and data mining
# we seek to characterize data by picking out features that will enable discovery
vocab=vectorizer.get_feature_names()
In [20]:
len(vocab)
Out[20]:
In [21]:
document_term_matrix.shape
Out[21]:
In [22]:
vocab[1000:1100]
Out[22]:
In [23]:
document_term_matrix_dense=document_term_matrix.toarray()
In [24]:
dtmdf=pd.DataFrame(document_term_matrix_dense, columns=vocab)
In [25]:
dtmdf
Out[25]:
We reduced our text to a vector of term-weights.
What can we do once we've committed this real violence on the text?
We can measure distance and similarity
I know. Crazy talk.
Right now our text is just a series of numbers, indexed to words. We can treat it like any collection of vectors more or less.
And the key way to distinguish two vectors is by measuring their distance or computing their similiarity (1-distance
).
You already know how, though you may have buried it along with memories of high school.
If $\mathbf{a}$ and $\mathbf{b}$ are vectors, then
$\mathbf{a}\cdot\mathbf{b}=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta$
Or
$\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum\limits_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum\limits_{i=1}^{n}{(B_i)^2}} }$
(h/t wikipedia)
In [26]:
#easy to program, but let's use a robust version from sklearn!
from sklearn.metrics.pairwise import cosine_similarity
In [40]:
similarity=cosine_similarity(document_term_matrix)
#Note here that the `cosine_similiary` can take
#an entire matrix as its argument
In [28]:
#what'd we get?
similarity
Out[28]:
In [29]:
similarity.shape
Out[29]:
In [30]:
similarity[100]
#this gives the similarity of row 100 to each of the other rows
Out[30]:
This time we're interested in relations among the words not the texts.
In other words, we're interested in the similarities between one column and another--one term and another term
So we'll work with the transposed matrix--the term-document matrix, rather than the document-term matrix.
For a description of hierarchical clustering, look at the example at https://en.wikipedia.org/wiki/Hierarchical_clustering
In [31]:
term_document_matrix=document_term_matrix.T
# .T is the easy transposition method for a
# matrix in python's matrix packages.
In [32]:
# import a bunch of packages we need
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity
from scipy.cluster.hierarchy import ward, dendrogram
In [33]:
#distance is 1-similarity, so:
dist=1-cosine_similarity(term_document_matrix)
# ward is an algorithm for hierarchical clustering
linkage_matrix=ward(dist)
#plot dendogram
f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()
In [34]:
vectorizer=TfidfVectorizer(min_df=.96, stop_words='english', use_idf=True)
#try a very high min_df
In [35]:
#rerun the model
document_term_matrix=vectorizer.fit_transform(our_texts)
vocab=vectorizer.get_feature_names()
In [36]:
#check the length of the vocab
len(vocab)
Out[36]:
In [37]:
#switch again to the term_document_matrix
term_document_matrix=document_term_matrix.T
In [38]:
dist=1-cosine_similarity(term_document_matrix)
linkage_matrix=ward(dist)
#plot dendogram
f=plt.figure(figsize=(9,9))
R=dendrogram(linkage_matrix, orientation="right", labels=vocab)
plt.tight_layout()
Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.
. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findings as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.