In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
Reading at scale:
Martha Ballard's Diary http://dohistory.org/diary/index.html
http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/
Richmond Dispatch
In [1]:
from IPython.display import Image
Image("http://journalofdigitalhumanities.org/wp-content/uploads/2013/02/blei_lda_illustration.png")
Out[1]:
In [ ]:
In [4]:
import textmining_blackboxes as tm
tm
is our temporarily helper, not a standard python
package!!download it from my github: https://github.com/matthewljones/computingincontext
In [3]:
#see if package imported correctly
tm.icantbelieve("butter")
Let's keep using the remarkable narratives available from Documenting the American South (http://docsouth.unc.edu/docsouthdata/)
Assuming that you are storing your data in a directory in the same place as your iPython notebook.
Put the slave narratives texts within a data
directory in the same place as this notebook
In [ ]:
title_info["Date"].str.replace("[^0-9]", "") #use regular expressions to clean up
In [ ]:
title_info["Date"]=title_info["Date"].str.replace("\-\?", "5")
title_info["Date"]=title_info["Date"].str.replace("[^0-9]", "") # what assumptions have I made about the data?
In [ ]:
title_info["Date"]=pd.to_datetime(title_info["Date"], coerce=True)
In [ ]:
title_info["Date"]<pd.datetime(1800,1,1)
In [ ]:
title_info[title_info["Date"]<pd.datetime(1800,1,1)]
In [ ]:
#Let's use a brittle thing for reading in a directory of pure txt files.
our_texts=tm.readtextfiles('data/na-slave-narratives/data/texts')
#again, this is not a std python package
#returns a simple list of the document as very long strings
#note if you want the following notebook will work on any directory of text files.
For now, we'll play with the cool scientists and use the powerful and fast scikit learn package.
In [61]:
our_texts, names=tm.readtextfiles("data/british-fiction-corpus")
In [62]:
names
Out[62]:
In [63]:
our_texts=tm.data_cleanse(our_texts)
#more necessary when have messy text
#eliminate escaped characters
In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [8]:
vectorizer=TfidfVectorizer(min_df=0.5, stop_words='english', use_idf=True)
In [9]:
document_term_matrix=vectorizer.fit_transform(our_texts)
In [10]:
# now let's get our vocabulary--the names corresponding to the rows
vocab=vectorizer.get_feature_names()
In [13]:
len(vocab)
Out[13]:
In [14]:
document_term_matrix.shape
Out[14]:
In [15]:
document_term_matrix_dense=document_term_matrix.toarray()
In [16]:
dtmdf=pd.DataFrame(document_term_matrix_dense, columns=vocab)
In [17]:
dtmdf
Out[17]:
In [11]:
#easy to program, but let's use a robust version from sklearn!
from sklearn.metrics.pairwise import cosine_similarity
In [12]:
similarity=cosine_similarity(document_term_matrix)
#Note here that the `cosine_similiary` can take
#an entire matrix as its argument
In [13]:
pd.DataFrame(similarity, index=names, columns=names)
Out[13]:
In [28]:
similarity_df.ix[1].order(ascending=False)
Out[28]:
In [14]:
#here's the blackbox
from sklearn.manifold import MDS
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
positions= mds.fit_transform(1-similarity)
In [15]:
positions.shape
Out[15]:
It's an 11 by 2 matrix
OR
simply an (x,y) coordinate pair for each of our texts
In [16]:
#let's plot it: I've set up a black box
tm.plot_mds(positions,names)
In [17]:
names=[name.replace(".txt", "") for name in names]
In [18]:
tm.plot_mds(positions,names)
What has this got us?
It suggests that even this crude measure of similarity is able to capture something significant.
Note: the axes don't really mean anything
Get the stoplist in the data directory in my github.
In [3]:
our_texts, names=tm.readtextfiles("Data/PCCIPtext")
In [5]:
our_texts=tm.data_cleanse(our_texts)
In [6]:
#improved stoplist--may be too complete
stop=[]
with open('data/stoplist-multilingual') as f:
stop=f.readlines()
stop=[word.strip('\n') for word in stop]
In [7]:
texts = [[word for word in document.lower().split() if word not in stop] for document in our_texts] #gensim requires list of list of words in documents
In [8]:
from gensim import corpora, models, similarities, matutils
"""gensim includes its own vectorizing tools"""
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
#doc2bow just means `doc`uments to `b`ag `o`f `w`ords
#ok, this has just vectorized our texts; it's another form
In [10]:
number_topics=40
model = models.LdaModel(corpus, id2word=dictionary, num_topics=number_topics, passes=10) #use gensim multicore LDA
In [11]:
model.show_topics()
Out[11]:
In [42]:
topics_indexed=[[b for (a,b) in topics] for topics in model.show_topics(number_topics,10,formatted=False)]
topics_indexed=pd.DataFrame(topics_indexed)
In [43]:
topics_indexed
Out[43]:
So which topics most significant for each document?
In [45]:
model[dictionary.doc2bow(texts[1])]
Out[45]:
In [48]:
In [ ]: