In [15]:
# Import all of the things you need to import!#
!pip install scipy
!pip install sklearn
!pip install nltk
In [16]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.stem.porter import PorterStemmer
from sklearn.cluster import KMeans
import numpy as np
pd.options.display.max_columns = 30
%matplotlib inline
The Congressional Record is more or less what happened in Congress every single day. Speeches and all that. A good large source of text data, maybe?
Let's pretend it's totally secret but we just got it leaked to us in a data dump, and we need to check it out. It was leaked from this page here.
In [17]:
# If you'd like to download it through the command line...
!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz
In [18]:
# And then extract it through the command line...
!tar -zxf convote_v1.1.tar.gz
You can explore the files if you'd like, but we're going to get the ones from convote_v1.1/data_stage_one/development_set/
. It's a bunch of text files.
In [19]:
# glob finds files matching a certain filename pattern
import glob
# Give me all the text files
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
paths[:5]
Out[19]:
In [20]:
len(paths)
Out[20]:
So great, we have 702 of them. Now let's import them.
In [21]:
speeches = []
for path in paths:
with open(path) as speech_file:
speech = {
'pathname': path,
'filename': path.split('/')[-1],
'content': speech_file.read()
}
speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df.head()
Out[21]:
In class we had the texts
variable. For the homework can just do speeches_df['content']
to get the same sort of list of stuff.
Take a look at the contents of the first 5 speeches
In [22]:
for item in speeches_df['content'].head(5):
print("++++++++++++++++++++NEW SPEECH+++++++++++++++++++++")
print(item)
print(" ")
In [23]:
c_vectorizer = CountVectorizer(stop_words='english')
x = c_vectorizer.fit_transform(speeches_df['content'])
x
Out[23]:
In [24]:
df = pd.DataFrame(x.toarray(), columns=c_vectorizer.get_feature_names())
In [25]:
df
Out[25]:
Okay, it's far too big to even look at. Let's try to get a list of features from a new CountVectorizer
that only takes the top 100 words.
In [26]:
#http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
c2_vectorizer = CountVectorizer(stop_words='english', max_features=100)
y = c2_vectorizer.fit_transform(speeches_df['content'])
y
Out[26]:
Now let's push all of that into a dataframe with nicely named columns.
In [27]:
new_df = pd.DataFrame(y.toarray(), columns=c2_vectorizer.get_feature_names())
#new_df
Everyone seems to start their speeches with "mr chairman" - how many speeches are there total, and how many don't mention "chairman" and how many mention neither "mr" nor "chairman"?
In [28]:
#http://stackoverflow.com/questions/15943769/how-to-get-row-count-of-pandas-dataframe
total_speeches = len(new_df.index)
print("In total there are", total_speeches, "speeches.")
In [29]:
wo_chairman = new_df[new_df['chairman']==0]['chairman'].count()
print(wo_chairman, "speeches don't mention 'chairman'")
In [30]:
wo_mr_chairman = new_df[(new_df['chairman']==0) & (new_df['mr']==0)]['chairman'].count()
print(wo_mr_chairman, "speeches don't mention neither 'chairman' nor 'mr'")
What is the index of the speech thank is the most thankful, a.k.a. includes the word 'thank' the most times?
In [31]:
#http://stackoverflow.com/questions/18199288/getting-the-integer-index-of-a-pandas-dataframe-row-fulfilling-a-condition
print("The speech with the most 'thank's has the index", np.where(new_df['thank']==new_df['thank'].max()))
If I'm searching for China
and trade
, what are the top 3 speeches to read according to the CountVectoriser
?
In [32]:
china_trade_speeches = (new_df['china'] + new_df['trade']).sort_values(ascending = False).head(3)
china_trade_speeches
Out[32]:
Now what if I'm using a TfidfVectorizer
?
In [33]:
porter_stemmer = PorterStemmer()
def stem_tokenizer(str_input):
words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
words = [porter_stemmer.stem(word) for word in words]
return words
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stem_tokenizer, use_idf=False, norm='l1', max_features=100)
X = tfidf_vectorizer.fit_transform(speeches_df['content'])
t_df = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())
In [34]:
china_trade_speeches_v2 = (t_df['china'] + t_df['trade']).sort_values(ascending = False).head(3)
china_trade_speeches_v2
Out[34]:
What's the content of the speeches? Here's a way to get them:
In [35]:
# index 0 is the first speech, which was the first one imported.
paths[0]
Out[35]:
In [36]:
# Pass that into 'cat' using { } which lets you put variables in shell commands
# that way you can pass the path to cat
print("++++++++++NEW SPEECH+++++++++")
!cat {paths[345]}
print("++++++++++NEW SPEECH+++++++++")
!cat {paths[336]}
print("++++++++++NEW SPEECH+++++++++")
!cat {paths[402]}
Now search for something else! Another two terms that might show up. elections
and chaos
? Whatever you thnik might be interesting.
In [37]:
new_df.columns
Out[37]:
In [38]:
election_speeches = (new_df['discrimination'] + new_df['rights']).sort_values(ascending = False).head(3)
election_speeches
Out[38]:
Using a simple counting vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.
Using a term frequency vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.
Using a term frequency inverse document frequency vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.
In [39]:
def new_stem_tokenizer(str_input):
words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
#With PorterStemmer implemented as above, the text was pretty crippled and hard to judge which made more sense.
#that's why I have commented that line out for now
#words = [porter_stemmer.stem(word) for word in words]
return words
vectorizer_types = [
{'name': 'CVectorizer', 'definition': CountVectorizer(stop_words='english', tokenizer=new_stem_tokenizer, max_features=100)},
{'name': 'TFVectorizer', 'definition': TfidfVectorizer(stop_words='english', tokenizer=new_stem_tokenizer, max_features=100, use_idf=False)},
{'name': 'TFVIDFVectorizer', 'definition': TfidfVectorizer(stop_words='english', tokenizer=new_stem_tokenizer, max_features=100, use_idf=True)}
]
In [40]:
for vectorizer in vectorizer_types:
X = vectorizer['definition'].fit_transform(speeches_df['content'])
number_of_clusters = 8
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)
print("++++++++ Top terms per cluster -- using a", vectorizer['name'])
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer['definition'].get_feature_names()
for i in range(number_of_clusters):
top_ten_words = [terms[ind] for ind in order_centroids[i, :7]]
print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))
Which one do you think works the best?
The last two seem to make more sense than the first one, telling from its cluster three. The last two are more human-readable. However human-readability ends with that distinction -- I can't tell which from the last two would be better, based on the top terms per cluster.
I have a scraped collection of Harry Potter fanfiction at https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip.
I want you to read them in, vectorize them and cluster them. Use this process to find out the two types of Harry Potter fanfiction. What is your hypothesis?
In [41]:
!curl -O https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip
In [42]:
import zipfile
In [43]:
import glob
potter_paths = glob.glob('hp/*')
potter_paths[:5]
Out[43]:
In [44]:
potter = []
for path in potter_paths:
with open(path) as potter_file:
potter_text = {
'pathname': path,
'filename': path.split('/')[-1],
'content': potter_file.read()
}
potter.append(potter_text)
potter_df = pd.DataFrame(potter)
potter_df.head()
Out[44]:
In [47]:
vectorizer = TfidfVectorizer(stop_words='english', tokenizer=new_stem_tokenizer, use_idf=True)
X = vectorizer.fit_transform(potter_df['content'])
number_of_clusters = 2
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)
print("Top terms per cluster")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
top_ten_words = [terms[ind] for ind in order_centroids[i, :7]]
print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))
The first cluster evolves around Harry and his friends Hermoine and Ron as well as his enemy Draco.
The second cluster evolves around Harry's family as well as his godfather Sirius and his mentor Lupin.
In [ ]: