In [2]:
# Import all of the things you need to import!
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.stem.porter import PorterStemmer
pd.options.display.max_columns = 30
%matplotlib inline
The Congressional Record is more or less what happened in Congress every single day. Speeches and all that. A good large source of text data, maybe?
Let's pretend it's totally secret but we just got it leaked to us in a data dump, and we need to check it out. It was leaked from this page here.
In [3]:
# If you'd like to download it through the command line...
!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz
In [4]:
# And then extract it through the command line...
!tar -zxf convote_v1.1.tar.gz
You can explore the files if you'd like, but we're going to get the ones from convote_v1.1/data_stage_one/development_set/
. It's a bunch of text files.
In [5]:
# glob finds files matching a certain filename pattern
import glob
# Give me all the text files
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
paths[:5]
Out[5]:
In [6]:
len(paths)
Out[6]:
So great, we have 702 of them. Now let's import them.
In [7]:
speeches = []
for path in paths:
with open(path) as speech_file:
speech = {
'pathname': path,
'filename': path.split('/')[-1],
'content': speech_file.read()
}
speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df.head()
Out[7]:
In class we had the texts
variable. For the homework can just do speeches_df['content']
to get the same sort of list of stuff.
Take a look at the contents of the first 5 speeches
In [13]:
first_5 = speeches_df['content'].head(5)
first_5
Out[13]:
In [14]:
countvectorizer = CountVectorizer(stop_words='english')
In [16]:
tokens = countvectorizer.fit_transform(speeches_df['content'])
In [17]:
countvectorizer.get_feature_names()
Out[17]:
Okay, it's far too big to even look at. Let's try to get a list of features from a new CountVectorizer
that only takes the top 100 words.
In [19]:
countvectorizer_new = CountVectorizer(max_features=100, stop_words='english')
In [20]:
tokens_new = countvectorizer_new.fit_transform(speeches_df['content'])
In [57]:
tokens_complete = pd.DataFrame(tokens_new.toarray(), columns=countvectorizer_new.get_feature_names())
Now let's push all of that into a dataframe with nicely named columns.
In [23]:
only_100_tokens = pd.DataFrame(tokens_new.toarray(), columns=countvectorizer_new.get_feature_names())
only_100_tokens
Out[23]:
Everyone seems to start their speeches with "mr chairman" - how many speeches are there total, and many don't mention "chairman" and how many mention neither "mr" nor "chairman"?
In [25]:
only_100_tokens.describe()
#There are 702 speeches in this Dataframe
Out[25]:
In [38]:
only_100_tokens['chairman_less'] = only_100_tokens['chairman'] == 0
only_100_tokens['chairman_less'].describe()
Out[38]:
In [39]:
print('There are',702-452,'Speeches withou the word chairman in it')
In [40]:
only_100_tokens['mr_less'] = only_100_tokens['mr'] == 0
only_100_tokens['mr_less'].describe()
Out[40]:
In [41]:
print('There are',702-623,'Speeches withou the word chairman in it')
What is the index of the speech thank is the most thankful, a.k.a. includes the word 'thank' the most times?
In [47]:
only_100_tokens['thank'].sort_values(ascending=False).head(1)
Out[47]:
If I'm searching for China
and trade
, what are the top 3 speeches to read according to the CountVectoriser
?
In [46]:
only_100_tokens['china trade'] = only_100_tokens['china'] + only_100_tokens['trade']
only_100_tokens['china trade'].sort_values(ascending=False).head(3)
Out[46]:
Now what if I'm using a TfidfVectorizer
?
In [ ]:
What's the content of the speeches? Here's a way to get them:
In [48]:
# index 0 is the first speech, which was the first one imported.
paths[0]
Out[48]:
In [49]:
# Pass that into 'cat' using { } which lets you put variables in shell commands
# that way you can pass the path to cat
!cat {paths[0]}
Now search for something else! Another two terms that might show up. elections
and chaos
? Whatever you thnik might be interesting.
In [67]:
tokens_complete['america'] = tokens_complete['america'].sort_values(ascending=False) >= 1
tokens_complete['america'].describe()
Out[67]:
In [70]:
tokens_complete[tokens_complete['america'] == True].count().head(1)
#wow! Someone said america 75 times in an speech!
Out[70]:
In [71]:
tokens_complete['elections'] = tokens_complete['elections'].sort_values(ascending=False) >= 1
tokens_complete['elections'].describe()
# the word elections has been present in nearly all speeches!
Out[71]:
Using a simple counting vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.
Using a term frequency vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.
Using a term frequency inverse document frequency vectorizer, cluster the documents into eight categories, telling me what the top terms are per category.
In [ ]:
## Im still working on this one!
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Which one do you think works the best?
In [ ]:
I have a scraped collection of Harry Potter fanfiction at https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip.
I want you to read them in, vectorize them and cluster them. Use this process to find out the two types of Harry Potter fanfiction. What is your hypothesis?
In [ ]: