Pandas is the Python Data Analysis Library from the makers of scipy, numpy, and IPython. It provides the perfect data structures for textual matrices, the Series and the DataFrame. Let's jump right in.
In [36]:
import pandas as pd #the recommended import condition from pandas
import numpy as np
sentence = 'the dog bit the man' #our first sentence from the presentation
token_list = sentence.split()
type_list = list(set(token_list)) #we only want each word type listed once
print(type_list)
Now, let's initialize a Series with these index labels.
In [37]:
ser1 = pd.Series(index = type_list)
print(ser1)
If we want to initialize it with values, we can add do this with the data argument or with a dictionary. First, let's get the counts for each word in the sentence.
In [38]:
count_list = []
count_dict = {}
for word in type_list:
count_list.append(token_list.count(word))
count_dict[word] = token_list.count(word)
print(count_list)
print(type_list)
print(count_dict)
Now, we can create our Series objects.
In [39]:
ser2 = pd.Series(data = count_list, index = type_list)
ser3 = pd.Series(count_dict)
print(ser2)
print(ser3)
A Series is essentially a labelled vector, here a frequency term-document vector. In order to construct a term-document matrix, we can create another Series for our second sentence.
Quiz:
Below, write a short function that takes as its input a string of text and outputs a dictionary of word counts and a term-document Series.
In [40]:
sent2 = 'the bat hit the ball'
def td_Series(text):
# insert your code here to create a count dictionary and a term-document vector for sent2
from collections import Counter
d = Counter(text.split())
s = pd.Series(d)
return d, s
count_dict2, ser4 = td_Series(sent2)
print(count_dict2)
ser4
Out[40]:
At this point, we have two separate Series representing two different term-document vectors. We can bring them together to create a DataFrame, the primary object type in the Pandas package.
In [41]:
df1 = pd.DataFrame(data = [ser3, ser4], index = ['sent1', 'sent2'])
print(df1)
# if you don't print the dataframe, it will give you a nice HTML formatted view on the table:
df1
Out[41]:
Notice that we now have a $m \times n$ term-document matrix. We could also create the DataFrame by calling our count_dicts directly. In this DataFrame, let's replace all nan values with 0.
In [42]:
df2 = pd.DataFrame(data = [count_dict, count_dict2], index = ['sent1', 'sent2'])
print(df2)
df2 = df2.fillna(value = 0)
df2
Out[42]:
And now we can call values simply by naming the row, column name pairs. Name the row first, then the column.
In [43]:
print(df1.ix['sent1', 'ball'])
print(df1.ix['sent2', 'ball'])
print(df2.ix['sent1', 'ball'])
print(df2.ix['sent2', 'ball'])
# or do it like this:
df1.ball.sent1
df1
Out[43]:
We can also call them by their row and column indices. Again, first row, then column.
In [44]:
df1.ix[0,0]
Out[44]:
In [45]:
df1.ix[1,0]
Out[45]:
In [46]:
df2.ix[0,0]
Out[46]:
In [47]:
df2.ix[1,0]
Out[47]:
In [48]:
df2.ix[0]
Out[48]:
In [49]:
df2.index
Out[49]:
In [50]:
df2.values # which will return a numpy 2d array
Out[50]:
Below are a few other things you can do with a DataFrame.
In [51]:
df2.min(axis = 0)
Out[51]:
In [52]:
df2.min(axis = 1)
Out[52]:
In [53]:
np.min(df2, axis = 1) # numpy function works but is slightly slower
Out[53]:
In [54]:
df2.max(axis = 1)
Out[54]:
In [55]:
df2.idxmin(axis = 1) # index of the min
Out[55]:
In [56]:
df2.idxmax(axis = 1) # index of the max
Out[56]:
In [57]:
df2.values.max() # max of all of the values
Out[57]:
And simple statistics.
In [58]:
df2.describe()
Out[58]:
In [59]:
df2.mean(axis = 1)
Out[59]:
In [60]:
df2.ix['sent1'].mean()
Out[60]:
In [61]:
df2.std(axis = 1) # standard deviation
Out[61]:
Now, what can we do with this? We can use, e.g., the correlation metric in Pandas.
In [62]:
df2.irow(0).corr(df2.irow(1))
Out[62]:
Or, if we have the scikit-learn package, there is a lot more we can do.</br>
Note: to install scikit-learn on Linux with Python 3.4, use the following command:</br>
[sudo] pip3 install git+https://github.com/scikit-learn/scikit-learn.git.
The tf-idf metric stands for 'term frequency-inverse document frequency'. It weights the importance each word has for each document based on how often it occurs in the document and the inverse of how many documents contain it in the corpus.
In [63]:
tf
You can also measure the distance between two documents with the pairwise_distances function in sklearn.
In [ ]:
from sklearn.metrics.pairwise import pairwise_distances
euclid = pairwise_distances(df2) #Euclidean distance between the two documents.
df_euclid = pd.DataFrame(data = euclid, index = df2.index, columns = df2.index)
print(df_euclid)
Data sub-directory in this directory. Write a function that takes as input a text file's path, reads the text from the file, splits it into its individual words, and returns a Series with the word types (i.e., unique words) as the index and the number of times they occur as the values.
In [20]:
def split_txt(filename):
# write your code here
words = []
[words.append(word) for word in (open(filename)).read().split()]
s = pd.Series(Counter(words))
return s
emma_Series = split_txt('./Data/austen-emma.txt')
print(emma_Series[:20])
Take a look at the first 20 members of the Series. It looks like we have a couple of problems: capitalization and punctuation. Edit your function below to solve these problems.
Hint: use the punctuation constant in the string library to recognize punctuation.
In [34]:
from string import punctuation
import re
def split_txt(filename):
# write your code here
words = []
[words.append(word.lower()) for word in (open(filename)).read().split()]
new_words = []
for word in words:
[new_words.append(w) for w in re.split('[%s]+' % punctuation, word) if w != '']
s = pd.Series(Counter(new_words))
return s
emma_Series = split_txt('./Data/austen-emma.txt')
'''
The following code checks whether you have successfully cleaned your corpus.
Please do not change it.
'''
problems = []
for word in emma_Series.index:
if re.search('[\WA-Z]', word):
problems.append(word)
print(len(problems))
If the length of the problems list is not 0, then you are not yet finished. Take a look at your results to check what you did wrong and edit your code to correct the problem.
You now have a function that can take a text, clean it, and produce a term-document array (Series). Now, you should integrate this function into a script that will read and clean all the texts in the ./Data folder. You should then integrate all of the resulting Series into one large term-document matrix. Transform this matrix into a tf-idf matrix, and then run at least 5 of the metrics under pairwise_distances in sklearn.
In [84]:
pw_list[0].shape
Out[84]:
In [90]:
#Write your code here or in a separate .py file. __Make sure I know where to find your file!__
from os import listdir
from sklearn.metrics.pairwise import pairwise_distances
texts = listdir('./Data')
s_list = []
for f in texts:
s_list.append(split_txt('/'.join(['./Data', f])))
td_df = pd.DataFrame(s_list, index = texts).fillna(0)
tfidf = TfidfTransformer().fit_transform(td_df)
tf_df = pd.DataFrame(data = tfidf.toarray(), index = td_df.index, columns = td_df.columns)
pw = ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']
pw_list = []
for distance in pw:
pw_list.append(pd.DataFrame(pairwise_distances(tf_df, metric = distance), index = td_df.index, columns = td_df.index))
In [91]:
pw_list[0]
Out[91]:
Consider your results from each of these different metrics. Is there anything that suggests which of these metrics are better for analyzing this data?
Write your answer in this text box, below this line.
Your answer:
blahblahblah