A Short Introduction to the Pandas Package

Pandas is the Python Data Analysis Library from the makers of scipy, numpy, and IPython. It provides the perfect data structures for textual matrices, the Series and the DataFrame. Let's jump right in.



In [36]:

    
import pandas as pd #the recommended import condition from pandas
import numpy as np
sentence = 'the dog bit the man' #our first sentence from the presentation
token_list = sentence.split()
type_list = list(set(token_list)) #we only want each word type listed once
print(type_list)









    



['the', 'man', 'dog', 'bit']

Now, let's initialize a Series with these index labels.



In [37]:

    
ser1 = pd.Series(index = type_list)
print(ser1)









    



the   NaN
man   NaN
dog   NaN
bit   NaN
dtype: float64

If we want to initialize it with values, we can add do this with the data argument or with a dictionary. First, let's get the counts for each word in the sentence.



In [38]:

    
count_list = []
count_dict = {}
for word in type_list:
    count_list.append(token_list.count(word))
    count_dict[word] = token_list.count(word)
print(count_list)
print(type_list)
print(count_dict)









    



[2, 1, 1, 1]
['the', 'man', 'dog', 'bit']
{'the': 2, 'man': 1, 'dog': 1, 'bit': 1}

Now, we can create our Series objects.



In [39]:

    
ser2 = pd.Series(data = count_list, index = type_list)
ser3 = pd.Series(count_dict)
print(ser2)
print(ser3)









    



the    2
man    1
dog    1
bit    1
dtype: int64
bit    1
dog    1
man    1
the    2
dtype: int64

A Series is essentially a labelled vector, here a frequency term-document vector. In order to construct a term-document matrix, we can create another Series for our second sentence.

Quiz:

Below, write a short function that takes as its input a string of text and outputs a dictionary of word counts and a term-document Series.



In [40]:

    
sent2 = 'the bat hit the ball'
def td_Series(text):
    # insert your code here to create a count dictionary and a term-document vector for sent2
    from collections import Counter
    d = Counter(text.split())
    s = pd.Series(d)
    return d, s

count_dict2, ser4 = td_Series(sent2)
print(count_dict2)
ser4









    



Counter({'the': 2, 'ball': 1, 'hit': 1, 'bat': 1})






    Out[40]:





ball    1
bat     1
hit     1
the     2
dtype: int64

At this point, we have two separate Series representing two different term-document vectors. We can bring them together to create a DataFrame, the primary object type in the Pandas package.



In [41]:

    
df1 = pd.DataFrame(data = [ser3, ser4], index = ['sent1', 'sent2'])
print(df1) 
# if you don't print the dataframe, it will give you a nice HTML formatted view on the table:
df1









    



       ball  bat  bit  dog  hit  man  the
sent1   NaN  NaN    1    1  NaN    1    2
sent2     1    1  NaN  NaN    1  NaN    2

[2 rows x 7 columns]






    Out[41]:






  
    
      
      ball
      bat
      bit
      dog
      hit
      man
      the
    
  
  
    
      sent1
      NaN
      NaN
        1
        1
      NaN
        1
       2
    
    
      sent2
        1
        1
      NaN
      NaN
        1
      NaN
       2
    
  

2 rows × 7 columns

Notice that we now have a $m \times n$ term-document matrix. We could also create the DataFrame by calling our count_dicts directly. In this DataFrame, let's replace all nan values with 0.



In [42]:

    
df2 = pd.DataFrame(data = [count_dict, count_dict2], index = ['sent1', 'sent2'])
print(df2)
df2 = df2.fillna(value = 0)
df2









    



       ball  bat  bit  dog  hit  man  the
sent1   NaN  NaN    1    1  NaN    1    2
sent2     1    1  NaN  NaN    1  NaN    2

[2 rows x 7 columns]






    Out[42]:






  
    
      
      ball
      bat
      bit
      dog
      hit
      man
      the
    
  
  
    
      sent1
       0
       0
       1
       1
       0
       1
       2
    
    
      sent2
       1
       1
       0
       0
       1
       0
       2
    
  

2 rows × 7 columns

And now we can call values simply by naming the row, column name pairs. Name the row first, then the column.



In [43]:

    
print(df1.ix['sent1', 'ball'])
print(df1.ix['sent2', 'ball'])
print(df2.ix['sent1', 'ball'])
print(df2.ix['sent2', 'ball'])
# or do it like this: 
df1.ball.sent1
df1









    



nan
1.0
0.0
1.0






    Out[43]:






  
    
      
      ball
      bat
      bit
      dog
      hit
      man
      the
    
  
  
    
      sent1
      NaN
      NaN
        1
        1
      NaN
        1
       2
    
    
      sent2
        1
        1
      NaN
      NaN
        1
      NaN
       2
    
  

2 rows × 7 columns

We can also call them by their row and column indices. Again, first row, then column.



In [44]:

    
df1.ix[0,0]









    Out[44]:





nan



In [45]:

    
df1.ix[1,0]









    Out[45]:





1.0



In [46]:

    
df2.ix[0,0]









    Out[46]:





0.0



In [47]:

    
df2.ix[1,0]









    Out[47]:





1.0



In [48]:

    
df2.ix[0]









    Out[48]:





ball    0
bat     0
bit     1
dog     1
hit     0
man     1
the     2
Name: sent1, dtype: float64



In [49]:

    
df2.index









    Out[49]:





Index(['sent1', 'sent2'], dtype='object')



In [50]:

    
df2.values # which will return a numpy 2d array









    Out[50]:





array([[ 0.,  0.,  1.,  1.,  0.,  1.,  2.],
       [ 1.,  1.,  0.,  0.,  1.,  0.,  2.]])

Below are a few other things you can do with a DataFrame.



In [51]:

    
df2.min(axis = 0)









    Out[51]:





ball    0
bat     0
bit     0
dog     0
hit     0
man     0
the     2
dtype: float64



In [52]:

    
df2.min(axis = 1)









    Out[52]:





sent1    0
sent2    0
dtype: float64



In [53]:

    
np.min(df2, axis = 1) # numpy function works but is slightly slower









    Out[53]:





sent1    0
sent2    0
dtype: float64



In [54]:

    
df2.max(axis = 1)









    Out[54]:





sent1    2
sent2    2
dtype: float64



In [55]:

    
df2.idxmin(axis = 1) # index of the min









    Out[55]:





sent1    ball
sent2     bit
dtype: object



In [56]:

    
df2.idxmax(axis = 1) # index of the max









    Out[56]:





sent1    the
sent2    the
dtype: object



In [57]:

    
df2.values.max() # max of all of the values









    Out[57]:





2.0

And simple statistics.



In [58]:

    
df2.describe()









    Out[58]:






  
    
      
      ball
      bat
      bit
      dog
      hit
      man
      the
    
  
  
    
      count
       2.000000
       2.000000
       2.000000
       2.000000
       2.000000
       2.000000
       2
    
    
      mean
       0.500000
       0.500000
       0.500000
       0.500000
       0.500000
       0.500000
       2
    
    
      std
       0.707107
       0.707107
       0.707107
       0.707107
       0.707107
       0.707107
       0
    
    
      min
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       0.000000
       2
    
    
      25%
       0.250000
       0.250000
       0.250000
       0.250000
       0.250000
       0.250000
       2
    
    
      50%
       0.500000
       0.500000
       0.500000
       0.500000
       0.500000
       0.500000
       2
    
    
      75%
       0.750000
       0.750000
       0.750000
       0.750000
       0.750000
       0.750000
       2
    
    
      max
       1.000000
       1.000000
       1.000000
       1.000000
       1.000000
       1.000000
       2
    
  

8 rows × 7 columns



In [59]:

    
df2.mean(axis = 1)









    Out[59]:





sent1    0.714286
sent2    0.714286
dtype: float64



In [60]:

    
df2.ix['sent1'].mean()









    Out[60]:





0.7142857142857143



In [61]:

    
df2.std(axis = 1) # standard deviation









    Out[61]:





sent1    0.755929
sent2    0.755929
dtype: float64

Now, what can we do with this? We can use, e.g., the correlation metric in Pandas.



In [62]:

    
df2.irow(0).corr(df2.irow(1))









    Out[62]:





0.12499999999999988

Or, if we have the scikit-learn package, there is a lot more we can do.</br>

Note: to install scikit-learn on Linux with Python 3.4, use the following command:</br> [sudo] pip3 install git+https://github.com/scikit-learn/scikit-learn.git.

The tf-idf metric stands for 'term frequency-inverse document frequency'. It weights the importance each word has for each document based on how often it occurs in the document and the inverse of how many documents contain it in the corpus.



In [63]:

    
tf









    



           ball       bat       bit       dog       hit       man       the
sent1  0.000000  0.000000  0.446101  0.446101  0.000000  0.446101  0.634809
sent2  0.446101  0.446101  0.000000  0.000000  0.446101  0.000000  0.634809

[2 rows x 7 columns]

You can also measure the distance between two documents with the pairwise_distances function in sklearn.



In [ ]:

    
from sklearn.metrics.pairwise import pairwise_distances
euclid = pairwise_distances(df2) #Euclidean distance between the two documents.
df_euclid = pd.DataFrame(data = euclid, index = df2.index, columns = df2.index)
print(df_euclid)

Quiz:
Now its your turn. There are many texts in the `Data` sub-directory in this directory. Write a function that takes as input a text file's path, reads the text from the file, splits it into its individual words, and returns a Series with the word types (i.e., unique words) as the index and the number of times they occur as the values.



In [20]:

    
def split_txt(filename):
    # write your code here
    words = []
    [words.append(word) for word in (open(filename)).read().split()]
    s = pd.Series(Counter(words))
    return s

emma_Series = split_txt('./Data/austen-emma.txt')
print(emma_Series[:20])









    



"'Tis              1
"--Mrs.            1
"A                13
"A.                1
"About             1
"Agreed,           1
"Ah!              27
"Ah!"              3
"Ah!--(shaking     1
"Ah!--Indeed       1
"Ah!--so           1
"Ah!--well--to     1
"Ah,               2
"Almost            1
"And              45
"And,              3
"Another           1
"Are               4
"As                8
"At                1
dtype: int64

Take a look at the first 20 members of the Series. It looks like we have a couple of problems: capitalization and punctuation. Edit your function below to solve these problems.
Hint: use the punctuation constant in the string library to recognize punctuation.



In [34]:

    
from string import punctuation
import re
def split_txt(filename):
    # write your code here
    words = []
    [words.append(word.lower()) for word in (open(filename)).read().split()]
    new_words = []
    for word in words:
        [new_words.append(w) for w in re.split('[%s]+' % punctuation, word) if w != '']
    s = pd.Series(Counter(new_words))
    return s

emma_Series = split_txt('./Data/austen-emma.txt')
'''
The following code checks whether you have successfully cleaned your corpus.
Please do not change it.
'''
problems = []
for word in emma_Series.index:
    if re.search('[\WA-Z]', word):
        problems.append(word)
print(len(problems))

If the length of the problems list is not 0, then you are not yet finished. Take a look at your results to check what you did wrong and edit your code to correct the problem.

You now have a function that can take a text, clean it, and produce a term-document array (Series). Now, you should integrate this function into a script that will read and clean all the texts in the ./Data folder. You should then integrate all of the resulting Series into one large term-document matrix. Transform this matrix into a tf-idf matrix, and then run at least 5 of the metrics under pairwise_distances in sklearn.



In [84]:

    
pw_list[0].shape









    Out[84]:





(18, 18)



In [90]:

    
#Write your code here or in a separate .py file. __Make sure I know where to find your file!__
from os import listdir
from sklearn.metrics.pairwise import pairwise_distances

texts = listdir('./Data')
s_list = []
for f in texts:
    s_list.append(split_txt('/'.join(['./Data', f])))
td_df = pd.DataFrame(s_list, index = texts).fillna(0)
tfidf = TfidfTransformer().fit_transform(td_df)
tf_df = pd.DataFrame(data = tfidf.toarray(), index = td_df.index, columns = td_df.columns)
pw = ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']
pw_list = []
for distance in pw:
    pw_list.append(pd.DataFrame(pairwise_distances(tf_df, metric = distance), index = td_df.index, columns = td_df.index))



In [91]:

    
pw_list[0]









    Out[91]:






  
    
      
      austen-emma.txt
      austen-pride.txt
      austen-sense.txt
      blake-poems.txt
      blake-songs.txt
      bryant-stories.txt
      burgess-busterbrown.txt
      carroll-alice.txt
      chesterton-ball.txt
      chesterton-thursday.txt
      edgeworth-parents.txt
      melville-piazza.txt
      milton-paradise.txt
      shakespeare-caesar.txt
      shakespeare-hamlet.txt
      whitman-leaves.txt
      whitman-patriotic.txt
      whitman-poems.txt
    
  
  
    
      austen-emma.txt
       0.000000
       3.288201
       3.566582
       7.661250
       8.367434
       6.096018
        7.603400
       5.444131
       6.326319
       6.201457
       4.645956
       6.391522
       7.519463
        8.404147
       8.222136
       7.143307
       7.895257
       6.812605
    
    
      austen-pride.txt
       3.288201
       0.000000
       3.165094
       7.658955
       8.341250
       6.172412
        7.765804
       5.639749
       6.350296
       6.194565
       4.667239
       6.324388
       7.421961
        8.526162
       8.319700
       7.276798
       7.911180
       6.912907
    
    
      austen-sense.txt
       3.566582
       3.165094
       0.000000
       7.847731
       8.510762
       6.357429
        7.892954
       5.829724
       6.489275
       6.406903
       4.970503
       6.508285
       7.600019
        8.837961
       8.608563
       7.424410
       8.033736
       7.024481
    
    
      blake-poems.txt
       7.661250
       7.658955
       7.847731
       0.000000
       2.462204
       6.387832
        8.630030
       6.426081
       7.636606
       7.382237
       6.725573
       7.631571
       7.082840
        8.394294
       8.279731
       6.736105
       6.943236
       6.794078
    
    
      blake-songs.txt
       8.367434
       8.341250
       8.510762
       2.462204
       0.000000
       6.951202
        9.157712
       7.156909
       8.215776
       7.997589
       7.490014
       8.296710
       7.847015
        9.139360
       9.040926
       7.464442
       7.633758
       7.447729
    
    
      bryant-stories.txt
       6.096018
       6.172412
       6.357429
       6.387832
       6.951202
       0.000000
        6.485495
       4.692564
       5.674748
       5.490166
       4.878074
       6.199281
       7.127346
        8.253450
       8.027359
       6.288505
       6.790038
       6.111640
    
    
      burgess-busterbrown.txt
       7.603400
       7.765804
       7.892954
       8.630030
       9.157712
       6.485495
        0.000000
       6.418385
       7.740942
       7.570074
       7.094584
       8.431711
       9.261794
       10.027637
       9.925509
       8.904495
       9.262020
       8.639101
    
    
      carroll-alice.txt
       5.444131
       5.639749
       5.829724
       6.426081
       7.156909
       4.692564
        6.418385
       0.000000
       5.713795
       5.485762
       4.476264
       6.066462
       7.187318
        7.720126
       7.491514
       6.521824
       6.966587
       6.354041
    
    
      chesterton-ball.txt
       6.326319
       6.350296
       6.489275
       7.636606
       8.215776
       5.674748
        7.740942
       5.713795
       0.000000
       4.138554
       5.616351
       6.005574
       7.538391
        9.049477
       8.765573
       6.658022
       7.352344
       6.375852
    
    
      chesterton-thursday.txt
       6.201457
       6.194565
       6.406903
       7.382237
       7.997589
       5.490166
        7.570074
       5.485762
       4.138554
       0.000000
       5.451512
       5.975628
       7.528025
        8.762982
       8.557727
       6.663544
       7.285691
       6.419817
    
    
      edgeworth-parents.txt
       4.645956
       4.667239
       4.970503
       6.725573
       7.490014
       4.878074
        7.094584
       4.476264
       5.616351
       5.451512
       0.000000
       5.688283
       6.799950
        7.425381
       7.205691
       6.385568
       7.125479
       6.157757
    
    
      melville-piazza.txt
       6.391522
       6.324388
       6.508285
       7.631571
       8.296710
       6.199281
        8.431711
       6.066462
       6.005574
       5.975628
       5.688283
       0.000000
       6.982115
        8.878817
       8.635799
       6.421391
       7.244879
       6.227169
    
    
      milton-paradise.txt
       7.519463
       7.421961
       7.600019
       7.082840
       7.847015
       7.127346
        9.261794
       7.187318
       7.538391
       7.528025
       6.799950
       6.982115
       0.000000
        8.706739
       8.480747
       6.671297
       7.274864
       6.679432
    
    
      shakespeare-caesar.txt
       8.404147
       8.526162
       8.837961
       8.394294
       9.139360
       8.253450
       10.027637
       7.720126
       9.049477
       8.762982
       7.425381
       8.878817
       8.706739
        0.000000
       5.984343
       8.286496
       8.758650
       8.265702
    
    
      shakespeare-hamlet.txt
       8.222136
       8.319700
       8.608563
       8.279731
       9.040926
       8.027359
        9.925509
       7.491514
       8.765573
       8.557727
       7.205691
       8.635799
       8.480747
        5.984343
       0.000000
       8.260741
       8.757950
       8.288808
    
    
      whitman-leaves.txt
       7.143307
       7.276798
       7.424410
       6.736105
       7.464442
       6.288505
        8.904495
       6.521824
       6.658022
       6.663544
       6.385568
       6.421391
       6.671297
        8.286496
       8.260741
       0.000000
       3.525984
       2.776200
    
    
      whitman-patriotic.txt
       7.895257
       7.911180
       8.033736
       6.943236
       7.633758
       6.790038
        9.262020
       6.966587
       7.352344
       7.285691
       7.125479
       7.244879
       7.274864
        8.758650
       8.757950
       3.525984
       0.000000
       3.904140
    
    
      whitman-poems.txt
       6.812605
       6.912907
       7.024481
       6.794078
       7.447729
       6.111640
        8.639101
       6.354041
       6.375852
       6.419817
       6.157757
       6.227169
       6.679432
        8.265702
       8.288808
       2.776200
       3.904140
       0.000000
    
  

18 rows × 18 columns

Consider your results from each of these different metrics. Is there anything that suggests which of these metrics are better for analyzing this data?

Write your answer in this text box, below this line.

Your answer:

blahblahblah

	ball	bat	bit	dog	hit	man	the
count	2.000000	2.000000	2.000000	2.000000	2.000000	2.000000	2
mean	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	2
std	0.707107	0.707107	0.707107	0.707107	0.707107	0.707107	0
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2
25%	0.250000	0.250000	0.250000	0.250000	0.250000	0.250000	2
50%	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	2
75%	0.750000	0.750000	0.750000	0.750000	0.750000	0.750000	2
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	2

	austen-emma.txt	austen-pride.txt	austen-sense.txt	blake-poems.txt	blake-songs.txt	bryant-stories.txt	burgess-busterbrown.txt	carroll-alice.txt	chesterton-ball.txt	chesterton-thursday.txt	edgeworth-parents.txt	melville-piazza.txt	milton-paradise.txt	shakespeare-caesar.txt	shakespeare-hamlet.txt	whitman-leaves.txt	whitman-patriotic.txt	whitman-poems.txt
austen-emma.txt	0.000000	3.288201	3.566582	7.661250	8.367434	6.096018	7.603400	5.444131	6.326319	6.201457	4.645956	6.391522	7.519463	8.404147	8.222136	7.143307	7.895257	6.812605
austen-pride.txt	3.288201	0.000000	3.165094	7.658955	8.341250	6.172412	7.765804	5.639749	6.350296	6.194565	4.667239	6.324388	7.421961	8.526162	8.319700	7.276798	7.911180	6.912907
austen-sense.txt	3.566582	3.165094	0.000000	7.847731	8.510762	6.357429	7.892954	5.829724	6.489275	6.406903	4.970503	6.508285	7.600019	8.837961	8.608563	7.424410	8.033736	7.024481
blake-poems.txt	7.661250	7.658955	7.847731	0.000000	2.462204	6.387832	8.630030	6.426081	7.636606	7.382237	6.725573	7.631571	7.082840	8.394294	8.279731	6.736105	6.943236	6.794078
blake-songs.txt	8.367434	8.341250	8.510762	2.462204	0.000000	6.951202	9.157712	7.156909	8.215776	7.997589	7.490014	8.296710	7.847015	9.139360	9.040926	7.464442	7.633758	7.447729
bryant-stories.txt	6.096018	6.172412	6.357429	6.387832	6.951202	0.000000	6.485495	4.692564	5.674748	5.490166	4.878074	6.199281	7.127346	8.253450	8.027359	6.288505	6.790038	6.111640
burgess-busterbrown.txt	7.603400	7.765804	7.892954	8.630030	9.157712	6.485495	0.000000	6.418385	7.740942	7.570074	7.094584	8.431711	9.261794	10.027637	9.925509	8.904495	9.262020	8.639101
carroll-alice.txt	5.444131	5.639749	5.829724	6.426081	7.156909	4.692564	6.418385	0.000000	5.713795	5.485762	4.476264	6.066462	7.187318	7.720126	7.491514	6.521824	6.966587	6.354041
chesterton-ball.txt	6.326319	6.350296	6.489275	7.636606	8.215776	5.674748	7.740942	5.713795	0.000000	4.138554	5.616351	6.005574	7.538391	9.049477	8.765573	6.658022	7.352344	6.375852
chesterton-thursday.txt	6.201457	6.194565	6.406903	7.382237	7.997589	5.490166	7.570074	5.485762	4.138554	0.000000	5.451512	5.975628	7.528025	8.762982	8.557727	6.663544	7.285691	6.419817
edgeworth-parents.txt	4.645956	4.667239	4.970503	6.725573	7.490014	4.878074	7.094584	4.476264	5.616351	5.451512	0.000000	5.688283	6.799950	7.425381	7.205691	6.385568	7.125479	6.157757
melville-piazza.txt	6.391522	6.324388	6.508285	7.631571	8.296710	6.199281	8.431711	6.066462	6.005574	5.975628	5.688283	0.000000	6.982115	8.878817	8.635799	6.421391	7.244879	6.227169
milton-paradise.txt	7.519463	7.421961	7.600019	7.082840	7.847015	7.127346	9.261794	7.187318	7.538391	7.528025	6.799950	6.982115	0.000000	8.706739	8.480747	6.671297	7.274864	6.679432
shakespeare-caesar.txt	8.404147	8.526162	8.837961	8.394294	9.139360	8.253450	10.027637	7.720126	9.049477	8.762982	7.425381	8.878817	8.706739	0.000000	5.984343	8.286496	8.758650	8.265702
shakespeare-hamlet.txt	8.222136	8.319700	8.608563	8.279731	9.040926	8.027359	9.925509	7.491514	8.765573	8.557727	7.205691	8.635799	8.480747	5.984343	0.000000	8.260741	8.757950	8.288808
whitman-leaves.txt	7.143307	7.276798	7.424410	6.736105	7.464442	6.288505	8.904495	6.521824	6.658022	6.663544	6.385568	6.421391	6.671297	8.286496	8.260741	0.000000	3.525984	2.776200
whitman-patriotic.txt	7.895257	7.911180	8.033736	6.943236	7.633758	6.790038	9.262020	6.966587	7.352344	7.285691	7.125479	7.244879	7.274864	8.758650	8.757950	3.525984	0.000000	3.904140
whitman-poems.txt	6.812605	6.912907	7.024481	6.794078	7.447729	6.111640	8.639101	6.354041	6.375852	6.419817	6.157757	6.227169	6.679432	8.265702	8.288808	2.776200	3.904140	0.000000