Pandas is one of the main Python packages for data analysis, along with

Numpy: numerical array manipulations
Scipy: more sophisticated math classes (such as probability distributions, etc.)

Pandas is based on Numpy arrays and provides two main classes:

Series
DataFrame

Roughly, both classes contain a Numpy array has one of their attributes:

a 1D array for Series
a 2D array for DataFrame

Both classes provides additional functionality to plain Numpy arrays in order to facilitate the analysis of inhomogeneous labelled data.

Pandas Series

Definition: A Pandas Series is the programmatic embodiement of a data line, i.e. a labelled sequence of data points:

$$x_{i_1},x_{i_2},\dots, x_{i_n}$$

If the labels $i_1,\dots, i_n$ correspond to

unique ID of individuals from a population $\Omega=\{1,\dots,N\}$, and the data points correspond to the value of some characteristic $X:\Omega\rightarrow \mathbb R$ for these individuals, the data line is called a population sample (of observations): $x_{i_l} = X(i_l).$

instant in times $i_1 < i_2 <\cdots < i_n$ and the data points correspond to the value of the characteristic $X$ of a single fixed individual $\omega\in \Omega$ but at different moment in times, the data line is called a time series: $x_{i_l} = X_{t=i_l}(\omega).$

The Series class is imported from the module pandas as follows:



In [1]:

    
from pandas import Series

The Series constructor takes the following arguments (which are all set to None by default:

Series(data=None, index=None, dtype=None, name=None, copy=False)

where

data is any array like object containing the data points
index is any array-like object containing the labels (or indices)
dtype is the type of the datapoints, such as float64 or int64
copy is a flag telling the class constructor whether to internally copy the data or not

If nothing is passed to the class constructor, except for the data argument, the constructor will try to infer everthing from the data.

For instance, one can pass a dictionary to the Series constructor:



In [2]:

    
data = {'Luc':32, 'Bob':24, 'Lucy':89}

dataLine = Series(data)
dataLine









    Out[2]:





Bob     24
Luc     32
Lucy    89
dtype: int64

Exercise: Explore the method of the class Series for yourself, and figure out the ones that may be the most useful to you (Recall that auto-completion with key tab can help you greatly with this exploration).



In [3]:

    
dataMena = dataLine.mean()
dataStd = dataLine.std()
summary = dataLine.describe()
#dataLine.values
type(summary)
print(summary)









    



count     3.000000
mean     48.333333
std      35.444793
min      24.000000
25%      28.000000
50%      32.000000
75%      60.500000
max      89.000000
dtype: float64

Pandas DataFrame

Definition: A Pandas DataFrame is the programmatic embodiement of a data table, i.e., a 2D array of values with rows and columns explicitely labelled. Usually,

the row labels, say $i_1,\dots, i_n$ correspond to a sample of the unique individual IDs in a population $\Omega = \{1,\dots, N\}$

the column labels correspond to the name $X,Y,Z$ of some characteristics $X,Y,Z:\Omega\rightarrow\mathbb R$ of our population $\Omega$

The actual values in the data tables are the values of these characteristics for the sampled individuals.

One imports the DataFrame as follows



In [4]:

    
from pandas import DataFrame

The DataFrame constructor is almost the same as the one of the Series class

DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

except that

there is a columns argument to hold the column names (in any array-like container)

the data argument takes in any 2D container, such as a 2D Numpy array, a list or a dictionary of series, 1D arrays, lists, or dictionaries:



In [5]:

    
characteristics = ['X','Y','Z']
individuals     = ['i1','i2','i3','i4']
values          = [[19, 12, 1],
                   [23, 45, 0],
                   [45, 14, 1],
                   [23, 17, 1]]

data_table = DataFrame(values, index=individuals, columns=characteristics)
data_table

Since Pandas, as Numpy, uses the row-major convention, data frame constructed by passing a list of characteristics values (meant to be columns), instead of a list of characteristics for all the individuals (meant to be rows) will be constructed the other way round:



In [7]:

    
ages    = Series([10,67,22], index=['Bob', 'Luc', 'Ted'], name='Age')
weights = Series([44,60,80], index=['Bob', 'Luc', 'Ted'], name='Weight')

guys = DataFrame([ages, weights])
guys

Fortunately, we can always use the tranpose method to get what we want:



In [10]:

    
guys.transpose()

Exercise: Explore the DataFrame methods using tab completion, and find out the ones you think will be the most useful to you.



In [ ]:

Learning Pandas on data analysis example

The following data analysis example is drawn from chapter 9 of

Python for Data Analysis, by Wes McKinney, Springer (2012)

which is the reference text for data analysis using Python, and whose author Wes McKinney is the creator of the pandas package.

The data concerning the campaign donation for the 2012 Presidential election are available for scrutiny and analysis here:

2012 Presidential Campaign Finance

The data is in the form of a data table where:

rows = a campaign donors
columns = attributes/characteristics of the donation as describe below

Loading the data



In [11]:

    
from pandas import read_csv



In [12]:

    
data = read_csv('../data/P00000001-ALL.csv')



In [13]:

    
data.ix[0:5,0:5]









    Out[13]:






  
    
      
      cmte_id
      cand_id
      cand_nm
      contbr_nm
      contbr_city
    
  
  
    
      0
       C00410118
       P20002978
       Bachmann, Michelle
          HARVEY, WILLIAM
                   MOBILE
    
    
      1
       C00410118
       P20002978
       Bachmann, Michelle
          HARVEY, WILLIAM
                   MOBILE
    
    
      2
       C00410118
       P20002978
       Bachmann, Michelle
            SMITH, LANIER
                   LANETT
    
    
      3
       C00410118
       P20002978
       Bachmann, Michelle
         BLEVINS, DARONDA
                  PIGGOTT
    
    
      4
       C00410118
       P20002978
       Bachmann, Michelle
       WARDENBURG, HAROLD
       HOT SPRINGS NATION
    
    
      5
       C00410118
       P20002978
       Bachmann, Michelle
           BECKMAN, JAMES
               SPRINGDALE



In [14]:

    
data.shape









    Out[14]:





(1001731, 16)



In [15]:

    
row_number = data.shape[0]



In [16]:

    
data.info









    Out[16]:





<bound method DataFrame.info of <class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id              1001731  non-null values
cand_id              1001731  non-null values
cand_nm              1001731  non-null values
contbr_nm            1001731  non-null values
contbr_city          1001716  non-null values
contbr_st            1001727  non-null values
contbr_zip           1001620  non-null values
contbr_employer      994314  non-null values
contbr_occupation    994433  non-null values
contb_receipt_amt    1001731  non-null values
contb_receipt_dt     1001731  non-null values
receipt_desc         14166  non-null values
memo_cd              92482  non-null values
memo_text            97770  non-null values
form_tp              1001731  non-null values
file_num             1001731  non-null values
dtypes: float64(1), int64(1), object(14)>



In [ ]:

Exploring and preparing the data



In [47]:

    
from random import randint

i = randint(0,row_number)

print data.ix[i,:]









    



cmte_id                   C00495820
cand_id                   P80000748
cand_nm                   Paul, Ron
contbr_nm            O'NEILL, JAMES
contbr_city            MIDDLE POINT
contbr_st                        OH
contbr_zip                458639539
contbr_employer           REQUESTED
contbr_occupation         REQUESTED
contb_receipt_amt               300
contb_receipt_dt          29-AUG-11
receipt_desc                    NaN
memo_cd                         NaN
memo_text                       NaN
form_tp                       SA17A
file_num                     779225
Name: 810293, dtype: object



In [ ]:

Adding party information



In [53]:

    
candidates = data.cand_nm.unique()
candidates









    Out[53]:





array([Bachmann, Michelle, Romney, Mitt, Obama, Barack,
       Roemer, Charles E. 'Buddy' III, Pawlenty, Timothy,
       Johnson, Gary Earl, Paul, Ron, Santorum, Rick, Cain, Herman,
       Gingrich, Newt, McCotter, Thaddeus G, Huntsman, Jon, Perry, Rick], dtype=object)



In [57]:

    
parties = {"Bachmann, Michelle":"Republican",
           "Cain, Herman":"Republican",
           "Gingrich, Newt":"Republican",
           "Huntsman, Jon":"Republican",
           "Johnson, Gary, Earl":"Republican",
           "McCotter, Thaddeus G":"Republican",
           "Obama, Barack":"Democrat",
           "Paul, Ron":"Republican",
           "Pawlenty, Timothy":"Republican",
           "Perry, Rick":"Republican",
           "Roemer, Charles, E. 'Buddy' III":"Republican",
           "Romney, Mitt":"Republican",
           "Santorum, Rick":"Republican",
           "Stein, Jill":"Republican"}

parties[data.cand_nm[800]]









    Out[57]:





'Republican'



In [84]:

    
from random import sample

sample_size = 6
rows        = sample(range(0, row_number), sample_size)
sample = data.ix[rows, ['contb_receipt_amt', 'contbr_nm', 'cand_nm']]

sample['contbr_pt'] = sample.cand_nm.map(parties)

sample.columns = ['Amount contributed', 'Contributor Name', 'Candidate Contributed to','Party Contributed To']

sample









    Out[84]:






  
    
      
      Amount contributed
      Contributor Name
      Candidate Contributed to
      Party Contributed To
    
  
  
    
      794790
        100.00
          FREEMAN, ERIK D.
            Paul, Ron
       Republican
    
    
      830929
        100.00
       KAISER, HENRY W MR.
            Paul, Ron
       Republican
    
    
      926770
        300.00
            GREIG, CELESTE
       Gingrich, Newt
       Republican
    
    
      848723
         20.12
              BOYD, EDWARD
            Paul, Ron
       Republican
    
    
      628019
        380.00
           KILLION, MARTHA
        Obama, Barack
         Democrat
    
    
      69486 
       2500.00
        VIOLA, VINCENT MR.
         Romney, Mitt
       Republican



In [86]:

    
data['contbr_pt'] = data.cand_nm.map(parties)


data[['contbr_nm', 'cand_nm', 'contbr_pt']].head()









    Out[86]:






  
    
      
      contbr_nm
      cand_nm
      contbr_pt
    
  
  
    
      0
          HARVEY, WILLIAM
       Bachmann, Michelle
       Republican
    
    
      1
          HARVEY, WILLIAM
       Bachmann, Michelle
       Republican
    
    
      2
            SMITH, LANIER
       Bachmann, Michelle
       Republican
    
    
      3
         BLEVINS, DARONDA
       Bachmann, Michelle
       Republican
    
    
      4
       WARDENBURG, HAROLD
       Bachmann, Michelle
       Republican



In [87]:

    
data.contbr_pt.value_counts()









    Out[87]:





Democrat      593746
Republican    400831
dtype: int64



In [ ]:

Removing refund entries



In [89]:

    
not_refund_rows = data.contb_receipt_amt > 0
not_refund_rows.value_counts()

not_refund_rows[:10]









    Out[89]:





0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
Name: contb_receipt_amt, dtype: bool



In [90]:

    
donnors = data[not_refund_rows]

Keeping only donnors to the two mains candidates



In [91]:

    
donnors = donnors[donnors.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]

donnors.contbr_pt.value_counts()









    Out[91]:





Democrat      589127
Republican    105155
dtype: int64

Cleaning the occupation column



In [92]:

    
occupations = donnors.contbr_occupation.value_counts()



In [94]:

    
occupations[:20]









    Out[94]:





RETIRED                                   177473
ATTORNEY                                   30133
INFORMATION REQUESTED                      24747
HOMEMAKER                                  19626
PHYSICIAN                                  17206
INFORMATION REQUESTED PER BEST EFFORTS     12545
PROFESSOR                                  11804
TEACHER                                    11512
CONSULTANT                                 10061
NOT EMPLOYED                                9696
LAWYER                                      7438
ENGINEER                                    6107
PRESIDENT                                   4864
MANAGER                                     4745
WRITER                                      4439
SELF-EMPLOYED                               3763
SALES                                       3697
EXECUTIVE                                   3609
OWNER                                       3408
EDUCATOR                                    3360
dtype: int64



In [95]:

    
occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
               'INFORMATION REQUESTED':'NOT PROVIDED',
               'INFORMATION REQUESTED (BEST EFFORTS)':'NOT PROVIDED',
               'C.E.O.': 'CEO'}



In [96]:

    
def f(x):
    return occ_mapping.get(x, x)

donnors.contbr_occupation = donnors.contbr_occupation.map(f)

Cleaning the employer columns



In [97]:

    
donnors.contbr_employer.value_counts()









    Out[97]:





RETIRED                                   173287
SELF-EMPLOYED                              79638
NOT EMPLOYED                               45727
INFORMATION REQUESTED                      25524
INFORMATION REQUESTED PER BEST EFFORTS     13104
HOMEMAKER                                  12081
OBAMA FOR AMERICA                           1741
STUDENT                                     1417
SELF                                        1228
DISABLED                                    1096
IBM                                         1027
REFUSED                                      854
KAISER PERMANENTE                            828
AT&T                                         796
US ARMY                                      680
...
MEYER BROOKS DEMMA & BLOHM               1
BATES MEMORIAL BAPTIST                   1
WEISSMAN, NOWACK, CURRY & WILCO, P.C.    1
NELSEN THOMPSON PEGUE & THORNTON         1
NET ACCESS                               1
HUFFER & HUFFER                          1
PHILANTHROPY                             1
HRM                                      1
SOUTHERN ILLINOIS UNIV - EDWARDSVILLE    1
NWCS                                     1
PEERLESS NETWORK                         1
LEADERSHIP INCENTER                      1
U OF DENVER                              1
DOUGLAS CO. PUBLIC DEFENDER              1
LAW OFFICE OF MARDI HARRISON             1
Length: 93117, dtype: int64



In [98]:

    
emp_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
               'INFORMATION REQUESTED':'NOT PROVIDED',
               'SELF' : 'SELF-EMPLOYED',
               'SELF EMPLOYED':'SELF-EMPLOYED'}



In [99]:

    
def g(x):
    return emp_mapping.get(x, x) 

donnors.contbr_employer = donnors.contbr_employer.map(g)

Visualizing and analysing the data



In [100]:

    
donnors.contbr_pt.value_counts()









    Out[100]:





Democrat      589127
Republican    105155
dtype: int64



In [101]:

    
by_occupation = donnors.pivot_table('contb_receipt_amt', 
                                    rows='contbr_occupation',
                                    cols='contbr_pt', 
                                    aggfunc='sum')



In [102]:

    
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]



In [103]:

    
over_2mm









    Out[103]:






  
    
      contbr_pt
      Democrat
      Republican
    
    
      contbr_occupation
      
      
    
  
  
    
      ATTORNEY
       11141982.97
        5364718.82
    
    
      CEO
        2074974.79
        2324297.03
    
    
      CONSULTANT
        2459912.71
        1424894.01
    
    
      EXECUTIVE
        1355161.05
        2300947.03
    
    
      HOMEMAKER
        4248875.80
        8147446.22
    
    
      INVESTOR
         884133.00
        1537595.12
    
    
      LAWYER
        3160478.87
           7705.20
    
    
      NOT PROVIDED
        4866973.96
       11396894.84
    
    
      PHYSICIAN
        3735124.94
        1368023.96
    
    
      PRESIDENT
        1878509.95
        2491244.89
    
    
      PROFESSOR
        2165071.08
         161362.12
    
    
      RETIRED
       25305116.38
       11508473.59



In [104]:

    
%matplotlib inline



In [105]:

    
over_2mm.plot(kind='barh');



In [ ]:

	cmte_id	cand_id	cand_nm	contbr_nm	contbr_city
0	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE
1	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE
2	C00410118	P20002978	Bachmann, Michelle	SMITH, LANIER	LANETT
3	C00410118	P20002978	Bachmann, Michelle	BLEVINS, DARONDA	PIGGOTT
4	C00410118	P20002978	Bachmann, Michelle	WARDENBURG, HAROLD	HOT SPRINGS NATION
5	C00410118	P20002978	Bachmann, Michelle	BECKMAN, JAMES	SPRINGDALE

	Amount contributed	Contributor Name	Candidate Contributed to	Party Contributed To
794790	100.00	FREEMAN, ERIK D.	Paul, Ron	Republican
830929	100.00	KAISER, HENRY W MR.	Paul, Ron	Republican
926770	300.00	GREIG, CELESTE	Gingrich, Newt	Republican
848723	20.12	BOYD, EDWARD	Paul, Ron	Republican
628019	380.00	KILLION, MARTHA	Obama, Barack	Democrat
69486	2500.00	VIOLA, VINCENT MR.	Romney, Mitt	Republican

contbr_pt	Democrat	Republican
contbr_occupation
ATTORNEY	11141982.97	5364718.82
CEO	2074974.79	2324297.03
CONSULTANT	2459912.71	1424894.01
EXECUTIVE	1355161.05	2300947.03
HOMEMAKER	4248875.80	8147446.22
INVESTOR	884133.00	1537595.12
LAWYER	3160478.87	7705.20
NOT PROVIDED	4866973.96	11396894.84
PHYSICIAN	3735124.94	1368023.96
PRESIDENT	1878509.95	2491244.89
PROFESSOR	2165071.08	161362.12
RETIRED	25305116.38	11508473.59