Pandas is one of the main Python packages for data analysis, along with

  • Numpy: numerical array manipulations
  • Scipy: more sophisticated math classes (such as probability distributions, etc.)

Pandas is based on Numpy arrays and provides two main classes:

Series
DataFrame

Roughly, both classes contain a Numpy array has one of their attributes:

  • a 1D array for Series

  • a 2D array for DataFrame

Both classes provides additional functionality to plain Numpy arrays in order to facilitate the analysis of inhomogeneous labelled data.

Pandas Series

Definition: A Pandas Series is the programmatic embodiement of a data line, i.e. a labelled sequence of data points:

$$x_{i_1},x_{i_2},\dots, x_{i_n}$$

If the labels $i_1,\dots, i_n$ correspond to

  • unique ID of individuals from a population $\Omega=\{1,\dots,N\}$, and the data points correspond to the value of some characteristic $X:\Omega\rightarrow \mathbb R$ for these individuals, the data line is called a population sample (of observations): $x_{i_l} = X(i_l).$
  • instant in times $i_1 < i_2 <\cdots < i_n$ and the data points correspond to the value of the characteristic $X$ of a single fixed individual $\omega\in \Omega$ but at different moment in times, the data line is called a time series: $x_{i_l} = X_{t=i_l}(\omega).$

The Series class is imported from the module pandas as follows:


In [1]:
from pandas import Series

The Series constructor takes the following arguments (which are all set to None by default:

Series(data=None, index=None, dtype=None, name=None, copy=False)

where

  • data is any array like object containing the data points
  • index is any array-like object containing the labels (or indices)
  • dtype is the type of the datapoints, such as float64 or int64
  • copy is a flag telling the class constructor whether to internally copy the data or not

If nothing is passed to the class constructor, except for the data argument, the constructor will try to infer everthing from the data.

For instance, one can pass a dictionary to the Series constructor:


In [2]:
data = {'Luc':32, 'Bob':24, 'Lucy':89}

dataLine = Series(data)
dataLine


Out[2]:
Bob     24
Luc     32
Lucy    89
dtype: int64

Exercise: Explore the method of the class Series for yourself, and figure out the ones that may be the most useful to you (Recall that auto-completion with key tab can help you greatly with this exploration).


In [3]:
dataMena = dataLine.mean()
dataStd = dataLine.std()
summary = dataLine.describe()
#dataLine.values
type(summary)
print(summary)


count     3.000000
mean     48.333333
std      35.444793
min      24.000000
25%      28.000000
50%      32.000000
75%      60.500000
max      89.000000
dtype: float64

Pandas DataFrame

Definition: A Pandas DataFrame is the programmatic embodiement of a data table, i.e., a 2D array of values with rows and columns explicitely labelled. Usually,

  • the row labels, say $i_1,\dots, i_n$ correspond to a sample of the unique individual IDs in a population $\Omega = \{1,\dots, N\}$
  • the column labels correspond to the name $X,Y,Z$ of some characteristics $X,Y,Z:\Omega\rightarrow\mathbb R$ of our population $\Omega$
  • The actual values in the data tables are the values of these characteristics for the sampled individuals.

One imports the DataFrame as follows


In [4]:
from pandas import DataFrame

The DataFrame constructor is almost the same as the one of the Series class

DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

except that

  • there is a columns argument to hold the column names (in any array-like container)
  • the data argument takes in any 2D container, such as a 2D Numpy array, a list or a dictionary of series, 1D arrays, lists, or dictionaries:

In [5]:
characteristics = ['X','Y','Z']
individuals     = ['i1','i2','i3','i4']
values          = [[19, 12, 1],
                   [23, 45, 0],
                   [45, 14, 1],
                   [23, 17, 1]]

data_table = DataFrame(values, index=individuals, columns=characteristics)
data_table


Out[5]:
X Y Z
i1 19 12 1
i2 23 45 0
i3 45 14 1
i4 23 17 1

Since Pandas, as Numpy, uses the row-major convention, data frame constructed by passing a list of characteristics values (meant to be columns), instead of a list of characteristics for all the individuals (meant to be rows) will be constructed the other way round:


In [7]:
ages    = Series([10,67,22], index=['Bob', 'Luc', 'Ted'], name='Age')
weights = Series([44,60,80], index=['Bob', 'Luc', 'Ted'], name='Weight')

guys = DataFrame([ages, weights])
guys


Out[7]:
Bob Luc Ted
Age 10 67 22
Weight 44 60 80

Fortunately, we can always use the tranpose method to get what we want:


In [10]:
guys.transpose()


Out[10]:
Age Weight
Bob 10 44
Luc 67 60
Ted 22 80

Exercise: Explore the DataFrame methods using tab completion, and find out the ones you think will be the most useful to you.


In [ ]:

Learning Pandas on data analysis example

The following data analysis example is drawn from chapter 9 of

which is the reference text for data analysis using Python, and whose author Wes McKinney is the creator of the pandas package.

The data concerning the campaign donation for the 2012 Presidential election are available for scrutiny and analysis here:

The data is in the form of a data table where:

  • rows = a campaign donors
  • columns = attributes/characteristics of the donation as describe below

Loading the data


In [11]:
from pandas import read_csv

In [12]:
data = read_csv('../data/P00000001-ALL.csv')

In [13]:
data.ix[0:5,0:5]


Out[13]:
cmte_id cand_id cand_nm contbr_nm contbr_city
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER LANETT
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA PIGGOTT
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD HOT SPRINGS NATION
5 C00410118 P20002978 Bachmann, Michelle BECKMAN, JAMES SPRINGDALE

In [14]:
data.shape


Out[14]:
(1001731, 16)

In [15]:
row_number = data.shape[0]

In [16]:
data.info


Out[16]:
<bound method DataFrame.info of <class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id              1001731  non-null values
cand_id              1001731  non-null values
cand_nm              1001731  non-null values
contbr_nm            1001731  non-null values
contbr_city          1001716  non-null values
contbr_st            1001727  non-null values
contbr_zip           1001620  non-null values
contbr_employer      994314  non-null values
contbr_occupation    994433  non-null values
contb_receipt_amt    1001731  non-null values
contb_receipt_dt     1001731  non-null values
receipt_desc         14166  non-null values
memo_cd              92482  non-null values
memo_text            97770  non-null values
form_tp              1001731  non-null values
file_num             1001731  non-null values
dtypes: float64(1), int64(1), object(14)>

In [ ]:

Exploring and preparing the data


In [47]:
from random import randint

i = randint(0,row_number)

print data.ix[i,:]


cmte_id                   C00495820
cand_id                   P80000748
cand_nm                   Paul, Ron
contbr_nm            O'NEILL, JAMES
contbr_city            MIDDLE POINT
contbr_st                        OH
contbr_zip                458639539
contbr_employer           REQUESTED
contbr_occupation         REQUESTED
contb_receipt_amt               300
contb_receipt_dt          29-AUG-11
receipt_desc                    NaN
memo_cd                         NaN
memo_text                       NaN
form_tp                       SA17A
file_num                     779225
Name: 810293, dtype: object

In [ ]:

Adding party information


In [53]:
candidates = data.cand_nm.unique()
candidates


Out[53]:
array([Bachmann, Michelle, Romney, Mitt, Obama, Barack,
       Roemer, Charles E. 'Buddy' III, Pawlenty, Timothy,
       Johnson, Gary Earl, Paul, Ron, Santorum, Rick, Cain, Herman,
       Gingrich, Newt, McCotter, Thaddeus G, Huntsman, Jon, Perry, Rick], dtype=object)

In [57]:
parties = {"Bachmann, Michelle":"Republican",
           "Cain, Herman":"Republican",
           "Gingrich, Newt":"Republican",
           "Huntsman, Jon":"Republican",
           "Johnson, Gary, Earl":"Republican",
           "McCotter, Thaddeus G":"Republican",
           "Obama, Barack":"Democrat",
           "Paul, Ron":"Republican",
           "Pawlenty, Timothy":"Republican",
           "Perry, Rick":"Republican",
           "Roemer, Charles, E. 'Buddy' III":"Republican",
           "Romney, Mitt":"Republican",
           "Santorum, Rick":"Republican",
           "Stein, Jill":"Republican"}

parties[data.cand_nm[800]]


Out[57]:
'Republican'

In [84]:
from random import sample

sample_size = 6
rows        = sample(range(0, row_number), sample_size)
sample = data.ix[rows, ['contb_receipt_amt', 'contbr_nm', 'cand_nm']]

sample['contbr_pt'] = sample.cand_nm.map(parties)

sample.columns = ['Amount contributed', 'Contributor Name', 'Candidate Contributed to','Party Contributed To']

sample


Out[84]:
Amount contributed Contributor Name Candidate Contributed to Party Contributed To
794790 100.00 FREEMAN, ERIK D. Paul, Ron Republican
830929 100.00 KAISER, HENRY W MR. Paul, Ron Republican
926770 300.00 GREIG, CELESTE Gingrich, Newt Republican
848723 20.12 BOYD, EDWARD Paul, Ron Republican
628019 380.00 KILLION, MARTHA Obama, Barack Democrat
69486 2500.00 VIOLA, VINCENT MR. Romney, Mitt Republican

In [86]:
data['contbr_pt'] = data.cand_nm.map(parties)


data[['contbr_nm', 'cand_nm', 'contbr_pt']].head()


Out[86]:
contbr_nm cand_nm contbr_pt
0 HARVEY, WILLIAM Bachmann, Michelle Republican
1 HARVEY, WILLIAM Bachmann, Michelle Republican
2 SMITH, LANIER Bachmann, Michelle Republican
3 BLEVINS, DARONDA Bachmann, Michelle Republican
4 WARDENBURG, HAROLD Bachmann, Michelle Republican

In [87]:
data.contbr_pt.value_counts()


Out[87]:
Democrat      593746
Republican    400831
dtype: int64

In [ ]:

Removing refund entries


In [89]:
not_refund_rows = data.contb_receipt_amt > 0
not_refund_rows.value_counts()

not_refund_rows[:10]


Out[89]:
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
Name: contb_receipt_amt, dtype: bool

In [90]:
donnors = data[not_refund_rows]

Keeping only donnors to the two mains candidates


In [91]:
donnors = donnors[donnors.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]

donnors.contbr_pt.value_counts()


Out[91]:
Democrat      589127
Republican    105155
dtype: int64

Cleaning the occupation column


In [92]:
occupations = donnors.contbr_occupation.value_counts()

In [94]:
occupations[:20]


Out[94]:
RETIRED                                   177473
ATTORNEY                                   30133
INFORMATION REQUESTED                      24747
HOMEMAKER                                  19626
PHYSICIAN                                  17206
INFORMATION REQUESTED PER BEST EFFORTS     12545
PROFESSOR                                  11804
TEACHER                                    11512
CONSULTANT                                 10061
NOT EMPLOYED                                9696
LAWYER                                      7438
ENGINEER                                    6107
PRESIDENT                                   4864
MANAGER                                     4745
WRITER                                      4439
SELF-EMPLOYED                               3763
SALES                                       3697
EXECUTIVE                                   3609
OWNER                                       3408
EDUCATOR                                    3360
dtype: int64

In [95]:
occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
               'INFORMATION REQUESTED':'NOT PROVIDED',
               'INFORMATION REQUESTED (BEST EFFORTS)':'NOT PROVIDED',
               'C.E.O.': 'CEO'}

In [96]:
def f(x):
    return occ_mapping.get(x, x)

donnors.contbr_occupation = donnors.contbr_occupation.map(f)

Cleaning the employer columns


In [97]:
donnors.contbr_employer.value_counts()


Out[97]:
RETIRED                                   173287
SELF-EMPLOYED                              79638
NOT EMPLOYED                               45727
INFORMATION REQUESTED                      25524
INFORMATION REQUESTED PER BEST EFFORTS     13104
HOMEMAKER                                  12081
OBAMA FOR AMERICA                           1741
STUDENT                                     1417
SELF                                        1228
DISABLED                                    1096
IBM                                         1027
REFUSED                                      854
KAISER PERMANENTE                            828
AT&T                                         796
US ARMY                                      680
...
MEYER BROOKS DEMMA & BLOHM               1
BATES MEMORIAL BAPTIST                   1
WEISSMAN, NOWACK, CURRY & WILCO, P.C.    1
NELSEN THOMPSON PEGUE & THORNTON         1
NET ACCESS                               1
HUFFER & HUFFER                          1
PHILANTHROPY                             1
HRM                                      1
SOUTHERN ILLINOIS UNIV - EDWARDSVILLE    1
NWCS                                     1
PEERLESS NETWORK                         1
LEADERSHIP INCENTER                      1
U OF DENVER                              1
DOUGLAS CO. PUBLIC DEFENDER              1
LAW OFFICE OF MARDI HARRISON             1
Length: 93117, dtype: int64

In [98]:
emp_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
               'INFORMATION REQUESTED':'NOT PROVIDED',
               'SELF' : 'SELF-EMPLOYED',
               'SELF EMPLOYED':'SELF-EMPLOYED'}

In [99]:
def g(x):
    return emp_mapping.get(x, x) 

donnors.contbr_employer = donnors.contbr_employer.map(g)

Visualizing and analysing the data


In [100]:
donnors.contbr_pt.value_counts()


Out[100]:
Democrat      589127
Republican    105155
dtype: int64

In [101]:
by_occupation = donnors.pivot_table('contb_receipt_amt', 
                                    rows='contbr_occupation',
                                    cols='contbr_pt', 
                                    aggfunc='sum')

In [102]:
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]

In [103]:
over_2mm


Out[103]:
contbr_pt Democrat Republican
contbr_occupation
ATTORNEY 11141982.97 5364718.82
CEO 2074974.79 2324297.03
CONSULTANT 2459912.71 1424894.01
EXECUTIVE 1355161.05 2300947.03
HOMEMAKER 4248875.80 8147446.22
INVESTOR 884133.00 1537595.12
LAWYER 3160478.87 7705.20
NOT PROVIDED 4866973.96 11396894.84
PHYSICIAN 3735124.94 1368023.96
PRESIDENT 1878509.95 2491244.89
PROFESSOR 2165071.08 161362.12
RETIRED 25305116.38 11508473.59

In [104]:
%matplotlib inline

In [105]:
over_2mm.plot(kind='barh');



In [ ]: