Pandas is one of the main Python packages for data analysis, along with

• Numpy: numerical array manipulations
• Scipy: more sophisticated math classes (such as probability distributions, etc.)

Pandas is based on Numpy arrays and provides two main classes:

Series
DataFrame



Roughly, both classes contain a Numpy array has one of their attributes:

• a 1D array for Series

• a 2D array for DataFrame

Both classes provides additional functionality to plain Numpy arrays in order to facilitate the analysis of inhomogeneous labelled data.

## Pandas Series

Definition: A Pandas Series is the programmatic embodiement of a data line, i.e. a labelled sequence of data points:

$$x_{i_1},x_{i_2},\dots, x_{i_n}$$

If the labels $i_1,\dots, i_n$ correspond to

• unique ID of individuals from a population $\Omega=\{1,\dots,N\}$, and the data points correspond to the value of some characteristic $X:\Omega\rightarrow \mathbb R$ for these individuals, the data line is called a population sample (of observations): $x_{i_l} = X(i_l).$
• instant in times $i_1 < i_2 <\cdots < i_n$ and the data points correspond to the value of the characteristic $X$ of a single fixed individual $\omega\in \Omega$ but at different moment in times, the data line is called a time series: $x_{i_l} = X_{t=i_l}(\omega).$

The Series class is imported from the module pandas as follows:



In [1]:

from pandas import Series



The Series constructor takes the following arguments (which are all set to None by default:

Series(data=None, index=None, dtype=None, name=None, copy=False)



where

• data is any array like object containing the data points
• index is any array-like object containing the labels (or indices)
• dtype is the type of the datapoints, such as float64 or int64
• copy is a flag telling the class constructor whether to internally copy the data or not

If nothing is passed to the class constructor, except for the data argument, the constructor will try to infer everthing from the data.

For instance, one can pass a dictionary to the Series constructor:



In [2]:

data = {'Luc':32, 'Bob':24, 'Lucy':89}

dataLine = Series(data)
dataLine




Out[2]:

Bob     24
Luc     32
Lucy    89
dtype: int64



Exercise: Explore the method of the class Series for yourself, and figure out the ones that may be the most useful to you (Recall that auto-completion with key tab can help you greatly with this exploration).



In [3]:

dataMena = dataLine.mean()
dataStd = dataLine.std()
summary = dataLine.describe()
#dataLine.values
type(summary)
print(summary)




count     3.000000
mean     48.333333
std      35.444793
min      24.000000
25%      28.000000
50%      32.000000
75%      60.500000
max      89.000000
dtype: float64



## Pandas DataFrame

Definition: A Pandas DataFrame is the programmatic embodiement of a data table, i.e., a 2D array of values with rows and columns explicitely labelled. Usually,

• the row labels, say $i_1,\dots, i_n$ correspond to a sample of the unique individual IDs in a population $\Omega = \{1,\dots, N\}$
• the column labels correspond to the name $X,Y,Z$ of some characteristics $X,Y,Z:\Omega\rightarrow\mathbb R$ of our population $\Omega$
• The actual values in the data tables are the values of these characteristics for the sampled individuals.

One imports the DataFrame as follows



In [4]:

from pandas import DataFrame



The DataFrame constructor is almost the same as the one of the Series class

DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

except that

• there is a columns argument to hold the column names (in any array-like container)
• the data argument takes in any 2D container, such as a 2D Numpy array, a list or a dictionary of series, 1D arrays, lists, or dictionaries:


In [5]:

characteristics = ['X','Y','Z']
individuals     = ['i1','i2','i3','i4']
values          = [[19, 12, 1],
[23, 45, 0],
[45, 14, 1],
[23, 17, 1]]

data_table = DataFrame(values, index=individuals, columns=characteristics)
data_table




Out[5]:

X
Y
Z

i1
19
12
1

i2
23
45
0

i3
45
14
1

i4
23
17
1



Since Pandas, as Numpy, uses the row-major convention, data frame constructed by passing a list of characteristics values (meant to be columns), instead of a list of characteristics for all the individuals (meant to be rows) will be constructed the other way round:



In [7]:

ages    = Series([10,67,22], index=['Bob', 'Luc', 'Ted'], name='Age')
weights = Series([44,60,80], index=['Bob', 'Luc', 'Ted'], name='Weight')

guys = DataFrame([ages, weights])
guys




Out[7]:

Bob
Luc
Ted

Age
10
67
22

Weight
44
60
80



Fortunately, we can always use the tranpose method to get what we want:



In [10]:

guys.transpose()




Out[10]:

Age
Weight

Bob
10
44

Luc
67
60

Ted
22
80



Exercise: Explore the DataFrame methods using tab completion, and find out the ones you think will be the most useful to you.



In [ ]:



## Learning Pandas on data analysis example

The following data analysis example is drawn from chapter 9 of

which is the reference text for data analysis using Python, and whose author Wes McKinney is the creator of the pandas package.

The data concerning the campaign donation for the 2012 Presidential election are available for scrutiny and analysis here:

The data is in the form of a data table where:

• rows = a campaign donors
• columns = attributes/characteristics of the donation as describe below



In [11]:




In [12]:




In [13]:

data.ix[0:5,0:5]




Out[13]:

cmte_id
cand_id
cand_nm
contbr_nm
contbr_city

0
C00410118
P20002978
Bachmann, Michelle
HARVEY, WILLIAM
MOBILE

1
C00410118
P20002978
Bachmann, Michelle
HARVEY, WILLIAM
MOBILE

2
C00410118
P20002978
Bachmann, Michelle
SMITH, LANIER
LANETT

3
C00410118
P20002978
Bachmann, Michelle
BLEVINS, DARONDA
PIGGOTT

4
C00410118
P20002978
Bachmann, Michelle
WARDENBURG, HAROLD
HOT SPRINGS NATION

5
C00410118
P20002978
Bachmann, Michelle
BECKMAN, JAMES
SPRINGDALE




In [14]:

data.shape




Out[14]:

(1001731, 16)




In [15]:

row_number = data.shape[0]




In [16]:

data.info




Out[16]:

<bound method DataFrame.info of <class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id              1001731  non-null values
cand_id              1001731  non-null values
cand_nm              1001731  non-null values
contbr_nm            1001731  non-null values
contbr_city          1001716  non-null values
contbr_st            1001727  non-null values
contbr_zip           1001620  non-null values
contbr_employer      994314  non-null values
contbr_occupation    994433  non-null values
contb_receipt_amt    1001731  non-null values
contb_receipt_dt     1001731  non-null values
receipt_desc         14166  non-null values
memo_cd              92482  non-null values
memo_text            97770  non-null values
form_tp              1001731  non-null values
file_num             1001731  non-null values
dtypes: float64(1), int64(1), object(14)>




In [ ]:



## Exploring and preparing the data



In [47]:

from random import randint

i = randint(0,row_number)

print data.ix[i,:]




cmte_id                   C00495820
cand_id                   P80000748
cand_nm                   Paul, Ron
contbr_nm            O'NEILL, JAMES
contbr_city            MIDDLE POINT
contbr_st                        OH
contbr_zip                458639539
contbr_employer           REQUESTED
contbr_occupation         REQUESTED
contb_receipt_amt               300
contb_receipt_dt          29-AUG-11
receipt_desc                    NaN
memo_cd                         NaN
memo_text                       NaN
form_tp                       SA17A
file_num                     779225
Name: 810293, dtype: object




In [ ]:





In [53]:

candidates = data.cand_nm.unique()
candidates




Out[53]:

array([Bachmann, Michelle, Romney, Mitt, Obama, Barack,
Roemer, Charles E. 'Buddy' III, Pawlenty, Timothy,
Johnson, Gary Earl, Paul, Ron, Santorum, Rick, Cain, Herman,
Gingrich, Newt, McCotter, Thaddeus G, Huntsman, Jon, Perry, Rick], dtype=object)




In [57]:

parties = {"Bachmann, Michelle":"Republican",
"Cain, Herman":"Republican",
"Gingrich, Newt":"Republican",
"Huntsman, Jon":"Republican",
"Johnson, Gary, Earl":"Republican",
"Obama, Barack":"Democrat",
"Paul, Ron":"Republican",
"Pawlenty, Timothy":"Republican",
"Perry, Rick":"Republican",
"Roemer, Charles, E. 'Buddy' III":"Republican",
"Romney, Mitt":"Republican",
"Santorum, Rick":"Republican",
"Stein, Jill":"Republican"}

parties[data.cand_nm[800]]




Out[57]:

'Republican'




In [84]:

from random import sample

sample_size = 6
rows        = sample(range(0, row_number), sample_size)
sample = data.ix[rows, ['contb_receipt_amt', 'contbr_nm', 'cand_nm']]

sample['contbr_pt'] = sample.cand_nm.map(parties)

sample.columns = ['Amount contributed', 'Contributor Name', 'Candidate Contributed to','Party Contributed To']

sample




Out[84]:

Amount contributed
Contributor Name
Candidate Contributed to
Party Contributed To

794790
100.00
FREEMAN, ERIK D.
Paul, Ron
Republican

830929
100.00
KAISER, HENRY W MR.
Paul, Ron
Republican

926770
300.00
GREIG, CELESTE
Gingrich, Newt
Republican

848723
20.12
BOYD, EDWARD
Paul, Ron
Republican

628019
380.00
KILLION, MARTHA
Obama, Barack
Democrat

69486
2500.00
VIOLA, VINCENT MR.
Romney, Mitt
Republican




In [86]:

data['contbr_pt'] = data.cand_nm.map(parties)




Out[86]:

contbr_nm
cand_nm
contbr_pt

0
HARVEY, WILLIAM
Bachmann, Michelle
Republican

1
HARVEY, WILLIAM
Bachmann, Michelle
Republican

2
SMITH, LANIER
Bachmann, Michelle
Republican

3
BLEVINS, DARONDA
Bachmann, Michelle
Republican

4
WARDENBURG, HAROLD
Bachmann, Michelle
Republican




In [87]:

data.contbr_pt.value_counts()




Out[87]:

Democrat      593746
Republican    400831
dtype: int64




In [ ]:



### Removing refund entries



In [89]:

not_refund_rows = data.contb_receipt_amt > 0
not_refund_rows.value_counts()

not_refund_rows[:10]




Out[89]:

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
Name: contb_receipt_amt, dtype: bool




In [90]:

donnors = data[not_refund_rows]



### Keeping only donnors to the two mains candidates



In [91]:

donnors = donnors[donnors.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]

donnors.contbr_pt.value_counts()




Out[91]:

Democrat      589127
Republican    105155
dtype: int64



### Cleaning the occupation column



In [92]:

occupations = donnors.contbr_occupation.value_counts()




In [94]:

occupations[:20]




Out[94]:

RETIRED                                   177473
ATTORNEY                                   30133
INFORMATION REQUESTED                      24747
HOMEMAKER                                  19626
PHYSICIAN                                  17206
INFORMATION REQUESTED PER BEST EFFORTS     12545
PROFESSOR                                  11804
TEACHER                                    11512
CONSULTANT                                 10061
NOT EMPLOYED                                9696
LAWYER                                      7438
ENGINEER                                    6107
PRESIDENT                                   4864
MANAGER                                     4745
WRITER                                      4439
SELF-EMPLOYED                               3763
SALES                                       3697
EXECUTIVE                                   3609
OWNER                                       3408
EDUCATOR                                    3360
dtype: int64




In [95]:

occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
'INFORMATION REQUESTED':'NOT PROVIDED',
'INFORMATION REQUESTED (BEST EFFORTS)':'NOT PROVIDED',
'C.E.O.': 'CEO'}




In [96]:

def f(x):
return occ_mapping.get(x, x)

donnors.contbr_occupation = donnors.contbr_occupation.map(f)



### Cleaning the employer columns



In [97]:

donnors.contbr_employer.value_counts()




Out[97]:

RETIRED                                   173287
SELF-EMPLOYED                              79638
NOT EMPLOYED                               45727
INFORMATION REQUESTED                      25524
INFORMATION REQUESTED PER BEST EFFORTS     13104
HOMEMAKER                                  12081
OBAMA FOR AMERICA                           1741
STUDENT                                     1417
SELF                                        1228
DISABLED                                    1096
IBM                                         1027
REFUSED                                      854
KAISER PERMANENTE                            828
AT&T                                         796
US ARMY                                      680
...
MEYER BROOKS DEMMA & BLOHM               1
BATES MEMORIAL BAPTIST                   1
WEISSMAN, NOWACK, CURRY & WILCO, P.C.    1
NELSEN THOMPSON PEGUE & THORNTON         1
NET ACCESS                               1
HUFFER & HUFFER                          1
PHILANTHROPY                             1
HRM                                      1
SOUTHERN ILLINOIS UNIV - EDWARDSVILLE    1
NWCS                                     1
PEERLESS NETWORK                         1
U OF DENVER                              1
DOUGLAS CO. PUBLIC DEFENDER              1
LAW OFFICE OF MARDI HARRISON             1
Length: 93117, dtype: int64




In [98]:

emp_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
'INFORMATION REQUESTED':'NOT PROVIDED',
'SELF' : 'SELF-EMPLOYED',
'SELF EMPLOYED':'SELF-EMPLOYED'}




In [99]:

def g(x):
return emp_mapping.get(x, x)

donnors.contbr_employer = donnors.contbr_employer.map(g)



## Visualizing and analysing the data



In [100]:

donnors.contbr_pt.value_counts()




Out[100]:

Democrat      589127
Republican    105155
dtype: int64




In [101]:

by_occupation = donnors.pivot_table('contb_receipt_amt',
rows='contbr_occupation',
cols='contbr_pt',
aggfunc='sum')




In [102]:

over_2mm = by_occupation[by_occupation.sum(1) > 2000000]




In [103]:

over_2mm




Out[103]:

contbr_pt
Democrat
Republican

contbr_occupation

ATTORNEY
11141982.97
5364718.82

CEO
2074974.79
2324297.03

CONSULTANT
2459912.71
1424894.01

EXECUTIVE
1355161.05
2300947.03

HOMEMAKER
4248875.80
8147446.22

INVESTOR
884133.00
1537595.12

LAWYER
3160478.87
7705.20

NOT PROVIDED
4866973.96
11396894.84

PHYSICIAN
3735124.94
1368023.96

PRESIDENT
1878509.95
2491244.89

PROFESSOR
2165071.08
161362.12

RETIRED
25305116.38
11508473.59




In [104]:

%matplotlib inline




In [105]:

over_2mm.plot(kind='barh');







In [ ]: