Pandas is one of the main Python packages for data analysis, along with
Pandas is based on Numpy arrays and provides two main classes:
Series
DataFrame
Roughly, both classes contain a Numpy array
has one of their attributes:
a 1D array for Series
a 2D array for DataFrame
Both classes provides additional functionality to plain Numpy arrays
in order to facilitate the analysis of inhomogeneous labelled data.
Definition: A Pandas Series
is the programmatic embodiement of a data line, i.e. a labelled sequence of data points:
If the labels $i_1,\dots, i_n$ correspond to
The Series class is imported from the module pandas as follows:
In [1]:
from pandas import Series
The Series constructor takes the following arguments (which are all set to None
by default:
Series(data=None, index=None, dtype=None, name=None, copy=False)
where
data
is any array like object containing the data pointsindex
is any array-like object containing the labels (or indices)dtype
is the type of the datapoints, such as float64
or int64
copy
is a flag telling the class constructor whether to internally copy the data or notIf nothing is passed to the class constructor, except for the data argument, the constructor will try to infer everthing from the data.
For instance, one can pass a dictionary to the Series
constructor:
In [2]:
data = {'Luc':32, 'Bob':24, 'Lucy':89}
dataLine = Series(data)
dataLine
Out[2]:
Exercise: Explore the method of the class Series
for yourself, and figure out the ones that may be the most useful to you (Recall that auto-completion with key tab can help you greatly with this exploration).
In [3]:
dataMena = dataLine.mean()
dataStd = dataLine.std()
summary = dataLine.describe()
#dataLine.values
type(summary)
print(summary)
Definition: A Pandas DataFrame
is the programmatic embodiement of a data table, i.e., a 2D array of values with rows and columns explicitely labelled. Usually,
One imports the DataFrame
as follows
In [4]:
from pandas import DataFrame
The DataFrame
constructor is almost the same as the one of the Series
class
DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
except that
columns
argument to hold the column names (in any array-like container)data
argument takes in any 2D container, such as a 2D Numpy array
, a list or a dictionary of series, 1D arrays, lists, or dictionaries:
In [5]:
characteristics = ['X','Y','Z']
individuals = ['i1','i2','i3','i4']
values = [[19, 12, 1],
[23, 45, 0],
[45, 14, 1],
[23, 17, 1]]
data_table = DataFrame(values, index=individuals, columns=characteristics)
data_table
Out[5]:
Since Pandas, as Numpy, uses the row-major convention, data frame constructed by passing a list of characteristics values (meant to be columns), instead of a list of characteristics for all the individuals (meant to be rows) will be constructed the other way round:
In [7]:
ages = Series([10,67,22], index=['Bob', 'Luc', 'Ted'], name='Age')
weights = Series([44,60,80], index=['Bob', 'Luc', 'Ted'], name='Weight')
guys = DataFrame([ages, weights])
guys
Out[7]:
Fortunately, we can always use the tranpose method
to get what we want:
In [10]:
guys.transpose()
Out[10]:
Exercise: Explore the DataFrame methods using tab completion, and find out the ones you think will be the most useful to you.
In [ ]:
The following data analysis example is drawn from chapter 9 of
which is the reference text for data analysis using Python, and whose author Wes McKinney is the creator of the pandas package.
The data concerning the campaign donation for the 2012 Presidential election are available for scrutiny and analysis here:
The data is in the form of a data table where:
In [11]:
from pandas import read_csv
In [12]:
data = read_csv('../data/P00000001-ALL.csv')
In [13]:
data.ix[0:5,0:5]
Out[13]:
In [14]:
data.shape
Out[14]:
In [15]:
row_number = data.shape[0]
In [16]:
data.info
Out[16]:
In [ ]:
In [47]:
from random import randint
i = randint(0,row_number)
print data.ix[i,:]
In [ ]:
In [53]:
candidates = data.cand_nm.unique()
candidates
Out[53]:
In [57]:
parties = {"Bachmann, Michelle":"Republican",
"Cain, Herman":"Republican",
"Gingrich, Newt":"Republican",
"Huntsman, Jon":"Republican",
"Johnson, Gary, Earl":"Republican",
"McCotter, Thaddeus G":"Republican",
"Obama, Barack":"Democrat",
"Paul, Ron":"Republican",
"Pawlenty, Timothy":"Republican",
"Perry, Rick":"Republican",
"Roemer, Charles, E. 'Buddy' III":"Republican",
"Romney, Mitt":"Republican",
"Santorum, Rick":"Republican",
"Stein, Jill":"Republican"}
parties[data.cand_nm[800]]
Out[57]:
In [84]:
from random import sample
sample_size = 6
rows = sample(range(0, row_number), sample_size)
sample = data.ix[rows, ['contb_receipt_amt', 'contbr_nm', 'cand_nm']]
sample['contbr_pt'] = sample.cand_nm.map(parties)
sample.columns = ['Amount contributed', 'Contributor Name', 'Candidate Contributed to','Party Contributed To']
sample
Out[84]:
In [86]:
data['contbr_pt'] = data.cand_nm.map(parties)
data[['contbr_nm', 'cand_nm', 'contbr_pt']].head()
Out[86]:
In [87]:
data.contbr_pt.value_counts()
Out[87]:
In [ ]:
In [89]:
not_refund_rows = data.contb_receipt_amt > 0
not_refund_rows.value_counts()
not_refund_rows[:10]
Out[89]:
In [90]:
donnors = data[not_refund_rows]
In [91]:
donnors = donnors[donnors.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]
donnors.contbr_pt.value_counts()
Out[91]:
In [92]:
occupations = donnors.contbr_occupation.value_counts()
In [94]:
occupations[:20]
Out[94]:
In [95]:
occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
'INFORMATION REQUESTED':'NOT PROVIDED',
'INFORMATION REQUESTED (BEST EFFORTS)':'NOT PROVIDED',
'C.E.O.': 'CEO'}
In [96]:
def f(x):
return occ_mapping.get(x, x)
donnors.contbr_occupation = donnors.contbr_occupation.map(f)
In [97]:
donnors.contbr_employer.value_counts()
Out[97]:
In [98]:
emp_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
'INFORMATION REQUESTED':'NOT PROVIDED',
'SELF' : 'SELF-EMPLOYED',
'SELF EMPLOYED':'SELF-EMPLOYED'}
In [99]:
def g(x):
return emp_mapping.get(x, x)
donnors.contbr_employer = donnors.contbr_employer.map(g)
In [100]:
donnors.contbr_pt.value_counts()
Out[100]:
In [101]:
by_occupation = donnors.pivot_table('contb_receipt_amt',
rows='contbr_occupation',
cols='contbr_pt',
aggfunc='sum')
In [102]:
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
In [103]:
over_2mm
Out[103]:
In [104]:
%matplotlib inline
In [105]:
over_2mm.plot(kind='barh');
In [ ]: