More often when working with data you'll use the pandas library. The fundamental constructs in pandas are the DataFrame and Series


In [1]:
from pandas import DataFrame, Series
import pandas as pd
import numpy as np

Series

Think of a Series as a Python dict all of whose values have the same type (and all of whose keys have the same type too)


In [7]:
series = Series({'a' : 1, 'c' : 3, 'e' : 7, 'b' : 4})
series


Out[7]:
a    1
b    4
c    3
e    7
dtype: int64

(Notice how the keys got sorted.)

The keys are referred to as the index:


In [8]:
series.index


Out[8]:
Index([u'a', u'b', u'c', u'e'], dtype='object')

And you can reference elements by their index:


In [10]:
series['c']


Out[10]:
3

In [13]:
series[['c','d','e']]


Out[13]:
c     3
d   NaN
e     7
dtype: float64

Notice that you get a NaN (Not a Number) for index elements that don't exist. You can also index in with booleans like we did in numpy arrays:


In [14]:
series > 3


Out[14]:
a    False
b     True
c    False
e     True
dtype: bool

In [15]:
series[series > 3]


Out[15]:
b    4
e    7
dtype: int64

And you can do the same kinds of arithmetic that we did with numpy arrays as well


In [16]:
series + 1


Out[16]:
a    2
b    5
c    4
e    8
dtype: int64

In [19]:
np.exp(series)


Out[19]:
a       2.718282
b      54.598150
c      20.085537
e    1096.633158
dtype: float64

When you work with multiple series, the index gets respected:


In [20]:
series2 = Series([10, 20, 30, 40, 50], index=['a','b','c','d','e'])

In [21]:
series2


Out[21]:
a    10
b    20
c    30
d    40
e    50
dtype: int64

In [24]:
series3 = series + series2
series3


Out[24]:
a    11
b    24
c    33
d   NaN
e    57
dtype: float64

Notice that the a value got added to the a value, and so on. The original series had no d value, so we got a NaN for that term. we have a couple of options:


In [32]:
# return a series of booleans with True where the null values are
series3.isnull()


Out[32]:
a    False
b    False
c    False
d     True
e    False
dtype: bool

In [34]:
# then use not that as the index
series3[~series3.isnull()]


Out[34]:
a    11
b    24
c    33
e    57
dtype: float64

In [36]:
# more simply
series3.dropna()


Out[36]:
a    11
b    24
c    33
e    57
dtype: float64

you also might want to do something other than drop the NaN values.


In [42]:
# replace NaN with 0
series2.add(series, fill_value = 0)


Out[42]:
a    11
b    24
c    33
d    40
e    57
dtype: float64

In [45]:
# replace NaN with 10, probably a bad idea!
series.add(series2, fill_value = 10)


Out[45]:
a    11
b    24
c    33
d    50
e    57
dtype: float64

DataFrame

Now that you understand Series, you can think of a DataFrame as a dict whose keys are column names and whose values are series.


In [49]:
df = DataFrame({ 'x' : series, 'y' : series2 })
df


Out[49]:
x y
a 1 10
b 4 20
c 3 30
d NaN 40
e 7 50

You can reference each series by its column name


In [50]:
df['y']


Out[50]:
a    10
b    20
c    30
d    40
e    50
Name: y, dtype: int64

And you can reference rows using .loc


In [61]:
df.loc[['b','d']]


Out[61]:
x y z
b 4 20 2
d NaN 40 NaN

You can easily add new columns


In [52]:
df['z'] = np.sqrt(df['x'])
df


Out[52]:
x y z
a 1 10 1.000000
b 4 20 2.000000
c 3 30 1.732051
d NaN 40 NaN
e 7 50 2.645751

And get subframes:


In [60]:
df.loc[['b','e'], ['x', 'z']]


Out[60]:
x z
b 4 2.000000
e 7 2.645751

Frequently we will work with data from a file, for which we use pd.read_csv (for csv files) or pd.read_table (for tab delimited files).


In [ ]: