More often when working with data you'll use the pandas library. The fundamental constructs in pandas are the DataFrame and Series



In [1]:

    
from pandas import DataFrame, Series
import pandas as pd
import numpy as np

Series

Think of a Series as a Python dict all of whose values have the same type (and all of whose keys have the same type too)



In [7]:

    
series = Series({'a' : 1, 'c' : 3, 'e' : 7, 'b' : 4})
series









    Out[7]:





a    1
b    4
c    3
e    7
dtype: int64

(Notice how the keys got sorted.)

The keys are referred to as the index:



In [8]:

    
series.index









    Out[8]:





Index([u'a', u'b', u'c', u'e'], dtype='object')

And you can reference elements by their index:



In [10]:

    
series['c']









    Out[10]:





3



In [13]:

    
series[['c','d','e']]









    Out[13]:





c     3
d   NaN
e     7
dtype: float64

Notice that you get a NaN (Not a Number) for index elements that don't exist. You can also index in with booleans like we did in numpy arrays:



In [14]:

    
series > 3









    Out[14]:





a    False
b     True
c    False
e     True
dtype: bool



In [15]:

    
series[series > 3]









    Out[15]:





b    4
e    7
dtype: int64

And you can do the same kinds of arithmetic that we did with numpy arrays as well



In [16]:

    
series + 1









    Out[16]:





a    2
b    5
c    4
e    8
dtype: int64



In [19]:

    
np.exp(series)









    Out[19]:





a       2.718282
b      54.598150
c      20.085537
e    1096.633158
dtype: float64

When you work with multiple series, the index gets respected:



In [20]:

    
series2 = Series([10, 20, 30, 40, 50], index=['a','b','c','d','e'])



In [21]:

    
series2









    Out[21]:





a    10
b    20
c    30
d    40
e    50
dtype: int64



In [24]:

    
series3 = series + series2
series3









    Out[24]:





a    11
b    24
c    33
d   NaN
e    57
dtype: float64

Notice that the a value got added to the a value, and so on. The original series had no d value, so we got a NaN for that term. we have a couple of options:



In [32]:

    
# return a series of booleans with True where the null values are
series3.isnull()









    Out[32]:





a    False
b    False
c    False
d     True
e    False
dtype: bool



In [34]:

    
# then use not that as the index
series3[~series3.isnull()]









    Out[34]:





a    11
b    24
c    33
e    57
dtype: float64



In [36]:

    
# more simply
series3.dropna()









    Out[36]:





a    11
b    24
c    33
e    57
dtype: float64

you also might want to do something other than drop the NaN values.



In [42]:

    
# replace NaN with 0
series2.add(series, fill_value = 0)









    Out[42]:





a    11
b    24
c    33
d    40
e    57
dtype: float64



In [45]:

    
# replace NaN with 10, probably a bad idea!
series.add(series2, fill_value = 10)









    Out[45]:





a    11
b    24
c    33
d    50
e    57
dtype: float64

DataFrame

Now that you understand Series, you can think of a DataFrame as a dict whose keys are column names and whose values are series.



In [49]:

    
df = DataFrame({ 'x' : series, 'y' : series2 })
df

You can reference each series by its column name



In [50]:

    
df['y']









    Out[50]:





a    10
b    20
c    30
d    40
e    50
Name: y, dtype: int64

And you can reference rows using .loc



In [61]:

    
df.loc[['b','d']]

You can easily add new columns



In [52]:

    
df['z'] = np.sqrt(df['x'])
df

And get subframes:



In [60]:

    
df.loc[['b','e'], ['x', 'z']]

Frequently we will work with data from a file, for which we use pd.read_csv (for csv files) or pd.read_table (for tab delimited files).



In [ ]: