More often when working with data you'll use the pandas
library. The fundamental constructs in pandas
are the DataFrame
and Series
In [1]:
from pandas import DataFrame, Series
import pandas as pd
import numpy as np
Think of a Series
as a Python dict
all of whose values have the same type (and all of whose keys have the same type too)
In [7]:
series = Series({'a' : 1, 'c' : 3, 'e' : 7, 'b' : 4})
series
Out[7]:
(Notice how the keys got sorted.)
The keys are referred to as the index
:
In [8]:
series.index
Out[8]:
And you can reference elements by their index:
In [10]:
series['c']
Out[10]:
In [13]:
series[['c','d','e']]
Out[13]:
Notice that you get a NaN
(Not a Number) for index elements that don't exist. You can also index in with booleans like we did in numpy arrays:
In [14]:
series > 3
Out[14]:
In [15]:
series[series > 3]
Out[15]:
And you can do the same kinds of arithmetic that we did with numpy arrays as well
In [16]:
series + 1
Out[16]:
In [19]:
np.exp(series)
Out[19]:
When you work with multiple series, the index gets respected:
In [20]:
series2 = Series([10, 20, 30, 40, 50], index=['a','b','c','d','e'])
In [21]:
series2
Out[21]:
In [24]:
series3 = series + series2
series3
Out[24]:
Notice that the a
value got added to the a
value, and so on. The original series had no d
value, so we got a NaN
for that term. we have a couple of options:
In [32]:
# return a series of booleans with True where the null values are
series3.isnull()
Out[32]:
In [34]:
# then use not that as the index
series3[~series3.isnull()]
Out[34]:
In [36]:
# more simply
series3.dropna()
Out[36]:
you also might want to do something other than drop the NaN
values.
In [42]:
# replace NaN with 0
series2.add(series, fill_value = 0)
Out[42]:
In [45]:
# replace NaN with 10, probably a bad idea!
series.add(series2, fill_value = 10)
Out[45]:
Now that you understand Series
, you can think of a DataFrame as a dict
whose keys are column names and whose values are series.
In [49]:
df = DataFrame({ 'x' : series, 'y' : series2 })
df
Out[49]:
You can reference each series by its column name
In [50]:
df['y']
Out[50]:
And you can reference rows using .loc
In [61]:
df.loc[['b','d']]
Out[61]:
You can easily add new columns
In [52]:
df['z'] = np.sqrt(df['x'])
df
Out[52]:
And get subframes:
In [60]:
df.loc[['b','e'], ['x', 'z']]
Out[60]:
Frequently we will work with data from a file, for which we use pd.read_csv
(for csv files) or pd.read_table
(for tab delimited files).
In [ ]: