In [63]:
%load_ext load_style
%load_style talk.css
from IPython.display import Image, HTML
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data. It is a fundamental high-level building block for doing practical, real world data analysis in Python.
pandas is well suited for:
Key features:
In [39]:
HTML('<iframe src=http://pandas.pydata.org/index.html width=900 height=350></iframe>')
Out[39]:
The conventional way to import pandas is import pandas as pd
In [40]:
import pandas as pd
some options setting
In [41]:
pd.set_option("display.width", 80)
# toggle the line below that if one doesnt want DataFrames displayed as HTML tables
#pd.set_option("notebook_repr_html", False)
pd.set_option("notebook_repr_html", True)
Pandas's data structures and functionalities will be familiar to R users, there's a section on Pandas's website where Wes McKinney gives some translation of common idioms / operations between R and Pandas
In [42]:
HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#compare-with-r width=900 height=350></iframe>')
Out[42]:
A Series is a single vector of data values (think a NumPy array with shape N or (N,1)) with an index that labels each element in the vector.
In [43]:
a = pd.Series(np.random.normal(0,1,(10,)))
In [44]:
a
Out[44]:
In [45]:
a.index
Out[45]:
In [46]:
a.values
Out[46]:
In [47]:
a = pd.Series(np.random.normal(0,1,(10,)), index=np.arange(1,11))
In [48]:
a
Out[48]:
In [49]:
a = pd.Series(np.random.normal(0,1,5), \
index=['a','b','c','d','e'], name='my series')
In [50]:
a
Out[50]:
Pandas objects expose some powerful, high level plotting functions (built on top of Matplotlib)
In [51]:
plot = a.plot(kind='bar', rot=0, color='w', hatch='///', title=a.name, width=0.9, grid='off')
In [52]:
f, ax = plt.subplots()
bars = ax.bar(np.arange(len(a)), a.values, color='w', edgecolor='k', align='center', hatch='///')
ax.set_xticks(np.arange(len(a)))
ax.set_xlim(-0.5, len(a)-0.5)
ax.set_xticklabels(a.index)
ax.set_title(a.name)
Out[52]:
Selecting from a Series is easy, using the corresponding index key (like a dict)
In [53]:
a
Out[53]:
In [54]:
a['c']
Out[54]:
slices are permitted
In [55]:
a['a':'c'] ### Note the difference with standard Python / Numpy positional, integer indexing
Out[55]:
In [56]:
a['c':]
Out[56]:
deleting an element
In [57]:
a.drop('d')
Out[57]:
Adding / modifying inplace an element to / of a Series: use the loc
method
In [58]:
a
Out[58]:
In [59]:
a.loc['e'] = 1
In [60]:
a
Out[60]:
In [61]:
a.loc['f'] = 5
In [62]:
a
Out[62]:
Mathematical operations involving two series will perform operations by aligning indices.
Indices that do not match are given the value NaN (not a number), and values are computed for all unique pairs of repeated indices.
In [24]:
s1 = pd.Series(np.arange(1.0,4.0),index=['a','b','c'])
s2 = pd.Series(np.arange(1.0,4.0),index=['b','c','d'])
In [25]:
s3 = s1 + s2
In [26]:
s3
Out[26]:
NaNs are ignored in all operations
In [27]:
s3.mean()
Out[27]:
You can drop them from the Series
In [28]:
s4 = s3.dropna()
In [29]:
s4
Out[29]:
Or use the fillna
method to replace them by a value
In [30]:
s3.fillna(-999)
Out[30]:
In [31]:
s3.fillna(s3.mean())
Out[31]:
Series can have indexes representing dates / times
In [ ]:
a
In [32]:
a.index = pd.date_range(start='2014-1-1', periods=len(a)) # default 'period' is daily
In [33]:
a.head()
Out[33]:
In [ ]:
a.index
In [ ]:
### but you can convert it to an numpy array of python datetime objects if you want
py_datetimes = a.index.to_pydatetime()
And a number of useful methods for manipulation of time series is exposed
In [34]:
### resample daily time-series to 5 minutes 'period', using forward filling method
a.resample('5min',fill_method='ffill')
Out[34]:
In [ ]:
a
In [35]:
### the ```shift``` method makes it easy e.g. to compare series with lead / lags
a.shift(periods=-1)
Out[35]:
In [ ]:
a
In [36]:
### and the ```truncate`` method allows easy selection of time-slices
a.truncate(after='2014-1-2')
Out[36]:
DataFrames are IMHO one of the most powerful data structures in the Python / data analysis world.
They can be viewed as a collection of named Series. They feature two indexes, respectively for the rows and the columns, and can contain heteregoneous data types (although it must be consistent within each column). Note that a DataFrame index, either along the rows or the columns (or both !) can contain more than one level, they are called hierarchical indexes and allows the representation of complex data organisation.
If the index along the rows of a DataFrame is of datetime type, all the methods exposed for the Series (re-sampling, shifting, truncating, etc) are available for the DataFrame.
In [ ]:
import string # part of the standard library
idx = list(string.lowercase[:10])
print(idx)
In [ ]:
df = pd.DataFrame(np.arange(100).reshape(10,10),\
columns=idx,index=np.arange(1,11))
In [ ]:
df
In [ ]:
### here I am creating a DataFrame from a dictionnary
df = pd.DataFrame({'A' : np.random.random(5),\
'B' : np.random.random(5),\
'C': np.random.random(5)}, index=np.arange(1,6))
print df
In [ ]:
df
different ways, return either a Series or a DataFrame
In [ ]:
type(df['A'])
In [ ]:
df[['A']]
In [ ]:
type(df['A'])
In [ ]:
type(df[['A']])
To access a particular row instead of a column, you use the ix method
In [ ]:
df.ix[3]
In [ ]:
df
access values per positional indices only
In [ ]:
df.iloc[2,0:2]
In [ ]:
type(_)
And you can combine of course row (with ix) and column indexing, using the same convention for slices as we saw for the Series
In [ ]:
df.ix[3]['A':'B']
In [ ]:
df.ix[3][['A','C']]
you can also use the loc
method, giving it both row AND columns indexes
(the indexing is based on the label, NOT on the position)
In [ ]:
df
In [ ]:
df.loc[[3,5],['A','B']]
conditional indexing or subsetting of a DataFrame
In [ ]:
df
In [ ]:
subset = df[df['A'] <= 0.7]
In [ ]:
subset
as from Pandas version 0.14, you can build complex database-like queries on DataFrames
In [ ]:
df
In [ ]:
df.query('A > B')
In [ ]:
df.query('(A > B) & (B > C)')
Adding a column is easy
In [ ]:
df
In [ ]:
df['D'] = np.random.random(5)
In [ ]:
df
The following works because Pandas understands that a single value must be repeated over the row length
In [ ]:
df['E'] = 5
In [ ]:
df
The following doesn't work because there's no way to tell where to insert the missing value (align to 1st or last index ?)
In [ ]:
df['F'] = np.random.random(4)
Unless we make a series out of it, with a index matching at least partly the DataFrame (row) index
In [ ]:
df['F'] = pd.Series(np.random.random(4), index=np.arange(1,5)) #
In [ ]:
df
In [ ]:
df.apply(np.sqrt) # or np.sqrt(df)
In [ ]:
df.describe().T
In [ ]:
df.head()
In [ ]:
df['E'].replace(5, 0.0, inplace=True)
# you can use dictionnary to multiple replacements
In [ ]:
df
In [ ]:
df.plot();
In [ ]:
df.drop('E', axis=1).plot(figsize=(8,12), \
subplots=True, \
sharex=True, \
kind='bar', rot=0);
In [ ]:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
iris = pd.DataFrame(iris.data, columns=iris.feature_names)
In [ ]:
iris.head()
In [ ]:
iris.hist();
In [ ]:
HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/visualization.html width=1000 height=350></iframe>')
Pandas has very powerful IO methods, allowing to load csv, excel, tab-delimited files very easily. Pandas DataFrames can also be saved also in csv, excel files.
Other supported file types are:
In [ ]:
pd.read
In [ ]:
SOI = pd.read_csv('../data/NIWA_SOI.csv')
In [ ]:
SOI.head()
In [ ]:
SOI = pd.read_csv('../data/NIWA_SOI.csv', index_col=0)
In [ ]:
SOI.head()
In [ ]:
SOI.ix[1950:2000].head()
In [ ]:
SOIs = SOI.stack()
In [ ]:
SOIs.head()
In [ ]:
SOIs.index
In [ ]:
from dateutil import parser
In [ ]:
dateindex = [parser.parse("-".join(map(str, [x[0], x[1], 1]))) for x in SOIs.index]
In [ ]:
SOIs.index=dateindex
In [ ]:
SOIs.head()
In [ ]:
SOIs.plot(figsize=(12,5))
In [ ]:
pd.rolling_mean(SOIs,12).plot()
In [ ]:
SOIs = pd.DataFrame(SOIs)
In [ ]:
SOIs.to_excel('../data/SOI_time_series.xlsx', header=False)
In [ ]:
#!open ./data/SOI_time_series.xlsx
first example: an Excel file, containing one sheet, with an Excel dates
column
In [38]:
!open ../data/ISO_datetime.xls
In [39]:
data = pd.read_excel('../data/ISO_datetime.xls', sheetname='Sheet1')
In [40]:
data.head()
Out[40]:
In [ ]:
data.date
In [41]:
data = pd.read_excel('../data/ISO_datetime.xls', \
sheetname='Sheet1', index_col=0)
In [43]:
data.index
Out[43]:
In [44]:
!open ../data/year_month_day.xlsx
In [ ]:
ymd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet1')
ymd.head()
In [45]:
ymd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet1', \
parse_dates = {'date': [0, 1, 2]}, \
index_col='date')
ymd.head()
Out[45]:
In [ ]:
myd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet2'); myd.head()
In [46]:
myd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet2',\
parse_dates = {'date': [1, 0, 2]}, \
index_col='date');
myd.head()
Out[46]:
The groupby method is a very powerful method of pandas DataFrames, in a nutschell it allows you to
split your data according to unique values of a variable (or unique combinations of N variables)
apply some operation to the groups thus defined, either an aggregation or transformation method
combine the results into a DataFrame
This process is illustrated below, where the operation is here calculating the mean of the groups's values
A very nice explanation of the groupby method, with examples, is available from Pandas's documentation at:
http://pandas.pydata.org/pandas-docs/stable/groupby.html
and a short tutorial on Wes McKinney's blog here
In [47]:
Image(filename='images/split-apply-combine.png', width=800)
Out[47]:
In [ ]:
url = "ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ersst3b.nino.mth.81-10.ascii"
In [ ]:
!wget -P ../data ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ersst3b.nino.mth.81-10.ascii
In [48]:
data = pd.read_table('../data/ersst3b.nino.mth.81-10.ascii', sep='\s+') #the '\s+' is a regular expression meaning any number of spaces
In [ ]:
# if the network confirguration allows it, you can read directly off the URL (ftp):
#data = pd.read_table(url, sep='\s+')
In [49]:
data.tail()
Out[49]:
I only keep the raw - monthly - values of NINO 3.4
In [50]:
nino = data[['YR','MON','NINO3.4']]
In [51]:
nino.tail()
Out[51]:
Now I want to calculate a climatology (over the whole period available)
I first group by UNIQUE values of the variable months, I should get 12 groups
In [52]:
groups = nino.groupby('MON')
In [53]:
for month, group in groups:
print month
print group.head()
In [54]:
climatology = groups.mean()
Same as
climatology = groups.aggregate(np.mean)
In [55]:
climatology['NINO3.4'].head(12)
Out[55]:
In [56]:
f, ax = plt.subplots()
climatology['NINO3.4'].plot(ax=ax, kind='bar',ylim=[26,28], rot=0, width=0.9)
ax.set_xticklabels(list('JFMAMJJASOND'));
f.savefig('../outputs/climatology.pdf')
Now suppose we want to apply a function that doesnt aggregate the values in the groups (such as sum, or mean) but rather want to apply a function to those values ...
An example would be calculating the standardized anomalies per month (to each value subtract the mean of the corresponding month, then divide by the standard-deviation)
In [57]:
def zscore(x):
z = (x - x.mean()) / x.std()
return z
In [58]:
transformed = nino.groupby(nino.MON).apply(zscore)
In [59]:
transformed['NINO3.4'].plot()
Out[59]:
Now we want calculate (just) the anomalies WRT to the climatology, but with a normal established over 1980 - 2010
In [60]:
nino.head()
Out[60]:
we can make use of the query method of Pandas DataFrames to select the climatological period
In [61]:
nino.query('YR >= 1981 & YR <= 2000').head()
Out[61]:
In [62]:
def demean(x):
z = x - x.query('YR >= 1981 & YR <= 2000').mean()
return z
In [63]:
anoms = nino.groupby([nino.MON]).apply(demean)
In [64]:
f, ax = plt.subplots(figsize=(10,6))
anoms['NINO3.4'].plot()
data['ANOM.3'].plot() # Should be the same
Out[64]:
The file Daily_clim_data_Auckland.csv in the data
directory contains
daily values for rainfall, minimum and maximum temperatures for Auckland from 1972 to now.
In [ ]: