In [63]:

    
%load_ext load_style
%load_style talk.css
from IPython.display import Image, HTML
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline









    



The load_style extension is already loaded. To reload it, use:
  %reload_ext load_style

Pandas

Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Key features:

Easy handling of missing data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Hierarchical labeling of axes
Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.



In [39]:

    
HTML('<iframe src=http://pandas.pydata.org/index.html width=900 height=350></iframe>')









    Out[39]:

The conventional way to import pandas is import pandas as pd



In [40]:

    
import pandas as pd

some options setting



In [41]:

    
pd.set_option("display.width", 80)
# toggle the line below that if one doesnt want DataFrames displayed as HTML tables
#pd.set_option("notebook_repr_html", False) 
pd.set_option("notebook_repr_html", True)

Pandas's data structures and functionalities will be familiar to R users, there's a section on Pandas's website where Wes McKinney gives some translation of common idioms / operations between R and Pandas



In [42]:

    
HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#compare-with-r width=900 height=350></iframe>')









    Out[42]:

Pandas data structures

Series

A Series is a single vector of data values (think a NumPy array with shape N or (N,1)) with an index that labels each element in the vector.

Series constructions



In [43]:

    
a = pd.Series(np.random.normal(0,1,(10,)))



In [44]:

    
a









    Out[44]:





0    0.037009
1    0.593607
2   -1.623254
3    0.294439
4    2.162388
5   -0.570866
6    1.671018
7    1.399286
8   -0.573680
9   -0.160413
dtype: float64



In [45]:

    
a.index









    Out[45]:





Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')



In [46]:

    
a.values









    Out[46]:





array([ 0.03700874,  0.59360691, -1.6232544 ,  0.29443856,  2.16238755,
       -0.57086554,  1.67101806,  1.39928642, -0.57367972, -0.16041259])

You can define your own index



In [47]:

    
a = pd.Series(np.random.normal(0,1,(10,)), index=np.arange(1,11))



In [48]:

    
a









    Out[48]:





1     0.950278
2    -0.206156
3    -0.136064
4     0.247603
5    -0.228526
6    -0.199513
7    -0.071164
8     1.546748
9    -0.712044
10    0.214061
dtype: float64



In [49]:

    
a = pd.Series(np.random.normal(0,1,5), \
              index=['a','b','c','d','e'], name='my series')



In [50]:

    
a









    Out[50]:





a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64

Pandas objects expose some powerful, high level plotting functions (built on top of Matplotlib)



In [51]:

    
plot = a.plot(kind='bar', rot=0, color='w', hatch='///', title=a.name, width=0.9, grid='off')



In [52]:

    
f, ax = plt.subplots()
bars = ax.bar(np.arange(len(a)), a.values, color='w', edgecolor='k', align='center', hatch='///')
ax.set_xticks(np.arange(len(a)))
ax.set_xlim(-0.5, len(a)-0.5)
ax.set_xticklabels(a.index)
ax.set_title(a.name)









    Out[52]:





<matplotlib.text.Text at 0x10b9342d0>

Series indexing

Selecting from a Series is easy, using the corresponding index key (like a dict)



In [53]:

    
a









    Out[53]:





a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64



In [54]:

    
a['c']









    Out[54]:





-1.607072752509346

slices are permitted



In [55]:

    
a['a':'c'] ### Note the difference with standard Python / Numpy positional, integer indexing









    Out[55]:





a   -0.448799
b   -0.410785
c   -1.607073
Name: my series, dtype: float64



In [56]:

    
a['c':]









    Out[56]:





c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64

deleting an element



In [57]:

    
a.drop('d')









    Out[57]:





a   -0.448799
b   -0.410785
c   -1.607073
e    0.024734
Name: my series, dtype: float64

Adding / modifying inplace an element to / of a Series: use the loc method



In [58]:

    
a









    Out[58]:





a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64



In [59]:

    
a.loc['e'] = 1



In [60]:

    
a









    Out[60]:





a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    1.000000
Name: my series, dtype: float64



In [61]:

    
a.loc['f'] = 5



In [62]:

    
a









    Out[62]:





a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    1.000000
f    5.000000
Name: my series, dtype: float64

Mathematical operations involving two series will perform operations by aligning indices.

The union of all indices is created
The mathematical operation is performed on matching indices.

Indices that do not match are given the value NaN (not a number), and values are computed for all unique pairs of repeated indices.



In [24]:

    
s1 = pd.Series(np.arange(1.0,4.0),index=['a','b','c'])
s2 = pd.Series(np.arange(1.0,4.0),index=['b','c','d'])



In [25]:

    
s3 = s1 + s2



In [26]:

    
s3









    Out[26]:





a   NaN
b     3
c     5
d   NaN
dtype: float64

NaNs are ignored in all operations



In [27]:

    
s3.mean()









    Out[27]:





4.0

You can drop them from the Series



In [28]:

    
s4 = s3.dropna()



In [29]:

    
s4









    Out[29]:





b    3
c    5
dtype: float64

Or use the fillna method to replace them by a value



In [30]:

    
s3.fillna(-999)









    Out[30]:





a   -999
b      3
c      5
d   -999
dtype: float64



In [31]:

    
s3.fillna(s3.mean())









    Out[31]:





a    4
b    3
c    5
d    4
dtype: float64

Series with a date / datetime index (timeseries)

Series can have indexes representing dates / times



In [ ]:

    
a



In [32]:

    
a.index = pd.date_range(start='2014-1-1', periods=len(a)) # default 'period' is daily



In [33]:

    
a.head()









    Out[33]:





2014-01-01    0.142453
2014-01-02   -0.367365
2014-01-03    1.160442
2014-01-04    0.846886
Freq: D, Name: my series, dtype: float64



In [ ]:

    
a.index



In [ ]:

    
### but you can convert it to an numpy array of python datetime objects if you want
py_datetimes = a.index.to_pydatetime()

And a number of useful methods for manipulation of time series is exposed



In [34]:

    
### resample daily time-series to 5 minutes 'period', using forward filling method
a.resample('5min',fill_method='ffill')









    Out[34]:





2014-01-01 00:00:00    0.142453
2014-01-01 00:05:00    0.142453
2014-01-01 00:10:00    0.142453
2014-01-01 00:15:00    0.142453
2014-01-01 00:20:00    0.142453
2014-01-01 00:25:00    0.142453
2014-01-01 00:30:00    0.142453
2014-01-01 00:35:00    0.142453
2014-01-01 00:40:00    0.142453
2014-01-01 00:45:00    0.142453
2014-01-01 00:50:00    0.142453
2014-01-01 00:55:00    0.142453
2014-01-01 01:00:00    0.142453
2014-01-01 01:05:00    0.142453
2014-01-01 01:10:00    0.142453
...
2014-01-03 22:50:00    1.160442
2014-01-03 22:55:00    1.160442
2014-01-03 23:00:00    1.160442
2014-01-03 23:05:00    1.160442
2014-01-03 23:10:00    1.160442
2014-01-03 23:15:00    1.160442
2014-01-03 23:20:00    1.160442
2014-01-03 23:25:00    1.160442
2014-01-03 23:30:00    1.160442
2014-01-03 23:35:00    1.160442
2014-01-03 23:40:00    1.160442
2014-01-03 23:45:00    1.160442
2014-01-03 23:50:00    1.160442
2014-01-03 23:55:00    1.160442
2014-01-04 00:00:00    0.846886
Freq: 5T, Name: my series, Length: 865



In [ ]:

    
a



In [35]:

    
### the ```shift``` method makes it easy e.g. to compare series with lead / lags 
a.shift(periods=-1)









    Out[35]:





2014-01-01   -0.367365
2014-01-02    1.160442
2014-01-03    0.846886
2014-01-04         NaN
Freq: D, Name: my series, dtype: float64



In [ ]:

    
a



In [36]:

    
### and the ```truncate`` method allows easy selection of time-slices
a.truncate(after='2014-1-2')









    Out[36]:





2014-01-01    0.142453
2014-01-02   -0.367365
Freq: D, Name: my series, dtype: float64

DataFrames

DataFrames are IMHO one of the most powerful data structures in the Python / data analysis world.

They can be viewed as a collection of named Series. They feature two indexes, respectively for the rows and the columns, and can contain heteregoneous data types (although it must be consistent within each column). Note that a DataFrame index, either along the rows or the columns (or both !) can contain more than one level, they are called hierarchical indexes and allows the representation of complex data organisation.

If the index along the rows of a DataFrame is of datetime type, all the methods exposed for the Series (re-sampling, shifting, truncating, etc) are available for the DataFrame.

DataFrame constructions



In [ ]:

    
import string # part of the standard library
idx = list(string.lowercase[:10])
print(idx)



In [ ]:

    
df = pd.DataFrame(np.arange(100).reshape(10,10),\
                  columns=idx,index=np.arange(1,11))



In [ ]:

    
df



In [ ]:

    
### here I am creating a DataFrame from a dictionnary

df = pd.DataFrame({'A' : np.random.random(5),\
                   'B' : np.random.random(5),\
                   'C': np.random.random(5)}, index=np.arange(1,6))
print df



In [ ]:

    
df

Indexing a DataFrame

different ways, return either a Series or a DataFrame



In [ ]:

    
type(df['A'])



In [ ]:

    
df[['A']]



In [ ]:

    
type(df['A'])



In [ ]:

    
type(df[['A']])

To access a particular row instead of a column, you use the ix method



In [ ]:

    
df.ix[3]



In [ ]:

    
df

access values per positional indices only



In [ ]:

    
df.iloc[2,0:2]



In [ ]:

    
type(_)

And you can combine of course row (with ix) and column indexing, using the same convention for slices as we saw for the Series



In [ ]:

    
df.ix[3]['A':'B']



In [ ]:

    
df.ix[3][['A','C']]

you can also use the loc method, giving it both row AND columns indexes (the indexing is based on the label, NOT on the position)



In [ ]:

    
df



In [ ]:

    
df.loc[[3,5],['A','B']]

conditional indexing or subsetting of a DataFrame



In [ ]:

    
df



In [ ]:

    
subset = df[df['A'] <= 0.7]



In [ ]:

    
subset

queries

as from Pandas version 0.14, you can build complex database-like queries on DataFrames



In [ ]:

    
df



In [ ]:

    
df.query('A > B')



In [ ]:

    
df.query('(A > B) & (B > C)')

Extending a DataFrame

Adding a column is easy



In [ ]:

    
df



In [ ]:

    
df['D'] = np.random.random(5)



In [ ]:

    
df

The following works because Pandas understands that a single value must be repeated over the row length



In [ ]:

    
df['E'] = 5



In [ ]:

    
df

The following doesn't work because there's no way to tell where to insert the missing value (align to 1st or last index ?)



In [ ]:

    
df['F'] = np.random.random(4)

Unless we make a series out of it, with a index matching at least partly the DataFrame (row) index



In [ ]:

    
df['F'] = pd.Series(np.random.random(4), index=np.arange(1,5)) #



In [ ]:

    
df

Useful DataFrames methods

applying a numpy function



In [ ]:

    
df.apply(np.sqrt) # or np.sqrt(df)

summarizing data



In [ ]:

    
df.describe().T

replacing values



In [ ]:

    
df.head()



In [ ]:

    
df['E'].replace(5, 0.0, inplace=True) 
# you can use dictionnary to multiple replacements



In [ ]:

    
df

DataFrame high-level plotting functions



In [ ]:

    
df.plot();



In [ ]:

    
df.drop('E', axis=1).plot(figsize=(8,12), \
                          subplots=True, \
                          sharex=True, \
                          kind='bar', rot=0);



In [ ]:

    
import pandas as pd
from sklearn.datasets import load_iris
 
iris = load_iris()
 
iris = pd.DataFrame(iris.data, columns=iris.feature_names)



In [ ]:

    
iris.head()



In [ ]:

    
iris.hist();



In [ ]:

    
HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/visualization.html width=1000 height=350></iframe>')

Input and Output in pandas

Pandas has very powerful IO methods, allowing to load csv, excel, tab-delimited files very easily. Pandas DataFrames can also be saved also in csv, excel files.

Other supported file types are:

JSON (JavaScript Object Notation)
HDF5
HTML (to e.g. read tables contained in HTML documents)
SQL (Pandas can query directly from SQL databases thanks to SQLAlchemy)
...



In [ ]:

    
pd.read



In [ ]:

    
SOI = pd.read_csv('../data/NIWA_SOI.csv')



In [ ]:

    
SOI.head()



In [ ]:

    
SOI = pd.read_csv('../data/NIWA_SOI.csv', index_col=0)



In [ ]:

    
SOI.head()



In [ ]:

    
SOI.ix[1950:2000].head()

Stacking



In [ ]:

    
SOIs = SOI.stack()



In [ ]:

    
SOIs.head()



In [ ]:

    
SOIs.index



In [ ]:

    
from dateutil import parser



In [ ]:

    
dateindex = [parser.parse("-".join(map(str, [x[0], x[1], 1]))) for x in SOIs.index]



In [ ]:

    
SOIs.index=dateindex



In [ ]:

    
SOIs.head()



In [ ]:

    
SOIs.plot(figsize=(12,5))



In [ ]:

    
pd.rolling_mean(SOIs,12).plot()

Saving in csv, excel etc



In [ ]:

    
SOIs = pd.DataFrame(SOIs)



In [ ]:

    
SOIs.to_excel('../data/SOI_time_series.xlsx', header=False)



In [ ]:

    
#!open ./data/SOI_time_series.xlsx

Reading Excel files

first example: an Excel file, containing one sheet, with an Excel dates column



In [38]:

    
!open ../data/ISO_datetime.xls



In [39]:

    
data = pd.read_excel('../data/ISO_datetime.xls', sheetname='Sheet1')



In [40]:

    
data.head()



In [ ]:

    
data.date



In [41]:

    
data = pd.read_excel('../data/ISO_datetime.xls', \
                     sheetname='Sheet1', index_col=0)



In [43]:

    
data.index









    Out[43]:





<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-02, ..., 2014-01-11]
Length: 10, Freq: None, Timezone: None

correctly parsing date(times) when date and time information are contained in different columns



In [44]:

    
!open ../data/year_month_day.xlsx



In [ ]:

    
ymd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet1')
ymd.head()



In [45]:

    
ymd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet1', \
                    parse_dates = {'date': [0, 1, 2]}, \
                    index_col='date')
ymd.head()



In [ ]:

    
myd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet2'); myd.head()



In [46]:

    
myd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet2',\
                    parse_dates = {'date': [1, 0, 2]}, \
                    index_col='date'); 
myd.head()

groupby operations in Pandas

The groupby method is a very powerful method of pandas DataFrames, in a nutschell it allows you to

split your data according to unique values of a variable (or unique combinations of N variables)
apply some operation to the groups thus defined, either an aggregation or transformation method
combine the results into a DataFrame

This process is illustrated below, where the operation is here calculating the mean of the groups's values

A very nice explanation of the groupby method, with examples, is available from Pandas's documentation at:

http://pandas.pydata.org/pandas-docs/stable/groupby.html

and a short tutorial on Wes McKinney's blog here



In [47]:

    
Image(filename='images/split-apply-combine.png', width=800)









    Out[47]:



In [ ]:

    
url = "ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ersst3b.nino.mth.81-10.ascii"



In [ ]:

    
!wget -P ../data ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ersst3b.nino.mth.81-10.ascii



In [48]:

    
data = pd.read_table('../data/ersst3b.nino.mth.81-10.ascii', sep='\s+') #the '\s+' is a regular expression meaning any number of spaces



In [ ]:

    
# if the network confirguration allows it, you can read directly off the URL (ftp):

#data = pd.read_table(url, sep='\s+')



In [49]:

    
data.tail()

I only keep the raw - monthly - values of NINO 3.4



In [50]:

    
nino = data[['YR','MON','NINO3.4']]



In [51]:

    
nino.tail()

Now I want to calculate a climatology (over the whole period available)

I first group by UNIQUE values of the variable months, I should get 12 groups



In [52]:

    
groups = nino.groupby('MON')



In [53]:

    
for month, group in groups:
    print month
    print group.head()









    



1
      YR  MON  NINO3.4
0   1950    1    24.83
12  1951    1    25.46
24  1952    1    26.85
36  1953    1    26.85
48  1954    1    27.03
2
      YR  MON  NINO3.4
1   1950    2    25.20
13  1951    2    25.78
25  1952    2    26.79
37  1953    2    27.19
49  1954    2    27.22
3
      YR  MON  NINO3.4
2   1950    3    26.03
14  1951    3    26.72
26  1952    3    27.32
38  1953    3    27.68
50  1954    3    27.21
4
      YR  MON  NINO3.4
3   1950    4    26.36
15  1951    4    27.24
27  1952    4    27.88
39  1953    4    28.19
51  1954    4    26.87
5
      YR  MON  NINO3.4
4   1950    5    26.19
16  1951    5    27.68
28  1952    5    27.99
40  1953    5    28.29
52  1954    5    27.07
6
      YR  MON  NINO3.4
5   1950    6    26.52
17  1951    6    27.46
29  1952    6    27.33
41  1953    6    28.02
53  1954    6    26.93
7
      YR  MON  NINO3.4
6   1950    7    26.42
18  1951    7    27.72
30  1952    7    26.72
42  1953    7    27.52
54  1954    7    26.37
8
      YR  MON  NINO3.4
7   1950    8    25.98
19  1951    8    27.36
31  1952    8    26.46
43  1953    8    27.16
55  1954    8    25.73
9
      YR  MON  NINO3.4
8   1950    9    25.78
20  1951    9    27.51
32  1952    9    26.54
44  1953    9    27.13
56  1954    9    25.38
10
      YR  MON  NINO3.4
9   1950   10    25.96
21  1951   10    27.43
33  1952   10    26.54
45  1953   10    27.02
57  1954   10    25.51
11
      YR  MON  NINO3.4
10  1950   11    25.64
22  1951   11    27.48
34  1952   11    26.36
46  1953   11    26.96
58  1954   11    25.67
12
      YR  MON  NINO3.4
11  1950   12    25.50
23  1951   12    27.12
35  1952   12    26.53
47  1953   12    26.99
59  1954   12    25.37



In [54]:

    
climatology = groups.mean()

Same as

climatology = groups.aggregate(np.mean)



In [55]:

    
climatology['NINO3.4'].head(12)









    Out[55]:





MON
1      26.495606
2      26.677077
3      27.184308
4      27.623692
5      27.739692
6      27.544462
7      27.152000
8      26.765385
9      26.672462
10     26.642000
11     26.608615
12     26.541692
Name: NINO3.4, dtype: float64



In [56]:

    
f, ax = plt.subplots()

climatology['NINO3.4'].plot(ax=ax, kind='bar',ylim=[26,28], rot=0, width=0.9)

ax.set_xticklabels(list('JFMAMJJASOND'));

f.savefig('../outputs/climatology.pdf')

Now suppose we want to apply a function that doesnt aggregate the values in the groups (such as sum, or mean) but rather want to apply a function to those values ...

An example would be calculating the standardized anomalies per month (to each value subtract the mean of the corresponding month, then divide by the standard-deviation)



In [57]:

    
def zscore(x): 
    z = (x - x.mean()) / x.std()
    return z



In [58]:

    
transformed = nino.groupby(nino.MON).apply(zscore)



In [59]:

    
transformed['NINO3.4'].plot()









    Out[59]:





<matplotlib.axes._subplots.AxesSubplot at 0x10877be50>

Now we want calculate (just) the anomalies WRT to the climatology, but with a normal established over 1980 - 2010



In [60]:

    
nino.head()

we can make use of the query method of Pandas DataFrames to select the climatological period



In [61]:

    
nino.query('YR >= 1981 & YR <= 2000').head()



In [62]:

    
def demean(x): 
    z = x - x.query('YR >= 1981 & YR <= 2000').mean()
    return z



In [63]:

    
anoms = nino.groupby([nino.MON]).apply(demean)



In [64]:

    
f, ax = plt.subplots(figsize=(10,6))
anoms['NINO3.4'].plot()
data['ANOM.3'].plot() # Should be the same









    Out[64]:





<matplotlib.axes._subplots.AxesSubplot at 0x1088c2ad0>

Exercise

The file Daily_clim_data_Auckland.csv in the data directory contains daily values for rainfall, minimum and maximum temperatures for Auckland from 1972 to now.

Read the data in a Pandas DataFrame, correctly setting up the index as a datetime object
calculate monthly means for the temperature data, monthly totals for rainfall
calculate and plot a climatology over the period 1981 - 2000 for all parameters
calculate the anomalies WRT this climatology for all parameters
plots the time-series of minimum and maximum temperature
dig into the Scipy documentation to find how to calculate the linear trend over the period for the temperature data
calculate and plot the trend
Contact the New Zealand Climate Science Coalition



In [ ]:

	date	value
0	2014-01-02	0.786299
1	2014-01-03	0.354398
2	2014-01-04	0.747329
3	2014-01-05	0.461564
4	2014-01-06	0.626105

	value
date
2014-01-02	0.786299
2014-01-03	0.354398
2014-01-04	0.747329
2014-01-05	0.461564
2014-01-06	0.626105

	value
date
2014-01-02	0.478467
2014-01-03	0.833801
2014-01-04	0.837646
2014-01-05	0.453646
2014-01-06	0.260355

	YR	MON	NINO1+2	ANOM	NINO3	ANOM.1	NINO4	ANOM.2	NINO3.4	ANOM.3
776	2014	9	21.54	0.66	25.33	0.31	29.14	0.42	27.00	0.17
777	2014	10	21.69	0.49	25.72	0.65	29.22	0.50	27.24	0.46
778	2014	11	22.44	0.62	26.05	0.91	29.38	0.72	27.56	0.83
779	2014	12	23.41	0.40	26.09	0.79	29.19	0.65	27.35	0.66
780	2015	1	24.83	0.15	26.28	0.51	28.97	0.65	27.21	0.54

	YR	MON	NINO3.4
372	1981	1	26.17
373	1981	2	26.13
374	1981	3	26.75
375	1981	4	27.32
376	1981	5	27.41