In [63]:
%load_ext load_style
%load_style talk.css
from IPython.display import Image, HTML
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


The load_style extension is already loaded. To reload it, use:
  %reload_ext load_style

Pandas

Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Key features:

  • Easy handling of missing data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes
  • Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
  • Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

In [39]:
HTML('<iframe src=http://pandas.pydata.org/index.html width=900 height=350></iframe>')


Out[39]:

The conventional way to import pandas is import pandas as pd


In [40]:
import pandas as pd

some options setting


In [41]:
pd.set_option("display.width", 80)
# toggle the line below that if one doesnt want DataFrames displayed as HTML tables
#pd.set_option("notebook_repr_html", False) 
pd.set_option("notebook_repr_html", True)

Pandas's data structures and functionalities will be familiar to R users, there's a section on Pandas's website where Wes McKinney gives some translation of common idioms / operations between R and Pandas


In [42]:
HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#compare-with-r width=900 height=350></iframe>')


Out[42]:

Pandas data structures

Series

A Series is a single vector of data values (think a NumPy array with shape N or (N,1)) with an index that labels each element in the vector.

Series constructions


In [43]:
a = pd.Series(np.random.normal(0,1,(10,)))

In [44]:
a


Out[44]:
0    0.037009
1    0.593607
2   -1.623254
3    0.294439
4    2.162388
5   -0.570866
6    1.671018
7    1.399286
8   -0.573680
9   -0.160413
dtype: float64

In [45]:
a.index


Out[45]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [46]:
a.values


Out[46]:
array([ 0.03700874,  0.59360691, -1.6232544 ,  0.29443856,  2.16238755,
       -0.57086554,  1.67101806,  1.39928642, -0.57367972, -0.16041259])

You can define your own index


In [47]:
a = pd.Series(np.random.normal(0,1,(10,)), index=np.arange(1,11))

In [48]:
a


Out[48]:
1     0.950278
2    -0.206156
3    -0.136064
4     0.247603
5    -0.228526
6    -0.199513
7    -0.071164
8     1.546748
9    -0.712044
10    0.214061
dtype: float64

In [49]:
a = pd.Series(np.random.normal(0,1,5), \
              index=['a','b','c','d','e'], name='my series')

In [50]:
a


Out[50]:
a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64

Pandas objects expose some powerful, high level plotting functions (built on top of Matplotlib)


In [51]:
plot = a.plot(kind='bar', rot=0, color='w', hatch='///', title=a.name, width=0.9, grid='off')



In [52]:
f, ax = plt.subplots()
bars = ax.bar(np.arange(len(a)), a.values, color='w', edgecolor='k', align='center', hatch='///')
ax.set_xticks(np.arange(len(a)))
ax.set_xlim(-0.5, len(a)-0.5)
ax.set_xticklabels(a.index)
ax.set_title(a.name)


Out[52]:
<matplotlib.text.Text at 0x10b9342d0>

Series indexing

Selecting from a Series is easy, using the corresponding index key (like a dict)


In [53]:
a


Out[53]:
a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64

In [54]:
a['c']


Out[54]:
-1.607072752509346

slices are permitted


In [55]:
a['a':'c'] ### Note the difference with standard Python / Numpy positional, integer indexing


Out[55]:
a   -0.448799
b   -0.410785
c   -1.607073
Name: my series, dtype: float64

In [56]:
a['c':]


Out[56]:
c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64

deleting an element


In [57]:
a.drop('d')


Out[57]:
a   -0.448799
b   -0.410785
c   -1.607073
e    0.024734
Name: my series, dtype: float64

Adding / modifying inplace an element to / of a Series: use the loc method


In [58]:
a


Out[58]:
a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64

In [59]:
a.loc['e'] = 1

In [60]:
a


Out[60]:
a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    1.000000
Name: my series, dtype: float64

In [61]:
a.loc['f'] = 5

In [62]:
a


Out[62]:
a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    1.000000
f    5.000000
Name: my series, dtype: float64

Mathematical operations involving two series will perform operations by aligning indices.

  1. The union of all indices is created
  2. The mathematical operation is performed on matching indices.

Indices that do not match are given the value NaN (not a number), and values are computed for all unique pairs of repeated indices.


In [24]:
s1 = pd.Series(np.arange(1.0,4.0),index=['a','b','c'])
s2 = pd.Series(np.arange(1.0,4.0),index=['b','c','d'])

In [25]:
s3 = s1 + s2

In [26]:
s3


Out[26]:
a   NaN
b     3
c     5
d   NaN
dtype: float64

NaNs are ignored in all operations


In [27]:
s3.mean()


Out[27]:
4.0

You can drop them from the Series


In [28]:
s4 = s3.dropna()

In [29]:
s4


Out[29]:
b    3
c    5
dtype: float64

Or use the fillna method to replace them by a value


In [30]:
s3.fillna(-999)


Out[30]:
a   -999
b      3
c      5
d   -999
dtype: float64

In [31]:
s3.fillna(s3.mean())


Out[31]:
a    4
b    3
c    5
d    4
dtype: float64

Series with a date / datetime index (timeseries)

Series can have indexes representing dates / times


In [ ]:
a

In [32]:
a.index = pd.date_range(start='2014-1-1', periods=len(a)) # default 'period' is daily

In [33]:
a.head()


Out[33]:
2014-01-01    0.142453
2014-01-02   -0.367365
2014-01-03    1.160442
2014-01-04    0.846886
Freq: D, Name: my series, dtype: float64

In [ ]:
a.index

In [ ]:
### but you can convert it to an numpy array of python datetime objects if you want
py_datetimes = a.index.to_pydatetime()

And a number of useful methods for manipulation of time series is exposed


In [34]:
### resample daily time-series to 5 minutes 'period', using forward filling method
a.resample('5min',fill_method='ffill')


Out[34]:
2014-01-01 00:00:00    0.142453
2014-01-01 00:05:00    0.142453
2014-01-01 00:10:00    0.142453
2014-01-01 00:15:00    0.142453
2014-01-01 00:20:00    0.142453
2014-01-01 00:25:00    0.142453
2014-01-01 00:30:00    0.142453
2014-01-01 00:35:00    0.142453
2014-01-01 00:40:00    0.142453
2014-01-01 00:45:00    0.142453
2014-01-01 00:50:00    0.142453
2014-01-01 00:55:00    0.142453
2014-01-01 01:00:00    0.142453
2014-01-01 01:05:00    0.142453
2014-01-01 01:10:00    0.142453
...
2014-01-03 22:50:00    1.160442
2014-01-03 22:55:00    1.160442
2014-01-03 23:00:00    1.160442
2014-01-03 23:05:00    1.160442
2014-01-03 23:10:00    1.160442
2014-01-03 23:15:00    1.160442
2014-01-03 23:20:00    1.160442
2014-01-03 23:25:00    1.160442
2014-01-03 23:30:00    1.160442
2014-01-03 23:35:00    1.160442
2014-01-03 23:40:00    1.160442
2014-01-03 23:45:00    1.160442
2014-01-03 23:50:00    1.160442
2014-01-03 23:55:00    1.160442
2014-01-04 00:00:00    0.846886
Freq: 5T, Name: my series, Length: 865

In [ ]:
a

In [35]:
### the ```shift``` method makes it easy e.g. to compare series with lead / lags 
a.shift(periods=-1)


Out[35]:
2014-01-01   -0.367365
2014-01-02    1.160442
2014-01-03    0.846886
2014-01-04         NaN
Freq: D, Name: my series, dtype: float64

In [ ]:
a

In [36]:
### and the ```truncate`` method allows easy selection of time-slices
a.truncate(after='2014-1-2')


Out[36]:
2014-01-01    0.142453
2014-01-02   -0.367365
Freq: D, Name: my series, dtype: float64

DataFrames

DataFrames are IMHO one of the most powerful data structures in the Python / data analysis world.

They can be viewed as a collection of named Series. They feature two indexes, respectively for the rows and the columns, and can contain heteregoneous data types (although it must be consistent within each column). Note that a DataFrame index, either along the rows or the columns (or both !) can contain more than one level, they are called hierarchical indexes and allows the representation of complex data organisation.

If the index along the rows of a DataFrame is of datetime type, all the methods exposed for the Series (re-sampling, shifting, truncating, etc) are available for the DataFrame.

DataFrame constructions


In [ ]:
import string # part of the standard library
idx = list(string.lowercase[:10])
print(idx)

In [ ]:
df = pd.DataFrame(np.arange(100).reshape(10,10),\
                  columns=idx,index=np.arange(1,11))

In [ ]:
df

In [ ]:
### here I am creating a DataFrame from a dictionnary

df = pd.DataFrame({'A' : np.random.random(5),\
                   'B' : np.random.random(5),\
                   'C': np.random.random(5)}, index=np.arange(1,6))
print df

In [ ]:
df

Indexing a DataFrame

different ways, return either a Series or a DataFrame


In [ ]:
type(df['A'])

In [ ]:
df[['A']]

In [ ]:
type(df['A'])

In [ ]:
type(df[['A']])

To access a particular row instead of a column, you use the ix method


In [ ]:
df.ix[3]

In [ ]:
df

access values per positional indices only


In [ ]:
df.iloc[2,0:2]

In [ ]:
type(_)

And you can combine of course row (with ix) and column indexing, using the same convention for slices as we saw for the Series


In [ ]:
df.ix[3]['A':'B']

In [ ]:
df.ix[3][['A','C']]

you can also use the loc method, giving it both row AND columns indexes (the indexing is based on the label, NOT on the position)


In [ ]:
df

In [ ]:
df.loc[[3,5],['A','B']]

conditional indexing or subsetting of a DataFrame


In [ ]:
df

In [ ]:
subset = df[df['A'] <= 0.7]

In [ ]:
subset

queries

as from Pandas version 0.14, you can build complex database-like queries on DataFrames


In [ ]:
df

In [ ]:
df.query('A > B')

In [ ]:
df.query('(A > B) & (B > C)')

Extending a DataFrame

Adding a column is easy


In [ ]:
df

In [ ]:
df['D'] = np.random.random(5)

In [ ]:
df

The following works because Pandas understands that a single value must be repeated over the row length


In [ ]:
df['E'] = 5

In [ ]:
df

The following doesn't work because there's no way to tell where to insert the missing value (align to 1st or last index ?)


In [ ]:
df['F'] = np.random.random(4)

Unless we make a series out of it, with a index matching at least partly the DataFrame (row) index


In [ ]:
df['F'] = pd.Series(np.random.random(4), index=np.arange(1,5)) #

In [ ]:
df

Useful DataFrames methods

applying a numpy function

In [ ]:
df.apply(np.sqrt) # or np.sqrt(df)
summarizing data

In [ ]:
df.describe().T
replacing values

In [ ]:
df.head()

In [ ]:
df['E'].replace(5, 0.0, inplace=True) 
# you can use dictionnary to multiple replacements

In [ ]:
df

DataFrame high-level plotting functions


In [ ]:
df.plot();

In [ ]:
df.drop('E', axis=1).plot(figsize=(8,12), \
                          subplots=True, \
                          sharex=True, \
                          kind='bar', rot=0);

In [ ]:
import pandas as pd
from sklearn.datasets import load_iris
 
iris = load_iris()
 
iris = pd.DataFrame(iris.data, columns=iris.feature_names)

In [ ]:
iris.head()

In [ ]:
iris.hist();

In [ ]:
HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/visualization.html width=1000 height=350></iframe>')

Input and Output in pandas

Pandas has very powerful IO methods, allowing to load csv, excel, tab-delimited files very easily. Pandas DataFrames can also be saved also in csv, excel files.

Other supported file types are:

  • JSON (JavaScript Object Notation)
  • HDF5
  • HTML (to e.g. read tables contained in HTML documents)
  • SQL (Pandas can query directly from SQL databases thanks to SQLAlchemy)
  • ...

In [ ]:
pd.read

In [ ]:
SOI = pd.read_csv('../data/NIWA_SOI.csv')

In [ ]:
SOI.head()

In [ ]:
SOI = pd.read_csv('../data/NIWA_SOI.csv', index_col=0)

In [ ]:
SOI.head()

In [ ]:
SOI.ix[1950:2000].head()

Stacking


In [ ]:
SOIs = SOI.stack()

In [ ]:
SOIs.head()

In [ ]:
SOIs.index

In [ ]:
from dateutil import parser

In [ ]:
dateindex = [parser.parse("-".join(map(str, [x[0], x[1], 1]))) for x in SOIs.index]

In [ ]:
SOIs.index=dateindex

In [ ]:
SOIs.head()

In [ ]:
SOIs.plot(figsize=(12,5))

In [ ]:
pd.rolling_mean(SOIs,12).plot()
Saving in csv, excel etc

In [ ]:
SOIs = pd.DataFrame(SOIs)

In [ ]:
SOIs.to_excel('../data/SOI_time_series.xlsx', header=False)

In [ ]:
#!open ./data/SOI_time_series.xlsx

Reading Excel files

first example: an Excel file, containing one sheet, with an Excel dates column


In [38]:
!open ../data/ISO_datetime.xls

In [39]:
data = pd.read_excel('../data/ISO_datetime.xls', sheetname='Sheet1')

In [40]:
data.head()


Out[40]:
date value
0 2014-01-02 0.786299
1 2014-01-03 0.354398
2 2014-01-04 0.747329
3 2014-01-05 0.461564
4 2014-01-06 0.626105

In [ ]:
data.date

In [41]:
data = pd.read_excel('../data/ISO_datetime.xls', \
                     sheetname='Sheet1', index_col=0)

In [43]:
data.index


Out[43]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-02, ..., 2014-01-11]
Length: 10, Freq: None, Timezone: None

correctly parsing date(times) when date and time information are contained in different columns


In [44]:
!open ../data/year_month_day.xlsx

In [ ]:
ymd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet1')
ymd.head()

In [45]:
ymd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet1', \
                    parse_dates = {'date': [0, 1, 2]}, \
                    index_col='date')
ymd.head()


Out[45]:
value
date
2014-01-02 0.786299
2014-01-03 0.354398
2014-01-04 0.747329
2014-01-05 0.461564
2014-01-06 0.626105

In [ ]:
myd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet2'); myd.head()

In [46]:
myd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet2',\
                    parse_dates = {'date': [1, 0, 2]}, \
                    index_col='date'); 
myd.head()


Out[46]:
value
date
2014-01-02 0.478467
2014-01-03 0.833801
2014-01-04 0.837646
2014-01-05 0.453646
2014-01-06 0.260355

groupby operations in Pandas

The groupby method is a very powerful method of pandas DataFrames, in a nutschell it allows you to

  1. split your data according to unique values of a variable (or unique combinations of N variables)

  2. apply some operation to the groups thus defined, either an aggregation or transformation method

  3. combine the results into a DataFrame

This process is illustrated below, where the operation is here calculating the mean of the groups's values

A very nice explanation of the groupby method, with examples, is available from Pandas's documentation at:

http://pandas.pydata.org/pandas-docs/stable/groupby.html

and a short tutorial on Wes McKinney's blog here


In [47]:
Image(filename='images/split-apply-combine.png', width=800)


Out[47]:

In [ ]:
url = "ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ersst3b.nino.mth.81-10.ascii"

In [ ]:
!wget -P ../data ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ersst3b.nino.mth.81-10.ascii

In [48]:
data = pd.read_table('../data/ersst3b.nino.mth.81-10.ascii', sep='\s+') #the '\s+' is a regular expression meaning any number of spaces

In [ ]:
# if the network confirguration allows it, you can read directly off the URL (ftp):

#data = pd.read_table(url, sep='\s+')

In [49]:
data.tail()


Out[49]:
YR MON NINO1+2 ANOM NINO3 ANOM.1 NINO4 ANOM.2 NINO3.4 ANOM.3
776 2014 9 21.54 0.66 25.33 0.31 29.14 0.42 27.00 0.17
777 2014 10 21.69 0.49 25.72 0.65 29.22 0.50 27.24 0.46
778 2014 11 22.44 0.62 26.05 0.91 29.38 0.72 27.56 0.83
779 2014 12 23.41 0.40 26.09 0.79 29.19 0.65 27.35 0.66
780 2015 1 24.83 0.15 26.28 0.51 28.97 0.65 27.21 0.54

I only keep the raw - monthly - values of NINO 3.4


In [50]:
nino = data[['YR','MON','NINO3.4']]

In [51]:
nino.tail()


Out[51]:
YR MON NINO3.4
776 2014 9 27.00
777 2014 10 27.24
778 2014 11 27.56
779 2014 12 27.35
780 2015 1 27.21

Now I want to calculate a climatology (over the whole period available)

I first group by UNIQUE values of the variable months, I should get 12 groups


In [52]:
groups = nino.groupby('MON')

In [53]:
for month, group in groups:
    print month
    print group.head()


1
      YR  MON  NINO3.4
0   1950    1    24.83
12  1951    1    25.46
24  1952    1    26.85
36  1953    1    26.85
48  1954    1    27.03
2
      YR  MON  NINO3.4
1   1950    2    25.20
13  1951    2    25.78
25  1952    2    26.79
37  1953    2    27.19
49  1954    2    27.22
3
      YR  MON  NINO3.4
2   1950    3    26.03
14  1951    3    26.72
26  1952    3    27.32
38  1953    3    27.68
50  1954    3    27.21
4
      YR  MON  NINO3.4
3   1950    4    26.36
15  1951    4    27.24
27  1952    4    27.88
39  1953    4    28.19
51  1954    4    26.87
5
      YR  MON  NINO3.4
4   1950    5    26.19
16  1951    5    27.68
28  1952    5    27.99
40  1953    5    28.29
52  1954    5    27.07
6
      YR  MON  NINO3.4
5   1950    6    26.52
17  1951    6    27.46
29  1952    6    27.33
41  1953    6    28.02
53  1954    6    26.93
7
      YR  MON  NINO3.4
6   1950    7    26.42
18  1951    7    27.72
30  1952    7    26.72
42  1953    7    27.52
54  1954    7    26.37
8
      YR  MON  NINO3.4
7   1950    8    25.98
19  1951    8    27.36
31  1952    8    26.46
43  1953    8    27.16
55  1954    8    25.73
9
      YR  MON  NINO3.4
8   1950    9    25.78
20  1951    9    27.51
32  1952    9    26.54
44  1953    9    27.13
56  1954    9    25.38
10
      YR  MON  NINO3.4
9   1950   10    25.96
21  1951   10    27.43
33  1952   10    26.54
45  1953   10    27.02
57  1954   10    25.51
11
      YR  MON  NINO3.4
10  1950   11    25.64
22  1951   11    27.48
34  1952   11    26.36
46  1953   11    26.96
58  1954   11    25.67
12
      YR  MON  NINO3.4
11  1950   12    25.50
23  1951   12    27.12
35  1952   12    26.53
47  1953   12    26.99
59  1954   12    25.37

In [54]:
climatology = groups.mean()

Same as

climatology = groups.aggregate(np.mean)

In [55]:
climatology['NINO3.4'].head(12)


Out[55]:
MON
1      26.495606
2      26.677077
3      27.184308
4      27.623692
5      27.739692
6      27.544462
7      27.152000
8      26.765385
9      26.672462
10     26.642000
11     26.608615
12     26.541692
Name: NINO3.4, dtype: float64

In [56]:
f, ax = plt.subplots()

climatology['NINO3.4'].plot(ax=ax, kind='bar',ylim=[26,28], rot=0, width=0.9)

ax.set_xticklabels(list('JFMAMJJASOND'));

f.savefig('../outputs/climatology.pdf')


Now suppose we want to apply a function that doesnt aggregate the values in the groups (such as sum, or mean) but rather want to apply a function to those values ...

An example would be calculating the standardized anomalies per month (to each value subtract the mean of the corresponding month, then divide by the standard-deviation)


In [57]:
def zscore(x): 
    z = (x - x.mean()) / x.std()
    return z

In [58]:
transformed = nino.groupby(nino.MON).apply(zscore)

In [59]:
transformed['NINO3.4'].plot()


Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x10877be50>

Now we want calculate (just) the anomalies WRT to the climatology, but with a normal established over 1980 - 2010


In [60]:
nino.head()


Out[60]:
YR MON NINO3.4
0 1950 1 24.83
1 1950 2 25.20
2 1950 3 26.03
3 1950 4 26.36
4 1950 5 26.19

we can make use of the query method of Pandas DataFrames to select the climatological period


In [61]:
nino.query('YR >= 1981 & YR <= 2000').head()


Out[61]:
YR MON NINO3.4
372 1981 1 26.17
373 1981 2 26.13
374 1981 3 26.75
375 1981 4 27.32
376 1981 5 27.41

In [62]:
def demean(x): 
    z = x - x.query('YR >= 1981 & YR <= 2000').mean()
    return z

In [63]:
anoms = nino.groupby([nino.MON]).apply(demean)

In [64]:
f, ax = plt.subplots(figsize=(10,6))
anoms['NINO3.4'].plot()
data['ANOM.3'].plot() # Should be the same


Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1088c2ad0>

Exercise

The file Daily_clim_data_Auckland.csv in the data directory contains daily values for rainfall, minimum and maximum temperatures for Auckland from 1972 to now.

  1. Read the data in a Pandas DataFrame, correctly setting up the index as a datetime object
  2. calculate monthly means for the temperature data, monthly totals for rainfall
  3. calculate and plot a climatology over the period 1981 - 2000 for all parameters
  4. calculate the anomalies WRT this climatology for all parameters
  5. plots the time-series of minimum and maximum temperature
  6. dig into the Scipy documentation to find how to calculate the linear trend over the period for the temperature data
  7. calculate and plot the trend
  8. Contact the New Zealand Climate Science Coalition

In [ ]: