->
Licensed under CC BY 4.0 Creative Commons
If you want to follow along, this is a notebook that you can view or run yourself:
pandas > 0.15 (easy solution is using Anaconda)Some imports:
In [1]:
    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
pd.options.display.max_rows = 8
    
In [2]:
    
import airbase
data = airbase.load_data()
    
In [3]:
    
data
    
    Out[3]:
to answering questions about this data in a few lines of code:
Does the air pollution show a decreasing trend over the years?
In [4]:
    
data['1999':].resample('A').plot(ylim=[0,100])
    
    Out[4]:
    
How many exceedances of the limit values?
In [5]:
    
exceedances = data > 200
exceedances = exceedances.groupby(exceedances.index.year).sum()
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')
    
    Out[5]:
    
What is the difference in diurnal profile between weekdays and weekend?
In [6]:
    
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['FR04012'].mean().unstack(level=0)
data_weekend.plot()
    
    Out[6]:
    
We will come back to these example, and build them up step by step.
For data-intensive work in Python the Pandas library has become essential.
What is pandas?
R's data.frame in Python.It's documentation: http://pandas.pydata.org/pandas-docs/stable/
.dropna(), pd.isnull())concat, join)groupby functionalitystack, pivot)
In [7]:
    
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s
    
    Out[7]:
In [8]:
    
s.index
    
    Out[8]:
You can access the underlying numpy array representation with the .values attribute:
In [9]:
    
s.values
    
    Out[9]:
We can access series values via the index, just like for NumPy arrays:
In [10]:
    
s[0]
    
    Out[10]:
Unlike the NumPy array, though, this index can be something other than integers:
In [11]:
    
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2
    
    Out[11]:
In [12]:
    
s2['c']
    
    Out[12]:
In this way, a Series object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value:
In [13]:
    
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 'United Kingdom': 64.9, 'Netherlands': 16.9})
population
    
    Out[13]:
In [14]:
    
population['France']
    
    Out[14]:
but with the power of numpy arrays:
In [15]:
    
population * 1000
    
    Out[15]:
We can index or slice the populations as expected:
In [16]:
    
population['Belgium']
    
    Out[16]:
In [17]:
    
population['Belgium':'Germany']
    
    Out[17]:
Many things you can do with numpy arrays, can also be applied on objects.
Fancy indexing, like indexing with a list or boolean indexing:
In [18]:
    
population[['France', 'Netherlands']]
    
    Out[18]:
In [19]:
    
population[population > 20]
    
    Out[19]:
Element-wise operations:
In [20]:
    
population / 100
    
    Out[20]:
A range of methods:
In [21]:
    
population.mean()
    
    Out[21]:
In [22]:
    
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]
    
In [23]:
    
s1
    
    Out[23]:
In [24]:
    
s2
    
    Out[24]:
In [25]:
    
s1 + s2
    
    Out[25]:
One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.
Note that in the IPython notebook, the dataframe will display in a rich HTML view:
In [26]:
    
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
    
    Out[26]:
In [27]:
    
countries.index
    
    Out[27]:
In [28]:
    
countries.columns
    
    Out[28]:
To check the data types of the different columns:
In [29]:
    
countries.dtypes
    
    Out[29]:
An overview of that information can be given with the info() method:
In [30]:
    
countries.info()
    
    
Also a DataFrame has a values attribute, but attention: when you have heterogeneous data, all values will be upcasted:
In [31]:
    
countries.values
    
    Out[31]:
If we don't like what the index looks like, we can reset it and set one of our columns:
In [32]:
    
countries = countries.set_index('country')
countries
    
    Out[32]:
To access a Series representing a column in the data, use typical indexing syntax:
In [33]:
    
countries['area']
    
    Out[33]:
As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.
Let's compute density of each country:
In [34]:
    
countries['population']*1000000 / countries['area']
    
    Out[34]:
Adding a new column to the dataframe is very simple:
In [35]:
    
countries['density'] = countries['population']*1000000 / countries['area']
countries
    
    Out[35]:
We can use masking to select certain data:
In [36]:
    
countries[countries['density'] > 300]
    
    Out[36]:
And we can do things like sorting the items in the array, and indexing to take the first two rows:
In [37]:
    
countries.sort_index(by='density', ascending=False)
    
    Out[37]:
One useful method to use is the describe method, which computes summary statistics for each column:
In [38]:
    
countries.describe()
    
    Out[38]:
The plot method can be used to quickly visualize the data in different ways:
In [39]:
    
countries.plot()
    
    Out[39]:
    
However, for this dataset, it does not say that much.
In [40]:
    
countries['population'].plot(kind='bar')
    
    Out[40]:
    
In [41]:
    
countries.plot(kind='scatter', x='population', y='area')
    
    Out[41]:
    
The available plotting types: ‘line’ (default), ‘bar’, ‘barh’, ‘hist’, ‘box’ , ‘kde’, ‘area’, ‘pie’, ‘scatter’, ‘hexbin’.
In [42]:
    
countries = countries.drop(['density'], axis=1)
    
For a DataFrame, basic indexing selects the columns.
Selecting a single column:
In [43]:
    
countries['area']
    
    Out[43]:
or multiple columns:
In [44]:
    
countries[['area', 'density']]
    
    Out[44]:
But, slicing accesses the rows:
In [45]:
    
countries['France':'Netherlands']
    
    Out[45]:
For more advanced indexing, you have some extra attributes:
loc: selection by labeliloc: selection by position
In [46]:
    
countries.loc['Germany', 'area']
    
    Out[46]:
In [47]:
    
countries.loc['France':'Germany', :]
    
    Out[47]:
In [49]:
    
countries.loc[countries['density']>300, ['capital', 'population']]
    
    Out[49]:
Selecting by position with iloc works similar as indexing numpy arrays:
In [50]:
    
countries.iloc[0:2,1:3]
    
    Out[50]:
The different indexing methods can also be used to assign data:
In [ ]:
    
countries.loc['Belgium':'Germany', 'population'] = 10
    
In [ ]:
    
countries
    
There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.
In [12]:
    
from IPython.display import HTML
HTML('<iframe src=http://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-quality-database-8#tab-data-by-country width=700 height=350></iframe>')
    
    Out[12]:
In [ ]:
    
pd.read
    
In [ ]:
    
countries.to
    
I downloaded some of the raw data files of AirBase and included it in the repo:
station code: BETR801, pollutant code: 8 (nitrogen dioxide)
In [26]:
    
!head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012
    
    
Just reading the tab-delimited data:
In [43]:
    
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')
    
In [44]:
    
data.head()
    
    Out[44]:
Not really what we want.
With using some more options of read_csv:
In [45]:
    
colnames = ['date'] + [item for pair in zip(["{:02d}".format(i) for i in range(24)], ['flag']*24) for item in pair]
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012",
                   sep='\t', header=None, na_values=[-999, -9999], names=colnames)
    
In [46]:
    
data.head()
    
    Out[46]:
So what did we do:
For now, we disregard the 'flag' columns
In [47]:
    
data = data.drop('flag', axis=1)
data
    
    Out[47]:
Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index.
The docs say:
Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.
In [48]:
    
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)})
df
    
    Out[48]:
To use stack/unstack, we need the values we want to shift from rows to columns or the other way around as the index:
In [49]:
    
df = df.set_index(['A', 'B'])
df
    
    Out[49]:
In [50]:
    
result = df['C'].unstack()
result
    
    Out[50]:
In [51]:
    
df = result.stack().reset_index(name='C')
df
    
    Out[51]:
pivot is similar to unstack, but let you specify column names:
In [52]:
    
df.pivot(index='A', columns='B', values='C')
    
    Out[52]:
pivot_table is similar as pivot, but can work with duplicate indices and let you specify an aggregation function:
In [53]:
    
df = pd.DataFrame({'A':['one', 'one', 'two', 'two', 'one', 'two'], 'B':['a', 'b', 'a', 'b', 'a', 'b'], 'C':range(6)})
df
    
    Out[53]:
In [54]:
    
df.pivot_table(index='A', columns='B', values='C', aggfunc='count') #'mean'
    
    Out[54]:
We can now use stack to create a timeseries:
In [55]:
    
data = data.set_index('date')
    
In [56]:
    
data_stacked = data.stack()
    
In [57]:
    
data_stacked
    
    Out[57]:
Now, lets combine the two levels of the index:
In [58]:
    
data_stacked = data_stacked.reset_index(name='BETR801')
    
In [59]:
    
data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['level_1'], format="%Y-%m-%d%H")
    
In [60]:
    
data_stacked = data_stacked.drop(['date', 'level_1'], axis=1)
    
In [61]:
    
data_stacked
    
    Out[61]:
For this talk, I put the above code in a separate function, and repeated this for some different monitoring stations:
In [62]:
    
import airbase
no2 = airbase.load_data()
    
Some useful methods:
head and tail
In [63]:
    
no2.head(3)
    
    Out[63]:
In [64]:
    
no2.tail()
    
    Out[64]:
info()
In [65]:
    
no2.info()
    
    
Getting some basic summary statistics about the data with describe:
In [66]:
    
no2.describe()
    
    Out[66]:
Quickly visualizing the data
In [67]:
    
no2.plot(kind='box', ylim=[0,250])
    
    Out[67]:
    
In [68]:
    
no2['BETR801'].plot(kind='hist', bins=50)
    
    Out[68]:
    
In [69]:
    
no2.plot(figsize=(12,6))
    
    Out[69]:
    
This does not say too much ..
We can select part of the data (eg the latest 500 data points):
In [70]:
    
no2[-500:].plot(figsize=(12,6))
    
    Out[70]:
    
Or we can use some more advanced time series features -> next section!
When we ensure the DataFrame has a DatetimeIndex, time-series related functionality becomes available:
In [71]:
    
no2.index
    
    Out[71]:
Indexing a time series works with strings:
In [72]:
    
no2["2010-01-01 09:00": "2010-01-01 12:00"]
    
    Out[72]:
A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.
E.g. all data of 2012:
In [73]:
    
no2['2012']
    
    Out[73]:
Or all data of January up to March 2012:
In [74]:
    
data['2012-01':'2012-03']
    
    Out[74]:
Time and date components can be accessed from the index:
In [75]:
    
no2.index.hour
    
    Out[75]:
In [76]:
    
no2.index.year
    
    Out[76]:
A very powerfull method is resample: converting the frequency of the time series (e.g. from hourly to daily data).
The time series has a frequency of 1 hour. I want to change this to daily:
In [77]:
    
no2.resample('D').head()
    
    Out[77]:
By default, resample takes the mean as aggregation function, but other methods can also be specified:
In [78]:
    
no2.resample('D', how='max').head()
    
    Out[78]:
The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases
These strings can also be combined with numbers, eg '10D'.
Further exploring the data:
In [79]:
    
no2.resample('M').plot() # 'A'
    
    Out[79]:
    
In [ ]:
    
# no2['2012'].resample('D').plot()
    
In [80]:
    
no2.loc['2009':, 'FR04037'].resample('M', how=['mean', 'median']).plot()
    
    Out[80]:
    
In [81]:
    
no2_1999 = no2['1999':]
no2_1999.resample('A').plot()
no2_1999.mean(axis=1).resample('A').plot(color='k', linestyle='--', linewidth=4)
    
    Out[81]:
    
By "group by" we are referring to a process involving one or more of the following steps
Similar to SQL GROUP BY
The example of the image in pandas syntax:
In [82]:
    
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
    
    Out[82]:
In [83]:
    
df.groupby('key').aggregate('sum')  # np.sum
    
    Out[83]:
In [84]:
    
df.groupby('key').sum()
    
    Out[84]:
Question: how does the typical monthly profile look like for the different stations?
First, we add a column to the dataframe that indicates the month (integer value of 1 to 12):
In [85]:
    
no2['month'] = no2.index.month
    
Now, we can calculate the mean of each month over the different years:
In [86]:
    
no2.groupby('month').mean()
    
    Out[86]:
In [87]:
    
no2.groupby('month').mean().plot()
    
    Out[87]:
    
In [88]:
    
no2.groupby(no2.index.hour).mean().plot()
    
    Out[88]:
    
In [89]:
    
no2.index.weekday?
    
In [90]:
    
no2['weekday'] = no2.index.weekday
    
Add a column indicating week/weekend
In [91]:
    
no2['weekend'] = no2['weekday'].isin([5, 6])
    
In [92]:
    
data_weekend = no2.groupby(['weekend', no2.index.hour]).mean()
data_weekend.head()
    
    Out[92]:
In [93]:
    
data_weekend_FR04012 = data_weekend['FR04012'].unstack(level=0)
data_weekend_FR04012.head()
    
    Out[93]:
In [94]:
    
data_weekend_FR04012.plot()
    
    Out[94]:
    
In [95]:
    
exceedances = no2 > 200
    
In [96]:
    
# group by year and count exceedances (sum of boolean)
exceedances = exceedances.groupby(exceedances.index.year).sum()
    
In [97]:
    
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')
    
    Out[97]:
    
In [98]:
    
# add a weekday and week column
no2['weekday'] = no2.index.weekday
no2['week'] = no2.index.week
no2.head()
    
    Out[98]:
In [99]:
    
# pivot table so that the weekdays are the different columns
data_pivoted = no2['2012'].pivot_table(columns='weekday', index='week', values='FR04037')
data_pivoted.head()
    
    Out[99]:
In [100]:
    
box = data_pivoted.boxplot()
    
    
    
Exercise: Calculate the correlation between the different stations
In [101]:
    
no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']].corr()
    
    Out[101]:
In [102]:
    
no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']].resample('D').corr()
    
    Out[102]:
In [103]:
    
no2 = no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']]
    
Some recent enhancements of the last year (versions 0.14 to 0.16):
Categorical and CategoricalIndex)Timedelta and TimedeltaIndexsqlalchemy.dt accessor for accesing datetime-properties from columnsSlides and data: Source: https://github.com/jorisvandenbossche/2015-PyDataParis
Slides presented with 'live reveal' https://github.com/damianavila/RISE