->
Licensed under CC BY 4.0 Creative Commons
If you want to follow along, this is a notebook that you can view or run yourself:
pandas
> 0.15 (easy solution is using Anaconda)Some imports:
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
pd.options.display.max_rows = 8
In [2]:
import airbase
data = airbase.load_data()
In [3]:
data
Out[3]:
to answering questions about this data in a few lines of code:
Does the air pollution show a decreasing trend over the years?
In [4]:
data['1999':].resample('A').plot(ylim=[0,100])
Out[4]:
How many exceedances of the limit values?
In [5]:
exceedances = data > 200
exceedances = exceedances.groupby(exceedances.index.year).sum()
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')
Out[5]:
What is the difference in diurnal profile between weekdays and weekend?
In [6]:
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['FR04012'].mean().unstack(level=0)
data_weekend.plot()
Out[6]:
We will come back to these example, and build them up step by step.
For data-intensive work in Python the Pandas library has become essential.
What is pandas
?
R
's data.frame
in Python.It's documentation: http://pandas.pydata.org/pandas-docs/stable/
.dropna()
, pd.isnull()
)concat
, join
)groupby
functionalitystack
, pivot
)
In [7]:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s
Out[7]:
In [8]:
s.index
Out[8]:
You can access the underlying numpy array representation with the .values
attribute:
In [9]:
s.values
Out[9]:
We can access series values via the index, just like for NumPy arrays:
In [10]:
s[0]
Out[10]:
Unlike the NumPy array, though, this index can be something other than integers:
In [11]:
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2
Out[11]:
In [12]:
s2['c']
Out[12]:
In this way, a Series
object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value:
In [13]:
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 'United Kingdom': 64.9, 'Netherlands': 16.9})
population
Out[13]:
In [14]:
population['France']
Out[14]:
but with the power of numpy arrays:
In [15]:
population * 1000
Out[15]:
We can index or slice the populations as expected:
In [16]:
population['Belgium']
Out[16]:
In [17]:
population['Belgium':'Germany']
Out[17]:
Many things you can do with numpy arrays, can also be applied on objects.
Fancy indexing, like indexing with a list or boolean indexing:
In [18]:
population[['France', 'Netherlands']]
Out[18]:
In [19]:
population[population > 20]
Out[19]:
Element-wise operations:
In [20]:
population / 100
Out[20]:
A range of methods:
In [21]:
population.mean()
Out[21]:
In [22]:
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]
In [23]:
s1
Out[23]:
In [24]:
s2
Out[24]:
In [25]:
s1 + s2
Out[25]:
One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.
Note that in the IPython notebook, the dataframe will display in a rich HTML view:
In [26]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
Out[26]:
In [27]:
countries.index
Out[27]:
In [28]:
countries.columns
Out[28]:
To check the data types of the different columns:
In [29]:
countries.dtypes
Out[29]:
An overview of that information can be given with the info()
method:
In [30]:
countries.info()
Also a DataFrame has a values
attribute, but attention: when you have heterogeneous data, all values will be upcasted:
In [31]:
countries.values
Out[31]:
If we don't like what the index looks like, we can reset it and set one of our columns:
In [32]:
countries = countries.set_index('country')
countries
Out[32]:
To access a Series representing a column in the data, use typical indexing syntax:
In [33]:
countries['area']
Out[33]:
As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.
Let's compute density of each country:
In [34]:
countries['population']*1000000 / countries['area']
Out[34]:
Adding a new column to the dataframe is very simple:
In [35]:
countries['density'] = countries['population']*1000000 / countries['area']
countries
Out[35]:
We can use masking to select certain data:
In [36]:
countries[countries['density'] > 300]
Out[36]:
And we can do things like sorting the items in the array, and indexing to take the first two rows:
In [37]:
countries.sort_index(by='density', ascending=False)
Out[37]:
One useful method to use is the describe
method, which computes summary statistics for each column:
In [38]:
countries.describe()
Out[38]:
The plot
method can be used to quickly visualize the data in different ways:
In [39]:
countries.plot()
Out[39]:
However, for this dataset, it does not say that much.
In [40]:
countries['population'].plot(kind='bar')
Out[40]:
In [41]:
countries.plot(kind='scatter', x='population', y='area')
Out[41]:
The available plotting types: ‘line’ (default), ‘bar’, ‘barh’, ‘hist’, ‘box’ , ‘kde’, ‘area’, ‘pie’, ‘scatter’, ‘hexbin’.
In [42]:
countries = countries.drop(['density'], axis=1)
For a DataFrame, basic indexing selects the columns.
Selecting a single column:
In [43]:
countries['area']
Out[43]:
or multiple columns:
In [44]:
countries[['area', 'density']]
Out[44]:
But, slicing accesses the rows:
In [45]:
countries['France':'Netherlands']
Out[45]:
For more advanced indexing, you have some extra attributes:
loc
: selection by labeliloc
: selection by position
In [46]:
countries.loc['Germany', 'area']
Out[46]:
In [47]:
countries.loc['France':'Germany', :]
Out[47]:
In [49]:
countries.loc[countries['density']>300, ['capital', 'population']]
Out[49]:
Selecting by position with iloc
works similar as indexing numpy arrays:
In [50]:
countries.iloc[0:2,1:3]
Out[50]:
The different indexing methods can also be used to assign data:
In [ ]:
countries.loc['Belgium':'Germany', 'population'] = 10
In [ ]:
countries
There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.
In [12]:
from IPython.display import HTML
HTML('<iframe src=http://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-quality-database-8#tab-data-by-country width=700 height=350></iframe>')
Out[12]:
In [ ]:
pd.read
In [ ]:
countries.to
I downloaded some of the raw data files of AirBase and included it in the repo:
station code: BETR801, pollutant code: 8 (nitrogen dioxide)
In [26]:
!head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012
Just reading the tab-delimited data:
In [43]:
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')
In [44]:
data.head()
Out[44]:
Not really what we want.
With using some more options of read_csv
:
In [45]:
colnames = ['date'] + [item for pair in zip(["{:02d}".format(i) for i in range(24)], ['flag']*24) for item in pair]
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012",
sep='\t', header=None, na_values=[-999, -9999], names=colnames)
In [46]:
data.head()
Out[46]:
So what did we do:
For now, we disregard the 'flag' columns
In [47]:
data = data.drop('flag', axis=1)
data
Out[47]:
Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index.
The docs say:
Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.
In [48]:
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)})
df
Out[48]:
To use stack
/unstack
, we need the values we want to shift from rows to columns or the other way around as the index:
In [49]:
df = df.set_index(['A', 'B'])
df
Out[49]:
In [50]:
result = df['C'].unstack()
result
Out[50]:
In [51]:
df = result.stack().reset_index(name='C')
df
Out[51]:
pivot
is similar to unstack
, but let you specify column names:
In [52]:
df.pivot(index='A', columns='B', values='C')
Out[52]:
pivot_table
is similar as pivot
, but can work with duplicate indices and let you specify an aggregation function:
In [53]:
df = pd.DataFrame({'A':['one', 'one', 'two', 'two', 'one', 'two'], 'B':['a', 'b', 'a', 'b', 'a', 'b'], 'C':range(6)})
df
Out[53]:
In [54]:
df.pivot_table(index='A', columns='B', values='C', aggfunc='count') #'mean'
Out[54]:
We can now use stack
to create a timeseries:
In [55]:
data = data.set_index('date')
In [56]:
data_stacked = data.stack()
In [57]:
data_stacked
Out[57]:
Now, lets combine the two levels of the index:
In [58]:
data_stacked = data_stacked.reset_index(name='BETR801')
In [59]:
data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['level_1'], format="%Y-%m-%d%H")
In [60]:
data_stacked = data_stacked.drop(['date', 'level_1'], axis=1)
In [61]:
data_stacked
Out[61]:
For this talk, I put the above code in a separate function, and repeated this for some different monitoring stations:
In [62]:
import airbase
no2 = airbase.load_data()
Some useful methods:
head
and tail
In [63]:
no2.head(3)
Out[63]:
In [64]:
no2.tail()
Out[64]:
info()
In [65]:
no2.info()
Getting some basic summary statistics about the data with describe
:
In [66]:
no2.describe()
Out[66]:
Quickly visualizing the data
In [67]:
no2.plot(kind='box', ylim=[0,250])
Out[67]:
In [68]:
no2['BETR801'].plot(kind='hist', bins=50)
Out[68]:
In [69]:
no2.plot(figsize=(12,6))
Out[69]:
This does not say too much ..
We can select part of the data (eg the latest 500 data points):
In [70]:
no2[-500:].plot(figsize=(12,6))
Out[70]:
Or we can use some more advanced time series features -> next section!
When we ensure the DataFrame has a DatetimeIndex
, time-series related functionality becomes available:
In [71]:
no2.index
Out[71]:
Indexing a time series works with strings:
In [72]:
no2["2010-01-01 09:00": "2010-01-01 12:00"]
Out[72]:
A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.
E.g. all data of 2012:
In [73]:
no2['2012']
Out[73]:
Or all data of January up to March 2012:
In [74]:
data['2012-01':'2012-03']
Out[74]:
Time and date components can be accessed from the index:
In [75]:
no2.index.hour
Out[75]:
In [76]:
no2.index.year
Out[76]:
A very powerfull method is resample
: converting the frequency of the time series (e.g. from hourly to daily data).
The time series has a frequency of 1 hour. I want to change this to daily:
In [77]:
no2.resample('D').head()
Out[77]:
By default, resample
takes the mean as aggregation function, but other methods can also be specified:
In [78]:
no2.resample('D', how='max').head()
Out[78]:
The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases
These strings can also be combined with numbers, eg '10D'
.
Further exploring the data:
In [79]:
no2.resample('M').plot() # 'A'
Out[79]:
In [ ]:
# no2['2012'].resample('D').plot()
In [80]:
no2.loc['2009':, 'FR04037'].resample('M', how=['mean', 'median']).plot()
Out[80]:
In [81]:
no2_1999 = no2['1999':]
no2_1999.resample('A').plot()
no2_1999.mean(axis=1).resample('A').plot(color='k', linestyle='--', linewidth=4)
Out[81]:
By "group by" we are referring to a process involving one or more of the following steps
Similar to SQL GROUP BY
The example of the image in pandas syntax:
In [82]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
Out[82]:
In [83]:
df.groupby('key').aggregate('sum') # np.sum
Out[83]:
In [84]:
df.groupby('key').sum()
Out[84]:
Question: how does the typical monthly profile look like for the different stations?
First, we add a column to the dataframe that indicates the month (integer value of 1 to 12):
In [85]:
no2['month'] = no2.index.month
Now, we can calculate the mean of each month over the different years:
In [86]:
no2.groupby('month').mean()
Out[86]:
In [87]:
no2.groupby('month').mean().plot()
Out[87]:
In [88]:
no2.groupby(no2.index.hour).mean().plot()
Out[88]:
In [89]:
no2.index.weekday?
In [90]:
no2['weekday'] = no2.index.weekday
Add a column indicating week/weekend
In [91]:
no2['weekend'] = no2['weekday'].isin([5, 6])
In [92]:
data_weekend = no2.groupby(['weekend', no2.index.hour]).mean()
data_weekend.head()
Out[92]:
In [93]:
data_weekend_FR04012 = data_weekend['FR04012'].unstack(level=0)
data_weekend_FR04012.head()
Out[93]:
In [94]:
data_weekend_FR04012.plot()
Out[94]:
In [95]:
exceedances = no2 > 200
In [96]:
# group by year and count exceedances (sum of boolean)
exceedances = exceedances.groupby(exceedances.index.year).sum()
In [97]:
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')
Out[97]:
In [98]:
# add a weekday and week column
no2['weekday'] = no2.index.weekday
no2['week'] = no2.index.week
no2.head()
Out[98]:
In [99]:
# pivot table so that the weekdays are the different columns
data_pivoted = no2['2012'].pivot_table(columns='weekday', index='week', values='FR04037')
data_pivoted.head()
Out[99]:
In [100]:
box = data_pivoted.boxplot()
Exercise: Calculate the correlation between the different stations
In [101]:
no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']].corr()
Out[101]:
In [102]:
no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']].resample('D').corr()
Out[102]:
In [103]:
no2 = no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']]
Some recent enhancements of the last year (versions 0.14 to 0.16):
Categorical
and CategoricalIndex
)Timedelta
and TimedeltaIndex
sqlalchemy
.dt
accessor for accesing datetime-properties from columnsSlides and data: Source: https://github.com/jorisvandenbossche/2015-PyDataParis
Slides presented with 'live reveal' https://github.com/damianavila/RISE