Introduction to Pandas

April 3rd, 2015

Joris Van den Bossche

Source: https://github.com/jorisvandenbossche/2015-PyDataParis

About me: Joris Van den Bossche

PhD student at Ghent University and VITO, Belgium
bio-science engineer, air quality research
pandas core dev

Licensed under CC BY 4.0 Creative Commons

Content of this talk

Why do you need pandas?
Basic introduction to the data structures
Guided tour through some of the pandas features with a case study about air quality

If you want to follow along, this is a notebook that you can view or run yourself:

All materials (notebook, data, link to nbviewer): https://github.com/jorisvandenbossche/2015-PyDataParis
You need pandas > 0.15 (easy solution is using Anaconda)

Some imports:



In [1]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn

pd.options.display.max_rows = 8

Let's start with a showcase

Case study: air quality in Europe

AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe

Starting from these hourly data for different stations:



In [2]:

    
import airbase
data = airbase.load_data()



In [3]:

    
data









    Out[3]:






  
    
      
      BETR801
      BETN029
      FR04037
      FR04012
    
  
  
    
      1990-01-01 00:00:00
      NaN
      16.0
      NaN
      NaN
    
    
      1990-01-01 01:00:00
      NaN
      18.0
      NaN
      NaN
    
    
      1990-01-01 02:00:00
      NaN
      21.0
      NaN
      NaN
    
    
      1990-01-01 03:00:00
      NaN
      26.0
      NaN
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      2012-12-31 20:00:00
      16.5
      2.0
      16
      47
    
    
      2012-12-31 21:00:00
      14.5
      2.5
      13
      43
    
    
      2012-12-31 22:00:00
      16.5
      3.5
      14
      42
    
    
      2012-12-31 23:00:00
      15.0
      3.0
      13
      49
    
  

198895 rows × 4 columns

to answering questions about this data in a few lines of code:

Does the air pollution show a decreasing trend over the years?



In [4]:

    
data['1999':].resample('A').plot(ylim=[0,100])









    Out[4]:





<matplotlib.axes._subplots.AxesSubplot at 0xab4c292c>

How many exceedances of the limit values?



In [5]:

    
exceedances = data > 200
exceedances = exceedances.groupby(exceedances.index.year).sum()
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')









    Out[5]:





<matplotlib.lines.Line2D at 0xab02004c>

What is the difference in diurnal profile between weekdays and weekend?



In [6]:

    
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['FR04012'].mean().unstack(level=0)
data_weekend.plot()









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0xab3cb5cc>

We will come back to these example, and build them up step by step.

Why do you need pandas?

When working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...):

Import data
Clean up messy data
Explore data, gain insight into data
Process and prepare your data for analysis
Analyse your data (together with scikit-learn, statsmodels, ...)

Pandas: data analysis in python

For data-intensive work in Python the Pandas library has become essential.

What is pandas?

Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
Pandas can also be thought of as R's data.frame in Python.
Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...

It's documentation: http://pandas.pydata.org/pandas-docs/stable/

Key features

Fast, easy and flexible input/output for a lot of different data formats
Working with missing data (.dropna(), pd.isnull())
Merging and joining (concat, join)
Grouping: groupby functionality
Reshaping (stack, pivot)
Powerful time series manipulation (resampling, timezones, ..)
Easy plotting

Basic data structures

Pandas does this through two fundamental object types, both built upon NumPy arrays: the Series object, and the DataFrame object.

Series

A Series is a basic holder for one-dimensional labeled data. It can be created much as a NumPy array is created:



In [7]:

    
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s









    Out[7]:





0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

Attributes of a Series: `index` and `values`

The series has a built-in concept of an index, which by default is the numbers 0 through N - 1



In [8]:

    
s.index









    Out[8]:





Int64Index([0, 1, 2, 3], dtype='int64')

You can access the underlying numpy array representation with the .values attribute:



In [9]:

    
s.values









    Out[9]:





array([ 0.1,  0.2,  0.3,  0.4])

We can access series values via the index, just like for NumPy arrays:



In [10]:

    
s[0]









    Out[10]:





0.10000000000000001

Unlike the NumPy array, though, this index can be something other than integers:



In [11]:

    
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2









    Out[11]:





a    0
b    1
c    2
d    3
dtype: int32



In [12]:

    
s2['c']









    Out[12]:





2

In this way, a Series object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value:



In [13]:

    
population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 'United Kingdom': 64.9, 'Netherlands': 16.9})
population









    Out[13]:





Belgium           11.3
France            64.3
Germany           81.3
Netherlands       16.9
United Kingdom    64.9
dtype: float64



In [14]:

    
population['France']









    Out[14]:





64.299999999999997

but with the power of numpy arrays:



In [15]:

    
population * 1000









    Out[15]:





Belgium           11300
France            64300
Germany           81300
Netherlands       16900
United Kingdom    64900
dtype: float64

We can index or slice the populations as expected:



In [16]:

    
population['Belgium']









    Out[16]:





11.300000000000001



In [17]:

    
population['Belgium':'Germany']









    Out[17]:





Belgium    11.3
France     64.3
Germany    81.3
dtype: float64

Many things you can do with numpy arrays, can also be applied on objects.

Fancy indexing, like indexing with a list or boolean indexing:



In [18]:

    
population[['France', 'Netherlands']]









    Out[18]:





France         64.3
Netherlands    16.9
dtype: float64



In [19]:

    
population[population > 20]









    Out[19]:





France            64.3
Germany           81.3
United Kingdom    64.9
dtype: float64

Element-wise operations:



In [20]:

    
population / 100









    Out[20]:





Belgium           0.113
France            0.643
Germany           0.813
Netherlands       0.169
United Kingdom    0.649
dtype: float64

A range of methods:



In [21]:

    
population.mean()









    Out[21]:





47.739999999999995

Alignment!

Only, pay attention to alignment: operations between series will align on the index:



In [22]:

    
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]



In [23]:

    
s1









    Out[23]:





Belgium    11.3
France     64.3
dtype: float64



In [24]:

    
s2









    Out[24]:





France     64.3
Germany    81.3
dtype: float64



In [25]:

    
s1 + s2









    Out[25]:





Belgium      NaN
France     128.6
Germany      NaN
dtype: float64

DataFrames: Multi-dimensional Data

A DataFrame is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.

One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.

Note that in the IPython notebook, the dataframe will display in a rich HTML view:



In [26]:

    
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries









    Out[26]:






  
    
      
      area
      capital
      country
      population
    
  
  
    
      0
      30510
      Brussels
      Belgium
      11.3
    
    
      1
      671308
      Paris
      France
      64.3
    
    
      2
      357050
      Berlin
      Germany
      81.3
    
    
      3
      41526
      Amsterdam
      Netherlands
      16.9
    
    
      4
      244820
      London
      United Kingdom
      64.9

Attributes of the DataFrame

A DataFrame has besides a index attribute, also a columns attribute:



In [27]:

    
countries.index









    Out[27]:





Int64Index([0, 1, 2, 3, 4], dtype='int64')



In [28]:

    
countries.columns









    Out[28]:





Index([u'area', u'capital', u'country', u'population'], dtype='object')

To check the data types of the different columns:



In [29]:

    
countries.dtypes









    Out[29]:





area            int64
capital        object
country        object
population    float64
dtype: object

An overview of that information can be given with the info() method:



In [30]:

    
countries.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 4 columns):
area          5 non-null int64
capital       5 non-null object
country       5 non-null object
population    5 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 160.0+ bytes

Also a DataFrame has a values attribute, but attention: when you have heterogeneous data, all values will be upcasted:



In [31]:

    
countries.values









    Out[31]:





array([[30510L, 'Brussels', 'Belgium', 11.3],
       [671308L, 'Paris', 'France', 64.3],
       [357050L, 'Berlin', 'Germany', 81.3],
       [41526L, 'Amsterdam', 'Netherlands', 16.9],
       [244820L, 'London', 'United Kingdom', 64.9]], dtype=object)

If we don't like what the index looks like, we can reset it and set one of our columns:



In [32]:

    
countries = countries.set_index('country')
countries









    Out[32]:






  
    
      
      area
      capital
      population
    
    
      country
      
      
      
    
  
  
    
      Belgium
      30510
      Brussels
      11.3
    
    
      France
      671308
      Paris
      64.3
    
    
      Germany
      357050
      Berlin
      81.3
    
    
      Netherlands
      41526
      Amsterdam
      16.9
    
    
      United Kingdom
      244820
      London
      64.9

To access a Series representing a column in the data, use typical indexing syntax:



In [33]:

    
countries['area']









    Out[33]:





country
Belgium            30510
France            671308
Germany           357050
Netherlands        41526
United Kingdom    244820
Name: area, dtype: int64

As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.

Let's compute density of each country:



In [34]:

    
countries['population']*1000000 / countries['area']









    Out[34]:





country
Belgium           370.370370
France             95.783158
Germany           227.699202
Netherlands       406.973944
United Kingdom    265.092721
dtype: float64

Adding a new column to the dataframe is very simple:



In [35]:

    
countries['density'] = countries['population']*1000000 / countries['area']
countries









    Out[35]:






  
    
      
      area
      capital
      population
      density
    
    
      country
      
      
      
      
    
  
  
    
      Belgium
      30510
      Brussels
      11.3
      370.370370
    
    
      France
      671308
      Paris
      64.3
      95.783158
    
    
      Germany
      357050
      Berlin
      81.3
      227.699202
    
    
      Netherlands
      41526
      Amsterdam
      16.9
      406.973944
    
    
      United Kingdom
      244820
      London
      64.9
      265.092721

We can use masking to select certain data:



In [36]:

    
countries[countries['density'] > 300]









    Out[36]:






  
    
      
      area
      capital
      population
      density
    
    
      country
      
      
      
      
    
  
  
    
      Belgium
      30510
      Brussels
      11.3
      370.370370
    
    
      Netherlands
      41526
      Amsterdam
      16.9
      406.973944

And we can do things like sorting the items in the array, and indexing to take the first two rows:



In [37]:

    
countries.sort_index(by='density', ascending=False)









    Out[37]:






  
    
      
      area
      capital
      population
      density
    
    
      country
      
      
      
      
    
  
  
    
      Netherlands
      41526
      Amsterdam
      16.9
      406.973944
    
    
      Belgium
      30510
      Brussels
      11.3
      370.370370
    
    
      United Kingdom
      244820
      London
      64.9
      265.092721
    
    
      Germany
      357050
      Berlin
      81.3
      227.699202
    
    
      France
      671308
      Paris
      64.3
      95.783158

One useful method to use is the describe method, which computes summary statistics for each column:



In [38]:

    
countries.describe()









    Out[38]:






  
    
      
      area
      population
      density
    
  
  
    
      count
      5.000000
      5.000000
      5.000000
    
    
      mean
      269042.800000
      47.740000
      273.183879
    
    
      std
      264012.827994
      31.519645
      123.440607
    
    
      min
      30510.000000
      11.300000
      95.783158
    
    
      25%
      41526.000000
      16.900000
      227.699202
    
    
      50%
      244820.000000
      64.300000
      265.092721
    
    
      75%
      357050.000000
      64.900000
      370.370370
    
    
      max
      671308.000000
      81.300000
      406.973944

The plot method can be used to quickly visualize the data in different ways:



In [39]:

    
countries.plot()









    Out[39]:





<matplotlib.axes._subplots.AxesSubplot at 0xab20740c>

However, for this dataset, it does not say that much.



In [40]:

    
countries['population'].plot(kind='bar')









    Out[40]:





<matplotlib.axes._subplots.AxesSubplot at 0xab258d8c>



In [41]:

    
countries.plot(kind='scatter', x='population', y='area')









    Out[41]:





<matplotlib.axes._subplots.AxesSubplot at 0xab12eaec>

The available plotting types: ‘line’ (default), ‘bar’, ‘barh’, ‘hist’, ‘box’ , ‘kde’, ‘area’, ‘pie’, ‘scatter’, ‘hexbin’.



In [42]:

    
countries = countries.drop(['density'], axis=1)

Some notes on selecting data

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. We now have to distuinguish between:

selection by label
selection by position.

For a DataFrame, basic indexing selects the columns.

Selecting a single column:



In [43]:

    
countries['area']









    Out[43]:





country
Belgium            30510
France            671308
Germany           357050
Netherlands        41526
United Kingdom    244820
Name: area, dtype: int64

or multiple columns:



In [44]:

    
countries[['area', 'density']]









    Out[44]:






  
    
      
      area
      density
    
    
      country
      
      
    
  
  
    
      Belgium
      30510
      370.370370
    
    
      France
      671308
      95.783158
    
    
      Germany
      357050
      227.699202
    
    
      Netherlands
      41526
      406.973944
    
    
      United Kingdom
      244820
      265.092721

But, slicing accesses the rows:



In [45]:

    
countries['France':'Netherlands']









    Out[45]:






  
    
      
      area
      capital
      population
      density
    
    
      country
      
      
      
      
    
  
  
    
      France
      671308
      Paris
      64.3
      95.783158
    
    
      Germany
      357050
      Berlin
      81.3
      227.699202
    
    
      Netherlands
      41526
      Amsterdam
      16.9
      406.973944

For more advanced indexing, you have some extra attributes:

loc: selection by label
iloc: selection by position



In [46]:

    
countries.loc['Germany', 'area']









    Out[46]:





357050



In [47]:

    
countries.loc['France':'Germany', :]









    Out[47]:






  
    
      
      area
      capital
      population
      density
    
    
      country
      
      
      
      
    
  
  
    
      France
      671308
      Paris
      64.3
      95.783158
    
    
      Germany
      357050
      Berlin
      81.3
      227.699202



In [49]:

    
countries.loc[countries['density']>300, ['capital', 'population']]









    Out[49]:






  
    
      
      capital
      population
    
    
      country
      
      
    
  
  
    
      Belgium
      Brussels
      11.3
    
    
      Netherlands
      Amsterdam
      16.9

Selecting by position with iloc works similar as indexing numpy arrays:



In [50]:

    
countries.iloc[0:2,1:3]









    Out[50]:






  
    
      
      capital
      population
    
    
      country
      
      
    
  
  
    
      Belgium
      Brussels
      11.3
    
    
      France
      Paris
      64.3

The different indexing methods can also be used to assign data:



In [ ]:

    
countries.loc['Belgium':'Germany', 'population'] = 10



In [ ]:

    
countries

There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.

Case study: air quality data of European monitoring stations (AirBase)

AirBase (The European Air quality dataBase)

AirBase: hourly measurements of all air quality monitoring stations from Europe.



In [12]:

    
from IPython.display import HTML
HTML('<iframe src=http://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-quality-database-8#tab-data-by-country width=700 height=350></iframe>')









    Out[12]:

Importing and cleaning the data

Importing and exporting data with pandas

A wide range of input/output formats are natively supported by pandas:

CSV, text
SQL database
Excel
HDF5
json
html
pickle
...



In [ ]:

    
pd.read



In [ ]:

    
countries.to

Now for our case study

I downloaded some of the raw data files of AirBase and included it in the repo:

station code: BETR801, pollutant code: 8 (nitrogen dioxide)



In [26]:

    
!head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012

Just reading the tab-delimited data:



In [43]:

    
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')



In [44]:

    
data.head()









    Out[44]:






  
    
      
      1990-01-01
      -999.000
      0
      -999.000.1
      0.1
      -999.000.2
      0.2
      -999.000.3
      0.3
      -999.000.4
      ...
      -999.000.19
      0.19
      -999.000.20
      0.20
      -999.000.21
      0.21
      -999.000.22
      0.22
      -999.000.23
      0.23
    
  
  
    
      0
      1990-01-02
      -999
      0
      -999
      0
      -999
      0
      -999
      0
      -999
      ...
      57
      1
      58
      1
      54
      1
      49
      1
      48
      1
    
    
      1
      1990-01-03
      51
      1
      50
      1
      47
      1
      48
      1
      51
      ...
      84
      1
      75
      1
      -999
      0
      -999
      0
      -999
      0
    
    
      2
      1990-01-04
      -999
      0
      -999
      0
      -999
      0
      -999
      0
      -999
      ...
      69
      1
      65
      1
      64
      1
      60
      1
      59
      1
    
    
      3
      1990-01-05
      51
      1
      51
      1
      48
      1
      50
      1
      51
      ...
      -999
      0
      -999
      0
      -999
      0
      -999
      0
      -999
      0
    
    
      4
      1990-01-06
      -999
      0
      -999
      0
      -999
      0
      -999
      0
      -999
      ...
      -999
      0
      -999
      0
      -999
      0
      -999
      0
      -999
      0
    
  

5 rows × 49 columns

Not really what we want.

With using some more options of read_csv:



In [45]:

    
colnames = ['date'] + [item for pair in zip(["{:02d}".format(i) for i in range(24)], ['flag']*24) for item in pair]

data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012",
                   sep='\t', header=None, na_values=[-999, -9999], names=colnames)



In [46]:

    
data.head()









    Out[46]:






  
    
      
      date
      00
      flag
      01
      flag
      02
      flag
      03
      flag
      04
      ...
      19
      flag
      20
      flag
      21
      flag
      22
      flag
      23
      flag
    
  
  
    
      0
      1990-01-01
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      ...
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
    
    
      1
      1990-01-02
      NaN
      1
      NaN
      1
      NaN
      1
      NaN
      1
      NaN
      ...
      57
      1
      58
      1
      54
      1
      49
      1
      48
      1
    
    
      2
      1990-01-03
      51
      0
      50
      0
      47
      0
      48
      0
      51
      ...
      84
      0
      75
      0
      NaN
      0
      NaN
      0
      NaN
      0
    
    
      3
      1990-01-04
      NaN
      1
      NaN
      1
      NaN
      1
      NaN
      1
      NaN
      ...
      69
      1
      65
      1
      64
      1
      60
      1
      59
      1
    
    
      4
      1990-01-05
      51
      0
      51
      0
      48
      0
      50
      0
      51
      ...
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
      NaN
      0
    
  

5 rows × 49 columns

So what did we do:

specify that the values of -999 and -9999 should be regarded as NaN
specified are own column names

For now, we disregard the 'flag' columns



In [47]:

    
data = data.drop('flag', axis=1)
data









    Out[47]:






  
    
      
      date
      00
      01
      02
      03
      04
      05
      06
      07
      08
      ...
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
    
  
  
    
      0
      1990-01-01
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1990-01-02
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      55.0
      59.0
      58
      59.0
      58.0
      57.0
      58.0
      54.0
      49.0
      48.0
    
    
      2
      1990-01-03
      51.0
      50.0
      47.0
      48.0
      51.0
      52.0
      58.0
      57.0
      NaN
      ...
      69.0
      74.0
      NaN
      NaN
      103.0
      84.0
      75.0
      NaN
      NaN
      NaN
    
    
      3
      1990-01-04
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      71.0
      74
      70.0
      70.0
      69.0
      65.0
      64.0
      60.0
      59.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      8388
      2012-12-28
      26.5
      28.5
      35.5
      32.0
      35.5
      50.5
      62.5
      74.5
      76.0
      ...
      56.5
      52.0
      55
      53.5
      49.0
      46.5
      42.5
      38.5
      30.5
      26.5
    
    
      8389
      2012-12-29
      21.5
      16.5
      13.0
      13.0
      16.0
      23.5
      23.5
      27.5
      46.0
      ...
      48.0
      41.5
      36
      33.0
      25.5
      21.0
      22.0
      20.5
      20.0
      15.0
    
    
      8390
      2012-12-30
      11.5
      9.5
      7.5
      7.5
      10.0
      11.0
      13.5
      13.5
      17.5
      ...
      NaN
      25.0
      25
      25.5
      24.5
      25.0
      18.5
      17.0
      15.5
      12.5
    
    
      8391
      2012-12-31
      9.5
      8.5
      8.5
      8.5
      10.5
      15.5
      18.0
      23.0
      25.0
      ...
      NaN
      NaN
      28
      27.5
      26.0
      21.0
      16.5
      14.5
      16.5
      15.0
    
  

8392 rows × 25 columns

Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index.

Intermezzo: reshaping your data with `stack`, `unstack` and `pivot`

The docs say:

Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.



In [48]:

    
df = pd.DataFrame({'A':['one', 'one', 'two', 'two'], 'B':['a', 'b', 'a', 'b'], 'C':range(4)})
df

To use stack/unstack, we need the values we want to shift from rows to columns or the other way around as the index:



In [49]:

    
df = df.set_index(['A', 'B'])
df



In [50]:

    
result = df['C'].unstack()
result



In [51]:

    
df = result.stack().reset_index(name='C')
df

pivot is similar to unstack, but let you specify column names:



In [52]:

    
df.pivot(index='A', columns='B', values='C')

pivot_table is similar as pivot, but can work with duplicate indices and let you specify an aggregation function:



In [53]:

    
df = pd.DataFrame({'A':['one', 'one', 'two', 'two', 'one', 'two'], 'B':['a', 'b', 'a', 'b', 'a', 'b'], 'C':range(6)})
df



In [54]:

    
df.pivot_table(index='A', columns='B', values='C', aggfunc='count') #'mean'

Back to our case study

We can now use stack to create a timeseries:



In [55]:

    
data = data.set_index('date')



In [56]:

    
data_stacked = data.stack()



In [57]:

    
data_stacked









    Out[57]:





date          
1990-01-02  09    48.0
            12    48.0
            13    50.0
            14    55.0
                  ... 
2012-12-31  20    16.5
            21    14.5
            22    16.5
            23    15.0
dtype: float64

Now, lets combine the two levels of the index:



In [58]:

    
data_stacked = data_stacked.reset_index(name='BETR801')



In [59]:

    
data_stacked.index = pd.to_datetime(data_stacked['date'] + data_stacked['level_1'], format="%Y-%m-%d%H")



In [60]:

    
data_stacked = data_stacked.drop(['date', 'level_1'], axis=1)



In [61]:

    
data_stacked









    Out[61]:






  
    
      
      BETR801
    
  
  
    
      1990-01-02 09:00:00
      48.0
    
    
      1990-01-02 12:00:00
      48.0
    
    
      1990-01-02 13:00:00
      50.0
    
    
      1990-01-02 14:00:00
      55.0
    
    
      ...
      ...
    
    
      2012-12-31 20:00:00
      16.5
    
    
      2012-12-31 21:00:00
      14.5
    
    
      2012-12-31 22:00:00
      16.5
    
    
      2012-12-31 23:00:00
      15.0
    
  

170794 rows × 1 columns

For this talk, I put the above code in a separate function, and repeated this for some different monitoring stations:



In [62]:

    
import airbase
no2 = airbase.load_data()

FR04037 (PARIS 13eme): urban background site at Square de Choisy
FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia
BETR802: urban traffic site in Antwerp, Belgium
BETN029: rural background site in Houtem, Belgium

See http://www.eea.europa.eu/themes/air/interactive/no2

Exploring the data

Some useful methods:

head and tail



In [63]:

    
no2.head(3)









    Out[63]:






  
    
      
      BETR801
      BETN029
      FR04037
      FR04012
    
  
  
    
      1990-01-01 00:00:00
      NaN
      16
      NaN
      NaN
    
    
      1990-01-01 01:00:00
      NaN
      18
      NaN
      NaN
    
    
      1990-01-01 02:00:00
      NaN
      21
      NaN
      NaN



In [64]:

    
no2.tail()









    Out[64]:






  
    
      
      BETR801
      BETN029
      FR04037
      FR04012
    
  
  
    
      2012-12-31 19:00:00
      21.0
      2.5
      28
      67
    
    
      2012-12-31 20:00:00
      16.5
      2.0
      16
      47
    
    
      2012-12-31 21:00:00
      14.5
      2.5
      13
      43
    
    
      2012-12-31 22:00:00
      16.5
      3.5
      14
      42
    
    
      2012-12-31 23:00:00
      15.0
      3.0
      13
      49

info()



In [65]:

    
no2.info()









    



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 198895 entries, 1990-01-01 00:00:00 to 2012-12-31 23:00:00
Data columns (total 4 columns):
BETR801    170794 non-null float64
BETN029    174807 non-null float64
FR04037    120384 non-null float64
FR04012    119448 non-null float64
dtypes: float64(4)
memory usage: 7.6 MB

Getting some basic summary statistics about the data with describe:



In [66]:

    
no2.describe()









    Out[66]:






  
    
      
      BETR801
      BETN029
      FR04037
      FR04012
    
  
  
    
      count
      170794.000000
      174807.000000
      120384.000000
      119448.000000
    
    
      mean
      47.914561
      16.687756
      40.040005
      87.993261
    
    
      std
      22.230921
      13.106549
      23.024347
      41.317684
    
    
      min
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      32.000000
      7.000000
      23.000000
      61.000000
    
    
      50%
      46.000000
      12.000000
      37.000000
      88.000000
    
    
      75%
      61.000000
      23.000000
      54.000000
      115.000000
    
    
      max
      339.000000
      115.000000
      256.000000
      358.000000

Quickly visualizing the data



In [67]:

    
no2.plot(kind='box', ylim=[0,250])









    Out[67]:





<matplotlib.axes._subplots.AxesSubplot at 0xaa1d544c>



In [68]:

    
no2['BETR801'].plot(kind='hist', bins=50)









    Out[68]:





<matplotlib.axes._subplots.AxesSubplot at 0xaabe97ec>



In [69]:

    
no2.plot(figsize=(12,6))









    Out[69]:





<matplotlib.axes._subplots.AxesSubplot at 0xa7b3e74c>

This does not say too much ..

We can select part of the data (eg the latest 500 data points):



In [70]:

    
no2[-500:].plot(figsize=(12,6))









    Out[70]:





<matplotlib.axes._subplots.AxesSubplot at 0xa7c325ec>

Or we can use some more advanced time series features -> next section!

Working with time series data

When we ensure the DataFrame has a DatetimeIndex, time-series related functionality becomes available:



In [71]:

    
no2.index









    Out[71]:





<class 'pandas.tseries.index.DatetimeIndex'>
[1990-01-01 00:00:00, ..., 2012-12-31 23:00:00]
Length: 198895, Freq: None, Timezone: None

Indexing a time series works with strings:



In [72]:

    
no2["2010-01-01 09:00": "2010-01-01 12:00"]









    Out[72]:






  
    
      
      BETR801
      BETN029
      FR04037
      FR04012
    
  
  
    
      2010-01-01 09:00:00
      17
      7
      19
      41
    
    
      2010-01-01 10:00:00
      18
      5
      21
      48
    
    
      2010-01-01 11:00:00
      17
      4
      23
      63
    
    
      2010-01-01 12:00:00
      18
      4
      22
      57

A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.

E.g. all data of 2012:



In [73]:

    
no2['2012']









    Out[73]:






  
    
      
      BETR801
      BETN029
      FR04037
      FR04012
    
  
  
    
      2012-01-01 00:00:00
      21.0
      1.0
      17
      56
    
    
      2012-01-01 01:00:00
      18.0
      1.0
      16
      50
    
    
      2012-01-01 02:00:00
      20.0
      1.0
      14
      46
    
    
      2012-01-01 03:00:00
      16.0
      1.0
      17
      47
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      2012-12-31 20:00:00
      16.5
      2.0
      16
      47
    
    
      2012-12-31 21:00:00
      14.5
      2.5
      13
      43
    
    
      2012-12-31 22:00:00
      16.5
      3.5
      14
      42
    
    
      2012-12-31 23:00:00
      15.0
      3.0
      13
      49
    
  

8784 rows × 4 columns

Or all data of January up to March 2012:



In [74]:

    
data['2012-01':'2012-03']









    Out[74]:






  
    
      
      00
      01
      02
      03
      04
      05
      06
      07
      08
      09
      ...
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01
      21
      18
      20.0
      16.0
      13
      17.0
      15.0
      13.0
      15.0
      15.0
      ...
      31.5
      33.5
      32.5
      30
      25.0
      20.0
      14.0
      13
      15.0
      14
    
    
      2012-01-02
      NaN
      NaN
      10.5
      12.0
      12
      39.0
      49.5
      52.5
      45.0
      48.0
      ...
      32.0
      38.0
      43.0
      61
      56.0
      46.0
      39.0
      33
      24.0
      20
    
    
      2012-01-03
      18
      NaN
      NaN
      14.0
      23
      31.5
      36.0
      40.0
      32.5
      26.0
      ...
      24.0
      28.0
      25.0
      28
      25.0
      26.0
      22.0
      21
      20.0
      19
    
    
      2012-01-04
      16
      16
      NaN
      13.0
      14
      17.0
      26.0
      33.0
      36.0
      36.0
      ...
      41.0
      42.0
      52.5
      48
      39.0
      32.5
      23.0
      16
      13.0
      12
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2012-02-26
      56
      53
      56.0
      53.0
      53
      NaN
      NaN
      48.0
      50.0
      52.5
      ...
      25.0
      32.0
      37.5
      50
      50.0
      44.0
      54.5
      54
      54.5
      67
    
    
      2012-02-27
      59
      47
      50.0
      51.5
      59
      65.0
      NaN
      NaN
      70.0
      62.0
      ...
      56.0
      61.0
      70.0
      68
      60.0
      56.0
      54.0
      42
      36.0
      28
    
    
      2012-02-28
      24
      23
      20.0
      21.0
      27
      43.0
      55.0
      NaN
      NaN
      47.0
      ...
      49.0
      55.0
      61.0
      59
      61.0
      53.5
      52.0
      52
      50.0
      48
    
    
      2012-02-29
      45
      39
      35.0
      32.5
      34
      47.0
      51.5
      52.5
      NaN
      NaN
      ...
      50.0
      56.0
      61.0
      67
      73.5
      73.0
      72.5
      70
      69.0
      62
    
  

60 rows × 24 columns

Time and date components can be accessed from the index:



In [75]:

    
no2.index.hour









    Out[75]:





array([ 0,  1,  2, ..., 21, 22, 23])



In [76]:

    
no2.index.year









    Out[76]:





array([1990, 1990, 1990, ..., 2012, 2012, 2012])

The power of pandas: `resample`

A very powerfull method is resample: converting the frequency of the time series (e.g. from hourly to daily data).

The time series has a frequency of 1 hour. I want to change this to daily:



In [77]:

    
no2.resample('D').head()

By default, resample takes the mean as aggregation function, but other methods can also be specified:



In [78]:

    
no2.resample('D', how='max').head()

The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases
These strings can also be combined with numbers, eg '10D'.

Further exploring the data:



In [79]:

    
no2.resample('M').plot() # 'A'









    Out[79]:





<matplotlib.axes._subplots.AxesSubplot at 0xa96348ac>



In [ ]:

    
# no2['2012'].resample('D').plot()



In [80]:

    
no2.loc['2009':, 'FR04037'].resample('M', how=['mean', 'median']).plot()









    Out[80]:





<matplotlib.axes._subplots.AxesSubplot at 0xa7bdc64c>

Question: The evolution of the yearly averages with, and the overall mean of all stations



In [81]:

    
no2_1999 = no2['1999':]
no2_1999.resample('A').plot()
no2_1999.mean(axis=1).resample('A').plot(color='k', linestyle='--', linewidth=4)









    Out[81]:





<matplotlib.axes._subplots.AxesSubplot at 0xa93b380c>

Analysing the data

Intermezzo - the groupby operation (split-apply-combine)

By "group by" we are referring to a process involving one or more of the following steps

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

Similar to SQL GROUP BY

The example of the image in pandas syntax:



In [82]:

    
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df









    Out[82]:






  
    
      
      data
      key
    
  
  
    
      0
      0
      A
    
    
      1
      5
      B
    
    
      2
      10
      C
    
    
      3
      5
      A
    
    
      ...
      ...
      ...
    
    
      5
      15
      C
    
    
      6
      10
      A
    
    
      7
      15
      B
    
    
      8
      20
      C
    
  

9 rows × 2 columns



In [83]:

    
df.groupby('key').aggregate('sum')  # np.sum



In [84]:

    
df.groupby('key').sum()

Back to the air quality data

Question: how does the typical monthly profile look like for the different stations?

First, we add a column to the dataframe that indicates the month (integer value of 1 to 12):



In [85]:

    
no2['month'] = no2.index.month

Now, we can calculate the mean of each month over the different years:



In [86]:

    
no2.groupby('month').mean()









    Out[86]:






  
    
      
      BETR801
      BETN029
      FR04037
      FR04012
    
    
      month
      
      
      
      
    
  
  
    
      1
      50.927088
      20.304075
      47.634409
      82.472813
    
    
      2
      54.168021
      19.938929
      50.564499
      83.973207
    
    
      3
      54.598322
      19.424205
      47.862715
      96.272138
    
    
      4
      51.491741
      18.183433
      40.943117
      95.962862
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      9
      49.220250
      14.605979
      39.706019
      93.000316
    
    
      10
      50.894911
      17.660149
      44.010934
      86.297836
    
    
      11
      50.254468
      19.372193
      45.564683
      87.173878
    
    
      12
      48.644117
      21.007089
      45.262243
      81.817977
    
  

12 rows × 4 columns



In [87]:

    
no2.groupby('month').mean().plot()









    Out[87]:





<matplotlib.axes._subplots.AxesSubplot at 0xa93495cc>

Question: The typical diurnal profile for the different stations



In [88]:

    
no2.groupby(no2.index.hour).mean().plot()









    Out[88]:





<matplotlib.axes._subplots.AxesSubplot at 0xa4a46acc>

Question: What is the difference in the typical diurnal profile between week and weekend days.



In [89]:

    
no2.index.weekday?



In [90]:

    
no2['weekday'] = no2.index.weekday

Add a column indicating week/weekend



In [91]:

    
no2['weekend'] = no2['weekday'].isin([5, 6])



In [92]:

    
data_weekend = no2.groupby(['weekend', no2.index.hour]).mean()
data_weekend.head()



In [93]:

    
data_weekend_FR04012 = data_weekend['FR04012'].unstack(level=0)
data_weekend_FR04012.head()



In [94]:

    
data_weekend_FR04012.plot()









    Out[94]:





<matplotlib.axes._subplots.AxesSubplot at 0xa95081cc>

Question: What are the number of exceedances of hourly values above the European limit 200 µg/m3 ?



In [95]:

    
exceedances = no2 > 200



In [96]:

    
# group by year and count exceedances (sum of boolean)
exceedances = exceedances.groupby(exceedances.index.year).sum()



In [97]:

    
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')









    Out[97]:





<matplotlib.lines.Line2D at 0xa94c7b0c>

Question: Visualize the typical week profile for the different stations as boxplots.

Tip: the boxplot method of a DataFrame expects the data for the different boxes in different columns)



In [98]:

    
# add a weekday and week column
no2['weekday'] = no2.index.weekday
no2['week'] = no2.index.week
no2.head()









    Out[98]:






  
    
      
      BETR801
      BETN029
      FR04037
      FR04012
      month
      weekday
      weekend
      week
    
  
  
    
      1990-01-01 00:00:00
      NaN
      16
      NaN
      NaN
      1
      0
      False
      1
    
    
      1990-01-01 01:00:00
      NaN
      18
      NaN
      NaN
      1
      0
      False
      1
    
    
      1990-01-01 02:00:00
      NaN
      21
      NaN
      NaN
      1
      0
      False
      1
    
    
      1990-01-01 03:00:00
      NaN
      26
      NaN
      NaN
      1
      0
      False
      1
    
    
      1990-01-01 04:00:00
      NaN
      21
      NaN
      NaN
      1
      0
      False
      1



In [99]:

    
# pivot table so that the weekdays are the different columns
data_pivoted = no2['2012'].pivot_table(columns='weekday', index='week', values='FR04037')
data_pivoted.head()



In [100]:

    
box = data_pivoted.boxplot()









    



/home/joris/miniconda/lib/python2.7/site-packages/pandas/tools/plotting.py:2633: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  warnings.warn(msg, FutureWarning)

Exercise: Calculate the correlation between the different stations



In [101]:

    
no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']].corr()



In [102]:

    
no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']].resample('D').corr()



In [103]:

    
no2 = no2[['BETR801', 'BETN029', 'FR04037', 'FR04012']]

What's new in pandas

Some recent enhancements of the last year (versions 0.14 to 0.16):

Better integration for categorical data (Categorical and CategoricalIndex)
The same for Timedelta and TimedeltaIndex
More flexible SQL interface based on sqlalchemy
MultiIndexing using slicers
.dt accessor for accesing datetime-properties from columns
Groupby enhancements
And a lot of enhancements and bug fixes

How can you help?

We need you!

Contributions are very welcome and can be in different domains:

reporting issues
improving the documentation
testing release candidates and provide feedback
triaging and fixing bugs
implementing new features
spreading the word

-> https://github.com/pydata/pandas

Thanks for listening! Questions?

Slides and data: Source: https://github.com/jorisvandenbossche/2015-PyDataParis

Slides presented with 'live reveal' https://github.com/damianavila/RISE

		BETR801	BETN029	FR04037	FR04012	month	weekday
weekend
False	0	40.008066	17.487512	34.439398	52.094663	6.520355	1.998157
	1	38.281875	17.162671	31.585121	44.721629	6.518121	1.997315
	2	38.601189	16.800076	30.865143	43.518539	6.520511	2.000000
	3	42.633946	16.591031	32.963500	51.942135	6.518038	2.002360
	4	49.853566	16.791971	40.780162	72.547472	6.514098	2.003883

weekend	False	True
0	52.094663	69.817219
1	44.721629	60.697248
2	43.518539	54.407904
3	51.942135	53.534933
4	72.547472	57.472830

weekday	0	1	2	3	4	5	6
week
1	24.625000	23.875000	26.208333	17.500000	40.208333	24.625000	22.375000
2	39.125000	44.125000	57.583333	50.750000	40.791667	34.750000	32.250000
3	45.208333	66.333333	51.958333	28.250000	28.291667	18.416667	18.333333
4	35.333333	49.500000	49.375000	48.916667	63.458333	34.250000	25.250000
5	47.791667	38.791667	54.333333	50.041667	51.458333	46.541667	35.458333

	BETR801	BETN029	FR04037	FR04012
1990-01-01 00:00:00	NaN	16.0	NaN	NaN
1990-01-01 01:00:00	NaN	18.0	NaN	NaN
1990-01-01 02:00:00	NaN	21.0	NaN	NaN
1990-01-01 03:00:00	NaN	26.0	NaN	NaN
...	...	...	...	...
2012-12-31 20:00:00	16.5	2.0	16	47
2012-12-31 21:00:00	14.5	2.5	13	43
2012-12-31 22:00:00	16.5	3.5	14	42
2012-12-31 23:00:00	15.0	3.0	13	49

	area	capital	country	population
0	30510	Brussels	Belgium	11.3
1	671308	Paris	France	64.3
2	357050	Berlin	Germany	81.3
3	41526	Amsterdam	Netherlands	16.9
4	244820	London	United Kingdom	64.9

	area	capital	population
country
Belgium	30510	Brussels	11.3
France	671308	Paris	64.3
Germany	357050	Berlin	81.3
Netherlands	41526	Amsterdam	16.9
United Kingdom	244820	London	64.9

	area	population	density
count	5.000000	5.000000	5.000000
mean	269042.800000	47.740000	273.183879
std	264012.827994	31.519645	123.440607
min	30510.000000	11.300000	95.783158
25%	41526.000000	16.900000	227.699202
50%	244820.000000	64.300000	265.092721
75%	357050.000000	64.900000	370.370370
max	671308.000000	81.300000	406.973944

	1990-01-01	-999.000	0	-999.000.1	0.1	-999.000.2	0.2	-999.000.3	0.3	-999.000.4	...	-999.000.19	0.19	-999.000.20	0.20	-999.000.21	0.21	-999.000.22	0.22	-999.000.23	0.23
0	1990-01-02	-999	0	-999	0	-999	0	-999	0	-999	...	57	1	58	1	54	1	49	1	48	1
1	1990-01-03	51	1	50	1	47	1	48	1	51	...	84	1	75	1	-999	0	-999	0	-999	0
2	1990-01-04	-999	0	-999	0	-999	0	-999	0	-999	...	69	1	65	1	64	1	60	1	59	1
3	1990-01-05	51	1	51	1	48	1	50	1	51	...	-999	0	-999	0	-999	0	-999	0	-999	0
4	1990-01-06	-999	0	-999	0	-999	0	-999	0	-999	...	-999	0	-999	0	-999	0	-999	0	-999	0

	date	00	flag	01	flag	02	flag	03	flag	04	...	19	flag	20	flag	21	flag	22	flag	23	flag
0	1990-01-01	NaN	0	NaN	0	NaN	0	NaN	0	NaN	...	NaN	0	NaN	0	NaN	0	NaN	0	NaN	0
1	1990-01-02	NaN	1	NaN	1	NaN	1	NaN	1	NaN	...	57	1	58	1	54	1	49	1	48	1
2	1990-01-03	51	0	50	0	47	0	48	0	51	...	84	0	75	0	NaN	0	NaN	0	NaN	0
3	1990-01-04	NaN	1	NaN	1	NaN	1	NaN	1	NaN	...	69	1	65	1	64	1	60	1	59	1
4	1990-01-05	51	0	51	0	48	0	50	0	51	...	NaN	0	NaN	0	NaN	0	NaN	0	NaN	0

	date	00	01	02	03	04	05	06	07	08	...	14	15	16	17	18	19	20	21	22	23
0	1990-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1990-01-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	55.0	59.0	58	59.0	58.0	57.0	58.0	54.0	49.0	48.0
2	1990-01-03	51.0	50.0	47.0	48.0	51.0	52.0	58.0	57.0	NaN	...	69.0	74.0	NaN	NaN	103.0	84.0	75.0	NaN	NaN	NaN
3	1990-01-04	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	71.0	74	70.0	70.0	69.0	65.0	64.0	60.0	59.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8388	2012-12-28	26.5	28.5	35.5	32.0	35.5	50.5	62.5	74.5	76.0	...	56.5	52.0	55	53.5	49.0	46.5	42.5	38.5	30.5	26.5
8389	2012-12-29	21.5	16.5	13.0	13.0	16.0	23.5	23.5	27.5	46.0	...	48.0	41.5	36	33.0	25.5	21.0	22.0	20.5	20.0	15.0
8390	2012-12-30	11.5	9.5	7.5	7.5	10.0	11.0	13.5	13.5	17.5	...	NaN	25.0	25	25.5	24.5	25.0	18.5	17.0	15.5	12.5
8391	2012-12-31	9.5	8.5	8.5	8.5	10.5	15.5	18.0	23.0	25.0	...	NaN	NaN	28	27.5	26.0	21.0	16.5	14.5	16.5	15.0

	BETR801
1990-01-02 09:00:00	48.0
1990-01-02 12:00:00	48.0
1990-01-02 13:00:00	50.0
1990-01-02 14:00:00	55.0
...	...
2012-12-31 20:00:00	16.5
2012-12-31 21:00:00	14.5
2012-12-31 22:00:00	16.5
2012-12-31 23:00:00	15.0

	BETR801	BETN029	FR04037	FR04012
2012-12-31 19:00:00	21.0	2.5	28	67
2012-12-31 20:00:00	16.5	2.0	16	47
2012-12-31 21:00:00	14.5	2.5	13	43
2012-12-31 22:00:00	16.5	3.5	14	42
2012-12-31 23:00:00	15.0	3.0	13	49

	BETR801	BETN029	FR04037	FR04012
count	170794.000000	174807.000000	120384.000000	119448.000000
mean	47.914561	16.687756	40.040005	87.993261
std	22.230921	13.106549	23.024347	41.317684
min	0.000000	0.000000	0.000000	0.000000
25%	32.000000	7.000000	23.000000	61.000000
50%	46.000000	12.000000	37.000000	88.000000
75%	61.000000	23.000000	54.000000	115.000000
max	339.000000	115.000000	256.000000	358.000000

	BETR801	BETN029	FR04037	FR04012
2010-01-01 09:00:00	17	7	19	41
2010-01-01 10:00:00	18	5	21	48
2010-01-01 11:00:00	17	4	23	63
2010-01-01 12:00:00	18	4	22	57

	BETR801	BETN029	FR04037	FR04012
2012-01-01 00:00:00	21.0	1.0	17	56
2012-01-01 01:00:00	18.0	1.0	16	50
2012-01-01 02:00:00	20.0	1.0	14	46
2012-01-01 03:00:00	16.0	1.0	17	47
...	...	...	...	...
2012-12-31 20:00:00	16.5	2.0	16	47
2012-12-31 21:00:00	14.5	2.5	13	43
2012-12-31 22:00:00	16.5	3.5	14	42
2012-12-31 23:00:00	15.0	3.0	13	49

	00	01	02	03	04	05	06	07	08	09	...	14	15	16	17	18	19	20	21	22	23
date
2012-01-01	21	18	20.0	16.0	13	17.0	15.0	13.0	15.0	15.0	...	31.5	33.5	32.5	30	25.0	20.0	14.0	13	15.0	14
2012-01-02	NaN	NaN	10.5	12.0	12	39.0	49.5	52.5	45.0	48.0	...	32.0	38.0	43.0	61	56.0	46.0	39.0	33	24.0	20
2012-01-03	18	NaN	NaN	14.0	23	31.5	36.0	40.0	32.5	26.0	...	24.0	28.0	25.0	28	25.0	26.0	22.0	21	20.0	19
2012-01-04	16	16	NaN	13.0	14	17.0	26.0	33.0	36.0	36.0	...	41.0	42.0	52.5	48	39.0	32.5	23.0	16	13.0	12
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2012-02-26	56	53	56.0	53.0	53	NaN	NaN	48.0	50.0	52.5	...	25.0	32.0	37.5	50	50.0	44.0	54.5	54	54.5	67
2012-02-27	59	47	50.0	51.5	59	65.0	NaN	NaN	70.0	62.0	...	56.0	61.0	70.0	68	60.0	56.0	54.0	42	36.0	28
2012-02-28	24	23	20.0	21.0	27	43.0	55.0	NaN	NaN	47.0	...	49.0	55.0	61.0	59	61.0	53.5	52.0	52	50.0	48
2012-02-29	45	39	35.0	32.5	34	47.0	51.5	52.5	NaN	NaN	...	50.0	56.0	61.0	67	73.5	73.0	72.5	70	69.0	62

	BETR801	BETN029	FR04037	FR04012
1990-01-01	NaN	21.500000	NaN	NaN
1990-01-02	53.923077	35.000000	NaN	NaN
1990-01-03	63.000000	29.136364	NaN	NaN
1990-01-04	65.250000	42.681818	NaN	NaN
1990-01-05	63.846154	40.136364	NaN	NaN

	BETR801	BETN029	FR04037	FR04012
month
1	50.927088	20.304075	47.634409	82.472813
2	54.168021	19.938929	50.564499	83.973207
3	54.598322	19.424205	47.862715	96.272138
4	51.491741	18.183433	40.943117	95.962862
...	...	...	...	...
9	49.220250	14.605979	39.706019	93.000316
10	50.894911	17.660149	44.010934	86.297836
11	50.254468	19.372193	45.564683	87.173878
12	48.644117	21.007089	45.262243	81.817977

	BETR801	BETN029	FR04037	FR04012
BETR801	1.000000	0.464085	0.561676	0.394446
BETN029	0.464085	1.000000	0.401864	0.186997
FR04037	0.561676	0.401864	1.000000	0.433466
FR04012	0.394446	0.186997	0.433466	1.000000

	BETR801	BETN029	FR04037	FR04012
BETR801	1.000000	0.581701	0.663855	0.459885
BETN029	0.581701	1.000000	0.527390	0.312484
FR04037	0.663855	0.527390	1.000000	0.453584
FR04012	0.459885	0.312484	0.453584	1.000000

About me: Joris Van den Bossche

Content of this talk

Let's start with a showcase

Case study: air quality in Europe

Why do you need pandas?

Why do you need pandas?

Pandas: data analysis in python

Key features

Basic data structures

Series

Attributes of a Series: index and values

Alignment!

DataFrames: Multi-dimensional Data

Attributes of the DataFrame

Some notes on selecting data

Case study: air quality data of European monitoring stations (AirBase)

AirBase (The European Air quality dataBase)

Importing and cleaning the data

Importing and exporting data with pandas

Now for our case study

Intermezzo: reshaping your data with stack, unstack and pivot

Back to our case study

Exploring the data

Working with time series data

The power of pandas: resample

Question: The evolution of the yearly averages with, and the overall mean of all stations

Analysing the data

Intermezzo - the groupby operation (split-apply-combine)

Back to the air quality data

Question: The typical diurnal profile for the different stations

Question: What is the difference in the typical diurnal profile between week and weekend days.

Question: What are the number of exceedances of hourly values above the European limit 200 µg/m3 ?

Question: Visualize the typical week profile for the different stations as boxplots.

Further reading

What's new in pandas

How can you help?

Thanks for listening! Questions?

Attributes of a Series: `index` and `values`

Intermezzo: reshaping your data with `stack`, `unstack` and `pivot`

The power of pandas: `resample`