Pandas Tutorial

EuroScipy, Cambridge UK, August 27th, 2015

Joris Van den Bossche

Source: https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial

About me: Joris Van den Bossche

  • PhD student at Ghent University and VITO, Belgium
  • bio-science engineer, air quality research
  • pandas core dev

->

Licensed under CC BY 4.0 Creative Commons

Content of this talk

  • Why do you need pandas?
  • Basic introduction to the data structures
  • Guided tour through some of the pandas features with two case studies: movie database and a case study about air quality

If you want to follow along, this is a notebook that you can view or run yourself:

Some imports:


In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn

pd.options.display.max_rows = 8

Let's start with a showcase

Case study: air quality in Europe

AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe

Starting from these hourly data for different stations:


In [2]:
data = pd.read_csv('data/airbase_data.csv', index_col=0, parse_dates=True)

In [3]:
data


Out[3]:
BETR801 BETN029 FR04037 FR04012
1990-01-01 00:00:00 NaN 16.0 NaN NaN
1990-01-01 01:00:00 NaN 18.0 NaN NaN
1990-01-01 02:00:00 NaN 21.0 NaN NaN
1990-01-01 03:00:00 NaN 26.0 NaN NaN
... ... ... ... ...
2012-12-31 20:00:00 16.5 2.0 16 47
2012-12-31 21:00:00 14.5 2.5 13 43
2012-12-31 22:00:00 16.5 3.5 14 42
2012-12-31 23:00:00 15.0 3.0 13 49

198895 rows × 4 columns

to answering questions about this data in a few lines of code:

Does the air pollution show a decreasing trend over the years?


In [4]:
data['1999':].resample('A').plot(ylim=[0,100])


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0xab4c292c>

How many exceedances of the limit values?


In [5]:
exceedances = data > 200
exceedances = exceedances.groupby(exceedances.index.year).sum()
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')


Out[5]:
<matplotlib.lines.Line2D at 0xab02004c>

What is the difference in diurnal profile between weekdays and weekend?


In [6]:
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['FR04012'].mean().unstack(level=0)
data_weekend.plot()


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0xab3cb5cc>

We will come back to these example, and build them up step by step.

Why do you need pandas?

Why do you need pandas?

When working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...):

  • Import data
  • Clean up messy data
  • Explore data, gain insight into data
  • Process and prepare your data for analysis
  • Analyse your data (together with scikit-learn, statsmodels, ...)

Pandas: data analysis in python

For data-intensive work in Python the Pandas library has become essential.

What is pandas?

  • Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
  • Pandas can also be thought of as R's data.frame in Python.
  • Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...

It's documentation: http://pandas.pydata.org/pandas-docs/stable/

Key features

  • Fast, easy and flexible input/output for a lot of different data formats
  • Working with missing data (.dropna(), pd.isnull())
  • Merging and joining (concat, join)
  • Grouping: groupby functionality
  • Reshaping (stack, pivot)
  • Powerful time series manipulation (resampling, timezones, ..)
  • Easy plotting

Further reading

What's new in pandas

Some recent enhancements of the last year (versions 0.14 to 0.16):

  • Better integration for categorical data (Categorical and CategoricalIndex)
  • The same for Timedelta and TimedeltaIndex
  • More flexible SQL interface based on sqlalchemy
  • MultiIndexing using slicers
  • .dt accessor for accesing datetime-properties from columns
  • Groupby enhancements
  • And a lot of enhancements and bug fixes

How can you help?

We need you!

Contributions are very welcome and can be in different domains:

  • reporting issues
  • improving the documentation
  • testing release candidates and provide feedback
  • triaging and fixing bugs
  • implementing new features
  • spreading the word

-> https://github.com/pydata/pandas

JOIN the sprint this Sunday!

In [ ]: