Introduction to Pandas


In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.options.display.max_rows = 8

1. Let's start with a showcase

Case 1: titanic survival data


In [2]:
df = pd.read_csv("data/titanic.csv")

In [3]:
df.head()


Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Starting from reading this dataset, to answering questions about this data in a few lines of code:

What is the age distribution of the passengers?


In [4]:
df['Age'].hist()


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f607cb8e2b0>

How does the survival rate of the passengers differ between sexes?


In [5]:
df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))


Out[5]:
Survived
Sex
female 0.742038
male 0.188908

Or how does it differ between the different classes?


In [6]:
df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f607cb2a630>

All the needed functionality for the above examples will be explained throughout this tutorial.

Case 2: air quality measurement timeseries

AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe

Starting from these hourly data for different stations:


In [7]:
data = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)

In [8]:
data.head()


Out[8]:
BASCH BONAP PA18 VERS
timestamp
2000-01-01 01:00:00 108.0 NaN 65.0 47.0
2000-01-01 02:00:00 104.0 60.0 77.0 42.0
2000-01-01 03:00:00 97.0 58.0 73.0 34.0
2000-01-01 04:00:00 77.0 52.0 57.0 29.0
2000-01-01 05:00:00 79.0 52.0 64.0 28.0

to answering questions about this data in a few lines of code:

Does the air pollution show a decreasing trend over the years?


In [9]:
data['1999':].resample('M').mean().plot(ylim=[0,120])


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f607c2a2208>

In [10]:
data['1999':].resample('A').mean().plot(ylim=[0,100])


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f60743b2160>

What is the difference in diurnal profile between weekdays and weekend?


In [11]:
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['BASCH'].mean().unstack(level=0)
data_weekend.plot()


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f607c130550>

We will come back to these example, and build them up step by step.

2. Pandas: data analysis in python

For data-intensive work in Python the Pandas library has become essential.

What is pandas?

  • Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
  • Pandas can also be thought of as R's data.frame in Python.
  • Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...

It's documentation: http://pandas.pydata.org/pandas-docs/stable/

When do you need pandas?

When working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...):

  • Import data
  • Clean up messy data
  • Explore data, gain insight into data
  • Process and prepare your data for analysis
  • Analyse your data (together with scikit-learn, statsmodels, ...)
ATTENTION!:

Pandas is great for working with heterogeneous and tabular 1D/2D data, but not all types of data fit in such structures!
  • When working with array data (e.g. images, numerical algorithms): just stick with numpy
  • When working with multidimensional labeled data (e.g. climate data): have a look at [xarray](http://xarray.pydata.org/en/stable/)

2. The pandas data structures: DataFrame and Series

A DataFrame is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.


In [12]:
df


Out[12]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
... ... ... ... ... ... ... ... ... ... ... ... ...
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Attributes of the DataFrame

A DataFrame has besides a index attribute, also a columns attribute:


In [13]:
df.index


Out[13]:
RangeIndex(start=0, stop=891, step=1)

In [14]:
df.columns


Out[14]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

To check the data types of the different columns:


In [15]:
df.dtypes


Out[15]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
                ...   
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

An overview of that information can be given with the info() method:


In [16]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Also a DataFrame has a values attribute, but attention: when you have heterogeneous data, all values will be upcasted:


In [17]:
df.values


Out[17]:
array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ..., 
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

Apart from importing your data from an external source (text file, excel, database, ..), one of the most common ways of creating a dataframe is from a dictionary of arrays or lists.

Note that in the IPython notebook, the dataframe will display in a rich HTML view:


In [18]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
df_countries = pd.DataFrame(data)
df_countries


Out[18]:
area capital country population
0 30510 Brussels Belgium 11.3
1 671308 Paris France 64.3
2 357050 Berlin Germany 81.3
3 41526 Amsterdam Netherlands 16.9
4 244820 London United Kingdom 64.9

One-dimensional data: Series (a column of a DataFrame)

A Series is a basic holder for one-dimensional labeled data.


In [19]:
df['Age']


Out[19]:
0      22.0
1      38.0
2      26.0
3      35.0
       ... 
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, dtype: float64

In [20]:
age = df['Age']

Attributes of a Series: index and values

The Series has also an index and values attribute, but no columns


In [21]:
age.index


Out[21]:
RangeIndex(start=0, stop=891, step=1)

You can access the underlying numpy array representation with the .values attribute:


In [22]:
age.values[:10]


Out[22]:
array([ 22.,  38.,  26.,  35.,  35.,  nan,  54.,   2.,  27.,  14.])

We can access series values via the index, just like for NumPy arrays:


In [23]:
age[0]


Out[23]:
22.0

Unlike the NumPy array, though, this index can be something other than integers:


In [24]:
df = df.set_index('Name')
df


Out[24]:
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
... ... ... ... ... ... ... ... ... ... ... ...
Graham, Miss. Margaret Edith 888 1 1 female 19.0 0 0 112053 30.0000 B42 S
Johnston, Miss. Catherine Helen "Carrie" 889 0 3 female NaN 1 2 W./C. 6607 23.4500 NaN S
Behr, Mr. Karl Howell 890 1 1 male 26.0 0 0 111369 30.0000 C148 C
Dooley, Mr. Patrick 891 0 3 male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 11 columns


In [25]:
age = df['Age']
age


Out[25]:
Name
Braund, Mr. Owen Harris                                22.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    38.0
Heikkinen, Miss. Laina                                 26.0
Futrelle, Mrs. Jacques Heath (Lily May Peel)           35.0
                                                       ... 
Graham, Miss. Margaret Edith                           19.0
Johnston, Miss. Catherine Helen "Carrie"                NaN
Behr, Mr. Karl Howell                                  26.0
Dooley, Mr. Patrick                                    32.0
Name: Age, dtype: float64

In [26]:
age['Dooley, Mr. Patrick']


Out[26]:
32.0

but with the power of numpy arrays. Many things you can do with numpy arrays, can also be applied on DataFrames / Series.

Eg element-wise operations:


In [27]:
age * 1000


Out[27]:
Name
Braund, Mr. Owen Harris                                22000.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    38000.0
Heikkinen, Miss. Laina                                 26000.0
Futrelle, Mrs. Jacques Heath (Lily May Peel)           35000.0
                                                        ...   
Graham, Miss. Margaret Edith                           19000.0
Johnston, Miss. Catherine Helen "Carrie"                   NaN
Behr, Mr. Karl Howell                                  26000.0
Dooley, Mr. Patrick                                    32000.0
Name: Age, dtype: float64

A range of methods:


In [28]:
age.mean()


Out[28]:
29.69911764705882

Fancy indexing, like indexing with a list or boolean indexing:


In [29]:
age[age > 70]


Out[29]:
Name
Goldschmidt, Mr. George B               71.0
Connors, Mr. Patrick                    70.5
Artagaveytia, Mr. Ramon                 71.0
Barkworth, Mr. Algernon Henry Wilson    80.0
Svensson, Mr. Johan                     74.0
Name: Age, dtype: float64

But also a lot of pandas specific methods, e.g.


In [30]:
df['Embarked'].value_counts()


Out[30]:
S    644
C    168
Q     77
Name: Embarked, dtype: int64
EXERCISE:
  • What is the maximum Fare that was paid? And the median?

In [31]:
df['Fare'].max()


Out[31]:
512.32920000000001

In [32]:
df['Fare'].median()


Out[32]:
14.4542
EXERCISE:
  • Calculate the average survival ratio for all passengers (note: the 'Survived' column indicates whether someone survived (1) or not (0)).

In [33]:
# df['Survived'].sum() / len(df['Survived'])
df['Survived'].mean()


Out[33]:
0.3838383838383838

3. Data import and export

A wide range of input/output formats are natively supported by pandas:

  • CSV, text
  • SQL database
  • Excel
  • HDF5
  • json
  • html
  • pickle
  • sas, stata
  • (parquet)
  • ...

In [34]:
#pd.read

In [35]:
#df.to

Very powerful csv reader:


In [36]:
pd.read_csv?

Luckily, if we have a well formed csv file, we don't need many of those arguments:


In [37]:
df = pd.read_csv("data/titanic.csv")

In [38]:
df.head()


Out[38]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
EXERCISE: Read the `data/20000101_20161231-NO2.csv` file into a DataFrame `no2`

Some aspects about the file:
  • Which separator is used in the file?
  • The second row includes unit information and should be skipped (check `skiprows` keyword)
  • For missing values, it uses the `'n/d'` notation (check `na_values` keyword)
  • We want to parse the 'timestamp' column as datetimes (check the `parse_dates` keyword)

In [39]:
no2 = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)

In [40]:
no2


Out[40]:
BASCH BONAP PA18 VERS
timestamp
2000-01-01 01:00:00 108.0 NaN 65.0 47.0
2000-01-01 02:00:00 104.0 60.0 77.0 42.0
2000-01-01 03:00:00 97.0 58.0 73.0 34.0
2000-01-01 04:00:00 77.0 52.0 57.0 29.0
... ... ... ... ...
2016-12-31 20:00:00 73.0 51.0 49.0 20.0
2016-12-31 21:00:00 61.0 51.0 48.0 16.0
2016-12-31 22:00:00 57.0 49.0 45.0 14.0
2016-12-31 23:00:00 51.0 47.0 45.0 12.0

149039 rows × 4 columns

4. Exploration

Some useful methods:

head and tail


In [41]:
no2.head(3)


Out[41]:
BASCH BONAP PA18 VERS
timestamp
2000-01-01 01:00:00 108.0 NaN 65.0 47.0
2000-01-01 02:00:00 104.0 60.0 77.0 42.0
2000-01-01 03:00:00 97.0 58.0 73.0 34.0

In [42]:
no2.tail()


Out[42]:
BASCH BONAP PA18 VERS
timestamp
2016-12-31 19:00:00 77.0 49.0 52.0 23.0
2016-12-31 20:00:00 73.0 51.0 49.0 20.0
2016-12-31 21:00:00 61.0 51.0 48.0 16.0
2016-12-31 22:00:00 57.0 49.0 45.0 14.0
2016-12-31 23:00:00 51.0 47.0 45.0 12.0

info()


In [43]:
no2.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 149039 entries, 2000-01-01 01:00:00 to 2016-12-31 23:00:00
Data columns (total 4 columns):
BASCH    139949 non-null float64
BONAP    136493 non-null float64
PA18     142259 non-null float64
VERS     143813 non-null float64
dtypes: float64(4)
memory usage: 5.7 MB

Getting some basic summary statistics about the data with describe:


In [44]:
no2.describe()


Out[44]:
BASCH BONAP PA18 VERS
count 139949.000000 136493.000000 142259.000000 143813.000000
mean 89.270098 64.001714 45.104211 27.613227
std 36.772855 27.866767 23.212719 19.604953
min 4.000000 0.000000 2.000000 0.000000
25% 63.000000 44.000000 28.000000 13.000000
50% 87.000000 62.000000 42.000000 22.000000
75% 112.000000 81.000000 59.000000 38.000000
max 358.000000 345.000000 306.000000 197.000000

Quickly visualizing the data


In [45]:
no2.plot(kind='box', ylim=[0,250])


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6074f91358>

In [46]:
no2['BASCH'].plot(kind='hist', bins=50)


Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f607186deb8>
EXERCISE:
  • Plot the age distribution of the titanic passengers

In [47]:
df['Age'].hist()


Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6071dafa58>

The default plot (when not specifying kind) is a line plot of all columns:


In [48]:
no2.plot(figsize=(12,6))


Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6071868080>

This does not say too much ..

We can select part of the data (eg the latest 500 data points):


In [49]:
no2[-500:].plot(figsize=(12,6))


Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f606ad036d8>

Or we can use some more advanced time series features -> see further in this notebook!

5. Selecting and filtering data

ATTENTION!:

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy.

We now have to distuinguish between:
  • selection by **label**
  • selection by **position**

In [50]:
df = pd.read_csv("data/titanic.csv")

df[] provides some convenience shortcuts

For a DataFrame, basic indexing selects the columns.

Selecting a single column:


In [51]:
df['Age']


Out[51]:
0      22.0
1      38.0
2      26.0
3      35.0
       ... 
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, dtype: float64

or multiple columns:


In [52]:
df[['Age', 'Fare']]


Out[52]:
Age Fare
0 22.0 7.2500
1 38.0 71.2833
2 26.0 7.9250
3 35.0 53.1000
... ... ...
887 19.0 30.0000
888 NaN 23.4500
889 26.0 30.0000
890 32.0 7.7500

891 rows × 2 columns

But, slicing accesses the rows:


In [53]:
df[10:15]


Out[53]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S

Systematic indexing with loc and iloc

When using [] like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:

  • loc: selection by label
  • iloc: selection by position

In [54]:
df = df.set_index('Name')

In [55]:
df.loc['Bonnell, Miss. Elizabeth', 'Fare']


Out[55]:
26.550000000000001

In [56]:
df.loc['Bonnell, Miss. Elizabeth':'Andersson, Mr. Anders Johan', :]


Out[56]:
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Bonnell, Miss. Elizabeth 12 1 1 female 58.0 0 0 113783 26.550 C103 S
Saundercock, Mr. William Henry 13 0 3 male 20.0 0 0 A/5. 2151 8.050 NaN S
Andersson, Mr. Anders Johan 14 0 3 male 39.0 1 5 347082 31.275 NaN S

Selecting by position with iloc works similar as indexing numpy arrays:


In [57]:
df.iloc[0:2,1:3]


Out[57]:
Survived Pclass
Name
Braund, Mr. Owen Harris 0 3
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 1 1

The different indexing methods can also be used to assign data:


In [59]:
df.loc['Braund, Mr. Owen Harris', 'Survived'] = 100

In [60]:
df


Out[60]:
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 100 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
... ... ... ... ... ... ... ... ... ... ... ...
Graham, Miss. Margaret Edith 888 1 1 female 19.0 0 0 112053 30.0000 B42 S
Johnston, Miss. Catherine Helen "Carrie" 889 0 3 female NaN 1 2 W./C. 6607 23.4500 NaN S
Behr, Mr. Karl Howell 890 1 1 male 26.0 0 0 111369 30.0000 C148 C
Dooley, Mr. Patrick 891 0 3 male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 11 columns

Boolean indexing (filtering)

Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL) and comparable to numpy.

The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.


In [61]:
df['Fare'] > 50


Out[61]:
Name
Braund, Mr. Owen Harris                                False
Cumings, Mrs. John Bradley (Florence Briggs Thayer)     True
Heikkinen, Miss. Laina                                 False
Futrelle, Mrs. Jacques Heath (Lily May Peel)            True
                                                       ...  
Graham, Miss. Margaret Edith                           False
Johnston, Miss. Catherine Helen "Carrie"               False
Behr, Mr. Karl Howell                                  False
Dooley, Mr. Patrick                                    False
Name: Fare, dtype: bool

In [62]:
df[df['Fare'] > 50]


Out[62]:
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Fortune, Mr. Charles Alexander 28 0 1 male 19.0 3 2 19950 263.0000 C23 C25 C27 S
... ... ... ... ... ... ... ... ... ... ... ...
Sage, Miss. Dorothy Edith "Dolly" 864 0 3 female NaN 8 2 CA. 2343 69.5500 NaN S
Roebling, Mr. Washington Augustus II 868 0 1 male 31.0 0 0 PC 17590 50.4958 A24 S
Beckwith, Mrs. Richard Leonard (Sallie Monypeny) 872 1 1 female 47.0 1 1 11751 52.5542 D35 S
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 880 1 1 female 56.0 0 1 11767 83.1583 C50 C

160 rows × 11 columns

EXERCISE:
  • Based on the titanic data set, select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers

In [63]:
df = pd.read_csv("data/titanic.csv")

In [64]:
df[df['Sex'] == 'male']


Out[64]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
... ... ... ... ... ... ... ... ... ... ... ... ...
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

577 rows × 12 columns


In [65]:
df.loc[df['Sex'] == 'male', 'Age'].mean()


Out[65]:
30.72664459161148

In [66]:
df.loc[df['Sex'] == 'female', 'Age'].mean()


Out[66]:
27.915708812260537
EXERCISE:
  • Based on the titanic data set, how many passengers older than 70 were on the Titanic?

In [67]:
len(df[df['Age'] > 70])


Out[67]:
5

In [68]:
(df['Age'] > 70).sum()


Out[68]:
5

6. The group-by operation

Some 'theory': the groupby operation (split-apply-combine)


In [69]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df


Out[69]:
data key
0 0 A
1 5 B
2 10 C
3 5 A
... ... ...
5 15 C
6 10 A
7 15 B
8 20 C

9 rows × 2 columns

Recap: aggregating functions

When analyzing data, you often calculate summary statistics (aggregations like the mean, max, ...). As we have seen before, we can easily calculate such a statistic for a Series or column using one of the many available methods. For example:


In [70]:
df['data'].sum()


Out[70]:
90

However, in many cases your data has certain groups in it, and in that case, you may want to calculate this statistic for each of the groups.

For example, in the above dataframe df, there is a column 'key' which has three possible values: 'A', 'B' and 'C'. When we want to calculate the sum for each of those groups, we could do the following:


In [71]:
for key in ['A', 'B', 'C']:
    print(key, df[df['key'] == key]['data'].sum())


A 15
B 30
C 45

This becomes very verbose when having multiple groups. You could make the above a bit easier by looping over the different values, but still, it is not very convenient to work with.

What we did above, applying a function on different groups, is a "groupby operation", and pandas provides some convenient functionality for this.

Groupby: applying functions per group

The "group by" concept: we want to apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

Similar to SQL GROUP BY

Instead of doing the manual filtering as above

df[df['key'] == "A"].sum()
df[df['key'] == "B"].sum()
...

pandas provides the groupby method to do exactly this:


In [72]:
df.groupby('key').sum()


Out[72]:
data
key
A 15
B 30
C 45

In [73]:
df.groupby('key').aggregate(np.sum)  # 'sum'


Out[73]:
data
key
A 15
B 30
C 45

And many more methods are available.


In [74]:
df.groupby('key')['data'].sum()


Out[74]:
key
A    15
B    30
C    45
Name: data, dtype: int64

Application of the groupby concept on the titanic data

We go back to the titanic passengers survival data:


In [75]:
df = pd.read_csv("data/titanic.csv")

In [76]:
df.head()


Out[76]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
EXERCISE:
  • Using groupby(), calculate the average age for each sex.

In [77]:
df.groupby('Sex')['Age'].mean()


Out[77]:
Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64
EXERCISE:
  • Calculate the average survival ratio for all passengers.

In [78]:
# df['Survived'].sum() / len(df['Survived'])
df['Survived'].mean()


Out[78]:
0.3838383838383838
EXERCISE:
  • Calculate this survival ratio for all passengers younger that 25 (remember: filtering/boolean indexing).

In [79]:
df25 = df[df['Age'] <= 25]
df25['Survived'].sum() / len(df25['Survived'])


Out[79]:
0.4119601328903654
EXERCISE:
  • What is the difference in the survival ratio between the sexes?

In [80]:
# df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))
df.groupby('Sex')[['Survived']].mean()


Out[80]:
Survived
Sex
female 0.742038
male 0.188908
EXERCISE:
  • Or how does it differ between the different classes? Make a bar plot visualizing the survival ratio for the 3 classes.

In [81]:
df.groupby('Pclass')['Survived'].mean().plot(kind='bar')


Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f606bb8e9e8>
EXERCISE:
  • Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is devided is separate classes using the `pd.cut` function as provided below.

In [82]:
df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10))

In [83]:
df.groupby('AgeClass')['Fare'].mean().plot(kind='bar', rot=0)


Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f606bbae5c0>

7. Working with time series data


In [84]:
no2 = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)

When we ensure the DataFrame has a DatetimeIndex, time-series related functionality becomes available:


In [85]:
no2.index


Out[85]:
DatetimeIndex(['2000-01-01 01:00:00', '2000-01-01 02:00:00',
               '2000-01-01 03:00:00', '2000-01-01 04:00:00',
               '2000-01-01 05:00:00', '2000-01-01 06:00:00',
               '2000-01-01 07:00:00', '2000-01-01 08:00:00',
               '2000-01-01 09:00:00', '2000-01-01 10:00:00',
               ...
               '2016-12-31 14:00:00', '2016-12-31 15:00:00',
               '2016-12-31 16:00:00', '2016-12-31 17:00:00',
               '2016-12-31 18:00:00', '2016-12-31 19:00:00',
               '2016-12-31 20:00:00', '2016-12-31 21:00:00',
               '2016-12-31 22:00:00', '2016-12-31 23:00:00'],
              dtype='datetime64[ns]', name='timestamp', length=149039, freq=None)

Indexing a time series works with strings:


In [86]:
no2["2010-01-01 09:00": "2010-01-01 12:00"]


Out[86]:
BASCH BONAP PA18 VERS
timestamp
2010-01-01 09:00:00 31.0 27.0 28.0 14.0
2010-01-01 10:00:00 41.0 31.0 30.0 14.0
2010-01-01 11:00:00 48.0 32.0 33.0 16.0
2010-01-01 12:00:00 63.0 33.0 39.0 19.0

A nice feature is "partial string" indexing, so you don't need to provide the full datetime string.

E.g. all data of January up to March 2012:


In [87]:
no2['2012-01':'2012-03']


Out[87]:
BASCH BONAP PA18 VERS
timestamp
2012-01-01 00:00:00 44.0 34.0 32.0 7.0
2012-01-01 01:00:00 56.0 35.0 29.0 9.0
2012-01-01 02:00:00 50.0 36.0 29.0 7.0
2012-01-01 03:00:00 46.0 34.0 22.0 8.0
... ... ... ... ...
2012-03-31 20:00:00 61.0 55.0 29.0 17.0
2012-03-31 21:00:00 46.0 49.0 32.0 14.0
2012-03-31 22:00:00 56.0 41.0 27.0 14.0
2012-03-31 23:00:00 59.0 51.0 29.0 13.0

2184 rows × 4 columns

Time and date components can be accessed from the index:


In [88]:
no2.index.hour


Out[88]:
array([ 1,  2,  3, ..., 21, 22, 23], dtype=int32)

In [89]:
no2.index.year


Out[89]:
array([2000, 2000, 2000, ..., 2016, 2016, 2016], dtype=int32)

The power of pandas: resample

A very powerfull method is resample: converting the frequency of the time series (e.g. from hourly to daily data).

Remember the air quality data:


In [90]:
no2.plot()


Out[90]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f606bb1efd0>

The time series has a frequency of 1 hour. I want to change this to daily:


In [91]:
no2.head()


Out[91]:
BASCH BONAP PA18 VERS
timestamp
2000-01-01 01:00:00 108.0 NaN 65.0 47.0
2000-01-01 02:00:00 104.0 60.0 77.0 42.0
2000-01-01 03:00:00 97.0 58.0 73.0 34.0
2000-01-01 04:00:00 77.0 52.0 57.0 29.0
2000-01-01 05:00:00 79.0 52.0 64.0 28.0

In [92]:
no2.resample('D').mean().head()


Out[92]:
BASCH BONAP PA18 VERS
timestamp
2000-01-01 83.173913 53.772727 64.695652 36.521739
2000-01-02 78.708333 59.250000 63.708333 17.166667
2000-01-03 82.333333 73.541667 61.000000 23.083333
2000-01-04 78.500000 73.708333 48.863636 23.791667
2000-01-05 94.291667 90.458333 60.166667 28.214286

Above I take the mean, but as with groupby I can also specify other methods:


In [93]:
no2.resample('D').max().head()


Out[93]:
BASCH BONAP PA18 VERS
timestamp
2000-01-01 109.0 62.0 77.0 59.0
2000-01-02 109.0 96.0 78.0 45.0
2000-01-03 120.0 102.0 72.0 37.0
2000-01-04 115.0 107.0 78.0 43.0
2000-01-05 141.0 115.0 82.0 41.0

The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases
These strings can also be combined with numbers, eg '10D'.

Further exploring the data:


In [94]:
no2.resample('M').mean().plot() # 'A'


Out[94]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6062d42dd8>

In [95]:
# no2['2012'].resample('D').plot()

In [96]:
no2.loc['2009':, 'VERS'].resample('M').agg(['mean', 'median']).plot()


Out[96]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6063065630>
EXERCISE: The evolution of the yearly averages with, and the overall mean of all stations
  • Use `resample` and `plot` to plot the yearly averages for the different stations.
  • The overall mean of all stations can be calculated by taking the mean of the different columns (`.mean(axis=1)`).

In [97]:
no2.resample('A').mean().plot()
no2.mean(axis=1).resample('A').mean().plot(color='k', linestyle='--', linewidth=4)


Out[97]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6063c2bda0>
EXERCISE: how does the *typical monthly profile* look like for the different stations?
  • Add a 'month' column to the dataframe.
  • Group by the month to obtain the typical monthly averages over the different years.

First, we add a column to the dataframe that indicates the month (integer value of 1 to 12):


In [98]:
no2['month'] = no2.index.month

Now, we can calculate the mean of each month over the different years:


In [99]:
no2.groupby('month').mean()


Out[99]:
BASCH BONAP PA18 VERS
month
1 83.907542 65.387329 52.771067 30.995293
2 88.347532 67.387637 53.922040 33.890926
3 94.812926 73.588946 54.171491 35.508674
4 95.841468 71.824767 47.619196 30.184283
... ... ... ... ...
9 95.748555 65.478451 46.283842 29.060800
10 92.084073 65.832718 48.139287 29.232252
11 86.824953 62.795296 49.082940 29.815414
12 85.031296 63.964938 51.229126 31.136172

12 rows × 4 columns


In [100]:
no2.groupby('month').mean().plot()


Out[100]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6063abaa20>
EXERCISE: The typical diurnal profile for the different stations
  • Similar as for the month, you can now group by the hour of the day.

In [101]:
no2.groupby(no2.index.hour).mean().plot()


Out[101]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6063a63470>
EXERCISE: What is the difference in the typical diurnal profile between week and weekend days for the 'BASCH' station.
  • Add a column 'weekday' defining the different days in the week.
  • Add a column 'weekend' defining if a days is in the weekend (i.e. days 5 and 6) or not (True/False).
  • You can groupby on multiple items at the same time. In this case you would need to group by both weekend/weekday and hour of the day.

Add a column indicating the weekday:


In [102]:
no2.index.weekday?

In [103]:
no2['weekday'] = no2.index.weekday

Add a column indicating week/weekend


In [104]:
no2['weekend'] = no2['weekday'].isin([5, 6])

Now we can groupby the hour of the day and the weekend (or use pivot_table):


In [105]:
data_weekend = no2.groupby(['weekend', no2.index.hour]).mean()
data_weekend.head()


Out[105]:
BASCH BONAP PA18 VERS month weekday
weekend
False 0 62.683270 49.385498 41.966667 25.601584 6.522435 2.0
1 51.150107 41.151063 37.160479 22.988806 6.522435 2.0
2 44.088698 36.148094 33.933945 21.275548 6.522435 2.0
3 43.542551 33.898973 32.919567 20.782081 6.522435 2.0
4 53.439094 35.102370 35.566087 22.241620 6.522435 2.0

In [106]:
data_weekend_BASCH = data_weekend['BASCH'].unstack(level=0)
data_weekend_BASCH.head()


Out[106]:
weekend False True
0 62.683270 77.040828
1 51.150107 68.010059
2 44.088698 59.186060
3 43.542551 53.515366
4 53.439094 53.383797

In [107]:
data_weekend_BASCH.plot()


Out[107]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6063a73be0>

In [108]:
no2['hour'] = no2.index.hour
no2.pivot_table(columns='weekend', index='hour', values='BASCH')


Out[108]:
weekend False True
hour
0 62.683270 77.040828
1 51.150107 68.010059
2 44.088698 59.186060
3 43.542551 53.515366
... ... ...
20 103.314564 94.654967
21 91.401094 85.394768
22 86.642586 81.529132
23 79.266397 76.563615

24 rows × 2 columns

EXERCISE: What are the number of exceedances of hourly values above the European limit 200 µg/m3 ? Count the number of exceedances of hourly values above the European limit 200 µg/m3 for each year and station after 2005. Make a barplot of the counts. Add an horizontal line indicating the maximum number of exceedances (which is 18) allowed per year?

Hints:
  • Create a new DataFrame, called `exceedances`, (with boolean values) indicating if the threshold is exceeded or not
  • Remember that the sum of True values can be used to count elements. Do this using groupby for each year.
  • Adding a horizontal line can be done with the matplotlib function `ax.axhline`.

In [109]:
exceedances = no2 > 200

In [110]:
# group by year and count exceedances (sum of boolean)
exceedances = exceedances.groupby(exceedances.index.year).sum()

In [111]:
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')


Out[111]:
<matplotlib.lines.Line2D at 0x7f60638dc048>

9. What I didn't talk about

  • Concatenating data: pd.concat
  • Merging and joining data: pd.merge
  • Reshaping data: pivot_table, melt, stack, unstack
  • Working with missing data: isnull, dropna, interpolate, ...
  • ...

Further reading