Introduction to Pandas



In [1]:

    
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.options.display.max_rows = 8

1. Let's start with a showcase

Case 1: titanic survival data



In [2]:

    
df = pd.read_csv("data/titanic.csv")



In [3]:

    
df.head()









    Out[3]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Starting from reading this dataset, to answering questions about this data in a few lines of code:

What is the age distribution of the passengers?



In [4]:

    
df['Age'].hist()









    Out[4]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f607cb8e2b0>

How does the survival rate of the passengers differ between sexes?



In [5]:

    
df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))

Or how does it differ between the different classes?



In [6]:

    
df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f607cb2a630>

All the needed functionality for the above examples will be explained throughout this tutorial.

Case 2: air quality measurement timeseries

AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe

Starting from these hourly data for different stations:



In [7]:

    
data = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)



In [8]:

    
data.head()









    Out[8]:






  
    
      
      BASCH
      BONAP
      PA18
      VERS
    
    
      timestamp
      
      
      
      
    
  
  
    
      2000-01-01 01:00:00
      108.0
      NaN
      65.0
      47.0
    
    
      2000-01-01 02:00:00
      104.0
      60.0
      77.0
      42.0
    
    
      2000-01-01 03:00:00
      97.0
      58.0
      73.0
      34.0
    
    
      2000-01-01 04:00:00
      77.0
      52.0
      57.0
      29.0
    
    
      2000-01-01 05:00:00
      79.0
      52.0
      64.0
      28.0

to answering questions about this data in a few lines of code:

Does the air pollution show a decreasing trend over the years?



In [9]:

    
data['1999':].resample('M').mean().plot(ylim=[0,120])









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f607c2a2208>



In [10]:

    
data['1999':].resample('A').mean().plot(ylim=[0,100])









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f60743b2160>

What is the difference in diurnal profile between weekdays and weekend?



In [11]:

    
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['BASCH'].mean().unstack(level=0)
data_weekend.plot()









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f607c130550>

We will come back to these example, and build them up step by step.

2. Pandas: data analysis in python

For data-intensive work in Python the Pandas library has become essential.

What is pandas?

Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.
Pandas can also be thought of as R's data.frame in Python.
Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...

It's documentation: http://pandas.pydata.org/pandas-docs/stable/

When do you need pandas?

When working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...):

Import data
Clean up messy data
Explore data, gain insight into data
Process and prepare your data for analysis
Analyse your data (together with scikit-learn, statsmodels, ...)

ATTENTION!:

Pandas is great for working with heterogeneous and tabular 1D/2D data, but not all types of data fit in such structures!

When working with array data (e.g. images, numerical algorithms): just stick with numpy
When working with multidimensional labeled data (e.g. climate data): have a look at [xarray](http://xarray.pydata.org/en/stable/)

2. The pandas data structures: `DataFrame` and `Series`

A DataFrame is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.



In [12]:

    
df









    Out[12]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      887
      888
      1
      1
      Graham, Miss. Margaret Edith
      female
      19.0
      0
      0
      112053
      30.0000
      B42
      S
    
    
      888
      889
      0
      3
      Johnston, Miss. Catherine Helen "Carrie"
      female
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

891 rows × 12 columns

Attributes of the DataFrame

A DataFrame has besides a index attribute, also a columns attribute:



In [13]:

    
df.index









    Out[13]:





RangeIndex(start=0, stop=891, step=1)



In [14]:

    
df.columns









    Out[14]:





Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

To check the data types of the different columns:



In [15]:

    
df.dtypes









    Out[15]:





PassengerId      int64
Survived         int64
Pclass           int64
Name            object
                ...   
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

An overview of that information can be given with the info() method:



In [16]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Also a DataFrame has a values attribute, but attention: when you have heterogeneous data, all values will be upcasted:



In [17]:

    
df.values









    Out[17]:





array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ..., 
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

Apart from importing your data from an external source (text file, excel, database, ..), one of the most common ways of creating a dataframe is from a dictionary of arrays or lists.

Note that in the IPython notebook, the dataframe will display in a rich HTML view:



In [18]:

    
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
df_countries = pd.DataFrame(data)
df_countries









    Out[18]:






  
    
      
      area
      capital
      country
      population
    
  
  
    
      0
      30510
      Brussels
      Belgium
      11.3
    
    
      1
      671308
      Paris
      France
      64.3
    
    
      2
      357050
      Berlin
      Germany
      81.3
    
    
      3
      41526
      Amsterdam
      Netherlands
      16.9
    
    
      4
      244820
      London
      United Kingdom
      64.9

One-dimensional data: `Series` (a column of a DataFrame)

A Series is a basic holder for one-dimensional labeled data.



In [19]:

    
df['Age']









    Out[19]:





0      22.0
1      38.0
2      26.0
3      35.0
       ... 
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, dtype: float64



In [20]:

    
age = df['Age']

Attributes of a Series: `index` and `values`

The Series has also an index and values attribute, but no columns



In [21]:

    
age.index









    Out[21]:





RangeIndex(start=0, stop=891, step=1)

You can access the underlying numpy array representation with the .values attribute:



In [22]:

    
age.values[:10]









    Out[22]:





array([ 22.,  38.,  26.,  35.,  35.,  nan,  54.,   2.,  27.,  14.])

We can access series values via the index, just like for NumPy arrays:



In [23]:

    
age[0]









    Out[23]:





22.0

Unlike the NumPy array, though, this index can be something other than integers:



In [24]:

    
df = df.set_index('Name')
df









    Out[24]:






  
    
      
      PassengerId
      Survived
      Pclass
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      Name
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Braund, Mr. Owen Harris
      1
      0
      3
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      Cumings, Mrs. John Bradley (Florence Briggs Thayer)
      2
      1
      1
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      Heikkinen, Miss. Laina
      3
      1
      3
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      4
      1
      1
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Graham, Miss. Margaret Edith
      888
      1
      1
      female
      19.0
      0
      0
      112053
      30.0000
      B42
      S
    
    
      Johnston, Miss. Catherine Helen "Carrie"
      889
      0
      3
      female
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
    
      Behr, Mr. Karl Howell
      890
      1
      1
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      Dooley, Mr. Patrick
      891
      0
      3
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

891 rows × 11 columns



In [25]:

    
age = df['Age']
age









    Out[25]:





Name
Braund, Mr. Owen Harris                                22.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    38.0
Heikkinen, Miss. Laina                                 26.0
Futrelle, Mrs. Jacques Heath (Lily May Peel)           35.0
                                                       ... 
Graham, Miss. Margaret Edith                           19.0
Johnston, Miss. Catherine Helen "Carrie"                NaN
Behr, Mr. Karl Howell                                  26.0
Dooley, Mr. Patrick                                    32.0
Name: Age, dtype: float64



In [26]:

    
age['Dooley, Mr. Patrick']









    Out[26]:





32.0

but with the power of numpy arrays. Many things you can do with numpy arrays, can also be applied on DataFrames / Series.

Eg element-wise operations:



In [27]:

    
age * 1000









    Out[27]:





Name
Braund, Mr. Owen Harris                                22000.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    38000.0
Heikkinen, Miss. Laina                                 26000.0
Futrelle, Mrs. Jacques Heath (Lily May Peel)           35000.0
                                                        ...   
Graham, Miss. Margaret Edith                           19000.0
Johnston, Miss. Catherine Helen "Carrie"                   NaN
Behr, Mr. Karl Howell                                  26000.0
Dooley, Mr. Patrick                                    32000.0
Name: Age, dtype: float64

A range of methods:



In [28]:

    
age.mean()









    Out[28]:





29.69911764705882

Fancy indexing, like indexing with a list or boolean indexing:



In [29]:

    
age[age > 70]









    Out[29]:





Name
Goldschmidt, Mr. George B               71.0
Connors, Mr. Patrick                    70.5
Artagaveytia, Mr. Ramon                 71.0
Barkworth, Mr. Algernon Henry Wilson    80.0
Svensson, Mr. Johan                     74.0
Name: Age, dtype: float64

But also a lot of pandas specific methods, e.g.



In [30]:

    
df['Embarked'].value_counts()









    Out[30]:





S    644
C    168
Q     77
Name: Embarked, dtype: int64

EXERCISE:

What is the maximum Fare that was paid? And the median?



In [31]:

    
df['Fare'].max()









    Out[31]:





512.32920000000001



In [32]:

    
df['Fare'].median()









    Out[32]:





14.4542

EXERCISE:

Calculate the average survival ratio for all passengers (note: the 'Survived' column indicates whether someone survived (1) or not (0)).



In [33]:

    
# df['Survived'].sum() / len(df['Survived'])
df['Survived'].mean()









    Out[33]:





0.3838383838383838

3. Data import and export

A wide range of input/output formats are natively supported by pandas:

CSV, text
SQL database
Excel
HDF5
json
html
pickle
sas, stata
(parquet)
...



In [34]:

    
#pd.read



In [35]:

    
#df.to

Very powerful csv reader:



In [36]:

    
pd.read_csv?

Luckily, if we have a well formed csv file, we don't need many of those arguments:



In [37]:

    
df = pd.read_csv("data/titanic.csv")



In [38]:

    
df.head()









    Out[38]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

EXERCISE: Read the `data/20000101_20161231-NO2.csv` file into a DataFrame `no2`

Some aspects about the file:

Which separator is used in the file?
The second row includes unit information and should be skipped (check `skiprows` keyword)
For missing values, it uses the `'n/d'` notation (check `na_values` keyword)
We want to parse the 'timestamp' column as datetimes (check the `parse_dates` keyword)



In [39]:

    
no2 = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)



In [40]:

    
no2









    Out[40]:






  
    
      
      BASCH
      BONAP
      PA18
      VERS
    
    
      timestamp
      
      
      
      
    
  
  
    
      2000-01-01 01:00:00
      108.0
      NaN
      65.0
      47.0
    
    
      2000-01-01 02:00:00
      104.0
      60.0
      77.0
      42.0
    
    
      2000-01-01 03:00:00
      97.0
      58.0
      73.0
      34.0
    
    
      2000-01-01 04:00:00
      77.0
      52.0
      57.0
      29.0
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      2016-12-31 20:00:00
      73.0
      51.0
      49.0
      20.0
    
    
      2016-12-31 21:00:00
      61.0
      51.0
      48.0
      16.0
    
    
      2016-12-31 22:00:00
      57.0
      49.0
      45.0
      14.0
    
    
      2016-12-31 23:00:00
      51.0
      47.0
      45.0
      12.0
    
  

149039 rows × 4 columns

4. Exploration

Some useful methods:

head and tail



In [41]:

    
no2.head(3)









    Out[41]:






  
    
      
      BASCH
      BONAP
      PA18
      VERS
    
    
      timestamp
      
      
      
      
    
  
  
    
      2000-01-01 01:00:00
      108.0
      NaN
      65.0
      47.0
    
    
      2000-01-01 02:00:00
      104.0
      60.0
      77.0
      42.0
    
    
      2000-01-01 03:00:00
      97.0
      58.0
      73.0
      34.0



In [42]:

    
no2.tail()









    Out[42]:






  
    
      
      BASCH
      BONAP
      PA18
      VERS
    
    
      timestamp
      
      
      
      
    
  
  
    
      2016-12-31 19:00:00
      77.0
      49.0
      52.0
      23.0
    
    
      2016-12-31 20:00:00
      73.0
      51.0
      49.0
      20.0
    
    
      2016-12-31 21:00:00
      61.0
      51.0
      48.0
      16.0
    
    
      2016-12-31 22:00:00
      57.0
      49.0
      45.0
      14.0
    
    
      2016-12-31 23:00:00
      51.0
      47.0
      45.0
      12.0

info()



In [43]:

    
no2.info()









    



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 149039 entries, 2000-01-01 01:00:00 to 2016-12-31 23:00:00
Data columns (total 4 columns):
BASCH    139949 non-null float64
BONAP    136493 non-null float64
PA18     142259 non-null float64
VERS     143813 non-null float64
dtypes: float64(4)
memory usage: 5.7 MB

Getting some basic summary statistics about the data with describe:



In [44]:

    
no2.describe()









    Out[44]:






  
    
      
      BASCH
      BONAP
      PA18
      VERS
    
  
  
    
      count
      139949.000000
      136493.000000
      142259.000000
      143813.000000
    
    
      mean
      89.270098
      64.001714
      45.104211
      27.613227
    
    
      std
      36.772855
      27.866767
      23.212719
      19.604953
    
    
      min
      4.000000
      0.000000
      2.000000
      0.000000
    
    
      25%
      63.000000
      44.000000
      28.000000
      13.000000
    
    
      50%
      87.000000
      62.000000
      42.000000
      22.000000
    
    
      75%
      112.000000
      81.000000
      59.000000
      38.000000
    
    
      max
      358.000000
      345.000000
      306.000000
      197.000000

Quickly visualizing the data



In [45]:

    
no2.plot(kind='box', ylim=[0,250])









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6074f91358>



In [46]:

    
no2['BASCH'].plot(kind='hist', bins=50)









    Out[46]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f607186deb8>

EXERCISE:

Plot the age distribution of the titanic passengers



In [47]:

    
df['Age'].hist()









    Out[47]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6071dafa58>

The default plot (when not specifying kind) is a line plot of all columns:



In [48]:

    
no2.plot(figsize=(12,6))









    Out[48]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6071868080>

This does not say too much ..

We can select part of the data (eg the latest 500 data points):



In [49]:

    
no2[-500:].plot(figsize=(12,6))









    Out[49]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f606ad036d8>

Or we can use some more advanced time series features -> see further in this notebook!

5. Selecting and filtering data

ATTENTION!:

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy.

We now have to distuinguish between:

selection by **label**
selection by **position**



In [50]:

    
df = pd.read_csv("data/titanic.csv")

`df[]` provides some convenience shortcuts

For a DataFrame, basic indexing selects the columns.

Selecting a single column:



In [51]:

    
df['Age']









    Out[51]:





0      22.0
1      38.0
2      26.0
3      35.0
       ... 
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, dtype: float64

or multiple columns:



In [52]:

    
df[['Age', 'Fare']]









    Out[52]:






  
    
      
      Age
      Fare
    
  
  
    
      0
      22.0
      7.2500
    
    
      1
      38.0
      71.2833
    
    
      2
      26.0
      7.9250
    
    
      3
      35.0
      53.1000
    
    
      ...
      ...
      ...
    
    
      887
      19.0
      30.0000
    
    
      888
      NaN
      23.4500
    
    
      889
      26.0
      30.0000
    
    
      890
      32.0
      7.7500
    
  

891 rows × 2 columns

But, slicing accesses the rows:



In [53]:

    
df[10:15]









    Out[53]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      10
      11
      1
      3
      Sandstrom, Miss. Marguerite Rut
      female
      4.0
      1
      1
      PP 9549
      16.7000
      G6
      S
    
    
      11
      12
      1
      1
      Bonnell, Miss. Elizabeth
      female
      58.0
      0
      0
      113783
      26.5500
      C103
      S
    
    
      12
      13
      0
      3
      Saundercock, Mr. William Henry
      male
      20.0
      0
      0
      A/5. 2151
      8.0500
      NaN
      S
    
    
      13
      14
      0
      3
      Andersson, Mr. Anders Johan
      male
      39.0
      1
      5
      347082
      31.2750
      NaN
      S
    
    
      14
      15
      0
      3
      Vestrom, Miss. Hulda Amanda Adolfina
      female
      14.0
      0
      0
      350406
      7.8542
      NaN
      S

Systematic indexing with `loc` and `iloc`

When using [] like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:

loc: selection by label
iloc: selection by position



In [54]:

    
df = df.set_index('Name')



In [55]:

    
df.loc['Bonnell, Miss. Elizabeth', 'Fare']









    Out[55]:





26.550000000000001



In [56]:

    
df.loc['Bonnell, Miss. Elizabeth':'Andersson, Mr. Anders Johan', :]









    Out[56]:






  
    
      
      PassengerId
      Survived
      Pclass
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      Name
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Bonnell, Miss. Elizabeth
      12
      1
      1
      female
      58.0
      0
      0
      113783
      26.550
      C103
      S
    
    
      Saundercock, Mr. William Henry
      13
      0
      3
      male
      20.0
      0
      0
      A/5. 2151
      8.050
      NaN
      S
    
    
      Andersson, Mr. Anders Johan
      14
      0
      3
      male
      39.0
      1
      5
      347082
      31.275
      NaN
      S

Selecting by position with iloc works similar as indexing numpy arrays:



In [57]:

    
df.iloc[0:2,1:3]









    Out[57]:






  
    
      
      Survived
      Pclass
    
    
      Name
      
      
    
  
  
    
      Braund, Mr. Owen Harris
      0
      3
    
    
      Cumings, Mrs. John Bradley (Florence Briggs Thayer)
      1
      1

The different indexing methods can also be used to assign data:



In [59]:

    
df.loc['Braund, Mr. Owen Harris', 'Survived'] = 100



In [60]:

    
df









    Out[60]:






  
    
      
      PassengerId
      Survived
      Pclass
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      Name
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Braund, Mr. Owen Harris
      1
      100
      3
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      Cumings, Mrs. John Bradley (Florence Briggs Thayer)
      2
      1
      1
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      Heikkinen, Miss. Laina
      3
      1
      3
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      4
      1
      1
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Graham, Miss. Margaret Edith
      888
      1
      1
      female
      19.0
      0
      0
      112053
      30.0000
      B42
      S
    
    
      Johnston, Miss. Catherine Helen "Carrie"
      889
      0
      3
      female
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
    
      Behr, Mr. Karl Howell
      890
      1
      1
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      Dooley, Mr. Patrick
      891
      0
      3
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

891 rows × 11 columns

Boolean indexing (filtering)

Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL) and comparable to numpy.

The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.



In [61]:

    
df['Fare'] > 50









    Out[61]:





Name
Braund, Mr. Owen Harris                                False
Cumings, Mrs. John Bradley (Florence Briggs Thayer)     True
Heikkinen, Miss. Laina                                 False
Futrelle, Mrs. Jacques Heath (Lily May Peel)            True
                                                       ...  
Graham, Miss. Margaret Edith                           False
Johnston, Miss. Catherine Helen "Carrie"               False
Behr, Mr. Karl Howell                                  False
Dooley, Mr. Patrick                                    False
Name: Fare, dtype: bool



In [62]:

    
df[df['Fare'] > 50]









    Out[62]:






  
    
      
      PassengerId
      Survived
      Pclass
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      Name
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Cumings, Mrs. John Bradley (Florence Briggs Thayer)
      2
      1
      1
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      4
      1
      1
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      McCarthy, Mr. Timothy J
      7
      0
      1
      male
      54.0
      0
      0
      17463
      51.8625
      E46
      S
    
    
      Fortune, Mr. Charles Alexander
      28
      0
      1
      male
      19.0
      3
      2
      19950
      263.0000
      C23 C25 C27
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      Sage, Miss. Dorothy Edith "Dolly"
      864
      0
      3
      female
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      Roebling, Mr. Washington Augustus II
      868
      0
      1
      male
      31.0
      0
      0
      PC 17590
      50.4958
      A24
      S
    
    
      Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
      872
      1
      1
      female
      47.0
      1
      1
      11751
      52.5542
      D35
      S
    
    
      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
      880
      1
      1
      female
      56.0
      0
      1
      11767
      83.1583
      C50
      C
    
  

160 rows × 11 columns

EXERCISE:

Based on the titanic data set, select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers



In [63]:

    
df = pd.read_csv("data/titanic.csv")



In [64]:

    
df[df['Sex'] == 'male']









    Out[64]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
    
    
      5
      6
      0
      3
      Moran, Mr. James
      male
      NaN
      0
      0
      330877
      8.4583
      NaN
      Q
    
    
      6
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54.0
      0
      0
      17463
      51.8625
      E46
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      884
      885
      0
      3
      Sutehall, Mr. Henry Jr
      male
      25.0
      0
      0
      SOTON/OQ 392076
      7.0500
      NaN
      S
    
    
      886
      887
      0
      2
      Montvila, Rev. Juozas
      male
      27.0
      0
      0
      211536
      13.0000
      NaN
      S
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

577 rows × 12 columns



In [65]:

    
df.loc[df['Sex'] == 'male', 'Age'].mean()









    Out[65]:





30.72664459161148



In [66]:

    
df.loc[df['Sex'] == 'female', 'Age'].mean()









    Out[66]:





27.915708812260537

EXERCISE:

Based on the titanic data set, how many passengers older than 70 were on the Titanic?



In [67]:

    
len(df[df['Age'] > 70])









    Out[67]:





5



In [68]:

    
(df['Age'] > 70).sum()









    Out[68]:





5

6. The group-by operation

Some 'theory': the groupby operation (split-apply-combine)



In [69]:

    
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df









    Out[69]:






  
    
      
      data
      key
    
  
  
    
      0
      0
      A
    
    
      1
      5
      B
    
    
      2
      10
      C
    
    
      3
      5
      A
    
    
      ...
      ...
      ...
    
    
      5
      15
      C
    
    
      6
      10
      A
    
    
      7
      15
      B
    
    
      8
      20
      C
    
  

9 rows × 2 columns

Recap: aggregating functions

When analyzing data, you often calculate summary statistics (aggregations like the mean, max, ...). As we have seen before, we can easily calculate such a statistic for a Series or column using one of the many available methods. For example:



In [70]:

    
df['data'].sum()









    Out[70]:





90

However, in many cases your data has certain groups in it, and in that case, you may want to calculate this statistic for each of the groups.

For example, in the above dataframe df, there is a column 'key' which has three possible values: 'A', 'B' and 'C'. When we want to calculate the sum for each of those groups, we could do the following:



In [71]:

    
for key in ['A', 'B', 'C']:
    print(key, df[df['key'] == key]['data'].sum())

This becomes very verbose when having multiple groups. You could make the above a bit easier by looping over the different values, but still, it is not very convenient to work with.

What we did above, applying a function on different groups, is a "groupby operation", and pandas provides some convenient functionality for this.

Groupby: applying functions per group

The "group by" concept: we want to apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

Similar to SQL GROUP BY

Instead of doing the manual filtering as above

df[df['key'] == "A"].sum()
df[df['key'] == "B"].sum()
...

pandas provides the groupby method to do exactly this:



In [72]:

    
df.groupby('key').sum()



In [73]:

    
df.groupby('key').aggregate(np.sum)  # 'sum'

And many more methods are available.



In [74]:

    
df.groupby('key')['data'].sum()









    Out[74]:





key
A    15
B    30
C    45
Name: data, dtype: int64

Application of the groupby concept on the titanic data

We go back to the titanic passengers survival data:



In [75]:

    
df = pd.read_csv("data/titanic.csv")



In [76]:

    
df.head()









    Out[76]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

EXERCISE:

Using groupby(), calculate the average age for each sex.



In [77]:

    
df.groupby('Sex')['Age'].mean()









    Out[77]:





Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

EXERCISE:

Calculate the average survival ratio for all passengers.



In [78]:

    
# df['Survived'].sum() / len(df['Survived'])
df['Survived'].mean()









    Out[78]:





0.3838383838383838

EXERCISE:

Calculate this survival ratio for all passengers younger that 25 (remember: filtering/boolean indexing).



In [79]:

    
df25 = df[df['Age'] <= 25]
df25['Survived'].sum() / len(df25['Survived'])









    Out[79]:





0.4119601328903654

EXERCISE:

What is the difference in the survival ratio between the sexes?



In [80]:

    
# df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))
df.groupby('Sex')[['Survived']].mean()

EXERCISE:

Or how does it differ between the different classes? Make a bar plot visualizing the survival ratio for the 3 classes.



In [81]:

    
df.groupby('Pclass')['Survived'].mean().plot(kind='bar')









    Out[81]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f606bb8e9e8>

EXERCISE:

Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is devided is separate classes using the `pd.cut` function as provided below.



In [82]:

    
df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10))



In [83]:

    
df.groupby('AgeClass')['Fare'].mean().plot(kind='bar', rot=0)









    Out[83]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f606bbae5c0>

7. Working with time series data



In [84]:

    
no2 = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)

When we ensure the DataFrame has a DatetimeIndex, time-series related functionality becomes available:



In [85]:

    
no2.index









    Out[85]:





DatetimeIndex(['2000-01-01 01:00:00', '2000-01-01 02:00:00',
               '2000-01-01 03:00:00', '2000-01-01 04:00:00',
               '2000-01-01 05:00:00', '2000-01-01 06:00:00',
               '2000-01-01 07:00:00', '2000-01-01 08:00:00',
               '2000-01-01 09:00:00', '2000-01-01 10:00:00',
               ...
               '2016-12-31 14:00:00', '2016-12-31 15:00:00',
               '2016-12-31 16:00:00', '2016-12-31 17:00:00',
               '2016-12-31 18:00:00', '2016-12-31 19:00:00',
               '2016-12-31 20:00:00', '2016-12-31 21:00:00',
               '2016-12-31 22:00:00', '2016-12-31 23:00:00'],
              dtype='datetime64[ns]', name='timestamp', length=149039, freq=None)

Indexing a time series works with strings:



In [86]:

    
no2["2010-01-01 09:00": "2010-01-01 12:00"]









    Out[86]:






  
    
      
      BASCH
      BONAP
      PA18
      VERS
    
    
      timestamp
      
      
      
      
    
  
  
    
      2010-01-01 09:00:00
      31.0
      27.0
      28.0
      14.0
    
    
      2010-01-01 10:00:00
      41.0
      31.0
      30.0
      14.0
    
    
      2010-01-01 11:00:00
      48.0
      32.0
      33.0
      16.0
    
    
      2010-01-01 12:00:00
      63.0
      33.0
      39.0
      19.0

A nice feature is "partial string" indexing, so you don't need to provide the full datetime string.

E.g. all data of January up to March 2012:



In [87]:

    
no2['2012-01':'2012-03']









    Out[87]:






  
    
      
      BASCH
      BONAP
      PA18
      VERS
    
    
      timestamp
      
      
      
      
    
  
  
    
      2012-01-01 00:00:00
      44.0
      34.0
      32.0
      7.0
    
    
      2012-01-01 01:00:00
      56.0
      35.0
      29.0
      9.0
    
    
      2012-01-01 02:00:00
      50.0
      36.0
      29.0
      7.0
    
    
      2012-01-01 03:00:00
      46.0
      34.0
      22.0
      8.0
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      2012-03-31 20:00:00
      61.0
      55.0
      29.0
      17.0
    
    
      2012-03-31 21:00:00
      46.0
      49.0
      32.0
      14.0
    
    
      2012-03-31 22:00:00
      56.0
      41.0
      27.0
      14.0
    
    
      2012-03-31 23:00:00
      59.0
      51.0
      29.0
      13.0
    
  

2184 rows × 4 columns

Time and date components can be accessed from the index:



In [88]:

    
no2.index.hour









    Out[88]:





array([ 1,  2,  3, ..., 21, 22, 23], dtype=int32)



In [89]:

    
no2.index.year









    Out[89]:





array([2000, 2000, 2000, ..., 2016, 2016, 2016], dtype=int32)

The power of pandas: `resample`

A very powerfull method is resample: converting the frequency of the time series (e.g. from hourly to daily data).

Remember the air quality data:



In [90]:

    
no2.plot()









    Out[90]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f606bb1efd0>

The time series has a frequency of 1 hour. I want to change this to daily:



In [91]:

    
no2.head()









    Out[91]:






  
    
      
      BASCH
      BONAP
      PA18
      VERS
    
    
      timestamp
      
      
      
      
    
  
  
    
      2000-01-01 01:00:00
      108.0
      NaN
      65.0
      47.0
    
    
      2000-01-01 02:00:00
      104.0
      60.0
      77.0
      42.0
    
    
      2000-01-01 03:00:00
      97.0
      58.0
      73.0
      34.0
    
    
      2000-01-01 04:00:00
      77.0
      52.0
      57.0
      29.0
    
    
      2000-01-01 05:00:00
      79.0
      52.0
      64.0
      28.0



In [92]:

    
no2.resample('D').mean().head()

Above I take the mean, but as with groupby I can also specify other methods:



In [93]:

    
no2.resample('D').max().head()

The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases
These strings can also be combined with numbers, eg '10D'.

Further exploring the data:



In [94]:

    
no2.resample('M').mean().plot() # 'A'









    Out[94]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6062d42dd8>



In [95]:

    
# no2['2012'].resample('D').plot()



In [96]:

    
no2.loc['2009':, 'VERS'].resample('M').agg(['mean', 'median']).plot()









    Out[96]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6063065630>

EXERCISE: The evolution of the yearly averages with, and the overall mean of all stations

Use `resample` and `plot` to plot the yearly averages for the different stations.
The overall mean of all stations can be calculated by taking the mean of the different columns (`.mean(axis=1)`).



In [97]:

    
no2.resample('A').mean().plot()
no2.mean(axis=1).resample('A').mean().plot(color='k', linestyle='--', linewidth=4)









    Out[97]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6063c2bda0>

EXERCISE: how does the *typical monthly profile* look like for the different stations?

Add a 'month' column to the dataframe.
Group by the month to obtain the typical monthly averages over the different years.

First, we add a column to the dataframe that indicates the month (integer value of 1 to 12):



In [98]:

    
no2['month'] = no2.index.month

Now, we can calculate the mean of each month over the different years:



In [99]:

    
no2.groupby('month').mean()









    Out[99]:






  
    
      
      BASCH
      BONAP
      PA18
      VERS
    
    
      month
      
      
      
      
    
  
  
    
      1
      83.907542
      65.387329
      52.771067
      30.995293
    
    
      2
      88.347532
      67.387637
      53.922040
      33.890926
    
    
      3
      94.812926
      73.588946
      54.171491
      35.508674
    
    
      4
      95.841468
      71.824767
      47.619196
      30.184283
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      9
      95.748555
      65.478451
      46.283842
      29.060800
    
    
      10
      92.084073
      65.832718
      48.139287
      29.232252
    
    
      11
      86.824953
      62.795296
      49.082940
      29.815414
    
    
      12
      85.031296
      63.964938
      51.229126
      31.136172
    
  

12 rows × 4 columns



In [100]:

    
no2.groupby('month').mean().plot()









    Out[100]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6063abaa20>

EXERCISE: The typical diurnal profile for the different stations

Similar as for the month, you can now group by the hour of the day.



In [101]:

    
no2.groupby(no2.index.hour).mean().plot()









    Out[101]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6063a63470>

EXERCISE: What is the difference in the typical diurnal profile between week and weekend days for the 'BASCH' station.

Add a column 'weekday' defining the different days in the week.
Add a column 'weekend' defining if a days is in the weekend (i.e. days 5 and 6) or not (True/False).
You can groupby on multiple items at the same time. In this case you would need to group by both weekend/weekday and hour of the day.

Add a column indicating the weekday:



In [102]:

    
no2.index.weekday?



In [103]:

    
no2['weekday'] = no2.index.weekday

Add a column indicating week/weekend



In [104]:

    
no2['weekend'] = no2['weekday'].isin([5, 6])

Now we can groupby the hour of the day and the weekend (or use pivot_table):



In [105]:

    
data_weekend = no2.groupby(['weekend', no2.index.hour]).mean()
data_weekend.head()



In [106]:

    
data_weekend_BASCH = data_weekend['BASCH'].unstack(level=0)
data_weekend_BASCH.head()



In [107]:

    
data_weekend_BASCH.plot()









    Out[107]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6063a73be0>



In [108]:

    
no2['hour'] = no2.index.hour
no2.pivot_table(columns='weekend', index='hour', values='BASCH')









    Out[108]:






  
    
      weekend
      False
      True
    
    
      hour
      
      
    
  
  
    
      0
      62.683270
      77.040828
    
    
      1
      51.150107
      68.010059
    
    
      2
      44.088698
      59.186060
    
    
      3
      43.542551
      53.515366
    
    
      ...
      ...
      ...
    
    
      20
      103.314564
      94.654967
    
    
      21
      91.401094
      85.394768
    
    
      22
      86.642586
      81.529132
    
    
      23
      79.266397
      76.563615
    
  

24 rows × 2 columns

EXERCISE: What are the number of exceedances of hourly values above the European limit 200 µg/m3 ? Count the number of exceedances of hourly values above the European limit 200 µg/m3 for each year and station after 2005. Make a barplot of the counts. Add an horizontal line indicating the maximum number of exceedances (which is 18) allowed per year?

Hints:

Create a new DataFrame, called `exceedances`, (with boolean values) indicating if the threshold is exceeded or not
Remember that the sum of True values can be used to count elements. Do this using groupby for each year.
Adding a horizontal line can be done with the matplotlib function `ax.axhline`.



In [109]:

    
exceedances = no2 > 200



In [110]:

    
# group by year and count exceedances (sum of boolean)
exceedances = exceedances.groupby(exceedances.index.year).sum()



In [111]:

    
ax = exceedances.loc[2005:].plot(kind='bar')
ax.axhline(18, color='k', linestyle='--')









    Out[111]:





<matplotlib.lines.Line2D at 0x7f60638dc048>

9. What I didn't talk about

Concatenating data: pd.concat
Merging and joining data: pd.merge
Reshaping data: pivot_table, melt, stack, unstack
Working with missing data: isnull, dropna, interpolate, ...
...

	BASCH	BONAP	PA18	VERS
timestamp
2000-01-01	83.173913	53.772727	64.695652	36.521739
2000-01-02	78.708333	59.250000	63.708333	17.166667
2000-01-03	82.333333	73.541667	61.000000	23.083333
2000-01-04	78.500000	73.708333	48.863636	23.791667
2000-01-05	94.291667	90.458333	60.166667	28.214286

	BASCH	BONAP	PA18	VERS
timestamp
2000-01-01	109.0	62.0	77.0	59.0
2000-01-02	109.0	96.0	78.0	45.0
2000-01-03	120.0	102.0	72.0	37.0
2000-01-04	115.0	107.0	78.0	43.0
2000-01-05	141.0	115.0	82.0	41.0

		BASCH	BONAP	PA18	VERS	month	weekday
weekend
False	0	62.683270	49.385498	41.966667	25.601584	6.522435	2.0
	1	51.150107	41.151063	37.160479	22.988806	6.522435	2.0
	2	44.088698	36.148094	33.933945	21.275548	6.522435	2.0
	3	43.542551	33.898973	32.919567	20.782081	6.522435	2.0
	4	53.439094	35.102370	35.566087	22.241620	6.522435	2.0

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	BASCH	BONAP	PA18	VERS
timestamp
2000-01-01 01:00:00	108.0	NaN	65.0	47.0
2000-01-01 02:00:00	104.0	60.0	77.0	42.0
2000-01-01 03:00:00	97.0	58.0	73.0	34.0
2000-01-01 04:00:00	77.0	52.0	57.0	29.0
2000-01-01 05:00:00	79.0	52.0	64.0	28.0

	area	capital	country	population
0	30510	Brussels	Belgium	11.3
1	671308	Paris	France	64.3
2	357050	Berlin	Germany	81.3
3	41526	Amsterdam	Netherlands	16.9
4	244820	London	United Kingdom	64.9

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Braund, Mr. Owen Harris	1	0	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	2	1	1	female	38.0	1	0	PC 17599	71.2833	C85	C
Heikkinen, Miss. Laina	3	1	3	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	1	female	35.0	1	0	113803	53.1000	C123	S
...	...	...	...	...	...	...	...	...	...	...	...
Graham, Miss. Margaret Edith	888	1	1	female	19.0	0	0	112053	30.0000	B42	S
Johnston, Miss. Catherine Helen "Carrie"	889	0	3	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
Behr, Mr. Karl Howell	890	1	1	male	26.0	0	0	111369	30.0000	C148	C
Dooley, Mr. Patrick	891	0	3	male	32.0	0	0	370376	7.7500	NaN	Q

	BASCH	BONAP	PA18	VERS
timestamp
2016-12-31 19:00:00	77.0	49.0	52.0	23.0
2016-12-31 20:00:00	73.0	51.0	49.0	20.0
2016-12-31 21:00:00	61.0	51.0	48.0	16.0
2016-12-31 22:00:00	57.0	49.0	45.0	14.0
2016-12-31 23:00:00	51.0	47.0	45.0	12.0

	BASCH	BONAP	PA18	VERS
count	139949.000000	136493.000000	142259.000000	143813.000000
mean	89.270098	64.001714	45.104211	27.613227
std	36.772855	27.866767	23.212719	19.604953
min	4.000000	0.000000	2.000000	0.000000
25%	63.000000	44.000000	28.000000	13.000000
50%	87.000000	62.000000	42.000000	22.000000
75%	112.000000	81.000000	59.000000	38.000000
max	358.000000	345.000000	306.000000	197.000000

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
12	13	0	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	NaN	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S

	BASCH	BONAP	PA18	VERS
timestamp
2010-01-01 09:00:00	31.0	27.0	28.0	14.0
2010-01-01 10:00:00	41.0	31.0	30.0	14.0
2010-01-01 11:00:00	48.0	32.0	33.0	16.0
2010-01-01 12:00:00	63.0	33.0	39.0	19.0

	BASCH	BONAP	PA18	VERS
timestamp
2012-01-01 00:00:00	44.0	34.0	32.0	7.0
2012-01-01 01:00:00	56.0	35.0	29.0	9.0
2012-01-01 02:00:00	50.0	36.0	29.0	7.0
2012-01-01 03:00:00	46.0	34.0	22.0	8.0
...	...	...	...	...
2012-03-31 20:00:00	61.0	55.0	29.0	17.0
2012-03-31 21:00:00	46.0	49.0	32.0	14.0
2012-03-31 22:00:00	56.0	41.0	27.0	14.0
2012-03-31 23:00:00	59.0	51.0	29.0	13.0

	BASCH	BONAP	PA18	VERS
month
1	83.907542	65.387329	52.771067	30.995293
2	88.347532	67.387637	53.922040	33.890926
3	94.812926	73.588946	54.171491	35.508674
4	95.841468	71.824767	47.619196	30.184283
...	...	...	...	...
9	95.748555	65.478451	46.283842	29.060800
10	92.084073	65.832718	48.139287	29.232252
11	86.824953	62.795296	49.082940	29.815414
12	85.031296	63.964938	51.229126	31.136172

weekend	False	True
0	62.683270	77.040828
1	51.150107	68.010059
2	44.088698	59.186060
3	43.542551	53.515366
4	53.439094	53.383797

weekend	False	True
hour
0	62.683270	77.040828
1	51.150107	68.010059
2	44.088698	59.186060
3	43.542551	53.515366
...	...	...
20	103.314564	94.654967
21	91.401094	85.394768
22	86.642586	81.529132
23	79.266397	76.563615

1. Let's start with a showcase

Case 1: titanic survival data

Case 2: air quality measurement timeseries

2. Pandas: data analysis in python

2. The pandas data structures: DataFrame and Series

Attributes of the DataFrame

One-dimensional data: Series (a column of a DataFrame)

Attributes of a Series: index and values

3. Data import and export

4. Exploration

5. Selecting and filtering data

df[] provides some convenience shortcuts

Systematic indexing with loc and iloc

Boolean indexing (filtering)

6. The group-by operation

Some 'theory': the groupby operation (split-apply-combine)

Recap: aggregating functions

Groupby: applying functions per group

Application of the groupby concept on the titanic data

7. Working with time series data

The power of pandas: resample

9. What I didn't talk about

Further reading

2. The pandas data structures: `DataFrame` and `Series`

One-dimensional data: `Series` (a column of a DataFrame)

Attributes of a Series: `index` and `values`

`df[]` provides some convenience shortcuts

Systematic indexing with `loc` and `iloc`

The power of pandas: `resample`