Introduction to Pandas
In [ ]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.max_rows = 8
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
df.head()
Starting from reading this dataset, to answering questions about this data in a few lines of code:
What is the age distribution of the passengers?
In [ ]:
df['Age'].hist()
How does the survival rate of the passengers differ between sexes?
In [ ]:
df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))
Or how does it differ between the different classes?
In [ ]:
df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')
All the needed functionality for the above examples will be explained throughout this tutorial.
AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe
Starting from these hourly data for different stations:
In [ ]:
data = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)
In [ ]:
data.head()
to answering questions about this data in a few lines of code:
Does the air pollution show a decreasing trend over the years?
In [ ]:
data['1999':].resample('M').mean().plot(ylim=[0,120])
In [ ]:
data['1999':].resample('A').mean().plot(ylim=[0,100])
What is the difference in diurnal profile between weekdays and weekend?
In [ ]:
data['weekday'] = data.index.weekday
data['weekend'] = data['weekday'].isin([5, 6])
data_weekend = data.groupby(['weekend', data.index.hour])['BASCH'].mean().unstack(level=0)
data_weekend.plot()
We will come back to these example, and build them up step by step.
For data-intensive work in Python the Pandas library has become essential.
What is pandas
?
R
's data.frame
in Python.It's documentation: http://pandas.pydata.org/pandas-docs/stable/
When do you need pandas?
When working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...):
DataFrame
and Series
A DataFrame
is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.
In [ ]:
df
In [ ]:
df.index
In [ ]:
df.columns
To check the data types of the different columns:
In [ ]:
df.dtypes
An overview of that information can be given with the info()
method:
In [ ]:
df.info()
Also a DataFrame has a values
attribute, but attention: when you have heterogeneous data, all values will be upcasted:
In [ ]:
df.values
Apart from importing your data from an external source (text file, excel, database, ..), one of the most common ways of creating a dataframe is from a dictionary of arrays or lists.
Note that in the IPython notebook, the dataframe will display in a rich HTML view:
In [ ]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
'population': [11.3, 64.3, 81.3, 16.9, 64.9],
'area': [30510, 671308, 357050, 41526, 244820],
'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
df_countries = pd.DataFrame(data)
df_countries
In [ ]:
df['Age']
In [ ]:
age = df['Age']
In [ ]:
age.index
You can access the underlying numpy array representation with the .values
attribute:
In [ ]:
age.values[:10]
We can access series values via the index, just like for NumPy arrays:
In [ ]:
age[0]
Unlike the NumPy array, though, this index can be something other than integers:
In [ ]:
df = df.set_index('Name')
df
In [ ]:
age = df['Age']
age
In [ ]:
age['Dooley, Mr. Patrick']
but with the power of numpy arrays. Many things you can do with numpy arrays, can also be applied on DataFrames / Series.
Eg element-wise operations:
In [ ]:
age * 1000
A range of methods:
In [ ]:
age.mean()
Fancy indexing, like indexing with a list or boolean indexing:
In [ ]:
age[age > 70]
But also a lot of pandas specific methods, e.g.
In [ ]:
df['Embarked'].value_counts()
In [ ]:
# %load snippets/02-pandas_introduction31.py
In [ ]:
# %load snippets/02-pandas_introduction32.py
In [ ]:
# %load snippets/02-pandas_introduction33.py
A wide range of input/output formats are natively supported by pandas:
In [ ]:
#pd.read
In [ ]:
#df.to
Very powerful csv reader:
In [ ]:
pd.read_csv?
Luckily, if we have a well formed csv file, we don't need many of those arguments:
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
df.head()
In [ ]:
# %load snippets/02-pandas_introduction39.py
In [ ]:
no2
Some useful methods:
head
and tail
In [ ]:
no2.head(3)
In [ ]:
no2.tail()
info()
In [ ]:
no2.info()
Getting some basic summary statistics about the data with describe
:
In [ ]:
no2.describe()
Quickly visualizing the data
In [ ]:
no2.plot(kind='box', ylim=[0,250])
In [ ]:
no2['BASCH'].plot(kind='hist', bins=50)
In [ ]:
# %load snippets/02-pandas_introduction47.py
The default plot (when not specifying kind
) is a line plot of all columns:
In [ ]:
no2.plot(figsize=(12,6))
This does not say too much ..
We can select part of the data (eg the latest 500 data points):
In [ ]:
no2[-500:].plot(figsize=(12,6))
Or we can use some more advanced time series features -> see further in this notebook!
In [ ]:
df = pd.read_csv("data/titanic.csv")
For a DataFrame, basic indexing selects the columns.
Selecting a single column:
In [ ]:
df['Age']
or multiple columns:
In [ ]:
df[['Age', 'Fare']]
But, slicing accesses the rows:
In [ ]:
df[10:15]
In [ ]:
df = df.set_index('Name')
In [ ]:
df.loc['Bonnell, Miss. Elizabeth', 'Fare']
In [ ]:
df.loc['Bonnell, Miss. Elizabeth':'Andersson, Mr. Anders Johan', :]
Selecting by position with iloc
works similar as indexing numpy arrays:
In [ ]:
df.iloc[0:2,1:3]
The different indexing methods can also be used to assign data:
In [ ]:
df.loc['Braund, Mr. Owen Harris', 'Survived'] = 100
In [ ]:
df
Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL) and comparable to numpy.
The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.
In [ ]:
df['Fare'] > 50
In [ ]:
df[df['Fare'] > 50]
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
# %load snippets/02-pandas_introduction64.py
In [ ]:
# %load snippets/02-pandas_introduction65.py
In [ ]:
# %load snippets/02-pandas_introduction66.py
In [ ]:
# %load snippets/02-pandas_introduction67.py
In [ ]:
# %load snippets/02-pandas_introduction68.py
In [ ]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
When analyzing data, you often calculate summary statistics (aggregations like the mean, max, ...). As we have seen before, we can easily calculate such a statistic for a Series or column using one of the many available methods. For example:
In [ ]:
df['data'].sum()
However, in many cases your data has certain groups in it, and in that case, you may want to calculate this statistic for each of the groups.
For example, in the above dataframe df
, there is a column 'key' which has three possible values: 'A', 'B' and 'C'. When we want to calculate the sum for each of those groups, we could do the following:
In [ ]:
for key in ['A', 'B', 'C']:
print(key, df[df['key'] == key]['data'].sum())
This becomes very verbose when having multiple groups. You could make the above a bit easier by looping over the different values, but still, it is not very convenient to work with.
What we did above, applying a function on different groups, is a "groupby operation", and pandas provides some convenient functionality for this.
The "group by" concept: we want to apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets
This operation is also referred to as the "split-apply-combine" operation, involving the following steps:
Similar to SQL GROUP BY
Instead of doing the manual filtering as above
df[df['key'] == "A"].sum()
df[df['key'] == "B"].sum()
...
pandas provides the groupby
method to do exactly this:
In [ ]:
df.groupby('key').sum()
In [ ]:
df.groupby('key').aggregate(np.sum) # 'sum'
And many more methods are available.
In [ ]:
df.groupby('key')['data'].sum()
We go back to the titanic passengers survival data:
In [ ]:
df = pd.read_csv("data/titanic.csv")
In [ ]:
df.head()
In [ ]:
# %load snippets/02-pandas_introduction77.py
In [ ]:
# %load snippets/02-pandas_introduction78.py
In [ ]:
# %load snippets/02-pandas_introduction79.py
In [ ]:
# %load snippets/02-pandas_introduction80.py
In [ ]:
# %load snippets/02-pandas_introduction81.py
In [ ]:
df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10))
In [ ]:
# %load snippets/02-pandas_introduction83.py
In [ ]:
no2 = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)
When we ensure the DataFrame has a DatetimeIndex
, time-series related functionality becomes available:
In [ ]:
no2.index
Indexing a time series works with strings:
In [ ]:
no2["2010-01-01 09:00": "2010-01-01 12:00"]
A nice feature is "partial string" indexing, so you don't need to provide the full datetime string.
E.g. all data of January up to March 2012:
In [ ]:
no2['2012-01':'2012-03']
Time and date components can be accessed from the index:
In [ ]:
no2.index.hour
In [ ]:
no2.index.year
A very powerfull method is resample
: converting the frequency of the time series (e.g. from hourly to daily data).
Remember the air quality data:
In [ ]:
no2.plot()
The time series has a frequency of 1 hour. I want to change this to daily:
In [ ]:
no2.head()
In [ ]:
no2.resample('D').mean().head()
Above I take the mean, but as with groupby
I can also specify other methods:
In [ ]:
no2.resample('D').max().head()
The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases
These strings can also be combined with numbers, eg '10D'
.
Further exploring the data:
In [ ]:
no2.resample('M').mean().plot() # 'A'
In [ ]:
# no2['2012'].resample('D').plot()
In [ ]:
# %load snippets/02-pandas_introduction96.py
In [ ]:
# %load snippets/02-pandas_introduction97.py
First, we add a column to the dataframe that indicates the month (integer value of 1 to 12):
In [ ]:
# %load snippets/02-pandas_introduction98.py
Now, we can calculate the mean of each month over the different years:
In [ ]:
# %load snippets/02-pandas_introduction99.py
In [ ]:
# %load snippets/02-pandas_introduction100.py
In [ ]:
# %load snippets/02-pandas_introduction101.py
Add a column indicating the weekday:
In [ ]:
no2.index.weekday?
In [ ]:
# %load snippets/02-pandas_introduction103.py
Add a column indicating week/weekend
In [ ]:
# %load snippets/02-pandas_introduction104.py
Now we can groupby the hour of the day and the weekend (or use pivot_table
):
In [ ]:
# %load snippets/02-pandas_introduction105.py
In [ ]:
# %load snippets/02-pandas_introduction106.py
In [ ]:
# %load snippets/02-pandas_introduction107.py
In [ ]:
# %load snippets/02-pandas_introduction108.py
In [ ]:
# %load snippets/02-pandas_introduction109.py
In [ ]:
# %load snippets/02-pandas_introduction110.py
In [ ]:
# %load snippets/02-pandas_introduction111.py
pd.concat
pd.merge
pivot_table
, melt
, stack
, unstack
isnull
, dropna
, interpolate
, ...Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/
Books
Tutorials (many good online tutorials!)
Tom Augspurger's blog