Introduction to pandas

  • pandas is a Python package providing convenient data structures to work with labelled data.

  • pandas is perfectly suited for observational / statistical data sets, having many similarities with Excel spreadsheets.

  • Key features:

    • easy handling of missing data
    • size mutability: columns can be inserted and deleted from DataFrame
    • automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
    • powerful, flexible group by functionality to perform split-apply-combine operations on data sets
    • make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
    • intelligent label-based slicing, fancy indexing, and subsetting of large data sets
    • intuitive merging and joining data sets
    • flexible reshaping and pivoting of data sets
    • hierarchical labeling of axes (possible to have multiple labels per tick)
    • robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
    • time series-specific functionality

Primary data structures of pandas

  • Series (1-dimensional)
  • DataFrame (2-dimensional)

pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Loading data

First, we import pandas module. We use an alias "pd" to write code quicker.


In [1]:
import pandas as pd

We also import the os module that is useful for building paths to files (among many other things). And numpy with matplotlib just in case too.


In [2]:
import matplotlib.pyplot as plt
import numpy as np
import os
%matplotlib inline

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
fname = '../data/air_quality_hourly_london_marylebone.csv'

Let's try to read the data using pandas.read_csv() function.


In [5]:
# minimal setup to read the given file
air_quality = pd.read_csv(fname, header=4, skipfooter=4, na_values='No data', engine='python')

Q: What happens if you remove the header? skipfooter? engine?

Data structures: DataFrame and Series

Let's interrogate the DataFrame object!


In [6]:
type(air_quality)


Out[6]:
pandas.core.frame.DataFrame

In [7]:
# Internal nature of the object
print(air_quality.shape)
print()
print(air_quality.dtypes)


(8760, 12)

Start date                                    object
End Date                                      object
Ozone                                        float64
Status                                        object
Nitrogen oxides as nitrogen dioxide          float64
Status.1                                      object
Carbon monoxide                              float64
Status.2                                      object
PM10 particulate matter (Hourly measured)    float64
Status.3                                      object
Cobalt                                       float64
Status.4                                     float64
dtype: object

In [8]:
# View just the tip of data
air_quality.head(5)


Out[8]:
Start date End Date Ozone Status Nitrogen oxides as nitrogen dioxide Status.1 Carbon monoxide Status.2 PM10 particulate matter (Hourly measured) Status.3 Cobalt Status.4
0 2015-01-01 01:00:00 13.70381 V ugm-3 177.01526 V ugm-3 0.375260 V mgm-3 39.3 V ugm-3 (TEOM FDMS) NaN NaN
1 2015-01-01 02:00:00 8.58151 V ugm-3 294.46380 V ugm-3 0.542517 V mgm-3 41.1 V ugm-3 (TEOM FDMS) NaN NaN
2 2015-01-01 03:00:00 9.77893 V ugm-3 209.99537 V ugm-3 0.406306 V mgm-3 35.1 V ugm-3 (TEOM FDMS) NaN NaN
3 2015-01-01 04:00:00 13.96990 V ugm-3 160.89863 V ugm-3 0.281445 V mgm-3 27.6 V ugm-3 (TEOM FDMS) NaN NaN
4 2015-01-01 05:00:00 15.66625 V ugm-3 153.89362 V ugm-3 0.250303 V mgm-3 27.9 V ugm-3 (TEOM FDMS) NaN NaN

Q: What did you notice about "Status" columns? Compare them to the original text file.


In [9]:
# View the last rows of data
air_quality.tail(n=2)  # Note the optional argument (available for head() too)


Out[9]:
Start date End Date Ozone Status Nitrogen oxides as nitrogen dioxide Status.1 Carbon monoxide Status.2 PM10 particulate matter (Hourly measured) Status.3 Cobalt Status.4
8758 2015-12-31 23:00:00 5.63785 V ugm-3 323.28183 V ugm-3 0.842881 V mgm-3 NaN V ugm-3 (TEOM FDMS) NaN NaN
8759 2015-12-31 24:00:00 4.83957 V ugm-3 290.91803 V ugm-3 0.722095 V mgm-3 NaN V ugm-3 (TEOM FDMS) NaN NaN

Get descriptors for the vertical axis (axis=0):


In [10]:
air_quality.index


Out[10]:
RangeIndex(start=0, stop=8760, step=1)

Get descriptors for the horizontal axis (axis=1):


In [11]:
air_quality.columns


Out[11]:
Index(['Start date', 'End Date', 'Ozone', 'Status',
       'Nitrogen oxides as nitrogen dioxide', 'Status.1', 'Carbon monoxide',
       'Status.2', 'PM10 particulate matter (Hourly measured)', 'Status.3',
       'Cobalt', 'Status.4'],
      dtype='object')

A lot of information at once including memory usage:


In [12]:
air_quality.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 12 columns):
Start date                                   8760 non-null object
End Date                                     8760 non-null object
Ozone                                        8601 non-null float64
Status                                       8760 non-null object
Nitrogen oxides as nitrogen dioxide          8640 non-null float64
Status.1                                     8760 non-null object
Carbon monoxide                              8539 non-null float64
Status.2                                     8760 non-null object
PM10 particulate matter (Hourly measured)    8502 non-null float64
Status.3                                     8760 non-null object
Cobalt                                       0 non-null float64
Status.4                                     0 non-null float64
dtypes: float64(6), object(6)
memory usage: 821.3+ KB

Series, pandas' 1D data containter

A series can be constructed with the pd.Series constructor (passing an array of values) or from a DataFrame, by extracting one of its columns.


In [13]:
carbon_monoxide = air_quality['Carbon monoxide']

Some of its attributes:


In [14]:
print(type(carbon_monoxide))
print(carbon_monoxide.dtype)
print(carbon_monoxide.shape)
print(carbon_monoxide.nbytes)


<class 'pandas.core.series.Series'>
float64
(8760,)
70080

Show me what you got!


In [15]:
# uncomment to see the values
# carbon_monoxide

Numpy as pandas's backend

It is always possible to fall back to a good old NumPy array to pass on to scientific libraries that need them: SciPy, scikit-learn, etc


In [16]:
air_quality['Nitrogen oxides as nitrogen dioxide'].values


Out[16]:
array([ 177.01526,  294.4638 ,  209.99537, ...,  314.07697,  323.28183,
        290.91803])

In [17]:
type(air_quality['Nitrogen oxides as nitrogen dioxide'].values)


Out[17]:
numpy.ndarray

Cleaning data

Dealing with dates and times


In [18]:
# def dateparse(date_str, time_str):
#     diff = pd.to_timedelta((df['End Date'] == '24:00:00').astype(int), unit='d')
#     pd.datetime.strptime(x+y, '%Y-%m-%d%H:%M:%S')

In [19]:
air_quality = pd.read_csv(fname, header=4, skipfooter=4, na_values='No data', engine='python',
                          parse_dates={'Time': [0, 1]})

Renaming columns


In [20]:
air_quality.columns = ['Time', 'O3', 'O3_status', 'NOx', 'NOx_status',
                       'CO', 'CO_status', 'PM10', 'PM10_status', 'Co', 'Co_status']
air_quality.columns


Out[20]:
Index(['Time', 'O3', 'O3_status', 'NOx', 'NOx_status', 'CO', 'CO_status',
       'PM10', 'PM10_status', 'Co', 'Co_status'],
      dtype='object')

Deleting columns

Let us concentrate our attention on the first 4 chemical species, and remove cobalt data from our DataFrame:


In [21]:
air_quality = air_quality.drop('Co', 1)
air_quality = air_quality.drop('Co_status', 1)

In [22]:
air_quality.head()


Out[22]:
Time O3 O3_status NOx NOx_status CO CO_status PM10 PM10_status
0 2015-01-01 01:00:00 13.70381 V ugm-3 177.01526 V ugm-3 0.375260 V mgm-3 39.3 V ugm-3 (TEOM FDMS)
1 2015-01-01 02:00:00 8.58151 V ugm-3 294.46380 V ugm-3 0.542517 V mgm-3 41.1 V ugm-3 (TEOM FDMS)
2 2015-01-01 03:00:00 9.77893 V ugm-3 209.99537 V ugm-3 0.406306 V mgm-3 35.1 V ugm-3 (TEOM FDMS)
3 2015-01-01 04:00:00 13.96990 V ugm-3 160.89863 V ugm-3 0.281445 V mgm-3 27.6 V ugm-3 (TEOM FDMS)
4 2015-01-01 05:00:00 15.66625 V ugm-3 153.89362 V ugm-3 0.250303 V mgm-3 27.9 V ugm-3 (TEOM FDMS)

Basic visualisation

Exercise

Try calling plot() method of the air_quality object:


In [23]:
# air_quality.plot()

What happens if put subplots=True as an argument of the plot() method?


In [24]:
# air_quality.plot( ... )

It is easy to create other useful plots using DataFrame:


In [25]:
fig, (ax0, ax1) = plt.subplots(ncols=2)
air_quality.boxplot(ax=ax0, column=['O3', 'PM10'])
air_quality.O3.plot(ax=ax1, kind="kde")


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6752d109e8>

Q: the plot is too small. How to make it bigger?

As well as just a simple line plot:


In [26]:
air_quality.O3.plot(grid=True, figsize=(12, 2))


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f674cc77ac8>

Setting missing values

As you may notice, we have negative values of ozone concentration, which does not make sense. So, let us replace those negative values with NaN:

Q: how to list data entries with negative O3?


In [27]:
# your code here

We can mask them out in the same way as you do with numpy arrays:


In [28]:
# Replace negative ozone values with NaN
air_quality.O3[air_quality.O3.values < 0] = np.nan

Saving data

For each read_** function to load data, there is a to_** method attached to Series and DataFrames.

Excel spreadsheets

Uncomment the following code cell and run to save the whole DataFrame to an Excel file.


In [29]:
# with pd.ExcelWriter("test.xls") as our_writer:
#     air_quality.to_excel(writer, sheet_name='Blah-blah')

Exercise: writing to CSV text files

Find a method to save DataFrames to a text file (or whatever format you like more).


In [30]:
# Your code here

Some statistics


In [31]:
# air_quality.describe()

Correlations and regressions

  • Is there correlations between the timeseries we loaded?
  • First, let's take a glance at the whole DataFrame using a fancy scatter_matrix function.

In [32]:
from pandas.tools.plotting import scatter_matrix

In [33]:
# with plt.style.context('ggplot'):
#     scatter_matrix(air_quality, figsize=(7, 7))

Computing correlations

Both Series and DataFrames have a corr() method to compute the correlation coefficient.


In [34]:
air_quality.NOx.corr(air_quality['CO'])


Out[34]:
0.7712297105293433

If series are already grouped into a DataFrame, computing all correlation coefficients is trivial:


In [35]:
air_quality.corr()


Out[35]:
O3 NOx CO PM10
O3 1.000000 -0.616442 -0.562254 -0.448305
NOx -0.616442 1.000000 0.771230 0.529790
CO -0.562254 0.771230 1.000000 0.532399
PM10 -0.448305 0.529790 0.532399 1.000000

If you want to visualise this correlation matrix, uncomment the following code cell.


In [36]:
# fig, ax = plt.subplots()
# p = ax.imshow(air_quality.corr(), interpolation="nearest", cmap='RdBu_r', vmin=-1, vmax=1)
# ax.set_xticks(np.arange(len(air_quality.corr().columns)))
# ax.set_yticks(np.arange(len(air_quality.corr().index)))
# ax.set_xticklabels(air_quality.corr().columns)
# ax.set_yticklabels(air_quality.corr().index)
# fig.colorbar(p)

Creating DataFrames

  • DataFrame can also be created manually, by grouping several Series together.
  • Now just for fun we switch to another dataset
    • create 2 Series objects from 2 CSV files
    • create a DataFrame by combining the two Series
  • Data are monthly values of
    • Southern Oscillation Index (SOI)
    • Outgoing Longwave Radiation (OLR), which is a proxy for convective precipitation in the western equatorial Pacific
  • Data were downloaded from NOAA's website: https://www.ncdc.noaa.gov/teleconnections/

In [37]:
soi_df = pd.read_csv('../data/soi.csv', skiprows=1, parse_dates=[0], index_col=0, na_values=-999.9,
                     date_parser=lambda x: pd.datetime.strptime(x, '%Y%m'))

In [38]:
olr_df = pd.read_csv('../data/olr.csv', skiprows=1, parse_dates=[0], index_col=0, na_values=-999.9,
                     date_parser=lambda x: pd.datetime.strptime(x, '%Y%m'))

In [39]:
df = pd.DataFrame({'OLR': olr_df.Value,
                   'SOI': soi_df.Value})

In [40]:
# df.describe()

Ordinary Least Square (OLS) regressions

The recommeded way to build ordinaty least square regressions is by using statsmodels.


In [41]:
# import statsmodels.formula.api as sm

In [42]:
# sm_model = sm.ols(formula="SOI ~ OLR", data=df).fit()

In [ ]:


In [43]:
# df['SOI'].plot()
# df['OLR'].plot()
# ax = sm_model.fittedvalues.plot(label="model prediction")
# ax.legend(loc="lower center", ncol=3)

Exercise: create a scatter plot

  • You can use df.plot function with the appropriate keywords
  • What happens if you use "c=" keyword?
  • Pass another parameter: edgecolors='none'

In [44]:
# your code here

Advanced scatter plot

Using the power of matplotlib, we can create a scatter plot with points coloured by the date index. To do this we need to import one additional submodule:


In [45]:
# import matplotlib.dates as mdates

Convert numpy.datetime64 objects (which are the indices of our DataFrame) to matplotlib floating point numbers. These numbers represent the number of days (fraction part represents hours, minutes, seconds) since 0001-01-01 00:00:00 UTC (assuming Gregorian calendar).


In [46]:
# mdt = mdates.date2num(df.index.astype(pd.datetime))

Append the new data to the original DataFrame:


In [47]:
# df['mpl_date'] = mdt

Create a scatter plot


In [48]:
# ax = df.plot(kind='scatter', x='OLR', y='SOI', c='mpl_date',
#              colormap='viridis', colorbar=False, edgecolors='none')
# plt.colorbar(ax.collections[0], ticks=mdates.YearLocator(5), 
#              format=mdates.DateFormatter('%Y'))

Exercise: rolling functions

1. Subset data

  • Start by subsetting the SOI DataFrame
  • Use either numerical indices, or, even better, datetime indices

In [49]:
# your code here

2. Plot the subset data

  • You can create figure and axis using matplotlib.pyplot
  • Or just use the plot() method of pandas DataFrame

In [50]:
# your code here

3. Explore what rolling() method is

  • What does this method return?

In [51]:
# df.rolling?

In [52]:
# your code here

4. Plot the original series and the smoothed series


In [53]:
# your code here