pandas is a Python package providing convenient data structures to work with labelled data.
pandas is perfectly suited for observational / statistical data sets, having many similarities with Excel spreadsheets.
Key features:
pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
Measured are: ozone ($O_3$), nitrogen oxides (NOx), carbon monoxide (CO) and PM10 particulate matter
Data source: https://uk-air.defra.gov.uk/data/
First, we import pandas module. We use an alias "pd" to write code quicker.
In [1]:
import pandas as pd
We also import the os module that is useful for building paths to files (among many other things). And numpy with matplotlib just in case too.
In [2]:
import matplotlib.pyplot as plt
import numpy as np
import os
%matplotlib inline
In [3]:
import warnings
warnings.filterwarnings('ignore')
In [4]:
fname = '../data/air_quality_hourly_london_marylebone.csv'
Let's try to read the data using pandas.read_csv() function.
In [5]:
# minimal setup to read the given file
air_quality = pd.read_csv(fname, header=4, skipfooter=4, na_values='No data', engine='python')
Q: What happens if you remove the header? skipfooter? engine?
Let's interrogate the DataFrame object!
In [6]:
type(air_quality)
Out[6]:
In [7]:
# Internal nature of the object
print(air_quality.shape)
print()
print(air_quality.dtypes)
In [8]:
# View just the tip of data
air_quality.head(5)
Out[8]:
Q: What did you notice about "Status" columns? Compare them to the original text file.
In [9]:
# View the last rows of data
air_quality.tail(n=2) # Note the optional argument (available for head() too)
Out[9]:
Get descriptors for the vertical axis (axis=0):
In [10]:
air_quality.index
Out[10]:
Get descriptors for the horizontal axis (axis=1):
In [11]:
air_quality.columns
Out[11]:
A lot of information at once including memory usage:
In [12]:
air_quality.info()
A series can be constructed with the pd.Series constructor (passing an array of values) or from a DataFrame, by extracting one of its columns.
In [13]:
carbon_monoxide = air_quality['Carbon monoxide']
Some of its attributes:
In [14]:
print(type(carbon_monoxide))
print(carbon_monoxide.dtype)
print(carbon_monoxide.shape)
print(carbon_monoxide.nbytes)
Show me what you got!
In [15]:
carbon_monoxide
Out[15]:
It is always possible to fall back to a good old NumPy array to pass on to scientific libraries that need them: SciPy, scikit-learn, etc
In [16]:
air_quality['Nitrogen oxides as nitrogen dioxide'].values
Out[16]:
In [17]:
type(air_quality['Nitrogen oxides as nitrogen dioxide'].values)
Out[17]:
The truth about data science: cleaning your data is 90% of the work. Fitting the model is easy. Interpreting the results is the other 90%.
— Jake VanderPlas (@jakevdp) June 13, 2016
In [18]:
# def dateparse(date_str, time_str):
# diff = pd.to_timedelta((df['End Date'] == '24:00:00').astype(int), unit='d')
# pd.datetime.strptime(x+y, '%Y-%m-%d%H:%M:%S')
In [19]:
air_quality = pd.read_csv(fname, header=4, skipfooter=4, na_values='No data', engine='python',
parse_dates={'Time': [0, 1]})
In [20]:
air_quality.columns = ['Time', 'O3', 'O3_status', 'NOx', 'NOx_status',
'CO', 'CO_status', 'PM10', 'PM10_status', 'Co', 'Co_status']
air_quality.columns
Out[20]:
Let us concentrate our attention on the first 4 chemical species, and remove cobalt data from our DataFrame:
In [21]:
air_quality = air_quality.drop('Co', 1)
air_quality = air_quality.drop('Co_status', 1)
In [22]:
air_quality.head()
Out[22]:
Try calling plot() method of the air_quality object:
In [23]:
# air_quality.plot()
What happens if put subplots=True as an argument of the plot() method?
In [24]:
# air_quality.plot( ... )
It is easy to create other useful plots using DataFrame:
In [25]:
fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(10, 4))
air_quality.boxplot(ax=ax0, column=['O3', 'PM10'])
air_quality.O3.plot(ax=ax1, kind="kde")
Out[25]:
As well as just a simple line plot:
In [26]:
air_quality.O3.plot(grid=True, figsize=(12, 2))
Out[26]:
As you may notice, we have negative values of ozone concentration, which does not make sense. So, let us replace those negative values with NaN:
In [27]:
air_quality[air_quality.O3.values < 0]
Out[27]:
We can mask them out in the same way as you do with numpy arrays:
In [28]:
# Replace negative ozone values with NaN
air_quality.O3[air_quality.O3.values < 0] = np.nan
For each read_** function to load data, there is a to_** method attached to Series and DataFrames.
Uncomment the following code cell and run to save the whole DataFrame to an Excel file.
In [29]:
# with pd.ExcelWriter("test.xls") as our_writer:
# air_quality.to_excel(writer, sheet_name='Blah-blah')
Find a method to save DataFrames to a text file (or whatever format you like more).
In [30]:
# Your code here
In [31]:
air_quality.describe()
Out[31]:
DataFrame using a fancy scatter_matrix function.
In [32]:
from pandas.tools.plotting import scatter_matrix
In [33]:
with plt.style.context('ggplot'):
scatter_matrix(air_quality, figsize=(7, 7))
Both Series and DataFrames have a corr() method to compute the correlation coefficient.
In [34]:
air_quality.NOx.corr(air_quality['CO'])
Out[34]:
If series are already grouped into a DataFrame, computing all correlation coefficients is trivial:
In [35]:
air_quality.corr()
Out[35]:
If you want to visualise this correlation matrix, uncomment the following code cell.
In [36]:
# fig, ax = plt.subplots()
# p = ax.imshow(air_quality.corr(), interpolation="nearest", cmap='RdBu_r', vmin=-1, vmax=1)
# ax.set_xticks(np.arange(len(air_quality.corr().columns)))
# ax.set_yticks(np.arange(len(air_quality.corr().index)))
# ax.set_xticklabels(air_quality.corr().columns)
# ax.set_yticklabels(air_quality.corr().index)
# fig.colorbar(p)
DataFrame can also be created manually, by grouping several Series together.Series objects from 2 CSV filesDataFrame by combining the two Series
In [37]:
soi_df = pd.read_csv('../data/soi.csv', skiprows=1, parse_dates=[0], index_col=0, na_values=-999.9,
date_parser=lambda x: pd.datetime.strptime(x, '%Y%m'))
In [38]:
olr_df = pd.read_csv('../data/olr.csv', skiprows=1, parse_dates=[0], index_col=0, na_values=-999.9,
date_parser=lambda x: pd.datetime.strptime(x, '%Y%m'))
In [39]:
df = pd.DataFrame({'OLR': olr_df.Value,
'SOI': soi_df.Value})
In [40]:
df.describe()
Out[40]:
In [41]:
df.plot()
Out[41]:
The recommeded way to build ordinaty least square regressions is by using statsmodels.
In [42]:
import statsmodels.formula.api as sm
In [43]:
sm_model = sm.ols(formula="SOI ~ OLR", data=df).fit()
In [ ]:
In [44]:
# df['SOI'].plot()
# df['OLR'].plot()
# ax = sm_model.fittedvalues.plot(label="model prediction")
# ax.legend(loc="lower center", ncol=3)
df.plot function with the appropriate keywordsedgecolors='none'
In [45]:
# your code here
Using the power of matplotlib, we can create a scatter plot with points coloured by the date index. To do this we need to import one additional submodule:
In [46]:
import matplotlib.dates as mdates
Convert numpy.datetime64 objects (which are the indices of our DataFrame) to matplotlib floating point numbers. These numbers represent the number of days (fraction part represents hours, minutes, seconds) since 0001-01-01 00:00:00 UTC (assuming Gregorian calendar).
In [47]:
mdt = mdates.date2num(df.index.astype(pd.datetime))
Append the new data to the original DataFrame:
In [48]:
df['mpl_date'] = mdt
Create a scatter plot
In [49]:
ax = df.plot(kind='scatter', x='OLR', y='SOI', c='mpl_date',
colormap='viridis', colorbar=False, edgecolors='none')
plt.colorbar(ax.collections[0], ticks=mdates.YearLocator(5),
format=mdates.DateFormatter('%Y'))
Out[49]:
1. Subset data
DataFrame
In [50]:
sub_soi_df = soi_df['1992':'2015']
sub_soi_df.head()
Out[50]:
2. Plot the subset data
matplotlib.pyplotplot() method of pandas DataFrame
In [51]:
sub_soi_df.plot(lw=0.5, marker='d', ms=3, linestyle='-', color='k', figsize=(8, 3), grid=True)
Out[51]:
3. Explore what rolling() method is
In [52]:
# df.rolling?
In [53]:
roll = sub_soi_df.rolling(window=10, center=False)
4. Plot the original series and the smoothed series
In [54]:
fig, ax = plt.subplots()
ax.plot(sub_soi_df, label='SOI')
ax.plot(roll.mean(), label='mean')
leg = ax.legend()