We have downloaded datasets from https://datahub.io/core/co2-ppm.
In [1]:
import pandas as pd
In [2]:
mlo = pd.read_csv('../data/co2-mm-mlo.csv', na_values=-99.99, index_col='Date', parse_dates=True)
In [3]:
mlo.head()
Out[3]:
In [4]:
mlo.index
Out[4]:
In [5]:
import matplotlib
%matplotlib inline
In [6]:
mlo['Average'].plot()
Out[6]:
mlo['Average']
is a timeseries: it a Series object with an index of dtype datetime64
(from NumPy).
In [7]:
pd.date_range('2017-09-01', periods=5, freq='D')
Out[7]:
In [8]:
n_hours = 24
hour_index = pd.date_range('2017-09-01', periods=n_hours, freq='H')
hour_index
Out[8]:
In [9]:
import numpy as np
In [10]:
pd.Series(np.random.rand(n_hours), index=hour_index).plot()
Out[10]:
We may want to smooth out seasonal fluctuations by computing a rolling (or moving) average.
In [11]:
mlo['Interpolated'].notnull().value_counts()
Out[11]:
In [12]:
s = mlo['Interpolated']
In [13]:
s.plot()
Out[13]:
Let us select only the first two years of the s
timeseries. Note that string indexing works.
In [14]:
s[:'1960-03-01'].plot()
Out[14]:
Even partial string indexing works!
In [15]:
s[:'1960-03'].plot()
Out[15]:
In [16]:
s[:'1960-01'].rolling(12).mean()
Out[16]:
In [17]:
s[:'1960-01'].rolling(12).mean().plot()
Out[17]:
In [18]:
s.rolling(12).mean().plot()
Out[18]:
Let us create a DataFrame which stores mlo
plus this rolling average in a new column (labelled smooth
).
In [19]:
df = mlo.assign(smooth=s.rolling(12).mean())
In [20]:
df[['Trend', 'smooth']].plot()
Out[20]:
s.rolling(12, win_type='triang').mean()
should yield?mlo['Trend']
.Using .rolling()
with a time-based index is similar to resampling; .rolling()
is a time-based window operation, while .resample()
is a frequency-based window operation.
In [21]:
s.index
Out[21]:
In [22]:
s['1958-03':'1958-06']
Out[22]:
Notice that each value is associated with a point in time (most usual type of timeseries data), but really it should be associated with a time interval (value holds for the entire month). Pandas provide a Period
object, opposite the expected Timestamp
object.
In [23]:
pd.Timestamp('1958-03-01')
Out[23]:
In [24]:
pd.Period('1958-03-01', freq='M')
Out[24]:
In [25]:
monthly_index = pd.period_range('1958-03-01', periods=706, freq='M')
monthly_index
Out[25]:
In [26]:
s.index = monthly_index
In [27]:
s['1958']
Out[27]:
We can down-sample the timeseries (going to a lower frequency), if we are interested in the minimum value over 3-month bins (for a list of convenient aliases, see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases).
In [28]:
s.head(15)
Out[28]:
In [29]:
re = s.resample('3M').min()
re.head()
Out[29]:
If we wanted to compute the difference between re
values and mlo['Trend']
values, we would have to begin with up-sampling re
.
In [30]:
up = re.resample('M').asfreq()
up.head(10)
Out[30]:
re
? (We mean the number of elements, not the duration in time!)up
and mlo['Trend']
?