Timeseries

We have downloaded datasets from https://datahub.io/core/co2-ppm.



In [1]:

    
import pandas as pd

metadata



In [2]:

    
mlo = pd.read_csv('../data/co2-mm-mlo.csv', na_values=-99.99, index_col='Date', parse_dates=True)



In [3]:

    
mlo.head()









    Out[3]:







  
    
      
      Decimal Date
      Average
      Interpolated
      Trend
      Number of Days
    
    
      Date
      
      
      
      
      
    
  
  
    
      1958-03-01
      1958.208
      315.71
      315.71
      314.62
      -1
    
    
      1958-04-01
      1958.292
      317.45
      317.45
      315.29
      -1
    
    
      1958-05-01
      1958.375
      317.50
      317.50
      314.71
      -1
    
    
      1958-06-01
      1958.458
      NaN
      317.10
      314.85
      -1
    
    
      1958-07-01
      1958.542
      315.86
      315.86
      314.98
      -1



In [4]:

    
mlo.index









    Out[4]:





DatetimeIndex(['1958-03-01', '1958-04-01', '1958-05-01', '1958-06-01',
               '1958-07-01', '1958-08-01', '1958-09-01', '1958-10-01',
               '1958-11-01', '1958-12-01',
               ...
               '2016-03-01', '2016-04-01', '2016-05-01', '2016-06-01',
               '2016-07-01', '2016-08-01', '2016-09-01', '2016-10-01',
               '2016-11-01', '2016-12-01'],
              dtype='datetime64[ns]', name='Date', length=706, freq=None)



In [5]:

    
import matplotlib
%matplotlib inline



In [6]:

    
mlo['Average'].plot()









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fab1b039080>

mlo['Average'] is a timeseries: it a Series object with an index of dtype datetime64 (from NumPy).



In [7]:

    
pd.date_range('2017-09-01', periods=5, freq='D')









    Out[7]:





DatetimeIndex(['2017-09-01', '2017-09-02', '2017-09-03', '2017-09-04',
               '2017-09-05'],
              dtype='datetime64[ns]', freq='D')



In [8]:

    
n_hours = 24
hour_index = pd.date_range('2017-09-01', periods=n_hours, freq='H')
hour_index









    Out[8]:





DatetimeIndex(['2017-09-01 00:00:00', '2017-09-01 01:00:00',
               '2017-09-01 02:00:00', '2017-09-01 03:00:00',
               '2017-09-01 04:00:00', '2017-09-01 05:00:00',
               '2017-09-01 06:00:00', '2017-09-01 07:00:00',
               '2017-09-01 08:00:00', '2017-09-01 09:00:00',
               '2017-09-01 10:00:00', '2017-09-01 11:00:00',
               '2017-09-01 12:00:00', '2017-09-01 13:00:00',
               '2017-09-01 14:00:00', '2017-09-01 15:00:00',
               '2017-09-01 16:00:00', '2017-09-01 17:00:00',
               '2017-09-01 18:00:00', '2017-09-01 19:00:00',
               '2017-09-01 20:00:00', '2017-09-01 21:00:00',
               '2017-09-01 22:00:00', '2017-09-01 23:00:00'],
              dtype='datetime64[ns]', freq='H')



In [9]:

    
import numpy as np



In [10]:

    
pd.Series(np.random.rand(n_hours), index=hour_index).plot()









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fab1b05a7b8>

Rolling window operations

We may want to smooth out seasonal fluctuations by computing a rolling (or moving) average.



In [11]:

    
mlo['Interpolated'].notnull().value_counts()









    Out[11]:





True    706
Name: Interpolated, dtype: int64



In [12]:

    
s = mlo['Interpolated']



In [13]:

    
s.plot()









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fab18f20b38>

Let us select only the first two years of the s timeseries. Note that string indexing works.



In [14]:

    
s[:'1960-03-01'].plot()









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fab18e974a8>

Even partial string indexing works!



In [15]:

    
s[:'1960-03'].plot()









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fab18dff7f0>



In [16]:

    
s[:'1960-01'].rolling(12).mean()









    Out[16]:





Date
1958-03-01           NaN
1958-04-01           NaN
1958-05-01           NaN
1958-06-01           NaN
1958-07-01           NaN
1958-08-01           NaN
1958-09-01           NaN
1958-10-01           NaN
1958-11-01           NaN
1958-12-01           NaN
1959-01-01           NaN
1959-02-01    315.367500
1959-03-01    315.450833
1959-04-01    315.473333
1959-05-01    315.539167
1959-06-01    315.626667
1959-07-01    315.683333
1959-08-01    315.672500
1959-09-01    315.725833
1959-10-01    315.775833
1959-11-01    315.898333
1959-12-01    315.974167
1960-01-01    316.041667
Name: Interpolated, dtype: float64



In [17]:

    
s[:'1960-01'].rolling(12).mean().plot()









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fab18c3cf98>



In [18]:

    
s.rolling(12).mean().plot()









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fab18f3c048>

Let us create a DataFrame which stores mlo plus this rolling average in a new column (labelled smooth).



In [19]:

    
df = mlo.assign(smooth=s.rolling(12).mean())



In [20]:

    
df[['Trend', 'smooth']].plot()









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fab18ad67f0>

Hands-on exercises

We can specify a window type when computing the rolling operation. What do you expect s.rolling(12, win_type='triang').mean() should yield?
Plot it to confirm (confront) your assumption. Plot the difference between the above and mlo['Trend'].

The PeriodIndex object

Using .rolling() with a time-based index is similar to resampling; .rolling() is a time-based window operation, while .resample() is a frequency-based window operation.



In [21]:

    
s.index









    Out[21]:





DatetimeIndex(['1958-03-01', '1958-04-01', '1958-05-01', '1958-06-01',
               '1958-07-01', '1958-08-01', '1958-09-01', '1958-10-01',
               '1958-11-01', '1958-12-01',
               ...
               '2016-03-01', '2016-04-01', '2016-05-01', '2016-06-01',
               '2016-07-01', '2016-08-01', '2016-09-01', '2016-10-01',
               '2016-11-01', '2016-12-01'],
              dtype='datetime64[ns]', name='Date', length=706, freq=None)



In [22]:

    
s['1958-03':'1958-06']









    Out[22]:





Date
1958-03-01    315.71
1958-04-01    317.45
1958-05-01    317.50
1958-06-01    317.10
Name: Interpolated, dtype: float64

Notice that each value is associated with a point in time (most usual type of timeseries data), but really it should be associated with a time interval (value holds for the entire month). Pandas provide a Period object, opposite the expected Timestamp object.



In [23]:

    
pd.Timestamp('1958-03-01')









    Out[23]:





Timestamp('1958-03-01 00:00:00')



In [24]:

    
pd.Period('1958-03-01', freq='M')









    Out[24]:





Period('1958-03', 'M')



In [25]:

    
monthly_index = pd.period_range('1958-03-01', periods=706, freq='M')
monthly_index









    Out[25]:





PeriodIndex(['1958-03', '1958-04', '1958-05', '1958-06', '1958-07', '1958-08',
             '1958-09', '1958-10', '1958-11', '1958-12',
             ...
             '2016-03', '2016-04', '2016-05', '2016-06', '2016-07', '2016-08',
             '2016-09', '2016-10', '2016-11', '2016-12'],
            dtype='period[M]', length=706, freq='M')



In [26]:

    
s.index = monthly_index



In [27]:

    
s['1958']









    Out[27]:





1958-03    315.71
1958-04    317.45
1958-05    317.50
1958-06    317.10
1958-07    315.86
1958-08    314.93
1958-09    313.20
1958-10    312.66
1958-11    313.33
1958-12    314.67
Freq: M, Name: Interpolated, dtype: float64

Resampling

We can down-sample the timeseries (going to a lower frequency), if we are interested in the minimum value over 3-month bins (for a list of convenient aliases, see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases).



In [28]:

    
s.head(15)









    Out[28]:





1958-03    315.71
1958-04    317.45
1958-05    317.50
1958-06    317.10
1958-07    315.86
1958-08    314.93
1958-09    313.20
1958-10    312.66
1958-11    313.33
1958-12    314.67
1959-01    315.62
1959-02    316.38
1959-03    316.71
1959-04    317.72
1959-05    318.29
Freq: M, Name: Interpolated, dtype: float64



In [29]:

    
re = s.resample('3M').min()
re.head()









    Out[29]:





1958-03-31    315.71
1958-06-30    317.10
1958-09-30    313.20
1958-12-31    312.66
1959-03-31    315.62
Freq: 3M, Name: Interpolated, dtype: float64

If we wanted to compute the difference between re values and mlo['Trend'] values, we would have to begin with up-sampling re.



In [30]:

    
up = re.resample('M').asfreq()
up.head(10)









    Out[30]:





1958-03-31    315.71
1958-04-30       NaN
1958-05-31       NaN
1958-06-30    317.10
1958-07-31       NaN
1958-08-31       NaN
1958-09-30    313.20
1958-10-31       NaN
1958-11-30       NaN
1958-12-31    312.66
Freq: M, Name: Interpolated, dtype: float64

Hands-on exercises

How long is re? (We mean the number of elements, not the duration in time!)
How would you go about computing the (relative) difference between up and mlo['Trend']?

	Decimal Date	Average	Interpolated	Trend	Number of Days
Date
1958-03-01	1958.208	315.71	315.71	314.62	-1
1958-04-01	1958.292	317.45	317.45	315.29	-1
1958-05-01	1958.375	317.50	317.50	314.71	-1
1958-06-01	1958.458	NaN	317.10	314.85	-1
1958-07-01	1958.542	315.86	315.86	314.98	-1