Looking at a simple time series of data

In this example, we will look at a sample of timeseries of data from the Atmospheric Radiation Measurement (ARM) Program's permanent site in the Eastern North Atlantic (ENA). This example also includes one method for reading in data from a netCDF file, which was (will be) covered more in-depth at a different session. Let's jump right into it.

First, let us define the path for the file. I prefer to define paths using the os.path package present in the standard Python library. This tends to lead to less confusion down the road when reading/writing files located in directories other than the current working directory.


In [1]:
# import the `os` package
import os

file_path = os.path.abspath('enametC1.b1.20140531.000000.cdf')
print(file_path)


/Users/pydev/Documents/bitbucket/notebooks/python_tutorials/pandas_intro/enametC1.b1.20140531.000000.cdf

As you can see, using os.path.abspath() returns the full path of the file that I want to use. Next, we'll go through the code for importing the netCDF file into Python.


In [2]:
# import the netCDF4 Dataset module
from netCDF4 import Dataset

# import the file as variable 'data'
data = Dataset(file_path)

# get the temperature field, print the dimensions and the units
temp = data.variables['temp_mean']
print('Temperature dimensions: {}, units: {}'.format(temp.dimensions, temp.units))

# get the time field as well
time = data.variables['time']
print('Time units: {}'.format(time.units))


Temperature dimensions: (u'time',), units: degC
Time units: seconds since 2014-05-31 00:00:00 0:00

So that's all fine and great, we've defined our netCDF file and imported it into our Python namespace. We have our variable temp that is in degC and has units of time, and we have the time variable that is in units of seconds since YYYY-MM-DD HH:MM:SS 0:00. Yeah, great, but what about pandas?? We're going to use the Series module from pandas to handle this time series data:


In [3]:
import pandas as pd
from pandas import Series

tseries = Series(temp[:])
tseries.plot()


Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a330350>

We've been able to make this nice pretty plot with very little fuss. Let's take a look at what we actually made when we used the Series module:


In [4]:
tseries.head()


Out[4]:
0    18.950001
1    19.000000
2    19.020000
3    18.959999
4    19.040001
dtype: float32

As you can see, this is simply a series of data with the index on the left, and the value on the right. This doesn't really help with the time part of the timeseries though, since all we have for the index is literally the index of each value. What if we want more information? How about we index the series with the time variable that we took a look at below. What do we get then?


In [5]:
# let's just reset tseries
tseries = None

# now redefine it
tseries = Series(temp[:], index=time[:])
tseries.head()


Out[5]:
0      18.950001
60     19.000000
120    19.020000
180    18.959999
240    19.040001
dtype: float32

Well that certainly looks better. We now have a clearer picture of what the data really looks like: 1-minute averages of temperature. The index of our Series object is now in the units of time, which is slightly better. However, much of the time-savings and functionality of pandas comes when we have the index in a more descriptive format: the python datetime object.

One method to do this is to utilize the netCDF attribute base_time which is a scalar value reporting the seconds since 1970-01-01. Using this, we can call the datetime.datetime.utcfromtimestamp() to turn this into a python datetime object representing the start of the file. Then we can use the time variable to compute the python datetime at each point.


In [6]:
import datetime

base_time = datetime.datetime.utcfromtimestamp(data.variables['base_time'][:])
print(base_time)

dtindex = base_time + pd.to_timedelta(time[:], unit='s')
print(dtindex)

tseries = None
tseries = Series(temp, index=dtindex)
tseries.head()


2014-05-31 00:00:00
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-05-31 00:00:00, ..., 2014-05-31 23:59:00]
Length: 1440, Freq: None, Timezone: None
Out[6]:
2014-05-31 00:00:00    18.950001
2014-05-31 00:01:00    19.000000
2014-05-31 00:02:00    19.020000
2014-05-31 00:03:00    18.959999
2014-05-31 00:04:00    19.040001
dtype: float32

In [7]:
tseries.plot()


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a423210>

Now we can see that the index of the pandas series is represented as a date and time, and the plot shows a nicely formatted x-axis with the date and times. Much nicer. But wait, there's more! What if we'd really just prefer to plot hourly averages. Or perhaps we just want to look at the data from noon onwards?


In [8]:
# Resampling data to 1 hour average
tseries_hourly = tseries.resample('1H', how=['min', 'mean', 'max'])
tseries_hourly.plot(style='.-')


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a3e3a90>

In [9]:
# Looking at a subset of the data
tseries['2014-05-31 12:00':].head()


Out[9]:
2014-05-31 12:00:00    22.840000
2014-05-31 12:01:00    23.000000
2014-05-31 12:02:00    23.190001
2014-05-31 12:03:00    23.219999
2014-05-31 12:04:00    23.070000
dtype: float32

Putting it all together

This might seem like a lot of code so far just to make a simple little plot. Chances are, however, that you're going to be reading and writing in a lot of netCDF files, or csv files, or whatever. So it might be useful to define a function that you can then use in other applications. This is what one such function (and subsequent use) might look like:


In [10]:
import os

def cdf_to_series(netcdf_file, varname):
    """Takes in a netCDF object and a variable name and returns a pandas Series object.
    
    This function requires the netCDF file to have a `base_time` attribute with units of
    'seconds since 1970-01-01' and a `time` attribute with units of seconds.
    """
    
    # import the necessary packages and modules needed within the function
    from netCDF4 import Dataset
    import pandas as pd
    import datetime
    
    # define our netCDF dataset and open it with "with ... as ... :"
    with Dataset(netcdf_file, 'r') as D:
    
        # turn it into a series.
        # note: this combines most of the steps from above into a single step
        S = pd.Series(D.variables[varname][:], 
                  index=datetime.datetime.utcfromtimestamp(D.variables['base_time'][:]) + 
                        pd.to_timedelta(D.variables['time'][:], unit='s'))
    
    # return the pandas Series
    return S

# define the location of the netCDF file
netcdf_file = os.path.abspath('enametC1.b1.20140531.000000.cdf')

# we can print out the help documentation for the function we defined earlier
help(cdf_to_series)

# call the function and look at the data
tseries = cdf_to_series(netcdf_file, 'temp_mean')
tseries.head()


Help on function cdf_to_series in module __main__:

cdf_to_series(netcdf_file, varname)
    Takes in a netCDF object and a variable name and returns a pandas Series object.
    
    This function requires the netCDF file to have a `base_time` attribute with units of
    'seconds since 1970-01-01' and a `time` attribute with units of seconds.

Out[10]:
2014-05-31 00:00:00    18.950001
2014-05-31 00:01:00    19.000000
2014-05-31 00:02:00    19.020000
2014-05-31 00:03:00    18.959999
2014-05-31 00:04:00    19.040001
dtype: float32

Summary

You can start to get a sense that once you have data in a pandas Series object, it gets just a bit easier to do complex operations concerning data alignment and averaging. Now, what I've shown in this notebook is only just a sneak peek at the power behind pandas for scientific data analysis. In subsequent sections I will discuss the DataFrame object, how to group and manipulate data, joining different datasets, and plotting.