In this example, we will look at a sample of timeseries of data from the Atmospheric Radiation Measurement (ARM) Program's permanent site in the Eastern North Atlantic (ENA). This example also includes one method for reading in data from a netCDF file, which was (will be) covered more in-depth at a different session. Let's jump right into it.
First, let us define the path for the file. I prefer to define paths using the os.path
package present in the standard Python library. This tends to lead to less confusion down the road when reading/writing files located in directories other than the current working directory.
In [1]:
# import the `os` package
import os
file_path = os.path.abspath('enametC1.b1.20140531.000000.cdf')
print(file_path)
As you can see, using os.path.abspath()
returns the full path of the file that I want to use. Next, we'll go through the code for importing the netCDF file into Python.
In [2]:
# import the netCDF4 Dataset module
from netCDF4 import Dataset
# import the file as variable 'data'
data = Dataset(file_path)
# get the temperature field, print the dimensions and the units
temp = data.variables['temp_mean']
print('Temperature dimensions: {}, units: {}'.format(temp.dimensions, temp.units))
# get the time field as well
time = data.variables['time']
print('Time units: {}'.format(time.units))
So that's all fine and great, we've defined our netCDF file and imported it into our Python namespace. We have our variable temp
that is in degC and has units of time, and we have the time
variable that is in units of seconds since YYYY-MM-DD HH:MM:SS 0:00. Yeah, great, but what about pandas?? We're going to use the Series
module from pandas to handle this time series data:
In [3]:
import pandas as pd
from pandas import Series
tseries = Series(temp[:])
tseries.plot()
Out[3]:
We've been able to make this nice pretty plot with very little fuss. Let's take a look at what we actually made when we used the Series
module:
In [4]:
tseries.head()
Out[4]:
As you can see, this is simply a series of data with the index on the left, and the value on the right. This doesn't really help with the time part of the timeseries though, since all we have for the index is literally the index of each value. What if we want more information? How about we index the series with the time
variable that we took a look at below. What do we get then?
In [5]:
# let's just reset tseries
tseries = None
# now redefine it
tseries = Series(temp[:], index=time[:])
tseries.head()
Out[5]:
Well that certainly looks better. We now have a clearer picture of what the data really looks like: 1-minute averages of temperature. The index of our Series object is now in the units of time
, which is slightly better. However, much of the time-savings and functionality of pandas comes when we have the index in a more descriptive format: the python datetime object.
One method to do this is to utilize the netCDF attribute base_time
which is a scalar value reporting the seconds since 1970-01-01. Using this, we can call the datetime.datetime.utcfromtimestamp()
to turn this into a python datetime object representing the start of the file. Then we can use the time
variable to compute the python datetime at each point.
In [6]:
import datetime
base_time = datetime.datetime.utcfromtimestamp(data.variables['base_time'][:])
print(base_time)
dtindex = base_time + pd.to_timedelta(time[:], unit='s')
print(dtindex)
tseries = None
tseries = Series(temp, index=dtindex)
tseries.head()
Out[6]:
In [7]:
tseries.plot()
Out[7]:
Now we can see that the index of the pandas series is represented as a date and time, and the plot shows a nicely formatted x-axis with the date and times. Much nicer. But wait, there's more! What if we'd really just prefer to plot hourly averages. Or perhaps we just want to look at the data from noon onwards?
In [8]:
# Resampling data to 1 hour average
tseries_hourly = tseries.resample('1H', how=['min', 'mean', 'max'])
tseries_hourly.plot(style='.-')
Out[8]:
In [9]:
# Looking at a subset of the data
tseries['2014-05-31 12:00':].head()
Out[9]:
This might seem like a lot of code so far just to make a simple little plot. Chances are, however, that you're going to be reading and writing in a lot of netCDF files, or csv files, or whatever. So it might be useful to define a function that you can then use in other applications. This is what one such function (and subsequent use) might look like:
In [10]:
import os
def cdf_to_series(netcdf_file, varname):
"""Takes in a netCDF object and a variable name and returns a pandas Series object.
This function requires the netCDF file to have a `base_time` attribute with units of
'seconds since 1970-01-01' and a `time` attribute with units of seconds.
"""
# import the necessary packages and modules needed within the function
from netCDF4 import Dataset
import pandas as pd
import datetime
# define our netCDF dataset and open it with "with ... as ... :"
with Dataset(netcdf_file, 'r') as D:
# turn it into a series.
# note: this combines most of the steps from above into a single step
S = pd.Series(D.variables[varname][:],
index=datetime.datetime.utcfromtimestamp(D.variables['base_time'][:]) +
pd.to_timedelta(D.variables['time'][:], unit='s'))
# return the pandas Series
return S
# define the location of the netCDF file
netcdf_file = os.path.abspath('enametC1.b1.20140531.000000.cdf')
# we can print out the help documentation for the function we defined earlier
help(cdf_to_series)
# call the function and look at the data
tseries = cdf_to_series(netcdf_file, 'temp_mean')
tseries.head()
Out[10]:
You can start to get a sense that once you have data in a pandas Series object, it gets just a bit easier to do complex operations concerning data alignment and averaging. Now, what I've shown in this notebook is only just a sneak peek at the power behind pandas for scientific data analysis. In subsequent sections I will discuss the DataFrame
object, how to group and manipulate data, joining different datasets, and plotting.