DataFrames in pandas

Like series objects, pandas.DataFrame objects are containers for data. The previous example saw us reading in a netCDF file and extracting a single variable into a pandas.Series object. While this is nice, we often want more than one variable from a netCDF file or to combine several datasets for similar times with different variables. This is where the pandas.DataFrame object comes in handy.

Let's take a look at the help file for DataFrames:


In [1]:
import pandas as pd
#help(pd.DataFrame)
# on second thought, it was really long. You can look at it yourself if you want.

tl;dr

That's a lot of information. The basics:

Help on class DataFrame in module pandas.core.frame:

class DataFrame(pandas.core.generic.NDFrame)
 |  Two-dimensional size-mutable, potentially heterogeneous tabular data
 |  structure with labeled axes (rows and columns). Arithmetic operations
 |  align on both row and column labels. Can be thought of as a dict-like
 |  container for Series objects. The primary pandas data structure

Basically, a DataFrame (DF from here on out) is tabular data with both row and column labels. This means we can keep the same handy date/time indexing features from the Series and extend that in DF space to include column labels for different variables or products. With that, let's get right into an example.

Example

I will be building on the function we developed in the last example. Instead of picking out a single variable, I will be looking for a specific set. Now, I know ahead of time a little bit about the structure of the netCDF file. For example, I want to get get all the variables that are indexed along the 'time' variable. So I'll be looking into the netCDF attributes for those that have dimensions that match those of 'time'. I also don't want any of the quality control variables for this demo, so I'll be skipping over files that start with 'qc_'.


In [2]:
import os
from netCDF4 import Dataset

file_path = os.path.abspath('enametC1.b1.20140531.000000.cdf')
data = Dataset(file_path)

temp_dims = data.variables['temp_mean'].dimensions
time_dims = data.variables['time'].dimensions

print('Temperature Dimensions: {}'.format(temp_dims))
print('Time Dimensinos: {}'.format(time_dims))

print('Do the dimensions match?: {}'.format(temp_dims == time_dims))

data.close()


Temperature Dimensions: (u'time',)
Time Dimensinos: (u'time',)
Do the dimensions match?: True

So you can see how to match dimensions. Let's put it all together, and return all the variables with the same dimensions as 'time', excluding the quality control variables.


In [3]:
def cdf_to_dataframe(netcdf_file, exclude_qc=True):
    """Takes in a netCDF object and returns a pandas DataFrame object
    """
    
    # import packages
    from netCDF4 import Dataset
    import pandas as pd
    import datetime
    
    with Dataset(netcdf_file, 'r') as D:
        
        # create an empty dictionary for the netCDF variables
        ncvars = {}
        
        for v in D.variables.keys():
            time_check = (D.variables[v].dimensions 
                          == D.variables['time'].dimensions)
            if exclude_qc:
                qc_check = 'qc_' not in v
                var_check = qc_check and time_check
            else:
                var_check = time_check
                
            if var_check:
                ncvars[v] = D.variables[v][:]
            
        D = pd.DataFrame(ncvars,
                index = (datetime.datetime.utcfromtimestamp(D.variables['base_time'][:])+
                        pd.to_timedelta(D.variables['time'][:], unit='s')))
        
    return D

In [4]:
DATA = cdf_to_dataframe(file_path)

F = DATA.plot(figsize=(15, 35), subplots=True)
print(DATA.columns)


Index([u'atmos_pressure', u'logger_temp', u'logger_volt', u'org_precip_rate_mean', u'pwd_cumul_rain', u'pwd_err_code', u'pwd_mean_vis_10min', u'pwd_mean_vis_1min', u'pwd_precip_rate_mean_1min', u'pwd_pw_code_15min', u'pwd_pw_code_1hr', u'pwd_pw_code_inst', u'rh_mean', u'rh_std', u'temp_mean', u'temp_std', u'time', u'time_offset', u'vapor_pressure_mean', u'vapor_pressure_std', u'wdir_vec_mean', u'wdir_vec_std', u'wspd_arith_mean', u'wspd_vec_mean'], dtype='object')

Wow. Such plot.

Okay so that's a lot of data, but it shows you that our function worked, and we have this DF object that contains all these variables. Now onto some of the more interesting features. We can reference any column of the DF either by calling it like

temperature = DATA['temp_mean']

or

temperature = DATA.temp_mean

One of the important things to get from this type of referencing is that it returns a pandas Series object. This makes sense, as a DataFrame for this use is a collection of Series objects. So by doing temperature = DATA['temp_mean'] we get the same pd.Series object that we had in the last example.


In [5]:
temperature = DATA['temp_mean']
temperature.head()


Out[5]:
2014-05-31 00:00:00    18.950001
2014-05-31 00:01:00    19.000000
2014-05-31 00:02:00    19.020000
2014-05-31 00:03:00    18.959999
2014-05-31 00:04:00    19.040001
Name: temp_mean, dtype: float32

Summary

This section was just to get to aquainted with the DataFrame object and how it relates to the pandas Series object. In the next section, I will be talking about more advanced ways to work with the data once it is in a pandas object - it should apply to both Series and DF objects (with some clearly explained exceptions).