In [6]:
%matplotlib inline

Working with Data

I want to expand a little bit on the example that I used in the DataFrames tutorial, and demonstrate some more advanced ways to grab, slice, and plot data. This first cell is just a copy/pase job for reading in the datafile from the last lesson.


In [7]:
def cdf_to_dataframe(netcdf_file, exclude_qc=True):
    """Takes in a netCDF object and returns a pandas DataFrame object
    """
    
    # import packages
    from netCDF4 import Dataset
    import pandas as pd
    import datetime
    
    with Dataset(netcdf_file, 'r') as D:
        
        # create an empty dictionary for the netCDF variables
        ncvars = {}
        
        for v in D.variables.keys():
            time_check = (D.variables[v].dimensions 
                          == D.variables['time'].dimensions)
            if exclude_qc:
                qc_check = 'qc_' not in v
                var_check = qc_check and time_check
            else:
                var_check = time_check
                
            if var_check:
                ncvars[v] = D.variables[v][:]
            
        D = pd.DataFrame(ncvars,
                index = (datetime.datetime.utcfromtimestamp(D.variables['base_time'][:])+
                        pd.to_timedelta(D.variables['time'][:], unit='s')))
        
    return D

import os
file_path = os.path.abspath('enametC1.b1.20140531.000000.cdf')
DATA = cdf_to_dataframe(file_path)

What can we do from here?

How about we take a simple example. Let's dive further into the temperature data; specifically, let's do the following:

  1. resample to hourly averages
  2. for each hour, plot the min, mean, and max
  3. for each hour, make a boxplot of the values

There are a couple of ways to do this. One is to use the pandas.DataFrame.resample() method we saw earlier to get the data into 1-hour averages. Then, we could do the necessary calculations if we wanted to. Instead, this will demonstrate the DataFrame.groupby() functionality, combined with the aggregate tool. Here we go:


In [9]:
import pandas as pd
import numpy as np

hourly = pd.TimeGrouper('1H')
T = DATA['temp_mean'].groupby(hourly).agg([np.min, np.mean, np.max])
T


Out[9]:
amin mean amax
2014-05-31 00:00:00 18.260000 18.569334 19.040001
2014-05-31 01:00:00 17.980000 18.088501 18.340000
2014-05-31 02:00:00 17.540001 17.761333 18.090000
2014-05-31 03:00:00 17.420000 17.776833 18.120001
2014-05-31 04:00:00 17.240000 17.559999 18.010000
2014-05-31 05:00:00 17.389999 17.770334 18.020000
2014-05-31 06:00:00 17.480000 17.707666 17.950001
2014-05-31 07:00:00 18.000000 19.100834 20.280001
2014-05-31 08:00:00 20.320000 20.991501 21.590000
2014-05-31 09:00:00 20.799999 21.356333 21.930000
2014-05-31 10:00:00 21.190001 22.094500 22.920000
2014-05-31 11:00:00 22.590000 22.922501 23.309999
2014-05-31 12:00:00 22.150000 22.667500 23.290001
2014-05-31 13:00:00 22.740000 23.334333 23.920000
2014-05-31 14:00:00 23.320000 23.882833 24.330000
2014-05-31 15:00:00 23.660000 23.962166 24.459999
2014-05-31 16:00:00 23.139999 23.638000 24.120001
2014-05-31 17:00:00 23.049999 23.725500 24.459999
2014-05-31 18:00:00 23.129999 23.458000 23.740000
2014-05-31 19:00:00 22.139999 22.750999 23.620001
2014-05-31 20:00:00 20.340000 21.390333 22.200001
2014-05-31 21:00:00 19.340000 19.715334 20.420000
2014-05-31 22:00:00 18.780001 19.034500 19.389999
2014-05-31 23:00:00 18.030001 18.249500 18.820000

In [ ]: