Like series objects, pandas.DataFrame objects are containers for data. The previous example saw us reading in a netCDF file and extracting a single variable into a pandas.Series object. While this is nice, we often want more than one variable from a netCDF file or to combine several datasets for similar times with different variables. This is where the pandas.DataFrame object comes in handy.
Let's take a look at the help file for DataFrames:
In [1]:
import pandas as pd
#help(pd.DataFrame)
# on second thought, it was really long. You can look at it yourself if you want.
That's a lot of information. The basics:
Help on class DataFrame in module pandas.core.frame:
class DataFrame(pandas.core.generic.NDFrame)
| Two-dimensional size-mutable, potentially heterogeneous tabular data
| structure with labeled axes (rows and columns). Arithmetic operations
| align on both row and column labels. Can be thought of as a dict-like
| container for Series objects. The primary pandas data structure
Basically, a DataFrame (DF from here on out) is tabular data with both row and column labels. This means we can keep the same handy date/time indexing features from the Series and extend that in DF space to include column labels for different variables or products. With that, let's get right into an example.
I will be building on the function we developed in the last example. Instead of picking out a single variable, I will be looking for a specific set. Now, I know ahead of time a little bit about the structure of the netCDF file. For example, I want to get get all the variables that are indexed along the 'time' variable. So I'll be looking into the netCDF attributes for those that have dimensions that match those of 'time'. I also don't want any of the quality control variables for this demo, so I'll be skipping over files that start with 'qc_'.
In [2]:
import os
from netCDF4 import Dataset
file_path = os.path.abspath('enametC1.b1.20140531.000000.cdf')
data = Dataset(file_path)
temp_dims = data.variables['temp_mean'].dimensions
time_dims = data.variables['time'].dimensions
print('Temperature Dimensions: {}'.format(temp_dims))
print('Time Dimensinos: {}'.format(time_dims))
print('Do the dimensions match?: {}'.format(temp_dims == time_dims))
data.close()
So you can see how to match dimensions. Let's put it all together, and return all the variables with the same dimensions as 'time', excluding the quality control variables.
In [3]:
def cdf_to_dataframe(netcdf_file, exclude_qc=True):
"""Takes in a netCDF object and returns a pandas DataFrame object
"""
# import packages
from netCDF4 import Dataset
import pandas as pd
import datetime
with Dataset(netcdf_file, 'r') as D:
# create an empty dictionary for the netCDF variables
ncvars = {}
for v in D.variables.keys():
time_check = (D.variables[v].dimensions
== D.variables['time'].dimensions)
if exclude_qc:
qc_check = 'qc_' not in v
var_check = qc_check and time_check
else:
var_check = time_check
if var_check:
ncvars[v] = D.variables[v][:]
D = pd.DataFrame(ncvars,
index = (datetime.datetime.utcfromtimestamp(D.variables['base_time'][:])+
pd.to_timedelta(D.variables['time'][:], unit='s')))
return D
In [4]:
DATA = cdf_to_dataframe(file_path)
F = DATA.plot(figsize=(15, 35), subplots=True)
print(DATA.columns)
Okay so that's a lot of data, but it shows you that our function worked, and we have this DF object that contains all these variables. Now onto some of the more interesting features. We can reference any column of the DF either by calling it like
temperature = DATA['temp_mean']
or
temperature = DATA.temp_mean
One of the important things to get from this type of referencing is that it returns a pandas Series object. This makes sense, as a DataFrame for this use is a collection of Series objects. So by doing temperature = DATA['temp_mean']
we get the same pd.Series
object that we had in the last example.
In [5]:
temperature = DATA['temp_mean']
temperature.head()
Out[5]:
This section was just to get to aquainted with the DataFrame object and how it relates to the pandas Series object. In the next section, I will be talking about more advanced ways to work with the data once it is in a pandas object - it should apply to both Series and DF objects (with some clearly explained exceptions).