This workshop will teach some of the basics of Climate and Forecasting metadata for netCDF data files with some hands-on work available in Jupyter Notebooks using Python. Along with introduction to netCDF and CF, we will introduce the CF data model and discuss some netCDF implementation details to consider when deciding how to write data with CF and netCDF. We will cover gridded data as well as in situ data (stations, soundings, etc.) and touch on storing geometries data in CF.
This assumes a basic understanding of netCDF.
Let's say we're working with some numerical weather forecast model output. Let's walk through the steps necessary to store this data in netCDF, using the Climate and Forecasting metadata conventions to ensure that our data are available to as many tools as possible.
To start, let's assume the following about our data:
We'll also go ahead and generate some arrays of data below to get started:
In [ ]:
# Import some useful Python tools
from datetime import datetime, timedelta
import numpy as np
# Twelve hours of hourly output starting at 22Z today
start = datetime.utcnow().replace(hour=22, minute=0, second=0, microsecond=0)
times = np.array([start + timedelta(hours=h) for h in range(13)])
# 3km spacing in x and y
x = np.arange(-150, 153, 3)
y = np.arange(-100, 100, 3)
# Standard pressure levels in hPa
press = np.array([1000, 925, 850, 700, 500, 300, 250])
temps = np.random.randn(times.size, press.size, y.size, x.size)
In [ ]:
from netCDF4 import Dataset
nc = Dataset('forecast_model.nc', 'w', format='NETCDF4_CLASSIC', diskless=True)
We're going to start by adding some global attribute metadata. These are recommendations from the standard (not required), but they're easy to add and help users keep the data straight, so let's go ahead and do it.
In [ ]:
nc.Conventions = 'CF-1.7'
nc.title = 'Forecast model run'
nc.institution = 'Unidata'
nc.source = 'WRF-1.5'
nc.history = str(datetime.utcnow()) + ' Python'
nc.references = ''
nc.comment = ''
At this point, this is the CDL representation of this dataset:
netcdf forecast_model {
attributes:
:Conventions = "CF-1.7" ;
:title = "Forecast model run" ;
:institution = "Unidata" ;
:source = "WRF-1.5" ;
:history = "2019-07-16 02:21:52.005718 Python" ;
:references = "" ;
:comment = "" ;
}
Next, before adding variables to the file to define each of the data fields in this file, we need to define the dimensions that exist in this data set. We set each of x
, y
, and pressure
to the size of the corresponding array. We set forecast_time
to be an "unlimited" dimension, which allows the dataset to grow along that dimension if we write additional data to it later.
In [ ]:
nc.createDimension('forecast_time', None)
nc.createDimension('x', x.size)
nc.createDimension('y', y.size)
nc.createDimension('pressure', press.size)
nc
The CDL representation now shows our dimensions:
netcdf forecast_model {
dimensions:
forecast_time = UNLIMITED (currently 13) ;
x = 101 ;
y = 67 ;
pressure = 7 ;
attributes:
:Conventions = "CF-1.7" ;
:title = "Forecast model run" ;
:institution = "Unidata" ;
:source = "WRF-1.5" ;
:history = "2019-07-16 02:21:52.005718 Python" ;
:references = "" ;
:comment = "" ;
}
So far, all we've done is outlined basic information about our dataset: broad metadata and the dimensions of our dataset. Now we create a variable to hold one particular data field for our dataset, in this case the forecast air temperature. When defining this variable, we specify the datatype for the values being stored, the relevant dimensions, as well as enable optional compression.
In [ ]:
temps_var = nc.createVariable('Temperature', datatype=np.float32,
dimensions=('forecast_time', 'pressure', 'y', 'x'),
zlib=True)
Now that we have the variable, we tell python to write our array of data to it.
In [ ]:
temps_var[:] = temps
temps_var
If instead we wanted to write data sporadically, like once per time step, we could do that instead (though the for loop below might actually be at a higher level in the program:
In [ ]:
next_slice = 0
for temp_slice in temps:
temps_var[next_slice] = temp_slice
next_slice += 1
At this point, this is the CDL representation of our dataset:
netcdf forecast_model {
dimensions:
forecast_time = UNLIMITED (currently 13) ;
x = 101 ;
y = 67 ;
pressure = 7 ;
variables:
float Temperature(forecast_time, pressure, y, x) ;
attributes:
:Conventions = "CF-1.7" ;
:title = "Forecast model run" ;
:institution = "Unidata" ;
:source = "WRF-1.5" ;
:history = "2019-07-16 02:21:52.005718 Python" ;
:references = "" ;
:comment = "" ;
}
We can also add attributes to this variable to define metadata. The CF conventions require a units
attribute to be set for all variables that represent a dimensional quantity. The value of this attribute needs to be parsable by the UDUNITS library. Here we set it to a value of 'Kelvin'
. We also set the standard (optional) attributes of long_name
and standard_name
. The former contains a longer description of the variable, while the latter comes from a controlled vocabulary in the CF conventions. This allows users of data to understand, in a standard fashion, what a variable represents. If we had missing values, we could also set the missing_value
attribute to an appropriate value.
NASA Dataset Interoperability Recommendations:
Section 2.2 - Include Basic CF Attributes
Include where applicable:
units
,long_name
,standard_name
,valid_min
/valid_max
,scale_factor
/add_offset
and others.
In [ ]:
temps_var.units = 'Kelvin'
temps_var.standard_name = 'air_temperature'
temps_var.long_name = 'Forecast air temperature'
temps_var.missing_value = -9999
temps_var
The resulting CDL (truncated to the variables only) looks like:
variables:
float Temperature(forecast_time, pressure, y, x) ;
Temperature:units = "Kelvin" ;
Temperature:standard_name = "air_temperature" ;
Temperature:long_name = "Forecast air temperature" ;
Temperature:missing_value = -9999.0 ;
To properly orient our data in time and space, we need to go beyond dimensions (which define common sizes and alignment) and include values along these dimensions, which are called "Coordinate Variables". Generally, these are defined by creating a one dimensional variable with the same name as the respective dimension.
To start, we define variables which define our x
and y
coordinate values. These variables include standard_name
s which allow associating them with projections (more on this later) as well as an optional axis
attribute to make clear what standard direction this coordinate refers to.
In [ ]:
x_var = nc.createVariable('x', np.float32, ('x',))
x_var[:] = x
x_var.units = 'km'
x_var.axis = 'X' # Optional
x_var.standard_name = 'projection_x_coordinate'
x_var.long_name = 'x-coordinate in projected coordinate system'
y_var = nc.createVariable('y', np.float32, ('y',))
y_var[:] = y
y_var.units = 'km'
y_var.axis = 'Y' # Optional
y_var.standard_name = 'projection_y_coordinate'
y_var.long_name = 'y-coordinate in projected coordinate system'
We also define a coordinate variable pressure
to reference our data in the vertical dimension. The standard_name
of 'air_pressure'
is sufficient to identify this coordinate variable as the vertical axis, but let's go ahead and specify the axis
as well. We also specify the attribute positive
to indicate whether the variable increases when going up or down. In the case of pressure, this is technically optional.
In [ ]:
press_var = nc.createVariable('pressure', np.float32, ('pressure',))
press_var[:] = press
press_var.units = 'hPa'
press_var.axis = 'Z' # Optional
press_var.standard_name = 'air_pressure'
press_var.positive = 'down' # Optional
Time coordinates must contain a units
attribute with a string value with a form similar to 'seconds since 2019-01-06 12:00:00.00'
. 'seconds', 'minutes', 'hours', and 'days' are the most commonly used units for time. Due to the variable length of months and years, they are not recommended.
Before we can write data, we need to first need to convert our list of Python datetime
instances to numeric values. We can use the cftime
library to make this easy to convert using the unit string as defined above.
In [ ]:
from cftime import date2num
time_units = 'hours since {:%Y-%m-%d 00:00}'.format(times[0])
time_vals = date2num(times, time_units)
time_vals
Now we can create the forecast_time
variable just as we did before for the other coordinate variables:
In [ ]:
time_var = nc.createVariable('forecast_time', np.int32, ('forecast_time',))
time_var[:] = time_vals
time_var.units = time_units
time_var.axis = 'T' # Optional
time_var.standard_name = 'time' # Optional
time_var.long_name = 'time'
The CDL representation of the variables now contains much more information:
dimensions:
forecast_time = UNLIMITED (currently 13) ;
x = 101 ;
y = 67 ;
pressure = 7 ;
variables:
float x(x) ;
x:units = "km" ;
x:axis = "X" ;
x:standard_name = "projection_x_coordinate" ;
x:long_name = "x-coordinate in projected coordinate system" ;
float y(y) ;
y:units = "km" ;
y:axis = "Y" ;
y:standard_name = "projection_y_coordinate" ;
y:long_name = "y-coordinate in projected coordinate system" ;
float pressure(pressure) ;
pressure:units = "hPa" ;
pressure:axis = "Z" ;
pressure:standard_name = "air_pressure" ;
pressure:positive = "down" ;
float forecast_time(forecast_time) ;
forecast_time:units = "hours since 2019-07-16 00:00" ;
forecast_time:axis = "T" ;
forecast_time:standard_name = "time" ;
forecast_time:long_name = "time" ;
float Temperature(forecast_time, pressure, y, x) ;
Temperature:units = "Kelvin" ;
Temperature:standard_name = "air_temperature" ;
Temperature:long_name = "Forecast air temperature" ;
Temperature:missing_value = -9999.0 ;
Our data are still not CF-compliant because they do not contain latitude and longitude information, which is needed to properly locate the data. To solve this, we need to add variables with latitude and longitude. These are called "auxillary coordinate variables", not because they are extra, but because they are not simple one dimensional variables.
Below, we first generate longitude and latitude values from our projected coordinates using the pyproj
library.
In [ ]:
from pyproj import Proj
X, Y = np.meshgrid(x, y)
lcc = Proj({'proj':'lcc', 'lon_0':-105, 'lat_0':40, 'a':6371000.,
'lat_1':25})
lon, lat = lcc(X * 1000, Y * 1000, inverse=True)
Now we can create the needed variables. Both are dimensioned on y
and x
and are two-dimensional. The longitude variable is identified as actually containing such information by its required units of 'degrees_east'
, as well as the optional 'longitude'
standard_name
attribute. The case is the same for latitude, except the units are 'degrees_north'
and the standard_name
is 'latitude'
.
In [ ]:
lon_var = nc.createVariable('lon', np.float64, ('y', 'x'))
lon_var[:] = lon
lon_var.units = 'degrees_east'
lon_var.standard_name = 'longitude' # Optional
lon_var.long_name = 'longitude'
lat_var = nc.createVariable('lat', np.float64, ('y', 'x'))
lat_var[:] = lat
lat_var.units = 'degrees_north'
lat_var.standard_name = 'latitude' # Optional
lat_var.long_name = 'latitude'
With the variables created, we identify these variables as containing coordinates for the Temperature
variable by setting the coordinates
value to a space-separated list of the names of the auxilliary coordinate variables:
In [ ]:
temps_var.coordinates = 'lon lat'
This yields the following CDL:
double lon(y, x);
lon:units = "degrees_east";
lon:long_name = "longitude coordinate";
lon:standard_name = "longitude";
double lat(y, x);
lat:units = "degrees_north";
lat:long_name = "latitude coordinate";
lat:standard_name = "latitude";
float Temperature(time, y, x);
Temperature:units = "Kelvin" ;
Temperature:standard_name = "air_temperature" ;
Temperature:long_name = "Forecast air temperature" ;
Temperature:missing_value = -9999.0 ;
Temperature:coordinates = "lon lat";
With our data specified on a Lambert conformal projected grid, it would be good to include this information in our metadata. We can do this using a "grid mapping" variable. This uses a dummy scalar variable as a namespace for holding all of the required information. Relevant variables then reference the dummy variable with their grid_mapping
attribute.
Below we create a variable and set it up for a Lambert conformal conic projection on a spherical earth. The grid_mapping_name
attribute describes which of the CF-supported grid mappings we are specifying. The names of additional attributes vary between the mappings.
In [ ]:
proj_var = nc.createVariable('lambert_projection', np.int32, ())
proj_var.grid_mapping_name = 'lambert_conformal_conic'
proj_var.standard_parallel = 25.
proj_var.latitude_of_projection_origin = 40.
proj_var.longitude_of_central_meridian = -105.
proj_var.semi_major_axis = 6371000.0
proj_var
Now that we created the variable, all that's left is to set the grid_mapping
attribute on our Temperature
variable to the name of our dummy variable:
In [ ]:
temps_var.grid_mapping = 'lambert_projection' # or proj_var.name
This yields the CDL:
variables:
int lambert_projection ;
lambert_projection:grid_mapping_name = "lambert_conformal_conic ;
lambert_projection:standard_parallel = 25.0 ;
lambert_projection:latitude_of_projection_origin = 40.0 ;
lambert_projection:longitude_of_central_meridian = -105.0 ;
lambert_projection:semi_major_axis = 6371000.0 ;
float Temperature(forecast_time, pressure, y, x) ;
Temperature:units = "Kelvin" ;
Temperature:standard_name = "air_temperature" ;
Temperature:long_name = "Forecast air temperature" ;
Temperature:missing_value = -9999.0 ;
Temperature:coordinates = "lon lat" ;
Temperature:grid_mapping = "lambert_projection" ;
NASA Dataset Interoperability Recommendations:
Section 2.3 - Use CF “bounds” attributes
CF conventions state: “When gridded data does not represent the point values of a field but instead represents some characteristic of the field within cells of finite ‘volume,’ a complete description of the variable should include metadata that describes the domain or extent of each cell, and the characteristic of the field that the cell values represent.”
For example, if a rain guage is read every 3 hours but only dumped every six hours, it might look like this
netcdf precip_bucket_bounds {
dimensions:
lat = 12 ;
lon = 19 ;
time = 8 ;
tbv = 2;
variables:
float lat(lat) ;
float lon(lon) ;
float time(time) ;
time:units = "hours since 2019-07-12 00:00:00.00";
time:bounds = "time_bounds" ;
float time_bounds(time,tbv)
float precip(time, lat, lon) ;
precip:units = "inches" ;
data:
time = 3, 6, 9, 12, 15, 18, 21, 24;
time_bounds = 0, 3, 0, 6, 6, 9, 6, 12, 12, 15, 12, 18, 18, 21, 18, 24;
}
So the time coordinate looks like
|---X
|-------X
|---X
|-------X
|---X
|-------X
|---X
|-------X
0 3 6 9 12 15 18 21 24
So far we've focused on how to handle storing data that are arranged in a grid. What about observation data? The CF conventions describe this as conventions for Discrete Sampling Geometeries (DSG).
For data that are regularly sampled (say, all at the same heights) this is straightforward. First, let's define some sample profile data, all at a few heights less than 1000m:
In [ ]:
lons = np.array([-97.1, -105, -80])
lats = np.array([35.25, 40, 27])
heights = np.linspace(10, 1000, 10)
temps = np.random.randn(lats.size, heights.size)
stids = ['KBOU', 'KOUN', 'KJUP']
First we create a new file and define some dimensions. Since this is profile data, heights will be one dimension. We use station as our other dimension. We also set the global featureType
attribute to 'profile'
to indicate that this file holds "an ordered set of data points along a vertical line at a fixed horizontal position and fixed time". We also add a dimension to assist in storing our string station ids.
In [ ]:
nc.close()
nc = Dataset('obs_data.nc', 'w', format='NETCDF4_CLASSIC', diskless=True)
nc.createDimension('station', lats.size)
nc.createDimension('heights', heights.size)
nc.createDimension('str_len', 4)
nc.Conventions = 'CF-1.7'
nc.featureType = 'profile'
nc
Which gives this CDL:
netcdf obs_data {
dimensions:
station = 3 ;
heights = 10 ;
str_len = 4 ;
attributes:
:Conventions = "CF-1.7" ;
:featureType = "profile" ;
}
We can create our coordinates with:
In [ ]:
lon_var = nc.createVariable('lon', np.float64, ('station',))
lon_var.units = 'degrees_east'
lon_var.standard_name = 'longitude'
lat_var = nc.createVariable('lat', np.float64, ('station',))
lat_var.units = 'degrees_north'
lat_var.standard_name = 'latitude'
The standard refers to these as "instance variables" because each one refers to an instance of a feature. From here we can create our height
coordinate variable:
In [ ]:
heights_var = nc.createVariable('heights', np.float32, ('heights',))
heights_var.units = 'meters'
heights_var.standard_name = 'altitude'
heights_var.positive = 'up'
heights_var[:] = heights
Now we can also write our station IDs to a variable. This is a 2D variable, but one of the dimensions is simply there to facilitate treating strings as character arrays. We also assign this an attribute cf_role
with a value of 'profile_id'
to facilitate software to identify individual profiles:
In [ ]:
stid_var = nc.createVariable('stid', 'c', ('station', 'str_len'))
stid_var.cf_role = 'profile_id'
stid_var.long_name = 'Station identifier'
stid_var[:] = stids
Now our CDL looks like:
netcdf obs_data {
dimensions:
station = 3 ;
heights = 10 ;
str_len = 4 ;
variables:
double lon(station) ;
lon:units = "degrees_east" ;
lon:standard_name = "longitude" ;
double lat(station) ;
lat:units = "degrees_north" ;
lat:standard_name = "latitude" ;
float heights(heights) ;
heights:units = "meters" ;
heights:standard_name = "altitude";
heights:positive = "up" ;
char stid(station, str_len) ;
stid:cf_role = "profile_id" ;
stid:long_name = "Station identifier" ;
attributes:
:Conventions = "CF-1.7" ;
:featureType = "profile" ;
}
In [ ]:
time_var = nc.createVariable('time', np.float32, ())
time_var.units = 'minutes since 2019-07-16 17:00'
time_var.standard_name = 'time'
time_var[:] = [5.]
temp_var = nc.createVariable('temperature', np.float32, ('station', 'heights'))
temp_var.units = 'celsius'
temp_var.standard_name = 'air_temperature'
temp_var.coordinates = 'lon lat heights time'
Note the use of the coordinates
attribute to store the names of the auxilliary coordinate variables since they're all dimensioned on station
and not proper coordinate variables. This yields the CDL for the variables:
variables:
double lon(station) ;
lon:units = "degrees_east" ;
lon:standard_name = "longitude" ;
double lat(station) ;
lat:units = "degrees_north" ;
lat:standard_name = "latitude" ;
float heights(heights) ;
heights:units = "meters" ;
heights:standard_name = "altitude";
heights:positive = "up" ;
char stid(station, str_len) ;
stid:cf_role = "profile_id" ;
stid:long_name = "Station identifier" ;
float time ;
time:units = "minutes since 2019-07-16 17:00" ;
time:standard_name = "time" ;
float temperature(station, heights) ;
temperature:units = "celsius" ;
temperature:standard_name = "air_temperature" ;
temperature:coordinates = "lon lat heights time" ;
These standards for storing DSG extend to time series, trajectories, and combinations of them. They also can extend for differing amounts of data per feature using ragged arrays. For more information see the main document or the annotated DSG examples.
In [ ]:
In [ ]: