This tutorial shows how to load and save datasets, and also how it can be manipulated.
The pygchem.datasets module provides functions to load and save CTM datasets:
In [1]:
from pygchem import datasets
NOTE: In the PyGChem development version, this module was previously named datafields and has been recently renamed to datasets (4-12-2014).
if
from pygchem import datasets
doesn't work, try
import pygchem.datafields as datasets
to use this notebook.
PyGChem uses Iris (http://scitools.org.uk/iris/) for handling CTM datasets. Iris provides a data abstraction layer that is based on the Climate and Forecast (CF) data model (http://cfconventions.org/) and which allows to isolate analysis and visualisation code from data format specifics. It natively supports read/write access to the (CF)netCDF format. PyGChem currently adds support to the BPCH format and the netCDF files created with the GAMAP routines BPCH2NC and BPCH2COARDS (read access only).
With Iris, a data field (data + metadata) relative to a particular phenomenon is represented by a iris.cube.Cube object (more info on Iris data strutures here). Iris also supports basic data manipulation operations, such as arithmetic, interpolation, and statistics. This tutorial shows only a few examples. For a more detailled description of these features, see the Iris user guide.
This tutorial uses the datasets created by the 1-year GEOS-Chem benchmark simulations. These are available for download via anonymous FTP from ftp://ftp.as.harvard.edu/gcgrid/geos-chem/.
For running the code cells below, you must specify the root folder of your local copy of these benchmarks:
In [2]:
bmk_root = '/home/bovy/geoschem'
The function pygchem.datasets.load allows to load the content of a BPCH file or a netCDF file into a list of iris.cube.Cube objects. This function is just an alias of the iris.load function. Note that it doesn't load all the data into memory but enough information to select, manipulate and retreive the data.
See the Iris documentation: loading iris cubes for more details.
cd into the 1-year benchmark (v10-01c Run1)
In [3]:
%cd {bmk_root}/1yr_benchmarks/v10-01/v10-01c/Run1
The line below loads the BPCH file corresponding to the first month of the simulation:
In [4]:
filename = 'bpch/ctm.bpch.v10-01c-geosfp-Run1.20120801'
dataset = datasets.load(filename)
The line below print the list of the 20 lasts data fields of the list (name, units, dimensions and coordinates). Note that the field name may be reformatted during the loading so that it is consistent with the name set by the GAMAP's netCDF writing routines. Note also that the units may also be reformatted as required by the udunits2 library (used by Iris).
In [5]:
print dataset[-20:]
The same function can be used for loading the content of the netCDF files. The netCDF files available in this simulation were created by the 'BPCH2COARDS' GAMAP routine. As there are differences between the COARDS Conventions and the CF Conventions, we have to use a callback function 'gamap_bpch2coards' to properly load the content of the netCDF file into Iris cubes:
In [6]:
filename = 'netcdf/v10-01c-geosfp-Run1.20120801.nc'
clb = datasets.load_callbacks['gamap_bpch2coards']
dataset = datasets.load(filename, callback=clb)
In [7]:
print dataset[-20:]
PyGChem also provides a callback for loading netCDF files created by the GAMAP routine 'BPCH2NC':
In [8]:
% cd {bmk_root}/1yr_benchmarks/v9-02/v9-02r/geos5/Run0
In [9]:
filename = 'netcdf/v9-02r-geos5-Run0.20050101.nc'
clb = datasets.load_callbacks['gamap_bpch2nc']
dataset = datasets.load(filename, callback=clb)
In [10]:
print dataset[-20:]
In [11]:
%cd {bmk_root}/1yr_benchmarks/v10-01/v10-01c/Run1
For large datasets, it is possible to constrain the load to match specific metadata or to load only a subset of the data. It is recommended as the loading is faster with given constraints.
The following example loads only the "IJ_AVG_S__O3" variable:
In [12]:
filename = 'netcdf/v10-01c-geosfp-Run1.20120801.nc'
clb = datasets.load_callbacks['gamap_bpch2coards']
dataset = datasets.load(filename, "IJ_AVG_S__O3",
callback=clb)
In [13]:
print dataset
We can specify multiple variables:
In [14]:
dataset = datasets.load(filename, ["IJ_AVG_S__O3", "IJ_AVG_S__NO"],
callback=clb)
In [15]:
print dataset
It is also possible to define more advanced constraints. For example, to load all "IJ-AVG-$" diagnostics:
In [16]:
import iris
check_ij_avg = lambda cube: cube.name().startswith("IJ_AVG_S")
ij_avg = iris.Constraint(cube_func=check_ij_avg)
dataset = datasets.load(filename, ij_avg,
callback=clb)
In [17]:
print dataset
A more advanced example, combining constraints and extracting data subsets:
In [18]:
def lon_subset(cell):
"""
return True or False as to whether the cell
center in question should be kept
"""
return cell > 0. and cell < 20.
lon_cst = iris.Constraint(longitude=lon_subset)
dataset = datasets.load(filename,
"IJ_AVG_S__O3" & lon_cst,
callback=clb)
In [19]:
print dataset
# note the reduced grid-size for the longitude
It is possible to load multiple files at once. The pygchem.datafields.load function will try to merge the fields when it is possible, i.e., when the multiple fields describing the same phenomenom overlap in space and/or time, so that it reduces at the minimum the number of loaded fields.
The following example load the "IJ_AVG_S__O3" diagnostic for the entire 1-year simulation:
In [20]:
# note the wildcard character in the filename
# (UNIX expressions are supported)
filename = 'bpch/ctm.bpch.v10-01c-geosfp-Run1.*'
diagnostics = ["BXHGHT_S__BXHEIGHT",
"BXHGHT_S__N(AIR)",
"IJ_AVG_S__NO2"]
dataset = datasets.load(filename, diagnostics)
In [21]:
print dataset
# note the additional time dimension
Merging fields may take a long time. If speed matters, it is still possible to load the fields without any merging:
In [22]:
dataset_nomerge = datasets.load_raw(filename, diagnostics)
print dataset_nomerge
We can extract data subsets and/or specific fields from a loaded list of fields:
In [23]:
dataset_lon_subset = dataset.extract(lon_cst)
print dataset_lon_subset
To select only one field (cube), the extract_strict method can be used:
In [24]:
no2_avg = dataset.extract_strict("IJ_AVG_S__NO2")
print no2_avg
Datasets can be written to a netCDF file (with the CF-conventions) using the function save of the datafields module (an alias to the iris.save function).
In [25]:
outfile = 'netcdf/test.nc'
datasets.save(dataset, outfile)
A text representation of (the header information of) the written netCDF file using the ncdump utility (provided with the netCDF4 package):
In [26]:
!ncdump -h netcdf/test.nc
Loading the written file using the load function:
In [27]:
print datasets.load('netcdf/test.nc')
It is also possible to write the datasets to the BPCH format, using the low-level function write_bpch in the module pygchem.io.bpch (not yet documented).
Print a string representation of the data field (cube):
In [28]:
print no2_avg
Get or set the name(s) of the field with the following function and/or attributes:
In [29]:
no2_avg.name()
Out[29]:
PyGChem considers that the GEOS-Chem variable name (category + tracer) is a standard name, although it is not CF-compliant (i.e., not listed in the standard name table of the udunits package).
In [30]:
no2_avg.standard_name
Out[30]:
long_name is the full name of the diagnostic
In [31]:
no2_avg.long_name
Out[31]:
var_name is the (netCDF) variable name
In [32]:
no2_avg.var_name
Out[32]:
Any attribute of the data field is stored in the following dictionary
In [33]:
no2_avg.attributes
Out[33]:
Get the field units:
In [34]:
no2_avg.units
Out[34]:
It is easy to change the units of the field (data values are re-computed accordingly):
In [35]:
no2_avg.convert_units('ppmv')
print no2_avg
Retreiving a coordinate by name:
In [36]:
lat_coord = no2_avg.coord('latitude')
Coordinate data and metadata:
In [37]:
lat_coord.points
Out[37]:
In [38]:
lat_coord.bounds
Out[38]:
In [39]:
lat_coord.units
Out[39]:
The field data can be accessed with the data attribute. It returns a Numpy array.
It is very easy to manipulate data fields (Iris cubes)
Indexing a cube return a new cube.
In [40]:
print no2_avg
In [41]:
# Get the first element of the 1st and last dimensions (time and model level number)
no2_avg_t0_l1 = no2_avg[0, :, :, 0]
print no2_avg_t0_l1
Note that another way to extract a subset is by applying one or more constraints on the cube (see above).
The slice method allow iterating over layers or time slices (return a Python generator). The slices can be 1-dimensional or n-dimensional.
The example below generate time slices (returns a 3-d cube for each time slice).
In [42]:
no2_avg_time_slices = no2_avg.slices(['longitude', 'latitude', 'model_level_number'])
for s in no2_avg_time_slices:
print s
The example below calculate the sum over the vertical levels
In [43]:
import iris.analysis
no2_avg_sum_levels = no2_avg.collapsed('model_level_number', iris.analysis.SUM)
print no2_avg_sum_levels
The example below calculates the total columns of the tracer for all grid cells and all time slices:
In [44]:
# extract the data fields (cubes) needed to compute the tracer columns
box_heights = dataset.extract_strict("BXHGHT_S__BXHEIGHT")
n_air = dataset.extract_strict("BXHGHT_S__N(AIR)")
# convert units back to ppbv for the NO2 tracer
no2_avg.convert_units('ppbv')
# calculate the columns
no2_avg_columns = (box_heights * n_air * no2_avg).collapsed('model_level_number',
iris.analysis.SUM)
# set name convert units to count/cm2 (count is used for #molecules)
no2_avg_columns.rename("NO2 columns")
no2_avg_columns.convert_units('count/cm2')
# string repr
print no2_avg_columns
Iris provides some modules for basic dataset plotting. It is built on top of matplotlib, and it uses the cartopy package for map projections.
In [45]:
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import iris.quickplot as qplt
%matplotlib inline
Plot the NO2 total columns for the first time slice:
In [46]:
fig = plt.figure(figsize=(10, 8))
qplt.contourf(no2_avg_columns[0], 15)
plt.gca().coastlines()
Out[46]:
A Hovmoller diagram example
In [47]:
import matplotlib.dates as mdates
no2_hovmoller = no2_avg_columns.collapsed('latitude',
iris.analysis.MEAN)
fig = plt.figure(figsize=(10, 8))
qplt.contourf(no2_hovmoller, 20)
# fine control over time axis ticks and labels
plt.gca().yaxis.set_major_locator(mdates.MonthLocator())
plt.gca().yaxis.set_major_formatter(mdates.DateFormatter('%m-%Y'))
plt.gca().set_ylabel("Time")
Out[47]:
For loading datasets, Pygchem reads the information stored in the diaginfo.dat and tracerinfo.dat files. The module pygchem.diagnostics provides a basic API for reading / writing these files and handling their contents.
In [48]:
from pygchem import diagnostics
To load a couple of files:
In [49]:
dinfo = diagnostics.CTMDiagnosticInfo(diaginfo_file='diaginfo.dat',
tracerinfo_file='tracerinfo.dat')
A CTMDiagnosticInfo object contains all information stored in those files. The attributes categories and diagnostics contains each record (line) in diaginfo.dat and tracerinfo.dat, respectively
In [50]:
dinfo.categories
Out[50]:
These attributes behave like a Python list, with added key reference and database lookup-like capabilities. Each item of the list coorespond to a record.
In [51]:
# get the 1st category (a Record like object)
cat_ij_avg = dinfo.categories[0]
cat_ij_avg
Out[51]:
In [52]:
cat_ij_avg.name
Out[52]:
In [53]:
cat_ij_avg.offset
Out[53]:
In [54]:
# convert the record object to a dict
cat_ij_avg.to_dict()
Out[54]:
It is aslo possible to filter the data (queries):
In [55]:
# select a category based on its name (key)
dinfo.categories.select_item("NS-FLX-$")
Out[55]:
In [56]:
# select a diagnostic (tracer) based on its number (key)
dinfo.diagnostics.select_item(11)
Out[56]:
In [57]:
# select categories based on other attributes
dinfo.categories.select(offset=3000)
Out[57]:
In [58]:
# advanced selection
dinfo.diagnostics.select(lambda d: d.unit == 'ppbC' and d.number > 10)
Out[58]:
We can add or remove entries:
In [59]:
new_tracer = diagnostics.CTMDiagnostic(9999, 'NEW', full_name='a new tracer')
dinfo.diagnostics.append(new_tracer)
dinfo.diagnostics[-1]
Out[59]:
In [60]:
# select the new tracer added to the list
s = dinfo.diagnostics.select(9999)
# remove the selected entry
s.selection_remove()
dinfo.diagnostics[-1]
Out[60]:
Exporting to diaginfo and tracerinfo files:
In [61]:
dinfo.save_diaginfo('diaginfo_test.dat')
dinfo.save_tracerinfo('tracerinfo_test.dat')
In [ ]: