This notebook provides discussion, examples, and best practices for
working with netCDF datasets in Python using the xarray
package.
Topics include:
This notebook is a companion to the
Exploring netCDF Files
and
Exploring netCDF Datasets from ERDDAP
notebooks.
Those notebooks focus on using the
netcdf4-python
package
to read netCDF datasets from local files and
ERDDAP servers on the Internet,
respectively.
This notebook is about using the xarray
package
to work with netCDF datasets.
xarray
uses the netcdf4-python
package behind the scenes,
so datasets can be read from either local files or from ERDDAP servers.
xarray
is a Python package that applies the concepts and tools for working with labeled data structures
from the pandas
package to the physical sciences.
Whereas pandas
excels at manipulating tablular data,
xarray
brings similar power to working with N-dimensional arrays.
If you are already familiar with working with netCDF datasets via the netCDF4-python
package,
you can think of xarray
as a higher level Python tools for working with those dataset.
Creating netCDF files and working with their attribute metadata is documented elsewhere: http://salishsea-meopar-docs.readthedocs.org/en/latest/code-notes/salishsea-nemo/nemo-forcing/netcdf4.html.
This notebook assumes that you are working in Python 3. If you don't have a Python 3 environment set up, please see our Anaconda Python Distribution docs for instructions on how to set one up.
xarray
and some of the packages that it depends on are not included in the default Anaconda
collection of packages,
so you may need to installed them explicitly:
$ conda install xarray netCDF4 bottleneck
bottleneck
is a package that speeds up NaN-skipping and rolling window aggregations.
If you are using a version of Python earlier than 3.5
(check with the command python --version
),
you should also install cyordereddict
to speed internal operations with xarray data structures.
It is not required for Python ≥3.5 because collections.OrderedDict
has been rewritten
in C,
making it even faster than cyordereddict
.
Let's start with some imports. It's good Python form to keep all of our imports at the top of the file.
In [1]:
import numpy as np
import xarray as xr
Note that we alias numpy
to np
and xarray
to xr
so that we don't have to type as much.
xarray
provides a open_dataset
function that allows us to load
a netCDF dataset into a Python data structure by simply passing in
a file path/name,
or an ERDDAP server URL and dataset ID.
Let's explore the Salish Sea NEMO model bathymetry data:
In [3]:
ds = xr.open_dataset('https://salishsea.eos.ubc.ca/erddap/griddap/ubcSSnBathymetry2V1')
See the Exploring netCDF Datasets from ERDDAP notebook for more information about ERDDAP dataset URLs.
We could have opened the same dataset from a local file with:
ds = xr.open_dataset('../../NEMO-forcing/grid/bathy_meter_SalishSea2.nc')
In [4]:
lds = xr.open_dataset('../../NEMO-forcing/grid/bathy_meter_SalishSea2.nc')
Printing the string respresentation of the ds
data structure that open_dataset()
returns gives us lots of information about the dataset and its metadata:
In [7]:
print(ds)
open_dataset()
returns an xarray.Dataset
object
that is xarray
’s multi-dimensional equivalent of a pandas.DataFrame
.
It is a dict-like container of labeled arrays (DataArray
objects) with aligned dimensions.
It is designed as an in-memory representation of the data model from the netCDF file format.
Dataset objects have four key properties:
dims
: a dictionary mapping from dimension names to the fixed length of each dimension
(e.g., {'x': 6, 'y': 6, 'time': 8}
)data_vars
: a dict-like container of DataArrays corresponding to variablescoords
: another dict-like container of DataArray
s intended to label points used in data_vars
(e.g., arrays of numbers, datetime objects or strings)attrs
: an OrderedDict
to hold arbitrary metadataLet's look at them one at a time:
In [6]:
lds
Out[6]:
In [9]:
ds.dims
In [11]:
ds.data_vars
Out[11]:
In [12]:
ds.coords
Out[12]:
So, we have a dataset that has 2 dimensions called gridX
and gridY
of size 398 and 898, respectively,
3 variables called longitude
, latitude
, and bathymetry
,
and 2 coordinates with the same names as the dimensions,
gridX
and gridY
.
The xarray
docs have a
good explanation and a diagram
about the distinction between coordinates and data variables.
If you are already familiar with working with netCDF datasets via the netCDF4-python
package,
you will note that the dims
and data_vars
attributes provide similar information to that
produced by functions in the
SalishSeaTools.nc_tools module.
xarray
provides a higher level Python interface to datasets.
We'll see how the dimensions and variables are related, and how to work with the data in the variables in a moment, but first, let's look at the dataset attributes:
In [15]:
ds.attrs
Out[15]:
Dataset attributes are metadata. They tell us about the dataset as a whole: how, when, and by whom it was created, how it has been modified, etc. The meanings of the various attributes and the conventions for them that we use in the Salish Sea MEOPAR project are documented elsewhere.
Variables also have attributes :
In [18]:
ds.longitude
Out[18]:
This tells us a whole lot of useful information about the longitude data values in our bathymetry dataset, for instance:
gridY
and gridX
dimensions, in that orderWe can access the attributes of the dataset variables using dotted notation:
In [21]:
ds.bathymetry.units, ds.bathymetry.long_name
Out[21]:
Dataset variables are xarray.DataArray
objects.
In addition to their attributes,
they carry a bunch of other useful properties and methods that you can read about in the
xarray docs.
Perhaps most importantly the data associated with the variables are stored as NumPy arrays. So, we can use NumPy indexing and slicing to access the data values. For instance, to get the latitudes and longitudes of the 4 corners of the domain:
In [31]:
ds.latitude.shape
Out[31]:
In [28]:
print('Latitudes and longitudes of domain corners:')
pt = (0, 0)
print(' 0, 0: ', ds.latitude.values[pt], ds.longitude.values[pt])
pt = (0, ds.latitude.shape[1] - 1)
print(' 0, x-max: ', ds.latitude.values[pt], ds.longitude.values[pt])
pt = (ds.latitude.shape[0] - 1, 0)
print(' y-max, 0: ', ds.latitude.values[pt], ds.longitude.values[pt])
pt = (ds.latitude.shape[0] - 1, ds.longitude.shape[1] - 1)
print(' y-max, x-max:', ds.latitude.values[pt], ds.longitude.values[pt])
You can also access the entire variable data array, or subsets of it using slicing.
In [34]:
ds.latitude.values
Out[34]:
In [36]:
ds.longitude.values[42:45, 128:135]
Out[36]:
In [37]:
ds.longitude.values[:2, :2], ds.latitude.values[-2:, -2:]
Out[37]:
Note that the zero and maximum dimension values may be omitted for slices that extend to the ends of array dimensions.