Exploring netCDF Datasets Using the xarray Package

This notebook provides discussion, examples, and best practices for working with netCDF datasets in Python using the xarray package. Topics include:

  • The xarray package
  • Reading netCDF datasets into Python data structures
  • Exploring netCDF dataset dimensions, variables, and attributes
  • Working with netCDF variable data as NumPy arrays

This notebook is a companion to the Exploring netCDF Files and Exploring netCDF Datasets from ERDDAP notebooks. Those notebooks focus on using the netcdf4-python package to read netCDF datasets from local files and ERDDAP servers on the Internet, respectively.

This notebook is about using the xarray package to work with netCDF datasets. xarray uses the netcdf4-python package behind the scenes, so datasets can be read from either local files or from ERDDAP servers.

xarray is a Python package that applies the concepts and tools for working with labeled data structures from the pandas package to the physical sciences. Whereas pandas excels at manipulating tablular data, xarray brings similar power to working with N-dimensional arrays.

If you are already familiar with working with netCDF datasets via the netCDF4-python package, you can think of xarray as a higher level Python tools for working with those dataset.

Creating netCDF files and working with their attribute metadata is documented elsewhere: http://salishsea-meopar-docs.readthedocs.org/en/latest/code-notes/salishsea-nemo/nemo-forcing/netcdf4.html.

This notebook assumes that you are working in Python 3. If you don't have a Python 3 environment set up, please see our Anaconda Python Distribution docs for instructions on how to set one up.

xarray and some of the packages that it depends on are not included in the default Anaconda collection of packages, so you may need to installed them explicitly:

$ conda install xarray netCDF4 bottleneck

bottleneck is a package that speeds up NaN-skipping and rolling window aggregations.

If you are using a version of Python earlier than 3.5 (check with the command python --version), you should also install cyordereddict to speed internal operations with xarray data structures. It is not required for Python ≥3.5 because collections.OrderedDict has been rewritten in C, making it even faster than cyordereddict.

Let's start with some imports. It's good Python form to keep all of our imports at the top of the file.


In [1]:
import numpy as np
import xarray as xr

Note that we alias numpy to np and xarray to xr so that we don't have to type as much.

xarray provides a open_dataset function that allows us to load a netCDF dataset into a Python data structure by simply passing in a file path/name, or an ERDDAP server URL and dataset ID.

Let's explore the Salish Sea NEMO model bathymetry data:


In [3]:
ds = xr.open_dataset('https://salishsea.eos.ubc.ca/erddap/griddap/ubcSSnBathymetry2V1')

See the Exploring netCDF Datasets from ERDDAP notebook for more information about ERDDAP dataset URLs.

We could have opened the same dataset from a local file with:

ds = xr.open_dataset('../../NEMO-forcing/grid/bathy_meter_SalishSea2.nc')

In [4]:
lds = xr.open_dataset('../../NEMO-forcing/grid/bathy_meter_SalishSea2.nc')

Printing the string respresentation of the ds data structure that open_dataset() returns gives us lots of information about the dataset and its metadata:


In [7]:
print(ds)


<xarray.Dataset>
Dimensions:     (gridX: 398, gridY: 898)
Coordinates:
  * gridY       (gridY) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * gridX       (gridX) int16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
Data variables:
    longitude   (gridY, gridX) float32 ...
    latitude    (gridY, gridX) float32 ...
    bathymetry  (gridY, gridX) float64 ...
Attributes:
    acknowledgement: MEOPAR, ONC, Compute Canada
    cdm_data_type: Grid
    comment: Bathymetry, Latitudes and Longitudes
    Conventions: CF-1.6, COARDS, ACDD-1.3
    coverage_content_type: modelResult
    creator_email: sallen@eos.ubc.ca
    creator_name: Salish Sea MEOPAR Project Contributors
    creator_url: https://salishsea-meopar-docs.readthedocs.org/
    drawLandMask: over
    history: [2016-02-05 16:35:19] Created dataset.
[2016-03-02 18:08:56] Changed all variables to zlib=True.
[2016-03-02 18:08:56] Added least_significant_digit=1 and fill_value=0 to bathymetry variable.
[2016-03-03 12:28:37] Added valid_range attribute to all variables.
2016-04-14T21:46:25Z (local files)
2016-04-14T21:46:25Z https://salishsea.eos.ubc.ca/erddap/griddap/ubcSSnBathymetry2V1.das
    infoUrl: https://salishsea-meopar-tools.readthedocs.org/en/latest/results_server/index.html#salish-sea-model-results
    institution: UBC EOAS
    institution_fullname: Earth, Ocean & Atmospheric Sciences, University of British Columbia
    keywords: bathymetry, bottom, data, model results, depth, floor, latitude, longitude, nemo, ocean, oceans,
Oceans > Bathymetry/Seafloor Topography > Bathymetry, salishsea, sea, sea_floor_depth, seafloor, topography
    keywords_vocabulary: GCMD Science Keywords
    license: The Salish Sea MEOPAR NEMO model results are copyright 2013-2016
by the Salish Sea MEOPAR Project Contributors and The University of British Columbia.

They are licensed under the Apache License, Version 2.0. http://www.apache.org/licenses/LICENSE-2.0
    project: Salish Sea MEOPAR NEMO Model
    references: https://bitbucket.org/salishsea/nemo-forcing/src/tipgrid/mesh_mask_SalishSea2.nc
    source: https://bitbucket.org/salishsea/tools/src/tip/bathymetry/NEMOBathymetryfromMeshMask.ipynb
    sourceUrl: (local files)
    standard_name_vocabulary: CF Standard Name Table v29
    summary: Salish Sea NEMO Model Grid, Geo-location and Bathymetry, v1

Longitude, latitude, and bathymetry of the Salish Sea NEMO model grid.
The bathymetry values are those calculated by NEMO from the input bathymetry file.
NEMO modifies the input bathymetry to remove isolated holes, and too-small partial steps.
The model grid includes the Juan de Fuca Strait, the Strait of Georgia, Puget Sound,
and Johnstone Strait on the coasts of Washington State and British Columbia.

v1: longitude, latitude and b...
    title: Salish Sea NEMO Model Grid, Geo-location and Bathymetry, v1

open_dataset() returns an xarray.Dataset object that is xarray’s multi-dimensional equivalent of a pandas.DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format.

Dataset objects have four key properties:

  • dims: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., {'x': 6, 'y': 6, 'time': 8})
  • data_vars: a dict-like container of DataArrays corresponding to variables
  • coords: another dict-like container of DataArrays intended to label points used in data_vars (e.g., arrays of numbers, datetime objects or strings)
  • attrs: an OrderedDict to hold arbitrary metadata

Let's look at them one at a time:


In [6]:
lds


Out[6]:
<xarray.Dataset>
Dimensions:     (x: 398, y: 898)
Coordinates:
  * x           (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * y           (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    nav_lon     (y, x) float64 -123.4 -123.4 -123.4 -123.4 -123.4 -123.4 ...
    nav_lat     (y, x) float64 46.86 46.86 46.86 46.87 46.87 46.87 46.87 ...
    Bathymetry  (y, x) float64 nan nan nan nan nan nan nan nan nan nan nan ...
Attributes:
    Conventions: CF-1.6
    title: Salish Sea NEMO Bathymetry
    institution: Dept of Earth, Ocean & Atmospheric Sciences, University of British Columbia
    references: https://bitbucket.org/salishsea/nemo-forcing/src/tip/grid/bathy_meter_SalishSea.nc
    comment: Based on 1_bathymetry_seagrid_WestCoast.nc file from 2-Oct-2013 WCSD_PREP tarball provided by J-P Paquin.
    source: 
https://bitbucket.org/salishsea/tools/src/tip/bathymetry/SalishSeaBathy.ipynb
https://bitbucket.org/salishsea/tools/src/tip/bathymetry/SmoothMouthJdF.ipynb

    history: 
[2013-10-30 13:18] Created netCDF4 zlib=True dataset.
[2013-10-30 15:22] Set depths between 0 and 4m to 4m and those >428m to 428m.
[2013-10-31 17:10] Algorithmic smoothing.
[2013-11-21 19:53] Reverted to pre-smothing dataset (repo rev 3b301b5b9b6d).
[2013-11-21 20:14] Updated dataset and variable attributes to CF-1.6 conventions & project standards.
[2013-11-21 20:47] Removed east end of Jervis Inlet and Toba Inlet region due to deficient source bathymetry data in Cascadia dataset.
[2013-11...

In [9]:
ds.dims


Frozen(SortedKeysDict({'gridX': 398, 'gridY': 898}))

In [11]:
ds.data_vars


Out[11]:
Data variables:
    longitude   (gridY, gridX) float32 ...
    latitude    (gridY, gridX) float32 ...
    bathymetry  (gridY, gridX) float64 ...

In [12]:
ds.coords


Out[12]:
Coordinates:
  * gridY    (gridY) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * gridX    (gridX) int16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...

So, we have a dataset that has 2 dimensions called gridX and gridY of size 398 and 898, respectively, 3 variables called longitude, latitude, and bathymetry, and 2 coordinates with the same names as the dimensions, gridX and gridY.

The xarray docs have a good explanation and a diagram about the distinction between coordinates and data variables.

If you are already familiar with working with netCDF datasets via the netCDF4-python package, you will note that the dims and data_vars attributes provide similar information to that produced by functions in the SalishSeaTools.nc_tools module. xarray provides a higher level Python interface to datasets.

We'll see how the dimensions and variables are related, and how to work with the data in the variables in a moment, but first, let's look at the dataset attributes:


In [15]:
ds.attrs


Out[15]:
OrderedDict([('acknowledgement', 'MEOPAR, ONC, Compute Canada'),
             ('cdm_data_type', 'Grid'),
             ('comment', 'Bathymetry, Latitudes and Longitudes'),
             ('Conventions', 'CF-1.6, COARDS, ACDD-1.3'),
             ('coverage_content_type', 'modelResult'),
             ('creator_email', 'sallen@eos.ubc.ca'),
             ('creator_name', 'Salish Sea MEOPAR Project Contributors'),
             ('creator_url', 'https://salishsea-meopar-docs.readthedocs.org/'),
             ('drawLandMask', 'over'),
             ('history',
              '[2016-02-05 16:35:19] Created dataset.\n[2016-03-02 18:08:56] Changed all variables to zlib=True.\n[2016-03-02 18:08:56] Added least_significant_digit=1 and fill_value=0 to bathymetry variable.\n[2016-03-03 12:28:37] Added valid_range attribute to all variables.\n2016-04-14T21:46:25Z (local files)\n2016-04-14T21:46:25Z https://salishsea.eos.ubc.ca/erddap/griddap/ubcSSnBathymetry2V1.das'),
             ('infoUrl',
              'https://salishsea-meopar-tools.readthedocs.org/en/latest/results_server/index.html#salish-sea-model-results'),
             ('institution', 'UBC EOAS'),
             ('institution_fullname',
              'Earth, Ocean & Atmospheric Sciences, University of British Columbia'),
             ('keywords',
              'bathymetry, bottom, data, model results, depth, floor, latitude, longitude, nemo, ocean, oceans,\nOceans > Bathymetry/Seafloor Topography > Bathymetry, salishsea, sea, sea_floor_depth, seafloor, topography'),
             ('keywords_vocabulary', 'GCMD Science Keywords'),
             ('license',
              'The Salish Sea MEOPAR NEMO model results are copyright 2013-2016\nby the Salish Sea MEOPAR Project Contributors and The University of British Columbia.\n\nThey are licensed under the Apache License, Version 2.0. http://www.apache.org/licenses/LICENSE-2.0'),
             ('project', 'Salish Sea MEOPAR NEMO Model'),
             ('references',
              'https://bitbucket.org/salishsea/nemo-forcing/src/tipgrid/mesh_mask_SalishSea2.nc'),
             ('source',
              'https://bitbucket.org/salishsea/tools/src/tip/bathymetry/NEMOBathymetryfromMeshMask.ipynb'),
             ('sourceUrl', '(local files)'),
             ('standard_name_vocabulary', 'CF Standard Name Table v29'),
             ('summary',
              'Salish Sea NEMO Model Grid, Geo-location and Bathymetry, v1\n\nLongitude, latitude, and bathymetry of the Salish Sea NEMO model grid.\nThe bathymetry values are those calculated by NEMO from the input bathymetry file.\nNEMO modifies the input bathymetry to remove isolated holes, and too-small partial steps.\nThe model grid includes the Juan de Fuca Strait, the Strait of Georgia, Puget Sound,\nand Johnstone Strait on the coasts of Washington State and British Columbia.\n\nv1: longitude, latitude and bathymetry variables'),
             ('title',
              'Salish Sea NEMO Model Grid, Geo-location and Bathymetry, v1')])

Dataset attributes are metadata. They tell us about the dataset as a whole: how, when, and by whom it was created, how it has been modified, etc. The meanings of the various attributes and the conventions for them that we use in the Salish Sea MEOPAR project are documented elsewhere.

Variables also have attributes :


In [18]:
ds.longitude


Out[18]:
<xarray.DataArray 'longitude' (gridY: 898, gridX: 398)>
[357404 values with dtype=float32]
Coordinates:
  * gridY    (gridY) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * gridX    (gridX) int16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Attributes:
    _ChunkSize: [898 398]
    colorBarMaximum: 180.0
    colorBarMinimum: -180.0
    long_name: Longitude
    standard_name: longitude
    units: degrees_east
    valid_range: [-126.40029144 -121.31835175]

This tells us a whole lot of useful information about the longitude data values in our bathymetry dataset, for instance:

  • They are 32-bit floating point values
  • They are associated with the gridY and gridX dimensions, in that order
  • The units are degrees measured eastward (from the Greenwich meridian)
  • etc.

We can access the attributes of the dataset variables using dotted notation:


In [21]:
ds.bathymetry.units, ds.bathymetry.long_name


Out[21]:
('m', 'Depth of Bottom')

Dataset variables are xarray.DataArray objects. In addition to their attributes, they carry a bunch of other useful properties and methods that you can read about in the xarray docs.

Perhaps most importantly the data associated with the variables are stored as NumPy arrays. So, we can use NumPy indexing and slicing to access the data values. For instance, to get the latitudes and longitudes of the 4 corners of the domain:


In [31]:
ds.latitude.shape


Out[31]:
(898, 398)

In [28]:
print('Latitudes and longitudes of domain corners:')
pt = (0, 0)
print('  0, 0:        ', ds.latitude.values[pt], ds.longitude.values[pt])
pt = (0, ds.latitude.shape[1] - 1)
print('  0, x-max:    ', ds.latitude.values[pt], ds.longitude.values[pt])
pt = (ds.latitude.shape[0] - 1, 0)
print('  y-max, 0:    ', ds.latitude.values[pt], ds.longitude.values[pt])
pt = (ds.latitude.shape[0] - 1, ds.longitude.shape[1] - 1)
print('  y-max, x-max:', ds.latitude.values[pt], ds.longitude.values[pt])


Latitudes and longitudes of domain corners:
  0, 0:         46.8597 -123.429
  0, x-max:     47.6009 -121.318
  y-max, 0:     50.3899 -126.4
  y-max, x-max: 51.1048 -124.342

You can also access the entire variable data array, or subsets of it using slicing.


In [34]:
ds.latitude.values


Out[34]:
array([[ 46.85966492,  46.86154556,  46.86342621, ...,  47.59721375,
         47.59906769,  47.60092163],
       [ 46.86278915,  46.86481476,  46.86677933, ...,  47.60125732,
         47.60311127,  47.60496521],
       [ 46.86606979,  46.86814499,  46.87015915, ...,  47.60529709,
         47.60715485,  47.60900879],
       ..., 
       [ 50.38191605,  50.38397598,  50.38602448, ...,  51.09400177,
         51.09560776,  51.09720612],
       [ 50.38591766,  50.38798523,  50.39004135, ...,  51.09781265,
         51.0994072 ,  51.10100174],
       [ 50.38992691,  50.39200592,  50.39406967, ...,  51.10162354,
         51.10321808,  51.10480118]], dtype=float32)

In [36]:
ds.longitude.values[42:45, 128:135]


Out[36]:
array([[-122.884552  , -122.87927246, -122.87399292, -122.86871338,
        -122.86342621, -122.85814667, -122.85286713],
       [-122.88778687, -122.88250732, -122.87722778, -122.87194824,
        -122.8666687 , -122.86138916, -122.85610962],
       [-122.89102936, -122.88574982, -122.88047028, -122.87519073,
        -122.86991119, -122.86463165, -122.85934448]], dtype=float32)

In [37]:
ds.longitude.values[:2, :2], ds.latitude.values[-2:, -2:]


Out[37]:
(array([[-123.42943573, -123.42411804],
        [-123.43196869, -123.42677307]], dtype=float32),
 array([[ 51.0994072 ,  51.10100174],
        [ 51.10321808,  51.10480118]], dtype=float32))

Note that the zero and maximum dimension values may be omitted for slices that extend to the ends of array dimensions.