XArray Introduction

Unidata Python Workshop


Overview:

  • Teaching: 25 minutes
  • Exercises: 20 minutes

Questions

  1. What is XArray?
  2. How does XArray fit in with Numpy and Pandas?

Objectives

  1. Create a DataArray.
  2. Open netCDF data using XArray
  3. Subset the data.

XArray

XArray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with tabular data, XArray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java's Common Data Model (CDM).

DataArray

The DataArray is one of the basic building blocks of XArray. It provides a NumPy ndarray-like object that expands to provide two critical pieces of functionality:

  1. Coordinate names and values are stored with the data, making slicing and indexing much more powerful
  2. It has a built-in container for attributes

In [ ]:
# Convention for import to get shortened namespace
import numpy as np
import xarray as xr

In [ ]:
# Create some sample "temperature" data
data = 283 + 5 * np.random.randn(5, 3, 4)
data

Here we create a basic DataArray by passing it just a numpy array of random data. Note that XArray generates some basic dimension names for us.


In [ ]:
temp = xr.DataArray(data)
temp

We can also pass in our own dimension names:


In [ ]:
temp = xr.DataArray(data, dims=['time', 'lat', 'lon'])
temp

This is already improved upon from a numpy array, because we have names for each of the dimensions (or axes in NumPy parlance). Even better, we can take arrays representing the values for the coordinates for each of these dimensions and associate them with the data when we create the DataArray.


In [ ]:
# Use pandas to create an array of datetimes
import pandas as pd
times = pd.date_range('2018-01-01', periods=5)
times

In [ ]:
# Sample lon/lats
lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)

When we create the DataArray instance, we pass in the arrays we just created:


In [ ]:
temp = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])
temp

...and we can also set some attribute metadata:


In [ ]:
temp.attrs['units'] = 'kelvin'
temp.attrs['standard_name'] = 'air_temperature'

temp

Notice what happens if we perform a mathematical operaton with the DataArray: the coordinate values persist, but the attributes are lost. This is done because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.


In [ ]:
# For example, convert Kelvin to Celsius
temp - 273.15

Selection

We can use the .sel method to select portions of our data based on these coordinate values, rather than using indices (this is similar to the CDM).


In [ ]:
temp.sel(time='2018-01-02')

.sel has the flexibility to also perform nearest neighbor sampling, taking an optional tolerance:


In [ ]:
from datetime import timedelta
temp.sel(time='2018-01-07', method='nearest', tolerance=timedelta(days=2))

Exercise

.interp() works similarly to .sel(). Using .interp(), get an interpolated time series "forecast" for Boulder (40°N, 105°W) or your favorite latitude/longitude location. (Documentation for interp).


In [ ]:
# Your code goes here

Solution


In [ ]:
# %load solutions/interp_solution.py

Slicing


In [ ]:
temp.sel(time=slice('2018-01-01', '2018-01-03'), lon=slice(-110, -70), lat=slice(25, 45))

.loc

All of these operations can also be done within square brackets on the .loc attribute of the DataArray. This permits a much more numpy-looking syntax, though you lose the ability to specify the names of the various dimensions. Instead, the slicing must be done in the correct order.


In [ ]:
# As done above
temp.loc['2018-01-02']

In [ ]:
temp.loc['2018-01-01':'2018-01-03', -110:-70, 25:45]

In [ ]:
# This *doesn't* work however:
#temp.loc[-110:-70, 25:45,'2018-01-01':'2018-01-03']

Opening netCDF data

With its close ties to the netCDF data model, XArray also supports netCDF as a first-class file format. This means it has easy support for opening netCDF datasets, so long as they conform to some of XArray's limitations (such as 1-dimensional coordinates).


In [ ]:
# Open sample North American Reanalysis data in netCDF format
ds = xr.open_dataset('../../data/NARR_19930313_0000.nc')
ds

This returns a Dataset object, which is a container that contains one or more DataArrays, which can also optionally share coordinates. We can then pull out individual fields:


In [ ]:
ds.isobaric1

or


In [ ]:
ds['isobaric1']

Datasets also support much of the same subsetting operations as DataArray, but will perform the operation on all data:


In [ ]:
ds_1000 = ds.sel(isobaric1=1000.0)
ds_1000

In [ ]:
ds_1000.Temperature_isobaric

Aggregation operations

Not only can you use the named dimensions for manual slicing and indexing of data, but you can also use it to control aggregation operations, like sum:


In [ ]:
u_winds = ds['u-component_of_wind_isobaric']
u_winds.std(dim=['x', 'y'])

Exercise

Using the sample dataset, calculate the mean temperature profile (temperature as a function of pressure) over Colorado within this dataset. For this exercise, consider the bounds of Colorado to be:

  • x: -182km to 424km
  • y: -1450 to -990km

(37°N to 41°N and 102°W to 109°W projected to Lambert Conformal projection coordinates)

Solution


In [ ]:
# %load solutions/mean_profile.py

Resources

There is much more in the XArray library. To learn more, visit the XArray Documentation