In [ ]:
%matplotlib inline
from dask.distributed import Client
import xarray as xr
Starting the Dask Client is optional. It will provide a dashboard which is useful to gain insight on the computation.
The link to the dashboard will become visible when you create the client below. We recommend having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.
In [ ]:
client = Client(n_workers=8, threads_per_worker=2, memory_limit='1GB')
client
We will use some of xarray's tutorial data for this example. By specifying the chunk shape, xarray will automatically create Dask arrays for each data variable in the Dataset
. In xarray, Datasets
are dict-like container of labeled arrays, analogous to the pandas.DataFrame
. Note that we're taking advantage of xarray's dimension labels when specifying chunk shapes.
In [ ]:
ds = xr.tutorial.open_dataset('air_temperature',
chunks={'lat': 25, 'lon': 25, 'time': -1})
ds
Quickly inspecting the Dataset
above, we'll note that this Dataset
has three dimensions akin to axes in NumPy (lat
, lon
, and time
), three coordinate variables akin to pandas.Index
objects (also named lat
, lon
, and time
), and one data variable (air
). Xarray also holds Dataset specific metadata in as attributes.
In [ ]:
da = ds['air']
da
Each data variable in xarray is called a DataArray
. These are the fundemental labeled array object in xarray. Much like the Dataset
, DataArrays
also have dimensions and coordinates that support many of its label-based opperations.
In [ ]:
da.data
Accessing the underlying array of data is done via the data
property. Here we can see that we have a Dask array. If this array were to be backed by a NumPy array, this property would point to the actual values in the array.
In [ ]:
da2 = da.groupby('time.month').mean('time')
da3 = da - da2
da3
Call .compute()
or .load()
when you want your result as a xarray.DataArray
with data stored as NumPy arrays.
If you started Client()
above then you may want to watch the status page during computation.
In [ ]:
computed_da = da3.load()
type(computed_da.data)
In [ ]:
computed_da
In [ ]:
da = da.persist()
In [ ]:
da.resample(time='1w').mean('time').std('time')
In [ ]:
da.resample(time='1w').mean('time').std('time').load().plot(figsize=(12, 8))
and rolling window operations:
In [ ]:
da_smooth = da.rolling(time=30).mean().persist()
da_smooth
Since xarray stores each of its coordinate variables in memory, slicing by label is trivial and entirely lazy.
In [ ]:
%time da.sel(time='2013-01-01T18:00:00')
In [ ]:
%time da.sel(time='2013-01-01T18:00:00').load()
Almost all of xarray’s built-in operations work on Dask arrays. If you want to use a function that isn’t wrapped by xarray, one option is to extract Dask arrays from xarray objects (.data) and use Dask directly.
Another option is to use xarray’s apply_ufunc()
function, which can automate embarrassingly parallel “map” type operations where a functions written for processing NumPy arrays should be repeatedly applied to xarray objects containing Dask arrays. It works similarly to dask.array.map_blocks()
and dask.array.atop()
, but without requiring an intermediate layer of abstraction.
Here we show an example using NumPy operations and a fast function from bottleneck
, which we use to calculate Spearman’s rank-correlation coefficient:
In [ ]:
import numpy as np
import xarray as xr
import bottleneck
def covariance_gufunc(x, y):
return ((x - x.mean(axis=-1, keepdims=True))
* (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)
def pearson_correlation_gufunc(x, y):
return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))
def spearman_correlation_gufunc(x, y):
x_ranks = bottleneck.rankdata(x, axis=-1)
y_ranks = bottleneck.rankdata(y, axis=-1)
return pearson_correlation_gufunc(x_ranks, y_ranks)
def spearman_correlation(x, y, dim):
return xr.apply_ufunc(
spearman_correlation_gufunc, x, y,
input_core_dims=[[dim], [dim]],
dask='parallelized',
output_dtypes=[float])
In the examples above, we were working with an some air temperature data. For this example, we'll calculate the spearman correlation using the raw air temperature data with the smoothed version that we also created (da_smooth
). For this, we'll also have to rechunk the data ahead of time.
In [ ]:
corr = spearman_correlation(da.chunk({'time': -1}),
da_smooth.chunk({'time': -1}),
'time')
corr
In [ ]:
corr.plot(figsize=(12, 8))
In [ ]: