In [1]:
name = '2016-10-28-xarray-intro'
title = 'Introduction to xarray'
tags = 'netcdf, io, xarray'
author = 'Denis Sergeev'
In [2]:
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML, Image
html = connect_notebook_to_post(name, title, tags, author)
xarray
Python library is great for analysing multi-dimensional arrays of data with labelled dimensions, which is a common situation in geosciences.
According to the docs, xarray has two core data structures. Both are fundamentally N-dimensional:
DataArray
is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series
.Dataset
is a multi-dimensional, in-memory array database. It serves a similar purpose in xarray to the pandas.DataFrame
.Essentially, xarray adds dimensions names and coordinate indexes to numpy.ndarray
. This significantly simplifies such operations like indexing, subsetting, broadcasting and even plotting.
Let's have a look.
First, we import some necessary modules, including of course xarray
:
In [3]:
import numpy as np
import xarray as xr
In [4]:
import warnings
warnings.filterwarnings('ignore')
In [5]:
import matplotlib.pyplot as plt
%matplotlib inline
The following examples will be done for ERA-Interim reanalysis data of 2-m temperature and mean sea level pressure. As we will see, these are global daily data for 2014-2015.
In [6]:
ds = xr.open_dataset('../data/tsurf_slp.nc')
ds
Out[6]:
Among other things, xarray prints out metadata in a much more readable format than, say, netCDF4.
Since a Dataset
is a dict
-like object, variables can be accessed by keys.
In [7]:
ds['t2m']
Out[7]:
But often it's even more convenient to access them as attributes of a Dataset
.
In [8]:
t2m = ds.t2m
In [9]:
t2m.shape
Out[9]:
In [10]:
t2m.ndim
Out[10]:
etc...
t2m[456, :, 123]
t2m.loc[dict(longitude=2.25)]
t2m[:2, :, 0]
t2m.isel(longitude=0, time=slice(None, 2))
In [11]:
import datetime
In [12]:
time_start = datetime.datetime.strptime('2014-02-03', '%Y-%m-%d')
time_end = datetime.datetime.strptime('2014-02-05', '%Y-%m-%d')
print('{}\n{}'.format(time_start, time_end))
t2m.sel(time=slice(time_start, time_end))
The following line would not work, because neither longitude coordinate array contains a value of 10
, nor latitude contains 20
.
t2m.sel(longitude=10, latitude=20) # Results in KeyError
However, we can use a built-in nearest-neighbour lookup method to find an element closest to the given coordinate values.
t2m.sel(longitude=10, latitude=15, method='nearest')
Let's extract a subset of the original data:
In [13]:
new_data = t2m.sel(longitude=slice(-5, 10), latitude=slice(55, 44))
In [14]:
new_data.shape
Out[14]:
Saving data to a netCDF file cannot get any easier:
new_data.to_dataset('test.nc')
In [15]:
new_data.to_dataframe().head()
Out[15]:
In [16]:
new_data.to_series().head()
Out[16]:
In [17]:
import pandas as pd
import string
df = pd.DataFrame(np.random.randn(365, 5),
index=pd.date_range(start='2014-1-1', periods=365),
columns=list(string.ascii_letters[:5]))
In [18]:
df.head()
Out[18]:
In [19]:
df_ds = xr.Dataset.from_dataframe(df)
df_ds
Out[19]:
xarray data structures allow us to perform resampling really easily:
In [20]:
ten_daily = t2m.resample('W', dim='time', how='mean')
Being pandas
's sibling, xarray supports groupby methods. For example, in the following line of code we do averaging of temperature by seasons (just 1 line of code!).
In [21]:
seas = t2m.groupby('time.season').mean('time')
print(seas.season)
Note how the time dimension was transformed into a 'season' dimension with the appropriate labels.
In [22]:
import cartopy.crs as ccrs
from calendar import month_name
You can use OO approach:
In [23]:
t2m.isel(time=10).plot.contourf()
Out[23]:
Or create a figure and axis first, and then pass that as an argument to a plotting fucntions
In [24]:
fig, ax = plt.subplots(figsize=(10, 3),
subplot_kw=dict(projection=ccrs.PlateCarree()))
t2m.isel(time=10).plot.contourf(ax=ax)
ax.coastlines()
Out[24]:
First, let's wrap the previous example into a small function:
In [25]:
def plot_field(da, ax=None, title=None):
if ax is None:
fig, ax = plt.subplots(figsize=(10, (da.shape[0] / da.shape[1]) * 10),
subplot_kw=dict(projection=ccrs.PlateCarree()))
da.plot.contourf(ax=ax)
ax.coastlines()
if title is not None:
ax.set_title(title)
Next, perform a monthly averaging on the original global data:
In [26]:
monthly_t2m = t2m.groupby('time.month').mean('time')
And finally, plot the result.
In [27]:
fig, axs = plt.subplots(nrows=4, ncols=3, figsize=(14, 10),
subplot_kw=dict(projection=ccrs.PlateCarree()))
fig.suptitle('Monthly averages of 2-m temperature')
axes = axs.flatten()
for month in range(1, 13):
ax = axes[month-1]
plot_field(monthly_t2m.sel(month=month),
ax=ax, title=month_name[month])
It's always good to close the file you are reading data from.
In [28]:
ds.close()
In [29]:
HTML(html)
Out[29]: