xarray
xarray
expands the utility of the time series analysis package pandas
into more than one dimension. It is actively being developed in conjunction with many other packages under the Pangeo umbrella. For example, you can run with Dask to use multiple cores on your laptop when you are working with data read in with xarray
.
NetCDF is a binary storage format for many different kinds of rectangular data. Examples include atmosphere and ocean model output, satellite images, and timeseries data. NetCDF files are intended to be device independent, and the dataset may be queried in a fast, random-access way. More information about NetCDF files can be found here. The CF conventions are used for storing NetCDF data for earth system models, so that programs can be aware of the coordinate axes used by the data cubes.
We will read in netCDF files using xarray
; a variety of other file formats will work too.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import cartopy
import cmocean.cm as cmo
import pandas as pd
import xarray as xr
An example NetCDF file containing monthly means of sea surface temperature over 160 years can be found here. We'll use the xarray
package to read this file, which has already been saved into the data
directory.
One of the useful things about xarray
is that it doesn't deal with the numbers in the file until it has to. This is called "lazy evaluation". It will note the operations you want done, but won't actually perform them until it needs to spit out numbers.
Viewing metadata is instantaneous since no calculations need to be done, even if the file is huge.
An xarray data object is a "dataset" or "data array".
In [2]:
ds = xr.open_dataset('../data/sst.mnmean.v4.nc')
# look at overview of metadata for file
ds
Out[2]:
In [3]:
# metadata for sst variable
ds['sst']
Out[3]:
In [4]:
# look at shape of sst variable
ds.sst.shape, ds.sst.units
Out[4]:
In [5]:
# view any of the metadata
ds.history
Out[5]:
In [ ]:
In [6]:
ds.lat.values
Out[6]:
Analogously to how we selected data from pandas
dataframes using .loc
and .iloc
, we extract data from xarray
datasets using .sel
and .isel
. The commands are much longer when using xarray
because we have multi-dimensional data now.
When files are read in, data arrays are read in as variables and the coordinates that they are in reference to are called "coordinates". For example, in the present dataset, we have the following coordinates:
In [7]:
ds.coords
Out[7]:
We also have the following data variables, which are the main data of the file:
In [8]:
ds.data_vars
Out[8]:
This means that we should subselect from the data variable "sst" with respect to the coordinates. We can select from none up to all of the coordinates that the sst variable is respect to. As we can see in the following cell, the variable "sst" has coordinates "lat", "lon", and "time".
In [9]:
ds.sst.coords
Out[9]:
We'll start with a small example: let's choose a single time to plot. Here is how to choose a specific time:
In [10]:
ds.sst.sel(time='1954-6-1')
Out[10]:
Now let's plot it! Note that we are still using cartopy
to plot our maps and we therefore still need to input the projection information with the "transform" keyword argument.
In [11]:
proj = cartopy.crs.Mollweide(central_longitude=180)
pc = cartopy.crs.PlateCarree()
fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
mappable = ax.contourf(ds.lon, ds.lat, ds.sst.sel(time='1954-6-1'), 10, cmap=cmo.thermal, transform=pc)
You can also select a "nearest" point in time if you aren't sure exactly when your time slices are:
In [12]:
ds['sst'].sel(time='1954-05-23', method='nearest')
Out[12]:
We can either select by coordinate type, such as in the following cell where we choose all times between (and including) the years 1900 and 1950, longtitudes between 260 and 280 degrees, and latitude between 16 and 30 degrees.
In [13]:
ds.sst.sel(time=slice('1900','1950'), lon=slice(-100+360, -80+360), lat=slice(30,16))
Out[13]:
.... or by index, such as in the following cell where we select the first index of data in terms of with time, longitude, and latitude:
In [23]:
ds.sst.isel(time=0, lon=0, lat=0)
Out[23]:
In [28]:
ds.sst.mean('time')
Out[28]:
In [29]:
ds.sst.sum(('lat','lon'))
Out[29]:
The netCDF library can be compiled such that it is 'THREDDS enabled', which means that you can put in a URL instead of a filename. This allows access to large remote datasets, without having to download the entire file. You can find a large list of datasets served via an OpenDAP/THREDDs server here.
Let's look at the ESRL/NOAA 20th Century Reanalysis – Version 2. You can access the data by the following link (this is the link of the .dds
and .das
files without the extension.):
In [15]:
loc = 'http://apdrc.soest.hawaii.edu/dods/public_data/Reanalysis_Data/NOAA_20th_Century/V2c/daily/monolevel/cprat'
ds2 = xr.open_dataset(loc)
ds2
Out[15]:
In [16]:
ds2['cprat'].long_name
Out[16]:
In [17]:
proj = cartopy.crs.Sinusoidal(central_longitude=180)
fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ax.coastlines(linewidth=0.25)
# use the last time available
mappable = ax.contourf(ds2.lon, ds2.lat, ds2.cprat.isel(time=-1), 20, cmap=cmo.tempo, transform=pc)
ax.set_title(pd.Timestamp(ds2.time[-1].values).isoformat()[:10]) # or use .strftime instead of .isoformat
fig.colorbar(mappable).set_label('%s' % ds2['cprat'].long_name)
Pick another variable from this dataset. Inspect and plot the variable in a similar manner to precipitation.
Find another dataset on a THREDDS server at SOEST (or elsewhere), pick a variable, and plot it.
In [ ]:
Note that you can also just plot against the included coordinates with built-in convenience functions (this is analogous to pandas
which was for one dimension). The sst is being plotted against longitude and latitude, which is flattening it out.
In [18]:
ds.sst.sel(time='1954-6-1').plot()#transform=pc) # the plot's projection
Out[18]:
In [19]:
proj = cartopy.crs.Mollweide(central_longitude=180)
fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ds.sst.sel(time='1954-6-1').plot(transform=pc) # the plot's projection
Out[19]:
In [20]:
seasonal_mean = ds.groupby('time.season').mean('time')
seasonal_mean
Out[20]:
In [21]:
fname = 'test.nc'
seasonal_mean.to_netcdf(fname)
In [22]:
xr.open_dataset(fname)
Out[22]:
In [ ]: