xarrayxarray expands the utility of the time series analysis package pandas into more than one dimension. It is actively being developed in conjunction with many other packages under the Pangeo umbrella. For example, you can run with Dask to use multiple cores on your laptop when you are working with data read in with xarray.
NetCDF is a binary storage format for many different kinds of rectangular data. Examples include atmosphere and ocean model output, satellite images, and timeseries data. NetCDF files are intended to be device independent, and the dataset may be queried in a fast, random-access way. More information about NetCDF files can be found here. The CF conventions are used for storing NetCDF data for earth system models, so that programs can be aware of the coordinate axes used by the data cubes.
We will read in netCDF files using xarray; a variety of other file formats will work too.
In [1]:
    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import cartopy
import cmocean.cm as cmo
import pandas as pd
import xarray as xr
    
An example NetCDF file containing monthly means of sea surface temperature over 160 years can be found here. We'll use the xarray package to read this file, which has already been saved into the data directory.
One of the useful things about xarray is that it doesn't deal with the numbers in the file until it has to. This is called "lazy evaluation". It will note the operations you want done, but won't actually perform them until it needs to spit out numbers.
Viewing metadata is instantaneous since no calculations need to be done, even if the file is huge.
An xarray data object is a "dataset" or "data array".
In [2]:
    
ds = xr.open_dataset('../data/sst.mnmean.v4.nc')
# look at overview of metadata for file
ds
    
    Out[2]:
In [3]:
    
# metadata for sst variable
ds['sst']
    
    Out[3]:
In [4]:
    
# look at shape of sst variable
ds.sst.shape, ds.sst.units
    
    Out[4]:
In [5]:
    
# view any of the metadata
ds.history
    
    Out[5]:
In [ ]:
    
    
In [6]:
    
ds.lat.values
    
    Out[6]:
Analogously to how we selected data from pandas dataframes using .loc and .iloc, we extract data from xarray datasets using .sel and .isel. The commands are much longer when using xarray because we have multi-dimensional data now.
When files are read in, data arrays are read in as variables and the coordinates that they are in reference to are called "coordinates". For example, in the present dataset, we have the following coordinates:
In [7]:
    
ds.coords
    
    Out[7]:
We also have the following data variables, which are the main data of the file:
In [8]:
    
ds.data_vars
    
    Out[8]:
This means that we should subselect from the data variable "sst" with respect to the coordinates. We can select from none up to all of the coordinates that the sst variable is respect to. As we can see in the following cell, the variable "sst" has coordinates "lat", "lon", and "time".
In [9]:
    
ds.sst.coords
    
    Out[9]:
We'll start with a small example: let's choose a single time to plot. Here is how to choose a specific time:
In [10]:
    
ds.sst.sel(time='1954-6-1')
    
    Out[10]:
Now let's plot it! Note that we are still using cartopy to plot our maps and we therefore still need to input the projection information with the "transform" keyword argument.
In [11]:
    
proj = cartopy.crs.Mollweide(central_longitude=180)
pc = cartopy.crs.PlateCarree()
fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
mappable = ax.contourf(ds.lon, ds.lat, ds.sst.sel(time='1954-6-1'), 10, cmap=cmo.thermal, transform=pc)
    
    
You can also select a "nearest" point in time if you aren't sure exactly when your time slices are:
In [12]:
    
ds['sst'].sel(time='1954-05-23', method='nearest')
    
    Out[12]:
We can either select by coordinate type, such as in the following cell where we choose all times between (and including) the years 1900 and 1950, longtitudes between 260 and 280 degrees, and latitude between 16 and 30 degrees.
In [13]:
    
ds.sst.sel(time=slice('1900','1950'), lon=slice(-100+360, -80+360), lat=slice(30,16))
    
    Out[13]:
.... or by index, such as in the following cell where we select the first index of data in terms of with time, longitude, and latitude:
In [23]:
    
ds.sst.isel(time=0, lon=0, lat=0)
    
    Out[23]:
In [28]:
    
ds.sst.mean('time')
    
    Out[28]:
In [29]:
    
ds.sst.sum(('lat','lon'))
    
    Out[29]:
The netCDF library can be compiled such that it is 'THREDDS enabled', which means that you can put in a URL instead of a filename. This allows access to large remote datasets, without having to download the entire file. You can find a large list of datasets served via an OpenDAP/THREDDs server here.
Let's look at the ESRL/NOAA 20th Century Reanalysis – Version 2. You can access the data by the following link (this is the link of the .dds and .das files without the extension.):
In [15]:
    
loc = 'http://apdrc.soest.hawaii.edu/dods/public_data/Reanalysis_Data/NOAA_20th_Century/V2c/daily/monolevel/cprat'
ds2 = xr.open_dataset(loc)
ds2
    
    Out[15]:
In [16]:
    
ds2['cprat'].long_name
    
    Out[16]:
In [17]:
    
proj = cartopy.crs.Sinusoidal(central_longitude=180)
fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ax.coastlines(linewidth=0.25)
# use the last time available
mappable = ax.contourf(ds2.lon, ds2.lat, ds2.cprat.isel(time=-1), 20, cmap=cmo.tempo, transform=pc)
ax.set_title(pd.Timestamp(ds2.time[-1].values).isoformat()[:10])  # or use .strftime instead of .isoformat
fig.colorbar(mappable).set_label('%s' % ds2['cprat'].long_name)
    
    
Pick another variable from this dataset. Inspect and plot the variable in a similar manner to precipitation.
Find another dataset on a THREDDS server at SOEST (or elsewhere), pick a variable, and plot it.
In [ ]:
    
    
Note that you can also just plot against the included coordinates with built-in convenience functions (this is analogous to pandas which was for one dimension). The sst is being plotted against longitude and latitude, which is flattening it out.
In [18]:
    
ds.sst.sel(time='1954-6-1').plot()#transform=pc)  # the plot's projection
    
    Out[18]:
    
In [19]:
    
proj = cartopy.crs.Mollweide(central_longitude=180)
fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ds.sst.sel(time='1954-6-1').plot(transform=pc)  # the plot's projection
    
    Out[19]:
    
In [20]:
    
seasonal_mean = ds.groupby('time.season').mean('time')
seasonal_mean
    
    
    Out[20]:
In [21]:
    
fname = 'test.nc'
seasonal_mean.to_netcdf(fname)
    
In [22]:
    
xr.open_dataset(fname)
    
    Out[22]:
In [ ]: