DataArray.XArray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with tabular data, XArray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java's Common Data Model (CDM).
DataArrayThe DataArray is one of the basic building blocks of XArray. It provides a NumPy ndarray-like object that expands to provide two critical pieces of functionality:
In [ ]:
    
# Convention for import to get shortened namespace
import numpy as np
import xarray as xr
    
In [ ]:
    
# Create some sample "temperature" data
data = 283 + 5 * np.random.randn(5, 3, 4)
data
    
Here we create a basic DataArray by passing it just a numpy array of random data. Note that XArray generates some basic dimension names for us.
In [ ]:
    
temp = xr.DataArray(data)
temp
    
We can also pass in our own dimension names:
In [ ]:
    
temp = xr.DataArray(data, dims=['time', 'lat', 'lon'])
temp
    
This is already improved upon from a numpy array, because we have names for each of the dimensions (or axes in NumPy parlance). Even better, we can take arrays representing the values for the coordinates for each of these dimensions and associate them with the data when we create the DataArray.
In [ ]:
    
# Use pandas to create an array of datetimes
import pandas as pd
times = pd.date_range('2018-01-01', periods=5)
times
    
In [ ]:
    
# Sample lon/lats
lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)
    
When we create the DataArray instance, we pass in the arrays we just created:
In [ ]:
    
temp = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])
temp
    
...and we can also set some attribute metadata:
In [ ]:
    
temp.attrs['units'] = 'kelvin'
temp.attrs['standard_name'] = 'air_temperature'
temp
    
Notice what happens if we perform a mathematical operaton with the DataArray: the coordinate values persist, but the attributes are lost. This is done because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.
In [ ]:
    
# For example, convert Kelvin to Celsius
temp - 273.15
    
In [ ]:
    
temp.sel(time='2018-01-02')
    
.sel has the flexibility to also perform nearest neighbor sampling, taking an optional tolerance:
In [ ]:
    
from datetime import timedelta
temp.sel(time='2018-01-07', method='nearest', tolerance=timedelta(days=2))
    
.interp() works similarly to .sel(). Using .interp(), get an interpolated time series "forecast" for Boulder (40°N, 105°W) or your favorite latitude/longitude location. (Documentation for interp).
In [ ]:
    
# Your code goes here
    
In [ ]:
    
# %load solutions/interp_solution.py
    
In [ ]:
    
temp.sel(time=slice('2018-01-01', '2018-01-03'), lon=slice(-110, -70), lat=slice(25, 45))
    
In [ ]:
    
# As done above
temp.loc['2018-01-02']
    
In [ ]:
    
temp.loc['2018-01-01':'2018-01-03', 25:45, -110:-70]
    
In [ ]:
    
# This *doesn't* work however:
#temp.loc[-110:-70, 25:45,'2018-01-01':'2018-01-03']
    
In [ ]:
    
# Open sample North American Reanalysis data in netCDF format
ds = xr.open_dataset('../../data/NARR_19930313_0000.nc')
ds
    
This returns a Dataset object, which is a container that contains one or more DataArrays, which can also optionally share coordinates. We can then pull out individual fields:
In [ ]:
    
ds.isobaric1
    
or
In [ ]:
    
ds['isobaric1']
    
Datasets also support much of the same subsetting operations as DataArray, but will perform the operation on all data:
In [ ]:
    
ds_1000 = ds.sel(isobaric1=1000.0)
ds_1000
    
In [ ]:
    
ds_1000.Temperature_isobaric
    
In [ ]:
    
u_winds = ds['u-component_of_wind_isobaric']
u_winds.std(dim=['x', 'y'])
    
Using the sample dataset, calculate the mean temperature profile (temperature as a function of pressure) over Colorado within this dataset. For this exercise, consider the bounds of Colorado to be:
(37°N to 41°N and 102°W to 109°W projected to Lambert Conformal projection coordinates)
In [ ]:
    
# %load solutions/mean_profile.py
    
There is much more in the XArray library. To learn more, visit the XArray Documentation
In order to better enable reproducible data and research, the Climate and Forecasting (CF) metadata convention was created to have proper metadata in atmospheric data files. In the remainder of this notebook, we will introduce the CF data model and discuss some netCDF implementation details to consider when deciding how to write data with CF and netCDF. We will cover gridded data in this notebook, with more in depth examples provided in the full CF notebook. Xarray makes the creation of netCDFs with proper metadata simple and straightforward, so we will use that, instead of the netCDF-Python library.
This assumes a basic understanding of netCDF.
Let's say we're working with some numerical weather forecast model output. Let's walk through the steps necessary to store this data in netCDF, using the Climate and Forecasting metadata conventions to ensure that our data are available to as many tools as possible.
To start, let's assume the following about our data:
We'll also go ahead and generate some arrays of data below to get started:
In [ ]:
    
# Import some useful Python tools
from datetime import datetime
# Twelve hours of hourly output starting at 22Z today
start = datetime.utcnow().replace(hour=22, minute=0, second=0, microsecond=0)
times = np.array([start + timedelta(hours=h) for h in range(13)])
# 3km spacing in x and y
x = np.arange(-150, 153, 3)
y = np.arange(-100, 100, 3)
# Standard pressure levels in hPa
press = np.array([1000, 925, 850, 700, 500, 300, 250])
temps = np.random.randn(times.size, press.size, y.size, x.size)
    
Time coordinates must contain a units attribute with a string value with a form similar to 'seconds since 2019-01-06 12:00:00.00'. 'seconds', 'minutes', 'hours', and 'days' are the most commonly used units for time. Due to the variable length of months and years, they are not recommended.
Before we can write data, we need to first need to convert our list of Python datetime instances to numeric values. We can use the cftime library to make this easy to convert using the unit string as defined above.
In [ ]:
    
from cftime import date2num
time_units = 'hours since {:%Y-%m-%d 00:00}'.format(times[0])
time_vals = date2num(times, time_units)
time_vals
    
Now we can create the forecast_time variable just as we did before for the other coordinate variables:
In [ ]:
    
ds = xr.Dataset({'temperature': (['time', 'z', 'y', 'x'], temps, {'units':'Kelvin'})},
                 coords={'x_dist': (['x'], x, {'units':'km'}),
                         'y_dist': (['y'], y, {'units':'km'}),
                         'pressure': (['z'], press, {'units':'hPa'}),
                         'forecast_time': (['time'], times)
                })
ds
    
Due to how xarray handles time units, we need to encode the units in the forecast_time coordinate.
In [ ]:
    
ds.forecast_time.encoding['units'] = time_units
    
If we look at our data variable, we can see the units printed out, so they were attached properly!
In [ ]:
    
ds.temperature
    
We're going to start by adding some global attribute metadata. These are recommendations from the standard (not required), but they're easy to add and help users keep the data straight, so let's go ahead and do it.
In [ ]:
    
ds.attrs['Conventions'] = 'CF-1.7'
ds.attrs['title'] = 'Forecast model run'
ds.attrs['nc.institution'] = 'Unidata'
ds.attrs['source'] = 'WRF-1.5'
ds.attrs['history'] = str(datetime.utcnow()) + ' Python'
ds.attrs['references'] = ''
ds.attrs['comment'] = ''
ds
    
We can also add attributes to this variable to define metadata. The CF conventions require a units attribute to be set for all variables that represent a dimensional quantity. The value of this attribute needs to be parsable by the UDUNITS library. Here we have already set it to a value of 'Kelvin'. We also set the standard (optional) attributes of long_name and standard_name. The former contains a longer description of the variable, while the latter comes from a controlled vocabulary in the CF conventions. This allows users of data to understand, in a standard fashion, what a variable represents. If we had missing values, we could also set the missing_value attribute to an appropriate value.
NASA Dataset Interoperability Recommendations:
Section 2.2 - Include Basic CF Attributes
Include where applicable:
units,long_name,standard_name,valid_min/valid_max,scale_factor/add_offsetand others.
In [ ]:
    
ds.temperature.attrs['standard_name'] = 'air_temperature'
ds.temperature.attrs['long_name'] = 'Forecast air temperature'
ds.temperature.attrs['missing_value'] = -9999
ds.temperature
    
To properly orient our data in time and space, we need to go beyond dimensions (which define common sizes and alignment) and include values along these dimensions, which are called "Coordinate Variables". Generally, these are defined by creating a one dimensional variable with the same name as the respective dimension.
To start, we define variables which define our x and y coordinate values. These variables include standard_names which allow associating them with projections (more on this later) as well as an optional axis attribute to make clear what standard direction this coordinate refers to.
In [ ]:
    
ds.x.attrs['axis'] = 'X' # Optional
ds.x.attrs['standard_name'] = 'projection_x_coordinate'
ds.x.attrs['long_name'] = 'x-coordinate in projected coordinate system'
ds.y.attrs['axis'] = 'Y' # Optional
ds.y.attrs['standard_name'] = 'projection_y_coordinate'
ds.y.attrs['long_name'] = 'y-coordinate in projected coordinate system'
    
We also define a coordinate variable pressure to reference our data in the vertical dimension. The standard_name of 'air_pressure' is sufficient to identify this coordinate variable as the vertical axis, but let's go ahead and specify the axis as well. We also specify the attribute positive to indicate whether the variable increases when going up or down. In the case of pressure, this is technically optional.
In [ ]:
    
ds.pressure.attrs['axis'] = 'Z'  # Optional
ds.pressure.attrs['standard_name'] = 'air_pressure'
ds.pressure.attrs['positive'] = 'down'  # Optional
    
In [ ]:
    
ds.forecast_time['axis'] = 'T'  # Optional
ds.forecast_time['standard_name'] = 'time'  # Optional
ds.forecast_time['long_name'] = 'time'
    
Our data are still not CF-compliant because they do not contain latitude and longitude information, which is needed to properly locate the data. To solve this, we need to add variables with latitude and longitude. These are called "auxillary coordinate variables", not because they are extra, but because they are not simple one dimensional variables.
Below, we first generate longitude and latitude values from our projected coordinates using the pyproj library.
In [ ]:
    
from pyproj import Proj
X, Y = np.meshgrid(x, y)
lcc = Proj({'proj':'lcc', 'lon_0':-105, 'lat_0':40, 'a':6371000.,
            'lat_1':25})
lon, lat = lcc(X * 1000, Y * 1000, inverse=True)
    
Now we can create the needed variables. Both are dimensioned on y and x and are two-dimensional. The longitude variable is identified as actually containing such information by its required units of 'degrees_east', as well as the optional 'longitude' standard_name attribute. The case is the same for latitude, except the units are 'degrees_north' and the standard_name is 'latitude'.
In [ ]:
    
ds = ds.assign_coords(lon = (['y', 'x'], lon))
ds = ds.assign_coords(lat = (['y', 'x'], lat))
ds
    
In [ ]:
    
ds.lon.attrs['units'] = 'degrees_east'
ds.lon.attrs['standard_name'] = 'longitude'  # Optional
ds.lon.attrs['long_name'] = 'longitude'
ds.lat.attrs['units'] = 'degrees_north'
ds.lat.attrs['standard_name'] = 'latitude'  # Optional
ds.lat.attrs['long_name'] = 'latitude'
    
With the variables created, we identify these variables as containing coordinates for the Temperature variable by setting the coordinates value to a space-separated list of the names of the auxilliary coordinate variables:
In [ ]:
    
ds
    
With our data specified on a Lambert conformal projected grid, it would be good to include this information in our metadata. We can do this using a "grid mapping" variable. This uses a dummy scalar variable as a namespace for holding all of the required information. Relevant variables then reference the dummy variable with their grid_mapping attribute.
Below we create a variable and set it up for a Lambert conformal conic projection on a spherical earth. The grid_mapping_name attribute describes which of the CF-supported grid mappings we are specifying. The names of additional attributes vary between the mappings.
In [ ]:
    
ds['lambert_projection'] = int()
ds.lambert_projection.attrs['grid_mapping_name'] = 'lambert_conformal_conic'
ds.lambert_projection.attrs['standard_parallel'] = 25.
ds.lambert_projection.attrs['latitude_of_projection_origin'] = 40.
ds.lambert_projection.attrs['longitude_of_central_meridian'] = -105.
ds.lambert_projection.attrs['semi_major_axis'] = 6371000.0
ds.lambert_projection
    
Now that we created the variable, all that's left is to set the grid_mapping attribute on our Temperature variable to the name of our dummy variable:
In [ ]:
    
ds.temperature.attrs['grid_mapping'] = 'lambert_projection'  # or proj_var.name
ds
    
Xarray has built-in support for a few flavors of netCDF. Here we'll write a netCDF4 file from our Dataset.
In [ ]:
    
ds.to_netcdf('test_netcdf.nc', format='NETCDF4')
    
In [ ]:
    
!ncdump test_netcdf.nc
    
In [ ]: