In [ ]:
%matplotlib inline

Iris introduction course

7. Advanced Concepts

Learning Outcome: by the end of this section, you will be able to utilise some more advanced parts of Iris's functionality.

Duration: 1 hour

Overview:
7.1 Load Callbacks
7.2 Categorised Statistics
7.3 Out-of-core Processing and Lazy Data
7.4 Performance Tricks
7.5 Summary of the Section

Setup


In [ ]:
import iris
import iris.quickplot as qplt
import matplotlib.pyplot as plt
import numpy as np

7.1 Load Callbacks

Sometimes information important for a file load is recorded, not in the file itself, but somewhere else : typically in the filename or its path. It is then desirable for such extra information to be added into the cube's metadata.

For example, some early GloSea4 model runs recorded the "ensemble member number" (or "realization" in CF terms) in the filename, but not in actual PP metadata itself.

As a result, loading the following data yields two cubes, rather than a single, fully merged, cube.


In [ ]:
fname = iris.sample_data_path('GloSea4', 'ensemble_00[34].pp')
for cube in iris.load(fname, 'surface_temperature'):
    print(cube, '\n', '--' * 40)

To resolve this we can define a function that gets called during the load process. This load callback function must take the following as arguments:

  • a cube,
  • a 2D field - either a PP field, a NetCDF variable or a GRIB message depending on the file format being loaded, and
  • a filename.

In our example, some cubes are missing the realization coordinate, so we define a function that parses the fname to identify the ensemble member number and includes this value as a realization coordinate. We pass this function to load, and the result is a successfully merged cube:


In [ ]:
import os
def realization_callback(cube, field, fname):
    basename = os.path.basename(fname)
    if not cube.coords('realization') and basename.startswith('ensemble_'):
        cube.add_aux_coord(iris.coords.DimCoord(np.int32(basename[-6:-3]),
                                                'realization'))

print(iris.load_cube(fname, callback=realization_callback))

7.2 Categorical Coordinates for Grouped Statistics

Sometimes we want to be able to categorise data before performing statistical operations on it.

For example, we might want to categorise our data by "day or night time" or "season of the year", so that we can calculate statistics such as a "daylight maximum" or "seasonal mean" etc. Both of these categorisations would be based on the time coordinate.

The iris.coord_categorisation module provides several convenience functions to perform common categorisations on a cube, and a generalised function for making custom ones.

Let's load in a cube that represents the monthly air_temperature from April 2006 through to October 2010.


In [ ]:
import iris.coord_categorisation as coord_cat

filename = iris.sample_data_path('ostia_monthly.nc')
cube = iris.load_cube(filename, 'surface_temperature')
print(cube)

Let's add a categorisation coordinate to this cube to identify the climatological season (i.e. "djf", "mam", "jja" or "son") of each time point:


In [ ]:
coord_cat.add_season(cube, 'time', name='clim_season')
print(cube)

As you can see in the above print out of the cube, we now have an extra coordinate called clim_season.

Let's print the coordinate out to take a closer look:


In [ ]:
print(cube.coord('clim_season'))

Now that we have a coordinate representing the climatological season, we can use the cube's aggregated_by method to "group by and aggregate" on the season, to produce a new cube that represents the seasonal mean:


In [ ]:
seasonal_mean = cube.aggregated_by('clim_season', iris.analysis.MEAN)
print(seasonal_mean)

We can take this further by extracting the winter season, using our newly created coordinate, and producing a plot of the winter zonal mean:


In [ ]:
winter = seasonal_mean.extract(iris.Constraint(clim_season='djf'))

qplt.plot(winter.collapsed('latitude', iris.analysis.MEAN))
plt.title('Winter zonal mean surface temperature at $\pm5^{\circ}$ latitude')
plt.show()
Exercise:

Calculate the yearly maximum surface_temperature.
Take a look at the documentation for iris.coord_categorisation to work out how to add a coordinate that represent the year to cube, then calculate the maximum.


In [ ]:
#
# edit space for user code ...
#

In [ ]:
# SAMPLE SOLUTION
# %load solutions/iris_exercise_7.2a

7.3 Out-of-core Processing

Out-of-core processing is a technical term that describes being able to process datasets that are too large to fit in memory at once. In Iris, this functionality is referred to as lazy data. It means that you can use Iris to load, process and save datasets that are too large to fit in memory without running out of memory. This is achieved by loading only the dataset's metadata and not the data array, unless this is specifically requested.

To determine whether your cube has lazy data you can use the has_lazy_data method


In [ ]:
fname = iris.sample_data_path('air_temp.pp')
cube = iris.load_cube(fname)
print(cube.has_lazy_data())

Iris tries to maintain lazy data as much as possible. Many Iris operations, like the arithmetic and statistics we have seen, can be implemented by a "lazy algorithm", which produces a representation of the calculation without actually performing it (initially).

This has two key benefits :

  1. we can easily form a large overall result but then only extract small selected portions; and
  2. where a result is smaller than the source data (like a statistic), it can be calculated in sections without ever loading all the original data (which may exceed available memory).

We refer to the operation of loading a cube's lazy data as 'realising' the cube's data. A cube's lazy data will only be loaded in a limited number of cases, including:

  • when the user directly requests the cube's data using cube.data,
  • when there is no lazy data processing algorithm available to perform the requested data processing, such as for peak finding, and
  • where actual data values are necessary, such as for cube plotting.

See : Iris Userguide : "Real and Lazy Data"


In [ ]:
cube.data
print(cube.has_lazy_data())

Above we have triggered the data to be loaded into memory by calling cube.data.

Exercise:

Load the sea_water_potential_temperature cube from the file iris.sample_data_path('atlantic_profiles.nc'). Does this cube have lazy data?
Calculate the mean over the depth coordinate. Does the cube still have lazy data?
Create a blockplot (pcolormesh) of the resulting 2D cube. Does the cube still have lazy data?


In [ ]:
#
# edit space for user code ...
#

In [ ]:
# SAMPLE SOLUTION
# %load solutions/iris_exercise_7.3a

7.4 Performance Tricks

This section details a few common tricks to improve the performance of your Iris code:

  • Data loading.
  • Load once, extract many times.

Make Use of Deferred Loading of Data

Sometimes it makes sense to load data before doing operations, other times it makes sense to do data reduction before loading.

We define a simple function the applies some processing to the cube:


In [ ]:
def zonal_sum(cube):
    """
    A really silly function to calculate the sum of the grid_longitude
    dimension.
    Don't use this in real life, instead consider doing:
    
        cube.collapsed('grid_longitude', iris.analysis.SUM)
    
    """
    total = 0
    for i, _ in enumerate(cube.coord('grid_longitude')):
        total += cube[..., i].data
    return total

First, let's try loading in our cube, and then applying the zonal_sum function to the cube whilst it still has lazy data.


In [ ]:
%%timeit
fname = iris.sample_data_path('uk_hires.pp')
pt = iris.load_cube(fname, 'air_potential_temperature')
result = zonal_sum(pt)

Now let's try doing the same thing, but this time we tell Iris to load the data into memory prior to applying the function.


In [ ]:
%%timeit
fname = iris.sample_data_path('uk_hires.pp')
pt = iris.load_cube(fname, 'air_potential_temperature')
pt.data
result = zonal_sum(pt)

As you can see, loading all the data upfront was much faster.

Load Once, Extract Many Times

Iris loading can be slow, particularly if the format stores 2d fields of a conceptually higher dimensional dataset, as is the case with GRIB and PP.

To maximise load speed and avoid unncecessary processing, it is worth constraining the fields that are of interest at load time, but there is no caching, so loading a file twice will be twice as slow.

Let's compare loading data on a select number of model levels.


In [ ]:
fname = iris.sample_data_path('uk_hires.pp')
model_levels = [1, 4, 7, 16]

First, let's constrain to our chosen model levels at load time:


In [ ]:
%%timeit
for model_level in model_levels:
    pt = iris.load_cube(fname,
                        iris.Constraint('air_potential_temperature',
                                        model_level_number=model_level))

Now, let's first load in the file, then extract each of our model levels.


In [ ]:
%%timeit
cubes = iris.load(fname)
for model_level in model_levels:
    pt = cubes.extract(iris.Constraint('air_potential_temperature',
                                       model_level_number=model_level))

As you can see, by loading a

For files with lots of different phenomenon this can be improved further by loading only the phenomenon (and in this case just the model levels of interest):


In [ ]:
%%timeit
cube = iris.load(fname,
                 iris.Constraint('air_potential_temperature',
                                 model_level_number=model_levels))
for model_level in model_levels:
    pt = cube.extract(iris.Constraint(model_level_number=model_level))

7.5 Section Summary: Advanced Concepts

In this section we learnt:

  • load callbacks can be used to capture additional metadata during loading
  • special facilities are provided for performing categorical statistics
  • Iris uses lazy data and out-of-core-processing to handle data that it too large to fit into memory
  • lazy loading can be used to enhance code performance