In [ ]:
%matplotlib inline
Learning Outcome: by the end of this section, you will be able to utilise some more advanced parts of Iris's functionality.
Duration: 1 hour
Overview:
7.1 Load Callbacks
7.2 Categorised Statistics
7.3 Out-of-core Processing and Lazy Data
7.4 Performance Tricks
7.5 Summary of the Section
In [ ]:
import iris
import iris.quickplot as qplt
import matplotlib.pyplot as plt
import numpy as np
Sometimes information important for a file load is recorded, not in the file itself, but somewhere else : typically in the filename or its path. It is then desirable for such extra information to be added into the cube's metadata.
For example, some early GloSea4 model runs recorded the "ensemble member number" (or "realization" in CF terms) in the filename, but not in actual PP metadata itself.
As a result, loading the following data yields two cubes, rather than a single, fully merged, cube.
In [ ]:
fname = iris.sample_data_path('GloSea4', 'ensemble_00[34].pp')
for cube in iris.load(fname, 'surface_temperature'):
print(cube, '\n', '--' * 40)
To resolve this we can define a function that gets called during the load process. This load callback function must take the following as arguments:
In our example, some cubes are missing the realization
coordinate, so we define a function that parses the fname to identify the ensemble member number and includes this value as a realization
coordinate. We pass this function to load, and the result is a successfully merged cube:
In [ ]:
import os
def realization_callback(cube, field, fname):
basename = os.path.basename(fname)
if not cube.coords('realization') and basename.startswith('ensemble_'):
cube.add_aux_coord(iris.coords.DimCoord(np.int32(basename[-6:-3]),
'realization'))
print(iris.load_cube(fname, callback=realization_callback))
Sometimes we want to be able to categorise data before performing statistical operations on it.
For example, we might want to categorise our data by "day or night time" or "season of the year", so that we can calculate statistics such as a "daylight maximum" or "seasonal mean" etc. Both of these categorisations would be based on the time coordinate.
The iris.coord_categorisation module provides several convenience functions to perform common categorisations on a cube, and a generalised function for making custom ones.
Let's load in a cube that represents the monthly air_temperature from April 2006 through to October 2010.
In [ ]:
import iris.coord_categorisation as coord_cat
filename = iris.sample_data_path('ostia_monthly.nc')
cube = iris.load_cube(filename, 'surface_temperature')
print(cube)
Let's add a categorisation coordinate to this cube to identify the climatological season (i.e. "djf", "mam", "jja" or "son") of each time point:
In [ ]:
coord_cat.add_season(cube, 'time', name='clim_season')
print(cube)
As you can see in the above print out of the cube, we now have an extra coordinate called clim_season
.
Let's print the coordinate out to take a closer look:
In [ ]:
print(cube.coord('clim_season'))
Now that we have a coordinate representing the climatological season, we can use the cube's aggregated_by
method to "group by and aggregate" on the season, to produce a new cube that represents the seasonal mean:
In [ ]:
seasonal_mean = cube.aggregated_by('clim_season', iris.analysis.MEAN)
print(seasonal_mean)
We can take this further by extracting the winter season, using our newly created coordinate, and producing a plot of the winter zonal mean:
In [ ]:
winter = seasonal_mean.extract(iris.Constraint(clim_season='djf'))
qplt.plot(winter.collapsed('latitude', iris.analysis.MEAN))
plt.title('Winter zonal mean surface temperature at $\pm5^{\circ}$ latitude')
plt.show()
Calculate the yearly maximum surface_temperature.
Take a look at the documentation for iris.coord_categorisation to work out how to add a coordinate that represent the year to cube, then calculate the maximum.
In [ ]:
#
# edit space for user code ...
#
In [ ]:
# SAMPLE SOLUTION
# %load solutions/iris_exercise_7.2a
Out-of-core processing is a technical term that describes being able to process datasets that are too large to fit in memory at once. In Iris, this functionality is referred to as lazy data. It means that you can use Iris to load, process and save datasets that are too large to fit in memory without running out of memory. This is achieved by loading only the dataset's metadata and not the data array, unless this is specifically requested.
To determine whether your cube has lazy data you can use the has_lazy_data
method
In [ ]:
fname = iris.sample_data_path('air_temp.pp')
cube = iris.load_cube(fname)
print(cube.has_lazy_data())
Iris tries to maintain lazy data as much as possible. Many Iris operations, like the arithmetic and statistics we have seen, can be implemented by a "lazy algorithm", which produces a representation of the calculation without actually performing it (initially).
This has two key benefits :
We refer to the operation of loading a cube's lazy data as 'realising' the cube's data. A cube's lazy data will only be loaded in a limited number of cases, including:
cube.data
,
In [ ]:
cube.data
print(cube.has_lazy_data())
Above we have triggered the data to be loaded into memory by calling cube.data
.
Load the sea_water_potential_temperature cube from the file iris.sample_data_path('atlantic_profiles.nc'). Does this cube have lazy data?
Calculate the mean over the depth coordinate. Does the cube still have lazy data?
Create a blockplot (pcolormesh) of the resulting 2D cube. Does the cube still have lazy data?
In [ ]:
#
# edit space for user code ...
#
In [ ]:
# SAMPLE SOLUTION
# %load solutions/iris_exercise_7.3a
In [ ]:
def zonal_sum(cube):
"""
A really silly function to calculate the sum of the grid_longitude
dimension.
Don't use this in real life, instead consider doing:
cube.collapsed('grid_longitude', iris.analysis.SUM)
"""
total = 0
for i, _ in enumerate(cube.coord('grid_longitude')):
total += cube[..., i].data
return total
First, let's try loading in our cube, and then applying the zonal_sum
function to the cube whilst it still has lazy data.
In [ ]:
%%timeit
fname = iris.sample_data_path('uk_hires.pp')
pt = iris.load_cube(fname, 'air_potential_temperature')
result = zonal_sum(pt)
Now let's try doing the same thing, but this time we tell Iris to load the data into memory prior to applying the function.
In [ ]:
%%timeit
fname = iris.sample_data_path('uk_hires.pp')
pt = iris.load_cube(fname, 'air_potential_temperature')
pt.data
result = zonal_sum(pt)
As you can see, loading all the data upfront was much faster.
Iris loading can be slow, particularly if the format stores 2d fields of a conceptually higher dimensional dataset, as is the case with GRIB and PP.
To maximise load speed and avoid unncecessary processing, it is worth constraining the fields that are of interest at load time, but there is no caching, so loading a file twice will be twice as slow.
Let's compare loading data on a select number of model levels.
In [ ]:
fname = iris.sample_data_path('uk_hires.pp')
model_levels = [1, 4, 7, 16]
First, let's constrain to our chosen model levels at load time:
In [ ]:
%%timeit
for model_level in model_levels:
pt = iris.load_cube(fname,
iris.Constraint('air_potential_temperature',
model_level_number=model_level))
Now, let's first load in the file, then extract each of our model levels.
In [ ]:
%%timeit
cubes = iris.load(fname)
for model_level in model_levels:
pt = cubes.extract(iris.Constraint('air_potential_temperature',
model_level_number=model_level))
As you can see, by loading a
For files with lots of different phenomenon this can be improved further by loading only the phenomenon (and in this case just the model levels of interest):
In [ ]:
%%timeit
cube = iris.load(fname,
iris.Constraint('air_potential_temperature',
model_level_number=model_levels))
for model_level in model_levels:
pt = cube.extract(iris.Constraint(model_level_number=model_level))
In this section we learnt: