In [1]:
name = '2017-03-24-climate-model-output'
title = 'Two ways of preparing climate model output for analysis'
tags = 'numpy, iris'
author = 'Denis Sergeev'
In [2]:
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML, Image
html = connect_notebook_to_post(name, title, tags, author)
Today one of the group members asked for help with reading climate model output and preparing it for data analysis.
This notebook shows a couple of ways of doing that with the help of numpy and iris Python packages.
Luckily, the model output is quite small and stored in a simple ASCII file. However, it has some properties that can be a hurdle for a programming novice.
We start with downloading data from a given link.
In [3]:
URL = 'https://raw.githubusercontent.com/ueapy/ueapy.github.io/src/content/data/run1_U_60N_10hPa.dat'
Instead of copy-pasting the contents manually, we are going to use Python's standard library and download the file, making this part of scientific analysis more reproducible.
In [4]:
from urllib.request import urlretrieve
To organise data and code folders, we also import os module.
In [5]:
import os
In [6]:
datadir = os.path.join(os.path.pardir, 'data') # directory is one level up
if not os.path.isdir(datadir):
# if the directory does not exist, create it
os.makedirs(datadir)
# File with data
fname = os.path.join(datadir, 'data.dat')
Now that we have a directory to store data, we can save the model output there.
In [7]:
urlretrieve(URL, fname)
Out[7]:
Since the data are purely numeric, we use numpy module.
In [8]:
import numpy as np
In [9]:
data = np.genfromtxt(fname)
In [10]:
data.shape
Out[10]:
In [11]:
data
Out[11]:
For some reason the data are stored in 6 columns by 1500 rows, which in total is 9000 values.
We know a priori that the file contains 75 years of data written every third day, and the climate model's calendar is 360-day calendar. Hence, we have 120 values per year:
In [12]:
data.shape[0] * data.shape[1] / 75
Out[12]:
Keeping data in $1500\times6$ array does not seem to be useful, so we make it 1-D:
In [13]:
data = data.flatten()
data.shape
Out[13]:
To make the code above reusable, we create the following function to get data.
In [14]:
def get_model_data(url=URL, fname='climate_model_data.dat', force_download=False):
"""
Function to download climate model output from UEA server
Parameters
---------
url : string (optional)
web location of the data
fname : string (optional)
full path to save the data
force_download : bool (optional)
if True, force redownload of data
Returns
-------
data : numpy.ndarray
1-D array of data
"""
if not os.path.exists(fname) or force_download:
urlretrieve(URL, fname)
# print('Downloading...')
data = np.genfromtxt(fname)
return data.flatten()
In [15]:
data = get_model_data()
Now we transform the array into a more useful shape.
In [16]:
NDAYS = 120 # the number of 3-day periods in a 360-day year
NYEARS = 75 # the total number of years
In [17]:
data_yd = data.reshape((NYEARS, NDAYS))
print(data_yd.shape)
For example, this is a value of $u$-wind on 30 January of the last year:
In [18]:
data_yd[-1, 10]
Out[18]:
What if we want to extract only winter data? We can't use the first winter, because it's incomplete: it only has January and February. So the first winter period will comprise December data from the year 1:
In [19]:
data_yd[0, -10:]
Out[19]:
plus January and February data from the year 2:
In [20]:
data_yd[1, :20]
Out[20]:
To join them, we can use numpy.concatentate() function:
In [21]:
np.concatenate([data_yd[0, -10:], data_yd[1, :20]])
Out[21]:
And of course we can apply the same logic to the whole dataset:
In [22]:
data_djf = np.concatenate([data_yd[:-1, -10:], data_yd[1:, :20]], axis=1)
print(data_djf.shape)
How to find winters when at least 20 days of constant wind direction followed by its change?
Here we are just applying this answer on Stack Overflow to our problem.
In [23]:
for i, yr in enumerate(data_djf):
condition = yr > 0
lens_true = np.diff(np.where(np.concatenate(([condition[0]], condition[:-1] != condition[1:], [True])))[0])[::2]
if 20 <= lens_true.max() < 30:
print(i, lens_true.max())
In the above example numpy's capabilities were probably enough. But when you have more dimensions and data are more complex, it is mostly always better to use labelled arrays and all the great functionality offered by such libraries as xarray or iris.
We show how iris library can be used with the same dataset. We chose iris, mostly because it can handle non-standard calendars, like 360-day one.
To create an appropriate time coordinate, we will use iris companion package - cf_units.
In [24]:
import cf_units
import iris
In [25]:
DAYS_PER_YEAR = 360
In [26]:
t_unit = cf_units.Unit('days since 0001-01-01 00:00:00',
calendar='360_day')
In [27]:
t_coord = iris.coords.DimCoord(np.arange(0, DAYS_PER_YEAR * NYEARS, 3),
units=t_unit,
standard_name='time')
Now we can attach the newly created time coordinate to the data themselves by creating an iris cube:
In [28]:
cube = iris.cube.Cube(data=data,
units='m/s',
dim_coords_and_dims=[(t_coord, 0)])
cube.rename('eastward_wind')
In [29]:
print(cube)
Since we now have a labelled aray with appropriate metadata, we can use iris to make statistical analysis easier and make the code more readable.
In [30]:
import iris.coord_categorisation
In [31]:
iris.coord_categorisation.add_season(cube, 'time', name='clim_season')
iris.coord_categorisation.add_season_year(cube, 'time', name='season_year')
In [32]:
print(cube)
In [33]:
cube.coord('clim_season')
Out[33]:
In [34]:
for season, year in zip(cube.coord('clim_season')[:100:10].points,
cube.coord('season_year')[:100:10].points):
print('{} {}'.format(season, year))
In [35]:
annual_seasonal_mean = cube.aggregated_by(['clim_season', 'season_year'],
iris.analysis.MEAN)
In [36]:
print(annual_seasonal_mean)
In [37]:
HTML(html)
Out[37]: