If you use this lesson in a bootcamp or workshop, please let me know (irving.damien (at) gmail.com). I'd love to keep track of where it's being used so I can update and improve the content accordingly.
Our previous lessons have shown us how to write programs that ingest a list of data files, perform some calculations on those data, and then print a final result to the screen. While this was a useful exercise in learning the principles of scripting and parsing the command line, in most cases the output of our programs will not be so simple. Instead, programs typically take data as input, manipulate that data, and then output yet more data. Over the course of a multi-year research project, most reseachers will write many different programs that produce many different output datasets.
We want to:
Along the way, we will learn:
Additional Python libraries that need be installed:
pip install gitpython
In this lesson we are going to process some data collected by Australia's Integrated Marine Observing System (IMOS).
First off, let's load our data:
In [1]:
from netCDF4 import Dataset
acorn_URL = 'http://thredds.aodn.org.au/thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz'
acorn_DATA = Dataset(acorn_URL)
The first thing to notice is the distinctive Data Reference Syntax (DRS) associated with the file. The staff at IMOS have archived the data according to the following directory structure:
http://thredds.aodn.org.au/thredds/dodsC/<project>/<organisation>/<collection>/<facility>/<data-type>/<site-code>/<year>/
From this we can deduce, without even inspecting the contents of the file, that we have data from the IMOS project that is run by the eMarine Information Infrastructure (eMII). It was collected in 2012 at the Turquoise Coast, Western Australia (TURQ) site of the Australian Coastal Ocean Radar Network (ACORN), which is a network of high frequency radars that measure the ocean surface current (see this page on the Research Data Australia website for a nice overview of the dataset).
The data type has a sub-DRS of its own, which tells us that the data represents the 1-hourly average surface current for a single month (October 2012), and that it is archived on a regularly spaced spatial grid and has not been quality controlled. The file is located in the "demos" directory, as it has been generated for the purpose of providing an example for users in the very helpful Australian Ocean Data Network (AODN) user code library.
Just in case the file gets separated from this informative directory stucture, much of the information is repeated in the file name itself, along with some more detailed information about the start and end time of the data, and the last time the file was modified.
<project>_<facility>_V_<time-start>_<site-code>_FV00_<data-type>_<time-end>_<modified>.nc.gz
In the first instance this level of detail seems like a bit of overkill, but consider the scope of the IMOS data archive. It is the final resting place for data collected by the entire national array of oceanographic observing equipment in Australia, which monitors the open oceans and coastal marine environment covering physical, chemical and biological variables. Since the data are so well labelled, locating all monthly timescale ACORN data from the Turquoise Coast and Rottnest Shelf sites (which represents hundreds of files) would be as simple as typing the following at the command line:
ls */ACORN/monthly_*/{TURQ,ROT}/*/*.nc
While it's unlikely that your research will ever involve cataloging data from such a large observational network, it's still a very good idea to develop your own personal DRS for the data you do have. This often involves investing some time at the beginning of a project to think carefully about the design of your directory and file name structures, as these can be very hard to change later on (a good example is the DRS used by the Climate Model Intercomparison Project). The combination of bash shell wildcards and a well planned DRS is one of the easiest ways to make your research more efficient and reliable.
We haven't even looked inside our IMOS data file and already we have the beginnings of a detailed data management plan. The first step in any research project should be to develop such a plan, so for this challenge we are going to turn back time. If you could start your current research project all over again, what would your data management plan look like? Things to consider include:
Write down and discuss your plan with your partner.
We can guess from the .nc
extension that we are dealing with a Network Common Data Form (netCDF) file.
It's also compressed, hence the .gz
.
Had we actually downloaded the file to our computer and uncompressed it,
our initial impulse might have been to type,
!cat IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc
but such a command would produce an incomprehensible mix symbols and letters.
The reason is that up until now,
we have been dealing with text files.
These consist of a simple sequence of character data
(represented using ASCII, Unicode, or some other standard)
separated into lines,
meaning that text files are human-readable when opened with a text editor or displayed using cat
.
All other file types (including netCDF) are known collectively as binary files.
They tend to be smaller and faster for the computer to interpret than text files,
but the payoff is that they aren't human-readable unless you have the right intpreter
(e.g. .doc
files aren't readable with your text editor and must instead be opened with Microsoft Word).
To view the contents of a netCDF file we'd need to use a special command line utility called
ncdump
,
which is included with the netCDF C software
(i.e. if you're using any software that knows how to deal with netCDF -
such as the Anaconda scientific Python distribution
used in Software Carpentry workshops -
then you probably have ncdump
installed without even knowing it).
For this example, however, we haven't actually downloaded the netCDF file to our machine.
Instead, IMOS has made the data available via a THREDDS server,
which means we can just pass a URL to the netCDF4.Dataset
function in order to obtain the data.
In [2]:
print acorn_DATA
The great thing about netCDF files is that they contain metadata - that is, data about the data. There are global attributes that give information about the file as a whole (shown above - we will come back to these later), while each variable also has its own attributes.
In [3]:
print 'The file contains the following variables:'
print acorn_DATA.variables.keys()
(The 'u' means each variable name is represented by a Unicode string.)
In [4]:
print 'These are the attributes of the time axis:'
print acorn_DATA.variables['TIME']
print 'These are some of the time values:'
print acorn_DATA.variables['TIME'][0:10]
The raw time values are fairly meaningless, but we can use the time attributes to convert them to a more meaningful format...
In [5]:
from netCDF4 import num2date
units = acorn_DATA.variables['TIME'].units
calendar = acorn_DATA.variables['TIME'].calendar
times = num2date(acorn_DATA.variables['TIME'][:], units=units, calendar=calendar)
print times[0:10]
Climate and Forecast (CF) metadata convention
When performing simple data analysis tasks on netCDF files, command line tools like the Climate Data Operators (CDO) are often a better alternative to writing your own functions in Python. However, let's put ourselves in the shoes of the developers of CDO for a minute. In order to calculate the time mean of a dataset for a given start and end date (for example), CDO must first identify the units of the time axis. This isn't as easy as you'd think, since the creator of the netCDF file could easily have called the
units
attributemeasure
, orscale
, or something else completely unpredictable. They could also have defined the units asweeks since 1-01-01 00:00:00
ormilliseconds after 1979-12-31
. Obviously what is needed is a standard method for defining netCDF attributes, and that’s where the Climate and Forecast (CF) metadata convention comes in.The CF metadata standard was first defined back in the early 2000s and has now been adopted by all the major institutions and projects in the weather/climate sciences. There's a nice blog post on the topic if you'd like more information, but for the most part you just need to be aware that if a tool like CDO isn't working, it might be because your netCDF file isn't CF compliant.
For the sake of example, let's say that our data file contained the zonal (east/west; 'UCUR') and meridional (north/south; 'VCUR') surface current components, but not the total current speed. To calculate it, we first need to assign a variable to the zonal and meridional current data.
In [6]:
uData = acorn_DATA.variables['UCUR'][:,:,:]
vData = acorn_DATA.variables['VCUR'][:,:,:]
Both uData
and vData
are a special type of numpy array
(which we have met previously)
known as a masked array,
whereby some of the points in the time/latitude/longitude grid have missing (or masked) values.
Just as with a normal numpy array,
we can check the shape of our data
(in fact, masked arrays can do everything normal numpy arrays can do and more).
In [7]:
print type(uData)
print uData.shape
In other words, 493 time steps, 55 latitudes and 57 longitudes. We can now go ahead and calculate the current speed.
In [8]:
spData = (uData**2 + vData**2)**0.5
It's a good idea to regularly view your data throughout the code development process, just to ensure nothing that crazy has happened along the way. Below is a code except from this example in the AODN user code library, which simply plots one of the 493 timesteps.
In [9]:
%matplotlib inline
from matplotlib.pyplot import figure, pcolor, colorbar, xlabel, ylabel, title, draw, quiver, show, savefig
LAT = acorn_DATA.variables['LATITUDE']
LON = acorn_DATA.variables['LONGITUDE']
TIME = acorn_DATA.variables['TIME']
# Only one time value is being plotted. modify timeIndex if desired (value between 0 and length(timeData)-1 )
timeIndex = 4
speedData = spData[timeIndex,:,:]
latData = LAT[:]
lonData = LON[:]
# sea water U and V components
uData = acorn_DATA.variables['UCUR'][timeIndex,:,:]
vData = acorn_DATA.variables['VCUR'][timeIndex,:,:]
units = acorn_DATA.variables['UCUR'].units
figure1 = figure(figsize=(12, 9), dpi=80, facecolor='w', edgecolor='k')
pcolor(lonData , latData, speedData)
cbar = colorbar()
cbar.ax.set_ylabel('Current speed in ' + units)
title(acorn_DATA.title + '\n' + num2date(TIME[timeIndex], TIME.units, TIME.calendar).strftime('%d/%m/%Y'))
xlabel(LON.long_name + ' in ' + LON.units)
ylabel(LAT.long_name + ' in ' + LAT.units)
# plot velocity field
Q = quiver(lonData[:], latData[:], uData, vData, units='width')
show()
#savefig('surface_current.svg', bbox_inches='tight')
Plotting options
Quite a few lines of code were required to create our publication quality figure using matplotlib, and there would have been even more had we wanted to use the basemap library to plot coastlines or change the map projection. Recognising this burden, the team at the UK Met Office have developed Iris and Cartopy, which build on matplotlib and basemap to provide a more convenient interface (read: shorter, less complex code) for plotting in the weather, climate and ocean sciences.
The easiest way to install Iris and Cartopy is to use the conda package installer that comes with Anaconda. Simply enter the following at the command line:
conda install -c scitools iris
What if you want to view the contents of a netCDF file quickly, rather than go to the effort of producing something that is publication quality? There are numerous tools out there for doing this, including Panoply and UV-CDAT.
Let's say (hypothetically) that the TURQ site radar has been found to be unreliable for surface current speeds greater than 0.9 m/s.
To correct for this problem,
the IMOS documentation suggests setting all values greater than 0.9 m/s, back to 0.9 m/s.
The most obvious solution to this problem would be to loop through every element in spData
and check its value:
for t in range(0, len(TIME[:])):
for y in range(0, len(LAT[:])):
for x in range(0, len(LON[:])):
if spData[t, y, x] > 0.9:
spData[t, y, x] = 0.9
The problem is that not only is this nested loop kind of ugly, it's also pretty slow.
If our data array was even larger
(e.g. like the huge data arrays that high resolution global climate models produce),
then it would probably be prohibitively slow.
The reason is that high level languages like Python and Matlab are built for usability
(i.e. they make it easy to write concise, readable code), not speed.
To get around this problem,
people have written Python libraries (like numpy
) in low level languages like C and Fortran,
which are built for speed (but definitely not usability).
Fortunately we don't ever need to see the C code under the hood of the numpy
library,
but we should use it to vectorise our array operations whenever possible.
With this in mind:
numpy.ma.where()
function to correct all values in spData
greater than 0.9 (hint: you'll need to import numpy.ma
)%timeit
cell magic to compare the speed of your answer to the nested loop described above Now that we've developed some code for reading in zonal and meridional surface current data and calculating the speed,
the logical next step is to put that code in a script called calc_current_speed.py
so that we can repeat the process quickly and easily.
The output of that script will be a new netCDF file containing the current speed data,
however there's one more thing to consider before we go ahead and start creating new files.
Looking closely at the global attributes of
IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc
you can see that the entire history of the file,
all the way back to its initial download,
has been recorded in the history
attribute.
In [10]:
print acorn_DATA.history
This practice of recording the history of the file ensures the provenance of the data. In other words, a complete record of everything that has been done to the data is stored with the data, which avoids any confusion in the event that the data is ever moved, passed around to different users, or viewed by its creator many months later.
If we want to create our own entry for the history attribute, we'll need to be able to create a:
calc_current_speed.py
A library called datetime
can be used to find out the time and date right now:
In [11]:
import datetime
time_stamp = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
print time_stamp
The strftime
function can be used to customise the appearance of a datetime object;
in this case we've made it look just like the other time stamps in our data file.
In the Software Carpentry lesson on command line programs we met sys.argv
,
which contains all the arguments entered by the user at the command line:
In [12]:
import sys
print sys.argv
In launching this IPython notebook,
you can see that a number of command line arguments were used.
To join all these list elements up,
we can use the join
function that belongs to Python strings:
In [13]:
args = " ".join(sys.argv)
print args
While this list of arguments is very useful,
it doesn't tell us which Python installation was used to execute those arguments.
The sys
library can help us out here too:
In [14]:
exe = sys.executable
print exe
In the Software Carpentry lessons on git we learned that each commit is associated with a unique 40-character identifier known as a hash. We can use the git Python library to get the hash associated with the script:
In [18]:
from git import Repo
import os
git_hash = Repo(os.getcwd()+'/../').heads[0].commit
print git_hash
We can now put all this information together for our history entry:
In [19]:
entry = """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, str(git_hash)[0:7])
print entry
So far we've been experimenting in the IPython notebook to familiarise ourselves with netCDF4
and the other Python libraries that might be useful for calculating the surface current speed.
We should now go ahead and write a script,
so we can repeat the process with a single entry at the command line:
In [21]:
!cat code/calc_current_speed.py
Introducing xray
It took four separate functions in
calc_current_speed.py
to create the output file, because we had to copy the dimensions and most of the global and variable attributes from the original file to the new file. This is such a common problem that a Python library called xray has been developed, which conserves metadata whenever possible. When xray is used to read a netCDF file, the data are stored as anxray.DataArray
(as opposed to anumpy.ndarray
). These special data arrays carry their dimension information and variable atributes with them, which means you don't have to retrieve them manually. xray also comes with a bunch of convenience functions for doing typical weather/climate/ocean tasks (calculating climatologies, anomalies, etc), which can be a pain using numpy.Similar to Iris and Cartopy, the easiest way to install xray is with the conda package installer. Simply type the following at the command line:
conda install xray dask netCDF4 bottleneck
Using the help information that the argparse
library provides, we can now go ahead and run this script at the command line:
In [22]:
!python code/calc_current_speed.py -h
In [25]:
!python code/calc_current_speed.py http://thredds.aodn.org.au/thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz UCUR VCUR IMOS_ACORN_SPCUR_20121001T000000Z_TURQ_monthly-1-hour-avg_END-20121029T180000Z.nc
We can now inspect the attributes in our new file:
In [26]:
!ncdump -h IMOS_ACORN_SPCUR_20121001T000000Z_TURQ_monthly-1-hour-avg_END-20121029T180000Z.nc
There are a lot of global and variable attributes in our output file,
because we copied everything from the input file.
It might be tempting to edit calc_current_speed.py
or use the ncatted
command line utility
(which, like ncdump
, comes with the NCO install)
to start cleaning up (i.e. delete) some of these attributes,
but we should resist that urge.
The id
global attribute, for instance, makes little sense to anyone who doesn't work at IMOS.
While this might seem like a reasonable argument for deleting that attribute,
once an attribute is deleted it's gone forever.
The id
attribute is taking up a negligible amount of memory,
so why not just leave it there just in case?
When in doubt, keep metadata.
Does your data management plan from the first challenge adequately address this issue of data provenance? If not, go ahead and add to your plan now. Things to consider include:
.nc
files, formats like .csv
or .png
don't store things like global and variable attributes within them) Discuss the additions you've made to your plan with your partner.