The latest ARCCSSive stable version is available from the conda analysis27 environment
Anyone can load them both from raijin and the remote desktop.
In [1]:
! module use /g/data3/hh5/public/modules
! module load conda/analysis27
The database location is saved in the $CMIP5_DB environment variable. This is defined automatically if you have loaded ARCCSSive from conda/analysis27.
In [2]:
! export CMIP5_DB=sqlite:////g/data1/ua6/unofficial-ESG-replica/tmp/tree/cmip5_raijin_latest.db
Import CMIP5 from the module and use the method connect() to open a connection to the database.
In [3]:
from ARCCSSive import CMIP5
db=CMIP5.connect()
Opening a connection creates a session object (in this case db). A session manages all the comunication with the database and contains all the objects which you’ve loaded or associated with it during its lifespan. Every query to the database is run through the session.
There are a number of helper functions for common operations:
In [4]:
db.models()
Out[4]:
models() return all the models recorded in the database,
experiments(), variables(), mips() produce similar lists for each respective field
To perform a search you can use the outputs( ) function.
outputs( ) is a 'shortcut' to perform a session.query on the Instances table.
The following example shows all the input arguments you can use, the order doesn't matter and you can omit any of them.
db.outputs( column-name='value', ... )
will return all the rows for the Instances table in the database.
In [5]:
results=db.outputs(variable='tas',experiment='historical',mip='Amon',model='MIROC-ESM-CHEM',ensemble='r1i1p1')
You can check how many instances your search returned by using the query method count()
In [7]:
results.count()
Out[7]:
In this case we defined every possible constraint for the table and hence we get just one instance.
This should always be the case, if you use all the five attributes, because every instance is fully defined by these and each instance is unique.
We can loop through the instances returned by the search and access their attributes and their children ( i.e. related versions and files) attributes.
In [8]:
for o in results:
print(o.model,o.variable,o.ensemble)
print()
print("drstree path is " + str(o.drstree_path()))
for v in o.versions:
print()
print('version', v.version)
print('dataset-id', v.dataset_id)
print('is_latest', v.is_latest, 'checked on', v.checked_on)
print()
print(v.path)
for f in v.files:
print(f.filename, f.tracking_id)
print(f.md5, f.sha256)
Let's have a better look at results
In [9]:
results=db.outputs(variable='tas',experiment='historical',mip='Amon',model='MIROC-ESM-CHEM',ensemble='r1i1p1')
type(results)
Out[9]:
results is a query object but as we saw before we can loop through it as we do with a list.
In this particular case we have only one instance returned in results, but we still need to use an index to access it.
In [10]:
type(results[0])
Out[10]:
A useful attribute of an instance is versions, this is a list of all the versions associated to that particular instance.
From a database point of view these are all the rows in the Versions table which are related to that particular instance.
In [11]:
results[0].versions
Out[11]:
We have two versions available for this instance, we can loop through them and retrieve their attributes:
In [12]:
for o in results:
for v in o.versions:
print()
print(v.version)
print()
print(v.path)
If we want to get only the latest version, we can use the latest( ) method of the Instance class.
In [13]:
results[0].latest()[0].version
Out[13]:
As you might have noticed latest( ) returns a list of Version objects rather than only one.
This is because there might be different copies of the same version, downloaded from different servers.
Currently the database lists all of them so that if you used one rather than the other in the past you can still find it.
There are plans though to keep just one copy per version to facilitate the collection management and save storage resources.
Other methods available for the Instances table (objects) are:
In [14]:
results[0].filenames()
Out[14]:
In [15]:
results[0].drstree_path()
Out[15]:
In [ ]:
% ls -l /g/data1/ua6/DRSv2/CMIP5/CCSM4/rcp45/day/atmos/r1i1p1/tas/latest
In most cases you can use directly the drstee_path() method to get to the files, but it can be useful to find all the available versions.
For example if you want to make sure that a new version hasn't been added recently, DRSv2 it is updated only once a week.
Or if you find that the version linked by the DRSv2 is incomplete, there might be another copy of the same version.
We hope eventually to be able to have just one copy for each version and all of them clearly defined.
We can refine our results by using the SQLalchemy filter( ) function.
We will use the attributes ( or columns ) of the database tables as constraints.
So, first we need to import the tables definitions from ARCCSSive.
In [16]:
from ARCCSSive.CMIP5.Model import Instance, Version, VersionFile
#print(type(Instance))
We can also import the unique( ) function. This function will give us all the possible values we can use to filter over a particular attribute.
In [17]:
from ARCCSSive.CMIP5.other_functions import unique
Let's do a new query
In [18]:
results=db.outputs(variable='tas',experiment='rcp45',mip='day')
results.count()
Out[18]:
We would like to filter the results by ensemble, so we will use unique( ) to get all the possible ensemble values.
In [19]:
ensembles=unique(results,'ensemble')
print(ensembles)
unique( results, 'attribute' ) takes two inputs:
unique( ) lists all the distinct values returned by the query for that particular attribute.
Now that we know all the ensembles values, let's choose one to filter our results.
In [20]:
r6i1p1_ens=results.filter(Instance.ensemble == 'r6i1p1')
print( r6i1p1_ens.count() )
unique(r6i1p1_ens,'ensemble')
Out[20]:
We used the == equals operator to select all the r61i1p1 ensembles.
If we wanted all the "r6i1p#" ensembles regardless of their physics (p) value we could have used the like operator.
In [21]:
r6i1_ens=results.filter(Instance.ensemble.like('r6i1p%'))
print( r6i1_ens.count() )
unique(r6i1_ens,'ensemble')
Out[21]:
If we want to search two variables at the same time we can leave the variable constraints out of the query inputs,
and then use filter with the in_ operator to select them.
In [22]:
results=db.outputs(ensemble='r1i1p1',experiment='rcp45',mip='day')\
.filter(Instance.variable.in_(['tasmin','tasmax']))
results.count()
Out[22]:
As you can see filter can follow directly the query, i.e. the outputs( ) function.
In fact, you can refine a query with how many successive filters as you want.
In [23]:
results=db.outputs(ensemble='r1i1p1',experiment='rcp45',mip='day')\
.filter(Instance.variable.in_(['tasmin','tasmax']))\
.filter(Instance.model.like('%ESM%'))
results.count()
Out[23]:
Once we have found the instances and versions we want to use we can use their path to find the files and work with them.
First we load numpy and the netcdf module.
In [24]:
import numpy as np
from netCDF4 import MFDataset
All you need to open a file is the location, this is stored in the Versions table in the database as path.
Alternatively you can use the drstree path, that is returned by the Instance drstree_path( ) method.
Let's define a simple function that reads a variable from a file and calculate its maximum value.
We will use MFDataset( ) from the netcDF4 module to open all the netcdf files in the input path as one aggregated file.
In [25]:
def var_max(var,path):
''' calculate max value for variable '''
# MFDataset will open all netcdf files in path as one aggregated file
print(path+"/*.nc")
# open the file
nc=MFDataset(path+"/*.nc",'r')
# read the variable from file into a numpy array
data = nc.variables[var][:]
# close the file
nc.close()
# return the maximum
return np.max(data)
Now we perform a search, loop through the results and pass the Version path attribute to the var_max( ) function
In [26]:
results=db.outputs(ensemble='r1i1p1',experiment='rcp45',mip='day').filter(Instance.model.like('MIROC%'))\
.filter(Instance.variable.in_(['tas','pr']))
print(results.count())
for o in results[:2]:
var = o.variable
for v in o.versions:
path=str(v.path)
varmax=var_max(var,path)
print()
print('Maximum value for variable %s, version %s is %d' % (var, v.version, varmax))
NB if you pass directly v.path value you get an error because the databse return unicode string, so you need to use the str( ) function to convert to a normal string.
In the previous example we simply looped through the results returned by the search as they were and passed them to a function that opened the files.
But what if we want to do something more complex?
Let's say that we want to pass two variables to a function and do it for every model/ensemble that has both of them for a fixed experiment and mip
Mostly users would somehow loop over the drstree path, doing something like:
cd /g/data1/ua6/DRSv2/CMIP5
list all models and save in model_list for model in model_list: list all eavailable ensembles and save in ensemble_list for ensemble in ensemble_list: call_function(var1_path, var2_path)
Using ARCCSSIve we can do the same using the unique( ) function to return the list of all available models/ensembles.
Let's start from defining a simple function that calculates the difference bewteen the values of two variables.
In [27]:
def vars_difference(var1,path1,var2,path2):
''' calculate difference between the mean of two variables '''
# open the files and read both variables
nc1=MFDataset(path1+"/*.nc",'r')
data1 = nc1.variables[var1][:]
nc1.close()
nc2=MFDataset(path2+"/*.nc",'r')
data2 = nc2.variables[var2][:]
nc2.close()
# return the difference between the two means
return np.mean(data2) - np.mean(data1)
Now let's do the another search and get tasmin and tasmax
In [28]:
results=db.outputs(ensemble='r1i1p1',experiment='rcp45',mip='Amon').filter(Instance.model.like('MIROC%'))\
.filter(Instance.variable.in_(['tasmin','tasmax']))
results.count()
Out[28]:
Get the list of distinct models and ensembles using unique
In [31]:
models=unique(results,'model')
ensembles=unique(results,'ensemble')
Now we loop over the models and the ensembles, for each model-ensemble combination we call the function if we have an instance for both variables.
In [32]:
for mod in models:
for ens in ensembles:
# we filter twice the reuslts, using the model and ensemble values plus one the variable at the time
tasmin_inst=results.filter(Instance.model==mod, Instance.ensemble==ens, Instance.variable=='tasmin').first()
tasmax_inst=results.filter(Instance.model==mod, Instance.ensemble==ens, Instance.variable=='tasmax').first()
# we check that both filters returned something and call the function if they did
if tasmax_inst and tasmin_inst:
tasmin_path=tasmin_inst.latest()[0].path
tasmax_path=tasmax_inst.latest()[0].path
diff=vars_difference('tasmin',str(tasmin_path),'tasmax',str(tasmax_path))
print('Difference for model %s and ensemble %s is %d' % (mod, ens, diff))
NB we used first( ) after the filter because we know we should be getting back either 1 instance or None. We cannot use one( ) because that would return an error if it can't find anything.
Also we should have checked that we are using the same versions for both variables rather than just getting the latest!
This is just an attempt to replicate the way we use drstree but when you get more familiar with the module and with SQLalchemy you can set up more sophysticated searches.