Import CMIP5 from the module and start a session

The latest ARCCSSive stable version is available from the conda analysis27 environment
Anyone can load them both from raijin and the remote desktop.


In [1]:
!  module use /g/data3/hh5/public/modules
!  module load conda/analysis27

The database location is saved in the $CMIP5_DB environment variable. This is defined automatically if you have loaded ARCCSSive from conda/analysis27.


In [2]:
! export CMIP5_DB=sqlite:////g/data1/ua6/unofficial-ESG-replica/tmp/tree/cmip5_raijin_latest.db

Import CMIP5 from the module and use the method connect() to open a connection to the database.


In [3]:
from ARCCSSive import CMIP5
db=CMIP5.connect()

Opening a connection creates a session object (in this case db). A session manages all the comunication with the database and contains all the objects which you’ve loaded or associated with it during its lifespan. Every query to the database is run through the session.
There are a number of helper functions for common operations:


In [4]:
db.models()


Out[4]:
[u'ACCESS1-0',
 u'ACCESS1-3',
 u'BESM-OA2-3',
 u'BNU-ESM',
 u'CCSM4',
 u'CESM1-BGC',
 u'CESM1-CAM5',
 u'CESM1-CAM5-1-FV2',
 u'CESM1-FASTCHEM',
 u'CESM1-WACCM',
 u'CFSv2-2011',
 u'CMCC-CESM',
 u'CMCC-CM',
 u'CMCC-CMS',
 u'CNRM-CM5',
 u'CNRM-CM5-2',
 u'COSMOS-ASO',
 u'CSIRO-Mk3-6-0',
 u'CSIRO-Mk3L-1-2',
 u'CanAM4',
 u'CanCM4',
 u'CanESM2',
 u'EC-EARTH',
 u'EC-EARTH-2-2',
 u'FGOALS-g2',
 u'FGOALS-gl',
 u'FGOALS-s2',
 u'FIO-ESM',
 u'GEOS-5',
 u'GFDL-CM2p1',
 u'GFDL-CM3',
 u'GFDL-ESM2G',
 u'GFDL-ESM2M',
 u'GFDL-HIRAM-C180',
 u'GFDL-HIRAM-C360',
 u'GISS-E2-H',
 u'GISS-E2-H-CC',
 u'GISS-E2-R',
 u'GISS-E2-R-CC',
 u'HadCM3',
 u'HadGEM2-A',
 u'HadGEM2-AO',
 u'HadGEM2-CC',
 u'HadGEM2-ES',
 u'IPSL-CM5A-LR',
 u'IPSL-CM5A-MR',
 u'IPSL-CM5B-LR',
 u'KCM1-2-2',
 u'MIROC-ESM',
 u'MIROC-ESM-CHEM',
 u'MIROC4h',
 u'MIROC5',
 u'MPI-ESM-LR',
 u'MPI-ESM-MR',
 u'MPI-ESM-P',
 u'MRI-AGCM3-2H',
 u'MRI-AGCM3-2S',
 u'MRI-CGCM3',
 u'MRI-ESM1',
 u'NorESM1-M',
 u'NorESM1-ME',
 u'bcc-csm1-1',
 u'bcc-csm1-1-m',
 u'inmcm4']

models() return all the models recorded in the database,
experiments(), variables(), mips() produce similar lists for each respective field

To perform a search you can use the outputs( ) function.
outputs( ) is a 'shortcut' to perform a session.query on the Instances table. The following example shows all the input arguments you can use, the order doesn't matter and you can omit any of them.

db.outputs( column-name='value', ... )

will return all the rows for the Instances table in the database.


In [5]:
results=db.outputs(variable='tas',experiment='historical',mip='Amon',model='MIROC-ESM-CHEM',ensemble='r1i1p1')

You can check how many instances your search returned by using the query method count()


In [7]:
results.count()


Out[7]:
1

In this case we defined every possible constraint for the table and hence we get just one instance.
This should always be the case, if you use all the five attributes, because every instance is fully defined by these and each instance is unique.

We can loop through the instances returned by the search and access their attributes and their children ( i.e. related versions and files) attributes.


In [8]:
for o in results:
    print(o.model,o.variable,o.ensemble)
    print()
    print("drstree path is " + str(o.drstree_path()))
    for v in o.versions:
        print()
        print('version', v.version)
        print('dataset-id', v.dataset_id)
        print('is_latest', v.is_latest, 'checked on', v.checked_on)
        print()
        print(v.path)
        for f in v.files:
            print(f.filename, f.tracking_id)
            print(f.md5, f.sha256)


(u'MIROC-ESM-CHEM', u'tas', u'r1i1p1')
()
drstree path is /g/data1/ua6/DRSv2/CMIP5/MIROC-ESM-CHEM/historical/mon/atmos/r1i1p1/tas/latest
()
('version', u'v20111004')
('dataset-id', u'NA')
('is_latest', False, 'checked on', u'2016-10-08')
()
/g/data/ua6/unofficial-ESG-replica/tmp/tree/dias-esg-nd.tkl.iis.u-tokyo.ac.jp/thredds/fileServer/esg_dataroot/outgoing/output1/MIROC/MIROC-ESM-CHEM/historical/mon/atmos/Amon/r1i1p1/v20111004/tas
(u'tas_Amon_MIROC-ESM-CHEM_historical_r1i1p1_185001-200512.nc', u'27c8356f-39f4-4d66-8915-3a5f48b76318')
(None, None)
()
('version', u'v20120710')
('dataset-id', u'cmip5.output1.MIROC.MIROC-ESM-CHEM.historical.mon.atmos.Amon.r1i1p1.v20120710|esgf-data1.diasjp.net')
('is_latest', True, 'checked on', u'2016-10-08')
()
/g/data/ua6/unofficial-ESG-replica/tmp/tree/dias-esg-nd.tkl.iis.u-tokyo.ac.jp/thredds/fileServer/esg_dataroot/outgoing/output1/MIROC/MIROC-ESM-CHEM/historical/mon/atmos/Amon/r1i1p1/v20120710/tas
(u'tas_Amon_MIROC-ESM-CHEM_historical_r1i1p1_185001-200512.nc', u'27c8356f-39f4-4d66-8915-3a5f48b76318')
(u'3459eff8fd39ff88c8b5e72f11694862', u'38bf378e1f11cabd42cee26124bfef6ac53e774f8ac9217ca856e103b4341acf')

Let's have a better look at results


In [9]:
results=db.outputs(variable='tas',experiment='historical',mip='Amon',model='MIROC-ESM-CHEM',ensemble='r1i1p1')
type(results)


Out[9]:
sqlalchemy.orm.query.Query

results is a query object but as we saw before we can loop through it as we do with a list.
In this particular case we have only one instance returned in results, but we still need to use an index to access it.


In [10]:
type(results[0])


Out[10]:
ARCCSSive.CMIP5.Model.Instance

A useful attribute of an instance is versions, this is a list of all the versions associated to that particular instance.
From a database point of view these are all the rows in the Versions table which are related to that particular instance.


In [11]:
results[0].versions


Out[11]:
[<ARCCSSive.CMIP5.Model.Version at 0x7f6436cca550>,
 <ARCCSSive.CMIP5.Model.Version at 0x7f6436c55890>]

We have two versions available for this instance, we can loop through them and retrieve their attributes:


In [12]:
for o in results:
    for v in o.versions:
        print()
        print(v.version)
        print()
        print(v.path)


()
v20111004
()
/g/data/ua6/unofficial-ESG-replica/tmp/tree/dias-esg-nd.tkl.iis.u-tokyo.ac.jp/thredds/fileServer/esg_dataroot/outgoing/output1/MIROC/MIROC-ESM-CHEM/historical/mon/atmos/Amon/r1i1p1/v20111004/tas
()
v20120710
()
/g/data/ua6/unofficial-ESG-replica/tmp/tree/dias-esg-nd.tkl.iis.u-tokyo.ac.jp/thredds/fileServer/esg_dataroot/outgoing/output1/MIROC/MIROC-ESM-CHEM/historical/mon/atmos/Amon/r1i1p1/v20120710/tas

If we want to get only the latest version, we can use the latest( ) method of the Instance class.


In [13]:
results[0].latest()[0].version


Out[13]:
u'v20120710'

As you might have noticed latest( ) returns a list of Version objects rather than only one.
This is because there might be different copies of the same version, downloaded from different servers.
Currently the database lists all of them so that if you used one rather than the other in the past you can still find it.
There are plans though to keep just one copy per version to facilitate the collection management and save storage resources.

Other methods available for the Instances table (objects) are:

  • filenames( )
  • drstree_path( )

In [14]:
results[0].filenames()


Out[14]:
[u'tas_Amon_MIROC-ESM-CHEM_historical_r1i1p1_185001-200512.nc']

In [15]:
results[0].drstree_path()


Out[15]:
u'/g/data1/ua6/DRSv2/CMIP5/MIROC-ESM-CHEM/historical/mon/atmos/r1i1p1/tas/latest'

In [ ]:
% ls -l /g/data1/ua6/DRSv2/CMIP5/CCSM4/rcp45/day/atmos/r1i1p1/tas/latest

!!Warning!!

In most cases you can use directly the drstee_path() method to get to the files, but it can be useful to find all the available versions.
For example if you want to make sure that a new version hasn't been added recently, DRSv2 it is updated only once a week.
Or if you find that the version linked by the DRSv2 is incomplete, there might be another copy of the same version.
We hope eventually to be able to have just one copy for each version and all of them clearly defined.

Filter search results

We can refine our results by using the SQLalchemy filter( ) function.

We will use the attributes ( or columns ) of the database tables as constraints.
So, first we need to import the tables definitions from ARCCSSive.


In [16]:
from ARCCSSive.CMIP5.Model import Instance, Version, VersionFile
#print(type(Instance))

We can also import the unique( ) function. This function will give us all the possible values we can use to filter over a particular attribute.


In [17]:
from ARCCSSive.CMIP5.other_functions import unique

Let's do a new query


In [18]:
results=db.outputs(variable='tas',experiment='rcp45',mip='day')
results.count()


Out[18]:
105

We would like to filter the results by ensemble, so we will use unique( ) to get all the possible ensemble values.


In [19]:
ensembles=unique(results,'ensemble')
print(ensembles)


[u'r10i1p1', u'r11i1p1', u'r12i1p1', u'r13i1p1', u'r14i1p1', u'r1i1p1', u'r1i2p1', u'r2i1p1', u'r2i2p1', u'r3i1p1', u'r3i2p1', u'r4i1p1', u'r5i1p1', u'r6i1p1', u'r6i1p3', u'r7i1p1', u'r8i1p1', u'r9i1p1']

unique( results, 'attribute' ) takes two inputs:

  • results is a query object on the Instances table, for example what is returned by the db.outputs( ) function
  • 'attribute' is a string defining a particular attribute or column of the Instances table, for example 'model'

unique( ) lists all the distinct values returned by the query for that particular attribute.

Now that we know all the ensembles values, let's choose one to filter our results.


In [20]:
r6i1p1_ens=results.filter(Instance.ensemble == 'r6i1p1')
print( r6i1p1_ens.count() )
unique(r6i1p1_ens,'ensemble')


6
Out[20]:
[u'r6i1p1']

We used the == equals operator to select all the r61i1p1 ensembles.
If we wanted all the "r6i1p#" ensembles regardless of their physics (p) value we could have used the like operator.


In [21]:
r6i1_ens=results.filter(Instance.ensemble.like('r6i1p%'))
print( r6i1_ens.count() )
unique(r6i1_ens,'ensemble')


8
Out[21]:
[u'r6i1p1', u'r6i1p3']

If we want to search two variables at the same time we can leave the variable constraints out of the query inputs,
and then use filter with the in_ operator to select them.


In [22]:
results=db.outputs(ensemble='r1i1p1',experiment='rcp45',mip='day')\
.filter(Instance.variable.in_(['tasmin','tasmax']))
results.count()


Out[22]:
73

As you can see filter can follow directly the query, i.e. the outputs( ) function.
In fact, you can refine a query with how many successive filters as you want.


In [23]:
results=db.outputs(ensemble='r1i1p1',experiment='rcp45',mip='day')\
.filter(Instance.variable.in_(['tasmin','tasmax']))\
.filter(Instance.model.like('%ESM%'))
results.count()


Out[23]:
22

Using the search results to open the files

Once we have found the instances and versions we want to use we can use their path to find the files and work with them.
First we load numpy and the netcdf module.


In [24]:
import numpy as np
from netCDF4 import MFDataset

All you need to open a file is the location, this is stored in the Versions table in the database as path.
Alternatively you can use the drstree path, that is returned by the Instance drstree_path( ) method.

Let's define a simple function that reads a variable from a file and calculate its maximum value.
We will use MFDataset( ) from the netcDF4 module to open all the netcdf files in the input path as one aggregated file.


In [25]:
def var_max(var,path):
    ''' calculate max value for variable '''
    
    # MFDataset will open all netcdf files in path as one aggregated file    
    print(path+"/*.nc")
    # open the file
    nc=MFDataset(path+"/*.nc",'r')
    # read the variable from file into a numpy array
    data = nc.variables[var][:]
    # close the file
    nc.close()
    # return the maximum
    return np.max(data)

Now we perform a search, loop through the results and pass the Version path attribute to the var_max( ) function


In [26]:
results=db.outputs(ensemble='r1i1p1',experiment='rcp45',mip='day').filter(Instance.model.like('MIROC%'))\
.filter(Instance.variable.in_(['tas','pr']))
print(results.count())

for o in results[:2]:
    var = o.variable
    for v in o.versions:
        path=str(v.path)
        varmax=var_max(var,path)
        print()
        print('Maximum value for variable %s, version %s is %d' % (var, v.version, varmax))


8
/g/data/ua6/unofficial-ESG-replica/tmp/tree/dias-esg-nd.tkl.iis.u-tokyo.ac.jp/thredds/fileServer/esg_dataroot/outgoing/output1/MIROC/MIROC-ESM/rcp45/day/atmos/day/r1i1p1/v20120710/tas/*.nc
()
Maximum value for variable tas, version v20120710 is 330
/g/data/ua6/unofficial-ESG-replica/tmp/tree/albedo2.dkrz.de/thredds/fileServer/cmip5/output1/MIROC/MIROC5/rcp45/day/atmos/day/r1i1p1/v20120710/tas/*.nc
()
Maximum value for variable tas, version v20120710 is 325

NB if you pass directly v.path value you get an error because the databse return unicode string, so you need to use the str( ) function to convert to a normal string.

How to integrate ARCCSSive in your python script

In the previous example we simply looped through the results returned by the search as they were and passed them to a function that opened the files.
But what if we want to do something more complex?
Let's say that we want to pass two variables to a function and do it for every model/ensemble that has both of them for a fixed experiment and mip Mostly users would somehow loop over the drstree path, doing something like:

cd /g/data1/ua6/DRSv2/CMIP5
list all models and save in model_list for model in model_list: list all eavailable ensembles and save in ensemble_list for ensemble in ensemble_list: call_function(var1_path, var2_path)

Using ARCCSSIve we can do the same using the unique( ) function to return the list of all available models/ensembles.
Let's start from defining a simple function that calculates the difference bewteen the values of two variables.


In [27]:
def vars_difference(var1,path1,var2,path2):
    ''' calculate difference between the mean of two variables '''
    
    # open the files and read both variables
    nc1=MFDataset(path1+"/*.nc",'r')
    data1 = nc1.variables[var1][:]
    nc1.close()
    nc2=MFDataset(path2+"/*.nc",'r')
    data2 = nc2.variables[var2][:]
    nc2.close()
    
    # return the difference between the two means
    return np.mean(data2) - np.mean(data1)

Now let's do the another search and get tasmin and tasmax


In [28]:
results=db.outputs(ensemble='r1i1p1',experiment='rcp45',mip='Amon').filter(Instance.model.like('MIROC%'))\
.filter(Instance.variable.in_(['tasmin','tasmax']))
results.count()


Out[28]:
8

Get the list of distinct models and ensembles using unique


In [31]:
models=unique(results,'model')
ensembles=unique(results,'ensemble')

Now we loop over the models and the ensembles, for each model-ensemble combination we call the function if we have an instance for both variables.


In [32]:
for mod in models:
    for ens in ensembles:
        # we filter twice the reuslts, using the model and ensemble values plus one the variable at the time
        tasmin_inst=results.filter(Instance.model==mod, Instance.ensemble==ens, Instance.variable=='tasmin').first()
        tasmax_inst=results.filter(Instance.model==mod, Instance.ensemble==ens, Instance.variable=='tasmax').first()
        # we check that both filters returned something and call the function if they did
        if tasmax_inst and tasmin_inst:
            tasmin_path=tasmin_inst.latest()[0].path
            tasmax_path=tasmax_inst.latest()[0].path
            diff=vars_difference('tasmin',str(tasmin_path),'tasmax',str(tasmax_path))
            print('Difference for model %s and ensemble %s is %d' % (mod, ens, diff))


Difference for model MIROC-ESM and ensemble r1i1p1 is 3
Difference for model MIROC-ESM-CHEM and ensemble r1i1p1 is 3
Difference for model MIROC4h and ensemble r1i1p1 is 3
Difference for model MIROC5 and ensemble r1i1p1 is 3

NB we used first( ) after the filter because we know we should be getting back either 1 instance or None. We cannot use one( ) because that would return an error if it can't find anything.
Also we should have checked that we are using the same versions for both variables rather than just getting the latest!

This is just an attempt to replicate the way we use drstree but when you get more familiar with the module and with SQLalchemy you can set up more sophysticated searches.