Extract tracer profiles and compute total columns above a given station from GEOS-Chem outputs : Example

This is a 'Run' version. For more details about what is computed and how it is implemented, see de development notebook.

USAGE: A good practice to ensure reproducibility is:

  1. First, copy this notebook as a new notebook (one notebook per run).
  2. Rename it by changing the short suffix that clearly identifies the purpose of this run. Change also the suffix in the title.
  3. Fill the comments section below with a longer description.
  4. Run cells in the Setup section for pre-processing, creating run directories, writing inputs, etc...
  5. Run cells in the Run section to set the command and start the process.
  6. The Monitor section is for checking/controlling the process while it is running. Check the process status as many times as you want.
  7. Run cells in the Output section to clean temp files, inspect the outputs, etc... when the process has finished.
  8. Write some comments about the outcomes of this run in the Comments section below.
  9. Save and close the notebook.

WARNING 1: Don't run all cells at once. Some cells may be executed several times and other may be executed optionally. It is also better to run the notebook cell by cell for carefull inputs/outputs verification.

WARNING 2: Don't shutdown the kernel associated with this notebook (or 'close and halt' or restart the kernel) until the process has finished. The process will run in the background and thus will normally not be affected by the kernel shutdown, but it will not be possible to get information from the process anymore.

NOTE: It is here possible to take advantage of multiple CPUs and automatically split the process into several, parallel jobs (it uses the IPython parallel system). To activate this, set a value > 1 for the argument ncpu (nengines) of the command line.

Comments

Description

Write here a more detailled description about the purpose of this run, the inputs used, etc...

Outcomes

Write here your comments about the process outcomes (how look the outputs, etc...).

1. Setup

  • Get some infos from this notebook and store it as variables in the kernel (needed for the next step)

In [7]:
%%javascript

var notebook_name = document.getElementById('notebook_name').innerHTML;
var notebook_root_dir = document.body.getAttribute('data-project');
var notebook_rel_path = document.body.getAttribute('data-notebook-path');

var kernel = IPython.notebook.kernel;

kernel.execute("notebook_name = '" + notebook_name + "'");
kernel.execute("notebook_root_dir = '" + notebook_root_dir + "'");
kernel.execute("notebook_rel_path = '" + notebook_rel_path + "'");


  • Set up the working directories.
    • Locate the directory of this notebook
    • Create a new directory for this run, which will have the name of this notebook and which will be located in \$HOME/IPYRuns (plus the same relative path from \$HOME/IPYNotebooks to this notebook).
    • 'cd' into this run directory. The process will be run from within this directory. Unless the absolute path is specified explicitly, any file (input, output, log file) will be saved into this directory.

In [9]:
import os
 
notebook_rel_path = os.path.dirname(notebook_rel_path)
notebook_rel_path = notebook_rel_path.replace('IPYNotebooks/', '')
this_notebook_dir = os.path.join(os.path.expanduser('~'), 'IPYNotebooks',
                                 notebook_rel_path)
notebook_runs_dir = os.path.join(os.path.expanduser('~'), 'IPYRuns')
this_run_dir = os.path.join(notebook_runs_dir, notebook_rel_path, notebook_name)

if not os.path.exists(this_run_dir):
    os.makedirs(this_run_dir)
os.chdir(this_run_dir)

print("run directory is: " + this_run_dir)


run directory is: /home/bovy/IPYRuns/Extract_GCprof_station_run-example
  • Edit/Create input file: all the content of the cell below will be written to the file specified at the 1st line

In [ ]:
%%writefile input.yaml

# This is an input file for the script 'extract_gcprof_station.py'
# Format of the file is YAML


# ------------------
# IN-OUT FILES
# ------------------

# Input main directory
#   Should contain GEOS-Chem ouput datafields
in_dir: ../../../IPYNotebooks/nb_geoschem/data/ts_example

# GEOS-Chem output file(s) (netCDF and/or bpch)
#   May be either 
#   (1) the name of a single file present in `in_dir` 
#   (2) an absolute path to a single file
#   (3) a file-matching pattern using de wildcard character.
#   (4) a list of any combination of (1), (2) and (3)
#
#   Mixing CTM outputs and ND49 outputs (time series) may work 
#   (though not tested yet), but datafields must not overlap in time.
#   All datafields contained in the files must use the same horizontal
#   grid (or a subset of this grid)!
in_files: 'ts.joch.200401*'

# Path to save output files where extracted data will be written
#   If '~' is given, output files will be saved in the directory
#   from where the script is run
out_dir: ~      

# Basename of the output files for profiles
#   Should not include the file extension
#   Any wildcard "*" will be replaced by the `station_name` parameter
out_profiles_basename: '*_profiles_200401'   

# Basename of output file for columns
out_columns_basename: '*_columns_200401'     

# Format of output files
#   One of the following: "csv", "hdf5", "xls", "xlsx"
#   In addtion, netCDF files will be created (iris cubes).
out_format: xlsx                 


# ------------------
# DATAFIELDS TO LOAD
# ------------------

# List of tracers/diagnostics for which profiles and columns
# will be extracted/computed
tracers: [PAN, CO, ACET, C3H8, CH2O, C2H6, NH3]

# List of diagnostic categories to load
#   Should be "IJ-AVG-$" for tracers
categories: [IJ-AVG-$]               

# Additional fields names to load (format: 'diagnostic_category')
#   Must at least include datafields required for columns calculation,
#   i.e., 'PSURF_PEDGE-$', 'BXHEIGHT_BXHGHT-$',
#   'AIRDEN_TIME-SER' or 'N(AIR)_BXHGHT-$', 'TMPU_DAO-3D-$
other_fields: [PSURF_PEDGE-$, BXHEIGHT_BXHGHT-$, AIRDEN_TIME-SER, N(AIR)_BXHGHT-$, TMPU_DAO-3D-$]


# ------------------
# STATION PARAMETERS
# ------------------

# Name of the station
station_name: JungfrauJoch      

# Latitude of the station [degrees_north]
station_lat: 46.54806              

# Longitude of the station [degress_east]
station_lon: 7.98389               

# Elevation a.s.l at the station [meters],
station_altitude: 3580.            

# Path to the file (CF-netCDF) that contains the altitude values
# of the vertical grid on which data will be regridded.
station_vertical_grid_file: /home/bovy/Grids/NDACC_vertical_Jungfraujoch_39L_2x2.5.nc


# ------------------
# GRID INFO
# ------------------

# Grid model name
#   All GEOS-Chem ouputs that will be loaded must use this grid.
#   See :prop:`pygchem.grid.CTMGrid.models`
#   for a list of available lodels
grid_model_name: GEOS57_47L          

# Grid horizontal resolution (lon, lat) [degrees]
#   All GEOS-Chem ouputs must use this resolution
grid_model_resolution: [2.5, 2]          

# Grid indices (min, max) of the 3D region box of interest
#   i: longitude, j: latitude, l: vertical levels
#   Must match the extent that was defined for any ND49
#   diagnostic output specified in `in_files`.
#   Must emcompass the position of the station (see below).
#   Used either to define the coordinates of ND49 outputs or to
#   extract a subset from the global CTM datafields.
iminmax: [76, 77]
jminmax: [69, 70]
lminmax: [1, 47]


# ------------------
# TOPOGRAPHY
# ------------------

# Path to the file of global topography needed for resampling
# the tracer profiles on a vertical grid with fixed altitude values.
#   The global topography grid must be compatible with the
#   GEOS-Chem grid used by the output GEOS-Chem files.
global_topography_datafile: /home/bovy/Grids/dem_GEOS57_2x2.5_awm.nc

2. Run

  • Set the command line to be executed

In [ ]:
import sys

cmd = "{executable} {script} input.yaml --loglevel={loglevel} --nengines={ncpu}"

cmd = cmd.format(
    # path to executable (same python interpreter than the one used to run the notebook server)
    executable=sys.executable,
    # path to the script
    script=os.path.join(this_notebook_dir, 'run_scripts', 'extract_gcprof_station.py'),
    # number of CPU to use
    ncpu=4,
    # loglevel ('CRITICAL', 'ERROR', 'WARNING', 'INFO' or 'DEBUG')
    loglevel='INFO',
)

print("Command to execute: " + cmd)
  • Execute the command in a new process in the background (only if no process is already running)

In [ ]:
import subprocess
import os
import sys
import shlex

# prevent running a new process if a process is already running.
try:
    if process.poll() is None:
        raise RuntimeError('A process is already running')
except NameError:
    pass

# split the command into a sequence
cmd = shlex.split(cmd)
# comment the line above and use the command string instead of sequence if shell is True
# http://stackoverflow.com/questions/16840427/python-on-linux-subprocess-popen-works-weird-with-shell-true

with open('process.log', 'w') as log:
    process = subprocess.Popen(cmd, shell=False, stdout=log, stderr=log)

print("New process started. PID: {}".format(process.pid))

3. Monitor

  • Check the status of the process

In [ ]:
import sys

try:
    if process.poll() is None:
        print("process is running")
        status = 'running'
    
    elif process.poll() == 0:
        print("process has terminated succesfully")
        status = 'success'
        
    else:
        sys.stderr.write("process has terminated with errors\n")
        status = 'error'
        
except NameError:
    print("no process is running! "
          "(or connection with the process loosed due "
          "to a kernel issue or kernel shutdown/restart)")
    status = None
  • Display the log file

In [ ]:
if status is not None:
    %cat process.log
  • If using the IPython parallel cluster, display ouputs (stdout, stderr) of all engines as they are printed out (debug)

    • open a terminal session in the server
    • activate the virtual environment
    • while the process is running, run the script iopubwatcher.py in the run_scripts directory

      $ python iopubwatcher.py

  • (WARNING) Use the cell below to terminate the process if needed

In [ ]:
import signal

if status == 'running':
    process.send_signal(signal.SIGINT)
  • (WARNING) Use the cell below to stop the IPython cluster if needed

In [ ]:
import os
import getpass
import sys

user = getpass.getuser()
ipy_profile = 'nb_{}'.format(user)
ipcluster_exe = os.path.join(
    os.path.dirname(sys.executable),
    'ipcluster'
)

os.system('{} stop --profile={}'.format(ipcluster_exe, ipy_profile))

4. Clean / Inspect outputs

  • Check if output files were created by listing them (with size, last modified date...)

In [ ]:
!ls -all -h

In [ ]:
import glob
from IPython.display import display, FileLink

for fout in glob.glob('*'):
    if os.path.isdir(fout):
        continue
    fout_link = FileLink(fout)
    display(fout_link)

In [ ]: