Created: Monday 30 January 2017

Quantifying the uncertainity of a global fire limitation model using Bayesian inference

Part 1: Staging data for analysis

^1,*Douglas Kelley, ²Ioannis Bistinas, ^{3, 4}Chantelle Burton, ¹Tobias Marthews, ⁵Rhys Whitley

¹Centre for Ecology and Hydrology, Maclean Building, Crowmarsh Gifford, Wallingford, Oxfordshire, United Kingdom
²Vrije Universiteit Amsterdam, Faculty of Earth and Life Sciences, Amsterdam, Netherlands
³Met Office United Kingdom, Exeter, United Kingdom
⁴Geography, University of Exeter, Exeter, United Kingdom
⁵Natural Perils Pricing, Commercial & Consumer Portfolio & Pricing, Suncorp Group, Sydney, Australia

Summary

This notebook aims to process the separate netCDF4 files for the model drivers (X_{i=1, 2, ... M}) and model target (Y) into a unified tabular data frame, exported as a compressed comma separated value (CSV) file. This file is subsequently used in the Bayesian inference study that forms the second notebook in this experiment. The advantage of the pre-processing the data separately to the analysis allows for it be quickly staged on demand. Of course other file formats may be more advantageous for greater compression (e.g. SQLite3 database file).

You will need to run this notebook to prepare the dataest before you attempt the Bayesian analysis in Part 2.

Python code and calculations below

Load libraries



In [1]:

    
# data munging and analytical libraries 
import re
import os
import numpy as np
import pandas as pd
from netCDF4 import Dataset 

# graphical libraries
import matplotlib.pyplot as plt
%matplotlib inline

# set paths
outPath = "../data/globfire.csv"

Import and clean data

Set the directory path and look for all netcdf files that correspond to the model drivers and target.



In [2]:

    
driver_paths = [os.path.join(dp, f) for (dp, _, fn) in os.walk("../data/raw/") for f in fn if f.endswith('.nc')]
driver_names = [re.search('^[a-zA-Z_]*', os.path.basename(fp)).group(0) for fp in driver_paths]

file_table = pd.DataFrame({'filepath': driver_paths, 'file_name': driver_names})
file_table









    Out[2]:






  
    
      
      file_name
      filepath
    
  
  
    
      0
      alpha
      ../data/raw/alpha2000-2014.nc
    
    
      1
      cropland
      ../data/raw/cropland2000-2014.nc
    
    
      2
      fire
      ../data/raw/fire2000-2014.nc
    
    
      3
      lightning_ignitions
      ../data/raw/lightning_ignitions2000-2014.nc
    
    
      4
      NPP
      ../data/raw/NPP2000-2014.nc
    
    
      5
      pasture
      ../data/raw/pasture2000-2014.nc
    
    
      6
      population_density
      ../data/raw/population_density2000-2014.nc
    
    
      7
      urban_area
      ../data/raw/urban_area2000-2014.nc

Define a function to extract the variable values from each netCDF4 file. Variables are flattened from a 3 dimensional array to 1 dimensional version, pooling all values both spatially and temporily.

Don't know if this is the correct way to do this, but will come back to it once I understand the model (and its optimisation) better.



In [3]:

    
def nc_extract(fpath):
    print("Processing: {0}".format(fpath))
    with Dataset(fpath, 'r') as nc_file:
        gdata = np.array(nc_file.variables['variable'][:,:,:])
        gflat = gdata.flatten()
        if type(gdata) == np.ma.core.MaskedArray:
            return gflat[~gflat.mask].data
        else:
            return gflat.data

Execute the above function on all netCDF4 file paths.



In [4]:

    
values = [nc_extract(dp) for dp in driver_paths]









    



Processing: ../data/raw/alpha2000-2014.nc
Processing: ../data/raw/cropland2000-2014.nc
Processing: ../data/raw/fire2000-2014.nc
Processing: ../data/raw/lightning_ignitions2000-2014.nc
Processing: ../data/raw/NPP2000-2014.nc
Processing: ../data/raw/pasture2000-2014.nc
Processing: ../data/raw/population_density2000-2014.nc
Processing: ../data/raw/urban_area2000-2014.nc

Turn this into a dataframe for the analysis.



In [5]:

    
# turn list into a dataframe
fire_df = pd.DataFrame(np.array(values).T, columns=driver_names)

# replace null flags with pandas null
fire_df.replace(-3.4e38, np.nan, inplace=True)

# drop all null rows (are ocean and not needed in optim)
fire_df.dropna(inplace=True)

Check that we've built it correctly.



In [6]:

    
fire_df.head()









    Out[6]:






  
    
      
      alpha
      cropland
      fire
      lightning_ignitions
      NPP
      pasture
      population_density
      urban_area
    
  
  
    
      14570
      0.469404
      0.0
      0.0
      0.0
      121.678574
      0.0
      0.0
      0.0
    
    
      14571
      0.549735
      0.0
      0.0
      0.0
      112.512817
      0.0
      0.0
      0.0
    
    
      14572
      0.701645
      0.0
      0.0
      0.0
      109.520493
      0.0
      0.0
      0.0
    
    
      14573
      1.243935
      0.0
      0.0
      0.0
      149.522217
      0.0
      0.0
      0.0
    
    
      14574
      1.243933
      0.0
      0.0
      0.0
      51.428570
      0.0
      0.0
      0.0

Export this to disk to be used by the analysis notebook - used gzip compression to save on space. Beware, because of there are approximation 10 million rows of data, this may take some time.



In [7]:

    
savepath = os.path.expanduser(outPath)
fire_df.to_csv(savepath, index=False)

Part 2: click here

	file_name	filepath
0	alpha	../data/raw/alpha2000-2014.nc
1	cropland	../data/raw/cropland2000-2014.nc
2	fire	../data/raw/fire2000-2014.nc
3	lightning_ignitions	../data/raw/lightning_ignitions2000-2014.nc
4	NPP	../data/raw/NPP2000-2014.nc
5	pasture	../data/raw/pasture2000-2014.nc
6	population_density	../data/raw/population_density2000-2014.nc
7	urban_area	../data/raw/urban_area2000-2014.nc

	alpha	NPP
14570	0.469404	121.678574
14571	0.549735	112.512817
14572	0.701645	109.520493
14573	1.243935	149.522217
14574	1.243933	51.428570