Data preprocessing

Here will download and subset NCEP reanalysis data, and read in files created from the DesInventar database. Then create a map showing the regions where disaster records are available, and how this relates to the resolution of the meteorological data.

Set up

Import needed packages



In [1]:

    
#--- Libraries
import pandas as pd                                  # statistics packages 
import numpy  as np                                  # linear algebra packages
import matplotlib.pyplot as plt                      # plotting routines
import seaborn as sns                                # more plotting routines
import shapefile                                     # routines for using 'shapefiles'
import urllib                                        # routines for interacting with websites
import subprocess                                    # routines for calling external OS commands

from mpl_toolkits.basemap import Basemap             # plotting routines for map making
from matplotlib import gridspec                      # plotting routines for multiple plots
from netCDF4 import Dataset                          # routines for interacting with NetCDF files

from matplotlib import cm                            # more plotting routines
from matplotlib.collections import LineCollection    # more plotting routines

from cdo import *                                    # routines for interacting with NetCDF files
cdo = Cdo()                                          #                    via an external program



In [2]:

    
# place graphics in the notebook document
%matplotlib inline

Specify region

For this exercise, using examples from India.



In [3]:

    
#--- Identify country for example 
# label country
country = 'India' 
# define bounding box for region
mlat = '0' ; Mlat = '40' ; mlon = '65' ; Mlon = '105'

Set data

Disaster records

A spreadsheet of availble data was obtained from the DesInventar website, and then exported to .csv format. Both versions are available in the data repository. When pulling data from the website sometimes there can be little formatting issues, which we repair here. Also want to learn what span of years is covered by the database for our example country (India), so that we can save disk space by paring down the reanalysis data to the smallest possible file.



In [4]:

    
#--- Pull in data from DesInvetar records
# Read file of reported heatwaves (original spreadsheet)
heatwave_data = pd.read_csv('../data/Heatwaves_database.csv')
# repair region name with space before name
heatwave_data.loc[(heatwave_data.Region==' Tamil Nadu'),'Region'] = 'Tamil Nadu'
# list out the dates for example country (India)
india_dates = heatwave_data['Date (YMD)'][heatwave_data['Country'].isin(['India'])]
# find year of earliest entry
min_year = np.min([int(x.split('/')[0]) for x in india_dates])
# find year of latest entry
max_year = np.max([int(x.split('/')[0]) for x in india_dates])

Reanalysis

Need to pull the renalysis data from NCEP's online database. Going to pull the full global files at first, so that have the data avaialbe if want to look at other regions of the world. This requires a lot of download time and storage space, the resulting minimally sized files are stored in the repository (others are deleated or moved to save disk space) so don't run these code blocks unless you need to change something about the data is being aquired or it's final form (which means, yeah, probably you'll end up having to run the script).



In [ ]:

    
#---Download NetCDF files
# path to data directory for max/min daily temperatures
path_maxmin = 'ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface_gauss'
# path to data directory for 6hr temperature records
path_hourly = 'ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis/surface_gauss'
# loop through years
for yr in range(1948,2016) :
    # write max 2meter temperature to new file
    path = path_maxmin+'/tmax.2m.gauss.'+str(yr)+'.nc'
    ofile = open('../data/t2m.max.daily.'+str(yr)+'.nc','w')
    ofile.write(urllib.urlopen(path).read())
    ofile.close()
    # write min 2meter temperature to new file
    path = path_maxmin+'/tmin.2m.gauss.'+str(yr)+'.nc'
    ofile = open('../data/t2m.min.daily.'+str(yr)+'.nc','w')
    ofile.write(urllib.urlopen(path).read())
    ofile.close()
    # write 2meter temperature at 6-hour intervals to new file
    path = path_hourly+'/air.2m.gauss.'+str(yr)+'.nc'
    ofile = open('../data/t2m.subdaily.'+str(yr)+'.nc','w')
    ofile.write(urllib.urlopen(path).read())
    ofile.close()



In [ ]:

    
# set data as single multiyear files
_ = cdo.mergetime(input='../data/t2m.max.daily.*.nc',output='../data/t2m.max.daily.nc')
_ = cdo.mergetime(input='../data/t2m.min.daily.*.nc',output='../data/t2m.min.daily.nc')
_ = cdo.mergetime(input='../data/t2m.subdaily.*.nc',output='../data/t2m.subdaily.nc')

Once have full data set can then subdivide to create individual files for different regions to reduce the run time when reading in data for individual regions.



In [ ]:

    
#--- Create data files of region
# select region from min-temperature data
_ = cdo.sellonlatbox(','.join([mlon,Mlon,mlat,Mlat]),
                     input='../data/t2m.min.daily.nc',
                     output='../data/'+country+'.t2m.min.daily.nc')
# select region from max-temperature data
_ = cdo.sellonlatbox(','.join([mlon,Mlon,mlat,Mlat]),
                     input='../data/t2m.max.daily.nc',
                     output='../data/'+country+'.t2m.max.daily.nc')
# select region from hourly-temperature data
_ = cdo.sellonlatbox(','.join([mlon,Mlon,mlat,Mlat]),
                     input='../data/t2m.subdaily.nc',
                     output='../data/'+country+'.t2m.subdaily.nc')
# create a daily mean temperature file
_ = cdo.daymean(input='../data/'+country+'.t2m.subdaily.nc',
                output='../data/'+country+'.t2m.daily.nc')



In [5]:

    
#--- Trim time range of file to match disaster records
# list years in time range
years_in_record = ','.join([ str(x) for x in range(min_year,max_year+1) ])
# subset regional data
_ = cdo.selyear(years_in_record,
                input='../data/'+country+'.t2m.min.daily.nc',
                output='../data/'+country+'.t2m.min.daily.subset.nc')
_ = cdo.selyear(years_in_record,
                input='../data/'+country+'.t2m.max.daily.nc',
                output='../data/'+country+'.t2m.max.daily.subset.nc')
# _ = cdo.selyear(years_in_record,
#                 input='../data/'+country+'.t2m.subdaily.nc',
#                 output='../data/'+country+'.t2m.subdaily.subset.nc')
_ = cdo.selyear(years_in_record,
                input='../data/'+country+'.t2m.daily.nc',
                output='../data/'+country+'.t2m.daily.subset.nc')
# retain base period file (needed for one of the heat wave definitions)
years = ','.join([ str(x) for x in range(1960,1991)])
_ = cdo.selyear(years,
                input='../data/'+country+'.t2m.max.daily.nc',
                output='../data/'+country+'basefile.nc')

Region masks

The way we arranged the analysis (which as you can see is a bit of an ad hoc, duct tape style procedure) requires masking out the individual districts, or rather the closest approximation of them possible using the low resolution, gridded reanalysis data.

The first step is creating a 'blanked' file of the region, where all the values are set to unity.



In [7]:

    
#--- Create blank file for region
# write grid information to file
ofile = open('../data/ncep_grid.asc','w')
ofile.write('\n'.join(cdo.griddes(input='../data/'+country+'.t2m.daily.nc')))
ofile.close()
# create data file with all values set to 1
_ = cdo.const('1','../data/ncep_grid.asc',
              output='../data/'+country+'.blank.ncepgrid.nc',
              options='-f nc')

The actual mask files are made with a different script, writen in NCL The code here modifies the generic script based on what region we're interested in at the moment.

For some countries, e.g., Chile, the region labels in the shapefiles and the region labels in the heatwave database are not rendered the same (typically this has to do with how accented letters are notated), so some tweaking has to be done.



In [8]:

    
#--- Identify regions of interest 
# make list of unique region names for country
regions = list( set(heatwave_data.Region.where(heatwave_data.Country==country)) )
# remove nans (from regions that arent in the selected country) 
regions = [x for x in regions if str(x) != 'nan']
regions = [x.title() for x in regions]
if ( country == 'Chile') :
    regions_shapefile = [u'Antofagasta',u'Araucan\xeda',
                         u'Ais\xe9n del General Carlos Ib\xe1\xf1ez del Campo',
                         u'Regi\xf3n Metropolitana de Santiago',
                         u'Magallanes y Ant\xe1rtica Chilena',
                         u"Libertador General Bernardo O'Higgins"]


else :
    regions_shapefile = regions



In [9]:

    
#--- Create masks
# loop through regions
for i in range(len(regions)) :
    # find the name of the region
    reg = regions[i].title()
    # find the name of the region as defined by the shapefile
    reg_shapefile = regions_shapefile[i]               #reg_shapefile = regions_shapefile[i].decode('utf-8')
    # remove spaces
    reg = reg.strip()
    # report what's happening
    print("Creating masking script for "+reg+", aka "+reg_shapefile)
    # create NCL script from defualt file with name of region 
    with open('maskregions_'+"".join(country.split(" "))+'.ncl', 'r') as input_file, open('crMaskFile.ncl', 'w') as output_file:
        # check lines for dummy line
        for line in input_file :
            if line.strip() == 'region = "STATE/PROVINCE"' :
                # overwrite with region name
                output_file.write('   region = "'+reg_shapefile.encode('utf-8')+'"\n')
            else :
                output_file.write(line)
    # run NCL routine
    print("Running masking script")
    # subprocess.call(['/bin/bash','-i','-c','ncl crMaskFile.ncl'])
    subprocess.call(['/bin/bash','-c','ncl crMaskFile.ncl'])
    # create a file that masks the region
    print("Renaming mask and copying to data folder.")
    subprocess.call(['cp','mask.nc',"../data/"+"_".join(reg.split())+'.mask.nc'])









    



Creating masking script for Orissa, aka Orissa
Running masking script
Renaming mask and copying to data folder.
Creating masking script for Uttar Pradesh, aka Uttar Pradesh
Running masking script
Renaming mask and copying to data folder.
Creating masking script for Tamil Nadu, aka Tamil Nadu
Running masking script
Renaming mask and copying to data folder.



In [10]:

    
#--- Create single mask file showing all considered regions
# combine all the individual  mask files
_ = cdo.add(input='../data/Orissa.mask.nc ../data/Uttar_Pradesh.mask.nc',
            output='../data/tmp.nc')
_ = cdo.add(input='../data/tmp.nc ../data/Tamil_Nadu.mask.nc',
            output='../data/India.masks.nc')

Drawing a map

Want to create a graphic to show that reports only exist for certain regions, and how the grid spacing of the meterological fields imperfectly matches the actual region boundaries. Have currently set things so that a grid cell is considered informative about the political region as long some part of the region boundary is within 50kms of the grid point (cell center). Played around with a few things before settling on this. The distance is pretty conservative; as in tends towards considering information from outside the region, rather than excluding information from within, but still keeps a more "fair" evaluation, by not evaluating against grid cells which contain only a minimal amount of the geographical region. Considering that most political boundaries are linked to geographical features/divides, if only a small fraction of the region extends into another grid cell, would expect its weather to more correlated with that shown by cells over the rest of the region than that of this other area. Example of this can be seen for Uttar Pradesh (India), where a sliver of the region overlaps with a gird cell that is mostly representitive of the Himalayas, so it is not considered when calculating the warm spell durations.

Looking at the individual administrative regions requires working with shape files. These are obtained from the Database of Global Administrative Areas.



In [11]:

    
#--- Map regions of India used in this example
# read which regions are included in disaster database
regions  = list(set(heatwave_data.loc[(heatwave_data.Country=='India'),'Region']))
# Create a map object
chart = Basemap(projection='lcc',resolution='c',
              lat_0=20,lon_0=85,
              llcrnrlat=5,urcrnrlat=35,
              llcrnrlon=70,urcrnrlon=100)             
# add geographic features
chart.shadedrelief()
# draw parallels and meridians.
chart.drawparallels(np.arange(-90.,91.,10.),labels=[False,True,True,False])
chart.drawmeridians(np.arange(-180.,181.,10.),labels=[True,False,False,True])
# add country outline 
chart.readshapefile('../data/IND_adm0', 'IND0',drawbounds=True) ;
# add region outlines, for regions in data set
chart.readshapefile('../data/IND_adm1', 'IND1',drawbounds=False) ;
for info, shape in zip(chart.IND1_info, chart.IND1):
    if info['NAME_1'] in regions :
        x, y = zip(*shape) 
        chart.plot(x, y, marker=None,color=sns.xkcd_rgb['dusty orange'])
        
# load file of combined regional masks
ncfile = Dataset('../data/India.masks.nc')
# read mask data
rmask = ncfile.variables['region_mask'][:]
# get coordinates of data
lons = ncfile.variables['lon'][:]
lats = ncfile.variables['lat'][:]
# shift so that lines show grid box boundaries, 
#    rather than grid point locations
lons = lons - (1.875/2)
lats = lats + (1.9047/2)
# if in western hemisphere, need to label as 
#    "all the way round", rather than +/- 
# lons = lons - 360
# set coordinates list as grid of locations
lons, lats = np.meshgrid(lons,lats)
# overlay region masks 
chart.pcolormesh(lons,lats,rmask,shading='flat',latlon=True, alpha=0.2) ;
# save image
plt.savefig('../figures/india.png')