In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import cartopy.crs as ccrs
import mpld3
import seaborn as sn
import numpy as np
import os
from netCDF4 import Dataset
from mpl_toolkits.basemap import Basemap
sn.set_context('notebook')

ICP Waters climate data processing

Important note

This notebook should be run using 32-bit Python. The xarray package has problems with 64-bit Python 2.7 - see here for details. To work around this issue on 64-bit Python simply requires pre-processing the station co-ordinates so they exactly match those in the netCDF files. For example, this code:

# Round stn lat/lon to nearest 0.5 of a degree
stn_df['grid_lat'] = ((stn_df['lat'] - 0.25)*2).round()/2. + 0.25
stn_df['grid_lon'] = ((stn_df['lon_shift'] - 0.25)*2).round()/2. + 0.25

rounds lat and lon values to the centre co-ordinates of the nearest 0.5 degree grid cell. This means the method='nearest' argument is no longer required in the call to sel_points (because the co-ordinates match exactly) and so the error does not occur. For example:

# Extract elevation data
dsloc = ds.sel_points(lon=stn_df['grid_lon'].values,
                      lat=stn_df['grid_lat'].values)

If it is necessary to use 64-bit Python this workaround is fine, but the method='nearest' argument is very convenient. It therefore seems easier to stick with 32-bit Python where possible (at least until the bug linked above is resolved).

Aims

Heleen would like historic climate data for the sites involved in the updated TOC trends analysis (see e-mail received 07/02/2017 at 13.33 for details).

I have downloaded 0.5 degree resolution global gridded data with a monthly time step from the Climate Research Unit (CRU; dataset details here). I have also contacted Ian Harris, the main scientist behind these datasets, to ask whether a compatible elevation dataset is also available - see his reply received 16/02/2017 at 16.12. In addition, I have downloaded a separate 0.5 degree resolution elevation dataset from here, which may also be useful. The aims of this notebook are as follows:

  • Extract latitude/longitude co-ordinates and elevations from RESA2 for all the stations involved in the updated trends analysis. The list of sites to use is here:

    C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\Results\trends_sites_oct_2016.xlsx

  • Extract the grid cell elevations for each site from the elevation dataset.

  • Extract annual resolution time series for each site for the following quantities:

    • Annual average temperature and total precipitation

    • Summer (JJA) average temperature and total precipitation

    • Summer (JAS) average temperature and total precipitation

In addition, because much of this processing is similar to the work Leah needs to do for the DOMQUA project, I'm going to investigate the various options for climate data processing in a bit more detail. I suspect many of these methods will be useful later.

1. Site data

The readme sheet of trends_sites_oct_2016.xlsx lists the RESA2 projects agreed with Heleen for inclusion in the updated trends analysis. I have extracted site data (co-ordinates and elevations) for all the sites associated with these projects. There are 609 sites in total, including some in Sweden with no data. I'll ignore this for now and simply extract climate data for the whole lot.

Three of the Norwegian sites in the database have only local (UTM Zone 33) co-ordinates. I've corrected this by manually converting the co-ordinates here. In addition, a number of the sites do not have elevation information. My code will ignore these sites, but if we get elevations for them in the future we can simply re-run this analysis.

The spreadsheet of site data is here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\CRU_Climate_Data\toc_trends_sites_for_climate.xlsx

Update 23/03/2017: Heleen has now supplied the missing site elevations (see e-mail received 15/03/2017 at 08.37). An update spreadsheet is now here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\CRU_Climate_Data\toc_trends_sites_for_climate_update_elev.xlsx


In [2]:
# Read station data
in_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
          r'\CRU_Climate_Data\toc_trends_sites_for_climate_update_elev.xlsx')
stn_df = pd.read_excel(in_xls, sheetname='DATA')

# Drop sites without elevations
stn_df.dropna(how='any', inplace=True)

print len(stn_df)

stn_df.head()


609
Out[2]:
stn_id stn_code stn_name lat lon elev_m
0 37284 X15:NS01DA0005 LOWER CORNING LAKE 44.0503 -66.0828 30
1 37320 X15:NS01EF0016 LITTLE WILES LAKE 44.4000 -64.6472 122
2 37300 X15:NS01ED0075 PESKOWESK LAKE 44.3167 -65.2828 105
3 37311 X15:NS01ED0091 LUXTON LAKE 44.3617 -65.3428 135
4 37301 X15:NS01ED0076 GRAFTON LAKE 44.3858 -65.1775 100

2. Elevation data

There are lots of options for processing the elevation and climate data. In the past, I've usually written my own code for this and used the Basemap library for plotting, but I have recently discovered xarray, which offers some more sophisticated options, especially for data stored in netCDF files. In this notebook I'd like to test xarray by comparing it to my old code.

2.1. Read gridded elevation data using Basemap

Start off by working with the 0.5 degree netCDF dataset from here.


In [3]:
# Open elevation netCDF  
nc_file = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\CRU_Climate_Data\elevation\elev.0.5-deg.nc')
nc_data = Dataset(nc_file, 'r')

# Read lat and long grids from netCDF
lats = nc_data.variables['lat'][:]
lons = nc_data.variables['lon'][:]

# Read the elevation data
elev = nc_data.variables['data'][:]

# Plot 
fig = plt.figure(figsize=(14, 7))

# Robinson projection
m = Basemap(projection='robin', lon_0=0, resolution='c')

# Create grid of lon-lat coordinates 
xx, yy = np.meshgrid(lons, lats)

# Plot elev grid
im = m.pcolormesh(xx, yy, elev[0,:,:], latlon=True, cmap=plt.cm.jet) 

# Annotate
m.drawcountries()
m.drawcoastlines(linewidth=.5)


Out[3]:
<matplotlib.collections.LineCollection at 0xdfbd278>

Next, I'd like to compare this to the dataset supplied by Ian Harris (which is what the climate data is based on). Note that this dataset is shifted 180 degrees compared to the one above: zero longitude is taken to be the international dateline. I therefore need to shift the longitudes by 180 degrees.


In [4]:
# Read the elev data from Ian
dat_file = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
            r'\CRU_Climate_Data\elevation\halfdeg.elv.grid.dat')
elev2 = np.loadtxt(dat_file, dtype=int)

# Flip data N-S, as numpy indexes arrays from the top down
elev2 = elev2[::-1, :] 

# Plot 
fig = plt.figure(figsize=(14, 7))

# Robinson projection
m = Basemap(projection='robin', lon_0=0, resolution='c')

# Create grid of lon-lat coordinates, shifting lons by -180 degrees
xx, yy = np.meshgrid(lons - 180, lats)

# Plot elev grid
# Use the same colour scheme as on map above for comparison
im = m.pcolormesh(xx, yy, elev2, latlon=True, cmap=plt.cm.jet,
                  vmin=elev[0].min(), vmax=elev[0].max()) 

# Annotate
m.drawcountries()
m.drawcoastlines(linewidth=.5)


Out[4]:
<matplotlib.collections.LineCollection at 0xe40d160>

Apart from the sea, these two maps look identical to me. The first map includes bathymetry, whereas the second only covers the land and all offshore areas are assigned the NoData value of -999, which is why all the oceans are coloured as though they were about 1 km deep. Antarctica has obviously been excluded too.

2.2. Read gridded elevation data using xarray

xarray should make things much simpler, especially because it includes tools for extracting time series for co-ordinates of interest. xarray datasets are essentially in-memory netCDF files.


In [5]:
# Read elev data
nc_file = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\CRU_Climate_Data\elevation\elev.0.5-deg.nc')
ds = xr.open_dataset(nc_file)

ds


Out[5]:
<xarray.Dataset>
Dimensions:  (lat: 360, lon: 720, time: 1)
Coordinates:
  * lat      (lat) float64 89.75 89.25 88.75 88.25 87.75 87.25 86.75 86.25 ...
  * lon      (lon) float64 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 ...
  * time     (time) datetime64[ns] 2001-01-01
Data variables:
    data     (time, lat, lon) float64 -4.29e+03 -4.29e+03 -4.291e+03 ...
Attributes:
    history: 
Elevations calculated from the TBASE 5-minute
latitude-longitude resolution elevation data set.
TBASE is available from the NCAR data archive at 
ftp://ncardata.ucar.edu/datasets/ds759.2/ 
Calculation and netCDF file written by Todd Mitchell in August 2000.
The file is written in COARDS-compliant netCDF:
http://ferret.wrc.noaa.gov/noaa_coop/coop_cdf_profile.html
The file was written at the JISAO at the University of Washington.
http://jisao.washington.edu

xarray automatically reads metadata from the netCDF file, so as long as the file conforms to netCDF standards it becomes very easy to work with. This file has two spatial co-ordinates (lat and lon) and one time co-ordinate (time). The time co-ordinate has just a single entry, which has been set arbitrarily to 01/01/2001 (because time isn't relevant to this static elevation dataset). The file also has one variable called data, which is indexed by the co-ordinates lat, lon and time.

Plotting is pretty straightforward:


In [6]:
# Plot using xarray
plt.figure(figsize=(14, 7))

# Define the desired output projection using Cartopy. Options are listed here:
# http://scitools.org.uk/cartopy/docs/latest/crs/projections.html#cartopy-projection-list
ax = plt.axes(projection=ccrs.Robinson())

# Plot the data
ds.data[0].plot.pcolormesh(ax=ax, 
                           transform=ccrs.PlateCarree(), # Define original data projection
                                                         # PlateCarree is just a Cartesian
                                                         # grid based on lat/lon values,
                                                         # which is what we have in the 
                                                         # original file   
                           x='lon', y='lat', 
                           add_colorbar=True)

# Add coastlines
ax.coastlines()

plt.tight_layout()


This looks good, although for mapping connoisseurs the Interrupted Goode Homolosine is perhaps the projection of choice...


In [7]:
# Plot using xarray
plt.figure(figsize=(14, 7))

# Define the desired output projection using Cartopy
ax = plt.axes(projection=ccrs.InterruptedGoodeHomolosine())

# Plot the data
ds.data[0].plot.pcolormesh(ax=ax, 
                           transform=ccrs.PlateCarree(),  
                           x='lon', y='lat', 
                           add_colorbar=True)

# Add coastlines
ax.coastlines()

plt.tight_layout()


2.3. Pixel elevations for ICPW sites

2.3.1. From the netCDF file using xarray

In the past I've written my own code for extracting time series from netCDF files (see e.g. here), but xarray should simplify this a lot. In principle, it's just a matter of providing lat and lon co-ordinates for the sites of interest, and xarray will return values from the nearest grid cell.

Note: There appears to be a bug with xarray and 64-bit Python on Windows, which prevents the method='nearest' argument from working properly (see here). It is therefore necessary to either (i) use 32-bit Python or (ii) manually alter the station co-ordinates in stn_df so they exactly match the grid cell centres (see the start of this notebook for code).

In addition, the netCDF metadata shows the longitude values in the elevation file range from 0 to 360, whereas the values in my stations table range from -180 to 180. Values in the stations table less than zero therefore need 360 adding to them to make the co-ordiantes match.


In [8]:
# Shift negative lons to match lons in elev file
lon_shift = stn_df['lon'].values.copy()                 # Copy lon values
lon_shift[lon_shift<0] = lon_shift[lon_shift<0] + 360   # Shift
stn_df['lon_shift'] = lon_shift                         # New column     

# Extract elevation data
dsloc = ds.sel_points(lon=stn_df['lon_shift'].values,
                      lat=stn_df['lat'].values,
                      method='nearest',
                      dim=stn_df['stn_id'])  # Use stn ids to index points in result

# Convert to df
elev_df =  dsloc['data'].to_dataframe()

# Rename cols
elev_df.columns = ['lat2', 'lon2', 'px_elev_m']

# Reset index and tidy
elev_df.reset_index(inplace=True)
del elev_df['time']

elev_df.head()


Out[8]:
stn_id lat2 lon2 px_elev_m
0 37284 44.25 293.75 -1.0
1 37320 44.25 295.25 60.0
2 37300 44.25 294.75 101.0
3 37311 44.25 294.75 101.0
4 37301 44.25 294.75 101.0

In [9]:
# Join to stn_df
df = pd.merge(stn_df, elev_df, how='left', on='stn_id')

# Get cols of interest
df = df[['stn_id', 'stn_code', 'stn_name', 'lat', 'lon', 'lon_shift',
         'lat2', 'lon2', 'elev_m', 'px_elev_m']]

# Rename cols
df.columns = ['stn_id', 'stn_code', 'stn_name', 'lat', 'lon', 'lon_shift',
              'grid_lat', 'grid_lon', 'elev_m', 'px_elev_m']

# Save
out_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\CRU_Climate_Data\cru_stn_elevs.csv')
df.to_csv(out_xls, index=False, encoding='utf-8')

df.head()


Out[9]:
stn_id stn_code stn_name lat lon lon_shift grid_lat grid_lon elev_m px_elev_m
0 37284 X15:NS01DA0005 LOWER CORNING LAKE 44.0503 -66.0828 293.9172 44.25 293.75 30 -1.0
1 37320 X15:NS01EF0016 LITTLE WILES LAKE 44.4000 -64.6472 295.3528 44.25 295.25 122 60.0
2 37300 X15:NS01ED0075 PESKOWESK LAKE 44.3167 -65.2828 294.7172 44.25 294.75 105 101.0
3 37311 X15:NS01ED0091 LUXTON LAKE 44.3617 -65.3428 294.6572 44.25 294.75 135 101.0
4 37301 X15:NS01ED0076 GRAFTON LAKE 44.3858 -65.1775 294.8225 44.25 294.75 100 101.0

We can now compare the measured station elevations to the gridded values in the 0.5 degree dataset. In the plot below, there is clearly a relationship, but it is pretty noisy. This isn't too surprising, as 0.5 degrees corresponds to roughly 50 km by 50 km at the equator, and there is obviously a lot of topographic variation at this scale. It also seems as though some of the more coastal stations are classified as being in the sea (with negative elevations) at the 0.5 degree scale. These values will need setting back to zero before performing any temperature corrections.


In [10]:
# Compare elvations
plt.scatter(x='elev_m', y='px_elev_m', data=df)
plt.xlabel('Station elevation (m)')
plt.ylabel('Pixel elevation (m)')


Out[10]:
<matplotlib.text.Text at 0x1489f860>

2.3.2. From the CRU elevation dataset

The CRU elevation dataset is supplied in .dat format. The code below manipulates the data and extracts the elevations, but it's not as neat as using xarray.


In [11]:
def geo_idx(dd, dd_array):
    """ Get array indices for specified lat/long co-ordinate.
        Adapted from here:
        http://stackoverflow.com/questions/33789379/
        netcdf-and-python-finding-the-closest-lon-lat-index-given-actual-lon-lat-values
    
    Args:
        dd       Lat or long value in decimal degrees
        dd_array Corresponding array of lat or long values from netCDF
    """
    geo_idx = (np.abs(dd_array - dd)).argmin()
    
    return geo_idx

In [12]:
# Get grid indices for each site
stn_df['lat_idx'] = stn_df['lat'].apply(geo_idx, args=(yy[:,0],))
stn_df['lon_idx'] = stn_df['lon'].apply(geo_idx, args=(xx[0],))

# Extract elevations based on indices
px_elev = []
for idx, row in stn_df.iterrows():
    px_elev.append(elev2[row['lat_idx'], row['lon_idx']])

# Add column
stn_df['px_elev_m'] = px_elev

# Plot
plt.scatter(x='elev_m', y='px_elev_m', data=stn_df)
plt.xlabel('Station elevation (m)')
plt.ylabel('Pixel elevation (m)')


Out[12]:
<matplotlib.text.Text at 0x14810080>

These results are identical to those above, except the CRU dataset does not include the "offshore" cells. Based on this, I'm pretty happy that xarray is performing as expected, so I'll use it below for the climate data processing.

3. Climate data

The climate data is here:

C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015\CRU_Climate_Data

and there is one netCDF file per variable per decade. The first step is therefore to merge the files from each decade into a single xarray dataset. With a bit of luck, this can be done without having to load everything into memory thanks to the out-of-core computation provided by Dask. See here and here for more details.

One extremely powerful feature of xarray is that it can also combine files for different variables. This means that simply placing all the netCDF files for both temperature and precipitation into a single folder results in a single xarray dataset.

3.1. Notes on performance versus memory with xarray and dask

When attempting to open multiple files, the xarray/dask default is to "chunk" each netCDF file into a single dask array. This is what happens when you run e.g.

ds = xr.open_mfdataset(pptn_path)

However, for merging/resampling operations on large files, processing the files as single chunks may result in memory errors. This is especially the case when working with 32-bit Python, because the theoretical maximum memory allocation for a 32-bit process on Windows is 2 GB (and in practice it's substantially less). If your computer has substantially more memory than this (my laptop has 32 GB of RAM), the easiest solution is simply to switch to 64-bit Python, where the process size is - for all practical purposes - unlimited. However, this prevents straightforward use of the sel_points method (because of the bug highlighted at the start of this notebook).

The solution is to use dask's chunking capabilities. When the files are opened, it is possible to "chunk" along any of the spatial or temporal dimensions using e.g.

ds = xr.open_mfdataset(pptn_path,
                       chunks={'lat': 10, 'lon': 10, 'time':10})

This will read each netCDF file in chunks that are 10 degrees by 10 degrees by 10 time steps in size, processing each one individually. If you're interested in temporal resampling, it's usually easiest to subset based on space rather than time e.g.

ds = xr.open_mfdataset(pptn_path,
                       chunks={'lat': 10, 'lon': 10})

because this way temporal aggregation takes place over the whole time dimension at once, but each spatial subset is processed as a separate tile. This dramatically reduces the memory burden, but there is a considerable reduction in performance, because each tile needs to be read from the disk and processed separately. It's therefore a good idea to use the largest chunk size you can get away with without causing memory errors.

Note also that the order in which you do things is also important. In particular, if you're interested in selecting and resampling data for points (e.g. monitoring stations), it's usually better to extract the points first and then resample to the desired frequency.

The xarray/netCDF format actually handles point time series very well, because the lat/lon grids can be "sparse", which basically means you can have a bunch of irregular lat/lon co-ordinates with associated time series. For example, suppose you want to extract long-term monthly averages for a set of points using 32-bit Python. The following code attempts to calculate monthly gridded averages, but hits the memory limit for the process and therefore throws a memory error.

# Read pptn data files
ds = xr.open_mfdataset(pptn_path)

# Calculate gridded monthly average pptn over all years
month_ds = ds.groupby('time.month').mean(dim='time')

To work around this, you can either switch to 64-bit Python or use chunking but, as described above, the latter option has negative implications for performance. Alternatively, you can try extracting the point data first, which gives a much smaller "point netCDF" dataset.

To do: In theory, dask should be capable of parallelising operations to take advantage of multiple cores. The link here suggests this should happen automatically, but in my initial tests it looks as though only one core is being used on my machine. It would be very useful to figure this out - come back to this in the future.


In [13]:
# Read all the netCDF files and combine into a single dataset
nc_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
             r'\CRU_Climate_Data\netcdf\*.nc')
ds = xr.open_mfdataset(nc_path)

print 'Total size of combined gridded dataset: %.5f GB.' % (ds.nbytes * (2 ** -30))

# Extract climate data for points
ds2 = ds.sel_points(lon=stn_df['lon'].values,
                    lat=stn_df['lat'].values,
                    method='nearest',
                    dim=stn_df['stn_id'])

print 'Total size of points dataset: %.5f GB.' % (ds2.nbytes * (2 ** -30))


Total size of combined gridded dataset: 1.62221 GB.
Total size of points dataset: 0.00383 GB.

3.2. Exploring the CRU climate data using xarray

To get a feel for the CRU data (and for xarray's plotting abilities), the code below generates some global maps of climate averages. Note that because I'm working with grids on 32-bit Python, I'm using chunking to split all the calculations into 8 spatial tiles (90 degrees by 90 degrees).


In [14]:
# Read all the netCDF files and combine into a single dataset
ds = xr.open_mfdataset(nc_path,
                       chunks={'lat': 90, 'lon': 90})

ds


Out[14]:
<xarray.Dataset>
Dimensions:  (lat: 360, lon: 720, time: 420)
Coordinates:
  * lon      (lon) float64 -179.8 -179.2 -178.8 -178.2 -177.8 -177.2 -176.8 ...
  * lat      (lat) float64 -89.75 -89.25 -88.75 -88.25 -87.75 -87.25 -86.75 ...
  * time     (time) datetime64[ns] 1981-01-16 1981-02-15 1981-03-16 ...
Data variables:
    pre      (time, lat, lon) float64 nan nan nan nan nan nan nan nan nan ...
    tmp      (time, lat, lon) float64 nan nan nan nan nan nan nan nan nan ...

Note that properties for all the datasets are read and combined, as though we had a single precipitation and temperature dataset running from 1981 to 2015. There are eight netCDF files in total with a combined size of about 1.6 gigabytes, but because function evaluation via Dask is "lazy", nothing actually gets read from the disk (apart from the metadata) until I ask for some results. This should make it possible to work with datasets that are too big to fit into memory. For example, the code below calculates gridded annual preciptation totals for the entire dataset.


In [15]:
# Resample grids to annual resolution
ds2 = ds.resample('A', dim='time', how='sum')

ds2


Out[15]:
<xarray.Dataset>
Dimensions:  (lat: 360, lon: 720, time: 35)
Coordinates:
  * lon      (lon) float64 -179.8 -179.2 -178.8 -178.2 -177.8 -177.2 -176.8 ...
  * lat      (lat) float64 -89.75 -89.25 -88.75 -88.25 -87.75 -87.25 -86.75 ...
  * time     (time) datetime64[ns] 1981-12-31 1982-12-31 1983-12-31 ...
Data variables:
    pre      (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    tmp      (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

In [16]:
# Plot results for 2010
plt.figure(figsize=(14, 7))

# Robinson projection
ax = plt.axes(projection=ccrs.Robinson())

# Plot the data
ds2.sel(time='2010').pre[0].plot.pcolormesh(ax=ax, 
                                            transform=ccrs.PlateCarree(),  
                                            x='lon', y='lat', 
                                            add_colorbar=True)
plt.tight_layout()



In [17]:
# Plot results for 2010, but setting zeros to NaN
plt.figure(figsize=(14, 7))

# Robinson projection
ax = plt.axes(projection=ccrs.Robinson())

# Plot the data
data = ds2.sel(time='2010')
data = data.where(data>0)
data.pre[0].plot.pcolormesh(ax=ax, 
                            transform=ccrs.PlateCarree(),  
                            x='lon', y='lat', 
                            add_colorbar=True)
plt.tight_layout()


xarray generalises many of the most useful pandas operations into n-dimensions, and it also includes some very useful functionality for grid plotting/faceting with maps. This makes it wonderfully easy to calculate gridded data summaries, which will be very useful for DOMQUA.


In [18]:
# Calculate monthly average pptn over all years
month_ds = ds.groupby('time.month').mean(dim='time')

month_ds


C:\Data\Anaconda2\lib\site-packages\dask\array\numpy_compat.py:45: RuntimeWarning: invalid value encountered in divide
  x = np.divide(x1, x2, out)
Out[18]:
<xarray.Dataset>
Dimensions:  (lat: 360, lon: 720, month: 12)
Coordinates:
  * lon      (lon) float64 -179.8 -179.2 -178.8 -178.2 -177.8 -177.2 -176.8 ...
  * lat      (lat) float64 -89.75 -89.25 -88.75 -88.25 -87.75 -87.25 -86.75 ...
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
Data variables:
    pre      (month, lat, lon) float64 nan nan nan nan nan nan nan nan nan ...
    tmp      (month, lat, lon) float64 nan nan nan nan nan nan nan nan nan ...

In [19]:
# Plot monthly average pptn over all years
p = month_ds.pre.plot(transform=ccrs.PlateCarree(), 
                      col='month', 
                      col_wrap=2, 
                      cmap='jet',
                      add_colorbar=False, 
                      figsize=(15, 30),                      
                      subplot_kws={'projection':ccrs.Robinson()})

# Add coastlines
for ax in p.axes.flat:
    ax.coastlines()

plt.tight_layout()


3.3. Extract climate data for ICPW

Based on the discussion in section 3.1 above, it should be possible to extract the desired point data without resorting to chunking. However, because this notebook has already generated a lot of plots and has explored various datasets (which I haven't been very careful about closing), I'm already getting close to the 32-bit memory limit. I will therefore use chunking to keep everything in this notebook manageable.

Note from above that longitude values in the climate data files actually range from -180 to +180, which is the same as in my original station data. This means no co-ordinate shifts are required on this occasion.

Finally, note that xarray allows resampling to quarterly frequency starting from an arbitrary month. For example, freq='Q-FEB' corresponds to quarterly with the year ending in February, which is the classic definition of "seasonal" in the northern hemisphere (i.e. where summer=JJA). Choosing freq='Q-MAR' results in summer=JAS, as requested by Heleen. For example:


In [20]:
# Monthly data frame with column of ones
dt = pd.date_range(start='2000-01-01', end='2005-12-31', freq='M')
df = pd.DataFrame({'month_count':np.ones(len(dt))}, index=dt)

# Resample to quarterly
print df.resample('Q-FEB').sum().head(10)
print '\n****************\n'
print df.resample('Q-MAR').sum().head(10)


            month_count
2000-02-29          2.0
2000-05-31          3.0
2000-08-31          3.0
2000-11-30          3.0
2001-02-28          3.0
2001-05-31          3.0
2001-08-31          3.0
2001-11-30          3.0
2002-02-28          3.0
2002-05-31          3.0

****************

            month_count
2000-03-31          3.0
2000-06-30          3.0
2000-09-30          3.0
2000-12-31          3.0
2001-03-31          3.0
2001-06-30          3.0
2001-09-30          3.0
2001-12-31          3.0
2002-03-31          3.0
2002-06-30          3.0

The code below first reads all of the temperature and precipitation netCDF files into a single xarray dataset. For each variable and time period of interest, it then calculates the gridded average, extracts the point data and writes the result to a new sheet in an Excel workbook.


In [21]:
# Read all the netCDF files and combine into a single dataset
nc_path = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
             r'\CRU_Climate_Data\netcdf\*.nc')
ds = xr.open_mfdataset(nc_path,
                       chunks={'lat': 90, 'lon': 90})

# Excel file for output
out_xls = (r'C:\Data\James_Work\Staff\Heleen_d_W\ICP_Waters\TOC_Trends_Analysis_2015'
           r'\CRU_Climate_Data\cru_climate_summaries.xlsx')

# Create writer object
writer = pd.ExcelWriter(out_xls)

# Define variables and period of interest
var_list = ['pre', 'tmp']
per_list = ['ann', 'jja', 'jas']

# Dict mapping vars and pers to freqs and stats
attrs_dict = {('pre', 'ann'):['A', 'sum'],
              ('pre', 'jja'):['Q-FEB', 'sum'],
              ('pre', 'jas'):['Q-MAR', 'sum'],
              ('tmp', 'ann'):['A', 'mean'],
              ('tmp', 'jja'):['Q-FEB', 'mean'],
              ('tmp', 'jas'):['Q-MAR', 'mean']}

# Loop over data
for var in var_list:
    for per in per_list:  
        # Resample to desired resolution
        ds2 = ds.resample(freq=attrs_dict[(var, per)][0],
                          dim='time', 
                          how=attrs_dict[(var, per)][1])

        # Extract point data
        ds2 = ds2.sel_points(lon=stn_df['lon'].values,
                             lat=stn_df['lat'].values,
                             method='nearest',
                             dim=stn_df['stn_id'])

        # Convert to df
        df =  ds2[var].to_dataframe()

        # Add year col (and month col for seasonal)
        df['year'] = df.index.get_level_values('time').year        
        if per != 'ann':
            df['month'] = df.index.get_level_values('time').month
            
        # Tidy
        df.reset_index(inplace=True)

        # If seasonal, get just season of interest
        if per == 'jja':
            # Totals for JJA are associated with month=8
            df = df.query('month==8')
        elif per == 'jas':
            # Totals for JAS are associated with month=9
            df = df.query('month==9')

        # Write to output
        df.to_excel(writer,
                    '%s_%s' % (var, per),
                    index=False)

# Save and tidy
writer.save()
ds.close()
ds2.close()