Import water supply data & create supply table

Here we download the raw supply data for a given years from from downscaled CMIP5 hydrology projections (link).

These data include monthly estimates of runoff at a 1/8th degree spatial resolution across the US for the period of 1950 to 2099. Estimates are provided for 21 different climate projection ensembles applied to the Variable Infiltration Capacity (VIC) Macroscale Hydrologic Model (link); see the PDF document for a complete list. For demonstration purposes, this project uses the National Center for Atmospheric Research CCSM4 2.6 projection ensembles as the base data for water supply figures.

The steps involved include:

  • Download monthly runoff (total_runoff), in NetCDF format, from a central data repository (link) for a given sample year (e.g. 2000).
  • For the year:

    • Extract the monthly runoff data from the downloaded NetCDF files into a 4-dimensional NumPy array (time, parameter value, latitude, longitude).

    • Collapse the time dimension (months) into annual sums, resulting in a 3-dimensional array for each parameter, i.e. a single annual value for each 1/8th degree coordinate pair: rows = latitudes, columns = longitudes.

    • Re-lable columns as longitude values and insert a column of latitude values. Then melt the table into a listing of lat, long, and value.

    • Spatially join state FIPS codes to the data frame, using a county shapefile stored in the data folder.

  • Summarize supply values on FIPS to create a table that can be joined to other county level data:
YEAR FIPS Runoff
2000 01001 0

Import libraries


In [ ]:
#Import libraries
import sys, os, glob, time, datetime, urllib
import numpy as np
import pandas as pd
import netCDF4

import geopandas as gpd
from geopandas import GeoDataFrame, read_file
from geopandas.tools import sjoin
from shapely.geometry import Point, mapping, shape

Set script runtime variables

The requiered inputs include the year to process, and the location of the county feature class (used to sample and summarize the 1/8th degree data to the county level). The script will write out values to


In [ ]:
#Year to run
year = 2000

#Set filename locations
countyFN = '../Data/cb_2016_us_county_5m.shp'

#Output file locations
tidyFN = '../Data/SupplyTableTidy.csv'
outputFN = '../Data/CountySupplyTable.csv'

Pull in the runoff data as a NetCDF4 file from the UCLLNC servers

The code blocks below pulls the runoff data from the CMIP5 data ftp server as individual NetCDF4 (nc) files. Each nc file stores monthly values (n=12) across a 1/8th degree geographic grid (463 x 222). Monthly values are summed to create a data frame ('dfParam') where columns represent longitude, rows represent latitude, and the value is the runoff in mm/year. This data frame, in turn, is melted to generate a new data frame (df) listing lat/long pairs and the parameter value associated at that location.


In [ ]:
#Get URLs for NCAR 2.6 scenario ensembles for runoff(ro)
baseURL = 'ftp://gdo-dcp.ucllnl.org/pub/dcp/archive/cmip5/hydro/BCSD_mon_VIC_nc/ccsm4_rcp26_r1i1p1/'
roURL = baseURL + "conus_c5.ccsm4_rcp26_r1i1p1.monthly.total_runoff.{}.nc".format(year)

In [ ]:
#These lines fix an issue with slow network connections
import socket
socket.setdefaulttimeout(30)

In [ ]:
#Retrieve the data file from the ftp server to a local (netCDF4) file
if not os.path.exists('tmpData.nc'):
    print "->Downloading runoff data for year {}".format(year)
    urllib.urlretrieve(roURL,"tmpData.nc");
else:
    print "->Downloading runoff data for year {}".format(year)

In [ ]:
#Convert to netCDF object
nc = netCDF4.Dataset("tmpData.nc",mode='r')
type(nc)

In [ ]:

A quick overview of NetCDF4 datasets

The few code blocks below are a quick examination of the netCDF file format.
More info: http://www.ceda.ac.uk/static/media/uploads/ncas-reading-2015/10_read_netcdf_python.pdf


In [ ]:
#A quick view the the netCDF object from its metadata
print "Description:",nc.description

In [ ]:
#The dimensions of the file
print nc.dimensions.keys()

In [ ]:
#A look at the `time` dimension reveals 12 levels (e.g. one per month)
print nc.dimensions['time']

In [ ]:
#How many levels in the latitude and longitude dimensions?
print nc.dimensions['latitude']
print nc.dimensions['longitude']

In [ ]:
#Examining the shape of the total_runoff variable reveals it has values  
# for the time dimension (n=12), latitude (n=222), and longitude (n=462)
print nc.variables['total_runoff']

Extracting the runoff values from the netCDF file


In [ ]:
#Extract the total_runoff values into a netCDF4 Variable object
param_vals = nc.variables['total_runoff']
type(param_vals)

In [ ]:
# The param_vals object is a netCDF4 array with some properties
print "Dimensions: ",param_vals.dimensions
print "Units: ",param_vals.units
print "Shape: ",param_vals.shape

In [ ]:
#If we want to get all the data for one time value, e.g. January, we use slicing.
januaryRunoff = param_vals[0,:,:] #<-- returns a numpy array
januaryRunoff.mean() # <-- computes the mean of all locations for January

In [ ]:
#Or, compute the time series for a specific lat-lng pair
param_vals[:,100,120]

Compute the annual sum for each lat/lng pair

We can sum all of the monthly runoff values using NumPy, specifying the axis we want to sum


In [ ]:
# Sum the values in axis '0' (time), which reduces the dimension of our array
annSum = param_vals[:,:,:].sum(axis=0)
annSum.shape

In [ ]:
#Now that it's two dimensions, we can convert it into a dataFrame
dfParam = pd.DataFrame(annSum)
dfParam.shape

In [ ]:
#We can again retrieve a runoff value for a specific lat-lng pair
# This is the sum of the value computed above...
dfParam.iloc[100,120]

Transpose values into a 'tidy' format

We now have the runoff data in a 2-dimensional dataFrame, but this dataFrame is not 'tidy': The columns represent sequential longitudes and the rows sequential latitudes. What we need to do next is to tidy the data; more specifically, we need a table that lists each runoff value and the actual latitude and longitude in which it was recorded. And ultimately, we want to add another column which lists the county in which this coordinate point falls -- so that we can eventually compute the total runoff by county.

The first step in this is to extract the actual latitude and longitude values corresponding to the rows and columns of the dfParam dataFrame. These values are stored in another set of variables in our original netCDF file, specifically, the latutude and longitude variables.

Below we extract these values into their own respective dataFrames.(And after that, we're done with extracting values from our netCDF file, so we can clean up some things...)


In [ ]:
arrLats = nc.variables['latitude'][:]
arrLons = nc.variables['longitude'][:]

In [ ]:
#Create latitude and longitude arrays from those dimensions
dfLats = pd.DataFrame(nc.variables['latitude'][:])
dfLons = pd.DataFrame(nc.variables['longitude'][:])

In [ ]:
#Clean up, now that we are done with the NetCDF file

#Close the nc object
nc.close()
#Delete the nc file
os.remove("tmpData.nc")
#Update
urllib.urlcleanup()

The pandas melt function below "un-pivots" our dfParam dataFrame from rows of latitud and columns of longitude into a three column table of lat,lon,value for the current parameter (e.g. runoff)


In [ ]:
#Rename the columns to the actual longitude values
dfParam.columns = arrLons

In [ ]:
#Create a column containing the actual longitude values
dfParam['LAT'] = arrLats

In [ ]:
#Take a quick view at the results
dfParam.head(5)

In [ ]:
#Untranspose, or "melt" the data so that all the longitude column names become rows, 
#  keeping the latitude values for each and assigning the values to a column named 'total_runoff'
dfTidy = pd.melt(dfParam,id_vars=['LAT'],var_name='LON',value_name='total_runoff')
dfTidy.head()

Assign the county value of each row

Here we use Shapely and GeoPandas to:

  • Create a point feature from the lat/lng pair in each row
  • Intersect those features with the counties feature class, extracting the FIPS code into a new column in our data.

In [ ]:
#Add a geometry field to the data frame, setting the value as a new Shapely point object
dfTidy['geometry'] = dfTidy.apply(lambda z: Point(z.LON, z.LAT), axis=1)

In [ ]:
#Show one geometry point (the first); it appears as a graphic!
dfTidy['geometry'][0]

In [ ]:
#Show the first 10 geometry points; they appear 
dfTidy['geometry'][:10]

With the our point object stored in our pandas data frame, we can now convert it into a geopandas dataframe which enables some spatial analysis functionality to our dataset.


In [ ]:
#Create a *geopandas* dataframe from the dataframe created above
gdfPoints = gpd.GeoDataFrame(dfTidy)

In [ ]:
#Create a second geopandas dataframe, this one from the counties file
gdfPolygons = gpd.GeoDataFrame.from_file(countyFN)

In [ ]:
#Set the coordinate system of the points equal to the polygons
gdfPoints.crs = gdfPolygons.crs

In [ ]:
#Execute the spatial join of polygon attributes to the point objects
dfMerged=sjoin(gdfPoints, gdfPolygons, how='left', op='within')

And now a bit of cleaning up

We'll drop some columns, any rows with no runoff data, and then rename some columns


In [ ]:
#Drop extranneous columns
dfMerged.drop(['geometry','index_right','AFFGEOID','COUNTYFP','COUNTYNS','LSAD'],
              axis=1,
              inplace=True)

In [ ]:
#Drop rows with no data (usually falling outside the US)
dfMerged.dropna(inplace=True)

In [ ]:
#Rename FIPS columns & write to a file
dfMerged.rename(columns={'GEOID':'FIPS','STATEFP':'STATEFIPS'},inplace=True)
dfMerged.to_csv(tidyFN,index=False,encoding='utf8')

In [ ]:
dfMerged.columns

Compute county runoff totals

With the County FIPS column added to our tidy dataset, we can compute the total runoff for each county by grouping on the FIPS column and computing the sum of the total_runoff field.


In [ ]:
#Specify the aggregate functions
aggFunctions = {'STATEFIPS':'first',
                'ALAND':'first',
                'AWATER':'first',
                'total_runoff':'sum'
               }
dfCounty = dfMerged.groupby('FIPS').agg(aggFunctions)
dfCounty.head()

In [ ]:
#Write to a file
dfCounty.to_csv(outputFN)

In [ ]:
%matplotlib inline
df = pd.DataFrame(dfMerged)

In [ ]:
df.boxplot(by='STATEFIPS',
           column='total_runoff',
           figsize=(12,6)
          );

In [ ]:
dfNC = df[df.STATEFIPS == '37']
dfNC.boxplot(by='FIPS',column='total_runoff',figsize=(12,6));