Here we download the raw supply data for a given years from from downscaled CMIP5 hydrology projections (link).
These data include monthly estimates of runoff at a 1/8th degree spatial resolution across the US for the period of 1950 to 2099. Estimates are provided for 21 different climate projection ensembles applied to the Variable Infiltration Capacity (VIC) Macroscale Hydrologic Model (link); see the PDF document for a complete list. For demonstration purposes, this project uses the National Center for Atmospheric Research CCSM4 2.6 projection ensembles as the base data for water supply figures.
The steps involved include:
For the year:
Extract the monthly runoff data from the downloaded NetCDF files into a 4-dimensional NumPy array (time, parameter value, latitude, longitude).
Collapse the time dimension (months) into annual sums, resulting in a 3-dimensional array for each parameter, i.e. a single annual value for each 1/8th degree coordinate pair: rows = latitudes, columns = longitudes.
Re-lable columns as longitude values and insert a column of latitude values. Then melt the table into a listing of lat, long, and value.
Spatially join state FIPS codes to the data frame, using a county shapefile stored in the data folder.
YEAR | FIPS | Runoff |
---|---|---|
2000 | 01001 | 0 |
In [ ]:
#Import libraries
import sys, os, glob, time, datetime, urllib
import numpy as np
import pandas as pd
import netCDF4
import geopandas as gpd
from geopandas import GeoDataFrame, read_file
from geopandas.tools import sjoin
from shapely.geometry import Point, mapping, shape
In [ ]:
#Year to run
year = 2000
#Set filename locations
countyFN = '../Data/cb_2016_us_county_5m.shp'
#Output file locations
tidyFN = '../Data/SupplyTableTidy.csv'
outputFN = '../Data/CountySupplyTable.csv'
The code blocks below pulls the runoff data from the CMIP5 data ftp server as individual NetCDF4 (nc) files. Each nc file stores monthly values (n=12) across a 1/8th degree geographic grid (463 x 222). Monthly values are summed to create a data frame ('dfParam') where columns represent longitude, rows represent latitude, and the value is the runoff in mm/year. This data frame, in turn, is melted to generate a new data frame (df
) listing lat/long pairs and the parameter value associated at that location.
In [ ]:
#Get URLs for NCAR 2.6 scenario ensembles for runoff(ro)
baseURL = 'ftp://gdo-dcp.ucllnl.org/pub/dcp/archive/cmip5/hydro/BCSD_mon_VIC_nc/ccsm4_rcp26_r1i1p1/'
roURL = baseURL + "conus_c5.ccsm4_rcp26_r1i1p1.monthly.total_runoff.{}.nc".format(year)
In [ ]:
#These lines fix an issue with slow network connections
import socket
socket.setdefaulttimeout(30)
In [ ]:
#Retrieve the data file from the ftp server to a local (netCDF4) file
if not os.path.exists('tmpData.nc'):
print "->Downloading runoff data for year {}".format(year)
urllib.urlretrieve(roURL,"tmpData.nc");
else:
print "->Downloading runoff data for year {}".format(year)
In [ ]:
#Convert to netCDF object
nc = netCDF4.Dataset("tmpData.nc",mode='r')
type(nc)
In [ ]:
The few code blocks below are a quick examination of the netCDF file format.
More info: http://www.ceda.ac.uk/static/media/uploads/ncas-reading-2015/10_read_netcdf_python.pdf
In [ ]:
#A quick view the the netCDF object from its metadata
print "Description:",nc.description
In [ ]:
#The dimensions of the file
print nc.dimensions.keys()
In [ ]:
#A look at the `time` dimension reveals 12 levels (e.g. one per month)
print nc.dimensions['time']
In [ ]:
#How many levels in the latitude and longitude dimensions?
print nc.dimensions['latitude']
print nc.dimensions['longitude']
In [ ]:
#Examining the shape of the total_runoff variable reveals it has values
# for the time dimension (n=12), latitude (n=222), and longitude (n=462)
print nc.variables['total_runoff']
In [ ]:
#Extract the total_runoff values into a netCDF4 Variable object
param_vals = nc.variables['total_runoff']
type(param_vals)
In [ ]:
# The param_vals object is a netCDF4 array with some properties
print "Dimensions: ",param_vals.dimensions
print "Units: ",param_vals.units
print "Shape: ",param_vals.shape
In [ ]:
#If we want to get all the data for one time value, e.g. January, we use slicing.
januaryRunoff = param_vals[0,:,:] #<-- returns a numpy array
januaryRunoff.mean() # <-- computes the mean of all locations for January
In [ ]:
#Or, compute the time series for a specific lat-lng pair
param_vals[:,100,120]
In [ ]:
# Sum the values in axis '0' (time), which reduces the dimension of our array
annSum = param_vals[:,:,:].sum(axis=0)
annSum.shape
In [ ]:
#Now that it's two dimensions, we can convert it into a dataFrame
dfParam = pd.DataFrame(annSum)
dfParam.shape
In [ ]:
#We can again retrieve a runoff value for a specific lat-lng pair
# This is the sum of the value computed above...
dfParam.iloc[100,120]
We now have the runoff data in a 2-dimensional dataFrame, but this dataFrame is not 'tidy': The columns represent sequential longitudes and the rows sequential latitudes. What we need to do next is to tidy the data; more specifically, we need a table that lists each runoff value and the actual latitude and longitude in which it was recorded. And ultimately, we want to add another column which lists the county in which this coordinate point falls -- so that we can eventually compute the total runoff by county.
The first step in this is to extract the actual latitude and longitude values corresponding to the rows and columns of the dfParam
dataFrame. These values are stored in another set of variables in our original netCDF file, specifically, the latutude and longitude variables.
Below we extract these values into their own respective dataFrames.(And after that, we're done with extracting values from our netCDF file, so we can clean up some things...)
In [ ]:
arrLats = nc.variables['latitude'][:]
arrLons = nc.variables['longitude'][:]
In [ ]:
#Create latitude and longitude arrays from those dimensions
dfLats = pd.DataFrame(nc.variables['latitude'][:])
dfLons = pd.DataFrame(nc.variables['longitude'][:])
In [ ]:
#Clean up, now that we are done with the NetCDF file
#Close the nc object
nc.close()
#Delete the nc file
os.remove("tmpData.nc")
#Update
urllib.urlcleanup()
The pandas melt
function below "un-pivots" our dfParam
dataFrame from rows of latitud and columns of longitude into a three column table of lat,lon,value for the current parameter (e.g. runoff)
In [ ]:
#Rename the columns to the actual longitude values
dfParam.columns = arrLons
In [ ]:
#Create a column containing the actual longitude values
dfParam['LAT'] = arrLats
In [ ]:
#Take a quick view at the results
dfParam.head(5)
In [ ]:
#Untranspose, or "melt" the data so that all the longitude column names become rows,
# keeping the latitude values for each and assigning the values to a column named 'total_runoff'
dfTidy = pd.melt(dfParam,id_vars=['LAT'],var_name='LON',value_name='total_runoff')
dfTidy.head()
In [ ]:
#Add a geometry field to the data frame, setting the value as a new Shapely point object
dfTidy['geometry'] = dfTidy.apply(lambda z: Point(z.LON, z.LAT), axis=1)
In [ ]:
#Show one geometry point (the first); it appears as a graphic!
dfTidy['geometry'][0]
In [ ]:
#Show the first 10 geometry points; they appear
dfTidy['geometry'][:10]
With the our point object stored in our pandas data frame, we can now convert it into a geopandas dataframe which enables some spatial analysis functionality to our dataset.
In [ ]:
#Create a *geopandas* dataframe from the dataframe created above
gdfPoints = gpd.GeoDataFrame(dfTidy)
In [ ]:
#Create a second geopandas dataframe, this one from the counties file
gdfPolygons = gpd.GeoDataFrame.from_file(countyFN)
In [ ]:
#Set the coordinate system of the points equal to the polygons
gdfPoints.crs = gdfPolygons.crs
In [ ]:
#Execute the spatial join of polygon attributes to the point objects
dfMerged=sjoin(gdfPoints, gdfPolygons, how='left', op='within')
In [ ]:
#Drop extranneous columns
dfMerged.drop(['geometry','index_right','AFFGEOID','COUNTYFP','COUNTYNS','LSAD'],
axis=1,
inplace=True)
In [ ]:
#Drop rows with no data (usually falling outside the US)
dfMerged.dropna(inplace=True)
In [ ]:
#Rename FIPS columns & write to a file
dfMerged.rename(columns={'GEOID':'FIPS','STATEFP':'STATEFIPS'},inplace=True)
dfMerged.to_csv(tidyFN,index=False,encoding='utf8')
In [ ]:
dfMerged.columns
In [ ]:
#Specify the aggregate functions
aggFunctions = {'STATEFIPS':'first',
'ALAND':'first',
'AWATER':'first',
'total_runoff':'sum'
}
dfCounty = dfMerged.groupby('FIPS').agg(aggFunctions)
dfCounty.head()
In [ ]:
#Write to a file
dfCounty.to_csv(outputFN)
In [ ]:
%matplotlib inline
df = pd.DataFrame(dfMerged)
In [ ]:
df.boxplot(by='STATEFIPS',
column='total_runoff',
figsize=(12,6)
);
In [ ]:
dfNC = df[df.STATEFIPS == '37']
dfNC.boxplot(by='FIPS',column='total_runoff',figsize=(12,6));