There is now a Unidata Developer's Blog entry accompanying this iPython notebook.
The goal is to demonstrate how to quickly access netCDF data based on geospatial coordinates instead of array indices.
Let's look at a netCDF file from the Atlantic Real-Time Ocean Forecast System. If you have cloned the Unidata 2013 TDS-Python Workshop, this data is already available in 'data/rtofs_glo_3dz_f006_6hrly_reg3.nc'. Otherwise you can get it from rtofs_glo_3dz_f006_6hrly_reg3.nc.
In iPython, we could invoke the ncdump utility like this:
filename = 'data/rtofs_glo_3dz_f006_6hrly_reg3.nc'
!ncdump -h $filename
if we know that a recent version of ncdump is installed that can read compressed data from netCDF-4 classic model files.
Alternatively, we'll use the netCDF4python package to show information about the file in a form that's somewhat less familiar, but contains the information we need for the subsequent examples. This works for any netCDF file format:
In [1]:
import netCDF4
filename = 'data/rtofs_glo_3dz_f006_6hrly_reg3.nc'
ncfile = netCDF4.Dataset(filename, 'r')
print ncfile # shows global attributes, dimensions, and variables
ncvars = ncfile.variables # a dictionary of variables
# print information about specific variables, including type, shape, and attributes
for varname in ['temperature', 'salinity', 'Latitude', 'Longitude']:
print ncvars[varname]
Here's a sparse picture (every 25th point on each axis) of what the grid looks like on which Latitude, Longitude, Temperature, Salinity, and other variables are defined:
So, keeping things simple, we want to find iy and ix to minimize
(Latitude[iy, ix] - lat0)**2 + (Longitude[iy, ix] - lon0)**2
To access netCDF data, rather than just metadata, we will also need NumPy:
With the Python netCDF4 package, using "[ ... ]" to index a netCDF variable object reads or writes a numpy array from the associated netCDF file.
The code below reads latitude and longitude values into 2D numpy arrays named latvals and lonvals:
In [2]:
import numpy as np
import netCDF4
def naive_slow(latvar,lonvar,lat0,lon0):
'''
Find "closest" point in a set of (lat,lon) points to specified point
latvar - 2D latitude variable from an open netCDF dataset
lonvar - 2D longitude variable from an open netCDF dataset
lat0,lon0 - query point
Returns iy,ix such that
(lonval[iy,ix] - lon0)**2 + (latval[iy,ix] - lat0)**2
is minimum. This "closeness" measure works badly near poles and
longitude boundaries.
'''
# Read from file into numpy arrays
latvals = latvar[:]
lonvals = lonvar[:]
ny,nx = latvals.shape
dist_sq_min = 1.0e30
for iy in range(ny):
for ix in range(nx):
latval = latvals[iy, ix]
lonval = lonvals[iy, ix]
dist_sq = (latval - lat0)**2 + (lonval - lon0)**2
if dist_sq < dist_sq_min:
iy_min, ix_min, dist_sq_min = iy, ix, dist_sq
return iy_min,ix_min
When we call the function above it takes several seconds to run, because it calculates distances one point at a time, for each of the 605200 $(lat, lon)$ points. Note that once indices for the point nearest to (50, -140) are found, they can be used to access temperature, salinity, and other netCDF variables that use the same dimensions.
In [3]:
ncfile = netCDF4.Dataset(filename, 'r')
latvar = ncfile.variables['Latitude']
lonvar = ncfile.variables['Longitude']
iy,ix = naive_slow(latvar, lonvar, 50.0, -140.0)
print 'Closest lat lon:', latvar[iy,ix], lonvar[iy,ix]
tempvar = ncfile.variables['temperature']
salvar = ncfile.variables['salinity']
print 'temperature:', tempvar[0, 0, iy, ix], tempvar.units
print 'salinity:', salvar[0, 0, iy, ix], salvar.units
ncfile.close()
The above function is slow, because it doesn't make good use of NumPy arrays. It's much faster to use whole array operations to eliminate loops and element-at-a-time computation. NumPy functions that help eliminate loops include:
argmin()
method that returns a 1D index of the minimum value of a NumPy arrayunravel_index()
function that converts a 1D index back into a multidimensional index
In [4]:
import numpy as np
import netCDF4
def naive_fast(latvar,lonvar,lat0,lon0):
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:]
lonvals = lonvar[:]
ny,nx = latvals.shape
dist_sq = (latvals-lat0)**2 + (lonvals-lon0)**2
minindex_flattened = dist_sq.argmin() # 1D index of min element
iy_min,ix_min = np.unravel_index(minindex_flattened, latvals.shape)
return iy_min,ix_min
ncfile = netCDF4.Dataset(filename, 'r')
latvar = ncfile.variables['Latitude']
lonvar = ncfile.variables['Longitude']
iy,ix = naive_fast(latvar, lonvar, 50.0, -140.0)
print 'Closest lat lon:', latvar[iy,ix], lonvar[iy,ix]
ncfile.close()
Though assuming a flat Earth may work OK for this example, we'd like to not worry about whether longitudes are from 0 to 360 or -180 to 180, or whether points are close to the poles. The code below fixes this by using the square of "tunnel distance" between (lat,lon) points. This version is both fast and correct (for a spherical Earth).
In [5]:
import numpy as np
import netCDF4
from math import pi
from numpy import cos, sin
def tunnel_fast(latvar,lonvar,lat0,lon0):
'''
Find closest point in a set of (lat,lon) points to specified point
latvar - 2D latitude variable from an open netCDF dataset
lonvar - 2D longitude variable from an open netCDF dataset
lat0,lon0 - query point
Returns iy,ix such that the square of the tunnel distance
between (latval[it,ix],lonval[iy,ix]) and (lat0,lon0)
is minimum.
'''
rad_factor = pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:] * rad_factor
lonvals = lonvar[:] * rad_factor
ny,nx = latvals.shape
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
# Compute numpy arrays for all values, no loops
clat,clon = cos(latvals),cos(lonvals)
slat,slon = sin(latvals),sin(lonvals)
delX = cos(lat0_rad)*cos(lon0_rad) - clat*clon
delY = cos(lat0_rad)*sin(lon0_rad) - clat*slon
delZ = sin(lat0_rad) - slat;
dist_sq = delX**2 + delY**2 + delZ**2
minindex_1d = dist_sq.argmin() # 1D index of minimum element
iy_min,ix_min = np.unravel_index(minindex_1d, latvals.shape)
return iy_min,ix_min
ncfile = netCDF4.Dataset(filename, 'r')
latvar = ncfile.variables['Latitude']
lonvar = ncfile.variables['Longitude']
iy,ix = tunnel_fast(latvar, lonvar, 50.0, -140.0)
print 'Closest lat lon:', latvar[iy,ix], lonvar[iy,ix]
ncfile.close()
We can still do better, by using a data structure designed to support efficient nearest-neighbor queries: the KD-tree. It works like a multidimensional binary tree, so finding the point nearest to a query point is much faster than computing all the distances to find the minimum. It takes some setup time to load all the points into the data structure, but that only has to be done once for a given set of points.
For a single point query, it's still more than twice as fast as the naive slow version above, but building the KD-tree for 605,200 points takes more time than the fast numpy search through all the points, so in this case using the KD-tree for a single point query is sort of pointless ...
In [6]:
import numpy as np
import netCDF4
from math import pi
from numpy import cos, sin
from scipy.spatial import cKDTree
def kdtree_fast(latvar,lonvar,lat0,lon0):
rad_factor = pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:] * rad_factor
lonvals = lonvar[:] * rad_factor
ny,nx = latvals.shape
clat,clon = cos(latvals),cos(lonvals)
slat,slon = sin(latvals),sin(lonvals)
# Build kd-tree from big arrays of 3D coordinates
triples = zip(np.ravel(clat*clon), np.ravel(clat*slon), np.ravel(slat))
kdt = cKDTree(triples)
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
clat0,clon0 = cos(lat0_rad),cos(lon0_rad)
slat0,slon0 = sin(lat0_rad),sin(lon0_rad)
dist_sq_min, minindex_1d = kdt.query([clat0*clon0, clat0*slon0, slat0])
iy_min, ix_min = np.unravel_index(minindex_1d, latvals.shape)
return iy_min,ix_min
ncfile = netCDF4.Dataset(filename, 'r')
latvar = ncfile.variables['Latitude']
lonvar = ncfile.variables['Longitude']
iy,ix = kdtree_fast(latvar, lonvar, 50.0, -140.0)
print 'Closest lat lon:', latvar[iy,ix], lonvar[iy,ix]
ncfile.close()
If you're curious about actual times for the versions above, the iPython notebook "%%timeit" statement gets accurate timings of all of them. Below, we time just a single query point, in this case (50.0, -140.0). To get accurate timings, the "%%timeit" statement lets us do untimed setup first on the same line, before running the function call in a loop.
In [7]:
%%timeit ncfile = netCDF4.Dataset(filename,'r');latvar = ncfile.variables['Latitude'];lonvar = ncfile.variables['Longitude']
naive_slow(latvar, lonvar, 50.0, -140.0)
In [8]:
%%timeit ncfile = netCDF4.Dataset(filename,'r');latvar = ncfile.variables['Latitude'];lonvar = ncfile.variables['Longitude']
naive_fast(latvar, lonvar, 50.0, -140.0)
In [9]:
%%timeit ncfile = netCDF4.Dataset(filename,'r');latvar = ncfile.variables['Latitude'];lonvar = ncfile.variables['Longitude']
tunnel_fast(latvar, lonvar, 50.0, -140.0)
In [10]:
%%timeit ncfile = netCDF4.Dataset(filename,'r');latvar = ncfile.variables['Latitude'];lonvar = ncfile.variables['Longitude']
kdtree_fast(latvar, lonvar, 50.0, -140.0)
In [11]:
ncfile.close()
The above use of the KD-tree data structure is not the way it's meant to be used. Instead, it should be initialized once with all the k-dimensional data for which nearest-neighbors are desired, then used repeatedly on each query, amortizing the work done to build the data structure over all the following queries. By separately timing the setup and the time required per query, the threshold for number of queries beyond which the KD-tree is faster can be determined.
That's exactly what we'll do now. We split each algorithm into two functions, a setup function and a query function. The times per query go from seconds (the naive version) to milliseconds (the array-oriented numpy version) to microseconds (the turbo-charged KD-tree, once it's built).
Rather than just using functions, we define a Class for each algorithm, do the setup in the class constructor, and provide a query method.
In [12]:
# Split naive_slow into initialization and query, so we can time them separately
import numpy as np
import netCDF4
class Naive_slow(object):
def __init__(self, ncfile, latvarname, lonvarname):
self.ncfile = ncfile
self.latvar = self.ncfile.variables[latvarname]
self.lonvar = self.ncfile.variables[lonvarname]
# Read latitude and longitude from file into numpy arrays
self.latvals = self.latvar[:]
self.lonvals = self.lonvar[:]
self.shape = self.latvals.shape
def query(self,lat0,lon0):
ny,nx = self.shape
dist_sq_min = 1.0e30
for iy in range(ny):
for ix in range(nx):
latval = self.latvals[iy, ix]
lonval = self.lonvals[iy, ix]
dist_sq = (latval - lat0)**2 + (lonval - lon0)**2
if dist_sq < dist_sq_min:
iy_min, ix_min, dist_sq_min = iy, ix, dist_sq
return iy_min,ix_min
ncfile = netCDF4.Dataset(filename, 'r')
ns = Naive_slow(ncfile,'Latitude','Longitude')
iy,ix = ns.query(50.0, -140.0)
print 'Closest lat lon:', ns.latvar[iy,ix], ns.lonvar[iy,ix]
ncfile.close()
In [13]:
# Split naive_fast into initialization and query, so we can time them separately
import numpy as np
import netCDF4
class Naive_fast(object):
def __init__(self, ncfile, latvarname, lonvarname):
self.ncfile = ncfile
self.latvar = self.ncfile.variables[latvarname]
self.lonvar = self.ncfile.variables[lonvarname]
# Read latitude and longitude from file into numpy arrays
self.latvals = self.latvar[:]
self.lonvals = self.lonvar[:]
self.shape = self.latvals.shape
def query(self,lat0,lon0):
dist_sq = (self.latvals-lat0)**2 + (self.lonvals-lon0)**2
minindex_flattened = dist_sq.argmin() # 1D index
iy_min, ix_min = np.unravel_index(minindex_flattened, self.shape) # 2D indexes
return iy_min,ix_min
ncfile = netCDF4.Dataset(filename, 'r')
ns = Naive_fast(ncfile,'Latitude','Longitude')
iy,ix = ns.query(50.0, -140.0)
print 'Closest lat lon:', ns.latvar[iy,ix], ns.lonvar[iy,ix]
ncfile.close()
In [14]:
# Split tunnel_fast into initialization and query, so we can time them separately
import numpy as np
import netCDF4
from math import pi
from numpy import cos, sin
class Tunnel_fast(object):
def __init__(self, ncfile, latvarname, lonvarname):
self.ncfile = ncfile
self.latvar = self.ncfile.variables[latvarname]
self.lonvar = self.ncfile.variables[lonvarname]
# Read latitude and longitude from file into numpy arrays
rad_factor = pi/180.0 # for trignometry, need angles in radians
self.latvals = self.latvar[:] * rad_factor
self.lonvals = self.lonvar[:] * rad_factor
self.shape = self.latvals.shape
clat,clon,slon = cos(self.latvals),cos(self.lonvals),sin(self.lonvals)
self.clat_clon = clat*clon
self.clat_slon = clat*slon
self.slat = sin(self.latvals)
def query(self,lat0,lon0):
# for trignometry, need angles in radians
rad_factor = pi/180.0
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
delX = cos(lat0_rad)*cos(lon0_rad) - self.clat_clon
delY = cos(lat0_rad)*sin(lon0_rad) - self.clat_slon
delZ = sin(lat0_rad) - self.slat;
dist_sq = delX**2 + delY**2 + delZ**2
minindex_1d = dist_sq.argmin() # 1D index
iy_min, ix_min = np.unravel_index(minindex_1d, self.shape) # 2D indexes
return iy_min,ix_min
ncfile = netCDF4.Dataset(filename, 'r')
ns = Tunnel_fast(ncfile,'Latitude','Longitude')
iy,ix = ns.query(50.0, -140.0)
print 'Closest lat lon:', ns.latvar[iy,ix], ns.lonvar[iy,ix]
ncfile.close()
In [15]:
# Split kdtree_fast into initialization and query, so we can time them separately
import numpy as np
import netCDF4
from math import pi
from numpy import cos, sin
from scipy.spatial import cKDTree
class Kdtree_fast(object):
def __init__(self, ncfile, latvarname, lonvarname):
self.ncfile = ncfile
self.latvar = self.ncfile.variables[latvarname]
self.lonvar = self.ncfile.variables[lonvarname]
# Read latitude and longitude from file into numpy arrays
rad_factor = pi/180.0 # for trignometry, need angles in radians
self.latvals = self.latvar[:] * rad_factor
self.lonvals = self.lonvar[:] * rad_factor
self.shape = self.latvals.shape
clat,clon = cos(self.latvals),cos(self.lonvals)
slat,slon = sin(self.latvals),sin(self.lonvals)
clat_clon = clat*clon
clat_slon = clat*slon
triples = zip(np.ravel(clat*clon), np.ravel(clat*slon), np.ravel(slat))
self.kdt = cKDTree(triples)
def query(self,lat0,lon0):
rad_factor = pi/180.0
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
clat0,clon0 = cos(lat0_rad),cos(lon0_rad)
slat0,slon0 = sin(lat0_rad),sin(lon0_rad)
dist_sq_min, minindex_1d = self.kdt.query([clat0*clon0,clat0*slon0,slat0])
iy_min, ix_min = np.unravel_index(minindex_1d, self.shape)
return iy_min,ix_min
ncfile = netCDF4.Dataset(filename, 'r')
ns = Kdtree_fast(ncfile,'Latitude','Longitude')
iy,ix = ns.query(50.0, -140.0)
print 'Closest lat lon:', ns.latvar[iy,ix], ns.lonvar[iy,ix]
ncfile.close()
In [16]:
%%timeit ncfile = netCDF4.Dataset(filename, 'r')
ns = Naive_slow(ncfile,'Latitude','Longitude')
In [17]:
%%timeit ncfile = netCDF4.Dataset(filename, 'r')
ns = Naive_fast(ncfile,'Latitude','Longitude')
In [18]:
%%timeit ncfile = netCDF4.Dataset(filename, 'r')
ns = Tunnel_fast(ncfile,'Latitude','Longitude')
In [19]:
%%timeit ncfile = netCDF4.Dataset(filename, 'r')
ns = Kdtree_fast(ncfile,'Latitude','Longitude')
In [20]:
%%timeit ncfile = netCDF4.Dataset(filename, 'r'); ns = Naive_slow(ncfile,'Latitude','Longitude')
iy,ix = ns.query(50.0, -140.0)
In [21]:
%%timeit ncfile = netCDF4.Dataset(filename, 'r'); ns = Naive_fast(ncfile,'Latitude','Longitude')
iy,ix = ns.query(50.0, -140.0)
In [22]:
%%timeit ncfile = netCDF4.Dataset(filename, 'r'); ns = Tunnel_fast(ncfile,'Latitude','Longitude')
iy,ix = ns.query(50.0, -140.0)
In [23]:
%%timeit ncfile = netCDF4.Dataset(filename, 'r'); ns = Kdtree_fast(ncfile,'Latitude','Longitude')
iy,ix = ns.query(50.0, -140.0)
In the next cell, we copy the results of the %%timeit runs into Python variables. (Is there a way to capture %%timeit output, so we don't have to do this manually?)
In [26]:
ns0,nf0,tf0,kd0 = 3.76, 3.8, 27.4, 2520 # setup times in msec
nsq,nfq,tfq,kdq = 7790, 2.46, 5.14, .0738 # query times in msec
The naive_slow method is always slower than all other methods. The naive_fast method would only be worth considering if non-flatness of the Earth is irrelevant, for example in a relatively small region not close to the poles and not crossing a longitude discontinuity.
Total time for running initialization followed by N queries is:
In [27]:
N = 10000
print N, "queries using naive_slow:", round((ns0 + nsq*N)/1000,1), "seconds"
print N, "queries using naive_fast:", round((nf0 + nfq*N)/1000,1), "seconds"
print N, "queries using tunnel_fast:", round((tf0 + tfq*N)/1000,1), "seconds"
print N, "queries using kdtree_fast:", round((kd0 + kdq*N)/1000,1), "seconds"
print
print "kd_tree_fast outperforms naive_fast above:", int((kd0-nf0)/(nfq-kdq)), "queries"
print "kd_tree_fast outperforms tunnel_fast above:", int((kd0-tf0)/(tfq-kdq)), "queries"
The advantage of using KD-trees is much greater for more search set points, as KD-tree query complexity is O(log(N)), but the other algorithms are O(N), the same as the difference between using binary search versus linear search.