X lines of Python

Gridding map data

This notebook goes with the Agile Scientific blog post from 8 March 2019.

I'm using a small dataset originally from Geoff Bohling at the Kansas Geological Survey. I can no longer find the data online.

We will look at four ways to do this:

Using SciPy with scipy.interpolate.Rbf
Using SciPy with scipy.griddata()
Using the Scikit-Learn machine learning library with sklearn.gaussian_process
Using the Verde spatial gridding library with verde.Spline (thank you to Leo Uieda for contributing this!)

Load and inspect the data



In [1]:

    
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt

We'll load the data from the web; if you are offline but have the repo, the data file is also in ../data/ZoneA.dat.



In [2]:

    
# Essential line number 1.
df = pd.read_csv('https://www.dropbox.com/s/6dyfc4fl5slhgry/ZoneA.dat?raw=1',
                 sep=' ',
                 header=9,
                 usecols=[0, 1, 2, 3],
                 names=['x', 'y', 'thick', 'por'],
                 dtype="float64"
                )



In [3]:

    
df.head()



In [4]:

    
df.describe()









    Out[4]:







  
    
      
      x
      y
      thick
      por
    
  
  
    
      count
      85.000000
      85.000000
      85.000000
      85.000000
    
    
      mean
      9460.000000
      7561.176471
      34.122658
      14.695880
    
    
      std
      5522.059657
      4467.983835
      4.150109
      0.887037
    
    
      min
      100.000000
      100.000000
      24.015600
      12.149100
    
    
      25%
      4700.000000
      3700.000000
      31.499300
      14.133300
    
    
      50%
      9100.000000
      7100.000000
      33.921700
      14.651500
    
    
      75%
      14100.000000
      11300.000000
      37.541500
      15.260600
    
    
      max
      19500.000000
      15700.000000
      43.821700
      16.958300



In [5]:

    
sns.distplot(df.por)









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fe6d22d60b8>

This looks a bit unpleasant, but we're just getting out min and max values for the x and y columns.



In [6]:

    
# Line 2.
extent = x_extent = x_min, x_max, y_min, y_max = [df.x.min()-1000, df.x.max()+1000,
                                                  df.y.min()-1000, df.y.max()+1000]

Later on, we'll see a nicer way to do this using the Verde library.

Now we can plot the data:



In [7]:

    
fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(df.x, df.y, c=df.por)
ax.set_aspect(1)
ax.set_xlim(*extent[:2])
ax.set_ylim(*extent[2:])
ax.set_xlabel('Easting [m]')
ax.set_ylabel('Northing [m]')
ax.set_title('Porosity %')
ax.grid(c='k', alpha=0.2)

plt.show()

Make a regular grid

We must make a grid, which represents the points we'd like to predict.



In [8]:

    
# Line 3.
grid_x, grid_y = np.mgrid[x_min:x_max:500, y_min:y_max:500]

# Use *shape* argument to specify the *number* (not size) of bins:
# grid_x, grid_y = np.mgrid[x_min:x_max:100j, y_min:y_max:100j]



In [9]:

    
plt.figure(figsize=(10,6))
plt.scatter(grid_x, grid_y, s=10)









    Out[9]:





<matplotlib.collections.PathCollection at 0x7fe6d1f06f98>

That was easy!

Interpolation with radial basis function

Now we make an interpolator and use it to predict into the grid 'cells'.



In [10]:

    
from scipy.interpolate import Rbf

# Make an n-dimensional interpolator. This is essential line number 4.
rbfi = Rbf(df.x, df.y, df.por)

# Predict on the regular grid. Line 5.
di = rbfi(grid_x, grid_y)

Let's plot the result. First, we'll need the min and max of the combined sparse and gridded data, so we can plot them with the same colourmap ranges:



In [11]:

    
mi = np.min(np.hstack([di.ravel(), df.por.values]))
ma = np.max(np.hstack([di.ravel(), df.por.values]))

Notice the transpose and the origin='lower', to keep everything matched up with the original dataset.



In [13]:

    
plt.figure(figsize=(15,15))
c1 = plt.imshow(di.T, origin="lower", extent=extent, vmin=mi, vmax=ma)
c2 = plt.scatter(df.x, df.y, s=60, c=df.por, edgecolor='#ffffff66', vmin=mi, vmax=ma)

plt.colorbar(c1, shrink=0.67)
plt.show()

The circles (the data) are the same colour as the grid (the model), so we can see that the error on this prediction is almost zero. In fact, the default parameters force the model to pass through all the data points (interpolation, as opposed to estimation or approximation).

The Rbf() interpolator has a few options. The most important one is probably smooth, which is the thing to increase if you end up with a singular matrix (because it can't converge on a solution). Anything above 0 relaxes the constraint that the surface must pass through every point. If you get an error, you probably need to change the smoothing.

You can also change the function (default is multiquadric, which also has an epsilon parameter to vary the range of influence of each point).



In [15]:

    
rbfi = Rbf(df.x, df.y, df.por, smooth=0.2)
di = rbfi(grid_x, grid_y)

plt.imshow(di.T, origin="lower", extent=extent)
plt.scatter(df.x, df.y, s=2, c='w')
plt.show()

We can also make a histogram and kernel density estimation of the errors, by making predictions at the original input locations:



In [16]:

    
por_hat = rbfi(df.x, df.y)

sns.distplot(por_hat - df.por)









    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fe6d101db00>

With the smoothing set to 0.2, we end up with a smoother surface, but pay for it with larger errors.

Interpolation with `scipy.griddata()`

The Rbf() interpolator is the one to know about, because it has lots of useful parameters. It's probably the only one you need to know. But there is also scipy.griddata(). For example see this SciPy recipe.

The interface is slightly different — we have to pass a single array of the coordinates (the (x, y) locations of the points we know). We also pass the values to interpolate, and the grids.

The function will not accept Pandas Series objects, so we'll use the Series.values attribute to get at the NumPy array representation.

First, let's make the 2D array of coordinates:



In [17]:

    
points = df[['x', 'y']].values

The grdding step is easy. We'll try three different algorithms:



In [18]:

    
from scipy.interpolate import griddata

grid_z0 = griddata(points, df.por.values, (grid_x, grid_y), method='nearest')
grid_z1 = griddata(points, df.por.values, (grid_x, grid_y), method='linear')
grid_z2 = griddata(points, df.por.values, (grid_x, grid_y), method='cubic')

Inspect the results.



In [19]:

    
fig, axs = plt.subplots(ncols=3, figsize=(15, 5))

ax = axs[0]
ax.imshow(grid_z0, origin='lower', extent=extent)
ax.scatter(df.x, df.y, s=2, c='w')
ax.set_title('Nearest')

ax = axs[1]
ax.imshow(grid_z1, origin='lower', extent=extent)
ax.scatter(df.x, df.y, s=2, c='w')
ax.set_title('Linear')

ax = axs[2]
ax.imshow(grid_z2, origin='lower', extent=extent)
ax.scatter(df.x, df.y, s=2, c='w')
ax.set_title('Cubic')

plt.show()

I don't particularly like any of these results.

Using `sklearn.gaussian_process`

Modeling with a Gaussian process is equivalent to kriging. Conveniently, the popular machine learning library scikit-learn has a Gaussian process modeling tool.



In [20]:

    
from sklearn.gaussian_process.kernels import RBF

kernel = RBF(length_scale=1000)

The main hyperparameters are the kernel, which we just defined, and alpha, which controls the smoothness. Larger values imply mmore noise in the input data, and result in smoother grids; default is very small: 1 × 10^-9.



In [23]:

    
from sklearn.gaussian_process import GaussianProcessRegressor

gp = GaussianProcessRegressor(normalize_y=True,
                              alpha=0.1,  # Larger values imply more noise in the input data.
                              kernel=kernel,)

gp.fit(df[['x', 'y']].values, df.por.values)









    Out[23]:





GaussianProcessRegressor(alpha=0.1, copy_X_train=True,
             kernel=RBF(length_scale=1e+03), n_restarts_optimizer=0,
             normalize_y=True, optimizer='fmin_l_bfgs_b',
             random_state=None)

To make a prediction, we need to construct the X matrix: (x, y) coordinates in 2 columns:



In [24]:

    
X_grid = np.stack([grid_x.ravel(), grid_y.ravel()]).T

Now we can make a prediction:



In [25]:

    
y_grid = gp.predict(X_grid).reshape(grid_x.shape)

And plot the predicted grid with the input data using the same colourmap:



In [26]:

    
# Compute min and max of all the data:
mi = np.min(np.hstack([y_grid.ravel(), df.por.values]))
ma = np.max(np.hstack([y_grid.ravel(), df.por.values]))

# Plot it all.
plt.figure(figsize=(15,15))
im = plt.imshow(y_grid.T, origin='lower', extent=extent, vmin=mi, vmax=ma)
pts = plt.scatter(df.x, df.y, c=df.por, s=80, edgecolor='#ffffff66', vmin=mi, vmax=ma)
plt.colorbar(im, shrink=0.67)
plt.show()

As before, we can compute the error by making a prediction on the original (x, y) values and comparing to the actual measured porosities at those locations:



In [27]:

    
por_hat = gp.predict(df[['x', 'y']].values)

sns.distplot(por_hat - df.por)









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fe6d11be0b8>

Using `verde.Spline`

One of the options in scipy's Rbf interpolator is the "thin-plate" kernel. This is what the verde.Spline interpolator is based on but with a few modifications, like damping regularization to smooth the solution. It's similar to the RBF and GaussianProcessRegressor approach but Verde provides a more conenient API for gridding tasks.

For example, we now have a nicer way to define extent, using vd.pad_region():



In [28]:

    
import verde as vd

extent = x_min, x_max, y_min, y_max = vd.pad_region(vd.get_region((df.x, df.y)), pad=1000)



In [29]:

    
spline = vd.Spline(mindist=2000, damping=1e-4)
spline.fit((df.x, df.y), df.por)









    Out[29]:





Spline(damping=0.0001, engine='auto',
    force_coords=(array([12100.,  5300., ...,  4900.,  9700.]), array([8300., 8700., ..., 9100., 5700.])),
    mindist=2000)

To make a grid, use the .grid method of the spline:



In [30]:

    
grid = spline.grid(region=extent, spacing=500)
grid









    Out[30]:





<xarray.Dataset>
Dimensions:   (easting: 44, northing: 36)
Coordinates:
  * easting   (easting) float64 -900.0 -402.3 95.35 ... 1.95e+04 2e+04 2.05e+04
  * northing  (northing) float64 -900.0 -397.1 105.7 ... 1.62e+04 1.67e+04
Data variables:
    scalars   (northing, easting) float64 14.93 14.83 14.73 ... 16.17 16.18 16.2
Attributes:
    metadata:  Generated by Spline(damping=0.0001, engine='auto',\n    force_...

This returns an xarray.Dataset which can be easily plotted, saved to disk as netCDF, or used for computations. The coordinates for the grid are automatically generated and populated based on the desired region and spacing. The spacing is adjusted to fit the desired region exactly. Optionally, you can set adjust="region" to adjust the size of the region so that the spacing is exact.

And plot the predicted grid with the input data using the same colourmap:



In [31]:

    
# Compute min and max of all the data:
mi = np.min(np.hstack([grid.scalars.values.ravel(), df.por.values]))
ma = np.max(np.hstack([grid.scalars.values.ravel(), df.por.values]))

# Plot it all.
plt.figure(figsize=(15,15))
im = plt.imshow(grid.scalars, origin='lower', extent=extent, vmin=mi, vmax=ma)
pts = plt.scatter(df.x, df.y, c=df.por, s=80, edgecolor='#ffffff66', vmin=mi, vmax=ma)
plt.colorbar(im, shrink=0.67)
plt.show()

As before, we can compute the error by making a prediction on the original (x, y) values and comparing to the actual measured porosities at those locations:



In [32]:

    
por_hat = spline.predict((df.x, df.y))

sns.distplot(por_hat - df.por)









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fe6cbe3fd30>

Just the essential bits

This is the minimal code required to load the data and make a model.



In [33]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import Rbf

# Load the data.
df = pd.read_csv('https://www.dropbox.com/s/6dyfc4fl5slhgry/ZoneA.dat?raw=1',
                 sep=' ',
                 header=9,
                 usecols=[0, 1, 2, 3],
                 names=['x', 'y', 'thick', 'por']
                )

# Build a regular grid with 500-metre cells.
extent = x_min, x_max, y_min, y_max = [df.x.min()-1000, df.x.max()+1000,
                                       df.y.min()-1000, df.y.max()+1000]
grid_x, grid_y = np.mgrid[x_min:x_max:500, y_min:y_max:500]

# Make the interpolator and do the interpolation.
rbfi = Rbf(df.x, df.y, df.por)
di = rbfi(grid_x, grid_y)

# Make the plot.
plt.figure(figsize=(15,15))
plt.imshow(di.T, origin="lower", extent=extent)
cb = plt.scatter(df.x, df.y, s=60, c=df.por, edgecolor='#ffffff66')
plt.colorbar(cb, shrink=0.67)
plt.show()

	x	y	thick	por
0	12100.0	8300.0	37.1531	14.6515
1	5300.0	8700.0	31.4993	14.5093
2	3500.0	13900.0	36.9185	14.0639
3	5100.0	1900.0	24.0156	15.1084
4	9900.0	13700.0	35.0411	13.9190

	x	y	thick	por
count	85.000000	85.000000	85.000000	85.000000
mean	9460.000000	7561.176471	34.122658	14.695880
std	5522.059657	4467.983835	4.150109	0.887037
min	100.000000	100.000000	24.015600	12.149100
25%	4700.000000	3700.000000	31.499300	14.133300
50%	9100.000000	7100.000000	33.921700	14.651500
75%	14100.000000	11300.000000	37.541500	15.260600
max	19500.000000	15700.000000	43.821700	16.958300