This notebook goes with the Agile Scientific blog post from 8 March 2019.
I'm using a small dataset originally from Geoff Bohling at the Kansas Geological Survey. I can no longer find the data online.
We will look at four ways to do this:
scipy.interpolate.Rbfscipy.griddata()sklearn.gaussian_processverde.Spline (thank you to Leo Uieda for contributing this!)
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
We'll load the data from the web; if you are offline but have the repo, the data file is also in ../data/ZoneA.dat.
In [2]:
# Essential line number 1.
df = pd.read_csv('https://www.dropbox.com/s/6dyfc4fl5slhgry/ZoneA.dat?raw=1',
sep=' ',
header=9,
usecols=[0, 1, 2, 3],
names=['x', 'y', 'thick', 'por'],
dtype="float64"
)
In [3]:
df.head()
Out[3]:
In [4]:
df.describe()
Out[4]:
In [5]:
sns.distplot(df.por)
Out[5]:
This looks a bit unpleasant, but we're just getting out min and max values for the x and y columns.
In [6]:
# Line 2.
extent = x_extent = x_min, x_max, y_min, y_max = [df.x.min()-1000, df.x.max()+1000,
df.y.min()-1000, df.y.max()+1000]
Later on, we'll see a nicer way to do this using the Verde library.
Now we can plot the data:
In [7]:
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df.x, df.y, c=df.por)
ax.set_aspect(1)
ax.set_xlim(*extent[:2])
ax.set_ylim(*extent[2:])
ax.set_xlabel('Easting [m]')
ax.set_ylabel('Northing [m]')
ax.set_title('Porosity %')
ax.grid(c='k', alpha=0.2)
plt.show()
In [8]:
# Line 3.
grid_x, grid_y = np.mgrid[x_min:x_max:500, y_min:y_max:500]
# Use *shape* argument to specify the *number* (not size) of bins:
# grid_x, grid_y = np.mgrid[x_min:x_max:100j, y_min:y_max:100j]
In [9]:
plt.figure(figsize=(10,6))
plt.scatter(grid_x, grid_y, s=10)
Out[9]:
In [10]:
from scipy.interpolate import Rbf
# Make an n-dimensional interpolator. This is essential line number 4.
rbfi = Rbf(df.x, df.y, df.por)
# Predict on the regular grid. Line 5.
di = rbfi(grid_x, grid_y)
Let's plot the result. First, we'll need the min and max of the combined sparse and gridded data, so we can plot them with the same colourmap ranges:
In [11]:
mi = np.min(np.hstack([di.ravel(), df.por.values]))
ma = np.max(np.hstack([di.ravel(), df.por.values]))
Notice the transpose and the origin='lower', to keep everything matched up with the original dataset.
In [13]:
plt.figure(figsize=(15,15))
c1 = plt.imshow(di.T, origin="lower", extent=extent, vmin=mi, vmax=ma)
c2 = plt.scatter(df.x, df.y, s=60, c=df.por, edgecolor='#ffffff66', vmin=mi, vmax=ma)
plt.colorbar(c1, shrink=0.67)
plt.show()
The circles (the data) are the same colour as the grid (the model), so we can see that the error on this prediction is almost zero. In fact, the default parameters force the model to pass through all the data points (interpolation, as opposed to estimation or approximation).
The Rbf() interpolator has a few options. The most important one is probably smooth, which is the thing to increase if you end up with a singular matrix (because it can't converge on a solution). Anything above 0 relaxes the constraint that the surface must pass through every point. If you get an error, you probably need to change the smoothing.
You can also change the function (default is multiquadric, which also has an epsilon parameter to vary the range of influence of each point).
In [15]:
rbfi = Rbf(df.x, df.y, df.por, smooth=0.2)
di = rbfi(grid_x, grid_y)
plt.imshow(di.T, origin="lower", extent=extent)
plt.scatter(df.x, df.y, s=2, c='w')
plt.show()
We can also make a histogram and kernel density estimation of the errors, by making predictions at the original input locations:
In [16]:
por_hat = rbfi(df.x, df.y)
sns.distplot(por_hat - df.por)
Out[16]:
With the smoothing set to 0.2, we end up with a smoother surface, but pay for it with larger errors.
The Rbf() interpolator is the one to know about, because it has lots of useful parameters. It's probably the only one you need to know. But there is also scipy.griddata(). For example see this SciPy recipe.
The interface is slightly different — we have to pass a single array of the coordinates (the (x, y) locations of the points we know). We also pass the values to interpolate, and the grids.
The function will not accept Pandas Series objects, so we'll use the Series.values attribute to get at the NumPy array representation.
First, let's make the 2D array of coordinates:
In [17]:
points = df[['x', 'y']].values
The grdding step is easy. We'll try three different algorithms:
In [18]:
from scipy.interpolate import griddata
grid_z0 = griddata(points, df.por.values, (grid_x, grid_y), method='nearest')
grid_z1 = griddata(points, df.por.values, (grid_x, grid_y), method='linear')
grid_z2 = griddata(points, df.por.values, (grid_x, grid_y), method='cubic')
Inspect the results.
In [19]:
fig, axs = plt.subplots(ncols=3, figsize=(15, 5))
ax = axs[0]
ax.imshow(grid_z0, origin='lower', extent=extent)
ax.scatter(df.x, df.y, s=2, c='w')
ax.set_title('Nearest')
ax = axs[1]
ax.imshow(grid_z1, origin='lower', extent=extent)
ax.scatter(df.x, df.y, s=2, c='w')
ax.set_title('Linear')
ax = axs[2]
ax.imshow(grid_z2, origin='lower', extent=extent)
ax.scatter(df.x, df.y, s=2, c='w')
ax.set_title('Cubic')
plt.show()
I don't particularly like any of these results.
Modeling with a Gaussian process is equivalent to kriging. Conveniently, the popular machine learning library scikit-learn has a Gaussian process modeling tool.
In [20]:
from sklearn.gaussian_process.kernels import RBF
kernel = RBF(length_scale=1000)
The main hyperparameters are the kernel, which we just defined, and alpha, which controls the smoothness. Larger values imply mmore noise in the input data, and result in smoother grids; default is very small: 1 × 10-9.
In [23]:
from sklearn.gaussian_process import GaussianProcessRegressor
gp = GaussianProcessRegressor(normalize_y=True,
alpha=0.1, # Larger values imply more noise in the input data.
kernel=kernel,)
gp.fit(df[['x', 'y']].values, df.por.values)
Out[23]:
To make a prediction, we need to construct the X matrix: (x, y) coordinates in 2 columns:
In [24]:
X_grid = np.stack([grid_x.ravel(), grid_y.ravel()]).T
Now we can make a prediction:
In [25]:
y_grid = gp.predict(X_grid).reshape(grid_x.shape)
And plot the predicted grid with the input data using the same colourmap:
In [26]:
# Compute min and max of all the data:
mi = np.min(np.hstack([y_grid.ravel(), df.por.values]))
ma = np.max(np.hstack([y_grid.ravel(), df.por.values]))
# Plot it all.
plt.figure(figsize=(15,15))
im = plt.imshow(y_grid.T, origin='lower', extent=extent, vmin=mi, vmax=ma)
pts = plt.scatter(df.x, df.y, c=df.por, s=80, edgecolor='#ffffff66', vmin=mi, vmax=ma)
plt.colorbar(im, shrink=0.67)
plt.show()
As before, we can compute the error by making a prediction on the original (x, y) values and comparing to the actual measured porosities at those locations:
In [27]:
por_hat = gp.predict(df[['x', 'y']].values)
sns.distplot(por_hat - df.por)
Out[27]:
One of the options in scipy's Rbf interpolator is the "thin-plate" kernel. This is what the verde.Spline interpolator is based on but with a few modifications, like damping regularization to smooth the solution. It's similar to the RBF and GaussianProcessRegressor approach but Verde provides a more conenient API for gridding tasks.
For example, we now have a nicer way to define extent, using vd.pad_region():
In [28]:
import verde as vd
extent = x_min, x_max, y_min, y_max = vd.pad_region(vd.get_region((df.x, df.y)), pad=1000)
In [29]:
spline = vd.Spline(mindist=2000, damping=1e-4)
spline.fit((df.x, df.y), df.por)
Out[29]:
To make a grid, use the .grid method of the spline:
In [30]:
grid = spline.grid(region=extent, spacing=500)
grid
Out[30]:
This returns an xarray.Dataset which can be easily plotted, saved to disk as netCDF, or used for computations. The coordinates for the grid are automatically generated and populated based on the desired region and spacing. The spacing is adjusted to fit the desired region exactly. Optionally, you can set adjust="region" to adjust the size of the region so that the spacing is exact.
And plot the predicted grid with the input data using the same colourmap:
In [31]:
# Compute min and max of all the data:
mi = np.min(np.hstack([grid.scalars.values.ravel(), df.por.values]))
ma = np.max(np.hstack([grid.scalars.values.ravel(), df.por.values]))
# Plot it all.
plt.figure(figsize=(15,15))
im = plt.imshow(grid.scalars, origin='lower', extent=extent, vmin=mi, vmax=ma)
pts = plt.scatter(df.x, df.y, c=df.por, s=80, edgecolor='#ffffff66', vmin=mi, vmax=ma)
plt.colorbar(im, shrink=0.67)
plt.show()
As before, we can compute the error by making a prediction on the original (x, y) values and comparing to the actual measured porosities at those locations:
In [32]:
por_hat = spline.predict((df.x, df.y))
sns.distplot(por_hat - df.por)
Out[32]:
In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import Rbf
# Load the data.
df = pd.read_csv('https://www.dropbox.com/s/6dyfc4fl5slhgry/ZoneA.dat?raw=1',
sep=' ',
header=9,
usecols=[0, 1, 2, 3],
names=['x', 'y', 'thick', 'por']
)
# Build a regular grid with 500-metre cells.
extent = x_min, x_max, y_min, y_max = [df.x.min()-1000, df.x.max()+1000,
df.y.min()-1000, df.y.max()+1000]
grid_x, grid_y = np.mgrid[x_min:x_max:500, y_min:y_max:500]
# Make the interpolator and do the interpolation.
rbfi = Rbf(df.x, df.y, df.por)
di = rbfi(grid_x, grid_y)
# Make the plot.
plt.figure(figsize=(15,15))
plt.imshow(di.T, origin="lower", extent=extent)
cb = plt.scatter(df.x, df.y, s=60, c=df.por, edgecolor='#ffffff66')
plt.colorbar(cb, shrink=0.67)
plt.show()
© 2019 Agile Scientific, licensed CC-BY