This tutorial will guide you through the theoretical variogram models available for the Variogram
class.
In this tutorial you will learn:
In [1]:
from skgstat import Variogram, OrdinaryKriging
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
In [2]:
%env SKG_SUPPRESS = true
There are three prepared data sets in the ./data
folder. Each of them is a generated random field with different underlying spatial properties. We will use only the first one, but you can re-run all the examples with any of the other fields.
In [3]:
data1 = pd.read_csv('data/sample_matern_15.csv')
data2 = pd.read_csv('data/sample_matern_40.csv')
data3 = pd.read_csv('data/sample_spherical_noise.csv')
In [4]:
def plot_scatter(data, ax):
art = ax.scatter(data.x, data.y, 50, c=data.z, cmap='plasma')
plt.colorbar(art, ax=ax)
In [5]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for data, ax in zip((data1, data2, data3), axes.flatten()):
plot_scatter(data, ax)
One of the features of scikit-gstat
is the fact that it is programmed object oriented. That means, we can just instantiate a Variogram
object and start changing arguments unitl it models spatial dependency in our observations well.
In [6]:
V1 = Variogram(data1[['x', 'y']].values, data1.z.values, normalize=False)
V1.plot(show=False);
The data set includes 200 observations, consequently we can increase the number of lag classes. Additionally, the histogram shows, that the lags over 100 units are backed up by just a few observations. Thus, we can limit the lag classes to at least 100.
In [7]:
V1.maxlag = 100
V1.n_lags = 25
V1.plot(show=False);
That's not too bad. Now, we can try different theoretical models. It is always a good idea to judge the fit visually, Especially, because we want it to fit to close bins more accurately than to distant bins, as they will ultimately determine the Kriging weights.
Nevertheless, Variogram
has a rmse
and a r2
property, that can be used as a quality measure for the fit.
The Variogram.plot
function also accepts one or two matplotlib subplot axes to plot the lag classes histogram and variogram plot into them. The histogram can also be turned off.
In [8]:
fig, _a = plt.subplots(2,3, figsize=(18, 10), sharex=True, sharey=True)
axes = _a.flatten()
for i, model in enumerate(('spherical', 'exponential', 'gaussian', 'matern', 'stable', 'cubic')):
V1.model = model
V1.plot(axes=axes[i], hist=False, show=False)
axes[i].set_title('Model: %s; RMSE: %.2f' % (model, V1.rmse))
This is quite important. We find all 6 models to describe the experimental variogram equally well in terms of RMSE.
However, the cubic
and gaussian
model are off the experimental values almost all the time. On short distances, the model is underestimating and on medium distances (up to the effective range) it is overestimating. The exponential
model is overestimating all the time. The spherical
, matern
and stable
model seem to be pretty good on short distances.
But what does this difference look like, when it comes to interpolation?
In [9]:
def interpolate(V, ax):
xx, yy = np.mgrid[0:99:100j, 0:99:100j]
ok = OrdinaryKriging(V, min_points=5, max_points=15, mode='exact')
field = ok.transform(xx.flatten(), yy.flatten()).reshape(xx.shape)
art = ax.matshow(field, origin='lower', cmap='plasma')
ax.set_title('%s model' % V.model.__name__)
plt.colorbar(art, ax=ax)
return field
In [10]:
fields = []
fig, _a = plt.subplots(2,3, figsize=(18, 12), sharex=True, sharey=True)
axes = _a.flatten()
for i, model in enumerate(('spherical', 'exponential', 'gaussian', 'matern', 'stable', 'cubic')):
V1.model = model
fields.append(interpolate(V1, axes[i]))
In [11]:
pd.DataFrame({'spherical': fields[0].flatten(), 'exponential': fields[1].flatten(), 'gaussian': fields[2].flatten(),
'matern': fields[3].flatten(), 'stable': fields[4].flatten(), 'cubic': fields[5].flatten()}).describe()
Out[11]:
While most of the results look fairly similar there are a few things to notice:
You have to remind that we had quite some observations. How does that look like, when the number of observations is decreased?
In this section we will run the same code, but on just a quarter and 10% of all available observations. First, we look into the variograms:
In [12]:
subset1 = data1.iloc[:50]
V2 = Variogram(subset1[['x', 'y']].values, subset1.z.values, normalize=False, maxlag=100, n_lags=25)
V2.plot(show=False);
In [13]:
fig, _a = plt.subplots(2,3, figsize=(18, 10), sharex=True, sharey=True)
axes = _a.flatten()
for i, model in enumerate(('spherical', 'exponential', 'gaussian', 'matern', 'stable', 'cubic')):
V2.model = model
V2.plot(axes=axes[i], hist=False, show=False)
axes[i].set_title('Model: %s; RMSE: %.2f' % (model, V2.rmse))
In [14]:
fields = []
fig, _a = plt.subplots(2,3, figsize=(18, 12), sharex=True, sharey=True)
axes = _a.flatten()
for i, model in enumerate(('spherical', 'exponential', 'gaussian', 'matern', 'stable', 'cubic')):
V2.model = model
fields.append(interpolate(V2, axes[i]))
In [15]:
pd.DataFrame({'spherical': fields[0].flatten(), 'exponential': fields[1].flatten(), 'gaussian': fields[2].flatten(),
'matern': fields[3].flatten(), 'stable': fields[4].flatten(), 'cubic': fields[5].flatten()}).describe()
Out[15]:
In this section we will run the same code, but on just a quarter and 10% of all available observations. First, we look into the variograms:
In [16]:
subset2 = data1.iloc[180:]
V3 = Variogram(subset2[['x', 'y']].values, subset2.z.values, normalize=False, maxlag=100, n_lags=15)
V3.plot(hist=False, show=False);
In [17]:
fig, _a = plt.subplots(2,3, figsize=(18, 10), sharex=True, sharey=True)
axes = _a.flatten()
for i, model in enumerate(('spherical', 'exponential', 'gaussian', 'matern', 'stable', 'cubic')):
V3.model = model
V3.plot(axes=axes[i], hist=False, show=False)
axes[i].set_title('Model: %s; RMSE: %.2f' % (model, V3.rmse))
In this example, we were basing the variogram analysis on only 20 observations. That is a number that could be considered to be the lower bound of geostatistics. The RMSE values are decreasing as the experimental variograms are more scattered. However, All six models seem to fit fairly well to the experimental data. It is hard to tell from just the figure above which is correct.
In [18]:
fields = []
fig, _a = plt.subplots(2,3, figsize=(18, 12), sharex=True, sharey=True)
axes = _a.flatten()
for i, model in enumerate(('spherical', 'exponential', 'gaussian', 'matern', 'stable', 'cubic')):
V3.model = model
fields.append(interpolate(V3, axes[i]))
In [19]:
pd.DataFrame({'spherical': fields[0].flatten(), 'exponential': fields[1].flatten(), 'gaussian': fields[2].flatten(),
'matern': fields[3].flatten(), 'stable': fields[4].flatten(), 'cubic': fields[5].flatten()}).describe()
Out[19]:
Here, some interesting things happend:
We decreased the number of observations so far, that the max_points
attribute came into effect. In the other cases the Kriging interpolator found so many close observations, that limiting them to 15 had the effect, that estimations were usually derived from obserations within a few units. Now, even if enough points are within the range, we use observations from medium distances. Here, the different shapes of the models come into effect, as could be seen from the last example.