Linear Model Fitting using empirical variogram

In classical geoestatistics it is frequent to use a sample-based variogram to fit a parametrised theoretical variogram. This method has proved to present biasses for the estimators of range, sill and shape. There are two reasons for this.

We are supposing that the observations are not random and the variogram is the true population variogram.
The variogram is built on summary statistics (bin size) and the curve-fitness can vary depending on this size.

Having said so, it is still a method used for estimating parameters and when the size of the data is big, being a good alternative to the likelihood methodology, a more robust optimization but computationally more complex.

So first thing first, import the neessary modules. Depending on your machine this may change. For the moment I work with this.



In [1]:

    
# Load Biospytial modules and etc.
%matplotlib inline
import sys
sys.path.append('/apps')
sys.path.append('/apps/external_plugins/spystats/')
import django
django.setup()

## Use the ggplot style
plt.style.use('ggplot')



In [6]:

    
import pandas as pd
from statsmodels.regression import linear_model
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np



In [4]:

    
import tools
import HEC_runs.fit_fia_logbiomass_logspp_GLS as auxiliary



In [7]:

    
## File locations, change accordingly
empirical_data_path = "/RawDataCSV/idiv_share/FIA_Plots_Biomass_11092017.csv"
variogram_path = "/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv"

Geographic extent



In [11]:

    
minx = -130
maxx = -60
miny = 24
maxy = 50

Auxiliary functions

* Open file
* Convert it to Pandas with Geometry (GeoPandas)
* Reproject it to Lambert Conic Conformal
* Calculate OLS and append residuals



In [13]:

    
data = auxiliary.initAnalysis(empirical_data_path=empirical_data_path,plotdata_path=plotdata_path, maxx=maxx,minx=minx,maxy=maxy,miny=miny)









    



INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Reprojecting to Alberts equal area
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Removing possible duplicates. 
 This avoids problems of Non Positive semidefinite
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting OLS linear model: logBiomass ~ logSppN 
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Subselecting Region

For testing, subselect the first 50 data points



In [81]:

    
test_data = data.loc[0:49,:]

Now lets instantiate the variogram according to the results obtained in the cluster.

Remember: < Whittle Variogram : sill 0.340246718396, range 41188.0234423, nugget 0.329937603763, alpha1.12143687914 >



In [82]:

    
model = tools.WhittleVariogram(sill=0.340246718396,range_a=41188.0234423,nugget=0.329937603763,alpha=1.12143687914)

Let's see the correlation



In [83]:

    
import numpy as np
X = np.linspace(0,1000000,1000)
plt.plot(X,model.corr_f(X))









    Out[83]:





[<matplotlib.lines.Line2D at 0x7f6d865e6250>]



In [84]:

    
vg = tools.Variogram(test_data,'residuals',model=model)
MM = vg.calculateCovarianceMatrix()

Fit a GLS



In [85]:

    
lmod = linear_model.GLS.from_formula(formula='logBiomass ~ logSppN',data=test_data,sigma=MM)



In [86]:

    
results = lmod.fit()



In [ ]:



In [87]:

    
results.summary()









    Out[87]:





GLS Regression Results

  Dep. Variable:        logBiomass       R-squared:             0.739


  Model:                    GLS          Adj. R-squared:        0.733


  Method:              Least Squares     F-statistic:           135.8


  Date:              Thu, 08 Feb 2018    Prob (F-statistic):  1.35e-15


  Time:                  20:07:18        Log-Likelihood:      -37.860


  No. Observations:           50         AIC:                   79.72


  Df Residuals:               48         BIC:                   83.54


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      8.4659      0.159     53.276   0.000      8.146     8.785


  logSppN        0.3405      0.097      3.508   0.001      0.145     0.536




  Omnibus:         0.534    Durbin-Watson:         2.038


  Prob(Omnibus):   0.766    Jarque-Bera (JB):      0.661


  Skew:           -0.205    Prob(JB):              0.719


  Kurtosis:        2.614    Cond. No.               3.82



In [88]:

    
MM









    Out[88]:





array([[  3.40246718e-01,   7.31546364e-03,   8.15024713e-03, ...,
          1.84364060e-06,   1.59910176e-05,   4.09168002e-05],
       [  7.31546364e-03,   3.40246718e-01,   5.76166346e-03, ...,
          2.70041624e-06,   2.45467297e-05,   6.41058708e-05],
       [  8.15024713e-03,   5.76166346e-03,   3.40246718e-01, ...,
          1.24826472e-06,   1.09900133e-05,   2.84811261e-05],
       ..., 
       [  1.84364060e-06,   2.70041624e-06,   1.24826472e-06, ...,
          3.40246718e-01,   1.37447932e-03,   4.82794798e-04],
       [  1.59910176e-05,   2.45467297e-05,   1.09900133e-05, ...,
          1.37447932e-03,   3.40246718e-01,   4.49004133e-03],
       [  4.09168002e-05,   6.41058708e-05,   2.84811261e-05, ...,
          4.82794798e-04,   4.49004133e-03,   3.40246718e-01]])



In [89]:

    
DD = vg.distance_coordinates



In [90]:

    
correl_mat = vg.model.corr_f(DD)



In [91]:

    
correl_mat.tofile('/outputs/correl_matf50.csv',sep=",",format="%s")



In [92]:

    
MM.tofile('/outputs/covar_matf50.csv',sep=",",format="%s")



In [93]:

    
ls /outputs









    



HEC_outputs/                                pvalues_gls.csv*
correl_matf50.csv*                          rsqs.csv*
covar_matf50.csv*                           test_denmark.tif*
elevation1.tif*                             test_denmark.tif.aux.xml*
elevation1.tif.aux.xml*                     test_denmark_tm.tif*
fia_plots_reproject.csv*                    test_denmark_tm.tif.aux.xml*
gls_confidence_int.csv*                     test_graph.gml*
gp_sigma_rho.png*                           test_slope.tif*
logbiomas_logsppn_residuals_variogram.csv*  test_slope.tif.aux.xml*
n_obs.csv*                                  test_tree_figure_1.png*
outputstest_fabacea.tif*                    testooooooo.tif*
outputstest_fabacea.tif.aux.xml*            testooooooo.tif.tif*
params2_gls.csv*                            testooooooo.tif.tif.aux.xml*
params_gls.csv*                             tflow_outs/



In [94]:

    
MM.tofile?



In [95]:

    
lmodco = linear_model.GLS.from_formula(formula='logBiomass ~ logSppN',data=test_data,sigma=correl_mat)



In [96]:

    
res=lmodco.fit()



In [97]:

    
res.summary()









    Out[97]:





GLS Regression Results

  Dep. Variable:        logBiomass       R-squared:             0.755


  Model:                    GLS          Adj. R-squared:        0.750


  Method:              Least Squares     F-statistic:           147.6


  Date:              Thu, 08 Feb 2018    Prob (F-statistic):  2.98e-16


  Time:                  20:07:25        Log-Likelihood:      -51.743


  No. Observations:           50         AIC:                   107.5


  Df Residuals:               48         BIC:                   111.3


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      8.4012      0.673     12.488   0.000      7.049     9.754


  logSppN        0.3233      0.103      3.125   0.003      0.115     0.531




  Omnibus:         1.360    Durbin-Watson:         1.760


  Prob(Omnibus):   0.507    Jarque-Bera (JB):      0.600


  Skew:            0.082    Prob(JB):              0.741


  Kurtosis:        3.511    Cond. No.               6.64



In [80]:

    
len(test_data)









    Out[80]:





51



In [ ]:

Dep. Variable:	logBiomass	R-squared:	0.739
Model:	GLS	Adj. R-squared:	0.733
Method:	Least Squares	F-statistic:	135.8
Date:	Thu, 08 Feb 2018	Prob (F-statistic):	1.35e-15
Time:	20:07:18	Log-Likelihood:	-37.860
No. Observations:	50	AIC:	79.72
Df Residuals:	48	BIC:	83.54
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	8.4659	0.159	53.276	0.000	8.146 8.785
logSppN	0.3405	0.097	3.508	0.001	0.145 0.536

Omnibus:	0.534	Durbin-Watson:	2.038
Prob(Omnibus):	0.766	Jarque-Bera (JB):	0.661
Skew:	-0.205	Prob(JB):	0.719
Kurtosis:	2.614	Cond. No.	3.82

Omnibus:	1.360	Durbin-Watson:	1.760
Prob(Omnibus):	0.507	Jarque-Bera (JB):	0.600
Skew:	0.082	Prob(JB):	0.741
Kurtosis:	3.511	Cond. No.	6.64