Spatial Model fitting in GLS

In this exercise we will fit a linear model using a Spatial structure as covariance matrix. We will use GLS to get better estimators.

As always we will need to load the necessary libraries.


In [4]:
ls


Analysis of spatial models using systematic and random samples.ipynb*
Model by chunks.ipynb*
Sandboxes/
Spatial Model Fitting using GLS.ipynb*
data/
demo/
global_variogram.ipynb*

In [6]:
# Load Biospytial modules and etc.
%matplotlib inline
import sys
sys.path.append('/apps/external_plugins/spystats/spystats/')
sys.path.append('..')
import django
django.setup()
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
## Use the ggplot style
plt.style.use('ggplot')
import tools

Use this to automate the process. Be carefull it can overwrite current results

run ../HEC_runs/fit_fia_logbiomass_logspp_GLS.py /RawDataCSV/idiv_share/plotsClimateData_11092017.csv /apps/external_plugins/spystats/HEC_runs/results/logbiomas_logsppn_res.csv -85 -80 30 35

Importing data

We will use the FIA dataset and for exemplary purposes we will take a subsample of this data. Also important. The empirical variogram has been calculated for the entire data set using the residuals of an OLS model.

We will use some auxiliary functions defined in the fit_fia_logbiomass_logspp_GLS. You can inspect the functions using the ?? symbol.


In [ ]:


In [19]:
from HEC_runs.fit_fia_logbiomass_logspp_GLS import prepareDataFrame,loadVariogramFromData,buildSpatialStructure, calculateGLS, initAnalysis, fitGLSRobust

In [9]:
section = initAnalysis("/RawDataCSV/idiv_share/FIA_Plots_Biomass_11092017.csv",
                        "/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv",-130,-60,30,40)


INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Reprojecting to Alberts equal area
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Removing possible duplicates. 
 This avoids problems of Non Positive semidefinite
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Fitting OLS linear model: logBiomass ~ logSppN 
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Subselecting Region

In [10]:
#section = initAnalysis("/RawDataCSV/idiv_share/plotsClimateData_11092017.csv",
#                        "/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv",
#                       -85,-80,30,35)

# IN HEC
#section = initAnalysis("/home/hpc/28/escamill/csv_data/idiv/FIA_Plots_Biomass_11092017.csv","/home/hpc/28/escamill/spystats/HEC_runs/results/variogram/data_envelope.csv",-85,-80,30,35)

Now we will obtain the data from the calculated empirical variogram.


In [15]:
gvg,tt = loadVariogramFromData("/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv",section)


INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Reading the empirical Variogram file
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Instantiating a Variogram object with the values calculated before
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Dropping possible Nans
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Instantiating Model...
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:fitting whittle Model with the empirical variogram
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Model fitted

In [16]:
gvg.plot(refresh=False,with_envelope=True)



In [20]:
resum,gvgn,resultspd,results = fitGLSRobust(section,gvg,num_iterations=50,distance_threshold=1000000)


INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Building Spatial Covariance Matrix
INFO:HEC_runs.fit_fia_logbiomass_logspp_GLS:Calculating Distance Matrix

MemoryErrorTraceback (most recent call last)
<ipython-input-20-97025d262ce0> in <module>()
----> 1 resum,gvgn,resultspd,results = fitGLSRobust(section,gvg,num_iterations=50,distance_threshold=1000000)

/apps/external_plugins/spystats/HEC_runs/fit_fia_logbiomass_logspp_GLS.py in fitGLSRobust(geodataframe, variogram_object, num_iterations, distance_threshold)
    200     for i in range(num_iterations):
    201         logger.info("Building Spatial Covariance Matrix")
--> 202         CovMat = buildSpatialStructure(geodataframe,variogram_object.model)
    203         logger.info("Calculating GLS estimators")
    204         results,resum = calculateGLS(geodataframe,CovMat)

/apps/external_plugins/spystats/HEC_runs/fit_fia_logbiomass_logspp_GLS.py in buildSpatialStructure(geodataframe, theoretical_model)
    130     secvg = tools.Variogram(geodataframe,'logBiomass',model=theoretical_model)
    131     logger.info("Calculating Distance Matrix")
--> 132     CovMat = secvg.calculateCovarianceMatrix()
    133     return CovMat
    134 

/apps/external_plugins/spystats/spystats/tools.pyc in calculateCovarianceMatrix(self)
    392         """
    393         MMdist = self.distance_coordinates
--> 394         Sigma = self.model.calculateCovarianceMatrixWith(MMdist)
    395         return Sigma
    396 

/apps/external_plugins/spystats/spystats/tools.pyc in calculateCovarianceMatrixWith(self, Mdist)
    697         nuggetId = self.nugget * np.identity(Mdist.shape[0])
    698         s2 = self.sill - self.nugget
--> 699         correlMat =  np.array(self.corr_f(Mdist.flatten())).reshape(Mdist.shape)
    700         Sigma = (s2 * correlMat) + nuggetId
    701         return Sigma

/apps/external_plugins/spystats/spystats/tools.pyc in corr_f(self, h)
    663         #corr_cont = lambda hx :  1 - (self.sill * self.f(hx) / (self.sill + self.nugget))
    664         ## Corrections made and suggested by Erick Chacon
--> 665         variogram_evaluated = self.f(h)
    666         corr_cont = (self.sill - variogram_evaluated) / (self.sill - self.nugget)
    667         #corr_cont = lambda hx : (self.sill - self.f(hx)) / (self.sill - self.nugget)

/apps/external_plugins/spystats/spystats/tools.pyc in <lambda>(x)
    791     @property
    792     def f(self):
--> 793         return lambda x : self.model(x,self.sill,self.range_a,self.nugget,self.alpha)
    794 
    795 

/apps/external_plugins/spystats/spystats/tools.pyc in whittleVariogram(h, sill, range_a, nugget, alpha)
    623         Ih = 1.0 if h >= 0 else 0.0
    624     #Ih = 1.0 if h >= 0 else 0.0
--> 625     g_h = ((sill - nugget)*(1 - np.exp(-(h**alpha / range_a**alpha)))) + nugget*Ih
    626     return g_h
    627 

MemoryError: 

In [13]:
variogram_data_path = "/apps/external_plugins/spystats/HEC_runs/results/variogram/data_envelope.csv"
thrs_dist = 100000
emp_var_log_log = pd.read_csv(variogram_data_path)

Instantiating the variogram object


In [14]:
gvg = tools.Variogram(section,'logBiomass',using_distance_threshold=thrs_dist)
gvg.envelope = emp_var_log_log
gvg.empirical = emp_var_log_log.variogram
gvg.lags = emp_var_log_log.lags
#emp_var_log_log = emp_var_log_log.dropna()
#vdata = gvg.envelope.dropna()

Instantiating theoretical variogram model


In [19]:
matern_model = tools.MaternVariogram(sill=0.34,range_a=100000,nugget=0.33,kappa=4)
whittle_model = tools.WhittleVariogram(sill=0.34,range_a=100000,nugget=0.0,alpha=3)
exp_model = tools.ExponentialVariogram(sill=0.34,range_a=100000,nugget=0.33)
gaussian_model = tools.GaussianVariogram(sill=0.34,range_a=100000,nugget=0.33)
spherical_model = tools.SphericalVariogram(sill=0.34,range_a=100000,nugget=0.33)

In [20]:
gvg.model = whittle_model
#gvg.model = matern_model
#models = map(lambda model : gvg.fitVariogramModel(model),[matern_model,whittle_model,exp_model,gaussian_model,spherical_model])

gvg.fitVariogramModel(whittle_model)


Out[20]:
< Whittle Variogram : sill 0.340288288241, range 40963.3203528, nugget 0.329830410223, alpha1.12279978135 >

In [22]:
import numpy as np
xx = np.linspace(0,1000000,1000)

gvg.plot(refresh=False,with_envelope=True)
plt.plot(xx,whittle_model.f(xx),lw=2.0,c='k')
plt.title("Empirical Variogram with fitted Whittle Model")


Out[22]:
<matplotlib.text.Text at 0x7f16002e8610>
FOr the sake of brevity I didn't include the steps for calculating the covariance matrix and include it in the GLS routine. These are the results after fitting. """ GLS Regression Results ============================================================================== Dep. Variable: logBiomass R-squared: 0.900 Model: GLS Adj. R-squared: 0.900 Method: Least Squares F-statistic: 3.304e+05 Date: Sun, 28 Jan 2018 Prob (F-statistic): 0.00 Time: 12:48:52 Log-Likelihood: -34662. No. Observations: 36868 AIC: 6.933e+04 Df Residuals: 36866 BIC: 6.935e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 8.4705 0.010 840.884 0.000 8.451 8.490 logSppN 0.3909 0.006 66.280 0.000 0.379 0.402 ============================================================================== Omnibus: 1933.532 Durbin-Watson: 1.906 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2819.569 Skew: -0.475 Prob(JB): 0.00 Kurtosis: 3.966 Cond. No. 3.75 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. """

In [ ]:
def randomSelection(n,p):
    idxs = np.random.choice(n,p,replace=False)
    random_sample = new_data.iloc[idxs]
    return random_sample
#################
n = len(new_data)
p = 3000 # The amount of samples taken (let's do it without replacement)

In [ ]:
random_sample = randomSelection(n,100)