Here I'm process by chunks the entire region.


In [1]:
# Load Biospytial modules and etc.
%matplotlib inline
import sys
sys.path.append('/apps')
import django
django.setup()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
## Use the ggplot style
plt.style.use('ggplot')

In [2]:
from external_plugins.spystats import tools
%run ../testvariogram.py


/opt/conda/envs/biospytial/lib/python2.7/site-packages/IPython/core/pylabtools.py:168: DtypeWarning: Columns (24) have mixed types. Specify dtype option on import or set low_memory=False.
  safe_execfile(fname,*where,**kw)

In [3]:
section.shape


Out[3]:
(1841, 48)

Algorithm for processing Chunks

  1. Make a partition given the extent
  2. Produce a tuple (minx ,maxx,miny,maxy) for each element on the partition
  3. Calculate the semivariogram for each chunk and save it in a dataframe
  4. Plot Everything
  5. Do the same with a mMatern Kernel

In [4]:
minx,maxx,miny,maxy = getExtent(new_data)

In [5]:
maxy


Out[5]:
1556957.5046647713

In [6]:
## Let's build the partition
N = 6
xp,dx = np.linspace(minx,maxx,N,retstep=True)
yp,dy = np.linspace(miny,maxy,N,retstep=True)

In [7]:
dx


Out[7]:
920016.69306643459

In [8]:
xx,yy = np.meshgrid(xp,yp)

In [9]:
coordinates_list = [ (xx[i][j],yy[i][j]) for i in range(N) for j in range(N)]

In [10]:
from functools import partial
tuples = map(lambda (x,y) : partial(getExtentFromPoint,x,y,step_sizex=dx,step_sizey=dy)(),coordinates_list)

In [11]:
len(tuples)


Out[11]:
36

In [12]:
chunks = map(lambda (mx,Mx,my,My) : subselectDataFrameByCoordinates(new_data,'newLon','newLat',mx,Mx,my,My),tuples)
chunks_sizes = map(lambda df : df.shape[0],chunks)

In [13]:
chunk_w_size = zip(chunks,chunks_sizes)

In [14]:
## Here we can filter based on a threshold
threshold = 10
nonempty_chunks_w_size = filter(lambda (df,n) : df.shape[0] > threshold ,chunk_w_size)
chunks_non_empty, chunks_sizes = zip(*nonempty_chunks_w_size)

In [15]:
lengths = pd.Series(map(lambda ch : ch.shape[0],chunks_non_empty))

In [16]:
lengths.plot.hist()


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbfbc580490>

In [17]:
cs = chunks_non_empty
variograms =map(lambda chunk : tools.Variogram(chunk,'residuals1',using_distance_threshold=600000),cs)

In [18]:
%time vars = map(lambda v : v.calculateEmpirical(),variograms)
%time vars = map(lambda v : v.calculateEnvelope(num_iterations=1),variograms)


CPU times: user 47.8 s, sys: 5.56 s, total: 53.3 s
Wall time: 53.4 s
CPU times: user 1min 28s, sys: 8.07 s, total: 1min 36s
Wall time: 1min 36s

In [19]:
%time lags = map(lambda v : v.lags,variograms)


CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 23.8 µs

In [20]:
lags = pd.DataFrame(lags).transpose()

In [21]:
lags = lags[[0]]

Take an average of the empirical variograms also with the envelope.

We will use the group by directive on the field lags


In [22]:
envslow = pd.concat(map(lambda df : df[['envlow']],vars),axis=1)
envhigh = pd.concat(map(lambda df : df[['envhigh']],vars),axis=1)
variogram = pd.concat(map(lambda df : df[['variogram']],vars),axis=1)
n_points = pd.DataFrame(map(lambda v : v.n_points,variograms))

In [23]:
points = n_points.transpose()

In [24]:
print(variogram.shape)
print(points.shape)


(49, 24)
(49, 24)

In [25]:
ejem1 = pd.DataFrame(variogram.values * points.values)

In [26]:
# Chunks (variograms) columns
# lag rows
vempchunk = ejem1.sum(axis=1) / points.sum(axis=1)
plt.plot(lags,vempchunk,'--',color='blue',lw=2.0)


Out[26]:
[<matplotlib.lines.Line2D at 0x7fbfb996f350>]

In [27]:
## Cut some values
vchunk = pd.concat([lags,vempchunk],axis=1)
vchunk.columns = ['lags','semivariance']
v = vchunk[vchunk['lags'] < 500000]
plt.plot(v.lags,v.semivariance,'--',color='blue',lw=2.0)
#vemp2


Out[27]:
[<matplotlib.lines.Line2D at 0x7fbfb97c1310>]

Let's bring the whole empirical variogram (calculated in HEC)

For comparison purposes


In [28]:
thrs_dist = 1000000
nt = 30 # num iterations
filename = "../HEC_runs/results/low_q/data_envelope.csv"
envelope_data = pd.read_csv(filename)
gvg = tools.Variogram(new_data,'logBiomass',using_distance_threshold=thrs_dist)
gvg.envelope = envelope_data
gvg.empirical = gvg.envelope.variogram
gvg.lags = gvg.envelope.lags
vdata = gvg.envelope.dropna()
gvg.plot(refresh=False,legend=False,percentage_trunked=20)
plt.plot(v.lags,v.semivariance,'--',color='blue',lw=2.0)


Out[28]:
[<matplotlib.lines.Line2D at 0x7fbfb9840cd0>]

Ok, now the thing that I was supposed to do since the begining

The log of the species instead of only species number


In [29]:
cs = chunks_non_empty
variograms =map(lambda chunk : tools.Variogram(chunk,'residuals2',using_distance_threshold=600000),cs)
%time vars = map(lambda v : v.calculateEmpirical(),variograms)
%time vars = map(lambda v : v.calculateEnvelope(num_iterations=1),variograms)
%time lags = map(lambda v : v.lags,variograms)
lags = pd.DataFrame(lags).transpose()
envslow = pd.concat(map(lambda df : df[['envlow']],vars),axis=1)
envhigh = pd.concat(map(lambda df : df[['envhigh']],vars),axis=1)
variogram = pd.concat(map(lambda df : df[['variogram']],vars),axis=1)
n_points = pd.DataFrame(map(lambda v : v.n_points,variograms))
points = n_points.transpose()
ejem1 = pd.DataFrame(variogram.values * points.values)
# Chunks (variograms) columns
# lag rows
vempchunk = ejem1.sum(axis=1) / points.sum(axis=1)
plt.plot(lags,vempchunk,'--',color='blue',lw=2.0)
thrs_dist = 1000000
nt = 30 # num iterations
filename = "../HEC_runs/results/low_q/data_envelope.csv"
envelope_data = pd.read_csv(filename)
gvg = tools.Variogram(new_data,'logBiomass',using_distance_threshold=thrs_dist)
gvg.envelope = envelope_data
gvg.empirical = gvg.envelope.variogram
gvg.lags = gvg.envelope.lags
vdata = gvg.envelope.dropna()
gvg.plot(refresh=False,legend=False,percentage_trunked=20)
#plt.plot(v.lags,v.semivariance,'--',color='blue',lw=2.0)


CPU times: user 47.2 s, sys: 4.56 s, total: 51.8 s
Wall time: 51.8 s
CPU times: user 1min 28s, sys: 7.88 s, total: 1min 36s
Wall time: 1min 36s
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 25 µs
Out[29]:
[<matplotlib.lines.Line2D at 0x7fbfb9910a10>]

In [35]:
vempchunk = ejem1.sum(axis=1) / points.sum(axis=1)
#plt.plot(lags,vempchunk,'--',color='blue',lw=2.0)
thrs_dist = 1000000
nt = 30 # num iterations
filename = "../HEC_runs/results/low_q/data_envelope.csv"
envelope_data = pd.read_csv(filename)
gvg = tools.Variogram(new_data,'logBiomass',using_distance_threshold=thrs_dist)
gvg.envelope = envelope_data
gvg.empirical = gvg.envelope.variogram
gvg.lags = gvg.envelope.lags
vdata = gvg.envelope.dropna()
gvg.plot(refresh=False,legend=False,percentage_trunked=20)
plt.plot(v.lags,v.semivariance,'--',color='blue',lw=2.0)


Out[35]:
[<matplotlib.lines.Line2D at 0x7fbfb97b42d0>]

In [37]:
## Oke satisfecho. Me muevo por ahora a hacer la optimización con gls

In [ ]: