Regression model

This notebook explores and models the data collected from recordings of the natural acoustic environment over the urban-rural gradient near Innsbruck, Austria. The models are implemented as Bayesian models with the PyMC3 probabilistic programming library.

References:
https://github.com/fonnesbeck/multilevel_modeling
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (1st ed.). Cambridge University Press.

Import statements


In [14]:
import warnings
warnings.filterwarnings('ignore')

In [15]:
import pandas
import numpy

In [16]:
%matplotlib inline
from matplotlib import pyplot
from matplotlib.patches import Rectangle
import seaborn
import mpld3
from mpld3 import plugins

In [18]:
from pymc3 import glm, Model, NUTS, sample, \
                  forestplot, traceplot, plot_posterior, summary, \
                  Normal, Uniform, Deterministic, StudentT

Variable definitions


In [41]:
data_filepath = "/Users/Jake/OneDrive/Documents/alpine soundscapes/data/dataset.csv"

In [43]:
seaborn_blue = seaborn.color_palette()[0]

Load data


In [42]:
data = pandas.read_csv(data_filepath)
data = data.loc[data.site<=30]

In [44]:
data[['site', 'site_name', 'building_pavement_200m', 'forest_200m']].head()


Out[44]:
site site_name building_pavement_200m forest_200m
1 15 Hofgarten 34.716504 40.635720
2 26 Bienenstraße 60.392767 20.139944
3 5 Gumppstraße 81.331017 12.070743
4 21 Templstraße 77.336232 17.342644
5 18 Egger-Lienz-Straße 74.866665 12.899635

sort data by site and then by visit


In [45]:
data_sorted = data.sort_values(by=['site', 'sound']).reset_index(drop=True)

In [46]:
data_sorted[['site', 'site_name', 'biophony', 
             'forest_200m', 'temperature', 'days']].head(n=10)


Out[46]:
site site_name biophony forest_200m temperature days
0 1 Lans 83.522323 70.934164 9.9 28.0
1 1 Lans 81.196749 70.934164 10.4 42.0
2 1 Lans 84.231717 70.934164 18.0 76.0
3 1 Lans 89.147379 70.934164 23.9 85.0
4 1 Lans 86.715442 70.934164 7.9 91.0
5 2 Hofwald 76.131350 97.068706 5.9 22.0
6 2 Hofwald 75.306785 97.068706 4.3 36.0
7 2 Hofwald 86.630363 97.068706 2.1 48.0
8 2 Hofwald 89.550034 97.068706 22.2 69.0
9 2 Hofwald 86.640584 97.068706 13.9 90.0

Plot data

Here is a pairwise plot of a subset of the data. The 'days' variable represents the number of days each recording was taken from the start of field collection (Janurary 27, 2016). The 'forest_200m' represents the percentage of landcover within a 200-meter radius of each recording location, and the 'temperature' variable represents the temperature in degrees Celcius, averaged from the measurements taken at the University of Innsbruck weather station 30 minutes before and after each recording. The 'biophony' and 'anthrophony' variables are shown in decibels (relative and log scale), and they are on different scales as a result of the source separation algorithm (will be fixed soon).


In [11]:
pairplot = seaborn.pairplot(data[['biophony', 'anthrophony', 
                                  'days', 'forest_200m', 'temperature']])


There is definitely a relationship between biophony and landcover, although a more robust analysis might be able to explain the relationship better than the simple regression shown below.


In [12]:
biophony_jointplot = seaborn.jointplot('forest_200m', 'biophony', data=data, kind='reg', 
                                       xlim=(-10, 110), ylim=(60, 110))


Landcover seems to have a greater effect on anthrophony.


In [13]:
anthrophony_jointplot = seaborn.jointplot('forest_200m', 'anthrophony', data=data, kind='reg', 
                                          xlim=(-10, 110), ylim=(-70, -10))


As the boxplots show, repeated measurements of anthrophony and biophony varry little at some sites and greatly at others.


In [14]:
biophony_boxplot = seaborn.boxplot('site', 'biophony', data=data)
biophony_boxplot.figure.set_figwidth(10)



In [15]:
anthrophony_boxplot = seaborn.boxplot('site', 'anthrophony', data=data)
anthrophony_boxplot.figure.set_figwidth(10)


Complete pooling

We can look at a standard regression analysis to compare it with more robust multilevel models. We start with a "complete pooling" analysis, considering only the 'days' variable (effectively time of year) as a predictor for biophony.

$$ \begin{align} y_i \sim \mathcal{N}(\mu, \sigma^2) \\ \mu = \alpha + \beta x_i \\ \end{align} $$

We start with the model defined above, where yi is the biophony at recording i, alpha is the intercept, beta is the slope, and xi is the number of days from the start of field collection for recroding i.

We can define the model in PyMC3 and run it with 1000 samples.


In [16]:
with Model() as pooled_model:
    glm.glm("biophony ~ days", data)
    pooled_trace = sample(1000, NUTS(), random_seed=1)


Applied log-transform to sd and added transformed sd_log_ to model.
100%|██████████| 1000/1000 [00:02<00:00, 338.35it/s]

In [17]:
p = plot_posterior(pooled_trace, varnames=['Intercept', 'days'])
# TODO: fix font style


Here are the mean parameter values (avoiding the burn-in period, or the first 500 samples).


In [18]:
intercept0 = pooled_trace['Intercept'][500:].mean()
slope0 = pooled_trace['days'][500:].mean()

In [19]:
intercept0


Out[19]:
77.992808748573296

In [20]:
slope0


Out[20]:
0.07813826139761354

Here is a plot of the data and the fitted regression line.


In [21]:
fig, ax = pyplot.subplots(figsize=(15, 5))
ax.scatter(data.days, data.biophony, marker='.', color='black', alpha=0.5)
x_days = numpy.linspace(0, 100)
# plot 50 samples from the posterior of the parameters 
for d in numpy.random.random_integers(500, 999, 50):
    p1 = ax.plot(x_days, pooled_trace['Intercept'][d] + pooled_trace['days'][d] * x_days, 
                 color=seaborn_blue, alpha=0.1)
# plot mean parameter values
p1 = ax.plot(x_days, intercept0 + slope0*x_days, 'k--')
l1 = ax.set_xlabel("days since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 100])


No pooling

The complete pooling model above does not allow the regression parameters to vary by site, which is likely unrealistic. For example, it seems probable that the change of biophony over time could vary between location. We can see what happens if we fit a separate regression to each site (different intercept (alpha) and slope (beta) for each site j).

$$y_{ij} = \alpha_j + \beta_j x_i$$

In [22]:
sample_sites = (2, 9, 11, 15, 17, 18, 26, 30)

In [23]:
fig, ax = pyplot.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(sample_sites):
    # plot data and regression line
    y = data.biophony[data.site==site]
    x = data.days[data.site==site]
    seaborn.regplot(x, y, ax=ax[i], ci=None, color='grey')
    
    # plot complete pooling regression line and set format
    x_days = numpy.linspace(0, 100)
    ax[i].plot(x_days, (slope0 * x_days) + intercept0, 'r--')
    ax[i].set_title("{0} - {1}".format(site, data_sorted.site_name[data_sorted.site == site].unique()[0]))
    ax[i].set_ylim([60, 110])


Partial pooling (empty model)

We can see from the no pooling example that biophony does seem to change differently at each location. However, as shown below, a multilevel model allows us to apply a single model to the data set while allowing regression parameters to vary between sites. For sites with a lesser amount of measurements, the regression parameters are pulled towards the parameters of the entire data set (towards complete pooling), but for sites with a large amount of measurements, the regression parameters are not pulled towards the entire data set (towards no pooling).

The model shown below contains no predictors (refered to as the empty model), and thus shows only how the mean biophony varies between locations.

$$y_{ij} = \beta_{j} + r_{ij}$$$$\beta_{j} = \gamma + u_{j}$$

In the equation above, y represents the biophony for recording i and location j, beta represents the mean biophony of each location j, gamma represents the grand mean of biophony (mean of the location means), u represents the difference between the mean biophony at location j and the grand mean biophony, and r represents the difference between the biophony of recording i at location j and the mean biophony at location j. The model can also be thought of as defining a separate intercept for each site with no slope.

We can define this model in PyMC3 as shown below.
NOTE: While the model equations are shown in frequentist form, PyMC3 implements the models in a Bayesian context.


In [24]:
sites = numpy.copy(data_sorted.site.values) - 1

In [25]:
with Model() as partial_pooling:
    
    # Priors
    mu_a = Normal('mu_a', mu=0., tau=0.0001)
    sigma_a = Uniform('sigma_a', lower=0, upper=100)
    tau_a = sigma_a**-2
    
    # Random intercepts
    a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
    
    # Model error
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # Expected value
    y_hat = a[sites]
    
    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data_sorted.biophony)


Applied interval-transform to sigma_a and added transformed sigma_a_interval_ to model.
Applied interval-transform to sigma_y and added transformed sigma_y_interval_ to model.

In [26]:
with partial_pooling:
    step = NUTS()
    partial_pooling_samples = sample(2000, step, random_seed=1)


100%|██████████| 2000/2000 [00:03<00:00, 622.17it/s]

We can plot both the unpooled means and the partially-pooled means.


In [27]:
sample_trace = partial_pooling_samples['a'][-1000:]

fig, ax = pyplot.subplots(1, 2, figsize=(14,6), sharex=True, sharey=True)
samples, sites = sample_trace.shape
jitter = numpy.random.normal(scale=0.1, size=sites)

n_site = data.groupby('site')['biophony'].count()
unpooled_means = data.groupby('site')['biophony'].mean()
unpooled_sd = data.groupby('site')['biophony'].std()
unpooled = pandas.DataFrame({'n':n_site, 'm':unpooled_means, 'sd':unpooled_sd})
unpooled['se'] = unpooled.sd / numpy.sqrt(unpooled.n)

# plot unpooled means
ax[0].scatter(unpooled.n + jitter, unpooled.m, marker='o', color='black')
for j, row in zip(jitter, unpooled.iterrows()):
    name, dat = row
    ax[0].plot([dat.n+j,dat.n+j], [dat.m-dat.se, dat.m+dat.se], 
               linestyle='-', color='black', alpha=0.25)
t1 = ax[0].hlines(sample_trace.mean(), 0.9, 100, linestyles='--')
ax[0].set_xticks([3, 4, 5])
t2 = ax[0].set_title('unpooled means')
t3 = ax[0].set_xlabel('number of measurements (jittered)')
t4 = ax[0].set_ylabel('biophony')

# plot partially-pooled means
samples, sites = sample_trace.shape
means = sample_trace.mean(axis=0)
sd = sample_trace.std(axis=0)
ax[1].scatter(n_site.values + jitter, means, marker='o', color='purple')
ax[1].set_xlim(2, 6)
ax[1].hlines(sample_trace.mean(), 0.9, 100, linestyles='--')
for j,n,m,s in zip(jitter, n_site.values, means, sd):
    ax[1].plot([n+j]*2, [m-s, m+s], linestyle='-', color='purple', alpha=0.25)
t1 = ax[1].set_title('partially-pooled means')
t2 = ax[1].set_xlabel('number of measurements (jittered)')
t3 = ax[1].set_ylabel('biophony')


Note that there is less pooling for sites with a higher number of measurements.

Varying intercepts

Going forward with the multilevel model and partial pooling, we can begin to add predictors. This model adds the 'days' variable as a predictor, but does not allow the effect (slope) to vary between locations. However, in this model the intercepts are allowed to vary, but without a predictor.

$$ \begin{align} \text{level 1} \\ y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + r_{ij} \\ \text{level 2} \\ \beta_{0j} = \gamma_{00} + u_{0j} \\ \beta_{1j} = \gamma_{10} \\ \end{align} $$

In [28]:
sites = numpy.copy(data_sorted.site.values) - 1

In [29]:
with Model() as varying_intercept:
    
    # Priors
    mu_a = Normal('mu_a', mu=0., tau=0.0001)
    sigma_a = Uniform('sigma_a', lower=0, upper=100)
    tau_a = sigma_a**-2
    
    # Random intercepts
    a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
    # Common slope
    b = Normal('b', mu=0., tau=0.0001)
    
    # Model error
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # Expected value
    y_hat = a[sites] + b * data_sorted.days
    
    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data_sorted.biophony)


Applied interval-transform to sigma_a and added transformed sigma_a_interval_ to model.
Applied interval-transform to sigma_y and added transformed sigma_y_interval_ to model.

In [30]:
with varying_intercept:
    step = NUTS()
    varying_intercept_samples = sample(2000, step, random_seed=1)


100%|██████████| 2000/2000 [00:04<00:00, 487.32it/s]

In [31]:
site_labels = ["{1} - {0}".format(s, 
                                  data_sorted.site_name[data_sorted.site == s].unique()[0])\
               for s in range(1, 31)]
pyplot.figure(figsize=(5, 8))
p1 = forestplot(varying_intercept_samples, varnames=['a'], 
                ylabels=site_labels,
                xtitle="biophony (intercept)")



In [93]:
vi_trace = varying_intercept_samples[:1500]
data_sorted_forest = pandas.DataFrame(data_sorted['site_name'].unique(), columns=['site'], index=[data_sorted['site'].unique()])
data_sorted_forest['forest_net_200m_centered'] = data_sorted['forest_net_200m_centered'].unique()
data_sorted_forest = data_sorted_forest.merge(pandas.DataFrame(vi_trace['a'].swapaxes(0, 1)), how='left', left_index=True, right_index=True)
fig, ax = pyplot.subplots()
for i, row in data_sorted_forest.iterrows():
    for s in numpy.random.random_integers(0, 1499, 200):
        ax.scatter(row['forest_net_200m_centered'], row[s], color=seaborn_blue, marker='.', alpha=0.1)
ax.set_ylim([60, 110])


Out[93]:
(60, 110)

We can also plot the sample traces and distributions for other variables defined in the model. Here 'sigma_a' represents the variance between the grand biophony mean and each location biophony mean, and 'b' represents the slope for all locations.


In [32]:
p1 = traceplot(varying_intercept_samples[-1000:], varnames=['sigma_a', 'b'])


We can also print a summary with confidence for each model variable. The summary below shows that the estimated slope ('b') parameter is 0.075, meaning that for each increase in 'days' the biophony value increases by 0.075


In [33]:
summary(varying_intercept_samples[-1000:], varnames=['b'])


b:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.077            0.017            0.001            [0.045, 0.108]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.044          0.066          0.078          0.090          0.107

Now we can plot a regression line for each location.


In [34]:
fig, ax = pyplot.subplots(figsize=(15, 5))
x_days = numpy.arange(0, 110, 10)
bp = varying_intercept_samples['a'].mean(axis=0)
mp = varying_intercept_samples['b'].mean()
ax.scatter(data.days, data.biophony, marker='.', color='black', alpha=0.5)
for bi in bp:
    ax.plot(x_days, mp * x_days + bi, color='purple', marker=None, alpha=0.1)
l1 = ax.set_xlabel("days since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 100])


Here are separate plots for each location, showing three different regressions.


In [35]:
fig, ax = pyplot.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(sample_sites):
    
    # Plot county data and unpooled regression
    y = data.biophony[data.site==site]
    x = data.days[data.site==site]
    seaborn.regplot(x, y, ax=ax[i], ci=None, color='grey')
    
    # plot pooled and partially-pooled regressions
    x_days = numpy.linspace(-10, 110)
    # Pooled estimate
    ax[i].plot(x_days, slope0 * x_days + intercept0, 'r--')
    # Partial pooling esimate
    ax[i].plot(x_days, mp * x_days + bp[site - 1], color='purple', marker=None, linestyle='-')
    ax[i].set_title("{0} - {1}".format(site, data_sorted.site_name[data_sorted.site == site].unique()[0]))
    ax[i].set_ylim([60, 110])


Varying slope

We can also see what happens if we fix a common intercept, but allow the slope to vary between locations.

$$ \begin{align} \text{level 1} \\ y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + r_{ij} \\ \text{level 2} \\ \beta_{0j} = \gamma_{00} \\ \beta_{1j} = \gamma_{10} + u_{1j} \\ \end{align} $$

In [36]:
sites = numpy.copy(data_sorted.site.values) - 1

In [79]:
with Model() as varying_slope:
    
    # Priors
#    mu_b = Normal('mu_b', mu=0., tau=0.0001)
    sigma_b = Uniform('sigma_b', lower=0, upper=100)
    tau_b = sigma_b**-2
    
    # Model intercept
#    a = Normal('a', mu=0., tau=0.0001)
    a = Uniform('a', lower=-10, upper=10)
    # Random slopes
#    b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
    b = Uniform('b', lower=-10, upper=10, shape=len(set(sites)))
    
    # Model error
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # Expected value
    y_hat = a + b[sites] * data_sorted.weeks_centered
    
    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data_sorted.biophony_centered)


Applied interval-transform to sigma_b and added transformed sigma_b_interval_ to model.
Applied interval-transform to a and added transformed a_interval_ to model.
Applied interval-transform to b and added transformed b_interval_ to model.
Applied interval-transform to sigma_y and added transformed sigma_y_interval_ to model.

In [80]:
with varying_slope:
    step = NUTS()
    varying_slope_samples = sample(2000, step, random_seed=1)


100%|██████████| 2000/2000 [00:04<00:00, 488.34it/s]

Here is a plot of the estimated slope for each location.


In [81]:
site_labels = ["{1} - {0}".format(s, 
                                  data_sorted.site_name[data_sorted.site == s].unique()[0])\
               for s in range(1, 31)]
pyplot.figure(figsize=(5, 8))
p1 = forestplot(varying_slope_samples, varnames=['b'], 
                ylabels=site_labels,
                xtitle="slope")



In [87]:
vs_trace = varying_slope_samples[:1500]
data_sorted_forest = pandas.DataFrame(data_sorted['site_name'].unique(), columns=['site'], index=[data_sorted['site'].unique()])
data_sorted_forest['forest_net_200m_centered'] = data_sorted['forest_net_200m_centered'].unique()
data_sorted_forest = data_sorted_forest.merge(pandas.DataFrame(vs_trace['b'].swapaxes(0, 1)), how='left', left_index=True, right_index=True)
fig, ax = pyplot.subplots()
for i, row in data_sorted_forest.iterrows():
    for s in numpy.random.random_integers(0, 1499, 500):
        ax.scatter(row['forest_net_200m_centered'], row[s], color=seaborn_blue, marker='.', alpha=0.1)

In [85]:
traceplot(vs_trace, varnames=['b'])


Out[85]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1736cdf28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1739dba58>]], dtype=object)

This result is interesting, as locations with an estimated slope near 0 (e.g., Wilten and Templstraße) are generally located in areas with mostly urban land cover, and other locations with higher estimated slopes (e.g., Pumhof and Wittenberg) are generally located in areas with a large amount of forest land cover.

Here is a plot of the regression lines for all the locations.


In [40]:
fig, ax = pyplot.subplots(figsize=(15, 5))

# all data
p1 = ax.scatter(data.days, data.biophony, marker='.', color='black', alpha=0.5)

# varying slopes
x_days = numpy.arange(0, 110, 100)
b = varying_slope_samples['a'].mean()
m = varying_slope_samples['b'].mean(axis=0)
for mi in m:
    pi = ax.plot(x_days, mi * x_days + b, marker=None, color=seaborn_blue, alpha=0.1)
    
# pooled
p2 = ax.plot(x_days, intercept0 + slope0 * x_days, marker=None, color='black', linestyle='--')

# formatting
l1 = ax.set_xlabel("days since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 100])


Varying slope and intercept

Now we can try varying both slope and intercept.

$$ \begin{align} \text{level 1} \\ y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + r_{ij} \\ \text{level 2} \\ \beta_{0j} = \gamma_{00} + u_{0j} \\ \beta_{1j} = \gamma_{10} + u_{1j} \\ \end{align} $$

In [94]:
sites = numpy.copy(data_sorted.site.values) - 1

In [95]:
with Model() as varying_intercept_slope:
    
    # Priors    
    mu_a = Normal('mu_a', mu=0., tau=0.0001)
    sigma_a = Uniform('sigma_a', lower=0, upper=100)
    tau_a = sigma_a**-2
    
    mu_b = Normal('mu_b', mu=0., tau=0.0001)
    sigma_b = Uniform('sigma_b', lower=0, upper=100)
    tau_b = sigma_b**-2
    
    # Random intercepts
    a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
    # Random slopes
    b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
    
    # Model error
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # Expected value
    y_hat = a[sites] + b[sites] * data_sorted.days
    
    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data_sorted.biophony)
    y_sim = Normal('y_sim', mu=y_hat, tau=tau_y, shape=y_hat.tag.test_value.shape)


Applied interval-transform to sigma_a and added transformed sigma_a_interval_ to model.
Applied interval-transform to sigma_b and added transformed sigma_b_interval_ to model.
Applied interval-transform to sigma_y and added transformed sigma_y_interval_ to model.

In [96]:
with varying_intercept_slope:
    step = NUTS()
    varying_intercept_slope_samples = sample(2000, step, random_seed=1)


100%|██████████| 2000/2000 [02:55<00:00,  6.13it/s]

In [44]:
data_sim = data_sorted[['site', 'biophony', 'days']]
biophony_sim = varying_intercept_slope_samples[:1500]['y_sim']
biophony_sim = biophony_sim.swapaxes(0, 1)
data_sim = data_sim.merge(pandas.DataFrame(biophony_sim), how='left', left_index=True, right_index=True)

In [45]:
#for d in numpy.random.random_integers(0, 1499, 1499):
for d in range(1500):
    pyplot.scatter(data_sim['days'], biophony_sim[:, d], s=20, marker='.', color=seaborn_blue, alpha=0.1)
pyplot.scatter('days', 'biophony', data=data_sim, color='black')


Out[45]:
<matplotlib.collections.PathCollection at 0x13e3af2e8>

In [46]:
fig, ax = pyplot.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(sample_sites):
    
    # Plot county data and unpooled regression
#    y = data_sim.biophony[data_sim.site==site]
#    x = data_sim.days[data_sim.site==site]
#    seaborn.regplot(x, y, ax=ax[i], ci=None, color='black', line_kws={'linestyle': '--', 'linewidth': 0.5})
    
    # Plot pooled and partially-pooled regressions
    x_days = numpy.linspace(-10, 110)
    b = varying_intercept_slope_samples['a'].mean(axis=0)
    m = varying_intercept_slope_samples['b'].mean(axis=0)
    # Pooled estimate
    #ax[i].plot(x_days, slope0 * x_days + intercept0, 'r--', alpha=0.5)
    # Partial pooling esimate
    #ax[i].plot(x_days, mp * x_days + bp[site - 1], color='purple', marker=None, linestyle='-', alpha=0.5)
    # Varying slopes and intercepts
    for d in range(1500):
        ax[i].scatter(data_sim.days[data_sim.site==site], data_sim[d][data_sim.site==site], 
                      color=seaborn_blue, marker='.', alpha=0.1)
        #ax[i].plot(x_days, 
        #           (varying_intercept_slope_samples['b'][d][site - 1] * x_days) + varying_intercept_slope_samples['a'][d][site - 1], 
        #           color=seaborn_blue, marker=None, alpha=0.1)
    
    # Plot county data and unpooled regression
    y = data_sim.biophony[data_sim.site==site]
    x = data_sim.days[data_sim.site==site]
    seaborn.regplot(x, y, ax=ax[i], ci=None, color='black', line_kws={'linestyle': '--', 'linewidth': 0.5})
    
    ax[i].plot(x_days, m[site - 1] * x_days + b[site - 1], color='black', marker=None, linestyle='--')
    ax[i].set_title("{0} - {1}".format(site, data_sorted.site_name[data_sorted.site == site].unique()[0]))
    ax[i].set_ylim([60, 110])
    ax[i].set_xlim([0, 100])



In [47]:
fig, ax = pyplot.subplots(figsize=(15, 5))

# all data
p1 = ax.scatter(data.days, data.biophony, marker='.', color='black', alpha=0.5)

# varying slopes and intercepts
x_days = numpy.arange(0, 110, 100)
b = varying_intercept_slope_samples['a'].mean(axis=0)
m = varying_intercept_slope_samples['b'].mean(axis=0)
for bi,mi in zip(b,m):
    pi = ax.plot(x_days, mi * x_days + bi, marker=None, color=seaborn_blue, alpha=0.1)

# pooled
p2 = ax.plot(x_days, intercept0 + slope0 * x_days, marker=None, color='black', linestyle='--')

# formatting
l1 = ax.set_xlabel("days since project start")
l2 = ax.set_ylabel("biophony")
f1 = ax.set_xlim([0, 100])


Here are plots for individual locations again, comparing this model (orange) compares with the unpooled model (gray).


In [48]:
fig, ax = pyplot.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(sample_sites):
    
    # Plot county data and unpooled regression
    y = data.biophony[data.site==site]
    x = data.days[data.site==site]
    seaborn.regplot(x, y, ax=ax[i], ci=None, color='black', line_kws={'linestyle': '--', 'linewidth': 0.5})
    
    # Plot pooled and partially-pooled regressions
    x_days = numpy.linspace(-10, 110)
    # Pooled estimate
    #ax[i].plot(x_days, slope0 * x_days + intercept0, 'r--', alpha=0.5)
    # Partial pooling esimate
    #ax[i].plot(x_days, mp * x_days + bp[site - 1], color='purple', marker=None, linestyle='-', alpha=0.5)
    # Varying slopes and intercepts
    for d in numpy.random.random_integers(500, 1999, 100):
        ax[i].plot(x_days, 
                   (varying_intercept_slope_samples['b'][d][site - 1] * x_days) + varying_intercept_slope_samples['a'][d][site - 1], 
                   color=seaborn_blue, marker=None, alpha=0.1)
    
    # Plot county data and unpooled regression
    y = data.biophony[data.site==site]
    x = data.days[data.site==site]
    seaborn.regplot(x, y, ax=ax[i], ci=None, color='black', line_kws={'linestyle': '--', 'linewidth': 0.5})
    
    ax[i].plot(x_days, m[site - 1] * x_days + b[site - 1], color='black', marker=None, linestyle='--')
    ax[i].set_title("{0} - {1}".format(site, data_sorted.site_name[data_sorted.site == site].unique()[0]))
    ax[i].set_xlim([0, 100])
    ax[i].set_ylim([60, 110])



In [112]:
trace = varying_intercept_slope_samples[-1500:]
a_means = trace['a'].mean(axis=0)
b_means = trace['b'].mean(axis=0)

Model 3 — adding group-level predictors

Now we can try to see if we can assign a predictor to the level 2 (group or location level) parameters. Land cover is likely to be a significant predictor of the amount of biophony (intercept) and how it changes over time (slope).

There are many predictor variables in the data set, but we can define an aggregate predictor from what is likely to be the predictors with the largest effect. We define the 'forest_net_200m' variable that is the percentage of forest land cover minus the combined percentage of buildings and paved surfaces land cover within a 200-meter radius of each recording location. We can later check the fit of the model with many different predictor variables.


In [47]:
data_sorted['forest_net_200m'] = data_sorted.forest_200m - data_sorted.building_pavement_200m

In [48]:
data_sorted[['site', 'forest_200m', 'forest_net_200m']].groupby('site').mean().head()


Out[48]:
forest_200m forest_net_200m
site
1 70.934164 64.039087
2 97.068706 94.241497
3 13.668314 -29.951781
4 79.950193 79.453945
5 12.070743 -69.260273

Here is the model, this time defined in Bayesian form.

$$ \begin{align} \text{level 1} \\ y_i \sim \mathcal{N}(\alpha_{j[i]} + \beta_{j[i]} x_{i}, \sigma^2_y), \\ \text{for $i = 1,...,n$} \\ \text{level 2} \\ \alpha_j \sim \mathcal{N}(\gamma_{00} + \gamma_{01} u_{j}, \sigma^2_{\alpha}), \\ \beta_{j} \sim \mathcal{N}(\gamma_{10} + \gamma_{11} u_{j}, \sigma^2_{\beta}), \\ \text{for $j = 1,...,J$} \end{align} $$

where $x_i$ is the measurement-level days indicator and $u_j$ is the location-level land cover indicator

We transform and center some of the variables to make it easier to interpret the model results.


In [49]:
data_sorted['biophony_centered'] = data_sorted.biophony - data_sorted.biophony.mean()
data_sorted['weeks_centered'] = (data_sorted.days - data_sorted.days.mean()) / 7
data_sorted['forest_net_200m_centered'] = data_sorted.forest_net_200m - data_sorted.forest_net_200m.mean()

In [77]:
sites = numpy.copy(data_sorted.site.values) - 1

In [66]:
with Model() as model_3:
    
    # intercept
    g_00 = Normal('g_00', mu=0, tau=0.001)
    g_01 = Normal('g_01', mu=0, tau=0.001)
    sigma_a = Uniform('sigma_a', lower=0, upper=100)
    tau_a = sigma_a**-2
    mu_a = g_00 + (g_01 * data_sorted.forest_net_200m_centered.unique())
    a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
    
    # slope
    g_10 = Normal('g_10', mu=0, tau=0.001)
    g_11 = Normal('g_11', mu=0, tau=0.001)
    sigma_b = Uniform('sigma_b', lower=0, upper=100)
    tau_b = sigma_b**-2
    mu_b = g_10 + (g_11 * data_sorted.forest_net_200m_centered.unique())
    b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
    
    # model error (data-level)
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # expected values
    y_hat = a[sites] + (b[sites] * data_sorted.weeks_centered)
    
    # likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data_sorted.biophony_centered)
    
    # simulated
    y_sim = Normal('y_sim', mu=y_hat, tau=tau_y, shape=y_hat.tag.test_value.shape)


Applied interval-transform to sigma_a and added transformed sigma_a_interval_ to model.
Applied interval-transform to sigma_b and added transformed sigma_b_interval_ to model.
Applied interval-transform to sigma_y and added transformed sigma_y_interval_ to model.

In [67]:
with model_3:
    step = NUTS()
    model_3_samples = sample(10000, step, random_seed=1)


100%|██████████| 10000/10000 [05:06<00:00, 45.10it/s]

We can compute the estimated intercept and slope for each location based on the mean of the last 5000 samples from the model trace.


In [68]:
trace = model_3_samples[-5000:]
a_means = trace['a'].mean(axis=0)
b_means = trace['b'].mean(axis=0)

Now we can compare how well the estimated level 2 parameters predict the estimated level 1 parameters.


In [69]:
fig, ax = pyplot.subplots(1, 2, sharex=True, figsize=(15, 5))

# plot estimated level 2 parameter (intercept)
intercept_points = ax[0].scatter(data_sorted.forest_net_200m_centered.unique(), a_means, color='black', alpha=0.4)
g00 = trace['g_00'].mean()
g01 = trace['g_01'].mean()
x_forest = numpy.linspace(-110, 80)
p1 = ax[0].plot(x_forest, g00 + (g01 * x_forest), color='black', linestyle='--')
pl = ax[0].set_xlim(-120, 90)

# plot estimated level 1 parameters
a_se = trace['a'].std(axis=0)
for xi, m, se in zip(data_sorted.forest_net_200m_centered.unique(), a_means, a_se):
    pi = ax[0].plot([xi, xi], [m-se, m+se], color='black', alpha=0.1)
t1 = ax[0].set_xlabel('net forest within a 200 meter radius')
t2 = ax[0].set_ylabel('intercept estimate (difference from mean biophony)')
t0 = ax[0].set_title('predicting intercept estimates')

# for interactive plot via mpld3
#intercept_point_labels = ["{0}: {1:0.2f}".format(n, s) for n, s in zip(data_sorted.site_name.unique(), a_means)]
#tooltip = plugins.PointLabelTooltip(intercept_points, intercept_point_labels)
#plugins.connect(fig, tooltip)
#mpld3.display(fig)

# plot estimated level 2 parameter (slope)
slope_points = ax[1].scatter(data_sorted.forest_net_200m_centered.unique(), b_means, color='black', alpha=0.4)
g10 = trace['g_10'].mean()
g11 = trace['g_11'].mean()
x_forest = numpy.linspace(-110, 80)
p2 = ax[1].plot(x_forest, g10 + (g11 * x_forest), color='black', linestyle='--')
pl = ax[1].set_xlim(-120, 90)

# plot estimated level 1 parameters
b_se = trace['b'].std(axis=0)
for xi, m, se in zip(data_sorted.forest_net_200m_centered.unique(), b_means, b_se):
    pi = ax[1].plot([xi, xi], [m-se, m+se], color='black', alpha=0.1)
t3 = ax[1].set_xlabel('net forest within a 200 meter radius')
t4 = ax[1].set_ylabel('slope estimate')
t5 = ax[1].set_title('predicting slope estimates')

# for interactive plot via mpld3
#slope_point_labels = ["{0}: {1:0.2f}".format(n, s) for n, s in zip(data_sorted.site_name.unique(), b_means)]
#tooltip = plugins.PointLabelTooltip(slope_points, slope_point_labels)
#plugins.connect(fig, tooltip)
#mpld3.display(fig)


Above, we can see that the land cover variable (forest_net_200m) can decentely predict how much biophony changes at each location over time (right plot). The intercept values do not fit quite as well, although there still seems to be a clear relationship (left plot). The two major outliers may be caused by errors in the source separation algorithm. We can compute an R squared value for each level 2 parameter and for all of level 1 (see below).

First, here are individual plots for all the locations.


In [70]:
fig, ax = pyplot.subplots(6, 5, figsize=(15, 20), sharey=True, sharex=True)
ax = ax.ravel()
for i, site in enumerate(data_sorted.site.unique()):
    
    x_days = numpy.linspace(-6, 6, 2)
    
    # draw parameter samples
    for d in numpy.random.random_integers(0, 4999, 100):
        ax[i].plot(x_days, 
                   (trace['b'][d][site - 1] * x_days) + trace['a'][d][site - 1], 
                   color=seaborn_blue, marker=None, alpha=0.1)
    
    # observed data (points)
    y = data_sorted.biophony_centered[data_sorted.site==site]
    x = data_sorted.weeks_centered[data_sorted.site==site]
    ax[i].scatter(x, y, color='black')
    
    # model estimate (line)
    x_days = numpy.linspace(-6, 6, 2)
    ax[i].plot(x_days, a_means[i] + (b_means[i] * x_days), color='black', linestyle='--')
    ax[i].set_title("{0} - {1}".format(site, data_sorted.site_name[data_sorted.site == site].unique()[0]))
    ax[i].text(-6, 20, "slope: {0:.02f}".format(b_means[i]), va='bottom')
    ax[i].set_xlim([-6, 6])
    ax[i].set_ylim([-25, 25])


organize simulated data with observed data


In [13]:
#data_sim = data_sorted[['site', 'biophony_centered', 'weeks_centered']]
#biophony_sim = trace[:5000]['y_sim']
#biophony_sim = biophony_sim.swapaxes(0, 1)
#data_sim = data_sim.merge(pandas.DataFrame(biophony_sim), how='left', left_index=True, right_index=True)

Calculate R squared

As defined in Gelman and Hill 2006, we can compute (effective) R squared values for the multilevel model.


In [73]:
# level 1
a_means_expanded = numpy.array([a_means[v-1] for v in data_sorted.site])
b_means_expanded = numpy.array([b_means[v-1] for v in data_sorted.site])
y_hat = a_means_expanded + (b_means_expanded * data_sorted.weeks_centered)
e_y = data_sorted.biophony_centered - y_hat
r2_level1 = 1 - (e_y.var() / data_sorted['biophony_centered'].var())

# level 2
a_hat = g00 + (g01 * data_sorted.groupby('site')['forest_net_200m'].mean())
b_hat = g10 + (g11 * data_sorted.groupby('site')['forest_net_200m'].mean())
e_a = a_means - a_hat
e_b = b_means - b_hat
# R squared for intercept estimates
r2_level2a = 1 - (e_a.var() / a_means.var())
# R squared for slope estimates
r2_level2b = 1 - (e_b.var() / b_means.var())

In [74]:
print("R squared values:\n\
      level 1\n\
      ----------\n\
      overall:    {0:0.2f}\n\
      \n\
      level 2\n\
      ----------\n\
      intercepts: {1:0.2f}\n\
      slopes:     {2:0.2f}".format(r2_level1, r2_level2a, r2_level2b))


R squared values:
      level 1
      ----------
      overall:    0.61
      
      level 2
      ----------
      intercepts: 0.42
      slopes:     0.90

Analysis of variance (ANOVA)


In [75]:
trace.varnames


Out[75]:
['g_00',
 'g_01',
 'sigma_a_interval_',
 'a',
 'g_10',
 'g_11',
 'sigma_b_interval_',
 'b',
 'sigma_y_interval_',
 'y_sim',
 'sigma_a',
 'sigma_b',
 'sigma_y']

In [76]:
p1 = forestplot(trace, varnames=['sigma_a', 'sigma_b', 'sigma_y'])


Model 4 - adding more site-level predictors


In [50]:
data_sorted['temperature_centered'] = data_sorted.temperature - data_sorted.temperature.mean()

In [51]:
data_sorted['wind_speed_centered'] = data_sorted.wind_speed - data_sorted.wind_speed.mean()

In [66]:
sites = numpy.copy(data_sorted.site.values) - 1

In [68]:
with Model() as model_4:
    
    # intercept
    g_00 = Normal('g_00', mu=0, tau=0.001)
    g_01 = Normal('g_01', mu=0, tau=0.001)
    sigma_a = Uniform('sigma_a', lower=0, upper=100)
    tau_a = sigma_a**-2
    mu_a = g_00 + (g_01 * data_sorted.forest_net_200m_centered.unique())
    a = Normal('a', mu=mu_a, tau=tau_a, shape=len(set(sites)))
    
    # slope (land cover)
    g_10 = Normal('g_10', mu=0, tau=0.001)
    g_11 = Normal('g_11', mu=0, tau=0.001)
    sigma_b = Uniform('sigma_b', lower=0, upper=100)
    tau_b = sigma_b**-2
    mu_b = g_10 + (g_11 * data_sorted.forest_net_200m_centered.unique())
    b = Normal('b', mu=mu_b, tau=tau_b, shape=len(set(sites)))
    
    # temperature
    c = Uniform('c', lower=-10, upper=10, shape=len(set(sites)))
    
    # model error (data-level)
    sigma_y = Uniform('sigma_y', lower=0, upper=100)
    tau_y = sigma_y**-2
    
    # expected values
    y_hat = a[sites] + (b[sites] * data_sorted.weeks_centered) + (b[sites] * data_sorted.temperature_centered)
    
    # likelihood
    y_like = Normal('y_like', mu=y_hat, tau=tau_y, observed=data_sorted.biophony_centered)
    
    # simulated
    #y_sim = Normal('y_sim', mu=y_hat, tau=tau_y, shape=y_hat.tag.test_value.shape)


Applied interval-transform to sigma_a and added transformed sigma_a_interval_ to model.
Applied interval-transform to sigma_b and added transformed sigma_b_interval_ to model.
Applied interval-transform to c and added transformed c_interval_ to model.
Applied interval-transform to sigma_y and added transformed sigma_y_interval_ to model.