In portuguese West Region, pear tree cv 'Rocha' is one of the most important crops. Every year, due to climate diferences and yield alternation, production could change a lot, as you can see in the following examples:
This situation take producers to unexpected costs of transport and chill storage. Furthermore our experience tell us that in low yield years the production tend to be overestimated and in high yield years the production tend to be underestimated.
In this image we have two clusters, the left one with two fruits and the right one with one fruit. You can see some more in this image.
My idea was to "decompose" the variable that I want to estimate in a product of 5 variables that I can sample easily:
$\frac{yield}{hectare} = \frac{number\,of\,trees}{hectare} \times \frac{number\,of\,branches}{tree} \times \frac{number\,of\,clusters}{branch} \times \frac{number\,of\,fruits}{cluster} \times fruit\;average\;weight$
We also tried to collect some measures about tree growth like height and diameter.
I "invited" some coleagues of mine to collect a (I hope) big enough dataset to resample it after and try to find some pattern. By now, I've started by the number of fruits per cluster.
In [1]:
import pandas as pd
import numpy as np
import random
from scipy.optimize import curve_fit
from scipy.stats import binom
from scipy.misc import factorial
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
df = pd.read_csv('dataset_1.csv') #dataset 1
df2 = pd.read_csv('dataset_2.csv') #dataset 2
dfx = pd.concat([df,df2])
We start analysing the number of fruits per cluster, using the two datasets separately, and then the dataset resulted from joining the firs ones.
In [3]:
def poisson(k, lamb):
return (lamb**k/factorial(k)) * np.exp(-lamb)
data = df.fruit_number
m = np.min(data)
M = np.max(data)
n = np.size(np.unique(data))
# the bins should be of integer width, because poisson is an integer distribution
entries, bin_edges, patches = plt.hist(data, bins=n, range=[m, M], normed=True)
# calculate binmiddles
bin_middles = 0.5*(bin_edges[1:] + bin_edges[:-1])
# fit with curve_fit
parameters, cov_matrix = curve_fit(poisson, bin_middles, entries)
# plot poisson-deviation with fitted parameter
x_plot = np.linspace(m, M, 100)
plt.plot(x_plot, poisson(x_plot, *parameters), 'r-', lw=2)
plt.title(parameters[0])
plt.show()
In [4]:
data.describe()
Out[4]:
Here's a Bayesian approach to the same problem.
The Fruit
class inherits Update
from Suite
, and it provides Likelihood
, which is the likelihood of the data (a sequence of integer counts) under a given hypothesis (the parameter of a Poisson distribution).
It assumes that branches with 0 fruits were not counted, so for each hypothetical Poisson PMF, it sets the probability of 0 fruits to 0 and renormalizes.
In [5]:
import thinkbayes2
import thinkplot
class Fruit(thinkbayes2.Suite):
def Likelihood(self, data, hypo):
"""Computes the likelihood of `data` under `hypo`.
data: sequence of integer counts
hypo: parameter of Poisson distribution
"""
if hypo == 0:
return 0
pmf = thinkbayes2.MakePoissonPmf(hypo, high=15)
pmf[0] = 0
pmf.Normalize()
like = 1
for count in data:
like *= pmf[count]
return like
In order to process all of the data, we have to break it into chunks and renormalize after each chunk. Otherwise the likelihoods underflow (round down to 0).
In [6]:
def chunks(t, n):
"""Yield successive n-sized chunks from t."""
for i in range(0, len(t), n):
yield t[i:i + n]
prior
is the prior distribution for the parameter. It is initially uniform over a small range (the range that has non-negligible likelihood).
After the update, the posterior distribution is approximately normal around the MLE.
In [7]:
prior = np.linspace(1.5, 2.5, 101)
fruit = Fruit(prior)
for chunk in chunks(data, 100):
fruit.Update(chunk)
thinkplot.Pdf(fruit)
The posterior mean and MAP (maximum aposteori probability) are both close to 2.1
In [8]:
fruit.Mean(), fruit.MaximumLikelihood()
Out[8]:
Now, to compare the model to the data, we have to generate a predictive distribution, which is a mixture of Poisson distributions with different parameters, each one weighted by the posterior probability of the parameter.
In [9]:
metapmf = thinkbayes2.Pmf()
for hypo, prob in fruit.Items():
pmf = thinkbayes2.MakePoissonPmf(hypo, high=15)
metapmf[pmf] = prob
mix = thinkbayes2.MakeMixture(metapmf)
data_pmf = thinkbayes2.Pmf(data)
thinkplot.Hist(data_pmf)
thinkplot.Pdf(mix, color='red')
The model is a pretty good match for the data, and not wildly different from the estimate by curve fitting.
So that's one of the steps we talked about -- making a Bayesian estimate based on the assumption that the 0-fruit branches were not measured.
I hope to get to some of the other steps soon!
-- Allen
In [10]:
data = df2.fruit_number
m = np.min(data)
M = np.max(data)
n = np.size(np.unique(data))
# the bins should be of integer width, because poisson is an integer distribution
entries, bin_edges, patches = plt.hist(data, bins=n, range=[m, M], normed=True)
# calculate binmiddles
bin_middles = 0.5*(bin_edges[1:] + bin_edges[:-1])
# fit with curve_fit
parameters, cov_matrix = curve_fit(poisson, bin_middles, entries)
# plot poisson-deviation with fitted parameter
x_plot = np.linspace(m, M, 100)
plt.plot(x_plot, poisson(x_plot, *parameters), 'r-', lw=2)
plt.title(parameters[0])
plt.show()
In [11]:
data = dfx.fruit_number
m = np.min(data)
M = np.max(data)
n = np.size(np.unique(data))
# the bins should be of integer width, because poisson is an integer distribution
entries, bin_edges, patches = plt.hist(data, bins=n, range=[m, M], normed=True)
# calculate binmiddles
bin_middles = 0.5*(bin_edges[1:] + bin_edges[:-1])
# fit with curve_fit
parameters, cov_matrix = curve_fit(poisson, bin_middles, entries)
# plot poisson-deviation with fitted parameter
x_plot = np.linspace(m, M, 100)
plt.plot(x_plot, poisson(x_plot, *parameters), 'r-', lw=2)
plt.title(parameters[0])
plt.show()
After that I made some resample and calculated the $\lambda$ factor of the adjusted distribution for each resample.
In [12]:
ixto =[]
for i in range(1,1000):
data = dfx.fruit_number.sample(n=100)
m = np.min(data)
M = np.max(data)
n = np.size(np.unique(data))
# the bins should be of integer width, because poisson is an integer distribution
entries, bin_edges, patches = plt.hist(data, bins=n, range=[m, M], normed=True)
# calculate binmiddles
bin_middles = 0.5*(bin_edges[1:] + bin_edges[:-1])
# fit with curve_fit
parameters, cov_matrix = curve_fit(poisson, bin_middles, entries)
ixto.append(parameters[0])
In [13]:
plt.hist(ixto, bins=100)
Out[13]:
I've noticed a bi-modal distribution... That's the begining of a long walk and I would like to have your help to find that is possible to reduce our sample sizes.