Tweedie Density Estimation

This package follows the following 2 papers to estimate the density of a presumed Tweedie distribution:

Dunn, Peter K. and Smyth, Gordon K. 2001, Tweedie Family Densities: Methods
of Evaluation

Dunn, Peter K. and Smyth, Gordon K. 2005, Series evaluation of Tweedie
exponential dispersion model densities

Below, I'll demonstrate a relatively simple approach to estimating Tweedie density from a given distribution. First, I'm going to generate some data and try to fit a GLM on it.


In [1]:
%matplotlib inline

from __future__ import print_function

import numpy as np
import scipy as sp
from tweedie import tweedie
import seaborn as sns
import statsmodels.api as sm

In [2]:
# Number of parameters for model
p = 20

# Number of simulated observations
n = 100000

np.random.seed(43)
exog = np.random.rand(n, p - 1)
exog = np.hstack((np.ones((n, 1)), exog))
beta = np.concatenate(([500], np.random.randint(-100, 100, p - 1))) / 100
eta = np.dot(exog, beta)
mu = np.exp(eta)

endog = tweedie(mu=mu, p=1.5, phi=20).rvs(n)

And to prove this is a typical Tweedie distribution, we'll plot it. Notice that there's a big point of mass at exactly 0, and then some positive values.


In [3]:
sns.distplot(endog)


Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x9c85518>

In [4]:
sns.distplot(endog[endog > 0])


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0xa0d45f8>

A typical problem would be to try to improve the GLM by choosing an appropriate value for the var_power parameter. That's often difficult, but it can be made easier by using the scipy's minimize_scalar function which will seek the best value of p. To show this works, let's initially set p to be some ludicrious value and see if the minimize_scalar moves us in the right direction.


In [5]:
res = sm.GLM(endog, exog, family=sm.families.Tweedie(link=sm.families.links.log, var_power=1.1)).fit()
print(res.summary())


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:               100000
Model:                            GLM   Df Residuals:                    99980
Model Family:                 Tweedie   Df Model:                           19
Link Function:                    log   Scale:                   113.453789434
Method:                          IRLS   Log-Likelihood:                    nan
Date:                Mon, 24 Apr 2017   Deviance:                   9.5545e+06
Time:                        17:52:57   Pearson chi2:                 1.13e+07
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0634      0.035    145.996      0.000       4.995       5.131
x1             0.3728      0.016     23.421      0.000       0.342       0.404
x2             0.5698      0.016     35.765      0.000       0.539       0.601
x3             0.6701      0.016     41.863      0.000       0.639       0.701
x4            -0.8728      0.016    -54.218      0.000      -0.904      -0.841
x5             0.5345      0.016     33.572      0.000       0.503       0.566
x6            -0.3665      0.016    -23.032      0.000      -0.398      -0.335
x7            -0.3650      0.016    -22.987      0.000      -0.396      -0.334
x8            -0.4150      0.016    -26.156      0.000      -0.446      -0.384
x9            -0.6359      0.016    -39.790      0.000      -0.667      -0.605
x10           -0.8903      0.016    -55.241      0.000      -0.922      -0.859
x11            0.1132      0.016      7.123      0.000       0.082       0.144
x12           -0.2390      0.016    -15.037      0.000      -0.270      -0.208
x13           -0.2066      0.016    -13.009      0.000      -0.238      -0.175
x14            0.6733      0.016     42.138      0.000       0.642       0.705
x15            0.3821      0.016     24.027      0.000       0.351       0.413
x16            0.3662      0.016     23.121      0.000       0.335       0.397
x17           -0.8107      0.016    -50.410      0.000      -0.842      -0.779
x18           -0.2769      0.016    -17.483      0.000      -0.308      -0.246
x19           -0.2439      0.016    -15.403      0.000      -0.275      -0.213
==============================================================================

In [6]:
def loglike_p(p):
    return -tweedie(mu=res.mu, p=p, phi=res.scale).logpdf(res._endog).sum()

opt = sp.optimize.minimize_scalar(loglike_p, bounds=(1.05, 1.95), method='bounded')
print(opt)


     fun: 481747.12621092948
 message: 'Solution found.'
    nfev: 12
  status: 0
 success: True
       x: 1.3454076982488605

Sure enough, we move in the right direction.

Now, let's run the GLM using the updated value of p.


In [7]:
res2 = sm.GLM(endog, exog, family=sm.families.Tweedie(link=sm.families.links.log, var_power=opt.x)).fit()
print(res.summary())

def loglike_p(p):
    return -tweedie(mu=res2.mu, p=p, phi=res2.scale).logpdf(res2._endog).sum()

opt2 = sp.optimize.minimize_scalar(loglike_p, bounds=(1.05, 1.95), method='bounded')
print(opt2)


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:               100000
Model:                            GLM   Df Residuals:                    99980
Model Family:                 Tweedie   Df Model:                           19
Link Function:                    log   Scale:                   113.453789434
Method:                          IRLS   Log-Likelihood:                    nan
Date:                Mon, 24 Apr 2017   Deviance:                   9.5545e+06
Time:                        17:53:08   Pearson chi2:                 1.13e+07
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0634      0.035    145.996      0.000       4.995       5.131
x1             0.3728      0.016     23.421      0.000       0.342       0.404
x2             0.5698      0.016     35.765      0.000       0.539       0.601
x3             0.6701      0.016     41.863      0.000       0.639       0.701
x4            -0.8728      0.016    -54.218      0.000      -0.904      -0.841
x5             0.5345      0.016     33.572      0.000       0.503       0.566
x6            -0.3665      0.016    -23.032      0.000      -0.398      -0.335
x7            -0.3650      0.016    -22.987      0.000      -0.396      -0.334
x8            -0.4150      0.016    -26.156      0.000      -0.446      -0.384
x9            -0.6359      0.016    -39.790      0.000      -0.667      -0.605
x10           -0.8903      0.016    -55.241      0.000      -0.922      -0.859
x11            0.1132      0.016      7.123      0.000       0.082       0.144
x12           -0.2390      0.016    -15.037      0.000      -0.270      -0.208
x13           -0.2066      0.016    -13.009      0.000      -0.238      -0.175
x14            0.6733      0.016     42.138      0.000       0.642       0.705
x15            0.3821      0.016     24.027      0.000       0.351       0.413
x16            0.3662      0.016     23.121      0.000       0.335       0.397
x17           -0.8107      0.016    -50.410      0.000      -0.842      -0.779
x18           -0.2769      0.016    -17.483      0.000      -0.308      -0.246
x19           -0.2439      0.016    -15.403      0.000      -0.275      -0.213
==============================================================================
     fun: 410898.54726265225
 message: 'Solution found.'
    nfev: 11
  status: 0
 success: True
       x: 1.3741068903553528

Still moving in the right direction.


In [8]:
res3 = sm.GLM(endog, exog, family=sm.families.Tweedie(link=sm.families.links.log, var_power=opt2.x)).fit()
print(res.summary())

def loglike_p(p):
    return -tweedie(mu=res3.mu, p=p, phi=res3.scale).logpdf(res3._endog).sum()

opt3 = sp.optimize.minimize_scalar(loglike_p, bounds=(1.05, 1.95), method='bounded')
print(opt3)


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:               100000
Model:                            GLM   Df Residuals:                    99980
Model Family:                 Tweedie   Df Model:                           19
Link Function:                    log   Scale:                   113.453789434
Method:                          IRLS   Log-Likelihood:                    nan
Date:                Mon, 24 Apr 2017   Deviance:                   9.5545e+06
Time:                        17:53:18   Pearson chi2:                 1.13e+07
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0634      0.035    145.996      0.000       4.995       5.131
x1             0.3728      0.016     23.421      0.000       0.342       0.404
x2             0.5698      0.016     35.765      0.000       0.539       0.601
x3             0.6701      0.016     41.863      0.000       0.639       0.701
x4            -0.8728      0.016    -54.218      0.000      -0.904      -0.841
x5             0.5345      0.016     33.572      0.000       0.503       0.566
x6            -0.3665      0.016    -23.032      0.000      -0.398      -0.335
x7            -0.3650      0.016    -22.987      0.000      -0.396      -0.334
x8            -0.4150      0.016    -26.156      0.000      -0.446      -0.384
x9            -0.6359      0.016    -39.790      0.000      -0.667      -0.605
x10           -0.8903      0.016    -55.241      0.000      -0.922      -0.859
x11            0.1132      0.016      7.123      0.000       0.082       0.144
x12           -0.2390      0.016    -15.037      0.000      -0.270      -0.208
x13           -0.2066      0.016    -13.009      0.000      -0.238      -0.175
x14            0.6733      0.016     42.138      0.000       0.642       0.705
x15            0.3821      0.016     24.027      0.000       0.351       0.413
x16            0.3662      0.016     23.121      0.000       0.335       0.397
x17           -0.8107      0.016    -50.410      0.000      -0.842      -0.779
x18           -0.2769      0.016    -17.483      0.000      -0.308      -0.246
x19           -0.2439      0.016    -15.403      0.000      -0.275      -0.213
==============================================================================
     fun: 407679.78651045624
 message: 'Solution found.'
    nfev: 9
  status: 0
 success: True
       x: 1.3933801324974868

Still moving... After enough back-and-forth guesses, you should approach the optimal p. For reasons I don't fully understand, the estimated p won't quite match the initial guess provided earlier. But we are moving in the right direction!