In [1]:
import pandas as pd
import numpy as np
import chainladder as cl
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
cl.__version__
Out[1]:
In [2]:
tri = cl.load_dataset('clrd').groupby('LOB').sum().loc['wkcomp', ['CumPaidLoss', 'EarnedPremNet']]
cl.Chainladder().fit(tri['CumPaidLoss']).ultimate_ == \
cl.MackChainladder().fit(tri['CumPaidLoss']).ultimate_
Out[2]:
Let's create a Mack Model.
In [3]:
mack = cl.MackChainladder().fit(tri['CumPaidLoss'])
MackChainladder has the following additional fitted features that the deterministic Chainladder
does not.
full_std_err_
: The full standard errortotal_process_risk_
: The total process errortotal_parameter_risk_
: The total parameter errormack_std_err_
: The total prediction error by origin periodtotal_mack_std_err_
: The total prediction error across all origin periodsNotice these are all measures of uncertainty, but where do they come from? Let's start by examining the link_ratios
underlying the triangle.
In [4]:
tri_first_lags = tri[tri.development<=24][tri.origin<'1997']['CumPaidLoss']
tri_first_lags
Out[4]:
A simple average link-ratio can be directly computed as follows:
In [5]:
tri_first_lags.link_ratio.to_frame().mean().values[0]
Out[5]:
Verifying that this ties to our Development
object:
In [6]:
cl.Development(average='simple').fit(tri['CumPaidLoss']).ldf_.to_frame().values[0, 0]
Out[6]:
Mack noticed that this estimate for an LDF is really just a linear regression fit. For the case of the simple
average, it is a weighted regression where the weight is set to $\left (\frac{1}{X} \right )^{2}$.
Take a look at the fitted coefficient in the next cell and verify that it ties to the direct calculations above. With the regression framework in hand, we get much more information about our LDF estimate than just the coefficient.
In [7]:
import statsmodels.api as sm
import numpy as np
y = tri_first_lags.to_frame().values[:, 1]
X = tri_first_lags.to_frame().values[:, 0]
model = sm.WLS(y, X, weights=(1/X)**2)
results = model.fit()
results.summary()
Out[7]:
By toggling the weights of our regression, we can handle the most common types of averaging used in picking loss development factors.
In [8]:
print('Does this work for simple?')
print(round(cl.Development(average='simple').fit(tri_first_lags).ldf_.to_frame().values[0, 0], 8) == \
round(sm.WLS(y, X, weights=(1/X)**2).fit().params[0],8))
print('Does this work for volume-weighted average?')
print(round(cl.Development(average='volume').fit(tri_first_lags).ldf_.to_frame().values[0, 0], 8) == \
round(sm.WLS(y, X, weights=(1/X)).fit().params[0],8))
print('Does this work for regression average?')
print(round(cl.Development(average='regression').fit(tri_first_lags).ldf_.to_frame().values[0, 0], 8) == \
round(sm.OLS(y, X).fit().params[0],8))
This regression framework is what the Development
estimator uses to set development patterns. Although we discard the information in deterministic approaches, Development
has two useful statistics for estimating reserve variability, both of which come from the regression framework. The stastics are std_err_
and sigma_
and they are used by the MackChainladder
estimator to determine the prediction error of our reserves.
In [9]:
dev = cl.Development(average='simple').fit(tri['CumPaidLoss'])
In [10]:
dev.std_err_
Out[10]:
In [11]:
dev.sigma_
Out[11]:
Since the regression framework is weighted, we can easily turn on/off any observation we want using the dropping capabilities of the Development
estimator. Dropping link ratios not only affects the ldf_
and cdf_
, but also the std_err_
and sigma
of the regression.
Here we eliminate the 1988 valuation from our triangle, which is identical to eliminating the first observation from our 12-24 regression fit.
In [12]:
print('Does this work for dropping observations?')
print(round(cl.Development(average='volume', drop_valuation='1988') \
.fit(tri['CumPaidLoss']).std_err_.to_frame().values[0, 0], 8) == \
round(sm.WLS(y[1:], X[1:], weights=(1/X[1:])).fit().bse[0],8))
With sigma_
and std_err_
in hand, Mack goes on to develop recursive formulas to estimate parameter_risk_
and process_risk_
.
In [13]:
mack.parameter_risk_
Out[13]:
The Mack model makes a lot of assumptions about independence (i.e. covariance between random processes is 0). This means many of the Variance estimates in the MackChainladder
model follow the form of $Var(A+B) = Var(A)+Var(B)$.
Notice the square of mack_std_err_
is simply the sum of the sqaures of parameter_risk_
and process_risk_
.
In [14]:
print('Parameter risk and process risk are independent?')
print(round(mack.mack_std_err_**2, 4) == round(mack.parameter_risk_**2 + mack.process_risk_**2, 4))
This independence assumption applies to variance of each origin period.
In [15]:
print('Total Parameter and process risk across origin periods is independent?')
print(round(mack.total_process_risk_**2, 4) == round((mack.process_risk_**2).sum('origin'), 4))
Independence is also assumed to apply to the overall standard error of reserves, total_mack_std_err_
.
In [16]:
(mack.total_process_risk_**2 + mack.total_parameter_risk_**2).to_frame().values[0, -1] == \
(mack.total_mack_std_err_**2).values[0,0]
Out[16]:
This over-reliance on independence is one of the weaknesses of the MackChainladder
method. Nevertheless, if the data align with this assumption, then total_mack_std_err_
is a reasonable esimator of reserve variability.
The mack_std_err_
at ultimate is the reserve variability for each origin
period.
In [17]:
mack.mack_std_err_[mack.mack_std_err_.development==mack.mack_std_err_.development.max()]
Out[17]:
These are probably easier to see in the summary_
of the MackChainladder
model.
In [18]:
mack.summary_
Out[18]:
In [19]:
plot_data = mack.summary_.to_frame()
g = plot_data[['Latest', 'IBNR']] \
.plot(kind='bar', stacked=True,
yerr=pd.DataFrame({'latest': plot_data['Mack Std Err']*0,
'IBNR': plot_data['Mack Std Err']}),
ylim=(0, None), title='Mack Chainladder Ultimate')
g.set_xlabel('Accident Year')
g.set_ylabel('Loss');
In [20]:
dist = pd.Series(np.random.normal(mack.ibnr_.sum(),
mack.total_mack_std_err_.values[0, 0], size=10000))
dist.plot(
kind='hist', bins=50,
title="Normally distributed IBNR estimate with a mean of " + '{:,}'.format(round(mack.ibnr_.sum(),0))[:-2]);
The MackChainladder
focused on a regression framework for determining the variability of reserve estimates. An alternative approach is to use statistical bootstrapping or sampling from a triangle with replacement to simulate new triangles.
Bootstrapping imposes less model constraints than the MackChainladder
which allows for greater applicability in different scenarios. Sampling new triangles can be accomplished through the BootstrapODPSample
estimator. This estimator will take a single triangle and simulate new ones from it.
Notice how easy it is to simulate 10,000 new triangles from an existing triangle by accessing the resampled_triangles_
attribute.
In [21]:
samples = cl.BootstrapODPSample(n_sims=10000).fit(tri['CumPaidLoss']).resampled_triangles_
Alternatively, we could use BootstrapODPSample
to transform our triangle into a resampled set.
In [22]:
samples = cl.BootstrapODPSample(n_sims=10000).fit_transform(tri['CumPaidLoss'])
The notion of the ODP Bootstrap is that as our simulations approach infinity, we should expect our mean simulation to converge on the basic Chainladder
estimate of of reserves.
Let's apply the basic chainladder to our original triangle and also to our simulated triangles to see whether this holds true.
In [23]:
difference = round(1 - cl.Chainladder().fit(samples).ibnr_.sum('origin').mean() / \
cl.Chainladder().fit(tri['CumPaidLoss']).ibnr_.sum())
print("Percentage difference in estimate using original triangle and BootstrapODPSample is " +str(difference))
In [24]:
samples
Out[24]:
In [25]:
pipe = cl.Pipeline([
('dev', cl.Development(average='simple')),
('tail', cl.TailConstant(1.05))])
pipe.fit(samples)
Out[25]:
Now instead of a single cdf_
vector, we have 10,000.
In [26]:
pipe.named_steps.dev.cdf_
Out[26]:
This allows us to look at the varibility of any fitted property used in our prior tutorials.
In [27]:
orig_dev = cl.Development(average='simple').fit(tri['CumPaidLoss'])
resampled_ldf = pipe.named_steps.dev.ldf_
print("12-24 LDF of original Triangle: " + str(round(orig_dev.ldf_.values[0,0,0,0],4)))
pd.Series(resampled_ldf.values[:, 0, 0, 0]).plot(
kind='hist', bins=100,
title='Age 12-14 LDF distribution using Bootstrap');
In [28]:
mack_vs_bs = resampled_ldf[resampled_ldf.origin == resampled_ldf.origin.max()].std('index').to_frame().append(
orig_dev.std_err_.to_frame()).T
mack_vs_bs.columns = ['Mack', 'Bootstrap']
mack_vs_bs.plot(kind='bar', title='Mack Regression Framework LDF Std Err vs Bootstrap Simulated LDF Std Err');
While the MackChainladder
produces statistics about the mean and variance of reserve estimates, those have to be fit to a distribution using MLE, MoM, etc to see the range of outcomes of reserves. With BootstrapODPSample
based fits, we can use the empirical distribution directly if we choose to.
In [29]:
ibnr = cl.Chainladder().fit(samples).ibnr_.sum('origin')
ibnr_99 = ibnr.quantile(0.99).to_frame().values[0,0]
print("99%-ile of reserve estimate is " +'{:0,}'.format(round(ibnr_99,0)))
Let's see how the MackChainladder
reserve distribution compares to the BootstrapODPSample
reserve distribution.
In [30]:
ax = ibnr.plot(kind='hist', bins=50, alpha=0.7, color='green').plot()
dist.plot(kind='hist', bins=50, alpha=0.4, color='blue', title='Mack vs Bootstrap Variability');
In [31]:
tri['EarnedPremNet'].latest_diagonal
Out[31]:
Passing an apriori_sigma
to the BornhuetterFerguson
estimator tells it to consider the apriori selection itself as a random variable. Fitting a stochastic BornhuetterFerguson
looks very much like the determinsitic version.
In [32]:
bf = cl.BornhuetterFerguson(apriori=0.65, apriori_sigma=0.10)
bf.fit(samples, sample_weight=tri['EarnedPremNet'].latest_diagonal)
Out[32]:
We can use our knowledge of Triangle
manipulation to grab most things we would want out of our model. In this example, we are using the broadcast_axis
function of the Triangle which repeats a Triangle object along an axis according to another triangle axis.
In [33]:
# Grab completed triangle replacing simulated known data with actual known data
full_triangle = bf.full_triangle_ - bf.X_ + \
tri['CumPaidLoss'].broadcast_axis('index', samples.index)
# Limiting to the current year for plotting
current_year = full_triangle[full_triangle.origin==full_triangle.origin.max()].to_frame().T
As expected, plotting the expected development of our full triangle over time from the Bootstrap BornhuetterFerguson
model fans out to greater uncertainty the farther we get from our valuation date.
In [34]:
# Plot the data
current_year.iloc[:, :200].reset_index(drop=True).plot(
color='green', legend=False, alpha=0.1,
title='Current Accident Year Expected Development Distribution', grid=True);