In [1]:
import pandas as pd
import numpy as np
import chainladder as cl
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
cl.__version__
Out[1]:
In [2]:
tri = cl.load_dataset('clrd').groupby('LOB').sum().loc['wkcomp', ['CumPaidLoss', 'EarnedPremNet']]
cl.Chainladder().fit(tri['CumPaidLoss']).ultimate_ == \
cl.MackChainladder().fit(tri['CumPaidLoss']).ultimate_
Out[2]:
Let's create a Mack Model.
In [3]:
mack = cl.MackChainladder().fit(tri['CumPaidLoss'])
MackChainladder has the following additional fitted features that the deterministic Chainladder does not.
full_std_err_: The full standard errortotal_process_risk_: The total process errortotal_parameter_risk_: The total parameter errormack_std_err_: The total prediction error by origin periodtotal_mack_std_err_: The total prediction error across all origin periodsNotice these are all measures of uncertainty, but where do they come from? Let's start by examining the link_ratios underlying the triangle.
In [4]:
tri_first_lags = tri[tri.development<=24][tri.origin<'1997']['CumPaidLoss']
tri_first_lags
Out[4]:
A simple average link-ratio can be directly computed as follows:
In [5]:
tri_first_lags.link_ratio.to_frame().mean().values[0]
Out[5]:
Verifying that this ties to our Development object:
In [6]:
cl.Development(average='simple').fit(tri['CumPaidLoss']).ldf_.to_frame().values[0, 0]
Out[6]:
Mack noticed that this estimate for an LDF is really just a linear regression fit. For the case of the simple average, it is a weighted regression where the weight is set to $\left (\frac{1}{X} \right )^{2}$.
Take a look at the fitted coefficient in the next cell and verify that it ties to the direct calculations above. With the regression framework in hand, we get much more information about our LDF estimate than just the coefficient.
In [7]:
import statsmodels.api as sm
import numpy as np
y = tri_first_lags.to_frame().values[:, 1]
X = tri_first_lags.to_frame().values[:, 0]
model = sm.WLS(y, X, weights=(1/X)**2)
results = model.fit()
results.summary()
Out[7]:
By toggling the weights of our regression, we can handle the most common types of averaging used in picking loss development factors.
In [8]:
print('Does this work for simple?')
print(round(cl.Development(average='simple').fit(tri_first_lags).ldf_.to_frame().values[0, 0], 8) == \
round(sm.WLS(y, X, weights=(1/X)**2).fit().params[0],8))
print('Does this work for volume-weighted average?')
print(round(cl.Development(average='volume').fit(tri_first_lags).ldf_.to_frame().values[0, 0], 8) == \
round(sm.WLS(y, X, weights=(1/X)).fit().params[0],8))
print('Does this work for regression average?')
print(round(cl.Development(average='regression').fit(tri_first_lags).ldf_.to_frame().values[0, 0], 8) == \
round(sm.OLS(y, X).fit().params[0],8))
This regression framework is what the Development estimator uses to set development patterns. Although we discard the information in deterministic approaches, Development has two useful statistics for estimating reserve variability, both of which come from the regression framework. The stastics are std_err_ and sigma_ and they are used by the MackChainladder estimator to determine the prediction error of our reserves.
In [9]:
dev = cl.Development(average='simple').fit(tri['CumPaidLoss'])
In [10]:
dev.std_err_
Out[10]:
In [11]:
dev.sigma_
Out[11]:
Since the regression framework is weighted, we can easily turn on/off any observation we want using the dropping capabilities of the Development estimator. Dropping link ratios not only affects the ldf_ and cdf_, but also the std_err_ and sigma of the regression.
Here we eliminate the 1988 valuation from our triangle, which is identical to eliminating the first observation from our 12-24 regression fit.
In [12]:
print('Does this work for dropping observations?')
print(round(cl.Development(average='volume', drop_valuation='1988') \
.fit(tri['CumPaidLoss']).std_err_.to_frame().values[0, 0], 8) == \
round(sm.WLS(y[1:], X[1:], weights=(1/X[1:])).fit().bse[0],8))
With sigma_ and std_err_ in hand, Mack goes on to develop recursive formulas to estimate parameter_risk_ and process_risk_.
In [13]:
mack.parameter_risk_
Out[13]:
The Mack model makes a lot of assumptions about independence (i.e. covariance between random processes is 0). This means many of the Variance estimates in the MackChainladder model follow the form of $Var(A+B) = Var(A)+Var(B)$.
Notice the square of mack_std_err_ is simply the sum of the sqaures of parameter_risk_ and process_risk_.
In [14]:
print('Parameter risk and process risk are independent?')
print(round(mack.mack_std_err_**2, 4) == round(mack.parameter_risk_**2 + mack.process_risk_**2, 4))
This independence assumption applies to variance of each origin period.
In [15]:
print('Total Parameter and process risk across origin periods is independent?')
print(round(mack.total_process_risk_**2, 4) == round((mack.process_risk_**2).sum('origin'), 4))
Independence is also assumed to apply to the overall standard error of reserves, total_mack_std_err_.
In [16]:
(mack.total_process_risk_**2 + mack.total_parameter_risk_**2).to_frame().values[0, -1] == \
(mack.total_mack_std_err_**2).values[0,0]
Out[16]:
This over-reliance on independence is one of the weaknesses of the MackChainladder method. Nevertheless, if the data align with this assumption, then total_mack_std_err_ is a reasonable esimator of reserve variability.
The mack_std_err_ at ultimate is the reserve variability for each origin period.
In [17]:
mack.mack_std_err_[mack.mack_std_err_.development==mack.mack_std_err_.development.max()]
Out[17]:
These are probably easier to see in the summary_ of the MackChainladder model.
In [18]:
mack.summary_
Out[18]:
In [19]:
plot_data = mack.summary_.to_frame()
g = plot_data[['Latest', 'IBNR']] \
.plot(kind='bar', stacked=True,
yerr=pd.DataFrame({'latest': plot_data['Mack Std Err']*0,
'IBNR': plot_data['Mack Std Err']}),
ylim=(0, None), title='Mack Chainladder Ultimate')
g.set_xlabel('Accident Year')
g.set_ylabel('Loss');
In [20]:
dist = pd.Series(np.random.normal(mack.ibnr_.sum(),
mack.total_mack_std_err_.values[0, 0], size=10000))
dist.plot(
kind='hist', bins=50,
title="Normally distributed IBNR estimate with a mean of " + '{:,}'.format(round(mack.ibnr_.sum(),0))[:-2]);
The MackChainladder focused on a regression framework for determining the variability of reserve estimates. An alternative approach is to use statistical bootstrapping or sampling from a triangle with replacement to simulate new triangles.
Bootstrapping imposes less model constraints than the MackChainladder which allows for greater applicability in different scenarios. Sampling new triangles can be accomplished through the BootstrapODPSample estimator. This estimator will take a single triangle and simulate new ones from it.
Notice how easy it is to simulate 10,000 new triangles from an existing triangle by accessing the resampled_triangles_ attribute.
In [21]:
samples = cl.BootstrapODPSample(n_sims=10000).fit(tri['CumPaidLoss']).resampled_triangles_
Alternatively, we could use BootstrapODPSample to transform our triangle into a resampled set.
In [22]:
samples = cl.BootstrapODPSample(n_sims=10000).fit_transform(tri['CumPaidLoss'])
The notion of the ODP Bootstrap is that as our simulations approach infinity, we should expect our mean simulation to converge on the basic Chainladder estimate of of reserves.
Let's apply the basic chainladder to our original triangle and also to our simulated triangles to see whether this holds true.
In [23]:
difference = round(1 - cl.Chainladder().fit(samples).ibnr_.sum('origin').mean() / \
cl.Chainladder().fit(tri['CumPaidLoss']).ibnr_.sum())
print("Percentage difference in estimate using original triangle and BootstrapODPSample is " +str(difference))
In [24]:
samples
Out[24]:
In [25]:
pipe = cl.Pipeline([
('dev', cl.Development(average='simple')),
('tail', cl.TailConstant(1.05))])
pipe.fit(samples)
Out[25]:
Now instead of a single cdf_ vector, we have 10,000.
In [26]:
pipe.named_steps.dev.cdf_
Out[26]:
This allows us to look at the varibility of any fitted property used in our prior tutorials.
In [27]:
orig_dev = cl.Development(average='simple').fit(tri['CumPaidLoss'])
resampled_ldf = pipe.named_steps.dev.ldf_
print("12-24 LDF of original Triangle: " + str(round(orig_dev.ldf_.values[0,0,0,0],4)))
pd.Series(resampled_ldf.values[:, 0, 0, 0]).plot(
kind='hist', bins=100,
title='Age 12-14 LDF distribution using Bootstrap');
In [28]:
mack_vs_bs = resampled_ldf[resampled_ldf.origin == resampled_ldf.origin.max()].std('index').to_frame().append(
orig_dev.std_err_.to_frame()).T
mack_vs_bs.columns = ['Mack', 'Bootstrap']
mack_vs_bs.plot(kind='bar', title='Mack Regression Framework LDF Std Err vs Bootstrap Simulated LDF Std Err');
While the MackChainladder produces statistics about the mean and variance of reserve estimates, those have to be fit to a distribution using MLE, MoM, etc to see the range of outcomes of reserves. With BootstrapODPSample based fits, we can use the empirical distribution directly if we choose to.
In [29]:
ibnr = cl.Chainladder().fit(samples).ibnr_.sum('origin')
ibnr_99 = ibnr.quantile(0.99).to_frame().values[0,0]
print("99%-ile of reserve estimate is " +'{:0,}'.format(round(ibnr_99,0)))
Let's see how the MackChainladder reserve distribution compares to the BootstrapODPSample reserve distribution.
In [30]:
ax = ibnr.plot(kind='hist', bins=50, alpha=0.7, color='green').plot()
dist.plot(kind='hist', bins=50, alpha=0.4, color='blue', title='Mack vs Bootstrap Variability');
In [31]:
tri['EarnedPremNet'].latest_diagonal
Out[31]:
Passing an apriori_sigma to the BornhuetterFerguson estimator tells it to consider the apriori selection itself as a random variable. Fitting a stochastic BornhuetterFerguson looks very much like the determinsitic version.
In [32]:
bf = cl.BornhuetterFerguson(apriori=0.65, apriori_sigma=0.10)
bf.fit(samples, sample_weight=tri['EarnedPremNet'].latest_diagonal)
Out[32]:
We can use our knowledge of Triangle manipulation to grab most things we would want out of our model. In this example, we are using the broadcast_axis function of the Triangle which repeats a Triangle object along an axis according to another triangle axis.
In [33]:
# Grab completed triangle replacing simulated known data with actual known data
full_triangle = bf.full_triangle_ - bf.X_ + \
tri['CumPaidLoss'].broadcast_axis('index', samples.index)
# Limiting to the current year for plotting
current_year = full_triangle[full_triangle.origin==full_triangle.origin.max()].to_frame().T
As expected, plotting the expected development of our full triangle over time from the Bootstrap BornhuetterFerguson model fans out to greater uncertainty the farther we get from our valuation date.
In [34]:
# Plot the data
current_year.iloc[:, :200].reset_index(drop=True).plot(
color='green', legend=False, alpha=0.1,
title='Current Accident Year Expected Development Distribution', grid=True);