Import libraries



In [1]:

    
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from sklearn import pipeline, preprocessing, compose, linear_model, impute, model_selection

Load data



In [2]:

    
df = pd.read_csv("/data/insurance.csv")
df.head()









    Out[2]:







  
    
      
      age
      gender
      bmi
      children
      smoker
      region
      charges
    
  
  
    
      0
      19
      female
      27.900
      0
      yes
      southwest
      16884.92400
    
    
      1
      18
      male
      33.770
      1
      no
      southeast
      1725.55230
    
    
      2
      28
      male
      33.000
      3
      no
      southeast
      4449.46200
    
    
      3
      33
      male
      22.705
      0
      no
      northwest
      21984.47061
    
    
      4
      32
      male
      28.880
      0
      no
      northwest
      3866.85520

Create X - features and y - target variable. Take log transformation to tackle outliners.



In [3]:

    
target = "charges"
y = np.log10(df[target])
X = df.drop(columns=[target])

Identitify categorical columns and numeric columns. We will apply imputer (to replace null values) and one-hot encoding to categorical columns and imputer (to replace null values), polynomial transformation and standard scaler (z-scoring) to numeric values.



In [4]:

    
cat_columns = ["gender", "smoker", "region"]
num_columns = ["age", "bmi", "children"]

Build pipeline for numeric and categorical variables. Search over a hyper parameter space to tune the model.



In [5]:

    
cat_pipe = pipeline.Pipeline([
    ('imputer', impute.SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', preprocessing.OneHotEncoder(handle_unknown='error', drop="first"))
]) 

num_pipe = pipeline.Pipeline([
    ('imputer', impute.SimpleImputer(strategy='median')),
    ('poly', preprocessing.PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', preprocessing.StandardScaler()),
])

preprocessing_pipe = compose.ColumnTransformer([
    ("cat", cat_pipe, cat_columns),
    ("num", num_pipe, num_columns)
])


estimator_pipe = pipeline.Pipeline([
    ("preprocessing", preprocessing_pipe),
    ("est", linear_model.ElasticNet(random_state=1))
])


param_grid = {
    "est__alpha": 0.0 + np.random.random(10) * 0.02,
    "est__l1_ratio": np.linspace(0.0001, 1, 20),
}


gsearch = model_selection.GridSearchCV(estimator_pipe, param_grid, cv = 5, verbose=1, n_jobs=8)

gsearch.fit(X, y)

print(gsearch.best_score_, gsearch.best_params_)









    



Fitting 5 folds for each of 200 candidates, totalling 1000 fits






    



[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    1.6s






    



0.7706238513668783 {'est__alpha': 0.001513687811857405, 'est__l1_ratio': 0.0001}






    



[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    4.3s finished

Find estimated values. Since we did not create tran-test split manually, we would get the estimated outcome of the entire dataset. Plot the residuals.



In [6]:

    
y_pred = gsearch.predict(X)
plt.scatter(y, y_pred - y)
plt.xlabel("Actual")
plt.ylabel("Residual")
plt.title("Residual Plot")









    Out[6]:





Text(0.5, 1.0, 'Residual Plot')

Show the few actual values vs predicted values.



In [7]:

    
pd.DataFrame({"actual": y, "predict": y_pred}).sample(10)

Save the model as pickle file, so that during prediction we can re-use the trained model without training the model again from scratch.



In [8]:

    
with open(r"/tmp/model.pickle", "wb") as f:
    pickle.dump(gsearch, f)

Reload the model from the disk. In real-use case, probably you will keep the following lines and their dependencies in a seperate script file.



In [9]:

    
import pickle
import pandas as pd
with open(r"/tmp/model.pickle", "rb") as f:
    est = pickle.load(f)

Create a single record with the feature values to get the estimate.



In [10]:

    
record = {"age": 18, "gender": "male", "bmi": 33.0, "smoker": "no", "children": 1, "region": "southeast"}
record









    Out[10]:





{'age': 18,
 'gender': 'male',
 'bmi': 33.0,
 'smoker': 'no',
 'children': 1,
 'region': 'southeast'}

Create a dataframe out of the record.



In [11]:

    
df_input = pd.DataFrame.from_dict([record])
df_input









    Out[11]:







  
    
      
      age
      gender
      bmi
      smoker
      children
      region
    
  
  
    
      0
      18
      male
      33.0
      no
      1
      southeast

Get the prediction for the df_input.



In [12]:

    
10 ** est.predict(df_input)









    Out[12]:





array([2999.0394772])

	actual	predict
49	4.587814	4.444098
45	4.314505	4.018335
665	4.629006	4.595823
1097	3.223919	3.459821
584	3.094407	3.309120
470	3.397425	3.545149
309	3.889254	3.949224
179	3.931371	4.007766
917	4.544928	4.508985
991	3.853994	3.902824

	age	gender	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520