Import libraries


In [1]:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from sklearn import pipeline, preprocessing, compose, linear_model, impute, model_selection

Load data


In [2]:
df = pd.read_csv("/data/insurance.csv")
df.head()


Out[2]:
age gender bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

Create X - features and y - target variable. Take log transformation to tackle outliners.


In [3]:
target = "charges"
y = np.log10(df[target])
X = df.drop(columns=[target])

Identitify categorical columns and numeric columns. We will apply imputer (to replace null values) and one-hot encoding to categorical columns and imputer (to replace null values), polynomial transformation and standard scaler (z-scoring) to numeric values.


In [4]:
cat_columns = ["gender", "smoker", "region"]
num_columns = ["age", "bmi", "children"]

Build pipeline for numeric and categorical variables. Search over a hyper parameter space to tune the model.


In [5]:
cat_pipe = pipeline.Pipeline([
    ('imputer', impute.SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', preprocessing.OneHotEncoder(handle_unknown='error', drop="first"))
]) 

num_pipe = pipeline.Pipeline([
    ('imputer', impute.SimpleImputer(strategy='median')),
    ('poly', preprocessing.PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', preprocessing.StandardScaler()),
])

preprocessing_pipe = compose.ColumnTransformer([
    ("cat", cat_pipe, cat_columns),
    ("num", num_pipe, num_columns)
])


estimator_pipe = pipeline.Pipeline([
    ("preprocessing", preprocessing_pipe),
    ("est", linear_model.ElasticNet(random_state=1))
])


param_grid = {
    "est__alpha": 0.0 + np.random.random(10) * 0.02,
    "est__l1_ratio": np.linspace(0.0001, 1, 20),
}


gsearch = model_selection.GridSearchCV(estimator_pipe, param_grid, cv = 5, verbose=1, n_jobs=8)

gsearch.fit(X, y)

print(gsearch.best_score_, gsearch.best_params_)


Fitting 5 folds for each of 200 candidates, totalling 1000 fits
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    1.6s
0.7706238513668783 {'est__alpha': 0.001513687811857405, 'est__l1_ratio': 0.0001}
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    4.3s finished

Find estimated values. Since we did not create tran-test split manually, we would get the estimated outcome of the entire dataset. Plot the residuals.


In [6]:
y_pred = gsearch.predict(X)
plt.scatter(y, y_pred - y)
plt.xlabel("Actual")
plt.ylabel("Residual")
plt.title("Residual Plot")


Out[6]:
Text(0.5, 1.0, 'Residual Plot')

Show the few actual values vs predicted values.


In [7]:
pd.DataFrame({"actual": y, "predict": y_pred}).sample(10)


Out[7]:
actual predict
49 4.587814 4.444098
45 4.314505 4.018335
665 4.629006 4.595823
1097 3.223919 3.459821
584 3.094407 3.309120
470 3.397425 3.545149
309 3.889254 3.949224
179 3.931371 4.007766
917 4.544928 4.508985
991 3.853994 3.902824

Save the model as pickle file, so that during prediction we can re-use the trained model without training the model again from scratch.


In [8]:
with open(r"/tmp/model.pickle", "wb") as f:
    pickle.dump(gsearch, f)

Reload the model from the disk. In real-use case, probably you will keep the following lines and their dependencies in a seperate script file.


In [9]:
import pickle
import pandas as pd
with open(r"/tmp/model.pickle", "rb") as f:
    est = pickle.load(f)

Create a single record with the feature values to get the estimate.


In [10]:
record = {"age": 18, "gender": "male", "bmi": 33.0, "smoker": "no", "children": 1, "region": "southeast"}
record


Out[10]:
{'age': 18,
 'gender': 'male',
 'bmi': 33.0,
 'smoker': 'no',
 'children': 1,
 'region': 'southeast'}

Create a dataframe out of the record.


In [11]:
df_input = pd.DataFrame.from_dict([record])
df_input


Out[11]:
age gender bmi smoker children region
0 18 male 33.0 no 1 southeast

Get the prediction for the df_input.


In [12]:
10 ** est.predict(df_input)


Out[12]:
array([2999.0394772])