Categorical Features in Regressions

I am interested in the best technique to combine continuous and categorical features in a regression model. I have two sample situations where this is important.

Situation 1

A ride-share business has two different rate systems: a normal price and a surge price. Their surge prices have a higher price-per-mile than their normal prices. A company that uses the ride-share often would like a model that predicts how much a ride would cost as a function of the trip distance and whether the trip is normal or under the surge price. They have collected price information from the last few months and created a plot to visualize it.



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
n_points = 100
trip_distance = np.random.rand(n_points)*27.3
categories  = np.array(['Normal','Surge'])
slope_fare_code = 1*(np.random.rand(n_points) > 0.8)



In [2]:

    
slopes = np.array([1.78,4.77])

slope_fare = trip_distance * slopes[slope_fare_code] * (1 + 0.02 * np.random.rand(n_points)) + 3 * np.random.rand(n_points)

slope_data = pd.DataFrame({'trip_distance':trip_distance,'slope_fare_type':categories[slope_fare_code],'slope_fare_code':slope_fare_code,'slope_fare':slope_fare})
slope_data['slope_fare_type'] = slope_data['slope_fare_type'].astype('category')



In [3]:

    
groups = slope_data.groupby('slope_fare_type')

# Plot
trainfig, ax = plt.subplots(figsize=(10,5))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
markers = ['s','o']
colors = ['b','r']
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    mg = list(map(lambda x:markers[x],group['slope_fare_code']))[0]
    cg = list(map(lambda x:colors[x],group['slope_fare_code']))[0]
    ax.scatter(group['trip_distance'], group['slope_fare'], s=80, label=name, marker=mg, color=cg)
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Trip distance (mi)')
ax.set_ylabel('Trip Fare ($)')
ax.set_xlim(0,30)
ax.set_ylim(0.150)
plt.legend(bbox_to_anchor=(0.3, 0.9), title='Fare Type')









    Out[3]:





<matplotlib.legend.Legend at 0x21ced59c748>

Situation 2

A taxi company has two different types of vehicles: a normal cab and a premium (luxury) cab. The rates they charge are the same per mile, but the luxury cab has a surcharge fee that is added to the fare. A company that uses this service would like a model that predicts how much a ride would cost as a function of the trip distance using the normal cab and the luxury cab.



In [4]:

    
categories  = np.array(['Normal','Luxury'])
intercept_fare_code = 1*(np.random.rand(n_points) > 0.8)
intercepts = np.array([1.54,8.26])

intercept_fare = (trip_distance * (2.54 +  0.02 * np.random.rand(n_points)) + 
                  (3 +  0.02 * np.random.rand(n_points)) * intercepts[intercept_fare_code])

intercept_data = pd.DataFrame({'trip_distance':trip_distance,'intercept_fare_type':categories[intercept_fare_code],'intercept_fare_code':intercept_fare_code,'intercept_fare':intercept_fare})
intercept_data['intercept_fare_type'] = intercept_data['intercept_fare_type'].astype('category')


groups = intercept_data.groupby('intercept_fare_type')

# Plot
trainfig, ax = plt.subplots(figsize=(10,5))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
markers = ['o','s']
colors = ['m','g']
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    mg = list(map(lambda x:markers[x],group['intercept_fare_code']))[0]
    cg = list(map(lambda x:colors[x],group['intercept_fare_code']))[0]
    ax.scatter(group['trip_distance'], group['intercept_fare'], s=80, label=name, marker=mg, color=cg)
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Trip distance (mi)')
ax.set_ylabel('Trip Fare ($)')
ax.set_xlim(0,30)
ax.set_ylim(0.150)
plt.legend(bbox_to_anchor=(0.3, 0.9), title='Fare Type')









    Out[4]:





<matplotlib.legend.Legend at 0x21cecb61b38>

Linear Model

I first use a straight linear model with the trip distance and fare type as the two independent variables (converting the fare type into a categorical feature first). I model the ride-share data and then predict the trip fares based on the model. If the model works, the prediction data points should like directly on top of (or very close to) the actual data points. However, they do not. This is an indicator that the linear model using the categorical feature as a simple dependent variable is not the right approach.



In [5]:

    
from sklearn.linear_model import LinearRegression

slope_model_1 = LinearRegression()
slope_features = ['trip_distance', 'slope_fare_code']
slope_model_1.fit(slope_data[slope_features], slope_data['slope_fare'])

slope_data['model1_predictions'] = slope_model_1.predict(slope_data[slope_features])

groups = slope_data.groupby('slope_fare_type')
# Plot
trainfig, ax = plt.subplots(figsize=(10,5))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
markers = ['s','o']
colors = ['b','r']
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    mg = list(map(lambda x:markers[x],group['slope_fare_code']))[0]
    cg = list(map(lambda x:colors[x],group['slope_fare_code']))[0]
    ax.scatter(group['trip_distance'], group['slope_fare'], s=80, label=name, marker=mg, color=cg)
ax.scatter(slope_data['trip_distance'], slope_data['model1_predictions'], s=20, label='Predictions',marker='^',color='k')
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Trip distance (mi)')
ax.set_ylabel('Trip Fare ($)')
ax.set_xlim(0,30)
ax.set_ylim(0.150)
plt.legend(bbox_to_anchor=(0.3, 0.9), title='Fare Type')









    Out[5]:





<matplotlib.legend.Legend at 0x21cf0eec1d0>

How about the taxi fare data? The linear model looks like it works just fine. The reason is that the different fare types correspond to different y-intercepts in the data. The effect is that the model can fit both y-intercepts separately from the slope and reach a point where the model works.



In [6]:

    
intercept_model_1 = LinearRegression()
intercept_features = ['trip_distance', 'intercept_fare_code']
intercept_model_1.fit(intercept_data[intercept_features], intercept_data['intercept_fare'])

intercept_data['model1_predictions'] = intercept_model_1.predict(intercept_data[intercept_features])


groups = intercept_data.groupby('intercept_fare_type')

# Plot
trainfig, ax = plt.subplots(figsize=(10,5))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
markers = ['o','s']
colors = ['m','g']
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    mg = list(map(lambda x:markers[x],group['intercept_fare_code']))[0]
    cg = list(map(lambda x:colors[x],group['intercept_fare_code']))[0]
    ax.scatter(group['trip_distance'], group['intercept_fare'], s=80, label=name, marker=mg, color=cg)
ax.scatter(intercept_data['trip_distance'], intercept_data['model1_predictions'], s=20, label='Predictions',marker='^',color='k')
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Trip distance (mi)')
ax.set_ylabel('Trip Fare ($)')
ax.set_xlim(0,30)
ax.set_ylim(0.150)
plt.legend(bbox_to_anchor=(0.3, 0.9), title='Fare Type')









    Out[6]:





<matplotlib.legend.Legend at 0x21cf102d048>

Dummy Variables

There is another approach to try: instead of using the fare type as a categorical feature with arbitrary values representing the different fare types, I split the two fare types into different feature columns which represent the two categories using binary data. This type of feature is called a dummy or one-hot feature. I fit the model using the trip distance and the two dummy features, but again find that the ride share model does not work.



In [7]:

    
# Build one-hot encoding
slope_dummydf = pd.get_dummies(slope_data['slope_fare_type'])
slope_data = slope_data.merge(slope_dummydf, left_index=True, right_index=True)

slope_model_2 = LinearRegression()
slope_features = ['trip_distance', 'Normal', 'Surge']
slope_model_2.fit(slope_data[slope_features], slope_data['slope_fare'])

slope_data['model2_predictions'] = slope_model_2.predict(slope_data[slope_features])

groups = slope_data.groupby('slope_fare_type')
# Plot
trainfig, ax = plt.subplots(figsize=(10,5))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
markers = ['s','o']
colors = ['b','r']
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    mg = list(map(lambda x:markers[x],group['slope_fare_code']))[0]
    cg = list(map(lambda x:colors[x],group['slope_fare_code']))[0]
    ax.scatter(group['trip_distance'], group['slope_fare'], s=80, label=name, marker=mg, color=cg)
ax.scatter(slope_data['trip_distance'], slope_data['model2_predictions'], s=20, label='Predictions',marker='^',color='k')
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Trip distance (mi)')
ax.set_ylabel('Trip Fare ($)')
ax.set_xlim(0,30)
ax.set_ylim(0.150)
plt.legend(bbox_to_anchor=(0.3, 0.9), title='Fare Type')









    Out[7]:





<matplotlib.legend.Legend at 0x21cf1216b70>

On the other hand, the dummy variable approach to using categorical features works well for the taxi data. This approach to feature engineering has the same drawback as using categorical features: the model is looking for different intercepts, not different slopes.



In [8]:

    
intercept_dummydf = pd.get_dummies(intercept_data['intercept_fare_type'])
intercept_data = intercept_data.merge(intercept_dummydf, left_index=True, right_index=True)

intercept_model_2 = LinearRegression()
intercept_features = ['trip_distance', 'Normal', 'Luxury']
intercept_model_2.fit(intercept_data[intercept_features], intercept_data['intercept_fare'])

intercept_data['model2_predictions'] = intercept_model_2.predict(intercept_data[intercept_features])


groups = intercept_data.groupby('intercept_fare_type')

# Plot
trainfig, ax = plt.subplots(figsize=(10,5))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
markers = ['o','s']
colors = ['m','g']
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    mg = list(map(lambda x:markers[x],group['intercept_fare_code']))[0]
    cg = list(map(lambda x:colors[x],group['intercept_fare_code']))[0]
    ax.scatter(group['trip_distance'], group['intercept_fare'], s=80, label=name, marker=mg, color=cg)
ax.scatter(intercept_data['trip_distance'], intercept_data['model2_predictions'], s=20, label='Predictions',marker='^',color='k')
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Trip distance (mi)')
ax.set_ylabel('Trip Fare ($)')
ax.set_xlim(0,30)
ax.set_ylim(0.150)
plt.legend(bbox_to_anchor=(0.3, 0.9), title='Fare Type')









    Out[8]:





<matplotlib.legend.Legend at 0x21cf12b5a58>

What if I knew before hand that the model might have an interaction between the trip distance and the fare type? Instead of looking at a linear model where we treat the two features as independent, we could treat them as co-dependent. In this type of model, the slope is not independent of the intercept. And the model works for both cases; the predictions are fairly close to the target points.



In [9]:

    
from sklearn.preprocessing import PolynomialFeatures

slope_features = ['trip_distance', 'slope_fare_code']
slope_poly = PolynomialFeatures(interaction_only=True,include_bias = False)
slope_input = slope_poly.fit_transform(slope_data[slope_features])

slope_model_3 = LinearRegression()
slope_model_3.fit(slope_input, slope_data['slope_fare'])

slope_data['model3_predictions'] = slope_model_3.predict(slope_input)

groups = slope_data.groupby('slope_fare_type')
# Plot
trainfig, ax = plt.subplots(figsize=(10,5))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
markers = ['s','o']
colors = ['b','r']
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    mg = list(map(lambda x:markers[x],group['slope_fare_code']))[0]
    cg = list(map(lambda x:colors[x],group['slope_fare_code']))[0]
    ax.scatter(group['trip_distance'], group['slope_fare'], s=80, label=name, marker=mg, color=cg)
ax.scatter(slope_data['trip_distance'], slope_data['model3_predictions'], s=20, label='Predictions',marker='^',color='k')
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Trip distance (mi)')
ax.set_ylabel('Trip Fare ($)')
ax.set_xlim(0,30)
ax.set_ylim(0.150)
plt.legend(bbox_to_anchor=(0.3, 0.9), title='Fare Type')









    Out[9]:





<matplotlib.legend.Legend at 0x21cf156e3c8>



In [10]:

    
intercept_features = ['trip_distance', 'intercept_fare_code']
intercept_poly = PolynomialFeatures(interaction_only=True,include_bias = False)
intercept_input = intercept_poly.fit_transform(intercept_data[intercept_features])


intercept_model_3 = LinearRegression()
intercept_model_3.fit(intercept_input, intercept_data['intercept_fare'])

intercept_data['model3_predictions'] = intercept_model_3.predict(intercept_input)


groups = intercept_data.groupby('intercept_fare_type')

# Plot
trainfig, ax = plt.subplots(figsize=(10,5))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
markers = ['o','s']
colors = ['m','g']
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    mg = list(map(lambda x:markers[x],group['intercept_fare_code']))[0]
    cg = list(map(lambda x:colors[x],group['intercept_fare_code']))[0]
    ax.scatter(group['trip_distance'], group['intercept_fare'], s=80, label=name, marker=mg, color=cg)
ax.scatter(intercept_data['trip_distance'], intercept_data['model3_predictions'], s=20, label='Predictions',marker='^',color='k')
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Trip distance (mi)')
ax.set_ylabel('Trip Fare ($)')
ax.set_xlim(0,30)
ax.set_ylim(0.150)
plt.legend(bbox_to_anchor=(0.3, 0.9), title='Fare Type')









    Out[10]:





<matplotlib.legend.Legend at 0x21cf15fd6a0>

GBM

But I did not know before starting that the model was linear or that there was an interdependence between the input features. Another approach is to use a decision tree instead of a linear regression. The gradient-boosted random forest is a generalized decision tree approach with a few helpful features (model averaging, learning parameters, etc.) The drawback is that it is a non-linear model that could potentially overfit the data. There will be a trade-off between model performance and the ease of the approach. I find that both the ride-share and the taxi models fit reasonably well with a little hyperparameter tuning.



In [11]:

    
from sklearn.ensemble import GradientBoostingRegressor

slope_gbm = GradientBoostingRegressor(n_estimators=500, min_samples_leaf=5, learning_rate=0.1)

slope_features = ['trip_distance', 'slope_fare_code']
slope_gbm.fit(slope_data[slope_features], slope_data['slope_fare'])

slope_data['model4_predictions'] = slope_gbm.predict(slope_data[slope_features])

groups = slope_data.groupby('slope_fare_type')
# Plot
trainfig, ax = plt.subplots(figsize=(10,5))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
markers = ['s','o']
colors = ['b','r']
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    mg = list(map(lambda x:markers[x],group['slope_fare_code']))[0]
    cg = list(map(lambda x:colors[x],group['slope_fare_code']))[0]
    ax.scatter(group['trip_distance'], group['slope_fare'], s=80, label=name, marker=mg, color=cg)
ax.scatter(slope_data['trip_distance'], slope_data['model4_predictions'], s=20, label='Predictions',marker='^',color='k')
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Trip distance (mi)')
ax.set_ylabel('Trip Fare ($)')
ax.set_xlim(0,30)
ax.set_ylim(0.150)
plt.legend(bbox_to_anchor=(0.3, 0.9), title='Fare Type')









    Out[11]:





<matplotlib.legend.Legend at 0x21cf16a2dd8>



In [12]:

    
intercept_gbm = GradientBoostingRegressor(n_estimators=500, min_samples_leaf=5, learning_rate=0.1)
intercept_features = ['trip_distance', 'intercept_fare_code']
intercept_gbm.fit(intercept_data[intercept_features], intercept_data['intercept_fare'])

intercept_data['model4_predictions'] = intercept_gbm.predict(intercept_data[intercept_features])


groups = intercept_data.groupby('intercept_fare_type')

# Plot
trainfig, ax = plt.subplots(figsize=(10,5))
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
markers = ['o','s']
colors = ['m','g']
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    mg = list(map(lambda x:markers[x],group['intercept_fare_code']))[0]
    cg = list(map(lambda x:colors[x],group['intercept_fare_code']))[0]
    ax.scatter(group['trip_distance'], group['intercept_fare'], s=80, label=name, marker=mg, color=cg)
ax.scatter(intercept_data['trip_distance'], intercept_data['model4_predictions'], s=20, label='Predictions',marker='^',color='k')
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Trip distance (mi)')
ax.set_ylabel('Trip Fare ($)')
ax.set_xlim(0,30)
ax.set_ylim(0.150)
plt.legend(bbox_to_anchor=(0.3, 0.9), title='Fare Type')









    Out[12]:





<matplotlib.legend.Legend at 0x21cf16a99e8>

Conclusion

Although the linear model is simple and powerful, using it with categorical features is problematic. This becomes even more problematic when there are many continuous and categorical features. An alternative approach that works well is to use a decision tree regression model. This alternative reduces the need to know before hand how the features interact. The notebooks (both R and Python) used in this example are available on Github (link-to-supplemental-material-section-on-ML-course-repo)