Logistic Regression

Agenda

Refresh your memory on how to do linear regression in scikit-learn
Attempt to use linear regression for classification
Show you why logistic regression is a better alternative for classification
Brief overview of probability, odds, e, log, and log-odds
Explain the form of logistic regression
Explain how to interpret logistic regression coefficients
Demonstrate how logistic regression works with categorical features
Compare logistic regression with other models

Part 1: Predicting a Continuous Response



In [1]:

    
# glass identification dataset
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass.sort_values('al', inplace=True)
glass.head()

Question: Pretend that we want to predict ri, and our only feature is al. How could we do it using machine learning?

Answer: We could frame it as a regression problem, and use a linear regression model with al as the only feature and ri as the response.

Question: How would we visualize this model?

Answer: Create a scatter plot with al on the x-axis and ri on the y-axis, and draw the line of best fit.



In [2]:

    
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(font_scale=1.5)



In [6]:

    
sns.lmplot(x='al', y='ri', data=glass, ci=None)









    Out[6]:





<seaborn.axisgrid.FacetGrid at 0x16e1f048>

Question: How would we draw this plot without using Seaborn?



In [7]:

    
# scatter plot using Pandas
glass.plot(kind='scatter', x='al', y='ri')









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x16c95278>



In [131]:

    
# equivalent scatter plot using Matplotlib
plt.scatter(glass.al, glass.ri)
plt.xlabel('al')
plt.ylabel('ri')









    Out[131]:





<matplotlib.text.Text at 0x20930358>



In [9]:

    
# fit a linear regression model



In [8]:

    
# make predictions for all values of X and add back to the original dataframe



In [134]:

    
# plot those predictions connected by a line









    Out[134]:





<matplotlib.text.Text at 0x210a1470>



In [135]:

    
# put the plots together









    Out[135]:





<matplotlib.text.Text at 0x1f50cf60>

Refresher: interpreting linear regression coefficients

Linear regression equation: $y = \beta_0 + \beta_1x$



In [136]:

    
# compute prediction for al=2 using the equation
linreg.intercept_ + linreg.coef_ * 2









    Out[136]:





array([ 1.51699012])



In [137]:

    
# compute prediction for al=2 using the predict method
linreg.predict(2)









    Out[137]:





array([ 1.51699012])



In [138]:

    
# examine coefficient for al
zip(feature_cols, linreg.coef_)









    Out[138]:





[('al', -0.0024776063874696226)]

Interpretation: A 1 unit increase in 'al' is associated with a 0.0025 unit decrease in 'ri'.



In [139]:

    
# increasing al by 1 (so that al=3) decreases ri by 0.0025
1.51699012 - 0.0024776063874696243









    Out[139]:





1.5145125136125304



In [140]:

    
# compute prediction for al=3 using the predict method
linreg.predict(3)









    Out[140]:





array([ 1.51451251])

Part 2: Predicting a Categorical Response



In [141]:

    
# examine glass_type
glass.glass_type.value_counts().sort_index()









    Out[141]:





1    70
2    76
3    17
5    13
6     9
7    29
dtype: int64



In [142]:

    
# types 1, 2, 3 are window glass
# types 5, 6, 7 are household glass
glass['household'] = glass.glass_type.map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1})
glass.head()

Let's change our task, so that we're predicting household using al. Let's visualize the relationship to figure out how to do this:



In [143]:

    
plt.scatter(glass.al, glass.household)
plt.xlabel('al')
plt.ylabel('household')









    Out[143]:





<matplotlib.text.Text at 0x21302c18>

Let's draw a regression line, like we did before:



In [144]:

    
# fit a linear regression model and store the predictions
feature_cols = ['al']
X = glass[feature_cols]
y = glass.household
linreg.fit(X, y)
glass['household_pred'] = linreg.predict(X)



In [145]:

    
# scatter plot that includes the regression line
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred, color='red')
plt.xlabel('al')
plt.ylabel('household')









    Out[145]:





<matplotlib.text.Text at 0x216cbf28>

If al=3, what class do we predict for household? 1

If al=1.5, what class do we predict for household? 0

We predict the 0 class for lower values of al, and the 1 class for higher values of al. What's our cutoff value? Around al=2, because that's where the linear regression line crosses the midpoint between predicting class 0 and class 1.

Therefore, we'll say that if household_pred >= 0.5, we predict a class of 1, else we predict a class of 0.



In [146]:

    
# understanding np.where
import numpy as np
nums = np.array([5, 15, 8])

# np.where returns the first value if the condition is True, and the second value if the condition is False
np.where(nums > 10, 'big', 'small')









    Out[146]:





array(['small', 'big', 'small'], 
      dtype='|S5')



In [147]:

    
# transform household_pred to 1 or 0
glass['household_pred_class'] = np.where(glass.household_pred >= 0.5, 1, 0)
glass.head()









    Out[147]:






  
    
      
      ri
      na
      mg
      al
      si
      k
      ca
      ba
      fe
      glass_type
      ri_pred
      household
      household_pred
      household_pred_class
    
    
      id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      22
      1.51966
      14.77
      3.75
      0.29
      72.02
      0.03
      9.00
      0
      0.00
      1
      1.521227
      0
      -0.340495
      0
    
    
      185
      1.51115
      17.38
      0.00
      0.34
      75.41
      0.00
      6.65
      0
      0.00
      6
      1.521103
      1
      -0.315436
      0
    
    
      40
      1.52213
      14.21
      3.82
      0.47
      71.77
      0.11
      9.57
      0
      0.00
      1
      1.520781
      0
      -0.250283
      0
    
    
      39
      1.52213
      14.21
      3.82
      0.47
      71.77
      0.11
      9.57
      0
      0.00
      1
      1.520781
      0
      -0.250283
      0
    
    
      51
      1.52320
      13.72
      3.72
      0.51
      71.75
      0.09
      10.06
      0
      0.16
      1
      1.520682
      0
      -0.230236
      0



In [148]:

    
# plot the class predictions
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_class, color='red')
plt.xlabel('al')
plt.ylabel('household')









    Out[148]:





<matplotlib.text.Text at 0x21a79588>

Part 3: Using Logistic Regression Instead

Logistic regression can do what we just did:



In [149]:

    
# fit a logistic regression model and store the class predictions



In [150]:

    
# plot the class predictions
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_class, color='red')
plt.xlabel('al')
plt.ylabel('household')









    Out[150]:





<matplotlib.text.Text at 0x21aa2198>

What if we wanted the predicted probabilities instead of just the class predictions, to understand how confident we are in a given prediction?



In [151]:

    
# store the predicted probabilites of class 1
glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]



In [152]:

    
# plot the predicted probabilities
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_prob, color='red')
plt.xlabel('al')
plt.ylabel('household')









    Out[152]:





<matplotlib.text.Text at 0x216cb8d0>



In [153]:

    
# examine some example predictions
print logreg.predict_proba(1)
print logreg.predict_proba(2)
print logreg.predict_proba(3)









    



[[ 0.97161726  0.02838274]]
[[ 0.34361555  0.65638445]]
[[ 0.00794192  0.99205808]]

The first column indicates the predicted probability of class 0, and the second column indicates the predicted probability of class 1.

Part 4: Probability, odds, e, log, log-odds

$$probability = \frac {one\ outcome} {all\ outcomes}$$$$odds = \frac {one\ outcome} {all\ other\ outcomes}$$

Examples:

Dice roll of 1: probability = 1/6, odds = 1/5
Even dice roll: probability = 3/6, odds = 3/3 = 1
Dice roll less than 5: probability = 4/6, odds = 4/2 = 2

$$odds = \frac {probability} {1 - probability}$$$$probability = \frac {odds} {1 + odds}$$



In [154]:

    
# create a table of probability versus odds
table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})
table['odds'] = table.probability/(1 - table.probability)
table









    Out[154]:






  
    
      
      probability
      odds
    
  
  
    
      0
      0.10
      0.111111
    
    
      1
      0.20
      0.250000
    
    
      2
      0.25
      0.333333
    
    
      3
      0.50
      1.000000
    
    
      4
      0.60
      1.500000
    
    
      5
      0.80
      4.000000
    
    
      6
      0.90
      9.000000

What is e? It is the base rate of growth shared by all continually growing processes:



In [155]:

    
# exponential function: e^1
np.exp(1)









    Out[155]:





2.7182818284590451

What is a (natural) log? It gives you the time needed to reach a certain level of growth:



In [156]:

    
# time needed to grow 1 unit to 2.718 units
np.log(2.718)









    Out[156]:





0.99989631572895199

It is also the inverse of the exponential function:



In [157]:

    
np.log(np.exp(5))









    Out[157]:





5.0



In [158]:

    
# add log-odds to the table
table['logodds'] = np.log(table.odds)
table









    Out[158]:






  
    
      
      probability
      odds
      logodds
    
  
  
    
      0
      0.10
      0.111111
      -2.197225
    
    
      1
      0.20
      0.250000
      -1.386294
    
    
      2
      0.25
      0.333333
      -1.098612
    
    
      3
      0.50
      1.000000
      0.000000
    
    
      4
      0.60
      1.500000
      0.405465
    
    
      5
      0.80
      4.000000
      1.386294
    
    
      6
      0.90
      9.000000
      2.197225

Part 5: What is Logistic Regression?

Linear regression: continuous response is modeled as a linear combination of the features:

$$y = \beta_0 + \beta_1x$$

Logistic regression: log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:

$$\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x$$

This is called the logit function.

Probability is sometimes written as pi:

$$\log \left({\pi\over 1-\pi}\right) = \beta_0 + \beta_1x$$

The equation can be rearranged into the logistic function:

$$\pi = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}$$

In other words:

Logistic regression outputs the probabilities of a specific class
Those probabilities can be converted into class predictions

The logistic function has some nice properties:

Takes on an "s" shape
Output is bounded by 0 and 1

We have covered how this works for binary classification problems (two response classes). But what about multi-class classification problems (more than two response classes)?

Most common solution for classification models is "one-vs-all" (also known as "one-vs-rest"): decompose the problem into multiple binary classification problems
Multinomial logistic regression can solve this as a single problem

Part 6: Interpreting Logistic Regression Coefficients



In [159]:

    
# plot the predicted probabilities again
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_prob, color='red')
plt.xlabel('al')
plt.ylabel('household')









    Out[159]:





<matplotlib.text.Text at 0x21ea3c88>



In [160]:

    
# compute predicted log-odds for al=2 using the equation
logodds = logreg.intercept_ + logreg.coef_[0] * 2
logodds









    Out[160]:





array([ 0.64722323])



In [161]:

    
# convert log-odds to odds
odds = np.exp(logodds)
odds









    Out[161]:





array([ 1.91022919])



In [162]:

    
# convert odds to probability
prob = odds/(1 + odds)
prob









    Out[162]:





array([ 0.65638445])



In [163]:

    
# compute predicted probability for al=2 using the predict_proba method
logreg.predict_proba(2)[:, 1]









    Out[163]:





array([ 0.65638445])



In [164]:

    
# examine the coefficient for al
zip(feature_cols, logreg.coef_[0])









    Out[164]:





[('al', 4.1804038614510928)]

Interpretation: A 1 unit increase in 'al' is associated with a 4.18 unit increase in the log-odds of 'household'.



In [165]:

    
# increasing al by 1 (so that al=3) increases the log-odds by 4.18
logodds = 0.64722323 + 4.1804038614510901
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob









    Out[165]:





0.99205808391674566



In [166]:

    
# compute predicted probability for al=3 using the predict_proba method
logreg.predict_proba(3)[:, 1]









    Out[166]:





array([ 0.99205808])

Bottom line: Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).



In [167]:

    
# examine the intercept
logreg.intercept_









    Out[167]:





array([-7.71358449])

Interpretation: For an 'al' value of 0, the log-odds of 'household' is -7.71.



In [168]:

    
# convert log-odds to probability
logodds = logreg.intercept_
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob









    Out[168]:





array([ 0.00044652])

That makes sense from the plot above, because the probability of household=1 should be very low for such a low 'al' value.

Changing the $\beta_0$ value shifts the curve horizontally, whereas changing the $\beta_1$ value changes the slope of the curve.

Part 7: Using Logistic Regression with Categorical Features

Logistic regression can still be used with categorical features. Let's see what that looks like:



In [102]:

    
# create a categorical feature
glass['high_ba'] = np.where(glass.ba > 0.5, 1, 0)

Let's use Seaborn to draw the logistic curve:



In [103]:

    
# original (continuous) feature
sns.lmplot(x='ba', y='household', data=glass, ci=None, logistic=True)









    Out[103]:





<seaborn.axisgrid.FacetGrid at 0x1cbdc5f8>



In [104]:

    
# categorical feature
sns.lmplot(x='high_ba', y='household', data=glass, ci=None, logistic=True)









    Out[104]:





<seaborn.axisgrid.FacetGrid at 0x1f74d6d8>



In [105]:

    
# categorical feature, with jitter added
sns.lmplot(x='high_ba', y='household', data=glass, ci=None, logistic=True, x_jitter=0.05, y_jitter=0.05)









    Out[105]:





<seaborn.axisgrid.FacetGrid at 0x1fa870b8>



In [106]:

    
# fit a logistic regression model









    Out[106]:





LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)



In [107]:

    
# examine the coefficient for high_ba
zip(feature_cols, logreg.coef_[0])









    Out[107]:





[('high_ba', 4.4273153450187213)]

Interpretation: Having a high 'ba' value is associated with a 4.43 unit increase in the log-odds of 'household' (as compared to a low 'ba' value).

Part 8: Comparing Logistic Regression with Other Models

Advantages of logistic regression:

Highly interpretable (if you remember how)
Model training and prediction are fast
No tuning is required (excluding regularization)
Features don't need scaling
Can perform well with a small number of observations
Outputs well-calibrated predicted probabilities

Disadvantages of logistic regression:

Presumes a linear relationship between the features and the log-odds of the response
Performance is (generally) not competitive with the best supervised learning methods
Can't automatically learn feature interactions

	ri	na	mg	al	si	k	ca	ba	fe	glass_type
id
22	1.51966	14.77	3.75	0.29	72.02	0.03	9.00	0.0	0.00	1
185	1.51115	17.38	0.00	0.34	75.41	0.00	6.65	0.0	0.00	6
40	1.52213	14.21	3.82	0.47	71.77	0.11	9.57	0.0	0.00	1
39	1.52213	14.21	3.82	0.47	71.77	0.11	9.57	0.0	0.00	1
51	1.52320	13.72	3.72	0.51	71.75	0.09	10.06	0.0	0.16	1

	probability	odds
0	0.10	0.111111
1	0.20	0.250000
2	0.25	0.333333
3	0.50	1.000000
4	0.60	1.500000
5	0.80	4.000000
6	0.90	9.000000

	probability	odds	logodds
0	0.10	0.111111	-2.197225
1	0.20	0.250000	-1.386294
2	0.25	0.333333	-1.098612
3	0.50	1.000000	0.000000
4	0.60	1.500000	0.405465
5	0.80	4.000000	1.386294
6	0.90	9.000000	2.197225