Logistic Regression

Agenda

1. Refresh your memory on how to do linear regression in scikit-learn
2. Attempt to use linear regression for classification
3. Show you why logistic regression is a better alternative for classification
4. Brief overview of probability, odds, e, log, and log-odds
5. Explain the form of logistic regression
6. Explain how to interpret logistic regression coefficients
7. Demonstrate how logistic regression works with categorical features
8. Compare logistic regression with other models

Part 1: Predicting a Continuous Response



In [1]:

# glass identification dataset
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass.sort_values('al', inplace=True)




Out[1]:

ri
na
mg
al
si
k
ca
ba
fe
glass_type

id

22
1.51966
14.77
3.75
0.29
72.02
0.03
9.00
0.0
0.00
1

185
1.51115
17.38
0.00
0.34
75.41
0.00
6.65
0.0
0.00
6

40
1.52213
14.21
3.82
0.47
71.77
0.11
9.57
0.0
0.00
1

39
1.52213
14.21
3.82
0.47
71.77
0.11
9.57
0.0
0.00
1

51
1.52320
13.72
3.72
0.51
71.75
0.09
10.06
0.0
0.16
1



Question: Pretend that we want to predict ri, and our only feature is al. How could we do it using machine learning?

Answer: We could frame it as a regression problem, and use a linear regression model with al as the only feature and ri as the response.

Question: How would we visualize this model?

Answer: Create a scatter plot with al on the x-axis and ri on the y-axis, and draw the line of best fit.



In [2]:

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(font_scale=1.5)




In [6]:

sns.lmplot(x='al', y='ri', data=glass, ci=None)




Out[6]:

<seaborn.axisgrid.FacetGrid at 0x16e1f048>



Question: How would we draw this plot without using Seaborn?



In [7]:

# scatter plot using Pandas
glass.plot(kind='scatter', x='al', y='ri')




Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x16c95278>




In [131]:

# equivalent scatter plot using Matplotlib
plt.scatter(glass.al, glass.ri)
plt.xlabel('al')
plt.ylabel('ri')




Out[131]:

<matplotlib.text.Text at 0x20930358>




In [9]:

# fit a linear regression model




In [8]:

# make predictions for all values of X and add back to the original dataframe




In [134]:

# plot those predictions connected by a line




Out[134]:

<matplotlib.text.Text at 0x210a1470>




In [135]:

# put the plots together




Out[135]:

<matplotlib.text.Text at 0x1f50cf60>



Refresher: interpreting linear regression coefficients

Linear regression equation: $y = \beta_0 + \beta_1x$



In [136]:

# compute prediction for al=2 using the equation
linreg.intercept_ + linreg.coef_ * 2




Out[136]:

array([ 1.51699012])




In [137]:

# compute prediction for al=2 using the predict method
linreg.predict(2)




Out[137]:

array([ 1.51699012])




In [138]:

# examine coefficient for al
zip(feature_cols, linreg.coef_)




Out[138]:

[('al', -0.0024776063874696226)]



Interpretation: A 1 unit increase in 'al' is associated with a 0.0025 unit decrease in 'ri'.



In [139]:

# increasing al by 1 (so that al=3) decreases ri by 0.0025
1.51699012 - 0.0024776063874696243




Out[139]:

1.5145125136125304




In [140]:

# compute prediction for al=3 using the predict method
linreg.predict(3)




Out[140]:

array([ 1.51451251])



Part 2: Predicting a Categorical Response



In [141]:

# examine glass_type
glass.glass_type.value_counts().sort_index()




Out[141]:

1    70
2    76
3    17
5    13
6     9
7    29
dtype: int64




In [142]:

# types 1, 2, 3 are window glass
# types 5, 6, 7 are household glass
glass['household'] = glass.glass_type.map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1})




Out[142]:

ri
na
mg
al
si
k
ca
ba
fe
glass_type
ri_pred
household

id

22
1.51966
14.77
3.75
0.29
72.02
0.03
9.00
0
0.00
1
1.521227
0

185
1.51115
17.38
0.00
0.34
75.41
0.00
6.65
0
0.00
6
1.521103
1

40
1.52213
14.21
3.82
0.47
71.77
0.11
9.57
0
0.00
1
1.520781
0

39
1.52213
14.21
3.82
0.47
71.77
0.11
9.57
0
0.00
1
1.520781
0

51
1.52320
13.72
3.72
0.51
71.75
0.09
10.06
0
0.16
1
1.520682
0



Let's change our task, so that we're predicting household using al. Let's visualize the relationship to figure out how to do this:



In [143]:

plt.scatter(glass.al, glass.household)
plt.xlabel('al')
plt.ylabel('household')




Out[143]:

<matplotlib.text.Text at 0x21302c18>



Let's draw a regression line, like we did before:



In [144]:

# fit a linear regression model and store the predictions
feature_cols = ['al']
X = glass[feature_cols]
y = glass.household
linreg.fit(X, y)
glass['household_pred'] = linreg.predict(X)




In [145]:

# scatter plot that includes the regression line
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred, color='red')
plt.xlabel('al')
plt.ylabel('household')




Out[145]:

<matplotlib.text.Text at 0x216cbf28>



If al=3, what class do we predict for household? 1

If al=1.5, what class do we predict for household? 0

We predict the 0 class for lower values of al, and the 1 class for higher values of al. What's our cutoff value? Around al=2, because that's where the linear regression line crosses the midpoint between predicting class 0 and class 1.

Therefore, we'll say that if household_pred >= 0.5, we predict a class of 1, else we predict a class of 0.



In [146]:

# understanding np.where
import numpy as np
nums = np.array([5, 15, 8])

# np.where returns the first value if the condition is True, and the second value if the condition is False
np.where(nums > 10, 'big', 'small')




Out[146]:

array(['small', 'big', 'small'],
dtype='|S5')




In [147]:

# transform household_pred to 1 or 0
glass['household_pred_class'] = np.where(glass.household_pred >= 0.5, 1, 0)




Out[147]:

ri
na
mg
al
si
k
ca
ba
fe
glass_type
ri_pred
household
household_pred
household_pred_class

id

22
1.51966
14.77
3.75
0.29
72.02
0.03
9.00
0
0.00
1
1.521227
0
-0.340495
0

185
1.51115
17.38
0.00
0.34
75.41
0.00
6.65
0
0.00
6
1.521103
1
-0.315436
0

40
1.52213
14.21
3.82
0.47
71.77
0.11
9.57
0
0.00
1
1.520781
0
-0.250283
0

39
1.52213
14.21
3.82
0.47
71.77
0.11
9.57
0
0.00
1
1.520781
0
-0.250283
0

51
1.52320
13.72
3.72
0.51
71.75
0.09
10.06
0
0.16
1
1.520682
0
-0.230236
0




In [148]:

# plot the class predictions
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_class, color='red')
plt.xlabel('al')
plt.ylabel('household')




Out[148]:

<matplotlib.text.Text at 0x21a79588>



Part 3: Using Logistic Regression Instead

Logistic regression can do what we just did:



In [149]:

# fit a logistic regression model and store the class predictions




In [150]:

# plot the class predictions
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_class, color='red')
plt.xlabel('al')
plt.ylabel('household')




Out[150]:

<matplotlib.text.Text at 0x21aa2198>



What if we wanted the predicted probabilities instead of just the class predictions, to understand how confident we are in a given prediction?



In [151]:

# store the predicted probabilites of class 1
glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]




In [152]:

# plot the predicted probabilities
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_prob, color='red')
plt.xlabel('al')
plt.ylabel('household')




Out[152]:

<matplotlib.text.Text at 0x216cb8d0>




In [153]:

# examine some example predictions
print logreg.predict_proba(1)
print logreg.predict_proba(2)
print logreg.predict_proba(3)




[[ 0.97161726  0.02838274]]
[[ 0.34361555  0.65638445]]
[[ 0.00794192  0.99205808]]



The first column indicates the predicted probability of class 0, and the second column indicates the predicted probability of class 1.

Part 4: Probability, odds, e, log, log-odds

$$probability = \frac {one\ outcome} {all\ outcomes}$$$$odds = \frac {one\ outcome} {all\ other\ outcomes}$$

Examples:

• Dice roll of 1: probability = 1/6, odds = 1/5
• Even dice roll: probability = 3/6, odds = 3/3 = 1
• Dice roll less than 5: probability = 4/6, odds = 4/2 = 2
$$odds = \frac {probability} {1 - probability}$$$$probability = \frac {odds} {1 + odds}$$


In [154]:

# create a table of probability versus odds
table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})
table['odds'] = table.probability/(1 - table.probability)
table




Out[154]:

probability
odds

0
0.10
0.111111

1
0.20
0.250000

2
0.25
0.333333

3
0.50
1.000000

4
0.60
1.500000

5
0.80
4.000000

6
0.90
9.000000



What is e? It is the base rate of growth shared by all continually growing processes:



In [155]:

# exponential function: e^1
np.exp(1)




Out[155]:

2.7182818284590451



What is a (natural) log? It gives you the time needed to reach a certain level of growth:



In [156]:

# time needed to grow 1 unit to 2.718 units
np.log(2.718)




Out[156]:

0.99989631572895199



It is also the inverse of the exponential function:



In [157]:

np.log(np.exp(5))




Out[157]:

5.0




In [158]:

# add log-odds to the table
table['logodds'] = np.log(table.odds)
table




Out[158]:

probability
odds
logodds

0
0.10
0.111111
-2.197225

1
0.20
0.250000
-1.386294

2
0.25
0.333333
-1.098612

3
0.50
1.000000
0.000000

4
0.60
1.500000
0.405465

5
0.80
4.000000
1.386294

6
0.90
9.000000
2.197225



Part 5: What is Logistic Regression?

Linear regression: continuous response is modeled as a linear combination of the features:

$$y = \beta_0 + \beta_1x$$

Logistic regression: log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:

$$\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x$$

This is called the logit function.

Probability is sometimes written as pi:

$$\log \left({\pi\over 1-\pi}\right) = \beta_0 + \beta_1x$$

The equation can be rearranged into the logistic function:

$$\pi = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}$$

In other words:

• Logistic regression outputs the probabilities of a specific class
• Those probabilities can be converted into class predictions

The logistic function has some nice properties:

• Takes on an "s" shape
• Output is bounded by 0 and 1

We have covered how this works for binary classification problems (two response classes). But what about multi-class classification problems (more than two response classes)?

• Most common solution for classification models is "one-vs-all" (also known as "one-vs-rest"): decompose the problem into multiple binary classification problems
• Multinomial logistic regression can solve this as a single problem

Part 6: Interpreting Logistic Regression Coefficients



In [159]:

# plot the predicted probabilities again
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_prob, color='red')
plt.xlabel('al')
plt.ylabel('household')




Out[159]:

<matplotlib.text.Text at 0x21ea3c88>




In [160]:

# compute predicted log-odds for al=2 using the equation
logodds = logreg.intercept_ + logreg.coef_[0] * 2
logodds




Out[160]:

array([ 0.64722323])




In [161]:

# convert log-odds to odds
odds = np.exp(logodds)
odds




Out[161]:

array([ 1.91022919])




In [162]:

# convert odds to probability
prob = odds/(1 + odds)
prob




Out[162]:

array([ 0.65638445])




In [163]:

# compute predicted probability for al=2 using the predict_proba method
logreg.predict_proba(2)[:, 1]




Out[163]:

array([ 0.65638445])




In [164]:

# examine the coefficient for al
zip(feature_cols, logreg.coef_[0])




Out[164]:

[('al', 4.1804038614510928)]



Interpretation: A 1 unit increase in 'al' is associated with a 4.18 unit increase in the log-odds of 'household'.



In [165]:

# increasing al by 1 (so that al=3) increases the log-odds by 4.18
logodds = 0.64722323 + 4.1804038614510901
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob




Out[165]:

0.99205808391674566




In [166]:

# compute predicted probability for al=3 using the predict_proba method
logreg.predict_proba(3)[:, 1]




Out[166]:

array([ 0.99205808])



Bottom line: Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).



In [167]:

# examine the intercept
logreg.intercept_




Out[167]:

array([-7.71358449])



Interpretation: For an 'al' value of 0, the log-odds of 'household' is -7.71.



In [168]:

# convert log-odds to probability
logodds = logreg.intercept_
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob




Out[168]:

array([ 0.00044652])



That makes sense from the plot above, because the probability of household=1 should be very low for such a low 'al' value.

Changing the $\beta_0$ value shifts the curve horizontally, whereas changing the $\beta_1$ value changes the slope of the curve.

Part 7: Using Logistic Regression with Categorical Features

Logistic regression can still be used with categorical features. Let's see what that looks like:



In [102]:

# create a categorical feature
glass['high_ba'] = np.where(glass.ba > 0.5, 1, 0)



Let's use Seaborn to draw the logistic curve:



In [103]:

# original (continuous) feature
sns.lmplot(x='ba', y='household', data=glass, ci=None, logistic=True)




Out[103]:

<seaborn.axisgrid.FacetGrid at 0x1cbdc5f8>




In [104]:

# categorical feature
sns.lmplot(x='high_ba', y='household', data=glass, ci=None, logistic=True)




Out[104]:

<seaborn.axisgrid.FacetGrid at 0x1f74d6d8>




In [105]:

# categorical feature, with jitter added
sns.lmplot(x='high_ba', y='household', data=glass, ci=None, logistic=True, x_jitter=0.05, y_jitter=0.05)




Out[105]:

<seaborn.axisgrid.FacetGrid at 0x1fa870b8>




In [106]:

# fit a logistic regression model




Out[106]:

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', penalty='l2', random_state=None,
solver='liblinear', tol=0.0001, verbose=0)




In [107]:

# examine the coefficient for high_ba
zip(feature_cols, logreg.coef_[0])




Out[107]:

[('high_ba', 4.4273153450187213)]



Interpretation: Having a high 'ba' value is associated with a 4.43 unit increase in the log-odds of 'household' (as compared to a low 'ba' value).