In [1]:
# glass identification dataset
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass.sort_values('al', inplace=True)
glass.head()
Out[1]:
Question: Pretend that we want to predict ri, and our only feature is al. How could we do it using machine learning?
Answer: We could frame it as a regression problem, and use a linear regression model with al as the only feature and ri as the response.
Question: How would we visualize this model?
Answer: Create a scatter plot with al on the x-axis and ri on the y-axis, and draw the line of best fit.
In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(font_scale=1.5)
In [6]:
sns.lmplot(x='al', y='ri', data=glass, ci=None)
Out[6]:
Question: How would we draw this plot without using Seaborn?
In [7]:
# scatter plot using Pandas
glass.plot(kind='scatter', x='al', y='ri')
Out[7]:
In [131]:
# equivalent scatter plot using Matplotlib
plt.scatter(glass.al, glass.ri)
plt.xlabel('al')
plt.ylabel('ri')
Out[131]:
In [9]:
# fit a linear regression model
In [8]:
# make predictions for all values of X and add back to the original dataframe
In [134]:
# plot those predictions connected by a line
Out[134]:
In [135]:
# put the plots together
Out[135]:
Linear regression equation: $y = \beta_0 + \beta_1x$
In [136]:
# compute prediction for al=2 using the equation
linreg.intercept_ + linreg.coef_ * 2
Out[136]:
In [137]:
# compute prediction for al=2 using the predict method
linreg.predict(2)
Out[137]:
In [138]:
# examine coefficient for al
zip(feature_cols, linreg.coef_)
Out[138]:
Interpretation: A 1 unit increase in 'al' is associated with a 0.0025 unit decrease in 'ri'.
In [139]:
# increasing al by 1 (so that al=3) decreases ri by 0.0025
1.51699012 - 0.0024776063874696243
Out[139]:
In [140]:
# compute prediction for al=3 using the predict method
linreg.predict(3)
Out[140]:
In [141]:
# examine glass_type
glass.glass_type.value_counts().sort_index()
Out[141]:
In [142]:
# types 1, 2, 3 are window glass
# types 5, 6, 7 are household glass
glass['household'] = glass.glass_type.map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1})
glass.head()
Out[142]:
Let's change our task, so that we're predicting household using al. Let's visualize the relationship to figure out how to do this:
In [143]:
plt.scatter(glass.al, glass.household)
plt.xlabel('al')
plt.ylabel('household')
Out[143]:
Let's draw a regression line, like we did before:
In [144]:
# fit a linear regression model and store the predictions
feature_cols = ['al']
X = glass[feature_cols]
y = glass.household
linreg.fit(X, y)
glass['household_pred'] = linreg.predict(X)
In [145]:
# scatter plot that includes the regression line
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred, color='red')
plt.xlabel('al')
plt.ylabel('household')
Out[145]:
If al=3, what class do we predict for household? 1
If al=1.5, what class do we predict for household? 0
We predict the 0 class for lower values of al, and the 1 class for higher values of al. What's our cutoff value? Around al=2, because that's where the linear regression line crosses the midpoint between predicting class 0 and class 1.
Therefore, we'll say that if household_pred >= 0.5, we predict a class of 1, else we predict a class of 0.
In [146]:
# understanding np.where
import numpy as np
nums = np.array([5, 15, 8])
# np.where returns the first value if the condition is True, and the second value if the condition is False
np.where(nums > 10, 'big', 'small')
Out[146]:
In [147]:
# transform household_pred to 1 or 0
glass['household_pred_class'] = np.where(glass.household_pred >= 0.5, 1, 0)
glass.head()
Out[147]:
In [148]:
# plot the class predictions
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_class, color='red')
plt.xlabel('al')
plt.ylabel('household')
Out[148]:
In [149]:
# fit a logistic regression model and store the class predictions
In [150]:
# plot the class predictions
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_class, color='red')
plt.xlabel('al')
plt.ylabel('household')
Out[150]:
What if we wanted the predicted probabilities instead of just the class predictions, to understand how confident we are in a given prediction?
In [151]:
# store the predicted probabilites of class 1
glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]
In [152]:
# plot the predicted probabilities
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_prob, color='red')
plt.xlabel('al')
plt.ylabel('household')
Out[152]:
In [153]:
# examine some example predictions
print logreg.predict_proba(1)
print logreg.predict_proba(2)
print logreg.predict_proba(3)
The first column indicates the predicted probability of class 0, and the second column indicates the predicted probability of class 1.
Examples:
In [154]:
# create a table of probability versus odds
table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})
table['odds'] = table.probability/(1 - table.probability)
table
Out[154]:
What is e? It is the base rate of growth shared by all continually growing processes:
In [155]:
# exponential function: e^1
np.exp(1)
Out[155]:
What is a (natural) log? It gives you the time needed to reach a certain level of growth:
In [156]:
# time needed to grow 1 unit to 2.718 units
np.log(2.718)
Out[156]:
It is also the inverse of the exponential function:
In [157]:
np.log(np.exp(5))
Out[157]:
In [158]:
# add log-odds to the table
table['logodds'] = np.log(table.odds)
table
Out[158]:
Linear regression: continuous response is modeled as a linear combination of the features:
$$y = \beta_0 + \beta_1x$$Logistic regression: log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:
$$\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x$$This is called the logit function.
Probability is sometimes written as pi:
$$\log \left({\pi\over 1-\pi}\right) = \beta_0 + \beta_1x$$The equation can be rearranged into the logistic function:
$$\pi = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}$$In other words:
The logistic function has some nice properties:
We have covered how this works for binary classification problems (two response classes). But what about multi-class classification problems (more than two response classes)?
In [159]:
# plot the predicted probabilities again
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_prob, color='red')
plt.xlabel('al')
plt.ylabel('household')
Out[159]:
In [160]:
# compute predicted log-odds for al=2 using the equation
logodds = logreg.intercept_ + logreg.coef_[0] * 2
logodds
Out[160]:
In [161]:
# convert log-odds to odds
odds = np.exp(logodds)
odds
Out[161]:
In [162]:
# convert odds to probability
prob = odds/(1 + odds)
prob
Out[162]:
In [163]:
# compute predicted probability for al=2 using the predict_proba method
logreg.predict_proba(2)[:, 1]
Out[163]:
In [164]:
# examine the coefficient for al
zip(feature_cols, logreg.coef_[0])
Out[164]:
Interpretation: A 1 unit increase in 'al' is associated with a 4.18 unit increase in the log-odds of 'household'.
In [165]:
# increasing al by 1 (so that al=3) increases the log-odds by 4.18
logodds = 0.64722323 + 4.1804038614510901
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob
Out[165]:
In [166]:
# compute predicted probability for al=3 using the predict_proba method
logreg.predict_proba(3)[:, 1]
Out[166]:
Bottom line: Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).
In [167]:
# examine the intercept
logreg.intercept_
Out[167]:
Interpretation: For an 'al' value of 0, the log-odds of 'household' is -7.71.
In [168]:
# convert log-odds to probability
logodds = logreg.intercept_
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob
Out[168]:
That makes sense from the plot above, because the probability of household=1 should be very low for such a low 'al' value.
Changing the $\beta_0$ value shifts the curve horizontally, whereas changing the $\beta_1$ value changes the slope of the curve.
Logistic regression can still be used with categorical features. Let's see what that looks like:
In [102]:
# create a categorical feature
glass['high_ba'] = np.where(glass.ba > 0.5, 1, 0)
Let's use Seaborn to draw the logistic curve:
In [103]:
# original (continuous) feature
sns.lmplot(x='ba', y='household', data=glass, ci=None, logistic=True)
Out[103]:
In [104]:
# categorical feature
sns.lmplot(x='high_ba', y='household', data=glass, ci=None, logistic=True)
Out[104]:
In [105]:
# categorical feature, with jitter added
sns.lmplot(x='high_ba', y='household', data=glass, ci=None, logistic=True, x_jitter=0.05, y_jitter=0.05)
Out[105]:
In [106]:
# fit a logistic regression model
Out[106]:
In [107]:
# examine the coefficient for high_ba
zip(feature_cols, logreg.coef_[0])
Out[107]:
Interpretation: Having a high 'ba' value is associated with a 4.43 unit increase in the log-odds of 'household' (as compared to a low 'ba' value).
Advantages of logistic regression:
Disadvantages of logistic regression: