Data is taken from Andrew Ng's CS229 course on Machine Learning at Stanford.
This is the same example previously shown but uses StatsModel package.
In [1]:
import pandas as pd
In [4]:
data = pd.read_csv("../datasets/ex2data1.txt", header=None,
names=['Exam1', 'Exam2', 'Admitted'])
In [5]:
data.head()
Out[5]:
Historical data from previous students: each student has two exams scores associated and the final admission result (1=yes, 0= no).
In [6]:
X = data.copy() # ou training data
In [7]:
y = X.Admitted.copy() # copy “y” column values out
X.drop(['Admitted'], axis=1, inplace=True) # then, drop y column
In [8]:
# manually add the intercept
X['intercept'] = 1.0 # so we don't need to use sm.add_constant every time
In [9]:
import statsmodels.formula.api as sm
Since we're doing a logistic regression, we're going to use the statsmodels Logit function.
In [10]:
model = sm.Logit(y, X)
In [11]:
result = model.fit()
In [12]:
result.summary()
Out[12]:
You get a great overview of the coefficients of the model, how well those coefficients fit, the overall fit quality, and several other statistical measures.
The result object also lets you to isolate and inspect parts of the model output, for example the coefficients are in params field:
In [13]:
coefficients = result.params
coefficients
Out[13]:
As you see, the model found the same coefficients as in the previous example.
The confidence interval gives you an idea for how robust the coefficients of the model are.
In [14]:
result.conf_int()
Out[14]:
In [15]:
import numpy as np
In [16]:
# odds ratios and 95% CI
confidenceInterval = result.conf_int()
confidenceInterval['OddsRatio'] = coefficients
confidenceInterval.columns = ['2.5%', '97.5%', 'OddsRatio']
np.exp(confidenceInterval)
Out[16]:
Take the exponential of each of the coefficients to generate the odds ratios.
This tells you how a 1 unit increase or decrease in a variable affects the odds of being admitted.
For example, we can expect the odds of being admitted to increase roughly the same for both exams better results.
We can use these beta parameters to plot the decision boundary on the training data. We only need two points to plot a line, so we choose two endpoints: the min and the max among the X training data that we already calculated in the previous example.
In [17]:
plot_x = np.array([ 95.60854757, 21.63578984])
The line equation is given by the coefficients:
In [18]:
line_intercept = -coefficients['intercept'] / coefficients['Exam2']
line_slope = -coefficients['Exam1'] / coefficients['Exam2']
In [19]:
line_intercept
Out[19]:
In [20]:
line_slope
Out[20]:
Now we have everything to plot the boundary:
In [21]:
import matplotlib.pyplot as plt
%matplotlib inline
In [22]:
colours = ['red' if i==0 else 'green' for i in y] # red not admitted, green yes.
In [23]:
# Create a list of values in the best fit line
plot_y = [line_slope * i + line_intercept for i in plot_x]
plot_y
Out[23]:
In [24]:
fig,ax = plt.subplots()
ax.scatter(X.Exam1, X.Exam2, c=colours)
ax.plot(plot_x, plot_y)
ax.grid(True)
ax.set_xlabel("Exam 1 score")
ax.set_ylabel("Exam 2 score")
fig.suptitle("Student admission vs. past two exams")
Out[24]:
The blue line is our decision boundary: when your exams score lie below the line then probably (that is the prediction) you will not be admitted to University. If they lie above, probably you will. As you can see, the boundary is not predicting perfectly on the training historical data. It's a model. Not perfect but useful. What we can do is to measure its accuracy.
In [25]:
def logPredict(modelParams, X):
probabilities = modelParams.predict(X)
return [1 if x >= 0.5 else 0 for x in probabilities]
In [26]:
predictions = logPredict(result, X)
In [27]:
accuracy = np.mean(predictions == y)
print ('accuracy = {0}%'.format(accuracy*100) )
Just for fun, let's say that my scores are 40 in the first exam and 78 in the second one:
In [28]:
myExams = pd.DataFrame({'Exam1': [40.], 'Exam2': [78.], 'intercept': [1.]})
myExams
Out[28]:
In [29]:
print ('Your probability = {0}%'.format(result.predict(myExams)[0]*100))
Uh oh, looks's like my probability to be admitted at University is only 23% ...