Example of Logistic regression

Predict student admission based on exams result

Data is taken from Andrew Ng's CS229 course on Machine Learning at Stanford.
This is the same example previously shown but uses StatsModel package.



In [1]:

    
import pandas as pd



In [4]:

    
data = pd.read_csv("../datasets/ex2data1.txt", header=None, 
                   names=['Exam1', 'Exam2', 'Admitted'])



In [5]:

    
data.head()

Historical data from previous students: each student has two exams scores associated and the final admission result (1=yes, 0= no).

Split data into X (training data) and y (target variable)



In [6]:

    
X = data.copy() # ou training data



In [7]:

    
y = X.Admitted.copy() # copy “y” column values out
X.drop(['Admitted'], axis=1, inplace=True) # then, drop y column



In [8]:

    
# manually add the intercept
X['intercept'] = 1.0  # so we don't need to use sm.add_constant every time

Create and fit a Logistic model via statsmodels package



In [9]:

    
import statsmodels.formula.api as sm

Since we're doing a logistic regression, we're going to use the statsmodels Logit function.



In [10]:

    
model = sm.Logit(y, X)



In [11]:

    
result = model.fit()









    



Optimization terminated successfully.
         Current function value: 0.203498
         Iterations 9



In [12]:

    
result.summary()









    Out[12]:





Logit Regression Results

  Dep. Variable:      Admitted        No. Observations:       100  


  Model:                Logit         Df Residuals:            97  


  Method:                MLE          Df Model:                 2  


  Date:           Sun, 10 Sep 2017    Pseudo R-squ.:       0.6976  


  Time:               18:22:37        Log-Likelihood:      -20.350 


  converged:            True          LL-Null:             -67.301 


                                    LLR p-value:        4.067e-21




               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Exam1          0.2062      0.048      4.296   0.000      0.112     0.300


  Exam2          0.2015      0.049      4.143   0.000      0.106     0.297


  intercept    -25.1613      5.799     -4.339   0.000    -36.526   -13.796

You get a great overview of the coefficients of the model, how well those coefficients fit, the overall fit quality, and several other statistical measures.
The result object also lets you to isolate and inspect parts of the model output, for example the coefficients are in params field:



In [13]:

    
coefficients = result.params
coefficients









    Out[13]:





Exam1         0.206232
Exam2         0.201472
intercept   -25.161334
dtype: float64

As you see, the model found the same coefficients as in the previous example.

The confidence interval gives you an idea for how robust the coefficients of the model are.



In [14]:

    
result.conf_int()



In [15]:

    
import numpy as np



In [16]:

    
# odds ratios and 95% CI
confidenceInterval = result.conf_int()
confidenceInterval['OddsRatio'] = coefficients
confidenceInterval.columns = ['2.5%', '97.5%', 'OddsRatio']
np.exp(confidenceInterval)









    Out[16]:






  
    
      
      2.5%
      97.5%
      OddsRatio
    
  
  
    
      Exam1
      1.118683e+00
      1.350279
      1.229038e+00
    
    
      Exam2
      1.112009e+00
      1.345512
      1.223201e+00
    
    
      intercept
      1.370361e-16
      0.000001
      1.181875e-11

Take the exponential of each of the coefficients to generate the odds ratios.
This tells you how a 1 unit increase or decrease in a variable affects the odds of being admitted.
For example, we can expect the odds of being admitted to increase roughly the same for both exams better results.

Plot the decision boundary

We can use these beta parameters to plot the decision boundary on the training data. We only need two points to plot a line, so we choose two endpoints: the min and the max among the X training data that we already calculated in the previous example.



In [17]:

    
plot_x = np.array([ 95.60854757,  21.63578984])

The line equation is given by the coefficients:



In [18]:

    
line_intercept = -coefficients['intercept'] / coefficients['Exam2']
line_slope = -coefficients['Exam1'] / coefficients['Exam2']



In [19]:

    
line_intercept









    Out[19]:





124.88774354024942



In [20]:

    
line_slope









    Out[20]:





-1.0236267188108765

Now we have everything to plot the boundary:



In [21]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [22]:

    
colours = ['red' if i==0 else 'green' for i in y] # red not admitted, green yes.



In [23]:

    
# Create a list of values in the best fit line
plot_y = [line_slope * i + line_intercept for i in plot_x]
plot_y









    Out[23]:





[27.020279700896722, 102.74077097744852]



In [24]:

    
fig,ax = plt.subplots()
ax.scatter(X.Exam1, X.Exam2, c=colours)
ax.plot(plot_x, plot_y)
ax.grid(True)
ax.set_xlabel("Exam 1 score")
ax.set_ylabel("Exam 2 score")
fig.suptitle("Student admission vs. past two exams")









    Out[24]:





<matplotlib.text.Text at 0x111f66ef0>

The blue line is our decision boundary: when your exams score lie below the line then probably (that is the prediction) you will not be admitted to University. If they lie above, probably you will. As you can see, the boundary is not predicting perfectly on the training historical data. It's a model. Not perfect but useful. What we can do is to measure its accuracy.

Accuracy



In [25]:

    
def logPredict(modelParams, X):  
    probabilities = modelParams.predict(X)
    return [1 if x >= 0.5 else 0 for x in probabilities]



In [26]:

    
predictions = logPredict(result, X)



In [27]:

    
accuracy = np.mean(predictions == y)
print ('accuracy = {0}%'.format(accuracy*100)  )









    



accuracy = 89.0%

Just for fun, let's say that my scores are 40 in the first exam and 78 in the second one:



In [28]:

    
myExams = pd.DataFrame({'Exam1': [40.], 'Exam2': [78.], 'intercept': [1.]})
myExams









    Out[28]:






  
    
      
      Exam1
      Exam2
      intercept
    
  
  
    
      0
      40.0
      78.0
      1.0



In [29]:

    
print ('Your probability = {0}%'.format(result.predict(myExams)[0]*100))









    



Your probability = 23.195940705038893%

Uh oh, looks's like my probability to be admitted at University is only 23% ...

	Exam1	Exam2	Admitted
0	34.623660	78.024693	0
1	30.286711	43.894998	0
2	35.847409	72.902198	0
3	60.182599	86.308552	1
4	79.032736	75.344376	1

	0	1
Exam1	0.112152	0.300311
Exam2	0.106168	0.296775
intercept	-36.526287	-13.796380

Dep. Variable:	Admitted	No. Observations:	100
Model:	Logit	Df Residuals:	97
Method:	MLE	Df Model:	2
Date:	Sun, 10 Sep 2017	Pseudo R-squ.:	0.6976
Time:	18:22:37	Log-Likelihood:	-20.350
converged:	True	LL-Null:	-67.301
		LLR p-value:	4.067e-21

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Exam1	0.2062	0.048	4.296	0.000	0.112 0.300
Exam2	0.2015	0.049	4.143	0.000	0.106 0.297
intercept	-25.1613	5.799	-4.339	0.000	-36.526 -13.796

	2.5%	97.5%	OddsRatio
Exam1	1.118683e+00	1.350279	1.229038e+00
Exam2	1.112009e+00	1.345512	1.223201e+00
intercept	1.370361e-16	0.000001	1.181875e-11