A simple implementation of logistic regression for machine learning in Python to create a model that predicts the admission of candidates into graduate schools.
Uses data from UCLA. Credit to the examples from yhat at http://blog.yhat.com/posts/logistic-regression-and-python.html and dataschool at http://nbviewer.jupyter.org/gist/justmarkham/6d5c061ca5aee67c4316471f8c2ae976 for being starting points and good guides (though the examples are not strictly followed).
In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics
The data from UCLA (found at http://www.ats.ucla.edu/stat/data/binary.csv and originally used in this example: http://www.ats.ucla.edu/stat/r/dae/logit.htm) contains 4 columns:
The columns will be renamed to "Admit," "GRE," "GPA," and "Prestige" as we import the data to make them more human-friendly. Note that "rank" is renamed to "Prestige" to avoid confusion with a method of pandas.
In [6]:
# As we import the data, we rename the "Rank" column to "Prestige" to avoid confusion with the rank method of pandas
df = pd.read_csv("binary.csv", header = 0, names = ["Admit", 'GRE', 'GPA', 'Prestige'])
df.head()
Out[6]:
In [7]:
# Basic summary of the data
df.describe()
Out[7]:
In [8]:
# Generate a cross-tabulation (frequency table by default) of the factors; here we use prestige
pd.crosstab(df['Admit'], df['Prestige'], rownames=['Admission'])
Out[8]:
Based on the cruss-tabulation above, it appears that prestige is a significant factor in admission, with those in schools of rank 1 having more admits than not, and those from schools of rank 4 being largely rejected.
In [9]:
# Generate histograms
sns.set_color_codes('muted')
df.hist(color='g')
plt.show()
In [10]:
# Dummy code the rank variable
dummy_ranks = pd.get_dummies(df['Prestige'], prefix="Prestige")
dummy_ranks.head()
Out[10]:
Given that prestige is a categorical value, we perform dummy coding to convert the values into binary variables.
In [11]:
columns1 = ['Admit', 'GRE', 'GPA']
data1 = df[columns1]
columns2 = ['Prestige_1','Prestige_2','Prestige_3']
data2 = dummy_ranks[columns2]
data = pd.merge(data1, data2, how="outer", left_index=True, right_index=True)
data
Out[11]:
In [12]:
# Separate independent and dependent variables
X = data.ix[:,1:]
y = data['Admit']
# Create a logistic regression model
initial = LogisticRegression(C = 1000, random_state=0)
initial.fit(X,y)
# Check model accuracy
print("Accuracy Score:", initial.score(X,y))
In [13]:
# What percentage of students actually got into grad school
print("Actual probability of admission:", y.mean())
If you were guessing "no," you would be right around 68.25% of the time. Our model is more accurate than just guessing "no" by around 2.5%.
Our model is significantly better than random guessing. To be more precise, it is about 20.75% better than just guessing 50/50.
In [14]:
# View coefficients
column_names = list(X.columns)
coefficients = np.transpose(initial.coef_)
intercept = initial.intercept_
Coeffs = pd.DataFrame(coefficients, column_names, columns=['Coefficients'])
Coeffs.append(pd.DataFrame(intercept,['Intercept'], columns=['Coefficients']))
Out[14]:
The coefficients above are telling of the value of the data in the dataset.
Every additional point in a candidate's GRE score improves their chance of admission by 0.002; every unit increase in GPA increases a candidate's chance by 0.803.
The prestige coefficients are interpreted as showing that being from a school of rank 1 increases your chance of going to grad school by 1.509 versus a student from a rank 4 school. Differences in chances can be determined by subtracting the prestige 1 coefficient from the prestige coefficient of another rank, e.g., being from a school of rank 1 increases your chance of admission by around 0.6662 (calculated from 1.508653-0.842366) versus a student from a rank 2 school.
It is important to note that the information mentioned regarding the log odds is contextual to our model.
In the real world, we will likely need to create a machine learning model that can take any set of predictor variables and spit out a probability of admission, which means we won't have the privilege of creating a logit model based on an entirely known set of data.
We will now create a logistic regression model based on one training set and one test set, with 70% of the data going into the training set and 30% going into the test set, in order to be able to construct a model and test its accuracy on data that was not used to create it.
In [15]:
# Split data into training and test sets, using 30% of the data as the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# Fit the logistic regression with lambda = 10^-3
lr = LogisticRegression(C = 1000, random_state=0)
lr.fit(X_train, y_train)
Out[15]:
In [16]:
# View predictions
predicted = lr.predict(X_test)
print(predicted)
In [17]:
# View class probabilities
probabilities = lr.predict_proba(X_test)
print(probabilities)
In [18]:
# Check accuracy
print("Accuracy Score:", metrics.accuracy_score(y_test, predicted))
The accuracy score here is slightly (around 0.083%) better than the optimized logistic regression without the training/test split.
Using a well-chosen (completely random) subset of the data, we were able to create a model whose accuracy actually exceeded that of the model created using all of the data.
In [19]:
# Print confusion matrix and classification report
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test, predicted))
print("\nClassification Report:\n",metrics.classification_report(y_test,predicted))
The confusion matrix shows that out of 82 non-admits, our model got 77 of those right, while 5 of those were false positives. This very good hit rate for 0's is reflected in the high recall of 0.94 for 0's in the classification report. However, the performance of the model is not as good at predicting admits, with only 8 out of 38 admissions correctly being predicted by the model. Again, this is reflected in the low recall 0.21 for 1's.
Looking at precision, 72% of 0's are indeed 0's, and 62% of identified 1's are actual 1's.
In total, 85 out of 120 results were correctly predicted by the model.
In order to more ably visualize the effectiveness of our model and support our existing analysis, we use an ROC curve. While scikit-learn already selects a certain balance (0.5 by default for binary classifiers; can be adjusted via the class_weight argument in LogisticRegression) of performance metrics (precision, recall, etc.), it is still good to get a view of the performance tradeoffs inherent in our model, as well as to gain insight for potential model tuning in the future.
In [20]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, probabilities[:,1])
results = pd.DataFrame({'False Positive Rate': fpr, 'True Positive Rate': tpr})
plt.plot(fpr,tpr, color='g', label="Model")
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label="Baseline (Random Guessing)")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
print("Area Under the Curve:", metrics.roc_auc_score(y_test, probabilities[:,1]))
As the plot above shows, while our logistic regression model is not really that good -- the area under the curve is calculated to be 0.6784 -- in accordance with the results earlier, it still does better than random guessing.
Also note that, in alignment with the 0.5 threshhold used by scikit-learn by default, our true positive rate (recall) of 0.71 matches up with the true positive rate in the graph when the false positive rate is 0.5.
In [21]:
fivefold = cross_val_score(lr, X, y, scoring='accuracy', cv=5)
print("Score per fold:", fivefold)
print("Mean score:", fivefold.mean())
print("Standard deviation:", fivefold.std())
Using five-fold cross-validation on our current model results in a similar accuracy score as the one previously derived, which shows that the model we arrived at earlier is not biased toward the training set and will likely generalize well to new data.
In [43]:
from sklearn import preprocessing
In [57]:
# Isolate columns to scale
toscale = X[['GRE','GPA']].astype(float)
scaledX = preprocessing.scale(toscale)
scaleddata = pd.DataFrame(scaledX, columns=['GRE','GPA'])
# Join scaled data with categorical rank columns
scaledX = scaleddata.join(data2)
scaledX.head()
Out[57]:
In [58]:
improve1 = cross_val_score(lr, scaledX, y, scoring='accuracy', cv=5)
print("Score per fold:", improve1)
print("Mean score:", improve1.mean())
print("Standard deviation:", improve1.std())
The accuracy of our model does not change, but the standard deviation improves a little. This means that our improved model should provide slightly better, or at the very least, more consistent performance.
Based on our confusion matrix, the model appeared to be quick to assign values of "1" to actual 0's. By modifying the weighting of the classes ever so slightly, from the default weight of 1 each to adding slightly more weight to 0's, false positives should be penalized more, and we give the model a little more breathing room to make mistakes in favor of providing 1's (while the data most certainly shows that it's more likely to get a 0 than a 1, our model still appears to predict too much in favor of 0).
In [59]:
lrweighted = LogisticRegression(C = 1000, random_state=0, class_weight={0:0.505,1:0.495})
improve2 = cross_val_score(lrweighted, scaledX, y, scoring='accuracy', cv=5)
print("Score per fold:", improve2)
print("Mean score:", improve2.mean())
print("Standard deviation:", improve2.std())
Our mean score shows a slight improvement. Our standard deviation is slightly higher than that of the previous, feature-scaled model, but it is still lower than our original model.
In [60]:
tens = [10**i for i in range(-5,6)]
for i in tens:
if i == 1000:
continue
testlr = LogisticRegression(C = i, random_state=0, class_weight={0:0.505,1:0.495})
testcrossval = cross_val_score(testlr, scaledX, y, scoring='accuracy', cv=5)
print('For C = {}:'.format(i))
print(' Score per fold:', testcrossval)
print(' Mean score:', testcrossval.mean())
print(' Standard deviation:', testcrossval.std())
Given that $C = \frac{1}{\lambda}$, it makes sense that, after a certain value of $C$ (in this case, 100), the model no longer improves because the penalty to the logistic regression objective function is minimal. As such, we will not need to change our current $C$ value.
In [61]:
# Create new train and test sets and fit our revised model to it
X_train2, X_test2, y_train2, y_test2 = train_test_split(scaledX, y, test_size = 0.3, random_state = 0)
newlr = LogisticRegression(C = 1000, random_state=0, class_weight={0:0.505,1:0.495})
newlr.fit(X_train2, y_train2)
Out[61]:
In [63]:
# Check for metrics on the new predicted probabilities
newpredictions = newlr.predict(X_test2)
newprobabilities = newlr.predict_proba(X_test2)
print("Accuracy Score:", newlr.score(X_test2, y_test2),"\n")
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test2, newpredictions))
print("\nClassification Report:\n",metrics.classification_report(y_test2, newpredictions))
In alignment with our expectations based on our model tuning, all metrics have shown an improvement over our original model.
In comparison with our original model, the predictions for non-admits are the same, and we now have two more correctly classified admits than in the previous model, which is obviously an improvement.
In [64]:
# Plot a new ROC curve for the revised model
fpr2, tpr2, thresholds2 = metrics.roc_curve(y_test2, newprobabilities[:,1])
results2 = pd.DataFrame({'False Positive Rate': fpr2, 'True Positive Rate': tpr2})
plt.plot(fpr,tpr,color='darkgray', label="Original Model")
plt.plot(fpr2,tpr2, color='g', label="Revised Model")
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label="Baseline (Random Guessing)")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Revised Model')
plt.legend()
plt.show()
print("Area Under the Curve:", metrics.roc_auc_score(y_test2, newprobabilities[:,1]))
Our ROC curve differs slightly from the original model's, and it is, generally speaking, an improvement. Our area under the curve is also better than the original model's.