Comparision of Machine Learning Methods vs Rule Based

Traditionally, Educational Institutions use rule based models to generate risk score which then informs resource allocation. For example, Hiller et al, 1999
Instead, we'll build a simple model using basic ML techniques and demonstrate why the risk scores generated are better



In [184]:

    
## Imports
import pandas as pd 
import seaborn as sns
sns.set(color_codes=True)
import matplotlib.pyplot as plt

Setup

First, we need to generate simulated data and read it into a data frame



In [77]:

    
# Gen Data
%run sim.py



In [131]:

    
stud_df.gpa = pd.to_numeric(stud_df.gpa)
stud_df.honors = pd.to_numeric(stud_df.honors)
stud_df.psat = pd.to_numeric(stud_df.psat)

Determine if the student undermatched-or-was-properly matched



In [125]:

    
avg_gpas = stud_df.groupby('college').gpa.mean()
def isUndermatched(student):
    if student.gpa >= (avg_gpas[student.college] + .50):
        return True
    else:
        return False



In [133]:

    
stud_df['undermatch_status'] = stud_df.apply(isUndermatched, axis =1 )
#stud_df.groupby('race').undermatch_status.value_counts()

Rule Based Model

Simple GPA and PSAT rule



In [155]:

    
msk = np.random.rand(len(stud_df)) < 0.8
train = stud_df[msk]
test = stud_df[~msk]
print("Training Set Length: ", len(train))
print("Testing Set Length: ", len(test))









    



Training Set Length:  3151
Testing Set Length:  849

The Rules

We have 3 observed variables - GPA, PSAT, race
Predict which college based on those observed variables.
Rules based on Hoxby, et al 2013



In [156]:

    
stud_df.psat.hist()









    Out[156]:





<matplotlib.axes._subplots.AxesSubplot at 0x113b36e48>



In [157]:

    
def rule_based_model(student_r): 
    """returns a college for each student passed"""
    risk_score = 0
    if student_r.race == 'aa':
        risk_score += 1
    if student_r.race == 'latino':
        risk_score += .5 
    if student_r.psat >= 170 and student_r.honors <= 3:
        risk_score += 1
    return risk_score



In [158]:

    
test['risk_score'] = test.apply(rule_based_model, axis = 1)









    



/Users/hunterowens/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

Machine Learning Model

Simple Logisitic Regression



In [192]:

    
from sklearn import linear_model
feature_cols = ['psat', 'gpa', 'honors']
X = train[feature_cols]
y = train['undermatch_status']

# instantiate, fit
lm = linear_model.LogisticRegression()
lm.fit(X, y)









    



/Users/hunterowens/anaconda3/lib/python3.5/site-packages/sklearn/base.py:175: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  args, varargs, kw, default = inspect.getargspec(init)






    Out[192]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)



In [194]:

    
# The coefficients
print('Coefficients: \n', lm.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((lm.predict(test[feature_cols]) - test['undermatch_status']) ** 2))
# Explained variance score: 1 is perfect prediction
lm.predict(train[feature_cols])









    



Coefficients: 
 [[-0.02641355  2.9242212  -0.01742304]]
Residual sum of squares: 0.14






    



/Users/hunterowens/anaconda3/lib/python3.5/site-packages/pandas/computation/expressions.py:190: UserWarning: evaluating in Python space because the '-' operator is not supported by numexpr for the bool dtype, use '^' instead
  def evaluate(op, op_str, a, b, raise_on_error=False, use_numexpr=True,






    Out[194]:





array([False, False, False, ..., False, False, False], dtype=bool)



In [201]:

    
sns.lmplot(x='psat', y='undermatch_status', data=test, logistic=True)









    Out[201]:





<seaborn.axisgrid.FacetGrid at 0x11315d8d0>






    



/Users/hunterowens/anaconda3/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  self.set_edgecolor(c)

tl:dr; none of these are predictive because it is random



In [200]:









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-200-9ffbf43126e3> in <module>()
----> 1 e

NameError: name 'e' is not defined

Comparing Outputs

* Comparision Matrix



In [ ]: