Comparision of Machine Learning Methods vs Rule Based

  • Traditionally, Educational Institutions use rule based models to generate risk score which then informs resource allocation. For example, Hiller et al, 1999

  • Instead, we'll build a simple model using basic ML techniques and demonstrate why the risk scores generated are better


In [184]:
## Imports
import pandas as pd 
import seaborn as sns
sns.set(color_codes=True)
import matplotlib.pyplot as plt

Setup

  • First, we need to generate simulated data and read it into a data frame

In [77]:
# Gen Data
%run sim.py

In [131]:
stud_df.gpa = pd.to_numeric(stud_df.gpa)
stud_df.honors = pd.to_numeric(stud_df.honors)
stud_df.psat = pd.to_numeric(stud_df.psat)

Determine if the student undermatched-or-was-properly matched


In [125]:
avg_gpas = stud_df.groupby('college').gpa.mean()
def isUndermatched(student):
    if student.gpa >= (avg_gpas[student.college] + .50):
        return True
    else:
        return False

In [133]:
stud_df['undermatch_status'] = stud_df.apply(isUndermatched, axis =1 )
#stud_df.groupby('race').undermatch_status.value_counts()

Rule Based Model

  • Simple GPA and PSAT rule

In [155]:
msk = np.random.rand(len(stud_df)) < 0.8
train = stud_df[msk]
test = stud_df[~msk]
print("Training Set Length: ", len(train))
print("Testing Set Length: ", len(test))


Training Set Length:  3151
Testing Set Length:  849

The Rules

  • We have 3 observed variables - GPA, PSAT, race
  • Predict which college based on those observed variables.
  • Rules based on Hoxby, et al 2013

In [156]:
stud_df.psat.hist()


Out[156]:
<matplotlib.axes._subplots.AxesSubplot at 0x113b36e48>

In [157]:
def rule_based_model(student_r): 
    """returns a college for each student passed"""
    risk_score = 0
    if student_r.race == 'aa':
        risk_score += 1
    if student_r.race == 'latino':
        risk_score += .5 
    if student_r.psat >= 170 and student_r.honors <= 3:
        risk_score += 1
    return risk_score

In [158]:
test['risk_score'] = test.apply(rule_based_model, axis = 1)


/Users/hunterowens/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

Machine Learning Model

  • Simple Logisitic Regression

In [192]:
from sklearn import linear_model
feature_cols = ['psat', 'gpa', 'honors']
X = train[feature_cols]
y = train['undermatch_status']

# instantiate, fit
lm = linear_model.LogisticRegression()
lm.fit(X, y)


/Users/hunterowens/anaconda3/lib/python3.5/site-packages/sklearn/base.py:175: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  args, varargs, kw, default = inspect.getargspec(init)
Out[192]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

In [194]:
# The coefficients
print('Coefficients: \n', lm.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((lm.predict(test[feature_cols]) - test['undermatch_status']) ** 2))
# Explained variance score: 1 is perfect prediction
lm.predict(train[feature_cols])


Coefficients: 
 [[-0.02641355  2.9242212  -0.01742304]]
Residual sum of squares: 0.14
/Users/hunterowens/anaconda3/lib/python3.5/site-packages/pandas/computation/expressions.py:190: UserWarning: evaluating in Python space because the '-' operator is not supported by numexpr for the bool dtype, use '^' instead
  def evaluate(op, op_str, a, b, raise_on_error=False, use_numexpr=True,
Out[194]:
array([False, False, False, ..., False, False, False], dtype=bool)

In [201]:
sns.lmplot(x='psat', y='undermatch_status', data=test, logistic=True)


Out[201]:
<seaborn.axisgrid.FacetGrid at 0x11315d8d0>
/Users/hunterowens/anaconda3/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  self.set_edgecolor(c)

tl:dr; none of these are predictive because it is random


In [200]:



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-200-9ffbf43126e3> in <module>()
----> 1 e

NameError: name 'e' is not defined

Comparing Outputs

* Comparision Matrix


In [ ]: