In [44]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
In [32]:
df_raw = pd.read_csv("../assets/admissions.csv")
df = df_raw.dropna()
print df_raw.count() # showing 'old' df total count
print df.count() # showing 'new' df total count
print df.head()
In [33]:
# frequency table for prestige and whether or not someone was admitted
print df.columns # finding columns for dataframe
pd.crosstab(df['admit'],df['prestige']) # displaying var presitge over var admit in a pandas crosstab function
Out[33]:
In [34]:
# creating dummy variables for var 'prestige' using pandas get_dummies function
dummy_ranks = pd.get_dummies(df['prestige'],prefix = 'prestige')
print dummy_ranks.head(5)
# still need to append these dummy values to df using .join function; see part 3
Answer: When modeling dummy or class variables in general, n-1 variables are required. We assume prestige value '4' to be the base case as this, comparatively, the lowest ranked. For modeling purposes of this model, we will need three (3) class variables
In [35]:
cols_to_keep = ['admit', 'gre', 'gpa']
handCalc = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_1':])
print handCalc.head()
In [73]:
# crosstab prestige 1 admission
# frequency table cutting prestige and whether or not someone was admitted
print handCalc.columns
pd.crosstab(handCalc['prestige_1.0'],handCalc['admit'])
Out[73]:
In [74]:
OR = float(33.0/93.0)
print OR
In [75]:
OR = float(28.0/243.0)
print OR
In [ ]:
Answer: For every 93 times a non-prestige_1.0 individual gets into UCLA, a non-prestige_1.0 individual will not get accepted 243 times.
In [76]:
pd.crosstab(handCalc['prestige_4.0'],handCalc['admit'])
Out[76]:
In [ ]:
Answer: For every 12 times an prestige_4.0 individual gets into UCLA, an individual from prestige_4.0 will not get accepted 114 times.
In [105]:
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
print data.head()
We're going to add a constant term for our Logistic Regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.
In [106]:
# manually add the intercept
data['intercept'] = 1
In [107]:
train_cols = ['prestige_2.0','prestige_3.0','prestige_4.0','gre', 'gpa']
In [103]:
data_predictors = data[train_cols]
print data_predictors.head(2)
logit = sm.Logit(data['admit'], data_predictors)
results = logit.fit()
In [104]:
results.summary()
Out[104]:
In [109]:
print np.exp(-0.9562)
print np.exp(-1.5375)
print np.exp(-1.8699)
print np.exp(0.0014)
print np.exp(-0.1323)
In [ ]:
Answer: Prestige_2.0 decreases your odds by 0.38 or a probability of 20 percent (p=0.38/0.83+1). This is statistically significant due to the fact that 0 is not included in the CI.
Answer: GPA decreases your odds but we cannot say that it is statistically significant becayse the CI encompases 0
As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called cartesian (above).
We're going to use np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value--in our case just the min/max observed values.
In [52]:
def cartesian(arrays, out=None):
"""
Generate a cartesian product of input arrays.
Parameters
----------
arrays : list of array-like
1-D arrays to form the cartesian product of.
out : ndarray
Array to place the cartesian product in.
Returns
-------
out : ndarray
2-D array of shape (M, len(arrays)) containing cartesian products
formed of input arrays.
Examples
--------
>>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
array([[1, 4, 6],
[1, 4, 7],
[1, 5, 6],
[1, 5, 7],
[2, 4, 6],
[2, 4, 7],
[2, 5, 6],
[2, 5, 7],
[3, 4, 6],
[3, 4, 7],
[3, 5, 6],
[3, 5, 7]])
"""
arrays = [np.asarray(x) for x in arrays]
dtype = arrays[0].dtype
n = np.prod([x.size for x in arrays])
if out is None:
out = np.zeros([n, len(arrays)], dtype=dtype)
m = n / arrays[0].size
out[:,0] = np.repeat(arrays[0], m)
if arrays[1:]:
cartesian(arrays[1:], out=out[0:m,1:])
for j in xrange(1, arrays[0].size):
out[j*m:(j+1)*m,1:] = out[0:m,1:]
return out
In [53]:
# instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max
gres = np.linspace(data['gre'].min(), data['gre'].max(), 10)
print gres
# array([ 220. , 284.44444444, 348.88888889, 413.33333333,
# 477.77777778, 542.22222222, 606.66666667, 671.11111111,
# 735.55555556, 800. ])
gpas = np.linspace(data['gpa'].min(), data['gpa'].max(), 10)
print gpas
# array([ 2.26 , 2.45333333, 2.64666667, 2.84 , 3.03333333,
# 3.22666667, 3.42 , 3.61333333, 3.80666667, 4. ])
# enumerate all possibilities
combos = pd.DataFrame(cartesian([gres, gpas, [1, 2, 3, 4], [1.]]))
In [122]:
# recreate the dummy variables
dummy_ranks = pd.get_dummies(df['prestige'],prefix = 'prestige')
# keep only what we need for making predictions
pred_cols=['gres','gpas']
training_data = combos[pred_cols].join(dummy_ranks.ix[:, 'prestige_2':])
training_data.head(2)
In [123]:
logit = sm.Logit(training_data['admit'], pred_cols)
Answer:
In [ ]: