Project 3

In this project, you will perform a logistic regression on the admissions data we've been working with in projects 1 and 2.


In [44]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

In [32]:
df_raw = pd.read_csv("../assets/admissions.csv")
df = df_raw.dropna() 
print df_raw.count() # showing 'old' df total count
print df.count() # showing 'new' df total count 
print df.head()


admit       400
gre         398
gpa         398
prestige    399
dtype: int64
admit       397
gre         397
gpa         397
prestige    397
dtype: int64
   admit  gre   gpa  prestige
0      0  380  3.61         3
1      1  660  3.67         3
2      1  800  4.00         1
3      1  640  3.19         4
4      0  520  2.93         4

Part 1. Frequency Tables

1. Let's create a frequency table of our variables


In [33]:
# frequency table for prestige and whether or not someone was admitted
print df.columns # finding columns for dataframe
pd.crosstab(df['admit'],df['prestige']) # displaying var presitge over var admit in a pandas crosstab function


Index([u'admit', u'gre', u'gpa', u'prestige'], dtype='object')
Out[33]:
prestige 1.0 2.0 3.0 4.0
admit
0 28 95 93 55
1 33 53 28 12

Part 2. Return of dummy variables

2.1 Create class or dummy variables for prestige


In [34]:
# creating dummy variables for var 'prestige' using pandas get_dummies function
dummy_ranks = pd.get_dummies(df['prestige'],prefix = 'prestige')
print dummy_ranks.head(5)
# still need to append these dummy values to df using .join function; see part 3


   prestige_1.0  prestige_2.0  prestige_3.0  prestige_4.0
0             0             0             1             0
1             0             0             1             0
2             1             0             0             0
3             0             0             0             1
4             0             0             0             1

2.2 When modeling our class variables, how many do we need?

Answer: When modeling dummy or class variables in general, n-1 variables are required. We assume prestige value '4' to be the base case as this, comparatively, the lowest ranked. For modeling purposes of this model, we will need three (3) class variables

Part 3. Hand calculating odds ratios

Develop your intuition about expected outcomes by hand calculating odds ratios.


In [35]:
cols_to_keep = ['admit', 'gre', 'gpa']
handCalc = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_1':])
print handCalc.head()


   admit  gre   gpa  prestige_1.0  prestige_2.0  prestige_3.0  prestige_4.0
0      0  380  3.61             0             0             1             0
1      1  660  3.67             0             0             1             0
2      1  800  4.00             1             0             0             0
3      1  640  3.19             0             0             0             1
4      0  520  2.93             0             0             0             1

In [73]:
# crosstab prestige 1 admission 
# frequency table cutting prestige and whether or not someone was admitted
print handCalc.columns
pd.crosstab(handCalc['prestige_1.0'],handCalc['admit'])


Index([u'admit', u'gre', u'gpa', u'prestige_1.0', u'prestige_2.0',
       u'prestige_3.0', u'prestige_4.0'],
      dtype='object')
Out[73]:
admit 0 1
prestige_1.0
0 243 93
1 28 33

3.1 Use the cross tab above to calculate the odds of being admitted to grad school if you attended a #1 ranked college


In [74]:
OR = float(33.0/93.0)
print OR


0.354838709677

3.2 Now calculate the odds of admission if you did not attend a #1 ranked college


In [75]:
OR = float(28.0/243.0)
print OR


0.115226337449

3.3 Calculate the odds ratio


In [ ]:

3.4 Write this finding in a sentence:

Answer: For every 93 times a non-prestige_1.0 individual gets into UCLA, a non-prestige_1.0 individual will not get accepted 243 times.

3.5 Print the cross tab for prestige_4


In [76]:
pd.crosstab(handCalc['prestige_4.0'],handCalc['admit'])


Out[76]:
admit 0 1
prestige_4.0
0 216 114
1 55 12

3.6 Calculate the OR


In [ ]:

3.7 Write this finding in a sentence

Answer: For every 12 times an prestige_4.0 individual gets into UCLA, an individual from prestige_4.0 will not get accepted 114 times.

Part 4. Analysis


In [105]:
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
print data.head()


   admit  gre   gpa  prestige_2.0  prestige_3.0  prestige_4.0
0      0  380  3.61             0             1             0
1      1  660  3.67             0             1             0
2      1  800  4.00             0             0             0
3      1  640  3.19             0             0             1
4      0  520  2.93             0             0             1

We're going to add a constant term for our Logistic Regression. The statsmodels function we're going to be using requires that intercepts/constants are specified explicitly.


In [106]:
# manually add the intercept
data['intercept'] = 1

4.1 Set the covariates to a variable called train_cols


In [107]:
train_cols = ['prestige_2.0','prestige_3.0','prestige_4.0','gre', 'gpa']

4.2 Fit the model


In [103]:
data_predictors = data[train_cols]
print data_predictors.head(2)
logit = sm.Logit(data['admit'], data_predictors)
results = logit.fit()


   prestige_2.0  prestige_3.0  prestige_4.0  gre   gpa
0             0             1             0  380  3.61
1             0             1             0  660  3.67
Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5

4.3 Print the summary results


In [104]:
results.summary()


Out[104]:
Logit Regression Results
Dep. Variable: admit No. Observations: 397
Model: Logit Df Residuals: 392
Method: MLE Df Model: 4
Date: Thu, 24 Mar 2016 Pseudo R-squ.: 0.05722
Time: 16:01:08 Log-Likelihood: -233.88
converged: True LL-Null: -248.08
LLR p-value: 1.039e-05
coef std err z P>|z| [95.0% Conf. Int.]
prestige_2.0 -0.9562 0.302 -3.171 0.002 -1.547 -0.365
prestige_3.0 -1.5375 0.332 -4.627 0.000 -2.189 -0.886
prestige_4.0 -1.8699 0.401 -4.658 0.000 -2.657 -1.083
gre 0.0014 0.001 1.308 0.191 -0.001 0.003
gpa -0.1323 0.195 -0.680 0.497 -0.514 0.249

4.4 Calculate the odds ratios of the coeffiencents and their 95% CI intervals

hint 1: np.exp(X)

hint 2: conf['OR'] = params

       conf.columns = ['2.5%', '97.5%', 'OR']

In [109]:
print np.exp(-0.9562)
print np.exp(-1.5375)
print np.exp(-1.8699)
print np.exp(0.0014)
print np.exp(-0.1323)


0.384350646933
0.21491772468
0.154139074952
1.00140098046
0.876078132212

In [ ]:

4.5 Interpret the OR of Prestige_2

Answer: Prestige_2.0 decreases your odds by 0.38 or a probability of 20 percent (p=0.38/0.83+1). This is statistically significant due to the fact that 0 is not included in the CI.

4.6 Interpret the OR of GPA

Answer: GPA decreases your odds but we cannot say that it is statistically significant becayse the CI encompases 0

Part 5: Predicted probablities

As a way of evaluating our classifier, we're going to recreate the dataset with every logical combination of input values. This will allow us to see how the predicted probability of admission increases/decreases across different variables. First we're going to generate the combinations using a helper function called cartesian (above).

We're going to use np.linspace to create a range of values for "gre" and "gpa". This creates a range of linearly spaced values from a specified min and maximum value--in our case just the min/max observed values.


In [52]:
def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.
    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.
    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.
    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])
    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m,1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m,1:] = out[0:m,1:]
    return out

In [53]:
# instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max 
gres = np.linspace(data['gre'].min(), data['gre'].max(), 10)
print gres
# array([ 220.        ,  284.44444444,  348.88888889,  413.33333333,
#         477.77777778,  542.22222222,  606.66666667,  671.11111111,
#         735.55555556,  800.        ])
gpas = np.linspace(data['gpa'].min(), data['gpa'].max(), 10)
print gpas
# array([ 2.26      ,  2.45333333,  2.64666667,  2.84      ,  3.03333333,
#         3.22666667,  3.42      ,  3.61333333,  3.80666667,  4.        ])


# enumerate all possibilities
combos = pd.DataFrame(cartesian([gres, gpas, [1, 2, 3, 4], [1.]]))


[ 220.          284.44444444  348.88888889  413.33333333  477.77777778
  542.22222222  606.66666667  671.11111111  735.55555556  800.        ]
[ 2.26        2.45333333  2.64666667  2.84        3.03333333  3.22666667
  3.42        3.61333333  3.80666667  4.        ]

5.1 Recreate the dummy variables


In [122]:
# recreate the dummy variables
dummy_ranks = pd.get_dummies(df['prestige'],prefix = 'prestige')
# keep only what we need for making predictions
pred_cols=['gres','gpas']
training_data = combos[pred_cols].join(dummy_ranks.ix[:, 'prestige_2':])
training_data.head(2)


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-122-aada1a071ff8> in <module>()
      3 # keep only what we need for making predictions
      4 pred_cols=['gres','gpas']
----> 5 training_data = combos[pred_cols].join(dummy_ranks.ix[:, 'prestige_2':])
      6 training_data.head(2)

//anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1961         if isinstance(key, (Series, np.ndarray, Index, list)):
   1962             # either boolean or fancy integer index
-> 1963             return self._getitem_array(key)
   1964         elif isinstance(key, DataFrame):
   1965             return self._getitem_frame(key)

//anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_array(self, key)
   2005             return self.take(indexer, axis=0, convert=False)
   2006         else:
-> 2007             indexer = self.ix._convert_to_indexer(key, axis=1)
   2008             return self.take(indexer, axis=1, convert=True)
   2009 

//anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1148                 mask = check == -1
   1149                 if mask.any():
-> 1150                     raise KeyError('%s not in index' % objarr[mask])
   1151 
   1152                 return _values_from_object(indexer)

KeyError: "['gres' 'gpas'] not in index"

5.2 Make predictions on the enumerated dataset


In [123]:
logit = sm.Logit(training_data['admit'], pred_cols)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-123-9efe3bdd0d4d> in <module>()
----> 1 logit = sm.Logit(training_data['admit'], pred_cols)

//anaconda/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.pyc in __init__(self, endog, exog, **kwargs)
    399 
    400     def __init__(self, endog, exog, **kwargs):
--> 401         super(BinaryModel, self).__init__(endog, exog, **kwargs)
    402         if (self.__class__.__name__ != 'MNLogit' and
    403                 not np.all((self.endog >= 0) & (self.endog <= 1))):

//anaconda/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.pyc in __init__(self, endog, exog, **kwargs)
    152     """
    153     def __init__(self, endog, exog, **kwargs):
--> 154         super(DiscreteModel, self).__init__(endog, exog, **kwargs)
    155         self.raise_on_perfect_prediction = True
    156 

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.pyc in __init__(self, endog, exog, **kwargs)
    184 
    185     def __init__(self, endog, exog=None, **kwargs):
--> 186         super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
    187         self.initialize()
    188 

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.pyc in __init__(self, endog, exog, **kwargs)
     58         hasconst = kwargs.pop('hasconst', None)
     59         self.data = self._handle_data(endog, exog, missing, hasconst,
---> 60                                       **kwargs)
     61         self.k_constant = self.data.k_constant
     62         self.exog = self.data.exog

//anaconda/lib/python2.7/site-packages/statsmodels/base/model.pyc in _handle_data(self, endog, exog, missing, hasconst, **kwargs)
     82 
     83     def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
---> 84         data = handle_data(endog, exog, missing, hasconst, **kwargs)
     85         # kwargs arrays could have changed, easier to just attach here
     86         for key in kwargs:

//anaconda/lib/python2.7/site-packages/statsmodels/base/data.pyc in handle_data(endog, exog, missing, hasconst, **kwargs)
    564     klass = handle_data_class_factory(endog, exog)
    565     return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
--> 566                  **kwargs)

//anaconda/lib/python2.7/site-packages/statsmodels/base/data.pyc in __init__(self, endog, exog, missing, hasconst, **kwargs)
     73 
     74         # this has side-effects, attaches k_constant and const_idx
---> 75         self._handle_constant(hasconst)
     76         self._check_integrity()
     77         self._cache = resettable_cache()

//anaconda/lib/python2.7/site-packages/statsmodels/base/data.pyc in _handle_constant(self, hasconst)
     91             # detect where the constant is
     92             check_implicit = False
---> 93             const_idx = np.where(self.exog.ptp(axis=0) == 0)[0].squeeze()
     94             self.k_constant = const_idx.size
     95 

TypeError: cannot perform reduce with flexible type

5.3 Interpret findings for the last 4 observations

Answer:

Bonus

Plot the probability of being admitted into graduate school, stratified by GPA and GRE score.


In [ ]: