Logistic Regression with scikit-learn

This is an example of logistic regression in Python with the scikit-learn module, performed for an assignment with my General Assembly Data Science class.

Dataset

The dataset I chose is the affairs dataset that comes with Statsmodels. It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs. More information about the study is available in a 1978 paper from the Journal of Political Economy.

Description of Variables

The dataset contains 6366 observations of 9 variables:

rate_marriage: woman's rating of her marriage (1 = very poor, 5 = very good)
age: woman's age
yrs_married: number of years married
children: number of children
religious: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)
educ: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)
occupation: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree)
occupation_husb: husband's occupation (same coding as above)
affairs: time spent in extra-marital affairs

Problem Statement

I decided to treat this as a classification problem by creating a new binary variable affair (did the woman have at least one affair?) and trying to predict the classification for each woman.

Skipper Seabold, one of the primary contributors to Statsmodels, did a similar classification in his Statsmodels demo at a Statistical Programming DC Meetup. However, he used Statsmodels for the classification (whereas I'm using scikit-learn), and he treated the occupation variables as continuous (whereas I'm treating them as categorical).

Import modules



In [1]:

    
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score









    



/usr/lib/python3.4/importlib/_bootstrap.py:321: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  return f(*args, **kwds)
/usr/local/lib/python3.4/dist-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Data Pre-Processing

First, let's load the dataset and add a binary affair column.



In [2]:

    
# load dataset
dta = sm.datasets.fair.load_pandas().data

# add "affair" column: 1 represents having affairs, 0 represents not
dta['affair'] = (dta.affairs > 0).astype(int)

Data Exploration



In [3]:

    
dta.groupby('affair').mean()









    Out[3]:







  
    
      
      rate_marriage
      age
      yrs_married
      children
      religious
      educ
      occupation
      occupation_husb
      affairs
    
    
      affair
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      4.329701
      28.390679
      7.989335
      1.238813
      2.504521
      14.322977
      3.405286
      3.833758
      0.000000
    
    
      1
      3.647345
      30.537019
      11.152460
      1.728933
      2.261568
      13.972236
      3.463712
      3.884559
      2.187243

We can see that on average, women who have affairs rate their marriages lower, which is to be expected. Let's take another look at the rate_marriage variable.



In [4]:

    
dta.groupby('rate_marriage').mean()









    Out[4]:







  
    
      
      age
      yrs_married
      children
      religious
      educ
      occupation
      occupation_husb
      affairs
      affair
    
    
      rate_marriage
      
      
      
      
      
      
      
      
      
    
  
  
    
      1.0
      33.823232
      13.914141
      2.308081
      2.343434
      13.848485
      3.232323
      3.838384
      1.201671
      0.747475
    
    
      2.0
      30.471264
      10.727011
      1.735632
      2.330460
      13.864943
      3.327586
      3.764368
      1.615745
      0.635057
    
    
      3.0
      30.008056
      10.239174
      1.638469
      2.308157
      14.001007
      3.402820
      3.798590
      1.371281
      0.550856
    
    
      4.0
      28.856601
      8.816905
      1.369536
      2.400981
      14.144514
      3.420161
      3.835861
      0.674837
      0.322926
    
    
      5.0
      28.574702
      8.311662
      1.252794
      2.506334
      14.399776
      3.454918
      3.892697
      0.348174
      0.181446

An increase in age, yrs_married, and children appears to correlate with a declining marriage rating.

Data Visualization



In [5]:

    
# show plots in the notebook
%matplotlib inline

Let's start with histograms of education and marriage rating.



In [6]:

    
# histogram of education
dta.educ.hist()
plt.title('Histogram of Education')
plt.xlabel('Education Level')
plt.ylabel('Frequency')









    Out[6]:





<matplotlib.text.Text at 0x7f7bb7ccacc0>



In [7]:

    
# histogram of marriage rating
dta.rate_marriage.hist()
plt.title('Histogram of Marriage Rating')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')









    Out[7]:





<matplotlib.text.Text at 0x7f7bb7c7e208>

Let's take a look at the distribution of marriage ratings for those having affairs versus those not having affairs.



In [8]:

    
# barplot of marriage rating grouped by affair (True or False)
pd.crosstab(dta.rate_marriage, dta.affair.astype(bool)).plot(kind='bar')
plt.title('Marriage Rating Distribution by Affair Status')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')









    Out[8]:





<matplotlib.text.Text at 0x7f7bb7c85be0>

Let's use a stacked barplot to look at the percentage of women having affairs by number of years of marriage.



In [9]:

    
affair_yrs_married = pd.crosstab(dta.yrs_married, dta.affair.astype(bool))
affair_yrs_married.div(affair_yrs_married.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Affair Percentage by Years Married')
plt.xlabel('Years Married')
plt.ylabel('Percentage')









    Out[9]:





<matplotlib.text.Text at 0x7f7bb7da4be0>

Prepare Data for Logistic Regression

To prepare the data, I want to add an intercept column as well as dummy variables for occupation and occupation_husb, since I'm treating them as categorial variables. The dmatrices function from the patsy module can do that using formula language.



In [10]:

    
# create dataframes with an intercept column and dummy variables for
# occupation and occupation_husb
y, X = dmatrices('affair ~ rate_marriage + age + yrs_married + children + \
                  religious + educ + C(occupation) + C(occupation_husb)',
                  dta, return_type="dataframe")
print(X.columns)









    



Index(['Intercept', 'C(occupation)[T.2.0]', 'C(occupation)[T.3.0]',
       'C(occupation)[T.4.0]', 'C(occupation)[T.5.0]', 'C(occupation)[T.6.0]',
       'C(occupation_husb)[T.2.0]', 'C(occupation_husb)[T.3.0]',
       'C(occupation_husb)[T.4.0]', 'C(occupation_husb)[T.5.0]',
       'C(occupation_husb)[T.6.0]', 'rate_marriage', 'age', 'yrs_married',
       'children', 'religious', 'educ'],
      dtype='object')

The column names for the dummy variables are ugly, so let's rename those.



In [11]:

    
# fix column names of X
X = X.rename(columns = {'C(occupation)[T.2.0]':'occ_2',
                        'C(occupation)[T.3.0]':'occ_3',
                        'C(occupation)[T.4.0]':'occ_4',
                        'C(occupation)[T.5.0]':'occ_5',
                        'C(occupation)[T.6.0]':'occ_6',
                        'C(occupation_husb)[T.2.0]':'occ_husb_2',
                        'C(occupation_husb)[T.3.0]':'occ_husb_3',
                        'C(occupation_husb)[T.4.0]':'occ_husb_4',
                        'C(occupation_husb)[T.5.0]':'occ_husb_5',
                        'C(occupation_husb)[T.6.0]':'occ_husb_6'})

We also need to flatten y into a 1-D array, so that scikit-learn will properly understand it as the response variable.



In [12]:

    
# flatten y into a 1-D array
y = np.ravel(y)

Logistic Regression

Let's go ahead and run logistic regression on the entire data set, and see how accurate it is!



In [13]:

    
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)









    Out[13]:





0.72588752748978946

73% accuracy seems good, but what's the null error rate?



In [14]:

    
# what percentage had affairs?
y.mean()









    Out[14]:





0.32249450204209867

Only 32% of the women had affairs, which means that you could obtain 68% accuracy by always predicting "no". So we're doing better than the null error rate, but not by much.

Let's examine the coefficients to see what we learn.



In [15]:

    
# examine the coefficients
trans = np.transpose(model.coef_)
zipi = list(zip(X.columns, trans))
pd.DataFrame(zipi)









    Out[15]:







  
    
      
      0
      1
    
  
  
    
      0
      Intercept
      [1.48986214923]
    
    
      1
      occ_2
      [0.188041521947]
    
    
      2
      occ_3
      [0.498919714822]
    
    
      3
      occ_4
      [0.250640990053]
    
    
      4
      occ_5
      [0.838976932692]
    
    
      5
      occ_6
      [0.83400830339]
    
    
      6
      occ_husb_2
      [0.190579893537]
    
    
      7
      occ_husb_3
      [0.297779843131]
    
    
      8
      occ_husb_4
      [0.161353493872]
    
    
      9
      occ_husb_5
      [0.187717838451]
    
    
      10
      occ_husb_6
      [0.193948470313]
    
    
      11
      rate_marriage
      [-0.703115859924]
    
    
      12
      age
      [-0.0584176922855]
    
    
      13
      yrs_married
      [0.105676573773]
    
    
      14
      children
      [0.0169202689857]
    
    
      15
      religious
      [-0.371133761191]
    
    
      16
      educ
      [0.00401556692472]

Increases in marriage rating and religiousness correspond to a decrease in the likelihood of having an affair. For both the wife's occupation and the husband's occupation, the lowest likelihood of having an affair corresponds to the baseline occupation (student), since all of the dummy coefficients are positive.

Model Evaluation Using a Validation Set

So far, we have trained and tested on the same set. Let's instead split the data into a training set and a testing set.



In [16]:

    
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)









    Out[16]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.



In [17]:

    
# predict class labels for the test set
predicted = model2.predict(X_test)
print(predicted)









    



[ 1.  0.  0. ...,  0.  0.  0.]



In [18]:

    
# generate class probabilities
probs = model2.predict_proba(X_test)
print(probs)









    



[[ 0.35148525  0.64851475]
 [ 0.90955539  0.09044461]
 [ 0.72569407  0.27430593]
 ..., 
 [ 0.55730636  0.44269364]
 [ 0.81211049  0.18788951]
 [ 0.74732836  0.25267164]]

As you can see, the classifier is predicting a 1 (having an affair) any time the probability in the second column is greater than 0.5.

Now let's generate some evaluation metrics.



In [19]:

    
# generate evaluation metrics
print(metrics.accuracy_score(y_test, predicted))
print(metrics.roc_auc_score(y_test, probs[:, 1]))









    



0.729319371728
0.745948078253

The accuracy is 73%, which is the same as we experienced when training and predicting on the same data.

We can also see the confusion matrix and a classification report with other metrics.



In [20]:

    
print(metrics.confusion_matrix(y_test, predicted))
print(metrics.classification_report(y_test, predicted))









    



[[1169  134]
 [ 383  224]]
             precision    recall  f1-score   support

        0.0       0.75      0.90      0.82      1303
        1.0       0.63      0.37      0.46       607

avg / total       0.71      0.73      0.71      1910

Model Evaluation Using Cross-Validation

Now let's try 10-fold cross-validation, to see if the accuracy holds up more rigorously.



In [21]:

    
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())









    



[ 0.72100313  0.70219436  0.73824451  0.70597484  0.70597484  0.72955975
  0.7327044   0.70440252  0.75157233  0.75      ]
0.724163068551

Looks good. It's still performing at 73% accuracy.

Predicting the Probability of an Affair

Just for fun, let's predict the probability of an affair for a random woman not present in the dataset. She's a 25-year-old teacher who graduated college, has been married for 3 years, has 1 child, rates herself as strongly religious, rates her marriage as fair, and her husband is a farmer.



In [22]:

    
model.predict_proba(np.array([[1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 3, 25, 3, 1, 4,
                              16]]))









    Out[22]:





array([[ 0.77472403,  0.22527597]])

The predicted probability of an affair is 23%.

Next Steps

There are many different steps that could be tried in order to improve the model:

including interaction terms
removing features
regularization techniques
using a non-linear model

	rate_marriage	age	yrs_married	children	religious	educ	occupation	occupation_husb	affairs
affair
0	4.329701	28.390679	7.989335	1.238813	2.504521	14.322977	3.405286	3.833758	0.000000
1	3.647345	30.537019	11.152460	1.728933	2.261568	13.972236	3.463712	3.884559	2.187243

	age	yrs_married	children	religious	educ	occupation	occupation_husb	affairs	affair
rate_marriage
1.0	33.823232	13.914141	2.308081	2.343434	13.848485	3.232323	3.838384	1.201671	0.747475
2.0	30.471264	10.727011	1.735632	2.330460	13.864943	3.327586	3.764368	1.615745	0.635057
3.0	30.008056	10.239174	1.638469	2.308157	14.001007	3.402820	3.798590	1.371281	0.550856
4.0	28.856601	8.816905	1.369536	2.400981	14.144514	3.420161	3.835861	0.674837	0.322926
5.0	28.574702	8.311662	1.252794	2.506334	14.399776	3.454918	3.892697	0.348174	0.181446

	0	1
0	Intercept	[1.48986214923]
1	occ_2	[0.188041521947]
2	occ_3	[0.498919714822]
3	occ_4	[0.250640990053]
4	occ_5	[0.838976932692]
5	occ_6	[0.83400830339]
6	occ_husb_2	[0.190579893537]
7	occ_husb_3	[0.297779843131]
8	occ_husb_4	[0.161353493872]
9	occ_husb_5	[0.187717838451]
10	occ_husb_6	[0.193948470313]
11	rate_marriage	[-0.703115859924]
12	age	[-0.0584176922855]
13	yrs_married	[0.105676573773]
14	children	[0.0169202689857]
15	religious	[-0.371133761191]
16	educ	[0.00401556692472]