Dataset:

The dataset I chose is the affairs dataset that comes with Statsmodels. It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs. More information about the study is available in a 1978 paper from the Journal of Political Economy.

Description of Variables

The dataset contains 6366 observations of 9 variables:

rate_marriage: woman's rating of her marriage (1 = very poor, 5 = very good)
age: woman's age
yrs_married: number of years married
children: number of children
religious: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)
educ: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)
occupation: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree)
occupation_husb: husband's occupation (same coding as above)
affairs: time spent in extra-marital affairs



In [10]:

    
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score



In [12]:

    
# load dataset
dta = sm.datasets.fair.load_pandas().data

# add "affair" column: 1 represents having affairs, 0 represents not
dta['affair'] = (dta.affairs > 0).astype(int)

dta['affair']









    Out[12]:





0       1
1       1
2       1
3       1
4       1
5       1
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
14      1
15      1
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      1
25      1
26      1
27      1
28      1
29      1
       ..
6336    0
6337    0
6338    0
6339    0
6340    0
6341    0
6342    0
6343    0
6344    0
6345    0
6346    0
6347    0
6348    0
6349    0
6350    0
6351    0
6352    0
6353    0
6354    0
6355    0
6356    0
6357    0
6358    0
6359    0
6360    0
6361    0
6362    0
6363    0
6364    0
6365    0
Name: affair, Length: 6366, dtype: int32

Data Exploration



In [13]:

    
dta.groupby('affair').mean()

# on average, women who have affairs rate their marriages lower, which is to be expected









    Out[13]:







  
    
      
      rate_marriage
      age
      yrs_married
      children
      religious
      educ
      occupation
      occupation_husb
      affairs
    
    
      affair
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      4.329701
      28.390679
      7.989335
      1.238813
      2.504521
      14.322977
      3.405286
      3.833758
      0.000000
    
    
      1
      3.647345
      30.537019
      11.152460
      1.728933
      2.261568
      13.972236
      3.463712
      3.884559
      2.187243



In [14]:

    
dta.groupby('rate_marriage').mean()









    Out[14]:







  
    
      
      age
      yrs_married
      children
      religious
      educ
      occupation
      occupation_husb
      affairs
      affair
    
    
      rate_marriage
      
      
      
      
      
      
      
      
      
    
  
  
    
      1.0
      33.823232
      13.914141
      2.308081
      2.343434
      13.848485
      3.232323
      3.838384
      1.201671
      0.747475
    
    
      2.0
      30.471264
      10.727011
      1.735632
      2.330460
      13.864943
      3.327586
      3.764368
      1.615745
      0.635057
    
    
      3.0
      30.008056
      10.239174
      1.638469
      2.308157
      14.001007
      3.402820
      3.798590
      1.371281
      0.550856
    
    
      4.0
      28.856601
      8.816905
      1.369536
      2.400981
      14.144514
      3.420161
      3.835861
      0.674837
      0.322926
    
    
      5.0
      28.574702
      8.311662
      1.252794
      2.506334
      14.399776
      3.454918
      3.892697
      0.348174
      0.181446

Data Visualization



In [15]:

    
# show plots in the notebook
%matplotlib inline



In [16]:

    
#histograms of education and marriage rating.


dta.educ.hist()
plt.title('Histogram of Education')
plt.xlabel('Education Level')
plt.ylabel('Frequency')









    Out[16]:





<matplotlib.text.Text at 0x25940bb3b00>



In [17]:

    
dta.rate_marriage.hist()
plt.title('Histogram of Marriage Rating')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')









    Out[17]:





<matplotlib.text.Text at 0x25940d9ecf8>



In [18]:

    
# look at the distribution of marriage ratings for those having affairs versus those not having affairs.


# barplot of marriage rating grouped by affair (True or False)
pd.crosstab(dta.rate_marriage, dta.affair.astype(bool)).plot(kind='bar')
plt.title('Marriage Rating Distribution by Affair Status')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')









    Out[18]:





<matplotlib.text.Text at 0x25940bed470>



In [ ]:



In [23]:

    
# create dataframes with an intercept column and dummy variables for
# occupation and occupation_husb
y, X = dmatrices('affair ~ rate_marriage + age + yrs_married + children + \
                  religious + educ + C(occupation) + C(occupation_husb)',
                  dta, return_type="dataframe")
print( X.columns)









    



Index(['Intercept', 'C(occupation)[T.2.0]', 'C(occupation)[T.3.0]',
       'C(occupation)[T.4.0]', 'C(occupation)[T.5.0]', 'C(occupation)[T.6.0]',
       'C(occupation_husb)[T.2.0]', 'C(occupation_husb)[T.3.0]',
       'C(occupation_husb)[T.4.0]', 'C(occupation_husb)[T.5.0]',
       'C(occupation_husb)[T.6.0]', 'rate_marriage', 'age', 'yrs_married',
       'children', 'religious', 'educ'],
      dtype='object')



In [24]:

    
# fix column names of X
X = X.rename(columns = {'C(occupation)[T.2.0]':'occ_2',
                        'C(occupation)[T.3.0]':'occ_3',
                        'C(occupation)[T.4.0]':'occ_4',
                        'C(occupation)[T.5.0]':'occ_5',
                        'C(occupation)[T.6.0]':'occ_6',
                        'C(occupation_husb)[T.2.0]':'occ_husb_2',
                        'C(occupation_husb)[T.3.0]':'occ_husb_3',
                        'C(occupation_husb)[T.4.0]':'occ_husb_4',
                        'C(occupation_husb)[T.5.0]':'occ_husb_5',
                        'C(occupation_husb)[T.6.0]':'occ_husb_6'})

need to flatten y into a 1-D array, so that scikit-learn will properly understand it as the response variable.



In [26]:

    
# flatten y into a 1-D array
y = np.ravel(y)
y









    Out[26]:





array([ 1.,  1.,  1., ...,  0.,  0.,  0.])

Logistic Regression



In [27]:

    
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)









    Out[27]:





0.72588752748978946



In [30]:

    
# what percentage had affairs?
y.mean()









    Out[30]:





0.32249450204209867

Model Evaluation Using a Validation Set



In [32]:

    
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)









    Out[32]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [34]:

    
# predict class labels for the test set
predicted = model2.predict(X_test)
print (predicted)









    



[ 1.  0.  0. ...,  0.  0.  0.]



In [36]:

    
# generate class probabilities
probs = model2.predict_proba(X_test)
print (probs)









    



[[ 0.35142488  0.64857512]
 [ 0.90952576  0.09047424]
 [ 0.72576603  0.27423397]
 ..., 
 [ 0.55736751  0.44263249]
 [ 0.81213933  0.18786067]
 [ 0.74729595  0.25270405]]

generate some evaluation metrics.



In [37]:

    
print (metrics.accuracy_score(y_test, predicted))
print (metrics.roc_auc_score(y_test, probs[:, 1]))









    



0.729842931937
0.74596198609



In [38]:

    
print (metrics.confusion_matrix(y_test, predicted))
print (metrics.classification_report(y_test, predicted))









    



[[1169  134]
 [ 382  225]]
             precision    recall  f1-score   support

        0.0       0.75      0.90      0.82      1303
        1.0       0.63      0.37      0.47       607

avg / total       0.71      0.73      0.71      1910

Model Evaluation Using Cross-Validation



In [40]:

    
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print (scores)
print (scores.mean())









    



[ 0.72100313  0.70219436  0.73824451  0.70597484  0.70597484  0.72955975
  0.7327044   0.70440252  0.75157233  0.75      ]
0.724163068551



In [41]:

    
model.predict_proba(np.array([1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 3, 25, 3, 1, 4,
                              16]))









    



C:\Users\SHAIK NAWAAZ\Anaconda3\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)






    Out[41]:





array([[ 0.7747227,  0.2252773]])

	rate_marriage	age	yrs_married	children	religious	educ	occupation	occupation_husb	affairs
affair
0	4.329701	28.390679	7.989335	1.238813	2.504521	14.322977	3.405286	3.833758	0.000000
1	3.647345	30.537019	11.152460	1.728933	2.261568	13.972236	3.463712	3.884559	2.187243

	age	yrs_married	children	religious	educ	occupation	occupation_husb	affairs	affair
rate_marriage
1.0	33.823232	13.914141	2.308081	2.343434	13.848485	3.232323	3.838384	1.201671	0.747475
2.0	30.471264	10.727011	1.735632	2.330460	13.864943	3.327586	3.764368	1.615745	0.635057
3.0	30.008056	10.239174	1.638469	2.308157	14.001007	3.402820	3.798590	1.371281	0.550856
4.0	28.856601	8.816905	1.369536	2.400981	14.144514	3.420161	3.835861	0.674837	0.322926
5.0	28.574702	8.311662	1.252794	2.506334	14.399776	3.454918	3.892697	0.348174	0.181446