## Dataset:

The dataset I chose is the affairs dataset that comes with Statsmodels. It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs. More information about the study is available in a 1978 paper from the Journal of Political Economy.

## Description of Variables

The dataset contains 6366 observations of 9 variables:

• rate_marriage: woman's rating of her marriage (1 = very poor, 5 = very good)
• age: woman's age
• yrs_married: number of years married
• children: number of children
• religious: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)
• educ: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)
• occupation: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree)
• occupation_husb: husband's occupation (same coding as above)
• affairs: time spent in extra-marital affairs
``````

In [10]:

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

``````
``````

In [12]:

# add "affair" column: 1 represents having affairs, 0 represents not
dta['affair'] = (dta.affairs > 0).astype(int)

dta['affair']

``````
``````

Out[12]:

0       1
1       1
2       1
3       1
4       1
5       1
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
14      1
15      1
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      1
25      1
26      1
27      1
28      1
29      1
..
6336    0
6337    0
6338    0
6339    0
6340    0
6341    0
6342    0
6343    0
6344    0
6345    0
6346    0
6347    0
6348    0
6349    0
6350    0
6351    0
6352    0
6353    0
6354    0
6355    0
6356    0
6357    0
6358    0
6359    0
6360    0
6361    0
6362    0
6363    0
6364    0
6365    0
Name: affair, Length: 6366, dtype: int32

``````

### Data Exploration

``````

In [13]:

dta.groupby('affair').mean()

# on average, women who have affairs rate their marriages lower, which is to be expected

``````
``````

Out[13]:

text-align: right;
}

text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

rate_marriage
age
yrs_married
children
religious
educ
occupation
occupation_husb
affairs

affair

0
4.329701
28.390679
7.989335
1.238813
2.504521
14.322977
3.405286
3.833758
0.000000

1
3.647345
30.537019
11.152460
1.728933
2.261568
13.972236
3.463712
3.884559
2.187243

``````
``````

In [14]:

dta.groupby('rate_marriage').mean()

``````
``````

Out[14]:

text-align: right;
}

text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

age
yrs_married
children
religious
educ
occupation
occupation_husb
affairs
affair

rate_marriage

1.0
33.823232
13.914141
2.308081
2.343434
13.848485
3.232323
3.838384
1.201671
0.747475

2.0
30.471264
10.727011
1.735632
2.330460
13.864943
3.327586
3.764368
1.615745
0.635057

3.0
30.008056
10.239174
1.638469
2.308157
14.001007
3.402820
3.798590
1.371281
0.550856

4.0
28.856601
8.816905
1.369536
2.400981
14.144514
3.420161
3.835861
0.674837
0.322926

5.0
28.574702
8.311662
1.252794
2.506334
14.399776
3.454918
3.892697
0.348174
0.181446

``````

### Data Visualization

``````

In [15]:

# show plots in the notebook
%matplotlib inline

``````
``````

In [16]:

#histograms of education and marriage rating.

dta.educ.hist()
plt.title('Histogram of Education')
plt.xlabel('Education Level')
plt.ylabel('Frequency')

``````
``````

Out[16]:

<matplotlib.text.Text at 0x25940bb3b00>

``````
``````

In [17]:

dta.rate_marriage.hist()
plt.title('Histogram of Marriage Rating')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')

``````
``````

Out[17]:

<matplotlib.text.Text at 0x25940d9ecf8>

``````
``````

In [18]:

# look at the distribution of marriage ratings for those having affairs versus those not having affairs.

# barplot of marriage rating grouped by affair (True or False)
pd.crosstab(dta.rate_marriage, dta.affair.astype(bool)).plot(kind='bar')
plt.title('Marriage Rating Distribution by Affair Status')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')

``````
``````

Out[18]:

<matplotlib.text.Text at 0x25940bed470>

``````
``````

In [ ]:

``````
``````

In [23]:

# create dataframes with an intercept column and dummy variables for
# occupation and occupation_husb
y, X = dmatrices('affair ~ rate_marriage + age + yrs_married + children + \
religious + educ + C(occupation) + C(occupation_husb)',
dta, return_type="dataframe")
print( X.columns)

``````
``````

Index(['Intercept', 'C(occupation)[T.2.0]', 'C(occupation)[T.3.0]',
'C(occupation)[T.4.0]', 'C(occupation)[T.5.0]', 'C(occupation)[T.6.0]',
'C(occupation_husb)[T.2.0]', 'C(occupation_husb)[T.3.0]',
'C(occupation_husb)[T.4.0]', 'C(occupation_husb)[T.5.0]',
'C(occupation_husb)[T.6.0]', 'rate_marriage', 'age', 'yrs_married',
'children', 'religious', 'educ'],
dtype='object')

``````
``````

In [24]:

# fix column names of X
X = X.rename(columns = {'C(occupation)[T.2.0]':'occ_2',
'C(occupation)[T.3.0]':'occ_3',
'C(occupation)[T.4.0]':'occ_4',
'C(occupation)[T.5.0]':'occ_5',
'C(occupation)[T.6.0]':'occ_6',
'C(occupation_husb)[T.2.0]':'occ_husb_2',
'C(occupation_husb)[T.3.0]':'occ_husb_3',
'C(occupation_husb)[T.4.0]':'occ_husb_4',
'C(occupation_husb)[T.5.0]':'occ_husb_5',
'C(occupation_husb)[T.6.0]':'occ_husb_6'})

``````
##### need to flatten y into a 1-D array, so that scikit-learn will properly understand it as the response variable.
``````

In [26]:

# flatten y into a 1-D array
y = np.ravel(y)
y

``````
``````

Out[26]:

array([ 1.,  1.,  1., ...,  0.,  0.,  0.])

``````

#### Logistic Regression

``````

In [27]:

# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)

# check the accuracy on the training set
model.score(X, y)

``````
``````

Out[27]:

0.72588752748978946

``````
``````

In [30]:

y.mean()

``````
``````

Out[30]:

0.32249450204209867

``````

#### Model Evaluation Using a Validation Set

``````

In [32]:

# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)

``````
``````

Out[32]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

``````
``````

In [34]:

# predict class labels for the test set
predicted = model2.predict(X_test)
print (predicted)

``````
``````

[ 1.  0.  0. ...,  0.  0.  0.]

``````
``````

In [36]:

# generate class probabilities
probs = model2.predict_proba(X_test)
print (probs)

``````
``````

[[ 0.35142488  0.64857512]
[ 0.90952576  0.09047424]
[ 0.72576603  0.27423397]
...,
[ 0.55736751  0.44263249]
[ 0.81213933  0.18786067]
[ 0.74729595  0.25270405]]

``````

### generate some evaluation metrics.

``````

In [37]:

print (metrics.accuracy_score(y_test, predicted))
print (metrics.roc_auc_score(y_test, probs[:, 1]))

``````
``````

0.729842931937
0.74596198609

``````
``````

In [38]:

print (metrics.confusion_matrix(y_test, predicted))
print (metrics.classification_report(y_test, predicted))

``````
``````

[[1169  134]
[ 382  225]]
precision    recall  f1-score   support

0.0       0.75      0.90      0.82      1303
1.0       0.63      0.37      0.47       607

avg / total       0.71      0.73      0.71      1910

``````

### Model Evaluation Using Cross-Validation

``````

In [40]:

# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print (scores)
print (scores.mean())

``````
``````

[ 0.72100313  0.70219436  0.73824451  0.70597484  0.70597484  0.72955975
0.7327044   0.70440252  0.75157233  0.75      ]
0.724163068551

``````
``````

In [41]:

model.predict_proba(np.array([1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 3, 25, 3, 1, 4,
16]))

``````
``````

C:\Users\SHAIK NAWAAZ\Anaconda3\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)

Out[41]:

array([[ 0.7747227,  0.2252773]])

``````