This is an example of logistic regression in Python with the scikit-learn module, performed for an assignment with my General Assembly Data Science class.

The dataset I chose is the affairs dataset that comes with Statsmodels. It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs. More information about the study is available in a 1978 paper from the Journal of Political Economy.

The dataset contains 6366 observations of 9 variables:

`rate_marriage`

: woman's rating of her marriage (1 = very poor, 5 = very good)`age`

: woman's age`yrs_married`

: number of years married`children`

: number of children`religious`

: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)`educ`

: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)`occupation`

: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree)`occupation_husb`

: husband's occupation (same coding as above)`affairs`

: time spent in extra-marital affairs

I decided to treat this as a classification problem by creating a new binary variable `affair`

(did the woman have at least one affair?) and trying to predict the classification for each woman.

Skipper Seabold, one of the primary contributors to Statsmodels, did a similar classification in his Statsmodels demo at a Statistical Programming DC Meetup. However, he used Statsmodels for the classification (whereas I'm using scikit-learn), and he treated the occupation variables as continuous (whereas I'm treating them as categorical).

```
In [1]:
```import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score

```
```

```
In [2]:
```# load dataset
dta = sm.datasets.fair.load_pandas().data
# add "affair" column: 1 represents having affairs, 0 represents not
dta['affair'] = (dta.affairs > 0).astype(int)

```
In [3]:
```dta.groupby('affair').mean()

```
Out[3]:
```

`rate_marriage`

variable.

```
In [4]:
```dta.groupby('rate_marriage').mean()

```
Out[4]:
```

`age`

, `yrs_married`

, and `children`

appears to correlate with a declining marriage rating.

```
In [5]:
```# show plots in the notebook
%matplotlib inline

Let's start with histograms of education and marriage rating.

```
In [6]:
```# histogram of education
dta.educ.hist()
plt.title('Histogram of Education')
plt.xlabel('Education Level')
plt.ylabel('Frequency')

```
Out[6]:
```

```
In [7]:
```# histogram of marriage rating
dta.rate_marriage.hist()
plt.title('Histogram of Marriage Rating')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')

```
Out[7]:
```

```
In [8]:
```# barplot of marriage rating grouped by affair (True or False)
pd.crosstab(dta.rate_marriage, dta.affair.astype(bool)).plot(kind='bar')
plt.title('Marriage Rating Distribution by Affair Status')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')

```
Out[8]:
```

```
In [9]:
```affair_yrs_married = pd.crosstab(dta.yrs_married, dta.affair.astype(bool))
affair_yrs_married.div(affair_yrs_married.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Affair Percentage by Years Married')
plt.xlabel('Years Married')
plt.ylabel('Percentage')

```
Out[9]:
```

To prepare the data, I want to add an intercept column as well as dummy variables for `occupation`

and `occupation_husb`

, since I'm treating them as categorial variables. The dmatrices function from the patsy module can do that using formula language.

```
In [10]:
```# create dataframes with an intercept column and dummy variables for
# occupation and occupation_husb
y, X = dmatrices('affair ~ rate_marriage + age + yrs_married + children + \
religious + educ + C(occupation) + C(occupation_husb)',
dta, return_type="dataframe")
print(X.columns)

```
```

The column names for the dummy variables are ugly, so let's rename those.

```
In [11]:
```# fix column names of X
X = X.rename(columns = {'C(occupation)[T.2.0]':'occ_2',
'C(occupation)[T.3.0]':'occ_3',
'C(occupation)[T.4.0]':'occ_4',
'C(occupation)[T.5.0]':'occ_5',
'C(occupation)[T.6.0]':'occ_6',
'C(occupation_husb)[T.2.0]':'occ_husb_2',
'C(occupation_husb)[T.3.0]':'occ_husb_3',
'C(occupation_husb)[T.4.0]':'occ_husb_4',
'C(occupation_husb)[T.5.0]':'occ_husb_5',
'C(occupation_husb)[T.6.0]':'occ_husb_6'})

`y`

into a 1-D array, so that scikit-learn will properly understand it as the response variable.

```
In [12]:
```# flatten y into a 1-D array
y = np.ravel(y)

```
In [13]:
```# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)
# check the accuracy on the training set
model.score(X, y)

```
Out[13]:
```

73% accuracy seems good, but what's the null error rate?

```
In [14]:
```# what percentage had affairs?
y.mean()

```
Out[14]:
```

Only 32% of the women had affairs, which means that you could obtain 68% accuracy by always predicting "no". So we're doing better than the null error rate, but not by much.

Let's examine the coefficients to see what we learn.

```
In [15]:
```# examine the coefficients
trans = np.transpose(model.coef_)
zipi = list(zip(X.columns, trans))
pd.DataFrame(zipi)

```
Out[15]:
```

```
In [16]:
```# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)

```
Out[16]:
```

```
In [17]:
```# predict class labels for the test set
predicted = model2.predict(X_test)
print(predicted)

```
```

```
In [18]:
```# generate class probabilities
probs = model2.predict_proba(X_test)
print(probs)

```
```

As you can see, the classifier is predicting a 1 (having an affair) any time the probability in the second column is greater than 0.5.

Now let's generate some evaluation metrics.

```
In [19]:
```# generate evaluation metrics
print(metrics.accuracy_score(y_test, predicted))
print(metrics.roc_auc_score(y_test, probs[:, 1]))

```
```

The accuracy is 73%, which is the same as we experienced when training and predicting on the same data.

We can also see the confusion matrix and a classification report with other metrics.

```
In [20]:
```print(metrics.confusion_matrix(y_test, predicted))
print(metrics.classification_report(y_test, predicted))

```
```

```
In [21]:
```# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())

```
```

Looks good. It's still performing at 73% accuracy.

Just for fun, let's predict the probability of an affair for a random woman not present in the dataset. She's a 25-year-old teacher who graduated college, has been married for 3 years, has 1 child, rates herself as strongly religious, rates her marriage as fair, and her husband is a farmer.

```
In [22]:
```model.predict_proba(np.array([[1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 3, 25, 3, 1, 4,
16]]))

```
Out[22]:
```

The predicted probability of an affair is 23%.