This is an example of logistic regression in Python with the scikit-learn module, performed for an assignment with my General Assembly Data Science class.
The dataset I chose is the affairs dataset that comes with Statsmodels. It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs. More information about the study is available in a 1978 paper from the Journal of Political Economy.
The dataset contains 6366 observations of 9 variables:
rate_marriage
: woman's rating of her marriage (1 = very poor, 5 = very good)age
: woman's ageyrs_married
: number of years marriedchildren
: number of childrenreligious
: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)educ
: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)occupation
: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree)occupation_husb
: husband's occupation (same coding as above)affairs
: time spent in extra-marital affairsI decided to treat this as a classification problem by creating a new binary variable affair
(did the woman have at least one affair?) and trying to predict the classification for each woman.
Skipper Seabold, one of the primary contributors to Statsmodels, did a similar classification in his Statsmodels demo at a Statistical Programming DC Meetup. However, he used Statsmodels for the classification (whereas I'm using scikit-learn), and he treated the occupation variables as continuous (whereas I'm treating them as categorical).
In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
In [2]:
# load dataset
dta = sm.datasets.fair.load_pandas().data
# add "affair" column: 1 represents having affairs, 0 represents not
dta['affair'] = (dta.affairs > 0).astype(int)
In [3]:
dta.groupby('affair').mean()
Out[3]:
We can see that on average, women who have affairs rate their marriages lower, which is to be expected. Let's take another look at the rate_marriage
variable.
In [4]:
dta.groupby('rate_marriage').mean()
Out[4]:
An increase in age
, yrs_married
, and children
appears to correlate with a declining marriage rating.
In [5]:
# show plots in the notebook
%matplotlib inline
Let's start with histograms of education and marriage rating.
In [6]:
# histogram of education
dta.educ.hist()
plt.title('Histogram of Education')
plt.xlabel('Education Level')
plt.ylabel('Frequency')
Out[6]:
In [7]:
# histogram of marriage rating
dta.rate_marriage.hist()
plt.title('Histogram of Marriage Rating')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')
Out[7]:
Let's take a look at the distribution of marriage ratings for those having affairs versus those not having affairs.
In [8]:
# barplot of marriage rating grouped by affair (True or False)
pd.crosstab(dta.rate_marriage, dta.affair.astype(bool)).plot(kind='bar')
plt.title('Marriage Rating Distribution by Affair Status')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')
Out[8]:
Let's use a stacked barplot to look at the percentage of women having affairs by number of years of marriage.
In [9]:
affair_yrs_married = pd.crosstab(dta.yrs_married, dta.affair.astype(bool))
affair_yrs_married.div(affair_yrs_married.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Affair Percentage by Years Married')
plt.xlabel('Years Married')
plt.ylabel('Percentage')
Out[9]:
To prepare the data, I want to add an intercept column as well as dummy variables for occupation
and occupation_husb
, since I'm treating them as categorial variables. The dmatrices function from the patsy module can do that using formula language.
In [10]:
# create dataframes with an intercept column and dummy variables for
# occupation and occupation_husb
y, X = dmatrices('affair ~ rate_marriage + age + yrs_married + children + \
religious + educ + C(occupation) + C(occupation_husb)',
dta, return_type="dataframe")
print(X.columns)
The column names for the dummy variables are ugly, so let's rename those.
In [11]:
# fix column names of X
X = X.rename(columns = {'C(occupation)[T.2.0]':'occ_2',
'C(occupation)[T.3.0]':'occ_3',
'C(occupation)[T.4.0]':'occ_4',
'C(occupation)[T.5.0]':'occ_5',
'C(occupation)[T.6.0]':'occ_6',
'C(occupation_husb)[T.2.0]':'occ_husb_2',
'C(occupation_husb)[T.3.0]':'occ_husb_3',
'C(occupation_husb)[T.4.0]':'occ_husb_4',
'C(occupation_husb)[T.5.0]':'occ_husb_5',
'C(occupation_husb)[T.6.0]':'occ_husb_6'})
We also need to flatten y
into a 1-D array, so that scikit-learn will properly understand it as the response variable.
In [12]:
# flatten y into a 1-D array
y = np.ravel(y)
In [13]:
# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, y)
# check the accuracy on the training set
model.score(X, y)
Out[13]:
73% accuracy seems good, but what's the null error rate?
In [14]:
# what percentage had affairs?
y.mean()
Out[14]:
Only 32% of the women had affairs, which means that you could obtain 68% accuracy by always predicting "no". So we're doing better than the null error rate, but not by much.
Let's examine the coefficients to see what we learn.
In [15]:
# examine the coefficients
trans = np.transpose(model.coef_)
zipi = list(zip(X.columns, trans))
pd.DataFrame(zipi)
Out[15]:
Increases in marriage rating and religiousness correspond to a decrease in the likelihood of having an affair. For both the wife's occupation and the husband's occupation, the lowest likelihood of having an affair corresponds to the baseline occupation (student), since all of the dummy coefficients are positive.
In [16]:
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)
Out[16]:
We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.
In [17]:
# predict class labels for the test set
predicted = model2.predict(X_test)
print(predicted)
In [18]:
# generate class probabilities
probs = model2.predict_proba(X_test)
print(probs)
As you can see, the classifier is predicting a 1 (having an affair) any time the probability in the second column is greater than 0.5.
Now let's generate some evaluation metrics.
In [19]:
# generate evaluation metrics
print(metrics.accuracy_score(y_test, predicted))
print(metrics.roc_auc_score(y_test, probs[:, 1]))
The accuracy is 73%, which is the same as we experienced when training and predicting on the same data.
We can also see the confusion matrix and a classification report with other metrics.
In [20]:
print(metrics.confusion_matrix(y_test, predicted))
print(metrics.classification_report(y_test, predicted))
In [21]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())
Looks good. It's still performing at 73% accuracy.
Just for fun, let's predict the probability of an affair for a random woman not present in the dataset. She's a 25-year-old teacher who graduated college, has been married for 3 years, has 1 child, rates herself as strongly religious, rates her marriage as fair, and her husband is a farmer.
In [22]:
model.predict_proba(np.array([[1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 3, 25, 3, 1, 4,
16]]))
Out[22]:
The predicted probability of an affair is 23%.