Logistic Regression binary classification exercise

In this exercise you will be working with the affairs dataset based on a survei of women on 1974 where they asked them whether they had extramarital affairs.

To Correctly work with the database we will be splitting the data into training (X_train and y_train matrices) and test data (X_test and y_test).

We ask you to:

1) Build a binary classifier trained on the training data, and compute its classification accuracy

2) Test the classification accuracy on the test data given the model you just trained

3) create a new sample modeling a virtual surveyed woman (you can randomly set parameters for it) and see whether your new sample would cheat or not on her husband.

Consider doing some plotting and printing out of the data along the way to get a feeling of what you are looking at in here.


In [17]:
%matplotlib inline

import pandas as pd #used for reading/writing data 
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library
from sklearn.model_selection import train_test_split #creation of train.test sets

#loading and splitting the data into train/test sets
data = pd.read_csv('data/affairs_dataset/fair.csv', sep=',')
y = (data.affairs > 0).astype(int)
X = data.drop('affairs', axis=1)

#split the data into train and test sets, with a 70-30 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.describe()


Out[17]:
rate_marriage age yrs_married children religious educ occupation occupation_husb
count 4456.000000 4456.000000 4456.000000 4456.000000 4456.000000 4456.00000 4456.000000 4456.000000
mean 4.116248 29.098070 9.017504 1.398900 2.414946 14.23070 3.419659 3.865575
std 0.959071 6.788485 7.226163 1.427461 0.877166 2.19748 0.935788 1.344939
min 1.000000 17.500000 0.500000 0.000000 1.000000 9.00000 1.000000 1.000000
25% 4.000000 22.000000 2.500000 0.000000 2.000000 12.00000 3.000000 3.000000
50% 4.000000 27.000000 6.000000 1.000000 2.000000 14.00000 3.000000 4.000000
75% 5.000000 32.000000 16.500000 2.000000 3.000000 16.00000 4.000000 5.000000
max 5.000000 42.000000 23.000000 5.500000 4.000000 20.00000 6.000000 6.000000

Your code starts here...


In [ ]: