Logistic Regression binary classification exercise

In this exercise you will be working with the affairs dataset based on a survei of women on 1974 where they asked them whether they had extramarital affairs.

To Correctly work with the database we will be splitting the data into training (X_train and y_train matrices) and test data (X_test and y_test).

We ask you to:

1) Build a binary classifier trained on the training data, and compute its classification accuracy

2) Test the classification accuracy on the test data given the model you just trained

3) create a new sample modeling a virtual surveyed woman (you can randomly set parameters for it) and see whether your new sample would cheat or not on her husband.

Consider doing some plotting and printing out of the data along the way to get a feeling of what you are looking at in here.



In [17]:

    
%matplotlib inline

import pandas as pd #used for reading/writing data 
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library
from sklearn.model_selection import train_test_split #creation of train.test sets

#loading and splitting the data into train/test sets
data = pd.read_csv('data/affairs_dataset/fair.csv', sep=',')
y = (data.affairs > 0).astype(int)
X = data.drop('affairs', axis=1)

#split the data into train and test sets, with a 70-30 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.describe()









    Out[17]:






  
    
      
      rate_marriage
      age
      yrs_married
      children
      religious
      educ
      occupation
      occupation_husb
    
  
  
    
      count
      4456.000000
      4456.000000
      4456.000000
      4456.000000
      4456.000000
      4456.00000
      4456.000000
      4456.000000
    
    
      mean
      4.116248
      29.098070
      9.017504
      1.398900
      2.414946
      14.23070
      3.419659
      3.865575
    
    
      std
      0.959071
      6.788485
      7.226163
      1.427461
      0.877166
      2.19748
      0.935788
      1.344939
    
    
      min
      1.000000
      17.500000
      0.500000
      0.000000
      1.000000
      9.00000
      1.000000
      1.000000
    
    
      25%
      4.000000
      22.000000
      2.500000
      0.000000
      2.000000
      12.00000
      3.000000
      3.000000
    
    
      50%
      4.000000
      27.000000
      6.000000
      1.000000
      2.000000
      14.00000
      3.000000
      4.000000
    
    
      75%
      5.000000
      32.000000
      16.500000
      2.000000
      3.000000
      16.00000
      4.000000
      5.000000
    
    
      max
      5.000000
      42.000000
      23.000000
      5.500000
      4.000000
      20.00000
      6.000000
      6.000000

Your code starts here...



In [ ]:

	rate_marriage	age	yrs_married	children	religious	educ	occupation	occupation_husb
count	4456.000000	4456.000000	4456.000000	4456.000000	4456.000000	4456.00000	4456.000000	4456.000000
mean	4.116248	29.098070	9.017504	1.398900	2.414946	14.23070	3.419659	3.865575
std	0.959071	6.788485	7.226163	1.427461	0.877166	2.19748	0.935788	1.344939
min	1.000000	17.500000	0.500000	0.000000	1.000000	9.00000	1.000000	1.000000
25%	4.000000	22.000000	2.500000	0.000000	2.000000	12.00000	3.000000	3.000000
50%	4.000000	27.000000	6.000000	1.000000	2.000000	14.00000	3.000000	4.000000
75%	5.000000	32.000000	16.500000	2.000000	3.000000	16.00000	4.000000	5.000000
max	5.000000	42.000000	23.000000	5.500000	4.000000	20.00000	6.000000	6.000000