In this exercise you will be working with the affairs dataset based on a survei of women on 1974 where they asked them whether they had extramarital affairs.
To Correctly work with the database we will be splitting the data into training (X_train and y_train matrices) and test data (X_test and y_test).
We ask you to:
1) Build a binary classifier trained on the training data, and compute its classification accuracy
2) Test the classification accuracy on the test data given the model you just trained
3) create a new sample modeling a virtual surveyed woman (you can randomly set parameters for it) and see whether your new sample would cheat or not on her husband.
Consider doing some plotting and printing out of the data along the way to get a feeling of what you are looking at in here.
In [17]:
%matplotlib inline
import pandas as pd #used for reading/writing data
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library
from sklearn.model_selection import train_test_split #creation of train.test sets
#loading and splitting the data into train/test sets
data = pd.read_csv('data/affairs_dataset/fair.csv', sep=',')
y = (data.affairs > 0).astype(int)
X = data.drop('affairs', axis=1)
#split the data into train and test sets, with a 70-30 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train.describe()
Out[17]:
Your code starts here...
In [ ]: