This notebook tackles "binary classification" where outputs are of 2 labels such as a positive and negative, (1 or 0). To give some examples of what we mean by binary: we could classify email as either spam or not spam, or tumors as either malignant or benign. In both these cases we have a set of data and features, but only two possible outputs. It is possible to have more than just two classes, but for now we will focus on binary classification.
In order to perform this classification we will be using the logistic function to perform logistic regression.
We need to install a new module we haven't used before: Statsmodels.
You can install it with 'pip install statsmodels' or use your Package Manager to install 'statsmodels' depending on your Python installation. In this lecture we will only be using a dataset from it, but it can do quite a bit, including many statistical computations that SciKit Learn does.
In [1]:
# Data Imports
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
# Math
import math
# Plot imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
# Machine Learning Imports
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
# For evaluating our ML results
from sklearn import metrics
In [2]:
# Dataset Import
import statsmodels.api as sm
First, let's take a look at the Logistic Function. The logistic function can take an input from negative to positive infinity and it has always has an output between 0 and 1. The logistic function is defined as: $$ \sigma (t)= \frac{1}{1+e^{-t}}$$
A graph of the logistic function looks like this (following code):
In [4]:
# Logistic Function
def logistic(t):
return 1.0 / (1 + math.exp((-1.0)*t) )
# Set t from -10 to 10 ( 501 elements, linearly spaced)
t = np.linspace(-10,10,501)
# Set up y values (using list comprehension)
y = np.array([logistic(ele) for ele in t])
# Plot
plt.plot(t,y)
plt.title(' Logistic Function ')
Out[4]:
When we talked about Linear Regression last week, we could describe a Linear Regression Function model as: $$ y_i = \beta _1 x_{i1} + ... + \beta _i x_{ip}$$
Which was basically an expanded linear equation (y=mx+b) for various x data features. In the case of the above equation, we presume a data set of 'n' number of units, so that the data set would have the form: $$ [ y_i, x_{i1},...,x_{ip}]^{n}_{i=1}$$
For our logistic function, if we view t as a linear function with a variable x and n=1 we could express t as: $$ t = \beta _0 + \beta _1 x $$
Here, we've basically just substituted a linear function (form similar to y=mx+b) for t. We could then rewrite our logistic function equation as: $$ F(x)= \frac{1}{1+e^{-(\beta _0 + \beta _1 x)}}$$
Now we can interpret F(x) as the probability that the dependent variable is a "success" case, this is a similar style of thinking as in the Binomial Distribution, in which we had successes and failures. So the formula for F(x) that we have here states that the probability of the dependent variable equaling a "success" case is equal to the value of the logistic function of the linear regression expression (the linear equation we used to replace t ).
Inputting the linear regression expression into the logistic function allows us to have a linear regression expression value that can vary from positive to negative infinity, but after the transformation due to the logistic expression we will have an output of F(x) that ranges from 0 to 1.
We can now perform a binary classification based on where F(x) lies, either from 0 to 0.5, or 0.5 to 1.
This is a very basic overview of binary classification using Logistic Regression, if you're still interested in a deeper dive into the mathematics, check out these sources:
1.) Andrew Ng's class notes on Logistic Regression.
2.) CMU notes Note: Advanced math notation.
3.) Wikipedia has a very extensive look at logistic regression.
Let's put Logistic Regression into use for binary classification of this dataset
The dataset is packaged within Statsmodels. It is a data set from a 1974 survey of women by Redbook magazine. Married women were asked if they have had extramarital affairs. The published work on the data set can be found in:
Fair, Ray. 1978. “A Theory of Extramarital Affairs,” Journal of Political Economy
, February, 45-61.
It is important to note that this data comes from a self-reported survey, which can have many issues as far as the accuracy of the data. Also this analysis isn't trying to promote any agenda concerning women or marriage, the data is just interesting but its accuracy should be met with a healthy dose of skepticism.
We'll ignore those issues concerning the data and just worry about the logistic regression aspects to the data.
In this case we will approach this as a classification problem by asking the question:
Given certain variables for each woman, can we classify them as either having particpated in an affair, or not participated in an affair?
From the Statsmodels website we have the following information about the data:
Number of observations: 6366 Number of variables: 9 Variable name definitions:
rate_marriage : How rate marriage, 1 = very poor, 2 = poor, 3 = fair,
4 = good, 5 = very good
age : Age
yrs_married : No. years married. Interval approximations. See
original paper for detailed explanation.
children : No. children
religious : How relgious, 1 = not, 2 = mildly, 3 = fairly,
4 = strongly
educ : Level of education, 9 = grade school, 12 = high
school, 14 = some college, 16 = college graduate,
17 = some graduate school, 20 = advanced degree
occupation : 1 = student, 2 = farming, agriculture; semi-skilled,
or unskilled worker; 3 = white-colloar; 4 = teacher
counselor social worker, nurse; artist, writers;
technician, skilled worker, 5 = managerial,
administrative, business, 6 = professional with
advanced degree
occupation_husb : Husband's occupation. Same as occupation.
affairs : measure of time spent in extramarital affairs
See the original paper for more details.
Why a Statsmodels data set? So you can have the option of working through additional example datasets included in SciKit Learn and their own tutorials.
In [6]:
# Standard method of loading Statsmodels datasets into a pandas DataFrame. Note the name fair stands for 'affair' dataset.
df = sm.datasets.fair.load_pandas().data
df.head()
Out[6]:
Great! Let's go ahead and start our classfication by creating a new column called 'Had_Affair'. We will set this column equal to 0 if the affairs column is 0 (meaning no time spent in affairs) otherwise the 'Had_Affair' value will be set as 1, indicating that the woman had an affair.
In [47]:
# Create check function
def affair_check(x):
if x != 0:
return "yes"
else:
return "no"
# Apply to DataFrame
df['Had_Affair'] = df['affairs'].apply(affair_check)
In [48]:
df.head(5)
Out[48]:
Now let's go ahead and groupby the newly created 'Had_Affair' column. We'll do this by grouping by the column and then calling the mean aggregate function.
In [49]:
# Groupby Had Affair column
df.groupby('Had_Affair').mean()
Out[49]:
Looking at this brief glance of the data, it seems that the women who had affairs were slightly older,married longer, and slightly less religious and less educated. However, the mean values of both classes are very close for all variables.
Let's go ahead and try to visualize some of this data.
In [53]:
# Factorplot for age with Had Affair hue
sns.factorplot("age",
hue="Had_Affair",
data=df,
kind="count",
palette='coolwarm')
Out[53]:
This suggests a higher probability of an affair as age increases. Let's check the number of years married.
In [55]:
# Factorplot for years married with Had Affair hue
sns.factorplot('yrs_married',
data=df,
hue='Had_Affair',
kind="count",
palette='coolwarm')
Out[55]:
Looks like probability of having an affair increases with the number of years married. Let's check the number of children.
In [56]:
# Factorplot for number of children with Had Affair hue
sns.factorplot('children',data=df,hue='Had_Affair',kind="count", palette='coolwarm')
Out[56]:
Pretty strong evidence suggesting that increased # children results in an increased probability of an affair. Finally let's check the education level.
In [58]:
# Factorplot for number of children with Had Affair hue
sns.factorplot('educ',data=df,hue='Had_Affair',kind="count", palette='coolwarm')
Out[58]:
There is a slight decrease probability of affairs with education.
If we look at the data, we'll notice that two columns are unlike the others. Occupation and Husband's Occupation. These columns are in a format know as Categorical Variables. Basically they are in set quantity/category, so that 1.0 and 2.0 are seperate variables, not values along a spectrum that goes from 1-2 (e.g. There is no 1.5 for the occupation column). Pandas has a built-in method of getting dummy variables and creating new columns from them.
In [18]:
# Create new DataFrames for the Categorical Variables
occ_dummies = pd.get_dummies(df['occupation'])
hus_occ_dummies = pd.get_dummies(df['occupation_husb'])
# Let's take a quick look at the results
occ_dummies.head()
Out[18]:
Great! Now let's name the columns something a little more readable.
In [19]:
# Create column names for the new DataFrames
occ_dummies.columns = ['occ1','occ2','occ3','occ4','occ5','occ6']
hus_occ_dummies.columns = ['hocc1','hocc2','hocc3','hocc4','hocc5','hocc6']
Now we will create the X and Y data sets for out logistic regression!
In [21]:
# Set X as new DataFrame without the occupation columns or the Y target
X = df.drop(['occupation','occupation_husb','Had_Affair'],axis=1)
# Concat the dummy DataFrames Together
dummies = pd.concat([occ_dummies,hus_occ_dummies],axis=1)
Now we will concatenate all the DataFrames together.
In [22]:
# Now Concat the X DataFrame with the dummy variables
X = pd.concat([X,dummies],axis=1)
# Preview of Result
X.head()
Out[22]:
Now let's go ahead and set up the Y.
In [23]:
# Set Y as Target class, Had Affair
Y = df.Had_Affair
# Preview
Y.head()
Out[23]:
Now we need to get rid of a few columns. We will be dropping the occ1 and hocc1 columns to avoid multicollinearity. Multicollinearity occurs due to the dummy variables) we created. This is because the dummy variables are highly correlated, our model begins to get distorted because one of the dummy variables can be linearly predicted from the others. We take care of this problem by dropping one of the dummy variables from each set, we do this at the cost of losing a data set point.
The other column we will drop is the affairs column. This is because it is basically a repeat of what will be our Y target, instead of 0 and 1 it just has 0 or a number, so we'll need to drop it for our target to make sense.
In [24]:
# Dropping one column of each dummy variable set to avoid multicollinearity
X = X.drop('occ1',axis=1)
X = X.drop('hocc1',axis=1)
# Drop affairs column so Y target makes sense
X = X.drop('affairs',axis=1)
# PReview
X.head()
Out[24]:
In order to use the Y with SciKit Learn, we need to set it as a 1-D array. This means we need to "flatten" the array. Numpy has a built in method for this called ravel. Let's use it!
In [25]:
# Flatten array
Y = np.ravel(Y)
# Check result
Y
Out[25]:
Awesome! Now let's go ahead and run the logistic regression. This is a very similar process to the Linear Regression from the previous lecture. We'll create the model, the fit the data into the model, and check our accuracy score. Then we'll split the data into testing and training sets and see if our results improve.
Let's start by initiating the model!
In [26]:
# Create LogisticRegression model
log_model = LogisticRegression()
# Fit our data
log_model.fit(X,Y)
# Check our accuracy
log_model.score(X,Y)
Out[26]:
Looks like we got a 73% accuracy rating. Let's go ahead and compare this to the original Y data. We can do this by simply taking the mean of the Y data, since it is in the format 1 or 0, we can use the mean to calulate the percentage of women who reported having affairs. This is known as checking the null error rate.
In [27]:
# Check percentage of women that had affairs
Y.mean()
Out[27]:
This means that if our model just simply guessed "no affair" we would have had 1-0.32=0.68 accuracy (or 68%) accuracy. So while we are doing better than the null error rate, we aren't doing that much better.
Let's go ahead and check the coefficients of our model to check what seemed to be the stronger predictors.
In [29]:
# Use zip to bring the column names and the np.transpose function to bring together the coefficients from the model
coeff_df = DataFrame(zip(X.columns, np.transpose(log_model.coef_)))
Looking at the coefficients we can see that a positive coeffecient corresponds to increasing the likelihood of having an affair while a negative coefficient means it corresponds to a decreased likelihood of having an affair as the actual data value point increases.
As you might expect, an increased marriage rating corresponded to a decrease in the likelihood of having an affair. Increased religiousness also seems to correspond to a decrease in the likelihood of having an affair.
Since all the dummy variables (the wife and husband occupations) are positive that means the lowest likelihood of having an affair corresponds to the baseline occupation we dropped (1-Student).
In [30]:
# Split the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
# Make a new log_model
log_model2 = LogisticRegression()
# Now fit the new model
log_model2.fit(X_train, Y_train)
Out[30]:
Now we can use predict to predict classification labels for the next test set, then we will reevaluate our accuracy score!
In [31]:
# Predict the classes of the testing data set
class_predict = log_model2.predict(X_test)
# Compare the predicted classes to the actual test classes
print metrics.accuracy_score(Y_test,class_predict)
Now we have a 73.35% accuracy score, which is basically the same as our previous accuracy score, 72.58%.
So what could we do to try to further improve our Logistic Regression model? We could try some regularization techniques or using a non-linear model.
I'll leave the Logistic Regression topic here for you to explore more possibilites on your own. Here are several more resources and tutorials with other data sets to explore:
1.) Here's another great post on how to do logistic regression analysis using Statsmodels from yhat!
2.) The SciKit learn Documentation includes several examples at the bottom of the page.
3.) DataRobot has a great overview of Logistic Regression
4.) Fantastic resource from aimotion.blogspot on the Logistic Regression and the Mathmatics of how it relates to the cost function and gradient!
In [ ]: