example taken from </br>
http://nbviewer.jupyter.org/github/justmarkham/DAT8/blob/master/notebooks/12_logistic_regression.ipynb)
a more mathematical approach here:
http://rasbt.github.io/mlxtend/user_guide/classifier/LogisticRegression/
Glass information found in the crime scene.
The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence...if it is correctly identified!
Attribute Information:
Mg: Magnesium ...
Dependent bar = glass_type => multivariate problem
We convert it to Dichotomus (household variable)
In [25]:
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass.sort_values(by='al', inplace=True)
glass.head()
glass['household'] = glass.glass_type.map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1})
glass.head()
Out[25]:
In [26]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [27]:
plt.scatter(glass.al, glass.household)
plt.xlabel('al')
plt.ylabel('household')
Out[27]:
In [28]:
# Lets see if we deal with an imbalanced problem..
glass['household'].value_counts()
Out[28]:
In [29]:
# fit a linear regression model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
feature_cols = ['al']
X = glass[feature_cols]
y = glass.household
linreg.fit(X, y)
glass['household_pred'] = linreg.predict(X)
# scatter plot that includes the regression line
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred, color='red')
plt.xlabel('al')
plt.ylabel('household')
Out[29]:
In [30]:
# Because eventually we need to decide a class (0 or 1) we need a cutoff value
# cutoff value can be .5 since ~halpf of data are below and ~half of data above
CUTTOFF = 0.5
test_values = [.5, 1., 3.]
for i in test_values:
print "if al = ", i, " is household = ", 0.5 < linreg.predict(i)
In [31]:
import numpy as np
# Predict for all household values
glass['household_pred_class'] = np.where(glass.household_pred >= 0.5, 1, 0)
glass.head()
Out[31]:
In [32]:
# plot the class predictions
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_class, color='red')
plt.xlabel('al')
plt.ylabel('household')
Out[32]:
In [35]:
# fit a logistic regression model and store the class predictions
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
feature_cols = ['al']
X = glass[feature_cols]
y = glass.household
logreg.fit(X, y)
# store the predicted class
glass['household_pred_class'] = logreg.predict(X)
# store the predicted probabilites of class 1
glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]
# plot the class predictions
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_class, color='red')
plt.plot(glass.al, glass.household_pred_prob, color='blue')
plt.xlabel('al')
plt.ylabel('household')
Out[35]:
In [37]:
glass['household_pred_class_logistic_reg'] = glass.household_pred_class
glass.head()
Out[37]:
In [ ]: