Logistic regression in Scikitlearn

  • We'll explore a Logistic Regression model in Scikitlearn
  • We'll talk about how to model debug etc.
  • We'll do some feature engineering etc.

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

In [2]:
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header=None, names=['age', 'workclass', 'fnlwgt', 
                'education-categorical', 'educ', 
                'marital-status', 'occupation',
                'relationship', 'race', 'sex', 
                'captial-gain', 'capital-loss', 
                'hours', 'native-country', 
                'income'])

In [3]:
income = 1 * (data['income'] == " >50K")

Let's explore the data a bit.


In [1]:
income.value_counts()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-db765db66048> in <module>()
----> 1 income.value_counts()

NameError: name 'income' is not defined

Exploring the data

  • Let us get a feel for the parameters.
  • We see that age is a tailed distribution.
  • Certainly not Gaussian! We don't see much of a correlation between many of the features, with the exception of Age and Age2.
  • Hours worked has some interesting behaviour. How would one describe this distribution?

In [3]:
import seaborn as seaborn
g = seaborn.pairplot(data)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-f37959d9617e> in <module>()
      1 import seaborn as seaborn
----> 2 g = seaborn.pairplot(data)

NameError: name 'data' is not defined

In [4]:
logreg = linear_model.LogisticRegression(C=1e5)

age2 = np.square(data['age'])
data = data[['age', 'educ', 'hours']]
data['age2'] = age2
data['income'] = income
X = data[['age', 'age2', 'educ', 'hours']]
Y = data['income']
logreg.fit(X, Y)


Out[4]:
LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [5]:
# check the accuracy on the training set
logreg.score(X, Y)


Out[5]:
0.79303461195909219

In [6]:
Y.mean()


Out[6]:
0.24080955744602439

So we've decent predictions but not great ones. Only 24% of the class earns more than 50k, which means that you could obtain 76% accuracy by always predicting "no". So we're doing better than the null error rate but not by much. Let's examine the coefficients and see what we learn.


In [7]:
g = np.transpose(logreg.coef_)
pd.DataFrame(list(zip(X.columns, g )))


Out[7]:
0 1
0 age [0.162458514116]
1 age2 [-0.00138241828468]
2 educ [0.283606412852]
3 hours [0.0290797158473]

Classical Machine Learning technique - using a training set and testing set.


In [8]:
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
model2 = linear_model.LogisticRegression()
model2.fit(X_train, y_train)


Out[8]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [9]:
# predict class labels for the test set
predicted = model2.predict(X_test)
print(predicted)


[0 0 0 ..., 1 0 0]

In [10]:
# generate class probabilities
probs = model2.predict_proba(X_test)
print(probs)


[[ 0.85986473  0.14013527]
 [ 0.75614576  0.24385424]
 [ 0.82441467  0.17558533]
 ..., 
 [ 0.48120856  0.51879144]
 [ 0.79467429  0.20532571]
 [ 0.92966606  0.07033394]]

Model evaluation.

  • We can look at the model as a black box.
  • We can evaluate it and score it.
  • We can also probably use something like Hyperparameter tuning or something like a Grid search to improve our results.

In [ ]: