Logistic regression in Scikitlearn

  • We'll explore a Logistic Regression model in Scikitlearn
  • We'll talk about how to model debug etc.
  • We'll do some feature engineering etc.

In [6]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
%matplotlib inline

In [7]:
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header=None, names=['age', 'workclass', 'fnlwgt', 
                'education-categorical', 'educ', 
                'marital-status', 'occupation',
                'relationship', 'race', 'sex', 
                'captial-gain', 'capital-loss', 
                'hours', 'native-country', 
                'income'])

Feature Engineering

  • We'll generate some features here.
  • We want to predict income greater than 50K.
  • This is a typical business like problem.
  • It could ad-revenue above a certain type, it could be 'is this person a student?, etc etc
  • Logistic regression is not as sexy as deep learning but it's a powerful model

In [10]:
data = data[~pd.isnull(data['income'])]
data[data['native-country']==" United-States"]


Out[10]:
age workclass fnlwgt education-categorical educ marital-status occupation relationship race sex captial-gain capital-loss hours native-country income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
5 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K
7 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States >50K
8 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States >50K
9 42 Private 159449 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 5178 0 40 United-States >50K
10 37 Private 280464 Some-college 10 Married-civ-spouse Exec-managerial Husband Black Male 0 0 80 United-States >50K
12 23 Private 122272 Bachelors 13 Never-married Adm-clerical Own-child White Female 0 0 30 United-States <=50K
13 32 Private 205019 Assoc-acdm 12 Never-married Sales Not-in-family Black Male 0 0 50 United-States <=50K
16 25 Self-emp-not-inc 176756 HS-grad 9 Never-married Farming-fishing Own-child White Male 0 0 35 United-States <=50K
17 32 Private 186824 HS-grad 9 Never-married Machine-op-inspct Unmarried White Male 0 0 40 United-States <=50K
18 38 Private 28887 11th 7 Married-civ-spouse Sales Husband White Male 0 0 50 United-States <=50K
19 43 Self-emp-not-inc 292175 Masters 14 Divorced Exec-managerial Unmarried White Female 0 0 45 United-States >50K
20 40 Private 193524 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 60 United-States >50K
21 54 Private 302146 HS-grad 9 Separated Other-service Unmarried Black Female 0 0 20 United-States <=50K
22 35 Federal-gov 76845 9th 5 Married-civ-spouse Farming-fishing Husband Black Male 0 0 40 United-States <=50K
23 43 Private 117037 11th 7 Married-civ-spouse Transport-moving Husband White Male 0 2042 40 United-States <=50K
24 59 Private 109015 HS-grad 9 Divorced Tech-support Unmarried White Female 0 0 40 United-States <=50K
25 56 Local-gov 216851 Bachelors 13 Married-civ-spouse Tech-support Husband White Male 0 0 40 United-States >50K
26 19 Private 168294 HS-grad 9 Never-married Craft-repair Own-child White Male 0 0 40 United-States <=50K
28 39 Private 367260 HS-grad 9 Divorced Exec-managerial Not-in-family White Male 0 0 80 United-States <=50K
29 49 Private 193366 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K
30 23 Local-gov 190709 Assoc-acdm 12 Never-married Protective-serv Not-in-family White Male 0 0 52 United-States <=50K
31 20 Private 266015 Some-college 10 Never-married Sales Own-child Black Male 0 0 44 United-States <=50K
32 45 Private 386940 Bachelors 13 Divorced Exec-managerial Own-child White Male 0 1408 40 United-States <=50K
33 30 Federal-gov 59951 Some-college 10 Married-civ-spouse Adm-clerical Own-child White Male 0 0 40 United-States <=50K
34 22 State-gov 311512 Some-college 10 Married-civ-spouse Other-service Husband Black Male 0 0 15 United-States <=50K
36 21 Private 197200 Some-college 10 Never-married Machine-op-inspct Own-child White Male 0 0 40 United-States <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32528 31 Private 292592 HS-grad 9 Married-civ-spouse Machine-op-inspct Wife White Female 0 0 40 United-States <=50K
32529 29 Private 125976 HS-grad 9 Separated Sales Unmarried White Female 0 0 35 United-States <=50K
32530 35 ? 320084 Bachelors 13 Married-civ-spouse ? Wife White Female 0 0 55 United-States >50K
32531 30 ? 33811 Bachelors 13 Never-married ? Not-in-family Asian-Pac-Islander Female 0 0 99 United-States <=50K
32532 34 Private 204461 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 60 United-States >50K
32534 37 Private 179137 Some-college 10 Divorced Adm-clerical Unmarried White Female 0 0 39 United-States <=50K
32535 22 Private 325033 12th 8 Never-married Protective-serv Own-child Black Male 0 0 35 United-States <=50K
32536 34 Private 160216 Bachelors 13 Never-married Exec-managerial Not-in-family White Female 0 0 55 United-States >50K
32537 30 Private 345898 HS-grad 9 Never-married Craft-repair Not-in-family Black Male 0 0 46 United-States <=50K
32538 38 Private 139180 Bachelors 13 Divorced Prof-specialty Unmarried Black Female 15020 0 45 United-States >50K
32539 71 ? 287372 Doctorate 16 Married-civ-spouse ? Husband White Male 0 0 10 United-States >50K
32540 45 State-gov 252208 HS-grad 9 Separated Adm-clerical Own-child White Female 0 0 40 United-States <=50K
32541 41 ? 202822 HS-grad 9 Separated ? Not-in-family Black Female 0 0 32 United-States <=50K
32542 72 ? 129912 HS-grad 9 Married-civ-spouse ? Husband White Male 0 0 25 United-States <=50K
32543 45 Local-gov 119199 Assoc-acdm 12 Divorced Prof-specialty Unmarried White Female 0 0 48 United-States <=50K
32544 31 Private 199655 Masters 14 Divorced Other-service Not-in-family Other Female 0 0 30 United-States <=50K
32545 39 Local-gov 111499 Assoc-acdm 12 Married-civ-spouse Adm-clerical Wife White Female 0 0 20 United-States >50K
32546 37 Private 198216 Assoc-acdm 12 Divorced Tech-support Not-in-family White Female 0 0 40 United-States <=50K
32548 65 Self-emp-not-inc 99359 Prof-school 15 Never-married Prof-specialty Not-in-family White Male 1086 0 60 United-States <=50K
32549 43 State-gov 255835 Some-college 10 Divorced Adm-clerical Other-relative White Female 0 0 40 United-States <=50K
32550 43 Self-emp-not-inc 27242 Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 50 United-States <=50K
32551 32 Private 34066 10th 6 Married-civ-spouse Handlers-cleaners Husband Amer-Indian-Eskimo Male 0 0 40 United-States <=50K
32552 43 Private 84661 Assoc-voc 11 Married-civ-spouse Sales Husband White Male 0 0 45 United-States <=50K
32554 53 Private 321865 Masters 14 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 United-States >50K
32555 22 Private 310152 Some-college 10 Never-married Protective-serv Not-in-family White Male 0 0 40 United-States <=50K
32556 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States <=50K
32557 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States >50K
32558 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States <=50K
32559 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States <=50K
32560 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States >50K

29170 rows × 15 columns

  • Standard trick with features, is to add squares.
  • I know based on other studies that income increases proportional to age squared.
  • Your earnings don't increase in a linear fashion.
  • This is a naive feature, but a decent one to make.
  • Feature engineering is a complicated topic but I just wanted to share some good rules of thumb here.

In [11]:
income = 1 * (data['income'] == " >50K")
age2 = np.square(data['age'])

We'll restrict our search space a bit, to only those variables or features we think are useful.

  • We could use an explained variance trick or some sort of machine learning feature selection trick to do this.
  • We'll assume that we know this from domain knowledge or just experience.

In [12]:
data = data[['age', 'educ', 'hours']]
data['age2'] = age2
data['income'] = income

Let's explore the data a bit.


In [13]:
income.value_counts()


Out[13]:
0    24720
1     7841
Name: income, dtype: int64

Exploring the data

  • Let us get a feel for the parameters.
  • We see that age is a tailed distribution.
  • Certainly not Gaussian! We don't see much of a correlation between many of the features, with the exception of Age and Age2.
  • Hours worked has some interesting behaviour. How would one describe this distribution?

In [14]:
import seaborn as seaborn
g = seaborn.pairplot(data)



In [31]:
logreg = linear_model.LogisticRegression(C=1e5)

age2 = np.square(data['age'])
data = data[['age', 'educ', 'hours']]
data['age2'] = age2
data['income'] = income
X = data[['age', 'age2', 'educ', 'hours']]
Y = data['income']
logreg.fit(X, Y)


Out[31]:
LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [32]:
# check the accuracy on the training set
logreg.score(X, Y)


Out[32]:
0.79303461195909219

In [33]:
Y.mean()


Out[33]:
0.24080955744602439

So we've decent predictions but not great ones. Only 24% of the class earns more than 50k, which means that you could obtain 76% accuracy by always predicting "no". So we're doing better than the null error rate but not by much. Let's examine the coefficients and see what we learn.


In [34]:
g = np.transpose(logreg.coef_)
pd.DataFrame(list(zip(X.columns, g )))


Out[34]:
0 1
0 age [0.162458514116]
1 age2 [-0.00138241828468]
2 educ [0.283606412852]
3 hours [0.0290797158473]

Classical Machine Learning technique - using a training set and testing set.


In [35]:
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
model2 = linear_model.LogisticRegression()
model2.fit(X_train, y_train)


Out[35]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [36]:
# predict class labels for the test set
predicted = model2.predict(X_test)
print(predicted)


[0 0 0 ..., 1 0 0]

In [37]:
# generate class probabilities
probs = model2.predict_proba(X_test)
print(probs)


[[ 0.85986473  0.14013527]
 [ 0.75614576  0.24385424]
 [ 0.82441467  0.17558533]
 ..., 
 [ 0.48120856  0.51879144]
 [ 0.79467429  0.20532571]
 [ 0.92966606  0.07033394]]

Model evaluation.

  • We can look at the model as a black box.
  • We can evaluate it and score it.
  • We can also probably use something like Hyperparameter tuning or something like a Grid search to improve our results.

In [ ]: