Logistic regression in Scikitlearn

We'll explore a Logistic Regression model in Scikitlearn
We'll talk about how to model debug etc.
We'll do some feature engineering etc.



In [1]:

    
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split



In [2]:

    
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header=None, names=['age', 'workclass', 'fnlwgt', 
                'education-categorical', 'educ', 
                'marital-status', 'occupation',
                'relationship', 'race', 'sex', 
                'captial-gain', 'capital-loss', 
                'hours', 'native-country', 
                'income'])



In [3]:

    
income = 1 * (data['income'] == " >50K")

Let's explore the data a bit.



In [1]:

    
income.value_counts()









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-db765db66048> in <module>()
----> 1 income.value_counts()

NameError: name 'income' is not defined

Exploring the data

Let us get a feel for the parameters.
We see that age is a tailed distribution.
Certainly not Gaussian! We don't see much of a correlation between many of the features, with the exception of Age and Age2.
Hours worked has some interesting behaviour. How would one describe this distribution?



In [3]:

    
import seaborn as seaborn
g = seaborn.pairplot(data)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-f37959d9617e> in <module>()
      1 import seaborn as seaborn
----> 2 g = seaborn.pairplot(data)

NameError: name 'data' is not defined



In [4]:

    
logreg = linear_model.LogisticRegression(C=1e5)

age2 = np.square(data['age'])
data = data[['age', 'educ', 'hours']]
data['age2'] = age2
data['income'] = income
X = data[['age', 'age2', 'educ', 'hours']]
Y = data['income']
logreg.fit(X, Y)









    Out[4]:





LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)



In [5]:

    
# check the accuracy on the training set
logreg.score(X, Y)









    Out[5]:





0.79303461195909219



In [6]:

    
Y.mean()









    Out[6]:





0.24080955744602439

So we've decent predictions but not great ones. Only 24% of the class earns more than 50k, which means that you could obtain 76% accuracy by always predicting "no". So we're doing better than the null error rate but not by much. Let's examine the coefficients and see what we learn.



In [7]:

    
g = np.transpose(logreg.coef_)
pd.DataFrame(list(zip(X.columns, g )))









    Out[7]:






  
    
      
      0
      1
    
  
  
    
      0
      age
      [0.162458514116]
    
    
      1
      age2
      [-0.00138241828468]
    
    
      2
      educ
      [0.283606412852]
    
    
      3
      hours
      [0.0290797158473]

Classical Machine Learning technique - using a training set and testing set.



In [8]:

    
# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
model2 = linear_model.LogisticRegression()
model2.fit(X_train, y_train)









    Out[8]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [9]:

    
# predict class labels for the test set
predicted = model2.predict(X_test)
print(predicted)









    



[0 0 0 ..., 1 0 0]



In [10]:

    
# generate class probabilities
probs = model2.predict_proba(X_test)
print(probs)









    



[[ 0.85986473  0.14013527]
 [ 0.75614576  0.24385424]
 [ 0.82441467  0.17558533]
 ..., 
 [ 0.48120856  0.51879144]
 [ 0.79467429  0.20532571]
 [ 0.92966606  0.07033394]]

Model evaluation.

We can look at the model as a black box.
We can evaluate it and score it.
We can also probably use something like Hyperparameter tuning or something like a Grid search to improve our results.



In [ ]:

	0	1
0	age	[0.162458514116]
1	age2	[-0.00138241828468]
2	educ	[0.283606412852]
3	hours	[0.0290797158473]