Census Income Dataset Analysis

Link: https://archive.ics.uci.edu/ml/datasets/adult Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.



In [2]:

    
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from xtoy import Toy 

df = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    header=None)
df.columns = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]
# df = df.sample(frac=0.01, random_state=1)
train_cols = df.columns[0:-1]
label = df.columns[-1]
X = df[train_cols]
y = df[label].apply(lambda x: 0 if x == " <=50K" else 1)
toy = Toy()
toy.fit(X,y)

Model Performance



In [4]:

    
from interpret import show
from interpret.perf import ROC

blackbox_perf = ROC(toy.predict_proba).explain_perf(X,y, name='toy')
show(blackbox_perf)

Which variable is most important?

Using MorrisSensitivity to answer this question https://www.sciencedirect.com/science/article/pii/S0022169412008918



In [5]:

    
from interpret.blackbox import MorrisSensitivity
trans_df = pd.DataFrame(data=toy.featurizer.transform(X).A, columns=toy.feature_names_)
sensitivity = MorrisSensitivity(predict_fn=toy.best_evo.predict_proba, data=trans_df)
sensitivity_global = sensitivity.explain_global(name="Global Sensitivity")
show(sensitivity_global)



In [16]:

    
print('Why Does this person displayed below earn less than 50k dollars?')
display(X.head(1))
print('Let\'s use shap value to explain our prediction')









    



Why Does this person displayed below earn less than 50k dollars?






    







  
    
      
      Age
      WorkClass
      fnlwgt
      Education
      EducationNum
      MaritalStatus
      Occupation
      Relationship
      Race
      Gender
      CapitalGain
      CapitalLoss
      HoursPerWeek
      NativeCountry
    
  
  
    
      0
      39
      State-gov
      77516
      Bachelors
      13
      Never-married
      Adm-clerical
      Not-in-family
      White
      Male
      2174
      0
      40
      United-States
    
  








    



Let's use shap value to explain our prediction



In [6]:

    
from interpret.blackbox import ShapKernel
import numpy as np

background_val = np.median(toy.featurizer.transform(X).A, axis=0).reshape(1, -1)
shap = ShapKernel(predict_fn=toy.best_evo.predict_proba, data=background_val, feature_names=toy.feature_names_)



In [18]:

    
from ipywidgets import IntProgress
shap_local = shap.explain_local(toy.featurizer.transform(X).A[0:1], y[0:1], name='SHAP')
show(shap_local)

The Above Analysis suggest that while Education Number is 13, Education has the word 'bachelors' and his age is 39, his capital gain is just 2174



In [19]:

    
df[df['CapitalGain'] <= 2200]['Income'].value_counts()









    Out[19]:





 <=50K    23929
 >50K      6164
Name: Income, dtype: int64