SP LIME

Regression explainer with boston housing prices dataset



In [1]:

    
from sklearn.datasets import load_boston
import sklearn.ensemble
import sklearn.linear_model
import sklearn.model_selection
import numpy as np
from sklearn.metrics import r2_score
np.random.seed(1)

#load example dataset
boston = load_boston()

#print a description of the variables
print(boston.DESCR)

#train a regressor
rf = sklearn.ensemble.RandomForestRegressor(n_estimators=1000)
train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(boston.data, boston.target, train_size=0.80, test_size=0.20)
rf.fit(train, labels_train);

#train a linear regressor
lr = sklearn.linear_model.LinearRegression()
lr.fit(train,labels_train)

#print the R^2 score of the random forest
print("Random Forest R^2 Score: " +str(round(r2_score(rf.predict(test),labels_test),3)))
print("Linear Regression R^2 Score: " +str(round(r2_score(lr.predict(test),labels_test),3)))









    



.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Random Forest R^2 Score: 0.881
Linear Regression R^2 Score: 0.593



In [2]:

    
# import lime tools
import lime
import lime.lime_tabular

# generate an "explainer" object
categorical_features  = np.argwhere(np.array([len(set(boston.data[:,x])) for x in range(boston.data.shape[1])]) <= 10).flatten()
explainer = lime.lime_tabular.LimeTabularExplainer(train, feature_names=boston.feature_names, class_names=['price'], categorical_features=categorical_features, verbose=False, mode='regression',discretize_continuous=False)



In [3]:

    
#generate an explanation
i = 13
exp = explainer.explain_instance(test[i], rf.predict, num_features=14)



In [4]:

    
%matplotlib inline
fig = exp.as_pyplot_figure();



In [5]:

    
print("Input feature names: ")
print(boston.feature_names)
print('\n')

print("Input feature values: ")
print(test[i])
print('\n')

print("Predicted: ")
print(rf.predict(test)[i])









    



Input feature names: 
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


Input feature values: 
[4.3790e-02 8.0000e+01 3.3700e+00 0.0000e+00 3.9800e-01 5.7870e+00
 3.1100e+01 6.6115e+00 4.0000e+00 3.3700e+02 1.6100e+01 3.9690e+02
 1.0240e+01]


Predicted: 
20.161599999999957

SP-LIME pick step

Maximize the 'coverage' function:

$c(V,W,I) = \sum_{j=1}^{d^{\prime}}{\mathbb{1}_{[\exists i \in V : W_{ij}>0]}I_j}$

$W = \text{Explanation Matrix, } n\times d^{\prime}$

$V = \text{Set of chosen explanations}$

$I = \text{Global feature importance vector, } I_j = \sqrt{\sum_i{|W_{ij}|}}$



In [6]:

    
import lime



In [9]:

    
import warnings
from lime import submodular_pick
sp_obj = submodular_pick.SubmodularPick(explainer, train, rf.predict, sample_size=20, num_features=14, num_exps_desired=5)



In [10]:

    
[exp.as_pyplot_figure() for exp in sp_obj.sp_explanations];



In [11]:

    
import pandas as pd
W=pd.DataFrame([dict(this.as_list()) for this in sp_obj.explanations])



In [12]:

    
W.head()



In [13]:

    
im=W.hist('NOX',bins=20)

Text explainer using the newsgroups



In [14]:

    
# run the text explainer example notebook, up to single explanation
import sklearn
import numpy as np
import sklearn
import sklearn.ensemble
import sklearn.metrics

from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
class_names = ['atheism', 'christian']

vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(newsgroups_train.data)
test_vectors = vectorizer.transform(newsgroups_test.data)

rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)
rf.fit(train_vectors, newsgroups_train.target)

pred = rf.predict(test_vectors)
sklearn.metrics.f1_score(newsgroups_test.target, pred, average='binary')

from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)

from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=class_names)

idx = 83
exp = explainer.explain_instance(newsgroups_test.data[idx], c.predict_proba, num_features=6)
print('Document id: %d' % idx)
print('Probability(christian) =', c.predict_proba([newsgroups_test.data[idx]])[0,1])
print('True class: %s' % class_names[newsgroups_test.target[idx]])









    



/mnt/c/Users/marcotcr/work/lime/python3env/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)






    



Document id: 83
Probability(christian) = 0.442
True class: atheism



In [15]:

    
sp_obj = submodular_pick.SubmodularPick(explainer, newsgroups_test.data, c.predict_proba, sample_size=2, num_features=6,num_exps_desired=2)









    



/mnt/c/Users/marcotcr/work/lime/python3env/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)
/mnt/c/Users/marcotcr/work/lime/python3env/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)



In [18]:

    
[exp.as_pyplot_figure(label=exp.available_labels()[0]) for exp in sp_obj.sp_explanations];



In [20]:

    
from sklearn.datasets import load_iris
iris=load_iris()
from sklearn.model_selection import train_test_split as tts
Xtrain,Xtest,ytrain,ytest=tts(iris.data,iris.target,test_size=.2)
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(Xtrain,ytrain)
rf.score(Xtest,ytest)









    



/mnt/c/Users/marcotcr/work/lime/python3env/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)






    Out[20]:





0.9333333333333333



In [21]:

    
explainer = lime.lime_tabular.LimeTabularExplainer(Xtrain, 
                                                   feature_names=iris.feature_names,
                                                   class_names=iris.target_names, 
                                                   verbose=False, 
                                                   mode='classification',
                                                   discretize_continuous=False)



In [22]:

    
exp=explainer.explain_instance(Xtrain[i],rf.predict_proba,top_labels=3)
exp.available_labels()









    Out[22]:





[0, 2, 1]



In [25]:

    
sp_obj = submodular_pick.SubmodularPick(data=Xtrain,explainer=explainer,num_exps_desired=5,predict_fn=rf.predict_proba, sample_size=20, num_features=4, top_labels=3)



In [26]:

    
import pandas as pd
df=pd.DataFrame({})
for this_label in range(3):
    dfl=[]
    for i,exp in enumerate(sp_obj.sp_explanations):
        l=exp.as_list(label=this_label)
        l.append(("exp number",i))
        dfl.append(dict(l))
    dftest=pd.DataFrame(dfl)
    df=df.append(pd.DataFrame(dfl,index=[iris.target_names[this_label] for i in range(len(sp_obj.sp_explanations))]))
df









    Out[26]:







  
    
      
      exp number
      petal length (cm)
      petal width (cm)
      sepal length (cm)
      sepal width (cm)
    
  
  
    
      setosa
      0
      -0.410513
      -0.049033
      -0.003635
      0.010532
    
    
      setosa
      1
      -0.416660
      -0.038774
      0.013522
      -0.000206
    
    
      setosa
      2
      -0.249963
      -0.023399
      -0.001491
      0.000962
    
    
      setosa
      3
      -0.423744
      -0.053167
      0.006401
      -0.001301
    
    
      setosa
      4
      -0.255049
      -0.037626
      0.002671
      0.002027
    
    
      versicolor
      0
      0.254472
      -0.042948
      0.029553
      -0.013252
    
    
      versicolor
      1
      0.261545
      -0.054880
      0.024067
      -0.001325
    
    
      versicolor
      2
      -0.026307
      -0.228508
      0.011449
      0.008821
    
    
      versicolor
      3
      0.282608
      -0.034669
      0.028875
      -0.001872
    
    
      versicolor
      4
      -0.009620
      -0.164904
      -0.015407
      -0.008737
    
    
      virginica
      0
      0.156041
      0.091980
      -0.025919
      0.002720
    
    
      virginica
      1
      0.155115
      0.093654
      -0.037589
      0.001530
    
    
      virginica
      2
      0.276270
      0.251906
      -0.009958
      -0.009783
    
    
      virginica
      3
      0.141136
      0.087837
      -0.035277
      0.003173
    
    
      virginica
      4
      0.264669
      0.202530
      0.012736
      0.006710



In [ ]:

	AGE	B	CHAS=0	CHAS=1	CRIM	DIS	INDUS	LSTAT	NOX	PTRATIO	RAD=24	RAD=3	RAD=4	RAD=5	RAD=7	RM	TAX	ZN
0	-0.080724	0.124994	-0.214342	NaN	0.181336	-1.298673	-0.161573	-4.601277	-0.192838	-0.423889	NaN	NaN	NaN	-0.003099	NaN	1.852061	-0.320556	0.012300
1	-0.057718	0.037930	-0.250741	NaN	0.017215	-1.414275	-0.147413	-4.916553	-0.409355	-0.448896	-0.047497	NaN	NaN	NaN	NaN	1.740887	-0.219528	0.011163
2	-0.108377	0.108911	-0.081349	NaN	0.000332	-1.151193	-0.185450	-4.491332	-0.361803	-0.364918	NaN	NaN	-0.025486	NaN	NaN	1.699323	-0.307213	0.016454
3	-0.138040	0.080173	NaN	0.20563	-0.080418	-1.194666	-0.139015	-4.221639	-0.309510	-0.380089	NaN	NaN	-0.084104	NaN	NaN	1.567589	-0.185666	0.064908
4	-0.186729	0.137872	0.094365	NaN	-0.240311	-1.089373	-0.077094	-4.997545	-0.546299	-0.498371	0.055739	NaN	NaN	NaN	NaN	1.506781	-0.216876	-0.084339

	exp number	petal length (cm)	petal width (cm)	sepal length (cm)	sepal width (cm)
setosa	0	-0.410513	-0.049033	-0.003635	0.010532
setosa	1	-0.416660	-0.038774	0.013522	-0.000206
setosa	2	-0.249963	-0.023399	-0.001491	0.000962
setosa	3	-0.423744	-0.053167	0.006401	-0.001301
setosa	4	-0.255049	-0.037626	0.002671	0.002027
versicolor	0	0.254472	-0.042948	0.029553	-0.013252
versicolor	1	0.261545	-0.054880	0.024067	-0.001325
versicolor	2	-0.026307	-0.228508	0.011449	0.008821
versicolor	3	0.282608	-0.034669	0.028875	-0.001872
versicolor	4	-0.009620	-0.164904	-0.015407	-0.008737
virginica	0	0.156041	0.091980	-0.025919	0.002720
virginica	1	0.155115	0.093654	-0.037589	0.001530
virginica	2	0.276270	0.251906	-0.009958	-0.009783
virginica	3	0.141136	0.087837	-0.035277	0.003173
virginica	4	0.264669	0.202530	0.012736	0.006710