Work with all features to identify those that are relevant based on statistical methods
Finalize the choice of features based on both statistical results and practical reasons: for example, in case gill spacing is identified as the most important feature, I would still be hesitant to include it. According to the description of the dataset, gill spacing could be described as close=c, crowded=w or distant=d but in real life, non-professional would have difficulties distinguishing between those, while features like colors or odors are easier to spot.
Analysis technics:
All of my data is categorical, therefore, I am unable to use Principal Component Analysis (PCA) for feature selection as it applies to contnious data and sometimes for dummies
In [1]:
import re, csv, os, sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sklearn
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
%matplotlib inline
In [2]:
df = pd.read_csv('/Users/dariaulybina/Desktop/georgetown/ml_practice/data/data.csv')
df = df.drop(df.columns[[0]], axis=1)
df.head()
Out[2]:
In [3]:
df.describe()
Out[3]:
In [4]:
cnames = df.columns
print(cnames)
In [5]:
y = df[['classif']]
#X = df[['cap_colour','odor','gill_color','stalk_color_above_ring','stalk_color_below_ring','habitat']]
#print(X.head())
In [6]:
table = pd.crosstab(index=df['classif'], columns="count")
table
Out[6]:
In [7]:
import itertools
def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
print(cm)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
return
In [8]:
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
class MultiColumnLabelEncoder:
def __init__(self,columns = None):
self.columns = columns # array of column names to encode
def fit(self,X,y=None):
return self # not relevant here
def transform(self,X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname,col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self,X,y=None):
return self.fit(X,y).transform(X)
def inverse_transform(self, X):
result = X.copy()
for idx, column in self.encoders.items():
result[column] = encoder.inverse_transform(X[column])
return result
In [9]:
dfx = df.drop(df.columns[[0]], axis=1)
X = MultiColumnLabelEncoder(columns = ['cap_shape', 'cap_surface', 'cap_colour', 'bruises', 'odor',
'gill_attach', 'gill_space', 'gill_size', 'gill_color', 'stalk_shape',
'stalk_surf_above_ring', 'stalk_surf_below_ring',
'stalk_color_above_ring', 'stalk_color_below_ring', 'veil_type',
'veil_color', 'ring_number', 'ring_type', 'spore_print_color',
'population', 'habitat']).fit_transform(dfx)
X
Out[9]:
In [10]:
encoder = LabelEncoder()
y = encoder.fit_transform(df['classif'])
In [11]:
y
Out[11]:
In [12]:
from sklearn.pipeline import Pipeline
encoding_pipeline = Pipeline([
('encoding',MultiColumnLabelEncoder())
])
A = encoding_pipeline.fit_transform(X)
A
Out[12]:
In [13]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(A, y, test_size=0.2)
print('X_train Shape:', X_train.shape)
print('X_test Shape:', X_test.shape)
print('y_train Shape:', y_train.shape)
print('y_test Shape:', y_test.shape)
In [14]:
#Random Forest Classifier - results + important features
for i in range(1,30):
random_forest = RandomForestClassifier(n_estimators=i,n_jobs=-1)
random_forest.fit(X_train, y_train)
pred = random_forest.predict(X_test)
acc = accuracy_score(y_test, pred)
print("Accuracy Score {}".format(acc))
#Graph most important features
importances = random_forest.feature_importances_
features = A.columns[:]
sort_indices = np.argsort(importances)[::-1]
sorted_features = []
for idx in sort_indices:
sorted_features.append(features[idx])
plt.figure()
plt.bar(range(len(importances)), importances[sort_indices], align='center', color ='#4C72B0');
plt.xticks(range(len(importances)), sorted_features, rotation='vertical');
plt.xlim([-1, len(importances)])
plt.title('Feature importance')
plt.grid(False)
Most Important Features comments:
Most important features identified are gill_size, spore print color, odor, bruises and then ring type, gill spacing stalk colors and polpulation. However, I won't choose the top features identified with Random Forest classifer - but rather, after including practical considerations, I will include those features, that seem to be the most straight forward and fastest to spot in real situation. My decision will be based on analysis done in the previous notebook.
My final list of feature: ['odor', 'ring_type', 'population', 'cap_colour']. I will run few basic models using default hyperparameters to decide on the best model to chose.
In [15]:
data = df[['habitat', 'cap_colour','odor']]
data.head()
Out[15]:
According to Scklearn documentation, One Hot Encoding is needed for feeding categorical data to linear models and SVMs with the standard kernels. I will create two encoding piplines that will evaluate few models with default parameters. In case linear SVC won't work, I will proceed trying other model, following the directions of sklearn cheet sheet for model choices.
In [16]:
encoder = LabelEncoder()
y = encoder.fit_transform(df['classif'])
y
Out[16]:
In [17]:
#linear and svm models
from sklearn.preprocessing import OneHotEncoder
encoding_pipeline = Pipeline([
('encoding',MultiColumnLabelEncoder()),
('one_hot_encoder', OneHotEncoder())
])
B = encoding_pipeline.fit_transform(data)
B
Out[17]:
In [18]:
Xh_train, Xh_test, yh_train, yh_test = train_test_split(B, y, test_size=0.2, random_state=101)
print('X_train Shape:', Xh_train.shape)
print('X_test Shape:', Xh_test.shape)
print('y_train Shape:', yh_train.shape)
print('y_test Shape:', yh_test.shape)
In [19]:
#encoding for trees and forests
encoder = LabelEncoder()
y = encoder.fit_transform(df['classif'])
encoding_pipeline = Pipeline([
('encoding',MultiColumnLabelEncoder())
])
C = encoding_pipeline.fit_transform(data)
C.head()
Out[19]:
In [20]:
X_train, X_test, y_train, y_test = train_test_split(C, y, test_size=0.2)
print('X_train Shape:', X_train.shape)
print('X_test Shape:', X_test.shape)
print('y_train Shape:', y_train.shape)
print('y_test Shape:', y_test.shape)
In [21]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
In [22]:
from sklearn.cross_validation import ShuffleSplit
cv = ShuffleSplit(Xh_train.shape[0], n_iter=12, test_size=0.2, random_state=0)
In [23]:
from sklearn.svm import SVC
estimator = SVC(kernel='linear')
In [24]:
from sklearn.grid_search import GridSearchCV
import numpy as np
gammas = np.logspace(-6, -1, 10)
classifier = GridSearchCV(estimator=estimator, cv=cv, param_grid=dict(gamma=gammas))
classifier.fit(Xh_train, yh_train)
Out[24]:
In [25]:
print(classifier.best_score_)
In [26]:
from sklearn.learning_curve import learning_curve
title = 'Learning Curves (SVM, linear kernel, $\gamma=%.6f$)' %classifier.best_estimator_.gamma
estimator = SVC(kernel='linear', gamma=classifier.best_estimator_.gamma)
plot_learning_curve(estimator, title, Xh_train, yh_train, cv=cv)
plt.show()
Interpretation of Learning Curve: Underfitting or Overfitting
First I can see that the training score is around its maximum and the validation score could be increased with more training samples. Thus, this learning curve is helpful for finding out that the model benefits from adding more training data in this case.
Adidtionally, I should be able to tell whether the estimator suffers more from a variance error or a bias error.
According to documentation, if two curves are "close to each other" and both of them have a low score - the model suffer from an under fitting problem (High Bias). Both cross-validation score and training scores are pretty high, not close to each other in the beginning, but definetly converging toward some higher score in the end. The training curve doesnt have 'significantly' better score than testing curve, the gap is not as large as it could be. Thus, the model doesnt suffer from overfitting (High Variance). However, if the difference between my scores was over 0.1 I should be worried about overfitting. However, I am noting that scores are on the boundary level and overfitting problem might become an issue.
In [27]:
classifier.score(Xh_test, yh_test)
Out[27]:
In [28]:
estimator.fit(Xh_train, yh_train)
predh = estimator.predict(Xh_test)
print(classification_report(yh_test, predh))
acch = accuracy_score(yh_test, predh)
print("Accuracy Score: {} \n".format(acch))
print(confusion_matrix(yh_test, predh))
print('\n' + '\n')
Comments:
Mu training space has 6499 instances and 26 features (dummy variable standing for 3 different categorical features) in total. The testing space has 1625 instances.
Best hyperparameter for Linear SVC is gamma = 0.000001, which gives an accuracy score of 99.2%.
In [29]:
from sklearn.cross_validation import train_test_split
models = {'Logistic Regression': LogisticRegression(),
'Decision Tree': DecisionTreeClassifier(),
'Random Forest': RandomForestClassifier(n_estimators=30),
'K-Nearest Neighbors':KNeighborsClassifier(n_neighbors=5),
'Linear SVM' : SVC(kernel="linear", C=1)}
for k,v in models.items():
mod = v
if k == 'Logistic Regression' or k =='Linear SVM':
mod.fit(Xh_train, yh_train)
predh = mod.predict(Xh_test)
print('Results for: ' + str(k) + '\n')
print(classification_report(yh_test, predh))
acch = accuracy_score(yh_test, predh)
print("Accuracy Score: {} \n".format(acch))
print(confusion_matrix(yh_test, predh))
print('\n' + '\n')
else:
mod.fit(X_train, y_train)
pred = mod.predict(X_test)
print('Results for: ' + str(k) + '\n')
print(classification_report(y_test, pred))
acc = accuracy_score(y_test, pred)
print("Accuracy Score: {} \n".format(acc))
print(confusion_matrix(y_test, pred))
print('\n' + '\n')
Comments:
Most of the models' accuracy scores are very close to each other and high - models fit well even with usage of only 3 different features. It looks like the best model, according to accoracy score, was Random Forest, K-Nearest Neighbors and Decision Tree Classifiers with accuracy being 99.6 %. Logistic Regression Classifier performed worse than Linear SVM - their respective scores were 98.4 % and 99.2 %.
Next, I will try to manually tune in and Decision Tree Classifier and explain the results of classification and confusion matrices.
In [30]:
decision_tree = DecisionTreeClassifier(max_depth= 10) # or criterion = "entropy", default = gini
decision_tree.fit(X_train, y_train)
expected = y_test
predicted = decision_tree.predict(X_test)
print("Decision Tree Classifier - results: \n")
acc = accuracy_score(y_test, predicted)
#print(confusion_matrix(y_test, predicted))
print(classification_report(expected, predicted))
print("Accuracy score: {}".format(acc))
print('Training Score: %.2f%%' % (decision_tree.score(X_train, y_train) * 100))
print('Test Score: %.2f%%' % (decision_tree.score(X_test, y_test) * 100))
plot_confusion_matrix(confusion_matrix(y_test, predicted), classes = ["etable", "poisonous"], title = "Decision Tree confusion matrix")
#help(DecisionTreeClassifier)
Comments:
The classifier made a total of 1,625 predictions.
Out of those 1625 cases, the classifier predicted "1" (or poisonous) 761 times, and "0" (eatable) 864 times.
In reality, 768 species were poisonous, and 857 were not.
True positives (predict=p, true=p): 761
True negatives (predict=e, true=e): 857
False positives (predict=p, true=e): 0 (Type I error)
False negatives (predict=e, true=p): 7 (type II error)
Accuracy: Overall, how often is the classifier correct? In 99.6 % of cases
Recall (true positive rate): When it's actually poisonous, how often does it predict poisonous? 99.2%
Precision: When it predicts poison, how often is it correct? 100 % cases
This is not the greatest model , ideally, if errors are inevitable, in this particular mushroom situation, we would prefer having more false positives rather than that many false negatives. In other words, it would be better to predict that the mushroom is poisonous (while it is etable) and not to consume it, rather than predict that the nushroom is etable while it is actually poisonous. Minimization of Type II error is the ideal approach in this sitation.
To display and save decision trees, save the output as a .dot file.
After getting the .dot file, open with any text editor and paste the code here: http://webgraphviz.com/ to see the picture.
Take a screenshot of the output and save in .png format
Note: There are different ways of displaying the tree, however, they didn't work for me.
system("dot -Tpng U:/tree2.dot -o U:/tree2.png")
graph_from_dot_data(out.getvalue()).write_pdf("U:/tree2.pdf")
In [31]:
from sklearn import tree
dotfile = open('/Users/dariaulybina/Desktop/georgetown/ml_practice/data/tree.dot', 'w')
tree.export_graphviz(decision_tree,out_file=dotfile)
dotfile.close()
In [33]:
#Randomized Parameters Optimization
knn = KNeighborsClassifier(n_neighbors=5)
from sklearn.grid_search import RandomizedSearchCV
params = {'n_neighbors': range(1,5), 'weights': ['uniform','distance']}
research = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4,n_iter=8,random_state=5)
research.fit(X_train, y_train)
print(research.best_score_)
In [35]:
print(research.best_estimator_.n_neighbors)
K-Neighbors Classifier fit very well with n_neighbors parameter optimal at 1, as found by using Randomized Parameter Optimizer. As most of the models chosen perform well even with only 3 different features acting as predictors, I will use Linear SVC In the next file to explore different encdings, splitting of my data and finally will apply the classifier in the application to make prediction about the classification of a mushroom based on user-input parameters.
In [ ]: