In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [3]:
#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("../data/master.csv")
#Print first observations
df.head()
Out[3]:
Throughout the Machine Learning part of this project we will be using scikit-learn, an open source machine learning library for the Python programming language.
Recall our arbitrary division of the city into 5 "Areas" spanning 36 Zip Codes: Central, SE, NE, SW, NW
In [4]:
# Focus on the main part of the city
SE_zip = (78744, 78747, 78719, 78741)
Central_zip = (78701, 78702, 78703, 78705, 78721, 78723, 78712, 78751, 78756)
NE_zip = (78752, 78753, 78754)
NW_zip = (78757, 78758, 78727, 78722, 78729, 78717, 78750, 78759, 78726, 78730, 78731, 78732)
SW_zip = (78704, 78745, 78748, 78739, 78749, 78735, 78733, 78746)
ACL_zip = SE_zip + Central_zip + NE_zip + NW_zip + SW_zip
len(ACL_zip)
Out[4]:
We won't need the numerical data here, so let's delete those columns to declutter the Notebook:
In [5]:
del df['Population']
del df['Med_Income']
del df['Home_Ownership']
Let us prepare our DataFrame for Binary Classification models by assigning those Areas, Letter Grades and Pass/Fail according to a restaurant's score (as discussed in the README file) as well as encoding dummy variables:
In [6]:
# City areas
# Assign Area to each Restaurant, given its Zip Code
mask_SE = df.Zip_Code.isin(SE_zip)
mask_Central = df.Zip_Code.isin(Central_zip)
mask_NE = df.Zip_Code.isin(NE_zip)
mask_NW = df.Zip_Code.isin(NW_zip)
mask_SW = df.Zip_Code.isin(SW_zip)
df['Area'] = 'Austin'
df.loc[mask_SE, 'Area'] = 'SE Austin'
df.loc[mask_Central, 'Area'] = 'Central Austin'
df.loc[mask_NE, 'Area'] = 'NE Austin'
df.loc[mask_NW, 'Area'] = 'NW Austin'
df.loc[mask_SW, 'Area'] = 'SW Austin'
In [7]:
# Assign Pass/Fail status and Letter Grades to each Restaurant
df['Status'] = 'Fail'
df['Letter_Grade'] = 'F'
mask_pass = df['Score'] >= 70
mask_A = df['Score'] >= 90
mask_B = (df['Score'] >= 80) & (df['Score'] < 90)
mask_C = (df['Score'] >= 70) & (df['Score'] < 80)
df.loc[mask_pass, 'Status'] = 'Pass'
df.loc[mask_A, 'Letter_Grade'] = 'A'
df.loc[mask_B, 'Letter_Grade'] = 'B'
df.loc[mask_C, 'Letter_Grade'] = 'C'
In [8]:
df.head(3)
Out[8]:
In [9]:
# Assign Dummy Variables for Binary Classifiers
# create five dummy variables using get_dummies, then exclude the first dummy column
# Note that the Area_Central variable is redundant
area_dummies = pd.get_dummies(df.Area, prefix='Area').iloc[:, 1:]
# create two dummy variables using get_dummies, then exclude the first dummy column
# Note that the Outcome_Fail variable is redundant
outcome_dummies = pd.get_dummies(df.Status, prefix='Status').iloc[:, 1:]
# create four dummy variables using get_dummies, then exclude the first dummy column
# Note that the Grade_A variable is redundant
letter_dummies = pd.get_dummies(df.Letter_Grade, prefix='Grade').iloc[:, 1:]
Let us check those DataFrames before we merge them into df:
In [10]:
area_dummies.head()
Out[10]:
In [11]:
outcome_dummies.head()
Out[11]:
In [12]:
letter_dummies.head()
Out[12]:
Let us now concatenate those created DataFrames, keeping a copy of the original DataFrame just in case:
In [13]:
# Concatenate
# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns)
data = pd.concat([df, area_dummies, outcome_dummies, letter_dummies], axis=1)
data.head()
Out[13]:
With 22 columns, the DataFrame has already become very hard to visualize, therefore we need to be very careful with our operations from now on.
We will at first present an ill-fated approach in order to illustrate the workings of the logistic regression algorithm on our dataset as well as to discuss the evident class imbalance problem and attempt some techniques to mitigate its effects. A naive first approach was to use some of the features in order to predict whether a restaurant will pass the Health Inspection (Score > 69) or not.
The issue that makes this problem almost impossible to deal with:
In [14]:
num_pass = data[data["Status"]=='Pass'].count()
num_fail = data[data["Status"]=='Fail'].count()
In [15]:
print "Passing Inspections = {0}".format(num_pass[0])
print "Failing Inspections = {0}".format(num_fail[0])
In a sum of 15936 rows, we have 99% passing restaurants (that have scored 70 or higher) and only 1% failing restaurants (that have scored 69 or lower). This is the Class Imbalance problem.
The approaches for addressing this are:
I will try both approaches, starting from the latter: scikit-learn Logistic Regression model has the attribute class_weight that takes the default value None but can optionally take the value balanced.
From the documentation: "The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes np.bincount(y)) Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified."*
Let us start by trying to predict whether a restaurant will pass or fail inspection given the part of town (area) it is at:
In [16]:
from sklearn import linear_model
logm = linear_model.LogisticRegression(class_weight='balanced')
feature_cols = ['Area_NE Austin', 'Area_NW Austin', 'Area_SE Austin', 'Area_SW Austin']
X = data[ feature_cols ].values
y = data['Status_Pass'].values
In [17]:
# Check the shapes of the X and y vectors:
print X.shape
print y.shape
In [18]:
logm.fit(X,y)
Out[18]:
In [19]:
print logm.predict(X) # The array of our model's predictions for each area
print logm.classes_ # 0: Fail, 1: Pass
print logm.predict_proba(X) # Probabilities for each entry to be assigned a "Fail" or a "Pass"
In [20]:
logm.score(X, y)
Out[20]:
Our classifier score is barely better than a coin toss. It seems we will have to improve our approach.
In [21]:
# examine the coefficients
pd.DataFrame(zip(data[feature_cols], np.transpose(logm.coef_)))
Out[21]:
The coefficients above aren't necessarily significant but they show associations with increased/decreased likelihood of passing the health inspection set. As a matter of fact, a restaurant being located in Northwest Austin is more likely to pass the test whereas a restaurant in Southeast Austin is less likely to do so.
The real test of a good model is to train the model on a training set and then test it with data that it has not fitted. Here’s where the rubber meets the road.
In [22]:
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.8)
In [23]:
logm.fit(X_train,y_train)
Out[23]:
In [24]:
# predict class labels for the test set
predicted = logm.predict(X_test)
print predicted
# generate class probabilities
probs = logm.predict_proba(X_test)
print probs
In [25]:
logm.score(X_test, y_test)
Out[25]:
k-fold Cross Validation has not really helped us improve the score of our classifier.
Even though we are far from creating a foolproof model, we can definitely discern some interesting trends.
In [26]:
# Evaluation Metrics
from sklearn import metrics
print metrics.accuracy_score(y_test, predicted)
print metrics.roc_auc_score(y_test, probs[:, 1])
The 0.506 ROC score is fairly low. It is telling us that our model is doing slightly better in predicting a restaurant's inspection outcome (pass/fail) than a coin toss (which would have a 0.5 probability)
In [27]:
# Confusion Matrix and Classification Report:
# Confusion Matrix
cm = metrics.confusion_matrix(y_test, logm.predict(X_test), labels=[1,0]) #labels = [1,0] will change the order: T,F
cm
Out[27]:
In [28]:
print metrics.confusion_matrix(y_test, predicted, labels=[1,0])
print metrics.classification_report(y_test, predicted,labels=[1,0])
In [29]:
# Visualize Confusion Matrix:
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plot_confusion_matrix(cm)
There really isn't any reason for us to delve further into that model because the misclassification rate is really substantial and as we mentioned it's not really doing significantly better than a coin toss.
Let us now quickly approach a different strategy to mitigate the results of class imbalance: oversampling the minority class (failing restaurants)
In [30]:
#Grab DataFramea rows where column has certain values
value_list_f = ['Fail']
failures = data[data.Status.isin(value_list_f)]
value_list_p = ['Pass']
passes = data[data.Status.isin(value_list_p)]
In [31]:
# Let us pick two random samples of 100 rows each from the "failures" and "passes" dataframes:
rows_f = np.random.choice(failures.index.values, 100)
rows_p = np.random.choice(passes.index.values, 100)
In [32]:
# Convert to dataframes:
sampled_f = failures.ix[rows_f]
sampled_p = passes.ix[rows_p]
In [33]:
# Concatenate into "training" dataframe:
training_df = pd.concat([sampled_p, sampled_f], axis=0)
print "The training DataFrame comprises {0} rows".format(len(training_df))
In [34]:
# Train our logistic regression model on this data:
logm = linear_model.LogisticRegression() # We don't need balanced class weights anymore
feature_cols = ['Area_NE Austin', 'Area_NW Austin', 'Area_SE Austin', 'Area_SW Austin']
X_tr = training_df[ feature_cols ].values
y_tr = training_df['Status_Pass'].values
logm.fit(X_tr, y_tr)
Out[34]:
In [36]:
print "The model's score on the training data is {0}%".format(logm.score(X_tr, y_tr)*100)
This doesn't sound very promising. Regardless, let us construct the testing DataFrame (comprising all rows that aren't already seen by the model)
In [37]:
# Create a DataFrame with all rows that are in data but NOT in training_df:
df1 = data
df2 = training_df
common_cols = list(set(df1.columns) & set(df2.columns))
testing_df = pd.merge(df1,df2,on=common_cols,how="outer",indicator=True)
testing_df = testing_df.loc[testing_df['_merge'] == 'left_only']
In [38]:
X_ts = testing_df[ feature_cols ].values
y_ts = testing_df['Status_Pass'].values
In [39]:
# Model Validation:
# predict class labels for the test set
predicted = logm.predict(X_ts)
# generate class probabilities
probs = logm.predict_proba(X_ts)
# Evaluation Metrics:
print metrics.accuracy_score(y_ts, predicted)
print metrics.roc_auc_score(y_ts, probs[:, 1])
The results are no better than before, and possibly more disappointing. It seems that the part of town a restaurant is located in is a terrible predictor of its Health Inspection broad outcome. (Pass/Fail)
From now on I am changing my approach to something more drastic. Instead of aiming to classify health inspection outcomes as "Passes" or "Fails", I will implement a different kind of classification: A resulting letter grade of "A" (which implies a Health Inspection score of 90 or higher) vs. everything else (Scores 89 and below)
Before throwing my hands up in the air with the use of Area as input features for my models, I will try once more:
In [40]:
num_A = data[data["Letter_Grade"]=='A'].count()
num_other = data[data["Letter_Grade"]!='A'].count()
In [41]:
print "Pristine Restaurants (90 or above) = {0}".format(num_A[0])
print "Restaurants scoring 89 or below = {0}".format(num_other[0])
There is a much nicer distribution between the classes now! Let us construct two new dummy variables, one for "Pristine" restaurants (the same ones obtaining a grade "A", which has been the redundant dummy variable in my previous treatment) and one for everything else:
In [42]:
# I need the "Grade_A" column:
def pristine(x):
if x >= 90:
return 1
else:
return 0
In [43]:
data['Pristine'] = data['Score'].apply(pristine)
data.head(2)
Out[43]:
Let us save this DataFrame as a csv file because we are going to use it again in the next Notebook (Text mining):
In [44]:
data.to_csv('../data/data.csv', index=False)
In [45]:
# Off to modeling:
logm = linear_model.LogisticRegression()
feature_cols = ['Area_NE Austin', 'Area_NW Austin', 'Area_SE Austin', 'Area_SW Austin']
X = data[ feature_cols ].values
y = data['Pristine'].values
In [46]:
# Cross-validation:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.75)
In [47]:
# Fitting the model:
logm.fit(X_train, y_train)
Out[47]:
In [48]:
# Model Validation:
# predict class labels for the test set
predicted = logm.predict(X_test)
# generate class probabilities
probs = logm.predict_proba(X_test)
# Evaluation Metrics:
print metrics.accuracy_score(y_test, predicted)
print metrics.roc_auc_score(y_test, probs[:, 1])
Even though our goal has changed (we are seeking to predict whether a restaurant will ace the health inspection) the results of our model are still disappointing. It is becoming more and more evident that Area isn't a good predictor.
Our final attempt will be to "strong-arm" our model by manually changing the logistic regression classification threshold from 0.5 to something much higher like 0.8
Hopefully this will eradicate some of the misclassifications. In order to manipulate our model like that, we will use statsmodels instead of scikit-learn:
In [49]:
import statsmodels.api as sm
logit = sm.Logit(data['Pristine'], data[ feature_cols ])
In [50]:
# fit the model
model_result = logit.fit()
In [51]:
predicted = model_result.predict(data[ feature_cols ])
In [52]:
thresholds = [0.5, 0.6, 0.7, 0.8, 0.9]
for i in thresholds:
predicted_choice = (predicted > i).astype(int)
print "The number of restaurants predicted to pass inspection (grade A) with logistic threshold = {0} is {1}".\
format(i, sum(predicted_choice) )
Raising the logistic regression classifier threshold to anything higher than 60% results in zero restaurants predicted to pass inspection with a grade "A", given the area.
In [53]:
print model_result.summary()
We get a great overview of the coefficients of the model, how well those coefficients fit, the overall fit quality, and several other statistical measures.
The result object also lets us to isolate and inspect parts of the model output. The confidence interval gives an idea for how robust the coefficients of the model are. Once again, we might not be getting anything really important from this classifier, but we can definitely see the "trend" of restaurants being located in my ad hoc "Northwest Austin" area being more prone to do better.
In [54]:
# Odds ratio:
print np.exp(model_result.params)
This tells us how a 1 unit increase or decrease in a variable affects the odds of scoring an "A". For example, we can expect the odds of scoring an "A" to increase by about 212% if the restaurant is located in Northwest Austin. However we should be really cautious and only perceive those odds ratios relative to each other.
Regardless of the model used, the overall prediction accuracy for the city is close to 60%, which is almost equal to random guessing and therefore implies that the area is probably not a useful factor.
The poor performance of this factor is likely due to the fact that each area is very large and consists of many diverse neighborhoods (zip codes), which means there is too much variation within each area.
The greatest plight of our DataFrame and our attempts to apply supervised learning techniques to predict the inspection score is the absence of continuous (or categorical) features specific to each restaurant. Let us attempt a different binary classifier, k Nearest Neighbors, for the case where our DataFrame comprises the averaged scores of each Zip Code. Again, we need to be overly cautious because we only have 35 rows now:
In [55]:
average_scores = data.groupby('Zip_Code').mean()
len(average_scores)
Out[55]:
In [56]:
average_scores.head()
Out[56]:
We will need to create another dummy column for "Pristine" since averaging removed the 1-0 dichotomy. We do get some interesting average scores for each zip code though.
In [57]:
def dummy_pristine(x):
if x >= 0.50:
return 1
else:
return 0
average_scores['Pristine_Dummy'] = average_scores['Pristine'].apply(dummy_pristine)
average_scores.head()
Out[57]:
In [58]:
from matplotlib.colors import ListedColormap
from sklearn import feature_selection
from sklearn.neighbors import KNeighborsClassifier as KNN
In [59]:
feature_cols = ['Area_NE Austin', 'Area_NW Austin', 'Area_SE Austin', 'Area_SW Austin']
X = average_scores[ feature_cols ].values
y = average_scores['Pristine_Dummy'].values
clf = KNN(3)
clf.fit(X, y)
print clf.score(X, y)
A score of 91% on 35 rows of data isn't exciting; it is overfitting. Even though we only have few points, we can attempt train/test split and cross validation:
In [60]:
# Cross-validation:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.8)
In [61]:
clf.fit(X_train, y_train)
print clf.score(X_test, y_test)
Looks slightly better
In [62]:
# Let's figure out what our most powerful features are.
F = feature_selection.f_classif(X, y)[0]
title = 'Health Inspection Features with F-Values'
plt.figure(figsize=(13, 8))
ax = sns.barplot(x=feature_cols,y=F)
ax.set_title(title)
ax.set(xlabel="Features");
The k Nearest Neighbors classifier yields another interesting feature: the importance of "Southeast Austin" in classifying a restaurant as "pristine" or not.
In [63]:
g = sns.PairGrid(average_scores, vars=["Latitude", "Longitude"],
hue="Pristine", aspect=1, size=5)
g.map(plt.scatter);
In [64]:
y = average_scores['Pristine']
a = list(y.unique())
a = [round(n, 1) for n in a]
In [65]:
average_scores['Round_Pristine'] = a
In [66]:
g = sns.PairGrid(average_scores, vars=["Latitude", "Longitude"],
hue="Round_Pristine", aspect=1, size=5)
g.map(plt.scatter);
g = g.add_legend()
We have plotted those averages on this longitude/latitude plane that resembles a map. The hue of the data points corresponds to "Pristine", which in our case is the fraction of restaurants that have been awarded a perfect score in each Zip Code.
In [67]:
r = set(a)
len(r) # The number of unique "Rounded Pristine" scores, we will use that to visualize decision boundaries:
Out[67]:
In [68]:
h = .02 # step size in the mesh
In [69]:
# Create color maps
col_map = sns.color_palette("Paired")
sns.palplot(col_map)
cmap_light = ListedColormap(['#A6CEE3', '#AFDD8A', '#FA9897'])
cmap_bold = ListedColormap(['#2078B4', '#35A12E', '#E31A1C'])
In [70]:
print a
In [71]:
features = ['Latitude', 'Longitude']
X_ = average_scores[features].values
r =list(r)
y_ = [r.index(v) for v in a] # Encode the rounded scores to integers
In [72]:
print y_
In [73]:
clf = KNN(2, weights='uniform')
clf.fit(X_, y_)
Out[73]:
In [74]:
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]*[y_min, y_max].
x_min, x_max = X_[:, 0].min() - 1, X_[:, 0].max() + 1
y_min, y_max = X_[:, 1].min() - 1, X_[:, 1].max() + 1
In [75]:
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
In [76]:
# Put the result into a color plot
Z = Z.reshape(xx.shape)
In [78]:
# Plot also the training points
plt.figure(figsize=(13,8))
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
plt.scatter(X_[:, 0], X_[:, 1], c=y_, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("5-Class classification (k = %i, weights = '%s')"
% (2, 'uniform'));
We only got 3 classes when running our KNN algorithm for k=2 (and less classes if we raise k) - this makes sense because we have a very small number of points to classify and they all seem clustered together.
Let us take a step back and remember what are the prerequisites for a successful implementation of the k Nearest Neighbors binary classifier:
None of those prerequisites is really satisfied, so choosing a KNN Classifier was a bad idea, save from the nice scatter plot.
Going back to our "data" DataFrame, let us now attempt to implement a decision tree as a binary classifier
In [79]:
data.head(3)
Out[79]:
In [80]:
from sklearn import tree
feature_cols = ['Area_NE Austin', 'Area_NW Austin', 'Area_SE Austin', 'Area_SW Austin']
X = data[ feature_cols ].values
y = data['Pristine'].values
clf = tree.DecisionTreeClassifier()
clf.fit(X, y)
y_pred = clf.predict(X)
print "Number of mislabeled points : %d" % (y != y_pred).sum()
print "Classifier score: %d" % clf.score(X, y)
In [81]:
len(y)
Out[81]:
We have mislabeled 6022 out of the 16113 points, yet the classifier's score is equal to zero. (?)
Let us attempt a train/test split and cross-validation:
In [82]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=.25)
clf.set_params(min_samples_leaf=4)
clf.set_params(max_depth=3)
clf.fit(X_train, y_train)
print metrics.confusion_matrix(y_train, clf.predict(X_train))
print metrics.classification_report(y_train, clf.predict(X_train))
print "The score of the Decision Tree classifier on the training data is {0}".format(clf.score(X_train,y_train))
print ""
print metrics.confusion_matrix(y_test, clf.predict(X_test))
print metrics.classification_report(y_test, clf.predict(X_test))
print "The score of the Decision Tree classifier on the test data is {0}".format(clf.score(X_test,y_test))
The zeros in the $[0][0]$ element of the confusion matrix are really bad news: they mean that our tree classifier is really incapable of correctly classifying any true positives (i.e. restaurants passing the health inspection)
We can still try to visualize our tree, which does have some merit due to its 63.2% classification score, but it is becoming more and more evident that this broad area division is a poor feature for our classification purposes.
In [83]:
from sklearn.externals.six import StringIO
with open("pristine.dot", 'w') as f:
f = tree.export_graphviz(clf, out_file=f, feature_names=feature_cols)
In [84]:
!dot -Tpng pristine.dot > pristine.png
In [86]:
# Even a random forest comprising 250 trees yields the exact same classifier score:
from sklearn.ensemble import RandomForestClassifier
## DecisionTreeClassifier
dtree = tree.DecisionTreeClassifier(max_depth=None, min_samples_split=1,
random_state=0)
scores = cross_validation.cross_val_score(dtree, X_test, y_test)
print "DecisionTreeClassifier:", scores.mean()
## RandomForestClassifier
forest = RandomForestClassifier(n_estimators=250,
max_features=4, random_state=0, max_depth=None)
scores = cross_validation.cross_val_score(forest, X_test, y_test)
print "RandomForestClassifier:", scores.mean()
Decision Trees and Random Forests give us some interesting information concerning the relative importance of their estimators, and we can visualize those importances:
In [87]:
importances = forest.fit(X, y).feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(X.shape[1]):
print("{0}. feature {1} ({2})".format(f + 1, feature_cols[f], importances[indices[f]]))
# Plot the feature importances of the forest
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
According to our Random Forest model, the most important feature is a restaurant's location in Northeast Austin. However, given the terrible performance of the classifier in actually finding any passing restaurants, we can safely ignore this result.
Our final attempt before we give up trying to classify restaurants given the (ad hoc) area of town they are located in is a support vector machine classifier:
In [88]:
from sklearn import svm
In [95]:
svc = svm.SVC(kernel='linear')
X = data[ feature_cols ].values
y = data['Pristine'].values
svc.fit(X, y)
Out[95]:
In [90]:
from matplotlib.colors import ListedColormap
# Create color maps for 2-class classification problem
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
def plot_estimator(estimator, X, y):
estimator.fit(X, y)
x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.axis('tight')
plt.axis('off')
plt.tight_layout()
In [92]:
X_ = data[['Area_NW Austin', 'Area_SE Austin']].values # picking two "important" features
y_ = data['Score'].values
plot_estimator(svc, X_, y_)
Visualizing the decision boundaries of our SVC yield a terrible picture, and I assume it is because of the categorical nature of both the features and the output. Let us score it instead:
In [93]:
print X.shape
print y.shape
In [96]:
svc.score(X,y)
Out[96]:
The support vector classifier yielded a slightly better score when ran on the entire data set. Let us try a train/test split and cross validation to make sure we are not overfitting:
In [100]:
# Cross-validation:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.8)
In [101]:
svc.fit(X_train, y_train)
Out[101]:
In [102]:
print svc.score(X_test, y_test)
This is a rather respectable score, 62.7%
Interestingly enough, changing the kernel function from 'linear' to 'polynomial' or 'radial' doesn't seem to really affect the classifier's performance. The same applied to changing the class_weight parameter to balanced.
Remember that even though the Support Vector Classifier has definitely resulted in the most promising classifier (because decision trees and the random forest fail to classify ANY restaurant as "pristine") for our data, the price we have to pay is our model's interpretability, since SVMs are largely "black boxes". Our firm conviction is that Area is not a good predictor of a restaurant's health inspection status, therefore our next attempt will be text mining and using a restaurant's name or the name of the street it is located at as a predictor.