DAT-ATX-1 Capstone Project

Nikolaos Vergos, February 2016

nvergos@gmail.com

2c. Supervised Learning: Textual Analysis - Naïve Bayes Classification

We will now shift gears and reformulate our question: we are going to shift to textual data (a restaurant's name and its street) as features predicting whether it has scored an A at the health inspection. This should lead to a more interesting analysis than the poor one we conducted based on the categorical variable of area.

The outline of the procedure we are going to follow is:

  • Turn a corpus of text documents (restaurant names, street addresses) into feature vectors using a Bag of Words representation,
  • Train a simple text classifier (Multinomial Naive Bayesian) on the feature vectors,
  • Wrap the vectorizer and the classifier with a pipeline,
  • Cross-validation and model selection on the pipeline.

0. Import libraries & packages


In [1]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(rc={"axes.labelsize": 15});

# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5;
plt.rcParams['axes.grid'] = True;
plt.gray();


<matplotlib.figure.Figure at 0x117334c50>

1. Import dataset


In [4]:
#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("../data/data.csv")  

#Print first observations
df.head()


Out[4]:
Facility_ID Restaurant_Name Inspection_Date Process_Description Geocode Street City Zip_Code Score Latitude ... Letter_Grade Area_NE Austin Area_NW Austin Area_SE Austin Area_SW Austin Status_Pass Grade_B Grade_C Grade_F Pristine
0 2801996 Mr. Gatti's #118 2015-12-23 Routine Inspection 2121 W PARMER LN, AUSTIN, TX 78758 2121 W PARMER LN AUSTIN 78758 94 30.415649 ... A 0 1 0 0 1 0 0 0 1
1 10385802 Subway 2015-12-23 Routine Inspection 2501 W PARMER LN, AUSTIN, TX 78758 2501 W PARMER LN AUSTIN 78758 98 30.418236 ... A 0 1 0 0 1 0 0 0 1
2 2802274 Baskin Robbins 2015-12-23 Routine Inspection 12407 N MOPAC EXPY, AUSTIN, TX 78758 12407 N MOPAC EXPY AUSTIN 78758 99 30.417462 ... A 0 1 0 0 1 0 0 0 1
3 10964220 JR's Tacos 2015-12-22 Routine Inspection 1921 CEDAR BEND DR, AUSTIN, TX 78758 1921 CEDAR BEND DR AUSTIN 78758 91 30.408322 ... A 0 1 0 0 1 0 0 0 1
4 10778546 Econo Lodge 2015-12-22 Routine Inspection 9100 BURNET RD, AUSTIN, TX 78758 9100 BURNET RD AUSTIN 78758 91 30.374790 ... A 0 1 0 0 1 0 0 0 1

5 rows × 23 columns

String Manipulation: Restaurant Names

Let us start our manipulation of restaurant names:


In [5]:
Names = pd.Series(df['Restaurant_Name'].values)

We will remove all words that are 3 characters long or smaller:


In [6]:
import re
shortword = re.compile(r'\W*\b\w{1,3}\b')

In [7]:
for i in range(len(Names)):
    Names[i] = shortword.sub('', Names[i])

In [8]:
# As an example, "JR's Tacos" is now just " Tacos"

Names[3]


Out[8]:
' Tacos'

In [9]:
# Add a new column into our DataFrame:

df['Names'] = Names

In [10]:
df['Names'].head(10)


Out[10]:
0              . Gatti
1               Subway
2       Baskin Robbins
3                Tacos
4          Econo Lodge
5    Shahi Food Market
6                Speed
7               Jasper
8      Papa John Pizza
9        Subway #43067
Name: Names, dtype: object

In [11]:
df.columns


Out[11]:
Index([u'Facility_ID', u'Restaurant_Name', u'Inspection_Date',
       u'Process_Description', u'Geocode', u'Street', u'City', u'Zip_Code',
       u'Score', u'Latitude', u'Longitude', u'Area', u'Status',
       u'Letter_Grade', u'Area_NE Austin', u'Area_NW Austin',
       u'Area_SE Austin', u'Area_SW Austin', u'Status_Pass', u'Grade_B',
       u'Grade_C', u'Grade_F', u'Pristine', u'Names'],
      dtype='object')

Our first collection of feature vectors will come from the new "Names" column. We are still trying to predict whether a restaurant falls under the "pristine" category (Grade A, score greater than 90) or not. We could also try to see whether we could predict a restaurant's grade (A, B, C or F)

2. Text Classification using a Naive Bayes Classifier

Restaurant Name


In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import cross_validation
from sklearn.naive_bayes import MultinomialNB

# Turn the text documents into Bag of Words feature vectors
# We'll throw out any terms that appear in only one document

vectorizer = CountVectorizer(min_df=1, stop_words="english")

X = vectorizer.fit_transform(df['Names'])
y = df['Letter_Grade']

# Train/test split for cross-validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size = 0.8)

# Fit a classifier on the training set

classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set

print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))


Training score: 70.1%
Testing score: 65.7%

It seems our Multinomial Naive Bayes classifier does significantly better on predicting a restaurant's status (whether it has gotten a "pristine" score" or not) given the restaurant's name than what we have seen so far with the area of town division.


In [13]:
# Some information about our Bag of Words feature vector:

In [14]:
len(X_train.data)


Out[14]:
28768

In [15]:
n_samples, n_features = X_train.shape

In [16]:
n_samples


Out[16]:
12890

In [17]:
n_features


Out[17]:
2704

In [18]:
# The vocabulary of our vectorizer, i.e. the unique words comprising it:

len(vectorizer.vocabulary_)


Out[18]:
2704

In [19]:
vectorizer.get_feature_names()[n_features / 3:n_features / 3 + 10]


Out[19]:
[u'divines',
 u'dizzy',
 u'dobie',
 u'dock',
 u'doddy',
 u'dogs',
 u'dogwood',
 u'dolce',
 u'domain',
 u'domestic']

In [20]:
target_predicted_proba = classifier.predict_proba(X_test)
percentages = pd.DataFrame(target_predicted_proba, columns=df['Letter_Grade'].unique())

In [21]:
# A table of probabilities for each one of the 3223 restaurants in the test set to be assigned a certain letter grade:

percentages.head()


Out[21]:
A B C F
0 0.477308 0.351621 0.169233 1.838392e-03
1 0.999599 0.000399 0.000001 4.617823e-07
2 0.443248 0.395690 0.132699 2.836293e-02
3 0.831649 0.092802 0.062244 1.330407e-02
4 0.078374 0.599928 0.313887 7.810455e-03

In [22]:
len(percentages)


Out[22]:
3223

By default the decision threshold is 0.5: if we vary the decision threshold from 0 to 1 we could generate a family of binary classifier models that address all the possible trade offs between false positive and false negative prediction errors.

Let us use a pipeline in order to perform 10-fold cross validation:


In [56]:
pipeline = Pipeline((
    ('vec', CountVectorizer(max_df = 0.8, ngram_range = (1, 2))),
    ('clf', MultinomialNB(alpha = 0.01)),
))
_ = pipeline.fit(df['Names'], df['Letter_Grade'])

In [57]:
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem

scores = cross_val_score(pipeline, df['Names'],
                         df['Letter_Grade'], cv=10)
scores.mean(), sem(scores)


Out[57]:
(0.52166700734667137, 0.021307751429222626)

In [58]:
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]

feature_names = vec.get_feature_names()
target_names = df['Letter_Grade'].unique()

feature_weights = clf.coef_

feature_weights.shape


Out[58]:
(4, 5902)

In [59]:
len(feature_names)


Out[59]:
5902

In [60]:
def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))

In [61]:
print_top10(vectorizer, classifier, target_names)


A: slaughter caves mopac anderson capital congress william parmer lamar blvd
B: mopac riverside springs capital anderson parmer congress william blvd lamar
C: anderson springs stassney parmer riverside oltorf william congress blvd lamar
F: parmer rundberg martin springs cesar congress riverside oltorf blvd lamar

In [62]:
from sklearn.metrics import classification_report

predicted = pipeline.predict(df['Restaurant_Name'])

In [63]:
print(classification_report(df['Letter_Grade'], predicted,
                            target_names=df['Letter_Grade'].unique()))


             precision    recall  f1-score   support

          A       0.82      0.80      0.81     10091
          B       0.55      0.49      0.52      4391
          C       0.41      0.50      0.45      1454
          F       0.17      0.54      0.26       177

avg / total       0.70      0.69      0.69     16113


In [64]:
from sklearn.metrics import confusion_matrix

pd.DataFrame(confusion_matrix(df['Letter_Grade'], predicted), 
             index = pd.MultiIndex.from_product([['actual'], target_names]),
             columns = pd.MultiIndex.from_product([['predicted'], target_names]))


Out[64]:
predicted
A B C F
actual A 8083 1397 474 137
B 1462 2168 542 219
C 286 326 725 117
F 18 41 23 95

String Manipulation: Street


In [65]:
df.head(3)


Out[65]:
Facility_ID Restaurant_Name Inspection_Date Process_Description Geocode Street City Zip_Code Score Latitude ... Area_NW Austin Area_SE Austin Area_SW Austin Status_Pass Grade_B Grade_C Grade_F Pristine Names Street_Words
0 2801996 . Gatti 2015-12-23 Routine Inspection 2121 W PARMER LN, AUSTIN, TX 78758 2121 W PARMER LN AUSTIN 78758 94 30.415649 ... 1 0 0 1 0 0 0 1 . Gatti PARMER
1 10385802 Subway 2015-12-23 Routine Inspection 2501 W PARMER LN, AUSTIN, TX 78758 2501 W PARMER LN AUSTIN 78758 98 30.418236 ... 1 0 0 1 0 0 0 1 Subway PARMER
2 2802274 Baskin Robbins 2015-12-23 Routine Inspection 12407 N MOPAC EXPY, AUSTIN, TX 78758 12407 N MOPAC EXPY AUSTIN 78758 99 30.417462 ... 1 0 0 1 0 0 0 1 Baskin Robbins MOPAC

3 rows × 25 columns

Let us now follow a similar approach in order to isolate the street name from the address string:


In [67]:
streets = df['Geocode'].apply(pd.Series)

In [68]:
streets = df['Geocode'].tolist()

In [69]:
split_streets = [i.split(' ', 1)[1] for i in streets]

In [70]:
split_streets[0]


Out[70]:
'W PARMER LN, AUSTIN, TX 78758'

In [71]:
split_streets = [i.split(' ', 1)[1] for i in split_streets]

In [72]:
split_streets[0]


Out[72]:
'PARMER LN, AUSTIN, TX 78758'

In [73]:
split_streets = [i.split(' ', 1)[0] for i in split_streets]

In [74]:
split_streets[0]


Out[74]:
'PARMER'

In [75]:
for i in range(len(split_streets)):
    split_streets[i] = shortword.sub('', split_streets[i])

In [76]:
split_streets[0]


Out[76]:
'PARMER'

In [77]:
# Create a new column with the street:
df['Street_Words'] = split_streets

In [78]:
# Turn the text documents into vectors of tf-idf
# We'll throw out any terms that appear in only one document

#vectorizer = TfidfVectorizer(min_df=2) # recipe for avoiding overfitting; others & alpha parameters can be tuned.
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(df['Street_Words'])
y = df['Letter_Grade']

# Train/test split for cross-validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size = 0.8)

# Fit a classifier on the training set

classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set

print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))


Training score: 62.8%
Testing score: 62.4%

In [79]:
n_samples, n_features = X_train.shape

In [80]:
vectorizer.get_feature_names()[n_features / 3:n_features / 3 + 10]


Out[80]:
[u'center',
 u'centre',
 u'cesar',
 u'champ',
 u'chase',
 u'cimas',
 u'clay',
 u'club',
 u'colorado',
 u'commerce']

In [81]:
len(vectorizer.vocabulary_)


Out[81]:
145

In [82]:
target_predicted_proba = classifier.predict_proba(X_test)
pd.DataFrame(target_predicted_proba[:10], columns=df['Letter_Grade'].unique())


Out[82]:
A B C F
0 0.626532 0.270908 0.091777 0.010784
1 0.626532 0.270908 0.091777 0.010784
2 0.626532 0.270908 0.091777 0.010784
3 0.626532 0.270908 0.091777 0.010784
4 0.491628 0.357262 0.123660 0.027449
5 0.606478 0.291367 0.088987 0.013168
6 0.626532 0.270908 0.091777 0.010784
7 0.626532 0.270908 0.091777 0.010784
8 0.626532 0.270908 0.091777 0.010784
9 0.697392 0.239386 0.059898 0.003324

In [85]:
pipeline = Pipeline((
    ('vec', CountVectorizer(max_df = 0.8, ngram_range = (1, 2))),
    ('clf', MultinomialNB(alpha = 0.01)),
))
_ = pipeline.fit(df['Street_Words'], df['Letter_Grade'])

In [86]:
scores = cross_val_score(pipeline, df['Street_Words'],
                         df['Letter_Grade'], cv=3)
scores.mean(), sem(scores)


Out[86]:
(0.60107151305392026, 0.011496010274910711)

In [87]:
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]

feature_names = vec.get_feature_names()
target_names = df['Letter_Grade'].unique()

feature_weights = clf.coef_

feature_weights.shape


Out[87]:
(4, 145)

In [88]:
predicted = pipeline.predict(df['Street_Words'])

In [89]:
print(classification_report(df['Letter_Grade'], predicted,
                            target_names=df['Letter_Grade'].unique()))


             precision    recall  f1-score   support

          A       0.63      0.99      0.77     10091
          B       0.50      0.02      0.03      4391
          C       0.38      0.00      0.01      1454
          F       0.00      0.00      0.00       177

avg / total       0.56      0.63      0.49     16113


In [90]:
pd.DataFrame(confusion_matrix(df['Letter_Grade'], predicted), 
             index = pd.MultiIndex.from_product([['actual'], target_names]),
             columns = pd.MultiIndex.from_product([['predicted'], target_names]))


Out[90]:
predicted
A B C F
actual A 10035 51 5 0
B 4320 66 5 0
C 1433 15 6 0
F 177 0 0 0

In [91]:
print_top10(vectorizer, classifier, target_names)


A: slaughter anderson caves mopac capital congress william parmer lamar blvd
B: oltorf riverside springs capital anderson parmer congress william blvd lamar
C: capital parmer riverside springs stassney oltorf william congress blvd lamar
F: springs cesar martin william anderson congress riverside blvd oltorf lamar

In [ ]: