DAT-ATX-1 Capstone Project

Nikolaos Vergos, February 2016

2c. Supervised Learning: Textual Analysis - Naïve Bayes Classification

We will now shift gears and reformulate our question: we are going to shift to textual data (a restaurant's name and its street) as features predicting whether it has scored an A at the health inspection. This should lead to a more interesting analysis than the poor one we conducted based on the categorical variable of area.

The outline of the procedure we are going to follow is:

Turn a corpus of text documents (restaurant names, street addresses) into feature vectors using a Bag of Words representation,
Train a simple text classifier (Multinomial Naive Bayesian) on the feature vectors,
Wrap the vectorizer and the classifier with a pipeline,
Cross-validation and model selection on the pipeline.

0. Import libraries & packages



In [1]:

    
import warnings
warnings.filterwarnings('ignore')



In [3]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(rc={"axes.labelsize": 15});

# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5;
plt.rcParams['axes.grid'] = True;
plt.gray();









    





<matplotlib.figure.Figure at 0x117334c50>

1. Import dataset



In [4]:

    
#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("../data/data.csv")  

#Print first observations
df.head()









    Out[4]:






  
    
      
      Facility_ID
      Restaurant_Name
      Inspection_Date
      Process_Description
      Geocode
      Street
      City
      Zip_Code
      Score
      Latitude
      ...
      Letter_Grade
      Area_NE Austin
      Area_NW Austin
      Area_SE Austin
      Area_SW Austin
      Status_Pass
      Grade_B
      Grade_C
      Grade_F
      Pristine
    
  
  
    
      0
      2801996
      Mr. Gatti's #118
      2015-12-23
      Routine Inspection
      2121 W PARMER LN, AUSTIN, TX 78758
      2121 W PARMER LN
      AUSTIN
      78758
      94
      30.415649
      ...
      A
      0
      1
      0
      0
      1
      0
      0
      0
      1
    
    
      1
      10385802
      Subway
      2015-12-23
      Routine Inspection
      2501 W PARMER LN, AUSTIN, TX 78758
      2501 W PARMER LN
      AUSTIN
      78758
      98
      30.418236
      ...
      A
      0
      1
      0
      0
      1
      0
      0
      0
      1
    
    
      2
      2802274
      Baskin Robbins
      2015-12-23
      Routine Inspection
      12407 N MOPAC EXPY, AUSTIN, TX 78758
      12407 N MOPAC EXPY
      AUSTIN
      78758
      99
      30.417462
      ...
      A
      0
      1
      0
      0
      1
      0
      0
      0
      1
    
    
      3
      10964220
      JR's Tacos
      2015-12-22
      Routine Inspection
      1921 CEDAR BEND DR, AUSTIN, TX 78758
      1921 CEDAR BEND DR
      AUSTIN
      78758
      91
      30.408322
      ...
      A
      0
      1
      0
      0
      1
      0
      0
      0
      1
    
    
      4
      10778546
      Econo Lodge
      2015-12-22
      Routine Inspection
      9100 BURNET RD, AUSTIN, TX 78758
      9100 BURNET RD
      AUSTIN
      78758
      91
      30.374790
      ...
      A
      0
      1
      0
      0
      1
      0
      0
      0
      1
    
  

5 rows × 23 columns

String Manipulation: Restaurant Names

Let us start our manipulation of restaurant names:



In [5]:

    
Names = pd.Series(df['Restaurant_Name'].values)

We will remove all words that are 3 characters long or smaller:



In [6]:

    
import re
shortword = re.compile(r'\W*\b\w{1,3}\b')



In [7]:

    
for i in range(len(Names)):
    Names[i] = shortword.sub('', Names[i])



In [8]:

    
# As an example, "JR's Tacos" is now just " Tacos"

Names[3]









    Out[8]:





' Tacos'



In [9]:

    
# Add a new column into our DataFrame:

df['Names'] = Names



In [10]:

    
df['Names'].head(10)









    Out[10]:





0              . Gatti
1               Subway
2       Baskin Robbins
3                Tacos
4          Econo Lodge
5    Shahi Food Market
6                Speed
7               Jasper
8      Papa John Pizza
9        Subway #43067
Name: Names, dtype: object



In [11]:

    
df.columns









    Out[11]:





Index([u'Facility_ID', u'Restaurant_Name', u'Inspection_Date',
       u'Process_Description', u'Geocode', u'Street', u'City', u'Zip_Code',
       u'Score', u'Latitude', u'Longitude', u'Area', u'Status',
       u'Letter_Grade', u'Area_NE Austin', u'Area_NW Austin',
       u'Area_SE Austin', u'Area_SW Austin', u'Status_Pass', u'Grade_B',
       u'Grade_C', u'Grade_F', u'Pristine', u'Names'],
      dtype='object')

Our first collection of feature vectors will come from the new "Names" column. We are still trying to predict whether a restaurant falls under the "pristine" category (Grade A, score greater than 90) or not. We could also try to see whether we could predict a restaurant's grade (A, B, C or F)

2. Text Classification using a Naive Bayes Classifier

Restaurant Name



In [12]:

    
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import cross_validation
from sklearn.naive_bayes import MultinomialNB

# Turn the text documents into Bag of Words feature vectors
# We'll throw out any terms that appear in only one document

vectorizer = CountVectorizer(min_df=1, stop_words="english")

X = vectorizer.fit_transform(df['Names'])
y = df['Letter_Grade']

# Train/test split for cross-validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size = 0.8)

# Fit a classifier on the training set

classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set

print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))









    



Training score: 70.1%
Testing score: 65.7%

It seems our Multinomial Naive Bayes classifier does significantly better on predicting a restaurant's status (whether it has gotten a "pristine" score" or not) given the restaurant's name than what we have seen so far with the area of town division.



In [13]:

    
# Some information about our Bag of Words feature vector:



In [14]:

    
len(X_train.data)









    Out[14]:





28768



In [15]:

    
n_samples, n_features = X_train.shape



In [16]:

    
n_samples









    Out[16]:





12890



In [17]:

    
n_features









    Out[17]:





2704



In [18]:

    
# The vocabulary of our vectorizer, i.e. the unique words comprising it:

len(vectorizer.vocabulary_)









    Out[18]:





2704



In [19]:

    
vectorizer.get_feature_names()[n_features / 3:n_features / 3 + 10]









    Out[19]:





[u'divines',
 u'dizzy',
 u'dobie',
 u'dock',
 u'doddy',
 u'dogs',
 u'dogwood',
 u'dolce',
 u'domain',
 u'domestic']



In [20]:

    
target_predicted_proba = classifier.predict_proba(X_test)
percentages = pd.DataFrame(target_predicted_proba, columns=df['Letter_Grade'].unique())



In [21]:

    
# A table of probabilities for each one of the 3223 restaurants in the test set to be assigned a certain letter grade:

percentages.head()









    Out[21]:






  
    
      
      A
      B
      C
      F
    
  
  
    
      0
      0.477308
      0.351621
      0.169233
      1.838392e-03
    
    
      1
      0.999599
      0.000399
      0.000001
      4.617823e-07
    
    
      2
      0.443248
      0.395690
      0.132699
      2.836293e-02
    
    
      3
      0.831649
      0.092802
      0.062244
      1.330407e-02
    
    
      4
      0.078374
      0.599928
      0.313887
      7.810455e-03



In [22]:

    
len(percentages)









    Out[22]:





3223

By default the decision threshold is 0.5: if we vary the decision threshold from 0 to 1 we could generate a family of binary classifier models that address all the possible trade offs between false positive and false negative prediction errors.

Let us use a pipeline in order to perform 10-fold cross validation:



In [56]:

    
pipeline = Pipeline((
    ('vec', CountVectorizer(max_df = 0.8, ngram_range = (1, 2))),
    ('clf', MultinomialNB(alpha = 0.01)),
))
_ = pipeline.fit(df['Names'], df['Letter_Grade'])



In [57]:

    
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem

scores = cross_val_score(pipeline, df['Names'],
                         df['Letter_Grade'], cv=10)
scores.mean(), sem(scores)









    Out[57]:





(0.52166700734667137, 0.021307751429222626)



In [58]:

    
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]

feature_names = vec.get_feature_names()
target_names = df['Letter_Grade'].unique()

feature_weights = clf.coef_

feature_weights.shape









    Out[58]:





(4, 5902)



In [59]:

    
len(feature_names)









    Out[59]:





5902



In [60]:

    
def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))



In [61]:

    
print_top10(vectorizer, classifier, target_names)









    



A: slaughter caves mopac anderson capital congress william parmer lamar blvd
B: mopac riverside springs capital anderson parmer congress william blvd lamar
C: anderson springs stassney parmer riverside oltorf william congress blvd lamar
F: parmer rundberg martin springs cesar congress riverside oltorf blvd lamar



In [62]:

    
from sklearn.metrics import classification_report

predicted = pipeline.predict(df['Restaurant_Name'])



In [63]:

    
print(classification_report(df['Letter_Grade'], predicted,
                            target_names=df['Letter_Grade'].unique()))









    



             precision    recall  f1-score   support

          A       0.82      0.80      0.81     10091
          B       0.55      0.49      0.52      4391
          C       0.41      0.50      0.45      1454
          F       0.17      0.54      0.26       177

avg / total       0.70      0.69      0.69     16113



In [64]:

    
from sklearn.metrics import confusion_matrix

pd.DataFrame(confusion_matrix(df['Letter_Grade'], predicted), 
             index = pd.MultiIndex.from_product([['actual'], target_names]),
             columns = pd.MultiIndex.from_product([['predicted'], target_names]))

String Manipulation: Street



In [65]:

    
df.head(3)









    Out[65]:






  
    
      
      Facility_ID
      Restaurant_Name
      Inspection_Date
      Process_Description
      Geocode
      Street
      City
      Zip_Code
      Score
      Latitude
      ...
      Area_NW Austin
      Area_SE Austin
      Area_SW Austin
      Status_Pass
      Grade_B
      Grade_C
      Grade_F
      Pristine
      Names
      Street_Words
    
  
  
    
      0
      2801996
      . Gatti
      2015-12-23
      Routine Inspection
      2121 W PARMER LN, AUSTIN, TX 78758
      2121 W PARMER LN
      AUSTIN
      78758
      94
      30.415649
      ...
      1
      0
      0
      1
      0
      0
      0
      1
      . Gatti
      PARMER
    
    
      1
      10385802
      Subway
      2015-12-23
      Routine Inspection
      2501 W PARMER LN, AUSTIN, TX 78758
      2501 W PARMER LN
      AUSTIN
      78758
      98
      30.418236
      ...
      1
      0
      0
      1
      0
      0
      0
      1
      Subway
      PARMER
    
    
      2
      2802274
      Baskin Robbins
      2015-12-23
      Routine Inspection
      12407 N MOPAC EXPY, AUSTIN, TX 78758
      12407 N MOPAC EXPY
      AUSTIN
      78758
      99
      30.417462
      ...
      1
      0
      0
      1
      0
      0
      0
      1
      Baskin Robbins
      MOPAC
    
  

3 rows × 25 columns

Let us now follow a similar approach in order to isolate the street name from the address string:



In [67]:

    
streets = df['Geocode'].apply(pd.Series)



In [68]:

    
streets = df['Geocode'].tolist()



In [69]:

    
split_streets = [i.split(' ', 1)[1] for i in streets]



In [70]:

    
split_streets[0]









    Out[70]:





'W PARMER LN, AUSTIN, TX 78758'



In [71]:

    
split_streets = [i.split(' ', 1)[1] for i in split_streets]



In [72]:

    
split_streets[0]









    Out[72]:





'PARMER LN, AUSTIN, TX 78758'



In [73]:

    
split_streets = [i.split(' ', 1)[0] for i in split_streets]



In [74]:

    
split_streets[0]









    Out[74]:





'PARMER'



In [75]:

    
for i in range(len(split_streets)):
    split_streets[i] = shortword.sub('', split_streets[i])



In [76]:

    
split_streets[0]









    Out[76]:





'PARMER'



In [77]:

    
# Create a new column with the street:
df['Street_Words'] = split_streets



In [78]:

    
# Turn the text documents into vectors of tf-idf
# We'll throw out any terms that appear in only one document

#vectorizer = TfidfVectorizer(min_df=2) # recipe for avoiding overfitting; others & alpha parameters can be tuned.
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(df['Street_Words'])
y = df['Letter_Grade']

# Train/test split for cross-validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size = 0.8)

# Fit a classifier on the training set

classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
    classifier.score(X_train, y_train) * 100))

# Evaluate the classifier on the testing set

print("Testing score: {0:.1f}%".format(
    classifier.score(X_test, y_test) * 100))









    



Training score: 62.8%
Testing score: 62.4%



In [79]:

    
n_samples, n_features = X_train.shape



In [80]:

    
vectorizer.get_feature_names()[n_features / 3:n_features / 3 + 10]









    Out[80]:





[u'center',
 u'centre',
 u'cesar',
 u'champ',
 u'chase',
 u'cimas',
 u'clay',
 u'club',
 u'colorado',
 u'commerce']



In [81]:

    
len(vectorizer.vocabulary_)









    Out[81]:





145



In [82]:

    
target_predicted_proba = classifier.predict_proba(X_test)
pd.DataFrame(target_predicted_proba[:10], columns=df['Letter_Grade'].unique())



In [85]:

    
pipeline = Pipeline((
    ('vec', CountVectorizer(max_df = 0.8, ngram_range = (1, 2))),
    ('clf', MultinomialNB(alpha = 0.01)),
))
_ = pipeline.fit(df['Street_Words'], df['Letter_Grade'])



In [86]:

    
scores = cross_val_score(pipeline, df['Street_Words'],
                         df['Letter_Grade'], cv=3)
scores.mean(), sem(scores)









    Out[86]:





(0.60107151305392026, 0.011496010274910711)



In [87]:

    
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]

feature_names = vec.get_feature_names()
target_names = df['Letter_Grade'].unique()

feature_weights = clf.coef_

feature_weights.shape









    Out[87]:





(4, 145)



In [88]:

    
predicted = pipeline.predict(df['Street_Words'])



In [89]:

    
print(classification_report(df['Letter_Grade'], predicted,
                            target_names=df['Letter_Grade'].unique()))









    



             precision    recall  f1-score   support

          A       0.63      0.99      0.77     10091
          B       0.50      0.02      0.03      4391
          C       0.38      0.00      0.01      1454
          F       0.00      0.00      0.00       177

avg / total       0.56      0.63      0.49     16113



In [90]:

    
pd.DataFrame(confusion_matrix(df['Letter_Grade'], predicted), 
             index = pd.MultiIndex.from_product([['actual'], target_names]),
             columns = pd.MultiIndex.from_product([['predicted'], target_names]))



In [91]:

    
print_top10(vectorizer, classifier, target_names)









    



A: slaughter anderson caves mopac capital congress william parmer lamar blvd
B: oltorf riverside springs capital anderson parmer congress william blvd lamar
C: capital parmer riverside springs stassney oltorf william congress blvd lamar
F: springs cesar martin william anderson congress riverside blvd oltorf lamar



In [ ]:

	A	B	C	F
0	0.626532	0.270908	0.091777	0.010784
1	0.626532	0.270908	0.091777	0.010784
2	0.626532	0.270908	0.091777	0.010784
3	0.626532	0.270908	0.091777	0.010784
4	0.491628	0.357262	0.123660	0.027449
5	0.606478	0.291367	0.088987	0.013168
6	0.626532	0.270908	0.091777	0.010784
7	0.626532	0.270908	0.091777	0.010784
8	0.626532	0.270908	0.091777	0.010784
9	0.697392	0.239386	0.059898	0.003324

	Facility_ID	Restaurant_Name	Inspection_Date	Process_Description	Geocode	Street	City	Zip_Code	Score	Latitude	...	Letter_Grade	Area_NW Austin	Status_Pass	Pristine
0	2801996	Mr. Gatti's #118	2015-12-23	Routine Inspection	2121 W PARMER LN, AUSTIN, TX 78758	2121 W PARMER LN	AUSTIN	78758	94	30.415649	...	A	1	1	1
1	10385802	Subway	2015-12-23	Routine Inspection	2501 W PARMER LN, AUSTIN, TX 78758	2501 W PARMER LN	AUSTIN	78758	98	30.418236	...	A	1	1	1
2	2802274	Baskin Robbins	2015-12-23	Routine Inspection	12407 N MOPAC EXPY, AUSTIN, TX 78758	12407 N MOPAC EXPY	AUSTIN	78758	99	30.417462	...	A	1	1	1
3	10964220	JR's Tacos	2015-12-22	Routine Inspection	1921 CEDAR BEND DR, AUSTIN, TX 78758	1921 CEDAR BEND DR	AUSTIN	78758	91	30.408322	...	A	1	1	1
4	10778546	Econo Lodge	2015-12-22	Routine Inspection	9100 BURNET RD, AUSTIN, TX 78758	9100 BURNET RD	AUSTIN	78758	91	30.374790	...	A	1	1	1

	A	B	C	F
0	0.477308	0.351621	0.169233	1.838392e-03
1	0.999599	0.000399	0.000001	4.617823e-07
2	0.443248	0.395690	0.132699	2.836293e-02
3	0.831649	0.092802	0.062244	1.330407e-02
4	0.078374	0.599928	0.313887	7.810455e-03

		predicted
		A	B	C	F
actual	A	8083	1397	474	137
	B	1462	2168	542	219
	C	286	326	725	117
	F	18	41	23	95

	Facility_ID	Restaurant_Name	Inspection_Date	Process_Description	Geocode	Street	City	Zip_Code	Score	Latitude	...	Area_NW Austin	Status_Pass	Pristine	Names	Street_Words
0	2801996	. Gatti	2015-12-23	Routine Inspection	2121 W PARMER LN, AUSTIN, TX 78758	2121 W PARMER LN	AUSTIN	78758	94	30.415649	...	1	1	1	. Gatti	PARMER
1	10385802	Subway	2015-12-23	Routine Inspection	2501 W PARMER LN, AUSTIN, TX 78758	2501 W PARMER LN	AUSTIN	78758	98	30.418236	...	1	1	1	Subway	PARMER
2	2802274	Baskin Robbins	2015-12-23	Routine Inspection	12407 N MOPAC EXPY, AUSTIN, TX 78758	12407 N MOPAC EXPY	AUSTIN	78758	99	30.417462	...	1	1	1	Baskin Robbins	MOPAC