Natural Language Processing Project

Welcome to the NLP Project for this section of the course. In this NLP project you will be attempting to classify Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews. This will be a simpler procedure than the lecture, since we will utilize the pipeline methods for more complex tasks.

We will use the Yelp Review Data Set from Kaggle.

Each observation in this dataset is a review of a particular business by a particular user.

The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.

The "cool" column is the number of "cool" votes this review received from other Yelp users.

All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.

The "useful" and "funny" columns are similar to the "cool" column.

Let's get started! Just follow the directions below!

Imports

Import the usual suspects. :)



In [94]:

The Data

Read the yelp.csv file and set it as a dataframe called yelp.



In [95]:

Check the head, info , and describe methods on yelp.



In [96]:









    Out[96]:






  
    
      
      business_id
      date
      review_id
      stars
      text
      type
      user_id
      cool
      useful
      funny
    
  
  
    
      0
      9yKzy9PApeiPPOUJEtnvkg
      2011-01-26
      fWKvX83p0-ka4JS3dc6E5A
      5
      My wife took me here on my birthday for breakf...
      review
      rLtl8ZkDX5vH5nAx9C3q5Q
      2
      5
      0
    
    
      1
      ZRJwVLyzEJq1VAihDhYiow
      2011-07-27
      IjZ33sJrzXqU-0X6U8NwyA
      5
      I have no idea why some people give bad review...
      review
      0a2KyEL0d3Yb1V6aivbIuQ
      0
      0
      0
    
    
      2
      6oRAC4uyJCsJl1X0WZpVSA
      2012-06-14
      IESLBzqUCLdSzSqm0eCSxQ
      4
      love the gyro plate. Rice is so good and I als...
      review
      0hT2KtfLiobPvh6cDC8JQg
      0
      1
      0
    
    
      3
      _1QQZuf4zZOyFCvXc0o6Vg
      2010-05-27
      G-WvGaISbqqaMHlNnByodA
      5
      Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
      review
      uZetl9T0NcROGOyFfughhg
      1
      2
      0
    
    
      4
      6ozycU1RpktNG2-1BroVtw
      2012-01-05
      1uJFq2r5QfJG_6ExMRCaGw
      5
      General Manager Scott Petello is a good egg!!!...
      review
      vYmM4KTsC8ZfQBg-j5MWkw
      0
      0
      0



In [97]:









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
business_id    10000 non-null object
date           10000 non-null object
review_id      10000 non-null object
stars          10000 non-null int64
text           10000 non-null object
type           10000 non-null object
user_id        10000 non-null object
cool           10000 non-null int64
useful         10000 non-null int64
funny          10000 non-null int64
dtypes: int64(4), object(6)
memory usage: 781.3+ KB



In [99]:









    Out[99]:






  
    
      
      stars
      cool
      useful
      funny
    
  
  
    
      count
      10000.000000
      10000.000000
      10000.000000
      10000.000000
    
    
      mean
      3.777500
      0.876800
      1.409300
      0.701300
    
    
      std
      1.214636
      2.067861
      2.336647
      1.907942
    
    
      min
      1.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      3.000000
      0.000000
      0.000000
      0.000000
    
    
      50%
      4.000000
      0.000000
      1.000000
      0.000000
    
    
      75%
      5.000000
      1.000000
      2.000000
      1.000000
    
    
      max
      5.000000
      77.000000
      76.000000
      57.000000

Create a new column called "text length" which is the number of words in the text column.



In [100]:

EDA

Let's explore the data

Imports

Import the data visualization libraries if you haven't done so already.



In [101]:

Use FacetGrid from the seaborn library to create a grid of 5 histograms of text length based off of the star ratings. Reference the seaborn documentation for hints on this



In [102]:









    Out[102]:





<seaborn.axisgrid.FacetGrid at 0x121e705f8>

Create a boxplot of text length for each star category.



In [103]:









    Out[103]:





<matplotlib.axes._subplots.AxesSubplot at 0x121283470>

Create a countplot of the number of occurrences for each type of star rating.



In [104]:









    Out[104]:





<matplotlib.axes._subplots.AxesSubplot at 0x12578fc88>

Use groupby to get the mean values of the numerical columns, you should be able to create this dataframe with the operation:



In [105]:









    Out[105]:






  
    
      
      cool
      useful
      funny
      text length
    
    
      stars
      
      
      
      
    
  
  
    
      1
      0.576769
      1.604806
      1.056075
      826.515354
    
    
      2
      0.719525
      1.563107
      0.875944
      842.256742
    
    
      3
      0.788501
      1.306639
      0.694730
      758.498289
    
    
      4
      0.954623
      1.395916
      0.670448
      712.923142
    
    
      5
      0.944261
      1.381780
      0.608631
      624.999101

Use the corr() method on that groupby dataframe to produce this dataframe:



In [106]:









    Out[106]:






  
    
      
      cool
      useful
      funny
      text length
    
  
  
    
      cool
      1.000000
      -0.743329
      -0.944939
      -0.857664
    
    
      useful
      -0.743329
      1.000000
      0.894506
      0.699881
    
    
      funny
      -0.944939
      0.894506
      1.000000
      0.843461
    
    
      text length
      -0.857664
      0.699881
      0.843461
      1.000000

Then use seaborn to create a heatmap based off that .corr() dataframe:



In [38]:









    Out[38]:





<matplotlib.axes._subplots.AxesSubplot at 0x120edb828>

NLP Classification Task

Let's move on to the actual task. To make things a little easier, go ahead and only grab reviews that were either 1 star or 5 stars.

Create a dataframe called yelp_class that contains the columns of yelp dataframe but for only the 1 or 5 star reviews.



In [107]:

Create two objects X and y. X will be the 'text' column of yelp_class and y will be the 'stars' column of yelp_class. (Your features and target/labels)



In [117]:

Import CountVectorizer and create a CountVectorizer object.



In [118]:

Use the fit_transform method on the CountVectorizer object and pass in X (the 'text' column). Save this result by overwriting X.



In [119]:

Train Test Split

Let's split our data into training and testing data.

Use train_test_split to split up the data into X_train, X_test, y_train, y_test. Use test_size=0.3 and random_state=101



In [120]:



In [121]:

Training a Model

Time to train a model!

Import MultinomialNB and create an instance of the estimator and call is nb



In [122]:

Now fit nb using the training data.



In [123]:









    Out[123]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Predictions and Evaluations

Time to see how our model did!

Use the predict method off of nb to predict labels from X_test.



In [124]:

Create a confusion matrix and classification report using these predictions and y_test



In [82]:



In [125]:









    



[[159  69]
 [ 22 976]]


             precision    recall  f1-score   support

          1       0.88      0.70      0.78       228
          5       0.93      0.98      0.96       998

avg / total       0.92      0.93      0.92      1226

Great! Let's see what happens if we try to include TF-IDF to this process using a pipeline.

Using Text Processing

Import TfidfTransformer from sklearn.



In [155]:

Import Pipeline from sklearn.



In [156]:

Now create a pipeline with the following steps:CountVectorizer(), TfidfTransformer(),MultinomialNB()



In [157]:

Using the Pipeline

Time to use the pipeline! Remember this pipeline has all your pre-process steps in it already, meaning we'll need to re-split the original data (Remember that we overwrote X as the CountVectorized version. What we need is just the text

Train Test Split

Redo the train test split on the yelp_class object.



In [158]:

Now fit the pipeline to the training data. Remember you can't use the same training data as last time because that data has already been vectorized. We need to pass in just the text and labels



In [159]:









    Out[159]:





Pipeline(steps=[('bow', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

Predictions and Evaluation

Now use the pipeline to predict from the X_test and create a classification report and confusion matrix. You should notice strange results.



In [153]:



In [154]:









    



[[  0 228]
 [  0 998]]
             precision    recall  f1-score   support

          1       0.00      0.00      0.00       228
          5       0.81      1.00      0.90       998

avg / total       0.66      0.81      0.73      1226







    



/Users/marci/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

Looks like Tf-Idf actually made things worse! That is it for this project. But there is still a lot more you can play with:

Some other things to try.... Try going back and playing around with the pipeline steps and seeing if creating a custom analyzer like we did in the lecture helps (note: it probably won't). Or recreate the pipeline with just the CountVectorizer() and NaiveBayes. Does changing the ML model at the end to another classifier help at all?

	business_id	date	review_id	stars	text	type	user_id	cool	useful
0	9yKzy9PApeiPPOUJEtnvkg	2011-01-26	fWKvX83p0-ka4JS3dc6E5A	5	My wife took me here on my birthday for breakf...	review	rLtl8ZkDX5vH5nAx9C3q5Q	2	5
1	ZRJwVLyzEJq1VAihDhYiow	2011-07-27	IjZ33sJrzXqU-0X6U8NwyA	5	I have no idea why some people give bad review...	review	0a2KyEL0d3Yb1V6aivbIuQ	0	0
2	6oRAC4uyJCsJl1X0WZpVSA	2012-06-14	IESLBzqUCLdSzSqm0eCSxQ	4	love the gyro plate. Rice is so good and I als...	review	0hT2KtfLiobPvh6cDC8JQg	0	1
3	_1QQZuf4zZOyFCvXc0o6Vg	2010-05-27	G-WvGaISbqqaMHlNnByodA	5	Rosie, Dakota, and I LOVE Chaparral Dog Park!!...	review	uZetl9T0NcROGOyFfughhg	1	2
4	6ozycU1RpktNG2-1BroVtw	2012-01-05	1uJFq2r5QfJG_6ExMRCaGw	5	General Manager Scott Petello is a good egg!!!...	review	vYmM4KTsC8ZfQBg-j5MWkw	0	0

	stars	cool	useful	funny
count	10000.000000	10000.000000	10000.000000	10000.000000
mean	3.777500	0.876800	1.409300	0.701300
std	1.214636	2.067861	2.336647	1.907942
min	1.000000	0.000000	0.000000	0.000000
25%	3.000000	0.000000	0.000000	0.000000
50%	4.000000	0.000000	1.000000	0.000000
75%	5.000000	1.000000	2.000000	1.000000
max	5.000000	77.000000	76.000000	57.000000

	cool	useful	funny	text length
stars
1	0.576769	1.604806	1.056075	826.515354
2	0.719525	1.563107	0.875944	842.256742
3	0.788501	1.306639	0.694730	758.498289
4	0.954623	1.395916	0.670448	712.923142
5	0.944261	1.381780	0.608631	624.999101

	cool	useful	funny	text length
cool	1.000000	-0.743329	-0.944939	-0.857664
useful	-0.743329	1.000000	0.894506	0.699881
funny	-0.944939	0.894506	1.000000	0.843461
text length	-0.857664	0.699881	0.843461	1.000000

Natural Language Processing Project

Imports

The Data

EDA

Imports

NLP Classification Task

Train Test Split

Training a Model

Predictions and Evaluations

Using Text Processing

Using the Pipeline

Train Test Split

Predictions and Evaluation

Great Job!