Natural Language Processing (NLP)

Introduction

Adapted from NLP Crash Course by Charlie Greenbacker and Introduction to NLP by Dan Jurafsky

What is NLP?

Using computers to process (analyze, understand, generate) natural human languages
Most knowledge created by humans is unstructured text, and we need a way to make sense of it
Build probabilistic model using data about a language

What are some of the higher level task areas?

Information retrieval: Find relevant results and similar results
- Google
Information extraction: Structured information from unstructured documents
- Events from Gmail
Machine translation: One language to another
- Google Translate
Text simplification: Preserve the meaning of text, but simplify the grammar and vocabulary
- Rewordify
- Simple English Wikipedia
Predictive text input: Faster or easier typing
- My application
- A much better application
Sentiment analysis: Attitude of speaker
- Hater News
Automatic summarization: Extractive or abstractive summarization
- autotldr
Natural Language Generation: Generate text from data
- How a computer describes a sports match
- Publishers withdraw more than 120 gibberish papers
Speech recognition and generation: Speech-to-text, text-to-speech
- Google's Web Speech API demo
- Vocalware Text-to-Speech demo
Question answering: Determine the intent of the question, match query with knowledge base, evaluate hypotheses

What are some of the lower level components?

Tokenization: breaking text into tokens (words, sentences, n-grams)
Stopword removal: a/an/the
Stemming and lemmatization: root word
TF-IDF: word importance
Part-of-speech tagging: noun/verb/adjective
Named entity recognition: person/organization/location
Spelling correction: "New Yrok City"
Word sense disambiguation: "buy a mouse"
Segmentation: "New York City subway"
Language detection: "translate this page"
Machine learning

Why is NLP hard?

Ambiguity:
- Hospitals are Sued by 7 Foot Doctors
- Juvenile Court to Try Shooting Defendant
- Local High School Dropouts Cut in Half
Non-standard English: text messages
Idioms: "throw in the towel"
Newly coined words: "retweet"
Tricky entity names: "Where is A Bug's Life playing?"
World knowledge: "Mary and Sue are sisters", "Mary and Sue are mothers"

NLP requires an understanding of the language and the world.



In [1]:

    
!pip install textblob









    



Requirement already satisfied (use --upgrade to upgrade): textblob in /Users/johria/anaconda3/lib/python3.5/site-packages
Requirement already satisfied (use --upgrade to upgrade): nltk>=3.1 in /Users/johria/anaconda3/lib/python3.5/site-packages (from textblob)
You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Part 1: Reading in the Yelp Reviews

"corpus" = collection of documents
"corpora" = plural form of corpus



In [2]:

    
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline



In [3]:

    
# read yelp.csv into a DataFrame
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/yelp.csv'
yelp = pd.read_csv(url)

# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

# split the new DataFrame into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Part 2: Tokenization

What: Separate text into units such as sentences or words
Why: Gives structure to previously unstructured text
Notes: Relatively easy with English language text, not easy with some languages



In [4]:

    
# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)



In [5]:

    
# rows are documents, columns are terms (aka "tokens" or "features")
X_train_dtm.shape









    Out[5]:





(3064, 16825)



In [6]:

    
# last 50 features
print(vect.get_feature_names()[-50:])









    



['yyyyy', 'z11', 'za', 'zabba', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zero', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zihuatenejo', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zippers', 'zipps', 'ziti', 'zoe', 'zombi', 'zombies', 'zone', 'zones', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel', 'zzed', 'éclairs', 'école', 'ém']



In [7]:

    
# show vectorizer options
vect









    Out[7]:





CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

CountVectorizer documentation

lowercase: boolean, True by default
Convert all characters to lowercase before tokenizing.



In [8]:

    
# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape
vect.get_feature_names()[-10:]









    Out[8]:





['zoning',
 'zoo',
 'zucchini',
 'zuchinni',
 'zupa',
 'zwiebel',
 'zzed',
 'École',
 'éclairs',
 'ém']

ngram_range: tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.



In [9]:

    
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2), min_df=5)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape









    Out[9]:





(3064, 14113)



In [10]:

    
# last 50 features
print(vect.get_feature_names()[-50:])









    



['young', 'your', 'your average', 'your body', 'your business', 'your car', 'your choice', 'your customers', 'your dog', 'your experience', 'your eyes', 'your face', 'your favorite', 'your first', 'your food', 'your friends', 'your guests', 'your hair', 'your hand', 'your hands', 'your job', 'your life', 'your looking', 'your meal', 'your mind', 'your money', 'your mouth', 'your name', 'your order', 'your own', 'your place', 'your table', 'your taste', 'your time', 'your tongue', 'your typical', 'your way', 'yourself', 'yourself favor', 'yourself with', 'yuck', 'yum', 'yum yum', 'yummy', 'yummy and', 'yup', 'zen', 'zero', 'zinburger', 'zucchini']

Predicting the star rating:



In [11]:

    
# use default options for CountVectorizer
vect = CountVectorizer()

# create document-term matrices
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class))









    



0.918786692759



In [12]:

    
# calculate null accuracy
y_test_binary = np.where(y_test==5, 1, 0)
print(y_test_binary.mean())
print(1 - y_test_binary.mean())









    



0.819960861057
0.180039138943



In [13]:

    
# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))



In [14]:

    
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 3), min_df=2, max_features=10000)
tokenize_test(vect)









    



Features:  10000
Accuracy:  0.907045009785

Part 3: Stopword Removal

What: Remove common words that will likely appear in any text
Why: They don't tell you much about your text



In [15]:

    
# show vectorizer options
vect









    Out[15]:





CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=10000, min_df=2,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

stop_words: string {'english'}, list, or None (default)
If 'english', a built-in stop word list for English is used.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.



In [16]:

    
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)
vect.get_params()









    



Features:  16528
Accuracy:  0.915851272016






    Out[16]:





{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': 'english',
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}



In [17]:

    
# set of stop words
print(vect.get_stop_words())









    



frozenset({'show', 'seems', 'cant', 'other', 'on', 'due', 'every', 'hereupon', 'whereas', 'afterwards', 'though', 'her', 'his', 'very', 'my', 'became', 'whereafter', 'by', 'serious', 'take', 'thence', 'whoever', 'sixty', 'be', 'will', 'last', 'con', 'mine', 'nor', 'cry', 'twenty', 'into', 'least', 'am', 'between', 'above', 'first', 'most', 'part', 'either', 'me', 'should', 'within', 'rather', 'two', 'eg', 'had', 'whenever', 'we', 'cannot', 'ten', 'this', 'four', 'un', 'anyway', 'anyhow', 'become', 'how', 'their', 'even', 'up', 'hasnt', 'none', 'thereafter', 'alone', 'toward', 'might', 'many', 'you', 'thereupon', 'upon', 'indeed', 'mill', 'ourselves', 'can', 'five', 'your', 'so', 'empty', 'never', 'per', 'move', 'over', 'among', 'along', 'please', 'fifteen', 're', 'former', 'from', 'as', 'down', 'becoming', 'ltd', 'some', 'ie', 'during', 'moreover', 'what', 'for', 'them', 'hereby', 'several', 'one', 'than', 'who', 'behind', 'only', 'such', 'always', 'too', 'etc', 'hereafter', 'herein', 'interest', 'neither', 'beyond', 'few', 'somewhere', 'name', 'de', 'perhaps', 'via', 'each', 'seeming', 'no', 'sometime', 'i', 'itself', 'meanwhile', 'top', 'when', 'about', 'towards', 'nothing', 'while', 'sincere', 'besides', 'whose', 'still', 'mostly', 'eleven', 'ours', 'own', 'both', 'somehow', 'nevertheless', 'has', 'without', 'us', 'further', 'at', 'since', 'whatever', 'whereby', 'hers', 'third', 'whom', 'becomes', 'someone', 'onto', 'across', 'but', 'hence', 'thick', 'any', 'couldnt', 'next', 'off', 'done', 'everywhere', 'himself', 'it', 'find', 'anywhere', 'eight', 'thin', 'to', 'made', 'detail', 'seem', 'get', 'much', 'latter', 'until', 'was', 'full', 'less', 'were', 'put', 'namely', 'why', 'noone', 'those', 'have', 'themselves', 'is', 'amongst', 'back', 'which', 'anyone', 'in', 'together', 'else', 'hundred', 'and', 'however', 'yourselves', 'below', 'against', 'elsewhere', 'these', 'do', 'everything', 'whether', 'our', 'beside', 'ever', 'forty', 'seemed', 'amount', 'around', 'amoungst', 'because', 'call', 'otherwise', 'throughout', 'thru', 'six', 'if', 'whereupon', 'whither', 'others', 'same', 'almost', 'may', 'of', 'fire', 'before', 'system', 'they', 'are', 'there', 'he', 'then', 'front', 'beforehand', 'formerly', 'with', 'yours', 'yourself', 'thereby', 'must', 'fill', 'whence', 'co', 'see', 'could', 'three', 'also', 'an', 'another', 'here', 'often', 'give', 'yet', 'side', 'sometimes', 'twelve', 'again', 'wherein', 'anything', 'not', 'everyone', 'go', 'nowhere', 'a', 'fify', 'myself', 'through', 'thus', 'under', 'therein', 'would', 'been', 'enough', 'now', 'already', 'all', 'found', 'nine', 'the', 'describe', 'therefore', 'except', 'inc', 'whole', 'that', 'its', 'bottom', 'or', 'something', 'him', 'more', 'keep', 'latterly', 'herself', 'after', 'being', 'well', 'where', 'bill', 'wherever', 'nobody', 'out', 'although', 'she', 'once'})

Part 4: Other CountVectorizer Options

max_features: int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.



In [18]:

    
# remove English stop words and only keep 100 features
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)









    



Features:  100
Accuracy:  0.869863013699



In [19]:

    
# all 100 features
print(vect.get_feature_names())









    



['amazing', 'area', 'atmosphere', 'awesome', 'bad', 'bar', 'best', 'better', 'big', 'came', 'cheese', 'chicken', 'clean', 'coffee', 'come', 'day', 'definitely', 'delicious', 'did', 'didn', 'dinner', 'don', 'eat', 'excellent', 'experience', 'favorite', 'feel', 'food', 'free', 'fresh', 'friendly', 'friends', 'going', 'good', 'got', 'great', 'happy', 'home', 'hot', 'hour', 'just', 'know', 'like', 'little', 'll', 'location', 'long', 'looking', 'lot', 'love', 'lunch', 'make', 'meal', 'menu', 'minutes', 'need', 'new', 'nice', 'night', 'order', 'ordered', 'people', 'perfect', 'phoenix', 'pizza', 'place', 'pretty', 'prices', 'really', 'recommend', 'restaurant', 'right', 'said', 'salad', 'sandwich', 'sauce', 'say', 'service', 'staff', 'store', 'sure', 'table', 'thing', 'things', 'think', 'time', 'times', 'took', 'town', 'tried', 'try', 've', 'wait', 'want', 'way', 'went', 'wine', 'work', 'worth', 'years']



In [20]:

    
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)
tokenize_test(vect)









    



Features:  100000
Accuracy:  0.885518590998

min_df: float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.



In [21]:

    
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)









    



Features:  43957
Accuracy:  0.932485322896

Part 5: Introduction to TextBlob

TextBlob: "Simplified Text Processing"



In [22]:

    
# print the first review
print(yelp_best_worst.text[0])









    



My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!



In [23]:

    
# save it as a TextBlob object
review = TextBlob(yelp_best_worst.text[0])



In [24]:

    
# list the words
review.words









    Out[24]:





WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'and', 'it', 'was', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'pieces', 'of', 'their', 'griddled', 'bread', 'with', 'was', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'was', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back'])



In [25]:

    
# list the sentences
review.sentences









    Out[25]:





[Sentence("My wife took me here on my birthday for breakfast and it was excellent."),
 Sentence("The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure."),
 Sentence("Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning."),
 Sentence("It looked like the place fills up pretty quickly so the earlier you get here the better."),
 Sentence("Do yourself a favor and get their Bloody Mary."),
 Sentence("It was phenomenal and simply the best I've ever had."),
 Sentence("I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it."),
 Sentence("It was amazing."),
 Sentence("While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious."),
 Sentence("It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete."),
 Sentence("It was the best "toast" I've ever had."),
 Sentence("Anyway, I can't wait to go back!")]



In [26]:

    
# some string methods are available
review.lower()









    Out[26]:





TextBlob("my wife took me here on my birthday for breakfast and it was excellent.  the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  our waitress was excellent and our food arrived quickly on the semi-busy saturday morning.  it looked like the place fills up pretty quickly so the earlier you get here the better.

do yourself a favor and get their bloody mary.  it was phenomenal and simply the best i've ever had.  i'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  it was amazing.

while everything on the menu looks excellent, i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  it was the best "toast" i've ever had.

anyway, i can't wait to go back!")

Part 6: Stemming and Lemmatization

Stemming:

What: Reduce a word to its base/stem/root form
Why: Often makes sense to treat related words the same way
Notes:
- Uses a "simple" and fast rule-based approach
- Stemmed words are usually not shown to users (used for analysis/indexing)
- Some search engines treat words with the same stem as synonyms



In [27]:

    
# initialize stemmer
stemmer = SnowballStemmer('english')

# stem each word
print([stemmer.stem(word) for word in review.words])









    



['my', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excel', 'the', 'weather', 'was', 'perfect', 'which', 'made', 'sit', 'outsid', 'overlook', 'their', 'ground', 'an', 'absolut', 'pleasur', 'our', 'waitress', 'was', 'excel', 'and', 'our', 'food', 'arriv', 'quick', 'on', 'the', 'semi-busi', 'saturday', 'morn', 'it', 'look', 'like', 'the', 'place', 'fill', 'up', 'pretti', 'quick', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'bloodi', 'mari', 'it', 'was', 'phenomen', 'and', 'simpli', 'the', 'best', 'i', 've', 'ever', 'had', 'i', "'m", 'pretti', 'sure', 'they', 'onli', 'use', 'ingredi', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'it', 'was', 'amaz', 'while', 'everyth', 'on', 'the', 'menu', 'look', 'excel', 'i', 'had', 'the', 'white', 'truffl', 'scrambl', 'egg', 'veget', 'skillet', 'and', 'it', 'was', 'tasti', 'and', 'delici', 'it', 'came', 'with', '2', 'piec', 'of', 'their', 'griddl', 'bread', 'with', 'was', 'amaz', 'and', 'it', 'absolut', 'made', 'the', 'meal', 'complet', 'it', 'was', 'the', 'best', 'toast', 'i', 've', 'ever', 'had', 'anyway', 'i', 'ca', "n't", 'wait', 'to', 'go', 'back']

Lemmatization

What: Derive the canonical form ('lemma') of a word
Why: Can be better than stemming
Notes: Uses a dictionary-based approach (slower than stemming)



In [28]:

    
# assume every word is a noun
print([word.lemmatize() for word in review.words])









    



['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'wa', 'excellent', 'The', 'weather', 'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'egg', 'vegetable', 'skillet', 'and', 'it', 'wa', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'piece', 'of', 'their', 'griddled', 'bread', 'with', 'wa', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'wa', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']



In [29]:

    
# assume every word is a verb
print([word.lemmatize(pos='v') for word in review.words])









    



['My', 'wife', 'take', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'be', 'excellent', 'The', 'weather', 'be', 'perfect', 'which', 'make', 'sit', 'outside', 'overlook', 'their', 'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'be', 'excellent', 'and', 'our', 'food', 'arrive', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'look', 'like', 'the', 'place', 'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'be', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'have', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'be', 'amaze', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'look', 'excellent', 'I', 'have', 'the', 'white', 'truffle', 'scramble', 'egg', 'vegetable', 'skillet', 'and', 'it', 'be', 'tasty', 'and', 'delicious', 'It', 'come', 'with', '2', 'piece', 'of', 'their', 'griddle', 'bread', 'with', 'be', 'amaze', 'and', 'it', 'absolutely', 'make', 'the', 'meal', 'complete', 'It', 'be', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'have', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']



In [30]:

    
# define a function that accepts text and returns a list of lemmas
def split_into_lemmas(text):
    #text = unicode(text, 'utf-8').lower()
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]



In [31]:

    
# use split_into_lemmas as the feature extraction function (WARNING: SLOW!)
vect = CountVectorizer(analyzer=split_into_lemmas, decode_error='replace')
tokenize_test(vect)









    



Features:  20599
Accuracy:  0.904109589041



In [32]:

    
# last 50 features
print(vect.get_feature_names()[-50:])









    



['yourselves', 'youth', 'youthful', 'yow', 'yr', 'yu', 'yuck', 'yucky', 'yuk', 'yukon', 'yum', 'yumm', 'yummie', 'yummier', 'yumminess', 'yummm', 'yummmm', 'yummmmmmers', 'yummmmy', 'yummy', 'yummy-we', 'yumness', 'yup', 'yuppie', 'yuuuuummmmmyyy', 'yuyuyummy', 'yuzu', 'zen', 'zen-like', 'zero', 'zero-star', 'zest', 'zhou', 'zilch', 'zinc', 'zing', 'zip', 'zipper', 'ziti', 'zone', 'zoning', 'zoo', 'zucchini', 'zuchinni', 'zupa', 'zwiebel-kräuter', 'zzed', 'École', 'éclairs', 'ém']

Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

What: Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
Why: More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
Notes: Used for search engine scoring, text summarization, document clustering



In [33]:

    
# example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']



In [34]:

    
# Term Frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf



In [35]:

    
# Document Frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())



In [36]:

    
# Term Frequency-Inverse Document Frequency (simple version)
tf/df



In [37]:

    
# TfidfVectorizer
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

More details: TF-IDF is about what matters

Part 8: Using TF-IDF to Summarize a Yelp Review

Reddit's autotldr uses the SMMRY algorithm, which is based on TF-IDF!



In [38]:

    
# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(yelp.text)
features = vect.get_feature_names()
dtm.shape









    Out[38]:





(10000, 28881)



In [39]:

    
def summarize():
    
    # choose a random review that is at least 300 characters
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = yelp.text[review_id]
        review_length = len(review_text)
    
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]
    
    # print words with the top 5 TF-IDF scores
    print('TOP SCORING WORDS:')
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print(word)
    
    # print 5 random words
    print('\n' + 'RANDOM WORDS:')
    random_words = np.random.choice(list(word_scores.keys()), size=5, replace=False)
    for word in random_words:
        print(word)
    
    # print the review
    print('\n' + review_text)



In [40]:

    
summarize()









    



TOP SCORING WORDS:
pretzels
pretzel
rods
gifts
chocolate

RANDOM WORDS:
stores
better
cinnamon
twists
sell

I was on my way into Sweet Republic just before Christmas when I spotted Painted Pretzel a few doors down.  Since I love pretzels, I was imagining all sorts of delights.  Hard pretzels in knots, twists, and rods covered with the finest chocolate.  Freshly baked soft pretzels salted, or rolled in cinnamon and sugar, or butter and garlic.  You name it.  They were going to sell it.

Well I was sorely disappointed.  Upon entering, I realized this is probably more of a mail order type business.  No soft pretzels, either.  They specialize in pretzels dipped in chocolate and topped with nuts, candy pieces, etc.  There were some boxes of pretzels already made up, and we purchased a few.  The honey wheat pretzel rods were actually pretty good, the small bite-sized pretzels were nothing special.  The chocolate was a nice quality, but not the best.  

Although they hand dip all the pretzels, they do not make the actual pretzels.  So if you're on a local foods kick, which I'm not, this isn't the place for you.  Not bad to have around the house and for gifts to pretzel lovers or corporate gifts as they are better than the products you would find in most grocery stores.  I was just hoping for a little more.

Part 9: Sentiment Analysis

http://planspace.org/20150607-textblob_sentiment/



In [41]:

    
print(review)









    



My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!



In [42]:

    
# polarity ranges from -1 (most negative) to 1 (most positive)
review.sentiment.polarity









    Out[42]:





0.40246913580246907



In [43]:

    
# understanding the apply method
yelp['length'] = yelp.text.apply(len)
yelp.head(1)









    Out[43]:






  
    
      
      business_id
      date
      review_id
      stars
      text
      type
      user_id
      cool
      useful
      funny
      length
    
  
  
    
      0
      9yKzy9PApeiPPOUJEtnvkg
      2011-01-26
      fWKvX83p0-ka4JS3dc6E5A
      5
      My wife took me here on my birthday for breakf...
      review
      rLtl8ZkDX5vH5nAx9C3q5Q
      2
      5
      0
      889



In [44]:

    
# define a function that accepts text and returns the polarity
def detect_sentiment(text):
    return TextBlob(text).sentiment.polarity



In [45]:

    
# create a new DataFrame column for sentiment (WARNING: SLOW!)
yelp['sentiment'] = yelp.text.apply(detect_sentiment)



In [46]:

    
# box plot of sentiment grouped by stars
yelp.boxplot(column='sentiment', by='stars')









    Out[46]:





<matplotlib.axes._subplots.AxesSubplot at 0x1191a9438>



In [47]:

    
# reviews with most positive sentiment
yelp[yelp.sentiment == 1].text.head()









    Out[47]:





254    Our server Gary was awesome. Food was amazing....
347    3 syllables for this place. \nA-MAZ-ING!\n\nTh...
420                                    LOVE the food!!!!
459    Love it!!! Wish we still lived in Arizona as C...
679                                     Excellent burger
Name: text, dtype: object



In [48]:

    
# reviews with most negative sentiment
yelp[yelp.sentiment == -1].text.head()









    Out[48]:





773     This was absolutely horrible. I got the suprem...
1517                  Nasty workers and over priced trash
3266    Absolutely awful... these guys have NO idea wh...
4766                                       Very bad food!
5812        I wouldn't send my worst enemy to this place.
Name: text, dtype: object



In [49]:

    
# widen the column display
pd.set_option('max_colwidth', 500)



In [50]:

    
# negative sentiment in a 5-star review
yelp[(yelp.stars == 5) & (yelp.sentiment < -0.3)].head(1)









    Out[50]:






  
    
      
      business_id
      date
      review_id
      stars
      text
      type
      user_id
      cool
      useful
      funny
      length
      sentiment
    
  
  
    
      390
      106JT5p8e8Chtd0CZpcARw
      2009-08-06
      KowGVoP_gygzdSu6Mt3zKQ
      5
      RIP AZ Coffee Connection.  :(  I stopped by two days ago unaware that they had closed.  I am severely bummed.  This place is irreplaceable!  Damn you, Starbucks and McDonalds!
      review
      jKeaOrPyJ-dI9SNeVqrbww
      1
      0
      0
      175
      -0.302083



In [51]:

    
# positive sentiment in a 1-star review
yelp[(yelp.stars == 1) & (yelp.sentiment > 0.5)].head(1)









    Out[51]:






  
    
      
      business_id
      date
      review_id
      stars
      text
      type
      user_id
      cool
      useful
      funny
      length
      sentiment
    
  
  
    
      1781
      53YGfwmbW73JhFiemNeyzQ
      2012-06-22
      Gi-4O3EhE175vujbFGDIew
      1
      If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating.
      review
      Hqgx3IdJAAaoQjvrUnbNvw
      0
      1
      2
      119
      0.766667



In [52]:

    
# reset the column display width
pd.reset_option('max_colwidth')

Bonus: Adding Features to a Document-Term Matrix



In [53]:

    
# create a DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# define X and y
feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny']
X = yelp_best_worst[feature_cols]
y = yelp_best_worst.stars

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)



In [54]:

    
# use CountVectorizer with text column only
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train.text)
X_test_dtm = vect.transform(X_test.text)
print(X_train_dtm.shape)
print(X_test_dtm.shape)









    



(3064, 16825)
(1022, 16825)



In [55]:

    
# shape of other four feature columns
X_train.drop('text', axis=1).shape









    Out[55]:





(3064, 4)



In [56]:

    
# cast other feature columns to float and convert to a sparse matrix
extra = sp.sparse.csr_matrix(X_train.drop('text', axis=1).astype(float))
extra.shape









    Out[56]:





(3064, 4)



In [57]:

    
# combine sparse matrices
X_train_dtm_extra = sp.sparse.hstack((X_train_dtm, extra))
X_train_dtm_extra.shape









    Out[57]:





(3064, 16829)



In [58]:

    
# repeat for testing set
extra = sp.sparse.csr_matrix(X_test.drop('text', axis=1).astype(float))
X_test_dtm_extra = sp.sparse.hstack((X_test_dtm, extra))
X_test_dtm_extra.shape









    Out[58]:





(1022, 16829)



In [59]:

    
# use logistic regression with text column only
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print(metrics.accuracy_score(y_test, y_pred_class))









    



0.915851272016



In [60]:

    
# use logistic regression with all features
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm_extra, y_train)
y_pred_class = logreg.predict(X_test_dtm_extra)
print(metrics.accuracy_score(y_test, y_pred_class))









    



0.922700587084

Bonus: Fun TextBlob Features



In [61]:

    
# spelling correction
TextBlob('15 minuets late').correct()









    Out[61]:





TextBlob("15 minutes late")



In [62]:

    
# spellcheck
Word('parot').spellcheck()









    Out[62]:





[('part', 0.9929478138222849), ('parrot', 0.007052186177715092)]



In [63]:

    
# definitions
Word('bank').define('v')









    Out[63]:





['tip laterally',
 'enclose with a bank',
 'do business with a bank or keep an account at a bank',
 'act as the banker in a game or in gambling',
 'be in the banking business',
 'put into a bank account',
 'cover with ashes so to control the rate of burning',
 'have confidence or faith in']



In [64]:

    
# language identification
TextBlob('Hola amigos').detect_language()









    Out[64]:





'es'

Conclusion

NLP is a gigantic field
Understanding the basics broadens the types of data you can work with
Simple techniques go a long way
Use scikit-learn for NLP whenever possible



In [65]:

    
import re
p = re.compile('[\'!@#$%^&*(),<>.?/:"\|}{};]')
# return p.sub('', text).lower().strip()



In [66]:

    
text = 'TTThe other one tttthe re, the blithe one.'
reg = re.compile('[tT]{1,3}he')
reg.sub('', text)









    Out[66]:





' or one t re,  bli one.'

	cab	call	me	please	tonight	you
0	0.000000	0.385372	0.000000	0.000000	0.652491	0.652491
1	0.720333	0.425441	0.547832	0.000000	0.000000	0.000000
2	0.000000	0.266075	0.342620	0.901008	0.000000	0.000000

	cab	call	me	please	tonight	you
0	0.0	0.333333	0.0	0.0	1.0	1.0
1	1.0	0.333333	0.5	0.0	0.0	0.0
2	0.0	0.333333	0.5	2.0	0.0	0.0