Natural Language Processing (NLP)

Overview

corpus - collection of texts
lexicon - collection of words (or sequences) we put into our index
bag-of-words - take each word and count how many times it appears
n-gram - count how often each set of n words appears

Tokenization

break text into tokens based on characters, words, sentences, etc.

Tokenizing Sentences



In [1]:

    
from nltk.tokenize import TreebankWordTokenizer

sentence = "How does nltk tokenize this sentence?"
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)









    Out[1]:





['How', 'does', 'nltk', 'tokenize', 'this', 'sentence', '?']



In [5]:

    
from nltk.tokenize.casual import casual_tokenize

tweet = "OMG @twitterguy that was sooooooooo cool :D :D :D!!!!"
print(casual_tokenize(tweet))









    



['OMG', '@twitterguy', 'that', 'was', 'sooooooooo', 'cool', ':D', ':D', ':D', '!', '!', '!']



In [4]:

    
casual_tokenize(tweet, reduce_len=True, strip_handles=True)









    Out[4]:





['OMG', 'that', 'was', 'sooo', 'cool', ':D', ':D', ':D', '!', '!', '!']

N-grams



In [16]:

    
from nltk.util import ngrams

list(ngrams(sentence.split(), 2))









    Out[16]:





[('How', 'does'),
 ('does', 'nltk'),
 ('nltk', 'tokenize'),
 ('tokenize', 'this'),
 ('this', 'sentence?')]

Stop-words



In [20]:

    
import nltk
nltk.download("stopwords")
stop_words = nltk.corpus.stopwords.words("english")

stop_words[:10]









    



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mcama\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!






    Out[20]:





['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Sentiment

VADER (Valence Aware Dictionary for sEntiment Reasoning)



In [38]:

    
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()









    



[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\mcama\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!



In [39]:

    
negative_sentence = "This is the worst!!! I hate it so much :( :("
sia.polarity_scores(negative_sentence)









    Out[39]:





{'neg': 0.715, 'neu': 0.285, 'pos': 0.0, 'compound': -0.9436}



In [40]:

    
sia.polarity_scores(tweet)









    Out[40]:





{'neg': 0.0, 'neu': 0.342, 'pos': 0.658, 'compound': 0.9106}



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Natural Language Processing (NLP)

Overview

Tokenization

Tokenizing Sentences

Tokenizing Social Media

N-grams

Stop-words

Sentiment