Natural Language Processing (NLP)

Overview

  • corpus - collection of texts
  • lexicon - collection of words (or sequences) we put into our index
  • bag-of-words - take each word and count how many times it appears
  • n-gram - count how often each set of n words appears

Tokenization

break text into tokens based on characters, words, sentences, etc.

Tokenizing Sentences


In [1]:
from nltk.tokenize import TreebankWordTokenizer

sentence = "How does nltk tokenize this sentence?"
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)


Out[1]:
['How', 'does', 'nltk', 'tokenize', 'this', 'sentence', '?']

Tokenizing Social Media


In [5]:
from nltk.tokenize.casual import casual_tokenize

tweet = "OMG @twitterguy that was sooooooooo cool :D :D :D!!!!"
print(casual_tokenize(tweet))


['OMG', '@twitterguy', 'that', 'was', 'sooooooooo', 'cool', ':D', ':D', ':D', '!', '!', '!']

In [4]:
casual_tokenize(tweet, reduce_len=True, strip_handles=True)


Out[4]:
['OMG', 'that', 'was', 'sooo', 'cool', ':D', ':D', ':D', '!', '!', '!']

N-grams


In [16]:
from nltk.util import ngrams

list(ngrams(sentence.split(), 2))


Out[16]:
[('How', 'does'),
 ('does', 'nltk'),
 ('nltk', 'tokenize'),
 ('tokenize', 'this'),
 ('this', 'sentence?')]

Stop-words


In [20]:
import nltk
nltk.download("stopwords")
stop_words = nltk.corpus.stopwords.words("english")

stop_words[:10]


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mcama\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[20]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Sentiment

VADER (Valence Aware Dictionary for sEntiment Reasoning)


In [38]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\mcama\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!

In [39]:
negative_sentence = "This is the worst!!! I hate it so much :( :("
sia.polarity_scores(negative_sentence)


Out[39]:
{'neg': 0.715, 'neu': 0.285, 'pos': 0.0, 'compound': -0.9436}

In [40]:
sia.polarity_scores(tweet)


Out[40]:
{'neg': 0.0, 'neu': 0.342, 'pos': 0.658, 'compound': 0.9106}

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: