Turn a string or document into token. There are different theories and rules. We want to perform tokenization in order to:
Usually we use nltk library with method:
In [3]:
scene_one='''SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there! [clop clop clop] \nSOLDIER #1: Halt! Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot. King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ... and this is my trusty servant Patsy. We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot. I must speak with your lord and master.\nSOLDIER #1: What? Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So? We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them? In Mercea? The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all. They could be carried.\nSOLDIER #1: What? A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it! It's a simple question of weight ratios! A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter. Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen. In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow. That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway... [clop clop clop] \nSOLDIER #2: Wait a minute! Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple! They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?
\nSOLDIER #2: Well, why not?\n'''
In [4]:
scene_one
Out[4]:
In [5]:
import nltk
#nltk.download()
Out[5]:
In [6]:
from nltk.tokenize import word_tokenize,sent_tokenize
sentences = sent_tokenize(scene_one)
sentences[:5]
Out[6]:
In [7]:
tokenized_sent = word_tokenize(sentences[3])
tokenized_sent
Out[7]:
In [8]:
#Unique token
unique_tokens = set(word_tokenize(scene_one))
In [9]:
tweets=['This is the best #nlp exercise ive found online! #python',
'#NLP is super fun! <3 #learning',
'Thanks @datacamp :) #nlp #python']
In [11]:
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
pattern1 = r"#\w+"
regexp_tokenize(tweets[0],pattern1)
Out[11]:
In [15]:
pattern2 = r"([#|@]\w+)"
regexp_tokenize(tweets[-1],pattern2)
Out[15]:
In [13]:
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)