In [1]:
import spacy
import wordcloud
nlp = spacy.load('en')
Usually the first step of text analysis is tokenization, which is the process of breaking a document into "tokens". You can roughly think of it as extracting each word.
In [2]:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
print(token)
As you can see, it's not exactly same as doc.split()
. You'd want to have $
as a separate token because it has a particular meaning (USD). Actually, as shown in an example (https://spacy.io/usage/spacy-101#annotations-pos-deps), spaCy
figures out a lot of things about these tokens. For instance,
In [3]:
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_)
It figured it out that Apple
is a proper noun ("PROPN" and "NNP"; see here for the part of speech tags).
spaCy
has a visualizer too.
In [4]:
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})
It even recognizes entities and can visualize them.
In [5]:
text = """But Google is starting from behind. The company made a late push
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa
software, which runs on its Echo and Dot devices, have clear leads in
consumer adoption."""
doc2 = nlp(text)
displacy.render(doc2, style='ent', jupyter=True)
In [6]:
import urllib.request
metamorphosis_book = urllib.request.urlopen('http://www.gutenberg.org/cache/epub/5200/pg5200.txt').read()
In [7]:
metamorphosis_book[:1000]
Out[7]:
Looks like we have successfully loaded the book. You'd probably want to remove the parts at the beginning and at the end that are not parts of the book if you are doing a serious analysis, but let's ignore them for now. Let's try to feed this directly into spaCy
.
In [8]:
doc_metamor = nlp(metamorphosis_book)
In [9]:
type(metamorphosis_book)
Out[9]:
Indeed, the type of metamorphosis_book
is bytes
. But as we have seen above, we can see the book contents right? What's going on?
Well, the problem is that a byte sequence is not yet a proper string until we know how to decode it. A string is an abstract object and we need to specify an encoding to write the string into a file. For instance, if I have a string of Korean characters like "안녕", there are several encodings that I can specify to write that into a file, and depending on the encoding that I choose, the byte sequences can be totally different from each other. This is a really important (and confusing) topic, but because it's beyond the scope of the course, I'll just link a nice post about encoding: http://kunststube.net/encoding/
In [10]:
"안녕".encode('utf8')
Out[10]:
In [11]:
# b'\xec\x95\x88\xeb\x85\x95'.decode('euc-kr') <- what happen if you do this?
b'\xec\x95\x88\xeb\x85\x95'.decode('utf8')
Out[11]:
In [12]:
"안녕".encode('euc-kr')
Out[12]:
In [13]:
b'\xbe\xc8\xb3\xe7'.decode('euc-kr')
Out[13]:
You can decode with "wrong" encoding too.
In [14]:
b'\xbe\xc8\xb3\xe7'.decode('latin-1')
Out[14]:
As you can see the same string can be encoded into different byte sequences depending on the encoding. It's a really annoying fun topic and if you need to deal with text data, you must have a good understanding of it.
I know that Project Gutenberg uses utf-8
encoding. So let's decode the byte sequence into a string.
In [15]:
# Implement
In [16]:
type(metamorphosis_book_str)
Out[16]:
Shall we try again?
In [17]:
doc_metamor = nlp(metamorphosis_book_str)
In [18]:
words = [token.text for token in doc_metamor
if token.is_stop != True and token.is_punct != True]
In [19]:
from collections import Counter
Counter(words).most_common(5)
Out[19]:
a lot of newline characters and multiple spaces. A quick and dirty way to remove them is split & join. The idea is that you split the document using split()
and then join with a single space . Can you implement it and print the 10 most common words?
In [20]:
# Implement
Out[20]:
Let's keep the object with word count.
In [21]:
word_cnt = Counter(words)
In [22]:
import matplotlib.pyplot as plt
%matplotlib inline
Can you check out the wordcloud
package documentation and create a word cloud from the word count object that we created from the book above and plot it?
In [23]:
# Implement: create a word cloud object
Out[23]:
In [24]:
# Implement: plot the word cloud object
Out[24]:
Q: Can you create a word cloud for a certain part of speech, such as nouns, verbs, proper nouns, etc. (pick one)?
In [25]:
# Implement
Out[25]:
In [ ]: