Library Exploration: spaCy

Parsing


In [1]:
import spacy

In [2]:
nlp = spacy.load('en')

In [3]:
text = u"We are living in Singapore.\nIt's blazing outside today!\n"

In [4]:
doc = nlp(text)

In [5]:
for token in doc:
    print((token.text, token.lemma, token.tag, token.pos))


('We', 757862, 479, 93)
('are', 536, 492, 98)
('living', 943, 490, 98)
('in', 522, 466, 83)
('Singapore', 88812, 475, 94)
('.', 453, 453, 95)
('\n', 518, 485, 101)
('It', 757862, 479, 93)
("'s", 536, 493, 98)
('blazing', 66705, 490, 98)
('outside', 1654, 481, 84)
('today', 1188, 474, 90)
('!', 558, 453, 95)
('\n', 518, 485, 101)

In [6]:
for token in doc:
    print((token.text, token.lemma_, token.tag_, token.pos_)) # lemma means *root form*


('We', '-PRON-', 'PRP', 'PRON')
('are', 'be', 'VBP', 'VERB')
('living', 'live', 'VBG', 'VERB')
('in', 'in', 'IN', 'ADP')
('Singapore', 'singapore', 'NNP', 'PROPN')
('.', '.', '.', 'PUNCT')
('\n', '\n', 'SP', 'SPACE')
('It', '-PRON-', 'PRP', 'PRON')
("'s", 'be', 'VBZ', 'VERB')
('blazing', 'blaze', 'VBG', 'VERB')
('outside', 'outside', 'RB', 'ADV')
('today', 'today', 'NN', 'NOUN')
('!', '!', '.', 'PUNCT')
('\n', '\n', 'SP', 'SPACE')

Corresponded Tag-POStag Table

TagPOSMorphology
-LRB- PUNCT PunctType=brck PunctSide=ini
-PRB- PUNCT PunctType=brck PunctSide=fin
, PUNCT PunctType=comm
: PUNCT
. PUNCT PunctType=peri
'' PUNCT PunctType=quot PunctSide=fin
"" PUNCT PunctType=quot PunctSide=fin
# SYM SymType=numbersign
`` PUNCT PunctType=quot PunctSide=ini
SYM SymType=currency
ADD X
AFX ADJ Hyph=yes
BES VERB
CC CONJ ConjType=coor
CD NUM NumType=card
DT DET
EX ADV AdvType=ex
FW X Foreign=yes
GW X
HVS VERB
HYPH PUNCT PunctType=dash
IN ADP
JJ ADJ Degree=pos
JJR ADJ Degree=comp
JJS ADJ Degree=sup
LS PUNCT NumType=ord
MD VERB VerbType=mod
NFP PUNCT
NIL
NN NOUN Number=sing
NNP PROPN NounType=prop Number=sign
NNPS PROPN NounType=prop Number=plur
NNS NOUN Number=plur
PDT ADJ AdjType=pdt PronType=prn
POS PART Poss=yes
PRP PRON PronType=prs
PRP ADJ PronType=prs Poss=yes
RB ADV Degree=pos
RBR ADV Degree=comp
RBS ADV Degree=sup
RP PART
SP SPACE
SYM SYM
TO PART PartType=inf VerbForm=inf
UH INTJ
VB VERB VerbForm=inf
VBD VERB VerbForm=fin Tense=past
VBG VERB VerbForm=part Tense=pres Aspect=prog
VBN VERB VerbForm=part Tense=past Aspect=perf
VBP VERB VerbForm=fin Tense=pres
VBZ VERB VerbForm=fin Tense=pres Number=sing Person=3
WDT ADJ PronType=int|rel
WP NOUN PronType=int|rel
WP ADJ Poss=yes PronType=int|rel
WRB ADV PronType=int|rel
XX X

Definition of Tags

Number
Tag
Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP Possessive wh-pronoun
36. WRB Wh-adverb

In [7]:
#https://spacy.io/docs/api/token
doc_ps = nlp("Mr.Sakamoto told us the Dragon Fruits was very yummy!") 
#for t in doc:
t = doc_ps[2]
print("token:",t)
print("vocab (The vocab object of the parent Doc):", t.vocab)
print("doc (The parent document.):", t.doc)
print("i (The index of the token within the parent document.):", t.i)
print("ent_type_ (Named entity type.):", t.ent_type_)
print("ent_iob_ (IOB code of named entity tag):", t.ent_iob_)
print("ent_id_ (ID of the entity the token is an instance of):", t.ent_id_)
print("lemma_ (Base form of the word, with no inflectional suffixes.):", t.lemma_)
print("lower_ (Lower-case form of the word.):", t.lower_)
print("shape_ (A transform of the word's string, to show orthographic features.):", t.shape_)
print("prefix_ (Integer ID of a length-N substring from the start of the word):", t.prefix_)
print("suffix_ (Length-N substring from the end of the word):", t.suffix_)
print("like_url (Does the word resemble a URL?):", t.like_url)
print("like_num (Does the word represent a number? ):", t.like_num)
print("like_email (Does the word resemble an email address?):", t.like_email)
print("is_oov (Is the word out-of-vocabulary?):", t.is_oov)
print("is_stop (Is the word part of a stop list?):", t.is_stop)
print("pos_ (Coarse-grained part-of-speech.):", t.pos_)
print("tag_ (Fine-grained part-of-speech.):", t.tag_)
print("dep_ (Syntactic dependency relation.):", t.dep_)
print("lang_ (Language of the parent document's vocabulary.):", t.lang_)
print("prob: (Smoothed log probability estimate of token's type.)", t.prob)
print("idx (The character offset of the token within the parent document.):", t.idx)
print("sentiment (A scalar value indicating the positivity or negativity of the token):", t.sentiment)
print("lex_id (ID of the token's lexical type.):", t.lex_id)
print("text (Verbatim text content.):", t.text)
print("text_with_ws (Text content, with trailing space character if present.):", t.text_with_ws)
print("whitespace_ (Trailing space character if present.):", t.whitespace_)


token: Sakamoto
vocab (The vocab object of the parent Doc): <spacy.vocab.Vocab object at 0x10c677598>
doc (The parent document.): Mr.Sakamoto told us the Dragon Fruits was very yummy!
i (The index of the token within the parent document.): 2
ent_type_ (Named entity type.): PERSON
ent_iob_ (IOB code of named entity tag): B
ent_id_ (ID of the entity the token is an instance of): 
lemma_ (Base form of the word, with no inflectional suffixes.): sakamoto
lower_ (Lower-case form of the word.): sakamoto
shape_ (A transform of the word's string, to show orthographic features.): Xxxxx
prefix_ (Integer ID of a length-N substring from the start of the word): S
suffix_ (Length-N substring from the end of the word): oto
like_url (Does the word resemble a URL?): False
like_num (Does the word represent a number? ): False
like_email (Does the word resemble an email address?): False
is_oov (Is the word out-of-vocabulary?): False
is_stop (Is the word part of a stop list?): False
pos_ (Coarse-grained part-of-speech.): PROPN
tag_ (Fine-grained part-of-speech.): NNP
dep_ (Syntactic dependency relation.): nsubj
lang_ (Language of the parent document's vocabulary.): en
prob: (Smoothed log probability estimate of token's type.) -19.579313278198242
idx (The character offset of the token within the parent document.): 3
sentiment (A scalar value indicating the positivity or negativity of the token): 0.0
lex_id (ID of the token's lexical type.): 518442
text (Verbatim text content.): Sakamoto
text_with_ws (Text content, with trailing space character if present.): Sakamoto 
whitespace_ (Trailing space character if present.):  

Dependency Analysis


In [8]:
doc_dep = nlp(u'I like chicken rice and Laksa.')
for np in doc_dep.noun_chunks:
    print((np.text, np.root.text, np.root.dep_, np.root.head.text))


('I', 'I', 'nsubj', 'like')
('chicken rice', 'rice', 'dobj', 'like')
('Laksa', 'Laksa', 'conj', 'rice')

In [9]:
for t in doc_dep:
    print((t.text, t.dep_, t.tag_))


('I', 'nsubj', 'PRP')
('like', 'ROOT', 'VBP')
('chicken', 'compound', 'NN')
('rice', 'dobj', 'NN')
('and', 'cc', 'CC')
('Laksa', 'conj', 'NNP')
('.', 'punct', '.')

Visualization using displaCy (https://demos.explosion.ai/displacy/)


In [10]:
for token in doc_dep:
    # Orth: Original, Head: head of subtree
    print((token.text, token.dep_, token.n_lefts, token.n_rights, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights]))


('I', 'nsubj', 0, 0, 'like', [], [])
('like', 'ROOT', 1, 2, 'like', ['I'], ['rice', '.'])
('chicken', 'compound', 0, 0, 'rice', [], [])
('rice', 'dobj', 1, 2, 'like', ['chicken'], ['and', 'Laksa'])
('and', 'cc', 0, 0, 'rice', [], [])
('Laksa', 'conj', 0, 0, 'rice', [], [])
('.', 'punct', 0, 0, 'like', [], [])

In [11]:
dependency_pattern = '{left}<---{word}[{w_type}]--->{right}\n--------'

In [12]:
for token in doc_dep:
    print (dependency_pattern.format(word=token.orth_, 
                                  w_type=token.dep_,
                                  left=[t.orth_ for t in token.lefts],
                                  right=[t.orth_ for t in token.rights]))


[]<---I[nsubj]--->[]
--------
['I']<---like[ROOT]--->['rice', '.']
--------
[]<---chicken[compound]--->[]
--------
['chicken']<---rice[dobj]--->['and', 'Laksa']
--------
[]<---and[cc]--->[]
--------
[]<---Laksa[conj]--->[]
--------
[]<---.[punct]--->[]
--------

Head and Child in dependency tree

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head.
https://spacy.io/docs/usage/dependency-parse


In [13]:
for t in doc_dep:
    print((t.text, t.dep_,t.tag_,t.pos_),(t.head.text, t.head.dep_,t.head.tag_,t.head.pos_))


('I', 'nsubj', 'PRP', 'PRON') ('like', 'ROOT', 'VBP', 'VERB')
('like', 'ROOT', 'VBP', 'VERB') ('like', 'ROOT', 'VBP', 'VERB')
('chicken', 'compound', 'NN', 'NOUN') ('rice', 'dobj', 'NN', 'NOUN')
('rice', 'dobj', 'NN', 'NOUN') ('like', 'ROOT', 'VBP', 'VERB')
('and', 'cc', 'CC', 'CCONJ') ('rice', 'dobj', 'NN', 'NOUN')
('Laksa', 'conj', 'NNP', 'PROPN') ('rice', 'dobj', 'NN', 'NOUN')
('.', 'punct', '.', 'PUNCT') ('like', 'ROOT', 'VBP', 'VERB')

Verb extraction


In [14]:
# Load symbols
from spacy.symbols import nsubj, VERB
verbs = set()

In [15]:
for token in doc:
    print ((token, token.dep, token.head, token.head.pos))
    if token.dep == nsubj and token.head.pos == VERB:
        verbs.add(token.head)


(We, 425, living, 98)
(are, 401, living, 98)
(living, 512817, living, 98)
(in, 439, living, 98)
(Singapore, 435, in, 83)
(., 441, living, 98)
(
, 0, ., 95)
(It, 425, blazing, 98)
('s, 401, blazing, 98)
(blazing, 512817, blazing, 98)
(outside, 396, blazing, 98)
(today, 424, blazing, 98)
(!, 441, blazing, 98)
(
, 0, !, 95)

In [16]:
verbs


Out[16]:
{living, blazing}

Extract similar words


In [17]:
from numpy import dot
from numpy.linalg import norm

# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))

In [18]:
target_word = 'Singapore'
sing = nlp.vocab[target_word]
sing


Out[18]:
<spacy.lexeme.Lexeme at 0x10c2e4ee8>

In [19]:
# gather all known words except for taget word
all_words = list({w for w in nlp.vocab if w.has_vector and w.orth_.islower() and w.lower_ != target_word.lower()})
len(all_words)


Out[19]:
7681

In [20]:
# sort by similarity
#all_words.sort(key=lambda w: cosine(w.vector, sing.vector))
#all_words.reverse()
#print("Top 10 most similar words to",target_word)
#for word in all_words[:10]:   
#    print(word.orth_)

Vector representation


In [21]:
country1 = nlp.vocab['china']
race1 = nlp.vocab['chinese']
country2 = nlp.vocab['japan']
result = country1.vector - race1.vector + country2.vector

In [22]:
all_words = list({w for w in nlp.vocab if w.has_vector and w.orth_.islower() and w.lower_ != "china" and w.lower_ != "chinese" and w.lower_ != "japan"})

In [23]:
all_words.sort(key=lambda w: cosine(w.vector, result))
all_words[0].orth_


Out[23]:
'japanese'

In [24]:
# Top 3 results
for word in all_words[:3]:   
    print(word.orth_)


japanese
asian
vegetarian

Entity Recognition


In [25]:
example_sent = "NTUC has raised S$25 million to help workers re-skill and upgrade their skills, secretary-general Chan Chun Sing said at the May Day Rally on Monday "
parsed = nlp(example_sent)
for token in parsed:
    print((token.orth_, token.ent_type_ if token.ent_type_ != "" else "(not an entity)"))


('NTUC', 'ORG')
('has', '(not an entity)')
('raised', '(not an entity)')
('S$25', 'CARDINAL')
('million', 'CARDINAL')
('to', '(not an entity)')
('help', '(not an entity)')
('workers', '(not an entity)')
('re', '(not an entity)')
('-', '(not an entity)')
('skill', '(not an entity)')
('and', '(not an entity)')
('upgrade', '(not an entity)')
('their', '(not an entity)')
('skills', '(not an entity)')
(',', '(not an entity)')
('secretary', '(not an entity)')
('-', '(not an entity)')
('general', '(not an entity)')
('Chan', 'PERSON')
('Chun', 'PERSON')
('Sing', 'PERSON')
('said', '(not an entity)')
('at', '(not an entity)')
('the', 'DATE')
('May', 'DATE')
('Day', 'DATE')
('Rally', '(not an entity)')
('on', '(not an entity)')
('Monday', 'DATE')

Visualization using displaCy Named Entity Visualizer (https://demos.explosion.ai/displacy-ent/)

List of entity types

https://spacy.io/docs/usage/entity-recognition

TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACILITYBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LANGUAGEAny named language.

Build own entity recognizer


In [26]:
import random
from spacy.gold import GoldParse
from spacy.language import EntityRecognizer

train_data = [
    ('Who is Chaka Khan?', [(7, 17, 'PERSON')]),
    ('I like Bangkok and Buangkok.', [(7, 14, 'LOC'), (19, 27, 'LOC')])
]

nlp2 = spacy.load('en', entity=False, parser=False)
ner = EntityRecognizer(nlp2.vocab, entity_types=['PERSON', 'LOC'])

for itn in range(5):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc2 = nlp2.make_doc(raw_text)
        gold = GoldParse(doc2, entities=entity_offsets)

        nlp.tagger(doc2)
        ner.update(doc2, gold)
ner.model.end_training()
nlp.save_to_directory('./sample_ner/')

In [27]:
nlp3 = spacy.load('en', path='./sample_ner/')
example_sent = "Who is Tai Seng Tan?"
doc3 = nlp3(example_sent)
for ent in doc3.ents:
            print(ent.label_, ent.text)


PERSON Tai Seng Tan

In [ ]: