Library Exploration: spaCy

Parsing



In [1]:

    
import spacy



In [2]:

    
nlp = spacy.load('en')



In [3]:

    
text = u"We are living in Singapore.\nIt's blazing outside today!\n"



In [4]:

    
doc = nlp(text)



In [5]:

    
for token in doc:
    print((token.text, token.lemma, token.tag, token.pos))









    



('We', 757862, 479, 93)
('are', 536, 492, 98)
('living', 943, 490, 98)
('in', 522, 466, 83)
('Singapore', 88812, 475, 94)
('.', 453, 453, 95)
('\n', 518, 485, 101)
('It', 757862, 479, 93)
("'s", 536, 493, 98)
('blazing', 66705, 490, 98)
('outside', 1654, 481, 84)
('today', 1188, 474, 90)
('!', 558, 453, 95)
('\n', 518, 485, 101)



In [6]:

    
for token in doc:
    print((token.text, token.lemma_, token.tag_, token.pos_)) # lemma means *root form*









    



('We', '-PRON-', 'PRP', 'PRON')
('are', 'be', 'VBP', 'VERB')
('living', 'live', 'VBG', 'VERB')
('in', 'in', 'IN', 'ADP')
('Singapore', 'singapore', 'NNP', 'PROPN')
('.', '.', '.', 'PUNCT')
('\n', '\n', 'SP', 'SPACE')
('It', '-PRON-', 'PRP', 'PRON')
("'s", 'be', 'VBZ', 'VERB')
('blazing', 'blaze', 'VBG', 'VERB')
('outside', 'outside', 'RB', 'ADV')
('today', 'today', 'NN', 'NOUN')
('!', '!', '.', 'PUNCT')
('\n', '\n', 'SP', 'SPACE')

Corresponded Tag-POStag Table

Tag	POS	Morphology
`-LRB-`	`PUNCT`	`PunctType=brck` `PunctSide=ini`
`-PRB-`	`PUNCT`	`PunctType=brck` `PunctSide=fin`
`,`	`PUNCT`	`PunctType=comm`
`:`	`PUNCT`
`.`	`PUNCT`	`PunctType=peri`
`''`	`PUNCT`	`PunctType=quot` `PunctSide=fin`
`""`	`PUNCT`	`PunctType=quot` `PunctSide=fin`
`#`	`SYM`	`SymType=numbersign`
``	`PUNCT`	`PunctType=quot` `PunctSide=ini`
	`SYM`	`SymType=currency`
`ADD`	`X`
`AFX`	`ADJ`	`Hyph=yes`
`BES`	`VERB`
`CC`	`CONJ`	`ConjType=coor`
`CD`	`NUM`	`NumType=card`
`DT`	`DET`
`EX`	`ADV`	`AdvType=ex`
`FW`	`X`	`Foreign=yes`
`GW`	`X`
`HVS`	`VERB`
`HYPH`	`PUNCT`	`PunctType=dash`
`IN`	`ADP`
`JJ`	`ADJ`	`Degree=pos`
`JJR`	`ADJ`	`Degree=comp`
`JJS`	`ADJ`	`Degree=sup`
`LS`	`PUNCT`	`NumType=ord`
`MD`	`VERB`	`VerbType=mod`
`NFP`	`PUNCT`
`NIL`
`NN`	`NOUN`	`Number=sing`
`NNP`	`PROPN`	`NounType=prop` `Number=sign`
`NNPS`	`PROPN`	`NounType=prop` `Number=plur`
`NNS`	`NOUN`	`Number=plur`
`PDT`	`ADJ`	`AdjType=pdt` `PronType=prn`
`POS`	`PART`	`Poss=yes`
`PRP`	`PRON`	`PronType=prs`
`PRP`	`ADJ`	`PronType=prs` `Poss=yes`
`RB`	`ADV`	`Degree=pos`
`RBR`	`ADV`	`Degree=comp`
`RBS`	`ADV`	`Degree=sup`
`RP`	`PART`
`SP`	`SPACE`
`SYM`	`SYM`
`TO`	`PART`	`PartType=inf` `VerbForm=inf`
`UH`	`INTJ`
`VB`	`VERB`	`VerbForm=inf`
`VBD`	`VERB`	`VerbForm=fin` `Tense=past`
`VBG`	`VERB`	`VerbForm=part` `Tense=pres` `Aspect=prog`
`VBN`	`VERB`	`VerbForm=part` `Tense=past` `Aspect=perf`
`VBP`	`VERB`	`VerbForm=fin` `Tense=pres`
`VBZ`	`VERB`	`VerbForm=fin` `Tense=pres` `Number=sing` `Person=3`
`WDT`	`ADJ`	`PronType=int\|rel`
`WP`	`NOUN`	`PronType=int\|rel`
`WP`	`ADJ`	`Poss=yes` `PronType=int\|rel`
`WRB`	`ADV`	`PronType=int\|rel`
`XX`	`X`

Definition of Tags

Number	Tag	Description
1.	CC	Coordinating conjunction
2.	CD	Cardinal number
3.	DT	Determiner
4.	EX	Existential there
5.	FW	Foreign word
6.	IN	Preposition or subordinating conjunction
7.	JJ	Adjective
8.	JJR	Adjective, comparative
9.	JJS	Adjective, superlative
10.	LS	List item marker
11.	MD	Modal
12.	NN	Noun, singular or mass
13.	NNS	Noun, plural
14.	NNP	Proper noun, singular
15.	NNPS	Proper noun, plural
16.	PDT	Predeterminer
17.	POS	Possessive ending
18.	PRP	Personal pronoun
19.	PRP	Possessive pronoun
20.	RB	Adverb
21.	RBR	Adverb, comparative
22.	RBS	Adverb, superlative
23.	RP	Particle
24.	SYM	Symbol
25.	TO	to
26.	UH	Interjection
27.	VB	Verb, base form
28.	VBD	Verb, past tense
29.	VBG	Verb, gerund or present participle
30.	VBN	Verb, past participle
31.	VBP	Verb, non-3rd person singular present
32.	VBZ	Verb, 3rd person singular present
33.	WDT	Wh-determiner
34.	WP	Wh-pronoun
35.	WP	Possessive wh-pronoun
36.	WRB	Wh-adverb



In [7]:

    
#https://spacy.io/docs/api/token
doc_ps = nlp("Mr.Sakamoto told us the Dragon Fruits was very yummy!") 
#for t in doc:
t = doc_ps[2]
print("token:",t)
print("vocab (The vocab object of the parent Doc):", t.vocab)
print("doc (The parent document.):", t.doc)
print("i (The index of the token within the parent document.):", t.i)
print("ent_type_ (Named entity type.):", t.ent_type_)
print("ent_iob_ (IOB code of named entity tag):", t.ent_iob_)
print("ent_id_ (ID of the entity the token is an instance of):", t.ent_id_)
print("lemma_ (Base form of the word, with no inflectional suffixes.):", t.lemma_)
print("lower_ (Lower-case form of the word.):", t.lower_)
print("shape_ (A transform of the word's string, to show orthographic features.):", t.shape_)
print("prefix_ (Integer ID of a length-N substring from the start of the word):", t.prefix_)
print("suffix_ (Length-N substring from the end of the word):", t.suffix_)
print("like_url (Does the word resemble a URL?):", t.like_url)
print("like_num (Does the word represent a number? ):", t.like_num)
print("like_email (Does the word resemble an email address?):", t.like_email)
print("is_oov (Is the word out-of-vocabulary?):", t.is_oov)
print("is_stop (Is the word part of a stop list?):", t.is_stop)
print("pos_ (Coarse-grained part-of-speech.):", t.pos_)
print("tag_ (Fine-grained part-of-speech.):", t.tag_)
print("dep_ (Syntactic dependency relation.):", t.dep_)
print("lang_ (Language of the parent document's vocabulary.):", t.lang_)
print("prob: (Smoothed log probability estimate of token's type.)", t.prob)
print("idx (The character offset of the token within the parent document.):", t.idx)
print("sentiment (A scalar value indicating the positivity or negativity of the token):", t.sentiment)
print("lex_id (ID of the token's lexical type.):", t.lex_id)
print("text (Verbatim text content.):", t.text)
print("text_with_ws (Text content, with trailing space character if present.):", t.text_with_ws)
print("whitespace_ (Trailing space character if present.):", t.whitespace_)









    



token: Sakamoto
vocab (The vocab object of the parent Doc): <spacy.vocab.Vocab object at 0x10c677598>
doc (The parent document.): Mr.Sakamoto told us the Dragon Fruits was very yummy!
i (The index of the token within the parent document.): 2
ent_type_ (Named entity type.): PERSON
ent_iob_ (IOB code of named entity tag): B
ent_id_ (ID of the entity the token is an instance of): 
lemma_ (Base form of the word, with no inflectional suffixes.): sakamoto
lower_ (Lower-case form of the word.): sakamoto
shape_ (A transform of the word's string, to show orthographic features.): Xxxxx
prefix_ (Integer ID of a length-N substring from the start of the word): S
suffix_ (Length-N substring from the end of the word): oto
like_url (Does the word resemble a URL?): False
like_num (Does the word represent a number? ): False
like_email (Does the word resemble an email address?): False
is_oov (Is the word out-of-vocabulary?): False
is_stop (Is the word part of a stop list?): False
pos_ (Coarse-grained part-of-speech.): PROPN
tag_ (Fine-grained part-of-speech.): NNP
dep_ (Syntactic dependency relation.): nsubj
lang_ (Language of the parent document's vocabulary.): en
prob: (Smoothed log probability estimate of token's type.) -19.579313278198242
idx (The character offset of the token within the parent document.): 3
sentiment (A scalar value indicating the positivity or negativity of the token): 0.0
lex_id (ID of the token's lexical type.): 518442
text (Verbatim text content.): Sakamoto
text_with_ws (Text content, with trailing space character if present.): Sakamoto 
whitespace_ (Trailing space character if present.):

Dependency Analysis



In [8]:

    
doc_dep = nlp(u'I like chicken rice and Laksa.')
for np in doc_dep.noun_chunks:
    print((np.text, np.root.text, np.root.dep_, np.root.head.text))









    



('I', 'I', 'nsubj', 'like')
('chicken rice', 'rice', 'dobj', 'like')
('Laksa', 'Laksa', 'conj', 'rice')



In [9]:

    
for t in doc_dep:
    print((t.text, t.dep_, t.tag_))









    



('I', 'nsubj', 'PRP')
('like', 'ROOT', 'VBP')
('chicken', 'compound', 'NN')
('rice', 'dobj', 'NN')
('and', 'cc', 'CC')
('Laksa', 'conj', 'NNP')
('.', 'punct', '.')

Visualization using displaCy (https://demos.explosion.ai/displacy/)



In [10]:

    
for token in doc_dep:
    # Orth: Original, Head: head of subtree
    print((token.text, token.dep_, token.n_lefts, token.n_rights, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights]))









    



('I', 'nsubj', 0, 0, 'like', [], [])
('like', 'ROOT', 1, 2, 'like', ['I'], ['rice', '.'])
('chicken', 'compound', 0, 0, 'rice', [], [])
('rice', 'dobj', 1, 2, 'like', ['chicken'], ['and', 'Laksa'])
('and', 'cc', 0, 0, 'rice', [], [])
('Laksa', 'conj', 0, 0, 'rice', [], [])
('.', 'punct', 0, 0, 'like', [], [])



In [11]:

    
dependency_pattern = '{left}<---{word}[{w_type}]--->{right}\n--------'



In [12]:

    
for token in doc_dep:
    print (dependency_pattern.format(word=token.orth_, 
                                  w_type=token.dep_,
                                  left=[t.orth_ for t in token.lefts],
                                  right=[t.orth_ for t in token.rights]))









    



[]<---I[nsubj]--->[]
--------
['I']<---like[ROOT]--->['rice', '.']
--------
[]<---chicken[compound]--->[]
--------
['chicken']<---rice[dobj]--->['and', 'Laksa']
--------
[]<---and[cc]--->[]
--------
[]<---Laksa[conj]--->[]
--------
[]<---.[punct]--->[]
--------

Head and Child in dependency tree

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head.
https://spacy.io/docs/usage/dependency-parse



In [13]:

    
for t in doc_dep:
    print((t.text, t.dep_,t.tag_,t.pos_),(t.head.text, t.head.dep_,t.head.tag_,t.head.pos_))









    



('I', 'nsubj', 'PRP', 'PRON') ('like', 'ROOT', 'VBP', 'VERB')
('like', 'ROOT', 'VBP', 'VERB') ('like', 'ROOT', 'VBP', 'VERB')
('chicken', 'compound', 'NN', 'NOUN') ('rice', 'dobj', 'NN', 'NOUN')
('rice', 'dobj', 'NN', 'NOUN') ('like', 'ROOT', 'VBP', 'VERB')
('and', 'cc', 'CC', 'CCONJ') ('rice', 'dobj', 'NN', 'NOUN')
('Laksa', 'conj', 'NNP', 'PROPN') ('rice', 'dobj', 'NN', 'NOUN')
('.', 'punct', '.', 'PUNCT') ('like', 'ROOT', 'VBP', 'VERB')

Verb extraction



In [14]:

    
# Load symbols
from spacy.symbols import nsubj, VERB
verbs = set()



In [15]:

    
for token in doc:
    print ((token, token.dep, token.head, token.head.pos))
    if token.dep == nsubj and token.head.pos == VERB:
        verbs.add(token.head)









    



(We, 425, living, 98)
(are, 401, living, 98)
(living, 512817, living, 98)
(in, 439, living, 98)
(Singapore, 435, in, 83)
(., 441, living, 98)
(
, 0, ., 95)
(It, 425, blazing, 98)
('s, 401, blazing, 98)
(blazing, 512817, blazing, 98)
(outside, 396, blazing, 98)
(today, 424, blazing, 98)
(!, 441, blazing, 98)
(
, 0, !, 95)



In [16]:

    
verbs









    Out[16]:





{living, blazing}

Extract similar words



In [17]:

    
from numpy import dot
from numpy.linalg import norm

# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))



In [18]:

    
target_word = 'Singapore'
sing = nlp.vocab[target_word]
sing









    Out[18]:





<spacy.lexeme.Lexeme at 0x10c2e4ee8>



In [19]:

    
# gather all known words except for taget word
all_words = list({w for w in nlp.vocab if w.has_vector and w.orth_.islower() and w.lower_ != target_word.lower()})
len(all_words)









    Out[19]:





7681



In [20]:

    
# sort by similarity
#all_words.sort(key=lambda w: cosine(w.vector, sing.vector))
#all_words.reverse()
#print("Top 10 most similar words to",target_word)
#for word in all_words[:10]:   
#    print(word.orth_)

Vector representation



In [21]:

    
country1 = nlp.vocab['china']
race1 = nlp.vocab['chinese']
country2 = nlp.vocab['japan']
result = country1.vector - race1.vector + country2.vector



In [22]:

    
all_words = list({w for w in nlp.vocab if w.has_vector and w.orth_.islower() and w.lower_ != "china" and w.lower_ != "chinese" and w.lower_ != "japan"})



In [23]:

    
all_words.sort(key=lambda w: cosine(w.vector, result))
all_words[0].orth_









    Out[23]:





'japanese'



In [24]:

    
# Top 3 results
for word in all_words[:3]:   
    print(word.orth_)









    



japanese
asian
vegetarian

Entity Recognition



In [25]:

    
example_sent = "NTUC has raised S$25 million to help workers re-skill and upgrade their skills, secretary-general Chan Chun Sing said at the May Day Rally on Monday "
parsed = nlp(example_sent)
for token in parsed:
    print((token.orth_, token.ent_type_ if token.ent_type_ != "" else "(not an entity)"))









    



('NTUC', 'ORG')
('has', '(not an entity)')
('raised', '(not an entity)')
('S$25', 'CARDINAL')
('million', 'CARDINAL')
('to', '(not an entity)')
('help', '(not an entity)')
('workers', '(not an entity)')
('re', '(not an entity)')
('-', '(not an entity)')
('skill', '(not an entity)')
('and', '(not an entity)')
('upgrade', '(not an entity)')
('their', '(not an entity)')
('skills', '(not an entity)')
(',', '(not an entity)')
('secretary', '(not an entity)')
('-', '(not an entity)')
('general', '(not an entity)')
('Chan', 'PERSON')
('Chun', 'PERSON')
('Sing', 'PERSON')
('said', '(not an entity)')
('at', '(not an entity)')
('the', 'DATE')
('May', 'DATE')
('Day', 'DATE')
('Rally', '(not an entity)')
('on', '(not an entity)')
('Monday', 'DATE')

Visualization using displaCy Named Entity Visualizer (https://demos.explosion.ai/displacy-ent/)

List of entity types

https://spacy.io/docs/usage/entity-recognition

Type	Description
`PERSON`	People, including fictional.
`NORP`	Nationalities or religious or political groups.
`FACILITY`	Buildings, airports, highways, bridges, etc.
`ORG`	Companies, agencies, institutions, etc.
`GPE`	Countries, cities, states.
`LOC`	Non-GPE locations, mountain ranges, bodies of water.
`PRODUCT`	Objects, vehicles, foods, etc. (Not services.)
`EVENT`	Named hurricanes, battles, wars, sports events, etc.
`WORK_OF_ART`	Titles of books, songs, etc.
`LANGUAGE`	Any named language.

Build own entity recognizer



In [26]:

    
import random
from spacy.gold import GoldParse
from spacy.language import EntityRecognizer

train_data = [
    ('Who is Chaka Khan?', [(7, 17, 'PERSON')]),
    ('I like Bangkok and Buangkok.', [(7, 14, 'LOC'), (19, 27, 'LOC')])
]

nlp2 = spacy.load('en', entity=False, parser=False)
ner = EntityRecognizer(nlp2.vocab, entity_types=['PERSON', 'LOC'])

for itn in range(5):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc2 = nlp2.make_doc(raw_text)
        gold = GoldParse(doc2, entities=entity_offsets)

        nlp.tagger(doc2)
        ner.update(doc2, gold)
ner.model.end_training()
nlp.save_to_directory('./sample_ner/')



In [27]:

    
nlp3 = spacy.load('en', path='./sample_ner/')
example_sent = "Who is Tai Seng Tan?"
doc3 = nlp3(example_sent)
for ent in doc3.ents:
            print(ent.label_, ent.text)









    



PERSON Tai Seng Tan

Reference

https://spacy.io/
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
https://spacy.io/docs/usage/pos-tagging

[Installation]
pip install spacy
python -m spacy download en



In [ ]: