spaCy introduction

Load spaCy resources



In [1]:

    
# Import spacy and English models
import spacy

nlp = spacy.load('en')

Loading spaCy can take a while, in the meantime here are a few definitions to help you on your NLP journey.

What are Stop Words?

Stop words are the common words in a vocabulary which are of little value when considering word frequencies in text. This is because they don't provide much useful information about what the sentence is telling the reader.

Example: "the","and","a","are","is"

What is a Corpus?

A corpus (plural: corpora) is a large collection of text or documents and can provide useful training data for NLP models. A corpus might be built from transcribed speech or a collection of manuscripts. Each item in a corpus is not necessarily unique and frequency counts of words can assist in uncovering the structure in a corpus.

Examples:

Every word written in the complete works of Shakespeare
Every word spoken on BBC Radio channels for the past 30 years

Process text



In [2]:

    
# Process sentences 'Hello, world. Natural Language Processing in 10 lines of code.' using spaCy
doc = nlp(u'Hello, world. Natural Language Processing in 10 lines of code.')

Get tokens and sentences

What is a Token?

A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called "tokenisation".

Example: The following sentence can be tokenised by splitting up the sentence into individual words.

"Cytora is going to PyCon!"
["Cytora","is","going","to","PyCon!"]



In [3]:

    
# Get first token of the processed document
token = doc[0]
print(token)

# Print sentences (one sentence per line)
for sent in doc.sents:
    print(sent)









    



Hello
Hello, world.
Natural Language Processing in 10 lines of code.

Part of speech tags

What is a Speech Tag?

A speech tag is a context sensitive description of what a word means in the context of the whole sentence. More information about the kinds of speech tags which are used in NLP can be found here.

Examples:

CARDINAL, Cardinal Number - 1,2,3
PROPN, Proper Noun, Singular - "Matic", "Andraz", "Cardiff"
INTJ, Interjection - "Uhhhhhhhhhhh"



In [4]:

    
# For each token, print corresponding part of speech tag
for token in doc:
    print('{} - {}'.format(token, token.pos_))









    



Hello - INTJ
, - PUNCT
world - NOUN
. - PUNCT
Natural - PROPN
Language - PROPN
Processing - PROPN
in - ADP
10 - NUM
lines - NOUN
of - ADP
code - NOUN
. - PUNCT

Visual part of speech tagging (displaCy)

Syntactic dependencies

What are syntactic dependencies?

We have the speech tags and we have all of the tokens in a sentence, but how do we relate the two to uncover the syntax in a sentence? Syntactic dependencies describe how each type of word relates to each other in a sentence, this is important in NLP in order to extract structure and understand grammar in plain text.

Example:



In [5]:

    
# Write a function that walks up the syntactic tree of the given token and collects all tokens to the root token (including root token).

def tokens_to_root(token):
    """
    Walk up the syntactic tree, collecting tokens to the root of the given `token`.
    :param token: Spacy token
    :return: list of Spacy tokens
    """
    tokens_to_r = []
    while token.head is not token:
        tokens_to_r.append(token)
        token = token.head
        tokens_to_r.append(token)

    return tokens_to_r

# For every token in document, print it's tokens to the root
for token in doc:
    print('{} --> {}'.format(token, tokens_to_root(token)))

# Print dependency labels of the tokens
for token in doc:
    print('-> '.join(['{}-{}'.format(dependent_token, dependent_token.dep_) for dependent_token in tokens_to_root(token)]))









    



Hello --> []
, --> [,, Hello]
world --> [world, Hello]
. --> [., Hello]
Natural --> [Natural, Processing]
Language --> [Language, Processing]
Processing --> []
in --> [in, Processing]
10 --> [10, lines, lines, in, in, Processing]
lines --> [lines, in, in, Processing]
of --> [of, lines, lines, in, in, Processing]
code --> [code, of, of, lines, lines, in, in, Processing]
. --> [., Processing]

,-punct-> Hello-ROOT
world-npadvmod-> Hello-ROOT
.-punct-> Hello-ROOT
Natural-compound-> Processing-ROOT
Language-compound-> Processing-ROOT

in-prep-> Processing-ROOT
10-nummod-> lines-pobj-> lines-pobj-> in-prep-> in-prep-> Processing-ROOT
lines-pobj-> in-prep-> in-prep-> Processing-ROOT
of-prep-> lines-pobj-> lines-pobj-> in-prep-> in-prep-> Processing-ROOT
code-pobj-> of-prep-> of-prep-> lines-pobj-> lines-pobj-> in-prep-> in-prep-> Processing-ROOT
.-punct-> Processing-ROOT

Named entities

Named Entities

A named entity is any real world object such as a person, location, organisation or product with a proper name.

Example:

1. Barack Obama
2. Edinburgh
3. Ferrari Enzo



In [6]:

    
# Print all named entities with named entity types

doc_2 = nlp(u"I went to Paris where I met my old friend Jack from uni.")
for ent in doc_2.ents:
    print('{} - {}'.format(ent, ent.label_))









    



Paris - GPE
Jack - PERSON

Noun chunks

What is a Noun Chunk?

Noun chunks are the phrases based upon nouns recovered from tokenized text using the speech tags.

Example:

The sentence "The boy saw the yellow dog" has 2 noun objects, the boy and the dog. Therefore the noun chunks will be

1. "The boy"
2. "the yellow dog"



In [7]:

    
# Print noun chunks for doc_2
print([chunk for chunk in doc_2.noun_chunks])









    



[I, Paris, I, my old friend, uni]



In [11]:

    
# Note that chunk is not string
print([[chunk,type(chunk)] for chunk in doc_2.noun_chunks])









    



[[I, <class 'spacy.tokens.span.Span'>], [Paris, <class 'spacy.tokens.span.Span'>], [I, <class 'spacy.tokens.span.Span'>], [my old friend, <class 'spacy.tokens.span.Span'>], [uni, <class 'spacy.tokens.span.Span'>]]



In [12]:

    
# Print noun chunks for doc_2
print(set([str(chunk) for chunk in doc_2.noun_chunks]))









    



{'my old friend', 'uni', 'Paris', 'I'}

Unigram probabilities



In [34]:

    
# For every token in doc_2, print log-probability of the word, estimated from counts from a large corpus 
tok_dic = {}
for token in doc_2:
    #print(token, ',', token.prob)
    tok_dic[str(token)]=token.prob



In [35]:

    
tok_dic









    Out[35]:





{'.': -3.0729479789733887,
 'I': -4.064180850982666,
 'Jack': -11.20296573638916,
 'Paris': -11.6917724609375,
 'friend': -8.825821876525879,
 'from': -6.028810501098633,
 'met': -9.784490585327148,
 'my': -5.918124675750732,
 'old': -7.7954816818237305,
 'to': -3.83851957321167,
 'uni': -19.579313278198242,
 'went': -8.474893569946289,
 'where': -7.183883190155029}



In [36]:

    
# import mylib
import sys
if not 'zapme' in ';'.join(sys.path):
    sys.path.insert(0,'D:\\zapme\\Dropbox\\docs\\codes\\python\\github')

import wgonglib as wg



In [41]:

    
#sys.path



In [40]:

    
wg.sort_dict_by_val(tok_dic, reverse=True)









    Out[40]:





[(-3.0729479789733887, '.'),
 (-3.83851957321167, 'to'),
 (-4.064180850982666, 'I'),
 (-5.918124675750732, 'my'),
 (-6.028810501098633, 'from'),
 (-7.183883190155029, 'where'),
 (-7.7954816818237305, 'old'),
 (-8.474893569946289, 'went'),
 (-8.825821876525879, 'friend'),
 (-9.784490585327148, 'met'),
 (-11.20296573638916, 'Jack'),
 (-11.6917724609375, 'Paris'),
 (-19.579313278198242, 'uni')]



In [39]:

    
wg.sort_dict_by_key(tok_dic)









    Out[39]:





[('.', -3.0729479789733887),
 ('I', -4.064180850982666),
 ('Jack', -11.20296573638916),
 ('Paris', -11.6917724609375),
 ('friend', -8.825821876525879),
 ('from', -6.028810501098633),
 ('met', -9.784490585327148),
 ('my', -5.918124675750732),
 ('old', -7.7954816818237305),
 ('to', -3.83851957321167),
 ('uni', -19.579313278198242),
 ('went', -8.474893569946289),
 ('where', -7.183883190155029)]

Word embedding / Similarity

What are Word embeddings?

A word embedding is a representation of a word, and by extension a whole language corpus, in a vector or other form of numerical mapping. This allows words to be treated numerically with word similarity represented as spatial difference in the dimensions of the word embedding mapping.

Example:

With word embeddings we can understand that vector operations describe word similarity. This means that we can see vector proofs of statements such as:

king-queen==man-woman



In [42]:

    
# For a given document, calculate similarity 
# between 'apples' and 'oranges' and 'boots' and 'hippos'
doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
apples = doc[0]
oranges = doc[2]
boots = doc[6]
hippos = doc[8]
print(apples.similarity(oranges))
print(boots.similarity(hippos))



In [43]:

    
# Print similarity between sentence and word 'fruit'
apples_sent, boots_sent = doc.sents
fruit = doc.vocab[u'fruit']
print(apples_sent.similarity(fruit))
print(boots_sent.similarity(fruit))









    



0.569403101179
0.323890751106