spaCy introduction

Load spaCy resources

In [1]:
# Import spacy and English models
import spacy

nlp = spacy.load('en')

Loading spaCy can take a while, in the meantime here are a few definitions to help you on your NLP journey.

What are Stop Words?

Stop words are the common words in a vocabulary which are of little value when considering word frequencies in text. This is because they don't provide much useful information about what the sentence is telling the reader.

Example: "the","and","a","are","is"

What is a Corpus?

A corpus (plural: corpora) is a large collection of text or documents and can provide useful training data for NLP models. A corpus might be built from transcribed speech or a collection of manuscripts. Each item in a corpus is not necessarily unique and frequency counts of words can assist in uncovering the structure in a corpus.


  1. Every word written in the complete works of Shakespeare
  2. Every word spoken on BBC Radio channels for the past 30 years

Process text

In [2]:
# Process sentences 'Hello, world. Natural Language Processing in 10 lines of code.' using spaCy
doc = nlp(u'Hello, world. Natural Language Processing in 10 lines of code.')

Get tokens and sentences

What is a Token?

A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called "tokenisation".

Example: The following sentence can be tokenised by splitting up the sentence into individual words.

"Cytora is going to PyCon!"

In [3]:
# Get first token of the processed document
token = doc[0]

# Print sentences (one sentence per line)
for sent in doc.sents:

Hello, world.
Natural Language Processing in 10 lines of code.

Part of speech tags

What is a Speech Tag?

A speech tag is a context sensitive description of what a word means in the context of the whole sentence. More information about the kinds of speech tags which are used in NLP can be found here.


  1. CARDINAL, Cardinal Number - 1,2,3
  2. PROPN, Proper Noun, Singular - "Matic", "Andraz", "Cardiff"
  3. INTJ, Interjection - "Uhhhhhhhhhhh"

In [4]:
# For each token, print corresponding part of speech tag
for token in doc:
    print('{} - {}'.format(token, token.pos_))

Hello - INTJ
world - NOUN
Natural - PROPN
Language - PROPN
Processing - PROPN
in - ADP
10 - NUM
lines - NOUN
of - ADP
code - NOUN

Visual part of speech tagging (displaCy)

Syntactic dependencies

What are syntactic dependencies?

We have the speech tags and we have all of the tokens in a sentence, but how do we relate the two to uncover the syntax in a sentence? Syntactic dependencies describe how each type of word relates to each other in a sentence, this is important in NLP in order to extract structure and understand grammar in plain text.


In [5]:
# Write a function that walks up the syntactic tree of the given token and collects all tokens to the root token (including root token).

def tokens_to_root(token):
    Walk up the syntactic tree, collecting tokens to the root of the given `token`.
    :param token: Spacy token
    :return: list of Spacy tokens
    tokens_to_r = []
    while token.head is not token:
        token = token.head

    return tokens_to_r

# For every token in document, print it's tokens to the root
for token in doc:
    print('{} --> {}'.format(token, tokens_to_root(token)))

# Print dependency labels of the tokens
for token in doc:
    print('-> '.join(['{}-{}'.format(dependent_token, dependent_token.dep_) for dependent_token in tokens_to_root(token)]))

Hello --> []
, --> [,, Hello]
world --> [world, Hello]
. --> [., Hello]
Natural --> [Natural, Processing]
Language --> [Language, Processing]
Processing --> []
in --> [in, Processing]
10 --> [10, lines, lines, in, in, Processing]
lines --> [lines, in, in, Processing]
of --> [of, lines, lines, in, in, Processing]
code --> [code, of, of, lines, lines, in, in, Processing]
. --> [., Processing]

,-punct-> Hello-ROOT
world-npadvmod-> Hello-ROOT
.-punct-> Hello-ROOT
Natural-compound-> Processing-ROOT
Language-compound-> Processing-ROOT

in-prep-> Processing-ROOT
10-nummod-> lines-pobj-> lines-pobj-> in-prep-> in-prep-> Processing-ROOT
lines-pobj-> in-prep-> in-prep-> Processing-ROOT
of-prep-> lines-pobj-> lines-pobj-> in-prep-> in-prep-> Processing-ROOT
code-pobj-> of-prep-> of-prep-> lines-pobj-> lines-pobj-> in-prep-> in-prep-> Processing-ROOT
.-punct-> Processing-ROOT

Named entities

Named Entities

A named entity is any real world object such as a person, location, organisation or product with a proper name.


1. Barack Obama
2. Edinburgh
3. Ferrari Enzo

In [6]:
# Print all named entities with named entity types

doc_2 = nlp(u"I went to Paris where I met my old friend Jack from uni.")
for ent in doc_2.ents:
    print('{} - {}'.format(ent, ent.label_))

Paris - GPE

Noun chunks

What is a Noun Chunk?

Noun chunks are the phrases based upon nouns recovered from tokenized text using the speech tags.


The sentence "The boy saw the yellow dog" has 2 noun objects, the boy and the dog. Therefore the noun chunks will be

1. "The boy"
2. "the yellow dog"

In [7]:
# Print noun chunks for doc_2
print([chunk for chunk in doc_2.noun_chunks])

[I, Paris, I, my old friend, uni]

In [11]:
# Note that chunk is not string
print([[chunk,type(chunk)] for chunk in doc_2.noun_chunks])

[[I, <class 'spacy.tokens.span.Span'>], [Paris, <class 'spacy.tokens.span.Span'>], [I, <class 'spacy.tokens.span.Span'>], [my old friend, <class 'spacy.tokens.span.Span'>], [uni, <class 'spacy.tokens.span.Span'>]]

In [12]:
# Print noun chunks for doc_2
print(set([str(chunk) for chunk in doc_2.noun_chunks]))

{'my old friend', 'uni', 'Paris', 'I'}

Unigram probabilities

In [34]:
# For every token in doc_2, print log-probability of the word, estimated from counts from a large corpus 
tok_dic = {}
for token in doc_2:
    #print(token, ',', token.prob)

In [35]:

{'.': -3.0729479789733887,
 'I': -4.064180850982666,
 'Jack': -11.20296573638916,
 'Paris': -11.6917724609375,
 'friend': -8.825821876525879,
 'from': -6.028810501098633,
 'met': -9.784490585327148,
 'my': -5.918124675750732,
 'old': -7.7954816818237305,
 'to': -3.83851957321167,
 'uni': -19.579313278198242,
 'went': -8.474893569946289,
 'where': -7.183883190155029}

In [36]:
# import mylib
import sys
if not 'zapme' in ';'.join(sys.path):

import wgonglib as wg

In [41]:

In [40]:
wg.sort_dict_by_val(tok_dic, reverse=True)

[(-3.0729479789733887, '.'),
 (-3.83851957321167, 'to'),
 (-4.064180850982666, 'I'),
 (-5.918124675750732, 'my'),
 (-6.028810501098633, 'from'),
 (-7.183883190155029, 'where'),
 (-7.7954816818237305, 'old'),
 (-8.474893569946289, 'went'),
 (-8.825821876525879, 'friend'),
 (-9.784490585327148, 'met'),
 (-11.20296573638916, 'Jack'),
 (-11.6917724609375, 'Paris'),
 (-19.579313278198242, 'uni')]

In [39]:

[('.', -3.0729479789733887),
 ('I', -4.064180850982666),
 ('Jack', -11.20296573638916),
 ('Paris', -11.6917724609375),
 ('friend', -8.825821876525879),
 ('from', -6.028810501098633),
 ('met', -9.784490585327148),
 ('my', -5.918124675750732),
 ('old', -7.7954816818237305),
 ('to', -3.83851957321167),
 ('uni', -19.579313278198242),
 ('went', -8.474893569946289),
 ('where', -7.183883190155029)]

Word embedding / Similarity

What are Word embeddings?

A word embedding is a representation of a word, and by extension a whole language corpus, in a vector or other form of numerical mapping. This allows words to be treated numerically with word similarity represented as spatial difference in the dimensions of the word embedding mapping.


With word embeddings we can understand that vector operations describe word similarity. This means that we can see vector proofs of statements such as:


In [42]:
# For a given document, calculate similarity 
# between 'apples' and 'oranges' and 'boots' and 'hippos'
doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
apples = doc[0]
oranges = doc[2]
boots = doc[6]
hippos = doc[8]


In [43]:
# Print similarity between sentence and word 'fruit'
apples_sent, boots_sent = doc.sents
fruit = doc.vocab[u'fruit']
