In [1]:
# Import spacy and English models
import spacy
nlp = spacy.load('en')
Loading spaCy can take a while, in the meantime here are a few definitions to help you on your NLP journey.
Stop words are the common words in a vocabulary which are of little value when considering word frequencies in text. This is because they don't provide much useful information about what the sentence is telling the reader.
Example: "the","and","a","are","is"
A corpus (plural: corpora) is a large collection of text or documents and can provide useful training data for NLP models. A corpus might be built from transcribed speech or a collection of manuscripts. Each item in a corpus is not necessarily unique and frequency counts of words can assist in uncovering the structure in a corpus.
Examples:
In [2]:
# Process sentences 'Hello, world. Natural Language Processing in 10 lines of code.' using spaCy
doc = nlp(u'Hello, world. Natural Language Processing in 10 lines of code.')
A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called "tokenisation".
Example: The following sentence can be tokenised by splitting up the sentence into individual words.
"Cytora is going to PyCon!"
["Cytora","is","going","to","PyCon!"]
In [3]:
# Get first token of the processed document
token = doc[0]
print(token)
# Print sentences (one sentence per line)
for sent in doc.sents:
print(sent)
A speech tag is a context sensitive description of what a word means in the context of the whole sentence. More information about the kinds of speech tags which are used in NLP can be found here.
Examples:
In [4]:
# For each token, print corresponding part of speech tag
for token in doc:
print('{} - {}'.format(token, token.pos_))
We have the speech tags and we have all of the tokens in a sentence, but how do we relate the two to uncover the syntax in a sentence? Syntactic dependencies describe how each type of word relates to each other in a sentence, this is important in NLP in order to extract structure and understand grammar in plain text.
Example:
In [5]:
# Write a function that walks up the syntactic tree of the given token and collects all tokens to the root token (including root token).
def tokens_to_root(token):
"""
Walk up the syntactic tree, collecting tokens to the root of the given `token`.
:param token: Spacy token
:return: list of Spacy tokens
"""
tokens_to_r = []
while token.head is not token:
tokens_to_r.append(token)
token = token.head
tokens_to_r.append(token)
return tokens_to_r
# For every token in document, print it's tokens to the root
for token in doc:
print('{} --> {}'.format(token, tokens_to_root(token)))
# Print dependency labels of the tokens
for token in doc:
print('-> '.join(['{}-{}'.format(dependent_token, dependent_token.dep_) for dependent_token in tokens_to_root(token)]))
In [6]:
# Print all named entities with named entity types
doc_2 = nlp(u"I went to Paris where I met my old friend Jack from uni.")
for ent in doc_2.ents:
print('{} - {}'.format(ent, ent.label_))
In [7]:
# Print noun chunks for doc_2
print([chunk for chunk in doc_2.noun_chunks])
In [11]:
# Note that chunk is not string
print([[chunk,type(chunk)] for chunk in doc_2.noun_chunks])
In [12]:
# Print noun chunks for doc_2
print(set([str(chunk) for chunk in doc_2.noun_chunks]))
In [34]:
# For every token in doc_2, print log-probability of the word, estimated from counts from a large corpus
tok_dic = {}
for token in doc_2:
#print(token, ',', token.prob)
tok_dic[str(token)]=token.prob
In [35]:
tok_dic
Out[35]:
In [36]:
# import mylib
import sys
if not 'zapme' in ';'.join(sys.path):
sys.path.insert(0,'D:\\zapme\\Dropbox\\docs\\codes\\python\\github')
import wgonglib as wg
In [41]:
#sys.path
In [40]:
wg.sort_dict_by_val(tok_dic, reverse=True)
Out[40]:
In [39]:
wg.sort_dict_by_key(tok_dic)
Out[39]:
A word embedding is a representation of a word, and by extension a whole language corpus, in a vector or other form of numerical mapping. This allows words to be treated numerically with word similarity represented as spatial difference in the dimensions of the word embedding mapping.
Example:
With word embeddings we can understand that vector operations describe word similarity. This means that we can see vector proofs of statements such as:
king-queen==man-woman
In [42]:
# For a given document, calculate similarity
# between 'apples' and 'oranges' and 'boots' and 'hippos'
doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
apples = doc[0]
oranges = doc[2]
boots = doc[6]
hippos = doc[8]
print(apples.similarity(oranges))
print(boots.similarity(hippos))
In [43]:
# Print similarity between sentence and word 'fruit'
apples_sent, boots_sent = doc.sents
fruit = doc.vocab[u'fruit']
print(apples_sent.similarity(fruit))
print(boots_sent.similarity(fruit))