Intro Spacy


In [1]:
!pip install spacy nltk


Requirement already satisfied (use --upgrade to upgrade): spacy in /Users/johria/anaconda3/lib/python3.5/site-packages
Requirement already satisfied (use --upgrade to upgrade): nltk in /Users/johria/anaconda3/lib/python3.5/site-packages
Requirement already satisfied (use --upgrade to upgrade): six in /Users/johria/anaconda3/lib/python3.5/site-packages (from spacy)
Requirement already satisfied (use --upgrade to upgrade): thinc<5.1.0,>=5.0.0 in /Users/johria/anaconda3/lib/python3.5/site-packages (from spacy)
Requirement already satisfied (use --upgrade to upgrade): cloudpickle in /Users/johria/anaconda3/lib/python3.5/site-packages (from spacy)
Requirement already satisfied (use --upgrade to upgrade): sputnik<0.10.0,>=0.9.2 in /Users/johria/anaconda3/lib/python3.5/site-packages (from spacy)
Requirement already satisfied (use --upgrade to upgrade): preshed<0.47,>=0.46.1 in /Users/johria/anaconda3/lib/python3.5/site-packages (from spacy)
Requirement already satisfied (use --upgrade to upgrade): cymem<1.32,>=1.30 in /Users/johria/anaconda3/lib/python3.5/site-packages (from spacy)
Requirement already satisfied (use --upgrade to upgrade): plac in /Users/johria/anaconda3/lib/python3.5/site-packages (from spacy)
Requirement already satisfied (use --upgrade to upgrade): murmurhash<0.27,>=0.26 in /Users/johria/anaconda3/lib/python3.5/site-packages (from spacy)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7 in /Users/johria/anaconda3/lib/python3.5/site-packages (from spacy)
Requirement already satisfied (use --upgrade to upgrade): semver in /Users/johria/anaconda3/lib/python3.5/site-packages (from sputnik<0.10.0,>=0.9.2->spacy)
You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Spacy is an NLP/Computational Linguistics package built from the ground up. It's written in Cython so it's fast!!

Let's check it out. Here's some text from Alice in Wonderland free on Gutenberg.


In [2]:
text = """'Please would you tell me,' said Alice, a little timidly, for she was not quite sure whether it was good manners for her to speak first, 'why your cat grins like that?'
'It's a Cheshire cat,' said the Duchess, 'and that's why. Pig!'
She said the last word with such sudden violence that Alice quite jumped; but she saw in another moment that it was addressed to the baby, and not to her, so she took courage, and went on again:—
'I didn't know that Cheshire cats always grinned; in fact, I didn't know that cats could grin.'
'They all can,' said the Duchess; 'and most of 'em do.'
'I don't know of any that do,' Alice said very politely, feeling quite pleased to have got into a conversation.
'You don't know much,' said the Duchess; 'and that's a fact.'"""

Download and load the model. SpaCy has an excellent English NLP processor. It has the following features which we shall explore:

  • Entity recognition
  • Dependency Parsing
  • Part of Speech tagging
  • Word Vectorization
  • Tokenization
  • Lemmatization
  • Noun Chunks

Download the Model, it may take a while


In [3]:
import spacy
import spacy.en.download
# spacy.en.download.main()
processor = spacy.en.English()

In [4]:
processed_text = processor(text)
processed_text


Out[4]:
'Please would you tell me,' said Alice, a little timidly, for she was not quite sure whether it was good manners for her to speak first, 'why your cat grins like that?'
'It's a Cheshire cat,' said the Duchess, 'and that's why. Pig!'
She said the last word with such sudden violence that Alice quite jumped; but she saw in another moment that it was addressed to the baby, and not to her, so she took courage, and went on again:—
'I didn't know that Cheshire cats always grinned; in fact, I didn't know that cats could grin.'
'They all can,' said the Duchess; 'and most of 'em do.'
'I don't know of any that do,' Alice said very politely, feeling quite pleased to have got into a conversation.
'You don't know much,' said the Duchess; 'and that's a fact.'

Looks like the same text? Let's dig a little deeper

Tokenization

Sentences


In [5]:
n = 0
for sentence in processed_text.sents:
    print(n, sentence)
    n+=1


0 'Please would you tell me,' said Alice, a little timidly, for she was not quite sure whether it was good manners for her to speak first, 'why your cat grins like that?'
'It's a Cheshire cat
1 ,' said the Duchess, 'and that's why.
2 Pig!'

3 She said the last word with such sudden violence that Alice quite jumped; but she saw in another moment that it was addressed to the baby, and not to her, so she took courage, and went on again:—
'I didn't know that Cheshire cats always grinned; in fact, I didn't know that cats could grin.'
'They all can,' said the Duchess; 'and most of 'em do.'
'I don't know of any that do,'
4 Alice said very politely, feeling quite pleased to have got into a conversation.
'
5 You don't know much,' said the Duchess; 'and that's a fact.'

Words and Punctuation - Along with POS tagging


In [6]:
n = 0
for sentence in processed_text.sents:
    for token in sentence:
        print(n, token, token.pos_, token.lemma_)
        n+=1


0 ' PUNCT '
1 Please INTJ please
2 would VERB would
3 you PRON you
4 tell VERB tell
5 me PRON me
6 , PUNCT ,
7 ' PUNCT '
8 said VERB say
9 Alice PROPN alice
10 , PUNCT ,
11 a DET a
12 little ADJ little
13 timidly ADV timidly
14 , PUNCT ,
15 for ADP for
16 she PRON she
17 was VERB be
18 not ADV not
19 quite ADV quite
20 sure ADJ sure
21 whether ADP whether
22 it PRON it
23 was VERB be
24 good ADJ good
25 manners NOUN manner
26 for ADP for
27 her PRON her
28 to PART to
29 speak VERB speak
30 first ADV first
31 , PUNCT ,
32 ' PUNCT '
33 why ADV why
34 your ADJ your
35 cat NOUN cat
36 grins VERB grin
37 like ADP like
38 that DET that
39 ? PUNCT ?
40 ' PUNCT '
41 
 SPACE 

42 ' PUNCT '
43 It PRON it
44 's VERB '
45 a DET a
46 Cheshire PROPN cheshire
47 cat NOUN cat
48 , PUNCT ,
49 ' PUNCT '
50 said VERB say
51 the DET the
52 Duchess PROPN duchess
53 , PUNCT ,
54 ' PUNCT '
55 and CONJ and
56 that DET that
57 's VERB '
58 why ADV why
59 . PUNCT .
60 Pig PROPN pig
61 ! PUNCT !
62 ' PUNCT '
63 
 SPACE 

64 She PRON she
65 said VERB say
66 the DET the
67 last ADJ last
68 word NOUN word
69 with ADP with
70 such ADJ such
71 sudden ADJ sudden
72 violence NOUN violence
73 that ADJ that
74 Alice PROPN alice
75 quite ADV quite
76 jumped VERB jump
77 ; PUNCT ;
78 but CONJ but
79 she PRON she
80 saw VERB saw
81 in ADP in
82 another DET another
83 moment NOUN moment
84 that ADJ that
85 it PRON it
86 was VERB be
87 addressed VERB address
88 to ADP to
89 the DET the
90 baby NOUN baby
91 , PUNCT ,
92 and CONJ and
93 not ADV not
94 to ADP to
95 her PRON her
96 , PUNCT ,
97 so ADV so
98 she PRON she
99 took VERB take
100 courage NOUN courage
101 , PUNCT ,
102 and CONJ and
103 went VERB go
104 on ADP on
105 again:— PROPN again:—
106 
 SPACE 

107 ' PUNCT '
108 I PRON i
109 did VERB do
110 n't ADV not
111 know VERB know
112 that ADP that
113 Cheshire PROPN cheshire
114 cats NOUN cat
115 always ADV always
116 grinned VERB grin
117 ; PUNCT ;
118 in ADP in
119 fact NOUN fact
120 , PUNCT ,
121 I PRON i
122 did VERB do
123 n't ADV not
124 know VERB know
125 that ADP that
126 cats NOUN cat
127 could VERB could
128 grin VERB grin
129 . PUNCT .
130 ' PUNCT '
131 
 SPACE 

132 ' PUNCT '
133 They PRON they
134 all DET all
135 can VERB can
136 , PUNCT ,
137 ' PUNCT '
138 said VERB say
139 the DET the
140 Duchess NOUN duchess
141 ; PUNCT ;
142 ' PUNCT '
143 and CONJ and
144 most ADJ most
145 of ADP of
146 'em PRON 'em
147 do VERB do
148 . PUNCT .
149 ' PUNCT '
150 
 SPACE 

151 ' PUNCT '
152 I PRON i
153 do VERB do
154 n't ADV not
155 know VERB know
156 of ADP of
157 any DET any
158 that ADJ that
159 do VERB do
160 , PUNCT ,
161 ' PUNCT '
162 Alice PROPN alice
163 said VERB say
164 very ADV very
165 politely ADV politely
166 , PUNCT ,
167 feeling VERB feel
168 quite ADV quite
169 pleased ADJ pleased
170 to PART to
171 have VERB have
172 got VERB get
173 into ADP into
174 a DET a
175 conversation NOUN conversation
176 . PUNCT .
177 
 SPACE 

178 ' PUNCT '
179 You PRON you
180 do VERB do
181 n't ADV not
182 know VERB know
183 much ADJ much
184 , PUNCT ,
185 ' PUNCT '
186 said VERB say
187 the DET the
188 Duchess NOUN duchess
189 ; PUNCT ;
190 ' PUNCT '
191 and CONJ and
192 that DET that
193 's VERB '
194 a DET a
195 fact NOUN fact
196 . PUNCT .
197 ' PUNCT '

In [7]:
for entity in processed_text.ents:
    print(entity, entity.label_)


Alice PERSON
first ORDINAL
Cheshire GPE
Alice PERSON
Cheshire GPE
Alice PERSON

Noun Chunks


In [8]:
for noun_chunk in processed_text.noun_chunks:
    print(noun_chunk)


you
me
Alice
she
it
good manners
her
your cat
It
a Cheshire cat
the Duchess
She
the last word
such sudden violence
Alice
she
another moment
it
the baby
her
she
courage
again:—
I
Cheshire cats
fact
I
cats
They
the Duchess
'em
I
Alice
a conversation
You
the Duchess
a fact

The Semi Holy Grail - Syntactic Depensy Parsing See Demo for clarity


In [9]:
def pr_tree(word, level):
    if word.is_punct:
        return
    for child in word.lefts:
        pr_tree(child, level+1)
    print('\t'* level + word.text + ' - ' + word.dep_)
    for child in word.rights:
        pr_tree(child, level+1)

In [10]:
for sentence in processed_text.sents:
    pr_tree(sentence.root, 0)
    print('-------------------------------------------')


		Please - intj
		would - aux
		you - nsubj
	tell - ccomp
		me - dobj
said - ROOT
	Alice - nsubj
			a - det
		little - npadvmod
	timidly - advmod
		for - mark
		she - nsubj
	was - advcl
		not - neg
			quite - advmod
		sure - acomp
				whether - mark
				it - nsubj
			was - ccomp
					good - amod
				manners - attr
						for - mark
						her - nsubj
						to - aux
					speak - relcl
						first - advmod
		why - advmod
			your - poss
		cat - nsubj
	grins - ccomp
		like - prep
			that - pobj
		It - nsubj
	's - ccomp
			a - det
			Cheshire - compound
		cat - attr
-------------------------------------------
said - ROOT
		the - det
	Duchess - nsubj
		and - cc
		that - nsubj
	's - conj
		why - ccomp
-------------------------------------------
Pig - ROOT
-------------------------------------------
	She - nsubj
said - ROOT
		the - det
		last - amod
	word - dobj
		with - prep
				such - amod
				sudden - amod
			violence - pobj
			that - nsubj
			Alice - nsubj
			quite - advmod
		jumped - relcl
	but - cc
		she - nsubj
	saw - conj
		in - prep
				another - det
			moment - pobj
			that - mark
			it - nsubjpass
			was - auxpass
		addressed - ccomp
			to - prep
					the - det
				baby - pobj
			and - cc
				not - neg
			to - conj
				her - pobj
		so - advmod
		she - nsubj
	took - conj
		courage - dobj
		and - cc
		went - conj
			on - prep
				again:— - pobj
					
 - 
			I - nsubj
			did - aux
			n't - neg
		know - conj
				that - mark
					Cheshire - compound
				cats - nsubj
				always - advmod
			grinned - ccomp
		in - prep
			fact - pobj
		I - nsubj
		did - aux
		n't - neg
	know - ccomp
			that - mark
			cats - nsubj
			could - aux
		grin - ccomp
		They - nsubj
			all - appos
	can - ccomp
	said - conj
			the - det
		Duchess - dobj
		and - cc
			most - nsubj
				of - prep
					'em - pobj
		do - conj
		I - nsubj
		do - aux
		n't - neg
	know - ccomp
		of - prep
			any - pobj
					that - nsubj
				do - relcl
-------------------------------------------
	Alice - nsubj
said - ROOT
		very - advmod
	politely - advmod
	feeling - advcl
			quite - advmod
		pleased - acomp
				to - aux
				have - aux
			got - xcomp
				into - prep
						a - det
					conversation - pobj
-------------------------------------------
		You - nsubj
		do - aux
		n't - neg
	know - ccomp
		much - dobj
said - ROOT
		the - det
	Duchess - nsubj
	and - cc
		that - nsubj
	's - conj
			a - det
		fact - attr
-------------------------------------------

What is 'nsubj'? 'acomp'? See The Universal Dependencies

Word Vectorization - Word2Vec


In [11]:
proc_fruits = processor('''I think green apples are delicious. 
                            While pears have a strange texture to them. 
                            The bowls they sit in are ugly.''')
apples, pears, bowls = proc_fruits.sents
fruit = processed_text.vocab['fruit']
print(apples.similarity(fruit))
print(pears.similarity(fruit))
print(bowls.similarity(fruit))


0.36491481428
0.350866559366
0.272271037182

Assingment - In Class

Find your favorite news source and grab the article text.

  1. Show the most common words in the article.
  2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
  3. Find a subject/object relationship through the dependency parser in any sentence.
  4. Show the most common Entities and their types.
  5. Find Entites and their dependency (hint: entity.root.head)
  6. Find the most similar words in the article