Introduction to Spanish NLP Workshop

First let's load spacy and a Spanish parsing model, this may take a while



In [8]:

    
import spacy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import codecs
import math

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from gensim.models import KeyedVectors

# Display plots in this notebook, instead of externally. 
from pylab import rcParams
rcParams['figure.figsize'] = 16, 8
%matplotlib inline
nlp = spacy.load('es')

Tokenization and POS Tagging

Now we can a pick a sentence and analyze it. A sentence is split into tokens (units of meaning: could be more than one word in some cases) and we can get the syntactic function of each one. This is called POS (Part of Speech) tagging



In [9]:

    
doc = nlp('Las naranjas y las manzanas se parecen')
pd.DataFrame([[word.text, word.tag_, word.pos_] for word in doc], columns=['Token', 'TAG', 'POS'])









    Out[9]:







  
    
      
      Token
      TAG
      POS
    
  
  
    
      0
      Las
      DET__Gender=Fem|Number=Plur|PronType=Art
      DET
    
    
      1
      naranjas
      NOUN__Gender=Fem|Number=Plur
      NOUN
    
    
      2
      y
      CCONJ___
      CONJ
    
    
      3
      las
      DET__Definite=Def|Gender=Fem|Number=Plur|PronT...
      DET
    
    
      4
      manzanas
      NOUN__Gender=Fem|Number=Plur
      NOUN
    
    
      5
      se
      PRON__Person=3
      PRON
    
    
      6
      parecen
      VERB__Mood=Ind|Number=Plur|Person=3|Tense=Pres...
      VERB

Language Models

Anothe key concept in NLP is that of Language Models. It is a function that is able to tell us the likelihood of a sentence of appearing in the real word. One such model, albeit a very simple one, is the one that multiplies the probabilities of every token in the sentence



In [17]:

    
pd.DataFrame([[word.text, math.exp(word.prob), word.prob] for word in doc], columns=['Token', 'Prob', 'Log Prob'])

Names Entity Recognition

We are also interested in extracting the proper entities that appear in a phrase and this problem is referred as Named Entity Recognition or NER. Here's and example with different kinds of entities such as people and locations



In [23]:

    
doc2 = nlp(u"La premier alemana Angela Merkel visitó Buenos Aires esta semana")
for ent in doc2.ents:
    print('{} \t {}'.format(ent, ent.label_))









    



Angela Merkel 	 PERSON
Buenos Aires 	 LOC

Word Embeddings

Embeddings are the silver bullet brought from the world of deep learning into the realm of NLP through the word2vec algorithm. They are a meaningful representation of tokens as a dense high dimensional vector. They can be obtained from large unlabelled corpora easily available on the web and are able to express the sintactic, semantic, morpholical and even polysemic richness of a word in a way that a computer can understand



In [37]:

    
orange = doc[1]
apple = doc[4]
orange.similarity(apple)









    Out[37]:





0.83445842963978367

One amazing property of word vectors is that they represent analogies really well, for instance:

Argentina - Macri = Alemania - Merkel
Reina - Mujer = Rey - Hombre

We can also average out the vectors in a whole sentence to get a similarity between them. NLP methods that disregard the word order such as this one are commonly referred as Bag of Words



In [41]:

    
doc.similarity(doc2)









    Out[41]:





0.82128151045732101

Let's look at some real world data

These is are the last 100 news articles at clarin.com, the biggest news media outlet in Argentina



In [121]:

    
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen('http://api-editoriales.clarin.com/api/instant_articles?limit=100').read()
soup = BeautifulSoup(html, 'html.parser')
news = [
    BeautifulSoup(c.text, 'html.parser').text.split('function', 1)[0] 
    for item in soup.findAll('item') 
        for c in item.children if c.name == 'content:encoded'
]
news[0][0:100]









    Out[121]:





'De Vido, López y Jaime estarían en la lista de sobornos de la brasileña OdebrechtDe Vido, López y Ja'



In [124]:

    
corpus = nlp('\n'.join(news))



In [155]:

    
visited = {}
nouns = []
for word in corpus:
    if word.pos_.startswith('N') and len(word.string) < 15:
        token = word.string.strip().lower()
        if token in visited:
            visited[token] += 1
            continue
        else:
            visited[token] = 1
            nouns.append(word)
nouns = sorted(nouns, key=lambda w: -visited[w.string.strip().lower()])[:150]
pd.DataFrame([[w.text, visited[w.string.strip().lower()]] for w in nouns], columns=['Noun', 'Freq'])









    Out[155]:







  
    
      
      Noun
      Freq
    
  
  
    
      0
      2017
      241
    
    
      1
      Junio
      158
    
    
      2
      años
      149
    
    
      3
      dos
      115
    
    
      4
      edificio
      69
    
    
      5
      año
      65
    
    
      6
      vez
      62
    
    
      7
      tres
      59
    
    
      8
      país
      58
    
    
      9
      parte
      56
    
    
      10
      gobierno
      51
    
    
      11
      8
      49
    
    
      12
      proyecto
      48
    
    
      13
      tiempo
      44
    
    
      14
      trabajo
      44
    
    
      15
      presidente
      42
    
    
      16
      frente
      41
    
    
      17
      mundo
      41
    
    
      18
      partido
      39
    
    
      19
      millones
      38
    
    
      20
      caso
      37
    
    
      21
      personas
      36
    
    
      22
      forma
      36
    
    
      23
      ciudad
      35
    
    
      24
      gente
      34
    
    
      25
      estudio
      34
    
    
      26
      canciller
      33
    
    
      27
      vida
      33
    
    
      28
      lugar
      33
    
    
      29
      obra
      32
    
    
      ...
      ...
      ...
    
    
      120
      30
      17
    
    
      121
      mil
      17
    
    
      122
      chicos
      17
    
    
      123
      amigos
      17
    
    
      124
      efedrina
      17
    
    
      125
      juez
      16
    
    
      126
      empresas
      16
    
    
      127
      tanto
      16
    
    
      128
      hora
      16
    
    
      129
      cena
      16
    
    
      130
      política
      16
    
    
      131
      piso
      16
    
    
      132
      papá
      16
    
    
      133
      22
      16
    
    
      134
      crimen
      16
    
    
      135
      fachada
      16
    
    
      136
      gastos
      15
    
    
      137
      causa
      15
    
    
      138
      relación
      15
    
    
      139
      resultado
      15
    
    
      140
      laboristas
      15
    
    
      141
      efecto
      15
    
    
      142
      fútbol
      15
    
    
      143
      ejemplo
      15
    
    
      144
      4
      15
    
    
      145
      idea
      15
    
    
      146
      valor
      15
    
    
      147
      18
      15
    
    
      148
      teatro
      15
    
    
      149
      video
      15
    
  

150 rows × 2 columns



In [152]:

    
def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(18, 18))  # in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i, :]
        plt.scatter(x, y, s=1.0)
        plt.annotate(label,
                 xy=(x, y),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')
    plt.show()
#     plt.savefig(filename)

# Creating the tsne plot [Warning: will take time]
tsne = TSNE(perplexity=30.0, n_components=2, init='pca', n_iter=5000)

low_dim_embedding = tsne.fit_transform(np.array([word.vector for word in nouns]))

# Finally plotting and saving the fig 
plot_with_labels(low_dim_embedding, [word.text for word in nouns])



In [ ]:

	Token	Prob	Log Prob
0	Las	0.000644	-7.347541
1	naranjas	0.000004	-12.366278
2	y	0.020835	-3.871129
3	las	0.007129	-4.943562
4	manzanas	0.000006	-12.064260
5	se	0.007493	-4.893838
6	parecen	0.000036	-10.242378

	Token	TAG	POS
0	Las	DET__Gender=Fem\|Number=Plur\|PronType=Art	DET
1	naranjas	NOUN__Gender=Fem\|Number=Plur	NOUN
2	y	CCONJ___	CONJ
3	las	DET__Definite=Def\|Gender=Fem\|Number=Plur\|PronT...	DET
4	manzanas	NOUN__Gender=Fem\|Number=Plur	NOUN
5	se	PRON__Person=3	PRON
6	parecen	VERB__Mood=Ind\|Number=Plur\|Person=3\|Tense=Pres...	VERB

	Noun	Freq
0	2017	241
1	Junio	158
2	años	149
3	dos	115
4	edificio	69
5	año	65
6	vez	62
7	tres	59
8	país	58
9	parte	56
10	gobierno	51
11	8	49
12	proyecto	48
13	tiempo	44
14	trabajo	44
15	presidente	42
16	frente	41
17	mundo	41
18	partido	39
19	millones	38
20	caso	37
21	personas	36
22	forma	36
23	ciudad	35
24	gente	34
25	estudio	34
26	canciller	33
27	vida	33
28	lugar	33
29	obra	32
...	...	...
120	30	17
121	mil	17
122	chicos	17
123	amigos	17
124	efedrina	17
125	juez	16
126	empresas	16
127	tanto	16
128	hora	16
129	cena	16
130	política	16
131	piso	16
132	papá	16
133	22	16
134	crimen	16
135	fachada	16
136	gastos	15
137	causa	15
138	relación	15
139	resultado	15
140	laboristas	15
141	efecto	15
142	fútbol	15
143	ejemplo	15
144	4	15
145	idea	15
146	valor	15
147	18	15
148	teatro	15
149	video	15