First let's load spacy and a Spanish parsing model, this may take a while
In [8]:
import spacy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import codecs
import math
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from gensim.models import KeyedVectors
# Display plots in this notebook, instead of externally.
from pylab import rcParams
rcParams['figure.figsize'] = 16, 8
%matplotlib inline
nlp = spacy.load('es')
In [9]:
doc = nlp('Las naranjas y las manzanas se parecen')
pd.DataFrame([[word.text, word.tag_, word.pos_] for word in doc], columns=['Token', 'TAG', 'POS'])
Out[9]:
In [17]:
pd.DataFrame([[word.text, math.exp(word.prob), word.prob] for word in doc], columns=['Token', 'Prob', 'Log Prob'])
Out[17]:
In [23]:
doc2 = nlp(u"La premier alemana Angela Merkel visitó Buenos Aires esta semana")
for ent in doc2.ents:
print('{} \t {}'.format(ent, ent.label_))
Embeddings are the silver bullet brought from the world of deep learning into the realm of NLP through the word2vec algorithm. They are a meaningful representation of tokens as a dense high dimensional vector. They can be obtained from large unlabelled corpora easily available on the web and are able to express the sintactic, semantic, morpholical and even polysemic richness of a word in a way that a computer can understand
In [37]:
orange = doc[1]
apple = doc[4]
orange.similarity(apple)
Out[37]:
One amazing property of word vectors is that they represent analogies really well, for instance:
We can also average out the vectors in a whole sentence to get a similarity between them. NLP methods that disregard the word order such as this one are commonly referred as Bag of Words
In [41]:
doc.similarity(doc2)
Out[41]:
These is are the last 100 news articles at clarin.com, the biggest news media outlet in Argentina
In [121]:
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen('http://api-editoriales.clarin.com/api/instant_articles?limit=100').read()
soup = BeautifulSoup(html, 'html.parser')
news = [
BeautifulSoup(c.text, 'html.parser').text.split('function', 1)[0]
for item in soup.findAll('item')
for c in item.children if c.name == 'content:encoded'
]
news[0][0:100]
Out[121]:
In [124]:
corpus = nlp('\n'.join(news))
In [155]:
visited = {}
nouns = []
for word in corpus:
if word.pos_.startswith('N') and len(word.string) < 15:
token = word.string.strip().lower()
if token in visited:
visited[token] += 1
continue
else:
visited[token] = 1
nouns.append(word)
nouns = sorted(nouns, key=lambda w: -visited[w.string.strip().lower()])[:150]
pd.DataFrame([[w.text, visited[w.string.strip().lower()]] for w in nouns], columns=['Noun', 'Freq'])
Out[155]:
In [152]:
def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
plt.figure(figsize=(18, 18)) # in inches
for i, label in enumerate(labels):
x, y = low_dim_embs[i, :]
plt.scatter(x, y, s=1.0)
plt.annotate(label,
xy=(x, y),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
# plt.savefig(filename)
# Creating the tsne plot [Warning: will take time]
tsne = TSNE(perplexity=30.0, n_components=2, init='pca', n_iter=5000)
low_dim_embedding = tsne.fit_transform(np.array([word.vector for word in nouns]))
# Finally plotting and saving the fig
plot_with_labels(low_dim_embedding, [word.text for word in nouns])
In [ ]: