Vector Math

In this notebook we'll demo that word2vec-like properties are kept. You can download the vectors, follow along at home, and make your own queries if you'd like.

Sums:

  1. silicon valley ~ california + technology
  2. uber ~ taxis + company
  3. baidu ~ china + search engine

Analogies:

  1. Mark Zuckerberg - Facebook + Amazon = Jeff Bezos
  2. Hacker News - story + article = StackOverflow
  3. VIM - terminal + graphics = Photoshop

And slightly more whimsically:

  1. vegeables - eat + drink = tea
  2. scala - features + simple = haskell

In [37]:
!wget https://zenodo.org/record/49903/files/vocab.npy


--2016-04-17 12:56:06--  https://zenodo.org/record/49903/files/vocab.npy
Resolving zenodo.org (zenodo.org)... 188.184.66.202
Connecting to zenodo.org (zenodo.org)|188.184.66.202|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81754640 (78M) [application/octet-stream]
Saving to: ‘vocab.npy’

vocab.npy           100%[=====================>]  77.97M  9.21MB/s   in 23s    

2016-04-17 12:56:32 (3.37 MB/s) - ‘vocab.npy’ saved [81754640/81754640]


In [36]:
!wget https://zenodo.org/record/49903/files/word_vectors.npy


--2016-04-17 12:55:41--  https://zenodo.org/record/49903/files/word_vectors.npy
Resolving zenodo.org (zenodo.org)... 188.184.66.202
Connecting to zenodo.org (zenodo.org)|188.184.66.202|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 116273232 (111M) [application/octet-stream]
Saving to: ‘word_vectors.npy’

word_vectors.npy    100%[=====================>] 110.89M  6.64MB/s   in 21s    

2016-04-17 12:56:06 (5.31 MB/s) - ‘word_vectors.npy’ saved [116273232/116273232]

You don't need to run the code below unless you've trained your own model. Otherwise, just download the word vectors from the URL above.


In [32]:
#from lda2vec_model import LDA2Vec
#from chainer import serializers
#import numpy as np
#import pandas as pd
#import pickle
#
#features = pd.read_pickle("../data/features.pd")
#vocab = np.load("../data/vocab")
#npz = np.load(open('topics.story.pyldavis.npz', 'r'))
#dat = {k: v for (k, v) in npz.iteritems()}
#vocab = dat['vocab'].tolist()
#dat = np.load("../data/data.npz")
#n_stories = features.story_id_codes.max() + 1
#n_units = 256
#n_vocab = dat['flattened'].max() + 1
#model = LDA2Vec(n_stories=n_stories, n_story_topics=40,
#                n_authors=5664, n_author_topics=20,
#                n_units=n_units, n_vocab=n_vocab, counts=np.zeros(n_vocab),
#                n_samples=15)
#serializers.load_hdf5("/home/chris/lda2vec-12/examples/hacker_news/lda2vec/lda2vec.hdf5", model)
#np.save("word_vectors", model.sampler.W.data)
#np.save("vocab", vocab)

In [2]:
import numpy as np
word_vectors_raw = np.load("word_vectors.npy")
vocab = np.load("vocab.npy").tolist()

L2 Normalize the word vectors


In [15]:
word_vectors = word_vectors_raw / np.linalg.norm(word_vectors_raw, axis=-1)[:, None]

In [16]:
def get_vector(token):
    index = vocab.index(token)
    return word_vectors[index, :].copy()

def most_similar(token, n=20):
    word_vector = get_vector(token)
    similarities = np.dot(word_vectors, word_vector)
    top = np.argsort(similarities)[::-1][:n]
    return [vocab[i] for i in top]

# This is Levy & Goldberg's 3Cosmul Metric
# Based on the Gensim implementation: https://github.com/piskvorky/gensim/blob/master/gensim/models/word2vec.py
def cosmul(positives, negatives, topn=20):
    positive = [get_vector(p) for p in positives]
    negative = [get_vector(n) for n in negatives]
    pos_dists = [((1 + np.dot(word_vectors, term)) / 2.) for term in positive]
    neg_dists = [((1 + np.dot(word_vectors, term)) / 2.) for term in negative]
    dists = np.prod(pos_dists, axis=0) / (np.prod(neg_dists, axis=0) + 1e-6)
    idxs = np.argsort(dists)[::-1][:topn]
    return [vocab[i] for i in idxs if (vocab[i] not in positives) and (vocab[i] not in negatives)]
def most_similar_posneg(positives, negatives, topn=20):
    positive = np.sum([get_vector(p) for p in positives], axis=0)
    negative = np.sum([get_vector(n) for n in negatives], axis=0)
    vector = positive - negative
    dists = np.dot(word_vectors, vector)
    idxs = np.argsort(dists)[::-1][:topn]
    return [vocab[i] for i in idxs if (vocab[i] not in positives) and (vocab[i] not in negatives)]

In [17]:
most_similar('san francisco')


Out[17]:
[u'san francisco',
 u'new york',
 u'nyc',
 u'palo alto',
 u'mountain view',
 u'boston',
 u'seattle',
 u'sf',
 u'los angeles',
 u'new york city',
 u'london',
 u'ny',
 u'brooklyn',
 u'chicago',
 u'austin',
 u'atlanta',
 u'portland',
 u'san jose',
 u'san mateo',
 u'sunnyvale']

In [18]:
cosmul(['california', 'technology'], [], topn=20)


Out[18]:
[u'silicon valley',
 u'in',
 u'new york',
 u'u.s.',
 u'west',
 u'tech',
 u'usa',
 u'san francisco',
 u'japan',
 u'america',
 u'dc',
 u'industry',
 u'canada',
 u'new york city',
 u'nyc',
 u'area',
 u'valley',
 u'china']

In [19]:
cosmul(['digital', 'currency'], [], topn=20)


Out[19]:
[u'currencies',
 u'bitcoin',
 u'goods',
 u'physical',
 u'gold',
 u'fiat',
 u'trading',
 u'cryptocurrency',
 u'bitcoins',
 u'electronic',
 u'analog',
 u'transfers',
 u'banking',
 u'commodity',
 u'mining',
 u'virtual currency',
 u'other currencies',
 u'media']

In [20]:
cosmul(['text editor', 'terminal'], [], topn=20)


Out[20]:
[u'vim',
 u'emacs',
 u'editor',
 u'sublime',
 u'tmux',
 u'shell',
 u'iterm',
 u'vi',
 u'ide',
 u'debugger',
 u'latex',
 u'gui',
 u'gvim',
 u'notepad',
 u'eclipse',
 u'command line',
 u'terminal.app',
 u'window manager']

In [35]:
cosmul(['china'], [], topn=20)


Out[35]:
[u'russia',
 u'india',
 u'japan',
 u'africa',
 u'korea',
 u'germany',
 u'other countries',
 u'asia',
 u'ukraine',
 u'iran',
 u'brazil',
 u'israel',
 u'usa',
 u'vietnam',
 u'france',
 u'countries',
 u'south korea',
 u'hong kong',
 u'europe']

In [21]:
cosmul(['china', 'search engine'], [], topn=20)


Out[21]:
[u'baidu',
 u'google',
 u'google search',
 u'india',
 u'russia',
 u'japan',
 u'iran',
 u'country',
 u'yandex',
 u'africa',
 u'duckduckgo',
 u'south korea',
 u'bing',
 u'france',
 u'beijing',
 u'hong kong',
 u'great firewall',
 u'search engines']

In [22]:
cosmul(['microsoft'], [], topn=20)


Out[22]:
[u'apple',
 u'ms',
 u'msft',
 u'google',
 u'nokia',
 u'adobe',
 u'samsung',
 u'hp',
 u'rim',
 u'oracle',
 u'valve',
 u'mozilla',
 u'ibm',
 u'motorola',
 u'oems',
 u'ballmer',
 u'intel',
 u'ms.',
 u'canonical']

In [23]:
cosmul(['microsoft', 'cloud'], [], topn=20)


Out[23]:
[u'apple',
 u'google',
 u'enterprise',
 u'azure',
 u'ms',
 u'skydrive',
 u'sharepoint',
 u'walled garden',
 u'icloud',
 u'oracle',
 u'chrome os',
 u'cloud services',
 u'android market',
 u'adobe',
 u'app store',
 u'rackspace',
 u'hp',
 u'samsung']

Queen is several rankings down, so not exactly the same as out of the box word2vec!


In [24]:
cosmul(['king', 'woman'], ['man'], topn=20)


Out[24]:
[u'professional context',
 u'female',
 u'pawn',
 u'content farm',
 u'queen',
 u'career trajectory',
 u'real risk',
 u'philadelphia',
 u'teen',
 u'shitty place',
 u'prussia',
 u'criminal offense',
 u'main theme',
 u'she',
 u'magician',
 u'gray area',
 u'herself',
 u'best site']

In [25]:
print 'Most similar'
print '\n'.join(most_similar('mark zuckerberg'))
print '\nCosmul'
pos = ['mark zuckerberg', 'amazon']
neg = ['facebook']
print '\n'.join(cosmul(pos, neg, topn=20))
print '\nTraditional Similarity'
print '\n'.join(most_similar_posneg(pos, neg, topn=20))


Most similar
mark zuckerberg
bill gates
zuckerberg
larry page
zuck
steve jobs
sergey brin
jeff bezos
gates
warren buffet
ceo
peter thiel
paul allen
sean parker
jack dorsey
paul graham
richard branson
sergey
linus torvalds
larry ellison

Cosmul
jeff bezos
elon musk
warren buffet
bezos
michael dell
bill gates
musk
hp
toshiba
dell
richard branson
elon
buffet
john carmack
steve wozniak
asus
ford
morgan

Traditional Similarity
jeff bezos
bill gates
elon musk
bezos
warren buffet
michael dell
hp
musk
richard branson
dell
toshiba
john carmack
buffet
peter thiel
steve wozniak
gates
steve jobs
ford

In [26]:
pos = ['hacker news', 'question']
neg = ['story']

print 'Most similar'
print '\n'.join(most_similar(pos[0]))
print '\nCosmul'
print '\n'.join(cosmul(pos, neg, topn=20))
print '\nTraditional Similarity'
print '\n'.join(most_similar_posneg(pos, neg, topn=20))


Most similar
hacker news
hn
hn.
reddit
front page
hackernews
commenting
posted
frontpage
comment
posting
upvoted
slashdot
news.yc
comments
posts
proggit
post
techcrunch
top story

Cosmul
stack overflow
stackoverflow
answers
answering
answer
questions
quora
answered
ask
hn
other questions
other question
programming questions
asking
stackexchange
stack exchange
why
basic questions

Traditional Similarity
stack overflow
answer
stackoverflow
answering
answers
hn
questions
answered
quora
ask
asking
other question
other questions
first question
stackexchange
hn.
programming questions
hackernews

In [27]:
pos = ['san francisco']
neg = []

print 'Most similar'
print '\n'.join(most_similar(pos[0]))
print '\nCosmul'
print '\n'.join(cosmul(pos, neg, topn=20))
print '\nTraditional Similarity'
print '\n'.join(most_similar_posneg(pos, neg, topn=20))


Most similar
san francisco
new york
nyc
palo alto
mountain view
boston
seattle
sf
los angeles
new york city
london
ny
brooklyn
chicago
austin
atlanta
portland
san jose
san mateo
sunnyvale

Cosmul
new york
nyc
palo alto
mountain view
boston
seattle
sf
los angeles
new york city
london
ny
brooklyn
chicago
austin
atlanta
portland
san jose
san mateo
sunnyvale

Traditional Similarity
new york
nyc
palo alto
mountain view
boston
seattle
sf
los angeles
new york city
london
ny
brooklyn
chicago
austin
atlanta
portland
san jose
san mateo
sunnyvale

In [28]:
pos = ['nlp', 'image']
neg = ['text']

print 'Most similar'
print '\n'.join(most_similar(pos[0]))
print '\nCosmul'
print '\n'.join(cosmul(pos, neg, topn=20))
print '\nTraditional Similarity'
print '\n'.join(most_similar_posneg(pos, neg, topn=20))


Most similar
nlp
machine learning
data mining
computer vision
natural language processing
ml
image processing
analytics
classification
algorithms
data science
hadoop
analysis
ai
clustering
mapreduce
algorithm design
information retrieval
data analysis
statistical

Cosmul
computer vision
machine learning
data mining
image processing
ai
analytics
algorithm
randomized
classification
natural language processing
hadoop
engine
statistical
analysis
machine
clustering
ml
artificial intelligence
neo4j

Traditional Similarity
computer vision
machine learning
data mining
image processing
ai
analytics
algorithm
natural language processing
classification
randomized
analysis
ml
hadoop
engine
machine
statistical
clustering
visualization

In [29]:
pos = ['vim', 'graphics']
neg = ['terminal']

print 'Most similar'
print '\n'.join(most_similar(pos[0]))
print '\nCosmul'
print '\n'.join(cosmul(pos, neg, topn=20))
print '\nTraditional Similarity'
print '\n'.join(most_similar_posneg(pos, neg, topn=20))


Most similar
vim
emacs
vi
sublime
tmux
textmate
eclipse
sublime text
macvim
zsh
org-mode
terminal
st2
bbedit
intellij
text editor
latex
notepad++
netbeans
other editors

Cosmul
photoshop
animations
typography
programming
layout
textures
web design
fonts
coding
illustrator
common lisp
design
prototyping
canvas
css.
css
diagrams
vector graphics
usability

Traditional Similarity
photoshop
animations
textures
layout
typography
programming
fonts
coding
illustrator
design
web design
common lisp
canvas
photography
ides
visual
animation
css

In [30]:
pos = ['vegetables', 'drink']
neg = ['eat']

print 'Most similar'
print '\n'.join(most_similar(pos[0]))
print '\nCosmul'
print '\n'.join(cosmul(pos, neg, topn=20))
print '\nTraditional Similarity'
print '\n'.join(most_similar_posneg(pos, neg, topn=20))


Most similar
vegetables
meat
rice
meats
fruit
veggies
pasta
salads
eat
fruits
cheese
carrots
potatoes
beans
seafood
soy
yogurt
spices
dairy
fats

Cosmul
tea
coffee
beer
drinking
red wine
soda
cup
alcohol
cups
vodka
rice
fruit
whisky
orange juice
milk
espresso
drinks
carrots

Traditional Similarity
tea
coffee
beer
drinking
soda
red wine
cup
alcohol
rice
cups
fruit
vodka
milk
drinks
orange juice
carrots
whisky
pasta

In [31]:
pos = ['lda', '']
neg = ['']

print 'Most similar'
print '\n'.join(most_similar(pos[0]))


Most similar
lda
linear
kmeans
clustering
-2
176
classification
svm
10000000
minaway
mb/s
statistical
173
ans
joiner
stdev
because:<p><pre><code
regression
	
gaussian