A Word2Vec playground

To play with this notebook, you'll need Numpy, Annoy, Gensim, and the GoogleNews word2vec model

  • pip install numpy
  • pip install annoy
  • pip install gensim
  • you can find the GoogleNews vector by googling ./GoogleNews-vectors-negative300.bin

Inspired by: https://github.com/chrisjmccormick/inspect_word2vec


In [1]:
# import and init
from annoy import AnnoyIndex
import gensim
import os.path
import numpy as np

prefix_filename = 'word2vec'
ann_filename = prefix_filename + '.ann'
i2k_filename = prefix_filename + '_i2k.npy'
k2i_filename = prefix_filename + '_k2i.npy'

Create a model or load it


In [2]:
# Load Google's pre-trained Word2Vec model.
print "load GoogleNews Model"
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)  
print "loading done"

hello = model['hello']
vector_size = len(hello)
print 'model size=', len(model.vocab)
print 'vector size=', vector_size


load GoogleNews Model
loading done
model size= 3000000
vector size= 300

In [3]:
# process the model and save a model
# or load the model directly
vocab = model.vocab.keys()
#indexNN = AnnoyIndex(vector_size, metric='angular')
indexNN = AnnoyIndex(vector_size)
index2key = [None]*len(model.vocab)
key2index = {}

if not os.path.isfile(ann_filename): 
    print 'creating indexes'
    i = 0
    try:
        for key in vocab:
            indexNN.add_item(i, model[key])
            key2index[key]=i
            index2key[i]=key
            i=i+1
            if (i%10000==0):
                print i, key
    except TypeError:
        print 'Error with key', key
    print 'building 10 trees'
    indexNN.build(10) # 10 trees
    print 'save  files'
    indexNN.save(ann_filename)
    np.save(i2k_filename, index2key)
    np.save(k2i_filename, key2index)
    print 'done'
else:
    print "loading files"
    indexNN.load(ann_filename)
    index2key = np.load(i2k_filename)
    key2index = np.load(k2i_filename)
    print "loading done:", indexNN.get_n_items(), "items"


creating indexes
10000 DeLille_Cellars
20000 igned
30000 industrial_Ruhr
40000 ANSI_ASHRAE_IESNA_Standard
50000 coach_Jay_Vidovich
60000 Kizil
70000 Nanakshahi
80000 iSink_U_Facebook
90000 Renfrey
100000 Doctorate_Degree
110000 Synthetic_Cannabinoids
120000 Employee_Jeff_Colucy
130000 Kolbek
140000 dunce_hat
150000 Irn_Bru_First
160000 model_Maggie_Rizer
170000 OTTAWA_Karlheinz_Schreiber
180000 BGiles
190000 prMac.com_Vienna_Austria
200000 Tina_Pisnik
210000 undersigned_Rubin_Lublin
220000 Willnett_Crockett
230000 Sony_Pictures_Studios
240000 Voices
250000 salmon_Delta_smelt
260000 Yasuaki_Iwamoto_auto
270000 Ambrose
280000 DeLamatre
290000 BY_JOYCE_J._PERSICO
300000 Austin_Ruse
310000 Adeline_Teoh
320000 1_Utama_Shopping
330000 iSimCity
340000 symbol_TOT.UN
350000 southeastwards
360000 Whitchurch_Heath
370000 WEXFORD
380000 Kirk_Baert
390000 church_renounced_polygamy
400000 Whitney_Otis_elevators
410000 Fonze
420000 Fabian_Babich
430000 Desmodur_®
440000 Michael_Egholm_Ph.D.
450000 Cookie_Zip
460000 David_Dal_Maso
470000 Santa_Barbara_Botanic_Garden
480000 Jellicoe_Road
490000 E_coli
500000 Burleith
510000 LSBK
520000 Maidie
530000 Buddha_Nallah
540000 Coast_Guard_watchstanders
550000 EPIRB_signal
560000 Comtech_Telecommunications_Corp.
570000 Silver_oz.
580000 Dominique_Darden
590000 Raiders_George_Blanda
600000 attract_younger_hipper
610000 backstabbed
620000 Omro_Rushford_Fire
630000 Lasse
640000 BY_DENNIS_BARTLOW
650000 Yorkshire_Casualty_Reduction
660000 Korleone_Young
670000 David_Dzhokhadze
680000 Rudy_Currence
690000 Archive_retrievals
700000 Dousis
710000 Albert_Kligman
720000 actress_Archie_Panjabi
730000 Wolf_Haldenstein
740000 Lendel_Thomas
750000 Grubka
760000 Rene_Charlebois
770000 arsenic
780000 Minkus_Electronic_Display
790000 Ibi_Kaslik
800000 Valhall
810000 visibly_irritated_plainclothes
820000 Norwegian_expressionist_Edvard
830000 troy_oz
840000 PfEMP1
850000 μ_velOSity
860000 active_RFID_RTLS
870000 Pitched_Perfectly
880000 JRT
890000 scotch_carts
900000 Insurer_Humana
910000 Monica_Isley
920000 Julie_Deardorff
930000 Doc_Paulin
940000 http://www.opengeospatial.org
950000 Microcell_Telecommunications_Inc.
960000 PRNewswire_FirstCall_FNDS####_Corp
970000 MENNONITE
980000 Gabelsville
990000 PLN_##.#mn
1000000 Graebe
1010000 midges_swarmed
1020000 Gokool
1030000 Laura_Stotler_writes
1040000 remotely_piloted_Predators
1050000 Antlered_deer
1060000 desolate_desert
1070000 DRCE
1080000 Dhiya_al_Kenani
1090000 Lily_Ledbetter
1100000 Flatford
1110000 nurdled
1120000 Electron_Optics
1130000 Hostos_Community_College
1140000 NVIDIA_PhysX
1150000 Eurythmics
1160000 Crazed_Fan
1170000 Tullett_Liberty
1180000 Hunka_Hunka_Burnin
1190000 Eat_Your_Own
1200000 ENSOR
1210000 nemesis_Captain_Hook
1220000 U._Shrinivas
1230000 penile_enlargement
1240000 Antolin_Alcaraz
1250000 parimutuel_betting
1260000 Shell_Petroleum
1270000 Woodenbong
1280000 Kishoreganj_district
1290000 tee'd
1300000 Niweigha
1310000 Engine_Overhaul
1320000 Jen_Marlowe
1330000 Nitsch
1340000 Tiananmen_dissident
1350000 expletive_laden_banter
1360000 eurocentric
1370000 provincial_spokesman_Zulmi
1380000 Thurrock_Harriers
1390000 Psychosocial_Factors
1400000 Imagi_Mangal
1410000 Marco_Rubio
1420000 distinguishes
1430000 Malaman
1440000 bowled_Peter_Ongondo
1450000 Patricia_A._Vinchesi
1460000 Keltbray
1470000 Jolliff
1480000 SERIOUS_MAN
1490000 Egyptians_resealed
1500000 BROKE_INTO
1510000 Texas_hold'em
1520000 Caruso_Benefits
1530000 ENTrigue_Surgical
1540000 Southern_Tagalog_Arterial
1550000 OHL'_murt
1560000 DeSean_Jackson
1570000 Adepoju
1580000 Kachkar
1590000 &_quotWe
1600000 Eyewear_Collection
1610000 novitiate
1620000 Inferiority_Complex
1630000 papal_nuncio_Benedict
1640000 Lyddiard_website
1650000 MTM###_MPEG_Transport
1660000 gypsum_stack
1670000 Taiwan_Straits_ARATS
1680000 AWARD_FOR_BEST
1690000 Elmaghraby
1700000 fright_fests
1710000 IMMUNE_SYSTEMS
1720000 spokesman_Larry_Solters
1730000 burial
1740000 BY_RAMONA_SHELBURNE
1750000 Maaroufi
1760000 bin_Qasim
1770000 assassinate_Sheik_Jaber
1780000 Marc_Axton
1790000 Lockerman
1800000 Isberto
1810000 Gatson
1820000 Jennifer_Klinkert
1830000 Memento_mori
1840000 Desi_Arnez
1850000 HEALTH_PLANS
1860000 Finger_Lakes_Riesling
1870000 Colonel_Mengistu_Haile_Mariam
1880000 Hotel_Sacher
1890000 Monticello_Ky.
1900000 Dr._Josyann_Abisaab
1910000 By_Lawerence_Synett
1920000 Aprimo_Marketing_Studio
1930000 Severe_thunderstorms
1940000 raster_imagery
1950000 settlement_blocs_Maaleh_Adumim
1960000 postherpetic_neuralgia
1970000 Dave_Betras
1980000 HKY_FLA
1990000 Gulbudin_Hekmatyar
2000000 Klip_south
2010000 crumbling_lakeside
2020000 Philadelphia_Pa_Lippincott
2030000 JOBE
2040000 FABIO_Capello
2050000 T2_weighted
2060000 Burke_Badenhop
2070000 Weather_Stations
2080000 Sogluizzo
2090000 AUTHORITY_OF
2100000 Slim_Goodbody
2110000 Nizam_Mir
2120000 NewYork_Presbyterian_Hospital
2130000 Chairman_Jimmy_Iovine
2140000 Quail_Run_Elementary
2150000 neckware_line
2160000 aminopyralid
2170000 Essex_Fells
2180000 DOJ_OIG
2190000 BONDING
2200000 fking
2210000 REALTOR_®_Lockbox_NXT
2220000 therapeutic_compound_GAMMAGARD
2230000 Loof
2240000 Nomir_Medical_Technologies
2250000 Guilia
2260000 MetaSphere_application
2270000 Qi_Ji
2280000 brewers_Anheuser_Busch
2290000 8dec
2300000 Balachander
2310000 Baoyu
2320000 forwards_Colby_Armstrong
2330000 Hanjin_Heavy_Industries
2340000 Permanently_extending
2350000 Mohsenian
2360000 dark_Lord_Voldemort
2370000 visit_http://www.elpaso.com
2380000 CONTACT_Universal_Stainless
2390000 Steve_Schale_Democratic
2400000 BY_ROB_STEIN
2410000 INTERNET_TELEPHONY_Conference
2420000 Cloux_France
2430000 Superwire_Inc.
2440000 Rothiemurchus
2450000 Menarik_Property
2460000 TOM_MACK
2470000 noni_juice
2480000 Larder_Lake_Property
2490000 confortable
2500000 Gowad
2510000 racewinner
2520000 Hazard_Elimination
2530000 wet_distiller_grains
2540000 downtown_honky_tonks
2550000 potency_dosing
2560000 Lupercalia
2570000 Niuatoputapu_wiping
2580000 lambasted
2590000 caisse
2600000 evrything
2610000 Molecular_Pharmacology_Physiology
2620000 Bill_Kostroun_FILE
2630000 directorial_reigns
2640000 Calle_Ridderwall
2650000 Schlappi
2660000 Orin_Hatch
2670000 cricketer_Anil_Kumble
2680000 Yoshinori_Nagano_strategist
2690000 Brahim_Boulami
2700000 Klaasen
2710000 Dovetail_Solar
2720000 Southwark_diocese
2730000 AND_COUPLE_DANCING
2740000 disassociates_itself
2750000 R._Madhavan
2760000 Kibwezi
2770000 KIMBERLY_EDDS
2780000 Sunway_Lagoon_Surf
2790000 PKR_supreme
2800000 nonsmokers_groused
2810000 Glyco
2820000 IronStone
2830000 Billings_Forge
2840000 Famer_Gordie_Howe
2850000 By_Eric_Mchugh
2860000 Ingrid_Beckles
2870000 Savory_Spice
2880000 Jamnong
2890000 TVonics
2900000 MENAFN_Arab
2910000 Imam_Khomeini_mausoleum
2920000 Mayor_Arturo_Garino
2930000 Dogwood_Trail
2940000 Joel_Scodnick
2950000 Brenner_CSO
2960000 Brownscombe
2970000 Ezra_Cray
2980000 Katrina_Relief_Efforts
2990000 EMILY_KAISER
3000000 Kenneth_Klinge
building 10 trees
save files
done

King - Male + Female = Queen?

Nope!

At least not based on a word2vec that is trained on the News...


In [10]:
what_vec = model['king'] - model['male'] + model['female']

what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
print index2key[what_indexes[0]]


king

King - boy + girl = Queen?

Yes :)
but it don't work with man & women :(


In [12]:
what_vec = model['king'] - model['boy'] + model['girl']

what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
print index2key[what_indexes[0]]


queen

In [15]:
what_vec = model['king'] - model['man'] + model['women']

what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
print index2key[what_indexes[0]]


absolute_monarch

Berlin - Germany + France = Paris?

Yes!

This makes me happy, but if someone understand why, please tell me!


In [14]:
what_vec = model['Berlin'] - model['Germany'] + model['France']

what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
print index2key[what_indexes[0]]


Paris

Trump - USA + Germany = Hitler?

FAKE NEWS


In [12]:
what_vec = model['Trump'] + model['Germany'] - model['USA']
what_indexes = indexNN.get_nns_by_vector(what_vec, 1)

for i in what_indexes:
    print index2key[i]


Dean_Gitter

Let's explore the stereotypes hidded in the news:


In [53]:
man2women =  - model['boy'] + model['girl'] 

word_list = ["king","prince", "male", "boy","dad", "father", "president", "dentist",
             "scientist", "efficient",  "teacher", "doctor", "minister", "lover"]
for word in word_list:
    what_vec = model[word] + man2women
    what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
    print word, "for him,", index2key[what_indexes[0]], "for her."


king for him, queen for her.
prince for him, duchess for her.
male for him, female for her.
boy for him, girl for her.
dad for him, motherly_instincts for her.
father for him, mother for her.
president for him, president for her.
dentist for him, plastic_surgeon for her.
scientist for him, linguistics_professor for her.
efficient for him, efficient for her.
teacher for him, teacher for her.
doctor for him, doctor for her.
minister for him, minister for her.
lover for him, seductress for her.

In [54]:
capital = model['Berlin'] - model['Germany'] 

word_list = ["Germany", "France", "Italy", "USA", "Russia", "boys", "cars", "flowers", "soldiers",
             "scientists", ]
for word in word_list:
    what_vec = model[word] + capital
    what_indexes = indexNN.get_nns_by_vector(what_vec, 1)
    print index2key[what_indexes[0]], "is the capital of", word


Berlin is the capital of Germany
Paris is the capital of France
Rome is the capital of Italy
Teen_Poetry_Slam is the capital of USA
Moscow is the capital of Russia
kids is the capital of boys
paddywagon is the capital of cars
flower is the capital of flowers
civilians is the capital of soldiers
Humberto_Campins is the capital of scientists

If you play with this notebook and find good word2vec equation, please tweet them to me!
@dh7net