Wiki2vec

Jupyter notebook for creating a Word2vec model from a Wikipedia dump. This model file can then be read into gensim's Word2Vec class. Feel free to edit this script as you see fit.

Dependencies

Python 3
Jupyter
Gensim

Steps

Download a Wikipedia dump by visiting

https://dumps.wikimedia.org/<locale>wiki/latest/<locale>wiki-latest-pages-articles.xml.bz2

E.x. https://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2

Once downloaded assign the following paths below:



In [1]:

    
WIKIPEDIA_DUMP_PATH = './data/wiki-corpuses/enwiki-latest-pages-articles.xml.bz2'

# Choose a path that the word2vec model should be saved to
# (during training), and read from afterwards.
WIKIPEDIA_W2V_PATH = './data/enwiki.model'

Train Word2vec on Wikipedia dump

Here is where we train the word2vec model on the given Wikipedia dump. Specifically we,

Read given Wikipedia dump with gensim
Write to temporary text file (will get deleted)
Train word2vec model
Save word2vec model

NB: 1 Wikipedia article is fed into word2vec as a single sentence.



In [2]:

    
import sys
import os
import tempfile
import multiprocessing
import logging

from gensim.corpora import WikiCorpus
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec









    



/Users/JVillella/Development/ml-playground/wiki2vec/venv/lib/python3.5/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
  warnings.warn("Pattern library is not installed, lemmatization won't be available.")



In [3]:

    
def write_wiki_corpus(wiki, output_file):
    """Write a WikiCorpus as plain text to file."""
    
    i = 0
    for text in wiki.get_texts():
        text_output_file.write(b' '.join(text) + b'\n')
        i = i + 1
        if (i % 10000 == 0):
            print('\rSaved %d articles' % i, end='', flush=True)
            
    print('\rFinished saving %d articles' % i, end='', flush=True)
    
def build_trained_model(text_file):
    """Reads text file and returns a trained model."""
    
    sentences = LineSentence(text_file)
    model = Word2Vec(sentences, size=400, window=5, min_count=5,
                     workers=multiprocessing.cpu_count())

    # Trim unneeded model memory to reduce RAM usage
    model.init_sims(replace=True)
    return model



In [4]:

    
logging_format = '%(asctime)s : %(levelname)s : %(message)s'
logging.basicConfig(format=logging_format, level=logging.INFO)

with tempfile.NamedTemporaryFile(suffix='.txt') as text_output_file:
    # Create wiki corpus, and save text to temp file
    wiki_corpus = WikiCorpus(WIKIPEDIA_DUMP_PATH, lemmatize=False, dictionary={})
    write_wiki_corpus(wiki_corpus, text_output_file)
    del wiki_corpus

    # Train model on wiki corpus
    model = build_trained_model(text_output_file)    
    model.save(WIKIPEDIA_W2V_PATH)









    



Saved 4210000 articles





    



2017-02-21 01:15:54,111 : INFO : finished iterating over Wikipedia corpus of 4211808 documents with 2303160832 positions (total 17246072 articles, 2365552029 positions before pruning articles shorter than 50 words)






    



Finished saving 4211808 articles

Demo word2vec

Read in the saved word2vec model and perform some basic analysis on it.



In [5]:

    
import random



In [6]:

    
%time
model = Word2Vec.load(WIKIPEDIA_W2V_PATH)









    



2017-02-21 04:58:25,818 : INFO : loading Word2Vec object from ./data/enwiki.model






    



CPU times: user 1 µs, sys: 1 µs, total: 2 µs
Wall time: 17.9 µs






    



2017-02-21 04:58:33,557 : INFO : loading syn1neg from ./data/enwiki.model.syn1neg.npy with mmap=None
2017-02-21 04:58:39,667 : INFO : loading syn0 from ./data/enwiki.model.syn0.npy with mmap=None
2017-02-21 04:58:58,673 : INFO : setting ignored attribute cum_table to None
2017-02-21 04:58:58,674 : INFO : setting ignored attribute syn0norm to None
2017-02-21 04:58:58,675 : INFO : loaded ./data/enwiki.model



In [7]:

    
vocab = list(model.vocab.keys())
print('Vocabulary sample:', vocab[:5])









    



Vocabulary sample: ['nettlestead', 'maaples', 'giniel', 'zahivi', 'mievs']



In [8]:

    
word = random.choice(vocab)

print('Similar words to:', word)
model.most_similar(word)









    



2017-02-21 04:59:17,888 : INFO : precomputing L2-norms of word weight vectors






    



Similar words to: parasphenoid






    Out[8]:





[('quadratojugal', 0.7142307162284851),
 ('basisphenoid', 0.713142991065979),
 ('basioccipital', 0.7118856310844421),
 ('squamosal', 0.697265625),
 ('coracoid', 0.6788373589515686),
 ('premaxillae', 0.6749427914619446),
 ('postorbital', 0.6736751794815063),
 ('uncinate', 0.6717867851257324),
 ('basipterygoid', 0.6691710948944092),
 ('pterygoid', 0.6659449338912964)]



In [9]:

    
word1 = random.choice(vocab)
word2 = random.choice(vocab)
print('similarity(%s, %s) = %f' % (word1, word2, model.similarity(word1, word2)))









    



similarity(xuanhe, vatnsskarð) = 0.070190