Wiki2vec

Jupyter notebook for creating a Word2vec model from a Wikipedia dump. This model file can then be read into gensim's Word2Vec class. Feel free to edit this script as you see fit.

Dependencies

  • Python 3
  • Jupyter
  • Gensim

Steps

  • Download a Wikipedia dump by visiting
https://dumps.wikimedia.org/<locale>wiki/latest/<locale>wiki-latest-pages-articles.xml.bz2

E.x. https://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
  • Once downloaded assign the following paths below:

In [1]:
WIKIPEDIA_DUMP_PATH = './data/wiki-corpuses/enwiki-latest-pages-articles.xml.bz2'

# Choose a path that the word2vec model should be saved to
# (during training), and read from afterwards.
WIKIPEDIA_W2V_PATH = './data/enwiki.model'

Train Word2vec on Wikipedia dump

Here is where we train the word2vec model on the given Wikipedia dump. Specifically we,

  1. Read given Wikipedia dump with gensim
  2. Write to temporary text file (will get deleted)
  3. Train word2vec model
  4. Save word2vec model

NB: 1 Wikipedia article is fed into word2vec as a single sentence.


In [2]:
import sys
import os
import tempfile
import multiprocessing
import logging

from gensim.corpora import WikiCorpus
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec


/Users/JVillella/Development/ml-playground/wiki2vec/venv/lib/python3.5/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
  warnings.warn("Pattern library is not installed, lemmatization won't be available.")

In [3]:
def write_wiki_corpus(wiki, output_file):
    """Write a WikiCorpus as plain text to file."""
    
    i = 0
    for text in wiki.get_texts():
        text_output_file.write(b' '.join(text) + b'\n')
        i = i + 1
        if (i % 10000 == 0):
            print('\rSaved %d articles' % i, end='', flush=True)
            
    print('\rFinished saving %d articles' % i, end='', flush=True)
    
def build_trained_model(text_file):
    """Reads text file and returns a trained model."""
    
    sentences = LineSentence(text_file)
    model = Word2Vec(sentences, size=400, window=5, min_count=5,
                     workers=multiprocessing.cpu_count())

    # Trim unneeded model memory to reduce RAM usage
    model.init_sims(replace=True)
    return model

In [4]:
logging_format = '%(asctime)s : %(levelname)s : %(message)s'
logging.basicConfig(format=logging_format, level=logging.INFO)

with tempfile.NamedTemporaryFile(suffix='.txt') as text_output_file:
    # Create wiki corpus, and save text to temp file
    wiki_corpus = WikiCorpus(WIKIPEDIA_DUMP_PATH, lemmatize=False, dictionary={})
    write_wiki_corpus(wiki_corpus, text_output_file)
    del wiki_corpus

    # Train model on wiki corpus
    model = build_trained_model(text_output_file)    
    model.save(WIKIPEDIA_W2V_PATH)


Saved 4210000 articles
2017-02-21 01:15:54,111 : INFO : finished iterating over Wikipedia corpus of 4211808 documents with 2303160832 positions (total 17246072 articles, 2365552029 positions before pruning articles shorter than 50 words)
Finished saving 4211808 articles

Demo word2vec

Read in the saved word2vec model and perform some basic analysis on it.


In [5]:
import random

In [6]:
%time
model = Word2Vec.load(WIKIPEDIA_W2V_PATH)


2017-02-21 04:58:25,818 : INFO : loading Word2Vec object from ./data/enwiki.model
CPU times: user 1 µs, sys: 1 µs, total: 2 µs
Wall time: 17.9 µs
2017-02-21 04:58:33,557 : INFO : loading syn1neg from ./data/enwiki.model.syn1neg.npy with mmap=None
2017-02-21 04:58:39,667 : INFO : loading syn0 from ./data/enwiki.model.syn0.npy with mmap=None
2017-02-21 04:58:58,673 : INFO : setting ignored attribute cum_table to None
2017-02-21 04:58:58,674 : INFO : setting ignored attribute syn0norm to None
2017-02-21 04:58:58,675 : INFO : loaded ./data/enwiki.model

In [7]:
vocab = list(model.vocab.keys())
print('Vocabulary sample:', vocab[:5])


Vocabulary sample: ['nettlestead', 'maaples', 'giniel', 'zahivi', 'mievs']

In [8]:
word = random.choice(vocab)

print('Similar words to:', word)
model.most_similar(word)


2017-02-21 04:59:17,888 : INFO : precomputing L2-norms of word weight vectors
Similar words to: parasphenoid
Out[8]:
[('quadratojugal', 0.7142307162284851),
 ('basisphenoid', 0.713142991065979),
 ('basioccipital', 0.7118856310844421),
 ('squamosal', 0.697265625),
 ('coracoid', 0.6788373589515686),
 ('premaxillae', 0.6749427914619446),
 ('postorbital', 0.6736751794815063),
 ('uncinate', 0.6717867851257324),
 ('basipterygoid', 0.6691710948944092),
 ('pterygoid', 0.6659449338912964)]

In [9]:
word1 = random.choice(vocab)
word2 = random.choice(vocab)
print('similarity(%s, %s) = %f' % (word1, word2, model.similarity(word1, word2)))


similarity(xuanhe, vatnsskarð) = 0.070190