In [ ]:
%matplotlib inline
In [ ]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
One of Gensim's features is simple and easy access to some common data.
The gensim-data <https://github.com/RaRe-Technologies/gensim-data>
_ project stores a variety of corpora, models and other data.
Gensim has a :py:mod:gensim.downloader
module for programmatically accessing this data.
The module leverages a local cache that ensures data is downloaded at most once.
This tutorial:
sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py
for a detailed tutorial)Let's start by importing the api module.
In [ ]:
import gensim.downloader as api
Now, lets download the text8 corpus and load it to memory (automatically)
In [ ]:
corpus = api.load('text8')
In this case, corpus is an iterable. If you look under the covers, it has the following definition:
In [ ]:
import inspect
print(inspect.getsource(corpus.__class__))
For more details, look inside the file that defines the Dataset class for your particular resource.
In [ ]:
print(inspect.getfile(corpus.__class__))
As the corpus has been downloaded and loaded, let's create a word2vec model of our corpus.
In [ ]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)
Now that we have our word2vec model, let's find words that are similar to 'tree'
In [ ]:
print(model.most_similar('tree'))
You can use the API to download many corpora and models. You can get the list of all the models and corpora that are provided, by using the code below:
In [ ]:
import json
info = api.info()
print(json.dumps(info, indent=4))
There are two types of data: corpora and models.
In [ ]:
print(info.keys())
Let's have a look at the available corpora:
In [ ]:
for corpus_name, corpus_data in sorted(info['corpora'].items()):
print(
'%s (%d records): %s' % (
corpus_name,
corpus_data.get('num_records', -1),
corpus_data['description'][:40] + '...',
)
)
... and the same for models:
In [ ]:
for model_name, model_data in sorted(info['models'].items()):
print(
'%s (%d records): %s' % (
model_name,
model_data.get('num_records', -1),
model_data['description'][:40] + '...',
)
)
If you want to get detailed information about the model/corpus, use:
In [ ]:
fake_news_info = api.info('fake-news')
print(json.dumps(fake_news_info, indent=4))
Sometimes, you do not want to load the model to memory. You would just want to get the path to the model. For that, use :
In [ ]:
print(api.load('glove-wiki-gigaword-50', return_path=True))
If you want to load the model to memory, then:
In [ ]:
model = api.load("glove-wiki-gigaword-50")
model.most_similar("glass")
In corpora, the corpus is never loaded to memory, all corpuses wrapped to special class Dataset
and provide __iter__
method