In this tutorial, you will learn how to use the author-topic model in Gensim for authorship prediction, based on the topic distributions and mesuring their similarity. We will train the author-topic model on a Reuters dataset, which contains 50 authors, each with 50 documents for trianing and another 50 documents for testing: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50 .
If you wish to learn more about the Author-topic model and LDA and how to train them, you should check out these tutorials beforehand. A lot of the preprocessing and configuration here has been done using their example:
NOTE:
To run this tutorial on your own, install Jupyter, Gensim, SpaCy, Scikit-Learn, Bokeh and Pandas, e.g. using pip:
pip install jupyter gensim spacy sklearn bokeh pandas
Note that you need to download some data for SpaCy using
python -m spacy.en.download
.Download the notebook at https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_prediction_tutorial.ipynb.
Predicting the author of a document is a difficult task, where current approaches usually turn to neural networks. These base a lot of their predictions on learing stylistic and syntactic preferences of the authors and also other features which help rather identify the author.
In our case, we first model the domain knowledge of a certain author, based on what the author writes about. We do this by calculating the topic distributions for each author using the author-topic model. After that, we perform the new author inference on the held-out subset. This again calculates a topic distribution for this new unknown author. In order to perform the prediction, we find out of all known authors, the most similar one to the new unknown. Mathematically speaking, we find the author, whose topic distribution is the closest to the topic distribution of the new author, by a certrain distrance function or metric. Here we explore the Hellinger distance for the measuring the distance between two discrete multinomial topic distributions.
We start off by downloading the dataset. You can do it manually using the aforementioned link, or run the following code cell.
In [1]:
!wget -O - "https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip" > /tmp/C50.zip
--2018-03-25 17:24:26-- https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip
Resolving archive.ics.uci.edu... 128.195.10.249
Connecting to archive.ics.uci.edu|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8194031 (7.8M) [application/zip]
Saving to: 'STDOUT'
- 100%[===================>] 7.81M 2.30MB/s in 3.4s
2018-03-25 17:24:31 (2.30 MB/s) - written to stdout [8194031/8194031]
In [2]:
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG, datefmt='%I:%M:%S')
In [3]:
import zipfile
filename = '/tmp/C50.zip'
zip_ref = zipfile.ZipFile(filename, 'r')
zip_ref.extractall("/tmp/")
zip_ref.close()
We wrap all the preprocessing steps, that you can find more about in the author-topic notebook , in one fucntion so that we are able to iterate over different preprocessing parameters.
In [4]:
import os, re, io
def preprocess_docs(data_dir):
doc_ids = []
author2doc = {}
docs = []
folders = os.listdir(data_dir) # List of filenames.
for authorname in folders:
files = file = os.listdir(data_dir + '/' + authorname)
for filen in files:
(idx1, idx2) = re.search('[0-9]+', filen).span() # Matches the indexes of the start end end of the ID.
if not author2doc.get(authorname):
# This is a new author.
author2doc[authorname] = []
doc_id = str(int(filen[idx1:idx2]))
doc_ids.append(doc_id)
author2doc[authorname].extend([doc_id])
# Read document text.
# Note: ignoring characters that cause encoding errors.
with io.open(data_dir + '/' + authorname + '/' + filen, errors='ignore', encoding='utf-8') as fid:
txt = fid.read()
# Replace any whitespace (newline, tabs, etc.) by a single space.
txt = re.sub('\s', ' ', txt)
docs.append(txt)
doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))
# Replace dataset IDs by integer IDs.
for a, a_doc_ids in author2doc.items():
for i, doc_id in enumerate(a_doc_ids):
author2doc[a][i] = doc_id_dict[doc_id]
import spacy
nlp = spacy.load('en')
%%time
processed_docs = []
for doc in nlp.pipe(docs, n_threads=4, batch_size=100):
# Process document using Spacy NLP pipeline.
ents = doc.ents # Named entities.
# Keep only words (no numbers, no punctuation).
# Lemmatize tokens, remove punctuation and remove stopwords.
doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]
# Remove common words from a stopword list.
#doc = [token for token in doc if token not in STOPWORDS]
# Add named entities, but only if they are a compound of more than word.
doc.extend([str(entity) for entity in ents if len(entity) > 1])
processed_docs.append(doc)
docs = processed_docs
del processed_docs
# Compute bigrams.
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
for token in bigram[docs[idx]]:
if '_' in token:
# Token is a bigram, add to document.
docs[idx].append(token)
return docs, author2doc
We create the corpus of the train and test data using two separate functions, since each corpus is tied to a certain dictionary which maps the words to their ids. Also in order to create the test corpus, we use the dictionary from the train data, since the trained model has have the same id2word reference as the new test data. Otherwise token with id 1 from the test data wont't mean the same as the trained upon token with id 1 in the model.
In [5]:
def create_corpus_dictionary(docs, max_freq=0.5, min_wordcount=20):
# Create a dictionary representation of the documents, and filter out frequent and rare words.
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
# Remove rare and common tokens.
# Filter out words that occur too frequently or too rarely.
max_freq = max_freq
min_wordcount = min_wordcount
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)
_ = dictionary[0] # This sort of "initializes" dictionary.id2token.
# Vectorize data.
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]
return corpus, dictionary
def create_test_corpus(train_dictionary, docs):
# Create test corpus using the dictionary from the train data.
return [train_dictionary.doc2bow(doc) for doc in docs]
For our first training, we specify that we want the parameters max_freq and min_wordcoun to be 50 and 20, as proposed by the original notebook tutorial. We will find out if this configuration is good enough for us.
In [6]:
traindata_dir = "/tmp/C50train"
train_docs, train_author2doc = preprocess_docs(traindata_dir)
train_corpus_50_20, train_dictionary_50_20 = create_corpus_dictionary(train_docs, 0.5, 20)
05:24:36 DEBUG:Registered VCS backend: git
05:24:36 DEBUG:Registered VCS backend: hg
05:24:36 DEBUG:Registered VCS backend: svn
05:24:36 DEBUG:Registered VCS backend: bzr
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs
05:26:17 INFO:'pattern' package not found; tag filters are not available for English
05:26:17 INFO:collecting all words and their counts
05:26:17 INFO:PROGRESS: at sentence #0, processed 0 words and 0 word types
05:26:19 INFO:collected 437598 word types from a corpus of 746622 words (unigram + bigrams) and 2500 sentences
05:26:19 INFO:using 437598 counts as vocab in Phrases<0 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>
/Users/martin/Projects/bachelor/gensim/gensim/models/phrases.py:490: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
05:26:24 INFO:adding document #0 to Dictionary(0 unique tokens: [])
05:26:25 INFO:built Dictionary(46905 unique tokens: ['$83.4 million', 'boarder', '$2.72 billion', 'checking', 'suzuki']...) from 2500 documents (total 786032 corpus positions)
05:26:25 INFO:discarding 42991 tokens: [('$1.4 billion', 11), ('$15', 3), ('$17.25', 1), ('$380 million', 2), ('12.5 cents', 7), ('Big B', 3), ('Big B Inc.', 2), ("Big B's", 3), ('Big B. I', 1), ('Dwayne Hoven', 1)]...
05:26:25 INFO:keeping 3914 tokens which were in no less than 20 and no more than 1250 (=50.0%) documents
05:26:25 DEBUG:rebuilding dictionary, shrinking gaps
05:26:25 INFO:resulting dictionary: Dictionary(3914 unique tokens: ['chris_patten', 'online', 'loss', 'hub', 'sound']...)
In [7]:
print('Number of unique tokens: %d' % len(train_dictionary_50_20))
Number of unique tokens: 3914
In [8]:
testdata_dir = "/tmp/C50test"
test_docs, test_author2doc = preprocess_docs(testdata_dir)
test_corpus_50_20 = create_test_corpus(train_dictionary_50_20, test_docs)
CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 15 µs
05:28:06 INFO:collecting all words and their counts
05:28:06 INFO:PROGRESS: at sentence #0, processed 0 words and 0 word types
05:28:08 INFO:collected 448895 word types from a corpus of 758070 words (unigram + bigrams) and 2500 sentences
05:28:08 INFO:using 448895 counts as vocab in Phrases<0 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>
/Users/martin/Projects/bachelor/gensim/gensim/models/phrases.py:490: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
We wrap the model training also in a function, in order to, again, be able to iterate over different parametrizations.
In [9]:
def train_model(corpus, author2doc, dictionary, num_topics=20, eval_every=0, iterations=50, passes=20):
from gensim.models import AuthorTopicModel
model = AuthorTopicModel(corpus=corpus, num_topics=num_topics, id2word=dictionary.id2token, \
author2doc=author2doc, chunksize=2500, passes=passes, \
eval_every=eval_every, iterations=iterations, random_state=1)
top_topics = model.top_topics(corpus)
tc = sum([t[1] for t in top_topics])
print(tc / num_topics)
return model
In [10]:
# NOTE: Author of the logic of this function is the Olavur Mortensen, from his notebook tutorial.
def predict_author(new_doc, atmodel, top_n=10, smallest_author=1):
from gensim import matutils
import pandas as pd
def similarity(vec1, vec2):
'''Get similarity between two vectors'''
dist = matutils.hellinger(matutils.sparse2full(vec1, atmodel.num_topics), \
matutils.sparse2full(vec2, atmodel.num_topics))
sim = 1.0 / (1.0 + dist)
return sim
def get_sims(vec):
'''Get similarity of vector to all authors.'''
sims = [similarity(vec, vec2) for vec2 in author_vecs]
return sims
author_vecs = [atmodel.get_author_topics(author) for author in atmodel.id2author.values()]
new_doc_topics = atmodel.get_new_author_topics(new_doc)
# Get similarities.
sims = get_sims(new_doc_topics)
# Arrange author names, similarities, and author sizes in a list of tuples.
table = []
for elem in enumerate(sims):
author_name = atmodel.id2author[elem[0]]
sim = elem[1]
author_size = len(atmodel.author2doc[author_name])
if author_size >= smallest_author:
table.append((author_name, sim, author_size))
# Make dataframe and retrieve top authors.
df = pd.DataFrame(table, columns=['Author', 'Score', 'Size'])
df = df.sort_values('Score', ascending=False)[:top_n]
return df
We define a custom function, which measures the prediction accuracy, following the precision at k principle. We parametrize the accuracy by a parameter k, k=1 meaning we need an exact match in order to be accurate, k=5 meaning our prediction has be in the top 5 results, ordered by similarity.
In [11]:
def prediction_accuracy(test_author2doc, test_corpus, model, k=5):
print("Precision@k: top_n={}".format(k))
matches=0
tries = 0
for author in test_author2doc:
author_id = model.author2id[author]
for doc_id in test_author2doc[author]:
predicted_authors = predict_author(test_corpus[doc_id:doc_id+1], atmodel=model, top_n=k)
tries = tries+1
if author_id in predicted_authors["Author"]:
matches=matches+1
accuracy = matches/tries
print("Prediction accuracy: {}".format(accuracy))
return accuracy, k
In [12]:
def plot_accuracy(scores1, label1, scores2=None, label2=None):
import matplotlib.pyplot as plt
s = [score*100 for score in scores1.values()]
t = list(scores1.keys())
plt.plot(t, s, "b-", label=label1)
plt.plot(t, s, "r^", label=label1+" data points")
if scores2 is not None:
s2 = [score*100 for score in scores2.values()]
plt.plot(t, s2, label=label2)
plt.plot(t, s2, "o", label=label2+" data points")
plt.legend(loc="lower right")
plt.xlabel('parameter k')
plt.ylabel('prediction accuracy')
plt.title('Precision at k')
plt.xticks(t)
plt.grid(True)
plt.yticks([30,40,50,60,70,80,90,100])
plt.axis([0, 11, 30, 100])
plt.show()
We calculate the accuracy for a range of values for k=[1,2,3,4,5,6,8,10] and plot how exactly the prediction accuracy naturally rises with higher k.
In [13]:
atmodel_standard = train_model(train_corpus_50_20, train_author2doc, train_dictionary_50_20)
05:28:14 INFO:Vocabulary consists of 3914 words.
05:28:14 INFO:using symmetric alpha at 0.05
05:28:14 INFO:using symmetric eta at 0.05
05:28:14 INFO:running online author-topic training, 20 topics, 50 authors, 20 passes over the supplied corpus of 2500 documents, updating model once every 2500 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000
05:28:14 INFO:PROGRESS: pass 0, at document #2500/2500
05:28:14 DEBUG:performing inference on a chunk of 2500 documents
05:28:22 DEBUG:3/2500 documents converged within 50 iterations
05:28:22 DEBUG:updating topics
05:28:22 INFO:topic #11 (0.050): 0.028*"gm" + 0.013*"plant" + 0.012*"strike" + 0.009*"worker" + 0.009*"uaw" + 0.007*"automaker" + 0.007*"share" + 0.007*"union" + 0.006*"truck" + 0.006*"analyst"
05:28:22 INFO:topic #17 (0.050): 0.018*"apple" + 0.008*"computer" + 0.008*"software" + 0.008*"share" + 0.008*"analyst" + 0.007*"quarter" + 0.006*"microsoft" + 0.006*"service" + 0.006*"base" + 0.005*"plan"
05:28:22 INFO:topic #15 (0.050): 0.009*"analyst" + 0.008*"computer" + 0.008*"stock" + 0.007*"billion" + 0.007*"quarter" + 0.007*"share" + 0.006*"industry" + 0.005*"software" + 0.005*"oil" + 0.005*"sale"
05:28:22 INFO:topic #9 (0.050): 0.009*"analyst" + 0.006*"share" + 0.006*"china" + 0.006*"gold" + 0.006*"chinese" + 0.005*"price" + 0.005*"government" + 0.005*"stock" + 0.004*"base" + 0.004*"drug"
05:28:22 INFO:topic #14 (0.050): 0.010*"pound" + 0.009*"share" + 0.008*"profit" + 0.007*"billion" + 0.007*"analyst" + 0.007*"group" + 0.007*"bank" + 0.006*"business" + 0.005*"million_pound" + 0.005*"price"
05:28:22 INFO:topic diff=2.864277, rho=1.000000
05:28:22 INFO:PROGRESS: pass 1, at document #2500/2500
05:28:22 DEBUG:performing inference on a chunk of 2500 documents
05:28:25 DEBUG:2491/2500 documents converged within 50 iterations
05:28:25 DEBUG:updating topics
05:28:25 INFO:topic #0 (0.050): 0.011*"bank" + 0.009*"analyst" + 0.005*"share" + 0.005*"billion" + 0.005*"government" + 0.004*"news" + 0.004*"business" + 0.004*"rule" + 0.004*"profit" + 0.004*"group"
05:28:25 INFO:topic #14 (0.050): 0.011*"pound" + 0.010*"share" + 0.009*"profit" + 0.008*"group" + 0.008*"analyst" + 0.008*"billion" + 0.007*"bank" + 0.006*"business" + 0.006*"million_pound" + 0.005*"penny"
05:28:25 INFO:topic #15 (0.050): 0.010*"analyst" + 0.009*"stock" + 0.008*"share" + 0.008*"billion" + 0.007*"quarter" + 0.007*"computer" + 0.006*"oil" + 0.006*"bank" + 0.005*"industry" + 0.005*"high"
05:28:25 INFO:topic #1 (0.050): 0.014*"bank" + 0.010*"china" + 0.009*"hong_kong" + 0.009*"kong" + 0.008*"hong" + 0.008*"billion" + 0.007*"Hong Kong" + 0.006*"analyst" + 0.006*"stock" + 0.006*"fund"
05:28:25 INFO:topic #7 (0.050): 0.007*"analyst" + 0.007*"sale" + 0.007*"share" + 0.006*"group" + 0.005*"business" + 0.005*"price" + 0.004*"profit" + 0.004*"industry" + 0.004*"pound" + 0.004*"billion"
05:28:25 INFO:topic diff=1.147566, rho=0.577350
05:28:25 INFO:PROGRESS: pass 2, at document #2500/2500
05:28:25 DEBUG:performing inference on a chunk of 2500 documents
05:28:27 DEBUG:2498/2500 documents converged within 50 iterations
05:28:27 DEBUG:updating topics
05:28:27 INFO:topic #9 (0.050): 0.011*"drug" + 0.009*"colombia" + 0.007*"analyst" + 0.006*"government" + 0.006*"sale" + 0.005*"share" + 0.005*"base" + 0.004*"price" + 0.004*"stock" + 0.004*"united"
05:28:27 INFO:topic #15 (0.050): 0.010*"stock" + 0.009*"analyst" + 0.008*"share" + 0.008*"billion" + 0.007*"bank" + 0.007*"oil" + 0.006*"quarter" + 0.006*"canada" + 0.006*"toronto" + 0.005*"high"
05:28:27 INFO:topic #8 (0.050): 0.026*"bre" + 0.024*"gold" + 0.024*"bre_x" + 0.024*"x" + 0.018*"Bre-X" + 0.015*"barrick" + 0.011*"analyst" + 0.010*"busang" + 0.010*"indonesian" + 0.008*"government"
05:28:27 INFO:topic #19 (0.050): 0.019*"hong" + 0.018*"kong" + 0.018*"hong_kong" + 0.014*"china" + 0.012*"Hong Kong" + 0.006*"chinese" + 0.005*"price" + 0.005*"british" + 0.005*"tell" + 0.004*"tung"
05:28:27 INFO:topic #10 (0.050): 0.009*"billion" + 0.008*"bank" + 0.005*"loan" + 0.005*"tonne" + 0.005*"yen" + 0.005*"price" + 0.005*"exporter" + 0.004*"real_estate" + 0.004*"analyst" + 0.004*"real"
05:28:27 INFO:topic diff=1.010061, rho=0.500000
05:28:27 INFO:PROGRESS: pass 3, at document #2500/2500
05:28:27 DEBUG:performing inference on a chunk of 2500 documents
05:28:29 DEBUG:2500/2500 documents converged within 50 iterations
05:28:29 DEBUG:updating topics
05:28:29 INFO:topic #13 (0.050): 0.017*"china" + 0.014*"wang" + 0.012*"beijing" + 0.011*"taiwan" + 0.009*"court" + 0.009*"party" + 0.008*"chinese" + 0.008*"government" + 0.007*"official" + 0.007*"communist"
05:28:29 INFO:topic #19 (0.050): 0.022*"hong" + 0.021*"kong" + 0.021*"hong_kong" + 0.015*"china" + 0.014*"Hong Kong" + 0.006*"chinese" + 0.005*"airbus" + 0.005*"tung" + 0.005*"british" + 0.005*"Hong Kong's"
05:28:29 INFO:topic #12 (0.050): 0.012*"czech" + 0.007*"bank" + 0.007*"crown" + 0.006*"government" + 0.006*"klaus" + 0.005*"billion" + 0.005*"price" + 0.005*"party" + 0.005*"prague" + 0.005*"foreign"
05:28:29 INFO:topic #17 (0.050): 0.041*"apple" + 0.026*"computer" + 0.022*"software" + 0.020*"quarter" + 0.013*"microsoft" + 0.013*"analyst" + 0.010*"share" + 0.009*"sale" + 0.009*"macintosh" + 0.008*"pc"
05:28:29 INFO:topic #6 (0.050): 0.020*"share" + 0.017*"analyst" + 0.010*"bank" + 0.010*"shanghai" + 0.009*"stock" + 0.007*"sale" + 0.007*"b" + 0.006*"quarter" + 0.006*"base" + 0.005*"business"
05:28:29 INFO:topic diff=0.877566, rho=0.447214
05:28:29 INFO:PROGRESS: pass 4, at document #2500/2500
05:28:29 DEBUG:performing inference on a chunk of 2500 documents
05:28:31 DEBUG:2500/2500 documents converged within 50 iterations
05:28:31 DEBUG:updating topics
05:28:31 INFO:topic #14 (0.050): 0.012*"pound" + 0.010*"profit" + 0.010*"share" + 0.009*"group" + 0.008*"analyst" + 0.008*"billion" + 0.007*"business" + 0.007*"bank" + 0.006*"million_pound" + 0.005*"british"
05:28:31 INFO:topic #9 (0.050): 0.014*"drug" + 0.011*"colombia" + 0.006*"government" + 0.005*"analyst" + 0.005*"sale" + 0.005*"united" + 0.005*"colombian" + 0.004*"guerrilla" + 0.004*"base" + 0.004*"force"
05:28:31 INFO:topic #13 (0.050): 0.018*"china" + 0.015*"wang" + 0.013*"beijing" + 0.012*"taiwan" + 0.009*"court" + 0.009*"party" + 0.009*"chinese" + 0.008*"government" + 0.007*"communist" + 0.007*"official"
05:28:31 INFO:topic #2 (0.050): 0.010*"share" + 0.009*"analyst" + 0.008*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.005*"business"
05:28:31 INFO:topic #15 (0.050): 0.010*"stock" + 0.009*"analyst" + 0.008*"billion" + 0.008*"share" + 0.008*"bank" + 0.007*"oil" + 0.007*"canada" + 0.007*"toronto" + 0.006*"russia" + 0.005*"high"
05:28:31 INFO:topic diff=0.761073, rho=0.408248
05:28:31 INFO:PROGRESS: pass 5, at document #2500/2500
05:28:31 DEBUG:performing inference on a chunk of 2500 documents
05:28:33 DEBUG:2500/2500 documents converged within 50 iterations
05:28:33 DEBUG:updating topics
05:28:33 INFO:topic #15 (0.050): 0.010*"stock" + 0.009*"analyst" + 0.009*"bank" + 0.008*"billion" + 0.008*"share" + 0.007*"oil" + 0.007*"canada" + 0.007*"toronto" + 0.006*"russia" + 0.006*"tonne"
05:28:33 INFO:topic #2 (0.050): 0.011*"share" + 0.009*"analyst" + 0.008*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.005*"business"
05:28:33 INFO:topic #13 (0.050): 0.018*"china" + 0.015*"wang" + 0.013*"beijing" + 0.012*"taiwan" + 0.009*"court" + 0.009*"chinese" + 0.009*"party" + 0.008*"government" + 0.007*"communist" + 0.007*"official"
05:28:33 INFO:topic #8 (0.050): 0.027*"gold" + 0.026*"bre" + 0.026*"bre_x" + 0.026*"x" + 0.019*"Bre-X" + 0.015*"barrick" + 0.012*"analyst" + 0.011*"busang" + 0.010*"indonesian" + 0.009*"government"
05:28:33 INFO:topic #1 (0.050): 0.017*"bank" + 0.010*"fund" + 0.010*"china" + 0.009*"billion" + 0.008*"hong_kong" + 0.008*"kong" + 0.008*"hong" + 0.007*"financial" + 0.007*"japan" + 0.006*"Hong Kong"
05:28:33 INFO:topic diff=0.658823, rho=0.377964
05:28:33 INFO:PROGRESS: pass 6, at document #2500/2500
05:28:33 DEBUG:performing inference on a chunk of 2500 documents
05:28:35 DEBUG:2500/2500 documents converged within 50 iterations
05:28:35 DEBUG:updating topics
05:28:35 INFO:topic #17 (0.050): 0.038*"apple" + 0.027*"computer" + 0.022*"software" + 0.021*"quarter" + 0.014*"analyst" + 0.013*"microsoft" + 0.010*"sale" + 0.010*"share" + 0.008*"pc" + 0.008*"macintosh"
05:28:35 INFO:topic #9 (0.050): 0.015*"drug" + 0.012*"colombia" + 0.006*"government" + 0.005*"united" + 0.005*"sale" + 0.005*"colombian" + 0.005*"guerrilla" + 0.005*"analyst" + 0.004*"force" + 0.004*"week"
05:28:35 INFO:topic #15 (0.050): 0.010*"stock" + 0.009*"bank" + 0.009*"analyst" + 0.008*"billion" + 0.008*"share" + 0.007*"oil" + 0.007*"canada" + 0.007*"toronto" + 0.006*"russia" + 0.006*"tonne"
05:28:35 INFO:topic #13 (0.050): 0.018*"china" + 0.015*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"court" + 0.009*"chinese" + 0.009*"party" + 0.008*"government" + 0.007*"communist" + 0.007*"official"
05:28:35 INFO:topic #16 (0.050): 0.016*"franc" + 0.015*"french" + 0.015*"air" + 0.014*"france" + 0.011*"thomson" + 0.010*"billion" + 0.009*"group" + 0.007*"government" + 0.007*"plan" + 0.007*"bid"
05:28:35 INFO:topic diff=0.568497, rho=0.353553
05:28:35 INFO:PROGRESS: pass 7, at document #2500/2500
05:28:35 DEBUG:performing inference on a chunk of 2500 documents
05:28:36 DEBUG:2500/2500 documents converged within 50 iterations
05:28:36 DEBUG:updating topics
05:28:37 INFO:topic #17 (0.050): 0.037*"apple" + 0.027*"computer" + 0.022*"software" + 0.021*"quarter" + 0.014*"analyst" + 0.013*"microsoft" + 0.010*"sale" + 0.010*"share" + 0.008*"pc" + 0.008*"macintosh"
05:28:37 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.005*"business"
05:28:37 INFO:topic #3 (0.050): 0.027*"bt" + 0.017*"telecom" + 0.015*"mci" + 0.013*"pound" + 0.011*"billion" + 0.011*"analyst" + 0.011*"deal" + 0.010*"british" + 0.010*"share" + 0.010*"group"
05:28:37 INFO:topic #13 (0.050): 0.018*"china" + 0.016*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"court" + 0.009*"chinese" + 0.009*"party" + 0.008*"government" + 0.007*"communist" + 0.007*"official"
05:28:37 INFO:topic #4 (0.050): 0.018*"china" + 0.011*"official" + 0.009*"state" + 0.008*"beijing" + 0.008*"tibet" + 0.007*"chinese" + 0.007*"government" + 0.007*"wang" + 0.006*"people" + 0.005*"dissident"
05:28:37 INFO:topic diff=0.488932, rho=0.333333
05:28:37 INFO:PROGRESS: pass 8, at document #2500/2500
05:28:37 DEBUG:performing inference on a chunk of 2500 documents
05:28:38 DEBUG:2500/2500 documents converged within 50 iterations
05:28:38 DEBUG:updating topics
05:28:38 INFO:topic #17 (0.050): 0.037*"apple" + 0.027*"computer" + 0.022*"software" + 0.021*"quarter" + 0.014*"analyst" + 0.013*"microsoft" + 0.010*"sale" + 0.010*"share" + 0.008*"pc" + 0.008*"macintosh"
05:28:38 INFO:topic #5 (0.050): 0.032*"china" + 0.016*"chinese" + 0.013*"beijing" + 0.012*"official" + 0.009*"tonne" + 0.007*"hong" + 0.007*"hong_kong" + 0.007*"kong" + 0.007*"trade" + 0.006*"state"
05:28:38 INFO:topic #13 (0.050): 0.018*"china" + 0.016*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"chinese" + 0.009*"court" + 0.009*"party" + 0.008*"government" + 0.007*"communist" + 0.007*"official"
05:28:38 INFO:topic #14 (0.050): 0.012*"pound" + 0.011*"profit" + 0.010*"share" + 0.009*"analyst" + 0.009*"group" + 0.008*"billion" + 0.007*"bank" + 0.007*"business" + 0.006*"million_pound" + 0.005*"british"
05:28:38 INFO:topic #6 (0.050): 0.019*"share" + 0.016*"analyst" + 0.012*"shanghai" + 0.011*"bank" + 0.009*"stock" + 0.007*"b" + 0.007*"sale" + 0.006*"exchange" + 0.006*"base" + 0.006*"quarter"
05:28:38 INFO:topic diff=0.419457, rho=0.316228
05:28:38 INFO:PROGRESS: pass 9, at document #2500/2500
05:28:38 DEBUG:performing inference on a chunk of 2500 documents
05:28:40 DEBUG:2500/2500 documents converged within 50 iterations
05:28:40 DEBUG:updating topics
05:28:40 INFO:topic #3 (0.050): 0.027*"bt" + 0.017*"telecom" + 0.015*"mci" + 0.013*"pound" + 0.011*"billion" + 0.011*"analyst" + 0.011*"deal" + 0.010*"british" + 0.010*"share" + 0.010*"group"
05:28:40 INFO:topic #1 (0.050): 0.018*"bank" + 0.011*"fund" + 0.009*"billion" + 0.008*"china" + 0.008*"financial" + 0.007*"japan" + 0.007*"kong" + 0.007*"hong_kong" + 0.007*"hong" + 0.006*"analyst"
05:28:40 INFO:topic #15 (0.050): 0.010*"bank" + 0.009*"stock" + 0.008*"analyst" + 0.008*"billion" + 0.008*"share" + 0.008*"oil" + 0.007*"canada" + 0.007*"toronto" + 0.006*"tonne" + 0.006*"russia"
05:28:40 INFO:topic #19 (0.050): 0.029*"hong" + 0.028*"kong" + 0.028*"hong_kong" + 0.019*"Hong Kong" + 0.019*"china" + 0.008*"chinese" + 0.007*"tung" + 0.007*"Hong Kong's" + 0.006*"beijing" + 0.006*"airbus"
05:28:40 INFO:topic #8 (0.050): 0.028*"gold" + 0.026*"bre" + 0.026*"bre_x" + 0.026*"x" + 0.019*"Bre-X" + 0.015*"barrick" + 0.012*"analyst" + 0.011*"busang" + 0.010*"indonesian" + 0.009*"government"
05:28:40 INFO:topic diff=0.359320, rho=0.301511
05:28:40 INFO:PROGRESS: pass 10, at document #2500/2500
05:28:40 DEBUG:performing inference on a chunk of 2500 documents
05:28:41 DEBUG:2500/2500 documents converged within 50 iterations
05:28:41 DEBUG:updating topics
05:28:41 INFO:topic #6 (0.050): 0.019*"share" + 0.016*"analyst" + 0.013*"shanghai" + 0.011*"bank" + 0.009*"stock" + 0.008*"b" + 0.007*"sale" + 0.007*"exchange" + 0.006*"china" + 0.006*"base"
05:28:41 INFO:topic #11 (0.050): 0.042*"gm" + 0.028*"plant" + 0.016*"uaw" + 0.016*"strike" + 0.015*"worker" + 0.011*"automaker" + 0.010*"local" + 0.010*"truck" + 0.009*"part" + 0.008*"ford"
05:28:41 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.005*"business"
05:28:41 INFO:topic #1 (0.050): 0.019*"bank" + 0.011*"fund" + 0.009*"billion" + 0.008*"financial" + 0.008*"china" + 0.008*"japan" + 0.007*"kong" + 0.007*"hong_kong" + 0.007*"hong" + 0.006*"analyst"
05:28:41 INFO:topic #12 (0.050): 0.014*"czech" + 0.008*"crown" + 0.008*"bank" + 0.007*"klaus" + 0.007*"government" + 0.006*"billion" + 0.006*"prague" + 0.005*"price" + 0.005*"foreign" + 0.005*"party"
05:28:41 INFO:topic diff=0.307661, rho=0.288675
05:28:41 INFO:PROGRESS: pass 11, at document #2500/2500
05:28:41 DEBUG:performing inference on a chunk of 2500 documents
05:28:43 DEBUG:2500/2500 documents converged within 50 iterations
05:28:43 DEBUG:updating topics
05:28:43 INFO:topic #15 (0.050): 0.011*"bank" + 0.009*"stock" + 0.008*"billion" + 0.008*"analyst" + 0.008*"share" + 0.008*"oil" + 0.007*"canada" + 0.007*"toronto" + 0.006*"russia" + 0.006*"tonne"
05:28:43 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.005*"business"
05:28:43 INFO:topic #5 (0.050): 0.032*"china" + 0.016*"chinese" + 0.013*"beijing" + 0.012*"official" + 0.009*"tonne" + 0.007*"hong" + 0.007*"hong_kong" + 0.007*"kong" + 0.007*"trade" + 0.006*"state"
05:28:43 INFO:topic #7 (0.050): 0.009*"sale" + 0.009*"analyst" + 0.007*"share" + 0.007*"group" + 0.006*"profit" + 0.006*"business" + 0.005*"pound" + 0.005*"price" + 0.005*"billion" + 0.005*"executive"
05:28:43 INFO:topic #19 (0.050): 0.029*"hong" + 0.029*"kong" + 0.029*"hong_kong" + 0.019*"Hong Kong" + 0.019*"china" + 0.008*"chinese" + 0.008*"tung" + 0.007*"Hong Kong's" + 0.006*"beijing" + 0.006*"airbus"
05:28:43 INFO:topic diff=0.263525, rho=0.277350
05:28:43 INFO:PROGRESS: pass 12, at document #2500/2500
05:28:43 DEBUG:performing inference on a chunk of 2500 documents
05:28:44 DEBUG:2500/2500 documents converged within 50 iterations
05:28:44 DEBUG:updating topics
05:28:45 INFO:topic #11 (0.050): 0.042*"gm" + 0.028*"plant" + 0.016*"uaw" + 0.016*"strike" + 0.015*"worker" + 0.011*"automaker" + 0.010*"local" + 0.010*"truck" + 0.009*"part" + 0.008*"ford"
05:28:45 INFO:topic #13 (0.050): 0.018*"china" + 0.016*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"chinese" + 0.009*"court" + 0.008*"party" + 0.008*"government" + 0.007*"official" + 0.007*"communist"
05:28:45 INFO:topic #4 (0.050): 0.021*"china" + 0.012*"official" + 0.010*"beijing" + 0.009*"chinese" + 0.009*"wang" + 0.008*"tibet" + 0.007*"state" + 0.007*"government" + 0.006*"people" + 0.006*"dissident"
05:28:45 INFO:topic #16 (0.050): 0.020*"franc" + 0.018*"french" + 0.017*"air" + 0.017*"france" + 0.014*"thomson" + 0.012*"billion" + 0.010*"group" + 0.008*"billion_franc" + 0.008*"telecom" + 0.007*"plan"
05:28:45 INFO:topic #18 (0.050): 0.014*"analyst" + 0.011*"computer" + 0.010*"quarter" + 0.010*"internet" + 0.008*"share" + 0.008*"business" + 0.008*"service" + 0.008*"stock" + 0.007*"industry" + 0.007*"software"
05:28:45 INFO:topic diff=0.226015, rho=0.267261
05:28:45 INFO:PROGRESS: pass 13, at document #2500/2500
05:28:45 DEBUG:performing inference on a chunk of 2500 documents
05:28:46 DEBUG:2500/2500 documents converged within 50 iterations
05:28:46 DEBUG:updating topics
05:28:46 INFO:topic #4 (0.050): 0.021*"china" + 0.012*"official" + 0.010*"beijing" + 0.009*"wang" + 0.009*"chinese" + 0.008*"tibet" + 0.007*"state" + 0.007*"government" + 0.006*"people" + 0.006*"dissident"
05:28:46 INFO:topic #3 (0.050): 0.027*"bt" + 0.017*"telecom" + 0.015*"mci" + 0.013*"pound" + 0.011*"billion" + 0.011*"analyst" + 0.011*"deal" + 0.010*"british" + 0.010*"share" + 0.010*"group"
05:28:46 INFO:topic #12 (0.050): 0.015*"czech" + 0.009*"crown" + 0.008*"bank" + 0.007*"klaus" + 0.007*"government" + 0.006*"prague" + 0.006*"billion" + 0.005*"foreign" + 0.005*"party" + 0.005*"price"
05:28:46 INFO:topic #19 (0.050): 0.030*"hong" + 0.030*"kong" + 0.030*"hong_kong" + 0.020*"Hong Kong" + 0.020*"china" + 0.008*"chinese" + 0.008*"tung" + 0.007*"Hong Kong's" + 0.007*"beijing" + 0.006*"airbus"
05:28:46 INFO:topic #5 (0.050): 0.032*"china" + 0.016*"chinese" + 0.013*"beijing" + 0.012*"official" + 0.010*"tonne" + 0.007*"hong" + 0.007*"kong" + 0.007*"hong_kong" + 0.007*"trade" + 0.006*"state"
05:28:46 INFO:topic diff=0.194260, rho=0.258199
05:28:46 INFO:PROGRESS: pass 14, at document #2500/2500
05:28:46 DEBUG:performing inference on a chunk of 2500 documents
05:28:48 DEBUG:2500/2500 documents converged within 50 iterations
05:28:48 DEBUG:updating topics
05:28:48 INFO:topic #5 (0.050): 0.033*"china" + 0.016*"chinese" + 0.013*"beijing" + 0.012*"official" + 0.010*"tonne" + 0.008*"hong" + 0.007*"kong" + 0.007*"hong_kong" + 0.007*"trade" + 0.007*"state"
05:28:48 INFO:topic #13 (0.050): 0.018*"china" + 0.016*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"chinese" + 0.009*"court" + 0.008*"party" + 0.008*"government" + 0.007*"official" + 0.007*"communist"
05:28:48 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.006*"business"
05:28:48 INFO:topic #0 (0.050): 0.011*"bank" + 0.009*"internet" + 0.009*"fcc" + 0.008*"service" + 0.008*"phone" + 0.007*"rule" + 0.006*"local" + 0.006*"tv" + 0.006*"court" + 0.006*"law"
05:28:48 INFO:topic #16 (0.050): 0.020*"franc" + 0.019*"french" + 0.018*"air" + 0.017*"france" + 0.014*"thomson" + 0.013*"billion" + 0.010*"group" + 0.008*"billion_franc" + 0.008*"telecom" + 0.007*"plan"
05:28:48 INFO:topic diff=0.167433, rho=0.250000
05:28:48 INFO:PROGRESS: pass 15, at document #2500/2500
05:28:48 DEBUG:performing inference on a chunk of 2500 documents
05:28:49 DEBUG:2500/2500 documents converged within 50 iterations
05:28:49 DEBUG:updating topics
05:28:49 INFO:topic #3 (0.050): 0.027*"bt" + 0.017*"telecom" + 0.015*"mci" + 0.013*"pound" + 0.011*"analyst" + 0.011*"billion" + 0.011*"deal" + 0.010*"british" + 0.010*"share" + 0.010*"group"
05:28:49 INFO:topic #10 (0.050): 0.001*"billion" + 0.000*"bank" + 0.000*"loan" + 0.000*"tonne" + 0.000*"yen" + 0.000*"price" + 0.000*"exporter" + 0.000*"real_estate" + 0.000*"analyst" + 0.000*"real"
05:28:49 INFO:topic #5 (0.050): 0.033*"china" + 0.017*"chinese" + 0.013*"beijing" + 0.012*"official" + 0.010*"tonne" + 0.008*"hong" + 0.008*"kong" + 0.007*"hong_kong" + 0.007*"trade" + 0.007*"state"
05:28:49 INFO:topic #4 (0.050): 0.021*"china" + 0.012*"official" + 0.011*"beijing" + 0.009*"wang" + 0.009*"chinese" + 0.008*"tibet" + 0.007*"state" + 0.007*"government" + 0.006*"people" + 0.006*"dissident"
05:28:49 INFO:topic #0 (0.050): 0.010*"bank" + 0.009*"internet" + 0.009*"fcc" + 0.008*"service" + 0.008*"phone" + 0.007*"rule" + 0.007*"local" + 0.007*"tv" + 0.006*"court" + 0.006*"law"
05:28:49 INFO:topic diff=0.144777, rho=0.242536
05:28:49 INFO:PROGRESS: pass 16, at document #2500/2500
05:28:49 DEBUG:performing inference on a chunk of 2500 documents
05:28:51 DEBUG:2500/2500 documents converged within 50 iterations
05:28:51 DEBUG:updating topics
05:28:51 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.006*"business"
05:28:51 INFO:topic #9 (0.050): 0.016*"drug" + 0.013*"colombia" + 0.006*"government" + 0.006*"united" + 0.005*"colombian" + 0.005*"guerrilla" + 0.005*"force" + 0.004*"oil" + 0.004*"country" + 0.004*"police"
05:28:51 INFO:topic #7 (0.050): 0.009*"sale" + 0.008*"analyst" + 0.007*"share" + 0.007*"group" + 0.007*"profit" + 0.006*"business" + 0.005*"pound" + 0.005*"price" + 0.005*"billion" + 0.005*"executive"
05:28:51 INFO:topic #4 (0.050): 0.021*"china" + 0.012*"official" + 0.011*"beijing" + 0.009*"wang" + 0.009*"chinese" + 0.008*"tibet" + 0.007*"state" + 0.007*"government" + 0.006*"people" + 0.006*"dissident"
05:28:51 INFO:topic #13 (0.050): 0.018*"china" + 0.016*"wang" + 0.014*"beijing" + 0.012*"taiwan" + 0.009*"chinese" + 0.009*"court" + 0.008*"party" + 0.008*"government" + 0.007*"official" + 0.007*"communist"
05:28:51 INFO:topic diff=0.125646, rho=0.235702
05:28:51 INFO:PROGRESS: pass 17, at document #2500/2500
05:28:51 DEBUG:performing inference on a chunk of 2500 documents
05:28:52 DEBUG:2500/2500 documents converged within 50 iterations
05:28:52 DEBUG:updating topics
05:28:52 INFO:topic #6 (0.050): 0.019*"share" + 0.016*"analyst" + 0.013*"shanghai" + 0.012*"bank" + 0.009*"stock" + 0.008*"china" + 0.007*"b" + 0.007*"sale" + 0.007*"exchange" + 0.006*"base"
05:28:52 INFO:topic #14 (0.050): 0.011*"pound" + 0.011*"profit" + 0.010*"share" + 0.009*"analyst" + 0.009*"group" + 0.008*"billion" + 0.007*"bank" + 0.007*"business" + 0.006*"million_pound" + 0.005*"british"
05:28:52 INFO:topic #17 (0.050): 0.036*"apple" + 0.026*"computer" + 0.021*"software" + 0.021*"quarter" + 0.014*"analyst" + 0.013*"microsoft" + 0.010*"sale" + 0.010*"share" + 0.008*"pc" + 0.008*"technology"
05:28:52 INFO:topic #7 (0.050): 0.009*"sale" + 0.008*"analyst" + 0.007*"share" + 0.007*"group" + 0.007*"profit" + 0.006*"business" + 0.005*"pound" + 0.005*"price" + 0.005*"billion" + 0.005*"executive"
05:28:52 INFO:topic #8 (0.050): 0.028*"gold" + 0.026*"bre" + 0.026*"bre_x" + 0.026*"x" + 0.019*"Bre-X" + 0.016*"barrick" + 0.012*"analyst" + 0.011*"busang" + 0.010*"indonesian" + 0.009*"government"
05:28:52 INFO:topic diff=0.109484, rho=0.229416
05:28:52 INFO:PROGRESS: pass 18, at document #2500/2500
05:28:52 DEBUG:performing inference on a chunk of 2500 documents
05:28:54 DEBUG:2500/2500 documents converged within 50 iterations
05:28:54 DEBUG:updating topics
05:28:54 INFO:topic #2 (0.050): 0.011*"share" + 0.010*"analyst" + 0.007*"service" + 0.007*"billion" + 0.007*"deal" + 0.006*"offer" + 0.006*"stock" + 0.006*"corp" + 0.006*"industry" + 0.006*"business"
05:28:54 INFO:topic #15 (0.050): 0.012*"bank" + 0.009*"stock" + 0.008*"billion" + 0.008*"analyst" + 0.008*"oil" + 0.007*"canada" + 0.007*"share" + 0.007*"toronto" + 0.007*"russia" + 0.006*"tonne"
05:28:54 INFO:topic #0 (0.050): 0.010*"bank" + 0.009*"internet" + 0.009*"fcc" + 0.009*"service" + 0.008*"phone" + 0.007*"rule" + 0.007*"local" + 0.007*"tv" + 0.007*"court" + 0.006*"law"
05:28:54 INFO:topic #7 (0.050): 0.009*"sale" + 0.008*"analyst" + 0.007*"share" + 0.007*"group" + 0.007*"profit" + 0.006*"business" + 0.005*"pound" + 0.005*"price" + 0.005*"billion" + 0.005*"executive"
05:28:54 INFO:topic #18 (0.050): 0.014*"analyst" + 0.011*"computer" + 0.010*"quarter" + 0.010*"internet" + 0.008*"share" + 0.008*"business" + 0.008*"stock" + 0.008*"service" + 0.007*"industry" + 0.007*"software"
05:28:54 INFO:topic diff=0.095805, rho=0.223607
05:28:54 INFO:PROGRESS: pass 19, at document #2500/2500
05:28:54 DEBUG:performing inference on a chunk of 2500 documents
05:28:55 DEBUG:2500/2500 documents converged within 50 iterations
05:28:55 DEBUG:updating topics
05:28:55 INFO:topic #7 (0.050): 0.009*"sale" + 0.008*"analyst" + 0.007*"share" + 0.007*"group" + 0.007*"profit" + 0.006*"business" + 0.005*"pound" + 0.005*"price" + 0.005*"billion" + 0.005*"executive"
05:28:55 INFO:topic #4 (0.050): 0.022*"china" + 0.012*"official" + 0.011*"beijing" + 0.010*"wang" + 0.009*"chinese" + 0.008*"tibet" + 0.007*"state" + 0.007*"government" + 0.007*"people" + 0.006*"dissident"
05:28:55 INFO:topic #19 (0.050): 0.032*"hong" + 0.031*"kong" + 0.031*"hong_kong" + 0.021*"Hong Kong" + 0.021*"china" + 0.009*"chinese" + 0.008*"tung" + 0.008*"Hong Kong's" + 0.007*"beijing" + 0.007*"airbus"
05:28:55 INFO:topic #3 (0.050): 0.027*"bt" + 0.018*"telecom" + 0.015*"mci" + 0.013*"pound" + 0.011*"deal" + 0.011*"analyst" + 0.011*"billion" + 0.011*"british" + 0.010*"share" + 0.010*"group"
05:28:55 INFO:topic #8 (0.050): 0.028*"gold" + 0.026*"bre" + 0.026*"bre_x" + 0.026*"x" + 0.019*"Bre-X" + 0.016*"barrick" + 0.012*"analyst" + 0.011*"busang" + 0.010*"indonesian" + 0.009*"government"
05:28:55 INFO:topic diff=0.084200, rho=0.218218
05:28:55 DEBUG:Setting topics to those of the model: AuthorTopicModel(num_terms=3914, num_topics=20, num_authors=50, decay=0.5, chunksize=2500)
05:28:55 INFO:CorpusAccumulator accumulated stats from 1000 documents
05:28:55 INFO:CorpusAccumulator accumulated stats from 2000 documents
-1.50354141347
We run our first training and observe that the passes and iterations parameters are set high enough, so that the model converges.
07:47:24 INFO:PROGRESS: pass 15, at document #2500/2500
07:47:24 DEBUG:performing inference on a chunk of 2500 documents
07:47:27 DEBUG:2500/2500 documents converged within 50 iterations
Tells us that the model indeed conveges well.
In [14]:
accuracy_scores_20topic={}
for i in [1,2,3,4,5,6,8,10]:
accuracy, k = prediction_accuracy(test_author2doc, test_corpus_50_20, atmodel_standard, k=i)
accuracy_scores_20topic[k] = accuracy
plot_accuracy(scores1=accuracy_scores_20topic, label1="20 topics")
Precision@k: top_n=1
Prediction accuracy: 0.3548
Precision@k: top_n=2
Prediction accuracy: 0.5228
Precision@k: top_n=3
Prediction accuracy: 0.6456
Precision@k: top_n=4
Prediction accuracy: 0.7208
Precision@k: top_n=5
Prediction accuracy: 0.7748
Precision@k: top_n=6
Prediction accuracy: 0.8188
Precision@k: top_n=8
Prediction accuracy: 0.8576
Precision@k: top_n=10
Prediction accuracy: 0.8936
This is a rather poor accuracy performace. We increase the number of topic to 100.
In [15]:
atmodel_100topics = train_model(train_corpus_50_20, train_author2doc, train_dictionary_50_20, num_topics=100, eval_every=0, iterations=50, passes=10)
05:31:51 INFO:Vocabulary consists of 3914 words.
05:31:51 INFO:using symmetric alpha at 0.01
05:31:51 INFO:using symmetric eta at 0.01
05:31:53 INFO:running online author-topic training, 100 topics, 50 authors, 10 passes over the supplied corpus of 2500 documents, updating model once every 2500 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000
05:31:53 INFO:PROGRESS: pass 0, at document #2500/2500
05:31:53 DEBUG:performing inference on a chunk of 2500 documents
05:32:05 DEBUG:5/2500 documents converged within 50 iterations
05:32:05 DEBUG:updating topics
05:32:05 INFO:topic #18 (0.010): 0.007*"analyst" + 0.007*"business" + 0.005*"billion" + 0.005*"stock" + 0.005*"boeing" + 0.004*"quarter" + 0.004*"industry" + 0.004*"share" + 0.004*"shareholder" + 0.004*"sale"
05:32:05 INFO:topic #71 (0.010): 0.015*"fcc" + 0.015*"phone" + 0.011*"local" + 0.011*"carrier" + 0.010*"service" + 0.009*"rule" + 0.008*"court" + 0.008*"distance" + 0.008*"long" + 0.007*"tv"
05:32:05 INFO:topic #79 (0.010): 0.011*"china" + 0.010*"beijing" + 0.007*"official" + 0.006*"chinese" + 0.006*"lama" + 0.006*"tibet" + 0.006*"share" + 0.005*"analyst" + 0.005*"region" + 0.005*"billion"
05:32:05 INFO:topic #93 (0.010): 0.015*"ibm" + 0.011*"analyst" + 0.011*"computer" + 0.010*"pc" + 0.009*"sale" + 0.009*"quarter" + 0.008*"industry" + 0.008*"price" + 0.007*"consumer" + 0.007*"service"
05:32:05 INFO:topic #99 (0.010): 0.008*"world" + 0.008*"czech" + 0.007*"analyst" + 0.006*"stock" + 0.006*"win" + 0.005*"billion" + 0.005*"team" + 0.005*"game" + 0.005*"bank" + 0.005*"second"
05:32:05 INFO:topic diff=25.070898, rho=1.000000
05:32:05 INFO:PROGRESS: pass 1, at document #2500/2500
05:32:05 DEBUG:performing inference on a chunk of 2500 documents
05:32:12 DEBUG:2492/2500 documents converged within 50 iterations
05:32:12 DEBUG:updating topics
05:32:12 INFO:topic #70 (0.010): 0.019*"shanghai" + 0.018*"share" + 0.017*"china" + 0.011*"stock" + 0.011*"beijing" + 0.010*"b" + 0.010*"foreign" + 0.010*"exchange" + 0.009*"analyst" + 0.008*"investor"
05:32:12 INFO:topic #2 (0.010): 0.020*"mci" + 0.012*"long" + 0.012*"service" + 0.010*"distance" + 0.010*"analyst" + 0.010*"sprint" + 0.010*"billion" + 0.010*"corp" + 0.008*"local" + 0.008*"deal"
05:32:12 INFO:topic #57 (0.010): 0.024*"china" + 0.013*"beijing" + 0.011*"chinese" + 0.010*"wang" + 0.010*"hong_kong" + 0.009*"hong" + 0.008*"kong" + 0.008*"official" + 0.007*"Hong Kong" + 0.006*"people"
05:32:12 INFO:topic #45 (0.010): 0.021*"time" + 0.016*"executive" + 0.013*"cable" + 0.011*"rise" + 0.011*"sale" + 0.010*"billion" + 0.009*"quarter" + 0.008*"share" + 0.008*"group" + 0.007*"analyst"
05:32:12 INFO:topic #18 (0.010): 0.007*"analyst" + 0.006*"business" + 0.005*"billion" + 0.004*"stock" + 0.004*"boeing" + 0.004*"quarter" + 0.004*"industry" + 0.004*"share" + 0.004*"shareholder" + 0.003*"sale"
05:32:12 INFO:topic diff=7.998665, rho=0.577350
05:32:12 INFO:PROGRESS: pass 2, at document #2500/2500
05:32:12 DEBUG:performing inference on a chunk of 2500 documents
05:32:19 DEBUG:2500/2500 documents converged within 50 iterations
05:32:19 DEBUG:updating topics
05:32:19 INFO:topic #70 (0.010): 0.021*"share" + 0.021*"shanghai" + 0.019*"china" + 0.012*"b" + 0.011*"foreign" + 0.011*"stock" + 0.011*"bank" + 0.011*"analyst" + 0.010*"beijing" + 0.010*"exchange"
05:32:19 INFO:topic #71 (0.010): 0.020*"fcc" + 0.015*"phone" + 0.013*"carrier" + 0.013*"tv" + 0.012*"local" + 0.010*"service" + 0.010*"rule" + 0.010*"long" + 0.009*"distance" + 0.008*"long_distance"
05:32:19 INFO:topic #22 (0.010): 0.033*"bank" + 0.010*"rate" + 0.010*"cut" + 0.009*"analyst" + 0.008*"day" + 0.008*"merger" + 0.007*"profit" + 0.007*"australia" + 0.007*"financial" + 0.006*"ltd"
05:32:19 INFO:topic #18 (0.010): 0.006*"analyst" + 0.005*"business" + 0.004*"billion" + 0.004*"stock" + 0.004*"boeing" + 0.003*"quarter" + 0.003*"industry" + 0.003*"share" + 0.003*"shareholder" + 0.003*"sale"
05:32:19 INFO:topic #5 (0.010): 0.018*"china" + 0.009*"beijing" + 0.008*"tonne" + 0.007*"chinese" + 0.006*"official" + 0.005*"trade" + 0.005*"price" + 0.005*"chen" + 0.005*"trader" + 0.004*"million_tonne"
05:32:19 INFO:topic diff=7.090922, rho=0.500000
05:32:19 INFO:PROGRESS: pass 3, at document #2500/2500
05:32:19 DEBUG:performing inference on a chunk of 2500 documents
05:32:25 DEBUG:2500/2500 documents converged within 50 iterations
05:32:25 DEBUG:updating topics
05:32:26 INFO:topic #9 (0.010): 0.004*"analyst" + 0.003*"government" + 0.002*"share" + 0.002*"china" + 0.002*"cost" + 0.002*"sale" + 0.002*"right" + 0.002*"stock" + 0.002*"big" + 0.002*"end"
05:32:26 INFO:topic #20 (0.010): 0.016*"gold" + 0.016*"bre" + 0.015*"x" + 0.015*"bre_x" + 0.010*"barrick" + 0.009*"Bre-X" + 0.008*"gm" + 0.008*"price" + 0.008*"analyst" + 0.008*"plant"
05:32:26 INFO:topic #4 (0.010): 0.015*"franc" + 0.015*"thomson" + 0.014*"french" + 0.009*"group" + 0.009*"share" + 0.008*"france" + 0.008*"government" + 0.008*"plan" + 0.008*"lagardere" + 0.007*"billion"
05:32:26 INFO:topic #87 (0.010): 0.017*"analyst" + 0.011*"sale" + 0.010*"share" + 0.009*"business" + 0.008*"quarter" + 0.008*"price" + 0.007*"add" + 0.007*"chemical" + 0.006*"stock" + 0.006*"earning"
05:32:26 INFO:topic #62 (0.010): 0.022*"profit" + 0.014*"pound" + 0.011*"sale" + 0.011*"rise" + 0.010*"analyst" + 0.010*"stg" + 0.010*"group" + 0.009*"business" + 0.009*"half" + 0.009*"million_stg"
05:32:26 INFO:topic diff=6.178695, rho=0.447214
05:32:26 INFO:PROGRESS: pass 4, at document #2500/2500
05:32:26 DEBUG:performing inference on a chunk of 2500 documents
05:32:31 DEBUG:2500/2500 documents converged within 50 iterations
05:32:31 DEBUG:updating topics
05:32:32 INFO:topic #47 (0.010): 0.012*"gold" + 0.008*"oil" + 0.008*"share" + 0.007*"stock" + 0.006*"analyst" + 0.006*"government" + 0.006*"price" + 0.006*"colombia" + 0.005*"rise" + 0.005*"issue"
05:32:32 INFO:topic #86 (0.010): 0.014*"cargo" + 0.010*"service" + 0.010*"kong" + 0.009*"hong" + 0.009*"air" + 0.009*"airline" + 0.008*"hong_kong" + 0.007*"Hong Kong" + 0.007*"route" + 0.006*"rate"
05:32:32 INFO:topic #25 (0.010): 0.010*"boeing" + 0.009*"share" + 0.009*"analyst" + 0.007*"billion" + 0.006*"service" + 0.006*"mci" + 0.006*"business" + 0.005*"stock" + 0.005*"jet" + 0.005*"growth"
05:32:32 INFO:topic #74 (0.010): 0.043*"china" + 0.020*"chinese" + 0.014*"official" + 0.014*"beijing" + 0.010*"trade" + 0.009*"state" + 0.007*"states" + 0.006*"united" + 0.006*"united_states" + 0.006*"import"
05:32:32 INFO:topic #53 (0.010): 0.032*"fund" + 0.012*"investment" + 0.011*"hong_kong" + 0.011*"hong" + 0.010*"stock" + 0.010*"management" + 0.010*"week" + 0.009*"manager" + 0.009*"billion" + 0.009*"kong"
05:32:32 INFO:topic diff=5.327576, rho=0.408248
05:32:32 INFO:PROGRESS: pass 5, at document #2500/2500
05:32:32 DEBUG:performing inference on a chunk of 2500 documents
05:32:36 DEBUG:2500/2500 documents converged within 50 iterations
05:32:36 DEBUG:updating topics
05:32:37 INFO:topic #60 (0.010): 0.002*"financial" + 0.002*"official" + 0.002*"stock" + 0.002*"policy" + 0.002*"group" + 0.002*"share" + 0.002*"china" + 0.001*"chinese" + 0.001*"beijing" + 0.001*"bank"
05:32:37 INFO:topic #77 (0.010): 0.006*"computer" + 0.005*"internet" + 0.005*"quarter" + 0.005*"analyst" + 0.004*"business" + 0.004*"share" + 0.004*"service" + 0.003*"profit" + 0.003*"industry" + 0.003*"system"
05:32:37 INFO:topic #43 (0.010): 0.003*"bre_x" + 0.003*"bre" + 0.002*"analyst" + 0.002*"gold" + 0.002*"barrick" + 0.002*"government" + 0.002*"Bre-X" + 0.002*"x" + 0.002*"share" + 0.002*"stock"
05:32:37 INFO:topic #10 (0.010): 0.002*"billion" + 0.001*"investment" + 0.001*"tonne" + 0.001*"quarter" + 0.001*"venture" + 0.001*"industry" + 0.001*"price" + 0.001*"cocoa" + 0.001*"coast" + 0.001*"month"
05:32:37 INFO:topic #99 (0.010): 0.005*"world" + 0.005*"czech" + 0.004*"analyst" + 0.004*"stock" + 0.004*"win" + 0.003*"billion" + 0.003*"team" + 0.003*"game" + 0.003*"bank" + 0.003*"second"
05:32:37 INFO:topic diff=4.560862, rho=0.377964
05:32:37 INFO:PROGRESS: pass 6, at document #2500/2500
05:32:37 DEBUG:performing inference on a chunk of 2500 documents
05:32:41 DEBUG:2500/2500 documents converged within 50 iterations
05:32:41 DEBUG:updating topics
05:32:41 INFO:topic #38 (0.010): 0.015*"analyst" + 0.014*"australian" + 0.013*"ltd" + 0.012*"share" + 0.011*"australia" + 0.011*"profit" + 0.010*"sydney" + 0.009*"group" + 0.009*"news" + 0.008*"corp"
05:32:41 INFO:topic #46 (0.010): 0.004*"hong" + 0.004*"kong" + 0.003*"china" + 0.002*"Hong Kong" + 0.002*"official" + 0.002*"hong_kong" + 0.002*"chinese" + 0.002*"united" + 0.002*"singapore" + 0.001*"month"
05:32:41 INFO:topic #97 (0.010): 0.021*"internet" + 0.017*"bank" + 0.008*"law" + 0.008*"court" + 0.008*"congress" + 0.007*"service" + 0.007*"credit" + 0.007*"allow" + 0.007*"bill" + 0.006*"policy"
05:32:41 INFO:topic #75 (0.010): 0.028*"bank" + 0.016*"japan" + 0.015*"billion" + 0.014*"yen" + 0.014*"financial" + 0.012*"loan" + 0.011*"japanese" + 0.010*"problem" + 0.010*"analyst" + 0.009*"firm"
05:32:41 INFO:topic #11 (0.010): 0.063*"gm" + 0.032*"plant" + 0.024*"strike" + 0.021*"automaker" + 0.021*"worker" + 0.017*"uaw" + 0.013*"truck" + 0.013*"local" + 0.013*"union" + 0.012*"chrysler"
05:32:41 INFO:topic diff=3.882969, rho=0.353553
05:32:41 INFO:PROGRESS: pass 7, at document #2500/2500
05:32:41 DEBUG:performing inference on a chunk of 2500 documents
05:32:46 DEBUG:2500/2500 documents converged within 50 iterations
05:32:46 DEBUG:updating topics
05:32:46 INFO:topic #82 (0.010): 0.003*"quarter" + 0.003*"executive" + 0.003*"internet" + 0.003*"high" + 0.003*"share" + 0.002*"loss" + 0.002*"technology" + 0.002*"high_tech" + 0.002*"stock" + 0.002*"software"
05:32:46 INFO:topic #38 (0.010): 0.015*"analyst" + 0.014*"australian" + 0.013*"ltd" + 0.012*"share" + 0.011*"australia" + 0.011*"profit" + 0.010*"sydney" + 0.009*"group" + 0.009*"news" + 0.008*"corp"
05:32:46 INFO:topic #9 (0.010): 0.001*"analyst" + 0.001*"government" + 0.001*"share" + 0.001*"china" + 0.001*"cost" + 0.001*"sale" + 0.001*"right" + 0.001*"stock" + 0.001*"big" + 0.001*"end"
05:32:46 INFO:topic #13 (0.010): 0.001*"china" + 0.001*"share" + 0.001*"official" + 0.001*"analyst" + 0.001*"group" + 0.001*"sale" + 0.001*"beijing" + 0.001*"party" + 0.001*"month" + 0.001*"billion"
05:32:46 INFO:topic #76 (0.010): 0.024*"cocoa" + 0.019*"exporter" + 0.019*"tonne" + 0.012*"ivory" + 0.012*"coast" + 0.012*"ivory_coast" + 0.011*"crop" + 0.011*"price" + 0.010*"buyer" + 0.009*"export"
05:32:46 INFO:topic diff=3.291750, rho=0.333333
05:32:46 INFO:PROGRESS: pass 8, at document #2500/2500
05:32:46 DEBUG:performing inference on a chunk of 2500 documents
05:32:50 DEBUG:2500/2500 documents converged within 50 iterations
05:32:50 DEBUG:updating topics
05:32:50 INFO:topic #18 (0.010): 0.001*"analyst" + 0.001*"business" + 0.001*"billion" + 0.001*"stock" + 0.001*"boeing" + 0.001*"quarter" + 0.001*"industry" + 0.001*"share" + 0.001*"shareholder" + 0.001*"sale"
05:32:50 INFO:topic #61 (0.010): 0.014*"analyst" + 0.014*"microsoft" + 0.010*"share" + 0.009*"software" + 0.009*"quarter" + 0.009*"boeing" + 0.009*"office" + 0.008*"computer" + 0.008*"worker" + 0.008*"fiscal"
05:32:50 INFO:topic #2 (0.010): 0.019*"mci" + 0.013*"analyst" + 0.011*"long" + 0.011*"share" + 0.011*"service" + 0.010*"distance" + 0.010*"long_distance" + 0.010*"billion" + 0.010*"corp" + 0.008*"local"
05:32:50 INFO:topic #98 (0.010): 0.031*"tonne" + 0.030*"china" + 0.019*"trader" + 0.018*"chinese" + 0.018*"price" + 0.016*"hong_kong" + 0.016*"hong" + 0.016*"kong" + 0.013*"source" + 0.013*"import"
05:32:50 INFO:topic #71 (0.010): 0.019*"fcc" + 0.014*"tv" + 0.014*"phone" + 0.013*"carrier" + 0.011*"local" + 0.010*"service" + 0.010*"long" + 0.009*"rule" + 0.009*"distance" + 0.009*"long_distance"
05:32:50 INFO:topic diff=2.781235, rho=0.316228
05:32:50 INFO:PROGRESS: pass 9, at document #2500/2500
05:32:50 DEBUG:performing inference on a chunk of 2500 documents
05:32:54 DEBUG:2500/2500 documents converged within 50 iterations
05:32:54 DEBUG:updating topics
05:32:55 INFO:topic #99 (0.010): 0.002*"world" + 0.002*"czech" + 0.002*"analyst" + 0.002*"stock" + 0.002*"win" + 0.001*"billion" + 0.001*"team" + 0.001*"game" + 0.001*"bank" + 0.001*"second"
05:32:55 INFO:topic #26 (0.010): 0.003*"business" + 0.002*"analyst" + 0.002*"gm" + 0.001*"share" + 0.001*"internet" + 0.001*"billion" + 0.001*"stock" + 0.001*"access" + 0.001*"chemical" + 0.001*"service"
05:32:55 INFO:topic #80 (0.010): 0.012*"analyst" + 0.012*"computer" + 0.009*"stock" + 0.009*"internet" + 0.008*"quarter" + 0.008*"technology" + 0.008*"service" + 0.007*"software" + 0.007*"share" + 0.007*"business"
05:32:55 INFO:topic #67 (0.010): 0.045*"gm" + 0.033*"plant" + 0.021*"uaw" + 0.017*"strike" + 0.017*"worker" + 0.012*"part" + 0.011*"local" + 0.010*"truck" + 0.010*"automaker" + 0.010*"contract"
05:32:55 INFO:topic #97 (0.010): 0.021*"internet" + 0.017*"bank" + 0.008*"law" + 0.008*"court" + 0.008*"congress" + 0.007*"service" + 0.007*"credit" + 0.007*"allow" + 0.007*"bill" + 0.006*"policy"
05:32:55 INFO:topic diff=2.344407, rho=0.301511
05:32:55 DEBUG:Setting topics to those of the model: AuthorTopicModel(num_terms=3914, num_topics=100, num_authors=50, decay=0.5, chunksize=2500)
05:32:55 INFO:CorpusAccumulator accumulated stats from 1000 documents
05:32:55 INFO:CorpusAccumulator accumulated stats from 2000 documents
-1.89056657258
In [16]:
accuracy_scores_100topic={}
for i in [1,2,3,4,5,6,8,10]:
accuracy, k = prediction_accuracy(test_author2doc, test_corpus_50_20, atmodel_100topics, k=i)
accuracy_scores_100topic[k] = accuracy
plot_accuracy(scores1=accuracy_scores_20topic, label1="20 topics", scores2=accuracy_scores_100topic, label2="100 topics")
Precision@k: top_n=1
Prediction accuracy: 0.5808
Precision@k: top_n=2
Prediction accuracy: 0.7472
Precision@k: top_n=3
Prediction accuracy: 0.8252
Precision@k: top_n=4
Prediction accuracy: 0.8732
Precision@k: top_n=5
Prediction accuracy: 0.8956
Precision@k: top_n=6
Prediction accuracy: 0.9072
Precision@k: top_n=8
Prediction accuracy: 0.9276
Precision@k: top_n=10
Prediction accuracy: 0.9412
The 100-topic model is much more accurate than the 20-topic model. We continue to increase the topic until convergence.
In [17]:
atmodel_150topics = train_model(train_corpus_50_20, train_author2doc, train_dictionary_50_20, num_topics=150, eval_every=0, iterations=50, passes=15)
05:36:37 INFO:Vocabulary consists of 3914 words.
05:36:37 INFO:using symmetric alpha at 0.006666666666666667
05:36:37 INFO:using symmetric eta at 0.006666666666666667
05:36:40 INFO:running online author-topic training, 150 topics, 50 authors, 15 passes over the supplied corpus of 2500 documents, updating model once every 2500 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000
05:36:40 INFO:PROGRESS: pass 0, at document #2500/2500
05:36:40 DEBUG:performing inference on a chunk of 2500 documents
05:36:55 DEBUG:15/2500 documents converged within 50 iterations
05:36:55 DEBUG:updating topics
05:36:56 INFO:topic #51 (0.007): 0.015*"profit" + 0.012*"price" + 0.012*"group" + 0.012*"analyst" + 0.009*"share" + 0.009*"steel" + 0.008*"tell" + 0.008*"australian" + 0.007*"month" + 0.007*"forecast"
05:36:56 INFO:topic #86 (0.007): 0.011*"china" + 0.008*"kong" + 0.007*"hong" + 0.007*"cargo" + 0.006*"hong_kong" + 0.006*"service" + 0.006*"Hong Kong" + 0.005*"profit" + 0.005*"analyst" + 0.005*"month"
05:36:56 INFO:topic #125 (0.007): 0.009*"analyst" + 0.007*"share" + 0.007*"bank" + 0.005*"problem" + 0.004*"billion" + 0.004*"sale" + 0.004*"loan" + 0.004*"plant" + 0.004*"gm" + 0.004*"corp"
05:36:56 INFO:topic #4 (0.007): 0.020*"franc" + 0.018*"thomson" + 0.016*"french" + 0.011*"group" + 0.011*"share" + 0.010*"government" + 0.009*"plan" + 0.009*"france" + 0.009*"lagardere" + 0.009*"billion"
05:36:56 INFO:topic #114 (0.007): 0.006*"analyst" + 0.006*"sale" + 0.005*"chairman" + 0.005*"business" + 0.004*"social" + 0.004*"party" + 0.004*"month" + 0.004*"industry" + 0.003*"share" + 0.003*"government"
05:36:56 INFO:topic diff=43.566047, rho=1.000000
05:36:56 INFO:PROGRESS: pass 1, at document #2500/2500
05:36:56 DEBUG:performing inference on a chunk of 2500 documents
05:37:04 DEBUG:2493/2500 documents converged within 50 iterations
05:37:04 DEBUG:updating topics
05:37:04 INFO:topic #72 (0.007): 0.024*"gold" + 0.024*"bre_x" + 0.023*"x" + 0.023*"bre" + 0.020*"Bre-X" + 0.013*"barrick" + 0.010*"government" + 0.010*"indonesian" + 0.010*"busang" + 0.010*"analyst"
05:37:04 INFO:topic #133 (0.007): 0.013*"group" + 0.012*"pound" + 0.011*"share" + 0.009*"billion" + 0.006*"bt" + 0.006*"business" + 0.006*"analyst" + 0.005*"british" + 0.005*"profit" + 0.005*"britain"
05:37:04 INFO:topic #19 (0.007): 0.007*"billion" + 0.006*"group" + 0.006*"airbus" + 0.005*"state" + 0.005*"profit" + 0.005*"industry" + 0.005*"tobacco" + 0.005*"tell" + 0.004*"price" + 0.004*"cost"
05:37:04 INFO:topic #90 (0.007): 0.029*"bank" + 0.017*"canadian" + 0.016*"billion" + 0.014*"canada" + 0.010*"toronto" + 0.009*"analyst" + 0.008*"stock" + 0.008*"fund" + 0.008*"share" + 0.007*"high"
05:37:04 INFO:topic #91 (0.007): 0.008*"analyst" + 0.007*"bre" + 0.005*"bre_x" + 0.005*"x" + 0.005*"gm" + 0.005*"billion" + 0.005*"Bre-X" + 0.005*"stock" + 0.004*"sale" + 0.004*"share"
05:37:04 INFO:topic diff=12.489199, rho=0.577350
05:37:04 INFO:PROGRESS: pass 2, at document #2500/2500
05:37:04 DEBUG:performing inference on a chunk of 2500 documents
05:37:12 DEBUG:2497/2500 documents converged within 50 iterations
05:37:12 DEBUG:updating topics
05:37:12 INFO:topic #58 (0.007): 0.011*"shanghai" + 0.010*"china" + 0.005*"bank" + 0.005*"chinese" + 0.004*"city" + 0.004*"stock" + 0.004*"chen" + 0.003*"beijing" + 0.003*"analyst" + 0.003*"modern"
05:37:12 INFO:topic #26 (0.007): 0.010*"business" + 0.005*"analyst" + 0.005*"share" + 0.004*"billion" + 0.004*"stock" + 0.004*"continue" + 0.003*"states" + 0.003*"chemical" + 0.003*"united" + 0.003*"internet"
05:37:12 INFO:topic #61 (0.007): 0.027*"boeing" + 0.014*"analyst" + 0.013*"billion" + 0.012*"microsoft" + 0.012*"jet" + 0.010*"airbus" + 0.009*"share" + 0.009*"order" + 0.008*"mcdonnell" + 0.007*"revenue"
05:37:12 INFO:topic #149 (0.007): 0.018*"china" + 0.010*"official" + 0.010*"chinese" + 0.008*"beijing" + 0.006*"trade" + 0.006*"world" + 0.005*"foreign" + 0.005*"united_states" + 0.005*"drug" + 0.005*"metre"
05:37:12 INFO:topic #74 (0.007): 0.044*"china" + 0.022*"chinese" + 0.012*"official" + 0.012*"tonne" + 0.011*"beijing" + 0.008*"trade" + 0.008*"import" + 0.008*"trader" + 0.007*"price" + 0.007*"state"
05:37:12 INFO:topic diff=10.945011, rho=0.500000
05:37:12 INFO:PROGRESS: pass 3, at document #2500/2500
05:37:12 DEBUG:performing inference on a chunk of 2500 documents
05:37:19 DEBUG:2499/2500 documents converged within 50 iterations
05:37:19 DEBUG:updating topics
05:37:19 INFO:topic #125 (0.007): 0.004*"analyst" + 0.003*"share" + 0.003*"bank" + 0.002*"problem" + 0.002*"billion" + 0.002*"sale" + 0.002*"loan" + 0.002*"plant" + 0.002*"gm" + 0.002*"corp"
05:37:19 INFO:topic #95 (0.007): 0.038*"bank" + 0.018*"billion" + 0.016*"society" + 0.010*"analyst" + 0.009*"debt" + 0.009*"eurotunnel" + 0.008*"banking" + 0.008*"pound" + 0.008*"member" + 0.007*"convert"
05:37:19 INFO:topic #19 (0.007): 0.007*"billion" + 0.006*"state" + 0.005*"group" + 0.005*"airbus" + 0.005*"loss" + 0.005*"cost" + 0.005*"profit" + 0.005*"industry" + 0.005*"sale" + 0.004*"executive"
05:37:19 INFO:topic #115 (0.007): 0.003*"share" + 0.002*"stock" + 0.002*"billion" + 0.002*"analyst" + 0.002*"china" + 0.002*"industry" + 0.001*"month" + 0.001*"rise" + 0.001*"big" + 0.001*"deal"
05:37:20 INFO:topic #29 (0.007): 0.018*"czech" + 0.008*"klaus" + 0.007*"government" + 0.007*"crown" + 0.007*"party" + 0.007*"bank" + 0.007*"prague" + 0.005*"country" + 0.005*"foreign" + 0.005*"election"
05:37:20 INFO:topic diff=9.415271, rho=0.447214
05:37:20 INFO:PROGRESS: pass 4, at document #2500/2500
05:37:20 DEBUG:performing inference on a chunk of 2500 documents
05:37:26 DEBUG:2499/2500 documents converged within 50 iterations
05:37:26 DEBUG:updating topics
05:37:27 INFO:topic #31 (0.007): 0.010*"franc" + 0.009*"french" + 0.008*"china" + 0.008*"billion" + 0.006*"shanghai" + 0.006*"analyst" + 0.006*"share" + 0.005*"government" + 0.005*"plan" + 0.005*"exchange"
05:37:27 INFO:topic #76 (0.007): 0.007*"china" + 0.006*"hong_kong" + 0.006*"price" + 0.005*"kong" + 0.005*"hong" + 0.004*"tonne" + 0.004*"world" + 0.004*"analyst" + 0.003*"chinese" + 0.003*"Hong Kong"
05:37:27 INFO:topic #36 (0.007): 0.024*"bid" + 0.021*"penny" + 0.020*"analyst" + 0.017*"share" + 0.015*"electric" + 0.013*"electricity" + 0.012*"offer" + 0.012*"price" + 0.011*"northern" + 0.010*"water"
05:37:27 INFO:topic #144 (0.007): 0.008*"computer" + 0.007*"software" + 0.006*"technology" + 0.006*"internet" + 0.005*"web" + 0.004*"site" + 0.004*"people" + 0.004*"quarter" + 0.004*"industry" + 0.004*"base"
05:37:27 INFO:topic #96 (0.007): 0.013*"tv" + 0.011*"industry" + 0.010*"group" + 0.010*"system" + 0.008*"television" + 0.008*"plan" + 0.008*"service" + 0.007*"rating" + 0.006*"american" + 0.006*"long"
05:37:27 INFO:topic diff=8.020445, rho=0.408248
05:37:27 INFO:PROGRESS: pass 5, at document #2500/2500
05:37:27 DEBUG:performing inference on a chunk of 2500 documents
05:37:33 DEBUG:2500/2500 documents converged within 50 iterations
05:37:33 DEBUG:updating topics
05:37:33 INFO:topic #128 (0.007): 0.019*"ford" + 0.017*"gm" + 0.015*"sale" + 0.015*"plant" + 0.011*"car" + 0.011*"vehicle" + 0.008*"chrysler" + 0.008*"worker" + 0.008*"automaker" + 0.007*"truck"
05:37:33 INFO:topic #82 (0.007): 0.006*"china" + 0.004*"tonne" + 0.004*"chinese" + 0.003*"trader" + 0.003*"copper" + 0.002*"price" + 0.002*"source" + 0.002*"kong" + 0.002*"shanghai" + 0.002*"metal"
05:37:33 INFO:topic #96 (0.007): 0.013*"tv" + 0.011*"industry" + 0.010*"group" + 0.010*"system" + 0.008*"plan" + 0.008*"television" + 0.008*"service" + 0.007*"rating" + 0.007*"american" + 0.006*"long"
05:37:33 INFO:topic #41 (0.007): 0.016*"australian" + 0.014*"bank" + 0.013*"profit" + 0.013*"share" + 0.013*"news" + 0.013*"sydney" + 0.013*"australia" + 0.011*"ltd" + 0.011*"analyst" + 0.011*"corp"
05:37:33 INFO:topic #79 (0.007): 0.006*"china" + 0.006*"beijing" + 0.004*"official" + 0.004*"lama" + 0.004*"tibet" + 0.004*"chinese" + 0.003*"region" + 0.003*"dalai_lama" + 0.003*"share" + 0.003*"analyst"
05:37:33 INFO:topic diff=6.797042, rho=0.377964
05:37:33 INFO:PROGRESS: pass 6, at document #2500/2500
05:37:33 DEBUG:performing inference on a chunk of 2500 documents
05:37:40 DEBUG:2500/2500 documents converged within 50 iterations
05:37:40 DEBUG:updating topics
05:37:40 INFO:topic #76 (0.007): 0.005*"china" + 0.004*"hong_kong" + 0.004*"price" + 0.003*"kong" + 0.003*"hong" + 0.003*"tonne" + 0.003*"world" + 0.003*"analyst" + 0.002*"chinese" + 0.002*"Hong Kong"
05:37:40 INFO:topic #31 (0.007): 0.008*"franc" + 0.007*"french" + 0.006*"china" + 0.006*"billion" + 0.005*"shanghai" + 0.005*"analyst" + 0.004*"share" + 0.004*"government" + 0.004*"plan" + 0.004*"exchange"
05:37:40 INFO:topic #140 (0.007): 0.026*"china" + 0.020*"beijing" + 0.016*"chinese" + 0.013*"official" + 0.010*"wang" + 0.006*"foreign" + 0.005*"right" + 0.005*"human" + 0.005*"washington" + 0.005*"state"
05:37:40 INFO:topic #46 (0.007): 0.004*"hong" + 0.004*"kong" + 0.003*"china" + 0.003*"Hong Kong" + 0.002*"official" + 0.002*"hong_kong" + 0.002*"chinese" + 0.002*"singapore" + 0.002*"united" + 0.002*"plan"
05:37:40 INFO:topic #148 (0.007): 0.001*"network" + 0.001*"analyst" + 0.001*"stock" + 0.001*"share" + 0.001*"price" + 0.001*"remote" + 0.001*"recent" + 0.001*"industry" + 0.001*"chinese" + 0.001*"billion"
05:37:40 INFO:topic diff=5.743502, rho=0.353553
05:37:40 INFO:PROGRESS: pass 7, at document #2500/2500
05:37:40 DEBUG:performing inference on a chunk of 2500 documents
05:37:46 DEBUG:2500/2500 documents converged within 50 iterations
05:37:46 DEBUG:updating topics
05:37:47 INFO:topic #99 (0.007): 0.003*"czech" + 0.003*"world" + 0.003*"team" + 0.002*"win" + 0.002*"game" + 0.002*"play" + 0.002*"second" + 0.002*"stock" + 0.002*"billion" + 0.002*"end"
05:37:47 INFO:topic #81 (0.007): 0.003*"pound" + 0.002*"share" + 0.002*"profit" + 0.002*"million_pound" + 0.002*"sale" + 0.001*"business" + 0.001*"analyst" + 0.001*"rise" + 0.001*"group" + 0.001*"fall"
05:37:47 INFO:topic #97 (0.007): 0.001*"analyst" + 0.001*"business" + 0.001*"china" + 0.001*"internet" + 0.001*"stock" + 0.001*"sale" + 0.001*"service" + 0.001*"chairman" + 0.001*"continue" + 0.001*"base"
05:37:47 INFO:topic #28 (0.007): 0.000*"large" + 0.000*"share" + 0.000*"stock" + 0.000*"property" + 0.000*"analyst" + 0.000*"taiwan" + 0.000*"china" + 0.000*"bank" + 0.000*"news" + 0.000*"billion"
05:37:47 INFO:topic #93 (0.007): 0.020*"ibm" + 0.018*"internet" + 0.016*"computer" + 0.013*"pc" + 0.012*"service" + 0.011*"analyst" + 0.009*"industry" + 0.009*"software" + 0.009*"quarter" + 0.009*"consumer"
05:37:47 INFO:topic diff=4.844379, rho=0.333333
05:37:47 INFO:PROGRESS: pass 8, at document #2500/2500
05:37:47 DEBUG:performing inference on a chunk of 2500 documents
05:37:53 DEBUG:2500/2500 documents converged within 50 iterations
05:37:53 DEBUG:updating topics
05:37:54 INFO:topic #97 (0.007): 0.001*"analyst" + 0.001*"business" + 0.001*"china" + 0.001*"internet" + 0.001*"stock" + 0.001*"sale" + 0.001*"service" + 0.001*"chairman" + 0.001*"continue" + 0.001*"base"
05:37:54 INFO:topic #77 (0.007): 0.002*"computer" + 0.002*"internet" + 0.002*"quarter" + 0.002*"business" + 0.002*"service" + 0.002*"analyst" + 0.001*"share" + 0.001*"system" + 0.001*"cost" + 0.001*"industry"
05:37:54 INFO:topic #70 (0.007): 0.024*"share" + 0.023*"shanghai" + 0.021*"china" + 0.015*"bank" + 0.013*"b" + 0.013*"analyst" + 0.012*"foreign" + 0.010*"exchange" + 0.010*"investor" + 0.010*"stock"
05:37:54 INFO:topic #26 (0.007): 0.003*"business" + 0.002*"analyst" + 0.001*"share" + 0.001*"billion" + 0.001*"stock" + 0.001*"continue" + 0.001*"states" + 0.001*"chemical" + 0.001*"united" + 0.001*"internet"
05:37:54 INFO:topic #145 (0.007): 0.019*"analyst" + 0.015*"sale" + 0.013*"share" + 0.012*"quarter" + 0.009*"business" + 0.008*"base" + 0.007*"earning" + 0.007*"stock" + 0.007*"drug" + 0.006*"amp"
05:37:54 INFO:topic diff=4.081138, rho=0.316228
05:37:54 INFO:PROGRESS: pass 9, at document #2500/2500
05:37:54 DEBUG:performing inference on a chunk of 2500 documents
05:38:00 DEBUG:2500/2500 documents converged within 50 iterations
05:38:00 DEBUG:updating topics
05:38:00 INFO:topic #116 (0.007): 0.003*"x" + 0.002*"bre" + 0.002*"analyst" + 0.002*"Bre-X" + 0.002*"bre_x" + 0.001*"government" + 0.001*"barrick" + 0.001*"gold" + 0.001*"mining" + 0.001*"indonesian"
05:38:00 INFO:topic #67 (0.007): 0.002*"hong" + 0.002*"china" + 0.002*"kong" + 0.001*"hong_kong" + 0.001*"Hong Kong" + 0.001*"beijing" + 0.001*"legislature" + 0.001*"rule" + 0.001*"chinese" + 0.001*"plan"
05:38:00 INFO:topic #69 (0.007): 0.001*"tibet" + 0.001*"chen" + 0.001*"dalai_lama" + 0.001*"china" + 0.001*"beijing" + 0.001*"group" + 0.001*"dalai" + 0.000*"lama" + 0.000*"billion" + 0.000*"region"
05:38:00 INFO:topic #146 (0.007): 0.001*"hong_kong" + 0.001*"analyst" + 0.000*"share" + 0.000*"hong" + 0.000*"kong" + 0.000*"china" + 0.000*"news" + 0.000*"Hong Kong" + 0.000*"billion" + 0.000*"price"
05:38:00 INFO:topic #66 (0.007): 0.001*"bank" + 0.001*"china" + 0.000*"hong" + 0.000*"government" + 0.000*"hong_kong" + 0.000*"plan" + 0.000*"bre" + 0.000*"x" + 0.000*"kong" + 0.000*"financial"
05:38:00 INFO:topic diff=3.435844, rho=0.301511
05:38:00 INFO:PROGRESS: pass 10, at document #2500/2500
05:38:00 DEBUG:performing inference on a chunk of 2500 documents
05:38:06 DEBUG:2500/2500 documents converged within 50 iterations
05:38:06 DEBUG:updating topics
05:38:06 INFO:topic #80 (0.007): 0.019*"analyst" + 0.015*"microsoft" + 0.013*"quarter" + 0.011*"business" + 0.010*"computer" + 0.009*"sale" + 0.008*"revenue" + 0.008*"windows" + 0.008*"share" + 0.008*"system"
05:38:06 INFO:topic #44 (0.007): 0.001*"sale" + 0.001*"china" + 0.001*"analyst" + 0.001*"share" + 0.001*"service" + 0.000*"plan" + 0.000*"bank" + 0.000*"deal" + 0.000*"billion" + 0.000*"world"
05:38:06 INFO:topic #58 (0.007): 0.001*"shanghai" + 0.001*"china" + 0.001*"bank" + 0.001*"chinese" + 0.000*"city" + 0.000*"stock" + 0.000*"chen" + 0.000*"beijing" + 0.000*"analyst" + 0.000*"modern"
05:38:06 INFO:topic #36 (0.007): 0.022*"penny" + 0.022*"bid" + 0.021*"analyst" + 0.018*"share" + 0.014*"electric" + 0.012*"price" + 0.012*"electricity" + 0.012*"offer" + 0.012*"pound" + 0.011*"northern"
05:38:06 INFO:topic #143 (0.007): 0.018*"mci" + 0.012*"analyst" + 0.012*"allen" + 0.011*"long" + 0.011*"billion" + 0.011*"distance" + 0.010*"long_distance" + 0.009*"stock" + 0.009*"share" + 0.009*"executive"
05:38:06 INFO:topic diff=2.892181, rho=0.288675
05:38:06 INFO:PROGRESS: pass 11, at document #2500/2500
05:38:06 DEBUG:performing inference on a chunk of 2500 documents
05:38:12 DEBUG:2500/2500 documents converged within 50 iterations
05:38:12 DEBUG:updating topics
05:38:13 INFO:topic #141 (0.007): 0.006*"internet" + 0.005*"bank" + 0.003*"law" + 0.003*"congress" + 0.003*"court" + 0.002*"service" + 0.002*"export" + 0.002*"security" + 0.002*"member" + 0.002*"credit"
05:38:13 INFO:topic #33 (0.007): 0.000*"billion" + 0.000*"service" + 0.000*"plan" + 0.000*"industry" + 0.000*"china" + 0.000*"internet" + 0.000*"tonne" + 0.000*"price" + 0.000*"chinese" + 0.000*"share"
05:38:13 INFO:topic #49 (0.007): 0.001*"eurotunnel" + 0.001*"service" + 0.000*"billion" + 0.000*"share" + 0.000*"fire" + 0.000*"pound" + 0.000*"tunnel" + 0.000*"debt" + 0.000*"group" + 0.000*"financial"
05:38:13 INFO:topic #116 (0.007): 0.002*"x" + 0.001*"bre" + 0.001*"analyst" + 0.001*"Bre-X" + 0.001*"bre_x" + 0.001*"government" + 0.001*"barrick" + 0.001*"gold" + 0.001*"mining" + 0.001*"indonesian"
05:38:13 INFO:topic #83 (0.007): 0.001*"beijing" + 0.001*"chinese" + 0.001*"billion" + 0.001*"profit" + 0.001*"tell" + 0.001*"china" + 0.001*"bank" + 0.001*"analyst" + 0.001*"australian" + 0.001*"share"
05:38:13 INFO:topic diff=2.435570, rho=0.277350
05:38:13 INFO:PROGRESS: pass 12, at document #2500/2500
05:38:13 DEBUG:performing inference on a chunk of 2500 documents
05:38:19 DEBUG:2500/2500 documents converged within 50 iterations
05:38:19 DEBUG:updating topics
05:38:19 INFO:topic #138 (0.007): 0.000*"china" + 0.000*"beijing" + 0.000*"share" + 0.000*"states" + 0.000*"news" + 0.000*"analyst" + 0.000*"long" + 0.000*"chinese" + 0.000*"trade" + 0.000*"the United States"
05:38:19 INFO:topic #125 (0.007): 0.000*"analyst" + 0.000*"share" + 0.000*"bank" + 0.000*"problem" + 0.000*"billion" + 0.000*"sale" + 0.000*"loan" + 0.000*"plant" + 0.000*"gm" + 0.000*"corp"
05:38:19 INFO:topic #21 (0.007): 0.024*"stock" + 0.021*"toronto" + 0.018*"share" + 0.017*"bank" + 0.016*"canada" + 0.013*"gold" + 0.012*"billion" + 0.012*"index" + 0.011*"close" + 0.010*"point"
05:38:19 INFO:topic #5 (0.007): 0.002*"china" + 0.001*"beijing" + 0.001*"chen" + 0.001*"official" + 0.001*"trade" + 0.001*"economic" + 0.001*"chinese" + 0.001*"party" + 0.001*"survey" + 0.001*"month"
05:38:19 INFO:topic #31 (0.007): 0.002*"franc" + 0.002*"french" + 0.002*"china" + 0.002*"billion" + 0.001*"shanghai" + 0.001*"analyst" + 0.001*"share" + 0.001*"government" + 0.001*"plan" + 0.001*"exchange"
05:38:19 INFO:topic diff=2.053082, rho=0.267261
05:38:19 INFO:PROGRESS: pass 13, at document #2500/2500
05:38:19 DEBUG:performing inference on a chunk of 2500 documents
05:38:26 DEBUG:2500/2500 documents converged within 50 iterations
05:38:26 DEBUG:updating topics
05:38:26 INFO:topic #3 (0.007): 0.000*"pound" + 0.000*"billion" + 0.000*"share" + 0.000*"group" + 0.000*"british" + 0.000*"deal" + 0.000*"bank" + 0.000*"sale" + 0.000*"mci" + 0.000*"service"
05:38:26 INFO:topic #42 (0.007): 0.000*"technology" + 0.000*"russia" + 0.000*"computer" + 0.000*"industry" + 0.000*"russian" + 0.000*"internet" + 0.000*"analyst" + 0.000*"world" + 0.000*"price" + 0.000*"software"
05:38:26 INFO:topic #104 (0.007): 0.044*"cent" + 0.043*"bank" + 0.028*"cent_share" + 0.022*"league" + 0.019*"football" + 0.016*"share" + 0.014*"card" + 0.013*"earning" + 0.012*"cos" + 0.011*"canada"
05:38:26 INFO:topic #69 (0.007): 0.000*"tibet" + 0.000*"chen" + 0.000*"dalai_lama" + 0.000*"china" + 0.000*"beijing" + 0.000*"group" + 0.000*"dalai" + 0.000*"lama" + 0.000*"billion" + 0.000*"region"
05:38:26 INFO:topic #115 (0.007): 0.000*"share" + 0.000*"stock" + 0.000*"billion" + 0.000*"analyst" + 0.000*"china" + 0.000*"industry" + 0.000*"month" + 0.000*"rise" + 0.000*"big" + 0.000*"deal"
05:38:26 INFO:topic diff=1.733322, rho=0.258199
05:38:26 INFO:PROGRESS: pass 14, at document #2500/2500
05:38:26 DEBUG:performing inference on a chunk of 2500 documents
05:38:32 DEBUG:2500/2500 documents converged within 50 iterations
05:38:32 DEBUG:updating topics
05:38:33 INFO:topic #82 (0.007): 0.001*"china" + 0.000*"tonne" + 0.000*"chinese" + 0.000*"trader" + 0.000*"copper" + 0.000*"price" + 0.000*"source" + 0.000*"kong" + 0.000*"shanghai" + 0.000*"metal"
05:38:33 INFO:topic #122 (0.007): 0.000*"share" + 0.000*"analyst" + 0.000*"quarter" + 0.000*"pc" + 0.000*"computer" + 0.000*"ibm" + 0.000*"profit" + 0.000*"service" + 0.000*"industry" + 0.000*"compaq"
05:38:33 INFO:topic #126 (0.007): 0.000*"group" + 0.000*"europe" + 0.000*"plan" + 0.000*"air" + 0.000*"model" + 0.000*"pound" + 0.000*"hong" + 0.000*"month" + 0.000*"japan" + 0.000*"Hong Kong"
05:38:33 INFO:topic #1 (0.007): 0.005*"bank" + 0.004*"stock" + 0.004*"billion" + 0.004*"japan" + 0.003*"analyst" + 0.003*"financial" + 0.003*"asset" + 0.003*"japanese" + 0.002*"big" + 0.002*"yen"
05:38:33 INFO:topic #120 (0.007): 0.022*"stiff" + 0.019*"court" + 0.018*"rating" + 0.018*"frequently" + 0.012*"mercury" + 0.012*"williams" + 0.012*"remove" + 0.011*"judge" + 0.011*"armed" + 0.009*"ford"
05:38:33 INFO:topic diff=1.466340, rho=0.250000
05:38:33 DEBUG:Setting topics to those of the model: AuthorTopicModel(num_terms=3914, num_topics=150, num_authors=50, decay=0.5, chunksize=2500)
05:38:33 INFO:CorpusAccumulator accumulated stats from 1000 documents
05:38:33 INFO:CorpusAccumulator accumulated stats from 2000 documents
-1.90810257282
In [18]:
accuracy_scores_150topic={}
for i in [1,2,3,4,5,6,8,10]:
accuracy, k = prediction_accuracy(test_author2doc, test_corpus_50_20, atmodel_150topics, k=i)
accuracy_scores_150topic[k] = accuracy
plot_accuracy(scores1=accuracy_scores_100topic, label1="100 topics", scores2=accuracy_scores_150topic, label2="150 topics")
Precision@k: top_n=1
Prediction accuracy: 0.6004
Precision@k: top_n=2
Prediction accuracy: 0.7632
Precision@k: top_n=3
Prediction accuracy: 0.8452
Precision@k: top_n=4
Prediction accuracy: 0.8796
Precision@k: top_n=5
Prediction accuracy: 0.8988
Precision@k: top_n=6
Prediction accuracy: 0.914
Precision@k: top_n=8
Prediction accuracy: 0.9324
Precision@k: top_n=10
Prediction accuracy: 0.9464
The 150-topic model is also slightly better, especially in the lower end of k. But we clearly see convergence. We try with 200 topic to be sure.
In [19]:
atmodel_200topics = train_model(train_corpus_50_20, train_author2doc, train_dictionary_50_20, num_topics=200, eval_every=0, iterations=50, passes=15)
05:43:01 INFO:Vocabulary consists of 3914 words.
05:43:01 INFO:using symmetric alpha at 0.005
05:43:01 INFO:using symmetric eta at 0.005
05:43:05 INFO:running online author-topic training, 200 topics, 50 authors, 15 passes over the supplied corpus of 2500 documents, updating model once every 2500 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000
05:43:05 INFO:PROGRESS: pass 0, at document #2500/2500
05:43:05 DEBUG:performing inference on a chunk of 2500 documents
05:43:25 DEBUG:2/2500 documents converged within 50 iterations
05:43:25 DEBUG:updating topics
05:43:26 INFO:topic #198 (0.005): 0.006*"plant" + 0.006*"analyst" + 0.006*"gm" + 0.005*"sale" + 0.005*"group" + 0.005*"service" + 0.004*"share" + 0.004*"internet" + 0.004*"plan" + 0.004*"uaw"
05:43:26 INFO:topic #186 (0.005): 0.009*"pound" + 0.007*"quarter" + 0.007*"analyst" + 0.007*"share" + 0.006*"group" + 0.006*"business" + 0.005*"million_pound" + 0.005*"software" + 0.005*"sale" + 0.005*"industry"
05:43:26 INFO:topic #188 (0.005): 0.011*"share" + 0.010*"analyst" + 0.008*"billion" + 0.007*"sale" + 0.007*"stock" + 0.006*"mci" + 0.006*"business" + 0.005*"british" + 0.005*"quarter" + 0.005*"deal"
05:43:26 INFO:topic #162 (0.005): 0.012*"analyst" + 0.008*"business" + 0.007*"quarter" + 0.006*"share" + 0.006*"industry" + 0.006*"sale" + 0.005*"base" + 0.005*"billion" + 0.005*"high" + 0.005*"price"
05:43:26 INFO:topic #22 (0.005): 0.039*"bank" + 0.011*"rate" + 0.011*"day" + 0.011*"cut" + 0.010*"analyst" + 0.008*"australia" + 0.008*"profit" + 0.008*"financial" + 0.007*"ltd" + 0.007*"merger"
05:43:26 INFO:topic diff=65.500588, rho=1.000000
05:43:26 INFO:PROGRESS: pass 1, at document #2500/2500
05:43:26 DEBUG:performing inference on a chunk of 2500 documents
05:43:36 DEBUG:2494/2500 documents converged within 50 iterations
05:43:36 DEBUG:updating topics
05:43:37 INFO:topic #77 (0.005): 0.013*"internet" + 0.011*"computer" + 0.009*"business" + 0.009*"quarter" + 0.009*"service" + 0.008*"revenue" + 0.007*"analyst" + 0.007*"cost" + 0.007*"industry" + 0.006*"compaq"
05:43:37 INFO:topic #25 (0.005): 0.010*"share" + 0.007*"analyst" + 0.007*"service" + 0.007*"business" + 0.006*"growth" + 0.006*"mci" + 0.006*"billion" + 0.006*"long" + 0.005*"distance" + 0.005*"stock"
05:43:37 INFO:topic #133 (0.005): 0.011*"group" + 0.011*"share" + 0.010*"pound" + 0.009*"billion" + 0.007*"profit" + 0.007*"business" + 0.006*"sale" + 0.005*"good" + 0.005*"bank" + 0.005*"analyst"
05:43:37 INFO:topic #180 (0.005): 0.011*"billion" + 0.006*"venture" + 0.006*"quarter" + 0.005*"investment" + 0.005*"industry" + 0.005*"analyst" + 0.004*"price" + 0.003*"group" + 0.003*"high" + 0.003*"rise"
05:43:37 INFO:topic #1 (0.005): 0.018*"japan" + 0.014*"japanese" + 0.013*"billion" + 0.011*"yen" + 0.011*"stock" + 0.011*"bank" + 0.010*"life" + 0.010*"financial" + 0.010*"big" + 0.008*"profit"
05:43:37 INFO:topic diff=17.080447, rho=0.577350
05:43:37 INFO:PROGRESS: pass 2, at document #2500/2500
05:43:37 DEBUG:performing inference on a chunk of 2500 documents
05:43:46 DEBUG:2499/2500 documents converged within 50 iterations
05:43:46 DEBUG:updating topics
05:43:47 INFO:topic #92 (0.005): 0.013*"analyst" + 0.011*"share" + 0.007*"sale" + 0.006*"profit" + 0.005*"pound" + 0.004*"high" + 0.004*"revenue" + 0.004*"quarter" + 0.004*"billion" + 0.004*"cent"
05:43:47 INFO:topic #117 (0.005): 0.015*"access" + 0.012*"local" + 0.011*"internet" + 0.011*"fee" + 0.010*"distance" + 0.010*"long" + 0.008*"long_distance" + 0.008*"service" + 0.007*"issue" + 0.006*"provider"
05:43:47 INFO:topic #81 (0.005): 0.011*"pound" + 0.010*"profit" + 0.009*"share" + 0.008*"sale" + 0.007*"million_pound" + 0.006*"analyst" + 0.006*"business" + 0.006*"rise" + 0.005*"group" + 0.005*"fall"
05:43:47 INFO:topic #97 (0.005): 0.025*"internet" + 0.018*"bill" + 0.014*"administration" + 0.014*"product" + 0.012*"key" + 0.011*"policy" + 0.011*"export" + 0.010*"law" + 0.008*"access" + 0.008*"bank"
05:43:47 INFO:topic #24 (0.005): 0.005*"crop" + 0.005*"price" + 0.005*"share" + 0.004*"tonne" + 0.004*"analyst" + 0.004*"exporter" + 0.004*"cocoa" + 0.003*"ivory_coast" + 0.003*"government" + 0.003*"reuters"
05:43:47 INFO:topic diff=14.773285, rho=0.500000
05:43:47 INFO:PROGRESS: pass 3, at document #2500/2500
05:43:47 DEBUG:performing inference on a chunk of 2500 documents
05:43:56 DEBUG:2499/2500 documents converged within 50 iterations
05:43:56 DEBUG:updating topics
05:43:56 INFO:topic #16 (0.005): 0.003*"group" + 0.003*"billion" + 0.002*"gm" + 0.002*"hong_kong" + 0.002*"pound" + 0.002*"kong" + 0.002*"china" + 0.002*"bid" + 0.002*"analyst" + 0.002*"hong"
05:43:56 INFO:topic #133 (0.005): 0.012*"group" + 0.011*"pound" + 0.010*"share" + 0.008*"billion" + 0.007*"business" + 0.006*"profit" + 0.005*"good" + 0.005*"sale" + 0.005*"add" + 0.005*"cost"
05:43:56 INFO:topic #23 (0.005): 0.017*"boeing" + 0.010*"billion" + 0.009*"analyst" + 0.006*"share" + 0.006*"microsoft" + 0.006*"industry" + 0.005*"quarter" + 0.005*"jet" + 0.005*"windows" + 0.005*"mcdonnell"
05:43:56 INFO:topic #156 (0.005): 0.005*"analyst" + 0.004*"bank" + 0.003*"share" + 0.003*"service" + 0.003*"internet" + 0.002*"china" + 0.002*"plan" + 0.002*"profit" + 0.002*"billion" + 0.002*"cost"
05:43:56 INFO:topic #131 (0.005): 0.024*"analyst" + 0.017*"share" + 0.013*"price" + 0.011*"business" + 0.011*"penny" + 0.009*"bid" + 0.007*"electric" + 0.006*"offer" + 0.006*"add" + 0.006*"northern"
05:43:56 INFO:topic diff=12.542799, rho=0.447214
05:43:56 INFO:PROGRESS: pass 4, at document #2500/2500
05:43:56 DEBUG:performing inference on a chunk of 2500 documents
05:44:05 DEBUG:2500/2500 documents converged within 50 iterations
05:44:05 DEBUG:updating topics
05:44:05 INFO:topic #92 (0.005): 0.009*"analyst" + 0.008*"share" + 0.005*"sale" + 0.004*"profit" + 0.003*"pound" + 0.003*"high" + 0.003*"revenue" + 0.003*"quarter" + 0.003*"billion" + 0.003*"cent"
05:44:05 INFO:topic #86 (0.005): 0.029*"cargo" + 0.021*"kong" + 0.020*"hong" + 0.020*"hong_kong" + 0.016*"air" + 0.015*"Hong Kong" + 0.015*"airline" + 0.009*"service" + 0.009*"route" + 0.009*"airport"
05:44:05 INFO:topic #80 (0.005): 0.017*"analyst" + 0.014*"microsoft" + 0.013*"quarter" + 0.010*"computer" + 0.010*"business" + 0.009*"windows" + 0.008*"revenue" + 0.008*"internet" + 0.007*"system" + 0.007*"sale"
05:44:05 INFO:topic #23 (0.005): 0.015*"boeing" + 0.009*"billion" + 0.008*"analyst" + 0.006*"share" + 0.005*"microsoft" + 0.005*"industry" + 0.005*"quarter" + 0.005*"jet" + 0.004*"windows" + 0.004*"mcdonnell"
05:44:05 INFO:topic #177 (0.005): 0.011*"investment" + 0.010*"pound" + 0.010*"group" + 0.008*"cable" + 0.008*"british" + 0.007*"fleming" + 0.007*"management" + 0.006*"fund" + 0.006*"share" + 0.006*"merger"
05:44:05 INFO:topic diff=10.561281, rho=0.408248
05:44:05 INFO:PROGRESS: pass 5, at document #2500/2500
05:44:05 DEBUG:performing inference on a chunk of 2500 documents
05:44:14 DEBUG:2500/2500 documents converged within 50 iterations
05:44:14 DEBUG:updating topics
05:44:15 INFO:topic #61 (0.005): 0.038*"boeing" + 0.017*"billion" + 0.016*"jet" + 0.013*"analyst" + 0.012*"mcdonnell" + 0.011*"microsoft" + 0.010*"airbus" + 0.010*"order" + 0.010*"douglas" + 0.009*"share"
05:44:15 INFO:topic #73 (0.005): 0.001*"bank" + 0.001*"china" + 0.001*"group" + 0.001*"big" + 0.001*"analyst" + 0.001*"sale" + 0.001*"shanghai" + 0.001*"deal" + 0.001*"gm" + 0.001*"pound"
05:44:15 INFO:topic #147 (0.005): 0.012*"czech" + 0.011*"crown" + 0.010*"week" + 0.009*"analyst" + 0.009*"point" + 0.008*"investor" + 0.007*"round" + 0.007*"prague" + 0.007*"billion" + 0.006*"second"
05:44:15 INFO:topic #50 (0.005): 0.006*"british" + 0.006*"telecom" + 0.006*"deal" + 0.005*"analyst" + 0.005*"drug" + 0.004*"share" + 0.004*"mci" + 0.004*"billion" + 0.004*"group" + 0.003*"sale"
05:44:15 INFO:topic #99 (0.005): 0.003*"stock" + 0.002*"business" + 0.002*"share" + 0.002*"analyst" + 0.002*"end" + 0.002*"day" + 0.002*"sale" + 0.002*"world" + 0.002*"billion" + 0.001*"quarter"
05:44:15 INFO:topic diff=8.863923, rho=0.377964
05:44:15 INFO:PROGRESS: pass 6, at document #2500/2500
05:44:15 DEBUG:performing inference on a chunk of 2500 documents
05:44:23 DEBUG:2500/2500 documents converged within 50 iterations
05:44:23 DEBUG:updating topics
05:44:24 INFO:topic #70 (0.005): 0.011*"stock" + 0.010*"shanghai" + 0.008*"share" + 0.008*"exchange" + 0.007*"trading" + 0.007*"china" + 0.007*"bank" + 0.006*"future" + 0.006*"beijing" + 0.005*"index"
05:44:24 INFO:topic #177 (0.005): 0.011*"investment" + 0.010*"pound" + 0.010*"group" + 0.008*"cable" + 0.008*"british" + 0.007*"fleming" + 0.007*"management" + 0.007*"share" + 0.006*"fund" + 0.006*"merger"
05:44:24 INFO:topic #37 (0.005): 0.015*"bank" + 0.012*"czech" + 0.008*"crown" + 0.006*"prague" + 0.005*"foreign" + 0.005*"billion" + 0.004*"state" + 0.004*"deficit" + 0.004*"communist" + 0.004*"central"
05:44:24 INFO:topic #56 (0.005): 0.032*"kong" + 0.031*"hong" + 0.031*"hong_kong" + 0.022*"Hong Kong" + 0.016*"china" + 0.007*"fund" + 0.007*"Hong Kong's" + 0.006*"chinese" + 0.005*"tung" + 0.005*"british"
05:44:24 INFO:topic #32 (0.005): 0.004*"china" + 0.003*"beijing" + 0.002*"taiwan" + 0.002*"bre" + 0.002*"bre_x" + 0.002*"share" + 0.002*"x" + 0.002*"chinese" + 0.002*"analyst" + 0.002*"party"
05:44:24 INFO:topic diff=7.433677, rho=0.353553
05:44:24 INFO:PROGRESS: pass 7, at document #2500/2500
05:44:24 DEBUG:performing inference on a chunk of 2500 documents
05:44:32 DEBUG:2500/2500 documents converged within 50 iterations
05:44:32 DEBUG:updating topics
05:44:33 INFO:topic #194 (0.005): 0.024*"cocoa" + 0.020*"tonne" + 0.019*"exporter" + 0.012*"ivory" + 0.012*"ivory_coast" + 0.012*"coast" + 0.011*"crop" + 0.011*"price" + 0.010*"buyer" + 0.009*"export"
05:44:33 INFO:topic #116 (0.005): 0.004*"x" + 0.003*"analyst" + 0.003*"bre" + 0.003*"Bre-X" + 0.002*"bre_x" + 0.002*"share" + 0.002*"government" + 0.002*"bank" + 0.002*"billion" + 0.002*"sale"
05:44:33 INFO:topic #174 (0.005): 0.001*"quarter" + 0.001*"venture" + 0.001*"china" + 0.001*"billion" + 0.001*"beijing" + 0.000*"investment" + 0.000*"share" + 0.000*"level" + 0.000*"chinese" + 0.000*"official"
05:44:33 INFO:topic #169 (0.005): 0.001*"china" + 0.001*"tell" + 0.001*"service" + 0.001*"hong_kong" + 0.001*"billion" + 0.001*"share" + 0.001*"beijing" + 0.001*"group" + 0.001*"kong" + 0.001*"analyst"
05:44:33 INFO:topic #88 (0.005): 0.024*"franc" + 0.023*"french" + 0.022*"air" + 0.021*"france" + 0.017*"thomson" + 0.014*"billion" + 0.011*"group" + 0.010*"telecom" + 0.010*"billion_franc" + 0.009*"government"
05:44:33 INFO:topic diff=6.234229, rho=0.333333
05:44:33 INFO:PROGRESS: pass 8, at document #2500/2500
05:44:33 DEBUG:performing inference on a chunk of 2500 documents
05:44:41 DEBUG:2500/2500 documents converged within 50 iterations
05:44:41 DEBUG:updating topics
05:44:42 INFO:topic #149 (0.005): 0.010*"china" + 0.005*"chinese" + 0.005*"official" + 0.004*"beijing" + 0.004*"metre" + 0.003*"world" + 0.003*"trade" + 0.003*"foreign" + 0.003*"united_states" + 0.003*"united"
05:44:42 INFO:topic #62 (0.005): 0.002*"property" + 0.002*"increase" + 0.002*"month" + 0.002*"klaus" + 0.001*"social" + 0.001*"commission" + 0.001*"pound" + 0.001*"analyst" + 0.001*"large" + 0.001*"party"
05:44:42 INFO:topic #38 (0.005): 0.016*"analyst" + 0.014*"australian" + 0.014*"ltd" + 0.013*"share" + 0.012*"australia" + 0.011*"profit" + 0.011*"sydney" + 0.009*"news" + 0.009*"group" + 0.009*"corp"
05:44:42 INFO:topic #155 (0.005): 0.002*"china" + 0.001*"fund" + 0.001*"stock" + 0.001*"billion" + 0.001*"economic" + 0.001*"hong" + 0.001*"bank" + 0.001*"group" + 0.001*"kong" + 0.001*"canada"
05:44:42 INFO:topic #130 (0.005): 0.019*"mci" + 0.012*"analyst" + 0.011*"service" + 0.011*"share" + 0.011*"long" + 0.010*"billion" + 0.009*"long_distance" + 0.009*"distance" + 0.009*"corp" + 0.008*"deal"
05:44:42 INFO:topic diff=5.231049, rho=0.316228
05:44:42 INFO:PROGRESS: pass 9, at document #2500/2500
05:44:42 DEBUG:performing inference on a chunk of 2500 documents
05:44:50 DEBUG:2500/2500 documents converged within 50 iterations
05:44:51 DEBUG:updating topics
05:44:51 INFO:topic #166 (0.005): 0.001*"oil" + 0.001*"russian" + 0.001*"russia" + 0.001*"internet" + 0.001*"export" + 0.001*"world" + 0.001*"service" + 0.001*"tonne" + 0.001*"analyst" + 0.001*"output"
05:44:51 INFO:topic #187 (0.005): 0.033*"china" + 0.011*"beijing" + 0.011*"official" + 0.010*"chinese" + 0.008*"state" + 0.008*"foreign" + 0.008*"trade" + 0.006*"united" + 0.005*"united_states" + 0.005*"states"
05:44:51 INFO:topic #139 (0.005): 0.013*"drug" + 0.012*"group" + 0.010*"pound" + 0.010*"sale" + 0.009*"plc" + 0.009*"british" + 0.009*"share" + 0.008*"product" + 0.008*"profit" + 0.008*"analyst"
05:44:51 INFO:topic #43 (0.005): 0.001*"tonne" + 0.001*"cocoa" + 0.001*"china" + 0.000*"share" + 0.000*"bank" + 0.000*"government" + 0.000*"exporter" + 0.000*"stock" + 0.000*"plan" + 0.000*"close"
05:44:51 INFO:topic #71 (0.005): 0.025*"fcc" + 0.018*"phone" + 0.016*"carrier" + 0.015*"local" + 0.012*"rule" + 0.011*"long" + 0.011*"service" + 0.011*"distance" + 0.010*"tv" + 0.010*"long_distance"
05:44:51 INFO:topic diff=4.391602, rho=0.301511
05:44:51 INFO:PROGRESS: pass 10, at document #2500/2500
05:44:51 DEBUG:performing inference on a chunk of 2500 documents
05:44:59 DEBUG:2500/2500 documents converged within 50 iterations
05:44:59 DEBUG:updating topics
05:45:00 INFO:topic #156 (0.005): 0.001*"analyst" + 0.001*"bank" + 0.001*"share" + 0.001*"service" + 0.000*"internet" + 0.000*"china" + 0.000*"plan" + 0.000*"profit" + 0.000*"billion" + 0.000*"cost"
05:45:00 INFO:topic #63 (0.005): 0.037*"oil" + 0.029*"russia" + 0.026*"russian" + 0.016*"tonne" + 0.016*"aluminium" + 0.015*"smelter" + 0.014*"output" + 0.012*"world" + 0.012*"export" + 0.010*"western"
05:45:00 INFO:topic #83 (0.005): 0.001*"profit" + 0.001*"bank" + 0.001*"australian" + 0.001*"analyst" + 0.001*"billion" + 0.001*"australia" + 0.001*"share" + 0.001*"tell" + 0.001*"ltd" + 0.001*"beijing"
05:45:00 INFO:topic #144 (0.005): 0.001*"computer" + 0.001*"software" + 0.001*"site" + 0.001*"quarter" + 0.001*"technology" + 0.001*"internet" + 0.001*"industry" + 0.001*"web" + 0.001*"product" + 0.001*"high"
05:45:00 INFO:topic #84 (0.005): 0.015*"klaus" + 0.014*"czech" + 0.014*"bank" + 0.011*"billion" + 0.011*"crown" + 0.009*"state" + 0.009*"price" + 0.008*"minister" + 0.007*"tell" + 0.007*"low"
05:45:00 INFO:topic diff=3.689287, rho=0.288675
05:45:00 INFO:PROGRESS: pass 11, at document #2500/2500
05:45:00 DEBUG:performing inference on a chunk of 2500 documents
05:45:08 DEBUG:2500/2500 documents converged within 50 iterations
05:45:08 DEBUG:updating topics
05:45:09 INFO:topic #50 (0.005): 0.001*"british" + 0.001*"telecom" + 0.001*"deal" + 0.001*"analyst" + 0.001*"drug" + 0.001*"share" + 0.001*"mci" + 0.001*"billion" + 0.001*"group" + 0.001*"sale"
05:45:09 INFO:topic #194 (0.005): 0.024*"cocoa" + 0.020*"tonne" + 0.020*"exporter" + 0.012*"ivory" + 0.012*"coast" + 0.012*"ivory_coast" + 0.011*"crop" + 0.011*"price" + 0.010*"buyer" + 0.009*"export"
05:45:09 INFO:topic #138 (0.005): 0.001*"china" + 0.000*"beijing" + 0.000*"share" + 0.000*"news" + 0.000*"states" + 0.000*"chinese" + 0.000*"analyst" + 0.000*"month" + 0.000*"long" + 0.000*"the United States"
05:45:09 INFO:topic #100 (0.005): 0.001*"chinese" + 0.001*"china" + 0.001*"beijing" + 0.000*"hong" + 0.000*"hong_kong" + 0.000*"official" + 0.000*"tibet" + 0.000*"magazine" + 0.000*"kong" + 0.000*"lama"
05:45:09 INFO:topic #94 (0.005): 0.000*"share" + 0.000*"stock" + 0.000*"election" + 0.000*"analyst" + 0.000*"low" + 0.000*"bank" + 0.000*"havel" + 0.000*"government" + 0.000*"high" + 0.000*"large"
05:45:09 INFO:topic diff=3.102101, rho=0.277350
05:45:09 INFO:PROGRESS: pass 12, at document #2500/2500
05:45:09 DEBUG:performing inference on a chunk of 2500 documents
05:45:17 DEBUG:2500/2500 documents converged within 50 iterations
05:45:17 DEBUG:updating topics
05:45:18 INFO:topic #122 (0.005): 0.000*"share" + 0.000*"analyst" + 0.000*"plant" + 0.000*"gm" + 0.000*"industry" + 0.000*"quarter" + 0.000*"service" + 0.000*"law" + 0.000*"month" + 0.000*"large"
05:45:18 INFO:topic #20 (0.005): 0.029*"gold" + 0.019*"bre" + 0.019*"x" + 0.018*"bre_x" + 0.014*"price" + 0.014*"analyst" + 0.011*"Bre-X" + 0.010*"busang" + 0.010*"barrick" + 0.010*"toronto"
05:45:18 INFO:topic #107 (0.005): 0.031*"russia" + 0.027*"oil" + 0.016*"russian" + 0.014*"export" + 0.012*"output" + 0.010*"moscow" + 0.009*"tonne" + 0.009*"domestic" + 0.009*"world" + 0.008*"western"
05:45:18 INFO:topic #130 (0.005): 0.019*"mci" + 0.012*"analyst" + 0.011*"service" + 0.011*"share" + 0.010*"long" + 0.010*"billion" + 0.009*"long_distance" + 0.009*"distance" + 0.009*"corp" + 0.008*"deal"
05:45:18 INFO:topic #151 (0.005): 0.012*"billion" + 0.008*"sale" + 0.007*"computer" + 0.007*"industry" + 0.006*"good" + 0.006*"analyst" + 0.006*"product" + 0.006*"quarter" + 0.005*"forecast" + 0.005*"internet"
05:45:18 INFO:topic diff=2.611657, rho=0.267261
05:45:18 INFO:PROGRESS: pass 13, at document #2500/2500
05:45:18 DEBUG:performing inference on a chunk of 2500 documents
05:45:28 DEBUG:2500/2500 documents converged within 50 iterations
05:45:28 DEBUG:updating topics
05:45:28 INFO:topic #74 (0.005): 0.036*"china" + 0.029*"tonne" + 0.021*"chinese" + 0.021*"trader" + 0.018*"price" + 0.014*"import" + 0.013*"source" + 0.011*"copper" + 0.010*"official" + 0.010*"million_tonne"
05:45:28 INFO:topic #138 (0.005): 0.000*"china" + 0.000*"beijing" + 0.000*"share" + 0.000*"news" + 0.000*"states" + 0.000*"chinese" + 0.000*"analyst" + 0.000*"month" + 0.000*"long" + 0.000*"the United States"
05:45:28 INFO:topic #60 (0.005): 0.000*"half" + 0.000*"financial" + 0.000*"northern" + 0.000*"policy" + 0.000*"official" + 0.000*"group" + 0.000*"product" + 0.000*"draft" + 0.000*"administration" + 0.000*"stock"
05:45:28 INFO:topic #42 (0.005): 0.000*"russia" + 0.000*"russian" + 0.000*"industry" + 0.000*"technology" + 0.000*"oil" + 0.000*"world" + 0.000*"export" + 0.000*"price" + 0.000*"diamond" + 0.000*"analyst"
05:45:28 INFO:topic #36 (0.005): 0.026*"bid" + 0.025*"penny" + 0.016*"analyst" + 0.015*"electric" + 0.015*"share" + 0.014*"electricity" + 0.013*"pound" + 0.013*"offer" + 0.011*"northern" + 0.011*"british"
05:45:28 INFO:topic diff=2.202422, rho=0.258199
05:45:28 INFO:PROGRESS: pass 14, at document #2500/2500
05:45:28 DEBUG:performing inference on a chunk of 2500 documents
05:45:36 DEBUG:2500/2500 documents converged within 50 iterations
05:45:36 DEBUG:updating topics
05:45:36 INFO:topic #89 (0.005): 0.001*"bre_x" + 0.001*"x" + 0.001*"bre" + 0.001*"analyst" + 0.001*"barrick" + 0.001*"Bre-X" + 0.001*"government" + 0.001*"gold" + 0.001*"indonesian" + 0.001*"billion"
05:45:36 INFO:topic #121 (0.005): 0.000*"time" + 0.000*"share" + 0.000*"second" + 0.000*"tobacco" + 0.000*"group" + 0.000*"industry" + 0.000*"action" + 0.000*"month" + 0.000*"plan" + 0.000*"hand"
05:45:36 INFO:topic #188 (0.005): 0.015*"sale" + 0.013*"analyst" + 0.011*"share" + 0.008*"mercury" + 0.008*"bank" + 0.007*"stock" + 0.007*"billion" + 0.006*"amp" + 0.006*"think" + 0.006*"base"
05:45:36 INFO:topic #80 (0.005): 0.016*"microsoft" + 0.015*"analyst" + 0.013*"quarter" + 0.010*"windows" + 0.010*"computer" + 0.009*"business" + 0.009*"revenue" + 0.009*"sale" + 0.008*"system" + 0.008*"software"
05:45:36 INFO:topic #56 (0.005): 0.033*"hong" + 0.033*"kong" + 0.032*"hong_kong" + 0.023*"Hong Kong" + 0.017*"china" + 0.007*"Hong Kong's" + 0.006*"chinese" + 0.006*"tung" + 0.006*"british" + 0.004*"government"
05:45:36 INFO:topic diff=1.861192, rho=0.250000
05:45:36 DEBUG:Setting topics to those of the model: AuthorTopicModel(num_terms=3914, num_topics=200, num_authors=50, decay=0.5, chunksize=2500)
05:45:37 INFO:CorpusAccumulator accumulated stats from 1000 documents
05:45:37 INFO:CorpusAccumulator accumulated stats from 2000 documents
-1.93149366596
In [20]:
accuracy_scores_200topic={}
for i in [1,2,3,4,5,6,8,10]:
accuracy, k = prediction_accuracy(test_author2doc, test_corpus_50_20, atmodel_200topics, k=i)
accuracy_scores_200topic[k] = accuracy
plot_accuracy(scores1=accuracy_scores_150topic, label1="150 topics", scores2=accuracy_scores_200topic, label2="200 topics")
Precision@k: top_n=1
Prediction accuracy: 0.6232
Precision@k: top_n=2
Prediction accuracy: 0.7664
Precision@k: top_n=3
Prediction accuracy: 0.8456
Precision@k: top_n=4
Prediction accuracy: 0.8816
Precision@k: top_n=5
Prediction accuracy: 0.9032
Precision@k: top_n=6
Prediction accuracy: 0.9164
Precision@k: top_n=8
Prediction accuracy: 0.9368
Precision@k: top_n=10
Prediction accuracy: 0.9464
The 200-topic seems to be performing a bit better for lower k, might be due to a slight overrepresentation with high topic number. So let us stop here with the topic number increase and focus some more on the dictionary. We choose either one of the models. Currently we are filtering out tokens, that appear in more 50% of all documents and no more than 20 times overall, which drastically decreaces the size of our dictionary. We know about this dataset, that the underlying topic are not so diverse and are structed around corporate/industrial topic class. Thus it makes sense to increase the dictionary by filtering less tokens.
We set the parameters set max_freq=25%, min_wordcount=10
In [30]:
train_corpus_25_10, train_dictionary_25_10 = create_corpus_dictionary(train_docs, 0.25, 10)
06:18:50 INFO:adding document #0 to Dictionary(0 unique tokens: [])
06:18:51 INFO:built Dictionary(46905 unique tokens: ['$83.4 million', 'boarder', '$2.72 billion', 'checking', 'suzuki']...) from 2500 documents (total 786032 corpus positions)
06:18:51 INFO:discarding 40690 tokens: [('$15', 3), ('$17.25', 1), ('$380 million', 2), ('12.5 cents', 7), ('Big B', 3), ('Big B Inc.', 2), ("Big B's", 3), ('Big B. I', 1), ('Dwayne Hoven', 1), ('Eckerd Corp.', 1)]...
06:18:51 INFO:keeping 6215 tokens which were in no less than 10 and no more than 625 (=25.0%) documents
06:18:51 DEBUG:rebuilding dictionary, shrinking gaps
06:18:51 INFO:resulting dictionary: Dictionary(6215 unique tokens: ['offshoot', 'shore', 'loss', 'merger', 'disappointing']...)
In [31]:
test_corpus_25_10 = create_test_corpus(train_dictionary_25_10, test_docs)
In [32]:
print('Number of unique tokens: %d' % len(train_dictionary_25_10))
Number of unique tokens: 6215
We now have now nearly doubled the tokens. Let's train and evaluate.
In [33]:
atmodel_150topics_25_10 = train_model(train_corpus_25_10, train_author2doc, train_dictionary_25_10, num_topics=150, eval_every=0, iterations=50, passes=15)
06:18:53 INFO:Vocabulary consists of 6215 words.
06:18:53 INFO:using symmetric alpha at 0.006666666666666667
06:18:53 INFO:using symmetric eta at 0.006666666666666667
06:18:57 INFO:running online author-topic training, 150 topics, 50 authors, 15 passes over the supplied corpus of 2500 documents, updating model once every 2500 documents, evaluating perplexity every 0 documents, iterating 50x with a convergence threshold of 0.001000
06:18:57 INFO:PROGRESS: pass 0, at document #2500/2500
06:18:57 DEBUG:performing inference on a chunk of 2500 documents
06:19:11 DEBUG:17/2500 documents converged within 50 iterations
06:19:11 DEBUG:updating topics
06:19:12 INFO:topic #141 (0.007): 0.031*"gm" + 0.016*"plant" + 0.011*"worker" + 0.010*"uaw" + 0.009*"strike" + 0.009*"truck" + 0.008*"local" + 0.007*"automaker" + 0.006*"part" + 0.005*"contract"
06:19:12 INFO:topic #105 (0.007): 0.013*"china" + 0.010*"tonne" + 0.009*"chinese" + 0.008*"trader" + 0.007*"copper" + 0.007*"product" + 0.005*"drug" + 0.005*"hong_kong" + 0.004*"soybean" + 0.004*"hong"
06:19:12 INFO:topic #15 (0.007): 0.006*"china" + 0.004*"network" + 0.003*"drug" + 0.003*"trade" + 0.003*"united" + 0.003*"states" + 0.003*"boeing" + 0.003*"chinese" + 0.003*"beijing" + 0.002*"product"
06:19:12 INFO:topic #30 (0.007): 0.010*"amp" + 0.009*"bank" + 0.005*"ernst" + 0.005*"claim" + 0.005*"bre" + 0.004*"bre_x" + 0.004*"gold" + 0.003*"rate" + 0.003*"x" + 0.003*"pay"
06:19:12 INFO:topic #114 (0.007): 0.019*"bank" + 0.010*"japan" + 0.009*"pound" + 0.008*"problem" + 0.008*"loan" + 0.007*"financial" + 0.006*"yen" + 0.005*"bt" + 0.005*"million_pound" + 0.005*"japanese"
06:19:12 INFO:topic diff=61.971494, rho=1.000000
06:19:12 INFO:PROGRESS: pass 1, at document #2500/2500
06:19:12 DEBUG:performing inference on a chunk of 2500 documents
06:19:19 DEBUG:2491/2500 documents converged within 50 iterations
06:19:19 DEBUG:updating topics
06:19:19 INFO:topic #45 (0.007): 0.006*"property" + 0.004*"china" + 0.003*"holding" + 0.003*"survey" + 0.003*"sector" + 0.002*"bank" + 0.002*"gold" + 0.002*"fall" + 0.002*"debt" + 0.002*"air"
06:19:19 INFO:topic #139 (0.007): 0.024*"colombia" + 0.021*"drug" + 0.008*"guerrilla" + 0.008*"colombian" + 0.007*"police" + 0.006*"extradition" + 0.005*"late" + 0.005*"anti" + 0.005*"congress" + 0.005*"contract"
06:19:19 INFO:topic #15 (0.007): 0.005*"china" + 0.003*"network" + 0.002*"drug" + 0.002*"trade" + 0.002*"united" + 0.002*"states" + 0.002*"boeing" + 0.002*"chinese" + 0.002*"beijing" + 0.002*"product"
06:19:19 INFO:topic #2 (0.007): 0.004*"bre_x" + 0.004*"x" + 0.003*"bid" + 0.003*"bre" + 0.003*"product" + 0.003*"Bre-X" + 0.003*"drug" + 0.003*"gold" + 0.002*"mining" + 0.002*"pound"
06:19:19 INFO:topic #116 (0.007): 0.007*"china" + 0.006*"bank" + 0.004*"tonne" + 0.004*"problem" + 0.004*"hong_kong" + 0.003*"trader" + 0.003*"chinese" + 0.003*"loan" + 0.003*"kong" + 0.003*"hong"
06:19:19 INFO:topic diff=11.411593, rho=0.577350
06:19:19 INFO:PROGRESS: pass 2, at document #2500/2500
06:19:19 DEBUG:performing inference on a chunk of 2500 documents
06:19:26 DEBUG:2499/2500 documents converged within 50 iterations
06:19:26 DEBUG:updating topics
06:19:26 INFO:topic #116 (0.007): 0.005*"china" + 0.005*"bank" + 0.003*"tonne" + 0.003*"problem" + 0.003*"hong_kong" + 0.002*"trader" + 0.002*"chinese" + 0.002*"loan" + 0.002*"kong" + 0.002*"hong"
06:19:26 INFO:topic #79 (0.007): 0.030*"china" + 0.020*"beijing" + 0.014*"chinese" + 0.009*"taiwan" + 0.008*"trade" + 0.008*"wang" + 0.007*"foreign" + 0.006*"united" + 0.006*"washington" + 0.006*"states"
06:19:26 INFO:topic #58 (0.007): 0.006*"pound" + 0.004*"million_pound" + 0.003*"hong_kong" + 0.002*"hong" + 0.002*"kong" + 0.002*"pay" + 0.002*"china" + 0.002*"Hong Kong" + 0.002*"shareholder" + 0.002*"service"
06:19:26 INFO:topic #19 (0.007): 0.036*"bre" + 0.035*"x" + 0.033*"bre_x" + 0.031*"gold" + 0.026*"Bre-X" + 0.019*"barrick" + 0.015*"busang" + 0.013*"indonesian" + 0.011*"mining" + 0.009*"deposit"
06:19:26 INFO:topic #112 (0.007): 0.005*"bank" + 0.003*"russia" + 0.002*"x" + 0.002*"diamond" + 0.002*"bre" + 0.002*"bre_x" + 0.002*"canada" + 0.002*"export" + 0.002*"canadian" + 0.002*"Bre-X"
06:19:26 INFO:topic diff=9.522079, rho=0.500000
06:19:26 INFO:PROGRESS: pass 3, at document #2500/2500
06:19:26 DEBUG:performing inference on a chunk of 2500 documents
06:19:32 DEBUG:2500/2500 documents converged within 50 iterations
06:19:32 DEBUG:updating topics
06:19:33 INFO:topic #38 (0.007): 0.003*"block" + 0.002*"quarter" + 0.002*"service" + 0.002*"compuserve" + 0.002*"china" + 0.002*"pound" + 0.002*"loss" + 0.001*"chinese" + 0.001*"time_warner" + 0.001*"cent"
06:19:33 INFO:topic #148 (0.007): 0.018*"franc" + 0.015*"french" + 0.014*"airbus" + 0.014*"france" + 0.013*"thomson" + 0.009*"air" + 0.009*"billion_franc" + 0.007*"boeing" + 0.007*"state" + 0.007*"air_france"
06:19:33 INFO:topic #9 (0.007): 0.023*"shanghai" + 0.021*"china" + 0.018*"bank" + 0.014*"b" + 0.014*"foreign" + 0.011*"investor" + 0.011*"exchange" + 0.011*"b_share" + 0.010*"beijing" + 0.010*"shenzhen"
06:19:33 INFO:topic #11 (0.007): 0.013*"tobacco" + 0.010*"florida" + 0.009*"quick" + 0.008*"state" + 0.007*"car" + 0.007*"amp" + 0.007*"trial" + 0.006*"cigarette" + 0.006*"television" + 0.006*"maker"
06:19:33 INFO:topic #116 (0.007): 0.004*"china" + 0.003*"bank" + 0.002*"tonne" + 0.002*"problem" + 0.002*"hong_kong" + 0.002*"trader" + 0.002*"chinese" + 0.002*"loan" + 0.001*"kong" + 0.001*"hong"
06:19:33 INFO:topic diff=7.935955, rho=0.447214
06:19:33 INFO:PROGRESS: pass 4, at document #2500/2500
06:19:33 DEBUG:performing inference on a chunk of 2500 documents
06:19:39 DEBUG:2500/2500 documents converged within 50 iterations
06:19:39 DEBUG:updating topics
06:19:39 INFO:topic #118 (0.007): 0.003*"tonne" + 0.002*"cocoa" + 0.002*"exporter" + 0.001*"chad" + 0.001*"bank" + 0.001*"coast" + 0.001*"ivory" + 0.001*"crop" + 0.001*"ivory_coast" + 0.001*"cable"
06:19:39 INFO:topic #134 (0.007): 0.038*"bank" + 0.020*"canada" + 0.017*"canadian" + 0.011*"toronto" + 0.009*"fund" + 0.008*"cent" + 0.007*"molson" + 0.006*"earning" + 0.005*"royal_bank" + 0.005*"royal"
06:19:39 INFO:topic #7 (0.007): 0.002*"soybean" + 0.002*"china" + 0.002*"monsanto" + 0.002*"director" + 0.002*"adm" + 0.002*"hong" + 0.002*"crop" + 0.001*"hong_kong" + 0.001*"united" + 0.001*"equipment"
06:19:39 INFO:topic #93 (0.007): 0.005*"earning" + 0.005*"point" + 0.004*"quarter" + 0.004*"investor" + 0.004*"fund" + 0.004*"growth" + 0.003*"exchange" + 0.003*"investment" + 0.003*"strong" + 0.003*"trade"
06:19:39 INFO:topic #26 (0.007): 0.012*"bank" + 0.010*"yen" + 0.008*"billion_yen" + 0.005*"financial" + 0.005*"affiliate" + 0.004*"daiwa" + 0.004*"non" + 0.004*"non_bank" + 0.004*"half" + 0.004*"post"
06:19:39 INFO:topic diff=6.627219, rho=0.408248
06:19:39 INFO:PROGRESS: pass 5, at document #2500/2500
06:19:39 DEBUG:performing inference on a chunk of 2500 documents
06:19:45 DEBUG:2500/2500 documents converged within 50 iterations
06:19:45 DEBUG:updating topics
06:19:46 INFO:topic #16 (0.007): 0.030*"toronto" + 0.020*"index" + 0.019*"bank" + 0.018*"canada" + 0.016*"point" + 0.015*"gold" + 0.012*"canadian" + 0.011*"toronto_stock" + 0.011*"fall" + 0.010*"gain"
06:19:46 INFO:topic #114 (0.007): 0.010*"bank" + 0.005*"japan" + 0.005*"pound" + 0.004*"problem" + 0.004*"loan" + 0.003*"financial" + 0.003*"yen" + 0.003*"bt" + 0.003*"million_pound" + 0.003*"japanese"
06:19:46 INFO:topic #52 (0.007): 0.019*"bank" + 0.017*"airbus" + 0.008*"canada" + 0.006*"fund" + 0.006*"canadian" + 0.006*"service" + 0.005*"boeing" + 0.004*"aircraft" + 0.004*"aerospace" + 0.004*"office"
06:19:46 INFO:topic #146 (0.007): 0.007*"china" + 0.005*"party" + 0.004*"pound" + 0.003*"british" + 0.003*"plc" + 0.003*"stg" + 0.003*"drug" + 0.002*"million_pound" + 0.002*"country" + 0.002*"technology"
06:19:46 INFO:topic #47 (0.007): 0.009*"tonne" + 0.008*"smelter" + 0.007*"oil" + 0.007*"aluminium" + 0.006*"state" + 0.006*"plant" + 0.006*"russia" + 0.006*"trader" + 0.006*"source" + 0.005*"metal"
06:19:46 INFO:topic diff=5.554374, rho=0.377964
06:19:46 INFO:PROGRESS: pass 6, at document #2500/2500
06:19:46 DEBUG:performing inference on a chunk of 2500 documents
06:19:51 DEBUG:2500/2500 documents converged within 50 iterations
06:19:51 DEBUG:updating topics
06:19:52 INFO:topic #44 (0.007): 0.007*"internet" + 0.004*"committee" + 0.003*"proposal" + 0.003*"address" + 0.003*"trade" + 0.003*"china" + 0.003*"congress" + 0.002*"member" + 0.002*"financial" + 0.002*"name"
06:19:52 INFO:topic #131 (0.007): 0.009*"bank" + 0.008*"internet" + 0.007*"court" + 0.005*"exchange" + 0.004*"foreign" + 0.004*"currency" + 0.004*"trading" + 0.004*"policy" + 0.003*"law" + 0.003*"security"
06:19:52 INFO:topic #112 (0.007): 0.001*"bank" + 0.001*"russia" + 0.001*"x" + 0.001*"diamond" + 0.001*"bre" + 0.001*"bre_x" + 0.001*"canada" + 0.001*"export" + 0.001*"canadian" + 0.001*"Bre-X"
06:19:52 INFO:topic #49 (0.007): 0.008*"bid" + 0.008*"penny" + 0.005*"pound" + 0.004*"northern" + 0.004*"electric" + 0.003*"midlands" + 0.003*"offer" + 0.003*"sector" + 0.003*"electricity" + 0.003*"east"
06:19:52 INFO:topic #61 (0.007): 0.008*"china" + 0.008*"tibet" + 0.005*"chinese" + 0.005*"beijing" + 0.005*"foreign" + 0.004*"wang" + 0.004*"hong_kong" + 0.004*"kong" + 0.003*"hong" + 0.003*"region"
06:19:52 INFO:topic diff=4.666072, rho=0.353553
06:19:52 INFO:PROGRESS: pass 7, at document #2500/2500
06:19:52 DEBUG:performing inference on a chunk of 2500 documents
06:19:58 DEBUG:2500/2500 documents converged within 50 iterations
06:19:58 DEBUG:updating topics
06:19:58 INFO:topic #8 (0.007): 0.001*"french" + 0.001*"bank" + 0.001*"service" + 0.001*"financial" + 0.001*"internet" + 0.001*"china" + 0.000*"mfs" + 0.000*"sell" + 0.000*"state" + 0.000*"product"
06:19:58 INFO:topic #45 (0.007): 0.001*"property" + 0.000*"china" + 0.000*"holding" + 0.000*"survey" + 0.000*"sector" + 0.000*"bank" + 0.000*"gold" + 0.000*"fall" + 0.000*"debt" + 0.000*"air"
06:19:58 INFO:topic #22 (0.007): 0.016*"pound" + 0.012*"drug" + 0.011*"plc" + 0.011*"british" + 0.011*"million_pound" + 0.008*"product" + 0.007*"penny" + 0.006*"cancer" + 0.006*"stg" + 0.005*"biotech"
06:19:58 INFO:topic #10 (0.007): 0.023*"bank" + 0.015*"pound" + 0.008*"society" + 0.006*"banking" + 0.006*"fund" + 0.006*"shareholder" + 0.005*"investment" + 0.005*"eurotunnel" + 0.005*"lloyds" + 0.005*"debt"
06:19:58 INFO:topic #11 (0.007): 0.013*"tobacco" + 0.012*"florida" + 0.009*"quick" + 0.009*"state" + 0.008*"car" + 0.007*"amp" + 0.007*"trial" + 0.007*"television" + 0.006*"news" + 0.006*"maker"
06:19:58 INFO:topic diff=3.925478, rho=0.333333
06:19:58 INFO:PROGRESS: pass 8, at document #2500/2500
06:19:58 DEBUG:performing inference on a chunk of 2500 documents
06:20:04 DEBUG:2500/2500 documents converged within 50 iterations
06:20:04 DEBUG:updating topics
06:20:04 INFO:topic #9 (0.007): 0.024*"shanghai" + 0.022*"china" + 0.018*"bank" + 0.014*"b" + 0.014*"foreign" + 0.011*"investor" + 0.011*"exchange" + 0.011*"b_share" + 0.010*"beijing" + 0.010*"shenzhen"
06:20:04 INFO:topic #42 (0.007): 0.001*"news" + 0.000*"china" + 0.000*"corp" + 0.000*"net" + 0.000*"property" + 0.000*"news_corp" + 0.000*"value" + 0.000*"shareholder" + 0.000*"bre_x" + 0.000*"x"
06:20:04 INFO:topic #131 (0.007): 0.006*"bank" + 0.005*"internet" + 0.004*"court" + 0.003*"exchange" + 0.003*"foreign" + 0.003*"currency" + 0.002*"trading" + 0.002*"policy" + 0.002*"law" + 0.002*"security"
06:20:04 INFO:topic #99 (0.007): 0.011*"mci" + 0.007*"digital" + 0.007*"camera" + 0.006*"rockwell" + 0.005*"technology" + 0.005*"kong" + 0.005*"hand" + 0.005*"hong_kong" + 0.005*"system" + 0.004*"trade"
06:20:04 INFO:topic #5 (0.007): 0.018*"bt" + 0.013*"telecom" + 0.011*"pound" + 0.010*"british" + 0.008*"mci" + 0.007*"service" + 0.006*"merger" + 0.005*"penny" + 0.005*"britain" + 0.005*"ntt"
06:20:04 INFO:topic diff=3.304076, rho=0.316228
06:20:04 INFO:PROGRESS: pass 9, at document #2500/2500
06:20:04 DEBUG:performing inference on a chunk of 2500 documents
06:20:10 DEBUG:2500/2500 documents converged within 50 iterations
06:20:10 DEBUG:updating topics
06:20:10 INFO:topic #108 (0.007): 0.005*"gm" + 0.004*"computer" + 0.004*"quarter" + 0.004*"ibm" + 0.004*"car" + 0.004*"technology" + 0.004*"france" + 0.003*"thomson" + 0.003*"plant" + 0.003*"service"
06:20:11 INFO:topic #111 (0.007): 0.008*"computer" + 0.006*"software" + 0.005*"apple" + 0.005*"quarter" + 0.004*"microsoft" + 0.003*"technology" + 0.003*"design" + 0.003*"pc" + 0.002*"oracle" + 0.002*"financial"
06:20:11 INFO:topic #82 (0.007): 0.003*"china" + 0.002*"shanghai" + 0.002*"future" + 0.002*"exchange" + 0.001*"b" + 0.001*"index" + 0.001*"authority" + 0.001*"investor" + 0.001*"trading" + 0.001*"foreign"
06:20:11 INFO:topic #89 (0.007): 0.016*"internet" + 0.015*"computer" + 0.014*"technology" + 0.010*"quarter" + 0.010*"software" + 0.009*"product" + 0.008*"microsoft" + 0.008*"sun" + 0.007*"netscape" + 0.007*"web"
06:20:11 INFO:topic #101 (0.007): 0.001*"china" + 0.001*"kong" + 0.001*"hong" + 0.001*"hong_kong" + 0.000*"Hong Kong" + 0.000*"macau" + 0.000*"tung" + 0.000*"chinese" + 0.000*"formula" + 0.000*"beijing"
06:20:11 INFO:topic diff=2.781140, rho=0.301511
06:20:11 INFO:PROGRESS: pass 10, at document #2500/2500
06:20:11 DEBUG:performing inference on a chunk of 2500 documents
06:20:16 DEBUG:2500/2500 documents converged within 50 iterations
06:20:16 DEBUG:updating topics
06:20:17 INFO:topic #60 (0.007): 0.011*"oil" + 0.009*"colombia" + 0.008*"colombian" + 0.008*"paramilitary" + 0.008*"country" + 0.008*"drug" + 0.008*"police" + 0.007*"attack" + 0.007*"force" + 0.007*"medellin"
06:20:17 INFO:topic #99 (0.007): 0.010*"mci" + 0.008*"digital" + 0.007*"camera" + 0.007*"rockwell" + 0.005*"technology" + 0.005*"hand" + 0.005*"system" + 0.005*"agreement" + 0.005*"personal" + 0.005*"trade"
06:20:17 INFO:topic #109 (0.007): 0.025*"pound" + 0.016*"million_pound" + 0.012*"life" + 0.011*"insurance" + 0.011*"scotam" + 0.009*"offer" + 0.009*"abbey" + 0.007*"policyholder" + 0.007*"british" + 0.006*"scottish"
06:20:17 INFO:topic #55 (0.007): 0.028*"internet" + 0.027*"court" + 0.019*"foreign" + 0.017*"exchange" + 0.017*"currency" + 0.014*"case" + 0.014*"foreign_currency" + 0.014*"trading" + 0.012*"amendment" + 0.012*"address"
06:20:17 INFO:topic #9 (0.007): 0.024*"shanghai" + 0.022*"china" + 0.018*"bank" + 0.014*"b" + 0.014*"foreign" + 0.011*"investor" + 0.011*"exchange" + 0.011*"b_share" + 0.010*"beijing" + 0.010*"shenzhen"
06:20:17 INFO:topic diff=2.340896, rho=0.288675
06:20:17 INFO:PROGRESS: pass 11, at document #2500/2500
06:20:17 DEBUG:performing inference on a chunk of 2500 documents
06:20:22 DEBUG:2500/2500 documents converged within 50 iterations
06:20:22 DEBUG:updating topics
06:20:23 INFO:topic #20 (0.007): 0.000*"china" + 0.000*"de" + 0.000*"russia" + 0.000*"chinese" + 0.000*"beijing" + 0.000*"diamond" + 0.000*"kong" + 0.000*"export" + 0.000*"oil" + 0.000*"service"
06:20:23 INFO:topic #7 (0.007): 0.000*"soybean" + 0.000*"china" + 0.000*"monsanto" + 0.000*"director" + 0.000*"adm" + 0.000*"hong" + 0.000*"crop" + 0.000*"hong_kong" + 0.000*"united" + 0.000*"equipment"
06:20:23 INFO:topic #24 (0.007): 0.024*"czech" + 0.011*"crown" + 0.011*"bank" + 0.010*"prague" + 0.010*"klaus" + 0.008*"party" + 0.006*"havel" + 0.006*"foreign" + 0.006*"country" + 0.006*"election"
06:20:23 INFO:topic #80 (0.007): 0.025*"king" + 0.021*"silver" + 0.013*"network" + 0.012*"station" + 0.012*"shopping" + 0.012*"home_shopping" + 0.012*"television" + 0.009*"latin" + 0.009*"news" + 0.009*"home"
06:20:23 INFO:topic #82 (0.007): 0.001*"china" + 0.001*"shanghai" + 0.001*"future" + 0.001*"exchange" + 0.001*"b" + 0.001*"index" + 0.001*"authority" + 0.001*"investor" + 0.001*"trading" + 0.001*"foreign"
06:20:23 INFO:topic diff=1.970765, rho=0.277350
06:20:23 INFO:PROGRESS: pass 12, at document #2500/2500
06:20:23 DEBUG:performing inference on a chunk of 2500 documents
06:20:29 DEBUG:2500/2500 documents converged within 50 iterations
06:20:29 DEBUG:updating topics
06:20:30 INFO:topic #123 (0.007): 0.021*"china" + 0.013*"wang" + 0.011*"beijing" + 0.010*"chinese" + 0.007*"tibet" + 0.006*"dissident" + 0.006*"state" + 0.006*"party" + 0.005*"communist" + 0.005*"court"
06:20:30 INFO:topic #86 (0.007): 0.015*"internet" + 0.014*"computer" + 0.014*"ibm" + 0.012*"quarter" + 0.011*"service" + 0.009*"pc" + 0.008*"software" + 0.007*"system" + 0.007*"consumer" + 0.006*"network"
06:20:30 INFO:topic #50 (0.007): 0.034*"hong_kong" + 0.033*"kong" + 0.033*"hong" + 0.029*"china" + 0.021*"Hong Kong" + 0.018*"tung" + 0.013*"beijing" + 0.012*"chinese" + 0.012*"Hong Kong's" + 0.010*"britain"
06:20:30 INFO:topic #9 (0.007): 0.025*"shanghai" + 0.022*"china" + 0.018*"bank" + 0.014*"b" + 0.014*"foreign" + 0.011*"investor" + 0.011*"exchange" + 0.011*"b_share" + 0.010*"beijing" + 0.010*"shenzhen"
06:20:30 INFO:topic #24 (0.007): 0.024*"czech" + 0.011*"crown" + 0.011*"bank" + 0.010*"prague" + 0.009*"klaus" + 0.008*"party" + 0.006*"havel" + 0.006*"foreign" + 0.006*"country" + 0.006*"election"
06:20:30 INFO:topic diff=1.660230, rho=0.267261
06:20:30 INFO:PROGRESS: pass 13, at document #2500/2500
06:20:30 DEBUG:performing inference on a chunk of 2500 documents
06:20:37 DEBUG:2500/2500 documents converged within 50 iterations
06:20:37 DEBUG:updating topics
06:20:38 INFO:topic #108 (0.007): 0.003*"gm" + 0.002*"computer" + 0.002*"quarter" + 0.002*"ibm" + 0.002*"car" + 0.002*"technology" + 0.002*"france" + 0.002*"thomson" + 0.002*"plant" + 0.002*"service"
06:20:38 INFO:topic #116 (0.007): 0.000*"china" + 0.000*"bank" + 0.000*"tonne" + 0.000*"problem" + 0.000*"hong_kong" + 0.000*"trader" + 0.000*"chinese" + 0.000*"loan" + 0.000*"kong" + 0.000*"hong"
06:20:38 INFO:topic #59 (0.007): 0.000*"pound" + 0.000*"lloyds" + 0.000*"bank" + 0.000*"pension" + 0.000*"insurance" + 0.000*"amp" + 0.000*"bhp" + 0.000*"claim" + 0.000*"million_pound" + 0.000*"scottish"
06:20:38 INFO:topic #19 (0.007): 0.034*"bre" + 0.033*"x" + 0.032*"bre_x" + 0.031*"gold" + 0.025*"Bre-X" + 0.018*"barrick" + 0.015*"busang" + 0.013*"indonesian" + 0.011*"mining" + 0.008*"exploration"
06:20:38 INFO:topic #112 (0.007): 0.000*"bank" + 0.000*"russia" + 0.000*"x" + 0.000*"diamond" + 0.000*"bre" + 0.000*"bre_x" + 0.000*"canada" + 0.000*"export" + 0.000*"canadian" + 0.000*"Bre-X"
06:20:38 INFO:topic diff=1.400248, rho=0.258199
06:20:38 INFO:PROGRESS: pass 14, at document #2500/2500
06:20:38 DEBUG:performing inference on a chunk of 2500 documents
06:20:47 DEBUG:2500/2500 documents converged within 50 iterations
06:20:47 DEBUG:updating topics
06:20:47 INFO:topic #43 (0.007): 0.001*"czech" + 0.001*"party" + 0.001*"klaus" + 0.001*"coalition" + 0.001*"election" + 0.001*"havel" + 0.000*"house" + 0.000*"crown" + 0.000*"prague" + 0.000*"parliament"
06:20:47 INFO:topic #49 (0.007): 0.001*"bid" + 0.001*"penny" + 0.001*"pound" + 0.001*"northern" + 0.001*"electric" + 0.000*"midlands" + 0.000*"offer" + 0.000*"sector" + 0.000*"electricity" + 0.000*"east"
06:20:47 INFO:topic #122 (0.007): 0.000*"wang" + 0.000*"china" + 0.000*"beijing" + 0.000*"law" + 0.000*"trial" + 0.000*"death" + 0.000*"dissident" + 0.000*"pound" + 0.000*"hong" + 0.000*"sentence"
06:20:47 INFO:topic #132 (0.007): 0.001*"bank" + 0.000*"crown" + 0.000*"klaus" + 0.000*"czech" + 0.000*"social" + 0.000*"banka" + 0.000*"party" + 0.000*"minister" + 0.000*"state" + 0.000*"billion_crown"
06:20:47 INFO:topic #16 (0.007): 0.030*"toronto" + 0.020*"index" + 0.019*"bank" + 0.019*"canada" + 0.017*"gold" + 0.016*"point" + 0.012*"canadian" + 0.011*"toronto_stock" + 0.011*"fall" + 0.010*"gain"
06:20:47 INFO:topic diff=1.182972, rho=0.250000
06:20:47 DEBUG:Setting topics to those of the model: AuthorTopicModel(num_terms=6215, num_topics=150, num_authors=50, decay=0.5, chunksize=2500)
06:20:47 INFO:CorpusAccumulator accumulated stats from 1000 documents
06:20:48 INFO:CorpusAccumulator accumulated stats from 2000 documents
-2.83261288295
In [35]:
accuracy_scores_150topic_25_10={}
for i in [1,2,3,4,5,6,8,10]:
accuracy, k = prediction_accuracy(test_author2doc, test_corpus_25_10, atmodel_150topics_25_10, k=i)
accuracy_scores_150topic_25_10[k] = accuracy
plot_accuracy(scores1=accuracy_scores_150topic_25_10, label1="150 topics, max_freq=25%, min_wordcount=10", scores2=accuracy_scores_150topic, label2="150 topics, standard")
Precision@k: top_n=1
Prediction accuracy: 0.6176
Precision@k: top_n=2
Prediction accuracy: 0.7712
Precision@k: top_n=3
Prediction accuracy: 0.8268
Precision@k: top_n=4
Prediction accuracy: 0.8656
Precision@k: top_n=5
Prediction accuracy: 0.8916
Precision@k: top_n=6
Prediction accuracy: 0.9112
Precision@k: top_n=8
Prediction accuracy: 0.9308
Precision@k: top_n=10
Prediction accuracy: 0.9408
The results seem rather ambigious and do not show a clear trend. Which is why we would stop here for the iterations.
Content source: gojomo/gensim
Similar notebooks: