Gensim Benchmark Visualizations


In [ ]:
import os
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style="white")

In [ ]:
OUT_DIR = "../../Data/out"
perf_out_file = "perf_enron.csv"
df_perf = pd.read_csv(os.path.join(OUT_DIR, perf_out_file))
time_cols = [col for col in df_perf.columns if 'pass' in col]
df_perf['mean_time'] = df_perf[time_cols].mean(axis=1)
df_perf.head()

In [ ]:
df_perf.describe()

In [ ]:
sns.pointplot(x='num_topics', hue="implementation", y='mean_time', data=df_perf, pallete="Set2")

Performance

Going deeper down the rabbit hole, to see how Gensim's LDA Multicore fares against Mallet's LDA in terms of performance.


In [ ]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
sns.pointplot(x='num_topics', hue="iterations", y='mean_time', data=df_perf[df_perf['implementation'] == 'Gensim'], pallete="Set1", ax=ax1)
sns.pointplot(x='num_topics', hue="iterations", y='mean_time', data=df_perf[df_perf['implementation'] == 'Mallet'], pallete="Set3", ax=ax2)
plt.suptitle('Num topics vs mean time', fontsize=20)
ax1.set_title('Gensim')
ax2.set_title('Mallet')

In [ ]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
sns.pointplot(x='num_topics', hue="workers", y='mean_time', data=df_perf[df_perf['implementation'] == 'Gensim'], pallete="Pastel", ax=ax1)
sns.pointplot(x='num_topics', hue="workers", y='mean_time', data=df_perf[df_perf['implementation'] == 'Mallet'], pallete="Set3", ax=ax2)
plt.suptitle('Num topics vs mean time', fontsize=20)
ax1.set_title('Gensim')
ax2.set_title('Mallet')

Coherence

Next we will be looking at the coherences of models we obtain. Let me just state that this comparison will not be an apples to apples comparison. Both Libraries provide many more parameters over which the model can be tuned with. This study has not exhausted all the individual parameters.

So it is definitely possible that one could end up with a much better model than the ones that are displayed here.

Also note that unlike how we timed the LDA call, by averaging out the time over n times, the current code does not measure coherence of models over those n times. This is because computing coherence is an expensive call. So this one will be slightly less accurate.


In [ ]:
if 'UMass_coherence' in df_perf.columns :
    f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
    sns.pointplot(x='num_topics', hue="iterations", y='UMass_coherence', data=df_perf[df_perf['implementation'] == 'Gensim'], pallete="Blues", ax=ax1)
    sns.pointplot(x='num_topics', hue="iterations", y='UMass_coherence', data=df_perf[df_perf['implementation'] == 'Mallet'], pallete="Blues", ax=ax1)
    plt.suptitle('Num topics vs mean time', fontsize=20)
    ax1.set_title('Gensim')
    ax2.set_title('Mallet')
else :
    print("UMass_coherence not available in file")

Whether tf-idf transformation is appropriate for our LDA


In [ ]:
if 'UMass_coherence' in df_perf.columns :
    f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
    sns.pointplot(x='num_topics', hue="tf-idf", y='UMass_coherence', data=df_perf[df_perf['implementation'] == 'Gensim'], pallete="Blues", ax=ax1)
    sns.pointplot(x='num_topics', hue="tf-idf", y='UMass_coherence', data=df_perf[df_perf['implementation'] == 'Mallet'], pallete="Blues", ax=ax1)
    plt.suptitle('Num topics vs mean time', fontsize=20)
    ax1.set_title('Gensim')
    ax2.set_title('Mallet')
else :
    print("UMass_coherence not available in file")