In this session, we're going to analyze the small experiment result to determine which method and distance metric are the best so we can just use them in the larger experiment. A little bit of explanation, in this small experiment, we didn't vary the number of normal posts, number of OOT posts, number of posts in top list, and the features. The reason is because we think that a good method and distance metric should be able to detect OOT posts no matter what the domain is. So, the forum variation is more important here. That's why we varied the threads instead and left other settings fixed.
Import our toolkits and let matplotlib plots inline.
In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
Read our experiment result.
In [2]:
df = pd.read_hdf('../reports/small-exp.h5', 'df')
df
Out[2]:
Let's see how well each method performed. We need to compute the average baseline and performance distribution of each method over all features, distance metrics, and threads.
In [3]:
df2 = df.groupby(level='method').mean()
In [4]:
df2
Out[4]:
We don't actually need the baseline so let's remove it.
In [5]:
df3 = df2.drop('base', axis=1, level='result')
In [6]:
df3
Out[6]:
In [7]:
df3.T.plot(kind='bar', subplots=True, ylim=(0.,1.))
Out[7]:
From the plot we can see that txt_comp_dist
outperformed the other two methods since its probability distribution is more negatively skewed. To make it clearer, let's compute the expected value for each method's distribution.
In [8]:
df3
Out[8]:
In [9]:
df4 = df3 * np.arange(4)
In [10]:
df4
Out[10]:
In [11]:
df4.sum(axis=1, level='result')
Out[11]:
We see that txt_comp_dist
is indeed superior compared to the others.
Now, let's see which distance metric is the best one. Again, we have to compute the performance distribution of each distance metric over all methods, features, and threads. Let's do it. And don't forget to remove the baseline. (I'm getting better at this, yay!)
In [12]:
df5 = df.groupby(level='metric').mean().drop('base', axis=1, level='result')
In [13]:
df5
Out[13]:
In [14]:
df5.T.plot(kind='bar', subplots=True, ylim=(0.,1.))
Out[14]:
The difference is very subtle. Let's compute the expected value instead.
In [15]:
df6 = df5 * np.arange(4)
In [16]:
df6
Out[16]:
In [17]:
df6.sum(axis=1, level='result')
Out[17]:
Although they don't differ that much, we still can conclude that euclidean
is better.
OK. Now we can safely use only txt_comp_dist
method with euclidean
distance metric for our experiment. Yeay!