It is seen that in many cases normalizing the tfidf weights for each terms tends to favor weight of terms of the documents with shorter length. Pivoted document length normalization scheme brings a pivoting scheme on the table which can be used to counter the effect of this bias for short documents by making tfidf independent of the document length.
This is achieved by tilting the normalization curve along the pivot point defined by user with some slope. Roughly following the equation -
pivoted_norm = (1 - slope) * pivot + slope * old_norm
This scheme is proposed in the paper pivoted document length normalization
Overall this approach can in many cases help increase the accuracy of the model where the document lengths are hugely varying in the enitre corpus.
In [1]:
%matplotlib inline
from sklearn.linear_model import LogisticRegression
from gensim.corpora import Dictionary
from gensim.sklearn_api.tfidf import TfIdfTransformer
from gensim.matutils import corpus2csc
import numpy as np
import matplotlib.pyplot as py
import gensim.downloader as api
In [2]:
# This function returns the model accuracy and indivitual document prob values using
# gensim's TfIdfTransformer and sklearn's LogisticRegression
def get_tfidf_scores(kwargs):
tfidf_transformer = TfIdfTransformer(**kwargs).fit(train_corpus)
X_train_tfidf = corpus2csc(tfidf_transformer.transform(train_corpus), num_terms=len(id2word)).T
X_test_tfidf = corpus2csc(tfidf_transformer.transform(test_corpus), num_terms=len(id2word)).T
clf = LogisticRegression().fit(X_train_tfidf, y_train)
model_accuracy = clf.score(X_test_tfidf, y_test)
doc_scores = clf.decision_function(X_test_tfidf)
return model_accuracy, doc_scores
In [3]:
# Sort the document scores by their scores and return a sorted list
# of document score and corresponding document lengths.
def sort_length_by_score(doc_scores, X_test):
doc_scores = sorted(enumerate(doc_scores), key=lambda x: x[1])
doc_leng = np.empty(len(doc_scores))
ds = np.empty(len(doc_scores))
for i, _ in enumerate(doc_scores):
doc_leng[i] = len(X_test[_[0]])
ds[i] = _[1]
return ds, doc_leng
In [4]:
nws = api.load("20-newsgroups")
In [5]:
cat1, cat2 = ('sci.electronics', 'sci.space')
X_train = []
X_test = []
y_train = []
y_test = []
for i in nws:
if i["set"] == "train" and i["topic"] == cat1:
X_train.append(i["data"])
y_train.append(0)
elif i["set"] == "train" and i["topic"] == cat2:
X_train.append(i["data"])
y_train.append(1)
elif i["set"] == "test" and i["topic"] == cat1:
X_test.append(i["data"])
y_test.append(0)
elif i["set"] == "test" and i["topic"] == cat2:
X_test.append(i["data"])
y_test.append(1)
In [6]:
id2word = Dictionary([_.split() for _ in X_train])
train_corpus = [id2word.doc2bow(i.split()) for i in X_train]
test_corpus = [id2word.doc2bow(i.split()) for i in X_test]
In [7]:
print(len(X_train), len(X_test))
In [8]:
# We perform our analysis on top k documents which is almost top 10% most scored documents
k = len(X_test) / 10
In [9]:
params = {}
model_accuracy, doc_scores = get_tfidf_scores(params)
print(model_accuracy)
In [10]:
print(
"Normal cosine normalisation favors short documents as our top {} "
"docs have a smaller mean doc length of {:.3f} compared to the corpus mean doc length of {:.3f}"
.format(
k, sort_length_by_score(doc_scores, X_test)[1][:k].mean(),
sort_length_by_score(doc_scores, X_test)[1].mean()
)
)
In [11]:
best_model_accuracy = 0
optimum_slope = 0
for slope in np.arange(0, 1.1, 0.1):
params = {"pivot": 10, "slope": slope}
model_accuracy, doc_scores = get_tfidf_scores(params)
if model_accuracy > best_model_accuracy:
best_model_accuracy = model_accuracy
optimum_slope = slope
print("Score for slope {} is {}".format(slope, model_accuracy))
print("We get best score of {} at slope {}".format(best_model_accuracy, optimum_slope))
In [12]:
params = {"pivot": 10, "slope": optimum_slope}
model_accuracy, doc_scores = get_tfidf_scores(params)
print(model_accuracy)
In [13]:
print(
"With pivoted normalisation top {} docs have mean length of {:.3f} "
"which is much closer to the corpus mean doc length of {:.3f}"
.format(
k, sort_length_by_score(doc_scores, X_test)[1][:k].mean(),
sort_length_by_score(doc_scores, X_test)[1].mean()
)
)
Since cosine normalization favors retrieval of short documents from the plot we can see that when slope was 1 (when pivoted normalisation was not applied) short documents with length of around 500 had very good score hence the bias for short documents can be seen. As we varied the value of slope from 1 to 0 we introdcued a new bias for long documents to counter the bias caused by cosine normalisation. Therefore at a certain point we got an optimum value of slope which is 0.5 where the overall accuracy of the model is increased.
In [14]:
best_model_accuracy = 0
optimum_slope = 0
w = 2
h = 2
f, axarr = py.subplots(h, w, figsize=(15, 7))
it = 0
for slope in [1, 0.2]:
params = {"pivot": 10, "slope": slope}
model_accuracy, doc_scores = get_tfidf_scores(params)
if model_accuracy > best_model_accuracy:
best_model_accuracy = model_accuracy
optimum_slope = slope
doc_scores, doc_leng = sort_length_by_score(doc_scores, X_test)
y = abs(doc_scores[:k, np.newaxis])
x = doc_leng[:k, np.newaxis]
py.subplot(1, 2, it+1).bar(x, y, linewidth=10.)
py.title("slope = " + str(slope) + " Model accuracy = " + str(model_accuracy))
py.ylim([0, 4.5])
py.xlim([0, 3200])
py.xlabel("document length")
py.ylabel("confidence score")
it += 1
py.tight_layout()
py.show()
The above histogram plot helps us visualize the effect of slope
. For top k documents we have document length on the x axis and their respective scores of belonging to a specific class on y axis.
As we decrease the slope the density of bins is shifted from low document length (around ~250-500) to over ~500 document length. This suggests that the positive biasness which was seen at slope=1
(or when regular tfidf was used) for short documents is now reduced. We get the optimum slope or the max model accuracy when slope is 0.2.