When predicting a large number of classes it often becomes inefficient to train a generalized linear model for each distinct class (i.e. the parameters for each softmax node at the end of a neural network). On proposed solution has been to use offline clustering of output label representations to create binary codes for each label and then use log(G) distinct sigmoid outputs at the end of a neural network (see "A Scalable Hierarchical Distributed Language Model" by Hinton and Mnih).
Why not instead learn an optimal encoding of the output labels interleaved with training the predictive model? Here we're experimenting with assigning initial random codes, training a multi-output linear regression, and then finding better output codes by averaging the predicted output codes of samples associated with each class.
In [1]:
import sklearn
In [2]:
data = sklearn.datasets.fetch_20newsgroups_vectorized()
In [3]:
data
Out[3]:
In [4]:
X = data['data']; Y = data['target']
In [10]:
model = sklearn.linear_model.Ridge()
In [12]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=False).fit(X)
X_tfidf = tfidf.transform(X)
In [18]:
Out[18]:
In [19]:
n_samples = len(Y)
n_labels = len(set(Y))
n_bits = int(np.ceil(np.log2(n_labels)))
In [28]:
# generate random binary codes for each output labels
output_codes = {}
for output_label in set(Y):
candidate_code = tuple((np.random.randn(n_bits) > 0).astype(float))
while candidate_code in output_codes.values():
candidate_code = tuple((np.random.randn(n_bits) > 0).astype(float))
output_codes[output_label] = candidate_code
In [29]:
output_codes
Out[29]:
In [30]:
encoded_Y = np.array([output_codes[yi] for yi in Y])
In [31]:
encoded_Y
Out[31]:
In [32]:
model.fit(X_tfidf, encoded_Y)
In [ ]: