The task is to classify each chord segment with a key.
Each data point represents a chord segment.
Inputs are chord segments with duration and a set of binary pitch classes contained in the chord. Thus the input dimension is 1+12 = 13.
Output of each data point is the key within the segment represented by its canonical diatonic key root pitch class. So there are 12 possible class values. With one-hot encoding the output dimension is 12.
N
) or no key (silence)First we can try to classify each chord segment independently to establish a baseline performance. In practice this is not too realistic, since the key may be distinguisehd by a set of multiple chords.
Then we can try to use as inputs short sequences of adjacent chord segments of fixed size. This might be more realistic and be usable for non-recurrent ML models.
Ideally we should handle the fact that the sequence of chords is of a variable length and use recurrent models and not be constrained by a fixed window.
As for the ML models we should start with a plain multi-class logistic regression.
Then we could try deep neural networks:
TODO
In [2]:
%pylab inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import glob
pylab.rcParams['figure.figsize'] = (16, 12)
In [11]:
dataset = pd.read_csv('data/beatles/derived/all_chords_with_keys.tsv', sep='\t')
dataset = dataset.dropna()
dataset
Out[11]:
In [4]:
pcs_columns = ['C','Db','D','Eb','E','F','Gb','G','Ab','A','Bb','B']
# note: we could also try use features like 'root' or 'bass'
input_cols = pcs_columns
output_col = 'key_diatonic_root'
def select_X_y(df):
X = df[input_cols].astype(np.int32)
y = df[output_col].astype(np.int32)
return X, y
In [ ]:
X, y = select_X_y(dataset)
In [197]:
len(dataset)
Out[197]:
In [198]:
X.head()
Out[198]:
In [199]:
y.head()
Out[199]:
In [24]:
from sklearn.cross_validation import train_test_split
def split_dataset(X, y, test_percentage=0.2, valid_percentage=0.2, random_state=42):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_percentage, random_state=random_state)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=valid_percentage/(1-test_percentage), random_state=random_state)
return (X_train, y_train), (X_valid, y_valid), (X_test, y_test)
In [ ]:
((X_train, y_train), (X_valid, y_valid), (X_test, y_test)) = split_dataset(X, y)
[d.shape for d in (X_train, y_train, X_valid, y_valid, X_test, y_test)]
In [27]:
from sklearn.linear_model import LogisticRegression
In [202]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train.ravel())
Out[202]:
In [31]:
from sklearn.metrics import classification_report, confusion_matrix
def accuracy_report(model, X_train, y_train, X_valid, y_valid):
# mean accuracy on the given test data and labels
accuracy_train, accuracy_valid = model.score(X_train, y_train), model.score(X_valid, y_valid)
print('training set accuracy (%):', accuracy_train)
print('validation set accuracy (%):', accuracy_valid)
print('difference (% points):', accuracy_train - accuracy_valid)
In [ ]:
accuracy_report(lr_model, X_train, y_train, X_valid, y_valid)
In [271]:
# TODO: we could better use http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html
In [208]:
random_batches = 100
y_random_uniform = np.random.random_integers(low=0, high=11, size=(len(X_valid), random_batches))
y_pred_lr_valid = lr_model.predict(X_valid)
In [209]:
def prior_class_probabilities(y):
priors = pd.Series(np.zeros(12))
hist = (pd.Series(y.ravel()).value_counts() / len(y))
priors[hist.index] = hist
return priors
priors_valid = prior_class_probabilities(y_valid)
priors_valid
Out[209]:
In [211]:
# random samples of classes based on prior probabilities of each class
y_random_prior = numpy.random.choice(np.arange(12), size=(len(y_valid), random_batches), p=priors_valid)
pd.Series(y_random_prior[0]).value_counts() / len(y_random_prior)
Out[211]:
In [218]:
mean_random_uniform_accuracy = ((y_random_uniform == np.outer(np.ones(y_random_uniform.shape[1]), y_valid).T).sum(axis=0) / len(X_valid)).mean() * 100
print('validation set, mean accuracy of uniform random guess (%):', mean_random_uniform_accuracy)
mean_random_prior_accuracy = ((y_random_prior == np.outer(np.ones(y_random_prior.shape[1]), y_valid).T).sum(axis=0) / len(X_valid)).mean() * 100
print('validation set, mean accuracy of random guess with prior class probabilities (%):', mean_random_prior_accuracy)
lr_accuracy = sum(y_pred_lr_valid == y_valid.ravel()) / len(X_valid) * 100
print('validation set, accuracy of LR model (%):', lr_accuracy)
print('the model is better than uniform random guess', lr_accuracy / mean_random_uniform_accuracy, 'times')
print('the model is better than prior random guess', lr_accuracy / mean_random_prior_accuracy, 'times')
We can see that the seemingly good performance of the simple model is caused by the model trained on the skewed class prior probabilities. When we compare the model to random guessing based on the priors, the model is just 3x better than this kind of random guessing.
In [219]:
print(classification_report(y_pred_lr_valid, y_valid))
Let's look at the actual results of classification on the validation set.
In [56]:
def classification_results(X, y_true, y_pred):
return pd.DataFrame(
np.hstack([X, y_true.reshape(-1, 1), y_pred.reshape(-1, 1)]),
columns=input_cols+['label_true']+['label_pred'])
In [39]:
classification_valid = classification_results(X_valid, y_valid, y_pred_lr_valid)
classification_valid['is_correct'] = classification_valid['label_true'] == classification_valid['label_pred']
print('accuracy on validation set (%)', sum(classification_valid['is_correct']) / len(classification_valid) * 100)
The score()
from the LogisticRegression
model seems to really corresponds to manually computed accuracy.
In [59]:
def key_classification_error_report(y_true, y_pred):
errors = pd.Series(((y_true - y_pred) + 12) % 12)
errors_hist = errors.value_counts()
stem(errors_hist / len(errors) * 100)
gca().set_xticks(np.arange(len(errors_hist)))
gca().set_xticklabels(errors_hist.index)
xlabel('error (true - predicted key pitch class)')
ylabel('%')
return errors, errors_hist
In [264]:
key_classification_error_report(classification['label_true'], classification['label_pred']);
In [248]:
dataset_synth = pd.read_csv('data/beatles/derived/all_chords_with_keys_synth.tsv', sep='\t')
In [249]:
dataset_synth = dataset_synth.dropna()
X, y = select_X_y(dataset_synth)
In [250]:
len(dataset_synth), len(X), len(y)
Out[250]:
In [251]:
((X_train, y_train), (X_valid, y_valid), (X_test, y_test)) = split_dataset(X, y)
In [253]:
lr_model_synth = LogisticRegression()
lr_model_synth.fit(X_train, y_train)
Out[253]:
In [254]:
accuracy_report(lr_model_synth, X_train, y_train, X_valid, y_valid)
lr_synth_accuracy = 100 * lr_model_synth.score(X_valid, y_valid)
print('accuracy of model without synth data', lr_accuracy)
print('the model is better than uniform random guess', lr_synth_accuracy / mean_random_uniform_accuracy, 'times')
We can se the model is worse than model trained on data with skewed classes but not dramatically.
In [266]:
y_pred_lr_synth_valid = lr_model_synth.predict(X_valid)
In [267]:
key_classification_error_report(classification_results(X_valid, y_valid, y_pred_lr_synth_valid));
Since there's almost no gap between training and validation accuracy the model seems to be highly biased, just too simple. We can try a few combinations of regularization parameters, but we expect regularization do not help here.
As for the kind of errors, we can see the correct key is often quite close to the true key on the circle of fifths. Moreover, the distribution of errors nicely follows the two-way distance on the circle of fifths. This is an important thing and should be taken into account when choosing/designing the cost function or validation metric.
In [259]:
from sklearn.grid_search import GridSearchCV
def tune_lr_model_on_grid():
model = LogisticRegression()
grid = GridSearchCV(estimator=model, param_grid=dict(C=[0.1, 0.5, 1], penalty=['l2', 'l1']))
# let the CV do multiple train-valid splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
grid.fit(X_train, y_train)
return grid
lr_model_synth_grid = tune_lr_model_on_grid()
print(lr_model_synth_grid)
print('best mean accuracy:', lr_model_synth_grid.best_score_)
print(lr_model_synth_grid.best_estimator_.C, lr_model_synth_grid.best_estimator_.penalty)
print(lr_model_synth_grid.grid_scores_)
As we can see regularization didn't help.
In [290]:
from sklearn.linear_model import SGDClassifier
lsgd_model_synth = SGDClassifier(loss='log')
lsgd_scores = {'train':[], 'valid':[]}
for i in range(15):
lsgd_model_synth.partial_fit(X_train, y_train, classes=np.arange(12))
lsgd_scores['train'].append(lsgd_model_synth.score(X_train, y_train))
lsgd_scores['valid'].append(lsgd_model_synth.score(X_valid, y_valid))
print('iter:', i, 'validation score:', lsgd_scores['valid'][-1])
plot(lsgd_scores['train'], label='train')
plot(lsgd_scores['valid'], label='validation')
legend();
It seems that the logistic regression trained with SGD (full-batch) obtains a very similar result.
In [293]:
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier()
tree_model = tree_model.fit(X_train, y_train)
print('training score', tree_model.score(X_train, y_train))
print('validation score', tree_model.score(X_valid, y_valid))
Also a simple decision tree gives a similar result.
As we could see trying to classify the key based on a single chord gives not completely bad results, it is very limited. Naturally even humans perceive the key in a context of multiple chords. So we should now move to providing input features that consist of a sequence of multiple chords. We expect much better results.
In order to keep things simple we restrict our problem such that each sequence of adjacent chords lies within a single key, ie. no sequence spans multiple keys. We then classify the key for for the whole sequence.
A question is how the context size affects the accuracy. What is a good tradeoff between accuracy and redundancy?
In [72]:
dataset = pd.read_csv('data/beatles/derived/all_chords_with_keys_synth_rolling_16.tsv', sep='\t')
X, y = dataset.iloc[:,:-1].astype(int32), dataset.iloc[:,-1].astype(int32)
((X_train, y_train), (X_valid, y_valid), (X_test, y_test)) = split_dataset(X, y)
[d.shape for d in (X_train, y_train, X_valid, y_valid, X_test, y_test)]
Out[72]:
In [39]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
Out[39]:
In [52]:
lr_model.score(X_train, y_train), lr_model.score(X_valid, y_valid)
Out[52]:
In [60]:
key_classification_error_report(y_valid, lr_model.predict(X_valid));
By adding context we can get significant improvemnt in classification performance, eg. from 38% to 71% (!) even with a simple logistic regression model.
In [88]:
import theanets
from sklearn.metrics import accuracy_score
In [95]:
def theano_accuracy_report(model):
y_train_pred = model.predict(X_train.astype(float32))
y_valid_pred = model.predict(X_valid.astype(float32))
print('training score:', accuracy_score(y_train, y_train_pred))
print('validation score:', accuracy_score(y_valid, y_valid_pred))
key_classification_error_report(y_valid, y_valid_pred)
In [75]:
# no hidden layers, no pretraining or regularization
exp = theanets.Experiment(theanets.Classifier, layers=[X.shape[1], 12])
exp.train(
(X_train.astype(float32),y_train),
(X_valid.astype(float32),y_valid),
algo='sgd', learning_rate=1e-4, momentum=0.9)
Out[75]:
In [96]:
theano_accuracy_report(exp.network)
In [102]:
exp.save('data/beatles/models/chords_keys_synth_theano_01.model')
In [77]:
# two hidden layers of size 100, autoencoer pretraining, no regularization
exp02 = theanets.Experiment(theanets.Classifier, layers=[X.shape[1], 100, 12])
exp02.train(
(X_train.astype(float32),y_train),
(X_valid.astype(float32),y_valid),
algo='pretraining', learning_rate=1e-4, momentum=0.9)
Out[77]:
In [97]:
theano_accuracy_report(exp02.network)
In [104]:
exp02.save('data/beatles/models/chords_keys_synth_theano_02.model')
In [83]:
# on hidden layer of size 100, dropout regularization
exp03 = theanets.Experiment(theanets.Classifier, layers=[X.shape[1], 100, 12])
exp03.train(
(X_train.astype(float32),y_train),
(X_valid.astype(float32),y_valid),
algo='pretraining', learning_rate=1e-4, momentum=0.9, input_dropout=0.5)
In [98]:
theano_accuracy_report(exp03.network)
In [105]:
exp03.save('data/beatles/models/chords_keys_synth_theano_03.model')
In [100]:
# two hidden layers of size 100, dropout regularization
exp04 = theanets.Experiment(theanets.Classifier, layers=[X.shape[1], 100, 100, 12])
exp04.train(
(X_train.astype(float32),y_train),
(X_valid.astype(float32),y_valid),
algo='pretraining', learning_rate=1e-4, momentum=0.9, input_dropout=0.5)
Out[100]:
In [101]:
theano_accuracy_report(exp04.network)
In [106]:
exp04.save('data/beatles/models/chords_keys_synth_theano_04.model')
We can see that neural network models are able to squeeze even more from the data. By training a few models using theano (up to to hidden layers with autoencoder pretraining and input dropout regularization) we were able to improve the accuracy by 10% points relative to the logistic regression (71% to 81%).
The error analysis still show that the models have mostly problems with discriminating keys close on the circle of fifths.
We could see that by using features designed with a bit care and a few rather simple models we were able to get past 80% performance of classifying key from chord sequences. This is quite a nice result for a few evenings of work.
In [ ]: