This assignment is worth a total of 100 points. You will hand in a report for your solution to the Music Genre Recognition competition posted on Kaggle (https://www.kaggle.com/c/uri-dl-hw-1). Your report must be submitted via Gradescope in PDF form and must follow the structure below.
The list below shows a few examples of items to be discussed/presented in your description: • have you performed any feature transformation or data preprocessing? Yes, only normalization is applied as the pre-processing step to the data. With normalization, we centralize the data (zero-mean) and make it uni-variance.
• what is the structure of your classifier/neural network? Multi layer perceptron including one input layer, one hidden layer, one output layer. This feed-forward neural network can also be made deeper using more layers.
• what activation functions are being used? Tanh: zero-centered, monotonic, differentiable, and saturated to -1 and +1.
• what is your loss function? have you tried others? Max cross entropy using softmaxx layer for probability generation.
• are you implementing any sort of regularization? provide details No
• what types of gradient descent are being explored? batch? stochastic? Stochastic gradient descent the vanilla version.
• are you implementing any improved optimization? e.g. learning rate schedules or adding momentum No
• how are you performing model selection? is cross-validation being used? No model selection but cross-validation is applied to validate the training and make sure underfitting and overfitting is not happening.
• what are the hyperparameters you are tuning in your model selection? n_iter = 1000 # number of epochs alpha = 1e-2 # learning_rate mb_size = 128 # minibatch size num_hidden_units = 64 # number of kernels/ filters in each layer num_layers = 1 # depth of NN
The Score for Kaggle is 0.21313, ranked 6, and with 25 entries or tries. The hyperparameters are listed below: n_iter = 1000 # number of epochs alpha = 1e-2 # learning_rate mb_size = 128 # minibatch size num_hidden_units = 64 # number of kernels/ filters in each layer num_layers = 1 # depth of NN
Source code is provide in this notebook with proper comments and separate in each block!
In [1]:
import pandas as pd # to read CSV files (Comma Separated Values)
train_x = pd.read_csv(filepath_or_buffer='../data/kaggle-music-genre/train.x.csv')
train_x.head()
Out[1]:
In [2]:
train_y = pd.read_csv(filepath_or_buffer='../data/kaggle-music-genre/train.y.csv')
train_y.head()
Out[2]:
In [3]:
test_x = pd.read_csv(filepath_or_buffer='../data/kaggle-music-genre/test.x.csv')
test_x.head()
Out[3]:
In [4]:
test_y_sample = pd.read_csv(filepath_or_buffer='../data/kaggle-music-genre/submission-random.csv')
test_y_sample.head()
Out[4]:
In [5]:
test_y_sample[:0]
Out[5]:
In [6]:
import numpy as np
train_X = np.array(train_x)
train_Y = np.array(train_y[:]['class_label'])
test_X = np.array(test_x)
# Getting rid of the first and the last column: Id and msd_track_id
X_train_val = np.array(train_X[:, 1:-1], dtype=float)
X_test = np.array(test_X[:, 1:], dtype=float)
train_Y.shape
Out[6]:
In [7]:
from collections import Counter
# Count the freq of the keys in the training labels
counted_labels = Counter(train_Y)
labels_keys = counted_labels.keys()
labels_keys
Out[7]:
In [8]:
labels_keys_sorted = sorted(labels_keys)
labels_keys_sorted
Out[8]:
In [9]:
# This for loop for creating a dictionary/ vocab
key_to_val = {key: val for val, key in enumerate(labels_keys_sorted)}
key_to_val['Country']
key_to_val
Out[9]:
In [10]:
val_to_key = {val: key for val, key in enumerate(labels_keys_sorted)}
val_to_key[1]
val_to_key
Out[10]:
In [11]:
Y_train_vec = []
for each in train_y[:]['class_label']:
# print(each, key_to_val[each])
Y_train_vec.append(key_to_val[each])
Y_train_val = np.array(Y_train_vec)
Y_train_val.shape
Out[11]:
In [12]:
# # Pre-processing: normalizing
# def normalize(X):
# # max scale for images 255= 2**8= 8 bit grayscale for each channel
# return (X - X.mean(axis=0)) #/ X.std(axis=0)
# X_train, X_val, X_test = normalize(X=X_train), normalize(X=X_val), normalize(X=X_test)
# Preprocessing: normalizing the data based on the training set
mean = X_train_val.mean(axis=0)
std = X_train_val.std(axis=0)
X_train_val, X_test = (X_train_val - mean)/ std, (X_test - mean)/ std
X_train_val.shape, X_test.shape, X_train_val.dtype, X_test.dtype
Out[12]:
In [13]:
# Creating validation set: 10% or 1/10 of the training set or whatever dataset with labels/ annotation
valid_size = X_train_val.shape[0]//10
valid_size
X_val = X_train_val[-valid_size:]
Y_val = Y_train_val[-valid_size:]
X_train = X_train_val[: -valid_size]
Y_train = Y_train_val[: -valid_size]
X_train_val.shape,
X_train.shape, X_val.shape, X_test.shape, Y_val.shape, Y_train.shape
# X_train.dtype, X_val.dtype
# Y_train.dtype, Y_val
Out[13]:
In [14]:
def softmax(X):
eX = np.exp((X.T - np.max(X, axis=1)).T)
return (eX.T / eX.sum(axis=1)).T
def tanh_forward(X):
out = np.tanh(X)
cache = out
return out, cache
def tanh_backward(dout, cache):
# dX = dout * (1 - (np.tanh(X)**2)) # dTanh = 1-tanh**2
dX = (1 - cache**2) * dout
return dX
def cross_entropy(y_pred, y_train):
m = y_pred.shape[0]
prob = softmax(y_pred)
log_like = -np.log(prob[range(m), y_train]) # to avoid the devision by zero
data_loss = np.sum(log_like) / m
return data_loss
def dcross_entropy(y_pred, y_train): # this is equal for both since the reg_loss (noise) derivative is ZERO.
m = y_pred.shape[0]
grad_y = softmax(y_pred)
grad_y[range(m), y_train] -= 1.
grad_y /= m
return grad_y
In [15]:
from sklearn.utils import shuffle as skshuffle
class FFNN:
def __init__(self, D, C, H, L):
self.L = L # number of layers or depth
self.losses = {'train':[], 'train_acc':[],
'valid':[], 'valid_acc':[]}
# The adaptive/learnable/updatable random feedforward
self.model = []
self.grads = []
low, high = -1, 1
# Input layer: weights/ biases
m = dict(W=np.random.uniform(size=(D, H), low=low, high=high) / np.sqrt(D / 2.),
b=np.zeros((1, H)))
self.model.append(m)
# Input layer: gradients
self.grads.append({key: np.zeros_like(val) for key, val in self.model[0].items()})
# Hidden layers: weights/ biases
m_L = []
for _ in range(L):
m = dict(W=np.random.uniform(size=(H, H), low=low, high=high) / np.sqrt(H / 2.),
b=np.zeros((1, H)))
m_L.append(m)
self.model.append(m_L)
# Hidden layer: gradients
grad_L = []
for _ in range(L):
grad_L.append({key: np.zeros_like(val) for key, val in self.model[1][0].items()})
self.grads.append(grad_L)
# Output layer: weights/ biases
m = dict(W=np.random.uniform(size=(H, C), low=low, high=high) / np.sqrt(H / 2.),
b=np.zeros((1, C)))
self.model.append(m)
# Outout layer: gradients
self.grads.append({key: np.zeros_like(val) for key, val in self.model[2].items()})
def fc_forward(self, X, W, b):
out = (X @ W) + b
cache = (W, X)
return out, cache
def fc_backward(self, dout, cache):
W, X = cache
dW = X.T @ dout
db = np.sum(dout, axis=0).reshape(1, -1) # db_1xn
dX = dout @ W.T # Backprop
return dX, dW, db
def train_forward(self, X, train):
caches, ys = [], []
# Input layer
y, fc_cache = self.fc_forward(X=X, W=self.model[0]['W'], b=self.model[0]['b']) # X_1xD, y_1xc
y, nl_cache = tanh_forward(X=y)
X = y.copy() # pass to the next layer
if train:
caches.append((fc_cache, nl_cache))
# Hidden layers
fc_caches, nl_caches = [], []
for layer in range(self.L):
y, fc_cache = self.fc_forward(X=X, W=self.model[1][layer]['W'], b=self.model[1][layer]['b'])
y, nl_cache = tanh_forward(X=y)
X = y.copy() # pass to next layer
if train:
fc_caches.append(fc_cache)
nl_caches.append(nl_cache)
if train:
caches.append((fc_caches, nl_caches)) # caches[1]
# Output layer
y, fc_cache = self.fc_forward(X=X, W=self.model[2]['W'], b=self.model[2]['b'])
# Softmax is included in loss function
if train:
caches.append(fc_cache)
return y, caches # for backpropating the error
def loss_function(self, y, y_train):
loss = cross_entropy(y, y_train) # softmax is included
dy = dcross_entropy(y, y_train) # dsoftmax is included
return loss, dy
def train_backward(self, dy, caches):
grads = self.grads # initialized by Zero in every iteration/epoch
# Output layer
fc_cache = caches[2]
# dSoftmax is included in loss function
dX, dW, db = self.fc_backward(dout=dy, cache=fc_cache)
dy = dX.copy()
grads[2]['W'] = dW
grads[2]['b'] = db
# Hidden layer
fc_caches, nl_caches = caches[1]
for layer in reversed(range(self.L)):
dy = tanh_backward(cache=nl_caches[layer], dout=dy) # diffable function
dX, dW, db = self.fc_backward(dout=dy, cache=fc_caches[layer])
dy = dX.copy()
grads[1][layer]['W'] = dW
grads[1][layer]['b'] = db
# Input layer
fc_cache, nl_cache = caches[0]
dy = tanh_backward(cache=nl_cache, dout=dy) # diffable function
dX, dW, db = self.fc_backward(dout=dy, cache=fc_cache)
grads[0]['W'] = dW
grads[0]['b'] = db
return grads
def test(self, X):
y_logit, _ = self.train_forward(X, train=False)
# if self.mode == 'classification':
y_prob = softmax(y_logit) # for accuracy == acc
y_pred = np.argmax(y_prob, axis=1) # for loss ==err
return y_pred, y_logit
def get_minibatch(self, X, y, minibatch_size, shuffle):
minibatches = []
if shuffle:
X, y = skshuffle(X, y)
for i in range(0, X.shape[0], minibatch_size):
X_mini = X[i:i + minibatch_size]
y_mini = y[i:i + minibatch_size]
minibatches.append((X_mini, y_mini))
return minibatches
def sgd(self, train_set, val_set, alpha, mb_size, n_iter, print_after):
X_train, y_train = train_set
X_val, y_val = val_set
# Epochs
for iter in range(1, n_iter + 1):
# Minibatches
minibatches = self.get_minibatch(X_train, y_train, mb_size, shuffle=True)
idx = np.random.randint(0, len(minibatches))
X_mini, y_mini = minibatches[idx]
# Train the model
y, caches = self.train_forward(X_mini, train=True)
_, dy = self.loss_function(y, y_mini)
grads = self.train_backward(dy, caches)
# Update the model for input layer
for key in grads[0].keys():
self.model[0][key] -= alpha * grads[0][key]
# Update the model for the hidden layers
for layer in range(self.L):
for key in grads[1][layer].keys():
self.model[1][layer][key] -= alpha * grads[1][layer][key]
# Update the model for output layer
for key in grads[2].keys():
self.model[2][key] -= alpha * grads[2][key]
# Trained model info
y_pred, y_logit = self.test(X_mini)
loss, _ = self.loss_function(y_logit, y_mini) # softmax is included in entropy loss function
self.losses['train'].append(loss)
acc = np.mean(y_pred == y_mini) # confusion matrix
self.losses['train_acc'].append(acc)
# Validated model info
y_pred, y_logit = self.test(X_val)
valid_loss, _ = self.loss_function(y_logit, y_val) # softmax is included in entropy loss function
self.losses['valid'].append(valid_loss)
valid_acc = np.mean(y_pred == y_val) # confusion matrix
self.losses['valid_acc'].append(valid_acc)
# Print the model info: loss & accuracy or err & acc
if iter % print_after == 0:
print('Iter: {}, train loss: {:.4f}, train acc: {:.4f}, valid loss: {:.4f}, valid acc: {:.4f}'.format(
iter, loss, acc, valid_loss, valid_acc))
In [16]:
Y_train.shape, X_train.shape, X_val.shape, Y_val.shape
Out[16]:
In [19]:
# Hyper-parameters
n_iter = 1000 # number of epochs
alpha = 1e-2 # learning_rate
mb_size = 128 # 2**10==1024 # width, timestep for sequential data or minibatch size
print_after = 10 # n_iter//10 # print loss for train, valid, and test
num_hidden_units = 64 # number of kernels/ filters in each layer
num_input_units = X_train.shape[1] # noise added at the input lavel as input noise we can use dX or for more improvement
num_output_units = Y_train.max() + 1 # number of classes in this classification problem
num_layers = 1 # depth
# Build the model/NN and learn it: running session.
nn = FFNN(C=num_output_units, D=num_input_units, H=num_hidden_units, L=num_layers)
nn.sgd(train_set=(X_train, Y_train), val_set=(X_val, Y_val), mb_size=mb_size, alpha=alpha,
n_iter=n_iter, print_after=print_after)
In [20]:
# # Display the learning curve and losses for training, validation, and testing
# %matplotlib inline
# %config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
plt.plot(nn.losses['train'], label='Train loss')
plt.plot(nn.losses['valid'], label='Valid loss')
plt.legend()
plt.show()
In [21]:
loss_train = np.array(nn.losses['train'], dtype=float)
loss_valid = np.array(nn.losses['valid'], dtype=float)
loss_train.shape, loss_valid.shape
Out[21]:
In [22]:
loss_train_norm = (loss_train - loss_train.mean(axis=0))/ loss_train.std(axis=0)
loss_valid_norm = (loss_valid - loss_valid.mean(axis=0))/ loss_valid.std(axis=0)
In [23]:
plt.plot(loss_train_norm, label='Normalized train loss')
plt.plot(loss_valid_norm, label='Normalized valid loss')
plt.legend()
plt.show()
In [24]:
plt.plot(nn.losses['train_acc'], label='Train accuracy')
plt.plot(nn.losses['valid_acc'], label='Valid accuracy')
plt.legend()
plt.show()
In [25]:
heading = labels_keys_sorted.copy()
heading.insert(0, 'Id')
heading
Out[25]:
In [26]:
y_pred, y_logits = nn.test(X_test)
y_prob = softmax(y_logits)
y_prob.shape, X_test.shape, y_logits.shape, test_y_sample.shape, test_y_sample[:1]
Out[26]:
In [27]:
pred_list = []
for Id, pred in enumerate(y_prob):
# print(Id+1, *pred)
pred_list.append([Id+1, *pred])
In [28]:
pred_file = open(file='prediction.csv', mode='w')
pred_file.write('\n') # because of the previous line
for idx in range(len(heading)):
if idx < len(heading) - 1:
pred_file.write(heading[idx] + ',')
else:
pred_file.write(heading[idx] + '\n')
# len(test), test[0]
# for key in test:
for i in range(len(pred_list)): # rows
for j in range(len(pred_list[i])): # cols
if j < (len(pred_list[i]) - 1):
pred_file.write(str(pred_list[i][j]))
pred_file.write(',')
else: # last item before starting a new line
pred_file.write(str(pred_list[i][j]) + '\n')
# pred_file.write(-',')
pred_file.close()
In [29]:
pd.read_csv(filepath_or_buffer='prediction.csv').head()
Out[29]:
In [30]:
pd.read_csv(filepath_or_buffer='prediction.csv').shape, test_y_sample.shape
Out[30]:
In [31]:
test_y_sample.head()
Out[31]:
In [ ]:
In [ ]: