Prediction with SVMs over SLM.

In this notebook I confirm that the good performance that I get with the letters does not depend on some quirk involved with the fact that the letters (that is the targers) are represented as text and not as number

Libraries and files



In [1]:

    
import numpy as np
import h5py
from sklearn import svm, cross_validation, preprocessing



In [2]:

    
# First we load the file 
file_location = '../results_database/text_wall_street_big.hdf5'
run_name = '/low-resolution'
f = h5py.File(file_location, 'r')

# Now we need to get the letters and align them
text_directory = '../data/wall_street_letters.npy'
letters_sequence = np.load(text_directory)
Nletters = len(letters_sequence)
symbols = set(letters_sequence)

Now we transform the letters into number with a dictionary



In [3]:

    
symbol_to_number = {}

for number, symbol in enumerate(symbols):
    symbol_to_number[symbol] = number

letters_sequence = [symbol_to_number[letter] for letter in letters_sequence]

Load nexa with its parameters



In [4]:

    
# Nexa parameters
Nspatial_clusters = 5
Ntime_clusters = 15
Nembedding = 3

parameters_string = '/' + str(Nspatial_clusters)
parameters_string += '-' + str(Ntime_clusters)
parameters_string += '-' + str(Nembedding)

nexa = f[run_name + parameters_string]

Now we make the predictions



In [5]:

    
delay = 4
N = 5000
cache_size = 1000



In [6]:

    
# Exctrat and normalized SLM
SLM = np.array(f[run_name]['SLM'])

print('Standarized')
X = SLM[:,:(N - delay)].T
y = letters_sequence[delay:N]
# We now scale X
X = preprocessing.scale(X)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.10)

clf_linear = svm.SVC(C=1.0, cache_size=cache_size, kernel='linear')
clf_linear.fit(X_train, y_train)
score = clf_linear.score(X_test, y_test) * 100.0
print('Score in linear', score)

clf_rbf = svm.SVC(C=1.0, cache_size=cache_size, kernel='rbf')
clf_rbf.fit(X_train, y_train)
score = clf_rbf.score(X_test, y_test) * 100.0
print('Score in rbf', score)

print('Not standarized')
X = SLM[:,:(N - delay)].T
y = letters_sequence[delay:N]

# We now scale X
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.10)

clf_linear = svm.SVC(C=1.0, cache_size=cache_size, kernel='linear')
clf_linear.fit(X_train, y_train)
score = clf_linear.score(X_test, y_test) * 100.0
print('Score in linear', score)

clf_rbf = svm.SVC(C=1.0, cache_size=cache_size, kernel='linear')
clf_rbf.fit(X_train, y_train)
score = clf_rbf.score(X_test, y_test) * 100.0
print('Score in rbf', score)









    



Standarized
Score in linear 98.8
Score in rbf 97.6
Not standarized
Score in linear 99.6
Score in rbf 99.6