This demo extends the kaggle competition. In this demo, we train using images of characters obtained from google view, and also classify test images which are of various sizes. Firstly k-Nearest Neighbor is used to identify the characters. Some of the take aways are how easy it is to build parallelizable and fast systems. We also try out various other methods and show comparisons.
Download the following files from here, and place them in the /data directory :
In [2]:
using Images, Colors, DataFrames, TestImages, Gadfly, ArrayFire;
In [3]:
include("$(Pkg.dir())/MLDemos/src/characters/knndemo.jl");
In [4]:
# Some configurations
path = "$(Pkg.dir())/MLDemos/";
imageSize = 400;
# true = for loop, false = vectorised
flag = true
# Read the training labels
labelsInfoTrain = readtable("$(path)data/characters/trainLabels.csv");
# Read the test labels
labelsInfoTest = readtable("$(path)data/characters/testLabels.csv");
In [5]:
#chars=unique(labelsInfoTrain[:Class])
counts=by(labelsInfoTrain, :Class, nrow);
p1=Gadfly.plot(x = counts[:Class], y=counts[:x1], Guide.xlabel("Characters"), Guide.ylabel("# of samples"), Geom.bar, Guide.title("Distribution of training data"))
Out[5]:
The training images range 1.Bmp to 6283.Bmp, Lets see how the characters look like:
In [64]:
#1<n<6283
n=6250
@show labelsInfoTrain[n,:Class]
showimtrain(n)
Out[64]:
Below is the test dataframe which has all the test images labelled as 'A', the goal is to predict the labels for all the 6220 test images indexed from 6284.Bmp to 12503.Bmp
In [48]:
labelsInfoTest
Out[48]:
In [13]:
#read the images in from the training data.
xTrain = read_data_sv("train", labelsInfoTrain, imageSize, "$(path)data/characters");
#read the test images
xTest = read_data_sv("test", labelsInfoTest, imageSize, "$(path)data/characters");
In [15]:
#Map the training characters to ASCII values
yTrain = map(x -> x[1], labelsInfoTrain[:Class]);
yTrain = int(yTrain);
In [16]:
# Transposing the images, so that the columns represent each character.
xTrain = xTrain'
xTest = xTest'
Out[16]:
In [30]:
#=
addprocs(1)
@everywhere using DataFrames
include("$(Pkg.dir())/MLDemos/src/characters/knndemo.jl")
procs()
=#
In [18]:
# Assign the labels to the training images and find the ratio
# of correctly classified labels to the total number of labels
k = 3
@time sumValues = @parallel (+) for i in 1:size(xTrain, 2)
assign_label(xTrain, yTrain, k, i) == yTrain[i, 1]
end
loofCvAccuracy = sumValues / size(xTrain, 2)
Out[18]:
In [19]:
# Running the kNN on test data set
tic()
k = 3 # The CV accuracy shows this value to be the best.
yPredictions = @parallel (vcat) for i in 1:size(xTest, 2)
nRows = size(xTrain, 1)
imageI = Array(Float32, nRows)
for index in 1:nRows
imageI[index] = xTest[index, i]
end
assign_label(xTrain, yTrain, k, imageI)
end
toc()
Out[19]:
In [20]:
#Convert integer predictions to character
labelsInfoTest[:Class] = map(Char, yPredictions);
In [68]:
m=12003
show(labelsInfoTest[labelsInfoTest[:ID].==m,:Class])
showimtest(m)
Out[68]:
In [49]:
labelsInfoTest
Out[49]:
In [23]:
t=[34.749010, 34.001, 23.3853, 16.9243, 15.7945, 15.5830, 17.09088];
N = [1,2,3,4,5,6,10]
Gadfly.plot(x=N, y=t, Geom.point, Geom.line, Guide.xlabel("no. of procs"), Guide.ylabel("t in sec"), Guide.title("Parallel Performance"))
Out[23]:
To compute the distance we have used the euclidean distance, which can be computed either as vectorised operation or through for loops.
Thanks to the typing and clever compiler decisions that happen behind the scenes in Julia, the opposite is often true: for loops can be faster than vectorized operations!
In [24]:
# Vectorised euclidean distance
function euclidean_distance_vectorise(a, b)
return dot(a-b, a-b)
end
Out[24]:
In [25]:
# For loop cosine measure
function euclidean_distance_for(a, b)
distance = 0.0
for index in 1:size(a, 1)
distance += (a[index]-b[index]) * (a[index]-b[index])
end
return distance
end
Out[25]:
In [29]:
#=
a = rand(100000)
b = rand(100000)
@time euclidean_distance_for(a,b)
@time euclidean_distance_vectorise(a,b)
af = AFArray(a)
bf = AFArray(b)
@time euclidean_distance_vectorise(af,bf)
=#
Out[29]:
In [37]:
df=DataFrame(names = ["For", "Vectorised", "ArrayFire"], t2 = [34,70,20])
p1=Gadfly.plot( x=df[:names], y=df[:t2], Guide.ylabel("Time in sec"), Geom.bar, Guide.title("For loop Vs Vectorised Vs ArrayFire."))
Out[37]: