Character recognition

This demo extends the kaggle competition. In this demo, we train using images of characters obtained from google view, and also classify test images which are of various sizes. Firstly k-Nearest Neighbor is used to identify the characters. Some of the take aways are how easy it is to build parallelizable and fast systems. We also try out various other methods and show comparisons.

Some key takeaways:

  • Ease of prototyping deployable models.
  • Parallelization also very easy to implement.
  • For loops are so fast in Julia.
  • How to gain from vectorizations using ArrayFire.

Download the following files from here, and place them in the /data directory :

  • testResized/
  • trainResized/
  • sampleSubmission.csv
  • trainLabels.csv

In [2]:
using Images, Colors, DataFrames, TestImages, Gadfly, ArrayFire;

In [3]:
include("$(Pkg.dir())/MLDemos/src/characters/knndemo.jl");

In [4]:
# Some configurations
path = "$(Pkg.dir())/MLDemos/";
imageSize = 400;
# true = for loop, false = vectorised
flag = true
# Read the training labels
labelsInfoTrain = readtable("$(path)data/characters/trainLabels.csv");
# Read the test labels
labelsInfoTest = readtable("$(path)data/characters/testLabels.csv");

In [5]:
#chars=unique(labelsInfoTrain[:Class])
counts=by(labelsInfoTrain, :Class, nrow);
p1=Gadfly.plot(x = counts[:Class], y=counts[:x1], Guide.xlabel("Characters"), Guide.ylabel("# of samples"), Geom.bar, Guide.title("Distribution of training data"))


Out[5]:
Characters 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 -500 -480 -460 -440 -420 -400 -380 -360 -340 -320 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880 900 920 940 960 980 1000 -500 0 500 1000 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 # of samples Distribution of training data

The training images range 1.Bmp to 6283.Bmp, Lets see how the characters look like:


In [64]:
#1<n<6283
n=6250
@show labelsInfoTrain[n,:Class]
showimtrain(n)


labelsInfoTrain[n,:Class] = "h"
Out[64]:

Below is the test dataframe which has all the test images labelled as 'A', the goal is to predict the labels for all the 6220 test images indexed from 6284.Bmp to 12503.Bmp


In [48]:
labelsInfoTest


Out[48]:
IDClass
16284A
26285A
36286A
46287A
56288A
66289A
76290A
86291A
96292A
106293A
116294A
126295A
136296A
146297A
156298A
166299A
176300A
186301A
196302A
206303A
216304A
226305A
236306A
246307A
256308A
266309A
276310A
286311A
296312A
306313A
&vellip&vellip&vellip

In [13]:
#read the images in from the training data.
xTrain = read_data_sv("train", labelsInfoTrain, imageSize, "$(path)data/characters");

#read the test images
xTest = read_data_sv("test", labelsInfoTest, imageSize, "$(path)data/characters");

In [15]:
#Map the training characters to ASCII values
yTrain = map(x -> x[1], labelsInfoTrain[:Class]);
yTrain = int(yTrain);

In [16]:
# Transposing the images, so that the columns represent each character.
xTrain = xTrain'
xTest = xTest'


Out[16]:
400x6220 Array{Float64,2}:
 0.45098   0.282353  0.113725  0.592157  …  0.329412  0.635294  0.403922
 0.447059  0.286275  0.152941  0.545098     0.341176  0.611765  0.423529
 0.443137  0.309804  0.156863  0.564706     0.345098  0.662745  0.396078
 0.443137  0.301961  0.156863  0.529412     0.345098  0.635294  0.388235
 0.435294  0.309804  0.156863  0.568627     0.341176  0.627451  0.368627
 0.431373  0.341176  0.156863  0.607843  …  0.341176  0.615686  0.368627
 0.458824  0.34902   0.160784  0.568627     0.345098  0.611765  0.411765
 0.462745  0.333333  0.160784  0.576471     0.337255  0.619608  0.501961
 0.447059  0.341176  0.160784  0.537255     0.337255  0.643137  0.690196
 0.462745  0.364706  0.156863  0.560784     0.341176  0.654902  0.819608
 0.458824  0.364706  0.156863  0.580392  …  0.345098  0.627451  0.858824
 0.458824  0.376471  0.156863  0.592157     0.337255  0.65098   0.862745
 0.458824  0.388235  0.156863  0.592157     0.345098  0.631373  0.796078
 ⋮                                       ⋱                              
 0.466667  0.45098   0.160784  0.494118     0.329412  0.643137  0.407843
 0.47451   0.458824  0.156863  0.380392     0.329412  0.643137  0.423529
 0.494118  0.443137  0.156863  0.258824  …  0.32549   0.658824  0.415686
 0.486275  0.443137  0.152941  0.290196     0.317647  0.643137  0.439216
 0.482353  0.443137  0.14902   0.239216     0.317647  0.662745  0.431373
 0.47451   0.447059  0.145098  0.235294     0.32549   0.615686  0.435294
 0.47451   0.447059  0.145098  0.258824     0.321569  0.639216  0.423529
 0.486275  0.439216  0.145098  0.247059  …  0.313725  0.631373  0.411765
 0.490196  0.45098   0.145098  0.388235     0.317647  0.607843  0.392157
 0.486275  0.431373  0.145098  0.556863     0.32549   0.658824  0.505882
 0.482353  0.282353  0.145098  0.588235     0.32549   0.576471  0.611765
 0.470588  0.294118  0.101961  0.592157     0.313725  0.584314  0.392157

Parallelisation :

  1. addprocs() before running the code
  2. @everywhere before each function
  3. @parallel before each for loop

In [30]:
#=
addprocs(1)
@everywhere using DataFrames
include("$(Pkg.dir())/MLDemos/src/characters/knndemo.jl")
procs()
=#

Training the model


In [18]:
# Assign the labels to the training images and find the ratio
# of correctly classified labels to the total number of labels

k = 3
@time sumValues = @parallel (+) for i in 1:size(xTrain, 2)
 assign_label(xTrain, yTrain, k, i) == yTrain[i, 1]
end
loofCvAccuracy = sumValues / size(xTrain, 2)


 36.474307 seconds (112.12 M allocations: 2.126 GB, 0.73% gc time)
Out[18]:
0.04918032786885246

Predicting for the test images :


In [19]:
# Running the kNN on test data set
tic()
k = 3 # The CV accuracy shows this value to be the best.                             
yPredictions = @parallel (vcat) for i in 1:size(xTest, 2)
 nRows = size(xTrain, 1)
 imageI = Array(Float32, nRows)
 for index in 1:nRows
  imageI[index] = xTest[index, i]
 end
 assign_label(xTrain, yTrain, k, imageI)
end
toc()


elapsed time: 65.206551157 seconds
Out[19]:
65.206551157

Assign the predicted values to labelsInfoTest


In [20]:
#Convert integer predictions to character                                            
labelsInfoTest[:Class] = map(Char, yPredictions);

Lets actually see how it looks like :

Test images range from 6384.Bmp to 12503.Bmp.


In [68]:
m=12003

show(labelsInfoTest[labelsInfoTest[:ID].==m,:Class])
showimtest(m)


['W']
Out[68]:

In [49]:
labelsInfoTest


Out[49]:
IDClass
16284H
26285E
36286R
46287d
56288E
66289C
762900
86291a
962928
106293H
116294N
126295R
136296k
146297H
156298r
1662995
176300A
186301R
196302X
206303Y
216304M
226305r
236306A
246307a
256308R
266309S
276310D
286311D
296312n
306313F
&vellip&vellip&vellip

Performance figures :

The julia parallel implementation :

We can see that as we add procs, the perfomance improves until nprocs = 6, this is because the data is small and 6 processors are optimal on the test system.


In [23]:
t=[34.749010, 34.001, 23.3853, 16.9243, 15.7945, 15.5830, 17.09088];
N = [1,2,3,4,5,6,10]
Gadfly.plot(x=N, y=t, Geom.point, Geom.line, Guide.xlabel("no. of procs"), Guide.ylabel("t in sec"), Guide.title("Parallel Performance"))


Out[23]:
no. of procs -12.5 -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 -10 0 10 20 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 -20 0 20 40 60 -5 0 5 10 15 20 25 30 35 40 45 50 55 t in sec Parallel Performance

Advantages of for loops :

Distance measure :

To compute the distance we have used the euclidean distance, which can be computed either as vectorised operation or through for loops.

Thanks to the typing and clever compiler decisions that happen behind the scenes in Julia, the opposite is often true: for loops can be faster than vectorized operations!


In [24]:
# Vectorised euclidean distance
function euclidean_distance_vectorise(a, b)
   return dot(a-b, a-b) 
end


Out[24]:
euclidean_distance_vectorise (generic function with 1 method)

In [25]:
# For loop cosine measure
function euclidean_distance_for(a, b)
 distance = 0.0 
 for index in 1:size(a, 1) 
  distance += (a[index]-b[index]) * (a[index]-b[index])
 end
 return distance
end


Out[25]:
euclidean_distance_for (generic function with 1 method)

ArrayFire

ArrayFire is a library for GPU and accelerated computing. ArrayFire.jl wraps the ArrayFire library for Julia, and provides a Julian interface.


In [29]:
#=
a = rand(100000)
b = rand(100000)
@time euclidean_distance_for(a,b)
@time euclidean_distance_vectorise(a,b)
af = AFArray(a)
bf = AFArray(b)
@time euclidean_distance_vectorise(af,bf)
=#


  0.000178 seconds (5 allocations: 176 bytes)
  0.000516 seconds (13 allocations: 1.526 MB)
  0.000295 seconds (27 allocations: 944 bytes)
Out[29]:
1-element ArrayFire.AFArray{Float64,1}:
 16669.3

Plot showing the time taken to train the model : for loop Vs Vectorised Vs ArrayFire Vectorised


In [37]:
df=DataFrame(names = ["For", "Vectorised", "ArrayFire"], t2 = [34,70,20])

p1=Gadfly.plot( x=df[:names], y=df[:t2],  Guide.ylabel("Time in sec"), Geom.bar, Guide.title("For loop Vs Vectorised Vs ArrayFire."))


Out[37]:
x For Vectorised ArrayFire -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 Time in sec For loop Vs Vectorised Vs ArrayFire.