``````

In [1]:

%pylab inline
%qtconsole

``````
``````

Populating the interactive namespace from numpy and matplotlib

``````
``````

In [10]:

import sklearn
from sklearn import cluster,datasets

``````

# Dataset

Let's generate a simple gaussian mixture to benchmark K-Means.

``````

In [7]:

N=10**4
D_gen=2
K_gen=10

``````
``````

In [27]:

X,A_true=sklearn.datasets.make_blobs(n_samples=N, n_features=D_gen, centers=K_gen, cluster_std=1.0, center_box=(-100.0, 100.0))

``````

# K-Means

``````

In [12]:

K=10
N_inits=500
max_iter=20

``````
``````

In [17]:

%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=True, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=1)

``````
``````

1 loops, best of 3: 5.07 s per loop

``````

The documentation claims that the precompute_distances (True by default) will consume more memory but make the algorithm faster. For the our goal (run K-Means many times with a small number of iterations) this is not the case, as shown by the result below. The documentation does not offer insight on the influence that this parameter is having under the hood.

``````

In [18]:

%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=False, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=1)

``````
``````

1 loops, best of 3: 3.66 s per loop

``````

The K-Means function from the scikit-learn already supports CPU parallelization by changing a simple parameter in the function. As expected, using multiple cores resulted in a speedup, albeit small. The parallelization is being done on the pairwise distances computation. It breaks down the pairwise matrix by the number of jobs selected.

``````

In [21]:

%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=True, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=-1)

``````
``````

1 loops, best of 3: 6.63 s per loop

``````
``````

In [24]:

%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=False, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=-1)

``````
``````

1 loops, best of 3: 2.29 s per loop

``````

It should be noted that it seems that this implementation is being accelerated by compiled C code.n

This are the times from the code above before the installation of optimized MKL sklearn (from the Anaconda Accelerate addon).

``````1 loops, best of 3: 5.67 s per loop
1 loops, best of 3: 4.05 s per loop
1 loops, best of 3: 3.29 s per loop
1 loops, best of 3: 2.57 s per loop``````

# CUDA

Following video tutorial on Youtube. Notebook available here.

``````

In [2]:

import numbapro
from numbapro import jit, cuda, int32, float32, float64

#@jit(complex64(int32, float32, complex64), target="cpu")

``````
``````

In [3]:

numbapro.check_cuda()

``````
``````

------------------------------libraries detection-------------------------------
Finding cublas
located at /home/chiroptera/anaconda/lib/libcublas.so.6.0.37
trying to open library...	ok
Finding cusparse
located at /home/chiroptera/anaconda/lib/libcusparse.so.6.0.37
trying to open library...	ok
Finding cufft
located at /home/chiroptera/anaconda/lib/libcufft.so.6.0.37
trying to open library...	ok
Finding curand
located at /home/chiroptera/anaconda/lib/libcurand.so.6.0.37
trying to open library...	ok
Finding nvvm
located at /home/chiroptera/anaconda/lib/libnvvm.so.2.0.0
trying to open library...	ok
finding libdevice for compute_20...	ok
finding libdevice for compute_30...	ok
finding libdevice for compute_35...	ok
-------------------------------hardware detection-------------------------------
Found 1 CUDA devices
id 0         GeForce GT 520M                              [SUPPORTED]
compute capability: 2.1
pci device id: 0
pci bus id: 1
Summary:
1/1 devices are supported
PASSED

Out[3]:

True

``````
``````

In [5]:

#CPU version
@numbapro.vectorize(['float32(float32,float32)',
'float64(float64,float64)'],target='cpu')
def cpu_sincos(x,y):
return math.sin(x) * math.cos(y)

#GPU version
@numbapro.vectorize(['float32(float32,float32)',
'float64(float64,float64)'],target='gpu')
def gpu_sincos(x,y):
return math.sin(x) * math.cos(y)

``````
``````

In [6]:

n=10**6
x = np.linspace(0,np.pi,n)
y = np.linspace(0,np.pi,n)

print 'Data of type ',x.dtype

print 'Unoptimized\t',
%timeit np.sin(x) * np.cos(y)
print 'CPU vectorize\t',
%timeit cpu_sincos(x,y)
print 'GPU vectorize\t',
%timeit gpu_sincos(x,y)

del x,y

``````
``````

Data of type  float64
Unoptimized	10 loops, best of 3: 77.5 ms per loop
CPU vectorize	10 loops, best of 3: 73.2 ms per loop
GPU vectorize	100 loops, best of 3: 17.8 ms per loop

``````

Unoptimized and CPU vectorize times are similar because sin() and cos() calls dominate time. GPU is quite faster already and would get even faster in a better GPU (current GPU has 48 CUDA cores).

All the data has to go from the RAM to the GPU memory, band back again. The work is distributed across all threads on the GPU and scheduling and memory management are taken cared of.

## Another example

``````

In [39]:

@numbapro.vectorize(['float32(float32,float32,float32,float32)'])
def cpu_powers(p,q,r,s):
return math.sqrt(p**2 + q**3 + r**4 + s**5)

@numbapro.vectorize(['float32(float32,float32,float32,float32)'],target='gpu')
def gpu_powers(p,q,r,s):
return math.sqrt(p**2 + q**3 + r**4 + s**5)

``````

When target is not specified, it defaults to 'cpu'.

``````

In [40]:

## Benchmarking
#Generate data
n=5e6
p = np.random.random(n).astype(np.float32)
q = np.random.random(n).astype(np.float32)
r = np.random.random(n).astype(np.float32)
s = np.random.random(n).astype(np.float32)

print 'Unoptimized\t',
%timeit np.sqrt(p**2 + q**3 + r**4 + s**5)
print 'CPU vectorize\t',
%timeit cpu_powers(p,q,r,s)
print 'GPU vectorize\t',
%timeit gpu_powers(p,q,r,s)

del p,q,r,s

``````
``````

Unoptimized	1 loops, best of 3: 2.99 s per loop
CPU vectorize	1 loops, best of 3: 2.25 s per loop
GPU vectorize	10 loops, best of 3: 80 ms per loop

``````

Anaconda Accelerate includes the MKL optimization to several libraries. I still have to check if the numpy routines are not faster than the normal ones.

The speedup of GPU over CPU is of \$\$2.1 s / 69.4 ms = 30.26\$\$

We've now seen that @vectorize accelerates functions, specially on GPU, but it only works on scalar functions. Even though more complex operations like matrix multiplication can be decomposed in several scalar operations over its elements, it would be more convenient to operate on arrays directly.

## GUVectorize and General Universal Functions (ufunc)

Now the function can't return anything and we should pass as a parameter where we want our result to be stored. Furthermore, it's now clear that the types are arrays and we have an extra descriptive string for the shape signature.

``````

In [7]:

@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],'(m,n),(n,p)->(m,p)',target='gpu')
def batch_matrix_mult(a,b,c):
for i in range(c.shape[0]):
for j in range(c.shape[1]):
tmp=0
for n in range(a.shape[1]):
tmp += a[i,n] * b[n,j]
c[i,j]=tmp
#batch_matrix_mult.max_blocksize = 32  # limits to 32 threads per block

``````

This is the code from the video. It seems to be a simple matrix multiplication though. In the video they say that it's a batch multuplication...

``````

In [9]:

n=4e6
dim=2

print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'

a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)

print 'Exception when too much resources of GPU are used:'
try:
batch_matrix_mult(a,b)
except Exception as exception:
print exception

batch_matrix_mult.max_blocksize = 64  # limits to 32 threads per block

import numpy.core.umath_tests as ut

print 'Unoptimized\t',
#c = np.dot(a,b)
%timeit c = ut.matrix_multiply(a,b)

print 'GUVectorize GPU\t',
%timeit c2 = batch_matrix_mult(a,b)

del n,dim,a,b

``````
``````

Memory for both matrices:	125000.0  KBytes
Exception when too much resources of GPU are used:
Unoptimized	1 loops, best of 3: 169 ms per loop
GUVectorize GPU	1 loops, best of 3: 255 ms per loop

``````
``````

In [23]:

del n,dim,a,b

``````

From the documentation:

There are times when the gufunc kernel uses too many of a GPU’s resources, which can cause the kernel launch to fail. The user can explicitly control the maximum size of the thread block by setting the max_blocksize attribute on the compiled gufunc object.

To control the size of the thread block we use (e.g. for 32 threds per block):

```very_complex_kernel.max_blocksize = 32
```

The speedup was of

## Manual memory management

The performance increases when the memory is manually managed. We'll run the same test as before, but now we'll copy the arrays to the GPU's memory and also allocate memory for the result.

``````

In [11]:

n=4e6
dim=2

print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'

a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)

dc = numbapro.cuda.device_array_like(a)
da = numbapro.cuda.to_device(a)
db = numbapro.cuda.to_device(b)

def check_pure_compute_time(da,db,dc):
batch_matrix_mult(da,db,out=dc)
numbapro.cuda.synchronize()

import numpy.core.umath_tests as ut

print 'Unoptimized\t',
#c = np.dot(a,b)
%timeit c = ut.matrix_multiply(a,b)

print 'Memory management'
batch_matrix_mult.max_blocksize = 64  # limits to 32 threads per block

print 'Automatic:\t\t',
%timeit c=batch_matrix_mult(a,b)
print 'Manual:\t\t\t',
%timeit check_pure_compute_time(da,db,dc)

del n,dim,a,b,da,db,dc

``````
``````

Memory for both matrices:	125000.0  KBytes
Unoptimized	1 loops, best of 3: 161 ms per loop
Memory management
Automatic:		1 loops, best of 3: 239 ms per loop
Manual:			10 loops, best of 3: 115 ms per loop

``````

## CUDA Execution Model

Kernel function

• are only visible o the host CPU
• cannot return any value (use output argument)
• associates to a grid

A grid is a group of blocks (that can be 1D,2D or 3D) and each block is a group of threads (which can also be 1D, 2D or 3D). Every thread will execute the same kernel function.

When we're using @jit, we're writing functions on the CUDA execution model. However, we wrote functions that executed on GPU before with a return type (which can't happen here). That is because before we were using the @vectorize and not working under the CUDA execution model.

# DEBUG

``````

In [12]:

@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],'(m,n),(n,p)->(m,p)',target='gpu')
def matrix_mult(a,b,c):
for i in range(c.shape[0]):
for j in range(c.shape[1]):
tmp=0
for n in range(a.shape[1]):
tmp += a[i,n] * b[n,j]
c[i,j]=tmp
#batch_matrix_mult.max_blocksize = 32  # limits to 32 threads per block

``````
``````

In [13]:

n=4e6
dim=2

print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'

a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)

dc = numbapro.cuda.device_array_like(a)
da = numbapro.cuda.to_device(a)
db = numbapro.cuda.to_device(b)

def check_pure_compute_time(da,db,dc):
batch_matrix_mult(da,db,out=dc)
numbapro.cuda.synchronize()

del a,b,da,db,dc

``````
``````

Memory for both matrices:	125000.0  KBytes

``````
``````

In [47]:

tpb=10e6 / (10*10)
print tpb
print np.sqrt(tpb)

``````
``````

100000.0
316.227766017

``````
``````

In [46]:

from numbapro import *
from timeit import default_timer as timer

m=1000
n=1000
A = np.array(np.random.random((n,m)), dtype=np.float32)
C = np.empty([n,n])
B = np.array(np.random.random((m,n)), dtype=np.float32)

%timeit X=np.dot(A,B)

@cuda.jit(void(float32[:,:],float32[:,:],float32[:,:]))
def cu_square_matrix_mul(A, B, C):

bx = cuda.blockIdx.x
by = cuda.blockIdx.y

bw = cuda.blockDim.x
bh = cuda.blockDim.y

x = tx + bx * bw
y = ty + by * bh

n = C.shape[0]

if x >= n or y >= n:
return

cs = 0
for i in range(n):
cs += A[y,i]*B[i,x]
C[y,x]= cs

def check_pure_compute_time(dA,dB,dC):
stream = cuda.stream()
blockdim = 10,10
griddim = 10,10
cu_square_matrix_mul[griddim,blockdim,stream](dA, dB,dC)
dC.to_host()
stream.synchronize()

stream = cuda.stream()
dA = cuda.to_device(A, stream)
dC = cuda.to_device(C, stream)
B = np.array(np.random.random((m,n)), dtype=np.float32)
dB = cuda.to_device(B, stream)

check_pure_compute_time(dA,dB,dC)
"""
print
print("Numpy took    %f seconds" % numpy_time)
print("CUDA JIT took %f seconds, %.5fx speedup" % (cuda_time, numpy_time / cuda_time))"""

``````
``````

10 loops, best of 3: 46 ms per loop

---------------------------------------------------------------------------
CudaAPIError                              Traceback (most recent call last)
<ipython-input-46-99de67136a24> in <module>()
49 dB = cuda.to_device(B, stream)
50
---> 51 check_pure_compute_time(dA,dB,dC)
52 """
53 print

<ipython-input-46-99de67136a24> in check_pure_compute_time(dA, dB, dC)
37     blockdim = 32,32
38     griddim = 10,10
---> 39     cu_square_matrix_mul[griddim,blockdim,stream](dA, dB,dC)
40     dC.to_host()
41     stream.synchronize()

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/compiler.pyc in __call__(self, *args, **kwargs)
305                           blockdim=self.blockdim,
306                           stream=self.stream,
--> 307                           sharedmem=self.sharedmem)
308
309     def bind(self):

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/compiler.pyc in _kernel_call(self, args, griddim, blockdim, stream, sharedmem)
345                                    sharedmem=sharedmem)
346         # invoke kernel
--> 347         cu_func(*args)
348
349         if self.debug:

1043
1044         launch_kernel(self.handle, self.griddim, self.blockdim,
-> 1045                       self.sharedmem, streamhandle, args)
1046
1047     @property

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.pyc in launch_kernel(cufunc_handle, griddim, blockdim, sharedmem, hstream, args)
1087                           hstream,
1088                           params,
-> 1089                           None)
1090
1091

213         def safe_cuda_api_call(*args):
214             retcode = libfn(*args)
--> 215             self._check_error(fname, retcode)
216
217         setattr(self, fname, safe_cuda_api_call)

243             errname = ERROR_MAP.get(retcode, "UNKNOWN_CUDA_ERROR")
244             msg = "Call to %s results in %s" % (fname, errname)
--> 245             raise CudaAPIError(retcode, msg)
246
247     def get_device(self, devnum=0):

CudaAPIError: Call to cuLaunchKernel results in CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

``````
``````

In [25]:

``````
``````

In [29]:

my_gpu = numba.cuda.get_current_device()
print "Running on GPU:", my_gpu.name
cores_per_capability = {1: 8,2: 32,3: 192,}
cc = my_gpu.compute_capability
print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
majorcc = cc[0]
print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
cores_per_multiprocessor = cores_per_capability[majorcc]
print "Number of cores per mutliprocessor:", cores_per_multiprocessor
total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
print "Number of cores on GPU:", total_cores

``````
``````

Running on GPU: GeForce GT 520M
Compute capability:  2.1 (Numba requires >= 2.0)
Number of streaming multiprocessor: 1
Number of cores per mutliprocessor: 32
Number of cores on GPU: 32

``````
``````

In [19]:

%qtconsole

``````

# Distance calculation

``````

In [2]:

import numbapro
from numbapro import *

``````
``````

In [88]:

def dist(a,b,c):
N,K = c.shape
dim = a.shape[1]

for n in range(N):
for k in range(K):
dist=0
for d in range(dim):
diff = a[n,d]-b[k,d]
dist += np.power(diff,2)
c[n,k]=dist

``````
``````

In [ ]:

def dist2(a,b,c):
N,K = c.shape
dim = a.shape[1]

for k in range(K):
dist = a - b[k]
dist = dist ** 2
c[:,k] = dist.sum(axis=1)

``````
``````

In [23]:

@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],
'(m,n),(p,n)->(m,p)',target='gpu')
def guvect_dist(a,b,c):
N,K = c.shape
dim = a.shape[1]

for n in range(N):
for k in range(K):
dist=0
for d in range(dim):
diff = a[n,d]-b[k,d]
#dist += np.power(diff,2)
dist += diff ** 2
c[n,k]=dist

``````
``````

In [179]:

@numbapro.cuda.jit("void(float32[:,:], float32[:,:], float64[:,:])")
def cu_dist(a,b,c):
"""

# block ID
bx = cuda.blockIdx.x
by = cuda.blockIdx.y

# block dimensions
bw = cuda.blockDim.x
bh = cuda.blockDim.y

# compute thread's x and y index (i.e. datapoint and cluster)
n = tx + bx * bw # row for each datapoint
k = ty + by * bh # column for each cluster
"""

n,k = numbapro.cuda.grid(2)

ch, cw = c.shape # c width and height

if n >= ch or k >= cw:
return

dist = 0.0
for d in range(a.shape[1]):
diff = a[n,d]-b[k,d]
dist += diff ** 2
c[n,k]= dist

def cu_dist_wrap(a,b,c,blockDim,gridDim):
dA = cuda.to_device(a)
dB = cuda.to_device(b)
#dC = cuda.to_device(c)
#dC = numbapro.cuda.device_array(c)
dC = numbapro.cuda.device_array_like(c)
#stream = cuda.stream()
cu_dist[gridDim,blockDim](dA,dB,dC)
#dC.to_host()
#stream.synchronize() #block untill stream is done
dC.copy_to_host(ary=c)

``````
``````

In [171]:

#%debug
##generate data
n = 1e4
d = 2
k = 20

total_bytes = (n * d + k * d + n * k) * 4
print 'Memory used by arrays:\t',total_bytes/1024,'\tKBytes'
print '\t\t\t',total_bytes/(1024*1024),'\tMBytes'

data = np.random.random((n,d)).astype(np.float32)
clusters = np.random.random((k,d)).astype(np.float32)
#dists = np.empty((n,k)).astype(np.float32)
distsDim = np.int(n),np.int(k)

blockDim = 8,6 # GT520M has 48 CUDA cores

numBlocks = np.ceil((n * k) / 48)
gridSquareSide = np.int(np.ceil(np.sqrt(numBlocks)))
gridDim = gridSquareSide,gridSquareSide

print  np.prod(gridDim)*np.prod(blockDim) - (n * k)

print '\n','Shape of dists:',dists.shape

``````
``````

Memory used by arrays:	859.53125 	KBytes
0.839385986328 	MBytes

Shape of dists: (10000, 20)

``````
``````

In [172]:

distsCUDA = np.empty(distsDim)
distsCPU = np.empty(distsDim)

dist2(data,clusters,distsCPU)
cu_dist_wrap(data,clusters,distsCUDA,blockDim,gridDim)

``````
``````

In [173]:

#%timeit -r 1 dist(data,clusters,dists)
#%timeit -r 1 guvect_dist(data,clusters,dists)
%timeit cu_dist_wrap(data,clusters,distsCUDA,blockDim,gridDim)
#print type(dists)
#del data,clusters,dists

``````
``````

100 loops, best of 3: 3.01 ms per loop

``````
``````

In [85]:

speedup=2.92 / 2.58e-3
print speedup

``````
``````

1131.78294574

``````