This notebook was originally presented as part of a talk at PGDay Austin 2016 by Davin Potts (davin AT discontinuity DOT net).
The talk built up to this notebook by first providing stories about computer vision technologies employed by Stipple, Inc., providing pointers to how similar functionality could be implemented by others using Python+NumPy+scikit-learn, and then introducing Numba to drastically accelerate the execution of such functionality inside a postgres database (performing data-intensive computations where the data lives).
Links mentioned during the talk:
Let's create a compute-intensive Python function, compile it using Numba inside PL/Python, then test its relative speedup.
In [1]:
import psycopg2
We'll create two functions in PL/Python. The first does the work of defining the function and even compiling it with Numba. The second performs a speedtest to compare the pure-Python version of the function with the Numba compiled version of it.
Imagining a more real-world use case, the first might be a function we want to make fast and the second is perhaps the code that uses that function (which could have been conventional SQL, no Python needed).
In [ ]:
with psycopg2.connect(database='radon_fingerprints',
host='localhost',
port=5437) as conn:
with conn.cursor() as cursor:
cursor.execute("CREATE EXTENSION IF NOT EXISTS plpythonu;")
cursor.execute("""
CREATE OR REPLACE FUNCTION prep_sum2d ()
RETURNS integer
AS $$
# GD is a so-called "global dictionary" made available
# to us by PL/Python. It allows us to share information
# across functions/code within a single session. We
# will use it to "share" our Numba-compiled function
# with a different Python function defined later in
# this cell.
if 'numba' in GD and 'numpy' in GD:
numpy = GD['numpy']
numba = GD['numba']
else:
import numpy
import numba
GD['numpy'] = numpy
GD['numba'] = numba
# Define our compute-intensive function to play with.
# (This is the example offered on the main Numba webpage.)
def sum2d(arr):
M, N = arr.shape
result = 0.0
for i in range(M):
for j in range(N):
result += arr[i,j]
return result
# Store it in PL/Python's special 'GD' dict for ease of later use.
GD['sum2d'] = sum2d
# Compile a version of sum2d using Numba, and store it for later use.
jitsum2d = numba.jit(sum2d, target='gpu')
csum2d = jitsum2d.compile(numba.double(numba.double[:,::1]))
GD['jitsum2d'] = jitsum2d
GD['csum2d'] = csum2d
return 1
$$ LANGUAGE plpythonu;
""")
#cursor.execute("DROP FUNCTION speedtest_sum2d();")
cursor.execute("""
CREATE OR REPLACE FUNCTION speedtest_sum2d ()
RETURNS float
AS $$
import time
if 'numba' in GD and 'numpy' in GD:
numpy = GD['numpy']
numba = GD['numba']
else:
import numpy
import numba
GD['numpy'] = numpy
GD['numba'] = numba
sum2d = GD['sum2d']
jitsum2d = GD['jitsum2d']
csum2d = GD['csum2d']
# Create some random input data to play with.
arr = numpy.random.randn(100, 100)
# Exercise the pure-Python function, sum2d.
start = time.time()
res = sum2d(arr)
duration = time.time() - start
plpy.log("Result from python is %s in %s (msec)" % (res, duration*1000))
csum2d(arr) # Warm up
# Exercise the Numba version of that same function, csum2d.
start = time.time()
res = csum2d(arr)
duration2 = time.time() - start
plpy.log("Result from compiled is %s in %s (msec)" % (res, duration2*1000))
plpy.log("Speed up is %s" % (duration / duration2))
return (duration / duration2)
$$ LANGUAGE plpythonu;
""")
conn.commit()
In [3]:
with psycopg2.connect(database='radon_fingerprints',
host='localhost',
port=5437) as conn:
with conn.cursor() as cursor:
cursor.execute("SELECT prep_sum2d();")
rows = cursor.fetchall()
conn.commit()
rows
Out[3]:
The messages from plpy.log()
will end up in the postgres logfile. The output from running this function will be the relative speedup of using the Numba version of the function. Speaking more generally, the amount of speedup will, of course, vary depending upon the amount and character of data being consumed, the nature of the algorithm, how data is consumed from a table (presumably what you'd likely be doing inside postgres), et cetera, et cetera.
In [4]:
with psycopg2.connect(database='radon_fingerprints',
host='localhost',
port=5437) as conn:
with conn.cursor() as cursor:
cursor.execute("SELECT prep_sum2d();")
cursor.execute("SELECT speedtest_sum2d();")
rows = cursor.fetchall()
conn.commit()
rows
Out[4]:
In [ ]: