Very Brief Demo of Numba Speedup inside PostgreSQL

Background

This notebook was originally presented as part of a talk at PGDay Austin 2016 by Davin Potts (davin AT discontinuity DOT net).

The talk built up to this notebook by first providing stories about computer vision technologies employed by Stipple, Inc., providing pointers to how similar functionality could be implemented by others using Python+NumPy+scikit-learn, and then introducing Numba to drastically accelerate the execution of such functionality inside a postgres database (performing data-intensive computations where the data lives).

Links mentioned during the talk:

https://techcrunch.com/2013/10/24/stipple-search-mobile/ (Overview of Stipple, Inc.)
http://blogs.wsj.com/venturecapital/2014/04/22/stipple-shuts-down-vc-backers-included-floodgate-kleiner-perkins/
https://vimeo.com/63258721 (Tutorial on Image Fingerprinting using scikit-learn from PyData Santa Clara 2013 also by Davin)
http://numba.pydata.org/ (Numba project home)

Purpose

Let's create a compute-intensive Python function, compile it using Numba inside PL/Python, then test its relative speedup.



In [1]:

    
import psycopg2

Create Compute-Intensive Python Function

We'll create two functions in PL/Python. The first does the work of defining the function and even compiling it with Numba. The second performs a speedtest to compare the pure-Python version of the function with the Numba compiled version of it.

Imagining a more real-world use case, the first might be a function we want to make fast and the second is perhaps the code that uses that function (which could have been conventional SQL, no Python needed).



In [ ]:

    
with psycopg2.connect(database='radon_fingerprints',
                      host='localhost',
                      port=5437) as conn:
    with conn.cursor() as cursor:
        cursor.execute("CREATE EXTENSION IF NOT EXISTS plpythonu;")
        cursor.execute("""
            CREATE OR REPLACE FUNCTION prep_sum2d ()
                RETURNS integer
            AS $$

                # GD is a so-called "global dictionary" made available
                # to us by PL/Python.  It allows us to share information
                # across functions/code within a single session.  We
                # will use it to "share" our Numba-compiled function
                # with a different Python function defined later in
                # this cell.
                if 'numba' in GD and 'numpy' in GD:
                    numpy = GD['numpy']
                    numba = GD['numba']
                else:
                    import numpy
                    import numba
                    GD['numpy'] = numpy
                    GD['numba'] = numba

                # Define our compute-intensive function to play with.
                # (This is the example offered on the main Numba webpage.)
                def sum2d(arr):
                    M, N = arr.shape
                    result = 0.0
                    for i in range(M):
                        for j in range(N):
                            result += arr[i,j]
                    return result
                
                # Store it in PL/Python's special 'GD' dict for ease of later use.
                GD['sum2d'] = sum2d

                # Compile a version of sum2d using Numba, and store it for later use.
                jitsum2d = numba.jit(sum2d, target='gpu')
                csum2d = jitsum2d.compile(numba.double(numba.double[:,::1]))
                GD['jitsum2d'] = jitsum2d
                GD['csum2d'] = csum2d

                return 1

            $$ LANGUAGE plpythonu;
        """)
        #cursor.execute("DROP FUNCTION speedtest_sum2d();")
        cursor.execute("""
            CREATE OR REPLACE FUNCTION speedtest_sum2d ()
                RETURNS float
            AS $$

                import time

                if 'numba' in GD and 'numpy' in GD:
                    numpy = GD['numpy']
                    numba = GD['numba']
                else:
                    import numpy
                    import numba
                    GD['numpy'] = numpy
                    GD['numba'] = numba

                sum2d = GD['sum2d']
                jitsum2d = GD['jitsum2d']
                csum2d = GD['csum2d']

                # Create some random input data to play with.
                arr = numpy.random.randn(100, 100)

                # Exercise the pure-Python function, sum2d.
                start = time.time()
                res = sum2d(arr)
                duration = time.time() - start
                plpy.log("Result from python is %s in %s (msec)" % (res, duration*1000))

                csum2d(arr)       # Warm up

                # Exercise the Numba version of that same function, csum2d.
                start = time.time()
                res = csum2d(arr)
                duration2 = time.time() - start
                plpy.log("Result from compiled is %s in %s (msec)" % (res, duration2*1000))

                plpy.log("Speed up is %s" % (duration / duration2))

                return (duration / duration2)

            $$ LANGUAGE plpythonu;
        """)
        conn.commit()

Quick Test on the Setup Function



In [3]:

    
with psycopg2.connect(database='radon_fingerprints',
                      host='localhost',
                      port=5437) as conn:
    with conn.cursor() as cursor:
        
        cursor.execute("SELECT prep_sum2d();")
        
        rows = cursor.fetchall()
        conn.commit()

rows









    Out[3]:





[(1,)]

Compare Performance of the Numba Compiled Version with the Pure-Python Function

The messages from plpy.log() will end up in the postgres logfile. The output from running this function will be the relative speedup of using the Numba version of the function. Speaking more generally, the amount of speedup will, of course, vary depending upon the amount and character of data being consumed, the nature of the algorithm, how data is consumed from a table (presumably what you'd likely be doing inside postgres), et cetera, et cetera.



In [4]:

    
with psycopg2.connect(database='radon_fingerprints',
                       host='localhost',
                       port=5437) as conn:
    with conn.cursor() as cursor:
        cursor.execute("SELECT prep_sum2d();")
        
        cursor.execute("SELECT speedtest_sum2d();")
        
        rows = cursor.fetchall()
        conn.commit()

rows









    Out[4]:





[(198.829787234,)]



In [ ]: