Executing JavaScript Code Directly in SQL Queries Using the `jseval` Function Tutorial

MLDB provides a complete implementation of the SQL SELECT statement. Most of the functions you are used to using are available in your queries.

MLDB also supports additional functions that extend standard SQL in very interesting ways. One of those function is the jseval function that can be used to execute arbitrary JavaScript code inline in an SQL query.

In this tutorial, we will show some basic usage example followed by two different use-cases for the jseval function:

Formatting data during the import step
Designing custom feature generators

Setting up

Before we begin, let's start by importing the pymldb library so we can make REST API calls to MLDB. You can check out the Using pymldb Tutorial for more details.



In [1]:

    
from pymldb import Connection
mldb = Connection("http://localhost")

Basic usage examples

Let's start by writing a simple SQL query that will multiply an input number by 2 in JavaScript:



In [33]:

    
mldb.query("""
    SELECT
        jseval('
            return val * 2;
        ','val', 5) AS output
""")









    Out[33]:






  
    
      
      output
    
    
      _rowName
      
    
  
  
    
      result
      10

The variable val takes the input value 5 and the code is then evaluated.

Our function can also take in multiple parameters as input, and return different output values:



In [35]:

    
mldb.query("""
    SELECT
        jseval('
            var output = {};
            output["mult"] = val * 2;
            output["name"] = str_val + " Hello!";
            return output;
        ','val,str_val', 5, 'Bonjour!') AS output
""")









    Out[35]:






  
    
      
      output.mult
      output.name
    
    
      _rowName
      
      
    
  
  
    
      result
      10
      Bonjour! Hello!

In the above example, the string val,str_val mean that the function takes 2 input variables. Those values will be 5 and the string Bonjour!. Since we return a JavaScript object, we essentially return a row where the keys are the objects' keys and the cell values are the object's values.

Now that we have the basics in place, let's continue to a real use-case below.

Formatting data during the import step

In the Loading Data From An HTTP Server Tutorial tutorial, we loaded a specific file from an archive that was located on the Stanford Network Analysis Project (SNAP) website.

The dataset contains all the circles of friends in which user no. 3980 is part of. Each row represents a circle of friends, and all the users that are part of that circle will be enumerated on the line.

Let's check's out the unformated version of the data first, by running the import.text procedure:



In [44]:

    
dataUrl = "http://snap.stanford.edu/data/facebook.tar.gz"

mldb.put("/v1/procedures/import_data", {
    "type": "import.text",
    "params": {
        "dataFileUrl": "archive+" + dataUrl + "#facebook/3980.circles",
        "delimiter": " ", 
        "quoteChar": "",
        "outputDataset": "import_URL2",
        "runOnCreation": True
    }
})

mldb.query("SELECT * NAMED rowName() FROM import_URL2 LIMIT 10")









    Out[44]:






  
    
      
      "circle0	3989	4009"
    
    
      _rowName
      
    
  
  
    
      2
      circle1\t4010\t4037
    
    
      3
      circle2\t4013
    
    
      4
      circle3\t4024\t3987\t4015
    
    
      5
      circle4\t4006
    
    
      6
      circle5\t4035
    
    
      7
      circle6\t3999\t4028\t4005\t3996\t4031\t4018\t3...
    
    
      8
      circle7\t3984
    
    
      9
      circle8\t3988\t4030\t4026\t4021
    
    
      10
      circle9\t3983\t3992\t4033\t4017\t4000\t3986
    
    
      11
      circle10\t3990\t4007\t4016\t4025

We see that each line contains the circle number followed by user ids. This type of data is an ideal candidate for MLDB, since we can store it as bags of words, or rather, bags of friends. A dataset of type sparse.mutable can store sparse representations like this one very efficiently.

Normally, we could use the tokenize function to deal with data like this. However, since splitting the data on the <TAB> character yields a variable number of columns, the standard way of importing this won't work very nicely in the import.text procedure.

In the code below, we will use the jseval function to do the following in JavaScript:

create an empty object
split each line on the <TAB> character
store the first element of each line under the key rowName in the object (circle0, circle1, etc...)
store all remaining elements of the line using the element's name as the key, and the number 1 as the value



In [2]:

    
dataUrl = "http://snap.stanford.edu/data/facebook.tar.gz"

print mldb.put("/v1/procedures/import_non_formated", {
    "type": "import.text",
    "params": {
        "dataFileUrl": "archive+" + dataUrl + "#facebook/3980.circles",
        "headers": ["circles"],
        "select": """
            jseval('
                var row_val = val.split("\t");
                var rtn = {};
                rtn["rowName"] = row_val[0];
                for(i=1; i<row_val.length; i++) {
                    rtn[row_val[i]] = 1;
                }
                return rtn;
                ','val', circles) AS *
        """,
        "outputDataset": {
            "id": "import_non_formated",
            "type": "sparse.mutable"
        },
        "runOnCreation": True
    }
})









    



<Response [201]>

We can now run a SELECT query on the resulting dataset and get a nice sparse representation:



In [3]:

    
mldb.query("""
    SELECT * EXCLUDING(rowName)
    NAMED rowName
    FROM import_non_formated 
    ORDER BY CAST(rowName() AS integer) 
    LIMIT 5
""")

We can now answer a simple question like: Is there any friend of user 3980 that appears in more than one of his circle of friends? It can be answered with the following query:



In [13]:

    
mldb.query("""
    SELECT *
    FROM transpose(
        (
            SELECT sum({* EXCLUDING(rowName)}) as * 
            NAMED 'result'
            FROM import_non_formated
        )
    )
    ORDER BY result DESC
    LIMIT 5
""")

Since the maximum value is 1, we now know that the answer to the above question is no.

Although there are other ways to obtain the same result, using jseval and the dataset of type sparse.mutable allowed us to transform our data in a single step, without knowing its characteristics in advance. This shows how much added flexibility is added by such a function.

Designing custom feature generators

Another very powerful way the jseval function can be used is as a feature generator. When trying to prototype and iterate quickly, this can be a very efficient way to try out new ideas.

Let's start by creating a toy dataset using the description of machine learning concepts from Wikipedia:



In [17]:

    
print mldb.put('/v1/procedures/import_ML_concepts', {
        "type":"import.text",
        "params": {
            "dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
            "outputDataset": "ml_concepts",
            "named": "Concepts",
            "select": "Text",
            "runOnCreation": True
        }
    }
)









    



<Response [201]>

Taking a peek at our data, we see there is a single column called Text that contains a textual description of an ML concept:



In [23]:

    
mldb.query("SELECT * FROM ml_concepts")









    Out[23]:






  
    
      
      Text
    
    
      _rowName
      
    
  
  
    
      Artificial neural network
      In machine learning and cognitive science, art...
    
    
      Autoencoder
      An autoencoder, autoassociator or Diabolo netw...
    
    
      Hopfield network
      A Hopfield network is a form of recurrent arti...
    
    
      Boltzmann machine
      Boltzmann machine is a type of stochastic recu...
    
    
      Restricted boltzmann machines
      A restricted Boltzmann machine (RBM) is a gene...
    
    
      Deep belief network
      In machine learning, a deep belief network (DB...
    
    
      Logistic regression
      In statistics, logistic regression, or logit r...
    
    
      Naive bayes classifier
      In machine learning, naive Bayes classifiers a...
    
    
      Support vector machine
      In machine learning, support vector machines (...

Let's now create a function of type sql.expression containing a jseval function that calculates different statistics about the string it is given. It calculates things like the number of words in the string, the number of capital letters, etc.

Putting it in an sql.expression allows us to reuse it easily later on.



In [21]:

    
print mldb.put("/v1/functions/getStats", {
    "type": "sql.expression",
    "params": {
        "expression": """
            jseval(' 
                var result = {};

                result["len"] = txt.length;
                result["numWords"] = txt.split(" ").length;
                result["numCapital"] = txt.replace(/[^A-Z]/g, "").length;
                result["numExpl"] = txt.replace(/[^!]/g, "").length;
                result["numQst"] = txt.replace(/[^?]/g, "").length;
                result["containsHashSign"] = txt.replace(/[^#]/g, "").length >= 1;
                result["numNumbers"] = txt.replace(/[^0-9]/g, "").length;

                result["capitalProportion"] = result["numCapital"] / result["len"];
                result["explProportion"] = result["numExpl"] / result["len"];
                result["qstProportion"] = result["numQst"] / result["len"];
                result["numberProportion"] = result["numNumbers"] / result["len"];

                return result;
            ', 'txt', text) as stats
        """
    }
})









    



<Response [201]>

Now that we have created our getStats function, we can call it on a single string:



In [25]:

    
mldb.query("SELECT getStats({text: 'This is a test #hopethisworks #mldb'}) as *")









    Out[25]:






  
    
      
      stats.capitalProportion
      stats.containsHashSign
      stats.explProportion
      stats.len
      stats.numCapital
      stats.numExpl
      stats.numNumbers
      stats.numQst
      stats.numWords
      stats.numberProportion
      stats.qstProportion
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      result
      0.028571
      true
      0
      35
      1
      0
      0
      0
      6
      0
      0

Looks like it works! We can also call it on the Text column of our ml_concepts dataset to get the statistics for all the rows of our dataset:



In [22]:

    
mldb.query("SELECT getStats({text: Text}) as * FROM ml_concepts")









    Out[22]:






  
    
      
      stats.capitalProportion
      stats.containsHashSign
      stats.explProportion
      stats.len
      stats.numCapital
      stats.numExpl
      stats.numNumbers
      stats.numQst
      stats.numWords
      stats.numberProportion
      stats.qstProportion
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Artificial neural network
      0.012469
      false
      0
      401
      5
      0
      1
      0
      60
      0.002494
      0
    
    
      Autoencoder
      0.010582
      false
      0
      378
      4
      0
      7
      0
      54
      0.018519
      0
    
    
      Hopfield network
      0.016461
      false
      0
      486
      8
      0
      10
      0
      73
      0.020576
      0
    
    
      Boltzmann machine
      0.015890
      false
      0
      944
      15
      0
      5
      0
      140
      0.005297
      0
    
    
      Restricted boltzmann machines
      0.020817
      false
      0
      1249
      26
      0
      14
      0
      180
      0.011209
      0
    
    
      Deep belief network
      0.021021
      false
      0
      999
      21
      0
      3
      0
      155
      0.003003
      0
    
    
      Logistic regression
      0.010979
      false
      0
      1093
      12
      0
      7
      0
      158
      0.006404
      0
    
    
      Naive bayes classifier
      0.016049
      false
      0
      1620
      26
      0
      19
      0
      238
      0.011728
      0
    
    
      Support vector machine
      0.012709
      false
      0
      1495
      19
      0
      2
      0
      236
      0.001338
      0

Doing most of this is possible in standard SQL, but the jseval implementation is simple, fast and compact. This is a great way to quickly experiment with ideas and gives maximum flexibility to manipulate data.

Where to next?

Check out the other Tutorials and Demos.

	3989	4009	4010	4037	4013	3987	4015	4024	4006
_rowName
circle0	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
circle1	NaN	NaN	1	1	NaN	NaN	NaN	NaN	NaN
circle2	NaN	NaN	NaN	NaN	1	NaN	NaN	NaN	NaN
circle3	NaN	NaN	NaN	NaN	NaN	1	1	1	NaN
circle4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1

	"circle0 3989 4009"
_rowName
2	circle1\t4010\t4037
3	circle2\t4013
4	circle3\t4024\t3987\t4015
5	circle4\t4006
6	circle5\t4035
7	circle6\t3999\t4028\t4005\t3996\t4031\t4018\t3...
8	circle7\t3984
9	circle8\t3988\t4030\t4026\t4021
10	circle9\t3983\t3992\t4033\t4017\t4000\t3986
11	circle10\t3990\t4007\t4016\t4025

	Text
_rowName
Artificial neural network	In machine learning and cognitive science, art...
Autoencoder	An autoencoder, autoassociator or Diabolo netw...
Hopfield network	A Hopfield network is a form of recurrent arti...
Boltzmann machine	Boltzmann machine is a type of stochastic recu...
Restricted boltzmann machines	A restricted Boltzmann machine (RBM) is a gene...
Deep belief network	In machine learning, a deep belief network (DB...
Logistic regression	In statistics, logistic regression, or logit r...
Naive bayes classifier	In machine learning, naive Bayes classifiers a...
Support vector machine	In machine learning, support vector machines (...