jseval
Function TutorialMLDB provides a complete implementation of the SQL SELECT statement. Most of the functions you are used to using are available in your queries.
MLDB also supports additional functions that extend standard SQL in very interesting ways. One of those function is the jseval
function that can be used to execute arbitrary JavaScript code inline in an SQL query.
In this tutorial, we will show some basic usage example followed by two different use-cases for the jseval
function:
Before we begin, let's start by importing the pymldb
library so we can make REST API calls to MLDB. You can check out the Using pymldb
Tutorial for more details.
In [1]:
from pymldb import Connection
mldb = Connection("http://localhost")
In [33]:
mldb.query("""
SELECT
jseval('
return val * 2;
','val', 5) AS output
""")
Out[33]:
The variable val
takes the input value 5 and the code is then evaluated.
Our function can also take in multiple parameters as input, and return different output values:
In [35]:
mldb.query("""
SELECT
jseval('
var output = {};
output["mult"] = val * 2;
output["name"] = str_val + " Hello!";
return output;
','val,str_val', 5, 'Bonjour!') AS output
""")
Out[35]:
In the above example, the string val,str_val
mean that the function takes 2 input variables. Those values will be 5 and the string Bonjour!
. Since we return a JavaScript object, we essentially return a row where the keys are the objects' keys and the cell values are the object's values.
Now that we have the basics in place, let's continue to a real use-case below.
In the Loading Data From An HTTP Server Tutorial tutorial, we loaded a specific file from an archive that was located on the Stanford Network Analysis Project (SNAP) website.
The dataset contains all the circles of friends in which user no. 3980 is part of. Each row represents a circle of friends, and all the users that are part of that circle will be enumerated on the line.
Let's check's out the unformated version of the data first, by running the import.text
procedure:
In [44]:
dataUrl = "http://snap.stanford.edu/data/facebook.tar.gz"
mldb.put("/v1/procedures/import_data", {
"type": "import.text",
"params": {
"dataFileUrl": "archive+" + dataUrl + "#facebook/3980.circles",
"delimiter": " ",
"quoteChar": "",
"outputDataset": "import_URL2",
"runOnCreation": True
}
})
mldb.query("SELECT * NAMED rowName() FROM import_URL2 LIMIT 10")
Out[44]:
We see that each line contains the circle number followed by user ids. This type of data is an ideal candidate for MLDB, since we can store it as bags of words, or rather, bags of friends. A dataset of type sparse.mutable can store sparse representations like this one very efficiently.
Normally, we could use the tokenize
function to deal with data like this. However, since splitting the data on the <TAB>
character yields a variable number of columns, the standard way of importing this won't work very nicely in the import.text
procedure.
In the code below, we will use the jseval
function to do the following in JavaScript:
<TAB>
characterrowName
in the object (circle0, circle1, etc...)
In [2]:
dataUrl = "http://snap.stanford.edu/data/facebook.tar.gz"
print mldb.put("/v1/procedures/import_non_formated", {
"type": "import.text",
"params": {
"dataFileUrl": "archive+" + dataUrl + "#facebook/3980.circles",
"headers": ["circles"],
"select": """
jseval('
var row_val = val.split("\t");
var rtn = {};
rtn["rowName"] = row_val[0];
for(i=1; i<row_val.length; i++) {
rtn[row_val[i]] = 1;
}
return rtn;
','val', circles) AS *
""",
"outputDataset": {
"id": "import_non_formated",
"type": "sparse.mutable"
},
"runOnCreation": True
}
})
We can now run a SELECT
query on the resulting dataset and get a nice sparse representation:
In [3]:
mldb.query("""
SELECT * EXCLUDING(rowName)
NAMED rowName
FROM import_non_formated
ORDER BY CAST(rowName() AS integer)
LIMIT 5
""")
Out[3]:
We can now answer a simple question like: Is there any friend of user 3980 that appears in more than one of his circle of friends? It can be answered with the following query:
In [13]:
mldb.query("""
SELECT *
FROM transpose(
(
SELECT sum({* EXCLUDING(rowName)}) as *
NAMED 'result'
FROM import_non_formated
)
)
ORDER BY result DESC
LIMIT 5
""")
Out[13]:
Since the maximum value is 1, we now know that the answer to the above question is no.
Although there are other ways to obtain the same result, using jseval
and the dataset of type sparse.mutable
allowed us to transform our data in a single step, without knowing its characteristics in advance. This shows how much added flexibility is added by such a function.
Another very powerful way the jseval
function can be used is as a feature generator. When trying to prototype and iterate quickly, this can be a very efficient way to try out new ideas.
Let's start by creating a toy dataset using the description of machine learning concepts from Wikipedia:
In [17]:
print mldb.put('/v1/procedures/import_ML_concepts', {
"type":"import.text",
"params": {
"dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
"outputDataset": "ml_concepts",
"named": "Concepts",
"select": "Text",
"runOnCreation": True
}
}
)
Taking a peek at our data, we see there is a single column called Text
that contains a textual description of an ML concept:
In [23]:
mldb.query("SELECT * FROM ml_concepts")
Out[23]:
Let's now create a function of type sql.expression
containing a jseval
function that calculates different statistics about the string it is given. It calculates things like the number of words in the string, the number of capital letters, etc.
Putting it in an sql.expression
allows us to reuse it easily later on.
In [21]:
print mldb.put("/v1/functions/getStats", {
"type": "sql.expression",
"params": {
"expression": """
jseval('
var result = {};
result["len"] = txt.length;
result["numWords"] = txt.split(" ").length;
result["numCapital"] = txt.replace(/[^A-Z]/g, "").length;
result["numExpl"] = txt.replace(/[^!]/g, "").length;
result["numQst"] = txt.replace(/[^?]/g, "").length;
result["containsHashSign"] = txt.replace(/[^#]/g, "").length >= 1;
result["numNumbers"] = txt.replace(/[^0-9]/g, "").length;
result["capitalProportion"] = result["numCapital"] / result["len"];
result["explProportion"] = result["numExpl"] / result["len"];
result["qstProportion"] = result["numQst"] / result["len"];
result["numberProportion"] = result["numNumbers"] / result["len"];
return result;
', 'txt', text) as stats
"""
}
})
Now that we have created our getStats
function, we can call it on a single string:
In [25]:
mldb.query("SELECT getStats({text: 'This is a test #hopethisworks #mldb'}) as *")
Out[25]:
Looks like it works! We can also call it on the Text
column of our ml_concepts
dataset to get the statistics for all the rows of our dataset:
In [22]:
mldb.query("SELECT getStats({text: Text}) as * FROM ml_concepts")
Out[22]:
Doing most of this is possible in standard SQL, but the jseval
implementation is simple, fast and compact. This is a great way to quickly experiment with ideas and gives maximum flexibility to manipulate data.
Check out the other Tutorials and Demos.