MLDB comes with a powerful implementation of SQL's SELECT
Queries. This tutorial will walk you through the basics of SELECT
, and some MLDB-specific features.
The notebook cells below use pymldb
's .query()
method; you can check out the Using pymldb
Tutorial for more details.
In [1]:
from pymldb import Connection
mldb = Connection()
In [2]:
mldb.query("""
select 1
""")
Out[2]:
Of course we can ask for more: the query below does a little math and shows how you can rename your columns with the as
keyword. Note that single-quotes ('
) are used to denote strings and double-quotes ("
) denote column names, both of which can contain any Unicode character.
In [3]:
mldb.query("""
select 1+1, 3+4 as var, 'UTF8 striñg' as "hello, François"
""")
Out[3]:
We can use a variety of operators in a SELECT
expression, like this:
In [4]:
mldb.query("""
select
1 between 0 and 2,
2 in (1,2,3),
3 is integer,
(case when 4<5 then 'yes' else 'no' end)
""")
Out[4]:
In [5]:
print mldb.put('/v1/procedures/import_titanic', {
"type": "import.text",
"params": {
"dataFileUrl": "http://public.mldb.ai/titanic_train.csv",
"outputDataset": "titanic",
"runOnCreation": True
}
})
Now let's query all columns with the star (*
) operator FROM
our titanic
dataset, using the LIMIT
keyword to avoid getting too much output.
In [6]:
mldb.query("""
select *
from titanic
limit 10
""")
Out[6]:
We can also ask for just certain columns by name.
In [7]:
mldb.query("""
select Name, Age
from titanic
limit 10
""")
Out[7]:
In [8]:
mldb.query("""
select Name, Age
from titanic
order by Age desc
limit 10
""")
Out[8]:
In [9]:
mldb.query("""
select Name, Age, Pclass, Sex, SibSp, Parch, label
from titanic
where Pclass in (1,3) and Sex='female' and (SibSp>3 or Parch=2) and label=1 and Age is null
""")
Out[9]:
In the query above we used the special operator is
to retrieve only rows where Age is null
. This is worth pointing out because null
is a special value in SQL: it means "unknown". null
has some strange properties, as you can see below: any comparison between Age
and 1 returns null
. This makes sense because if, say, Age
is unknown, then we don't know if Age
is less than, equal to or greater than anything else. SQL works according to 3-valued logic).
The only reliable way to check if a value is null is with the is null
operator.
In [10]:
mldb.query("""
select Age, Age = 1, Age < 1, Age > 1, Age + 1, Age / 1, Age is null, Age is not null
from titanic
where Age is null
limit 1
""")
Out[10]:
In [11]:
mldb.query("""
select Name, upper(Name), lower(Name)
from titanic
order by Age desc limit 10
""")
Out[11]:
The functions below are special: they're aggregate functions, so they operate on multiple rows and give you a single output. They operate only on non-null
values of their input.
In [12]:
mldb.query("""
select count(Age), sum(Age), sum(Age)/count(Age), avg(Age), min(Age), max(Age)
from titanic
""")
Out[12]:
The count
aggregate function is special in that it accepts *
as an input, in which case it will return the count of all rows:
In [13]:
mldb.query("""
select count(*)
from titanic
""")
Out[13]:
In [14]:
mldb.query("""
select Pclass, avg(Age)
from titanic
group by Pclass
""")
Out[14]:
You cannot use aggregate functions in a WHERE
clause. The HAVING
clause is a little bit like a WHERE
clause which is applied after GROUP BY
, and in which you can use aggregate functions.
In [15]:
mldb.query("""
select Pclass, avg(Age)
from titanic
group by Pclass
having avg(Age) > 30
""")
Out[15]:
FROM
with subqueriesSQL allows you to use the output of one query as the input to another by putting queries in the FROM
clause, at which point they become "subqueries". The following example shows how to emulate the HAVING
example above with a subquery, although it should be noted that the HAVING
form will be faster in this case.
In [16]:
mldb.query("""
select *
from (
select Pclass, avg(Age) as mean_age
from titanic
group by Pclass
)
where mean_age > 30
""")
Out[16]:
In [17]:
not_supported = """
select Pclass, avg(Age) as mean_age
into class_stats
from titanic
group by Pclass
"""
supported = mldb.post('/v1/procedures', {
"type": "transform",
"params": {
"inputData": """
select Pclass, avg(Age) as mean_age
from titanic
group by Pclass
""",
"outputDataset": "class_stats",
"runOnCreation": True
}
})
print supported
We can now query our new dataset!
In [18]:
mldb.query("""
select *
from class_stats
""")
Out[18]:
In [19]:
mldb.query("""
select titanic.Name, titanic.Pclass, class_stats.*
from titanic JOIN class_stats ON titanic.Pclass = class_stats.Pclass
order by Age desc limit 10
""")
Out[19]:
MLDB has some notable differences with more conventional SQL databases like PostgreSQL, MySQL, Oracle or SQLServer. For example, MLDB datasets are not SQL tables:
In order to accomodate this, MLDB provides a number of extensions to standard SQL. Examples are provided below.
Selecting columns based on a prefix:
In [20]:
mldb.query("""
select P*
from titanic
limit 1
""")
Out[20]:
Renaming columns based on a prefix pattern:
In [21]:
mldb.query("""
select P* as x*
from titanic
limit 1
""")
Out[21]:
Excluding columns from a selection:
In [22]:
mldb.query("""
select * excluding(P*)
from titanic
limit 1
""")
Out[22]:
NOTE: Selecting a column which is not in the dataset will not cause an error, instead it will return NULL
.
In [23]:
mldb.query("""
select nothing
from titanic
limit 1
""")
Out[23]:
MLDB supports JSON-like objects in queries.
In [24]:
mldb.query("""
select {a: 1, b:2, c: 'hello'}
""")
Out[24]:
In [25]:
mldb.query("""
select {a: 1, b:2, c: 'hello'} as obj
""")
Out[25]:
In [26]:
mldb.query("""
select {a: 1, b:{x:2}, c: 'hello'} as *
""")
Out[26]:
This is especially useful for tokenizing text into bags of words, or importing semi-structured JSON data.
In [27]:
mldb.query("""
select tokenize('Hello world, Hello!', {splitChars: ' ,!'}) as *
""")
Out[27]:
In [28]:
mldb.query("""
select parse_json('{"hello":"world","list":[1,2,3,4]}') as *
""")
Out[28]:
MLDB's object notation also allows you to run aggregate functions on multiple columns at once, with the special {*}
notation, which refers to all fields in the current row as an object.
In [29]:
mldb.query("""
select Pclass, count(*), count({*})
from titanic
group by Pclass
""")
Out[29]:
MLDB's flexible output model also enables powerful aggregate functions like pivot()
to operate.
In [30]:
mldb.query("""
select Pclass, pivot(Sex, "count(*)") as *
from (
select Pclass, Sex, count(*) from titanic group by Pclass, Sex
)
group by Pclass
""")
Out[30]:
MLDB supports multi-dimensional arrays called embeddings, also known as tensors.
In [31]:
mldb.query("""
select [1,2,3] as embedding
""")
Out[31]:
In [32]:
mldb.query("""
select normalize([1,2,3], 1) as n, [1,2,3] / norm([1,2,3] ,1) as d
""")
Out[32]:
MLDB datasets have named rows as well as columns, and the NAMED
keyword allows you to control the names of your output rows.
In [33]:
mldb.query("""
select *
named Name + ' aged ' + cast(Age as string)
from titanic
order by Age desc limit 10
""")
Out[33]:
Having named rows as well as columns allows us to easily operate on the transpose of a dataset
In [34]:
mldb.query("""
select * from transpose(
(select * from titanic order by Age desc limit 5)
)
""")
Out[34]:
MLDB supports inline Javascript application via the jseval()
function.
In [35]:
mldb.query("""
select Name,
jseval(
'return Name.replace(/([A-Z])/g, function(m, p) { return " "+p; });',
'Name', Name
) as processed_name
from titanic
order by Age desc limit 10
""")
Out[35]:
MLDB datasets handle millions of columns, and deal very well with sparse datasets, making them ideal for operating on bags of words.
In [36]:
mldb.query("""
select tokenize(
jseval('
return Name.replace(/([A-Z])/g, function(m, p) { return " "+p; });
', 'Name', Name),
{splitChars: ' .()"', quoteChar:''}) as *
from titanic
order by Age desc
limit 10
""")
Out[36]:
Putting it all together, here are the top 20 tokens present in the names of Titanic passengers.
In [37]:
mldb.query("""
select * from transpose((
select sum(
tokenize(
jseval(
'return Name.replace(/([A-Z])/g, function(m, p) { return " "+p; });',
'Name', Name
),
{splitChars: ' .()"', quoteChar:''}
)
) as *
named 'counts'
from titanic
))
order by counts desc limit 20
""")
Out[37]:
Check out the other Tutorials and Demos.
In [ ]: