SELECT Tutorial

MLDB comes with a powerful implementation of SQL's SELECT Queries. This tutorial will walk you through the basics of SELECT, and some MLDB-specific features.

The notebook cells below use pymldb's .query() method; you can check out the Using pymldb Tutorial for more details.


In [1]:
from pymldb import Connection
mldb = Connection()

SELECT

All queries start with the keyword SELECT. Here is the simplest possible query: we ask for 1 and we get a very short result set consisting of one row with one column named 1 and the single cell in it also contains 1.


In [2]:
mldb.query("""

select 1

""")


Out[2]:
1
_rowName
1

Of course we can ask for more: the query below does a little math and shows how you can rename your columns with the as keyword. Note that single-quotes (') are used to denote strings and double-quotes (") denote column names, both of which can contain any Unicode character.


In [3]:
mldb.query("""

select 1+1, 3+4 as var, 'UTF8 striñg' as "hello, François"

""")


Out[3]:
1+1 var hello, François
_rowName
2 7 UTF8 striñg

We can use a variety of operators in a SELECT expression, like this:


In [4]:
mldb.query("""

select 
    1 between 0 and 2,
    2 in (1,2,3),
    3 is integer,
    (case when 4<5 then 'yes' else 'no' end)

""")


Out[4]:
1 between 0 and 2 2 in (1,2,3) 3 is integer (case when 4<5 then 'yes' else 'no' end)
_rowName
1 1 1 yes

FROM and LIMIT

Queries are mostly useful when run on actual datasets, so let's import part of the passenger manifest from the Titanic.


In [5]:
print mldb.put('/v1/procedures/import_titanic', { 
    "type": "import.text",
    "params": { 
        "dataFileUrl": "http://public.mldb.ai/titanic_train.csv",
        "outputDataset": "titanic",
        "runOnCreation": True
    } 
})


<Response [201]>

Now let's query all columns with the star (*) operator FROM our titanic dataset, using the LIMIT keyword to avoid getting too much output.


In [6]:
mldb.query("""

select *
from titanic
limit 10

""")


Out[6]:
PassengerId label Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
_rowName
97 96 0 3 ShorneyMr.CharlesJoseph male NaN 0 0 374910 8.0500 None S
273 272 1 3 TornquistMr.WilliamHenry male 25 0 0 LINE 0.0000 None S
524 523 0 3 LahoudMr.Sarkis male NaN 0 0 2624 7.2250 None C
278 277 0 3 LindblomMiss.AugustaCharlotta female 45 0 0 347073 7.7500 None S
211 210 1 1 BlankMr.Henry male 40 0 0 112277 31.0000 A31 C
210 209 1 3 CarrMiss.Helen"Ellen" female 16 0 0 367231 7.7500 None Q
11 10 1 2 NasserMrs.Nicholas(AdeleAchem) female 14 1 0 237736 30.0708 None C
281 280 1 3 AbbottMrs.Stanton(RosaHunt) female 35 1 1 C.A.2673 20.2500 None S
698 697 0 3 KellyMr.James male 44 0 0 363592 8.0500 None S
460 459 1 2 ToomeyMiss.Ellen female 50 0 0 F.C.C.13531 10.5000 None S

We can also ask for just certain columns by name.


In [7]:
mldb.query("""

select Name, Age
from titanic
limit 10
""")


Out[7]:
Name Age
_rowName
97 ShorneyMr.CharlesJoseph NaN
273 TornquistMr.WilliamHenry 25
524 LahoudMr.Sarkis NaN
278 LindblomMiss.AugustaCharlotta 45
211 BlankMr.Henry 40
210 CarrMiss.Helen"Ellen" 16
11 NasserMrs.Nicholas(AdeleAchem) 14
281 AbbottMrs.Stanton(RosaHunt) 35
698 KellyMr.James 44
460 ToomeyMiss.Ellen 50

ORDER BY

When we've used the LIMIT keyword above, we were just getting an arbitrary set of 10 rows. Using the ORDER BY keyword we can ask for the 'top 10' according to some criterion, for example Age.


In [8]:
mldb.query("""

select Name, Age
from titanic
order by Age desc 
limit 10

""")


Out[8]:
Name Age
_rowName
632 BarkworthMr.AlgernonHenryWilson 80.0
853 SvenssonMr.Johan 74.0
98 GoldschmidtMr.GeorgeB 71.0
495 ArtagaveytiaMr.Ramon 71.0
118 ConnorsMr.Patrick 70.5
674 MitchellMr.HenryMichael 70.0
747 CrosbyCapt.EdwardGifford 70.0
35 WheadonMr.EdwardH 66.0
282 DuaneMr.Frank 65.0
458 MilletMr.FrancisDavis 65.0

WHERE

Beyond limiting the number of records, sometimes we want to look at records which match certain criteria, which we can do with the WHERE keyword. You can use the same operators in the WHERE clause as in the SELECT clause.


In [9]:
mldb.query("""

select Name, Age, Pclass, Sex, SibSp, Parch, label
from titanic
where Pclass in (1,3) and Sex='female' and (SibSp>3 or Parch=2) and label=1 and Age is null

""")


Out[9]:
Name Age Pclass Sex SibSp Parch label
_rowName
535 PeterMrs.Catherine(CatherineRizk) None 3 female 0 2 1

In the query above we used the special operator is to retrieve only rows where Age is null. This is worth pointing out because null is a special value in SQL: it means "unknown". null has some strange properties, as you can see below: any comparison between Age and 1 returns null. This makes sense because if, say, Age is unknown, then we don't know if Age is less than, equal to or greater than anything else. SQL works according to 3-valued logic).

The only reliable way to check if a value is null is with the is null operator.


In [10]:
mldb.query("""

select Age, Age = 1, Age < 1, Age > 1, Age + 1, Age / 1, Age is null, Age is not null
from titanic
where Age is null
limit 1

""")


Out[10]:
Age Age = 1 Age < 1 Age > 1 Age + 1 Age / 1 Age is null Age is not null
_rowName
97 None None None None None None 1 0

Functions and Aggregate Functions

MLDB comes with a number of builtin functions to operate on your data. Here's an example where we convert a string to uppercase and lowercase.


In [11]:
mldb.query("""

select Name, upper(Name), lower(Name)
from titanic
order by Age desc limit 10

""")


Out[11]:
Name upper(Name) lower(Name)
_rowName
632 BarkworthMr.AlgernonHenryWilson BARKWORTHMR.ALGERNONHENRYWILSON barkworthmr.algernonhenrywilson
853 SvenssonMr.Johan SVENSSONMR.JOHAN svenssonmr.johan
98 GoldschmidtMr.GeorgeB GOLDSCHMIDTMR.GEORGEB goldschmidtmr.georgeb
495 ArtagaveytiaMr.Ramon ARTAGAVEYTIAMR.RAMON artagaveytiamr.ramon
118 ConnorsMr.Patrick CONNORSMR.PATRICK connorsmr.patrick
674 MitchellMr.HenryMichael MITCHELLMR.HENRYMICHAEL mitchellmr.henrymichael
747 CrosbyCapt.EdwardGifford CROSBYCAPT.EDWARDGIFFORD crosbycapt.edwardgifford
35 WheadonMr.EdwardH WHEADONMR.EDWARDH wheadonmr.edwardh
282 DuaneMr.Frank DUANEMR.FRANK duanemr.frank
458 MilletMr.FrancisDavis MILLETMR.FRANCISDAVIS milletmr.francisdavis

The functions below are special: they're aggregate functions, so they operate on multiple rows and give you a single output. They operate only on non-null values of their input.


In [12]:
mldb.query("""

select count(Age), sum(Age), sum(Age)/count(Age), avg(Age), min(Age), max(Age)
from titanic

""")


Out[12]:
count(Age) sum(Age) sum(Age)/count(Age) avg(Age) min(Age) max(Age)
_rowName
[] 714 21205.17 29.699118 29.699118 0.42 80

The count aggregate function is special in that it accepts * as an input, in which case it will return the count of all rows:


In [13]:
mldb.query("""

select count(*)
from titanic

""")


Out[13]:
count(*)
_rowName
[] 891

GROUP BY & HAVING

You can get aggregate functions to return multiple rows by grouping the input according to some criteria with the GROUP BY keyword. If you use an aggregate function in your SELECT clause, then you cannot use any non-aggregate expressions unless they appear in a GROUP BY clause.


In [14]:
mldb.query("""

select Pclass, avg(Age)
from titanic
group by Pclass

""")


Out[14]:
Pclass avg(Age)
_rowName
[1] 1 38.233441
[2] 2 29.877630
[3] 3 25.140620

You cannot use aggregate functions in a WHERE clause. The HAVING clause is a little bit like a WHERE clause which is applied after GROUP BY, and in which you can use aggregate functions.


In [15]:
mldb.query("""

select Pclass, avg(Age)
from titanic
group by Pclass
having avg(Age) > 30

""")


Out[15]:
Pclass avg(Age)
_rowName
[1] 1 38.233441

Advanced FROM with subqueries

SQL allows you to use the output of one query as the input to another by putting queries in the FROM clause, at which point they become "subqueries". The following example shows how to emulate the HAVING example above with a subquery, although it should be noted that the HAVING form will be faster in this case.


In [16]:
mldb.query("""

select *
from (
    select Pclass, avg(Age) as mean_age
    from titanic
    group by Pclass
)
where mean_age > 30

""")


Out[16]:
Pclass mean_age
_rowName
[1] 1 38.233441

INTO: supported via transform Procedures

Standard SQL defines an INTO clause to create new datasets from the output of queries. MLDB SELECT queries are idempotent (they do not modify anything) so INTO is not supported directly. You can accomplish the same task with a transform procedure, however:


In [17]:
not_supported = """
    select Pclass, avg(Age) as mean_age
    into class_stats
    from titanic
    group by Pclass
"""

supported = mldb.post('/v1/procedures', { 
    "type": "transform",
    "params": { 
        "inputData": """
                select Pclass, avg(Age) as mean_age
                from titanic
                group by Pclass
        """,
        "outputDataset": "class_stats",
        "runOnCreation": True
    } 
})

print supported


<Response [201]>

We can now query our new dataset!


In [18]:
mldb.query("""

select *
from class_stats

""")


Out[18]:
Pclass mean_age
_rowName
[1] 1 38.233441
[3] 3 25.140620
[2] 2 29.877630

Advanced FROM with JOIN

You can run queries across multiple datasets with the JOIN keyword, using the ON keyword to define how to combine the datasets.


In [19]:
mldb.query("""

select titanic.Name, titanic.Pclass, class_stats.*
from titanic JOIN class_stats ON titanic.Pclass = class_stats.Pclass    
order by Age desc limit 10

""")


Out[19]:
titanic.Name titanic.Pclass class_stats.Pclass class_stats.mean_age
_rowName
[835]-[[3]] AugustssonMr.Albert 3 3 25.14062
[60]-[[2]] WestMiss.ConstanceMirium 2 2 29.87763
[78]-[[3]] StaneffMr.Ivan 3 3 25.14062
[356]-[[3]] YousifMr.Wazli 3 3 25.14062
[134]-[[3]] RobinsMrs.AlexanderA(GraceCharityLaury) 3 3 25.14062
[740]-[[3]] IvanoffMr.Kanio 3 3 25.14062
[652]-[[3]] MitkoffMr.Mito 3 3 25.14062
[10]-[[3]] JohnsonMrs.OscarW(ElisabethVilhelminaBerg) 3 3 25.14062
[249]-[[2]] HamalainenMrs.William(Anna) 2 2 29.87763
[280]-[[3]] RiceMaster.Eric 3 3 25.14062

MLDB extensions to conventional SQL

MLDB has some notable differences with more conventional SQL databases like PostgreSQL, MySQL, Oracle or SQLServer. For example, MLDB datasets are not SQL tables:

  • datasets have no fixed schema
  • datasets can have a variable number of columns, numbering into the millions
  • columns can contain mixed types (i.e. both numbers and strings in the same column)
  • both rows and columns have names

In order to accomodate this, MLDB provides a number of extensions to standard SQL. Examples are provided below.

Selecting columns based on a prefix:


In [20]:
mldb.query("""

select P*
from titanic
limit 1

""")


Out[20]:
PassengerId Pclass Parch
_rowName
97 96 3 0

Renaming columns based on a prefix pattern:


In [21]:
mldb.query("""

select P* as x*
from titanic
limit 1

""")


Out[21]:
xassengerId xclass xarch
_rowName
97 96 3 0

Excluding columns from a selection:


In [22]:
mldb.query("""

select * excluding(P*)
from titanic
limit 1

""")


Out[22]:
label Name Sex Age SibSp Ticket Fare Cabin Embarked
_rowName
97 0 ShorneyMr.CharlesJoseph male None 0 374910 8.05 None S

NOTE: Selecting a column which is not in the dataset will not cause an error, instead it will return NULL.


In [23]:
mldb.query("""

select nothing
from titanic
limit 1

""")


Out[23]:
nothing
_rowName
97 None

MLDB supports JSON-like objects in queries.


In [24]:
mldb.query("""

select {a: 1, b:2, c: 'hello'}

""")


Out[24]:
{a: 1, b:2, c: 'hello'}.a {a: 1, b:2, c: 'hello'}.b {a: 1, b:2, c: 'hello'}.c
_rowName
1 2 hello

In [25]:
mldb.query("""

select {a: 1, b:2, c: 'hello'} as obj

""")


Out[25]:
obj.a obj.b obj.c
_rowName
1 2 hello

In [26]:
mldb.query("""

select {a: 1, b:{x:2}, c: 'hello'} as *

""")


Out[26]:
a b.x c
_rowName
1 2 hello

This is especially useful for tokenizing text into bags of words, or importing semi-structured JSON data.


In [27]:
mldb.query("""

select tokenize('Hello world, Hello!', {splitChars: ' ,!'}) as *

""")


Out[27]:
world Hello
_rowName
1 2

In [28]:
mldb.query("""

select parse_json('{"hello":"world","list":[1,2,3,4]}') as *

""")


Out[28]:
hello list.0 list.1 list.2 list.3
_rowName
world 1 2 3 4

MLDB's object notation also allows you to run aggregate functions on multiple columns at once, with the special {*} notation, which refers to all fields in the current row as an object.


In [29]:
mldb.query("""

select Pclass, count(*), count({*})
from titanic
group by Pclass

""")


Out[29]:
Pclass count(*) count({*}).Age count({*}).Cabin count({*}).Embarked count({*}).Fare count({*}).Name count({*}).Parch count({*}).PassengerId count({*}).Pclass count({*}).Sex count({*}).SibSp count({*}).Ticket count({*}).label
_rowName
[1] 1 216 186 176 214 216 216 216 216 216 216 216 216 216
[2] 2 184 173 16 184 184 184 184 184 184 184 184 184 184
[3] 3 491 355 12 491 491 491 491 491 491 491 491 491 491

MLDB's flexible output model also enables powerful aggregate functions like pivot() to operate.


In [30]:
mldb.query("""

select Pclass, pivot(Sex, "count(*)") as *
from (
    select Pclass, Sex, count(*) from titanic group by Pclass, Sex
)
group by Pclass

""")


Out[30]:
Pclass female male
_rowName
[1] 1 94 122
[2] 2 76 108
[3] 3 144 347

MLDB supports multi-dimensional arrays called embeddings, also known as tensors.


In [31]:
mldb.query("""

select [1,2,3] as embedding

""")


Out[31]:
embedding.0 embedding.1 embedding.2
_rowName
1 2 3

In [32]:
mldb.query("""

select normalize([1,2,3], 1) as n,  [1,2,3] / norm([1,2,3] ,1) as d

""")


Out[32]:
n.0 n.1 n.2 d.0 d.1 d.2
_rowName
0.166667 0.333333 0.5 0.166667 0.333333 0.5

MLDB datasets have named rows as well as columns, and the NAMED keyword allows you to control the names of your output rows.


In [33]:
mldb.query("""

select *
named Name + ' aged ' + cast(Age as string)
from titanic
order by Age desc limit 10

""")


Out[33]:
PassengerId label Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
_rowName
BarkworthMr.AlgernonHenryWilson aged 80 631 1 1 BarkworthMr.AlgernonHenryWilson male 80.0 0 0 27042 30.0000 A23 S
SvenssonMr.Johan aged 74 852 0 3 SvenssonMr.Johan male 74.0 0 0 347060 7.7750 None S
GoldschmidtMr.GeorgeB aged 71 97 0 1 GoldschmidtMr.GeorgeB male 71.0 0 0 PC17754 34.6542 A5 C
ArtagaveytiaMr.Ramon aged 71 494 0 1 ArtagaveytiaMr.Ramon male 71.0 0 0 PC17609 49.5042 None C
ConnorsMr.Patrick aged 70.5 117 0 3 ConnorsMr.Patrick male 70.5 0 0 370369 7.7500 None Q
MitchellMr.HenryMichael aged 70 673 0 2 MitchellMr.HenryMichael male 70.0 0 0 C.A.24580 10.5000 None S
CrosbyCapt.EdwardGifford aged 70 746 0 1 CrosbyCapt.EdwardGifford male 70.0 1 1 WE/P5735 71.0000 B22 S
WheadonMr.EdwardH aged 66 34 0 2 WheadonMr.EdwardH male 66.0 0 0 C.A.24579 10.5000 None S
DuaneMr.Frank aged 65 281 0 3 DuaneMr.Frank male 65.0 0 0 336439 7.7500 None Q
MilletMr.FrancisDavis aged 65 457 0 1 MilletMr.FrancisDavis male 65.0 0 0 13509 26.5500 E38 S

Having named rows as well as columns allows us to easily operate on the transpose of a dataset


In [34]:
mldb.query("""

select * from transpose(
    (select * from titanic order by Age desc limit 5)
)

""")


Out[34]:
632 853 98 495 118
_rowName
Cabin A23 None A5 None None
Fare 30 7.775 34.6542 49.5042 7.75
SibSp 0 0 0 0 0
Ticket 27042 347060 PC17754 PC17609 370369
PassengerId 631 852 97 494 117
label 1 0 0 0 0
Age 80 74 71 71 70.5
Pclass 1 3 1 1 3
Name BarkworthMr.AlgernonHenryWilson SvenssonMr.Johan GoldschmidtMr.GeorgeB ArtagaveytiaMr.Ramon ConnorsMr.Patrick
Sex male male male male male
Parch 0 0 0 0 0
Embarked S S C C Q

MLDB supports inline Javascript application via the jseval() function.


In [35]:
mldb.query("""

select Name,
    jseval(
        'return Name.replace(/([A-Z])/g, function(m, p) { return " "+p; });',
        'Name', Name
    ) as processed_name
from titanic
order by Age desc limit 10

""")


Out[35]:
Name processed_name
_rowName
632 BarkworthMr.AlgernonHenryWilson Barkworth Mr. Algernon Henry Wilson
853 SvenssonMr.Johan Svensson Mr. Johan
98 GoldschmidtMr.GeorgeB Goldschmidt Mr. George B
495 ArtagaveytiaMr.Ramon Artagaveytia Mr. Ramon
118 ConnorsMr.Patrick Connors Mr. Patrick
674 MitchellMr.HenryMichael Mitchell Mr. Henry Michael
747 CrosbyCapt.EdwardGifford Crosby Capt. Edward Gifford
35 WheadonMr.EdwardH Wheadon Mr. Edward H
282 DuaneMr.Frank Duane Mr. Frank
458 MilletMr.FrancisDavis Millet Mr. Francis Davis

MLDB datasets handle millions of columns, and deal very well with sparse datasets, making them ideal for operating on bags of words.


In [36]:
mldb.query("""

select tokenize(
    jseval('
        return Name.replace(/([A-Z])/g, function(m, p) { return " "+p; });
    ', 'Name', Name),
    {splitChars: ' .()"', quoteChar:''}) as *
from titanic
order by Age desc
limit 10

""")


Out[36]:
Henry Algernon Mr Wilson Barkworth Johan Svensson B George Goldschmidt ... Edward Capt Crosby H Wheadon Frank Duane Davis Francis Millet
_rowName
632 1 1 1 1 1 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
853 NaN NaN 1 NaN NaN 1 1 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
98 NaN NaN 1 NaN NaN NaN NaN 1 1 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
495 NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
118 NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
674 1 NaN 1 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
747 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1 1 1 NaN NaN NaN NaN NaN NaN NaN
35 NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN ... 1 NaN NaN 1 1 NaN NaN NaN NaN NaN
282 NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 1 1 NaN NaN NaN
458 NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 1 1 1

10 rows × 27 columns

Putting it all together, here are the top 20 tokens present in the names of Titanic passengers.


In [37]:
mldb.query("""

select * from transpose((
    select sum(
        tokenize(
            jseval(
                'return Name.replace(/([A-Z])/g, function(m, p) { return " "+p; });', 
                'Name', Name
            ),
            {splitChars: ' .()"', quoteChar:''}
        )
    ) as *
    named 'counts'
    from titanic
))
order by counts desc limit 20

""")


Out[37]:
counts
_rowName
Mr 521
Miss 182
Mrs 128
William 64
John 44
Master 40
Henry 35
George 24
James 24
Charles 24
Thomas 22
Mary 20
Edward 18
Anna 17
Joseph 16
Frederick 15
Johan 15
Elizabeth 15
Richard 14
Samuel 13

Where to next?

Check out the other Tutorials and Demos.


In [ ]: