SELECT Tutorial

MLDB comes with a powerful implementation of SQL's SELECT Queries. This tutorial will walk you through the basics of SELECT, and some MLDB-specific features.

The notebook cells below use pymldb's .query() method; you can check out the Using pymldb Tutorial for more details.



In [1]:

    
from pymldb import Connection
mldb = Connection()

`SELECT`

All queries start with the keyword SELECT. Here is the simplest possible query: we ask for 1 and we get a very short result set consisting of one row with one column named 1 and the single cell in it also contains 1.



In [2]:

    
mldb.query("""

select 1

""")

Of course we can ask for more: the query below does a little math and shows how you can rename your columns with the as keyword. Note that single-quotes (') are used to denote strings and double-quotes (") denote column names, both of which can contain any Unicode character.



In [3]:

    
mldb.query("""

select 1+1, 3+4 as var, 'UTF8 striñg' as "hello, François"

""")









    Out[3]:






  
    
      
      1+1
      var
      hello, François
    
    
      _rowName
      
      
      
    
  
  
    
      
      2
      7
      UTF8 striñg

We can use a variety of operators in a SELECT expression, like this:



In [4]:

    
mldb.query("""

select 
    1 between 0 and 2,
    2 in (1,2,3),
    3 is integer,
    (case when 4<5 then 'yes' else 'no' end)

""")









    Out[4]:






  
    
      
      1 between 0 and 2
      2 in (1,2,3)
      3 is integer
      (case when 4<5 then 'yes' else 'no' end)
    
    
      _rowName
      
      
      
      
    
  
  
    
      
      1
      1
      1
      yes

`FROM` and `LIMIT`

Queries are mostly useful when run on actual datasets, so let's import part of the passenger manifest from the Titanic.



In [5]:

    
print mldb.put('/v1/procedures/import_titanic', { 
    "type": "import.text",
    "params": { 
        "dataFileUrl": "http://public.mldb.ai/titanic_train.csv",
        "outputDataset": "titanic",
        "runOnCreation": True
    } 
})









    



<Response [201]>

Now let's query all columns with the star (*) operator FROM our titanic dataset, using the LIMIT keyword to avoid getting too much output.



In [6]:

    
mldb.query("""

select *
from titanic
limit 10

""")









    Out[6]:






  
    
      
      PassengerId
      label
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      97
      96
      0
      3
      ShorneyMr.CharlesJoseph
      male
      NaN
      0
      0
      374910
      8.0500
      None
      S
    
    
      273
      272
      1
      3
      TornquistMr.WilliamHenry
      male
      25
      0
      0
      LINE
      0.0000
      None
      S
    
    
      524
      523
      0
      3
      LahoudMr.Sarkis
      male
      NaN
      0
      0
      2624
      7.2250
      None
      C
    
    
      278
      277
      0
      3
      LindblomMiss.AugustaCharlotta
      female
      45
      0
      0
      347073
      7.7500
      None
      S
    
    
      211
      210
      1
      1
      BlankMr.Henry
      male
      40
      0
      0
      112277
      31.0000
      A31
      C
    
    
      210
      209
      1
      3
      CarrMiss.Helen"Ellen"
      female
      16
      0
      0
      367231
      7.7500
      None
      Q
    
    
      11
      10
      1
      2
      NasserMrs.Nicholas(AdeleAchem)
      female
      14
      1
      0
      237736
      30.0708
      None
      C
    
    
      281
      280
      1
      3
      AbbottMrs.Stanton(RosaHunt)
      female
      35
      1
      1
      C.A.2673
      20.2500
      None
      S
    
    
      698
      697
      0
      3
      KellyMr.James
      male
      44
      0
      0
      363592
      8.0500
      None
      S
    
    
      460
      459
      1
      2
      ToomeyMiss.Ellen
      female
      50
      0
      0
      F.C.C.13531
      10.5000
      None
      S

We can also ask for just certain columns by name.



In [7]:

    
mldb.query("""

select Name, Age
from titanic
limit 10
""")









    Out[7]:






  
    
      
      Name
      Age
    
    
      _rowName
      
      
    
  
  
    
      97
      ShorneyMr.CharlesJoseph
      NaN
    
    
      273
      TornquistMr.WilliamHenry
      25
    
    
      524
      LahoudMr.Sarkis
      NaN
    
    
      278
      LindblomMiss.AugustaCharlotta
      45
    
    
      211
      BlankMr.Henry
      40
    
    
      210
      CarrMiss.Helen"Ellen"
      16
    
    
      11
      NasserMrs.Nicholas(AdeleAchem)
      14
    
    
      281
      AbbottMrs.Stanton(RosaHunt)
      35
    
    
      698
      KellyMr.James
      44
    
    
      460
      ToomeyMiss.Ellen
      50

`ORDER BY`

When we've used the LIMIT keyword above, we were just getting an arbitrary set of 10 rows. Using the ORDER BY keyword we can ask for the 'top 10' according to some criterion, for example Age.



In [8]:

    
mldb.query("""

select Name, Age
from titanic
order by Age desc 
limit 10

""")









    Out[8]:






  
    
      
      Name
      Age
    
    
      _rowName
      
      
    
  
  
    
      632
      BarkworthMr.AlgernonHenryWilson
      80.0
    
    
      853
      SvenssonMr.Johan
      74.0
    
    
      98
      GoldschmidtMr.GeorgeB
      71.0
    
    
      495
      ArtagaveytiaMr.Ramon
      71.0
    
    
      118
      ConnorsMr.Patrick
      70.5
    
    
      674
      MitchellMr.HenryMichael
      70.0
    
    
      747
      CrosbyCapt.EdwardGifford
      70.0
    
    
      35
      WheadonMr.EdwardH
      66.0
    
    
      282
      DuaneMr.Frank
      65.0
    
    
      458
      MilletMr.FrancisDavis
      65.0

`WHERE`

Beyond limiting the number of records, sometimes we want to look at records which match certain criteria, which we can do with the WHERE keyword. You can use the same operators in the WHERE clause as in the SELECT clause.



In [9]:

    
mldb.query("""

select Name, Age, Pclass, Sex, SibSp, Parch, label
from titanic
where Pclass in (1,3) and Sex='female' and (SibSp>3 or Parch=2) and label=1 and Age is null

""")









    Out[9]:






  
    
      
      Name
      Age
      Pclass
      Sex
      SibSp
      Parch
      label
    
    
      _rowName
      
      
      
      
      
      
      
    
  
  
    
      535
      PeterMrs.Catherine(CatherineRizk)
      None
      3
      female
      0
      2
      1

In the query above we used the special operator is to retrieve only rows where Age is null. This is worth pointing out because null is a special value in SQL: it means "unknown". null has some strange properties, as you can see below: any comparison between Age and 1 returns null. This makes sense because if, say, Age is unknown, then we don't know if Age is less than, equal to or greater than anything else. SQL works according to 3-valued logic).

The only reliable way to check if a value is null is with the is null operator.



In [10]:

    
mldb.query("""

select Age, Age = 1, Age < 1, Age > 1, Age + 1, Age / 1, Age is null, Age is not null
from titanic
where Age is null
limit 1

""")









    Out[10]:






  
    
      
      Age
      Age = 1
      Age < 1
      Age > 1
      Age + 1
      Age / 1
      Age is null
      Age is not null
    
    
      _rowName
      
      
      
      
      
      
      
      
    
  
  
    
      97
      None
      None
      None
      None
      None
      None
      1
      0

Functions and Aggregate Functions

MLDB comes with a number of builtin functions to operate on your data. Here's an example where we convert a string to uppercase and lowercase.



In [11]:

    
mldb.query("""

select Name, upper(Name), lower(Name)
from titanic
order by Age desc limit 10

""")









    Out[11]:






  
    
      
      Name
      upper(Name)
      lower(Name)
    
    
      _rowName
      
      
      
    
  
  
    
      632
      BarkworthMr.AlgernonHenryWilson
      BARKWORTHMR.ALGERNONHENRYWILSON
      barkworthmr.algernonhenrywilson
    
    
      853
      SvenssonMr.Johan
      SVENSSONMR.JOHAN
      svenssonmr.johan
    
    
      98
      GoldschmidtMr.GeorgeB
      GOLDSCHMIDTMR.GEORGEB
      goldschmidtmr.georgeb
    
    
      495
      ArtagaveytiaMr.Ramon
      ARTAGAVEYTIAMR.RAMON
      artagaveytiamr.ramon
    
    
      118
      ConnorsMr.Patrick
      CONNORSMR.PATRICK
      connorsmr.patrick
    
    
      674
      MitchellMr.HenryMichael
      MITCHELLMR.HENRYMICHAEL
      mitchellmr.henrymichael
    
    
      747
      CrosbyCapt.EdwardGifford
      CROSBYCAPT.EDWARDGIFFORD
      crosbycapt.edwardgifford
    
    
      35
      WheadonMr.EdwardH
      WHEADONMR.EDWARDH
      wheadonmr.edwardh
    
    
      282
      DuaneMr.Frank
      DUANEMR.FRANK
      duanemr.frank
    
    
      458
      MilletMr.FrancisDavis
      MILLETMR.FRANCISDAVIS
      milletmr.francisdavis

The functions below are special: they're aggregate functions, so they operate on multiple rows and give you a single output. They operate only on non-null values of their input.



In [12]:

    
mldb.query("""

select count(Age), sum(Age), sum(Age)/count(Age), avg(Age), min(Age), max(Age)
from titanic

""")









    Out[12]:






  
    
      
      count(Age)
      sum(Age)
      sum(Age)/count(Age)
      avg(Age)
      min(Age)
      max(Age)
    
    
      _rowName
      
      
      
      
      
      
    
  
  
    
      []
      714
      21205.17
      29.699118
      29.699118
      0.42
      80

The count aggregate function is special in that it accepts * as an input, in which case it will return the count of all rows:



In [13]:

    
mldb.query("""

select count(*)
from titanic

""")









    Out[13]:






  
    
      
      count(*)
    
    
      _rowName
      
    
  
  
    
      []
      891

GROUP BY & HAVING

You can get aggregate functions to return multiple rows by grouping the input according to some criteria with the GROUP BY keyword. If you use an aggregate function in your SELECT clause, then you cannot use any non-aggregate expressions unless they appear in a GROUP BY clause.



In [14]:

    
mldb.query("""

select Pclass, avg(Age)
from titanic
group by Pclass

""")

You cannot use aggregate functions in a WHERE clause. The HAVING clause is a little bit like a WHERE clause which is applied after GROUP BY, and in which you can use aggregate functions.



In [15]:

    
mldb.query("""

select Pclass, avg(Age)
from titanic
group by Pclass
having avg(Age) > 30

""")









    Out[15]:






  
    
      
      Pclass
      avg(Age)
    
    
      _rowName
      
      
    
  
  
    
      [1]
      1
      38.233441

Advanced `FROM` with subqueries

SQL allows you to use the output of one query as the input to another by putting queries in the FROM clause, at which point they become "subqueries". The following example shows how to emulate the HAVING example above with a subquery, although it should be noted that the HAVING form will be faster in this case.



In [16]:

    
mldb.query("""

select *
from (
    select Pclass, avg(Age) as mean_age
    from titanic
    group by Pclass
)
where mean_age > 30

""")









    Out[16]:






  
    
      
      Pclass
      mean_age
    
    
      _rowName
      
      
    
  
  
    
      [1]
      1
      38.233441

`INTO`: supported via `transform` Procedures

Standard SQL defines an INTO clause to create new datasets from the output of queries. MLDB SELECT queries are idempotent (they do not modify anything) so INTO is not supported directly. You can accomplish the same task with a transform procedure, however:



In [17]:

    
not_supported = """
    select Pclass, avg(Age) as mean_age
    into class_stats
    from titanic
    group by Pclass
"""

supported = mldb.post('/v1/procedures', { 
    "type": "transform",
    "params": { 
        "inputData": """
                select Pclass, avg(Age) as mean_age
                from titanic
                group by Pclass
        """,
        "outputDataset": "class_stats",
        "runOnCreation": True
    } 
})

print supported









    



<Response [201]>

We can now query our new dataset!



In [18]:

    
mldb.query("""

select *
from class_stats

""")

Advanced `FROM` with `JOIN`

You can run queries across multiple datasets with the JOIN keyword, using the ON keyword to define how to combine the datasets.



In [19]:

    
mldb.query("""

select titanic.Name, titanic.Pclass, class_stats.*
from titanic JOIN class_stats ON titanic.Pclass = class_stats.Pclass    
order by Age desc limit 10

""")









    Out[19]:






  
    
      
      titanic.Name
      titanic.Pclass
      class_stats.Pclass
      class_stats.mean_age
    
    
      _rowName
      
      
      
      
    
  
  
    
      [835]-[[3]]
      AugustssonMr.Albert
      3
      3
      25.14062
    
    
      [60]-[[2]]
      WestMiss.ConstanceMirium
      2
      2
      29.87763
    
    
      [78]-[[3]]
      StaneffMr.Ivan
      3
      3
      25.14062
    
    
      [356]-[[3]]
      YousifMr.Wazli
      3
      3
      25.14062
    
    
      [134]-[[3]]
      RobinsMrs.AlexanderA(GraceCharityLaury)
      3
      3
      25.14062
    
    
      [740]-[[3]]
      IvanoffMr.Kanio
      3
      3
      25.14062
    
    
      [652]-[[3]]
      MitkoffMr.Mito
      3
      3
      25.14062
    
    
      [10]-[[3]]
      JohnsonMrs.OscarW(ElisabethVilhelminaBerg)
      3
      3
      25.14062
    
    
      [249]-[[2]]
      HamalainenMrs.William(Anna)
      2
      2
      29.87763
    
    
      [280]-[[3]]
      RiceMaster.Eric
      3
      3
      25.14062

MLDB extensions to conventional SQL

MLDB has some notable differences with more conventional SQL databases like PostgreSQL, MySQL, Oracle or SQLServer. For example, MLDB datasets are not SQL tables:

datasets have no fixed schema
datasets can have a variable number of columns, numbering into the millions
columns can contain mixed types (i.e. both numbers and strings in the same column)
both rows and columns have names

In order to accomodate this, MLDB provides a number of extensions to standard SQL. Examples are provided below.

Selecting columns based on a prefix:



In [20]:

    
mldb.query("""

select P*
from titanic
limit 1

""")









    Out[20]:






  
    
      
      PassengerId
      Pclass
      Parch
    
    
      _rowName
      
      
      
    
  
  
    
      97
      96
      3
      0

Renaming columns based on a prefix pattern:



In [21]:

    
mldb.query("""

select P* as x*
from titanic
limit 1

""")









    Out[21]:






  
    
      
      xassengerId
      xclass
      xarch
    
    
      _rowName
      
      
      
    
  
  
    
      97
      96
      3
      0

Excluding columns from a selection:



In [22]:

    
mldb.query("""

select * excluding(P*)
from titanic
limit 1

""")









    Out[22]:






  
    
      
      label
      Name
      Sex
      Age
      SibSp
      Ticket
      Fare
      Cabin
      Embarked
    
    
      _rowName
      
      
      
      
      
      
      
      
      
    
  
  
    
      97
      0
      ShorneyMr.CharlesJoseph
      male
      None
      0
      374910
      8.05
      None
      S

NOTE: Selecting a column which is not in the dataset will not cause an error, instead it will return NULL.



In [23]:

    
mldb.query("""

select nothing
from titanic
limit 1

""")









    Out[23]:






  
    
      
      nothing
    
    
      _rowName
      
    
  
  
    
      97
      None

MLDB supports JSON-like objects in queries.



In [24]:

    
mldb.query("""

select {a: 1, b:2, c: 'hello'}

""")









    Out[24]:






  
    
      
      {a: 1, b:2, c: 'hello'}.a
      {a: 1, b:2, c: 'hello'}.b
      {a: 1, b:2, c: 'hello'}.c
    
    
      _rowName
      
      
      
    
  
  
    
      
      1
      2
      hello



In [25]:

    
mldb.query("""

select {a: 1, b:2, c: 'hello'} as obj

""")









    Out[25]:






  
    
      
      obj.a
      obj.b
      obj.c
    
    
      _rowName
      
      
      
    
  
  
    
      
      1
      2
      hello



In [26]:

    
mldb.query("""

select {a: 1, b:{x:2}, c: 'hello'} as *

""")

This is especially useful for tokenizing text into bags of words, or importing semi-structured JSON data.



In [27]:

    
mldb.query("""

select tokenize('Hello world, Hello!', {splitChars: ' ,!'}) as *

""")



In [28]:

    
mldb.query("""

select parse_json('{"hello":"world","list":[1,2,3,4]}') as *

""")

MLDB's object notation also allows you to run aggregate functions on multiple columns at once, with the special {*} notation, which refers to all fields in the current row as an object.



In [29]:

    
mldb.query("""

select Pclass, count(*), count({*})
from titanic
group by Pclass

""")









    Out[29]:






  
    
      
      Pclass
      count(*)
      count({*}).Age
      count({*}).Cabin
      count({*}).Embarked
      count({*}).Fare
      count({*}).Name
      count({*}).Parch
      count({*}).PassengerId
      count({*}).Pclass
      count({*}).Sex
      count({*}).SibSp
      count({*}).Ticket
      count({*}).label
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      [1]
      1
      216
      186
      176
      214
      216
      216
      216
      216
      216
      216
      216
      216
      216
    
    
      [2]
      2
      184
      173
      16
      184
      184
      184
      184
      184
      184
      184
      184
      184
      184
    
    
      [3]
      3
      491
      355
      12
      491
      491
      491
      491
      491
      491
      491
      491
      491
      491

MLDB's flexible output model also enables powerful aggregate functions like pivot() to operate.



In [30]:

    
mldb.query("""

select Pclass, pivot(Sex, "count(*)") as *
from (
    select Pclass, Sex, count(*) from titanic group by Pclass, Sex
)
group by Pclass

""")

MLDB supports multi-dimensional arrays called embeddings, also known as tensors.



In [31]:

    
mldb.query("""

select [1,2,3] as embedding

""")









    Out[31]:






  
    
      
      embedding.0
      embedding.1
      embedding.2
    
    
      _rowName
      
      
      
    
  
  
    
      
      1
      2
      3



In [32]:

    
mldb.query("""

select normalize([1,2,3], 1) as n,  [1,2,3] / norm([1,2,3] ,1) as d

""")

MLDB datasets have named rows as well as columns, and the NAMED keyword allows you to control the names of your output rows.



In [33]:

    
mldb.query("""

select *
named Name + ' aged ' + cast(Age as string)
from titanic
order by Age desc limit 10

""")









    Out[33]:






  
    
      
      PassengerId
      label
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      BarkworthMr.AlgernonHenryWilson aged 80
      631
      1
      1
      BarkworthMr.AlgernonHenryWilson
      male
      80.0
      0
      0
      27042
      30.0000
      A23
      S
    
    
      SvenssonMr.Johan aged 74
      852
      0
      3
      SvenssonMr.Johan
      male
      74.0
      0
      0
      347060
      7.7750
      None
      S
    
    
      GoldschmidtMr.GeorgeB aged 71
      97
      0
      1
      GoldschmidtMr.GeorgeB
      male
      71.0
      0
      0
      PC17754
      34.6542
      A5
      C
    
    
      ArtagaveytiaMr.Ramon aged 71
      494
      0
      1
      ArtagaveytiaMr.Ramon
      male
      71.0
      0
      0
      PC17609
      49.5042
      None
      C
    
    
      ConnorsMr.Patrick aged 70.5
      117
      0
      3
      ConnorsMr.Patrick
      male
      70.5
      0
      0
      370369
      7.7500
      None
      Q
    
    
      MitchellMr.HenryMichael aged 70
      673
      0
      2
      MitchellMr.HenryMichael
      male
      70.0
      0
      0
      C.A.24580
      10.5000
      None
      S
    
    
      CrosbyCapt.EdwardGifford aged 70
      746
      0
      1
      CrosbyCapt.EdwardGifford
      male
      70.0
      1
      1
      WE/P5735
      71.0000
      B22
      S
    
    
      WheadonMr.EdwardH aged 66
      34
      0
      2
      WheadonMr.EdwardH
      male
      66.0
      0
      0
      C.A.24579
      10.5000
      None
      S
    
    
      DuaneMr.Frank aged 65
      281
      0
      3
      DuaneMr.Frank
      male
      65.0
      0
      0
      336439
      7.7500
      None
      Q
    
    
      MilletMr.FrancisDavis aged 65
      457
      0
      1
      MilletMr.FrancisDavis
      male
      65.0
      0
      0
      13509
      26.5500
      E38
      S

Having named rows as well as columns allows us to easily operate on the transpose of a dataset



In [34]:

    
mldb.query("""

select * from transpose(
    (select * from titanic order by Age desc limit 5)
)

""")









    Out[34]:






  
    
      
      632
      853
      98
      495
      118
    
    
      _rowName
      
      
      
      
      
    
  
  
    
      Cabin
      A23
      None
      A5
      None
      None
    
    
      Fare
      30
      7.775
      34.6542
      49.5042
      7.75
    
    
      SibSp
      0
      0
      0
      0
      0
    
    
      Ticket
      27042
      347060
      PC17754
      PC17609
      370369
    
    
      PassengerId
      631
      852
      97
      494
      117
    
    
      label
      1
      0
      0
      0
      0
    
    
      Age
      80
      74
      71
      71
      70.5
    
    
      Pclass
      1
      3
      1
      1
      3
    
    
      Name
      BarkworthMr.AlgernonHenryWilson
      SvenssonMr.Johan
      GoldschmidtMr.GeorgeB
      ArtagaveytiaMr.Ramon
      ConnorsMr.Patrick
    
    
      Sex
      male
      male
      male
      male
      male
    
    
      Parch
      0
      0
      0
      0
      0
    
    
      Embarked
      S
      S
      C
      C
      Q

MLDB supports inline Javascript application via the jseval() function.



In [35]:

    
mldb.query("""

select Name,
    jseval(
        'return Name.replace(/([A-Z])/g, function(m, p) { return " "+p; });',
        'Name', Name
    ) as processed_name
from titanic
order by Age desc limit 10

""")









    Out[35]:






  
    
      
      Name
      processed_name
    
    
      _rowName
      
      
    
  
  
    
      632
      BarkworthMr.AlgernonHenryWilson
      Barkworth Mr. Algernon Henry Wilson
    
    
      853
      SvenssonMr.Johan
      Svensson Mr. Johan
    
    
      98
      GoldschmidtMr.GeorgeB
      Goldschmidt Mr. George B
    
    
      495
      ArtagaveytiaMr.Ramon
      Artagaveytia Mr. Ramon
    
    
      118
      ConnorsMr.Patrick
      Connors Mr. Patrick
    
    
      674
      MitchellMr.HenryMichael
      Mitchell Mr. Henry Michael
    
    
      747
      CrosbyCapt.EdwardGifford
      Crosby Capt. Edward Gifford
    
    
      35
      WheadonMr.EdwardH
      Wheadon Mr. Edward H
    
    
      282
      DuaneMr.Frank
      Duane Mr. Frank
    
    
      458
      MilletMr.FrancisDavis
      Millet Mr. Francis Davis

MLDB datasets handle millions of columns, and deal very well with sparse datasets, making them ideal for operating on bags of words.



In [36]:

    
mldb.query("""

select tokenize(
    jseval('
        return Name.replace(/([A-Z])/g, function(m, p) { return " "+p; });
    ', 'Name', Name),
    {splitChars: ' .()"', quoteChar:''}) as *
from titanic
order by Age desc
limit 10

""")









    Out[36]:






  
    
      
      Henry
      Algernon
      Mr
      Wilson
      Barkworth
      Johan
      Svensson
      B
      George
      Goldschmidt
      ...
      Edward
      Capt
      Crosby
      H
      Wheadon
      Frank
      Duane
      Davis
      Francis
      Millet
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      632
      1
      1
      1
      1
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      853
      NaN
      NaN
      1
      NaN
      NaN
      1
      1
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      98
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      1
      1
      1
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      495
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      118
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      674
      1
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      747
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      1
      1
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      35
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      1
      NaN
      NaN
      1
      1
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      282
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      1
      1
      NaN
      NaN
      NaN
    
    
      458
      NaN
      NaN
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      1
      1
      1
    
  

10 rows × 27 columns

Putting it all together, here are the top 20 tokens present in the names of Titanic passengers.



In [37]:

    
mldb.query("""

select * from transpose((
    select sum(
        tokenize(
            jseval(
                'return Name.replace(/([A-Z])/g, function(m, p) { return " "+p; });', 
                'Name', Name
            ),
            {splitChars: ' .()"', quoteChar:''}
        )
    ) as *
    named 'counts'
    from titanic
))
order by counts desc limit 20

""")

Where to next?

Check out the other Tutorials and Demos.



In [ ]:

	hello	list.0	list.1	list.2	list.3
_rowName
	world	1	2	3	4

	n.0	n.1	n.2	d.0	d.1	d.2
_rowName
	0.166667	0.333333	0.5	0.166667	0.333333	0.5

	PassengerId	label	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
_rowName
97	96	0	3	ShorneyMr.CharlesJoseph	male	NaN	0	0	374910	8.0500	None	S
273	272	1	3	TornquistMr.WilliamHenry	male	25	0	0	LINE	0.0000	None	S
524	523	0	3	LahoudMr.Sarkis	male	NaN	0	0	2624	7.2250	None	C
278	277	0	3	LindblomMiss.AugustaCharlotta	female	45	0	0	347073	7.7500	None	S
211	210	1	1	BlankMr.Henry	male	40	0	0	112277	31.0000	A31	C
210	209	1	3	CarrMiss.Helen"Ellen"	female	16	0	0	367231	7.7500	None	Q
11	10	1	2	NasserMrs.Nicholas(AdeleAchem)	female	14	1	0	237736	30.0708	None	C
281	280	1	3	AbbottMrs.Stanton(RosaHunt)	female	35	1	1	C.A.2673	20.2500	None	S
698	697	0	3	KellyMr.James	male	44	0	0	363592	8.0500	None	S
460	459	1	2	ToomeyMiss.Ellen	female	50	0	0	F.C.C.13531	10.5000	None	S

	Name	Age
_rowName
632	BarkworthMr.AlgernonHenryWilson	80.0
853	SvenssonMr.Johan	74.0
98	GoldschmidtMr.GeorgeB	71.0
495	ArtagaveytiaMr.Ramon	71.0
118	ConnorsMr.Patrick	70.5
674	MitchellMr.HenryMichael	70.0
747	CrosbyCapt.EdwardGifford	70.0
35	WheadonMr.EdwardH	66.0
282	DuaneMr.Frank	65.0
458	MilletMr.FrancisDavis	65.0

	titanic.Name	titanic.Pclass	class_stats.Pclass	class_stats.mean_age
_rowName
[835]-[[3]]	AugustssonMr.Albert	3	3	25.14062
[60]-[[2]]	WestMiss.ConstanceMirium	2	2	29.87763
[78]-[[3]]	StaneffMr.Ivan	3	3	25.14062
[356]-[[3]]	YousifMr.Wazli	3	3	25.14062
[134]-[[3]]	RobinsMrs.AlexanderA(GraceCharityLaury)	3	3	25.14062
[740]-[[3]]	IvanoffMr.Kanio	3	3	25.14062
[652]-[[3]]	MitkoffMr.Mito	3	3	25.14062
[10]-[[3]]	JohnsonMrs.OscarW(ElisabethVilhelminaBerg)	3	3	25.14062
[249]-[[2]]	HamalainenMrs.William(Anna)	2	2	29.87763
[280]-[[3]]	RiceMaster.Eric	3	3	25.14062

	PassengerId	label	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
_rowName
BarkworthMr.AlgernonHenryWilson aged 80	631	1	1	BarkworthMr.AlgernonHenryWilson	male	80.0	0	0	27042	30.0000	A23	S
SvenssonMr.Johan aged 74	852	0	3	SvenssonMr.Johan	male	74.0	0	0	347060	7.7750	None	S
GoldschmidtMr.GeorgeB aged 71	97	0	1	GoldschmidtMr.GeorgeB	male	71.0	0	0	PC17754	34.6542	A5	C
ArtagaveytiaMr.Ramon aged 71	494	0	1	ArtagaveytiaMr.Ramon	male	71.0	0	0	PC17609	49.5042	None	C
ConnorsMr.Patrick aged 70.5	117	0	3	ConnorsMr.Patrick	male	70.5	0	0	370369	7.7500	None	Q
MitchellMr.HenryMichael aged 70	673	0	2	MitchellMr.HenryMichael	male	70.0	0	0	C.A.24580	10.5000	None	S
CrosbyCapt.EdwardGifford aged 70	746	0	1	CrosbyCapt.EdwardGifford	male	70.0	1	1	WE/P5735	71.0000	B22	S
WheadonMr.EdwardH aged 66	34	0	2	WheadonMr.EdwardH	male	66.0	0	0	C.A.24579	10.5000	None	S
DuaneMr.Frank aged 65	281	0	3	DuaneMr.Frank	male	65.0	0	0	336439	7.7500	None	Q
MilletMr.FrancisDavis aged 65	457	0	1	MilletMr.FrancisDavis	male	65.0	0	0	13509	26.5500	E38	S

	632	853	98	495	118
_rowName
Cabin	A23	None	A5	None	None
Fare	30	7.775	34.6542	49.5042	7.75
SibSp	0	0	0	0	0
Ticket	27042	347060	PC17754	PC17609	370369
PassengerId	631	852	97	494	117
label	1	0	0	0	0
Age	80	74	71	71	70.5
Pclass	1	3	1	1	3
Name	BarkworthMr.AlgernonHenryWilson	SvenssonMr.Johan	GoldschmidtMr.GeorgeB	ArtagaveytiaMr.Ramon	ConnorsMr.Patrick
Sex	male	male	male	male	male
Parch	0	0	0	0	0
Embarked	S	S	C	C	Q

	Name	processed_name
_rowName
632	BarkworthMr.AlgernonHenryWilson	Barkworth Mr. Algernon Henry Wilson
853	SvenssonMr.Johan	Svensson Mr. Johan
98	GoldschmidtMr.GeorgeB	Goldschmidt Mr. George B
495	ArtagaveytiaMr.Ramon	Artagaveytia Mr. Ramon
118	ConnorsMr.Patrick	Connors Mr. Patrick
674	MitchellMr.HenryMichael	Mitchell Mr. Henry Michael
747	CrosbyCapt.EdwardGifford	Crosby Capt. Edward Gifford
35	WheadonMr.EdwardH	Wheadon Mr. Edward H
282	DuaneMr.Frank	Duane Mr. Frank
458	MilletMr.FrancisDavis	Millet Mr. Francis Davis

	counts
_rowName
Mr	521
Miss	182
Mrs	128
William	64
John	44
Master	40
Henry	35
George	24
James	24
Charles	24
Thomas	22
Mary	20
Edward	18
Anna	17
Joseph	16
Frederick	15
Johan	15
Elizabeth	15
Richard	14
Samuel	13

SELECT Tutorial

SELECT

FROM and LIMIT

ORDER BY

WHERE