Selecting Columns Programmatically Using Column Expressions Tutorial

MLDB provides a complete implementation of the SQL SELECT statement. Most of the functions you are accustomed to using are available in your queries.

MLDB is different from traditional SQL databases in that there is no enforced schema on rows, allowing you to work with millions of columns of sparse data. This makes it easy to load and manipulate sparse datasets, even when there are millions of columns. To reduce the size of your dataset or use only specific variables, we may need to select columns based on specific critera. Column Expressions is an MLDB extension that provides additional control over your column selection. With a column expression, you can programmatically return specific columns with a SQL SELECT statement.

In this tutorial, we will provide examples of COLUMN EXPR within SELECT statements. This tutorial assumes familiarity with Procedures and Datasets. We suggest going through the Procedures and Functions Tutorial and the Loading Data Tutorial beforehand.

Setting up

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.


In [12]:
from pymldb import Connection
mldb = Connection()

Basic usage example

Let's begin by loading and visualizing our data. We will be using the dataset from the Virtual Manipulation of Datasets Tutorial. We had chosen the tokenize function to count the number of words in the Wikipedia descriptions of several Machine Learning concepts (please check out the tutorial for more details).


In [13]:
print mldb.put("/v1/procedures/import_ML_concepts", {
    "type":"import.text",
    "params": {
        "dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
        "outputDataset":{
            "id":"ml_concepts",
            "type": "sparse.mutable"
        },
        "named": "Concepts",
        "select": """ 
            tokenize(
                lower(Text), 
                {splitChars: ' -''"?!;:/[]*,().',  
                minTokenLength: 4}) AS *
        """,
        "runOnCreation": True
    }
})


<Response [201]>

Each word is represented by a column and each Machine Learning concept by a row. We can run a simple SELECT query to take a quick look at the first 5 rows of our dataset.


In [14]:
mldb.query("SELECT * FROM ml_concepts LIMIT 5")


Out[14]:
addition algorithm algorithms also analysis analyze applications approach assigns associated ... popularized provide rather recurrent serve stored systems threshold understanding wrong
_rowName
Support vector machine 1 2 1 1 1 1 1 1 1 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Logistic regression NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Deep belief network NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Restricted boltzmann machines NaN 1 2 NaN NaN NaN 1 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Hopfield network NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN ... 1 1 1 1 1 1 1 1 1 1

5 rows × 286 columns

There are 286 columns, some of which may or may not be useful to the data analysis at hand. For example, we may want to rebuild a dataset with:

  • verbs and adverbs that end with "ing"
  • words that appear at least twice in each of the descriptions of the Machine Learning concepts.

This can be done in a few queries as you will see below.

Using column expressions to keep columns that end with "ing"

Column Expressions provide efficient ways of picking and choosing our columns. For example, we can only choose verbs and adverbs that end with "ing" to understand the overall meaning of a description.

We use the columnName column expression function along with the LIKE SQL expression, as you will see below.


In [15]:
mldb.query("""
    SELECT COLUMN EXPR (WHERE columnName() LIKE '%ing')
    FROM ml_concepts
    LIMIT 5
""")


Out[15]:
belonging clustering learning making mapping performing preprocessing training using estimating being starting depending filtering modelling understanding
_rowName
Support vector machine 1 3 5 1 1 1 1 2 1 NaN NaN NaN NaN NaN NaN NaN
Logistic regression NaN NaN NaN NaN NaN NaN NaN NaN 3 1 NaN NaN NaN NaN NaN NaN
Deep belief network NaN NaN 2 NaN NaN NaN NaN 2 NaN NaN 1 1 NaN NaN NaN NaN
Restricted boltzmann machines NaN NaN 2 NaN NaN NaN NaN 1 NaN NaN NaN NaN 1 1 1 NaN
Hopfield network NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1

This is very powerful because the LIKE statement in Standard SQL is typically found in row operations and more rarely in column operations. MLDB makes it simple to use such SQL expressions on columns.

Using column expressions to keep columns that appear in multiple descriptions

With Column Expressions, we can select columns based on specific row selection criteria. COLUMN EXPR</code> will allow us for example to choose words that appear in multiple descriptions. In this case, we filter on words that show up at least 4 times.

To achieve the desired outcome, we use a Built-in Function available in column expressions called rowCount. rowCount iterates through each column and returns the number of rows that have a value for the specific column.


In [16]:
mldb.query("""
    SELECT COLUMN EXPR (WHERE rowCount() > 4)
    FROM ml_concepts
""")


Out[16]:
learning machine machines model training used with network neural
_rowName
Support vector machine 5 1 2 2 2 2 1 NaN NaN
Logistic regression NaN NaN NaN 4 NaN 1 1 NaN NaN
Deep belief network 2 1 1 1 2 NaN 1 4 1
Restricted boltzmann machines 2 1 3 NaN 1 NaN 1 1 1
Hopfield network NaN NaN NaN 1 NaN NaN 1 2 1
Naive bayes classifier 2 1 1 1 1 1 4 NaN NaN
Boltzmann machine 3 2 2 NaN 1 NaN 1 2 2
Autoencoder 2 NaN NaN NaN NaN 2 NaN 2 1
Artificial neural network 1 1 NaN NaN NaN 1 NaN NaN 3

The results make sense. The words that we found above in the columns are common in Machine Learning concept descriptions. With a plain SQL statement and the rowCount function, we reduced our dataset to include words that appear at least 4 times.

Nested JSON example

Nested JSON objects can have complex schemas, often involving multi-level and multidimensional data structures. In this section we will create a more complex dataset to illustrate ways to simplify data structures and column selection with Built-in Function and Column Expression.

Let's first create an empty dataset called 'toy_example'.


In [17]:
# create dataset
print mldb.put('/v1/datasets/toy_example', { "type":"sparse.mutable" })


<Response [201]>

We will now create one row in the 'toy_example' dataset with the 'row1' JSON object below.


In [18]:
import json

row1 = {
  "name": "Bob", 
  "address": {"city": "Montreal", "street": "Stanley"}, 
  "sports": ["soccer","hockey"], 
  "friends": [{"name": "Mich", "age": 25}, {"name": "Jean", "age": 28}]
}

# update dataset by adding a row
mldb.post('/v1/datasets/toy_example/rows', {
    "rowName": "row1",
    "columns": [["data", json.dumps(row1), 0]]
})
# save changes
mldb.post("/v1/datasets/toy_example/commit")


Out[18]:
POST http://localhost/v1/datasets/toy_example/commit
200 OK

We will check out our data with a SELECT query.


In [19]:
mldb.query("SELECT * FROM toy_example")


Out[19]:
data
_rowName
row1 {"sports": ["soccer", "hockey"], "friends": [{...

There are many elements within the cell above. We will need to better structure elements within the nested JSON object.

Working with nested JSON objects with built-in functions and column expressions

To understand and query nested JSON objects, we will be using a Built-in Function called parse_json</code> and a Column Expression columnPathElement</code>.

This is where the parse_json function comes in handy. It will help us turn a multidimensional JSON object into a 2D dataset.


In [20]:
mldb.query("""
        SELECT parse_json(data, {arrays: 'parse'}) AS * 
        FROM toy_example
""")


Out[20]:
address.city address.street friends.0.age friends.0.name friends.1.age friends.1.name name sports.0 sports.1
_rowName
row1 Montreal Stanley 25 Mich 28 Jean Bob soccer hockey

parse_json is a powerful feature since we can create 2D representations out of multidimensional data. We can read all of the elements of the JSON object on one line. It is also easier to SQL as we will see below.

columnPathElement makes it convenient to navigate specific parts of the data structure. In the next block of code, we will do the following:

  • use parse_json to parse each data element of the object on one row (same as above)
  • select specific cells using columnPathElement where the the column path name at index = 2 is 'name' (note that 'friends' is at index = 0)

In [21]:
mldb.query("""
    SELECT COLUMN EXPR (WHERE columnPathElement(2) = 'name') 
    FROM (
        SELECT parse_json(data, {arrays: 'parse'}) AS * NAMED rowPath() FROM toy_example
    )
""")


Out[21]:
friends.0.name friends.1.name
_rowName
row1 Mich Jean

We now know the name of Bob's two friends... As you may have noticed, this is very practical if we want to query a specific attribute of a nested object. The columnPathElement Column Expression allows us to easily query specific JSON data levels or dimensions.

Column operations such as the ones shown in this tutorial can be difficult without column expressions. Column Expressions offer a compact and flexible way to programmatically select columns. It is a great tool to carve out the data that is most needed for your analysis.

In this tutorial, we covered three Column Expressions:

  • columnName which returns the name of the columns inside our dataset
  • rowCount which returns the number of non-empty rows for each column
  • columnPathElement which allows us to chose columns at specific sub-levels

Where to next?

Check out the other Tutorials and Demos.