Loading Data Tutorial

MLDB operates on data via Datasets, which can be created and populated in three different ways:

You can create a mutable Dataset and insert data row by row via REST.
You can create a Dataset from an existing file.
You can create a Dataset by running a Procedure.

Creating a dataset via REST by POSTing rows

Creating a Dataset is a simple REST call. The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.



In [11]:

    
from pymldb import Connection
mldb = Connection()

Let's create a dataset called example.



In [12]:

    
mldb.put('/v1/datasets/example', { "type":"sparse.mutable" })









    Out[12]:




PUT http://localhost/v1/datasets/example
201 Created
 {
  "status": {
    "columnCount": 0, 
    "rowCount": 0, 
    "valueCount": 0
  }, 
  "config": {
    "type": "sparse.mutable", 
    "id": "example"
  }, 
  "state": "ok", 
  "type": "sparse.mutable", 
  "id": "example"
}

That's all there is to it, and now we can add some rows.



In [13]:

    
mldb.post('/v1/datasets/example/rows', {
    "rowName": "first row",
    "columns": [
        ["first column", 1, 0],
        ["second column", 2, 0]
    ]
})

mldb.post('/v1/datasets/example/rows', {
    "rowName": "second row",
    "columns": [
        ["first column", 3, 0],
        ["second column", 4, 0]
    ]
})
mldb.post("/v1/datasets/example/commit")









    Out[13]:




POST http://localhost/v1/datasets/example/commit
200 OK

So now we have a little bit of data in our dataset. Let's check.



In [14]:

    
mldb.query("select * from example")









    Out[14]:






  
    
      
      first column
      second column
    
    
      _rowName
      
      
    
  
  
    
      first row
      1
      2
    
    
      second row
      3
      4

Creating a Dataset from a file

Now we'll load a dataset from a CSV file. Let's start by creating a simple CSV file to import:



In [15]:

    
%mkdir /mldb_data/datasets









    



mkdir: cannot create directory ‘/mldb_data/datasets’: File exists



In [16]:

    
%%writefile /mldb_data/datasets/sample.csv
a column,another column
a,b
c,d









    



Overwriting /mldb_data/datasets/sample.csv

We can import this dataset with a simple procedure.



In [17]:

    
mldb.put('/v1/procedures/import_example2', {
    "type":"import.text",
    "params": {
        "dataFileUrl":"file:///mldb_data/datasets/sample.csv",
        "outputDataset": "example2",
        "runOnCreation": True
    }
})









    Out[17]:




PUT http://localhost/v1/procedures/import_example2
201 Created
 {
  "status": {
    "firstRun": {
      "runStarted": "2016-03-22T16:20:51.3476582Z", 
      "status": {
        "numLineErrors": 0
      }, 
      "runFinished": "2016-03-22T16:20:51.3526092Z", 
      "id": "2016-03-22T16:20:51.347551Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "outputDataset": "example2", 
      "runOnCreation": true, 
      "dataFileUrl": "file:///mldb_data/datasets/sample.csv"
    }, 
    "type": "import.text", 
    "id": "import_example2"
  }, 
  "state": "ok", 
  "type": "import.text", 
  "id": "import_example2"
}

And a query to validate that things got loaded correctly!



In [18]:

    
mldb.query("select * from example2")









    Out[18]:






  
    
      
      a column
      another column
    
    
      _rowName
      
      
    
  
  
    
      2
      a
      b
    
    
      3
      c
      d

Creating a Dataset by running a Procedure on another Dataset

Procedures take Datasets as inputs and can create Datasets as outputs. This is how you can do data cleanup/transformation in MLDB. Here's a simple example with the transform Procedure:



In [19]:

    
mldb.put('/v1/procedures/example', {
    "type": "transform",
    "params": {
        "inputData": 'select "first column" + "second column" as "transformed column" from example',
        "outputDataset": "example3"
    }
})
mldb.post('/v1/procedures/example/runs')









    Out[19]:




POST http://localhost/v1/procedures/example/runs
201 Created
 {
  "status": {
    "columnCount": 1, 
    "rowCount": 2, 
    "valueCount": 3
  }, 
  "runStarted": "2016-03-22T16:20:51.3896525Z", 
  "state": "finished", 
  "runFinished": "2016-03-22T16:20:51.3908167Z", 
  "id": "2016-03-22T16:20:51.389521Z-5bc7042b732cb41f"
}



In [20]:

    
mldb.query("select * from example3")









    Out[20]:






  
    
      
      transformed column
    
    
      _rowName
      
    
  
  
    
      first row
      3
    
    
      second row
      7

Where to next?

Check out the other Tutorials and Demos.



In [ ]: