Procedures and Functions Tutorial

MLDB is the Machine Learning Database, and all machine learning operations are done via Procedures and Functions. Training a model happens via Procedures, and applying a model happens via Functions.

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.



In [1]:

    
from pymldb import Connection
mldb = Connection("http://localhost")

Loading a Dataset

The classic Iris Flower Dataset isn't very big but it's well-known and easy to reason about so it's a good example dataset to use for machine learning examples.

We can import it directly from a remote URL:



In [2]:

    
mldb.put('/v1/procedures/import_iris', {
    "type": "import.text",
    "params": {
        "dataFileUrl": "http://public.mldb.ai/iris.data",
        "headers": [ "sepal length", "sepal width", "petal length", "petal width", "class" ],
        "outputDataset": "iris",
        "runOnCreation": True
    }
})









    Out[2]:




PUT http://localhost/v1/procedures/import_iris
201 Created
 {
  "status": {
    "firstRun": {
      "runStarted": "2016-03-22T16:20:12.7195733Z", 
      "status": {
        "numLineErrors": 0
      }, 
      "runFinished": "2016-03-22T16:20:13.0135105Z", 
      "id": "2016-03-22T16:20:12.719491Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "headers": [
        "sepal length", 
        "sepal width", 
        "petal length", 
        "petal width", 
        "class"
      ], 
      "outputDataset": "iris", 
      "runOnCreation": true, 
      "dataFileUrl": "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
    }, 
    "type": "import.text", 
    "id": "import_iris"
  }, 
  "state": "ok", 
  "type": "import.text", 
  "id": "import_iris"
}

A quick look at the data

We can use the Query API to get the data into a Pandas DataFrame to take a quick look at it.



In [3]:

    
df = mldb.query("select * from iris")
df.head()









    Out[3]:






  
    
      
      sepal length
      sepal width
      petal length
      petal width
      class
    
    
      _rowName
      
      
      
      
      
    
  
  
    
      97
      5.7
      2.9
      4.2
      1.3
      Iris-versicolor
    
    
      11
      5.4
      3.7
      1.5
      0.2
      Iris-setosa
    
    
      112
      6.4
      2.7
      5.3
      1.9
      Iris-virginica
    
    
      134
      6.3
      2.8
      5.1
      1.5
      Iris-virginica
    
    
      142
      6.9
      3.1
      5.1
      2.3
      Iris-virginica



In [4]:

    
%matplotlib inline
import seaborn as sns, pandas as pd

sns.pairplot(df, hue="class", size=2.5)









    Out[4]:





<seaborn.axisgrid.PairGrid at 0x7f23b54d3410>

Unsupervised Machine Learning with a `kmeans.train` Procedure

We will create and run a Procedure of type kmeans.train. This will train an unsupervised K-Means model and use it to assign each row in the input to a cluster, in the output dataset.



In [5]:

    
mldb.put('/v1/procedures/iris_train_kmeans', {
    'type' : 'kmeans.train',
    'params' : {
        'trainingData' : 'select * EXCLUDING(class) from iris',
        'outputDataset' : 'iris_clusters',
        'numClusters' : 3,
        'metric': 'euclidean',
        "runOnCreation": True
    }
})









    Out[5]:




PUT http://localhost/v1/procedures/iris_train_kmeans
201 Created
 {
  "status": {
    "firstRun": {
      "runStarted": "2016-03-22T16:20:18.0258212Z", 
      "runFinished": "2016-03-22T16:20:18.030994Z", 
      "id": "2016-03-22T16:20:18.025736Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "trainingData": "select * EXCLUDING(class) from iris", 
      "metric": "euclidean", 
      "outputDataset": "iris_clusters", 
      "numClusters": 3, 
      "runOnCreation": true
    }, 
    "type": "kmeans.train", 
    "id": "iris_train_kmeans"
  }, 
  "state": "ok", 
  "type": "kmeans.train", 
  "id": "iris_train_kmeans"
}

Now we can look at the output dataset and compare the clusters the model learned with the three types of flower in the dataset.



In [6]:

    
mldb.query("""
    select pivot(class, num) as *
    from (
        select cluster, class, count(*) as num
        from merge(iris_clusters, iris)
        group by cluster, class
    )
    group by cluster
""")









    Out[6]:






  
    
      
      Iris-setosa
      Iris-versicolor
      Iris-virginica
    
    
      _rowName
      
      
      
    
  
  
    
      [0]
      50
      NaN
      NaN
    
    
      [1]
      NaN
      2
      36
    
    
      [2]
      NaN
      48
      14

As you can see, the K-means algorithm doesn't do a great job of clustering this data (as is mentioned in the Wikipedia article!).

Supervised Machine Learning with `classifier.train` and `.test` Procedures

We will now create and run a Procedure of type classifier.train. The configuration below will use 20% of the data to train a decision tree to classify rows into the three classes of Iris. The output of this procedure is a Function, which we will be able to call from REST or SQL.



In [7]:

    
mldb.put('/v1/procedures/iris_train_classifier', {
    'type' : 'classifier.train',
    'params' : {
        'trainingData' : """
            select 
                {* EXCLUDING(class)} as features, 
                class as label 
            from iris 
            where rowHash() % 5 = 0
        """,
        "algorithm": "dt",
        "modelFileUrl": "file://models/iris.cls",
        "mode": "categorical",
        "functionName": "iris_classify",
        "runOnCreation": True
    }
})









    Out[7]:




PUT http://localhost/v1/procedures/iris_train_classifier
201 Created
 {
  "status": {
    "firstRun": {
      "runStarted": "2016-03-22T16:20:18.0753982Z", 
      "runFinished": "2016-03-22T16:20:18.0805328Z", 
      "id": "2016-03-22T16:20:18.075322Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "functionName": "iris_classify", 
      "trainingData": "\n            select \n                {* EXCLUDING(class)} as features, \n                class as label \n            from iris \n            where rowHash() % 5 = 0\n        ", 
      "modelFileUrl": "file://models/iris.cls", 
      "runOnCreation": true, 
      "mode": "categorical", 
      "algorithm": "dt"
    }, 
    "type": "classifier.train", 
    "id": "iris_train_classifier"
  }, 
  "state": "ok", 
  "type": "classifier.train", 
  "id": "iris_train_classifier"
}

We can now test the classifier we just trained on the subset of the data we didn't use for training. To do so we use a procedure of type classifier.test.



In [8]:

    
rez = mldb.put('/v1/procedures/iris_test_classifier', {
    'type' : 'classifier.test',
    'params' : {
        'testingData' : """
            select 
                iris_classify({
                    features: {* EXCLUDING(class)}
                }) as score,
                class as label 
            from iris 
            where rowHash() % 5 != 0
        """,
        "mode": "categorical",
        "runOnCreation": True
    }
})

runResults = rez.json()["status"]["firstRun"]["status"]
print rez









    



<Response [201]>

The procedure returns a confusion matrix, which you can compare with the one that resulted from the K-means procedure.



In [9]:

    
pd.DataFrame(runResults["confusionMatrix"])\
    .pivot_table(index="actual", columns="predicted", fill_value=0)









    Out[9]:






  
    
      
      count
    
    
      predicted
      Iris-setosa
      Iris-versicolor
      Iris-virginica
    
    
      actual
      
      
      
    
  
  
    
      Iris-setosa
      40
      0
      0
    
    
      Iris-versicolor
      0
      37
      2
    
    
      Iris-virginica
      0
      6
      38

As you can see, the decision tree does a much better job of classifying the data than the K-means model, using 20% of the examples as training data.

The procedure also returns standard classification statistics on how the classifier performed on the test set. Below are performance statistics for each label:



In [10]:

    
pd.DataFrame.from_dict(runResults["labelStatistics"]).transpose()









    Out[10]:






  
    
      
      f
      precision
      recall
      support
    
  
  
    
      Iris-setosa
      1.000000
      1.000000
      1.000000
      40
    
    
      Iris-versicolor
      0.902439
      0.860465
      0.948718
      39
    
    
      Iris-virginica
      0.904762
      0.950000
      0.863636
      44

They are also available, averaged over all labels:



In [11]:

    
pd.DataFrame.from_dict({"weightedStatistics": runResults["weightedStatistics"]})









    Out[11]:






  
    
      
      weightedStatistics
    
  
  
    
      f
      0.934997
    
    
      precision
      0.937871
    
    
      recall
      0.934959
    
    
      support
      123.000000

Scoring new examples

We can call the Function REST API endpoint to classify a never-before-seen set of measurements like this:



In [12]:

    
mldb.get('/v1/functions/iris_classify/application', input={
    "features":{
        "petal length": 1,
        "petal width": 2,
        "sepal length": 3,
        "sepal width": 4
    }
})









    Out[12]:




GET http://localhost/v1/functions/iris_classify/application?input=%7B%22features%22%3A+%7B%22sepal+width%22%3A+4%2C+%22petal+width%22%3A+2%2C+%22petal+length%22%3A+1%2C+%22sepal+length%22%3A+3%7D%7D
200 OK
 {
  "output": {
    "scores": [
      [
        "\"Iris-setosa\"", 
        [
          1, 
          "-Inf"
        ]
      ], 
      [
        "\"Iris-versicolor\"", 
        [
          0, 
          "-Inf"
        ]
      ], 
      [
        "\"Iris-virginica\"", 
        [
          0, 
          "-Inf"
        ]
      ]
    ]
  }
}

Where to next?

Check out the other Tutorials and Demos.

You can also take a look at the classifier.experiment procedure type that can be used to train and test a classifier in a single call.



In [ ]:

	sepal length	sepal width	petal length	petal width	class
_rowName
97	5.7	2.9	4.2	1.3	Iris-versicolor
11	5.4	3.7	1.5	0.2	Iris-setosa
112	6.4	2.7	5.3	1.9	Iris-virginica
134	6.3	2.8	5.1	1.5	Iris-virginica
142	6.9	3.1	5.1	2.3	Iris-virginica

	count
predicted	Iris-setosa	Iris-versicolor	Iris-virginica
actual
Iris-setosa	40	0	0
Iris-versicolor	0	37	2
Iris-virginica	0	6	38

	f	precision	recall	support
Iris-setosa	1.000000	1.000000	1.000000	40
Iris-versicolor	0.902439	0.860465	0.948718	39
Iris-virginica	0.904762	0.950000	0.863636	44