The notebook cells below use pymldb
's Connection
class to make REST API calls. You can check out the Using pymldb
Tutorial for more details.
In [1]:
from pymldb import Connection
mldb = Connection("http://localhost")
The classic Iris Flower Dataset isn't very big but it's well-known and easy to reason about so it's a good example dataset to use for machine learning examples.
We can import it directly from a remote URL:
In [2]:
mldb.put('/v1/procedures/import_iris', {
"type": "import.text",
"params": {
"dataFileUrl": "http://public.mldb.ai/iris.data",
"headers": [ "sepal length", "sepal width", "petal length", "petal width", "class" ],
"outputDataset": "iris",
"runOnCreation": True
}
})
Out[2]:
We can use the Query API to get the data into a Pandas DataFrame to take a quick look at it.
In [3]:
df = mldb.query("select * from iris")
df.head()
Out[3]:
In [4]:
%matplotlib inline
import seaborn as sns, pandas as pd
sns.pairplot(df, hue="class", size=2.5)
Out[4]:
kmeans.train
ProcedureWe will create and run a Procedure of type kmeans.train
. This will train an unsupervised K-Means model and use it to assign each row in the input to a cluster, in the output dataset.
In [5]:
mldb.put('/v1/procedures/iris_train_kmeans', {
'type' : 'kmeans.train',
'params' : {
'trainingData' : 'select * EXCLUDING(class) from iris',
'outputDataset' : 'iris_clusters',
'numClusters' : 3,
'metric': 'euclidean',
"runOnCreation": True
}
})
Out[5]:
Now we can look at the output dataset and compare the clusters the model learned with the three types of flower in the dataset.
In [6]:
mldb.query("""
select pivot(class, num) as *
from (
select cluster, class, count(*) as num
from merge(iris_clusters, iris)
group by cluster, class
)
group by cluster
""")
Out[6]:
As you can see, the K-means algorithm doesn't do a great job of clustering this data (as is mentioned in the Wikipedia article!).
classifier.train
and .test
ProceduresWe will now create and run a Procedure of type classifier.train
. The configuration below will use 20% of the data to train a decision tree to classify rows into the three classes of Iris. The output of this procedure is a Function, which we will be able to call from REST or SQL.
In [7]:
mldb.put('/v1/procedures/iris_train_classifier', {
'type' : 'classifier.train',
'params' : {
'trainingData' : """
select
{* EXCLUDING(class)} as features,
class as label
from iris
where rowHash() % 5 = 0
""",
"algorithm": "dt",
"modelFileUrl": "file://models/iris.cls",
"mode": "categorical",
"functionName": "iris_classify",
"runOnCreation": True
}
})
Out[7]:
We can now test the classifier we just trained on the subset of the data we didn't use for training. To do so we use a procedure of type classifier.test
.
In [8]:
rez = mldb.put('/v1/procedures/iris_test_classifier', {
'type' : 'classifier.test',
'params' : {
'testingData' : """
select
iris_classify({
features: {* EXCLUDING(class)}
}) as score,
class as label
from iris
where rowHash() % 5 != 0
""",
"mode": "categorical",
"runOnCreation": True
}
})
runResults = rez.json()["status"]["firstRun"]["status"]
print rez
The procedure returns a confusion matrix, which you can compare with the one that resulted from the K-means procedure.
In [9]:
pd.DataFrame(runResults["confusionMatrix"])\
.pivot_table(index="actual", columns="predicted", fill_value=0)
Out[9]:
As you can see, the decision tree does a much better job of classifying the data than the K-means model, using 20% of the examples as training data.
The procedure also returns standard classification statistics on how the classifier performed on the test set. Below are performance statistics for each label:
In [10]:
pd.DataFrame.from_dict(runResults["labelStatistics"]).transpose()
Out[10]:
They are also available, averaged over all labels:
In [11]:
pd.DataFrame.from_dict({"weightedStatistics": runResults["weightedStatistics"]})
Out[11]:
In [12]:
mldb.get('/v1/functions/iris_classify/application', input={
"features":{
"petal length": 1,
"petal width": 2,
"sepal length": 3,
"sepal width": 4
}
})
Out[12]:
Check out the other Tutorials and Demos.
You can also take a look at the classifier.experiment
procedure type that can be used to train and test a classifier in a single call.
In [ ]: