This is a small example of a machine learning engine that is built with python and ticdat
and is ready for deployment on the Opalytics Cloud Platform.
This example is built from this notebook demonstrating the use of sklearn
to build a "spam"/"ham" classifier. Bear in mind that the math here is fairly complex, and also that the black box details of this math aren't the focus of this demonstration. Instead, we give an example of what an OCP ready engine might look like, and how the "fit" and "predict" modes can be tested in isolation. So if you don't understand the full radimrehurek notebook don't worry about it. Just start by examining the ml_spam.py
file and getting a general sense of how the math interacts with the data.
In [1]:
from ml_spam import dataFactory, run, solnFactory
from ticdat import LogFile
First we load some data that we want fitted.
In [2]:
fit_data = dataFactory.csv.create_tic_dat("fit_me/")
The solve knows this data is ready to be fitted because of the parameter setting.
In [3]:
fit_data.parameters
Out[3]:
Here we fit the data. We create some log files for diagnostic purposes. (Don't use the builtins for logging. Use the ticdat
classes instead).
In [4]:
fit_results = run(fit_data, LogFile("output_fit.txt"), LogFile("error_fit.txt"))
Because we did a "fit" run, the fit_results
object consists of a big slab of honking text in the parameters table and not much else.
In [5]:
assert not fit_results.predictions
print "The length of the big CLOB is %s"%len(fit_results.parameters["fitted CLOB"]["value"])
This is a big honking string, but lets not be afraid. CSV should be ok handling it.
In [6]:
solnFactory.csv.write_directory(fit_results, "archived_results", allow_overwrite=True)
OK, now pretend the whole system took a nice long nappy nap. When it woke up, some user came along and entered data that just so happened to be equivalent to every 10th record (starting with the 6th record) of the original data.
In [7]:
data_to_predict = dataFactory.TicDat()
for i,r in enumerate(fit_data.messages): # messages has no primary key so it is akin to a list
if i%10 == 5:
data_to_predict.messages.append({"message":r["message"],
"label":"IGNORED BECAUSE I'M NOW GOING TO PREDICT"})
The user knows he wants to predict this data, so he sets the parameter correctly.
In [8]:
data_to_predict.parameters["mode"] = "predict"
Here, the fitted CLOB that we archived before the nap is now going to be recovered and used to recreate the predictor object.
(The OCP will have some special gizmos to facilitate this "what used to be an output is now an input".)
In [9]:
import csv # doing some extra magic here to make sure csv works properly...
csv.field_size_limit(sys.maxsize) # with our massive field from before
recovered_fit_results = solnFactory.csv.create_tic_dat("archived_results")
data_to_predict.parameters["fitted CLOB"] = \
recovered_fit_results.parameters["fitted CLOB"]["value"]
OK, lets see if it works!
In [10]:
predict_results = run(data_to_predict,
LogFile("output_predict.txt"), LogFile("error_predict.txt"))
In [11]:
len(predict_results.predictions)
Out[11]:
In [12]:
len(predict_results.predictions) == len(data_to_predict.messages)
Out[12]:
OK, it looked like something good happened, but lets examine it a little more closely. I'm going to go back and pick out the labels associated with the data_to_predict
messages.
In [13]:
labels = []
for i,r in enumerate(fit_data.messages): # messages has no primary key so it is akin to a list
if i%10 == 5:
labels.append(r["label"])
assert len(labels) == len(predict_results.predictions) == len(data_to_predict.messages)
{_:len([x for x in labels if x == _]) for _ in ["ham","spam"]}
Out[13]:
And now I'll just see how well those labels matched up against the predictions. (Again, this notebook isn't focused on rigorous ML methodology. I'm execrising the ml_spam
file, in the same way you can exercise your own ticdat
compatible ML engine).
In [14]:
worked = 0
for l,p in zip(labels, predict_results.predictions):
if l == p["prediction"]:
worked += 1
print "It recreated %s out of %s"%(worked, len(labels))
In [15]:
{_:len([x for x in predict_results.predictions if x["prediction"] == _]) for _ in ["ham","spam"]}
Out[15]:
This seems good enough. Let's recap what we saw here.
TicDatFactories
for the input and output schemas, as always.sklearn
to do fancy ML work inside of a run
routine.run
routine has two modes, that are read from a parameters table.cPickle.dumps
and that string is returned as a parameters result in the output schema.cPickle.loads
. The recreated predictor object then makes predictions about the input data. These predictions populate the appropriate output schema table.run
routine with scripts similar to this notebook.run
, dataFactory
and solnFactory
is compatible with Opalytics, and can be deployed on our system.