Small Example of an Machine Learning Engine ready for OCP

This is a small example of a machine learning engine that is built with python and ticdat and is ready for deployment on the Opalytics Cloud Platform.

This example is built from this notebook demonstrating the use of sklearn to build a "spam"/"ham" classifier. Bear in mind that the math here is fairly complex, and also that the black box details of this math aren't the focus of this demonstration. Instead, we give an example of what an OCP ready engine might look like, and how the "fit" and "predict" modes can be tested in isolation. So if you don't understand the full radimrehurek notebook don't worry about it. Just start by examining the ml_spam.py file and getting a general sense of how the math interacts with the data.


In [1]:
from ml_spam import dataFactory, run, solnFactory
from ticdat import LogFile

First we load some data that we want fitted.


In [2]:
fit_data = dataFactory.csv.create_tic_dat("fit_me/")

The solve knows this data is ready to be fitted because of the parameter setting.


In [3]:
fit_data.parameters


Out[3]:
{'mode': _td:{'value': 'fit'}}

Here we fit the data. We create some log files for diagnostic purposes. (Don't use the builtins for logging. Use the ticdat classes instead).


In [4]:
fit_results = run(fit_data, LogFile("output_fit.txt"), LogFile("error_fit.txt"))


/Users/petercacioppi/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2645: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)

Because we did a "fit" run, the fit_results object consists of a big slab of honking text in the parameters table and not much else.


In [5]:
assert not fit_results.predictions
print  "The length of the big CLOB is %s"%len(fit_results.parameters["fitted CLOB"]["value"])


The length of the big CLOB is 1832065

This is a big honking string, but lets not be afraid. CSV should be ok handling it.


In [6]:
solnFactory.csv.write_directory(fit_results, "archived_results", allow_overwrite=True)

OK, now pretend the whole system took a nice long nappy nap. When it woke up, some user came along and entered data that just so happened to be equivalent to every 10th record (starting with the 6th record) of the original data.


In [7]:
data_to_predict = dataFactory.TicDat()
for i,r in enumerate(fit_data.messages): # messages has no primary key so it is akin to a list
    if i%10 == 5:
        data_to_predict.messages.append({"message":r["message"], 
                                         "label":"IGNORED BECAUSE I'M NOW GOING TO PREDICT"})

The user knows he wants to predict this data, so he sets the parameter correctly.


In [8]:
data_to_predict.parameters["mode"] = "predict"

Here, the fitted CLOB that we archived before the nap is now going to be recovered and used to recreate the predictor object.

(The OCP will have some special gizmos to facilitate this "what used to be an output is now an input".)


In [9]:
import csv # doing some extra magic here to make sure csv works properly...
csv.field_size_limit(sys.maxsize) # with our massive field from before
recovered_fit_results = solnFactory.csv.create_tic_dat("archived_results")
data_to_predict.parameters["fitted CLOB"] = \
    recovered_fit_results.parameters["fitted CLOB"]["value"]

OK, lets see if it works!


In [10]:
predict_results = run(data_to_predict, 
                      LogFile("output_predict.txt"), LogFile("error_predict.txt"))

In [11]:
len(predict_results.predictions)


Out[11]:
557

In [12]:
len(predict_results.predictions) == len(data_to_predict.messages)


Out[12]:
True

OK, it looked like something good happened, but lets examine it a little more closely. I'm going to go back and pick out the labels associated with the data_to_predict messages.


In [13]:
labels = []
for i,r in enumerate(fit_data.messages): # messages has no primary key so it is akin to a list
    if i%10 == 5:
        labels.append(r["label"])
assert len(labels) == len(predict_results.predictions) == len(data_to_predict.messages)
{_:len([x for x in labels if x == _]) for _ in ["ham","spam"]}


Out[13]:
{'ham': 490, 'spam': 67}

And now I'll just see how well those labels matched up against the predictions. (Again, this notebook isn't focused on rigorous ML methodology. I'm execrising the ml_spam file, in the same way you can exercise your own ticdat compatible ML engine).


In [14]:
worked = 0
for l,p in zip(labels, predict_results.predictions):
    if l == p["prediction"]:
        worked += 1
print "It recreated %s out of %s"%(worked, len(labels))


It recreated 446 out of 557

In [15]:
{_:len([x for x in predict_results.predictions if x["prediction"] == _]) for _ in ["ham","spam"]}


Out[15]:
{'ham': 491, 'spam': 66}

This seems good enough. Let's recap what we saw here.

  1. You make TicDatFactories for the input and output schemas, as always.
  2. You use things like sklearn to do fancy ML work inside of a run routine.
    1. This run routine has two modes, that are read from a parameters table.
      1. The fit mode will create a predictor object from a big data set. This predictor object is turned into a string with cPickle.dumps and that string is returned as a parameters result in the output schema.
      2. The predict mode will look for both a data set and a big parameter string that can be turned into a predictor object with cPickle.loads. The recreated predictor object then makes predictions about the input data. These predictions populate the appropriate output schema table.
  3. You test this run routine with scripts similar to this notebook.
  4. The actual file that has run, dataFactory and solnFactory is compatible with Opalytics, and can be deployed on our system.