By Chris Emmery, 14-04-2016, 5 minute read
This is a follow-up to this post.
In my last entry, I wrote about several hurdles on the way to replacing pickle with JSON for storing scikit-learn pipelines. While my previous solution was satisfactory for handling a class per file, storing an entire pipeline introduces more complexity than I previously assumed. In this follow-up, I will quickly illustrate one of these issues, and provide an effective solution.
Please note that all code is in Python 3.x, sklearn 0.17, and numpy 1.9.
We left off using __dict__ representations for each of the scikit-learn
classes, converting their data structures (including those from numpy) with
a small script and storing them per pipeline item. This would make a final
application look as follows:
vec = deserialize(DictVectorizer(), json.load(open('vec.json')))
svm = deserialize(LinearSVC(), json.load(open('clf.json')))
user_input = "hey can I get more examples"
grams = vec.transform(extract_grams(user_input, [2]))
print(svm.predict(grams))
# output ---------
[1]
The assumptions are that 1) your pipeline is quite small, so it's not too convoluted to store their items seperately, and 2) it has a static components, e.g. it will always use an SVM, and not do any preprocessing. If you're interested in reproducibility only, this is good enough. For demos, however, flexibility can be important.
Let's say we want to just allow selection of a trained model. The easiest way would be to store the pipeline in a dictionary, for example:
pipeline = {
"clf": GaussianNB(),
"vec": DictVectorizer(),
}
It shouldn't really matter what clf is, as long as it has the same
methods as all other sklearn classes. Subsequently, our application can be
reduced to the following:
pl = deserialize(Pipeline(), json.load(open('pipeline.json')))
user_input = "hey can I get more examples"
grams = pl['vec'].transform(extract_grams(user_input, [2]))
print(pl['clf'].predict(grams))
However, to achieve this, we would need to serialize the classes in a way that
we can deserialize them to their initialized form. Hence, just storing them as
their __dict__ representation is not enough.
How does one store a python object in a form that JSON can handle, and we can deserialize in our application? Remeber that before, we set classes like so:
In [1]:
import serialize_sk as sr
def deserialize(class_init, attr):
for k, v in attr.items():
setattr(class_init, k, sr.json_to_data(v))
return class_init
We already know how to set the attributes (with __dict__), but we need a way
to get a representation from a class object which we can use to initalize it.
Python allows you to get a string name with __class__, like so:
In [2]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
print(str(vec.__class__))
print(vec.__class__.__name__)
print(vec.__module__)
As we can see from the output, the first returns a class object, and the second its name. However, we would need the full path in order to import it, which leaves us with the third solution. From there, we could easily import and initialize it by string, like so:
In [3]:
import sys
class_ = getattr(sys.modules[vec.__module__], vec.__class__.__name__)
new_vec = class_()
new_vec
Out[3]:
After, we can use setattr again like in the deserialize function above to
return our settings. Just need to store them both in a format along with the
__dict__ to pass to the deserializer. Something like:
In [4]:
import json
def serialize_class(cls_):
return sr.data_to_json({'mod': cls_.__module__, 'name': cls_.__class__.__name__,
'attr': cls_.__dict__})
def deserialize_class(cls_repr):
cls_repr = sr.json_to_data(cls_repr)
cls_ = getattr(sys.modules[cls_repr['mod']], cls_repr['name'])
cls_init = cls_()
for k, v in cls_repr['attr'].items():
setattr(cls_init, k, v)
return cls_init
cls_str = serialize_class(vec)
json.dump(cls_str, open('./vec_class.json', 'w'))
cls_js = json.load(open('./vec_class.json'))
deserialize_class(cls_js)
Out[4]:
Great! Now the classes can be used in a pipeline dictionary. As the script I provided in the previous post is recursive, these methods can be built in without much effort. However, while reading into these object serialization techniques I found an even better alternative (given that you don't mind dependencies).
So far I managed to manually convert most numpy cases in scikit-learn's
modules. And the modules themselves to be stored in dictionaries for
flexibility. However, I decided to sweep all of this off the table for
jsonpickle. This package covers a
lot more edge-cases with a way more extensive implementation. Quick
demonstration:
In [5]:
import jsonpickle
vec_repr = jsonpickle.encode(vec)
vec_repr
Out[5]:
And with a quick decode we're back to our old python storage format!
That's it for now, if I encounter any more challenges there will be another follow-up. As before, I've written this up in a Jupyter notebook.