By Chris Emmery, 11-04-2016, 7 minute read
First off, I would like to thank Sebastian Raschka, and Chris Wagner for providing the text and code that proved essential for writing this blog.
For some time now, I have been wanting to replace simply pickling my
sklearn
pipelines. Pickle is
incredibly convenient, but can be easy to corrupt, is not very transparent, and
has compatibility issues. The latter has
been quite a thorn in my side for several projects, and I stumbled upon it again
while working on my own small text mining
framework. Persistence is imperative when
deploying a pipeline to a practical application like demo. Each piece of new
data needs to be constructed in exactly the same vector size as it was offered in
during development. Therefore, feature extraction, hashing, normalization, etc.
has to be exactly the same, feeding data to the same model as after training.
After reading Sebastian Raschka's notebook on model persistence for scikit-learn,
I figured I might give it a go myself.
Please note that all code is in Python 3.x, sklearn 0.17, and numpy 1.9.
I also tried to use JSON as storage format. In addition, however,
I aimed to store other parts of a pipeline as well. The biggest
hurdles are definitely due to numpy.
These special Python objects cannot be serialized in JSON, as it is limited to
at most bool, int, float, and str for data types and list, and dict
for structures. Following Sebastian's notes, I first tried to reproduce this
to store classifiers. For trained models, we can access the parameters by
get_params, and fit information in the class attributes (e.g. classes_,
intercept_ for LogisticRegression). Alternatively, we can just store all
class information as follows:
In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
lr = LogisticRegression(multi_class='multinomial', solver='newton-cg')
lr.fit(X, y)
attr = lr.__dict__
attr
Out[1]:
Great, so the _-affixed keys are fit-parameters, whereas the rest are model
parameters. The first issue arises here, which is that some of our values have
a numpy.array that is incompatible with JSON. These are pretty straight-forward
to serialize, we can simply convert them to a list:
In [2]:
import json
import numpy as np
for k, v in attr.items():
if isinstance(v, np.ndarray) and k[-1:] == '_':
attr[k] = v.tolist()
json.dump(attr, open('./attributes.json', 'w'))
!head ./attributes.json
And sure enough, if we port these back to a new instance of the
LogisticRegression class we are good to go:
In [3]:
lr2 = LogisticRegression()
for k, v in attr.items():
if isinstance(v, list):
setattr(lr2, k, np.array(v))
else:
setattr(lr2, k, v)
lr2.predict(X) # just for testing :)
Out[3]:
Sadly, life isn't always this easy.
In a broader scenario, one might use other sklearn classes to create a fancy
data-to-prediction pipeline. Say that we want to accept some text input, and
generate $n$-gram features. I wrote about using the DictVectorizer for
efficient gram extraction in my previous post,
so I'll use it here:
In [4]:
from collections import Counter
def extract_grams(sentence, n_list):
tokens = sentence.split()
return Counter([gram for gram in zip(*[tokens[i:]
for n in n_list for i in range(n)])])
Assume we have some form that accepts user input, represented by text_input,
and our training data corpus. First we extract features and fit the vectorizer:
In [5]:
from sklearn.feature_extraction import DictVectorizer
corpus = ["this is an example", "hey more examples", "can we get more examples"]
text_input = "hey can I get more examples"
vec = DictVectorizer().fit([extract_grams(s, [2]) for s in corpus])
print(vec.transform(extract_grams(text_input, [2])))
Sweet, the vectorizer works. Now it can be serialized as before, right?
In [6]:
vec_attr = vec.__dict__
for k, v in vec_attr.items():
if isinstance(v, list) and v[-1:] == '_':
vec_attr[k] = v.tolist()
json.dump(vec_attr, open('./vec_attributes.json', 'w'))
Nope. The tuples used to fit the vectorizer are not in the data types accepted
by JSON. Ok, no problem, we just alter the extract_grams function again to
concatenate them to a string and run it again:
In [7]:
def extract_grams(sentence, n_list):
tokens = sentence.split()
return Counter(['_'.join(list(gram)) for gram in zip(*[tokens[i:]
for n in n_list for i in range(n)])])
vec = DictVectorizer().fit([extract_grams(s, [2]) for s in corpus])
vec_attr = vec.__dict__
for k, v in vec_attr.items():
if isinstance(v, list) and v[-1:] == '_':
vec_attr[k] = v.tolist()
json.dump(vec_attr, open('./vec_attributes.json', 'w'))
Uh oh.
Life is not simple, and neither is scikit-learn. Actually, from a range of
pipeline pieces I have tested, there are many different sources that throw JSON
serialization errors. These can be variables that store types, or any other
numpy data format (np.int32 and np.float64 are both used in LinearSVC
for example). While some objects have a (limited) python object representation,
one of the harder cases was the error thrown by the DictVectorizer. To
convert a numpy type object, the following is required:
In [8]:
target = np.float64
serialisation = target.__name__
deserialisation = np.dtype(serialisation).type
print(target, serialisation, deserialisation)
So, we actually need a couple of functions that can serialize and entire
dictionary with python and numpy objects, and then deserialize when we need
it again. I was very much helped by Chris Wagner's blog, who already
provides quite a big code snippet that does exactly this. I inserted the
following lines myself:
def serialize(data):
...
if isinstance(data, type):
return {"py/numpy.type": data.__name__}
if isinstance(data, np.integer):
return {"py/numpy.int": int(data)}
if isinstance(data, np.float):
return {"py/numpy.float": data.hex()}
...
def deserialize(data):
...
if "py/numpy.type" in dct:
return np.dtype(dct["py/numpy.type"]).type
if "py/numpy.int" in dct:
return np.int32(dct["py/numpy.int"])
if "py/numpy.float" in dct:
return np.float64.fromhex(dct["py/numpy.float"])
...
Now the whole thing can be stored in This even retains the floating point
precisions by hexing them for serialization. So using these scripts, we can
run the full pipeline by importing Chris' script as serialize_json. First
we fit our amazing corpus again, and train the model:
In [9]:
import json
import numpy as np
import serialize_sk as sr
from sklearn.feature_extraction import DictVectorizer
from sklearn.svm import LinearSVC
corpus = ["this is an example", "hey more examples", "can we get more examples"]
def extract_grams(sentence, n_list):
tokens = sentence.split()
return Counter(['_'.join(list(gram)) for gram in zip(*[tokens[i:]
for n in n_list for i in range(n)])])
vec = DictVectorizer()
D = vec.fit_transform([extract_grams(s, [2]) for s in corpus])
svm = LinearSVC()
svm.fit(D, [1, 0, 1])
Out[9]:
Serialize the vectorizer and model:
In [10]:
atb_vec = vec.__dict__
atb_clf = svm.__dict__
def serialize(d, name):
for k, v in d.items():
d[k] = sr.data_to_json(v)
json.dump(d, open(name + '.json', 'w'))
serialize(atb_clf, 'clf')
serialize(atb_vec, 'vec')
Now we assume that this a new application. First, we load the .jsons and deserialize:
In [11]:
new_vec = json.load(open('vec.json'))
new_clf = json.load(open('clf.json'))
def deserialize(class_init, attr):
for k, v in attr.items():
setattr(class_init, k, sr.json_to_data(v))
return class_init
vec2 = deserialize(DictVectorizer(), new_vec)
svm2 = deserialize(LinearSVC(), new_clf)
And finally we accept user input, and give back a classification label:
In [12]:
user_input = "hey can I get more examples"
grams = vec2.transform(extract_grams(user_input, [2]))
print(grams, "\n")
print(svm2.predict(grams))
And it works!
Chances are that when using different classes in sklearn, other
issues might present themselves. However, for now I've got my most used pieces
covered. It will probably mostly entail refining serialize_json. Of course, even
when using JSON there is no protection from the fact that parameters might be
changed in different version of scikit-learn. At least now the JSONs stored
with old versions are transparent
enough to be easily modifiable. Any suggestions and or improvements are
obviously more than welcome. I hereby also provide my version of Chris Wagner's
script, as well as a Jupyter notebook.