Return data to JSON or leave in BSON

bsonsearch has the capability to project data back into python. Depending on your application, you may want to efficiently convert to a more standardized format such as JSON.

Much has been made about assertion #3 from bsonspec.org

Efficient

Encoding data to BSON and decoding from BSON can be performed very quickly in most languages due to the use of C data types.

At least in this case, there is a bit of reason to allow this library to let the data remain in BSON in certain cases

Set up the compare / projection engine

This specific dataset uses the named strengths of BSON compared to JSON (unique-ish IDs, datetime, and regexes).


In [1]:
import bson
import re
from bsonsearch import bsoncompare
from datetime import datetime
bc = bsoncompare()
source_data = {"a":[bson.objectid.ObjectId(), datetime.now(), re.compile(r".*test string.*", re.IGNORECASE)]}
echo_projection = bc.generate_matcher({"$project":{"a":True}})
source_data_doc_id = bc.generate_doc(source_data)

Standard BSON->JSON->BSON encoding using the default json_util

The only valid reason to transform in this manner is to pass data around which you intend to manipulate, require the python/dict format, but a library in the tool chain does not support null characters in strings, such as passing though something that casts to a cstring. Otherwise, there's no reason to take the serialization out of BSON if you intend to put it back into BSON in the same tool chain.


In [2]:
print source_data
print bc.project_json_as_dict(echo_projection, source_data_doc_id)


{'a': [ObjectId('57269865e1382332ba4346e3'), datetime.datetime(2016, 5, 1, 18, 59, 33, 102667), <_sre.SRE_Pattern object at 0x16c1030>]}
{u'a': [ObjectId('57269865e1382332ba4346e3'), datetime.datetime(2016, 5, 1, 18, 59, 33, 102000, tzinfo=<bson.tz_util.FixedOffset object at 0x169fa90>), Regex(u'.*test string.*', 2)]}

The bson.json_util library has correctly translated the back to a fair representation of the source.


In [3]:
%%timeit
bc.project_json_as_dict(echo_projection, source_data_doc_id)


10000 loops, best of 3: 48.6 µs per loop

Standard BSON->BSON pointer passing

If everything lies within the same process space and you can pass pointers around the c library, this is going to be the easiest method


In [4]:
print source_data
print bc.project_bson_as_dict(echo_projection, source_data_doc_id)


{'a': [ObjectId('57269865e1382332ba4346e3'), datetime.datetime(2016, 5, 1, 18, 59, 33, 102667), <_sre.SRE_Pattern object at 0x16c1030>]}
{u'a': [ObjectId('57269865e1382332ba4346e3'), datetime.datetime(2016, 5, 1, 18, 59, 33, 102000), Regex(u'.*test string.*', 2)]}

The bson.json_util library has correctly translated the back to a fair representation of the source.


In [5]:
%%timeit
bc.project_bson_as_dict(echo_projection, source_data_doc_id)


100000 loops, best of 3: 9.44 µs per loop

Standard BSON->JSON->DICT

This may be a good idea if your application has no intention to modify the response data but does need it it deserialized in order to embed the response into another json wrapper within web services.

ujson (https://github.com/esnme/ultrajson) has statistics on their github page showing it's ability to (de)serialize strings that match the strict formatting requirements. There's basically no error checking. Fortunately, you should be able to assume libbson correctly error checks and only returns valid json


In [6]:
import ujson

In [7]:
print source_data
print ujson.loads(bc.project_json(echo_projection, source_data_doc_id))


{'a': [ObjectId('57269865e1382332ba4346e3'), datetime.datetime(2016, 5, 1, 18, 59, 33, 102667), <_sre.SRE_Pattern object at 0x16c1030>]}
{u'a': [{u'$oid': u'57269865e1382332ba4346e3'}, {u'$date': 1462129173102L}, {u'$options': u'i', u'$regex': u'.*test string.*'}]}

In [8]:
%%timeit
ujson.loads(bc.project_json(echo_projection, source_data_doc_id))


100000 loops, best of 3: 8.11 µs per loop

it is slightly faster than bson->bson, but the output may be unusable if you need to cast these things back to their original types.

Conclusion

As with everything, the road you choose depends highly on what you intend to do with this data.

If you stick to standard types (utf8/number/document/array) go ahead and immediately cast to json using the project_json call from this library and use the most efficient json decoder (perhaps ujson) to pull it back into a python dict if you need.

If you use bson specific types AND want to use those values within the pipeline, you should use the project_bson_as_dict call from this library, check/modify/play with the data as you see fit, then use the json_util to push to json when complete. I don't see much use for the default json_util conversion at any time other than immediately before you're about to serialize to a system where it won't be cast back into a python dict. (Like encoding to JSON immediately prior to passing to a javascript client).

If error chekcing and data trustworthyness is more important to you in your application than speed, you should allow the project_json_as_dict to handle the serialization. Upstream vendor of that function does perform checking of inputs.


In [8]: