bsonsearch has the capability to project data back into python. Depending on your application, you may want to efficiently convert to a more standardized format such as JSON.
Much has been made about assertion #3 from bsonspec.org
Efficient
Encoding data to BSON and decoding from BSON can be performed very quickly in most languages due to the use of C data types.
At least in this case, there is a bit of reason to allow this library to let the data remain in BSON in certain cases
In [1]:
import bson
import re
from bsonsearch import bsoncompare
from datetime import datetime
bc = bsoncompare()
source_data = {"a":[bson.objectid.ObjectId(), datetime.now(), re.compile(r".*test string.*", re.IGNORECASE)]}
echo_projection = bc.generate_matcher({"$project":{"a":True}})
source_data_doc_id = bc.generate_doc(source_data)
The only valid reason to transform in this manner is to pass data around which you intend to manipulate, require the python/dict format, but a library in the tool chain does not support null characters in strings, such as passing though something that casts to a cstring. Otherwise, there's no reason to take the serialization out of BSON if you intend to put it back into BSON in the same tool chain.
In [2]:
print source_data
print bc.project_json_as_dict(echo_projection, source_data_doc_id)
The bson.json_util library has correctly translated the back to a fair representation of the source.
In [3]:
%%timeit
bc.project_json_as_dict(echo_projection, source_data_doc_id)
In [4]:
print source_data
print bc.project_bson_as_dict(echo_projection, source_data_doc_id)
The bson.json_util library has correctly translated the back to a fair representation of the source.
In [5]:
%%timeit
bc.project_bson_as_dict(echo_projection, source_data_doc_id)
This may be a good idea if your application has no intention to modify the response data but does need it it deserialized in order to embed the response into another json wrapper within web services.
ujson (https://github.com/esnme/ultrajson) has statistics on their github page showing it's ability to (de)serialize strings that match the strict formatting requirements. There's basically no error checking. Fortunately, you should be able to assume libbson correctly error checks and only returns valid json
In [6]:
import ujson
In [7]:
print source_data
print ujson.loads(bc.project_json(echo_projection, source_data_doc_id))
In [8]:
%%timeit
ujson.loads(bc.project_json(echo_projection, source_data_doc_id))
it is slightly faster than bson->bson, but the output may be unusable if you need to cast these things back to their original types.
As with everything, the road you choose depends highly on what you intend to do with this data.
If you stick to standard types (utf8/number/document/array) go ahead and immediately cast to json using the project_json call from this library and use the most efficient json decoder (perhaps ujson) to pull it back into a python dict if you need.
If you use bson specific types AND want to use those values within the pipeline, you should use the project_bson_as_dict call from this library, check/modify/play with the data as you see fit, then use the json_util to push to json when complete. I don't see much use for the default json_util conversion at any time other than immediately before you're about to serialize to a system where it won't be cast back into a python dict. (Like encoding to JSON immediately prior to passing to a javascript client).
If error chekcing and data trustworthyness is more important to you in your application than speed, you should allow the project_json_as_dict to handle the serialization. Upstream vendor of that function does perform checking of inputs.
In [8]: