JSON Schema Parser

Notes on the JSON schema / Traitlets package

Goal: write a function that, given a JSON Schema, will generate code for traitlets objects which provide equivalent validation.

By-Hand Example

First confirm that we're doing things correctly with the jsonschema package:


In [1]:
import json
import jsonschema

simple_schema = {
    "type": "object",
    "properties": {
        "foo": {"type": "string"},
        "bar": {"type": "number"}
    }
}

In [2]:
good_instance = {
    "foo": "hello world",
    "bar": 3.141592653,
}

In [3]:
bad_instance = {
    "foo" : 42,
    "bar" : "string"
}

In [4]:
# Should succeed
jsonschema.validate(good_instance, simple_schema)

In [5]:
# Should fail
try:
    jsonschema.validate(bad_instance, simple_schema)
except jsonschema.ValidationError as err:
    print(err)


42 is not of type 'string'

Failed validating 'type' in schema['properties']['foo']:
    {'type': 'string'}

On instance['foo']:
    42

OK, now let's write a traitlets class that does the same thing:


In [6]:
import traitlets as T

class SimpleInstance(T.HasTraits):
    foo = T.Unicode()
    bar = T.Float()

In [7]:
# Should succeed

SimpleInstance(**good_instance)


Out[7]:
<__main__.SimpleInstance at 0x10b06c828>

In [8]:
# Should fail

try:
    SimpleInstance(**bad_instance)
except T.TraitError as err:
    print(err)


The 'foo' trait of a SimpleInstance instance must be a unicode string, but a value of 42 <class 'int'> was specified.

Roadmap

  1. Start by recognizing all simple JSON types in the schema ("string", "number", "integer", "boolean", "null")

  2. Next recognize objects containing simple types

  3. Next recognize compound simple types (i.e. where type is a list of simple types)

  4. Next recognize arrays & enums

  5. Next recognize "$ref" definitions

  6. Next recognize "anyOf", "oneOf", "allOf" definitions... first is essentially a traitlets Union, second is a Union where only one must match, and "allOf" is essentially a composite object (not sure if traitlets has that...) Note that among these, Vega-Lite only uses "anyOf"

  7. Catalog all validation keywords... Implement custom traitlets that support all the various validation keywords for each type. (Validation keywords listed here)

  8. Use hypothesis for testing?

Challenges & Questions to think about

  • JSONSchema ignores any keys/properties which are not listed in the schema. Traitlets warns, and in the future will raise an error for undefined keys/properties

    • this may be OK... we can just document the fact that traitlets is more prescriptive than JSONschema
  • JSON allows undefined values as well as explicit nulls, which map to None. Traitlets treats None as undefined. How to resolve this?

    • Best option is probably to use an undefined sentinel within the traitlets structure, such that the code knows when to ignore keys & produces dicts which translate directly to the correct JSON
  • Will probably need to define some custom trait types, e.g. Null, and also extend simple trait types to allow for the more extensive validations allowed in JSON Schema.

    • Generate subclasses with the code
  • What version of the schema should we target? Perhaps try to target multiple versions?

    • start with 04 because this is what's supported by jsonschema and used by Vega(-Lite)

Ideas

Easiest way: validate everything with a single HasTraits class via jsonschema, splitting out properties into traitlets

Interface

  • root schema and all definitions should become their own T.HasTraits class
  • Objects defined inline should also have their own class with a generated anonymous name
  • Use Jinja templating; allow output to one file or multiple files with relative imports
  • root object must have type="object"... this differs from jsonschema

Testing

  • test cases should be an increasingly complicated set of jsonschema objects, with test cases that should pass and fail. Perhaps store these in a JSON structure? (With a schema?)

An initial prototype

Let's try generating some traitlets classes for simple cases


In [9]:
import jinja2


OBJECT_TEMPLATE = """
{%- for import in cls.imports %}
{{ import }}
{%- endfor %}

class {{ cls.classname }}({{ cls.baseclass }}):
    {%- for (name, prop) in cls.properties.items() %}
    {{ name }} = {{ prop.trait_code }}
    {%- endfor %}
"""

class JSONSchema(object):
    """A class to wrap JSON Schema objects and reason about their contents"""
    object_template = OBJECT_TEMPLATE
    
    def __init__(self, schema, root=None):
        self.schema = schema
        self.root = root or schema
        
    @property
    def type(self):
        # TODO: should the default type be considered object?
        return self.schema.get('type', 'object')
    
    @property
    def trait_code(self):
        type_dict = {'string': 'T.Unicode()',
                     'number': 'T.Float()',
                     'integer': 'T.Integer()',
                     'boolean': 'T.Bool()'}
        if self.type not in type_dict:
            raise NotImplementedError()
        return type_dict[self.type]
    
    @property
    def classname(self):
        # TODO: deal with non-root schemas somehow...
        if self.schema is self.root:
            return "RootInstance"
        else:
            raise NotImplementedError("Non-root object schema")
            
    @property
    def baseclass(self):
        return "T.HasTraits"
    
    @property
    def imports(self):
        return ["import traitlets as T"]
    
    @property
    def properties(self):
        return {key: JSONSchema(val) for key, val in self.schema.get('properties', {}).items()}
    
    def object_code(self):
        return jinja2.Template(self.object_template).render(cls=self)

Trying it out...


In [10]:
code = JSONSchema(simple_schema).object_code()
print(code)


import traitlets as T

class RootInstance(T.HasTraits):
    foo = T.Unicode()
    bar = T.Float()

Testing the result


In [11]:
exec(code)  # defines RootInstance

In [12]:
# Good instance should validate correctly
RootInstance(**good_instance)


Out[12]:
<__main__.RootInstance at 0x10b7fefd0>

In [13]:
# Bad instance should raise a TraitError
try:
    RootInstance(**bad_instance)
except T.TraitError as err:
    print(err)


The 'foo' trait of a RootInstance instance must be a unicode string, but a value of 42 <class 'int'> was specified.

Seems to work 😀

We'll start with something like this in the package, and then build from there.