03 More Canonicalisation Logics for JSON Schemas

CorrineCorrine
3 min read

As commented in the _canonicalise.py file, the transformation of the canonical format is not intended for human consumption; instead, it prioritizes reasoning and performance. What does this mean? What does this imply when we consider a new normalization implementation?

Before we start designing a new implementation with different rulesets and configurations, let’s first look at the current implementation:

The current implementation: the one-way normalization

The current implementation doesn't follow a systematic ordering by JSON Schema keywords. Instead, it is in favor of optimization:

  • Processes "fast path" cases first (like booleans, const, enum)

  • Handles transformations with dependencies between them

  • Applies type-specific optimizations scattered throughout the code

  • Processes logical combinations later in the flow

  • Removes redundancies as a final step

While performance optimization is essential, we must also consider human factors. A comprehensive code might increase the maintenance costs and prevent beginners from contributing. We need to find a way to balance performance and maintainability. Additionally, we need to enable selective rulesets based on different configurations, so we need a more structured and systemic implementation.

A problematic implementation: check by JSON Schema keyword

One possible implementation that can make the processing logic look more systematic and human-readable is to check by keywords, but this implementation might raise some issues:

  1. Dependencies Between Transformations

    Some transformations depend on the results of others. For example, processing the not keyword requires type constraints to be processed first. In the example below, if we handle not before type, the normalizer might just remove not without removing “string“ from the array.

      { "type": ["string", "number"],     
         "not": {          
             "type": "string"
         }    
     }
    
  2. Performance Optimization Considerations

    The code appears ordered to optimize efficiency, prioritizing transformations that quickly simplify schemas. If we handle this by keyword, we might miss some optimization opportunities. The code comments emphasize performance importance:

    "That's the difference between 'I'd like it to be faster' and 'doesn't finish at all'"

A better implementation: check by category layers

If we still want to check by keyword, a better implementation would be checking by keyword category layers, rather than by JSON Schema keywords. We can follow this order:

  • Metadata keywords (title, description, etc.)

  • Validation keywords (type, enum, etc.)

  • Type-organized keywords (numeric, string, array, object)

  • Logical keywords (allOf, anyOf, etc.)

An even better implementation: Multi-stage processing

  • Stage 1: Preprocessing and conflict detection

    Some schemas have logical conflicts and invalid constraints that can never be satisfied. This can include incompatible type combinations, contradictory numeric ranges, or impossible property requirements. These conflicts should always be detected early and removed first, unless for learning purposes (I do not think anyone would use this library to learn JSON schema though). If we start thinking about the with different configuration rules thingy Julian mentioned, the part should be by default set to True in almost every ruleset.

  • Stage 2: Type-specific normalization

    Starting this stage, users can customize their rulesets or choose to keep some elements. (Details TBD)

  • Stage 3: Logical combination normalization

    Users can configure based on different use cases. For example, they can choose from one of the existing rulesets like basic, modest, strict, … or keep ... as required by comments, high efficiency , validation, high readability…(TBD)

  • Stage 4: Redundancy removal and final cleanup

    This is the most flexible stage. Users can choose to remove some redundancy or keep these elements. For example, they can choose to remove some descriptive elements (title, description) for API documentation generation. They can also keep examples for testing.

Advantages of this implementation:

  • Clear separation of concerns: Each stage focuses on specific types of transformations

  • Predictable processing order: Reduces unexpected interactions between rules

  • Better configurability: Specific stages can be enabled or disabled as needed

  • Maintainability and Extensibility: New rules can be added to their appropriate stages

More to come: Balancing flexibility (users can customize the ruleset) and simplicity (there should be some commonly used and existing rulesets for current users)

1
Subscribe to my newsletter

Read articles from Corrine directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Corrine
Corrine