03 More Canonicalisation Logics for JSON Schemas

As commented in the _canonicalise.py file, the transformation of the canonical format is not intended for human consumption; instead, it prioritizes reasoning and performance. What does this mean? What does this imply when we consider a new normalization implementation?
Before we start designing a new implementation with different rulesets and configurations, let’s first look at the current implementation:
The current implementation: the one-way normalization
The current implementation doesn't follow a systematic ordering by JSON Schema keywords. Instead, it is in favor of optimization:
Processes "fast path" cases first (like booleans, const, enum)
Handles transformations with dependencies between them
Applies type-specific optimizations scattered throughout the code
Processes logical combinations later in the flow
Removes redundancies as a final step
While performance optimization is essential, we must also consider human factors. A comprehensive code might increase the maintenance costs and prevent beginners from contributing. We need to find a way to balance performance and maintainability. Additionally, we need to enable selective rulesets based on different configurations, so we need a more structured and systemic implementation.
A problematic implementation: check by JSON Schema keyword
One possible implementation that can make the processing logic look more systematic and human-readable is to check by keywords, but this implementation might raise some issues:
Dependencies Between Transformations
Some transformations depend on the results of others. For example, processing the
not
keyword requires type constraints to be processed first. In the example below, if we handlenot
beforetype
, the normalizer might just removenot
without removing“string“
from the array.{ "type": ["string", "number"], "not": { "type": "string" } }
Performance Optimization Considerations
The code appears ordered to optimize efficiency, prioritizing transformations that quickly simplify schemas. If we handle this by keyword, we might miss some optimization opportunities. The code comments emphasize performance importance:
"That's the difference between 'I'd like it to be faster' and 'doesn't finish at all'"
A better implementation: check by category layers
If we still want to check by keyword
, a better implementation would be checking by keyword category layers, rather than by JSON Schema keywords. We can follow this order:
Metadata keywords (title, description, etc.)
Validation keywords (type, enum, etc.)
Type-organized keywords (numeric, string, array, object)
Logical keywords (allOf, anyOf, etc.)
An even better implementation: Multi-stage processing
Stage 1: Preprocessing and conflict detection
Some schemas have logical conflicts and invalid constraints that can never be satisfied. This can include incompatible type combinations, contradictory numeric ranges, or impossible property requirements. These conflicts should always be detected early and removed first, unless for learning purposes (I do not think anyone would use this library to learn JSON schema though). If we start thinking about the
with different configuration rules
thingy Julian mentioned, the part should be by default set toTrue
in almost every ruleset.Stage 2: Type-specific normalization
Starting this stage, users can customize their rulesets or choose to keep some elements. (Details TBD)
Stage 3: Logical combination normalization
Users can configure based on different use cases. For example, they can choose from one of the existing rulesets like
basic
,modest
,strict
, … orkeep ... as required by comments
,high efficiency
,validation
,high readability
…(TBD)Stage 4: Redundancy removal and final cleanup
This is the most flexible stage. Users can choose to remove some redundancy or keep these elements. For example, they can choose to remove some descriptive elements (
title
,description
) for API documentation generation. They can also keep examples for testing.
Advantages of this implementation:
Clear separation of concerns: Each stage focuses on specific types of transformations
Predictable processing order: Reduces unexpected interactions between rules
Better configurability: Specific stages can be enabled or disabled as needed
Maintainability and Extensibility: New rules can be added to their appropriate stages
More to come: Balancing flexibility (users can customize the ruleset) and simplicity (there should be some commonly used and existing rulesets for current users)
Subscribe to my newsletter
Read articles from Corrine directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
