JSON has reigned supreme since the early 2000s, replacing XML as the de facto standard for web data exchange. Its strength was its alignment with JavaScript and its human readability. But in the age of AI, "human readability" is taking a backseat to "machine inference."
Semantic Compression
The next generation of data formats isn't just about binary packing like Protocol Buffers or MsgPack. While efficient for bandwidth, binary formats are opaque to LLMs. An LLM cannot "read" binary without a decoding step.
Semantic Compression—the philosophy behind TOON—is the idea that data formats should be optimized for the cognitive architecture of the consumer. When the consumer is a transformer model, the format should leverage the model's ability to infer context, removing redundancy that traditional parsers strictly require.
The "Token Economy"
We are moving into an API economy where cost is measured in tokens, not bytes. A format that is byte-heavy but token-light (using longer but fewer distinct words) might actually be cheaper than a byte-light format that uses many rare tokens.
For example, specialized symbols often split into multiple tokens. Simple English words often compress into single tokens. TOON optimizes for this tokenizer behavior, ensuring that the structural glue of your data costs as little as possible.
What's Next?
We expect to see more "LLM-native" formats emerge. These might include embedding-friendly structures or formats that mix natural language descriptions with structured data more fluidly than JSON Schema ever could.