Retrieval-Augmented Generation (RAG) is the standard architecture for enterprise AI. It involves retrieving relevant documents from a vector database and feeding them into the LLM's context window. But context windows, while growing (128k, 1M tokens), are still finite and expensive.
The Metadata Overhead
When you retrieve a chunk of text from a vector store (like Pinecone or Milvus), you typically get associated metadata: author, timestamp, source URL, tags, and permissions. In a standard JSON response, this metadata can easily outweigh the actual text content in terms of token count.
TOON Strategy for RAG
By converting the retrieved metadata arrays into TOON format before injection, you effectively "zip" the structural overhead. This allows you to:
- Retrieve more chunks: If you save 30% on tokens, you can include 30% more relevant documents in the same context window.
- Reduce Hallucinations: More context leads to better grounding for the model.
- Speed up Time-to-First-Byte: Fewer input tokens mean faster processing by the LLM's prefill stage.
Implementation Example
Instead of injecting a list of JSON objects representing search results, format them as a readable table string using TOON. The LLM understands the schema from the header row (@cols(id,score,content)) and can reference items by ID just as effectively as if they were full JSON objects.
Impact on Vector Stores
While TOON is primarily an output format for LLMs, some advanced teams are beginning to store TOON-formatted strings directly in their vector metadata fields to save on storage and bandwidth costs.