Optimizing RAG Pipelines | TOON Converter

Retrieval-Augmented Generation (RAG) is the standard architecture for enterprise AI. It involves retrieving relevant documents from a vector database and feeding them into the LLM's context window. But context windows, while growing (128k, 1M tokens), are still finite and expensive.

The Metadata Overhead

When you retrieve a chunk of text from a vector store (like Pinecone or Milvus), you typically get associated metadata: author, timestamp, source URL, tags, and permissions. In a standard JSON response, this metadata can easily outweigh the actual text content in terms of token count.

Fact: In a typical legal document retrieval system, JSON structure overhead accounts for 30% of the total prompt size.

TOON Strategy for RAG

By converting the retrieved metadata arrays into TOON format before injection, you effectively "zip" the structural overhead. This allows you to:

Retrieve more chunks: If you save 30% on tokens, you can include 30% more relevant documents in the same context window.
Reduce Hallucinations: More context leads to better grounding for the model.
Speed up Time-to-First-Byte: Fewer input tokens mean faster processing by the LLM's prefill stage.

Implementation Example

Instead of injecting a list of JSON objects representing search results, format them as a readable table string using TOON. The LLM understands the schema from the header row (@cols(id,score,content)) and can reference items by ID just as effectively as if they were full JSON objects.

Impact on Vector Stores

While TOON is primarily an output format for LLMs, some advanced teams are beginning to store TOON-formatted strings directly in their vector metadata fields to save on storage and bandwidth costs.

Optimizing RAG Pipelines: Fitting More Context for Less

The Metadata Overhead

TOON Strategy for RAG

Implementation Example

Impact on Vector Stores