Optimizing RAG Pipelines: Fitting More Context for Less

Published on Nov 25, 2025 • 6 min read

Retrieval-Augmented Generation (RAG) is the standard architecture for enterprise AI. It involves retrieving relevant documents from a vector database and feeding them into the LLM's context window. But context windows, while growing (128k, 1M tokens), are still finite and expensive.

AI Network

The Metadata Overhead

When you retrieve a chunk of text from a vector store (like Pinecone or Milvus), you typically get associated metadata: author, timestamp, source URL, tags, and permissions. In a standard JSON response, this metadata can easily outweigh the actual text content in terms of token count.

Fact: In a typical legal document retrieval system, JSON structure overhead accounts for 30% of the total prompt size.

TOON Strategy for RAG

By converting the retrieved metadata arrays into TOON format before injection, you effectively "zip" the structural overhead. This allows you to:

Implementation Example

Instead of injecting a list of JSON objects representing search results, format them as a readable table string using TOON. The LLM understands the schema from the header row (@cols(id,score,content)) and can reference items by ID just as effectively as if they were full JSON objects.

Impact on Vector Stores

While TOON is primarily an output format for LLMs, some advanced teams are beginning to store TOON-formatted strings directly in their vector metadata fields to save on storage and bandwidth costs.