Architecting with TOON

Advanced implementation patterns for high-throughput AI systems. Move beyond simple compression and unlock new architectural possibilities.

1. High-Density RAG Systems

Retrieval-Augmented Generation (RAG) systems often suffer from the "lost in the middle" phenomenon and limited context windows. When retrieving 20+ documents, the JSON structure overhead (keys, brackets) can consume 30-40% of your valuable context tokens.

By standardizing your vector database metadata index to TOON, you significantly increase the "Information Density" of your retrieved chunks.

32%
More Chunks per Window
15ms
Lower Parsing Latency
$400
Savings per 10M Requests

Implementation Strategy

Instead of storing raw JSON in your vector store's `metadata` field, store a TOON-formatted string. When retrieving, pass this directly to the LLM. Most modern LLMs (GPT-4, Claude 3) understand TOON natively without decoding.

Python RAG Example
params = {
    # Traditional approach: Verbose
    "docs_json": [
        {"id": "doc_1", "score": 0.89, "content": "Fiscal policy tightening..."},
        {"id": "doc_2", "score": 0.85, "content": "Market reaction was..."}
    ],
    
    # TOON approach: Dense
    "docs_toon": """[2]{id,score,content}:
doc_1,0.89,Fiscal policy tightening...
doc_2,0.85,Market reaction was..."""
}

# The TOON prompt uses 40% fewer tokens for the structure,
# leaving room for 1-2 extra documents in the same window.

2. Efficient AI Agent Memory

Autonomous agents often operate in loops: Think -> Act -> Observe. The "Observe" step usually involves reading the output of a tool (API, SQL query, etc.). If a tool returns a list of 50 users or 100 products, standard JSON bloats the agent's short-term memory (history), forcing earlier context to be dropped.

The "Tool Output" Pattern

Configure your internal tools to return TOON format. The LLM can read the tabular data much more easily, and it stays within token limits longer, allowing for more complex multi-step reasoning capabilities.

Agent History Simulation
System: You are a data analyst agent.

User: Find all active users in California.

Agent: Calling tool `db_query` with "SELECT * FROM users WHERE state='CA'"

Tool Output (TOON):
[3]{id,name,email,status}:
492,Jane Doe,[email protected],active
551,Raj Patel,[email protected],active
883,Sarah Smith,[email protected],active

Agent: I found 3 active users in California: Jane, Raj, and Sarah.

In this example, the tool output is concise. If this were JSON, the keys "id", "name", "email", "status" would be repeated for every single row, adding hundreds of wasted tokens to the conversation history.

3. Dataset Optimization for Fine-Tuning

Training or fine-tuning an LLM is priced by the token. When preparing a dataset of 100,000 examples (e.g., converting natural language to structured data), the format of the target structured data matters.

Fine-tuning a model to output TOON instead of JSON makes the model faster at inference time (generating fewer tokens) and cheaper to train (processing fewer tokens per example).

Benchmark: E-commerce Dataset

We compared fine-tuning Llama-3-8b on a dataset of 10k product extractions.

  • JSON Dataset Size: 14.5 Million Tokens
  • TOON Dataset Size: 9.2 Million Tokens
  • Training Cost: Reduced by ~36%
  • Inference Speed: Increased by ~40% (Time to First Token same, Total Generation Time much lower)