Getting started¶
Zero-install Colab notebook
Want to try CompactBench in your browser without any setup? Open
notebooks/try_compactbench.ipynb
in Colab — it walks through installation, runs the four built-in baselines against a real
model on Groq's free tier, and ends with writing your own compactor. Under two minutes end-to-end.
Install¶
Or with uv:
Verify:
Configure a provider¶
CompactBench needs a model provider to evaluate compacted context. Any of the following work out of the box:
| Provider | --provider key |
Env var | Free tier |
|---|---|---|---|
| Groq | groq |
COMPACTBENCH_GROQ_API_KEY |
yes — ~100k tokens/day on Llama 3.3 70B free tier, upgrade for higher limits |
| Google AI Studio | google-ai-studio |
COMPACTBENCH_GOOGLE_AI_STUDIO_API_KEY |
yes — free on Gemini 2.0 Flash, RPM limits apply |
| Anthropic | anthropic |
COMPACTBENCH_ANTHROPIC_API_KEY |
credit-based — $5 starter credit with a new account |
| OpenAI | openai |
COMPACTBENCH_OPENAI_API_KEY |
credit-based — usage-priced, no true free tier |
| Ollama (local) | ollama |
COMPACTBENCH_OLLAMA_BASE_URL |
local only — constrained by your machine |
| Mock (tests) | mock |
(no config needed) | n/a — returns canned "hello" responses; scores are non-interpretable but the full pipeline runs, useful for CI + pipeline smokes |
Running the full Elite practice suite
15 templates × default 5 cases × 2 drift cycles × ~3 eval items per cycle = ~450 LLM calls, which exceeds Groq's free-tier daily token limit. For real evaluation runs use Anthropic, OpenAI, or a paid Groq account. The benchmark auto-detects daily-quota 429s and surfaces them immediately (it will not retry futilely).
Prompt caching on the evaluation layer
Every cycle asks the target model N evaluation questions against the same compaction artifact. CompactBench automatically threads the artifact as a cached_prefix on each call so providers that support prompt caching only pay for those input tokens once per cycle:
- Anthropic — explicit
cache_control: ephemeralon the artifact block (~90% input-cost drop on cache hits) - OpenAI — automatic prompt caching kicks in on stable prefixes ≥1024 tokens at the start of a user message
- Groq / Ollama / Gemini — no network-side caching today; behaviour is identical to a single concatenated prompt
- Mock — cached_prefix is recorded on every call for test inspection
This saves roughly 40–60% of input tokens on the evaluation layer for Anthropic and OpenAI runs, with no user-facing change. The compactbench run --estimate projector reports total tokens before caching, so real spend is typically lower than the estimate on those two providers.
Put keys in .env at the project root, or export them in your shell.
Run your first benchmark¶
Against a local Ollama model:
compactbench run \
--method built-in:hybrid-ledger \
--suite starter \
--provider ollama \
--model llama3.2
Against Groq:
compactbench run \
--method built-in:structured-state \
--suite starter \
--provider groq \
--model llama-3.3-70b-versatile
Against Anthropic Claude:
compactbench run \
--method built-in:hybrid-ledger \
--suite starter \
--provider anthropic \
--model claude-3-5-haiku-latest
Against OpenAI:
compactbench run \
--method built-in:hybrid-ledger \
--suite starter \
--provider openai \
--model gpt-4o-mini
Results are written to results.jsonl by default. Use --output to change the path.
Inspect a single generated case¶
Same template + seed always produces the same case, so this is useful for debugging your compactor against a specific failure.
Score results¶
Prints a per-case breakdown plus run-level overall score, drift resistance, constraint retention, contradiction rate, and compression ratio.