Skip to content

Elite program

The Elite program is CompactBench's hardest benchmark track. It is not a single fixed test pack — it is a versioned program maintained with explicit governance.

Goal

The hardest publicly operated, defensible, anti-overfitting benchmark program for AI conversation compaction methods we can build and continuously maintain.

A one-time static benchmark cannot honestly sustain a "hardest" claim. Elite is maintained as a program so the claim can hold.

Structure

Track Visibility Purpose
Elite Practice public templates, public seeds transparent practice, debugging, education
Elite Ranked hidden templates, rotating seeds official leaderboard qualification
Elite Shadow never exposed internal anti-gaming validation of top methods

Users develop against Practice, submit against Ranked, and top methods are periodically re-checked on Shadow.

Failure modes targeted

Elite cases stress the compaction failure modes that weak methods lose:

  • Buried critical constraints
  • Late decision overrides
  • Negative-rule preservation ("never do X")
  • Similar-entity confusion
  • Exception precedence (general rule + specific exception)
  • Resolved vs unresolved task continuity
  • Cross-cycle drift degradation
  • Counterfactual contamination (rejected branches staying dead)
  • Semantic camouflage
  • Compression pressure (methods that barely compress should not dominate)

Launch families (v1)

Three families ship with the first release:

Family Stress mode
buried_constraint_v1 constraint survives deep in a noisy transcript
decision_override_v1 later decision overrides an earlier one
entity_confusion_v1 multiple entities with overlapping names and roles

Each family ships with 5 public practice variations and 20 hidden ranked variations. Additional families are added post-launch.

Current version: Elite v1.0 (released 2026-04-17). Full changelog, seed-group policy, and retirement log in governance/elite-versions.md.

Compression tiers

Elite segments the leaderboard by compression tier so aggressive methods never compete directly with near-pass-through ones:

  • Elite-Light — ≥ 2× compression
  • Elite-Mid — ≥ 4× compression
  • Elite-Aggressive — ≥ 8× compression

Multi-cycle requirement

Elite Ranked runs include drift:

  • cycle 0: initial evaluation after first compaction
  • cycle 1: continue → compact → evaluate
  • cycle 2: continue → compact → evaluate

Three-to-five-cycle validations are used in Shadow on top methods.

Anti-gaming controls

  • Hidden ranked cases never leak through public APIs or published artifacts.
  • Seed groups rotate on leaderboard version changes.
  • Shadow evaluations spot-check the top of the leaderboard.
  • Methods must maintain ≥ 0.40 on every family — a specialist that collapses on one family cannot win overall.
  • Suspicion heuristics flag unusual patterns (low public variance + hidden collapse, extreme family concentration, brittle shadow results) for maintainer review.

Governance

  • Elite versions are reviewed quarterly.
  • Compromised template families are retired.
  • New families are added after observed gaming or new failure modes.
  • Historical leaderboards stay pinned to their original Elite version.
  • A public changelog records every version transition.

Product-copy guidance

Safe to claim:

  • "ranked on the hardest benchmark program we maintain"
  • "evaluated on hidden Elite cases"
  • "tested under multi-cycle drift"
  • "scored on adversarial state-fidelity benchmarks"

Not safe without continuous external validation:

  • "hardest test in all of AI"
  • "best benchmark in the world"
  • "impossible to game"