Elite program¶
The Elite program is CompactBench's hardest benchmark track. It is not a single fixed test pack — it is a versioned program maintained with explicit governance.
Goal¶
The hardest publicly operated, defensible, anti-overfitting benchmark program for AI conversation compaction methods we can build and continuously maintain.
A one-time static benchmark cannot honestly sustain a "hardest" claim. Elite is maintained as a program so the claim can hold.
Structure¶
| Track | Visibility | Purpose |
|---|---|---|
| Elite Practice | public templates, public seeds | transparent practice, debugging, education |
| Elite Ranked | hidden templates, rotating seeds | official leaderboard qualification |
| Elite Shadow | never exposed | internal anti-gaming validation of top methods |
Users develop against Practice, submit against Ranked, and top methods are periodically re-checked on Shadow.
Failure modes targeted¶
Elite cases stress the compaction failure modes that weak methods lose:
- Buried critical constraints
- Late decision overrides
- Negative-rule preservation ("never do X")
- Similar-entity confusion
- Exception precedence (general rule + specific exception)
- Resolved vs unresolved task continuity
- Cross-cycle drift degradation
- Counterfactual contamination (rejected branches staying dead)
- Semantic camouflage
- Compression pressure (methods that barely compress should not dominate)
Launch families (v1)¶
Three families ship with the first release:
| Family | Stress mode |
|---|---|
buried_constraint_v1 |
constraint survives deep in a noisy transcript |
decision_override_v1 |
later decision overrides an earlier one |
entity_confusion_v1 |
multiple entities with overlapping names and roles |
Each family ships with 5 public practice variations and 20 hidden ranked variations. Additional families are added post-launch.
Current version: Elite v1.0 (released 2026-04-17). Full changelog, seed-group policy, and retirement log in governance/elite-versions.md.
Compression tiers¶
Elite segments the leaderboard by compression tier so aggressive methods never compete directly with near-pass-through ones:
- Elite-Light — ≥ 2× compression
- Elite-Mid — ≥ 4× compression
- Elite-Aggressive — ≥ 8× compression
Multi-cycle requirement¶
Elite Ranked runs include drift:
- cycle 0: initial evaluation after first compaction
- cycle 1: continue → compact → evaluate
- cycle 2: continue → compact → evaluate
Three-to-five-cycle validations are used in Shadow on top methods.
Anti-gaming controls¶
- Hidden ranked cases never leak through public APIs or published artifacts.
- Seed groups rotate on leaderboard version changes.
- Shadow evaluations spot-check the top of the leaderboard.
- Methods must maintain ≥ 0.40 on every family — a specialist that collapses on one family cannot win overall.
- Suspicion heuristics flag unusual patterns (low public variance + hidden collapse, extreme family concentration, brittle shadow results) for maintainer review.
Governance¶
- Elite versions are reviewed quarterly.
- Compromised template families are retired.
- New families are added after observed gaming or new failure modes.
- Historical leaderboards stay pinned to their original Elite version.
- A public changelog records every version transition.
Product-copy guidance¶
Safe to claim:
- "ranked on the hardest benchmark program we maintain"
- "evaluated on hidden Elite cases"
- "tested under multi-cycle drift"
- "scored on adversarial state-fidelity benchmarks"
Not safe without continuous external validation:
- "hardest test in all of AI"
- "best benchmark in the world"
- "impossible to game"