Skip to content

CompactBench — Workorders (OSS v1)

Implementation plan from post-scaffold to public PyPI launch. Each workorder has objective, scope, deliverables, dependencies, and acceptance criteria.

Formulas, schemas, and policy numbers referenced below are locked in decisions.md.

Phase summary

Phase Workorders Outcome
0 — Scaffold WO-001 Package installs, tests pass
1 — Core engine WO-002, WO-003, WO-004 Generate and score cases deterministically
2 — Methods & providers WO-005, WO-006 Four built-ins run against real + mock models
3 — Runtime WO-007 Full run end-to-end via CLI with drift cycles
4 — Leaderboard & launch WO-008, WO-009, WO-010 Public leaderboard, hidden-set runner, PyPI release

WO-001 — Python package scaffold ✅

Objective

Installable Python package with CLI entry point, typed contracts, CI pipeline, docs skeleton.

Status

Complete.

Deliverables

  • pyproject.toml (uv-managed, hatchling build)
  • src/compactbench/ with cli.py, contracts/, package ABCs
  • tests/unit/ with smoke + contracts tests
  • .github/workflows/ci.yml, issue + PR templates, CODEOWNERS
  • MkDocs site skeleton, Apache 2.0 license, Contributor Covenant
  • CONTRIBUTING.md, CHANGELOG.md

Acceptance criteria

  • uv sync --all-extras --dev succeeds
  • uv run pytest -m unit passes
  • uv run ruff check . passes
  • uv run pyright passes

WO-002 — Template DSL parser, generators, and schema validation

Objective

Turn a YAML template file into a validated, deterministic template object ready for case generation.

Scope

  • JSON Schema for template files (benchmarks/schemas/template-v1.json)
  • Parser in src/compactbench/dsl/parser.py — YAML → TemplateDefinition pydantic model
  • Seeded generator registry in src/compactbench/dsl/generators.py: person_name, action_phrase, project_noun, date_iso, amount_usd, product_sku, org_name
  • Validator in src/compactbench/dsl/validator.py — schema + semantic checks (variable references match declarations, difficulty policy well-formed)
  • Handlebars-style {{variable}} substitution

Dependencies

WO-001.

Deliverables

  • Three sample templates in benchmarks/public/starter/
  • Round-trip tests: parse → validate → serialize
  • Property tests (hypothesis) for generator determinism
  • compactbench suites list wires up

Acceptance criteria

  • Same seed always produces same resolved template
  • Invalid templates fail validation with actionable errors
  • All seeded generators return deterministic output for a given seed

WO-003 — Deterministic case generation engine

Objective

Produce a concrete GeneratedCase from (template, seed, difficulty) with byte-identical output across runs.

Scope

  • src/compactbench/engine/seeds.py — seed derivation from (suite_version, seed_group, case_slot) using SHA-256 → PRNG
  • src/compactbench/engine/difficulty.py — apply difficulty policy (distractor count, paraphrase depth, override timing)
  • src/compactbench/engine/transcript.py — assemble the turn sequence
  • src/compactbench/engine/generation.py — orchestrate: template + seed + difficulty → transcript + ground truth + evaluation items

Dependencies

WO-002.

Deliverables

  • compactbench generate --template <key> --seed <int> works end-to-end
  • Determinism regression test fixture (serialized expected output checked into tests/fixtures/)
  • Per-difficulty golden-output tests

Acceptance criteria

  • Identical seeds produce byte-identical cases across Python minor versions and platforms
  • Difficulty knobs measurably change output (more distractors at hard, later overrides at elite, etc.)
  • Generation fully isolated from DB, HTTP, filesystem (pure function of inputs)

WO-004 — Scoring engine

Objective

Score a set of evaluation responses against the case's ground truth, producing per-cycle and per-case metrics.

Scope

  • src/compactbench/scoring/checks.py — check types: contains_normalized, forbidden_absent, set_match, exact, numeric_within, entity_consistent
  • src/compactbench/scoring/scorer.py — weighted aggregation per decisions.md §B3 B3 weights
  • src/compactbench/scoring/contradictions.py — detect locked_decision / forbidden_behavior violations in responses
  • src/compactbench/scoring/drift.py — drift delta + drift resistance across cycles
  • src/compactbench/scoring/compression.py — cl100k-based compression ratio

Dependencies

WO-002 (for contract access), WO-003 (for case fixtures in tests).

Deliverables

  • compactbench score --results results.jsonl prints run-level summary
  • Scorer returns a pydantic Scorecard object matching src/compactbench/contracts/result.py
  • Coverage >= 90% for scoring package

Acceptance criteria

  • Boundary cases: all-pass, all-fail, single-item runs, zero-denominator safeguards
  • Compression ratio matches hand-computed value for a known transcript
  • Drift resistance = 1.0 for a method that holds every cycle; < 1.0 for a degrading method

WO-005 — Built-in compactors

Objective

Ship four baseline compaction methods. All validate against the artifact schema.

Scope

  • src/compactbench/compactors/naive_summary.py — single summarization call, empty structured state
  • src/compactbench/compactors/structured_state.py — JSON-schema-forced extraction, no summaryText
  • src/compactbench/compactors/hierarchical_summary.py — chunk → summarize → meta-summarize
  • src/compactbench/compactors/hybrid_ledger.py — append-only ledger across cycles via previous_artifact
  • Registry in src/compactbench/compactors/__init__.py for built-in:<key> resolution

Dependencies

WO-003, WO-006 (needs provider to call a model).

Deliverables

  • Each compactor runnable via compactbench run --method built-in:<key>
  • Integration tests using the mock provider for deterministic output
  • Each method has a README section documenting its prompt template and config schema

Acceptance criteria

  • All four return valid CompactionArtifact on every call
  • None skip any structuredState section (empty arrays permitted, missing keys not)
  • Hybrid-ledger accumulates across cycles (verified with 2-cycle integration test)

WO-006 — Provider clients

Objective

Model provider abstractions with concrete implementations for Groq, Google AI Studio, Ollama, and a deterministic mock.

Scope

  • src/compactbench/providers/base.py — already scaffolded; review + extend
  • src/compactbench/providers/groq.py — async client, 429 backoff, typed errors
  • src/compactbench/providers/google_ai_studio.py — same
  • src/compactbench/providers/ollama.py — local HTTP client
  • src/compactbench/providers/mock.py — prompt-hash → canned response map, deterministic
  • src/compactbench/providers/registry.py--provider <key> resolves to the right client

Dependencies

WO-001.

Deliverables

  • compactbench providers list shows all registered providers
  • Per-provider integration test (skipped in CI when no key is configured)
  • Contract test: all providers return identical response shape for identical request

Acceptance criteria

  • Mock provider is fully offline and deterministic
  • Real providers respect env-var config from COMPACTBENCH_* namespace
  • 429s from Groq/Google produce backoff retries, not hard failures

WO-007 — Runner, CLI orchestration, drift cycles

Objective

End-to-end: compactbench run generates cases, calls the compactor, invokes the model, scores results, writes results.jsonl, supports drift cycles.

Scope

  • src/compactbench/runner/run.py — orchestrate a full run
  • src/compactbench/runner/cycle.py — drift cycle execution with continuation prompts
  • src/compactbench/runner/persistence.py — JSONL writer, resume support
  • Wire up compactbench run and compactbench score command bodies

Dependencies

WO-004, WO-005, WO-006.

Deliverables

  • Full run on starter suite using mock provider completes in < 30 seconds
  • Full run on elite_practice with Groq completes and produces a valid RunResult
  • --resume flag picks up from last completed case

Acceptance criteria

  • results.jsonl matches the RunResult schema from src/compactbench/contracts/result.py
  • Drift cycles execute in order; later cycles can access previous_artifact
  • Interrupted runs resume cleanly without double-counting completed cases

WO-008 — GitHub Actions submission flow + leaderboard site

Objective

PR-based submission pipeline that evaluates methods against the hidden set on the self-hosted runner and publishes the public leaderboard.

Scope

  • .github/workflows/evaluate-submission.yml — triggers on evaluate label; runs on self-hosted runner; clones hidden repo; executes method against Elite Ranked; comments score back to PR
  • .github/workflows/update-leaderboard.yml — runs on merge of a qualified submission; updates site/data/leaderboard.json; rebuilds static site; deploys to GitHub Pages
  • site/ — static leaderboard site (Astro or MkDocs plugin); deployed to compactbench.dev
  • src/compactbench/leaderboard/{ranking,qualification,projection}.py — ranking formula + floors + JSON projection
  • Submission scaffolding in submissions/_template/

Dependencies

WO-007.

Deliverables

  • End-to-end: a dummy PR with submissions/test/sample-method/ triggers evaluation on the runner and posts scores to the PR
  • Leaderboard site live at compactbench.dev/leaderboard
  • compactbench submit command drafts a PR-ready submission directory from a local results.jsonl

Acceptance criteria

  • Hidden test content never appears in public workflow logs
  • Provider keys never leak into PR-author-runnable workflows
  • Leaderboard updates deterministically on merge
  • Leaderboard segmented by benchmark version + target model

WO-009 — Launch Elite templates (3 families)

Objective

Ship the three launch families as public practice + hidden ranked content.

Scope

  • Public practice templates in benchmarks/public/elite_practice/:
  • buried_constraint_v1.yaml (5 variations)
  • decision_override_v1.yaml (5 variations)
  • entity_confusion_v1.yaml (5 variations)
  • Hidden ranked templates in the private compactbench-hidden repo:
  • 20 seed variations per family (60 total)
  • Governance doc at docs/governance/elite-versions.md — v1 Elite changelog
  • Seed group policy: each leaderboard version draws a fixed seed group per template

Dependencies

WO-002, WO-003, WO-008.

Deliverables

  • Public practice runs of all three families succeed end-to-end with hybrid-ledger baseline
  • Hidden ranked runs on the self-hosted runner produce valid scorecards
  • Governance doc explains version policy, retirement policy, seed rotation

Acceptance criteria

  • All three families cover their documented failure mode (buried constraint / decision override / entity confusion)
  • Hidden content is never referenced in public repo commit history
  • Determinism regression tests for all 15 public seeds

WO-010 — Docs site, PyPI release, launch announcement

Objective

First public 0.1.0 release on PyPI, docs site live, launch communications prepared.

Scope

  • Polish all MkDocs pages: getting-started, writing-a-compactor, submitting, methodology, elite-program
  • Add a "Try it in 3 minutes" quickstart with Ollama
  • .github/workflows/release.yml — tag push → build wheels → PyPI publish (trusted publishing via OIDC)
  • Populate CHANGELOG.md for 0.1.0
  • Launch post draft (HN Show, r/LocalLLaMA, AI Twitter)
  • Verify domain resolves, Pages deploys, PyPI install works

Dependencies

All prior workorders.

Deliverables

  • pip install compactbench works against PyPI
  • compactbench.dev serves docs + leaderboard
  • Launch post reviewed and ready to ship

Acceptance criteria

  • A cold user can go from pip install compactbench to seeing their first run in under 3 minutes (with Ollama pre-installed)
  • Every public doc page renders and links resolve
  • PyPI package metadata is correct (keywords, classifiers, project URLs)
  • First leaderboard version is frozen and documented

Post-v1 roadmap (out of scope here)

  • Elite families 4–15 (monthly release cadence)
  • Automated shadow evaluation scheduling
  • Anthropic / OpenAI / Google Vertex provider integrations
  • Research paper + arXiv submission
  • Sponsorship acceptance
  • Third-party Dockerized runner distribution