CompactBench — Technical Decisions (v1, locked)¶

Status¶

Version: 1.0 (OSS)
Date locked: 2026-04-16
Scope: canonical reference for all workorders through v1 PyPI release.
Supersedes: an earlier SaaS-era decisions document (no longer retained in the repo).

CompactBench is an open-source Python package distributed on PyPI. The leaderboard is operated by maintainers; submissions arrive as PRs and are evaluated by a GitHub Actions self-hosted runner that holds the hidden test set and provider API keys.

Part A — Language, tooling, and distribution¶

A1. Language¶

Python 3.12 target; minimum supported 3.11; tested on 3.11 / 3.12 / 3.13 in CI.

A2. Package manager¶

uv (Astral). uv sync is the one command contributors run after clone.

A3. Build backend¶

Hatchling. Declared in pyproject.toml; produces wheels and sdists.

A4. Linting + formatting¶

Ruff. One tool for both. Config in pyproject.toml.

A5. Type checking¶

Pyright in strict mode. py.typed marker ships with the package.

A6. Testing¶

pytest + pytest-asyncio + pytest-cov + hypothesis (for property-based determinism tests).

A7. CLI framework¶

Typer + Rich. Entry point: compactbench.

A8. Schema validation¶

pydantic v2 for runtime models; JSON Schema (draft 2020-12) for templates and artifacts; jsonschema library for external artifact validation.

A9. YAML¶

ruamel.yaml (round-trip support for template authoring).

A10. HTTP¶

httpx (sync + async; used by all provider clients).

A11. Tokenizer¶

tiktoken, cl100k_base encoding. Applied to every method × model pair so compression ratios are comparable.

A12. Docs site¶

MkDocs Material, served from GitHub Pages on the compactbench.dev apex domain.

A13. Leaderboard site¶

Static site on the same GitHub Pages deployment, generated from site/data/leaderboard.json by a GH Actions workflow on merge of a qualified submission.

A14. License¶

Apache License 2.0. Patent grant matters when third parties submit methods.

A15. Pre-commit¶

Enabled by default. ruff (check + format) + pyright + basic hygiene hooks.

Part B — Product content (carried forward from SaaS decisions)¶

The mathematical and structural decisions that defined the product are unchanged. Where a workorder needs an exact formula, it lives here.

B1. Benchmark template DSL¶

YAML templates with Handlebars-style {{variable}} substitution, JSON Schema validation of the template file itself, and deterministic seeded generators.

Seeded generators: person_name, action_phrase, project_noun, date_iso, amount_usd, product_sku, org_name. Each draws from a bounded lexicon and advances a seeded PRNG.

Canonical examples ship at benchmarks/public/starter/*.yaml and are parsed by src/compactbench/dsl/parser.py.

B2. Compaction artifact schema¶

Strict JSON Schema. Every section required; empty arrays / empty object allowed; missing keys rejected. Implemented as pydantic models in src/compactbench/contracts/artifact.py.

Full schema lives with the code at src/compactbench/contracts/artifact.py, with a JSON Schema mirror at benchmarks/schemas/. Schema version: 1.0.0.

B3. Scoring formulas¶

Item weights (by item_type):

item_type	weight
`locked_decision_retention`	3
`forbidden_behavior_retention`	3
`immutable_fact_recall`	2
`unresolved_task_continuity`	2
`entity_integrity`	1
`planning_soundness`	1

Cycle score: cycle_score = sum(weight_i * item_score_i) / sum(weight_i)

Contradiction rate: violating_responses / total_responses

Penalized cycle score: cycle_score * (1 - contradiction_rate)

Drift resistance: clamp(1 + mean(cycle_score_n - cycle_score_0 for n >= 1), 0, 1)

Compression ratio: tokens(transcript) / (tokens(summary_text) + tokens(structured_state)) using cl100k_base.

B4. Leaderboard ranking¶

elite_score =
    0.40 * run_overall_score
  + 0.30 * drift_resistance
  + 0.20 * constraint_retention
  + 0.10 * compression_bonus

Tie-breakers: higher drift_resistance → higher constraint_retention → lower contradiction_rate → earlier published_at.

Qualification floors (all must pass): - compression_ratio >= tier_floor (B9) - contradiction_rate <= 0.10 - all configured case slots completed - all configured drift cycles completed (v1 default: 2) - no single family below 0.40 case-level pass rate - no runner failures in ranked slots

B5. Built-in compactors¶

naive-summary, structured-state, hierarchical-summary, hybrid-ledger. Implementations live under src/compactbench/compactors/; each file's module docstring sketches the algorithm.

B6. Launch Elite families¶

buried_constraint_v1, decision_override_v1, entity_confusion_v1. Each: 5 public practice variations + 20 hidden ranked variations. Remaining 12 families added post-launch.

B7. Compression tier floors¶

Elite-Light: ≥ 2.0×
Elite-Mid: ≥ 4.0×
Elite-Aggressive: ≥ 8.0×

B8. Starter/Hard suites¶

Starter public (20 cases, no compression floor), Hard public (30 cases, no compression floor). Used for development and practice, not ranked.

Part C — Distribution¶

C1. PyPI¶

Package name: compactbench
Release cadence: SemVer, first release 0.1.0 after WO-007
Trusted publishing via GitHub Actions OIDC (no long-lived PyPI token)

C2. GitHub¶

Public repo: compactbench/compactbench
Private hidden-set repo: compactbench/compactbench-hidden
Docs + leaderboard: GitHub Pages on compactbench.dev

C3. Domain¶

compactbench.dev registered via a standard registrar (Porkbun or Namecheap — post-scaffold decision). Cloudflare (free plan) for DNS + CDN.

C4. Branding¶

Short name "CompactBench". Tagline: "Open benchmark for AI conversation compaction methods."

Part D — Model providers¶

D1. Primary¶

Groq free tier: llama-3.3-70b-versatile, kimi-k2-instruct.

D2. Secondary¶

Google AI Studio free tier: gemini-2.0-flash.

D3. Local dev¶

Ollama: llama3.2 and similar. No API key required.

D4. Testing¶

Mock provider in src/compactbench/providers/mock.py. Deterministic responses keyed by prompt hash.

D5. User responsibility¶

Users supply their own API keys via environment variables (COMPACTBENCH_GROQ_API_KEY, etc.). The package makes no upstream calls without a configured key.

D6. Leaderboard evaluation¶

The maintainer-operated runner uses the same providers via repository secrets. First leaderboard version pins Groq + Llama 3.3 70B; second pins Gemini 2.0 Flash.

D7. Deferred¶

Anthropic Claude, OpenAI GPT, Google Vertex AI — integration when a funding path opens.

Part E — Leaderboard runner infrastructure¶

E1. Runner¶

GitHub-hosted ubuntu-latest runners (free and unlimited for public repos).

No self-hosted VM for v1. Earlier drafts planned an Oracle Cloud self-hosted runner for outbound-firewall control; we dropped it because (a) it's zero cost only on Oracle's increasingly unreliable free tier, (b) the operational burden outweighs the isolation benefit for a pre-scale project, and (c) the same security properties come from the pull_request_target + manual-label flow described in E3–E5 below.

We can migrate to self-hosted later without changing the workflow shape if there is a specific incident or a scale need.

E2. Evaluation workflow¶

.github/workflows/evaluate-submission.yml triggers on pull_request_target with the evaluate label. It runs on GitHub-hosted ubuntu-latest and:

Checks out the PR's head commit (contains the submission's method source).
Clones compactbench/compactbench-hidden using a deploy key stored as a repo secret.
Installs compactbench plus the submission's declared deps.
Runs compactbench run against the hidden Elite Ranked suite.
Posts the score summary as a PR comment and uploads the results file as a workflow artifact.

E3. Secret exposure model¶

pull_request_target runs in the trusted context of the target repo (not the PR author's fork) with access to repo secrets. The workflow is only authorized to run once a maintainer applies the evaluate label after reviewing the submission's code. This review is the security gate.

Provider API keys (GROQ_API_KEY, GOOGLE_AI_STUDIO_API_KEY) and the hidden-repo deploy key are stored as repo secrets scoped to the evaluate-submission.yml workflow environment. They are not accessible to regular pull_request events or to other workflows.

E4. Evaluation gating¶

Maintainers must manually apply the evaluate label to each submission PR after code review. The label's existence is what authorizes the workflow to run with secrets. No auto-labeling, no drive-by evaluations.

E5. Leaderboard publication¶

On merge of a qualified submission, update-leaderboard.yml rebuilds docs/data/leaderboard.json from all committed submissions and triggers the existing docs deploy. Served from the same GitHub Pages deployment as the docs.

E6. Cost¶

$0. GitHub Actions free minutes (unlimited for public repos) + GitHub Pages free hosting. Optional later: domain registration (~$12/year).

E7. Migration path¶

If we ever need self-hosted (e.g., for outbound-firewall allow-listing or for compliance), add a single runs-on: [self-hosted, evaluate] override to the evaluate workflow and register the runner. No changes to the submission flow, leaderboard core, or secret model.

Part F — Governance¶

F1. Maintainers¶

Initial maintainer: project owner. Additional maintainers added via invite to the compactbench/maintainers GitHub team. Hidden-repo access is granted separately and revocably.

F2. Elite version review¶

Quarterly. Retire compromised template families, add new ones, document in CHANGELOG.md and a dedicated docs/governance/elite-versions.md.

F3. Leaderboard versioning¶

Each leaderboard version is segmented by (benchmark_version, target_model). Historical leaderboards stay pinned to their original versions forever.

F4. Template retirement¶

Public retirement notice in the changelog; hidden template is removed from the ranked pool but not deleted (preserved in the hidden repo for audit).

F5. Suspicion review¶

Flagged submissions (low public variance + hidden collapse, narrow family concentration, brittle shadow results) are reviewed by a maintainer before publication. Process documented publicly; heuristics documented privately.

F6. Code of conduct¶

Contributor Covenant 2.1. Reports to conduct@compactbench.dev (routed to maintainers).

Part G — Deferred to post-launch¶

Paid model providers (Anthropic / OpenAI / Google Vertex / Mistral)
Automated shadow evaluation scheduling
Elite template families 4–15 (content roadmap)
Sponsorship acceptance process
Non-English benchmark content
Python 3.10 backport
Dockerized runner distribution for third-party evaluators
Research paper + arXiv submission

Part H — Budget¶

Cost	Amount
Hosting, CI, docs, leaderboard, runner	$0
PyPI publishing	$0
Domain registration (`compactbench.dev`)	~$12/year
v1 launch total	~$12/year

No other paid dependencies. Every item in Parts A–F runs on a forever-free tier or is self-hosted on the Oracle free VM.

Revision history¶

Version	Date	Change
1.0	2026-04-16	Initial OSS lock, superseding SaaS decisions v1.0