APPENDIX B — The Benchmark
APPENDIX B — The Benchmark
The implementation test for the machine plane compares six conditions on audit-dependent tasks:
- A — single unscaffolded frontier model, one-shot.
- B — single scaffolded model with deterministic proof structure.
- C — multiple unscaffolded models, consensus voting.
- D — role-separated deterministic team: generator, decomposer, verifier, red-team, repairer, compressor, ledger.
- E — LLM-as-OS dynamic router: deterministic command plane selecting per task among local/open-weight/frontier models, tools, context, proof depth, red-team depth, privacy mode, and ledgering, under cost, privacy, latency, and surety constraints.
- F — new in v3.0: a live object-grammar deployment (Book VI pattern): one dispatch door, contract-resolved invocation, mandatory receipts, repair lineage, scheduled zero-context review. F tests what A–E cannot: the grammar under real operation over time — reuse rates, repair-lineage integrity, review-loop effect on artifact quality, delegation safety under scoped tokens.
Metrics: correctness, auditability, reproducibility, adversarial survival, token cost, compute cost, latency, human verification time and time saved, failure cost (domain-weighted), reuse value, proof-reuse rate, repair-lineage integrity (fraction of failures with attached fixes), review-score trajectory over versions, data-custody and privacy cost, actionability. Derived: surety, logical energy, logical density, task-adjusted logical density.
Predictions: D dominates A and C where surety gain exceeds coordination cost; E dominates D across heterogeneous task sets; F's review-score trajectory rises across versions (S8's constructive prediction) and F's repair-lineage integrity stays near unity where A–E's unlinked-guess rate grows with volume.
Validity requirements: demonstrably audit-dependent tasks; diverse error distributions; measured (not assumed) coordination cost; defined deployment window; pre-published failure-cost weighting; ground truth independent of the evaluated systems; pre-defined privacy scoring; for F, review parameters declared before the window opens (IX.10).
Falsifiers: A consistently beats D/E/F on task-adjusted logical density; surety/alpha cost curves fail to fall under deterministic scaffolding; proof reuse fails to beat regeneration over the window; routing overhead exceeds task-adjusted gain; F's review scores stagnate or degrade across versions (S8); F's repair lineage decays with scale (S7).
---
---
Corpus map
- Shelf root: Total Structure v3 — root
- Kin appendices: UDST Appendix B — Compact Benchmark · UDST Appendix C — Attack Types