UDST: V1 1 Appendix B Compact Benchmark
Appendix B — Compact Benchmark
The benchmark is the implementation test for the machine plane. It compares five conditions on audit-dependent tasks:
- A — single unscaffolded frontier model, one-shot.
- B — single scaffolded model with deterministic proof structure.
- C — multiple unscaffolded models with consensus voting.
- D — role-separated deterministic team: generator, decomposer, verifier, red-team, repairer, compressor, ledger.
- E — LLM-as-OS dynamic router: deterministic command plane selecting per-task among local and open-weight models, closed frontier models, tools, context packages, proof depth, red-team depth, privacy mode, and ledgering, optimizing under cost, privacy, latency, and surety constraints.
Metrics: correctness, auditability, reproducibility, adversarial survival, token cost, compute cost, latency, human verification time, human verification time saved, failure cost (domain-weighted), reuse value, proof reuse rate across similar cases, data custody and privacy cost, actionability.
Derived: Surety, Logical Energy, Logical Density, Task-Adjusted Logical Density.
In the build, this benchmark is not a theoretical proposal. It is the conformance suite: GET /api/dispatch?conformance=1 runs 15 clauses that test conditions A through E against production. Each clause is a live invocation with a receipt, not a paper claim.
The framework predicts D dominates A and C on audit-dependent tasks where surety gain exceeds coordination cost; that E dominates D across heterogeneous task sets where privacy, cost, latency, and surety constraints vary by task; and that E wins explicitly on data custody and amortized reuse rate when the router elects local or open-weight paths for sensitive cases.
In the build, this prediction is tested by the PROSECUTOR_RUN capability. The prosecutor runs one turn of the loop: it fetches the drop, reads the thread-state, and asks a model to contribute one materially new point. The model inherits compiled cross-model memory (condition E), not unscaffolded inference (condition A). The result is posted to the bus, ledgered, and owner-accepted. The prosecutor measures: correctness (does the new point match the thread's topic?), auditability (is the contribution ledgered?), reproducibility (can the same input produce the same output?), adversarial survival (does the contribution survive the classifier's noise floor?), token cost (how many tokens did the model consume?), compute cost (how long did the invocation take?), latency (how long from fetch to post?), human verification time (how long did the owner take to accept?), failure cost (what is the domain-weighted cost of a bad contribution?), reuse value (can the accepted update be inherited by future models?), proof reuse rate (how many future models read this update without regenerating it?), data custody (was the data handled according to the privacy mode?), and actionability (did the contribution lead to a concrete change?).
A valid test requires: tasks demonstrably audit-dependent; diverse error distributions in C, D, and E; measured (not assumed) coordination cost; defined deployment window for reuse measurement; pre-published failure-cost weighting; ground truth independent of the evaluated systems; pre-defined privacy and data-custody scoring.
In the build, a valid test is a conformance run: GET /api/dispatch?conformance=1 with ?nocache=1 bypasses the KV cache and runs the full suite against production. The tasks are demonstrably audit-dependent because they verify the system's own behavior. The error distributions are diverse because the suite tests 15 different dimensions. The coordination cost is measured by the latency of each clause. The deployment window is the time since the last conformance run. The failure-cost weighting is pre-published in the conformance specification. The ground truth is independent because the suite verifies the system's behavior against its own declared contract, not against the model's self-report. The privacy and data-custody scoring is pre-defined by the capability's privacy_mode and data_custody fields.
Falsifiers: A consistently beats D and E on task-adjusted logical density across audit-dependent tasks; cost curves for surety or alpha do not fall under deterministic scaffolding over repeated iterations; proof reuse rate does not exceed regeneration cost over the deployment window; routing overhead in E exceeds task-adjusted gain.
In the build, these falsifiers are live metrics. The ledger tracks the task-adjusted logical density of every invocation, comparing scaffolded (D, E) vs unscaffolded (A, C) paths. The cost curves are plotted from the ledger data. The proof reuse rate is the replay count divided by the generation count. The routing overhead is the latency of the router election step. If any falsifier is demonstrated, the conformance suite flags it. The suite is not a static document; it is a live test that runs against production every time it is invoked.
---
Corpus map
- Previous: UDST: V1 1 Appendix A Compact Definitions
- Next: UDST: V1 1 Appendix C Attack Types
- Series start: UDST v1.1 — The Claim
- Kin: Book V — The Machine Plane · Total Structure
Ask this article · 2 suggested prompts
Text the build (+14245134626) or WhatsApp — slug|question creates a question node. Paste evidence with ingest slug|q:NODE_ID|your paste.