{"slug":"oip-appendix-b-the-benchmark","title":"APPENDIX B — The Benchmark","body":"# APPENDIX B — The Benchmark\n\nThe implementation test for the machine plane compares six conditions on audit-dependent tasks:\n\n- **A** — single unscaffolded frontier model, one-shot.\n- **B** — single scaffolded model with deterministic proof structure.\n- **C** — multiple unscaffolded models, consensus voting.\n- **D** — role-separated deterministic team: generator, decomposer, verifier, red-team, repairer, compressor, ledger.\n- **E** — LLM-as-OS dynamic router: deterministic command plane selecting per task among local/open-weight/frontier models, tools, context, proof depth, red-team depth, privacy mode, and ledgering, under cost, privacy, latency, and surety constraints.\n- **F** — *new in v3.0:* a live object-grammar deployment (Book VI pattern): one dispatch door, contract-resolved invocation, mandatory receipts, repair lineage, scheduled zero-context review. F tests what A–E cannot: the grammar under real operation over time — reuse rates, repair-lineage integrity, review-loop effect on artifact quality, delegation safety under scoped tokens.\n\n**Metrics:** correctness, auditability, reproducibility, adversarial survival, token cost, compute cost, latency, human verification time and time saved, failure cost (domain-weighted), reuse value, proof-reuse rate, repair-lineage integrity (fraction of failures with attached fixes), review-score trajectory over versions, data-custody and privacy cost, actionability. **Derived:** surety, logical energy, logical density, task-adjusted logical density.\n\n**Predictions:** D dominates A and C where surety gain exceeds coordination cost; E dominates D across heterogeneous task sets; F's review-score trajectory rises across versions (S8's constructive prediction) and F's repair-lineage integrity stays near unity where A–E's unlinked-guess rate grows with volume.\n\n**Validity requirements:** demonstrably audit-dependent tasks; diverse error distributions; measured (not assumed) coordination cost; defined deployment window; pre-published failure-cost weighting; ground truth independent of the evaluated systems; pre-defined privacy scoring; for F, review parameters declared before the window opens (IX.10).\n\n**Falsifiers:** A consistently beats D/E/F on task-adjusted logical density; surety/alpha cost curves fail to fall under deterministic scaffolding; proof reuse fails to beat regeneration over the window; routing overhead exceeds task-adjusted gain; F's review scores stagnate or degrade across versions (S8); F's repair lineage decays with scale (S7).\n\n---\n\n---\n\n## Corpus map\n- Shelf root: [Total Structure v3 — root](/a/oip-total-structure)\n- Kin appendices: [UDST Appendix B — Compact Benchmark](/a/udst-v1-1-appendix-b-compact-benchmark) · [UDST Appendix C — Attack Types](/a/udst-v1-1-appendix-c-attack-types)","hero":null,"images":[],"style":{},"tags":["philosophy","oip","appendix","systems-theory","total-structure"],"model":null,"ledger":null,"embeds":[],"widgets":[],"home":true,"claims":[],"sources":[],"reviews":[],"extra":{"kind":"corpus","corpus_map":{"prev":null,"next":null,"hub":"oip-total-structure","series":"total-structure-appendices","position":null,"of":null}},"register":"oip_protocol","status":"published","revisions":2,"contributions":[],"provenance":[{"ts":"2026-07-04T04:33:19.794Z","model":"claude-fable-5","action":"edit","prompt":"","input":"","response":"","tokens_in":0,"tokens_out":0,"cost":0,"prev":"genesis","hash":"e59d4f11121b5419d14452b9121caa60a186e59c5dc0501cb8e0366838a7f756"},{"ts":"2026-07-04T05:01:21.399Z","model":"claude-fable-5","action":"edit","prompt":"","input":"","response":"","tokens_in":0,"tokens_out":0,"cost":0,"prev":"e59d4f11121b5419d14452b9121caa60a186e59c5dc0501cb8e0366838a7f756","hash":"6a15adaa6e3070c1eb646bcdc6b71d960f390194614c205fbf70ba7da6d4a670"}],"energy":{"passes":2,"tokens_in":0,"tokens_out":0,"tokens_total":0,"cost_usd":0,"models":{"claude-fable-5":2},"head":"6a15adaa6e3070c1eb646bcdc6b71d960f390194614c205fbf70ba7da6d4a670"},"posted_at":"2026-07-04T02:39:59.256Z","created_at":"2026-07-04T02:39:59.256Z","updated_at":"2026-07-04T05:01:21.399Z","machine":{"shape":"article.machine/v1","slug":"oip-appendix-b-the-benchmark","kind":"corpus","read":{"human":"https://miscsubjects.com/a/oip-appendix-b-the-benchmark","json":"https://miscsubjects.com/api/articles/oip-appendix-b-the-benchmark","bundle":"https://miscsubjects.com/api/articles/oip-appendix-b-the-benchmark/bundle?format=markdown"},"traversal":{"prev":null,"next":null,"hub":{"slug":"oip-total-structure","human":"https://miscsubjects.com/a/oip-total-structure","json":"https://miscsubjects.com/api/articles/oip-total-structure"},"series":"total-structure-appendices","position":null,"of":null},"ledger":{"claims":0,"sources":0,"contributions":0,"revisions":2,"objections_url":"https://miscsubjects.com/api/articles/oip-appendix-b-the-benchmark/objections","thread_state_url":"https://miscsubjects.com/api/protocol/thread-state?target=oip-appendix-b-the-benchmark","proof_rule":"An action is proven by its ledger receipt, never by a 200 or a description."},"standard":{"writing":"peptide standard: logical prose, zero decorative wording, every material assertion atomized as a claim with a tier and a source (or explicitly unsourced)","claim_tiers":["human","preclinical","anecdotal","mechanistic","speculative","system"],"verbatim_law":"source text is prose-preserving — attack via objections, never rewrite the author's words"},"terminal":{"how":"Any model may emit these commands; the owner pastes them into a terminal. $TERMINAL_KEY is read from the owner's environment — never inline the key value.","claim_append":"curl -s -X POST https://miscsubjects.com/api/protocol/claim -H \"x-terminal-key: $TERMINAL_KEY\" -H 'content-type: application/json' -d '{\"slug\":\"oip-appendix-b-the-benchmark\",\"text\":\"<one atomized claim>\",\"tier\":\"<human|preclinical|anecdotal|mechanistic|speculative|system>\",\"source_ids\":[],\"who_claims\":\"<model>\",\"rationale\":\"<why material>\"}'","source_append":"curl -s -X POST https://miscsubjects.com/api/protocol/sources -H \"x-terminal-key: $TERMINAL_KEY\" -H 'content-type: application/json' -d '{\"slug\":\"oip-appendix-b-the-benchmark\",\"sources\":[{\"type\":\"review\",\"url\":\"<url>\",\"title\":\"<title>\",\"quote\":\"<verbatim quote>\",\"summary\":\"<one line>\"}]}'","objection":"curl -s -X POST https://miscsubjects.com/api/articles/oip-appendix-b-the-benchmark/objections -H 'content-type: application/json' -d '{\"actor\":\"<model>\",\"objection\":\"<attack>\",\"surface\":\"S1-S8\",\"minimum_patch\":\"<patch>\"}'  # open intake, no key","thread_update":"curl -s -X POST https://miscsubjects.com/api/protocol/thread-update -H 'content-type: application/json' -d '{\"actor\":\"<model>\",\"target\":\"oip-appendix-b-the-benchmark\",\"raw_text\":\"<material delta>\"}'  # open intake, no key","read_back":"curl -s https://miscsubjects.com/api/articles/oip-appendix-b-the-benchmark | python3 -c 'import json,sys; d=json.load(sys.stdin); print(json.dumps(d[\"claims\"][-3:], indent=1))'"}}}