Appearance
What this is
For a producer engine, "trust" means provenance and reliability: every canonical record can be traced to its source and step, re-runs don't corrupt prior work, and the cost of extraction is controlled.
Provenance
Canonical outputs carry lineage — which source asset produced them, through which extraction step, under which contract. Variable-relationship models in book_level_sweep.json retain provenance so downstream consumers (Principia) inherit a defensible chain, not a bare assertion.
Idempotency
The pipeline is idempotent: a completed unit is skipped, not recomputed (resume guards, oversized sub-split handling). Re-running a partially-failed batch resumes rather than duplicating — the basis for reliable scale.
The clean store
Each book has one current, clean store: book_level/book_level_sweep.json. The legacy summary.json is dead — reading it is a known failure mode. The ledger (_sweep_ledger.csv) reports coverage so "done" is verifiable, not assumed.
Controlled cost & honest status
The moat is thousands of multi-step extractions run reliably at controlled cost. Operating discipline is part of the guarantee: prove one unit end-to-end (with real result + cost) before the batch, and verify outputs changed on disk — a book that "finishes" in 38s did not run. Honest status over hopeful framing.
Asset integrity
The asset registry is verifiable: npm run registry:verify confirms every declared asset exists; registry:verify:fix prunes missing ones. Files resolve through the registry, not hand-built paths, so lineage stays intact.
Honest scope
CanonicAI produces canonical data with provenance; it does not adjudicate downstream truth. Consumers (Principia) own evidence curation. In-scope vs. spun-out producers: docs/CANONICAI-SCOPE.md.