Skip to content

What this is

For a producer engine, "trust" means provenance and reliability: every canonical record can be traced to its source and step, re-runs don't corrupt prior work, and the cost of extraction is controlled.

Provenance

Canonical outputs carry lineage — which source asset produced them, through which extraction step, under which contract. Variable-relationship models in book_level_sweep.json retain provenance so downstream consumers (Principia) inherit a defensible chain, not a bare assertion.

Idempotency

The pipeline is idempotent: a completed unit is skipped, not recomputed (resume guards, oversized sub-split handling). Re-running a partially-failed batch resumes rather than duplicating — the basis for reliable scale.

The clean store

Each book has one current, clean store: book_level/book_level_sweep.json. The legacy summary.json is dead — reading it is a known failure mode. The ledger (_sweep_ledger.csv) reports coverage so "done" is verifiable, not assumed.

Controlled cost & honest status

The moat is thousands of multi-step extractions run reliably at controlled cost. Operating discipline is part of the guarantee: prove one unit end-to-end (with real result + cost) before the batch, and verify outputs changed on disk — a book that "finishes" in 38s did not run. Honest status over hopeful framing.

Asset integrity

The asset registry is verifiable: npm run registry:verify confirms every declared asset exists; registry:verify:fix prunes missing ones. Files resolve through the registry, not hand-built paths, so lineage stays intact.

Honest scope

CanonicAI produces canonical data with provenance; it does not adjudicate downstream truth. Consumers (Principia) own evidence curation. In-scope vs. spun-out producers: docs/CANONICAI-SCOPE.md.

See also

Data Contracts & Consumers · Architecture · Concepts