Skip to content

What this is

The core terms used across CanonicAI. The mental model: a corpus is ingested, extracted in multiple steps, registered as assets, and emitted as canonical outputs with provenance — then published to the federation spine for downstream consumers.

Corpus

A body of source material — books, research papers, domain documents — ingested for extraction. Book IDs are slugs (a_new_way_to_think); research IDs are {author}_{year}_{title_slug}.

Asset & Asset Registry

Every file in the system is an asset with ID {domain}:{type}:{source_id} (e.g. books:extracted_text:a_new_way_to_think). The asset registry (asset-registry.json) is the single source of truth; loaders resolve through it ("Strategy 0") before any legacy path. Domains: books · research · onet · bls · hr_metrics · competency. Never construct file paths by hand — use getRegistry() / getResolver() from core.

Canonical output

The structured product of the pipeline. Per book: canonical_outputs/<book>/chapters/<chXX>/ holds chunks.json, summary_core.json, gaps.json, summary.md; the clean, current book-level store is canonical_outputs/<book>/book_level/book_level_sweep.json (with _sweep_ledger.csv for coverage).

Provenance & idempotency

Outputs carry lineage — what source produced them, through which step. Re-running the pipeline is idempotent: a completed unit is skipped, not recomputed. This is the moat — reliable, repeatable extraction with lineage, not one-off prompting.

Federation spine

The shared cross-repo contracts that publish canonical results: measurement-core (canonical measurement vocabulary, shared with Principia + toolbox), measurement-ingest (publishes records to Principia's canonical registry), library-core (deriveLibraryId work-identity), variableizer (variable normalization), reliability (psychometric coefficients).

The factories

  • Book Factorycollectororganizerreferee: PDF → text → chapter detection → per-chapter summaries (breadth + depth).
  • Article / Research Factory — papers → instruments, constructs, citations, effect data → Principia.

See also

Architecture · Data Contracts & Consumers · Getting Started