Skip to content

What this is

CanonicAI is the corpus-to-dataset engine: it turns large bodies of unstructured knowledge (books, research papers, domain documents) into canonical, queryable datasets with provenance, via a structured, repeatable generative-AI pipeline. It is the producer in the portfolio — ingest, extraction, registry, and a federation spine — not a public storefront. The consumable surfaces are downstream (Principia, peopleanalyst.com, the toolbox).

Corpus in. Canonical data out. The moat is thousands of multi-step extractions run reliably, idempotently, with lineage, at controlled cost — a production line, not a prompt.

Who it's for

  • Operators running the Book Factory / Article Factory pipelines (largely via DevPlane).
  • Downstream consumers (Principia, toolbox, PA-site) that read canonical outputs and the federation-spine contracts.

A note on shape (how this differs from a service repo)

CanonicAI's interface is not a REST/MCP gateway. It is CLI pipelines + canonical-output data contracts + package exports. So this docs surface adapts the standard's "API reference" section into a Data Contracts & Consumers section — the product's real interface. (Standard: ~/.claude/DOCUMENTATION-STANDARD.md — the interface section flexes to the product's actual surface.)

Documentation tree

Overview ........................ this page
Concepts ........................ concepts.md          (corpus · canonical output · asset registry · provenance · federation spine)
Getting Started ................. getting-started.md   (run the pipeline on one book · registry commands)
Architecture .................... architecture.md      (collector → organizer → referee → outputs · packages · registry)
Data Contracts & Consumers ...... data-contracts.md    (canonical_outputs schema · federation spine · who consumes what)
Trust & Provenance .............. trust-and-provenance.md  (lineage · idempotency · the clean store)
Interface reference (generated) . reference-mcp.generated.md  (sparse by design — see below)

Source of truth

  • Assets: asset-registry.json (+ packages/core/src/asset-registry.ts) — ~8,491 assets across 6 domains.
  • Capabilities: docs/capability-manifest.json (programs + realization gates).
  • Canonical outputs: canonical_outputs/<book>/… (per-chapter + book_level/book_level_sweep.json).

See also

Concepts · Getting Started · Architecture