Skip to content

What this is

How CanonicAI is built and how a corpus flows from source to published canonical data.

Shape

A TypeScript producer pipeline (ES2020 / CommonJS, run via ts-node), operated through DevPlane (cd ~/devplane && npm start ~/canonicai). Everything routes through packages/core — the spine. It is a production line, not a service: inputs are corpora, outputs are canonical datasets + federation records.

Data flow (Book Factory)

PDF / source asset
   → collector   (extract text → detect chapters → per-chapter summaries: breadth + depth)
   → organizer   (normalize + structure)
   → referee     (quality scoring + validation)
   → canonical_outputs/<book>/  (per-chapter + book_level_sweep.json, with provenance + ledger)
   → measurement-ingest → Principia canonical registry

Each step is idempotent and registered; re-runs skip completed units.

Core packages

PackageRole
coreSpine — asset registry, storage resolver, canonical schemas, llm.ts (model abstraction), text loaders
collectorBook extraction (PDF → text → chapters → summaries)
organizerNormalize + structure extracted data
refereeQuality scoring + validation
orchestratorPipeline orchestration
research_agentArticle Factory — papers → instruments, constructs, citations, effect data → Principia

Federation spine (shared cross-repo contracts)

measurement-core (canonical measurement vocabulary, shared with Principia + toolbox) · measurement-ingest (publishes records to Principia) · library-core (deriveLibraryId) · variableizer (variable normalization) · reliability (psychometric coefficients).

Asset registry (the resolver)

asset-registry.json (+ packages/core/src/asset-registry.ts) is the single source of truth — ~8,491 assets across 6 domains. All loaders try the registry first ("Strategy 0"), then fall back to legacy paths. Use getRegistry() / getResolver(); never hand-build paths.

Substrate

TypeScript + ts-node; Express surface (rest-express); Drizzle (+ drizzle-zod); LLM via @anthropic-ai/sdk + openai, routed through core/src/llm.ts. Heavy ingestion runs on Modal (off-local). Organized storage in ~/meta-factory-storage/.

Honest scope

Some packages are domain factories kept pending a downstream route (e.g. competency_agent → Vela). Trust docs/CANONICAI-SCOPE.md for what's in-scope vs. spun-out.

See also

Concepts · Data Contracts & Consumers · Trust & Provenance