Appearance
What this is
How CanonicAI is built and how a corpus flows from source to published canonical data.
Shape
A TypeScript producer pipeline (ES2020 / CommonJS, run via ts-node), operated through DevPlane (cd ~/devplane && npm start ~/canonicai). Everything routes through packages/core — the spine. It is a production line, not a service: inputs are corpora, outputs are canonical datasets + federation records.
Data flow (Book Factory)
PDF / source asset
→ collector (extract text → detect chapters → per-chapter summaries: breadth + depth)
→ organizer (normalize + structure)
→ referee (quality scoring + validation)
→ canonical_outputs/<book>/ (per-chapter + book_level_sweep.json, with provenance + ledger)
→ measurement-ingest → Principia canonical registryEach step is idempotent and registered; re-runs skip completed units.
Core packages
| Package | Role |
|---|---|
core | Spine — asset registry, storage resolver, canonical schemas, llm.ts (model abstraction), text loaders |
collector | Book extraction (PDF → text → chapters → summaries) |
organizer | Normalize + structure extracted data |
referee | Quality scoring + validation |
orchestrator | Pipeline orchestration |
research_agent | Article Factory — papers → instruments, constructs, citations, effect data → Principia |
Federation spine (shared cross-repo contracts)
measurement-core (canonical measurement vocabulary, shared with Principia + toolbox) · measurement-ingest (publishes records to Principia) · library-core (deriveLibraryId) · variableizer (variable normalization) · reliability (psychometric coefficients).
Asset registry (the resolver)
asset-registry.json (+ packages/core/src/asset-registry.ts) is the single source of truth — ~8,491 assets across 6 domains. All loaders try the registry first ("Strategy 0"), then fall back to legacy paths. Use getRegistry() / getResolver(); never hand-build paths.
Substrate
TypeScript + ts-node; Express surface (rest-express); Drizzle (+ drizzle-zod); LLM via @anthropic-ai/sdk + openai, routed through core/src/llm.ts. Heavy ingestion runs on Modal (off-local). Organized storage in ~/meta-factory-storage/.
Honest scope
Some packages are domain factories kept pending a downstream route (e.g. competency_agent → Vela). Trust docs/CANONICAI-SCOPE.md for what's in-scope vs. spun-out.