Architecture¶
A 10,000-foot view of the kuroi codebase, the data flow through a single
kuroi run, and the invariants the design relies on.
Module map¶
src/kuroi/
├── cli/ Typer entry points — one file per subcommand
├── core/ Pure-ish business logic; no Typer or HTTP imports
│ ├── pdf.py PyMuPDF wrappers (word index, redaction)
│ ├── rules.py RuleSet / Category dataclasses + YAML loader
│ ├── findings.py The Finding dataclass — what every detector emits
│ ├── redaction.py Apply Findings to a PDF (true PyMuPDF redaction)
│ ├── chunking.py Per-batch orchestrator over Provider calls (retry, progress)
│ ├── instruction_decomposer.py Split multi-rule --instruct strings (Ollama path)
│ ├── audit.py Per-run audit JSONL writer
│ ├── audit_records.py Audit record dataclasses
│ ├── backup.py Pre-redaction backup + retention
│ ├── locks.py Advisory lockfiles per output
│ ├── state.py Resume-key store
│ ├── config.py Config dataclasses + precedence resolver
│ ├── pricing.py Per-provider/model token pricing
│ ├── output_resolution.py -o / --in-place resolution
│ ├── verification.py kuroi verify implementation
│ ├── diff.py kuroi diff implementation
│ └── log.py Logging setup
├── providers/ LLM client implementations
│ ├── base.py Provider Protocol
│ ├── factory.py Resolve config → Provider instance
│ ├── anthropic.py Anthropic API client
│ ├── claude_cli.py Claude CLI client (claude-agent-sdk subscription billing)
│ ├── ollama.py Ollama client (httpx → /api/chat)
│ └── _shared.py parse_findings_payload helper
├── rules/ Built-in YAML rule packs
│ └── pii-en.yaml
└── data/ Static assets (pricing.json)
The directional rule: cli/ may import from core/ and providers/,
but core/ must not import from cli/. providers/ only imports from
core/.
Data flow for kuroi run¶
flowchart TD
A[CLI: parse args, resolve config] --> B[core.locks: acquire output lock]
B --> C[core.pdf: extract word-indexed pages]
C --> D[core.rules: apply regex categories]
C --> E[providers: run LLM judge on llm categories]
D --> F[merge findings, dedupe spans]
E --> F
F --> G[core.backup: snapshot original PDF]
G --> H[core.redaction: write redacted PDF]
H --> I[core.audit: append run record]
I --> J[print summary]
Key invariants¶
- Atomic writes. Redacted output is written to a temp path and renamed; partial files never survive a crash.
- Backup before redact.
core.backupalways runs beforecore.redaction;kuroi undois therefore always safe. - Audit completeness. Every Finding emitted by any detector is
recorded in the audit JSONL with its
sourcefield, regardless of whether it was applied to the PDF. - No PDF mutation in
core.pdf. That module only reads. Writes happen exclusively incore.redactionandcore.backup.
Where the LLM enters¶
providers.base.Provider is the only seam between core logic and an LLM.
Providers see word-indexed Page objects and a list of
llm_category_ids — they never touch the PDF binary directly.
Where to look first¶
- Adding a new CLI subcommand →
cli/__init__.py+ a newcli/<name>.py. - Adding a new detector category → edit
rules/pii-en.yaml(regex) or add an LLM-handled category there + adjust the prompt in the appropriate provider. - Adding a new LLM provider → see Adding an LLM provider.
- Changing the on-disk audit format →
core/audit_records.py(versioned schema; bump on breaking changes).