Writing rule packs¶
A rule pack is a YAML file that lists categories of things to redact
and how each one should be detected. The shipping pack pii-en.yaml
covers common English-language PII; you can write your own for other
languages or domains.
Anatomy of pii-en.yaml¶
name: pii-en
display_name: PII (English)
description: Common personally identifiable information in English text.
version: 1
categories:
- id: email
label: Email address
detection: regex
confidence: high
pattern: '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
- id: phone
label: Phone number
detection: regex
confidence: high
pattern: '\+?\d[\d\s\-]{7,}\d'
- id: person_name
label: Person name
detection: llm
confidence: medium
pattern: null
The full schema is documented at
Rule schema reference (auto-generated
from the dataclasses in kuroi.core.rules).
Regex vs LLM detection¶
detection: regex |
detection: llm |
|---|---|
pattern is required. |
pattern is null. |
| Match runs locally, no LLM. | The LLM is asked to find spans of this category by id. |
| Best for fixed shapes (emails, IBANs, credit-card numbers). | Best for context-dependent things (person names, addresses, internal codenames). |
confidence should usually be high. |
confidence reflects how trustworthy the LLM judge is for this category. |
A category is regex or llm, never both.
Built-in pack aliases¶
kuroi.core.rules._ALIASES maps short names to packaged packs. The
shipping alias pii resolves to pii-en. So:
is equivalent to --rules pii-en. Add aliases when you contribute a
pack with a more specific name.
Adding a pack to the kuroi distribution¶
- Drop your YAML in
src/kuroi/rules/<name>.yaml. - (Optional) Register a short alias in
_ALIASESif the file name is long. - Run
kuroi run my.pdf --rules <name> -o out.pdfto verify the pack loads. Schema errors raise onload_rule_set(name)with a specific message pointing at the offending field. - Add a small representative PDF + expected findings in
tests/.
Loading a private pack¶
If you don't want to vendor the pack into kuroi, load it from your own code. The simplest path is to construct the dataclasses directly:
from pathlib import Path
import yaml
from kuroi.core.rules import Category, RuleSet
def load_my_pack(path: Path) -> RuleSet:
raw = yaml.safe_load(path.read_text())
cats = tuple(
Category(
id=c["id"],
label=c["label"],
detection=c["detection"],
confidence=c["confidence"],
pattern=c.get("pattern"),
)
for c in raw["categories"]
)
return RuleSet(
name=raw["name"],
display_name=raw["display_name"],
description=raw["description"],
version=int(raw["version"]),
categories=cats,
)
Then pass the resulting RuleSet into apply_regex_rules and to your
provider's detect_redactions. See
Using kuroi as a library for the full pipeline.
Per-category model routing¶
Each category accepts an optional model: field. When present, the
chunker dispatches that category's calls against the named model instead
of the default model selected at the top level. The chunker groups by
model and runs each group concurrently per batch, so adding model:
to a few categories does not serialize the run.
categories:
- id: emails
label: Email addresses
detection: llm
confidence: high
model: claude-haiku-4-5-20251001 # cheap, plenty for emails
- id: full_names
label: Personal names
detection: llm
confidence: high
# no model: → uses the run's default model
The override applies whatever provider you choose; pass a model id the configured provider can serve. See LLM providers → Per-category model routing for an end-user-facing summary.
Tips¶
- Keep regex patterns tight. A loose pattern produces false positives the LLM has to undo, which costs tokens.
- Use the
confidencefield meaningfully —kuroi diffdoes not yet filter by it, but the audit JSONL records it for downstream use. - Bump
versionwhen the schema changes. The version is recorded in the audit log so you can correlate findings to the pack revision that produced them. - Test packs against a small representative PDF in
tests/.