rewrite 2.0.0: real process — extract the algorithm into DMN

The 1.x package was a single ai.extract call wrapped in three BPMN service tasks. No decision logic, no dmn cornerstone, no weights — the risk/routing/validation algorithm lived invisibly in host code. There was nothing for a runtime to actually execute. 2.0.0 makes it a real process: - dmn cornerstone added with three decision tables: * assess-personal-data-risk — PII regex signals -> risk level * gdpr-processing-route — risk x centralisation -> CENTRAL/LOCAL, anonymisation, redaction level * human-validation-gate — confidence thresholds + PII re-scan -> REJECTED/PENDING_REVIEW/APPROVED_AUTO - BPMN expanded 3 -> 6 nodes (3 serviceTask + 3 businessRuleTask), with horizontal DI. - Task ids, mappings, docs, manifest (dmn:true), uapf.yaml, lifecycle and eval-set updated; added a PII-bearing fixture. Only the semantic extraction remains a model step. Risk classification, GDPR routing and validation gating are now explicit ranked DMN rules — inspectable, versioned, portable. Breaking change: structure + outputs.
2026-05-17 20:00:36 +00:00
parent 3f1d62c748
commit dd69a04355
15 changed files with 496 additions and 120 deletions
--- a/README.md
+++ b/README.md
@@ -1,19 +1,68 @@
-# dev.uapf.semantic-document-analysis
+# Semantic Document Analysis

-UAPF v1.1 SSOT-conformant Level 4 process package providing reusable
-semantic document analysis as `Process_SemanticDocumentAnalysis`.
+A UAPF Level-4 process package for extracting VDVC-conformant semantic
+metadata from free-text documents.

-See `docs/00-overview.md` for what it does, `docs/01-eu-ai-act.md` for
-the regulatory analysis, and `docs/02-integration.md` for runtime
-integration notes.
+## What this package is

-## Validates against
+A **real, inspectable process** — not a single AI call in BPMN costume.
+The flow has six executable nodes; three of them are DMN decision tables
+that carry the actual algorithm, with explicit ranked rules and weights.

-Run `jsonschema -i <each yaml-rendered-as-json> <each schema>` against
-the canonical schemas in
-`github.com/UAPFormat/UAPF-specification/schemas/`:
+```
+Start
+  -> [service]  Detect and redact PII          ai.redact@1
+  -> [decision] Assess personal-data risk      DMN assess-personal-data-risk
+  -> [decision] Decide GDPR processing route   DMN gdpr-processing-route
+  -> [service]  Extract semantic metadata      ai.extract@1
+  -> [decision] Determine validation status    DMN human-validation-gate
+  -> [service]  Emit completed event           event.emit@1
+End
+```

- `uapf-manifest.schema.json` (root manifest)
- `ownership.schema.json` (metadata/ownership.yaml)
- `lifecycle.schema.json` (metadata/lifecycle.yaml)
- `resource-mapping.schema.json` (resources/mappings.yaml)
+Only **one** node performs model inference (semantic extraction). PII
+detection, risk classification, GDPR routing and the human-validation
+gate are deterministic — the host cannot make them up.
+
+## The decision tables (dmn/)
+
+### assess-personal-data-risk
+PII regex signals -> `personalDataRisk`. Personas kods or IBAN forces
+HIGH; two or more PII categories, or contact data, gives MEDIUM; one
+category LOW; nothing NONE. Hit policy FIRST (ranked).
+
+### gdpr-processing-route
+`personalDataRisk` x `allowCentralization` -> `processingRoute`
+(CENTRAL | LOCAL), `anonymizationRequired`, `redactionLevel`. A
+sensitive document whose owner has not permitted centralisation stays
+LOCAL with full redaction. This is the routing rule lifted out of the
+host's `generate_semantic_metadata`.
+
+### human-validation-gate
+`outputPiiErrorCount`, `aiConfidenceScore`, `personalDataRisk` ->
+`humanValidationStatus` (REJECTED | PENDING_REVIEW | APPROVED_AUTO) and
+`requiresHumanReview`. Any leaked PII or confidence below 0.3 -> REJECTED;
+below 0.7 or HIGH risk -> PENDING_REVIEW; 0.7+ with clean output ->
+APPROVED_AUTO. The thresholds 0.3 / 0.7 are the weights.
+
+## Capabilities required of the host
+
+| Capability     | Used by                | Purpose                          |
+|----------------|------------------------|----------------------------------|
+| ai.redact@1    | Task_DetectRedactPii   | Mask PII + return regex signals  |
+| ai.extract@1   | Task_ExtractSemantics  | VDVC semantic extraction         |
+| event.emit@1   | Task_EmitResult        | Publish completion CloudEvent    |
+
+DMN decisions need no host capability — the runtime evaluates them.
+
+## Output contract
+
+`resources/schemas/vdvc-semantic-summary.schema.json` — the ai.extract@1
+output. The process additionally yields the DMN-decided fields
+(`personalDataRisk`, `processingRoute`, `redactionLevel`,
+`humanValidationStatus`, `requiresHumanReview`).
+
+## Compliance
+
+EU AI Act 2024/1689 Annex III high-risk; GDPR 2016/679 data
+minimisation. See `resources/guardrails.yaml` and `docs/`.