You've already forked dokumenta-semantiska-analize
Import UAPF package
rewrite 2.0.0: real process — extract the algorithm into DMN
The 1.x package was a single ai.extract call wrapped in three BPMN
service tasks. No decision logic, no dmn cornerstone, no weights — the
risk/routing/validation algorithm lived invisibly in host code. There
was nothing for a runtime to actually execute.
2.0.0 makes it a real process:
- dmn cornerstone added with three decision tables:
* assess-personal-data-risk — PII regex signals -> risk level
* gdpr-processing-route — risk x centralisation -> CENTRAL/LOCAL,
anonymisation, redaction level
* human-validation-gate — confidence thresholds + PII re-scan
-> REJECTED/PENDING_REVIEW/APPROVED_AUTO
- BPMN expanded 3 -> 6 nodes (3 serviceTask + 3 businessRuleTask),
with horizontal DI.
- Task ids, mappings, docs, manifest (dmn:true), uapf.yaml, lifecycle
and eval-set updated; added a PII-bearing fixture.
Only the semantic extraction remains a model step. Risk classification,
GDPR routing and validation gating are now explicit ranked DMN rules —
inspectable, versioned, portable. Breaking change: structure + outputs.
This commit is contained in:
77
README.md
77
README.md
@@ -1,19 +1,68 @@
|
||||
# dev.uapf.semantic-document-analysis
|
||||
# Semantic Document Analysis
|
||||
|
||||
UAPF v1.1 SSOT-conformant Level 4 process package providing reusable
|
||||
semantic document analysis as `Process_SemanticDocumentAnalysis`.
|
||||
A UAPF Level-4 process package for extracting VDVC-conformant semantic
|
||||
metadata from free-text documents.
|
||||
|
||||
See `docs/00-overview.md` for what it does, `docs/01-eu-ai-act.md` for
|
||||
the regulatory analysis, and `docs/02-integration.md` for runtime
|
||||
integration notes.
|
||||
## What this package is
|
||||
|
||||
## Validates against
|
||||
A **real, inspectable process** — not a single AI call in BPMN costume.
|
||||
The flow has six executable nodes; three of them are DMN decision tables
|
||||
that carry the actual algorithm, with explicit ranked rules and weights.
|
||||
|
||||
Run `jsonschema -i <each yaml-rendered-as-json> <each schema>` against
|
||||
the canonical schemas in
|
||||
`github.com/UAPFormat/UAPF-specification/schemas/`:
|
||||
```
|
||||
Start
|
||||
-> [service] Detect and redact PII ai.redact@1
|
||||
-> [decision] Assess personal-data risk DMN assess-personal-data-risk
|
||||
-> [decision] Decide GDPR processing route DMN gdpr-processing-route
|
||||
-> [service] Extract semantic metadata ai.extract@1
|
||||
-> [decision] Determine validation status DMN human-validation-gate
|
||||
-> [service] Emit completed event event.emit@1
|
||||
End
|
||||
```
|
||||
|
||||
- `uapf-manifest.schema.json` (root manifest)
|
||||
- `ownership.schema.json` (metadata/ownership.yaml)
|
||||
- `lifecycle.schema.json` (metadata/lifecycle.yaml)
|
||||
- `resource-mapping.schema.json` (resources/mappings.yaml)
|
||||
Only **one** node performs model inference (semantic extraction). PII
|
||||
detection, risk classification, GDPR routing and the human-validation
|
||||
gate are deterministic — the host cannot make them up.
|
||||
|
||||
## The decision tables (dmn/)
|
||||
|
||||
### assess-personal-data-risk
|
||||
PII regex signals -> `personalDataRisk`. Personas kods or IBAN forces
|
||||
HIGH; two or more PII categories, or contact data, gives MEDIUM; one
|
||||
category LOW; nothing NONE. Hit policy FIRST (ranked).
|
||||
|
||||
### gdpr-processing-route
|
||||
`personalDataRisk` x `allowCentralization` -> `processingRoute`
|
||||
(CENTRAL | LOCAL), `anonymizationRequired`, `redactionLevel`. A
|
||||
sensitive document whose owner has not permitted centralisation stays
|
||||
LOCAL with full redaction. This is the routing rule lifted out of the
|
||||
host's `generate_semantic_metadata`.
|
||||
|
||||
### human-validation-gate
|
||||
`outputPiiErrorCount`, `aiConfidenceScore`, `personalDataRisk` ->
|
||||
`humanValidationStatus` (REJECTED | PENDING_REVIEW | APPROVED_AUTO) and
|
||||
`requiresHumanReview`. Any leaked PII or confidence below 0.3 -> REJECTED;
|
||||
below 0.7 or HIGH risk -> PENDING_REVIEW; 0.7+ with clean output ->
|
||||
APPROVED_AUTO. The thresholds 0.3 / 0.7 are the weights.
|
||||
|
||||
## Capabilities required of the host
|
||||
|
||||
| Capability | Used by | Purpose |
|
||||
|----------------|------------------------|----------------------------------|
|
||||
| ai.redact@1 | Task_DetectRedactPii | Mask PII + return regex signals |
|
||||
| ai.extract@1 | Task_ExtractSemantics | VDVC semantic extraction |
|
||||
| event.emit@1 | Task_EmitResult | Publish completion CloudEvent |
|
||||
|
||||
DMN decisions need no host capability — the runtime evaluates them.
|
||||
|
||||
## Output contract
|
||||
|
||||
`resources/schemas/vdvc-semantic-summary.schema.json` — the ai.extract@1
|
||||
output. The process additionally yields the DMN-decided fields
|
||||
(`personalDataRisk`, `processingRoute`, `redactionLevel`,
|
||||
`humanValidationStatus`, `requiresHumanReview`).
|
||||
|
||||
## Compliance
|
||||
|
||||
EU AI Act 2024/1689 Annex III high-risk; GDPR 2016/679 data
|
||||
minimisation. See `resources/guardrails.yaml` and `docs/`.
|
||||
|
||||
Reference in New Issue
Block a user