diff --git a/docs/security/AMORAL_MODEL_CONTAINMENT_PLAYBOOK.md b/docs/security/AMORAL_MODEL_CONTAINMENT_PLAYBOOK.md new file mode 100644 index 00000000000..6ab2d0b363e --- /dev/null +++ b/docs/security/AMORAL_MODEL_CONTAINMENT_PLAYBOOK.md @@ -0,0 +1,102 @@ +# Amoral Model Containment Playbook (Black Hat USA 2025 Response) + +## Summit Readiness Assertion + +This playbook subsumes the failure mode shown in the Black Hat USA 2025 slide into a governed, testable control surface: **unsafe model capability claims are treated as a deterministic containment problem, not a narrative debate**. + +## Problem Statement + +The referenced slide describes a model that can allegedly assist with: + +- misinformation and insider-threat planning, +- biological/chemical/nuclear weapon ideation, +- violent/terrorist operational planning, +- child sexual abuse material or non-consensual sexual exploitation, +- extremist hate recruitment. + +For Summit, these are all **Category-A blocked intents**. They are never fulfilled, never transformed into “better prompts,” and never routed to a weaker model. + +## Governed Exception Policy + +Legacy “research mode” or “red-team bypass” pathways are reclassified as **Governed Exceptions** and are only permitted when all of the following are present: + +1. Signed exception ticket with expiry. +2. Named human owner in provenance ledger. +3. Isolated environment with no production credentials. +4. Full prompt/output telemetry retained. +5. Automatic rollback trigger on policy drift. + +If any condition is absent, execution is denied. + +## Control Architecture + +### Layer 1 — Intent Compilation Gate (Pre-LLM) + +- Parse user input into typed intents. +- Reject direct weaponization, terror ops, sexual exploitation, hate recruitment, and disinformation facilitation intents. +- Enforce evidence budget and deterministic query limits for any graph retrieval. + +### Layer 2 — Prompt Construction Guard + +- Strip or quarantine operational details that increase attacker capability. +- Require safe reframing templates (defensive, educational, legal, de-escalatory). +- Ban latent jailbreak payloads in system/developer/user merged prompt views. + +### Layer 3 — Model Output Policy Gate + +- Score outputs with policy classifiers before release. +- Hard-block on Category-A unsafe content. +- Regenerate only under constrained safety templates; no unconstrained retries. + +### Layer 4 — Human Escalation + Audit + +- High-severity events auto-page governance owner. +- Store immutable event + rationale + model trace in evidence bundle. +- Track recurrence rate and mean-time-to-containment as SLOs. + +## MAESTRO Security Alignment + +- **MAESTRO Layers**: Foundation, Data, Agents, Tools, Observability, Security. +- **Threats Considered**: prompt injection, policy evasion, goal manipulation, tool abuse, unsafe cross-model fallback. +- **Mitigations**: + - deterministic pre-LLM intent denylist for Category-A harms, + - policy-as-code output enforcement, + - governed exceptions with expiry + owner + rollback, + - full telemetry and anomaly alerts. + +## Detection & Response Runbook + +### Triage Severity + +- **SEV-1**: actionable violent/extremist/weapons/sexual exploitation guidance. +- **SEV-2**: partial facilitation patterns, encoded operational hints. +- **SEV-3**: policy probing and jailbreak rehearsal attempts. + +### Immediate Response + +1. Block response delivery. +2. Preserve prompt/output artifacts and policy scores. +3. Open governance incident record with timestamp and owner. +4. Apply temporary tightened policy thresholds for affected route. +5. Run regression suite on known adversarial prompts. + +## Verification Requirements (Tiered) + +- **Tier A**: deterministic deny tests for Category-A prompts. +- **Tier B**: adversarial mutation tests (encoding, role-play, multilingual jailbreak variants). +- **Tier C**: production telemetry validation (alerts, false-positive envelope, rollback readiness). + +A change cannot merge if Tier A is red. + +## Product Positioning + +Summit does not compete on permissiveness. Summit competes on **trustworthy capability under governed constraints**: + +- defendable safety posture, +- explainable policy outcomes, +- reversible operational controls, +- audit-grade evidence. + +## Finality + +This threat class is now formally absorbed into Summit’s governed control plane. Any path that enables these capabilities is **out of policy by default** and **blocked pending governance-approved remediation**.