BrianCLong · BrianCLong · Mar 31, 2026 · gemini-code-assist · Mar 31, 2026 · gemini-code-assist
diff --git a/docs/security/AMORAL_MODEL_CONTAINMENT_PLAYBOOK.md b/docs/security/AMORAL_MODEL_CONTAINMENT_PLAYBOOK.md
@@ -0,0 +1,102 @@
+# Amoral Model Containment Playbook (Black Hat USA 2025 Response)
+
+## Summit Readiness Assertion
+
+This playbook subsumes the failure mode shown in the Black Hat USA 2025 slide into a governed, testable control surface: **unsafe model capability claims are treated as a deterministic containment problem, not a narrative debate**.
+
+## Problem Statement
+
+The referenced slide describes a model that can allegedly assist with:
+
+- misinformation and insider-threat planning,
+- biological/chemical/nuclear weapon ideation,
+- violent/terrorist operational planning,
+- child sexual abuse material or non-consensual sexual exploitation,
+- extremist hate recruitment.
+
+For Summit, these are all **Category-A blocked intents**. They are never fulfilled, never transformed into “better prompts,” and never routed to a weaker model.
-For Summit, these are all **Category-A blocked intents**. They are never fulfilled, never transformed into “better prompts,” and never routed to a weaker model.
+For Summit, these are all **Category-A blocked intents**. They are never fulfilled, never transformed into “better prompts,” and are terminated before model routing.
-For Summit, these are all **Category-A blocked intents**. They are never fulfilled, never transformed into “better prompts,” and never routed to a weaker model.
+For Summit, these are all **Category-A blocked intents**. They are never fulfilled, never transformed into “better prompts,” and are terminated before model routing.
+
+## Governed Exception Policy
+
+Legacy “research mode” or “red-team bypass” pathways are reclassified as **Governed Exceptions** and are only permitted when all of the following are present:
+
+1. Signed exception ticket with expiry.
+2. Named human owner in provenance ledger.
+3. Isolated environment with no production credentials.
+4. Full prompt/output telemetry retained.
+5. Automatic rollback trigger on policy drift.
+
+If any condition is absent, execution is denied.
+
+## Control Architecture
+
+### Layer 1 — Intent Compilation Gate (Pre-LLM)
+
+- Parse user input into typed intents.
+- Reject direct weaponization, terror ops, sexual exploitation, hate recruitment, and disinformation facilitation intents.
+- Enforce evidence budget and deterministic query limits for any graph retrieval.
- Enforce evidence budget and deterministic query limits for any graph retrieval.
+- Enforce retrieval context budgets and deterministic query limits for any graph retrieval.
- Enforce evidence budget and deterministic query limits for any graph retrieval.
+- Enforce retrieval context budgets and deterministic query limits for any graph retrieval.
+
+### Layer 2 — Prompt Construction Guard
+
+- Strip or quarantine operational details that increase attacker capability.
+- Require safe reframing templates (defensive, educational, legal, de-escalatory).
+- Ban latent jailbreak payloads in system/developer/user merged prompt views.
- Ban latent jailbreak payloads in system/developer/user merged prompt views.
+- Detect and strip latent jailbreak payloads in system/developer/user merged prompt views.
- Ban latent jailbreak payloads in system/developer/user merged prompt views.
+- Detect and strip latent jailbreak payloads in system/developer/user merged prompt views.
+
+### Layer 3 — Model Output Policy Gate
+
+- Score outputs with policy classifiers before release.
+- Hard-block on Category-A unsafe content.
+- Regenerate only under constrained safety templates; no unconstrained retries.
+
+### Layer 4 — Human Escalation + Audit
+
+- High-severity events auto-page governance owner.
+- Store immutable event + rationale + model trace in evidence bundle.
+- Track recurrence rate and mean-time-to-containment as SLOs.
+
+## MAESTRO Security Alignment
+
+- **MAESTRO Layers**: Foundation, Data, Agents, Tools, Observability, Security.
+- **Threats Considered**: prompt injection, policy evasion, goal manipulation, tool abuse, unsafe cross-model fallback.
+- **Mitigations**:
+  - deterministic pre-LLM intent denylist for Category-A harms,
+  - policy-as-code output enforcement,
+  - governed exceptions with expiry + owner + rollback,
+  - full telemetry and anomaly alerts.
+
+## Detection & Response Runbook
+
+### Triage Severity
+
+- **SEV-1**: actionable violent/extremist/weapons/sexual exploitation guidance.
- **SEV-1**: actionable violent/extremist/weapons/sexual exploitation guidance.
+- **SEV-1**: actionable Category-A violations (violence, extremism, weapons, sexual exploitation, misinformation, insider-threat).
- **SEV-1**: actionable violent/extremist/weapons/sexual exploitation guidance.
+- **SEV-1**: actionable Category-A violations (violence, extremism, weapons, sexual exploitation, misinformation, insider-threat).
+- **SEV-2**: partial facilitation patterns, encoded operational hints.
+- **SEV-3**: policy probing and jailbreak rehearsal attempts.
+
+### Immediate Response
+
+1. Block response delivery.
+2. Preserve prompt/output artifacts and policy scores.
+3. Open governance incident record with timestamp and owner.
+4. Apply temporary tightened policy thresholds for affected route.
+5. Run regression suite on known adversarial prompts.
+
+## Verification Requirements (Tiered)
+
+- **Tier A**: deterministic deny tests for Category-A prompts.
+- **Tier B**: adversarial mutation tests (encoding, role-play, multilingual jailbreak variants).
+- **Tier C**: production telemetry validation (alerts, false-positive envelope, rollback readiness).
+
+A change cannot merge if Tier A is red.
+
+## Product Positioning
+
+Summit does not compete on permissiveness. Summit competes on **trustworthy capability under governed constraints**:
+
+- defendable safety posture,
+- explainable policy outcomes,
+- reversible operational controls,
+- audit-grade evidence.
+
+## Finality
+
+This threat class is now formally absorbed into Summit’s governed control plane. Any path that enables these capabilities is **out of policy by default** and **blocked pending governance-approved remediation**.