-
Notifications
You must be signed in to change notification settings - Fork 1
docs: add amoral model containment playbook #22283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,102 @@ | ||||||
| # Amoral Model Containment Playbook (Black Hat USA 2025 Response) | ||||||
|
|
||||||
| ## Summit Readiness Assertion | ||||||
|
|
||||||
| This playbook subsumes the failure mode shown in the Black Hat USA 2025 slide into a governed, testable control surface: **unsafe model capability claims are treated as a deterministic containment problem, not a narrative debate**. | ||||||
|
|
||||||
| ## Problem Statement | ||||||
|
|
||||||
| The referenced slide describes a model that can allegedly assist with: | ||||||
|
|
||||||
| - misinformation and insider-threat planning, | ||||||
| - biological/chemical/nuclear weapon ideation, | ||||||
| - violent/terrorist operational planning, | ||||||
| - child sexual abuse material or non-consensual sexual exploitation, | ||||||
| - extremist hate recruitment. | ||||||
|
|
||||||
| For Summit, these are all **Category-A blocked intents**. They are never fulfilled, never transformed into “better prompts,” and never routed to a weaker model. | ||||||
|
|
||||||
| ## Governed Exception Policy | ||||||
|
|
||||||
| Legacy “research mode” or “red-team bypass” pathways are reclassified as **Governed Exceptions** and are only permitted when all of the following are present: | ||||||
|
|
||||||
| 1. Signed exception ticket with expiry. | ||||||
| 2. Named human owner in provenance ledger. | ||||||
| 3. Isolated environment with no production credentials. | ||||||
| 4. Full prompt/output telemetry retained. | ||||||
| 5. Automatic rollback trigger on policy drift. | ||||||
|
|
||||||
| If any condition is absent, execution is denied. | ||||||
|
|
||||||
| ## Control Architecture | ||||||
|
|
||||||
| ### Layer 1 — Intent Compilation Gate (Pre-LLM) | ||||||
|
|
||||||
| - Parse user input into typed intents. | ||||||
| - Reject direct weaponization, terror ops, sexual exploitation, hate recruitment, and disinformation facilitation intents. | ||||||
| - Enforce evidence budget and deterministic query limits for any graph retrieval. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The term "evidence budget" is not standard in LLM security. Replacing it with "retrieval context budget" or "token budget" makes the control more actionable for engineering teams.
Suggested change
|
||||||
|
|
||||||
| ### Layer 2 — Prompt Construction Guard | ||||||
|
|
||||||
| - Strip or quarantine operational details that increase attacker capability. | ||||||
| - Require safe reframing templates (defensive, educational, legal, de-escalatory). | ||||||
| - Ban latent jailbreak payloads in system/developer/user merged prompt views. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the "Guard" layer, "Ban" is a policy requirement. Using "Detect and strip" more accurately describes the active mitigation expected at this layer to prevent jailbreak payloads from reaching the model.
Suggested change
|
||||||
|
|
||||||
| ### Layer 3 — Model Output Policy Gate | ||||||
|
|
||||||
| - Score outputs with policy classifiers before release. | ||||||
| - Hard-block on Category-A unsafe content. | ||||||
| - Regenerate only under constrained safety templates; no unconstrained retries. | ||||||
|
|
||||||
| ### Layer 4 — Human Escalation + Audit | ||||||
|
|
||||||
| - High-severity events auto-page governance owner. | ||||||
| - Store immutable event + rationale + model trace in evidence bundle. | ||||||
| - Track recurrence rate and mean-time-to-containment as SLOs. | ||||||
|
|
||||||
| ## MAESTRO Security Alignment | ||||||
|
|
||||||
| - **MAESTRO Layers**: Foundation, Data, Agents, Tools, Observability, Security. | ||||||
| - **Threats Considered**: prompt injection, policy evasion, goal manipulation, tool abuse, unsafe cross-model fallback. | ||||||
| - **Mitigations**: | ||||||
| - deterministic pre-LLM intent denylist for Category-A harms, | ||||||
| - policy-as-code output enforcement, | ||||||
| - governed exceptions with expiry + owner + rollback, | ||||||
| - full telemetry and anomaly alerts. | ||||||
|
|
||||||
| ## Detection & Response Runbook | ||||||
|
|
||||||
| ### Triage Severity | ||||||
|
|
||||||
| - **SEV-1**: actionable violent/extremist/weapons/sexual exploitation guidance. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The SEV-1 definition should include all Category-A intents (like misinformation and insider-threat planning) to ensure consistent incident response for all critical risks identified in the problem statement.
Suggested change
|
||||||
| - **SEV-2**: partial facilitation patterns, encoded operational hints. | ||||||
| - **SEV-3**: policy probing and jailbreak rehearsal attempts. | ||||||
|
|
||||||
| ### Immediate Response | ||||||
|
|
||||||
| 1. Block response delivery. | ||||||
| 2. Preserve prompt/output artifacts and policy scores. | ||||||
| 3. Open governance incident record with timestamp and owner. | ||||||
| 4. Apply temporary tightened policy thresholds for affected route. | ||||||
| 5. Run regression suite on known adversarial prompts. | ||||||
|
|
||||||
| ## Verification Requirements (Tiered) | ||||||
|
|
||||||
| - **Tier A**: deterministic deny tests for Category-A prompts. | ||||||
| - **Tier B**: adversarial mutation tests (encoding, role-play, multilingual jailbreak variants). | ||||||
| - **Tier C**: production telemetry validation (alerts, false-positive envelope, rollback readiness). | ||||||
|
|
||||||
| A change cannot merge if Tier A is red. | ||||||
|
|
||||||
| ## Product Positioning | ||||||
|
|
||||||
| Summit does not compete on permissiveness. Summit competes on **trustworthy capability under governed constraints**: | ||||||
|
|
||||||
| - defendable safety posture, | ||||||
| - explainable policy outcomes, | ||||||
| - reversible operational controls, | ||||||
| - audit-grade evidence. | ||||||
|
|
||||||
| ## Finality | ||||||
|
|
||||||
| This threat class is now formally absorbed into Summit’s governed control plane. Any path that enables these capabilities is **out of policy by default** and **blocked pending governance-approved remediation**. | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrase "never routed to a weaker model" is slightly ambiguous. Since Category-A intents are "never fulfilled," they should be terminated immediately. Explicitly stating that they are blocked before model routing provides a clearer security guarantee.