Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions docs/security/AMORAL_MODEL_CONTAINMENT_PLAYBOOK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Amoral Model Containment Playbook (Black Hat USA 2025 Response)

## Summit Readiness Assertion

This playbook subsumes the failure mode shown in the Black Hat USA 2025 slide into a governed, testable control surface: **unsafe model capability claims are treated as a deterministic containment problem, not a narrative debate**.

## Problem Statement

The referenced slide describes a model that can allegedly assist with:

- misinformation and insider-threat planning,
- biological/chemical/nuclear weapon ideation,
- violent/terrorist operational planning,
- child sexual abuse material or non-consensual sexual exploitation,
- extremist hate recruitment.

For Summit, these are all **Category-A blocked intents**. They are never fulfilled, never transformed into “better prompts,” and never routed to a weaker model.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The phrase "never routed to a weaker model" is slightly ambiguous. Since Category-A intents are "never fulfilled," they should be terminated immediately. Explicitly stating that they are blocked before model routing provides a clearer security guarantee.

Suggested change
For Summit, these are all **Category-A blocked intents**. They are never fulfilled, never transformed into “better prompts,” and never routed to a weaker model.
For Summit, these are all **Category-A blocked intents**. They are never fulfilled, never transformed into “better prompts,” and are terminated before model routing.


## Governed Exception Policy

Legacy “research mode” or “red-team bypass” pathways are reclassified as **Governed Exceptions** and are only permitted when all of the following are present:

1. Signed exception ticket with expiry.
2. Named human owner in provenance ledger.
3. Isolated environment with no production credentials.
4. Full prompt/output telemetry retained.
5. Automatic rollback trigger on policy drift.

If any condition is absent, execution is denied.

## Control Architecture

### Layer 1 — Intent Compilation Gate (Pre-LLM)

- Parse user input into typed intents.
- Reject direct weaponization, terror ops, sexual exploitation, hate recruitment, and disinformation facilitation intents.
- Enforce evidence budget and deterministic query limits for any graph retrieval.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The term "evidence budget" is not standard in LLM security. Replacing it with "retrieval context budget" or "token budget" makes the control more actionable for engineering teams.

Suggested change
- Enforce evidence budget and deterministic query limits for any graph retrieval.
- Enforce retrieval context budgets and deterministic query limits for any graph retrieval.


### Layer 2 — Prompt Construction Guard

- Strip or quarantine operational details that increase attacker capability.
- Require safe reframing templates (defensive, educational, legal, de-escalatory).
- Ban latent jailbreak payloads in system/developer/user merged prompt views.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

In the "Guard" layer, "Ban" is a policy requirement. Using "Detect and strip" more accurately describes the active mitigation expected at this layer to prevent jailbreak payloads from reaching the model.

Suggested change
- Ban latent jailbreak payloads in system/developer/user merged prompt views.
- Detect and strip latent jailbreak payloads in system/developer/user merged prompt views.


### Layer 3 — Model Output Policy Gate

- Score outputs with policy classifiers before release.
- Hard-block on Category-A unsafe content.
- Regenerate only under constrained safety templates; no unconstrained retries.

### Layer 4 — Human Escalation + Audit

- High-severity events auto-page governance owner.
- Store immutable event + rationale + model trace in evidence bundle.
- Track recurrence rate and mean-time-to-containment as SLOs.

## MAESTRO Security Alignment

- **MAESTRO Layers**: Foundation, Data, Agents, Tools, Observability, Security.
- **Threats Considered**: prompt injection, policy evasion, goal manipulation, tool abuse, unsafe cross-model fallback.
- **Mitigations**:
- deterministic pre-LLM intent denylist for Category-A harms,
- policy-as-code output enforcement,
- governed exceptions with expiry + owner + rollback,
- full telemetry and anomaly alerts.

## Detection & Response Runbook

### Triage Severity

- **SEV-1**: actionable violent/extremist/weapons/sexual exploitation guidance.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The SEV-1 definition should include all Category-A intents (like misinformation and insider-threat planning) to ensure consistent incident response for all critical risks identified in the problem statement.

Suggested change
- **SEV-1**: actionable violent/extremist/weapons/sexual exploitation guidance.
- **SEV-1**: actionable Category-A violations (violence, extremism, weapons, sexual exploitation, misinformation, insider-threat).

- **SEV-2**: partial facilitation patterns, encoded operational hints.
- **SEV-3**: policy probing and jailbreak rehearsal attempts.

### Immediate Response

1. Block response delivery.
2. Preserve prompt/output artifacts and policy scores.
3. Open governance incident record with timestamp and owner.
4. Apply temporary tightened policy thresholds for affected route.
5. Run regression suite on known adversarial prompts.

## Verification Requirements (Tiered)

- **Tier A**: deterministic deny tests for Category-A prompts.
- **Tier B**: adversarial mutation tests (encoding, role-play, multilingual jailbreak variants).
- **Tier C**: production telemetry validation (alerts, false-positive envelope, rollback readiness).

A change cannot merge if Tier A is red.

## Product Positioning

Summit does not compete on permissiveness. Summit competes on **trustworthy capability under governed constraints**:

- defendable safety posture,
- explainable policy outcomes,
- reversible operational controls,
- audit-grade evidence.

## Finality

This threat class is now formally absorbed into Summit’s governed control plane. Any path that enables these capabilities is **out of policy by default** and **blocked pending governance-approved remediation**.
Loading