-
Notifications
You must be signed in to change notification settings - Fork 1
docs(security): add Black Hat 2025 misuse playbook and critical routing #22282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,86 @@ | ||||||
| # Black Hat 2025 Misuse Response Playbook | ||||||
|
|
||||||
| **Status:** Active | ||||||
| **Owner:** Security + AI Platform | ||||||
| **Last Updated:** 2026-03-31 | ||||||
|
|
||||||
| ## Purpose | ||||||
|
|
||||||
| This playbook subsumes the abuse patterns surfaced in the Black Hat 2025 examples and turns them into governed controls for Summit. The objective is simple: detect, block, and audit AI-enabled misuse before it can create operational, reputational, or safety harm. | ||||||
|
|
||||||
| ## Abuse Patterns Covered | ||||||
|
|
||||||
| 1. **Misinformation Campaign Planning (MA/PI):** Prompting models for scalable disinformation operations, legal-gray deployment tactics, and optimization for social chaos. | ||||||
| 2. **Insider “Malicious Compliance” (IN/DP):** Intentional low-quality task execution, behavior capture contamination, or process sabotage disguised as policy adherence. | ||||||
| 3. **Behavioral Model Poisoning (DP/GH):** Attempts to manipulate model adaptation, imitation, or reinforcement loops through intentionally adversarial demonstrations. | ||||||
| 4. **Stealth Exfiltration/Control Signals (PI/TI):** Hidden or out-of-band prompt channels (e.g., “invisible” text patterns) meant to bypass normal review. | ||||||
|
|
||||||
| ## MAESTRO Layers | ||||||
|
|
||||||
| - **Foundation Models:** refusal reliability, harmful capability suppression. | ||||||
| - **Data Operations:** poisoning-resistant ingestion, dataset trust scoring. | ||||||
| - **Agents:** constrained planning policies, bounded autonomy. | ||||||
| - **Tools:** deny unsafe campaign-building actions, signed tool contracts. | ||||||
| - **Infrastructure:** immutable evidence retention, access-segmented runtime. | ||||||
| - **Observability:** abuse telemetry, high-risk prompt alerting. | ||||||
| - **Security & Compliance:** policy-as-code enforcement and exception workflow. | ||||||
|
|
||||||
| ## Policy Decision Table | ||||||
|
|
||||||
| | Scenario | Classification | Default Action | Escalation | | ||||||
| | -------------------------------------------------------------------- | -------------- | ----------------------------------------- | --------------------------------- | | ||||||
| | User asks for mass misinformation planning | Critical MA/PI | **Hard deny** + safe alternative guidance | Security on-call + Trust & Safety | | ||||||
| | Prompt requests sabotage of employer systems or workforce transition | High IN/DP | **Hard deny** + insider-risk warning | Security + HR/legal workflow | | ||||||
| | Content appears to poison behavior/feedback loops | High DP/GH | Quarantine artifact, block learning path | ML security review | | ||||||
| | Hidden instruction channel detected | High PI/TI | Strip, sanitize, and re-run policy checks | SOC triage | | ||||||
|
|
||||||
| ## Required Controls (Implementation Contract) | ||||||
|
|
||||||
| 1. **Pre-Generation Risk Classifier** | ||||||
| - Route every high-impact prompt through misuse intent classification. | ||||||
| - Block critical labels before LLM inference. | ||||||
|
|
||||||
| 2. **Generation Guardrails** | ||||||
| - Enforce refusal templates for: misinformation operations, social destabilization, fraud enablement, and insider sabotage. | ||||||
| - Do not provide optimization details, operational sequencing, target segmentation, or evasion instructions. | ||||||
|
|
||||||
| 3. **Post-Generation Safety Validator** | ||||||
| - Verify no output contains campaign planning checklists, adversarial messaging playbooks, or covert manipulation instructions. | ||||||
| - Reject and replace with defensive guidance when violated. | ||||||
|
|
||||||
| 4. **Learning-Loop Isolation** | ||||||
| - Untrusted demonstrations and user interactions are never directly eligible for training/fine-tuning. | ||||||
| - Require signed provenance, trust score threshold, and two-person approval for promotion. | ||||||
|
|
||||||
| 5. **Evidence & Auditability** | ||||||
| - Persist decision artifacts: prompt hash, policy labels, refusal reason, model/version, reviewer trace. | ||||||
| - Retain immutable logs for incident and regulator-grade review. | ||||||
|
|
||||||
| ## Detection Signals | ||||||
|
|
||||||
| - Prompt includes objectives like _maximize chaos_, _fabricate credibility_, _exploit legal gray areas_, _evade platform detection_. | ||||||
| - Requests for synthetic narrative amplification at low cost and high deniability. | ||||||
| - Repeated instructions to preserve malicious actions “off-platform” or “outside monitored channels.” | ||||||
| - Suggestions to seed hidden text/instructions for downstream model contamination. | ||||||
|
|
||||||
| ## Incident Response Workflow | ||||||
|
|
||||||
| 1. **Contain:** block response, tag session `critical_misuse`, freeze adaptive memory writes. | ||||||
| 2. **Classify:** map event to MAESTRO layers and STRIDE+AI category. | ||||||
| 3. **Notify:** trigger Security + Trust response channel with evidence bundle. | ||||||
| 4. **Eradicate:** patch policy rules, expand signatures, and backtest recent sessions. | ||||||
| 5. **Recover:** re-enable traffic only after false-negative sampling passes threshold. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The phrase "after false-negative sampling passes threshold" is ambiguous. In a security context, you typically want the false-negative rate to be below a maximum allowable threshold. It would be clearer to specify that the rate must fall below the threshold or that the sampling results must meet the required safety criteria.
Suggested change
|
||||||
| 6. **Learn:** publish post-incident control delta in governance ledger. | ||||||
|
|
||||||
| ## Verification Gates | ||||||
|
|
||||||
| - `pnpm lint` and `pnpm typecheck` stay green for any guardrail/policy code change. | ||||||
| - Run `scripts/ci/verify-prompt-integrity.ts` when prompt contracts or guardrail templates change. | ||||||
| - Run `scripts/ci/validate-pr-metadata.ts` for agent metadata and allowed-operation checks. | ||||||
|
|
||||||
| ## Non-Negotiables | ||||||
|
|
||||||
| - No “dual-use optimization” content when misuse intent is present. | ||||||
| - No direct model-improvement ingestion from untrusted behavioral traces. | ||||||
| - No bypass of policy-as-code or audit logging. | ||||||
| - Any exception is formalized as a time-bound **Governed Exception** in the registry. | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The action for "Hidden instruction channel detected" is currently "Strip, sanitize, and re-run policy checks". Since a hidden channel (e.g., invisible characters, steganography) is a deliberate attempt to bypass security controls, it should be treated with the same severity as other high-risk misuse. Simply stripping the content and proceeding might allow the rest of the prompt to execute, which could still be part of a multi-stage attack. Consider changing the default action to a Hard deny to align with the "High PI/TI" classification and the handling of other high-risk scenarios in this table.