Jailbreak Attack and Defense Experiments

An experimental Python framework for preparing jailbreak prompts, running them against local large language models, applying defenses, and evaluating model responses on an HPC/PBS environment.

The maintained implementation lives in prompt_attacker/. The main entry point is prompt_attacker/run_orchestrator.py, which reads prompt_attacker/config_orchestrator.yaml, generates PBS job scripts, and submits them with qsub unless dry_run is enabled.

About

This repository is intended for controlled LLM safety experiments. It supports:

prepared jailbreak attack datasets stored as JSON files;
single-model and batch-model attack execution;
Ollama and vLLM inference backends;
baseline and custom defense workflows;
PBS job generation for MetaCentrum-style clusters;
evaluation scripts for generated responses.

The current recommended workflow uses prepared JSON attack files from prompt_attacker/dataset/oponent_show/ and model directories under prompt_attacker/models/.

Prerequisites

You need:

access to a PBS cluster with qsub;
Python 3.12 is recommended; the current dependency set was tested with Python 3.12;
mamba or conda;
CUDA-capable GPU nodes for vLLM runs;
local model files for vLLM, or an Ollama installation and pulled Ollama models;
enough storage for model directories and generated result JSON files.

On MetaCentrum, the expected module setup is:

module add mambaforge

For interactive vLLM debugging, request a GPU allocation first:

qsub -I -l walltime=8:0:0 -q default@pbs-m1.metacentrum.cz -l select=1:ncpus=1:ngpus=1:mem=200gb:gpu_mem=60gb:scratch_local=400gb

The generated PBS templates already request compatible GPU nodes with gpu_cap=cuda80.

Setup

Create a fresh environment from the repository root:

module add mambaforge
mamba create -n jailbreak-exp python=3.12
mamba activate jailbreak-exp
python3.12 -m pip install --upgrade pip
python3.12 -m pip install -r requirements.txt
cd prompt_attacker

requirements.txt lists the direct project dependencies instead of a full environment dump. It includes the orchestration tools, vLLM/Ollama backends, PyTorch/Transformers, and the attack/defense utilities used by this codebase.

For vLLM, make sure the installed PyTorch build supports the GPU architecture allocated by PBS. If a job fails with a CUDA architecture error, use the provided PBS templates or request a compatible GPU capability.

Models

Models are not part of the repository. They must be downloaded or prepared before running attacks.

vLLM backend

Use this when use_ollama: false in config_orchestrator.yaml.

Place Hugging Face-compatible model folders under:

prompt_attacker/models/

The folder name should match the model name used in config. For example:

prompt_attacker/models/falcon3:3b
prompt_attacker/models/gemma3:12b
prompt_attacker/models/llama2:7b

Example download pattern:

cd <repo>/prompt_attacker
mkdir -p models
huggingface-cli download <hf-org>/<hf-model> --local-dir models/<model-name-used-in-config>

Then set:

use_ollama: false
models_dir: "models"
target_model: "falcon3:3b"

For --attack-single, target_model selects one model. For --attack-batch, all model directories in models_dir are used.

Ollama backend

Use this when use_ollama: true.

Pull the model before running interactive jobs:

ollama pull gemma3:12b

For generated PBS jobs, the Ollama template starts the local Ollama server and pulls the configured model automatically for workflows that use that template. For manual interactive runs, start or verify Ollama yourself:

ollama serve

Then set:

use_ollama: true
target_model: "gemma3:12b"
ollama_bin: "ollama"

Configuration

The main configuration file is:

prompt_attacker/config_orchestrator.yaml

Important keys:

models_dir: "models"
local_model_path: "models"
results_dir: "results/oponent_show"
dataset_to_attack_path: "dataset/oponent_show"

use_ollama: false
dry_run: true
module_command: "module add mambaforge"
conda_env: "jailbreak-exp"
ollama_bin: "ollama"

target_model: "falcon3:3b"
single_attack: "_1_cypher"

Key meanings:

models_dir: directory containing one subdirectory per local model.
local_model_path: base model path used by single-model vLLM runs.
results_dir: output directory for generated jobs and JSON results.
dataset_to_attack_path: directory with prepared attack JSON files.
use_ollama: false for vLLM/local model folders, true for Ollama.
dry_run: true creates job scripts only; false also submits with qsub.
module_command: shell command used by generated PBS jobs before activation.
conda_env: mamba/conda environment activated inside generated PBS jobs.
ollama_bin: Ollama executable used by generated Ollama/evaluation jobs.
target_model: model used by --attack-single.
single_attack: attack JSON stem used by --attack-single, for example _1_cypher; use all, *, or an empty value to run all prepared attacks.

Advanced users can replace module_command and conda_env with a full custom setup block:

job_env_setup: |
  module add mambaforge
  mamba activate jailbreak-exp

Before large runs, keep dry_run: true, inspect generated job scripts, then set dry_run: false and rerun the same command.

Basic Usage

All commands below start from:

cd <repo>/prompt_attacker

List available prepared attacks:

python3 run_orchestrator.py --config config_orchestrator.yaml --list-attacks

Create a dry-run job for one model and one attack:

python3 run_orchestrator.py --config config_orchestrator.yaml --attack-single

Override the selected attack without editing config:

python3 run_orchestrator.py --config config_orchestrator.yaml --attack-single --single-attack _1_cypher

Submit the same job for real:

dry_run: false

python3 run_orchestrator.py --config config_orchestrator.yaml --attack-single

Run all prepared attacks for all model folders in models_dir:

python3 run_orchestrator.py --config config_orchestrator.yaml --attack-batch

Run a small action directly in the current terminal instead of creating PBS jobs:

python3 run_orchestrator.py --config config_orchestrator.yaml --attack-single --interactive

For --interactive with use_ollama: false, run inside a GPU allocation. The orchestrator checks whether the CUDA driver is visible before starting vLLM.

Run defenses:

python3 run_orchestrator.py --config config_orchestrator.yaml --defense ea
python3 run_orchestrator.py --config config_orchestrator.yaml --defense rallm
python3 run_orchestrator.py --config config_orchestrator.yaml --defense llamaguard
python3 run_orchestrator.py --config config_orchestrator.yaml --defense safeguard

Train and apply the rule-tree defense:

python3 run_orchestrator.py --config config_orchestrator.yaml --defense-train
python3 run_orchestrator.py --config config_orchestrator.yaml --defense-apply-rules
python3 run_orchestrator.py --config config_orchestrator.yaml --defense-train-apply

Generate evaluation jobs:

python3 run_orchestrator.py --config config_orchestrator.yaml --evaluate

Show all CLI options:

python3 run_orchestrator.py --help

Project Structure

requirements.txt          Curated Python dependencies for the project
prompt_attacker/
  attacks/                 Attack implementations and shared LLM wrapper
  defense/                 Baseline defenses and rule-tree defense utilities
  evaluate/                Evaluation scripts and selected examples
  scripts/                 Small runners used inside generated PBS jobs
  dataset/oponent_show/    Prepared attack JSON files for demonstration/testing
  models/                  Local model folders; not committed
  results/                 Generated jobs and outputs; not committed
  run_orchestrator.py      Main orchestration CLI
  config_orchestrator.yaml Main configuration file
  steps.txt                Practical local notes for running the orchestrator

The detailed implementation README is maintained in prompt_attacker/README.md.

Outputs

Generated PBS scripts are written under results_dir/jobs/ for single-model runs. Batch runs create per-model result directories:

results_dir/<model>/<attack>.json
results_dir/<model>/jobs/job_onlyattackbatch_<attack>.sh

For --attack-single, outputs go directly to:

results_dir/<attack>.json

If dry_run: true, only job scripts are created. If dry_run: false, the orchestrator also calls qsub.

Responsible Use

This repository is intended for research and defensive evaluation of model safety behavior. Run experiments only on models, datasets, and infrastructure that you are authorized to use. Treat harmful generations as sensitive research outputs and avoid publishing raw outputs unless there is a clear research need and appropriate safeguards.

See SECURITY.md for security and responsible-use reporting guidance.

Contributing

See CONTRIBUTING.md for development setup notes, dry-run recommendations, and contribution guidelines.

Security

See SECURITY.md for responsible-use guidance and reporting recommendations for unsafe defaults, credential leaks, or other security issues.

Citation

If you use this repository in academic work, see CITATION.cff for citation metadata.

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
prompt_attacker		prompt_attacker
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ollama.log		ollama.log
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jailbreak Attack and Defense Experiments

Table of Contents

About

Prerequisites

Setup

Models

vLLM backend

Ollama backend

Configuration

Basic Usage

Project Structure

Outputs

Responsible Use

Contributing

Security

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Jailbreak Attack and Defense Experiments

Table of Contents

About

Prerequisites

Setup

Models

vLLM backend

Ollama backend

Configuration

Basic Usage

Project Structure

Outputs

Responsible Use

Contributing

Security

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages