Code for the for our research titled "When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search."
GRACE is a diagnostic workflow for rank-1 activation steering. It uses cheap
geometric statistics computed from contrastive activations — no steering
search needed — to predict where useful steering directions emerge in a model,
how expensive a concept will be to optimize, and which of three vector
constructions (pv, unit_mean, cluster) and search strategies are likely
to help.
grace/— the importable Python package.data/Concept-data generation (GPT-5 → JSON pairs/questions/rubric).activations/Residual-stream extraction (prompt-boundary + response-averaged) and per-layer statistics.vectors/Three vector constructions:pv,unit_mean,cluster.steering/Inference-time activation steering (forward hook).eval/Local Gemma-3-12b-it judge by default; optional GPT-4.1-mini and Nova-2-Lite plugins.search/Grid sweep (§3) and Optuna TPE search (§4).diagnostics/Alignment 𝒜, granularity G, magnitude CV, PL/RA correlation, per-pair heatmap, ANOVA, and the workflow recommender.analysis/Cross-concept aggregation helpers used by the notebooks.
scripts/— 10 CLI entry points, one per pipeline stage.configs/— YAML configs for data generation, vector training, search, and eval.concepts/gpt-5/— the 20 concept JSONs (5 prompt pairs × 100 extraction + 100 held-out questions).results/— pre-computed per-concept statistics, summary CSVs, and Optuna trial histories so the notebooks reproduce all paper figures without re-running the (expensive) generation/judging pipeline.notebooks/— six notebooks that produce every figure in the paper.docs/—reproducing_the_paper.md,diagnostic_workflow.md,adding_new_concepts.md.smoke_test.sh— full mini-pipeline on Gemma-2-2B + Gemma-3-12B judge, two concepts, ~15 min on one GPU.
This repository relies on the uv package manager for python, available here.
uv sync
cp .env.example .env
# edit .env to add HF_TOKEN (required) and OPENAI_API_KEY / AWS_* (optional)The Bedrock judge is gated behind an extra:
uv sync --extra bedrockbash smoke_test.shRuns the full pipeline (extract → statistics → train all 3 vectors → eval one config → mini grid sweep → mini TPE search → ANOVA → diagnostics → cross-judge) on one GPU in ~15 minutes. It only checks that every stage runs to completion, not numerical reproducibility.
The notebooks load from results/ so they run without re-doing any
generation. Each notebook header lists the figures it produces:
| Notebook | Paper figures |
|---|---|
notebooks/01_alignment_predicts_layers.ipynb |
Figs. 2, 5, 6 |
notebooks/02_search_landscape_and_geo_search.ipynb |
Figs. 1, 3, 9 |
notebooks/03_granularity_correlations.ipynb |
Figs. 4, 7, 8 |
notebooks/04_anova_decomposition.ipynb |
Appendix Table 6 |
notebooks/05_grace_diagnostics.ipynb |
Figs. 10, 11, 12 |
notebooks/06_judge_robustness.ipynb |
Tables 4, 5 |
See docs/reproducing_the_paper.md for the
full pipeline-from-scratch recipe (extracting activations, training vectors,
running TPE sweeps, etc.).
If you have steering vectors trained on a new concept and want to know
whether to (a) prefer unit_mean/cluster over pv, (b) restrict the
search to the top-15 alignment-ranked layers, or (c) widen the search space
because of representational fragmentation, run:
uv run python scripts/08_run_diagnostics.py \
--model google/gemma-3-27b-it \
--concepts evil sycophantic golden_gate_centric \
--output results/diagnostics/This reads cached activations, computes the full GRACE diagnostic suite, and
prints a per-concept recommendation. See
docs/diagnostic_workflow.md for details.
@misc{robertson2026rank1steeringcheapgeometry,
title = {When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search},
author = {John T. Robertson and Jianing Zhu and Haris Vikalo and Zhangyang Wang},
year = {2026},
eprint = {2605.16362},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2605.16362},
}- Granularity is the pairwise-cosine formulation G_c(ℓ) = γ_c(ℓ) / 𝒜_c(ℓ), implemented in grace/diagnostics/granularity.py.
- The figure file
Images/per-model-gran-steerabilitty.pngkeeps the typo ("steerabilitty") in its filename to match the paper's\includegraphicsreference.
- This repository was originally based on the PersonaVectors codebase.
- Claude Code was used throughout the development of this project.