feat(RL): RL integration layer to support Prime-RL by biswapanda · Pull Request #9131 · ai-dynamo/dynamo

biswapanda · 2026-05-05T03:23:45Z

Overview

End-to-end RL integration into Dynamo's vLLM backend. Replaces / supersedes #8630.

This PR delivers the design captured in two in-tree planning docs:

plans/rl-crate.md — standalone dynamo-rl crate, request-plane fan-out, RlClient SDK, optional /v1/rl/* HTTP facade.
plans/weight-transfer-config.md (v5) — single discriminated update_weights body covering full-FT and LoRA across pluggable backends; two-layer worker seam (WeightTransport + EngineAdapter).

Branch: bis/dynamo-rl @ 7c15b6b9cd.

Architecture in three pieces

1. Standalone `dynamo-rl` crate (`lib/rl/`)

Dependency direction is dynamo-llm → dynamo-rl → dynamo-runtime. The crate has zero dep on dynamo-llm.

// lib/rl/src/lib.rs (~1.1 k LoC)

pub struct RlClient {
    runtime: Arc<DistributedRuntime>,
    namespace: String,
    rl_endpoint: String,        // default: "rl"
    policy:  FanoutPolicy,
}

impl RlClient {
    pub async fn snapshot(&self) -> Result<MembershipSnapshot>;
    pub async fn describe(&self) -> Result<Vec<RlResponse>>;
    pub async fn pause(&self,           req: PauseRequest)         -> Result<FanoutReport>;
    pub async fn resume(&self,          req: ResumeRequest)        -> Result<FanoutReport>;
    pub async fn init_transport(&self,  req: InitTransportRequest) -> Result<FanoutReport>;
    pub async fn update_weights(&self,  req: UpdateWeightsRequest) -> Result<FanoutReport>;
}

pub struct MembershipSnapshot { pub epoch: u64, pub targets: Vec<WorkerTarget> }
pub struct WorkerTarget { pub namespace, component, endpoint, instance_id }
pub struct FanoutPolicy { min_workers, membership_timeout, request_timeout,
                          strict_direct, abort_on_membership_change,
                          component_filter }

// HTTP facade is a thin wrapper over the SDK:
pub fn rl_router(deps: RlHttpDeps) -> axum::Router;

The SDK owns: discovery list/watch, per-component request-plane clients, fan-out across all matching rl instances, optional component filters, strict direct addressing, membership epoch/fingerprint checks, empty-fanout behavior, timeout/retry policy, and per-worker success/failure reporting.

Strict direct addressing. For admin RPC (weight update / pause / etc.) we cannot silently fall back to a different instance if the target disappears mid-call — the SDK fails with RlError::MembershipChanged { before_epoch, after_epoch } and forces the caller to reconcile.

2. Worker contract — single `rl` request-plane endpoint

Per rl-crate.md § "Worker contract". When DYN_ENABLE_RL=true, each worker registers exactly one discoverable endpoint per component:

dyn://<DYN_NAMESPACE>.<component>.rl

served by BaseWorkerHandler.rl_dispatch(request) which dispatches on request["op"] to the existing handler methods. Request envelope:

pub enum RlRequest {
    Describe(DescribeRequest),
    Pause(PauseRequest),
    Resume(ResumeRequest),
    InitTransport(InitTransportRequest),
    UpdateWeights(UpdateWeightsRequest),
    LoadLoraAdapter(LoadLoraAdapterRequest),     // legacy migration
    UnloadLoraAdapter(UnloadLoraAdapterRequest), // legacy migration
}

pub struct RlResponse {
    pub status:       RlStatus,
    pub instance_id:  u64,
    pub component:    String,
    pub worker_role:  Option<String>,
    pub message:      Option<String>,
    pub details:      serde_json::Value,
}

The legacy per-RL-op register_engine_route("pause_generation"…) HTTP-on-system-port mechanism + DYN_RL_WORKER_SYSTEM_URLS static URL list are gone from the steady-state path. Discovery is the only source of truth for "which workers are live".

3. Discovery + fan-out — the request plane

Frontend fan-out flow (dynamo_rl::RlState::fan_out):

Build a Client for <ns>.<comp>.rl from the runtime.
client.wait_for_instances() (5 s bound) so the etcd watcher populates before dispatch.
client.instance_ids() → list of live workers.
PushRouter::<Value, Annotated<Value>>::from_client(client, RouterMode::Direct).
Per instance: router.direct(SingleIn::new({op, body}), instance_id), drain first Annotated.data.

NATS / shared TCP is the wire. No reqwest::Client, no system-port HTTP.

HTTP facade — `/v1/rl/*`

Per weight-transfer-config.md v5 + rl-crate.md § "HTTP facade". Mounted on a dedicated listener (port DYN_RL_PORT / --rl-port, default 8002) when DYN_ENABLE_RL_ENDPOINTS=true. Inference plane on :8000 is untouched.

POST /v1/rl/pause
POST /v1/rl/resume

POST /v1/rl/init_transport       # idempotent backend setup (filesystem no-op; nccl bootstrap)
POST /v1/rl/update_weights       # discriminated body covering full-FT + LoRA

POST /v1/rl/load_lora_adapter    # temporary legacy alias (deletes after prime-rl migration)
POST /v1/rl/unload_lora_adapter  # temporary legacy alias
GET  /v1/rl/health               # optional temporary alias to /live

After prime-rl migrates: the two LoRA aliases collapse into update_weights {target.kind="lora"}; /v1/rl/health is removed or stays as a trivial alias to the frontend's existing /live. No read-side state endpoint is added until a real caller appears (deferred design captured in plans/rl-state-endpoint.md).

WeightTransferConfig — `update_weights` wire shape

Per weight-transfer-config.md v5. Single discriminated body covering both full-FT and LoRA. No legacy {weight_dir, weight_version, reset_prefix_cache} form.

// 1. full-FT base reload
POST /v1/rl/update_weights
{ "version":"step_42",
  "target":  {"kind":"base"},
  "transport":{"backend":"filesystem","filesystem":{"path":"…/step_42","require_marker":"STABLE"}} }

// 2. NCCL collective broadcast (after init_transport)
POST /v1/rl/update_weights
{ "version":"step_42",
  "target":  {"kind":"base"},
  "transport":{"backend":"nccl","nccl":{"transport_id":"rl-weights-step"}} }

// 3. LoRA load / swap (same endpoint)
POST /v1/rl/update_weights
{ "version":"step_42",
  "target":  {"kind":"lora","name":"adapter-A","op":"load"},
  "transport":{"backend":"filesystem","filesystem":{"path":"…/step_42"}} }

// 4. LoRA unload — transport optional
POST /v1/rl/update_weights
{ "version":"step_42",
  "target":  {"kind":"lora","name":"adapter-A","op":"unload"} }

init_transport is idempotent. filesystem treats it as a no-op. nccl performs the worker-side init_weight_transfer_engine collective bootstrap and returns once every rank has joined the StatelessProcessGroup.

In-scope this iteration: filesystem, nccl. Future (deferred): nixl, model_express, ipc. The trait shape is designed so each future backend drops in as one extra WeightTransport subclass — see § "WeightTransport seam" below.

WeightTransport seam — pluggable backends

Per weight-transfer-config.md v5 § "Trait on the worker side". Two-layer seam:

# components/src/dynamo/vllm/weight_transports/base.py
class WeightTransport(Protocol):
    backend_id: str

    async def init(ctx: InitCtx, cfg: BackendConfig)        -> InitResult
    async def update_weights(req: UpdateWeightsRequest)     -> UpdateResult
    async def teardown()                                    -> None

    @property def state(self) -> TransportState

# Engine-flavor seam, separate:
class EngineAdapter(Protocol):
    async def update_weights_from_disk         (...)        -> Result
    async def update_weights_from_distributed  (...)        -> Result
    async def update_weights_from_tensor       (...)        -> Result
    async def update_weights_from_ipc          (...)        -> Result
    async def add_lora                         (...)        -> Result
    async def remove_lora                      (...)        -> Result

Mapping:

concern	owner
HTTP wire shape	`dynamo_rl::handlers::*`
backend orchestration (path / nccl group / future nixl-mx poll-and-recv)	`WeightTransport` impl
engine-flavor invocation (vLLM `collective_rpc(...)`, future sglang `tokenizer_manager.update_weights_from_*`)	`EngineAdapter` impl

Adding a new backend = one WeightTransport subclass. Adding sglang = one EngineAdapter subclass. Axes don't multiply.

This iteration ships:

components/src/dynamo/vllm/weight_transports/
├── base.py            # WeightTransport Protocol, EngineAdapter Protocol, request types
├── engine_adapter.py  # VllmEngineAdapter — single shim over engine_client.collective_rpc(...)
├── filesystem.py      # FilesystemTransport (Phase 1 default; safetensors + STABLE marker)
├── nccl.py            # NcclTransport     (Phase 4; init_weight_transfer_engine + update_weights)
└── __init__.py        # build_transport(backend, engine_adapter, cfg) factory

LoRA folds into update_weights {target.kind="lora", op="load|swap|unload"}. The internal handlers (vllm_engine.add_lora vs collective_rpc("reload_weights")) stay separate; the wire surface is one.

TITO on `/v1/chat/completions`

No separate URI. Pre-tokenized input on the standard chat-completions request:

{
  "model": "...",
  "messages": [],            // empty allowed when prompt_token_ids is set
  "prompt_token_ids":        [...],
  "stop_token_ids":          [...],
  "allowed_token_ids":       [...],
  "bad_words_token_ids":     [...],
  "truncate_prompt_tokens":  N,
  "nvext": { "return_token_ids": true, "return_prompt_logprobs": true, ... }
}

validate.rs::PASSTHROUGH_EXTRA_FIELDS allowlists these plus weight_version, return_routed_experts, cache_salt.
OpenAIStopConditionsProvider::get_stop_token_ids() returns typed Result<Option<Vec<TokenIdType>>> so malformed input (stop_token_ids: "not-an-array") surfaces a 400 instead of being silently dropped.
extract_stop_conditions plumbs into common::StopConditions::stop_token_ids_hidden, honored end-to-end by vLLM SamplingParams.
NvExtProvider::get_pretokenized_input() reads from both legacy nvext.token_data and the new top-level prompt_token_ids so existing renderer-mode callers keep working.
apply_template short-circuits when pre-tokenized input is present (no MiniJinja "undefined value" error on empty messages).
chat_completions/delta.rs, completions/delta.rs: nvext build paths replaced silent if let Ok(json) = … with match + tracing::warn! on Err. A dropped nvext means promoted token IDs / weight version never reach the trainer — silent-corruption class bug, now logged.

Removed dead URIs (404 in this branch)

/v1/chat/completions/tokens — TITO collapses into /v1/chat/completions with prompt_token_ids.
/v1/tokenize, /v1/detokenize — owned by feat: Nemo-rl support #7699 (NeMo-rl scope), not required by prime-rl.
Verified end-to-end via curl. Handler bodies + their routers stay in-tree under #[allow(dead_code)] for downstream compat; physical deletion is a follow-up.

Env vars (frontend + worker)

env var	scope	default	purpose
`DYN_ENABLE_RL`	frontend + worker	`false`	Frontend: inference-plane RL extensions on `/v1/chat/completions` (TITO, nvext). Worker: register `<ns>.<comp>.rl` request-plane endpoint + RL handlers.
`DYN_ENABLE_RL_ENDPOINTS`	frontend / admin	`false`	Mount the optional `/v1/rl/*` HTTP facade.
`DYN_NAMESPACE`	both	runtime default	Resolved namespace used by the SDK. The crate receives the resolved value; it does not read env itself.
`DYN_RL_PORT`	frontend / admin	`8002`	Port for the `/v1/rl/*` listener. Also `--rl-port`.

No DYN_RL_WORKER_SYSTEM_URLS, no DYN_RL_TARGETS, no DYN_RL_COMPONENTS for v1. Optional component filters can be request fields.

Useful deployment combinations:

deployment shape	`DYN_ENABLE_RL` (frontend)	`DYN_ENABLE_RL_ENDPOINTS`	`DYN_ENABLE_RL` (workers)
public OpenAI ingress	`false`	`false`	`false`
RL training cluster, single frontend	`true`	`true`	`true`
split: rollout frontend (public) + admin frontend (internal)	rollout: `true`, admin: optional	admin only	`true`
SDK-only Slime integration	optional	`false`	`true`

End-to-end smokes

Against bis/dynamo-rl @ 7c15b6b9cd, all on RTX PRO 6000 Blackwell with Qwen3-0.6B + math-env (GSM8K) + filesystem broadcast:

smoke	result	what it validates
`bis-dev/7/wt-transfer/sft-file/run.sh`	PASS	full-FT pause → `update_weights {target.kind=base, transport.backend=filesystem}` step_1 → resume; worker logs `[RL] filesystem.update_weights: base reload from … (version=step_1)`.
`bis-dev/7/wt-transfer/lora-file/run.sh`	PASS	`update_weights {target.kind=lora, op=load}` step_1; worker logs `[RL] LoRA adapter loaded: name=…`.
`bis-dev/7/wt-transfer/sft-file-2w/run.sh`	PASS	2 vLLM aggregator workers sharing one GPU. Both `init_transport` and `update_weights` fan out to 2 workers, both `status:ok`. Validates the request-plane path (Discovery + `PushRouter::direct`).
`bis-dev/7/wt-transfer/sft-nccl/run.sh`	wire-shape PASS, hardware NCCL fail	`init_transport {backend=nccl}` reaches worker → vLLM creates `NCCLWeightTransferEngine` → `pynccl.PyNcclCommunicator` rejects two ranks on one physical GPU. Confirms the `init_weight_transfer_engine` collective_rpc dispatch path; needs a 2-GPU host (or `backend="ipc"`) for the data plane.
`cargo test test_common_ext`	15 / 15	`get_stop_token_ids` typed `Result` + 400-on-malformed regression tests.
`cargo check --workspace`	clean	one pre-existing benign warning.

TITO + LoRA combined smoke (tito-multiturn) confirmed _generate_token_mode … will use LoRA adapter: … on both unary and streaming paths.

Files touched

51 files, +4,733 / −182. Hot spots:

lib/rl/src/lib.rs — +1,072 (new crate: RlClient SDK, request types, fan-out, HTTP handlers, route definitions)
docs/RL.md — +667 (full API reference)
components/src/dynamo/vllm/handlers.py — +693 (worker handlers: rl_dispatch, weight_transport_*)
components/src/dynamo/vllm/weight_transports/* — +666 across 5 files (WeightTransport Protocol + impls + factory)
lib/llm/src/http/service/service_v2.rs — +349 (separate listener wiring, env-var split, RlHttpDeps)
lib/llm/src/http/service/openai.rs — +456 (TITO mutual-exclusion, nvext promotion, dead-URI 404s)
components/src/dynamo/vllm/worker_factory.py — +62 (rl_endpoint = runtime.endpoint("…rl").serve_endpoint(handler.rl_dispatch, …))

Where reviewers should start

plans/rl-crate.md + plans/weight-transfer-config.md — design / contract for the entire RL surface.
lib/rl/src/lib.rs — RlClient SDK + HTTP handlers; the headline of the request-plane refactor.
components/src/dynamo/vllm/handlers.py::rl_dispatch + worker_factory.py::rl_endpoint.serve_endpoint(...) — worker side of the request-plane endpoint.
components/src/dynamo/vllm/weight_transports/{base,filesystem,nccl,engine_adapter}.py — WeightTransport trait + concrete impls.
lib/llm/src/protocols/openai/{chat_completions,validate,nvext}.rs — TITO mutual-exclusion + PASSTHROUGH_EXTRA_FIELDS.
lib/llm/tests/test_common_ext.rs — typed Result regression tests for stop_token_ids.

Commit walk-through

Commit	What
`0d74ecaf`	Baseline import — RL surface squashed in from `bis/parity-tokenize-tcp`.
`d295ebc6`	Composite state endpoint, 3-mode pause, RL extras on `/v1/chat/completions`, drop dead URIs.
`f034171`	`/v1/chat/completions` absorbs TITO with full `SamplingParams` parity.
`a2cc90da`	`get_stop_token_ids` typed `Result`; mutual-exclusion relaxed to canonical channel only; regression tests.
`2cb5e604`	RL API doc.
`0ee6cbe4` / `4aac7e89` / `8e08e32b`	Review-driven cleanup: no `unwrap`/`expect`, structured tracing, dead-code removal, Tier 1/2/3/5 + CodeRabbit.
`6007c77c`	PR A — extract `dynamo-rl` crate at `lib/rl/`. Pure refactor; behavior unchanged.
`67058be7`	WeightTransport Phase 1 + 4 — `WeightTransport` Protocol + `EngineAdapter` shim + `FilesystemTransport` + `NcclTransport`.
`93a7e411`	`cargo fmt` cleanup.
`575afd9e`	PR C — drop legacy `/v1/rl/{state,health,ready,liveness,weight_version,*_lora_adapter}` + legacy `update_weights` body; gate listener on `DYN_ENABLE_RL_ENDPOINTS`.
`b6e471de`	PR B — request-plane fan-out via Discovery + `PushRouter::direct`. Worker registers single `<ns>.<comp>.rl` endpoint with `rl_dispatch`; frontend dispatches via `Client::wait_for_instances` + `direct()`.
`626d3e44`	`RlClient` SDK — `MembershipSnapshot`, `FanoutPolicy`, typed `RlRequest` variants, separate listener on `DYN_RL_PORT`.
`7c15b6b9`	LoRA `unload` allowed without a `transport` block.

Pairs with

prime-rl bis/prime-rl-merged adds VLLMGenerateClient / DynamoGenerateClient + setup_generate_client and threads them through compute_teacher_logprobs. Same client.backend axis as setup_admin_api — one config field drives both admin and data paths. After prime-rl migrates, admin_base_url switches from http://dynamo-frontend:8000/v1/rl (legacy single-listener) to http://dynamo-frontend:8002/v1/rl once DYN_ENABLE_RL_ENDPOINTS=true and the dedicated listener is live.

Closes / relates

Closes review comments from [draft] prime-rl integration #8630: CR-8, CR-9, CR-10
Closes design-doc items: HH-19 (composite state-style endpoint, then dropped per Phase 3 — plan in rl-state-endpoint.md), HH-21 (3-mode pause), HH-22 / HH-26 (TITO via prompt_token_ids on /v1/chat/completions), HH-23 (liveness via the frontend's existing /live), HH-25 (RL-specific endpoint surface), HH-27 (weight_version folded into the version field of update_weights)
Plans of record live in-tree:
- plans/rl-crate.md — frontend / SDK / listener / worker contract
- plans/weight-transfer-config.md — wire shape + worker trait
- plans/rl-state-endpoint.md — deferred read-side endpoint design

Summary by CodeRabbit

Release Notes

New Features
- Standalone dynamo-rl crate hosting the /v1/rl/* admin surface with a request-plane fan-out SDK (RlClient, MembershipSnapshot, FanoutPolicy)
- Discriminated update_weights body covering both full-FT and LoRA with pluggable backends (filesystem, nccl)
- Pluggable WeightTransport Python Protocol + FilesystemTransport / NcclTransport impls; EngineAdapter shim isolates engine flavor
- Idempotent init_transport for backend setup (e.g. NCCL group bootstrap)
- Separate /v1/rl/* listener (DYN_RL_PORT / --rl-port) gated by DYN_ENABLE_RL_ENDPOINTS
- TITO (token-in-token-out) on /v1/chat/completions via prompt_token_ids + nvext extensions
- Workers register a single <ns>.<component>.rl request-plane endpoint with rl_dispatch; replaces legacy HTTP-on-system-port engine routes
Documentation
- docs/RL.md — full API reference
- plans/rl-crate.md and plans/weight-transfer-config.md — design / contract docs

…for bis/dynamo-rl Squash-imports the v1 RL surface from bis/parity-tokenize-tcp as the testable starting point for bis/dynamo-rl. Subsequent commits refactor toward the cleaner design in bis-dev/design-docs/rl-support.md (phases 0-5). Merge conflict resolutions: - nvext.rs: keep main's NvExtResponseFieldSelection refactor; add v1's completion_token_ids field + selection flag + gating in build_response_nvext. - preprocessor.rs: combine main's tracing::info! [SIDECAR-SKIP-TOKENIZE] log with v1's skip_token_annotation = has_backend_instance_id semantics (so RL/TITO callers without backend_instance_id keep the token_ids annotation; GAIE EPP callers continue to skip it). - chat_completions/delta.rs: route through main's build_response_nvext helper; layer v1's RL completion_token_ids accumulator on top so the full token list is emitted only on the finish chunk. - completions/delta.rs: drop v1's redundant inline NvExtResponse build (main's helper covers it). - worker_factory.py: extend main's register_engine_routes() helper with v1's RL routes (pause_generation, resume_generation, flush_cache, update_weights_from_path, get_weight_version, load_lora_adapter, unload_lora_adapter). - handlers.py: keep both main's start_profile/stop_profile and v1's RL handlers (pause_generation through unload_lora_adapter). - publisher.py: keep v1's 'scheduler_stats can be None right after a weight reload / cache reset' explanatory comment. - install_vllm.sh: keep VLLM_VER=0.19.1 from v1 (matches the venv the smoke runs against; main's 0.20.0 will be picked up by the long-term plan's upgrade workstream). cargo check --workspace clean (1 pre-existing benign warning). Test loop: ~/dev/rl/work/bis-dev/4-02/{lora,sft}/run.sh — mirrors the known-working bis-dev/4 reference smokes.

…d URIs Incremental refactor on bis/dynamo-rl per docs/design-docs/rl-support.md. Test loop ~/dev/rl/work/bis-dev/4-02/{lora,sft}/run.sh — both smokes PASS end-to-end after this commit. Composite state + liveness probe handlers.py: new engine routes get_state and liveness_probe. - liveness_probe round-trips through engine_client.check_health() so a wedged event loop surfaces as 503 (closes hhzhang16 HH-23). - get_state returns per-worker {engine_alive, pause_state, applied_weight_version, loras} — aggregated by Rust frontend. - pause_generation / resume_generation now track the _paused flag. worker_factory.py: register the two new engine routes alongside the existing pause/resume/flush_cache/update_weights/load_lora ones. openai.rs: new GET /v1/rl/state composite endpoint and GET /v1/rl/liveness probe (5s timeout, override via DYN_RL_LIVENESS_TIMEOUT_MS). State aggregates per-worker payloads and surfaces ready / engine_alive / pause_state / applied_weight_version / loras / per-worker workers. Closes HH-19 (single state endpoint), HH-25 (RL-specific), HH-27 (weight_version folded in). Legacy /v1/rl/{health,ready,weight_version} kept for back-compat — drop once prime-rl AdminAPI migrates to /state. 3-mode pause + structured update_weights body handlers.py: pause_generation accepts {mode, clear_cache} body. - mode=keep|wait|abort with default keep (matches prime-rl client.py:_pause_engines). Closes HH-21. - mode=abort triggers collective_rpc(abort_all_requests) when available; gracefully falls back with a warning on vLLM 0.19 where that RPC isn't implemented. - clear_cache=true triggers reset_prefix_cache after pause. openai.rs: rl_pause now extracts ?mode= and ?clear_cache= via axum::extract::Query, validates mode, propagates to worker. Returns 400 on unknown mode (verified end-to-end via curl). rl_update_weights body migrates to typed RlUpdateWeightsBody {weight_dir, weight_version?, reset_prefix_cache=true}; the prior flush+reload sequence is now optional (controlled by reset_prefix_cache, default true) and the response carries applied_weight_version. RL extras on /v1/chat/completions validate.rs: PASSTHROUGH_EXTRA_FIELDS expanded from {cache_salt} to {cache_salt, prompt_token_ids, weight_version, return_routed_experts, return_token_ids, return_prompt_logprobs}. RL clients can now send these as top-level extras without 400s. Closes HH-22 / HH-26 (TITO via prompt_token_ids on /v1/chat/completions instead of forked URI). chat_completions/delta.rs and completions/delta.rs: replace the silent 'if let Ok(json) = serde_json::to_value(...)' fallback with a match that emits tracing::warn! on the Err branch. A dropped nvext means promoted token_ids / weight_version never reach the RL trainer — silently corrupting training. Closes CR-9. openai.rs: doc-block for token-ID promotion moved from rl_tokenize_prompt to rl_promote_token_ids_in_response (where it actually applies). Closes CR-10. Drop dead URIs service_v2.rs: drop tokenization_router (/v1/tokenize, /v1/detokenize) mounting — owned by jthomson04 PR #7699 for NeMo-rl scope, not required by prime-rl. Drop chat_completions_tokens_router mounting — TITO collapses into /v1/chat/completions with prompt_token_ids extension. All three URIs verified 404 end-to-end via curl. openai.rs: tokenize, detokenize, tokenization_router, chat_completions_tokens_router, handler_chat_completions_tokens, bad_request all marked #[allow(dead_code)] (kept for downstream compat; physical deletion is a follow-up). bad_request doc-block cleaned of stale 'Not Implemented' lines. Closes CR-8. End-to-end verification (smokes ran against this commit) /v1/rl/state 200 ready=true engine_alive=true (manual curl) /v1/rl/liveness 200 alive=true (manual curl) /v1/rl/pause?mode=abort&clear_cache=true 200 (manual curl) └─ vllm_worker.log: '[RL] Engine paused (..., mode=abort, clear_cache=True)' /v1/rl/pause?mode=invalid 400 'Invalid mode' /v1/rl/resume 200 /v1/chat/completions/tokens 404 /v1/tokenize 404 /v1/detokenize 404 4-02/sft smoke: PASS (full-FT pause->update_weights->resume, mismatch_kl<=0.0007) 4-02/lora smoke: PASS (1 hot-swap closed, lora_id=1626203954) Reviewer-comment closures landed in this commit CR-8 bad_request doc cleaned CR-9 serde_json::to_value Err logged via tracing::warn! CR-10 token-ID promotion doc-block re-attached HH-19 single state endpoint /v1/rl/state HH-21 3-mode pause keep|wait|abort + clear_cache HH-22 prompt_token_ids in PASSTHROUGH_EXTRA_FIELDS (TITO collapse) HH-23 liveness_probe via engine_client.check_health() + 5s timeout HH-25 /v1/rl/state is RL-specific (vs broader /health) HH-26 no /v1/chat/completions/tokens distinction (URI dropped) HH-27 weight_version folded into /v1/rl/state.applied_weight_version

github-actions · 2026-05-05T03:25:58Z

🌿 Fern Docs Preview: https://nvidia-preview-6015327f-42f6-4872-b17e-052371aa156d.docs.buildwithfern.com/dynamo/dev

coderabbitai · 2026-05-05T03:53:40Z

Walkthrough

This PR introduces Prime-RL training workflow support to the Dynamo serving stack by adding RL-specific request/response token ID fields, streaming completion token ID accumulation, generation pause/resume/weight-update handlers in the vLLM worker, Rust-side RL admin routes and tokenization endpoints, and enhanced tokenizer trait implementations for token-to-string conversion. Additionally, vLLM is pinned to v0.19.1 with prime-rl v0.5.1.dev101 plugin installation.

Changes

RL Token ID Flow & Integration

Layer / File(s)	Summary
Request/Response Shape `lib/llm/src/protocols/openai/chat_completions.rs`, `lib/llm/src/protocols/openai/nvext.rs`, `lib/llm/src/audit/stream.rs`	New request fields `tokens`, `return_token_ids` and response field `prompt_token_ids` added to chat completion types. `NvExtResponse` gains optional `completion_token_ids` field with selection flag.
Streaming Accumulation `lib/llm/src/protocols/openai/chat_completions/delta.rs`, `lib/llm/src/protocols/openai/completions/delta.rs`	Delta generator now accumulates `completion_token_ids` across streamed chunks, emitting the full list on finish chunk. nvext serialization errors are now logged as warnings instead of silent drops.
Entrypoint & Aggregation `lib/llm/src/entrypoint/input/text.rs`, `lib/llm/src/protocols/openai/chat_completions/aggregator.rs`, `lib/llm/src/protocols/anthropic/types.rs`, `lib/llm/src/protocols/openai/responses/mod.rs`	Chat completion requests initialize new RL fields (`tokens: None`, `return_token_ids: None`). Aggregation fallback responses initialize `prompt_token_ids: None`.
Frontend RL Field Promotion `lib/llm/src/http/service/openai.rs` (handlers section)	Chat completions handler promotes `tokens` to `nvext.token_data`, ensures messages non-empty, forces `logprobs`, and appends RL completion token IDs to each choice via JSON rewriting in non-streaming post-processing.
TITO & Tokenization Endpoints `lib/llm/src/http/service/openai.rs` (tokenization section), `lib/llm/src/protocols/openai.rs`, `lib/llm/src/protocols/openai/tokenization.rs`, `lib/llm/src/protocols/openai/validate.rs`	New tokenization protocol types for chat/completion requests and responses. Tokenization routers mounted at `POST /v1/tokenize` and `POST /v1/detokenize`. TITO endpoint (`POST /v1/chat/completions/tokens`) consumes pre-tokenized prompts via `tokens` field. Passthrough allowlist permits RL request fields.
RL Admin Routes `lib/llm/src/http/service/openai.rs` (RL admin section), `lib/llm/src/http/service/service_v2.rs`	New RL admin router under `/v1/rl/*` with health, pause/resume, weight updates, LoRA load/unload, and state endpoints. `enable_rl` config flag conditionally mounts routes when set or `DYN_ENABLE_RL` is truthy.
Preprocessor & Request Handling `lib/llm/src/preprocessor.rs`	Token annotation suppression now depends on `backend_instance_id` presence rather than blanket suppression when `token_data` is provided.
vLLM Worker Handlers `components/src/dynamo/vllm/handlers.py`	New async methods on `BaseWorkerHandler`: `pause_generation`, `resume_generation`, `liveness_probe`, `get_state`, `flush_cache`, `update_weights_from_path`, `get_weight_version`, `load_lora_adapter`, `unload_lora_adapter`. Includes LoRA hot-swap, per-LoRA locking, discovery registration, and rollback semantics.
Worker Route Registration & Frontend Streaming `components/src/dynamo/vllm/worker_factory.py`, `components/src/dynamo/frontend/vllm_processor.py`, `components/src/dynamo/vllm/publisher.py`	Engine routes for RL lifecycle registered in `register_engine_routes`. Streaming processor injects RL logprobs and passes through `completion_token_ids` from backend. Publisher includes clarifying comment on scheduler stats nullability post-reload.
Documentation & Configuration `docs/Dynamo-RL-api-draft.md`, `container/deps/vllm/install_vllm.sh`	RL API surface documented including control routes, tokenization, token ID invariants, and validation results. vLLM pinned to v0.19.1 with prime-rl v0.5.1.dev101 plugin installation and entry-point validation.

Tokenizer Trait Enhancement

Layer / File(s)	Summary
Trait Interface `lib/tokenizers/src/lib.rs`	`Encoder` trait adds default `encode_with_special_tokens(input, add_special_tokens)` method. `Tokenizer` trait adds default `convert_ids_to_tokens(token_ids)` implementation that decodes individual IDs and collects into strings. Public `Tokenizer` wrapper type gains corresponding forwarding methods.
Tokenizer Implementations `lib/tokenizers/src/fastokens.rs`, `lib/tokenizers/src/hf.rs`, `lib/tokenizers/src/tiktoken.rs`	`FastTokenizer` routes encoding through new helper, delegates ID-to-token via HuggingFace decoder. `HuggingFaceTokenizer` implements both new trait methods with consistent special-token handling. `TikTokenTokenizer` precomputes decoder lookup maps and implements ID-to-token conversion via lossy UTF-8 decoding from cached byte sequences.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is entirely a template with no substantive content from the author.	Replace the template placeholders with detailed information about the changes, objectives, and testing performed as documented in the PR objectives and raw summary.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title 'feat(RL): RL integration layer to support Prime-RL' clearly and specifically describes the main change—adding RL (reinforcement learning) integration to support Prime-RL functionality—which aligns with the extensive PR scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/llm/src/protocols/openai/nvext.rs (1)
141-148: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Update the exhaustive NvExtResponseFieldSelection literals in this file.

Adding completion_token_ids to the struct makes the manual NvExtResponseFieldSelection { ... } literals later in these tests non-exhaustive, so this file will not compile until they either set completion_token_ids: false or switch to ..Default::default(). The breakage is in the explicit literals used by test_build_response_nvext_combined_emission (line 887) and test_nvext_response_field_selection_multiple_extra_fields (line 923).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/llm/src/protocols/openai/nvext.rs` around lines 141 - 148, The tests fail
because NvExtResponseFieldSelection gained a new field completion_token_ids,
making existing struct literals non-exhaustive; update the explicit literals
used in test_build_response_nvext_combined_emission and
test_nvext_response_field_selection_multiple_extra_fields to include
completion_token_ids: false (or replace the full literal with
NvExtResponseFieldSelection { ..Default::default() }) so the constructors are
exhaustive again; locate the struct NvExtResponseFieldSelection and adjust the
two test usages accordingly.

🧹 Nitpick comments (1)

docs/Dynamo-RL-api-draft.md (1)
208-208: 💤 Low value

Consider adding language identifiers to fenced code blocks.

Several code blocks lack language specifiers (lines 208, 287, 317, 359, 819, 844, 856, 871, 882). Adding identifiers like text, http, or rust improves syntax highlighting and accessibility.

For example:

HTTP endpoint descriptions → ```http or ```text

SSE responses → ```text

Rust-like struct definitions → ```rust or ```text

Also applies to: 287-287, 317-317, 359-359, 819-890
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/Dynamo-RL-api-draft.md` at line 208, Several fenced code blocks in the
document are missing language identifiers; update each triple-backtick block
(``` ) that contains HTTP endpoint examples, SSE responses, or Rust-like struct
definitions to include an appropriate language tag (e.g., ```http or ```text for
endpoint/SSE blocks, ```rust or ```text for struct examples) so syntax
highlighting and accessibility are improved — search for the bare ``` fences
around the HTTP examples, SSE examples, and struct-like snippets and add the
corresponding language identifier.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/vllm/handlers.py`:
- Around line 1178-1187: The handler currently removes the LoRA from the local
engine (remove_lora) but treats unregister_model failures as non-fatal, leaving
the model routable incorrectly; make this operation transactional: if
unregister_model(endpoint=self.generate_endpoint, lora_name=lora_name) fails,
either perform a rollback by re-adding the LoRA to the engine (call the
corresponding register/register_model or load_lora function to restore state) or
propagate/raise the exception so the handler returns an error and the removal is
not considered successful. Update the code paths around remove_lora,
unregister_model, and any callers of _resolve_lora_request to ensure consistency
with the legacy unload_lora transactional behavior.
- Around line 887-889: The generic except block in the RL handlers (e.g., the
pause handler shown) must first catch EngineDeadError and run the same shutdown
sequence used elsewhere in this file instead of returning a JSON error; add a
specific except EngineDeadError: branch before the existing except Exception
that logs the engine-dead event (using logger.exception or logger.error) and
invokes the module's existing worker shutdown logic (the same behavior already
used for EngineDeadError in other handlers), then re-raise or exit as
appropriate; keep the generic except Exception to return the JSON
{"status":"error","message":str(e)} for other errors. Ensure you apply this
change to all analogous handlers referenced (lines around 899-901, 926-928,
957-959, 968-970, 999-1001, 1144-1148, 1198-1202) and reference the
EngineDeadError symbol when editing each handler.
- Around line 839-843: Remove the internal Linear ticket reference "HH-21" from
the comment/docstring that documents the pause API (the block describing Body
keys mode and clear_cache near the handler for pause behavior) and replace it
with either a plain descriptive phrase (e.g., "implements 3-mode pause
behavior") or a public GitHub issue reference (e.g., "GH-<number>" or
"#<number>"); apply the same change to the other occurrences flagged (the
comment blocks around the code near the sections referencing HH-* at the other
two locations) so no HH-* IDs remain in source comments.
- Around line 1049-1086: If a hot-swap partially fails we must not claim
success: change the hot-swap path around engine_client.remove_lora,
engine_client.add_lora, and engine_client.reset_prefix_cache so failures abort
and roll back instead of continuing. Specifically, if remove_lora(lora_name)
raises, stop and propagate/return an error (do not call add_lora or update
loaded_loras); if add_lora(...) raises, ensure loaded_loras is not set and
propagate the error; if reset_prefix_cache() raises after a successful add,
remove the newly-added LoRA via engine_client.remove_lora(new_id), restore
loaded_loras to the previous LoRAInfo, and then propagate the reset error (or
return a non-ok result) so the control plane won’t believe the swap succeeded.
Use the existing symbols (remove_lora, add_lora, reset_prefix_cache,
loaded_loras, LoRARequest, LoRAInfo, engine_client) to locate and implement
these early-returns and rollback steps.
- Around line 941-946: In get_state(), currently if self.engine_client lacks
check_health it leaves engine_alive True; change that branch to mirror
liveness_probe(): when not hasattr(self.engine_client, "check_health") call the
same collective_rpc probe used in liveness_probe() (e.g., await
collective_rpc(self.engine_client, "get_status" or the exact probe used there)),
set engine_alive based on the probe result, and catch/handle exceptions the same
way as liveness_probe() so a failed IPC round-trip marks engine_alive False and
logs the error; keep existing behavior when check_health exists but still catch
exceptions from it.

In `@container/deps/vllm/install_vllm.sh`:
- Around line 319-323: Replace the tag-based ref with the immutable commit SHA:
update the PRIME_RL_REF usage in the install block so the pip install line
("prime-rl @ git+https://github.com/PrimeIntellect-ai/prime-rl@${PRIME_RL_REF}")
references the commit SHA instead of the tag (i.e., set PRIME_RL_REF to
d49f3939e7dca29bceb9ed515cc1782497b67e81 or substitute that SHA directly into
the git+ URL) so the install step in install_vllm.sh is pinned to a specific
commit.

In `@lib/llm/src/http/service/openai.rs`:
- Around line 207-210: Remove all internal Linear ticket IDs (e.g., CR-8, HH-21,
CR-10, HH-23, HH-19, HH-25, HH-27) from the source comments in openai.rs (look
for the doc comments around the Bad Request Error and the other affected comment
blocks at the noted regions) and replace them with neutral prose or GitHub issue
references (e.g., "See `#NNNN`" or "GH-NNNN") as per guidelines; update the
comment lines to describe the issue/closure in plain language without internal
ticket IDs so the doc comments remain informative and compliant.
- Around line 3130-3133: rl_ready() (and the other RL endpoints around the same
area) currently reuses the long-lived reqwest client built with a 600s timeout,
which lets a wedged worker hang readiness for minutes; change the readiness
probe to use a shorter timeout by either creating a separate reqwest::Client
with a small timeout (e.g., 1–5s) for rl_ready() or by setting a per-request
timeout via reqwest::Client::execute with tokio::time::timeout; update the code
paths that call rl_ready() (and the similar endpoints referenced around the
3182-3203 range) to use this short-timeout client/request wrapper so readiness
fails fast with a 503 instead of blocking on the shared 600s http_client.
- Around line 977-988: The RL promotion blocks that check rl_want_token_ids must
force request.inner.logprobs = Some(true) even if it was explicitly set to
false; locate the two places where rl_want_token_ids is used (the block that
manipulates nvext.extra_fields and the similar RL promotion block around lines
~2403-2416) and set request.inner.logprobs = Some(true) unconditionally when
adding "token_ids"/"completion_token_ids" to nvext.extra_fields so
completion_token_ids isn't lost on the backend.
- Around line 1280-1298: The current logic only populates rl_saved_messages and
rl_tito_token_ids when DYN_ENABLE_RL is true, causing prompt_token_ids to be
missing or incorrect for requests that explicitly requested return_token_ids or
used tools/chat_template_args; before calling engine.generate() (the point where
the prompt is finalized), capture the fully rendered prompt text and/or preserve
all prompt-shaping inputs (request.inner.messages, request.tools,
request.chat_template_args and request.nvext.token_data) unconditionally when
request.return_token_ids is set (or equivalent request intent), and use those
captured values for post-response tokenization instead of gating on
DYN_ENABLE_RL; update the references rl_saved_messages and rl_tito_token_ids so
they are derived from the captured prompt/inputs (and fall back to token_data if
present) and ensure downstream post-processing keys off request.return_token_ids
rather than the environment variable.
- Around line 3299-3404: The current rl_update_weights handler
(rl_update_weights, RlState::all_ok, state.fan_out) claims atomicity but can
leave a mixed fleet when update_weights_from_path partially fails; fix by making
the update transactional: before calling update_weights_from_path, fan_out a
"get_weight_version" (or similar) to capture each worker's current version, then
call update_weights_from_path as now; if load_results is not all_ok, call
state.fan_out("update_weights_from_path", {"path": <prev_path>, "version":
<prev_version>}) targeted only to workers that reported success to rollback them
to their captured pre-update version, and return a BAD_GATEWAY with detailed
per-worker status including which were rolled back and which failed; update the
docblock to stop advertising implicit atomicity or note the implemented rollback
behavior and reference rl_update_weights, fan_out, update_weights_from_path, and
the get_weight_version/rollback calls.

In `@lib/llm/src/protocols/openai/chat_completions.rs`:
- Around line 63-64: Update the stale doc comment that currently references the
removed `/v1/chat/completions/tokens` endpoint so it instead tells integrators
to use the standard `/v1/chat/completions` endpoint with `prompt_token_ids`;
locate the comment near the chat completions implementation in
openai::chat_completions (file chat_completions.rs) and replace the two-line
note about TITO mode to say that on `/v1/chat/completions` the field is accepted
but ignored and that TITO-style token-authoritative requests should be made by
passing `prompt_token_ids` on `/v1/chat/completions`.

In `@lib/llm/src/protocols/openai/chat_completions/delta.rs`:
- Around line 117-120: The shared Vec accumulated_completion_token_ids in the
streaming delta handling must be made per-choice instead of global: change
accumulated_completion_token_ids to a map keyed by the chunk's choice index
(e.g., HashMap<usize, Vec<TokenIdType>>) and append token IDs into
accumulated_completion_token_ids[chunk.index]; when promoting to
choices[i].token_ids (the code paths that read/emit completion_token_ids) pull
from the per-index entry and clear it on finish_reason; alternatively, if you
prefer to keep a single accumulator, explicitly reject multi-choice requests at
the start of the RL path (when n > 1) so chunks from different indices cannot
interleave. Ensure the same per-index approach is applied to the other
accumulator usages referenced around the choices handling and promotion logic.

In `@lib/llm/src/protocols/openai/tokenization.rs`:
- Around line 61-84: Call TokenizeChatRequest::validate() when handling
TokenizeRequest::Chat before using merged_chat_template_kwargs(); specifically,
after deserializing the chat tokenize request in the TokenizeRequest::Chat
branch, invoke request.validate()? and propagate any error (return Err) so
invalid states (both continue_final_message and add_generation_prompt true) are
rejected; this ensures TokenizeChatRequest::validate enforces constraints prior
to calling TokenizeChatRequest::merged_chat_template_kwargs().

In `@lib/llm/src/protocols/openai/validate.rs`:
- Around line 117-118: Remove the internal Linear ticket IDs "HH-22 / HH-26"
from the comment block in validate.rs and replace them with either a GitHub
issue reference like "#NNNN" or neutral prose (e.g., "see related issue" or
"related enhancement") so the comment no longer contains internal ticket IDs;
update the comment text surrounding the line that mentions the "tokens variant
of /v1/chat/completions" to use the new reference or neutral wording.

---

Outside diff comments:
In `@lib/llm/src/protocols/openai/nvext.rs`:
- Around line 141-148: The tests fail because NvExtResponseFieldSelection gained
a new field completion_token_ids, making existing struct literals
non-exhaustive; update the explicit literals used in
test_build_response_nvext_combined_emission and
test_nvext_response_field_selection_multiple_extra_fields to include
completion_token_ids: false (or replace the full literal with
NvExtResponseFieldSelection { ..Default::default() }) so the constructors are
exhaustive again; locate the struct NvExtResponseFieldSelection and adjust the
two test usages accordingly.

---

Nitpick comments:
In `@docs/Dynamo-RL-api-draft.md`:
- Line 208: Several fenced code blocks in the document are missing language
identifiers; update each triple-backtick block (``` ) that contains HTTP
endpoint examples, SSE responses, or Rust-like struct definitions to include an
appropriate language tag (e.g., ```http or ```text for endpoint/SSE blocks,
```rust or ```text for struct examples) so syntax highlighting and accessibility
are improved — search for the bare ``` fences around the HTTP examples, SSE
examples, and struct-like snippets and add the corresponding language
identifier.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 100e02b7-f3fb-4a7f-9860-7bf2a725db68

📥 Commits

Reviewing files that changed from the base of the PR and between 8db5ad2 and d295ebc.

📒 Files selected for processing (25)

components/src/dynamo/frontend/vllm_processor.py
components/src/dynamo/vllm/handlers.py
components/src/dynamo/vllm/publisher.py
components/src/dynamo/vllm/worker_factory.py
container/deps/vllm/install_vllm.sh
docs/Dynamo-RL-api-draft.md
lib/llm/src/audit/stream.rs
lib/llm/src/entrypoint/input/text.rs
lib/llm/src/http/service/openai.rs
lib/llm/src/http/service/service_v2.rs
lib/llm/src/preprocessor.rs
lib/llm/src/protocols/anthropic/types.rs
lib/llm/src/protocols/openai.rs
lib/llm/src/protocols/openai/chat_completions.rs
lib/llm/src/protocols/openai/chat_completions/aggregator.rs
lib/llm/src/protocols/openai/chat_completions/delta.rs
lib/llm/src/protocols/openai/completions/delta.rs
lib/llm/src/protocols/openai/nvext.rs
lib/llm/src/protocols/openai/responses/mod.rs
lib/llm/src/protocols/openai/tokenization.rs
lib/llm/src/protocols/openai/validate.rs
lib/tokenizers/src/fastokens.rs
lib/tokenizers/src/hf.rs
lib/tokenizers/src/lib.rs
lib/tokenizers/src/tiktoken.rs

End-to-end TITO support on /v1/chat/completions: callers post prompt_token_ids as a top-level extension (and stop_token_ids, allowed_token_ids, bad_words_token_ids, truncate_prompt_tokens for SamplingParams parity with vLLM /inference/v1/generate). vLLM 0.20+ already accepts these on /v1/chat/completions natively; this commit is the bridge until that upgrade lands. Plumbing: validate.rs: PASSTHROUGH_EXTRA_FIELDS adds stop_token_ids, bad_words_token_ids, allowed_token_ids, truncate_prompt_tokens. protocols/openai.rs: OpenAIStopConditionsProvider gains get_stop_token_ids() (default None); extract_stop_conditions() plumbs into common::StopConditions::stop_token_ids_hidden so vLLM's SamplingParams.stop_token_ids is honored end-to-end. protocols/openai/chat_completions.rs: NvCreateChatCompletionRequest overrides get_stop_token_ids and get_pretokenized_input to read unsupported_fields. ValidateRequest::validate accepts empty messages when prompt_token_ids is present, 400s on mutual-exclusion violations. protocols/openai/nvext.rs: NvExtProvider gains get_pretokenized_input with a default impl reading nvext.token_data; chat-completions overrides to also read top-level prompt_token_ids extension. preprocessor.rs: apply_template short-circuits to None when pre- tokenized input is present (avoids 'undefined value' template errors on empty messages); gather_tokens reads via the same NvExtProvider hook covering both channels. http/service/openai.rs: validate_chat_completion_required_fields accepts empty messages when pre-tokenized input is provided. End-to-end probes against Qwen/Qwen3-0.6B: /v1/chat/completions with prompt_token_ids + stop_token_ids: 200, finish_reason=stop on stop-token, nvext.completion_token_ids populated. stop_token_ids=[16] forces stop on first generated token: 200, immediate stop (proves SamplingParams.stop_token_ids honored). messages + prompt_token_ids: 400 mutual-exclusion error. messages-only (MITO): unchanged — chat template applied normally. Smoke regression on bis-dev/4-02 against Qwen/Qwen3-0.6B: sft (full-FT bracket): PASS, mismatch_kl 0.0006/0.0007 lora (hot-swap): PASS, lora_id=1626203954 Pairs with prime-rl bis/prime-rl-merged commits that add VLLMGenerateClient / DynamoGenerateClient + setup_generate_client and thread them through compute_teacher_logprobs. Same client.backend axis as setup_admin_api; one config field drives both admin and data paths.

…clusion Three improvements borrowed from PR #9141 (Ameen Patel) on top of the existing PASSTHROUGH_EXTRA_FIELDS expansion: 1. get_stop_token_ids returns Result<Option<Vec<TokenIdType>>>, not Option<Vec<TokenIdType>>. Malformed payloads (e.g. stop_token_ids: "not-an-array") now surface as a typed 400 with the diagnostic 'stop_token_ids must be an array of unsigned token IDs: {err}'. extract_stop_conditions propagates the Result via ?. Replaces the prior silent-fallback Option<> variant which dropped malformed inputs without telling the caller. Silent drops on RL correctness primitives (stop conditions affect what tokens the engine emits) is the bug class CR-9 was about — same principle applies here. 2. Mutual-exclusion between messages and pre-tokenized input is now scoped to the canonical TOP-LEVEL prompt_token_ids extension only. The legacy nvext.token_data channel — which the verifiers dynamo_chat_nvext renderer transport (#1287) uses with placeholder messages 'role: user, content: (token-in mode)' — is allowed to coexist with non-empty messages. validate_messages still gates the empty-messages-with-no-tokens case. Without this relaxation, the renderer transport's placeholder pattern would 400 on every request. 3. Two new tests in test_common_ext.rs: - test_chat_completions_stop_token_ids_extraction: positive case with nvext.token_data + top-level stop_token_ids (lifted from PR #9141 verbatim). - test_chat_completions_stop_token_ids_malformed_returns_400: verifies the typed-error path on bad input. Pre-existing test struct-init sites in test_common_ext.rs were missing required fields (return_token_ids, tokens) added to the NvCreateChatCompletionRequest struct since the tests were written. Three sites updated to construct cleanly. cargo test test_common_ext: 15 tests, 15 passes.

Cleanup-only — no behavior change. Strips review-tracker noise that accumulated on top of PR-added text during iteration: - "Closes hhzhang16 HH-19/HH-21/HH-22/HH-23/HH-25/HH-26/HH-27" - "CR-8 / CR-9 / CR-10 closure" prefixes on serde-error / doc-attach fixes - Branch-name references: bis/parity-tokenize-tcp, bis/prime-rl-merged - Internal PR numbers: #6094, #7699, #8197, #9141 - Phase numbers from internal design docs (rl-support.md Phase 1/4/5) - "prime-rl" mentions in narrative copy and mermaid diagrams → generic "RL trainer / RL orchestrator / external client" Technical content (semantics, invariants, why-this-exists rationale) preserved everywhere; only the internal-process scaffolding is removed. Scope verification: every removed line is one this branch ADDED (diff main..HEAD shows the removed text on a `+` line). No edits land on pre-existing main-branch comments. Specifically reverted the nvext.rs cleanup attempt — its target lines (GAIE Stage 1/2, SGLang-specific) live on main, not in this PR's diff. Files touched: components/src/dynamo/vllm/handlers.py +9 -10 components/src/dynamo/vllm/worker_factory.py +6 -4 docs/dynamo-RL-api.md +19 -32 lib/llm/src/http/service/openai.rs +32 -34 lib/llm/src/protocols/openai/chat_completions/delta.rs +4 -4 lib/llm/src/protocols/openai/completions/delta.rs +3 -3 lib/llm/src/protocols/openai/validate.rs +20 -20 cargo check -p dynamo-llm: clean (1 pre-existing benign warning).

…acing Four blocking findings from a Graham-style review. 1. resolve_model_for_chat (openai.rs ~2086): replace served_models.into_iter().next().unwrap() after a len()==1 check with let-Some on drain().next(). Eliminates the silent-panic-if-empty path without adding a Safety: comment for what's now self-evident. 2. RlState::from_env (openai.rs ~3128): builder().build() failures (TLS init, resolver init) panicked the frontend on first request boot via .expect("Failed to create RL router HTTP client"). Now returns anyhow::Result<Self> and surfaces a typed error. Caller rl_router() becomes anyhow::Result<(Vec<RouteDoc>, Router)>; service_v2::build() propagates with `?` (it already returns Result<HttpService, anyhow::Error>). 3. rl_update_weights (openai.rs ~3349): replace `if weight_dir.is_none() { return ... } let weight_dir = weight_dir.unwrap();` with `let Some(weight_dir) = body.weight_dir.clone() else { return ... };`. One match instead of two; no unwrap. 4. Structured tracing fields, not format strings (8 sites in rl_pause / rl_resume / rl_update_weights / rl_load_lora_adapter / rl_unload_lora_adapter): tracing::warn!("RL pause: some workers failed: {:?}", results); → tracing::warn!(?results, "RL pause: some workers failed"); Same shape applied to info!/warn! calls that interpolated worker_count, mode, weight_dir, version, lora_name, lora_path. Use % for Display, ? for Debug per tracing::Value docs. Verification: cargo check -p dynamo-llm: clean (1 pre-existing benign warning). cargo test -p dynamo-llm --test test_common_ext: 15 passed.

Quick-win review fixes from PR #9131. Heavy-lift items (#9 prompt_token_ids env-gate, #11 update_weights atomicity, #13 per-choice completion_token_ids) tracked separately as follow-ups. handlers.py - Catch EngineDeadError before the generic except in all 8 RL handlers (pause/resume/liveness_probe/get_state/flush_cache/update_weights_from_path/ load_lora_adapter/unload_lora_adapter): match the existing shutdown pattern in this file so admin calls also surface engine death instead of leaving a broken worker alive. - get_state: fall back to a no-op collective_rpc when check_health is absent — same fallback liveness_probe already uses, otherwise older engines without check_health always look alive. - load_lora_adapter hot-swap path: a remove_lora() failure now returns a 400-style error response (was: silent log warn + continue, leaving add_lora to no-op against the still-registered ID); a reset_prefix_cache() failure after add_lora succeeds also returns error (was: log error and continue, leaving stale KV from the old adapter routable). - unload_lora_adapter: an unregister_model() failure after engine remove_lora succeeds now returns error (was: log warn and report success, leaving model=<lora_name> still routed to this worker even though _resolve_lora_request would now fall back to the base model). container/deps/vllm/install_vllm.sh - Pin prime-rl install to an immutable commit SHA (d49f3939e7dca29bceb9ed515cc1782497b67e81 ↔ tag v0.5.1.dev101) so a re-pointed tag upstream can't change what we ship. PRIME_RL_REF kept in build logs for human readability; PRIME_RL_COMMIT is the authoritative pin. - Replace `echo "\n=== ..."` with `printf '\n=== ...\n'` (shellcheck SC2028). lib/llm/src/http/service/openai.rs - Force `request.inner.logprobs = Some(true)` unconditionally in both RL token-id promotion blocks (was: only when None). RL extraction of completion_token_ids depends on logprobs being on at the engine; an explicit logprobs=false would otherwise silently drop them. - Bound `/v1/rl/ready` per-worker probes with a 5s timeout (override via DYN_RL_LIVENESS_TIMEOUT_MS). Was reusing the shared 600s http_client, so one wedged worker could block readiness for 10 minutes instead of failing fast as 503. - Tokenize Chat handler: call `request.validate()?` before `merged_chat_template_kwargs()` so the continue_final_message + add_generation_prompt mutual-exclusion constraint is enforced (validate() existed but was never invoked). lib/llm/src/protocols/openai/chat_completions.rs - Update stale doc comments on the legacy `tokens` and `return_token_ids` fields: they pointed callers at the now-404 `/v1/chat/completions/tokens` URI. Direct callers to the canonical top-level `prompt_token_ids` extension and `nvext.extra_fields` instead. cargo check -p dynamo-llm: clean (1 pre-existing benign warning). cargo test -p dynamo-llm --test test_common_ext: 15 passed.

…e shutdowns, tier the docs Stack-ranked review fixes from the latest Graham-style pass on this branch. Tier 4 (commit message conventions and tokenizer-crate scope) deliberately deferred. Tier 1 — correctness / contract claims - Drop the word "atomic" from the `/v1/rl/update_weights` docblock and spell out the partial-failure semantics: workers `0..N-1` may have switched while worker `N` failed; per-worker status is in the response, true rollback is a follow-up. - Reject `n > 1` on `/v1/chat/completions` when RL token IDs are requested. The streaming aggregator's `completion_token_ids` Vec is shared across all choices, so per-choice promotion downstream cannot recover which tokens belong to which choice with `n > 1`. Hard-reject is the interim guard until the keyed-by-index refactor lands. - `update_weights` empty `weight_dir` (`""`) is now treated the same as null/missing (NCCL-mode no-op) instead of being forwarded to the engine as `path=""`. Tier 2 — hard rules - Hot-path `[RL]` `logger.info` → `logger.debug` for pause / resume / flush_cache / weights-load / LoRA load / LoRA unload (8 sites). RL trainers fire these per training step; info-level was a log flood. - Extract the duplicated 4-line `EngineDeadError` shutdown stanza into `BaseWorkerHandler._shutdown_on_engine_dead(e) -> NoReturn` and collapse the 9 call sites (8 RL handlers + 1 generate path) to a single line each. ~32 lines removed. - Strip the remaining internal-tracker comments missed by an earlier chore-scrub: `service_v2.rs` had `jthomson04 PR #7699`, `bis-dev/design-docs/rl-support.md §1`, and `hhzhang16 HH-22 / HH-26` references in two places. Replaced with neutral prose. - Strip the SGLang-coordination comment in `handlers.py` ("Signatures intentionally line up with the SGLang RL admin routes"). The kind of line that goes stale when SGLang's admin set drifts. - Delete dead-code carcasses for the dropped routes (`/v1/chat/completions/tokens`, `/v1/tokenize`, `/v1/detokenize`): remove `tokenize`, `detokenize`, `tokenization_router`, `chat_completions_tokens_router`, `handler_chat_completions_tokens` (~240 lines), drop `pub mod tokenization;`, delete the `tokenization.rs` module file (124 lines), drop unused tokenize-type imports. All five were behind `#[allow(dead_code)]` and unmounted. Tier 3 — tests - Make `RlState::new(...)` a pub(super) test-friendly constructor so handler-level tests don't need `from_env` / process env vars. - Convert `RlPauseQuery::mode: Option<String>` to `Option<PauseMode>`, a typed enum with `serde(rename_all = "lowercase")`. Axum now returns 400 on `mode=foo` before the handler runs; the runtime string match is gone. - Add four behavior tests in `mod tests`: test_pause_mode_serde_roundtrip test_pause_mode_rejects_unknown_value test_rl_update_weights_body_defaults test_rl_state_new_constructs_without_env - Fix unrelated test struct-init breakage that shipped earlier in this branch: 26 sites across 11 files were missing the `prompt_token_ids`, `return_token_ids`, `tokens`, and `completion_token_ids` fields added to the response/request/nvext structs. `cargo test -p dynamo-llm` now compiles cleanly. Tier 5 — design / nits - `DYN_RL_LIVENESS_TIMEOUT_MS` is read once in `RlState::from_env` and cached as `RlState.probe_timeout: Duration`; `rl_ready` and `rl_liveness` use the cached value instead of re-parsing the env on every request. - `rl_ready` worker-probe body simplified from a four-link `.ok().and_then(Result::ok).map(...).unwrap_or(false)` chain to a `match (timeout, send_result)` that surfaces the timeout/network distinction for a future log line. cargo check -p dynamo-llm: clean (1 pre-existing benign warning). cargo test -p dynamo-llm --lib: 58 passed (4 new RL tests). cargo test -p dynamo-llm --tests integration suites: all green.

…, no behavior change)

…ansport (Phase 1+4 of weight-transfer-config.md)

… chained .or_else()

…ht_version,*_lora_adapter} + legacy update_weights body; gate listener on DYN_ENABLE_RL_ENDPOINTS

…er registers <ns>.<comp>.rl with rl_dispatch; frontend dispatches via Client::wait_for_instances + direct())

biswapanda added 2 commits May 4, 2026 06:55

biswapanda requested review from a team as code owners May 5, 2026 03:23

pull-request-size Bot added the size/XXL label May 5, 2026

github-actions Bot added feat documentation Improvements or additions to documentation backend::vllm Relates to the vllm backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` container labels May 5, 2026

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 06:25 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 06:26 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 17:13 Inactive

biswapanda marked this pull request as draft May 5, 2026 18:20

copy-pr-bot Bot temporarily deployed to GITLAB May 5, 2026 18:31 Inactive

biswapanda changed the title ~~feat: dynamo <> prime-rl integration~~ feat: dynamo RL integration May 7, 2026

biswapanda marked this pull request as ready for review May 7, 2026 22:44

biswapanda changed the title ~~feat: dynamo RL integration~~ feat(RL): Dynamo RL integration May 7, 2026

biswapanda changed the title ~~feat(RL): Dynamo RL integration~~ feat(RL): Dynamo RL integration APIs May 7, 2026

biswapanda added 3 commits May 7, 2026 15:59

rl api docs

2cb5e60

update

8e0e019

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 00:15 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 00:26 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 00:37 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 01:34 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 03:54 Inactive

biswapanda added 3 commits May 8, 2026 01:40

refactor(rl): extract dynamo-rl crate at lib/rl (PR A — pure refactor…

6007c77

…, no behavior change)

feat(rl/vllm): add WeightTransport trait + FilesystemTransport/NcclTr…

67058be

…ansport (Phase 1+4 of weight-transfer-config.md)

style(llm): cargo fmt — strip trailing whitespace + reformat delta.rs…

93a7e41

… chained .or_else()

biswapanda self-assigned this May 8, 2026

biswapanda added 2 commits May 8, 2026 02:01

feat(rl): PR C — drop legacy /v1/rl/{state,health,ready,liveness,weig…

575afd9

…ht_version,*_lora_adapter} + legacy update_weights body; gate listener on DYN_ENABLE_RL_ENDPOINTS

feat(rl): PR B — request-plane fan-out via Discovery+PushRouter (work…

b6e471d

…er registers <ns>.<comp>.rl with rl_dispatch; frontend dispatches via Client::wait_for_instances + direct())

AmeenP mentioned this pull request May 8, 2026

Add Dynamo renderer transport PrimeIntellect-ai/verifiers#1287

Draft

biswapanda added 2 commits May 8, 2026 03:35

rl: add request-plane admin fanout

626d3e4

rl: allow lora unload without transport

7c15b6b

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 15:06 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 15:10 Inactive

biswapanda changed the title ~~feat(RL): Dynamo RL integration APIs~~ feat(RL): Dynamo RL integration APIs supporting prime-rl May 8, 2026

biswapanda changed the title ~~feat(RL): Dynamo RL integration APIs supporting prime-rl~~ feat(RL): Dynamo RL integration supporting Prime-RL May 8, 2026

biswapanda changed the title ~~feat(RL): Dynamo RL integration supporting Prime-RL~~ feat(RL): RL integration layer to support Prime-RL May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(RL): RL integration layer to support Prime-RL#9131

feat(RL): RL integration layer to support Prime-RL#9131
biswapanda wants to merge 17 commits intomainfrom
bis/dynamo-rl

biswapanda commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

biswapanda commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Architecture in three pieces

1. Standalone dynamo-rl crate (lib/rl/)

2. Worker contract — single rl request-plane endpoint

3. Discovery + fan-out — the request plane

HTTP facade — /v1/rl/*

WeightTransferConfig — update_weights wire shape

WeightTransport seam — pluggable backends

TITO on /v1/chat/completions

Removed dead URIs (404 in this branch)

Env vars (frontend + worker)

End-to-end smokes

Files touched

Where reviewers should start

Commit walk-through

Pairs with

Closes / relates

Summary by CodeRabbit

Release Notes

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

biswapanda commented May 5, 2026 •

edited

Loading

1. Standalone `dynamo-rl` crate (`lib/rl/`)

2. Worker contract — single `rl` request-plane endpoint

HTTP facade — `/v1/rl/*`

WeightTransferConfig — `update_weights` wire shape

TITO on `/v1/chat/completions`

github-actions Bot commented May 5, 2026 •

edited

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading