Skip to content

feat(RL): RL integration layer to support Prime-RL#9131

Open
biswapanda wants to merge 17 commits intomainfrom
bis/dynamo-rl
Open

feat(RL): RL integration layer to support Prime-RL#9131
biswapanda wants to merge 17 commits intomainfrom
bis/dynamo-rl

Conversation

@biswapanda
Copy link
Copy Markdown
Contributor

@biswapanda biswapanda commented May 5, 2026

Overview

End-to-end RL integration into Dynamo's vLLM backend. Replaces / supersedes #8630.

This PR delivers the design captured in two in-tree planning docs:

  • plans/rl-crate.md — standalone dynamo-rl crate, request-plane fan-out, RlClient SDK, optional /v1/rl/* HTTP facade.
  • plans/weight-transfer-config.md (v5) — single discriminated update_weights body covering full-FT and LoRA across pluggable backends; two-layer worker seam (WeightTransport + EngineAdapter).

Branch: bis/dynamo-rl @ 7c15b6b9cd.

Architecture in three pieces

1. Standalone dynamo-rl crate (lib/rl/)

Dependency direction is dynamo-llm → dynamo-rl → dynamo-runtime. The crate has zero dep on dynamo-llm.

// lib/rl/src/lib.rs (~1.1 k LoC)

pub struct RlClient {
    runtime: Arc<DistributedRuntime>,
    namespace: String,
    rl_endpoint: String,        // default: "rl"
    policy:  FanoutPolicy,
}

impl RlClient {
    pub async fn snapshot(&self) -> Result<MembershipSnapshot>;
    pub async fn describe(&self) -> Result<Vec<RlResponse>>;
    pub async fn pause(&self,           req: PauseRequest)         -> Result<FanoutReport>;
    pub async fn resume(&self,          req: ResumeRequest)        -> Result<FanoutReport>;
    pub async fn init_transport(&self,  req: InitTransportRequest) -> Result<FanoutReport>;
    pub async fn update_weights(&self,  req: UpdateWeightsRequest) -> Result<FanoutReport>;
}

pub struct MembershipSnapshot { pub epoch: u64, pub targets: Vec<WorkerTarget> }
pub struct WorkerTarget { pub namespace, component, endpoint, instance_id }
pub struct FanoutPolicy { min_workers, membership_timeout, request_timeout,
                          strict_direct, abort_on_membership_change,
                          component_filter }

// HTTP facade is a thin wrapper over the SDK:
pub fn rl_router(deps: RlHttpDeps) -> axum::Router;

The SDK owns: discovery list/watch, per-component request-plane clients, fan-out across all matching rl instances, optional component filters, strict direct addressing, membership epoch/fingerprint checks, empty-fanout behavior, timeout/retry policy, and per-worker success/failure reporting.

Strict direct addressing. For admin RPC (weight update / pause / etc.) we cannot silently fall back to a different instance if the target disappears mid-call — the SDK fails with RlError::MembershipChanged { before_epoch, after_epoch } and forces the caller to reconcile.

2. Worker contract — single rl request-plane endpoint

Per rl-crate.md § "Worker contract". When DYN_ENABLE_RL=true, each worker registers exactly one discoverable endpoint per component:

dyn://<DYN_NAMESPACE>.<component>.rl

served by BaseWorkerHandler.rl_dispatch(request) which dispatches on request["op"] to the existing handler methods. Request envelope:

pub enum RlRequest {
    Describe(DescribeRequest),
    Pause(PauseRequest),
    Resume(ResumeRequest),
    InitTransport(InitTransportRequest),
    UpdateWeights(UpdateWeightsRequest),
    LoadLoraAdapter(LoadLoraAdapterRequest),     // legacy migration
    UnloadLoraAdapter(UnloadLoraAdapterRequest), // legacy migration
}

pub struct RlResponse {
    pub status:       RlStatus,
    pub instance_id:  u64,
    pub component:    String,
    pub worker_role:  Option<String>,
    pub message:      Option<String>,
    pub details:      serde_json::Value,
}

The legacy per-RL-op register_engine_route("pause_generation"…) HTTP-on-system-port mechanism + DYN_RL_WORKER_SYSTEM_URLS static URL list are gone from the steady-state path. Discovery is the only source of truth for "which workers are live".

3. Discovery + fan-out — the request plane

Frontend fan-out flow (dynamo_rl::RlState::fan_out):

  1. Build a Client for <ns>.<comp>.rl from the runtime.
  2. client.wait_for_instances() (5 s bound) so the etcd watcher populates before dispatch.
  3. client.instance_ids() → list of live workers.
  4. PushRouter::<Value, Annotated<Value>>::from_client(client, RouterMode::Direct).
  5. Per instance: router.direct(SingleIn::new({op, body}), instance_id), drain first Annotated.data.

NATS / shared TCP is the wire. No reqwest::Client, no system-port HTTP.

HTTP facade — /v1/rl/*

Per weight-transfer-config.md v5 + rl-crate.md § "HTTP facade". Mounted on a dedicated listener (port DYN_RL_PORT / --rl-port, default 8002) when DYN_ENABLE_RL_ENDPOINTS=true. Inference plane on :8000 is untouched.

POST /v1/rl/pause
POST /v1/rl/resume

POST /v1/rl/init_transport       # idempotent backend setup (filesystem no-op; nccl bootstrap)
POST /v1/rl/update_weights       # discriminated body covering full-FT + LoRA

POST /v1/rl/load_lora_adapter    # temporary legacy alias (deletes after prime-rl migration)
POST /v1/rl/unload_lora_adapter  # temporary legacy alias
GET  /v1/rl/health               # optional temporary alias to /live

After prime-rl migrates: the two LoRA aliases collapse into update_weights {target.kind="lora"}; /v1/rl/health is removed or stays as a trivial alias to the frontend's existing /live. No read-side state endpoint is added until a real caller appears (deferred design captured in plans/rl-state-endpoint.md).

WeightTransferConfig — update_weights wire shape

Per weight-transfer-config.md v5. Single discriminated body covering both full-FT and LoRA. No legacy {weight_dir, weight_version, reset_prefix_cache} form.

// 1. full-FT base reload
POST /v1/rl/update_weights
{ "version":"step_42",
  "target":  {"kind":"base"},
  "transport":{"backend":"filesystem","filesystem":{"path":"…/step_42","require_marker":"STABLE"}} }

// 2. NCCL collective broadcast (after init_transport)
POST /v1/rl/update_weights
{ "version":"step_42",
  "target":  {"kind":"base"},
  "transport":{"backend":"nccl","nccl":{"transport_id":"rl-weights-step"}} }

// 3. LoRA load / swap (same endpoint)
POST /v1/rl/update_weights
{ "version":"step_42",
  "target":  {"kind":"lora","name":"adapter-A","op":"load"},
  "transport":{"backend":"filesystem","filesystem":{"path":"…/step_42"}} }

// 4. LoRA unload — transport optional
POST /v1/rl/update_weights
{ "version":"step_42",
  "target":  {"kind":"lora","name":"adapter-A","op":"unload"} }

init_transport is idempotent. filesystem treats it as a no-op. nccl performs the worker-side init_weight_transfer_engine collective bootstrap and returns once every rank has joined the StatelessProcessGroup.

In-scope this iteration: filesystem, nccl. Future (deferred): nixl, model_express, ipc. The trait shape is designed so each future backend drops in as one extra WeightTransport subclass — see § "WeightTransport seam" below.

WeightTransport seam — pluggable backends

Per weight-transfer-config.md v5 § "Trait on the worker side". Two-layer seam:

# components/src/dynamo/vllm/weight_transports/base.py
class WeightTransport(Protocol):
    backend_id: str

    async def init(ctx: InitCtx, cfg: BackendConfig)        -> InitResult
    async def update_weights(req: UpdateWeightsRequest)     -> UpdateResult
    async def teardown()                                    -> None

    @property def state(self) -> TransportState

# Engine-flavor seam, separate:
class EngineAdapter(Protocol):
    async def update_weights_from_disk         (...)        -> Result
    async def update_weights_from_distributed  (...)        -> Result
    async def update_weights_from_tensor       (...)        -> Result
    async def update_weights_from_ipc          (...)        -> Result
    async def add_lora                         (...)        -> Result
    async def remove_lora                      (...)        -> Result

Mapping:

concern owner
HTTP wire shape dynamo_rl::handlers::*
backend orchestration (path / nccl group / future nixl-mx poll-and-recv) WeightTransport impl
engine-flavor invocation (vLLM collective_rpc(...), future sglang tokenizer_manager.update_weights_from_*) EngineAdapter impl

Adding a new backend = one WeightTransport subclass. Adding sglang = one EngineAdapter subclass. Axes don't multiply.

This iteration ships:

components/src/dynamo/vllm/weight_transports/
├── base.py            # WeightTransport Protocol, EngineAdapter Protocol, request types
├── engine_adapter.py  # VllmEngineAdapter — single shim over engine_client.collective_rpc(...)
├── filesystem.py      # FilesystemTransport (Phase 1 default; safetensors + STABLE marker)
├── nccl.py            # NcclTransport     (Phase 4; init_weight_transfer_engine + update_weights)
└── __init__.py        # build_transport(backend, engine_adapter, cfg) factory

LoRA folds into update_weights {target.kind="lora", op="load|swap|unload"}. The internal handlers (vllm_engine.add_lora vs collective_rpc("reload_weights")) stay separate; the wire surface is one.

TITO on /v1/chat/completions

No separate URI. Pre-tokenized input on the standard chat-completions request:

{
  "model": "...",
  "messages": [],            // empty allowed when prompt_token_ids is set
  "prompt_token_ids":        [...],
  "stop_token_ids":          [...],
  "allowed_token_ids":       [...],
  "bad_words_token_ids":     [...],
  "truncate_prompt_tokens":  N,
  "nvext": { "return_token_ids": true, "return_prompt_logprobs": true, ... }
}
  • validate.rs::PASSTHROUGH_EXTRA_FIELDS allowlists these plus weight_version, return_routed_experts, cache_salt.
  • OpenAIStopConditionsProvider::get_stop_token_ids() returns typed Result<Option<Vec<TokenIdType>>> so malformed input (stop_token_ids: "not-an-array") surfaces a 400 instead of being silently dropped.
  • extract_stop_conditions plumbs into common::StopConditions::stop_token_ids_hidden, honored end-to-end by vLLM SamplingParams.
  • NvExtProvider::get_pretokenized_input() reads from both legacy nvext.token_data and the new top-level prompt_token_ids so existing renderer-mode callers keep working.
  • apply_template short-circuits when pre-tokenized input is present (no MiniJinja "undefined value" error on empty messages).
  • chat_completions/delta.rs, completions/delta.rs: nvext build paths replaced silent if let Ok(json) = … with match + tracing::warn! on Err. A dropped nvext means promoted token IDs / weight version never reach the trainer — silent-corruption class bug, now logged.

Removed dead URIs (404 in this branch)

  • /v1/chat/completions/tokens — TITO collapses into /v1/chat/completions with prompt_token_ids.
  • /v1/tokenize, /v1/detokenize — owned by feat: Nemo-rl support #7699 (NeMo-rl scope), not required by prime-rl.
  • Verified end-to-end via curl. Handler bodies + their routers stay in-tree under #[allow(dead_code)] for downstream compat; physical deletion is a follow-up.

Env vars (frontend + worker)

env var scope default purpose
DYN_ENABLE_RL frontend + worker false Frontend: inference-plane RL extensions on /v1/chat/completions (TITO, nvext). Worker: register <ns>.<comp>.rl request-plane endpoint + RL handlers.
DYN_ENABLE_RL_ENDPOINTS frontend / admin false Mount the optional /v1/rl/* HTTP facade.
DYN_NAMESPACE both runtime default Resolved namespace used by the SDK. The crate receives the resolved value; it does not read env itself.
DYN_RL_PORT frontend / admin 8002 Port for the /v1/rl/* listener. Also --rl-port.

No DYN_RL_WORKER_SYSTEM_URLS, no DYN_RL_TARGETS, no DYN_RL_COMPONENTS for v1. Optional component filters can be request fields.

Useful deployment combinations:

deployment shape DYN_ENABLE_RL (frontend) DYN_ENABLE_RL_ENDPOINTS DYN_ENABLE_RL (workers)
public OpenAI ingress false false false
RL training cluster, single frontend true true true
split: rollout frontend (public) + admin frontend (internal) rollout: true, admin: optional admin only true
SDK-only Slime integration optional false true

End-to-end smokes

Against bis/dynamo-rl @ 7c15b6b9cd, all on RTX PRO 6000 Blackwell with Qwen3-0.6B + math-env (GSM8K) + filesystem broadcast:

smoke result what it validates
bis-dev/7/wt-transfer/sft-file/run.sh PASS full-FT pause → update_weights {target.kind=base, transport.backend=filesystem} step_1 → resume; worker logs [RL] filesystem.update_weights: base reload from … (version=step_1).
bis-dev/7/wt-transfer/lora-file/run.sh PASS update_weights {target.kind=lora, op=load} step_1; worker logs [RL] LoRA adapter loaded: name=….
bis-dev/7/wt-transfer/sft-file-2w/run.sh PASS 2 vLLM aggregator workers sharing one GPU. Both init_transport and update_weights fan out to 2 workers, both status:ok. Validates the request-plane path (Discovery + PushRouter::direct).
bis-dev/7/wt-transfer/sft-nccl/run.sh wire-shape PASS, hardware NCCL fail init_transport {backend=nccl} reaches worker → vLLM creates NCCLWeightTransferEnginepynccl.PyNcclCommunicator rejects two ranks on one physical GPU. Confirms the init_weight_transfer_engine collective_rpc dispatch path; needs a 2-GPU host (or backend="ipc") for the data plane.
cargo test test_common_ext 15 / 15 get_stop_token_ids typed Result + 400-on-malformed regression tests.
cargo check --workspace clean one pre-existing benign warning.

TITO + LoRA combined smoke (tito-multiturn) confirmed _generate_token_mode … will use LoRA adapter: … on both unary and streaming paths.

Files touched

51 files, +4,733 / −182. Hot spots:

  • lib/rl/src/lib.rs — +1,072 (new crate: RlClient SDK, request types, fan-out, HTTP handlers, route definitions)
  • docs/RL.md — +667 (full API reference)
  • components/src/dynamo/vllm/handlers.py — +693 (worker handlers: rl_dispatch, weight_transport_*)
  • components/src/dynamo/vllm/weight_transports/* — +666 across 5 files (WeightTransport Protocol + impls + factory)
  • lib/llm/src/http/service/service_v2.rs — +349 (separate listener wiring, env-var split, RlHttpDeps)
  • lib/llm/src/http/service/openai.rs — +456 (TITO mutual-exclusion, nvext promotion, dead-URI 404s)
  • components/src/dynamo/vllm/worker_factory.py — +62 (rl_endpoint = runtime.endpoint("…rl").serve_endpoint(handler.rl_dispatch, …))

Where reviewers should start

  1. plans/rl-crate.md + plans/weight-transfer-config.md — design / contract for the entire RL surface.
  2. lib/rl/src/lib.rsRlClient SDK + HTTP handlers; the headline of the request-plane refactor.
  3. components/src/dynamo/vllm/handlers.py::rl_dispatch + worker_factory.py::rl_endpoint.serve_endpoint(...) — worker side of the request-plane endpoint.
  4. components/src/dynamo/vllm/weight_transports/{base,filesystem,nccl,engine_adapter}.pyWeightTransport trait + concrete impls.
  5. lib/llm/src/protocols/openai/{chat_completions,validate,nvext}.rs — TITO mutual-exclusion + PASSTHROUGH_EXTRA_FIELDS.
  6. lib/llm/tests/test_common_ext.rs — typed Result regression tests for stop_token_ids.

Commit walk-through

Commit What
0d74ecaf Baseline import — RL surface squashed in from bis/parity-tokenize-tcp.
d295ebc6 Composite state endpoint, 3-mode pause, RL extras on /v1/chat/completions, drop dead URIs.
f034171 /v1/chat/completions absorbs TITO with full SamplingParams parity.
a2cc90da get_stop_token_ids typed Result; mutual-exclusion relaxed to canonical channel only; regression tests.
2cb5e604 RL API doc.
0ee6cbe4 / 4aac7e89 / 8e08e32b Review-driven cleanup: no unwrap/expect, structured tracing, dead-code removal, Tier 1/2/3/5 + CodeRabbit.
6007c77c PR A — extract dynamo-rl crate at lib/rl/. Pure refactor; behavior unchanged.
67058be7 WeightTransport Phase 1 + 4WeightTransport Protocol + EngineAdapter shim + FilesystemTransport + NcclTransport.
93a7e411 cargo fmt cleanup.
575afd9e PR C — drop legacy /v1/rl/{state,health,ready,liveness,weight_version,*_lora_adapter} + legacy update_weights body; gate listener on DYN_ENABLE_RL_ENDPOINTS.
b6e471de PR B — request-plane fan-out via Discovery + PushRouter::direct. Worker registers single <ns>.<comp>.rl endpoint with rl_dispatch; frontend dispatches via Client::wait_for_instances + direct().
626d3e44 RlClient SDKMembershipSnapshot, FanoutPolicy, typed RlRequest variants, separate listener on DYN_RL_PORT.
7c15b6b9 LoRA unload allowed without a transport block.

Pairs with

prime-rl bis/prime-rl-merged adds VLLMGenerateClient / DynamoGenerateClient + setup_generate_client and threads them through compute_teacher_logprobs. Same client.backend axis as setup_admin_api — one config field drives both admin and data paths. After prime-rl migrates, admin_base_url switches from http://dynamo-frontend:8000/v1/rl (legacy single-listener) to http://dynamo-frontend:8002/v1/rl once DYN_ENABLE_RL_ENDPOINTS=true and the dedicated listener is live.

Closes / relates

  • Closes review comments from [draft] prime-rl integration  #8630: CR-8, CR-9, CR-10
  • Closes design-doc items: HH-19 (composite state-style endpoint, then dropped per Phase 3 — plan in rl-state-endpoint.md), HH-21 (3-mode pause), HH-22 / HH-26 (TITO via prompt_token_ids on /v1/chat/completions), HH-23 (liveness via the frontend's existing /live), HH-25 (RL-specific endpoint surface), HH-27 (weight_version folded into the version field of update_weights)
  • Plans of record live in-tree:
    • plans/rl-crate.md — frontend / SDK / listener / worker contract
    • plans/weight-transfer-config.md — wire shape + worker trait
    • plans/rl-state-endpoint.md — deferred read-side endpoint design

Summary by CodeRabbit

Release Notes

  • New Features

    • Standalone dynamo-rl crate hosting the /v1/rl/* admin surface with a request-plane fan-out SDK (RlClient, MembershipSnapshot, FanoutPolicy)
    • Discriminated update_weights body covering both full-FT and LoRA with pluggable backends (filesystem, nccl)
    • Pluggable WeightTransport Python Protocol + FilesystemTransport / NcclTransport impls; EngineAdapter shim isolates engine flavor
    • Idempotent init_transport for backend setup (e.g. NCCL group bootstrap)
    • Separate /v1/rl/* listener (DYN_RL_PORT / --rl-port) gated by DYN_ENABLE_RL_ENDPOINTS
    • TITO (token-in-token-out) on /v1/chat/completions via prompt_token_ids + nvext extensions
    • Workers register a single <ns>.<component>.rl request-plane endpoint with rl_dispatch; replaces legacy HTTP-on-system-port engine routes
  • Documentation

    • docs/RL.md — full API reference
    • plans/rl-crate.md and plans/weight-transfer-config.md — design / contract docs

biswapanda added 2 commits May 4, 2026 06:55
…for bis/dynamo-rl

Squash-imports the v1 RL surface from bis/parity-tokenize-tcp as the
testable starting point for bis/dynamo-rl. Subsequent commits refactor
toward the cleaner design in bis-dev/design-docs/rl-support.md
(phases 0-5).

Merge conflict resolutions:
- nvext.rs: keep main's NvExtResponseFieldSelection refactor; add v1's
  completion_token_ids field + selection flag + gating in build_response_nvext.
- preprocessor.rs: combine main's tracing::info! [SIDECAR-SKIP-TOKENIZE] log
  with v1's skip_token_annotation = has_backend_instance_id semantics
  (so RL/TITO callers without backend_instance_id keep the token_ids
  annotation; GAIE EPP callers continue to skip it).
- chat_completions/delta.rs: route through main's
  build_response_nvext helper; layer v1's RL completion_token_ids
  accumulator on top so the full token list is emitted only on the
  finish chunk.
- completions/delta.rs: drop v1's redundant inline NvExtResponse build
  (main's helper covers it).
- worker_factory.py: extend main's register_engine_routes() helper with
  v1's RL routes (pause_generation, resume_generation, flush_cache,
  update_weights_from_path, get_weight_version, load_lora_adapter,
  unload_lora_adapter).
- handlers.py: keep both main's start_profile/stop_profile and v1's
  RL handlers (pause_generation through unload_lora_adapter).
- publisher.py: keep v1's 'scheduler_stats can be None right after a
  weight reload / cache reset' explanatory comment.
- install_vllm.sh: keep VLLM_VER=0.19.1 from v1 (matches the venv
  the smoke runs against; main's 0.20.0 will be picked up by the
  long-term plan's upgrade workstream).

cargo check --workspace clean (1 pre-existing benign warning).

Test loop: ~/dev/rl/work/bis-dev/4-02/{lora,sft}/run.sh — mirrors the
known-working bis-dev/4 reference smokes.
…d URIs

Incremental refactor on bis/dynamo-rl per docs/design-docs/rl-support.md.
Test loop ~/dev/rl/work/bis-dev/4-02/{lora,sft}/run.sh — both smokes PASS
end-to-end after this commit.

Composite state + liveness probe
  handlers.py: new engine routes get_state and liveness_probe.
    - liveness_probe round-trips through engine_client.check_health()
      so a wedged event loop surfaces as 503 (closes hhzhang16 HH-23).
    - get_state returns per-worker {engine_alive, pause_state,
      applied_weight_version, loras} — aggregated by Rust frontend.
    - pause_generation / resume_generation now track the _paused flag.
  worker_factory.py: register the two new engine routes alongside the
    existing pause/resume/flush_cache/update_weights/load_lora ones.
  openai.rs: new GET /v1/rl/state composite endpoint and GET /v1/rl/liveness
    probe (5s timeout, override via DYN_RL_LIVENESS_TIMEOUT_MS). State
    aggregates per-worker payloads and surfaces ready / engine_alive /
    pause_state / applied_weight_version / loras / per-worker workers.
    Closes HH-19 (single state endpoint), HH-25 (RL-specific), HH-27
    (weight_version folded in). Legacy /v1/rl/{health,ready,weight_version}
    kept for back-compat — drop once prime-rl AdminAPI migrates to /state.

3-mode pause + structured update_weights body
  handlers.py: pause_generation accepts {mode, clear_cache} body.
    - mode=keep|wait|abort with default keep (matches prime-rl
      client.py:_pause_engines). Closes HH-21.
    - mode=abort triggers collective_rpc(abort_all_requests) when
      available; gracefully falls back with a warning on vLLM 0.19
      where that RPC isn't implemented.
    - clear_cache=true triggers reset_prefix_cache after pause.
  openai.rs: rl_pause now extracts ?mode= and ?clear_cache= via
    axum::extract::Query, validates mode, propagates to worker.
    Returns 400 on unknown mode (verified end-to-end via curl).
    rl_update_weights body migrates to typed RlUpdateWeightsBody
    {weight_dir, weight_version?, reset_prefix_cache=true}; the prior
    flush+reload sequence is now optional (controlled by reset_prefix_cache,
    default true) and the response carries applied_weight_version.

RL extras on /v1/chat/completions
  validate.rs: PASSTHROUGH_EXTRA_FIELDS expanded from {cache_salt} to
    {cache_salt, prompt_token_ids, weight_version, return_routed_experts,
    return_token_ids, return_prompt_logprobs}. RL clients can now send
    these as top-level extras without 400s. Closes HH-22 / HH-26 (TITO
    via prompt_token_ids on /v1/chat/completions instead of forked URI).
  chat_completions/delta.rs and completions/delta.rs: replace the silent
    'if let Ok(json) = serde_json::to_value(...)' fallback with a match
    that emits tracing::warn! on the Err branch. A dropped nvext means
    promoted token_ids / weight_version never reach the RL trainer —
    silently corrupting training. Closes CR-9.
  openai.rs: doc-block for token-ID promotion moved from
    rl_tokenize_prompt to rl_promote_token_ids_in_response (where it
    actually applies). Closes CR-10.

Drop dead URIs
  service_v2.rs: drop tokenization_router (/v1/tokenize, /v1/detokenize)
    mounting — owned by jthomson04 PR #7699 for NeMo-rl scope, not
    required by prime-rl. Drop chat_completions_tokens_router mounting
    — TITO collapses into /v1/chat/completions with prompt_token_ids
    extension. All three URIs verified 404 end-to-end via curl.
  openai.rs: tokenize, detokenize, tokenization_router,
    chat_completions_tokens_router, handler_chat_completions_tokens,
    bad_request all marked #[allow(dead_code)] (kept for downstream
    compat; physical deletion is a follow-up). bad_request doc-block
    cleaned of stale 'Not Implemented' lines. Closes CR-8.

End-to-end verification (smokes ran against this commit)
  /v1/rl/state         200 ready=true engine_alive=true (manual curl)
  /v1/rl/liveness      200 alive=true (manual curl)
  /v1/rl/pause?mode=abort&clear_cache=true   200 (manual curl)
    └─ vllm_worker.log: '[RL] Engine paused (..., mode=abort, clear_cache=True)'
  /v1/rl/pause?mode=invalid                   400 'Invalid mode'
  /v1/rl/resume                               200
  /v1/chat/completions/tokens                 404
  /v1/tokenize                                404
  /v1/detokenize                              404

  4-02/sft smoke: PASS (full-FT pause->update_weights->resume, mismatch_kl<=0.0007)
  4-02/lora smoke: PASS (1 hot-swap closed, lora_id=1626203954)

Reviewer-comment closures landed in this commit
  CR-8  bad_request doc cleaned
  CR-9  serde_json::to_value Err logged via tracing::warn!
  CR-10 token-ID promotion doc-block re-attached
  HH-19 single state endpoint /v1/rl/state
  HH-21 3-mode pause keep|wait|abort + clear_cache
  HH-22 prompt_token_ids in PASSTHROUGH_EXTRA_FIELDS (TITO collapse)
  HH-23 liveness_probe via engine_client.check_health() + 5s timeout
  HH-25 /v1/rl/state is RL-specific (vs broader /health)
  HH-26 no /v1/chat/completions/tokens distinction (URI dropped)
  HH-27 weight_version folded into /v1/rl/state.applied_weight_version
@biswapanda biswapanda requested review from a team as code owners May 5, 2026 03:23
@github-actions github-actions Bot added feat documentation Improvements or additions to documentation backend::vllm Relates to the vllm backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` container labels May 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

Walkthrough

This PR introduces Prime-RL training workflow support to the Dynamo serving stack by adding RL-specific request/response token ID fields, streaming completion token ID accumulation, generation pause/resume/weight-update handlers in the vLLM worker, Rust-side RL admin routes and tokenization endpoints, and enhanced tokenizer trait implementations for token-to-string conversion. Additionally, vLLM is pinned to v0.19.1 with prime-rl v0.5.1.dev101 plugin installation.

Changes

RL Token ID Flow & Integration

Layer / File(s) Summary
Request/Response Shape
lib/llm/src/protocols/openai/chat_completions.rs, lib/llm/src/protocols/openai/nvext.rs, lib/llm/src/audit/stream.rs
New request fields tokens, return_token_ids and response field prompt_token_ids added to chat completion types. NvExtResponse gains optional completion_token_ids field with selection flag.
Streaming Accumulation
lib/llm/src/protocols/openai/chat_completions/delta.rs, lib/llm/src/protocols/openai/completions/delta.rs
Delta generator now accumulates completion_token_ids across streamed chunks, emitting the full list on finish chunk. nvext serialization errors are now logged as warnings instead of silent drops.
Entrypoint & Aggregation
lib/llm/src/entrypoint/input/text.rs, lib/llm/src/protocols/openai/chat_completions/aggregator.rs, lib/llm/src/protocols/anthropic/types.rs, lib/llm/src/protocols/openai/responses/mod.rs
Chat completion requests initialize new RL fields (tokens: None, return_token_ids: None). Aggregation fallback responses initialize prompt_token_ids: None.
Frontend RL Field Promotion
lib/llm/src/http/service/openai.rs (handlers section)
Chat completions handler promotes tokens to nvext.token_data, ensures messages non-empty, forces logprobs, and appends RL completion token IDs to each choice via JSON rewriting in non-streaming post-processing.
TITO & Tokenization Endpoints
lib/llm/src/http/service/openai.rs (tokenization section), lib/llm/src/protocols/openai.rs, lib/llm/src/protocols/openai/tokenization.rs, lib/llm/src/protocols/openai/validate.rs
New tokenization protocol types for chat/completion requests and responses. Tokenization routers mounted at POST /v1/tokenize and POST /v1/detokenize. TITO endpoint (POST /v1/chat/completions/tokens) consumes pre-tokenized prompts via tokens field. Passthrough allowlist permits RL request fields.
RL Admin Routes
lib/llm/src/http/service/openai.rs (RL admin section), lib/llm/src/http/service/service_v2.rs
New RL admin router under /v1/rl/* with health, pause/resume, weight updates, LoRA load/unload, and state endpoints. enable_rl config flag conditionally mounts routes when set or DYN_ENABLE_RL is truthy.
Preprocessor & Request Handling
lib/llm/src/preprocessor.rs
Token annotation suppression now depends on backend_instance_id presence rather than blanket suppression when token_data is provided.
vLLM Worker Handlers
components/src/dynamo/vllm/handlers.py
New async methods on BaseWorkerHandler: pause_generation, resume_generation, liveness_probe, get_state, flush_cache, update_weights_from_path, get_weight_version, load_lora_adapter, unload_lora_adapter. Includes LoRA hot-swap, per-LoRA locking, discovery registration, and rollback semantics.
Worker Route Registration & Frontend Streaming
components/src/dynamo/vllm/worker_factory.py, components/src/dynamo/frontend/vllm_processor.py, components/src/dynamo/vllm/publisher.py
Engine routes for RL lifecycle registered in register_engine_routes. Streaming processor injects RL logprobs and passes through completion_token_ids from backend. Publisher includes clarifying comment on scheduler stats nullability post-reload.
Documentation & Configuration
docs/Dynamo-RL-api-draft.md, container/deps/vllm/install_vllm.sh
RL API surface documented including control routes, tokenization, token ID invariants, and validation results. vLLM pinned to v0.19.1 with prime-rl v0.5.1.dev101 plugin installation and entry-point validation.

Tokenizer Trait Enhancement

Layer / File(s) Summary
Trait Interface
lib/tokenizers/src/lib.rs
Encoder trait adds default encode_with_special_tokens(input, add_special_tokens) method. Tokenizer trait adds default convert_ids_to_tokens(token_ids) implementation that decodes individual IDs and collects into strings. Public Tokenizer wrapper type gains corresponding forwarding methods.
Tokenizer Implementations
lib/tokenizers/src/fastokens.rs, lib/tokenizers/src/hf.rs, lib/tokenizers/src/tiktoken.rs
FastTokenizer routes encoding through new helper, delegates ID-to-token via HuggingFace decoder. HuggingFaceTokenizer implements both new trait methods with consistent special-token handling. TikTokenTokenizer precomputes decoder lookup maps and implements ID-to-token conversion via lossy UTF-8 decoding from cached byte sequences.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is entirely a template with no substantive content from the author. Replace the template placeholders with detailed information about the changes, objectives, and testing performed as documented in the PR objectives and raw summary.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'feat(RL): RL integration layer to support Prime-RL' clearly and specifically describes the main change—adding RL (reinforcement learning) integration to support Prime-RL functionality—which aligns with the extensive PR scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/protocols/openai/nvext.rs (1)

141-148: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Update the exhaustive NvExtResponseFieldSelection literals in this file.

Adding completion_token_ids to the struct makes the manual NvExtResponseFieldSelection { ... } literals later in these tests non-exhaustive, so this file will not compile until they either set completion_token_ids: false or switch to ..Default::default(). The breakage is in the explicit literals used by test_build_response_nvext_combined_emission (line 887) and test_nvext_response_field_selection_multiple_extra_fields (line 923).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/llm/src/protocols/openai/nvext.rs` around lines 141 - 148, The tests fail
because NvExtResponseFieldSelection gained a new field completion_token_ids,
making existing struct literals non-exhaustive; update the explicit literals
used in test_build_response_nvext_combined_emission and
test_nvext_response_field_selection_multiple_extra_fields to include
completion_token_ids: false (or replace the full literal with
NvExtResponseFieldSelection { ..Default::default() }) so the constructors are
exhaustive again; locate the struct NvExtResponseFieldSelection and adjust the
two test usages accordingly.
🧹 Nitpick comments (1)
docs/Dynamo-RL-api-draft.md (1)

208-208: 💤 Low value

Consider adding language identifiers to fenced code blocks.

Several code blocks lack language specifiers (lines 208, 287, 317, 359, 819, 844, 856, 871, 882). Adding identifiers like text, http, or rust improves syntax highlighting and accessibility.

For example:

  • HTTP endpoint descriptions → ```http or ```text
  • SSE responses → ```text
  • Rust-like struct definitions → ```rust or ```text

Also applies to: 287-287, 317-317, 359-359, 819-890

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/Dynamo-RL-api-draft.md` at line 208, Several fenced code blocks in the
document are missing language identifiers; update each triple-backtick block
(``` ) that contains HTTP endpoint examples, SSE responses, or Rust-like struct
definitions to include an appropriate language tag (e.g., ```http or ```text for
endpoint/SSE blocks, ```rust or ```text for struct examples) so syntax
highlighting and accessibility are improved — search for the bare ``` fences
around the HTTP examples, SSE examples, and struct-like snippets and add the
corresponding language identifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/vllm/handlers.py`:
- Around line 1178-1187: The handler currently removes the LoRA from the local
engine (remove_lora) but treats unregister_model failures as non-fatal, leaving
the model routable incorrectly; make this operation transactional: if
unregister_model(endpoint=self.generate_endpoint, lora_name=lora_name) fails,
either perform a rollback by re-adding the LoRA to the engine (call the
corresponding register/register_model or load_lora function to restore state) or
propagate/raise the exception so the handler returns an error and the removal is
not considered successful. Update the code paths around remove_lora,
unregister_model, and any callers of _resolve_lora_request to ensure consistency
with the legacy unload_lora transactional behavior.
- Around line 887-889: The generic except block in the RL handlers (e.g., the
pause handler shown) must first catch EngineDeadError and run the same shutdown
sequence used elsewhere in this file instead of returning a JSON error; add a
specific except EngineDeadError: branch before the existing except Exception
that logs the engine-dead event (using logger.exception or logger.error) and
invokes the module's existing worker shutdown logic (the same behavior already
used for EngineDeadError in other handlers), then re-raise or exit as
appropriate; keep the generic except Exception to return the JSON
{"status":"error","message":str(e)} for other errors. Ensure you apply this
change to all analogous handlers referenced (lines around 899-901, 926-928,
957-959, 968-970, 999-1001, 1144-1148, 1198-1202) and reference the
EngineDeadError symbol when editing each handler.
- Around line 839-843: Remove the internal Linear ticket reference "HH-21" from
the comment/docstring that documents the pause API (the block describing Body
keys mode and clear_cache near the handler for pause behavior) and replace it
with either a plain descriptive phrase (e.g., "implements 3-mode pause
behavior") or a public GitHub issue reference (e.g., "GH-<number>" or
"#<number>"); apply the same change to the other occurrences flagged (the
comment blocks around the code near the sections referencing HH-* at the other
two locations) so no HH-* IDs remain in source comments.
- Around line 1049-1086: If a hot-swap partially fails we must not claim
success: change the hot-swap path around engine_client.remove_lora,
engine_client.add_lora, and engine_client.reset_prefix_cache so failures abort
and roll back instead of continuing. Specifically, if remove_lora(lora_name)
raises, stop and propagate/return an error (do not call add_lora or update
loaded_loras); if add_lora(...) raises, ensure loaded_loras is not set and
propagate the error; if reset_prefix_cache() raises after a successful add,
remove the newly-added LoRA via engine_client.remove_lora(new_id), restore
loaded_loras to the previous LoRAInfo, and then propagate the reset error (or
return a non-ok result) so the control plane won’t believe the swap succeeded.
Use the existing symbols (remove_lora, add_lora, reset_prefix_cache,
loaded_loras, LoRARequest, LoRAInfo, engine_client) to locate and implement
these early-returns and rollback steps.
- Around line 941-946: In get_state(), currently if self.engine_client lacks
check_health it leaves engine_alive True; change that branch to mirror
liveness_probe(): when not hasattr(self.engine_client, "check_health") call the
same collective_rpc probe used in liveness_probe() (e.g., await
collective_rpc(self.engine_client, "get_status" or the exact probe used there)),
set engine_alive based on the probe result, and catch/handle exceptions the same
way as liveness_probe() so a failed IPC round-trip marks engine_alive False and
logs the error; keep existing behavior when check_health exists but still catch
exceptions from it.

In `@container/deps/vllm/install_vllm.sh`:
- Around line 319-323: Replace the tag-based ref with the immutable commit SHA:
update the PRIME_RL_REF usage in the install block so the pip install line
("prime-rl @ git+https://github.com/PrimeIntellect-ai/prime-rl@${PRIME_RL_REF}")
references the commit SHA instead of the tag (i.e., set PRIME_RL_REF to
d49f3939e7dca29bceb9ed515cc1782497b67e81 or substitute that SHA directly into
the git+ URL) so the install step in install_vllm.sh is pinned to a specific
commit.

In `@lib/llm/src/http/service/openai.rs`:
- Around line 207-210: Remove all internal Linear ticket IDs (e.g., CR-8, HH-21,
CR-10, HH-23, HH-19, HH-25, HH-27) from the source comments in openai.rs (look
for the doc comments around the Bad Request Error and the other affected comment
blocks at the noted regions) and replace them with neutral prose or GitHub issue
references (e.g., "See `#NNNN`" or "GH-NNNN") as per guidelines; update the
comment lines to describe the issue/closure in plain language without internal
ticket IDs so the doc comments remain informative and compliant.
- Around line 3130-3133: rl_ready() (and the other RL endpoints around the same
area) currently reuses the long-lived reqwest client built with a 600s timeout,
which lets a wedged worker hang readiness for minutes; change the readiness
probe to use a shorter timeout by either creating a separate reqwest::Client
with a small timeout (e.g., 1–5s) for rl_ready() or by setting a per-request
timeout via reqwest::Client::execute with tokio::time::timeout; update the code
paths that call rl_ready() (and the similar endpoints referenced around the
3182-3203 range) to use this short-timeout client/request wrapper so readiness
fails fast with a 503 instead of blocking on the shared 600s http_client.
- Around line 977-988: The RL promotion blocks that check rl_want_token_ids must
force request.inner.logprobs = Some(true) even if it was explicitly set to
false; locate the two places where rl_want_token_ids is used (the block that
manipulates nvext.extra_fields and the similar RL promotion block around lines
~2403-2416) and set request.inner.logprobs = Some(true) unconditionally when
adding "token_ids"/"completion_token_ids" to nvext.extra_fields so
completion_token_ids isn't lost on the backend.
- Around line 1280-1298: The current logic only populates rl_saved_messages and
rl_tito_token_ids when DYN_ENABLE_RL is true, causing prompt_token_ids to be
missing or incorrect for requests that explicitly requested return_token_ids or
used tools/chat_template_args; before calling engine.generate() (the point where
the prompt is finalized), capture the fully rendered prompt text and/or preserve
all prompt-shaping inputs (request.inner.messages, request.tools,
request.chat_template_args and request.nvext.token_data) unconditionally when
request.return_token_ids is set (or equivalent request intent), and use those
captured values for post-response tokenization instead of gating on
DYN_ENABLE_RL; update the references rl_saved_messages and rl_tito_token_ids so
they are derived from the captured prompt/inputs (and fall back to token_data if
present) and ensure downstream post-processing keys off request.return_token_ids
rather than the environment variable.
- Around line 3299-3404: The current rl_update_weights handler
(rl_update_weights, RlState::all_ok, state.fan_out) claims atomicity but can
leave a mixed fleet when update_weights_from_path partially fails; fix by making
the update transactional: before calling update_weights_from_path, fan_out a
"get_weight_version" (or similar) to capture each worker's current version, then
call update_weights_from_path as now; if load_results is not all_ok, call
state.fan_out("update_weights_from_path", {"path": <prev_path>, "version":
<prev_version>}) targeted only to workers that reported success to rollback them
to their captured pre-update version, and return a BAD_GATEWAY with detailed
per-worker status including which were rolled back and which failed; update the
docblock to stop advertising implicit atomicity or note the implemented rollback
behavior and reference rl_update_weights, fan_out, update_weights_from_path, and
the get_weight_version/rollback calls.

In `@lib/llm/src/protocols/openai/chat_completions.rs`:
- Around line 63-64: Update the stale doc comment that currently references the
removed `/v1/chat/completions/tokens` endpoint so it instead tells integrators
to use the standard `/v1/chat/completions` endpoint with `prompt_token_ids`;
locate the comment near the chat completions implementation in
openai::chat_completions (file chat_completions.rs) and replace the two-line
note about TITO mode to say that on `/v1/chat/completions` the field is accepted
but ignored and that TITO-style token-authoritative requests should be made by
passing `prompt_token_ids` on `/v1/chat/completions`.

In `@lib/llm/src/protocols/openai/chat_completions/delta.rs`:
- Around line 117-120: The shared Vec accumulated_completion_token_ids in the
streaming delta handling must be made per-choice instead of global: change
accumulated_completion_token_ids to a map keyed by the chunk's choice index
(e.g., HashMap<usize, Vec<TokenIdType>>) and append token IDs into
accumulated_completion_token_ids[chunk.index]; when promoting to
choices[i].token_ids (the code paths that read/emit completion_token_ids) pull
from the per-index entry and clear it on finish_reason; alternatively, if you
prefer to keep a single accumulator, explicitly reject multi-choice requests at
the start of the RL path (when n > 1) so chunks from different indices cannot
interleave. Ensure the same per-index approach is applied to the other
accumulator usages referenced around the choices handling and promotion logic.

In `@lib/llm/src/protocols/openai/tokenization.rs`:
- Around line 61-84: Call TokenizeChatRequest::validate() when handling
TokenizeRequest::Chat before using merged_chat_template_kwargs(); specifically,
after deserializing the chat tokenize request in the TokenizeRequest::Chat
branch, invoke request.validate()? and propagate any error (return Err) so
invalid states (both continue_final_message and add_generation_prompt true) are
rejected; this ensures TokenizeChatRequest::validate enforces constraints prior
to calling TokenizeChatRequest::merged_chat_template_kwargs().

In `@lib/llm/src/protocols/openai/validate.rs`:
- Around line 117-118: Remove the internal Linear ticket IDs "HH-22 / HH-26"
from the comment block in validate.rs and replace them with either a GitHub
issue reference like "#NNNN" or neutral prose (e.g., "see related issue" or
"related enhancement") so the comment no longer contains internal ticket IDs;
update the comment text surrounding the line that mentions the "tokens variant
of /v1/chat/completions" to use the new reference or neutral wording.

---

Outside diff comments:
In `@lib/llm/src/protocols/openai/nvext.rs`:
- Around line 141-148: The tests fail because NvExtResponseFieldSelection gained
a new field completion_token_ids, making existing struct literals
non-exhaustive; update the explicit literals used in
test_build_response_nvext_combined_emission and
test_nvext_response_field_selection_multiple_extra_fields to include
completion_token_ids: false (or replace the full literal with
NvExtResponseFieldSelection { ..Default::default() }) so the constructors are
exhaustive again; locate the struct NvExtResponseFieldSelection and adjust the
two test usages accordingly.

---

Nitpick comments:
In `@docs/Dynamo-RL-api-draft.md`:
- Line 208: Several fenced code blocks in the document are missing language
identifiers; update each triple-backtick block (``` ) that contains HTTP
endpoint examples, SSE responses, or Rust-like struct definitions to include an
appropriate language tag (e.g., ```http or ```text for endpoint/SSE blocks,
```rust or ```text for struct examples) so syntax highlighting and accessibility
are improved — search for the bare ``` fences around the HTTP examples, SSE
examples, and struct-like snippets and add the corresponding language
identifier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 100e02b7-f3fb-4a7f-9860-7bf2a725db68

📥 Commits

Reviewing files that changed from the base of the PR and between 8db5ad2 and d295ebc.

📒 Files selected for processing (25)
  • components/src/dynamo/frontend/vllm_processor.py
  • components/src/dynamo/vllm/handlers.py
  • components/src/dynamo/vllm/publisher.py
  • components/src/dynamo/vllm/worker_factory.py
  • container/deps/vllm/install_vllm.sh
  • docs/Dynamo-RL-api-draft.md
  • lib/llm/src/audit/stream.rs
  • lib/llm/src/entrypoint/input/text.rs
  • lib/llm/src/http/service/openai.rs
  • lib/llm/src/http/service/service_v2.rs
  • lib/llm/src/preprocessor.rs
  • lib/llm/src/protocols/anthropic/types.rs
  • lib/llm/src/protocols/openai.rs
  • lib/llm/src/protocols/openai/chat_completions.rs
  • lib/llm/src/protocols/openai/chat_completions/aggregator.rs
  • lib/llm/src/protocols/openai/chat_completions/delta.rs
  • lib/llm/src/protocols/openai/completions/delta.rs
  • lib/llm/src/protocols/openai/nvext.rs
  • lib/llm/src/protocols/openai/responses/mod.rs
  • lib/llm/src/protocols/openai/tokenization.rs
  • lib/llm/src/protocols/openai/validate.rs
  • lib/tokenizers/src/fastokens.rs
  • lib/tokenizers/src/hf.rs
  • lib/tokenizers/src/lib.rs
  • lib/tokenizers/src/tiktoken.rs

Comment thread components/src/dynamo/vllm/handlers.py Outdated
Comment thread components/src/dynamo/vllm/handlers.py
Comment thread components/src/dynamo/vllm/handlers.py
Comment thread components/src/dynamo/vllm/handlers.py
Comment thread components/src/dynamo/vllm/handlers.py
Comment thread lib/llm/src/http/service/openai.rs Outdated
Comment thread lib/llm/src/protocols/openai/chat_completions.rs Outdated
Comment thread lib/llm/src/protocols/openai/chat_completions/delta.rs
Comment thread lib/llm/src/protocols/openai/tokenization.rs Outdated
Comment thread lib/llm/src/protocols/openai/validate.rs Outdated
End-to-end TITO support on /v1/chat/completions: callers post
prompt_token_ids as a top-level extension (and stop_token_ids,
allowed_token_ids, bad_words_token_ids, truncate_prompt_tokens for
SamplingParams parity with vLLM /inference/v1/generate). vLLM 0.20+
already accepts these on /v1/chat/completions natively; this commit is
the bridge until that upgrade lands.

Plumbing:
  validate.rs: PASSTHROUGH_EXTRA_FIELDS adds stop_token_ids,
    bad_words_token_ids, allowed_token_ids, truncate_prompt_tokens.
  protocols/openai.rs: OpenAIStopConditionsProvider gains
    get_stop_token_ids() (default None); extract_stop_conditions()
    plumbs into common::StopConditions::stop_token_ids_hidden so
    vLLM's SamplingParams.stop_token_ids is honored end-to-end.
  protocols/openai/chat_completions.rs: NvCreateChatCompletionRequest
    overrides get_stop_token_ids and get_pretokenized_input to read
    unsupported_fields. ValidateRequest::validate accepts empty
    messages when prompt_token_ids is present, 400s on mutual-exclusion
    violations.
  protocols/openai/nvext.rs: NvExtProvider gains get_pretokenized_input
    with a default impl reading nvext.token_data; chat-completions
    overrides to also read top-level prompt_token_ids extension.
  preprocessor.rs: apply_template short-circuits to None when pre-
    tokenized input is present (avoids 'undefined value' template
    errors on empty messages); gather_tokens reads via the same
    NvExtProvider hook covering both channels.
  http/service/openai.rs: validate_chat_completion_required_fields
    accepts empty messages when pre-tokenized input is provided.

End-to-end probes against Qwen/Qwen3-0.6B:
  /v1/chat/completions with prompt_token_ids + stop_token_ids:
    200, finish_reason=stop on stop-token, nvext.completion_token_ids
    populated.
  stop_token_ids=[16] forces stop on first generated token: 200,
    immediate stop (proves SamplingParams.stop_token_ids honored).
  messages + prompt_token_ids: 400 mutual-exclusion error.
  messages-only (MITO): unchanged — chat template applied normally.

Smoke regression on bis-dev/4-02 against Qwen/Qwen3-0.6B:
  sft (full-FT bracket): PASS, mismatch_kl 0.0006/0.0007
  lora (hot-swap):       PASS, lora_id=1626203954

Pairs with prime-rl bis/prime-rl-merged commits that add
VLLMGenerateClient / DynamoGenerateClient + setup_generate_client and
thread them through compute_teacher_logprobs. Same client.backend axis
as setup_admin_api; one config field drives both admin and data paths.
…clusion

Three improvements borrowed from PR #9141 (Ameen Patel) on top of the
existing PASSTHROUGH_EXTRA_FIELDS expansion:

1. get_stop_token_ids returns Result<Option<Vec<TokenIdType>>>, not
   Option<Vec<TokenIdType>>. Malformed payloads (e.g.
   stop_token_ids: "not-an-array") now surface as a typed 400 with the
   diagnostic 'stop_token_ids must be an array of unsigned token IDs:
   {err}'. extract_stop_conditions propagates the Result via ?.

   Replaces the prior silent-fallback Option<> variant which dropped
   malformed inputs without telling the caller. Silent drops on RL
   correctness primitives (stop conditions affect what tokens
   the engine emits) is the bug class CR-9 was about — same principle
   applies here.

2. Mutual-exclusion between messages and pre-tokenized input is now
   scoped to the canonical TOP-LEVEL prompt_token_ids extension only.
   The legacy nvext.token_data channel — which the verifiers
   dynamo_chat_nvext renderer transport (#1287) uses with placeholder
   messages 'role: user, content: (token-in mode)' — is allowed to
   coexist with non-empty messages. validate_messages still gates the
   empty-messages-with-no-tokens case.

   Without this relaxation, the renderer transport's placeholder
   pattern would 400 on every request.

3. Two new tests in test_common_ext.rs:
   - test_chat_completions_stop_token_ids_extraction: positive case
     with nvext.token_data + top-level stop_token_ids (lifted from
     PR #9141 verbatim).
   - test_chat_completions_stop_token_ids_malformed_returns_400:
     verifies the typed-error path on bad input.

Pre-existing test struct-init sites in test_common_ext.rs were missing
required fields (return_token_ids, tokens) added to the
NvCreateChatCompletionRequest struct since the tests were written. Three
sites updated to construct cleanly. cargo test test_common_ext: 15
tests, 15 passes.
@biswapanda biswapanda marked this pull request as draft May 5, 2026 18:20
@biswapanda biswapanda changed the title feat: dynamo <> prime-rl integration feat: dynamo RL integration May 7, 2026
@biswapanda biswapanda marked this pull request as ready for review May 7, 2026 22:44
@biswapanda biswapanda changed the title feat: dynamo RL integration feat(RL): Dynamo RL integration May 7, 2026
@biswapanda biswapanda changed the title feat(RL): Dynamo RL integration feat(RL): Dynamo RL integration APIs May 7, 2026
biswapanda added 3 commits May 7, 2026 15:59
Cleanup-only — no behavior change. Strips review-tracker noise that
accumulated on top of PR-added text during iteration:

  - "Closes hhzhang16 HH-19/HH-21/HH-22/HH-23/HH-25/HH-26/HH-27"
  - "CR-8 / CR-9 / CR-10 closure" prefixes on serde-error / doc-attach fixes
  - Branch-name references: bis/parity-tokenize-tcp, bis/prime-rl-merged
  - Internal PR numbers: #6094, #7699, #8197, #9141
  - Phase numbers from internal design docs (rl-support.md Phase 1/4/5)
  - "prime-rl" mentions in narrative copy and mermaid diagrams →
    generic "RL trainer / RL orchestrator / external client"

Technical content (semantics, invariants, why-this-exists rationale)
preserved everywhere; only the internal-process scaffolding is removed.

Scope verification: every removed line is one this branch ADDED
(diff main..HEAD shows the removed text on a `+` line). No edits land
on pre-existing main-branch comments. Specifically reverted the
nvext.rs cleanup attempt — its target lines (GAIE Stage 1/2,
SGLang-specific) live on main, not in this PR's diff.

Files touched:
  components/src/dynamo/vllm/handlers.py            +9  -10
  components/src/dynamo/vllm/worker_factory.py      +6  -4
  docs/dynamo-RL-api.md                             +19 -32
  lib/llm/src/http/service/openai.rs                +32 -34
  lib/llm/src/protocols/openai/chat_completions/delta.rs  +4 -4
  lib/llm/src/protocols/openai/completions/delta.rs       +3 -3
  lib/llm/src/protocols/openai/validate.rs                +20 -20

cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
…acing

Four blocking findings from a Graham-style review.

1. resolve_model_for_chat (openai.rs ~2086): replace
   served_models.into_iter().next().unwrap() after a len()==1 check with
   let-Some on drain().next(). Eliminates the silent-panic-if-empty path
   without adding a Safety: comment for what's now self-evident.

2. RlState::from_env (openai.rs ~3128): builder().build() failures (TLS
   init, resolver init) panicked the frontend on first request boot via
   .expect("Failed to create RL router HTTP client"). Now returns
   anyhow::Result<Self> and surfaces a typed error. Caller rl_router()
   becomes anyhow::Result<(Vec<RouteDoc>, Router)>; service_v2::build()
   propagates with `?` (it already returns Result<HttpService, anyhow::Error>).

3. rl_update_weights (openai.rs ~3349): replace `if weight_dir.is_none() {
   return ... } let weight_dir = weight_dir.unwrap();` with
   `let Some(weight_dir) = body.weight_dir.clone() else { return ... };`.
   One match instead of two; no unwrap.

4. Structured tracing fields, not format strings (8 sites in rl_pause /
   rl_resume / rl_update_weights / rl_load_lora_adapter /
   rl_unload_lora_adapter):

     tracing::warn!("RL pause: some workers failed: {:?}", results);
   →
     tracing::warn!(?results, "RL pause: some workers failed");

   Same shape applied to info!/warn! calls that interpolated
   worker_count, mode, weight_dir, version, lora_name, lora_path. Use %
   for Display, ? for Debug per tracing::Value docs.

Verification:
  cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
  cargo test -p dynamo-llm --test test_common_ext: 15 passed.
Quick-win review fixes from PR #9131. Heavy-lift items (#9
prompt_token_ids env-gate, #11 update_weights atomicity, #13
per-choice completion_token_ids) tracked separately as follow-ups.

handlers.py
  - Catch EngineDeadError before the generic except in all 8 RL handlers
    (pause/resume/liveness_probe/get_state/flush_cache/update_weights_from_path/
    load_lora_adapter/unload_lora_adapter): match the existing shutdown
    pattern in this file so admin calls also surface engine death instead
    of leaving a broken worker alive.
  - get_state: fall back to a no-op collective_rpc when check_health is
    absent — same fallback liveness_probe already uses, otherwise older
    engines without check_health always look alive.
  - load_lora_adapter hot-swap path: a remove_lora() failure now returns
    a 400-style error response (was: silent log warn + continue, leaving
    add_lora to no-op against the still-registered ID); a
    reset_prefix_cache() failure after add_lora succeeds also returns
    error (was: log error and continue, leaving stale KV from the old
    adapter routable).
  - unload_lora_adapter: an unregister_model() failure after engine
    remove_lora succeeds now returns error (was: log warn and report
    success, leaving model=<lora_name> still routed to this worker even
    though _resolve_lora_request would now fall back to the base model).

container/deps/vllm/install_vllm.sh
  - Pin prime-rl install to an immutable commit SHA
    (d49f3939e7dca29bceb9ed515cc1782497b67e81 ↔ tag v0.5.1.dev101) so a
    re-pointed tag upstream can't change what we ship. PRIME_RL_REF kept
    in build logs for human readability; PRIME_RL_COMMIT is the
    authoritative pin.
  - Replace `echo "\n=== ..."` with `printf '\n=== ...\n'` (shellcheck SC2028).

lib/llm/src/http/service/openai.rs
  - Force `request.inner.logprobs = Some(true)` unconditionally in both
    RL token-id promotion blocks (was: only when None). RL extraction of
    completion_token_ids depends on logprobs being on at the engine; an
    explicit logprobs=false would otherwise silently drop them.
  - Bound `/v1/rl/ready` per-worker probes with a 5s timeout (override
    via DYN_RL_LIVENESS_TIMEOUT_MS). Was reusing the shared 600s
    http_client, so one wedged worker could block readiness for 10
    minutes instead of failing fast as 503.
  - Tokenize Chat handler: call `request.validate()?` before
    `merged_chat_template_kwargs()` so the
    continue_final_message + add_generation_prompt mutual-exclusion
    constraint is enforced (validate() existed but was never invoked).

lib/llm/src/protocols/openai/chat_completions.rs
  - Update stale doc comments on the legacy `tokens` and
    `return_token_ids` fields: they pointed callers at the now-404
    `/v1/chat/completions/tokens` URI. Direct callers to the canonical
    top-level `prompt_token_ids` extension and `nvext.extra_fields`
    instead.

cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
cargo test -p dynamo-llm --test test_common_ext: 15 passed.
…e shutdowns, tier the docs

Stack-ranked review fixes from the latest Graham-style pass on this branch.
Tier 4 (commit message conventions and tokenizer-crate scope) deliberately
deferred.

Tier 1 — correctness / contract claims
  - Drop the word "atomic" from the `/v1/rl/update_weights` docblock and
    spell out the partial-failure semantics: workers `0..N-1` may have
    switched while worker `N` failed; per-worker status is in the
    response, true rollback is a follow-up.
  - Reject `n > 1` on `/v1/chat/completions` when RL token IDs are
    requested. The streaming aggregator's `completion_token_ids` Vec is
    shared across all choices, so per-choice promotion downstream cannot
    recover which tokens belong to which choice with `n > 1`. Hard-reject
    is the interim guard until the keyed-by-index refactor lands.
  - `update_weights` empty `weight_dir` (`""`) is now treated the same
    as null/missing (NCCL-mode no-op) instead of being forwarded to the
    engine as `path=""`.

Tier 2 — hard rules
  - Hot-path `[RL]` `logger.info` → `logger.debug` for pause / resume /
    flush_cache / weights-load / LoRA load / LoRA unload (8 sites). RL
    trainers fire these per training step; info-level was a log flood.
  - Extract the duplicated 4-line `EngineDeadError` shutdown stanza into
    `BaseWorkerHandler._shutdown_on_engine_dead(e) -> NoReturn` and
    collapse the 9 call sites (8 RL handlers + 1 generate path) to a
    single line each. ~32 lines removed.
  - Strip the remaining internal-tracker comments missed by an earlier
    chore-scrub: `service_v2.rs` had `jthomson04 PR #7699`,
    `bis-dev/design-docs/rl-support.md §1`, and `hhzhang16 HH-22 / HH-26`
    references in two places. Replaced with neutral prose.
  - Strip the SGLang-coordination comment in `handlers.py` ("Signatures
    intentionally line up with the SGLang RL admin routes"). The kind of
    line that goes stale when SGLang's admin set drifts.
  - Delete dead-code carcasses for the dropped routes
    (`/v1/chat/completions/tokens`, `/v1/tokenize`, `/v1/detokenize`):
    remove `tokenize`, `detokenize`, `tokenization_router`,
    `chat_completions_tokens_router`, `handler_chat_completions_tokens`
    (~240 lines), drop `pub mod tokenization;`, delete the
    `tokenization.rs` module file (124 lines), drop unused tokenize-type
    imports. All five were behind `#[allow(dead_code)]` and unmounted.

Tier 3 — tests
  - Make `RlState::new(...)` a pub(super) test-friendly constructor so
    handler-level tests don't need `from_env` / process env vars.
  - Convert `RlPauseQuery::mode: Option<String>` to `Option<PauseMode>`,
    a typed enum with `serde(rename_all = "lowercase")`. Axum now returns
    400 on `mode=foo` before the handler runs; the runtime string match
    is gone.
  - Add four behavior tests in `mod tests`:
      test_pause_mode_serde_roundtrip
      test_pause_mode_rejects_unknown_value
      test_rl_update_weights_body_defaults
      test_rl_state_new_constructs_without_env
  - Fix unrelated test struct-init breakage that shipped earlier in
    this branch: 26 sites across 11 files were missing the
    `prompt_token_ids`, `return_token_ids`, `tokens`, and
    `completion_token_ids` fields added to the response/request/nvext
    structs. `cargo test -p dynamo-llm` now compiles cleanly.

Tier 5 — design / nits
  - `DYN_RL_LIVENESS_TIMEOUT_MS` is read once in `RlState::from_env` and
    cached as `RlState.probe_timeout: Duration`; `rl_ready` and
    `rl_liveness` use the cached value instead of re-parsing the env on
    every request.
  - `rl_ready` worker-probe body simplified from a four-link
    `.ok().and_then(Result::ok).map(...).unwrap_or(false)` chain to a
    `match (timeout, send_result)` that surfaces the timeout/network
    distinction for a future log line.

cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
cargo test -p dynamo-llm --lib: 58 passed (4 new RL tests).
cargo test -p dynamo-llm --tests integration suites: all green.
@biswapanda biswapanda self-assigned this May 8, 2026
biswapanda added 2 commits May 8, 2026 02:01
…ht_version,*_lora_adapter} + legacy update_weights body; gate listener on DYN_ENABLE_RL_ENDPOINTS
…er registers <ns>.<comp>.rl with rl_dispatch; frontend dispatches via Client::wait_for_instances + direct())
@biswapanda biswapanda changed the title feat(RL): Dynamo RL integration APIs feat(RL): Dynamo RL integration APIs supporting prime-rl May 8, 2026
@biswapanda biswapanda changed the title feat(RL): Dynamo RL integration APIs supporting prime-rl feat(RL): Dynamo RL integration supporting Prime-RL May 8, 2026
@biswapanda biswapanda changed the title feat(RL): Dynamo RL integration supporting Prime-RL feat(RL): RL integration layer to support Prime-RL May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend container documentation Improvements or additions to documentation feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant