feat(RL): RL integration layer to support Prime-RL#9131
feat(RL): RL integration layer to support Prime-RL#9131biswapanda wants to merge 17 commits intomainfrom
Conversation
…for bis/dynamo-rl
Squash-imports the v1 RL surface from bis/parity-tokenize-tcp as the
testable starting point for bis/dynamo-rl. Subsequent commits refactor
toward the cleaner design in bis-dev/design-docs/rl-support.md
(phases 0-5).
Merge conflict resolutions:
- nvext.rs: keep main's NvExtResponseFieldSelection refactor; add v1's
completion_token_ids field + selection flag + gating in build_response_nvext.
- preprocessor.rs: combine main's tracing::info! [SIDECAR-SKIP-TOKENIZE] log
with v1's skip_token_annotation = has_backend_instance_id semantics
(so RL/TITO callers without backend_instance_id keep the token_ids
annotation; GAIE EPP callers continue to skip it).
- chat_completions/delta.rs: route through main's
build_response_nvext helper; layer v1's RL completion_token_ids
accumulator on top so the full token list is emitted only on the
finish chunk.
- completions/delta.rs: drop v1's redundant inline NvExtResponse build
(main's helper covers it).
- worker_factory.py: extend main's register_engine_routes() helper with
v1's RL routes (pause_generation, resume_generation, flush_cache,
update_weights_from_path, get_weight_version, load_lora_adapter,
unload_lora_adapter).
- handlers.py: keep both main's start_profile/stop_profile and v1's
RL handlers (pause_generation through unload_lora_adapter).
- publisher.py: keep v1's 'scheduler_stats can be None right after a
weight reload / cache reset' explanatory comment.
- install_vllm.sh: keep VLLM_VER=0.19.1 from v1 (matches the venv
the smoke runs against; main's 0.20.0 will be picked up by the
long-term plan's upgrade workstream).
cargo check --workspace clean (1 pre-existing benign warning).
Test loop: ~/dev/rl/work/bis-dev/4-02/{lora,sft}/run.sh — mirrors the
known-working bis-dev/4 reference smokes.
…d URIs
Incremental refactor on bis/dynamo-rl per docs/design-docs/rl-support.md.
Test loop ~/dev/rl/work/bis-dev/4-02/{lora,sft}/run.sh — both smokes PASS
end-to-end after this commit.
Composite state + liveness probe
handlers.py: new engine routes get_state and liveness_probe.
- liveness_probe round-trips through engine_client.check_health()
so a wedged event loop surfaces as 503 (closes hhzhang16 HH-23).
- get_state returns per-worker {engine_alive, pause_state,
applied_weight_version, loras} — aggregated by Rust frontend.
- pause_generation / resume_generation now track the _paused flag.
worker_factory.py: register the two new engine routes alongside the
existing pause/resume/flush_cache/update_weights/load_lora ones.
openai.rs: new GET /v1/rl/state composite endpoint and GET /v1/rl/liveness
probe (5s timeout, override via DYN_RL_LIVENESS_TIMEOUT_MS). State
aggregates per-worker payloads and surfaces ready / engine_alive /
pause_state / applied_weight_version / loras / per-worker workers.
Closes HH-19 (single state endpoint), HH-25 (RL-specific), HH-27
(weight_version folded in). Legacy /v1/rl/{health,ready,weight_version}
kept for back-compat — drop once prime-rl AdminAPI migrates to /state.
3-mode pause + structured update_weights body
handlers.py: pause_generation accepts {mode, clear_cache} body.
- mode=keep|wait|abort with default keep (matches prime-rl
client.py:_pause_engines). Closes HH-21.
- mode=abort triggers collective_rpc(abort_all_requests) when
available; gracefully falls back with a warning on vLLM 0.19
where that RPC isn't implemented.
- clear_cache=true triggers reset_prefix_cache after pause.
openai.rs: rl_pause now extracts ?mode= and ?clear_cache= via
axum::extract::Query, validates mode, propagates to worker.
Returns 400 on unknown mode (verified end-to-end via curl).
rl_update_weights body migrates to typed RlUpdateWeightsBody
{weight_dir, weight_version?, reset_prefix_cache=true}; the prior
flush+reload sequence is now optional (controlled by reset_prefix_cache,
default true) and the response carries applied_weight_version.
RL extras on /v1/chat/completions
validate.rs: PASSTHROUGH_EXTRA_FIELDS expanded from {cache_salt} to
{cache_salt, prompt_token_ids, weight_version, return_routed_experts,
return_token_ids, return_prompt_logprobs}. RL clients can now send
these as top-level extras without 400s. Closes HH-22 / HH-26 (TITO
via prompt_token_ids on /v1/chat/completions instead of forked URI).
chat_completions/delta.rs and completions/delta.rs: replace the silent
'if let Ok(json) = serde_json::to_value(...)' fallback with a match
that emits tracing::warn! on the Err branch. A dropped nvext means
promoted token_ids / weight_version never reach the RL trainer —
silently corrupting training. Closes CR-9.
openai.rs: doc-block for token-ID promotion moved from
rl_tokenize_prompt to rl_promote_token_ids_in_response (where it
actually applies). Closes CR-10.
Drop dead URIs
service_v2.rs: drop tokenization_router (/v1/tokenize, /v1/detokenize)
mounting — owned by jthomson04 PR #7699 for NeMo-rl scope, not
required by prime-rl. Drop chat_completions_tokens_router mounting
— TITO collapses into /v1/chat/completions with prompt_token_ids
extension. All three URIs verified 404 end-to-end via curl.
openai.rs: tokenize, detokenize, tokenization_router,
chat_completions_tokens_router, handler_chat_completions_tokens,
bad_request all marked #[allow(dead_code)] (kept for downstream
compat; physical deletion is a follow-up). bad_request doc-block
cleaned of stale 'Not Implemented' lines. Closes CR-8.
End-to-end verification (smokes ran against this commit)
/v1/rl/state 200 ready=true engine_alive=true (manual curl)
/v1/rl/liveness 200 alive=true (manual curl)
/v1/rl/pause?mode=abort&clear_cache=true 200 (manual curl)
└─ vllm_worker.log: '[RL] Engine paused (..., mode=abort, clear_cache=True)'
/v1/rl/pause?mode=invalid 400 'Invalid mode'
/v1/rl/resume 200
/v1/chat/completions/tokens 404
/v1/tokenize 404
/v1/detokenize 404
4-02/sft smoke: PASS (full-FT pause->update_weights->resume, mismatch_kl<=0.0007)
4-02/lora smoke: PASS (1 hot-swap closed, lora_id=1626203954)
Reviewer-comment closures landed in this commit
CR-8 bad_request doc cleaned
CR-9 serde_json::to_value Err logged via tracing::warn!
CR-10 token-ID promotion doc-block re-attached
HH-19 single state endpoint /v1/rl/state
HH-21 3-mode pause keep|wait|abort + clear_cache
HH-22 prompt_token_ids in PASSTHROUGH_EXTRA_FIELDS (TITO collapse)
HH-23 liveness_probe via engine_client.check_health() + 5s timeout
HH-25 /v1/rl/state is RL-specific (vs broader /health)
HH-26 no /v1/chat/completions/tokens distinction (URI dropped)
HH-27 weight_version folded into /v1/rl/state.applied_weight_version
WalkthroughThis PR introduces Prime-RL training workflow support to the Dynamo serving stack by adding RL-specific request/response token ID fields, streaming completion token ID accumulation, generation pause/resume/weight-update handlers in the vLLM worker, Rust-side RL admin routes and tokenization endpoints, and enhanced tokenizer trait implementations for token-to-string conversion. Additionally, vLLM is pinned to v0.19.1 with prime-rl v0.5.1.dev101 plugin installation. ChangesRL Token ID Flow & Integration
Tokenizer Trait Enhancement
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/protocols/openai/nvext.rs (1)
141-148:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winUpdate the exhaustive
NvExtResponseFieldSelectionliterals in this file.Adding
completion_token_idsto the struct makes the manualNvExtResponseFieldSelection { ... }literals later in these tests non-exhaustive, so this file will not compile until they either setcompletion_token_ids: falseor switch to..Default::default(). The breakage is in the explicit literals used bytest_build_response_nvext_combined_emission(line 887) andtest_nvext_response_field_selection_multiple_extra_fields(line 923).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/llm/src/protocols/openai/nvext.rs` around lines 141 - 148, The tests fail because NvExtResponseFieldSelection gained a new field completion_token_ids, making existing struct literals non-exhaustive; update the explicit literals used in test_build_response_nvext_combined_emission and test_nvext_response_field_selection_multiple_extra_fields to include completion_token_ids: false (or replace the full literal with NvExtResponseFieldSelection { ..Default::default() }) so the constructors are exhaustive again; locate the struct NvExtResponseFieldSelection and adjust the two test usages accordingly.
🧹 Nitpick comments (1)
docs/Dynamo-RL-api-draft.md (1)
208-208: 💤 Low valueConsider adding language identifiers to fenced code blocks.
Several code blocks lack language specifiers (lines 208, 287, 317, 359, 819, 844, 856, 871, 882). Adding identifiers like
text,http, orrustimproves syntax highlighting and accessibility.For example:
- HTTP endpoint descriptions →
```httpor```text- SSE responses →
```text- Rust-like struct definitions →
```rustor```textAlso applies to: 287-287, 317-317, 359-359, 819-890
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/Dynamo-RL-api-draft.md` at line 208, Several fenced code blocks in the document are missing language identifiers; update each triple-backtick block (``` ) that contains HTTP endpoint examples, SSE responses, or Rust-like struct definitions to include an appropriate language tag (e.g., ```http or ```text for endpoint/SSE blocks, ```rust or ```text for struct examples) so syntax highlighting and accessibility are improved — search for the bare ``` fences around the HTTP examples, SSE examples, and struct-like snippets and add the corresponding language identifier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/src/dynamo/vllm/handlers.py`:
- Around line 1178-1187: The handler currently removes the LoRA from the local
engine (remove_lora) but treats unregister_model failures as non-fatal, leaving
the model routable incorrectly; make this operation transactional: if
unregister_model(endpoint=self.generate_endpoint, lora_name=lora_name) fails,
either perform a rollback by re-adding the LoRA to the engine (call the
corresponding register/register_model or load_lora function to restore state) or
propagate/raise the exception so the handler returns an error and the removal is
not considered successful. Update the code paths around remove_lora,
unregister_model, and any callers of _resolve_lora_request to ensure consistency
with the legacy unload_lora transactional behavior.
- Around line 887-889: The generic except block in the RL handlers (e.g., the
pause handler shown) must first catch EngineDeadError and run the same shutdown
sequence used elsewhere in this file instead of returning a JSON error; add a
specific except EngineDeadError: branch before the existing except Exception
that logs the engine-dead event (using logger.exception or logger.error) and
invokes the module's existing worker shutdown logic (the same behavior already
used for EngineDeadError in other handlers), then re-raise or exit as
appropriate; keep the generic except Exception to return the JSON
{"status":"error","message":str(e)} for other errors. Ensure you apply this
change to all analogous handlers referenced (lines around 899-901, 926-928,
957-959, 968-970, 999-1001, 1144-1148, 1198-1202) and reference the
EngineDeadError symbol when editing each handler.
- Around line 839-843: Remove the internal Linear ticket reference "HH-21" from
the comment/docstring that documents the pause API (the block describing Body
keys mode and clear_cache near the handler for pause behavior) and replace it
with either a plain descriptive phrase (e.g., "implements 3-mode pause
behavior") or a public GitHub issue reference (e.g., "GH-<number>" or
"#<number>"); apply the same change to the other occurrences flagged (the
comment blocks around the code near the sections referencing HH-* at the other
two locations) so no HH-* IDs remain in source comments.
- Around line 1049-1086: If a hot-swap partially fails we must not claim
success: change the hot-swap path around engine_client.remove_lora,
engine_client.add_lora, and engine_client.reset_prefix_cache so failures abort
and roll back instead of continuing. Specifically, if remove_lora(lora_name)
raises, stop and propagate/return an error (do not call add_lora or update
loaded_loras); if add_lora(...) raises, ensure loaded_loras is not set and
propagate the error; if reset_prefix_cache() raises after a successful add,
remove the newly-added LoRA via engine_client.remove_lora(new_id), restore
loaded_loras to the previous LoRAInfo, and then propagate the reset error (or
return a non-ok result) so the control plane won’t believe the swap succeeded.
Use the existing symbols (remove_lora, add_lora, reset_prefix_cache,
loaded_loras, LoRARequest, LoRAInfo, engine_client) to locate and implement
these early-returns and rollback steps.
- Around line 941-946: In get_state(), currently if self.engine_client lacks
check_health it leaves engine_alive True; change that branch to mirror
liveness_probe(): when not hasattr(self.engine_client, "check_health") call the
same collective_rpc probe used in liveness_probe() (e.g., await
collective_rpc(self.engine_client, "get_status" or the exact probe used there)),
set engine_alive based on the probe result, and catch/handle exceptions the same
way as liveness_probe() so a failed IPC round-trip marks engine_alive False and
logs the error; keep existing behavior when check_health exists but still catch
exceptions from it.
In `@container/deps/vllm/install_vllm.sh`:
- Around line 319-323: Replace the tag-based ref with the immutable commit SHA:
update the PRIME_RL_REF usage in the install block so the pip install line
("prime-rl @ git+https://github.com/PrimeIntellect-ai/prime-rl@${PRIME_RL_REF}")
references the commit SHA instead of the tag (i.e., set PRIME_RL_REF to
d49f3939e7dca29bceb9ed515cc1782497b67e81 or substitute that SHA directly into
the git+ URL) so the install step in install_vllm.sh is pinned to a specific
commit.
In `@lib/llm/src/http/service/openai.rs`:
- Around line 207-210: Remove all internal Linear ticket IDs (e.g., CR-8, HH-21,
CR-10, HH-23, HH-19, HH-25, HH-27) from the source comments in openai.rs (look
for the doc comments around the Bad Request Error and the other affected comment
blocks at the noted regions) and replace them with neutral prose or GitHub issue
references (e.g., "See `#NNNN`" or "GH-NNNN") as per guidelines; update the
comment lines to describe the issue/closure in plain language without internal
ticket IDs so the doc comments remain informative and compliant.
- Around line 3130-3133: rl_ready() (and the other RL endpoints around the same
area) currently reuses the long-lived reqwest client built with a 600s timeout,
which lets a wedged worker hang readiness for minutes; change the readiness
probe to use a shorter timeout by either creating a separate reqwest::Client
with a small timeout (e.g., 1–5s) for rl_ready() or by setting a per-request
timeout via reqwest::Client::execute with tokio::time::timeout; update the code
paths that call rl_ready() (and the similar endpoints referenced around the
3182-3203 range) to use this short-timeout client/request wrapper so readiness
fails fast with a 503 instead of blocking on the shared 600s http_client.
- Around line 977-988: The RL promotion blocks that check rl_want_token_ids must
force request.inner.logprobs = Some(true) even if it was explicitly set to
false; locate the two places where rl_want_token_ids is used (the block that
manipulates nvext.extra_fields and the similar RL promotion block around lines
~2403-2416) and set request.inner.logprobs = Some(true) unconditionally when
adding "token_ids"/"completion_token_ids" to nvext.extra_fields so
completion_token_ids isn't lost on the backend.
- Around line 1280-1298: The current logic only populates rl_saved_messages and
rl_tito_token_ids when DYN_ENABLE_RL is true, causing prompt_token_ids to be
missing or incorrect for requests that explicitly requested return_token_ids or
used tools/chat_template_args; before calling engine.generate() (the point where
the prompt is finalized), capture the fully rendered prompt text and/or preserve
all prompt-shaping inputs (request.inner.messages, request.tools,
request.chat_template_args and request.nvext.token_data) unconditionally when
request.return_token_ids is set (or equivalent request intent), and use those
captured values for post-response tokenization instead of gating on
DYN_ENABLE_RL; update the references rl_saved_messages and rl_tito_token_ids so
they are derived from the captured prompt/inputs (and fall back to token_data if
present) and ensure downstream post-processing keys off request.return_token_ids
rather than the environment variable.
- Around line 3299-3404: The current rl_update_weights handler
(rl_update_weights, RlState::all_ok, state.fan_out) claims atomicity but can
leave a mixed fleet when update_weights_from_path partially fails; fix by making
the update transactional: before calling update_weights_from_path, fan_out a
"get_weight_version" (or similar) to capture each worker's current version, then
call update_weights_from_path as now; if load_results is not all_ok, call
state.fan_out("update_weights_from_path", {"path": <prev_path>, "version":
<prev_version>}) targeted only to workers that reported success to rollback them
to their captured pre-update version, and return a BAD_GATEWAY with detailed
per-worker status including which were rolled back and which failed; update the
docblock to stop advertising implicit atomicity or note the implemented rollback
behavior and reference rl_update_weights, fan_out, update_weights_from_path, and
the get_weight_version/rollback calls.
In `@lib/llm/src/protocols/openai/chat_completions.rs`:
- Around line 63-64: Update the stale doc comment that currently references the
removed `/v1/chat/completions/tokens` endpoint so it instead tells integrators
to use the standard `/v1/chat/completions` endpoint with `prompt_token_ids`;
locate the comment near the chat completions implementation in
openai::chat_completions (file chat_completions.rs) and replace the two-line
note about TITO mode to say that on `/v1/chat/completions` the field is accepted
but ignored and that TITO-style token-authoritative requests should be made by
passing `prompt_token_ids` on `/v1/chat/completions`.
In `@lib/llm/src/protocols/openai/chat_completions/delta.rs`:
- Around line 117-120: The shared Vec accumulated_completion_token_ids in the
streaming delta handling must be made per-choice instead of global: change
accumulated_completion_token_ids to a map keyed by the chunk's choice index
(e.g., HashMap<usize, Vec<TokenIdType>>) and append token IDs into
accumulated_completion_token_ids[chunk.index]; when promoting to
choices[i].token_ids (the code paths that read/emit completion_token_ids) pull
from the per-index entry and clear it on finish_reason; alternatively, if you
prefer to keep a single accumulator, explicitly reject multi-choice requests at
the start of the RL path (when n > 1) so chunks from different indices cannot
interleave. Ensure the same per-index approach is applied to the other
accumulator usages referenced around the choices handling and promotion logic.
In `@lib/llm/src/protocols/openai/tokenization.rs`:
- Around line 61-84: Call TokenizeChatRequest::validate() when handling
TokenizeRequest::Chat before using merged_chat_template_kwargs(); specifically,
after deserializing the chat tokenize request in the TokenizeRequest::Chat
branch, invoke request.validate()? and propagate any error (return Err) so
invalid states (both continue_final_message and add_generation_prompt true) are
rejected; this ensures TokenizeChatRequest::validate enforces constraints prior
to calling TokenizeChatRequest::merged_chat_template_kwargs().
In `@lib/llm/src/protocols/openai/validate.rs`:
- Around line 117-118: Remove the internal Linear ticket IDs "HH-22 / HH-26"
from the comment block in validate.rs and replace them with either a GitHub
issue reference like "#NNNN" or neutral prose (e.g., "see related issue" or
"related enhancement") so the comment no longer contains internal ticket IDs;
update the comment text surrounding the line that mentions the "tokens variant
of /v1/chat/completions" to use the new reference or neutral wording.
---
Outside diff comments:
In `@lib/llm/src/protocols/openai/nvext.rs`:
- Around line 141-148: The tests fail because NvExtResponseFieldSelection gained
a new field completion_token_ids, making existing struct literals
non-exhaustive; update the explicit literals used in
test_build_response_nvext_combined_emission and
test_nvext_response_field_selection_multiple_extra_fields to include
completion_token_ids: false (or replace the full literal with
NvExtResponseFieldSelection { ..Default::default() }) so the constructors are
exhaustive again; locate the struct NvExtResponseFieldSelection and adjust the
two test usages accordingly.
---
Nitpick comments:
In `@docs/Dynamo-RL-api-draft.md`:
- Line 208: Several fenced code blocks in the document are missing language
identifiers; update each triple-backtick block (``` ) that contains HTTP
endpoint examples, SSE responses, or Rust-like struct definitions to include an
appropriate language tag (e.g., ```http or ```text for endpoint/SSE blocks,
```rust or ```text for struct examples) so syntax highlighting and accessibility
are improved — search for the bare ``` fences around the HTTP examples, SSE
examples, and struct-like snippets and add the corresponding language
identifier.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 100e02b7-f3fb-4a7f-9860-7bf2a725db68
📒 Files selected for processing (25)
components/src/dynamo/frontend/vllm_processor.pycomponents/src/dynamo/vllm/handlers.pycomponents/src/dynamo/vllm/publisher.pycomponents/src/dynamo/vllm/worker_factory.pycontainer/deps/vllm/install_vllm.shdocs/Dynamo-RL-api-draft.mdlib/llm/src/audit/stream.rslib/llm/src/entrypoint/input/text.rslib/llm/src/http/service/openai.rslib/llm/src/http/service/service_v2.rslib/llm/src/preprocessor.rslib/llm/src/protocols/anthropic/types.rslib/llm/src/protocols/openai.rslib/llm/src/protocols/openai/chat_completions.rslib/llm/src/protocols/openai/chat_completions/aggregator.rslib/llm/src/protocols/openai/chat_completions/delta.rslib/llm/src/protocols/openai/completions/delta.rslib/llm/src/protocols/openai/nvext.rslib/llm/src/protocols/openai/responses/mod.rslib/llm/src/protocols/openai/tokenization.rslib/llm/src/protocols/openai/validate.rslib/tokenizers/src/fastokens.rslib/tokenizers/src/hf.rslib/tokenizers/src/lib.rslib/tokenizers/src/tiktoken.rs
End-to-end TITO support on /v1/chat/completions: callers post
prompt_token_ids as a top-level extension (and stop_token_ids,
allowed_token_ids, bad_words_token_ids, truncate_prompt_tokens for
SamplingParams parity with vLLM /inference/v1/generate). vLLM 0.20+
already accepts these on /v1/chat/completions natively; this commit is
the bridge until that upgrade lands.
Plumbing:
validate.rs: PASSTHROUGH_EXTRA_FIELDS adds stop_token_ids,
bad_words_token_ids, allowed_token_ids, truncate_prompt_tokens.
protocols/openai.rs: OpenAIStopConditionsProvider gains
get_stop_token_ids() (default None); extract_stop_conditions()
plumbs into common::StopConditions::stop_token_ids_hidden so
vLLM's SamplingParams.stop_token_ids is honored end-to-end.
protocols/openai/chat_completions.rs: NvCreateChatCompletionRequest
overrides get_stop_token_ids and get_pretokenized_input to read
unsupported_fields. ValidateRequest::validate accepts empty
messages when prompt_token_ids is present, 400s on mutual-exclusion
violations.
protocols/openai/nvext.rs: NvExtProvider gains get_pretokenized_input
with a default impl reading nvext.token_data; chat-completions
overrides to also read top-level prompt_token_ids extension.
preprocessor.rs: apply_template short-circuits to None when pre-
tokenized input is present (avoids 'undefined value' template
errors on empty messages); gather_tokens reads via the same
NvExtProvider hook covering both channels.
http/service/openai.rs: validate_chat_completion_required_fields
accepts empty messages when pre-tokenized input is provided.
End-to-end probes against Qwen/Qwen3-0.6B:
/v1/chat/completions with prompt_token_ids + stop_token_ids:
200, finish_reason=stop on stop-token, nvext.completion_token_ids
populated.
stop_token_ids=[16] forces stop on first generated token: 200,
immediate stop (proves SamplingParams.stop_token_ids honored).
messages + prompt_token_ids: 400 mutual-exclusion error.
messages-only (MITO): unchanged — chat template applied normally.
Smoke regression on bis-dev/4-02 against Qwen/Qwen3-0.6B:
sft (full-FT bracket): PASS, mismatch_kl 0.0006/0.0007
lora (hot-swap): PASS, lora_id=1626203954
Pairs with prime-rl bis/prime-rl-merged commits that add
VLLMGenerateClient / DynamoGenerateClient + setup_generate_client and
thread them through compute_teacher_logprobs. Same client.backend axis
as setup_admin_api; one config field drives both admin and data paths.
…clusion Three improvements borrowed from PR #9141 (Ameen Patel) on top of the existing PASSTHROUGH_EXTRA_FIELDS expansion: 1. get_stop_token_ids returns Result<Option<Vec<TokenIdType>>>, not Option<Vec<TokenIdType>>. Malformed payloads (e.g. stop_token_ids: "not-an-array") now surface as a typed 400 with the diagnostic 'stop_token_ids must be an array of unsigned token IDs: {err}'. extract_stop_conditions propagates the Result via ?. Replaces the prior silent-fallback Option<> variant which dropped malformed inputs without telling the caller. Silent drops on RL correctness primitives (stop conditions affect what tokens the engine emits) is the bug class CR-9 was about — same principle applies here. 2. Mutual-exclusion between messages and pre-tokenized input is now scoped to the canonical TOP-LEVEL prompt_token_ids extension only. The legacy nvext.token_data channel — which the verifiers dynamo_chat_nvext renderer transport (#1287) uses with placeholder messages 'role: user, content: (token-in mode)' — is allowed to coexist with non-empty messages. validate_messages still gates the empty-messages-with-no-tokens case. Without this relaxation, the renderer transport's placeholder pattern would 400 on every request. 3. Two new tests in test_common_ext.rs: - test_chat_completions_stop_token_ids_extraction: positive case with nvext.token_data + top-level stop_token_ids (lifted from PR #9141 verbatim). - test_chat_completions_stop_token_ids_malformed_returns_400: verifies the typed-error path on bad input. Pre-existing test struct-init sites in test_common_ext.rs were missing required fields (return_token_ids, tokens) added to the NvCreateChatCompletionRequest struct since the tests were written. Three sites updated to construct cleanly. cargo test test_common_ext: 15 tests, 15 passes.
Cleanup-only — no behavior change. Strips review-tracker noise that accumulated on top of PR-added text during iteration: - "Closes hhzhang16 HH-19/HH-21/HH-22/HH-23/HH-25/HH-26/HH-27" - "CR-8 / CR-9 / CR-10 closure" prefixes on serde-error / doc-attach fixes - Branch-name references: bis/parity-tokenize-tcp, bis/prime-rl-merged - Internal PR numbers: #6094, #7699, #8197, #9141 - Phase numbers from internal design docs (rl-support.md Phase 1/4/5) - "prime-rl" mentions in narrative copy and mermaid diagrams → generic "RL trainer / RL orchestrator / external client" Technical content (semantics, invariants, why-this-exists rationale) preserved everywhere; only the internal-process scaffolding is removed. Scope verification: every removed line is one this branch ADDED (diff main..HEAD shows the removed text on a `+` line). No edits land on pre-existing main-branch comments. Specifically reverted the nvext.rs cleanup attempt — its target lines (GAIE Stage 1/2, SGLang-specific) live on main, not in this PR's diff. Files touched: components/src/dynamo/vllm/handlers.py +9 -10 components/src/dynamo/vllm/worker_factory.py +6 -4 docs/dynamo-RL-api.md +19 -32 lib/llm/src/http/service/openai.rs +32 -34 lib/llm/src/protocols/openai/chat_completions/delta.rs +4 -4 lib/llm/src/protocols/openai/completions/delta.rs +3 -3 lib/llm/src/protocols/openai/validate.rs +20 -20 cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
…acing
Four blocking findings from a Graham-style review.
1. resolve_model_for_chat (openai.rs ~2086): replace
served_models.into_iter().next().unwrap() after a len()==1 check with
let-Some on drain().next(). Eliminates the silent-panic-if-empty path
without adding a Safety: comment for what's now self-evident.
2. RlState::from_env (openai.rs ~3128): builder().build() failures (TLS
init, resolver init) panicked the frontend on first request boot via
.expect("Failed to create RL router HTTP client"). Now returns
anyhow::Result<Self> and surfaces a typed error. Caller rl_router()
becomes anyhow::Result<(Vec<RouteDoc>, Router)>; service_v2::build()
propagates with `?` (it already returns Result<HttpService, anyhow::Error>).
3. rl_update_weights (openai.rs ~3349): replace `if weight_dir.is_none() {
return ... } let weight_dir = weight_dir.unwrap();` with
`let Some(weight_dir) = body.weight_dir.clone() else { return ... };`.
One match instead of two; no unwrap.
4. Structured tracing fields, not format strings (8 sites in rl_pause /
rl_resume / rl_update_weights / rl_load_lora_adapter /
rl_unload_lora_adapter):
tracing::warn!("RL pause: some workers failed: {:?}", results);
→
tracing::warn!(?results, "RL pause: some workers failed");
Same shape applied to info!/warn! calls that interpolated
worker_count, mode, weight_dir, version, lora_name, lora_path. Use %
for Display, ? for Debug per tracing::Value docs.
Verification:
cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
cargo test -p dynamo-llm --test test_common_ext: 15 passed.
Quick-win review fixes from PR #9131. Heavy-lift items (#9 prompt_token_ids env-gate, #11 update_weights atomicity, #13 per-choice completion_token_ids) tracked separately as follow-ups. handlers.py - Catch EngineDeadError before the generic except in all 8 RL handlers (pause/resume/liveness_probe/get_state/flush_cache/update_weights_from_path/ load_lora_adapter/unload_lora_adapter): match the existing shutdown pattern in this file so admin calls also surface engine death instead of leaving a broken worker alive. - get_state: fall back to a no-op collective_rpc when check_health is absent — same fallback liveness_probe already uses, otherwise older engines without check_health always look alive. - load_lora_adapter hot-swap path: a remove_lora() failure now returns a 400-style error response (was: silent log warn + continue, leaving add_lora to no-op against the still-registered ID); a reset_prefix_cache() failure after add_lora succeeds also returns error (was: log error and continue, leaving stale KV from the old adapter routable). - unload_lora_adapter: an unregister_model() failure after engine remove_lora succeeds now returns error (was: log warn and report success, leaving model=<lora_name> still routed to this worker even though _resolve_lora_request would now fall back to the base model). container/deps/vllm/install_vllm.sh - Pin prime-rl install to an immutable commit SHA (d49f3939e7dca29bceb9ed515cc1782497b67e81 ↔ tag v0.5.1.dev101) so a re-pointed tag upstream can't change what we ship. PRIME_RL_REF kept in build logs for human readability; PRIME_RL_COMMIT is the authoritative pin. - Replace `echo "\n=== ..."` with `printf '\n=== ...\n'` (shellcheck SC2028). lib/llm/src/http/service/openai.rs - Force `request.inner.logprobs = Some(true)` unconditionally in both RL token-id promotion blocks (was: only when None). RL extraction of completion_token_ids depends on logprobs being on at the engine; an explicit logprobs=false would otherwise silently drop them. - Bound `/v1/rl/ready` per-worker probes with a 5s timeout (override via DYN_RL_LIVENESS_TIMEOUT_MS). Was reusing the shared 600s http_client, so one wedged worker could block readiness for 10 minutes instead of failing fast as 503. - Tokenize Chat handler: call `request.validate()?` before `merged_chat_template_kwargs()` so the continue_final_message + add_generation_prompt mutual-exclusion constraint is enforced (validate() existed but was never invoked). lib/llm/src/protocols/openai/chat_completions.rs - Update stale doc comments on the legacy `tokens` and `return_token_ids` fields: they pointed callers at the now-404 `/v1/chat/completions/tokens` URI. Direct callers to the canonical top-level `prompt_token_ids` extension and `nvext.extra_fields` instead. cargo check -p dynamo-llm: clean (1 pre-existing benign warning). cargo test -p dynamo-llm --test test_common_ext: 15 passed.
…e shutdowns, tier the docs
Stack-ranked review fixes from the latest Graham-style pass on this branch.
Tier 4 (commit message conventions and tokenizer-crate scope) deliberately
deferred.
Tier 1 — correctness / contract claims
- Drop the word "atomic" from the `/v1/rl/update_weights` docblock and
spell out the partial-failure semantics: workers `0..N-1` may have
switched while worker `N` failed; per-worker status is in the
response, true rollback is a follow-up.
- Reject `n > 1` on `/v1/chat/completions` when RL token IDs are
requested. The streaming aggregator's `completion_token_ids` Vec is
shared across all choices, so per-choice promotion downstream cannot
recover which tokens belong to which choice with `n > 1`. Hard-reject
is the interim guard until the keyed-by-index refactor lands.
- `update_weights` empty `weight_dir` (`""`) is now treated the same
as null/missing (NCCL-mode no-op) instead of being forwarded to the
engine as `path=""`.
Tier 2 — hard rules
- Hot-path `[RL]` `logger.info` → `logger.debug` for pause / resume /
flush_cache / weights-load / LoRA load / LoRA unload (8 sites). RL
trainers fire these per training step; info-level was a log flood.
- Extract the duplicated 4-line `EngineDeadError` shutdown stanza into
`BaseWorkerHandler._shutdown_on_engine_dead(e) -> NoReturn` and
collapse the 9 call sites (8 RL handlers + 1 generate path) to a
single line each. ~32 lines removed.
- Strip the remaining internal-tracker comments missed by an earlier
chore-scrub: `service_v2.rs` had `jthomson04 PR #7699`,
`bis-dev/design-docs/rl-support.md §1`, and `hhzhang16 HH-22 / HH-26`
references in two places. Replaced with neutral prose.
- Strip the SGLang-coordination comment in `handlers.py` ("Signatures
intentionally line up with the SGLang RL admin routes"). The kind of
line that goes stale when SGLang's admin set drifts.
- Delete dead-code carcasses for the dropped routes
(`/v1/chat/completions/tokens`, `/v1/tokenize`, `/v1/detokenize`):
remove `tokenize`, `detokenize`, `tokenization_router`,
`chat_completions_tokens_router`, `handler_chat_completions_tokens`
(~240 lines), drop `pub mod tokenization;`, delete the
`tokenization.rs` module file (124 lines), drop unused tokenize-type
imports. All five were behind `#[allow(dead_code)]` and unmounted.
Tier 3 — tests
- Make `RlState::new(...)` a pub(super) test-friendly constructor so
handler-level tests don't need `from_env` / process env vars.
- Convert `RlPauseQuery::mode: Option<String>` to `Option<PauseMode>`,
a typed enum with `serde(rename_all = "lowercase")`. Axum now returns
400 on `mode=foo` before the handler runs; the runtime string match
is gone.
- Add four behavior tests in `mod tests`:
test_pause_mode_serde_roundtrip
test_pause_mode_rejects_unknown_value
test_rl_update_weights_body_defaults
test_rl_state_new_constructs_without_env
- Fix unrelated test struct-init breakage that shipped earlier in
this branch: 26 sites across 11 files were missing the
`prompt_token_ids`, `return_token_ids`, `tokens`, and
`completion_token_ids` fields added to the response/request/nvext
structs. `cargo test -p dynamo-llm` now compiles cleanly.
Tier 5 — design / nits
- `DYN_RL_LIVENESS_TIMEOUT_MS` is read once in `RlState::from_env` and
cached as `RlState.probe_timeout: Duration`; `rl_ready` and
`rl_liveness` use the cached value instead of re-parsing the env on
every request.
- `rl_ready` worker-probe body simplified from a four-link
`.ok().and_then(Result::ok).map(...).unwrap_or(false)` chain to a
`match (timeout, send_result)` that surfaces the timeout/network
distinction for a future log line.
cargo check -p dynamo-llm: clean (1 pre-existing benign warning).
cargo test -p dynamo-llm --lib: 58 passed (4 new RL tests).
cargo test -p dynamo-llm --tests integration suites: all green.
…, no behavior change)
…ansport (Phase 1+4 of weight-transfer-config.md)
… chained .or_else()
…ht_version,*_lora_adapter} + legacy update_weights body; gate listener on DYN_ENABLE_RL_ENDPOINTS
…er registers <ns>.<comp>.rl with rl_dispatch; frontend dispatches via Client::wait_for_instances + direct())
Overview
End-to-end RL integration into Dynamo's vLLM backend. Replaces / supersedes #8630.
This PR delivers the design captured in two in-tree planning docs:
plans/rl-crate.md— standalonedynamo-rlcrate, request-plane fan-out,RlClientSDK, optional/v1/rl/*HTTP facade.plans/weight-transfer-config.md(v5) — single discriminatedupdate_weightsbody covering full-FT and LoRA across pluggable backends; two-layer worker seam (WeightTransport+EngineAdapter).Branch:
bis/dynamo-rl@7c15b6b9cd.Architecture in three pieces
1. Standalone
dynamo-rlcrate (lib/rl/)Dependency direction is
dynamo-llm → dynamo-rl → dynamo-runtime. The crate has zero dep ondynamo-llm.The SDK owns: discovery list/watch, per-component request-plane clients, fan-out across all matching
rlinstances, optional component filters, strict direct addressing, membership epoch/fingerprint checks, empty-fanout behavior, timeout/retry policy, and per-worker success/failure reporting.Strict direct addressing. For admin RPC (weight update / pause / etc.) we cannot silently fall back to a different instance if the target disappears mid-call — the SDK fails with
RlError::MembershipChanged { before_epoch, after_epoch }and forces the caller to reconcile.2. Worker contract — single
rlrequest-plane endpointPer
rl-crate.md§ "Worker contract". WhenDYN_ENABLE_RL=true, each worker registers exactly one discoverable endpoint per component:served by
BaseWorkerHandler.rl_dispatch(request)which dispatches onrequest["op"]to the existing handler methods. Request envelope:The legacy per-RL-op
register_engine_route("pause_generation"…)HTTP-on-system-port mechanism +DYN_RL_WORKER_SYSTEM_URLSstatic URL list are gone from the steady-state path. Discovery is the only source of truth for "which workers are live".3. Discovery + fan-out — the request plane
Frontend fan-out flow (
dynamo_rl::RlState::fan_out):Clientfor<ns>.<comp>.rlfrom the runtime.client.wait_for_instances()(5 s bound) so the etcd watcher populates before dispatch.client.instance_ids()→ list of live workers.PushRouter::<Value, Annotated<Value>>::from_client(client, RouterMode::Direct).router.direct(SingleIn::new({op, body}), instance_id), drain firstAnnotated.data.NATS / shared TCP is the wire. No
reqwest::Client, no system-port HTTP.HTTP facade —
/v1/rl/*Per
weight-transfer-config.mdv5 +rl-crate.md§ "HTTP facade". Mounted on a dedicated listener (portDYN_RL_PORT/--rl-port, default8002) whenDYN_ENABLE_RL_ENDPOINTS=true. Inference plane on:8000is untouched.After prime-rl migrates: the two LoRA aliases collapse into
update_weights {target.kind="lora"};/v1/rl/healthis removed or stays as a trivial alias to the frontend's existing/live. No read-side state endpoint is added until a real caller appears (deferred design captured inplans/rl-state-endpoint.md).WeightTransferConfig —
update_weightswire shapePer
weight-transfer-config.mdv5. Single discriminated body covering both full-FT and LoRA. No legacy{weight_dir, weight_version, reset_prefix_cache}form.init_transportis idempotent.filesystemtreats it as a no-op.ncclperforms the worker-sideinit_weight_transfer_enginecollective bootstrap and returns once every rank has joined theStatelessProcessGroup.In-scope this iteration:
filesystem,nccl. Future (deferred):nixl,model_express,ipc. The trait shape is designed so each future backend drops in as one extraWeightTransportsubclass — see § "WeightTransport seam" below.WeightTransport seam — pluggable backends
Per
weight-transfer-config.mdv5 § "Trait on the worker side". Two-layer seam:Mapping:
dynamo_rl::handlers::*WeightTransportimplcollective_rpc(...), future sglangtokenizer_manager.update_weights_from_*)EngineAdapterimplAdding a new backend = one
WeightTransportsubclass. Adding sglang = oneEngineAdaptersubclass. Axes don't multiply.This iteration ships:
LoRA folds into
update_weights {target.kind="lora", op="load|swap|unload"}. The internal handlers (vllm_engine.add_loravscollective_rpc("reload_weights")) stay separate; the wire surface is one.TITO on
/v1/chat/completionsNo separate URI. Pre-tokenized input on the standard chat-completions request:
{ "model": "...", "messages": [], // empty allowed when prompt_token_ids is set "prompt_token_ids": [...], "stop_token_ids": [...], "allowed_token_ids": [...], "bad_words_token_ids": [...], "truncate_prompt_tokens": N, "nvext": { "return_token_ids": true, "return_prompt_logprobs": true, ... } }validate.rs::PASSTHROUGH_EXTRA_FIELDSallowlists these plusweight_version,return_routed_experts,cache_salt.OpenAIStopConditionsProvider::get_stop_token_ids()returns typedResult<Option<Vec<TokenIdType>>>so malformed input (stop_token_ids: "not-an-array") surfaces a 400 instead of being silently dropped.extract_stop_conditionsplumbs intocommon::StopConditions::stop_token_ids_hidden, honored end-to-end by vLLMSamplingParams.NvExtProvider::get_pretokenized_input()reads from both legacynvext.token_dataand the new top-levelprompt_token_idsso existing renderer-mode callers keep working.apply_templateshort-circuits when pre-tokenized input is present (no MiniJinja "undefined value" error on emptymessages).chat_completions/delta.rs,completions/delta.rs:nvextbuild paths replaced silentif let Ok(json) = …withmatch+tracing::warn!onErr. A dropped nvext means promoted token IDs / weight version never reach the trainer — silent-corruption class bug, now logged.Removed dead URIs (404 in this branch)
/v1/chat/completions/tokens— TITO collapses into/v1/chat/completionswithprompt_token_ids./v1/tokenize,/v1/detokenize— owned by feat: Nemo-rl support #7699 (NeMo-rl scope), not required by prime-rl.#[allow(dead_code)]for downstream compat; physical deletion is a follow-up.Env vars (frontend + worker)
DYN_ENABLE_RLfalse/v1/chat/completions(TITO, nvext). Worker: register<ns>.<comp>.rlrequest-plane endpoint + RL handlers.DYN_ENABLE_RL_ENDPOINTSfalse/v1/rl/*HTTP facade.DYN_NAMESPACEDYN_RL_PORT8002/v1/rl/*listener. Also--rl-port.No
DYN_RL_WORKER_SYSTEM_URLS, noDYN_RL_TARGETS, noDYN_RL_COMPONENTSfor v1. Optional component filters can be request fields.Useful deployment combinations:
DYN_ENABLE_RL(frontend)DYN_ENABLE_RL_ENDPOINTSDYN_ENABLE_RL(workers)falsefalsefalsetruetruetruetrue, admin: optionaltruefalsetrueEnd-to-end smokes
Against
bis/dynamo-rl@7c15b6b9cd, all on RTX PRO 6000 Blackwell with Qwen3-0.6B + math-env (GSM8K) + filesystem broadcast:bis-dev/7/wt-transfer/sft-file/run.shupdate_weights {target.kind=base, transport.backend=filesystem}step_1 → resume; worker logs[RL] filesystem.update_weights: base reload from … (version=step_1).bis-dev/7/wt-transfer/lora-file/run.shupdate_weights {target.kind=lora, op=load}step_1; worker logs[RL] LoRA adapter loaded: name=….bis-dev/7/wt-transfer/sft-file-2w/run.shinit_transportandupdate_weightsfan out to 2 workers, bothstatus:ok. Validates the request-plane path (Discovery +PushRouter::direct).bis-dev/7/wt-transfer/sft-nccl/run.shinit_transport {backend=nccl}reaches worker → vLLM createsNCCLWeightTransferEngine→pynccl.PyNcclCommunicatorrejects two ranks on one physical GPU. Confirms theinit_weight_transfer_enginecollective_rpc dispatch path; needs a 2-GPU host (orbackend="ipc") for the data plane.cargo test test_common_extget_stop_token_idstypedResult+ 400-on-malformed regression tests.cargo check --workspaceTITO + LoRA combined smoke (
tito-multiturn) confirmed_generate_token_mode … will use LoRA adapter: …on both unary and streaming paths.Files touched
51 files, +4,733 / −182. Hot spots:
lib/rl/src/lib.rs— +1,072 (new crate:RlClientSDK, request types, fan-out, HTTP handlers, route definitions)docs/RL.md— +667 (full API reference)components/src/dynamo/vllm/handlers.py— +693 (worker handlers:rl_dispatch,weight_transport_*)components/src/dynamo/vllm/weight_transports/*— +666 across 5 files (WeightTransportProtocol + impls + factory)lib/llm/src/http/service/service_v2.rs— +349 (separate listener wiring, env-var split,RlHttpDeps)lib/llm/src/http/service/openai.rs— +456 (TITO mutual-exclusion, nvext promotion, dead-URI 404s)components/src/dynamo/vllm/worker_factory.py— +62 (rl_endpoint = runtime.endpoint("…rl").serve_endpoint(handler.rl_dispatch, …))Where reviewers should start
plans/rl-crate.md+plans/weight-transfer-config.md— design / contract for the entire RL surface.lib/rl/src/lib.rs—RlClientSDK + HTTP handlers; the headline of the request-plane refactor.components/src/dynamo/vllm/handlers.py::rl_dispatch+worker_factory.py::rl_endpoint.serve_endpoint(...)— worker side of the request-plane endpoint.components/src/dynamo/vllm/weight_transports/{base,filesystem,nccl,engine_adapter}.py—WeightTransporttrait + concrete impls.lib/llm/src/protocols/openai/{chat_completions,validate,nvext}.rs— TITO mutual-exclusion +PASSTHROUGH_EXTRA_FIELDS.lib/llm/tests/test_common_ext.rs— typedResultregression tests forstop_token_ids.Commit walk-through
0d74ecafbis/parity-tokenize-tcp.d295ebc6/v1/chat/completions, drop dead URIs.f034171/v1/chat/completionsabsorbs TITO with fullSamplingParamsparity.a2cc90daget_stop_token_idstypedResult; mutual-exclusion relaxed to canonical channel only; regression tests.2cb5e6040ee6cbe4/4aac7e89/8e08e32bunwrap/expect, structured tracing, dead-code removal, Tier 1/2/3/5 + CodeRabbit.6007c77cdynamo-rlcrate atlib/rl/. Pure refactor; behavior unchanged.67058be7WeightTransportProtocol +EngineAdaptershim +FilesystemTransport+NcclTransport.93a7e411cargo fmtcleanup.575afd9e/v1/rl/{state,health,ready,liveness,weight_version,*_lora_adapter}+ legacyupdate_weightsbody; gate listener onDYN_ENABLE_RL_ENDPOINTS.b6e471dePushRouter::direct. Worker registers single<ns>.<comp>.rlendpoint withrl_dispatch; frontend dispatches viaClient::wait_for_instances+direct().626d3e44RlClientSDK —MembershipSnapshot,FanoutPolicy, typedRlRequestvariants, separate listener onDYN_RL_PORT.7c15b6b9unloadallowed without atransportblock.Pairs with
prime-rl
bis/prime-rl-mergedaddsVLLMGenerateClient/DynamoGenerateClient+setup_generate_clientand threads them throughcompute_teacher_logprobs. Sameclient.backendaxis assetup_admin_api— one config field drives both admin and data paths. After prime-rl migrates,admin_base_urlswitches fromhttp://dynamo-frontend:8000/v1/rl(legacy single-listener) tohttp://dynamo-frontend:8002/v1/rlonceDYN_ENABLE_RL_ENDPOINTS=trueand the dedicated listener is live.Closes / relates
rl-state-endpoint.md), HH-21 (3-mode pause), HH-22 / HH-26 (TITO viaprompt_token_idson/v1/chat/completions), HH-23 (liveness via the frontend's existing/live), HH-25 (RL-specific endpoint surface), HH-27 (weight_versionfolded into theversionfield ofupdate_weights)plans/rl-crate.md— frontend / SDK / listener / worker contractplans/weight-transfer-config.md— wire shape + worker traitplans/rl-state-endpoint.md— deferred read-side endpoint designSummary by CodeRabbit
Release Notes
New Features
dynamo-rlcrate hosting the/v1/rl/*admin surface with a request-plane fan-out SDK (RlClient,MembershipSnapshot,FanoutPolicy)update_weightsbody covering both full-FT and LoRA with pluggable backends (filesystem,nccl)WeightTransportPython Protocol +FilesystemTransport/NcclTransportimpls;EngineAdaptershim isolates engine flavorinit_transportfor backend setup (e.g. NCCL group bootstrap)/v1/rl/*listener (DYN_RL_PORT/--rl-port) gated byDYN_ENABLE_RL_ENDPOINTS/v1/chat/completionsviaprompt_token_ids+ nvext extensions<ns>.<component>.rlrequest-plane endpoint withrl_dispatch; replaces legacy HTTP-on-system-port engine routesDocumentation
docs/RL.md— full API referenceplans/rl-crate.mdandplans/weight-transfer-config.md— design / contract docs