[1/N] Polish evaluation skills and common skills based on an E2E workflow testing#1239
[1/N] Polish evaluation skills and common skills based on an E2E workflow testing#1239Edwardf0t1 wants to merge 8 commits intomainfrom
Conversation
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds documentation tying PTQ → Deployment → Evaluation into a single pipeline: workspace conventions and artifact locations, cluster/CI execution guidance (SLURM vs NEL CI), registry/auth instructions, workspace persistence across stages, runtime deployment override patterns, and operational troubleshooting for JET and SLURM. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User / Agent
participant PTQ as PTQ Job
participant WS as Workspace\n(workspaces/<model>)
participant Deploy as Deployment Job\n(NEL/SLURM/CI)
participant Serve as Model Server
participant Eval as Evaluation Job
participant Storage as Cluster Storage
User->>PTQ: submit PTQ (produce quantized checkpoint -> WS/output/)
PTQ->>WS: write checkpoint & artifacts
User->>Deploy: trigger deployment (consumes WS/output/, may set deployment.command / NEL_DEPLOYMENT_COMMAND)
Deploy->>Serve: stage model and start serving
Serve->>Eval: run evaluation tasks (reads from Serve, writes WS/eval_results/)
Eval->>WS: write eval artifacts and eval_config.yaml
Eval->>Storage: persist final results
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1239 +/- ##
=======================================
Coverage 76.91% 76.91%
=======================================
Files 350 350
Lines 40480 40480
=======================================
Hits 31137 31137
Misses 9343 9343
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add common/end-to-end-workflow.md documenting the PTQ → Deploy → Eval pipeline, workspace continuity, unsupported model handling, NEL deployment.command pattern, and NEL CI vs SLURM executor decision table - Add cross-skill workspace flow to workspace-management.md - Add "Next steps" to ptq/SKILL.md pointing to deployment/evaluation - Add pipeline integration note to evaluation/SKILL.md Depends on PR #1236 (deployment/references/unsupported-models.md). Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Adds end-to-end documentation that connects the PTQ, deployment, and evaluation “skills”, with cluster-specific guidance for running NeMo Evaluator Launcher (NEL) on SLURM vs JET/NEL CI.
Changes:
- Added a new PTQ → Deploy → Eval workflow doc and cross-skill workspace flow guidance.
- Added a comprehensive NEL CI / cluster guide (JET vs SLURM executor, storage paths, env var formats, common issues).
- Updated existing skill docs to point users to the new workflow and evaluation references.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
.claude/skills/ptq/SKILL.md |
Adds a “Next steps” pointer into the end-to-end pipeline doc. |
.claude/skills/evaluation/SKILL.md |
Adds pipeline integration notes and a pointer to the new NEL CI guide. |
.claude/skills/evaluation/references/nel-ci-guide.md |
New detailed guide for NEL CI/JET vs SLURM executor usage, clusters, storage, and troubleshooting. |
.claude/skills/common/workspace-management.md |
Documents how a single workspace is reused across PTQ/deploy/eval stages. |
.claude/skills/common/slurm-setup.md |
Adds enroot/pyxis private registry credential setup for nvcr.io. |
.claude/skills/common/remote-execution.md |
Adds compute-node storage availability guidance to prevent “checkpoint not found” issues. |
.claude/skills/common/end-to-end-workflow.md |
New end-to-end workflow doc tying the three skills together. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Stage | What can go wrong | Reference | | ||
| |-------|-------------------|-----------| | ||
| | **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` | | ||
| | **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` | |
| When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving: | ||
|
|
||
| ```yaml | ||
| deployment: | ||
| command: >- | ||
| pip install "transformers>=5.0.0.dev0" --pre -q && |
| ```bash | ||
| export GITLAB_TOKEN=<your_gitlab_token> | ||
|
|
||
| curl -k --request POST \ |
| If the checkpoint is on a workstation, **copy it to cluster storage first**: | ||
|
|
||
| ```bash | ||
| rsync -av /path/to/local/checkpoint \ | ||
| <cluster-login>:/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/ | ||
| ``` | ||
|
|
||
| For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes. | ||
|
|
| mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface | ||
| mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm | ||
| mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs | ||
| chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface | ||
| chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm | ||
| chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs | ||
| ``` | ||
|
|
||
| `chmod 777` is required because `svc-jet` (JET service account) runs containers and needs write access. |
| # To add NGC credentials: | ||
| mkdir -p ~/.config/enroot | ||
| echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials |
|
|
||
| - Binds to `0.0.0.0` by default — health checks work out of the box | ||
| - For NVFP4: `--quantization modelopt_fp4` | ||
| - For unsupported models (e.g., ministral3): may need custom `deployment.command` to patch the framework before serving (see `deployment/references/unsupported-models.md`) |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
.claude/skills/common/slurm-setup.md (1)
62-62: Clarify that$oauthtokenis a literal string.Users might misinterpret
$oauthtokenas a shell variable that needs to be set. Consider adding a brief note that$oauthtokenis NGC's required literal login string (not a variable).📝 Suggested clarification
# To add NGC credentials: mkdir -p ~/.config/enroot -echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials +# Note: '$oauthtoken' is a literal string required by NGC (not a shell variable) +echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials chmod 600 ~/.config/enroot/.credentials🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/common/slurm-setup.md at line 62, The line that writes credentials uses the token string "$oauthtoken" which readers may think is a shell variable; update the documentation line that currently shows echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials to explicitly state that $oauthtoken is the literal NGC login string (not a shell variable) and add a short parenthetical like "(use the literal string $oauthtoken as NGC requires, do not replace with a shell variable)" or escape it in the example so readers understand it's not to be substituted; reference the exact example command text "machine nvcr.io login $oauthtoken password <NGC_API_KEY>" when making the clarification..claude/skills/evaluation/references/nel-ci-guide.md (1)
178-193: Consider noting the security implications ofchmod 777.While
chmod 777is necessary for the JET service account to access these directories (as explained), it makes them world-writable. Consider adding a brief note about the security trade-off, or suggesting users scope these permissions to only the directories actually needed for evaluation runs.📝 Suggested addition
chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs + +# Note: chmod 777 is required for svc-jet service account access, but makes +# directories world-writable. Limit to only the directories needed for your runs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/references/nel-ci-guide.md around lines 178 - 193, Add a short note after the chmod 777 commands explaining the security trade-off of making those paths world-writable and suggest safer alternatives: prefer scoping permissions to the service account or group (e.g., chown svc-jet or set group ownership and use chmod 770/770-like perms), or use filesystem ACLs to grant only the svc-jet account write access to the listed directories (/lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface, /.../cache/vllm, /.../nv-eval-rundirs) instead of blanket chmod 777.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/common/end-to-end-workflow.md:
- Around line 47-59: The doc currently claims the deployment.command override
works "with both NEL SLURM executor and NEL CI (via NEL_DEPLOYMENT_COMMAND)";
remove or amend this to avoid asserting unsupported CI support: either delete
the "NEL CI (via NEL_DEPLOYMENT_COMMAND)" clause or add a clear note that using
NEL_DEPLOYMENT_COMMAND is a custom/unsupported extension (not part of official
NEL executors) and list only the officially supported executors (e.g., NEL
SLURM, local, Lepton AI); update the text around deployment.command and
NEL_DEPLOYMENT_COMMAND to reflect this change.
---
Nitpick comments:
In @.claude/skills/common/slurm-setup.md:
- Line 62: The line that writes credentials uses the token string "$oauthtoken"
which readers may think is a shell variable; update the documentation line that
currently shows echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>'
> ~/.config/enroot/.credentials to explicitly state that $oauthtoken is the
literal NGC login string (not a shell variable) and add a short parenthetical
like "(use the literal string $oauthtoken as NGC requires, do not replace with a
shell variable)" or escape it in the example so readers understand it's not to
be substituted; reference the exact example command text "machine nvcr.io login
$oauthtoken password <NGC_API_KEY>" when making the clarification.
In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Around line 178-193: Add a short note after the chmod 777 commands explaining
the security trade-off of making those paths world-writable and suggest safer
alternatives: prefer scoping permissions to the service account or group (e.g.,
chown svc-jet or set group ownership and use chmod 770/770-like perms), or use
filesystem ACLs to grant only the svc-jet account write access to the listed
directories (/lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface,
/.../cache/vllm, /.../nv-eval-rundirs) instead of blanket chmod 777.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3c8b9ea9-5ab6-43f7-96a8-fb58f0a1804e
📒 Files selected for processing (7)
.claude/skills/common/end-to-end-workflow.md.claude/skills/common/remote-execution.md.claude/skills/common/slurm-setup.md.claude/skills/common/workspace-management.md.claude/skills/evaluation/SKILL.md.claude/skills/evaluation/references/nel-ci-guide.md.claude/skills/ptq/SKILL.md
| ## NEL Evaluation with Custom Deployments | ||
|
|
||
| When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving: | ||
|
|
||
| ```yaml | ||
| deployment: | ||
| command: >- | ||
| pip install "transformers>=5.0.0.dev0" --pre -q && | ||
| sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py && | ||
| ${deployment.base_command} | ||
| ``` | ||
|
|
||
| This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`). |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does NeMo Evaluator Launcher (NEL) support overriding deployment.command in both SLURM executor and GitLab CI modes?
💡 Result:
Yes, NeMo Evaluator Launcher (NEL) supports overriding deployment.command in SLURM executor mode. No evidence found for GitLab CI mode support; official docs only document local, Slurm, and Lepton AI executors.
Citations:
- 1: https://docs.nvidia.com/nemo/evaluator/latest/deployment/launcher-orchestrated/slurm.html
- 2: https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator-launcher/configuration/executors/slurm.html
- 3: https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator-launcher/index.html
Remove or verify unsupported CI mode claim for deployment.command override.
Line 59 claims the deployment.command override pattern works with "both NEL SLURM executor and NEL CI (via NEL_DEPLOYMENT_COMMAND)". However, official NEL documentation only documents support for SLURM executor, local, and Lepton AI executors—not GitLab CI. Either remove the CI claim or add clear documentation that this is an extension beyond officially supported NEL features.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/common/end-to-end-workflow.md around lines 47 - 59, The doc
currently claims the deployment.command override works "with both NEL SLURM
executor and NEL CI (via NEL_DEPLOYMENT_COMMAND)"; remove or amend this to avoid
asserting unsupported CI support: either delete the "NEL CI (via
NEL_DEPLOYMENT_COMMAND)" clause or add a clear note that using
NEL_DEPLOYMENT_COMMAND is a custom/unsupported extension (not part of official
NEL executors) and list only the officially supported executors (e.g., NEL
SLURM, local, Lepton AI); update the text around deployment.command and
NEL_DEPLOYMENT_COMMAND to reflect this change.
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…ra escaping - Add wrapper script pattern for complex deployment commands that break Hydra's override parser (put serve.sh in checkpoint dir, reference as bash /checkpoint/serve.sh) - Add NEL_CONFIG_BASE64 + Python trigger pattern for custom configs - Add cross-cluster checkpoint copy via tar pipe - Add Hydra LexerNoViableAltException and Bad Request to common issues Learned from triggering full AA evaluation (MMLU-PRO, GPQA Diamond, LiveCodeBench, SCICODE, AIME 2025, Terminal-Bench Hard) for Devstral-Small-2-24B NVFP4 on oci-hsg via NEL CI. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
.claude/skills/evaluation/references/nel-ci-guide.md (1)
70-70: Prefer least-privilege wording forchmod 777guidanceGiven this is shared infra guidance, consider documenting
777as a fallback-only option and prefer group ACL/owner-group write where possible to reduce accidental overexposure.Also applies to: 128-128, 270-275
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/evaluation/references/nel-ci-guide.md at line 70, Replace the blanket "chmod -R 777 /lustre/.../checkpoints/model-name" guidance with least-privilege instructions: set the directory owner/group to svc-jet (chown svc-jet:svc-jet) and grant group write/execute (chmod g+rwX) or use POSIX ACLs (setfacl) to grant only svc-jet the required access, and note that 777 should be documented as a last-resort fallback; apply the same change to the other occurrences referenced (lines 128 and 270-275) and ensure examples mention using setfacl or group-owner changes rather than recommending 777.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Line 87: Remove the insecure TLS bypass from the curl examples by deleting the
"-k" flag in the command that currently appears as "curl -k --request POST" (and
the other occurrence) and either leave the example using standard TLS
verification or replace it with an explicit documented option to set a CA bundle
(e.g., mention "--cacert /path/to/corporate-ca.pem" or a pinned CA path) so the
examples do not send PRIVATE-TOKEN over disabled verification.
- Around line 133-135: The fenced code block containing the JSON snippet with
the key "NEL_DEPLOYMENT_COMMAND" should include a language tag to satisfy
markdownlint MD040; update the block fence to use ```json so the snippet
({"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"}) is
marked as JSON, ensuring the markdown linter recognizes the language.
---
Nitpick comments:
In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Line 70: Replace the blanket "chmod -R 777 /lustre/.../checkpoints/model-name"
guidance with least-privilege instructions: set the directory owner/group to
svc-jet (chown svc-jet:svc-jet) and grant group write/execute (chmod g+rwX) or
use POSIX ACLs (setfacl) to grant only svc-jet the required access, and note
that 777 should be documented as a last-resort fallback; apply the same change
to the other occurrences referenced (lines 128 and 270-275) and ensure examples
mention using setfacl or group-owner changes rather than recommending 777.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2e6188dc-54d7-48bd-8e55-d7d6b0fc9ed8
📒 Files selected for processing (1)
.claude/skills/evaluation/references/nel-ci-guide.md
| ```bash | ||
| export GITLAB_TOKEN=<your_gitlab_token> | ||
|
|
||
| curl -k --request POST \ |
There was a problem hiding this comment.
Avoid curl -k in token-bearing API examples
Both trigger examples disable TLS verification while transmitting PRIVATE-TOKEN, which normalizes an insecure pattern. Please remove -k (or document a pinned CA/corporate CA path explicitly) in the primary examples.
Proposed doc fix
-curl -k --request POST \
+curl --request POST \- ["curl", "-k", "--request", "POST",
+ ["curl", "--request", "POST",Also applies to: 166-166
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/evaluation/references/nel-ci-guide.md at line 87, Remove the
insecure TLS bypass from the curl examples by deleting the "-k" flag in the
command that currently appears as "curl -k --request POST" (and the other
occurrence) and either leave the example using standard TLS verification or
replace it with an explicit documented option to set a CA bundle (e.g., mention
"--cacert /path/to/corporate-ca.pem" or a pinned CA path) so the examples do not
send PRIVATE-TOKEN over disabled verification.
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
When using NEL_DEPLOYMENT_COMMAND with a custom --served-model-name, deployment.served_model_name must also be overridden via NEL_OTHER_OVERRIDES — NEL uses the config field (not the actual serve command) to set the eval client's model_id. Without this, the client sends the checkpoint path as model_id, causing 404 errors. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
| @@ -0,0 +1,276 @@ | |||
| # NEL CI Evaluation Guide | |||
There was a problem hiding this comment.
this is Nvidia internal, suggest:
- Put all Nvidia internal stuff under something like:
.claude/skills/evaluation/references/nvidia-internal/
- Put it in the ModelOpt-Internal repo
What does this PR do?
Type of change: Documentation
Polish evaluation and common skills based on end-to-end experience quantizing + deploying + evaluating Devstral-Small-2-24B-Instruct (NVFP4 MLP-only) on dlcluster B100 GPUs.
Depends on PR #1236 (creates
deployment/references/unsupported-models.mdreferenced here). Merge #1236 first.Changes
New files:
common/end-to-end-workflow.md— Cross-skill pipeline doc covering PTQ → Deploy → Eval workflow, workspace continuity, unsupported model handling across stages, NELdeployment.commandpattern for runtime patches, and NEL CI vs SLURM executor decision tableevaluation/references/nel-ci-guide.md— Comprehensive NEL CI evaluation guide covering:/lustre/)NEL_DEPLOYMENT_COMMANDhost:,lit:) for SLURM executor vs JET secrets--host 0.0.0.0requirementchmod 777for svc-jetUpdated files:
common/remote-execution.md— Added "Checkpoint and storage availability" section (compute nodes may not share login node filesystems; dlcluster uses/home/omniml_data_*, not/lustre/)common/slurm-setup.md— Added "Container registry credentials (pyxis)" section for NGCenrootauth setupcommon/workspace-management.md— Added "Cross-Skill Workspace Flow" showing directory layout across PTQ/deploy/eval stagesevaluation/SKILL.md— Added "NEL CI and Cluster-Specific Notes" pointer section; renamed workspace section to include pipeline integration noteptq/SKILL.md— Added "Next steps" after Step 5 pointing to deployment/evaluation skills + end-to-end workflowMotivation
Learnings from:
Testing
Validated end-to-end: PTQ (6 min) → vLLM deployment (3 debug iterations) → NEL evaluation (MMLU 77.4%, GSM8K 80%, GPQA 40% on limit_samples=10).
Before your PR is "Ready for review"
CONTRIBUTING.md: N/ASummary by CodeRabbit