feat: JudgeLLM evaluation with ProposalAmender#248
Conversation
|
Warning Review limit reached
More reviews will be available in 49 minutes and 22 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds derive_phase(), a CLIClient/KubeCLI, and a ProposalAmender that aggregates Result CRs and renders workflow summaries; refactors ProposalDriver to use these components; registers a new LLM-judged metric proposal_evaluation_correctness with prompt and JSON parsing; updates config, docs, tests, and integration scripts. ChangesProposal Evaluation and Enrichment Pipeline
Sequence Diagram(s)sequenceDiagram
participant Client as ProposalDriver
participant Amender as ProposalAmender
participant CLI as KubeCLI
participant Results as Result CRs
Client->>CLI: apply(...) / get status / delete(...)
Client->>Amender: amend(turn_data, proposal_status)
Amender->>CLI: get_resource(resource_plural, name)
CLI->>Results: fetch CR
Results-->>CLI: status JSON
CLI-->>Amender: parsed status
Amender->>Amender: aggregate into proposal_results & build_summary
Amender-->>Client: enriched turn_data.response
Client->>Phase: derive_phase(conditions, spec)
Phase-->>Client: terminal phase
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
354b221 to
5c2f4b1
Compare
|
|
||
| import json | ||
| import os | ||
| import subprocess |
ed62f0f to
5492575
Compare
5492575 to
c47a46f
Compare
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/integration/test_evaluation_data_proposal.yaml (1)
95-104: 💤 Low value
expected_responseis likely unused for this metric.
custom:proposal_evaluation_correctnessis a turn-level LLM-as-judge metric that requires onlyresponse(no ground truth), soexpected_responsehere will not be consulted during scoring. It's harmless but can mislead readers into thinking the judge compares against it. Consider dropping it or adding a brief comment clarifying it's documentation-only.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/test_evaluation_data_proposal.yaml` around lines 95 - 104, The test includes an unnecessary expected_response alongside the turn-level judge metric custom:proposal_evaluation_correctness (defined in turn_metrics and turn_metrics_metadata) which doesn't use ground truth; remove the expected_response block from the test or, if you want to keep it for human-readable documentation, add a short inline comment next to expected_response stating it is documentation-only and not used by the custom:proposal_evaluation_correctness metric so readers aren’t misled.tests/integration/test_proposal_evaluation.py (1)
204-236: ⚡ Quick winTest asserts less than its docstring claims.
The docstring states this verifies that
custom:proposal_evaluation_correctnessruns against the response and the pipeline completes, but the body only checks thatturn.responseis populated — identical totest_full_lifecycle's response check. Nothing confirms the judge metric actually produced a result.Since live LLM scores are nondeterministic, asserting a specific score is fragile, but you can confirm the metric path executed by checking the emitted results (e.g., the JSON output written to
tmp_path / "eval_output") contain acustom:proposal_evaluation_correctnessentry for the turn. This makes the test meaningfully distinct from the lifecycle test.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/test_proposal_evaluation.py` around lines 204 - 236, The test_judge_evaluation currently only asserts that ProposalDriver populated turn.response; update it to also verify that the judge metric ran by reading the evaluation output written to the configured storage (the FileBackend output_dir set to tmp_path / "eval_output") after calling evaluate(system_config, eval_data) and assert that a result entry for custom:proposal_evaluation_correctness exists and is associated with the evaluated turn; locate this logic near the test_judge_evaluation function and use the same identifiers (system_config, evaluate, eval_data, tmp_path, and the turn from eval_data[0].turns[0]) to load the JSON results and assert the presence of the custom:proposal_evaluation_correctness metric for that turn.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/cli.py`:
- Around line 61-75: The KubeCLI.run method can raise subprocess.TimeoutExpired
which escapes callers like KubeCLI.get_resource and ProposalAmender.amend;
modify KubeCLI.run to catch subprocess.TimeoutExpired and normalize it by
returning a failing subprocess.CompletedProcess (non-zero returncode, empty
stdout, stderr describing the timeout and including the timeout value/command)
so callers always receive a CompletedProcess rather than an exception; update
any references in get_resource/ProposalAmender.amend to rely on CompletedProcess
return semantics (or alternatively, raise the project-specific EvaluationError
consistently if your codebase prefers exceptions).
In `@src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py`:
- Around line 36-39: The try/except in ProposalAmender that calls self._do_amend
currently only catches KeyError/TypeError/ValueError and therefore misses
subprocess errors from self._cli.get_resource (via KubeCLI.run); update the
except clause in ProposalAmender.execute (the block wrapping self._do_amend) to
also catch subprocess.SubprocessError and subprocess.TimeoutExpired (or broaden
to Exception if preferred), or alternatively normalize CLI exceptions inside
KubeCLI.run/_cli.get_resource so they raise a common custom exception that
ProposalAmender can catch; reference _do_amend, ProposalAmender,
_cli.get_resource, KubeCLI.run and ProposalDriver.execute_turn when making the
change.
- Line 80: Remove the stray stdout dump by replacing the print call that outputs
turn_data.response with structured logging: call logger.debug(...) (using the
module logger or create one via logging.getLogger(__name__) if absent) so the
Markdown summary is logged at debug level instead of printed; update the
location where print(turn_data.response) appears in proposal_amender.py (the
code handling turn_data response/amend flow) to use logger.debug and ensure
imports/logger declaration are present.
---
Nitpick comments:
In `@tests/integration/test_evaluation_data_proposal.yaml`:
- Around line 95-104: The test includes an unnecessary expected_response
alongside the turn-level judge metric custom:proposal_evaluation_correctness
(defined in turn_metrics and turn_metrics_metadata) which doesn't use ground
truth; remove the expected_response block from the test or, if you want to keep
it for human-readable documentation, add a short inline comment next to
expected_response stating it is documentation-only and not used by the
custom:proposal_evaluation_correctness metric so readers aren’t misled.
In `@tests/integration/test_proposal_evaluation.py`:
- Around line 204-236: The test_judge_evaluation currently only asserts that
ProposalDriver populated turn.response; update it to also verify that the judge
metric ran by reading the evaluation output written to the configured storage
(the FileBackend output_dir set to tmp_path / "eval_output") after calling
evaluate(system_config, eval_data) and assert that a result entry for
custom:proposal_evaluation_correctness exists and is associated with the
evaluated turn; locate this logic near the test_judge_evaluation function and
use the same identifiers (system_config, evaluate, eval_data, tmp_path, and the
turn from eval_data[0].turns[0]) to load the JSON results and assert the
presence of the custom:proposal_evaluation_correctness metric for that turn.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e2a40ded-3b7b-4b8b-9ed7-9c9473a5bf72
📒 Files selected for processing (20)
README.mdconfig/system.yamldocs/EVALUATION_GUIDE.mdsrc/lightspeed_evaluation/core/metrics/custom/custom.pysrc/lightspeed_evaluation/core/metrics/custom/prompts.pysrc/lightspeed_evaluation/core/metrics/custom/proposal_eval.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/proposal/__init__.pysrc/lightspeed_evaluation/core/proposal/phase.pysrc/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/pipeline/evaluation/cli.pysrc/lightspeed_evaluation/pipeline/evaluation/driver.pysrc/lightspeed_evaluation/pipeline/evaluation/proposal_amender.pytests/integration/system-config-agents-proposal.yamltests/integration/test_evaluation_data_proposal.yamltests/integration/test_proposal_evaluation.pytests/unit/core/metrics/custom/test_custom.pytests/unit/core/metrics/custom/test_proposal_eval.pytests/unit/pipeline/evaluation/test_proposal_amender.pytests/unit/pipeline/evaluation/test_proposal_driver.py
37b4196 to
4a5dc4c
Compare
4a5dc4c to
20926e7
Compare
| logger.warning( | ||
| "Proposal '%s' terminated with outcome: %s", cr_name, outcome | ||
| ) | ||
| return (None, None) |
There was a problem hiding this comment.
Not clear about this. Shouldn't we add some error status.
There was a problem hiding this comment.
I've adjusted this in recent changes. Especially the logic is:
- If the proposal status is FAILED, ESCALATED then the turn execution will fail with an error and no metric will be evaluated.
- If the proposal is user denied, so not completed, this is not considered as an error and a warning is logged. metric evaluation is permed.
If the proposal ends as COMPLETED everything is fine, no error is returned
| {workflow_summary} | ||
|
|
||
| ## Expected Outcome | ||
| {expected_response} |
There was a problem hiding this comment.
So here we are relying on the final expected response only as facts. My concern is that we are putting it more on LLM side to decide whether the path was correct just based on expected_response/outcome.. Probably it is okay, as good models will have more kubernetes related context.
There was a problem hiding this comment.
As per offline discussion, the initial plan is to move from expected_response to
expected_outcome + other optional expectations (expected_analysis_outcome, expected_execution_outcome, expected_verification_outcome. The idea is to leverage on this to further improve judge scoring even if not required.
20926e7 to
0db3bf8
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/integration/agentic/scripts/_setup_infra-openai.sh (1)
34-44: 💤 Low valueOptional: align secret creation with the safer
oc create secret --dry-runpattern.Interpolating
${OPENAI_API_KEY}directly into the heredoc YAML can break if the value ever contains YAML-special characters. The Claude/Vertex infra script already usesoc create secret … --dry-run=client -o yaml | oc apply -f -, which sidesteps escaping. Consider using the same approach here for consistency and robustness.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/agentic/scripts/_setup_infra-openai.sh` around lines 34 - 44, Replace the heredoc that directly interpolates ${OPENAI_API_KEY} with the safer oc create secret --dry-run=client -o yaml pattern: instead of the cat <<EOF | oc apply ... block that creates the Secret named eval-llm-credentials, call oc create secret generic eval-llm-credentials with --from-literal entries for OPENAI_API_KEY, LIGHTSPEED_AGENT_PROVIDER and OPENAI_MODEL (using the AGENT_MODEL and OPENAI_API_KEY vars), add --dry-run=client -o yaml, and pipe that output into oc apply -n "$OPERATOR_NS" -f -; update the script to remove the heredoc and use this create|apply pipeline so values containing YAML-special chars are correctly handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lightspeed_evaluation/core/metrics/custom/custom.py`:
- Around line 326-355: The parsing logic in _parse_proposal_eval_response uses
diagnosis/action/verification/average directly from the JSON and
formats/arithmetic-uses them, but _try_parse_float is unused so string numbers
break f"{v:.2f}" and sum(present)/len(present); update
_parse_proposal_eval_response to call _try_parse_float on each of
data.get("diagnosis"), data.get("action"), data.get("verification"), and
data.get("average") (store back into sub_scores and average), so present becomes
a list of floats and formatting with f"{v:.2f}" and average calculations are
safe; ensure _try_parse_float accepts generic input (str/None/number) and
returns Optional[float] as already defined.
In `@src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py`:
- Around line 126-129: The loop that builds option headings currently hardcodes
"(Approved)" for idx == 0; change it to derive approval from explicit signals:
inspect each option dict for common approval keys (e.g., "approved",
"is_approved", "selected") and/or consult the surrounding proposal
status/analysis_results (e.g., a "selected_index" or "approved_option" field)
and set label = "(Approved)" only when such an explicit indicator is true/
matches this idx; if no explicit indicator exists, omit the label. Update the
logic in the for idx, option in enumerate(options) block (and any code that
constructs options/analysis_results) to prefer these explicit fields rather than
relying on idx == 0 as a fallback.
---
Nitpick comments:
In `@tests/integration/agentic/scripts/_setup_infra-openai.sh`:
- Around line 34-44: Replace the heredoc that directly interpolates
${OPENAI_API_KEY} with the safer oc create secret --dry-run=client -o yaml
pattern: instead of the cat <<EOF | oc apply ... block that creates the Secret
named eval-llm-credentials, call oc create secret generic eval-llm-credentials
with --from-literal entries for OPENAI_API_KEY, LIGHTSPEED_AGENT_PROVIDER and
OPENAI_MODEL (using the AGENT_MODEL and OPENAI_API_KEY vars), add
--dry-run=client -o yaml, and pipe that output into oc apply -n "$OPERATOR_NS"
-f -; update the script to remove the heredoc and use this create|apply pipeline
so values containing YAML-special chars are correctly handled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7824cbcf-518b-42ff-8fdf-9d6e338f0123
📒 Files selected for processing (37)
MakefileREADME.mdconfig/system.yamldocs/EVALUATION_GUIDE.mdsrc/lightspeed_evaluation/core/metrics/custom/custom.pysrc/lightspeed_evaluation/core/metrics/custom/prompts.pysrc/lightspeed_evaluation/core/metrics/custom/proposal_eval.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/proposal/__init__.pysrc/lightspeed_evaluation/core/proposal/phase.pysrc/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/pipeline/evaluation/cli.pysrc/lightspeed_evaluation/pipeline/evaluation/driver.pysrc/lightspeed_evaluation/pipeline/evaluation/proposal_amender.pytests/integration/agentic/fixtures/crashloop-probe-demo.yamltests/integration/agentic/scripts/_cleanup_infra-claude-vertex.shtests/integration/agentic/scripts/_cleanup_infra-openai.shtests/integration/agentic/scripts/_setup_infra-claude-vertex.shtests/integration/agentic/scripts/_setup_infra-openai.shtests/integration/agentic/scripts/cleanup_crashloop_probe-claude-vertex.shtests/integration/agentic/scripts/cleanup_crashloop_probe-openai.shtests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.shtests/integration/agentic/scripts/cleanup_oomkill-openai.shtests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.shtests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.shtests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.shtests/integration/agentic/scripts/setup_crashloop_probe-openai.shtests/integration/agentic/scripts/setup_oomkill-claude-vertex.shtests/integration/agentic/scripts/setup_oomkill-openai.shtests/integration/system-config-agents-proposal.yamltests/integration/test_evaluation_data_proposal.yamltests/integration/test_proposal_evaluation.pytests/unit/core/metrics/custom/test_custom.pytests/unit/core/metrics/custom/test_proposal_eval.pytests/unit/pipeline/evaluation/test_driver.pytests/unit/pipeline/evaluation/test_proposal_amender.pytests/unit/pipeline/evaluation/test_proposal_driver.py
💤 Files with no reviewable changes (2)
- tests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.sh
- tests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.sh
✅ Files skipped from review due to trivial changes (6)
- tests/integration/system-config-agents-proposal.yaml
- tests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.sh
- tests/integration/agentic/scripts/_cleanup_infra-claude-vertex.sh
- src/lightspeed_evaluation/core/proposal/init.py
- docs/EVALUATION_GUIDE.md
- README.md
🚧 Files skipped from review as they are similar to previous changes (6)
- src/lightspeed_evaluation/core/system/validator.py
- tests/unit/core/metrics/custom/test_custom.py
- src/lightspeed_evaluation/pipeline/evaluation/driver.py
- src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
- src/lightspeed_evaluation/core/proposal/phase.py
- tests/unit/core/metrics/custom/test_proposal_eval.py
Extract CLI operations (run, get_resource, apply, delete) into an injectable CLIClient interface with KubeCLI implementation backed by oc/kubectl. ProposalDriver now delegates to KubeCLI instead of internal subprocess calls, enabling dependency injection for the upcoming ProposalAmender. ProposalAmender fetches AnalysisResult, ExecutionResult, VerificationResult, and EscalationResult CRs via CLIClient and populates turn_data.proposal_results with structured status data. It also builds a Markdown workflow summary into turn_data.response. - Add proposal_results field to TurnData model - Create ProposalAmender with CLIClient dependency injection - Integrate ProposalAmender into ProposalDriver (always enabled) - Fallback to _extract_summary if amender fails add custom:proposal_evaluation_correctness LLM-as-judge metric New metric that evaluates agentic remediation workflow quality using an LLM judge. Scores 0.0-1.0 based on diagnosis quality, action appropriateness, risk management, and verification thoroughness. - Add PROPOSAL_EVALUATION_CORRECTNESS_PROMPT template - Register metric in CustomMetrics.supported_metrics - Add METRIC_REQUIREMENTS entry (requires response field) - Add metrics_metadata threshold (0.75) in system.yaml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0db3bf8 to
79d3c09
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/lightspeed_evaluation/pipeline/evaluation/driver.py (1)
143-148:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHonor
enabled=Falsebefore creating cluster dependencies.
validate_config()explicitly skips the CLI presence check when the driver is disabled, but this constructor still eagerly buildsKubeCLIandProposalAmender. That leavesProposalDriverinconsistent withHttpApiDriver: a disabled proposal driver can still reach runtime code paths that try to use an empty CLI path instead of behaving like a no-op.Proposed fix
class ProposalDriver(AgentDriver): """Driver that manages Proposal CR lifecycle via oc/kubectl CLI.""" def __init__(self, config: dict[str, Any], *, enabled: bool = True) -> None: """Initialize the proposal driver.""" super().__init__(config, enabled=enabled) self._cli = self._resolve_cli() - self._kube_cli = KubeCLI( - cli_path=self._cli, - namespace=self._config.namespace, - timeout=self._config.cli_timeout, - ) - self._amender = ProposalAmender(self._kube_cli) + self._kube_cli = None + self._amender = None + if enabled: + self._kube_cli = KubeCLI( + cli_path=self._cli, + namespace=self._config.namespace, + timeout=self._config.cli_timeout, + ) + self._amender = ProposalAmender(self._kube_cli) @@ def execute_turn( self, turn_data: TurnData, conversation_id: Optional[str] = None ) -> tuple[Optional[str], Optional[str]]: """Execute a Proposal CR lifecycle for a single turn.""" + if not self._enabled or self._kube_cli is None or self._amender is None: + return None, conversation_id + # Proposal CR lifecycle:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/pipeline/evaluation/driver.py` around lines 143 - 148, The constructor currently always instantiates KubeCLI and ProposalAmender even when the driver is disabled; change the ProposalDriver constructor to only create self._kube_cli = KubeCLI(...) and self._amender = ProposalAmender(...) when the driver is enabled (e.g., if self._config.enabled: ...), otherwise set those attributes to None (or leave uninitialized) so no cluster dependencies are created for a disabled driver, and update any methods that reference self._kube_cli or self._amender to guard for None or perform lazy initialization.
🧹 Nitpick comments (1)
tests/unit/pipeline/evaluation/test_proposal_driver.py (1)
624-625: ⚡ Quick winAssert denied turns still produce a response.
Deniedis now a non-error path, so downstream evaluation will still consume the amended summary. This test only checksproposal_status, which would miss regressions where denied turns stop populatingturn.response.Proposed test addition
error, _ = driver.execute_turn(turn) assert error is None + assert turn.response assert turn.proposal_status == status🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/pipeline/evaluation/test_proposal_driver.py` around lines 624 - 625, The test currently only asserts error is None and turn.proposal_status == status, but when status is Denied we must also ensure the turn still produces a response; update the test in test_proposal_driver.py to assert that turn.response is populated for denied turns (e.g., assert turn.response is not None and/or assert turn.response.text or length > 0) so regressions that drop the amended summary are caught; reference the existing locals error, turn, and proposal_status/status when adding the extra assertion(s).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/driver.py`:
- Around line 247-250: The current warning treats any non-COMPLETED
TerminalOutcome (variable outcome) as a warning; change the condition around
logger.warning in driver.py so that TerminalOutcome.DENIED is considered a
normal terminal state and does not emit a warning. Concretely, update the check
that uses TerminalOutcome.COMPLETED (referenced alongside cr_name and
logger.warning) to only warn when outcome is neither TerminalOutcome.COMPLETED
nor TerminalOutcome.DENIED (or use an explicit if outcome in
[TerminalOutcome.COMPLETED, TerminalOutcome.DENIED]: skip warning; else: log
warning).
In `@tests/integration/agentic/scripts/setup_crashloop_probe-openai.sh`:
- Around line 14-16: The problem is that the unconditional "|| true" after the
oc wait call (oc wait --for=condition=Available=false
deployment/crashloop-probe-demo -n "$TEST_NS" --timeout=60s 2>/dev/null || true)
masks failures and allows the script to print "Setup complete." even when oc
wait fails; remove the "|| true", check the exit status of oc wait, and on
failure print a clear error including "$TEST_NS" and the oc output (avoid
redirecting stderr to /dev/null) and exit with a non-zero status so downstream
tests fail fast; only proceed to the sleep 10 and echo "Setup complete." when oc
wait succeeds.
---
Duplicate comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/driver.py`:
- Around line 143-148: The constructor currently always instantiates KubeCLI and
ProposalAmender even when the driver is disabled; change the ProposalDriver
constructor to only create self._kube_cli = KubeCLI(...) and self._amender =
ProposalAmender(...) when the driver is enabled (e.g., if self._config.enabled:
...), otherwise set those attributes to None (or leave uninitialized) so no
cluster dependencies are created for a disabled driver, and update any methods
that reference self._kube_cli or self._amender to guard for None or perform lazy
initialization.
---
Nitpick comments:
In `@tests/unit/pipeline/evaluation/test_proposal_driver.py`:
- Around line 624-625: The test currently only asserts error is None and
turn.proposal_status == status, but when status is Denied we must also ensure
the turn still produces a response; update the test in test_proposal_driver.py
to assert that turn.response is populated for denied turns (e.g., assert
turn.response is not None and/or assert turn.response.text or length > 0) so
regressions that drop the amended summary are caught; reference the existing
locals error, turn, and proposal_status/status when adding the extra
assertion(s).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 384b0716-bdea-4ee2-9c2c-4f4efc9b7bcc
📒 Files selected for processing (37)
MakefileREADME.mdconfig/system.yamldocs/EVALUATION_GUIDE.mdsrc/lightspeed_evaluation/core/metrics/custom/custom.pysrc/lightspeed_evaluation/core/metrics/custom/prompts.pysrc/lightspeed_evaluation/core/metrics/custom/proposal_eval.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/proposal/__init__.pysrc/lightspeed_evaluation/core/proposal/phase.pysrc/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/pipeline/evaluation/cli.pysrc/lightspeed_evaluation/pipeline/evaluation/driver.pysrc/lightspeed_evaluation/pipeline/evaluation/proposal_amender.pytests/integration/agentic/fixtures/crashloop-probe-demo.yamltests/integration/agentic/scripts/_cleanup_infra-claude-vertex.shtests/integration/agentic/scripts/_cleanup_infra-openai.shtests/integration/agentic/scripts/_setup_infra-claude-vertex.shtests/integration/agentic/scripts/_setup_infra-openai.shtests/integration/agentic/scripts/cleanup_crashloop_probe-claude-vertex.shtests/integration/agentic/scripts/cleanup_crashloop_probe-openai.shtests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.shtests/integration/agentic/scripts/cleanup_oomkill-openai.shtests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.shtests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.shtests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.shtests/integration/agentic/scripts/setup_crashloop_probe-openai.shtests/integration/agentic/scripts/setup_oomkill-claude-vertex.shtests/integration/agentic/scripts/setup_oomkill-openai.shtests/integration/system-config-agents-proposal.yamltests/integration/test_evaluation_data_proposal.yamltests/integration/test_proposal_evaluation.pytests/unit/core/metrics/custom/test_custom.pytests/unit/core/metrics/custom/test_proposal_eval.pytests/unit/pipeline/evaluation/test_driver.pytests/unit/pipeline/evaluation/test_proposal_amender.pytests/unit/pipeline/evaluation/test_proposal_driver.py
💤 Files with no reviewable changes (2)
- tests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.sh
- tests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.sh
✅ Files skipped from review due to trivial changes (5)
- README.md
- tests/integration/agentic/scripts/cleanup_crashloop_probe-claude-vertex.sh
- tests/integration/agentic/scripts/setup_oomkill-claude-vertex.sh
- docs/EVALUATION_GUIDE.md
- src/lightspeed_evaluation/core/system/validator.py
🚧 Files skipped from review as they are similar to previous changes (26)
- tests/integration/agentic/scripts/_cleanup_infra-claude-vertex.sh
- tests/integration/system-config-agents-proposal.yaml
- tests/integration/agentic/scripts/cleanup_oomkill-openai.sh
- tests/integration/agentic/scripts/setup_oomkill-openai.sh
- tests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.sh
- tests/integration/agentic/scripts/_cleanup_infra-openai.sh
- tests/integration/agentic/scripts/cleanup_crashloop_probe-openai.sh
- src/lightspeed_evaluation/core/models/data.py
- tests/unit/pipeline/evaluation/test_driver.py
- tests/integration/agentic/scripts/_setup_infra-openai.sh
- tests/integration/test_proposal_evaluation.py
- tests/integration/agentic/fixtures/crashloop-probe-demo.yaml
- src/lightspeed_evaluation/core/proposal/init.py
- src/lightspeed_evaluation/pipeline/evaluation/cli.py
- src/lightspeed_evaluation/core/metrics/custom/prompts.py
- src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
- tests/integration/test_evaluation_data_proposal.yaml
- tests/integration/agentic/scripts/_setup_infra-claude-vertex.sh
- src/lightspeed_evaluation/core/proposal/phase.py
- tests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.sh
- tests/unit/core/metrics/custom/test_custom.py
- config/system.yaml
- src/lightspeed_evaluation/core/metrics/custom/custom.py
- tests/unit/pipeline/evaluation/test_proposal_amender.py
- src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
- tests/unit/core/metrics/custom/test_proposal_eval.py
…-openai.sh Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Summary
This PR adds LLM-as-judge evaluation and enriched data capture for the ProposalDriver evaluation pipeline, along with per-scenario test infrastructure and a new CrashLoopBackOff test fixture.
Core changes
ProposalAmender — after a Proposal CR reaches terminal state, fetches child Result CRs (AnalysisResult, ExecutionResult, VerificationResult, EscalationResult) from the cluster and enriches
TurnDatawith:proposal_results: structured dict with the complete.statusof each child CRresponse: a Markdown workflow summary suitable for both human review and LLM-as-judge evaluationcustom:proposal_evaluation_correctness— a new LLM-as-judge metric (score 0–1) with a multi-dimensional evaluation prompt:_parse_proposal_eval_response): extracts sub-scores and average from the multi-dimensional output formatPer-scenario test infrastructure — setup/cleanup scripts refactored from monolithic per-provider scripts to:
_setup_infra-openai.sh,_setup_infra-claude-vertex.sh) sourced by scenario scriptssetup_oomkill-openai.sh,setup_crashloop_probe-openai.sh, etc.)crashloop-probe-demofixture (nginx with misconfigured liveness probe at/nonexistent-health)Shellcheck compliance —
exportfor variables consumed by sourced scripts (SC2034), exclude SC1091 for dynamicsourcepaths in MakefileWhy
The existing
ProposalDriveronly extracts conditionmessagefields via_extract_summary, losing the rich structured data from child Result CRs. In particular, the Diagnosis from AnalysisResult (root cause analysis, confidence level, detailed summary) is never captured, making it impossible to evaluate whether the agentic workflow diagnosed and remediated the issue correctly.custom:proposal_statusprovides deterministic pass/fail on workflow phase, but cannot assess the quality of diagnosis, actions, or verification. The newproposal_evaluation_correctnessmetric fills this gap using an LLM judge with a structured, multi-dimensional prompt.Design choices
CLIClient abstraction
CLI operations (
run,get_resource,apply,delete) are extracted into aCLIClientABC with aKubeCLIimplementation. BothProposalDriverandProposalAmenderuse the same interface; tests inject a mockCLIClientwithout patching subprocess internals.ProposalAmender as a separate class
Follows the
APIDataAmenderpattern — a dedicated class composed into the driver, responsible for enrichingTurnDatain-place. Navigatesproposal_status.steps.<step>.results[]to readStepResultRefentries, then fetches each child CR viaCLIClient.get_resource().Multi-dimensional judge prompt
The prompt produces per-dimension scores instead of a single holistic score:
Per-scenario scripts
Each scenario (oomkill, crashloop-probe) × provider (openai, claude-vertex) has its own setup/cleanup script that sources a shared
_setup_infra-{provider}.sh/_cleanup_infra-{provider}.sh. This avoids deploying all fixtures for every conversation and makes adding new scenarios mechanical.Relationship between the two proposal metrics
custom:proposal_statusexpected_proposal_statuscustom:proposal_evaluation_correctnessresponse(from amender) +expected_responseThey are complementary:
proposal_statuschecks what happened,proposal_evaluation_correctnesschecks how well it was done.Test plan
test_proposal_driver.pytests pass unchanged (regression)ProposalAmenderunit tests: analysis-only, analysis+execution, full pipeline, failed step, empty results, Markdown summary formattingproposal_evaluation_correctnessmetric unit tests: mock LLM, multi-dimensional score parsing, missing response handling, conversation-level skip, SRE persona verification_parse_proposal_eval_responseparser unit tests: all dimensions, N/A dimensions, fallback average computation, unparseable inputtest_oomkill_full_lifecycle,test_analysis_only,test_oomkill_claude_vertexcrashloop-probe-demofixture and integration eval data for both providersmake pre-commit && make testgreen🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Enhancements