Skip to content

fix: Add retry configuration and warning for GEval score=None bug#216

Merged
asamal4 merged 2 commits into
lightspeed-core:mainfrom
x86girl:prgutier/fix-silent-score-mismatch
Jun 4, 2026
Merged

fix: Add retry configuration and warning for GEval score=None bug#216
asamal4 merged 2 commits into
lightspeed-core:mainfrom
x86girl:prgutier/fix-silent-score-mismatch

Conversation

@x86girl
Copy link
Copy Markdown

@x86girl x86girl commented Apr 16, 2026

BUG FIX: Prevent silent score mismatch when LLM judge fails

Problem Statement

GEval metrics silently convert None scores to 0.0 without warning when LLM judge evaluation fails. This masks critical infrastructure issues:

  • Rate limiting (429 errors) after retries exhausted
  • LLM provider timeouts
  • Invalid JSON responses from judge
  • API quota/credits exhausted
  • Network connection failures

Without this fix, failed evaluations appear as valid low scores (0.0), making debugging nearly impossible and corrupting evaluation data.

What Changed

1. Added configurable retry logic (geval.py)

  • New num_retries parameter from user config (default: 6)
  • Custom retry decorator respecting user settings
  • DeepEval hardcodes MAX_RETRIES=6, now we add our own layer

2. Added warning when score=None (geval.py:385-397, 515-527)

Before:

score = metric.score if metric.score is not None else 0.0  # Silent!
return score, reason

After:

score = metric.score
if score is None:
    logger.warning(
        "GEval metric returned None score; defaulting to 0.0. "
        "This typically indicates LLM judge failure (rate limiting, "
        "timeout, invalid JSON, or quota exhausted). Reason: %s",
        reason
    )
    score = 0.0
return score, reason

3. Updated system.yaml documentation

Added comment explaining num_retries applies to GEval and defaults to 6.

4. Fixed test mocks

Updated test fixtures to include llm_params with num_retries to match new initialization logic.

Files Changed

  • src/lightspeed_evaluation/core/metrics/geval.py

    • Add tenacity imports for retry logic
    • Add num_retries initialization from config
    • Add _is_retryable_exception() method
    • Add _measure_with_retry() method
    • Add warning logs for None scores (turn & conversation level)
    • Replace direct metric.measure() with retry wrapper
  • config/system.yaml

    • Document num_retries behavior for GEval
  • tests/unit/core/metrics/test_geval.py

    • Add llm_params mock to fixture

Testing

  • All 833 tests pass
  • All pre-commit checks pass (black, pylint, ruff, mypy, bandit, pyright)
  • Manual verification of retry logic and warning logs

Impact

Users can now:

  • Configure retry attempts via system.yaml (was hardcoded to 6)
  • Immediately see warnings when evaluations fail
  • Distinguish between real low scores vs failed evaluations
  • Debug rate limiting and infrastructure issues

Description

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Bug Fixes

    • Preserve None scores from metric evaluations instead of coercing them to 0.0; emit warnings and surface the None score with its reason.
  • Improvements

    • Make LLM retry behavior configurable at runtime and log the applied retry count; patch underlying retry behavior to respect configured values.
    • Return (score, reason) tuples where score may be None, with clearer reason messages when absent.
  • Tests

    • Added and updated unit tests covering no-score handling, retry patching, routing, and related logging.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 69c1fb7a-084f-4932-a01c-a59b227fbb1b

📥 Commits

Reviewing files that changed from the base of the PR and between 5e8736e and e28e2b8.

📒 Files selected for processing (7)
  • src/lightspeed_evaluation/core/llm/deepeval.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
  • tests/unit/core/llm/test_deepeval_manager.py
  • tests/unit/core/metrics/conftest.py
  • tests/unit/core/metrics/test_deepeval.py
  • tests/unit/core/metrics/test_geval.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • tests/unit/core/metrics/test_geval.py
  • src/lightspeed_evaluation/core/llm/deepeval.py
  • tests/unit/core/llm/test_deepeval_manager.py
  • tests/unit/core/metrics/conftest.py
  • tests/unit/core/metrics/test_deepeval.py

Walkthrough

Adds DeepEval retry patching (from llm_params) and changes DeepEval/GEval metric evaluation to detect and warn when metric.score is None (no coercion to numeric), with unit tests and fixtures covering retries, warnings, routing, and cache initialization.

Changes

DeepEval robustness and retry configuration

Layer / File(s) Summary
DeepEval LLM Manager retry patching
src/lightspeed_evaluation/core/llm/deepeval.py
DeepEvalLLMManager reads num_retries from llm_params (defaulting to DEFAULT_LLM_RETRIES), constructs LiteLLMModel without forwarding num_retries, and monkey-patches decorated generate methods' .retry.stop via _patch_deepeval_retries to apply stop_after_attempt(max_retries).
DeepEval metric None-score handling
src/lightspeed_evaluation/core/metrics/deepeval.py
_evaluate_metric prefers metric.reason, formats fallback reasons when missing, logs a warning when metric.score is None, flushes pending DeepEval tasks, and returns (score, reason) without coercing None to a numeric default; docstring updated to note score may be None.
GEval None-score handling
src/lightspeed_evaluation/core/metrics/geval.py, tests/unit/core/metrics/test_geval.py
GEvalHandler turn- and conversation-level evaluation read metric.score directly, derive reason from metric.reason, log a warning if score is None, and return (None, reason) without coercion; tests now assert warning emission rather than score coercion.
Test infrastructure and coverage
tests/unit/core/llm/test_deepeval_manager.py, tests/unit/core/metrics/conftest.py, tests/unit/core/metrics/test_deepeval.py
New fixtures mock DeepEval dependencies and tests validate retry patching, INFO logging of configured retries, DeepEvalMetrics _evaluate_metric None/valid-score behaviors, cache-enabled init, conversation-level routing, exception handling, and routing of geval:... metrics.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant DeepEvalLLMManager
  participant stop_after_attempt
  participant LiteLLMModel
  Client->>DeepEvalLLMManager: initialize with llm_params
  DeepEvalLLMManager->>stop_after_attempt: compute stop condition(max_retries)
  DeepEvalLLMManager->>LiteLLMModel: patch generate methods' .retry.stop
  Client->>LiteLLMModel: call generate -> uses patched retry.stop
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • asamal4
  • VladimirKadlec
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding retry configuration to respect user settings and adding warning logs when GEval returns score=None, preventing silent conversion to 0.0.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (5)
config/system.yaml (1)

30-30: Clarify the comment to avoid confusion about the default value.

The comment says "default: 6 if not specified" which is technically correct (the code defaults to 6 when the key is absent), but could confuse users since the configured value here is 3. Users might wonder if 3 or 6 is actually used.

Consider rewording to clarify the distinction:

📝 Suggested clarification
-    num_retries: 3  # Retry attempts for LLM judge (applies to GEval, custom metrics, etc.; default: 6 if not specified)
+    num_retries: 3  # Retry attempts for LLM judge (applies to GEval, custom metrics, etc.). Code defaults to 6 when omitted.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/system.yaml` at line 30, The inline comment for the num_retries
setting is ambiguous about the configured value versus the fallback; update the
comment for num_retries to clearly state that the current configured value is 3
and that 6 is the application default used only when the key is omitted (e.g.,
"num_retries: 3  # Number of retry attempts; currently set to 3. If this key is
absent, the system defaults to 6."). Ensure the change targets the num_retries
entry so readers understand the distinction between the explicit setting and the
implicit default.
src/lightspeed_evaluation/core/metrics/geval.py (2)

217-235: Docstring doesn't match implementation - all exceptions are retried.

The docstring lists specific "retryable conditions" (rate limiting, timeouts, network failures), but the implementation unconditionally returns True for any Exception. While the comment on line 232-234 explains this matches DeepEval's behavior, the docstring creates misleading expectations.

Consider updating the docstring to accurately reflect the behavior:

📝 Suggested fix
     def _is_retryable_exception(self, exception: BaseException) -> bool:
         """Check if exception should trigger a retry.
 
-        Retryable conditions:
-        - Rate limiting (429 errors from LLM provider)
-        - Timeout errors
-        - Temporary network failures
-        - LLM provider temporary errors
+        This method returns True for all Exception subclasses to match
+        DeepEval's hardcoded behavior (retryable_exceptions = (Exception,)).
+        
+        Common retryable scenarios include rate limiting, timeouts, and
+        network failures, but we retry broadly to handle unknown transient
+        errors from various LLM providers.
 
         Args:
             exception: The exception to check
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/core/metrics/geval.py` around lines 217 - 235,
Update the _is_retryable_exception method docstring to reflect actual behavior:
state explicitly that it returns True for any Exception (i.e., it retries on all
Exception subclasses), remove or change the specific "Retryable conditions" list
to note that DeepEval/LiteLLM handles specific error types internally and that
this function mirrors their hardcoded behavior, and adjust the Returns section
to say it always returns True for Exception objects; reference
_is_retryable_exception and DeepEval/LiteLLM in the text for clarity.

67-70: Consider adding explicit type annotation for num_retries.

The num_retries attribute is inferred from .get() which returns Any. For clarity and type safety, consider adding an explicit annotation:

📝 Suggested fix
         # Get num_retries from LLM configuration (default: 6 to match DeepEval's hardcoded value)
         # Note: DeepEval's internal retry logic uses hardcoded MAX_RETRIES=6,
         # but we add our own retry layer to respect user configuration
-        self.num_retries = self.deepeval_llm_manager.llm_params.get("num_retries", 6)
+        self.num_retries: int = self.deepeval_llm_manager.llm_params.get("num_retries", 6)

As per coding guidelines: "Include type hints for all public functions and methods in Python" - while this is an instance attribute, explicit typing improves maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lightspeed_evaluation/core/metrics/geval.py` around lines 67 - 70, The
attribute self.num_retries is inferred as Any from
deepeval_llm_manager.llm_params.get; add an explicit int type and ensure the
value is cast/validated. Change the assignment to annotate and coerce the value,
e.g. set self.num_retries: int =
int(self.deepeval_llm_manager.llm_params.get("num_retries", 6)) (or perform a
safe int conversion/validation before assigning) so num_retries is explicitly
typed as int in the class/constructor where it is defined.
bug_test_geval_score_mismatch.yaml (1)

1-11: Update comments to reflect the fix is now in place.

The header comments describe the bug as "ACTUAL BEHAVIOR: Silent conversion from None -> 0.0" and "NO WARNING will be shown". However, with this PR merged, warnings WILL be shown. Consider updating the comments to indicate this file now serves as a regression test rather than a bug reproduction:

📝 Suggested update
-# BUG REPRODUCTION: GEval Score Mismatch Without Warning
-# This configuration demonstrates the bug where GEval metrics silently default
-# None scores to 0.0 without any warning or logging.
+# REGRESSION TEST: GEval Score None Handling
+# This configuration tests that GEval metrics properly warn when score is None
+# before defaulting to 0.0.
 #
 # HOW TO REPRODUCE:
-# 1. Run: lightspeed-evaluation --system-config config/system.yaml --eval-config bug_test_geval_score_mismatch.yaml
-# 2. Observe the output - if GEval returns None score, it will silently become 0.0
-# 3. Check logs - NO WARNING will be shown about the score mismatch
+# 1. Run: lightspeed-evaluation --system-config config/system.yaml --eval-config bug_test_geval_score_mismatch.yaml
+# 2. If GEval returns None score, it will become 0.0
+# 3. Check logs - A WARNING should be shown about the None score
 #
-# EXPECTED BEHAVIOR: A warning should be logged when score is None
-# ACTUAL BEHAVIOR: Silent conversion from None -> 0.0
+# EXPECTED BEHAVIOR: Warning logged when score is None, then defaults to 0.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bug_test_geval_score_mismatch.yaml` around lines 1 - 11, Update the
top-of-file comments to reflect that the fix is merged and this YAML now acts as
a regression test: change "ACTUAL BEHAVIOR: Silent conversion from None -> 0.0"
and "NO WARNING will be shown" to state that warnings are now logged when GEval
returns None scores, rename the header to indicate "REGRESSION TEST" instead of
"BUG REPRODUCTION", and adjust the HOW TO REPRODUCE/EXPECTED BEHAVIOR sections
to instruct running lightspeed-evaluation and verifying that a warning is
emitted when GEval returns None (so the file documents the fixed behavior and
test purpose).
tests/unit/core/metrics/test_geval.py (1)

360-398: Consider adding tests for retry behavior and warning logs.

The existing test_evaluate_turn_none_score_returns_zero verifies the None→0.0 conversion, but there's no explicit verification that:

  1. The retry logic actually retries on transient failures
  2. A warning is logged when score is None

These would strengthen confidence in the new retry/warning functionality.

Would you like me to help generate additional test cases for retry behavior and warning log verification?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/core/metrics/test_geval.py` around lines 360 - 398, Add two
assertions/tests alongside test_evaluate_turn_none_score_returns_zero to (1)
simulate transient failures and verify retry behavior by having the patched
GEval (mocker.patch("lightspeed_evaluation.core.metrics.geval.GEval")) raise a
transient exception on the first call and return a valid metric on a subsequent
call, then call GEvalHandler.evaluate and assert the final score/reason come
from the successful retry; and (2) verify a warning is logged when the metric
score is None by patching or capturing the handler's logger (or using caplog)
around GEvalHandler.evaluate and asserting logger.warning was called with the
expected message when mock_metric.score is None. Ensure you reference GEval,
GEvalHandler.evaluate, mock_metric_manager.get_metric_metadata, and the existing
test fixture names (turn_data, conv_data) when locating where to add these
assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bug_test_geval_score_mismatch.yaml`:
- Around line 1-11: Update the top-of-file comments to reflect that the fix is
merged and this YAML now acts as a regression test: change "ACTUAL BEHAVIOR:
Silent conversion from None -> 0.0" and "NO WARNING will be shown" to state that
warnings are now logged when GEval returns None scores, rename the header to
indicate "REGRESSION TEST" instead of "BUG REPRODUCTION", and adjust the HOW TO
REPRODUCE/EXPECTED BEHAVIOR sections to instruct running lightspeed-evaluation
and verifying that a warning is emitted when GEval returns None (so the file
documents the fixed behavior and test purpose).

In `@config/system.yaml`:
- Line 30: The inline comment for the num_retries setting is ambiguous about the
configured value versus the fallback; update the comment for num_retries to
clearly state that the current configured value is 3 and that 6 is the
application default used only when the key is omitted (e.g., "num_retries: 3  #
Number of retry attempts; currently set to 3. If this key is absent, the system
defaults to 6."). Ensure the change targets the num_retries entry so readers
understand the distinction between the explicit setting and the implicit
default.

In `@src/lightspeed_evaluation/core/metrics/geval.py`:
- Around line 217-235: Update the _is_retryable_exception method docstring to
reflect actual behavior: state explicitly that it returns True for any Exception
(i.e., it retries on all Exception subclasses), remove or change the specific
"Retryable conditions" list to note that DeepEval/LiteLLM handles specific error
types internally and that this function mirrors their hardcoded behavior, and
adjust the Returns section to say it always returns True for Exception objects;
reference _is_retryable_exception and DeepEval/LiteLLM in the text for clarity.
- Around line 67-70: The attribute self.num_retries is inferred as Any from
deepeval_llm_manager.llm_params.get; add an explicit int type and ensure the
value is cast/validated. Change the assignment to annotate and coerce the value,
e.g. set self.num_retries: int =
int(self.deepeval_llm_manager.llm_params.get("num_retries", 6)) (or perform a
safe int conversion/validation before assigning) so num_retries is explicitly
typed as int in the class/constructor where it is defined.

In `@tests/unit/core/metrics/test_geval.py`:
- Around line 360-398: Add two assertions/tests alongside
test_evaluate_turn_none_score_returns_zero to (1) simulate transient failures
and verify retry behavior by having the patched GEval
(mocker.patch("lightspeed_evaluation.core.metrics.geval.GEval")) raise a
transient exception on the first call and return a valid metric on a subsequent
call, then call GEvalHandler.evaluate and assert the final score/reason come
from the successful retry; and (2) verify a warning is logged when the metric
score is None by patching or capturing the handler's logger (or using caplog)
around GEvalHandler.evaluate and asserting logger.warning was called with the
expected message when mock_metric.score is None. Ensure you reference GEval,
GEvalHandler.evaluate, mock_metric_manager.get_metric_metadata, and the existing
test fixture names (turn_data, conv_data) when locating where to add these
assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4dfcca80-a393-4ccd-bd2d-6dd684b93fdd

📥 Commits

Reviewing files that changed from the base of the PR and between 5170019 and 8de2181.

📒 Files selected for processing (4)
  • bug_test_geval_score_mismatch.yaml
  • config/system.yaml
  • src/lightspeed_evaluation/core/metrics/geval.py
  • tests/unit/core/metrics/test_geval.py

@x86girl
Copy link
Copy Markdown
Author

x86girl commented Apr 16, 2026

The reasons of the above changes are if the llm fails to grab the content, before, it just changed the score to 0, without throwing any warning:
score = metrics.score if is not None else 0.0
(inside ./lightspeed-evaluation/deepeval/models/llms/litellm+model.py)
This problem is mainly caused by expired quota, exceeded number of max retries and about the last one, it was ignoring when the user changes system.yaml and the max_retries number and it had
MAX_RETRIES = 6
hardcoded, besides being caught by this class:
def init(
self
...
**kawargs,
...
)
I changed it to allow the user to set it up ( and to keep it to 6 if doesn't especified)
Now we have a warning in the logs to have a cue if one of the listed problems are happening allowing us to do not get buggy results
these problems can cause rag chuncks to be missed and without no reason for the user, may leading to a wrong conclusion about the real quality of the chuncks.

Copy link
Copy Markdown
Member

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍
Please remove the bug_test_geval_score_mismatch.yaml file from the PR. Some test would be nice if you are able to mock the error.

@asamal4
Copy link
Copy Markdown
Collaborator

asamal4 commented Apr 20, 2026

@x86girl Thank you !!
As the issue is for DeepEval llm configuration, I would suggest to do (modular) changes so that it will work for both standard DeepEval metrics and GEval, currently issue is resolved only for GEval.

@x86girl
Copy link
Copy Markdown
Author

x86girl commented May 14, 2026

Is this PR still worth to fix and merge? I was working on another stuff in the meantime.

@xmican10
Copy link
Copy Markdown
Collaborator

Hi @x86girl, this PR is still valuable and definitely worth fixing and merging. If you'd prefer not to continue working on it, I'm happy to take it over, just let me know what works best for you.

@xmican10
Copy link
Copy Markdown
Collaborator

Hi @x86girl, just wanted to give you a heads up that I'm going to start working on finishing up this PR so we can get it merged soon.

Thanks again for getting this started!

@xmican10 xmican10 force-pushed the prgutier/fix-silent-score-mismatch branch from 8de2181 to 11ed093 Compare May 28, 2026 15:04
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/unit/core/metrics/test_deepeval.py (1)

28-28: ⚡ Quick win

Avoid mutating mock_metric.__class__.__name__; use a tiny stub metric class instead.

DeepEvalMetrics._evaluate_metric includes metric.__class__.__name__ in the “returned None score” warning, so this test relies on that. Since mock_metric is created inside the test, this won’t affect other tests, but changing a MagicMock’s class name is brittle—define a small stub object with measure() and score/reason so the class name is stable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/metrics/test_deepeval.py` at line 28, The test mutates
mock_metric.__class__.__name__ which is brittle; instead define a tiny stub
class (e.g., class TestMetric:) that implements the required interface
(measure() and attributes/props score and reason) and instantiate that as
mock_metric in the test so DeepEvalMetrics._evaluate_metric sees a stable class
name; replace the MagicMock and the assignment to __class__.__name__ with the
stub instance used where mock_metric is referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/llm/deepeval.py`:
- Around line 82-83: Remove the inline pylint disable and guard the assignment
to method.retry.stop by checking/ensuring the retry attribute exists on method
(and that it has a stop field) before setting it; replace the direct assignment
"method.retry.stop = stop_after_attempt(...)" with a safe pattern that either
creates/initializes method.retry (or uses setattr) when missing or calls a small
helper (e.g., ensure_retry_and_set_stop(method, stop_after_attempt(...))) so no
pylint suppression is necessary and member access is explicit and safe.
- Around line 55-58: DeepEvalLLMManager._patch_deepeval_retries currently
mutates deepeval.models.LiteLLMModel.<method>.retry.stop for each instance
causing global races and uses an inline pylint disable; change this to perform a
one-time, class-scoped patch guarded by a threading.Lock (or similar) and cached
max_retries value so subsequent initializations either reuse the same patch or
log/error when a different num_retries is requested, and remove the "# pylint:
disable=no-member" by using a typed-safe access (e.g., getattr with typing.cast
or an explicit protocol/type on the target methods) to set method.retry.stop in
a typesafe way; keep the API using num_retries (and DEFAULT_LLM_RETRIES) but
ensure _patch_deepeval_retries only mutates shared state once and is
concurrency-safe.

In `@tests/unit/core/metrics/test_deepeval.py`:
- Line 1: Remove the file-level "# pylint: disable=protected-access,
unused-variable" from tests/unit/core/metrics/test_deepeval.py and fix the
underlying issues: replace any direct uses of protected members (e.g.,
ClassName._private_attr or instance._private_method) with the public API or a
tested accessor, or if truly necessary limit the suppression to a single-line "#
pylint: disable=protected-access" with a short justification next to that
specific access; for unused-variable warnings, either remove the unused
assignment, rename the variable to start with an underscore (e.g., _unused) or
use the value in an assertion/pytest.mark.usefixtures so it is meaningfully
consumed. Ensure changes target the specific test functions and fixtures in
test_deepeval.py rather than reintroducing a file-level disable.

---

Nitpick comments:
In `@tests/unit/core/metrics/test_deepeval.py`:
- Line 28: The test mutates mock_metric.__class__.__name__ which is brittle;
instead define a tiny stub class (e.g., class TestMetric:) that implements the
required interface (measure() and attributes/props score and reason) and
instantiate that as mock_metric in the test so DeepEvalMetrics._evaluate_metric
sees a stable class name; replace the MagicMock and the assignment to
__class__.__name__ with the stub instance used where mock_metric is referenced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9ba83dfa-b162-4400-b4fd-ff34d68a66ca

📥 Commits

Reviewing files that changed from the base of the PR and between 8de2181 and 11ed093.

📒 Files selected for processing (7)
  • src/lightspeed_evaluation/core/llm/deepeval.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
  • tests/unit/core/llm/test_deepeval_manager.py
  • tests/unit/core/metrics/conftest.py
  • tests/unit/core/metrics/test_deepeval.py
  • tests/unit/core/metrics/test_geval.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/core/metrics/test_geval.py

Comment thread src/lightspeed_evaluation/core/llm/deepeval.py
Comment thread src/lightspeed_evaluation/core/llm/deepeval.py
Comment thread tests/unit/core/metrics/test_deepeval.py Outdated
@xmican10
Copy link
Copy Markdown
Collaborator

Instead of wrapping DeepEval/GEval evaluation methods in additional Tenacity retries (which created a multiplication effect, 3 outer × 3 inner = 9 total retries), I now monkey-patch DeepEval's MAX_RETRIES directly in LiteLLMModel to sync with system.yaml configuration. This eliminates retry multiplication and unexpected behavior.

Extended score=None detection from GEval to DeepEval metrics, adding warning logs that identify root causes (rate limiting, timeouts, malformed JSON, quota exhaustion) instead of silently converting to 0.0.

Added tests for retry patching and score=None handling and increasing overall DeepEval test coverage.

@xmican10 xmican10 force-pushed the prgutier/fix-silent-score-mismatch branch from 11ed093 to b37303f Compare May 28, 2026 15:39
Copy link
Copy Markdown
Member

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove bug_test_geval_score_mismatch.yaml, the tests cover the case, am I right?
Otherwise LGTM.

Comment on lines +81 to +82
method = getattr(LiteLLMModel, method_name)
method.retry.stop = stop_after_attempt( # pylint: disable=no-member
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this but I can't figure out a better solution :-(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug_test_geval_score_mismatch.yaml was alredy removed, the tests do cover the case :-)

asamal4
asamal4 previously approved these changes Jun 1, 2026
Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !!..
LGTM.. however I am concerned about yet another litellm patch. This may create problem in future.

Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xmican10 I just realized that making the score 0 from None will probably add some issues. PTAL

Comment thread src/lightspeed_evaluation/core/metrics/geval.py Outdated
@asamal4 asamal4 dismissed their stale review June 1, 2026 22:59

Added another comment

@xmican10 xmican10 force-pushed the prgutier/fix-silent-score-mismatch branch from b37303f to 5e8736e Compare June 2, 2026 12:40
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)

142-147: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Broaden DeepEval standard-metric exception normalization in evaluate()

_evaluate_metric() calls metric.measure() and self.llm_manager.flush_deepevals_pending_tasks() without guarding, so any runtime exception other than (ValueError, AttributeError, KeyError) thrown by DeepEval/LiteLLM will escape DeepEvalMetrics.evaluate() instead of returning (None, reason) (the None-score path is only for metrics that return score=None normally, not for thrown errors). Existing tests cover score=None and the ValueError case, not generic exceptions from measure()/flush.

Suggested fix
         if metric_name in self.supported_metrics:
             try:
                 return self.supported_metrics[metric_name](
                     conv_data, scope.turn_idx, scope.turn_data, scope.is_conversation
                 )
-            except (ValueError, AttributeError, KeyError) as e:
+            except Exception as e:
+                logger.warning(
+                    "DeepEval %s evaluation failed before producing a score: %s",
+                    metric_name,
+                    e,
+                )
                 return None, f"DeepEval {metric_name} evaluation failed: {str(e)}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/metrics/deepeval.py` around lines 142 - 147,
The evaluate flow in DeepEvalMetrics currently only catches (ValueError,
AttributeError, KeyError) in _evaluate_metric/evaluate, so exceptions raised by
metric.measure() or self.llm_manager.flush_deepevals_pending_tasks() can escape;
update the exception handling in DeepEvalMetrics._evaluate_metric (and/or
DeepEvalMetrics.evaluate) to catch all runtime exceptions (catch Exception)
around the calls to metric.measure() and
llm_manager.flush_deepevals_pending_tasks(), and on exception return (None,
f"DeepEval {metric_name} evaluation failed: {str(e)}") to normalize failures the
same way as existing ValueError handling while preserving existing behavior for
known exceptions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/core/metrics/test_deepeval.py`:
- Around line 24-28: The test mutates mock_metric.__class__.__name__ which
doesn't affect other MagicMock instances and is misleading; instead create a
small throwaway type and either pass it as the spec to the MagicMock or
instantiate it and attach the mock attributes so the type name is correct only
for this object. Replace the mutation of mock_metric.__class__.__name__ with
e.g. defining class TestMetric: pass and then create mock_metric =
mocker.MagicMock(spec=TestMetric) (or create an instance test_metric =
TestMetric(); set test_metric.score=None; test_metric.reason="..." and use that)
so the test uses a localized, non-leaking type name.

In `@tests/unit/core/metrics/test_geval.py`:
- Around line 390-398: The test sets up caplog.at_level(logging.WARNING) but
only asserts on reason; update the test in tests/unit/core/metrics/test_geval.py
that calls handler.evaluate (the evaluate call for metric_name="test_metric") to
also assert that the returned score is None and that caplog captured the
expected warning (e.g., assert the warning message appears in caplog.records or
caplog.text). Locate the evaluate invocation and add two assertions: one
checking score is None and one checking caplog contains the None-score warning
string to ensure the warning is actually emitted.

---

Outside diff comments:
In `@src/lightspeed_evaluation/core/metrics/deepeval.py`:
- Around line 142-147: The evaluate flow in DeepEvalMetrics currently only
catches (ValueError, AttributeError, KeyError) in _evaluate_metric/evaluate, so
exceptions raised by metric.measure() or
self.llm_manager.flush_deepevals_pending_tasks() can escape; update the
exception handling in DeepEvalMetrics._evaluate_metric (and/or
DeepEvalMetrics.evaluate) to catch all runtime exceptions (catch Exception)
around the calls to metric.measure() and
llm_manager.flush_deepevals_pending_tasks(), and on exception return (None,
f"DeepEval {metric_name} evaluation failed: {str(e)}") to normalize failures the
same way as existing ValueError handling while preserving existing behavior for
known exceptions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ad377350-58bc-436f-b45c-2b13d501aea2

📥 Commits

Reviewing files that changed from the base of the PR and between 11ed093 and 5e8736e.

📒 Files selected for processing (7)
  • src/lightspeed_evaluation/core/llm/deepeval.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
  • tests/unit/core/llm/test_deepeval_manager.py
  • tests/unit/core/metrics/conftest.py
  • tests/unit/core/metrics/test_deepeval.py
  • tests/unit/core/metrics/test_geval.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/unit/core/llm/test_deepeval_manager.py
  • src/lightspeed_evaluation/core/llm/deepeval.py
  • tests/unit/core/metrics/conftest.py

Comment thread tests/unit/core/metrics/test_deepeval.py Outdated
Comment thread tests/unit/core/metrics/test_geval.py
@xmican10 xmican10 force-pushed the prgutier/fix-silent-score-mismatch branch from 5e8736e to b7d9bfd Compare June 2, 2026 13:22
@xmican10 xmican10 requested a review from VladimirKadlec June 2, 2026 13:24
Comment thread src/lightspeed_evaluation/core/metrics/geval.py
Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor change

Priscila Gutierres and others added 2 commits June 3, 2026 14:26
CRITICAL BUG FIX: Prevent silent score mismatch when LLM judge fails

## Problem Statement

GEval metrics silently convert None scores to 0.0 without warning when
LLM judge evaluation fails. This masks critical infrastructure issues:

- Rate limiting (429 errors) after retries exhausted
- LLM provider timeouts
- Invalid JSON responses from judge
- API quota/credits exhausted
- Network connection failures

Without this fix, failed evaluations appear as valid low scores (0.0),
making debugging nearly impossible and corrupting evaluation data.

## What Changed

### 1. Added configurable retry logic (geval.py)

- New `num_retries` parameter from user config (default: 6)
- Custom retry decorator respecting user settings
- DeepEval hardcodes MAX_RETRIES=6, now we add our own layer

### 2. Added warning when score=None (geval.py:385-397, 515-527)

Before:
```python
score = metric.score if metric.score is not None else 0.0  # Silent!
return score, reason
```

After:
```python
score = metric.score
if score is None:
    logger.warning(
        "GEval metric returned None score; defaulting to 0.0. "
        "This typically indicates LLM judge failure (rate limiting, "
        "timeout, invalid JSON, or quota exhausted). Reason: %s",
        reason
    )
    score = 0.0
return score, reason
```

### 3. Updated system.yaml documentation

Added comment explaining num_retries applies to GEval and defaults to 6.

### 4. Fixed test mocks

Updated test fixtures to include llm_params with num_retries to match
new initialization logic.

## Files Changed

- src/lightspeed_evaluation/core/metrics/geval.py
  - Add tenacity imports for retry logic
  - Add num_retries initialization from config
  - Add _is_retryable_exception() method
  - Add _measure_with_retry() method
  - Add warning logs for None scores (turn & conversation level)
  - Replace direct metric.measure() with retry wrapper

- config/system.yaml
  - Document num_retries behavior for GEval

- tests/unit/core/metrics/test_geval.py
  - Add llm_params mock to fixture

## Testing

- All 833 tests pass
- All pre-commit checks pass (black, pylint, ruff, mypy, bandit, pyright)
- Manual verification of retry logic and warning logs

## Impact

Users can now:
- Configure retry attempts via system.yaml (was hardcoded to 6)
- Immediately see warnings when evaluations fail
- Distinguish between real low scores vs failed evaluations
- Debug rate limiting and infrastructure issues

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@xmican10 xmican10 force-pushed the prgutier/fix-silent-score-mismatch branch from b7d9bfd to e28e2b8 Compare June 3, 2026 12:40
@xmican10 xmican10 requested a review from asamal4 June 3, 2026 12:42
Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@asamal4 asamal4 merged commit fbfc4a7 into lightspeed-core:main Jun 4, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants