feat(eval): add FRAMES benchmark evaluation pipeline by Matteovanypersele · Pull Request #305 · linagora/openrag

Matteovanypersele · 2026-04-15T10:09:06Z

Summary

This PR adds a FRAMES benchmark evaluation pipeline under automatic-evaluation-pipeline/ (not sure if it's the best place).

setup_frames.py: downloads the Wikipedia articles referenced by FRAMES in md, pdf, or html format, then uploads and indexes them into an OpenRAG partition
eval_frames.py: runs FRAMES evaluation over the 824-question benchmark with LLM-judge scoring a fuzzy exact match

Evaluation modes

Default RAG mode: query OpenRAG with retrieval enabled
--no-rag: bypass retrieval and query the model directly whithout chunk
--oracle: bypass retrieval and provide the gold Wikipedia articles directly to the model (hardly impacted by the 8192 tokens limit)
--gold-workspaces: create one workspace per question and attach only the gold files

Summary by CodeRabbit

New Features
- Added FRAMES benchmark evaluation pipeline supporting multiple evaluation modes: standard RAG retrieval, oracle context, and direct LLM querying.
- Automated benchmark setup with concurrent Wikipedia content download and OpenRAG document indexing.
- Includes exact-match and LLM-based accuracy metrics with reasoning type breakdowns and result resumption capability.

coderabbitai · 2026-04-15T10:09:21Z

📝 Walkthrough

Walkthrough

This pull request introduces two new scripts for the FRAMES benchmark pipeline: setup_frames.py prepares Wikipedia content for OpenRAG indexing (downloading articles and managing concurrent uploads), while eval_frames.py implements multi-mode evaluation against OpenRAG with exact-match and LLM-based accuracy scoring.

Changes

Cohort / File(s)	Summary
FRAMES Benchmark Setup `automatic-evaluation-pipeline/setup_frames.py`	New script that loads the FRAMES dataset from HuggingFace, extracts Wikipedia article titles, downloads content (markdown/HTML/PDF) from Wikipedia APIs with retry logic, and uploads documents to OpenRAG with concurrent indexing and task polling. Supports caching, skip-aware resumption, and various CLI modes (`--index-only`, `--retry-missing`).
FRAMES Benchmark Evaluation `automatic-evaluation-pipeline/eval_frames.py`	New script that evaluates FRAMES benchmark questions against OpenRAG using four mutually exclusive modes: default RAG retrieval, `--no-rag` direct LLM querying, `--oracle` with gold article context, and `--gold-workspaces` with per-question workspace setup. Computes normalized exact-match and LLM-based factual accuracy scoring with breakdowns by reasoning type.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SetupScript as setup_frames.py
    participant HF as HuggingFace
    participant WikiAPI as Wikipedia API
    participant OpenRAG
    
    User->>SetupScript: Run with --partition, --limit, --wiki-concurrency
    SetupScript->>HF: Load FRAMES dataset (cached or fresh)
    HF-->>SetupScript: Return test split
    SetupScript->>WikiAPI: Download articles asynchronously<br/>(markdown/HTML/PDF with retries)
    WikiAPI-->>SetupScript: Article content
    SetupScript->>OpenRAG: Check health
    OpenRAG-->>SetupScript: Health status
    SetupScript->>OpenRAG: Create partition (if needed)
    OpenRAG-->>SetupScript: Partition ready
    SetupScript->>OpenRAG: Upload & track files concurrently<br/>(post then poll task_status_url)
    OpenRAG-->>SetupScript: Terminal state (COMPLETED/FAILED/SKIPPED)
    SetupScript->>User: Print aggregated results

sequenceDiagram
    participant User
    participant EvalScript as eval_frames.py
    participant OpenRAG
    participant LLM
    participant Judge as LLM Judge
    
    User->>EvalScript: Run with mode flag (--no-rag/--oracle/--gold-workspaces)
    EvalScript->>EvalScript: Load dataset (cached)
    
    alt Default RAG Mode
        EvalScript->>OpenRAG: Query chat completions with question
        OpenRAG->>LLM: Generate answer
        LLM-->>OpenRAG: Answer + extra.sources
        OpenRAG-->>EvalScript: Answer + chunk references
    else No-RAG Mode
        EvalScript->>LLM: Send question directly
        LLM-->>EvalScript: Answer
    else Oracle Mode
        EvalScript->>EvalScript: Load gold article & extract context
        EvalScript->>LLM: Send question + gold context
        LLM-->>EvalScript: Answer
    else Gold-Workspaces Mode
        EvalScript->>OpenRAG: Create workspace per question
        EvalScript->>OpenRAG: Attach gold file IDs to workspace
        EvalScript->>OpenRAG: Query using workspace
        OpenRAG-->>EvalScript: Answer
    end
    
    EvalScript->>EvalScript: Compute exact-match score
    EvalScript->>Judge: Run accuracy judging (structured output)
    Judge-->>EvalScript: {correct, justification}
    EvalScript->>EvalScript: Aggregate by reasoning_type
    EvalScript->>User: Save results + print summary

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

The changes introduce two substantial, feature-complete scripts with heterogeneous logic patterns: async HTTP operations with exponential backoff across Wikipedia and OpenRAG APIs, dataset loading and caching, four distinct evaluation modes with different control flows, structured output parsing, and concurrent task management with polling. The interaction between external services (HuggingFace, Wikipedia, OpenRAG, LLM) and internal state handling requires careful verification of error handling, retry strategies, and metric computation accuracy.

Poem

🐰 Hops in with glee!

Two scripts hop into the warren,
One fetches Wikipedia, no tarren,
The other evaluates with flair—
OpenRAG benchmarks everywhere!
FRAMES shine bright, assessments complete,
This pipeline's a cottontail treat! 🏆

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.04% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding a FRAMES benchmark evaluation pipeline with two scripts for setup and evaluation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

automatic-evaluation-pipeline/eval_frames.py (2)

125-146: Consider extracting shared utilities to reduce duplication.

load_dataset_cached, extract_title_from_url, _safe_filename, and wiki link parsing are duplicated between setup_frames.py and eval_frames.py. Extracting these to a shared module (e.g., frames_utils.py) would improve maintainability.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/eval_frames.py` around lines 125 - 146, The
same utilities (load_dataset_cached, extract_title_from_url, _safe_filename, and
the wiki link parsing logic) are duplicated in setup_frames.py and
eval_frames.py; consolidate them into a new shared module (e.g.,
frames_utils.py) that exports those functions, move the implementation there,
update both files to import the functions (referencing load_dataset_cached,
extract_title_from_url, _safe_filename and the wiki link parsing helper by
name), and ensure any relative path logic (_default_dataset_cache usage) is
either moved into the shared module or passed in as parameters so both callers
work without duplication.

289-290: Creating a new HTTP client per question is inefficient.

Each call to query_openrag_answer creates a new httpx.AsyncClient, which incurs connection setup overhead. Consider passing a shared client or using a connection pool for better performance when evaluating 824 questions.

Suggested approach

Pass the client as a parameter:

 async def query_openrag_answer(
     question: str,
     partition: str,
     semaphore: asyncio.Semaphore,
+    client: httpx.AsyncClient,
     workspace: str | None = None,
 ) -> tuple[str | None, str]:
     """Call OpenRAG chat completions for a single question."""
     headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}

     async with semaphore:
-        async with httpx.AsyncClient(timeout=300) as client:

Then create the client once in main() and pass it to all calls.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/eval_frames.py` around lines 289 - 290, The
code creates a new httpx.AsyncClient inside query_openrag_answer each time
(inside the async with block guarded by semaphore), which is inefficient;
refactor query_openrag_answer to accept an optional shared_client parameter (or
required client) and remove the internal "async with httpx.AsyncClient" so it
uses the passed httpx.AsyncClient; then instantiate a single AsyncClient in
main() (or the calling scope) and pass that client into all calls to
query_openrag_answer (while keeping the existing semaphore usage intact) to
enable connection reuse and pooling.

automatic-evaluation-pipeline/setup_frames.py (1)

43-47: Default AUTH_TOKEN exposes a hardcoded credential.

The default AUTH_TOKEN = "sk-1234" could accidentally be used in non-local environments if the .env file is missing or misconfigured. Consider failing explicitly if AUTH_TOKEN is not set when indexing.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/setup_frames.py` around lines 43 - 47, The code
currently provides a hardcoded default AUTH_TOKEN ("sk-1234"); remove the
default so AUTH_TOKEN = os.environ.get("AUTH_TOKEN") (no fallback) and add an
explicit check where indexing is initiated (or at module import if indexing runs
on startup): if not AUTH_TOKEN: raise RuntimeError("AUTH_TOKEN must be set for
indexing") or call sys.exit(1) with a clear message; keep APP_URL, APP_PORT and
OPENRAG_BASE_URL unchanged but ensure any code that relies on AUTH_TOKEN uses
the validated value.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@automatic-evaluation-pipeline/eval_frames.py`:
- Around line 53-58: Add validation to ensure required LLM env vars (MODEL,
BASE_URL, API_KEY and their judge variants) are present before any ChatOpenAI
instantiation: implement a helper like _require_llm_config() that checks MODEL,
BASE_URL, API_KEY (and optionally JUDGE_MODEL/JUDGE_BASE_URL/JUDGE_API_KEY) and
raises a clear SystemExit with the missing names if any are unset, and call this
helper at the start of the code paths for --no-rag and --oracle (and the judge
initialization) so ChatOpenAI is never created with model=None.
- Around line 812-821: Default RAG path prints and begins querying OpenRAG
without verifying the service is reachable; add the same OpenRAG health check
used by the other modes before the print/async gather block so we fail fast with
a clear error if the service is down. Specifically, before printing the message
that includes OPENRAG_BASE_URL and before creating the semaphore and calling
query_openrag_answer across the dataset, invoke the existing OpenRAG
health-check helper (the same call used in --oracle/--gold-workspaces) and
abort/raise/log a clear error on failure; then proceed to create sem, call
tqdm.gather(*[query_openrag_answer(...)]), build results and
run_accuracy_judging as before.
- Around line 362-376: The add_files_to_workspace function currently calls
resp.raise_for_status() and will abort on any HTTP error; wrap the POST call in
a try/except for httpx.HTTPStatusError, catch failures that indicate
non-existent/missing files (e.g., status 404 or 400) and inspect resp.json() for
a field indicating missing file ids (e.g., "missing_files" or similar), log
which file_ids are missing and either filter them out and retry a POST with the
remaining ids or return/propagate a controlled result instead of raising; update
add_files_to_workspace to handle missing files gracefully (log missing ids and
continue) and reference OPENRAG_BASE_URL, AUTH_TOKEN, and the
add_files_to_workspace function when making changes.

In `@automatic-evaluation-pipeline/setup_frames.py`:
- Around line 155-156: The current calculation of wait uses
float(resp.headers.get("retry-after")) which will raise ValueError for HTTP-date
values; update the logic where wait is set (the block using resp, resp.headers,
attempt) to: read the "Retry-After" header into a variable, try to parse it as
an integer/float first, and if that fails parse it as an HTTP-date (e.g., with
email.utils.parsedate_to_datetime) and compute seconds until that date (clamped
>=0); if parsing fails or header missing, fall back to the existing exponential
backoff expression min(2 ** attempt + 1, 60). Ensure the final wait remains a
float and preserve existing behavior for numeric headers.
- Around line 343-346: The status counters rely on specific terminal strings but
the API may return different tokens (e.g., "SUCCESS" vs "COMPLETED"); update the
counting in the block that computes completed/failed/skipped/errors (which
iterates over results) to compare against the normalized status values produced
by upload_and_track or to accept both variants — for example change the
completed check to accept both "COMPLETED" and "SUCCESS" (and similarly align
failed/skipped/errors with any alternate API tokens) or, preferably, normalize
each result.status (via the same normalization logic used in
upload_and_track/TERMINAL_STATES) before counting so the sums reliably reflect
terminal states.

---

Nitpick comments:
In `@automatic-evaluation-pipeline/eval_frames.py`:
- Around line 125-146: The same utilities (load_dataset_cached,
extract_title_from_url, _safe_filename, and the wiki link parsing logic) are
duplicated in setup_frames.py and eval_frames.py; consolidate them into a new
shared module (e.g., frames_utils.py) that exports those functions, move the
implementation there, update both files to import the functions (referencing
load_dataset_cached, extract_title_from_url, _safe_filename and the wiki link
parsing helper by name), and ensure any relative path logic
(_default_dataset_cache usage) is either moved into the shared module or passed
in as parameters so both callers work without duplication.
- Around line 289-290: The code creates a new httpx.AsyncClient inside
query_openrag_answer each time (inside the async with block guarded by
semaphore), which is inefficient; refactor query_openrag_answer to accept an
optional shared_client parameter (or required client) and remove the internal
"async with httpx.AsyncClient" so it uses the passed httpx.AsyncClient; then
instantiate a single AsyncClient in main() (or the calling scope) and pass that
client into all calls to query_openrag_answer (while keeping the existing
semaphore usage intact) to enable connection reuse and pooling.

In `@automatic-evaluation-pipeline/setup_frames.py`:
- Around line 43-47: The code currently provides a hardcoded default AUTH_TOKEN
("sk-1234"); remove the default so AUTH_TOKEN = os.environ.get("AUTH_TOKEN") (no
fallback) and add an explicit check where indexing is initiated (or at module
import if indexing runs on startup): if not AUTH_TOKEN: raise
RuntimeError("AUTH_TOKEN must be set for indexing") or call sys.exit(1) with a
clear message; keep APP_URL, APP_PORT and OPENRAG_BASE_URL unchanged but ensure
any code that relies on AUTH_TOKEN uses the validated value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: abb66cbf-67a5-4a95-89a4-1ce44edcf361

📥 Commits

Reviewing files that changed from the base of the PR and between 922714e and a5d8bf4.

📒 Files selected for processing (2)

automatic-evaluation-pipeline/eval_frames.py
automatic-evaluation-pipeline/setup_frames.py

coderabbitai · 2026-04-15T10:14:22Z

+MODEL = os.environ.get("MODEL")
+BASE_URL = os.environ.get("BASE_URL")
+API_KEY = os.environ.get("API_KEY")
+JUDGE_MODEL = os.environ.get("MODEL_JUDGE", MODEL)
+JUDGE_BASE_URL = os.environ.get("BASE_URL_JUDGE", BASE_URL)
+JUDGE_API_KEY = os.environ.get("API_KEY_JUDGE", API_KEY)


⚠️ Potential issue | 🟡 Minor

Missing validation for required LLM environment variables.

MODEL, BASE_URL, and API_KEY can be None if not set in .env. The --no-rag, --oracle, and judge functionality will fail with unclear errors when ChatOpenAI is instantiated with model=None.

Suggested fix

Add validation before using the LLM:

def _require_llm_config() -> None: """Validate LLM environment variables are set.""" missing = [name for name, val in [("MODEL", MODEL), ("BASE_URL", BASE_URL), ("API_KEY", API_KEY)] if not val] if missing: raise SystemExit(f"ERROR: Missing required environment variables for LLM: {', '.join(missing)}")

Call this at the start of --no-rag and --oracle modes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automatic-evaluation-pipeline/eval_frames.py` around lines 53 - 58, Add validation to ensure required LLM env vars (MODEL, BASE_URL, API_KEY and their judge variants) are present before any ChatOpenAI instantiation: implement a helper like _require_llm_config() that checks MODEL, BASE_URL, API_KEY (and optionally JUDGE_MODEL/JUDGE_BASE_URL/JUDGE_API_KEY) and raises a clear SystemExit with the missing names if any are unset, and call this helper at the start of the code paths for --no-rag and --oracle (and the judge initialization) so ChatOpenAI is never created with model=None.

coderabbitai · 2026-04-15T10:14:22Z

+async def add_files_to_workspace(
+    client: httpx.AsyncClient,
+    partition: str,
+    workspace_id: str,
+    file_ids: list[str],
+) -> None:
+    """Attach files to a workspace."""
+    headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}
+    resp = await client.post(
+        f"{OPENRAG_BASE_URL}/partition/{partition}/workspaces/{workspace_id}/files",
+        json={"file_ids": file_ids},
+        headers=headers,
+    )
+    resp.raise_for_status()
+


⚠️ Potential issue | 🟡 Minor

Missing error handling for non-existent files in workspace attachment.

add_files_to_workspace calls raise_for_status() which will raise on any error, but there's no specific handling if some gold files don't exist in the partition. This could cause the entire evaluation to fail if one article wasn't indexed.

Suggested improvement

async def add_files_to_workspace( client: httpx.AsyncClient, partition: str, workspace_id: str, file_ids: list[str], ) -> None: """Attach files to a workspace.""" headers = {"Authorization": f"Bearer {AUTH_TOKEN}"} resp = await client.post( f"{OPENRAG_BASE_URL}/partition/{partition}/workspaces/{workspace_id}/files", json={"file_ids": file_ids}, headers=headers, ) - resp.raise_for_status() + if resp.status_code not in (200, 201): + logger.warning(f"Failed to attach files to workspace '{workspace_id}': {resp.status_code} - {resp.text}") + resp.raise_for_status()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automatic-evaluation-pipeline/eval_frames.py` around lines 362 - 376, The add_files_to_workspace function currently calls resp.raise_for_status() and will abort on any HTTP error; wrap the POST call in a try/except for httpx.HTTPStatusError, catch failures that indicate non-existent/missing files (e.g., status 404 or 400) and inspect resp.json() for a field indicating missing file ids (e.g., "missing_files" or similar), log which file_ids are missing and either filter them out and retry a POST with the remaining ids or return/propagate a controlled result instead of raising; update add_files_to_workspace to handle missing files gracefully (log missing ids and continue) and reference OPENRAG_BASE_URL, AUTH_TOKEN, and the add_files_to_workspace function when making changes.

coderabbitai · 2026-04-15T10:14:23Z

+    else:
+        print(f"Evaluating {len(dataset)} questions on partition '{args.partition}' at {OPENRAG_BASE_URL}")
+        sem = asyncio.Semaphore(args.concurrency)
+        raw = await tqdm.gather(
+            *[query_openrag_answer(row["Prompt"], args.partition, sem) for row in dataset],
+            desc="Querying OpenRAG",
+        )
+        results = [_build_result(row, ans, src, "rag") for row, (ans, src) in zip(dataset, raw)]
+        annotate_exact_match(results)
+        await run_accuracy_judging(results, concurrency=args.judge_concurrency)


⚠️ Potential issue | 🟡 Minor

Default RAG mode skips OpenRAG health check.

The --oracle and --gold-workspaces modes verify OpenRAG is reachable before proceeding, but the default RAG mode (lines 812-821) does not. If OpenRAG is down, users will see confusing errors instead of a clear health check failure.

Suggested fix

else: print(f"Evaluating {len(dataset)} questions on partition '{args.partition}' at {OPENRAG_BASE_URL}") + async with httpx.AsyncClient(timeout=60) as client: + if not await _check_openrag_health(client): + return sem = asyncio.Semaphore(args.concurrency)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automatic-evaluation-pipeline/eval_frames.py` around lines 812 - 821, Default RAG path prints and begins querying OpenRAG without verifying the service is reachable; add the same OpenRAG health check used by the other modes before the print/async gather block so we fail fast with a clear error if the service is down. Specifically, before printing the message that includes OPENRAG_BASE_URL and before creating the semaphore and calling query_openrag_answer across the dataset, invoke the existing OpenRAG health-check helper (the same call used in --oracle/--gold-workspaces) and abort/raise/log a clear error on failure; then proceed to create sem, call tqdm.gather(*[query_openrag_answer(...)]), build results and run_accuracy_judging as before.

coderabbitai · 2026-04-15T10:14:23Z

+            if resp.status_code == 429 or resp.status_code >= 500:
+                wait = float(resp.headers.get("retry-after") or min(2 ** attempt + 1, 60))


⚠️ Potential issue | 🟡 Minor

retry-after header parsing may fail on non-numeric values.

The Retry-After header can be either a number of seconds or an HTTP-date string. Passing an HTTP-date to float() will raise ValueError.

Proposed fix

- wait = float(resp.headers.get("retry-after") or min(2 ** attempt + 1, 60)) + retry_after = resp.headers.get("retry-after") + try: + wait = float(retry_after) if retry_after else min(2 ** attempt + 1, 60) + except ValueError: + wait = min(2 ** attempt + 1, 60)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automatic-evaluation-pipeline/setup_frames.py` around lines 155 - 156, The current calculation of wait uses float(resp.headers.get("retry-after")) which will raise ValueError for HTTP-date values; update the logic where wait is set (the block using resp, resp.headers, attempt) to: read the "Retry-After" header into a variable, try to parse it as an integer/float first, and if that fails parse it as an HTTP-date (e.g., with email.utils.parsedate_to_datetime) and compute seconds until that date (clamped >=0); if parsing fails or header missing, fall back to the existing exponential backoff expression min(2 ** attempt + 1, 60). Ensure the final wait remains a float and preserve existing behavior for numeric headers.

coderabbitai · 2026-04-15T10:14:23Z

+    completed = sum(1 for r in results if r["status"] == "COMPLETED")
+    failed = sum(1 for r in results if r["status"] == "FAILED")
+    skipped = sum(1 for r in results if r["status"] == "skipped")
+    errors = sum(1 for r in results if r["status"] == "ERROR")


⚠️ Potential issue | 🟠 Major

Result status counting depends on corrected terminal states.

After fixing TERMINAL_STATES, ensure the status comparison here aligns. If the API returns "SUCCESS", the current check for "COMPLETED" won't match any results.

Proposed fix (after normalizing status in upload_and_track)

If you normalize status in upload_and_track to return "COMPLETED" for successful tasks, this code will work. Otherwise, update to:

- completed = sum(1 for r in results if r["status"] == "COMPLETED") + completed = sum(1 for r in results if r["status"] in ("COMPLETED", "SUCCESS"))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@automatic-evaluation-pipeline/setup_frames.py` around lines 343 - 346, The status counters rely on specific terminal strings but the API may return different tokens (e.g., "SUCCESS" vs "COMPLETED"); update the counting in the block that computes completed/failed/skipped/errors (which iterates over results) to compare against the normalized status values produced by upload_and_track or to accept both variants — for example change the completed check to accept both "COMPLETED" and "SUCCESS" (and similarly align failed/skipped/errors with any alternate API tokens) or, preferably, normalize each result.status (via the same normalization logic used in upload_and_track/TERMINAL_STATES) before counting so the sums reliably reflect terminal states.

feat(eval): add FRAMES benchmark evaluation pipeline

a5d8bf4

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add FRAMES benchmark evaluation pipeline#305

feat(eval): add FRAMES benchmark evaluation pipeline#305
Matteovanypersele wants to merge 1 commit intolinagora:devfrom
Matteovanypersele:feat/frames-benchmark-pipeline

Matteovanypersele commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 15, 2026

Uh oh!

coderabbitai bot Apr 15, 2026

Uh oh!

coderabbitai bot Apr 15, 2026

Uh oh!

coderabbitai bot Apr 15, 2026

Uh oh!

coderabbitai bot Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if resp.status_code == 429 or resp.status_code >= 500:
		wait = float(resp.headers.get("retry-after") or min(2 ** attempt + 1, 60))

Conversation

Matteovanypersele commented Apr 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Evaluation modes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Matteovanypersele commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 15, 2026 •

edited

Loading