Skip to content

feat(eval): add FRAMES benchmark evaluation pipeline#305

Open
Matteovanypersele wants to merge 1 commit intolinagora:devfrom
Matteovanypersele:feat/frames-benchmark-pipeline
Open

feat(eval): add FRAMES benchmark evaluation pipeline#305
Matteovanypersele wants to merge 1 commit intolinagora:devfrom
Matteovanypersele:feat/frames-benchmark-pipeline

Conversation

@Matteovanypersele
Copy link
Copy Markdown

@Matteovanypersele Matteovanypersele commented Apr 15, 2026

Summary

This PR adds a FRAMES benchmark evaluation pipeline under automatic-evaluation-pipeline/ (not sure if it's the best place).

  • setup_frames.py: downloads the Wikipedia articles referenced by FRAMES in md, pdf, or html format, then uploads and indexes them into an OpenRAG partition
  • eval_frames.py: runs FRAMES evaluation over the 824-question benchmark with LLM-judge scoring a fuzzy exact match

Evaluation modes

  • Default RAG mode: query OpenRAG with retrieval enabled
  • --no-rag: bypass retrieval and query the model directly whithout chunk
  • --oracle: bypass retrieval and provide the gold Wikipedia articles directly to the model (hardly impacted by the 8192 tokens limit)
  • --gold-workspaces: create one workspace per question and attach only the gold files

Summary by CodeRabbit

  • New Features
    • Added FRAMES benchmark evaluation pipeline supporting multiple evaluation modes: standard RAG retrieval, oracle context, and direct LLM querying.
    • Automated benchmark setup with concurrent Wikipedia content download and OpenRAG document indexing.
    • Includes exact-match and LLM-based accuracy metrics with reasoning type breakdowns and result resumption capability.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

This pull request introduces two new scripts for the FRAMES benchmark pipeline: setup_frames.py prepares Wikipedia content for OpenRAG indexing (downloading articles and managing concurrent uploads), while eval_frames.py implements multi-mode evaluation against OpenRAG with exact-match and LLM-based accuracy scoring.

Changes

Cohort / File(s) Summary
FRAMES Benchmark Setup
automatic-evaluation-pipeline/setup_frames.py
New script that loads the FRAMES dataset from HuggingFace, extracts Wikipedia article titles, downloads content (markdown/HTML/PDF) from Wikipedia APIs with retry logic, and uploads documents to OpenRAG with concurrent indexing and task polling. Supports caching, skip-aware resumption, and various CLI modes (--index-only, --retry-missing).
FRAMES Benchmark Evaluation
automatic-evaluation-pipeline/eval_frames.py
New script that evaluates FRAMES benchmark questions against OpenRAG using four mutually exclusive modes: default RAG retrieval, --no-rag direct LLM querying, --oracle with gold article context, and --gold-workspaces with per-question workspace setup. Computes normalized exact-match and LLM-based factual accuracy scoring with breakdowns by reasoning type.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SetupScript as setup_frames.py
    participant HF as HuggingFace
    participant WikiAPI as Wikipedia API
    participant OpenRAG
    
    User->>SetupScript: Run with --partition, --limit, --wiki-concurrency
    SetupScript->>HF: Load FRAMES dataset (cached or fresh)
    HF-->>SetupScript: Return test split
    SetupScript->>WikiAPI: Download articles asynchronously<br/>(markdown/HTML/PDF with retries)
    WikiAPI-->>SetupScript: Article content
    SetupScript->>OpenRAG: Check health
    OpenRAG-->>SetupScript: Health status
    SetupScript->>OpenRAG: Create partition (if needed)
    OpenRAG-->>SetupScript: Partition ready
    SetupScript->>OpenRAG: Upload & track files concurrently<br/>(post then poll task_status_url)
    OpenRAG-->>SetupScript: Terminal state (COMPLETED/FAILED/SKIPPED)
    SetupScript->>User: Print aggregated results
Loading
sequenceDiagram
    participant User
    participant EvalScript as eval_frames.py
    participant OpenRAG
    participant LLM
    participant Judge as LLM Judge
    
    User->>EvalScript: Run with mode flag (--no-rag/--oracle/--gold-workspaces)
    EvalScript->>EvalScript: Load dataset (cached)
    
    alt Default RAG Mode
        EvalScript->>OpenRAG: Query chat completions with question
        OpenRAG->>LLM: Generate answer
        LLM-->>OpenRAG: Answer + extra.sources
        OpenRAG-->>EvalScript: Answer + chunk references
    else No-RAG Mode
        EvalScript->>LLM: Send question directly
        LLM-->>EvalScript: Answer
    else Oracle Mode
        EvalScript->>EvalScript: Load gold article & extract context
        EvalScript->>LLM: Send question + gold context
        LLM-->>EvalScript: Answer
    else Gold-Workspaces Mode
        EvalScript->>OpenRAG: Create workspace per question
        EvalScript->>OpenRAG: Attach gold file IDs to workspace
        EvalScript->>OpenRAG: Query using workspace
        OpenRAG-->>EvalScript: Answer
    end
    
    EvalScript->>EvalScript: Compute exact-match score
    EvalScript->>Judge: Run accuracy judging (structured output)
    Judge-->>EvalScript: {correct, justification}
    EvalScript->>EvalScript: Aggregate by reasoning_type
    EvalScript->>User: Save results + print summary
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

The changes introduce two substantial, feature-complete scripts with heterogeneous logic patterns: async HTTP operations with exponential backoff across Wikipedia and OpenRAG APIs, dataset loading and caching, four distinct evaluation modes with different control flows, structured output parsing, and concurrent task management with polling. The interaction between external services (HuggingFace, Wikipedia, OpenRAG, LLM) and internal state handling requires careful verification of error handling, retry strategies, and metric computation accuracy.

Poem

🐰 Hops in with glee!

Two scripts hop into the warren,
One fetches Wikipedia, no tarren,
The other evaluates with flair—
OpenRAG benchmarks everywhere!
FRAMES shine bright, assessments complete,
This pipeline's a cottontail treat! 🏆

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a FRAMES benchmark evaluation pipeline with two scripts for setup and evaluation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
automatic-evaluation-pipeline/eval_frames.py (2)

125-146: Consider extracting shared utilities to reduce duplication.

load_dataset_cached, extract_title_from_url, _safe_filename, and wiki link parsing are duplicated between setup_frames.py and eval_frames.py. Extracting these to a shared module (e.g., frames_utils.py) would improve maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/eval_frames.py` around lines 125 - 146, The
same utilities (load_dataset_cached, extract_title_from_url, _safe_filename, and
the wiki link parsing logic) are duplicated in setup_frames.py and
eval_frames.py; consolidate them into a new shared module (e.g.,
frames_utils.py) that exports those functions, move the implementation there,
update both files to import the functions (referencing load_dataset_cached,
extract_title_from_url, _safe_filename and the wiki link parsing helper by
name), and ensure any relative path logic (_default_dataset_cache usage) is
either moved into the shared module or passed in as parameters so both callers
work without duplication.

289-290: Creating a new HTTP client per question is inefficient.

Each call to query_openrag_answer creates a new httpx.AsyncClient, which incurs connection setup overhead. Consider passing a shared client or using a connection pool for better performance when evaluating 824 questions.

Suggested approach

Pass the client as a parameter:

 async def query_openrag_answer(
     question: str,
     partition: str,
     semaphore: asyncio.Semaphore,
+    client: httpx.AsyncClient,
     workspace: str | None = None,
 ) -> tuple[str | None, str]:
     """Call OpenRAG chat completions for a single question."""
     headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}

     async with semaphore:
-        async with httpx.AsyncClient(timeout=300) as client:

Then create the client once in main() and pass it to all calls.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/eval_frames.py` around lines 289 - 290, The
code creates a new httpx.AsyncClient inside query_openrag_answer each time
(inside the async with block guarded by semaphore), which is inefficient;
refactor query_openrag_answer to accept an optional shared_client parameter (or
required client) and remove the internal "async with httpx.AsyncClient" so it
uses the passed httpx.AsyncClient; then instantiate a single AsyncClient in
main() (or the calling scope) and pass that client into all calls to
query_openrag_answer (while keeping the existing semaphore usage intact) to
enable connection reuse and pooling.
automatic-evaluation-pipeline/setup_frames.py (1)

43-47: Default AUTH_TOKEN exposes a hardcoded credential.

The default AUTH_TOKEN = "sk-1234" could accidentally be used in non-local environments if the .env file is missing or misconfigured. Consider failing explicitly if AUTH_TOKEN is not set when indexing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/setup_frames.py` around lines 43 - 47, The code
currently provides a hardcoded default AUTH_TOKEN ("sk-1234"); remove the
default so AUTH_TOKEN = os.environ.get("AUTH_TOKEN") (no fallback) and add an
explicit check where indexing is initiated (or at module import if indexing runs
on startup): if not AUTH_TOKEN: raise RuntimeError("AUTH_TOKEN must be set for
indexing") or call sys.exit(1) with a clear message; keep APP_URL, APP_PORT and
OPENRAG_BASE_URL unchanged but ensure any code that relies on AUTH_TOKEN uses
the validated value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@automatic-evaluation-pipeline/eval_frames.py`:
- Around line 53-58: Add validation to ensure required LLM env vars (MODEL,
BASE_URL, API_KEY and their judge variants) are present before any ChatOpenAI
instantiation: implement a helper like _require_llm_config() that checks MODEL,
BASE_URL, API_KEY (and optionally JUDGE_MODEL/JUDGE_BASE_URL/JUDGE_API_KEY) and
raises a clear SystemExit with the missing names if any are unset, and call this
helper at the start of the code paths for --no-rag and --oracle (and the judge
initialization) so ChatOpenAI is never created with model=None.
- Around line 812-821: Default RAG path prints and begins querying OpenRAG
without verifying the service is reachable; add the same OpenRAG health check
used by the other modes before the print/async gather block so we fail fast with
a clear error if the service is down. Specifically, before printing the message
that includes OPENRAG_BASE_URL and before creating the semaphore and calling
query_openrag_answer across the dataset, invoke the existing OpenRAG
health-check helper (the same call used in --oracle/--gold-workspaces) and
abort/raise/log a clear error on failure; then proceed to create sem, call
tqdm.gather(*[query_openrag_answer(...)]), build results and
run_accuracy_judging as before.
- Around line 362-376: The add_files_to_workspace function currently calls
resp.raise_for_status() and will abort on any HTTP error; wrap the POST call in
a try/except for httpx.HTTPStatusError, catch failures that indicate
non-existent/missing files (e.g., status 404 or 400) and inspect resp.json() for
a field indicating missing file ids (e.g., "missing_files" or similar), log
which file_ids are missing and either filter them out and retry a POST with the
remaining ids or return/propagate a controlled result instead of raising; update
add_files_to_workspace to handle missing files gracefully (log missing ids and
continue) and reference OPENRAG_BASE_URL, AUTH_TOKEN, and the
add_files_to_workspace function when making changes.

In `@automatic-evaluation-pipeline/setup_frames.py`:
- Around line 155-156: The current calculation of wait uses
float(resp.headers.get("retry-after")) which will raise ValueError for HTTP-date
values; update the logic where wait is set (the block using resp, resp.headers,
attempt) to: read the "Retry-After" header into a variable, try to parse it as
an integer/float first, and if that fails parse it as an HTTP-date (e.g., with
email.utils.parsedate_to_datetime) and compute seconds until that date (clamped
>=0); if parsing fails or header missing, fall back to the existing exponential
backoff expression min(2 ** attempt + 1, 60). Ensure the final wait remains a
float and preserve existing behavior for numeric headers.
- Around line 343-346: The status counters rely on specific terminal strings but
the API may return different tokens (e.g., "SUCCESS" vs "COMPLETED"); update the
counting in the block that computes completed/failed/skipped/errors (which
iterates over results) to compare against the normalized status values produced
by upload_and_track or to accept both variants — for example change the
completed check to accept both "COMPLETED" and "SUCCESS" (and similarly align
failed/skipped/errors with any alternate API tokens) or, preferably, normalize
each result.status (via the same normalization logic used in
upload_and_track/TERMINAL_STATES) before counting so the sums reliably reflect
terminal states.

---

Nitpick comments:
In `@automatic-evaluation-pipeline/eval_frames.py`:
- Around line 125-146: The same utilities (load_dataset_cached,
extract_title_from_url, _safe_filename, and the wiki link parsing logic) are
duplicated in setup_frames.py and eval_frames.py; consolidate them into a new
shared module (e.g., frames_utils.py) that exports those functions, move the
implementation there, update both files to import the functions (referencing
load_dataset_cached, extract_title_from_url, _safe_filename and the wiki link
parsing helper by name), and ensure any relative path logic
(_default_dataset_cache usage) is either moved into the shared module or passed
in as parameters so both callers work without duplication.
- Around line 289-290: The code creates a new httpx.AsyncClient inside
query_openrag_answer each time (inside the async with block guarded by
semaphore), which is inefficient; refactor query_openrag_answer to accept an
optional shared_client parameter (or required client) and remove the internal
"async with httpx.AsyncClient" so it uses the passed httpx.AsyncClient; then
instantiate a single AsyncClient in main() (or the calling scope) and pass that
client into all calls to query_openrag_answer (while keeping the existing
semaphore usage intact) to enable connection reuse and pooling.

In `@automatic-evaluation-pipeline/setup_frames.py`:
- Around line 43-47: The code currently provides a hardcoded default AUTH_TOKEN
("sk-1234"); remove the default so AUTH_TOKEN = os.environ.get("AUTH_TOKEN") (no
fallback) and add an explicit check where indexing is initiated (or at module
import if indexing runs on startup): if not AUTH_TOKEN: raise
RuntimeError("AUTH_TOKEN must be set for indexing") or call sys.exit(1) with a
clear message; keep APP_URL, APP_PORT and OPENRAG_BASE_URL unchanged but ensure
any code that relies on AUTH_TOKEN uses the validated value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: abb66cbf-67a5-4a95-89a4-1ce44edcf361

📥 Commits

Reviewing files that changed from the base of the PR and between 922714e and a5d8bf4.

📒 Files selected for processing (2)
  • automatic-evaluation-pipeline/eval_frames.py
  • automatic-evaluation-pipeline/setup_frames.py

Comment on lines +53 to +58
MODEL = os.environ.get("MODEL")
BASE_URL = os.environ.get("BASE_URL")
API_KEY = os.environ.get("API_KEY")
JUDGE_MODEL = os.environ.get("MODEL_JUDGE", MODEL)
JUDGE_BASE_URL = os.environ.get("BASE_URL_JUDGE", BASE_URL)
JUDGE_API_KEY = os.environ.get("API_KEY_JUDGE", API_KEY)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing validation for required LLM environment variables.

MODEL, BASE_URL, and API_KEY can be None if not set in .env. The --no-rag, --oracle, and judge functionality will fail with unclear errors when ChatOpenAI is instantiated with model=None.

Suggested fix

Add validation before using the LLM:

def _require_llm_config() -> None:
    """Validate LLM environment variables are set."""
    missing = [name for name, val in [("MODEL", MODEL), ("BASE_URL", BASE_URL), ("API_KEY", API_KEY)] if not val]
    if missing:
        raise SystemExit(f"ERROR: Missing required environment variables for LLM: {', '.join(missing)}")

Call this at the start of --no-rag and --oracle modes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/eval_frames.py` around lines 53 - 58, Add
validation to ensure required LLM env vars (MODEL, BASE_URL, API_KEY and their
judge variants) are present before any ChatOpenAI instantiation: implement a
helper like _require_llm_config() that checks MODEL, BASE_URL, API_KEY (and
optionally JUDGE_MODEL/JUDGE_BASE_URL/JUDGE_API_KEY) and raises a clear
SystemExit with the missing names if any are unset, and call this helper at the
start of the code paths for --no-rag and --oracle (and the judge initialization)
so ChatOpenAI is never created with model=None.

Comment on lines +362 to +376
async def add_files_to_workspace(
client: httpx.AsyncClient,
partition: str,
workspace_id: str,
file_ids: list[str],
) -> None:
"""Attach files to a workspace."""
headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}
resp = await client.post(
f"{OPENRAG_BASE_URL}/partition/{partition}/workspaces/{workspace_id}/files",
json={"file_ids": file_ids},
headers=headers,
)
resp.raise_for_status()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing error handling for non-existent files in workspace attachment.

add_files_to_workspace calls raise_for_status() which will raise on any error, but there's no specific handling if some gold files don't exist in the partition. This could cause the entire evaluation to fail if one article wasn't indexed.

Suggested improvement
 async def add_files_to_workspace(
     client: httpx.AsyncClient,
     partition: str,
     workspace_id: str,
     file_ids: list[str],
 ) -> None:
     """Attach files to a workspace."""
     headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}
     resp = await client.post(
         f"{OPENRAG_BASE_URL}/partition/{partition}/workspaces/{workspace_id}/files",
         json={"file_ids": file_ids},
         headers=headers,
     )
-    resp.raise_for_status()
+    if resp.status_code not in (200, 201):
+        logger.warning(f"Failed to attach files to workspace '{workspace_id}': {resp.status_code} - {resp.text}")
+        resp.raise_for_status()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/eval_frames.py` around lines 362 - 376, The
add_files_to_workspace function currently calls resp.raise_for_status() and will
abort on any HTTP error; wrap the POST call in a try/except for
httpx.HTTPStatusError, catch failures that indicate non-existent/missing files
(e.g., status 404 or 400) and inspect resp.json() for a field indicating missing
file ids (e.g., "missing_files" or similar), log which file_ids are missing and
either filter them out and retry a POST with the remaining ids or
return/propagate a controlled result instead of raising; update
add_files_to_workspace to handle missing files gracefully (log missing ids and
continue) and reference OPENRAG_BASE_URL, AUTH_TOKEN, and the
add_files_to_workspace function when making changes.

Comment on lines +812 to +821
else:
print(f"Evaluating {len(dataset)} questions on partition '{args.partition}' at {OPENRAG_BASE_URL}")
sem = asyncio.Semaphore(args.concurrency)
raw = await tqdm.gather(
*[query_openrag_answer(row["Prompt"], args.partition, sem) for row in dataset],
desc="Querying OpenRAG",
)
results = [_build_result(row, ans, src, "rag") for row, (ans, src) in zip(dataset, raw)]
annotate_exact_match(results)
await run_accuracy_judging(results, concurrency=args.judge_concurrency)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Default RAG mode skips OpenRAG health check.

The --oracle and --gold-workspaces modes verify OpenRAG is reachable before proceeding, but the default RAG mode (lines 812-821) does not. If OpenRAG is down, users will see confusing errors instead of a clear health check failure.

Suggested fix
     else:
         print(f"Evaluating {len(dataset)} questions on partition '{args.partition}' at {OPENRAG_BASE_URL}")
+        async with httpx.AsyncClient(timeout=60) as client:
+            if not await _check_openrag_health(client):
+                return
         sem = asyncio.Semaphore(args.concurrency)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/eval_frames.py` around lines 812 - 821, Default
RAG path prints and begins querying OpenRAG without verifying the service is
reachable; add the same OpenRAG health check used by the other modes before the
print/async gather block so we fail fast with a clear error if the service is
down. Specifically, before printing the message that includes OPENRAG_BASE_URL
and before creating the semaphore and calling query_openrag_answer across the
dataset, invoke the existing OpenRAG health-check helper (the same call used in
--oracle/--gold-workspaces) and abort/raise/log a clear error on failure; then
proceed to create sem, call tqdm.gather(*[query_openrag_answer(...)]), build
results and run_accuracy_judging as before.

Comment on lines +155 to +156
if resp.status_code == 429 or resp.status_code >= 500:
wait = float(resp.headers.get("retry-after") or min(2 ** attempt + 1, 60))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

retry-after header parsing may fail on non-numeric values.

The Retry-After header can be either a number of seconds or an HTTP-date string. Passing an HTTP-date to float() will raise ValueError.

Proposed fix
-                wait = float(resp.headers.get("retry-after") or min(2 ** attempt + 1, 60))
+                retry_after = resp.headers.get("retry-after")
+                try:
+                    wait = float(retry_after) if retry_after else min(2 ** attempt + 1, 60)
+                except ValueError:
+                    wait = min(2 ** attempt + 1, 60)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/setup_frames.py` around lines 155 - 156, The
current calculation of wait uses float(resp.headers.get("retry-after")) which
will raise ValueError for HTTP-date values; update the logic where wait is set
(the block using resp, resp.headers, attempt) to: read the "Retry-After" header
into a variable, try to parse it as an integer/float first, and if that fails
parse it as an HTTP-date (e.g., with email.utils.parsedate_to_datetime) and
compute seconds until that date (clamped >=0); if parsing fails or header
missing, fall back to the existing exponential backoff expression min(2 **
attempt + 1, 60). Ensure the final wait remains a float and preserve existing
behavior for numeric headers.

Comment on lines +343 to +346
completed = sum(1 for r in results if r["status"] == "COMPLETED")
failed = sum(1 for r in results if r["status"] == "FAILED")
skipped = sum(1 for r in results if r["status"] == "skipped")
errors = sum(1 for r in results if r["status"] == "ERROR")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Result status counting depends on corrected terminal states.

After fixing TERMINAL_STATES, ensure the status comparison here aligns. If the API returns "SUCCESS", the current check for "COMPLETED" won't match any results.

Proposed fix (after normalizing status in upload_and_track)

If you normalize status in upload_and_track to return "COMPLETED" for successful tasks, this code will work. Otherwise, update to:

-    completed = sum(1 for r in results if r["status"] == "COMPLETED")
+    completed = sum(1 for r in results if r["status"] in ("COMPLETED", "SUCCESS"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automatic-evaluation-pipeline/setup_frames.py` around lines 343 - 346, The
status counters rely on specific terminal strings but the API may return
different tokens (e.g., "SUCCESS" vs "COMPLETED"); update the counting in the
block that computes completed/failed/skipped/errors (which iterates over
results) to compare against the normalized status values produced by
upload_and_track or to accept both variants — for example change the completed
check to accept both "COMPLETED" and "SUCCESS" (and similarly align
failed/skipped/errors with any alternate API tokens) or, preferably, normalize
each result.status (via the same normalization logic used in
upload_and_track/TERMINAL_STATES) before counting so the sums reliably reflect
terminal states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant