feat(eval): add FRAMES benchmark evaluation pipeline#305
feat(eval): add FRAMES benchmark evaluation pipeline#305Matteovanypersele wants to merge 1 commit intolinagora:devfrom
Conversation
📝 WalkthroughWalkthroughThis pull request introduces two new scripts for the FRAMES benchmark pipeline: Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant SetupScript as setup_frames.py
participant HF as HuggingFace
participant WikiAPI as Wikipedia API
participant OpenRAG
User->>SetupScript: Run with --partition, --limit, --wiki-concurrency
SetupScript->>HF: Load FRAMES dataset (cached or fresh)
HF-->>SetupScript: Return test split
SetupScript->>WikiAPI: Download articles asynchronously<br/>(markdown/HTML/PDF with retries)
WikiAPI-->>SetupScript: Article content
SetupScript->>OpenRAG: Check health
OpenRAG-->>SetupScript: Health status
SetupScript->>OpenRAG: Create partition (if needed)
OpenRAG-->>SetupScript: Partition ready
SetupScript->>OpenRAG: Upload & track files concurrently<br/>(post then poll task_status_url)
OpenRAG-->>SetupScript: Terminal state (COMPLETED/FAILED/SKIPPED)
SetupScript->>User: Print aggregated results
sequenceDiagram
participant User
participant EvalScript as eval_frames.py
participant OpenRAG
participant LLM
participant Judge as LLM Judge
User->>EvalScript: Run with mode flag (--no-rag/--oracle/--gold-workspaces)
EvalScript->>EvalScript: Load dataset (cached)
alt Default RAG Mode
EvalScript->>OpenRAG: Query chat completions with question
OpenRAG->>LLM: Generate answer
LLM-->>OpenRAG: Answer + extra.sources
OpenRAG-->>EvalScript: Answer + chunk references
else No-RAG Mode
EvalScript->>LLM: Send question directly
LLM-->>EvalScript: Answer
else Oracle Mode
EvalScript->>EvalScript: Load gold article & extract context
EvalScript->>LLM: Send question + gold context
LLM-->>EvalScript: Answer
else Gold-Workspaces Mode
EvalScript->>OpenRAG: Create workspace per question
EvalScript->>OpenRAG: Attach gold file IDs to workspace
EvalScript->>OpenRAG: Query using workspace
OpenRAG-->>EvalScript: Answer
end
EvalScript->>EvalScript: Compute exact-match score
EvalScript->>Judge: Run accuracy judging (structured output)
Judge-->>EvalScript: {correct, justification}
EvalScript->>EvalScript: Aggregate by reasoning_type
EvalScript->>User: Save results + print summary
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes The changes introduce two substantial, feature-complete scripts with heterogeneous logic patterns: async HTTP operations with exponential backoff across Wikipedia and OpenRAG APIs, dataset loading and caching, four distinct evaluation modes with different control flows, structured output parsing, and concurrent task management with polling. The interaction between external services (HuggingFace, Wikipedia, OpenRAG, LLM) and internal state handling requires careful verification of error handling, retry strategies, and metric computation accuracy. Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
automatic-evaluation-pipeline/eval_frames.py (2)
125-146: Consider extracting shared utilities to reduce duplication.
load_dataset_cached,extract_title_from_url,_safe_filename, and wiki link parsing are duplicated betweensetup_frames.pyandeval_frames.py. Extracting these to a shared module (e.g.,frames_utils.py) would improve maintainability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@automatic-evaluation-pipeline/eval_frames.py` around lines 125 - 146, The same utilities (load_dataset_cached, extract_title_from_url, _safe_filename, and the wiki link parsing logic) are duplicated in setup_frames.py and eval_frames.py; consolidate them into a new shared module (e.g., frames_utils.py) that exports those functions, move the implementation there, update both files to import the functions (referencing load_dataset_cached, extract_title_from_url, _safe_filename and the wiki link parsing helper by name), and ensure any relative path logic (_default_dataset_cache usage) is either moved into the shared module or passed in as parameters so both callers work without duplication.
289-290: Creating a new HTTP client per question is inefficient.Each call to
query_openrag_answercreates a newhttpx.AsyncClient, which incurs connection setup overhead. Consider passing a shared client or using a connection pool for better performance when evaluating 824 questions.Suggested approach
Pass the client as a parameter:
async def query_openrag_answer( question: str, partition: str, semaphore: asyncio.Semaphore, + client: httpx.AsyncClient, workspace: str | None = None, ) -> tuple[str | None, str]: """Call OpenRAG chat completions for a single question.""" headers = {"Authorization": f"Bearer {AUTH_TOKEN}"} async with semaphore: - async with httpx.AsyncClient(timeout=300) as client:Then create the client once in
main()and pass it to all calls.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@automatic-evaluation-pipeline/eval_frames.py` around lines 289 - 290, The code creates a new httpx.AsyncClient inside query_openrag_answer each time (inside the async with block guarded by semaphore), which is inefficient; refactor query_openrag_answer to accept an optional shared_client parameter (or required client) and remove the internal "async with httpx.AsyncClient" so it uses the passed httpx.AsyncClient; then instantiate a single AsyncClient in main() (or the calling scope) and pass that client into all calls to query_openrag_answer (while keeping the existing semaphore usage intact) to enable connection reuse and pooling.automatic-evaluation-pipeline/setup_frames.py (1)
43-47: Default AUTH_TOKEN exposes a hardcoded credential.The default
AUTH_TOKEN = "sk-1234"could accidentally be used in non-local environments if the.envfile is missing or misconfigured. Consider failing explicitly ifAUTH_TOKENis not set when indexing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@automatic-evaluation-pipeline/setup_frames.py` around lines 43 - 47, The code currently provides a hardcoded default AUTH_TOKEN ("sk-1234"); remove the default so AUTH_TOKEN = os.environ.get("AUTH_TOKEN") (no fallback) and add an explicit check where indexing is initiated (or at module import if indexing runs on startup): if not AUTH_TOKEN: raise RuntimeError("AUTH_TOKEN must be set for indexing") or call sys.exit(1) with a clear message; keep APP_URL, APP_PORT and OPENRAG_BASE_URL unchanged but ensure any code that relies on AUTH_TOKEN uses the validated value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@automatic-evaluation-pipeline/eval_frames.py`:
- Around line 53-58: Add validation to ensure required LLM env vars (MODEL,
BASE_URL, API_KEY and their judge variants) are present before any ChatOpenAI
instantiation: implement a helper like _require_llm_config() that checks MODEL,
BASE_URL, API_KEY (and optionally JUDGE_MODEL/JUDGE_BASE_URL/JUDGE_API_KEY) and
raises a clear SystemExit with the missing names if any are unset, and call this
helper at the start of the code paths for --no-rag and --oracle (and the judge
initialization) so ChatOpenAI is never created with model=None.
- Around line 812-821: Default RAG path prints and begins querying OpenRAG
without verifying the service is reachable; add the same OpenRAG health check
used by the other modes before the print/async gather block so we fail fast with
a clear error if the service is down. Specifically, before printing the message
that includes OPENRAG_BASE_URL and before creating the semaphore and calling
query_openrag_answer across the dataset, invoke the existing OpenRAG
health-check helper (the same call used in --oracle/--gold-workspaces) and
abort/raise/log a clear error on failure; then proceed to create sem, call
tqdm.gather(*[query_openrag_answer(...)]), build results and
run_accuracy_judging as before.
- Around line 362-376: The add_files_to_workspace function currently calls
resp.raise_for_status() and will abort on any HTTP error; wrap the POST call in
a try/except for httpx.HTTPStatusError, catch failures that indicate
non-existent/missing files (e.g., status 404 or 400) and inspect resp.json() for
a field indicating missing file ids (e.g., "missing_files" or similar), log
which file_ids are missing and either filter them out and retry a POST with the
remaining ids or return/propagate a controlled result instead of raising; update
add_files_to_workspace to handle missing files gracefully (log missing ids and
continue) and reference OPENRAG_BASE_URL, AUTH_TOKEN, and the
add_files_to_workspace function when making changes.
In `@automatic-evaluation-pipeline/setup_frames.py`:
- Around line 155-156: The current calculation of wait uses
float(resp.headers.get("retry-after")) which will raise ValueError for HTTP-date
values; update the logic where wait is set (the block using resp, resp.headers,
attempt) to: read the "Retry-After" header into a variable, try to parse it as
an integer/float first, and if that fails parse it as an HTTP-date (e.g., with
email.utils.parsedate_to_datetime) and compute seconds until that date (clamped
>=0); if parsing fails or header missing, fall back to the existing exponential
backoff expression min(2 ** attempt + 1, 60). Ensure the final wait remains a
float and preserve existing behavior for numeric headers.
- Around line 343-346: The status counters rely on specific terminal strings but
the API may return different tokens (e.g., "SUCCESS" vs "COMPLETED"); update the
counting in the block that computes completed/failed/skipped/errors (which
iterates over results) to compare against the normalized status values produced
by upload_and_track or to accept both variants — for example change the
completed check to accept both "COMPLETED" and "SUCCESS" (and similarly align
failed/skipped/errors with any alternate API tokens) or, preferably, normalize
each result.status (via the same normalization logic used in
upload_and_track/TERMINAL_STATES) before counting so the sums reliably reflect
terminal states.
---
Nitpick comments:
In `@automatic-evaluation-pipeline/eval_frames.py`:
- Around line 125-146: The same utilities (load_dataset_cached,
extract_title_from_url, _safe_filename, and the wiki link parsing logic) are
duplicated in setup_frames.py and eval_frames.py; consolidate them into a new
shared module (e.g., frames_utils.py) that exports those functions, move the
implementation there, update both files to import the functions (referencing
load_dataset_cached, extract_title_from_url, _safe_filename and the wiki link
parsing helper by name), and ensure any relative path logic
(_default_dataset_cache usage) is either moved into the shared module or passed
in as parameters so both callers work without duplication.
- Around line 289-290: The code creates a new httpx.AsyncClient inside
query_openrag_answer each time (inside the async with block guarded by
semaphore), which is inefficient; refactor query_openrag_answer to accept an
optional shared_client parameter (or required client) and remove the internal
"async with httpx.AsyncClient" so it uses the passed httpx.AsyncClient; then
instantiate a single AsyncClient in main() (or the calling scope) and pass that
client into all calls to query_openrag_answer (while keeping the existing
semaphore usage intact) to enable connection reuse and pooling.
In `@automatic-evaluation-pipeline/setup_frames.py`:
- Around line 43-47: The code currently provides a hardcoded default AUTH_TOKEN
("sk-1234"); remove the default so AUTH_TOKEN = os.environ.get("AUTH_TOKEN") (no
fallback) and add an explicit check where indexing is initiated (or at module
import if indexing runs on startup): if not AUTH_TOKEN: raise
RuntimeError("AUTH_TOKEN must be set for indexing") or call sys.exit(1) with a
clear message; keep APP_URL, APP_PORT and OPENRAG_BASE_URL unchanged but ensure
any code that relies on AUTH_TOKEN uses the validated value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: abb66cbf-67a5-4a95-89a4-1ce44edcf361
📒 Files selected for processing (2)
automatic-evaluation-pipeline/eval_frames.pyautomatic-evaluation-pipeline/setup_frames.py
| MODEL = os.environ.get("MODEL") | ||
| BASE_URL = os.environ.get("BASE_URL") | ||
| API_KEY = os.environ.get("API_KEY") | ||
| JUDGE_MODEL = os.environ.get("MODEL_JUDGE", MODEL) | ||
| JUDGE_BASE_URL = os.environ.get("BASE_URL_JUDGE", BASE_URL) | ||
| JUDGE_API_KEY = os.environ.get("API_KEY_JUDGE", API_KEY) |
There was a problem hiding this comment.
Missing validation for required LLM environment variables.
MODEL, BASE_URL, and API_KEY can be None if not set in .env. The --no-rag, --oracle, and judge functionality will fail with unclear errors when ChatOpenAI is instantiated with model=None.
Suggested fix
Add validation before using the LLM:
def _require_llm_config() -> None:
"""Validate LLM environment variables are set."""
missing = [name for name, val in [("MODEL", MODEL), ("BASE_URL", BASE_URL), ("API_KEY", API_KEY)] if not val]
if missing:
raise SystemExit(f"ERROR: Missing required environment variables for LLM: {', '.join(missing)}")Call this at the start of --no-rag and --oracle modes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@automatic-evaluation-pipeline/eval_frames.py` around lines 53 - 58, Add
validation to ensure required LLM env vars (MODEL, BASE_URL, API_KEY and their
judge variants) are present before any ChatOpenAI instantiation: implement a
helper like _require_llm_config() that checks MODEL, BASE_URL, API_KEY (and
optionally JUDGE_MODEL/JUDGE_BASE_URL/JUDGE_API_KEY) and raises a clear
SystemExit with the missing names if any are unset, and call this helper at the
start of the code paths for --no-rag and --oracle (and the judge initialization)
so ChatOpenAI is never created with model=None.
| async def add_files_to_workspace( | ||
| client: httpx.AsyncClient, | ||
| partition: str, | ||
| workspace_id: str, | ||
| file_ids: list[str], | ||
| ) -> None: | ||
| """Attach files to a workspace.""" | ||
| headers = {"Authorization": f"Bearer {AUTH_TOKEN}"} | ||
| resp = await client.post( | ||
| f"{OPENRAG_BASE_URL}/partition/{partition}/workspaces/{workspace_id}/files", | ||
| json={"file_ids": file_ids}, | ||
| headers=headers, | ||
| ) | ||
| resp.raise_for_status() | ||
|
|
There was a problem hiding this comment.
Missing error handling for non-existent files in workspace attachment.
add_files_to_workspace calls raise_for_status() which will raise on any error, but there's no specific handling if some gold files don't exist in the partition. This could cause the entire evaluation to fail if one article wasn't indexed.
Suggested improvement
async def add_files_to_workspace(
client: httpx.AsyncClient,
partition: str,
workspace_id: str,
file_ids: list[str],
) -> None:
"""Attach files to a workspace."""
headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}
resp = await client.post(
f"{OPENRAG_BASE_URL}/partition/{partition}/workspaces/{workspace_id}/files",
json={"file_ids": file_ids},
headers=headers,
)
- resp.raise_for_status()
+ if resp.status_code not in (200, 201):
+ logger.warning(f"Failed to attach files to workspace '{workspace_id}': {resp.status_code} - {resp.text}")
+ resp.raise_for_status()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@automatic-evaluation-pipeline/eval_frames.py` around lines 362 - 376, The
add_files_to_workspace function currently calls resp.raise_for_status() and will
abort on any HTTP error; wrap the POST call in a try/except for
httpx.HTTPStatusError, catch failures that indicate non-existent/missing files
(e.g., status 404 or 400) and inspect resp.json() for a field indicating missing
file ids (e.g., "missing_files" or similar), log which file_ids are missing and
either filter them out and retry a POST with the remaining ids or
return/propagate a controlled result instead of raising; update
add_files_to_workspace to handle missing files gracefully (log missing ids and
continue) and reference OPENRAG_BASE_URL, AUTH_TOKEN, and the
add_files_to_workspace function when making changes.
| else: | ||
| print(f"Evaluating {len(dataset)} questions on partition '{args.partition}' at {OPENRAG_BASE_URL}") | ||
| sem = asyncio.Semaphore(args.concurrency) | ||
| raw = await tqdm.gather( | ||
| *[query_openrag_answer(row["Prompt"], args.partition, sem) for row in dataset], | ||
| desc="Querying OpenRAG", | ||
| ) | ||
| results = [_build_result(row, ans, src, "rag") for row, (ans, src) in zip(dataset, raw)] | ||
| annotate_exact_match(results) | ||
| await run_accuracy_judging(results, concurrency=args.judge_concurrency) |
There was a problem hiding this comment.
Default RAG mode skips OpenRAG health check.
The --oracle and --gold-workspaces modes verify OpenRAG is reachable before proceeding, but the default RAG mode (lines 812-821) does not. If OpenRAG is down, users will see confusing errors instead of a clear health check failure.
Suggested fix
else:
print(f"Evaluating {len(dataset)} questions on partition '{args.partition}' at {OPENRAG_BASE_URL}")
+ async with httpx.AsyncClient(timeout=60) as client:
+ if not await _check_openrag_health(client):
+ return
sem = asyncio.Semaphore(args.concurrency)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@automatic-evaluation-pipeline/eval_frames.py` around lines 812 - 821, Default
RAG path prints and begins querying OpenRAG without verifying the service is
reachable; add the same OpenRAG health check used by the other modes before the
print/async gather block so we fail fast with a clear error if the service is
down. Specifically, before printing the message that includes OPENRAG_BASE_URL
and before creating the semaphore and calling query_openrag_answer across the
dataset, invoke the existing OpenRAG health-check helper (the same call used in
--oracle/--gold-workspaces) and abort/raise/log a clear error on failure; then
proceed to create sem, call tqdm.gather(*[query_openrag_answer(...)]), build
results and run_accuracy_judging as before.
| if resp.status_code == 429 or resp.status_code >= 500: | ||
| wait = float(resp.headers.get("retry-after") or min(2 ** attempt + 1, 60)) |
There was a problem hiding this comment.
retry-after header parsing may fail on non-numeric values.
The Retry-After header can be either a number of seconds or an HTTP-date string. Passing an HTTP-date to float() will raise ValueError.
Proposed fix
- wait = float(resp.headers.get("retry-after") or min(2 ** attempt + 1, 60))
+ retry_after = resp.headers.get("retry-after")
+ try:
+ wait = float(retry_after) if retry_after else min(2 ** attempt + 1, 60)
+ except ValueError:
+ wait = min(2 ** attempt + 1, 60)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@automatic-evaluation-pipeline/setup_frames.py` around lines 155 - 156, The
current calculation of wait uses float(resp.headers.get("retry-after")) which
will raise ValueError for HTTP-date values; update the logic where wait is set
(the block using resp, resp.headers, attempt) to: read the "Retry-After" header
into a variable, try to parse it as an integer/float first, and if that fails
parse it as an HTTP-date (e.g., with email.utils.parsedate_to_datetime) and
compute seconds until that date (clamped >=0); if parsing fails or header
missing, fall back to the existing exponential backoff expression min(2 **
attempt + 1, 60). Ensure the final wait remains a float and preserve existing
behavior for numeric headers.
| completed = sum(1 for r in results if r["status"] == "COMPLETED") | ||
| failed = sum(1 for r in results if r["status"] == "FAILED") | ||
| skipped = sum(1 for r in results if r["status"] == "skipped") | ||
| errors = sum(1 for r in results if r["status"] == "ERROR") |
There was a problem hiding this comment.
Result status counting depends on corrected terminal states.
After fixing TERMINAL_STATES, ensure the status comparison here aligns. If the API returns "SUCCESS", the current check for "COMPLETED" won't match any results.
Proposed fix (after normalizing status in upload_and_track)
If you normalize status in upload_and_track to return "COMPLETED" for successful tasks, this code will work. Otherwise, update to:
- completed = sum(1 for r in results if r["status"] == "COMPLETED")
+ completed = sum(1 for r in results if r["status"] in ("COMPLETED", "SUCCESS"))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@automatic-evaluation-pipeline/setup_frames.py` around lines 343 - 346, The
status counters rely on specific terminal strings but the API may return
different tokens (e.g., "SUCCESS" vs "COMPLETED"); update the counting in the
block that computes completed/failed/skipped/errors (which iterates over
results) to compare against the normalized status values produced by
upload_and_track or to accept both variants — for example change the completed
check to accept both "COMPLETED" and "SUCCESS" (and similarly align
failed/skipped/errors with any alternate API tokens) or, preferably, normalize
each result.status (via the same normalization logic used in
upload_and_track/TERMINAL_STATES) before counting so the sums reliably reflect
terminal states.
Summary
This PR adds a FRAMES benchmark evaluation pipeline under
automatic-evaluation-pipeline/(not sure if it's the best place).setup_frames.py: downloads the Wikipedia articles referenced by FRAMES inmd,pdf, orhtmlformat, then uploads and indexes them into an OpenRAG partitioneval_frames.py: runs FRAMES evaluation over the 824-question benchmark with LLM-judge scoring a fuzzy exact matchEvaluation modes
--no-rag: bypass retrieval and query the model directly whithout chunk--oracle: bypass retrieval and provide the gold Wikipedia articles directly to the model (hardly impacted by the 8192 tokens limit)--gold-workspaces: create one workspace per question and attach only the gold filesSummary by CodeRabbit