Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions benchmarks/prompt_eval/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

BASE_URLs=https://example/v1/;https://example2/v1/;
API_KEYs=sk-api_key1;sk-api_key2;
MODELs=model1;model2;
Comment on lines +2 to +4
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify all env var references before deciding whether to rename or suppress the lint rule.
rg -nP '\b(BASE_URLs|API_KEYs|MODELs|BASE_URLS|API_KEYS|MODELS)\b' -C2

Repository: linagora/openrag

Length of output: 9750


Normalize these env var names to uppercase or suppress the dotenv-linter warning.

BASE_URLs, API_KEYs, and MODELs are mixed-case, which dotenv-linter flags. These environment variables are hardcoded in both benchmarks/prompt_eval/eval_query_decomposition.py (lines 62-64) and benchmarks/prompt_eval/eval_temporal_filter_generation.py (lines 83-85). Renaming them to BASE_URLS, API_KEYS, and MODELS requires updating all three locations in both evaluator scripts simultaneously; changing only the .env.example will break configuration loading. Alternatively, add a .dotenv-linter.json rule to suppress the mixed-case warning for these specific keys.

🧰 Tools
🪛 dotenv-linter (4.0.0)

[warning] 2-2: [LowercaseKey] The BASE_URLs key should be in uppercase

(LowercaseKey)


[warning] 3-3: [LowercaseKey] The API_KEYs key should be in uppercase

(LowercaseKey)


[warning] 3-3: [UnorderedKey] The API_KEYs key should go before the BASE_URLs key

(UnorderedKey)


[warning] 4-4: [LowercaseKey] The MODELs key should be in uppercase

(LowercaseKey)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/prompt_eval/.env.example` around lines 2 - 4, The env var names
BASE_URLs, API_KEYs, and MODELs in .env.example are mixed-case and trigger
dotenv-linter; either normalize them to uppercase across all usages or add a
linter exception: Option A — rename the three variables to BASE_URLS, API_KEYS,
and MODELS and update every reference in eval_query_decomposition.py (references
to BASE_URLs/API_KEYs/MODELs) and eval_temporal_filter_generation.py (same three
references) so they match; Option B — add a .dotenv-linter.json rule to ignore
these specific keys (BASE_URLs, API_KEYs, MODELs) to suppress the warning if you
want to keep the current names.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the bot

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@paultranvan, great — here's the plan to rename all three variables to proper uppercase across every affected file.

  • Rename BASE_URLsBASE_URLS, API_KEYsAPI_KEYS, MODELsMODELS in .env.example
  • Update all references in eval_query_decomposition.py
  • Update all references in eval_temporal_filter_generation.py

🏁 Script executed:

#!/bin/bash
rg -nP '\b(BASE_URLs|API_KEYs|MODELs)\b' benchmarks/prompt_eval/eval_query_decomposition.py -C3

Length of output: 1070


🏁 Script executed:

#!/bin/bash
rg -nP '\b(BASE_URLs|API_KEYs|MODELs)\b' benchmarks/prompt_eval/eval_temporal_filter_generation.py -C3

Length of output: 1031


@paultranvan, here are the complete diffs for all three files:

benchmarks/prompt_eval/.env.example

-BASE_URLs=https://example/v1/;https://example2/v1/;
-API_KEYs=sk-api_key1;sk-api_key2;
-MODELs=model1;model2;
+BASE_URLS=https://example/v1/;https://example2/v1/;
+API_KEYS=sk-api_key1;sk-api_key2;
+MODELS=model1;model2;

benchmarks/prompt_eval/eval_query_decomposition.py

-    BASE_URLs, API_KEYs, MODELs
+    BASE_URLS, API_KEYS, MODELS

-    base_urls = _parse_env_list("BASE_URLs")
-    api_keys = _parse_env_list("API_KEYs")
-    models = _parse_env_list("MODELs")
+    base_urls = _parse_env_list("BASE_URLS")
+    api_keys = _parse_env_list("API_KEYS")
+    models = _parse_env_list("MODELS")

-        f"BASE_URLs ({len(base_urls)}), API_KEYs ({len(api_keys)}), and MODELs ({len(models)}) "
+        f"BASE_URLS ({len(base_urls)}), API_KEYS ({len(api_keys)}), and MODELS ({len(models)}) "

-        print("No models configured. Set BASE_URLs/API_KEYs/MODELs in the environment.")
+        print("No models configured. Set BASE_URLS/API_KEYS/MODELS in the environment.")

benchmarks/prompt_eval/eval_temporal_filter_generation.py

-    BASE_URLs, API_KEYs, MODELs
+    BASE_URLS, API_KEYS, MODELS

-    base_urls = _parse_env_list("BASE_URLs")
-    api_keys = _parse_env_list("API_KEYs")
-    models = _parse_env_list("MODELs")
+    base_urls = _parse_env_list("BASE_URLS")
+    api_keys = _parse_env_list("API_KEYS")
+    models = _parse_env_list("MODELS")

-        f"BASE_URLs ({len(base_urls)}), API_KEYs ({len(api_keys)}), and MODELs ({len(models)}) "
+        f"BASE_URLS ({len(base_urls)}), API_KEYS ({len(api_keys)}), and MODELS ({len(models)}) "

-        print("No models configured. Set BASE_URLs/API_KEYs/MODELs in the environment.")
+        print("No models configured. Set BASE_URLS/API_KEYS/MODELS in the environment.")

The critical lines are the _parse_env_list(...) calls (lines 62–64 and 83–85) — those are what actually read from the environment, so they must match exactly. The docstring and error-message lines are updated for consistency.


🧠 Learnings used
Learnt from: Ahmath-Gadji
Repo: linagora/openrag PR: 284
File: openrag/config/models.py:422-422
Timestamp: 2026-04-21T14:51:44.809Z
Learning: In openrag/config/models.py, `_BaseRetrieverConfig.allow_filterless_fallback` defaults to `True` intentionally. The design is recall-friendly by default: when a temporally-filtered search returns zero documents, the pipeline retries without the filter. Deployments that require strict temporal filtering are expected to explicitly set `RETRIEVER_ALLOW_FILTERLESS_FALLBACK=false`. Do not flag this default as incorrect during reviews.

  • 📌 Commit the changes to this branch

63 changes: 63 additions & 0 deletions benchmarks/prompt_eval/Report_query_decomposition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Query Reformulation Prompt Evaluation

**Dataset:** [datasets/query_decomposition.json](datasets/query_decomposition.json) (80 cases — 20 D1, 40 D2, 20 D3)
**Scope:** reformulation + decomposition only. Temporal filters are out of scope (see [Report_temporal_filter.md](Report_temporal_filter.md)).

## Dataset

| Tier | # Cases | Description |
|------|--------:|-------------|
| **D1** | 20 | Standalone queries that do **not** need decomposition. |
| **D2** | 40 | Queries that **definitely need** decomposition, 2 to 4 sub-queries. Clear signals: distinct entities, distinct time periods, unrelated dimensions, or exclusions. |
| **D3** | 20 | **Ambiguous** queries. Surface features suggest decomposition (conjunctions, comparisons) but the semantics may require one retrieval or several — e.g. trends, interactions, joint effects, multi-attribute-for-one-subject. |

20 domains (finance, healthcare, legal, engineering, science, HR, education, marketing, real_estate, technology, environment, logistics, agriculture, energy, manufacturing, public_policy, retail, telecommunications, insurance, aviation), each represented 3–5 times across tiers.

Gold labels live in `expected_queries` (shape: `SearchQueries`, so it deserializes directly into the pipeline's Pydantic model). Relative dates in `query` strings are pre-resolved to the dataset's current date (2026-04-17); `temporal_filters` is null throughout.

## Metrics

1. **decomposition_count_matching** — `len(generated.query_list) == len(expected.query_list)`.
2. **decomposition_semantic_coverage** — LLM-as-judge boolean: do the generated sub-queries, taken together, cover every expected sub-query (count-insensitive, order-insensitive)? When coverage is incomplete, the judge returns a short reasoning naming the missing expected sub-query.

Reported slices: overall and per-difficulty (D1 / D2 / D3).

## Results

Full raw output: [results/result_query_decomposition.json](results/result_query_decomposition.json). Judge: `Qwen3-VL-8B-Instruct-FP8`.

**Overall**

| Prompt | Model | count_match | semantic_coverage |
|---|---|---|---|
| v0 | Mistral-Small-3.1-24B-Instruct-2503 | 66/80 (82.5%) | 74/80 (92.5%) |
| v0 | Qwen3-VL-8B-Instruct-FP8 | 58/80 (72.5%) | 69/80 (86.2%) |
| **v1** | **Mistral-Small-3.1-24B-Instruct-2503** | **69/80 (86.2%)** | **76/80 (95.0%)** |
| v1 | Qwen3-VL-8B-Instruct-FP8 | 66/80 (82.5%) | 71/80 (88.8%) |

**Per-difficulty (count_match · semantic_coverage)**

| Prompt | Model | D1 (n=20) | D2 (n=40) | D3 (n=20) |
|---|---|---|---|---|
| v0 | Mistral-Small | 19/20 · 20/20 | 39/40 · 38/40 | **8/20** · 16/20 |
| v0 | Qwen3-VL-8B | 20/20 · 20/20 | 28/40 · 35/40 | 10/20 · 14/20 |
| v1 | Mistral-Small | 16/20 · 20/20 | 38/40 · 38/40 | **15/20** · 18/20 |
| v1 | Qwen3-VL-8B | 20/20 · 20/20 | 35/40 · 36/40 | 11/20 · 15/20 |

### v0 vs v1 — Mistral-Small

- **v1 handles D3 much better than v0** (D3 count_match 8 vs 15): v0 splits "interaction / joint-effect / multi-attribute-for-one-subject" questions that should stay as one (id 68, 69, 71, 75, 79, 80), while v1 keeps them together. v1 also correctly splits the two bounded-range trend cases (id 61, 65) per the "evolution / trend over a bounded range" rule.
- **v1 is slightly weaker on D1** (19 vs 16). Chat-history turns pull prior-turn topics into the reformulation and trigger spurious splits (id 2, 5, 11).
- **Shared failures on both prompts**: id 30 (Lambda/GCF collapsed), id 63 (Kafka/RabbitMQ — prior gRPC turn leaks in), id 66 (HR onboarding under-split), id 70 (air/sea freight), id 74 (EASA/FAA under-split).

### v0 vs v1 — Qwen3-VL-8B

- v1 gains on count_match (+8) and coverage (+2): mostly D2 (28 vs 35) as the explicit split rules embolden Qwen to separate multi-entity/region/time-period queries it previously collapsed.
- Qwen still under-splits comparative questions and misses both bounded-range trend cases (id 61, 65) — it keeps them as one query despite the rule.

### Common hard cases (both models, both prompts)

- **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query.
- **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation.
- **id 70** — "air freight vs sea freight": emitted as a single comparison instead of two independent lookups.
- **id 74** — "EASA vs FAA certification": same pattern.
78 changes: 78 additions & 0 deletions benchmarks/prompt_eval/Report_temporal_filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Temporal Filter Generation Prompt Evaluation (v2)

**Dataset:** [datasets/temporal_filter.json](datasets/temporal_filter.json) (40 cases — 20 positive, 20 negative)
**Scope:** whether the model emits `temporal_filters` when (and only when) it should, and whether the emitted predicates are correct. Decomposition is out of scope (see [Report_query_decomposition.md](Report_query_decomposition.md)).

## Dataset

Each case has the minimal schema `{id, messages, query_with_temporal_filter}`:

| Class | # Cases | Description |
|------|--------:|-------------|
| **Positive** (`true`) | 20 | User restricts by document creation/authoring/publication date. Covers all resolution rules: today, yesterday, this/last week, this/last month, this/last year, past N days/weeks/months, recent/latest, since X, before X, bare MONTH, in YEAR, specific date, exclusion, multi-entity with shared time, plus multi-turn context (3- and 5-message). |
| **Negative** (`false`) | 20 | Filter must be null. Three sub-patterns: (a) dates that describe the topic/subject ("2024 sustainability report", "trends 2020→2025", "2016 US election"); (b) no temporal reference (policy, how-to, trivia); (c) conversational fillers (greetings, thanks). Includes a 5-message negative where the last turn pivots to a pure topic question. |

Document types and verbs are varied on purpose — design specs, incident reports, PRs, commits, lab results, audit logs, invoices, legal briefs, slide decks, meeting minutes, safety bulletins, etc. — so the evaluation does not reduce to "the model learned the word *uploaded*".

## Metrics

Positive class = a filter **was** / **should have been** emitted.

1. **filter_detection_accuracy** — (TP + TN) / N.
2. **filter_detection_precision** — TP / (TP + FP). How often an emitted filter was actually wanted.
3. **filter_detection_recall** — TP / (TP + FN). How often a wanted filter was actually emitted.
4. **filter_detection_f1** — harmonic mean of the two.
5. **filter_correctness** — LLM-as-judge boolean on TP cases only: given the chat history, current date, and generator output JSON, are the predicates correct as a whole (operator, field, ISO values, closed-vs-open intervals, exclusion split)?

Judge is invoked **only on TP** (filter expected and emitted). Precision/recall capture the detection decision; correctness captures the filter body.

## Results

Full raw output: [results/result_filter_generation.json](results/result_filter_generation.json). Judge: `Qwen3-VL-8B-Instruct-FP8`. Current date at eval time: Sunday, April 19, 2026.

**Overall**

| Prompt | Model | Acc | Precision | Recall | F1 | TP / FP / FN / TN | filter_correctness (TP only) |
|---|---|---:|---:|---:|---:|:-:|---:|
| v0 | Mistral-Small-3.1-24B-Instruct-2503 | 75.0% | 100.0% | 50.0% | 66.7% | 10 / 0 / 10 / 20 | 8/10 (80.0%) |
| v0 | Qwen3-VL-8B-Instruct-FP8 | 57.5% | 100.0% | 15.0% | 26.1% | 3 / 0 / 17 / 20 | 2/3 (66.7%) |
| **v1** | **Mistral-Small-3.1-24B-Instruct-2503** | **100.0%** | **100.0%** | **100.0%** | **100.0%** | 20 / 0 / 0 / 20 | **18/20 (90.0%)** |
| v1 | Qwen3-VL-8B-Instruct-FP8 | 94.9% | 90.5% | 100.0% | 95.0% | 19 / 2 / 0 / 18 | 17/19 (89.5%) |

(Qwen/v1 had 1 judge-call error on id 8 — excluded from the denominator.)

### v0 vs v1

v0 contains no `temporal_filters` rules, so both models default to "no filter" and collapse on recall (Mistral 50%, Qwen 15%). v1's explicit resolution table brings both to 100% recall. No false positives under v0 — the cost of v0 is recall only, not precision.

Exclusion (id 14) illustrates the v0 body failure: without a rule, Mistral emits three contradictory predicates (`>= 2025-04-19 AND < 2025-03-01 AND >= 2025-04-…`); Qwen collapses to the full year including March. v1 fixes this for Mistral.

### v1 — Mistral-Small (winner)

Perfect detection (20/20 TP, 0/20 FP). Two judge-rejected filter bodies among TP:

- **id 20 "Latest safety bulletins"** → emitted a 12-month window `[2025-07-20, 2026-07-20)` extending into the future. Prompt rule is `recent/latest → past 90 days`, expected `[2026-01-19, 2026-04-20)`. Real generator bug.
- **id 18 "Commits pushed since last Monday"** → added a spurious upper bound `< 2026-04-19`. Prompt rule is `since X → one predicate >= X`. Real generator bug.

### v1 — Qwen3-VL-8B

Perfect recall but 2 false positives where a year describes the **content**, not the document creation date:

| id | Query | Bad filter |
|---|---|---|
| 30 | "Findings in the 2024 annual sustainability report." | `created_at ∈ [2024-01-01, 2025-01-01)` |
| 34 | "Effects of climate change on Arctic sea ice between 2010 and 2020." | `created_at ∈ [2010-01-01, 2021-01-01)` |

Cause: v1's topic-vs-creation section is short and its single null example ("Q3 2024 reporting template") does not cover research/historical framings with spanning year ranges.

Two judge-rejected filter bodies among TP:

- **id 18 "Commits pushed since last Monday"** → added a spurious upper bound `< 2026-04-20`. Prompt rule is `since X → one predicate >= X`. Correct rejection.
- **id 9 "Recent SRE incident reports"** → emitted past 10 days instead of past 90. Correct rejection.

## Recommendations

1. **Ship v1 + Mistral-Small-24B as the production pairing** — 100% detection, 90% filter correctness, with the two remaining body bugs on "latest" and "since X" worth a targeted prompt tweak.
2. **Patch v1's topic-vs-creation section.** Add null examples for the two patterns Qwen still trips on: `"findings in the YEAR report"` and `"events in YEAR"` / `"between YEAR1 and YEAR2"` when the year is the subject.
3. **Reinforce `since X` and `latest` rules.** Both Mistral and Qwen emit an unwanted upper bound on "since last Monday"; Mistral emits an over-wide, future-extending window for "latest". A short prompt clarification or an additional example should eliminate these.
4. **Upgrade the judge.** Qwen3-VL-8B as judge is flaky (one call returned `None` on id 8) and occasionally misreads the prompt's resolution rules. A stronger judge would tighten the correctness metric.
Loading
Loading