-
Notifications
You must be signed in to change notification settings - Fork 38
Step 3/3 Feat/temporal filters in rag pipeline #284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Ahmath-Gadji
wants to merge
4
commits into
dev
Choose a base branch
from
feat/temporal_filters_in_rag_pipeline
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
61d34ba
chore: adding query decomposition evaluation
Ahmath-Gadji 240f1c1
chore: adding temporal filtering evaluation
Ahmath-Gadji 34ec1e7
feat(rag): use json_schema + function_calling for reliable structured…
Ahmath-Gadji 4f47041
Merge pull request #328 from linagora/feat/fallback_for_tool_calling
Ahmath-Gadji File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
|
|
||
| BASE_URLs=https://example/v1/;https://example2/v1/; | ||
| API_KEYs=sk-api_key1;sk-api_key2; | ||
| MODELs=model1;model2; | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| # Query Reformulation Prompt Evaluation | ||
|
|
||
| **Dataset:** [datasets/query_decomposition.json](datasets/query_decomposition.json) (80 cases — 20 D1, 40 D2, 20 D3) | ||
| **Scope:** reformulation + decomposition only. Temporal filters are out of scope (see [Report_temporal_filter.md](Report_temporal_filter.md)). | ||
|
|
||
| ## Dataset | ||
|
|
||
| | Tier | # Cases | Description | | ||
| |------|--------:|-------------| | ||
| | **D1** | 20 | Standalone queries that do **not** need decomposition. | | ||
| | **D2** | 40 | Queries that **definitely need** decomposition, 2 to 4 sub-queries. Clear signals: distinct entities, distinct time periods, unrelated dimensions, or exclusions. | | ||
| | **D3** | 20 | **Ambiguous** queries. Surface features suggest decomposition (conjunctions, comparisons) but the semantics may require one retrieval or several — e.g. trends, interactions, joint effects, multi-attribute-for-one-subject. | | ||
|
|
||
| 20 domains (finance, healthcare, legal, engineering, science, HR, education, marketing, real_estate, technology, environment, logistics, agriculture, energy, manufacturing, public_policy, retail, telecommunications, insurance, aviation), each represented 3–5 times across tiers. | ||
|
|
||
| Gold labels live in `expected_queries` (shape: `SearchQueries`, so it deserializes directly into the pipeline's Pydantic model). Relative dates in `query` strings are pre-resolved to the dataset's current date (2026-04-17); `temporal_filters` is null throughout. | ||
|
|
||
| ## Metrics | ||
|
|
||
| 1. **decomposition_count_matching** — `len(generated.query_list) == len(expected.query_list)`. | ||
| 2. **decomposition_semantic_coverage** — LLM-as-judge boolean: do the generated sub-queries, taken together, cover every expected sub-query (count-insensitive, order-insensitive)? When coverage is incomplete, the judge returns a short reasoning naming the missing expected sub-query. | ||
|
|
||
| Reported slices: overall and per-difficulty (D1 / D2 / D3). | ||
|
|
||
| ## Results | ||
|
|
||
| Full raw output: [results/result_query_decomposition.json](results/result_query_decomposition.json). Judge: `Qwen3-VL-8B-Instruct-FP8`. | ||
|
|
||
| **Overall** | ||
|
|
||
| | Prompt | Model | count_match | semantic_coverage | | ||
| |---|---|---|---| | ||
| | v0 | Mistral-Small-3.1-24B-Instruct-2503 | 66/80 (82.5%) | 74/80 (92.5%) | | ||
| | v0 | Qwen3-VL-8B-Instruct-FP8 | 58/80 (72.5%) | 69/80 (86.2%) | | ||
| | **v1** | **Mistral-Small-3.1-24B-Instruct-2503** | **69/80 (86.2%)** | **76/80 (95.0%)** | | ||
| | v1 | Qwen3-VL-8B-Instruct-FP8 | 66/80 (82.5%) | 71/80 (88.8%) | | ||
|
|
||
| **Per-difficulty (count_match · semantic_coverage)** | ||
|
|
||
| | Prompt | Model | D1 (n=20) | D2 (n=40) | D3 (n=20) | | ||
| |---|---|---|---|---| | ||
| | v0 | Mistral-Small | 19/20 · 20/20 | 39/40 · 38/40 | **8/20** · 16/20 | | ||
| | v0 | Qwen3-VL-8B | 20/20 · 20/20 | 28/40 · 35/40 | 10/20 · 14/20 | | ||
| | v1 | Mistral-Small | 16/20 · 20/20 | 38/40 · 38/40 | **15/20** · 18/20 | | ||
| | v1 | Qwen3-VL-8B | 20/20 · 20/20 | 35/40 · 36/40 | 11/20 · 15/20 | | ||
|
|
||
| ### v0 vs v1 — Mistral-Small | ||
|
|
||
| - **v1 handles D3 much better than v0** (D3 count_match 8 vs 15): v0 splits "interaction / joint-effect / multi-attribute-for-one-subject" questions that should stay as one (id 68, 69, 71, 75, 79, 80), while v1 keeps them together. v1 also correctly splits the two bounded-range trend cases (id 61, 65) per the "evolution / trend over a bounded range" rule. | ||
| - **v1 is slightly weaker on D1** (19 vs 16). Chat-history turns pull prior-turn topics into the reformulation and trigger spurious splits (id 2, 5, 11). | ||
| - **Shared failures on both prompts**: id 30 (Lambda/GCF collapsed), id 63 (Kafka/RabbitMQ — prior gRPC turn leaks in), id 66 (HR onboarding under-split), id 70 (air/sea freight), id 74 (EASA/FAA under-split). | ||
|
|
||
| ### v0 vs v1 — Qwen3-VL-8B | ||
|
|
||
| - v1 gains on count_match (+8) and coverage (+2): mostly D2 (28 vs 35) as the explicit split rules embolden Qwen to separate multi-entity/region/time-period queries it previously collapsed. | ||
| - Qwen still under-splits comparative questions and misses both bounded-range trend cases (id 61, 65) — it keeps them as one query despite the rule. | ||
|
|
||
| ### Common hard cases (both models, both prompts) | ||
|
|
||
| - **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query. | ||
| - **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation. | ||
| - **id 70** — "air freight vs sea freight": emitted as a single comparison instead of two independent lookups. | ||
| - **id 74** — "EASA vs FAA certification": same pattern. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| # Temporal Filter Generation Prompt Evaluation (v2) | ||
|
|
||
| **Dataset:** [datasets/temporal_filter.json](datasets/temporal_filter.json) (40 cases — 20 positive, 20 negative) | ||
| **Scope:** whether the model emits `temporal_filters` when (and only when) it should, and whether the emitted predicates are correct. Decomposition is out of scope (see [Report_query_decomposition.md](Report_query_decomposition.md)). | ||
|
|
||
| ## Dataset | ||
|
|
||
| Each case has the minimal schema `{id, messages, query_with_temporal_filter}`: | ||
|
|
||
| | Class | # Cases | Description | | ||
| |------|--------:|-------------| | ||
| | **Positive** (`true`) | 20 | User restricts by document creation/authoring/publication date. Covers all resolution rules: today, yesterday, this/last week, this/last month, this/last year, past N days/weeks/months, recent/latest, since X, before X, bare MONTH, in YEAR, specific date, exclusion, multi-entity with shared time, plus multi-turn context (3- and 5-message). | | ||
| | **Negative** (`false`) | 20 | Filter must be null. Three sub-patterns: (a) dates that describe the topic/subject ("2024 sustainability report", "trends 2020→2025", "2016 US election"); (b) no temporal reference (policy, how-to, trivia); (c) conversational fillers (greetings, thanks). Includes a 5-message negative where the last turn pivots to a pure topic question. | | ||
|
|
||
| Document types and verbs are varied on purpose — design specs, incident reports, PRs, commits, lab results, audit logs, invoices, legal briefs, slide decks, meeting minutes, safety bulletins, etc. — so the evaluation does not reduce to "the model learned the word *uploaded*". | ||
|
|
||
| ## Metrics | ||
|
|
||
| Positive class = a filter **was** / **should have been** emitted. | ||
|
|
||
| 1. **filter_detection_accuracy** — (TP + TN) / N. | ||
| 2. **filter_detection_precision** — TP / (TP + FP). How often an emitted filter was actually wanted. | ||
| 3. **filter_detection_recall** — TP / (TP + FN). How often a wanted filter was actually emitted. | ||
| 4. **filter_detection_f1** — harmonic mean of the two. | ||
| 5. **filter_correctness** — LLM-as-judge boolean on TP cases only: given the chat history, current date, and generator output JSON, are the predicates correct as a whole (operator, field, ISO values, closed-vs-open intervals, exclusion split)? | ||
|
|
||
| Judge is invoked **only on TP** (filter expected and emitted). Precision/recall capture the detection decision; correctness captures the filter body. | ||
|
|
||
| ## Results | ||
|
|
||
| Full raw output: [results/result_filter_generation.json](results/result_filter_generation.json). Judge: `Qwen3-VL-8B-Instruct-FP8`. Current date at eval time: Sunday, April 19, 2026. | ||
|
|
||
| **Overall** | ||
|
|
||
| | Prompt | Model | Acc | Precision | Recall | F1 | TP / FP / FN / TN | filter_correctness (TP only) | | ||
| |---|---|---:|---:|---:|---:|:-:|---:| | ||
| | v0 | Mistral-Small-3.1-24B-Instruct-2503 | 75.0% | 100.0% | 50.0% | 66.7% | 10 / 0 / 10 / 20 | 8/10 (80.0%) | | ||
| | v0 | Qwen3-VL-8B-Instruct-FP8 | 57.5% | 100.0% | 15.0% | 26.1% | 3 / 0 / 17 / 20 | 2/3 (66.7%) | | ||
| | **v1** | **Mistral-Small-3.1-24B-Instruct-2503** | **100.0%** | **100.0%** | **100.0%** | **100.0%** | 20 / 0 / 0 / 20 | **18/20 (90.0%)** | | ||
| | v1 | Qwen3-VL-8B-Instruct-FP8 | 94.9% | 90.5% | 100.0% | 95.0% | 19 / 2 / 0 / 18 | 17/19 (89.5%) | | ||
|
|
||
| (Qwen/v1 had 1 judge-call error on id 8 — excluded from the denominator.) | ||
|
|
||
| ### v0 vs v1 | ||
|
|
||
| v0 contains no `temporal_filters` rules, so both models default to "no filter" and collapse on recall (Mistral 50%, Qwen 15%). v1's explicit resolution table brings both to 100% recall. No false positives under v0 — the cost of v0 is recall only, not precision. | ||
|
|
||
| Exclusion (id 14) illustrates the v0 body failure: without a rule, Mistral emits three contradictory predicates (`>= 2025-04-19 AND < 2025-03-01 AND >= 2025-04-…`); Qwen collapses to the full year including March. v1 fixes this for Mistral. | ||
|
|
||
| ### v1 — Mistral-Small (winner) | ||
|
|
||
| Perfect detection (20/20 TP, 0/20 FP). Two judge-rejected filter bodies among TP: | ||
|
|
||
| - **id 20 "Latest safety bulletins"** → emitted a 12-month window `[2025-07-20, 2026-07-20)` extending into the future. Prompt rule is `recent/latest → past 90 days`, expected `[2026-01-19, 2026-04-20)`. Real generator bug. | ||
| - **id 18 "Commits pushed since last Monday"** → added a spurious upper bound `< 2026-04-19`. Prompt rule is `since X → one predicate >= X`. Real generator bug. | ||
|
|
||
| ### v1 — Qwen3-VL-8B | ||
|
|
||
| Perfect recall but 2 false positives where a year describes the **content**, not the document creation date: | ||
|
|
||
| | id | Query | Bad filter | | ||
| |---|---|---| | ||
| | 30 | "Findings in the 2024 annual sustainability report." | `created_at ∈ [2024-01-01, 2025-01-01)` | | ||
| | 34 | "Effects of climate change on Arctic sea ice between 2010 and 2020." | `created_at ∈ [2010-01-01, 2021-01-01)` | | ||
|
|
||
| Cause: v1's topic-vs-creation section is short and its single null example ("Q3 2024 reporting template") does not cover research/historical framings with spanning year ranges. | ||
|
|
||
| Two judge-rejected filter bodies among TP: | ||
|
|
||
| - **id 18 "Commits pushed since last Monday"** → added a spurious upper bound `< 2026-04-20`. Prompt rule is `since X → one predicate >= X`. Correct rejection. | ||
| - **id 9 "Recent SRE incident reports"** → emitted past 10 days instead of past 90. Correct rejection. | ||
|
|
||
| ## Recommendations | ||
|
|
||
| 1. **Ship v1 + Mistral-Small-24B as the production pairing** — 100% detection, 90% filter correctness, with the two remaining body bugs on "latest" and "since X" worth a targeted prompt tweak. | ||
| 2. **Patch v1's topic-vs-creation section.** Add null examples for the two patterns Qwen still trips on: `"findings in the YEAR report"` and `"events in YEAR"` / `"between YEAR1 and YEAR2"` when the year is the subject. | ||
| 3. **Reinforce `since X` and `latest` rules.** Both Mistral and Qwen emit an unwanted upper bound on "since last Monday"; Mistral emits an over-wide, future-extending window for "latest". A short prompt clarification or an additional example should eliminate these. | ||
| 4. **Upgrade the judge.** Qwen3-VL-8B as judge is flaky (one call returned `None` on id 8) and occasionally misreads the prompt's resolution rules. A stronger judge would tighten the correctness metric. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: linagora/openrag
Length of output: 9750
Normalize these env var names to uppercase or suppress the dotenv-linter warning.
BASE_URLs,API_KEYs, andMODELsare mixed-case, which dotenv-linter flags. These environment variables are hardcoded in bothbenchmarks/prompt_eval/eval_query_decomposition.py(lines 62-64) andbenchmarks/prompt_eval/eval_temporal_filter_generation.py(lines 83-85). Renaming them toBASE_URLS,API_KEYS, andMODELSrequires updating all three locations in both evaluator scripts simultaneously; changing only the.env.examplewill break configuration loading. Alternatively, add a.dotenv-linter.jsonrule to suppress the mixed-case warning for these specific keys.🧰 Tools
🪛 dotenv-linter (4.0.0)
[warning] 2-2: [LowercaseKey] The BASE_URLs key should be in uppercase
(LowercaseKey)
[warning] 3-3: [LowercaseKey] The API_KEYs key should be in uppercase
(LowercaseKey)
[warning] 3-3: [UnorderedKey] The API_KEYs key should go before the BASE_URLs key
(UnorderedKey)
[warning] 4-4: [LowercaseKey] The MODELs key should be in uppercase
(LowercaseKey)
🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the bot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
@paultranvan, great — here's the plan to rename all three variables to proper uppercase across every affected file.BASE_URLs→BASE_URLS,API_KEYs→API_KEYS,MODELs→MODELSin.env.exampleeval_query_decomposition.pyeval_temporal_filter_generation.py🏁 Script executed:
Length of output: 1070
🏁 Script executed:
Length of output: 1031
@paultranvan, here are the complete diffs for all three files:benchmarks/prompt_eval/.env.examplebenchmarks/prompt_eval/eval_query_decomposition.pybenchmarks/prompt_eval/eval_temporal_filter_generation.pyThe critical lines are the
_parse_env_list(...)calls (lines 62–64 and 83–85) — those are what actually read from the environment, so they must match exactly. The docstring and error-message lines are updated for consistency.🧠 Learnings used