Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions benchmarks/prompt_eval/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

BASE_URLS=https://example/v1/;https://example2/v1/;
API_KEYS=sk-api_key1;sk-api_key2;
MODELS=model1;model2;
63 changes: 63 additions & 0 deletions benchmarks/prompt_eval/Report_query_decomposition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Query Reformulation Prompt Evaluation

**Dataset:** [datasets/query_decomposition.json](datasets/query_decomposition.json) (80 cases — 20 D1, 40 D2, 20 D3)
**Scope:** reformulation + decomposition only. Temporal filters are out of scope (see [Report_temporal_filter.md](Report_temporal_filter.md)).

## Dataset

| Tier | # Cases | Description |
|------|--------:|-------------|
| **D1** | 20 | Standalone queries that do **not** need decomposition. |
| **D2** | 40 | Queries that **definitely need** decomposition, 2 to 4 sub-queries. Clear signals: distinct entities, distinct time periods, unrelated dimensions, or exclusions. |
| **D3** | 20 | **Ambiguous** queries. Surface features suggest decomposition (conjunctions, comparisons) but the semantics may require one retrieval or several — e.g. trends, interactions, joint effects, multi-attribute-for-one-subject. |

20 domains (finance, healthcare, legal, engineering, science, HR, education, marketing, real_estate, technology, environment, logistics, agriculture, energy, manufacturing, public_policy, retail, telecommunications, insurance, aviation), each represented 3–5 times across tiers.

Gold labels live in `expected_queries` (shape: `SearchQueries`, so it deserializes directly into the pipeline's Pydantic model). Relative dates in `query` strings are pre-resolved to the dataset's current date (2026-04-17); `temporal_filters` is null throughout.

## Metrics

1. **decomposition_count_matching** — `len(generated.query_list) == len(expected.query_list)`.
2. **decomposition_semantic_coverage** — LLM-as-judge boolean: do the generated sub-queries, taken together, cover every expected sub-query (count-insensitive, order-insensitive)? When coverage is incomplete, the judge returns a short reasoning naming the missing expected sub-query.

Reported slices: overall and per-difficulty (D1 / D2 / D3).

## Results

Full raw output: [results/result_query_decomposition.json](results/result_query_decomposition.json). Judge: `Qwen3-VL-8B-Instruct-FP8`.

**Overall**

| Prompt | Model | count_match | semantic_coverage |
|---|---|---|---|
| v0 | Mistral-Small-3.1-24B-Instruct-2503 | 66/80 (82.5%) | 74/80 (92.5%) |
| v0 | Qwen3-VL-8B-Instruct-FP8 | 58/80 (72.5%) | 69/80 (86.2%) |
| **v1** | **Mistral-Small-3.1-24B-Instruct-2503** | **69/80 (86.2%)** | **76/80 (95.0%)** |
| v1 | Qwen3-VL-8B-Instruct-FP8 | 66/80 (82.5%) | 71/80 (88.8%) |

**Per-difficulty (count_match · semantic_coverage)**

| Prompt | Model | D1 (n=20) | D2 (n=40) | D3 (n=20) |
|---|---|---|---|---|
| v0 | Mistral-Small | 19/20 · 20/20 | 39/40 · 38/40 | **8/20** · 16/20 |
| v0 | Qwen3-VL-8B | 20/20 · 20/20 | 28/40 · 35/40 | 10/20 · 14/20 |
| v1 | Mistral-Small | 16/20 · 20/20 | 38/40 · 38/40 | **15/20** · 18/20 |
| v1 | Qwen3-VL-8B | 20/20 · 20/20 | 35/40 · 36/40 | 11/20 · 15/20 |

### v0 vs v1 — Mistral-Small

- **v1 handles D3 much better than v0** (D3 count_match 8 vs 15): v0 splits "interaction / joint-effect / multi-attribute-for-one-subject" questions that should stay as one (id 68, 69, 71, 75, 79, 80), while v1 keeps them together. v1 also correctly splits the two bounded-range trend cases (id 61, 65) per the "evolution / trend over a bounded range" rule.
- **v1 is slightly weaker on D1** (19 vs 16). Chat-history turns pull prior-turn topics into the reformulation and trigger spurious splits (id 2, 5, 11).
- **Shared failures on both prompts**: id 30 (Lambda/GCF collapsed), id 63 (Kafka/RabbitMQ — prior gRPC turn leaks in), id 66 (HR onboarding under-split), id 70 (air/sea freight), id 74 (EASA/FAA under-split).

### v0 vs v1 — Qwen3-VL-8B

- v1 gains on count_match (+8) and coverage (+2): mostly D2 (28 vs 35) as the explicit split rules embolden Qwen to separate multi-entity/region/time-period queries it previously collapsed.
- Qwen still under-splits comparative questions and misses both bounded-range trend cases (id 61, 65) — it keeps them as one query despite the rule.

### Common hard cases (both models, both prompts)

- **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query.
- **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation.
- **id 70** — "air freight vs sea freight": emitted as a single comparison instead of two independent lookups.
- **id 74** — "EASA vs FAA certification": same pattern.
78 changes: 78 additions & 0 deletions benchmarks/prompt_eval/Report_temporal_filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Temporal Filter Generation Prompt Evaluation (v2)

**Dataset:** [datasets/temporal_filter.json](datasets/temporal_filter.json) (40 cases — 20 positive, 20 negative)
**Scope:** whether the model emits `temporal_filters` when (and only when) it should, and whether the emitted predicates are correct. Decomposition is out of scope (see [Report_query_decomposition.md](Report_query_decomposition.md)).

## Dataset

Each case has the minimal schema `{id, messages, query_with_temporal_filter}`:

| Class | # Cases | Description |
|------|--------:|-------------|
| **Positive** (`true`) | 20 | User restricts by document creation/authoring/publication date. Covers all resolution rules: today, yesterday, this/last week, this/last month, this/last year, past N days/weeks/months, recent/latest, since X, before X, bare MONTH, in YEAR, specific date, exclusion, multi-entity with shared time, plus multi-turn context (3- and 5-message). |
| **Negative** (`false`) | 20 | Filter must be null. Three sub-patterns: (a) dates that describe the topic/subject ("2024 sustainability report", "trends 2020→2025", "2016 US election"); (b) no temporal reference (policy, how-to, trivia); (c) conversational fillers (greetings, thanks). Includes a 5-message negative where the last turn pivots to a pure topic question. |

Document types and verbs are varied on purpose — design specs, incident reports, PRs, commits, lab results, audit logs, invoices, legal briefs, slide decks, meeting minutes, safety bulletins, etc. — so the evaluation does not reduce to "the model learned the word *uploaded*".

## Metrics

Positive class = a filter **was** / **should have been** emitted.

1. **filter_detection_accuracy** — (TP + TN) / N.
2. **filter_detection_precision** — TP / (TP + FP). How often an emitted filter was actually wanted.
3. **filter_detection_recall** — TP / (TP + FN). How often a wanted filter was actually emitted.
4. **filter_detection_f1** — harmonic mean of the two.
5. **filter_correctness** — LLM-as-judge boolean on TP cases only: given the chat history, current date, and generator output JSON, are the predicates correct as a whole (operator, field, ISO values, closed-vs-open intervals, exclusion split)?

Judge is invoked **only on TP** (filter expected and emitted). Precision/recall capture the detection decision; correctness captures the filter body.

## Results

Full raw output: [results/result_filter_generation.json](results/result_filter_generation.json). Judge: `Qwen3-VL-8B-Instruct-FP8`. Current date at eval time: Sunday, April 19, 2026.

**Overall**

| Prompt | Model | Acc | Precision | Recall | F1 | TP / FP / FN / TN | filter_correctness (TP only) |
|---|---|---:|---:|---:|---:|:-:|---:|
| v0 | Mistral-Small-3.1-24B-Instruct-2503 | 75.0% | 100.0% | 50.0% | 66.7% | 10 / 0 / 10 / 20 | 8/10 (80.0%) |
| v0 | Qwen3-VL-8B-Instruct-FP8 | 57.5% | 100.0% | 15.0% | 26.1% | 3 / 0 / 17 / 20 | 2/3 (66.7%) |
| **v1** | **Mistral-Small-3.1-24B-Instruct-2503** | **100.0%** | **100.0%** | **100.0%** | **100.0%** | 20 / 0 / 0 / 20 | **19/20 (95.0%)** |
| v1 | Qwen3-VL-8B-Instruct-FP8 | 92.5% | 87.0% | 100.0% | 93.0% | 20 / 3 / 0 / 17 | 17/20 (85.0%) |

(Numbers above are from a matched-conditions rerun with `DATASET_CURRENT_DATE = "Sunday, April 19, 2026"` pinned in the evaluator. Minor v0/v1 drift vs earlier archived runs reflects generator + LLM-as-judge variance.)

### v0 vs v1

v0 contains no `temporal_filters` rules, so both models default to "no filter" and collapse on recall (Mistral 50%, Qwen 15%). v1's explicit resolution table brings both to 100% recall. No false positives under v0 — the cost of v0 is recall only, not precision.

Exclusion (id 14) illustrates the v0 body failure: without a rule, Mistral emits three contradictory predicates (`>= 2025-04-19 AND < 2025-03-01 AND >= 2025-04-…`); Qwen collapses to the full year including March. v1 fixes this for Mistral.

### v1 — Mistral-Small (winner)

Perfect detection (20/20 TP, 0/20 FP). Two judge-rejected filter bodies among TP:

- **id 20 "Latest safety bulletins"** → emitted a 12-month window `[2025-07-20, 2026-07-20)` extending into the future. Prompt rule is `recent/latest → past 90 days`, expected `[2026-01-19, 2026-04-20)`. Real generator bug.
- **id 18 "Commits pushed since last Monday"** → added a spurious upper bound `< 2026-04-19`. Prompt rule is `since X → one predicate >= X`. Real generator bug.

### v1 — Qwen3-VL-8B

Perfect recall but 2 false positives where a year describes the **content**, not the document creation date:

| id | Query | Bad filter |
|---|---|---|
| 30 | "Findings in the 2024 annual sustainability report." | `created_at ∈ [2024-01-01, 2025-01-01)` |
| 34 | "Effects of climate change on Arctic sea ice between 2010 and 2020." | `created_at ∈ [2010-01-01, 2021-01-01)` |

Cause: v1's topic-vs-creation section is short and its single null example ("Q3 2024 reporting template") does not cover research/historical framings with spanning year ranges.

Two judge-rejected filter bodies among TP:

- **id 18 "Commits pushed since last Monday"** → added a spurious upper bound `< 2026-04-20`. Prompt rule is `since X → one predicate >= X`. Correct rejection.
- **id 9 "Recent SRE incident reports"** → emitted past 10 days instead of past 90. Correct rejection.

## Recommendations

1. **Ship v1 + Mistral-Small-24B as the production pairing** — 100% detection, 90% filter correctness, with the two remaining body bugs on "latest" and "since X" worth a targeted prompt tweak.
2. **Patch v1's topic-vs-creation section.** Add null examples for the two patterns Qwen still trips on: `"findings in the YEAR report"` and `"events in YEAR"` / `"between YEAR1 and YEAR2"` when the year is the subject.
3. **Reinforce `since X` and `latest` rules.** Both Mistral and Qwen emit an unwanted upper bound on "since last Monday"; Mistral emits an over-wide, future-extending window for "latest". A short prompt clarification or an additional example should eliminate these.
4. **Upgrade the judge.** Qwen3-VL-8B as judge is flaky (one call returned `None` on id 8) and occasionally misreads the prompt's resolution rules. A stronger judge would tighten the correctness metric.
Loading
Loading