Skip to content

Step 3/3 Feat/temporal filters in rag pipeline#284

Open
Ahmath-Gadji wants to merge 2 commits intomainfrom
feat/temporal_filters_in_rag_pipeline
Open

Step 3/3 Feat/temporal filters in rag pipeline#284
Ahmath-Gadji wants to merge 2 commits intomainfrom
feat/temporal_filters_in_rag_pipeline

Conversation

@Ahmath-Gadji
Copy link
Copy Markdown
Collaborator

@Ahmath-Gadji Ahmath-Gadji commented Mar 13, 2026

feat: temporal created_at filters in the RAG query pipeline

Summary

This PR adds LLM-generated temporal filters to the retrieval pipeline, allowing users to naturally ask time-bounded questions ("last week's meeting notes", "reports uploaded this month") and have them translated into Milvus created_at filter expressions before vector search.

Core changes:

  • New structured query models (Query, SearchQueries) in pipeline.py — the query contextualizer now returns a list of typed sub-queries, each optionally carrying a Milvus created_at filter expression instead of a bare string
  • RetrieverPipeline.retrieve_docs() updated to accept a Query object and forward its filter field to the retriever, enabling date-scoped vector search without schema changes
  • Query contextualizer prompt (prompts/example1/query_contextualizer_tmpl.txt) extended with:
    • Query decomposition guidelines (split by time period, entity, or dimension)
    • Temporal filter rules with clear lower/upper-bound semantics
    • Current date injection ({current_date}) for resolving relative expressions like "past 7 days"
    • Explicit language pass-through ({query_language})

Key design decision — filter vs. no filter:

The prompt carefully distinguishes between dates that describe when a document was ingested (→ generate a filter) and dates that describe what a document is about (→ no filter). For example:

  • "Sales figures in Q1 2024" → content topic, no filter
  • "Reports uploaded in Q1 2024" → ingestion date, generate filter

Benchmarks

An offline evaluation harness (benchmarks/prompt_eval/) was built to validate filter generation quality across models (80 cases: 40 filter, 40 split):

Model Overall Filter pass Split pass Filter presence
claude-opus-4.6-fast 85.0% 70% 100% 97.5%
gpt-5.4 78.8% 67.5% 90% 90.0%
Mistral-Small-3.1-24B 60.0% 22.5% 97.5% 47.5%

Claude achieves perfect query decomposition (40/40 split cases) and near-perfect filter detection (97.5%). Its 11 remaining failures share a single root cause: adding a spurious AND created_at <= today upper bound to open-ended "past N days/weeks" queries — a known prompt issue already addressed by the lower-bound-only rule in the template.

GPT-5.4 is competitive overall (78.8%) but struggles with the same upper-bound issue: all 9 of its filter rejections come from adding an unnecessary <= today to open-ended queries.

Mistral-Small is not suitable for this task. Filter presence is only 47.5% — it misses more than half of all required created_at filters, systematically failing on quarter/month expressions ("Q1 2024", "February 2025") and relative periods ("this month", "last week"). When it does generate a filter, judge accuracy is only 47.4%, reflecting poor date arithmetic (wrong lower bounds, spurious upper bounds).

Test plan

  • Verify that time-bounded queries ("reports from last month", "documents added this week") produce correct Milvus filter expressions in retrieval logs
  • Verify that content-date queries ("Q1 2024 financial results") produce no filter
  • Verify that multi-period queries ("2022 vs 2023 sales") decompose into two sub-queries without filters
  • Confirm existing retrieval (no temporal context) is unaffected — Query.filter defaults to None
  • Run unit test suite: uv run pytest

Summary by CodeRabbit

Release Notes

  • New Features

    • Added temporal filter support for time-bound query searches with automatic date resolution
    • Implemented query decomposition for splitting complex requests into multiple sub-queries
    • Enhanced retrieval with fallback mechanism when filtered queries return no results
    • Added strictified structured output validation for improved reliability
  • Documentation

    • New benchmark evaluation reports and datasets for query decomposition and temporal filtering
    • Updated prompt templates with structured JSON schema and context date injection
  • Improvements

    • Refactored file indexing with deterministic SHA-256-based file ID generation
    • Added advanced error handling with automatic retry logic for query generation

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 13, 2026

Warning

Rate limit exceeded

@EnjoyBacon7 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 38 minutes and 10 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 38 minutes and 10 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5d3d6b7d-0c40-4c50-a884-b83845e60c97

📥 Commits

Reviewing files that changed from the base of the PR and between 7d30ac3 and 40ed25d.

⛔ Files ignored due to path filters (1)
  • benchmarks/prompt_eval/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • benchmarks/prompt_eval/.env.example
  • benchmarks/prompt_eval/Report_query_decomposition.md
  • benchmarks/prompt_eval/Report_temporal_filter.md
  • benchmarks/prompt_eval/datasets/query_decomposition.json
  • benchmarks/prompt_eval/datasets/temporal_filter.json
  • benchmarks/prompt_eval/eval_query_decomposition.py
  • benchmarks/prompt_eval/eval_temporal_filter_generation.py
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt
  • benchmarks/prompt_eval/pyproject.toml
  • benchmarks/prompt_eval/results/result_filter_generation.json
  • benchmarks/prompt_eval/results/result_query_decomposition.json
  • openrag/components/pipeline.py
  • prompts/example1/query_contextualizer_tmpl.txt
  • prompts/example1/spoken_style_answer_tmpl.txt
  • prompts/example1/sys_prompt_tmpl.txt
  • pyproject.toml
  • tests/api_tests/api_run/mock_vllm.py
  • utility/data_indexer.py
  • utility/data_indexer2.py
📝 Walkthrough

Walkthrough

This PR introduces a comprehensive temporal filter generation and query decomposition evaluation framework. It adds benchmark datasets, evaluation scripts with LLM-as-judge metrics, and updated prompt templates, while simultaneously extending the OpenRAG pipeline to support structured Query objects with optional temporal filter predicates and implementing retry-on-failure logic for retrieval.

Changes

Cohort / File(s) Summary
Benchmark Suite Infrastructure
benchmarks/prompt_eval/.env.example, benchmarks/prompt_eval/pyproject.toml
Added environment configuration template and project manifest for the benchmark suite with dependencies on langchain-openai, openai, langdetect, tqdm, and python-dotenv.
Benchmark Reports
benchmarks/prompt_eval/Report_query_decomposition.md, benchmarks/prompt_eval/Report_temporal_filter.md
Added evaluation documentation defining query decomposition (exact count matching + judge-based semantic coverage) and temporal filter generation (v2) metrics with per-model results, per-difficulty breakdowns, and qualitative analysis.
Benchmark Datasets
benchmarks/prompt_eval/datasets/query_decomposition.json, benchmarks/prompt_eval/datasets/temporal_filter.json
Added JSON datasets containing 2,026 query decomposition examples with expected sub-query lists and 40 temporal filter classification cases with multi-turn message histories.
Benchmark Evaluation Scripts
benchmarks/prompt_eval/eval_query_decomposition.py, benchmarks/prompt_eval/eval_temporal_filter_generation.py
Added asynchronous evaluation frameworks implementing structured output parsing, LLM-as-judge metrics, confusion matrices, per-model reporting, and CLI interfaces supporting dataset/prompt/output path configuration.
Benchmark Prompt Templates
benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt, benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt
Added prompt templates instructing query reformulation/decomposition with explicit rules for temporal filter generation using ISO-8601 UTC predicates, half-open interval semantics, and relative date handling.
Core Pipeline Temporal Filter Support
openrag/components/pipeline.py
Introduced TemporalPredicate and Query Pydantic models, updated SearchQueries.query_list to use structured Query objects, added to_milvus_filter() conversion method, implemented retrieval fallback when filters return no results, and enabled strict structured output parsing.
Production Prompt Templates
prompts/example1/query_contextualizer_tmpl.txt, prompts/example1/spoken_style_answer_tmpl.txt, prompts/example1/sys_prompt_tmpl.txt
Updated query contextualizer to enforce JSON schema with temporal_filters field; added current_date context injection to system and answer templates.
Utility Scripts
utility/data_indexer.py
Refactored to use deterministic SHA-256 file ID generation from absolute paths, removed conditional existence checking, and simplified upload endpoint construction.
Data Indexer Extension
utility/data_indexer2.py
Added new utility for selective PDF indexing with health-check validation, page-count filtering (11–40 pages), multipart form uploads, and statistics logging.
Test Infrastructure
tests/api_tests/api_run/mock_vllm.py
Enhanced mock tool-call response generation with JSON Schema $ref dereferencing and recursive _mock_value logic to handle nested objects, arrays, primitives, and union types.
Project Configuration
pyproject.toml
Added [tool.uv.workspace] section excluding benchmarks/prompt_eval from workspace handling.

Sequence Diagram

sequenceDiagram
    participant User
    participant QueryGen as Query Generator<br/>(LLM)
    participant RetrieverPipeline
    participant Milvus as Vector DB<br/>(Milvus)
    participant RAG as RAG System

    User->>QueryGen: User message +<br/>Current date context
    QueryGen->>QueryGen: Generate structured<br/>Query with optional<br/>temporal_filters
    QueryGen->>RetrieverPipeline: Query object<br/>(query + temporal_filters)
    RetrieverPipeline->>RetrieverPipeline: Convert temporal_filters<br/>to Milvus filter string
    RetrieverPipeline->>Milvus: retrieve(query_text,<br/>milvus_filter)
    alt Filter returns results
        Milvus-->>RetrieverPipeline: Filtered documents
    else No results with filter
        RetrieverPipeline->>Milvus: retrieve(query_text,<br/>no filter) - Retry
        Milvus-->>RetrieverPipeline: Fallback documents
    end
    RetrieverPipeline->>RAG: Relevant documents
    RAG-->>User: Final answer
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • BREAKING CHANGE Step 2/3: Adding temporal fields #259 — Introduces temporal field and filter-related API changes to openrag/components/pipeline.py that directly overlap with this PR's TemporalPredicate/Query model introduction and retrieval signature modifications.
  • Add workspaces #272 — Modifies RetrieverPipeline.retrieve_docs signature and query/filter handling within the same file as this PR's core pipeline changes.
  • Refactor/hydra to pydantic #295 — Introduces TemporalPredicate and Query models with SearchQueries.query_list refactoring, directly aligned with the same structural changes in this PR.

Suggested labels

feat

Suggested reviewers

  • Ahmath-Gadji
  • paultranvan

🐰 A benchmark hops through the temporal maze,
Decomposing queries in structured ways,
The Milvus retriever now filters with grace,
With fallback hops when no documents trace,
The pipeline leaps forward, time-aware and keen! ⏰✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.21% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Step 3/3 Feat/temporal filters in rag pipeline' clearly summarizes the main change: adding temporal filter functionality to the RAG retrieval pipeline.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/temporal_filters_in_rag_pipeline

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Ahmath-Gadji Ahmath-Gadji changed the title Feat/temporal filters in rag pipeline 3/3 Feat/temporal filters in rag pipeline Mar 17, 2026
@Ahmath-Gadji Ahmath-Gadji changed the title 3/3 Feat/temporal filters in rag pipeline Step 3/3 Feat/temporal filters in rag pipeline Mar 17, 2026
- Assistant: Italy offers a wealth of history and culinary delights.
- User: What are the must-see sites?
Reformulated query: What are the must-see historical monuments and local cuisine restaurants in Italy?
Reformulated query: What are the must-see historical monuments and local cuisine restaurants in Italy? No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we evaluate this prompt ? How it performs with real-case examples? How it impact the latency ?

@Ahmath-Gadji Ahmath-Gadji force-pushed the feat/temporal_filters_in_rag_pipeline branch 10 times, most recently from 22fc7d7 to e043548 Compare April 14, 2026 14:55
@Ahmath-Gadji Ahmath-Gadji marked this pull request as ready for review April 14, 2026 15:06
Comment thread utility/data_indexer2.py


# # How to run this code:
# uv run python utility/data_indexer2.py \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid non numbered naming such as indexer2 🙏
if we need several data_indexer script, let's name them explicitly with their purpose

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only for my local use. Not meant to be pushed on github.

@coderabbitai coderabbitai bot added the feat Add a new feature label Apr 19, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
prompts/example1/spoken_style_answer_tmpl.txt (1)

22-22: ⚠️ Potential issue | 🟡 Minor

Fix typo in user-facing prompt text.

"humain" should be "human".

📝 Proposed fix
-   * Keep your answer succinct and straight to the point as if speaking to humain.
+   * Keep your answer succinct and straight to the point as if speaking to human.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@prompts/example1/spoken_style_answer_tmpl.txt` at line 22, Fix the typo in
the user-facing prompt template: in the prompt string that begins "Keep your
answer succinct and straight to the point as if speaking to humain." (in
spoken_style_answer_tmpl.txt / the "Keep your answer..." prompt), change
"humain" to "human" so the sentence reads "...speaking to human."
openrag/components/pipeline.py (1)

294-301: ⚠️ Potential issue | 🔴 Critical

Fix web search calls to pass query string instead of Query object.

At lines 297 and 320, web_search_service.search() receives a Query Pydantic model instead of the expected str. This silently calls Query.__str__(), which returns f"Query: {self.query}, Filter: {self.to_milvus_filter()}", polluting the search engine with literal tokens ("Query:", "Filter:") and Milvus filter expressions, significantly degrading result quality.

Fix
-            web_tasks = [self.web_search_service.search(q) for q in queries.query_list]
+            web_tasks = [self.web_search_service.search(q.query) for q in queries.query_list]

Apply the same fix at line 320 (web-only branch).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openrag/components/pipeline.py` around lines 294 - 301, The web search is
being called with a Query Pydantic model instead of the raw string, which causes
Query.__str__() to pollute the search; update the web search invocations to pass
the actual query string (use q.query) rather than the Query object: change the
web_tasks comprehension that builds from queries.query_list to call
self.web_search_service.search(q.query), and make the identical change in the
web-only branch where web_tasks is constructed (the second occurrence near the
other web-only logic).
♻️ Duplicate comments (1)
utility/data_indexer2.py (1)

1-121: 🛠️ Refactor suggestion | 🟠 Major

This script appears to duplicate data_indexer.py and was reportedly not intended for upstream.

Per prior discussion on this PR, you mentioned this file was only for local use and not meant to be pushed. It also largely duplicates utility/data_indexer.py with added PDF-only filtering, a 100-file cap, page-count bounds (11–40), and size/page statistics. If it is intentionally kept, please:

  • Rename it to reflect its purpose (e.g., pdf_benchmark_indexer.py) rather than data_indexer2.
  • Fold the shared logic (arg parsing, __check_api, upload loop) into a common helper to avoid drift between the two scripts.
  • Document the rationale for the hardcoded limit = 100 and page bounds (10, 40] — these look benchmark-specific.

Otherwise, consider dropping it from this PR since it is orthogonal to the temporal-filters feature.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@utility/data_indexer2.py` around lines 1 - 121, This file duplicates existing
data_indexer.py and shouldn't be upstream as-is; either remove it from the PR or
rename it to reflect purpose (e.g., pdf_benchmark_indexer.py) and refactor
common logic (arg parsing, __check_api, upload loop) into a shared helper module
to prevent drift; if you keep it, add a short docstring near the top declaring
its benchmark intent and justify the hard-coded limit = 100 and the page bounds
check (the page_count <= 10 or page_count > 40 filter) so reviewers know they
are intentional.
🧹 Nitpick comments (5)
utility/data_indexer.py (1)

36-46: __check_api does not fail on non-200 responses.

If the health check returns e.g. 500 or 404, the function silently returns and the script proceeds to index. Only httpx.RequestError (connection-level) is raised. Consider response.raise_for_status() or an explicit non-200 branch so a misconfigured URL/auth fails fast.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@utility/data_indexer.py` around lines 36 - 46, The __check_api function
currently only raises on httpx.RequestError and silently continues on non-200
responses; modify __check_api to detect non-200 HTTP responses (e.g., call
response.raise_for_status() or explicitly check response.status_code != 200) and
raise an exception (or log and raise) so the caller fails fast; update the
try/except around httpx.get in __check_api to include HTTP status errors (using
response.raise_for_status() is preferred) so any 4xx/5xx result causes the
function to raise instead of returning silently.
benchmarks/prompt_eval/Report_query_decomposition.md (1)

50-63: Optional: Consider varying sentence structure for readability.

The static analysis tool flags repetitive sentence openings in the "Common hard cases" section. While not critical, varying the structure could improve readability.

✍️ Optional rewording suggestion
 ### Common hard cases (both models, both prompts)
 
-- **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query.
-- **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation.
-- **id 70** — "air freight vs sea freight": emitted as a single comparison instead of two independent lookups.
-- **id 74** — "EASA vs FAA certification": same pattern.
+- **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query.
+- **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation.
+- For **id 70** ("air freight vs sea freight"), the model emits a single comparison instead of two independent lookups.
+- **id 74** ("EASA vs FAA certification") exhibits the same pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/prompt_eval/Report_query_decomposition.md` around lines 50 - 63,
The "Common hard cases (both models, both prompts)" bullet list uses repetitive
sentence openings and punctuation; please rewrite that section (the header and
the four bullets referencing id 30, id 63, id 70, id 74) to vary sentence
structure and lead-ins for readability — e.g., turn some bullets into full
sentences starting with the case description ("Compare AWS Lambda and Google
Cloud Functions") and others starting with the observed failure ("Earlier gRPC
turn..."), replace repeated em-dash openings with varied connectors or clauses,
and ensure each item still mentions the id and the failure mode (single
comparison, contaminated reformulation, under-splitting) without changing
meaning.
benchmarks/prompt_eval/eval_temporal_filter_generation.py (1)

55-230: Extract shared infra with eval_query_decomposition.py to prevent drift.

_parse_env_list, _build_models, _judge_config, _model_kwargs, build_llm_messages, format_prompt, the TemporalPredicate/Query/SearchQueries schemas, and the CLI path-sandboxing logic are duplicated almost verbatim between the two evaluators. Any future change (e.g. adding strict=True, switching to UTC, changing the default max_completion_tokens) will need to be applied in both places, and already-present asymmetries will grow. Consider extracting a benchmarks/prompt_eval/_common.py (or lib/) module.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py` around lines 55 -
230, Extract duplicated utilities and schemas into a shared module (e.g.,
benchmarks/prompt_eval/_common.py): move _parse_env_list, _build_models,
_judge_config, _model_kwargs, build_llm_messages, format_prompt and the Pydantic
models TemporalPredicate, Query, SearchQueries (and any CLI path-sandboxing
helpers) into that module; then import and use them from
eval_temporal_filter_generation.py (and eval_query_decomposition.py) so both
files reference the single implementation; ensure imports/exports preserve names
and behavior (including default values like max_completion_tokens and any
langdetect fallback), run lints/tests, and update any relative references to
MODELS or judge_config usage to use the shared module.
benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt (1)

31-42: Tighten the past N/bare MONTH rules — the formulas drop the time unit.

Line 37 writes [Current − N, tomorrow) without carrying over the unit (days/weeks/months/years), so a model parsing "past 2 weeks" could read the lower bound as Current − 2. Line 41 ("nearest past occurrence") is ambiguous when the bare month equals the current month or is in the future (e.g. "December" queried in April 2026 — Dec 2025 or Dec 2026?). Consider:

✏️ Proposed clarification
-- past N days/weeks/months/years → `[Current − N, tomorrow)`
+- past N days/weeks/months/years → `[Current − N <same unit>, tomorrow)` (e.g. "past 2 weeks" → `[Current − 14 days, tomorrow)`)
 - recent / latest → treat as "past 90 days"
 - since X → one predicate `>= X`
 - before X → one predicate `< X`
-- bare MONTH (no year) → nearest past occurrence
+- bare MONTH (no year) → the most recent fully-elapsed occurrence of that month (if the named month equals the current month, use the occurrence one year prior)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt` around lines
31 - 42, The "past N" and "bare MONTH" rules are ambiguous: change the "past N
days/weeks/months/years → `[Current − N, tomorrow)`" line to explicitly carry
the time unit (e.g., "past N days/weeks/months/years → `[Current − N
{days|weeks|months|years}, tomorrow)`") and ensure "weeks" and "months" are
interpreted as whole-week and whole-month arithmetic (not unitless subtraction).
Clarify the "bare MONTH (no year) → nearest past occurrence" rule by specifying
deterministic behavior: if MONTH is before the current month use MONTH of the
current year; if MONTH is the current month or after, use MONTH of the previous
year (i.e., the most recent calendar occurrence strictly before today). Ensure
these exact phrases (the "past N days/weeks/months/years" rule and the "bare
MONTH (no year)" rule) are updated so models include the unit and the
tie-breaker rule unambiguously.
openrag/components/pipeline.py (1)

73-77: The double-quote syntax is correct; single and double quotes are both valid in Milvus TIMESTAMPTZ filter expressions.

The codebase's use of double quotes in ISO "2026-03-15T00:00:00+00:00" is valid. Milvus's expression parser accepts both single and double quotes for string literals, including ISO timestamp literals, and treats them equivalently. This is consistent across the codebase (tests, docstrings, and the to_milvus_filter() method).

However, the broader design note stands: TemporalPredicate.operator permits ["==", "!=", ">", "<", ">=", "<="], but all tested and documented filter cases use only ordered comparisons (<, >, >=, <=). Equality and inequality checks on TIMESTAMPTZ fields are semantically unusual. If equality comparisons are not intentional, consider narrowing the Literal to exclude "==" and "!=".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openrag/components/pipeline.py` around lines 73 - 77, The TemporalPredicate
currently allows equality operators but we only use ordered comparisons for
TIMESTAMPTZ; update the type/definition for the TemporalPredicate.operator (the
Literal used where TemporalPredicate is declared) to remove "==" and "!=" so
only "<", ">", "<=", ">=" are permitted; keep the to_milvus_filter() behavior
(it can continue to emit ISO "..." strings) and ensure any type hints or
validations referencing TemporalPredicate.operator (e.g., construction sites or
validators) are updated to the narrower set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/prompt_eval/eval_query_decomposition.py`:
- Around line 1-32: The default dataset filename is misspelled as
"query_decompostion" in multiple places; update every occurrence to
"query_decomposition" — specifically fix the docstring references and the
argparse default for the --dataset option (the string
"datasets/query_decompostion.json" and any plain "query_decompostion"
occurrences) so that the script loads datasets/query_decomposition.json by
default and no longer raises FileNotFoundError when run without --dataset.

In `@openrag/components/pipeline.py`:
- Around line 112-122: The current silent fallback that re-runs
self.retriever.retrieve without filters when docs is empty should be made
explicit and configurable: add a pipeline config flag (e.g.,
allow_filterless_fallback) and only perform the unfiltered re-query when that
flag is True; when you do fallback, set a metadata flag (e.g.,
filter_dropped=True) on the returned result (or attach to the docs/container) so
callers can surface that the temporal filter was dropped; replace the
logger.debug call with logger.warning or logger.info so it’s visible at INFO
level and include context (query=query.query, partition, reason); keep the
original retriever call parameters (partition, query.query, filter_params) when
not falling back and ensure all code paths return the same result shape
including the new metadata key.

In `@utility/data_indexer.py`:
- Around line 57-60: The deterministic file_id generation (absolute_path ->
file_id) now causes duplicate POSTs to url =
f"{args.url}/indexer/partition/{args.partition}/file/{file_id}" to return HTTP
409 and break reruns; update the uploader call in utility/data_indexer.py (the
block that posts to url) to treat 409 as non-fatal: catch the HTTP response or
exception from the POST, and if response.status_code == 409 (or the raised
HTTPError maps to 409), log an informative message and continue (skip that file)
instead of raising; otherwise preserve existing error handling for other status
codes so real failures still surface.

In `@utility/data_indexer2.py`:
- Around line 63-68: The try/except around fitz.open(file_path) currently
swallows all exceptions; change it to except Exception as e and log the failure
before falling back to page_count = 0 (e.g. logger.warning or logger.exception)
so operators can see why a PDF was skipped; update the block that calls
fitz.open(file_path) and sets page_count to capture the exception object (e) and
include file_path and e in the log message.

---

Outside diff comments:
In `@openrag/components/pipeline.py`:
- Around line 294-301: The web search is being called with a Query Pydantic
model instead of the raw string, which causes Query.__str__() to pollute the
search; update the web search invocations to pass the actual query string (use
q.query) rather than the Query object: change the web_tasks comprehension that
builds from queries.query_list to call self.web_search_service.search(q.query),
and make the identical change in the web-only branch where web_tasks is
constructed (the second occurrence near the other web-only logic).

In `@prompts/example1/spoken_style_answer_tmpl.txt`:
- Line 22: Fix the typo in the user-facing prompt template: in the prompt string
that begins "Keep your answer succinct and straight to the point as if speaking
to humain." (in spoken_style_answer_tmpl.txt / the "Keep your answer..."
prompt), change "humain" to "human" so the sentence reads "...speaking to
human."

---

Duplicate comments:
In `@utility/data_indexer2.py`:
- Around line 1-121: This file duplicates existing data_indexer.py and shouldn't
be upstream as-is; either remove it from the PR or rename it to reflect purpose
(e.g., pdf_benchmark_indexer.py) and refactor common logic (arg parsing,
__check_api, upload loop) into a shared helper module to prevent drift; if you
keep it, add a short docstring near the top declaring its benchmark intent and
justify the hard-coded limit = 100 and the page bounds check (the page_count <=
10 or page_count > 40 filter) so reviewers know they are intentional.

---

Nitpick comments:
In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py`:
- Around line 55-230: Extract duplicated utilities and schemas into a shared
module (e.g., benchmarks/prompt_eval/_common.py): move _parse_env_list,
_build_models, _judge_config, _model_kwargs, build_llm_messages, format_prompt
and the Pydantic models TemporalPredicate, Query, SearchQueries (and any CLI
path-sandboxing helpers) into that module; then import and use them from
eval_temporal_filter_generation.py (and eval_query_decomposition.py) so both
files reference the single implementation; ensure imports/exports preserve names
and behavior (including default values like max_completion_tokens and any
langdetect fallback), run lints/tests, and update any relative references to
MODELS or judge_config usage to use the shared module.

In `@benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt`:
- Around line 31-42: The "past N" and "bare MONTH" rules are ambiguous: change
the "past N days/weeks/months/years → `[Current − N, tomorrow)`" line to
explicitly carry the time unit (e.g., "past N days/weeks/months/years →
`[Current − N {days|weeks|months|years}, tomorrow)`") and ensure "weeks" and
"months" are interpreted as whole-week and whole-month arithmetic (not unitless
subtraction). Clarify the "bare MONTH (no year) → nearest past occurrence" rule
by specifying deterministic behavior: if MONTH is before the current month use
MONTH of the current year; if MONTH is the current month or after, use MONTH of
the previous year (i.e., the most recent calendar occurrence strictly before
today). Ensure these exact phrases (the "past N days/weeks/months/years" rule
and the "bare MONTH (no year)" rule) are updated so models include the unit and
the tie-breaker rule unambiguously.

In `@benchmarks/prompt_eval/Report_query_decomposition.md`:
- Around line 50-63: The "Common hard cases (both models, both prompts)" bullet
list uses repetitive sentence openings and punctuation; please rewrite that
section (the header and the four bullets referencing id 30, id 63, id 70, id 74)
to vary sentence structure and lead-ins for readability — e.g., turn some
bullets into full sentences starting with the case description ("Compare AWS
Lambda and Google Cloud Functions") and others starting with the observed
failure ("Earlier gRPC turn..."), replace repeated em-dash openings with varied
connectors or clauses, and ensure each item still mentions the id and the
failure mode (single comparison, contaminated reformulation, under-splitting)
without changing meaning.

In `@openrag/components/pipeline.py`:
- Around line 73-77: The TemporalPredicate currently allows equality operators
but we only use ordered comparisons for TIMESTAMPTZ; update the type/definition
for the TemporalPredicate.operator (the Literal used where TemporalPredicate is
declared) to remove "==" and "!=" so only "<", ">", "<=", ">=" are permitted;
keep the to_milvus_filter() behavior (it can continue to emit ISO "..." strings)
and ensure any type hints or validations referencing TemporalPredicate.operator
(e.g., construction sites or validators) are updated to the narrower set.

In `@utility/data_indexer.py`:
- Around line 36-46: The __check_api function currently only raises on
httpx.RequestError and silently continues on non-200 responses; modify
__check_api to detect non-200 HTTP responses (e.g., call
response.raise_for_status() or explicitly check response.status_code != 200) and
raise an exception (or log and raise) so the caller fails fast; update the
try/except around httpx.get in __check_api to include HTTP status errors (using
response.raise_for_status() is preferred) so any 4xx/5xx result causes the
function to raise instead of returning silently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 395f8e6d-4bcc-412e-9833-038603131849

📥 Commits

Reviewing files that changed from the base of the PR and between 922714e and 7d30ac3.

⛔ Files ignored due to path filters (1)
  • benchmarks/prompt_eval/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • benchmarks/prompt_eval/.env.example
  • benchmarks/prompt_eval/Report_query_decomposition.md
  • benchmarks/prompt_eval/Report_temporal_filter.md
  • benchmarks/prompt_eval/datasets/query_decomposition.json
  • benchmarks/prompt_eval/datasets/temporal_filter.json
  • benchmarks/prompt_eval/eval_query_decomposition.py
  • benchmarks/prompt_eval/eval_temporal_filter_generation.py
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txt
  • benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt
  • benchmarks/prompt_eval/pyproject.toml
  • benchmarks/prompt_eval/results/result_filter_generation.json
  • benchmarks/prompt_eval/results/result_query_decomposition.json
  • openrag/components/pipeline.py
  • prompts/example1/query_contextualizer_tmpl.txt
  • prompts/example1/spoken_style_answer_tmpl.txt
  • prompts/example1/sys_prompt_tmpl.txt
  • pyproject.toml
  • tests/api_tests/api_run/mock_vllm.py
  • utility/data_indexer.py
  • utility/data_indexer2.py

Comment on lines +1 to +32
"""
Prompt evaluation script for query reformulation prompts.

Loads the query_decompostion dataset, runs each test case through one or more
prompt templates against every model defined in the `MODELS` dict, and scores
two decomposition metrics.

Metrics (this pass):
1. decomposition_count_matching — generated query count equals expected
2. decomposition_semantic_coverage — LLM-as-judge boolean: does the generated
split, taken as a whole, semantically
cover every expected sub-query? The
count does not have to match exactly;
only semantic coverage matters.

Usage:
uv run python eval_query_decompostion.py [OPTIONS]

Options:
--dataset PATH Path to the dataset JSON file
(default: datasets/query_decompostion.json)
--prompt PATH Path to a specific prompt template file.
If omitted, all *.txt files in ./prompts/ are evaluated.
--output PATH Write JSON results to this file.

Required environment (candidate models under evaluation — semicolon-separated):
BASE_URLs, API_KEYs, MODELs

Optional environment (LLM-as-judge for semantic coverage; defaults to the
first candidate model if unset):
JUDGE_BASE_URL, JUDGE_API_KEY, JUDGE_MODEL
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the dataset filename on disk vs. what the evaluator defaults to.
fd -t f 'query_decomp' benchmarks/prompt_eval/datasets
rg -nP 'query_decompos?t?ion' benchmarks/prompt_eval

Repository: linagora/openrag

Length of output: 763


Default dataset path is misspelled — script fails out of the box.

The actual dataset file is query_decomposition.json (with second i), but the script references query_decompostion in five locations: the docstring (lines 4, 17, 21), the argparse description (line 449), and critically, the default path argument (line 453). Running python eval_query_decomposition.py without --dataset will fail with FileNotFoundError.

✏️ Proposed fix
-Loads the query_decompostion dataset, runs each test case through one or more
+Loads the query_decomposition dataset, runs each test case through one or more
@@
-    uv run python eval_query_decompostion.py [OPTIONS]
+    uv run python eval_query_decomposition.py [OPTIONS]
@@
-                     (default: datasets/query_decompostion.json)
+                     (default: datasets/query_decomposition.json)
@@
-        default=str(HERE / "datasets" / "query_decompostion.json"),
+        default=str(HERE / "datasets" / "query_decomposition.json"),
@@
-        description="Evaluate query_decompostion prompts (decomposition count matching + semantic coverage)."
+        description="Evaluate query_decomposition prompts (decomposition count matching + semantic coverage)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/prompt_eval/eval_query_decomposition.py` around lines 1 - 32, The
default dataset filename is misspelled as "query_decompostion" in multiple
places; update every occurrence to "query_decomposition" — specifically fix the
docstring references and the argparse default for the --dataset option (the
string "datasets/query_decompostion.json" and any plain "query_decompostion"
occurrences) so that the script loads datasets/query_decomposition.json by
default and no longer raises FileNotFoundError when run without --dataset.

Comment on lines +112 to +122
# Fallback
if not docs:
logger.debug(
"No documents retrieved with temporal filters, falling back to retrieval without filters",
query=str(query.query),
)

docs = await self.retriever.retrieve(
partition=partition, query=query.query, filter=None, filter_params=filter_params
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent filter-less fallback can return temporally-incorrect results without the caller knowing.

When the filtered retrieval returns zero docs, you re-query without the filter. For a user who asked "show me reports from last week", returning arbitrary older reports is often worse than returning nothing — and nothing upstream (response payload, sources metadata, UI) signals that the date constraint was dropped. At minimum consider:

  • Surfacing a flag in the returned metadata (e.g. filter_dropped=True) so the caller can warn the user, and
  • Gating the fallback on a config flag so deployments that prefer "strict temporal" can opt out.

A logger.debug alone will not be visible to operators at INFO level.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openrag/components/pipeline.py` around lines 112 - 122, The current silent
fallback that re-runs self.retriever.retrieve without filters when docs is empty
should be made explicit and configurable: add a pipeline config flag (e.g.,
allow_filterless_fallback) and only perform the unfiltered re-query when that
flag is True; when you do fallback, set a metadata flag (e.g.,
filter_dropped=True) on the returned result (or attach to the docs/container) so
callers can surface that the temporal filter was dropped; replace the
logger.debug call with logger.warning or logger.info so it’s visible at INFO
level and include context (query=query.query, partition, reason); keep the
original retriever call parameters (partition, query.query, filter_params) when
not falling back and ensure all code paths return the same result shape
including the new metadata key.

Comment thread utility/data_indexer.py
Comment on lines +57 to +60
absolute_path = str(file_path.resolve())
file_id = hashlib.sha256(absolute_path.encode()).hexdigest()

url_template = f"{args.url}/indexer/partition/{args.partition}/file/{file_id}"
url = url_template.format(partition_name=args.partition, file_id=file_id)
url = f"{args.url}/indexer/partition/{args.partition}/file/{file_id}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Re-runs now fail with 409 due to removed pre-check.

With __check_file_exists removed, re-running the script against the same directory will hit /indexer/partition/{partition}/file/{file_id}, which rejects duplicates with HTTP 409 CONFLICT (see openrag/routers/indexer.py:138-142). Since file_id is now a deterministic hash of the absolute path, the same input produces the same ID and all previously indexed files error out on subsequent runs. Consider either skipping on 409 client-side or treating 409 as a non-fatal outcome when logging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@utility/data_indexer.py` around lines 57 - 60, The deterministic file_id
generation (absolute_path -> file_id) now causes duplicate POSTs to url =
f"{args.url}/indexer/partition/{args.partition}/file/{file_id}" to return HTTP
409 and break reruns; update the uploader call in utility/data_indexer.py (the
block that posts to url) to treat 409 as non-fatal: catch the HTTP response or
exception from the POST, and if response.status_code == 409 (or the raised
HTTPError maps to 409), log an informative message and continue (skip that file)
instead of raising; otherwise preserve existing error handling for other status
codes so real failures still surface.

Comment thread utility/data_indexer2.py
Comment on lines +63 to +68
try:
doc = fitz.open(file_path)
page_count = doc.page_count
doc.close()
except Exception:
page_count = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silently swallowing fitz.open errors masks real problems.

A bare except Exception with page_count = 0 means corrupt/encrypted PDFs become indistinguishable from legitimately empty docs and are silently filtered out by the <= 10 check. At minimum log the exception (e.g. logger.warning(f"Failed to read {file_path}: {e}")) so operator can tell why a file was skipped.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@utility/data_indexer2.py` around lines 63 - 68, The try/except around
fitz.open(file_path) currently swallows all exceptions; change it to except
Exception as e and log the failure before falling back to page_count = 0 (e.g.
logger.warning or logger.exception) so operators can see why a PDF was skipped;
update the block that calls fitz.open(file_path) and sets page_count to capture
the exception object (e) and include file_path and e in the log message.

@EnjoyBacon7 EnjoyBacon7 force-pushed the feat/temporal_filters_in_rag_pipeline branch from 7d30ac3 to 40ed25d Compare April 19, 2026 18:00
Base automatically changed from dev to main April 20, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat Add a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants