Step 3/3 Feat/temporal filters in rag pipeline#284
Step 3/3 Feat/temporal filters in rag pipeline#284Ahmath-Gadji wants to merge 2 commits intomainfrom
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 38 minutes and 10 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (20)
📝 WalkthroughWalkthroughThis PR introduces a comprehensive temporal filter generation and query decomposition evaluation framework. It adds benchmark datasets, evaluation scripts with LLM-as-judge metrics, and updated prompt templates, while simultaneously extending the OpenRAG pipeline to support structured Query objects with optional temporal filter predicates and implementing retry-on-failure logic for retrieval. Changes
Sequence DiagramsequenceDiagram
participant User
participant QueryGen as Query Generator<br/>(LLM)
participant RetrieverPipeline
participant Milvus as Vector DB<br/>(Milvus)
participant RAG as RAG System
User->>QueryGen: User message +<br/>Current date context
QueryGen->>QueryGen: Generate structured<br/>Query with optional<br/>temporal_filters
QueryGen->>RetrieverPipeline: Query object<br/>(query + temporal_filters)
RetrieverPipeline->>RetrieverPipeline: Convert temporal_filters<br/>to Milvus filter string
RetrieverPipeline->>Milvus: retrieve(query_text,<br/>milvus_filter)
alt Filter returns results
Milvus-->>RetrieverPipeline: Filtered documents
else No results with filter
RetrieverPipeline->>Milvus: retrieve(query_text,<br/>no filter) - Retry
Milvus-->>RetrieverPipeline: Fallback documents
end
RetrieverPipeline->>RAG: Relevant documents
RAG-->>User: Final answer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| - Assistant: Italy offers a wealth of history and culinary delights. | ||
| - User: What are the must-see sites? | ||
| Reformulated query: What are the must-see historical monuments and local cuisine restaurants in Italy? | ||
| Reformulated query: What are the must-see historical monuments and local cuisine restaurants in Italy? No newline at end of file |
There was a problem hiding this comment.
Did we evaluate this prompt ? How it performs with real-case examples? How it impact the latency ?
22fc7d7 to
e043548
Compare
|
|
||
|
|
||
| # # How to run this code: | ||
| # uv run python utility/data_indexer2.py \ |
There was a problem hiding this comment.
please avoid non numbered naming such as indexer2 🙏
if we need several data_indexer script, let's name them explicitly with their purpose
There was a problem hiding this comment.
This was only for my local use. Not meant to be pushed on github.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
prompts/example1/spoken_style_answer_tmpl.txt (1)
22-22:⚠️ Potential issue | 🟡 MinorFix typo in user-facing prompt text.
"humain" should be "human".
📝 Proposed fix
- * Keep your answer succinct and straight to the point as if speaking to humain. + * Keep your answer succinct and straight to the point as if speaking to human.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@prompts/example1/spoken_style_answer_tmpl.txt` at line 22, Fix the typo in the user-facing prompt template: in the prompt string that begins "Keep your answer succinct and straight to the point as if speaking to humain." (in spoken_style_answer_tmpl.txt / the "Keep your answer..." prompt), change "humain" to "human" so the sentence reads "...speaking to human."openrag/components/pipeline.py (1)
294-301:⚠️ Potential issue | 🔴 CriticalFix web search calls to pass query string instead of Query object.
At lines 297 and 320,
web_search_service.search()receives aQueryPydantic model instead of the expectedstr. This silently callsQuery.__str__(), which returnsf"Query: {self.query}, Filter: {self.to_milvus_filter()}", polluting the search engine with literal tokens ("Query:", "Filter:") and Milvus filter expressions, significantly degrading result quality.Fix
- web_tasks = [self.web_search_service.search(q) for q in queries.query_list] + web_tasks = [self.web_search_service.search(q.query) for q in queries.query_list]Apply the same fix at line 320 (web-only branch).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openrag/components/pipeline.py` around lines 294 - 301, The web search is being called with a Query Pydantic model instead of the raw string, which causes Query.__str__() to pollute the search; update the web search invocations to pass the actual query string (use q.query) rather than the Query object: change the web_tasks comprehension that builds from queries.query_list to call self.web_search_service.search(q.query), and make the identical change in the web-only branch where web_tasks is constructed (the second occurrence near the other web-only logic).
♻️ Duplicate comments (1)
utility/data_indexer2.py (1)
1-121: 🛠️ Refactor suggestion | 🟠 MajorThis script appears to duplicate
data_indexer.pyand was reportedly not intended for upstream.Per prior discussion on this PR, you mentioned this file was only for local use and not meant to be pushed. It also largely duplicates
utility/data_indexer.pywith added PDF-only filtering, a 100-file cap, page-count bounds (11–40), and size/page statistics. If it is intentionally kept, please:
- Rename it to reflect its purpose (e.g.,
pdf_benchmark_indexer.py) rather thandata_indexer2.- Fold the shared logic (arg parsing,
__check_api, upload loop) into a common helper to avoid drift between the two scripts.- Document the rationale for the hardcoded
limit = 100and page bounds(10, 40]— these look benchmark-specific.Otherwise, consider dropping it from this PR since it is orthogonal to the temporal-filters feature.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@utility/data_indexer2.py` around lines 1 - 121, This file duplicates existing data_indexer.py and shouldn't be upstream as-is; either remove it from the PR or rename it to reflect purpose (e.g., pdf_benchmark_indexer.py) and refactor common logic (arg parsing, __check_api, upload loop) into a shared helper module to prevent drift; if you keep it, add a short docstring near the top declaring its benchmark intent and justify the hard-coded limit = 100 and the page bounds check (the page_count <= 10 or page_count > 40 filter) so reviewers know they are intentional.
🧹 Nitpick comments (5)
utility/data_indexer.py (1)
36-46:__check_apidoes not fail on non-200 responses.If the health check returns e.g. 500 or 404, the function silently returns and the script proceeds to index. Only
httpx.RequestError(connection-level) is raised. Considerresponse.raise_for_status()or an explicit non-200 branch so a misconfigured URL/auth fails fast.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@utility/data_indexer.py` around lines 36 - 46, The __check_api function currently only raises on httpx.RequestError and silently continues on non-200 responses; modify __check_api to detect non-200 HTTP responses (e.g., call response.raise_for_status() or explicitly check response.status_code != 200) and raise an exception (or log and raise) so the caller fails fast; update the try/except around httpx.get in __check_api to include HTTP status errors (using response.raise_for_status() is preferred) so any 4xx/5xx result causes the function to raise instead of returning silently.benchmarks/prompt_eval/Report_query_decomposition.md (1)
50-63: Optional: Consider varying sentence structure for readability.The static analysis tool flags repetitive sentence openings in the "Common hard cases" section. While not critical, varying the structure could improve readability.
✍️ Optional rewording suggestion
### Common hard cases (both models, both prompts) -- **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query. -- **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation. -- **id 70** — "air freight vs sea freight": emitted as a single comparison instead of two independent lookups. -- **id 74** — "EASA vs FAA certification": same pattern. +- **id 30** — "Compare AWS Lambda and Google Cloud Functions" stays as a single comparison query. +- **id 63** — "Kafka vs RabbitMQ": earlier gRPC turn in the history contaminates the reformulation. +- For **id 70** ("air freight vs sea freight"), the model emits a single comparison instead of two independent lookups. +- **id 74** ("EASA vs FAA certification") exhibits the same pattern.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/prompt_eval/Report_query_decomposition.md` around lines 50 - 63, The "Common hard cases (both models, both prompts)" bullet list uses repetitive sentence openings and punctuation; please rewrite that section (the header and the four bullets referencing id 30, id 63, id 70, id 74) to vary sentence structure and lead-ins for readability — e.g., turn some bullets into full sentences starting with the case description ("Compare AWS Lambda and Google Cloud Functions") and others starting with the observed failure ("Earlier gRPC turn..."), replace repeated em-dash openings with varied connectors or clauses, and ensure each item still mentions the id and the failure mode (single comparison, contaminated reformulation, under-splitting) without changing meaning.benchmarks/prompt_eval/eval_temporal_filter_generation.py (1)
55-230: Extract shared infra witheval_query_decomposition.pyto prevent drift.
_parse_env_list,_build_models,_judge_config,_model_kwargs,build_llm_messages,format_prompt, theTemporalPredicate/Query/SearchQueriesschemas, and the CLI path-sandboxing logic are duplicated almost verbatim between the two evaluators. Any future change (e.g. addingstrict=True, switching to UTC, changing the defaultmax_completion_tokens) will need to be applied in both places, and already-present asymmetries will grow. Consider extracting abenchmarks/prompt_eval/_common.py(orlib/) module.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py` around lines 55 - 230, Extract duplicated utilities and schemas into a shared module (e.g., benchmarks/prompt_eval/_common.py): move _parse_env_list, _build_models, _judge_config, _model_kwargs, build_llm_messages, format_prompt and the Pydantic models TemporalPredicate, Query, SearchQueries (and any CLI path-sandboxing helpers) into that module; then import and use them from eval_temporal_filter_generation.py (and eval_query_decomposition.py) so both files reference the single implementation; ensure imports/exports preserve names and behavior (including default values like max_completion_tokens and any langdetect fallback), run lints/tests, and update any relative references to MODELS or judge_config usage to use the shared module.benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt (1)
31-42: Tighten thepast N/bare MONTHrules — the formulas drop the time unit.Line 37 writes
[Current − N, tomorrow)without carrying over the unit (days/weeks/months/years), so a model parsing "past 2 weeks" could read the lower bound asCurrent − 2. Line 41 ("nearest past occurrence") is ambiguous when the bare month equals the current month or is in the future (e.g. "December" queried in April 2026 — Dec 2025 or Dec 2026?). Consider:✏️ Proposed clarification
-- past N days/weeks/months/years → `[Current − N, tomorrow)` +- past N days/weeks/months/years → `[Current − N <same unit>, tomorrow)` (e.g. "past 2 weeks" → `[Current − 14 days, tomorrow)`) - recent / latest → treat as "past 90 days" - since X → one predicate `>= X` - before X → one predicate `< X` -- bare MONTH (no year) → nearest past occurrence +- bare MONTH (no year) → the most recent fully-elapsed occurrence of that month (if the named month equals the current month, use the occurrence one year prior)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt` around lines 31 - 42, The "past N" and "bare MONTH" rules are ambiguous: change the "past N days/weeks/months/years → `[Current − N, tomorrow)`" line to explicitly carry the time unit (e.g., "past N days/weeks/months/years → `[Current − N {days|weeks|months|years}, tomorrow)`") and ensure "weeks" and "months" are interpreted as whole-week and whole-month arithmetic (not unitless subtraction). Clarify the "bare MONTH (no year) → nearest past occurrence" rule by specifying deterministic behavior: if MONTH is before the current month use MONTH of the current year; if MONTH is the current month or after, use MONTH of the previous year (i.e., the most recent calendar occurrence strictly before today). Ensure these exact phrases (the "past N days/weeks/months/years" rule and the "bare MONTH (no year)" rule) are updated so models include the unit and the tie-breaker rule unambiguously.openrag/components/pipeline.py (1)
73-77: The double-quote syntax is correct; single and double quotes are both valid in Milvus TIMESTAMPTZ filter expressions.The codebase's use of double quotes in
ISO "2026-03-15T00:00:00+00:00"is valid. Milvus's expression parser accepts both single and double quotes for string literals, including ISO timestamp literals, and treats them equivalently. This is consistent across the codebase (tests, docstrings, and theto_milvus_filter()method).However, the broader design note stands:
TemporalPredicate.operatorpermits["==", "!=", ">", "<", ">=", "<="], but all tested and documented filter cases use only ordered comparisons (<,>,>=,<=). Equality and inequality checks on TIMESTAMPTZ fields are semantically unusual. If equality comparisons are not intentional, consider narrowing theLiteralto exclude"=="and"!=".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openrag/components/pipeline.py` around lines 73 - 77, The TemporalPredicate currently allows equality operators but we only use ordered comparisons for TIMESTAMPTZ; update the type/definition for the TemporalPredicate.operator (the Literal used where TemporalPredicate is declared) to remove "==" and "!=" so only "<", ">", "<=", ">=" are permitted; keep the to_milvus_filter() behavior (it can continue to emit ISO "..." strings) and ensure any type hints or validations referencing TemporalPredicate.operator (e.g., construction sites or validators) are updated to the narrower set.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/prompt_eval/eval_query_decomposition.py`:
- Around line 1-32: The default dataset filename is misspelled as
"query_decompostion" in multiple places; update every occurrence to
"query_decomposition" — specifically fix the docstring references and the
argparse default for the --dataset option (the string
"datasets/query_decompostion.json" and any plain "query_decompostion"
occurrences) so that the script loads datasets/query_decomposition.json by
default and no longer raises FileNotFoundError when run without --dataset.
In `@openrag/components/pipeline.py`:
- Around line 112-122: The current silent fallback that re-runs
self.retriever.retrieve without filters when docs is empty should be made
explicit and configurable: add a pipeline config flag (e.g.,
allow_filterless_fallback) and only perform the unfiltered re-query when that
flag is True; when you do fallback, set a metadata flag (e.g.,
filter_dropped=True) on the returned result (or attach to the docs/container) so
callers can surface that the temporal filter was dropped; replace the
logger.debug call with logger.warning or logger.info so it’s visible at INFO
level and include context (query=query.query, partition, reason); keep the
original retriever call parameters (partition, query.query, filter_params) when
not falling back and ensure all code paths return the same result shape
including the new metadata key.
In `@utility/data_indexer.py`:
- Around line 57-60: The deterministic file_id generation (absolute_path ->
file_id) now causes duplicate POSTs to url =
f"{args.url}/indexer/partition/{args.partition}/file/{file_id}" to return HTTP
409 and break reruns; update the uploader call in utility/data_indexer.py (the
block that posts to url) to treat 409 as non-fatal: catch the HTTP response or
exception from the POST, and if response.status_code == 409 (or the raised
HTTPError maps to 409), log an informative message and continue (skip that file)
instead of raising; otherwise preserve existing error handling for other status
codes so real failures still surface.
In `@utility/data_indexer2.py`:
- Around line 63-68: The try/except around fitz.open(file_path) currently
swallows all exceptions; change it to except Exception as e and log the failure
before falling back to page_count = 0 (e.g. logger.warning or logger.exception)
so operators can see why a PDF was skipped; update the block that calls
fitz.open(file_path) and sets page_count to capture the exception object (e) and
include file_path and e in the log message.
---
Outside diff comments:
In `@openrag/components/pipeline.py`:
- Around line 294-301: The web search is being called with a Query Pydantic
model instead of the raw string, which causes Query.__str__() to pollute the
search; update the web search invocations to pass the actual query string (use
q.query) rather than the Query object: change the web_tasks comprehension that
builds from queries.query_list to call self.web_search_service.search(q.query),
and make the identical change in the web-only branch where web_tasks is
constructed (the second occurrence near the other web-only logic).
In `@prompts/example1/spoken_style_answer_tmpl.txt`:
- Line 22: Fix the typo in the user-facing prompt template: in the prompt string
that begins "Keep your answer succinct and straight to the point as if speaking
to humain." (in spoken_style_answer_tmpl.txt / the "Keep your answer..."
prompt), change "humain" to "human" so the sentence reads "...speaking to
human."
---
Duplicate comments:
In `@utility/data_indexer2.py`:
- Around line 1-121: This file duplicates existing data_indexer.py and shouldn't
be upstream as-is; either remove it from the PR or rename it to reflect purpose
(e.g., pdf_benchmark_indexer.py) and refactor common logic (arg parsing,
__check_api, upload loop) into a shared helper module to prevent drift; if you
keep it, add a short docstring near the top declaring its benchmark intent and
justify the hard-coded limit = 100 and the page bounds check (the page_count <=
10 or page_count > 40 filter) so reviewers know they are intentional.
---
Nitpick comments:
In `@benchmarks/prompt_eval/eval_temporal_filter_generation.py`:
- Around line 55-230: Extract duplicated utilities and schemas into a shared
module (e.g., benchmarks/prompt_eval/_common.py): move _parse_env_list,
_build_models, _judge_config, _model_kwargs, build_llm_messages, format_prompt
and the Pydantic models TemporalPredicate, Query, SearchQueries (and any CLI
path-sandboxing helpers) into that module; then import and use them from
eval_temporal_filter_generation.py (and eval_query_decomposition.py) so both
files reference the single implementation; ensure imports/exports preserve names
and behavior (including default values like max_completion_tokens and any
langdetect fallback), run lints/tests, and update any relative references to
MODELS or judge_config usage to use the shared module.
In `@benchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txt`:
- Around line 31-42: The "past N" and "bare MONTH" rules are ambiguous: change
the "past N days/weeks/months/years → `[Current − N, tomorrow)`" line to
explicitly carry the time unit (e.g., "past N days/weeks/months/years →
`[Current − N {days|weeks|months|years}, tomorrow)`") and ensure "weeks" and
"months" are interpreted as whole-week and whole-month arithmetic (not unitless
subtraction). Clarify the "bare MONTH (no year) → nearest past occurrence" rule
by specifying deterministic behavior: if MONTH is before the current month use
MONTH of the current year; if MONTH is the current month or after, use MONTH of
the previous year (i.e., the most recent calendar occurrence strictly before
today). Ensure these exact phrases (the "past N days/weeks/months/years" rule
and the "bare MONTH (no year)" rule) are updated so models include the unit and
the tie-breaker rule unambiguously.
In `@benchmarks/prompt_eval/Report_query_decomposition.md`:
- Around line 50-63: The "Common hard cases (both models, both prompts)" bullet
list uses repetitive sentence openings and punctuation; please rewrite that
section (the header and the four bullets referencing id 30, id 63, id 70, id 74)
to vary sentence structure and lead-ins for readability — e.g., turn some
bullets into full sentences starting with the case description ("Compare AWS
Lambda and Google Cloud Functions") and others starting with the observed
failure ("Earlier gRPC turn..."), replace repeated em-dash openings with varied
connectors or clauses, and ensure each item still mentions the id and the
failure mode (single comparison, contaminated reformulation, under-splitting)
without changing meaning.
In `@openrag/components/pipeline.py`:
- Around line 73-77: The TemporalPredicate currently allows equality operators
but we only use ordered comparisons for TIMESTAMPTZ; update the type/definition
for the TemporalPredicate.operator (the Literal used where TemporalPredicate is
declared) to remove "==" and "!=" so only "<", ">", "<=", ">=" are permitted;
keep the to_milvus_filter() behavior (it can continue to emit ISO "..." strings)
and ensure any type hints or validations referencing TemporalPredicate.operator
(e.g., construction sites or validators) are updated to the narrower set.
In `@utility/data_indexer.py`:
- Around line 36-46: The __check_api function currently only raises on
httpx.RequestError and silently continues on non-200 responses; modify
__check_api to detect non-200 HTTP responses (e.g., call
response.raise_for_status() or explicitly check response.status_code != 200) and
raise an exception (or log and raise) so the caller fails fast; update the
try/except around httpx.get in __check_api to include HTTP status errors (using
response.raise_for_status() is preferred) so any 4xx/5xx result causes the
function to raise instead of returning silently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 395f8e6d-4bcc-412e-9833-038603131849
⛔ Files ignored due to path filters (1)
benchmarks/prompt_eval/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (20)
benchmarks/prompt_eval/.env.examplebenchmarks/prompt_eval/Report_query_decomposition.mdbenchmarks/prompt_eval/Report_temporal_filter.mdbenchmarks/prompt_eval/datasets/query_decomposition.jsonbenchmarks/prompt_eval/datasets/temporal_filter.jsonbenchmarks/prompt_eval/eval_query_decomposition.pybenchmarks/prompt_eval/eval_temporal_filter_generation.pybenchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v0.txtbenchmarks/prompt_eval/prompts/query_contextualizer_tmpl_v1.txtbenchmarks/prompt_eval/pyproject.tomlbenchmarks/prompt_eval/results/result_filter_generation.jsonbenchmarks/prompt_eval/results/result_query_decomposition.jsonopenrag/components/pipeline.pyprompts/example1/query_contextualizer_tmpl.txtprompts/example1/spoken_style_answer_tmpl.txtprompts/example1/sys_prompt_tmpl.txtpyproject.tomltests/api_tests/api_run/mock_vllm.pyutility/data_indexer.pyutility/data_indexer2.py
| """ | ||
| Prompt evaluation script for query reformulation prompts. | ||
|
|
||
| Loads the query_decompostion dataset, runs each test case through one or more | ||
| prompt templates against every model defined in the `MODELS` dict, and scores | ||
| two decomposition metrics. | ||
|
|
||
| Metrics (this pass): | ||
| 1. decomposition_count_matching — generated query count equals expected | ||
| 2. decomposition_semantic_coverage — LLM-as-judge boolean: does the generated | ||
| split, taken as a whole, semantically | ||
| cover every expected sub-query? The | ||
| count does not have to match exactly; | ||
| only semantic coverage matters. | ||
|
|
||
| Usage: | ||
| uv run python eval_query_decompostion.py [OPTIONS] | ||
|
|
||
| Options: | ||
| --dataset PATH Path to the dataset JSON file | ||
| (default: datasets/query_decompostion.json) | ||
| --prompt PATH Path to a specific prompt template file. | ||
| If omitted, all *.txt files in ./prompts/ are evaluated. | ||
| --output PATH Write JSON results to this file. | ||
|
|
||
| Required environment (candidate models under evaluation — semicolon-separated): | ||
| BASE_URLs, API_KEYs, MODELs | ||
|
|
||
| Optional environment (LLM-as-judge for semantic coverage; defaults to the | ||
| first candidate model if unset): | ||
| JUDGE_BASE_URL, JUDGE_API_KEY, JUDGE_MODEL | ||
| """ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the dataset filename on disk vs. what the evaluator defaults to.
fd -t f 'query_decomp' benchmarks/prompt_eval/datasets
rg -nP 'query_decompos?t?ion' benchmarks/prompt_evalRepository: linagora/openrag
Length of output: 763
Default dataset path is misspelled — script fails out of the box.
The actual dataset file is query_decomposition.json (with second i), but the script references query_decompostion in five locations: the docstring (lines 4, 17, 21), the argparse description (line 449), and critically, the default path argument (line 453). Running python eval_query_decomposition.py without --dataset will fail with FileNotFoundError.
✏️ Proposed fix
-Loads the query_decompostion dataset, runs each test case through one or more
+Loads the query_decomposition dataset, runs each test case through one or more
@@
- uv run python eval_query_decompostion.py [OPTIONS]
+ uv run python eval_query_decomposition.py [OPTIONS]
@@
- (default: datasets/query_decompostion.json)
+ (default: datasets/query_decomposition.json)
@@
- default=str(HERE / "datasets" / "query_decompostion.json"),
+ default=str(HERE / "datasets" / "query_decomposition.json"),
@@
- description="Evaluate query_decompostion prompts (decomposition count matching + semantic coverage)."
+ description="Evaluate query_decomposition prompts (decomposition count matching + semantic coverage)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/prompt_eval/eval_query_decomposition.py` around lines 1 - 32, The
default dataset filename is misspelled as "query_decompostion" in multiple
places; update every occurrence to "query_decomposition" — specifically fix the
docstring references and the argparse default for the --dataset option (the
string "datasets/query_decompostion.json" and any plain "query_decompostion"
occurrences) so that the script loads datasets/query_decomposition.json by
default and no longer raises FileNotFoundError when run without --dataset.
| # Fallback | ||
| if not docs: | ||
| logger.debug( | ||
| "No documents retrieved with temporal filters, falling back to retrieval without filters", | ||
| query=str(query.query), | ||
| ) | ||
|
|
||
| docs = await self.retriever.retrieve( | ||
| partition=partition, query=query.query, filter=None, filter_params=filter_params | ||
| ) | ||
|
|
There was a problem hiding this comment.
Silent filter-less fallback can return temporally-incorrect results without the caller knowing.
When the filtered retrieval returns zero docs, you re-query without the filter. For a user who asked "show me reports from last week", returning arbitrary older reports is often worse than returning nothing — and nothing upstream (response payload, sources metadata, UI) signals that the date constraint was dropped. At minimum consider:
- Surfacing a flag in the returned metadata (e.g.
filter_dropped=True) so the caller can warn the user, and - Gating the fallback on a config flag so deployments that prefer "strict temporal" can opt out.
A logger.debug alone will not be visible to operators at INFO level.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openrag/components/pipeline.py` around lines 112 - 122, The current silent
fallback that re-runs self.retriever.retrieve without filters when docs is empty
should be made explicit and configurable: add a pipeline config flag (e.g.,
allow_filterless_fallback) and only perform the unfiltered re-query when that
flag is True; when you do fallback, set a metadata flag (e.g.,
filter_dropped=True) on the returned result (or attach to the docs/container) so
callers can surface that the temporal filter was dropped; replace the
logger.debug call with logger.warning or logger.info so it’s visible at INFO
level and include context (query=query.query, partition, reason); keep the
original retriever call parameters (partition, query.query, filter_params) when
not falling back and ensure all code paths return the same result shape
including the new metadata key.
| absolute_path = str(file_path.resolve()) | ||
| file_id = hashlib.sha256(absolute_path.encode()).hexdigest() | ||
|
|
||
| url_template = f"{args.url}/indexer/partition/{args.partition}/file/{file_id}" | ||
| url = url_template.format(partition_name=args.partition, file_id=file_id) | ||
| url = f"{args.url}/indexer/partition/{args.partition}/file/{file_id}" |
There was a problem hiding this comment.
Re-runs now fail with 409 due to removed pre-check.
With __check_file_exists removed, re-running the script against the same directory will hit /indexer/partition/{partition}/file/{file_id}, which rejects duplicates with HTTP 409 CONFLICT (see openrag/routers/indexer.py:138-142). Since file_id is now a deterministic hash of the absolute path, the same input produces the same ID and all previously indexed files error out on subsequent runs. Consider either skipping on 409 client-side or treating 409 as a non-fatal outcome when logging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@utility/data_indexer.py` around lines 57 - 60, The deterministic file_id
generation (absolute_path -> file_id) now causes duplicate POSTs to url =
f"{args.url}/indexer/partition/{args.partition}/file/{file_id}" to return HTTP
409 and break reruns; update the uploader call in utility/data_indexer.py (the
block that posts to url) to treat 409 as non-fatal: catch the HTTP response or
exception from the POST, and if response.status_code == 409 (or the raised
HTTPError maps to 409), log an informative message and continue (skip that file)
instead of raising; otherwise preserve existing error handling for other status
codes so real failures still surface.
| try: | ||
| doc = fitz.open(file_path) | ||
| page_count = doc.page_count | ||
| doc.close() | ||
| except Exception: | ||
| page_count = 0 |
There was a problem hiding this comment.
Silently swallowing fitz.open errors masks real problems.
A bare except Exception with page_count = 0 means corrupt/encrypted PDFs become indistinguishable from legitimately empty docs and are silently filtered out by the <= 10 check. At minimum log the exception (e.g. logger.warning(f"Failed to read {file_path}: {e}")) so operator can tell why a file was skipped.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@utility/data_indexer2.py` around lines 63 - 68, The try/except around
fitz.open(file_path) currently swallows all exceptions; change it to except
Exception as e and log the failure before falling back to page_count = 0 (e.g.
logger.warning or logger.exception) so operators can see why a PDF was skipped;
update the block that calls fitz.open(file_path) and sets page_count to capture
the exception object (e) and include file_path and e in the log message.
7d30ac3 to
40ed25d
Compare
feat: temporal
created_atfilters in the RAG query pipelineSummary
This PR adds LLM-generated temporal filters to the retrieval pipeline, allowing users to naturally ask time-bounded questions ("last week's meeting notes", "reports uploaded this month") and have them translated into Milvus
created_atfilter expressions before vector search.Core changes:
Query,SearchQueries) inpipeline.py— the query contextualizer now returns a list of typed sub-queries, each optionally carrying a Milvuscreated_atfilter expression instead of a bare stringRetrieverPipeline.retrieve_docs()updated to accept aQueryobject and forward itsfilterfield to the retriever, enabling date-scoped vector search without schema changesprompts/example1/query_contextualizer_tmpl.txt) extended with:{current_date}) for resolving relative expressions like "past 7 days"{query_language})Key design decision — filter vs. no filter:
The prompt carefully distinguishes between dates that describe when a document was ingested (→ generate a filter) and dates that describe what a document is about (→ no filter). For example:
Benchmarks
An offline evaluation harness (
benchmarks/prompt_eval/) was built to validate filter generation quality across models (80 cases: 40 filter, 40 split):Claude achieves perfect query decomposition (40/40 split cases) and near-perfect filter detection (97.5%). Its 11 remaining failures share a single root cause: adding a spurious
AND created_at <= todayupper bound to open-ended "past N days/weeks" queries — a known prompt issue already addressed by the lower-bound-only rule in the template.GPT-5.4 is competitive overall (78.8%) but struggles with the same upper-bound issue: all 9 of its filter rejections come from adding an unnecessary
<= todayto open-ended queries.Mistral-Small is not suitable for this task. Filter presence is only 47.5% — it misses more than half of all required
created_atfilters, systematically failing on quarter/month expressions ("Q1 2024", "February 2025") and relative periods ("this month", "last week"). When it does generate a filter, judge accuracy is only 47.4%, reflecting poor date arithmetic (wrong lower bounds, spurious upper bounds).Test plan
Query.filterdefaults toNoneuv run pytestSummary by CodeRabbit
Release Notes
New Features
Documentation
Improvements