WIP: Merge Dev to Main by danielaskdd · Pull Request #2846 · HKUDS/LightRAG

danielaskdd · 2026-03-27T05:21:38Z

Dummy PR: Merge Dev to Main (Never try to merge this PR)

…documentation - delete historical document describing RAG-Anything parser compatibility notes - remove context that is no longer needed since LightRAG now consumes external parser services directly

…lpers The orchestrator previously held ~1250 lines including a ~790-line nested process_document closure, three more worker closures, six near-duplicate doc_status.upsert templates, and two ~70-line FAILED-handling blocks. That made every change to the state machine touch many places at once and prevented any unit-level testing. Pure structural refactor — behaviour and the pipeline_status concurrency contract are preserved (offline test suite passes: 1123 passed). Key extractions inside _PipelineMixin: - _upsert_doc_status_transition: single source of truth for every PARSING / ANALYZING / PROCESSING / PROCESSED / FAILED upsert - _finalize_doc_failure: shared epilogue for extract / merge stage failures (cancel pending tasks, flush LLM cache, FAILED upsert) - _purge_stale_extraction_if_resuming: resume-guard that clears stale chunks + KG before re-running extraction under new process_options - _raise_if_cancelled: cancellation check helper - _format_job_name: pipeline_status job-name formatter - _parse_worker / _analyze_worker / _process_worker / _process_single_document: queue-driven workers promoted from nested closures to instance methods - _run_pipeline_batch: per-batch queue + worker orchestration - _BatchRunContext (module-level dataclass): shared per-batch state (queues, semaphore, processed_count, etc.) consumed by all workers - _INFLIGHT_DOC_STATUSES: module-level constant replacing the per-call list Also drops the unreachable ``pre_parsed_data is None`` branch in process_document (~130 lines): _process_worker is the only caller and always passes pre-parsed data, so the inline parse/analyze fallback is dead code from before the queue-worker model. apipeline_process_enqueue_documents itself is now ~180 lines focused on busy/request_pending lifecycle and the main loop. A pure helper, build_chunks_dict_from_chunking_result, also moves to utils_pipeline.py so chunk-id resolution can be unit-tested independently. pipeline.py: 3740 → 3531 lines (-209) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Pure method reorder inside lightrag/pipeline.py::_PipelineMixin — no implementation, signature, or behaviour changes. Method definition order in a Python class body is independent of runtime resolution, so this is text-only. Before this commit, the 26 methods of the mixin were arranged roughly in implementation-history order: public APIs were scattered (the three apipeline_* entry points spanned ~1700 lines apart), parser engines came after their internals, and multimodal helpers were split across three distant locations. The new layout groups methods into 8 sections with region banners: 1. Public document ingestion API (apipeline_enqueue_documents, apipeline_enqueue_error_documents, apipeline_process_enqueue_documents) 2. Pipeline orchestration (_run_pipeline_batch, _validate_and_fix_document_consistency, _atomic_release_busy_or_consume_pending, _format_job_name) 3. Cascading queue workers (_parse_worker, _analyze_worker, _process_worker) 4. Single-document state machine (_process_single_document, _purge_stale_extraction_if_resuming) 5. doc_status state-machine helpers (_upsert_doc_status_transition, _raise_if_cancelled, _finalize_doc_failure) 6. Parser engines (parse_native, parse_mineru, parse_docling) 7. Parser internals (_call_protocol_parse_service, _persist_parsed_full_docs, _mark_duplicate_after_parse, _resolve_source_file_for_parser, _write_lightrag_document_from_content_list) 8. Multimodal / VLM (analyze_multimodal, _run_multimodal_postprocess_hook, _build_mm_chunks_from_sidecars) Verified: ruff check passes; full offline pytest suite still 1123 passed (matches the pre-reorder baseline); AST method set is identical to HEAD~1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nqueue-split refactor(pipeline): split apipeline_process_enqueue_documents into helpers

- replace audit-specific function and message references with generic docx/chunking terminology - update user-facing error messages to reference LightRAG workflow instead of audit workflow - adjust docstrings and comments to reflect chunking purpose over auditing purpose

…for file processing pipeline - add new content extraction engines: native, mineru, docling - add new text chunking methods: Recursive, Vector, Paragraph - replace deprecated `S` option with `P` (Paragraph semantic chunking) - clarify pipeline stage names and option effects - update version note to indicate current dev branch status

…ontent extraction engine descriptions - replace legacy `S` chunking option with new `V` and `P` options across documentation - add explanation of raw vs lightrag dual format for different extraction engines - clarify paragraph semantic chunking only works with lightrag format content ♻️ refactor(constants): restructure chunking option constants with type aliases - introduce `ProcessChunkingOption` type alias for literal chunking values - replace `PROCESS_OPTION_CHUNK_HEADING` with `PROCESS_OPTION_CHUNK_VECTOR` and `PROCESS_OPTION_CHUNK_PARAGRAH` - update frozensets definitions to use new typed constants ♻️ refactor(parser_routing): update process options parsing for new chunking types - replace `Literal` type with `ProcessChunkingOption` type alias - update validation and parsing logic to support `V` and `P` chunking options - adjust error messages and comments to reflect new chunking options ✅ test(chunking): update tests for expanded chunking option support - update T5 test case to cover `R/V/P` deferred strategies instead of `R/S` - rename test function from `test_rs_chunking_warns_and_falls_back_to_fixed` to `test_nonfixed_chunking_warns_and_falls_back_to_fixed` - update warning message assertions to match new `R/V/P` format

…pi pipelines - introduce PROCESS_OPTION_CHUNK_FIXED as default for api document processing - ensure process_options is always passed to apipeline_enqueue_documents - update pipeline_index_texts to use fixed chunk processing - add test coverage for process_options in enqueue and index operations

- introduce new `lightrag.chunker` package with `chunking_by_paragraph_semantic` for heading-aware, table-preserving document splits - implement four-stage pipeline: heading-driven split (A), table resplit/glue (B), anchor-driven long-block split (C), level-aware merge (D) - add `chunking_by_fixed_token` under standardized file-chunker contract alongside legacy `chunking_by_token_size` - wire chunker dispatch in `_process_single_document` via `chunking_explicit` flag to route between legacy and new contracts - document full algorithm in `docs/ParagraphSemanticChunking-zh.md` with thresholds, invariants, and comparison table - relocate `chunking_by_token_size` from `operate.py` to `chunker/token_size.py` and update imports/tests accordingly

…chunking - add explicit fallback_reason tracking for empty blocks_path and empty rows - consolidate warning log to single location with detailed context - add keyword argument for chunk_token_size parameter

- correct automatic fallback behavior from F to R for P option in legacy path

…atch contract - replace legacy chunking_func spy with new contract spy on chunking_by_fixed_token - add monkeypatch to verify dispatcher routes to standardized file-chunker - update warning assertion from "R/V/P" to "R/V" to match new deferred-strategy message - expand docstring to explain new chunker dispatch behavior and test purpose

Wires up the recursive-character (R) and semantic-vector (V) chunking strategies with LangChain integrations, and introduces a per-document chunk_options snapshot persisted in full_docs so chunker parameters become reproducible and per-file overridable. - R chunker: chunking_by_recursive_character wraps RecursiveCharacterTextSplitter with the LightRAG Tokenizer plugged in as length_function so chunk_size measures tokens rather than characters - V chunker: chunking_by_semantic_vector (async) wraps SemanticChunker via an _AsyncEmbeddingFuncAdapter that bridges LightRAG's async EmbeddingFunc to LangChain's sync Embeddings interface using run_coroutine_threadsafe; falls back to R when embedding_func is None - _process_single_document dispatcher: full F/R/V/P branch matrix; reads per-strategy kwargs from full_docs[doc_id].chunk_options and splats into the chunker call - chunk_options as the canonical chunker-config carrier: - new dict-shaped column on full_docs persisted at enqueue time - apipeline_enqueue_documents accepts chunk_options (dict | list[dict] | None); None falls back to resolve_chunk_options(self.addon_params) - per-strategy sub-dicts mirror each chunker's keyword-only signature - frozen at enqueue so env / addon_params changes don't break re-runs - addon_params['chunker'] as the runtime-mutable config source: - default_chunker_config() builds env-driven defaults (CHUNK_F_*, CHUNK_R_*, CHUNK_V_*) and is invoked from default_addon_params / normalize_addon_params - users can mutate rag.addon_params['chunker'] at runtime to change defaults for subsequently enqueued documents - parse_optional_float helper relocated to lightrag.utils for reuse - resolve_chunk_options moved to lightrag.parser_routing as the natural pair to parse_process_options - LightRAG dataclass cleanup: removed 5 chunker-specific fields (now in addon_params['chunker']) and the _resolve_chunk_options method - apipeline_enqueue_documents / apipeline_process_enqueue_documents: removed split_by_character / split_by_character_only parameters; the F-strategy runtime args are now baked into chunk_options by ainsert before enqueue, so the pipeline plumbing no longer carries them - paragraph_semantic chunker fallback target switched from F to R per the documented contract; raises ImportError when langchain isn't installed instead of silently degrading - pyproject.toml: langchain-text-splitters>=0.3 and langchain-experimental>=0.3 added to [api] extras - env.example: full set of chunker env vars documented - docs/FileProcessingConfiguration-zh.md: env table extended with F vars, new "分块器参数与 chunk_options" section explains the three-tier config flow (env → addon_params → chunk_options snapshot → chunker call) Tests: +14 new offline tests across recursive_character, semantic_vector, chunk_options_persistence, plus updates to test_chunking_raw_lightrag_parity for the new R/V dispatch contract. Full suite: 1137 passed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- exclude index 0 from anchor candidates to avoid empty leading slices - add regression tests for short lead paragraph edge case - document recursion guard behavior in comments

…oles - cap target_chunks at len(rows) to avoid zero-row slices when total tokens exceed target_max - enforce forward progress with max(start + 1, ...) in row boundary calculation - track cur_role across flush operations to preserve "first"/"last" table chunk semantics - add regression tests for empty slice bug and role propagation between consecutive oversized tables

Add paragraph semantic chunking strategy

- replace eager `setdefault` with lazy `in` check for chunker key - prevent environment parsing errors from blocking caller-supplied chunker values

…size resolution - introduce `_apply_chunk_size_overlay` to reconcile `chunk_token_size` and `chunk_overlap_token_size` across config tiers - change `chunk_token_size` and `chunk_overlap_token_size` fields to `Optional[int]` with `None` default - update `default_chunker_config` to only read strategy-specific env vars, leaving slots empty for overlay fallback - add precedence chain: addon_params explicit > strategy env > legacy constructor field > legacy env - back-fill legacy instance fields after resolution for backward compatibility with downstream readers - update Chinese documentation to reflect new configuration hierarchy and priority rules - add comprehensive tests covering constructor overlay, addon_params precedence, strategy env wins, and legacy fallback

…h semantic strategy - introduce CHUNK_P_SIZE env variable to decouple P strategy chunk size from global CHUNK_SIZE - update default_chunker_config to parse and inject CHUNK_P_SIZE into paragraph_semantic options - modify pipeline to extract and apply per-strategy chunk_token_size for P strategy with fallback to resolved top-level size - document new env variable and configuration in Chinese docs with usage guidance - add tests verifying env override behavior and fallback to global chunk size when unset

- add upper version bounds for langchain-text-splitters (<2) and langchain-experimental (<1) - remove duplicate langchain 1.x and langchain-core 1.x entries from uv.lock - add missing explicit dependencies (defusedxml, langchain-experimental, langchain-text-splitters) to api/evaluation/offline/test extras - pin async-timeout to 4.0.3 for python < 3.11 to resolve version conflicts

…e file processing documentation - reorganize document with numbered sections for server deployment workflow - add quick start section with legacy, native, and combined configuration examples - introduce detailed chunk_options configuration with environment variable reference - add new chapter for python sdk usage covering runtime api and deprecated parameters - improve clarity on engine fallback, validation, and priority chains - relocate and expand storage layout, duplicate detection, concurrency, and resume rules sections - add appendix for upgrade notes regarding deprecated multimodal global switch

…n params - ensure chunk size configuration is reconciled when runtime addon params are set - maintain consistency across all four configuration tiers

feat(chunker): add R/V chunkers and chunk_options snapshot mechanism

- move extraction-related settings below multimodal parsing section - uncomment CHUNK_P_SIZE to set default value of 3000 - improve logical grouping by placing docling settings before extraction configs

… to reflect slim snapshot design - clarify distinction between addon_params["chunker"] full baseline and full_docs["chunk_options"] slim snapshot - document F/R/V/P selector mapping to strategy sub-dict keys - explain that only selected strategy parameters are persisted, others are dropped - update reprocess behavior to overwrite both process_options and chunk_options together ♻️ refactor(parser_routing): introduce slim chunk options projection - add _CHUNK_STRATEGY_KEYS mapping for F/R/V/P selectors to sub-dict keys - implement chunk_strategy_key() to resolve active strategy from process_options - implement slim_chunk_options() to project full config down to active strategy only - update resolve_chunk_options() to accept process_options and return slim snapshot - conditionally apply F-strategy runtime args only when fixed_token is active ♻️ refactor(pipeline): integrate slim chunk options into enqueue and process paths - update _chunk_options_at() to project caller-supplied dicts against per-doc process_options - use slim_chunk_options() for caller-supplied chunk_options to avoid mutable shared state - pass process_options through resolve_chunk_options() for fresh snapshot builds - update legacy fallback in process path to scope snapshot to per-doc strategy ✅ test(chunk_options_persistence): adapt tests to slim snapshot contract - split single-doc snapshot test into F/R/V strategy-specific enqueues - verify non-active strategy keys are absent from persisted chunk_options - update per-file chunk_options list test to use R selector and verify slim drops - adjust constructor overlay and precedence tests to match single-strategy shape - add explicit process_options selectors where strategy-specific envs are tested

refactor(full-docs): unify path handling and slim chunk_options snapshot

PG storage previously dropped pipeline-derived fields (parse_engine, content_hash, chunk_options, sidecar/heading metadata) because the underlying tables and SQL templates listed a fixed column set. JSON storage persists full payload dicts and thus retained these fields naturally, leading to a behavioral divergence under PG deployments. This change brings the three doc tables in line with what the pipeline writes: LIGHTRAG_DOC_FULL + sidecar_location, parse_format, content_hash, process_options, chunk_options (JSONB), parse_engine LIGHTRAG_DOC_STATUS + content_hash LIGHTRAG_DOC_CHUNKS + heading (JSONB), sidecar (JSONB) Notable behavior: * doc_status upsert protects existing content_hash on state transitions that omit the field via COALESCE(NULLIF(EXCLUDED.content_hash, ''), existing). * full_docs upsert uses straight EXCLUDED overwrite — callers in _persist_parsed_full_docs already do read-merge-write, so COALESCE is unnecessary there. * JSONB read paths normalize missing/None values to {} for parity with downstream isinstance(dict) / .get(...) checks. * PGDocStatusStorage overrides get_doc_by_file_basename and get_doc_by_content_hash with indexed SQL instead of the base-class full-table scan. The basename lookup includes a defensive LIKE fallback for legacy a.[hint].ext rows, pushed down to the DB. Migrations: * _migrate_doc_full_add_pipeline_fields adds the six new full_docs columns + partial workspace+content_hash index. * _migrate_doc_status_add_content_hash adds the column + partial index. * _migrate_text_chunks_add_heading_sidecar adds heading/sidecar JSONB. Each migration is idempotent, individually try-wrapped per column so one failure does not block the rest, and called from check_tables(). No JSONB GIN indexes on heading/sidecar — access is always by chunk id. Tests: * Updated tests/test_postgres_upsert.py to assert the new tuple positions for text_chunks and full_docs upserts, including default fallbacks for missing fields and explicit None -> {} coercion for JSONB columns. * Added doc_status upsert tests covering the content_hash position and the COALESCE/NULLIF SQL guard. * Added tests/test_postgres_doc_status_lookup.py covering the new get_doc_by_file_basename and get_doc_by_content_hash overrides, including the legacy hint-segment LIKE fallback.

…emantics - change content_hash from VARCHAR(64) to TEXT for hash algorithm agnosticism - add COALESCE/NULLIF guards to upsert_doc_full for pipeline-derived columns - prevent partial upserts from overwriting existing metadata with defaults - escape LIKE metacharacters in get_doc_by_file_basename for safe pattern matching - add ORDER BY precedence to rank exact matches above legacy hint rows - update tests to verify SQL-level protection and escaping behavior

…layer - remove normalization logic from doc status storage implementations - move canonicalization responsibility to callers in lightrag.py and pipeline.py - simplify postgres, json and base storage to perform exact match only - update tests to verify canonical paths are persisted and matched correctly

- introduce _call_source_file_resolver helper to tolerate legacy test doubles - add source_file_name tracking in content_data and doc_status metadata - extend _resolve_source_file_for_parser with parser engine and source_file_name params - implement canonical name matching and parser hint variant resolution - add support for finding hinted variants by canonical name and engine - carry source_file_name through doc_status metadata transitions - add comprehensive tests for resolver priority and mineru integration

…r handling - introduce FilenameParserHintError exception for invalid filename hints - add _validate_filename_hint_for_resolution to fail fast on malformed hints - update resolve_file_parser_directives to validate hints before processing - handle FilenameParserHintError in document_routes with error document enqueue - add comprehensive tests for invalid filename hint rejection - fix regex to reject empty bracket hints while preserving non-throwing helpers - ensure endpoint requirement validation during ingestion entrypoints

- change options-only syntax from [OPTIONS] to [-OPTIONS] to disambiguate from engine names - update split_engine_and_options to reject bare options without leading hyphen - add validation for empty options in both [-] and [engine-] forms - update documentation and tests to reflect new syntax requirements

…rtlib - remove unused lazy_external_import utility - use importlib.import_module with explicit package anchor for reliable resolution - add inline comments explaining import path resolution behavior

- remove redundant partial index on workspace and content_hash - prevent potential migration failures from duplicate index attempts

refactor(pg): align PG storage fields with JSON storage for parity

Mirror the two-stage dedup helpers already shipped on JsonDocStatusStorage and PostgreSQLDocStatusStorage so the ingestion pipeline can deduplicate documents when the doc-status backend is Redis. Without these methods the pipeline falls back to a missing-attribute path and silently loses dedup. - get_doc_by_file_basename(basename): early-returns on empty input and on the "unknown_source" sentinel, then SCAN+pipeline.GET across the namespace for an exact file_path match. Returns (doc_id, doc_data). - get_doc_by_content_hash(content_hash): early-returns on empty input so legacy rows without the field are never matched, then scans for an exact content_hash match. Returns (doc_id, doc_data). Both follow the existing get_doc_by_file_path SCAN pattern (no new index) to keep behavior parity with the JSON backend. Tests use an in-memory fake Redis (no live service required); covers hit/miss, empty input, the unknown_source sentinel, and legacy rows that predate the content_hash field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JSON storage recently added two new lookup methods on DocStatusStorage and a content_hash field used for content-based dedup. Mongo inherited the base-class scan-all fallbacks, which work but bypass the indexable content_hash field — so dedup on large workspaces fell back to fetching every status row. This change brings MongoDocStatusStorage up to par with JSON storage and the recently-landed PG and Redis updates (Redis PR #3098): * Override get_doc_by_file_basename with an indexed find_one on file_path (mirrors the existing file_path index already created in create_and_migrate_indexes_if_not_exists). Honors the empty-string and "unknown_source" sentinel short-circuits required by the base contract. * Override get_doc_by_content_hash with an indexed find_one on content_hash. Empty hash is a guaranteed miss so legacy rows missing the field cannot match. * Add a partial index on content_hash (filter expression: exists, string, non-empty) to keep the index small and match the PG partial-index semantics. Legacy unprefixed "content_hash" index name added to the workspace-prefix migration list so historical names get cleaned up. * Wire partialFilterExpression through the create_index dispatch. For MongoKVStorage no schema changes are needed: BSON is schemaless and the existing $set upsert path persists whatever pipeline-derived fields the caller hands in (sidecar_location, parse_format, parse_engine, content_hash, process_options, chunk_options on full_docs; heading + sidecar on text_chunks). Reads via find_one return full docs. The DocProcessingStatus dataclass already defaults content_hash to None, so legacy doc_status rows without the field still decode cleanly via _prepare_doc_status_data + DocProcessingStatus(**data). Tests: extend tests/test_mongo_storage.py with TestMongoDocStatusLookup covering hit, miss, empty-input short-circuit, and the "unknown_source" sentinel for both new overrides. All 8 tests in the file pass offline.

feat(redis): add basename and content_hash lookups for doc status

Mirror the JSON / MongoDB / Redis storage parity work for the OpenSearch backend so the new pipeline dedup paths can resolve documents by canonical basename or content_hash without falling back to the per-status full-scan default implementation. What's changed: - Add `content_hash` keyword mapping when creating new doc status indices. - For pre-existing indices, call put_mapping on initialize to add the content_hash keyword field (idempotent; skipped when already present). Older indices created by prior releases get the mapping on first startup. - Implement `get_doc_by_file_basename` using a term query on the file_path keyword field with the same empty/unknown_source short-circuits as the base class. - Implement `get_doc_by_content_hash` using a term query on the new content_hash keyword field; empty values short-circuit so legacy rows cannot match via coercion. Tolerance: full_docs / text_chunks rely on the existing dynamic mapping, so legacy records without the new sidecar / parse / chunk fields are read back with the missing keys absent — consumers use `.get()` defaults. The new content_hash optional attribute on DocProcessingStatus also defaults to None, so doc_status rows from older indices deserialize cleanly.

…-jSuYr refactor(mongo): align doc-status storage with JSON storage for parity

- verify PyMongoError is caught and returns None in get_doc_by_file_basename - verify PyMongoError is caught and returns None in get_doc_by_content_hash - ensure storage failures are treated as "no match" with error logged

…TNxo8 feat(opensearch): add basename and content_hash lookups for doc status

…tion - eliminate unnecessary `:-/` fallback in redis uri path capture - ensure exact path preservation from original uri during local service normalization

… setup scripts - add /app/data/prompts directory creation in dockerfile and dockerfile.lite - add PROMPT_DIR environment variable and volume mounts in all compose files - update setup scripts to support PROMPT_DIR configuration and idempotent mount injection - fix redis test default uri to remove trailing slash

- consolidate verbose log strings in parse_mineru and parse_docling to reduce noise - shorten analyze_multimodal opt-in missing and backfill log lines for clarity - remove redundant file_path references from completion and cache hit logs - update chinese documentation to match simplified log format

@AbstractMethod

… methods - remove default implementations of get_doc_by_file_basename and get_doc_by_content_hash - add @AbstractMethod decorator to enforce implementation in subclasses - clean up unused asdict import from dataclasses module - simplify docstrings to reflect abstract nature of methods

…ation - correct the info log message format for empty equations sidecar in analyze_multimodal

- replace specific entity_type subdirectory with entire prompts directory - update comment to reflect user customized prompt directory purpose

danielaskdd temporarily deployed to pypi April 26, 2026 21:48 — with GitHub Actions Inactive

danielaskdd force-pushed the dev branch from 8ef8a29 to 5c2f738 Compare April 27, 2026 06:07

danielaskdd and others added 28 commits May 8, 2026 21:44

📝 docs(RAGAnythingParserAlignment): remove outdated parser alignment …

ee5672b

…documentation - delete historical document describing RAG-Anything parser compatibility notes - remove context that is no longer needed since LightRAG now consumes external parser services directly

Merge pull request #3041 from danielaskdd/refactor/pipeline-process-e…

27f9042

…nqueue-split refactor(pipeline): split apipeline_process_enqueue_documents into helpers

♻️ refactor(chunker): improve fallback logging in paragraph semantic …

8e8f8d6

…chunking - add explicit fallback_reason tracking for empty blocks_path and empty rows - consolidate warning log to single location with detailed context - add keyword argument for chunk_token_size parameter

📝 docs(docs): fix file processing configuration documentation

c16970f

- correct automatic fallback behavior from F to R for P option in legacy path

Fix lintings

3585ddc

🐛 fix(chunker): prevent infinite recursion in paragraph semantic split

46e40b1

- exclude index 0 from anchor candidates to avoid empty leading slices - add regression tests for short lead paragraph edge case - document recursion guard behavior in comments

Merge branch 'feat/paragraph-chunker' into feat/add-R-V-chunker

f583b1f

Merge pull request #3044 from danielaskdd/feat/paragraph-chunker

77255be

Add paragraph semantic chunking strategy

Merge branch 'dev' into feat/add-R-V-chunker

9ff6533

♻️ refactor(addon_params): defer chunker default evaluation to runtime

c951f8f

- replace eager `setdefault` with lazy `in` check for chunker key - prevent environment parsing errors from blocking caller-supplied chunker values

♻️ refactor(lightrag): integrate chunk size overlay into runtime addo…

3a0628f

…n params - ensure chunk size configuration is reconciled when runtime addon params are set - maintain consistency across all four configuration tiers

Merge pull request #3046 from danielaskdd/feat/add-R-V-chunker

50fda8e

feat(chunker): add R/V chunkers and chunk_options snapshot mechanism

Merge branch 'main' into dev

7fa7144

🔧 chore(env.example): reorganize configuration sections

ad19e9f

- move extraction-related settings below multimodal parsing section - uncomment CHUNK_P_SIZE to set default value of 3000 - improve logical grouping by placing docling settings before extraction configs

danielaskdd and others added 30 commits May 19, 2026 18:34

Fix lintings

d12a559

Merge pull request #3093 from danielaskdd/refact/full-docs-storage

a80574b

refactor(full-docs): unify path handling and slim chunk_options snapshot

♻️ refactor(factory): replace lazy_external_import with standard impo…

6ef5423

…rtlib - remove unused lazy_external_import utility - use importlib.import_module with explicit package anchor for reliable resolution - add inline comments explaining import path resolution behavior

🐛 fix(postgres): remove duplicate partial index creation

ed09c28

- remove redundant partial index on workspace and content_hash - prevent potential migration failures from duplicate index attempts

Merge pull request #3094 from HKUDS/claude/refactor-pg-storage-zIrOi

8022ef4

refactor(pg): align PG storage fields with JSON storage for parity

Merge pull request #3098 from danielaskdd/refact/redis

c149fde

feat(redis): add basename and content_hash lookups for doc status

Fix lintings

efa44a5

Merge branch 'dev' into claude/sync-mongo-storage-updates-jSuYr

1d4142a

Merge pull request #3099 from HKUDS/claude/sync-mongo-storage-updates…

324cca3

…-jSuYr refactor(mongo): align doc-status storage with JSON storage for parity

Merge branch 'claude/sync-mongo-storage-updates-jSuYr' into dev

49d07fe

Merge branch 'dev' into claude/update-opensearch-storage-TNxo8

4aff9ae

Merge pull request #3100 from HKUDS/claude/update-opensearch-storage-…

66f3a9a

…TNxo8 feat(opensearch): add basename and content_hash lookups for doc status

Fix lintings

cc102ac

🔧 chore(setup): remove redundant default slash in redis uri normaliza…

1e3283d

…tion - eliminate unnecessary `:-/` fallback in redis uri path capture - ensure exact path preservation from original uri during local service normalization

📝 docs(docs): update log message in file processing pipeline document…

f3dddb1

…ation - correct the info log message format for empty equations sidecar in analyze_multimodal

🔧 chore(gitignore): update ignore rules for prompts directory

c050cc4

- replace specific entity_type subdirectory with entire prompts directory - update comment to reflect user customized prompt directory purpose

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Merge Dev to Main#2846

WIP: Merge Dev to Main#2846
danielaskdd wants to merge 602 commits into
mainfrom
dev

danielaskdd commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielaskdd commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants