WIP: Merge Dev to Main#2846
Draft
danielaskdd wants to merge 602 commits into
Draft
Conversation
…documentation - delete historical document describing RAG-Anything parser compatibility notes - remove context that is no longer needed since LightRAG now consumes external parser services directly
…lpers The orchestrator previously held ~1250 lines including a ~790-line nested process_document closure, three more worker closures, six near-duplicate doc_status.upsert templates, and two ~70-line FAILED-handling blocks. That made every change to the state machine touch many places at once and prevented any unit-level testing. Pure structural refactor — behaviour and the pipeline_status concurrency contract are preserved (offline test suite passes: 1123 passed). Key extractions inside _PipelineMixin: - _upsert_doc_status_transition: single source of truth for every PARSING / ANALYZING / PROCESSING / PROCESSED / FAILED upsert - _finalize_doc_failure: shared epilogue for extract / merge stage failures (cancel pending tasks, flush LLM cache, FAILED upsert) - _purge_stale_extraction_if_resuming: resume-guard that clears stale chunks + KG before re-running extraction under new process_options - _raise_if_cancelled: cancellation check helper - _format_job_name: pipeline_status job-name formatter - _parse_worker / _analyze_worker / _process_worker / _process_single_document: queue-driven workers promoted from nested closures to instance methods - _run_pipeline_batch: per-batch queue + worker orchestration - _BatchRunContext (module-level dataclass): shared per-batch state (queues, semaphore, processed_count, etc.) consumed by all workers - _INFLIGHT_DOC_STATUSES: module-level constant replacing the per-call list Also drops the unreachable ``pre_parsed_data is None`` branch in process_document (~130 lines): _process_worker is the only caller and always passes pre-parsed data, so the inline parse/analyze fallback is dead code from before the queue-worker model. apipeline_process_enqueue_documents itself is now ~180 lines focused on busy/request_pending lifecycle and the main loop. A pure helper, build_chunks_dict_from_chunking_result, also moves to utils_pipeline.py so chunk-id resolution can be unit-tested independently. pipeline.py: 3740 → 3531 lines (-209) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pure method reorder inside lightrag/pipeline.py::_PipelineMixin — no
implementation, signature, or behaviour changes. Method definition order
in a Python class body is independent of runtime resolution, so this is
text-only.
Before this commit, the 26 methods of the mixin were arranged roughly
in implementation-history order: public APIs were scattered (the three
apipeline_* entry points spanned ~1700 lines apart), parser engines came
after their internals, and multimodal helpers were split across three
distant locations.
The new layout groups methods into 8 sections with region banners:
1. Public document ingestion API (apipeline_enqueue_documents,
apipeline_enqueue_error_documents, apipeline_process_enqueue_documents)
2. Pipeline orchestration (_run_pipeline_batch,
_validate_and_fix_document_consistency,
_atomic_release_busy_or_consume_pending, _format_job_name)
3. Cascading queue workers (_parse_worker, _analyze_worker,
_process_worker)
4. Single-document state machine (_process_single_document,
_purge_stale_extraction_if_resuming)
5. doc_status state-machine helpers (_upsert_doc_status_transition,
_raise_if_cancelled, _finalize_doc_failure)
6. Parser engines (parse_native, parse_mineru, parse_docling)
7. Parser internals (_call_protocol_parse_service, _persist_parsed_full_docs,
_mark_duplicate_after_parse, _resolve_source_file_for_parser,
_write_lightrag_document_from_content_list)
8. Multimodal / VLM (analyze_multimodal, _run_multimodal_postprocess_hook,
_build_mm_chunks_from_sidecars)
Verified: ruff check passes; full offline pytest suite still 1123 passed
(matches the pre-reorder baseline); AST method set is identical to
HEAD~1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nqueue-split refactor(pipeline): split apipeline_process_enqueue_documents into helpers
- replace audit-specific function and message references with generic docx/chunking terminology - update user-facing error messages to reference LightRAG workflow instead of audit workflow - adjust docstrings and comments to reflect chunking purpose over auditing purpose
…for file processing pipeline - add new content extraction engines: native, mineru, docling - add new text chunking methods: Recursive, Vector, Paragraph - replace deprecated `S` option with `P` (Paragraph semantic chunking) - clarify pipeline stage names and option effects - update version note to indicate current dev branch status
…ontent extraction engine descriptions - replace legacy `S` chunking option with new `V` and `P` options across documentation - add explanation of raw vs lightrag dual format for different extraction engines - clarify paragraph semantic chunking only works with lightrag format content ♻️ refactor(constants): restructure chunking option constants with type aliases - introduce `ProcessChunkingOption` type alias for literal chunking values - replace `PROCESS_OPTION_CHUNK_HEADING` with `PROCESS_OPTION_CHUNK_VECTOR` and `PROCESS_OPTION_CHUNK_PARAGRAH` - update frozensets definitions to use new typed constants ♻️ refactor(parser_routing): update process options parsing for new chunking types - replace `Literal` type with `ProcessChunkingOption` type alias - update validation and parsing logic to support `V` and `P` chunking options - adjust error messages and comments to reflect new chunking options ✅ test(chunking): update tests for expanded chunking option support - update T5 test case to cover `R/V/P` deferred strategies instead of `R/S` - rename test function from `test_rs_chunking_warns_and_falls_back_to_fixed` to `test_nonfixed_chunking_warns_and_falls_back_to_fixed` - update warning message assertions to match new `R/V/P` format
…pi pipelines - introduce PROCESS_OPTION_CHUNK_FIXED as default for api document processing - ensure process_options is always passed to apipeline_enqueue_documents - update pipeline_index_texts to use fixed chunk processing - add test coverage for process_options in enqueue and index operations
- introduce new `lightrag.chunker` package with `chunking_by_paragraph_semantic` for heading-aware, table-preserving document splits - implement four-stage pipeline: heading-driven split (A), table resplit/glue (B), anchor-driven long-block split (C), level-aware merge (D) - add `chunking_by_fixed_token` under standardized file-chunker contract alongside legacy `chunking_by_token_size` - wire chunker dispatch in `_process_single_document` via `chunking_explicit` flag to route between legacy and new contracts - document full algorithm in `docs/ParagraphSemanticChunking-zh.md` with thresholds, invariants, and comparison table - relocate `chunking_by_token_size` from `operate.py` to `chunker/token_size.py` and update imports/tests accordingly
…chunking - add explicit fallback_reason tracking for empty blocks_path and empty rows - consolidate warning log to single location with detailed context - add keyword argument for chunk_token_size parameter
- correct automatic fallback behavior from F to R for P option in legacy path
…atch contract - replace legacy chunking_func spy with new contract spy on chunking_by_fixed_token - add monkeypatch to verify dispatcher routes to standardized file-chunker - update warning assertion from "R/V/P" to "R/V" to match new deferred-strategy message - expand docstring to explain new chunker dispatch behavior and test purpose
Wires up the recursive-character (R) and semantic-vector (V) chunking
strategies with LangChain integrations, and introduces a per-document
chunk_options snapshot persisted in full_docs so chunker parameters
become reproducible and per-file overridable.
- R chunker: chunking_by_recursive_character wraps RecursiveCharacterTextSplitter
with the LightRAG Tokenizer plugged in as length_function so chunk_size
measures tokens rather than characters
- V chunker: chunking_by_semantic_vector (async) wraps SemanticChunker via
an _AsyncEmbeddingFuncAdapter that bridges LightRAG's async EmbeddingFunc
to LangChain's sync Embeddings interface using run_coroutine_threadsafe;
falls back to R when embedding_func is None
- _process_single_document dispatcher: full F/R/V/P branch matrix; reads
per-strategy kwargs from full_docs[doc_id].chunk_options and splats into
the chunker call
- chunk_options as the canonical chunker-config carrier:
- new dict-shaped column on full_docs persisted at enqueue time
- apipeline_enqueue_documents accepts chunk_options (dict | list[dict] | None);
None falls back to resolve_chunk_options(self.addon_params)
- per-strategy sub-dicts mirror each chunker's keyword-only signature
- frozen at enqueue so env / addon_params changes don't break re-runs
- addon_params['chunker'] as the runtime-mutable config source:
- default_chunker_config() builds env-driven defaults (CHUNK_F_*, CHUNK_R_*,
CHUNK_V_*) and is invoked from default_addon_params / normalize_addon_params
- users can mutate rag.addon_params['chunker'] at runtime to change defaults
for subsequently enqueued documents
- parse_optional_float helper relocated to lightrag.utils for reuse
- resolve_chunk_options moved to lightrag.parser_routing as the natural
pair to parse_process_options
- LightRAG dataclass cleanup: removed 5 chunker-specific fields (now in
addon_params['chunker']) and the _resolve_chunk_options method
- apipeline_enqueue_documents / apipeline_process_enqueue_documents:
removed split_by_character / split_by_character_only parameters; the
F-strategy runtime args are now baked into chunk_options by ainsert
before enqueue, so the pipeline plumbing no longer carries them
- paragraph_semantic chunker fallback target switched from F to R per
the documented contract; raises ImportError when langchain isn't
installed instead of silently degrading
- pyproject.toml: langchain-text-splitters>=0.3 and langchain-experimental>=0.3
added to [api] extras
- env.example: full set of chunker env vars documented
- docs/FileProcessingConfiguration-zh.md: env table extended with F vars,
new "分块器参数与 chunk_options" section explains the three-tier config
flow (env → addon_params → chunk_options snapshot → chunker call)
Tests: +14 new offline tests across recursive_character, semantic_vector,
chunk_options_persistence, plus updates to test_chunking_raw_lightrag_parity
for the new R/V dispatch contract. Full suite: 1137 passed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- exclude index 0 from anchor candidates to avoid empty leading slices - add regression tests for short lead paragraph edge case - document recursion guard behavior in comments
…oles - cap target_chunks at len(rows) to avoid zero-row slices when total tokens exceed target_max - enforce forward progress with max(start + 1, ...) in row boundary calculation - track cur_role across flush operations to preserve "first"/"last" table chunk semantics - add regression tests for empty slice bug and role propagation between consecutive oversized tables
Add paragraph semantic chunking strategy
- replace eager `setdefault` with lazy `in` check for chunker key - prevent environment parsing errors from blocking caller-supplied chunker values
…size resolution - introduce `_apply_chunk_size_overlay` to reconcile `chunk_token_size` and `chunk_overlap_token_size` across config tiers - change `chunk_token_size` and `chunk_overlap_token_size` fields to `Optional[int]` with `None` default - update `default_chunker_config` to only read strategy-specific env vars, leaving slots empty for overlay fallback - add precedence chain: addon_params explicit > strategy env > legacy constructor field > legacy env - back-fill legacy instance fields after resolution for backward compatibility with downstream readers - update Chinese documentation to reflect new configuration hierarchy and priority rules - add comprehensive tests covering constructor overlay, addon_params precedence, strategy env wins, and legacy fallback
…h semantic strategy - introduce CHUNK_P_SIZE env variable to decouple P strategy chunk size from global CHUNK_SIZE - update default_chunker_config to parse and inject CHUNK_P_SIZE into paragraph_semantic options - modify pipeline to extract and apply per-strategy chunk_token_size for P strategy with fallback to resolved top-level size - document new env variable and configuration in Chinese docs with usage guidance - add tests verifying env override behavior and fallback to global chunk size when unset
- add upper version bounds for langchain-text-splitters (<2) and langchain-experimental (<1) - remove duplicate langchain 1.x and langchain-core 1.x entries from uv.lock - add missing explicit dependencies (defusedxml, langchain-experimental, langchain-text-splitters) to api/evaluation/offline/test extras - pin async-timeout to 4.0.3 for python < 3.11 to resolve version conflicts
…e file processing documentation - reorganize document with numbered sections for server deployment workflow - add quick start section with legacy, native, and combined configuration examples - introduce detailed chunk_options configuration with environment variable reference - add new chapter for python sdk usage covering runtime api and deprecated parameters - improve clarity on engine fallback, validation, and priority chains - relocate and expand storage layout, duplicate detection, concurrency, and resume rules sections - add appendix for upgrade notes regarding deprecated multimodal global switch
…n params - ensure chunk size configuration is reconciled when runtime addon params are set - maintain consistency across all four configuration tiers
feat(chunker): add R/V chunkers and chunk_options snapshot mechanism
- move extraction-related settings below multimodal parsing section - uncomment CHUNK_P_SIZE to set default value of 3000 - improve logical grouping by placing docling settings before extraction configs
… to reflect slim snapshot design - clarify distinction between addon_params["chunker"] full baseline and full_docs["chunk_options"] slim snapshot - document F/R/V/P selector mapping to strategy sub-dict keys - explain that only selected strategy parameters are persisted, others are dropped - update reprocess behavior to overwrite both process_options and chunk_options together ♻️ refactor(parser_routing): introduce slim chunk options projection - add _CHUNK_STRATEGY_KEYS mapping for F/R/V/P selectors to sub-dict keys - implement chunk_strategy_key() to resolve active strategy from process_options - implement slim_chunk_options() to project full config down to active strategy only - update resolve_chunk_options() to accept process_options and return slim snapshot - conditionally apply F-strategy runtime args only when fixed_token is active ♻️ refactor(pipeline): integrate slim chunk options into enqueue and process paths - update _chunk_options_at() to project caller-supplied dicts against per-doc process_options - use slim_chunk_options() for caller-supplied chunk_options to avoid mutable shared state - pass process_options through resolve_chunk_options() for fresh snapshot builds - update legacy fallback in process path to scope snapshot to per-doc strategy ✅ test(chunk_options_persistence): adapt tests to slim snapshot contract - split single-doc snapshot test into F/R/V strategy-specific enqueues - verify non-active strategy keys are absent from persisted chunk_options - update per-file chunk_options list test to use R selector and verify slim drops - adjust constructor overlay and precedence tests to match single-strategy shape - add explicit process_options selectors where strategy-specific envs are tested
refactor(full-docs): unify path handling and slim chunk_options snapshot
PG storage previously dropped pipeline-derived fields (parse_engine,
content_hash, chunk_options, sidecar/heading metadata) because the
underlying tables and SQL templates listed a fixed column set. JSON
storage persists full payload dicts and thus retained these fields
naturally, leading to a behavioral divergence under PG deployments.
This change brings the three doc tables in line with what the pipeline
writes:
LIGHTRAG_DOC_FULL + sidecar_location, parse_format, content_hash,
process_options, chunk_options (JSONB),
parse_engine
LIGHTRAG_DOC_STATUS + content_hash
LIGHTRAG_DOC_CHUNKS + heading (JSONB), sidecar (JSONB)
Notable behavior:
* doc_status upsert protects existing content_hash on state transitions
that omit the field via
COALESCE(NULLIF(EXCLUDED.content_hash, ''), existing).
* full_docs upsert uses straight EXCLUDED overwrite — callers in
_persist_parsed_full_docs already do read-merge-write, so COALESCE
is unnecessary there.
* JSONB read paths normalize missing/None values to {} for parity with
downstream isinstance(dict) / .get(...) checks.
* PGDocStatusStorage overrides get_doc_by_file_basename and
get_doc_by_content_hash with indexed SQL instead of the base-class
full-table scan. The basename lookup includes a defensive LIKE
fallback for legacy a.[hint].ext rows, pushed down to the DB.
Migrations:
* _migrate_doc_full_add_pipeline_fields adds the six new full_docs
columns + partial workspace+content_hash index.
* _migrate_doc_status_add_content_hash adds the column + partial index.
* _migrate_text_chunks_add_heading_sidecar adds heading/sidecar JSONB.
Each migration is idempotent, individually try-wrapped per column so
one failure does not block the rest, and called from check_tables().
No JSONB GIN indexes on heading/sidecar — access is always by chunk id.
Tests:
* Updated tests/test_postgres_upsert.py to assert the new tuple
positions for text_chunks and full_docs upserts, including default
fallbacks for missing fields and explicit None -> {} coercion for
JSONB columns.
* Added doc_status upsert tests covering the content_hash position
and the COALESCE/NULLIF SQL guard.
* Added tests/test_postgres_doc_status_lookup.py covering the new
get_doc_by_file_basename and get_doc_by_content_hash overrides,
including the legacy hint-segment LIKE fallback.
…emantics - change content_hash from VARCHAR(64) to TEXT for hash algorithm agnosticism - add COALESCE/NULLIF guards to upsert_doc_full for pipeline-derived columns - prevent partial upserts from overwriting existing metadata with defaults - escape LIKE metacharacters in get_doc_by_file_basename for safe pattern matching - add ORDER BY precedence to rank exact matches above legacy hint rows - update tests to verify SQL-level protection and escaping behavior
…layer - remove normalization logic from doc status storage implementations - move canonicalization responsibility to callers in lightrag.py and pipeline.py - simplify postgres, json and base storage to perform exact match only - update tests to verify canonical paths are persisted and matched correctly
- introduce _call_source_file_resolver helper to tolerate legacy test doubles - add source_file_name tracking in content_data and doc_status metadata - extend _resolve_source_file_for_parser with parser engine and source_file_name params - implement canonical name matching and parser hint variant resolution - add support for finding hinted variants by canonical name and engine - carry source_file_name through doc_status metadata transitions - add comprehensive tests for resolver priority and mineru integration
…r handling - introduce FilenameParserHintError exception for invalid filename hints - add _validate_filename_hint_for_resolution to fail fast on malformed hints - update resolve_file_parser_directives to validate hints before processing - handle FilenameParserHintError in document_routes with error document enqueue - add comprehensive tests for invalid filename hint rejection - fix regex to reject empty bracket hints while preserving non-throwing helpers - ensure endpoint requirement validation during ingestion entrypoints
- change options-only syntax from [OPTIONS] to [-OPTIONS] to disambiguate from engine names - update split_engine_and_options to reject bare options without leading hyphen - add validation for empty options in both [-] and [engine-] forms - update documentation and tests to reflect new syntax requirements
…rtlib - remove unused lazy_external_import utility - use importlib.import_module with explicit package anchor for reliable resolution - add inline comments explaining import path resolution behavior
- remove redundant partial index on workspace and content_hash - prevent potential migration failures from duplicate index attempts
refactor(pg): align PG storage fields with JSON storage for parity
Mirror the two-stage dedup helpers already shipped on JsonDocStatusStorage and PostgreSQLDocStatusStorage so the ingestion pipeline can deduplicate documents when the doc-status backend is Redis. Without these methods the pipeline falls back to a missing-attribute path and silently loses dedup. - get_doc_by_file_basename(basename): early-returns on empty input and on the "unknown_source" sentinel, then SCAN+pipeline.GET across the namespace for an exact file_path match. Returns (doc_id, doc_data). - get_doc_by_content_hash(content_hash): early-returns on empty input so legacy rows without the field are never matched, then scans for an exact content_hash match. Returns (doc_id, doc_data). Both follow the existing get_doc_by_file_path SCAN pattern (no new index) to keep behavior parity with the JSON backend. Tests use an in-memory fake Redis (no live service required); covers hit/miss, empty input, the unknown_source sentinel, and legacy rows that predate the content_hash field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JSON storage recently added two new lookup methods on DocStatusStorage and a content_hash field used for content-based dedup. Mongo inherited the base-class scan-all fallbacks, which work but bypass the indexable content_hash field — so dedup on large workspaces fell back to fetching every status row. This change brings MongoDocStatusStorage up to par with JSON storage and the recently-landed PG and Redis updates (Redis PR #3098): * Override get_doc_by_file_basename with an indexed find_one on file_path (mirrors the existing file_path index already created in create_and_migrate_indexes_if_not_exists). Honors the empty-string and "unknown_source" sentinel short-circuits required by the base contract. * Override get_doc_by_content_hash with an indexed find_one on content_hash. Empty hash is a guaranteed miss so legacy rows missing the field cannot match. * Add a partial index on content_hash (filter expression: exists, string, non-empty) to keep the index small and match the PG partial-index semantics. Legacy unprefixed "content_hash" index name added to the workspace-prefix migration list so historical names get cleaned up. * Wire partialFilterExpression through the create_index dispatch. For MongoKVStorage no schema changes are needed: BSON is schemaless and the existing $set upsert path persists whatever pipeline-derived fields the caller hands in (sidecar_location, parse_format, parse_engine, content_hash, process_options, chunk_options on full_docs; heading + sidecar on text_chunks). Reads via find_one return full docs. The DocProcessingStatus dataclass already defaults content_hash to None, so legacy doc_status rows without the field still decode cleanly via _prepare_doc_status_data + DocProcessingStatus(**data). Tests: extend tests/test_mongo_storage.py with TestMongoDocStatusLookup covering hit, miss, empty-input short-circuit, and the "unknown_source" sentinel for both new overrides. All 8 tests in the file pass offline.
feat(redis): add basename and content_hash lookups for doc status
Mirror the JSON / MongoDB / Redis storage parity work for the OpenSearch backend so the new pipeline dedup paths can resolve documents by canonical basename or content_hash without falling back to the per-status full-scan default implementation. What's changed: - Add `content_hash` keyword mapping when creating new doc status indices. - For pre-existing indices, call put_mapping on initialize to add the content_hash keyword field (idempotent; skipped when already present). Older indices created by prior releases get the mapping on first startup. - Implement `get_doc_by_file_basename` using a term query on the file_path keyword field with the same empty/unknown_source short-circuits as the base class. - Implement `get_doc_by_content_hash` using a term query on the new content_hash keyword field; empty values short-circuit so legacy rows cannot match via coercion. Tolerance: full_docs / text_chunks rely on the existing dynamic mapping, so legacy records without the new sidecar / parse / chunk fields are read back with the missing keys absent — consumers use `.get()` defaults. The new content_hash optional attribute on DocProcessingStatus also defaults to None, so doc_status rows from older indices deserialize cleanly.
…-jSuYr refactor(mongo): align doc-status storage with JSON storage for parity
- verify PyMongoError is caught and returns None in get_doc_by_file_basename - verify PyMongoError is caught and returns None in get_doc_by_content_hash - ensure storage failures are treated as "no match" with error logged
…TNxo8 feat(opensearch): add basename and content_hash lookups for doc status
…tion - eliminate unnecessary `:-/` fallback in redis uri path capture - ensure exact path preservation from original uri during local service normalization
… setup scripts - add /app/data/prompts directory creation in dockerfile and dockerfile.lite - add PROMPT_DIR environment variable and volume mounts in all compose files - update setup scripts to support PROMPT_DIR configuration and idempotent mount injection - fix redis test default uri to remove trailing slash
- consolidate verbose log strings in parse_mineru and parse_docling to reduce noise - shorten analyze_multimodal opt-in missing and backfill log lines for clarity - remove redundant file_path references from completion and cache hit logs - update chinese documentation to match simplified log format
… methods - remove default implementations of get_doc_by_file_basename and get_doc_by_content_hash - add @AbstractMethod decorator to enforce implementation in subclasses - clean up unused asdict import from dataclasses module - simplify docstrings to reflect abstract nature of methods
…ation - correct the info log message format for empty equations sidecar in analyze_multimodal
- replace specific entity_type subdirectory with entire prompts directory - update comment to reflect user customized prompt directory purpose
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dummy PR: Merge Dev to Main (Never try to merge this PR)