diff --git a/.gitignore b/.gitignore index 6604e2aa9b..9f58c87a56 100644 --- a/.gitignore +++ b/.gitignore @@ -55,7 +55,7 @@ output/ rag_storage/ data/ -# Runtime-provided entity type prompt profiles (keep sample files only) +# User cumstomized prompt directory prompts/entity_type/ # Evaluation results @@ -78,14 +78,10 @@ ignore_this.txt # temporary test files in project root /test_* -# Cline files +# AI Agent files memory-bank -.claude/CLAUDE.md .claude/ -# Claude Code -CLAUDE.md - # Google Jules .jules/ diff --git a/AGENTS.md b/AGENTS.md index a204b08979..797576ea69 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,61 +1,315 @@ # Repository Guidelines -LightRAG is an advanced Retrieval-Augmented Generation (RAG) framework designed to enhance information retrieval and generation through graph-based knowledge representation. - -## Project Structure & Module Organization -- `lightrag/`: Core Python package with orchestrators (`lightrag/lightrag.py`), storage adapters in `kg/`, LLM bindings in `llm/`, and helpers such as `operate.py` and `utils_*.py`. -- `lightrag-api/`: FastAPI service (`lightrag_server.py`) with routers under `routers/` and Gunicorn launcher `run_with_gunicorn.py`. -- `lightrag_webui/`: React 19 + TypeScript client driven by Bun + Vite; UI components live in `src/`. -- `scripts/setup/`: Interactive environment setup wizard. `setup.sh` orchestrates staged `--base` / `--storage` / `--server` / validation flows, `lib/` holds prompt/validation/file helpers, and `templates/*.yml` contains compose fragments for bundled services. -- Tests live in `tests/` and root-level `test_*.py`. Working datasets stay in `inputs/`, `rag_storage/`, `temp/`; deployment collateral lives in `docs/`, `k8s-deploy/`, and `docker-compose.yml`. -- `Makefile`: Canonical entry point for the setup wizard and local developer shortcuts; prefer documented targets over invoking ad hoc shell snippets. - -## Build, Test, and Development Commands -- `python -m venv .venv && source .venv/bin/activate`: set up the Python runtime. -- `pip install -e .` / `pip install -e .[api]`: install the package and API extras in editable mode. -- `make env-base`: first-run interactive setup for LLM, embedding, and reranker configuration; writes `.env` and may generate `docker-compose.final.yml`. -- `make env-storage`, `make env-server`: optional follow-up wizard stages for storage backends and server/security/SSL settings; both reuse the existing `.env`. -- `make env-validate`, `make env-security-check`, `make env-backup`: validate, audit, or back up the current `.env` via the setup wizard. -- `lightrag-server` or `uvicorn lightrag.api.lightrag_server:app --reload`: start the API locally; ensure `.env` is present. -- `python -m pytest tests` (offline markers apply by default) or `python -m pytest tests --run-integration` / `python test_graph_storage.py`: run the full suite, opt into integration coverage, or target an individual script. -- `ruff check .`: lint Python sources before committing. -- Front-end workflow uses Bun from `lightrag_webui/`; run UI commands from the repo root as `cd lightrag_webui && bun install`, `cd lightrag_webui && bun run dev`, `cd lightrag_webui && bun run build`, `cd lightrag_webui && bun run lint`, and `cd lightrag_webui && bun test`. - -## Coding Style & Naming Conventions -- Backend code follow PEP 8 with four-space indentation, annotate functions, and reach for dataclasses when modelling state. -- Use `lightrag.utils.logger` instead of `print`; respect logger configuration flags. -- Extend storage or pipeline abstractions via `lightrag.base` and keep reusable helpers in the existing `utils_*.py`. -- Python modules remain lowercase with underscores; React components use `PascalCase.tsx` and hooks-first patterns. -- Front-end code should remain in TypeScript with two-space indentation, rely on functional React components with hooks, and follow Tailwind utility style. - -## Testing Guidelines -- Keep pytest additions close to the code you touch (`tests/` mirrors feature folders and there are root-level `test_*.py` helpers); functions must start with `test_`. -- Follow `tests/pytest.ini`: markers include `offline`, `integration`, `requires_db`, and `requires_api`, and the suite runs with `-m "not integration"` by default—pass `--run-integration` (or set `LIGHTRAG_RUN_INTEGRATION=true`) when external services are available. -- Use the custom CLI toggles from `tests/conftest.py`: `--keep-artifacts`/`LIGHTRAG_KEEP_ARTIFACTS=true`, `--stress-test`/`LIGHTRAG_STRESS_TEST=true`, and `--test-workers N`/`LIGHTRAG_TEST_WORKERS` to dial up workloads or preserve temp files during investigations. -- Export other required `LIGHTRAG_*` environment variables before running integration or storage tests so adapters can reach configured backends. -- For UI updates, pair changes with Bun test coverage using `bun:test`; run `cd lightrag_webui && bun test`, and use `cd lightrag_webui && bun test --watch` or `cd lightrag_webui && bun test --coverage` when needed. - -## Commit & Pull Request Guidelines -- Use concise, imperative commit subjects (e.g., `Fix lock key normalization`) and add body context only when necessary. -- PRs should include a summary, operational impact, linked issues, and screenshots or API samples for user-facing work. -- Verify `ruff check .`, `python -m pytest`, and affected front-end commands such as `cd lightrag_webui && bun run lint`, `cd lightrag_webui && bun run build`, and `cd lightrag_webui && bun test` succeed before requesting review; note the runs in the PR text. -- This repo is a fork of `HKUDS/LightRAG`. Always target **`HKUDS/LightRAG:main`** (upstream) when creating PRs, not the fork's own main. -- Create PR work from a dedicated branch, not `main`. If the CLI sandbox blocks writes under `.git/refs`, request escalation for branch creation or other ref updates instead of retrying blindly. -- If `gh auth status` is invalid but the GitHub plugin is available, prefer the plugin to create the upstream PR after pushing the branch to the fork; `gh` login state and plugin auth can differ. -- For `gh` commands that require GitHub network/auth access, prefer running them with escalation from the start instead of first trying the sandboxed path. Use an escalated `gh auth status` check as the source of truth for Codex, not the VSCode terminal. Only abandon the `gh` path when the escalated check still fails; otherwise treat sandbox-only failures as an expected limitation. -- For lightweight Python validation in fresh shells, prefer `python3` over `python` unless the active environment has already exposed `python`. - -## Security & Configuration Tips -- Copy `.env.example`; never commit secrets or real connection strings. -- Configure storage backends through `LIGHTRAG_*` variables and validate them with `docker-compose` services when needed. -- Treat `lightrag.log*` as local artefacts; purge sensitive information before sharing logs or outputs. - -## Automation & Agent Workflow -- Use repo-relative `workdir` arguments for every shell command and prefer `rg`/`rg --files` for searches since they are faster under the CLI harness. -- Default edits to ASCII, rely on `apply_patch` for single-file changes, and only add concise comments that aid comprehension of complex logic. -- Honor existing local modifications; never revert or discard user changes (especially via `git reset --hard`) unless explicitly asked. -- Follow the planning tool guidance: skip it for trivial fixes, but provide multi-step plans for non-trivial work and keep the plan updated as steps progress. -- Validate changes by running the relevant `ruff`/`pytest`/`bun test` commands whenever feasible, and describe any unrun checks with follow-up guidance. -- For Codex and other fresh-shell automation, prefer `./scripts/test.sh` instead of bare `pytest`; the script falls back through `PYTHON`, the active virtualenv, `uv`, `.venv`, and `venv` before trying `python` or `python3`. -- For setup workflow changes, prefer `make env-*` targets over calling `scripts/setup/setup.sh` directly; the `Makefile` resolves a Bash 4+ interpreter for macOS/Linux compatibility. -- When editing setup logic, keep `.env` host-usable and treat `docker-compose.final.yml` as generated output assembled from `scripts/setup/templates/*.yml`; compose-only overrides belong in the wizard-managed compose layer rather than being persisted back into `.env`. +## Project Overview + +LightRAG is a Retrieval-Augmented Generation (RAG) framework that uses graph-based knowledge representation for enhanced information retrieval. The system extracts entities and relationships from documents, builds a knowledge graph, and uses multiple retrieval modes (`local`, `global`, `hybrid`, `mix`, `naive`) for queries. + +## Project Structure + +Top-level directories: + +- **lightrag/**: Core Python package — see *Module Layout* below. +- **lightrag_webui/**: React 19 + TypeScript client (Bun + Vite + Tailwind). UI components in `src/`. +- **scripts/**: `test.sh` (preferred test runner), `setup/` interactive environment wizard (use `make env-*` rather than calling `setup.sh` directly — see *Configuration > Setup Wizard Outputs*), and release tooling. +- **tests/** and root-level `test_*.py`: Pytest coverage. Working datasets stay in `inputs/`, `rag_storage/`, and `temp/`; deployment collateral lives in `docs/`, `k8s-deploy/`, and compose files. + +### Module Layout (`lightrag/`) + +- **lightrag.py**: Main orchestrator class (`LightRAG`) — assembled from mixins (see *LightRAG class composition*). Hosts `ainsert_custom_kg`, `_insert_done`, `_process_extract_entities`, `_refresh_addon_params_cache`, and `addon_params` accessors. Critical: always call `await rag.initialize_storages()` after instantiation. +- **pipeline.py**: `_PipelineMixin` — owns the document ingestion pipeline (`apipeline_enqueue_documents`, `apipeline_process_enqueue_documents`, `apipeline_process_error_documents`), the `parse_native` / `parse_mineru` / `parse_docling` parser dispatchers, multimodal analysis, validation, and the worker scaffolding. +- **utils_pipeline.py**: Pure helpers shared by the pipeline mixin and other entry points: doc-status field access, document identity (source key, content hash), parsed-artifact path resolution, parser payload normalization, multimodal entity augmentation, and `make_lightrag_doc_content`. +- **llm_roles.py**: `RoleSpec` / `RoleLLMConfig` / `_RoleLLMState` / `ROLES` registry plus `_RoleLLMMixin` — role normalization, builder registration, wrapper rebuild, runtime config update, queue cleanup, sanitized config export, queue status reporting. Route role-specific behavior here rather than into provider modules. +- **storage_migrations.py**: `_StorageMigrationMixin` — `check_and_migrate_data`, `_migrate_entity_relation_data`, `_migrate_chunk_tracking_storage`. +- **addon_params.py**: `ObservableAddonParams` plus `default_addon_params` / `normalize_addon_params` helpers. +- **operate.py**: Core extraction and query operations including entity/relation extraction, chunking, and multi-mode retrieval logic. +- **base.py**: Abstract base classes for storage backends (`BaseKVStorage`, `BaseVectorStorage`, `BaseGraphStorage`, `BaseDocStatusStorage`). +- **kg/**: Storage implementations (JSON, NetworkX, Neo4j, PostgreSQL, MongoDB, Redis, Milvus, Qdrant, Faiss, Memgraph, OpenSearch, NanoVectorDB). The backend registry (`STORAGE_IMPLEMENTATIONS` / `STORAGES`) lives in `kg/__init__.py`; `kg/factory.py::get_storage_class()` resolves backend classes from configuration. +- **llm/**: LLM and embedding provider bindings (OpenAI, Ollama, Azure, Gemini, Bedrock, Anthropic, etc.). All async with caching support. +- **parser_routing.py**: Parser engine and filename-hint resolution for `legacy`, `native`, `mineru`, and `docling` flows, plus chunker configuration resolution. +- **native_parser/** and **chunker/**: Native document parsing and chunking layers. `.docx` parsing lives under `native_parser/docx/`; chunking strategies include token-size, recursive character, semantic vector, and paragraph semantic chunkers. +- **api/**: FastAPI service (`lightrag_server.py`) with REST endpoints and Ollama-compatible API; routers under `routers/`, static Swagger assets, packaged WebUI output, and Gunicorn launcher. + +## Core Architecture + +### LightRAG class composition + +`LightRAG` is assembled from focused mixins (split out of the previously monolithic `lightrag.py`): + +``` +LightRAG → _RoleLLMMixin → _StorageMigrationMixin → _PipelineMixin → object +``` + +The `@final` decorator on `LightRAG` is preserved — the mixin layering is an internal implementation detail, not an external subclassing surface. The public API (`ainsert`, `aquery`, `ainsert_custom_kg`, `initialize_storages`, etc.) is unchanged. `ainsert_custom_kg` and its internal construction logic, `_insert_done`, `_process_extract_entities`, `_refresh_addon_params_cache`, and the `addon_params` property accessors stay on `LightRAG` itself because they cut across multiple flows or depend on prompt-profile state. + +### Storage Layer + +LightRAG uses 4 storage types with pluggable backends: +- **KV_STORAGE**: LLM response cache, text chunks, document info +- **VECTOR_STORAGE**: Entity/relation/chunk embeddings +- **GRAPH_STORAGE**: Entity-relation graph structure +- **DOC_STATUS_STORAGE**: Document processing status tracking + +Each `LightRAG` instance can pass a `workspace` parameter for data isolation. Implementation differs per storage type: +- **File-based**: subdirectories under `working_dir`. +- **Collection-based**: collection name prefixes. +- **Relational DB**: workspace column filtering. +- **Qdrant**: payload-based partitioning. + +### Pipeline concurrency contract + +The document ingestion pipeline coordinates concurrent writers through `pipeline_status` (a per-workspace shared dict in `lightrag.kg.shared_storage`). These fields are mutated under `get_namespace_lock("pipeline_status", workspace=...)`: + +- **`busy`**: any pipeline-busy state. Set by both the processing loop AND destructive jobs (clear / per-doc delete). On its own, `busy=True` does NOT block enqueue — see `destructive_busy` for the exclusive subset. +- **`destructive_busy`**: the busy job is `/documents/clear` or `/documents/{doc_id}` (delete). These DROP storages and remove input files; a concurrent enqueue accepted in this window would write to storage being torn down and silently lose the document. Reservation and the enqueue last-line guard reject when this is True. +- **`scanning`**: a `/documents/scan` task is running (whole lifecycle: classification + processing). Used by the `/scan` endpoint to refuse overlapping scans. Does NOT on its own block uploads/inserts. +- **`scanning_exclusive`**: True only during the scan task's classification phase, when `run_scanning_process` is reading `doc_status` to classify files (PROCESSED → archive, FAILED-without-`full_docs` → retry-as-new, etc.) and possibly deleting stale stubs. Reservation and the enqueue last-line guard reject when this is set. Cleared before the scan transitions to its processing phase, allowing concurrent uploads to land while scan-driven processing finishes. +- **`pending_enqueues`**: count of `/upload`, `/text`, `/texts` endpoints that have reserved a slot (via `_reserve_enqueue_slot`) but whose bg task has not yet completed. Only the scan endpoint reads this — to refuse starting while uploads are mid-flight. +- **`request_pending`**: a nudge to the running processing loop. Set by either (a) `apipeline_process_enqueue_documents` when called while `busy=True` or (b) `apipeline_enqueue_documents` after writing to `doc_status` while `busy=True`. The loop checks it after each batch and re-queries `doc_status` if set. + +Mutual-exclusion rules (all checked atomically inside the lock): + +| Operation | Refuses if | Writes | +|---|---|---| +| `_reserve_enqueue_slot` | `scanning_exclusive` or `destructive_busy` | `pending_enqueues++` | +| `apipeline_enqueue_documents` (last-line guard) | (`scanning_exclusive` and not `from_scan`) or `destructive_busy` | — | +| Scan endpoint reservation | `busy or scanning or pending_enqueues > 0` | `scanning = True` | +| `apipeline_process_enqueue_documents` entry | (already busy → set `request_pending`, return) | `busy = True` (NOT `destructive_busy`) | +| `clear_documents` / `delete_document` (synchronous reservation) | `busy or scanning or pending_enqueues > 0` | `busy = True`, `destructive_busy = True` | + +The contract permits **concurrent enqueue + processing**: a freshly-uploaded doc lands in `doc_status` while the loop is mid-batch, the loop sees `request_pending` after the current batch, re-queries `doc_status`, and picks up the new PENDING row. + +For the rest — write ordering of `full_docs` vs `doc_status`, the workspace-scoped `enqueue_serialize` lock around dedup-and-upsert, and the `from_scan=True` bypass — see the docstrings on `apipeline_enqueue_documents` and `apipeline_process_enqueue_documents` in `lightrag/pipeline.py`. + +### Query Modes + +- **local**: Context-dependent retrieval focused on specific entities +- **global**: Community/summary-based broad knowledge retrieval +- **hybrid**: Combines local and global +- **naive**: Direct vector search without graph +- **mix**: Integrates KG and vector retrieval (recommended with reranker) + +## Development Commands + +### Setup +```bash +# Install with uv +uv sync +source .venv/bin/activate # Or: .venv\Scripts\activate on Windows + +# Install with API support +uv sync --extra api + +# Install specific extras +uv sync --extra offline-storage # Storage backends +uv sync --extra offline-llm # LLM providers +uv sync --extra test # Testing dependencies +``` + +### API Server +```bash +# Copy and configure environment +cp env.example .env # Edit with your LLM/embedding configs + +# Build WebUI +cd lightrag_webui +bun install --frozen-lockfile +bun run build +cd .. + +# Run server +lightrag-server # Production +uvicorn lightrag.api.lightrag_server:app --reload # Development +lightrag-gunicorn # Multi-worker (gunicorn) +``` + +### WebUI +```bash +cd lightrag_webui +bun install --frozen-lockfile # Install dependencies +bun run dev # Dev server (Node + Vite) +bun run dev:bun # Dev server (Bun native) +bun run build # Production build +bun run preview # Preview production build +bun run lint # ESLint over *.ts/tsx/js/jsx + +# Testing — Bun built-in runner (NOT Vitest/Jest) +bun test # All tests +bun test --watch # Watch mode +bun test --coverage # With coverage report +bun test src/api/lightrag.test.ts # Single test file +``` + +### Testing + +Backend tests use pytest; frontend unit tests use Bun's built-in runner — see *WebUI* above. + +```bash +# Preferred for fresh shells and automation; resolves PYTHON, venv, uv, .venv, venv, python, python3 +./scripts/test.sh tests + +# Run specific test file +./scripts/test.sh test_graph_storage.py + +# Run with custom workers +./scripts/test.sh tests --test-workers 4 +``` + +- `tests/`: main test suite (mirrors feature folders); root-level `test_*.py` for specific integration tests. +- Markers (see `tests/pytest.ini`): `offline`, `integration`, `requires_db`, `requires_api`. Integration tests are skipped by default via `-m "not integration"`. +- Integration env vars: `LIGHTRAG_RUN_INTEGRATION=true`, `LIGHTRAG_KEEP_ARTIFACTS=true`, `LIGHTRAG_TEST_WORKERS=4`, plus storage-specific connection strings. + +### Linting +```bash +ruff check . +``` + +## Key Implementation Patterns + +### LightRAG Initialization (Critical) + +The most common error is forgetting to initialize storages (manifests as `AttributeError: __aenter__` or `KeyError: 'history_messages'`): + +```python +import asyncio +from lightrag import LightRAG +from lightrag.llm.openai import gpt_4o_mini_complete, openai_embed + +async def main(): + rag = LightRAG( + working_dir="./rag_storage", + llm_model_func=gpt_4o_mini_complete, + embedding_func=openai_embed + ) + + # REQUIRED: Initialize storage backends + await rag.initialize_storages() + + # Now safe to use + await rag.ainsert("Your text here") + result = await rag.aquery("Your question", param=QueryParam(mode="hybrid")) + + # Cleanup + await rag.finalize_storages() + +asyncio.run(main()) +``` + +### Custom Embedding Functions + +Use `@wrap_embedding_func_with_attrs` decorator and call `.func` when wrapping (already-decorated functions cannot be wrapped again — access the underlying via `.func`): + +```python +from lightrag.utils import wrap_embedding_func_with_attrs + +@wrap_embedding_func_with_attrs(embedding_dim=1536, max_token_size=8192) +async def custom_embed(texts: list[str]) -> np.ndarray: + # Call underlying function, not wrapped version + return await openai_embed.func(texts, model="text-embedding-3-large") + +# Wrong: EmbeddingFunc(func=openai_embed) +# Right: EmbeddingFunc(func=openai_embed.func) +``` + +> **Pitfall — switching embedding models**: when changing the embedding model you MUST clear the data directory (optionally keeping `kv_store_llm_response_cache.json` for LLM cache). Existing vectors will not match the new model's space. + +### Storage Configuration + +Configure via environment variables or constructor params: + +```python +# Environment-based (recommended for production) +# See env.example for full list + +# Constructor-based +rag = LightRAG( + working_dir="./storage", + workspace="project_name", # For data isolation + kv_storage="PGKVStorage", + vector_storage="PGVectorStorage", + graph_storage="Neo4JStorage", + doc_status_storage="PGDocStatusStorage", + vector_db_storage_cls_kwargs={ + "cosine_better_than_threshold": 0.2 + } +) +``` + +### Document Insertion + +```python +# Single document +await rag.ainsert("Text content") + +# Batch insertion +await rag.ainsert(["Text 1", "Text 2", ...]) + +# With custom IDs +await rag.ainsert("Text", ids=["doc-123"]) + +# With file paths (for citation) +await rag.ainsert(["Text 1", "Text 2"], file_paths=["doc1.pdf", "doc2.pdf"]) + +# Configure batch size +rag = LightRAG(..., max_parallel_insert=4) # Default: 2, max recommended: 10 +``` + +### Query Configuration + +```python +from lightrag import QueryParam + +result = await rag.aquery( + "Your question", + param=QueryParam( + mode="mix", # Recommended with reranker + top_k=60, # KG entities/relations to retrieve + chunk_top_k=20, # Text chunks to retrieve + max_entity_tokens=6000, + max_relation_tokens=8000, + max_total_tokens=30000, + enable_rerank=True, + user_prompt="Additional instructions for LLM", + stream=False + ) +) +``` + +## Configuration + +### .env Configuration +Primary configuration file for API server. Generate it with `make env-base` or copy `env.example` manually. Key sections: +- Server settings (HOST, PORT, CORS) +- Storage backends (connection strings via environment variables) +- Query parameters (TOP_K, MAX_TOTAL_TOKENS, etc.) +- Reranking configuration (RERANK_BINDING, RERANK_MODEL) +- Authentication (AUTH_ACCOUNTS, LIGHTRAG_API_KEY) + +See `env.example` for comprehensive template. + +### Setup Wizard Outputs +- Keep `.env` host-usable. Container-only hostnames and staged SSL paths belong in the wizard-managed compose layer, not persisted back into `.env`. +- Treat `docker-compose.final.yml` as generated output assembled from `scripts/setup/templates/*.yml`. +- For setup workflow changes, prefer `make env-*` targets over direct `scripts/setup/setup.sh` calls. + +## Code Style + +### Language +Comments, backend code, and log messages in English. Frontend uses i18next for multi-language support. + +### Python +- Follow PEP 8 with 4-space indentation +- Use type annotations +- Prefer dataclasses for state management +- Use `lightrag.utils.logger` instead of print +- Async/await patterns throughout + +### TypeScript / React (incl. WebUI ESLint) +- Functional components with hooks; PascalCase for components +- 2-space indentation, single quotes (enforced by `@stylistic` rules) +- Tailwind utility-first styling +- ESLint stack: TypeScript-ESLint + React Hooks plugin + Prettier; `@typescript-eslint/no-explicit-any` is disabled (allowed) + +## Commit and Pull Request Guidance + +- This repo is a fork of `HKUDS/LightRAG`. Target to `HKUDS/LightRAG` when creating PRs, not the fork's own repo. +- PR descriptions should include: summary, motivation, linked issues if applyed, what's changed, what's broken and how it works. diff --git a/CLAUDE.md b/CLAUDE.md index b77de6c35c..27bd541086 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,346 +1 @@ -# CLAUDE.md - -This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. - -## Project Overview - -LightRAG is a Retrieval-Augmented Generation (RAG) framework that uses graph-based knowledge representation for enhanced information retrieval. The system extracts entities and relationships from documents, builds a knowledge graph, and uses multi-modal retrieval (local, global, hybrid, mix, naive) for queries. - -## Core Architecture - -### Key Components - -- **lightrag.py**: Main orchestrator class (`LightRAG`) that coordinates document insertion, query processing, and storage management. Critical: Always call `await rag.initialize_storages()` after instantiation. - -- **operate.py**: Core extraction and query operations including entity/relation extraction, chunking, and multi-mode retrieval logic. - -- **base.py**: Abstract base classes for storage backends (`BaseKVStorage`, `BaseVectorStorage`, `BaseGraphStorage`, `BaseDocStatusStorage`). - -- **kg/**: Storage implementations (JSON, NetworkX, Neo4j, PostgreSQL, MongoDB, Redis, Milvus, Qdrant, Faiss, Memgraph). Each storage type provides different trade-offs for production vs. development use. - -- **llm/**: LLM provider bindings (OpenAI, Ollama, Azure, Gemini, Bedrock, Anthropic, etc.). All use async patterns with caching support. - -- **api/**: FastAPI server (`lightrag_server.py`) with REST endpoints and Ollama-compatible API, plus React 19 + TypeScript WebUI. - -### Storage Layer - -LightRAG uses 4 storage types with pluggable backends: -- **KV_STORAGE**: LLM response cache, text chunks, document info -- **VECTOR_STORAGE**: Entity/relation/chunk embeddings -- **GRAPH_STORAGE**: Entity-relation graph structure -- **DOC_STATUS_STORAGE**: Document processing status tracking - -Workspace isolation is implemented differently per storage type (subdirectories for file-based, prefixes for collections, fields for relational DBs). - -### Query Modes - -- **local**: Context-dependent retrieval focused on specific entities -- **global**: Community/summary-based broad knowledge retrieval -- **hybrid**: Combines local and global -- **naive**: Direct vector search without graph -- **mix**: Integrates KG and vector retrieval (recommended with reranker) - -## Development Commands - -### Setup -```bash -# Install core package (development mode) -uv sync -source .venv/bin/activate # Or: .venv\Scripts\activate on Windows - -# Install with API support -uv sync --extra api - -# Install specific extras -uv sync --extra offline-storage # Storage backends -uv sync --extra offline-llm # LLM providers -uv sync --extra test # Testing dependencies -``` - -### API Server -```bash -# Copy and configure environment -cp env.example .env # Edit with your LLM/embedding configs - -# Build WebUI -cd lightrag_webui -bun install --frozen-lockfile -bun run build -cd .. - -# Run server -lightrag-server # Production -uvicorn lightrag.api.lightrag_server:app --reload # Development -lightrag-gunicorn # Multi-worker (gunicorn) -``` - -### Testing -```bash -# Run offline tests (default) -python -m pytest tests - -# Run integration tests (requires external services) -python -m pytest tests --run-integration -# Or set: LIGHTRAG_RUN_INTEGRATION=true - -# Run specific test file -python test_graph_storage.py - -# Keep artifacts for debugging -python -m pytest tests --keep-artifacts - -# Run with custom workers -python -m pytest tests --test-workers 4 -``` - -### Linting -```bash -ruff check . -``` - -## Key Implementation Patterns - -### LightRAG Initialization (Critical) - -The most common error is forgetting to initialize storages: - -```python -import asyncio -from lightrag import LightRAG -from lightrag.llm.openai import gpt_4o_mini_complete, openai_embed - -async def main(): - rag = LightRAG( - working_dir="./rag_storage", - llm_model_func=gpt_4o_mini_complete, - embedding_func=openai_embed - ) - - # REQUIRED: Initialize storage backends - await rag.initialize_storages() - - # Now safe to use - await rag.ainsert("Your text here") - result = await rag.aquery("Your question", param=QueryParam(mode="hybrid")) - - # Cleanup - await rag.finalize_storages() - -asyncio.run(main()) -``` - -### Custom Embedding Functions - -Use `@wrap_embedding_func_with_attrs` decorator and call `.func` when wrapping: - -```python -from lightrag.utils import wrap_embedding_func_with_attrs - -@wrap_embedding_func_with_attrs(embedding_dim=1536, max_token_size=8192) -async def custom_embed(texts: list[str]) -> np.ndarray: - # Call underlying function, not wrapped version - return await openai_embed.func(texts, model="text-embedding-3-large") -``` - -### Storage Configuration - -Configure via environment variables or constructor params: - -```python -# Environment-based (recommended for production) -# See env.example for full list - -# Constructor-based -rag = LightRAG( - working_dir="./storage", - workspace="project_name", # For data isolation - kv_storage="PGKVStorage", - vector_storage="PGVectorStorage", - graph_storage="Neo4JStorage", - doc_status_storage="PGDocStatusStorage", - vector_db_storage_cls_kwargs={ - "cosine_better_than_threshold": 0.2 - } -) -``` - -### Document Insertion - -```python -# Single document -await rag.ainsert("Text content") - -# Batch insertion -await rag.ainsert(["Text 1", "Text 2", ...]) - -# With custom IDs -await rag.ainsert("Text", ids=["doc-123"]) - -# With file paths (for citation) -await rag.ainsert(["Text 1", "Text 2"], file_paths=["doc1.pdf", "doc2.pdf"]) - -# Configure batch size -rag = LightRAG(..., max_parallel_insert=4) # Default: 2, max recommended: 10 -``` - -### Query Configuration - -```python -from lightrag import QueryParam - -result = await rag.aquery( - "Your question", - param=QueryParam( - mode="mix", # Recommended with reranker - top_k=60, # KG entities/relations to retrieve - chunk_top_k=20, # Text chunks to retrieve - max_entity_tokens=6000, - max_relation_tokens=8000, - max_total_tokens=30000, - enable_rerank=True, - user_prompt="Additional instructions for LLM", - stream=False - ) -) -``` - -## WebUI Development - -### Structure -- `lightrag_webui/src/`: React components (TypeScript) -- Uses Vite + Bun build system -- Tailwind CSS for styling -- React 19 with functional components and hooks - -### Commands -```bash -cd lightrag_webui -bun install --frozen-lockfile # Install dependencies -bun run dev # Development server (Node + Vite) -bun run dev:bun # Development server (Bun native) -bun run build # Production build -bun run preview # Preview production build locally - -# Linting (ESLint with TypeScript, React hooks, Stylistic rules) -bun run lint # Run ESLint on all *.ts/tsx/js/jsx files - -# Testing (Bun built-in test runner) -bun test # Run all tests -bun test --watch # Watch mode -bun test --coverage # With coverage report -bun test src/api/lightrag.test.ts # Run a single test file -``` - -### Lint Rules -ESLint is configured with TypeScript-ESLint, React Hooks plugin, Prettier integration, and `@stylistic` rules: -- 2-space indentation, single quotes enforced -- `@typescript-eslint/no-explicit-any` is disabled (allowed) - -## Common Issues - -### 1. Storage Not Initialized -**Error**: `AttributeError: __aenter__` or `KeyError: 'history_messages'` -**Solution**: Always call `await rag.initialize_storages()` after creating LightRAG instance - -### 2. Embedding Model Changes -When switching embedding models, you MUST clear the data directory (except optionally `kv_store_llm_response_cache.json` for LLM cache). - -### 3. Nested Embedding Functions -Cannot wrap already-decorated embedding functions. Use `.func` to access underlying function: -```python -# Wrong: EmbeddingFunc(func=openai_embed) -# Right: EmbeddingFunc(func=openai_embed.func) -``` - -### 4. Context Length for Ollama -Ollama models default to 8k context; LightRAG requires 32k+. Configure via: -```python -llm_model_kwargs={"options": {"num_ctx": 32768}} -``` - -## Configuration Files - -### .env Configuration -Primary configuration file for API server. Key sections: -- Server settings (HOST, PORT, CORS) -- Storage backends (connection strings via environment variables) -- Query parameters (TOP_K, MAX_TOTAL_TOKENS, etc.) -- Reranking configuration (RERANK_BINDING, RERANK_MODEL) -- Authentication (AUTH_ACCOUNTS, LIGHTRAG_API_KEY) - -See `env.example` for comprehensive template. - -### Workspace Isolation -Each LightRAG instance can use a `workspace` parameter for data isolation. Implementation varies by storage type: -- File-based: subdirectories -- Collection-based: collection name prefixes -- Relational DB: workspace column filtering -- Qdrant: payload-based partitioning - -## Testing Guidelines - -### Test Structure -- `tests/`: Main test suite (mirrors feature folders) -- `test_*.py` in root: Specific integration tests -- Markers: `offline`, `integration`, `requires_db`, `requires_api` - -### Running Tests -```bash -# Default: runs only offline tests -pytest tests - -# Include integration tests -pytest tests --run-integration - -# Keep test artifacts for debugging -pytest tests --keep-artifacts - -# Configure test workers -pytest tests --test-workers 4 -``` - -### Environment Variables for Tests -Set `LIGHTRAG_*` variables for integration tests: -- `LIGHTRAG_RUN_INTEGRATION=true` -- `LIGHTRAG_KEEP_ARTIFACTS=true` -- `LIGHTRAG_TEST_WORKERS=4` -- Plus storage-specific connection strings - -## Code Style - -### Language -- Comment Language - Use English for comments and documentation -- Backend Language - Use English for backend code and messages -- Frontend Internationalization: i18next for multi-language support - -### Python -- Follow PEP 8 with 4-space indentation -- Use type annotations -- Prefer dataclasses for state management -- Use `lightrag.utils.logger` instead of print -- Async/await patterns throughout -- Keep storage implementations in `kg/` with consistent base class inheritance - -### TypeScript/React -- Functional components with hooks -- 2-space indentation -- PascalCase for components -- Tailwind utility-first styling - -## Important Architectural Notes - -### LLM Requirements -- Minimum 32B parameters recommended -- 32KB context minimum (64KB recommended) -- Avoid reasoning models during indexing -- Stronger models for query stage than indexing stage - -### Embedding Models -- Must be consistent across indexing and querying -- Recommended: `BAAI/bge-m3`, `text-embedding-3-large` -- Changing models requires clearing vector storage and recreating with new dimensions - -### Reranker Configuration -- Significantly improves retrieval quality -- Recommended models: `BAAI/bge-reranker-v2-m3`, Jina rerankers -- Use "mix" mode when reranker is enabled +Strictly follow the rules in ./AGENTS.md diff --git a/Dockerfile b/Dockerfile index e9568048a2..2e6d025305 100644 --- a/Dockerfile +++ b/Dockerfile @@ -93,7 +93,7 @@ RUN --mount=type=cache,target=/root/.local/share/uv \ && /app/.venv/bin/python -m ensurepip --upgrade # Create persistent data directories AFTER package installation -RUN mkdir -p /app/data/rag_storage /app/data/inputs /app/data/tiktoken +RUN mkdir -p /app/data/rag_storage /app/data/inputs /app/data/prompts /app/data/tiktoken # Copy offline cache into the newly created directory COPY --from=builder /app/data/tiktoken /app/data/tiktoken @@ -102,6 +102,7 @@ COPY --from=builder /app/data/tiktoken /app/data/tiktoken ENV TIKTOKEN_CACHE_DIR=/app/data/tiktoken ENV WORKING_DIR=/app/data/rag_storage ENV INPUT_DIR=/app/data/inputs +ENV PROMPT_DIR=/app/data/prompts # Expose API port EXPOSE 9621 diff --git a/Dockerfile.lite b/Dockerfile.lite index a00cbd0af1..c16ee90a26 100644 --- a/Dockerfile.lite +++ b/Dockerfile.lite @@ -93,7 +93,7 @@ RUN --mount=type=cache,target=/root/.local/share/uv \ && /app/.venv/bin/python -m ensurepip --upgrade # Create persistent data directories -RUN mkdir -p /app/data/rag_storage /app/data/inputs /app/data/tiktoken +RUN mkdir -p /app/data/rag_storage /app/data/inputs /app/data/prompts /app/data/tiktoken # Copy cached tokenizer assets prepared in the builder stage COPY --from=builder /app/data/tiktoken /app/data/tiktoken @@ -102,6 +102,7 @@ COPY --from=builder /app/data/tiktoken /app/data/tiktoken ENV TIKTOKEN_CACHE_DIR=/app/data/tiktoken ENV WORKING_DIR=/app/data/rag_storage ENV INPUT_DIR=/app/data/inputs +ENV PROMPT_DIR=/app/data/prompts # Expose API port EXPOSE 9621 diff --git a/MANIFEST.in b/MANIFEST.in index cf12c1e942..1af3b48967 100644 --- a/MANIFEST.in +++ b/MANIFEST.in @@ -2,3 +2,4 @@ include requirements.txt include lightrag/api/requirements.txt recursive-include lightrag/api/webui * recursive-include lightrag/api/static * +recursive-include prompts/samples * diff --git a/README-zh.md b/README-zh.md index 0863d7185a..23a1f84859 100644 --- a/README-zh.md +++ b/README-zh.md @@ -73,6 +73,9 @@ --- ## 🎉 新闻 +- [2026.05]🎯[新功能]:**将 RagAnything 合并至 LightRAG**🎉。支持通过 **MinerU / Docling** 服务进行多模态内容解析与提取。 +- [2026.05]🎯[新功能]:引入四种可选的文本分块策略:`Fix`(固定)、`Recursive`(递归)、`Vector`(向量)和 `Paragraph`(段落语义)。 +- [2026.05]🎯[新功能]:**支持按角色配置 LLM**,提供四个独立角色:EXTRACT、QUERY、KEYWORDS 和 VLM,每个角色拥有独立的 LLM 设置。 - [2026.03]🎯[新功能]: 集成了 **OpenSearch** 作为统一存储后端,为 LightRAG 的全部四种存储类型提供全面支持。 - [2026.03]🎯[新功能]: 推出交互式安装向导,支持通过 Docker 在本地部署 Embedding、Reranking 及存储后端服务。 - [2025.11]🎯[新功能]: 集成了 **RAGAS 评估**和 **Langfuse 追踪**。更新了 API 以在查询结果中返回召回上下文,支持上下文精度指标。 @@ -292,11 +295,9 @@ python examples/lightrag_openai_demo.py LightRAG 提供 Token 用量追踪、知识图谱数据导出、LLM 缓存管理、Langfuse 可观测性集成和基于 RAGAS 的评估框架。详见 **[docs/AdvancedFeatures.md](./docs/AdvancedFeatures.md)**(英文)。 -### 多模态文档处理(RAG-Anything 集成) +### 多模态文档处理 -LightRAG 与 [RAG-Anything](https://github.com/HKUDS/RAG-Anything) 集成,支持对 PDF、Office 文档、图像、表格和公式的端到端多模态 RAG。详见 **[docs/AdvancedFeatures.md](./docs/AdvancedFeatures.md)**(英文)。 - -> LightRAG Server 将会在不久的将来把 RAG-Anything 的多模态处理能力整合到其文件件处理流水线中。敬请期待。 +LightRAG Server 已内置多模态文档流水线,支持 PDF、Office 文档、图像、表格和公式。解析通过外置 MinerU 或 Docling 服务完成,多模态索引在 LightRAG 流水线内执行。详见 **[docs/AdvancedFeatures.md](./docs/AdvancedFeatures.md)**(英文)。 ## 重现论文结果 diff --git a/README.md b/README.md index d91fe33be8..a9c6039427 100644 --- a/README.md +++ b/README.md @@ -74,6 +74,9 @@ --- ## 🎉 News +- [2026.05]🎯[New Feature]: **Merge RagAnything into LightRAG**🎉. Multimodal content parsing and extraction via **MinerU / Docling** services. +- [2026.05]🎯[New Feature]: Introducing four selectable text chunking strategies: `Fix`, `Recursive`, `Vector`, and `Paragraph`. +- [2026.05]🎯[New Feature]: **Role-specific LLM configuration** support, 4 distinct roles: EXTRACT, QUERY, KEYWORDS, and VLM, with independent LLM settings. - [2026.03]🎯[New Feature]: Integrated **OpenSearch** as a unified storage backend, providing comprehensive support for all four LightRAG storage. - [2026.03]🎯[New Feature]: Introduced a setup wizard. Support for local deployment of embedding, reranking, and storage backends via Docker. - [2025.11]🎯[New Feature]: Integrated **RAGAS for Evaluation** and **Langfuse for Tracing**. Updated the API to return retrieved contexts alongside query results to support context precision metrics. @@ -293,11 +296,9 @@ For the complete Core API reference — including init parameters, `QueryParam`, LightRAG provides additional capabilities including token usage tracking, knowledge graph data export, LLM cache management, Langfuse observability integration, and RAGAS-based evaluation. See **[docs/AdvancedFeatures.md](./docs/AdvancedFeatures.md)**. -### Multimodal Document Processing (RAG-Anything Integration) +### Multimodal Document Processing -LightRAG integrates with [RAG-Anything](https://github.com/HKUDS/RAG-Anything) for end-to-end multimodal RAG across PDFs, Office documents, images, tables, and formulas. For setup and usage examples, see **[docs/AdvancedFeatures.md](./docs/AdvancedFeatures.md)**. - -> LightRAG Server will soon integrate RAG-Anything’s multimodal processing capabilities into its file processing pipeline. Stay tuned. +LightRAG Server includes a multimodal document pipeline for PDFs, Office documents, images, tables, and formulas. Parsing is handled through external MinerU or Docling services, while multimodal indexing runs in the LightRAG pipeline. For setup details, see **[docs/AdvancedFeatures.md](./docs/AdvancedFeatures.md)**. ## Replicating Findings in the Paper diff --git a/docker-compose-full.yml b/docker-compose-full.yml index 553f1d6305..ed63ce066a 100644 --- a/docker-compose-full.yml +++ b/docker-compose-full.yml @@ -14,7 +14,7 @@ services: volumes: - ./data/rag_storage:/app/data/rag_storage - ./data/inputs:/app/data/inputs - - ./config.ini:/app/config.ini + - ./data/prompts:/app/data/prompts - ./.env:/app/.env deploy: restart_policy: @@ -33,6 +33,7 @@ services: WORKING_DIR: "/app/data/rag_storage" MILVUS_URI: "http://milvus:19530" INPUT_DIR: "/app/data/inputs" + PROMPT_DIR: "/app/data/prompts" MEMGRAPH_URI: "bolt://host.docker.internal:7687" depends_on: vllm-embed: diff --git a/docker-compose.podman.yml b/docker-compose.podman.yml index 052f5854f0..017def78c1 100644 --- a/docker-compose.podman.yml +++ b/docker-compose.podman.yml @@ -25,11 +25,13 @@ services: volumes: - ./data/rag_storage:/app/data/rag_storage - ./data/inputs:/app/data/inputs + - ./data/prompts:/app/data/prompts - ./config.ini:/app/config.ini - ./.env:/app/.env restart: on-failure:10 environment: WORKING_DIR: "/app/data/rag_storage" INPUT_DIR: "/app/data/inputs" + PROMPT_DIR: "/app/data/prompts" HOST: "0.0.0.0" PORT: "9621" diff --git a/docker-compose.yml b/docker-compose.yml index abc800065e..c24f79f8c2 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -11,7 +11,7 @@ services: volumes: - ./data/rag_storage:/app/data/rag_storage - ./data/inputs:/app/data/inputs - - ./config.ini:/app/config.ini + - ./data/prompts:/app/data/prompts - ./.env:/app/.env deploy: restart_policy: @@ -22,5 +22,6 @@ services: environment: WORKING_DIR: "/app/data/rag_storage" INPUT_DIR: "/app/data/inputs" + PROMPT_DIR: "/app/data/prompts" HOST: "0.0.0.0" PORT: "9621" diff --git a/docs/AdvancedFeatures.md b/docs/AdvancedFeatures.md index 75324f9ce1..13864506ca 100644 --- a/docs/AdvancedFeatures.md +++ b/docs/AdvancedFeatures.md @@ -1,10 +1,12 @@ # Advanced Features -## Multimodal Document Processing (RAG-Anything Integration) +## Multimodal Document Processing -LightRAG integrates with [RAG-Anything](https://github.com/HKUDS/RAG-Anything), an **All-in-One Multimodal Document Processing RAG system** that enables advanced parsing and RAG capabilities across diverse document formats including PDFs, images, Office documents, tables, and formulas. +LightRAG Server includes a multimodal document pipeline for text, images, tables, and equations. Document parsing is handled through external MinerU or Docling services configured by endpoint, so the server no longer needs to install or import the `raganything` package locally. -**Key Features:** +**Status:** the multimodal post-process hook is currently a placeholder; image, table, and equation processors are planned but not yet wired up. Ingestion via external MinerU/Docling parsers and native text indexing already work today. + +**Planned Capabilities:** - End-to-End Multimodal Pipeline: complete workflow from document ingestion to multimodal query answering - Universal Document Support: PDFs, Office documents (DOC/DOCX/PPT/PPTX/XLS/XLSX), images, and diverse file formats - Specialized Content Analysis: dedicated processors for images, tables, mathematical equations @@ -13,94 +15,16 @@ LightRAG integrates with [RAG-Anything](https://github.com/HKUDS/RAG-Anything), ### Quick Start -* Install Rag-Anything +Configure parser routing and external parser service endpoints in `.env`: ```bash -pip install raganything -``` - -* RAGAnything Usage Example - -```python -import asyncio -from raganything import RAGAnything -from lightrag import LightRAG -from lightrag.llm.openai import openai_complete_if_cache, openai_embed -from lightrag.utils import EmbeddingFunc -import os - -async def load_existing_lightrag(): - lightrag_working_dir = "./existing_lightrag_storage" - - from functools import partial - - lightrag_instance = LightRAG( - working_dir=lightrag_working_dir, - llm_model_func=lambda prompt, system_prompt=None, history_messages=[], **kwargs: openai_complete_if_cache( - "gpt-4o-mini", - prompt, - system_prompt=system_prompt, - history_messages=history_messages, - api_key="your-api-key", - **kwargs, - ), - embedding_func=EmbeddingFunc( - embedding_dim=3072, - max_token_size=8192, - model="text-embedding-3-large", - func=partial( - openai_embed.func, - model="text-embedding-3-large", - api_key=api_key, - base_url=base_url, - ), - ) - ) - - await lightrag_instance.initialize_storages() - - rag = RAGAnything( - lightrag=lightrag_instance, - vision_model_func=lambda prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs: openai_complete_if_cache( - "gpt-4o", - "", - system_prompt=None, - history_messages=[], - messages=[ - {"role": "system", "content": system_prompt} if system_prompt else None, - {"role": "user", "content": [ - {"type": "text", "text": prompt}, - {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}} - ]} if image_data else {"role": "user", "content": prompt} - ], - api_key="your-api-key", - **kwargs, - ) if image_data else openai_complete_if_cache( - "gpt-4o-mini", - prompt, - system_prompt=system_prompt, - history_messages=history_messages, - api_key="your-api-key", - **kwargs, - ) - ) - - result = await rag.query_with_multimodal( - "What data has been processed in this LightRAG instance?", - mode="hybrid" - ) - print("Query result:", result) - - await rag.process_document_complete( - file_path="path/to/new/multimodal_document.pdf", - output_dir="./output" - ) - -if __name__ == "__main__": - asyncio.run(load_existing_lightrag()) +LIGHTRAG_PARSER=pdf:mineru,docx:docling,pptx:docling,xlsx:docling,*:legacy +MINERU_API_MODE=local +MINERU_LOCAL_ENDPOINT=http://localhost:8000 +DOCLING_ENDPOINT=http://localhost:5001/v1/convert/file/async ``` -* For detailed documentation and advanced usage, see the [RAG-Anything repository](https://github.com/HKUDS/RAG-Anything). +Then upload documents through LightRAG Server. `LIGHTRAG_PARSER` rules match suffixes such as `pdf`, may be separated with commas or semicolons, and are evaluated from left to right. If a rule enables MinerU or Docling, the matching endpoint must be configured before server startup. Per-file hints such as `paper.[mineru].pdf` and `memo.[native].docx` override the default rules. Parsed multimodal sidecars are written by the pipeline and consumed by the normal indexing flow. See [File Processing Configuration](./FileProcessingConfiguration-zh.md) for detailed routing rules and examples. --- diff --git a/docs/FileProcessingPipeline-zh.md b/docs/FileProcessingPipeline-zh.md new file mode 100644 index 0000000000..2a57b90a23 --- /dev/null +++ b/docs/FileProcessingPipeline-zh.md @@ -0,0 +1,851 @@ +# 文件处理流水线工作方式说明 + +从版本 v1.5.0 (目前在dev分支)开始,LightRAG的文件处理流水线进行了重大的升级: + +* 支持多种文件内容抽引擎:legacy、native、mineru、docling +* 支持多种文本块分块方法:Fix、Recursive、Vector、Paragraph +* 支持对个别文件关闭实体关系抽取 + +LightRAG Server引入了一个文件处理的中间格式: `LightRAG Document` 。该格式支持表格和图片等多模态数据,同时包含文章的章节段落元数据,方便日后进行内容溯源。 + +本文以 **LightRAG Server** 的部署与使用视角组织:先给出快速开始可直接套用的配置,再展开内容抽取与分块的配置语法、存储 / 目录布局、去重、并发以及续跑规则。直接通过 Python 代码调用 `LightRAG` 类的开发者请翻到[第八章 Python SDK 调用](#八、Python SDK 调用)。 + +## 一、快速开始 + +### 保持旧版文件处理行为 + +所有文件按旧版的文档解析和分块策略处理所有文档。不配置 `LIGHTRAG_PARSER` 或把它配置为如下值: + +```bash +LIGHTRAG_PARSER=*:legacy-F +``` + +### 推荐起步文件处理行为 + +不依赖外部文档解析服务,不依赖`VLM`视觉模型。使用新版原生的 `Native` 解析 `docx` 文档,开启表格(t)和公式(e)的模态分析,搭配`P`分块策略;其余文档使用老版本的内容解析器,搭配效果更好的`R`分块策略。 + +```bash +LIGHTRAG_PARSER=*:native-teP,*:legacy-R +``` + +### 开启多模态处理能力 + +开启多模态处理能力需要依赖 `MinerU` 文件解析服务和 `VLM` 视觉识别模型。使用 `Native` 解释 `docx` 文件,使用 `MinerU` 解析 `pdf`、`office` 和各种图片文件。以上文件都开启图片(i)、表格(t)和公式(e)的模态分析,并并搭配`P`分块策略。其余文档回退到老版本的内容解析器并搭配`R`分块策略。 + +```bash +LIGHTRAG_PARSER=*:native-iteP,*:mineru-iteP,*:legacy-R +VLM_PROCESS_ENABLE=true +VLM_LLM_MODEL=kimi-k2.6 +MINERU_API_MODE=local +MINERU_LOCAL_ENDPOINT=http://localhost:8000 +``` + +> `P`分块策略是LightRAG原生的分块策略,详情请参阅[Paragraph Semantic 分块策略](ParagraphSemanticChunking-zh.md)。VLM的配资请参阅[基于角色的 LLM/VLM 配置指南](RoleSpecificLLMConfiguration-zh.md) + +## 二、内容抽取与处理选项配置 + +LightRAG 的文件处理配置由两部分合成:内容抽取引擎决定原始文件如何被解析,处理选项决定解析后是否执行多模态分析、使用哪种分块方式,以及是否构建知识图谱。通常先用环境变量 `LIGHTRAG_PARSER` 按文件后缀设置默认规则,再用文件名中的 `[hint]` 覆盖单个文件。引擎和选项可以写在同一个配置片段里,例如 `docx:native-iet` 或 `report.[native-R!].docx`。 + +为了向后兼容,在未修改配置的情况下,升级后的文件内容提取方式会维持原来的 `legacy` 行为。如需启用新的内容处理引擎,请按本节说明配置。 + +### 2.1 配置语法总览 + +完整配置模型如下: + +```text +LIGHTRAG_PARSER=后缀:引擎-选项,后缀:引擎,*:legacy-R +filename.[ENGINE].ext +filename.[ENGINE-OPTIONS].ext +filename.[-OPTIONS].ext +``` + +- `LIGHTRAG_PARSER` 是默认规则表,按文件后缀匹配,例如 `pdf:mineru`、`docx:native-iet`。 +- 文件名 `[hint]` 是单文件覆盖规则,例如 `paper.[mineru].pdf`、`memo.[native-R!].docx`。 +- `ENGINE` 是内容抽取引擎:`legacy`、`native`、`mineru` 或 `docling`。 +- `OPTIONS` 是处理选项字符组合,例如 `iet`、`R!`、`P`。选项最终写入 `process_options`,由后续流水线阶段读取。 +- `ENGINE-OPTIONS` 中的连字符只用于分隔引擎和选项,不属于选项本身。 +- 仅指定处理选项时必须写成 `[-OPTIONS]`,例如 `[-!]`。无横线的 `[abc]` 会被严格解释为引擎名并报错,不会回退为选项串。 + +常见组合示例: + +```bash +LIGHTRAG_PARSER=pdf:mineru-R,docx:native-ietP,*:legacy-R +MINERU_API_MODE=local +MINERU_LOCAL_ENDPOINT=http://localhost:8000 +DOCLING_ENDPOINT=http://localhost:5001 +``` + +```text +my-proposal.[native-iet].docx # 使用 native 引擎,开启图、表、公式分析 +my-memo.[native-R!].docx # 使用 native 引擎,递归语义分块,禁止知识图谱构建 +my-proposal.[-!].docx # 使用默认引擎,仅禁止知识图谱构建 +my-proposal.[mineru].docx # 使用 MinerU 引擎,处理选项全部默认 +``` + +### 2.2 默认规则:`LIGHTRAG_PARSER` + +`LIGHTRAG_PARSER` 用来为不同文件后缀配置默认内容抽取引擎,也可以在引擎后追加该规则的默认处理选项: + +```text +后缀:引擎,后缀:引擎,*:legacy +后缀:引擎;后缀:引擎;*:legacy +后缀:引擎-选项 +``` + +- 左侧匹配的是文件后缀,不是完整文件名;应写 `pdf:mineru`,不要写 `*.pdf:mineru`。 +- 规则可以使用英文逗号 `,` 或分号 `;` 分隔。 +- 规则按从左到右的顺序检查;优先规则放在前面,通配符规则通常放在最后。 +- 引擎后缀 `-选项` 部分作为该规则匹配文件的默认 `process_options`。例如 `LIGHTRAG_PARSER=docx:native-iet` 表示所有 `.docx` 默认采用 `native` 引擎,并开启图像、表格、公式分析。 + +### 2.3 单文件覆盖:文件名 hint + +文件名中可以使用中括号临时指定单个文件的处理方式: + +```text +paper.[mineru-R].pdf +slides.[docling].pptx +memo.[native-P].docx +notes.[-R].md +``` + +中括号内的内容支持三种形式: + +```text +[ENGINE] # 仅指定引擎,处理选项使用默认或 LIGHTRAG_PARSER 提供的默认 +[ENGINE-OPTIONS] # 同时指定引擎和处理选项 +[-OPTIONS] # 仅指定处理选项,引擎仍按 LIGHTRAG_PARSER / 默认规则解析 +``` + +解析 hint 时,无横线内容必须整体匹配引擎名(`mineru` / `native` / `docling` / `legacy`);带横线且横线前有内容时,横线前是引擎、横线后是选项;以横线开头时表示仅指定选项。旧式 `[OPTIONS]` 写法不再合法,例如 `[iet]` 应改为 `[-iet]`。 + +### 2.4 内容抽取引擎 + +| 引擎 | 说明 | 支持的文件格式(后缀) | +| --- | --- | --- | +| `legacy` | 旧版提取方式,在加入流水线前集中提取内容 | `txt` `md` `mdx` `pdf` `docx` `pptx` `xlsx` `rtf` `odt` `tex` `epub` `html` `htm` `csv` `json` `xml` `yaml` `yml` `log` `conf` `ini` `properties` `sql` `bat` `sh` `c` `h` `cpp` `hpp` `py` `java` `js` `ts` `swift` `go` `rb` `php` `css` `scss` `less` | +| `native` | 内置智能结构化内容抽取器 | `docx` | +| `mineru` | 外部 MinerU 内容提取引擎 | `pdf` `doc` `docx` `ppt` `pptx` `xls` `xlsx` `png` `jpg` `jpeg` `jp2` `webp` `gif` `bmp` | +| `docling` | 外部 Docling 内容提取引擎 | `pdf` `docx` `pptx` `xlsx` `md` `html` `xhtml` `png` `jpg` `jpeg` `tiff` `webp` `bmp` | + +`mineru` 和 `docling` 是外部内容提取引擎,启用相关规则前必须先把服务跑起来,再在 LightRAG 配置对应 endpoint/token。 + +LightRAG 在本地会缓存 `mineru` 和 `docling` 引擎的解析结果。重复上传相同的文件通常不会重新调用引擎解析文档。如果需要删除解析缓存,必须在文档管理界面删除文件弹窗中点击“同时删除文件”选项。修改 `mineru` 和 `docling` 引擎的端点地址和有效提取参数也会导致缓存失效,下次上传相同文件的时候会重新调用引擎解析文件内容。 + +#### MinerU 配置方法与本地部署 + +MinerU 客户端支持两种模式,二选一: + +- `local`:自建 MinerU 服务(推荐用官方 Docker Compose 部署),LightRAG 通过 HTTP 调用本地容器。 +- `official`:直连 MinerU 官方精准 API v4,需要在 [mineru.net](https://mineru.net) 申请 token。 + +**本地化部署(Docker Compose)** + +从 [opendatalab/MinerU](https://github.com/opendatalab/MinerU) 克隆官方仓库到本地,进入仓库内的 docker 部署目录后,先构建镜像: + +```bash +docker compose -f compose.yaml build +``` + +然后启动 API 服务(带 `--profile api` 才会启用 HTTP API 容器,默认监听 8000 端口): + +```bash +docker compose -f compose.yaml --profile api up -d +``` + +镜像构建细节、GPU 驱动准备、模型权重位置等请参考官方 README:。 + +**LightRAG 侧 env 配置** + +Local 模式(自建 mineru-api): + +```bash +MINERU_API_MODE=local +MINERU_LOCAL_ENDPOINT=http://localhost:8000 +``` + +Official 模式(MinerU 云端 API): + +```bash +MINERU_API_MODE=official +MINERU_API_TOKEN= +# MINERU_OFFICIAL_ENDPOINT=https://mineru.net # 默认值,通常无需修改 +``` + +其余高级开关(`MINERU_MODEL_VERSION`、`MINERU_LANGUAGE`、`MINERU_ENABLE_TABLE` / `MINERU_ENABLE_FORMULA`、`MINERU_PAGE_RANGES`、`MINERU_LOCAL_BACKEND` / `MINERU_LOCAL_PARSE_METHOD`、`MINERU_POLL_INTERVAL_SECONDS` / `MINERU_MAX_POLLS`、`MINERU_ENGINE_VERSION`、`LIGHTRAG_FORCE_REPARSE_MINERU` 等)请参考仓库根目录 `env.example` 模板的 MinerU 小节。需要特别注意 `MINERU_PAGE_RANGES` 在两种模式下语义不同:`official` 支持完整列表(如 `1-3,5,7-9`),`local` 仅支持单页(`3`)或简单范围(`1-10`),不接受逗号列表。 + +#### Docling 配置方法 + +`docling` 内容提取引擎需要外部的 [docling-serve](https://github.com/DS4SD/docling-serve) 服务(v1 异步 API)。最少配置: + +```bash +DOCLING_ENDPOINT=http://localhost:5001 +``` + +`DOCLING_ENDPOINT` 只填 base URL(**不**带 `/v1/convert/file/async`)。目前LightRAG固定使用 Docling 的 standard 流水线处理文件。用户可以通过以下环境环境变量来控制 Docling 流水线的行为: + +| Env | 默认 | 含义 | +| --- | --- | --- | +| `DOCLING_DO_OCR` | `true` | OCR 总开关 | +| `DOCLING_FORCE_OCR` | `true` | 强制对每页 OCR(扫描件必须开,非扫描件开启通常也有助于提高版面识别质量) | +| `DOCLING_OCR_ENGINE` | `auto` | OCR 引擎选择(不建议修改) | +| `DOCLING_OCR_PRESET` | `auto` | OCR 引擎 preset(不建议修改) | +| `DOCLING_OCR_LANG` | (空) | 按照OCR引擎要求设置(不建议修改) | +| `DOCLING_DO_FORMULA_ENRICHMENT` | `false` | 是识别文档中的公式并按LaTex格式输出;启用前需要确保Docling后台下载了公式识别模型(见后面说明) | + +未配置 `DOCLING_OCR_ENGINE` / `DOCLING_OCR_PRESET` 时等同于 `auto`;未配置 `DOCLING_OCR_LANG` 时不向 docling-serve 传递语言列表,由 OCR 引擎使用自身默认值。解析缓存按这些有效参数计算签名,因此“未配置”和“显式填写默认值”不会导致缓存失效。 + +轮询预算 2 个 env(docling-serve 是 server-side long-poll,客户端不再额外 sleep): + +| Env | 默认 | 含义 | +| --- | --- | --- | +| `DOCLING_POLL_INTERVAL_SECONDS` | `5` | 等待解析结果的轮询间隔时间 | +| `DOCLING_MAX_POLLS` | `240` | 最大轮询轮次,超过抛 `TimeoutError`;
默认等待时间 ≈ 5 x 240(约20 分钟) | + +Bundle 缓存 3 个 env: + +| Env | 默认 | 含义 | +| --- | --- | --- | +| `DOCLING_ENGINE_VERSION` | (空) | Docling引擎版本;版本变化会导致解析缓存失效 | +| `LIGHTRAG_FORCE_REPARSE_DOCLING` | `false` | 设为 `true`/`1` 时不启用解析缓存 | +| `DOCLING_BBOX_ATTRIBUTES` | `{"origin":"LEFTBOTTOM"}` | Docling 版面默认坐标系 | + +**`DOCLING_DO_FORMULA_ENRICHMENT` 启用前提**:docling-serve 侧需就绪 code-formula 模型权重。adapter 双轨兼容 —— 启用时 `text` 字段为 LaTeX,关闭或权重缺失导致 `text == orig` 时自动按普通文本处理,不写 `equations.json`。因此默认 `false` 是保守值,部署侧确认模型就绪后再开启。 + +#### Docling本地部署(启用 LaTeX 公式识别) + +下面以 Docker 部署 docling-serve 为例,给出从镜像下载到模型挂载的完整步骤,部署完成后将 `DOCLING_DO_FORMULA_ENRICHMENT=true` 写入 LightRAG 的 `.env` 即可启用 LaTeX 公式识别。 + +> **重要提示**:以下步骤基于显卡支持 CUDA 13 的环境。如果显卡较老旧、不支持 CUDA 13,需要把命令与 compose 文件中的镜像名 `docling-serve-cu130:main` 替换为对应 CUDA 版本的标签。可选镜像列表参见 [docling-serve Packages](https://github.com/orgs/docling-project/packages?repo_name=docling-serve)。 + +**1. 下载镜像** + +```bash +docker pull ghcr.io/docling-project/docling-serve-cu130:main +``` + +**2. 下载模型** + +```bash +# 创建 docling 工作目录 +mkdir docling +cd docling + +# 创建模型挂载目录 +mkdir models + +# 把容器内的原有模型拷贝到 models 目录 +docker run --rm -it \ + -v "$(pwd)/models:/opt/app-root/src/models" \ + ghcr.io/docling-project/docling-serve-cu130:main \ + cp -r /opt/app-root/src/.cache/docling/models /opt/app-root/src/ + +# 下载公式识别模型 +docker run --rm \ + -v "$(pwd)/models:/opt/app-root/src/models" \ + -e DOCLING_SERVE_ARTIFACTS_PATH="/opt/app-root/src/models" \ + ghcr.io/docling-project/docling-serve-cu130:main \ + docling-tools models download-hf-repo docling-project/CodeFormulaV2 -o models +``` + +**3. 创建 `docker-compose.yaml` 文件** + +在上一步的 `docling` 目录下创建 `docker-compose.yaml`,内容如下: + +```yaml +services: + docling-serve: + image: ghcr.io/docling-project/docling-serve-cu130:main + container_name: docling-serve + ports: + - "5001:5001" + environment: + DOCLING_SERVE_ENABLE_UI: "true" + NVIDIA_VISIBLE_DEVICES: "all" + DOCLING_SERVE_ARTIFACTS_PATH: "/opt/app-root/src/models" + # deploy: # This section is for compatibility with Swarm + # resources: + # reservations: + # devices: + # - driver: nvidia + # count: all + # capabilities: [gpu] + runtime: nvidia + restart: always + volumes: + - ./models:/opt/app-root/src/models +``` + +随后在该目录执行 `docker compose up -d` 启动服务。容器就绪后,在 LightRAG 的 `.env` 中设置: + +```bash +DOCLING_ENDPOINT=http://localhost:5001 +DOCLING_DO_FORMULA_ENRICHMENT=true +``` + +即可让 LightRAG 通过本地 docling-serve 识别文档中的公式并以 LaTeX 形式输出。 + +### 2.5 文件处理选项 + +处理选项控制单个文件在多模态分析、知识图谱构建和文本分块上的行为。所有选项都是可选的;缺省值见下表。同一文件最多指定一种分块方式(F/R/V/P),其它选项可任意组合。 + +| 选项 | 类型 | 默认 | 含义 | +| --- | --- | --- | --- | +| `i` | 多模态 | 关闭 | 启用图像分析(VLM) | +| `t` | 多模态 | 关闭 | 启用表格分析(VLM) | +| `e` | 多模态 | 关闭 | 启用公式分析(VLM) | +| `!` | 流水线 | 关闭 | 禁止实体/关系抽取,不构建知识图谱(仅保留 chunks 向量索引,naive / mix 检索仍可用) | +| `F` | 分块 | 默认 | Fix/固定长度分块:遗留方法, 按固定Token长度或按分隔符机械分割(按分隔符分割时文本块不会出现重叠) | +| `R` | 分块 | - | Recursive/递归字符分块(RecursiveCharacterTextSplitter@LangChain):接收一个分隔符列表(默认是 `["\n\n","\n","。","!","?",";",","," ",""]`,按从语义最强到最弱排列)。优先按段落(双换行符)切分;如果切出的块依然超过 Token 限制,逐级降级使用单换行符 → 中文句末标点(`。!?`)→ 中文句中标点(`;,`)→ 空格 → 逐字符切分。**默认 cascade 包含中文标点**,使中文 / 中英混合文档能在语义边界切分。英文 `.?!` 故意排除(字面量匹配会误切 `0.95` / `e.g.`)。 | +| `V` | 分块 | - | Vector/向量语义分块(SemanticChunker@LangChain):首先按句子拆分文本(默认句子切分正则同时识别英文 `.?!` 与中文 `。?!`,使中文 / 中英混合文档能正确切句),计算相邻句子的 Embedding,然后根据指定的阈值策略(如百分位 percentile、标准差 standard_deviation 或四分位距 interquartile)寻找语义断层进行切分。`SemanticChunker` 本身没有 chunk size 上限——任何超过 `chunk_token_size` 的语义块在落库前会自动通过 R 二次切分(保留 V 的非重叠语义)。此分块策略不会出现文本块重叠的情况。 | +| `P` | 分块 | - | Paragraph/段落语义分块(native);优先按标题分割,严格避免上一标题底部内容与下一个标题内容混合破坏语义。适合对能够准确识别标题且标题结构清晰的文档进行分块。同一标题下的超长正文 fallback 到 R 时允许按 `CHUNK_P_OVERLAP_SIZE` 保留重叠;相邻大表格之间的桥接文字也可按该预算重复进入前后表格块。此分块方法只能运用在保存在 sidecar 目录的 `lightrag` 内容。如果 `lightrag` 内容不存在,将退化为使用 `R` 方法进行文本分块。此分块方法出现文本块重叠的情况远少于 `R策略` 和 `F策略`。 | + +> 多模态全局开关 `addon_params["enable_multimodal_pipeline"]` 已废弃,相关行为统一由文件级 `i/t/e` 选项控制。详见[附录 A](#附录-a从旧版升级的注意事项)。 + +#### 选项生效阶段 + +处理选项的不同字符在流水线的不同阶段生效: + +| 选项 | 作用阶段 | 说明 | +| :-: | --- | --- | +| i/t/e | Analyzing多模态分析 | 决定是否对 sidecar 中的图像 / 表格 / 公式调用 VLM 做摘要分析。**抽取阶段不受影响**:内容提取引擎按文档实际内容输出 `drawings.json` / `tables.json` / `equations.json` sidecar 文件。这样后续仅修改 `i`/`t`/`e` 选项触发"再分析"即可补做 VLM,无须重新解析原始文件。 | +| ! | Extraction实体关系抽取 | 跳过实体/关系抽取与图谱写入;chunks 仍写入向量库以保留 naive / mix 检索能力。 | +| F/R/V/P | Chunking文本分块 | 决定使用哪种分块策略;对解析阶段输出无影响。 | + +> 模态可用性以"sidecar 文件是否存在"为唯一信号,内容提取引擎不需要在 meta 中声明能力。某文档若没有任何图像/表格/公式,对应 sidecar 不会写入;用户即使开启了 `i/t/e`,对应模态也只会被静默跳过,但 `analyze_multimodal` 会在该篇文档落一行 INFO 级日志(`[analyze_multimodal] sidecar e:equations empty: doc—id ...`),便于排查"VLM 为何没跑"。这种情况不会报错。 + +### 2.6 校验、优先级与回退 + +- 启动时会严格校验 `LIGHTRAG_PARSER`:未知内容提取引擎、错误后缀写法、显式使用不支持的后缀、外部引擎缺少 endpoint、处理选项中的非法字符都会导致启动失败。 +- **通配符规则匹配某后缀时**,引擎需通过两道可用性检查(见 `parser_routing._engine_is_usable`):(a) 该引擎能力表支持此后缀;(b) 若是外部引擎(`mineru` / `docling`),对应 endpoint/token 环境变量已配置。任一检查不过,本规则跳过,继续匹配下一条规则。例如 `*:mineru;html:docling` 中:MinerU 不支持 `html` 后缀(条件 a 不过),`html` 继续命中 `docling`;如果 `MINERU_API_MODE=local` 但未设置 `MINERU_LOCAL_ENDPOINT`,所有 PDF 也会跳过 `*:mineru` 落到下一条规则(条件 b 不过)。这一行为对 `LIGHTRAG_PARSER` 规则匹配和文件名 hint 引擎选择都生效。 +- 文件名 hint 的优先级高于 `LIGHTRAG_PARSER`。如果 hint 指定的引擎不支持该后缀,系统会回退到默认规则继续选择可用引擎。 +- 如果文件名 hint 提供了非空选项串,则以 hint 为准;否则使用 `LIGHTRAG_PARSER` 规则中匹配项的默认选项;都没有则使用全部默认。 +- 如果所有规则都不可用,文件内容提取方式会回退到 `legacy`;如果 `legacy` 也不支持对应的文件后缀,会向系统添加一个错误条目,上传文件保留在 `INPUT` 目录。 +- F/R/V/P至多出现一个;同一选项重复时只生效一次但不报错。 +- 大小写敏感:分块选项 F/R/V/P必须大写;其它选项 i/t/e小写。 +- 中括号内出现非法字符时,整个 hint 失效,引擎按默认规则解析,选项按 `LIGHTRAG_PARSER` 默认或全部默认;同时落日志 warning。 +- `P` 仅对 `native` 抽取出的 LightRAG Document 结构化结果有效;对 `legacy` 路径或非结构化输出会自动降级到 `R` 并记录 warning。 + +## 三、分块器参数配置(chunk_options) + +### 3.1 process_options vs chunk_options 的职责 + +`process_options` 选**用哪种**分块策略(F/R/V/P),`chunk_options` 决定那一路分块器**用哪些参数**。两者职责正交:前者是单字符 selector,后者是结构化字典。 + +``` +env vars (启动期一次性读取) + │ + ▼ +addon_params["chunker"] (LightRAG 实例字段,由 env 与 legacy 兜底填入) + │ + ▼ resolve_chunk_options(addon_params, split_by_character=…, split_by_character_only=…) + │ +full_docs[doc_id]["chunk_options"] (入队时冻结,每文件独立快照) + │ + ▼ +chunker(tokenizer, content, chunk_token_size, **strategy_kwargs) (分块时按 selector 派发) +``` + +- **env vars** 在 `LightRAG.__init__` 阶段(由 `default_chunker_config()` 读取 strategy 特定 env,再由 `_apply_chunk_size_overlay` 兜底 legacy env)灌进 `addon_params["chunker"]`。 +- **`addon_params["chunker"]`** 是 `ObservableAddonParams` 字段;Server 部署只需通过 env / 重启即可让新值生效。若需要在 Python 进程内运行时改它(不重启)以及 per-file 覆盖,请见[第八章 Python SDK 调用](#八python-sdk-调用)。 +- **`full_docs.chunk_options`** 在 `apipeline_enqueue_documents` 入队时冻结:默认由 `resolve_chunk_options(self.addon_params, ...)` 现场拼装;若调用方传入 `chunk_options` 参数则原样持久化(SDK 用法,见 §8.4)。 +- **分块器调用**从 `full_docs.chunk_options` 取对应子字典,按 `process_options.chunking` selector 派发到 F/R/V/P。 + +### 3.2 环境变量 + +下表所有变量在 `LightRAG` 实例化时一次性读入 `addon_params["chunker"]`:strategy 特定 env 由 `default_chunker_config()` 读取,legacy env (`CHUNK_SIZE` / `CHUNK_OVERLAP_SIZE`) 由 `_apply_chunk_size_overlay` 在 strategy env 与 legacy 构造字段都没填的槽位上兜底。修改 env 后需要重启服务(或新建 `LightRAG` 实例)才生效;已入队的文档持有冻结快照不受影响。 + +| 变量 | 默认 | 类型 | 作用域 | +|---|---|---|---| +| `CHUNK_SIZE` | `1200` | int | legacy 顶层 `chunk_token_size` 兜底;优先级低于 strategy 特定 env 与 SDK 路径设置的 `addon_params["chunker"]["chunk_token_size"]` | +| `CHUNK_OVERLAP_SIZE` | `100` | int | legacy overlap 兜底;当某 strategy 既无特定 env (`CHUNK_F_OVERLAP_SIZE` / `CHUNK_R_OVERLAP_SIZE` / `CHUNK_P_OVERLAP_SIZE`) 又无 SDK 路径的 `LightRAG(chunk_overlap_token_size=…)` 时填入 | +| `CHUNK_F_OVERLAP_SIZE` | 未设 | int | F strategy 特定 overlap;高于 legacy 构造字段与 `CHUNK_OVERLAP_SIZE` | +| `CHUNK_F_SPLIT_BY_CHARACTER` | (未设 = `null`) | str? | F 预切分隔符;`null` / 空串 = 仅按 token 窗 | +| `CHUNK_F_SPLIT_BY_CHARACTER_ONLY` | `false` | bool | F 严格模式:不二次按 token 切,超长抛错 | +| `CHUNK_R_SIZE` | 未设 | int | R strategy 特定 `chunk_token_size`;高于顶层 legacy 兜底(`CHUNK_SIZE` 与 SDK 路径的 `LightRAG(chunk_token_size=…)`)。未设时 R 沿用顶层解析结果 | +| `CHUNK_R_OVERLAP_SIZE` | 未设 | int | R strategy 特定 overlap;高于 legacy 构造字段与 `CHUNK_OVERLAP_SIZE` | +| `CHUNK_R_SEPARATORS` | `["\n\n","\n","。","!","?",";",","," ",""]` | JSON 数组字符串 | R 分隔符级联,按从语义最强到最弱排列。默认包含中文句末(`。!?`)和句中(`;,`)标点,使中文 / 中英混合文档能在语义边界切分。英文 `.?!` 故意排除(字面量匹配会误切数字与缩写) | +| `CHUNK_V_SIZE` | 未设 | int | V strategy 特定 `chunk_token_size`(hard cap,超过时自动通过 R 二次切分);高于顶层 legacy 兜底。未设时 V 沿用顶层解析结果 | +| `CHUNK_V_BREAKPOINT_THRESHOLD_TYPE` | `percentile` | str | V 阈值类型;可选 `percentile` / `standard_deviation` / `interquartile` / `gradient` | +| `CHUNK_V_BREAKPOINT_THRESHOLD_AMOUNT` | (未设 = `null`) | float? | V 阈值大小;`null` 让 LangChain 按类型自选默认(如 percentile=95) | +| `CHUNK_V_BUFFER_SIZE` | `1` | int | V 句子缓冲窗,距离计算时合并的相邻句数 | +| `CHUNK_V_SENTENCE_SPLIT_REGEX` | `(?<=[.?!])\s+\|(?<=[。?!])` | str | V 的句子切分正则,喂给 LangChain `SemanticChunker`。默认同时识别英文 `.?!`(要求后接空白,避免误切 `0.95`)和中文 `。?!`(不要求空白,适应中文连写)。env 值为原始正则字符串,无需 JSON 引号 | +| `CHUNK_P_SIZE` | `2000`(`DEFAULT_CHUNK_P_SIZE`) | int | P strategy 特定 `chunk_token_size`。与 R/V 不同,未设时 P **不**沿用顶层 `CHUNK_SIZE` / `LightRAG(chunk_token_size=…)`——段落语义合并需要比全局默认更大的上限才能将相关段落保留在一起,因此槽位始终携带 `DEFAULT_CHUNK_P_SIZE`(2000) | +| `CHUNK_P_OVERLAP_SIZE` | 未设 | int | P strategy 特定 overlap;高于 legacy 构造字段与 `CHUNK_OVERLAP_SIZE`。用于同一 JSONL content 行内长正文 fallback 到 R 时的文本重叠,以及相邻大表格之间桥接文字复制到前后表格块的单侧预算 | + +P 的内部比例常量是算法刻度,会随 `chunk_token_size` 自动按比例推导。P 始终使用独立于全局链的 `chunk_token_size`——即使 `CHUNK_P_SIZE` 未设,P 也会回退到 `DEFAULT_CHUNK_P_SIZE`(2000)而**不**沿用全局 `CHUNK_SIZE`,因为段落语义合并需要比全局默认更大的上限才能将相关段落保留在一起。需要按部署调整时通过 `CHUNK_P_SIZE` 覆盖该默认。`CHUNK_P_OVERLAP_SIZE` 只影响 P 内部普通文本 fallback 与表格桥接上下文,不会让表格行级切片互相重叠。`CHUNK_R_SIZE` / `CHUNK_V_SIZE` 行为不同——未设时**仍会**沿用顶层 `chunk_token_size`(R 偏向较小目标利于句段切分,V 作为 advisory ceiling 通常希望放大以减少过度拆分)。 + +### 3.3 优先级链 + +每个分块槽位的最终值按 specificity-ordered 链解析(高 → 低): + +1. **`addon_params["chunker"]` 显式值** —— 通过 SDK 路径运行时设置或在构造时显式写入的字段值(见 §8.3)。Server-only 部署通常不会出现这一档。最直接,赢一切。 +2. **strategy 特定 env** —— 如 `CHUNK_F_OVERLAP_SIZE` / `CHUNK_R_OVERLAP_SIZE` / `CHUNK_P_OVERLAP_SIZE` / `CHUNK_R_SIZE` / `CHUNK_V_SIZE` / `CHUNK_P_SIZE`(尚无 strategy 特定的 `CHUNK_F_SIZE`,F 复用顶层 `chunk_token_size`)。仅当槽位未被 ① 显式占用时填入。 +3. **legacy 构造字段** —— `LightRAG(chunk_token_size=…, chunk_overlap_token_size=…)`,仅 SDK 路径生效,详见 §8.2。strategy 无关,"粗粒度缺省",只填仍空的槽位。 +4. **legacy env** —— `CHUNK_SIZE` / `CHUNK_OVERLAP_SIZE`。最终回退。 + +举例:`CHUNK_R_OVERLAP_SIZE=42` + `LightRAG(chunk_overlap_token_size=2)` → R 子字典 `chunk_overlap_token_size=42`(strategy env 胜出),F / P 子字典 `chunk_overlap_token_size=2`(无 F / P 特定 env,legacy 构造字段填入)。 + +**P 的 `chunk_token_size` 特例**:P 的 `chunk_token_size` 槽位**不**走完整的四档链。当 ① 未显式提供时,直接按 `CHUNK_P_SIZE` env > `DEFAULT_CHUNK_P_SIZE`(2000)解析,**跳过** ③ legacy 构造字段 `LightRAG(chunk_token_size=…)` 与 ④ legacy env `CHUNK_SIZE`。理由参见 §3.2 `CHUNK_P_SIZE` 行。 + +三层语义保证: + +1. **复现性**:env 改了,重启后老文档仍按入队那一刻的快照分块,结果不变。 +2. **续跑一致性**:续跑分支 B(内容已抽取,按当前 `process_options` 重做分块)读的也是 `full_docs.chunk_options`,避免 env 漂移破坏一致性。 +3. **per-file 个性化**:调用方可以为每个文件传不同的 `chunk_options`(典型用法:管理 UI 单独配置某个文件的 separators 或 V 阈值)。这是 SDK 路径的入参语义,详见 §8.4。 + +### 3.4 字段结构 + +`addon_params["chunker"]`(实例字段)保留全部四种策略的子字典作为运行时基线;`full_docs[doc_id]["chunk_options"]` 是**精简快照**——入队时只保留 `process_options` 选中的那一路策略子字典(缺省 F),其它策略的参数会被丢弃,因为处理阶段不会读它们。重新解析时 `process_options` 与 `chunk_options` 一同改写,避免旧策略的参数残留。 + +**`addon_params["chunker"]` 全量基线**(运行时可由 SDK 修改,影响后续入队): + +```jsonc +{ + "chunk_token_size": 1200, // 通用 token 上限 + "fixed_token": { // F 专属 + "chunk_overlap_token_size": 100, + "split_by_character": null, + "split_by_character_only": false + }, + "recursive_character": { // R 专属 + "chunk_token_size": 1200, // 可选;不写沿用顶层 chunk_token_size + "chunk_overlap_token_size": 100, + "separators": ["\n\n", "\n", "。", "!", "?", ";", ",", " ", ""] // 默认 cascade 含中文标点 + }, + "semantic_vector": { // V 专属 + "chunk_token_size": 1200, // 可选 hard cap;超过时通过 R 二次切分 + "breakpoint_threshold_type": "percentile", // percentile | standard_deviation | interquartile | gradient + "breakpoint_threshold_amount": null, // null = LangChain 默认 + "buffer_size": 1, + "sentence_split_regex": "(?<=[.?!])\\s+|(?<=[。?!])" // 默认正则兼容中英文句末标点 + }, + "paragraph_semantic": { // P 专属 + "chunk_token_size": 2000, // 不写则按 CHUNK_P_SIZE 或 DEFAULT_CHUNK_P_SIZE(2000)解析; + // **不**继承通用 chunk_token_size + "chunk_overlap_token_size": 100 // 不写沿用 legacy overlap 解析链 + } +} +``` + +**`full_docs[doc_id]["chunk_options"]` 精简快照**(按 selector 投影;下例为 `process_options="R"`): + +```jsonc +{ + "chunk_token_size": 1200, // 通用 token 上限(保留为顶层 fallback) + "recursive_character": { // 唯一保留的策略子字典 + "chunk_overlap_token_size": 100, + "separators": ["\n\n", "\n", "。", "!", "?", ";", ",", " ", ""] + } +} +``` + +selector → 子字典映射:F → `fixed_token`,R → `recursive_character`,V → `semantic_vector`,P → `paragraph_semantic`;无 selector 默认 F。各子字典与对应分块器函数的 keyword-only 参数一一对应;新增参数时无需改 dispatcher,只在 chunker 函数添加 kwarg 即可。 + +### 3.5 缺失兼容 + +老文档入队时还没有 `chunk_options` 字段;分块时 dispatcher 会按当前 `process_options` 调用 `resolve_chunk_options(self.addon_params, process_options=…)` 兜底拼装一份精简快照。建议在升级后通过 reprocess 一次让老文档拿到精简的 `chunk_options` 快照(且与当前 `process_options` 对齐)。 + +## 四、存储与目录布局 + +### 4.1 `full_docs` 字段 + +文件入队和抽取结果会写入 `full_docs`: + +| 字段 | 说明 | +| --- | --- | +| `file_path` | 文件名 basename(不含目录),**保留用户提供的原始名(含中括号 hint)**,例如 `abc.[native-iet].docx` 原样写入。未提供有效来源时保存为 `unknown_source`。文件名 hint 不会被剥离,方便管理 UI 直接展示用户原本的命名意图。 | +| `canonical_basename` | 去掉处理提示 hint 后的规范化 basename(例如 `abc.docx`)。文件名查重以此字段为索引 key,保证 `abc.docx` 与 `abc.[native-iet].docx` 视为同一逻辑文档。 | +| `source_path` | 入队时提供的原始路径(仅当含目录分隔符或绝对路径时才写入),供 `native` / `mineru` / `docling` 解析器定位真实文件位置。 | +| `parse_format` | 内容格式:`pending_parse`, `raw`, `lightrag`。 | +| `content` | `raw` 时保存抽取文本;`pending_parse` 时为空字符串;`lightrag` 时存储以 `{{LRdoc}}` 开头的**完整合并文本**(拼接 `.blocks.jsonl` 中所有 `type=="content"` 行的 body 段),分块阶段 `parse_native` 会剥离前缀后再交给 chunking_func,与 `raw` 走完全相同的代码路径。 | +| `content_hash` | 内容 MD5,用于跨文件名查重。`parse_format=raw` 取 `sanitize_text_for_encoding` 后文本的 hash;`parse_format=lightrag` 取 `*.blocks.jsonl` 文件 hash;`parse_format=pending_parse` 不写入,待抽取完成后补上。 | +| `lightrag_document_path` | `parse_format=lightrag` 时保存结构化 LightRAG Document 的路径;新记录优先保存为相对 `INPUT_DIR` 的路径,例如 `__parsed__/report.docx.parsed/report.blocks.jsonl`。注意路径中的子目录与 blocks 文件名都使用规范化 basename(不含 hint)。 | +| `parse_engine` | 实际完成抽取的引擎:`legacy`, `native`, `mineru`, `docling`。对于待抽取文件,也可暂存目标引擎。 | +| `process_options` | 入队时记录的原始处理选项串(不含引擎名和分隔 `-`),例如 `"iet"`、`"R!"`、`""`。下游各阶段以此字段为权威源,决定是否启用图像/表格/公式分析(`i/t/e`)、是否禁止知识图谱构建(`!`)以及分块方式(`F/R/V/P`)。空字符串等价于全部默认值。 | +| `chunk_options` | 入队时**冻结**的分块器参数快照(精简字典:只保留 `process_options` 选中的那一路策略子字典,其它策略丢弃)。由 SDK 路径调用方传入或由 `resolve_chunk_options(self.addon_params, process_options=…)` 从实例字段(含 env 默认)兜底(见 §3.1)。`process_options` 选哪种分块策略(F/R/V/P),`chunk_options` 决定那一路分块器使用哪些参数。下游 `process_single_document` 在分块前从此字段读取专属 kwargs;持久化保证 env 变化、续跑、重启后老文档行为可复现。重新解析时与 `process_options` 一同改写。 | + +`pending_parse` 表示文件已经入队,但还没有完成抽取。抽取成功后会改写为 `raw` 或 `lightrag`,并补齐 `content_hash`。抽取失败时保留 `pending_parse` 和空 `content`,便于后续排查和重试。 + +> `doc_status` 中也同步保存原始 `file_path`(含 hint)、`canonical_basename` 与 `content_hash`,作为 `get_doc_by_file_basename` / `get_doc_by_content_hash` 的查重索引来源。`get_doc_by_file_basename` 内部把传入参数先经 `canonicalize_parser_hinted_basename` 规范化后再与 `canonical_basename` 比对,因此 `abc.docx` 与 `abc.[native-iet].docx` 总是命中同一文档。 +> `process_options` 同时镜像写入 `doc_status.metadata["process_options"]`,便于管理 UI 直接展示当前文件的处理策略。 + +### 4.2 `__parsed__` 目录结构 + +`__parsed__` 是输入目录旁的归档与分析结果目录。它同时保存已经处理过的原始文档,以及结构化解析产生的 LightRAG Document (lightrag格式)的文件和图片等资源。 + +- 原始文件归档:`legacy` 本地抽取成功并入队后,原文件会移动到同级 `__parsed__` 目录;`native` / `mineru` / `docling` 会先保留原文件供 pipeline 解析,解析成功并写入 `full_docs` 后再移动到 `__parsed__`。**归档时保留原始文件名(含 `[hint]`)**,例如 `report.[native-iet].docx` 归档为 `__parsed__/report.[native-iet].docx`,便于追溯用户最初的命名与处理选项。 +- 分析结果目录:结构化解析结果会写入以**规范化文件名**(去掉 `[hint]`)加 `.parsed` 后缀命名的子目录,避免与归档原文件同名冲突,并保证当文件名 hint 或处理选项变化时同一逻辑文档继续指向同一目录。例如 `report.docx`、`report.[native].docx`、`report.[native-iet].docx` 的分析结果都写入 `__parsed__/report.docx.parsed/`。 +- 分析结果文件:LightRAG Document blocks 文件以及 sidecar 都使用规范化文件名的主干命名,例如 `__parsed__/report.docx.parsed/report.blocks.jsonl`;同一目录下还可能包含 `report.tables.json`、`report.drawings.json`、`report.equations.json` 和 `report.blocks.assets/` 图片资源目录。**sidecar 是否生成由文档内容决定**:解析器只在文档实际包含表格/图片/公式时写出对应文件。这是模态可用性的唯一信号 —— 引擎不需要在 meta 中声明能力。`i`/`t`/`e` 选项只决定下一阶段是否对已存在的 sidecar 调用 VLM 做摘要分析。 +- 解析失败时,原文件不会移动,便于修复配置后重新处理。 +- `/documents/scan` 扫描到同名且已 `PROCESSED` 的文件时,该输入文件会被视为已处理并移动到 `__parsed__`,不会作为新文档入队。 +- `/documents/scan` 同一次扫描中发现多个规范化后同名的文件时,会优先保留带支持引擎 hint 的文件以尊重用户的引擎选择;如果没有任何变体带 hint,则按排序处理第一个文件。其余变体会输出 warning 并移动到 `__parsed__`,避免同批文件互相覆盖。例如 `abc.docx` 和 `abc.[native].docx` 同时存在时只会处理 `abc.[native].docx`。 +- 扫描或解析过程中发现内容 hash 重复时,该输入文件同样会移动到 `__parsed__`;本次 `doc_status` 保留为 `FAILED duplicate` 以便追踪。 +- 移动文件只作用于当前输入文件,不会覆盖或移动既有文档源文件。若目标目录已存在同名文件,系统会自动追加 `_001`、`_002` 等编号,例如 `report.pdf` 会依次归档为 `report_001.pdf`、`report_002.pdf`。若分析结果目录名已被普通文件占用,也会追加编号,例如 `report.docx.parsed_001/`。 + +### 4.3 MinerU 原始产物目录 `.mineru_raw/` + +`mineru` 引擎在解析过程中会把 MinerU 服务返回的完整产物(`content_list.json` + 可选的 `full.md` / `middle.json` / `layout.pdf` / `images/` 等)落到 `__parsed__/<规范文件名>.mineru_raw/` 目录下,并写入 `_manifest.json` 作为完整性校验文件。 + +设计目的: + +- **避免重复上传**。再次解析同一文件时,先用源文件的内容 hash + 文件大小校验 `_manifest.json`,命中即跳过 MinerU 服务调用,直接从本地 `content_list.json` 走 adapter → SidecarWriter 流程。 +- **保留诊断信息**。MinerU 解析出错或者下游 sidecar 字段异常时,可以直接到 `*.mineru_raw/` 比对原始 content_list 与图片资源。 +- **支持对象溯源**。MinerU 生成的 `drawings.json` / `tables.json` / `equations.json` 会在 `self_ref` 中保存 `content_list.json#/N`,用于回查对应的 MinerU 原始对象及其 `page_idx` / `bbox` 等定位信息。 +- **上传文件名去 hint**。源文件名包含 `[mineru-...]` / `[-iet]` 等处理 hint 时,调用 MinerU API 使用去 hint 后的规范文件名,避免 MinerU 返回的 raw bundle 内部文件名携带 hint。 + +生命周期: + +| 操作 | 行为 | +|---|---| +| 首次解析 | 下载所有产物 → 原子写入 `_manifest.json`。 | +| 重复解析(cache 命中) | 不调用 MinerU 服务;不重写产物;走 adapter+Writer 重生成 sidecar(适用于 adapter 升级场景)。 | +| 重复解析(cache miss) | 清空目录内所有文件后重新下载并写入 manifest。 | +| `DELETE /documents` 且 `delete_file=True` | `*.parsed/` 与 `*.mineru_raw/` 与原始文件一并删除。 | +| `DELETE /documents` 且 `delete_file=False` | 保留所有产物,仅删 doc_status 与 KG 数据。 | +| `clear_documents` / `__parsed__` 整体清理 | 自然一并清除。 | +| scan 周期 | 不主动 GC 孤儿 `*.mineru_raw/`(用户显式删除时才清,避免误删调试现场)。 | + +强制重新解析(绕过 cache):设置 `LIGHTRAG_FORCE_REPARSE_MINERU=true`。 + +并发安全:LightRAG 强制要求同一 workspace 下 `canonical_basename` 唯一(上传/入队时返回 HTTP 409),加上流水线对单个文档的串行化处理,因此 `*.mineru_raw/` 不会出现并发写入冲突,无需额外锁。 + +`_manifest.json` 失效条件(任一触发即 cache miss): + +- 源文件大小或 sha256 与 manifest 记录不符; +- `MINERU_ENGINE_VERSION` 环境变量与 manifest 记录的 `engine_version` 都非空且不一致; +- 当前 `MINERU_API_MODE` 与 manifest 记录的 `api_mode` 都非空且不一致; +- 当前 mode 对应 endpoint(`MINERU_OFFICIAL_ENDPOINT` / `MINERU_LOCAL_ENDPOINT`)与 manifest 记录的 `endpoint_signature` 都非空且不一致; +- `content_list.json` 大小或 sha256 与 manifest 不符; +- 任一记录的非关键文件(图片、`middle.json` 等)大小与 manifest 不符。 + +> 关于 `engine_version` / `endpoint_signature` 的"任一侧为空即跳过"语义:当 manifest 写入时该字段为空(例如首次解析时未配置 `MINERU_ENGINE_VERSION`),或当前环境变量未设置时,该项不参与失效判断。如果首次解析时未设置版本环境变量,事后再补上并不会自动让历史缓存失效——这类场景需要手动设置 `LIGHTRAG_FORCE_REPARSE_MINERU=true` 触发重新解析。 + +### 4.4 Docling 原始产物目录 `.docling_raw/` + +`docling` 引擎在解析过程中会把 docling-serve 返回的 zip 产物(DoclingDocument JSON、Markdown 和引用图片)解压到 `__parsed__/<规范文件名>.docling_raw/` 目录下,并写入 `_manifest.json` 作为完整性校验文件。IR builder 在二次解析时会读取该目录的 `.json` 文件喂给 `DoclingIRBuilder`,不再走 docling-serve 服务。 + +目录布局: + +```text +__parsed__/.docling_raw/ +├── _manifest.json +├── .json # DoclingDocument JSON(含 pages[].image base64) +├── .md # Markdown 形态,供人工检查 +└── artifacts/ + └── image_*.png # pictures[*].image.uri 指向的图片资源 +``` + +设计目的: + +- **避免重复上传/转换**。再次解析同一文件时,先用源文件 hash + 文件大小校验 `_manifest.json`,命中即跳过对 docling-serve 的上传 / 轮询 / 下载,直接从本地 `.json` 走 DoclingIRBuilder → SidecarWriter 流程。 +- **保留诊断信息**。docling-serve 解析出错或下游 sidecar 字段异常时,可以直接到 `*.docling_raw/` 比对原始 DoclingDocument JSON、Markdown 与 `artifacts/` 图片。 + +生命周期: + +| 操作 | 行为 | +|---|---| +| 首次解析 | `POST /v1/convert/file/async` 上传 → 长轮询 `/v1/status/poll/{task_id}?wait=N` → `GET /v1/result/{task_id}` 下载 zip → 安全解压(拒绝绝对路径与 `..`)→ 原子写入 `_manifest.json`。 | +| 重复解析(cache 命中) | 不调用 docling-serve;不重写产物;走 adapter+Writer 重生成 sidecar(适用于 adapter 升级场景)。 | +| 重复解析(cache miss) | 清空目录内所有文件后重新上传 / 下载 / 写入 manifest。 | +| `DELETE /documents` 且 `delete_file=True` | `*.parsed/` 与 `*.docling_raw/` 与原始文件一并删除。 | +| `DELETE /documents` 且 `delete_file=False` | 保留所有产物,仅删 doc_status 与 KG 数据。 | +| `clear_documents` / `__parsed__` 整体清理 | 自然一并清除。 | +| scan 周期 | 不主动 GC 孤儿 `*.docling_raw/`(用户显式删除时才清,避免误删调试现场)。 | + +强制重新解析(绕过 cache):设置 `LIGHTRAG_FORCE_REPARSE_DOCLING=true`。 + +并发安全:与 MinerU 路径一致 —— LightRAG 强制要求同一 workspace 下 `canonical_basename` 唯一(上传 / 入队时返回 HTTP 409),加上流水线对单个文档的串行化处理,因此 `*.docling_raw/` 不会出现并发写入冲突,无需额外锁。 + +`_manifest.json` 失效条件(任一触发即 cache miss): + +- 源文件大小或 sha256 与 manifest 记录不符; +- `DOCLING_ENDPOINT` 与 manifest 记录的 `endpoint_signature` 不一致; +- `DOCLING_ENGINE_VERSION` 设置且与 manifest 记录的 `engine_version` 不一致; +- `options_signature` 不一致 —— 任一 OCR / 公式 / pipeline 字段变化都会触发,覆盖范围包括: + - 可调 env:`DOCLING_DO_OCR` / `DOCLING_FORCE_OCR` / `DOCLING_OCR_ENGINE` / `DOCLING_OCR_PRESET` / `DOCLING_OCR_LANG` / `DOCLING_DO_FORMULA_ENRICHMENT`; + - 固化常量:`pipeline` / `target_type` / `to_formats` / `image_export_mode`(写入 signature 是为了防止未来值变更后老 bundle 被误复用); +- 主 JSON 缺失、大小或 sha256 不一致; +- `artifacts/` 内任一图片缺失或大小不一致; +- `LIGHTRAG_FORCE_REPARSE_DOCLING=true`。 + +> `engine_version` / `endpoint_signature` 的"任一侧为空即跳过"语义与 MinerU §4.3 一致:manifest 写入时该字段为空(首次未配置 `DOCLING_ENGINE_VERSION`)或当前环境变量未设置时,该项不参与失效判断;事后补上版本号不会自动让历史缓存失效,需要 `LIGHTRAG_FORCE_REPARSE_DOCLING=true` 触发。 + +## 五、文档重复判定规则 + +文件上传、文件解析入队和文本接口会按照「文件名 + 内容 hash」两道关卡判断是否重复,命中任一即视为重复并写入一条 `FAILED` 记录,不会覆盖已有的 `full_docs`。`/documents/scan` 目录扫描也使用同一套索引,但为了便于自动重试未完成文件,对文件名重复有单独的归档与重处理规则。 + +### 5.1 文件名(basename)查重 + +- 判断粒度为 basename,不包含目录路径和 workspace 路径。例如 `/data/a.pdf`、`inputs/a.pdf` 和 `a.pdf` 都视为同一个文件名 `a.pdf`。 +- 文件名查重以 `canonical_basename` 为索引:将文件名末尾的支持引擎处理提示 hint 剥离后再比对,因此 `abc.docx`、`abc.[native].docx`、`abc.[native-iet].docx` 之间互相视为同名;不支持的 hint 不会被剥离,例如 `abc.[draft].docx` 仍按原文件名处理。 +- 对普通上传、文本接口和核心入队 API,只要 `doc_status` 中已经存在同名文件记录,无论该记录当前处于 `PENDING`、`PARSING`、`ANALYZING`、`PROCESSING`、`FAILED` 还是 `PROCESSED`,同名文件都会被视为重复。 +- 对 `/documents/scan` 目录扫描: + - 同一次扫描中如果有多个文件规范化后同名,优先处理带支持引擎 hint 的文件;若无任何 hint 变体,则处理排序后的第一个文件,其余文件会归档到 `__parsed__` 并跳过。 + - 如果同名记录已经是 `PROCESSED`,当前扫描到的文件视为已处理文件,系统会输出 warning,将该输入文件移动到同级 `__parsed__` 目录,并跳过入队。 + - 如果同名记录不是 `PROCESSED`,扫描文件**不**仅因文件名相同而跳过,但**也不**会重新提取/覆盖既有记录。具体路径取决于既有记录的形态(与下文"为什么 scan 仍是独占写者"一节列举的分类规则一致): + - 同名非 PROCESSED 且 `full_docs` 存在 → **resume 路径**:doc_status 现状保留,源文件留在 `INPUT/`,由处理循环按状态查询接走(不重新提取、不覆盖既有状态)。 + - 同名 `FAILED` 且 `full_docs` 缺失 → 视为 `apipeline_enqueue_error_documents` 写下的提取错误 stub:scan 删掉这条 stub 后**把当前文件按新文件重新入队**。这是唯一会重新提取的子分支,目的是让"修好源文件再 scan 一次"自动生效。 +- 普通上传和核心入队 API 中,同名文件即使内容已经变化,也需要先删除旧文档记录后再重新上传或入队;扫描路径上述两种自动恢复仅用于目录扫描场景。 +- 文本接口必须提供有效的 `file_source`,并按 `file_source` 的 basename 判断重复;缺少有效 `file_source` 时直接返回 400。 +- SDK 路径调用 `insert` / `ainsert` / `apipeline_enqueue_documents` 时不传 `file_paths` 是被允许的,相关行为详见 §8.4。这类无来源文档的 `file_path` 保存为 `unknown_source`。 +- 空字符串、`no-file-path` 和 `unknown_source` 都会被视为未知来源;它们不会阻止新的无来源文本入队,也不会作为同名文件互相去重。 + +存储后端通过 `get_doc_by_file_basename` 提供 basename 直查能力,内部按 `canonical_basename` 字段比对(传入参数会先经 `canonicalize_parser_hinted_basename` 规范化)。`JsonDocStatusStorage` 已经实现了内存级遍历;其它后端目前回落到默认实现(扫描全部状态后比对 `canonical_basename`),将在后续 PR 中补齐原生索引。 + +### 5.2 内容 hash 查重 + +- 文件名不同但抽取后的内容完全相同的文档同样视为重复。这里的 hash 是按配置的抽取引擎得到最终文本或 LightRAG Document 后计算的内容 hash,不是原始文件字节 hash。 +- `full_docs` 与 `doc_status` 会按内容格式写入或补齐 `content_hash` 字段: + - `parse_format=raw`:取经过 `sanitize_text_for_encoding` 之后的文本 MD5。 + - `parse_format=lightrag`:取 `lightrag_document_path` 解析出的 `*.blocks.jsonl` 文件 MD5。相对路径按 `INPUT_DIR` 解析。 + - `parse_format=pending_parse`:暂不写入 hash,等到真正完成解析后由后续步骤补上(避免按空内容误判)。 +- `legacy` 路径会在本地提取文本后、入队时进行内容 hash 查重;命中重复时,本次记录写为 `FAILED duplicate`,不会生成新的 `full_docs`、chunks 或图数据。 +- `native` / `mineru` / `docling` 路径会先以 `pending_parse` 入队;真正完成解析并补齐 `content_hash` 后,如果发现其它文档已有相同 hash,本次记录会在进入分析、切块、实体抽取和图写入前停止。 +- 重复记录会在 `metadata.duplicate_kind` 中标记为 `filename` 或 `content_hash`,便于排查。内容 hash 重复还会记录 `metadata.is_duplicate=true`、`metadata.original_doc_id` 和 `metadata.original_track_id`;解析后才发现的重复会删除本次临时写入的 `full_docs`。 +- 相关 warning 会尽量减少重复噪音:扫描发现已 `PROCESSED` 的同名文件时会写入日志和 pipeline status;入队阶段重复使用 LightRAG 层的 `Duplicate document detected (...)` 日志;解析完成后才发现的内容重复使用 `Duplicate content skipped after parsing`,并写入 pipeline status。扫描归档不会额外输出 `[File Extraction]Duplicate skipped`。 +- 存储后端通过 `get_doc_by_content_hash` 进行 hash 直查;命名约定与 `get_doc_by_file_basename` 一致。 + +> 入队批次内(同一次 `apipeline_enqueue_documents` 调用)也会做 basename 与 content_hash 去重,命中时把后续条目直接写为 `FAILED` 并标记 `existing_status=batch_duplicate`。其中 basename 去重只对有效文件名生效;`unknown_source`、`no-file-path` 和空来源只参与内容 hash 去重。 +> +> **跨调用并发去重**也由 workspace 级串行锁保证(详见 [§6.7 enqueue 串行锁(防并发去重穿透)](#67-enqueue-串行锁防并发去重穿透)):两次相同内容、不同文件名的并发入队不会双双穿透 `content_hash` 检查。 + +## 六、流水线并发与重入约束 + +为防止 `scan` / `upload` / `insert` 与运行中的流水线相互覆盖 `doc_status` / `full_docs` 记录,所有写入入口在 `pipeline_status` 共享字典上协调。同一 workspace 下的 `pipeline_status_lock` 保证下表所有 transition 都在锁内原子完成。 + +### 6.1 `pipeline_status` 字段 + +| 字段 | 语义 | +| --- | --- | +| `busy` | 流水线繁忙的笼统标志。处理循环和破坏性作业(clear/delete)都会设它。**仅有 `busy=True`(处理循环)不阻塞 enqueue**——循环按 batch 拉取 `doc_status` 快照处理,每批结束后通过 `request_pending` 检查是否还有新工作。 | +| `destructive_busy` | `busy` 的破坏性子集:`/documents/clear` 或 `/documents/{doc_id}`(删除)正在 drop 存储 / 删源文件。reservation 和 enqueue last-line guard 都会拒绝——并发 enqueue 会写入正被 drop 的存储,已接受的文档会静默丢失。处理循环不会设此字段。 | +| `scanning` | `/documents/scan` 后台任务运行中(整个生命周期:分类阶段 + 处理阶段)。仅 `/scan` 端点用它拒绝重叠 scan,本身**不**阻塞 upload/insert。 | +| `scanning_exclusive` | `scanning` 的独占子集:只在 scan 的**分类阶段**为 True——run_scanning_process 在读 doc_status 分类(已处理 / 续跑 / 删 stub / 归档),不能与并发写者交错。reservation 和 enqueue last-line guard 都会拒绝。分类完成后会立即清旗,scan 进入处理阶段后允许并发 upload。 | +| `pending_enqueues` | 已通过 `_reserve_enqueue_slot` 但 bg task 未完成的 upload/insert 数。仅给 scan 端点参考——决定是否能拿独占。bg task 在 `finally` 里释放 slot。 | +| `request_pending` | 让运行中的处理循环再扫一轮的信号。enqueue 在 `busy=True` 时写完 `doc_status` 后置位;处理循环每个 batch 结束后检查并重新拉快照。 | + +### 6.2 入口行为 + +| 入口 | 条件 | 行为 | +| --- | --- | --- | +| `/documents/upload` / `/documents/text` / `/documents/texts` | `scanning_exclusive=True` 或 `destructive_busy=True` | 抛 HTTP 409,不写文件、不调入队 | +| 同上 | 否则(含纯 `busy=True`、scan 处理阶段 `scanning=True` 但 `scanning_exclusive=False`) | 锁内 `pending_enqueues++` 预留 slot → 严格名字预检 → 保存文件 → schedule bg task;bg task 在 `finally` 释放 slot | +| `/documents/scan` | `busy=True` 或 `scanning=True` 或 `pending_enqueues>0` | 落 warning 后立即返回 `scanning_skipped_pipeline_busy`,不 schedule 后台任务 | +| 同上 | 全部 idle | 锁内设 `scanning=True` 后 schedule,task 结束在 `finally` 清旗 | +| `/documents/clear` / `/documents/delete_document` | `busy=True` 或 `scanning=True` 或 `pending_enqueues>0` | 端点同步返回 `status="busy"`,不 schedule 后台任务 | +| 同上 | 全部 idle | 端点**同步**在锁内设 `busy=True` + `destructive_busy=True`(`delete_document` 在返回 `deletion_started` 之前),bg task 的 finally 一并清旗 | +| `apipeline_enqueue_documents` 内部 (last-line guard) | `scanning_exclusive=True` 且 `from_scan=False`,或 `destructive_busy=True` | 抛 `RuntimeError("Cannot enqueue while scan is classifying / clearing or deleting")` | +| 同上 | 任何其它情况(含纯 `busy=True`、scan 处理阶段) | 正常入队;写完 `doc_status` 后若 `busy=True` 自动 nudge `request_pending=True` | + +`from_scan=True` 是 scan 后台任务自身入队时的旁路:scan 已持有 `scanning` 旗标,必须允许它把扫到的文件入队。 + +### 6.3 为什么 `busy` 不再阻塞 enqueue + +旧版本里 `busy=True` 一律拒绝任何新入队,理由是"修改 `doc_status` 会与流水线工作线程交错"。但实际上: + +1. **写入顺序保证一致性**:`apipeline_enqueue_documents` 总是先 upsert `full_docs`、再 upsert `doc_status`。处理循环开头的 consistency check 仅删除"`doc_status` 行没有对应 `full_docs`"的孤儿——这种状态在并发 enqueue 中不可能出现。 +2. **批次级快照**:处理循环每个 batch 拉一次 `get_docs_by_statuses` 快照,新写入的 `PENDING` 行不会破坏当前 batch;下一轮通过 `request_pending` 重拉快照即可看到新工作。 +3. **`request_pending` 设计本就为此**:旧版同时存在 `request_pending` 字段——它就是为"运行中又有新工作"设计的,但被 busy 守护堵死了。 + +新契约把这个机制启用起来后,**用户在长批次处理过程中仍可继续上传新文档**,bg task 写完 `doc_status` 后由运行中的循环自动接管。 + +### 6.4 为什么 scan 仍是独占写者 + +scan 不仅 enqueue 自己扫到的新文件,还会读 `doc_status` 决定每个文件去向: + +- 同名 `PROCESSED` 行 → 归档源文件、跳过入队。 +- 同名非 PROCESSED 且 `full_docs` 存在 → resume 路径,源文件**保留在 `INPUT/`**,不归档(pending-parse 解析器仍可能需要它),由处理循环按状态查询接走。 +- 同名 `FAILED` 且 `full_docs` 缺失 → 识别为之前 `apipeline_enqueue_error_documents` 写下的提取错误 stub(一致性检查会保留这种行供人工 review),scan 自动删除该 stub 并把当前文件按新文件重新入队,让用户"修好源文件再 scan 一次"能直接生效。 + +这些"读—决策—写"组合不能与其它写者交错,否则分类决策会基于过期视图。所以 scan 必须独占,且 scan 端点会在 `busy` / `scanning` / `pending_enqueues>0` 任一存在时拒绝。 + +### 6.5 严格名字预检(upload 路径) + +upload 通过 reservation 后、保存文件前必须双道检查: + +1. **INPUT 目录扫描**:把要保存的 basename 经 `canonicalize_parser_hinted_basename` 规范化,遍历 INPUT 目录里现有任何同 canonical 变体(含 hint / 不含 hint),命中即 409。 +2. **doc_status 查重**:用规范化 basename 调 `get_existing_doc_by_file_basename`,命中即 409。 + +两道都过 → 保存文件 → schedule bg task → bg task 调 `apipeline_enqueue_documents` 写库 + 调 `apipeline_process_enqueue_documents` 触发处理。 + +> 旧版本曾允许 upload 在已有同名记录时悄悄写入 FAILED 重复条目;新规则改为 fail-fast,不在 doc_status 留下任何重复痕迹。如需替换同名文档,请先调用 `/documents/{doc_id}` 的删除接口。 + +### 6.6 多 reservation 并发的协调 + +两个 upload 同时进来时(scan 此时拿不到独占): + +1. A `_reserve_enqueue_slot` → `pending_enqueues=1`,写文件,schedule bg task A,返回 success。 +2. B `_reserve_enqueue_slot` → `pending_enqueues=2`,写文件,schedule bg task B,返回 success。 +3. bg task A `apipeline_enqueue_documents` → 写 `doc_status` → 调 `apipeline_process_enqueue_documents` → 设 `busy=True` 处理 A 的文档。 +4. bg task B `apipeline_enqueue_documents` → 看到 `scanning=False`,正常写入;写完后看到 `busy=True`,自动设 `request_pending=True`。 +5. bg task B 调 `apipeline_process_enqueue_documents` → 看到 `busy=True`,设 `request_pending=True` 立即返回。 +6. A 的处理循环跑完当前 batch,看到 `request_pending=True`,重拉快照,把 B 的 `PENDING` 行接上处理。 +7. 全部完成后 `busy=False`、`pending_enqueues=0`。 + +任何一个 bg task 都不会因为 busy 被误拒——因为 enqueue 不再检查 busy;处理循环也不会重复处理同一份 batch——`request_pending` 只在 batch 间生效,且每次重拉前清零。 + +### 6.7 enqueue 串行锁(防并发去重穿透) + +`apipeline_enqueue_documents` 内部"读 doc_status 做去重 → 写 `full_docs` / `doc_status`"这一段在 workspace 级 `enqueue_serialize` 锁内串行执行。原因:放开 busy/scan-processing 阶段允许并发 enqueue 之后,两次相同内容、不同文件名的入队(典型场景:scan 处理阶段的 enqueue 与 upload 同时进来)若在没有锁的情况下并发执行—— + +1. A 读 `doc_status` 查 `content_hash`:未命中。 +2. B 读 `doc_status` 查 `content_hash`:仍未命中(A 还没 upsert)。 +3. A upsert `full_docs` + `doc_status`。 +4. B upsert `full_docs` + `doc_status`。 + +结果:同 `content_hash` 的两条 `PENDING` 都进入流水线后续处理,原本应当被识别为 `duplicate_kind=content_hash` 的那条**没**被识别。 + +加上串行锁后第二次 enqueue 一定能在去重读时看到第一次已 upsert 的行,正常走"无新唯一文档"的早返回路径并把本次记为 `duplicate_kind=content_hash` 的 FAILED 行。锁的作用范围**只覆盖**: + +- `filter_keys`(按 doc_id 排除已存在) +- 文件名 / 内容 hash 去重读 +- 重复 FAILED 行的 upsert +- `full_docs.upsert` + `doc_status.upsert` + +锁**不**覆盖 `request_pending` nudge(在锁外,只取一下 `pipeline_status_lock`),也**不**阻塞处理循环的 `get_docs_by_statuses` 读(处理循环走的是 `doc_status` 自身的并发读,与 enqueue 写是 KV 级原子,不抢同一把锁)。锁顺序:`enqueue_serialize → pipeline_status_lock`,无死锁路径。 + +### 6.8 流水线并发参数 + +`pipeline_status` 相关的锁解决的是"谁能写"的正确性问题,本节这一组参数解决的是"同时跑几个 worker"的吞吐量问题。流水线分为 3 个阶段,每个阶段的 worker 池数量独立可调: + +``` + ┌─ q_native ──► [native parser × N1] ─┐ +PENDING ─►├─ q_mineru ──► [mineru parser × N2] ─┼─► q_analyze ─►[analyzer × N4] ─► q_process ─►[processor × N5] + └─ q_docling ──► [docling parser × N3] ─┘ +``` + +入队时 `resolve_stored_document_parser_engine` 根据每个文档的 `parser_engine`(来自 `LIGHTRAG_PARSER` 默认值或文件 hint)把它放入对应解析队列;3 个解析队列**完全互不阻塞**——mineru 占满不会拖慢 docling 或 native。解析完成后统一进入 `q_analyze`(多模态分析),再进入 `q_process`(实体/关系抽取 + 入库)。 + +| 环境变量 | 默认值 | 作用 | 调优建议 | +| --- | --- | --- | --- | +| `MAX_PARALLEL_PARSE_NATIVE` | `5` | N1: native 解析(docx / pdf / txt 等纯本地处理)并发 worker 数 | 纯 CPU、内存占用低,可按 CPU 核数提高 | +| `MAX_PARALLEL_PARSE_MINERU` | `1` | N2: MinerU 解析并发 worker 数 | MinerU 占用 GPU/CPU 显著,**默认串行最稳**。本地部署且显存充足时可设 2-3;走 MinerU 官方云端服务时可适当提高(受云端配额限制) | +| `MAX_PARALLEL_PARSE_DOCLING` | `1` | N3: Docling 解析并发 worker 数 | Docling 同样资源敏感,**默认串行最稳**。本地部署且 CPU/GPU 充足时可设 2-3 | +| `MAX_PARALLEL_ANALYZE` | `5` | N4: 多模态分析(VLM 图片 / 表格描述)并发 worker 数 | 直接消耗 VLM 配额。建议 ≤ VLM 服务并发上限 | +| `MAX_PARALLEL_INSERT` | `2` | N5: 实体 / 关系抽取 + 入库阶段并发文档数 | 推荐 `MAX_ASYNC / 3`,区间 2~10。该阶段每个文档会触发多次 LLM 调用,过高会撞 LLM 限流。同时该值还作为 `asyncio.Semaphore` 用于二次约束(worker 数和信号量值一致) | +| `QUEUE_SIZE_DEFAULT` | `100` | parse / analyze 阶段间的有界队列容量 | 一般无需调整。极少量大批量任务(成千上万)可适当提高,避免 enqueue 端反压;内存紧张时可调低 | +| `QUEUE_SIZE_INSERT` | `4` | analyze → process 阶段间的队列容量 | process 是流水线中最慢、最耗内存的阶段,队列特意做小,给上游提供反压防止内存堆积 | + +**几个要点:** + +1. **解析阶段按引擎隔离**,所以混用 native/mineru/docling 时不必担心一种引擎慢拖累另一种。 +2. **mineru / docling 默认串行(=1)**:实测两者资源占用高,并行收益不稳定(容易 OOM / 显存竞争 / 失败重试)。如果你部署了多 GPU 或专门的解析服务器,可手动调高。 +3. **`MAX_PARALLEL_INSERT` 兼任 worker 池大小和信号量上限**:流水线创建 `Semaphore(max_parallel_insert)`,每个 process worker 在抽取入库前还要拿一次信号量。所以哪怕你把 worker 数手动改大,实际并发上限仍由这个值决定——直接调它就够了。 +4. **queue size 与背压**:`QUEUE_SIZE_INSERT=4` 这个偏小的默认值是有意为之——process 阶段慢且占内存,让 analyze 阶段在队列写满时阻塞、再反压到 parse 阶段,避免一次性把成千上万份解析结果堆在内存里。 +5. **改后生效方式**:所有参数通过 `.env`(或环境变量)传入,仅在 `LightRAG` 实例构造时读取一次;改完需要重启服务。 + +**典型调优场景:** + +- 大量 PDF + 本地 MinerU 单 GPU:`MAX_PARALLEL_PARSE_MINERU=1`、`MAX_PARALLEL_ANALYZE=5`、`MAX_PARALLEL_INSERT=2`(默认即可)。 +- 大量 PDF + MinerU 云端服务:`MAX_PARALLEL_PARSE_MINERU=3~5`(视云端配额),其它保持默认。 +- 纯 docx / txt(仅走 native):`MAX_PARALLEL_PARSE_NATIVE=10`、`MAX_PARALLEL_INSERT` 按 `MAX_ASYNC/3` 推算。 +- LLM 限流明显:先降 `MAX_PARALLEL_INSERT`(process 阶段每文档多次 LLM 调用),再降 `MAX_PARALLEL_ANALYZE`(VLM 是独立配额)。 + +## 七、流水线启动时的续跑规则 + +每次 `apipeline_process_enqueue_documents` 起步时,会拉取所有处于 `PARSING` / `ANALYZING` / `PROCESSING` / `PENDING` / `FAILED` 状态的文档继续处理。续跑路径**根据"内容是否已抽取"分流**,保证同一个文档无论之前进度如何,按当前 `process_options` 续跑都有幂等结果。 + +续跑规则只对 `doc_id` 已经存在于 `doc_status` 的文档生效。新文件入队需要"并发与重入约束"中的文件查重逻辑,避免新文件挤掉旧的已经成功提取内容的文件记录。 + +### 7.1 判断"内容已抽取" + +读 `full_docs[doc_id]`: + +| `parse_format` | 判定 | +| --- | --- | +| `lightrag` 且 `lightrag_document_path` 文件存在 | ✅ 已抽取 | +| `raw` 且 `content` 非空 | ✅ 已抽取 | +| 其它(含 `pending_parse`、记录缺失) | ❌ 未抽取 | + +### 7.2 分支 A:未抽取 + +走完整流水线(`parse_native` / `parse_mineru` / `parse_docling` → `analyze_multimodal` → 分块 → 实体抽取),按 `full_docs.process_options` 决定每一阶段的行为。这是"首次入队"的常规流。 + +### 7.3 分支 B:已抽取 + +**一律跳过解析**(不重新调 `parse_*`),从 ANALYZING 阶段重启,并清光旧 chunks / entities 后按当前 `process_options` 重做: + +| 子步骤 | 行为 | +| --- | --- | +| 引擎对比 | 若 `process_options` 隐含的引擎 ≠ `full_docs.parse_engine`,**仅 warn**,不重新解析。已抽取的内容是不可变事实,重新跑不同引擎会产生不一致。要切换引擎请先 delete 整个文档再重传。 | +| 旧 chunks / 实体 / 关系清理 | 读 `status_doc.chunks_list` 收集旧 chunk id 集,调 `_purge_doc_chunks_and_kg(doc_id, chunk_ids)`:从 `chunks_vdb` / `text_chunks` 删除 chunk 行;按 `entity_chunks` / `relation_chunks` 反查受影响的实体 / 关系,对失去全部源的条目直接从图谱与向量库删除,对仍有其它文档贡献的条目调 `rebuild_knowledge_from_chunks` 用剩余 chunks 重建;最后删除 `full_entities` / `full_relations` 中本 doc 的索引行。purge 完成后 `status_doc.chunks_list = []` / `chunks_count = 0` 重置,避免后续 state-machine upsert 写回旧 ID。 | +| `analyze_multimodal` | 对已启用模态,每次运行都会重新计算 sidecar item 分析并覆盖已有的 `llm_analyze_result`。由于 LLM cache 的存在重复计算通常会保持语义字段不变,只会重写 `analyze_time` 等运行时字段;cache miss,例如更换模型和提示词等,保存内容才可能与上次不同。 | +| 重新分块 | 按新 `process_options.chunking` 选策略,参数从 `full_docs.chunk_options` 读取(入队快照,不会因续跑被覆盖;env 改动后老文档仍按入队那一刻的参数分块)。LightRAG Document path 在 `process_options=P` 时走 paragraph_semantic,否则按 selector 分发到 F/R/V。 | +| 实体抽取 / KG-skip | 按新 `process_options.skip_kg` 决定 | + +> 这条规则保证:用户改 `i/t/e` 重传同名文档(先删旧 doc 再上传带新 hint 的文件)时,多模态分析能增量补齐;改 `F/R/V/P` 时 chunks 与图谱重建;改 `!` 时停掉或恢复 KG 构建。引擎变更被视为"重大变更",统一由 delete + 重传完成,不在续跑路径里隐式发生。 + +## 八、Python SDK 调用 + +本章针对**直接 import `LightRAG` 类**进行集成的开发者,覆盖 Server 部署不会用到的运行时 API、构造期参数和已移除的旧接口。Server 用户通常无须阅读本章。 + +### 8.1 适用对象 + +```python +from lightrag import LightRAG +rag = LightRAG(working_dir="./rag_storage", ...) +await rag.initialize_storages() +await rag.ainsert("text", file_paths="doc.pdf") +``` + +这种调用方式以下行为与 Server 路径不同:可在不重启进程的情况下改 `addon_params["chunker"]`,可向 `apipeline_enqueue_documents` 传入 per-file `chunk_options`,可在 `ainsert` 调用时动态覆盖 F 策略的预切分参数。 + +### 8.2 LightRAG 构造期参数 + +`LightRAG(chunk_token_size=…, chunk_overlap_token_size=…)` 是 §3.3 优先级链中的**第 3 档**:"legacy 构造字段"。strategy 无关、粗粒度缺省,只填仍空的槽位: + +- 优先级低于 `addon_params["chunker"]` 显式值(§8.3)和 strategy 特定 env(§3.2)。 +- 优先级高于 legacy env `CHUNK_SIZE` / `CHUNK_OVERLAP_SIZE`。 +- 实例字段 `self.chunk_token_size` / `self.chunk_overlap_token_size` 在 `__post_init__` 之后总会被回填为 `int`,方便仍读这两个字段的旧路径(如 `pipeline.py` 中 `chunk_opts.get("chunk_token_size") or self.chunk_token_size` 兜底)继续工作。 + +### 8.3 运行时改 `addon_params["chunker"]` + +`addon_params["chunker"]` 是 `ObservableAddonParams` 字段,可以**运行时改**: + +```python +rag.addon_params["chunker"]["recursive_character"]["separators"] = ["##", "\n", " "] +``` + +改完后,**后续入队**的文档拿到新默认;已入队文档保留入队时的快照不变(参见 §3.3 三层语义保证)。这是 §3.3 优先级链的第 1 档:"`addon_params["chunker"]` 显式值",赢一切。 + +Server 部署没有这个能力 —— 改 env 后必须重启服务才生效。 + +### 8.4 `apipeline_enqueue_documents(chunk_options=…)` + +`apipeline_enqueue_documents` 接受可选的 `chunk_options` 参数,调用方传入 `dict` / `list[dict]` 会按当前文档的 `process_options` 投影为精简快照(只保留对应策略子字典 + 顶层 `chunk_token_size`)后持久化到 `full_docs[doc_id]["chunk_options"]`;不传则由 `resolve_chunk_options(self.addon_params, process_options=…)` 现场拼装一份。调用方可以放心传入全量字典——其它策略子字典会被 dispatcher 丢弃,不会污染存储。 + +典型用法: + +```python +await rag.apipeline_enqueue_documents( + input=["text A", "text B"], + file_paths=["a.[native-R].txt", "b.txt"], + process_options=["R", ""], + chunk_options=[ + {"chunk_token_size": 800, "recursive_character": {"separators": ["\n\n", "\n"]}}, + {"chunk_token_size": 1500}, + ], +) +``` + +per-file 个性化的典型场景:管理 UI 单独配置某个文件的 separators 或 V 阈值;将来上传 API 也可在 form / hint 中接收覆盖。 + +**不传 `file_paths` 的兼容**:核心 API `insert` / `ainsert` / `apipeline_enqueue_documents` 仍兼容未传 `file_paths` 的调用;这类文档的 `file_path` 会保存为 `unknown_source`,不会参与文件名查重,文档 ID 继续按文本内容生成。 + +`apipeline_enqueue_documents` 自身的并发约束(last-line guard、`from_scan=True` 旁路)见 §6.2 入口行为表。 + +### 8.5 `ainsert(split_by_character=…, split_by_character_only=…)` + +`LightRAG.ainsert(split_by_character=…, split_by_character_only=…)` 的运行时参数在入队时由 `resolve_chunk_options` 覆写到 `chunk_options.fixed_token`: + +- `split_by_character` 非 `None` 即覆盖 env 默认; +- `split_by_character_only=True` 即覆盖(`False` 是签名默认值,与"未指定"无法区分,所以 env 默认胜出)。 + +仅对 F 策略生效;其它策略的子字典不受影响。 + +### 8.6 已移除的 SDK 入参:`reprocess_existing_non_processed` + +旧 `apipeline_enqueue_documents` 的 `reprocess_existing_non_processed=True` 行为会在 scan 时直接删除非 PROCESSED 的旧记录并重建,与 §五 / §六 的规则相冲突,已整段移除。替代路径: + +- 自动续跑:scan 按 §6.4 的分类规则处理同名文件(归档 / 续跑 / 删 stub 后重入队),由 §七 续跑规则在处理循环里统一接管。 +- 强制刷新:先调 `/documents/{doc_id}` 删旧文档,再上传同名新文件。 diff --git a/docs/FileProcessingPipeline.md b/docs/FileProcessingPipeline.md new file mode 100644 index 0000000000..213be83c02 --- /dev/null +++ b/docs/FileProcessingPipeline.md @@ -0,0 +1,851 @@ +# File Processing Pipeline Specification + +Starting from version v1.5.0 (currently on the dev branch), LightRAG's file processing pipeline has received a major upgrade: + +* Supports multiple file content extraction engines: legacy, native, mineru, docling +* Supports multiple text chunking methods: Fix, Recursive, Vector, Paragraph +* Supports disabling entity-relation extraction for individual files + +LightRAG Server introduces an intermediate file-processing format: `LightRAG Document`. This format supports multimodal data such as tables and images, and also includes the document's section/paragraph metadata, which is convenient for content traceability later. + +This document is organized from the perspective of **LightRAG Server** deployment and use: the quick-start configuration that can be applied directly is given first, followed by configuration syntax for content extraction and chunking, storage / directory layout, deduplication, concurrency, and resume rules. Developers who call the `LightRAG` class directly via Python should jump to [Chapter 8: Python SDK Invocation](#8-python-sdk-invocation). + +## 1. Quick Start + +### Keep the legacy file-processing behavior + +All files are processed using the legacy document parsing and chunking strategy. Either leave `LIGHTRAG_PARSER` unconfigured, or set it to the following value: + +```bash +LIGHTRAG_PARSER=*:legacy-F +``` + +### Recommended starting file-processing behavior + +No reliance on external document parsing services or on `VLM` vision models. Use the new built-in `Native` engine to parse `docx` documents with table (t) and equation (e) modality analysis enabled, paired with the `P` chunking strategy; other documents use the legacy content extractor paired with the more effective `R` chunking strategy. + +```bash +LIGHTRAG_PARSER=*:native-teP,*:legacy-R +``` + +### Enable multimodal processing capability + +Enabling multimodal processing requires the `MinerU` file parsing service and a `VLM` vision recognition model. Use `Native` to parse `docx` files; use `MinerU` to parse `pdf`, `office`, and various image files. All of the above files have image (i), table (t), and equation (e) modality analysis enabled and are paired with the `P` chunking strategy. Other documents fall back to the legacy content extractor paired with the `R` chunking strategy. + +```bash +LIGHTRAG_PARSER=*:native-iteP,*:mineru-iteP,*:legacy-R +VLM_PROCESS_ENABLE=true +VLM_LLM_MODEL=kimi-k2.6 +MINERU_API_MODE=local +MINERU_LOCAL_ENDPOINT=http://localhost:8000 +``` + +> `P` is LightRAG's native chunking strategy; see [Paragraph Semantic Chunking](ParagraphSemanticChunking.md) for details. For VLM configuration, see [Role-based LLM/VLM Configuration Guide](RoleSpecificLLMConfiguration.md). + +## 2. Content Extraction and Processing Option Configuration + +LightRAG's file processing configuration is composed of two parts: the content extraction engine determines how the original file is parsed, and the processing options determine whether multimodal analysis is performed after parsing, which chunking method to use, and whether to build a knowledge graph. Typically, the environment variable `LIGHTRAG_PARSER` is first used to set default rules by file extension, and then a `[hint]` in the filename overrides individual files. Engine and options can be written in the same configuration fragment, for example `docx:native-iet` or `report.[native-R!].docx`. + +For backward compatibility, when the configuration is not modified, the upgraded file content extraction behavior remains the original `legacy` behavior. To enable the new content processing engines, configure as described in this section. + +### 2.1 Configuration Syntax Overview + +The complete configuration model is as follows: + +```text +LIGHTRAG_PARSER=ext:engine-options,ext:engine,*:legacy-R +filename.[ENGINE].ext +filename.[ENGINE-OPTIONS].ext +filename.[-OPTIONS].ext +``` + +- `LIGHTRAG_PARSER` is the default rule table, matched by file extension, e.g., `pdf:mineru`, `docx:native-iet`. +- The `[hint]` in a filename is a single-file override rule, e.g., `paper.[mineru].pdf`, `memo.[native-R!].docx`. +- `ENGINE` is the content extraction engine: `legacy`, `native`, `mineru`, or `docling`. +- `OPTIONS` is a string combination of processing options, e.g., `iet`, `R!`, `P`. The options are ultimately written into `process_options` and read by subsequent pipeline stages. +- The hyphen in `ENGINE-OPTIONS` is only used to separate the engine from the options; it is not part of the options themselves. +- When only processing options are specified, it must be written as `[-OPTIONS]`, e.g., `[-!]`. `[abc]` without a hyphen is strictly interpreted as an engine name and will raise an error; it will not fall back to being interpreted as options. + +Common combination examples: + +```bash +LIGHTRAG_PARSER=pdf:mineru-R,docx:native-ietP,*:legacy-R +MINERU_API_MODE=local +MINERU_LOCAL_ENDPOINT=http://localhost:8000 +DOCLING_ENDPOINT=http://localhost:5001 +``` + +```text +my-proposal.[native-iet].docx # Use the native engine, enable drawing/table/equation analysis +my-memo.[native-R!].docx # Use the native engine, recursive semantic chunking, disable knowledge graph construction +my-proposal.[-!].docx # Use the default engine, only disable knowledge graph construction +my-proposal.[mineru].docx # Use the MinerU engine, all processing options default +``` + +### 2.2 Default Rules: `LIGHTRAG_PARSER` + +`LIGHTRAG_PARSER` is used to configure the default content extraction engine for different file extensions; default processing options for the rule can also be appended after the engine: + +```text +ext:engine,ext:engine,*:legacy +ext:engine;ext:engine;*:legacy +ext:engine-options +``` + +- The left side matches the file extension, not the full filename; write `pdf:mineru`, not `*.pdf:mineru`. +- Rules can be separated by either a comma `,` or a semicolon `;`. +- Rules are checked left to right; priority rules go in front, with the wildcard rule typically at the end. +- The `-options` suffix after the engine serves as the default `process_options` for files matched by this rule. For example, `LIGHTRAG_PARSER=docx:native-iet` means all `.docx` files default to the `native` engine with image, table, and equation analysis enabled. + +### 2.3 Single-File Override: filename hints + +Square brackets in the filename can be used to temporarily specify how a single file is processed: + +```text +paper.[mineru-R].pdf +slides.[docling].pptx +memo.[native-P].docx +notes.[-R].md +``` + +The content inside the square brackets supports three forms: + +```text +[ENGINE] # Specify only the engine; processing options use the default or what LIGHTRAG_PARSER provides +[ENGINE-OPTIONS] # Specify both engine and processing options +[-OPTIONS] # Specify only processing options; the engine still follows LIGHTRAG_PARSER / default rules +``` + +When parsing the hint, content without a hyphen must match an engine name exactly (`mineru` / `native` / `docling` / `legacy`); when there is content before a hyphen, the part before the hyphen is the engine and the part after is the options; when starting with a hyphen, it specifies only options. The legacy `[OPTIONS]` syntax is no longer valid; for example, `[iet]` must now be written as `[-iet]`. + +### 2.4 Content Extraction Engines + +| Engine | Description | Supported file formats (extensions) | +| --- | --- | --- | +| `legacy` | Legacy extraction; content is centrally extracted before joining the pipeline | `txt` `md` `mdx` `pdf` `docx` `pptx` `xlsx` `rtf` `odt` `tex` `epub` `html` `htm` `csv` `json` `xml` `yaml` `yml` `log` `conf` `ini` `properties` `sql` `bat` `sh` `c` `h` `cpp` `hpp` `py` `java` `js` `ts` `swift` `go` `rb` `php` `css` `scss` `less` | +| `native` | Built-in intelligent structured content extractor | `docx` | +| `mineru` | External MinerU content extraction engine | `pdf` `doc` `docx` `ppt` `pptx` `xls` `xlsx` `png` `jpg` `jpeg` `jp2` `webp` `gif` `bmp` | +| `docling` | External Docling content extraction engine | `pdf` `docx` `pptx` `xlsx` `md` `html` `xhtml` `png` `jpg` `jpeg` `tiff` `webp` `bmp` | + +`mineru` and `docling` are external content extraction engines; before enabling related rules, the services must be running first, and the corresponding endpoint/token must be configured in LightRAG. + +LightRAG caches the parsing results of the `mineru` and `docling` engines locally. Re-uploading the same file usually does not trigger the engine to re-parse the document. To delete the parse cache, you must click the "also delete file" option in the delete-file dialog of the document management interface. Modifying the endpoint addresses and effective extraction parameters of the `mineru` / `docling` engines will also invalidate the cache, causing the engine to re-parse the file content on the next upload of the same file. + +#### MinerU Configuration and Local Deployment + +The MinerU client supports two modes; choose one: + +- `local`: self-hosted MinerU service (the official Docker Compose deployment is recommended); LightRAG calls the local container via HTTP. +- `official`: directly connects to the MinerU official precise API v4; you need to apply for a token at [mineru.net](https://mineru.net). + +**Local deployment (Docker Compose)** + +Clone the official [opendatalab/MinerU](https://github.com/opendatalab/MinerU) repository to your local machine, enter the docker deployment directory inside the repository, and first build the image: + +```bash +docker compose -f compose.yaml build +``` + +Then start the API service (`--profile api` is required to enable the HTTP API container; the default listening port is 8000): + +```bash +docker compose -f compose.yaml --profile api up -d +``` + +For image build details, GPU driver setup, model weight locations, etc., refer to the official README: . + +**LightRAG-side env configuration** + +Local mode (self-hosted mineru-api): + +```bash +MINERU_API_MODE=local +MINERU_LOCAL_ENDPOINT=http://localhost:8000 +``` + +Official mode (MinerU cloud API): + +```bash +MINERU_API_MODE=official +MINERU_API_TOKEN= +# MINERU_OFFICIAL_ENDPOINT=https://mineru.net # Default value, usually no need to change +``` + +For the remaining advanced switches (`MINERU_MODEL_VERSION`, `MINERU_LANGUAGE`, `MINERU_ENABLE_TABLE` / `MINERU_ENABLE_FORMULA`, `MINERU_PAGE_RANGES`, `MINERU_LOCAL_BACKEND` / `MINERU_LOCAL_PARSE_METHOD`, `MINERU_POLL_INTERVAL_SECONDS` / `MINERU_MAX_POLLS`, `MINERU_ENGINE_VERSION`, `LIGHTRAG_FORCE_REPARSE_MINERU`, etc.), refer to the MinerU section of the `env.example` template at the repository root. Note that `MINERU_PAGE_RANGES` has different semantics in the two modes: `official` supports a complete list (e.g., `1-3,5,7-9`), while `local` only supports a single page (`3`) or a simple range (`1-10`); it does not accept comma-separated lists. + +#### Docling Configuration + +The `docling` content extraction engine requires an external [docling-serve](https://github.com/DS4SD/docling-serve) service (v1 async API). Minimal configuration: + +```bash +DOCLING_ENDPOINT=http://localhost:5001 +``` + +`DOCLING_ENDPOINT` is just the base URL (**without** `/v1/convert/file/async`). Currently LightRAG uses Docling's standard pipeline to process files. Users can control the behavior of the Docling pipeline through the following environment variables: + +| Env | Default | Meaning | +| --- | --- | --- | +| `DOCLING_DO_OCR` | `true` | OCR master switch | +| `DOCLING_FORCE_OCR` | `true` | Force OCR per page (mandatory for scanned documents; enabling it for non-scanned documents usually also helps improve layout recognition quality) | +| `DOCLING_OCR_ENGINE` | `auto` | OCR engine selection (not recommended to change) | +| `DOCLING_OCR_PRESET` | `auto` | OCR engine preset (not recommended to change) | +| `DOCLING_OCR_LANG` | (empty) | Set per OCR engine requirements (not recommended to change) | +| `DOCLING_DO_FORMULA_ENRICHMENT` | `false` | Whether to recognize equations in the document and output them in LaTeX format; before enabling, ensure that Docling has downloaded the equation recognition model on the backend (see explanation below) | + +When `DOCLING_OCR_ENGINE` / `DOCLING_OCR_PRESET` are not configured, they are equivalent to `auto`; when `DOCLING_OCR_LANG` is not configured, no language list is passed to docling-serve, and the OCR engine uses its own default. The parse cache signature is computed from these effective parameters, so "not configured" and "explicitly set to the default value" do not invalidate the cache. + +Two polling-budget envs (docling-serve uses server-side long-poll; the client does not sleep extra): + +| Env | Default | Meaning | +| --- | --- | --- | +| `DOCLING_POLL_INTERVAL_SECONDS` | `5` | Poll interval for awaiting parse results | +| `DOCLING_MAX_POLLS` | `240` | Maximum poll iterations; raises `TimeoutError` when exceeded;
default wait time ≈ 5 × 240 (about 20 minutes) | + +Three bundle-cache envs: + +| Env | Default | Meaning | +| --- | --- | --- | +| `DOCLING_ENGINE_VERSION` | (empty) | Docling engine version; version changes invalidate the parse cache | +| `LIGHTRAG_FORCE_REPARSE_DOCLING` | `false` | When set to `true`/`1`, the parse cache is not used | +| `DOCLING_BBOX_ATTRIBUTES` | `{"origin":"LEFTBOTTOM"}` | Default coordinate system for Docling layout | + +**Prerequisites for `DOCLING_DO_FORMULA_ENRICHMENT`**: the docling-serve side must have the code-formula model weights ready. The adapter is dual-track compatible — when enabled, the `text` field is LaTeX; when disabled, or when missing weights cause `text == orig`, it falls back to plain text and does not write `equations.json`. Therefore the default of `false` is conservative; turn it on only after confirming the model is ready on the deployment side. + +#### Docling Local Deployment (enabling LaTeX equation recognition) + +The following uses a Docker-based docling-serve deployment as an example, giving the complete steps from image download to model mounting. After deployment completes, write `DOCLING_DO_FORMULA_ENRICHMENT=true` into LightRAG's `.env` to enable LaTeX equation recognition. + +> **Important**: the steps below are based on an environment where the GPU supports CUDA 13. If your GPU is older and does not support CUDA 13, replace the image name `docling-serve-cu130:main` in the command and compose file with the tag corresponding to your CUDA version. For the list of available images, see [docling-serve Packages](https://github.com/orgs/docling-project/packages?repo_name=docling-serve). + +**1. Pull the image** + +```bash +docker pull ghcr.io/docling-project/docling-serve-cu130:main +``` + +**2. Download models** + +```bash +# Create the docling working directory +mkdir docling +cd docling + +# Create the model mount directory +mkdir models + +# Copy the existing models inside the container into the models directory +docker run --rm -it \ + -v "$(pwd)/models:/opt/app-root/src/models" \ + ghcr.io/docling-project/docling-serve-cu130:main \ + cp -r /opt/app-root/src/.cache/docling/models /opt/app-root/src/ + +# Download the equation recognition model +docker run --rm \ + -v "$(pwd)/models:/opt/app-root/src/models" \ + -e DOCLING_SERVE_ARTIFACTS_PATH="/opt/app-root/src/models" \ + ghcr.io/docling-project/docling-serve-cu130:main \ + docling-tools models download-hf-repo docling-project/CodeFormulaV2 -o models +``` + +**3. Create `docker-compose.yaml`** + +Create `docker-compose.yaml` in the `docling` directory from the previous step, with the following contents: + +```yaml +services: + docling-serve: + image: ghcr.io/docling-project/docling-serve-cu130:main + container_name: docling-serve + ports: + - "5001:5001" + environment: + DOCLING_SERVE_ENABLE_UI: "true" + NVIDIA_VISIBLE_DEVICES: "all" + DOCLING_SERVE_ARTIFACTS_PATH: "/opt/app-root/src/models" + # deploy: # This section is for compatibility with Swarm + # resources: + # reservations: + # devices: + # - driver: nvidia + # count: all + # capabilities: [gpu] + runtime: nvidia + restart: always + volumes: + - ./models:/opt/app-root/src/models +``` + +Then execute `docker compose up -d` in that directory to start the service. After the container is ready, set the following in LightRAG's `.env`: + +```bash +DOCLING_ENDPOINT=http://localhost:5001 +DOCLING_DO_FORMULA_ENRICHMENT=true +``` + +This enables LightRAG to recognize equations in documents via the local docling-serve and output them in LaTeX form. + +### 2.5 File Processing Options + +Processing options control the behavior of a single file with respect to multimodal analysis, knowledge graph construction, and text chunking. All options are optional; defaults are shown in the table below. At most one chunking method (F/R/V/P) is specified per file; the other options can be combined arbitrarily. + +| Option | Type | Default | Meaning | +| --- | --- | --- | --- | +| `i` | Multimodal | Off | Enable image analysis (VLM) | +| `t` | Multimodal | Off | Enable table analysis (VLM) | +| `e` | Multimodal | Off | Enable equation analysis (VLM) | +| `!` | Pipeline | Off | Disable entity/relation extraction; do not build the knowledge graph (only the chunks vector index is kept; naive / mix retrieval still works) | +| `F` | Chunking | Default | Fix / fixed-length chunking: legacy method, splits mechanically by fixed token length or by separator (no chunk overlap when splitting by separator) | +| `R` | Chunking | - | Recursive / recursive character chunking (RecursiveCharacterTextSplitter@LangChain): takes a list of separators (default `["\n\n","\n","。","!","?",";",","," ",""]`, ordered from strongest to weakest semantic boundary). Splits by paragraph (double newline) first; if a chunk is still over the token limit, falls back stepwise to single newline → Chinese sentence-ending punctuation (`。!?`) → Chinese mid-sentence punctuation (`;,`) → space → per-character split. **The default cascade includes Chinese punctuation**, letting Chinese / mixed Chinese-English documents split at semantic boundaries. English `.?!` is deliberately excluded (literal matching would mis-split `0.95` / `e.g.`). | +| `V` | Chunking | - | Vector / semantic vector chunking (SemanticChunker@LangChain): first splits text into sentences (the default sentence splitting regex recognizes both English `.?!` and Chinese `。?!`, allowing correct sentence splitting in Chinese / mixed Chinese-English documents), computes embeddings of adjacent sentences, then finds semantic breakpoints based on the specified threshold strategy (e.g., percentile, standard_deviation, or interquartile) for splitting. `SemanticChunker` itself has no chunk size cap — any semantic chunk that exceeds `chunk_token_size` is automatically split again by R before persistence (preserving V's non-overlap semantics). This chunking strategy never produces overlapping chunks. | +| `P` | Chunking | - | Paragraph / paragraph semantic chunking (native); splits by heading first and strictly avoids mixing content from the bottom of the previous heading with content from the next heading, which would break semantics. Suited for chunking documents that can accurately identify headings with a clear heading structure. When the body under the same heading is too long and falls back to R, overlap can be preserved according to `CHUNK_P_OVERLAP_SIZE`; bridging text between adjacent large tables can also be repeated into the surrounding table chunks within that budget. This chunking method can only be applied to `lightrag` content stored in the sidecar directory. If `lightrag` content does not exist, it degrades to chunking with `R`. This chunking method produces far fewer overlapping chunks than the `R` or `F` strategies. | + +> The global multimodal switch `addon_params["enable_multimodal_pipeline"]` is deprecated; the related behavior is now uniformly controlled by the file-level `i/t/e` options. See [Appendix A](#appendix-a-notes-on-upgrading-from-legacy). + +#### Option effective stages + +Different characters of processing options take effect at different stages of the pipeline: + +| Option | Stage | Description | +| :-: | --- | --- | +| i/t/e | Analyzing (multimodal analysis) | Determines whether VLM summarization analysis is invoked on the images / tables / equations in the sidecar. **The extraction stage is unaffected**: the content extraction engine outputs `drawings.json` / `tables.json` / `equations.json` sidecar files based on what the document actually contains. As a result, simply tweaking the `i`/`t`/`e` options to trigger "re-analysis" can complete VLM later without re-parsing the original file. | +| ! | Extraction (entity-relation extraction) | Skips entity/relation extraction and graph writing; chunks are still written to the vector store to retain naive / mix retrieval capabilities. | +| F/R/V/P | Chunking (text chunking) | Determines which chunking strategy to use; does not affect the output of the parsing stage. | + +> Modality availability is signaled solely by "whether the sidecar file exists"; the content extraction engine does not need to declare its capabilities in meta. If a given document contains no images/tables/equations, the corresponding sidecar is not written; even if the user has enabled `i/t/e`, the corresponding modality is silently skipped, but `analyze_multimodal` logs an INFO-level line for that document (`[analyze_multimodal] sidecar e:equations empty: doc—id ...`), making it easy to diagnose "why didn't the VLM run". This is not an error. + +### 2.6 Validation, Priority, and Fallback + +- `LIGHTRAG_PARSER` is strictly validated at startup: unknown content extraction engines, malformed extension syntax, explicitly using an unsupported extension, external engines missing endpoint, and illegal characters in processing options all cause startup to fail. +- **When a wildcard rule matches a certain extension**, the engine must pass two usability checks (see `parser_routing._engine_is_usable`): (a) the engine's capability table supports that extension; (b) if it is an external engine (`mineru` / `docling`), the corresponding endpoint/token environment variable is configured. If either check fails, the rule is skipped and the next rule is matched. For example, in `*:mineru;html:docling`: MinerU does not support the `html` extension (condition a fails), so `html` continues to match `docling`; if `MINERU_API_MODE=local` but `MINERU_LOCAL_ENDPOINT` is not set, all PDFs also skip `*:mineru` and fall to the next rule (condition b fails). This behavior applies to both `LIGHTRAG_PARSER` rule matching and filename hint engine selection. +- Filename hints have higher priority than `LIGHTRAG_PARSER`. If the engine specified in a hint does not support that extension, the system falls back to the default rules to continue selecting an available engine. +- If the filename hint provides a non-empty options string, the hint takes precedence; otherwise the default options of the matching item in `LIGHTRAG_PARSER` are used; if neither is provided, all defaults are used. +- If no rule is available, the file content extraction falls back to `legacy`; if `legacy` also does not support the file extension, an error entry is added to the system and the uploaded file remains in the `INPUT` directory. +- At most one of F/R/V/P may appear; repeating the same option has effect only once but does not raise an error. +- Case-sensitive: the chunking options F/R/V/P must be uppercase; other options i/t/e must be lowercase. +- If illegal characters appear inside the square brackets, the entire hint is invalidated, the engine follows the default rules, and the options fall back to `LIGHTRAG_PARSER` defaults or all defaults; a warning is also logged. +- `P` is only effective for structured `LightRAG Document` results extracted by `native`; for the `legacy` path or unstructured output, it automatically degrades to `R` and logs a warning. + +## 3. Chunker Parameter Configuration (chunk_options) + +### 3.1 Responsibilities of process_options vs chunk_options + +`process_options` selects **which** chunking strategy (F/R/V/P), while `chunk_options` decides **which parameters** that chunker uses. The two responsibilities are orthogonal: the former is a single-character selector, the latter is a structured dictionary. + +``` +env vars (read once at startup) + │ + ▼ +addon_params["chunker"] (LightRAG instance field, filled by env with legacy fallback) + │ + ▼ resolve_chunk_options(addon_params, split_by_character=…, split_by_character_only=…) + │ +full_docs[doc_id]["chunk_options"] (frozen at enqueue time, an independent snapshot per file) + │ + ▼ +chunker(tokenizer, content, chunk_token_size, **strategy_kwargs) (dispatched by selector during chunking) +``` + +- **env vars** are loaded into `addon_params["chunker"]` during the `LightRAG.__init__` stage (strategy-specific env is read by `default_chunker_config()`, then `_apply_chunk_size_overlay` fills in legacy env as a fallback). +- **`addon_params["chunker"]`** is an `ObservableAddonParams` field; for Server deployments, you only need env / restart for the new values to take effect. To change it at runtime within the Python process (without restarting) and to do per-file overrides, see [Chapter 8: Python SDK Invocation](#8-python-sdk-invocation). +- **`full_docs.chunk_options`** is frozen at `apipeline_enqueue_documents` enqueue time: by default it is assembled by `resolve_chunk_options(self.addon_params, ...)` on the spot; if the caller passes a `chunk_options` argument, it is persisted as-is (SDK usage, see §8.4). +- **The chunker invocation** takes the corresponding sub-dictionary from `full_docs.chunk_options` and dispatches to F/R/V/P by the `process_options.chunking` selector. + +### 3.2 Environment Variables + +All variables in the table below are read into `addon_params["chunker"]` once when `LightRAG` is instantiated: strategy-specific env is read by `default_chunker_config()`, while legacy env (`CHUNK_SIZE` / `CHUNK_OVERLAP_SIZE`) is filled in by `_apply_chunk_size_overlay` into slots that neither strategy env nor legacy constructor fields filled. After modifying env, the service must be restarted (or a new `LightRAG` instance created) for it to take effect; documents already enqueued hold the frozen snapshot and are unaffected. + +| Variable | Default | Type | Scope | +|---|---|---|---| +| `CHUNK_SIZE` | `1200` | int | Legacy top-level `chunk_token_size` fallback; lower priority than strategy-specific env and the SDK path setting of `addon_params["chunker"]["chunk_token_size"]` | +| `CHUNK_OVERLAP_SIZE` | `100` | int | Legacy overlap fallback; filled when a strategy has neither a specific env (`CHUNK_F_OVERLAP_SIZE` / `CHUNK_R_OVERLAP_SIZE` / `CHUNK_P_OVERLAP_SIZE`) nor the SDK path's `LightRAG(chunk_overlap_token_size=…)` | +| `CHUNK_F_OVERLAP_SIZE` | unset | int | F strategy-specific overlap; higher than the legacy constructor field and `CHUNK_OVERLAP_SIZE` | +| `CHUNK_F_SPLIT_BY_CHARACTER` | (unset = `null`) | str? | F pre-split separator; `null` / empty string = split by token window only | +| `CHUNK_F_SPLIT_BY_CHARACTER_ONLY` | `false` | bool | F strict mode: no secondary token split; raise error when oversized | +| `CHUNK_R_SIZE` | unset | int | R strategy-specific `chunk_token_size`; higher than top-level legacy fallback (`CHUNK_SIZE` and the SDK path's `LightRAG(chunk_token_size=…)`). When unset, R inherits the top-level resolved value. | +| `CHUNK_R_OVERLAP_SIZE` | unset | int | R strategy-specific overlap; higher than the legacy constructor field and `CHUNK_OVERLAP_SIZE` | +| `CHUNK_R_SEPARATORS` | `["\n\n","\n","。","!","?",";",","," ",""]` | JSON array string | R separator cascade, ordered from strongest to weakest semantic boundary. The default includes Chinese sentence-ending (`。!?`) and mid-sentence (`;,`) punctuation, letting Chinese / mixed Chinese-English documents split at semantic boundaries. English `.?!` is deliberately excluded (literal matching would mis-split numbers and abbreviations). | +| `CHUNK_V_SIZE` | unset | int | V strategy-specific `chunk_token_size` (hard cap, automatically re-split through R when exceeded); higher than the top-level legacy fallback. When unset, V inherits the top-level resolved value. | +| `CHUNK_V_BREAKPOINT_THRESHOLD_TYPE` | `percentile` | str | V threshold type; can be `percentile` / `standard_deviation` / `interquartile` / `gradient` | +| `CHUNK_V_BREAKPOINT_THRESHOLD_AMOUNT` | (unset = `null`) | float? | V threshold magnitude; `null` lets LangChain pick the default by type (e.g., percentile=95) | +| `CHUNK_V_BUFFER_SIZE` | `1` | int | V sentence buffer window; the number of adjacent sentences to merge during distance computation | +| `CHUNK_V_SENTENCE_SPLIT_REGEX` | `(?<=[.?!])\s+\|(?<=[。?!])` | str | V's sentence splitting regex, fed to LangChain's `SemanticChunker`. The default recognizes both English `.?!` (requiring trailing whitespace to avoid mis-splitting `0.95`) and Chinese `。?!` (no whitespace required, fitting Chinese continuous writing). The env value is the raw regex string; no JSON quoting needed. | +| `CHUNK_P_SIZE` | `2000` (`DEFAULT_CHUNK_P_SIZE`) | int | P strategy-specific `chunk_token_size`. Unlike R/V, P does NOT inherit the top-level `CHUNK_SIZE` / `LightRAG(chunk_token_size=…)` when unset — paragraph-semantic merging needs more headroom than the global default to keep related paragraphs together, so the slot always carries `DEFAULT_CHUNK_P_SIZE` (2000) instead. | +| `CHUNK_P_OVERLAP_SIZE` | unset | int | P strategy-specific overlap; higher than the legacy constructor field and `CHUNK_OVERLAP_SIZE`. Used for text overlap when long body text within the same JSONL content line falls back to R, and as the per-side budget for bridging text copied into the adjacent large-table chunks. | + +P's internal ratio constants are algorithmic scales and are automatically derived in proportion to `chunk_token_size`. P always uses an independent `chunk_token_size` decoupled from the global chain — even when `CHUNK_P_SIZE` is unset, P falls back to `DEFAULT_CHUNK_P_SIZE` (2000) rather than the global `CHUNK_SIZE`, because paragraph-semantic merging needs more headroom than the global default to keep related paragraphs together. Use `CHUNK_P_SIZE` to override that default per deployment. `CHUNK_P_OVERLAP_SIZE` only affects P's internal plain-text fallback and table bridging context; it does not let table row-level slices overlap each other. `CHUNK_R_SIZE` / `CHUNK_V_SIZE` work differently — when unset they DO fall back to the top-level `chunk_token_size` (R prefers a smaller target to better split sentences, while V — as an advisory ceiling — typically wants to be enlarged to reduce over-splitting). + +### 3.3 Priority Chain + +The final value of each chunking slot is resolved by a specificity-ordered chain (high → low): + +1. **`addon_params["chunker"]` explicit value** — field values explicitly written at construction time or set at runtime via the SDK path (see §8.3). Server-only deployments usually don't hit this tier. Most direct; wins everything. +2. **Strategy-specific env** — e.g., `CHUNK_F_OVERLAP_SIZE` / `CHUNK_R_OVERLAP_SIZE` / `CHUNK_P_OVERLAP_SIZE` / `CHUNK_R_SIZE` / `CHUNK_V_SIZE` / `CHUNK_P_SIZE` (there is no strategy-specific `CHUNK_F_SIZE` yet; F reuses the top-level `chunk_token_size`). Filled only when the slot is not already occupied by ①. +3. **Legacy constructor fields** — `LightRAG(chunk_token_size=…, chunk_overlap_token_size=…)`; only effective on the SDK path, see §8.2. Strategy-agnostic, "coarse-grained default", fills only the slots still empty. +4. **Legacy env** — `CHUNK_SIZE` / `CHUNK_OVERLAP_SIZE`. Final fallback. + +Example: `CHUNK_R_OVERLAP_SIZE=42` + `LightRAG(chunk_overlap_token_size=2)` → R sub-dictionary `chunk_overlap_token_size=42` (strategy env wins), F / P sub-dictionary `chunk_overlap_token_size=2` (no F / P-specific env; the legacy constructor field is filled in). + +**Special case for P's `chunk_token_size`**: the P `chunk_token_size` slot does NOT walk the full four-tier chain. When ① is not explicitly provided, it resolves directly via `CHUNK_P_SIZE` env > `DEFAULT_CHUNK_P_SIZE` (2000), **skipping** ③ legacy constructor field `LightRAG(chunk_token_size=…)` and ④ legacy env `CHUNK_SIZE`. See the `CHUNK_P_SIZE` row in §3.2 for the rationale. + +Three layers of semantic guarantee: + +1. **Reproducibility**: change env, restart — old documents still chunk by the snapshot from the moment they were enqueued; results unchanged. +2. **Resume consistency**: resume branch B (content already extracted, redo chunking by current `process_options`) also reads `full_docs.chunk_options`, preventing env drift from breaking consistency. +3. **Per-file personalization**: callers can pass different `chunk_options` for each file (typical usage: a management UI configures separators or V threshold individually for a certain file). These are the input semantics on the SDK path; see §8.4. + +### 3.4 Field Structure + +`addon_params["chunker"]` (instance field) keeps the sub-dictionaries of all four strategies as the runtime baseline; `full_docs[doc_id]["chunk_options"]` is a **slim snapshot** — at enqueue time, only the strategy sub-dictionary selected by `process_options` is kept (default F), and the parameters of other strategies are discarded, because the processing stage will not read them. When re-parsing, `process_options` and `chunk_options` are rewritten together, avoiding residue of old-strategy parameters. + +**`addon_params["chunker"]` full baseline** (modifiable at runtime via SDK, affecting subsequent enqueues): + +```jsonc +{ + "chunk_token_size": 1200, // common token cap + "fixed_token": { // F-specific + "chunk_overlap_token_size": 100, + "split_by_character": null, + "split_by_character_only": false + }, + "recursive_character": { // R-specific + "chunk_token_size": 1200, // optional; when omitted, inherits the top-level chunk_token_size + "chunk_overlap_token_size": 100, + "separators": ["\n\n", "\n", "。", "!", "?", ";", ",", " ", ""] // default cascade includes Chinese punctuation + }, + "semantic_vector": { // V-specific + "chunk_token_size": 1200, // optional hard cap; re-split through R when exceeded + "breakpoint_threshold_type": "percentile", // percentile | standard_deviation | interquartile | gradient + "breakpoint_threshold_amount": null, // null = LangChain default + "buffer_size": 1, + "sentence_split_regex": "(?<=[.?!])\\s+|(?<=[。?!])" // default regex handles both English and Chinese sentence-ending punctuation + }, + "paragraph_semantic": { // P-specific + "chunk_token_size": 2000, // when omitted, resolves from CHUNK_P_SIZE or DEFAULT_CHUNK_P_SIZE (2000); + // does NOT inherit the common chunk_token_size + "chunk_overlap_token_size": 100 // when omitted, inherits the legacy overlap resolution chain + } +} +``` + +**`full_docs[doc_id]["chunk_options"]` slim snapshot** (projected by selector; example below is for `process_options="R"`): + +```jsonc +{ + "chunk_token_size": 1200, // common token cap (kept as a top-level fallback) + "recursive_character": { // the only retained strategy sub-dictionary + "chunk_overlap_token_size": 100, + "separators": ["\n\n", "\n", "。", "!", "?", ";", ",", " ", ""] + } +} +``` + +selector → sub-dictionary mapping: F → `fixed_token`, R → `recursive_character`, V → `semantic_vector`, P → `paragraph_semantic`; without a selector, F is the default. Each sub-dictionary corresponds one-to-one with the keyword-only parameters of the corresponding chunker function; when adding new parameters, no dispatcher change is needed, just add a kwarg to the chunker function. + +### 3.5 Backward Compatibility for Missing Fields + +Old documents at enqueue time don't yet have the `chunk_options` field; during chunking, the dispatcher calls `resolve_chunk_options(self.addon_params, process_options=…)` per the current `process_options` to fall back to a slim snapshot. After upgrading, it is recommended to run a reprocess once to give old documents a slim `chunk_options` snapshot (aligned with the current `process_options`). + +## 4. Storage and Directory Layout + +### 4.1 `full_docs` Fields + +File enqueue and extraction results are written into `full_docs`: + +| Field | Description | +| --- | --- | +| `file_path` | Basename of the filename (without directory), **preserves the original name provided by the user (including the square-bracket hint)**, e.g., `abc.[native-iet].docx` is written as-is. When no valid source is provided, it is saved as `unknown_source`. The filename hint is not stripped, so the management UI can directly show the user's original naming intent. | +| `canonical_basename` | The canonicalized basename with the processing hint stripped (e.g., `abc.docx`). Filename deduplication uses this field as the index key, ensuring `abc.docx` and `abc.[native-iet].docx` are treated as the same logical document. | +| `source_path` | The original path provided at enqueue time (written only when it contains a directory separator or is an absolute path), used by the `native` / `mineru` / `docling` parsers to locate the actual file. | +| `parse_format` | Content format: `pending_parse`, `raw`, `lightrag`. | +| `content` | When `raw`, holds the extracted text; when `pending_parse`, it is an empty string; when `lightrag`, holds the **complete merged text** starting with `{{LRdoc}}` (concatenated body segments of all `type=="content"` lines in `.blocks.jsonl`). During chunking, `parse_native` strips the prefix and hands it to the chunking_func, going through exactly the same code path as `raw`. | +| `content_hash` | MD5 of the content, used for cross-filename deduplication. For `parse_format=raw`, takes the hash of text after `sanitize_text_for_encoding`; for `parse_format=lightrag`, takes the hash of the `*.blocks.jsonl` file; for `parse_format=pending_parse`, not written, filled in after extraction completes. | +| `lightrag_document_path` | When `parse_format=lightrag`, saves the path to the structured LightRAG Document; new records prefer to save the path relative to `INPUT_DIR`, e.g., `__parsed__/report.docx.parsed/report.blocks.jsonl`. Note that the subdirectories and the blocks filename in the path both use the canonicalized basename (without hint). | +| `parse_engine` | The engine that actually completed extraction: `legacy`, `native`, `mineru`, `docling`. For files awaiting extraction, can also temporarily store the target engine. | +| `process_options` | The original processing options string recorded at enqueue time (without engine name and the separator `-`), e.g., `"iet"`, `"R!"`, `""`. Downstream stages take this field as the authoritative source for deciding whether to enable image / table / equation analysis (`i/t/e`), whether to disable knowledge graph construction (`!`), and the chunking method (`F/R/V/P`). An empty string is equivalent to all defaults. | +| `chunk_options` | The **frozen** snapshot of chunker parameters at enqueue time (slim dictionary: only the strategy sub-dictionary selected by `process_options` is retained, others discarded). Passed in by the SDK-path caller or assembled by `resolve_chunk_options(self.addon_params, process_options=…)` from instance fields (containing env defaults) as a fallback (see §3.1). `process_options` chooses which chunking strategy (F/R/V/P); `chunk_options` decides which parameters that chunker uses. The downstream `process_single_document` reads strategy-specific kwargs from this field before chunking; persistence guarantees that old documents behave reproducibly across env changes, resumes, and restarts. Rewritten together with `process_options` when re-parsing. | + +`pending_parse` indicates the file has been enqueued but extraction is not yet complete. After successful extraction, it is rewritten to `raw` or `lightrag`, and `content_hash` is filled in. On extraction failure, `pending_parse` and the empty `content` are kept, making subsequent troubleshooting and retry easier. + +> The original `file_path` (with hint), `canonical_basename`, and `content_hash` are also synchronized into `doc_status`, serving as the deduplication index sources for `get_doc_by_file_basename` / `get_doc_by_content_hash`. `get_doc_by_file_basename` internally canonicalizes the input through `canonicalize_parser_hinted_basename` before comparing against `canonical_basename`, so `abc.docx` and `abc.[native-iet].docx` always hit the same document. +> `process_options` is also mirrored into `doc_status.metadata["process_options"]`, making it convenient for the management UI to directly display the current file's processing policy. + +### 4.2 `__parsed__` Directory Structure + +`__parsed__` is the archival and analysis-result directory next to the input directory. It both stores already-processed original documents and the `LightRAG Document` (lightrag format) files and image assets produced by structured parsing. + +- Original file archival: after `legacy` local extraction succeeds and enqueueing finishes, the original file is moved into the sibling `__parsed__` directory; `native` / `mineru` / `docling` keep the original file first for the pipeline to parse, and only move it to `__parsed__` after successful parsing and writing to `full_docs`. **When archived, the original filename (including `[hint]`) is preserved**, e.g., `report.[native-iet].docx` is archived as `__parsed__/report.[native-iet].docx`, making it easy to trace the user's original name and processing options. +- Analysis result directory: structured parsing results are written into a subdirectory named with the **canonicalized filename** (with `[hint]` removed) plus the `.parsed` suffix, avoiding name conflicts with the archived original file and ensuring that the same logical document continues to point to the same directory when the filename hint or processing options change. For example, the analysis results of `report.docx`, `report.[native].docx`, and `report.[native-iet].docx` are all written into `__parsed__/report.docx.parsed/`. +- Analysis result files: the LightRAG Document blocks file and sidecars are named with the canonicalized filename stem, e.g., `__parsed__/report.docx.parsed/report.blocks.jsonl`; the same directory may also contain `report.tables.json`, `report.drawings.json`, `report.equations.json`, and the `report.blocks.assets/` image asset directory. **Whether a sidecar is generated is determined by the document content**: the parser only writes the corresponding file when the document actually contains tables / images / equations. This is the only signal of modality availability — the engine does not need to declare capabilities in meta. The `i`/`t`/`e` options only determine whether the next stage invokes the VLM for summarization analysis on already-existing sidecars. +- When parsing fails, the original file is not moved, making it easy to fix the configuration and re-process. +- When `/documents/scan` encounters a file with the same name that is already `PROCESSED`, the input file is treated as already processed and moved to `__parsed__`, not enqueued as a new document. +- When `/documents/scan` finds multiple files that share the same canonicalized name in the same scan, it prefers the file with a supported engine hint to respect the user's engine selection; if no variant has a hint, it processes the first file in sorted order. Other variants emit warnings and are moved to `__parsed__`, avoiding files in the same batch overwriting each other. For example, if both `abc.docx` and `abc.[native].docx` exist, only `abc.[native].docx` is processed. +- When duplicate content hashes are found during scanning or parsing, the input file is likewise moved to `__parsed__`; this `doc_status` entry is kept as `FAILED duplicate` for tracking. +- File moves only act on the current input file and do not overwrite or move existing document source files. If a file with the same name already exists at the destination, the system automatically appends `_001`, `_002`, etc., e.g., `report.pdf` is archived as `report_001.pdf`, `report_002.pdf`. If the analysis result directory name is already taken by a regular file, a number is also appended, e.g., `report.docx.parsed_001/`. + +### 4.3 MinerU Raw Artifacts Directory `.mineru_raw/` + +The `mineru` engine writes the complete artifacts returned by the MinerU service (`content_list.json` + optional `full.md` / `middle.json` / `layout.pdf` / `images/`, etc.) into the `__parsed__/.mineru_raw/` directory during parsing, and writes `_manifest.json` as the integrity validation file. + +Design goals: + +- **Avoid duplicate uploads**. When parsing the same file again, the source file's content hash + size is first validated against `_manifest.json`; on hit, the MinerU service call is skipped and the local `content_list.json` is fed directly through adapter → SidecarWriter. +- **Preserve diagnostic information**. When MinerU parses incorrectly or downstream sidecar fields are abnormal, you can go straight to `*.mineru_raw/` to compare the original content_list and image assets. +- **Support object traceability**. The `drawings.json` / `tables.json` / `equations.json` generated by MinerU save `content_list.json#/N` in `self_ref`, used for looking up the corresponding MinerU original object and its `page_idx` / `bbox`, etc. +- **De-hint uploaded filenames**. When the source filename contains processing hints like `[mineru-...]` / `[-iet]`, the MinerU API is called with the canonicalized filename (hint removed), to avoid hint-bearing filenames inside the raw bundle returned by MinerU. + +Lifecycle: + +| Operation | Behavior | +|---|---| +| First parse | Download all artifacts → atomically write `_manifest.json`. | +| Re-parse (cache hit) | Do not call the MinerU service; do not rewrite artifacts; rerun adapter+Writer to regenerate sidecar (for adapter upgrade scenarios). | +| Re-parse (cache miss) | Clear all files in the directory, then re-download and write manifest. | +| `DELETE /documents` with `delete_file=True` | `*.parsed/`, `*.mineru_raw/`, and the original file are all deleted together. | +| `DELETE /documents` with `delete_file=False` | All artifacts are preserved; only doc_status and KG data are deleted. | +| `clear_documents` / a full sweep of `__parsed__` | Naturally cleared together. | +| scan cycle | Does not actively GC orphan `*.mineru_raw/` (only cleared on explicit deletion by the user, to avoid accidentally removing the debug site). | + +Force re-parse (bypass cache): set `LIGHTRAG_FORCE_REPARSE_MINERU=true`. + +Concurrency safety: LightRAG mandates `canonical_basename` uniqueness within the same workspace (HTTP 409 on upload/enqueue), and combined with the pipeline's serialization per document, `*.mineru_raw/` has no concurrent write conflicts and needs no extra locks. + +`_manifest.json` invalidation conditions (any triggers a cache miss): + +- Source file size or sha256 does not match manifest; +- `MINERU_ENGINE_VERSION` environment variable and the `engine_version` recorded in manifest are both non-empty but inconsistent; +- Current `MINERU_API_MODE` and the `api_mode` recorded in manifest are both non-empty but inconsistent; +- Endpoint for the current mode (`MINERU_OFFICIAL_ENDPOINT` / `MINERU_LOCAL_ENDPOINT`) and the `endpoint_signature` recorded in manifest are both non-empty but inconsistent; +- `content_list.json` size or sha256 does not match manifest; +- Size of any recorded non-critical file (images, `middle.json`, etc.) does not match manifest. + +> About the "either side empty → skip" semantics of `engine_version` / `endpoint_signature`: when the field was empty at manifest-write time (e.g., `MINERU_ENGINE_VERSION` was not configured at first parse), or when the current environment variable is not set, the check is skipped for that item. If the version env was not set at first parse, setting it later does not automatically invalidate the historical cache — this scenario requires manually setting `LIGHTRAG_FORCE_REPARSE_MINERU=true` to trigger re-parsing. + +### 4.4 Docling Raw Artifacts Directory `.docling_raw/` + +The `docling` engine extracts the zip artifact returned by docling-serve (DoclingDocument JSON, Markdown, and referenced images) into the `__parsed__/.docling_raw/` directory during parsing, and writes `_manifest.json` as the integrity validation file. On a subsequent parse, the IR builder reads the `.json` file in that directory and feeds it to `DoclingIRBuilder`, no longer calling docling-serve. + +Directory layout: + +```text +__parsed__/.docling_raw/ +├── _manifest.json +├── .json # DoclingDocument JSON (contains pages[].image base64) +├── .md # Markdown form, for human inspection +└── artifacts/ + └── image_*.png # image assets referenced by pictures[*].image.uri +``` + +Design goals: + +- **Avoid duplicate uploads/conversions**. When parsing the same file again, the source file's hash + size is first validated against `_manifest.json`; on hit, the upload / poll / download against docling-serve is skipped, and the local `.json` is fed directly through DoclingIRBuilder → SidecarWriter. +- **Preserve diagnostic information**. When docling-serve parses incorrectly or downstream sidecar fields are abnormal, you can go straight to `*.docling_raw/` to compare the original DoclingDocument JSON, Markdown, and `artifacts/` images. + +Lifecycle: + +| Operation | Behavior | +|---|---| +| First parse | `POST /v1/convert/file/async` upload → long-poll `/v1/status/poll/{task_id}?wait=N` → `GET /v1/result/{task_id}` download zip → safe extraction (rejecting absolute paths and `..`) → atomically write `_manifest.json`. | +| Re-parse (cache hit) | Do not call docling-serve; do not rewrite artifacts; rerun adapter+Writer to regenerate sidecar (for adapter upgrade scenarios). | +| Re-parse (cache miss) | Clear all files in the directory, then re-upload / download / write manifest. | +| `DELETE /documents` with `delete_file=True` | `*.parsed/`, `*.docling_raw/`, and the original file are all deleted together. | +| `DELETE /documents` with `delete_file=False` | All artifacts are preserved; only doc_status and KG data are deleted. | +| `clear_documents` / a full sweep of `__parsed__` | Naturally cleared together. | +| scan cycle | Does not actively GC orphan `*.docling_raw/` (only cleared on explicit deletion by the user, to avoid accidentally removing the debug site). | + +Force re-parse (bypass cache): set `LIGHTRAG_FORCE_REPARSE_DOCLING=true`. + +Concurrency safety: identical to the MinerU path — LightRAG mandates `canonical_basename` uniqueness within the same workspace (HTTP 409 on upload / enqueue), and combined with the pipeline's serialization per document, `*.docling_raw/` has no concurrent write conflicts and needs no extra locks. + +`_manifest.json` invalidation conditions (any triggers a cache miss): + +- Source file size or sha256 does not match manifest; +- `DOCLING_ENDPOINT` does not match the `endpoint_signature` recorded in manifest; +- `DOCLING_ENGINE_VERSION` is set and does not match the `engine_version` recorded in manifest; +- `options_signature` does not match — any OCR / equation / pipeline field change triggers it, covering: + - Tunable env: `DOCLING_DO_OCR` / `DOCLING_FORCE_OCR` / `DOCLING_OCR_ENGINE` / `DOCLING_OCR_PRESET` / `DOCLING_OCR_LANG` / `DOCLING_DO_FORMULA_ENRICHMENT`; + - Hard-coded constants: `pipeline` / `target_type` / `to_formats` / `image_export_mode` (written into the signature to prevent old bundles from being mistakenly reused if these values change in the future); +- Main JSON missing, size, or sha256 does not match; +- Any image in `artifacts/` missing or size mismatch; +- `LIGHTRAG_FORCE_REPARSE_DOCLING=true`. + +> The "either side empty → skip" semantics of `engine_version` / `endpoint_signature` is the same as MinerU §4.3: when the field was empty at manifest-write time (first parse without `DOCLING_ENGINE_VERSION` configured) or when the current environment variable is not set, the check is skipped for that item; adding the version number later does not automatically invalidate the historical cache; `LIGHTRAG_FORCE_REPARSE_DOCLING=true` is needed to trigger. + +## 5. Document Duplicate Detection Rules + +File upload, file-parse enqueue, and the text APIs check duplicates against two gates: "filename + content hash". Hitting either is considered a duplicate, and a `FAILED` record is written without overwriting the existing `full_docs`. `/documents/scan` directory scanning uses the same set of indexes, but in order to facilitate automatic retry of unfinished files, it has separate archive and re-process rules for duplicate filenames. + +### 5.1 Filename (basename) Deduplication + +- The granularity of the check is basename, excluding directory path and workspace path. For example, `/data/a.pdf`, `inputs/a.pdf`, and `a.pdf` are all considered the same filename `a.pdf`. +- Filename deduplication uses `canonical_basename` as the index: the supported-engine processing hint at the end of the filename is stripped before comparison, so `abc.docx`, `abc.[native].docx`, and `abc.[native-iet].docx` are considered the same name. Unsupported hints are not stripped; e.g., `abc.[draft].docx` is still treated by its original filename. +- For ordinary upload, text APIs, and core enqueue APIs, as long as a file with the same name already exists in `doc_status` — whether that record is currently `PENDING`, `PARSING`, `ANALYZING`, `PROCESSING`, `FAILED`, or `PROCESSED` — the same-name file is considered a duplicate. +- For `/documents/scan` directory scan: + - If multiple files in the same scan share the same canonicalized name, the file with a supported engine hint is processed first; if no variant has a hint, the first file after sorting is processed, and the rest are archived to `__parsed__` and skipped. + - If the same-name record is already `PROCESSED`, the file just scanned is treated as already processed; the system emits a warning, moves the input file to the sibling `__parsed__` directory, and skips enqueueing. + - If the same-name record is not `PROCESSED`, the scanned file is **not** skipped simply because of the same name, but **also** does not re-extract / overwrite the existing record. The specific path depends on the form of the existing record (consistent with the classification rules listed below in the "Why is scan still the exclusive writer" section): + - Same name non-PROCESSED with `full_docs` present → **resume path**: doc_status is preserved as-is, the source file remains in `INPUT/`, and the processing loop picks it up by status query (no re-extract, no overwrite of existing status). + - Same name `FAILED` with `full_docs` missing → recognized as an extraction-error stub written by `apipeline_enqueue_error_documents`: scan deletes the stub and **enqueues the current file as a new file**. This is the only sub-branch that re-extracts; the purpose is to make "fix the source file, scan again" automatically take effect. +- For ordinary upload and core enqueue APIs, a file with the same name — even if its content has changed — must have its old document record deleted before re-upload or re-enqueue; the two automatic recoveries above only apply to the directory-scan path. +- The text APIs must provide a valid `file_source`, and duplicates are checked by the basename of `file_source`; lacking a valid `file_source` returns 400 directly. +- When the SDK path calls `insert` / `ainsert` / `apipeline_enqueue_documents` without `file_paths`, that is allowed; related behavior is detailed in §8.4. Such documents without a source have `file_path` saved as `unknown_source`. +- Empty strings, `no-file-path`, and `unknown_source` are all considered unknown sources; they do not block new source-less text from being enqueued, nor do they deduplicate each other as same-named files. + +The storage backend provides basename direct lookup via `get_doc_by_file_basename`, internally comparing against the `canonical_basename` field (the input parameter is first canonicalized through `canonicalize_parser_hinted_basename`). `JsonDocStatusStorage` already implements an in-memory traversal; other backends currently fall back to the default implementation (scanning all states and comparing `canonical_basename`), to be augmented with native indexes in subsequent PRs. + +### 5.2 Content Hash Deduplication + +- Documents with different filenames but identical extracted content are also considered duplicates. The hash here is the content hash of the final text or LightRAG Document obtained by the configured extraction engine; it is not the hash of the original file bytes. +- `full_docs` and `doc_status` write or fill in the `content_hash` field according to the content format: + - `parse_format=raw`: the MD5 of the text after `sanitize_text_for_encoding`. + - `parse_format=lightrag`: the MD5 of the `*.blocks.jsonl` file parsed out of `lightrag_document_path`. Relative paths are resolved against `INPUT_DIR`. + - `parse_format=pending_parse`: no hash is written yet; it is filled in by subsequent steps after parsing actually completes (to avoid mistakenly judging by empty content). +- The `legacy` path deduplicates content hashes after locally extracting text and during enqueue; on hit, this record is written as `FAILED duplicate`, and no new `full_docs`, chunks, or graph data are generated. +- The `native` / `mineru` / `docling` paths first enqueue with `pending_parse`; after parsing completes and `content_hash` is filled in, if another document already has the same hash, this record is stopped before entering analysis, chunking, entity extraction, and graph writing. +- Duplicate records are marked as `filename` or `content_hash` in `metadata.duplicate_kind` for diagnosis. Content-hash duplicates also record `metadata.is_duplicate=true`, `metadata.original_doc_id`, and `metadata.original_track_id`; duplicates discovered only after parsing also have the temporarily-written `full_docs` deleted. +- Related warnings minimize repetitive noise: when scanning discovers a same-name file already `PROCESSED`, a log and pipeline status are written; duplicates at the enqueue stage use the LightRAG layer's `Duplicate document detected (...)` log; content duplicates only discovered after parsing use `Duplicate content skipped after parsing` and write a pipeline status. Scan archiving does not emit the extra `[File Extraction]Duplicate skipped`. +- The storage backend provides hash direct lookup via `get_doc_by_content_hash`; the naming convention is the same as `get_doc_by_file_basename`. + +> Within an enqueue batch (the same `apipeline_enqueue_documents` call), basename and content_hash dedup are also performed; on hit, subsequent entries are written as `FAILED` directly and marked with `existing_status=batch_duplicate`. Basename dedup only applies to valid filenames; `unknown_source`, `no-file-path`, and empty sources only participate in content-hash dedup. +> +> **Cross-call concurrent dedup** is also guaranteed by the workspace-level serialization lock (see [§6.7 enqueue serialization lock (preventing concurrent dedup leakage)](#67-enqueue-serialization-lock-preventing-concurrent-dedup-leakage)): two concurrent enqueues of identical content with different filenames will not both leak past the `content_hash` check. + +## 6. Pipeline Concurrency and Reentry Constraints + +To prevent `scan` / `upload` / `insert` from overwriting `doc_status` / `full_docs` records of an in-flight pipeline, all write entry points coordinate via the `pipeline_status` shared dictionary. The `pipeline_status_lock` per workspace ensures that all transitions in the table below are completed atomically within the lock. + +### 6.1 `pipeline_status` Fields + +| Field | Semantics | +| --- | --- | +| `busy` | Generic pipeline-busy flag. Both the processing loop and destructive jobs (clear/delete) set it. **`busy=True` (processing loop) alone does not block enqueue** — the loop pulls a `doc_status` snapshot per batch and checks `request_pending` between batches for any newly arrived work. | +| `destructive_busy` | A destructive subset of `busy`: `/documents/clear` or `/documents/{doc_id}` (delete) is dropping storages / removing source files. Both reservation and the enqueue last-line guard reject — a concurrent enqueue would write to storage being torn down, and accepted documents would be silently lost. The processing loop does not set this field. | +| `scanning` | The `/documents/scan` background task is running (entire lifecycle: classification stage + processing stage). Only the `/scan` endpoint uses it to reject overlapping scans; it does **not** itself block upload/insert. | +| `scanning_exclusive` | An exclusive subset of `scanning`: True only during scan's **classification phase** — run_scanning_process is reading doc_status to classify (already processed / resume / delete stub / archive) and cannot interleave with concurrent writers. Both reservation and the enqueue last-line guard reject. After classification, the flag is cleared immediately, and concurrent uploads are allowed once scan enters the processing phase. | +| `pending_enqueues` | The number of upload/insert calls that have passed `_reserve_enqueue_slot` but whose bg task has not completed. Used only by the scan endpoint — to decide whether to take the exclusive lock. The bg task releases the slot in `finally`. | +| `request_pending` | A signal nudging the running processing loop to scan another round. Enqueue sets it after writing to `doc_status` when `busy=True`; the processing loop checks it after each batch and re-pulls the snapshot. | + +### 6.2 Entry Point Behavior + +| Entry point | Condition | Behavior | +| --- | --- | --- | +| `/documents/upload` / `/documents/text` / `/documents/texts` | `scanning_exclusive=True` or `destructive_busy=True` | Throw HTTP 409; do not write file, do not call enqueue | +| Same as above | Otherwise (including pure `busy=True`, scan-processing-phase `scanning=True` but `scanning_exclusive=False`) | Within the lock: `pending_enqueues++` reserves a slot → strict name precheck → save file → schedule bg task; the bg task releases the slot in `finally` | +| `/documents/scan` | `busy=True` or `scanning=True` or `pending_enqueues>0` | Emit a warning and immediately return `scanning_skipped_pipeline_busy`; do not schedule a background task | +| Same as above | All idle | Within the lock, set `scanning=True` then schedule; the task clears the flag in `finally` upon completion | +| `/documents/clear` / `/documents/delete_document` | `busy=True` or `scanning=True` or `pending_enqueues>0` | The endpoint synchronously returns `status="busy"` and does not schedule a background task | +| Same as above | All idle | The endpoint **synchronously** within the lock sets `busy=True` + `destructive_busy=True` (before `delete_document` returns `deletion_started`), and the bg task's finally clears both flags | +| `apipeline_enqueue_documents` internal (last-line guard) | `scanning_exclusive=True` and `from_scan=False`, or `destructive_busy=True` | Throw `RuntimeError("Cannot enqueue while scan is classifying / clearing or deleting")` | +| Same as above | Anything else (including pure `busy=True`, scan processing phase) | Enqueue normally; after writing `doc_status`, if `busy=True`, automatically nudge `request_pending=True` | + +`from_scan=True` is a bypass for scan's own background-task enqueue: scan already holds the `scanning` flag, so it must be allowed to enqueue the files it has scanned. + +### 6.3 Why `busy` no longer blocks enqueue + +In the old version, `busy=True` always rejected any new enqueue, on the reasoning that "modifying `doc_status` would interleave with the pipeline worker thread." However, in practice: + +1. **Write order guarantees consistency**: `apipeline_enqueue_documents` always upserts `full_docs` first, then upserts `doc_status`. The consistency check at the start of the processing loop only deletes "orphan `doc_status` rows that have no corresponding `full_docs`" — a state that cannot occur with concurrent enqueue. +2. **Batch-level snapshots**: each processing-loop batch pulls a `get_docs_by_statuses` snapshot once; newly written `PENDING` rows don't disturb the current batch, and the next round re-pulls the snapshot via `request_pending` to see the new work. +3. **`request_pending` is designed for this**: the old version already had the `request_pending` field — it was designed for "new work arrives while running" — but was gated by busy. + +With this mechanism enabled in the new contract, **users can continue to upload new documents during long batch processing**, and the bg task, after writing `doc_status`, will be automatically picked up by the running loop. + +### 6.4 Why scan is still the exclusive writer + +scan not only enqueues the new files it finds, but also reads `doc_status` to decide what to do with each file: + +- Same-name `PROCESSED` row → archive source file, skip enqueue. +- Same-name non-PROCESSED with `full_docs` present → resume path; the source file **stays in `INPUT/`**, not archived (the pending-parse parser may still need it); the processing loop picks it up by status query. +- Same-name `FAILED` with `full_docs` missing → recognized as an extraction-error stub previously written by `apipeline_enqueue_error_documents` (consistency check preserves such rows for human review); scan automatically deletes that stub and enqueues the current file as a new file, so that "fix the source file, scan again" takes effect directly. + +These "read–decide–write" combinations cannot interleave with other writers; otherwise classification decisions would be based on a stale view. So scan must be exclusive, and the scan endpoint will reject when any of `busy` / `scanning` / `pending_enqueues>0` is present. + +### 6.5 Strict name precheck (upload path) + +After upload passes the reservation but before saving the file, a two-pass check is required: + +1. **INPUT directory scan**: canonicalize the basename to be saved via `canonicalize_parser_hinted_basename`, traverse the INPUT directory for any existing same-canonical variant (with hint / without hint); 409 on hit. +2. **doc_status check**: call `get_existing_doc_by_file_basename` with the canonicalized basename; 409 on hit. + +Both pass → save the file → schedule the bg task → bg task calls `apipeline_enqueue_documents` to write the store + calls `apipeline_process_enqueue_documents` to trigger processing. + +> The old version once allowed upload to silently write a FAILED duplicate entry when a same-name record existed; the new rule is fail-fast, leaving no duplicate traces in doc_status. To replace a same-name document, call the `/documents/{doc_id}` delete API first. + +### 6.6 Coordination of Multiple Concurrent Reservations + +When two uploads arrive simultaneously (scan cannot acquire exclusivity at this time): + +1. A `_reserve_enqueue_slot` → `pending_enqueues=1`, write file, schedule bg task A, return success. +2. B `_reserve_enqueue_slot` → `pending_enqueues=2`, write file, schedule bg task B, return success. +3. bg task A `apipeline_enqueue_documents` → writes `doc_status` → calls `apipeline_process_enqueue_documents` → sets `busy=True` to process A's document. +4. bg task B `apipeline_enqueue_documents` → sees `scanning=False`, writes normally; after writing, sees `busy=True`, automatically sets `request_pending=True`. +5. bg task B calls `apipeline_process_enqueue_documents` → sees `busy=True`, sets `request_pending=True` and returns immediately. +6. A's processing loop finishes the current batch, sees `request_pending=True`, re-pulls the snapshot, and picks up B's `PENDING` row. +7. After all is complete: `busy=False`, `pending_enqueues=0`. + +No bg task will be falsely rejected due to busy — because enqueue no longer checks busy; the processing loop will not process the same batch repeatedly — because `request_pending` only takes effect between batches and is cleared before each re-pull. + +### 6.7 enqueue Serialization Lock (Preventing Concurrent Dedup Leakage) + +Inside `apipeline_enqueue_documents`, "read doc_status to dedupe → write `full_docs` / `doc_status`" runs serially under the workspace-level `enqueue_serialize` lock. Reason: now that concurrent enqueue is allowed during the busy/scan-processing phases, two enqueues with identical content but different filenames (typical scenario: a scan-processing-phase enqueue and an upload arriving together) would, without the lock, race as follows — + +1. A reads `doc_status` to check `content_hash`: miss. +2. B reads `doc_status` to check `content_hash`: still miss (A hasn't upserted yet). +3. A upserts `full_docs` + `doc_status`. +4. B upserts `full_docs` + `doc_status`. + +Result: both `PENDING` rows with the same `content_hash` enter the downstream pipeline, and the row that should have been identified as `duplicate_kind=content_hash` was **not** identified. + +With the serialization lock, the second enqueue's dedup read is guaranteed to see the row already upserted by the first, taking the normal "no new unique document" early-return path and writing this run as a `duplicate_kind=content_hash` FAILED row. The lock only covers: + +- `filter_keys` (exclude existing by doc_id) +- Filename / content hash dedup reads +- Upsert of duplicate FAILED rows +- `full_docs.upsert` + `doc_status.upsert` + +The lock does **not** cover the `request_pending` nudge (outside the lock; only briefly takes `pipeline_status_lock`), and does **not** block the `get_docs_by_statuses` read of the processing loop (which goes through `doc_status`'s own concurrent reads — a KV-level atomic with the enqueue writes, not contending for the same lock). Lock order: `enqueue_serialize → pipeline_status_lock`; no deadlock path. + +### 6.8 Pipeline Concurrency Parameters + +The locks around `pipeline_status` solve the correctness problem of "who can write"; this section's set of parameters solves the throughput problem of "how many workers run concurrently". The pipeline is divided into 3 stages, each with an independently tunable worker pool: + +``` + ┌─ q_native ──► [native parser × N1] ─┐ +PENDING ─►├─ q_mineru ──► [mineru parser × N2] ─┼─► q_analyze ─►[analyzer × N4] ─► q_process ─►[processor × N5] + └─ q_docling ──► [docling parser × N3] ─┘ +``` + +At enqueue time, `resolve_stored_document_parser_engine` puts each document into the corresponding parse queue based on its `parser_engine` (from `LIGHTRAG_PARSER` defaults or the filename hint); the three parse queues are **completely non-blocking** with respect to each other — mineru saturation does not slow down docling or native. After parsing, they enter `q_analyze` (multimodal analysis) uniformly, and then enter `q_process` (entity/relation extraction + ingest). + +| Environment variable | Default | Effect | Tuning advice | +| --- | --- | --- | --- | +| `MAX_PARALLEL_PARSE_NATIVE` | `5` | N1: number of concurrent workers for native parsing (docx / pdf / txt and other pure local processing) | Pure CPU, low memory usage; can be raised to CPU core count | +| `MAX_PARALLEL_PARSE_MINERU` | `1` | N2: number of concurrent workers for MinerU parsing | MinerU has significant GPU/CPU usage; **the default of serial is most stable**. With local deployment and ample VRAM, you can set 2–3; when going through MinerU's official cloud service, you can raise it appropriately (subject to cloud quotas). | +| `MAX_PARALLEL_PARSE_DOCLING` | `1` | N3: number of concurrent workers for Docling parsing | Docling is similarly resource-sensitive; **the default of serial is most stable**. With local deployment and ample CPU/GPU, you can set 2–3. | +| `MAX_PARALLEL_ANALYZE` | `5` | N4: number of concurrent workers for multimodal analysis (VLM image / table description) | Directly consumes the VLM quota. Recommended ≤ VLM service concurrency cap. | +| `MAX_PARALLEL_INSERT` | `2` | N5: number of concurrent documents at the entity / relation extraction + ingest stage | Recommended `MAX_ASYNC / 3`, in the range 2–10. This stage triggers multiple LLM calls per document; setting it too high will hit LLM rate limits. This value also serves as the `asyncio.Semaphore` for an additional constraint (worker count and semaphore value are the same). | +| `QUEUE_SIZE_DEFAULT` | `100` | Bounded queue capacity between the parse / analyze stages | Generally no need to tune. For very large batches (thousands or more), can be raised to avoid backpressure at the enqueue side; lower it when memory is tight. | +| `QUEUE_SIZE_INSERT` | `4` | Queue capacity between the analyze → process stage | The process stage is the slowest and most memory-hungry in the pipeline; the queue is deliberately small to provide backpressure to upstream and prevent memory bloat. | + +**Several key points:** + +1. **Parsing stage is isolated per engine**, so when mixing native/mineru/docling, you don't have to worry about a slow engine dragging another down. +2. **mineru / docling default to serial (=1)**: in practice both have high resource usage, and concurrency benefits are unstable (prone to OOM / VRAM contention / failure retry). With multi-GPU or a dedicated parser server, you can raise them manually. +3. **`MAX_PARALLEL_INSERT` doubles as worker pool size and semaphore cap**: the pipeline creates a `Semaphore(max_parallel_insert)`, and each process worker also takes the semaphore before extraction and ingest. So even if you manually raise the worker count, the actual concurrency cap is still bounded by this value — just tune it directly. +4. **Queue size and backpressure**: the small default `QUEUE_SIZE_INSERT=4` is intentional — the process stage is slow and memory-hungry; when the queue fills, analyze blocks, and backpressure reaches the parse stage, preventing thousands of parsing results from piling up in memory at once. +5. **How changes take effect**: all parameters are passed in via `.env` (or environment variables), read once at `LightRAG` construction; restart the service after changing them. + +**Typical tuning scenarios:** + +- Large batch of PDFs + local MinerU on a single GPU: `MAX_PARALLEL_PARSE_MINERU=1`, `MAX_PARALLEL_ANALYZE=5`, `MAX_PARALLEL_INSERT=2` (defaults are fine). +- Large batch of PDFs + MinerU cloud service: `MAX_PARALLEL_PARSE_MINERU=3~5` (depending on cloud quota), others at defaults. +- Pure docx / txt (only native): `MAX_PARALLEL_PARSE_NATIVE=10`; `MAX_PARALLEL_INSERT` derived from `MAX_ASYNC/3`. +- Heavy LLM rate-limiting: first lower `MAX_PARALLEL_INSERT` (the process stage makes multiple LLM calls per document), then lower `MAX_PARALLEL_ANALYZE` (VLM is a separate quota). + +## 7. Pipeline Resume Rules at Startup + +Each time `apipeline_process_enqueue_documents` starts up, it pulls all documents in `PARSING` / `ANALYZING` / `PROCESSING` / `PENDING` / `FAILED` to continue processing. The resume path **branches by "whether content has been extracted"**, ensuring that any document, regardless of its previous progress, has an idempotent result when resumed under the current `process_options`. + +The resume rule only applies to documents whose `doc_id` already exists in `doc_status`. New files joining the queue require the file dedup logic in "Concurrency and Reentry Constraints", to avoid new files squeezing out the records of files whose content has already been successfully extracted. + +### 7.1 Determining "Content Has Been Extracted" + +Read `full_docs[doc_id]`: + +| `parse_format` | Verdict | +| --- | --- | +| `lightrag` and `lightrag_document_path` file exists | ✅ extracted | +| `raw` and `content` is non-empty | ✅ extracted | +| Other (including `pending_parse`, missing record) | ❌ not extracted | + +### 7.2 Branch A: Not Extracted + +Go through the full pipeline (`parse_native` / `parse_mineru` / `parse_docling` → `analyze_multimodal` → chunking → entity extraction), with each stage's behavior determined by `full_docs.process_options`. This is the normal flow of a "first-time enqueue". + +### 7.3 Branch B: Already Extracted + +**Always skip parsing** (do not call `parse_*` again), restart from the ANALYZING stage, clear old chunks / entities, and redo per the current `process_options`: + +| Sub-step | Behavior | +| --- | --- | +| Engine comparison | If the engine implied by `process_options` ≠ `full_docs.parse_engine`, **only warn**, do not re-parse. The extracted content is an immutable fact; re-running a different engine would produce inconsistency. To switch engines, delete the whole document and re-upload it. | +| Old chunks / entities / relations cleanup | Read `status_doc.chunks_list` to collect old chunk id set, call `_purge_doc_chunks_and_kg(doc_id, chunk_ids)`: delete chunk rows from `chunks_vdb` / `text_chunks`; reverse-lookup affected entities / relations by `entity_chunks` / `relation_chunks`, directly remove entries that have lost all sources from the graph and vector store, and call `rebuild_knowledge_from_chunks` to rebuild with the remaining chunks for entries still contributed by other documents; finally delete the index rows of this doc in `full_entities` / `full_relations`. After purge completes, `status_doc.chunks_list = []` / `chunks_count = 0` are reset to avoid the subsequent state-machine upsert writing back old IDs. | +| `analyze_multimodal` | For enabled modalities, every run recomputes the sidecar item analysis and overwrites the existing `llm_analyze_result`. The LLM analysis cache still applies: a cache hit reuses the previous provider response, so semantic fields usually stay the same and only runtime fields such as `analyze_time` are rewritten. Cache misses, for example after changing the model or prompt, can produce different saved content. | +| Re-chunk | Pick the strategy by the new `process_options.chunking`, with parameters read from `full_docs.chunk_options` (the enqueue snapshot; not overwritten by resume; env changes do not affect old documents that still chunk by the parameters from the moment of enqueue). The LightRAG Document path uses paragraph_semantic when `process_options=P`, otherwise dispatches to F/R/V by selector. | +| Entity extraction / KG-skip | Determined by the new `process_options.skip_kg` | + +> This rule guarantees: when users change `i/t/e` and re-upload the same-named document (delete the old doc first, then upload the file with the new hint), multimodal analysis is incrementally filled in; when changing `F/R/V/P`, chunks and graph are rebuilt; when changing `!`, KG construction is stopped or restored. Engine changes are considered a "major change", uniformly handled by delete + re-upload, not implicitly happening on the resume path. + +## 8. Python SDK Invocation + +This chapter targets developers who **directly import the `LightRAG` class** for integration, covering runtime APIs, constructor parameters, and removed legacy interfaces that Server deployments don't use. Server users usually don't need to read this chapter. + +### 8.1 Audience + +```python +from lightrag import LightRAG +rag = LightRAG(working_dir="./rag_storage", ...) +await rag.initialize_storages() +await rag.ainsert("text", file_paths="doc.pdf") +``` + +The following behaviors of this invocation style differ from the Server path: you can change `addon_params["chunker"]` without restarting the process, you can pass per-file `chunk_options` into `apipeline_enqueue_documents`, and you can dynamically override the F strategy's pre-split parameters in an `ainsert` call. + +### 8.2 LightRAG Constructor Parameters + +`LightRAG(chunk_token_size=…, chunk_overlap_token_size=…)` is **tier 3** in §3.3's priority chain: "legacy constructor field". Strategy-agnostic and coarse-grained default, fills only slots still empty: + +- Lower priority than `addon_params["chunker"]` explicit values (§8.3) and strategy-specific env (§3.2). +- Higher priority than the legacy env `CHUNK_SIZE` / `CHUNK_OVERLAP_SIZE`. +- The instance fields `self.chunk_token_size` / `self.chunk_overlap_token_size` are always back-filled to `int` after `__post_init__`, so legacy paths still reading these two fields (e.g., the `chunk_opts.get("chunk_token_size") or self.chunk_token_size` fallback in `pipeline.py`) continue to work. + +### 8.3 Modifying `addon_params["chunker"]` at Runtime + +`addon_params["chunker"]` is an `ObservableAddonParams` field; it can be **modified at runtime**: + +```python +rag.addon_params["chunker"]["recursive_character"]["separators"] = ["##", "\n", " "] +``` + +After modification, **subsequent enqueues** get the new defaults; already-enqueued documents keep the snapshot from their enqueue moment (see the three layers of semantic guarantee in §3.3). This is tier 1 of §3.3's priority chain: "`addon_params["chunker"]` explicit value", winning everything. + +Server deployments do not have this capability — after changing env, the service must be restarted for it to take effect. + +### 8.4 `apipeline_enqueue_documents(chunk_options=…)` + +`apipeline_enqueue_documents` accepts an optional `chunk_options` argument. When the caller passes a `dict` / `list[dict]`, it is projected by the current document's `process_options` into a slim snapshot (keeping only the corresponding strategy sub-dictionary + top-level `chunk_token_size`) before being persisted to `full_docs[doc_id]["chunk_options"]`; when not passed, `resolve_chunk_options(self.addon_params, process_options=…)` assembles one on the spot. Callers can safely pass the full dictionary — the other strategies' sub-dictionaries will be discarded by the dispatcher and won't pollute the store. + +Typical usage: + +```python +await rag.apipeline_enqueue_documents( + input=["text A", "text B"], + file_paths=["a.[native-R].txt", "b.txt"], + process_options=["R", ""], + chunk_options=[ + {"chunk_token_size": 800, "recursive_character": {"separators": ["\n\n", "\n"]}}, + {"chunk_token_size": 1500}, + ], +) +``` + +Typical scenarios for per-file personalization: a management UI configures separators or V threshold individually for a certain file; in the future, upload APIs may also accept overrides in form / hint. + +**Compatibility for not passing `file_paths`**: the core APIs `insert` / `ainsert` / `apipeline_enqueue_documents` still support invocations without `file_paths`; the `file_path` of such documents is saved as `unknown_source`, does not participate in filename dedup, and the document ID continues to be generated from text content. + +For `apipeline_enqueue_documents`'s own concurrency constraints (last-line guard, `from_scan=True` bypass), see the entry-point behavior table in §6.2. + +### 8.5 `ainsert(split_by_character=…, split_by_character_only=…)` + +`LightRAG.ainsert(split_by_character=…, split_by_character_only=…)` runtime parameters are overridden into `chunk_options.fixed_token` by `resolve_chunk_options` at enqueue time: + +- A non-`None` `split_by_character` overrides the env default; +- `split_by_character_only=True` overrides (`False` is the signature default, indistinguishable from "not specified", so the env default wins). + +Only effective for the F strategy; other strategies' sub-dictionaries are unaffected. + +### 8.6 Removed SDK Parameter: `reprocess_existing_non_processed` + +The legacy `apipeline_enqueue_documents` behavior of `reprocess_existing_non_processed=True` would directly delete non-PROCESSED old records and rebuild them during scan, which conflicts with the rules in §5 / §6; it has been entirely removed. Replacement paths: + +- Automatic resume: scan handles same-named files per the classification rules in §6.4 (archive / resume / delete stub then re-enqueue), uniformly picked up by the resume rules in §7 inside the processing loop. +- Forced refresh: first call `/documents/{doc_id}` to delete the old document, then upload the same-named new file. diff --git a/docs/LightRAG-API-Server-zh.md b/docs/LightRAG-API-Server-zh.md index d1c0c4aa60..c85f429fb1 100644 --- a/docs/LightRAG-API-Server-zh.md +++ b/docs/LightRAG-API-Server-zh.md @@ -68,13 +68,15 @@ LightRAG 需要同时集成 LLM(大型语言模型)和嵌入模型以有效 * lollms * openai 或 openai 兼容 * azure_openai -* aws_bedrock +* bedrock * gemini 建议使用环境变量来配置 LightRAG 服务器。项目根目录中有一个名为 `env.example` 的示例环境变量文件。请将此文件复制到启动目录并重命名为 `.env`。之后,您可以在 `.env` 文件中修改与 LLM 和嵌入模型相关的参数。需要注意的是,LightRAG 服务器每次启动时都会将 `.env` 中的环境变量加载到系统环境变量中。**LightRAG 服务器会优先使用系统环境变量中的设置**。 > 由于安装了 Python 扩展的 VS Code 可能会在集成终端中自动加载 .env 文件,请在每次修改 .env 文件后打开新的终端会话。 +如果需要为实体抽取、关键词抽取、最终回答或多模态分析配置不同的 LLM/VLM,请参考 [基于角色的 LLM/VLM 配置指南](./RoleSpecificLLMConfiguration-zh.md)。 + 以下是 LLM 和嵌入模型的一些常见设置示例: * OpenAI LLM + Ollama 嵌入 @@ -284,7 +286,7 @@ lightrag-server --port 9622 --workspace space2 ### Gunicorn + Uvicorn 的多工作进程 -LightRAG 服务器可以在 `Gunicorn + Uvicorn` 预加载模式下运行。Gunicorn 的多工作进程(多进程)功能可以防止文档索引任务阻塞 RAG 查询。使用 CPU 密集型文档提取工具(如 docling)在纯 Uvicorn 模式下可能会导致整个系统被阻塞。 +LightRAG 服务器可以在 `Gunicorn + Uvicorn` 预加载模式下运行。Gunicorn 的多工作进程(多进程)功能可以防止文档索引任务阻塞 RAG 查询。CPU 密集型文档提取工具应作为外置服务部署,避免阻塞 API 进程。 虽然 LightRAG 服务器使用一个工作进程来处理文档索引流程,但通过 Uvicorn 的异步任务支持,可以并行处理多个文件。文档索引速度的瓶颈主要在于 LLM。如果您的 LLM 支持高并发,您可以通过增加 LLM 的并发级别来加速文档索引。以下是几个与并发处理相关的环境变量及其默认值: @@ -297,6 +299,15 @@ MAX_PARALLEL_INSERT=2 MAX_ASYNC=4 ``` +在 macOS 上,Gunicorn 多工作进程模式还要求 Objective-C fork safety +覆盖变量必须在 Python 进程启动前就存在。不要依赖 `.env` 设置这个变量; +`.env` 会在 Python 启动后才加载,对 Objective-C 运行时来说已经太晚: + +```shell +export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES +lightrag-gunicorn --workers 2 +``` + ### 将 Lightrag 安装为 Linux 服务 从示例文件 `lightrag.service.example` 创建您的服务文件 `lightrag.service`。修改服务文件中的服务启动定义: @@ -468,7 +479,7 @@ LightRAG 支持绑定到各种 LLM 后端: * openai (含openai 兼容) * azure_openai * lollms -* aws_bedrock +* bedrock * gemini LightRAG 支持绑定到各种嵌入后端: @@ -477,13 +488,15 @@ LightRAG 支持绑定到各种嵌入后端: * ollama * openai (含 openai 兼容) * azure_openai -* aws_bedrock +* bedrock * jina * gemini * voyageai 使用环境变量 `LLM_BINDING` 或 CLI 参数 `--llm-binding` 选择 LLM 后端类型。使用环境变量 `EMBEDDING_BINDING` 或 CLI 参数 `--embedding-binding` 选择嵌入后端类型。 +Bedrock 会忽略 `LLM_BINDING_API_KEY` 和 `EMBEDDING_BINDING_API_KEY`。请通过 AWS credential chain 使用 SigV4 凭据;如果要使用 Bedrock API key / bearer token,请在启动前显式设置进程级环境变量 `AWS_BEARER_TOKEN_BEDROCK`。 + 非对称嵌入需要显式开启。仅当所选嵌入后端支持 provider task 参数或任务前缀时,才设置 `EMBEDDING_ASYMMETRIC=true`。修改这些设置前请先阅读 [Asymmetric Embedding Configuration](./AsymmetricEmbedding.md),因为任何变更后都必须清空已有数据并重新索引文件。 LLM和Embedding配置例子请查看项目根目录的 env.example 文件。OpenAI和Ollama兼容LLM接口的支持的完整配置选型可以通过一下命令查看: @@ -558,8 +571,8 @@ LIGHTRAG_DOC_STATUS_STORAGE=PGDocStatusStorage | --ssl | False | 启用 HTTPS | | --ssl-certfile | None | SSL 证书文件路径(如果启用 --ssl 则必需) | | --ssl-keyfile | None | SSL 私钥文件路径(如果启用 --ssl 则必需) | -| --llm-binding | ollama | LLM 绑定类型(lollms、ollama、openai、openai-ollama、azure_openai、aws_bedrock) | -| --embedding-binding | ollama | 嵌入绑定类型(lollms、ollama、openai、azure_openai、aws_bedrock、jina、gemini、voyageai) | +| --llm-binding | ollama | LLM 绑定类型(lollms、ollama、openai、openai-ollama、azure_openai、bedrock) | +| --embedding-binding | ollama | 嵌入绑定类型(lollms、ollama、openai、azure_openai、bedrock, jina、gemini、voyageai) | ### Reranking 配置 diff --git a/docs/LightRAG-API-Server.md b/docs/LightRAG-API-Server.md index 2b71b5c2a6..88ee5b5fba 100644 --- a/docs/LightRAG-API-Server.md +++ b/docs/LightRAG-API-Server.md @@ -68,13 +68,15 @@ LightRAG necessitates the integration of both an LLM (Large Language Model) and * lollms * openai or openai compatible * azure_openai -* aws_bedrock +* bedrock * gemini It is recommended to use environment variables to configure the LightRAG Server. There is an example environment variable file named `env.example` in the root directory of the project. Please copy this file to the startup directory and rename it to `.env`. After that, you can modify the parameters related to the LLM and Embedding models in the `.env` file. It is important to note that the LightRAG Server will load the environment variables from `.env` into the system environment variables each time it starts. **LightRAG Server will prioritize the settings in the system environment variables to .env file**. > Since VS Code with the Python extension may automatically load the .env file in the integrated terminal, please open a new terminal session after each modification to the .env file. +If you need to configure different LLMs/VLMs for entity extraction, keyword extraction, final answers, or multimodal analysis, see the [Role-Specific LLM/VLM Configuration Guide](./RoleSpecificLLMConfiguration.md). + Here are some examples of common settings for LLM and Embedding models: * OpenAI LLM + Ollama Embedding: @@ -284,7 +286,7 @@ To maintain compatibility with legacy data, the default workspace for PostgreSQL ### Multiple workers for Gunicorn + Uvicorn -The LightRAG Server can operate in the `Gunicorn + Uvicorn` preload mode. Gunicorn's multiple worker (multiprocess) capability prevents document indexing tasks from blocking RAG queries. Using CPU-exhaustive document extraction tools, such as docling, can lead to the entire system being blocked in pure Uvicorn mode. +The LightRAG Server can operate in the `Gunicorn + Uvicorn` preload mode. Gunicorn's multiple worker (multiprocess) capability prevents document indexing tasks from blocking RAG queries. CPU-heavy document extraction tools should be deployed as external services so they do not block the API process. Though LightRAG Server uses one worker to process the document indexing pipeline, with the async task support of Uvicorn, multiple files can be processed in parallel. The bottleneck of document indexing speed mainly lies with the LLM. If your LLM supports high concurrency, you can accelerate document indexing by increasing the concurrency level of the LLM. Below are several environment variables related to concurrent processing, along with their default values: @@ -297,6 +299,16 @@ MAX_PARALLEL_INSERT=2 MAX_ASYNC=4 ``` +On macOS, Gunicorn multi-worker mode also requires the Objective-C fork-safety +override to be present before the Python process starts. Do not rely on `.env` +for this variable; `.env` is loaded after Python startup and is too late for +the Objective-C runtime: + +```shell +export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES +lightrag-gunicorn --workers 2 +``` + ### Install LightRAG as a Linux Service Create your service file `lightrag.service` from the sample file: `lightrag.service.example`. Modify the start options the service file: @@ -433,7 +445,6 @@ az cognitiveservices account deployment create --resource-group $RESOURCE_GROUP_ az cognitiveservices account deployment create --resource-group $RESOURCE_GROUP_NAME --model-format OpenAI --name $RESOURCE_NAME --deployment-name text-embedding-3-large --model-name text-embedding-3-large --model-version "1" --sku-capacity 80 --sku-name "Standard" az cognitiveservices account show --name $RESOURCE_NAME --resource-group $RESOURCE_GROUP_NAME --query "properties.endpoint" az cognitiveservices account keys list --name $RESOURCE_NAME -g $RESOURCE_GROUP_NAME - ``` The output of the last command will give you the endpoint and the key for the OpenAI API. You can use these values to set the environment variables in the `.env` file. @@ -469,7 +480,7 @@ LightRAG supports binding to various LLM backends: * openai (including openai compatible) * azure_openai * lollms -* aws_bedrock +* bedrock * gemini LightRAG supports binding to various Embedding backends: @@ -478,13 +489,15 @@ LightRAG supports binding to various Embedding backends: * ollama * openai (including openai compatible) * azure_openai -* aws_bedrock +* bedrock * jina * gemini * voyageai Use environment variables `LLM_BINDING` or CLI argument `--llm-binding` to select the LLM backend type. Use environment variables `EMBEDDING_BINDING` or CLI argument `--embedding-binding` to select the Embedding backend type. +Bedrock ignores `LLM_BINDING_API_KEY` and `EMBEDDING_BINDING_API_KEY`. Use SigV4 credentials through the AWS credential chain, or set the process-level `AWS_BEARER_TOKEN_BEDROCK` environment variable before startup for Bedrock API key / bearer-token auth. + Asymmetric embedding is explicit opt-in. Set `EMBEDDING_ASYMMETRIC=true` only when the selected embedding backend supports either provider task parameters or task prefixes. See [Asymmetric Embedding Configuration](./AsymmetricEmbedding.md) before changing these settings, because existing data must be cleared and files re-indexed after any change. For LLM and embedding configuration examples, please refer to the `env.example` file in the project's root directory. To view the complete list of configurable options for OpenAI and Ollama-compatible LLM interfaces, use the following commands: @@ -558,8 +571,8 @@ When switching the storage implementation in LightRAG, the LLM cache can be migr | --ssl | False | Enable HTTPS | | --ssl-certfile | None | Path to SSL certificate file (required if --ssl is enabled) | | --ssl-keyfile | None | Path to SSL private key file (required if --ssl is enabled) | -| --llm-binding | ollama | LLM binding type (lollms, ollama, openai, openai-ollama, azure_openai, aws_bedrock) | -| --embedding-binding | ollama | Embedding binding type (lollms, ollama, openai, azure_openai, aws_bedrock, jina, gemini, voyageai) | +| --llm-binding | ollama | LLM binding type (lollms, ollama, openai, openai-ollama, azure_openai, bedrock) | +| --embedding-binding | ollama | Embedding binding type (lollms, ollama, openai, azure_openai, bedrock, jina, gemini) | ### Reranking Configuration diff --git a/docs/LightRAGSidecarFormat-zh.md b/docs/LightRAGSidecarFormat-zh.md new file mode 100644 index 0000000000..5c6f0b82b3 --- /dev/null +++ b/docs/LightRAGSidecarFormat-zh.md @@ -0,0 +1,399 @@ +# LightRAG Sidecar 文件格式说明 + +本文介绍内解析引擎输出的**LightRAG Sidecar**文件格式。LightRAG 在使用native/mineru/docling这些支持多模态内容解析引擎提取文件内容的时候,会把"正文 + 多模态对象 + 解析元数据"拆开写到一个 `*.parsed/` 目录中,目录内的每个 JSON / JSONL 文件统称为 **sidecar** 文件。Sidecar 是后续流水线(多模态分析 → 多模态 chunk 构造 → 实体抽取 → 文档删除时的缓存清理)唯一可靠的依据。Sidecar的文件格式是LightRAG内置的通用文件交换格式,新的多模态内容提取引擎都需要遵循这个格式。公开**LightRAG Sidecar**文件格式的目的是给社区开发者编写字节的内容解析引擎提供方便。 + +## 一、概述 + +| 关注点 | 文件 | 存放内容 | 说明 | +|---|---|---|---| +| 主文件 | `.blocks.jsonl` | 存放 Block 正文 | 所有 Block 的 content字段内容拼接后形成完整的原文 | +| 图形对象 | `.drawings.json` | 文件中抽取出来的图形对象 | 送VLM进行分析后回填分析结果 | +| 表格对象 | `.tables.json` | 文件中抽取出来的表格对象 | 送LLM进行分析后回填分析结果 | +| 公式对象 | `.equations.json` | 文件中抽取出来的公司对象 | 送LLM进行分析后回填分析结果 | +| 原始图像资源 | `.blocks.assets/` | 文件中抽取出来的图片原始文件 | 送VLM进行图片分析 | + +Sidecar 的设计意图: + +- 解析阶段 内容提取引擎(native/mineru/docling) **只**负责生成 `blockid / heading / content / surrounding` 等"客观"字段; +- 多模态分析阶段 (`analyze_multimodal`) 由 LightRAG 写入分析结果 `llm_analyze_result` 字典,可能是首次追加,也可能覆盖已有结果;解析器不应预先填充该字段 + +## 二、目录布局 + +``` +inputs/space1/__parsed__/<规范文件名>.parsed/ +├── <规范文件名>.blocks.jsonl 正文块序列 + 文档级 meta(首行) +├── <规范文件名>.drawings.json 图形 sidecar(dict 容器,键 = 图形 id) +├── <规范文件名>.tables.json 表格 sidecar +├── <规范文件名>.equations.json 公式 sidecar +└── <规范文件名>.blocks.assets/ 原始资源目录(存放drawings.json中的图片文件放这里) + ├── image1.wmf + ├── image2.wmf + ├── image3.wmf + ├── image4.png + ├── image5.png + ├── image6.png + └── image7.emf +``` + +## 三、blocks.jsonl + +`blocks.jsonl` 是按行序列化的 JSON,**第一行 `type="meta"`**,其余每行是一个内容块 `type="content"`。 + +### 3.1 meta 行实例 + +```json +{ + "type": "meta", + "format": "lightrag", + "version": "1.0", + "document_name": "m012-manual.docx", + "document_format": "docx", + "document_hash": "sha256:4840...3f9543d9db0822d2d59", + "table_file": true, + "equation_file": true, + "drawing_file": true, + "asset_dir": true, + "split_option": { "fixlevel": 0 }, + "blocks": 39, + "doc_id": "doc-f1bee60173d067d88595c00e7d9b0ce5", + "parse_engine": "native", + "parse_time": "2026-05-13T18:42:25.943490+00:00", + "doc_title": "m012-manual" +} +``` + +| 字段 | 类型 | 说明 | +|---|---|---| +| `type` | `"meta"` | 行类型,固定值,校验位 | +| `format` | `"lightrag"` | sidecar 大版本族标识 | +| `version` | `str` | sidecar schema 版本 | +| `document_name` | `str` | 规范文件名(含后缀,不含处理指示) | +| `document_format` | `str` | 文件格式(目前以文件后缀表示) | +| `document_hash` | `"sha256:"` | sidecar 正文指纹,定义为 `SHA-256(merged_text)`,其中 `merged_text` 是所有非空 content 行的 `content` 字段按 `"\n\n"` 拼接后的字符串。供外部消费者快速判断两份 `.parsed/` 是否同源(不必逐行比对 body),并作为 sidecar 文件的自描述内容校验位。注意:LightRAG 入库流水线本身不读此字段,跨文档去重由 `doc_status.content_hash` 单独承担 | +| `table_file` / `equation_file` / `drawing_file` | `bool` | 是否存在对应 sidecar 文件(为真时对应文件必然存在) | +| `asset_dir` | `bool` | 是否存在`blocks.assets`资源目录 | +| `split_option` | `object` | 文件提取时的分块参数。此字段留给文件提取引擎自己记录和使用 | +| `blocks` | `int` | content 行数(不含 meta) | +| `doc_id` | `"doc-"` | 文档全局 id。sidecar item id(`im-/tb-/eq-`)使用 `doc_id` 去掉 `doc-` 前缀后的哈希部分,以缩短嵌入正文中的占位标签 | +| `parse_engine` | `str` | 解析引擎`native/mineru/docling/legacy` | +| `parse_time` | `str` | 解析完成时间; 格式:ISO-8601 UTC | +| `doc_title` | `str` | 文档标题(通常为首个 H1);可选 | +| `doc_summary` | `str` | 文档摘要;可选 | +| `doc_attributes` | `object` | 文章扩展属性对象;可选 | +| `bbox_attributes` | `object` | bbox possition全局属性;详见[§八](八、positions) | + +> LightRAG要求同一个workspace(知识库)内的文件名(document_name)必须唯一。 + +### 3.2 content 行 + +每个 content 行是一个原始文档"块"的最小可寻址单位,至少包含: + +```json +{ + "type": "content", + "blockid": "462c6364584a7ba4bdae6853f85ac429", + "format": "plain_text", + "content": "1 产品用途和功能\nMI012模块用于支撑供氧抗荷调节器的供氧抗荷控制功能...", + "heading": "1 产品用途和功能", + "parent_headings": [], + "level": 1, + "session_type": "body", + "table_slice": "none", + "positions": [ + { + "type": "paraid", + "range": ["5EA4577A", "6555DDCB"] + } + ] +} +``` + +| 字段 | 含义 | +|---|---| +| `type` | `"content"` | +| `blockid` | 全局唯一的Block ID | +| `format` | 内容形态,目前固定为 `"plain_text"` | +| `content` | 文本内容;**公式和图片此以占位标签出现,表格以带table标签的JSON或HTLM格式出现**(见 3.3) | +| `heading` | content所在章节的最高层级标题;heading真实存在时,应该同时出现在content的开头;如果heading之后紧接着下一个层级的heading,则把下一个层级的heading正文看待。这样做的目的是需要保证所有 Block 的 content字段内容拼接后形成完整的原文。 | +| `parent_headings` | 字符串数组: 自顶向下的祖先标题列表,不含当前 `heading` | +| `level` | 整数: `heading` 在文档大纲中的层级(`1` = H1 / 一级标题,0表示无标题) | +| `session_type` | Block所处区域:`body` `preface` `TOC` `references` `appendix` | +| `table_slice` | 可选保留字段;表示Block是否仅包括表格片段。目前分析引擎不会拆分长表格。因此本字段固定为 `"none"`(表示表格不会被分片) | +| `table_header` | 可选保留字段;在当前块位表格片段的时候,保存识别出来的表格头。目前不存在 | +| `positions` | `position` 对象数组:标识文本块的版面位置;文本块来与版面的多个位置的时候,则会出现多个`position` 对象。参见[§八](#八、position) | + +> - blockid计算方式:`md5(doc_id + ":" + block_index + ":" + heading + ":" + content)`。文档经过分块策略处理得到的 chunk 将保存 blockid 用于溯源 chunk 在s idecar 中的位置。 +> - 不关系文档章节结构的分块策略 `F` `R` `V` 使用的就是 content 字段拼接后的内容进行分块。因此需要保证所有 Block 的 content字段合并在一起能够构成完整的文档内容,不会缺少内容,不会出现重叠的内容。 + +### 3.3 content 内嵌占位标签 + +为了让 P 分块策略在不破坏多模态对象的前提下对正文做切分,`content` 文本里使用如下三种 XML 风格的占位标签: + +| 标签 | 含义 | 标签属性 | +|---|---|---| +| `…
` | 表格占位,包体是表格原始 JSON / HTML | `id` 指向 `tables.json` 里对应 item;`format` ∈ `json` / `html` | +| `` | 自闭合图形占位 | `id` 指向 `drawings.json`;`path` 相对 `*.parsed/` 目录;`src` 是原文档里的引用名 | +| `` | 公式占位 | 行内公式同样用 `` 但**不**带 `id`,不会进 sidecar; 仅块公式(独占一行或多行)时携带 `id` | + +在实体关系抽取的时候喂给大模型的文本会把 `id / path / src` 等内部属性剥掉,但为保留键属性(`format / caption`)。目的是避免抽取出文章不可见的实体,给抽取结果注入过多的噪声。 + +### 3.4 blockid 与 chunk sidecar.refs 的对应 + +葛总分块策略在sidecar文件存在时,会在其输出的每个 chunk 都会带上 `sidecar = {"type": "block", "id": <主来源 blockid>, "refs": [{"type": "block", "id": }, …]}`,其中: + +- 未合并的 chunk → `sidecar.refs` 只有一个元素,等于该 chunk 来自的 blocks.jsonl 行的 `blockid`; +- Stage D 合并后的 chunk → `refs` 顺序保留所有来源 `blockid`(去重); +- hard fallback split 后的子 chunk → 共享父 chunk 的 `sidecar`。 + +这条链路是文档级追溯(chunk ↔ block ↔ 原段落 paraId)的基础。 + +## 四、drawings.json + +顶层是 `{"version": "1.0", "drawings": { : , … }}` 形态的 dict 容器,**键 = `id` 字段**,便于按 id 查找。每个 item 形如: + +```json +{ + "id": "im-f1bee60173d067d88595c00e7d9b0ce5-0004", + "blockid": "2f52b70839d13a936d97955916820147", + "heading": "2.3 结构尺寸及重量", + "format": "png", + "path": "m012-manual.blocks.assets/image4.png", + "src": "", + "caption": "", + "footnotes": [], + "extras": { + "ocr_texts": "图内第一段 OCR 文本\n\n图内第二段 OCR 文本", + "ocr_texts_count": 2 + }, + "surrounding": { + "leading": "2.3 结构尺寸及重量\n尺寸及重量要求如下:\na) 外廓尺寸长度为:-` 形式(`doc_hash` 为 `doc_id` 去掉 `doc-` 前缀后的 32 位 md5) | +| `blockid` | 指向产生该图形的 content 行 | +| `heading` | 所在章节标题 | +| `format` | 原始扩展名(去点):`png` / `jpeg` / `gif` / `webp` / `wmf` / `emf` / … | +| `path` | 相对 `*.parsed/` 目录的资源路径,**永远**指向 `*.blocks.assets/` 内文件 | +| `src` | 原文档里图形的引用别名(多数情况下为空) | +| `caption` | 可见标题(解析器可能留空) | +| `footnotes` | 脚注字符串列表 | +| `surrounding` | 上下文对象:参见[§七](#七、surrounding) | +| `self_ref` | 字符串:可选;解析引擎原始输出中的对象引用(如 Docling JSON Pointer `#/pictures/3`,或 MinerU `content_list.json#/23`),用于溯源时回查原始解析产物中的对应对象(页面位置、原始结构等)。native 等不提供此字段时不输出 | +| `extras` | 对象:可选;引擎专属的旁路字段(如图片中包含的OCR文字等)。不属于 spec 校验范围,下游消费者不应依赖具体键。 | +| `llm_analyze_result` | 模态分析结果对象:详见 [§九](#九、`llm_analyze_result`) (后续会注入到多模态文本块) | +| `llm_cache_list` | 模态分析LLM缓存数组(后续会注入到多模态文本块) | + +`extras` 中常见的 drawing 专属键: + +| 键 | 说明 | +|---|---| +| `ocr_texts` | 字符串:可选;图形对象内部 OCR 文本,多个段落用空行(`\n\n`)拼接。仅当解析引擎显式把 OCR 文本挂在该 drawing 的 children 下时输出;caption / footnote 不进入此字段。 | +| `ocr_texts_count` | 整数:可选;写入 `ocr_texts` 的非空 OCR 段落数量。 | + +**只有图形支持的 raster 格式(png / jpeg / gif / webp)才会进入 VLM 分析**;其他格式(wmf / emf / svg 等)写 `llm_analyze_result.status="skipped"`,下游不生成多模态 chunk,文档继续处理。图片大小超过环境变量`VLM_MAX_IMAGE_BYTES`规定的大小后,图片同样不会进入VLM分析。 + +> 图片的大小、DPI等信息统一放进 `extras` 对象;不要在 item 顶层引入未声明的字段(比如 `image` / `img_path` 等)。tables / equations 也遵循同样的 `extras` 约定。`self_ref` 是 spec 顶层声明的可选字段,不属于 extras 范围。 + +## 五、tables.json + +顶层是 `{"version": "1.0", "tables": { : , ... }}` 形态的 dict 容器,**键 = `id` 字段**,便于按 id 查找。每个 item 形如: + +```json +{ + "id": "tb-f1bee60173d067d88595c00e7d9b0ce5-0007", + "blockid": "3f33897b5e105d254addc655f1efbf8c", + "heading": "2.4.4 温度-湿度-高度(随系统进行)", + "dimension": [16, 8], + "format": "json", + "content": "[[\"试验步骤\", \"温度(℃)\", \"高度(m)\", \"相对湿度\", \"时间(min)\", \"辅助冷却\", \"系统电源\", \"功能、性能检查\"],…", + "caption": "", + "footnotes": [], + "table_header": "[[\"试验步骤\", \"温度(℃)\", \"高度(m)\", \"相对湿度\", \"时间(min)\", \"辅助冷却\", \"系统电源\", \"功能、性能检查\"]]" + "surrounding": { + "leading": "2.4.4 温度-湿度-高度(随系统进行)\n产品应能承受执行任务期间的温度、湿度、高度环境综合作用…", + "trailing": "\n注:以上步骤重复10个循环。a成品及附件达到温度稳定或240min,以长者为准;b成品及附件达到温度稳定或120min,以长者为准。…" + }, + "llm_analyze_result": { + "name": "文档管理元数据表", + "description": "这是一份文档管理信息表,用于记录技术文档的基本元数据和版本控制信息 …", + "analyze_time": 1778697759, + "status": "success", + "message": "" + }, + "llm_cache_list": [ + "default:analysis:b316aacd40fdca0cb56430870bb89a62" + ] +} +``` + +tables.json 文件的 `blockid` `heading` `surrounding` `llm_analyze_result` 字段与drawings.json相同。不同或新添加的字段说明如下: + +| 字段 | 说明 | +|---|---| +| `id` | `tb--` 形式(`doc_hash` 为 `doc_id` 去掉 `doc-` 前缀后的 32 位 md5) | +| `dimension` | 整数数组:`[num_rows, num_cols]`,包含表头行 | +| `format` | `"json"` (二维数组) 或 `"html"` (负载 `…
` 片段,含起止标签) | +| `content` | 字符串:表格正文,按 `format` 决定结构;这是后续多模态 chunk 真正使用的字符串。 | +| `table_header` | 字符串:可选;识别出来的作为表格头的行内容 | +| `self_ref` | 可选;解析引擎原始输出中的对象引用(如 Docling JSON Pointer `#/tables/2`,或 MinerU `content_list.json#/31`),用于溯源时回查原始解析产物 | + +在模态分析阶段,如果`content`字段长度超过大模型的上下文长度时,表格内容会被机械地截断后在喂给模型。 + +## 六、equations.json + +顶层是 `{"version": "1.0", "equations": { : , ... }}` 形态的 dict 容器,**键 = `id` 字段**,便于按 id 查找。每个 item 形如: + +```json +{ + "id": "eq-f1bee60173d067d88595c00e7d9b0ce5-0001", + "blockid": "2f52b70839d13a936d97955916820147", + "heading": "2.3 结构尺寸及重量", + "format": "latex", + "content": "C=2∗\\frac{P∗T}{\\left( {V}_{H}^{2}−{V}_{L}^{2} \\right)∗η}", + "caption": "", + "footnotes": [], + "surrounding": { + "leading": "2.3 结构尺寸及重量\n尺寸及重量要求如下:\n …", + "trailing": "\n其中P为供电异常时维持的功率28W,T为期望储能时间,VH为电容放电前…" + }, + "llm_analyze_result": { + "name": "电容储能时间计算公式", + "description": "该公式用于计算在电源异常情况下维持系统正常工作所需的电容储能值 …", + "analyze_time": 1778697783, + "status": "success", + "message": "", + "equation": "C=2\\cdot\\frac{P\\cdot T}{(V_{H}^{2}-V_{L}^{2})\\cdot\\eta}" + }, + "llm_cache_list": [ + "default:analysis:fcf4c4f88227ee1c1bf0ed4394039e37" + ] +} +``` + +equations.json 文件的 `blockid` `heading` `surrounding` `llm_analyze_result` 字段与drawings.json相同。不同或新添加的字段说明如下: + +| 字段 | 说明 | +|---|---| +| `id` | `eq--` 形式(`doc_hash` 为 `doc_id` 去掉 `doc-` 前缀后的 32 位 md5) | +| `format` | 固定为 `"latex"` | +| `content` | 字符串:是**原始** LaTeX(可能包含 Unicode 运算符、外层 `\[ \]`),不包含两头的`$`分割符;模态分析阶段直接读这里 | +| `self_ref` | 可选;解析引擎原始输出中的对象引用(如 Docling JSON Pointer `#/texts/15`,或 MinerU `content_list.json#/45`),用于溯源时回查原始解析产物 | +| `llm_analyze_result.equation` | 字符串:是大模型输出的**规范化**后的 LaTeX公式(外层 `$ / \[ \] / equation` 环境,Unicode 转 LaTeX,不包含联投的`$`分割符),这是后续多模态 chunk 真正使用的字符串; | + +在模态分析阶段,如果`content`字段长度超过大模型的上下文长度时,表格内容会被机械地截断后在喂给模型。行内公式(与正文连续的 ``)**不会**保存到 equations.json 文件,它仅会在 blocks 文本里以无 `id` 形式留存。这样做的目的是避免给抽取结果注入过多的噪音。 + +## 七、surrounding + +`surrounding.leading` 和 `surrounding.trailing` 是 sidecar item 的可分析上下文窗口,目的是提供图片、表格和公式所在段落的上下文信息,提高多模态分析的质量。**surrounding内容有LightRAG在分析阶段自动注入,不需要在文档解析引擎中主动写入sidecar文件中**。以下是surrounding内容的生成逻辑: + +- 取自同一 `blockid` 对应的 content 行文本,以多模态占位标签的位置为切分点; +- 每一侧的 token 上限由环境变量 `SURROUNDING_LEADING_MAX_TOKENS` / `SURROUNDING_TRAILING_MAX_TOKENS` 控制(缺省 `2000`,可独立调整);按 tokenizer 截断,倾向保留靠近目标的句子; +- 文本中保留**同行其他**多模态对象的占位标签,这让模型能感知"图 1 之后还有公式 1"这种上下文;但解析器内部标识符(`id` / `path` / `src` / `refid`)已被 `strip_internal_multimodal_markup_for_extraction` 剥离 —— 与 chunk content 实体抽取前的清理一致,避免噪声进入 VLM/LLM prompt。具体清理规则: + - `` → ``;**没有 caption 的 drawing 整段移除**(标签不再携带任何对模型可见的信息); + - `rows
` → `rows
`; + - `body` → `body`; + - `表 1` → `表 1`;`公式 2` → `公式 2`。仅删 `refid` 属性,保留 `` 包装 —— 让 VLM/LLM 能识别"这是对其他表/公式的引用"而非普通的文本,同时屏蔽 LLM 看不到的解析器内部 id; + - 例外:`tables.json` 类型的 surrounding 在 strip 之前先走 `remove_table_tags`,把所有 `` 整段移除(分析目标表时不希望被对其他表的悬挂引用干扰); +- 清理发生在 token 预算截断**之前**:token 数按"LLM 实际看到的内容"统计,且截断点不会落到未清理的 `id="…"` 属性中间,避免标签结构残缺; +- 当目标对象本身位于 block 起点 / 终点时,对应一侧为 `""` 而不是 `"n/a"`(提示词组装时再把空字符串显示为 `n/a`); +- `enrich_sidecars_with_surrounding` 是幂等的:每次 `analyze_multimodal` 入口都会重新计算并覆盖 `surrounding`,因此修改 `SURROUNDING_LEADING_MAX_TOKENS` / `SURROUNDING_TRAILING_MAX_TOKENS` 后无需手动清理 sidecar,重新执行多模态分析即可按新预算重写。 + +## 八、positions + +`positions`是一个对象数组,用于标识`blockid`的内容来之文件中的哪一个文字,用于内容溯源的时候能够在原始文件中找到和显示对应的内容。当`blockid`的内容是由版面的多个栏目合并而成时,会出现多个`position` 对象,每个`position` 对象对应1个版面方框或栏目。为了适应不同的文档格式的内容定位方式,系统提供了以下几种`position` 对象对象类型。 + +`position` 对象有多种类型,对象的`type`字段决定了其类型: + +* paraid + +适用于docx格式文件;按`段落id`(paraid)定位内容。`rang`字段指定起止`段落id`;`charspan`为可选字段,指定内容从段落的m个字符开始到底n个字符结束。不提供`charspan`表示`blockid`为起止段落的全部内容。示例: + +``` +"positions": [ +{ + "type": "paraid", + "range": ["5EA4577A", "6555DDCB"] + "charspan": [10,999] +}] +``` + +* bbox + +适用于与PDF格式类似的文件,通过页面矩形位置来标定内容来源的原始位置。bbox支持一下字段: + +``` +origin: 矩形坐标相对于页面那个位置(可选字段,默认为LEFTTOP,另一个可选值为LEFTBOTTOM) +max: 页面布局的长和宽的最大值,坐标按此值归一化以便能准确显示位置(可选字段,为空表示坐标按图片的点阵计算) +anchor: 页码, 页码为字符串,支持罗马数字等非数阿拉伯数字字页码 +range: 矩形坐标数组 [h1,w1,h2,w2],例如 [174, 155, 818, 333] +charspan: 内容从标定段落的m个字符开始到底n个字符结束(可选字段) +``` + +`blocks.jsonl`文件的`meta`行的`bbox_attributes`字段保存的是bbox的全局设置,避免每个`content`行的`positions`对象中重复保存相同的内容。一下是一个典型的`positions`对象示例: + +``` +"positions": [ +{ + "type": "bbox", + "anchor": "ii" + "range": [174, 155, 818, 333] + "charspan": [10, 999] +}] +``` + +* heading + +适用于与Markdown格式类似的文件,按标题定位内容。`anchor`是起始标题(标题重复是的处理方式查到markdown anchor规范);`charspan`为可选字段,指定内容从段落的m个字符开始到底n个字符结束。不提供`charspan`表示`blockid`为起止段落的全部内容。 + +``` +"positions": [ +{ + "type": "heading", + "anchor": "ii" + "range": [174, 155, 818, 333] + "charspan": [10, 999] +}] +``` + +* absolute + +适用于text格式类似的文件,按字符绝对位置定位。`charspan`指定内容从段落的m个字符开始到底n个字符结束。 + +``` +"positions": [ +{ + "charspan": [10, 999] +}] +``` + +## 九、`llm_analyze_result` + +| `status` | 触发场景 | 字段说明 | +|---|---|---| +| `success` | 模型成功返回合法 JSON 且必需字段齐全 | 图形:`name / type / description`;表格:`name / description`;公式:`name / description / equation` | +| `skipped` | 期跳过多模态分析:图片格式不支持、像素 < `VLM_MIN_IMAGE_PIXEL`(默认 32px)、大于 `VLM_MAX_IMAGE_BYTES`(默认 5 MB)、未启用VLM | `message` 写跳过原因 | +| `failure` | 必需字段缺失、JSON 修复后仍不合法、VLM/EXTRACT role 未配置而对应模态被启用、模型调用异常 | `message` 写诊断 | + +补充: + +- `analyze_time` 是 epoch 秒,每个 status 都有; +- `message` 在 `status="success"` 时**恒为空串**,便于过滤; +- 对已启用模态的 item,每次 `analyze_multimodal` 都会重新计算,并用本次结果覆盖已有的 `llm_analyze_result`(无论原先是 `success`、`skipped` 还是 `failure`)。这样修正 VLM/EXTRACT 配置后可以直接重试,无须手动清理旧 sidecar 结果。LLM 调用仍会走 analysis cache:如果 cache key 命中,不会再次请求 provider,语义字段通常保持一致,但 `analyze_time` 等运行时字段会被重写。只有 cache miss,例如有效 role 模型 / binding / host、prompt 输入或图片元数据变化后,保存内容才可能与上次不同。 + +图形 `type` 受 12 项枚举约束(见 [`IMAGE_TYPE_ENUM`](../lightrag/prompt_multimodal.py):`Photo / Illustration / Screenshot / Icon / Chart / Table / Infographic / Flowchart / Chat Log / Wireframe / Texture / Other`);模型若返回枚举外的值,会被规整成 `Other` 而不是失败。 diff --git a/docs/LightRAGSidecarFormat.md b/docs/LightRAGSidecarFormat.md new file mode 100644 index 0000000000..660cc3933d --- /dev/null +++ b/docs/LightRAGSidecarFormat.md @@ -0,0 +1,399 @@ +# LightRAG Sidecar File Format Specification + +This document describes the **LightRAG Sidecar** file format that content parsing engines output. When LightRAG uses multimodal-capable content parsing engines such as native/mineru/docling to extract file content, it splits "body text + multimodal objects + parsing metadata" into a `*.parsed/` directory. Each JSON / JSONL file in that directory is collectively called a **sidecar** file. Sidecars are the only reliable source of truth for the subsequent pipeline (multimodal analysis → multimodal chunk construction → entity extraction → cache cleanup on document deletion). The sidecar format is LightRAG's built-in universal file interchange format; new multimodal content extraction engines must follow this format. The purpose of publicly documenting the **LightRAG Sidecar** format is to make it convenient for community developers to write their own content parsing engines. + +## 1. Overview + +| Concern | File | Contents | Notes | +|---|---|---|---| +| Main file | `.blocks.jsonl` | Stores block body | Concatenating the `content` fields of all blocks reconstructs the complete original text | +| Drawing objects | `.drawings.json` | Drawing objects extracted from the file | Sent to a VLM for analysis; analysis results are written back | +| Table objects | `.tables.json` | Table objects extracted from the file | Sent to an LLM for analysis; analysis results are written back | +| Equation objects | `.equations.json` | Equation objects extracted from the file | Sent to an LLM for analysis; analysis results are written back | +| Original image assets | `.blocks.assets/` | Original image files extracted from the document | Sent to a VLM for image analysis | + +Design intent of sidecars: + +- During the parsing stage, the content extraction engine (native/mineru/docling) is **only** responsible for generating "objective" fields such as `blockid / heading / content / surrounding`; +- During the multimodal analysis stage (`analyze_multimodal`), the analysis result dict `llm_analyze_result` is written by LightRAG and may be appended or overwritten; parsers should not pre-populate it. + +## 2. Directory Layout + +``` +inputs/space1/__parsed__/.parsed/ +├── .blocks.jsonl body block sequence + document-level meta (first line) +├── .drawings.json drawing sidecar (dict container, key = drawing id) +├── .tables.json table sidecar +├── .equations.json equation sidecar +└── .blocks.assets/ original asset directory (image files referenced by drawings.json live here) + ├── image1.wmf + ├── image2.wmf + ├── image3.wmf + ├── image4.png + ├── image5.png + ├── image6.png + └── image7.emf +``` + +## 3. blocks.jsonl + +`blocks.jsonl` is JSON serialized line by line. The **first line has `type="meta"`**; every subsequent line is a content block with `type="content"`. + +### 3.1 meta line example + +```json +{ + "type": "meta", + "format": "lightrag", + "version": "1.0", + "document_name": "m012-manual.docx", + "document_format": "docx", + "document_hash": "sha256:4840...3f9543d9db0822d2d59", + "table_file": true, + "equation_file": true, + "drawing_file": true, + "asset_dir": true, + "split_option": { "fixlevel": 0 }, + "blocks": 39, + "doc_id": "doc-f1bee60173d067d88595c00e7d9b0ce5", + "parse_engine": "native", + "parse_time": "2026-05-13T18:42:25.943490+00:00", + "doc_title": "m012-manual" +} +``` + +| Field | Type | Description | +|---|---|---| +| `type` | `"meta"` | Line type, fixed value, sanity check | +| `format` | `"lightrag"` | Sidecar major version family identifier | +| `version` | `str` | Sidecar schema version | +| `document_name` | `str` | Canonical filename (with extension, without processing hints) | +| `document_format` | `str` | File format (currently expressed as the file extension) | +| `document_hash` | `"sha256:"` | Sidecar body fingerprint, defined as `SHA-256(merged_text)`, where `merged_text` is the concatenation of all non-empty content lines' `content` fields joined by `"\n\n"`. Used by external consumers to quickly determine whether two `.parsed/` directories share the same source (without line-by-line body comparison), and serves as a self-describing content checksum for the sidecar file. Note: the LightRAG ingestion pipeline itself does not read this field; cross-document deduplication is handled separately by `doc_status.content_hash`. | +| `table_file` / `equation_file` / `drawing_file` | `bool` | Whether the corresponding sidecar files exist (when true, the corresponding file must exist) | +| `asset_dir` | `bool` | Whether the `blocks.assets` asset directory exists | +| `split_option` | `object` | Chunking parameters used during file extraction. This field is reserved for the extraction engine itself to record and use | +| `blocks` | `int` | Number of content lines (excluding meta) | +| `doc_id` | `"doc-"` | Global document ID. Sidecar item IDs (`im-/tb-/eq-`) use the hash portion of `doc_id` with the `doc-` prefix removed, in order to shorten the placeholder tags embedded in body text. | +| `parse_engine` | `str` | Parsing engine `native/mineru/docling/legacy` | +| `parse_time` | `str` | Parse completion time; format: ISO-8601 UTC | +| `doc_title` | `str` | Document title (usually the first H1); optional | +| `doc_summary` | `str` | Document summary; optional | +| `doc_attributes` | `object` | Document extended attributes object; optional | +| `bbox_attributes` | `object` | Global bbox position attributes; see [§8](#8-positions) | + +> LightRAG requires that filenames (`document_name`) be unique within the same workspace (knowledge base). + +### 3.2 content line + +Each content line is the minimum addressable unit of an original document "block" and contains at least: + +```json +{ + "type": "content", + "blockid": "462c6364584a7ba4bdae6853f85ac429", + "format": "plain_text", + "content": "1 Product Purpose and Functions\nThe MI012 module is used to support the oxygen-supply and anti-gravity control function of the oxygen-supply and anti-gravity regulator...", + "heading": "1 Product Purpose and Functions", + "parent_headings": [], + "level": 1, + "session_type": "body", + "table_slice": "none", + "positions": [ + { + "type": "paraid", + "range": ["5EA4577A", "6555DDCB"] + } + ] +} +``` + +| Field | Meaning | +|---|---| +| `type` | `"content"` | +| `blockid` | Globally unique Block ID | +| `format` | Content form, currently fixed to `"plain_text"` | +| `content` | Text content; **equations and images appear as placeholder tags here, tables appear as JSON or HTML wrapped in table tags** (see §3.3) | +| `heading` | The top-most-level heading of the section containing this content. When `heading` is real, it should also appear at the beginning of `content`; if a heading is immediately followed by a heading at the next level, the next-level heading should be treated as body text. The goal is to ensure that concatenating the `content` fields of all blocks reconstructs the complete original text. | +| `parent_headings` | String array: the top-down list of ancestor headings, excluding the current `heading` | +| `level` | Integer: the level of `heading` in the document outline (`1` = H1 / first-level heading; `0` means no heading) | +| `session_type` | The region the block belongs to: `body` `preface` `TOC` `references` `appendix` | +| `table_slice` | Optional reserved field; indicates whether the block contains only a slice of a table. The current analysis engines do not split long tables, so this field is fixed to `"none"` (meaning the table will not be sliced) | +| `table_header` | Optional reserved field; when the current block is a table slice, this holds the recognized table header. Currently unused. | +| `positions` | Array of `position` objects: identifies the layout position of the text block; when the text block comes from multiple positions in the layout, multiple `position` objects appear. See [§8](#8-positions) | + +> - blockid computation: `md5(doc_id + ":" + block_index + ":" + heading + ":" + content)`. Chunks produced by chunking strategies record the blockid for tracing the chunk back to its location in the sidecar. +> - The chunking strategies `F` / `R` / `V` that ignore document section structure operate on the concatenated `content` fields. Therefore, concatenating the `content` fields of all blocks must form the complete document content — no content missing, no content overlapping. + +### 3.3 Inline placeholder tags inside content + +To let the P chunking strategy split body text without breaking multimodal objects, three XML-style placeholder tags are used inside `content`: + +| Tag | Meaning | Tag attributes | +|---|---|---| +| `…
` | Table placeholder; the body is the raw table JSON / HTML | `id` points to the corresponding item in `tables.json`; `format` ∈ `json` / `html` | +| `` | Self-closing drawing placeholder | `id` points to `drawings.json`; `path` is relative to the `*.parsed/` directory; `src` is the reference name in the original document | +| `` | Equation placeholder | Inline equations also use ``, but **without** `id`, and are not written to the sidecar; only block equations (occupying one or more entire lines) carry an `id` | + +When the text is fed to the LLM during entity/relation extraction, internal attributes such as `id / path / src` are stripped, but key attributes (`format / caption`) are preserved. The goal is to avoid extracting entities that are invisible in the article and injecting too much noise into the extraction results. + +### 3.4 Correspondence between blockid and chunk sidecar.refs + +When a sidecar file exists, the chunking strategies attach `sidecar = {"type": "block", "id": , "refs": [{"type": "block", "id": }, …]}` to each output chunk, where: + +- Unmerged chunk → `sidecar.refs` has only one element, equal to the `blockid` of the blocks.jsonl line the chunk came from; +- Chunk merged in Stage D → `refs` preserves the order of all source `blockid`s (deduplicated); +- Sub-chunks after hard fallback split → share the parent chunk's `sidecar`. + +This linkage is the basis for document-level traceability (chunk ↔ block ↔ original paragraph paraId). + +## 4. drawings.json + +The top level is a dict container of the form `{"version": "1.0", "drawings": { : , … }}`, **keyed by the `id` field** for lookup by id. Each item looks like: + +```json +{ + "id": "im-f1bee60173d067d88595c00e7d9b0ce5-0004", + "blockid": "2f52b70839d13a936d97955916820147", + "heading": "2.3 Structural Dimensions and Weight", + "format": "png", + "path": "m012-manual.blocks.assets/image4.png", + "src": "", + "caption": "", + "footnotes": [], + "extras": { + "ocr_texts": "First OCR paragraph inside the image\n\nSecond OCR paragraph inside the image", + "ocr_texts_count": 2 + }, + "surrounding": { + "leading": "2.3 Structural Dimensions and Weight\nDimensional and weight requirements are as follows:\na) Outer dimensions length: -` (`doc_hash` is the 32-character md5 portion of `doc_id` with the `doc-` prefix removed) | +| `blockid` | Points to the content line that produced this drawing | +| `heading` | The section heading the drawing belongs to | +| `format` | Original extension (no dot): `png` / `jpeg` / `gif` / `webp` / `wmf` / `emf` / … | +| `path` | Resource path relative to the `*.parsed/` directory; **always** points to a file inside `*.blocks.assets/` | +| `src` | The reference alias of the drawing in the original document (empty in most cases) | +| `caption` | Visible caption (the parser may leave it empty) | +| `footnotes` | List of footnote strings | +| `surrounding` | Context object: see [§7](#7-surrounding) | +| `self_ref` | String, optional; an object reference from the original parsing engine output (e.g., Docling JSON Pointer `#/pictures/3`, or MinerU `content_list.json#/23`), used to look up the original object in the parsing artifacts (page position, original structure, etc.) when tracing back. Not output by `native` and other engines that do not provide this field. | +| `extras` | Object, optional; engine-specific bypass fields (such as OCR text contained inside the image, etc.). Not part of spec validation; downstream consumers should not rely on specific keys. | +| `llm_analyze_result` | Modal analysis result object: see [§9](#9-llm_analyze_result) (will later be injected into the multimodal text block) | +| `llm_cache_list` | LLM cache list for modal analysis (will later be injected into the multimodal text block) | + +Common drawing-specific keys inside `extras`: + +| Key | Description | +|---|---| +| `ocr_texts` | String, optional; OCR text inside the drawing object, with multiple paragraphs concatenated by blank lines (`\n\n`). Only written when the parsing engine explicitly attaches OCR text under this drawing's children; caption / footnote do not enter this field. | +| `ocr_texts_count` | Integer, optional; number of non-empty OCR paragraphs written into `ocr_texts`. | + +**Only raster formats supported by drawings (png / jpeg / gif / webp) enter VLM analysis**; other formats (wmf / emf / svg, etc.) get `llm_analyze_result.status="skipped"`, no multimodal chunk is generated downstream, and document processing continues. Images larger than the size specified by the environment variable `VLM_MAX_IMAGE_BYTES` likewise will not enter VLM analysis. + +> Information such as image size and DPI is uniformly placed in the `extras` object; do not introduce undeclared fields (like `image` / `img_path`, etc.) at the item top level. tables / equations follow the same `extras` convention. `self_ref` is a top-level optional field declared by the spec and does not belong to `extras`. + +## 5. tables.json + +The top level is a dict container of the form `{"version": "1.0", "tables": { : , ... }}`, **keyed by the `id` field** for lookup by id. Each item looks like: + +```json +{ + "id": "tb-f1bee60173d067d88595c00e7d9b0ce5-0007", + "blockid": "3f33897b5e105d254addc655f1efbf8c", + "heading": "2.4.4 Temperature-Humidity-Altitude (run with the system)", + "dimension": [16, 8], + "format": "json", + "content": "[[\"Step\", \"Temperature (°C)\", \"Altitude (m)\", \"Relative humidity\", \"Time (min)\", \"Auxiliary cooling\", \"System power\", \"Functional/performance check\"],…", + "caption": "", + "footnotes": [], + "table_header": "[[\"Step\", \"Temperature (°C)\", \"Altitude (m)\", \"Relative humidity\", \"Time (min)\", \"Auxiliary cooling\", \"System power\", \"Functional/performance check\"]]" + "surrounding": { + "leading": "2.4.4 Temperature-Humidity-Altitude (run with the system)\nThe product shall withstand the combined temperature, humidity, and altitude environment during mission execution…", + "trailing": "\nNote: the above steps are repeated for 10 cycles. a) Finished product and accessories reach thermal stability or 240 min, whichever is longer; b) Finished product and accessories reach thermal stability or 120 min, whichever is longer.…" + }, + "llm_analyze_result": { + "name": "Document management metadata table", + "description": "This is a document management information table used to record basic metadata and version control information for a technical document …", + "analyze_time": 1778697759, + "status": "success", + "message": "" + }, + "llm_cache_list": [ + "default:analysis:b316aacd40fdca0cb56430870bb89a62" + ] +} +``` + +The `blockid` / `heading` / `surrounding` / `llm_analyze_result` fields of tables.json have the same meaning as in drawings.json. Different or newly added fields are described below: + +| Field | Description | +|---|---| +| `id` | Form `tb--` (`doc_hash` is the 32-character md5 portion of `doc_id` with the `doc-` prefix removed) | +| `dimension` | Integer array: `[num_rows, num_cols]`, including header rows | +| `format` | `"json"` (2D array) or `"html"` (payload `…
` fragment including the opening and closing tags) | +| `content` | String: the table body, structured according to `format`; this is the string actually used by the downstream multimodal chunk. | +| `table_header` | String, optional; the recognized row(s) treated as the table header | +| `self_ref` | Optional; object reference from the original parsing engine output (e.g., Docling JSON Pointer `#/tables/2`, or MinerU `content_list.json#/31`), used to look up the original artifact when tracing back | + +During the modal analysis stage, when the length of the `content` field exceeds the LLM's context window, the table content is mechanically truncated before being fed to the model. + +## 6. equations.json + +The top level is a dict container of the form `{"version": "1.0", "equations": { : , ... }}`, **keyed by the `id` field** for lookup by id. Each item looks like: + +```json +{ + "id": "eq-f1bee60173d067d88595c00e7d9b0ce5-0001", + "blockid": "2f52b70839d13a936d97955916820147", + "heading": "2.3 Structural Dimensions and Weight", + "format": "latex", + "content": "C=2∗\\frac{P∗T}{\\left( {V}_{H}^{2}−{V}_{L}^{2} \\right)∗η}", + "caption": "", + "footnotes": [], + "surrounding": { + "leading": "2.3 Structural Dimensions and Weight\nDimensional and weight requirements are as follows:\n …", + "trailing": "\nwhere P is the power maintained during power abnormalities 28 W, T is the desired energy-storage time, VH is before capacitor discharge…" + }, + "llm_analyze_result": { + "name": "Capacitor energy-storage time calculation formula", + "description": "This formula calculates the capacitor energy storage value required to maintain normal system operation during power abnormality …", + "analyze_time": 1778697783, + "status": "success", + "message": "", + "equation": "C=2\\cdot\\frac{P\\cdot T}{(V_{H}^{2}-V_{L}^{2})\\cdot\\eta}" + }, + "llm_cache_list": [ + "default:analysis:fcf4c4f88227ee1c1bf0ed4394039e37" + ] +} +``` + +The `blockid` / `heading` / `surrounding` / `llm_analyze_result` fields of equations.json have the same meaning as in drawings.json. Different or newly added fields are described below: + +| Field | Description | +|---|---| +| `id` | Form `eq--` (`doc_hash` is the 32-character md5 portion of `doc_id` with the `doc-` prefix removed) | +| `format` | Fixed to `"latex"` | +| `content` | String: the **raw** LaTeX (possibly containing Unicode operators, outer `\[ \]`); does not include the leading/trailing `$` delimiters; read directly by the modal analysis stage | +| `self_ref` | Optional; object reference from the original parsing engine output (e.g., Docling JSON Pointer `#/texts/15`, or MinerU `content_list.json#/45`), used to look up the original artifact when tracing back | +| `llm_analyze_result.equation` | String: the **canonicalized** LaTeX equation output by the LLM (outer `$ / \[ \] / equation` environment, Unicode converted to LaTeX, no leading/trailing `$` delimiters); this is the string actually used by the downstream multimodal chunk. | + +During the modal analysis stage, when the length of the `content` field exceeds the LLM's context window, the content is mechanically truncated before being fed to the model. Inline equations (those continuous with the body, as ``) **are not** saved to equations.json; they remain only in the blocks text without an `id`. The goal is to avoid injecting too much noise into the extraction results. + +## 7. surrounding + +`surrounding.leading` and `surrounding.trailing` are the analyzable context windows of a sidecar item; their purpose is to provide contextual information about the paragraph containing the image, table, or equation, improving the quality of multimodal analysis. **The surrounding content is automatically injected by LightRAG during the analysis stage; it does not need to be actively written into the sidecar by the document parsing engine.** The generation logic of the surrounding content is as follows: + +- Taken from the text of the content line with the same `blockid`, split at the position of the multimodal placeholder tag; +- The token limit on each side is controlled by the environment variables `SURROUNDING_LEADING_MAX_TOKENS` / `SURROUNDING_TRAILING_MAX_TOKENS` (default `2000`, can be tuned independently); truncated by tokenizer, preferring to retain sentences close to the target; +- The text preserves placeholder tags of **other multimodal objects on the same line**, allowing the model to perceive context such as "after Figure 1 there is also Equation 1"; but internal parser identifiers (`id` / `path` / `src` / `refid`) have been stripped by `strip_internal_multimodal_markup_for_extraction` — consistent with chunk content cleanup before entity extraction, to avoid noise entering the VLM/LLM prompt. Specific cleanup rules: + - `` → ``; **drawings without a caption are removed entirely** (the tag carries no model-visible information anymore); + - `rows
` → `rows
`; + - `body` → `body`; + - `Table 1` → `Table 1`; `Equation 2` → `Equation 2`. Only the `refid` attribute is removed; the `` wrapper is preserved — letting the VLM/LLM recognize "this is a reference to another table/equation" rather than ordinary text, while hiding the parser-internal id that the LLM cannot see. + - Exception: surrounding of the `tables.json` type first goes through `remove_table_tags` before stripping, removing all `` blocks entirely (when analyzing the target table, we don't want to be distracted by dangling references to other tables); +- Cleanup happens **before** token-budget truncation: the token count is computed on "what the LLM actually sees", and truncation does not land inside an uncleaned `id="…"` attribute, avoiding broken tag structure; +- When the target object itself sits at the start / end of the block, the corresponding side is `""` instead of `"n/a"` (when assembling the prompt, the empty string is later displayed as `n/a`); +- `enrich_sidecars_with_surrounding` is idempotent: each `analyze_multimodal` entry point recomputes and overwrites `surrounding`, so after changing `SURROUNDING_LEADING_MAX_TOKENS` / `SURROUNDING_TRAILING_MAX_TOKENS` there is no need to manually clean the sidecar — just re-run multimodal analysis and `surrounding` will be rewritten under the new budget. + +## 8. positions + +`positions` is an array of objects that identifies which piece of text in the file the `blockid` content comes from, allowing the original content to be located and displayed in the source file during content traceability. When the content of a `blockid` is composed of several columns from the layout, multiple `position` objects appear, with each `position` object corresponding to one layout box or column. To accommodate different document formats' content positioning approaches, the system supports the following types of `position` object. + +`position` objects have multiple types, and the `type` field determines its type: + +* paraid + +Applicable to docx-format files; locates content by `paragraph id` (paraid). The `range` field specifies the start and end `paragraph id`s; `charspan` is an optional field specifying that the content starts at character m and ends at character n of the paragraph. When `charspan` is not provided, the `blockid` covers the entire content of the start and end paragraphs. Example: + +``` +"positions": [ +{ + "type": "paraid", + "range": ["5EA4577A", "6555DDCB"] + "charspan": [10,999] +}] +``` + +* bbox + +Applicable to PDF-like files; identifies the original position of the content via a rectangle on the page. bbox supports the following fields: + +``` +origin: Which position the rectangle coordinates are relative to on the page (optional, defaults to LEFTTOP; another option is LEFTBOTTOM) +max: Maximum length and width of the page layout; coordinates are normalized by this value for accurate position display (optional; empty means coordinates are computed by the image's pixel grid) +anchor: Page number, as a string, supporting non-Arabic page numbers such as Roman numerals +range: Rectangle coordinate array [h1, w1, h2, w2], e.g., [174, 155, 818, 333] +charspan: Content starts at character m and ends at character n of the anchored paragraph (optional) +``` + +The `bbox_attributes` field of the `meta` line in `blocks.jsonl` holds global bbox settings, avoiding repeating the same content in every `content` line's `positions` object. A typical `positions` object example: + +``` +"positions": [ +{ + "type": "bbox", + "anchor": "ii" + "range": [174, 155, 818, 333] + "charspan": [10, 999] +}] +``` + +* heading + +Applicable to Markdown-like files; locates content by heading. `anchor` is the starting heading (for handling duplicated headings, refer to the Markdown anchor specification); `charspan` is an optional field specifying that the content starts at character m and ends at character n of the paragraph. When `charspan` is not provided, the `blockid` covers the entire content of the start and end paragraphs. + +``` +"positions": [ +{ + "type": "heading", + "anchor": "ii" + "range": [174, 155, 818, 333] + "charspan": [10, 999] +}] +``` + +* absolute + +Applicable to text-like files; locates content by absolute character position. `charspan` specifies that the content starts at character m and ends at character n. + +``` +"positions": [ +{ + "charspan": [10, 999] +}] +``` + +## 9. `llm_analyze_result` + +| `status` | Trigger scenario | Field description | +|---|---|---| +| `success` | The model returns valid JSON and all required fields are present | Drawing: `name / type / description`; Table: `name / description`; Equation: `name / description / equation` | +| `skipped` | Multimodal analysis was deliberately skipped: image format unsupported, pixels < `VLM_MIN_IMAGE_PIXEL` (default 32 px), larger than `VLM_MAX_IMAGE_BYTES` (default 5 MB), or VLM not enabled | `message` records the skip reason | +| `failure` | Required fields missing, JSON still invalid after repair, the VLM/EXTRACT role is not configured while the corresponding modality is enabled, or the model invocation throws an exception | `message` records the diagnostic | + +Additional notes: + +- `analyze_time` is epoch seconds and is present for every status; +- `message` is **always an empty string** when `status="success"`, making filtering convenient; +- Items for enabled modalities are recomputed on each `analyze_multimodal` run, and the current run overwrites any prior `llm_analyze_result` (`success`, `skipped`, or `failure`). This allows operators to fix VLM/EXTRACT configuration and retry without manually clearing stale sidecar results. LLM calls still use the analysis cache: if the cache key matches, the provider is not called and semantic fields usually remain the same, though runtime fields such as `analyze_time` are rewritten. A cache miss, for example after changing the effective role model/binding/host, prompt inputs, or image metadata, can produce different saved content. + +Drawing `type` is constrained to a 12-value enum (see [`IMAGE_TYPE_ENUM`](../lightrag/prompt_multimodal.py): `Photo / Illustration / Screenshot / Icon / Chart / Table / Infographic / Flowchart / Chat Log / Wireframe / Texture / Other`); values returned by the model outside the enum are normalized to `Other` rather than failing. diff --git a/docs/LightRAG_concurrent_explain.md b/docs/LightRAG_concurrent_explain.md deleted file mode 100644 index 8551ad2641..0000000000 --- a/docs/LightRAG_concurrent_explain.md +++ /dev/null @@ -1,114 +0,0 @@ -## LightRAG Multi-Document Processing: Concurrent Control Strategy - -LightRAG employs a multi-layered concurrent control strategy when processing multiple documents. This article provides an in-depth analysis of the concurrent control mechanisms at document level, chunk level, and LLM request level, helping you understand why specific concurrent behaviors occur. - -### 1. Document-Level Concurrent Control - -**Control Parameter**: `max_parallel_insert` - -This parameter controls the number of documents processed simultaneously. The purpose is to prevent excessive parallelism from overwhelming system resources, which could lead to extended processing times for individual files. Document-level concurrency is governed by the `max_parallel_insert` attribute within LightRAG, which defaults to 2 and is configurable via the `MAX_PARALLEL_INSERT` environment variable. `max_parallel_insert` is recommended to be set between 2 and 10, typically `llm_model_max_async/3`. Setting this value too high can increase the likelihood of naming conflicts among entities and relationships across different documents during the merge phase, thereby reducing its overall efficiency. - -### 2. Chunk-Level Concurrent Control - -**Control Parameter**: `llm_model_max_async` - -This parameter controls the number of chunks processed simultaneously in the extraction stage within a document. The purpose is to prevent a high volume of concurrent requests from monopolizing LLM processing resources, which would impede the efficient parallel processing of multiple files. Chunk-Level Concurrent Control is governed by the `llm_model_max_async` attribute within LightRAG, which defaults to 4 and is configurable via the `MAX_ASYNC` environment variable. The purpose of this parameter is to fully leverage the LLM's concurrency capabilities when processing individual documents. - -In the `extract_entities` function, **each document independently creates** its own chunk semaphore. Since each document independently creates chunk semaphores, the theoretical chunk concurrency of the system is: -$$ -ChunkConcurrency = Max Parallel Insert × LLM Model Max Async -$$ -For example: -- `max_parallel_insert = 2` (process 2 documents simultaneously) -- `llm_model_max_async = 4` (maximum 4 chunk concurrency per document) -- Theoretical chunk-level concurrent: 2 × 4 = 8 - -### 3. Graph-Level Concurrent Control - -**Control Parameter**: `llm_model_max_async * 2` - -This parameter controls the number of entities and relations processed simultaneously in the merging stage within a document. The purpose is to prevent a high volume of concurrent requests from monopolizing LLM processing resources, which would impede the efficient parallel processing of multiple files. Graph-level concurrency is governed by the `llm_model_max_async` attribute within LightRAG, which defaults to 4 and is configurable via the `MAX_ASYNC` environment variable. Graph-level parallelism control parameters are equally applicable to managing parallelism during the entity relationship reconstruction phase after document deletion. - -Given that the entity relationship merging phase doesn't necessitate LLM interaction for every operation, its parallelism is set at double the LLM's parallelism. This optimizes machine utilization while concurrently preventing excessive queuing resource contention for the LLM. - -### 4. LLM-Level Concurrent Control - -**Control Parameter**: `llm_model_max_async` - -This parameter governs the **concurrent volume** of LLM requests dispatched by the entire LightRAG system, encompassing the document extraction stage, merging stage, and user query handling. - -LLM request prioritization is managed via a global priority queue, which **systematically prioritizes user queries** over merging-related requests, and merging-related requests over extraction-related requests. This strategic prioritization **minimizes user query latency**. - -LLM-level concurrency is governed by the `llm_model_max_async` attribute within LightRAG, which defaults to 4 and is configurable via the `MAX_ASYNC` environment variable. - -### 5. Complete Concurrent Hierarchy Diagram - -```mermaid -graph TD -classDef doc fill:#e6f3ff,stroke:#5b9bd5,stroke-width:2px; -classDef chunk fill:#fbe5d6,stroke:#ed7d31,stroke-width:1px; -classDef merge fill:#e2f0d9,stroke:#70ad47,stroke-width:2px; - -A["Multiple Documents
max_parallel_insert = 2"] --> A1 -A --> B1 - -A1[DocA: split to n chunks] --> A_chunk; -B1[DocB: split to m chunks] --> B_chunk; - -subgraph A_chunk[Extraction Stage] - A_chunk_title[Entity Relation Extraction
llm_model_max_async = 4]; - A_chunk_title --> A_chunk1[Chunk A1]:::chunk; - A_chunk_title --> A_chunk2[Chunk A2]:::chunk; - A_chunk_title --> A_chunk3[Chunk A3]:::chunk; - A_chunk_title --> A_chunk4[Chunk A4]:::chunk; - A_chunk1 & A_chunk2 & A_chunk3 & A_chunk4 --> A_chunk_done([Extraction Complete]); -end - -subgraph B_chunk[Extraction Stage] - B_chunk_title[Entity Relation Extraction
llm_model_max_async = 4]; - B_chunk_title --> B_chunk1[Chunk B1]:::chunk; - B_chunk_title --> B_chunk2[Chunk B2]:::chunk; - B_chunk_title --> B_chunk3[Chunk B3]:::chunk; - B_chunk_title --> B_chunk4[Chunk B4]:::chunk; - B_chunk1 & B_chunk2 & B_chunk3 & B_chunk4 --> B_chunk_done([Extraction Complete]); -end -A_chunk -.->|LLM Request| LLM_Queue; - -A_chunk --> A_merge; -B_chunk --> B_merge; - -subgraph A_merge[Merge Stage] - A_merge_title[Entity Relation Merging
llm_model_max_async * 2 = 8]; - A_merge_title --> A1_entity[Ent a1]:::merge; - A_merge_title --> A2_entity[Ent a2]:::merge; - A_merge_title --> A3_entity[Rel a3]:::merge; - A_merge_title --> A4_entity[Rel a4]:::merge; - A1_entity & A2_entity & A3_entity & A4_entity --> A_done([Merge Complete]) -end - -subgraph B_merge[Merge Stage] - B_merge_title[Entity Relation Merging
llm_model_max_async * 2 = 8]; - B_merge_title --> B1_entity[Ent b1]:::merge; - B_merge_title --> B2_entity[Ent b2]:::merge; - B_merge_title --> B3_entity[Rel b3]:::merge; - B_merge_title --> B4_entity[Rel b4]:::merge; - B1_entity & B2_entity & B3_entity & B4_entity --> B_done([Merge Complete]) -end - -A_merge -.->|LLM Request| LLM_Queue["LLM Request Prioritized Queue
llm_model_max_async = 4"]; -B_merge -.->|LLM Request| LLM_Queue; -B_chunk -.->|LLM Request| LLM_Queue; - -``` - -> The extraction and merge stages share a global prioritized LLM queue, regulated by `llm_model_max_async`. While numerous entity and relation extraction and merging operations may be "actively processing", **only a limited number will concurrently execute LLM requests** the remainder will be queued and awaiting their turn. - -### 6. Performance Optimization Recommendations - -* **Increase LLM Concurrent Setting based on the capabilities of your LLM server or API provider** - -During the file processing phase, the performance and concurrency capabilities of the LLM are critical bottlenecks. When deploying LLMs locally, the service's concurrency capacity must adequately account for the context length requirements of LightRAG. LightRAG recommends that LLMs support a minimum context length of 32KB; therefore, server concurrency should be calculated based on this benchmark. For API providers, LightRAG will retry requests up to three times if the client's request is rejected due to concurrent request limits. Backend logs can be used to determine if LLM retries are occurring, thereby indicating whether `MAX_ASYNC` has exceeded the API provider's limits. - -* **Align Parallel Document Insertion Settings with LLM Concurrency Configurations** - -The recommended number of parallel document processing tasks is 1/4 of the LLM's concurrency, with a minimum of 2 and a maximum of 10. Setting a higher number of parallel document processing tasks typically does not accelerate overall document processing speed, as even a small number of concurrently processed documents can fully utilize the LLM's parallel processing capabilities. Excessive parallel document processing can significantly increase the processing time for each individual document. Since LightRAG commits processing results on a file-by-file basis, a large number of concurrent files would necessitate caching a substantial amount of data. In the event of a system error, all documents in the middle stage would require reprocessing, thereby increasing error handling costs. For instance, setting `MAX_PARALLEL_INSERT` to 3 is appropriate when `MAX_ASYNC` is configured to 12. diff --git a/docs/OfflineDeployment.md b/docs/OfflineDeployment.md index 7cf6efd6c9..1634cc06af 100644 --- a/docs/OfflineDeployment.md +++ b/docs/OfflineDeployment.md @@ -216,7 +216,6 @@ python -c "from lightrag import LightRAG; print('✓ LightRAG imported')" python -c "from lightrag.utils import TiktokenTokenizer; t = TiktokenTokenizer(); print('✓ Tiktoken working')" # Test optional dependencies (if installed) -python -c "import docling; print('✓ Docling available')" python -c "import redis; print('✓ Redis available')" ``` diff --git a/docs/ParagraphSemanticChunking-zh.md b/docs/ParagraphSemanticChunking-zh.md new file mode 100644 index 0000000000..dc1181ebc7 --- /dev/null +++ b/docs/ParagraphSemanticChunking-zh.md @@ -0,0 +1,404 @@ +# Paragraph Semantic 分块策略 + +## 1. 适用场景与策略选择 + +### 1.1 P 策略要解决什么问题 + +Paragraph Semantic Chunking(下文简称 **P 策略**)面向 DOCX 等具有清晰章节结构的文档。其核心目标是:**让分块边界尽可能对齐文档原生的语义边界**(标题、段落、表格行),而不是仅由 token 长度计数决定切点。 + +P 策略主要解决以下四类问题: + +1. **表格语境断裂**:大表被拆分后,首尾切片容易脱离前置说明、后置解释或中间桥接文字,召回时无法独立理解。 +2. **层级信息利用不足**:仅看相邻段落的方法无法利用父标题路径、同级条款之间的关系。 +3. **细碎章节尺寸失衡**:规章、标准、合同等文档常包含大量 100~300 token 的细碎条款,若不合并则块过短、语义稀薄;若仅按相邻长度合并又会跨主题污染。 +4. **长块二次拆分破坏结构**:章节过长时,常规字符切分会忽略表格行边界和标题层级。 + +P 策略仅对 `native` 抽取引擎生成的 `.blocks.jsonl` 结构化产物有效;对非结构化输入会自动降级为 R 策略(见 §8)。 + +### 1.2 P / R / V 三种策略对比 + +| 维度 | R 策略(Recursive) | V 策略(SemanticVector) | P 策略(ParagraphSemantic) | +|---|---|---|---| +| 切分依据 | 字符分隔符级联(段落 → 换行 → 中文标点 → 空格 → 字符)+ token 预算 | 句子级 embedding 距离阈值(百分位 / 标准差 / 四分位距 / 梯度)寻找语义断层 | DOCX outline level 与 `parent_headings` + 表格行边界 + 锚点 + 层级感知合并 | +| 块大小控制 | `chunk_token_size` 硬上限 | `chunk_token_size` 仅为 advisory ceiling,超限时通过 R 二次切分 | `target_max` 硬上限 + `target_ideal` 软目标 + 表格阈值 + 尾部吸收阈值多重协同 | +| 表格处理 | 不感知表格,可能在表格中间切断 | 不感知表格 | 表格小于 `table_max` 保持完整;大表按 JSON 行数组 / HTML `` 行边界切片,并重新包裹为合法 `` | +| 表格上下文 | 依赖窗口偶然覆盖 | 依赖 embedding 距离 | 首切片粘连前置说明、末切片粘连后置解释、连续大表桥接文字双向重叠 | +| 块间重叠 | 全局 `chunk_overlap_token_size` | 不会出现重叠 | 章节边界不会重叠;同章节长正文 fallback 到 R 时按 `CHUNK_P_OVERLAP_SIZE` 重叠;连续大表桥接文字可同时进入前后两个表格块 | +| heading 元数据 | 通常无 | 通常无 | 继承或提升 heading;拆分后追加 `[part n]` 后缀;保留 `parent_headings` 和 `level` | +| 嵌入计算开销 | 无 | 高(需对每个句子计算 embedding) | 无 | +| 依赖输入 | 任意文本 | 任意文本 + Embedding 模型 | 必须有 `.blocks.jsonl` sidecar(即 `native` 引擎抽取结果),否则降级为 R | + +### 1.3 怎么选 + +| 场景 | 推荐 | 理由 | +|---|---|---| +| DOCX 且章节层级清晰、含大表格、含细碎条款 | **P** | 充分利用标题层级与表格行边界,块边界最贴合语义;避免跨主题污染 | +| 文档以散文 / 评论 / 长篇正文为主,没有明确章节结构 | **V** | 按语义相似度切分能在话题切换点形成自然边界,比字符切分更稳定 | +| 输入是纯文本、Markdown、代码、日志,或追求最低算力开销 | **R** | 无嵌入开销,分隔符级联对中英文混合文本足够稳定 | +| 通用配置(不确定文件类型) | **R** | P 在无 sidecar 时自动降级到 R;V 在无 Embedding 模型时也降级到 R | +| 标题样式混乱、正文中大量伪标题的文档 | **R** 或 **V** | P 依赖 native parser 正确识别标题,标题错乱会导致基础块边界偏移 | +| 单行超大表格或不可解析表格 | 任意 | 三种策略最终都会走字符级 fallback;P 仍保留表格上下文粘连优势 | + +### 1.4 P 策略的代价 + +- 必须搭配 `native` 引擎:在 `LIGHTRAG_PARSER` 中显式声明,例如 `docx:native-P`;否则即使写了 `P`,也会因为缺少 `.blocks.jsonl` 退化到 R。 +- 仅支持 DOCX:其他格式没有 `.blocks.jsonl` 产物。 +- 算法路径多、阈值多:调试时需要先确认输入 sidecar 是否正确,再看各阶段输出。 + +## 2. 工作原理总览 + +P 策略以 native parser 在 `fixlevel=0` 模式下产生的 `.blocks.jsonl` 为输入,**每个 `type == "content"` 行被视为一个标题级基础块**,然后在该基础上执行表格切片、长块拆分和层级合并: + +```text +DOCX + ↓ native parser (fixlevel=0) +.blocks.jsonl + sidecar (.tables.json / .equations.json / .drawings.json / .blocks.assets/) + ↓ Stage B:超大表格按行边界切片并赋予 first/middle/last 角色 + ↓ Stage B.1:连续大表之间桥接文字双向重叠 + ↓ Stage C:锚点驱动的长文本块再切分 + ↓ Stage D:层级感知的双相位合并 + ↓ Stage E:[part n] 行级来源追溯编号 +最终 chunk 列表 +``` + +**P 策略的关键不变量**: + +1. **章节边界不会重叠**:不同 `.blocks.jsonl` 内容行之间的文本绝不会被复制到对方块里,避免“张冠李戴”。 +2. **章节内长正文可重叠**:同一个内容行内拆分的多个片段允许按 `chunk_overlap_token_size` 保留 R 风格 overlap,减少长正文中途切断。 +3. **表格之间桥接文字可双向重叠**:唯一的跨段落复制场景,专门服务连续大表的上下文保留。 +4. **表格行不互相重叠**:行级切片本身是非重叠的,与 R 的 overlap 概念不同。 + +## 3. 输入与输出 + +### 3.1 输入 + +`chunking_by_paragraph_semantic()` 接收以下输入: + +| 参数 | 来源 | 说明 | +|---|---|---| +| `content` | `full_docs[doc_id].content` | 拼接后的合并文本,用于 sidecar 缺失时降级 | +| `blocks_path` | `full_docs[doc_id].lightrag_document_path` | `.blocks.jsonl` 路径,是 P 策略的主输入 | +| `chunk_token_size` | `chunk_options.chunk_token_size` / `CHUNK_P_SIZE` | 目标硬上限 N,默认 `2000` | +| `chunk_overlap_token_size` | `CHUNK_P_OVERLAP_SIZE` / `chunk_overlap_token_size` | 同一内容行内长正文 fallback 与表格桥接预算的上限,默认 `100` | +| `tokenizer` | LightRAG 已解析好的 tokenizer | 所有 token 计数与文本 overlap 截取的基准 | + +P 策略**不接收** `split_by_character` / `split_by_character_only`,因为正常路径由标题和段落结构驱动。 + +### 3.2 `.blocks.jsonl` 约定 + +P 策略只处理 `type == "content"` 行。每个内容行通常包含: + +- `content`:该标题下的正文文本,可能包含普通段落、`
` 标签、`` 公式、`` 图形。 +- `heading`:当前标题。 +- `parent_headings`:父级标题链。 +- `level`:标题级别(1~9,对应原始 outline level 0~8)。 +- `positions`:原始段落定位(用于追溯)。 + +native parser 的 `fixlevel=0` 模式保证「一条标题下的正文作为一个基础块」,不在解析阶段做 token 阈值拆分。表格保持完整插入到 `content` 中。 + +### 3.3 输出 + +最终输出为有序 chunk 列表,每个元素: + +```python +{ + "tokens": int, # 真实 token 数(合并后会复测) + "content": str, # 块文本(可能包含
标签) + "chunk_order_index": int, # 块顺序索引 + "heading": str, # 拆分后追加 [part n] 后缀 + "parent_headings": list[str], # 父级标题链,不追加后缀 + "level": int, # 标题层级 +} +``` + +实现内部还会临时使用 `paragraphs`、`table_chunk_role`、`uuid`、`uuid_end`、`type` 等字段辅助拆分和合并,但**不会进入最终输出**。 + +### 3.4 `[part n]` 后缀规则 + +- 同一个原始 `.blocks.jsonl` 内容行被拆成多个片段时,所有片段的 `heading` 字段追加 `[part 1]`、`[part 2]` … +- 未发生拆分的内容行保持原 heading 不变。 +- `parent_headings` 不追加后缀。 +- 编号在每个原始内容行内**独立重置**。 +- 旧的 `[表格片段N]` 后缀已统一由 `[part n]` 替代。 + +## 4. 关键阈值 + +P 策略的阈值不是固定常量,而是按 `chunk_token_size`(记为 N)动态推导: + +| 名称 | 计算式 | N = 2000 时取值 | 技术含义 | +|---|---|---:|---| +| `target_max` | N | 2000 | 文本块硬上限 | +| `target_ideal` | 0.75 × N | 1500 | 文本块理想目标,达到此值后停止参与普通同级合并 | +| `table_max` | 0.625 × N | 1250 | 表格触发切片阈值 | +| `table_ideal` | 0.375 × N | 750 | 表格切片理想大小 | +| `table_min_last` | 0.32 × `table_max` | 400 | 表格末片回吞阈值(小于此值且能合并则回吞至前一切片) | +| `small_tail_threshold` | 0.125 × N | 250 | 尾部碎块吸收阈值 | +| `max_anchor_candidate_length` | 固定 | 100 字符 | 长块拆分锚点候选段落长度上限 | + +比例约束关系:`table_max < target_ideal < target_max`、`table_ideal < table_max`。这些比例源自审计模式经验值(`大块 8000、小表 5000、理想表 3000、表格尾块 1600`),现按 `chunk_token_size` 等比缩放。 + +## 5. Stage A:标题级基础块 + +标题识别由 native parser 完成,**P chunker 自身不扫描 docx body、也不判断标题样式**。 + +native parser 在 `fixlevel=0` 模式下: + +1. 读取 `styles.xml`,按 `` 建立样式继承链,回溯有效 ``。 +2. 遍历 `document.xml` 段落,沿继承链解析大纲级别;原始 outline level 0~8 映射为内部 `level` 1~9。 +3. 维护 `current_heading_stack`,遇新标题时清理不浅于当前 level 的旧标题,计算 `parent_headings`。 +4. 将表格、公式、图形分别提取为单行标签(`
...
` 等),写入对应 sidecar。 +5. 所有可识别标题均触发基础块边界,**不**执行 token 阈值拆分。 + +P chunker 直接读取 `.blocks.jsonl`,每个 content 行作为后续 Stage B/C 的独立处理单元。这意味着 `[part n]` 编号按每个原始 content 行独立重置。 + +## 6. Stage B:超大表格行边界切片 + +Stage B 只处理 token 数超过 `table_max` 的表格。其目标**不是单纯拆表**,而是在行边界优先拆分的基础上保留表格边界上下文。 + +### 6.1 行边界优先切片 + +- `format="json"`:按 JSON 顶层行数组切片。 +- `format="html"`:按 `...` 行切片。 +- 未显式标注但内容可嗅探为 JSON / HTML 的表格同样按上述规则处理。 + +切片前预扣 `
` 外壳 token 开销,使重新包裹后的切片尽量不超过 `table_max`。每个切片重新包裹为合法的 `` 标签,便于下游解析。 + +### 6.2 行级递归二次切片 + +若某个行子集重新包裹后仍超过 `table_max`,则在该行子集内继续细分。**只有切片已经收敛到单行、且该单行自身超过限制时,才退化为字符级切分**。该机制使可被行边界表达的表格内容尽量保留合法表格结构。 + +### 6.3 末片回吞 + +若表格末片 token 数低于 `table_min_last`,且与前一切片合并后不超过 `table_max`,则将末片回吞至前一切片,减少无效短表格块。 + +### 6.4 表格切片角色与物理粘连 + +每个表格切片被赋予内部字段 `table_chunk_role`,并按角色决定与周围段落的粘连方式: + +| 角色 | 含义 | 粘连策略 | +|---|---|---| +| `first` | 原始表格的首切片 | 追加到当前累积块尾部,使表格**前置说明**与首切片进入同一块 | +| `middle` | 原始表格的中间切片 | 独立输出,避免与无关正文合并 | +| `last` | 原始表格的末切片 | 作为新累积块起点,使**后置解释**自动追加到末切片之后 | +| `none` | 非表格切片或未拆分的完整表格 | 按普通文本块处理 | + +`table_chunk_role` 是内部字段,最终输出不会保留,**但在 Stage D 中继续作为合并约束使用**(见 §9.1)。 + +## 7. Stage B.1:连续大表桥接文字双向重叠 + +当同一原始内容行中出现「大表 A、短桥接文字、大表 B」的模式,且两张表均被拆分时,桥接文字按上下文预算进行双向分配: + +1. 将桥接文字按 token 编码。 +2. 计算左侧预算 `prev_budget = min(chunk_overlap_token_size, target_max - 左侧末切片当前 token 数)`。 +3. 计算右侧预算 `next_budget = min(chunk_overlap_token_size, target_max - 右侧首切片当前 token 数)`。 +4. **若桥接文字长度同时不超过两侧预算**:左右两个表格边界块都包含**完整桥接文字**。 +5. **若桥接文字较长**:前缀进入左侧末切片块,后缀进入右侧首切片块;超出两侧预算的中间段独立成为普通文本块。 + +单侧预算还会被限制到不超过 `chunk_token_size / 2`,避免桥接文字主导整个块。 + +这与普通相邻 chunk overlap 的差异: + +- 普通 overlap 按前后顺序复制字符或 token,与边界类型无关。 +- B.1 机制以表格切片角色为触发条件,把桥接文字同时作为左表后文上下文和右表前文上下文,避免桥接说明只归属一侧表格或被单独切散后难以召回。 + +## 8. Stage C:锚点驱动的长文本块再切分 + +Stage C 处理 Stage B 后仍超过 `target_max` 的内容块。 + +### 8.1 短段落锚点 + +把内容按段落恢复,选择满足以下条件的段落作为候选锚点: + +- 段落不是表格(不以 `.docx.parsed/.blocks.jsonl +``` + +若文件不存在或为空,P 策略会整体降级为 R,不会获得 P 的任何收益。常见原因: + +- 未配置 `LIGHTRAG_PARSER=docx:native-...`。 +- 解析失败(看 `pipeline_status` 错误条目)。 +- 文档不是 DOCX(其他格式不支持 P)。 + +### 12.2 检查 blocks.jsonl 内容 + +每行一个 JSON,过滤 `type == "content"` 后查看 heading / level / parent_headings 是否符合预期: + +```bash +jq -c 'select(.type=="content") | {level, heading, parent_headings}' \ + INPUT/__parsed__/.docx.parsed/.blocks.jsonl | head +``` + +若 heading 大量为空或 level 异常,说明 native parser 没正确识别标题样式 —— 此时 P 策略的层级合并和锚点提升都会失效。 + +### 12.3 检查最终 chunks + +查看 `text_chunks` 存储中的 chunk 元数据: + +```bash +jq '.[] | {heading, level, tokens, parent_headings}' \ + rag_storage/kv_store_text_chunks.json | head -30 +``` + +应观察到: + +- 大表前后块的 heading 通常对应 `[part 1]` / `[part n]`(说明 Stage B 拆分发生)。 +- 细碎条款被合并到接近 `target_ideal` 的块(说明 Stage D 生效)。 +- `parent_headings` 在不同章节切换处发生跳变,同章节内保持稳定。 + +### 12.4 块尺寸分布检验 + +理想分布:大多数 chunk 落在 `[target_ideal, target_max]` 区间(即 N=2000 时约 1500~2000 token);明显偏小的块通常是 `middle` 表格切片(锁定独立)或紧靠章节边界的尾块。 + +若出现大量低于 `small_tail_threshold` 的尾块,可能是: + +- 父标题路径一致性约束过严(不同 `parent_headings` 的相邻小块无法合并)。 +- 大量 `middle` 表格切片堆积(表格本身就很大)。 + +## 13. 错误调试 + +### 13.1 P 没生效,输出与 R 一致 + +按以下顺序排查: + +1. `full_docs[doc_id].process_options` 是否包含 `P`? +2. `full_docs[doc_id].parse_format` 是否为 `lightrag`?若为 `raw`,说明走的是 legacy 路径,P 会自动降级到 R。 +3. `lightrag_document_path` 指向的 `.blocks.jsonl` 是否存在、是否非空? +4. 日志中是否有 `paragraph_semantic ... fallback to recursive_character` 字样? + +### 13.2 表格被切散、前后说明分离 + +- 检查表格是否真的被识别为 `
` 或 `
`(看 `.blocks.jsonl`)。未识别格式的表格只能走字符切分,无法启动 Stage B 的角色机制。 +- 检查表格 token 数是否真的超过 `table_max`。低于阈值的表格保持完整,不会触发首/中/末切片。 +- 若是连续大表,确认两张表之间的桥接文字是否在**同一 content 行**内 —— 跨 content 行的桥接不参与 B.1 双向重叠。 + +### 13.3 细碎条款没有被合并 + +- 检查相邻条款的 `parent_headings` 是否一致:父标题路径一致性约束会阻止跨主题合并。 +- 检查 `level` 是否一致:同级合并要求相同 `level`,跨级吸收只允许浅吸深。 +- 检查中间是否插入了 `middle` 表格切片:会阻断尾部整批吸收。 + +### 13.4 出现单个超过 `target_max` 的块 + +正常情况下 Stage D 的真实 token 复测会拒绝超限合并,但以下场景仍可能出现超限块: + +- 单行表格自身超过 `target_max`,无锚点可拆,最终走 R 字符切分但单 chunk 仍超限。 +- `enforce_chunk_token_limit_before_embedding` 在 embedding 前会做最后的硬切分,下游不会真把超限 chunk 嵌入向量库。 + +### 13.5 `[part n]` 后缀异常 + +- 同一原始 content 行拆出多片但只看到一个 `[part 1]`:检查是否在 Stage D 中被合并 —— 合并后保留主块的 part 后缀,不拼接多个。 +- 出现旧式 `[表格片段N]` 后缀:说明使用了旧版 chunker 输出的数据,新版统一为 `[part n]`,需要重新分块。 + +### 13.6 日志关键字 + +P 策略相关日志关键字(用于 `grep` 排查): + +- `paragraph_semantic` — 模块入口 +- `fallback to recursive_character` — 整体或单段落降级 +- `table_chunk_role` — 表格角色相关 +- `bridge` — Stage B.1 桥接文字处理 +- `anchor` — Stage C 锚点选择 diff --git a/docs/ParagraphSemanticChunking.md b/docs/ParagraphSemanticChunking.md new file mode 100644 index 0000000000..fda1cc6b9a --- /dev/null +++ b/docs/ParagraphSemanticChunking.md @@ -0,0 +1,404 @@ +# Paragraph Semantic Chunking Strategy + +## 1. Use Cases and Strategy Selection + +### 1.1 What the P Strategy Solves + +Paragraph Semantic Chunking (hereafter the **P strategy**) targets documents with clear sectional structure such as DOCX. Its core goal is to **align chunk boundaries with the document's native semantic boundaries** (headings, paragraphs, table rows) as much as possible, rather than determining cut points solely from token-length counting. + +The P strategy is mainly designed to address the following four categories of problems: + +1. **Table context fragmentation**: When a large table is split, its head and tail slices easily become detached from the preceding description, following explanation, or intermediate bridging text, making them impossible to understand independently during recall. +2. **Insufficient utilization of hierarchical information**: Methods that only look at neighboring paragraphs cannot leverage parent heading paths or relationships between sibling clauses. +3. **Imbalanced sizes of fine-grained sections**: Regulations, standards, contracts, etc., often contain many fine-grained clauses of 100–300 tokens. Without merging, chunks become too short and semantically thin; merging by adjacent length alone causes cross-topic pollution. +4. **Long-chunk re-splitting breaks structure**: When sections are excessively long, ordinary character splitting ignores table row boundaries and heading hierarchy. + +The P strategy is effective only for the `.blocks.jsonl` structured artifacts produced by the `native` extraction engine; for unstructured inputs, it automatically falls back to the R strategy (see §8). + +### 1.2 Comparison of P / R / V Strategies + +| Dimension | R Strategy (Recursive) | V Strategy (SemanticVector) | P Strategy (ParagraphSemantic) | +|---|---|---|---| +| Splitting basis | Cascading character separators (paragraph → newline → Chinese punctuation → whitespace → character) + token budget | Sentence-level embedding distance thresholds (percentile / standard deviation / IQR / gradient) to locate semantic breaks | DOCX outline level with `parent_headings` + table row boundaries + anchors + hierarchy-aware merging | +| Chunk size control | `chunk_token_size` hard cap | `chunk_token_size` is merely an advisory ceiling; when exceeded, secondary splitting via R | `target_max` hard cap + `target_ideal` soft target + table threshold + tail-absorption threshold working in concert | +| Table handling | Table-unaware; may cut in the middle of a table | Table-unaware | Tables smaller than `table_max` are kept intact; large tables are sliced by JSON row array / HTML `` row boundaries and re-wrapped as valid `
` | +| Table context | Relies on incidental window coverage | Relies on embedding distance | First slice glues to preceding description, last slice glues to following explanation; bidirectional overlap of bridging text between consecutive large tables | +| Inter-chunk overlap | Global `chunk_overlap_token_size` | No overlap | No overlap across section boundaries; within the same section, long body falls back to R with overlap by `CHUNK_P_OVERLAP_SIZE`; bridging text between consecutive large tables may enter both the preceding and following table chunks | +| Heading metadata | Usually none | Usually none | Inherits or promotes heading; appends `[part n]` suffix after splitting; preserves `parent_headings` and `level` | +| Embedding compute cost | None | High (must compute embedding per sentence) | None | +| Input requirements | Any text | Any text + Embedding model | Must have a `.blocks.jsonl` sidecar (i.e., result of the `native` engine); otherwise falls back to R | + +### 1.3 How to Choose + +| Scenario | Recommended | Rationale | +|---|---|---| +| DOCX with clear sectional hierarchy, large tables, fine-grained clauses | **P** | Fully leverages heading hierarchy and table row boundaries; chunk boundaries best match semantics; avoids cross-topic pollution | +| Documents dominated by prose / commentary / long body without clear sectional structure | **V** | Splitting by semantic similarity forms natural boundaries at topic shifts, more stable than character splitting | +| Inputs are plain text, Markdown, code, logs, or you want minimum compute overhead | **R** | No embedding overhead; cascading separators are stable enough for mixed Chinese-English text | +| General configuration (uncertain about file types) | **R** | P automatically falls back to R when no sidecar is present; V also falls back to R when no Embedding model is available | +| Documents with chaotic heading styles and many pseudo-headings in body | **R** or **V** | P depends on the native parser correctly identifying headings; messy headings cause basic chunk boundaries to shift | +| Single-line giant tables or unparsable tables | Any | All three strategies eventually fall back to character-level splitting; P still retains the advantage of table context gluing | + +### 1.4 Costs of the P Strategy + +- Must be paired with the `native` engine: explicitly declared in `LIGHTRAG_PARSER`, e.g., `docx:native-P`; otherwise, even if `P` is written, it falls back to R due to the missing `.blocks.jsonl`. +- DOCX only: other formats have no `.blocks.jsonl` artifact. +- Many algorithmic paths and thresholds: debugging requires first verifying the input sidecar, then inspecting the outputs of each stage. + +## 2. Overview of How It Works + +The P strategy takes as input the `.blocks.jsonl` produced by the native parser in `fixlevel=0` mode. **Each `type == "content"` line is treated as one heading-level basic chunk**, then table slicing, long-chunk splitting, and hierarchical merging are performed on top: + +```text +DOCX + ↓ native parser (fixlevel=0) +.blocks.jsonl + sidecars (.tables.json / .equations.json / .drawings.json / .blocks.assets/) + ↓ Stage B: slice oversized tables along row boundaries and assign first/middle/last roles + ↓ Stage B.1: bidirectional overlap of bridging text between consecutive large tables + ↓ Stage C: anchor-driven re-splitting of long text chunks + ↓ Stage D: hierarchy-aware two-phase merging + ↓ Stage E: [part n] line-level provenance numbering +Final chunk list +``` + +**Key invariants of the P strategy**: + +1. **No overlap across section boundaries**: Text between different `.blocks.jsonl` content lines is never copied into the other chunk, avoiding "misattribution". +2. **Long body within a section may overlap**: Multiple slices from within the same content line may keep R-style overlap controlled by `chunk_overlap_token_size`, reducing mid-sentence cuts in long bodies. +3. **Bridging text between tables may overlap bidirectionally**: The only cross-paragraph copying scenario, specifically serving context preservation for consecutive large tables. +4. **Table rows do not overlap each other**: Row-level slicing itself is non-overlapping, different from R's overlap concept. + +## 3. Input and Output + +### 3.1 Input + +`chunking_by_paragraph_semantic()` receives the following inputs: + +| Parameter | Source | Description | +|---|---|---| +| `content` | `full_docs[doc_id].content` | Concatenated merged text, used for fallback when sidecar is missing | +| `blocks_path` | `full_docs[doc_id].lightrag_document_path` | Path to `.blocks.jsonl`, the primary input for the P strategy | +| `chunk_token_size` | `chunk_options.chunk_token_size` / `CHUNK_P_SIZE` | Target hard cap N; defaults to `2000` | +| `chunk_overlap_token_size` | `CHUNK_P_OVERLAP_SIZE` / `chunk_overlap_token_size` | Upper bound for long-body fallback overlap within the same content line and for the table bridging budget; defaults to `100` | +| `tokenizer` | The tokenizer already parsed by LightRAG | Basis for all token counting and text overlap truncation | + +The P strategy **does not accept** `split_by_character` / `split_by_character_only`, because the normal path is driven by heading and paragraph structure. + +### 3.2 `.blocks.jsonl` Convention + +The P strategy only processes `type == "content"` lines. Each content line typically contains: + +- `content`: The body text under the heading, possibly including ordinary paragraphs, `
` tags, `` formulas, `` graphics. +- `heading`: The current heading. +- `parent_headings`: The chain of parent headings. +- `level`: Heading level (1–9, corresponding to the original outline levels 0–8). +- `positions`: Original paragraph positioning (used for traceability). + +The native parser's `fixlevel=0` mode guarantees that "the body under a heading becomes one basic chunk" without performing token-threshold splitting during parsing. Tables are inserted into `content` while staying intact. + +### 3.3 Output + +The final output is an ordered list of chunks, where each element is: + +```python +{ + "tokens": int, # Actual token count (re-measured after merging) + "content": str, # Chunk text (may contain
tags) + "chunk_order_index": int, # Chunk ordering index + "heading": str, # Suffix [part n] appended after splitting + "parent_headings": list[str], # Parent heading chain; no suffix appended + "level": int, # Heading level +} +``` + +Internally, the implementation also temporarily uses fields such as `paragraphs`, `table_chunk_role`, `uuid`, `uuid_end`, `type` to assist splitting and merging, but **these do not appear in the final output**. + +### 3.4 `[part n]` Suffix Rules + +- When the same original `.blocks.jsonl` content line is split into multiple slices, the `heading` field of every slice gets `[part 1]`, `[part 2]` … appended. +- Content lines that are not split keep the original heading unchanged. +- `parent_headings` does not get any suffix. +- Numbering is **reset independently within each original content line**. +- The legacy `[表格片段N]` ("table fragment N") suffix is uniformly replaced by `[part n]`. + +## 4. Key Thresholds + +P strategy thresholds are not fixed constants; they are dynamically derived from `chunk_token_size` (denoted N): + +| Name | Formula | Value when N = 2000 | Technical meaning | +|---|---|---:|---| +| `target_max` | N | 2000 | Hard upper bound for text chunks | +| `target_ideal` | 0.75 × N | 1500 | Ideal target for text chunks; chunks at or above this value stop participating in ordinary peer merging | +| `table_max` | 0.625 × N | 1250 | Threshold that triggers table slicing | +| `table_ideal` | 0.375 × N | 750 | Ideal size for a table slice | +| `table_min_last` | 0.32 × `table_max` | 400 | Last-slice swallow-back threshold (if the last slice is smaller and can be merged, swallow it back into the previous slice) | +| `small_tail_threshold` | 0.125 × N | 250 | Threshold for tail fragment absorption | +| `max_anchor_candidate_length` | Fixed | 100 chars | Upper bound on paragraph length for candidate anchors in long-chunk splitting | + +Proportional constraint relationships: `table_max < target_ideal < target_max`, `table_ideal < table_max`. These ratios originate from empirical values in the audit mode (`large chunk 8000, small table 5000, ideal table 3000, table tail 1600`) and are now proportionally scaled by `chunk_token_size`. + +## 5. Stage A: Heading-Level Basic Chunks + +Heading recognition is performed by the native parser; **the P chunker itself does not scan the docx body nor judge heading styles**. + +In `fixlevel=0` mode, the native parser: + +1. Reads `styles.xml`, builds a style inheritance chain via ``, and traces back the effective ``. +2. Iterates over the paragraphs of `document.xml`, resolving outline levels along the inheritance chain; original outline levels 0–8 are mapped to internal `level` 1–9. +3. Maintains `current_heading_stack`, clearing old headings no shallower than the current level when a new heading is encountered, and computing `parent_headings`. +4. Extracts tables, formulas, and drawings into single-line tags (`
...
` etc.) and writes them to the corresponding sidecars. +5. All recognizable headings trigger a basic chunk boundary; **no** token-threshold splitting is performed. + +The P chunker directly reads `.blocks.jsonl`, treating each content line as an independent unit of processing for subsequent Stages B/C. This implies that `[part n]` numbering is reset independently per original content line. + +## 6. Stage B: Row-Boundary Slicing for Oversized Tables + +Stage B only processes tables whose token count exceeds `table_max`. Its goal is **not merely to split the table** but to preserve table boundary context based on row-boundary-priority splitting. + +### 6.1 Row-Boundary-Priority Slicing + +- `format="json"`: Slice by the top-level JSON row array. +- `format="html"`: Slice by `...` rows. +- Tables not explicitly tagged but sniffable as JSON / HTML are handled by the same rules. + +Before slicing, the `
` wrapper token cost is pre-deducted so that each re-wrapped slice stays under `table_max` as much as possible. Each slice is re-wrapped as a valid `` tag for ease of downstream parsing. + +### 6.2 Row-Level Recursive Re-Slicing + +If a row subset, after re-wrapping, still exceeds `table_max`, further subdivision is performed within that row subset. **Only when slicing has converged to a single row that itself exceeds the limit does it degrade to character-level splitting**. This mechanism keeps as much valid table structure as possible for table content expressible by row boundaries. + +### 6.3 Last-Slice Swallow-Back + +If the token count of the last table slice falls below `table_min_last` and the result of merging with the previous slice does not exceed `table_max`, the last slice is swallowed back into the previous slice, reducing useless short table chunks. + +### 6.4 Table Slice Roles and Physical Gluing + +Each table slice is assigned an internal field `table_chunk_role`, and gluing to surrounding paragraphs is decided by role: + +| Role | Meaning | Gluing strategy | +|---|---|---| +| `first` | First slice of the original table | Appended to the tail of the current accumulating chunk so that the table's **preceding description** enters the same chunk as the first slice | +| `middle` | Middle slice of the original table | Output independently to avoid merging with unrelated body | +| `last` | Last slice of the original table | Used as the starting point of a new accumulating chunk so that the **following explanation** is automatically appended after the last slice | +| `none` | Non-table slice or untouched intact table | Treated as ordinary text chunks | + +`table_chunk_role` is an internal field that does not survive in the final output, **but in Stage D it continues to serve as a merging constraint** (see §9.1). + +## 7. Stage B.1: Bidirectional Overlap of Bridging Text Between Consecutive Large Tables + +When the pattern "large table A, short bridging text, large table B" appears in the same original content line and both tables are split, the bridging text is distributed bidirectionally according to a context budget: + +1. Encode the bridging text into tokens. +2. Compute the left budget `prev_budget = min(chunk_overlap_token_size, target_max - current token count of the left last slice)`. +3. Compute the right budget `next_budget = min(chunk_overlap_token_size, target_max - current token count of the right first slice)`. +4. **If the bridging text length does not exceed either budget**: Both the left and right table boundary chunks contain the **complete bridging text**. +5. **If the bridging text is longer**: The prefix enters the left last-slice chunk, the suffix enters the right first-slice chunk; the middle portion that exceeds both budgets becomes an independent ordinary text chunk. + +Each one-sided budget is additionally capped at `chunk_token_size / 2` to prevent the bridging text from dominating an entire chunk. + +The difference from ordinary adjacent chunk overlap: + +- Ordinary overlap copies characters or tokens by forward/backward order, regardless of boundary type. +- The B.1 mechanism is triggered by table slice roles, treating bridging text as both the post-text context of the left table and the pre-text context of the right table, avoiding the bridging description being assigned to only one side or being split off and hard to recall. + +## 8. Stage C: Anchor-Driven Re-Splitting of Long Text Chunks + +Stage C processes content chunks that still exceed `target_max` after Stage B. + +### 8.1 Short-Paragraph Anchors + +Restore content into paragraphs, then select paragraphs that satisfy all of the following as candidate anchors: + +- The paragraph is not a table (does not start with `.docx.parsed/.blocks.jsonl +``` + +If the file is missing or empty, the P strategy falls back to R entirely and gains none of P's benefits. Common causes: + +- `LIGHTRAG_PARSER=docx:native-...` was not configured. +- Parsing failed (see error entries in `pipeline_status`). +- The document is not a DOCX (other formats do not support P). + +### 12.2 Inspect the Contents of blocks.jsonl + +Each line is a JSON; filter `type == "content"` and inspect whether heading / level / parent_headings match expectations: + +```bash +jq -c 'select(.type=="content") | {level, heading, parent_headings}' \ + INPUT/__parsed__/.docx.parsed/.blocks.jsonl | head +``` + +If most headings are empty or levels are abnormal, the native parser did not correctly recognize heading styles — in which case P's hierarchical merging and anchor promotion will both fail. + +### 12.3 Inspect the Final Chunks + +View chunk metadata in the `text_chunks` storage: + +```bash +jq '.[] | {heading, level, tokens, parent_headings}' \ + rag_storage/kv_store_text_chunks.json | head -30 +``` + +You should observe: + +- Headings of chunks around large tables typically correspond to `[part 1]` / `[part n]` (indicating Stage B splitting occurred). +- Fine-grained clauses are merged into chunks close to `target_ideal` (indicating Stage D took effect). +- `parent_headings` jumps at boundaries between different sections and stays stable within the same section. + +### 12.4 Chunk Size Distribution Check + +Ideal distribution: most chunks fall in the range `[target_ideal, target_max]` (i.e., approximately 1500–2000 tokens when N=2000); chunks noticeably smaller are usually `middle` table slices (locked as independent) or tail chunks at section boundaries. + +If many tail chunks below `small_tail_threshold` appear, possible causes include: + +- The parent heading path consistency constraint is too strict (adjacent small chunks with different `parent_headings` cannot merge). +- Many `middle` table slices pile up (the table itself is very large). + +## 13. Troubleshooting + +### 13.1 P Did Not Take Effect; Output Matches R + +Investigate in this order: + +1. Does `full_docs[doc_id].process_options` contain `P`? +2. Is `full_docs[doc_id].parse_format` equal to `lightrag`? If `raw`, it is on the legacy path and P automatically falls back to R. +3. Does the `.blocks.jsonl` pointed to by `lightrag_document_path` exist and is it non-empty? +4. Are there `paragraph_semantic ... fallback to recursive_character` messages in the logs? + +### 13.2 Tables Are Scattered; Preceding and Following Explanations Are Detached + +- Check whether the table is truly recognized as `
` or `
` (see `.blocks.jsonl`). Tables with unrecognized format can only undergo character splitting and cannot trigger Stage B's role mechanism. +- Check whether the table's token count actually exceeds `table_max`. Tables below the threshold remain intact and never trigger first/middle/last slicing. +- For consecutive large tables, confirm whether the bridging text between the two tables resides in the **same content line** — bridging across content lines does not participate in B.1 bidirectional overlap. + +### 13.3 Fine-Grained Clauses Are Not Merged + +- Check whether the `parent_headings` of adjacent clauses are identical: the parent heading path consistency constraint prevents cross-topic merging. +- Check whether `level` is the same: peer merging requires equal `level`; cross-level absorption only allows shallow absorbing deep. +- Check whether a `middle` table slice is inserted in the middle: this blocks batched tail absorption. + +### 13.4 A Single Chunk Exceeds `target_max` + +Normally, Stage D's actual token re-measurement rejects oversized merges, but oversized chunks may still occur in the following scenarios: + +- A single-row table itself exceeds `target_max` with no anchor to split on; eventually it goes through R character splitting but a single chunk still exceeds the limit. +- `enforce_chunk_token_limit_before_embedding` performs a final hard cut before embedding; downstream will not actually embed an oversized chunk into the vector store. + +### 13.5 Abnormal `[part n]` Suffixes + +- Multiple slices come from the same original content line, but only one `[part 1]` is seen: check whether they were merged in Stage D — after merging, the main chunk's part suffix is retained and multiple part tags are not concatenated. +- Legacy `[表格片段N]` suffix appears: this indicates data output by an older chunker; the new version standardizes on `[part n]`, and re-chunking is required. + +### 13.6 Log Keywords + +P-strategy-related log keywords (for `grep`-based troubleshooting): + +- `paragraph_semantic` — module entry +- `fallback to recursive_character` — overall or single-paragraph degradation +- `table_chunk_role` — table role-related +- `bridge` — Stage B.1 bridging text handling +- `anchor` — Stage C anchor selection diff --git a/docs/ParserDebugCLI-zh.md b/docs/ParserDebugCLI-zh.md new file mode 100644 index 0000000000..876a889a86 --- /dev/null +++ b/docs/ParserDebugCLI-zh.md @@ -0,0 +1,129 @@ +# Parser CLI Debuger使用指南 + +本工具用于本地调试 LightRAG 的三个内容解析引擎(`native` / `mineru` / `docling`),针对**单个文件**触发 `LightRAG.parse_` 生产代码路径,并把解析产物(sidecar 与 raw 缓存)输出到一个**扁平目录布局**——与生产入库目录相比,区别仅在于: + +- **无 `__parsed__/` 中间层**:产物直接落在指定父目录下,便于查看; +- **源文件不会被归档**:源文件保留在原位置(生产路径会把源文件移到 `/__parsed__/`); +- **raw 缓存只看目录是否存在**:`mineru` / `docling` 的 raw 目录非空即视为有效,跳过 `_manifest.json` 校验。 + +其余流程(IR 构建、sidecar 写入、对 `full_docs` 的同步逻辑)与生产入库完全一致,便于排查解析阶段问题。 + +## 命令格式 + +```bash +python -m lightrag.parser_cli \ + --engine {native|mineru|docling} \ + [-o ] \ + [--doc-id ] \ + [--force-reparse] \ + [--preview N] +``` + +| 参数 | 说明 | +|---|---| +| `input_file` | 待解析的源文件路径(位置参数,必填)。文件必须实际存在。 | +| `--engine` | 必填:`native`(仅 `.docx`,本地解析)/ `mineru`(PDF/办公文档,调 MinerU 服务)/ `docling`(PDF/办公文档,调 docling-serve)。 | +| `-o / --sidecar-parent-dir` | sidecar 与 raw 目录的父目录,默认 = 源文件所在目录。 | +| `--doc-id` | 自定义文档 ID,默认 `doc-`(同一文件多次跑结果稳定)。 | +| `--force-reparse` | 仅对 `mineru` / `docling` 生效:清空 raw 目录、强制重新下载与解析。默认行为是 raw 目录非空即复用。 | +| `--preview N` | 解析完成后打印前 N 个 block 的预览(headings + 内容片段),默认 5;`0` 关闭。 | + +## 输出目录布局 + +以输入 `./inputs/workspace/sample.pdf` + 默认 sidecar 父目录(即 `./inputs/workspace/`)为例: + +``` +./inputs/workspace/ +├── sample.pdf # 原文件,不动 +├── sample.pdf.parsed/ # ← sidecar 输出 +│ ├── sample.blocks.jsonl # JSONL:首行 meta,后续每行一个 block +│ ├── sample.blocks.assets/ # native 抽取的图片/媒体资产(若有) +│ ├── sample.tables.json # 表格 sidecar(若 IR 含 tables) +│ ├── sample.drawings.json # 图纸/图片 sidecar(若 IR 含 drawings) +│ └── sample.equations.json # 公式 sidecar(若 IR 含 equations) +└── sample.pdf._raw/ # ← mineru / docling 的 raw 缓存(native 无此目录) + ├── _manifest.json # 由引擎下载流程写入;CLI 缓存校验不读 + └── # 引擎特定 raw 产物(content_list.json / *.json / 资产等) +``` + +`native` 引擎不产生 raw 目录(解析是本地的,无外部服务参与)。 + +## 典型用例 + +### A. 本地解析 `.docx`(零网络依赖) + +```bash +python -m lightrag.parser_cli ./inputs/workspace/sample.docx --engine native +# 产出:./inputs/workspace/sample.docx.parsed/ (含 blocks.jsonl + assets) +``` + +### B. 用 MinerU 解析 PDF(首次会下载 raw) + +```bash +# 第一次:下载 raw bundle + 生成 sidecar +python -m lightrag.parser_cli ./inputs/workspace/sample.pdf --engine mineru +# 第二次(无任何修改):raw 目录非空 → 直接复用 → 仅重建 sidecar,速度快 +python -m lightrag.parser_cli ./inputs/workspace/sample.pdf --engine mineru +# 日志会显示: [parse_mineru] raw cache hit doc_id=... raw_dir=.../sample.pdf.mineru_raw +``` + +### C. 用 Docling 解析 PDF + 复用已有 raw 目录 + +```bash +# 已有 ./inputs/workspace/sample.pdf.docling_raw/ (含 docling 产物的 JSON 等文件) +python -m lightrag.parser_cli ./inputs/workspace/sample.pdf --engine docling +# CLI 不查 manifest,只要 raw 目录非空就跳过 docling-serve 调用 +``` + +> 注:这是旧 `python -m lightrag.external_parser.docling` 调试入口「从已有 raw 重建 sidecar」场景的等价替代——只需把 raw 目录放到约定位置(`/.docling_raw/`)即可触发缓存命中分支。 + +### D. 输出到自定义目录 + +```bash +python -m lightrag.parser_cli ./inputs/workspace/sample.docx \ + --engine native -o /tmp/debug_sidecar +# 产出:/tmp/debug_sidecar/sample.docx.parsed/ +# 原文件 ./inputs/workspace/sample.docx 不会被移动 +``` + +### E. 强制重新解析(清空 raw 后重新下载) + +```bash +python -m lightrag.parser_cli ./inputs/workspace/sample.pdf \ + --engine docling --force-reparse +# raw 目录被清空 → 重新调 docling-serve 下载 → 重新生成 sidecar +``` + +## 环境变量 + +`mineru` / `docling` 引擎在 **缓存未命中**(首次解析或 `--force-reparse`)时会调用外部服务,所需环境变量与生产入库一致: + +- **MinerU**:`MINERU_API_MODE`(`local` / `official`)、`MINERU_API_TOKEN`、`MINERU_LOCAL_ENDPOINT` 或 `MINERU_OFFICIAL_ENDPOINT`,可选 `MINERU_ENGINE_VERSION` / `MINERU_MODEL_VERSION` / `MINERU_POLL_INTERVAL_SECONDS` / `MINERU_MAX_POLLS`。 +- **Docling**:`DOCLING_ENDPOINT`,可选 `DOCLING_ENGINE_VERSION` / `DOCLING_DO_OCR` / `DOCLING_FORCE_OCR` / `DOCLING_OCR_ENGINE` / `DOCLING_OCR_PRESET` / `DOCLING_OCR_LANG` / `DOCLING_DO_FORMULA_ENRICHMENT` / `DOCLING_POLL_INTERVAL_SECONDS` / `DOCLING_MAX_POLLS`。 + +详见 [FileProcessingConfiguration-zh.md](./FileProcessingConfiguration-zh.md)。 + +**缓存命中**时(raw 目录已存在且非空,且未传 `--force-reparse`)无需任何外部服务环境变量——可用于离线复现解析输出。 + +## 常见排障 + +| 现象 | 处理 | +|---|---| +| `error: input file does not exist: ...` | 检查 `input_file` 路径,必须是已存在的文件(不是 raw 目录)。 | +| raw 目录存在但 sidecar 内容仍是旧的 | 默认会**复用** raw 重建 sidecar。如果 raw 本身就过期或被替换,加 `--force-reparse` 清空重下。 | +| MinerU 报 `MINERU_API_TOKEN` 缺失 / Docling 连接 `DOCLING_ENDPOINT` 失败 | 缓存未命中触发了外部服务调用——核对对应环境变量;或确认 raw 目录是否非空(命中缓存时无需服务)。 | +| 源文件被意外移动 | 不应发生:CLI 已 mock 归档函数。若复现请提 issue(可能是 pipeline 内增加了新的归档调用点)。 | +| `parse_docling` 报 `produced zero blocks` | docling raw 中的主 JSON 内容不可解析或为空。检查 raw 目录的 `*.json` 是否合法。 | + +## 与 `LightRAG.parse_*` 生产路径的等价性 + +本 CLI 直接调用生产代码路径 `LightRAG.parse_native` / `parse_mineru` / `parse_docling`(通过 `lightrag/parser_debug.py` 的轻量 RAG 替身),因此: + +- sidecar 字段、命名、内容格式与生产入库完全一致; +- IR 构建器、`write_sidecar` 调用、`_persist_parsed_full_docs` 行为完全一致; +- 三处差异均由 CLI 内的 `monkey-patch` 实现,**不修改任何生产代码**: + 1. `parsed_artifact_dir_for_source` → 返回扁平路径(无 `__parsed__/`); + 2. `is_bundle_valid` → 「raw 非空即有效」; + 3. `archive_docx_source_after_full_docs_sync` → no-op,保留源文件。 + +可与 `tests/native_parser/docx/golden/native_docx/` 下的 golden fixture 对比验证(CLI 不冻结时间戳,比对时排除 `created_at` 等时间字段即可)。 diff --git a/docs/ParserDebugCLI.md b/docs/ParserDebugCLI.md new file mode 100644 index 0000000000..34bc09cb43 --- /dev/null +++ b/docs/ParserDebugCLI.md @@ -0,0 +1,129 @@ +# Parser CLI Debugger Guide + +This tool is used to locally debug LightRAG's three content parsing engines (`native` / `mineru` / `docling`). It triggers the `LightRAG.parse_` production code path for a **single file** and outputs the parsing artifacts (sidecar and raw cache) into a **flat directory layout**. Compared with the production ingestion directory, the only differences are: + +- **No `__parsed__/` intermediate layer**: artifacts land directly under the specified parent directory for easy inspection; +- **The source file is not archived**: the source file stays at its original location (the production path moves the source file to `/__parsed__/`); +- **Raw cache validity only checks directory existence**: any non-empty `mineru` / `docling` raw directory is considered valid, skipping `_manifest.json` validation. + +The rest of the flow (IR construction, sidecar writing, `full_docs` synchronization logic) is identical to production ingestion, making it convenient for troubleshooting parsing-stage issues. + +## Command Format + +```bash +python -m lightrag.parser_cli \ + --engine {native|mineru|docling} \ + [-o ] \ + [--doc-id ] \ + [--force-reparse] \ + [--preview N] +``` + +| Argument | Description | +|---|---| +| `input_file` | Path to the source file to parse (positional argument, required). The file must actually exist. | +| `--engine` | Required: `native` (only `.docx`, local parsing) / `mineru` (PDF/Office documents, calls MinerU service) / `docling` (PDF/Office documents, calls docling-serve). | +| `-o / --sidecar-parent-dir` | Parent directory of the sidecar and raw directories. Defaults to the directory containing the source file. | +| `--doc-id` | Custom document ID. Defaults to `doc-` (stable across multiple runs on the same file). | +| `--force-reparse` | Effective only for `mineru` / `docling`: clears the raw directory and forces re-download and re-parse. By default, a non-empty raw directory is reused. | +| `--preview N` | After parsing completes, prints a preview of the first N blocks (headings + content snippets). Default 5; `0` disables it. | + +## Output Directory Layout + +Taking input `./inputs/workspace/sample.pdf` + the default sidecar parent directory (i.e., `./inputs/workspace/`) as an example: + +``` +./inputs/workspace/ +├── sample.pdf # original file, untouched +├── sample.pdf.parsed/ # ← sidecar output +│ ├── sample.blocks.jsonl # JSONL: first line is meta, each subsequent line is a block +│ ├── sample.blocks.assets/ # image/media assets extracted by native (if any) +│ ├── sample.tables.json # table sidecar (if IR contains tables) +│ ├── sample.drawings.json # drawing/image sidecar (if IR contains drawings) +│ └── sample.equations.json # equation sidecar (if IR contains equations) +└── sample.pdf._raw/ # ← raw cache for mineru / docling (native has no such directory) + ├── _manifest.json # written by the engine download flow; not read by CLI cache validation + └── # engine-specific raw artifacts (content_list.json / *.json / assets, etc.) +``` + +The `native` engine does not produce a raw directory (parsing is local, with no external service involved). + +## Typical Use Cases + +### A. Locally parse a `.docx` (zero network dependency) + +```bash +python -m lightrag.parser_cli ./inputs/workspace/sample.docx --engine native +# Output: ./inputs/workspace/sample.docx.parsed/ (contains blocks.jsonl + assets) +``` + +### B. Parse a PDF with MinerU (raw will be downloaded on first run) + +```bash +# First run: download raw bundle + generate sidecar +python -m lightrag.parser_cli ./inputs/workspace/sample.pdf --engine mineru +# Second run (no changes): raw directory non-empty → reused directly → only regenerate sidecar, fast +python -m lightrag.parser_cli ./inputs/workspace/sample.pdf --engine mineru +# The log will show: [parse_mineru] raw cache hit doc_id=... raw_dir=.../sample.pdf.mineru_raw +``` + +### C. Parse a PDF with Docling + reuse an existing raw directory + +```bash +# Existing ./inputs/workspace/sample.pdf.docling_raw/ (contains docling's JSON output, etc.) +python -m lightrag.parser_cli ./inputs/workspace/sample.pdf --engine docling +# The CLI does not check the manifest; as long as the raw directory is non-empty, the docling-serve call is skipped +``` + +> Note: this is the equivalent replacement for the "rebuild sidecar from an existing raw directory" scenario that used to live in the legacy `python -m lightrag.external_parser.docling` debug entry point — just place the raw directory at the agreed location (`/.docling_raw/`) to trigger the cache-hit branch. + +### D. Output to a custom directory + +```bash +python -m lightrag.parser_cli ./inputs/workspace/sample.docx \ + --engine native -o /tmp/debug_sidecar +# Output: /tmp/debug_sidecar/sample.docx.parsed/ +# The source file ./inputs/workspace/sample.docx is not moved +``` + +### E. Force re-parse (clear raw and re-download) + +```bash +python -m lightrag.parser_cli ./inputs/workspace/sample.pdf \ + --engine docling --force-reparse +# raw directory is cleared → docling-serve is called again to download → sidecar regenerated +``` + +## Environment Variables + +The `mineru` / `docling` engines call external services when the **cache misses** (first parse or `--force-reparse`); the required environment variables are identical to production ingestion: + +- **MinerU**: `MINERU_API_MODE` (`local` / `official`), `MINERU_API_TOKEN`, `MINERU_LOCAL_ENDPOINT` or `MINERU_OFFICIAL_ENDPOINT`, optional `MINERU_ENGINE_VERSION` / `MINERU_MODEL_VERSION` / `MINERU_POLL_INTERVAL_SECONDS` / `MINERU_MAX_POLLS`. +- **Docling**: `DOCLING_ENDPOINT`, optional `DOCLING_ENGINE_VERSION` / `DOCLING_DO_OCR` / `DOCLING_FORCE_OCR` / `DOCLING_OCR_ENGINE` / `DOCLING_OCR_PRESET` / `DOCLING_OCR_LANG` / `DOCLING_DO_FORMULA_ENRICHMENT` / `DOCLING_POLL_INTERVAL_SECONDS` / `DOCLING_MAX_POLLS`. + +See [FileProcessingConfiguration.md](./FileProcessingConfiguration.md) for details. + +When the **cache is hit** (the raw directory already exists and is non-empty, and `--force-reparse` is not passed), no external service environment variables are needed — this can be used to offline-reproduce parsing output. + +## Common Troubleshooting + +| Symptom | Action | +|---|---| +| `error: input file does not exist: ...` | Check the `input_file` path; it must be an existing file (not a raw directory). | +| Raw directory exists but sidecar content is still stale | The default behavior is to **reuse** raw and regenerate sidecar. If the raw itself is outdated or has been replaced, add `--force-reparse` to clear and re-download. | +| MinerU reports `MINERU_API_TOKEN` missing / Docling fails to connect to `DOCLING_ENDPOINT` | A cache miss triggered an external service call — verify the corresponding environment variables; or confirm whether the raw directory is non-empty (no service needed when the cache hits). | +| Source file is unexpectedly moved | Should not happen: the CLI has mocked the archive function. If reproducible, please file an issue (a new archive call site may have been added in the pipeline). | +| `parse_docling` reports `produced zero blocks` | The main JSON content in docling raw is unparseable or empty. Check whether the `*.json` files in the raw directory are valid. | + +## Equivalence with the `LightRAG.parse_*` Production Path + +This CLI directly calls the production code paths `LightRAG.parse_native` / `parse_mineru` / `parse_docling` (via the lightweight RAG stand-in in `lightrag/parser_debug.py`), so: + +- The sidecar fields, naming, and content format are identical to production ingestion; +- The IR builders, `write_sidecar` calls, and `_persist_parsed_full_docs` behavior are identical; +- All three differences are implemented via `monkey-patch` inside the CLI — **no production code is modified**: + 1. `parsed_artifact_dir_for_source` → returns the flat path (no `__parsed__/`); + 2. `is_bundle_valid` → "raw is valid if non-empty"; + 3. `archive_docx_source_after_full_docs_sync` → no-op, source file preserved. + +Results can be cross-validated against golden fixtures under `tests/native_parser/docx/golden/native_docx/` (the CLI does not freeze timestamps; just exclude time fields such as `created_at` when comparing). diff --git a/docs/ProgramingWithCore.md b/docs/ProgramingWithCore.md index 24806fcf39..9059752a43 100644 --- a/docs/ProgramingWithCore.md +++ b/docs/ProgramingWithCore.md @@ -92,7 +92,7 @@ Notes: | **vector_db_storage_cls_kwargs** | `dict` | Additional parameters for vector database, like setting the threshold for nodes and relations retrieval | cosine_better_than_threshold: 0.2(default value changed by env var COSINE_THRESHOLD) | | **enable_llm_cache** | `bool` | If `TRUE`, stores LLM results in cache; repeated prompts return cached responses | `TRUE` | | **enable_llm_cache_for_entity_extract** | `bool` | If `TRUE`, stores LLM results in cache for entity extraction; Good for beginners to debug your application | `TRUE` | -| **addon_params** | `dict` | Additional parameters, e.g., `{"language": "Simplified Chinese", "entity_types": ["organization", "person", "location", "event"]}`: sets example limit, entity/relation extraction output language | language: English` | +| **addon_params** | `dict` | Runtime knobs for extraction. Supported keys: `language` (output language for entity/relation summaries), `entity_type_prompt_file` (file name — not a path — of a YAML profile loaded from `PROMPT_DIR/entity_type`; `PROMPT_DIR` defaults to `./prompts`), `entity_types_guidance` (inline override that wins over the file profile). | `{"language": "English", "entity_type_prompt_file": ""}` | | **embedding_cache_config** | `dict` | Configuration for question-answer caching. Contains three parameters: `enabled`: Boolean value to enable/disable cache lookup functionality. When enabled, the system will check cached responses before generating new answers. `similarity_threshold`: Float value (0-1), similarity threshold. When a new question's similarity with a cached question exceeds this threshold, the cached answer will be returned directly without calling the LLM. `use_llm_check`: Boolean value to enable/disable LLM similarity verification. When enabled, LLM will be used as a secondary check to verify the similarity between questions before returning cached answers. | Default: `{"enabled": False, "similarity_threshold": 0.95, "use_llm_check": False}` | @@ -149,9 +149,14 @@ class QueryParam: """ model_func: Callable[..., object] | None = None - """Optional override for the LLM model function to use for this specific query. - If provided, this will be used instead of the global model function. - This allows using different models for different query modes. + """Deprecated optional override for the LLM model function. + Use role-specific LLM configuration at initialization or + await rag.aupdate_llm_role_config("query" | "keyword", ...) for runtime + LLM changes instead. Kept for backward compatibility with direct Python callers. + + Note: when set, the LLM cache key collapses to a single "override" identity, + so swapping the override across calls will reuse stale cached responses. + Use aupdate_llm_role_config() for cache-correct model swaps. """ user_prompt: str | None = None diff --git a/docs/RoleSpecificLLMConfiguration-zh.md b/docs/RoleSpecificLLMConfiguration-zh.md new file mode 100644 index 0000000000..e615e10a31 --- /dev/null +++ b/docs/RoleSpecificLLMConfiguration-zh.md @@ -0,0 +1,376 @@ +# 基于角色的 LLM/VLM 配置指南 + +LightRAG 支持为不同处理阶段配置不同的 LLM 或 VLM。这个机制适合把低成本模型用于抽取,把更强模型用于最终回答,或为多模态分析单独指定视觉语言模型。 + +## 角色说明 + +当前支持四个角色: + +| 角色 | 用途 | +| --- | --- | +| `EXTRACT` | 实体/关系抽取,以及实体/关系描述摘要。 | +| `KEYWORD` | 查询关键词抽取,用于检索前的 high-level / low-level keyword 生成。 | +| `QUERY` | 最终问答、普通查询、bypass 查询,以及 Ollama-compatible API 的查询路径。 | +| `VLM` | 多模态分析阶段,用于图片、表格、公式等内容的 VLM 分析。 | + +如果某个角色没有专门配置,LightRAG 会使用基础 `LLM_*` 配置。 + +## 基础 LLM 配置 + +基础配置定义默认 LLM provider、模型、服务地址、认证信息和并发控制: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_api_key + +# 所有 LLM 请求的默认超时时间 +LLM_TIMEOUT=180 + +# 所有 LLM 调用的默认最大并发数 +MAX_ASYNC=4 +``` + +常用字段: + +| 变量 | 说明 | +| --- | --- | +| `LLM_BINDING` | 基础 LLM provider。支持 `openai`、`ollama`、`lollms`、`azure_openai`、`bedrock`、`gemini`。 | +| `LLM_MODEL` | 基础模型名。对 Azure OpenAI 通常使用 deployment 名称。 | +| `LLM_BINDING_HOST` | 基础 provider endpoint。对于 SDK 默认 endpoint,可使用对应 sentinel,例如 `DEFAULT_GEMINI_ENDPOINT` 或 `DEFAULT_BEDROCK_ENDPOINT`。 | +| `LLM_BINDING_API_KEY` | 基础 API key。Bedrock 不使用这个字段。 | +| `LLM_TIMEOUT` | 基础 LLM timeout。角色未设置 timeout 时继承它。 | +| `MAX_ASYNC` | 基础 LLM 最大并发。角色未设置 `{ROLE}_MAX_ASYNC_LLM` 时继承它。 | + +## 角色覆盖变量 + +每个角色都可以覆盖 binding、模型、endpoint、API key、并发和 timeout: + +```env +QUERY_LLM_BINDING=openai +QUERY_LLM_MODEL=gpt-5 +QUERY_LLM_BINDING_HOST=https://api.openai.com/v1 +QUERY_LLM_BINDING_API_KEY=your_query_api_key +QUERY_MAX_ASYNC_LLM=2 +QUERY_LLM_TIMEOUT=240 +``` + +变量格式: + +| 变量 | 说明 | +| --- | --- | +| `{ROLE}_LLM_BINDING` | 覆盖角色 provider。`ROLE` 可为 `EXTRACT`、`KEYWORD`、`QUERY`、`VLM`。 | +| `{ROLE}_LLM_MODEL` | 覆盖角色模型名。 | +| `{ROLE}_LLM_BINDING_HOST` | 覆盖角色 endpoint。 | +| `{ROLE}_LLM_BINDING_API_KEY` | 覆盖角色 API key。Bedrock 不支持。 | +| `{ROLE}_MAX_ASYNC_LLM` | 覆盖角色最大并发。未设置时继承 `MAX_ASYNC`。 | +| `{ROLE}_LLM_TIMEOUT` | 覆盖角色 timeout。未设置时继承 `LLM_TIMEOUT`。 | + +## Provider 参数覆盖 + +provider 细项使用下面的格式: + +```env +{ROLE}_{PROVIDER_PREFIX}_{FIELD} +``` + +例如: + +```env +# 只覆盖 QUERY 角色的 OpenAI reasoning effort +QUERY_OPENAI_LLM_REASONING_EFFORT=medium + +# 只覆盖 EXTRACT 角色的 Bedrock 生成参数 +EXTRACT_BEDROCK_LLM_TEMPERATURE=0.0 +EXTRACT_BEDROCK_LLM_MAX_TOKENS=2048 + +# 只覆盖 VLM 角色的 Gemini 生成参数 +VLM_GEMINI_LLM_MAX_OUTPUT_TOKENS=4096 +VLM_GEMINI_LLM_TEMPERATURE=0.2 +``` + +常见 provider 前缀: + +| Provider | 基础参数前缀 | 角色参数示例 | +| --- | --- | --- | +| `openai` / `azure_openai` | `OPENAI_LLM_*` | `QUERY_OPENAI_LLM_REASONING_EFFORT` | +| `ollama` | `OLLAMA_LLM_*` | `EXTRACT_OLLAMA_LLM_NUM_PREDICT` | +| `lollms` | 使用 Ollama 兼容参数集合 | `QUERY_OLLAMA_LLM_TEMPERATURE` | +| `bedrock` | `BEDROCK_LLM_*` | `EXTRACT_BEDROCK_LLM_MAX_TOKENS` | +| `gemini` | `GEMINI_LLM_*` | `VLM_GEMINI_LLM_THINKING_CONFIG` | + +## 继承规则 + +### 同一个 provider 内覆盖 + +如果角色没有设置 `{ROLE}_LLM_BINDING`,或设置成与基础 `LLM_BINDING` 相同,角色会继承基础配置: + +- 未设置 `{ROLE}_LLM_MODEL` 时继承 `LLM_MODEL`。 +- 未设置 `{ROLE}_LLM_BINDING_HOST` 时继承 `LLM_BINDING_HOST`。 +- 未设置 `{ROLE}_LLM_BINDING_API_KEY` 时继承 `LLM_BINDING_API_KEY`。 +- 未设置 `{ROLE}_LLM_TIMEOUT` 时继承 `LLM_TIMEOUT`。 +- 未设置 `{ROLE}_MAX_ASYNC_LLM` 时继承 `MAX_ASYNC`。 +- provider 参数先继承基础 provider options,再叠加角色专属 provider options。 + +因此,同一个 provider 下只想换模型时,只需要写模型名: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_api_key +OPENAI_LLM_REASONING_EFFORT=minimal + +# QUERY 继承 host、API key、timeout、并发和 OPENAI_LLM_REASONING_EFFORT +QUERY_LLM_MODEL=gpt-5 +``` + +### 跨 provider 覆盖 + +如果角色的 `{ROLE}_LLM_BINDING` 与基础 `LLM_BINDING` 不同,就是跨 provider 配置。当前规则是: + +- 必须设置 `{ROLE}_LLM_MODEL`。 +- 非 Bedrock provider 必须设置 `{ROLE}_LLM_BINDING_API_KEY`。 +- 如果没有设置 `{ROLE}_LLM_BINDING_HOST`,LightRAG 会尝试使用该 provider 的默认 host。 +- provider 参数不继承基础 provider options,而是从空配置开始,只叠加角色专属 provider options。 + +示例:基础使用 Ollama,本地抽取;最终回答改用 OpenAI: + +```env +LLM_BINDING=ollama +LLM_MODEL=qwen3.5:9b +LLM_BINDING_HOST=http://localhost:11434 +OLLAMA_LLM_NUM_CTX=32768 + +QUERY_LLM_BINDING=openai +QUERY_LLM_MODEL=gpt-5-mini +QUERY_LLM_BINDING_HOST=https://api.openai.com/v1 +QUERY_LLM_BINDING_API_KEY=your_openai_api_key +QUERY_OPENAI_LLM_REASONING_EFFORT=minimal +``` + +跨 provider 时建议显式设置 `{ROLE}_LLM_BINDING_HOST`,避免默认 host 与基础 provider 的 endpoint 混淆。 + +### Bedrock 认证规则 + +Bedrock 不使用 `LLM_BINDING_API_KEY`,也不支持 `{ROLE}_LLM_BINDING_API_KEY`。可用认证方式: + +- 全局 SigV4:`AWS_ACCESS_KEY_ID`、`AWS_SECRET_ACCESS_KEY`、`AWS_SESSION_TOKEN`、`AWS_REGION`。 +- 角色级 SigV4:`{ROLE}_AWS_ACCESS_KEY_ID`、`{ROLE}_AWS_SECRET_ACCESS_KEY`、`{ROLE}_AWS_SESSION_TOKEN`、`{ROLE}_AWS_REGION`。 +- 进程级 bearer token:`AWS_BEARER_TOKEN_BEDROCK`。这是 AWS SDK 进程级设置,不能按角色覆盖。 + +角色级 Bedrock 示例: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_openai_api_key + +EXTRACT_LLM_BINDING=bedrock +EXTRACT_LLM_MODEL=us.amazon.nova-lite-v1:0 +EXTRACT_LLM_BINDING_HOST=DEFAULT_BEDROCK_ENDPOINT +EXTRACT_AWS_REGION=us-west-2 +EXTRACT_AWS_ACCESS_KEY_ID=your_extract_access_key +EXTRACT_AWS_SECRET_ACCESS_KEY=your_extract_secret_key +EXTRACT_AWS_SESSION_TOKEN=your_optional_session_token +EXTRACT_BEDROCK_LLM_TEMPERATURE=0.0 +EXTRACT_BEDROCK_LLM_MAX_TOKENS=2048 +``` + +## Provider 行为对照 + +| Provider | 角色级 host/base_url | 角色级 API key | 认证限制 | +| --- | --- | --- | --- | +| `openai` | 支持,通过 `{ROLE}_LLM_BINDING_HOST` 传给 OpenAI-compatible client。 | 支持 `{ROLE}_LLM_BINDING_API_KEY`,未设置时同 provider 继承基础 `LLM_BINDING_API_KEY`。 | 当前主要是 API key / Bearer 模式。 | +| `ollama` | 支持,通过 `{ROLE}_LLM_BINDING_HOST` 传给 Ollama client。 | 支持 `{ROLE}_LLM_BINDING_API_KEY`,未设置时同 provider 继承基础 key;底层未收到 key 时会再回退 `OLLAMA_API_KEY`。 | Bearer header。 | +| `lollms` | 支持,通过 `{ROLE}_LLM_BINDING_HOST` 作为 `base_url`。 | 支持 `{ROLE}_LLM_BINDING_API_KEY`,未设置时同 provider 继承基础 key。 | Bearer header。 | +| `azure_openai` | 支持,通过 `{ROLE}_LLM_BINDING_HOST` 作为 Azure endpoint。 | 支持 `{ROLE}_LLM_BINDING_API_KEY`,未设置时同 provider 继承基础 key,也可能回退 `AZURE_OPENAI_API_KEY`。 | `AZURE_OPENAI_API_VERSION` 是全局环境变量,不支持角色级覆盖。 | +| `bedrock` | 支持,通过 `{ROLE}_LLM_BINDING_HOST` 作为 `endpoint_url`;`DEFAULT_BEDROCK_ENDPOINT` 表示交给 AWS SDK 选择。 | 不支持 generic API key。 | 使用全局或角色级 SigV4。`AWS_BEARER_TOKEN_BEDROCK` 是进程级,不能按角色覆盖。 | +| `gemini` | 支持,通过 `{ROLE}_LLM_BINDING_HOST` 传给 Google GenAI client;`DEFAULT_GEMINI_ENDPOINT` 表示使用 SDK 默认 endpoint。 | AI Studio 模式支持 `{ROLE}_LLM_BINDING_API_KEY`。 | Vertex AI 由 `GOOGLE_GENAI_USE_VERTEXAI`、`GOOGLE_CLOUD_PROJECT`、`GOOGLE_CLOUD_LOCATION`、`GOOGLE_APPLICATION_CREDENTIALS` 控制,都是进程级设置。 | + +## 推荐配置模式 + +### 1. 同 provider 只更换模型 + +适合用同一个 OpenAI key 和 endpoint,但让最终回答使用更强模型: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_api_key +OPENAI_LLM_REASONING_EFFORT=minimal + +QUERY_LLM_MODEL=gpt-5 +QUERY_MAX_ASYNC_LLM=2 +``` + +`QUERY` 会继承基础 host、API key 和 `OPENAI_LLM_REASONING_EFFORT`。 + +### 2. 同 provider 更换模型并调整参数 + +适合基础模型用于抽取,最终回答使用更高 reasoning effort: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_api_key +OPENAI_LLM_REASONING_EFFORT=minimal +OPENAI_LLM_MAX_COMPLETION_TOKENS=4096 + +QUERY_LLM_MODEL=gpt-5 +QUERY_OPENAI_LLM_REASONING_EFFORT=medium +QUERY_OPENAI_LLM_MAX_COMPLETION_TOKENS=9000 +QUERY_LLM_TIMEOUT=240 +``` + +### 3. 同 provider 使用不同 endpoint 和 API key + +适合所有角色都走 `openai` binding,但其中一些角色访问 OpenAI 官方接口,另一些角色访问本地 vLLM、SGLang 或 OpenRouter 等 OpenAI-compatible endpoint。下面的例子中: + +- `EXTRACT` 使用 OpenAI 官方 `gpt-5-mini`。 +- `QUERY` 使用 OpenAI 官方 `gpt-5.4`,并使用单独的 OpenAI key。 +- `KEYWORD` 使用本地 vLLM 部署的 `Qwen3.5-35B-A3B`。 + +```env +########################################################################### +# Base LLM fallback. Keep it aligned with EXTRACT so unspecified roles still +# have a valid OpenAI configuration. +########################################################################### +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_extract_openai_api_key +LLM_TIMEOUT=180 +MAX_ASYNC=4 + +########################################################################### +# IMPORTANT: +# Do not set global OPENAI_LLM_REASONING_EFFORT here if any same-provider role +# points to a local OpenAI-compatible server that does not support it. +# Use role-specific OPENAI options instead. +########################################################################### +# OPENAI_LLM_REASONING_EFFORT=none + +########################################################################### +# EXTRACT: OpenAI official API, gpt-5-mini +########################################################################### +EXTRACT_LLM_BINDING=openai +EXTRACT_LLM_MODEL=gpt-5-mini +EXTRACT_LLM_BINDING_HOST=https://api.openai.com/v1 +EXTRACT_LLM_BINDING_API_KEY=your_extract_openai_api_key +EXTRACT_OPENAI_LLM_REASONING_EFFORT=low +EXTRACT_OPENAI_LLM_MAX_COMPLETION_TOKENS=4096 +EXTRACT_MAX_ASYNC_LLM=4 +EXTRACT_LLM_TIMEOUT=180 + +########################################################################### +# QUERY: OpenAI official API, gpt-5.4, separate API key +########################################################################### +QUERY_LLM_BINDING=openai +QUERY_LLM_MODEL=gpt-5.4 +QUERY_LLM_BINDING_HOST=https://api.openai.com/v1 +QUERY_LLM_BINDING_API_KEY=your_query_openai_api_key +QUERY_OPENAI_LLM_REASONING_EFFORT=medium +QUERY_OPENAI_LLM_MAX_COMPLETION_TOKENS=9000 +QUERY_MAX_ASYNC_LLM=2 +QUERY_LLM_TIMEOUT=240 + +########################################################################### +# KEYWORD: local vLLM OpenAI-compatible endpoint, Qwen3.5-35B-A3B +########################################################################### +KEYWORD_LLM_BINDING=openai +KEYWORD_LLM_MODEL=Qwen3.5-35B-A3B +KEYWORD_LLM_BINDING_HOST=http://localhost:8000/v1 +# If vLLM was started with --api-key, use the same value here. +# If vLLM has no auth, still set a non-empty dummy value to avoid falling +# back to the official OpenAI key. +KEYWORD_LLM_BINDING_API_KEY=local-vllm-api-key +KEYWORD_OPENAI_LLM_MAX_TOKENS=2048 +# Optional for Qwen-style models served by vLLM when you want to disable thinking. +KEYWORD_OPENAI_LLM_EXTRA_BODY='{"chat_template_kwargs": {"enable_thinking": false}}' +KEYWORD_MAX_ASYNC_LLM=4 +KEYWORD_LLM_TIMEOUT=180 +``` + +这个模式不是跨 provider,因为三个角色的 binding 都是 `openai`。LightRAG 会分别把每个角色的 `*_LLM_BINDING_HOST` 和 `*_LLM_BINDING_API_KEY` 传给 OpenAI-compatible client。 + +注意:同 provider 的 provider options 会继承基础 `OPENAI_LLM_*`。如果本地 vLLM 不支持 OpenAI 官方参数,例如 `reasoning_effort`,不要设置全局 `OPENAI_LLM_REASONING_EFFORT`;改用 `EXTRACT_OPENAI_LLM_REASONING_EFFORT`、`QUERY_OPENAI_LLM_REASONING_EFFORT` 这类角色级变量。 + +### 4. 某个角色跨 provider + +适合基础使用 OpenAI 官方模型,只有关键词抽取使用本地 Ollama: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_openai_api_key +OPENAI_LLM_REASONING_EFFORT=medium + +KEYWORD_LLM_BINDING=ollama +KEYWORD_LLM_MODEL=qwen3.5:9b +KEYWORD_LLM_BINDING_HOST=http://localhost:11434 +KEYWORD_LLM_BINDING_API_KEY=ollama-local-key +KEYWORD_OLLAMA_LLM_NUM_CTX=32768 +``` + +跨 provider 时,Ollama 参数不会继承 OpenAI 参数。`KEYWORD_LLM_BINDING_API_KEY` 对本地 Ollama 通常可以使用占位值;当前跨 provider 校验会要求非 Bedrock 角色显式提供角色级 API key。 + +### 5. 为 VLM 单独指定多模态模型 + +适合文本任务使用便宜模型,多模态分析使用视觉语言模型: + +```env +VLM_PROCESS_ENABLE=true + +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_api_key + +VLM_LLM_BINDING=openai +VLM_LLM_MODEL=gpt-4o +VLM_OPENAI_LLM_MAX_TOKENS=4096 +VLM_MAX_ASYNC_LLM=2 +VLM_LLM_TIMEOUT=240 +``` + +如果 VLM 使用同一个 provider 和 key,可以省略 `VLM_LLM_BINDING_HOST` 与 `VLM_LLM_BINDING_API_KEY`。 + +`VLM_PROCESS_ENABLE` 是多模态分析的总开关:设为 `false` 时,pipeline 会对每个多模态 item 输出 warning 并跳过,不调用 VLM;设为 `true` 时,生效的 VLM binding(设置了 `VLM_LLM_BINDING` 时取该值,否则取 `LLM_BINDING`)必须支持图片输入。当前支持视觉输入的 provider 包括:`openai`、`azure_openai`、`gemini`、`bedrock`、`ollama`、`anthropic`。`lollms` 无法接收图片输入,会在启动时直接报错。 + +### 6. Bedrock 角色级 SigV4 凭证 + +适合只有某个角色访问 Bedrock,并使用独立 IAM/STS 凭证: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_openai_api_key + +QUERY_LLM_BINDING=bedrock +QUERY_LLM_MODEL=us.amazon.nova-lite-v1:0 +QUERY_LLM_BINDING_HOST=DEFAULT_BEDROCK_ENDPOINT +QUERY_AWS_REGION=us-east-1 +QUERY_AWS_ACCESS_KEY_ID=your_query_access_key +QUERY_AWS_SECRET_ACCESS_KEY=your_query_secret_key +QUERY_AWS_SESSION_TOKEN=your_optional_session_token +QUERY_BEDROCK_LLM_MAX_TOKENS=4096 +QUERY_BEDROCK_LLM_TEMPERATURE=0.2 +``` + +不要设置 `QUERY_LLM_BINDING_API_KEY`,Bedrock 会拒绝该配置。 + +## 注意事项 + +- 同 provider 下,`OPENAI_LLM_REASONING_EFFORT`、`OPENAI_LLM_MAX_TOKENS`、`OLLAMA_LLM_NUM_CTX`、`GEMINI_LLM_THINKING_CONFIG` 等 provider 参数会自动继承。 +- 当前没有干净的角色级“取消继承某个 provider 参数”的语义。如果某个同 provider 角色模型不支持基础参数,需要为该角色显式覆盖为可用值,或将它配置成跨 provider,并且只设置该角色支持的 provider 参数。 +- `azure_openai` 的 `AZURE_OPENAI_DEPLOYMENT` 和 `AZURE_OPENAI_API_VERSION` 是全局环境变量。若设置了 `AZURE_OPENAI_DEPLOYMENT`,它可能优先于角色模型名。 +- Gemini Vertex AI 模式由进程级 Google 环境变量控制,不能在同一个 LightRAG 进程里让某些角色使用 Vertex AI、另一些角色使用 AI Studio API key。 +- `LLM_BINDING_HOST` 在 Docker/Compose 中通常需要使用容器可访问地址,例如 `host.docker.internal`,角色级 host 也遵循相同原则。 +- 修改 `.env` 后请重启 LightRAG Server。部分 IDE 终端会预加载 `.env`,建议打开新的终端会话确认环境变量生效。 diff --git a/docs/RoleSpecificLLMConfiguration.md b/docs/RoleSpecificLLMConfiguration.md new file mode 100644 index 0000000000..002ced4aa6 --- /dev/null +++ b/docs/RoleSpecificLLMConfiguration.md @@ -0,0 +1,376 @@ +# Role-Specific LLM/VLM Configuration Guide + +LightRAG supports configuring different LLMs or VLMs for different processing stages. This mechanism is useful when using a lower-cost model for extraction, a stronger model for final answers, or a dedicated vision-language model for multimodal analysis. + +## Role Overview + +Four roles are currently supported: + +| Role | Purpose | +| --- | --- | +| `EXTRACT` | Entity/relation extraction and entity/relation description summarization. | +| `KEYWORD` | Query keyword extraction for high-level / low-level keyword generation before retrieval. | +| `QUERY` | Final QA, regular queries, bypass queries, and the query path of the Ollama-compatible API. | +| `VLM` | Multimodal analysis stage for VLM analysis of images, tables, formulas, and similar content. | + +If a role has no dedicated configuration, LightRAG uses the base `LLM_*` configuration. + +## Base LLM Configuration + +The base configuration defines the default LLM provider, model, service endpoint, authentication information, and concurrency control: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_api_key + +# Default timeout for all LLM requests +LLM_TIMEOUT=180 + +# Default maximum concurrency for all LLM calls +MAX_ASYNC=4 +``` + +Common fields: + +| Variable | Description | +| --- | --- | +| `LLM_BINDING` | Base LLM provider. Supported values are `openai`, `ollama`, `lollms`, `azure_openai`, `bedrock`, and `gemini`. | +| `LLM_MODEL` | Base model name. For Azure OpenAI, this is usually the deployment name. | +| `LLM_BINDING_HOST` | Base provider endpoint. For SDK default endpoints, use the corresponding sentinel, such as `DEFAULT_GEMINI_ENDPOINT` or `DEFAULT_BEDROCK_ENDPOINT`. | +| `LLM_BINDING_API_KEY` | Base API key. Bedrock does not use this field. | +| `LLM_TIMEOUT` | Base LLM timeout. A role inherits it when no role timeout is set. | +| `MAX_ASYNC` | Base maximum LLM concurrency. A role inherits it when `{ROLE}_MAX_ASYNC_LLM` is not set. | + +## Role Override Variables + +Each role can override the binding, model, endpoint, API key, concurrency, and timeout: + +```env +QUERY_LLM_BINDING=openai +QUERY_LLM_MODEL=gpt-5 +QUERY_LLM_BINDING_HOST=https://api.openai.com/v1 +QUERY_LLM_BINDING_API_KEY=your_query_api_key +QUERY_MAX_ASYNC_LLM=2 +QUERY_LLM_TIMEOUT=240 +``` + +Variable format: + +| Variable | Description | +| --- | --- | +| `{ROLE}_LLM_BINDING` | Overrides the role provider. `ROLE` can be `EXTRACT`, `KEYWORD`, `QUERY`, or `VLM`. | +| `{ROLE}_LLM_MODEL` | Overrides the role model name. | +| `{ROLE}_LLM_BINDING_HOST` | Overrides the role endpoint. | +| `{ROLE}_LLM_BINDING_API_KEY` | Overrides the role API key. Bedrock does not support it. | +| `{ROLE}_MAX_ASYNC_LLM` | Overrides the role maximum concurrency. Inherits `MAX_ASYNC` when unset. | +| `{ROLE}_LLM_TIMEOUT` | Overrides the role timeout. Inherits `LLM_TIMEOUT` when unset. | + +## Provider Option Overrides + +Provider-specific options use the following format: + +```env +{ROLE}_{PROVIDER_PREFIX}_{FIELD} +``` + +Examples: + +```env +# Override only the OpenAI reasoning effort for the QUERY role +QUERY_OPENAI_LLM_REASONING_EFFORT=medium + +# Override only Bedrock generation parameters for the EXTRACT role +EXTRACT_BEDROCK_LLM_TEMPERATURE=0.0 +EXTRACT_BEDROCK_LLM_MAX_TOKENS=2048 + +# Override only Gemini generation parameters for the VLM role +VLM_GEMINI_LLM_MAX_OUTPUT_TOKENS=4096 +VLM_GEMINI_LLM_TEMPERATURE=0.2 +``` + +Common provider prefixes: + +| Provider | Base option prefix | Role option example | +| --- | --- | --- | +| `openai` / `azure_openai` | `OPENAI_LLM_*` | `QUERY_OPENAI_LLM_REASONING_EFFORT` | +| `ollama` | `OLLAMA_LLM_*` | `EXTRACT_OLLAMA_LLM_NUM_PREDICT` | +| `lollms` | Uses the Ollama-compatible option set | `QUERY_OLLAMA_LLM_TEMPERATURE` | +| `bedrock` | `BEDROCK_LLM_*` | `EXTRACT_BEDROCK_LLM_MAX_TOKENS` | +| `gemini` | `GEMINI_LLM_*` | `VLM_GEMINI_LLM_THINKING_CONFIG` | + +## Inheritance Rules + +### Overrides Within the Same Provider + +If a role does not set `{ROLE}_LLM_BINDING`, or sets it to the same value as the base `LLM_BINDING`, the role inherits the base configuration: + +- Inherits `LLM_MODEL` when `{ROLE}_LLM_MODEL` is not set. +- Inherits `LLM_BINDING_HOST` when `{ROLE}_LLM_BINDING_HOST` is not set. +- Inherits `LLM_BINDING_API_KEY` when `{ROLE}_LLM_BINDING_API_KEY` is not set. +- Inherits `LLM_TIMEOUT` when `{ROLE}_LLM_TIMEOUT` is not set. +- Inherits `MAX_ASYNC` when `{ROLE}_MAX_ASYNC_LLM` is not set. +- Provider options first inherit the base provider options, then apply role-specific provider options. + +Therefore, when you only want to change the model within the same provider, you only need to set the model name: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_api_key +OPENAI_LLM_REASONING_EFFORT=minimal + +# QUERY inherits host, API key, timeout, concurrency, and OPENAI_LLM_REASONING_EFFORT +QUERY_LLM_MODEL=gpt-5 +``` + +### Cross-Provider Overrides + +If a role's `{ROLE}_LLM_BINDING` differs from the base `LLM_BINDING`, it is a cross-provider configuration. The current rules are: + +- `{ROLE}_LLM_MODEL` must be set. +- Non-Bedrock providers must set `{ROLE}_LLM_BINDING_API_KEY`. +- If `{ROLE}_LLM_BINDING_HOST` is not set, LightRAG tries to use that provider's default host. +- Provider options do not inherit base provider options. They start empty and only apply role-specific provider options. + +Example: use Ollama as the base for local extraction, then use OpenAI for final answers: + +```env +LLM_BINDING=ollama +LLM_MODEL=qwen3.5:9b +LLM_BINDING_HOST=http://localhost:11434 +OLLAMA_LLM_NUM_CTX=32768 + +QUERY_LLM_BINDING=openai +QUERY_LLM_MODEL=gpt-5-mini +QUERY_LLM_BINDING_HOST=https://api.openai.com/v1 +QUERY_LLM_BINDING_API_KEY=your_openai_api_key +QUERY_OPENAI_LLM_REASONING_EFFORT=minimal +``` + +For cross-provider configurations, explicitly setting `{ROLE}_LLM_BINDING_HOST` is recommended to avoid confusion between the default host and the base provider endpoint. + +### Bedrock Authentication Rules + +Bedrock does not use `LLM_BINDING_API_KEY` and does not support `{ROLE}_LLM_BINDING_API_KEY`. Available authentication methods are: + +- Global SigV4: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, and `AWS_REGION`. +- Role-level SigV4: `{ROLE}_AWS_ACCESS_KEY_ID`, `{ROLE}_AWS_SECRET_ACCESS_KEY`, `{ROLE}_AWS_SESSION_TOKEN`, and `{ROLE}_AWS_REGION`. +- Process-level bearer token: `AWS_BEARER_TOKEN_BEDROCK`. This is an AWS SDK process-level setting and cannot be overridden per role. + +Role-level Bedrock example: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_openai_api_key + +EXTRACT_LLM_BINDING=bedrock +EXTRACT_LLM_MODEL=us.amazon.nova-lite-v1:0 +EXTRACT_LLM_BINDING_HOST=DEFAULT_BEDROCK_ENDPOINT +EXTRACT_AWS_REGION=us-west-2 +EXTRACT_AWS_ACCESS_KEY_ID=your_extract_access_key +EXTRACT_AWS_SECRET_ACCESS_KEY=your_extract_secret_key +EXTRACT_AWS_SESSION_TOKEN=your_optional_session_token +EXTRACT_BEDROCK_LLM_TEMPERATURE=0.0 +EXTRACT_BEDROCK_LLM_MAX_TOKENS=2048 +``` + +## Provider Behavior Matrix + +| Provider | Role-level host/base_url | Role-level API key | Authentication limitations | +| --- | --- | --- | --- | +| `openai` | Supported, passed to the OpenAI-compatible client through `{ROLE}_LLM_BINDING_HOST`. | Supports `{ROLE}_LLM_BINDING_API_KEY`; when unset within the same provider, it inherits the base `LLM_BINDING_API_KEY`. | Currently mainly API key / Bearer mode. | +| `ollama` | Supported, passed to the Ollama client through `{ROLE}_LLM_BINDING_HOST`. | Supports `{ROLE}_LLM_BINDING_API_KEY`; when unset within the same provider, it inherits the base key. If no key reaches the lower layer, it falls back to `OLLAMA_API_KEY`. | Bearer header. | +| `lollms` | Supported, using `{ROLE}_LLM_BINDING_HOST` as `base_url`. | Supports `{ROLE}_LLM_BINDING_API_KEY`; when unset within the same provider, it inherits the base key. | Bearer header. | +| `azure_openai` | Supported, using `{ROLE}_LLM_BINDING_HOST` as the Azure endpoint. | Supports `{ROLE}_LLM_BINDING_API_KEY`; when unset within the same provider, it inherits the base key and may also fall back to `AZURE_OPENAI_API_KEY`. | `AZURE_OPENAI_API_VERSION` is a global environment variable and does not support role-level overrides. | +| `bedrock` | Supported, using `{ROLE}_LLM_BINDING_HOST` as `endpoint_url`; `DEFAULT_BEDROCK_ENDPOINT` means letting the AWS SDK choose. | Generic API keys are not supported. | Uses global or role-level SigV4. `AWS_BEARER_TOKEN_BEDROCK` is process-level and cannot be overridden per role. | +| `gemini` | Supported, passed to the Google GenAI client through `{ROLE}_LLM_BINDING_HOST`; `DEFAULT_GEMINI_ENDPOINT` means using the SDK default endpoint. | AI Studio mode supports `{ROLE}_LLM_BINDING_API_KEY`. | Vertex AI is controlled by `GOOGLE_GENAI_USE_VERTEXAI`, `GOOGLE_CLOUD_PROJECT`, `GOOGLE_CLOUD_LOCATION`, and `GOOGLE_APPLICATION_CREDENTIALS`; all are process-level settings. | + +## Recommended Configuration Patterns + +### 1. Same Provider, Only Change the Model + +Suitable when using the same OpenAI key and endpoint, but using a stronger model for final answers: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_api_key +OPENAI_LLM_REASONING_EFFORT=minimal + +QUERY_LLM_MODEL=gpt-5 +QUERY_MAX_ASYNC_LLM=2 +``` + +`QUERY` inherits the base host, API key, and `OPENAI_LLM_REASONING_EFFORT`. + +### 2. Same Provider, Change the Model and Tune Options + +Suitable when the base model is used for extraction and final answers use a higher reasoning effort: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_api_key +OPENAI_LLM_REASONING_EFFORT=minimal +OPENAI_LLM_MAX_COMPLETION_TOKENS=4096 + +QUERY_LLM_MODEL=gpt-5 +QUERY_OPENAI_LLM_REASONING_EFFORT=medium +QUERY_OPENAI_LLM_MAX_COMPLETION_TOKENS=9000 +QUERY_LLM_TIMEOUT=240 +``` + +### 3. Same Provider with Different Endpoints and API Keys + +Suitable when all roles use the `openai` binding, but some roles access the official OpenAI API while others access a local vLLM, SGLang, OpenRouter, or another OpenAI-compatible endpoint. In the example below: + +- `EXTRACT` uses the official OpenAI `gpt-5-mini`. +- `QUERY` uses the official OpenAI `gpt-5.4` with a separate OpenAI key. +- `KEYWORD` uses `Qwen3.5-35B-A3B` deployed by local vLLM. + +```env +########################################################################### +# Base LLM fallback. Keep it aligned with EXTRACT so unspecified roles still +# have a valid OpenAI configuration. +########################################################################### +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_extract_openai_api_key +LLM_TIMEOUT=180 +MAX_ASYNC=4 + +########################################################################### +# IMPORTANT: +# Do not set global OPENAI_LLM_REASONING_EFFORT here if any same-provider role +# points to a local OpenAI-compatible server that does not support it. +# Use role-specific OPENAI options instead. +########################################################################### +# OPENAI_LLM_REASONING_EFFORT=none + +########################################################################### +# EXTRACT: OpenAI official API, gpt-5-mini +########################################################################### +EXTRACT_LLM_BINDING=openai +EXTRACT_LLM_MODEL=gpt-5-mini +EXTRACT_LLM_BINDING_HOST=https://api.openai.com/v1 +EXTRACT_LLM_BINDING_API_KEY=your_extract_openai_api_key +EXTRACT_OPENAI_LLM_REASONING_EFFORT=low +EXTRACT_OPENAI_LLM_MAX_COMPLETION_TOKENS=4096 +EXTRACT_MAX_ASYNC_LLM=4 +EXTRACT_LLM_TIMEOUT=180 + +########################################################################### +# QUERY: OpenAI official API, gpt-5.4, separate API key +########################################################################### +QUERY_LLM_BINDING=openai +QUERY_LLM_MODEL=gpt-5.4 +QUERY_LLM_BINDING_HOST=https://api.openai.com/v1 +QUERY_LLM_BINDING_API_KEY=your_query_openai_api_key +QUERY_OPENAI_LLM_REASONING_EFFORT=medium +QUERY_OPENAI_LLM_MAX_COMPLETION_TOKENS=9000 +QUERY_MAX_ASYNC_LLM=2 +QUERY_LLM_TIMEOUT=240 + +########################################################################### +# KEYWORD: local vLLM OpenAI-compatible endpoint, Qwen3.5-35B-A3B +########################################################################### +KEYWORD_LLM_BINDING=openai +KEYWORD_LLM_MODEL=Qwen3.5-35B-A3B +KEYWORD_LLM_BINDING_HOST=http://localhost:8000/v1 +# If vLLM was started with --api-key, use the same value here. +# If vLLM has no auth, still set a non-empty dummy value to avoid falling +# back to the official OpenAI key. +KEYWORD_LLM_BINDING_API_KEY=local-vllm-api-key +KEYWORD_OPENAI_LLM_MAX_TOKENS=2048 +# Optional for Qwen-style models served by vLLM when you want to disable thinking. +KEYWORD_OPENAI_LLM_EXTRA_BODY='{"chat_template_kwargs": {"enable_thinking": false}}' +KEYWORD_MAX_ASYNC_LLM=4 +KEYWORD_LLM_TIMEOUT=180 +``` + +This pattern is not cross-provider because all three roles use the `openai` binding. LightRAG passes each role's `*_LLM_BINDING_HOST` and `*_LLM_BINDING_API_KEY` to the OpenAI-compatible client separately. + +Note: provider options within the same provider inherit the base `OPENAI_LLM_*`. If the local vLLM server does not support official OpenAI parameters such as `reasoning_effort`, do not set the global `OPENAI_LLM_REASONING_EFFORT`; use role-level variables such as `EXTRACT_OPENAI_LLM_REASONING_EFFORT` and `QUERY_OPENAI_LLM_REASONING_EFFORT` instead. + +### 4. One Role Crosses Provider + +Suitable when the base uses an official OpenAI model and only keyword extraction uses local Ollama: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_openai_api_key +OPENAI_LLM_REASONING_EFFORT=medium + +KEYWORD_LLM_BINDING=ollama +KEYWORD_LLM_MODEL=qwen3.5:9b +KEYWORD_LLM_BINDING_HOST=http://localhost:11434 +KEYWORD_LLM_BINDING_API_KEY=ollama-local-key +KEYWORD_OLLAMA_LLM_NUM_CTX=32768 +``` + +For cross-provider configurations, Ollama options do not inherit OpenAI options. For local Ollama, `KEYWORD_LLM_BINDING_API_KEY` can usually use a placeholder value; the current cross-provider validation requires non-Bedrock roles to explicitly provide a role-level API key. + +### 5. Specify a Dedicated Multimodal Model for VLM + +Suitable when text tasks use a cheaper model and multimodal analysis uses a vision-language model: + +```env +VLM_PROCESS_ENABLE=true + +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_api_key + +VLM_LLM_BINDING=openai +VLM_LLM_MODEL=gpt-4o +VLM_OPENAI_LLM_MAX_TOKENS=4096 +VLM_MAX_ASYNC_LLM=2 +VLM_LLM_TIMEOUT=240 +``` + +If VLM uses the same provider and key, `VLM_LLM_BINDING_HOST` and `VLM_LLM_BINDING_API_KEY` can be omitted. + +`VLM_PROCESS_ENABLE` is the master switch for multimodal analysis. When `false`, the pipeline emits a warning and skips every multimodal item without invoking the VLM. When `true`, the effective VLM binding (`VLM_LLM_BINDING` if set, otherwise `LLM_BINDING`) must support image inputs. The following providers are vision-capable: `openai`, `azure_openai`, `gemini`, `bedrock`, `ollama`, `anthropic`. `lollms` is rejected at startup because it cannot accept image inputs. + +### 6. Bedrock Role-Level SigV4 Credentials + +Suitable when only one role accesses Bedrock and uses independent IAM/STS credentials: + +```env +LLM_BINDING=openai +LLM_MODEL=gpt-5-mini +LLM_BINDING_HOST=https://api.openai.com/v1 +LLM_BINDING_API_KEY=your_openai_api_key + +QUERY_LLM_BINDING=bedrock +QUERY_LLM_MODEL=us.amazon.nova-lite-v1:0 +QUERY_LLM_BINDING_HOST=DEFAULT_BEDROCK_ENDPOINT +QUERY_AWS_REGION=us-east-1 +QUERY_AWS_ACCESS_KEY_ID=your_query_access_key +QUERY_AWS_SECRET_ACCESS_KEY=your_query_secret_key +QUERY_AWS_SESSION_TOKEN=your_optional_session_token +QUERY_BEDROCK_LLM_MAX_TOKENS=4096 +QUERY_BEDROCK_LLM_TEMPERATURE=0.2 +``` + +Do not set `QUERY_LLM_BINDING_API_KEY`; Bedrock rejects that configuration. + +## Caveats + +- Within the same provider, provider options such as `OPENAI_LLM_REASONING_EFFORT`, `OPENAI_LLM_MAX_TOKENS`, `OLLAMA_LLM_NUM_CTX`, and `GEMINI_LLM_THINKING_CONFIG` are inherited automatically. +- There is currently no clean role-level semantic for "unsetting an inherited provider option". If a model in a same-provider role does not support a base option, explicitly override that option for the role with a supported value, or configure the role as cross-provider and set only the role-specific provider options it supports. +- `AZURE_OPENAI_DEPLOYMENT` and `AZURE_OPENAI_API_VERSION` for `azure_openai` are global environment variables. If `AZURE_OPENAI_DEPLOYMENT` is set, it may take precedence over the role model name. +- Gemini Vertex AI mode is controlled by process-level Google environment variables. In the same LightRAG process, some roles cannot use Vertex AI while others use AI Studio API keys. +- In Docker/Compose, `LLM_BINDING_HOST` usually needs to use a container-reachable address such as `host.docker.internal`; role-level hosts follow the same principle. +- Restart LightRAG Server after modifying `.env`. Some IDE terminals preload `.env`, so opening a new terminal session is recommended to confirm that environment variables take effect. diff --git a/env.docker-compose-full b/env.docker-compose-full index c532886447..b4b0715e0d 100644 --- a/env.docker-compose-full +++ b/env.docker-compose-full @@ -24,9 +24,10 @@ WEBUI_DESCRIPTION='Simple and Fast Graph Based RAG System' # SSL_KEYFILE=/path/to/key.pem ### Directory Configuration (defaults to current working directory) -### Default value is ./inputs and ./rag_storage +### Default value is: ./inputs ./rag_storage ./prompts # INPUT_DIR= # WORKING_DIR= +# PROMPT_DIR= ### Tiktoken cache directory (Store cached files in this folder for offline deployment) # TIKTOKEN_CACHE_DIR=/app/data/tiktoken @@ -129,6 +130,13 @@ RERANK_BINDING_API_KEY=sk-your_rerank_api_key_here ### Enable rerank by default in query params when RERANK_BINDING is not null # RERANK_BY_DEFAULT=True +### Rerank concurrency and timeout (independent from base LLM settings) +### MAX_ASYNC_RERANK falls back to MAX_ASYNC when unset. +### RERANK_TIMEOUT has its own default (30s) since reranker calls are +### typically much shorter than full LLM generation. +# MAX_ASYNC_RERANK=4 +# RERANK_TIMEOUT=30 + ### Cohere AI # # RERANK_MODEL=rerank-v3.5 # # RERANK_BINDING_HOST=https://api.cohere.com/v2/rerank @@ -189,9 +197,6 @@ SUMMARY_LANGUAGE=English ### Note: If using Nginx as reverse proxy, also configure client_max_body_size # MAX_UPLOAD_SIZE=104857600 -### Entity types that the LLM will attempt to recognize -# ENTITY_TYPES='["Person", "Creature", "Organization", "Location", "Event", "Concept", "Method", "Content", "Data", "Artifact", "NaturalObject"]' - ### Chunk size for document splitting, 500~1500 is recommended # CHUNK_SIZE=1200 # CHUNK_OVERLAP_SIZE=100 @@ -207,6 +212,75 @@ SUMMARY_LANGUAGE=English ### Maximum token size allowed for entity extraction input context # MAX_EXTRACT_INPUT_TOKENS=20480 +### Per-response cap on total entity+relationship rows/records emitted by the LLM +# MAX_EXTRACTION_RECORDS=100 +### Per-response cap on entity rows/objects emitted by the LLM +# MAX_EXTRACTION_ENTITIES=40 + +### Use JSON structured output for entity extraction (false: default, JSON is slower but more reliable) +ENTITY_EXTRACTION_USE_JSON=true + +### Optional external YAML profile for entity type guidance and extraction examples +### Profiles are loaded from PROMPT_DIR/entity_type (PROMPT_DIR defaults to ./prompts). +### A reference template is shipped at prompts/samples/entity_type_prompt.sample.yml; +### Alternatively, override guidance at runtime from Python: +### addon_params={"entity_types_guidance": "- CustomType: description..."} +# ENTITY_TYPE_PROMPT_FILE=entity_type_prompt.yml + +### Multimodal parsing/analyze integration +### Optional parser routing rules, for example: +### pdf:mineru,docx:docling,pptx:docling,xlsx:docling,*:legacy +### Rules may be separated with commas or semicolons. Rules match file suffixes +### (pdf), not full names (*.pdf), and are checked left-to-right. +### If mineru/docling appears in LIGHTRAG_PARSER, the corresponding endpoint +### below must be configured before server startup. +# LIGHTRAG_PARSER= +### Master switch for VLM multimodal analysis (i/t/e items) +# VLM_PROCESS_ENABLE=false +### Maximum image bytes sent to VLM per multimodal item +# VLM_MAX_IMAGE_BYTES=5242880 + +### Async parser service protocol (optional) +### Configure these when using remote MinerU/Docling async services +# MinerU upload / poll protocol (all backward-compatible defaults). +# MINERU_ENDPOINT=http://localhost:8000/api/v1/task +# MINERU_POLL_ENDPOINT=http://localhost:8000/api/v1/task/{trace_id} +# MINERU_POLL_METHOD=GET +# MINERU_ID_FIELD=trace_id +# MINERU_STATUS_FIELD=status +# MINERU_RESULT_URL_FIELD=result_url +# MINERU_CONTENT_FIELD=content +# MINERU_FILE_FIELD=file +# MINERU_SUCCESS_VALUES=done,success,completed +# MINERU_FAILED_VALUES=failed,error,cancelled +# MINERU_POLL_INTERVAL_SECONDS=2 +# MINERU_MAX_POLLS=180 + +# MinerU raw bundle handling (introduced with unified sidecar pipeline). +# See env.example for descriptions. +# MINERU_RESULT_MODE=auto +# MINERU_IMAGE_URL_TEMPLATE= +# MINERU_ENGINE_VERSION= +# LIGHTRAG_FORCE_REPARSE_MINERU=false + +# DOCLING_* defaults assume docling-serve v1. If DOCLING_ENDPOINT does not end +# with /v1/convert/file/async, also set DOCLING_POLL_ENDPOINT and +# DOCLING_RESULT_ENDPOINT explicitly — the base URL used to derive them is the +# upload endpoint with that suffix stripped. +# DOCLING_ENDPOINT=http://localhost:5001/v1/convert/file/async +# DOCLING_POLL_ENDPOINT=http://localhost:5001/v1/status/poll/{task_id} +# DOCLING_POLL_METHOD=GET +# DOCLING_ID_FIELD=task_id +# DOCLING_STATUS_FIELD=task_status +# DOCLING_RESULT_URL_FIELD=result_url +# DOCLING_RESULT_ENDPOINT=http://localhost:5001/v1/result/{task_id} +# DOCLING_CONTENT_FIELD=document.md_content +# DOCLING_FILE_FIELD=files +# DOCLING_SUCCESS_VALUES=done,success,completed +# DOCLING_FAILED_VALUES=failed,error,cancelled +# DOCLING_POLL_INTERVAL_SECONDS=2 +# DOCLING_MAX_POLLS=180 + ### control the maximum chunk_ids stored in vector and graph db # MAX_SOURCE_IDS_PER_ENTITY=300 # MAX_SOURCE_IDS_PER_RELATION=300 @@ -230,6 +304,14 @@ MAX_ASYNC=8 ### Number of parallel processing documents(between 2~10, MAX_ASYNC/3 is recommended) MAX_PARALLEL_INSERT=3 # MAX_PARALLEL_INSERT=2 +### Optional per-stage document pipeline concurrency +# MAX_PARALLEL_PARSE_NATIVE=5 +# MAX_PARALLEL_PARSE_MINERU=1 +# MAX_PARALLEL_PARSE_DOCLING=1 +# MAX_PARALLEL_ANALYZE=5 +### Optional queue sizes for staged pipeline workers +# QUEUE_SIZE_DEFAULT=100 +# QUEUE_SIZE_INSERT=4 ### Max concurrency requests for Embedding # EMBEDDING_FUNC_MAX_ASYNC=8 ### Num of chunks send to Embedding in single request @@ -237,8 +319,8 @@ MAX_PARALLEL_INSERT=3 ########################################################################### ### LLM Configuration -### LLM_BINDING type: openai, ollama, lollms, azure_openai, aws_bedrock, gemini -### LLM_BINDING_HOST: Service endpoint (left empty if using default endpoint provided by openai or gemini SDK) +### LLM_BINDING type: openai, ollama, lollms, azure_openai, bedrock, gemini +### LLM_BINDING_HOST: Service endpoint (left empty if using the provider SDK default endpoint) ### LLM_BINDING_API_KEY: api key ### If LightRAG deployed in Docker: ### uses host.docker.internal instead of localhost in LLM_BINDING_HOST @@ -265,7 +347,7 @@ LLM_MODEL=gpt-5-mini # OPENAI_LLM_TEMPERATURE=0.9 ### Set the max_tokens to mitigate endless output of some LLM (less than LLM_TIMEOUT * llm_output_tokens/second, i.e. 9000 = 180s * 50 tokens/s) ### Typically, max_tokens does not include prompt content -### For vLLM/SGLang deployed models, or most of OpenAI compatible API provider +### For vLLM/SGLang and most of OpenAI compatible API provider # OPENAI_LLM_MAX_TOKENS=9000 ### For OpenAI o1-mini or newer modles utilizes max_completion_tokens instead of max_tokens # OPENAI_LLM_MAX_COMPLETION_TOKENS=9000 @@ -286,8 +368,9 @@ LLM_MODEL=gpt-5-mini ### Google Gemini example (AI Studio) # # LLM_BINDING=gemini +### DEFAULT_GEMINI_ENDPOINT means selecting endpoit by SDK automatically +# # LLM_BINDING_HOST=DEFAULT_GEMINI_ENDPOINT # # LLM_BINDING_API_KEY=your_gemini_api_key -# # LLM_BINDING_HOST=https://generativelanguage.googleapis.com # # LLM_MODEL=gemini-flash-latest ### use the following command to see all support options for OpenAI, azure_openai or OpenRouter @@ -301,7 +384,6 @@ LLM_MODEL=gpt-5-mini ### Google Vertex AI example ### Vertex AI use GOOGLE_APPLICATION_CREDENTIALS instead of API-KEY for authentication -### LLM_BINDING_HOST=DEFAULT_GEMINI_ENDPOINT means select endpoit based on project and location automatically # # LLM_BINDING=gemini # # LM_BINDING_HOST=https://aiplatform.googleapis.com ### or use DEFAULT_GEMINI_ENDPOINT to select endpoint based on project and location automatically @@ -329,19 +411,53 @@ OLLAMA_LLM_NUM_CTX=32768 # OLLAMA_LLM_STOP='["", "<|EOT|>"]' ### Bedrock Specific Parameters -### Bedrock uses AWS credentials from the environment / AWS credential chain. -### It does not use LLM_BINDING_API_KEY. -# # LLM_BINDING=aws_bedrock -# # LLM_MODEL=anthropic.claude-3-5-sonnet-20241022-v2:0 +### Bedrock reads AWS credentials from the environment / AWS credential chain. +# # LLM_BINDING=bedrock +# # LLM_BINDING_HOST=DEFAULT_BEDROCK_ENDPOINT +# # LLM_MODEL=us.amazon.nova-lite-v1:0 +### Authentication (choose ONE of the following two approaches): +### Bedrock API key (bearer token). Bedrock ignores LLM_BINDING_API_KEY; +### set AWS_BEARER_TOKEN_BEDROCK directly before startup. This is a +### process-level AWS SDK setting and cannot be overridden per role. +# AWS_BEARER_TOKEN_BEDROCK=your_bedrock_api_key +### SigV4 credentials (classic IAM user / STS / instance profile). # AWS_ACCESS_KEY_ID=your_aws_access_key_id # AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key # AWS_SESSION_TOKEN=your_optional_aws_session_token -# AWS_REGION=us-east-1 +### Region is required for both auth modes (Bedrock endpoints are regional). +# AWS_REGION=us-west-1 +### use the following command to see all supported options for Bedrock +### lightrag-server --llm-binding bedrock --help +### Bedrock Converse API inferenceConfig (leave max_tokens unset to use the model default) # BEDROCK_LLM_TEMPERATURE=1.0 +# BEDROCK_LLM_MAX_TOKENS=9000 +# BEDROCK_LLM_TOP_P=1.0 +# BEDROCK_LLM_STOP_SEQUENCES='[""]' +### Model-specific request fields forwarded as Converse API additionalModelRequestFields +# BEDROCK_LLM_EXTRA_FIELDS='{"reasoningConfig": {"type": "enabled", "maxReasoningEffort": "low"}}' + +########################################################################### +### Optional role-specific LLM/VLM overrides +### If unset, each role falls back to the base LLM_* configuration above. +### Available roles: EXTRACT, KEYWORD, QUERY, VLM +### Note: {ROLE}_MAX_ASYNC_LLM, when unset, inherits the base MAX_ASYNC +### value at runtime (it is NOT capped at the commented default below). +### For EXTRACT, KEYWORD, QUERY, cross-provider configuration, provider +### options, and inheritance rules, see: +### docs/RoleSpecificLLMConfiguration.md +### docs/RoleSpecificLLMConfiguration-zh.md +########################################################################### +### Example: use a dedicated model/provider for multimodal analysis +# VLM_LLM_BINDING=openai +# VLM_LLM_MODEL=your_vlm_model +# VLM_LLM_BINDING_HOST=https://api.example.com/v1 +# VLM_LLM_BINDING_API_KEY=your_vlm_api_key +# VLM_MAX_ASYNC_LLM=4 +# VLM_LLM_TIMEOUT=180 ####################################################################################### ### Embedding Configuration (Should not be changed after the first file processed) -### EMBEDDING_BINDING: ollama, openai, azure_openai, jina, lollms, aws_bedrock +### EMBEDDING_BINDING: ollama, openai, azure_openai, jina, lollms, bedrock ### EMBEDDING_BINDING_HOST: Service endpoint (left empty if using default endpoint provided by openai or gemini SDK) ### EMBEDDING_BINDING_API_KEY: api key ### If LightRAG deployed in Docker: @@ -350,6 +466,9 @@ OLLAMA_LLM_NUM_CTX=32768 ### For OpenAI: Set EMBEDDING_SEND_DIM=true to enable dynamic dimension adjustment ### For OpenAI: Set EMBEDDING_SEND_DIM=false (default) to disable sending dimension parameter ### For Gemini: Allways set EMBEDDING_SEND_DIM=true +### Control whether to use base64 encoding format for embeddings (improves performance for OpenAI) +### For OpenAI: Set EMBEDDING_USE_BASE64=true (default) to use base64 encoding +### For Yandex Cloud and other providers that don't support it: Set EMBEDDING_USE_BASE64=false ####################################################################################### # EMBEDDING_TIMEOUT=30 @@ -365,6 +484,7 @@ EMBEDDING_DIM=1024 # EMBEDDING_DIM=3072 EMBEDDING_TOKEN_LIMIT=8192 EMBEDDING_SEND_DIM=false +# EMBEDDING_USE_BASE64=true ### Optional for Azure Embedding ### Use deployment name as model name or set AZURE_EMBEDDING_DEPLOYMENT instead @@ -380,7 +500,8 @@ EMBEDDING_SEND_DIM=false # # EMBEDDING_MODEL=gemini-embedding-001 # # EMBEDDING_DIM=1536 # # EMBEDDING_TOKEN_LIMIT=2048 -# # EMBEDDING_BINDING_HOST=https://generativelanguage.googleapis.com +### DEFAULT_GEMINI_ENDPOINT means selecting endpoit by SDK automatically +# # EMBEDDING_BINDING_HOST=DEFAULT_GEMINI_ENDPOINT # # EMBEDDING_BINDING_API_KEY=your_api_key ### Gemini embedding requires sending dimension to server # # EMBEDDING_SEND_DIM=true @@ -397,15 +518,23 @@ OLLAMA_EMBEDDING_NUM_CTX=8192 ### lightrag-server --embedding-binding ollama --help ### Bedrock embedding -### Bedrock uses AWS credentials from the environment / AWS credential chain. -### It does not use EMBEDDING_BINDING_API_KEY. -# # EMBEDDING_BINDING=aws_bedrock +### Bedrock reads AWS credentials from the environment / AWS credential chain. +# # EMBEDDING_BINDING=bedrock +# # EMBEDDING_BINDING_HOST=DEFAULT_BEDROCK_ENDPOINT +# # Or set EMBEDDING_BINDING_HOST to a custom Bedrock-compatible proxy/gateway URL # # EMBEDDING_MODEL=amazon.titan-embed-text-v2:0 # # EMBEDDING_DIM=1024 +### Authentication (choose ONE of the following two approaches): +### Bedrock API key (bearer token). Bedrock ignores EMBEDDING_BINDING_API_KEY; +### set AWS_BEARER_TOKEN_BEDROCK directly before startup. This is a +### process-level AWS SDK setting and cannot be overridden per role. +# AWS_BEARER_TOKEN_BEDROCK=your_bedrock_api_key +### SigV4 credentials (classic IAM user / STS / instance profile). # AWS_ACCESS_KEY_ID=your_aws_access_key_id # AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key # AWS_SESSION_TOKEN=your_optional_aws_session_token -# AWS_REGION=us-east-1 +### Region is required for both auth modes (Bedrock endpoints are regional). +# AWS_REGION=us-west-1 ### Jina AI Embedding # # EMBEDDING_BINDING=jina @@ -414,6 +543,19 @@ OLLAMA_EMBEDDING_NUM_CTX=8192 # # EMBEDDING_DIM=2048 # # EMBEDDING_BINDING_API_KEY=your_api_key +### Optional: asymmetric embeddings (query/document behavior split) +### Leave EMBEDDING_ASYMMETRIC unset or set false to keep symmetric behavior. +### Set true only when the selected embedding backend supports asymmetric mode. +# EMBEDDING_ASYMMETRIC=true +### Provider-task bindings such as Jina/Gemini/VoyageAI use provider parameters +### and should not configure the prefix variables below. +### Prefix-based models such as BGE/E5/GTE require both prefix variables. +### Use NO_PREFIX for a side that should intentionally have no prefix. +### Wrap non-empty values with quotes if there are trailing spaces. +# EMBEDDING_DOCUMENT_PREFIX="search_document: " +# EMBEDDING_QUERY_PREFIX="search_query: " +# # EMBEDDING_DOCUMENT_PREFIX=NO_PREFIX + #################################################################### ### WORKSPACE sets workspace name for all storage types ### for the purpose of isolating data from LightRAG instances. diff --git a/env.example b/env.example index a15c35dc3a..65d838a6c5 100644 --- a/env.example +++ b/env.example @@ -35,9 +35,10 @@ WEBUI_DESCRIPTION='Simple and Fast Graph Based RAG System' # SSL_KEYFILE=/path/to/key.pem ### Directory Configuration (defaults to current working directory) -### Default value is ./inputs and ./rag_storage +### Default value is: ./inputs ./rag_storage ./prompts # INPUT_DIR= # WORKING_DIR= +# PROMPT_DIR= ### Tiktoken cache directory (Store cached files in this folder for offline deployment) # TIKTOKEN_CACHE_DIR=/app/data/tiktoken @@ -139,6 +140,13 @@ RERANK_BINDING=null ### Enable rerank by default in query params when RERANK_BINDING is not null # RERANK_BY_DEFAULT=True +### Rerank concurrency and timeout (independent from base LLM settings) +### MAX_ASYNC_RERANK falls back to MAX_ASYNC when unset. +### RERANK_TIMEOUT has its own default (30s) since reranker calls are +### typically much shorter than full LLM generation. +# MAX_ASYNC_RERANK=4 +# RERANK_TIMEOUT=30 + ### Cohere AI # # RERANK_MODEL=rerank-v3.5 # # RERANK_BINDING_HOST=https://api.cohere.com/v2/rerank @@ -199,13 +207,201 @@ SUMMARY_LANGUAGE=English ### Note: If using Nginx as reverse proxy, also configure client_max_body_size # MAX_UPLOAD_SIZE=104857600 -### Entity types that the LLM will attempt to recognize -# ENTITY_TYPES='["Person", "Creature", "Organization", "Location", "Event", "Concept", "Method", "Content", "Data", "Artifact", "NaturalObject"]' - ### Chunk size for document splitting, 500~1500 is recommended # CHUNK_SIZE=1200 # CHUNK_OVERLAP_SIZE=100 +### Fixed-token chunker (process_options=F, default) settings +### CHUNK_F_OVERLAP_SIZE: token overlap; falls back to CHUNK_OVERLAP_SIZE when unset +### CHUNK_F_SPLIT_BY_CHARACTER: optional separator string; pre-segment before token windowing +### CHUNK_F_SPLIT_BY_CHARACTER_ONLY: when true, raise on oversize segment instead of token re-split +# CHUNK_F_OVERLAP_SIZE=100 +# CHUNK_F_SPLIT_BY_CHARACTER= +# CHUNK_F_SPLIT_BY_CHARACTER_ONLY=false + +### Recursive character chunker (process_options=R) settings +### CHUNK_R_SIZE: per-strategy chunk_token_size override; falls back to CHUNK_SIZE when unset +### CHUNK_R_OVERLAP_SIZE: token overlap between adjacent chunks; falls back to CHUNK_OVERLAP_SIZE when unset +### CHUNK_R_SEPARATORS: JSON array of cascaded separators tried by RecursiveCharacterTextSplitter. +### Default includes CJK sentence-ending punctuation so Chinese / mixed-language +### documents split at semantic boundaries. Order: paragraph (\n\n) > line (\n) > +### Chinese sentence-end (。!?) > Chinese semi-clause (;,) > space > char. +### English ".?!" are intentionally omitted (literal match would split "0.95" / +### "e.g."); the English path falls through space / char as before. +# CHUNK_R_SIZE=1200 +# CHUNK_R_OVERLAP_SIZE=100 +# CHUNK_R_SEPARATORS=["\n\n","\n","。","!","?",";",","," ",""] + +### Semantic vector chunker (process_options=V) settings +### CHUNK_V_SIZE: per-strategy chunk_token_size hard cap (oversized pieces are +### re-split via R before being emitted); falls back to CHUNK_SIZE when unset +### CHUNK_V_BREAKPOINT_THRESHOLD_TYPE: percentile | standard_deviation | interquartile | gradient +### CHUNK_V_BREAKPOINT_THRESHOLD_AMOUNT: leave empty to use the LangChain per-type default (e.g. 95 for percentile) +### CHUNK_V_BUFFER_SIZE: number of adjacent sentences combined when computing distances +### CHUNK_V_SENTENCE_SPLIT_REGEX: regex fed to LangChain SemanticChunker for the +### initial sentence split. Default extends the upstream English-only pattern +### with CJK sentence-end punctuation (。?!). Override if you need a +### different language mix. Note: env value is the raw regex string, no JSON +### quoting. +# CHUNK_V_SIZE=1200 +# CHUNK_V_BREAKPOINT_THRESHOLD_TYPE=percentile +# CHUNK_V_BREAKPOINT_THRESHOLD_AMOUNT= +# CHUNK_V_BUFFER_SIZE=1 +# CHUNK_V_SENTENCE_SPLIT_REGEX=(?<=[.?!])\s+|(?<=[。?!]) + +### Paragraph semantic chunker (process_options=P) settings +### CHUNK_P_SIZE: per-strategy chunk_token_size override; defaults to 2000 when unset +### (does NOT fall back to CHUNK_SIZE — paragraph-semantic merging needs more +### headroom than the global default to keep related paragraphs together). +### CHUNK_P_OVERLAP_SIZE: overlap for prose fallback and table-bridge context; falls back to CHUNK_OVERLAP_SIZE when unset +# CHUNK_P_SIZE=2000 +# CHUNK_P_OVERLAP_SIZE=100 + +### Multimodal parsing/analyze integration +### Optional parser routing rules, for example: +### LIGHTRAG_PARSER=pdf:mineru,docx:docling,pptx:docling,xlsx:docling,*:legacy +### Rules may be separated with commas or semicolons. Rules match file suffixes +### (pdf), not full names (*.pdf), and are checked left-to-right. +### If mineru/docling appears in LIGHTRAG_PARSER, the corresponding endpoint +### below must be configured before server startup. +### See docs/FileProcessingPipeline.md for detail +LIGHTRAG_PARSER=*:native-teP,*:legacy-R + +### Async parser service protocol (optional) +### Configure these when using remote MinerU/Docling async services + +### ---- MinerU shared parameters (both local and official modes) ---- +### MinerU API protocol. Choose one active mode. +### - official: MinerU precision API v4. Requires MINERU_API_TOKEN. +### - local: self-hosted mineru-api / mineru-router base URL. +MINERU_API_MODE=local +# MINERU_POLL_INTERVAL_SECONDS=2 +# MINERU_MAX_POLLS=180 +# MINERU_LANGUAGE=ch +# MINERU_ENABLE_TABLE=true +# MINERU_ENABLE_FORMULA=true +# MINERU_PAGE_RANGES= +### MINERU_PAGE_RANGES semantics differ by mode: +### - official: forwarded verbatim, supports e.g. "1-3,5,7-9". +### - local: only a single page ("3") or simple range ("1-10"); comma +### lists are rejected at startup. +### When switching modes, double-check this constraint. + +### ---- MinerU local-only (MINERU_API_MODE=local) ---- +MINERU_LOCAL_ENDPOINT=http://127.0.0.1:8000 +### MINERU_LOCAL_BACKEND: which mineru-api backend handles the parse. +### Accepted values (per mineru-api POST /tasks form parameter `backend`): +### hybrid-auto-engine - pipeline + VLM combo with auto-selected local +### engine (mineru-api's default). GPU required. +### pipeline - CPU-friendly traditional pipeline; no VLM step. +### vlm-auto-engine - VLM with auto-selected local inference engine +### (sglang-engine / vllm-engine if GPU is available); +### requires the matching engine extra preinstalled +### on the mineru-api side, plus model weights. +### We ship `hybrid-auto-engine` -- requires the target mineru-api +### deployment to have a GPU plus the matching inference engine +### (sglang / vllm) and model weights installed. Switch to `pipeline` +### for CPU-only deployments without those dependencies. +MINERU_LOCAL_BACKEND=hybrid-auto-engine +### MINERU_LOCAL_PARSE_METHOD: parsing strategy for the pipeline component. +### Accepted values: +### auto - auto-detect embedded text-layer vs OCR per page (default). +### txt - extract text from the embedded text layer only; fastest, +### but yields empty output on scanned PDFs without a text layer. +### ocr - force OCR on every page regardless of text-layer quality; +### slowest, reliable on scanned or low-quality PDFs. +### Only consumed when MINERU_LOCAL_BACKEND is `pipeline` or +### `hybrid-auto-engine` (the pipeline arm of the hybrid pipeline). +### Pure VLM backends (`vlm-auto-engine`, `vlm-http-client`) ignore this +### parameter -- the VLM model handles layout/OCR natively. +MINERU_LOCAL_PARSE_METHOD=auto +### MINERU_LOCAL_IMAGE_ANALYSIS: enable VLM image/chart analysis pass. +### Only consumed by `vlm-auto-engine`, `vlm-http-client`, +### `hybrid-auto-engine`, `hybrid-http-client`. The `pipeline` backend +### silently drops this flag -- its `_process_pipeline` does not accept +### the kwarg, so setting `false` under pipeline does NOT speed parsing +### up; pipeline never invokes the VLM image pass to begin with. +### Disable (`false`) on VLM / hybrid backends to skip the extra VLM +### round, trading image / chart semantic descriptions for faster parsing +### and lower GPU cost. +MINERU_LOCAL_IMAGE_ANALYSIS=true +# MINERU_LOCAL_START_PAGE_ID=0 +# MINERU_LOCAL_END_PAGE_ID=99999 + +### ---- MinerU official-only (MINERU_API_MODE=official) ---- +# MINERU_API_TOKEN=your-api-key +# MINERU_OFFICIAL_ENDPOINT=https://mineru.net +# MINERU_MODEL_VERSION=vlm +# MINERU_IS_OCR=false + +### MinerU raw bundle handling (introduced with unified sidecar pipeline). +### - MINERU_IMAGE_URL_TEMPLATE: optional fallback for content_list image refs +### that are not present in the downloaded zip. Supports {name} and {path}. +### - MINERU_ENGINE_VERSION: recorded in .mineru_raw/_manifest.json. +### Mismatch with the recorded value forces a cache miss → re-download. +### Leave empty to skip this check. +### - LIGHTRAG_FORCE_REPARSE_MINERU: when truthy ("1"/"true"), bypass the +### mineru raw cache and re-upload on every parse_mineru call. +# MINERU_IMAGE_URL_TEMPLATE= +# MINERU_ENGINE_VERSION= +# LIGHTRAG_FORCE_REPARSE_MINERU=false + +### Docling parser (docling-serve v1 / async API). +### +### Endpoint: base URL only — the client appends /v1/convert/file/async, +### /v1/status/poll/{task_id}?wait=, +### /v1/result/{task_id} itself. +### Pipeline shape (pipeline=standard, target_type=zip, +### to_formats=[json,md], image_export_mode=referenced) is fixed in +### code so the sidecar flow stays self-consistent — flipping any of +### these would break the adapter and is therefore not exposed as env. +### +### OCR tunables: +### - DOCLING_DO_OCR: master switch; when false the engine relies only on +### text-layer extraction. +### - DOCLING_FORCE_OCR: when true, OCR every page regardless of text-layer +### quality (slower, useful for scanned PDFs with bad text layers). +### - DOCLING_OCR_ENGINE: explicit engine selection (DEPRECATED in the +### docling-serve OpenAPI but still honored for older deployments). +### - DOCLING_OCR_PRESET: recommended replacement for DOCLING_OCR_ENGINE. +### - DOCLING_OCR_LANG: JSON array (e.g. ["en","zh"]) or comma-separated +### list. Empty (default) lets the OCR engine pick its default. +### - DOCLING_DO_FORMULA_ENRICHMENT: when true, the code-formula model runs +### and `texts[*].label="formula"` items carry LaTeX in `text`. Default +### false because the model may not be present on every deployment; +### adapter falls back to plain-text formulas when disabled. +### +### Polling budget (server-side long-poll; client does NOT add extra sleep): +### - DOCLING_POLL_INTERVAL_SECONDS: ``?wait=N`` value sent to +### /v1/status/poll/{task_id}. Larger N = fewer round trips per parse; +### bound by your reverse-proxy idle timeout. Default 5. +### - DOCLING_MAX_POLLS: max polling rounds before raising TimeoutError. +### Worst-case wall-clock budget ≈ +### DOCLING_POLL_INTERVAL_SECONDS × DOCLING_MAX_POLLS. Default 240 +### (≈ 20 minutes at wait=5s); raise for very large PDFs. +### +### Bundle cache controls: +### - DOCLING_ENGINE_VERSION: recorded in .docling_raw/_manifest.json. +### Mismatch with the recorded value forces a cache miss → re-download. +### Leave empty to skip this check. +### - LIGHTRAG_FORCE_REPARSE_DOCLING: when truthy ("1"/"true"), bypass the +### docling raw cache and re-upload on every parse_docling call. +### - DOCLING_BBOX_ATTRIBUTES: override the doc-level bbox_attributes +### written into .blocks.jsonl meta. Default +### {"origin":"LEFTBOTTOM"} matches docling's default coordinate system. +DOCLING_ENDPOINT=http://localhost:5001 +DOCLING_DO_OCR=true +DOCLING_FORCE_OCR=true +DOCLING_DO_FORMULA_ENRICHMENT=false +# DOCLING_OCR_ENGINE=auto +# DOCLING_OCR_PRESET=auto +# DOCLING_OCR_LANG= +# DOCLING_POLL_INTERVAL_SECONDS=5 +# DOCLING_MAX_POLLS=240 +# DOCLING_ENGINE_VERSION= +# LIGHTRAG_FORCE_REPARSE_DOCLING=false +# DOCLING_BBOX_ATTRIBUTES={"origin":"LEFTBOTTOM"} + ### Number of summary segments or tokens to trigger LLM summary on entity/relation merge (at least 3 is recommended) # FORCE_LLM_SUMMARY_ON_MERGE=8 ### Max description token size to trigger LLM summary @@ -217,6 +413,29 @@ SUMMARY_LANGUAGE=English ### Maximum token size allowed for entity extraction input context # MAX_EXTRACT_INPUT_TOKENS=20480 +### Multimodal surrounding-context budget (per-half token cap for the +### `leading` / `trailing` text injected into VLM and extract prompts). +### Computed at analyze_multimodal entry; the two halves are independent +### so deployments can bias context forward or backward as needed. +# SURROUNDING_LEADING_MAX_TOKENS=2000 +# SURROUNDING_TRAILING_MAX_TOKENS=2000 + +### Per-response cap on total entity+relationship rows/records emitted by the LLM +# MAX_EXTRACTION_RECORDS=100 +### Per-response cap on entity rows/objects emitted by the LLM +# MAX_EXTRACTION_ENTITIES=40 + +### Enable JSON-structured output for entity extraction (Note: JSON output incurs higher latency but delivers improved reliability) +### Default behavior: JSON output is disabled when ENTITY_EXTRACTION_USE_JSON is unset +ENTITY_EXTRACTION_USE_JSON=true + +### Optional external YAML profile for entity type guidance and extraction examples +### Profiles are loaded from PROMPT_DIR/entity_type (PROMPT_DIR defaults to ./prompts). +### A reference template is shipped at prompts/samples/entity_type_prompt.sample.yml; +### Alternatively, override guidance at runtime from Python: +### addon_params={"entity_types_guidance": "- CustomType: description..."} +# ENTITY_TYPE_PROMPT_FILE=entity_type_prompt.yml + ### control the maximum chunk_ids stored in vector and graph db # MAX_SOURCE_IDS_PER_ENTITY=300 # MAX_SOURCE_IDS_PER_RELATION=300 @@ -238,6 +457,14 @@ SUMMARY_LANGUAGE=English MAX_ASYNC=4 ### Number of parallel processing documents(between 2~10, MAX_ASYNC/3 is recommended) MAX_PARALLEL_INSERT=2 +### Optional per-stage document pipeline concurrency +# MAX_PARALLEL_PARSE_NATIVE=5 +# MAX_PARALLEL_PARSE_MINERU=1 +# MAX_PARALLEL_PARSE_DOCLING=1 +# MAX_PARALLEL_ANALYZE=5 +### Optional queue sizes for staged pipeline workers +# QUEUE_SIZE_DEFAULT=100 +# QUEUE_SIZE_INSERT=4 ### Max concurrency requests for Embedding # EMBEDDING_FUNC_MAX_ASYNC=8 ### Num of chunks send to Embedding in single request @@ -245,8 +472,8 @@ MAX_PARALLEL_INSERT=2 ########################################################################### ### LLM Configuration -### LLM_BINDING type: openai, ollama, lollms, azure_openai, aws_bedrock, gemini -### LLM_BINDING_HOST: Service endpoint (left empty if using default endpoint provided by openai or gemini SDK) +### LLM_BINDING type: openai, ollama, lollms, azure_openai, bedrock, gemini +### LLM_BINDING_HOST: Service endpoint (left empty if using the provider SDK default endpoint) ### LLM_BINDING_API_KEY: api key ### If LightRAG deployed in Docker: ### uses host.docker.internal instead of localhost in LLM_BINDING_HOST @@ -273,7 +500,7 @@ LLM_MODEL=gpt-5-mini # OPENAI_LLM_TEMPERATURE=0.9 ### Set the max_tokens to mitigate endless output of some LLM (less than LLM_TIMEOUT * llm_output_tokens/second, i.e. 9000 = 180s * 50 tokens/s) ### Typically, max_tokens does not include prompt content -### For vLLM/SGLang deployed models, or most of OpenAI compatible API provider +### For vLLM/SGLang and most of OpenAI compatible API provider # OPENAI_LLM_MAX_TOKENS=9000 ### For OpenAI o1-mini or newer modles utilizes max_completion_tokens instead of max_tokens # OPENAI_LLM_MAX_COMPLETION_TOKENS=9000 @@ -294,8 +521,9 @@ LLM_MODEL=gpt-5-mini ### Google Gemini example (AI Studio) # # LLM_BINDING=gemini +### DEFAULT_GEMINI_ENDPOINT means selecting endpoit by SDK automatically +# # LLM_BINDING_HOST=DEFAULT_GEMINI_ENDPOINT # # LLM_BINDING_API_KEY=your_gemini_api_key -# # LLM_BINDING_HOST=https://generativelanguage.googleapis.com # # LLM_MODEL=gemini-flash-latest ### use the following command to see all support options for OpenAI, azure_openai or OpenRouter @@ -309,7 +537,6 @@ LLM_MODEL=gpt-5-mini ### Google Vertex AI example ### Vertex AI use GOOGLE_APPLICATION_CREDENTIALS instead of API-KEY for authentication -### LLM_BINDING_HOST=DEFAULT_GEMINI_ENDPOINT means select endpoit based on project and location automatically # # LLM_BINDING=gemini # # LM_BINDING_HOST=https://aiplatform.googleapis.com ### or use DEFAULT_GEMINI_ENDPOINT to select endpoint based on project and location automatically @@ -337,19 +564,64 @@ OLLAMA_LLM_NUM_CTX=32768 # OLLAMA_LLM_STOP='["", "<|EOT|>"]' ### Bedrock Specific Parameters -### Bedrock uses AWS credentials from the environment / AWS credential chain. -### It does not use LLM_BINDING_API_KEY. -# # LLM_BINDING=aws_bedrock -# # LLM_MODEL=anthropic.claude-3-5-sonnet-20241022-v2:0 +### Bedrock reads AWS credentials from the environment / AWS credential chain. +# # LLM_BINDING=bedrock +# # LLM_BINDING_HOST=DEFAULT_BEDROCK_ENDPOINT +# # LLM_MODEL=us.amazon.nova-lite-v1:0 +### Authentication (choose ONE of the following two approaches): +### Bedrock API key (bearer token). Bedrock ignores LLM_BINDING_API_KEY; +### set AWS_BEARER_TOKEN_BEDROCK directly before startup. This is a +### process-level AWS SDK setting and cannot be overridden per role. +# AWS_BEARER_TOKEN_BEDROCK=your_bedrock_api_key +### SigV4 credentials (classic IAM user / STS / instance profile). # AWS_ACCESS_KEY_ID=your_aws_access_key_id # AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key # AWS_SESSION_TOKEN=your_optional_aws_session_token -# AWS_REGION=us-east-1 +### Region is required for both auth modes (Bedrock endpoints are regional). +# AWS_REGION=us-west-1 +### use the following command to see all supported options for Bedrock +### lightrag-server --llm-binding bedrock --help +### Bedrock Converse API inferenceConfig (leave max_tokens unset to use the model default) # BEDROCK_LLM_TEMPERATURE=1.0 +# BEDROCK_LLM_MAX_TOKENS=9000 +# BEDROCK_LLM_TOP_P=1.0 +# BEDROCK_LLM_STOP_SEQUENCES='[""]' +### Model-specific request fields forwarded as Converse API additionalModelRequestFields +# BEDROCK_LLM_EXTRA_FIELDS='{"reasoningConfig": {"type": "enabled", "maxReasoningEffort": "low"}}' + +########################################################################### +### Optional role-specific LLM/VLM overrides +### If unset, each role falls back to the base LLM_* configuration above. +### Available roles: EXTRACT, KEYWORD, QUERY, VLM +### Note: {ROLE}_MAX_ASYNC_LLM, when unset, inherits the base MAX_ASYNC +### value at runtime (it is NOT capped at the commented default below). +### Role timeout variable format: {ROLE}_LLM_TIMEOUT +### e.g. EXTRACT_LLM_TIMEOUT, KEYWORD_LLM_TIMEOUT, QUERY_LLM_TIMEOUT, +### VLM_LLM_TIMEOUT. +### For EXTRACT, KEYWORD, QUERY, cross-provider configuration, provider +### options, and inheritance rules, see: +### docs/RoleSpecificLLMConfiguration.md +### docs/RoleSpecificLLMConfiguration-zh.md +########################################################################### +### Master switch for VLM multimodal analysis (i/t/e items). +### When false, every multimodal item is skipped with a warning regardless of +### the per-document process_options. When true, VLM_LLM_BINDING (or the base +### LLM_BINDING) must be vision-capable; lollms is rejected at startup. +# VLM_PROCESS_ENABLE=false + +### Example: use a dedicated model/provider for multimodal analysis +# VLM_LLM_BINDING=openai +# VLM_LLM_MODEL=your_vlm_model +# VLM_LLM_BINDING_HOST=https://api.example.com/v1 +# VLM_LLM_BINDING_API_KEY=your_vlm_api_key +# VLM_MAX_ASYNC_LLM=4 +# VLM_LLM_TIMEOUT=180 +### Maximum image bytes sent to VLM (5242880=5MB) +# VLM_MAX_IMAGE_BYTES=5242880 ####################################################################################### ### Embedding Configuration (Should not be changed after the first file processed) -### EMBEDDING_BINDING: ollama, openai, azure_openai, jina, lollms, aws_bedrock +### EMBEDDING_BINDING: ollama, openai, azure_openai, jina, lollms, bedrock ### EMBEDDING_BINDING_HOST: Service endpoint (left empty if using default endpoint provided by openai or gemini SDK) ### EMBEDDING_BINDING_API_KEY: api key ### If LightRAG deployed in Docker: @@ -388,7 +660,8 @@ EMBEDDING_SEND_DIM=false # # EMBEDDING_MODEL=gemini-embedding-001 # # EMBEDDING_DIM=1536 # # EMBEDDING_TOKEN_LIMIT=2048 -# # EMBEDDING_BINDING_HOST=https://generativelanguage.googleapis.com +### DEFAULT_GEMINI_ENDPOINT means selecting endpoit by SDK automatically +# # EMBEDDING_BINDING_HOST=DEFAULT_GEMINI_ENDPOINT # # EMBEDDING_BINDING_API_KEY=your_api_key ### Gemini embedding requires sending dimension to server # # EMBEDDING_SEND_DIM=true @@ -405,15 +678,23 @@ OLLAMA_EMBEDDING_NUM_CTX=8192 ### lightrag-server --embedding-binding ollama --help ### Bedrock embedding -### Bedrock uses AWS credentials from the environment / AWS credential chain. -### It does not use EMBEDDING_BINDING_API_KEY. -# # EMBEDDING_BINDING=aws_bedrock +### Bedrock reads AWS credentials from the environment / AWS credential chain. +# # EMBEDDING_BINDING=bedrock +# # EMBEDDING_BINDING_HOST=DEFAULT_BEDROCK_ENDPOINT +# # Or set EMBEDDING_BINDING_HOST to a custom Bedrock-compatible proxy/gateway URL # # EMBEDDING_MODEL=amazon.titan-embed-text-v2:0 # # EMBEDDING_DIM=1024 +### Authentication (choose ONE of the following two approaches): +### Bedrock API key (bearer token). Bedrock ignores EMBEDDING_BINDING_API_KEY; +### set AWS_BEARER_TOKEN_BEDROCK directly before startup. This is a +### process-level AWS SDK setting and cannot be overridden per role. +# AWS_BEARER_TOKEN_BEDROCK=your_bedrock_api_key +### SigV4 credentials (classic IAM user / STS / instance profile). # AWS_ACCESS_KEY_ID=your_aws_access_key_id # AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key # AWS_SESSION_TOKEN=your_optional_aws_session_token -# AWS_REGION=us-east-1 +### Region is required for both auth modes (Bedrock endpoints are regional). +# AWS_REGION=us-west-1 ### Jina AI Embedding # # EMBEDDING_BINDING=jina diff --git a/examples/lightrag_gemini_workspace_demo.py b/examples/lightrag_gemini_workspace_demo.py index f5b571079d..5cc6d87aaa 100644 --- a/examples/lightrag_gemini_workspace_demo.py +++ b/examples/lightrag_gemini_workspace_demo.py @@ -9,12 +9,11 @@ which ensures that Knowledge Graphs, Vector Databases, and Chunks are stored in separate, non-conflicting directories. - Independent Configuration: Different workspaces can utilize different - ENTITY_TYPES and document sets simultaneously. + entity type guidance and document sets simultaneously. Prerequisites: 1. Set the following environment variables: - GEMINI_API_KEY: Your Google Gemini API key. - - ENTITY_TYPES: A JSON string of entity categories (e.g., '["Person", "Organization"]'). 2. Ensure your data directory contains: - Data/book-small.txt - Data/HR_policies.txt @@ -25,12 +24,10 @@ import os import asyncio -import json import numpy as np from lightrag import LightRAG, QueryParam from lightrag.llm.gemini import gemini_model_complete, gemini_embed from lightrag.utils import wrap_embedding_func_with_attrs -from lightrag.constants import DEFAULT_ENTITY_TYPES async def llm_model_func( @@ -59,25 +56,14 @@ async def embedding_func(texts: list[str]) -> np.ndarray: async def initialize_rag( workspace: str = "default_workspace", - entities=None, ) -> LightRAG: """ Initializes a LightRAG instance with data isolation. - - entities (if provided) overrides everything - - else ENTITY_TYPES env var is used - - else DEFAULT_ENTITY_TYPES is used + Entity type guidance can be customized by passing + addon_params={'entity_types_guidance': '...'} to LightRAG. """ - if entities is not None: - entity_types = entities - else: - env_entities = os.getenv("ENTITY_TYPES") - if env_entities: - entity_types = json.loads(env_entities) - else: - entity_types = DEFAULT_ENTITY_TYPES - rag = LightRAG( workspace=workspace, llm_model_name="gemini-2.0-flash", @@ -86,7 +72,6 @@ async def initialize_rag( embedding_func_max_async=4, embedding_batch_num=8, llm_model_max_async=2, - addon_params={"entity_types": entity_types}, ) await rag.initialize_storages() diff --git a/examples/modalprocessors_example.py b/examples/modalprocessors_example.py deleted file mode 100644 index bc11021e5f..0000000000 --- a/examples/modalprocessors_example.py +++ /dev/null @@ -1,229 +0,0 @@ -""" -Example of directly using modal processors - -This example demonstrates how to use LightRAG's modal processors directly without going through MinerU. -""" - -import asyncio -import argparse -from lightrag.llm.openai import openai_complete_if_cache, openai_embed -from lightrag import LightRAG -from lightrag.utils import EmbeddingFunc -from raganything.modalprocessors import ( - ImageModalProcessor, - TableModalProcessor, - EquationModalProcessor, -) - -WORKING_DIR = "./rag_storage" - - -def get_llm_model_func(api_key: str, base_url: str = None): - return lambda prompt, system_prompt=None, history_messages=[], **kwargs: ( - openai_complete_if_cache( - "gpt-4o-mini", - prompt, - system_prompt=system_prompt, - history_messages=history_messages, - api_key=api_key, - base_url=base_url, - **kwargs, - ) - ) - - -def get_vision_model_func(api_key: str, base_url: str = None): - return ( - lambda prompt, - system_prompt=None, - history_messages=[], - image_data=None, - **kwargs: ( - openai_complete_if_cache( - "gpt-4o", - "", - system_prompt=None, - history_messages=[], - messages=[ - {"role": "system", "content": system_prompt} - if system_prompt - else None, - { - "role": "user", - "content": [ - {"type": "text", "text": prompt}, - { - "type": "image_url", - "image_url": { - "url": f"data:image/jpeg;base64,{image_data}" - }, - }, - ], - } - if image_data - else {"role": "user", "content": prompt}, - ], - api_key=api_key, - base_url=base_url, - **kwargs, - ) - if image_data - else openai_complete_if_cache( - "gpt-4o-mini", - prompt, - system_prompt=system_prompt, - history_messages=history_messages, - api_key=api_key, - base_url=base_url, - **kwargs, - ) - ) - ) - - -async def process_image_example(lightrag: LightRAG, vision_model_func): - """Example of processing an image""" - # Create image processor - image_processor = ImageModalProcessor( - lightrag=lightrag, modal_caption_func=vision_model_func - ) - - # Prepare image content - image_content = { - "img_path": "image.jpg", - "img_caption": ["Example image caption"], - "img_footnote": ["Example image footnote"], - } - - # Process image - description, entity_info = await image_processor.process_multimodal_content( - modal_content=image_content, - content_type="image", - file_path="image_example.jpg", - entity_name="Example Image", - ) - - print("Image Processing Results:") - print(f"Description: {description}") - print(f"Entity Info: {entity_info}") - - -async def process_table_example(lightrag: LightRAG, llm_model_func): - """Example of processing a table""" - # Create table processor - table_processor = TableModalProcessor( - lightrag=lightrag, modal_caption_func=llm_model_func - ) - - # Prepare table content - table_content = { - "table_body": """ - | Name | Age | Occupation | - |------|-----|------------| - | John | 25 | Engineer | - | Mary | 30 | Designer | - """, - "table_caption": ["Employee Information Table"], - "table_footnote": ["Data updated as of 2024"], - } - - # Process table - description, entity_info = await table_processor.process_multimodal_content( - modal_content=table_content, - content_type="table", - file_path="table_example.md", - entity_name="Employee Table", - ) - - print("\nTable Processing Results:") - print(f"Description: {description}") - print(f"Entity Info: {entity_info}") - - -async def process_equation_example(lightrag: LightRAG, llm_model_func): - """Example of processing a mathematical equation""" - # Create equation processor - equation_processor = EquationModalProcessor( - lightrag=lightrag, modal_caption_func=llm_model_func - ) - - # Prepare equation content - equation_content = {"text": "E = mc^2", "text_format": "LaTeX"} - - # Process equation - description, entity_info = await equation_processor.process_multimodal_content( - modal_content=equation_content, - content_type="equation", - file_path="equation_example.txt", - entity_name="Mass-Energy Equivalence", - ) - - print("\nEquation Processing Results:") - print(f"Description: {description}") - print(f"Entity Info: {entity_info}") - - -async def initialize_rag(api_key: str, base_url: str = None): - rag = LightRAG( - working_dir=WORKING_DIR, - embedding_func=EmbeddingFunc( - embedding_dim=3072, - max_token_size=8192, - func=lambda texts: openai_embed( - texts, - model="text-embedding-3-large", - api_key=api_key, - base_url=base_url, - ), - ), - llm_model_func=lambda prompt, - system_prompt=None, - history_messages=[], - **kwargs: ( - openai_complete_if_cache( - "gpt-4o-mini", - prompt, - system_prompt=system_prompt, - history_messages=history_messages, - api_key=api_key, - base_url=base_url, - **kwargs, - ) - ), - ) - - await rag.initialize_storages() # Auto-initializes pipeline_status - return rag - - -def main(): - """Main function to run the example""" - parser = argparse.ArgumentParser(description="Modal Processors Example") - parser.add_argument("--api-key", required=True, help="OpenAI API key") - parser.add_argument("--base-url", help="Optional base URL for API") - parser.add_argument( - "--working-dir", "-w", default=WORKING_DIR, help="Working directory path" - ) - - args = parser.parse_args() - - # Run examples - asyncio.run(main_async(args.api_key, args.base_url)) - - -async def main_async(api_key: str, base_url: str = None): - # Initialize LightRAG - lightrag = await initialize_rag(api_key, base_url) - - # Get model functions - llm_model_func = get_llm_model_func(api_key, base_url) - vision_model_func = get_vision_model_func(api_key, base_url) - - # Run examples - await process_image_example(lightrag, vision_model_func) - await process_table_example(lightrag, llm_model_func) - await process_equation_example(lightrag, llm_model_func) - - -if __name__ == "__main__": - main() diff --git a/examples/raganything_example.py b/examples/raganything_example.py deleted file mode 100644 index f61274a8e9..0000000000 --- a/examples/raganything_example.py +++ /dev/null @@ -1,286 +0,0 @@ -#!/usr/bin/env python -""" -Example script demonstrating the integration of MinerU parser with RAGAnything - -This example shows how to: -1. Process parsed documents with RAGAnything -2. Perform multimodal queries on the processed documents -3. Handle different types of content (text, images, tables) -""" - -import os -import argparse -import asyncio -import logging -import logging.config -from pathlib import Path - -# Add project root directory to Python path -import sys - -sys.path.append(str(Path(__file__).parent.parent)) - -from lightrag.llm.openai import openai_complete_if_cache, openai_embed -from lightrag.utils import EmbeddingFunc, logger, set_verbose_debug -from raganything import RAGAnything, RAGAnythingConfig - - -def configure_logging(): - """Configure logging for the application""" - # Get log directory path from environment variable or use current directory - log_dir = os.getenv("LOG_DIR", os.getcwd()) - log_file_path = os.path.abspath(os.path.join(log_dir, "raganything_example.log")) - - print(f"\nRAGAnything example log file: {log_file_path}\n") - os.makedirs(os.path.dirname(log_dir), exist_ok=True) - - # Get log file max size and backup count from environment variables - log_max_bytes = int(os.getenv("LOG_MAX_BYTES", 10485760)) # Default 10MB - log_backup_count = int(os.getenv("LOG_BACKUP_COUNT", 5)) # Default 5 backups - - logging.config.dictConfig( - { - "version": 1, - "disable_existing_loggers": False, - "formatters": { - "default": { - "format": "%(levelname)s: %(message)s", - }, - "detailed": { - "format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s", - }, - }, - "handlers": { - "console": { - "formatter": "default", - "class": "logging.StreamHandler", - "stream": "ext://sys.stderr", - }, - "file": { - "formatter": "detailed", - "class": "logging.handlers.RotatingFileHandler", - "filename": log_file_path, - "maxBytes": log_max_bytes, - "backupCount": log_backup_count, - "encoding": "utf-8", - }, - }, - "loggers": { - "lightrag": { - "handlers": ["console", "file"], - "level": "INFO", - "propagate": False, - }, - }, - } - ) - - # Set the logger level to INFO - logger.setLevel(logging.INFO) - # Enable verbose debug if needed - set_verbose_debug(os.getenv("VERBOSE", "false").lower() == "true") - - -async def process_with_rag( - file_path: str, - output_dir: str, - api_key: str, - base_url: str = None, - working_dir: str = None, -): - """ - Process document with RAGAnything - - Args: - file_path: Path to the document - output_dir: Output directory for RAG results - api_key: OpenAI API key - base_url: Optional base URL for API - working_dir: Working directory for RAG storage - """ - try: - # Create RAGAnything configuration - config = RAGAnythingConfig( - working_dir=working_dir or "./rag_storage", - mineru_parse_method="auto", - enable_image_processing=True, - enable_table_processing=True, - enable_equation_processing=True, - ) - - # Define LLM model function - def llm_model_func(prompt, system_prompt=None, history_messages=[], **kwargs): - return openai_complete_if_cache( - "gpt-4o-mini", - prompt, - system_prompt=system_prompt, - history_messages=history_messages, - api_key=api_key, - base_url=base_url, - **kwargs, - ) - - # Define vision model function for image processing - def vision_model_func( - prompt, system_prompt=None, history_messages=[], image_data=None, **kwargs - ): - if image_data: - return openai_complete_if_cache( - "gpt-4o", - "", - system_prompt=None, - history_messages=[], - messages=[ - {"role": "system", "content": system_prompt} - if system_prompt - else None, - { - "role": "user", - "content": [ - {"type": "text", "text": prompt}, - { - "type": "image_url", - "image_url": { - "url": f"data:image/jpeg;base64,{image_data}" - }, - }, - ], - } - if image_data - else {"role": "user", "content": prompt}, - ], - api_key=api_key, - base_url=base_url, - **kwargs, - ) - else: - return llm_model_func(prompt, system_prompt, history_messages, **kwargs) - - # Define embedding function - embedding_func = EmbeddingFunc( - embedding_dim=3072, - max_token_size=8192, - func=lambda texts: openai_embed( - texts, - model="text-embedding-3-large", - api_key=api_key, - base_url=base_url, - ), - ) - - # Initialize RAGAnything with new dataclass structure - rag = RAGAnything( - config=config, - llm_model_func=llm_model_func, - vision_model_func=vision_model_func, - embedding_func=embedding_func, - ) - - # Process document - await rag.process_document_complete( - file_path=file_path, output_dir=output_dir, parse_method="auto" - ) - - # Example queries - demonstrating different query approaches - logger.info("\nQuerying processed document:") - - # 1. Pure text queries using aquery() - text_queries = [ - "What is the main content of the document?", - "What are the key topics discussed?", - ] - - for query in text_queries: - logger.info(f"\n[Text Query]: {query}") - result = await rag.aquery(query, mode="hybrid") - logger.info(f"Answer: {result}") - - # 2. Multimodal query with specific multimodal content using aquery_with_multimodal() - logger.info( - "\n[Multimodal Query]: Analyzing performance data in context of document" - ) - multimodal_result = await rag.aquery_with_multimodal( - "Compare this performance data with any similar results mentioned in the document", - multimodal_content=[ - { - "type": "table", - "table_data": """Method,Accuracy,Processing_Time - RAGAnything,95.2%,120ms - Traditional_RAG,87.3%,180ms - Baseline,82.1%,200ms""", - "table_caption": "Performance comparison results", - } - ], - mode="hybrid", - ) - logger.info(f"Answer: {multimodal_result}") - - # 3. Another multimodal query with equation content - logger.info("\n[Multimodal Query]: Mathematical formula analysis") - equation_result = await rag.aquery_with_multimodal( - "Explain this formula and relate it to any mathematical concepts in the document", - multimodal_content=[ - { - "type": "equation", - "latex": "F1 = 2 \\cdot \\frac{precision \\cdot recall}{precision + recall}", - "equation_caption": "F1-score calculation formula", - } - ], - mode="hybrid", - ) - logger.info(f"Answer: {equation_result}") - - except Exception as e: - logger.error(f"Error processing with RAG: {str(e)}") - import traceback - - logger.error(traceback.format_exc()) - - -def main(): - """Main function to run the example""" - parser = argparse.ArgumentParser(description="MinerU RAG Example") - parser.add_argument("file_path", help="Path to the document to process") - parser.add_argument( - "--working_dir", "-w", default="./rag_storage", help="Working directory path" - ) - parser.add_argument( - "--output", "-o", default="./output", help="Output directory path" - ) - parser.add_argument( - "--api-key", - default=os.getenv("OPENAI_API_KEY"), - help="OpenAI API key (defaults to OPENAI_API_KEY env var)", - ) - parser.add_argument("--base-url", help="Optional base URL for API") - - args = parser.parse_args() - - # Check if API key is provided - if not args.api_key: - logger.error("Error: OpenAI API key is required") - logger.error("Set OPENAI_API_KEY environment variable or use --api-key option") - return - - # Create output directory if specified - if args.output: - os.makedirs(args.output, exist_ok=True) - - # Process with RAG - asyncio.run( - process_with_rag( - args.file_path, args.output, args.api_key, args.base_url, args.working_dir - ) - ) - - -if __name__ == "__main__": - # Configure logging first - configure_logging() - - print("RAGAnything Example") - print("=" * 30) - print("Processing document with multimodal RAG pipeline") - print("=" * 30) - - main() diff --git a/lightrag/__init__.py b/lightrag/__init__.py index e269f250cb..a2e0fc0e7b 100644 --- a/lightrag/__init__.py +++ b/lightrag/__init__.py @@ -2,17 +2,40 @@ from ._version import __version__ as __version__ -__all__ = ["LightRAG", "QueryParam", "__version__"] +__all__ = [ + "LightRAG", + "QueryParam", + "RoleLLMConfig", + "RoleSpec", + "ROLES", + "__version__", +] if TYPE_CHECKING: - from .lightrag import LightRAG as LightRAG, QueryParam as QueryParam + from .lightrag import ( + LightRAG as LightRAG, + QueryParam as QueryParam, + ROLES as ROLES, + RoleLLMConfig as RoleLLMConfig, + RoleSpec as RoleSpec, + ) -def __getattr__(name: str) -> Any: - if name in {"LightRAG", "QueryParam"}: - from .lightrag import LightRAG, QueryParam +_LAZY_EXPORTS = {"LightRAG", "QueryParam", "RoleLLMConfig", "RoleSpec", "ROLES"} + - value = {"LightRAG": LightRAG, "QueryParam": QueryParam}[name] +def __getattr__(name: str) -> Any: + if name in _LAZY_EXPORTS: + from .lightrag import LightRAG, QueryParam, RoleLLMConfig, RoleSpec, ROLES + + values = { + "LightRAG": LightRAG, + "QueryParam": QueryParam, + "RoleLLMConfig": RoleLLMConfig, + "RoleSpec": RoleSpec, + "ROLES": ROLES, + } + value = values[name] globals()[name] = value return value raise AttributeError(f"module {__name__!r} has no attribute {name!r}") diff --git a/lightrag/_version.py b/lightrag/_version.py index 5af48762f3..e7e36bd735 100644 --- a/lightrag/_version.py +++ b/lightrag/_version.py @@ -1,4 +1,4 @@ """Lightweight version definitions shared by packaging and runtime code.""" -__version__ = "1.4.16" -__api_version__ = "0292" +__version__ = "1.5.0" +__api_version__ = "0295" diff --git a/lightrag/addon_params.py b/lightrag/addon_params.py new file mode 100644 index 0000000000..84c49c0a53 --- /dev/null +++ b/lightrag/addon_params.py @@ -0,0 +1,157 @@ +"""Addon parameters: observable mapping + normalization helper. + +``addon_params`` is a free-form configuration dict on :class:`LightRAG` that +controls things like summary language and entity-type prompt overrides. The +module exposes: + +- :class:`ObservableAddonParams` — a ``dict`` subclass that calls a callback + whenever the contents change so the LightRAG runtime can invalidate cached + derived state. +- :func:`default_addon_params` — environment-driven defaults. +- :func:`normalize_addon_params` — converts an arbitrary input into a plain + ``dict`` with the env-driven defaults backfilled. +""" + +from __future__ import annotations + +from typing import Any, Callable, Mapping + +from lightrag.constants import DEFAULT_SUMMARY_LANGUAGE +from lightrag.utils import get_env_value, logger + + +# Keys that used to live in addon_params but have been superseded by +# per-document ``process_options``. We log once when callers still pass them +# so existing configs surface their drift without breaking. +_DEPRECATED_ADDON_PARAM_KEYS: tuple[str, ...] = ("enable_multimodal_pipeline",) +_warned_deprecated_keys: set[str] = set() + + +def _emit_deprecated_addon_warnings(params: Mapping[str, Any]) -> None: + for key in _DEPRECATED_ADDON_PARAM_KEYS: + if key in params and key not in _warned_deprecated_keys: + logger.warning( + f"addon_params['{key}'] is deprecated and ignored; per-document " + f"behaviour is now controlled by filename-hint process_options " + f"(see docs/FileProcessingConfiguration-zh.md)." + ) + _warned_deprecated_keys.add(key) + + +def default_addon_params() -> dict[str, Any]: + # Lazy import to avoid the parser_routing → utils → … cycle that + # would otherwise form when parser_routing imports back into this + # module via ``LightRAG`` construction paths. + from lightrag.parser_routing import default_chunker_config + + return { + "language": get_env_value("SUMMARY_LANGUAGE", DEFAULT_SUMMARY_LANGUAGE, str), + "entity_type_prompt_file": get_env_value("ENTITY_TYPE_PROMPT_FILE", "", str), + # Per-strategy chunker parameters; mutate at runtime (e.g. + # ``rag.addon_params["chunker"]["recursive_character"]["separators"] + # = [...]``) to change defaults applied to subsequently + # enqueued documents. Per-document snapshots are persisted to + # ``full_docs[doc_id]["chunk_options"]`` at enqueue time and + # are not affected by later runtime mutations. + "chunker": default_chunker_config(), + } + + +def normalize_addon_params(addon_params: Mapping[str, Any] | None) -> dict[str, Any]: + """Coerce ``addon_params`` to a plain dict with env defaults backfilled.""" + from lightrag.parser_routing import default_chunker_config + + if addon_params is None: + normalized = default_addon_params() + elif isinstance(addon_params, Mapping): + _emit_deprecated_addon_warnings(addon_params) + normalized = { + k: v + for k, v in addon_params.items() + if k not in _DEPRECATED_ADDON_PARAM_KEYS + } + else: + raise TypeError( + "addon_params must be a Mapping or None, got " + f"{type(addon_params).__name__}" + ) + + # When the caller supplies addon_params explicitly, the dataclass + # default_factory is skipped — fall back to environment variables so + # ENTITY_TYPE_PROMPT_FILE / SUMMARY_LANGUAGE / chunker still apply. + normalized.setdefault( + "language", get_env_value("SUMMARY_LANGUAGE", DEFAULT_SUMMARY_LANGUAGE, str) + ) + normalized.setdefault( + "entity_type_prompt_file", + get_env_value("ENTITY_TYPE_PROMPT_FILE", "", str), + ) + # Build the chunker default lazily — `default_chunker_config()` reads env + # vars (e.g. CHUNK_R_SEPARATORS via json.loads) and would raise on a + # malformed value, which would prevent an explicit caller-supplied + # `chunker` from bypassing a broken environment. + if "chunker" not in normalized: + normalized["chunker"] = default_chunker_config() + return normalized + + +class ObservableAddonParams(dict[str, Any]): + def __init__( + self, + *args: Any, + on_change: Callable[[], None] | None = None, + **kwargs: Any, + ) -> None: + super().__init__(*args, **kwargs) + self._on_change = on_change + + def _changed(self) -> None: + if self._on_change is not None: + self._on_change() + + def __setitem__(self, key: str, value: Any) -> None: + super().__setitem__(key, value) + self._changed() + + def __delitem__(self, key: str) -> None: + super().__delitem__(key) + self._changed() + + def clear(self) -> None: + if self: + super().clear() + self._changed() + + def pop(self, key: str, default: Any = ...): + existed = key in self + if default is ...: + value = super().pop(key) + self._changed() + else: + value = super().pop(key, default) + if existed: + self._changed() + return value + + def popitem(self) -> tuple[str, Any]: + item = super().popitem() + self._changed() + return item + + def setdefault(self, key: str, default: Any = None) -> Any: + if key in self: + return self[key] + value = super().setdefault(key, default) + self._changed() + return value + + def update(self, *args: Any, **kwargs: Any) -> None: + if not args and not kwargs: + return + super().update(*args, **kwargs) + self._changed() + + def __ior__(self, other: Mapping[str, Any]): + super().__ior__(other) + self._changed() + return self diff --git a/lightrag/api/config.py b/lightrag/api/config.py index f40c70d9ae..de62ebb6aa 100644 --- a/lightrag/api/config.py +++ b/lightrag/api/config.py @@ -7,8 +7,10 @@ import argparse import logging from dotenv import load_dotenv +from lightrag import ROLES from lightrag.utils import get_env_value, logger from lightrag.llm.binding_options import ( + BedrockLLMOptions, GeminiEmbeddingOptions, GeminiLLMOptions, OllamaEmbeddingOptions, @@ -41,7 +43,9 @@ DEFAULT_OLLAMA_MODEL_NAME, DEFAULT_OLLAMA_MODEL_TAG, DEFAULT_RERANK_BINDING, - DEFAULT_ENTITY_TYPES, + DEFAULT_LLM_TIMEOUT, + DEFAULT_EMBEDDING_TIMEOUT, + DEFAULT_RERANK_TIMEOUT, ) # use the .env that is inside the current folder @@ -70,9 +74,12 @@ def get_default_host(binding_type: str) -> str: "lollms": os.getenv("LLM_BINDING_HOST", "http://localhost:9600"), "azure_openai": os.getenv("AZURE_OPENAI_ENDPOINT", "https://api.openai.com/v1"), "openai": os.getenv("LLM_BINDING_HOST", "https://api.openai.com/v1"), - "gemini": os.getenv( - "LLM_BINDING_HOST", "https://generativelanguage.googleapis.com" - ), + # Let boto3 select the regional Bedrock endpoint unless the user + # explicitly overrides LLM_BINDING_HOST / EMBEDDING_BINDING_HOST. + "bedrock": os.getenv("LLM_BINDING_HOST", "DEFAULT_BEDROCK_ENDPOINT"), + # Let google-genai pick the correct default endpoint/version unless the + # user explicitly overrides LLM_BINDING_HOST / EMBEDDING_BINDING_HOST. + "gemini": os.getenv("LLM_BINDING_HOST", "DEFAULT_GEMINI_ENDPOINT"), } return default_hosts.get( binding_type, os.getenv("LLM_BINDING_HOST", "http://localhost:11434") @@ -160,6 +167,77 @@ def validate_auth_configuration(args: argparse.Namespace) -> None: ) +def _is_set(value: str | None) -> bool: + return bool((value or "").strip()) + + +def validate_bedrock_auth_configuration(args: argparse.Namespace) -> None: + """Reject Bedrock configuration with no explicit supported auth source.""" + bearer_token = os.getenv("AWS_BEARER_TOKEN_BEDROCK") + + def has_valid_auth(prefix: str | None = None) -> bool: + if _is_set(bearer_token): + return True + + if prefix: + role_access_key = getattr(args, f"{prefix}_aws_access_key_id", None) + role_secret_key = getattr(args, f"{prefix}_aws_secret_access_key", None) + if _is_set(role_access_key) or _is_set(role_secret_key): + return _is_set(role_access_key) and _is_set(role_secret_key) + + access_key = getattr(args, "aws_access_key_id", None) + secret_key = getattr(args, "aws_secret_access_key", None) + return _is_set(access_key) and _is_set(secret_key) + + if getattr(args, "llm_binding", None) == "bedrock": + if not has_valid_auth(): + raise ValueError( + "Bedrock LLM binding requires AWS_ACCESS_KEY_ID and " + "AWS_SECRET_ACCESS_KEY, or process-level AWS_BEARER_TOKEN_BEDROCK." + ) + if _is_set(getattr(args, "llm_binding_api_key", None)): + logging.warning( + "LLM_BINDING_API_KEY is set but ignored for Bedrock LLM binding. " + "Use SigV4 AWS_* variables or process-level AWS_BEARER_TOKEN_BEDROCK instead." + ) + + if getattr(args, "embedding_binding", None) == "bedrock": + if not has_valid_auth(): + raise ValueError( + "Bedrock embedding binding requires AWS_ACCESS_KEY_ID and " + "AWS_SECRET_ACCESS_KEY, or process-level AWS_BEARER_TOKEN_BEDROCK." + ) + if _is_set(getattr(args, "embedding_binding_api_key", None)): + logging.warning( + "EMBEDDING_BINDING_API_KEY is set but ignored for Bedrock embedding binding. " + "Use SigV4 AWS_* variables or process-level AWS_BEARER_TOKEN_BEDROCK instead." + ) + + for spec in ROLES: + role = spec.name + if getattr( + args, f"{role}_llm_binding", None + ) == "bedrock" and not has_valid_auth(role): + raise ValueError( + f"Bedrock role '{role}' requires {spec.env_prefix}_AWS_ACCESS_KEY_ID " + f"and {spec.env_prefix}_AWS_SECRET_ACCESS_KEY, global " + "AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY, or process-level " + "AWS_BEARER_TOKEN_BEDROCK." + ) + + +def normalize_binding_name(binding: str | None) -> str | None: + """Normalize environment-provided binding aliases to canonical names.""" + if binding == "aws_bedrock": + return "bedrock" + return binding + + +def get_binding_env_value(env_key: str, default: str) -> str: + """Read a binding env var and normalize legacy aliases.""" + return normalize_binding_name(get_env_value(env_key, default)) or default + + def parse_args() -> argparse.Namespace: """ Parse command line arguments with environment variable fallback @@ -317,14 +395,14 @@ def parse_args() -> argparse.Namespace: parser.add_argument( "--llm-binding", type=str, - default=get_env_value("LLM_BINDING", "ollama"), + default=get_binding_env_value("LLM_BINDING", "ollama"), choices=[ "lollms", "ollama", "openai", "openai-ollama", "azure_openai", - "aws_bedrock", + "bedrock", "gemini", ], help="LLM binding type (default: from env or ollama)", @@ -332,13 +410,13 @@ def parse_args() -> argparse.Namespace: parser.add_argument( "--embedding-binding", type=str, - default=get_env_value("EMBEDDING_BINDING", "ollama"), + default=get_binding_env_value("EMBEDDING_BINDING", "ollama"), choices=[ "lollms", "ollama", "openai", "azure_openai", - "aws_bedrock", + "bedrock", "jina", "gemini", "voyageai", @@ -353,14 +431,6 @@ def parse_args() -> argparse.Namespace: help=f"Rerank binding type (default: from env or {DEFAULT_RERANK_BINDING})", ) - # Document loading engine configuration - parser.add_argument( - "--docling", - action="store_true", - default=False, - help="Enable DOCLING document loading engine (default: from env or DEFAULT)", - ) - # Conditionally add binding-specific options (Ollama, OpenAI, Azure OpenAI, Gemini) # This registers command line arguments (e.g., --openai-llm-temperature) # and reads corresponding environment variables (e.g., OPENAI_LLM_TEMPERATURE) @@ -377,7 +447,7 @@ def parse_args() -> argparse.Namespace: # Fall back to environment variable using same function as argparse default if llm_binding_value is None: - llm_binding_value = get_env_value("LLM_BINDING", "ollama") + llm_binding_value = get_binding_env_value("LLM_BINDING", "ollama") # Add LLM binding options based on determined value if llm_binding_value == "ollama": @@ -386,6 +456,8 @@ def parse_args() -> argparse.Namespace: OpenAILLMOptions.add_args(parser) elif llm_binding_value == "gemini": GeminiLLMOptions.add_args(parser) + elif llm_binding_value == "bedrock": + BedrockLLMOptions.add_args(parser) # Determine embedding binding value consistently from command line or environment embedding_binding_value = None @@ -399,7 +471,7 @@ def parse_args() -> argparse.Namespace: # Fall back to environment variable using same function as argparse default if embedding_binding_value is None: - embedding_binding_value = get_env_value("EMBEDDING_BINDING", "ollama") + embedding_binding_value = get_binding_env_value("EMBEDDING_BINDING", "ollama") # Add embedding binding options based on determined value if embedding_binding_value == "ollama": @@ -447,6 +519,13 @@ def parse_args() -> argparse.Namespace: args.llm_binding_api_key = get_env_value("LLM_BINDING_API_KEY", None) args.embedding_binding_api_key = get_env_value("EMBEDDING_BINDING_API_KEY", "") + args.aws_region = get_env_value("AWS_REGION", None, special_none=True) + args.aws_access_key_id = get_env_value("AWS_ACCESS_KEY_ID", None, special_none=True) + args.aws_secret_access_key = get_env_value( + "AWS_SECRET_ACCESS_KEY", None, special_none=True + ) + args.aws_session_token = get_env_value("AWS_SESSION_TOKEN", None, special_none=True) + # Inject model configuration args.llm_model = get_env_value("LLM_MODEL", "mistral-nemo:latest") # EMBEDDING_MODEL defaults to None - each binding will use its own default model @@ -467,21 +546,96 @@ def parse_args() -> argparse.Namespace: ) args.enable_llm_cache = get_env_value("ENABLE_LLM_CACHE", True, bool) - # Set document_loading_engine from --docling flag - if args.docling: - args.document_loading_engine = "DOCLING" - else: - args.document_loading_engine = get_env_value( - "DOCUMENT_LOADING_ENGINE", "DEFAULT" - ) - # PDF decryption password args.pdf_decrypt_password = get_env_value("PDF_DECRYPT_PASSWORD", None) + # --- Per-role LLM configuration (driven by lightrag.ROLES registry) --- + for spec in ROLES: + prefix = spec.env_prefix + attr_prefix = spec.name + binding_key = f"{prefix}_LLM_BINDING" + model_key = f"{prefix}_LLM_MODEL" + host_key = f"{prefix}_LLM_BINDING_HOST" + apikey_key = f"{prefix}_LLM_BINDING_API_KEY" + max_async_key = f"{prefix}_MAX_ASYNC_LLM" + timeout_key = f"{prefix}_LLM_TIMEOUT" + + role_binding = normalize_binding_name( + get_env_value(binding_key, None, special_none=True) + ) + role_model = get_env_value(model_key, None, special_none=True) + role_host = get_env_value(host_key, None, special_none=True) + role_apikey = get_env_value(apikey_key, None, special_none=True) + role_max_async = get_env_value(max_async_key, None, int, special_none=True) + role_timeout = get_env_value(timeout_key, None, int, special_none=True) + role_aws_region = get_env_value(f"{prefix}_AWS_REGION", None, special_none=True) + role_aws_access_key_id = get_env_value( + f"{prefix}_AWS_ACCESS_KEY_ID", None, special_none=True + ) + role_aws_secret_access_key = get_env_value( + f"{prefix}_AWS_SECRET_ACCESS_KEY", None, special_none=True + ) + role_aws_session_token = get_env_value( + f"{prefix}_AWS_SESSION_TOKEN", None, special_none=True + ) + + setattr(args, f"{attr_prefix}_llm_binding", role_binding) + setattr(args, f"{attr_prefix}_llm_model", role_model) + setattr(args, f"{attr_prefix}_llm_binding_host", role_host) + setattr(args, f"{attr_prefix}_llm_binding_api_key", role_apikey) + setattr(args, f"{attr_prefix}_llm_max_async", role_max_async) + setattr(args, f"{attr_prefix}_llm_timeout", role_timeout) + setattr(args, f"{attr_prefix}_aws_region", role_aws_region) + setattr(args, f"{attr_prefix}_aws_access_key_id", role_aws_access_key_id) + setattr( + args, f"{attr_prefix}_aws_secret_access_key", role_aws_secret_access_key + ) + setattr(args, f"{attr_prefix}_aws_session_token", role_aws_session_token) + + if role_binding == "bedrock" and role_apikey: + raise SystemExit( + f"Bedrock role '{spec.name}' does not support {apikey_key}; use " + "role-specific SigV4 AWS_* variables or process-level " + "AWS_BEARER_TOKEN_BEDROCK." + ) + + # Cross-provider validation + if role_binding and role_binding != args.llm_binding: + missing = [] + if not role_model: + missing.append(model_key) + if not role_host: + role_host = get_default_host(role_binding) + setattr(args, f"{attr_prefix}_llm_binding_host", role_host) + if role_binding != "bedrock" and not role_apikey: + missing.append(apikey_key) + if missing: + raise SystemExit( + f"Cross-provider error for role '{spec.name}': " + f"binding={role_binding} differs from base={args.llm_binding}, " + f"but required env vars are missing: {', '.join(missing)}" + ) + + # VLM multimodal master switch — when off, the pipeline emits a warning + # and skips every i/t/e item without touching the VLM. When on, the + # effective VLM binding must support image inputs. + args.vlm_process_enable = get_env_value("VLM_PROCESS_ENABLE", False, bool) + if args.vlm_process_enable: + effective_vlm_binding = ( + getattr(args, "vlm_llm_binding", None) or args.llm_binding + ) + vlm_incompatible = {"lollms"} + if effective_vlm_binding in vlm_incompatible: + raise SystemExit( + f"VLM_PROCESS_ENABLE=true but the effective VLM binding " + f"'{effective_vlm_binding}' does not support image inputs. " + "Configure VLM_LLM_BINDING (or LLM_BINDING) to one of: " + "openai, azure_openai, gemini, bedrock, ollama." + ) + # Add environment variables that were previously read directly args.cors_origins = get_env_value("CORS_ORIGINS", "*") args.summary_language = get_env_value("SUMMARY_LANGUAGE", DEFAULT_SUMMARY_LANGUAGE) - args.entity_types = get_env_value("ENTITY_TYPES", DEFAULT_ENTITY_TYPES, list) args.whitelist_paths = get_env_value("WHITELIST_PATHS", "/health,/api/*") # For JWT Auth @@ -506,6 +660,17 @@ def parse_args() -> argparse.Namespace: "MIN_RERANK_SCORE", DEFAULT_MIN_RERANK_SCORE, float ) + # LLM / Embedding request timeouts + args.llm_timeout = get_env_value("LLM_TIMEOUT", DEFAULT_LLM_TIMEOUT, int) + args.embedding_timeout = get_env_value( + "EMBEDDING_TIMEOUT", DEFAULT_EMBEDDING_TIMEOUT, int + ) + + # Rerank async/timeout configuration (independent from base LLM) + # rerank_max_async falls back to MAX_ASYNC; rerank_timeout has its own default. + args.rerank_max_async = get_env_value("MAX_ASYNC_RERANK", args.max_async, int) + args.rerank_timeout = get_env_value("RERANK_TIMEOUT", DEFAULT_RERANK_TIMEOUT, int) + # Query configuration args.history_turns = get_env_value("HISTORY_TURNS", DEFAULT_HISTORY_TURNS, int) args.top_k = get_env_value("TOP_K", DEFAULT_TOP_K, int) @@ -582,6 +747,7 @@ def parse_args() -> argparse.Namespace: args.workspace = sanitized validate_auth_configuration(args) + validate_bedrock_auth_configuration(args) return args @@ -635,6 +801,7 @@ def initialize_config(args=None, force=False): resolved_args = args if args is not None else parse_args() validate_auth_configuration(resolved_args) + validate_bedrock_auth_configuration(resolved_args) _global_args = resolved_args _initialized = True return _global_args diff --git a/lightrag/api/lightrag_server.py b/lightrag/api/lightrag_server.py index 9fb7e4086f..621234f0e3 100644 --- a/lightrag/api/lightrag_server.py +++ b/lightrag/api/lightrag_server.py @@ -17,6 +17,7 @@ import sys import uvicorn import pipmaster as pm +from typing import Any from fastapi.staticfiles import StaticFiles from fastapi.responses import RedirectResponse from pathlib import Path @@ -37,21 +38,19 @@ PREFIX_ASYMMETRIC_EMBEDDING_BINDINGS, ) from lightrag.utils import get_env_value -from lightrag import LightRAG, __version__ as core_version +from lightrag import LightRAG, ROLES, RoleLLMConfig, __version__ as core_version from lightrag.api import __api_version__ -from lightrag.types import GPTKeywordExtractionFormat from lightrag.utils import EmbeddingFunc from lightrag.constants import ( DEFAULT_LOG_MAX_BYTES, DEFAULT_LOG_BACKUP_COUNT, DEFAULT_LOG_FILENAME, - DEFAULT_LLM_TIMEOUT, - DEFAULT_EMBEDDING_TIMEOUT, ) from lightrag.api.routers.document_routes import ( DocumentManager, create_document_routes, ) +from lightrag.parser_routing import validate_parser_routing_config from lightrag.api.routers.query_routes import create_query_routes from lightrag.api.routers.graph_routes import create_graph_routes from lightrag.api.routers.ollama_api import OllamaAPI @@ -89,6 +88,46 @@ WEBUI_PATH = "/webui" +def _clean_workspace_value(value: Any) -> str | None: + if value is None: + return None + text = str(value).strip() + return text or None + + +def _get_storage_workspace(storage: Any) -> str | None: + if storage is None: + return None + + effective_workspace = _clean_workspace_value( + getattr(storage, "effective_workspace", None) + ) + if effective_workspace: + return effective_workspace + + final_namespace = _clean_workspace_value(getattr(storage, "final_namespace", None)) + namespace = _clean_workspace_value(getattr(storage, "namespace", None)) + if final_namespace and namespace: + suffix = f"_{namespace}" + if final_namespace.endswith(suffix): + workspace = final_namespace[: -len(suffix)] + if workspace: + return workspace + + return _clean_workspace_value(getattr(storage, "workspace", None)) + + +def _get_storage_workspaces(rag: Any) -> dict[str, str | None]: + return { + "kv_storage": _get_storage_workspace(getattr(rag, "full_docs", None)), + "doc_status_storage": _get_storage_workspace(getattr(rag, "doc_status", None)), + "graph_storage": _get_storage_workspace( + getattr(rag, "chunk_entity_relation_graph", None) + ), + "vector_storage": _get_storage_workspace(getattr(rag, "entities_vdb", None)), + } + + class LLMConfigCache: """Smart LLM and Embedding configuration cache class""" @@ -101,6 +140,7 @@ def __init__(self, args): self.gemini_embedding_options = None self.ollama_llm_options = None self.ollama_embedding_options = None + self.bedrock_llm_options = None # Only initialize and log OpenAI options when using OpenAI-related bindings if args.llm_binding in ["openai", "azure_openai"]: @@ -115,6 +155,12 @@ def __init__(self, args): self.gemini_llm_options = GeminiLLMOptions.options_dict(args) logger.info(f"Gemini LLM Options: {self.gemini_llm_options}") + if args.llm_binding == "bedrock": + from lightrag.llm.binding_options import BedrockLLMOptions + + self.bedrock_llm_options = BedrockLLMOptions.options_dict(args) + logger.info(f"Bedrock LLM Options: {self.bedrock_llm_options}") + # Only initialize and log Ollama LLM options when using Ollama LLM binding if args.llm_binding == "ollama": try: @@ -163,6 +209,52 @@ def __init__(self, args): self.gemini_embedding_options = {} +_PROVIDER_LOG_LABELS = { + "azure_openai": "Azure OpenAI", + "bedrock": "Bedrock", + "gemini": "Gemini", + "lollms": "Lollms", + "ollama": "Ollama", + "openai": "OpenAI", +} + + +def _provider_log_label(binding: Any) -> str: + binding_name = str(binding) + return _PROVIDER_LOG_LABELS.get( + binding_name, binding_name.replace("_", " ").title() + ) + + +def _log_role_provider_options(rag: Any) -> None: + """Log sanitized provider options for every role LLM.""" + try: + role_configs = rag.get_llm_role_config() + except Exception as e: + logger.warning(f"Failed to read role LLM configuration for logging: {e}") + return + + logger.info("Role LLM Option:") + + for spec in ROLES: + role_config = role_configs.get(spec.name) + if not isinstance(role_config, dict): + continue + + metadata = role_config.get("metadata") or {} + binding = role_config.get("binding") or metadata.get("binding") + if not binding: + continue + + provider_options = metadata.get("provider_options") or {} + logger.info( + " - %s: %s %s", + spec.name, + _provider_log_label(binding), + provider_options, + ) + + def check_frontend_build(): """Check if frontend is built and optionally check if source is up-to-date @@ -304,6 +396,7 @@ def create_app(args): # Setup logging logger.setLevel(args.log_level) set_verbose_debug(args.verbose) + validate_parser_routing_config() # Create configuration cache (this will output configuration logs) config_cache = LLMConfigCache(args) @@ -314,7 +407,7 @@ def create_app(args): "ollama", "openai", "azure_openai", - "aws_bedrock", + "bedrock", "gemini", ]: raise Exception("llm binding not supported") @@ -324,7 +417,7 @@ def create_app(args): "ollama", "openai", "azure_openai", - "aws_bedrock", + "bedrock", "jina", "gemini", "voyageai", @@ -528,18 +621,17 @@ async def optimized_openai_alike_model_complete( prompt, system_prompt=None, history_messages=None, - keyword_extraction=False, **kwargs, ) -> str: from lightrag.llm.openai import openai_complete_if_cache - keyword_extraction = kwargs.pop("keyword_extraction", None) - if keyword_extraction: - kwargs["response_format"] = GPTKeywordExtractionFormat if history_messages is None: history_messages = [] - # Use pre-processed configuration to avoid repeated parsing + # Use pre-processed configuration to avoid repeated parsing. + # response_format and legacy keyword_extraction/entity_extraction + # flags flow through **kwargs; openai_complete_if_cache handles + # the deprecation shim for the legacy booleans. kwargs["timeout"] = llm_timeout if config_cache.openai_llm_options: kwargs.update(config_cache.openai_llm_options) @@ -565,18 +657,15 @@ async def optimized_azure_openai_model_complete( prompt, system_prompt=None, history_messages=None, - keyword_extraction=False, **kwargs, ) -> str: from lightrag.llm.azure_openai import azure_openai_complete_if_cache - keyword_extraction = kwargs.pop("keyword_extraction", None) - if keyword_extraction: - kwargs["response_format"] = GPTKeywordExtractionFormat if history_messages is None: history_messages = [] - # Use pre-processed configuration to avoid repeated parsing + # response_format and legacy extraction booleans flow through kwargs + # to azure_openai_complete_if_cache, which handles deprecation shims. kwargs["timeout"] = llm_timeout if config_cache.openai_llm_options: kwargs.update(config_cache.openai_llm_options) @@ -603,7 +692,6 @@ async def optimized_gemini_model_complete( prompt, system_prompt=None, history_messages=None, - keyword_extraction=False, **kwargs, ) -> str: from lightrag.llm.gemini import gemini_complete_if_cache @@ -611,7 +699,8 @@ async def optimized_gemini_model_complete( if history_messages is None: history_messages = [] - # Use pre-processed configuration to avoid repeated parsing + # response_format and legacy extraction booleans flow through kwargs + # to gemini_complete_if_cache, which handles deprecation shims. kwargs["timeout"] = llm_timeout if ( config_cache.gemini_llm_options is not None @@ -626,7 +715,6 @@ async def optimized_gemini_model_complete( history_messages=history_messages, api_key=args.llm_binding_api_key, base_url=args.llm_binding_host, - keyword_extraction=keyword_extraction, **kwargs, ) @@ -646,7 +734,7 @@ def create_llm_model_func(binding: str): from lightrag.llm.ollama import ollama_model_complete return ollama_model_complete - elif binding == "aws_bedrock": + elif binding == "bedrock": return bedrock_model_complete # Already defined locally elif binding == "azure_openai": # Use optimized function with pre-processed configuration @@ -680,6 +768,303 @@ def create_llm_model_kwargs(binding: str, args, llm_timeout: int) -> dict: raise Exception(f"Failed to import {binding} options: {e}") return {} + def resolve_role_llm_settings( + role: str, override_meta: dict | None = None + ) -> dict[str, Any]: + attr = role.lower() + override_meta = override_meta or {} + + role_binding = ( + override_meta.get("binding") + or getattr(args, f"{attr}_llm_binding", None) + or args.llm_binding + ) + role_model = ( + override_meta.get("model") + or getattr(args, f"{attr}_llm_model", None) + or args.llm_model + ) + role_host = ( + override_meta.get("host") + or getattr(args, f"{attr}_llm_binding_host", None) + or args.llm_binding_host + ) + explicit_role_apikey = override_meta.get("api_key") or getattr( + args, f"{attr}_llm_binding_api_key", None + ) + if role_binding == "bedrock": + if explicit_role_apikey: + raise ValueError( + f"Bedrock role '{role}' does not support role-specific " + "LLM_BINDING_API_KEY; use role-specific SigV4 AWS_* " + "variables or process-level AWS_BEARER_TOKEN_BEDROCK." + ) + role_apikey = None + else: + role_apikey = explicit_role_apikey or args.llm_binding_api_key + role_timeout = ( + override_meta.get("timeout") + or getattr(args, f"{attr}_llm_timeout", None) + or llm_timeout + ) + role_max_async = override_meta.get("max_async") + if role_max_async is None: + role_max_async = getattr(args, f"{attr}_llm_max_async", None) + is_cross_provider = role_binding != args.llm_binding + + role_provider_options = override_meta.get("provider_options") + if role_provider_options is None: + if role_binding in ["openai", "azure_openai"]: + from lightrag.llm.binding_options import OpenAILLMOptions + + role_provider_options = OpenAILLMOptions.options_dict_for_role( + args, role, is_cross_provider + ) + elif role_binding == "gemini": + from lightrag.llm.binding_options import GeminiLLMOptions + + role_provider_options = GeminiLLMOptions.options_dict_for_role( + args, role, is_cross_provider + ) + elif role_binding in ["lollms", "ollama"]: + from lightrag.llm.binding_options import OllamaLLMOptions + + role_provider_options = OllamaLLMOptions.options_dict_for_role( + args, role, is_cross_provider + ) + elif role_binding == "bedrock": + from lightrag.llm.binding_options import BedrockLLMOptions + + role_provider_options = BedrockLLMOptions.options_dict_for_role( + args, role, is_cross_provider + ) + else: + role_provider_options = {} + + bedrock_aws_options = {} + if role_binding == "bedrock": + override_bedrock_aws_options = override_meta.get("bedrock_aws_options", {}) + bedrock_aws_options = { + "aws_region": override_meta.get("aws_region") + or override_bedrock_aws_options.get("aws_region") + or getattr(args, f"{attr}_aws_region", None) + or getattr(args, "aws_region", None), + "aws_access_key_id": override_meta.get("aws_access_key_id") + or override_bedrock_aws_options.get("aws_access_key_id") + or getattr(args, f"{attr}_aws_access_key_id", None) + or getattr(args, "aws_access_key_id", None), + "aws_secret_access_key": override_meta.get("aws_secret_access_key") + or override_bedrock_aws_options.get("aws_secret_access_key") + or getattr(args, f"{attr}_aws_secret_access_key", None) + or getattr(args, "aws_secret_access_key", None), + "aws_session_token": override_meta.get("aws_session_token") + or override_bedrock_aws_options.get("aws_session_token") + or getattr(args, f"{attr}_aws_session_token", None) + or getattr(args, "aws_session_token", None), + } + + return { + "binding": role_binding, + "model": role_model, + "host": role_host, + "api_key": role_apikey, + "timeout": role_timeout, + "max_async": role_max_async, + "provider_options": role_provider_options, + "is_cross_provider": is_cross_provider, + "bedrock_aws_options": bedrock_aws_options, + } + + def create_role_llm_func(role: str, override_meta: dict | None = None): + """Create an independent raw LLM function for a role.""" + settings = resolve_role_llm_settings(role, override_meta) + role_binding = settings["binding"] + role_model = settings["model"] + role_host = settings["host"] + role_apikey = settings["api_key"] + role_timeout = settings["timeout"] + role_provider_options = settings["provider_options"] + bedrock_aws_options = settings["bedrock_aws_options"] + + try: + if role_binding == "ollama": + from lightrag.llm.ollama import _ollama_model_if_cache + + async def role_ollama_complete( + prompt, + system_prompt=None, + history_messages=None, + enable_cot: bool = False, + **kwargs, + ): + # response_format and legacy extraction booleans flow + # through kwargs to _ollama_model_if_cache, which handles + # the deprecation shim and emits a single warning. + if history_messages is None: + history_messages = [] + if role_provider_options: + kwargs.setdefault("options", dict(role_provider_options)) + return await _ollama_model_if_cache( + role_model, + prompt, + system_prompt=system_prompt, + history_messages=history_messages, + enable_cot=enable_cot, + host=role_host, + timeout=role_timeout, + api_key=role_apikey, + **kwargs, + ) + + return role_ollama_complete + if role_binding == "lollms": + from lightrag.llm.lollms import lollms_model_if_cache + + async def role_lollms_complete( + prompt, + system_prompt=None, + history_messages=None, + enable_cot: bool = False, + **kwargs, + ): + # response_format and legacy extraction booleans flow + # through kwargs to lollms_model_if_cache, which drops + # them and emits deprecation warnings when booleans are set. + if history_messages is None: + history_messages = [] + if role_provider_options: + kwargs = {**role_provider_options, **kwargs} + return await lollms_model_if_cache( + role_model, + prompt, + system_prompt=system_prompt, + history_messages=history_messages, + enable_cot=enable_cot, + base_url=role_host, + api_key=role_apikey, + timeout=role_timeout, + **kwargs, + ) + + return role_lollms_complete + if role_binding == "bedrock": + from lightrag.llm.bedrock import bedrock_complete_if_cache + + async def role_bedrock_complete( + prompt, + system_prompt=None, + history_messages=None, + **kwargs, + ) -> str: + if history_messages is None: + history_messages = [] + if role_provider_options: + kwargs = {**role_provider_options, **kwargs} + return await bedrock_complete_if_cache( + role_model, + prompt, + system_prompt=system_prompt, + history_messages=history_messages, + endpoint_url=role_host, + **bedrock_aws_options, + **kwargs, + ) + + return role_bedrock_complete + if role_binding == "azure_openai": + from lightrag.llm.azure_openai import azure_openai_complete_if_cache + + async def role_azure_openai_complete( + prompt, + system_prompt=None, + history_messages=None, + **kwargs, + ) -> str: + if history_messages is None: + history_messages = [] + kwargs["timeout"] = role_timeout + if role_provider_options: + kwargs.update(role_provider_options) + return await azure_openai_complete_if_cache( + role_model, + prompt, + system_prompt=system_prompt, + history_messages=history_messages, + base_url=role_host, + api_key=role_apikey or os.getenv("AZURE_OPENAI_API_KEY"), + api_version=os.getenv( + "AZURE_OPENAI_API_VERSION", "2024-08-01-preview" + ), + **kwargs, + ) + + return role_azure_openai_complete + if role_binding == "gemini": + from lightrag.llm.gemini import gemini_complete_if_cache + + async def role_gemini_complete( + prompt, + system_prompt=None, + history_messages=None, + **kwargs, + ) -> str: + if history_messages is None: + history_messages = [] + kwargs["timeout"] = role_timeout + if role_provider_options and "generation_config" not in kwargs: + kwargs["generation_config"] = dict(role_provider_options) + return await gemini_complete_if_cache( + role_model, + prompt, + system_prompt=system_prompt, + history_messages=history_messages, + api_key=role_apikey, + base_url=role_host, + **kwargs, + ) + + return role_gemini_complete + + from lightrag.llm.openai import openai_complete_if_cache + + async def role_openai_complete( + prompt, + system_prompt=None, + history_messages=None, + **kwargs, + ) -> str: + if history_messages is None: + history_messages = [] + kwargs["timeout"] = role_timeout + if role_provider_options: + kwargs.update(role_provider_options) + return await openai_complete_if_cache( + role_model, + prompt, + system_prompt=system_prompt, + history_messages=history_messages, + base_url=role_host, + api_key=role_apikey, + **kwargs, + ) + + return role_openai_complete + except ImportError as e: + raise Exception(f"Failed to create LLM for role '{role}': {e}") + + def create_role_llm_model_kwargs( + role: str, override_meta: dict | None = None + ) -> dict[str, Any] | None: + """Create role-specific kwargs for runtime wrapper injection. + + Role functions built above already encapsulate provider host/model/api_key/options, + so we intentionally return an empty dict here to prevent base kwargs inheritance + from polluting cross-provider role calls. + """ + _ = role + _ = override_meta + return {} + def create_optimized_embedding_function( config_cache: LLMConfigCache, binding, @@ -739,7 +1124,7 @@ def create_optimized_embedding_function( from lightrag.llm.azure_openai import azure_openai_embed provider_func = azure_openai_embed - elif binding == "aws_bedrock": + elif binding == "bedrock": from lightrag.llm.bedrock import bedrock_embed provider_func = bedrock_embed @@ -860,7 +1245,7 @@ async def optimized_embedding_function( if document_prefix: kwargs["document_prefix"] = document_prefix return await actual_func(**kwargs) - elif binding == "aws_bedrock": + elif binding == "bedrock": from lightrag.llm.bedrock import bedrock_embed actual_func = ( @@ -869,7 +1254,17 @@ async def optimized_embedding_function( else bedrock_embed ) # Pass model only if provided, let function use its default otherwise - kwargs = {"texts": texts} + kwargs = { + "texts": texts, + "aws_region": getattr(args, "aws_region", None), + "aws_access_key_id": getattr(args, "aws_access_key_id", None), + "aws_secret_access_key": getattr( + args, "aws_secret_access_key", None + ), + "aws_session_token": getattr(args, "aws_session_token", None), + } + if host is not None: + kwargs["endpoint_url"] = host if model: kwargs["model"] = model return await actual_func(**kwargs) @@ -996,35 +1391,37 @@ async def optimized_embedding_function( return embedding_func_instance - llm_timeout = get_env_value("LLM_TIMEOUT", DEFAULT_LLM_TIMEOUT, int) - embedding_timeout = get_env_value( - "EMBEDDING_TIMEOUT", DEFAULT_EMBEDDING_TIMEOUT, int - ) + llm_timeout = args.llm_timeout + embedding_timeout = args.embedding_timeout async def bedrock_model_complete( prompt, system_prompt=None, history_messages=None, - keyword_extraction=False, **kwargs, ) -> str: # Lazy import from lightrag.llm.bedrock import bedrock_complete_if_cache - keyword_extraction = kwargs.pop("keyword_extraction", None) - if keyword_extraction: - kwargs["response_format"] = GPTKeywordExtractionFormat if history_messages is None: history_messages = [] - # Use global temperature for Bedrock - kwargs["temperature"] = get_env_value("BEDROCK_LLM_TEMPERATURE", 1.0, float) + # Bedrock Converse API has no JSON mode; response_format and the legacy + # extraction booleans flow through kwargs to bedrock_complete_if_cache, + # which drops them and emits deprecation warnings when booleans are set. + if config_cache.bedrock_llm_options: + kwargs = {**config_cache.bedrock_llm_options, **kwargs} return await bedrock_complete_if_cache( args.llm_model, prompt, system_prompt=system_prompt, history_messages=history_messages, + endpoint_url=args.llm_binding_host, + aws_region=getattr(args, "aws_region", None), + aws_access_key_id=getattr(args, "aws_access_key_id", None), + aws_secret_access_key=getattr(args, "aws_secret_access_key", None), + aws_session_token=getattr(args, "aws_session_token", None), **kwargs, ) @@ -1037,7 +1434,9 @@ async def bedrock_model_complete( binding=args.embedding_binding, model=args.embedding_model, host=args.embedding_binding_host, - api_key=args.embedding_binding_api_key, + api_key=None + if args.embedding_binding == "bedrock" + else args.embedding_binding_api_key, args=args, document_prefix=args.embedding_document_prefix, query_prefix=args.embedding_query_prefix, @@ -1163,6 +1562,22 @@ async def server_rerank_func( name=args.simulated_model_name, tag=args.simulated_model_tag ) + # LightRAG.__post_init__ normalizes addon_params and backfills env-based defaults + # (SUMMARY_LANGUAGE, ENTITY_TYPE_PROMPT_FILE, ...), so we only need to pass the + # API-level overrides here. + addon_params = { + "language": args.summary_language, + } + + role_llm_configs = { + spec.name: { + **resolve_role_llm_settings(spec.name), + "func": create_role_llm_func(spec.name), + "kwargs": create_role_llm_model_kwargs(spec.name), + } + for spec in ROLES + } + # Initialize RAG with unified configuration try: rag = LightRAG( @@ -1190,19 +1605,53 @@ async def server_rerank_func( }, enable_llm_cache_for_entity_extract=args.enable_llm_cache_for_extract, enable_llm_cache=args.enable_llm_cache, + vlm_process_enable=args.vlm_process_enable, rerank_model_func=rerank_model_func, + rerank_model_max_async=args.rerank_max_async, + default_rerank_timeout=args.rerank_timeout, max_parallel_insert=args.max_parallel_insert, max_graph_nodes=args.max_graph_nodes, - addon_params={ - "language": args.summary_language, - "entity_types": args.entity_types, - }, + addon_params=addon_params, ollama_server_infos=ollama_server_infos, + role_llm_configs={ + spec.name: RoleLLMConfig( + func=role_llm_configs[spec.name]["func"], + kwargs=role_llm_configs[spec.name]["kwargs"], + max_async=role_llm_configs[spec.name]["max_async"], + timeout=role_llm_configs[spec.name]["timeout"], + metadata={ + "base_binding": args.llm_binding, + "binding": role_llm_configs[spec.name]["binding"], + "model": role_llm_configs[spec.name]["model"], + "host": role_llm_configs[spec.name]["host"], + "api_key": role_llm_configs[spec.name]["api_key"], + "provider_options": role_llm_configs[spec.name][ + "provider_options" + ], + "bedrock_aws_options": role_llm_configs[spec.name][ + "bedrock_aws_options" + ], + "is_cross_provider": role_llm_configs[spec.name][ + "is_cross_provider" + ], + }, + ) + for spec in ROLES + }, ) except Exception as e: logger.error(f"Failed to initialize LightRAG: {e}") raise + _log_role_provider_options(rag) + + rag.register_role_llm_builder( + lambda role, meta: ( + create_role_llm_func(role, meta), + create_role_llm_model_kwargs(role, meta), + ) + ) + # Add routes # root_path is set on the app for reverse proxy support; # routes stay at their natural paths and are prefixed by the proxy or uvicorn --root-path @@ -1334,6 +1783,12 @@ async def login(form_data: OAuth2PasswordRequestForm = Depends()): "embedding_binding": "openai", "embedding_model": "text-embedding-ada-002", "workspace": "default", + "storage_workspaces": { + "kv_storage": "default", + "doc_status_storage": "default", + "graph_storage": "default", + "vector_storage": "default", + }, }, "auth_mode": "enabled", "pipeline_busy": False, @@ -1356,6 +1811,21 @@ async def get_status(request: Request): "pipeline_status", workspace=workspace ) + pipeline_busy = bool(pipeline_status.get("busy", False)) + pipeline_scanning = bool(pipeline_status.get("scanning", False)) + pipeline_destructive_busy = bool( + pipeline_status.get("destructive_busy", False) + ) + pipeline_pending_enqueues = int( + pipeline_status.get("pending_enqueues", 0) or 0 + ) + pipeline_active = ( + pipeline_busy + or pipeline_scanning + or pipeline_destructive_busy + or pipeline_pending_enqueues > 0 + ) + if not auth_configured: auth_mode = "disabled" else: @@ -1386,7 +1856,9 @@ async def get_status(request: Request): "vector_storage": args.vector_storage, "enable_llm_cache_for_extract": args.enable_llm_cache_for_extract, "enable_llm_cache": args.enable_llm_cache, + "vlm_process_enable": args.vlm_process_enable, "workspace": default_workspace, + "storage_workspaces": _get_storage_workspaces(rag), "max_graph_nodes": args.max_graph_nodes, # Rerank configuration "enable_rerank": rerank_model_func is not None, @@ -1395,6 +1867,8 @@ async def get_status(request: Request): "rerank_binding_host": args.rerank_binding_host if rerank_model_func else None, + "rerank_max_async": args.rerank_max_async, + "rerank_timeout": args.rerank_timeout, # Environment variable status (requested configuration) "summary_language": args.summary_language, "force_llm_summary_on_merge": args.force_llm_summary_on_merge, @@ -1403,12 +1877,22 @@ async def get_status(request: Request): "min_rerank_score": args.min_rerank_score, "related_chunk_number": args.related_chunk_number, "max_async": args.max_async, + "llm_timeout": args.llm_timeout, "embedding_func_max_async": args.embedding_func_max_async, "embedding_batch_num": args.embedding_batch_num, + "embedding_timeout": args.embedding_timeout, + "role_llm_config": rag.get_llm_role_config(), }, "auth_mode": auth_mode, - "pipeline_busy": pipeline_status.get("busy", False), + "pipeline_busy": pipeline_busy, + "pipeline_active": pipeline_active, + "pipeline_scanning": pipeline_scanning, + "pipeline_destructive_busy": pipeline_destructive_busy, + "pipeline_pending_enqueues": pipeline_pending_enqueues, "keyed_locks": keyed_lock_info, + "llm_queue_status": await rag.get_llm_queue_status(include_base=True), + "embedding_queue_status": await rag.get_embedding_queue_status(), + "rerank_queue_status": await rag.get_rerank_queue_status(), "core_version": core_version, "api_version": api_version_display, "webui_title": webui_title, diff --git a/lightrag/api/routers/document_routes.py b/lightrag/api/routers/document_routes.py index 127862084a..ecf201741c 100644 --- a/lightrag/api/routers/document_routes.py +++ b/lightrag/api/routers/document_routes.py @@ -3,9 +3,10 @@ """ import asyncio +import re +import shutil import time from uuid import uuid4 -from functools import lru_cache from lightrag.utils import logger, get_pinyin_sort_key, performance_timing_log import aiofiles import traceback @@ -25,33 +26,27 @@ from lightrag import LightRAG from lightrag.base import DeletionResult, DocProcessingStatus, DocStatus +from lightrag.constants import ( + FULL_DOCS_FORMAT_PENDING_PARSE, + PARSER_ENGINE_LEGACY, + PARSED_ARTIFACT_DIR_SUFFIXES, + PARSED_DIR_NAME, + PROCESS_OPTION_CHUNK_FIXED, +) +from lightrag.parser_routing import ( + FilenameParserHintError, + canonicalize_parser_hinted_basename, + filename_parser_hint, + resolve_file_parser_directives, +) from lightrag.utils import ( generate_track_id, - compute_mdhash_id, - sanitize_text_for_encoding, + move_file_to_parsed_dir, ) from lightrag.api.utils_api import get_combined_auth_dependency from ..config import global_args -@lru_cache(maxsize=1) -def _is_docling_available() -> bool: - """Check if docling is available (cached check). - - This function uses lru_cache to avoid repeated import attempts. - The result is cached after the first call. - - Returns: - bool: True if docling is available, False otherwise - """ - try: - import docling # noqa: F401 # type: ignore[import-not-found] - - return True - except ImportError: - return False - - # Function to format datetime to ISO format string with timezone information def format_datetime(dt: Any) -> Optional[str]: """Format datetime to ISO format string with timezone information @@ -88,6 +83,7 @@ def format_datetime(dt: Any) -> Optional[str]: temp_prefix = "__tmp__" UNKNOWN_FILE_SOURCE = "unknown_source" LEGACY_EMPTY_FILE_PATH_SENTINELS = {"", "no-file-path"} +ARCHIVED_FILE_SUFFIX_RE = re.compile(r"_(?:\d{3}|\d{10,})$") def normalize_file_path(file_path: str | None) -> str: @@ -99,7 +95,13 @@ def normalize_file_path(file_path: str | None) -> str: if normalized in LEGACY_EMPTY_FILE_PATH_SENTINELS: return UNKNOWN_FILE_SOURCE - return normalized + return canonicalize_parser_hinted_basename(normalized) or UNKNOWN_FILE_SOURCE + + +def is_valid_file_source(file_source: str | None) -> bool: + if file_source is None: + return False + return normalize_file_path(file_source) != UNKNOWN_FILE_SOURCE def sanitize_filename(filename: str, input_dir: Path) -> str: @@ -151,12 +153,15 @@ class ScanResponse(BaseModel): """Response model for document scanning operation Attributes: - status: Status of the scanning operation + status: Status of the scanning operation. ``scanning_started`` when + a new background scan has been scheduled; + ``scanning_skipped_pipeline_busy`` when the request was rejected + because indexing or another scan is already running. message: Optional message with additional details track_id: Tracking ID for monitoring scanning progress """ - status: Literal["scanning_started"] = Field( + status: Literal["scanning_started", "scanning_skipped_pipeline_busy"] = Field( description="Status of the scanning operation" ) message: Optional[str] = Field( @@ -313,12 +318,15 @@ class InsertResponse(BaseModel): """Response model for document insertion operations Attributes: - status: Status of the operation (success, duplicated, partial_success, failure) + status: Status of the operation (success, partial_success, failure). + Same-name conflicts are rejected with HTTP 409 rather than being + reported as a "duplicated" 200 response, so this field never + takes that value any more. message: Detailed message describing the operation result track_id: Tracking ID for monitoring processing status """ - status: Literal["success", "duplicated", "partial_success", "failure"] = Field( + status: Literal["success", "partial_success", "failure"] = Field( description="Status of the operation" ) message: str = Field(description="Message describing the operation result") @@ -614,7 +622,8 @@ class DocumentsRequest(BaseModel): """Request model for paginated document queries Attributes: - status_filter: Filter by document status, None for all statuses + status_filter: Legacy single-status filter, ignored when status_filters is set + status_filters: Filter by multiple document statuses, None for all statuses page: Page number (1-based) page_size: Number of documents per page (10-200) sort_field: Field to sort by ('created_at', 'updated_at', 'id', 'file_path') @@ -622,7 +631,11 @@ class DocumentsRequest(BaseModel): """ status_filter: Optional[DocStatus] = Field( - default=None, description="Filter by document status, None for all statuses" + default=None, + description="Legacy single-status filter, ignored when status_filters is set", + ) + status_filters: Optional[List[DocStatus]] = Field( + default=None, description="Filter by multiple document statuses" ) page: int = Field(default=1, ge=1, description="Page number (1-based)") page_size: int = Field( @@ -638,7 +651,7 @@ class DocumentsRequest(BaseModel): model_config = ConfigDict( json_schema_extra={ "example": { - "status_filter": "PROCESSED", + "status_filters": ["PREPROCESSED", "PARSING", "ANALYZING"], "page": 1, "page_size": 50, "sort_field": "updated_at", @@ -933,56 +946,406 @@ def validate_file_path_security(file_path_str: str, base_dir: Path) -> Optional[ return None -def get_unique_filename_in_enqueued(target_dir: Path, original_name: str) -> str: - """Generate a unique filename in the target directory by adding numeric suffixes if needed +def get_doc_status_value(doc_status: Any) -> str: + """Read status from dict or DocProcessingStatus-like objects.""" + status = ( + doc_status.get("status") + if isinstance(doc_status, dict) + else getattr(doc_status, "status", None) + ) + if isinstance(status, DocStatus): + return status.value + return str(status or "") - Args: - target_dir: Target directory path - original_name: Original filename + +def get_doc_track_id(doc_status: Any) -> str: + """Read track_id from dict or DocProcessingStatus-like objects.""" + track_id = ( + doc_status.get("track_id") + if isinstance(doc_status, dict) + else getattr(doc_status, "track_id", None) + ) + return str(track_id or "") + + +async def get_existing_doc_by_file_path_candidates( + doc_status: Any, file_path: Path | str +) -> dict[str, Any] | None: + """Find an existing document by canonical basename.""" + basename = normalize_file_path(str(file_path)) + if basename == UNKNOWN_FILE_SOURCE: + return None + match = await doc_status.get_doc_by_file_basename(basename) + if not match: + return None + _, existing_doc_data = match + return existing_doc_data + + +async def _reserve_enqueue_slot(rag: LightRAG) -> bool: + """Atomically check exclusive-writer state and reserve a + pending-enqueue slot. + + Concurrent enqueues are permitted while the processing loop is + running — the loop is notified via ``request_pending`` and picks up + newly-enqueued docs after its current batch. This includes the + scan task's processing phase: once classification is done, the + scan transitions to driving the processing pipeline like any + other enqueuer, and uploads can land alongside it. + + Two states block new uploads/inserts: + + - ``scanning_exclusive``: scan task is in its CLASSIFICATION + phase — reading doc_status to classify files (PROCESSED → + archive, FAILED-without-full_docs → retry-as-new, etc.) and + possibly deleting stale stubs. Concurrent enqueue would race + against scan's reads / stub deletions. ``scanning`` alone + (the processing phase) does NOT block uploads. + - ``destructive_busy``: a /documents/clear or per-doc delete is in + flight. These DROP storages and remove input files; an enqueue + accepted in this window would write to a storage that is being + torn down and silently lose the document after the client saw + success. + + ``pending_enqueues`` is incremented so the scan endpoint can refuse + while bg tasks are mid-enqueue. The counter does NOT gate + ``apipeline_process_enqueue_documents`` — concurrent processing is + explicitly allowed and is what makes "upload while pipeline is + busy" possible. + + A workspace whose ``pipeline_status`` has never been initialised + (mocked test rigs) is treated as idle; no slot is reserved. Returns: - str: Unique filename (may have numeric suffix added) + True when a slot was reserved (caller MUST pair with + ``_release_enqueue_slot``); False when pipeline_status is not + bootstrapped. + + Raises: + HTTPException(409): when + ``pipeline_status['scanning_exclusive']`` or + ``pipeline_status['destructive_busy']`` is set. """ - import time + from lightrag.exceptions import PipelineNotInitializedError + from lightrag.kg.shared_storage import get_namespace_data, get_namespace_lock - original_path = Path(original_name) - base_name = original_path.stem - extension = original_path.suffix + try: + pipeline_status = await get_namespace_data( + "pipeline_status", workspace=rag.workspace + ) + except PipelineNotInitializedError: + return False + pipeline_status_lock = get_namespace_lock( + "pipeline_status", workspace=rag.workspace + ) + async with pipeline_status_lock: + if pipeline_status.get("scanning_exclusive"): + raise HTTPException( + status_code=409, + detail=( + "Document scan is classifying files. " + "Wait for the classification phase to finish before " + "submitting new work." + ), + ) + if pipeline_status.get("destructive_busy"): + raise HTTPException( + status_code=409, + detail=( + "Pipeline is clearing or deleting documents. " + "Wait for the running job to finish before submitting " + "new work." + ), + ) + pipeline_status["pending_enqueues"] = ( + pipeline_status.get("pending_enqueues", 0) + 1 + ) + return True - # Try original name first - if not (target_dir / original_name).exists(): - return original_name - # Try with numeric suffixes 001-999 - for i in range(1, 1000): - suffix = f"{i:03d}" - new_name = f"{base_name}_{suffix}{extension}" - if not (target_dir / new_name).exists(): - return new_name +async def _acquire_destructive_busy(rag: LightRAG) -> tuple[bool, str | None]: + """Atomically reserve the destructive busy slot for ``/documents/clear`` + or ``/documents/delete_document``. - # Fallback with timestamp if all 999 slots are taken - timestamp = int(time.time()) - return f"{base_name}_{timestamp}{extension}" + Both jobs DROP storages and (for clear) remove input files. They + must serialise against: + - any other ``busy`` work (processing loop, another destructive job), + - an in-flight ``scanning`` task that reads/writes doc_status and + INPUT/, and + - any ``pending_enqueues`` reservation whose bg task has not yet + written to doc_status — accepting the destructive job in that + window would drop storages while the enqueue is mid-write, + losing a document the client already saw success for. -# Document processing helper functions (synchronous) -# These functions run in thread pool via asyncio.to_thread() to avoid blocking the event loop + All three checks happen inside a single ``pipeline_status_lock`` + critical section together with the flag write, so a concurrent + enqueue/scan reservation cannot squeeze past us. + Caller is responsible for clearing both flags in its finally block. -def _convert_with_docling(file_path: Path) -> str: - """Convert document using docling (synchronous). + Returns: + (acquired, reason). ``acquired=True`` and ``reason=None`` on + success. ``acquired=False`` with a human-readable ``reason`` + when another writer has the lock; the caller surfaces this to + the client (HTTP 200 with status="busy" for these endpoints). - Args: - file_path: Path to the document file + For test rigs where ``pipeline_status`` was never bootstrapped, + returns (True, None) — there is nothing to coordinate against. + """ + from lightrag.exceptions import PipelineNotInitializedError + from lightrag.kg.shared_storage import get_namespace_data, get_namespace_lock - Returns: - str: Extracted markdown content + try: + pipeline_status = await get_namespace_data( + "pipeline_status", workspace=rag.workspace + ) + except PipelineNotInitializedError: + return True, None + pipeline_status_lock = get_namespace_lock( + "pipeline_status", workspace=rag.workspace + ) + async with pipeline_status_lock: + if pipeline_status.get("busy"): + return False, "Pipeline is busy with another operation." + if pipeline_status.get("scanning"): + return False, ( + "Document scan is in progress. " + "Wait for the scan to complete before clearing or deleting." + ) + if pipeline_status.get("pending_enqueues", 0) > 0: + return False, ( + "Document upload/insert is being enqueued. " + "Wait for in-flight work to complete before clearing or " + "deleting." + ) + pipeline_status["busy"] = True + pipeline_status["destructive_busy"] = True + return True, None + + +async def _release_destructive_busy(rag: LightRAG) -> None: + """Release the destructive busy slot acquired by + ``_acquire_destructive_busy``. Never raises. + + Distinct from ``_release_enqueue_slot``: that helper clears + ``pending_enqueues`` (the upload/insert reservation), this one + clears ``busy + destructive_busy`` (the clear/delete reservation). + """ + from lightrag.exceptions import PipelineNotInitializedError + from lightrag.kg.shared_storage import get_namespace_data, get_namespace_lock + + try: + pipeline_status = await get_namespace_data( + "pipeline_status", workspace=rag.workspace + ) + except PipelineNotInitializedError: + return + pipeline_status_lock = get_namespace_lock( + "pipeline_status", workspace=rag.workspace + ) + async with pipeline_status_lock: + pipeline_status["busy"] = False + pipeline_status["destructive_busy"] = False + + +async def _release_enqueue_slot(rag: LightRAG) -> None: + """Release a slot reserved by ``_reserve_enqueue_slot``. + + Pure decrement; the bg task itself drives processing by calling + ``apipeline_process_enqueue_documents`` after enqueue (the call is + a cheap no-op when the loop is already busy — it just sets + ``request_pending``). Drain coordination across sibling bg tasks + is unnecessary in the new contract: each task triggers processing + independently and the loop's request_pending mechanism collapses + duplicate triggers safely. + + Decrement is clamped at 0 so a stray release (e.g. from a workspace + whose reservation returned False but whose bg task wrapper still + calls release) is harmless. Never raises. + """ + from lightrag.exceptions import PipelineNotInitializedError + from lightrag.kg.shared_storage import get_namespace_data, get_namespace_lock + + try: + pipeline_status = await get_namespace_data( + "pipeline_status", workspace=rag.workspace + ) + except PipelineNotInitializedError: + return + pipeline_status_lock = get_namespace_lock( + "pipeline_status", workspace=rag.workspace + ) + async with pipeline_status_lock: + current = pipeline_status.get("pending_enqueues", 0) + if current > 0: + pipeline_status["pending_enqueues"] = current - 1 + + +def find_existing_file_by_file_path(input_dir: Path, file_path: str) -> Path | None: + """Find an input-dir file whose canonical basename matches ``file_path``. + + Callers pass the stored canonical ``file_path`` (already hint-stripped); + on-disk filenames are normalized before comparison so a hint-bearing + variant on disk still matches a canonical stored ``file_path``. + """ + if not file_path or file_path == UNKNOWN_FILE_SOURCE: + return None + try: + for candidate in input_dir.iterdir(): + if not candidate.is_file(): + continue + if normalize_file_path(candidate.name) == file_path: + return candidate + except FileNotFoundError: + return None + return None + + +def canonicalize_archived_file_variant_basename( + file_path: Path | str, *, strip_archive_suffix: bool = False +) -> str: + """Canonical basename for original files and numbered archive variants.""" + name = Path(file_path).name + path = Path(name) + stem = ( + ARCHIVED_FILE_SUFFIX_RE.sub("", path.stem) + if strip_archive_suffix + else path.stem + ) + return normalize_file_path(f"{stem}{path.suffix}") + + +def _file_path_for_parsed_artifact_dir(dir_name: str) -> str | None: + """Return the canonical source basename for a parser artifact dir. + + Recognized layouts (suffix list in + :data:`lightrag.constants.PARSED_ARTIFACT_DIR_SUFFIXES`): + + - ``.parsed[_NNN]/`` — sidecar output (every engine) + - ``.mineru_raw[_NNN]/`` — MinerU preserved raw bundle + - ``.docling_raw[_NNN]/`` — Docling preserved raw bundle + + Raw bundles are preserved across re-parses for cache reuse and on-demand + diagnostics; they are cleaned only when the user deletes the document + with ``delete_file=True`` so the raw artifacts and source file go away + together. """ - from docling.document_converter import DocumentConverter # type: ignore + stripped = ARCHIVED_FILE_SUFFIX_RE.sub("", dir_name) + for suffix in PARSED_ARTIFACT_DIR_SUFFIXES: + if stripped.endswith(suffix): + basename = stripped[: -len(suffix)] + if basename: + return normalize_file_path(basename) + return None + + +def delete_file_variants_by_file_path( + input_dir: Path, + file_path: str | None, +) -> tuple[list[str], list[str]]: + """Delete input/__parsed__ source files matching a canonical ``file_path``.""" + if not file_path: + return [], [] + canonical = normalize_file_path(file_path) + if canonical == UNKNOWN_FILE_SOURCE: + return [], [] + canonical_names = {canonical} + + deleted_files: list[str] = [] + errors: list[str] = [] + candidate_dirs = [input_dir, input_dir / PARSED_DIR_NAME] + input_dir_resolved = input_dir.resolve() + + for candidate_dir in candidate_dirs: + try: + candidates = list(candidate_dir.iterdir()) + except FileNotFoundError: + continue + except Exception as e: + errors.append(f"Failed to scan {candidate_dir}: {e}") + continue + + in_parsed_dir = candidate_dir.name == PARSED_DIR_NAME + for candidate in candidates: + if candidate.is_file(): + if ( + canonicalize_archived_file_variant_basename( + candidate.name, + strip_archive_suffix=in_parsed_dir, + ) + not in canonical_names + ): + continue + + safe_candidate = validate_file_path_security( + candidate.name, candidate_dir + ) + if safe_candidate is None: + errors.append(f"Unsafe file path skipped: {candidate.name}") + continue + + try: + safe_candidate.unlink() + deleted_files.append( + str(safe_candidate.relative_to(input_dir_resolved)) + ) + except Exception as e: + errors.append(f"Failed to delete {candidate.name}: {e}") + continue + + if in_parsed_dir and candidate.is_dir(): + canonical_for_dir = _file_path_for_parsed_artifact_dir(candidate.name) + if ( + canonical_for_dir is None + or canonical_for_dir not in canonical_names + ): + continue + + safe_candidate = validate_file_path_security( + candidate.name, candidate_dir + ) + if safe_candidate is None: + errors.append(f"Unsafe artifact dir skipped: {candidate.name}") + continue + + try: + shutil.rmtree(safe_candidate) + deleted_files.append( + str(safe_candidate.relative_to(input_dir_resolved)) + ) + except Exception as e: + errors.append( + f"Failed to delete artifact dir {candidate.name}: {e}" + ) + + return deleted_files, errors + + +async def record_scan_warning(rag: LightRAG, message: str) -> None: + logger.warning(message) + try: + from lightrag.kg import shared_storage + + if not getattr(shared_storage, "_initialized", False): + return + + workspace = getattr(rag, "workspace", "") + pipeline_status = await shared_storage.get_namespace_data( + "pipeline_status", workspace=workspace + ) + pipeline_status_lock = shared_storage.get_namespace_lock( + "pipeline_status", workspace=workspace + ) + async with pipeline_status_lock: + pipeline_status["latest_message"] = message + pipeline_status["history_messages"].append(message) + except Exception: + pass - converter = DocumentConverter() - result = converter.convert(file_path) - return result.document.export_to_markdown() + +# Document processing helper functions (synchronous) +# These functions run in thread pool via asyncio.to_thread() to avoid blocking the event loop def _extract_pdf_pypdf(file_bytes: bytes, password: str = None) -> str: @@ -1231,7 +1594,10 @@ def escape_sheet_title(title: str) -> str: async def pipeline_enqueue_file( - rag: LightRAG, file_path: Path, track_id: str = None + rag: LightRAG, + file_path: Path, + track_id: str = None, + from_scan: bool = False, ) -> tuple[bool, str]: """Add a file to the queue for processing @@ -1239,6 +1605,10 @@ async def pipeline_enqueue_file( rag: LightRAG instance file_path: Path to the saved file track_id: Optional tracking ID, if not provided will be generated + from_scan: True only when invoked by the scan-owned background task, + which already holds ``pipeline_status["scanning"]``. Forwarded to + ``apipeline_enqueue_documents`` so the scan can enqueue the files + it just discovered without tripping the scanning guard there. Returns: tuple: (success: bool, track_id: str) """ @@ -1259,6 +1629,66 @@ async def pipeline_enqueue_file( except Exception: file_size = 0 + try: + extraction_engine, process_options = resolve_file_parser_directives( + file_path + ) + except FilenameParserHintError as e: + error_files = [ + { + "file_path": str(file_path.name), + "error_description": "[File Extraction]Filename hint error", + "original_error": str(e), + "file_size": file_size, + } + ] + await rag.apipeline_enqueue_error_documents(error_files, track_id) + logger.error( + f"[File Extraction]Invalid filename hint in {file_path.name}: {e}" + ) + return False, track_id + + api_process_options = process_options or PROCESS_OPTION_CHUNK_FIXED + if extraction_engine != PARSER_ENGINE_LEGACY: + try: + enqueue_kwargs = { + "file_paths": str(file_path), + "track_id": track_id, + "docs_format": FULL_DOCS_FORMAT_PENDING_PARSE, + "parse_engine": extraction_engine, + "process_options": api_process_options, + "from_scan": from_scan, + } + enqueue_result = await rag.apipeline_enqueue_documents( + "", **enqueue_kwargs + ) + if enqueue_result is None: + try: + await move_file_to_parsed_dir(file_path) + except Exception as move_error: + logger.error( + f"Failed to move duplicate file {file_path.name} to {PARSED_DIR_NAME} directory: {move_error}" + ) + return False, track_id + logger.info( + f"[File Extraction]Deferred {file_path.name} to {extraction_engine} parser" + ) + return True, track_id + except Exception as e: + error_files = [ + { + "file_path": str(file_path.name), + "error_description": "[File Extraction]Parser enqueue error", + "original_error": f"Failed to enqueue file for parser: {str(e)}", + "file_size": file_size, + } + ] + await rag.apipeline_enqueue_error_documents(error_files, track_id) + logger.error( + f"[File Extraction]Error enqueuing {file_path.name} for {extraction_engine}: {str(e)}" + ) + return False, track_id + file = None try: async with aiofiles.open(file_path, "rb") as f: @@ -1404,28 +1834,11 @@ async def pipeline_enqueue_file( case ".pdf": try: - # Try DOCLING first if configured and available - if ( - global_args.document_loading_engine == "DOCLING" - and _is_docling_available() - ): - content = await asyncio.to_thread( - _convert_with_docling, file_path - ) - else: - if ( - global_args.document_loading_engine == "DOCLING" - and not _is_docling_available() - ): - logger.warning( - f"DOCLING engine configured but not available for {file_path.name}. Falling back to pypdf." - ) - # Use pypdf (non-blocking via to_thread) - content = await asyncio.to_thread( - _extract_pdf_pypdf, - file, - global_args.pdf_decrypt_password, - ) + content = await asyncio.to_thread( + _extract_pdf_pypdf, + file, + global_args.pdf_decrypt_password, + ) except Exception as e: error_files = [ { @@ -1445,24 +1858,7 @@ async def pipeline_enqueue_file( case ".docx": try: - # Try DOCLING first if configured and available - if ( - global_args.document_loading_engine == "DOCLING" - and _is_docling_available() - ): - content = await asyncio.to_thread( - _convert_with_docling, file_path - ) - else: - if ( - global_args.document_loading_engine == "DOCLING" - and not _is_docling_available() - ): - logger.warning( - f"DOCLING engine configured but not available for {file_path.name}. Falling back to python-docx." - ) - # Use python-docx (non-blocking via to_thread) - content = await asyncio.to_thread(_extract_docx, file) + content = await asyncio.to_thread(_extract_docx, file) except Exception as e: error_files = [ { @@ -1482,24 +1878,7 @@ async def pipeline_enqueue_file( case ".pptx": try: - # Try DOCLING first if configured and available - if ( - global_args.document_loading_engine == "DOCLING" - and _is_docling_available() - ): - content = await asyncio.to_thread( - _convert_with_docling, file_path - ) - else: - if ( - global_args.document_loading_engine == "DOCLING" - and not _is_docling_available() - ): - logger.warning( - f"DOCLING engine configured but not available for {file_path.name}. Falling back to python-pptx." - ) - # Use python-pptx (non-blocking via to_thread) - content = await asyncio.to_thread(_extract_pptx, file) + content = await asyncio.to_thread(_extract_pptx, file) except Exception as e: error_files = [ { @@ -1519,24 +1898,7 @@ async def pipeline_enqueue_file( case ".xlsx": try: - # Try DOCLING first if configured and available - if ( - global_args.document_loading_engine == "DOCLING" - and _is_docling_available() - ): - content = await asyncio.to_thread( - _convert_with_docling, file_path - ) - else: - if ( - global_args.document_loading_engine == "DOCLING" - and not _is_docling_available() - ): - logger.warning( - f"DOCLING engine configured but not available for {file_path.name}. Falling back to openpyxl." - ) - # Use openpyxl (non-blocking via to_thread) - content = await asyncio.to_thread(_extract_xlsx, file) + content = await asyncio.to_thread(_extract_xlsx, file) except Exception as e: error_files = [ { @@ -1603,34 +1965,35 @@ async def pipeline_enqueue_file( return False, track_id try: - await rag.apipeline_enqueue_documents( - content, file_paths=file_path.name, track_id=track_id + enqueue_kwargs = { + "file_paths": file_path.name, + "track_id": track_id, + "parse_engine": PARSER_ENGINE_LEGACY, + "process_options": api_process_options, + "from_scan": from_scan, + } + enqueue_result = await rag.apipeline_enqueue_documents( + content, **enqueue_kwargs ) + if enqueue_result is None: + try: + await move_file_to_parsed_dir(file_path) + except Exception as move_error: + logger.error( + f"Failed to move duplicate file {file_path.name} to {PARSED_DIR_NAME} directory: {move_error}" + ) + return False, track_id logger.info( f"Successfully extracted and enqueued file: {file_path.name}" ) - # Move file to __enqueued__ directory after enqueuing + # Move file to __parsed__ directory after enqueuing (LR2-PRD: parsed output dir) try: - enqueued_dir = file_path.parent / "__enqueued__" - await asyncio.to_thread(enqueued_dir.mkdir, exist_ok=True) - - # Generate unique filename to avoid conflicts - unique_filename = get_unique_filename_in_enqueued( - enqueued_dir, file_path.name - ) - target_path = enqueued_dir / unique_filename - - # Move the file - await asyncio.to_thread(file_path.rename, target_path) - logger.debug( - f"Moved file to enqueued directory: {file_path.name} -> {unique_filename}" - ) - + await move_file_to_parsed_dir(file_path) except Exception as move_error: logger.error( - f"Failed to move file {file_path.name} to __enqueued__ directory: {move_error}" + f"Failed to move file {file_path.name} to {PARSED_DIR_NAME} directory: {move_error}" ) # Don't affect the main function's success status @@ -1697,9 +2060,7 @@ async def pipeline_index_file(rag: LightRAG, file_path: Path, track_id: str = No track_id: Optional tracking ID """ try: - success, returned_track_id = await pipeline_enqueue_file( - rag, file_path, track_id - ) + success, _ = await pipeline_enqueue_file(rag, file_path, track_id) if success: await rag.apipeline_process_enqueue_documents() @@ -1709,7 +2070,10 @@ async def pipeline_index_file(rag: LightRAG, file_path: Path, track_id: str = No async def pipeline_index_files( - rag: LightRAG, file_paths: List[Path], track_id: str = None + rag: LightRAG, + file_paths: List[Path], + track_id: str = None, + from_scan: bool = False, ): """Index multiple files sequentially to avoid high CPU load @@ -1717,6 +2081,11 @@ async def pipeline_index_files( rag: LightRAG instance file_paths: Paths to the files to index track_id: Optional tracking ID to pass to all files + from_scan: True only when invoked by the scan-owned background task. + Forwarded to ``pipeline_enqueue_file`` so the per-file enqueue + calls bypass the scanning guard inside + ``apipeline_enqueue_documents`` (whose ``scanning`` flag the + scan task itself owns). """ if not file_paths: return @@ -1730,7 +2099,12 @@ async def pipeline_index_files( # Process files sequentially with track_id for file_path in sorted_file_paths: - success, _ = await pipeline_enqueue_file(rag, file_path, track_id) + success, _ = await pipeline_enqueue_file( + rag, + file_path, + track_id, + from_scan=from_scan, + ) if success: enqueued = True @@ -1759,20 +2133,20 @@ async def pipeline_index_texts( if not texts: return - normalized_file_sources: list[str] | None = None - if file_sources: - normalized_file_sources = [ - normalize_file_path(source) for source in file_sources - ] - if len(normalized_file_sources) > len(texts): - raise ValueError("Number of file sources must not exceed number of texts") - if len(normalized_file_sources) < len(texts): - normalized_file_sources.extend( - [UNKNOWN_FILE_SOURCE] * (len(texts) - len(normalized_file_sources)) - ) + if not file_sources or len(file_sources) != len(texts): + raise ValueError("A valid file source is required for each text") + + normalized_file_sources = [normalize_file_path(source) for source in file_sources] + if any(source == UNKNOWN_FILE_SOURCE for source in normalized_file_sources): + raise ValueError("A valid file source is required for each text") + if len(set(normalized_file_sources)) != len(normalized_file_sources): + raise ValueError("File sources must be unique by filename") await rag.apipeline_enqueue_documents( - input=texts, file_paths=normalized_file_sources, track_id=track_id + input=texts, + file_paths=normalized_file_sources, + track_id=track_id, + process_options=PROCESS_OPTION_CHUNK_FIXED, ) await rag.apipeline_process_enqueue_documents() @@ -1787,45 +2161,222 @@ async def run_scanning_process( doc_manager: DocumentManager instance track_id: Optional tracking ID to pass to all scanned files """ + # The scan endpoint set ``scanning=True`` AND + # ``scanning_exclusive=True`` synchronously before scheduling this + # task. ``scanning`` covers the whole lifecycle (refuses + # overlapping scans); ``scanning_exclusive`` covers only the + # classification phase below — we clear it before invoking + # pipeline_index_files so concurrent uploads can land while the + # scan-driven processing finishes. Both MUST be cleared in + # finally so subsequent uploads / scans can proceed even if the + # body raises. When pipeline_status is not initialised (mocked + # test rigs), the flags were never set so there's nothing to + # clear — track that here to skip the namespace fetch. + from lightrag.exceptions import PipelineNotInitializedError + from lightrag.kg.shared_storage import get_namespace_data, get_namespace_lock + + pipeline_status = None + pipeline_status_lock = None + try: + pipeline_status = await get_namespace_data( + "pipeline_status", workspace=rag.workspace + ) + pipeline_status_lock = get_namespace_lock( + "pipeline_status", workspace=rag.workspace + ) + except PipelineNotInitializedError: + pass + try: new_files = doc_manager.scan_directory_for_new_files() total_files = len(new_files) logger.info(f"Found {total_files} files to index.") if new_files: - # Check for files with PROCESSED status and filter them out - valid_files = [] - processed_files = [] + # Group canonical-equivalent files so we can prefer hint-bearing + # variants over plain ones. Within each group sort order is + # preserved as a deterministic tiebreaker. + files_by_canonical_name: dict[str, list[Path]] = {} + for file_path in sorted( + new_files, key=lambda p: get_pinyin_sort_key(str(p)) + ): + canonical_name = normalize_file_path(str(file_path)) + files_by_canonical_name.setdefault(canonical_name, []).append(file_path) + + unique_files: list[Path] = [] + for canonical_name, group in files_by_canonical_name.items(): + # Prefer the first file carrying a supported parser hint so + # the user's explicit engine choice wins over plain variants; + # otherwise fall back to the first sorted entry. + chosen = next( + (f for f in group if filename_parser_hint(f.name) is not None), + group[0], + ) + unique_files.append(chosen) + for duplicate in group: + if duplicate is chosen: + continue + warning = ( + "Skipping duplicate file in scan batch: " + f"{duplicate.name} duplicates {chosen.name} " + f"(canonical: {canonical_name})" + ) + await record_scan_warning(rag, warning) + try: + await move_file_to_parsed_dir(duplicate) + except Exception as move_error: + logger.error( + f"Failed to move duplicate scan file {duplicate.name} to {PARSED_DIR_NAME}: {move_error}" + ) - for file_path in new_files: + # Partition unique_files into: + # * processed_files — already PROCESSED, archived and skipped. + # * resume_files — same canonical basename matches an existing + # non-PROCESSED doc_status row (PARSING / + # FAILED / PROCESSING / ANALYZING / PENDING). + # These must NOT go through pipeline_enqueue_file + # because apipeline_enqueue_documents would + # treat the same canonical name as a duplicate + # (returning None) and pipeline_enqueue_file + # would then archive the source as if it were + # a duplicate — corrupting pending-parse cases + # that still need the source on disk. The + # pipeline's resume logic, triggered via + # apipeline_process_enqueue_documents, will + # advance them based on their existing + # doc_status row. + # * new_files — no existing record; standard enqueue path. + new_files: list[Path] = [] + resume_files: list[Path] = [] + processed_files: list[str] = [] + + for file_path in unique_files: filename = file_path.name - existing_doc_data = await rag.doc_status.get_doc_by_file_path(filename) + # Inline the canonical-basename lookup so we keep both the + # doc_id and the data: the FAILED-without-full_docs sub-case + # below needs the doc_id to delete the stale stub. + basename = normalize_file_path(str(file_path)) + existing_match = ( + await rag.doc_status.get_doc_by_file_basename(basename) + if basename != UNKNOWN_FILE_SOURCE + else None + ) + existing_doc_id, existing_doc_data = ( + existing_match if existing_match else (None, None) + ) - if existing_doc_data and existing_doc_data.get("status") == "processed": - # File is already PROCESSED, skip it with warning + if ( + existing_doc_data + and get_doc_status_value(existing_doc_data) + == DocStatus.PROCESSED.value + ): + # File is already PROCESSED, skip it with warning and archive it. processed_files.append(filename) - logger.warning(f"Skipping already processed file: {filename}") - else: - # File is new or in non-PROCESSED status, add to processing list - valid_files.append(file_path) - - # Process valid files (new files + non-PROCESSED status files) - if valid_files: - await pipeline_index_files(rag, valid_files, track_id) - if processed_files: + warning = f"Skipping already processed file: " f"{filename}" + await record_scan_warning(rag, warning) + try: + await move_file_to_parsed_dir(file_path) + except Exception as move_error: + logger.error( + f"Failed to move already processed file {filename} to {PARSED_DIR_NAME}: {move_error}" + ) + elif existing_doc_data: + # FAILED rows recorded by apipeline_enqueue_error_documents + # never write a full_docs entry — extraction blew up before + # any content was stored. _validate_and_fix_document_consistency + # preserves them for manual review and removes them from the + # processing list, so the resume path can never advance them. + # When the user fixes the file and re-scans we want a real + # retry: drop the stale stub and treat the file as new so + # the standard enqueue path re-extracts content. + status_value = get_doc_status_value(existing_doc_data) + if status_value == DocStatus.FAILED.value: + full_doc = await rag.full_docs.get_by_id(existing_doc_id) + if full_doc is None: + try: + await rag.doc_status.delete([existing_doc_id]) + except Exception as delete_error: + logger.error( + "Failed to delete stale failed-extraction " + f"doc_status stub {existing_doc_id} " + f"({filename}): {delete_error}" + ) + # Fall through to resume — at worst the row + # remains preserved (current behaviour) rather + # than re-enqueued. + resume_files.append(file_path) + continue + logger.info( + "Retrying previously failed extraction; " + f"removed stale doc_status stub: {filename} " + f"(doc_id: {existing_doc_id})" + ) + new_files.append(file_path) + continue logger.info( - f"Scanning process completed: {len(valid_files)} files Processed {len(processed_files)} skipped." + "Resuming previously unfinished file from scan: " + f"{filename} (Status: {status_value})" ) + resume_files.append(file_path) else: - logger.info( - f"Scanning process completed: {len(valid_files)} files Processed." - ) + new_files.append(file_path) + + # Classification phase complete — release ``scanning_exclusive`` + # so concurrent uploads/inserts can land in doc_status while + # the scan-driven processing finishes. ``scanning`` stays + # True for the rest of the task lifecycle (releases in + # finally) so the /scan endpoint still refuses overlapping + # scans. Any per-file enqueue or duplicate detected during + # the processing phase is handled by + # apipeline_enqueue_documents' in-batch dedup, identical to + # the upload-during-busy case. + if pipeline_status is not None and pipeline_status_lock is not None: + async with pipeline_status_lock: + pipeline_status["scanning_exclusive"] = False + + # New files take the standard enqueue + process path. When at + # least one new file is successfully enqueued, pipeline_index_files + # internally invokes apipeline_process_enqueue_documents, which + # selects work by doc_status state and so will also pick up any + # resume_files in the same run. + if new_files: + await pipeline_index_files( + rag, + new_files, + track_id, + from_scan=True, + ) + + # Resume targets must always trigger the pipeline explicitly: + # pipeline_index_files only runs apipeline_process_enqueue_documents + # after at least one new file successfully enqueues, so when every + # new file is rejected (unsupported extension, empty body, content + # / filename duplicate, ...) the resume rows would otherwise stay + # stuck until an unrelated indexing run. When new files DID + # enqueue, the inner call already drained the queue and this is a + # cheap no-op that returns "No documents to process". + if resume_files: + await rag.apipeline_process_enqueue_documents() + + total_active = len(new_files) + len(resume_files) + if total_active or processed_files: + summary_parts: list[str] = [] + if total_active: + summary_parts.append(f"{total_active} files Processed") + if processed_files: + summary_parts.append(f"{len(processed_files)} skipped") + logger.info(f"Scanning process completed: {' '.join(summary_parts)}.") else: logger.info( "No files to process after filtering already processed files." ) else: - # No new files to index, check if there are any documents in the queue + # No new files to index — classification is trivially done; + # release ``scanning_exclusive`` before driving the queue so + # concurrent uploads can land while process_enqueue runs. + if pipeline_status is not None and pipeline_status_lock is not None: + async with pipeline_status_lock: + pipeline_status["scanning_exclusive"] = False logger.info( "No upload file found, check if there are any documents in the queue..." ) @@ -1834,6 +2385,14 @@ async def run_scanning_process( except Exception as e: logger.error(f"Error during scanning process: {str(e)}") logger.error(traceback.format_exc()) + finally: + # Always release both scanning flags so future uploads / scans + # are not blocked by a crashed task. Skip when pipeline_status + # was never initialised for this workspace (test rigs). + if pipeline_status is not None and pipeline_status_lock is not None: + async with pipeline_status_lock: + pipeline_status["scanning"] = False + pipeline_status["scanning_exclusive"] = False async def background_delete_documents( @@ -1860,16 +2419,16 @@ async def background_delete_documents( successful_deletions = [] failed_deletions = [] - # Double-check pipeline status before proceeding + # The /documents/delete_document endpoint has already reserved the + # destructive slot synchronously: ``busy=True`` and + # ``destructive_busy=True`` were set before the client got + # ``deletion_started``, after checking busy + scanning + + # pending_enqueues>0 atomically. Here we only update the + # job-info fields; the busy reservation was acquired by the + # endpoint and is released in the finally block below. async with pipeline_status_lock: - if pipeline_status.get("busy", False): - logger.warning("Error: Unexpected pipeline busy state, aborting deletion.") - return # Abort deletion operation - - # Set pipeline status to busy for deletion pipeline_status.update( { - "busy": True, # Job name can not be changed, it's verified in adelete_by_doc_id() "job_name": f"Deleting {total_docs} Documents", "job_start": datetime.now().isoformat(), @@ -1925,95 +2484,45 @@ async def background_delete_documents( async with pipeline_status_lock: pipeline_status["history_messages"].append(success_msg) - # Handle file deletion if requested and file_path is available + # Handle file deletion if requested and source information is available if ( delete_file and result.file_path - and result.file_path != "unknown_source" + and result.file_path != UNKNOWN_FILE_SOURCE ): try: - deleted_files = [] - # SECURITY FIX: Use secure path validation to prevent arbitrary file deletion - safe_file_path = validate_file_path_security( - result.file_path, doc_manager.input_dir + deleted_files, file_delete_errors = ( + delete_file_variants_by_file_path( + doc_manager.input_dir, + result.file_path, + ) ) + for file_delete_error in file_delete_errors: + logger.warning(file_delete_error) + async with pipeline_status_lock: + pipeline_status["latest_message"] = ( + file_delete_error + ) + pipeline_status["history_messages"].append( + file_delete_error + ) - if safe_file_path is None: - # Security violation detected - log and skip file deletion - security_msg = f"Security violation: Unsafe file path detected for deletion - {result.file_path}" - logger.warning(security_msg) + if deleted_files: + file_delete_msg = ( + "Successfully deleted source files: " + + ", ".join(deleted_files) + ) + logger.info(file_delete_msg) async with pipeline_status_lock: - pipeline_status["latest_message"] = security_msg + pipeline_status["latest_message"] = file_delete_msg pipeline_status["history_messages"].append( - security_msg + file_delete_msg ) else: - # check and delete files from input_dir directory - if safe_file_path.exists(): - try: - safe_file_path.unlink() - deleted_files.append(safe_file_path.name) - file_delete_msg = f"Successfully deleted input_dir file: {result.file_path}" - logger.info(file_delete_msg) - async with pipeline_status_lock: - pipeline_status["latest_message"] = ( - file_delete_msg - ) - pipeline_status["history_messages"].append( - file_delete_msg - ) - except Exception as file_error: - file_error_msg = f"Failed to delete input_dir file {result.file_path}: {str(file_error)}" - logger.debug(file_error_msg) - async with pipeline_status_lock: - pipeline_status["latest_message"] = ( - file_error_msg - ) - pipeline_status["history_messages"].append( - file_error_msg - ) - - # Also check and delete files from __enqueued__ directory - enqueued_dir = doc_manager.input_dir / "__enqueued__" - if enqueued_dir.exists(): - # SECURITY FIX: Validate that the file path is safe before processing - # Only proceed if the original path validation passed - base_name = Path(result.file_path).stem - extension = Path(result.file_path).suffix - - # Search for exact match and files with numeric suffixes - for enqueued_file in enqueued_dir.glob( - f"{base_name}*{extension}" - ): - # Additional security check: ensure enqueued file is within enqueued directory - safe_enqueued_path = ( - validate_file_path_security( - enqueued_file.name, enqueued_dir - ) - ) - if safe_enqueued_path is not None: - try: - enqueued_file.unlink() - deleted_files.append(enqueued_file.name) - logger.info( - f"Successfully deleted enqueued file: {enqueued_file.name}" - ) - except Exception as enqueued_error: - file_error_msg = f"Failed to delete enqueued file {enqueued_file.name}: {str(enqueued_error)}" - logger.debug(file_error_msg) - async with pipeline_status_lock: - pipeline_status[ - "latest_message" - ] = file_error_msg - pipeline_status[ - "history_messages" - ].append(file_error_msg) - else: - security_msg = f"Security violation: Unsafe enqueued file path detected - {enqueued_file.name}" - logger.warning(security_msg) - - if deleted_files == []: - file_error_msg = f"File deletion skipped, missing or unsafe file: {result.file_path}" + file_error_msg = ( + "File deletion skipped, missing or unsafe file: " + f"{result.file_path}" + ) logger.warning(file_error_msg) async with pipeline_status_lock: pipeline_status["latest_message"] = file_error_msg @@ -2064,6 +2573,7 @@ async def background_delete_documents( # Final summary and check for pending requests async with pipeline_status_lock: pipeline_status["busy"] = False + pipeline_status["destructive_busy"] = False pipeline_status["pending_requests"] = False # Reset pending requests flag pipeline_status["cancellation_requested"] = ( False # Always reset cancellation flag @@ -2105,17 +2615,115 @@ async def scan_for_new_documents(background_tasks: BackgroundTasks): """ Trigger the scanning process for new documents. - This endpoint initiates a background task that scans the input directory for new documents - and processes them. If a scanning process is already running, it returns a status indicating - that fact. + Refuses to start a new scan with + ``status='scanning_skipped_pipeline_busy'`` (and does not + schedule a background task) when any of these is set: + + - ``pipeline_status["busy"]`` — the processing loop or another + destructive job is running. + - ``pipeline_status["scanning"]`` — another scan is already + running (any phase: classification or processing). + - ``pipeline_status["pending_enqueues"] > 0`` — an /upload, + /text or /texts endpoint has reserved a slot whose bg task + has not yet written to doc_status; starting a scan now would + race scan's classification reads against that pending write. + + Both ``scanning`` and ``scanning_exclusive`` are acquired + synchronously here so a subsequent fast-follow request hits the + guard rather than racing against the not-yet-started task. + ``run_scanning_process`` clears ``scanning_exclusive`` once + classification is done, allowing concurrent uploads to land + while the scan-driven processing finishes. Returns: ScanResponse: A response object containing the scanning status and track_id """ + from lightrag.exceptions import PipelineNotInitializedError + from lightrag.kg.shared_storage import get_namespace_data, get_namespace_lock + # Generate track_id with "scan" prefix for scanning operation track_id = generate_track_id("scan") - # Start the scanning process in the background with track_id + try: + pipeline_status = await get_namespace_data( + "pipeline_status", workspace=rag.workspace + ) + except PipelineNotInitializedError: + # Workspace pipeline_status not yet bootstrapped (e.g. mocked + # test rigs). Treat as idle and allow the scan to proceed; the + # scanning flag has nowhere to live so it is effectively skipped. + background_tasks.add_task(run_scanning_process, rag, doc_manager, track_id) + return ScanResponse( + status="scanning_started", + message="Scanning process has been initiated in the background", + track_id=track_id, + ) + pipeline_status_lock = get_namespace_lock( + "pipeline_status", workspace=rag.workspace + ) + + # Atomically acquire the scanning flag. Scan is the exclusive + # writer in this contract — it reads doc_status to make + # classification decisions (PROCESSED / resume / retry-as-new / + # archive) and would race with concurrent writers — so refuse if: + # * pipeline is processing (busy=True): scan + processing both + # read/mutate doc_status; serialise. + # * another scan is in flight (scanning=True). + # * any /upload, /text, /texts endpoint has reserved a + # pending-enqueue slot (see _reserve_enqueue_slot): the bg + # task has not yet written doc_status and we would otherwise + # race with its mid-flight write. + async with pipeline_status_lock: + if pipeline_status.get("busy"): + logger.warning( + "Scan request skipped: pipeline is busy processing documents" + ) + return ScanResponse( + status="scanning_skipped_pipeline_busy", + message=( + "Pipeline is currently busy processing documents. " + "Wait for the running job to finish before triggering another scan." + ), + track_id=track_id, + ) + if pipeline_status.get("scanning"): + logger.warning( + "Scan request skipped: another scan is already in progress" + ) + return ScanResponse( + status="scanning_skipped_pipeline_busy", + message=( + "Another scan is already in progress. " + "Wait for it to finish before triggering a new one." + ), + track_id=track_id, + ) + pending_enqueues = pipeline_status.get("pending_enqueues", 0) + if pending_enqueues > 0: + logger.warning( + "Scan request skipped: " + f"{pending_enqueues} pending enqueue(s) reserved by " + "upload/insert endpoints" + ) + return ScanResponse( + status="scanning_skipped_pipeline_busy", + message=( + "Document upload/insert is being enqueued. " + "Wait for in-flight work to complete before triggering a scan." + ), + track_id=track_id, + ) + # ``scanning`` covers the whole scan task lifecycle (used by + # this endpoint to refuse overlapping scans). + # ``scanning_exclusive`` is True only during the + # classification phase: run_scanning_process clears it once + # classification is done so concurrent uploads can land + # while the scan-driven processing finishes. + pipeline_status["scanning"] = True + pipeline_status["scanning_exclusive"] = True + + # Start the scanning process in the background with track_id. The + # task is responsible for clearing both flags in its finally block. background_tasks.add_task(run_scanning_process, rag, doc_manager, track_id) return ScanResponse( status="scanning_started", @@ -2146,11 +2754,17 @@ async def upload_to_input_dir( This endpoint handles two types of duplicate scenarios differently: 1. **Filename Duplicate (Synchronous Detection)**: - - Detected immediately before file processing - - Returns `status="duplicated"` with the existing document's track_id - - Two cases: - - If filename exists in document storage: returns existing track_id - - If filename exists in file system only: returns empty track_id ("") + - Detected immediately, before any file is written. + - File name is treated as the unique document key. Both + ``doc_status`` and the INPUT directory are checked under the + canonical (parser-hint stripped) basename so ``abc.docx`` and + ``abc.[native].docx`` map to the same record. + - **HTTP 409** is returned when a same-name record already exists. + The response detail names the conflict source ("Document + storage already contains ..." or "Input directory already + contains ..."). Clients must delete the existing document + (``DELETE /documents/{doc_id}``) before re-uploading; there is + no longer a 200 ``status="duplicated"`` soft-fail response. 2. **Content Duplicate (Asynchronous Detection)**: - Detected during background processing after content extraction @@ -2168,6 +2782,20 @@ async def upload_to_input_dir( - Content extraction is expensive (PDF/DOCX parsing), done asynchronously - This design prevents blocking the client during expensive operations + **Concurrency Constraint:** + - The endpoint refuses with HTTP 409 only while one of the + following exclusive-writer states is set: + ``pipeline_status["scanning_exclusive"]`` (a scan is in its + classification phase, reading and possibly mutating doc_status) + or ``pipeline_status["destructive_busy"]`` (``/documents/clear`` + or per-doc delete is dropping storages / removing input files). + Wait for the running job to finish before re-submitting. + - ``busy=True`` from the processing loop, and a scan in its + processing phase (``scanning=True`` with + ``scanning_exclusive=False``), do NOT block uploads — uploads + are accepted concurrently and the running pipeline picks them + up via its ``request_pending`` mechanism. + Args: background_tasks: FastAPI BackgroundTasks for async processing file (UploadFile): The file to be uploaded. It must have an allowed extension. @@ -2175,12 +2803,26 @@ async def upload_to_input_dir( Returns: InsertResponse: A response object containing the upload status and a message. - status="success": File accepted and queued for processing - - status="duplicated": Filename already exists (see track_id for existing document) Raises: - HTTPException: If the file type is not supported (400), file too large (413), or other errors occur (500). + HTTPException: 400 unsupported file type, 409 same-name + conflict or scan-classifying / destructive job in + flight, 413 file too large, 500 other errors. """ + slot_reserved = False try: + # Reject upload while a scan is in its CLASSIFICATION + # phase or a destructive job (clear / per-doc delete) is + # in flight, AND reserve a pending-enqueue slot so a scan + # request that arrives before the bg task runs cannot + # transition scanning_exclusive=True under us. Concurrent + # processing (``busy=True``) and a scan in its processing + # phase (``scanning=True`` with + # ``scanning_exclusive=False``) are permitted: the running + # loop's ``request_pending`` mechanism picks up our doc + # after the current batch. + slot_reserved = await _reserve_enqueue_slot(rag) + # Sanitize filename to prevent Path Traversal attacks safe_filename = sanitize_filename(file.filename, doc_manager.input_dir) @@ -2211,26 +2853,43 @@ async def upload_to_input_dir( f"File size not available in UploadFile for {safe_filename}, will check during streaming" ) - # Check if filename already exists in doc_status storage - existing_doc_data = await rag.doc_status.get_doc_by_file_path(safe_filename) + file_path = doc_manager.input_dir / safe_filename + + # Strict name pre-check. Both the INPUT directory and doc_status + # must be free of any same-canonical-basename record before we + # accept the upload. Replacing an existing document requires an + # explicit DELETE first; we no longer write a "duplicated" 200 + # response that silently no-ops. + existing_doc_data = await get_existing_doc_by_file_path_candidates( + rag.doc_status, file_path + ) if existing_doc_data: - # Get document status and track_id from existing document - status = existing_doc_data.get("status", "unknown") - # Use `or ""` to handle both missing key and None value (e.g., legacy rows without track_id) - existing_track_id = existing_doc_data.get("track_id") or "" - return InsertResponse( - status="duplicated", - message=f"File '{safe_filename}' already exists in document storage (Status: {status}).", - track_id=existing_track_id, + status = get_doc_status_value(existing_doc_data) or "unknown" + raise HTTPException( + status_code=409, + detail=( + f"Document storage already contains '{safe_filename}' " + f"(Status: {status}). Delete the existing record before re-uploading." + ), ) - file_path = doc_manager.input_dir / safe_filename - # Check if file already exists in file system + # INPUT directory check, using canonical parser-hint names. + # Fast path: exact filename match avoids iterdir on large input directories. + canonical_filename = normalize_file_path(safe_filename) if file_path.exists(): - return InsertResponse( - status="duplicated", - message=f"File '{safe_filename}' already exists in the input directory.", - track_id="", + existing_input_file: Path | None = file_path + else: + existing_input_file = find_existing_file_by_file_path( + doc_manager.input_dir, canonical_filename + ) + if existing_input_file: + raise HTTPException( + status_code=409, + detail=( + f"Input directory already contains a file with the same " + f"canonical basename ('{existing_input_file.name}'). " + f"Remove or rename it before re-uploading." + ), ) # Async streaming write with size check @@ -2274,8 +2933,24 @@ async def upload_to_input_dir( track_id = generate_track_id("upload") - # Add to background tasks and get track_id - background_tasks.add_task(pipeline_index_file, rag, file_path, track_id) + # Bg task: enqueue + trigger processing, then release the slot. + # ``pipeline_index_file`` does both: it calls + # ``pipeline_enqueue_file`` (writes doc_status / full_docs) and + # then ``apipeline_process_enqueue_documents``. The latter is + # safe to invoke even when the loop is already busy — it + # collapses to a ``request_pending=True`` nudge and returns, + # so concurrent uploads/inserts cooperate via the running + # loop's request_pending mechanism. + async def _indexing_task(): + try: + await pipeline_index_file(rag, file_path, track_id) + finally: + await _release_enqueue_slot(rag) + + background_tasks.add_task(_indexing_task) + # Ownership of the slot transferred to the bg task — the + # finally block below must NOT release it again. + slot_reserved = False return InsertResponse( status="success", @@ -2290,6 +2965,13 @@ async def upload_to_input_dir( logger.error(f"Error /documents/upload: {file.filename}: {str(e)}") logger.error(traceback.format_exc()) raise HTTPException(status_code=500, detail=str(e)) + finally: + # If we reserved a slot but never scheduled the bg task + # (e.g. early validation rejection or streaming-write + # failure), release here. No drain coordination needed — + # any sibling bg task triggers its own processing pass. + if slot_reserved: + await _release_enqueue_slot(rag) @router.post( "/text", response_model=InsertResponse, dependencies=[Depends(combined_auth)] @@ -2303,6 +2985,15 @@ async def insert_text( This endpoint allows you to insert text data into the RAG system for later retrieval and use in generating responses. + **Concurrency Constraint:** + - Refuses with HTTP 409 only while + ``pipeline_status["scanning_exclusive"]`` (a scan is in its + classification phase) or ``pipeline_status["destructive_busy"]`` + (clear / per-doc delete is in flight) is set. ``busy=True`` + from the processing loop, and a scan in its processing phase, + do NOT block — the running pipeline picks up the new doc via + ``request_pending``. + Args: request (InsertTextRequest): The request body containing the text to be inserted. background_tasks: FastAPI BackgroundTasks for async processing @@ -2311,63 +3002,67 @@ async def insert_text( InsertResponse: A response object containing the status of the operation. Raises: - HTTPException: If an error occurs during text processing (500). + HTTPException: 400 invalid file_source, 409 same-name conflict + or scan/destructive job in flight, 500 other errors. """ + slot_reserved = False try: + # Reject text insertion while a scan is in progress AND reserve + # a pending-enqueue slot — see /upload for the rationale. + slot_reserved = await _reserve_enqueue_slot(rag) + # Check if file_source already exists in doc_status storage - if ( - request.file_source - and request.file_source.strip() - and request.file_source != "unknown_source" - ): - existing_doc_data = await rag.doc_status.get_doc_by_file_path( - request.file_source + if not is_valid_file_source(request.file_source): + raise HTTPException( + status_code=400, + detail="A valid file_source is required for text insertion", ) - if existing_doc_data: - # Get document status and track_id from existing document - status = existing_doc_data.get("status", "unknown") - # Use `or ""` to handle both missing key and None value (e.g., legacy rows without track_id) - existing_track_id = existing_doc_data.get("track_id") or "" - return InsertResponse( - status="duplicated", - message=f"File source '{request.file_source}' already exists in document storage (Status: {status}).", - track_id=existing_track_id, - ) - # Check if content already exists by computing content hash (doc_id) - sanitized_text = sanitize_text_for_encoding(request.text) - content_doc_id = compute_mdhash_id(sanitized_text, prefix="doc-") - existing_doc = await rag.doc_status.get_by_id(content_doc_id) - if existing_doc: - # Content already exists, return duplicated with existing track_id - status = existing_doc.get("status", "unknown") - existing_track_id = existing_doc.get("track_id") or "" - return InsertResponse( - status="duplicated", - message=f"Identical content already exists in document storage (doc_id: {content_doc_id}, Status: {status}).", - track_id=existing_track_id, + normalized_file_source = normalize_file_path(request.file_source) + existing_doc_data = await get_existing_doc_by_file_path_candidates( + rag.doc_status, normalized_file_source + ) + if existing_doc_data: + status = get_doc_status_value(existing_doc_data) or "unknown" + raise HTTPException( + status_code=409, + detail=( + f"Document storage already contains '{normalized_file_source}' " + f"(Status: {status}). Delete the existing record before re-inserting." + ), ) # Generate track_id for text insertion track_id = generate_track_id("insert") - background_tasks.add_task( - pipeline_index_texts, - rag, - [request.text], - file_sources=[request.file_source], - track_id=track_id, - ) + async def _indexing_task(): + try: + await pipeline_index_texts( + rag, + [request.text], + file_sources=[normalized_file_source], + track_id=track_id, + ) + finally: + await _release_enqueue_slot(rag) + + background_tasks.add_task(_indexing_task) + slot_reserved = False return InsertResponse( status="success", message="Text successfully received. Processing will continue in background.", track_id=track_id, ) + except HTTPException: + raise except Exception as e: logger.error(f"Error /documents/text: {str(e)}") logger.error(traceback.format_exc()) raise HTTPException(status_code=500, detail=str(e)) + finally: + if slot_reserved: + await _release_enqueue_slot(rag) @router.post( "/texts", @@ -2383,6 +3078,15 @@ async def insert_texts( This endpoint allows you to insert multiple text entries into the RAG system in a single request. + **Concurrency Constraint:** + - Refuses with HTTP 409 only while + ``pipeline_status["scanning_exclusive"]`` (a scan is in its + classification phase) or ``pipeline_status["destructive_busy"]`` + (clear / per-doc delete is in flight) is set. ``busy=True`` + from the processing loop, and a scan in its processing phase, + do NOT block — the running pipeline picks up the new docs via + ``request_pending``. + Args: request (InsertTextsRequest): The request body containing the list of texts. background_tasks: FastAPI BackgroundTasks for async processing @@ -2391,66 +3095,87 @@ async def insert_texts( InsertResponse: A response object containing the status of the operation. Raises: - HTTPException: If an error occurs during text processing (500). + HTTPException: 400 invalid file_sources, 409 same-name + conflict or scan/destructive job in flight, 500 other + errors. """ + slot_reserved = False try: + # Reject batch text insertion while a scan is in progress AND + # reserve a pending-enqueue slot — see /upload for the rationale. + slot_reserved = await _reserve_enqueue_slot(rag) + # Check if any file_sources already exist in doc_status storage - if request.file_sources: - for file_source in request.file_sources: - if ( - file_source - and file_source.strip() - and file_source != "unknown_source" - ): - existing_doc_data = await rag.doc_status.get_doc_by_file_path( - file_source - ) - if existing_doc_data: - # Get document status and track_id from existing document - status = existing_doc_data.get("status", "unknown") - # Use `or ""` to handle both missing key and None value (e.g., legacy rows without track_id) - existing_track_id = existing_doc_data.get("track_id") or "" - return InsertResponse( - status="duplicated", - message=f"File source '{file_source}' already exists in document storage (Status: {status}).", - track_id=existing_track_id, - ) + if not request.file_sources or len(request.file_sources) != len( + request.texts + ): + raise HTTPException( + status_code=400, + detail="A valid file_source is required for each text", + ) - # Check if any content already exists by computing content hash (doc_id) - for text in request.texts: - sanitized_text = sanitize_text_for_encoding(text) - content_doc_id = compute_mdhash_id(sanitized_text, prefix="doc-") - existing_doc = await rag.doc_status.get_by_id(content_doc_id) - if existing_doc: - # Content already exists, return duplicated with existing track_id - status = existing_doc.get("status", "unknown") - existing_track_id = existing_doc.get("track_id") or "" - return InsertResponse( - status="duplicated", - message=f"Identical content already exists in document storage (doc_id: {content_doc_id}, Status: {status}).", - track_id=existing_track_id, + normalized_file_sources = [ + normalize_file_path(file_source) for file_source in request.file_sources + ] + if any( + file_source == UNKNOWN_FILE_SOURCE + for file_source in normalized_file_sources + ): + raise HTTPException( + status_code=400, + detail="A valid file_source is required for each text", + ) + if len(set(normalized_file_sources)) != len(normalized_file_sources): + raise HTTPException( + status_code=400, + detail="file_sources must be unique by filename", + ) + + for file_source in normalized_file_sources: + existing_doc_data = await get_existing_doc_by_file_path_candidates( + rag.doc_status, file_source + ) + if existing_doc_data: + status = get_doc_status_value(existing_doc_data) or "unknown" + raise HTTPException( + status_code=409, + detail=( + f"Document storage already contains '{file_source}' " + f"(Status: {status}). Delete the existing record before re-inserting." + ), ) # Generate track_id for texts insertion track_id = generate_track_id("insert") - background_tasks.add_task( - pipeline_index_texts, - rag, - request.texts, - file_sources=request.file_sources, - track_id=track_id, - ) + async def _indexing_task(): + try: + await pipeline_index_texts( + rag, + request.texts, + file_sources=normalized_file_sources, + track_id=track_id, + ) + finally: + await _release_enqueue_slot(rag) + + background_tasks.add_task(_indexing_task) + slot_reserved = False return InsertResponse( status="success", message="Texts successfully received. Processing will continue in background.", track_id=track_id, ) + except HTTPException: + raise except Exception as e: logger.error(f"Error /documents/texts: {str(e)}") logger.error(traceback.format_exc()) raise HTTPException(status_code=500, detail=str(e)) + finally: + if slot_reserved: + await _release_enqueue_slot(rag) @router.delete( "", response_model=ClearDocumentsResponse, dependencies=[Depends(combined_auth)] @@ -2463,11 +3188,23 @@ async def clear_documents(): It uses the storage drop methods to properly clean up all data and removes all files from the input directory. + **Concurrency Constraint:** + - Atomically reserves the destructive slot (sets ``busy=True`` + and ``destructive_busy=True``) before dropping anything. + Refuses with ``status="busy"`` when ANY of these is set: + ``pipeline_status["busy"]`` (processing loop or another + destructive job in flight), ``pipeline_status["scanning"]`` + (a scan is anywhere in its lifecycle), or + ``pipeline_status["pending_enqueues"] > 0`` (an /upload, + /text or /texts has reserved a slot whose bg task has not + yet written to doc_status). + Returns: ClearDocumentsResponse: A response object containing the status and message. - status="success": All documents and files were successfully cleared. - status="partial_success": Document clear job exit with some errors. - - status="busy": Operation could not be completed because the pipeline is busy. + - status="busy": Operation could not be completed because another + writer (busy / scanning / pending enqueue) holds the pipeline. - status="fail": All storage drop operations failed, with message - message: Detailed information about the operation results, including counts of deleted files and any errors encountered. @@ -2489,17 +3226,20 @@ async def clear_documents(): "pipeline_status", workspace=rag.workspace ) - # Check and set status with lock + # Atomically reserve the destructive slot. Checks busy + + # scanning + pending_enqueues>0 in a single critical section + # before flipping busy=True and destructive_busy=True together. + # ``destructive_busy`` blocks reservation and the enqueue + # last-line guard: clear is about to drop every storage and + # remove every input file, so a concurrent upload accepted in + # this window would write to storages mid-drop and silently + # lose the document. + acquired, reason = await _acquire_destructive_busy(rag) + if not acquired: + return ClearDocumentsResponse(status="busy", message=reason) async with pipeline_status_lock: - if pipeline_status.get("busy", False): - return ClearDocumentsResponse( - status="busy", - message="Cannot clear documents while pipeline is busy", - ) - # Set busy to true pipeline_status.update( { - "busy": True, "job_name": "Clearing Documents", "job_start": datetime.now().isoformat(), "docs": 0, @@ -2638,9 +3378,11 @@ async def clear_documents(): pipeline_status["history_messages"].append(error_msg) raise HTTPException(status_code=500, detail=str(e)) finally: - # Reset busy status after completion + # Reset busy + destructive_busy after completion so the next + # reservation / scan sees an idle pipeline. async with pipeline_status_lock: pipeline_status["busy"] = False + pipeline_status["destructive_busy"] = False completion_msg = "Document clearing process completed" pipeline_status["latest_message"] = completion_msg if "history_messages" in pipeline_status: @@ -2771,6 +3513,8 @@ async def documents() -> DocsStatusesResponse: try: statuses = ( DocStatus.PENDING, + DocStatus.PARSING, + DocStatus.ANALYZING, DocStatus.PROCESSING, DocStatus.PREPROCESSED, DocStatus.PROCESSED, @@ -2877,6 +3621,15 @@ async def delete_document( This operation is irreversible and will interact with the pipeline status. + **Concurrency Constraint:** + - Atomically reserves the destructive slot (sets ``busy=True`` + and ``destructive_busy=True``) **synchronously** before + returning ``deletion_started``, so a /scan or /upload that + arrives before the bg task runs cannot race the delete. + Refuses with ``status="busy"`` when ANY of these is set: + ``pipeline_status["busy"]``, ``pipeline_status["scanning"]``, + or ``pipeline_status["pending_enqueues"] > 0``. + Args: delete_request (DeleteDocRequest): The request containing the document IDs and deletion options. background_tasks: FastAPI BackgroundTasks for async processing @@ -2884,7 +3637,8 @@ async def delete_document( Returns: DeleteDocByIdResponse: The result of the deletion operation. - status="deletion_started": The document deletion has been initiated in the background. - - status="busy": The pipeline is busy with another operation. + - status="busy": Another writer (busy / scanning / pending enqueue) holds the + pipeline; nothing scheduled, retry after the running job finishes. Raises: HTTPException: @@ -2892,29 +3646,24 @@ async def delete_document( """ doc_ids = delete_request.doc_ids + slot_acquired = False try: - from lightrag.kg.shared_storage import ( - get_namespace_data, - get_namespace_lock, - ) - - pipeline_status = await get_namespace_data( - "pipeline_status", workspace=rag.workspace - ) - pipeline_status_lock = get_namespace_lock( - "pipeline_status", workspace=rag.workspace - ) - - # Check if pipeline is busy with proper lock - async with pipeline_status_lock: - if pipeline_status.get("busy", False): - return DeleteDocByIdResponse( - status="busy", - message="Cannot delete documents while pipeline is busy", - doc_id=", ".join(doc_ids), - ) + # Atomically reserve the destructive slot BEFORE returning + # ``deletion_started``. Without this, the bg task would set + # destructive_busy only when it later runs — leaving a + # window where a /scan or /upload can race the delete after + # the client has already received success. The check + # covers busy + scanning + pending_enqueues>0 in a single + # critical section. + acquired, reason = await _acquire_destructive_busy(rag) + if not acquired: + return DeleteDocByIdResponse( + status="busy", + message=reason or "Cannot delete documents while pipeline is busy", + doc_id=", ".join(doc_ids), + ) + slot_acquired = True - # Add deletion task to background tasks background_tasks.add_task( background_delete_documents, rag, @@ -2923,6 +3672,10 @@ async def delete_document( delete_request.delete_file, delete_request.delete_llm_cache, ) + # Ownership of the slot transferred to the bg task — it + # will release in its finally. The endpoint's finally + # below must NOT release it again. + slot_acquired = False return DeleteDocByIdResponse( status="deletion_started", @@ -2935,6 +3688,12 @@ async def delete_document( logger.error(error_msg) logger.error(traceback.format_exc()) raise HTTPException(status_code=500, detail=error_msg) + finally: + # If we reserved but never scheduled the bg task (e.g. an + # unexpected error between acquire and add_task), release + # so the next reservation / scan / enqueue can proceed. + if slot_acquired: + await _release_destructive_busy(rag) @router.post( "/clear_cache", @@ -3149,11 +3908,12 @@ async def get_documents_paginated( status_filter_value = ( request.status_filter.value if request.status_filter is not None else None ) + workspace = getattr(rag, "workspace", None) performance_timing_log( "[documents/paginated][%s] Request start workspace=%s status_filter=%s page=%s page_size=%s sort_field=%s sort_direction=%s", trace_id, - rag.workspace, + workspace, status_filter_value, request.page, request.page_size, @@ -3197,6 +3957,7 @@ async def _timed_call(operation_name: str, operation): "get_docs_paginated", rag.doc_status.get_docs_paginated( status_filter=request.status_filter, + status_filters=request.status_filters, page=request.page, page_size=request.page_size, sort_field=request.sort_field, diff --git a/lightrag/api/routers/ollama_api.py b/lightrag/api/routers/ollama_api.py index 15c695cee7..460fa57b85 100644 --- a/lightrag/api/routers/ollama_api.py +++ b/lightrag/api/routers/ollama_api.py @@ -299,12 +299,17 @@ async def generate(raw_request: Request): start_time = time.time_ns() prompt_tokens = estimate_tokens(query) + role_kwargs = ( + dict(self.rag.role_llm_kwargs["query"]) + if self.rag.role_llm_kwargs["query"] is not None + else dict(self.rag.llm_model_kwargs) + ) if request.system: - self.rag.llm_model_kwargs["system_prompt"] = request.system + role_kwargs["system_prompt"] = request.system if request.stream: - response = await self.rag.llm_model_func( - query, stream=True, **self.rag.llm_model_kwargs + response = await (self.rag.role_llm_funcs["query"])( + query, stream=True, **role_kwargs ) async def stream_generator(): @@ -428,8 +433,8 @@ async def stream_generator(): ) else: first_chunk_time = time.time_ns() - response_text = await self.rag.llm_model_func( - query, stream=False, **self.rag.llm_model_kwargs + response_text = await (self.rag.role_llm_funcs["query"])( + query, stream=False, **role_kwargs ) last_chunk_time = time.time_ns() @@ -515,13 +520,18 @@ async def chat(raw_request: Request): if request.stream: # Determine if the request is prefix with "/bypass" if mode == SearchMode.bypass: + role_kwargs = ( + dict(self.rag.role_llm_kwargs["query"]) + if self.rag.role_llm_kwargs["query"] is not None + else dict(self.rag.llm_model_kwargs) + ) if request.system: - self.rag.llm_model_kwargs["system_prompt"] = request.system - response = await self.rag.llm_model_func( + role_kwargs["system_prompt"] = request.system + response = await (self.rag.role_llm_funcs["query"])( cleaned_query, stream=True, history_messages=conversation_history, - **self.rag.llm_model_kwargs, + **role_kwargs, ) else: response = await self.rag.aquery( @@ -677,14 +687,19 @@ async def stream_generator(): r"\n\nUSER:", cleaned_query, re.MULTILINE ) if match_result or mode == SearchMode.bypass: + role_kwargs = ( + dict(self.rag.role_llm_kwargs["query"]) + if self.rag.role_llm_kwargs["query"] is not None + else dict(self.rag.llm_model_kwargs) + ) if request.system: - self.rag.llm_model_kwargs["system_prompt"] = request.system + role_kwargs["system_prompt"] = request.system - response_text = await self.rag.llm_model_func( + response_text = await (self.rag.role_llm_funcs["query"])( cleaned_query, stream=False, history_messages=conversation_history, - **self.rag.llm_model_kwargs, + **role_kwargs, ) else: response_text = await self.rag.aquery( diff --git a/lightrag/api/run_with_gunicorn.py b/lightrag/api/run_with_gunicorn.py index e3bc0a8c45..d98925ac83 100644 --- a/lightrag/api/run_with_gunicorn.py +++ b/lightrag/api/run_with_gunicorn.py @@ -7,15 +7,10 @@ import sys import platform import pipmaster as pm -from lightrag.api.utils_api import display_splash_screen, check_env_file -from lightrag.api.config import global_args -from lightrag.utils import get_env_value -from lightrag.kg.shared_storage import initialize_share_data -from lightrag.constants import ( - DEFAULT_WOKERS, - DEFAULT_TIMEOUT, -) +# Capture this before importing LightRAG modules, because those imports load .env. +# On macOS, libobjc needs this value in the inherited process environment. +_PROCESS_START_OBJC_FORK_SAFETY = os.environ.get("OBJC_DISABLE_INITIALIZE_FORK_SAFETY") def check_and_install_dependencies(): @@ -35,9 +30,16 @@ def check_and_install_dependencies(): def main(): - # Explicitly initialize configuration for Gunicorn mode - from lightrag.api.config import initialize_config + from lightrag.api.utils_api import display_splash_screen, check_env_file + from lightrag.api.config import global_args, initialize_config + from lightrag.utils import get_env_value + from lightrag.kg.shared_storage import initialize_share_data + from lightrag.constants import ( + DEFAULT_WOKERS, + DEFAULT_TIMEOUT, + ) + # Explicitly initialize configuration for Gunicorn mode initialize_config() # Set Gunicorn mode flag for lifespan cleanup detection @@ -47,41 +49,13 @@ def main(): if not check_env_file(): sys.exit(1) - # Check DOCLING compatibility with Gunicorn multi-worker mode on macOS - if ( - platform.system() == "Darwin" - and global_args.document_loading_engine == "DOCLING" - and global_args.workers > 1 - ): - print("\n" + "=" * 80) - print("❌ ERROR: Incompatible configuration detected!") - print("=" * 80) - print( - "\nDOCLING engine with Gunicorn multi-worker mode is not supported on macOS" - ) - print("\nReason:") - print(" PyTorch (required by DOCLING) has known compatibility issues with") - print(" fork-based multiprocessing on macOS, which can cause crashes or") - print(" unexpected behavior when using Gunicorn with multiple workers.") - print("\nCurrent configuration:") - print(" - Operating System: macOS (Darwin)") - print(f" - Document Engine: {global_args.document_loading_engine}") - print(f" - Workers: {global_args.workers}") - print("\nPossible solutions:") - print(" 1. Use single worker mode:") - print(" --workers 1") - print("\n 2. Change document loading engine in .env:") - print(" DOCUMENT_LOADING_ENGINE=DEFAULT") - print("\n 3. Deploy on Linux where multi-worker mode is fully supported") - print("=" * 80 + "\n") - sys.exit(1) - # Check macOS fork safety environment variable for multi-worker mode if ( platform.system() == "Darwin" and global_args.workers > 1 - and os.environ.get("OBJC_DISABLE_INITIALIZE_FORK_SAFETY") != "YES" + and _PROCESS_START_OBJC_FORK_SAFETY != "YES" ): + current_objc_fork_safety = os.environ.get("OBJC_DISABLE_INITIALIZE_FORK_SAFETY") print("\n" + "=" * 80) print("❌ ERROR: Missing required environment variable on macOS!") print("=" * 80) @@ -95,8 +69,18 @@ def main(): print(" - Operating System: macOS (Darwin)") print(f" - Workers: {global_args.workers}") print( - f" - Environment Variable: {os.environ.get('OBJC_DISABLE_INITIALIZE_FORK_SAFETY', 'NOT SET')}" + " - Process Environment at Startup: " + f"{_PROCESS_START_OBJC_FORK_SAFETY or 'NOT SET'}" + ) + print( + " - Environment After .env Load: " + f"{current_objc_fork_safety or 'NOT SET'}" ) + if current_objc_fork_safety == "YES": + print("\nNote:") + print(" OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES was loaded from .env,") + print(" but that is too late for the macOS Objective-C runtime.") + print(" Export it before starting lightrag-gunicorn.") print("\nHow to fix:") print(" Option 1 - Set environment variable before starting (recommended):") print(" export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES") diff --git a/lightrag/api/utils_api.py b/lightrag/api/utils_api.py index 2e06bd30ec..d382f66cc1 100644 --- a/lightrag/api/utils_api.py +++ b/lightrag/api/utils_api.py @@ -336,24 +336,6 @@ def display_splash_screen(args: argparse.Namespace) -> None: ASCIIColors.yellow(f"{args.working_dir}") ASCIIColors.white(" └─ Input Directory: ", end="") ASCIIColors.yellow(f"{args.input_dir}") - - # LLM Configuration - ASCIIColors.magenta("\n🤖 LLM Configuration:") - ASCIIColors.white(" ├─ Binding: ", end="") - ASCIIColors.yellow(f"{args.llm_binding}") - ASCIIColors.white(" ├─ Host: ", end="") - ASCIIColors.yellow(f"{args.llm_binding_host}") - ASCIIColors.white(" ├─ Model: ", end="") - ASCIIColors.yellow(f"{args.llm_model}") - ASCIIColors.white(" ├─ Max Async for LLM: ", end="") - ASCIIColors.yellow(f"{args.max_async}") - ASCIIColors.white(" ├─ Summary Context Size: ", end="") - ASCIIColors.yellow(f"{args.summary_context_size}") - ASCIIColors.white(" ├─ LLM Cache Enabled: ", end="") - ASCIIColors.yellow(f"{args.enable_llm_cache}") - ASCIIColors.white(" └─ LLM Cache for Extraction Enabled: ", end="") - ASCIIColors.yellow(f"{args.enable_llm_cache_for_extract}") - # Embedding Configuration ASCIIColors.magenta("\n📊 Embedding Configuration:") ASCIIColors.white(" ├─ Binding: ", end="") @@ -371,8 +353,6 @@ def display_splash_screen(args: argparse.Namespace) -> None: ASCIIColors.magenta("\n⚙️ RAG Configuration:") ASCIIColors.white(" ├─ Summary Language: ", end="") ASCIIColors.yellow(f"{args.summary_language}") - ASCIIColors.white(" ├─ Entity Types: ", end="") - ASCIIColors.yellow(f"{args.entity_types}") ASCIIColors.white(" ├─ Max Parallel Insert: ", end="") ASCIIColors.yellow(f"{args.max_parallel_insert}") ASCIIColors.white(" ├─ Chunk Size: ", end="") diff --git a/lightrag/base.py b/lightrag/base.py index 28c564459c..f1d8dd9aab 100644 --- a/lightrag/base.py +++ b/lightrag/base.py @@ -145,10 +145,18 @@ class QueryParam: history_turns: int = int(os.getenv("HISTORY_TURNS", str(DEFAULT_HISTORY_TURNS))) """Number of complete conversation turns (user-assistant pairs) to consider in the response context.""" + # TODO(v1.5.0): remove model_func together with the override branches in + # operate.py (_warn_deprecated_query_model_func call sites) and the + # `model_func_override` path in utils.get_llm_cache_identity. model_func: Callable[..., object] | None = None - """Optional override for the LLM model function to use for this specific query. - If provided, this will be used instead of the global model function. - This allows using different models for different query modes. + """Deprecated optional override for the LLM model function. + Use role_llm_configs at initialization or LightRAG.aupdate_llm_role_config() / + LightRAG.update_llm_role_config() for runtime role LLM changes instead. + Kept for backward compatibility with direct Python callers. + + Note: when set, the LLM cache key collapses to a single "override" identity, + so swapping the override across calls will reuse stale cached responses. + Use aupdate_llm_role_config() for cache-correct model swaps. """ user_prompt: str | None = None @@ -750,11 +758,16 @@ async def search_labels(self, query: str, limit: int = 50) -> list[str]: class DocStatus(str, Enum): - """Document processing status""" + """Document processing status. + Pipeline order: PENDING -> PARSING -> ANALYZING (optional) -> PROCESSING -> PROCESSED | FAILED. + PREPROCESSED is deprecated, kept for backward compatibility. + """ PENDING = "pending" - PROCESSING = "processing" - PREPROCESSED = "preprocessed" + PARSING = "parsing" # Phase 1: content extraction (parse_native/mineru/docling) + ANALYZING = "analyzing" # Phase 2: multimodal analysis (VLM) + PROCESSING = "processing" # Phase 3: entity/relation extraction + PREPROCESSED = "preprocessed" # Deprecated: use ANALYZING in new pipeline PROCESSED = "processed" FAILED = "failed" @@ -768,7 +781,13 @@ class DocProcessingStatus: content_length: int """Total length of document""" file_path: str - """File path of the document""" + """Canonical basename of the document. + + Always a hint-stripped basename (e.g. ``abc.docx``) or the literal + ``"unknown_source"`` sentinel; never carries directory components or + parser ``[hint]`` segments. UI display, filename-based dedup, and + citation paths all share this value. + """ status: DocStatus """Current processing status""" created_at: str @@ -786,6 +805,12 @@ class DocProcessingStatus: metadata: dict[str, Any] = field(default_factory=dict) """Additional metadata""" multimodal_processed: bool | None = field(default=None, repr=False) + content_hash: str | None = None + """MD5 hash of the underlying document content (raw text or source file). + + Used together with file_path basename for duplicate detection. Empty for + pending_parse records whose content has not been extracted yet. + """ """Internal field: indicates if multimodal processing is complete. Not shown in repr() but accessible for debugging.""" def __post_init__(self): @@ -810,6 +835,22 @@ def __post_init__(self): class DocStatusStorage(BaseKVStorage, ABC): """Base class for document status storage""" + @staticmethod + def resolve_status_filter_values( + status_filter: DocStatus | None = None, + status_filters: list[DocStatus] | None = None, + ) -> set[str] | None: + """Normalize single- and multi-status filters into comparable values. + + `status_filters` takes precedence over `status_filter`. Empty multi-status + filters are treated as no filter for backward-compatible request handling. + """ + if status_filters: + return {status.value for status in status_filters} + if status_filter is not None: + return {status_filter.value} + return None + @abstractmethod async def get_status_counts(self) -> dict[str, int]: """Get counts of documents in each status""" @@ -836,6 +877,7 @@ async def get_docs_by_track_id( async def get_docs_paginated( self, status_filter: DocStatus | None = None, + status_filters: list[DocStatus] | None = None, page: int = 1, page_size: int = 50, sort_field: str = "updated_at", @@ -844,7 +886,8 @@ async def get_docs_paginated( """Get documents with pagination support Args: - status_filter: Filter by document status, None for all statuses + status_filter: Legacy single-status filter, ignored when status_filters is set + status_filters: Filter by multiple document statuses, None for all statuses page: Page number (1-based) page_size: Number of documents per page (10-200) sort_field: Field to sort by ('created_at', 'updated_at', 'id') @@ -874,6 +917,38 @@ async def get_doc_by_file_path(self, file_path: str) -> dict[str, Any] | None: Returns the same format as get_by_ids method """ + @abstractmethod + async def get_doc_by_file_basename( + self, basename: str + ) -> tuple[str, dict[str, Any]] | None: + """Get document by canonical file basename. + + Used for filename-based deduplication. Callers must pass the canonical + basename; storage implementations only compare against the canonical + ``file_path`` persisted by the business layer. + + Args: + basename: The filename basename to search for (e.g. "report.pdf"). + + Returns: + (doc_id, doc_data) when a matching record exists, otherwise None. + """ + + @abstractmethod + async def get_doc_by_content_hash( + self, content_hash: str + ) -> tuple[str, dict[str, Any]] | None: + """Get document by content_hash field. + + Used for content-hash deduplication of full documents. + + Args: + content_hash: The content hash value to search for. + + Returns: + (doc_id, doc_data) when a matching record exists, otherwise None. + """ + class StoragesStatus(str, Enum): """Storages status""" diff --git a/lightrag/chunk_schema.py b/lightrag/chunk_schema.py new file mode 100644 index 0000000000..4eed5f87df --- /dev/null +++ b/lightrag/chunk_schema.py @@ -0,0 +1,256 @@ +"""Chunk schema helpers shared across the chunking + extraction pipeline. + +Three responsibilities live here so chunker implementations and the pipeline +both consume identical normalization rules: + +- :func:`normalize_chunk_heading` collapses the legacy flat + ``heading``/``parent_headings``/``level`` triple and the new nested form + into the canonical ``{"level", "heading", "parent_headings"}`` dict. +- :func:`normalize_chunk_sidecar` validates the new ``sidecar`` payload and + ensures ``refs`` is always present as a list (single-source items may omit + it before normalization; we materialize a single-element list for the + storage layer). +- :func:`strip_internal_multimodal_markup_for_extraction` rewrites + ```` / ```` / ```` markup so the entity-extraction + LLM sees a clean text body. The original ``chunk["content"]`` is never + mutated; the cleaned string is only used to build the extraction prompt. + +The clean function is intentionally conservative: it only strips +parser-emitted identifier attributes that have no business reaching the LLM +(``id``, ``refid``, ``path``, ``src``). Visible captions and equation bodies +are preserved so the extracted entities can still ground against them. +""" + +from __future__ import annotations + +import re +from typing import Any + + +_SIDECAR_TYPES = frozenset({"block", "drawing", "table", "equation"}) + + +def normalize_chunk_heading(dp: dict[str, Any]) -> dict[str, Any] | None: + """Return the canonical nested heading dict or ``None`` when absent. + + Accepts: + + - ``dp["heading"]`` already a dict ``{"level", "heading", "parent_headings"}``. + - Legacy flat fields ``heading: str`` + ``parent_headings: list[str]`` + + ``level: int``. + + Empty / missing inputs collapse to ``None`` so callers can simply omit + the field when writing the chunk record. + """ + nested = dp.get("heading") + if isinstance(nested, dict): + heading_text = str(nested.get("heading") or "").strip() + parents_raw = nested.get("parent_headings") or [] + level_raw = nested.get("level", 0) + else: + heading_text = str(nested or "").strip() + parents_raw = dp.get("parent_headings") or [] + level_raw = dp.get("level", 0) + + parent_headings: list[str] = [] + if isinstance(parents_raw, list): + for entry in parents_raw: + text = str(entry or "").strip() + if text: + parent_headings.append(text) + + try: + level = int(level_raw or 0) + except (TypeError, ValueError): + level = 0 + + if not heading_text and not parent_headings and level == 0: + return None + + return { + "level": level, + "heading": heading_text, + "parent_headings": parent_headings, + } + + +def normalize_chunk_sidecar(dp: dict[str, Any]) -> dict[str, Any] | None: + """Return the canonical sidecar dict or ``None`` when absent / invalid. + + Output shape:: + + {"type": , + "id": , + "refs": [{"type": ..., "id": ...}, ...]} + + ``refs`` is always materialized as a list with at least the primary id. + Single-source chunks therefore land in storage with ``refs=[{type,id}]`` + so downstream consumers don't need to special-case the field's presence. + """ + sidecar = dp.get("sidecar") + if not isinstance(sidecar, dict): + return None + sidecar_type = str(sidecar.get("type") or "").strip() + sidecar_id = str(sidecar.get("id") or "").strip() + if sidecar_type not in _SIDECAR_TYPES or not sidecar_id: + return None + + refs_raw = sidecar.get("refs") + refs: list[dict[str, str]] = [] + if isinstance(refs_raw, list): + for entry in refs_raw: + if not isinstance(entry, dict): + continue + ref_type = str(entry.get("type") or "").strip() + ref_id = str(entry.get("id") or "").strip() + if ref_type in _SIDECAR_TYPES and ref_id: + refs.append({"type": ref_type, "id": ref_id}) + if not refs: + refs = [{"type": sidecar_type, "id": sidecar_id}] + + return {"type": sidecar_type, "id": sidecar_id, "refs": refs} + + +# `visible text` → `visible text`. +_CITE_RE = re.compile( + r"]*>(.*?)", + flags=re.IGNORECASE | re.DOTALL, +) + +# Inner attribute stripper used when the caller wants to *preserve* the +# `` wrapper but drop the parser-internal `refid`. +# Matches ` refid="…"` (leading whitespace + quoted value) so the +# surrounding attribute layout (e.g. `type="table"`) stays intact. +_CITE_REFID_ATTR_RE = re.compile( + r'\s+refid\s*=\s*"[^"]*"', + flags=re.IGNORECASE, +) + +# Self-closing `` placeholder. We keep `caption` (visible) and +# drop `id`, `path`, `src`, `format`, etc. Tags without any caption are +# removed entirely so they don't pollute extraction input. +_DRAWING_RE = re.compile( + r"]*)/>", + flags=re.IGNORECASE, +) + +# Container `latex`. Strip +# identifier attributes; preserve the body and the `format` attribute so +# extraction still sees the equation is a structured element. +_EQUATION_RE = re.compile( + r"]*)>(.*?)", + flags=re.IGNORECASE | re.DOTALL, +) + +# Container `
rows
`. +# Native parser emits the internal ``tb--NNNN`` identifier here, which +# would otherwise leak into the entity-extraction prompt and become a noisy +# entity. Strip ``id``; keep ``format`` / ``caption`` (and the body verbatim) +# so the extractor still recognizes the element as a structured table. +_TABLE_RE = re.compile( + r"]*)>(.*?)", + flags=re.IGNORECASE | re.DOTALL, +) + +# Match attribute pairs like ``caption="text with \"escapes\""``. We treat +# only the safe identifier-style attributes; complex quoting is rare in +# parser output. +_ATTR_RE = re.compile( + r'(\w+)\s*=\s*"((?:[^"\\]|\\.)*)"', +) + + +def _attrs_to_dict(attr_string: str) -> dict[str, str]: + return { + match.group(1).lower(): match.group(2) + for match in _ATTR_RE.finditer(attr_string) + } + + +def _format_attrs(pairs: list[tuple[str, str]]) -> str: + return "".join(f' {k}="{v}"' for k, v in pairs if v) + + +def _replace_drawing(match: re.Match[str]) -> str: + attrs = _attrs_to_dict(match.group(1)) + caption = attrs.get("caption", "") + if not caption.strip(): + return "" + return f"" + + +def _replace_equation(match: re.Match[str]) -> str: + attrs = _attrs_to_dict(match.group(1)) + body = match.group(2) + keep: list[tuple[str, str]] = [] + fmt = attrs.get("format", "") + if fmt: + keep.append(("format", fmt)) + caption = attrs.get("caption", "") + if caption.strip(): + keep.append(("caption", caption)) + return f"{body}
" + + +def _replace_table(match: re.Match[str]) -> str: + attrs = _attrs_to_dict(match.group(1)) + body = match.group(2) + keep: list[tuple[str, str]] = [] + fmt = attrs.get("format", "") + if fmt: + keep.append(("format", fmt)) + caption = attrs.get("caption", "") + if caption.strip(): + keep.append(("caption", caption)) + return f"{body}" + + +def strip_internal_multimodal_markup_for_extraction( + content: str, *, keep_cite_tag: bool = False +) -> str: + """Strip parser-internal identifiers from a chunk content string. + + Only the entity-extraction prompt should receive the cleaned form; + callers must NOT mutate the stored chunk ``content`` so query-time + citations still resolve back to the original parser output. + + Transformations always applied: + + - ```` + → ```` + (drops the entire tag when no caption is present) + - ``rows
`` + → ``rows
`` + - ```` + → ```` + + Cite-tag handling depends on ``keep_cite_tag``: + + - ``keep_cite_tag=False`` (default — entity-extraction path): + ``Table 1`` → ``Table 1``. The + cite wrapper is dropped so the extractor does not surface it as + a noisy structural entity. + - ``keep_cite_tag=True`` (multimodal-analysis surrounding path): + ``Table 1`` → + ``Table 1``. Only the internal + ``refid`` is removed; the wrapper survives so the VLM/LLM can + tell visible reference labels (e.g. "Table 1") apart from inline + prose. + """ + if not content: + return content + if keep_cite_tag: + cleaned = _CITE_REFID_ATTR_RE.sub("", content) + else: + cleaned = _CITE_RE.sub(lambda m: m.group(1), content) + cleaned = _DRAWING_RE.sub(_replace_drawing, cleaned) + cleaned = _TABLE_RE.sub(_replace_table, cleaned) + cleaned = _EQUATION_RE.sub(_replace_equation, cleaned) + return cleaned + + +__all__ = [ + "normalize_chunk_heading", + "normalize_chunk_sidecar", + "strip_internal_multimodal_markup_for_extraction", +] diff --git a/lightrag/chunker/__init__.py b/lightrag/chunker/__init__.py new file mode 100644 index 0000000000..3d7c757706 --- /dev/null +++ b/lightrag/chunker/__init__.py @@ -0,0 +1,66 @@ +"""LightRAG chunking strategies. + +Two contracts coexist intentionally: + + - **Legacy contract** — :func:`chunking_by_token_size` keeps its + historical 6-positional-arg signature + + ``(tokenizer, content, split_by_character, + split_by_character_only, chunk_overlap_token_size, + chunk_token_size)`` + + so externally-supplied :attr:`lightrag.LightRAG.chunking_func` + implementations continue to work unchanged. The legacy contract is + only invoked when ``process_options`` does NOT specify a chunking + selector (i.e. ``chunking_explicit`` is False) — typically direct + :meth:`LightRAG.ainsert` calls with raw text. + + - **File-chunker contract** — for documents whose ``process_options`` + explicitly selects a chunking strategy, the file-based dispatcher in + ``_PipelineMixin.process_single_document`` reads + ``doc_process_opts.chunking`` and routes to a chunker following the + standardized signature + + ``(tokenizer, content, chunk_token_size, *, + )`` + + Currently shipped file chunkers: + + - :func:`chunking_by_fixed_token` — the ``"F"`` strategy. Same + algorithm as :func:`chunking_by_token_size`, surfaced under the + new contract. + - :func:`chunking_by_recursive_character` — the ``"R"`` strategy. + Wraps LangChain ``RecursiveCharacterTextSplitter``; recursively + splits on a separator cascade with token-aware sizing. + - :func:`chunking_by_semantic_vector` — the ``"V"`` strategy. + Wraps LangChain ``SemanticChunker``; sentence-level embedding + similarity finds breakpoints. Async; needs an + :class:`~lightrag.utils.EmbeddingFunc`. + - :func:`chunking_by_paragraph_semantic` — the ``"P"`` strategy. + Heading-aware semantic chunker; consumes the docx-native + ``.blocks.jsonl`` sidecar. Falls back to R when the sidecar is + missing or unreadable. + +See ``docs/ParagraphSemanticChunking-zh.md`` for the algorithm behind +the ``"P"`` strategy and ``docs/FileProcessingConfiguration-zh.md`` for +how ``process_options`` and the new ``chunk_options`` snapshot drive +chunker selection per document. +""" + +from lightrag.chunker.paragraph_semantic import chunking_by_paragraph_semantic +from lightrag.chunker.recursive_character import ( + chunking_by_recursive_character, +) +from lightrag.chunker.semantic_vector import chunking_by_semantic_vector +from lightrag.chunker.token_size import ( + chunking_by_fixed_token, + chunking_by_token_size, +) + +__all__ = [ + "chunking_by_fixed_token", + "chunking_by_paragraph_semantic", + "chunking_by_recursive_character", + "chunking_by_semantic_vector", + "chunking_by_token_size", +] diff --git a/lightrag/chunker/paragraph_semantic.py b/lightrag/chunker/paragraph_semantic.py new file mode 100644 index 0000000000..d0a43ffe8c --- /dev/null +++ b/lightrag/chunker/paragraph_semantic.py @@ -0,0 +1,1503 @@ +"""Paragraph Semantic Chunking for LightRAG. + +Reads a LightRAG ``.blocks.jsonl`` artifact (produced by the docx native +parser at ``fixlevel=0`` — heading-driven splits only, tables kept whole) +and produces a chunk list compatible with +:func:`lightrag.chunker.chunking_by_token_size`. + +The full algorithm and rationale are documented in +``docs/ParagraphSemanticChunking-zh.md``. This module re-implements the +post-Stage-A pipeline (B/C/D) on top of blocks.jsonl input, parameterised +on ``chunk_token_size`` so chunk size targets follow the user's RAG +configuration rather than the audit-mode constants in +``lightrag/native_parser/docx/parse_document.py``. + +Pipeline: + - Stage A — heading-driven initial split: already done at parse time and + persisted as one row per block in ``blocks.jsonl``. + - Stage B — oversized-table re-split + first/middle/last gluing: invoked + here when an embedded ```` (or + ``format="html"``) exceeds ``TABLE_MAX_TOKENS``. Splitting prefers + structural row boundaries (JSON list items, HTML ```` rows) so + each fragment remains a legal ``
`` tag; only when no row + boundary is available, or a single row alone exceeds the cap, does + the splitter fall back to ``chunking_by_recursive_character`` on + that specific fragment. When two oversized tables are separated by + text inside the same heading block, the bridge text may be duplicated + into both table boundary chunks so each table keeps nearby context. + - Stage C — anchor-driven long-block re-split: short non-table + paragraphs (≤ 100 chars) are promoted as split points and the block + is rebalanced toward ``IDEAL_BLOCK_TOKENS``. When no anchor exists, + table-aware fallback applies the same row-boundary-first strategy + to any oversized table paragraph and only character-splits the + residual non-table content. Character fallback for ordinary text uses + the configured paragraph-semantic overlap. + - Stage D — bottom-up, level-aware small-block merging: undersized + blocks get absorbed by same-level neighbours (Phase A), shallower + levels (Phase B), and a final tail-absorption pass eliminates the + last few zero-content remainders. +""" + +from __future__ import annotations + +import json +import math +import re +from pathlib import Path +from typing import Any, Callable + +from lightrag.table_markup import ( + TABLE_TAG_RE as _TABLE_TAG_RE, + detect_table_format as _detect_table_format, + serialize_html_rows as _serialize_rows_with_wrappers, + split_html_rows as _split_html_rows, +) +from lightrag.utils import Tokenizer, logger + + +# --------------------------------------------------------------------------- +# Threshold ratios — derived from the audit-mode constants in +# lightrag/native_parser/docx/parse_document.py so the trade-off curves +# (table vs. block size, ideal vs. max, etc.) carry over verbatim. The +# absolute values scale with the user-configured ``chunk_token_size``. +# --------------------------------------------------------------------------- + +# IDEAL/MAX = 6000/8000 = 0.75 in audit mode. +_IDEAL_RATIO = 0.75 + +# TABLE_MAX/MAX = 5000/8000 = 0.625 in audit mode. +_TABLE_MAX_RATIO = 0.625 + +# TABLE_IDEAL/MAX = 3000/8000 = 0.375 in audit mode. +_TABLE_IDEAL_RATIO = 0.375 + +# TABLE_MIN_LAST/TABLE_MAX = (TABLE_MAX-TABLE_IDEAL)*0.8/TABLE_MAX +# = (5000-3000)*0.8/5000 = 0.32 in audit mode. +_TABLE_MIN_LAST_RATIO = 0.32 + +# SMALL_TAIL_THRESHOLD/MAX = (MAX-IDEAL)/2/MAX = 1000/8000 = 0.125. +_SMALL_TAIL_RATIO = 0.125 + +# Anchor candidate length is a UI/readability constraint — keep absolute. +_MAX_ANCHOR_CANDIDATE_LENGTH = 100 # characters + +# Table tag regex (``_TABLE_TAG_RE``) plus the ``_detect_table_format``, +# ``_split_html_rows`` and ``_serialize_rows_with_wrappers`` helpers are +# imported from :mod:`lightrag.table_markup` so the surrounding-context +# extractor can reuse the same primitives. + +_LEGACY_TABLE_CHUNK_SUFFIX_RE = re.compile(r"\s*\[表格片段\d+\]\s*$") +_PART_SUFFIX_RE = re.compile(r"\s*\[part\s+\d+\]\s*$", re.IGNORECASE) + + +# --------------------------------------------------------------------------- +# Shared helpers. +# --------------------------------------------------------------------------- + + +def _count_tokens(tokenizer: Tokenizer, text: str) -> int: + if not text: + return 0 + return len(tokenizer.encode(text)) + + +def _bounded_overlap(target_max: int, chunk_overlap_token_size: int) -> int: + """Return an overlap value safe for recursive-character splitting.""" + overlap = max(int(chunk_overlap_token_size), 0) + if target_max <= 1: + return 0 + return min(overlap, target_max - 1) + + +def _strip_generated_heading_suffixes(heading: str) -> str: + """Remove generated split suffixes before assigning a fresh part number.""" + cleaned = (heading or "").rstrip() + while True: + next_cleaned = _PART_SUFFIX_RE.sub("", cleaned).rstrip() + next_cleaned = _LEGACY_TABLE_CHUNK_SUFFIX_RE.sub("", next_cleaned).rstrip() + if next_cleaned == cleaned: + return cleaned + cleaned = next_cleaned + + +def _append_part_suffix(heading: str, part_number: int) -> str: + base = _strip_generated_heading_suffixes(heading) + suffix = f"[part {part_number}]" + return f"{base} {suffix}" if base else suffix + + +def _apply_part_suffixes(blocks: list[dict[str, Any]]) -> list[dict[str, Any]]: + """Tag split fragments from one original block as ``[part n]``.""" + if len(blocks) <= 1: + return blocks + for idx, block in enumerate(blocks, start=1): + block["heading"] = _append_part_suffix(block.get("heading", ""), idx) + return blocks + + +def _is_table_paragraph(text: str) -> bool: + stripped = text.strip() + return stripped.startswith("
") + + +def _block_to_paragraphs(content: str) -> list[dict[str, Any]]: + """Recover the per-paragraph view of a rewritten block. + + The docx parser joins paragraphs with ``\\n`` inside + ``_build_unsplit_block``; tables/equations/drawings are inserted as + single-line tags with no internal newlines, so ``split("\\n")`` faithfully + recovers paragraph boundaries. + """ + paragraphs: list[dict[str, Any]] = [] + for line in content.split("\n"): + if not line.strip(): + continue + paragraphs.append({"text": line, "is_table": _is_table_paragraph(line)}) + return paragraphs + + +def _load_blocks_from_jsonl(blocks_path: str) -> list[dict[str, Any]]: + """Read ``type == "content"`` rows from a blocks.jsonl file in order.""" + rows: list[dict[str, Any]] = [] + with Path(blocks_path).open("r", encoding="utf-8") as fh: + for raw in fh: + raw = raw.strip() + if not raw: + continue + try: + obj = json.loads(raw) + except json.JSONDecodeError: + continue + if isinstance(obj, dict) and obj.get("type") == "content": + rows.append(obj) + return rows + + +def _split_html_rows_by_tokens( + rows: list[tuple[str, str]], + tokenizer: Tokenizer, + *, + target_max: int, + target_ideal: int, + last_min: int, +) -> list[list[tuple[str, str]]]: + """HTML-tuple analog of :func:`_split_rows_by_tokens`. + + Same balanced-split + tail-merge algorithm; tokens are measured on + the row payloads (``tr_str``) only — wrapper overhead is amortised + later by the per-chunk serialiser plus the re-split-on-overflow + safety net in :func:`_split_table_text`. + """ + total = _count_tokens(tokenizer, "".join(tr for _, tr in rows)) + if total <= target_max or len(rows) <= 1: + return [rows] + + target_chunks = max( + math.ceil(total / target_ideal), + math.ceil(total / target_max), + ) + target_chunks = min(target_chunks, len(rows)) + target_rows = len(rows) / target_chunks + + chunks: list[list[tuple[str, str]]] = [] + start = 0 + for i in range(target_chunks): + if i == target_chunks - 1: + end = len(rows) + else: + end = max(start + 1, min(int((i + 1) * target_rows), len(rows))) + remaining = len(rows) - end + if remaining > 0 and remaining < target_rows * 0.3: + end = len(rows) + chunks.append(rows[start:end]) + start = end + if start >= len(rows): + break + + if len(chunks) >= 2: + last_text = "".join(tr for _, tr in chunks[-1]) + if _count_tokens(tokenizer, last_text) < last_min: + merged = chunks[-2] + chunks[-1] + merged_tokens = _count_tokens(tokenizer, "".join(tr for _, tr in merged)) + if merged_tokens <= target_max: + chunks[-2] = merged + chunks.pop() + return chunks + + +def _dedup_preserving_order(values: list[str]) -> list[str]: + seen: set[str] = set() + out: list[str] = [] + for v in values: + if v and v not in seen: + seen.add(v) + out.append(v) + return out + + +def _new_block( + *, + heading: str, + parent_headings: list[str], + level: int, + paragraphs: list[dict[str, Any]], + table_chunk_role: str, + tokenizer: Tokenizer, + blockids: list[str] | None = None, +) -> dict[str, Any]: + content = "\n".join(p["text"] for p in paragraphs) + return { + "heading": heading, + "parent_headings": list(parent_headings), + "level": level, + "paragraphs": list(paragraphs), + "content": content, + "tokens": _count_tokens(tokenizer, content), + "table_chunk_role": table_chunk_role, + # Ordered list of source blockids (deduped). Empty when the input + # blocks.jsonl row did not carry a blockid (raw/legacy input). + "blockids": _dedup_preserving_order(list(blockids or [])), + } + + +# --------------------------------------------------------------------------- +# Stage B — oversized-table re-split with first/middle/last gluing. +# --------------------------------------------------------------------------- + + +def _split_rows_by_tokens( + rows: list[Any], + tokenizer: Tokenizer, + *, + target_max: int, + target_ideal: int, + last_min: int, +) -> list[list[Any]]: + """Split ``rows`` into balanced row-bounded chunks (Stage B core).""" + total = _count_tokens(tokenizer, json.dumps(rows, ensure_ascii=False)) + if total <= target_max or len(rows) <= 1: + return [rows] + + target_chunks = max( + math.ceil(total / target_ideal), + math.ceil(total / target_max), + ) + # Cap at len(rows) so target_rows >= 1; otherwise int((i+1)*target_rows) + # can collapse to ``start`` and emit empty
[]
slices. + target_chunks = min(target_chunks, len(rows)) + target_rows = len(rows) / target_chunks + + chunks: list[list[Any]] = [] + start = 0 + for i in range(target_chunks): + if i == target_chunks - 1: + end = len(rows) + else: + # max(start + 1, ...) guarantees forward progress (>= 1 row per + # slice) even at fractional target_rows boundaries. + end = max(start + 1, min(int((i + 1) * target_rows), len(rows))) + remaining = len(rows) - end + if remaining > 0 and remaining < target_rows * 0.3: + end = len(rows) + chunks.append(rows[start:end]) + start = end + if start >= len(rows): + break + + # Merge a tiny last chunk back into the previous chunk when feasible. + if len(chunks) >= 2: + last_json = json.dumps(chunks[-1], ensure_ascii=False) + if _count_tokens(tokenizer, last_json) < last_min: + merged = chunks[-2] + chunks[-1] + merged_tokens = _count_tokens( + tokenizer, json.dumps(merged, ensure_ascii=False) + ) + if merged_tokens <= target_max: + chunks[-2] = merged + chunks.pop() + return chunks + + +def _character_split_text( + text: str, + tokenizer: Tokenizer, + *, + target_max: int, + chunk_overlap_token_size: int = 0, +) -> list[str]: + """Character-level fallback wrapped to return plain-text pieces. + + Lazy import dodges the ``recursive_character`` ↔ ``paragraph_semantic`` + circular dependency (same pattern as the sidecar-missing fallback in + :func:`chunking_by_paragraph_semantic`). Callers that split ordinary + prose pass the paragraph-semantic overlap; table character fallbacks + leave the default at zero so structured table row chunks do not gain + implicit row-level overlap. + """ + from lightrag.chunker.recursive_character import ( + chunking_by_recursive_character, + ) + + pieces = chunking_by_recursive_character( + tokenizer, + text, + target_max, + chunk_overlap_token_size=_bounded_overlap(target_max, chunk_overlap_token_size), + ) + return [p["content"] for p in pieces if p.get("content")] + + +def _split_table_text( + table_text: str, + *, + tokenizer: Tokenizer, + target_max: int, + target_ideal: int, + last_min: int, +) -> list[str]: + """Split a single oversized ``...
`` text into ≤ target_max pieces. + + Strategy (mirrors the user-supplied contract in + ``docs/ParagraphSemanticChunking-zh.md`` — row boundary first, + character fallback last): + + 1. Match the outer ``{body}
``. If the regex + fails, character-split the original text and return. + 2. Detect the body format via :func:`_detect_table_format` (with + body sniffing when ``attrs`` is silent). + 3. Row-boundary split: JSON via :func:`_split_rows_by_tokens`, + HTML via :func:`_split_html_rows_by_tokens`. Re-wrap every + row-chunk as ``{rows}
``. + 4. For any wrapped chunk still exceeding ``target_max`` + (single-row chunks where the row alone exceeds the cap, or + row-split returned a single chunk because rows were ≤ 1), + character-fallback that specific chunk's text. + 5. Unknown / unparseable format → character-fallback the entire + original text. + + Output strings are either: + - a re-wrapped ``{rows}
`` (legal markup, + callers may keep ``is_table=True`` for these), or + - a character-fallback fragment (no ```` wrapper, callers + should mark ``is_table=False``). + """ + match = _TABLE_TAG_RE.match((table_text or "").strip()) + if not match: + return _character_split_text(table_text, tokenizer, target_max=target_max) + attrs = match.group("attrs") + body = match.group("body") + fmt = _detect_table_format(attrs, body) + + # Budget the
wrapper out of the per-chunk + # caps before calling the row splitter — the splitter only measures + # the body (json.dumps(rows) / "".join(rows)), so without this the + # wrapped chunk can exceed target_max purely from the wrapper, which + # would force a needless character-fallback below. + wrapper_overhead = _count_tokens(tokenizer, f"
") + body_max = max(target_max - wrapper_overhead, 1) + body_ideal = max(min(target_ideal, target_max) - wrapper_overhead, 1) + body_last_min = max(last_min - wrapper_overhead, 1) + row_chunks: list[list[Any]] | None = None + serialize: Callable[[list[Any]], str] | None = None + if fmt == "json": + try: + rows = json.loads(body) + except json.JSONDecodeError: + rows = None + if isinstance(rows, list) and len(rows) > 1: + row_chunks = _split_rows_by_tokens( + rows, + tokenizer, + target_max=body_max, + target_ideal=body_ideal, + last_min=body_last_min, + ) + + def serialize(chunk_rows: list[Any]) -> str: + return ( + f"" + f"{json.dumps(chunk_rows, ensure_ascii=False)}" + f"
" + ) + + elif fmt == "html": + rows_html = _split_html_rows(body) + if rows_html and len(rows_html) > 1: + row_chunks = _split_html_rows_by_tokens( + rows_html, + tokenizer, + target_max=body_max, + target_ideal=body_ideal, + last_min=body_last_min, + ) + + def serialize(chunk_rows: list[tuple[str, str]]) -> str: + return ( + f"" + f"{_serialize_rows_with_wrappers(chunk_rows)}" + f"
" + ) + + if row_chunks is None or serialize is None: + # No row boundary available (single-row table, parse failure, + # unknown format) → character-fallback the whole text. + return _character_split_text(table_text, tokenizer, target_max=target_max) + + # Re-split any chunk whose wrapped form still exceeds target_max + # before resorting to character-level shredding. The row splitter's + # balanced-cut heuristic can produce uneven chunks when row sizes + # vary, and only a chunk that has collapsed to a single row (where + # row-boundary splitting can no longer reduce it) belongs in the + # character fallback. + pieces: list[str] = [] + pending: list[list[Any]] = list(row_chunks) + while pending: + chunk_rows = pending.pop(0) + wrapped = serialize(chunk_rows) + if _count_tokens(tokenizer, wrapped) <= target_max: + pieces.append(wrapped) + continue + if len(chunk_rows) <= 1: + pieces.extend( + _character_split_text(wrapped, tokenizer, target_max=target_max) + ) + continue + # Force a finer cut: cap the next-pass body budget at half the + # current wrapped size so target_chunks >= 2 inside the splitter. + # This guarantees forward progress (one row at minimum per + # sub-chunk, see the splitter's len(rows) cap). + halved = max(_count_tokens(tokenizer, wrapped) // 2, 1) + sub_max = max(min(body_max, halved), 1) + sub_ideal = max(sub_max // 2, 1) + sub_last_min = max(min(body_last_min, sub_max // 2), 1) + if fmt == "json": + sub_chunks = _split_rows_by_tokens( + chunk_rows, + tokenizer, + target_max=sub_max, + target_ideal=sub_ideal, + last_min=sub_last_min, + ) + else: + sub_chunks = _split_html_rows_by_tokens( + chunk_rows, + tokenizer, + target_max=sub_max, + target_ideal=sub_ideal, + last_min=sub_last_min, + ) + if len(sub_chunks) <= 1: + # The splitter could not reduce further (e.g. one row already + # dominates the body). Avoid an infinite loop and let the + # character fallback handle this stubborn chunk. + pieces.extend( + _character_split_text(wrapped, tokenizer, target_max=target_max) + ) + continue + # Process the finer cuts before any remaining peer chunks so the + # output keeps source order. + pending[0:0] = sub_chunks + return pieces + + +def _expand_block_with_table_splits( + block: dict[str, Any], + *, + tokenizer: Tokenizer, + table_max: int, + table_ideal: int, + table_min_last: int, + target_max: int | None = None, + chunk_overlap_token_size: int = 0, +) -> list[dict[str, Any]]: + """Apply Stage B to one heading-driven block. + + For every embedded table whose tokens exceed ``table_max``: + - the first row-slice glues with paragraphs already accumulated in + the current expansion (i.e. content *before* the table); + - middle slices are emitted as standalone blocks tagged + ``table_chunk_role == "middle"`` so Stage D refuses to merge them; + - the last slice begins a fresh accumulation that will glue with + paragraphs *after* the table. + + When a ``last`` table slice is followed by short bridge text and then + another oversized table's ``first`` slice, the bridge text is split + into table boundary context: a prefix may be duplicated into the + previous table block and a suffix into the next table block. If the + bridge is longer than both context budgets, the remaining middle text + is emitted as a standalone text block. Tables within the size limit + pass through untouched. + """ + if target_max is None: + target_max = table_max + target_max = max(int(target_max), 1) + context_overlap = _bounded_overlap(target_max, chunk_overlap_token_size) + sep_tokens = _count_tokens(tokenizer, "\n") + paragraphs = block["paragraphs"] + has_oversized_table = any( + p["is_table"] and _count_tokens(tokenizer, p["text"]) > table_max + for p in paragraphs + ) + if not has_oversized_table: + return [block] + + out: list[dict[str, Any]] = [] + cur_paras: list[dict[str, Any]] = [] + # Role to assign to ``cur_paras`` when it next flushes. Tracks the + # boundary semantics across split-table iterations so the merged + # block carries "first" / "last" instead of defaulting to "none" — + # otherwise Stage D's directional protections (a "first" block must + # not absorb backward, a "last" block must not absorb forward) silently + # disappear after the slice glues with surrounding paragraphs. + cur_role = "none" + + def flush_cur() -> None: + nonlocal cur_role + if not cur_paras: + cur_role = "none" + return + out.append( + _new_block( + heading=block["heading"], + parent_headings=block["parent_headings"], + level=block["level"], + paragraphs=cur_paras, + table_chunk_role=cur_role, + tokenizer=tokenizer, + blockids=block.get("blockids"), + ) + ) + cur_paras.clear() + cur_role = "none" + + def _append_bridge_block( + paragraphs: list[dict[str, Any]], + table_chunk_role: str, + ) -> None: + if not paragraphs: + return + out.append( + _new_block( + heading=block["heading"], + parent_headings=block["parent_headings"], + level=block["level"], + paragraphs=paragraphs, + table_chunk_role=table_chunk_role, + tokenizer=tokenizer, + blockids=block.get("blockids"), + ) + ) + + def _text_paragraph(text: str) -> dict[str, Any] | None: + if not text or not text.strip(): + return None + return {"text": text, "is_table": False} + + def _context_capacity(base_paras: list[dict[str, Any]]) -> int: + if context_overlap <= 0: + return 0 + base_text = "\n".join(p["text"] for p in base_paras) + base_tokens = _count_tokens(tokenizer, base_text) + if base_tokens >= target_max: + return 0 + # The context paragraph is joined to the table fragment with "\n". + return max(min(context_overlap, target_max - base_tokens - sep_tokens), 0) + + def _flush_last_bridge_before_next_first( + next_first_para: dict[str, Any], + ) -> list[dict[str, Any]]: + """Flush ``last + bridge`` before a following table ``first``. + + Returns context paragraphs to prepend to the following first-table + block. Only non-table bridge paragraphs are duplicated/sliced; if + the bridge contains tables we keep the prior non-overlapping flush. + """ + nonlocal cur_role + if not cur_paras: + cur_role = "none" + return [] + + seed_paras = [cur_paras[0]] + bridge_paras = cur_paras[1:] + if ( + context_overlap <= 0 + or not bridge_paras + or any(p.get("is_table", False) for p in bridge_paras) + ): + flush_cur() + return [] + + bridge_text = "\n".join(p["text"] for p in bridge_paras) + bridge_tokens = tokenizer.encode(bridge_text) + if not bridge_tokens: + flush_cur() + return [] + + prev_budget = _context_capacity(seed_paras) + next_budget = _context_capacity([next_first_para]) + bridge_len = len(bridge_tokens) + + if bridge_len <= prev_budget and bridge_len <= next_budget: + prefix_text = bridge_text + suffix_text = bridge_text + middle_text = "" + else: + prefix_len = min(prev_budget, bridge_len) + suffix_len = min(next_budget, bridge_len) + middle_start = prefix_len + middle_end = max(middle_start, bridge_len - suffix_len) + + prefix_text = ( + tokenizer.decode(bridge_tokens[:prefix_len]) if prefix_len else "" + ) + suffix_text = ( + tokenizer.decode(bridge_tokens[bridge_len - suffix_len :]) + if suffix_len + else "" + ) + middle_text = ( + tokenizer.decode(bridge_tokens[middle_start:middle_end]) + if middle_end > middle_start + else "" + ) + + prev_paras = list(seed_paras) + prefix_para = _text_paragraph(prefix_text) + if prefix_para is not None: + prev_paras.append(prefix_para) + _append_bridge_block(prev_paras, "last") + + middle_para = _text_paragraph(middle_text) + if middle_para is not None: + _append_bridge_block([middle_para], "none") + + cur_paras.clear() + cur_role = "none" + + suffix_para = _text_paragraph(suffix_text) + return [suffix_para] if suffix_para is not None else [] + + for para in paragraphs: + text = para["text"] + if not (para["is_table"] and _count_tokens(tokenizer, text) > table_max): + cur_paras.append(para) + continue + + # Row-boundary first, character fallback last. ``_split_table_text`` + # returns one or more strings: row-wrapped ``...
`` + # fragments where row-splitting succeeded, plain text where it + # had to character-split (single-row tables, parse failures, + # rows whose own size exceeded ``table_max``). + pieces = _split_table_text( + text, + tokenizer=tokenizer, + target_max=table_max, + target_ideal=table_ideal, + last_min=table_min_last, + ) + if len(pieces) <= 1: + # No reduction was possible (e.g. very small unparseable table + # that already fits within ``table_max`` after a no-op character + # fallback). Keep the original paragraph to preserve content. + cur_paras.append(para) + continue + + for chunk_idx, piece_text in enumerate(pieces): + stripped = piece_text.strip() + is_still_table = stripped.startswith("" + ) + chunk_para = {"text": piece_text, "is_table": is_still_table} + is_first = chunk_idx == 0 + is_last = chunk_idx == len(pieces) - 1 + + if is_first: + # First slice glues with everything currently accumulated + # (= the paragraphs that appeared before the table inside + # this heading block). If the buffer still carries the + # "last" tail of a previous oversized table, flush it first + # so its protective role survives instead of being + # overwritten by "first". + if cur_role == "last": + cur_paras.extend(_flush_last_bridge_before_next_first(chunk_para)) + cur_paras.append(chunk_para) + cur_role = "first" + elif is_last: + # Flush the accumulated "first-glued" block, then begin a + # new accumulation seeded with this last slice — it will + # absorb the paragraphs that appear after the table. + flush_cur() + cur_paras.append(chunk_para) + cur_role = "last" + else: + # Middle slice: flush the first-glued block, then emit + # this middle slice as a standalone block that Stage D + # MUST keep intact (table_chunk_role == "middle"). + flush_cur() + out.append( + _new_block( + heading=block["heading"], + parent_headings=block["parent_headings"], + level=block["level"], + paragraphs=[chunk_para], + table_chunk_role="middle", + tokenizer=tokenizer, + blockids=block.get("blockids"), + ) + ) + + flush_cur() + return out + + +# --------------------------------------------------------------------------- +# Stage C — anchor-driven long-block re-split. +# --------------------------------------------------------------------------- + + +def _split_long_block( + paragraphs: list[dict[str, Any]], + heading: str, + parent_headings: list[str], + level: int, + table_chunk_role: str, + *, + tokenizer: Tokenizer, + target_max: int, + target_ideal: int, + chunk_overlap_token_size: int = 100, + blockids: list[str] | None = None, +) -> list[dict[str, Any]]: + """Split an oversized block into balanced sub-blocks at short-paragraph anchors. + + Mirrors :func:`lightrag.native_parser.docx.parse_document.split_long_block`, + parameterised on ``target_max`` / ``target_ideal``. Tables (``is_table``) + are excluded from the anchor candidate pool, so Stage B's row-level + splits stay intact. When no anchor exists (including the single- + paragraph oversized case), the no-anchor branch below honors the cap + via row-boundary splitting (for tables) or character-level splitting + (for prose). The audit-mode parser would ``sys.exit(1)`` on no-anchor + failure, but the RAG pipeline must never drop a document silently. + Character-level splitting of ordinary prose uses + ``chunk_overlap_token_size`` so long text under one JSONL content row + keeps semantic continuity across adjacent chunks. + """ + chunk_overlap_token_size = _bounded_overlap(target_max, chunk_overlap_token_size) + content = "\n".join(p["text"] for p in paragraphs) + total = _count_tokens(tokenizer, content) + if total <= target_max: + return [ + _new_block( + heading=heading, + parent_headings=parent_headings, + level=level, + paragraphs=paragraphs, + table_chunk_role=table_chunk_role, + tokenizer=tokenizer, + blockids=blockids, + ) + ] + + target_blocks = max( + math.ceil(total / target_ideal), + math.ceil(total / target_max), + ) + target_size = total / target_blocks + + # Build anchor candidates with cumulative token offsets. Index 0 is + # excluded: an anchor at the first paragraph yields an empty leading + # slice and a tail equal to the input, so it cannot divide the block — + # selecting it would re-enter this function with the same arguments + # and recurse until RecursionError. + candidates: list[dict[str, Any]] = [] + cumulative = 0 + for idx, para in enumerate(paragraphs): + text = para["text"] + if ( + idx > 0 + and not para.get("is_table", False) + and 0 < len(text) <= _MAX_ANCHOR_CANDIDATE_LENGTH + ): + candidates.append({"index": idx, "text": text, "position": cumulative}) + cumulative += _count_tokens(tokenizer, text) + + if not candidates: + # All paragraphs in the block are longer than the anchor-length + # cap (typical for dense academic prose: every paragraph is a + # full body section). Anchor-driven splitting cannot proceed, + # but we must NOT emit a single oversized chunk: the + # embedding-time hard fallback uses ``embedding_token_limit`` + # (often 8K), not ``chunk_token_size``, so the chunk would + # silently exceed the user-configured size. Prefer + # row-boundary splitting on any oversized table paragraph + # before falling back to character-level splitting on residual + # content — character splitting destroys ``
`` markup + # mid-tag and produces fragments LLMs can't interpret as + # tables. + logger.warning( + "[paragraph_semantic_chunking] block under heading %r exceeds " + "target_max=%d tokens (~%d tokens) but has no eligible anchor " + "paragraph (≤ %d chars); preferring table row-boundary split, " + "falling back to recursive-character splitting on residual " + "content.", + heading, + target_max, + total, + _MAX_ANCHOR_CANDIDATE_LENGTH, + ) + + # Step 1: expand each oversized table paragraph into row-bounded + # pieces; non-table or in-budget paragraphs pass through verbatim. + # ``last_min`` mirrors Stage B's ratio (no separate constant — the + # tail-merge threshold is purely a row-balancing heuristic). + last_min = max(int(target_max * _TABLE_MIN_LAST_RATIO), 1) + pieces: list[str] = [] + for para in paragraphs: + text = para["text"] + if ( + para.get("is_table", False) + and _count_tokens(tokenizer, text) > target_max + ): + pieces.extend( + _split_table_text( + text, + tokenizer=tokenizer, + target_max=target_max, + target_ideal=target_ideal, + last_min=last_min, + ) + ) + else: + pieces.append(text) + + # Step 2: greedy-pack pieces into chunks ≤ target_max. A piece + # that is itself oversized (e.g. a single dense prose paragraph + # without short anchors) is character-split via + # :func:`chunking_by_recursive_character` after flushing the + # current buffer. The "\n" separator inserted by ``"\n".join(buf)`` + # also costs tokens, so it must be debited from the budget — + # otherwise two pieces that sum to exactly target_max would + # overflow once joined. + sep_tokens = _count_tokens(tokenizer, "\n") + chunks_text: list[str] = [] + buf: list[str] = [] + buf_tokens = 0 + for piece in pieces: + piece_tokens = _count_tokens(tokenizer, piece) + if piece_tokens > target_max: + if buf: + chunks_text.append("\n".join(buf)) + buf, buf_tokens = [], 0 + chunks_text.extend( + _character_split_text( + piece, + tokenizer, + target_max=target_max, + chunk_overlap_token_size=chunk_overlap_token_size, + ) + ) + continue + addition = piece_tokens + (sep_tokens if buf else 0) + if buf and buf_tokens + addition > target_max: + chunks_text.append("\n".join(buf)) + buf, buf_tokens = [], 0 + addition = piece_tokens + buf.append(piece) + buf_tokens += addition + if buf: + chunks_text.append("\n".join(buf)) + + if not chunks_text: + # Defensive: every piece was empty after stripping. Emit the + # original oversized block so the document is never silently + # dropped (matches the prior behaviour of the empty-R branch). + return [ + _new_block( + heading=heading, + parent_headings=parent_headings, + level=level, + paragraphs=paragraphs, + table_chunk_role=table_chunk_role, + tokenizer=tokenizer, + blockids=blockids, + ) + ] + + sub_blocks: list[dict[str, Any]] = [] + for i, chunk_text in enumerate(chunks_text): + stripped = chunk_text.strip() + is_still_table = stripped.startswith("
" + ) + sub_blocks.append( + _new_block( + heading=heading, + parent_headings=parent_headings, + level=level, + paragraphs=[{"text": chunk_text, "is_table": is_still_table}], + # Only the first sub-block keeps the inbound + # table_chunk_role; the rest are text-only by + # construction (mirrors the anchor-split path below). + table_chunk_role=table_chunk_role if i == 0 else "none", + tokenizer=tokenizer, + blockids=blockids, + ) + ) + return sub_blocks + + # Pick the anchors closest to evenly-spaced ideal positions. + pool = list(candidates) + selected: list[dict[str, Any]] = [] + for i in range(1, target_blocks): + if not pool: + break + ideal_position = i * target_size + best = min(pool, key=lambda c: abs(c["position"] - ideal_position)) + selected.append(best) + pool.remove(best) + selected.sort(key=lambda c: c["index"]) + + sub_blocks: list[dict[str, Any]] = [] + prev_idx = 0 + cur_heading = heading + cur_parents = list(parent_headings) + # Only the first sub-block keeps the inbound table_chunk_role; the + # post-anchor sub-blocks are text-only by construction. + cur_role = table_chunk_role + + for anchor in selected: + split_idx = anchor["index"] + slice_paras = paragraphs[prev_idx:split_idx] + if slice_paras: + sub_blocks.append( + _new_block( + heading=cur_heading, + parent_headings=cur_parents, + level=level, + paragraphs=slice_paras, + table_chunk_role=cur_role, + tokenizer=tokenizer, + blockids=blockids, + ) + ) + # Anchor becomes the first paragraph (and heading) of the next sub-block. + cur_parents = ( + list(parent_headings) + [heading] + if heading and cur_heading == heading + else list(cur_parents) + ) + cur_heading = anchor["text"] + cur_role = "none" + prev_idx = split_idx + + tail = paragraphs[prev_idx:] + if tail: + sub_blocks.append( + _new_block( + heading=cur_heading, + parent_headings=cur_parents, + level=level, + paragraphs=tail, + table_chunk_role=cur_role, + tokenizer=tokenizer, + blockids=blockids, + ) + ) + + # Recursive guard: any sub-block still over target_max is re-split, + # including single-paragraph subs — the no-anchor branch above honors + # the cap via row-boundary or character-level splitting and is the + # only path that can shrink them. + out: list[dict[str, Any]] = [] + for sub in sub_blocks: + if sub["tokens"] > target_max: + out.extend( + _split_long_block( + sub["paragraphs"], + sub["heading"], + sub["parent_headings"], + sub["level"], + sub["table_chunk_role"], + tokenizer=tokenizer, + target_max=target_max, + target_ideal=target_ideal, + chunk_overlap_token_size=chunk_overlap_token_size, + blockids=sub.get("blockids") or blockids, + ) + ) + else: + out.append(sub) + return out + + +# --------------------------------------------------------------------------- +# Stage D — bottom-up, level-aware small-block merging. +# --------------------------------------------------------------------------- + + +def _can_merge_forward(role: str, *, phase: str) -> bool: + if phase == "A": + return role in {"none", "first"} + return role in {"none", "first", "last"} + + +def _can_merge_backward(role: str) -> bool: + return role in {"none", "last"} + + +def _merged_pair( + left: dict[str, Any], + right: dict[str, Any], + *, + keep: str, + tokenizer: Tokenizer, +) -> dict[str, Any]: + base = left if keep == "left" else right + paragraphs = list(left["paragraphs"]) + list(right["paragraphs"]) + content = left["content"] + "\n\n" + right["content"] + merged_blockids = _dedup_preserving_order( + list(left.get("blockids") or []) + list(right.get("blockids") or []) + ) + return { + "heading": base["heading"], + "parent_headings": list(base["parent_headings"]), + "level": base["level"], + "paragraphs": paragraphs, + "content": content, + "tokens": _count_tokens(tokenizer, content), + "table_chunk_role": "none", + "blockids": merged_blockids, + } + + +def _merge_small_blocks( + blocks: list[dict[str, Any]], + *, + tokenizer: Tokenizer, + target_max: int, + target_ideal: int, + small_tail_threshold: int, +) -> list[dict[str, Any]]: + """Bottom-up, level-aware small-block merging. + + Re-implementation of + :func:`lightrag.native_parser.docx.parse_document.merge_small_blocks`, + parameterised on the chunk-size targets and operating on internal + block dicts (no ``uuid`` / ``table_header`` propagation needed: the + chunking output schema does not carry them). + """ + if len(blocks) <= 1: + return blocks + + result = list(blocks) + levels = sorted({b.get("level", 1) for b in result}, reverse=True) + + for current_level in levels: + # Phase A — same-level merging. + changed = True + while changed: + changed = False + new_result: list[dict[str, Any]] = [] + i = 0 + while i < len(result): + cur = result[i] + cur_tokens = cur["tokens"] + cur_level = cur.get("level", 1) + cur_role = cur.get("table_chunk_role", "none") + below_ideal = 0 < cur_tokens < target_ideal + is_cur_lv = cur_level == current_level + + if below_ideal and is_cur_lv: + merged = False + + if _can_merge_forward(cur_role, phase="A") and i + 1 < len(result): + nxt = result[i + 1] + if nxt.get("level", 1) == current_level and _can_merge_backward( + nxt.get("table_chunk_role", "none") + ): + combined = _merged_pair( + cur, nxt, keep="left", tokenizer=tokenizer + ) + if combined["tokens"] <= target_max: + new_result.append(combined) + i += 2 + changed = True + merged = True + + if not merged and _can_merge_backward(cur_role) and new_result: + prev = new_result[-1] + if ( + prev.get("level", 1) == current_level + and _can_merge_forward( + prev.get("table_chunk_role", "none"), phase="A" + ) + and prev["tokens"] < target_ideal + ): + combined = _merged_pair( + prev, cur, keep="left", tokenizer=tokenizer + ) + if combined["tokens"] <= target_max: + new_result[-1] = combined + i += 1 + changed = True + merged = True + + if not merged: + new_result.append(cur) + i += 1 + else: + # Tail absorption: an at-or-above-IDEAL block can absorb + # a short run of subsequent same-level blocks if their + # combined size stays under SMALL_TAIL_THRESHOLD and + # fits within target_max — eliminates the document's + # trailing sliver of zero-content remainders. + if is_cur_lv and cur_tokens >= target_ideal: + tail_total = 0 + end_idx = i + 1 + for j in range(i + 1, len(result)): + nxt = result[j] + if nxt.get("level", 1) != current_level: + break + if nxt.get("table_chunk_role", "none") == "middle": + break + tail_total += nxt["tokens"] + end_idx = j + 1 + if ( + tail_total > 0 + and tail_total < small_tail_threshold + and cur_tokens + tail_total <= target_max + ): + absorbed_paragraphs = list(cur["paragraphs"]) + absorbed_content = cur["content"] + for j in range(i + 1, end_idx): + nxt = result[j] + absorbed_paragraphs.extend(nxt["paragraphs"]) + absorbed_content += "\n\n" + nxt["content"] + # The cheap predicate above sums per-block + # tokens, but absorption joins blocks with + # ``"\n\n"`` — those separator tokens are + # real and can push the merged block over + # target_max. Re-measure the joined content + # before committing to absorb. + absorbed_tokens = _count_tokens(tokenizer, absorbed_content) + if absorbed_tokens <= target_max: + new_result.append( + { + "heading": cur["heading"], + "parent_headings": list(cur["parent_headings"]), + "level": cur["level"], + "paragraphs": absorbed_paragraphs, + "content": absorbed_content, + "tokens": absorbed_tokens, + "table_chunk_role": "none", + } + ) + i = end_idx + changed = True + continue + new_result.append(cur) + i += 1 + result = new_result + + # Phase B — cross-level absorption (shallower absorbs deeper). + changed = True + while changed: + changed = False + new_result = [] + i = 0 + while i < len(result): + cur = result[i] + cur_tokens = cur["tokens"] + cur_level = cur.get("level", 1) + cur_role = cur.get("table_chunk_role", "none") + below_ideal = 0 < cur_tokens < target_ideal + is_cur_lv = cur_level == current_level + + if below_ideal and is_cur_lv: + merged = False + + if _can_merge_forward(cur_role, phase="B") and i + 1 < len(result): + nxt = result[i + 1] + if nxt.get("level", 1) > current_level and _can_merge_backward( + nxt.get("table_chunk_role", "none") + ): + combined = _merged_pair( + cur, nxt, keep="left", tokenizer=tokenizer + ) + if combined["tokens"] <= target_max: + new_result.append(combined) + i += 2 + changed = True + merged = True + + if not merged and _can_merge_backward(cur_role) and new_result: + prev = new_result[-1] + if ( + prev.get("level", 1) < current_level + and _can_merge_forward( + prev.get("table_chunk_role", "none"), phase="B" + ) + and prev["tokens"] < target_ideal + ): + combined = _merged_pair( + prev, cur, keep="left", tokenizer=tokenizer + ) + if combined["tokens"] <= target_max: + new_result[-1] = combined + i += 1 + changed = True + merged = True + + if not merged: + new_result.append(cur) + i += 1 + else: + new_result.append(cur) + i += 1 + result = new_result + + return result + + +# --------------------------------------------------------------------------- +# Public entrypoint. +# --------------------------------------------------------------------------- + + +def chunking_by_paragraph_semantic( + tokenizer: Tokenizer, + content: str, + chunk_token_size: int = 2000, + *, + blocks_path: str | None = None, + chunk_overlap_token_size: int = 100, +) -> list[dict[str, Any]]: + """Paragraph Semantic Chunking — the ``chunking="P"`` strategy. + + Reads structured blocks emitted by the docx native parser at + ``fixlevel=0`` (Stage A, persisted to ``blocks.jsonl``) and applies + Stage B (table re-split + glue), Stage C (anchor-driven long-block + re-split) and Stage D (bottom-up, level-aware merging). Output rows + match the schema produced by + :func:`lightrag.chunker.chunking_by_token_size` + (``tokens``/``content``/``chunk_order_index``), enriched with + ``heading``, ``parent_headings`` and ``level`` so KG extraction can + leverage the document hierarchy. + + Signature follows the LightRAG chunker contract — the standard + prefix ``(tokenizer, content, chunk_token_size)`` is shared with + every other chunker, while strategy-specific knobs are keyword-only: + + - ``blocks_path`` (this strategy's required input — the + ``.blocks.jsonl`` sidecar produced at parse time) + + Knobs that ``chunking_by_token_size`` exposes for delimiter-based + splitting (``split_by_character``, ``split_by_character_only``) are + deliberately absent here because paragraph-semantic chunks are + heading-aligned. ``chunk_overlap_token_size`` is supported for two + paragraph-semantic cases where overlap preserves meaning inside one + JSONL content row: recursive-character fallback for long prose, and + bridge text duplicated around adjacent oversized table boundary chunks. + When one original ``blocks.jsonl`` content row is split into multiple + fragments, every fragment heading receives a row-local ``[part n]`` + suffix; unsplit rows keep their original heading. + + Args: + tokenizer: LightRAG tokenizer (used for all token counting; matches + the unit used by ``chunk_token_size``). + content: Merged plain-text content of the document. Used as the + fallback corpus when ``blocks_path`` is missing or unreadable + so the pipeline never silently drops a document. + chunk_token_size: Hard upper bound for each chunk in tokens. The + ideal target is set at 75 % of this value (mirroring the + audit-mode 6000/8000 ratio); see threshold ratio constants + above for the full mapping. + blocks_path: Path to the document's ``.blocks.jsonl`` sidecar + (typically ``parsed_data["blocks_path"]``). When ``None``, + unreadable, or empty, this function falls back to + :func:`chunking_by_recursive_character` on ``content`` + (per ``docs/FileProcessingConfiguration-zh.md`` line 120 / 146). + That fallback hard-requires ``langchain-text-splitters``; + an :class:`ImportError` is surfaced rather than silently + degrading further. + chunk_overlap_token_size: Token overlap used only when P must + fall back to recursive-character splitting of ordinary text, + and as the per-side budget for duplicating text between two + adjacent oversized table chunks. Structural table row splits + remain row-bounded and non-overlapping. + + Returns: + Ordered list of chunk dicts, each shaped: + ``{"tokens", "content", "chunk_order_index", "heading", + "parent_headings", "level"}``. + + Notes: + blocks.jsonl field analysis vs. algorithm requirements: + + - ``content`` (``\\n``-joined per ``_build_unsplit_block``) → + split back into per-paragraph text via ``split("\\n")``; + lossless because table/equation/drawing tags are emitted as + single-line replacements. + - ``heading`` / ``parent_headings`` / ``level`` → consumed + directly by Stage C/D for hierarchy-aware merging. If one + original content row produces multiple fragments, the current + ``heading`` receives a ``[part n]`` suffix after Stage B/C and + before Stage D. ``parent_headings`` remain unchanged. + - ``
{rows_json}
`` tags → + JSON body parsed in Stage B for row-level re-split when the + tag exceeds the per-table token cap. When two split tables + have short text between them, that text may be repeated in + both table boundary chunks; longer bridge text leaves any + middle remainder as a separate text block. + - ```` / ```` tags → treated as atomic + non-table paragraphs — neither splittable nor anchorable. + - Per-paragraph paraIds are NOT preserved in blocks.jsonl + (only block-level ``positions[].range`` is). Acceptable + because the chunking output schema does not require them. + - ``table_slice`` is always ``"none"`` in blocks.jsonl + (parse-time ``fixlevel=0`` keeps tables whole), so any + ``table_chunk_role`` consumed by Stage D is recomputed + on-the-fly inside Stage B. + """ + target_max = max(int(chunk_token_size), 1) + target_ideal = max(int(target_max * _IDEAL_RATIO), 1) + table_max = max(int(target_max * _TABLE_MAX_RATIO), 1) + table_ideal = max(int(target_max * _TABLE_IDEAL_RATIO), 1) + table_min_last = max(int(table_max * _TABLE_MIN_LAST_RATIO), 1) + small_tail_threshold = max(int(target_max * _SMALL_TAIL_RATIO), 1) + overlap = _bounded_overlap(target_max, chunk_overlap_token_size) + + rows: list[dict[str, Any]] = [] + fallback_reason: str | None = None + if not blocks_path: + fallback_reason = "blocks_path is empty" + else: + try: + rows = _load_blocks_from_jsonl(blocks_path) + except OSError as exc: + fallback_reason = f"cannot read blocks.jsonl at {blocks_path}: {exc}" + else: + if not rows: + fallback_reason = ( + f"blocks.jsonl at {blocks_path} contains no content rows" + ) + + if fallback_reason is not None: + # Defer to recursive-character chunking when the sidecar is + # absent — ensures non-docx documents and edge-case parses still + # produce chunks instead of silently dropping content. Document + # contract (FileProcessingConfiguration-zh.md L120 / L146) is + # explicit that P falls back to R; that contract requires + # langchain-text-splitters to be installed, so an ImportError + # here is intentional rather than a silent degrade to F. Lazy + # import dodges the recursive_character ↔ paragraph_semantic + # circular dependency. + logger.warning( + "[paragraph_semantic_chunking] %s; falling back to " + "recursive-character chunking with chunk_token_size=%d.", + fallback_reason, + target_max, + ) + from lightrag.chunker.recursive_character import ( + chunking_by_recursive_character, + ) + + return chunking_by_recursive_character( + tokenizer, + content, + target_max, + chunk_overlap_token_size=overlap, + ) + + # Build initial blocks (Stage A output, already persisted). + initial: list[dict[str, Any]] = [] + for row in rows: + text = row.get("content", "") or "" + if not text.strip(): + continue + paragraphs = _block_to_paragraphs(text) + if not paragraphs: + continue + row_blockid = str(row.get("blockid") or "").strip() + initial.append( + _new_block( + heading=row.get("heading", "") or "", + parent_headings=list(row.get("parent_headings") or []), + level=int(row.get("level", 1) or 1), + paragraphs=paragraphs, + table_chunk_role="none", + tokenizer=tokenizer, + blockids=[row_blockid] if row_blockid else None, + ) + ) + + # Stage B/C are run per original blocks.jsonl content row so split + # fragments can be labelled with [part n] using a row-local counter + # before Stage D merges small neighbours. + after_c: list[dict[str, Any]] = [] + for blk in initial: + block_after_b = _expand_block_with_table_splits( + blk, + tokenizer=tokenizer, + table_max=table_max, + table_ideal=table_ideal, + table_min_last=table_min_last, + target_max=target_max, + chunk_overlap_token_size=overlap, + ) + + block_after_c: list[dict[str, Any]] = [] + for split_blk in block_after_b: + block_after_c.extend( + _split_long_block( + split_blk["paragraphs"], + split_blk["heading"], + split_blk["parent_headings"], + split_blk["level"], + split_blk.get("table_chunk_role", "none"), + tokenizer=tokenizer, + target_max=target_max, + target_ideal=target_ideal, + chunk_overlap_token_size=overlap, + blockids=split_blk.get("blockids") or blk.get("blockids"), + ) + ) + after_c.extend(_apply_part_suffixes(block_after_c)) + + # Stage D — bottom-up, level-aware small-block merging. + final = _merge_small_blocks( + after_c, + tokenizer=tokenizer, + target_max=target_max, + target_ideal=target_ideal, + small_tail_threshold=small_tail_threshold, + ) + + # Convert internal block dicts to the new chunk schema: nested heading + # dict + sidecar block carrying source blockid refs so the multimodal + # pipeline (and document-delete cache cleanup) can trace each chunk + # back to its blocks.jsonl row(s). + chunks: list[dict[str, Any]] = [] + for idx, blk in enumerate(final): + body = blk["content"].strip() + if not body: + continue + chunk_dict: dict[str, Any] = { + "tokens": blk["tokens"], + "content": body, + "chunk_order_index": idx, + "heading": { + "level": int(blk.get("level") or 0), + "heading": str(blk.get("heading") or ""), + "parent_headings": list(blk.get("parent_headings") or []), + }, + } + blockids = blk.get("blockids") or [] + if blockids: + chunk_dict["sidecar"] = { + "type": "block", + "id": blockids[0], + "refs": [{"type": "block", "id": bid} for bid in blockids], + } + chunks.append(chunk_dict) + return chunks diff --git a/lightrag/chunker/recursive_character.py b/lightrag/chunker/recursive_character.py new file mode 100644 index 0000000000..16081fe99d --- /dev/null +++ b/lightrag/chunker/recursive_character.py @@ -0,0 +1,110 @@ +"""Recursive character chunking — the ``"R"`` strategy. + +Wraps LangChain's :class:`RecursiveCharacterTextSplitter` and delivers +output rows in the LightRAG file-chunker schema. The splitter walks the +``separators`` list from longest semantic boundary (``\\n\\n`` by default) +to weakest (the empty string), recursively re-splitting any segment that +still exceeds the token cap. + +Token accounting goes through the LightRAG :class:`Tokenizer` via the +``length_function`` plug-in — without that, ``chunk_size`` would be +measured in characters and ``chunk_token_size`` would lose its meaning. + +Output cap is *not* enforced internally: oversized segments are produced +when no separator can break them, and +:func:`lightrag.utils.enforce_chunk_token_limit_before_embedding` does the +final hard split before embedding. +""" + +from __future__ import annotations + +from typing import Any + +from lightrag.utils import Tokenizer, logger + +try: + from langchain_text_splitters import RecursiveCharacterTextSplitter + + _LANGCHAIN_TEXT_SPLITTERS_AVAILABLE = True +except ImportError: + _LANGCHAIN_TEXT_SPLITTERS_AVAILABLE = False + RecursiveCharacterTextSplitter = None # type: ignore[assignment] + + +def chunking_by_recursive_character( + tokenizer: Tokenizer, + content: str, + chunk_token_size: int = 1200, + *, + chunk_overlap_token_size: int = 100, + separators: list[str] | None = None, +) -> list[dict[str, Any]]: + """Recursive character splitter — the ``"R"`` chunking strategy. + + Args: + tokenizer: LightRAG tokenizer; used as the length function so + ``chunk_token_size`` and ``chunk_overlap_token_size`` are + interpreted in tokens, not characters. + content: Text to split. + chunk_token_size: Hard target size for each chunk (tokens). + chunk_overlap_token_size: Token overlap between adjacent chunks. + separators: Cascade of split candidates. ``None`` defers to + LangChain's defaults: ``["\\n\\n", "\\n", " ", ""]``. + + Returns: + Ordered list of ``{"tokens", "content", "chunk_order_index"}`` + dicts. + """ + if not _LANGCHAIN_TEXT_SPLITTERS_AVAILABLE: + raise ImportError( + "langchain-text-splitters is required for the 'R' chunking " + "strategy; install with `pip install langchain-text-splitters>=0.3`." + ) + + if not content or not content.strip(): + return [] + + splitter_kwargs: dict[str, Any] = { + "chunk_size": max(int(chunk_token_size), 1), + "chunk_overlap": max(int(chunk_overlap_token_size), 0), + "length_function": lambda s: len(tokenizer.encode(s)), + } + if separators is not None: + splitter_kwargs["separators"] = list(separators) + + splitter = RecursiveCharacterTextSplitter(**splitter_kwargs) + + pieces = splitter.split_text(content) + results: list[dict[str, Any]] = [] + for piece in pieces: + body = piece.strip() + if not body: + continue + results.append( + { + "tokens": len(tokenizer.encode(body)), + "content": body, + "chunk_order_index": len(results), + } + ) + + if not results: + # Defensive: splitter returned only whitespace fragments. Fall + # through with a single chunk of stripped content so downstream + # callers always receive at least one row when input is non-empty. + logger.warning( + "[recursive_character] splitter produced no non-empty chunks " + "for %d-char input; emitting single fallback chunk.", + len(content), + ) + body = content.strip() + if body: + results.append( + { + "tokens": len(tokenizer.encode(body)), + "content": body, + "chunk_order_index": 0, + } + ) + + return results diff --git a/lightrag/chunker/semantic_vector.py b/lightrag/chunker/semantic_vector.py new file mode 100644 index 0000000000..63da9ec7ff --- /dev/null +++ b/lightrag/chunker/semantic_vector.py @@ -0,0 +1,217 @@ +"""Semantic vector chunking — the ``"V"`` strategy. + +Wraps LangChain's :class:`SemanticChunker` (from ``langchain-experimental``) +which splits text by sentence embeddings: it first segments the input into +sentences, embeds each sentence (in adjacent windows of ``buffer_size``), +and finds breakpoints where the cosine distance between consecutive +windows crosses a threshold derived from the chosen distribution +(``percentile`` / ``standard_deviation`` / ``interquartile`` / +``gradient``). + +The chunker exposed here is ``async`` because LightRAG's +:class:`EmbeddingFunc` is async. Internally we call SemanticChunker +synchronously inside :func:`asyncio.to_thread` and bridge the embedding +calls back to the main event loop via +:func:`asyncio.run_coroutine_threadsafe`. + +Caveats: + - SemanticChunker does NOT enforce a maximum chunk size; the caller's + ``chunk_token_size`` is *advisory* here. Oversized chunks will be + hard-split before embedding by + :func:`lightrag.utils.enforce_chunk_token_limit_before_embedding`. + - When ``embedding_func`` is ``None`` we log a warning and fall back to + :func:`lightrag.chunker.chunking_by_recursive_character` — V's only + differentiator is embeddings, and R is the closest structural-only + alternative. +""" + +from __future__ import annotations + +import asyncio +from typing import Any + +from lightrag.constants import DEFAULT_SENTENCE_SPLIT_REGEX +from lightrag.utils import EmbeddingFunc, Tokenizer, logger + +try: + from langchain_core.embeddings import Embeddings + from langchain_experimental.text_splitter import SemanticChunker + + _LANGCHAIN_EXPERIMENTAL_AVAILABLE = True +except ImportError: + _LANGCHAIN_EXPERIMENTAL_AVAILABLE = False + Embeddings = object # type: ignore[assignment,misc] + SemanticChunker = None # type: ignore[assignment] + + +class _AsyncEmbeddingFuncAdapter(Embeddings): + """Bridge a LightRAG :class:`EmbeddingFunc` (async) to LangChain's + sync :class:`Embeddings` interface used by ``SemanticChunker``. + + The adapter must be constructed inside the running event loop so it + can capture the loop reference; the blocking ``embed_documents`` / + ``embed_query`` calls are then made from a worker thread (via + :func:`asyncio.to_thread` in the public chunker) and bounce back to + the captured loop with :func:`asyncio.run_coroutine_threadsafe`. + """ + + def __init__( + self, + embedding_func: EmbeddingFunc, + loop: asyncio.AbstractEventLoop, + ) -> None: + self._embedding_func = embedding_func + self._loop = loop + + def _run(self, texts: list[str], context: str) -> list[list[float]]: + future = asyncio.run_coroutine_threadsafe( + self._embedding_func(texts, context=context), + self._loop, + ) + result = future.result() + return [list(map(float, vec)) for vec in result] + + def embed_documents(self, texts: list[str]) -> list[list[float]]: + return self._run(list(texts), context="document") + + def embed_query(self, text: str) -> list[float]: + return self._run([text], context="query")[0] + + +async def chunking_by_semantic_vector( + tokenizer: Tokenizer, + content: str, + chunk_token_size: int = 1200, + *, + embedding_func: EmbeddingFunc | None = None, + breakpoint_threshold_type: str = "percentile", + breakpoint_threshold_amount: float | None = None, + buffer_size: int = 1, + sentence_split_regex: str = DEFAULT_SENTENCE_SPLIT_REGEX, +) -> list[dict[str, Any]]: + """Semantic vector chunker — the ``"V"`` chunking strategy. + + Args: + tokenizer: LightRAG tokenizer (used for output token counts). + content: Text to split. + chunk_token_size: Hard upper bound (tokens). SemanticChunker does + NOT enforce a maximum natively, so any piece that exceeds + this value is re-split via + :func:`chunking_by_recursive_character` before being emitted. + embedding_func: LightRAG :class:`EmbeddingFunc`. When ``None`` + this chunker logs a warning and falls back to + :func:`chunking_by_recursive_character`. + breakpoint_threshold_type: ``percentile`` | ``standard_deviation`` + | ``interquartile`` | ``gradient`` (LangChain default: + ``percentile``). + breakpoint_threshold_amount: Threshold magnitude. ``None`` lets + LangChain pick the per-type default (e.g. 95 for percentile). + buffer_size: Number of adjacent sentences combined when computing + distances (LangChain default: 1). + sentence_split_regex: Pattern fed to LangChain's + :class:`SemanticChunker` for the initial sentence split. + Default extends the upstream English-only pattern with + Chinese sentence terminators ``。?!`` so mixed-language and + pure-Chinese inputs split correctly. + + Returns: + Ordered list of ``{"tokens", "content", "chunk_order_index"}`` + dicts. + """ + if not content or not content.strip(): + return [] + + if embedding_func is None: + # V's only differentiator is embeddings — without them the + # closest neighbour is R's structural splitting. V chunks are + # non-overlapping by design (semantic boundaries), so the + # fallback uses ``chunk_overlap_token_size=0`` to preserve that + # semantic and avoid LangChain's "overlap > chunk_size" guard + # for very small ``chunk_token_size``. + logger.warning( + "[semantic_vector] embedding_func is None; falling back to " + "recursive-character chunking." + ) + from lightrag.chunker.recursive_character import ( + chunking_by_recursive_character, + ) + + return chunking_by_recursive_character( + tokenizer, + content, + chunk_token_size, + chunk_overlap_token_size=0, + ) + + if not _LANGCHAIN_EXPERIMENTAL_AVAILABLE: + raise ImportError( + "langchain-experimental is required for the 'V' chunking " + "strategy; install with `pip install langchain-experimental>=0.3`." + ) + + loop = asyncio.get_running_loop() + adapter = _AsyncEmbeddingFuncAdapter(embedding_func, loop) + + chunker_kwargs: dict[str, Any] = { + "embeddings": adapter, + "buffer_size": int(buffer_size), + "breakpoint_threshold_type": breakpoint_threshold_type, + "sentence_split_regex": sentence_split_regex, + } + if breakpoint_threshold_amount is not None: + chunker_kwargs["breakpoint_threshold_amount"] = float( + breakpoint_threshold_amount + ) + + splitter = SemanticChunker(**chunker_kwargs) + pieces = await asyncio.to_thread(splitter.split_text, content) + + # SemanticChunker has no internal size cap; oversized pieces here + # would otherwise rely on the embedding-time hard fallback (which + # uses ``embedding_token_limit``, not ``chunk_token_size``) to split + # them. Enforce ``chunk_token_size`` directly via R for any piece + # that exceeds it so the user-configured size is actually honored. + # Lazy import dodges the recursive_character ↔ semantic_vector + # circular dependency (same pattern as the embedding-None fallback + # above). + from lightrag.chunker.recursive_character import ( + chunking_by_recursive_character, + ) + + target_max = max(int(chunk_token_size), 1) + results: list[dict[str, Any]] = [] + for piece in pieces: + body = piece.strip() + if not body: + continue + piece_tokens = len(tokenizer.encode(body)) + if piece_tokens <= target_max: + results.append( + { + "tokens": piece_tokens, + "content": body, + "chunk_order_index": len(results), + } + ) + continue + # Oversized semantic piece: re-split via R while preserving the + # surrounding chunk order. ``chunk_overlap_token_size=0`` keeps + # V's non-overlapping semantics. + sub_pieces = chunking_by_recursive_character( + tokenizer, + body, + target_max, + chunk_overlap_token_size=0, + ) + for sub in sub_pieces: + sub_body = sub.get("content", "") + if not sub_body: + continue + results.append( + { + "tokens": sub.get("tokens", len(tokenizer.encode(sub_body))), + "content": sub_body, + "chunk_order_index": len(results), + } + ) + return results diff --git a/lightrag/chunker/token_size.py b/lightrag/chunker/token_size.py new file mode 100644 index 0000000000..3a1e8fa0e5 --- /dev/null +++ b/lightrag/chunker/token_size.py @@ -0,0 +1,128 @@ +"""Fixed-size token-window chunking — the LightRAG default strategy. + +Chunks the input text into windows of at most ``chunk_token_size`` tokens +with ``chunk_overlap_token_size`` of overlap between adjacent windows. +When ``split_by_character`` is supplied, the splitter first segments on +that delimiter and then either tokenizes each segment as-is +(``split_by_character_only=True``) or further sub-splits any segment +that exceeds the token cap. + +Two entry points are exported: + + - :func:`chunking_by_token_size` — the **legacy 6-arg signature** + used as the default value for :attr:`lightrag.LightRAG.chunking_func`. + Kept for backward compatibility so externally-supplied chunking + functions can continue to drop in unchanged. + + - :func:`chunking_by_fixed_token` — the same algorithm exposed under + the **new file-chunker contract** (standard prefix + ``(tokenizer, content, chunk_token_size)`` plus keyword-only + knobs). Used by the file-based chunking dispatcher in + ``process_single_document`` for ``doc_process_opts.chunking == "F"``. +""" + +from __future__ import annotations + +from typing import Any + +from lightrag.exceptions import ChunkTokenLimitExceededError +from lightrag.utils import Tokenizer, logger + + +def chunking_by_token_size( + tokenizer: Tokenizer, + content: str, + split_by_character: str | None = None, + split_by_character_only: bool = False, + chunk_overlap_token_size: int = 100, + chunk_token_size: int = 1200, +) -> list[dict[str, Any]]: + """Legacy 6-arg fixed-token chunker (default for ``LightRAG.chunking_func``). + + Signature is preserved for backward compatibility with externally + supplied ``chunking_func`` implementations. New file-based chunking + dispatch uses :func:`chunking_by_fixed_token` instead. + """ + tokens = tokenizer.encode(content) + results: list[dict[str, Any]] = [] + if split_by_character: + raw_chunks = content.split(split_by_character) + new_chunks = [] + if split_by_character_only: + for chunk in raw_chunks: + _tokens = tokenizer.encode(chunk) + if len(_tokens) > chunk_token_size: + logger.warning( + "Chunk split_by_character exceeds token limit: len=%d limit=%d", + len(_tokens), + chunk_token_size, + ) + raise ChunkTokenLimitExceededError( + chunk_tokens=len(_tokens), + chunk_token_limit=chunk_token_size, + chunk_preview=chunk[:120], + ) + new_chunks.append((len(_tokens), chunk)) + else: + for chunk in raw_chunks: + _tokens = tokenizer.encode(chunk) + if len(_tokens) > chunk_token_size: + for start in range( + 0, len(_tokens), chunk_token_size - chunk_overlap_token_size + ): + chunk_content = tokenizer.decode( + _tokens[start : start + chunk_token_size] + ) + new_chunks.append( + (min(chunk_token_size, len(_tokens) - start), chunk_content) + ) + else: + new_chunks.append((len(_tokens), chunk)) + for index, (_len, chunk) in enumerate(new_chunks): + results.append( + { + "tokens": _len, + "content": chunk.strip(), + "chunk_order_index": index, + } + ) + else: + for index, start in enumerate( + range(0, len(tokens), chunk_token_size - chunk_overlap_token_size) + ): + chunk_content = tokenizer.decode(tokens[start : start + chunk_token_size]) + results.append( + { + "tokens": min(chunk_token_size, len(tokens) - start), + "content": chunk_content.strip(), + "chunk_order_index": index, + } + ) + return results + + +def chunking_by_fixed_token( + tokenizer: Tokenizer, + content: str, + chunk_token_size: int = 1200, + *, + chunk_overlap_token_size: int = 100, + split_by_character: str | None = None, + split_by_character_only: bool = False, +) -> list[dict[str, Any]]: + """Fixed-token chunker — file-chunker contract for the ``"F"`` strategy. + + Implements the same fixed-window algorithm as + :func:`chunking_by_token_size`, exposed under the standard + file-chunker signature ``(tokenizer, content, chunk_token_size, *, + )`` so the file-based chunking dispatcher in + ``process_single_document`` can call every strategy uniformly. + """ + return chunking_by_token_size( + tokenizer, + content, + split_by_character=split_by_character, + split_by_character_only=split_by_character_only, + chunk_overlap_token_size=chunk_overlap_token_size, + chunk_token_size=chunk_token_size, + ) diff --git a/lightrag/constants.py b/lightrag/constants.py index 318a83ea5b..6d1a96c858 100644 --- a/lightrag/constants.py +++ b/lightrag/constants.py @@ -6,6 +6,8 @@ consistency and makes maintenance easier. """ +from typing import Literal, TypeAlias + # Default values for server settings DEFAULT_WOKERS = 2 DEFAULT_MAX_GRAPH_NODES = 1000 @@ -15,6 +17,10 @@ DEFAULT_MAX_GLEANING = 1 DEFAULT_ENTITY_NAME_MAX_LENGTH = 256 +# Per-response output limits for entity extraction prompts +DEFAULT_MAX_EXTRACTION_RECORDS = 100 +DEFAULT_MAX_EXTRACTION_ENTITIES = 40 + # Number of description fragments to trigger LLM summary DEFAULT_FORCE_LLM_SUMMARY_ON_MERGE = 8 # Max description token size to trigger LLM summary @@ -25,21 +31,6 @@ DEFAULT_SUMMARY_CONTEXT_SIZE = 12000 # Maximum token size allowed for entity extraction input context DEFAULT_MAX_EXTRACT_INPUT_TOKENS = 20480 -# Default entities to extract if ENTITY_TYPES is not specified in .env -DEFAULT_ENTITY_TYPES = [ - "Person", - "Creature", - "Organization", - "Location", - "Event", - "Concept", - "Method", - "Content", - "Data", - "Artifact", - "NaturalObject", -] - # Separator for: description, source_id and relation-key fields(Can not be changed after data inserted) GRAPH_FIELD_SEP = "" @@ -89,6 +80,233 @@ DEFAULT_MAX_ASYNC = 4 # Default maximum async operations DEFAULT_MAX_PARALLEL_INSERT = 2 # Default maximum parallel insert operations +# Chunker defaults — i18n-aware so Chinese / mixed-language documents +# split correctly out of the box. Override per deployment via +# CHUNK_R_SEPARATORS / CHUNK_V_SENTENCE_SPLIT_REGEX env vars. +# +# DEFAULT_R_SEPARATORS: cascade tried by langchain RecursiveCharacterTextSplitter. +# Order matters — strongest boundary first: paragraph (\n\n) > line (\n) > +# Chinese sentence-end (。!?) > Chinese semi-clause (;,) > space > char. +# English sentence-ending punctuation (.?!) is intentionally NOT included +# because RecursiveCharacterTextSplitter does literal-string splitting, so +# "." would also split numerals (``0.95``) and abbreviations (``e.g.``). +# The English path falls through space / char as before. +DEFAULT_R_SEPARATORS: tuple[str, ...] = ( + "\n\n", + "\n", + "。", + "!", + "?", + ";", + ",", + " ", + "", +) +# DEFAULT_SENTENCE_SPLIT_REGEX: pattern fed to langchain SemanticChunker. +# Two alternates so the English branch keeps its ``\s+`` requirement +# (avoiding ``0.95`` mid-token splits) while the Chinese branch matches +# bare ``。?!`` (CJK has no inter-sentence whitespace). +DEFAULT_SENTENCE_SPLIT_REGEX = r"(?<=[.?!])\s+|(?<=[。?!])" + +# DEFAULT_CHUNK_P_SIZE: paragraph-semantic chunker target size when +# CHUNK_P_SIZE env is unset. Deliberately larger than the global +# CHUNK_SIZE default — heading-aligned paragraph merging needs more +# headroom to keep semantically related paragraphs together; falling +# back to CHUNK_SIZE (1200) would force premature splits and defeat +# the strategy's purpose. +DEFAULT_CHUNK_P_SIZE = 2000 + +# LightRAG Document pipeline +FULL_DOCS_FORMAT_RAW = "raw" # content in full_docs["content"] +FULL_DOCS_FORMAT_LIGHTRAG = "lightrag" # content in LightRAG Document files +FULL_DOCS_FORMAT_PENDING_PARSE = ( + "pending_parse" # file saved but not yet parsed; parse_native will read from disk +) +# Marker prefix for full_docs.content when format=lightrag. +# Per docs/FileProcessingConfiguration-zh.md, the content is "{{LRdoc}}" + a +# leading summary of the parsed document so paginated APIs can show a real +# preview without loading the full LightRAG Document file. +LIGHTRAG_DOC_CONTENT_PREFIX = "{{LRdoc}}" +PARSER_ENGINE_LEGACY = "legacy" +PARSER_ENGINE_NATIVE = "native" +PARSER_ENGINE_MINERU = "mineru" +PARSER_ENGINE_DOCLING = "docling" +SUPPORTED_PARSER_ENGINES = frozenset( + { + PARSER_ENGINE_LEGACY, + PARSER_ENGINE_NATIVE, + PARSER_ENGINE_MINERU, + PARSER_ENGINE_DOCLING, + } +) +PARSER_ENGINE_SUFFIX_CAPABILITIES = { + PARSER_ENGINE_LEGACY: frozenset( + { + "txt", + "md", + "mdx", + "pdf", + "docx", + "pptx", + "xlsx", + "rtf", + "odt", + "tex", + "epub", + "html", + "htm", + "csv", + "json", + "xml", + "yaml", + "yml", + "log", + "conf", + "ini", + "properties", + "sql", + "bat", + "sh", + "c", + "h", + "cpp", + "hpp", + "py", + "java", + "js", + "ts", + "swift", + "go", + "rb", + "php", + "css", + "scss", + "less", + } + ), + PARSER_ENGINE_NATIVE: frozenset({"docx"}), + PARSER_ENGINE_MINERU: frozenset( + { + "pdf", + "doc", + "docx", + "ppt", + "pptx", + "xls", + "xlsx", + "png", + "jpg", + "jpeg", + "jp2", + "webp", + "gif", + "bmp", + } + ), + PARSER_ENGINE_DOCLING: frozenset( + { + "pdf", + "docx", + "pptx", + "xlsx", + "md", + "html", + "xhtml", + "png", + "jpg", + "jpeg", + "tiff", + "webp", + "bmp", + } + ), +} +PARSED_DIR_NAME = "__parsed__" # Dir for parsed files (renamed from __enqueued__) + +# Suffixes for parser artifact subdirectories under ``/__parsed__/``. +# Centralising them here keeps the sidecar writer, engine cache modules and +# the delete-path whitelist in sync — new engines should add their raw-dir +# suffix to ``PARSED_ARTIFACT_DIR_SUFFIXES`` so deletion picks them up +# automatically. +PARSED_DIR_SUFFIX = ".parsed" # spec sidecar layout (every engine) +MINERU_RAW_DIR_SUFFIX = ".mineru_raw" # preserved MinerU raw bundle +DOCLING_RAW_DIR_SUFFIX = ".docling_raw" # preserved Docling raw bundle +PARSED_ARTIFACT_DIR_SUFFIXES: tuple[str, ...] = ( + PARSED_DIR_SUFFIX, + MINERU_RAW_DIR_SUFFIX, + DOCLING_RAW_DIR_SUFFIX, +) + +# Per-file processing options carried by filename hints / LIGHTRAG_PARSER rules. +# See docs/FileProcessingConfiguration-zh.md for the full specification. +PROCESS_OPTION_IMAGES = "i" # Enable VLM analysis for drawings/images +PROCESS_OPTION_TABLES = "t" # Enable VLM analysis for tables +PROCESS_OPTION_EQUATIONS = "e" # Enable VLM analysis for equations +PROCESS_OPTION_SKIP_KG = "!" # Skip entity/relation extraction (no KG build) +ProcessChunkingOption: TypeAlias = Literal["F", "R", "V", "P"] +PROCESS_OPTION_CHUNK_FIXED: ProcessChunkingOption = ( + "F" # Fixed-length / separator chunking (default) +) +PROCESS_OPTION_CHUNK_RECURSIVE: ProcessChunkingOption = ( + "R" # Recursive semantic chunking +) +PROCESS_OPTION_CHUNK_VECTOR: ProcessChunkingOption = ( + "V" # Vector-driven semantic chunking +) +PROCESS_OPTION_CHUNK_PARAGRAH: ProcessChunkingOption = ( + "P" # Paragrah-driven semantic chunking +) + +PROCESS_OPTION_CHUNK_CHARS: frozenset[ProcessChunkingOption] = frozenset( + { + PROCESS_OPTION_CHUNK_FIXED, + PROCESS_OPTION_CHUNK_RECURSIVE, + PROCESS_OPTION_CHUNK_VECTOR, + PROCESS_OPTION_CHUNK_PARAGRAH, + } +) +SUPPORTED_PROCESS_OPTIONS = frozenset( + { + PROCESS_OPTION_IMAGES, + PROCESS_OPTION_TABLES, + PROCESS_OPTION_EQUATIONS, + PROCESS_OPTION_SKIP_KG, + PROCESS_OPTION_CHUNK_FIXED, + PROCESS_OPTION_CHUNK_RECURSIVE, + PROCESS_OPTION_CHUNK_VECTOR, + PROCESS_OPTION_CHUNK_PARAGRAH, + } +) + +DEFAULT_MAX_PARALLEL_ANALYZE = 5 # Multimodal analysis (VLM) concurrency + +# Per-engine parsing concurrency defaults. mineru / docling default to 1 +# because both engines are resource-intensive (GPU/CPU + memory) and tend to +# be more stable when run serially; users with capacity can opt into higher +# concurrency via MAX_PARALLEL_PARSE_* env vars. +DEFAULT_MAX_PARALLEL_PARSE_NATIVE = 5 +DEFAULT_MAX_PARALLEL_PARSE_MINERU = 1 +DEFAULT_MAX_PARALLEL_PARSE_DOCLING = 1 + +# Staged pipeline queue size defaults. +DEFAULT_QUEUE_SIZE_DEFAULT = 100 +DEFAULT_QUEUE_SIZE_INSERT = 4 + +# Multimodal analysis / chunk thresholds +# Minimum token count retained when truncating a multimodal chunk's +# description to fit within DEFAULT_MAX_EXTRACT_INPUT_TOKENS. Falling below +# this floor leaves the description too thin to ground a useful entity +# description, so the pipeline raises instead of producing a stub. +DEFAULT_MM_CHUNK_DESCRIPTION_MIN_TOKENS = 100 +# Minimum image side (width or height) in pixels accepted for VLM analysis. +# Anything smaller is treated as decorative (icons, separators, etc.) and +# written as status="skipped". +DEFAULT_MM_IMAGE_MIN_PIXEL = 32 +# Priority used for all multimodal analysis LLM calls. Higher numbers run +# behind entity extraction (priority 10) so a busy ingestion queue still +# prefers KG-building work. +DEFAULT_MM_ANALYSIS_PRIORITY = 12 + # Embedding configuration defaults DEFAULT_EMBEDDING_FUNC_MAX_ASYNC = 8 # Default max async for embedding functions DEFAULT_EMBEDDING_BATCH_NUM = 10 # Default batch size for embedding computations @@ -100,6 +318,12 @@ DEFAULT_LLM_TIMEOUT = 180 DEFAULT_EMBEDDING_TIMEOUT = 30 +# Rerank async / timeout defaults +# Concurrency falls back to base MAX_ASYNC when env unset; timeout has its own +# default since reranker calls are typically much faster than full LLM generation. +DEFAULT_RERANK_MAX_ASYNC = DEFAULT_MAX_ASYNC +DEFAULT_RERANK_TIMEOUT = 30 + # Logging configuration defaults DEFAULT_LOG_MAX_BYTES = 10485760 # Default 10MB DEFAULT_LOG_BACKUP_COUNT = 5 # Default 5 backups diff --git a/lightrag/exceptions.py b/lightrag/exceptions.py index 5e1acb7d47..e592cbdfa4 100644 --- a/lightrag/exceptions.py +++ b/lightrag/exceptions.py @@ -137,3 +137,14 @@ class DataMigrationError(Exception): def __init__(self, message: str): super().__init__(message) self.message = message + + +class MultimodalAnalysisError(RuntimeError): + """Raised when multimodal analysis must fail the current document. + + Hard failures (missing required field, schema mismatch, model not + available, sidecar already carries ``status="failure"``) bubble this + exception so the pipeline marks the document failed instead of writing + an unusable analyze result. Callers persist a ``status="failure"`` + sidecar entry alongside the raise so a re-run sees the failure. + """ diff --git a/lightrag/external_parser/__init__.py b/lightrag/external_parser/__init__.py new file mode 100644 index 0000000000..8dd337d412 --- /dev/null +++ b/lightrag/external_parser/__init__.py @@ -0,0 +1,50 @@ +"""Adapters for external document parsing services. + +Each subpackage under ``external_parser/`` integrates one external parser +(docling, mineru, ...) by handling: + +- request/upload/poll choreography against the parser's HTTP API, +- on-disk caching of the raw bundle under ``._raw/``, +- normalization into LightRAG IR (``IRDoc``) for the sidecar writer. + +Shared cross-engine helpers (size/hash, atomic manifest IO, safe zip +extraction, env coercion) live at this package root in private modules +prefixed ``_``. Engine-specific cache validation, manifest construction, +and IR adaptation live in each subpackage. +""" + +from lightrag.external_parser._common import ( + clear_dir_contents, + compute_size_and_hash, + env_bool, + env_int, + env_json, + raw_dir_for_parsed_dir, +) +from lightrag.external_parser._manifest import ( + MANIFEST_FILENAME, + MANIFEST_VERSION, + Manifest, + ManifestFile, + load_manifest, + manifest_path, + write_manifest, +) +from lightrag.external_parser._zip import safe_extract_zip + +__all__ = [ + "MANIFEST_FILENAME", + "MANIFEST_VERSION", + "Manifest", + "ManifestFile", + "clear_dir_contents", + "compute_size_and_hash", + "env_bool", + "env_int", + "env_json", + "load_manifest", + "manifest_path", + "raw_dir_for_parsed_dir", + "safe_extract_zip", + "write_manifest", +] diff --git a/lightrag/external_parser/_common.py b/lightrag/external_parser/_common.py new file mode 100644 index 0000000000..101fa096f1 --- /dev/null +++ b/lightrag/external_parser/_common.py @@ -0,0 +1,152 @@ +"""Shared helpers for ``lightrag/external_parser//`` packages. + +Currently consumed by the docling subpackage; expected to be reused when +mineru is migrated under ``external_parser/mineru/``. + +These are pure functions with no engine-specific knowledge. Engine-specific +logic (endpoint signature, options signature, cache validation policy) lives +in each engine's own ``cache.py``. +""" + +from __future__ import annotations + +import hashlib +import json +import os +import shutil +from pathlib import Path +from typing import Any + +from lightrag.constants import PARSED_DIR_SUFFIX +from lightrag.utils import logger + + +def compute_size_and_hash(path: Path) -> tuple[int, str]: + """Single-read computation of ``(size_bytes, "sha256:")``. + + Manifest writes use this so the recorded size and hash are guaranteed to + describe the same byte stream; using two ``open()`` calls would risk a + TOCTOU mismatch if the file changed in between. + """ + h = hashlib.sha256() + size = 0 + with path.open("rb") as f: + for chunk in iter(lambda: f.read(1 << 20), b""): + h.update(chunk) + size += len(chunk) + return size, f"sha256:{h.hexdigest()}" + + +def clear_dir_contents(directory: Path) -> None: + """Delete everything inside ``directory`` but keep ``directory`` itself.""" + if not directory.exists(): + return + for entry in directory.iterdir(): + try: + if entry.is_dir() and not entry.is_symlink(): + shutil.rmtree(entry, ignore_errors=True) + else: + entry.unlink() + except OSError: + continue + + +def raw_dir_for_parsed_dir(parsed_dir: Path, *, suffix: str) -> Path: + """Sibling raw dir for a ``*.parsed`` dir. + + ``foo.parsed/`` with ``suffix=".docling_raw"`` → ``foo.docling_raw/``. + ``suffix`` must start with ``.`` and be engine-specific (the caller + binds it via ``functools.partial`` or a thin wrapper). + """ + if not suffix.startswith("."): + raise ValueError(f"raw dir suffix must start with '.', got {suffix!r}") + stem = parsed_dir.name + if stem.endswith(PARSED_DIR_SUFFIX): + stem = stem[: -len(PARSED_DIR_SUFFIX)] + return parsed_dir.parent / f"{stem}{suffix}" + + +def env_bool(name: str, default: bool) -> bool: + raw = os.getenv(name, "").strip().lower() + if raw in {"1", "true", "yes", "on"}: + return True + if raw in {"0", "false", "no", "off"}: + return False + return default + + +def env_int(name: str, default: int) -> int: + raw = os.getenv(name, "").strip() + if not raw: + return default + try: + return int(raw) + except ValueError: + logger.warning( + "[external_parser] %s=%r is not an integer; using %s", name, raw, default + ) + return default + + +def env_json(name: str, default: Any) -> Any: + """Parse a JSON env var; on parse error log a warning and return default.""" + raw = os.getenv(name, "").strip() + if not raw: + return default + try: + return json.loads(raw) + except json.JSONDecodeError: + logger.warning( + "[external_parser] %s=%r is not valid JSON; using default", name, raw + ) + return default + + +def response_error_detail(resp: Any, *, limit: int = 1000) -> str: + """Return a compact response body snippet for HTTP error reporting.""" + try: + payload = resp.json() if getattr(resp, "text", "") else None + except Exception: + payload = None + + if payload is not None: + try: + detail = json.dumps(payload, ensure_ascii=False, sort_keys=True) + except TypeError: + detail = repr(payload) + else: + detail = str(getattr(resp, "text", "") or "").strip() + + detail = " ".join(detail.split()) + if not detail: + return "empty response body" + if len(detail) > limit: + return f"{detail[:limit]}..." + return detail + + +def raise_for_status_with_detail(resp: Any, operation: str) -> None: + """Raise an HTTP error that preserves service-provided response details. + + Treats any non-2xx response as an error, matching httpx's + ``raise_for_status`` status handling (which also raises on 1xx/3xx, + not just 4xx/5xx) while attaching a compact response-body snippet to + the message for faster diagnosis. + """ + status_code = int(getattr(resp, "status_code", 0) or 0) + if 200 <= status_code < 300: + return + detail = response_error_detail(resp) + raise RuntimeError(f"{operation} failed: HTTP {status_code} {detail}") + + +__all__ = [ + "clear_dir_contents", + "compute_size_and_hash", + "env_bool", + "env_int", + "env_json", + "raise_for_status_with_detail", + "raw_dir_for_parsed_dir", + "response_error_detail", +] diff --git a/lightrag/external_parser/_manifest.py b/lightrag/external_parser/_manifest.py new file mode 100644 index 0000000000..176b70390f --- /dev/null +++ b/lightrag/external_parser/_manifest.py @@ -0,0 +1,167 @@ +"""Shared ``_manifest.json`` schema for ``external_parser//`` bundles. + +The manifest is the *atomic success marker* for a raw bundle. Its presence +implies "all files in this directory finished downloading"; its content is +the cache key for "is this bundle for the same source file, the same engine +version, the same endpoint, and the same option signature we are using right +now?". + +Write path: :func:`write_manifest` writes a temp file then atomically renames +to ``_manifest.json``. A crash mid-download leaves no manifest, so the next +parse call cleanly invalidates and re-downloads. + +Read path: :func:`load_manifest` returns ``None`` if absent, malformed, or +recorded under a different engine — either way the bundle is treated as +stale. +""" + +from __future__ import annotations + +import json +import os +from dataclasses import asdict, dataclass, field +from pathlib import Path + +MANIFEST_FILENAME = "_manifest.json" +MANIFEST_VERSION = "1.0" + + +@dataclass +class ManifestFile: + """One file entry inside the bundle. Size always; sha256 only for files + where silent corruption would break the adapter (the "critical" file). + """ + + path: str # relative to the raw dir + size: int + sha256: str | None = None # ``"sha256:"`` or ``None`` + + +@dataclass +class Manifest: + """Generic manifest schema. ``engine`` is filled by the caller (docling / + mineru / etc.); ``options_signature`` lets per-engine cache layers detect + when env-driven request parameters changed without bumping the version. + """ + + engine: str + source_content_hash: str + source_size_bytes: int + source_filename_at_parse: str + critical_file: ManifestFile + files: list[ManifestFile] + total_size_bytes: int + task_id: str = "" + api_mode: str = "" + engine_version: str = "" + endpoint_signature: str = "" + options_signature: str = "" + downloaded_at: str = "" + extras: dict = field(default_factory=dict) + version: str = MANIFEST_VERSION + + def to_dict(self) -> dict: + return { + "version": self.version, + "engine": self.engine, + "api_mode": self.api_mode, + "engine_version": self.engine_version, + "endpoint_signature": self.endpoint_signature, + "options_signature": self.options_signature, + "source_content_hash": self.source_content_hash, + "source_size_bytes": int(self.source_size_bytes), + "source_filename_at_parse": self.source_filename_at_parse, + "task_id": self.task_id, + "downloaded_at": self.downloaded_at, + "critical_file": asdict(self.critical_file), + "files": [asdict(f) for f in self.files], + "total_size_bytes": int(self.total_size_bytes), + "extras": dict(self.extras or {}), + } + + @classmethod + def from_dict(cls, payload: dict) -> "Manifest": + critical_raw = payload.get("critical_file") or {} + files_raw = payload.get("files") or [] + return cls( + version=str(payload.get("version") or MANIFEST_VERSION), + engine=str(payload.get("engine") or ""), + api_mode=str(payload.get("api_mode") or ""), + engine_version=str(payload.get("engine_version") or ""), + endpoint_signature=str(payload.get("endpoint_signature") or ""), + options_signature=str(payload.get("options_signature") or ""), + source_content_hash=str(payload.get("source_content_hash") or ""), + source_size_bytes=int(payload.get("source_size_bytes") or 0), + source_filename_at_parse=str(payload.get("source_filename_at_parse") or ""), + task_id=str(payload.get("task_id") or ""), + downloaded_at=str(payload.get("downloaded_at") or ""), + critical_file=ManifestFile( + path=str(critical_raw.get("path") or ""), + size=int(critical_raw.get("size") or 0), + sha256=( + str(critical_raw["sha256"]) if critical_raw.get("sha256") else None + ), + ), + files=[ + ManifestFile( + path=str(f.get("path") or ""), + size=int(f.get("size") or 0), + sha256=(str(f["sha256"]) if f.get("sha256") else None), + ) + for f in files_raw + if isinstance(f, dict) + ], + total_size_bytes=int(payload.get("total_size_bytes") or 0), + extras=dict(payload.get("extras") or {}), + ) + + +def manifest_path(raw_dir: Path) -> Path: + return raw_dir / MANIFEST_FILENAME + + +def load_manifest(raw_dir: Path, *, expected_engine: str) -> Manifest | None: + """Return the parsed manifest or ``None`` if absent / malformed / for a + different engine. ``expected_engine`` is required so a future shared raw + dir cannot serve a bundle that belongs to another engine. + """ + p = manifest_path(raw_dir) + if not p.is_file(): + return None + try: + payload = json.loads(p.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): + return None + if not isinstance(payload, dict): + return None + if payload.get("version") != MANIFEST_VERSION: + return None + if payload.get("engine") != expected_engine: + return None + try: + return Manifest.from_dict(payload) + except (TypeError, ValueError): + return None + + +def write_manifest(raw_dir: Path, manifest: Manifest) -> None: + """Atomically write the manifest using temp-file + rename.""" + raw_dir.mkdir(parents=True, exist_ok=True) + final = manifest_path(raw_dir) + tmp = final.with_suffix(".json.tmp") + tmp.write_text( + json.dumps(manifest.to_dict(), ensure_ascii=False, indent=2), + encoding="utf-8", + ) + os.replace(tmp, final) + + +__all__ = [ + "MANIFEST_FILENAME", + "MANIFEST_VERSION", + "Manifest", + "ManifestFile", + "load_manifest", + "manifest_path", + "write_manifest", +] diff --git a/lightrag/external_parser/_zip.py b/lightrag/external_parser/_zip.py new file mode 100644 index 0000000000..6dd66a8d16 --- /dev/null +++ b/lightrag/external_parser/_zip.py @@ -0,0 +1,42 @@ +"""Shared zip-bundle extraction for external parser engines. + +Engines like docling return their full output as a zip archive. This helper +extracts it safely (refusing path traversal / absolute paths) into a target +directory. Engine-specific post-extraction normalization (e.g. mineru's +nested-subdir hoist) is *not* done here — each engine's client handles its +own quirks. +""" + +from __future__ import annotations + +import io +import os +import zipfile +from pathlib import Path + + +def safe_extract_zip(payload: bytes, dest_dir: Path) -> list[str]: + """Extract a zip archive into ``dest_dir``, refusing unsafe paths. + + Raises ``RuntimeError`` if any entry name is absolute or contains ``..`` + components after normalization. Returns the list of extracted member + names (as stored in the zip, prior to OS-specific normalization), so + callers can validate the bundle layout without re-walking the directory. + """ + dest_dir.mkdir(parents=True, exist_ok=True) + buf = io.BytesIO(payload) + with zipfile.ZipFile(buf) as zf: + names = zf.namelist() + for name in names: + norm = os.path.normpath(name) + if ( + norm.startswith("..") + or os.path.isabs(norm) + or norm.startswith(("/", os.sep)) + ): + raise RuntimeError(f"Refusing zip entry with unsafe path: {name!r}") + zf.extractall(dest_dir) + return names + + +__all__ = ["safe_extract_zip"] diff --git a/lightrag/external_parser/docling/__init__.py b/lightrag/external_parser/docling/__init__.py new file mode 100644 index 0000000000..039d9c6ee0 --- /dev/null +++ b/lightrag/external_parser/docling/__init__.py @@ -0,0 +1,41 @@ +"""Docling parser integration (raw client, cache, manifest, IR adapter). + +Public surface for the rest of the codebase. ``parse_docling`` imports +only from this facade so the inner module layout stays free to evolve. +""" + +from lightrag.constants import DOCLING_RAW_DIR_SUFFIX +from lightrag.external_parser._common import ( + clear_dir_contents, + raw_dir_for_parsed_dir as _raw_dir_for_parsed_dir, +) + +MANIFEST_ENGINE = "docling" + + +def raw_dir_for_parsed_dir(parsed_dir): + """``foo.parsed/`` → ``foo.docling_raw/`` (docling-specific binding).""" + return _raw_dir_for_parsed_dir(parsed_dir, suffix=DOCLING_RAW_DIR_SUFFIX) + + +# Imported after ``MANIFEST_ENGINE`` / ``DOCLING_RAW_DIR_SUFFIX`` because +# the submodules read those constants at import time. +from lightrag.external_parser.docling.ir_builder import ( # noqa: E402 + DoclingIRBuilder, +) +from lightrag.external_parser.docling.cache import ( # noqa: E402 + is_bundle_valid, +) +from lightrag.external_parser.docling.client import ( # noqa: E402 + DoclingRawClient, +) + +__all__ = [ + "DOCLING_RAW_DIR_SUFFIX", + "MANIFEST_ENGINE", + "DoclingIRBuilder", + "DoclingRawClient", + "clear_dir_contents", + "is_bundle_valid", + "raw_dir_for_parsed_dir", +] diff --git a/lightrag/external_parser/docling/cache.py b/lightrag/external_parser/docling/cache.py new file mode 100644 index 0000000000..e8af378a49 --- /dev/null +++ b/lightrag/external_parser/docling/cache.py @@ -0,0 +1,228 @@ +"""Cache validation for ``*.docling_raw/`` bundles. + +Validation policy (settled in +``docs/DoclingSidecarRefactorPlan-zh.md`` §4.1): + +1. ``_manifest.json`` exists, parses, ``engine="docling"`` ∧ schema version + matches. +2. **Source size fast-path**: ``source_file.stat().st_size`` matches the + manifest; mismatch → miss without hashing. +3. **Source content_hash**: full sha256 of the current source file matches + the manifest. +4. **Engine version**: if ``DOCLING_ENGINE_VERSION`` is set in env and the + manifest recorded a non-empty value, they must match. +5. **Endpoint signature**: if the active ``DOCLING_ENDPOINT`` differs from + what was recorded at parse time, miss (avoids re-using a bundle produced + by a different docling-serve instance). +6. **Options signature**: covers every env or fixed constant that changes + the produced bundle (OCR flags, language list, formula enrichment, + target format and pipeline). Any change → miss. +7. **Critical file**: the main JSON must exist with matching size **and** + sha256 — final tie-breaker against silent corruption affecting the file + the adapter depends on. +8. **Other files**: size-only verification (cheap; covers most corruption + modes for markdown / artifacts). + +Any failed step ⇒ cache miss; the caller wipes the directory contents and +re-runs the download. +""" + +from __future__ import annotations + +import hashlib +import json +import os +from pathlib import Path + +from lightrag.external_parser._common import compute_size_and_hash, env_bool +from lightrag.external_parser._manifest import load_manifest +from lightrag.external_parser.docling import MANIFEST_ENGINE +from lightrag.utils import logger + +# Legacy upload-path suffix. ``env.example`` historically documented +# ``DOCLING_ENDPOINT=http://host:5001/v1/convert/file/async`` (the full +# upload URL); the current client expects a base URL and appends the path +# itself. Strip the suffix so an unmodified pre-refactor ``.env`` keeps +# working instead of producing +# ``/v1/convert/file/async/v1/convert/file/async`` requests. +_LEGACY_UPLOAD_PATH_SUFFIX = "/v1/convert/file/async" +_legacy_endpoint_warned = False + +# Envs that change the bytes docling-serve produces. Any change here must +# invalidate the bundle cache. ``DOCLING_BBOX_ATTRIBUTES`` is intentionally +# NOT in this list: it only affects how the adapter writes IR meta, not the +# docling bundle, so flipping it should re-emit the sidecar (which we always +# do) without forcing a re-download. +DOCLING_TUNABLE_ENVS: tuple[str, ...] = ( + "DOCLING_DO_OCR", + "DOCLING_FORCE_OCR", + "DOCLING_OCR_ENGINE", + "DOCLING_OCR_PRESET", + "DOCLING_OCR_LANG", + "DOCLING_DO_FORMULA_ENRICHMENT", +) + + +def current_endpoint_signature() -> str: + """The active docling endpoint, normalized to a base URL. + + Normalization: + + - Trims surrounding whitespace and strips trailing slashes. + - Strips the legacy ``/v1/convert/file/async`` upload suffix if present, + preserving backwards compatibility with the pre-refactor ``env.example`` + that documented the full upload URL. + + Returns ``""`` if ``DOCLING_ENDPOINT`` is unset — callers that need a + real endpoint (``DoclingRawClient``) raise on empty; callers that only + compare against a recorded manifest field (``is_bundle_valid``) silently + skip the check when either side is empty. + """ + global _legacy_endpoint_warned + endpoint = os.getenv("DOCLING_ENDPOINT", "").strip().rstrip("/") + if endpoint.endswith(_LEGACY_UPLOAD_PATH_SUFFIX): + endpoint = endpoint[: -len(_LEGACY_UPLOAD_PATH_SUFFIX)] + if not _legacy_endpoint_warned: + _legacy_endpoint_warned = True + logger.warning( + "DOCLING_ENDPOINT still includes the legacy %r upload suffix; " + "stripping it. Update your .env to a base URL " + "(e.g. http://host:5001).", + _LEGACY_UPLOAD_PATH_SUFFIX, + ) + return endpoint + + +def compute_options_signature( + *, + tunable_env: dict[str, str], + fixed_constants: dict[str, object], +) -> str: + """Stable signature over user-tunable env values and fixed pipeline + constants. + + Storing the constants in the signature means a future code change that + flips e.g. ``image_export_mode`` from ``referenced`` to ``embedded`` + invalidates every existing cache without anyone having to remember to + bump a version. + """ + payload = json.dumps( + {"env": tunable_env, "fixed": fixed_constants}, + ensure_ascii=False, + sort_keys=True, + separators=(",", ":"), + ) + return "sha256:" + hashlib.sha256(payload.encode("utf-8")).hexdigest() + + +def snapshot_tunable_env() -> dict[str, str]: + """Read effective docling tunables so equivalent requests share a signature.""" + return { + "DOCLING_DO_OCR": str(env_bool("DOCLING_DO_OCR", True)).lower(), + "DOCLING_FORCE_OCR": str(env_bool("DOCLING_FORCE_OCR", True)).lower(), + "DOCLING_OCR_ENGINE": os.getenv("DOCLING_OCR_ENGINE", "auto").strip() or "auto", + "DOCLING_OCR_PRESET": os.getenv("DOCLING_OCR_PRESET", "auto").strip() or "auto", + "DOCLING_OCR_LANG": os.getenv("DOCLING_OCR_LANG", "").strip(), + "DOCLING_DO_FORMULA_ENRICHMENT": str( + env_bool("DOCLING_DO_FORMULA_ENRICHMENT", False) + ).lower(), + } + + +def is_bundle_valid(raw_dir: Path, source_file: Path) -> bool: + """Return True iff the bundle matches the current source + env state.""" + if not raw_dir.is_dir(): + return False + + manifest = load_manifest(raw_dir, expected_engine=MANIFEST_ENGINE) + if manifest is None: + return False + + # 1. Source size fast-path + try: + cur_size = source_file.stat().st_size + except OSError: + return False + if cur_size != int(manifest.source_size_bytes): + return False + + # 2. Source content_hash + _, cur_hash = compute_size_and_hash(source_file) + if cur_hash != manifest.source_content_hash: + return False + + # 3. Engine version. Skip the comparison when either side is empty so + # operators can opt out by unsetting the env, and so bundles from + # earlier code that never recorded the field aren't force-invalidated. + cur_engine_version = os.getenv("DOCLING_ENGINE_VERSION", "").strip() + if ( + cur_engine_version + and manifest.engine_version + and cur_engine_version != manifest.engine_version + ): + return False + + # 4. Endpoint signature. Same "both non-empty to compare" rule: a bundle + # parsed against a different docling-serve URL must not be reused, but + # we don't reject the cache just because the env happens to be unset + # at validation time (e.g. CLI tooling that only reads the cache). + cur_endpoint = current_endpoint_signature() + if ( + cur_endpoint + and manifest.endpoint_signature + and cur_endpoint != manifest.endpoint_signature + ): + return False + + # 5. Options signature: only enforced if the manifest recorded one + # (manifests written before this commit have it empty — they are + # treated as stale and re-downloaded the next time the env changes). + # + # Compare against the *current* fixed constants from client.py, not + # the copy stashed in the manifest — using the manifest's copy would + # always reproduce the recorded signature and silently swallow + # code-only changes (e.g. flipping image_export_mode or to_formats), + # defeating the invalidation this step is supposed to provide. + # Lazy import: client.py imports from cache.py. + if manifest.options_signature: + from lightrag.external_parser.docling.client import FIXED_CONSTANTS + + cur_options = compute_options_signature( + tunable_env=snapshot_tunable_env(), + fixed_constants=FIXED_CONSTANTS, + ) + if cur_options != manifest.options_signature: + return False + + # 6. Critical file: size + sha256 + crit = manifest.critical_file + crit_path = raw_dir / crit.path + try: + if crit_path.stat().st_size != int(crit.size): + return False + except OSError: + return False + if crit.sha256: + _, crit_actual = compute_size_and_hash(crit_path) + if crit_actual != crit.sha256: + return False + + # 7. Other files: size only + for entry in manifest.files: + ep = raw_dir / entry.path + try: + if ep.stat().st_size != int(entry.size): + return False + except OSError: + return False + + return True + + +__all__ = [ + "DOCLING_TUNABLE_ENVS", + "compute_options_signature", + "current_endpoint_signature", + "is_bundle_valid", + "snapshot_tunable_env", +] diff --git a/lightrag/external_parser/docling/client.py b/lightrag/external_parser/docling/client.py new file mode 100644 index 0000000000..f11b04f08d --- /dev/null +++ b/lightrag/external_parser/docling/client.py @@ -0,0 +1,344 @@ +"""Docling raw bundle downloader. + +Talks to Docling Serve v1 over HTTP: + +- ``POST /v1/convert/file/async`` — multipart upload, returns ``task_id``, +- ``GET /v1/status/poll/{task_id}?wait=5`` — long-poll for terminal state, +- ``GET /v1/result/{task_id}`` — zip download (only on ``success``). + +The zip is extracted safely under ``raw_dir/`` (refusing path traversal / +absolute entries). A success manifest is written atomically at the very +end; mid-run crashes therefore leave the directory in a state the cache +layer marks as invalid (no manifest → miss → re-download). + +Pipeline constants (``pipeline``, ``target_type``, ``to_formats``, +``image_export_mode``) are intentionally **not** env-driven — the sidecar +flow depends on them — and are recorded inside the manifest so a future +code change automatically invalidates pre-existing caches. +""" + +from __future__ import annotations + +import asyncio +import json +import os +import time +from pathlib import Path +from typing import TYPE_CHECKING, Any + +from lightrag.external_parser._common import ( + env_bool, + env_int, + raise_for_status_with_detail, +) +from lightrag.external_parser._zip import safe_extract_zip +from lightrag.external_parser.docling.cache import ( + compute_options_signature, + current_endpoint_signature, + snapshot_tunable_env, +) +from lightrag.external_parser.docling.manifest import ( + build_and_write_docling_manifest, + select_main_json, +) +from lightrag.utils import logger + +if TYPE_CHECKING: + import httpx +else: + try: + import httpx + except ImportError: # pragma: no cover + httpx = None + +# --------------------------------------------------------------------------- +# Fixed pipeline constants (NOT env-driven) +# --------------------------------------------------------------------------- + +PIPELINE = "standard" +TARGET_TYPE = "zip" +TO_FORMATS: tuple[str, ...] = ("json", "md") +IMAGE_EXPORT_MODE = "referenced" + +FIXED_CONSTANTS: dict[str, object] = { + "pipeline": PIPELINE, + "target_type": TARGET_TYPE, + "to_formats": list(TO_FORMATS), + "image_export_mode": IMAGE_EXPORT_MODE, +} + +CONVERT_PATH = "/v1/convert/file/async" +POLL_PATH = "/v1/status/poll/{task_id}" +RESULT_PATH = "/v1/result/{task_id}" + +DEFAULT_POLL_WAIT_SECONDS = 5 +DEFAULT_MAX_POLLS = 240 # 240 * 5s long-poll ≈ 20 min worst case + +# ConversionStatus enum from the docling-serve OpenAPI +SUCCESS_STATES = {"success"} +FAILURE_STATES = {"failure", "partial_success", "skipped"} +IN_PROGRESS_STATES = {"pending", "started"} + + +class DoclingRawClient: + """Downloads docling-serve bundles into ``raw_dir``. + + Construct once per parse call (cheap). Reads ``DOCLING_*`` envs at + ``__init__`` time, so callers can flip env between calls and pick up + the new values without holding a stale instance. + """ + + def __init__(self) -> None: + self.endpoint = current_endpoint_signature() + if not self.endpoint: + raise ValueError("DOCLING_ENDPOINT is required") + self.engine_version = os.getenv("DOCLING_ENGINE_VERSION", "").strip() + + self.do_ocr = env_bool("DOCLING_DO_OCR", True) + self.force_ocr = env_bool("DOCLING_FORCE_OCR", True) + self.ocr_engine = os.getenv("DOCLING_OCR_ENGINE", "auto").strip() or "auto" + self.ocr_preset = os.getenv("DOCLING_OCR_PRESET", "auto").strip() or "auto" + self.ocr_lang_raw = os.getenv("DOCLING_OCR_LANG", "").strip() + self.do_formula_enrichment = env_bool("DOCLING_DO_FORMULA_ENRICHMENT", False) + + # Poll cadence: docling-serve's ``?wait=N`` is a server-side long-poll + # window. ``DOCLING_POLL_INTERVAL_SECONDS`` sets that window; the + # client does NOT add its own sleep between polls. ``DOCLING_MAX_POLLS`` + # bounds the total polling budget — exceeding it raises ``TimeoutError``. + wait = env_int("DOCLING_POLL_INTERVAL_SECONDS", DEFAULT_POLL_WAIT_SECONDS) + self.poll_wait_seconds = wait if wait > 0 else DEFAULT_POLL_WAIT_SECONDS + max_polls = env_int("DOCLING_MAX_POLLS", DEFAULT_MAX_POLLS) + self.max_poll_attempts = max_polls if max_polls > 0 else DEFAULT_MAX_POLLS + + # ------------------------------------------------------------------ + # Public API + # ------------------------------------------------------------------ + + async def download_into( + self, + raw_dir: Path, + source_file_path: Path, + *, + upload_filename: str | None = None, + ): + """Upload, poll, download, extract, and write the manifest. + + ``upload_filename`` overrides the multipart filename sent to + docling-serve (defaults to ``source_file_path.name``). The pipeline + passes the canonical, hint-stripped document name here so the + bundle's ``.json`` ends up canonical too — otherwise a file + named ``report.[docling].pdf`` would produce ``report.[docling].json`` + inside the bundle, and the adapter (which only knows the canonical + ``report.pdf``) would not be able to locate it via the preferred + ``.json`` lookup. + + Pre-condition: caller cleared ``raw_dir`` (e.g. via + :func:`lightrag.external_parser.clear_dir_contents`). This method + does not clean the directory itself — keeping that explicit at the + ``parse_docling`` entry point. + """ + if httpx is None: + raise RuntimeError( + "httpx is required for Docling parsing but is not installed" + ) + raw_dir.mkdir(parents=True, exist_ok=True) + + effective_filename = upload_filename or source_file_path.name + + timeout = httpx.Timeout(120.0, connect=30.0) + async with httpx.AsyncClient(timeout=timeout) as client: + task_id = await self._submit( + client, source_file_path, filename=effective_filename + ) + await self._poll_until_done(client, task_id) + payload = await self._download_zip_bytes(client, task_id) + + safe_extract_zip(payload, raw_dir) + # Defensive: confirm the main JSON exists before anyone reads the + # bundle. Look it up by the *uploaded* filename's stem — that's + # what docling-serve uses to name the JSON inside the zip. + select_main_json(raw_dir, Path(effective_filename)) + + options_signature = compute_options_signature( + tunable_env=snapshot_tunable_env(), + fixed_constants=FIXED_CONSTANTS, + ) + return build_and_write_docling_manifest( + raw_dir, + source_file_path=source_file_path, + task_id=task_id, + endpoint_signature=self.endpoint, + engine_version=self.engine_version, + options_signature=options_signature, + fixed_constants=FIXED_CONSTANTS, + recorded_filename=effective_filename, + ) + + # ------------------------------------------------------------------ + # Upload + poll + download + # ------------------------------------------------------------------ + + def _build_multipart_data(self) -> dict[str, str | list[str]]: + """Form fields (everything except the file payload). + + Returns a ``dict`` (not a list of tuples): httpx ≥ 0.28 short-circuits + non-``Mapping`` ``data`` into raw-content encoding and ignores + ``files=`` entirely, producing a sync-only stream that an + ``AsyncClient`` then rejects. List-valued entries are emitted as + repeated form keys by ``MultipartStream``, matching docling-serve's + pydantic ``List[Enum]`` form parsing. ``ocr_lang`` is omitted entirely + when empty so the engine uses its own default. + """ + data: dict[str, str | list[str]] = { + "pipeline": PIPELINE, + "target_type": TARGET_TYPE, + "image_export_mode": IMAGE_EXPORT_MODE, + "do_ocr": _bool_form(self.do_ocr), + "force_ocr": _bool_form(self.force_ocr), + "ocr_engine": self.ocr_engine, + "ocr_preset": self.ocr_preset, + "do_formula_enrichment": _bool_form(self.do_formula_enrichment), + "to_formats": list(TO_FORMATS), + } + if self.ocr_lang_raw: + langs = _parse_ocr_lang(self.ocr_lang_raw) + if langs: + data["ocr_lang"] = langs + return data + + async def _submit( + self, + client: "httpx.AsyncClient", + source_file_path: Path, + *, + filename: str, + ) -> str: + url = f"{self.endpoint}{CONVERT_PATH}" + # Hand httpx a file object so its MultipartStream reads the body in + # chunks instead of materializing the whole PDF/PPTX in worker memory. + # With ``max_parallel_parse_docling > 1`` a per-doc bytes copy can + # OOM the worker before docling-serve ever sees the request. + with source_file_path.open("rb") as fh: + files = {"files": (filename, fh, "application/octet-stream")} + resp = await client.post( + url, data=self._build_multipart_data(), files=files + ) + raise_for_status_with_detail(resp, f"Docling upload for {filename!r}") + payload = resp.json() if resp.text else {} + task_id = str(payload.get("task_id") or payload.get("id") or "").strip() + if not task_id: + raise RuntimeError(f"Docling upload response missing task_id: {payload!r}") + return task_id + + async def _poll_until_done( + self, + client: "httpx.AsyncClient", + task_id: str, + ) -> None: + url = f"{self.endpoint}{POLL_PATH.format(task_id=task_id)}" + params = {"wait": self.poll_wait_seconds} + for _ in range(self.max_poll_attempts): + iteration_started = time.monotonic() + resp = await client.get(url, params=params) + raise_for_status_with_detail(resp, f"Docling task {task_id} poll") + payload = resp.json() if resp.text else {} + status = str( + payload.get("task_status") or payload.get("status") or "" + ).lower() + + if status in SUCCESS_STATES: + return + if status in FAILURE_STATES: + raise RuntimeError(_format_failure(task_id, status, payload)) + if status not in IN_PROGRESS_STATES: + # Unknown status: keep polling, but surface it so operators notice. + logger.warning( + "[docling] unknown task status %r for task %s; continuing to poll", + status, + task_id, + ) + + # The intended cadence is one poll per ``poll_wait_seconds`` — the + # design relies on docling-serve's ``?wait=N`` long-polling for + # that. Some deployments return immediately instead, which would + # burn through ``max_poll_attempts`` in milliseconds and fail + # with a spurious timeout. Cap each iteration at the configured + # interval ourselves so the total budget holds either way. + elapsed = time.monotonic() - iteration_started + remaining = self.poll_wait_seconds - elapsed + if remaining > 0: + await asyncio.sleep(remaining) + + raise TimeoutError(f"Docling task {task_id} polling timeout") + + async def _download_zip_bytes( + self, + client: "httpx.AsyncClient", + task_id: str, + ) -> bytes: + url = f"{self.endpoint}{RESULT_PATH.format(task_id=task_id)}" + resp = await client.get(url) + raise_for_status_with_detail(resp, f"Docling result {task_id} download") + ctype = resp.headers.get("content-type", "") + if "zip" not in ctype.lower(): + raise RuntimeError( + f"Docling result {task_id} returned non-zip content-type " + f"{ctype!r}; body prefix={resp.text[:400]!r}" + ) + return resp.content + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + + +def _bool_form(v: bool) -> str: + return "true" if v else "false" + + +def _parse_ocr_lang(raw: str) -> list[str]: + """Best-effort parser for ``DOCLING_OCR_LANG``. + + Accepts a JSON array (``["en","zh"]``) or a comma-separated list + (``en,zh``). Returns a list of stripped non-empty strings; empty in → + empty out. + """ + try: + parsed = json.loads(raw) + except json.JSONDecodeError: + parsed = None + if isinstance(parsed, list): + return [str(x).strip() for x in parsed if str(x).strip()] + return [item.strip() for item in raw.split(",") if item.strip()] + + +def _format_failure(task_id: str, status: str, payload: Any) -> str: + if isinstance(payload, dict): + err = ( + payload.get("error_message") + or payload.get("error") + or payload.get("message") + or "" + ) + else: + err = "" + truncated = json.dumps(payload, ensure_ascii=False)[:400] + return f"Docling task {task_id} ended in {status}: {err}; payload={truncated}" + + +__all__ = [ + "DoclingRawClient", + "CONVERT_PATH", + "DEFAULT_MAX_POLLS", + "DEFAULT_POLL_WAIT_SECONDS", + "FIXED_CONSTANTS", + "IMAGE_EXPORT_MODE", + "PIPELINE", + "POLL_PATH", + "RESULT_PATH", + "SUCCESS_STATES", + "FAILURE_STATES", + "TARGET_TYPE", + "TO_FORMATS", +] diff --git a/lightrag/external_parser/docling/ir_builder.py b/lightrag/external_parser/docling/ir_builder.py new file mode 100644 index 0000000000..7e1c4428f0 --- /dev/null +++ b/lightrag/external_parser/docling/ir_builder.py @@ -0,0 +1,1085 @@ +"""Docling IR builder: ``DoclingDocument`` JSON → :class:`IRDoc`. + +Input contract: a ``*.docling_raw/`` directory containing a ``.json`` +produced by docling-serve with ``to_formats=[json,md]`` + +``image_export_mode=referenced``. Companion ``.md`` and +``artifacts/`` are not read by the builder (markdown stays for human +inspection; image bytes are referenced by relative URI). + +Conversion rules (informed by +``docs/DoclingSidecarRefactorPlan-zh.md`` §5): + +- **Faithful** mapping. We do NOT correct heading levels from numbering, + do NOT bind orphan ``caption`` / ``footnote`` text to neighbouring + tables/pictures via proximity, do NOT merge continuation tables, do NOT + invent captions or refer to inline neighbours. If docling didn't make + the link, the sidecar doesn't make it either. +- ``content_layer != "body"`` is filtered everywhere (top-level traversal, + group expansion, picture children). Furniture / background never leaks + into blocks, positions, or consumed_refs. +- ``texts[*].label="title"`` → heading level 1; ``"section_header"`` → + Docling ``level + 1`` (default 2 when level missing). +- ``texts[*].label="caption"|"footnote"`` are dropped from the reading + stream **iff** their ref is referenced by a table/picture (via + ``captions`` / ``footnotes`` refs, or as a direct ``children`` ref + whose target is itself a caption/footnote). Otherwise they remain as + regular text in the reading flow. +- ``pictures[*]`` without a usable image reference are skipped instead of + emitting empty-path drawings. ``pictures[*].children`` references that + are NOT caption/footnote are treated as inner-OCR text and excluded from + the reading stream only for pictures that are emitted. +- ``IRPosition`` writes ``origin="LEFTTOP"`` only when the source + ``prov.bbox.coord_origin == "TOPLEFT"``. ``BOTTOMLEFT`` inherits the + doc-level meta (``{"origin":"LEFTBOTTOM"}`` by default). Coordinates + are written verbatim — never flipped. +- ``DOCLING_BBOX_ATTRIBUTES`` env (JSON) can override the doc-level + ``bbox_attributes``, mirroring MinerU's behaviour. +- Equations: ``texts[k].label == "formula"`` is treated as a structural + formula signal whenever text/orig/content is non-empty. Top-level formulas + become block equations; formulas inside inline groups become inline + equations. +""" + +from __future__ import annotations + +import base64 +import json +import os +import re +from pathlib import Path +from typing import Any + +from lightrag.external_parser._common import env_json +from lightrag.external_parser.docling.manifest import select_main_json +from lightrag.sidecar.ir import ( + AssetSpec, + IRBlock, + IRDoc, + IRDrawing, + IREquation, + IRPosition, + IRTable, +) +from lightrag.utils import logger + + +PREFACE_HEADING = "Preface/Uncategorized" + +# Docling JSON Pointer ``#/texts/3``, ``#/tables/2``, ``#/pictures/0``, +# ``#/groups/5``, or ``#/body``. +_REF_PATTERN = re.compile(r"^#/(?P[a-z_]+)(?:/(?P\d+))?$") + + +class DoclingIRBuilder: + """Stateless except for env-driven config. Reusable across calls.""" + + def __init__(self) -> None: + self.engine_version = os.getenv("DOCLING_ENGINE_VERSION", "").strip() + self.bbox_attributes = self._load_bbox_attributes_env() + + @staticmethod + def _load_bbox_attributes_env() -> dict[str, Any]: + default = {"origin": "LEFTBOTTOM"} + parsed = env_json("DOCLING_BBOX_ATTRIBUTES", default) + if not isinstance(parsed, dict): + logger.warning( + "[docling_ir_builder] DOCLING_BBOX_ATTRIBUTES must decode to an object; " + "falling back to %s", + default, + ) + return dict(default) + return parsed + + # ------------------------------------------------------------------ + # Entry point + # ------------------------------------------------------------------ + + def normalize_from_workdir( + self, + raw_dir: Path, + *, + document_name: str, + ) -> IRDoc: + main_json = select_main_json(raw_dir, Path(document_name)) + try: + doc = json.loads(main_json.read_text(encoding="utf-8")) + except json.JSONDecodeError as exc: + raise ValueError( + f"Docling raw JSON malformed at {main_json}: {exc}" + ) from exc + if not isinstance(doc, dict): + raise ValueError(f"Docling raw JSON is not an object at {main_json}") + return self._normalize(doc, raw_dir, document_name=document_name) + + # ------------------------------------------------------------------ + # Core traversal + # ------------------------------------------------------------------ + + def _normalize( + self, + doc: dict, + raw_dir: Path, + *, + document_name: str, + ) -> IRDoc: + document_format = Path(document_name).suffix.lower().lstrip(".") + ref_index = _build_ref_index(doc) + consumed_refs, picture_inner_refs = _precompute_consumed_refs(doc, raw_dir) + + blocks: list[IRBlock] = [] + assets: list[AssetSpec] = [] + seen_asset_refs: dict[str, str] = {} + doc_title = "" + placeholder_counter = 0 + + def _next_key(prefix: str) -> str: + nonlocal placeholder_counter + placeholder_counter += 1 + return f"{prefix}{placeholder_counter}" + + # Heading stack + current block accumulator — identical structure + # to MinerUIRBuilder so downstream P-chunking and provenance behave + # the same way regardless of engine. + heading_stack: list[str] = [] + cb_lines: list[str] = [] + cb_tables: list[IRTable] = [] + cb_drawings: list[IRDrawing] = [] + cb_equations: list[IREquation] = [] + cb_page_set: set[str] = set() + cb_bbox_positions: list[IRPosition] = [] + cb_heading = PREFACE_HEADING + cb_level = 0 + cb_parents: list[str] = [] + cb_has_body = False + + visited: set[str] = set() + kv_count = len(doc.get("key_value_items") or []) + form_count = len(doc.get("form_items") or []) + + # --- closures over the accumulator ----------------------------- + + def _flush_block() -> None: + nonlocal cb_lines, cb_tables, cb_drawings, cb_equations + nonlocal cb_page_set, cb_bbox_positions, cb_has_body + has_payload = bool(cb_lines or cb_tables or cb_drawings or cb_equations) + if not has_payload: + return + content = "\n".join(line for line in cb_lines if line) + if not content.strip() and not (cb_tables or cb_drawings or cb_equations): + cb_lines = [] + cb_page_set = set() + cb_bbox_positions = [] + cb_has_body = False + return + positions = [ + IRPosition(type="bbox", anchor=p) + for p in _sort_page_anchors(cb_page_set) + ] + list(cb_bbox_positions) + blocks.append( + IRBlock( + content_template=content, + heading=cb_heading, + level=cb_level, + parent_headings=list(cb_parents), + positions=positions, + tables=list(cb_tables), + drawings=list(cb_drawings), + equations=list(cb_equations), + ) + ) + cb_lines = [] + cb_tables = [] + cb_drawings = [] + cb_equations = [] + cb_page_set = set() + cb_bbox_positions = [] + cb_has_body = False + + def _open_block(heading: str, level: int, parents: list[str]) -> None: + nonlocal cb_heading, cb_level, cb_parents + cb_heading = heading + cb_level = level + cb_parents = parents + md_prefix = "#" * max(level, 1) + cb_lines.append(f"{md_prefix} {heading}") + + def _merge_heading_as_body(heading: str, level: int) -> None: + md_prefix = "#" * max(level, 1) + cb_lines.append(f"{md_prefix} {heading}") + + def _append_text(text: str) -> bool: + nonlocal cb_has_body + if not text: + return False + cb_lines.append(text) + cb_has_body = True + return True + + def _record_positions(item: dict) -> None: + for prov in item.get("prov") or []: + if not isinstance(prov, dict): + continue + bbox = prov.get("bbox") or {} + page_raw = prov.get("page_no") + charspan = prov.get("charspan") + if isinstance(bbox, dict) and all( + k in bbox for k in ("l", "t", "r", "b") + ): + coord_origin = str(bbox.get("coord_origin") or "").upper() + origin_override: str | None = None + if coord_origin == "TOPLEFT": + origin_override = "LEFTTOP" + elif coord_origin == "BOTTOMLEFT": + origin_override = None + elif coord_origin: + logger.warning( + "[docling_ir_builder] unknown coord_origin %r; " + "writing through as override", + coord_origin, + ) + origin_override = coord_origin + anchor = str(page_raw) if page_raw is not None else None + range_ = [ + bbox["l"], + bbox["t"], + bbox["r"], + bbox["b"], + ] + cb_bbox_positions.append( + IRPosition( + type="bbox", + anchor=anchor, + range=range_, + charspan=( + list(charspan) if isinstance(charspan, list) else None + ), + origin=origin_override, + ) + ) + elif page_raw is not None: + cb_page_set.add(str(page_raw)) + + # --- main traversal ------------------------------------------- + + def _visit_ref(ref: str) -> None: + if not ref or ref in consumed_refs or ref in visited: + return + visited.add(ref) + item = ref_index.get(ref) + if item is None: + return + if _content_layer(item) != "body": + return + kind = _ref_kind(ref) + + if kind == "groups": + _visit_group(item) + return + if kind == "texts": + _handle_text(item) + return + if kind == "tables": + _handle_table(item) + return + if kind == "pictures": + _handle_picture(item) + return + # Unknown kind — log and ignore; falling through silently would + # hide schema drift in future docling releases. + logger.warning( + "[docling_ir_builder] unknown ref kind %r (ref=%r); skipping", kind, ref + ) + + def _visit_group(group: dict) -> None: + label = str(group.get("label") or "").lower() + if label not in { + "list", + "inline", + "picture_area", + "section", + "form_area", + "key_value_area", + "ordered_list", + "unordered_list", + "chapter", + }: + logger.warning( + "[docling_ir_builder] unrecognized group label %r; " + "expanding children as default reading order", + label, + ) + if label == "inline": + _handle_inline_group(group) + return + _visit_children(group) + + def _visit_children(item: dict) -> None: + for child_ref in item.get("children") or []: + ref = _ref_str(child_ref) + _visit_ref(ref) + + def _handle_inline_group(group: dict) -> None: + """``inline`` groups concatenate text and inline formulas on one line.""" + buf: list[str] = [] + pages_recorded = False + for child_ref in group.get("children") or []: + ref = _ref_str(child_ref) + if ref in consumed_refs: + continue + child = ref_index.get(ref) + if not isinstance(child, dict): + continue + if _content_layer(child) != "body": + continue + if _ref_kind(ref) != "texts": + continue + visited.add(ref) + label = str(child.get("label") or "").lower() + piece = ( + _make_equation_placeholder(child, is_block=False) + if label == "formula" + else _text_of(child) + ) + if piece: + buf.append(piece) + if not pages_recorded: + _record_positions(child) + pages_recorded = True + line = " ".join(buf).strip() + if line: + _append_text(line) + + def _handle_text(item: dict) -> None: + nonlocal doc_title, heading_stack, cb_has_body + label = str(item.get("label") or "").lower() + text = _text_of(item).strip() + + # Heading? + heading_level = _docling_heading_level(label, item) + if heading_level > 0 and text: + heading_stack = heading_stack[: max(heading_level - 1, 0)] + parents = [h for h in heading_stack if h] + heading_stack.append(text) + # Adjacency merge + if cb_level > 0 and not cb_has_body and heading_level > cb_level: + _merge_heading_as_body(text, heading_level) + _record_positions(item) + if not doc_title and heading_level == 1: + doc_title = text + _visit_children(item) + return + _flush_block() + _open_block(text, heading_level, parents) + _record_positions(item) + if not doc_title and heading_level == 1: + doc_title = text + _visit_children(item) + return + + # Formula — Docling's label is the structural signal. For DOCX, + # valid LaTeX may have text == orig, so do not use that equality + # as an enrichment-off heuristic. + if label == "formula": + _handle_formula(item) + _visit_children(item) + return + + # list_item: keep the marker if Docling captured one + if label == "list_item": + marker = str(item.get("marker") or "").strip() + line = f"{marker} {text}".strip() if marker else text + if line and _append_text(line): + _record_positions(item) + _visit_children(item) + return + + # Caption/footnote not consumed by any table/picture → keep in + # reading flow as ordinary text (preserves original prefixes). + if label in {"caption", "footnote", "text", "code"}: + if _append_text(text): + _record_positions(item) + _visit_children(item) + return + + # page_header / page_footer should have been filtered by + # content_layer; reach here only if someone misuses the label. + if label in {"page_header", "page_footer"}: + return + + # Unknown label: fall back to writing the text and warn once. + if text: + logger.warning( + "[docling_ir_builder] unknown text label %r; treating as body", + label, + ) + if _append_text(text): + _record_positions(item) + _visit_children(item) + + def _handle_formula(item: dict) -> None: + placeholder = _make_equation_placeholder(item, is_block=True) + if not placeholder: + return + cb_lines.append(placeholder) + _bump_has_body() + _record_positions(item) + + def _make_equation_placeholder(item: dict, *, is_block: bool) -> str: + latex_raw = _text_of(item).strip() + if not latex_raw: + return "" + placeholder = _next_key("eq") + token = "EQ" if is_block else "EQI" + latex = f"$$ {latex_raw} $$" if is_block else latex_raw + cb_equations.append( + IREquation( + placeholder_key=placeholder, + latex=latex, + is_block=is_block, + self_ref=str(item.get("self_ref") or "") if is_block else "", + ) + ) + return f"{{{{{token}:{placeholder}}}}}" + + def _bump_has_body() -> None: + nonlocal cb_has_body + cb_has_body = True + + def _handle_table(item: dict) -> None: + table = _build_ir_table(item, ref_index) + if table is None: + # Empty body — _build_ir_table already logged the drop. + # Skip placeholder allocation and position recording so the + # body-less table item leaves no trace in the IR. + return + placeholder = _next_key("tb") + table.placeholder_key = placeholder + cb_tables.append(table) + cb_lines.append(f"{{{{TBL:{placeholder}}}}}") + _bump_has_body() + _record_positions(item) + + def _handle_picture(item: dict) -> None: + built = _build_ir_drawing( + item, + ref_index=ref_index, + picture_inner_refs=picture_inner_refs, + raw_dir=raw_dir, + seen_asset_refs=seen_asset_refs, + ) + if built is None: + return + drawing, asset = built + placeholder = _next_key("im") + drawing.placeholder_key = placeholder + if asset is not None and asset.ref not in {a.ref for a in assets}: + assets.append(asset) + cb_drawings.append(drawing) + cb_lines.append(f"{{{{IMG:{placeholder}}}}}") + _bump_has_body() + _record_positions(item) + + # Kick off traversal from body.children + body = doc.get("body") or {} + for child_ref in body.get("children") or []: + _visit_ref(_ref_str(child_ref)) + + _flush_block() + + if not doc_title: + doc_title = Path(document_name).stem or document_name + + split_option: dict[str, Any] = {} + if self.engine_version: + split_option["engine_version"] = self.engine_version + docling_extras: dict[str, Any] = {} + if kv_count: + docling_extras["key_value_items"] = kv_count + if form_count: + docling_extras["form_items"] = form_count + if docling_extras: + split_option["docling_extras"] = docling_extras + + return IRDoc( + document_name=document_name, + document_format=document_format, + doc_title=doc_title, + split_option=split_option, + blocks=blocks, + assets=assets, + bbox_attributes=dict(self.bbox_attributes), + ) + + +# --------------------------------------------------------------------------- +# Module-level helpers +# --------------------------------------------------------------------------- + + +def _ref_str(node: Any) -> str: + """Normalize a Docling reference (``{"$ref": "#/texts/0"}`` or a bare + string) to its string form. Returns ``""`` on garbage input.""" + if isinstance(node, str): + return node + if isinstance(node, dict): + v = node.get("$ref") or node.get("ref") + if isinstance(v, str): + return v + return "" + + +def _ref_kind(ref: str) -> str: + m = _REF_PATTERN.match(ref) + return m.group("kind") if m else "" + + +def _build_ref_index(doc: dict) -> dict[str, dict]: + """Map every JSON-pointer-style ref to its target object. + + Builds entries for ``#/body``, ``#/texts/N``, ``#/tables/N``, + ``#/pictures/N``, ``#/groups/N``. The body object is *not* a + typical content item but we index it so callers don't need a + special case when chasing arbitrary refs. + """ + index: dict[str, dict] = {} + body = doc.get("body") + if isinstance(body, dict): + index["#/body"] = body + for key, prefix in ( + ("texts", "#/texts/"), + ("tables", "#/tables/"), + ("pictures", "#/pictures/"), + ("groups", "#/groups/"), + ): + items = doc.get(key) + if not isinstance(items, list): + continue + for i, obj in enumerate(items): + if isinstance(obj, dict): + index[f"{prefix}{i}"] = obj + return index + + +def _precompute_consumed_refs(doc: dict, raw_dir: Path) -> tuple[set[str], set[str]]: + """Return ``(consumed_refs, picture_inner_refs)``. + + ``consumed_refs`` enumerates text refs that must NOT enter the reading + stream. The rules below apply only when the owning table/picture is + itself in the body content layer — refs harvested from furniture or + background items are ignored so they do not block legitimate body text + that might be reachable through ``body.children``: + + - body ``tables[*].captions`` and ``tables[*].footnotes`` + - body ``pictures[*].captions`` and ``pictures[*].footnotes`` only when + the picture has a usable image reference and will be emitted + - body ``tables[*].children`` / ``pictures[*].children`` that resolve + to ``texts[*]`` with ``label="caption"`` or ``"footnote"`` + - All body ``pictures[*].children`` that are non-caption/footnote texts + (the picture's inner OCR text). These also land in + ``picture_inner_refs`` so the builder can attribute them to the + drawing's extras. + + Sibling text nodes are NOT touched: only refs explicitly linked from a + table/picture object qualify. + """ + consumed: set[str] = set() + picture_inner: set[str] = set() + + text_label_index: dict[str, str] = {} + for i, obj in enumerate(doc.get("texts") or []): + if isinstance(obj, dict): + text_label_index[f"#/texts/{i}"] = str(obj.get("label") or "").lower() + + # Furniture/background tables/pictures must not consume refs that may + # appear under body.children — the builder contract is that non-body + # items are filtered everywhere, including their outgoing refs. + for table in doc.get("tables") or []: + if not isinstance(table, dict): + continue + if _content_layer(table) != "body": + continue + for ref in _iter_refs(table.get("captions")): + consumed.add(ref) + for ref in _iter_refs(table.get("footnotes")): + consumed.add(ref) + for ref in _iter_refs(table.get("children")): + label = text_label_index.get(ref) + if label in {"caption", "footnote"}: + consumed.add(ref) + + for pic in doc.get("pictures") or []: + if not isinstance(pic, dict): + continue + if _content_layer(pic) != "body": + continue + if not _has_usable_picture_image(pic, raw_dir): + continue + for ref in _iter_refs(pic.get("captions")): + consumed.add(ref) + for ref in _iter_refs(pic.get("footnotes")): + consumed.add(ref) + for ref in _iter_refs(pic.get("children")): + label = text_label_index.get(ref) + if label in {"caption", "footnote"}: + consumed.add(ref) + elif ref.startswith("#/texts/"): + consumed.add(ref) + picture_inner.add(ref) + + return consumed, picture_inner + + +def _iter_refs(value: Any): + """Yield refs from either a list of ref dicts/strings, or a single one.""" + if value is None: + return + if isinstance(value, list): + for item in value: + ref = _ref_str(item) + if ref: + yield ref + else: + ref = _ref_str(value) + if ref: + yield ref + + +def _content_layer(item: dict) -> str: + return str(item.get("content_layer") or "body").lower() + + +def _text_of(item: dict) -> str: + for key in ("text", "orig", "content"): + v = item.get(key) + if isinstance(v, str) and v.strip(): + return v + return "" + + +def _docling_heading_level(label: str, item: dict) -> int: + """Map a Docling text item to its IR heading level. + + - ``title`` → level 1 + - ``section_header`` → ``item.level + 1`` (fallback 2) + Returns 0 when the item is not a heading. + """ + if label == "title": + return 1 + if label == "section_header": + raw = item.get("level") + try: + level = int(raw) + except (TypeError, ValueError): + level = 0 + if level <= 0: + return 2 + return level + 1 + return 0 + + +def _resolve_text_refs(refs: Any, ref_index: dict[str, dict]) -> list[str]: + """Resolve a list of ``$ref`` entries to their text bodies. + + Skips targets whose ``content_layer`` is not ``"body"``. The builder + contract (see module docstring) is that furniture/background items + never leak into sidecar metadata — even when a body table or picture + explicitly references them, because such refs are typically the + consequence of a page-header/footer being mislabeled as a caption. + """ + out: list[str] = [] + for ref in _iter_refs(refs): + target = ref_index.get(ref) + if not isinstance(target, dict): + continue + if _content_layer(target) != "body": + continue + txt = _text_of(target).strip() + if txt: + out.append(txt) + return out + + +def _build_ir_table( + item: dict, + ref_index: dict[str, dict], +) -> IRTable | None: + data = item.get("data") or {} + grid = data.get("grid") if isinstance(data, dict) else None + rows = _rows_from_grid(grid) + if not rows and isinstance(data, dict) and data.get("table_cells"): + rows = _rows_from_table_cells(data) + + # Docling never populates IRTable.html, so a table without visible row + # content would land in the sidecar as ``content=""`` and trip the + # analyze worker's "missing table content" path (mirrors the MinerU + # filter in lightrag/external_parser/mineru/ir_builder.py). Drop the + # item up here so the IR stays clean. + if not _table_rows_have_content(rows): + logger.info( + "[docling_ir_builder] dropping empty table item " + "(self_ref=%s, num_rows=%s, num_cols=%s)", + item.get("self_ref"), + data.get("num_rows") if isinstance(data, dict) else None, + data.get("num_cols") if isinstance(data, dict) else None, + ) + return None + + num_rows = ( + int(data.get("num_rows") or len(rows) or 0) + if isinstance(data, dict) + else len(rows) + ) + num_cols = int( + (data.get("num_cols") if isinstance(data, dict) else 0) + or (max((len(r) for r in rows), default=0)) + ) + + table_header = _extract_table_header(grid) + + captions = _resolve_text_refs(item.get("captions"), ref_index) + if not captions: + # Fallback: direct children with label="caption" + captions = _resolve_children_with_label( + item.get("children"), ref_index, "caption" + ) + footnotes = _resolve_text_refs(item.get("footnotes"), ref_index) + if not footnotes: + footnotes = _resolve_children_with_label( + item.get("children"), ref_index, "footnote" + ) + + return IRTable( + placeholder_key="", + rows=rows or None, + html=None, + num_rows=num_rows, + num_cols=num_cols, + caption=" / ".join(captions), + footnotes=footnotes, + table_header=table_header, + self_ref=str(item.get("self_ref") or ""), + ) + + +def _table_rows_have_content(rows: list[list[str]]) -> bool: + """True iff at least one cell carries visible text.""" + for row in rows: + for cell in row: + if isinstance(cell, str) and cell.strip(): + return True + return False + + +def _rows_from_grid(grid: Any) -> list[list[str]]: + out: list[list[str]] = [] + if not isinstance(grid, list): + return out + for row in grid: + if not isinstance(row, list): + continue + out.append( + [str((c or {}).get("text", "") if isinstance(c, dict) else c) for c in row] + ) + return out + + +def _rows_from_table_cells(data: dict) -> list[list[str]]: + num_rows = int(data.get("num_rows") or 0) + num_cols = int(data.get("num_cols") or 0) + cells = data.get("table_cells") or [] + if num_rows <= 0 or num_cols <= 0 or not isinstance(cells, list): + return [] + grid = [[""] * num_cols for _ in range(num_rows)] + for cell in cells: + if not isinstance(cell, dict): + continue + text = str(cell.get("text") or "") + rs = int(cell.get("start_row_offset_idx") or 0) + re_ = int(cell.get("end_row_offset_idx") or rs + 1) + cs = int(cell.get("start_col_offset_idx") or 0) + ce_ = int(cell.get("end_col_offset_idx") or cs + 1) + for r in range(max(rs, 0), min(re_, num_rows)): + for c in range(max(cs, 0), min(ce_, num_cols)): + grid[r][c] = text + return grid + + +def _extract_table_header(grid: Any) -> list[list[str]] | None: + """Return the contiguous top rows where every cell has + ``column_header=True`` and ``start_row_offset_idx==0`` (the spec calls + out both conditions to defeat false positives from spanning cells). + """ + if not isinstance(grid, list): + return None + header_rows: list[list[str]] = [] + for row in grid: + if not isinstance(row, list): + break + if ( + all( + isinstance(c, dict) + and bool(c.get("column_header")) + and int(c.get("start_row_offset_idx") or 0) == 0 + for c in row + ) + and row + ): + header_rows.append([str((c or {}).get("text", "")) for c in row]) + else: + break + return header_rows or None + + +def _resolve_children_with_label( + children: Any, ref_index: dict[str, dict], expected_label: str +) -> list[str]: + out: list[str] = [] + for ref in _iter_refs(children): + target = ref_index.get(ref) + if not isinstance(target, dict): + continue + # Same body-only filter as _resolve_text_refs; see its docstring. + if _content_layer(target) != "body": + continue + if str(target.get("label") or "").lower() != expected_label: + continue + txt = _text_of(target).strip() + if txt: + out.append(txt) + return out + + +def _resolve_picture_ocr_paragraphs( + children: Any, ref_index: dict[str, dict], picture_inner_refs: set[str] +) -> list[str]: + """Resolve picture OCR child refs into non-empty body-layer paragraphs.""" + paragraphs: list[str] = [] + for ref in _iter_refs(children): + if ref not in picture_inner_refs: + continue + target = ref_index.get(ref) + if not isinstance(target, dict): + continue + if _content_layer(target) != "body": + continue + txt = _text_of(target).strip() + if txt: + paragraphs.append(txt) + return paragraphs + + +def _build_ir_drawing( + item: dict, + *, + ref_index: dict[str, dict], + picture_inner_refs: set[str], + raw_dir: Path, + seen_asset_refs: dict[str, str], +) -> tuple[IRDrawing, AssetSpec | None] | None: + image = item.get("image") or {} + uri = "" + mimetype = "" + image_size: tuple[float, float] | None = None + dpi: Any = None + if isinstance(image, dict): + uri = str(image.get("uri") or "") + mimetype = str(image.get("mimetype") or "") + size = image.get("size") or {} + if isinstance(size, dict) and "width" in size and "height" in size: + image_size = (float(size["width"]), float(size["height"])) + dpi = image.get("dpi") + + fmt = _image_fmt_from_mimetype(mimetype) or ( + Path(uri).suffix.lstrip(".").lower() if uri else "" + ) + + captions = _resolve_text_refs(item.get("captions"), ref_index) + if not captions: + captions = _resolve_children_with_label( + item.get("children"), ref_index, "caption" + ) + footnotes = _resolve_text_refs(item.get("footnotes"), ref_index) + if not footnotes: + footnotes = _resolve_children_with_label( + item.get("children"), ref_index, "footnote" + ) + + extras: dict[str, Any] = {} + if image_size is not None: + extras["intrinsic_size"] = list(image_size) + if dpi is not None: + extras["dpi"] = dpi + if mimetype: + extras["mimetype"] = mimetype + if "parent" in item: + extras["parent"] = item.get("parent") + ocr_paragraphs = _resolve_picture_ocr_paragraphs( + item.get("children"), ref_index, picture_inner_refs + ) + if ocr_paragraphs: + extras["ocr_texts"] = "\n\n".join(ocr_paragraphs) + extras["ocr_texts_count"] = len(ocr_paragraphs) + if item.get("annotations"): + extras["annotations"] = item.get("annotations") + if item.get("references"): + extras["references"] = item.get("references") + + asset_ref = "" + asset: AssetSpec | None = None + path_override: str | None = None + drawing_kwargs: dict[str, Any] = {} + + if not uri: + return None + if uri.startswith("data:"): + decoded = _decode_data_uri(uri) + if decoded is not None: + payload, ext = decoded + stem = ( + (item.get("self_ref") or "picture").replace("#/", "").replace("/", "_") + ) + suggested = f"{stem}.{ext or fmt or 'bin'}" + asset_ref = uri # use the data URI as a stable ref + if asset_ref not in seen_asset_refs: + asset = AssetSpec( + ref=asset_ref, + suggested_name=suggested, + source=payload, + ) + seen_asset_refs[asset_ref] = suggested + else: + logger.warning( + "[docling_ir_builder] skipping picture %s because data URI could " + "not be decoded", + item.get("self_ref") or "", + ) + return None + elif uri.startswith(("http://", "https://")): + path_override = uri + asset_ref = uri + else: + asset_ref = uri + if asset_ref not in seen_asset_refs: + # A malicious/corrupted bundle JSON could point at "../../etc/..." + # or an absolute path; the zip extractor's traversal guard only + # covers member names, not refs embedded in JSON metadata. Resolve + # against raw_dir and require the result to stay inside. + source_path = _resolve_local_image_path(raw_dir, uri) + suggested = Path(uri).name or f"image_{len(seen_asset_refs):06d}" + asset = AssetSpec( + ref=asset_ref, + suggested_name=suggested, + source=source_path if source_path is not None else None, + ) + if source_path is None: + logger.warning( + "[docling_ir_builder] skipping picture %s because image URI " + "%r could not be resolved inside %s", + item.get("self_ref") or "", + uri, + raw_dir, + ) + return None + seen_asset_refs[asset_ref] = suggested + + if path_override is not None: + drawing_kwargs["path_override"] = path_override + + drawing = IRDrawing( + placeholder_key="", + asset_ref=asset_ref, + fmt=fmt, + caption=" / ".join(captions), + footnotes=footnotes, + src=str(item.get("src") or ""), + self_ref=str(item.get("self_ref") or ""), + extras=extras, + **drawing_kwargs, + ) + return drawing, asset + + +def _image_uri_of(item: dict) -> str: + image = item.get("image") + if not isinstance(image, dict): + return "" + return str(image.get("uri") or "") + + +def _has_usable_picture_image(item: dict, raw_dir: Path) -> bool: + uri = _image_uri_of(item) + if not uri: + return False + if uri.startswith("data:"): + return _decode_data_uri(uri) is not None + if uri.startswith(("http://", "https://")): + return True + return _resolve_local_image_path(raw_dir, uri) is not None + + +def _image_fmt_from_mimetype(mimetype: str) -> str: + if not mimetype: + return "" + if mimetype == "image/jpeg": + return "jpg" + if mimetype.startswith("image/"): + return mimetype[len("image/") :].lower() + return "" + + +def _decode_data_uri(uri: str) -> tuple[bytes, str] | None: + """Decode ``data:image/png;base64,...`` style URIs. + + Returns ``(bytes, extension)`` or ``None`` if the payload could not be + decoded. Non-base64 payloads (extremely rare for images) are not + supported and yield ``None``. + """ + try: + head, payload = uri.split(",", 1) + except ValueError: + return None + if ";base64" not in head: + return None + try: + data = base64.b64decode(payload, validate=False) + except (ValueError, TypeError): + return None + ext = "" + if head.startswith("data:image/"): + ext = head[len("data:image/") :].split(";", 1)[0].lower() + if ext == "jpeg": + ext = "jpg" + return data, ext + + +def _resolve_local_image_path(raw_dir: Path, uri: str) -> Path | None: + """Resolve a relative image URI against the bundle root and return it + only if the result is a file *inside* ``raw_dir``. + + Returns ``None`` for: absolute URIs (``Path("foo") / "/etc/x"`` discards + the left side and would escape), refs that resolve outside the bundle + (``..``-traversal), and refs whose target does not exist. Symlinks are + followed by ``resolve()`` and the post-resolution path is what's checked, + so a symlink inside the bundle pointing outward is also refused. + """ + if not uri or os.path.isabs(uri): + return None + try: + base = raw_dir.resolve(strict=False) + candidate = (raw_dir / uri).resolve(strict=False) + except (OSError, RuntimeError): + return None + try: + candidate.relative_to(base) + except ValueError: + return None + return candidate if candidate.is_file() else None + + +def _sort_page_anchors(pages: set[str]) -> list[str]: + non_numeric = sorted(p for p in pages if not p.isdigit()) + numeric = sorted((p for p in pages if p.isdigit()), key=int) + return non_numeric + numeric + + +__all__ = ["DoclingIRBuilder"] diff --git a/lightrag/external_parser/docling/manifest.py b/lightrag/external_parser/docling/manifest.py new file mode 100644 index 0000000000..3890e3251c --- /dev/null +++ b/lightrag/external_parser/docling/manifest.py @@ -0,0 +1,130 @@ +"""Helpers for building ``_manifest.json`` for docling raw bundles. + +Wraps the generic :class:`Manifest` schema with docling-specific knowledge: + +- the critical file is the main ``.json`` produced by docling-serve, +- non-critical files are the markdown + every entry under ``artifacts/``, +- ``extras`` carries the fixed pipeline constants so the options signature + remains reproducible across runs. +""" + +from __future__ import annotations + +from datetime import datetime, timezone +from pathlib import Path + +from lightrag.external_parser._common import compute_size_and_hash +from lightrag.external_parser._manifest import ( + MANIFEST_FILENAME, + Manifest, + ManifestFile, + write_manifest, +) +from lightrag.external_parser.docling import MANIFEST_ENGINE + + +def select_main_json(raw_dir: Path, source_file_path: Path) -> Path: + """Locate the primary docling JSON inside ``raw_dir``. + + Priority: ``.json`` if present, else the single ``*.json`` + sitting at ``raw_dir`` root (excluding ``_manifest.json``, which always + sits in the bundle once a download has completed and would otherwise + collide with the bundle JSON in the fallback). Raises ``RuntimeError`` + if zero or multiple candidates exist. + """ + preferred = raw_dir / f"{source_file_path.stem}.json" + if preferred.is_file(): + return preferred + + candidates = sorted( + p for p in raw_dir.glob("*.json") if p.is_file() and p.name != MANIFEST_FILENAME + ) + if len(candidates) == 1: + return candidates[0] + if not candidates: + raise RuntimeError(f"Docling raw bundle at {raw_dir} contains no .json file") + names = ", ".join(p.name for p in candidates) + raise RuntimeError( + f"Docling raw bundle at {raw_dir} has multiple .json candidates ({names}); " + f"expected exactly one to derive the critical file from" + ) + + +def select_main_md(raw_dir: Path, source_file_path: Path) -> Path | None: + """Locate the markdown twin of the main JSON. Returns ``None`` if no + markdown was produced (defensive — docling-serve always emits one for + ``to_formats=["json","md"]`` but we don't want to crash if it is + missing).""" + preferred = raw_dir / f"{source_file_path.stem}.md" + if preferred.is_file(): + return preferred + candidates = sorted(p for p in raw_dir.glob("*.md") if p.is_file()) + return candidates[0] if candidates else None + + +def build_and_write_docling_manifest( + raw_dir: Path, + *, + source_file_path: Path, + task_id: str, + endpoint_signature: str, + engine_version: str, + options_signature: str, + fixed_constants: dict[str, object], + recorded_filename: str | None = None, +) -> Manifest: + """Construct the manifest for a freshly downloaded docling bundle and + persist it atomically. Returns the in-memory manifest for callers that + need the task_id / signatures for logging. + + ``recorded_filename`` is the name passed to docling-serve at upload + time (canonical, hint-stripped form when called from the pipeline). + It governs both the preferred-path lookup for the bundle JSON and the + value persisted as ``source_filename_at_parse``. When ``None``, falls + back to ``source_file_path.name`` for backward compatibility. + """ + lookup_path = Path(recorded_filename) if recorded_filename else source_file_path + main_json = select_main_json(raw_dir, lookup_path) + crit_size, crit_hash = compute_size_and_hash(main_json) + critical = ManifestFile( + path=main_json.relative_to(raw_dir).as_posix(), + size=crit_size, + sha256=crit_hash, + ) + + others: list[ManifestFile] = [] + for path in sorted(raw_dir.rglob("*")): + if not path.is_file(): + continue + rel = path.relative_to(raw_dir).as_posix() + if rel == critical.path or rel.startswith("_manifest"): + continue + others.append(ManifestFile(path=rel, size=path.stat().st_size)) + + source_size, source_hash = compute_size_and_hash(source_file_path) + total = crit_size + sum(f.size for f in others) + + manifest = Manifest( + engine=MANIFEST_ENGINE, + source_content_hash=source_hash, + source_size_bytes=source_size, + source_filename_at_parse=recorded_filename or source_file_path.name, + critical_file=critical, + files=others, + total_size_bytes=total, + task_id=task_id, + endpoint_signature=endpoint_signature, + engine_version=engine_version, + options_signature=options_signature, + downloaded_at=datetime.now(timezone.utc).isoformat(timespec="seconds"), + extras={"fixed_constants": dict(fixed_constants)}, + ) + write_manifest(raw_dir, manifest) + return manifest + + +__all__ = [ + "build_and_write_docling_manifest", + "select_main_json", + "select_main_md", +] diff --git a/lightrag/external_parser/mineru/__init__.py b/lightrag/external_parser/mineru/__init__.py new file mode 100644 index 0000000000..ff1c2a6810 --- /dev/null +++ b/lightrag/external_parser/mineru/__init__.py @@ -0,0 +1,31 @@ +"""MinerU parser integration (raw client, cache, manifest, IR builder). + +Public surface for the rest of the codebase. ``parse_mineru`` imports +only from this facade so the inner module layout stays free to evolve. + +See ``docs/LightRAGSidecarFormat-zh.md`` for sidecar format and +``docs/FileProcessingConfiguration-zh.md`` for cache lifecycle. +""" + +from lightrag.external_parser.mineru.cache import ( + MINERU_RAW_DIR_SUFFIX, + clear_dir_contents, + compute_size_and_hash, + is_bundle_valid, + raw_dir_for_parsed_dir, +) +from lightrag.external_parser.mineru.client import MinerURawClient +from lightrag.external_parser.mineru.ir_builder import MinerUIRBuilder +from lightrag.external_parser.mineru.manifest import Manifest, ManifestFile + +__all__ = [ + "MINERU_RAW_DIR_SUFFIX", + "Manifest", + "ManifestFile", + "MinerUIRBuilder", + "MinerURawClient", + "clear_dir_contents", + "compute_size_and_hash", + "is_bundle_valid", + "raw_dir_for_parsed_dir", +] diff --git a/lightrag/external_parser/mineru/cache.py b/lightrag/external_parser/mineru/cache.py new file mode 100644 index 0000000000..bc491bb8b4 --- /dev/null +++ b/lightrag/external_parser/mineru/cache.py @@ -0,0 +1,397 @@ +"""Cache validation for ``*.mineru_raw/`` bundles. + +Validation policy (settled in design discussion; see +``LightRAGSidecarFormat-zh.md`` related notes): + +1. ``_manifest.json`` exists, parses, ``version=1.0`` ∧ ``engine=mineru``. +2. **Source size fast-path**: ``source_file.stat().st_size`` matches manifest; + mismatch → miss without hashing. +3. **Source content_hash**: full sha256 of the current source file matches + manifest. The size+hash pair is computed by a single-read helper so the + stored manifest is internally self-consistent. +4. **API mode**: if the manifest recorded ``api_mode`` and it differs from + current ``MINERU_API_MODE``, miss. +5. **Parser options**: the manifest must record an ``options_signature`` that + matches the current effective MinerU request options. Missing signatures + from older manifests are treated as stale. +6. **Engine version**: if ``MINERU_ENGINE_VERSION`` is set and the manifest + recorded a non-empty one, they must match. +7. **Endpoint signature**: if the active MinerU endpoint is set and the + manifest recorded a non-empty one, they must match. +8. **Critical file**: ``content_list.json`` must exist with matching size + **and** sha256 — sha256 here is the final tie-breaker against silent + corruption affecting the file the adapter depends on. +9. **Other files**: size-only verification (cheap; covers most corruption + modes for image / middle.json / layout.pdf). + +Any failed step ⇒ cache miss; the caller wipes the directory contents +(preserving the directory itself) and re-runs the download. +""" + +from __future__ import annotations + +import hashlib +import json +import os +from dataclasses import asdict, dataclass +from pathlib import Path +from typing import Any + +from lightrag.constants import MINERU_RAW_DIR_SUFFIX, PARSED_DIR_SUFFIX +from lightrag.external_parser.mineru.manifest import load_manifest +from lightrag.utils import logger + +DEFAULT_MINERU_API_MODE = "local" +DEFAULT_MINERU_OFFICIAL_ENDPOINT = "https://mineru.net" +DEFAULT_MINERU_MODEL_VERSION = "vlm" +DEFAULT_MINERU_LANGUAGE = "ch" +DEFAULT_MINERU_LOCAL_BACKEND = "hybrid-auto-engine" +DEFAULT_MINERU_LOCAL_PARSE_METHOD = "auto" +DEFAULT_MINERU_LOCAL_IMAGE_ANALYSIS = True +DEFAULT_MINERU_LOCAL_START_PAGE_ID = 0 +DEFAULT_MINERU_LOCAL_END_PAGE_ID = 99999 +DEFAULT_MINERU_ENABLE_TABLE = True +DEFAULT_MINERU_ENABLE_FORMULA = True +DEFAULT_MINERU_IS_OCR = False + + +def raw_dir_for_parsed_dir(parsed_dir: Path) -> Path: + """Sibling raw dir for a given ``*.parsed`` dir. + + ``foo.parsed/`` → ``foo.mineru_raw/``. Used both at download time and at + cache check time so the layout is canonical. + """ + stem = parsed_dir.name + if stem.endswith(PARSED_DIR_SUFFIX): + stem = stem[: -len(PARSED_DIR_SUFFIX)] + return parsed_dir.parent / f"{stem}{MINERU_RAW_DIR_SUFFIX}" + + +def clear_dir_contents(directory: Path) -> None: + """Delete everything inside ``directory`` but keep ``directory`` itself.""" + if not directory.exists(): + return + for entry in directory.iterdir(): + try: + if entry.is_dir() and not entry.is_symlink(): + _rmtree_safe(entry) + else: + entry.unlink() + except OSError: + # Best-effort cleanup; subsequent download will overwrite. + continue + + +def _rmtree_safe(directory: Path) -> None: + import shutil + + shutil.rmtree(directory, ignore_errors=True) + + +def compute_size_and_hash(path: Path) -> tuple[int, str]: + """Single-read computation of ``(size_bytes, "sha256:")``. + + Manifest writes use this so the recorded size and hash are guaranteed to + describe the same byte stream; using two ``open()`` calls would risk a + TOCTOU mismatch if the file changed in between. + """ + h = hashlib.sha256() + size = 0 + with path.open("rb") as f: + for chunk in iter(lambda: f.read(1 << 20), b""): + h.update(chunk) + size += len(chunk) + return size, f"sha256:{h.hexdigest()}" + + +def _current_api_mode() -> str: + mode = _normalize_api_mode(os.getenv("MINERU_API_MODE", DEFAULT_MINERU_API_MODE)) + return mode + + +def _normalize_api_mode(mode: str) -> str: + mode = str(mode or "").strip().lower() + return mode if mode in {"official", "local"} else DEFAULT_MINERU_API_MODE + + +def _env_bool(name: str, default: bool) -> bool: + raw = os.getenv(name, "").strip().lower() + if raw in {"1", "true", "yes", "on"}: + return True + if raw in {"0", "false", "no", "off"}: + return False + return default + + +def _env_int(name: str, default: int) -> int: + raw = os.getenv(name, "").strip() + if not raw: + return default + try: + return int(raw) + except ValueError: + logger.warning( + "[mineru_raw] %s=%r is not an integer; using %s", name, raw, default + ) + return default + + +def _current_endpoint_signature() -> str: + mode = _current_api_mode() + if mode == "official": + return ( + os.getenv("MINERU_OFFICIAL_ENDPOINT", DEFAULT_MINERU_OFFICIAL_ENDPOINT) + .strip() + .rstrip("/") + ) + if mode == "local": + return os.getenv("MINERU_LOCAL_ENDPOINT", "").strip().rstrip("/") + return "" + + +def local_page_bounds(page_ranges: str) -> tuple[int, int]: + raw = page_ranges.strip() + if not raw: + return DEFAULT_MINERU_LOCAL_START_PAGE_ID, DEFAULT_MINERU_LOCAL_END_PAGE_ID + if "," in raw: + raise ValueError( + "MINERU_PAGE_RANGES with MINERU_API_MODE=local supports only a " + "single page or simple range such as '1-10'" + ) + if raw.isdigit(): + page = max(int(raw), 1) + return page - 1, page - 1 + if "-" in raw: + left, _, right = raw.partition("-") + if left.isdigit() and right.isdigit(): + start = max(int(left), 1) + end = max(int(right), start) + return start - 1, end - 1 + raise ValueError( + "MINERU_PAGE_RANGES with MINERU_API_MODE=local must be a single " + "positive page number or simple range such as '1-10'" + ) + + +@dataclass(frozen=True) +class MinerUParserOptions: + """Effective MinerU parser options used both for live requests and the + cache signature. + + Constructed once via :meth:`from_env` so the client and the cache + validator agree on every defaulting / normalization rule. + """ + + api_mode: str + model_version: str + language: str + enable_table: bool + enable_formula: bool + is_ocr: bool + page_ranges: str + local_backend: str + local_parse_method: str + local_image_analysis: bool + local_start_page_id: int + local_end_page_id: int + + @classmethod + def from_env(cls, *, api_mode: str | None = None) -> "MinerUParserOptions": + mode = ( + _normalize_api_mode(api_mode) + if api_mode is not None + else _current_api_mode() + ) + page_ranges = os.getenv("MINERU_PAGE_RANGES", "").strip() + local_start = _env_int( + "MINERU_LOCAL_START_PAGE_ID", DEFAULT_MINERU_LOCAL_START_PAGE_ID + ) + local_end = _env_int( + "MINERU_LOCAL_END_PAGE_ID", DEFAULT_MINERU_LOCAL_END_PAGE_ID + ) + if mode == "local" and page_ranges: + local_start, local_end = local_page_bounds(page_ranges) + return cls( + api_mode=mode, + model_version=( + os.getenv("MINERU_MODEL_VERSION", DEFAULT_MINERU_MODEL_VERSION).strip() + or DEFAULT_MINERU_MODEL_VERSION + ), + language=( + os.getenv("MINERU_LANGUAGE", DEFAULT_MINERU_LANGUAGE).strip() + or DEFAULT_MINERU_LANGUAGE + ), + enable_table=_env_bool("MINERU_ENABLE_TABLE", DEFAULT_MINERU_ENABLE_TABLE), + enable_formula=_env_bool( + "MINERU_ENABLE_FORMULA", DEFAULT_MINERU_ENABLE_FORMULA + ), + is_ocr=_env_bool("MINERU_IS_OCR", DEFAULT_MINERU_IS_OCR), + page_ranges=page_ranges, + local_backend=( + os.getenv("MINERU_LOCAL_BACKEND", DEFAULT_MINERU_LOCAL_BACKEND).strip() + or DEFAULT_MINERU_LOCAL_BACKEND + ), + local_parse_method=( + os.getenv( + "MINERU_LOCAL_PARSE_METHOD", DEFAULT_MINERU_LOCAL_PARSE_METHOD + ).strip() + or DEFAULT_MINERU_LOCAL_PARSE_METHOD + ), + local_image_analysis=_env_bool( + "MINERU_LOCAL_IMAGE_ANALYSIS", DEFAULT_MINERU_LOCAL_IMAGE_ANALYSIS + ), + local_start_page_id=local_start, + local_end_page_id=local_end, + ) + + def signature(self) -> str: + return mineru_options_signature(**asdict(self)) + + +def mineru_options_signature( + *, + api_mode: str, + model_version: str = DEFAULT_MINERU_MODEL_VERSION, + language: str = DEFAULT_MINERU_LANGUAGE, + enable_table: bool = DEFAULT_MINERU_ENABLE_TABLE, + enable_formula: bool = DEFAULT_MINERU_ENABLE_FORMULA, + is_ocr: bool = DEFAULT_MINERU_IS_OCR, + page_ranges: str = "", + local_backend: str = DEFAULT_MINERU_LOCAL_BACKEND, + local_parse_method: str = DEFAULT_MINERU_LOCAL_PARSE_METHOD, + local_image_analysis: bool = DEFAULT_MINERU_LOCAL_IMAGE_ANALYSIS, + local_start_page_id: int = DEFAULT_MINERU_LOCAL_START_PAGE_ID, + local_end_page_id: int = DEFAULT_MINERU_LOCAL_END_PAGE_ID, +) -> str: + mode = _normalize_api_mode(api_mode) + payload: dict[str, Any] = { + "signature_version": 1, + "api_mode": mode, + "language": str(language or "").strip() or DEFAULT_MINERU_LANGUAGE, + "enable_table": bool(enable_table), + "enable_formula": bool(enable_formula), + } + if mode == "official": + payload.update( + { + "model_version": str(model_version or "").strip() + or DEFAULT_MINERU_MODEL_VERSION, + "is_ocr": bool(is_ocr), + "page_ranges": str(page_ranges or "").strip(), + } + ) + else: + payload.update( + { + "local_backend": str(local_backend or "").strip() + or DEFAULT_MINERU_LOCAL_BACKEND, + "local_parse_method": str(local_parse_method or "").strip() + or DEFAULT_MINERU_LOCAL_PARSE_METHOD, + "local_image_analysis": bool(local_image_analysis), + "local_start_page_id": int(local_start_page_id), + "local_end_page_id": int(local_end_page_id), + } + ) + + raw = json.dumps(payload, sort_keys=True, separators=(",", ":")) + return "sha256:" + hashlib.sha256(raw.encode("utf-8")).hexdigest() + + +def current_mineru_options_signature() -> str: + return MinerUParserOptions.from_env().signature() + + +def is_bundle_valid(raw_dir: Path, source_file: Path) -> bool: + """Return True iff the bundle is intact and matches the current source. + + See module docstring for the full policy. Returns False on any of: + missing manifest, malformed manifest, schema version mismatch, source + size/hash mismatch, parser options mismatch, engine/endpoint env mismatch, + critical file missing or corrupted, or any non-critical file size mismatch. + """ + if not raw_dir.is_dir(): + return False + + manifest = load_manifest(raw_dir) + if manifest is None: + return False + + # 1. Source size fast-path + try: + cur_size = source_file.stat().st_size + except OSError: + return False + if cur_size != int(manifest.source_size_bytes): + return False + + # 2. Source content_hash + _, cur_hash = compute_size_and_hash(source_file) + if cur_hash != manifest.source_content_hash: + return False + + # 3. API mode (only when manifest had one; old manifests remain compatible) + cur_api_mode = _current_api_mode() + if manifest.api_mode and cur_api_mode != manifest.api_mode: + return False + + # 4. Parser options. Old manifests did not record this and must miss so + # changes such as MINERU_LOCAL_BACKEND cannot silently reuse stale output. + if not manifest.options_signature: + return False + if current_mineru_options_signature() != manifest.options_signature: + return False + + # 5. Engine version (only when current env exposes one AND manifest had one) + cur_engine_version = os.getenv("MINERU_ENGINE_VERSION", "").strip() + if ( + cur_engine_version + and manifest.engine_version + and cur_engine_version != manifest.engine_version + ): + return False + + # 6. Endpoint signature + cur_endpoint = _current_endpoint_signature() + if ( + cur_endpoint + and manifest.endpoint_signature + and cur_endpoint != manifest.endpoint_signature + ): + return False + + # 7. Critical file: size + sha256 + crit = manifest.critical_file + crit_path = raw_dir / crit.path + try: + if crit_path.stat().st_size != int(crit.size): + return False + except OSError: + return False + if crit.sha256: + _, crit_actual = compute_size_and_hash(crit_path) + if crit_actual != crit.sha256: + return False + + # 8. Other files: size only + for entry in manifest.files: + ep = raw_dir / entry.path + try: + if ep.stat().st_size != int(entry.size): + return False + except OSError: + return False + + return True + + +__all__ = [ + "MINERU_RAW_DIR_SUFFIX", + "MinerUParserOptions", + "clear_dir_contents", + "compute_size_and_hash", + "current_mineru_options_signature", + "is_bundle_valid", + "local_page_bounds", + "mineru_options_signature", + "raw_dir_for_parsed_dir", +] diff --git a/lightrag/external_parser/mineru/client.py b/lightrag/external_parser/mineru/client.py new file mode 100644 index 0000000000..51d6880318 --- /dev/null +++ b/lightrag/external_parser/mineru/client.py @@ -0,0 +1,677 @@ +"""MinerU raw bundle downloader. + +Supports MinerU's official cloud and self-hosted API protocols and lands the +final parser bundle on disk under ``raw_dir/``: + +- ``official`` — MinerU precision API v4: apply for signed upload URL, PUT the + local file, poll batch results, download ``full_zip_url``. +- ``local`` — self-hosted ``mineru-api`` / ``mineru-router``: submit + ``POST /tasks``, poll ``GET /tasks/{task_id}``, download + ``GET /tasks/{task_id}/result``. + +Both protocols request a zip result bundle. Archives are extracted under +``raw_dir/`` and normalized so the adapter can read a root-level +``content_list.json``. +""" + +from __future__ import annotations + +import asyncio +import io +import json +import os +import shutil +import zipfile +from collections.abc import AsyncIterator +from datetime import datetime, timezone +from pathlib import Path +from typing import TYPE_CHECKING, Any +from urllib.parse import urlparse + +from lightrag.external_parser._common import raise_for_status_with_detail +from lightrag.external_parser.mineru.cache import ( + MinerUParserOptions, + compute_size_and_hash, +) +from lightrag.external_parser.mineru.manifest import ( + Manifest, + ManifestFile, + write_manifest, +) +from lightrag.utils import logger + +if TYPE_CHECKING: + import httpx +else: + try: + import httpx + except ImportError: # pragma: no cover + httpx = None + +CONTENT_LIST_FILENAME = "content_list.json" +DEFAULT_MINERU_API_MODE = "local" +DEFAULT_MINERU_OFFICIAL_ENDPOINT = "https://mineru.net" +VALID_MINERU_API_MODES = {"official", "local"} +OFFICIAL_DONE_STATES = {"done"} +OFFICIAL_FAILED_STATES = {"failed"} +LOCAL_DONE_STATES = {"completed"} +LOCAL_FAILED_STATES = {"failed"} +UPLOAD_CHUNK_SIZE = 1024 * 1024 + + +def _get_by_path(payload: Any, path: str) -> Any: + """Walk a dotted path through a nested dict; returns None if any segment + is missing or non-dict.""" + if not path: + return None + cur = payload + for part in path.split("."): + if isinstance(cur, dict) and part in cur: + cur = cur[part] + else: + return None + return cur + + +def _strip_trailing_slash(url: str) -> str: + return url.rstrip("/") + + +def _resolve_upload_name(upload_name: str | None, source_file_path: Path) -> str: + candidate = Path(str(upload_name or "")).name + return candidate or source_file_path.name + + +async def _iter_file_bytes(path: Path) -> AsyncIterator[bytes]: + with path.open("rb") as fh: + while True: + chunk = await asyncio.to_thread(fh.read, UPLOAD_CHUNK_SIZE) + if not chunk: + break + yield chunk + + +def _validate_base_url( + name: str, endpoint: str, forbidden_segments: tuple[str, ...] +) -> None: + parsed = urlparse(endpoint) + path = (parsed.path or "").rstrip("/") + for segment in forbidden_segments: + if path.endswith(segment) or f"{segment}/" in path: + raise ValueError( + f"{name} must be a base URL, not an API path: {endpoint!r}" + ) + + +class MinerURawClient: + """Downloads MinerU bundles into ``raw_dir``. + + Construct once per call (cheap). Reads ``MINERU_*`` env vars at + construction time. Methods are async and use a single shared httpx + client across all calls in :meth:`download_into`. + + Implements the MinerU-specific upload + poll + zip download flow + inline; bundle handling needs the ``result_url`` *and* the + ``Content-Type`` of the response, which a generic protocol helper + cannot expose without leaking abstractions. + """ + + def __init__(self) -> None: + self.api_mode = ( + os.getenv("MINERU_API_MODE", DEFAULT_MINERU_API_MODE).strip().lower() + ) + if self.api_mode not in VALID_MINERU_API_MODES: + allowed = ", ".join(sorted(VALID_MINERU_API_MODES)) + raise ValueError( + f"MINERU_API_MODE must be one of {allowed}, got {self.api_mode!r}" + ) + + self.official_endpoint = _strip_trailing_slash( + os.getenv( + "MINERU_OFFICIAL_ENDPOINT", DEFAULT_MINERU_OFFICIAL_ENDPOINT + ).strip() + or DEFAULT_MINERU_OFFICIAL_ENDPOINT + ) + self.local_endpoint = _strip_trailing_slash( + os.getenv("MINERU_LOCAL_ENDPOINT", "").strip() + ) + self.api_token = os.getenv("MINERU_API_TOKEN", "").strip() + if self.api_mode == "official": + if not self.api_token: + raise ValueError( + "MINERU_API_TOKEN is required when MINERU_API_MODE=official" + ) + _validate_base_url( + "MINERU_OFFICIAL_ENDPOINT", + self.official_endpoint, + ("/api/v4", "/api/v4/file-urls/batch", "/api/v4/extract/task"), + ) + self.endpoint = self.official_endpoint + elif self.api_mode == "local": + if not self.local_endpoint: + raise ValueError( + "MINERU_LOCAL_ENDPOINT is required when MINERU_API_MODE=local" + ) + _validate_base_url( + "MINERU_LOCAL_ENDPOINT", + self.local_endpoint, + ("/tasks", "/file_parse", "/health"), + ) + self.endpoint = self.local_endpoint + self.poll_interval = float(os.getenv("MINERU_POLL_INTERVAL_SECONDS", "2")) + self.max_polls = int(os.getenv("MINERU_MAX_POLLS", "180")) + self.engine_version = os.getenv("MINERU_ENGINE_VERSION", "").strip() + + options = MinerUParserOptions.from_env(api_mode=self.api_mode) + self._parser_options = options + self.model_version = options.model_version + self.language = options.language + self.enable_table = options.enable_table + self.enable_formula = options.enable_formula + self.is_ocr = options.is_ocr + self.page_ranges = options.page_ranges + self.local_backend = options.local_backend + self.local_parse_method = options.local_parse_method + self.local_image_analysis = options.local_image_analysis + self.local_start_page_id = options.local_start_page_id + self.local_end_page_id = options.local_end_page_id + + # ------------------------------------------------------------------ + # Public API + # ------------------------------------------------------------------ + + async def download_into( + self, + raw_dir: Path, + source_file_path: Path, + *, + upload_name: str | None = None, + ) -> Manifest: + """Download a fresh bundle and write the manifest. + + Pre-condition: caller cleared ``raw_dir`` contents (recommended via + :func:`clear_dir_contents`). This method does NOT clean the + directory itself — leaving that to the caller keeps cache miss + semantics explicit at the parse_mineru entry point. + + Returns the :class:`Manifest` describing the bundle. + """ + if httpx is None: + raise RuntimeError("httpx is required for MinerU parsing but not installed") + raw_dir.mkdir(parents=True, exist_ok=True) + resolved_upload_name = _resolve_upload_name(upload_name, source_file_path) + + timeout = httpx.Timeout(120.0, connect=30.0) + async with httpx.AsyncClient(timeout=timeout) as client: + if self.api_mode == "official": + task_id = await self._download_official( + client, source_file_path, raw_dir, resolved_upload_name + ) + else: + task_id = await self._download_local( + client, source_file_path, raw_dir, resolved_upload_name + ) + + self._normalize_raw_bundle(raw_dir, source_file_path, resolved_upload_name) + return self._build_and_write_manifest( + raw_dir, source_file_path, task_id, resolved_upload_name + ) + + # ------------------------------------------------------------------ + # Upload + poll + # ------------------------------------------------------------------ + + def _official_headers(self) -> dict[str, str]: + return { + "Content-Type": "application/json", + "Authorization": f"Bearer {self.api_token}", + } + + def _official_payload(self, upload_name: str) -> dict[str, Any]: + file_entry: dict[str, Any] = {"name": upload_name} + if self.is_ocr: + file_entry["is_ocr"] = True + if self.page_ranges: + file_entry["page_ranges"] = self.page_ranges + return { + "files": [file_entry], + "model_version": self.model_version, + "language": self.language, + "enable_table": self.enable_table, + "enable_formula": self.enable_formula, + } + + async def _download_official( + self, + client: "httpx.AsyncClient", + source_file_path: Path, + raw_dir: Path, + upload_name: str, + ) -> str: + apply_url = f"{self.official_endpoint}/api/v4/file-urls/batch" + resp = await client.post( + apply_url, + headers=self._official_headers(), + json=self._official_payload(upload_name), + ) + raise_for_status_with_detail(resp, "MinerU official upload URL request") + payload = resp.json() if resp.text else {} + self._raise_if_official_error(payload, "MinerU official upload URL request") + data = payload.get("data") if isinstance(payload, dict) else {} + batch_id = str((data or {}).get("batch_id") or "") + file_urls = (data or {}).get("file_urls") or [] + if not batch_id or not isinstance(file_urls, list) or not file_urls: + raise RuntimeError( + f"MinerU official upload URL response missing batch_id/file_urls: " + f"{payload}" + ) + + first_file_url = file_urls[0] + if isinstance(first_file_url, dict): + upload_url = str( + first_file_url.get("url") or first_file_url.get("file_url") or "" + ) + else: + upload_url = str(first_file_url) + if not upload_url: + raise RuntimeError( + f"MinerU official upload URL response had an empty upload URL: " + f"{payload}" + ) + upload_resp = await client.put( + upload_url, + content=_iter_file_bytes(source_file_path), + headers={"Content-Length": str(source_file_path.stat().st_size)}, + ) + raise_for_status_with_detail(upload_resp, "MinerU official file upload") + + result_url = await self._poll_official_batch(client, batch_id, upload_name) + await self._download_zip(client, result_url, raw_dir) + return batch_id + + async def _poll_official_batch( + self, + client: "httpx.AsyncClient", + batch_id: str, + upload_name: str, + ) -> str: + poll_url = f"{self.official_endpoint}/api/v4/extract-results/batch/{batch_id}" + for _ in range(self.max_polls): + await asyncio.sleep(self.poll_interval) + resp = await client.get(poll_url, headers=self._official_headers()) + raise_for_status_with_detail(resp, "MinerU official batch poll") + payload = resp.json() if resp.text else {} + self._raise_if_official_error(payload, "MinerU official batch poll") + results = _get_by_path(payload, "data.extract_result") + if isinstance(results, dict): + results = [results] + if not isinstance(results, list): + continue + + selected = _select_official_extract_result(results, upload_name) + if selected is None: + continue + state = str(selected.get("state") or "").lower() + if state in OFFICIAL_DONE_STATES: + full_zip_url = str(selected.get("full_zip_url") or "") + if not full_zip_url: + raise RuntimeError( + f"MinerU official batch {batch_id} is done but has no " + f"full_zip_url: {selected}" + ) + return full_zip_url + if state in OFFICIAL_FAILED_STATES: + err = selected.get("err_msg") or selected.get("error") or selected + raise RuntimeError( + f"MinerU official parse failed for batch {batch_id}: {err}" + ) + + raise TimeoutError(f"MinerU official batch polling timeout: {batch_id}") + + def _raise_if_official_error(self, payload: Any, operation: str) -> None: + if not isinstance(payload, dict): + raise RuntimeError(f"{operation} returned non-object payload: {payload!r}") + code = payload.get("code", 0) + if code not in (0, "0", None): + raise RuntimeError( + f"{operation} failed: code={code} msg={payload.get('msg')!r}" + ) + + def _local_form_data(self) -> dict[str, str]: + return { + "lang_list": self.language, + "backend": self.local_backend, + "parse_method": self.local_parse_method, + "formula_enable": _bool_form(self.enable_formula), + "table_enable": _bool_form(self.enable_table), + "image_analysis": _bool_form(self.local_image_analysis), + "return_md": "true", + "return_middle_json": "true", + "return_model_output": "true", + "return_content_list": "true", + "return_images": "true", + "response_format_zip": "true", + "return_original_file": "true", + "start_page_id": str(self.local_start_page_id), + "end_page_id": str(self.local_end_page_id), + } + + async def _download_local( + self, + client: "httpx.AsyncClient", + source_file_path: Path, + raw_dir: Path, + upload_name: str, + ) -> str: + submit_url = f"{self.local_endpoint}/tasks" + # Keep data as a Mapping so httpx 0.28 builds an async MultipartStream + # and reads the file handle in chunks instead of buffering the payload. + with source_file_path.open("rb") as fh: + files = {"files": (upload_name, fh, "application/octet-stream")} + resp = await client.post( + submit_url, + data=self._local_form_data(), + files=files, + ) + raise_for_status_with_detail( + resp, + f"MinerU local task submission for {upload_name!r}", + ) + payload = resp.json() if resp.text else {} + task_id = str(payload.get("task_id") or "") + if not task_id: + raise RuntimeError( + f"MinerU local /tasks response missing task_id: {payload}" + ) + + await self._poll_local_task(client, task_id) + await self._download_zip( + client, + f"{self.local_endpoint}/tasks/{task_id}/result", + raw_dir, + ) + return task_id + + async def _poll_local_task( + self, + client: "httpx.AsyncClient", + task_id: str, + ) -> None: + poll_url = f"{self.local_endpoint}/tasks/{task_id}" + for _ in range(self.max_polls): + await asyncio.sleep(self.poll_interval) + resp = await client.get(poll_url) + raise_for_status_with_detail(resp, "MinerU local task poll") + payload = resp.json() if resp.text else {} + status = str(payload.get("status") or "").lower() + if status in LOCAL_DONE_STATES: + return + if status in LOCAL_FAILED_STATES: + err = payload.get("error") or payload.get("message") or payload + raise RuntimeError( + f"MinerU local parse failed for task {task_id}: {err}" + ) + + raise TimeoutError(f"MinerU local task polling timeout: {task_id}") + + async def _download_zip( + self, + client: "httpx.AsyncClient", + result_url: str, + raw_dir: Path, + resp: Any = None, + ) -> None: + """Download (or re-use already-fetched response) and extract.""" + if resp is None or not hasattr(resp, "content"): + resp = await client.get(result_url) + raise_for_status_with_detail(resp, "MinerU result bundle download") + buf = io.BytesIO(resp.content) + with zipfile.ZipFile(buf) as zf: + # Safe-extract: refuse absolute paths and ``..`` traversal. + for name in zf.namelist(): + norm = os.path.normpath(name) + if norm.startswith("..") or os.path.isabs(norm): + raise RuntimeError(f"Refusing zip entry with unsafe path: {name!r}") + zf.extractall(raw_dir) + + # Normalize: if the zip nested everything under a single top-level + # dir, hoist its contents up so content_list.json sits at raw_dir + # root. This matches the common MinerU bundle layout. + self._maybe_hoist_single_subdir(raw_dir) + + def _maybe_hoist_single_subdir(self, raw_dir: Path) -> None: + entries = [p for p in raw_dir.iterdir() if p.name != "_manifest.json"] + if len(entries) != 1 or not entries[0].is_dir(): + return + sub = entries[0] + for child in list(sub.iterdir()): + child.rename(raw_dir / child.name) + try: + sub.rmdir() + except OSError: + pass + + def _normalize_raw_bundle( + self, + raw_dir: Path, + source_file_path: Path, + upload_name: str | None = None, + ) -> None: + """Ensure a downloaded bundle has root-level ``content_list.json``. + + Official and local MinerU zip archives commonly place parser outputs at + ``//_content_list.json``. The adapter consumes a + canonical root ``content_list.json`` plus optional root ``images/``. + + After hoisting we delete the nested originals so the manifest does not + bookkeep two copies (and disk usage doesn't double for big bundles). + Sibling artifacts of the parse subdir (``*.md``, ``middle.json`` etc.) + are also hoisted to ``raw_dir`` root for easier diagnostics. + """ + if (raw_dir / CONTENT_LIST_FILENAME).is_file(): + return + + candidate = _select_content_list_candidate( + raw_dir, source_file_path, upload_name + ) + if candidate is None: + return + + source_dir = candidate.parent + target_root = raw_dir.resolve() + # Guard: never hoist from above raw_dir (defensive — candidate already + # comes from rglob inside raw_dir, but cheap to verify). + try: + source_dir.resolve().relative_to(target_root) + except ValueError: + shutil.copy2(candidate, raw_dir / CONTENT_LIST_FILENAME) + return + + # Move the critical file first; then hoist sibling files/dirs that + # don't already exist at raw_dir root. + shutil.move(str(candidate), str(raw_dir / CONTENT_LIST_FILENAME)) + for entry in list(source_dir.iterdir()): + target = raw_dir / entry.name + if target.exists(): + continue + shutil.move(str(entry), str(target)) + + # Best-effort cleanup of the now-empty parse subtree. + cursor = source_dir + while cursor != raw_dir and cursor.is_dir(): + try: + cursor.rmdir() + except OSError: + break + cursor = cursor.parent + + # ------------------------------------------------------------------ + # Manifest construction + # ------------------------------------------------------------------ + + def _build_and_write_manifest( + self, + raw_dir: Path, + source_file_path: Path, + task_id: str, + upload_name: str, + ) -> Manifest: + source_size, source_hash = compute_size_and_hash(source_file_path) + + # Critical file — required. + crit_path = raw_dir / CONTENT_LIST_FILENAME + if not crit_path.is_file(): + raise RuntimeError( + f"MinerU bundle missing required {CONTENT_LIST_FILENAME} " + f"after download (raw_dir={raw_dir})" + ) + crit_size, crit_hash = compute_size_and_hash(crit_path) + + # Other files. + others: list[ManifestFile] = [] + total = crit_size + for p in sorted(raw_dir.rglob("*")): + if not p.is_file(): + continue + if p.name == "_manifest.json": + continue + rel = p.relative_to(raw_dir).as_posix() + if rel == CONTENT_LIST_FILENAME: + continue + size = p.stat().st_size + others.append(ManifestFile(path=rel, size=size)) + total += size + + manifest = Manifest( + source_content_hash=source_hash, + source_size_bytes=source_size, + source_filename_at_parse=upload_name, + critical_file=ManifestFile( + path=CONTENT_LIST_FILENAME, + size=crit_size, + sha256=crit_hash, + ), + files=others, + total_size_bytes=total, + task_id=task_id, + api_mode=self.api_mode, + engine_version=self.engine_version, + endpoint_signature=self.endpoint, + options_signature=self._options_signature(), + downloaded_at=datetime.now(timezone.utc).isoformat(), + ) + write_manifest(raw_dir, manifest) + return manifest + + def _options_signature(self) -> str: + return self._parser_options.signature() + + +def _find_content_list(payload: Any, content_field: str) -> list[dict] | None: + """Heuristic content_list extractor. + + Tries (in order): + + 1. The provided dotted path if it lands on a list of dicts. + 2. Direct ``content_list`` / ``content`` / ``items`` / ``result`` keys. + 3. Recursive descent. + """ + if isinstance(payload, list): + if payload and all(isinstance(x, dict) for x in payload): + return payload + return None + if not isinstance(payload, dict): + return None + + via_field = _get_by_path(payload, content_field) + candidate = _find_content_list(via_field, content_field) + if candidate is not None: + return candidate + + for key in ("content_list", "content", "items", "result"): + value = payload.get(key) + candidate = _find_content_list(value, content_field) + if candidate is not None: + return candidate + + for value in payload.values(): + candidate = _find_content_list(value, content_field) + if candidate is not None: + return candidate + return None + + +def _bool_form(value: bool) -> str: + return "true" if value else "false" + + +def _select_official_extract_result( + results: list[Any], + source_filename: str, +) -> dict[str, Any] | None: + """Pick the extract_result entry that matches the file we uploaded. + + Invariant: :meth:`MinerURawClient._download_official` always submits a + single-file batch, so a non-matching ``file_name`` from the API would + indicate either a server response we don't understand or a future + multi-file extension. We fall back to ``dict_results[0]`` to remain + forward-compatible but log a warning so the mismatch is visible. + """ + dict_results = [item for item in results if isinstance(item, dict)] + if not dict_results: + return None + source_name = Path(source_filename).name + source_stem = Path(source_filename).stem + for item in dict_results: + file_name = str(item.get("file_name") or item.get("name") or "") + if Path(file_name).name == source_name or Path(file_name).stem == source_stem: + return item + logger.warning( + "[mineru_raw] official extract_result did not contain a match for " + "%r; falling back to the first entry (%r). This is unexpected for " + "a single-file batch.", + source_name, + str(dict_results[0].get("file_name") or dict_results[0].get("name") or ""), + ) + return dict_results[0] + + +def _select_content_list_candidate( + raw_dir: Path, + source_file_path: Path, + upload_name: str | None = None, +) -> Path | None: + source_stem = Path(upload_name or source_file_path.name).stem + candidates: list[tuple[int, int, str, Path]] = [] + for path in raw_dir.rglob("*.json"): + if not path.is_file(): + continue + if path.name != CONTENT_LIST_FILENAME and not path.name.endswith( + "_content_list.json" + ): + continue + try: + payload = json.loads(path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): + continue + content_list = _find_content_list(payload, "content") + if content_list is None: + continue + + score = 10 + if path.name == CONTENT_LIST_FILENAME: + score = 0 + elif path.name == f"{source_stem}_content_list.json": + score = 1 + elif path.stem.endswith("_content_list"): + score = 2 + depth = len(path.relative_to(raw_dir).parts) + candidates.append((score, depth, path.as_posix(), path)) + + if not candidates: + return None + candidates.sort() + return candidates[0][3] + + +__all__ = ["MinerURawClient", "CONTENT_LIST_FILENAME"] diff --git a/lightrag/external_parser/mineru/ir_builder.py b/lightrag/external_parser/mineru/ir_builder.py new file mode 100644 index 0000000000..e9ecbfbd44 --- /dev/null +++ b/lightrag/external_parser/mineru/ir_builder.py @@ -0,0 +1,749 @@ +"""MinerU IR builder: ``content_list.json`` (+ images/) → :class:`IRDoc`. + +Input contract: a ``*.mineru_raw/`` directory containing at least +``content_list.json``. Optional sibling resources (``images/``, +``middle.json``, ``full.md``, ``layout.pdf``) are kept as-is; this builder +only reads the content list and image asset bytes. + +Conversion rules (informed by spec §3-§六): + +- ``text`` items with ``text_level>0`` and ``title`` / ``section_header`` + start a NEW block. The heading text is rendered with a markdown ``#`` + prefix matching the level (``# foo``, ``## bar`` …) as the first line of + the new block's content. +- All other items (``text``, ``list``, ``code``, ``table``, ``image``, + ``equation``) are MERGED into the current block — their text / placeholder + is appended (newline-separated) to the heading's block. This mirrors the + native docx parser's "split-by-heading, merge-everything-under-heading" + behavior (see ``native_parser/docx/parse_document.py``). +- Content emitted before the first heading lands in a synthetic + ``Preface/Uncategorized`` block at level 0. +- ``list`` items joined with ``\n``; ``code`` body taken from ``code_body`` + if present. +- ``table`` → IRTable + ``{{TBL:k}}`` placeholder. ``table_body`` (HTML) or + the ``rows`` field (2D array) become ``html`` / ``rows`` on IRTable. + ``num_rows`` / ``num_cols`` are taken from MinerU if present, otherwise + inferred. ``header`` populates ``table_header`` (per spec §5). +- ``image`` / ``picture`` / ``drawing`` → IRDrawing + ``{{IMG:k}}`` placeholder. + Asset bytes are referenced via ``img_path`` relative to the raw dir. +- ``equation`` → IREquation. ``is_block`` is decided by whether + ``text_format=="block"`` (MinerU explicit flag) OR ``text_level==0`` with + no inline neighbours; otherwise inline. The latex string is preserved + verbatim (including any ``$$``/``$`` wrappers) so ``blocks.jsonl``'s + ```` body matches MinerU's raw output; the writer strips the + wrappers when persisting ``equations.json`` content. +- ``page_idx`` + ``bbox`` → ``IRPosition(type="bbox", anchor=page, range=[x0,y0,x1,y1])``. + Empty/missing bbox is acceptable; positions accumulate on the merged block. +- ``IRDoc.split_option`` records the MinerU engine version when available. +- ``IRDoc.bbox_attributes`` defaults to ``{"origin":"LEFTTOP","max":1000}`` + reflecting MinerU's PDF coordinate convention. Operators may override + via ``MINERU_BBOX_ATTRIBUTES`` (JSON string). +""" + +from __future__ import annotations + +import json +import os +from pathlib import Path +from typing import Any +from urllib.parse import urlparse + +from lightrag.sidecar.ir import ( + AssetSpec, + IRBlock, + IRDoc, + IRDrawing, + IREquation, + IRPosition, + IRTable, +) +from lightrag.utils import logger + + +PREFACE_HEADING = "Preface/Uncategorized" +CONTENT_LIST_FILENAME = "content_list.json" + + +class MinerUIRBuilder: + """Stateless except for env-driven config. Reusable across calls.""" + + def __init__(self) -> None: + self.engine_version = os.getenv("MINERU_ENGINE_VERSION", "").strip() + # Mirror MinerURawClient.__init__: when this is set, the downloader + # stores ALL referenced images (including relative ones) under + # ``images/``. The builder has to look in the same place. + self.image_url_template = os.getenv("MINERU_IMAGE_URL_TEMPLATE", "").strip() + self.bbox_attributes = self._load_bbox_attributes_env() + + def _load_bbox_attributes_env(self) -> dict[str, Any]: + default = {"origin": "LEFTTOP", "max": 1000} + raw = os.getenv("MINERU_BBOX_ATTRIBUTES", "").strip() + if not raw: + return default + try: + parsed = json.loads(raw) + except json.JSONDecodeError as exc: + logger.warning( + "[mineru_ir_builder] MINERU_BBOX_ATTRIBUTES is not valid JSON " + "(%s); falling back to default %s", + exc, + default, + ) + return default + if not isinstance(parsed, dict): + logger.warning( + "[mineru_ir_builder] MINERU_BBOX_ATTRIBUTES must decode to a JSON " + "object, got %s; falling back to default %s", + type(parsed).__name__, + default, + ) + return default + return parsed + + # ------------------------------------------------------------------ + # Entry point + # ------------------------------------------------------------------ + + def normalize_from_workdir( + self, + raw_dir: Path, + *, + document_name: str, + ) -> IRDoc: + """Read ``raw_dir/content_list.json`` and emit an IRDoc. + + ``document_name`` is the canonical filename (e.g. ``foo.pdf``) used + for ``meta.document_name``; resolved by the caller from the parser + hint chain. + """ + content_list_path = raw_dir / "content_list.json" + if not content_list_path.is_file(): + raise FileNotFoundError( + f"MinerU raw bundle missing content_list.json at {raw_dir}" + ) + content_list = json.loads(content_list_path.read_text(encoding="utf-8")) + if not isinstance(content_list, list): + raise ValueError( + f"MinerU content_list.json malformed (not a JSON array) at {raw_dir}" + ) + return self._normalize_content_list( + content_list, raw_dir, document_name=document_name + ) + + # ------------------------------------------------------------------ + # Core + # ------------------------------------------------------------------ + + def _normalize_content_list( + self, + content_list: list[Any], + raw_dir: Path, + *, + document_name: str, + ) -> IRDoc: + document_format = Path(document_name).suffix.lower().lstrip(".") + + blocks: list[IRBlock] = [] + assets: list[AssetSpec] = [] + seen_assets: dict[str, str] = {} # ref → suggested_name + doc_title = "" + placeholder_counter = 0 + + def _next_key(prefix: str) -> str: + nonlocal placeholder_counter + placeholder_counter += 1 + return f"{prefix}{placeholder_counter}" + + # Heading hierarchy stack — index = level-1 (level 1 lives at [0]). + heading_stack: list[str] = [] + + # Current-block accumulator. The block is materialized when the next + # heading arrives (or at end-of-document). The initial block is the + # synthetic "Preface/Uncategorized" container at level 0. + cb_lines: list[str] = [] + cb_tables: list[IRTable] = [] + cb_drawings: list[IRDrawing] = [] + cb_equations: list[IREquation] = [] + # Positions are split into two channels: + # - ``cb_page_set`` collects ``page_idx`` of bbox-less items; at flush + # each unique page becomes one anchor-only summary ``IRPosition``. + # - ``cb_bbox_positions`` keeps one fine-grained position per item that + # carried a parseable bbox (anchor + range), in source order, with + # no deduplication. + cb_page_set: set[str] = set() + cb_bbox_positions: list[IRPosition] = [] + cb_heading = PREFACE_HEADING + cb_level = 0 + cb_parents: list[str] = [] + # ``cb_has_body`` flips True the moment we accumulate any non-heading + # payload into the current block. While it stays False, an adjacent + # deeper heading is folded into this block as a body line (aligning + # with the native docx parser's behaviour for back-to-back headings). + cb_has_body = False + + def _record_position(item: dict) -> None: + """Route an item's positional info into the right channel. + + Items with a parseable ``bbox`` produce one fine-grained + IRPosition appended to ``cb_bbox_positions`` (no dedupe). + Otherwise, ``page_idx`` (if any) is added to ``cb_page_set`` + and emitted as a single anchor-only summary entry at flush. + """ + bbox_pos = _extract_bbox_position(item) + if bbox_pos is not None: + cb_bbox_positions.append(bbox_pos) + return + page = _extract_page_anchor(item) + if page is not None: + cb_page_set.add(page) + + def _flush_block() -> None: + """Emit the in-flight block if it carries any content.""" + nonlocal cb_lines, cb_tables, cb_drawings, cb_equations + nonlocal cb_page_set, cb_bbox_positions, cb_has_body + has_payload = bool(cb_lines or cb_tables or cb_drawings or cb_equations) + if not has_payload: + return + content = "\n".join(line for line in cb_lines if line) + if not content.strip() and not (cb_tables or cb_drawings or cb_equations): + # Reset and skip — nothing meaningful to emit. + cb_lines = [] + cb_page_set = set() + cb_bbox_positions = [] + cb_has_body = False + return + positions = [ + IRPosition(type="bbox", anchor=p) + for p in _sort_page_anchors(cb_page_set) + ] + list(cb_bbox_positions) + blocks.append( + IRBlock( + content_template=content, + heading=cb_heading, + level=cb_level, + parent_headings=list(cb_parents), + positions=positions, + tables=list(cb_tables), + drawings=list(cb_drawings), + equations=list(cb_equations), + ) + ) + cb_lines = [] + cb_tables = [] + cb_drawings = [] + cb_equations = [] + cb_page_set = set() + cb_bbox_positions = [] + cb_has_body = False + + def _open_block(heading: str, level: int, parents: list[str]) -> None: + nonlocal cb_heading, cb_level, cb_parents + cb_heading = heading + cb_level = level + cb_parents = parents + # Render the heading line into the block body so the merged + # text reads like markdown (``# Foo`` / ``## Bar`` / …). + md_prefix = "#" * max(level, 1) + cb_lines.append(f"{md_prefix} {heading}") + + def _append_text(text: str) -> bool: + """Append ``text`` to the current block body and return whether + anything was actually written. Callers use the return value to + decide whether to also record the item's source position — an + empty text item must NOT leak its ``page_idx`` to the block. + """ + nonlocal cb_has_body + if not text: + return False + cb_lines.append(text) + cb_has_body = True + return True + + def _merge_heading_as_body(heading: str, level: int) -> None: + """Fold an adjacent deeper heading into the current block. + + The line keeps its markdown ``#`` prefix so the rendered block + still reads as ``# Section\n## Subsection``. Does NOT flip + ``cb_has_body`` — successive headings can keep folding until a + real body item lands. + """ + md_prefix = "#" * max(level, 1) + cb_lines.append(f"{md_prefix} {heading}") + + for item_index, item in enumerate(content_list): + if not isinstance(item, dict): + continue + item_type = str(item.get("type") or item.get("label") or "").lower() + + heading_text, heading_level = _detect_heading(item, item_type) + if heading_text: + # Heading hierarchy is updated unconditionally so deeper + # parents resolve correctly once the next real body item + # opens a fresh block. + heading_stack = heading_stack[: max(heading_level - 1, 0)] + parents = [h for h in heading_stack if h] + heading_stack.append(heading_text) + + # Adjacency merge: previous block is a real heading with no + # body yet AND the new heading is strictly deeper — append + # this heading as body to the existing block instead of + # flushing. (Preface, level=0, is never merged into.) + if cb_level > 0 and not cb_has_body and heading_level > cb_level: + _merge_heading_as_body(heading_text, heading_level) + _record_position(item) + if not doc_title and heading_level == 1: + doc_title = heading_text + continue + + _flush_block() + _open_block(heading_text, heading_level, parents) + _record_position(item) + + if not doc_title and heading_level == 1: + doc_title = heading_text + continue + + if item_type == "text": + if _append_text(_coerce_text(item)): + _record_position(item) + continue + + if item_type == "list": + items = item.get("list_items") + if isinstance(items, list): + text = "\n".join(str(x) for x in items if str(x).strip()) + else: + text = _coerce_text(item) + if _append_text(text): + _record_position(item) + continue + + if item_type == "code": + if _append_text(item.get("code_body") or _coerce_text(item)): + _record_position(item) + continue + + if item_type == "equation": + latex_raw = _coerce_text(item) + if not latex_raw: + # Spec compliance fix: empty equation must not enter sidecar. + continue + # Preserve MinerU's raw latex (including any ``$$``/``$`` + # wrappers); the writer strips them when emitting + # equations.json so blocks.jsonl shows the raw form while + # the per-equation sidecar holds clean latex. + latex = latex_raw.strip() + is_block = _is_block_equation(item) + caption = str(item.get("caption") or "") + placeholder = _next_key("eq") + token = "EQ" if is_block else "EQI" + cb_equations.append( + IREquation( + placeholder_key=placeholder, + latex=latex, + is_block=is_block, + caption=caption, + footnotes=_as_str_list(item.get("footnotes")), + self_ref=_content_list_self_ref(item_index) if is_block else "", + ) + ) + cb_lines.append(f"{{{{{token}:{placeholder}}}}}") + cb_has_body = True + _record_position(item) + continue + + if item_type == "table": + table = self._build_ir_table(item) + if table is None: + # Empty body — _build_ir_table already logged the drop. + # Skip placeholder allocation and position recording so + # the misidentified item leaves no trace in the IR. + continue + placeholder = _next_key("tb") + table.placeholder_key = placeholder + table.self_ref = _content_list_self_ref(item_index) + cb_tables.append(table) + cb_lines.append(f"{{{{TBL:{placeholder}}}}}") + cb_has_body = True + _record_position(item) + continue + + if item_type in {"image", "picture", "drawing"}: + drawing, asset = self._build_ir_drawing(item, raw_dir, seen_assets) + placeholder = _next_key("im") + drawing.placeholder_key = placeholder + drawing.self_ref = _content_list_self_ref(item_index) + if asset is not None and asset.ref not in {a.ref for a in assets}: + assets.append(asset) + cb_drawings.append(drawing) + cb_lines.append(f"{{{{IMG:{placeholder}}}}}") + cb_has_body = True + _record_position(item) + continue + + # Fallback: serialize unknown items as plain text so we don't + # silently drop information. Position only recorded when the + # fallback actually contributed text — empty unknown items must + # not leak their page_idx into the current block. + if _append_text(_coerce_text(item)): + _record_position(item) + + _flush_block() + + if not doc_title: + doc_title = Path(document_name).stem or document_name + + split_option: dict[str, Any] = {} + if self.engine_version: + split_option["engine_version"] = self.engine_version + # Reserved hook for later: detect OCR flag from middle.json / config. + + return IRDoc( + document_name=document_name, + document_format=document_format, + doc_title=doc_title, + split_option=split_option, + blocks=blocks, + assets=assets, + bbox_attributes=dict(self.bbox_attributes), + ) + + # ------------------------------------------------------------------ + # Tables / drawings + # ------------------------------------------------------------------ + + def _build_ir_table(self, item: dict) -> IRTable | None: + rows: list[list[str]] | None = None + html: str | None = None + body_field = item.get("rows") + body = body_field if body_field is not None else item.get("table_body") + + if isinstance(body, list): + rows = _normalize_grid(body) + elif isinstance(body, str): + stripped = body.strip() + if stripped.startswith("[") and stripped.endswith("]"): + try: + decoded = json.loads(stripped) + if isinstance(decoded, list): + rows = _normalize_grid(decoded) + except json.JSONDecodeError: + pass + if rows is None: + html = stripped or None + elif isinstance(body, dict): + grid = body.get("grid") or body.get("rows") + if isinstance(grid, list): + rows = _normalize_grid(grid) + else: + html = json.dumps(body, ensure_ascii=False) + + # MinerU occasionally emits table items with no usable body (e.g. when + # a page number or blank region is misidentified as a table). Dropping + # them here keeps the sidecar free of items that would later trip the + # analyze worker's "missing table content" hard-failure path. + if not _ir_table_body_has_content(rows, html): + logger.debug( + "[mineru_ir_builder] dropping empty table item " + "(body type=%s, num_rows=%s, num_cols=%s)", + type(body).__name__, + item.get("num_rows"), + item.get("num_cols"), + ) + return None + + num_rows = int(item.get("num_rows") or (len(rows) if rows else 0) or 0) + num_cols_default = max((len(r) for r in rows), default=0) if rows else 0 + num_cols = int(item.get("num_cols") or num_cols_default or 0) + + captions = item.get("table_caption") + caption = str(item.get("caption") or "") + if not caption and isinstance(captions, list) and captions: + caption = str(captions[0]) + + table_header_raw = item.get("header") + table_header: list[list[str]] | None = None + if isinstance(table_header_raw, list) and table_header_raw: + table_header = _normalize_grid(table_header_raw) + + return IRTable( + placeholder_key="", # filled by caller + rows=rows, + html=html, + num_rows=num_rows, + num_cols=num_cols, + caption=caption, + footnotes=_as_str_list(item.get("table_footnote") or item.get("footnotes")), + table_header=table_header, + ) + + def _build_ir_drawing( + self, + item: dict, + raw_dir: Path, + seen: dict[str, str], + ) -> tuple[IRDrawing, AssetSpec | None]: + img_path = str(item.get("img_path") or item.get("path") or "") + src_val = str(item.get("src") or "") + captions = item.get("image_caption") or item.get("captions") + caption = str(item.get("caption") or "") + if not caption and isinstance(captions, list) and captions: + caption = str(captions[0]) + + fmt = Path(img_path).suffix.lower().lstrip(".") if img_path else "" + if not fmt: + fmt = str(item.get("format") or "") + + asset: AssetSpec | None = None + ref = "" + if img_path: + ref = img_path + if ref in seen: + # Already declared by a previous block; reuse name. + pass + else: + # Asset source: file on disk inside raw_dir. ``img_path`` is + # untrusted (it comes from MinerU's content_list.json or a + # downloaded zip), so we go through a safe resolver that + # refuses to escape ``raw_dir`` and mirrors the downloader's + # storage layout for absolute-URL / templated references. + local_path = _safe_local_asset_path( + raw_dir, + img_path, + image_url_template=self.image_url_template, + ) + suggested_name = _suggested_asset_name(img_path, fmt, len(seen)) + asset = AssetSpec( + ref=ref, + suggested_name=suggested_name, + source=local_path + if local_path is not None and local_path.is_file() + else None, + ) + seen[ref] = suggested_name + + drawing = IRDrawing( + placeholder_key="", # filled by caller + asset_ref=ref, + fmt=fmt, + caption=caption, + footnotes=_as_str_list(item.get("image_footnote") or item.get("footnotes")), + src=src_val, + ) + return drawing, asset + + +# ---------------------------------------------------------------------- +# helpers +# ---------------------------------------------------------------------- + + +def _detect_heading(item: dict, item_type: str) -> tuple[str, int]: + """Return ``(heading_text, level)`` if ``item`` is a heading, else ``("", 0)``. + + A heading is either an explicit ``title``/``section_header`` block, or a + ``text`` block whose ``text_level`` is positive (MinerU's convention). + """ + if item_type in {"title", "section_header"}: + text = _coerce_text(item).strip() + level = max(int(item.get("text_level") or item.get("level") or 1), 1) + return text, level + if item_type == "text": + try: + tl = int(item.get("text_level") or 0) + except (TypeError, ValueError): + tl = 0 + if tl > 0: + return _coerce_text(item).strip(), tl + return "", 0 + + +def _coerce_text(item: dict) -> str: + for key in ("text", "content", "body", "code_body"): + val = item.get(key) + if isinstance(val, str) and val.strip(): + return val + return "" + + +def _as_str_list(value: Any) -> list[str]: + if value is None: + return [] + if isinstance(value, list): + return [str(x) for x in value if str(x).strip()] + s = str(value).strip() + return [s] if s else [] + + +def _content_list_self_ref(index: int) -> str: + return f"{CONTENT_LIST_FILENAME}#/{index}" + + +def _normalize_grid(grid: Any) -> list[list[str]]: + out: list[list[str]] = [] + if not isinstance(grid, list): + return out + for row in grid: + if not isinstance(row, list): + continue + out_row: list[str] = [] + for cell in row: + if isinstance(cell, dict): + out_row.append(str(cell.get("text", "")).strip()) + else: + out_row.append(str(cell).strip()) + out.append(out_row) + return out + + +def _ir_table_body_has_content(rows: list[list[str]] | None, html: str | None) -> bool: + """True iff the parsed table body carries any visible cell text or HTML.""" + if html and html.strip(): + return True + if rows: + for row in rows: + for cell in row: + if isinstance(cell, str) and cell.strip(): + return True + return False + + +def _is_block_equation(item: dict) -> bool: + """Heuristic: MinerU's ``text_format`` distinguishes block vs inline. + + Fallback when absent: treat as block (most MinerU equation items in + PDF context represent display equations); inline equations are usually + embedded inside ``text`` items rather than first-class ``equation`` + items. + """ + fmt = str(item.get("text_format") or "").lower() + if fmt in {"inline", "inline_equation"}: + return False + if fmt in {"block", "block_equation", "display"}: + return True + return True + + +def _extract_page_anchor(item: dict) -> str | None: + """Return a 1-based page anchor from MinerU's ``page_idx`` / ``page``. + + Always returns a string so ``blocks.jsonl`` carries a uniform anchor + type across Roman / letter / numeric page labels. Integers are bumped + to 1-based (``page_idx=0`` → ``"1"``); strings are stripped and passed + through verbatim. Returns ``None`` when no usable page info is present. + """ + page_raw = item.get("page_idx") + if page_raw is None: + page_raw = item.get("page") + if isinstance(page_raw, bool): + # bool is a subclass of int — guard so True/False don't sneak in. + return None + if isinstance(page_raw, int): + return str(page_raw + 1 if page_raw >= 0 else page_raw) + if isinstance(page_raw, str) and page_raw.strip(): + return page_raw.strip() + return None + + +def _sort_page_anchors(pages: set[str]) -> list[str]: + """Order page anchors using book pagination convention. + + Non-numeric labels (Roman preface pages ``i``/``ii``/``iv``…, letter + pages like ``A``, ``B-1``) come first in lexical order; numeric labels + follow, sorted by their integer value so ``"2"`` precedes ``"10"``. + Mixing both kinds is safe — the bucketed key avoids the ``TypeError`` + that ``sorted({"ii", "1"})`` raises when ints and strings mix. + """ + non_numeric = sorted(p for p in pages if not p.isdigit()) + numeric = sorted((p for p in pages if p.isdigit()), key=int) + return non_numeric + numeric + + +def _extract_bbox_position(item: dict) -> IRPosition | None: + """Build a fine-grained ``IRPosition`` when ``bbox`` is parseable. + + Returns ``None`` when ``bbox`` is missing or malformed; the caller then + falls back to page-only tracking via :func:`_extract_page_anchor`. + """ + bbox = item.get("bbox") + if not isinstance(bbox, (list, tuple)) or len(bbox) < 4: + return None + try: + coords = [float(x) for x in bbox[:4]] + except (TypeError, ValueError): + return None + return IRPosition(type="bbox", anchor=_extract_page_anchor(item), range=coords) + + +def _safe_local_asset_path( + raw_dir: Path, + img_path: str, + *, + image_url_template: str = "", +) -> Path | None: + """Resolve ``img_path`` to a concrete file location inside ``raw_dir``. + + ``img_path`` comes from MinerU's ``content_list.json`` and is therefore + untrusted. This resolver mirrors :meth:`MinerURawClient._fetch_one_image` + storage rules so the builder always looks where the downloader wrote + the file: + + - absolute http(s) URLs and absolute filesystem paths + → ``raw_dir/images/``; + - any ref when ``MINERU_IMAGE_URL_TEMPLATE`` is configured (the + downloader routes ALL refs — including relative ones — through + :meth:`_image_dest_rel`) → ``raw_dir/images/``; + - otherwise relative paths resolve under ``raw_dir`` with ``..`` + traversal refused and a final ``Path.relative_to`` check. + + Returns ``None`` when the candidate is unsafe or cannot be expressed + inside ``raw_dir``. The caller treats ``None`` the same as "file missing" + — the drawing tag still gets written, but no bytes are copied. + """ + if not img_path: + return None + + if img_path.startswith(("http://", "https://")): + name = Path(urlparse(img_path).path).name + return raw_dir / "images" / name if name else None + + if os.path.isabs(img_path): + # Absolute filesystem path in img_path is never trusted to point + # outside raw_dir; mirror the downloader's basename rule. + name = Path(img_path).name + return raw_dir / "images" / name if name else None + + if image_url_template: + # Templated mode: downloader stored every ref (incl. relative) at + # images/, so we must look there too. + name = Path(img_path).name + return raw_dir / "images" / name if name else None + + normalized = os.path.normpath(img_path) + if normalized.startswith("..") or os.path.isabs(normalized): + return None + candidate = (raw_dir / normalized).resolve() + try: + candidate.relative_to(raw_dir.resolve()) + except ValueError: + return None + return candidate + + +def _suggested_asset_name(img_path: str, fmt: str, seen_count: int) -> str: + """Pick an in-assets-dir filename for an asset. + + For URL refs, use the URL path's basename so we get a useful filename + (``foo.png`` rather than the whole URL). For local refs, the regular + basename. Falls back to ``image-[.fmt]`` when nothing usable. + """ + if img_path.startswith(("http://", "https://")): + name = Path(urlparse(img_path).path).name + else: + name = Path(img_path).name + if name: + return name + return f"image-{seen_count + 1}{('.' + fmt) if fmt else ''}" + + +__all__ = ["MinerUIRBuilder"] diff --git a/lightrag/external_parser/mineru/manifest.py b/lightrag/external_parser/mineru/manifest.py new file mode 100644 index 0000000000..f42bd9d742 --- /dev/null +++ b/lightrag/external_parser/mineru/manifest.py @@ -0,0 +1,164 @@ +"""``_manifest.json`` schema for ``*.mineru_raw/`` bundles. + +The manifest is the *atomic success marker* for a raw bundle. Its presence +implies "all files in this directory finished downloading"; its content is +the cache key for "is this bundle for the same source file, the same MinerU +parser options, engine version, and endpoint we are using right now?". + +Write path: ``write_manifest(path, manifest)`` writes a temp file then +atomically renames to ``_manifest.json``. A crash mid-download leaves no +manifest, so the next ``parse_mineru`` call cleanly invalidates and +re-downloads. + +Read path: ``load_manifest(path)`` returns ``None`` if absent or malformed +— either way the bundle is treated as stale. +""" + +from __future__ import annotations + +import json +import os +from dataclasses import asdict, dataclass +from pathlib import Path + +MANIFEST_FILENAME = "_manifest.json" +MANIFEST_VERSION = "1.0" +MANIFEST_ENGINE = "mineru" + + +@dataclass +class ManifestFile: + """One file entry inside the bundle. Size always; sha256 only for the + critical file (content_list.json) — see :class:`Manifest.critical_file`. + """ + + path: str # relative to the raw dir + size: int + sha256: str | None = None # ``"sha256:"`` form or ``None`` + + +@dataclass +class Manifest: + """Schema for ``_manifest.json``. Backward-compat policy: new optional + fields can be added without bumping version; **any** mismatch on existing + field semantics requires a version bump. + """ + + source_content_hash: str # ``"sha256:"`` of source file + source_size_bytes: int + source_filename_at_parse: str + critical_file: ManifestFile # content_list.json; size + sha256 + files: list[ManifestFile] # other files; size only + total_size_bytes: int + task_id: str = "" + api_mode: str = "" + engine_version: str = "" + endpoint_signature: str = "" + options_signature: str = "" + downloaded_at: str = "" + version: str = MANIFEST_VERSION + engine: str = MANIFEST_ENGINE + + def to_dict(self) -> dict: + return { + "version": self.version, + "engine": self.engine, + "api_mode": self.api_mode, + "engine_version": self.engine_version, + "endpoint_signature": self.endpoint_signature, + "options_signature": self.options_signature, + "source_content_hash": self.source_content_hash, + "source_size_bytes": int(self.source_size_bytes), + "source_filename_at_parse": self.source_filename_at_parse, + "task_id": self.task_id, + "downloaded_at": self.downloaded_at, + "critical_file": asdict(self.critical_file), + "files": [asdict(f) for f in self.files], + "total_size_bytes": int(self.total_size_bytes), + } + + @classmethod + def from_dict(cls, payload: dict) -> "Manifest": + critical_raw = payload.get("critical_file") or {} + files_raw = payload.get("files") or [] + return cls( + version=str(payload.get("version") or MANIFEST_VERSION), + engine=str(payload.get("engine") or MANIFEST_ENGINE), + api_mode=str(payload.get("api_mode") or ""), + engine_version=str(payload.get("engine_version") or ""), + endpoint_signature=str(payload.get("endpoint_signature") or ""), + options_signature=str(payload.get("options_signature") or ""), + source_content_hash=str(payload.get("source_content_hash") or ""), + source_size_bytes=int(payload.get("source_size_bytes") or 0), + source_filename_at_parse=str(payload.get("source_filename_at_parse") or ""), + task_id=str(payload.get("task_id") or ""), + downloaded_at=str(payload.get("downloaded_at") or ""), + critical_file=ManifestFile( + path=str(critical_raw.get("path") or ""), + size=int(critical_raw.get("size") or 0), + sha256=( + str(critical_raw["sha256"]) if critical_raw.get("sha256") else None + ), + ), + files=[ + ManifestFile( + path=str(f.get("path") or ""), + size=int(f.get("size") or 0), + sha256=(str(f["sha256"]) if f.get("sha256") else None), + ) + for f in files_raw + if isinstance(f, dict) + ], + total_size_bytes=int(payload.get("total_size_bytes") or 0), + ) + + +def manifest_path(raw_dir: Path) -> Path: + return raw_dir / MANIFEST_FILENAME + + +def load_manifest(raw_dir: Path) -> Manifest | None: + """Return the parsed manifest or ``None`` if absent / malformed.""" + p = manifest_path(raw_dir) + if not p.is_file(): + return None + try: + payload = json.loads(p.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): + return None + if not isinstance(payload, dict): + return None + if payload.get("version") != MANIFEST_VERSION: + return None + if payload.get("engine") != MANIFEST_ENGINE: + return None + try: + return Manifest.from_dict(payload) + except (TypeError, ValueError): + return None + + +def write_manifest(raw_dir: Path, manifest: Manifest) -> None: + """Atomically write the manifest. The temp-file + rename pattern + guarantees the manifest never appears in a partially-written state.""" + raw_dir.mkdir(parents=True, exist_ok=True) + final = manifest_path(raw_dir) + tmp = final.with_suffix(".json.tmp") + tmp.write_text( + json.dumps(manifest.to_dict(), ensure_ascii=False, indent=2), + encoding="utf-8", + ) + os.replace(tmp, final) + + +# Re-exported for convenience. +__all__ = [ + "MANIFEST_FILENAME", + "MANIFEST_VERSION", + "MANIFEST_ENGINE", + "Manifest", + "ManifestFile", + "load_manifest", + "manifest_path", + "write_manifest", +] diff --git a/lightrag/kg/factory.py b/lightrag/kg/factory.py new file mode 100644 index 0000000000..580cf0a6d6 --- /dev/null +++ b/lightrag/kg/factory.py @@ -0,0 +1,42 @@ +"""Storage backend class factory. + +Resolves a storage backend name (e.g. ``"JsonKVStorage"``) to its concrete +implementation class. The four default backends are imported directly so +they always work without depending on the ``STORAGES`` registry; everything +else is resolved lazily through the registry. +""" + +from __future__ import annotations + +import importlib +from typing import Any, Callable + +from lightrag.kg import STORAGES + + +def get_storage_class(storage_name: str) -> Callable[..., Any]: + """Return the storage backend class for ``storage_name``.""" + if storage_name == "JsonKVStorage": + from lightrag.kg.json_kv_impl import JsonKVStorage + + return JsonKVStorage + if storage_name == "NanoVectorDBStorage": + from lightrag.kg.nano_vector_db_impl import NanoVectorDBStorage + + return NanoVectorDBStorage + if storage_name == "NetworkXStorage": + from lightrag.kg.networkx_impl import NetworkXStorage + + return NetworkXStorage + if storage_name == "JsonDocStatusStorage": + from lightrag.kg.json_doc_status_impl import JsonDocStatusStorage + + return JsonDocStatusStorage + + # Fallback to dynamic import for other storage implementations. + # STORAGES values are relative paths (e.g. ".kg.postgres_impl") authored + # against the top-level ``lightrag`` package, so anchor the import there + # rather than letting it resolve against this module's own package. + import_path = STORAGES[storage_name] + module = importlib.import_module(import_path, package="lightrag") + return getattr(module, storage_name) diff --git a/lightrag/kg/json_doc_status_impl.py b/lightrag/kg/json_doc_status_impl.py index d5c790b098..17570795b7 100644 --- a/lightrag/kg/json_doc_status_impl.py +++ b/lightrag/kg/json_doc_status_impl.py @@ -242,6 +242,7 @@ async def get_by_id(self, id: str) -> Union[dict[str, Any], None]: async def get_docs_paginated( self, status_filter: DocStatus | None = None, + status_filters: list[DocStatus] | None = None, page: int = 1, page_size: int = 50, sort_field: str = "updated_at", @@ -259,6 +260,11 @@ async def get_docs_paginated( Returns: Tuple of (list of (doc_id, DocProcessingStatus) tuples, total_count) """ + status_filter_values = self.resolve_status_filter_values( + status_filter=status_filter, + status_filters=status_filters, + ) + # Validate parameters if page < 1: page = 1 @@ -280,8 +286,8 @@ async def get_docs_paginated( for doc_id, doc_data in self._data.items(): # Apply status filter if ( - status_filter is not None - and doc_data.get("status") != status_filter.value + status_filter_values is not None + and doc_data.get("status") not in status_filter_values ): continue @@ -394,6 +400,43 @@ async def get_doc_by_file_path(self, file_path: str) -> Union[dict[str, Any], No return None + async def get_doc_by_file_basename( + self, basename: str + ) -> Union[tuple[str, dict[str, Any]], None]: + """Find an existing record whose canonical basename matches. + + The caller is responsible for passing an already-canonical basename. + Stored ``file_path`` values are canonicalized by the business layer, so + this lookup intentionally performs an exact match only. + """ + if not basename: + return None + if self._storage_lock is None: + raise StorageNotInitializedError("JsonDocStatusStorage") + + if basename == "unknown_source": + return None + async with self._storage_lock: + for doc_id, doc_data in self._data.items(): + if doc_data.get("file_path") == basename: + return doc_id, doc_data + return None + + async def get_doc_by_content_hash( + self, content_hash: str + ) -> Union[tuple[str, dict[str, Any]], None]: + """Find an existing record whose content_hash field matches.""" + if not content_hash: + return None + if self._storage_lock is None: + raise StorageNotInitializedError("JsonDocStatusStorage") + + async with self._storage_lock: + for doc_id, doc_data in self._data.items(): + if doc_data.get("content_hash") == content_hash: + return doc_id, doc_data + return None + async def drop(self) -> dict[str, str]: """Drop all document status data from storage and clean up resources diff --git a/lightrag/kg/mongo_impl.py b/lightrag/kg/mongo_impl.py index 4123d92edb..01f5bdc011 100644 --- a/lightrag/kg/mongo_impl.py +++ b/lightrag/kg/mongo_impl.py @@ -561,6 +561,16 @@ async def create_and_migrate_indexes_if_not_exists(self): "keys": [("status", 1), ("file_path", 1)], "collation": collation_config, }, + # Partial index on content_hash for content-based dedup lookups. + # Mirrors the PG partial index: skip legacy/empty values so the + # index stays small and a content_hash="" query is a guaranteed miss. + { + "name": f"{workspace_prefix}content_hash", + "keys": [("content_hash", 1)], + "partialFilterExpression": { + "content_hash": {"$exists": True, "$type": "string", "$gt": ""} + }, + }, ] # 2. Handle legacy index cleanup: only drop old indexes that exist in THIS collection @@ -573,6 +583,7 @@ async def create_and_migrate_indexes_if_not_exists(self): "created_at", "id", "track_id", + "content_hash", ] for legacy_name in legacy_index_names: @@ -599,6 +610,10 @@ async def create_and_migrate_indexes_if_not_exists(self): create_kwargs = {"name": index_name} if "collation" in index_info: create_kwargs["collation"] = index_info["collation"] + if "partialFilterExpression" in index_info: + create_kwargs["partialFilterExpression"] = index_info[ + "partialFilterExpression" + ] try: await self._data.create_index( @@ -625,6 +640,7 @@ async def create_and_migrate_indexes_if_not_exists(self): async def get_docs_paginated( self, status_filter: DocStatus | None = None, + status_filters: list[DocStatus] | None = None, page: int = 1, page_size: int = 50, sort_field: str = "updated_at", @@ -642,6 +658,11 @@ async def get_docs_paginated( Returns: Tuple of (list of (doc_id, DocProcessingStatus) tuples, total_count) """ + status_filter_values = self.resolve_status_filter_values( + status_filter=status_filter, + status_filters=status_filters, + ) + # Validate parameters if page < 1: page = 1 @@ -658,8 +679,8 @@ async def get_docs_paginated( # Build query filter query_filter = {} - if status_filter is not None: - query_filter["status"] = status_filter.value + if status_filter_values is not None: + query_filter["status"] = {"$in": sorted(status_filter_values)} # Get total count total_count = await self._data.count_documents(query_filter) @@ -742,6 +763,58 @@ async def get_doc_by_file_path(self, file_path: str) -> Union[dict[str, Any], No """ return await self._data.find_one({"file_path": file_path}) + async def get_doc_by_file_basename( + self, basename: str + ) -> Union[tuple[str, dict[str, Any]], None]: + """Mongo-native override of basename-based document lookup. + + The caller is responsible for passing an already-canonical basename; + stored ``file_path`` values are canonicalized by the business layer, so + this lookup performs an exact match only and relies on the file_path + index created by ``create_and_migrate_indexes_if_not_exists``. + """ + if not basename: + return None + if basename == "unknown_source": + return None + + try: + doc = await self._data.find_one({"file_path": basename}) + except PyMongoError as e: + logger.error(f"[{self.workspace}] Error in get_doc_by_file_basename: {e}") + return None + if not doc: + return None + doc_id = doc.get("_id") + if doc_id is None: + return None + return str(doc_id), doc + + async def get_doc_by_content_hash( + self, content_hash: str + ) -> Union[tuple[str, dict[str, Any]], None]: + """Mongo-native override of content-hash document lookup. + + Uses the partial ``content_hash`` index. Empty strings are treated as a + miss to align with the partial-index predicate; legacy rows missing the + field cannot match a non-empty query because ``find_one`` requires an + exact value. + """ + if not content_hash: + return None + + try: + doc = await self._data.find_one({"content_hash": content_hash}) + except PyMongoError as e: + logger.error(f"[{self.workspace}] Error in get_doc_by_content_hash: {e}") + return None + if not doc: + return None + doc_id = doc.get("_id") + if doc_id is None: + return None + return str(doc_id), doc + @final @dataclass diff --git a/lightrag/kg/opensearch_impl.py b/lightrag/kg/opensearch_impl.py index 3f2741dd26..db2c62c961 100644 --- a/lightrag/kg/opensearch_impl.py +++ b/lightrag/kg/opensearch_impl.py @@ -632,6 +632,7 @@ async def _create_index_if_not_exists(self): "status": {"type": "keyword"}, "file_path": {"type": "keyword"}, "track_id": {"type": "keyword"}, + "content_hash": {"type": "keyword"}, "created_at": {"type": "date"}, "updated_at": {"type": "date"}, }, @@ -649,6 +650,7 @@ async def _create_index_if_not_exists(self): ) else: await _verify_mirrored_id_mapping(self.client, self._index_name) + await self._ensure_content_hash_mapping() except RequestError as e: if "resource_already_exists_exception" not in str(e): raise @@ -656,6 +658,38 @@ async def _create_index_if_not_exists(self): logger.error(f"[{self.workspace}] Error creating doc status index: {e}") raise + async def _ensure_content_hash_mapping(self) -> None: + """Add the content_hash keyword mapping to a pre-existing doc status index. + + Indices created by older LightRAG releases lack content_hash entirely. + put_mapping is idempotent for new fields, so this is safe to call every + startup; we only fail loudly when the cluster reports a mapping conflict + (which would indicate dynamic mapping already coerced content_hash to a + different type). + """ + try: + mapping = await self.client.indices.get_mapping(index=self._index_name) + except OpenSearchException: + return + props = ( + mapping.get(self._index_name, {}).get("mappings", {}).get("properties", {}) + ) + if "content_hash" in props: + return + try: + await self.client.indices.put_mapping( + index=self._index_name, + body={"properties": {"content_hash": {"type": "keyword"}}}, + ) + logger.info( + f"[{self.workspace}] Added content_hash keyword mapping to {self._index_name}" + ) + except OpenSearchException as e: + logger.warning( + f"[{self.workspace}] Failed to add content_hash mapping to " + f"{self._index_name}: {e}" + ) + async def finalize(self): """Release the OpenSearch client connection.""" if self.client is not None: @@ -845,6 +879,7 @@ async def get_docs_by_track_id( async def get_docs_paginated( self, status_filter: DocStatus | None = None, + status_filters: list[DocStatus] | None = None, page: int = 1, page_size: int = 50, sort_field: str = "updated_at", @@ -853,6 +888,10 @@ async def get_docs_paginated( """Get documents with pagination using PIT + search_after.""" if not self._index_ready: return [], 0 + status_filter_values = self.resolve_status_filter_values( + status_filter=status_filter, + status_filters=status_filters, + ) page = max(1, page) page_size = max(10, min(200, page_size)) if sort_field == "id": @@ -862,8 +901,11 @@ async def get_docs_paginated( sort_order = "asc" if sort_direction.lower() == "asc" else "desc" query = {"match_all": {}} - if status_filter is not None: - query = {"term": {"status": status_filter.value}} + if status_filter_values is not None: + if len(status_filter_values) == 1: + query = {"term": {"status": next(iter(status_filter_values))}} + else: + query = {"terms": {"status": sorted(status_filter_values)}} skip_count = (page - 1) * page_size @@ -979,6 +1021,68 @@ async def get_doc_by_file_path(self, file_path: str) -> Union[dict[str, Any], No logger.error(f"[{self.workspace}] Error getting doc by file_path: {e}") return None + async def get_doc_by_file_basename( + self, basename: str + ) -> Union[tuple[str, dict[str, Any]], None]: + """Find an existing record whose canonical basename matches. + + The caller is responsible for passing an already-canonical basename; + stored ``file_path`` values are canonicalized by the business layer, so + this lookup performs an exact term query against the file_path keyword + field. + """ + if not basename: + return None + if basename == "unknown_source": + return None + if not self._index_ready: + return None + try: + body = {"query": {"term": {"file_path": basename}}, "size": 1} + response = await self.client.search(index=self._index_name, body=body) + hits = response["hits"]["hits"] + if not hits: + return None + hit = hits[0] + doc = hit["_source"] + return hit["_id"], doc + except OpenSearchException as e: + if _is_missing_index_error(e): + self._mark_index_missing() + return None + logger.error(f"[{self.workspace}] Error getting doc by file_basename: {e}") + return None + + async def get_doc_by_content_hash( + self, content_hash: str + ) -> Union[tuple[str, dict[str, Any]], None]: + """Find an existing record whose content_hash field matches. + + Uses the content_hash keyword mapping created by + ``_create_index_if_not_exists`` / ``_ensure_content_hash_mapping``. + Empty values short-circuit so legacy rows without the field cannot + accidentally match via type coercion. + """ + if not content_hash: + return None + if not self._index_ready: + return None + try: + body = {"query": {"term": {"content_hash": content_hash}}, "size": 1} + response = await self.client.search(index=self._index_name, body=body) + hits = response["hits"]["hits"] + if not hits: + return None + hit = hits[0] + doc = hit["_source"] + return hit["_id"], doc + except OpenSearchException as e: + if _is_missing_index_error(e): + self._mark_index_missing() + return None + logger.error(f"[{self.workspace}] Error getting doc by content_hash: {e}") + return None + async def index_done_callback(self) -> None: """Refresh index to make recently indexed documents searchable.""" if not self._index_ready: diff --git a/lightrag/kg/postgres_impl.py b/lightrag/kg/postgres_impl.py index 5cdbd46f7e..68918011df 100644 --- a/lightrag/kg/postgres_impl.py +++ b/lightrag/kg/postgres_impl.py @@ -1272,6 +1272,155 @@ async def _migrate_doc_status_add_metadata_error_msg(self): f"Failed to add metadata/error_msg columns to LIGHTRAG_DOC_STATUS: {e}" ) + async def _migrate_doc_full_add_pipeline_fields(self): + """Add pipeline-derived fields to LIGHTRAG_DOC_FULL if they don't exist. + + Each ALTER is guarded individually so a single failure does not abort + the remaining columns; the migration is idempotent and retried on + every startup until all columns are present. + """ + # content_hash uses TEXT (not VARCHAR(N)) so the column stays + # algorithm-agnostic; future SHA-512 / base64 hashes do not require a + # schema change. process_options is an opaque selector string emitted + # by sanitize_process_options() (e.g. "Fi"). + columns_to_add = [ + ("sidecar_location", "TEXT NULL"), + ("parse_format", "VARCHAR(32) NULL DEFAULT 'raw'"), + ("content_hash", "TEXT NULL"), + ("process_options", "TEXT NULL"), + ("chunk_options", "JSONB NULL DEFAULT '{}'::jsonb"), + ("parse_engine", "VARCHAR(32) NULL"), + ] + try: + existing = await self.query( + """ + SELECT column_name + FROM information_schema.columns + WHERE table_name = 'lightrag_doc_full' + AND column_name = ANY($1) + """, + [[c for c, _ in columns_to_add]], + multirows=True, + ) + existing_names = {row["column_name"] for row in (existing or [])} + except Exception as e: + logger.warning( + f"Failed to inspect LIGHTRAG_DOC_FULL columns for migration: {e}" + ) + existing_names = set() + + for col_name, col_type in columns_to_add: + if col_name in existing_names: + logger.debug(f"Column {col_name} already exists in LIGHTRAG_DOC_FULL") + continue + try: + alter_sql = ( + f"ALTER TABLE LIGHTRAG_DOC_FULL ADD COLUMN {col_name} {col_type}" + ) + logger.info(f"Adding {col_name} column to LIGHTRAG_DOC_FULL table") + await self.execute(alter_sql) + logger.info( + f"Successfully added {col_name} column to LIGHTRAG_DOC_FULL table" + ) + except Exception as e: + logger.error( + f"Failed to add column {col_name} to LIGHTRAG_DOC_FULL: {e}" + ) + + async def _migrate_doc_status_add_content_hash(self): + """Add content_hash column to LIGHTRAG_DOC_STATUS table if it doesn't exist.""" + try: + check_column_sql = """ + SELECT column_name + FROM information_schema.columns + WHERE table_name = 'lightrag_doc_status' + AND column_name = 'content_hash' + """ + column_info = await self.query(check_column_sql) + if not column_info: + logger.info("Adding content_hash column to LIGHTRAG_DOC_STATUS table") + # TEXT (not VARCHAR(N)) so the column is agnostic to the hash + # algorithm; today the pipeline writes 64-char SHA-256 hex. + await self.execute( + "ALTER TABLE LIGHTRAG_DOC_STATUS ADD COLUMN content_hash TEXT NULL" + ) + logger.info( + "Successfully added content_hash column to LIGHTRAG_DOC_STATUS table" + ) + else: + logger.debug( + "content_hash column already exists in LIGHTRAG_DOC_STATUS table" + ) + except Exception as e: + logger.error( + f"Failed to add content_hash column to LIGHTRAG_DOC_STATUS: {e}" + ) + + try: + check_index_sql = """ + SELECT indexname FROM pg_indexes + WHERE tablename = 'lightrag_doc_status' + AND indexname = 'idx_lightrag_doc_status_workspace_content_hash' + """ + index_info = await self.query(check_index_sql) + if not index_info: + logger.info( + "Creating partial index idx_lightrag_doc_status_workspace_content_hash" + ) + await self.execute( + """ + CREATE INDEX IF NOT EXISTS idx_lightrag_doc_status_workspace_content_hash + ON LIGHTRAG_DOC_STATUS (workspace, content_hash) + WHERE content_hash IS NOT NULL AND content_hash <> '' + """ + ) + except Exception as e: + logger.error( + f"Failed to create partial content_hash index on LIGHTRAG_DOC_STATUS: {e}" + ) + + async def _migrate_text_chunks_add_heading_sidecar(self): + """Add heading and sidecar JSONB columns to LIGHTRAG_DOC_CHUNKS if missing.""" + columns_to_add = [ + ("heading", "JSONB NULL DEFAULT '{}'::jsonb"), + ("sidecar", "JSONB NULL DEFAULT '{}'::jsonb"), + ] + try: + existing = await self.query( + """ + SELECT column_name + FROM information_schema.columns + WHERE table_name = 'lightrag_doc_chunks' + AND column_name = ANY($1) + """, + [[c for c, _ in columns_to_add]], + multirows=True, + ) + existing_names = {row["column_name"] for row in (existing or [])} + except Exception as e: + logger.warning( + f"Failed to inspect LIGHTRAG_DOC_CHUNKS columns for migration: {e}" + ) + existing_names = set() + + for col_name, col_type in columns_to_add: + if col_name in existing_names: + logger.debug(f"Column {col_name} already exists in LIGHTRAG_DOC_CHUNKS") + continue + try: + alter_sql = ( + f"ALTER TABLE LIGHTRAG_DOC_CHUNKS ADD COLUMN {col_name} {col_type}" + ) + logger.info(f"Adding {col_name} column to LIGHTRAG_DOC_CHUNKS table") + await self.execute(alter_sql) + logger.info( + f"Successfully added {col_name} column to LIGHTRAG_DOC_CHUNKS table" + ) + except Exception as e: + logger.error( + f"Failed to add column {col_name} to LIGHTRAG_DOC_CHUNKS: {e}" + ) + async def _migrate_field_lengths(self): """Migrate database field lengths: entity_name, source_id, target_id, and file_path""" # Define the field changes needed @@ -1582,6 +1731,33 @@ async def check_tables(self): f"PostgreSQL, Failed to create full entities/relations tables: {e}" ) + # Migrate LIGHTRAG_DOC_FULL to add pipeline-derived fields used by the + # JSON storage parity: sidecar_location / parse_format / content_hash / + # process_options / chunk_options / parse_engine + try: + await self._migrate_doc_full_add_pipeline_fields() + except Exception as e: + logger.error( + f"PostgreSQL, Failed to migrate LIGHTRAG_DOC_FULL pipeline fields: {e}" + ) + + # Migrate LIGHTRAG_DOC_STATUS to add content_hash column for content + # dedup queries + try: + await self._migrate_doc_status_add_content_hash() + except Exception as e: + logger.error( + f"PostgreSQL, Failed to migrate LIGHTRAG_DOC_STATUS content_hash field: {e}" + ) + + # Migrate LIGHTRAG_DOC_CHUNKS to add heading / sidecar JSONB columns + try: + await self._migrate_text_chunks_add_heading_sidecar() + except Exception as e: + logger.error( + f"PostgreSQL, Failed to migrate LIGHTRAG_DOC_CHUNKS heading/sidecar fields: {e}" + ) + async def _migrate_create_full_entities_relations_tables(self): """Create LIGHTRAG_FULL_ENTITIES and LIGHTRAG_FULL_RELATIONS tables if they don't exist""" tables_to_check = [ @@ -2239,11 +2415,46 @@ async def get_by_id(self, id: str) -> dict[str, Any] | None: except json.JSONDecodeError: llm_cache_list = [] response["llm_cache_list"] = llm_cache_list + + # Parse heading JSON string back to dict; normalize None/missing to {} + heading = response.get("heading") + if isinstance(heading, str): + try: + heading = json.loads(heading) + except json.JSONDecodeError: + heading = {} + if not isinstance(heading, dict): + heading = {} + response["heading"] = heading + + # Parse sidecar JSON string back to dict; normalize None/missing to {} + sidecar = response.get("sidecar") + if isinstance(sidecar, str): + try: + sidecar = json.loads(sidecar) + except json.JSONDecodeError: + sidecar = {} + if not isinstance(sidecar, dict): + sidecar = {} + response["sidecar"] = sidecar + create_time = response.get("create_time", 0) update_time = response.get("update_time", 0) response["create_time"] = create_time response["update_time"] = create_time if update_time == 0 else update_time + if response and is_namespace(self.namespace, NameSpace.KV_STORE_FULL_DOCS): + # Parse chunk_options JSON string back to dict; normalize None/missing to {} + chunk_options = response.get("chunk_options") + if isinstance(chunk_options, str): + try: + chunk_options = json.loads(chunk_options) + except json.JSONDecodeError: + chunk_options = {} + if not isinstance(chunk_options, dict): + chunk_options = {} + response["chunk_options"] = chunk_options + # Special handling for LLM cache to ensure compatibility with _get_cached_extraction_results if response and is_namespace( self.namespace, NameSpace.KV_STORE_LLM_RESPONSE_CACHE @@ -2364,7 +2575,7 @@ def _order_results( return ordered if results and is_namespace(self.namespace, NameSpace.KV_STORE_TEXT_CHUNKS): - # Parse llm_cache_list JSON string back to list for each result + # Parse llm_cache_list / heading / sidecar JSON strings for each result for result in results: llm_cache_list = result.get("llm_cache_list", []) if isinstance(llm_cache_list, str): @@ -2373,11 +2584,44 @@ def _order_results( except json.JSONDecodeError: llm_cache_list = [] result["llm_cache_list"] = llm_cache_list + + heading = result.get("heading") + if isinstance(heading, str): + try: + heading = json.loads(heading) + except json.JSONDecodeError: + heading = {} + if not isinstance(heading, dict): + heading = {} + result["heading"] = heading + + sidecar = result.get("sidecar") + if isinstance(sidecar, str): + try: + sidecar = json.loads(sidecar) + except json.JSONDecodeError: + sidecar = {} + if not isinstance(sidecar, dict): + sidecar = {} + result["sidecar"] = sidecar + create_time = result.get("create_time", 0) update_time = result.get("update_time", 0) result["create_time"] = create_time result["update_time"] = create_time if update_time == 0 else update_time + if results and is_namespace(self.namespace, NameSpace.KV_STORE_FULL_DOCS): + for result in results: + chunk_options = result.get("chunk_options") + if isinstance(chunk_options, str): + try: + chunk_options = json.loads(chunk_options) + except json.JSONDecodeError: + chunk_options = {} + if not isinstance(chunk_options, dict): + chunk_options = {} + result["chunk_options"] = chunk_options + # Special handling for LLM cache to ensure compatibility with _get_cached_extraction_results if results and is_namespace( self.namespace, NameSpace.KV_STORE_LLM_RESPONSE_CACHE @@ -2520,7 +2764,8 @@ async def upsert(self, data: dict[str, dict[str, Any]]) -> None: current_time = datetime.datetime.now(timezone.utc).replace(tzinfo=None) for i, (k, v) in enumerate(data.items(), start=1): # Tuple order must match SQL: (workspace, id, tokens, chunk_order_index, - # full_doc_id, content, file_path, llm_cache_list, create_time, update_time) + # full_doc_id, content, file_path, llm_cache_list, heading, sidecar, + # create_time, update_time) batch_values.append( ( self.workspace, @@ -2531,6 +2776,8 @@ async def upsert(self, data: dict[str, dict[str, Any]]) -> None: v["content"], v["file_path"], json.dumps(v.get("llm_cache_list", [])), + json.dumps(v.get("heading") or {}), + json.dumps(v.get("sidecar") or {}), current_time, current_time, ) @@ -2539,9 +2786,30 @@ async def upsert(self, data: dict[str, dict[str, Any]]) -> None: elif is_namespace(self.namespace, NameSpace.KV_STORE_FULL_DOCS): upsert_sql = SQL_TEMPLATES["upsert_doc_full"] for i, (k, v) in enumerate(data.items(), start=1): - # Tuple order must match SQL: (id, content, doc_name, workspace) + # Tuple order must match SQL: (id, content, doc_name, workspace, + # sidecar_location, parse_format, content_hash, process_options, + # chunk_options, parse_engine) + # + # All pipeline-derived fields pass through untouched so the + # SQL-level COALESCE guard in upsert_doc_full can distinguish + # "caller did not supply" (None/'') from "caller supplied a + # real value". The 'raw' default for parse_format is provided + # by the column DDL on initial insert; do NOT default it here + # or the COALESCE guard never triggers on subsequent partial + # writes. batch_values.append( - (k, v["content"], v.get("file_path", ""), self.workspace) + ( + k, + v["content"], + v.get("file_path", ""), + self.workspace, + v.get("sidecar_location"), + v.get("parse_format"), + v.get("content_hash"), + v.get("process_options"), + json.dumps(v.get("chunk_options") or {}), + v.get("parse_engine"), + ) ) await _cooperative_yield(i) elif is_namespace(self.namespace, NameSpace.KV_STORE_LLM_RESPONSE_CACHE): @@ -3854,6 +4122,7 @@ async def get_by_id(self, id: str) -> Union[dict[str, Any], None]: metadata=metadata, error_msg=result[0].get("error_msg"), track_id=result[0].get("track_id"), + content_hash=result[0].get("content_hash"), ) async def get_by_ids(self, ids: list[str]) -> list[dict[str, Any]]: @@ -3903,6 +4172,7 @@ async def get_by_ids(self, ids: list[str]) -> list[dict[str, Any]]: "metadata": metadata, "error_msg": row.get("error_msg"), "track_id": row.get("track_id"), + "content_hash": row.get("content_hash"), } ordered_results: list[dict[str, Any] | None] = [] @@ -3960,8 +4230,125 @@ async def get_doc_by_file_path(self, file_path: str) -> Union[dict[str, Any], No metadata=metadata, error_msg=result[0].get("error_msg"), track_id=result[0].get("track_id"), + content_hash=result[0].get("content_hash"), ) + async def get_doc_by_file_basename( + self, basename: str + ) -> tuple[str, dict[str, Any]] | None: + """PG-native override of basename-based document lookup. + + Replaces the base-class full-table scan with a database-level query on + the canonical ``file_path`` column. The caller is responsible for + passing an already-canonical basename; storage performs an exact match + only. + """ + if not basename: + return None + + if basename == "unknown_source": + return None + + sql = ( + "SELECT * FROM LIGHTRAG_DOC_STATUS " + "WHERE workspace=$1 AND file_path = $2 " + "ORDER BY created_at ASC, id ASC LIMIT 1" + ) + params = [self.workspace, basename] + + result = await self.db.query(sql, params, True) + if not result: + return None + row = result[0] + + chunks_list = row.get("chunks_list", []) + if isinstance(chunks_list, str): + try: + chunks_list = json.loads(chunks_list) + except json.JSONDecodeError: + chunks_list = [] + + metadata = row.get("metadata", {}) + if isinstance(metadata, str): + try: + metadata = json.loads(metadata) + except json.JSONDecodeError: + metadata = {} + + created_at = self._format_datetime_with_timezone(row["created_at"]) + updated_at = self._format_datetime_with_timezone(row["updated_at"]) + + doc = dict( + content_length=row["content_length"], + content_summary=row["content_summary"], + status=row["status"], + chunks_count=row["chunks_count"], + created_at=created_at, + updated_at=updated_at, + file_path=row["file_path"], + chunks_list=chunks_list, + metadata=metadata, + error_msg=row.get("error_msg"), + track_id=row.get("track_id"), + content_hash=row.get("content_hash"), + ) + return str(row["id"]), doc + + async def get_doc_by_content_hash( + self, content_hash: str + ) -> tuple[str, dict[str, Any]] | None: + """PG-native override of content-hash document lookup. + + Replaces the base-class full-table scan with an indexed query on + ``workspace + content_hash``. Empty strings are treated as a miss + to align with the partial-index predicate. + """ + if not content_hash: + return None + + sql = ( + "SELECT * FROM LIGHTRAG_DOC_STATUS " + "WHERE workspace=$1 AND content_hash=$2 " + "ORDER BY created_at ASC, id ASC LIMIT 1" + ) + result = await self.db.query(sql, [self.workspace, content_hash], True) + if not result: + return None + row = result[0] + + chunks_list = row.get("chunks_list", []) + if isinstance(chunks_list, str): + try: + chunks_list = json.loads(chunks_list) + except json.JSONDecodeError: + chunks_list = [] + + metadata = row.get("metadata", {}) + if isinstance(metadata, str): + try: + metadata = json.loads(metadata) + except json.JSONDecodeError: + metadata = {} + + created_at = self._format_datetime_with_timezone(row["created_at"]) + updated_at = self._format_datetime_with_timezone(row["updated_at"]) + + doc = dict( + content_length=row["content_length"], + content_summary=row["content_summary"], + status=row["status"], + chunks_count=row["chunks_count"], + created_at=created_at, + updated_at=updated_at, + file_path=row["file_path"], + chunks_list=chunks_list, + metadata=metadata, + error_msg=row.get("error_msg"), + track_id=row.get("track_id"), + content_hash=row.get("content_hash"), + ) + return str(row["id"]), doc + async def get_status_counts(self) -> dict[str, int]: """Get counts of documents in each status""" sql = """SELECT status as "status", COUNT(1) as "count" @@ -4025,6 +4412,7 @@ async def get_docs_by_status( metadata=metadata, error_msg=element.get("error_msg"), track_id=element.get("track_id"), + content_hash=element.get("content_hash"), ) return docs_by_status @@ -4086,6 +4474,7 @@ async def get_docs_by_statuses( metadata=metadata, error_msg=element.get("error_msg"), track_id=element.get("track_id"), + content_hash=element.get("content_hash"), ) except (KeyError, TypeError) as e: doc_id_hint = element.get("id", "") if element else "" @@ -4147,6 +4536,7 @@ async def get_docs_by_track_id( track_id=element.get("track_id"), metadata=metadata, error_msg=element.get("error_msg"), + content_hash=element.get("content_hash"), ) return docs_by_track_id @@ -4154,6 +4544,7 @@ async def get_docs_by_track_id( async def get_docs_paginated( self, status_filter: DocStatus | None = None, + status_filters: list[DocStatus] | None = None, page: int = 1, page_size: int = 50, sort_field: str = "updated_at", @@ -4172,6 +4563,10 @@ async def get_docs_paginated( Tuple of (list of (doc_id, DocProcessingStatus) tuples, total_count) """ start = time.perf_counter() + status_filter_values = self.resolve_status_filter_values( + status_filter=status_filter, + status_filters=status_filters, + ) status_filter_value = status_filter.value if status_filter is not None else None performance_timing_log( @@ -4211,10 +4606,10 @@ async def get_docs_paginated( param_count = 1 # Build WHERE clause with parameterized query - if status_filter is not None: + if status_filter_values is not None: param_count += 1 - where_clause = "WHERE workspace=$1 AND status=$2" - params["status"] = status_filter.value + where_clause = "WHERE workspace=$1 AND status = ANY($2)" + params["status_filters"] = sorted(status_filter_values) else: where_clause = "WHERE workspace=$1" @@ -4252,7 +4647,7 @@ async def get_docs_paginated( ), paged AS ( SELECT id, workspace, content_summary, content_length, chunks_count, - status, file_path, track_id, metadata, error_msg, + status, file_path, track_id, metadata, error_msg, content_hash, created_at, updated_at FROM LIGHTRAG_DOC_STATUS {where_clause} @@ -4305,6 +4700,7 @@ async def get_docs_paginated( track_id=element.get("track_id"), metadata=metadata, error_msg=element.get("error_msg"), + content_hash=element.get("content_hash"), ) documents.append((doc_id, doc_status)) @@ -4446,8 +4842,21 @@ async def upsert(self, data: dict[str, dict[str, Any]]) -> None: len(data), ) - sql = """insert into LIGHTRAG_DOC_STATUS(workspace,id,content_summary,content_length,chunks_count,status,file_path,chunks_list,track_id,metadata,error_msg,created_at,updated_at) - values($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13) + # NOTE: content_hash uses COALESCE(NULLIF(...,''), existing) rather than + # a straight EXCLUDED overwrite. This gives write-once-after-set + # semantics: once a non-empty content_hash is recorded, subsequent + # upserts that omit it (or pass '' / NULL) will NOT clear it. Required + # because pipeline state transitions (e.g. processing -> processed) + # reuse the existing DocProcessingStatus payload without re-supplying + # the hash, while _persist_parsed_full_docs patches the hash in a + # separate upsert. + # + # This is a deliberate behavioral divergence from JsonDocStatusStorage, + # which overwrites unconditionally. No caller today wants to clear a + # content_hash, so the divergence is invisible — but if that ever + # changes, this guard must be revisited. + sql = """insert into LIGHTRAG_DOC_STATUS(workspace,id,content_summary,content_length,chunks_count,status,file_path,chunks_list,track_id,metadata,error_msg,content_hash,created_at,updated_at) + values($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14) on conflict(id,workspace) do update set content_summary = EXCLUDED.content_summary, content_length = EXCLUDED.content_length, @@ -4458,12 +4867,16 @@ async def upsert(self, data: dict[str, dict[str, Any]]) -> None: track_id = EXCLUDED.track_id, metadata = EXCLUDED.metadata, error_msg = EXCLUDED.error_msg, + content_hash = COALESCE( + NULLIF(EXCLUDED.content_hash, ''), + LIGHTRAG_DOC_STATUS.content_hash + ), created_at = EXCLUDED.created_at, updated_at = EXCLUDED.updated_at""" # Tuple order must match SQL: (workspace, id, content_summary, content_length, # chunks_count, status, file_path, chunks_list, track_id, metadata, - # error_msg, created_at, updated_at) + # error_msg, content_hash, created_at, updated_at) batch: list[tuple] = [] skipped: list[str] = [] batch_build_start = time.perf_counter() @@ -4482,6 +4895,7 @@ async def upsert(self, data: dict[str, dict[str, Any]]) -> None: v.get("track_id"), json.dumps(v.get("metadata", {})), v.get("error_msg"), + v.get("content_hash"), _parse_doc_status_datetime( v.get("created_at"), f"[{self.workspace}] doc {k} created_at", @@ -6328,6 +6742,18 @@ def namespace_to_table_name(namespace: str) -> str: doc_name VARCHAR(1024), content TEXT, meta JSONB, + sidecar_location TEXT NULL, + parse_format VARCHAR(32) NULL DEFAULT 'raw', + -- content_hash is TEXT (not VARCHAR(N)) so the column is + -- agnostic to the hash algorithm. Today's pipeline writes + -- 64-char SHA-256 hex; future algos (SHA-512, base64) do + -- not require a schema change. + content_hash TEXT NULL, + -- process_options is an opaque selector string emitted by + -- sanitize_process_options() (e.g. "Fi"). + process_options TEXT NULL, + chunk_options JSONB NULL DEFAULT '{}'::jsonb, + parse_engine VARCHAR(32) NULL, create_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP, update_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP, CONSTRAINT LIGHTRAG_DOC_FULL_PK PRIMARY KEY (workspace, id) @@ -6343,6 +6769,8 @@ def namespace_to_table_name(namespace: str) -> str: content TEXT, file_path TEXT NULL, llm_cache_list JSONB NULL DEFAULT '[]'::jsonb, + heading JSONB NULL DEFAULT '{}'::jsonb, + sidecar JSONB NULL DEFAULT '{}'::jsonb, create_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP, update_time TIMESTAMP(0) DEFAULT CURRENT_TIMESTAMP, CONSTRAINT LIGHTRAG_DOC_CHUNKS_PK PRIMARY KEY (workspace, id) @@ -6419,6 +6847,11 @@ def namespace_to_table_name(namespace: str) -> str: track_id varchar(255) NULL, metadata JSONB NULL DEFAULT '{}'::jsonb, error_msg TEXT NULL, + -- content_hash is TEXT (not VARCHAR(N)) so the column is + -- agnostic to the hash algorithm. Today's pipeline writes + -- 64-char SHA-256 hex; future algos (SHA-512, base64) do + -- not require a schema change. + content_hash TEXT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, CONSTRAINT LIGHTRAG_DOC_STATUS_PK PRIMARY KEY (workspace, id) @@ -6474,12 +6907,20 @@ def namespace_to_table_name(namespace: str) -> str: SQL_TEMPLATES = { # SQL for KVStorage "get_by_id_full_docs": """SELECT id, COALESCE(content, '') as content, - COALESCE(doc_name, '') as file_path + COALESCE(doc_name, '') as file_path, + sidecar_location, + parse_format, + content_hash, + process_options, + COALESCE(chunk_options, '{}'::jsonb) as chunk_options, + parse_engine FROM LIGHTRAG_DOC_FULL WHERE workspace=$1 AND id=$2 """, "get_by_id_text_chunks": """SELECT id, tokens, COALESCE(content, '') as content, chunk_order_index, full_doc_id, file_path, COALESCE(llm_cache_list, '[]'::jsonb) as llm_cache_list, + COALESCE(heading, '{}'::jsonb) as heading, + COALESCE(sidecar, '{}'::jsonb) as sidecar, EXTRACT(EPOCH FROM create_time)::BIGINT as create_time, EXTRACT(EPOCH FROM update_time)::BIGINT as update_time FROM LIGHTRAG_DOC_CHUNKS WHERE workspace=$1 AND id=$2 @@ -6490,12 +6931,20 @@ def namespace_to_table_name(namespace: str) -> str: FROM LIGHTRAG_LLM_CACHE WHERE workspace=$1 AND id=$2 """, "get_by_ids_full_docs": """SELECT id, COALESCE(content, '') as content, - COALESCE(doc_name, '') as file_path + COALESCE(doc_name, '') as file_path, + sidecar_location, + parse_format, + content_hash, + process_options, + COALESCE(chunk_options, '{}'::jsonb) as chunk_options, + parse_engine FROM LIGHTRAG_DOC_FULL WHERE workspace=$1 AND id = ANY($2) """, "get_by_ids_text_chunks": """SELECT id, tokens, COALESCE(content, '') as content, chunk_order_index, full_doc_id, file_path, COALESCE(llm_cache_list, '[]'::jsonb) as llm_cache_list, + COALESCE(heading, '{}'::jsonb) as heading, + COALESCE(sidecar, '{}'::jsonb) as sidecar, EXTRACT(EPOCH FROM create_time)::BIGINT as create_time, EXTRACT(EPOCH FROM update_time)::BIGINT as update_time FROM LIGHTRAG_DOC_CHUNKS WHERE workspace=$1 AND id = ANY($2) @@ -6546,11 +6995,49 @@ def namespace_to_table_name(namespace: str) -> str: FROM LIGHTRAG_RELATION_CHUNKS WHERE workspace=$1 AND id = ANY($2) """, "filter_keys": "SELECT id FROM {table_name} WHERE workspace=$1 AND id IN ({ids})", - "upsert_doc_full": """INSERT INTO LIGHTRAG_DOC_FULL (id, content, doc_name, workspace) - VALUES ($1, $2, $3, $4) + # Pipeline-derived columns (sidecar_location / parse_format / content_hash / + # process_options / chunk_options / parse_engine) are guarded with COALESCE + # so a partial upsert (e.g. a caller writing only ``content`` + ``doc_name``) + # does not silently overwrite metadata recorded by _persist_parsed_full_docs. + # ``content`` and ``doc_name`` themselves are always overwritten — they are + # the primary payload, never a candidate for preservation. + # For the string columns we use NULLIF('', ...) so that an empty string from + # a default-bearing caller is treated as "no value, preserve existing". + # For chunk_options (JSONB) we treat NULL or the empty-object literal as + # "no value, preserve existing". + "upsert_doc_full": """INSERT INTO LIGHTRAG_DOC_FULL (id, content, doc_name, workspace, + sidecar_location, parse_format, content_hash, + process_options, chunk_options, parse_engine) + VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) ON CONFLICT (workspace,id) DO UPDATE - SET content = $2, - doc_name = $3, + SET content = EXCLUDED.content, + doc_name = EXCLUDED.doc_name, + sidecar_location = COALESCE( + NULLIF(EXCLUDED.sidecar_location, ''), + LIGHTRAG_DOC_FULL.sidecar_location + ), + parse_format = COALESCE( + NULLIF(EXCLUDED.parse_format, ''), + LIGHTRAG_DOC_FULL.parse_format + ), + content_hash = COALESCE( + NULLIF(EXCLUDED.content_hash, ''), + LIGHTRAG_DOC_FULL.content_hash + ), + process_options = COALESCE( + NULLIF(EXCLUDED.process_options, ''), + LIGHTRAG_DOC_FULL.process_options + ), + chunk_options = CASE + WHEN EXCLUDED.chunk_options IS NULL + OR EXCLUDED.chunk_options = '{}'::jsonb + THEN LIGHTRAG_DOC_FULL.chunk_options + ELSE EXCLUDED.chunk_options + END, + parse_engine = COALESCE( + NULLIF(EXCLUDED.parse_engine, ''), + LIGHTRAG_DOC_FULL.parse_engine + ), update_time = CURRENT_TIMESTAMP """, "upsert_llm_response_cache": """INSERT INTO LIGHTRAG_LLM_CACHE(workspace,id,original_prompt,return_value,chunk_id,cache_type,queryparam) @@ -6565,8 +7052,8 @@ def namespace_to_table_name(namespace: str) -> str: """, "upsert_text_chunk": """INSERT INTO LIGHTRAG_DOC_CHUNKS (workspace, id, tokens, chunk_order_index, full_doc_id, content, file_path, llm_cache_list, - create_time, update_time) - VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) + heading, sidecar, create_time, update_time) + VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12) ON CONFLICT (workspace,id) DO UPDATE SET tokens=EXCLUDED.tokens, chunk_order_index=EXCLUDED.chunk_order_index, @@ -6574,6 +7061,8 @@ def namespace_to_table_name(namespace: str) -> str: content = EXCLUDED.content, file_path=EXCLUDED.file_path, llm_cache_list=EXCLUDED.llm_cache_list, + heading=EXCLUDED.heading, + sidecar=EXCLUDED.sidecar, update_time = EXCLUDED.update_time """, "upsert_full_entities": """INSERT INTO LIGHTRAG_FULL_ENTITIES (workspace, id, entity_names, count, diff --git a/lightrag/kg/redis_impl.py b/lightrag/kg/redis_impl.py index 89eb594369..3f1dd4a7f2 100644 --- a/lightrag/kg/redis_impl.py +++ b/lightrag/kg/redis_impl.py @@ -925,6 +925,7 @@ async def delete(self, doc_ids: list[str]) -> None: async def get_docs_paginated( self, status_filter: DocStatus | None = None, + status_filters: list[DocStatus] | None = None, page: int = 1, page_size: int = 50, sort_field: str = "updated_at", @@ -942,6 +943,11 @@ async def get_docs_paginated( Returns: Tuple of (list of (doc_id, DocProcessingStatus) tuples, total_count) """ + status_filter_values = self.resolve_status_filter_values( + status_filter=status_filter, + status_filters=status_filters, + ) + # Validate parameters if page < 1: page = 1 @@ -983,9 +989,9 @@ async def get_docs_paginated( # Apply status filter if ( - status_filter is not None + status_filter_values is not None and doc_data.get("status") - != status_filter.value + not in status_filter_values ): continue @@ -1104,6 +1110,101 @@ async def get_doc_by_file_path(self, file_path: str) -> Union[dict[str, Any], No logger.error(f"[{self.workspace}] Error in get_doc_by_file_path: {e}") return None + async def get_doc_by_file_basename( + self, basename: str + ) -> Union[tuple[str, dict[str, Any]], None]: + """Find an existing record whose canonical basename matches. + + The caller is responsible for passing an already-canonical basename. + Stored ``file_path`` values are canonicalized by the business layer, so + this lookup intentionally performs an exact match only. + """ + if not basename: + return None + if basename == "unknown_source": + return None + + async with self._get_redis_connection() as redis: + try: + cursor = 0 + while True: + cursor, keys = await redis.scan( + cursor, match=f"{self.final_namespace}:*", count=1000 + ) + if keys: + pipe = redis.pipeline() + for key in keys: + pipe.get(key) + values = await pipe.execute() + + for key, value in zip(keys, values): + if not value: + continue + try: + doc_data = json.loads(value) + except json.JSONDecodeError as e: + logger.error( + f"[{self.workspace}] JSON decode error in get_doc_by_file_basename: {e}" + ) + continue + if doc_data.get("file_path") == basename: + doc_id = key.split(":", 1)[1] + return doc_id, doc_data + + if cursor == 0: + break + + return None + except Exception as e: + logger.error( + f"[{self.workspace}] Error in get_doc_by_file_basename: {e}" + ) + return None + + async def get_doc_by_content_hash( + self, content_hash: str + ) -> Union[tuple[str, dict[str, Any]], None]: + """Find an existing record whose content_hash field matches.""" + if not content_hash: + return None + + async with self._get_redis_connection() as redis: + try: + cursor = 0 + while True: + cursor, keys = await redis.scan( + cursor, match=f"{self.final_namespace}:*", count=1000 + ) + if keys: + pipe = redis.pipeline() + for key in keys: + pipe.get(key) + values = await pipe.execute() + + for key, value in zip(keys, values): + if not value: + continue + try: + doc_data = json.loads(value) + except json.JSONDecodeError as e: + logger.error( + f"[{self.workspace}] JSON decode error in get_doc_by_content_hash: {e}" + ) + continue + if doc_data.get("content_hash") == content_hash: + doc_id = key.split(":", 1)[1] + return doc_id, doc_data + + if cursor == 0: + break + + return None + except Exception as e: + logger.error( + f"[{self.workspace}] Error in get_doc_by_content_hash: {e}" + ) + return None + async def drop(self) -> dict[str, str]: """Drop all document status data from storage and clean up resources""" try: diff --git a/lightrag/kg/shared_storage.py b/lightrag/kg/shared_storage.py index 6da563088e..7d36e1f7b5 100644 --- a/lightrag/kg/shared_storage.py +++ b/lightrag/kg/shared_storage.py @@ -1289,6 +1289,31 @@ async def initialize_pipeline_status(workspace: str | None = None): { "autoscanned": False, # Auto-scan started "busy": False, # Control concurrent processes + # Destructive subset of ``busy``: clear / delete jobs that + # DROP storages or remove input files. Concurrent enqueue + # would race against the drop and silently lose the + # accepted document, so reservation and the enqueue + # last-line guard reject when this is True. ``busy`` on + # its own (the processing loop) remains compatible with + # concurrent enqueue via request_pending. + "destructive_busy": False, + "scanning": False, # /documents/scan task running (whole lifecycle) + # Exclusive subset of ``scanning``: only True during the + # scan's *classification* phase, when run_scanning_process + # is reading doc_status to classify files (PROCESSED → + # archive, FAILED-without-full_docs → retry-as-new, etc.) + # and possibly deleting stale stubs. After classification + # the scan transitions to its processing phase (which + # behaves like any other busy processing run) and clears + # this flag, allowing concurrent uploads to land in + # doc_status while the scan-driven processing finishes. + "scanning_exclusive": False, + # Counter of upload/insert endpoints that have passed the + # idle preflight but whose background enqueue has not yet + # run. Closes the preflight-to-background race: scan + # refuses to start while this is > 0 so the bg task is + # guaranteed to see scanning=False at enqueue time. + "pending_enqueues": 0, "job_name": "-", # Current job name (indexing files/indexing texts) "job_start": None, # Job start time "docs": 0, # Total number of documents to be indexed diff --git a/lightrag/lightrag.py b/lightrag/lightrag.py index e34f6df48a..cd3d6729e1 100644 --- a/lightrag/lightrag.py +++ b/lightrag/lightrag.py @@ -2,11 +2,16 @@ import traceback import asyncio -import inspect import os import time import warnings -from dataclasses import asdict, dataclass, field, replace +from copy import deepcopy + +try: + import httpx +except Exception: # pragma: no cover - optional dependency + httpx = None +from dataclasses import InitVar, asdict, dataclass, field, replace from datetime import datetime, timezone from functools import partial from typing import ( @@ -18,15 +23,23 @@ cast, final, Literal, + Mapping, Optional, List, Dict, Union, ) -from lightrag.prompt import PROMPTS -from lightrag.exceptions import PipelineCancelledException +from lightrag.prompt import ( + PROMPTS, + get_default_entity_extraction_prompt_profile, + resolve_entity_extraction_prompt_profile, + validate_entity_extraction_prompt_profile_for_mode, +) from lightrag.constants import ( + DEFAULT_CHUNK_P_SIZE, DEFAULT_MAX_GLEANING, + DEFAULT_MAX_EXTRACTION_RECORDS, + DEFAULT_MAX_EXTRACTION_ENTITIES, DEFAULT_FORCE_LLM_SUMMARY_ON_MERGE, DEFAULT_TOP_K, DEFAULT_CHUNK_TOP_K, @@ -40,31 +53,34 @@ DEFAULT_SUMMARY_MAX_TOKENS, DEFAULT_SUMMARY_CONTEXT_SIZE, DEFAULT_SUMMARY_LENGTH_RECOMMENDED, - DEFAULT_MAX_EXTRACT_INPUT_TOKENS, DEFAULT_MAX_ASYNC, DEFAULT_MAX_PARALLEL_INSERT, DEFAULT_MAX_GRAPH_NODES, DEFAULT_MAX_SOURCE_IDS_PER_ENTITY, DEFAULT_MAX_SOURCE_IDS_PER_RELATION, - DEFAULT_ENTITY_TYPES, DEFAULT_SUMMARY_LANGUAGE, DEFAULT_LLM_TIMEOUT, DEFAULT_EMBEDDING_TIMEOUT, + DEFAULT_RERANK_TIMEOUT, DEFAULT_SOURCE_IDS_LIMIT_METHOD, DEFAULT_MAX_FILE_PATHS, + DEFAULT_MAX_PARALLEL_ANALYZE, + DEFAULT_MAX_PARALLEL_PARSE_NATIVE, + DEFAULT_MAX_PARALLEL_PARSE_MINERU, + DEFAULT_MAX_PARALLEL_PARSE_DOCLING, + DEFAULT_QUEUE_SIZE_DEFAULT, + DEFAULT_QUEUE_SIZE_INSERT, DEFAULT_FILE_PATH_MORE_PLACEHOLDER, ) from lightrag.utils import get_env_value from lightrag.kg import ( - STORAGES, verify_storage_implementation, ) from lightrag.kg.shared_storage import ( get_namespace_data, - get_data_init_lock, get_default_workspace, set_default_workspace, get_namespace_lock, @@ -85,14 +101,15 @@ QueryResult, ) from lightrag.namespace import NameSpace +from lightrag.chunker import chunking_by_token_size from lightrag.operate import ( - chunking_by_token_size, extract_entities, - merge_nodes_and_edges, kg_query, naive_query, rebuild_knowledge_from_chunks, + _warn_deprecated_query_model_func, ) +from lightrag.utils_pipeline import normalize_document_file_path from lightrag.constants import GRAPH_FIELD_SEP from lightrag.utils import ( Tokenizer, @@ -100,9 +117,7 @@ EmbeddingFunc, always_get_an_event_loop, compute_mdhash_id, - lazy_external_import, priority_limit_async_func_call, - get_content_summary, sanitize_text_for_encoding, check_storage_env_vars, generate_track_id, @@ -112,9 +127,26 @@ subtract_source_ids, make_relation_chunk_key, normalize_source_ids_limit_method, + normalize_string_list, ) from lightrag.types import KnowledgeGraph from dotenv import load_dotenv +from lightrag.pipeline import _PipelineMixin +from lightrag.kg.factory import get_storage_class +from lightrag.addon_params import ( + ObservableAddonParams, + normalize_addon_params, +) +from lightrag.llm_roles import ( + ROLE_NAMES, + ROLES, + RoleLLMConfig, + RoleSpec, # noqa: F401 # re-exported via lightrag/__init__.py + _optional_env_int, + _RoleLLMMixin, + _RoleLLMState, +) +from lightrag.storage_migrations import _StorageMigrationMixin # use the .env that is inside the current folder # allows to use different .env file for each lightrag instance @@ -122,92 +154,9 @@ load_dotenv(dotenv_path=".env", override=False) -def _chunk_fields_from_status_doc( - status_doc: "DocProcessingStatus", -) -> tuple[list[str], int]: - """Return (chunks_list, chunks_count) preserved from a status document. - - Filters out any non-string or empty chunk IDs. When chunks_count is - absent or invalid, it is inferred from the length of chunks_list. - """ - chunks_list: list[str] = [] - if isinstance(status_doc.chunks_list, list): - chunks_list = [ - chunk_id - for chunk_id in status_doc.chunks_list - if isinstance(chunk_id, str) and chunk_id - ] - - if isinstance(status_doc.chunks_count, int) and status_doc.chunks_count >= 0: - return chunks_list, status_doc.chunks_count - - return chunks_list, len(chunks_list) - - -def _resolve_doc_file_path( - status_doc: "DocProcessingStatus" | None = None, - content_data: dict[str, Any] | None = None, -) -> str: - """Resolve the best available document file path. - - Prefer a non-placeholder path from doc_status, then fall back to full_docs. - This avoids overwriting historical file paths with placeholder values during - retries or early-cancellation paths. - """ - - placeholder_paths = {"", "no-file-path", "unknown_source"} - - def _normalize_path(candidate: Any) -> str | None: - if not isinstance(candidate, str): - return None - - normalized = candidate.strip() - if not normalized: - return None - - return normalized - - candidates = [ - _normalize_path(getattr(status_doc, "file_path", None)), - _normalize_path(content_data.get("file_path") if content_data else None), - ] - - for candidate in candidates: - if candidate and candidate not in placeholder_paths: - return candidate - - for candidate in candidates: - if candidate: - return "unknown_source" if candidate == "no-file-path" else candidate - - return "unknown_source" - - -def _normalize_string_list(raw_values: Any, context: str = "") -> list[str]: - """Return a list of non-empty strings from raw_values. - - Non-string elements are dropped and logged as warnings. If raw_values is - not a list, an empty list is returned. - """ - if not isinstance(raw_values, list): - return [] - result = [] - for i, value in enumerate(raw_values): - if isinstance(value, str) and value: - result.append(value) - else: - logger.warning( - "Non-string element dropped from list%s at index %d: %r", - f" ({context})" if context else "", - i, - value, - ) - return result - - @final @dataclass -class LightRAG: +class LightRAG(_RoleLLMMixin, _StorageMigrationMixin, _PipelineMixin): """LightRAG: Simple and Fast Retrieval-Augmented Generation.""" # Directory @@ -291,12 +240,19 @@ class LightRAG: ) """Maximum number of entity extraction attempts for ambiguous content.""" - max_extract_input_tokens: int = field( + entity_extract_max_records: int = field( default=get_env_value( - "MAX_EXTRACT_INPUT_TOKENS", DEFAULT_MAX_EXTRACT_INPUT_TOKENS, int + "MAX_EXTRACTION_RECORDS", DEFAULT_MAX_EXTRACTION_RECORDS, int ) ) - """Maximum tokens allowed for entity extraction input context.""" + """Per-response cap on total entity+relationship rows/records.""" + + entity_extract_max_entities: int = field( + default=get_env_value( + "MAX_EXTRACTION_ENTITIES", DEFAULT_MAX_EXTRACTION_ENTITIES, int + ) + ) + """Per-response cap on entity rows/objects.""" force_llm_summary_on_merge: int = field( default=get_env_value( @@ -307,13 +263,32 @@ class LightRAG: # Text chunking # --- - chunk_token_size: int = field(default=int(os.getenv("CHUNK_SIZE", 1200))) - """Maximum number of tokens per text chunk when splitting documents.""" - - chunk_overlap_token_size: int = field( - default=int(os.getenv("CHUNK_OVERLAP_SIZE", 100)) - ) - """Number of overlapping tokens between consecutive text chunks to preserve context.""" + chunk_token_size: int | None = field(default=None) + """Maximum number of tokens per text chunk when splitting documents. + + ``None`` means "use ``addon_params['chunker']['chunk_token_size']``" + (env-driven via ``CHUNK_SIZE``). When the constructor is given a + non-None value it overlays onto ``addon_params['chunker']`` in + ``__post_init__`` so the per-document ``chunk_options`` snapshot + actually picks it up. Always an ``int`` after construction — + back-filled from the resolved chunker config so legacy readers + (``self.chunk_token_size``) keep working.""" + + chunk_overlap_token_size: int | None = field(default=None) + """Number of overlapping tokens between consecutive text chunks (F-strategy semantics). + + ``None`` means "use the per-strategy default in + ``addon_params['chunker']``" (env-driven via + ``CHUNK_F_OVERLAP_SIZE`` / ``CHUNK_R_OVERLAP_SIZE`` falling back to + ``CHUNK_OVERLAP_SIZE``). When non-None at construction time, the + value overlays onto every strategy sub-dict that natively takes + ``chunk_overlap_token_size`` (``fixed_token``, ``recursive_character``) + so the per-doc snapshot reflects the constructor choice. Per-strategy + chunker parameters (R / V separators, thresholds, overlap overrides, + etc.) live in ``addon_params['chunker']`` and are documented in + :func:`lightrag.parser_routing.default_chunker_config`. Per-doc + snapshots are persisted to ``full_docs[doc_id]['chunk_options']`` + at enqueue time.""" tokenizer: Optional[Tokenizer] = field(default=None) """ @@ -337,27 +312,55 @@ class LightRAG: Union[List[Dict[str, Any]], Awaitable[List[Dict[str, Any]]]], ] = field(default_factory=lambda: chunking_by_token_size) """ - Custom chunking function for splitting text into chunks before processing. - - The function can be either synchronous or asynchronous. - - The function should take the following parameters: + Legacy chunking-function customization point. Synchronous or async. + + **When this function is actually invoked.** The chunker dispatch in + ``_PipelineMixin.process_single_document`` is driven by the + document's ``process_options``: + + - If ``process_options`` explicitly contains a chunking selector + char (``F``/``R``/``V``/``P``), the dispatcher routes to a + chunker that follows the new file-chunker contract — see + :mod:`lightrag.chunker` (``chunking_by_fixed_token`` for ``F``, + ``chunking_by_paragraph_semantic`` for ``P``; ``R``/``V`` are + not yet implemented and fall back to ``F``). **This + ``chunking_func`` is NOT called in that case** — it is a + legacy escape hatch and is intentionally bypassed when the user + opted into a specific strategy. + + - If ``process_options`` does **not** name a chunking strategy + (empty string, or only non-chunking flags such as ``i`` / ``t`` + / ``e`` / ``!``), the dispatcher invokes this ``chunking_func`` + with the legacy 6-arg signature below. This is the path taken + by direct ``ainsert(text)`` calls and by any document whose + ``process_options`` simply does not select a chunker. + + The presence/absence of the selector is exposed by + :attr:`lightrag.parser_routing.ProcessOptions.chunking_explicit`. + + **Signature** — preserved unchanged from earlier LightRAG releases + so externally-supplied chunkers continue to drop in without edits: - `tokenizer`: A Tokenizer instance to use for tokenization. - `content`: The text to be split into chunks. - - `split_by_character`: The character to split the text on. If None, the text is split into chunks of `chunk_token_size` tokens. - - `split_by_character_only`: If True, the text is split only on the specified character. - - `chunk_overlap_token_size`: The number of overlapping tokens between consecutive chunks. + - `split_by_character`: The character to split the text on. If + None, the text is split into chunks of `chunk_token_size` + tokens. + - `split_by_character_only`: If True, the text is split only on + the specified character. + - `chunk_overlap_token_size`: The number of overlapping tokens + between consecutive chunks. - `chunk_token_size`: The maximum number of tokens per chunk. + The function should return a list of dictionaries (or an awaitable + that resolves to one), each containing: - The function should return a list of dictionaries (or an awaitable that resolves to a list), - where each dictionary contains the following keys: - `tokens` (int): The number of tokens in the chunk. - `content` (str): The text content of the chunk. - - `chunk_order_index` (int): Zero-based index indicating the chunk's order in the document. + - `chunk_order_index` (int): Zero-based index indicating the + chunk's order in the document. - Defaults to `chunking_by_token_size` if not specified. + Defaults to :func:`lightrag.chunker.chunking_by_token_size`. """ # Embedding @@ -400,6 +403,17 @@ class LightRAG: llm_model_func: Callable[..., object] | None = field(default=None) """Function for interacting with the large language model (LLM). Must be set before use.""" + role_llm_configs: dict[str, RoleLLMConfig | dict[str, Any]] | None = field( + default=None + ) + """Per-role LLM overrides keyed by role name (see :data:`ROLES`). + + Each entry is a :class:`RoleLLMConfig` (or a plain dict with the same + keys ``func`` / ``kwargs`` / ``max_async`` / ``timeout``). Any field left + as ``None`` falls back to the corresponding base LLM setting. Roles not + present in the dict are wrapped from the base ``llm_model_func`` and + pick up ``{ROLE_PREFIX}_MAX_ASYNC_LLM`` env defaults.""" + llm_model_name: str = field(default="gpt-4o-mini") """Name of the LLM model used for generating responses.""" @@ -432,12 +446,39 @@ class LightRAG: default=int(os.getenv("LLM_TIMEOUT", DEFAULT_LLM_TIMEOUT)) ) + entity_extraction_use_json: bool = field( + default=os.getenv("ENTITY_EXTRACTION_USE_JSON", "false").lower() == "true" + ) + """When True, entity extraction uses JSON structured output instead of delimiter-based text. + JSON mode is slower but significantly improves extraction quality and compatibility with smaller models. + Providers with native structured output support (OpenAI, Ollama, Gemini) will use their + native capabilities. Other providers rely on JSON-formatted prompts with json_repair parsing. + Default: False. Set ENTITY_EXTRACTION_USE_JSON=true in .env to enable.""" + # Rerank Configuration # --- rerank_model_func: Callable[..., object] | None = field(default=None) """Function for reranking retrieved documents. All rerank configurations (model name, API keys, top_k, etc.) should be included in this function. Optional.""" + rerank_model_max_async: int = field( + default=int( + os.getenv( + "MAX_ASYNC_RERANK", + os.getenv("MAX_ASYNC", DEFAULT_MAX_ASYNC), + ) + ) + ) + """Maximum number of concurrent rerank calls. + Falls back to MAX_ASYNC when MAX_ASYNC_RERANK is unset.""" + + default_rerank_timeout: int = field( + default=int(os.getenv("RERANK_TIMEOUT", DEFAULT_RERANK_TIMEOUT)) + ) + """Rerank request timeout in seconds. + Independent from LLM_TIMEOUT since reranker calls are much shorter + than full LLM generation.""" + min_rerank_score: float = field( default=get_env_value("MIN_RERANK_SCORE", DEFAULT_MIN_RERANK_SCORE, float) ) @@ -455,6 +496,16 @@ class LightRAG: enable_llm_cache_for_entity_extract: bool = field(default=True) """If True, enables caching for entity extraction steps to reduce LLM costs.""" + vlm_process_enable: bool = field( + default_factory=lambda: get_env_value("VLM_PROCESS_ENABLE", False, bool) + ) + """Master switch for VLM multimodal analysis (i/t/e items). + + When False, the pipeline emits a warning and skips every multimodal item + without invoking the VLM. When True, the configured VLM binding must + support image inputs. + """ + # Extensions # --- @@ -463,6 +514,39 @@ class LightRAG: ) """Maximum number of parallel insert operations.""" + max_parallel_parse_native: int = field( + default=int( + os.getenv( + "MAX_PARALLEL_PARSE_NATIVE", str(DEFAULT_MAX_PARALLEL_PARSE_NATIVE) + ) + ) + ) + max_parallel_parse_mineru: int = field( + default=int( + os.getenv( + "MAX_PARALLEL_PARSE_MINERU", str(DEFAULT_MAX_PARALLEL_PARSE_MINERU) + ) + ) + ) + max_parallel_parse_docling: int = field( + default=int( + os.getenv( + "MAX_PARALLEL_PARSE_DOCLING", str(DEFAULT_MAX_PARALLEL_PARSE_DOCLING) + ) + ) + ) + max_parallel_analyze: int = field( + default=int( + os.getenv("MAX_PARALLEL_ANALYZE", str(DEFAULT_MAX_PARALLEL_ANALYZE)) + ) + ) + queue_size_default: int = field( + default=int(os.getenv("QUEUE_SIZE_DEFAULT", str(DEFAULT_QUEUE_SIZE_DEFAULT))) + ) + queue_size_insert: int = field( + default=int(os.getenv("QUEUE_SIZE_INSERT", str(DEFAULT_QUEUE_SIZE_INSERT))) + ) + max_graph_nodes: int = field( default=get_env_value("MAX_GRAPH_NODES", DEFAULT_MAX_GRAPH_NODES, int) ) @@ -503,13 +587,27 @@ class LightRAG: file_path_more_placeholder: str = field(default=DEFAULT_FILE_PATH_MORE_PLACEHOLDER) """Placeholder text when file paths exceed max_file_paths limit.""" - addon_params: dict[str, Any] = field( - default_factory=lambda: { - "language": get_env_value( - "SUMMARY_LANGUAGE", DEFAULT_SUMMARY_LANGUAGE, str - ), - "entity_types": get_env_value("ENTITY_TYPES", DEFAULT_ENTITY_TYPES, list), - } + addon_params: InitVar[dict[str, Any] | None] = None + _addon_params: ObservableAddonParams = field( + default_factory=ObservableAddonParams, + init=False, + repr=False, + ) + _addon_params_dirty: bool = field(default=True, init=False, repr=False) + _entity_extraction_prompt_profile: dict[str, Any] = field( + default_factory=get_default_entity_extraction_prompt_profile, + init=False, + repr=False, + ) + _cached_entity_extraction_use_json: bool | None = field( + default=None, + init=False, + repr=False, + ) + _resolved_summary_language: str = field( + default=DEFAULT_SUMMARY_LANGUAGE, + init=False, + repr=False, ) # Storages Management @@ -528,11 +626,211 @@ class LightRAG: _storages_status: StoragesStatus = field(default=StoragesStatus.NOT_CREATED) - def __post_init__(self): + def _mark_addon_params_dirty(self) -> None: + self._addon_params_dirty = True + + def _replace_addon_params( + self, addon_params: Mapping[str, Any] | None, *, mark_dirty: bool + ) -> None: + wrapped = ObservableAddonParams( + normalize_addon_params(addon_params), + on_change=self._mark_addon_params_dirty, + ) + self._addon_params = wrapped + if mark_dirty: + self._mark_addon_params_dirty() + + def _get_addon_params(self) -> ObservableAddonParams: + """Return the live addon_params store. + + Mutations on the returned instance trigger a cache refresh on the next + _build_global_config() call. If the whole mapping is replaced via the + setter, previously captured references point at the old instance and + will no longer propagate changes — always re-read `rag.addon_params` + after replacement rather than caching references. + """ + return self._addon_params + + def _set_runtime_addon_params(self, addon_params: Mapping[str, Any] | None) -> None: + self._replace_addon_params(addon_params, mark_dirty=True) + self._apply_chunk_size_overlay() + + def _apply_chunk_size_overlay(self) -> None: + """Reconcile chunk-size config across all four configuration tiers. + + Specificity-ordered precedence (high → low) per slot: + + 1. ``addon_params['chunker']`` explicit (user-supplied dict that + already carries the key). + 2. Strategy-specific env (``CHUNK_F_OVERLAP_SIZE`` / + ``CHUNK_R_OVERLAP_SIZE`` / ``CHUNK_P_OVERLAP_SIZE`` — already pre-filled by + :func:`lightrag.parser_routing.default_chunker_config` *only* + when the env var is set). No strategy-specific top-level + ``CHUNK_*_SIZE`` exists today; if added later, plug it in + between this tier and the legacy ctor tier. + 3. Legacy constructor field + (``LightRAG(chunk_token_size=…, chunk_overlap_token_size=…)``). + Strategy-agnostic; only fills slots that were not already set + by tiers 1–2. + 4. Legacy env (``CHUNK_SIZE`` / ``CHUNK_OVERLAP_SIZE``) — final + fallback. + + After this runs, ``self._addon_params['chunker']`` carries fully + resolved values for every slot the pipeline needs, and the + legacy ``self.chunk_token_size`` / ``self.chunk_overlap_token_size`` + instance fields are back-filled to ``int`` so downstream readers + (e.g. ``process_single_document``'s + ``chunk_opts.get("chunk_token_size") or self.chunk_token_size`` + fallback) keep working. + """ + chunker_cfg = self._addon_params.get("chunker") + if not isinstance(chunker_cfg, dict): + chunker_cfg = {} + self._addon_params["chunker"] = chunker_cfg + + # Top-level chunk_token_size — no strategy-specific env exists, + # so the chain is: addon_params > legacy ctor > CHUNK_SIZE env. + if "chunk_token_size" not in chunker_cfg: + if self.chunk_token_size is not None: + chunker_cfg["chunk_token_size"] = self.chunk_token_size + else: + chunker_cfg["chunk_token_size"] = int(os.getenv("CHUNK_SIZE", 1200)) + + # Per-strategy chunk_overlap_token_size — strategy env (if set) + # already lives in the sub-dict. Slots still missing fall back + # to the legacy ctor field, then CHUNK_OVERLAP_SIZE env. + if self.chunk_overlap_token_size is not None: + legacy_overlap_default = self.chunk_overlap_token_size + else: + legacy_overlap_default = int(os.getenv("CHUNK_OVERLAP_SIZE", 100)) + for strategy_key in ( + "fixed_token", + "recursive_character", + "paragraph_semantic", + ): + sub = chunker_cfg.get(strategy_key) + if not isinstance(sub, dict): + sub = {} + chunker_cfg[strategy_key] = sub + if "chunk_overlap_token_size" not in sub: + sub["chunk_overlap_token_size"] = legacy_overlap_default + + # P-specific chunk_token_size backfill — P does NOT inherit the + # top-level chunk_token_size (CHUNK_SIZE / legacy ctor) when + # nothing more specific was set; paragraph-semantic merging + # needs more headroom than the global default to keep related + # paragraphs together. ``default_chunker_config`` already + # pre-fills this slot for the default-built chunker dict, but + # when the caller hands us a partial ``addon_params['chunker']`` + # that lacks the slot (e.g. ``{"paragraph_semantic": {}}``) + # ``normalize_addon_params`` does not re-run the defaults + # builder — so this overlay is the last guard that ensures the + # slot is always populated. Precedence (high → low): + # explicit ``addon_params`` > ``CHUNK_P_SIZE`` env > + # ``DEFAULT_CHUNK_P_SIZE``. ``setdefault`` preserves any + # explicit value the caller did provide; the env read here + # mirrors ``default_chunker_config`` so partial-addon-params + # callers still pick up env overrides. + p_size_raw = os.getenv("CHUNK_P_SIZE") + chunker_cfg["paragraph_semantic"].setdefault( + "chunk_token_size", + int(p_size_raw) if p_size_raw is not None else DEFAULT_CHUNK_P_SIZE, + ) + + # Back-fill legacy instance fields → always int afterwards. + # Overlap mirrors the F-strategy resolved value, matching the + # F-flavoured legacy ``self.chunk_overlap_token_size`` semantics + # used by the legacy 6-arg ``chunking_func`` path. + self.chunk_token_size = chunker_cfg["chunk_token_size"] + self.chunk_overlap_token_size = chunker_cfg["fixed_token"][ + "chunk_overlap_token_size" + ] + + def _refresh_addon_params_cache(self) -> None: + summary_language = self._addon_params.get("language", DEFAULT_SUMMARY_LANGUAGE) + if not isinstance(summary_language, str) or not summary_language.strip(): + summary_language = DEFAULT_SUMMARY_LANGUAGE + self._resolved_summary_language = summary_language + + resolved_prompt_profile = resolve_entity_extraction_prompt_profile( + self._addon_params, + self.entity_extraction_use_json, + ) + self._entity_extraction_prompt_profile = ( + validate_entity_extraction_prompt_profile_for_mode( + resolved_prompt_profile, + self.entity_extraction_use_json, + self._addon_params.get("entity_type_prompt_file"), + ) + ) + self._cached_entity_extraction_use_json = self.entity_extraction_use_json + self._addon_params_dirty = False + + def _ensure_addon_params_cache(self) -> None: + if ( + not self._addon_params_dirty + and self._cached_entity_extraction_use_json + == self.entity_extraction_use_json + ): + return + self._refresh_addon_params_cache() + + def _build_global_config(self) -> dict[str, Any]: + self._ensure_addon_params_cache() + global_config = asdict(self) + global_config.pop("_addon_params", None) + global_config.pop("_addon_params_dirty", None) + global_config.pop("_cached_entity_extraction_use_json", None) + global_config["addon_params"] = dict(self._addon_params) + # Inject runtime per-role wrapped LLM funcs (callable; not part of asdict + # because they live in the private _role_llm_states map). The first + # _build_global_config() call from __post_init__ runs before the role + # state is built, so fall back to an empty dict in that case. + states = getattr(self, "_role_llm_states", None) or {} + global_config["role_llm_funcs"] = { + spec.name: states[spec.name].wrapped if spec.name in states else None + for spec in ROLES + } + global_config["llm_cache_identities"] = { + spec.name: self._build_role_llm_cache_identity( + spec.name, states.get(spec.name) + ) + for spec in ROLES + } + return global_config + + def _build_role_llm_cache_identity( + self, role: str, state: _RoleLLMState | None + ) -> dict[str, Any]: + # `state` is None during the first _build_global_config() call from + # __post_init__ — role builders have not run yet, so metadata is empty + # and we fall back to self.llm_model_name. Once roles are initialized + # or aupdate_llm_role_config() runs, metadata always carries `model`. + metadata = state.metadata if state is not None else {} + return { + "role": role, + "binding": metadata.get("binding"), + "model": metadata.get("model") or self.llm_model_name, + "host": metadata.get("host"), + } + + def __post_init__(self, addon_params: dict[str, Any] | None): from lightrag.kg.shared_storage import ( initialize_share_data, ) + # Fail fast if deprecated ENTITY_TYPES env var is set + if os.getenv("ENTITY_TYPES") is not None: + raise SystemExit( + "ERROR: ENTITY_TYPES environment variable is no longer supported. " + "Please customize entity type guidance through the prompt template instead. " + "Set addon_params={'entity_types_guidance': '...'} or replace the prompt template." + ) + + self._replace_addon_params(addon_params, mark_dirty=False) + self._apply_chunk_size_overlay() + self._refresh_addon_params_cache() + # Handle deprecated parameters if self.log_level is not None: warnings.warn( @@ -605,6 +903,13 @@ def __post_init__(self): f"max_total_tokens({self.summary_max_tokens}) should greater than summary_length_recommended({self.summary_length_recommended})" ) + if self.rerank_model_func is not None: + self.rerank_model_func = priority_limit_async_func_call( + self.rerank_model_max_async, + llm_timeout=self.default_rerank_timeout, + queue_name="Rerank func", + )(self.rerank_model_func) + # Init Embedding # Step 1: Capture embedding_func and max_token_size before applying rate_limit decorator original_embedding_func = self.embedding_func @@ -617,7 +922,7 @@ def __post_init__(self): self.embedding_token_limit = embedding_max_token_size # Fix global_config now - global_config = asdict(self) + global_config = self._build_global_config() # Restore original EmbeddingFunc object (asdict converts it to dict) global_config["embedding_func"] = original_embedding_func @@ -638,13 +943,13 @@ def __post_init__(self): self.embedding_func = replace(self.embedding_func, func=wrapped_func) # Initialize all storages - self.key_string_value_json_storage_cls: type[BaseKVStorage] = ( - self._get_storage_class(self.kv_storage) + self.key_string_value_json_storage_cls: type[BaseKVStorage] = get_storage_class( + self.kv_storage ) # type: ignore - self.vector_db_storage_cls: type[BaseVectorStorage] = self._get_storage_class( + self.vector_db_storage_cls: type[BaseVectorStorage] = get_storage_class( self.vector_storage ) # type: ignore - self.graph_storage_cls: type[BaseGraphStorage] = self._get_storage_class( + self.graph_storage_cls: type[BaseGraphStorage] = get_storage_class( self.graph_storage ) # type: ignore self.key_string_value_json_storage_cls = partial( # type: ignore @@ -658,7 +963,7 @@ def __post_init__(self): ) # Initialize document status storage - self.doc_status_storage_cls = self._get_storage_class(self.doc_status_storage) + self.doc_status_storage_cls = get_storage_class(self.doc_status_storage) self.llm_response_cache: BaseKVStorage = self.key_string_value_json_storage_cls( # type: ignore namespace=NameSpace.KV_STORE_LLM_RESPONSE_CACHE, @@ -736,21 +1041,70 @@ def __post_init__(self): embedding_func=None, ) - # Directly use llm_response_cache, don't create a new object - hashing_kv = self.llm_response_cache - - # Get timeout from LLM model kwargs for dynamic timeout calculation - self.llm_model_func = priority_limit_async_func_call( - self.llm_model_max_async, - llm_timeout=self.default_llm_timeout, - queue_name="LLM func", - )( - partial( - self.llm_model_func, # type: ignore - hashing_kv=hashing_kv, - **self.llm_model_kwargs, + # Per-role isolated LLM wrappers (independent queues per role). + # The base ``self.llm_model_func`` is intentionally NOT queue-wrapped: + # every code path that calls an LLM goes through one of the role + # wrappers built below, so concurrency is enforced at the role layer. + base_llm_func = self.llm_model_func + if base_llm_func is None: + raise ValueError("llm_model_func must be provided") + + self._llm_role_builder = None + self._retired_llm_queue_cleanup_tasks: set[asyncio.Task] = set() + + user_role_configs = self.role_llm_configs or {} + if not isinstance(user_role_configs, Mapping): + raise TypeError( + "role_llm_configs must be a Mapping or None, got " + f"{type(user_role_configs).__name__}" ) - ) + unknown_roles = [role for role in user_role_configs if role not in ROLE_NAMES] + if unknown_roles: + valid_roles = ", ".join(sorted(ROLE_NAMES)) + unknown = ", ".join(repr(role) for role in unknown_roles) + raise ValueError( + f"Unknown role_llm_configs key(s): {unknown}. " + f"Valid roles are: {valid_roles}" + ) + + self._role_llm_states: dict[str, _RoleLLMState] = {} + for spec in ROLES: + override = user_role_configs.get(spec.name) + if override is None: + cfg = RoleLLMConfig() + elif isinstance(override, RoleLLMConfig): + cfg = override + elif isinstance(override, Mapping): + cfg = RoleLLMConfig(**dict(override)) + else: + raise TypeError( + f"role_llm_configs[{spec.name!r}] must be RoleLLMConfig or " + f"a dict, got {type(override).__name__}" + ) + + max_async = cfg.max_async + if max_async is None: + max_async = _optional_env_int(f"{spec.env_prefix}_MAX_ASYNC_LLM") + + metadata = {} + if cfg.metadata is not None: + if not isinstance(cfg.metadata, Mapping): + raise TypeError( + f"role_llm_configs[{spec.name!r}].metadata must be a " + f"Mapping or None, got {type(cfg.metadata).__name__}" + ) + metadata = deepcopy(dict(cfg.metadata)) + + self._role_llm_states[spec.name] = _RoleLLMState( + raw_func=cfg.func or base_llm_func, + kwargs=cfg.kwargs, + max_async=max_async, + timeout=cfg.timeout, + metadata=metadata, + ) + + self._rebuild_role_llm_funcs() + self._log_llm_role_config("initialized") self._storages_status = StoragesStatus.CREATED @@ -842,307 +1196,6 @@ async def finalize_storages(self): self._storages_status = StoragesStatus.FINALIZED - async def check_and_migrate_data(self): - """Check if data migration is needed and perform migration if necessary""" - async with get_data_init_lock(): - try: - # Check if migration is needed: - # 1. chunk_entity_relation_graph has entities and relations (count > 0) - # 2. full_entities and full_relations are empty - - # Get all entity labels from graph - all_entity_labels = ( - await self.chunk_entity_relation_graph.get_all_labels() - ) - - if not all_entity_labels: - logger.debug("No entities found in graph, skipping migration check") - return - - try: - # Initialize chunk tracking storage after migration - await self._migrate_chunk_tracking_storage() - except Exception as e: - logger.error(f"Error during chunk_tracking migration: {e}") - raise e - - # Check if full_entities and full_relations are empty - # Get all processed documents to check their entity/relation data - try: - processed_docs = await self.doc_status.get_docs_by_status( - DocStatus.PROCESSED - ) - - if not processed_docs: - logger.debug("No processed documents found, skipping migration") - return - - # Check first few documents to see if they have full_entities/full_relations data - migration_needed = True - checked_count = 0 - max_check = min(5, len(processed_docs)) # Check up to 5 documents - - for doc_id in list(processed_docs.keys())[:max_check]: - checked_count += 1 - entity_data = await self.full_entities.get_by_id(doc_id) - relation_data = await self.full_relations.get_by_id(doc_id) - - if entity_data or relation_data: - migration_needed = False - break - - if not migration_needed: - logger.debug( - "Full entities/relations data already exists, no migration needed" - ) - return - - logger.info( - f"Data migration needed: found {len(all_entity_labels)} entities in graph but no full_entities/full_relations data" - ) - - # Perform migration - await self._migrate_entity_relation_data(processed_docs) - - except Exception as e: - logger.error(f"Error during migration check: {e}") - raise e - - except Exception as e: - logger.error(f"Error in data migration check: {e}") - raise e - - async def _migrate_entity_relation_data(self, processed_docs: dict): - """Migrate existing entity and relation data to full_entities and full_relations storage""" - logger.info(f"Starting data migration for {len(processed_docs)} documents") - - # Create mapping from chunk_id to doc_id - chunk_to_doc = {} - for doc_id, doc_status in processed_docs.items(): - chunk_ids = ( - doc_status.chunks_list - if hasattr(doc_status, "chunks_list") and doc_status.chunks_list - else [] - ) - for chunk_id in chunk_ids: - chunk_to_doc[chunk_id] = doc_id - - # Initialize document entity and relation mappings - doc_entities = {} # doc_id -> set of entity_names - doc_relations = {} # doc_id -> set of relation_pairs (as tuples) - - # Get all nodes and edges from graph - all_nodes = await self.chunk_entity_relation_graph.get_all_nodes() - all_edges = await self.chunk_entity_relation_graph.get_all_edges() - - # Process all nodes once - for node in all_nodes: - if "source_id" in node: - entity_id = node.get("entity_id") or node.get("id") - if not entity_id: - continue - - # Get chunk IDs from source_id - source_ids = node["source_id"].split(GRAPH_FIELD_SEP) - - # Find which documents this entity belongs to - for chunk_id in source_ids: - doc_id = chunk_to_doc.get(chunk_id) - if doc_id: - if doc_id not in doc_entities: - doc_entities[doc_id] = set() - doc_entities[doc_id].add(entity_id) - - # Process all edges once - for edge in all_edges: - if "source_id" in edge: - src = edge.get("source") - tgt = edge.get("target") - if not src or not tgt: - continue - - # Get chunk IDs from source_id - source_ids = edge["source_id"].split(GRAPH_FIELD_SEP) - - # Find which documents this relation belongs to - for chunk_id in source_ids: - doc_id = chunk_to_doc.get(chunk_id) - if doc_id: - if doc_id not in doc_relations: - doc_relations[doc_id] = set() - # Use tuple for set operations, convert to list later - doc_relations[doc_id].add(tuple(sorted((src, tgt)))) - - # Store the results in full_entities and full_relations - migration_count = 0 - - # Store entities - if doc_entities: - entities_data = {} - for doc_id, entity_set in doc_entities.items(): - entities_data[doc_id] = { - "entity_names": list(entity_set), - "count": len(entity_set), - } - await self.full_entities.upsert(entities_data) - - # Store relations - if doc_relations: - relations_data = {} - for doc_id, relation_set in doc_relations.items(): - # Convert tuples back to lists - relations_data[doc_id] = { - "relation_pairs": [list(pair) for pair in relation_set], - "count": len(relation_set), - } - await self.full_relations.upsert(relations_data) - - migration_count = len( - set(list(doc_entities.keys()) + list(doc_relations.keys())) - ) - - # Persist the migrated data - await self.full_entities.index_done_callback() - await self.full_relations.index_done_callback() - - logger.info( - f"Data migration completed: migrated {migration_count} documents with entities/relations" - ) - - async def _migrate_chunk_tracking_storage(self) -> None: - """Ensure entity/relation chunk tracking KV stores exist and are seeded.""" - - if not self.entity_chunks or not self.relation_chunks: - return - - need_entity_migration = False - need_relation_migration = False - - try: - need_entity_migration = await self.entity_chunks.is_empty() - except Exception as exc: # pragma: no cover - defensive logging - logger.error(f"Failed to check entity chunks storage: {exc}") - raise exc - - try: - need_relation_migration = await self.relation_chunks.is_empty() - except Exception as exc: # pragma: no cover - defensive logging - logger.error(f"Failed to check relation chunks storage: {exc}") - raise exc - - if not need_entity_migration and not need_relation_migration: - return - - BATCH_SIZE = 500 # Process 500 records per batch - - if need_entity_migration: - try: - nodes = await self.chunk_entity_relation_graph.get_all_nodes() - except Exception as exc: - logger.error(f"Failed to fetch nodes for chunk migration: {exc}") - nodes = [] - - logger.info(f"Starting chunk_tracking data migration: {len(nodes)} nodes") - - # Process nodes in batches - total_nodes = len(nodes) - total_batches = (total_nodes + BATCH_SIZE - 1) // BATCH_SIZE - total_migrated = 0 - - for batch_idx in range(total_batches): - start_idx = batch_idx * BATCH_SIZE - end_idx = min((batch_idx + 1) * BATCH_SIZE, total_nodes) - batch_nodes = nodes[start_idx:end_idx] - - upsert_payload: dict[str, dict[str, object]] = {} - for node in batch_nodes: - entity_id = node.get("entity_id") or node.get("id") - if not entity_id: - continue - - raw_source = node.get("source_id") or "" - chunk_ids = [ - chunk_id - for chunk_id in raw_source.split(GRAPH_FIELD_SEP) - if chunk_id - ] - if not chunk_ids: - continue - - upsert_payload[entity_id] = { - "chunk_ids": chunk_ids, - "count": len(chunk_ids), - } - - if upsert_payload: - await self.entity_chunks.upsert(upsert_payload) - total_migrated += len(upsert_payload) - logger.info( - f"Processed entity batch {batch_idx + 1}/{total_batches}: {len(upsert_payload)} records (total: {total_migrated}/{total_nodes})" - ) - - if total_migrated > 0: - # Persist entity_chunks data to disk - await self.entity_chunks.index_done_callback() - logger.info( - f"Entity chunk_tracking migration completed: {total_migrated} records persisted" - ) - - if need_relation_migration: - try: - edges = await self.chunk_entity_relation_graph.get_all_edges() - except Exception as exc: - logger.error(f"Failed to fetch edges for chunk migration: {exc}") - edges = [] - - logger.info(f"Starting chunk_tracking data migration: {len(edges)} edges") - - # Process edges in batches - total_edges = len(edges) - total_batches = (total_edges + BATCH_SIZE - 1) // BATCH_SIZE - total_migrated = 0 - - for batch_idx in range(total_batches): - start_idx = batch_idx * BATCH_SIZE - end_idx = min((batch_idx + 1) * BATCH_SIZE, total_edges) - batch_edges = edges[start_idx:end_idx] - - upsert_payload: dict[str, dict[str, object]] = {} - for edge in batch_edges: - src = edge.get("source") or edge.get("src_id") or edge.get("src") - tgt = edge.get("target") or edge.get("tgt_id") or edge.get("tgt") - if not src or not tgt: - continue - - raw_source = edge.get("source_id") or "" - chunk_ids = [ - chunk_id - for chunk_id in raw_source.split(GRAPH_FIELD_SEP) - if chunk_id - ] - if not chunk_ids: - continue - - storage_key = make_relation_chunk_key(src, tgt) - upsert_payload[storage_key] = { - "chunk_ids": chunk_ids, - "count": len(chunk_ids), - } - - if upsert_payload: - await self.relation_chunks.upsert(upsert_payload) - total_migrated += len(upsert_payload) - logger.info( - f"Processed relation batch {batch_idx + 1}/{total_batches}: {len(upsert_payload)} records (total: {total_migrated}/{total_edges})" - ) - - if total_migrated > 0: - # Persist relation_chunks data to disk - await self.relation_chunks.index_done_callback() - logger.info( - f"Relation chunk_tracking migration completed: {total_migrated} records persisted" - ) - async def get_graph_labels(self): text = await self.chunk_entity_relation_graph.get_all_labels() return text @@ -1174,30 +1227,6 @@ async def get_knowledge_graph( node_label, max_depth, max_nodes ) - def _get_storage_class(self, storage_name: str) -> Callable[..., Any]: - # Direct imports for default storage implementations - if storage_name == "JsonKVStorage": - from lightrag.kg.json_kv_impl import JsonKVStorage - - return JsonKVStorage - elif storage_name == "NanoVectorDBStorage": - from lightrag.kg.nano_vector_db_impl import NanoVectorDBStorage - - return NanoVectorDBStorage - elif storage_name == "NetworkXStorage": - from lightrag.kg.networkx_impl import NetworkXStorage - - return NetworkXStorage - elif storage_name == "JsonDocStatusStorage": - from lightrag.kg.json_doc_status_impl import JsonDocStatusStorage - - return JsonDocStatusStorage - else: - # Fallback to dynamic import for other storage implementations - import_path = STORAGES[storage_name] - storage_class = lazy_external_import(import_path, storage_name) - return storage_class - def insert( self, input: str | list[str], @@ -1245,1075 +1274,117 @@ async def ainsert( ) -> str: """Async Insert documents with checkpoint support - Args: - input: Single document string or list of document strings - split_by_character: if split_by_character is not None, split the string by character, if chunk longer than - chunk_token_size, it will be split again by token size. - split_by_character_only: if split_by_character_only is True, split the string by character only, when - split_by_character is None, this parameter is ignored. - ids: list of unique document IDs, if not provided, MD5 hash IDs will be generated - file_paths: list of file paths corresponding to each document, used for citation - track_id: tracking ID for monitoring processing status, if not provided, will be generated - - Returns: - str: tracking ID for monitoring processing status - """ - # Generate track_id if not provided - if track_id is None: - track_id = generate_track_id("insert") - - await self.apipeline_enqueue_documents(input, ids, file_paths, track_id) - await self.apipeline_process_enqueue_documents( - split_by_character, split_by_character_only - ) - - return track_id - - # TODO: deprecated, use insert instead - def insert_custom_chunks( - self, - full_text: str, - text_chunks: list[str], - doc_id: str | list[str] | None = None, - ) -> None: - loop = always_get_an_event_loop() - loop.run_until_complete( - self.ainsert_custom_chunks(full_text, text_chunks, doc_id) - ) - - # TODO: deprecated, use ainsert instead - async def ainsert_custom_chunks( - self, full_text: str, text_chunks: list[str], doc_id: str | None = None - ) -> None: - update_storage = False - try: - # Clean input texts - full_text = sanitize_text_for_encoding(full_text) - text_chunks = [sanitize_text_for_encoding(chunk) for chunk in text_chunks] - file_path = "" - - # Process cleaned texts - if doc_id is None: - doc_key = compute_mdhash_id(full_text, prefix="doc-") - else: - doc_key = doc_id - new_docs = {doc_key: {"content": full_text, "file_path": file_path}} - - _add_doc_keys = await self.full_docs.filter_keys({doc_key}) - new_docs = {k: v for k, v in new_docs.items() if k in _add_doc_keys} - if not len(new_docs): - logger.warning("This document is already in the storage.") - return - - update_storage = True - logger.info(f"Inserting {len(new_docs)} docs") - - inserting_chunks: dict[str, Any] = {} - for index, chunk_text in enumerate(text_chunks): - chunk_key = compute_mdhash_id(chunk_text, prefix="chunk-") - tokens = len(self.tokenizer.encode(chunk_text)) - inserting_chunks[chunk_key] = { - "content": chunk_text, - "full_doc_id": doc_key, - "tokens": tokens, - "chunk_order_index": index, - "file_path": file_path, - } - - doc_ids = set(inserting_chunks.keys()) - add_chunk_keys = await self.text_chunks.filter_keys(doc_ids) - inserting_chunks = { - k: v for k, v in inserting_chunks.items() if k in add_chunk_keys - } - if not len(inserting_chunks): - logger.warning("All chunks are already in the storage.") - return - - tasks = [ - self.chunks_vdb.upsert(inserting_chunks), - self._process_extract_entities(inserting_chunks), - self.full_docs.upsert(new_docs), - self.text_chunks.upsert(inserting_chunks), - ] - await asyncio.gather(*tasks) - - finally: - if update_storage: - await self._insert_done() - - async def apipeline_enqueue_documents( - self, - input: str | list[str], - ids: list[str] | None = None, - file_paths: str | list[str] | None = None, - track_id: str | None = None, - ) -> str: - """ - Pipeline for Processing Documents - - 1. Validate ids if provided or generate MD5 hash IDs and remove duplicate contents - 2. Generate document initial status - 3. Filter out already processed documents - 4. Enqueue document in status - - Args: - input: Single document string or list of document strings - ids: list of unique document IDs, if not provided, MD5 hash IDs will be generated - file_paths: list of file paths corresponding to each document, used for citation - track_id: tracking ID for monitoring processing status, if not provided, will be generated with "enqueue" prefix - - Returns: - str: tracking ID for monitoring processing status - """ - # Generate track_id if not provided - if track_id is None or track_id.strip() == "": - track_id = generate_track_id("enqueue") - if isinstance(input, str): - input = [input] - if isinstance(ids, str): - ids = [ids] - if isinstance(file_paths, str): - file_paths = [file_paths] - - # If file_paths is provided, ensure it matches the number of documents - if file_paths is not None: - if isinstance(file_paths, str): - file_paths = [file_paths] - if len(file_paths) != len(input): - raise ValueError( - "Number of file paths must match the number of documents" - ) - file_paths = [ - path.strip() if isinstance(path, str) else "" for path in file_paths - ] - file_paths = [path if path else "unknown_source" for path in file_paths] - else: - # If no file paths provided, use placeholder - file_paths = ["unknown_source"] * len(input) - - # 1. Validate ids if provided or generate MD5 hash IDs and remove duplicate contents - if ids is not None: - # Check if the number of IDs matches the number of documents - if len(ids) != len(input): - raise ValueError("Number of IDs must match the number of documents") - - # Check if IDs are unique - if len(ids) != len(set(ids)): - raise ValueError("IDs must be unique") - - # Generate contents dict and remove duplicates in one pass - unique_contents = {} - for id_, doc, path in zip(ids, input, file_paths): - cleaned_content = sanitize_text_for_encoding(doc) - if cleaned_content not in unique_contents: - unique_contents[cleaned_content] = (id_, path) - - # Reconstruct contents with unique content - contents = { - id_: {"content": content, "file_path": file_path} - for content, (id_, file_path) in unique_contents.items() - } - else: - # Clean input text and remove duplicates in one pass - unique_content_with_paths = {} - for doc, path in zip(input, file_paths): - cleaned_content = sanitize_text_for_encoding(doc) - if cleaned_content not in unique_content_with_paths: - unique_content_with_paths[cleaned_content] = path - - # Generate contents dict of MD5 hash IDs and documents with paths - contents = { - compute_mdhash_id(content, prefix="doc-"): { - "content": content, - "file_path": path, - } - for content, path in unique_content_with_paths.items() - } - - # 2. Generate document initial status (without content) - new_docs: dict[str, Any] = { - id_: { - "status": DocStatus.PENDING, - "content_summary": get_content_summary(content_data["content"]), - "content_length": len(content_data["content"]), - "created_at": datetime.now(timezone.utc).isoformat(), - "updated_at": datetime.now(timezone.utc).isoformat(), - "file_path": content_data[ - "file_path" - ], # Store file path in document status - "track_id": track_id, # Store track_id in document status - } - for id_, content_data in contents.items() - } - - # 3. Filter out already processed documents - # Get docs ids - all_new_doc_ids = set(new_docs.keys()) - # Exclude IDs of documents that are already enqueued - unique_new_doc_ids = await self.doc_status.filter_keys(all_new_doc_ids) - - # Handle duplicate documents - create trackable records with current track_id - ignored_ids = list(all_new_doc_ids - unique_new_doc_ids) - if ignored_ids: - duplicate_docs: dict[str, Any] = {} - for doc_id in ignored_ids: - file_path = ( - new_docs.get(doc_id, {}).get("file_path") or "unknown_source" - ) - logger.warning(f"Duplicate document detected: {doc_id} ({file_path})") - - # Get existing document info for reference - existing_doc = await self.doc_status.get_by_id(doc_id) - existing_status = ( - existing_doc.get("status", "unknown") if existing_doc else "unknown" - ) - existing_track_id = ( - existing_doc.get("track_id", "") if existing_doc else "" - ) - - # Create a new record with unique ID for this duplicate attempt - dup_record_id = compute_mdhash_id(f"{doc_id}-{track_id}", prefix="dup-") - duplicate_docs[dup_record_id] = { - "status": DocStatus.FAILED, - "content_summary": f"[DUPLICATE] Original document: {doc_id}", - "content_length": new_docs.get(doc_id, {}).get("content_length", 0), - "chunks_count": 0, - "chunks_list": [], - "created_at": datetime.now(timezone.utc).isoformat(), - "updated_at": datetime.now(timezone.utc).isoformat(), - "file_path": file_path, - "track_id": track_id, # Use current track_id for tracking - "error_msg": f"Content already exists. Original doc_id: {doc_id}, Status: {existing_status}", - "metadata": { - "is_duplicate": True, - "original_doc_id": doc_id, - "original_track_id": existing_track_id, - }, - } - - # Store duplicate records in doc_status - if duplicate_docs: - await self.doc_status.upsert(duplicate_docs) - logger.info( - f"Created {len(duplicate_docs)} duplicate document records with track_id: {track_id}" - ) - - # Filter new_docs to only include documents with unique IDs - new_docs = { - doc_id: new_docs[doc_id] - for doc_id in unique_new_doc_ids - if doc_id in new_docs - } - - if not new_docs: - logger.warning("No new unique documents were found.") - return - - # 4. Store document content in full_docs and status in doc_status - # Store full document content separately - full_docs_data = { - doc_id: { - "content": contents[doc_id]["content"], - "file_path": contents[doc_id]["file_path"], - } - for doc_id in new_docs.keys() - } - await self.full_docs.upsert(full_docs_data) - # Persist data to disk immediately - await self.full_docs.index_done_callback() - - # Store document status (without content) - await self.doc_status.upsert(new_docs) - logger.debug(f"Stored {len(new_docs)} new unique documents") - - return track_id - - async def apipeline_enqueue_error_documents( - self, - error_files: list[dict[str, Any]], - track_id: str | None = None, - ) -> None: - """ - Record file extraction errors in doc_status storage. - - This function creates error document entries in the doc_status storage for files - that failed during the extraction process. Each error entry contains information - about the failure to help with debugging and monitoring. - - Args: - error_files: List of dictionaries containing error information for each failed file. - Each dictionary should contain: - - file_path: Original file name/path - - error_description: Brief error description (for content_summary) - - original_error: Full error message (for error_msg) - - file_size: File size in bytes (for content_length, 0 if unknown) - track_id: Optional tracking ID for grouping related operations - - Returns: - None - """ - if not error_files: - logger.debug("No error files to record") - return - - # Generate track_id if not provided - if track_id is None or track_id.strip() == "": - track_id = generate_track_id("error") - - error_docs: dict[str, Any] = {} - current_time = datetime.now(timezone.utc).isoformat() - - for error_file in error_files: - file_path = error_file.get("file_path", "unknown_file") - error_description = error_file.get( - "error_description", "File extraction failed" - ) - original_error = error_file.get("original_error", "Unknown error") - file_size = error_file.get("file_size", 0) - - # Generate unique doc_id with "error-" prefix - doc_id_content = f"{file_path}-{error_description}" - doc_id = compute_mdhash_id(doc_id_content, prefix="error-") - - error_docs[doc_id] = { - "status": DocStatus.FAILED, - "content_summary": error_description, - "content_length": file_size, - "error_msg": original_error, - "chunks_count": 0, # No chunks for failed files - "chunks_list": [], - "created_at": current_time, - "updated_at": current_time, - "file_path": file_path, - "track_id": track_id, - "metadata": { - "error_type": "file_extraction_error", - }, - } - - # Store error documents in doc_status - if error_docs: - await self.doc_status.upsert(error_docs) - # Log each error for debugging - for doc_id, error_doc in error_docs.items(): - logger.error( - f"File processing error: - ID: {doc_id} {error_doc['file_path']}" - ) - - async def _validate_and_fix_document_consistency( - self, - to_process_docs: dict[str, DocProcessingStatus], - pipeline_status: dict, - pipeline_status_lock: asyncio.Lock, - ) -> dict[str, DocProcessingStatus]: - """Validate and fix document data consistency by deleting inconsistent entries, but preserve failed documents""" - inconsistent_docs = [] - failed_docs_to_preserve = [] - successful_deletions = 0 - - # Check each document's data consistency - for doc_id, status_doc in to_process_docs.items(): - # Check if corresponding content exists in full_docs - content_data = await self.full_docs.get_by_id(doc_id) - if not content_data: - # Check if this is a failed document that should be preserved - if ( - hasattr(status_doc, "status") - and status_doc.status == DocStatus.FAILED - ): - failed_docs_to_preserve.append(doc_id) - else: - inconsistent_docs.append(doc_id) - - # Log information about failed documents that will be preserved - if failed_docs_to_preserve: - async with pipeline_status_lock: - preserve_message = f"Preserving {len(failed_docs_to_preserve)} failed document entries for manual review" - logger.info(preserve_message) - pipeline_status["latest_message"] = preserve_message - pipeline_status["history_messages"].append(preserve_message) - - # Remove failed documents from processing list but keep them in doc_status - for doc_id in failed_docs_to_preserve: - to_process_docs.pop(doc_id, None) - - # Delete inconsistent document entries(excluding failed documents) - if inconsistent_docs: - async with pipeline_status_lock: - summary_message = ( - f"Inconsistent document entries found: {len(inconsistent_docs)}" - ) - logger.info(summary_message) - pipeline_status["latest_message"] = summary_message - pipeline_status["history_messages"].append(summary_message) - - successful_deletions = 0 - for doc_id in inconsistent_docs: - try: - status_doc = to_process_docs[doc_id] - file_path = _resolve_doc_file_path(status_doc=status_doc) - - # Delete doc_status entry - await self.doc_status.delete([doc_id]) - successful_deletions += 1 - - # Log successful deletion - async with pipeline_status_lock: - log_message = ( - f"Deleted inconsistent entry: {doc_id} ({file_path})" - ) - logger.info(log_message) - pipeline_status["latest_message"] = log_message - pipeline_status["history_messages"].append(log_message) - - # Remove from processing list - to_process_docs.pop(doc_id, None) - - except Exception as e: - # Log deletion failure - async with pipeline_status_lock: - error_message = f"Failed to delete entry: {doc_id} - {str(e)}" - logger.error(error_message) - pipeline_status["latest_message"] = error_message - pipeline_status["history_messages"].append(error_message) - - # Final summary log - # async with pipeline_status_lock: - # final_message = f"Successfully deleted {successful_deletions} inconsistent entries, preserved {len(failed_docs_to_preserve)} failed documents" - # logger.info(final_message) - # pipeline_status["latest_message"] = final_message - # pipeline_status["history_messages"].append(final_message) - - # Reset PROCESSING and FAILED documents that pass consistency checks to PENDING status - docs_to_reset = {} - reset_count = 0 - - for doc_id, status_doc in to_process_docs.items(): - # Check if document has corresponding content in full_docs (consistency check) - content_data = await self.full_docs.get_by_id(doc_id) - if content_data: # Document passes consistency check - # Check if document is in PROCESSING or FAILED status - if hasattr(status_doc, "status") and status_doc.status in [ - DocStatus.PROCESSING, - DocStatus.FAILED, - ]: - preserved_chunks_list, preserved_chunks_count = ( - _chunk_fields_from_status_doc(status_doc) - ) - resolved_file_path = _resolve_doc_file_path( - status_doc=status_doc, - content_data=content_data, - ) - # Prepare document for status reset to PENDING - docs_to_reset[doc_id] = { - "status": DocStatus.PENDING, - "content_summary": status_doc.content_summary, - "content_length": status_doc.content_length, - "chunks_count": preserved_chunks_count, - "chunks_list": preserved_chunks_list, - "created_at": status_doc.created_at, - "updated_at": datetime.now(timezone.utc).isoformat(), - "file_path": resolved_file_path, - "track_id": getattr(status_doc, "track_id", ""), - # Clear any error messages and processing metadata - "error_msg": "", - "metadata": {}, - } - - # Update the status in to_process_docs as well - status_doc.status = DocStatus.PENDING - status_doc.file_path = resolved_file_path - reset_count += 1 - - # Update doc_status storage if there are documents to reset - if docs_to_reset: - await self.doc_status.upsert(docs_to_reset) - - async with pipeline_status_lock: - reset_message = f"Reset {reset_count} documents from PROCESSING/FAILED to PENDING status" - logger.info(reset_message) - pipeline_status["latest_message"] = reset_message - pipeline_status["history_messages"].append(reset_message) - - return to_process_docs - - async def apipeline_process_enqueue_documents( - self, - split_by_character: str | None = None, - split_by_character_only: bool = False, - ) -> None: - """ - Process pending documents by splitting them into chunks, processing - each chunk for entity and relation extraction, and updating the - document status. - - 1. Get all pending, failed, and abnormally terminated processing documents. - 2. Validate document data consistency and fix any issues - 3. Split document content into chunks - 4. Process each chunk for entity and relation extraction - 5. Update the document status - """ - - # Get pipeline status shared data and lock - pipeline_status = await get_namespace_data( - "pipeline_status", workspace=self.workspace - ) - pipeline_status_lock = get_namespace_lock( - "pipeline_status", workspace=self.workspace - ) - - # Check if another process is already processing the queue - async with pipeline_status_lock: - # Ensure only one worker is processing documents - if not pipeline_status.get("busy", False): - to_process_docs: dict[ - str, DocProcessingStatus - ] = await self.doc_status.get_docs_by_statuses( - [DocStatus.PROCESSING, DocStatus.FAILED, DocStatus.PENDING] - ) - - if not to_process_docs: - logger.info("No documents to process") - return - - pipeline_status.update( - { - "busy": True, - "job_name": "Default Job", - "job_start": datetime.now(timezone.utc).isoformat(), - "docs": 0, - "batchs": 0, # Total number of files to be processed - "cur_batch": 0, # Number of files already processed - "request_pending": False, # Clear any previous request - "cancellation_requested": False, # Initialize cancellation flag - "latest_message": "", - } - ) - # Cleaning history_messages without breaking it as a shared list object - del pipeline_status["history_messages"][:] - else: - # Another process is busy, just set request flag and return - pipeline_status["request_pending"] = True - logger.info( - "Another process is already processing the document queue. Request queued." - ) - return - - try: - # Process documents until no more documents or requests - while True: - # Check for cancellation request at the start of main loop - async with pipeline_status_lock: - if pipeline_status.get("cancellation_requested", False): - # Clear pending request - pipeline_status["request_pending"] = False - # Celar cancellation flag - pipeline_status["cancellation_requested"] = False - - log_message = "Pipeline cancelled by user" - logger.info(log_message) - pipeline_status["latest_message"] = log_message - pipeline_status["history_messages"].append(log_message) - - # Exit directly, skipping request_pending check - return - - if not to_process_docs: - log_message = "All enqueued documents have been processed" - logger.info(log_message) - pipeline_status["latest_message"] = log_message - pipeline_status["history_messages"].append(log_message) - break - - # Validate document data consistency and fix any issues as part of the pipeline - to_process_docs = await self._validate_and_fix_document_consistency( - to_process_docs, pipeline_status, pipeline_status_lock - ) - - if not to_process_docs: - log_message = ( - "No valid documents to process after consistency check" - ) - logger.info(log_message) - pipeline_status["latest_message"] = log_message - pipeline_status["history_messages"].append(log_message) - break - - log_message = f"Processing {len(to_process_docs)} document(s)" - logger.info(log_message) - - # Update pipeline_status, batchs now represents the total number of files to be processed - pipeline_status["docs"] = len(to_process_docs) - pipeline_status["batchs"] = len(to_process_docs) - pipeline_status["cur_batch"] = 0 - pipeline_status["latest_message"] = log_message - pipeline_status["history_messages"].append(log_message) - - # Get first document's file path and total count for job name - first_doc_id, first_doc = next(iter(to_process_docs.items())) - first_doc_path = first_doc.file_path - - # Handle cases where first_doc_path is None - if first_doc_path: - path_prefix = first_doc_path[:20] + ( - "..." if len(first_doc_path) > 20 else "" - ) - else: - path_prefix = "unknown_source" - - total_files = len(to_process_docs) - job_name = f"{path_prefix}[{total_files} files]" - pipeline_status["job_name"] = job_name - - # Create a counter to track the number of processed files - processed_count = 0 - # Create a semaphore to limit the number of concurrent file processing - semaphore = asyncio.Semaphore(self.max_parallel_insert) - - async def process_document( - doc_id: str, - status_doc: DocProcessingStatus, - split_by_character: str | None, - split_by_character_only: bool, - pipeline_status: dict, - pipeline_status_lock: asyncio.Lock, - semaphore: asyncio.Semaphore, - ) -> None: - """Process single document""" - # Initialize variables at the start to prevent UnboundLocalError in error handling - file_path = _resolve_doc_file_path(status_doc=status_doc) - current_file_number = 0 - file_extraction_stage_ok = False - processing_start_time = int(time.time()) - first_stage_tasks = [] - entity_relation_task = None - chunks: dict[str, Any] = {} - content_data: dict[str, Any] | None = None - - def get_failed_chunk_snapshot() -> tuple[list[str], int]: - if chunks: - chunk_ids = list(chunks.keys()) - return chunk_ids, len(chunk_ids) - return _chunk_fields_from_status_doc(status_doc) - - async with semaphore: - nonlocal processed_count - # Initialize to prevent UnboundLocalError in error handling - first_stage_tasks = [] - entity_relation_task = None - try: - # Resolve file_path from full_docs before honoring a queued - # cancellation so corrupted doc_status placeholders do not - # get written back again during retry/cancel flows. - content_data = await self.full_docs.get_by_id(doc_id) - if content_data: - file_path = _resolve_doc_file_path( - status_doc=status_doc, - content_data=content_data, - ) - status_doc.file_path = file_path - - # Check for cancellation before starting document processing. - # file_path is resolved before this check so queued documents - # do not lose their source path on early cancellation. - async with pipeline_status_lock: - if pipeline_status.get("cancellation_requested", False): - raise PipelineCancelledException("User cancelled") - - async with pipeline_status_lock: - # Update processed file count and save current file number - processed_count += 1 - current_file_number = ( - processed_count # Save the current file number - ) - pipeline_status["cur_batch"] = processed_count - - log_message = f"Extracting stage {current_file_number}/{total_files}: {file_path}" - logger.info(log_message) - pipeline_status["history_messages"].append(log_message) - log_message = f"Processing d-id: {doc_id}" - logger.info(log_message) - pipeline_status["latest_message"] = log_message - pipeline_status["history_messages"].append(log_message) - - # Prevent memory growth: keep only latest 5000 messages when exceeding 10000 - if len(pipeline_status["history_messages"]) > 10000: - logger.info( - f"Trimming pipeline history from {len(pipeline_status['history_messages'])} to 5000 messages" - ) - # Trim in place so Manager.list-backed shared state - # remains appendable and visible across processes. - del pipeline_status["history_messages"][:-5000] - - # Get document content from full_docs - if not content_data: - raise Exception( - f"Document content not found in full_docs for doc_id: {doc_id}" - ) - content = content_data["content"] - - # Call chunking function, supporting both sync and async implementations - chunking_result = self.chunking_func( - self.tokenizer, - content, - split_by_character, - split_by_character_only, - self.chunk_overlap_token_size, - self.chunk_token_size, - ) - - # If result is awaitable, await to get actual result - if inspect.isawaitable(chunking_result): - chunking_result = await chunking_result - - # Validate return type - if not isinstance(chunking_result, (list, tuple)): - raise TypeError( - f"chunking_func must return a list or tuple of dicts, " - f"got {type(chunking_result)}" - ) - - # Build chunks dictionary - chunks: dict[str, Any] = { - compute_mdhash_id(dp["content"], prefix="chunk-"): { - **dp, - "full_doc_id": doc_id, - "file_path": file_path, # Add file path to each chunk - "llm_cache_list": [], # Initialize empty LLM cache list for each chunk - } - for dp in chunking_result - } - - if not chunks: - logger.warning("No document chunks to process") - - # Record processing start time - processing_start_time = int(time.time()) - - # Check for cancellation before entity extraction - async with pipeline_status_lock: - if pipeline_status.get("cancellation_requested", False): - raise PipelineCancelledException("User cancelled") - - # Process document in two stages - # Stage 1: Process text chunks and docs (parallel execution) - doc_status_task = asyncio.create_task( - self.doc_status.upsert( - { - doc_id: { - "status": DocStatus.PROCESSING, - "chunks_count": len(chunks), - "chunks_list": list( - chunks.keys() - ), # Save chunks list - "content_summary": status_doc.content_summary, - "content_length": status_doc.content_length, - "created_at": status_doc.created_at, - "updated_at": datetime.now( - timezone.utc - ).isoformat(), - "file_path": file_path, - "track_id": status_doc.track_id, # Preserve existing track_id - "metadata": { - "processing_start_time": processing_start_time - }, - } - } - ) - ) - chunks_vdb_task = asyncio.create_task( - self.chunks_vdb.upsert(chunks) - ) - text_chunks_task = asyncio.create_task( - self.text_chunks.upsert(chunks) - ) - - # First stage tasks (parallel execution) - first_stage_tasks = [ - doc_status_task, - chunks_vdb_task, - text_chunks_task, - ] - entity_relation_task = None - - # Execute first stage tasks - await asyncio.gather(*first_stage_tasks) - - # Stage 2: Process entity relation graph (after text_chunks are saved) - entity_relation_task = asyncio.create_task( - self._process_extract_entities( - chunks, pipeline_status, pipeline_status_lock - ) - ) - chunk_results = await entity_relation_task - file_extraction_stage_ok = True - - except Exception as e: - # Check if this is a user cancellation - if isinstance(e, PipelineCancelledException): - # User cancellation - log brief message only, no traceback - error_msg = f"User cancelled {current_file_number}/{total_files}: {file_path}" - logger.warning(error_msg) - async with pipeline_status_lock: - pipeline_status["latest_message"] = error_msg - pipeline_status["history_messages"].append( - error_msg - ) - else: - # Other exceptions - log with traceback - logger.error(traceback.format_exc()) - error_msg = f"Failed to extract document {current_file_number}/{total_files}: {file_path}" - logger.error(error_msg) - async with pipeline_status_lock: - pipeline_status["latest_message"] = error_msg - pipeline_status["history_messages"].append( - traceback.format_exc() - ) - pipeline_status["history_messages"].append( - error_msg - ) - - # Cancel tasks that are not yet completed - all_tasks = first_stage_tasks + ( - [entity_relation_task] if entity_relation_task else [] - ) - for task in all_tasks: - if task and not task.done(): - task.cancel() - - # Persistent llm cache with error handling - if self.llm_response_cache: - try: - await self.llm_response_cache.index_done_callback() - except Exception as persist_error: - logger.error( - f"Failed to persist LLM cache: {persist_error}" - ) - - # Record processing end time for failed case - processing_end_time = int(time.time()) - failed_chunks_list, failed_chunks_count = ( - get_failed_chunk_snapshot() - ) + Args: + input: Single document string or list of document strings + split_by_character: if split_by_character is not None, split the string by character, if chunk longer than + chunk_token_size, it will be split again by token size. + split_by_character_only: if split_by_character_only is True, split the string by character only, when + split_by_character is None, this parameter is ignored. + ids: list of unique document IDs, if not provided, MD5 hash IDs will be generated + file_paths: list of file paths corresponding to each document, used for citation + track_id: tracking ID for monitoring processing status, if not provided, will be generated - # Update document status to failed - await self.doc_status.upsert( - { - doc_id: { - "status": DocStatus.FAILED, - "error_msg": str(e), - "chunks_count": failed_chunks_count, - "chunks_list": failed_chunks_list, - "content_summary": status_doc.content_summary, - "content_length": status_doc.content_length, - "created_at": status_doc.created_at, - "updated_at": datetime.now( - timezone.utc - ).isoformat(), - "file_path": file_path, - "track_id": status_doc.track_id, # Preserve existing track_id - "metadata": { - "processing_start_time": processing_start_time, - "processing_end_time": processing_end_time, - }, - } - } - ) + Returns: + str: tracking ID for monitoring processing status + """ + # Generate track_id if not provided + if track_id is None: + track_id = generate_track_id("insert") - # Concurrency is controlled by keyed lock for individual entities and relationships - if file_extraction_stage_ok: - try: - # Check for cancellation before merge - async with pipeline_status_lock: - if pipeline_status.get( - "cancellation_requested", False - ): - raise PipelineCancelledException( - "User cancelled" - ) - - # Use chunk_results from entity_relation_task - await merge_nodes_and_edges( - chunk_results=chunk_results, # result collected from entity_relation_task - knowledge_graph_inst=self.chunk_entity_relation_graph, - entity_vdb=self.entities_vdb, - relationships_vdb=self.relationships_vdb, - global_config=asdict(self), - full_entities_storage=self.full_entities, - full_relations_storage=self.full_relations, - doc_id=doc_id, - pipeline_status=pipeline_status, - pipeline_status_lock=pipeline_status_lock, - llm_response_cache=self.llm_response_cache, - entity_chunks_storage=self.entity_chunks, - relation_chunks_storage=self.relation_chunks, - current_file_number=current_file_number, - total_files=total_files, - file_path=file_path, - ) - - # Record processing end time - processing_end_time = int(time.time()) - - await self.doc_status.upsert( - { - doc_id: { - "status": DocStatus.PROCESSED, - "chunks_count": len(chunks), - "chunks_list": list(chunks.keys()), - "content_summary": status_doc.content_summary, - "content_length": status_doc.content_length, - "created_at": status_doc.created_at, - "updated_at": datetime.now( - timezone.utc - ).isoformat(), - "file_path": file_path, - "track_id": status_doc.track_id, # Preserve existing track_id - "metadata": { - "processing_start_time": processing_start_time, - "processing_end_time": processing_end_time, - }, - } - } - ) - - # Call _insert_done after processing each file - await self._insert_done() - - async with pipeline_status_lock: - log_message = f"Completed processing file {current_file_number}/{total_files}: {file_path}" - logger.info(log_message) - pipeline_status["latest_message"] = log_message - pipeline_status["history_messages"].append( - log_message - ) + # Capture the F-strategy runtime args into a chunk_options + # snapshot before enqueue so they become a per-document + # setting. ``apipeline_enqueue_documents`` itself doesn't take + # split args — chunk_options is the canonical chunker-config + # carrier; runtime split args are an ainsert-only concern. + from lightrag.parser_routing import resolve_chunk_options + + chunk_opts = resolve_chunk_options( + self.addon_params, + split_by_character=split_by_character, + split_by_character_only=split_by_character_only, + ) + await self.apipeline_enqueue_documents( + input, + ids, + file_paths, + track_id, + chunk_options=chunk_opts, + ) + await self.apipeline_process_enqueue_documents() - except Exception as e: - # Check if this is a user cancellation - if isinstance(e, PipelineCancelledException): - # User cancellation - log brief message only, no traceback - error_msg = f"User cancelled during merge {current_file_number}/{total_files}: {file_path}" - logger.warning(error_msg) - async with pipeline_status_lock: - pipeline_status["latest_message"] = error_msg - pipeline_status["history_messages"].append( - error_msg - ) - else: - # Other exceptions - log with traceback - logger.error(traceback.format_exc()) - error_msg = f"Merging stage failed in document {current_file_number}/{total_files}: {file_path}" - logger.error(error_msg) - async with pipeline_status_lock: - pipeline_status["latest_message"] = error_msg - pipeline_status["history_messages"].append( - traceback.format_exc() - ) - pipeline_status["history_messages"].append( - error_msg - ) - - # Persistent llm cache with error handling - if self.llm_response_cache: - try: - await self.llm_response_cache.index_done_callback() - except Exception as persist_error: - logger.error( - f"Failed to persist LLM cache: {persist_error}" - ) - - # Record processing end time for failed case - processing_end_time = int(time.time()) - failed_chunks_list, failed_chunks_count = ( - get_failed_chunk_snapshot() - ) - - # Update document status to failed - await self.doc_status.upsert( - { - doc_id: { - "status": DocStatus.FAILED, - "error_msg": str(e), - "chunks_count": failed_chunks_count, - "chunks_list": failed_chunks_list, - "content_summary": status_doc.content_summary, - "content_length": status_doc.content_length, - "created_at": status_doc.created_at, - "updated_at": datetime.now( - timezone.utc - ).isoformat(), - "file_path": file_path, - "track_id": status_doc.track_id, # Preserve existing track_id - "metadata": { - "processing_start_time": processing_start_time, - "processing_end_time": processing_end_time, - }, - } - } - ) - - # Create processing tasks for all documents - doc_tasks = [] - for doc_id, status_doc in to_process_docs.items(): - doc_tasks.append( - process_document( - doc_id, - status_doc, - split_by_character, - split_by_character_only, - pipeline_status, - pipeline_status_lock, - semaphore, - ) - ) + return track_id - # Wait for all document processing to complete - try: - await asyncio.gather(*doc_tasks) - except PipelineCancelledException: - # Cancel all remaining tasks - for task in doc_tasks: - if not task.done(): - task.cancel() + # TODO: deprecated, use insert instead + def insert_custom_chunks( + self, + full_text: str, + text_chunks: list[str], + doc_id: str | list[str] | None = None, + ) -> None: + loop = always_get_an_event_loop() + loop.run_until_complete( + self.ainsert_custom_chunks(full_text, text_chunks, doc_id) + ) - # Wait for all tasks to complete cancellation - await asyncio.wait(doc_tasks, return_when=asyncio.ALL_COMPLETED) + # TODO: deprecated, use ainsert instead + async def ainsert_custom_chunks( + self, full_text: str, text_chunks: list[str], doc_id: str | None = None + ) -> None: + update_storage = False + try: + # Clean input texts + full_text = sanitize_text_for_encoding(full_text) + text_chunks = [sanitize_text_for_encoding(chunk) for chunk in text_chunks] + file_path = normalize_document_file_path("") - # Exit directly (document statuses already updated in process_document) - return + # Process cleaned texts + if doc_id is None: + doc_key = compute_mdhash_id(full_text, prefix="doc-") + else: + doc_key = doc_id + new_docs = {doc_key: {"content": full_text, "file_path": file_path}} - # Check if there's a pending request to process more documents (with lock) - has_pending_request = False - async with pipeline_status_lock: - has_pending_request = pipeline_status.get("request_pending", False) - if has_pending_request: - # Clear the request flag before checking for more documents - pipeline_status["request_pending"] = False + _add_doc_keys = await self.full_docs.filter_keys({doc_key}) + new_docs = {k: v for k, v in new_docs.items() if k in _add_doc_keys} + if not len(new_docs): + logger.warning("This document is already in the storage.") + return - if not has_pending_request: - break + update_storage = True + logger.info(f"Inserting {len(new_docs)} docs") - log_message = "Processing additional documents due to pending request" - logger.info(log_message) - pipeline_status["latest_message"] = log_message - pipeline_status["history_messages"].append(log_message) + inserting_chunks: dict[str, Any] = {} + for index, chunk_text in enumerate(text_chunks): + chunk_key = compute_mdhash_id(chunk_text, prefix="chunk-") + tokens = len(self.tokenizer.encode(chunk_text)) + inserting_chunks[chunk_key] = { + "content": chunk_text, + "full_doc_id": doc_key, + "tokens": tokens, + "chunk_order_index": index, + "file_path": file_path, + } - # Check for pending documents again - to_process_docs = await self.doc_status.get_docs_by_statuses( - [DocStatus.PROCESSING, DocStatus.FAILED, DocStatus.PENDING] - ) + doc_ids = set(inserting_chunks.keys()) + add_chunk_keys = await self.text_chunks.filter_keys(doc_ids) + inserting_chunks = { + k: v for k, v in inserting_chunks.items() if k in add_chunk_keys + } + if not len(inserting_chunks): + logger.warning("All chunks are already in the storage.") + return + + tasks = [ + self.chunks_vdb.upsert(inserting_chunks), + self._process_extract_entities(inserting_chunks), + self.full_docs.upsert(new_docs), + self.text_chunks.upsert(inserting_chunks), + ] + await asyncio.gather(*tasks) finally: - log_message = "Enqueued document processing pipeline stopped" - logger.info(log_message) - # Always reset busy status and cancellation flag when done or if an exception occurs (with lock) - async with pipeline_status_lock: - pipeline_status["busy"] = False - pipeline_status["cancellation_requested"] = ( - False # Always reset cancellation flag - ) - pipeline_status["latest_message"] = log_message - pipeline_status["history_messages"].append(log_message) + if update_storage: + await self._insert_done() async def _process_extract_entities( self, chunk: dict[str, Any], pipeline_status=None, pipeline_status_lock=None @@ -2321,7 +1392,7 @@ async def _process_extract_entities( try: chunk_results = await extract_entities( chunk, - global_config=asdict(self), + global_config=self._build_global_config(), pipeline_status=pipeline_status, pipeline_status_lock=pipeline_status_lock, llm_response_cache=self.llm_response_cache, @@ -2386,7 +1457,9 @@ async def ainsert_custom_kg( for chunk_data in custom_kg.get("chunks", []): chunk_content = sanitize_text_for_encoding(chunk_data["content"]) source_id = chunk_data["source_id"] - file_path = chunk_data.get("file_path", "custom_kg") + file_path = normalize_document_file_path( + chunk_data.get("file_path", "custom_kg") + ) tokens = len(self.tokenizer.encode(chunk_content)) chunk_order_index = ( 0 @@ -2433,7 +1506,9 @@ async def ainsert_custom_kg( description = entity_data.get("description", "No description provided") source_chunk_id = entity_data.get("source_id", "UNKNOWN") source_id = chunk_to_source_map.get(source_chunk_id, "UNKNOWN") - file_path = entity_data.get("file_path", "custom_kg") + file_path = normalize_document_file_path( + entity_data.get("file_path", "custom_kg") + ) if source_id == "UNKNOWN": logger.warning( @@ -2489,7 +1564,9 @@ async def ainsert_custom_kg( tgt_id = relationship_data["tgt_id"] source_chunk_id = relationship_data.get("source_id", "UNKNOWN") source_id = chunk_to_source_map.get(source_chunk_id, "UNKNOWN") - file_path = relationship_data.get("file_path", "custom_kg") + file_path = normalize_document_file_path( + relationship_data.get("file_path", "custom_kg") + ) if source_id == "UNKNOWN": logger.warning( @@ -2785,7 +1862,7 @@ async def aquery_data( actual data is nested under the 'data' field, with 'status' and 'message' fields at the top level. """ - global_config = asdict(self) + global_config = self._build_global_config() # Create a copy of param to avoid modifying the original data_param = QueryParam( @@ -2903,7 +1980,7 @@ async def aquery_llm( """ logger.debug(f"[aquery_llm] Query param: {param}") - global_config = asdict(self) + global_config = self._build_global_config() try: query_result = None @@ -2932,7 +2009,11 @@ async def aquery_llm( ) elif param.mode == "bypass": # Bypass mode: directly use LLM without knowledge retrieval - use_llm_func = param.model_func or global_config["llm_model_func"] + if param.model_func: + _warn_deprecated_query_model_func("bypass query generation") + use_llm_func = ( + param.model_func or global_config["role_llm_funcs"]["query"] + ) # Apply higher priority (8) to entity/relation summary tasks use_llm_func = partial(use_llm_func, _priority=8) @@ -3060,7 +2141,7 @@ async def _update_delete_retry_state( if not isinstance(metadata, dict): metadata = {} - backup_cache_ids = _normalize_string_list( + backup_cache_ids = normalize_string_list( metadata.get("deletion_llm_cache_ids", []), context=f"doc {doc_id} metadata.deletion_llm_cache_ids", ) @@ -3220,6 +2301,417 @@ async def aget_docs_by_ids( # Return the dictionary containing statuses only for the found document IDs return found_statuses + async def _purge_doc_chunks_and_kg( + self, + doc_id: str, + chunk_ids: set[str], + *, + pipeline_status: dict, + pipeline_status_lock: Any, + ) -> None: + """Remove a document's chunks and clean up its knowledge-graph contributions. + + Used by: + - The pipeline resume branch in ``process_document`` when a + document whose content is already extracted is re-processed + under different ``process_options``: chunks must be wiped and + entities/relations rebuilt fresh. + - Future deletion paths that want a focused "purge KG only" + operation without the LLM-cache / doc_status / full_docs + cleanup that ``adelete_by_doc_id`` also performs. + + What this method does: + 1. Reads ``full_entities`` / ``full_relations`` to identify which + graph nodes / edges this document contributed to. + 2. For each affected entity / relation, intersects the doc's + ``chunk_ids`` with the union of chunk-tracking entries + (``entity_chunks`` / ``relation_chunks``) and graph + ``source_id`` lists, then classifies it as either + *delete-outright* (no remaining sources) or *rebuild* + (still references chunks from other documents). + 3. Deletes the chunks themselves from ``chunks_vdb`` and + ``text_chunks``. + 4. For *delete-outright* entries: removes the relationship / + entity from the graph storage, vector storage, and chunk + tracking. + 5. Calls :py:meth:`_insert_done` to persist graph changes + before rebuilding (so the rebuild step sees a consistent + state). + 6. Calls :func:`rebuild_knowledge_from_chunks` to rebuild any + *rebuild* entries from their remaining chunks (so other + documents that also contributed to the same entity / + relation keep their data intact). + 7. Deletes the per-doc ``full_entities`` / ``full_relations`` + index rows so subsequent re-extraction starts fresh. + + Does NOT touch: + - ``doc_status`` / ``full_docs`` records — caller manages those. + - ``llm_response_cache`` — orthogonal to KG cleanup. + - Pipeline busy-flag — assumes the caller already holds the + pipeline (i.e. this runs inside a pipeline run). + + Idempotent: passing an empty ``chunk_ids`` returns immediately + without touching storage. + """ + if not chunk_ids: + return + + # ---- 1. Analyze affected entities/relations from full_entities/full_relations ---- + entities_to_delete: set[str] = set() + entities_to_rebuild: dict[str, list[str]] = {} + relationships_to_delete: set[tuple[str, str]] = set() + relationships_to_rebuild: dict[tuple[str, str], list[str]] = {} + entity_chunk_updates: dict[str, list[str]] = {} + relation_chunk_updates: dict[tuple[str, str], list[str]] = {} + + try: + doc_entities_data = await self.full_entities.get_by_id(doc_id) + doc_relations_data = await self.full_relations.get_by_id(doc_id) + + affected_nodes: list[dict[str, Any]] = [] + affected_edges: list[dict[str, Any]] = [] + + if doc_entities_data and "entity_names" in doc_entities_data: + entity_names = doc_entities_data["entity_names"] + nodes_dict = await self.chunk_entity_relation_graph.get_nodes_batch( + entity_names + ) + for entity_name in entity_names: + node_data = nodes_dict.get(entity_name) + if node_data: + if "id" not in node_data: + node_data["id"] = entity_name + affected_nodes.append(node_data) + + if doc_relations_data and "relation_pairs" in doc_relations_data: + relation_pairs = doc_relations_data["relation_pairs"] + edge_pairs_dicts = [ + {"src": pair[0], "tgt": pair[1]} for pair in relation_pairs + ] + edges_dict = await self.chunk_entity_relation_graph.get_edges_batch( + edge_pairs_dicts + ) + for pair in relation_pairs: + src, tgt = pair[0], pair[1] + edge_data = edges_dict.get((src, tgt)) + if edge_data: + if "source" not in edge_data: + edge_data["source"] = src + if "target" not in edge_data: + edge_data["target"] = tgt + affected_edges.append(edge_data) + except Exception as e: + logger.error( + f"[purge] Failed to analyze affected graph elements for {doc_id}: {e}" + ) + raise Exception(f"Failed to analyze graph dependencies: {e}") from e + + # ---- 2. Classify entities/relations into delete vs rebuild ---- + try: + for node_data in affected_nodes: + node_label = node_data.get("entity_id") + if not node_label: + continue + + existing_sources: list[str] = [] + graph_sources: list[str] = [] + if self.entity_chunks: + stored_chunks = await self.entity_chunks.get_by_id(node_label) + if stored_chunks and isinstance(stored_chunks, dict): + existing_sources = [ + chunk_id + for chunk_id in stored_chunks.get("chunk_ids", []) + if chunk_id + ] + + if node_data.get("source_id"): + graph_sources = [ + chunk_id + for chunk_id in node_data["source_id"].split(GRAPH_FIELD_SEP) + if chunk_id + ] + + if not existing_sources: + existing_sources = graph_sources + + if not existing_sources: + entities_to_delete.add(node_label) + entity_chunk_updates[node_label] = [] + continue + + remaining_sources = subtract_source_ids(existing_sources, chunk_ids) + graph_references_deleted_chunks = bool( + graph_sources and set(graph_sources) & chunk_ids + ) + + if not remaining_sources: + entities_to_delete.add(node_label) + entity_chunk_updates[node_label] = [] + elif ( + remaining_sources != existing_sources + or graph_references_deleted_chunks + ): + entities_to_rebuild[node_label] = remaining_sources + entity_chunk_updates[node_label] = remaining_sources + + async with pipeline_status_lock: + log_message = ( + f"[purge] {doc_id}: {len(entities_to_rebuild)} entity(ies) " + f"to rebuild, {len(entities_to_delete)} to delete" + ) + logger.info(log_message) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + + for edge_data in affected_edges: + src = edge_data.get("source") + tgt = edge_data.get("target") + if not src or not tgt or "source_id" not in edge_data: + continue + + edge_tuple = tuple(sorted((src, tgt))) + if ( + edge_tuple in relationships_to_delete + or edge_tuple in relationships_to_rebuild + ): + continue + + existing_sources = [] + graph_sources = [] + if self.relation_chunks: + storage_key = make_relation_chunk_key(src, tgt) + stored_chunks = await self.relation_chunks.get_by_id(storage_key) + if stored_chunks and isinstance(stored_chunks, dict): + existing_sources = [ + chunk_id + for chunk_id in stored_chunks.get("chunk_ids", []) + if chunk_id + ] + + if edge_data.get("source_id"): + graph_sources = [ + chunk_id + for chunk_id in edge_data["source_id"].split(GRAPH_FIELD_SEP) + if chunk_id + ] + + if not existing_sources: + existing_sources = graph_sources + + if not existing_sources: + relationships_to_delete.add(edge_tuple) + relation_chunk_updates[edge_tuple] = [] + continue + + remaining_sources = subtract_source_ids(existing_sources, chunk_ids) + graph_references_deleted_chunks = bool( + graph_sources and set(graph_sources) & chunk_ids + ) + + if not remaining_sources: + relationships_to_delete.add(edge_tuple) + relation_chunk_updates[edge_tuple] = [] + elif ( + remaining_sources != existing_sources + or graph_references_deleted_chunks + ): + relationships_to_rebuild[edge_tuple] = remaining_sources + relation_chunk_updates[edge_tuple] = remaining_sources + + async with pipeline_status_lock: + log_message = ( + f"[purge] {doc_id}: {len(relationships_to_rebuild)} relation(s) " + f"to rebuild, {len(relationships_to_delete)} to delete" + ) + logger.info(log_message) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + + # Update entity/relation chunk-tracking with the remaining sources. + current_time = int(time.time()) + if entity_chunk_updates and self.entity_chunks: + entity_upsert_payload = {} + for entity_name, remaining in entity_chunk_updates.items(): + if not remaining: + continue + entity_upsert_payload[entity_name] = { + "chunk_ids": remaining, + "count": len(remaining), + "updated_at": current_time, + } + if entity_upsert_payload: + await self.entity_chunks.upsert(entity_upsert_payload) + + if relation_chunk_updates and self.relation_chunks: + relation_upsert_payload = {} + for edge_tuple, remaining in relation_chunk_updates.items(): + if not remaining: + continue + storage_key = make_relation_chunk_key(*edge_tuple) + relation_upsert_payload[storage_key] = { + "chunk_ids": remaining, + "count": len(remaining), + "updated_at": current_time, + } + if relation_upsert_payload: + await self.relation_chunks.upsert(relation_upsert_payload) + except Exception as e: + logger.error( + f"[purge] Failed to process graph analysis results for {doc_id}: {e}" + ) + raise Exception(f"Failed to process graph dependencies: {e}") from e + + # ---- 3. Delete chunks themselves ---- + try: + await self.chunks_vdb.delete(chunk_ids) + await self.text_chunks.delete(chunk_ids) + async with pipeline_status_lock: + log_message = ( + f"[purge] {doc_id}: deleted {len(chunk_ids)} chunk(s) from storage" + ) + logger.info(log_message) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + except Exception as e: + logger.error(f"[purge] Failed to delete chunks for {doc_id}: {e}") + raise Exception(f"Failed to delete document chunks: {e}") from e + + # ---- 4. Delete relationships with no remaining sources ---- + if relationships_to_delete: + try: + rel_ids_to_delete = [] + for src, tgt in relationships_to_delete: + rel_ids_to_delete.extend( + [ + compute_mdhash_id(src + tgt, prefix="rel-"), + compute_mdhash_id(tgt + src, prefix="rel-"), + ] + ) + await self.relationships_vdb.delete(rel_ids_to_delete) + await self.chunk_entity_relation_graph.remove_edges( + list(relationships_to_delete) + ) + if self.relation_chunks: + relation_storage_keys = [ + make_relation_chunk_key(src, tgt) + for src, tgt in relationships_to_delete + ] + await self.relation_chunks.delete(relation_storage_keys) + async with pipeline_status_lock: + log_message = ( + f"[purge] {doc_id}: deleted " + f"{len(relationships_to_delete)} relation(s)" + ) + logger.info(log_message) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + except Exception as e: + logger.error( + f"[purge] Failed to delete relationships for {doc_id}: {e}" + ) + raise Exception(f"Failed to delete relationships: {e}") from e + + # ---- 5. Delete entities with no remaining sources ---- + if entities_to_delete: + try: + nodes_edges_dict = ( + await self.chunk_entity_relation_graph.get_nodes_edges_batch( + list(entities_to_delete) + ) + ) + + edges_to_delete: set[tuple[str, str]] = set() + for entity, edges in nodes_edges_dict.items(): + if edges: + for src, tgt in edges: + edges_to_delete.add(tuple(sorted((src, tgt)))) + + if edges_to_delete: + rel_ids_to_delete = [] + for src, tgt in edges_to_delete: + rel_ids_to_delete.extend( + [ + compute_mdhash_id(src + tgt, prefix="rel-"), + compute_mdhash_id(tgt + src, prefix="rel-"), + ] + ) + await self.relationships_vdb.delete(rel_ids_to_delete) + if self.relation_chunks: + relation_storage_keys = [ + make_relation_chunk_key(src, tgt) + for src, tgt in edges_to_delete + ] + await self.relation_chunks.delete(relation_storage_keys) + logger.info( + f"[purge] {doc_id}: cleaned {len(edges_to_delete)} residual " + f"edge(s) from VDB and chunk-tracking storage" + ) + + await self.chunk_entity_relation_graph.remove_nodes( + list(entities_to_delete) + ) + + entity_vdb_ids = [ + compute_mdhash_id(entity, prefix="ent-") + for entity in entities_to_delete + ] + await self.entities_vdb.delete(entity_vdb_ids) + + if self.entity_chunks: + await self.entity_chunks.delete(list(entities_to_delete)) + + async with pipeline_status_lock: + log_message = ( + f"[purge] {doc_id}: deleted " + f"{len(entities_to_delete)} entity(ies)" + ) + logger.info(log_message) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + except Exception as e: + logger.error(f"[purge] Failed to delete entities for {doc_id}: {e}") + raise Exception(f"Failed to delete entities: {e}") from e + + # ---- 6. Persist pre-rebuild changes ---- + try: + await self._insert_done() + except Exception as e: + logger.error(f"[purge] Failed to persist pre-rebuild changes: {e}") + raise Exception(f"Failed to persist pre-rebuild changes: {e}") from e + + # ---- 7. Rebuild entities/relations that still have remaining sources ---- + if entities_to_rebuild or relationships_to_rebuild: + try: + await rebuild_knowledge_from_chunks( + entities_to_rebuild=entities_to_rebuild, + relationships_to_rebuild=relationships_to_rebuild, + knowledge_graph_inst=self.chunk_entity_relation_graph, + entities_vdb=self.entities_vdb, + relationships_vdb=self.relationships_vdb, + text_chunks_storage=self.text_chunks, + llm_response_cache=self.llm_response_cache, + global_config=self._build_global_config(), + pipeline_status=pipeline_status, + pipeline_status_lock=pipeline_status_lock, + entity_chunks_storage=self.entity_chunks, + relation_chunks_storage=self.relation_chunks, + ) + except Exception as e: + logger.error(f"[purge] Failed to rebuild knowledge from chunks: {e}") + raise Exception(f"Failed to rebuild knowledge graph: {e}") from e + + # ---- 8. Delete per-doc full_entities / full_relations index rows ---- + try: + await self.full_entities.delete([doc_id]) + await self.full_relations.delete([doc_id]) + except Exception as e: + logger.error( + f"[purge] Failed to delete full_entities/full_relations rows for {doc_id}: {e}" + ) + raise Exception( + f"Failed to delete from full_entities/full_relations: {e}" + ) from e + async def adelete_by_doc_id( self, doc_id: str, delete_llm_cache: bool = False ) -> DeletionResult: @@ -3327,7 +2819,6 @@ async def adelete_by_doc_id( try: # 1. Get the document status and related data doc_status_data = await self.doc_status.get_by_id(doc_id) - file_path = doc_status_data.get("file_path") if doc_status_data else None if not doc_status_data: logger.warning(f"Document {doc_id} not found") return DeletionResult( @@ -3337,6 +2828,7 @@ async def adelete_by_doc_id( status_code=404, file_path="", ) + file_path = doc_status_data.get("file_path") # Check document status and log warning for non-completed documents raw_status = doc_status_data.get("status") @@ -3381,12 +2873,12 @@ async def adelete_by_doc_id( metadata = doc_status_data.get("metadata", {}) if not isinstance(metadata, dict): metadata = {} - metadata_cache_ids = _normalize_string_list( + metadata_cache_ids = normalize_string_list( metadata.get("deletion_llm_cache_ids", []), context=f"doc {doc_id} metadata.deletion_llm_cache_ids", ) chunk_ids = set( - _normalize_string_list( + normalize_string_list( doc_status_data.get("chunks_list", []), context=f"doc {doc_id} chunks_list", ) @@ -3955,7 +3447,7 @@ async def adelete_by_doc_id( relationships_vdb=self.relationships_vdb, text_chunks_storage=self.text_chunks, llm_response_cache=self.llm_response_cache, - global_config=asdict(self), + global_config=self._build_global_config(), pipeline_status=pipeline_status, pipeline_status_lock=pipeline_status_lock, entity_chunks_storage=self.entity_chunks, @@ -4505,3 +3997,15 @@ def export_data( loop.run_until_complete( self.aexport_data(output_path, file_format, include_vector_data) ) + + +# `addon_params` is declared as an InitVar on the dataclass so it can still be +# passed through LightRAG(addon_params=...). InitVars are not stored as +# instance attributes, which frees the name to be installed here as a property +# that routes reads/writes through the observable `_addon_params` store. +# Declaring it as both a dataclass field and a property is not supported by +# @dataclass, so the property is attached after class creation. +LightRAG.addon_params = property( # type: ignore[attr-defined] + LightRAG._get_addon_params, + LightRAG._set_runtime_addon_params, +) diff --git a/lightrag/llm/_vision_utils.py b/lightrag/llm/_vision_utils.py new file mode 100644 index 0000000000..bcfdab8df8 --- /dev/null +++ b/lightrag/llm/_vision_utils.py @@ -0,0 +1,301 @@ +"""Shared image-input normalization for LLM bindings. + +All LLM bindings accept a unified ``image_inputs`` keyword parameter. Each +element may be: + +- a raw base64 string (the MIME type is inferred via ``imghdr`` / magic bytes, + defaulting to ``image/png``); +- a data URL of the form ``data:;base64,``; +- a dict with keys ``base64`` (required) and optional ``mime_type``, + ``source_id``, ``source_file``, ``modality``, ``doc_id``. + +The provider-specific binding code converts the normalized result to its own +content-block format. The VLM pipeline uses :func:`image_cache_metadata` for +cache-key inputs (deliberately excluding ``source_id`` / ``source_file`` so the +same image at different filenames still hits the same entry) and +:func:`image_audit_metadata` for the human-readable ``original_prompt`` audit +block. +""" + +from __future__ import annotations + +import base64 +import hashlib +import re +import struct +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +DATA_URL_RE = re.compile( + r"^data:(?P[\w./+-]+);base64,(?P[A-Za-z0-9+/=\s]+)$" +) + +_PNG_SIGNATURE = b"\x89PNG\r\n\x1a\n" +_JPEG_SIGNATURE = b"\xff\xd8\xff" +_GIF_SIGNATURES = (b"GIF87a", b"GIF89a") +_WEBP_RIFF = b"RIFF" +_WEBP_TAG = b"WEBP" + + +@dataclass(frozen=True) +class NormalizedImage: + index: int + raw_bytes: bytes + mime_type: str + sha256: str + base64_str: str + source_id: str | None + source_file: str | None + modality: str | None + doc_id: str | None + # Pixel dimensions parsed from the raster header (None when the format + # is recognized but dimensions could not be extracted). + width: int | None = None + height: int | None = None + + +def _detect_mime(raw: bytes) -> str: + if raw.startswith(_PNG_SIGNATURE): + return "image/png" + if raw.startswith(_JPEG_SIGNATURE): + return "image/jpeg" + if any(raw.startswith(sig) for sig in _GIF_SIGNATURES): + return "image/gif" + if len(raw) >= 12 and raw[0:4] == _WEBP_RIFF and raw[8:12] == _WEBP_TAG: + return "image/webp" + return "image/png" + + +def _decode_base64(data: str) -> bytes: + cleaned = re.sub(r"\s+", "", data) + try: + return base64.b64decode(cleaned, validate=True) + except (base64.binascii.Error, ValueError) as exc: + raise ValueError(f"invalid base64 image data: {exc}") from exc + + +def _coerce_item(item: Any) -> dict[str, Any]: + if isinstance(item, str): + match = DATA_URL_RE.match(item.strip()) + if match: + return {"base64": match.group("data"), "mime_type": match.group("mime")} + return {"base64": item} + if isinstance(item, dict): + if "base64" not in item: + raise ValueError("image_inputs dict element must contain a 'base64' key") + return item + raise TypeError( + f"image_inputs element must be str or dict, got {type(item).__name__}" + ) + + +def normalize_image_inputs( + image_inputs: list[Any] | None, +) -> list[NormalizedImage]: + """Normalize the unified ``image_inputs`` parameter. + + Returns an empty list when ``image_inputs`` is falsy, so callers can do a + plain ``if normalized:`` check. + """ + if not image_inputs: + return [] + + result: list[NormalizedImage] = [] + for idx, raw_item in enumerate(image_inputs): + item = _coerce_item(raw_item) + raw_bytes = _decode_base64(item["base64"]) + if not raw_bytes: + raise ValueError(f"image_inputs[{idx}] decoded to empty bytes") + mime_type = item.get("mime_type") or _detect_mime(raw_bytes) + sha = hashlib.sha256(raw_bytes).hexdigest() + clean_b64 = base64.b64encode(raw_bytes).decode("ascii") + dims = _dimensions_from_bytes(raw_bytes) + width, height = (dims[0], dims[1]) if dims else (None, None) + result.append( + NormalizedImage( + index=idx, + raw_bytes=raw_bytes, + mime_type=mime_type, + sha256=sha, + base64_str=clean_b64, + source_id=item.get("source_id"), + source_file=item.get("source_file"), + modality=item.get("modality"), + doc_id=item.get("doc_id"), + width=width, + height=height, + ) + ) + return result + + +def image_cache_metadata(images: list[NormalizedImage]) -> list[dict[str, Any]]: + """Return cache-key-safe image metadata (no source identifiers). + + Includes ``width`` / ``height`` so the cache key reflects the full + image digest the design contract specifies (mime, sha256, bytes, + width, height). The sha256 alone is sufficient for identity, but + surfacing dimensions matches the documented audit shape and gives + diagnostics a one-line "what was sent" without re-decoding. + """ + return [ + { + "index": img.index, + "mime_type": img.mime_type, + "sha256": img.sha256, + "bytes": len(img.raw_bytes), + "width": img.width, + "height": img.height, + } + for img in images + ] + + +def image_audit_metadata(images: list[NormalizedImage]) -> list[dict[str, Any]]: + """Return audit metadata suitable for the ``original_prompt`` block. + + Never includes the raw base64 payload — only digests and source pointers. + """ + return [ + { + "index": img.index, + "mime_type": img.mime_type, + "sha256": img.sha256, + "bytes": len(img.raw_bytes), + "width": img.width, + "height": img.height, + "source_id": img.source_id, + "source_file": img.source_file, + "modality": img.modality, + "doc_id": img.doc_id, + } + for img in images + ] + + +def _read_png_dimensions(data: bytes) -> tuple[int, int] | None: + # IHDR is the first chunk; width/height are big-endian uint32 at offsets + # 16/20 (8-byte signature + 4 length + 4 "IHDR" + 4 width + 4 height). + if len(data) < 24 or not data.startswith(_PNG_SIGNATURE): + return None + width, height = struct.unpack(">II", data[16:24]) + return width, height + + +def _read_gif_dimensions(data: bytes) -> tuple[int, int] | None: + # Logical screen descriptor: width/height are little-endian uint16 at + # offsets 6/8. + if len(data) < 10 or not any(data.startswith(sig) for sig in _GIF_SIGNATURES): + return None + width, height = struct.unpack(" tuple[int, int] | None: + # Scan for a Start-Of-Frame marker (SOF0 / SOF2 / etc.). Skip segments by + # their length field. We deliberately accept any SOF variant the codec + # might emit rather than enumerating each one. + if len(data) < 4 or not data.startswith(_JPEG_SIGNATURE): + return None + i = 2 + n = len(data) + while i < n: + if data[i] != 0xFF: + return None + # Skip fill bytes. + while i < n and data[i] == 0xFF: + i += 1 + if i >= n: + return None + marker = data[i] + i += 1 + # Standalone markers without a length field. + if marker in (0xD8, 0xD9) or 0xD0 <= marker <= 0xD7: + continue + if i + 2 > n: + return None + segment_len = struct.unpack(">H", data[i : i + 2])[0] + if segment_len < 2 or i + segment_len > n: + return None + # SOF0..SOF15 except 0xC4 (DHT), 0xC8 (JPG reserved), 0xCC (DAC). + if 0xC0 <= marker <= 0xCF and marker not in (0xC4, 0xC8, 0xCC): + # SOF payload: precision(1) + height(2) + width(2) + … + if i + 7 > n: + return None + height, width = struct.unpack(">HH", data[i + 3 : i + 7]) + return width, height + i += segment_len + return None + + +def _read_webp_dimensions(data: bytes) -> tuple[int, int] | None: + if len(data) < 30 or data[0:4] != _WEBP_RIFF or data[8:12] != _WEBP_TAG: + return None + chunk_type = data[12:16] + if chunk_type == b"VP8 ": + # Lossy: 3-byte tag + 3-byte sync code at offset 23, then 4 bytes + # holding 14-bit width / 14-bit height in little-endian halves. + if len(data) < 30: + return None + width = struct.unpack("> 6) + 1 + return width, height + if chunk_type == b"VP8X": + # Extended: 3 bytes width-1 / 3 bytes height-1, little-endian, at + # offsets 24/27. + if len(data) < 30: + return None + width = (data[24] | data[25] << 8 | data[26] << 16) + 1 + height = (data[27] | data[28] << 8 | data[29] << 16) + 1 + return width, height + return None + + +def read_image_dimensions(path: Path) -> tuple[int, int] | None: + """Return ``(width, height)`` for a raster image, or ``None`` if unknown. + + Reads only the file header — no Pillow dependency. Supports PNG, JPEG, + GIF and WebP (VP8 / VP8L / VP8X). Returns ``None`` for unsupported + formats and on any I/O or parse error so callers can fall back to a + skipped/failure decision without raising. + """ + try: + with open(path, "rb") as fh: + header = fh.read(64 * 1024) + except OSError: + return None + return _dimensions_from_bytes(header) + + +def _dimensions_from_bytes(data: bytes) -> tuple[int, int] | None: + """Run the four header readers against a byte buffer. + + Shared between the file-path entry point (:func:`read_image_dimensions`) + and :func:`normalize_image_inputs`, which receives raster payloads + decoded from the unified ``image_inputs`` parameter. + """ + if not data: + return None + for reader in ( + _read_png_dimensions, + _read_gif_dimensions, + _read_jpeg_dimensions, + _read_webp_dimensions, + ): + try: + dims = reader(data) + except (struct.error, IndexError, ValueError): + continue + if dims: + return dims + return None diff --git a/lightrag/llm/anthropic.py b/lightrag/llm/anthropic.py index b1a3081caf..b7cc0878da 100644 --- a/lightrag/llm/anthropic.py +++ b/lightrag/llm/anthropic.py @@ -57,8 +57,17 @@ async def anthropic_complete_if_cache( enable_cot: bool = False, base_url: str | None = None, api_key: str | None = None, + image_inputs: list[Any] | None = None, **kwargs: Any, ) -> Union[str, AsyncIterator[str]]: + """Call Anthropic Messages API with LightRAG-compatible shims. + + Structured output note: + - This adapter does not support OpenAI-style ``response_format`` JSON mode. + - If callers pass ``response_format``, it is stripped before the request. + - Deprecated ``keyword_extraction`` and ``entity_extraction`` booleans are + accepted only as compatibility shims; they emit warnings and are ignored. + """ if history_messages is None: history_messages = [] if enable_cot: @@ -78,7 +87,23 @@ async def anthropic_complete_if_cache( logging.getLogger("anthropic").setLevel(logging.INFO) kwargs.pop("hashing_kv", None) - kwargs.pop("keyword_extraction", None) + # Anthropic Messages API has no JSON mode; drop legacy flags and + # response_format. Emit DeprecationWarning when the booleans were set. + if kwargs.pop("keyword_extraction", False): + warnings.warn( + "anthropic_complete_if_cache(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + if kwargs.pop("entity_extraction", False): + warnings.warn( + "anthropic_complete_if_cache(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + kwargs.pop("response_format", None) timeout = kwargs.pop("timeout", None) anthropic_async_client = ( @@ -96,7 +121,26 @@ async def anthropic_complete_if_cache( messages: list[dict[str, Any]] = [] messages.extend(history_messages) - messages.append({"role": "user", "content": prompt}) + if image_inputs: + from lightrag.llm._vision_utils import normalize_image_inputs + + normalized_images = normalize_image_inputs(image_inputs) + user_content: list[dict[str, Any]] = [] + for img in normalized_images: + user_content.append( + { + "type": "image", + "source": { + "type": "base64", + "media_type": img.mime_type, + "data": img.base64_str, + }, + } + ) + user_content.append({"type": "text", "text": prompt}) + messages.append({"role": "user", "content": user_content}) + else: + messages.append({"role": "user", "content": prompt}) logger.debug("===== Sending Query to Anthropic LLM =====") logger.debug(f"Model: {model} Base URL: {base_url}") diff --git a/lightrag/llm/bedrock.py b/lightrag/llm/bedrock.py index e651e3c8c1..0da075a273 100644 --- a/lightrag/llm/bedrock.py +++ b/lightrag/llm/bedrock.py @@ -1,7 +1,8 @@ import copy -import os +import inspect import json import logging +import warnings import pipmaster as pm # Pipmaster for dynamic library install @@ -16,14 +17,10 @@ retry_if_exception_type, ) -import sys -from lightrag.utils import wrap_embedding_func_with_attrs +from collections.abc import AsyncIterator +from typing import Any, Union -if sys.version_info < (3, 9): - from typing import AsyncIterator -else: - from collections.abc import AsyncIterator -from typing import Union +from lightrag.utils import wrap_embedding_func_with_attrs # Import botocore exceptions for proper exception handling try: @@ -55,10 +52,36 @@ class BedrockTimeoutError(BedrockError): """Error for timeout issues""" -def _set_env_if_present(key: str, value): - """Set environment variable only if a non-empty value is provided.""" - if value is not None and value != "": - os.environ[key] = value +def _normalize_bedrock_endpoint_url(endpoint_url: str | None) -> str | None: + """Return a usable Bedrock endpoint override or None for SDK defaults.""" + if endpoint_url is None: + return None + + normalized = endpoint_url.strip() + if not normalized or normalized == "DEFAULT_BEDROCK_ENDPOINT": + return None + + return normalized + + +def _bedrock_client_kwargs( + region: str | None, + endpoint_url: str | None, + aws_access_key_id: str | None = None, + aws_secret_access_key: str | None = None, + aws_session_token: str | None = None, +) -> dict: + """Build kwargs for aioboto3 ``session.client("bedrock-runtime", ...)``.""" + client_kwargs: dict = {"region_name": region} + if endpoint_url is not None: + client_kwargs["endpoint_url"] = endpoint_url + if aws_access_key_id: + client_kwargs["aws_access_key_id"] = aws_access_key_id + if aws_secret_access_key: + client_kwargs["aws_secret_access_key"] = aws_secret_access_key + if aws_session_token: + client_kwargs["aws_session_token"] = aws_session_token + return client_kwargs def _handle_bedrock_exception(e: Exception, operation: str = "Bedrock API") -> None: @@ -150,23 +173,70 @@ async def bedrock_complete_if_cache( aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, + aws_region: str | None = None, + api_key: str | None = None, + endpoint_url: str | None = None, + image_inputs: list[Any] | None = None, **kwargs, ) -> Union[str, AsyncIterator[str]]: + """Call Amazon Bedrock Converse API with LightRAG-compatible shims. + + Structured output note: + - This adapter does not support OpenAI-style ``response_format`` JSON mode. + - If callers pass ``response_format``, it is stripped before the request. + - Deprecated ``keyword_extraction`` and ``entity_extraction`` booleans are + accepted only as compatibility shims; they emit warnings and are ignored. + + Authentication note: + - Bedrock does not use LightRAG's generic ``api_key`` fields. + - ``LLM_BINDING_API_KEY`` and ``EMBEDDING_BINDING_API_KEY`` are ignored for + Bedrock. + - To use Bedrock API key / bearer-token auth, set + ``AWS_BEARER_TOKEN_BEDROCK`` before starting the process; this is a + process-level AWS SDK setting. + - For role-specific Bedrock LLMs, use explicit SigV4 parameters + (``aws_access_key_id``, ``aws_secret_access_key``, ``aws_session_token``, + ``aws_region``). Per-role bearer-token overrides are not supported. + + Endpoint note: + - ``endpoint_url`` overrides the default regional Bedrock endpoint. Pass + ``None``, an empty string, or the sentinel ``DEFAULT_BEDROCK_ENDPOINT`` + to let the AWS SDK select its default endpoint. + """ if enable_cot: - import logging - logging.debug( "enable_cot=True is not supported for Bedrock and will be ignored." ) - # Respect existing env; only set if a non-empty value is available - access_key = os.environ.get("AWS_ACCESS_KEY_ID") or aws_access_key_id - secret_key = os.environ.get("AWS_SECRET_ACCESS_KEY") or aws_secret_access_key - session_token = os.environ.get("AWS_SESSION_TOKEN") or aws_session_token - _set_env_if_present("AWS_ACCESS_KEY_ID", access_key) - _set_env_if_present("AWS_SECRET_ACCESS_KEY", secret_key) - _set_env_if_present("AWS_SESSION_TOKEN", session_token) - # Region handling: prefer env, else kwarg (optional) - region = os.environ.get("AWS_REGION") or kwargs.pop("aws_region", None) + + # Bedrock Converse API has no JSON mode; drop legacy extraction flags and + # response_format below and rely on the prompt template plus downstream + # tolerant JSON parsing. + keyword_extraction = kwargs.pop("keyword_extraction", False) + entity_extraction = kwargs.pop("entity_extraction", False) + if keyword_extraction: + warnings.warn( + "bedrock_complete_if_cache(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + if entity_extraction: + warnings.warn( + "bedrock_complete_if_cache(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + if api_key: + warnings.warn( + "bedrock_complete_if_cache(api_key=...) is ignored; use SigV4 " + "parameters or set AWS_BEARER_TOKEN_BEDROCK before process start.", + DeprecationWarning, + stacklevel=2, + ) + + region = aws_region or kwargs.pop("aws_region", None) + endpoint_url = _normalize_bedrock_endpoint_url(endpoint_url) kwargs.pop("hashing_kv", None) # Capture stream flag (if provided) and remove from kwargs since it's not a Bedrock API parameter # We'll use this to determine whether to call converse_stream or converse @@ -183,7 +253,6 @@ async def bedrock_complete_if_cache( "logprobs", "top_logprobs", "max_completion_tokens", - "response_format", ]: kwargs.pop(k, None) # Fix message history format @@ -194,7 +263,26 @@ async def bedrock_complete_if_cache( messages.append(message) # Add user prompt - messages.append({"role": "user", "content": [{"text": prompt}]}) + if image_inputs: + from lightrag.llm._vision_utils import normalize_image_inputs + + normalized_images = normalize_image_inputs(image_inputs) + user_content: list[dict[str, Any]] = [{"text": prompt}] + for img in normalized_images: + fmt = img.mime_type.split("/", 1)[1] if "/" in img.mime_type else "png" + user_content.append( + {"image": {"format": fmt, "source": {"bytes": img.raw_bytes}}} + ) + messages.append({"role": "user", "content": user_content}) + + if stream: + logging.getLogger(__name__).debug( + "[bedrock] image_inputs provided; forcing non-stream Converse " + "(stream + image combination has SDK limitations)" + ) + stream = False + else: + messages.append({"role": "user", "content": [{"text": prompt}]}) # Initialize Converse API arguments args = {"modelId": model, "messages": messages} @@ -209,97 +297,84 @@ async def bedrock_complete_if_cache( "top_p": "topP", "stop_sequences": "stopSequences", } - if inference_params := list( - set(kwargs) & set(["max_tokens", "temperature", "top_p", "stop_sequences"]) - ): - args["inferenceConfig"] = {} - for param in inference_params: - args["inferenceConfig"][inference_params_map.get(param, param)] = ( - kwargs.pop(param) - ) - - # Import logging for error handling - import logging + inference_config: dict[str, Any] = {} + for param in ("max_tokens", "temperature", "top_p", "stop_sequences"): + if param not in kwargs: + continue + value = kwargs.pop(param) + # Bedrock rejects None; a None default means "inherit provider default" + if value is None: + continue + inference_config[inference_params_map.get(param, param)] = value + if inference_config: + args["inferenceConfig"] = inference_config + + # Pass-through for model-specific parameters (e.g. Anthropic reasoning_config, + # Nova inferenceConfig extensions). Mirrors OpenAI's `extra_body`. + extra_fields = kwargs.pop("extra_fields", None) + if extra_fields: + args["additionalModelRequestFields"] = extra_fields # For streaming responses, we need a different approach to keep the connection open if stream: # Create a session that will be used throughout the streaming process session = aioboto3.Session() - client = None + client_kwargs = _bedrock_client_kwargs( + region, + endpoint_url, + aws_access_key_id=aws_access_key_id, + aws_secret_access_key=aws_secret_access_key, + aws_session_token=aws_session_token, + ) # Define the generator function that will manage the client lifecycle async def stream_generator(): - nonlocal client - - # Create the client outside the generator to ensure it stays open - client = await session.client( - "bedrock-runtime", region_name=region - ).__aenter__() - event_stream = None - iteration_started = False - - try: - # Make the API call - response = await client.converse_stream(**args, **kwargs) - event_stream = response.get("stream") - iteration_started = True - - # Process the stream - async for event in event_stream: - # Validate event structure - if not event or not isinstance(event, dict): - continue - - if "contentBlockDelta" in event: - delta = event["contentBlockDelta"].get("delta", {}) - text = delta.get("text") - if text: - yield text - # Handle other event types that might indicate stream end - elif "messageStop" in event: - break - - except Exception as e: - # Try to clean up resources if possible - if ( - iteration_started - and event_stream - and hasattr(event_stream, "aclose") - and callable(getattr(event_stream, "aclose", None)) - ): - try: - await event_stream.aclose() - except Exception as close_error: - logging.warning( - f"Failed to close Bedrock event stream: {close_error}" - ) - - # Convert to appropriate exception type - _handle_bedrock_exception(e, "Bedrock streaming") - - finally: - # Clean up the event stream - if ( - iteration_started - and event_stream - and hasattr(event_stream, "aclose") - and callable(getattr(event_stream, "aclose", None)) - ): - try: - await event_stream.aclose() - except Exception as close_error: - logging.warning( - f"Failed to close Bedrock event stream in finally block: {close_error}" - ) + # async with ensures the aioboto3 client is closed even under + # task cancellation, avoiding aiohttp "Unclosed connection" warnings. + async with session.client("bedrock-runtime", **client_kwargs) as client: + event_stream = None + try: + # Make the API call + response = await client.converse_stream(**args, **kwargs) + event_stream = response.get("stream") + + # Process the stream + async for event in event_stream: + # Validate event structure + if not event or not isinstance(event, dict): + continue + + if "contentBlockDelta" in event: + delta = event["contentBlockDelta"].get("delta", {}) + text = delta.get("text") + if text: + yield text + # Handle other event types that might indicate stream end + elif "messageStop" in event: + break - # Clean up the client - if client: - try: - await client.__aexit__(None, None, None) - except Exception as client_close_error: - logging.warning( - f"Failed to close Bedrock client: {client_close_error}" + except Exception as e: + # Convert to appropriate exception type + _handle_bedrock_exception(e, "Bedrock streaming") + + finally: + # Close the event stream once; client cleanup is handled by async with. + # aiobotocore's EventStream exposes sync `close()`, while generic + # async iterators expose async `aclose()` — handle both and dispatch + # awaitable results accordingly. + if event_stream is not None: + close_fn = getattr(event_stream, "close", None) or getattr( + event_stream, "aclose", None ) + if callable(close_fn): + try: + result = close_fn() + if inspect.isawaitable(result): + await result + except Exception as close_error: + logging.warning( + f"Failed to close Bedrock event stream: {close_error}" + ) # Return the generator that manages its own lifecycle return stream_generator() @@ -307,7 +382,14 @@ async def stream_generator(): # For non-streaming responses, use the standard async context manager pattern session = aioboto3.Session() async with session.client( - "bedrock-runtime", region_name=region + "bedrock-runtime", + **_bedrock_client_kwargs( + region, + endpoint_url, + aws_access_key_id=aws_access_key_id, + aws_secret_access_key=aws_secret_access_key, + aws_session_token=aws_session_token, + ), ) as bedrock_async_client: try: # Use converse for non-streaming responses @@ -323,7 +405,17 @@ async def stream_generator(): ): raise BedrockError("Invalid response structure from Bedrock API") - content = response["output"]["message"]["content"][0]["text"] + # When thinking/reasoning is enabled, the first content block is a + # `reasoningContent` block and the visible text follows in a later + # block. Pick the first block that carries a text payload. + content = next( + ( + block["text"] + for block in response["output"]["message"]["content"] + if isinstance(block, dict) and block.get("text") + ), + None, + ) if not content or content.strip() == "": raise BedrockError("Received empty content from Bedrock API") @@ -337,15 +429,24 @@ async def stream_generator(): # Generic Bedrock completion function async def bedrock_complete( - prompt, system_prompt=None, history_messages=[], keyword_extraction=False, **kwargs + prompt, + system_prompt=None, + history_messages=[], + keyword_extraction=False, + entity_extraction=False, + **kwargs, ) -> Union[str, AsyncIterator[str]]: - kwargs.pop("keyword_extraction", None) + # Bedrock Converse API has no JSON mode; the shim booleans are absorbed + # and forwarded so bedrock_complete_if_cache can emit DeprecationWarnings + # with accurate stack frames. model_name = kwargs["hashing_kv"].global_config["llm_model_name"] result = await bedrock_complete_if_cache( model_name, prompt, system_prompt=system_prompt, history_messages=history_messages, + keyword_extraction=keyword_extraction, + entity_extraction=entity_extraction, **kwargs, ) return result @@ -369,21 +470,44 @@ async def bedrock_embed( aws_access_key_id=None, aws_secret_access_key=None, aws_session_token=None, + aws_region: str | None = None, + api_key: str | None = None, + endpoint_url: str | None = None, ) -> np.ndarray: - # Respect existing env; only set if a non-empty value is available - access_key = os.environ.get("AWS_ACCESS_KEY_ID") or aws_access_key_id - secret_key = os.environ.get("AWS_SECRET_ACCESS_KEY") or aws_secret_access_key - session_token = os.environ.get("AWS_SESSION_TOKEN") or aws_session_token - _set_env_if_present("AWS_ACCESS_KEY_ID", access_key) - _set_env_if_present("AWS_SECRET_ACCESS_KEY", secret_key) - _set_env_if_present("AWS_SESSION_TOKEN", session_token) + """Generate embeddings with Amazon Bedrock Runtime. + + Authentication note: + - Bedrock does not use LightRAG's generic ``api_key`` fields. + - ``LLM_BINDING_API_KEY`` and ``EMBEDDING_BINDING_API_KEY`` are ignored for + Bedrock. + - To use Bedrock API key / bearer-token auth, set + ``AWS_BEARER_TOKEN_BEDROCK`` before starting the process; this is a + process-level AWS SDK setting. + - For role-specific Bedrock configuration, use explicit SigV4 parameters + (``aws_access_key_id``, ``aws_secret_access_key``, ``aws_session_token``, + ``aws_region``). Per-role bearer-token overrides are not supported. + """ + if api_key: + warnings.warn( + "bedrock_embed(api_key=...) is ignored; use SigV4 parameters or " + "set AWS_BEARER_TOKEN_BEDROCK before process start.", + DeprecationWarning, + stacklevel=2, + ) - # Region handling: prefer env - region = os.environ.get("AWS_REGION") + region = aws_region + endpoint_url = _normalize_bedrock_endpoint_url(endpoint_url) session = aioboto3.Session() async with session.client( - "bedrock-runtime", region_name=region + "bedrock-runtime", + **_bedrock_client_kwargs( + region, + endpoint_url, + aws_access_key_id=aws_access_key_id, + aws_secret_access_key=aws_secret_access_key, + aws_session_token=aws_session_token, + ), ) as bedrock_async_client: try: if (model_provider := model.split(".")[0]) == "amazon": diff --git a/lightrag/llm/binding_options.py b/lightrag/llm/binding_options.py index 3adcdf6257..515d9fdd15 100644 --- a/lightrag/llm/binding_options.py +++ b/lightrag/llm/binding_options.py @@ -342,6 +342,69 @@ def options_dict(cls, args: Namespace) -> dict[str, Any]: return options + @classmethod + def options_dict_for_role( + cls, args: Namespace, role: str, is_cross_provider: bool = False + ) -> dict[str, Any]: + """ + Extract role-specific provider options with proper inheritance. + + Same provider: + - inherit the base binding options from parsed args + - overlay any role-specific environment variable overrides + + Cross provider: + - start from empty provider options + - overlay any role-specific environment variable overrides + + Role env vars follow the pattern: + `{ROLE}_{BINDING_PREFIX}_{FIELD}` + e.g. `EXTRACT_OPENAI_LLM_TEMPERATURE` + """ + import os + + if is_cross_provider: + base: dict[str, Any] = {} + else: + base = cls.options_dict(args) + + role_upper = role.upper() + env_prefix = cls._binding_name.upper() + "_" + + for arg_item in cls.args_env_name_type_value(): + original_env = arg_item["env_name"] + role_env = f"{role_upper}_{original_env}" + field_name = original_env[len(env_prefix) :].lower() + + env_raw = os.getenv(role_env) + if env_raw is None: + continue + + field_type = _resolve_optional_type(arg_item["type"]) + try: + if field_type is bool: + base[field_name] = env_raw.lower() in ( + "true", + "1", + "yes", + "t", + "on", + ) + elif field_type in (list, List[str]): + base[field_name] = json.loads(env_raw) + elif field_type is dict: + base[field_name] = json.loads(env_raw) + elif field_type is int: + base[field_name] = int(env_raw) + elif field_type is float: + base[field_name] = float(env_raw) + else: + base[field_name] = env_raw + except (ValueError, json.JSONDecodeError): + base[field_name] = env_raw + + return base + def asdict(self) -> dict[str, Any]: """ Convert an instance of binding options to a dictionary. @@ -567,6 +630,36 @@ class OpenAILLMOptions(BindingOptions): } +# ============================================================================= +# Binding Options for AWS Bedrock +# ============================================================================= +# +# Bedrock binding options map to the subset of the Bedrock Converse API +# inferenceConfig that LightRAG's bedrock driver actually forwards. See +# ``lightrag/llm/bedrock.py`` for the whitelist — any field added here that is +# not in that whitelist will be silently dropped by the driver. +# ============================================================================= +@dataclass +class BedrockLLMOptions(BindingOptions): + """Options for AWS Bedrock LLM (Converse API inferenceConfig).""" + + _binding_name: ClassVar[str] = "bedrock_llm" + + temperature: float = DEFAULT_TEMPERATURE + max_tokens: int | None = None + top_p: float = 1.0 + stop_sequences: List[str] = field(default_factory=list) + extra_fields: dict = None # Converse API additionalModelRequestFields + + _help: ClassVar[dict[str, str]] = { + "temperature": "Controls randomness (0.0-1.0 for most Bedrock models)", + "max_tokens": "Maximum tokens generated in the response (leave empty for model default)", + "top_p": "Nucleus sampling parameter (0.0-1.0)", + "stop_sequences": "Stop sequences (JSON array of strings, e.g., '[\"\"]')", + "extra_fields": 'Model-specific request fields forwarded as Converse API additionalModelRequestFields (JSON dict, e.g., \'{"reasoning_config": {"type": "enabled"}}\')', + } + + # ============================================================================= # Main Section - For Testing and Sample Generation # ============================================================================= diff --git a/lightrag/llm/gemini.py b/lightrag/llm/gemini.py index 19f35867f0..b448a87edc 100644 --- a/lightrag/llm/gemini.py +++ b/lightrag/llm/gemini.py @@ -10,6 +10,7 @@ from __future__ import annotations import os +import warnings from collections.abc import AsyncIterator from functools import lru_cache from typing import Any @@ -48,6 +49,33 @@ class InvalidResponseError(Exception): pass +_DEFAULT_GEMINI_BASE_URLS = { + "https://generativelanguage.googleapis.com", + "https://generativelanguage.googleapis.com/", + "https://generativelanguage.googleapis.com/v1beta", + "https://generativelanguage.googleapis.com/v1beta/", + "https://generativelanguage.googleapis.com/v1", + "https://generativelanguage.googleapis.com/v1/", +} + + +def _normalize_gemini_base_url(base_url: str | None) -> str | None: + """Treat Google's default Gemini API service roots as SDK defaults.""" + if not base_url: + return None + + normalized = base_url.strip() + if not normalized or normalized == "DEFAULT_GEMINI_ENDPOINT": + return None + + if normalized.rstrip("/") in { + service_root.rstrip("/") for service_root in _DEFAULT_GEMINI_BASE_URLS + }: + return None + + return normalized + + @lru_cache(maxsize=8) def _get_gemini_client( api_key: str, base_url: str | None, timeout: int | None = None @@ -64,6 +92,7 @@ def _get_gemini_client( genai.Client: Configured Gemini client instance. """ client_kwargs: dict[str, Any] = {} + normalized_base_url = _normalize_gemini_base_url(base_url) # Add Vertex AI support use_vertexai = os.getenv("GOOGLE_GENAI_USE_VERTEXAI", "").lower() == "true" @@ -84,11 +113,11 @@ def _get_gemini_client( # Standard Gemini API mode: use api_key client_kwargs["api_key"] = api_key - if base_url and base_url != "DEFAULT_GEMINI_ENDPOINT" or timeout is not None: + if normalized_base_url is not None or timeout is not None: try: http_options_kwargs = {} - if base_url and base_url != "DEFAULT_GEMINI_ENDPOINT": - http_options_kwargs["base_url"] = base_url + if normalized_base_url is not None: + http_options_kwargs["base_url"] = normalized_base_url if timeout is not None: http_options_kwargs["timeout"] = timeout @@ -119,7 +148,7 @@ def _ensure_api_key(api_key: str | None) -> str: def _build_generation_config( base_config: dict[str, Any] | None, system_prompt: str | None, - keyword_extraction: bool, + response_format: Any | None, ) -> types.GenerateContentConfig | None: config_data = dict(base_config or {}) @@ -131,8 +160,12 @@ def _build_generation_config( else: config_data["system_instruction"] = system_prompt - if keyword_extraction and not config_data.get("response_mime_type"): - config_data["response_mime_type"] = "application/json" + # Translate response_format to Gemini's native generation config fields. + if response_format is not None: + config_data.setdefault("response_mime_type", "application/json") + schema = _normalize_gemini_response_schema(response_format) + if schema is not None and "response_json_schema" not in config_data: + config_data["response_json_schema"] = schema # Remove entries that are explicitly set to None to avoid type errors sanitized = { @@ -147,6 +180,39 @@ def _build_generation_config( return types.GenerateContentConfig(**sanitized) +def _normalize_gemini_response_schema(response_format: Any) -> Any | None: + """Extract a Gemini-compatible JSON schema from LightRAG/OpenAI inputs.""" + if response_format is None: + return None + + if isinstance(response_format, dict): + if response_format.get("type") == "json_object": + return None + + if response_format.get("type") == "json_schema": + json_schema = response_format.get("json_schema") + if isinstance(json_schema, dict): + schema = json_schema.get("schema") + if isinstance(schema, dict): + return schema + return json_schema + + return response_format + + return response_format + + +def _validate_gemini_response_format(response_format: Any | None) -> None: + """Reject typed structured-output helpers; only dict payloads are supported.""" + if response_format is None or isinstance(response_format, dict): + return + + raise TypeError( + "gemini_complete_if_cache only supports dict response_format payloads; " + "typed/Pydantic response_format values are not supported." + ) + + def _format_history_messages(history_messages: list[dict[str, Any]] | None) -> str: if not history_messages: return "" @@ -225,9 +291,12 @@ async def gemini_complete_if_cache( api_key: str | None = None, token_tracker: Any | None = None, stream: bool | None = None, + response_format: Any | None = None, keyword_extraction: bool = False, + entity_extraction: bool = False, generation_config: dict[str, Any] | None = None, timeout: int | None = None, + image_inputs: list[Any] | None = None, **_: Any, ) -> str | AsyncIterator[str]: """ @@ -236,6 +305,19 @@ async def gemini_complete_if_cache( This function supports automatic integration of reasoning content from Gemini models that provide Chain of Thought capabilities via the thinking_config API feature. + Structured output note: + - This adapter accepts OpenAI-style ``response_format`` and translates it + to Gemini's native generation config fields. + - ``response_format={"type": "json_object"}`` maps to + ``response_mime_type="application/json"``. + - Dict-form ``json_schema`` payloads map to + ``response_mime_type="application/json"`` plus + ``response_json_schema=``. + - Typed/Pydantic ``response_format`` helpers are rejected explicitly. + - Deprecated ``keyword_extraction`` and ``entity_extraction`` booleans are + compatibility shims; when no explicit ``response_format`` is supplied, + they are mapped to ``{"type": "json_object"}``. + COT Integration: - When enable_cot=True: Thought content is wrapped in ... tags - When enable_cot=False: Thought content is filtered out, only regular content returned @@ -250,7 +332,11 @@ async def gemini_complete_if_cache( api_key: Optional Gemini API key. If None, uses environment variable. base_url: Optional custom API endpoint. generation_config: Optional generation configuration dict. - keyword_extraction: Whether to use JSON response format. + response_format: OpenAI-style structured output control translated to + Gemini generation config. ``{"type": "json_object"}`` maps to + ``response_mime_type="application/json"``; dict-form + ``json_schema`` payloads map to ``response_json_schema``. + Typed/Pydantic response_format values are rejected. token_tracker: Optional token usage tracker for monitoring API usage. stream: Whether to stream the response. hashing_kv: Storage interface (for interface parity with other bindings). @@ -271,6 +357,29 @@ async def gemini_complete_if_cache( timeout_ms = timeout * 1000 if timeout else None client = _get_gemini_client(key, base_url, timeout_ms) + # Deprecation shims: map legacy boolean flags to response_format only when + # an explicit response_format was not supplied. + if response_format is None: + if entity_extraction: + warnings.warn( + "gemini_complete_if_cache(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + response_format = {"type": "json_object"} + elif keyword_extraction: + warnings.warn( + "gemini_complete_if_cache(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + response_format = {"type": "json_object"} + _validate_gemini_response_format(response_format) + if response_format is not None: + enable_cot = False + history_block = _format_history_messages(history_messages) prompt_sections = [] if history_block: @@ -281,12 +390,25 @@ async def gemini_complete_if_cache( config_obj = _build_generation_config( generation_config, system_prompt=system_prompt, - keyword_extraction=keyword_extraction, + response_format=response_format, ) + if image_inputs: + from lightrag.llm._vision_utils import normalize_image_inputs + + normalized_images = normalize_image_inputs(image_inputs) + parts: list[Any] = [combined_prompt] + parts.extend( + types.Part.from_bytes(data=img.raw_bytes, mime_type=img.mime_type) + for img in normalized_images + ) + contents: list[Any] = [parts] + else: + contents = [combined_prompt] + request_kwargs: dict[str, Any] = { "model": model, - "contents": [combined_prompt], + "contents": contents, } if config_obj is not None: request_kwargs["config"] = config_obj @@ -303,10 +425,10 @@ async def _async_stream() -> AsyncIterator[str]: try: # Use native async streaming from genai SDK # Note: generate_content_stream returns Awaitable[AsyncIterator], need to await first - stream = await client.aio.models.generate_content_stream( + stream_iter = await client.aio.models.generate_content_stream( **request_kwargs ) - async for chunk in stream: + async for chunk in stream_iter: usage = getattr(chunk, "usage_metadata", None) if usage is not None: usage_metadata = usage @@ -363,14 +485,14 @@ async def _async_stream() -> AsyncIterator[str]: yield "" cot_active = False - except Exception as exc: + except Exception: # Try to close COT tag before re-raising if cot_active: try: yield "" except Exception: pass - raise exc + raise finally: # Track token usage after streaming completes if token_tracker and usage_metadata: @@ -439,9 +561,13 @@ async def gemini_model_complete( prompt: str, system_prompt: str | None = None, history_messages: list[dict[str, Any]] | None = None, + response_format: Any | None = None, keyword_extraction: bool = False, + entity_extraction: bool = False, **kwargs: Any, ) -> str | AsyncIterator[str]: + # Accept legacy keyword if passed via kwargs to preserve backwards compat. + entity_extraction = kwargs.pop("entity_extraction", entity_extraction) hashing_kv = kwargs.get("hashing_kv") model_name = None if hashing_kv is not None: @@ -456,7 +582,9 @@ async def gemini_model_complete( prompt, system_prompt=system_prompt, history_messages=history_messages, + response_format=response_format, keyword_extraction=keyword_extraction, + entity_extraction=entity_extraction, **kwargs, ) diff --git a/lightrag/llm/hf.py b/lightrag/llm/hf.py index 008edef873..479b7c0bef 100644 --- a/lightrag/llm/hf.py +++ b/lightrag/llm/hf.py @@ -1,5 +1,6 @@ import copy import os +import warnings from functools import lru_cache import pipmaster as pm # Pipmaster for dynamic library install @@ -126,10 +127,35 @@ async def hf_model_complete( system_prompt=None, history_messages=[], keyword_extraction=False, + entity_extraction=False, enable_cot: bool = False, **kwargs, ) -> str: - kwargs.pop("keyword_extraction", None) + """Run local Hugging Face inference with LightRAG-compatible shims. + + Structured output note: + - This adapter does not support OpenAI-style ``response_format`` JSON mode. + - If callers pass ``response_format``, it is stripped before generation. + - Deprecated ``keyword_extraction`` and ``entity_extraction`` booleans are + accepted only as compatibility shims; they emit warnings and are ignored. + """ + # HuggingFace local inference has no JSON mode; drop response_format and + # warn when legacy shim flags are set. + if kwargs.pop("keyword_extraction", False) or keyword_extraction: + warnings.warn( + "hf_model_complete(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + if kwargs.pop("entity_extraction", False) or entity_extraction: + warnings.warn( + "hf_model_complete(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + kwargs.pop("response_format", None) model_name = kwargs["hashing_kv"].global_config["llm_model_name"] result = await hf_model_if_cache( model_name, diff --git a/lightrag/llm/llama_index_impl.py b/lightrag/llm/llama_index_impl.py index c44e6c7a30..8b444c6551 100644 --- a/lightrag/llm/llama_index_impl.py +++ b/lightrag/llm/llama_index_impl.py @@ -1,10 +1,12 @@ +import warnings + import pipmaster as pm from llama_index.core.llms import ( ChatMessage, MessageRole, ChatResponse, ) -from typing import List, Optional +from typing import Any, List, Optional from lightrag.utils import logger # Install required dependencies @@ -30,7 +32,7 @@ import numpy as np -def configure_llama_index(settings: LlamaIndexSettings = None, **kwargs): +def configure_llama_index(settings: Any = None, **kwargs): """ Configure LlamaIndex settings. @@ -145,24 +147,49 @@ async def llama_index_complete( history_messages=None, enable_cot: bool = False, keyword_extraction=False, - settings: LlamaIndexSettings = None, + entity_extraction=False, + settings: Any = None, **kwargs, ) -> str: """ - Main completion function for LlamaIndex + Main completion function for LlamaIndex. Args: prompt: Input prompt system_prompt: Optional system prompt history_messages: Optional chat history - keyword_extraction: Whether to extract keywords from response + keyword_extraction: Deprecated compatibility shim. Emits a warning and + is ignored. + entity_extraction: Deprecated compatibility shim. Emits a warning and + is ignored. settings: Optional LlamaIndex settings - **kwargs: Additional arguments + **kwargs: Additional arguments. ``response_format`` is not supported by + this adapter and is stripped before calling LlamaIndex. + + Structured output note: + - This adapter does not support OpenAI-style ``response_format`` JSON mode. + - If callers pass ``response_format``, it is stripped before generation. """ if history_messages is None: history_messages = [] - kwargs.pop("keyword_extraction", None) + # LlamaIndex adapters have no JSON mode; drop response_format and warn + # when legacy boolean shim flags are set. + if kwargs.pop("keyword_extraction", False) or keyword_extraction: + warnings.warn( + "llama_index_complete(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + if kwargs.pop("entity_extraction", False) or entity_extraction: + warnings.warn( + "llama_index_complete(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + kwargs.pop("response_format", None) result = await llama_index_complete_if_cache( kwargs.get("llm_instance"), prompt, @@ -185,7 +212,7 @@ async def llama_index_complete( async def llama_index_embed( texts: list[str], embed_model: BaseEmbedding = None, - settings: LlamaIndexSettings = None, + settings: Any = None, **kwargs, ) -> np.ndarray: """ diff --git a/lightrag/llm/lmdeploy.py b/lightrag/llm/lmdeploy.py index 8916b0fde0..6f160211fd 100644 --- a/lightrag/llm/lmdeploy.py +++ b/lightrag/llm/lmdeploy.py @@ -1,3 +1,5 @@ +import warnings + import pipmaster as pm # Pipmaster for dynamic library install # install specific modules @@ -62,7 +64,14 @@ async def lmdeploy_model_if_cache( quant_policy=0, **kwargs, ) -> str: - """ + """Run lmdeploy generation with LightRAG-compatible shims. + + Structured output note: + - This adapter does not support OpenAI-style ``response_format`` JSON mode. + - If callers pass ``response_format``, it is stripped before generation. + - Deprecated ``keyword_extraction`` and ``entity_extraction`` booleans are + accepted only as compatibility shims; they emit warnings and are ignored. + Args: model (str): The path to the model. It could be one of the following options: @@ -102,6 +111,22 @@ async def lmdeploy_model_if_cache( except Exception: raise ImportError("Please install lmdeploy before initialize lmdeploy backend.") kwargs.pop("hashing_kv", None) + # lmdeploy has no JSON mode; drop response_format and warn when legacy + # boolean shim flags are set. + if kwargs.pop("keyword_extraction", False): + warnings.warn( + "lmdeploy_model_if_cache(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + if kwargs.pop("entity_extraction", False): + warnings.warn( + "lmdeploy_model_if_cache(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) kwargs.pop("response_format", None) max_new_tokens = kwargs.pop("max_tokens", 512) tp = kwargs.pop("tp", 1) diff --git a/lightrag/llm/lollms.py b/lightrag/llm/lollms.py index 3eaef1afad..011bdcaa7b 100644 --- a/lightrag/llm/lollms.py +++ b/lightrag/llm/lollms.py @@ -1,4 +1,5 @@ import sys +import warnings if sys.version_info < (3, 9): from typing import AsyncIterator @@ -23,7 +24,7 @@ APITimeoutError, ) -from typing import Union, List +from typing import Any, List, Union import numpy as np from lightrag.utils import ( @@ -45,14 +46,51 @@ async def lollms_model_if_cache( history_messages=[], enable_cot: bool = False, base_url="http://localhost:9600", + image_inputs: list[Any] | None = None, **kwargs, ) -> Union[str, AsyncIterator[str]]: - """Client implementation for lollms generation.""" + """Client implementation for lollms generation. + + Structured output note: + - This adapter does not support OpenAI-style ``response_format`` JSON mode. + - If callers pass ``response_format``, it is stripped before the request. + - Deprecated ``keyword_extraction`` and ``entity_extraction`` booleans are + accepted only as compatibility shims; they emit warnings and are ignored. + + Vision note: + - lollms does not support image inputs. Passing a non-empty + ``image_inputs`` raises :class:`NotImplementedError`. + """ + if image_inputs: + raise NotImplementedError( + "lollms binding does not support image_inputs; configure a " + "vision-capable VLM provider (openai/azure_openai/gemini/bedrock/" + "ollama/anthropic) for VLM_LLM_BINDING." + ) + if enable_cot: from lightrag.utils import logger logger.debug("enable_cot=True is not supported for lollms and will be ignored.") + # lollms has no JSON mode; drop response_format and warn when legacy + # boolean shim flags are set. + if kwargs.pop("keyword_extraction", False): + warnings.warn( + "lollms_model_if_cache(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + if kwargs.pop("entity_extraction", False): + warnings.warn( + "lollms_model_if_cache(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + kwargs.pop("response_format", None) + stream = True if kwargs.get("stream") else False api_key = kwargs.pop("api_key", None) headers = ( @@ -112,21 +150,18 @@ async def lollms_model_complete( history_messages=[], enable_cot: bool = False, keyword_extraction=False, + entity_extraction=False, **kwargs, ) -> Union[str, AsyncIterator[str]]: """Complete function for lollms model generation.""" - # Extract and remove keyword_extraction from kwargs if present - keyword_extraction = kwargs.pop("keyword_extraction", None) - - # Get model name from config - model_name = kwargs["hashing_kv"].global_config["llm_model_name"] - - # If keyword extraction is needed, we might need to modify the prompt - # or add specific parameters for JSON output (if lollms supports it) + # Forward legacy extraction flags as kwargs so lollms_model_if_cache can + # emit a single DeprecationWarning with the correct stack frame. if keyword_extraction: - # Note: You might need to adjust this based on how lollms handles structured output - pass + kwargs.setdefault("keyword_extraction", True) + if entity_extraction: + kwargs.setdefault("entity_extraction", True) + model_name = kwargs["hashing_kv"].global_config["llm_model_name"] return await lollms_model_if_cache( model_name, diff --git a/lightrag/llm/ollama.py b/lightrag/llm/ollama.py index 879d389cf7..34f689631c 100644 --- a/lightrag/llm/ollama.py +++ b/lightrag/llm/ollama.py @@ -1,6 +1,7 @@ from collections.abc import AsyncIterator import os import re +import warnings import pipmaster as pm @@ -24,7 +25,7 @@ from lightrag.api import __api_version__ import numpy as np -from typing import Optional, Union +from typing import Any, Optional, Union from lightrag.utils import ( wrap_embedding_func_with_attrs, logger, @@ -51,6 +52,34 @@ def _coerce_host_for_cloud_model(host: Optional[str], model: object) -> Optional return host +def _normalize_ollama_response_format(kwargs: dict) -> None: + """Translate OpenAI-style response_format into Ollama's native format field. + + Precedence: an explicit ``format`` value (Ollama's native field) wins over + ``response_format`` — if ``format`` is already set, ``response_format`` is + dropped silently. Otherwise, ``{"type": "json_object"}`` maps to + ``format="json"`` and any other payload is passed through unchanged so + callers can supply JSON schemas directly. + """ + + response_format = kwargs.pop("response_format", None) + if kwargs.get("format") is not None or response_format is None: + return + + if isinstance(response_format, dict): + if response_format.get("type") == "json_object": + kwargs["format"] = "json" + return + if response_format.get("type") == "json_schema": + json_schema = response_format.get("json_schema") + if isinstance(json_schema, dict): + kwargs["format"] = json_schema.get("schema", json_schema) + return + + # Fall back to passing through schema-like payloads for native Ollama support. + kwargs["format"] = response_format + + @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10), @@ -64,14 +93,49 @@ async def _ollama_model_if_cache( system_prompt=None, history_messages=[], enable_cot: bool = False, + image_inputs: list[Any] | None = None, **kwargs, ) -> Union[str, AsyncIterator[str]]: + """Call Ollama chat API with OpenAI-style structured-output compatibility. + + Structured output note: + - This adapter accepts OpenAI-style ``response_format`` and translates it + to Ollama's native ``format`` field. + - ``response_format={"type": "json_object"}`` maps to ``format="json"``. + - Deprecated ``keyword_extraction`` and ``entity_extraction`` booleans are + compatibility shims; when no explicit ``response_format`` is supplied, + they are mapped to ``{"type": "json_object"}``. + """ if enable_cot: logger.debug("enable_cot=True is not supported for ollama and will be ignored.") stream = True if kwargs.get("stream") else False kwargs.pop("max_tokens", None) - # kwargs.pop("response_format", None) # allow json + # Deprecation shims: map legacy boolean flags to response_format only when + # an explicit response_format was not supplied by the caller. + if kwargs.get("response_format") is None: + if kwargs.pop("entity_extraction", False): + warnings.warn( + "_ollama_model_if_cache(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + kwargs["response_format"] = {"type": "json_object"} + elif kwargs.pop("keyword_extraction", False): + warnings.warn( + "_ollama_model_if_cache(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + kwargs["response_format"] = {"type": "json_object"} + else: + # response_format was supplied explicitly; drop legacy flags silently. + kwargs.pop("entity_extraction", None) + kwargs.pop("keyword_extraction", None) + + _normalize_ollama_response_format(kwargs) host = kwargs.pop("host", None) timeout = kwargs.pop("timeout", None) if timeout == 0: @@ -97,7 +161,13 @@ async def _ollama_model_if_cache( if system_prompt: messages.append({"role": "system", "content": system_prompt}) messages.extend(history_messages) - messages.append({"role": "user", "content": prompt}) + user_message: dict[str, Any] = {"role": "user", "content": prompt} + if image_inputs: + from lightrag.llm._vision_utils import normalize_image_inputs + + normalized_images = normalize_image_inputs(image_inputs) + user_message["images"] = [img.base64_str for img in normalized_images] + messages.append(user_message) response = await ollama_client.chat(model=model, messages=messages, **kwargs) if stream: @@ -156,11 +226,15 @@ async def ollama_model_complete( history_messages=[], enable_cot: bool = False, keyword_extraction=False, + entity_extraction=False, **kwargs, ) -> Union[str, AsyncIterator[str]]: - keyword_extraction = kwargs.pop("keyword_extraction", None) + # Forward legacy extraction flags as kwargs so _ollama_model_if_cache can + # emit a single DeprecationWarning with the correct stack frame. if keyword_extraction: - kwargs["format"] = "json" + kwargs.setdefault("keyword_extraction", True) + if entity_extraction: + kwargs.setdefault("entity_extraction", True) model_name = kwargs["hashing_kv"].global_config["llm_model_name"] return await _ollama_model_if_cache( model_name, diff --git a/lightrag/llm/openai.py b/lightrag/llm/openai.py index f02e1eef84..392e512069 100644 --- a/lightrag/llm/openai.py +++ b/lightrag/llm/openai.py @@ -1,6 +1,7 @@ from ..utils import verbose_debug, VERBOSE_DEBUG import os import logging +import warnings from collections.abc import AsyncIterator @@ -28,7 +29,6 @@ logger, ) -from lightrag.types import GPTKeywordExtractionFormat from lightrag.api import __api_version__ import numpy as np @@ -75,6 +75,17 @@ class InvalidResponseError(Exception): pass +def _validate_openai_response_format(response_format: Any | None) -> None: + """Reject typed structured-output helpers; only wire-format dicts are supported.""" + if response_format is None or isinstance(response_format, dict): + return + + raise TypeError( + "openai_complete_if_cache only supports dict response_format payloads; " + "typed/Pydantic response_format values are not supported." + ) + + # Module-level cache for tiktoken encodings _TIKTOKEN_ENCODING_CACHE: dict[str, Any] = {} @@ -196,6 +207,7 @@ def create_openai_async_client( return AsyncOpenAI(**merged_configs) +# TODO LengthFinishReasonError should not persist into LLM cache @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10), @@ -221,6 +233,7 @@ async def openai_complete_if_cache( use_azure: bool = False, azure_deployment: str | None = None, api_version: str | None = None, + image_inputs: list[Any] | None = None, **kwargs: Any, ) -> str: """Complete a prompt using OpenAI's API with caching support and Chain of Thought (COT) integration. @@ -229,6 +242,22 @@ async def openai_complete_if_cache( Chain of Thought capabilities. The reasoning content is seamlessly integrated into the response using ... tags. + Structured output design note: + - This adapter supports dict-based OpenAI response_format payloads, + including ``{"type": "json_object"}`` and dict-form ``json_schema``. + - Typed/Pydantic ``response_format`` helpers are rejected explicitly. + - Structured responses are returned as raw text from ``message.content`` + and are not locally schema-validated here. + - ``keyword_extraction`` is deprecated; prefer + ``response_format={"type": "json_object"}`` instead. + + Note on truncated structured output: when the OpenAI SDK raises + `LengthFinishReasonError`, callers may still receive partial raw JSON from + `completion.choices[0].message.content`. That payload should be treated as + best-effort recovery only. If the JSON was truncated or repaired after + truncation, it is safer not to persist it into the LLM cache because later + runs with a higher token budget could otherwise keep reusing incomplete data. + Note on `reasoning_content`: This feature relies on a Deepseek Style `reasoning_content` in the API response, which may be provided by OpenAI-compatible endpoints that support Chain of Thought. @@ -254,8 +283,10 @@ async def openai_complete_if_cache( token_tracker: Optional token usage tracker for monitoring API usage. stream: Whether to stream the response. Default is False. timeout: Request timeout in seconds. Default is None. - keyword_extraction: Whether to enable keyword extraction mode. When True, triggers - special response formatting for keyword extraction. Default is False. + keyword_extraction: Deprecated compatibility shim. When True and no + explicit ``response_format`` is supplied, it is mapped to + ``{"type": "json_object"}``. Prefer passing ``response_format`` + directly. Default is False. use_azure: Whether to use Azure OpenAI service instead of standard OpenAI. When True, creates an AsyncAzureOpenAI client. Default is False. azure_deployment: Azure OpenAI deployment name. Only used when use_azure=True. @@ -265,6 +296,10 @@ async def openai_complete_if_cache( environment variable. **kwargs: Additional keyword arguments to pass to the OpenAI API. Special kwargs: + - response_format: Structured output control forwarded to the OpenAI + chat completions API. This adapter accepts dict payloads such + as ``{"type": "json_object"}`` and dict-form ``json_schema``, + but rejects typed/Pydantic response_format values. - openai_client_configs: Dict of configuration options for the AsyncOpenAI client. These will be passed to the client constructor but will be overridden by explicit parameters (api_key, base_url). Supports proxy configuration, @@ -293,9 +328,29 @@ async def openai_complete_if_cache( # Extract client configuration options client_configs = kwargs.pop("openai_client_configs", {}) - # Handle keyword extraction mode - if keyword_extraction: - kwargs["response_format"] = GPTKeywordExtractionFormat + # Deprecation shims: map legacy boolean flags to response_format only when + # an explicit response_format was not supplied by the caller. Prefer passing + # response_format directly. + entity_extraction = kwargs.pop("entity_extraction", False) + if entity_extraction and kwargs.get("response_format") is None: + warnings.warn( + "openai_complete_if_cache(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + kwargs["response_format"] = {"type": "json_object"} + if keyword_extraction and kwargs.get("response_format") is None: + warnings.warn( + "openai_complete_if_cache(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + kwargs["response_format"] = {"type": "json_object"} + _validate_openai_response_format(kwargs.get("response_format")) + if kwargs.get("response_format") is not None: + enable_cot = False # Create the OpenAI client (supports both OpenAI and Azure) openai_async_client = create_openai_async_client( @@ -313,7 +368,23 @@ async def openai_complete_if_cache( if system_prompt: messages.append({"role": "system", "content": system_prompt}) messages.extend(history_messages) - messages.append({"role": "user", "content": prompt}) + if image_inputs: + from lightrag.llm._vision_utils import normalize_image_inputs + + normalized_images = normalize_image_inputs(image_inputs) + user_content: list[dict[str, Any]] = [{"type": "text", "text": prompt}] + for img in normalized_images: + user_content.append( + { + "type": "image_url", + "image_url": { + "url": f"data:{img.mime_type};base64,{img.base64_str}" + }, + } + ) + messages.append({"role": "user", "content": user_content}) + else: + messages.append({"role": "user", "content": prompt}) logger.debug("===== Entering func of LLM =====") logger.debug(f"Model: {model} Base URL: {base_url}") @@ -337,31 +408,34 @@ async def openai_complete_if_cache( api_model = azure_deployment if use_azure and azure_deployment else model try: - # Don't use async with context manager, use client directly - if "response_format" in kwargs: - # beta.chat.completions.parse() provides structured output and is inherently - # non-streaming; passing `stream=True` would raise a TypeError at runtime. - # Strip `stream` from kwargs before forwarding to avoid this error when - # OpenAI-compatible providers (e.g. DeepSeek) set stream in their kwargs. - parse_kwargs = {k: v for k, v in kwargs.items() if k != "stream"} - response = await openai_async_client.chat.completions.parse( - model=api_model, messages=messages, **parse_kwargs - ) - else: - response = await openai_async_client.chat.completions.create( - model=api_model, messages=messages, **kwargs - ) + # Single dispatch: create() covers the dict-based response_format + # payloads used by this project. Typed/Pydantic helpers are rejected + # above. Length-truncation is detected via finish_reason below and the + # raw content is returned unchanged so upstream tolerant JSON parsing + # can still salvage it. + response = await openai_async_client.chat.completions.create( + model=api_model, messages=messages, **kwargs + ) except APITimeoutError as e: logger.error(f"OpenAI API Timeout Error: {e}") - await openai_async_client.close() # Ensure client is closed + try: + await openai_async_client.close() + except Exception as close_error: + logger.warning(f"Failed to close OpenAI client: {close_error}") raise except APIConnectionError as e: logger.error(f"OpenAI API Connection Error: {e}") - await openai_async_client.close() # Ensure client is closed + try: + await openai_async_client.close() + except Exception as close_error: + logger.warning(f"Failed to close OpenAI client: {close_error}") raise except RateLimitError as e: logger.error(f"OpenAI API Rate Limit Error: {e}") - await openai_async_client.close() # Ensure client is closed + try: + await openai_async_client.close() + except Exception as close_error: + logger.warning(f"Failed to close OpenAI client: {close_error}") raise except Exception as e: body = getattr(e, "body", None) @@ -378,7 +452,10 @@ async def openai_complete_if_cache( logger.error( f"OpenAI API Call Failed,\nModel: {model},\nParams: {kwargs}, Got: {e}{extra}" ) - await openai_async_client.close() # Ensure client is closed + try: + await openai_async_client.close() + except Exception as close_error: + logger.warning(f"Failed to close OpenAI client: {close_error}") raise if hasattr(response, "__aiter__"): @@ -513,7 +590,12 @@ async def inner(): f"Failed to close stream response: {close_error}" ) # Ensure client is closed in case of exception - await openai_async_client.close() + try: + await openai_async_client.close() + except Exception as client_close_error: + logger.warning( + f"Failed to close OpenAI client after stream error: {client_close_error}" + ) raise finally: # Final safety check for unclosed COT tags @@ -566,7 +648,10 @@ async def inner(): or not hasattr(response.choices[0], "message") ): logger.error("Invalid response from OpenAI API") - await openai_async_client.close() # Ensure client is closed + try: + await openai_async_client.close() + except Exception as close_error: + logger.warning(f"Failed to close OpenAI client: {close_error}") raise InvalidResponseError("Invalid response from OpenAI API") message = response.choices[0].message @@ -619,7 +704,10 @@ async def inner(): # Validate final content if not final_content or final_content.strip() == "": logger.error("Received empty content from OpenAI API") - await openai_async_client.close() # Ensure client is closed + try: + await openai_async_client.close() + except Exception as close_error: + logger.warning(f"Failed to close OpenAI client: {close_error}") raise InvalidResponseError("Received empty content from OpenAI API") # Apply Unicode decoding to final content if needed @@ -642,7 +730,12 @@ async def inner(): return final_content finally: # Ensure client is closed in all cases for non-streaming responses - await openai_async_client.close() + try: + await openai_async_client.close() + except Exception as close_error: + logger.warning( + f"Failed to close OpenAI client in non-streaming finally block: {close_error}" + ) async def openai_complete( @@ -650,10 +743,13 @@ async def openai_complete( system_prompt=None, history_messages=None, keyword_extraction=False, + entity_extraction=False, **kwargs, ) -> Union[str, AsyncIterator[str]]: if history_messages is None: history_messages = [] + # Pop entity_extraction from kwargs if also passed there (avoid duplication) + entity_extraction = kwargs.pop("entity_extraction", entity_extraction) model_name = kwargs["hashing_kv"].global_config["llm_model_name"] return await openai_complete_if_cache( model_name, @@ -661,6 +757,7 @@ async def openai_complete( system_prompt=system_prompt, history_messages=history_messages, keyword_extraction=keyword_extraction, + entity_extraction=entity_extraction, **kwargs, ) @@ -671,10 +768,12 @@ async def gpt_4o_complete( history_messages=None, enable_cot: bool = False, keyword_extraction=False, + entity_extraction=False, **kwargs, ) -> str: if history_messages is None: history_messages = [] + entity_extraction = kwargs.pop("entity_extraction", entity_extraction) return await openai_complete_if_cache( "gpt-4o", prompt, @@ -682,6 +781,7 @@ async def gpt_4o_complete( history_messages=history_messages, enable_cot=enable_cot, keyword_extraction=keyword_extraction, + entity_extraction=entity_extraction, **kwargs, ) @@ -692,10 +792,12 @@ async def gpt_4o_mini_complete( history_messages=None, enable_cot: bool = False, keyword_extraction=False, + entity_extraction=False, **kwargs, ) -> str: if history_messages is None: history_messages = [] + entity_extraction = kwargs.pop("entity_extraction", entity_extraction) return await openai_complete_if_cache( "gpt-4o-mini", prompt, @@ -703,6 +805,7 @@ async def gpt_4o_mini_complete( history_messages=history_messages, enable_cot=enable_cot, keyword_extraction=keyword_extraction, + entity_extraction=entity_extraction, **kwargs, ) @@ -713,10 +816,12 @@ async def nvidia_openai_complete( history_messages=None, enable_cot: bool = False, keyword_extraction=False, + entity_extraction=False, **kwargs, ) -> str: if history_messages is None: history_messages = [] + entity_extraction = kwargs.pop("entity_extraction", entity_extraction) result = await openai_complete_if_cache( "nvidia/llama-3.1-nemotron-70b-instruct", # context length 128k prompt, @@ -724,6 +829,7 @@ async def nvidia_openai_complete( history_messages=history_messages, enable_cot=enable_cot, keyword_extraction=keyword_extraction, + entity_extraction=entity_extraction, base_url="https://integrate.api.nvidia.com/v1", **kwargs, ) @@ -958,6 +1064,7 @@ async def azure_openai_complete( system_prompt=None, history_messages=None, keyword_extraction=False, + entity_extraction=False, **kwargs, ) -> str: """Azure OpenAI complete wrapper function. @@ -966,12 +1073,14 @@ async def azure_openai_complete( """ if history_messages is None: history_messages = [] + entity_extraction = kwargs.pop("entity_extraction", entity_extraction) result = await azure_openai_complete_if_cache( os.getenv("LLM_MODEL", "gpt-4o-mini"), prompt, system_prompt=system_prompt, history_messages=history_messages, keyword_extraction=keyword_extraction, + entity_extraction=entity_extraction, **kwargs, ) return result diff --git a/lightrag/llm/zhipu.py b/lightrag/llm/zhipu.py index e9d8f9973a..ab68e7d95e 100644 --- a/lightrag/llm/zhipu.py +++ b/lightrag/llm/zhipu.py @@ -1,6 +1,5 @@ import sys -import re -import json +import warnings from ..utils import verbose_debug if sys.version_info < (3, 9): @@ -30,8 +29,6 @@ logger, ) -from lightrag.types import GPTKeywordExtractionFormat - import numpy as np from typing import Union, List, Optional, Dict @@ -63,6 +60,11 @@ async def zhipu_complete_if_cache( - `enable_cot`: LightRAG-only formatting switch. When True and the API returns `reasoning_content`, it is preserved in the final string as `...`. + - `response_format`: forwarded as Zhipu's OpenAI-compatible structured + output parameter when supplied by callers. + - Deprecated `keyword_extraction` and `entity_extraction` booleans are + compatibility shims; when no explicit `response_format` is supplied, + they are mapped to `{"type": "json_object"}`. """ # dynamically load ZhipuAI try: @@ -93,9 +95,42 @@ async def zhipu_complete_if_cache( logger.debug(f"Query: {prompt}") verbose_debug(f"System prompt: {system_prompt}") + # Deprecation shims: map legacy extraction booleans to response_format only + # when an explicit response_format was not supplied by the caller. The + # legacy path also forces enable_cot=False so reasoning_content cannot + # corrupt the JSON payload expected by callers relying on it. + keyword_extraction = kwargs.pop("keyword_extraction", False) + entity_extraction = kwargs.pop("entity_extraction", False) + if kwargs.get("response_format") is None: + if entity_extraction: + warnings.warn( + "zhipu_complete_if_cache(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + kwargs["response_format"] = {"type": "json_object"} + enable_cot = False + elif keyword_extraction: + warnings.warn( + "zhipu_complete_if_cache(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + kwargs["response_format"] = {"type": "json_object"} + enable_cot = False + + # Structured output and COT are mutually exclusive here because + # reasoning_content would corrupt the JSON payload expected by callers. + if kwargs.get("response_format") is not None: + enable_cot = False + # Remove unsupported kwargs kwargs = { - k: v for k, v in kwargs.items() if k not in ["hashing_kv", "keyword_extraction"] + k: v + for k, v in kwargs.items() + if k not in ["hashing_kv", "keyword_extraction", "entity_extraction"] } # `thinking` is an official Zhipu request field. Example: # {"type": "enabled"} enables reasoning output on supported models. @@ -122,83 +157,54 @@ async def zhipu_complete( system_prompt=None, history_messages=[], keyword_extraction=False, + entity_extraction=False, enable_cot: bool = False, **kwargs, ): - # Pop keyword_extraction from kwargs to avoid passing it to zhipu_complete_if_cache + """Zhipu completion wrapper with LightRAG structured-output shims. + + Structured output note: + - This adapter accepts OpenAI-style ``response_format`` and forwards it to + Zhipu's compatible chat-completions API. + - Deprecated ``keyword_extraction`` and ``entity_extraction`` booleans are + compatibility shims; when no explicit ``response_format`` is supplied, + they are mapped to ``{"type": "json_object"}``. + """ + # Pop legacy extraction flags from kwargs to avoid passing them downstream. keyword_extraction = kwargs.pop("keyword_extraction", keyword_extraction) - - if keyword_extraction: - # Add a system prompt to guide the model to return JSON format - extraction_prompt = """You are a helpful assistant that extracts keywords from text. - Please analyze the content and extract two types of keywords: - 1. High-level keywords: Important concepts and main themes - 2. Low-level keywords: Specific details and supporting elements - - Return your response in this exact JSON format: - { - "high_level_keywords": ["keyword1", "keyword2"], - "low_level_keywords": ["keyword1", "keyword2", "keyword3"] - } - - Only return the JSON, no other text.""" - - # Combine with existing system prompt if any - if system_prompt: - system_prompt = f"{system_prompt}\n\n{extraction_prompt}" - else: - system_prompt = extraction_prompt - - try: - response = await zhipu_complete_if_cache( - prompt=prompt, - system_prompt=system_prompt, - history_messages=history_messages, - enable_cot=enable_cot, - **kwargs, + entity_extraction = kwargs.pop("entity_extraction", entity_extraction) + + # Deprecation shims: map legacy boolean flags to response_format only when + # an explicit response_format was not supplied by the caller. The legacy + # path also forces enable_cot=False so that reasoning_content cannot + # corrupt the JSON payload expected by callers that were relying on it. + if kwargs.get("response_format") is None: + if entity_extraction: + warnings.warn( + "zhipu_complete(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, ) - - # Try to parse as JSON - try: - data = json.loads(response) - return GPTKeywordExtractionFormat( - high_level_keywords=data.get("high_level_keywords", []), - low_level_keywords=data.get("low_level_keywords", []), - ) - except json.JSONDecodeError: - # If direct JSON parsing fails, try to extract JSON from text - match = re.search(r"\{[\s\S]*\}", response) - if match: - try: - data = json.loads(match.group()) - return GPTKeywordExtractionFormat( - high_level_keywords=data.get("high_level_keywords", []), - low_level_keywords=data.get("low_level_keywords", []), - ) - except json.JSONDecodeError: - pass - - # If all parsing fails, log warning and return empty format - logger.warning( - f"Failed to parse keyword extraction response: {response}" - ) - return GPTKeywordExtractionFormat( - high_level_keywords=[], low_level_keywords=[] - ) - except Exception as e: - logger.error(f"Error during keyword extraction: {str(e)}") - return GPTKeywordExtractionFormat( - high_level_keywords=[], low_level_keywords=[] + kwargs["response_format"] = {"type": "json_object"} + enable_cot = False + elif keyword_extraction: + warnings.warn( + "zhipu_complete(keyword_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, ) - else: - # For non-keyword-extraction, just return the raw response string - return await zhipu_complete_if_cache( - prompt=prompt, - system_prompt=system_prompt, - history_messages=history_messages, - enable_cot=enable_cot, - **kwargs, - ) + kwargs["response_format"] = {"type": "json_object"} + enable_cot = False + + return await zhipu_complete_if_cache( + prompt=prompt, + system_prompt=system_prompt, + history_messages=history_messages, + enable_cot=enable_cot, + **kwargs, + ) @wrap_embedding_func_with_attrs( diff --git a/lightrag/llm_roles.py b/lightrag/llm_roles.py new file mode 100644 index 0000000000..879da16e6c --- /dev/null +++ b/lightrag/llm_roles.py @@ -0,0 +1,572 @@ +"""LLM role registry, configuration types, and runtime mixin. + +LightRAG can route different stages of work (entity extraction, keyword +extraction, query, vlm) to distinct LLM bindings. This module owns the +static role registry (:data:`ROLES`), the per-role configuration +(:class:`RoleLLMConfig`), and the :class:`_RoleLLMMixin` that drives the +runtime: builder registration, wrapper rebuilding, hot config updates, +queue cleanup, and queue-status reporting. +""" + +from __future__ import annotations + +import asyncio +import inspect +from copy import deepcopy +from dataclasses import dataclass, field +from functools import partial +from typing import Any, Callable, Mapping + +from lightrag.utils import ( + get_env_value, + logger, + priority_limit_async_func_call, +) + + +def _optional_env_int(env_key: str) -> int | None: + return get_env_value(env_key, None, int, special_none=True) + + +@dataclass(frozen=True) +class RoleSpec: + """Static descriptor for a known LLM role. + + Adding a new role anywhere in LightRAG is a single-line edit: append a + ``RoleSpec`` to :data:`ROLES`. Every other component (env var loop in + ``api/config.py``, queue observability, role config update flow) iterates + this registry rather than hard-coding role names. + """ + + name: str + """Canonical lowercase role key (used in ``role_llm_configs`` dict and CLI/log output).""" + + env_prefix: str + """Uppercase prefix used by the API env-var layer, e.g. ``"EXTRACT"`` for + ``EXTRACT_LLM_BINDING`` / ``EXTRACT_MAX_ASYNC_LLM`` / ``EXTRACT_LLM_TIMEOUT``.""" + + queue_name: str + """Display name passed to ``priority_limit_async_func_call`` for log lines.""" + + +ROLES: tuple[RoleSpec, ...] = ( + RoleSpec("extract", "EXTRACT", "extract LLM func"), + RoleSpec("keyword", "KEYWORD", "keyword LLM func"), + RoleSpec("query", "QUERY", "query LLM func"), + RoleSpec("vlm", "VLM", "vlm LLM func"), +) +ROLE_NAMES: frozenset[str] = frozenset(spec.name for spec in ROLES) +ROLES_BY_NAME: dict[str, RoleSpec] = {spec.name: spec for spec in ROLES} + + +@dataclass +class RoleLLMConfig: + """Per-role LLM override accepted at :class:`LightRAG` init time. + + Any field left as ``None`` falls back to the corresponding base LLM + setting (``llm_model_func`` / ``llm_model_kwargs`` / ``llm_model_max_async`` + / ``default_llm_timeout``). When ``max_async`` is None at init and the + user did not pass a ``role_llm_configs`` entry for the role, the value is + additionally seeded from ``{ROLE_PREFIX}_MAX_ASYNC_LLM``. ``metadata`` seeds + runtime observability and role-builder context. + """ + + func: Callable[..., object] | None = None + kwargs: dict[str, Any] | None = None + max_async: int | None = None + timeout: int | None = None + metadata: dict[str, Any] | None = None + + +@dataclass +class _RoleLLMState: + """Runtime state for one role. Internal — not part of the public API.""" + + raw_func: Callable[..., object] + kwargs: dict[str, Any] | None + max_async: int | None + timeout: int | None + metadata: dict[str, Any] = field(default_factory=dict) + wrapped: Callable[..., object] | None = None + + +class _RoleLLMMixin: + """Mixin that owns the role LLM runtime on :class:`LightRAG`. + + Mixed into LightRAG only. Relies on attributes that the main class + initializes in ``__post_init__`` (``_role_llm_states``, ``_role_llm_builders``, + ``llm_model_func``, ``llm_model_kwargs``, ``llm_model_max_async``, + ``default_llm_timeout``, ``embedding_func``, ``rerank_model_func``). + """ + + _SECRET_MARKERS = ( + "api_key", + "api-key", + "apikey", + "access_key", + "access-key", + "secret", + "token", + "credential", + "password", + "passphrase", + "pwd", + "auth", + "session", + ) + + @staticmethod + def _normalize_llm_role(role: str) -> str: + normalized = role.strip().lower() + if normalized not in ROLE_NAMES: + raise ValueError(f"Invalid LLM role: {role}") + return normalized + + def register_role_llm_builder( + self, + builder: Callable[ + [str, dict[str, Any]], tuple[Callable[..., object], dict[str, Any] | None] + ], + ) -> None: + """Register a runtime builder used by update_llm_role_config for binding/model updates.""" + self._llm_role_builder = builder + + def set_role_llm_metadata(self, role: str, **metadata: Any) -> None: + """Store role metadata used when rebuilding a role-specific LLM function.""" + role = self._normalize_llm_role(role) + state = self._role_llm_states[role] + for key, value in metadata.items(): + if value is None: + continue + state.metadata[key] = value + + @property + def role_llm_funcs(self) -> Mapping[str, Callable[..., object]]: + """Read-only mapping of role name → wrapped (queue-managed) LLM func.""" + return { + name: state.wrapped + for name, state in self._role_llm_states.items() + if state.wrapped is not None + } + + @property + def role_llm_kwargs(self) -> Mapping[str, dict[str, Any] | None]: + """Read-only mapping of role name → effective LLM kwargs (None means inherit base).""" + return {name: state.kwargs for name, state in self._role_llm_states.items()} + + def _get_effective_role_llm_kwargs(self, role: str) -> dict[str, Any]: + state = self._role_llm_states[self._normalize_llm_role(role)] + if state.kwargs is not None: + return state.kwargs + if state.metadata.get("is_cross_provider"): + return {} + return self.llm_model_kwargs + + def _get_effective_role_llm_timeout(self, role: str) -> int: + state = self._role_llm_states[self._normalize_llm_role(role)] + return state.timeout if state.timeout is not None else self.default_llm_timeout + + def _get_effective_role_llm_max_async(self, role: str) -> int: + state = self._role_llm_states[self._normalize_llm_role(role)] + return ( + state.max_async if state.max_async is not None else self.llm_model_max_async + ) + + def _wrap_llm_role_func( + self, + role_name: str, + raw_func: Callable[..., object], + max_async: int, + timeout: int, + model_kwargs: dict[str, Any], + ) -> Callable[..., object]: + spec = ROLES_BY_NAME[role_name] + return priority_limit_async_func_call( + max_async, + llm_timeout=timeout, + queue_name=spec.queue_name, + )( + partial( + raw_func, + hashing_kv=self.llm_response_cache, + **model_kwargs, + ) + ) + + def _rebuild_role_llm_funcs(self) -> None: + """Wrap each role's raw_func with its own priority queue. + + Base ``llm_model_func`` is intentionally NOT wrapped — concurrency + for the base function is enforced at the role layer (every code path + that calls an LLM goes through a role wrapper). + """ + for spec in ROLES: + self._rebuild_single_role_llm_func(spec.name) + + def _rebuild_single_role_llm_func(self, role: str) -> None: + role = self._normalize_llm_role(role) + state = self._role_llm_states[role] + state.wrapped = self._wrap_llm_role_func( + role, + state.raw_func, + self._get_effective_role_llm_max_async(role), + self._get_effective_role_llm_timeout(role), + self._get_effective_role_llm_kwargs(role), + ) + + async def _shutdown_llm_wrapper(self, wrapped_func: Callable[..., object]) -> None: + shutdown = getattr(wrapped_func, "shutdown", None) + if callable(shutdown): + await shutdown(graceful=True) + + def _schedule_retired_llm_queue_cleanup( + self, wrapped_func: Callable[..., object] | None + ) -> None: + if wrapped_func is None or not callable( + getattr(wrapped_func, "shutdown", None) + ): + return + + try: + loop = asyncio.get_running_loop() + except RuntimeError: + # The retired wrapper's queue and worker tasks are tied to the + # event loop that first used them. Spinning up a fresh loop via + # asyncio.run would either hang on queue.join() or touch + # primitives bound to a closed loop. Skip cleanup with a warning + # — call aupdate_llm_role_config() from an async context for + # deterministic shutdown. + logger.warning( + "update_llm_role_config: skipping retired LLM queue cleanup " + "because no event loop is running; call aupdate_llm_role_config() " + "from an async context for deterministic shutdown" + ) + return + + task = loop.create_task(self._shutdown_llm_wrapper(wrapped_func)) + self._retired_llm_queue_cleanup_tasks.add(task) + task.add_done_callback(self._finalize_retired_llm_queue_cleanup) + + def _finalize_retired_llm_queue_cleanup(self, task: asyncio.Task) -> None: + self._retired_llm_queue_cleanup_tasks.discard(task) + try: + task.result() + except asyncio.CancelledError: + pass + except Exception as e: + logger.warning(f"Retired LLM queue cleanup failed: {e}") + + async def wait_for_retired_llm_queues(self) -> None: + """Wait until all retired role LLM queues have drained and shut down. + + Cleanup failures are logged by ``_finalize_retired_llm_queue_cleanup`` + and intentionally swallowed here so callers can rely on this method + always returning once every retired wrapper has finished. + """ + while self._retired_llm_queue_cleanup_tasks: + tasks = list(self._retired_llm_queue_cleanup_tasks) + await asyncio.gather(*tasks, return_exceptions=True) + + def _apply_llm_role_config_update( + self, + role: str, + *, + model_func: Callable[..., object] | None = None, + model_kwargs: dict[str, Any] | None = None, + max_async: int | None = None, + timeout: int | None = None, + binding: str | None = None, + model: str | None = None, + host: str | None = None, + api_key: str | None = None, + provider_options: dict[str, Any] | None = None, + ) -> Callable[..., object] | None: + role = self._normalize_llm_role(role) + state = self._role_llm_states[role] + old_wrapped = state.wrapped + + snapshot = _RoleLLMState( + raw_func=state.raw_func, + kwargs=deepcopy(state.kwargs), + max_async=state.max_async, + timeout=state.timeout, + metadata=deepcopy(state.metadata), + wrapped=state.wrapped, + ) + + try: + if model_func is not None and not callable(model_func): + raise TypeError("model_func must be callable") + + if model_kwargs is not None: + state.kwargs = model_kwargs + if max_async is not None: + state.max_async = max_async + if timeout is not None: + state.timeout = timeout + if model_func is not None: + state.raw_func = model_func + + metadata_updated = any( + value is not None + for value in (binding, model, host, api_key, provider_options) + ) + if binding is not None: + state.metadata["binding"] = binding + if model is not None: + state.metadata["model"] = model + if host is not None: + state.metadata["host"] = host + if api_key is not None: + state.metadata["api_key"] = api_key + if provider_options is not None: + state.metadata["provider_options"] = provider_options + if "base_binding" in state.metadata and "binding" in state.metadata: + state.metadata["is_cross_provider"] = ( + state.metadata["binding"] != state.metadata["base_binding"] + ) + + if metadata_updated: + builder = getattr(self, "_llm_role_builder", None) + if builder is None and model_func is None: + raise ValueError( + "Runtime role builder is not configured; provide model_func or register_role_llm_builder() first" + ) + if builder is not None: + built_func, built_kwargs = builder(role, state.metadata) + state.raw_func = built_func + if model_kwargs is None and built_kwargs is not None: + state.kwargs = built_kwargs + + self._rebuild_single_role_llm_func(role) + except Exception: + state.raw_func = snapshot.raw_func + state.kwargs = snapshot.kwargs + state.max_async = snapshot.max_async + state.timeout = snapshot.timeout + state.metadata = snapshot.metadata + state.wrapped = snapshot.wrapped + raise + + self._log_llm_role_config("updated", role=role) + return old_wrapped + + def update_llm_role_config( + self, + role: str, + *, + model_func: Callable[..., object] | None = None, + model_kwargs: dict[str, Any] | None = None, + max_async: int | None = None, + timeout: int | None = None, + binding: str | None = None, + model: str | None = None, + host: str | None = None, + api_key: str | None = None, + provider_options: dict[str, Any] | None = None, + ) -> None: + """ + Update a role-specific LLM configuration at runtime. + + Supports lightweight updates (kwargs/max_async/timeout/model_func) directly. + For binding/model/host/api_key/provider_options updates, a role builder must + be registered via register_role_llm_builder(). + """ + old_wrapped = self._apply_llm_role_config_update( + role, + model_func=model_func, + model_kwargs=model_kwargs, + max_async=max_async, + timeout=timeout, + binding=binding, + model=model, + host=host, + api_key=api_key, + provider_options=provider_options, + ) + self._schedule_retired_llm_queue_cleanup(old_wrapped) + + async def aupdate_llm_role_config( + self, + role: str, + *, + model_func: Callable[..., object] | None = None, + model_kwargs: dict[str, Any] | None = None, + max_async: int | None = None, + timeout: int | None = None, + binding: str | None = None, + model: str | None = None, + host: str | None = None, + api_key: str | None = None, + provider_options: dict[str, Any] | None = None, + ) -> None: + """Async variant of update_llm_role_config that waits for queue cleanup. + + Blocking behavior: + This coroutine awaits a graceful shutdown of the retired role + wrapper's priority queue. The shutdown blocks on + ``queue.join()`` until every already-queued LLM call has been + executed (workers always call ``task_done()`` in ``finally``, + so in-flight requests are not cut off). + + The wait is bounded by ``max_task_duration`` of the retired + queue, which is computed as ``llm_timeout * 2 + 15`` seconds + (default ``180 * 2 + 15 = 375`` seconds, ~6 min 15 s). When + this bound is reached, the drain times out and the shutdown + falls through to forced cancellation: pending futures are + cancelled, the queue is cleared, workers are stopped. So this + method **never blocks indefinitely**, but with a deep backlog + of slow LLM calls it can take up to that bound to return, and + in-flight calls past the bound will be cancelled. + + If you need a non-blocking switch, use the sync + ``update_llm_role_config()`` (which schedules cleanup as a + background task) and await ``wait_for_retired_llm_queues()`` + separately when you want to confirm the old queue is gone. + """ + old_wrapped = self._apply_llm_role_config_update( + role, + model_func=model_func, + model_kwargs=model_kwargs, + max_async=max_async, + timeout=timeout, + binding=binding, + model=model, + host=host, + api_key=api_key, + provider_options=provider_options, + ) + if old_wrapped is not None: + await self._shutdown_llm_wrapper(old_wrapped) + + @classmethod + def _is_secret_key(cls, key: str) -> bool: + lowered = key.lower() + return any(marker in lowered for marker in cls._SECRET_MARKERS) + + def _scrubbed_llm_metadata(self, metadata: dict[str, Any]) -> dict[str, Any]: + """Return a deep copy of ``metadata`` with auth-bearing fields removed. + + Auth-bearing fields are stripped entirely — not masked — because a + masked ``"***"`` carries no information for an external consumer + (operators already see ``binding`` / ``host`` to confirm a role is + configured). Stripping makes the invariant simple: anything that + appears in this output is safe to log, cache, ship over the wire. + + Components that legitimately need the raw secret (the role builder, + provider clients) read it directly off the private + ``_role_llm_states[role].metadata`` dict. + """ + + def scrub_value(value: Any) -> Any: + if isinstance(value, Mapping): + return { + key: scrub_value(inner_value) + for key, inner_value in value.items() + if not self._is_secret_key(str(key)) + } + if isinstance(value, list): + return [scrub_value(item) for item in value] + if isinstance(value, tuple): + return tuple(scrub_value(item) for item in value) + return deepcopy(value) + + return scrub_value(metadata) + + def get_llm_role_config(self, role: str | None = None) -> dict[str, Any]: + """Return effective role LLM runtime configuration (observability snapshot). + + Each role entry exposes ``binding`` / ``model`` / ``host`` at the top + level for convenience and again inside ``metadata`` as part of the + full runtime snapshot (which may contain extra builder-specific + keys). Auth-bearing fields (``api_key``, ``aws_secret_access_key``, + ``password``, …) are **stripped entirely** from ``metadata`` — this + method is intended for ``/health`` / WebUI / audit output and must + never leak credentials. There is no escape hatch; runtime components + that legitimately need the raw value read it from + ``_role_llm_states[role].metadata`` directly. + """ + + def role_config(role_name: str) -> dict[str, Any]: + state = self._role_llm_states[role_name] + metadata = self._scrubbed_llm_metadata(state.metadata) + return { + "binding": metadata.get("binding"), + "model": metadata.get("model"), + "host": metadata.get("host"), + "is_cross_provider": metadata.get("is_cross_provider", False), + "max_async": self._get_effective_role_llm_max_async(role_name), + "timeout": self._get_effective_role_llm_timeout(role_name), + "has_model_kwargs": state.kwargs is not None, + "metadata": metadata, + } + + if role is not None: + return role_config(self._normalize_llm_role(role)) + + return {spec.name: role_config(spec.name) for spec in ROLES} + + def _log_llm_role_config(self, reason: str, role: str | None = None) -> None: + """Log the sanitized role LLM runtime configuration.""" + if role is None: + configs = self.get_llm_role_config() + role_names = [spec.name for spec in ROLES] + logger.info(f"Role LLM Configuration ({reason}):") + else: + normalized_role = self._normalize_llm_role(role) + configs = {normalized_role: self.get_llm_role_config(normalized_role)} + role_names = [normalized_role] + logger.info(f"Role LLM Configuration ({reason}: {normalized_role}):") + + for role_name in role_names: + cfg = configs[role_name] + logger.info( + " - %s: %s/%s, host=%s, max_async=%s, timeout=%s", + role_name, + cfg["binding"], + cfg["model"], + cfg["host"], + cfg["max_async"], + cfg["timeout"], + ) + + async def _queue_status_for_func( + self, func: Callable[..., object] | None + ) -> dict[str, Any]: + if func is None: + return {"available": False} + get_stats = getattr(func, "get_queue_stats", None) + if not callable(get_stats): + return {"available": False} + stats = get_stats() + if inspect.isawaitable(stats): + stats = await stats + stats["available"] = True + return stats + + async def get_llm_queue_status(self, include_base: bool = True) -> dict[str, Any]: + """Return queue status for each role's wrapped LLM func. + + The base ``llm_model_func`` is no longer queue-wrapped, so it is not + reported here. ``include_base`` is kept for signature compatibility + but has no effect. + """ + del include_base # base is unwrapped — see docstring + + result: dict[str, Any] = {} + for spec in ROLES: + state = self._role_llm_states.get(spec.name) + result[spec.name] = await self._queue_status_for_func( + state.wrapped if state else None + ) + return result + + async def get_embedding_queue_status(self) -> dict[str, Any]: + """Return queue status for the wrapped embedding function.""" + return await self._queue_status_for_func( + self.embedding_func.func if self.embedding_func is not None else None + ) + + async def get_rerank_queue_status(self) -> dict[str, Any]: + """Return queue status for the wrapped rerank function.""" + return await self._queue_status_for_func(self.rerank_model_func) diff --git a/lightrag/multimodal_context.py b/lightrag/multimodal_context.py new file mode 100644 index 0000000000..62a45f0418 --- /dev/null +++ b/lightrag/multimodal_context.py @@ -0,0 +1,1028 @@ +"""Surrounding-context enrichment for native multimodal sidecars. + +See ``docs/NativeMultimodalSurroundingContextPlan-zh.md``. + +For each entry in ``drawings.json`` / ``tables.json`` / ``equations.json``, +this module locates the matching ````, +``…
`` / table ```` or +```` inside the *single* +``blocks.jsonl`` content row referenced by the entry's ``blockid``, then +extracts up to ``max_tokens`` of leading and trailing text from the same +row (without crossing block rows). + +Sidecar entries gain an optional ``surrounding`` field: + + { + "leading": "…", + "trailing": "…" + } + +with both halves capped at ``max_tokens`` tokens (default 2000). +Truncation prefers paragraph / sentence / clause boundaries (using the +recursive separator cascade from ``CHUNK_R_SEPARATORS`` / falling back +to :data:`lightrag.constants.DEFAULT_R_SEPARATORS`); only when a single +closest segment alone exceeds the budget does the splitter fall through +to a character-level binary search. + +Multimodal tags (````, ````, +``…
``) inside the candidate text are treated as atomic so +the splitter cannot cut a tag in half. For ``tables.json`` entries — +where the surrounding should describe text around the target table +without dragging other tables along — every ``…
`` is +removed from the candidate text *before* token counting and +segmentation, so the saved surrounding string and the tokens budgeted +against it stay in sync. For ``drawings.json`` / ``equations.json`` +entries the table tags are preserved when they fit; oversized JSON or +HTML tables are row-trimmed (tail rows for leading, head rows for +trailing) so the surrounding keeps the rows physically closest to the +target. + +Parser-internal identifiers (``id`` / ``path`` / ``src`` / ``refid``) are +stripped from the candidate text via +:func:`lightrag.chunk_schema.strip_internal_multimodal_markup_for_extraction` +**before** atomization and token-budgeted truncation. This mirrors the +treatment given to chunk content prior to entity extraction (see +``lightrag.operate._process_single_content``) and ensures the +multimodal analysis prompt never sees those internal markers. Cleaning +before truncation also guarantees the truncation point can never land +inside an ``id="…"`` attribute and leave a malformed tag the strip +regex would no longer recognize. + +Unlike the entity-extraction call site, the surrounding path invokes +the cleaner with ``keep_cite_tag=True``: parser-internal ``refid`` is +removed but the ```` wrapper is preserved so the +VLM/LLM can still tell a reference label apart from inline prose +(e.g. ``表1`` makes it obvious the visible +text "表1" denotes another table elsewhere in the document, rather +than appearing as an ordinary noun phrase). Note this only affects +``drawings.json`` / ``equations.json`` surroundings — ``tables.json`` +surroundings still drop all cite tags via :func:`remove_table_tags` +because the target-table analysis should not be steered by dangling +references to other tables. +""" + +from __future__ import annotations + +import json +import logging +import os +import re +from html import escape as html_escape +from html import unescape as html_unescape +from pathlib import Path +from lightrag.chunk_schema import strip_internal_multimodal_markup_for_extraction +from lightrag.constants import DEFAULT_R_SEPARATORS +from lightrag.table_markup import ( + TABLE_TAG_RE, + detect_table_format, + parse_table_tag, + serialize_html_rows, + split_html_rows, +) +from lightrag.utils import Tokenizer + +logger = logging.getLogger(__name__) + + +# --------------------------------------------------------------------------- +# Tag scanner — atomises a string into a list of ``(kind, text)`` pieces so +# the recursive splitter can treat ````, ```` +# and ``…
`` as indivisible. +# --------------------------------------------------------------------------- + +_MM_TAG_RE = re.compile( + r"]*/>" + r"|]*>.*?" + r"|]*>.*?
", + re.DOTALL, +) + +_TABLE_CITE_RE = re.compile( + r']*\btype\s*=\s*"table")[^>]*>.*?
', + re.DOTALL, +) + + +def _atomize(text: str) -> list[tuple[str, str]]: + """Split ``text`` into ``(kind, content)`` atoms. + + ``kind`` ∈ ``{"text", "drawing", "equation", "table"}``. + Concatenating all atom contents reproduces ``text`` verbatim. + """ + atoms: list[tuple[str, str]] = [] + pos = 0 + for match in _MM_TAG_RE.finditer(text): + if match.start() > pos: + atoms.append(("text", text[pos : match.start()])) + tag_text = match.group(0) + if tag_text.startswith(" re.Pattern[str]: + esc = re.escape(item_id) + return re.compile( + rf']*?\bid\s*=\s*"{esc}"[^>]*?/>', + re.DOTALL, + ) + + +def _table_pattern(item_id: str) -> re.Pattern[str]: + esc = re.escape(item_id) + return re.compile( + rf']*?\bid\s*=\s*"{esc}"[^>]*?>.*?' + rf'|]*\btype\s*=\s*"table")' + rf'(?=[^>]*\brefid\s*=\s*"{esc}")[^>]*>.*?', + re.DOTALL, + ) + + +def _equation_pattern(item_id: str) -> re.Pattern[str]: + esc = re.escape(item_id) + return re.compile( + rf']*?\bid\s*=\s*"{esc}"[^>]*?>.*?
', + re.DOTALL, + ) + + +def find_target_span( + kind: str, item_id: str, block_content: str +) -> tuple[int, int] | None: + """Locate the target multimodal marker with the given ``id`` inside + ``block_content``. + + Returns ``(start, end)`` byte offsets, or ``None`` if not found. + ``kind`` is the sidecar root key — ``"drawings"`` / ``"tables"`` / + ``"equations"``. + """ + if kind == "drawings": + pattern = _drawing_pattern(item_id) + elif kind == "tables": + pattern = _table_pattern(item_id) + elif kind == "equations": + pattern = _equation_pattern(item_id) + else: + return None + match = pattern.search(block_content) + if not match: + return None + return match.start(), match.end() + + +# --------------------------------------------------------------------------- +# Recursive splitter that respects multimodal tag atoms. +# --------------------------------------------------------------------------- + + +def _split_text_segment(text: str, separators: list[str]) -> tuple[list[str], int]: + """Split ``text`` using the first separator that produces >1 pieces. + + Returns ``(segments, sep_index)`` where ``segments`` reproduces + ``text`` verbatim when concatenated and ``sep_index`` is the index + in ``separators`` of the separator that was used. When no listed + separator yields >1 piece the original string is returned as a + single-element list with ``sep_index = len(separators)`` — the + caller is responsible for any further char-level fallback. + + The separator is kept attached to the preceding segment so the + assembled accumulator preserves whitespace boundaries. + """ + if not text: + return [text], len(separators) + for idx, sep in enumerate(separators): + if not sep: + continue + if sep in text: + parts = text.split(sep) + assembled: list[str] = [] + for j, part in enumerate(parts): + if j < len(parts) - 1: + assembled.append(part + sep) + else: + if part: + assembled.append(part) + if len(assembled) > 1: + return assembled, idx + return [text], len(separators) + + +def _count_tokens(tokenizer: Tokenizer, text: str) -> int: + if not text: + return 0 + return len(tokenizer.encode(text)) + + +def _char_trim_leading(text: str, max_tokens: int, tokenizer: Tokenizer) -> str: + """Drop characters from the head until the token count fits. + + Used as the final char-level fallback for the ``leading`` half — we + want to keep the *tail* of the text (closest to the target). + """ + if _count_tokens(tokenizer, text) <= max_tokens: + return text + lo, hi = 0, len(text) + while lo < hi: + mid = (lo + hi) // 2 + if _count_tokens(tokenizer, text[mid:]) <= max_tokens: + hi = mid + else: + lo = mid + 1 + return text[lo:] + + +def _char_trim_trailing(text: str, max_tokens: int, tokenizer: Tokenizer) -> str: + """Drop characters from the tail until the token count fits. + + Used as the final char-level fallback for the ``trailing`` half — we + keep the *head* (closest to the target). + """ + if _count_tokens(tokenizer, text) <= max_tokens: + return text + lo, hi = 0, len(text) + while lo < hi: + mid = (lo + hi + 1) // 2 + if _count_tokens(tokenizer, text[:mid]) <= max_tokens: + lo = mid + else: + hi = mid - 1 + return text[:lo] + + +# --------------------------------------------------------------------------- +# Row-aware table trimming for drawings / equations surrounding. +# --------------------------------------------------------------------------- + + +def _row_trim_table_leading( + tag_text: str, max_tokens: int, tokenizer: Tokenizer +) -> str | None: + """Return a smaller ``…
`` whose tail rows fit ``max_tokens``. + + For a JSON table, takes the last ``k`` rows (closest to the target) + such that the re-wrapped tag still fits. For an HTML table, takes + the last ``k`` ````s with their wrapper context. Returns + ``None`` when no row-bounded trim fits. + """ + match = TABLE_TAG_RE.match(tag_text.strip()) + if not match: + return None + attrs = match.group("attrs") + body = match.group("body") + fmt = detect_table_format(attrs, body) + if fmt == "json": + parsed = parse_table_tag(tag_text) + if not parsed: + return None + attrs_str, rows = parsed + for k in range(len(rows) - 1, 0, -1): + candidate = ( + f"" + f"{json.dumps(rows[-k:], ensure_ascii=False)}" + f"
" + ) + if _count_tokens(tokenizer, candidate) <= max_tokens: + return candidate + return _char_fallback_json_table( + attrs_str, + json.dumps(rows[-1], ensure_ascii=False) if rows else body, + max_tokens, + tokenizer, + keep_tail=True, + ) + if fmt == "html": + rows = split_html_rows(body) + if not rows: + return None + for k in range(len(rows) - 1, 0, -1): + inner = serialize_html_rows(rows[-k:]) + candidate = f"{inner}
" + if _count_tokens(tokenizer, candidate) <= max_tokens: + return candidate + return _char_fallback_html_table( + attrs, + rows[-1][1] if rows else body, + max_tokens, + tokenizer, + keep_tail=True, + ) + return None + + +def _row_trim_table_trailing( + tag_text: str, max_tokens: int, tokenizer: Tokenizer +) -> str | None: + """Return a smaller ``…
`` whose head rows fit ``max_tokens``.""" + match = TABLE_TAG_RE.match(tag_text.strip()) + if not match: + return None + attrs = match.group("attrs") + body = match.group("body") + fmt = detect_table_format(attrs, body) + if fmt == "json": + parsed = parse_table_tag(tag_text) + if not parsed: + return None + attrs_str, rows = parsed + for k in range(len(rows) - 1, 0, -1): + candidate = ( + f"" + f"{json.dumps(rows[:k], ensure_ascii=False)}" + f"
" + ) + if _count_tokens(tokenizer, candidate) <= max_tokens: + return candidate + return _char_fallback_json_table( + attrs_str, + json.dumps(rows[0], ensure_ascii=False) if rows else body, + max_tokens, + tokenizer, + keep_tail=False, + ) + if fmt == "html": + rows = split_html_rows(body) + if not rows: + return None + for k in range(len(rows) - 1, 0, -1): + inner = serialize_html_rows(rows[:k]) + candidate = f"{inner}
" + if _count_tokens(tokenizer, candidate) <= max_tokens: + return candidate + return _char_fallback_html_table( + attrs, + rows[0][1] if rows else body, + max_tokens, + tokenizer, + keep_tail=False, + ) + return None + + +def _empty_table(attrs: str) -> str: + return f"
" + + +def _char_fallback_json_table( + attrs: str, + source_text: str, + max_tokens: int, + tokenizer: Tokenizer, + *, + keep_tail: bool, +) -> str | None: + """Fit one oversized JSON table row while keeping a valid table tag. + + The fallback stores the truncated serialized row text as a JSON string + inside a one-row table. That preserves JSON validity and keeps the + closest side of the oversized row when no complete row can fit. + """ + empty = _empty_table(attrs) + if _count_tokens(tokenizer, empty) > max_tokens: + return None + + def candidate(chars: int) -> str: + snippet = source_text[-chars:] if keep_tail and chars else source_text[:chars] + if not chars: + return empty + body = json.dumps([[snippet]], ensure_ascii=False) + return f"{body}
" + + if _count_tokens(tokenizer, candidate(len(source_text))) <= max_tokens: + return candidate(len(source_text)) + + lo, hi = 0, len(source_text) + while lo < hi: + mid = (lo + hi + 1) // 2 + if _count_tokens(tokenizer, candidate(mid)) <= max_tokens: + lo = mid + else: + hi = mid - 1 + return candidate(lo) + + +def _char_fallback_html_table( + attrs: str, + row_html: str, + max_tokens: int, + tokenizer: Tokenizer, + *, + keep_tail: bool, +) -> str | None: + """Fit one oversized HTML row without emitting broken table markup.""" + empty = _empty_table(attrs) + if _count_tokens(tokenizer, empty) > max_tokens: + return None + + text = html_unescape(re.sub(r"<[^>]+>", "", row_html or "")) + + def candidate(chars: int) -> str: + snippet = text[-chars:] if keep_tail and chars else text[:chars] + if not chars: + return empty + return f"
{html_escape(snippet)}
" + + if _count_tokens(tokenizer, candidate(len(text))) <= max_tokens: + return candidate(len(text)) + + lo, hi = 0, len(text) + while lo < hi: + mid = (lo + hi + 1) // 2 + if _count_tokens(tokenizer, candidate(mid)) <= max_tokens: + lo = mid + else: + hi = mid - 1 + return candidate(lo) + + +def remove_table_tags(text: str) -> str: + """Strip every table marker from ``text``. + + Used to pre-clean candidate text for ``tables.json`` surroundings: + we never include sibling tables, so they must be dropped *before* + token counting and segmentation so the budget matches the persisted + string exactly. + """ + return _TABLE_CITE_RE.sub("", TABLE_TAG_RE.sub("", text)) + + +# --------------------------------------------------------------------------- +# Core leading / trailing builders. +# --------------------------------------------------------------------------- + + +def _build_leading( + source: str, + *, + kind: str, + tokenizer: Tokenizer, + max_tokens: int, + separators: list[str], +) -> str: + """Build the ``leading`` half: suffix of ``source`` within budget. + + ``source`` is cleaned via + :func:`lightrag.chunk_schema.strip_internal_multimodal_markup_for_extraction` + *before* atomization and token-budgeted accumulation, so parser-internal + identifiers (``id`` / ``path`` / ``src`` / ``refid``) never reach the + accumulated output and the token budget reflects what the LLM actually + sees. Cleaning before truncation also prevents a truncation point from + landing inside an ``id="…"`` attribute and producing a malformed tag + that the strip regex would no longer recognize. + """ + if not source or max_tokens <= 0: + return "" + if kind == "tables": + source = remove_table_tags(source) + if not source: + return "" + source = strip_internal_multimodal_markup_for_extraction(source, keep_cite_tag=True) + if not source: + return "" + accumulated = "" + atoms = _atomize(source) + for atom_idx in range(len(atoms) - 1, -1, -1): + atom_kind, atom_text = atoms[atom_idx] + if not atom_text: + continue + if atom_kind in {"drawing", "equation"}: + candidate = atom_text + accumulated + if _count_tokens(tokenizer, candidate) <= max_tokens: + accumulated = candidate + continue + break + if atom_kind == "table": + # Only reached for drawings/equations surroundings — table + # tags are pre-stripped for the ``tables`` kind above. + candidate = atom_text + accumulated + if _count_tokens(tokenizer, candidate) <= max_tokens: + accumulated = candidate + continue + remaining = max_tokens - _count_tokens(tokenizer, accumulated) + if remaining > 0: + trimmed = _row_trim_table_leading(atom_text, remaining, tokenizer) + if trimmed is not None: + accumulated = trimmed + accumulated + break + # Plain text atom — segment with separator cascade and accumulate + # from the right. + addition = _accumulate_text_leading( + atom_text, + existing=accumulated, + tokenizer=tokenizer, + max_tokens=max_tokens, + separators=separators, + ) + if addition is None: + # Even a partial fit was not possible; we stop here. + break + accumulated = addition + accumulated + if _count_tokens(tokenizer, accumulated) >= max_tokens: + break + return accumulated + + +def _accumulate_text_leading( + text: str, + *, + existing: str, + tokenizer: Tokenizer, + max_tokens: int, + separators: list[str], +) -> str | None: + """Add as much of ``text`` (suffix) as fits into the remaining budget. + + Returns the chunk to prepend to ``existing``, or ``None`` to signal + "stop walking earlier atoms" (i.e. budget exhausted with no useful + addition). + """ + segments, sep_idx = _split_text_segment(text, separators) + if not segments: + return None + # Try to add whole segments from the right. ``buf`` is what we will + # prepend to ``existing``. + buf = "" + for i in range(len(segments) - 1, -1, -1): + candidate = segments[i] + buf + # Total tokens once we prepend ``candidate`` to ``existing``. + if _count_tokens(tokenizer, candidate + existing) <= max_tokens: + buf = candidate + continue + # Cannot fit segment ``i`` whole. Two cases: + if buf: + # We already added at least one segment — stop here without + # char-truncating a more-distant segment. + return buf + # ``buf`` is empty: the closest segment alone overflows. Recurse + # into the next separator level so we try a finer split before + # falling back to characters. + weaker = separators[sep_idx + 1 :] if sep_idx < len(separators) else [] + if weaker: + return _accumulate_text_leading( + segments[i], + existing=existing, + tokenizer=tokenizer, + max_tokens=max_tokens, + separators=weaker, + ) + # Char-level fallback: take the longest suffix of this segment + # that fits the remaining budget. + remaining = max_tokens - _count_tokens(tokenizer, existing) + if remaining <= 0: + return None + trimmed = _char_trim_leading(segments[i], remaining, tokenizer) + return trimmed if trimmed else None + return buf if buf else None + + +def _build_trailing( + source: str, + *, + kind: str, + tokenizer: Tokenizer, + max_tokens: int, + separators: list[str], +) -> str: + """Build the ``trailing`` half: prefix of ``source`` within budget. + + See :func:`_build_leading` for the rationale behind stripping + parser-internal markers *before* atomization and truncation. + """ + if not source or max_tokens <= 0: + return "" + if kind == "tables": + source = remove_table_tags(source) + if not source: + return "" + source = strip_internal_multimodal_markup_for_extraction(source, keep_cite_tag=True) + if not source: + return "" + accumulated = "" + atoms = _atomize(source) + for atom_kind, atom_text in atoms: + if not atom_text: + continue + if atom_kind in {"drawing", "equation"}: + candidate = accumulated + atom_text + if _count_tokens(tokenizer, candidate) <= max_tokens: + accumulated = candidate + continue + break + if atom_kind == "table": + candidate = accumulated + atom_text + if _count_tokens(tokenizer, candidate) <= max_tokens: + accumulated = candidate + continue + remaining = max_tokens - _count_tokens(tokenizer, accumulated) + if remaining > 0: + trimmed = _row_trim_table_trailing(atom_text, remaining, tokenizer) + if trimmed is not None: + accumulated = accumulated + trimmed + break + addition = _accumulate_text_trailing( + atom_text, + existing=accumulated, + tokenizer=tokenizer, + max_tokens=max_tokens, + separators=separators, + ) + if addition is None: + break + accumulated = accumulated + addition + if _count_tokens(tokenizer, accumulated) >= max_tokens: + break + return accumulated + + +def _accumulate_text_trailing( + text: str, + *, + existing: str, + tokenizer: Tokenizer, + max_tokens: int, + separators: list[str], +) -> str | None: + segments, sep_idx = _split_text_segment(text, separators) + if not segments: + return None + buf = "" + for i, seg in enumerate(segments): + candidate = buf + seg + if _count_tokens(tokenizer, existing + candidate) <= max_tokens: + buf = candidate + continue + if buf: + return buf + weaker = separators[sep_idx + 1 :] if sep_idx < len(separators) else [] + if weaker: + return _accumulate_text_trailing( + seg, + existing=existing, + tokenizer=tokenizer, + max_tokens=max_tokens, + separators=weaker, + ) + remaining = max_tokens - _count_tokens(tokenizer, existing) + if remaining <= 0: + return None + trimmed = _char_trim_trailing(seg, remaining, tokenizer) + return trimmed if trimmed else None + return buf if buf else None + + +# --------------------------------------------------------------------------- +# Public entrypoints. +# --------------------------------------------------------------------------- + + +def load_chunk_separators() -> list[str]: + """Resolve the recursive-character separator cascade. + + Reads ``CHUNK_R_SEPARATORS`` and falls back to + :data:`lightrag.constants.DEFAULT_R_SEPARATORS` on missing / invalid + JSON. The returned list always has the empty-string sentinel + dropped — char fallback is signalled separately by the caller. + """ + raw = os.getenv("CHUNK_R_SEPARATORS") + separators: list[str] + if raw: + try: + parsed = json.loads(raw) + if isinstance(parsed, list) and all(isinstance(s, str) for s in parsed): + separators = parsed + else: + separators = list(DEFAULT_R_SEPARATORS) + except json.JSONDecodeError: + separators = list(DEFAULT_R_SEPARATORS) + else: + separators = list(DEFAULT_R_SEPARATORS) + return [s for s in separators if s] + + +def load_content_rows_by_blockid(blocks_path: str) -> dict[str, str]: + """Read ``blocks.jsonl`` and return ``{blockid: content_str}``. + + Only ``type == "content"`` rows are kept. When the same blockid + appears multiple times, the first occurrence wins. + """ + rows: dict[str, str] = {} + path = Path(blocks_path) + if not path.exists(): + return rows + with path.open("r", encoding="utf-8") as fh: + for line in fh: + line = line.strip() + if not line: + continue + try: + obj = json.loads(line) + except json.JSONDecodeError: + continue + if not isinstance(obj, dict): + continue + if obj.get("type") != "content": + continue + blockid = obj.get("blockid") + if not isinstance(blockid, str) or not blockid: + continue + if blockid in rows: + continue + content = obj.get("content") + if isinstance(content, str): + rows[blockid] = content + return rows + + +DEFAULT_SURROUNDING_MAX_TOKENS = 2000 + + +def _resolve_surrounding_budget( + leading_max_tokens: int | None, + trailing_max_tokens: int | None, +) -> tuple[int, int]: + """Resolve per-half token budgets, defaulting to env vars then 2000. + + Reads ``SURROUNDING_LEADING_MAX_TOKENS`` / ``SURROUNDING_TRAILING_MAX_TOKENS`` + when the caller passes ``None``. Invalid env values fall back to + :data:`DEFAULT_SURROUNDING_MAX_TOKENS`. + """ + + def _from_env(env_var: str) -> int: + raw = os.getenv(env_var) + if raw is None or not raw.strip(): + return DEFAULT_SURROUNDING_MAX_TOKENS + try: + value = int(raw) + except ValueError: + logger.warning( + "[multimodal_context] invalid %s=%r; falling back to %d", + env_var, + raw, + DEFAULT_SURROUNDING_MAX_TOKENS, + ) + return DEFAULT_SURROUNDING_MAX_TOKENS + return max(0, value) + + leading = ( + leading_max_tokens + if leading_max_tokens is not None + else _from_env("SURROUNDING_LEADING_MAX_TOKENS") + ) + trailing = ( + trailing_max_tokens + if trailing_max_tokens is not None + else _from_env("SURROUNDING_TRAILING_MAX_TOKENS") + ) + return leading, trailing + + +_CONTENT_TRUNCATION_MARKER = ( + "\n" +) + + +def trim_content_to_budget( + content: str, + *, + kind: str, + max_tokens: int, + tokenizer: Tokenizer | None, +) -> tuple[str, bool]: + """Trim sidecar ``content`` to fit within ``max_tokens``, preserving the head. + + Used by ``analyze_multimodal`` to keep the EXTRACT-role prompt within + :data:`lightrag.constants.DEFAULT_MAX_EXTRACT_INPUT_TOKENS`. Only ``content`` + is compressed — surrounding/captions/footnotes already have their own caps + and the prompt template is fixed. + + Strategy: + - ``tables`` (``…
`` wrapped): row-aware trim via + :func:`_row_trim_table_trailing` (keep head rows / first k ); + falls back to ``_char_fallback_*`` (still ````-wrapped) when + no single row fits. Non-``
`` content falls through to char + trim from the tail. + - ``equations`` / other: :func:`_char_trim_trailing` (keep head chars). + + A trailing HTML-comment marker is appended *outside* the ``
`` + wrapper (when trimmed) so the LLM knows the body is incomplete. The + marker is included in the token budget. + + Returns ``(possibly_trimmed_content, was_trimmed)``. When + ``max_tokens <= 0`` or ``tokenizer is None`` the input is returned + unchanged with ``was_trimmed=False``. + """ + if not content or tokenizer is None or max_tokens <= 0: + return content, False + original_tokens = _count_tokens(tokenizer, content) + if original_tokens <= max_tokens: + return content, False + + # Reserve token room for the truncation marker before trimming. + marker_probe = _CONTENT_TRUNCATION_MARKER.format( + original=original_tokens, final=max_tokens + ) + marker_tokens = _count_tokens(tokenizer, marker_probe) + inner_budget = max(0, max_tokens - marker_tokens) + + trimmed_inner: str | None = None + if kind == "tables" and TABLE_TAG_RE.match(content.strip()): + # _row_trim_table_trailing keeps head rows and internally falls back + # to char-level fits while preserving the
wrapper. Only + # malformed / unrecognized-format markup returns None. + trimmed_inner = _row_trim_table_trailing(content, inner_budget, tokenizer) + if trimmed_inner is None: + trimmed_inner = _char_trim_trailing(content, inner_budget, tokenizer) + + final_tokens = _count_tokens(tokenizer, trimmed_inner) + marker = _CONTENT_TRUNCATION_MARKER.format( + original=original_tokens, final=final_tokens + ) + return trimmed_inner + marker, True + + +def build_surrounding( + *, + kind: str, + block_content: str, + span: tuple[int, int], + tokenizer: Tokenizer, + leading_max_tokens: int, + trailing_max_tokens: int, + separators: list[str], +) -> dict[str, str]: + """Compute ``{"leading": …, "trailing": …}`` for one sidecar entry. + + ``leading_max_tokens`` and ``trailing_max_tokens`` are independent + per-half caps so deployments can tune the two contexts separately + via ``SURROUNDING_LEADING_MAX_TOKENS`` / ``SURROUNDING_TRAILING_MAX_TOKENS``. + + The returned strings have parser-internal markers (``id`` / ``path`` + / ``src`` / ``refid``) stripped — the cleaning happens before + token-budgeted truncation inside :func:`_build_leading` / + :func:`_build_trailing`, so the budget reflects the LLM-visible + content and truncation cannot leave malformed tags behind. + """ + start, end = span + leading_src = block_content[:start] + trailing_src = block_content[end:] + leading = _build_leading( + leading_src, + kind=kind, + tokenizer=tokenizer, + max_tokens=leading_max_tokens, + separators=separators, + ) + trailing = _build_trailing( + trailing_src, + kind=kind, + tokenizer=tokenizer, + max_tokens=trailing_max_tokens, + separators=separators, + ) + return {"leading": leading, "trailing": trailing} + + +def enrich_sidecars_with_surrounding( + *, + blocks_path: str, + enabled_modalities: set[str], + tokenizer: Tokenizer, + leading_max_tokens: int | None = None, + trailing_max_tokens: int | None = None, + separators: list[str] | None = None, +) -> dict[str, int]: + """Backfill ``surrounding`` on enabled-modality sidecars. + + Args: + blocks_path: path to the ``…blocks.jsonl`` artifact. + enabled_modalities: subset of ``{"drawings", "tables", + "equations"}`` reflecting the document's ``process_options``. + tokenizer: tokenizer used to enforce the per-half token budget. + leading_max_tokens: leading-half cap. ``None`` reads + ``SURROUNDING_LEADING_MAX_TOKENS`` (default 2000). + trailing_max_tokens: trailing-half cap. ``None`` reads + ``SURROUNDING_TRAILING_MAX_TOKENS`` (default 2000). + separators: explicit separator cascade. Defaults to the cascade + resolved from ``CHUNK_R_SEPARATORS`` (or + ``DEFAULT_R_SEPARATORS``). + + Returns: + ``{modality: updated_entries}`` for diagnostics. Modalities + without a sidecar on disk are silently skipped (consistent with + the rest of the multimodal pipeline). + """ + counts = {"drawings": 0, "tables": 0, "equations": 0} + if not enabled_modalities: + return counts + + blocks_file = Path(blocks_path) + if not blocks_file.exists(): + return counts + + content_by_blockid = load_content_rows_by_blockid(blocks_path) + if separators is None: + separators = load_chunk_separators() + + leading_tokens, trailing_tokens = _resolve_surrounding_budget( + leading_max_tokens, trailing_max_tokens + ) + + base = str(blocks_file) + if base.endswith(".blocks.jsonl"): + base = base[: -len(".blocks.jsonl")] + + for root_key in ("drawings", "tables", "equations"): + if root_key not in enabled_modalities: + continue + sidecar_path = Path(base + f".{root_key}.json") + if not sidecar_path.exists(): + continue + try: + payload = json.loads(sidecar_path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError) as exc: + logger.warning( + "[multimodal_context] failed to read %s: %s", + sidecar_path, + exc, + ) + continue + items = payload.get(root_key) + if not isinstance(items, dict): + continue + + updated = 0 + for item_id, item in items.items(): + if not isinstance(item, dict): + continue + blockid = item.get("blockid") + if not isinstance(blockid, str) or not blockid: + continue + block_content = content_by_blockid.get(blockid) + if block_content is None: + continue + span = find_target_span(root_key, item_id, block_content) + if span is None: + logger.debug( + "[multimodal_context] %s/%s: id not found in block %s", + root_key, + item_id, + blockid, + ) + continue + surrounding = build_surrounding( + kind=root_key, + block_content=block_content, + span=span, + tokenizer=tokenizer, + leading_max_tokens=leading_tokens, + trailing_max_tokens=trailing_tokens, + separators=separators, + ) + item["surrounding"] = surrounding + updated += 1 + + counts[root_key] = updated + try: + sidecar_path.write_text( + json.dumps(payload, ensure_ascii=False, indent=2), + encoding="utf-8", + ) + except OSError as exc: + logger.warning( + "[multimodal_context] failed to write %s: %s", + sidecar_path, + exc, + ) + continue + logger.debug( + "[multimodal_context] %s: surrounding written for %d entries", + root_key, + updated, + ) + + return counts + + +__all__ = [ + "DEFAULT_SURROUNDING_MAX_TOKENS", + "build_surrounding", + "enrich_sidecars_with_surrounding", + "find_target_span", + "load_chunk_separators", + "load_content_rows_by_blockid", + "remove_table_tags", + "trim_content_to_budget", +] diff --git a/lightrag/native_parser/docx/__init__.py b/lightrag/native_parser/docx/__init__.py new file mode 100644 index 0000000000..d53b100107 --- /dev/null +++ b/lightrag/native_parser/docx/__init__.py @@ -0,0 +1,14 @@ +"""LightRAG native DOCX parser package. + +The :mod:`parse_document` / :mod:`numbering_resolver` / :mod:`table_extractor` / +:mod:`drawing_image_extractor` / :mod:`utils` / :mod:`omml` modules ship the +upstream DOCX extraction logic verbatim (with imports localized for the new +package path). + +The pipeline-side orchestration (extract → IR → sidecar) now lives in +:meth:`lightrag.pipeline._PipelineMixin.parse_native` so the native and +MinerU engines share one shape; see :mod:`lightrag.native_parser.docx.ir_builder` +for the engine IR builder. +""" + +__all__: list[str] = [] diff --git a/lightrag/native_parser/docx/drawing_image_extractor.py b/lightrag/native_parser/docx/drawing_image_extractor.py new file mode 100644 index 0000000000..de800b012b --- /dev/null +++ b/lightrag/native_parser/docx/drawing_image_extractor.py @@ -0,0 +1,445 @@ +#!/usr/bin/env python3 +""" +ABOUTME: Shared drawing/image extraction utilities for DOCX parsing and editing +ABOUTME: Resolves w:drawing -> a:blip relationships, exports embedded images, builds placeholders +""" + +from __future__ import annotations + +import posixpath +import re +import shutil +import zipfile +from dataclasses import dataclass, field +from html import escape, unescape +from pathlib import Path, PurePosixPath +from typing import Dict, Optional, Tuple +from urllib.parse import urlparse + +try: + from defusedxml import ElementTree as ET +except ImportError: # pragma: no cover + from xml.etree import ElementTree as ET + + +NS = { + "w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main", + "wp": "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing", + "a": "http://schemas.openxmlformats.org/drawingml/2006/main", + "r": "http://schemas.openxmlformats.org/officeDocument/2006/relationships", + "v": "urn:schemas-microsoft-com:vml", +} + +REL_NS = "http://schemas.openxmlformats.org/package/2006/relationships" +CONTENT_TYPE_NS = "http://schemas.openxmlformats.org/package/2006/content-types" +IMAGE_REL_TYPE = ( + "http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" +) +SOURCE_DOCUMENT_PART = "/word/document.xml" + +# Match old and new drawing placeholders (requires id/name, allows extra attributes) +DRAWING_PATTERN = re.compile( + r']*\bid="[^"]*")(?=[^>]*\bname="[^"]*")[^>]*/>' +) +DRAWING_TAG_PATTERN = re.compile(r"]*/>") +DRAWING_ATTR_PATTERN = re.compile(r'([a-zA-Z_][\w:.-]*)="([^"]*)"') + + +@dataclass +class DrawingRelationship: + """Relationship metadata for a single relationship ID.""" + + rel_id: str + target: str + target_mode: str + rel_type: str + part_name: Optional[str] = None + content_type: Optional[str] = None + image_format: Optional[str] = None + + +@dataclass +class DrawingExtractionContext: + """Context used to resolve and export drawing images for one DOCX file.""" + + docx_path: Path + blocks_output_path: Optional[Path] = None + export_dir_name: Optional[str] = None + export_dir_path: Optional[Path] = None + relationships: Dict[str, DrawingRelationship] = field(default_factory=dict) + _exported_part_to_relpath: Dict[str, str] = field(default_factory=dict) + _used_filenames: Dict[str, str] = field(default_factory=dict) + + def resolve_relationship(self, rel_id: str) -> Optional[DrawingRelationship]: + return self.relationships.get(rel_id) + + def export_embedded_image(self, rel: DrawingRelationship) -> Optional[str]: + """ + Export an embedded image relationship target to export_dir. + + Returns: + Relative path like ".image/image1.png" if exported, + or None when export is not applicable. + """ + if not self.export_dir_path or not self.export_dir_name: + return None + if rel.target_mode.lower() == "external": + return None + if not rel.part_name: + return None + if rel.part_name in self._exported_part_to_relpath: + return self._exported_part_to_relpath[rel.part_name] + + zip_member = rel.part_name.lstrip("/") + try: + with zipfile.ZipFile(self.docx_path, "r") as zf: + blob = zf.read(zip_member) + except Exception: + return None + + filename = self._dedupe_filename(PurePosixPath(rel.part_name).name or "image") + output_file = self.export_dir_path / filename + output_file.write_bytes(blob) + + rel_path = str(PurePosixPath(self.export_dir_name) / filename) + self._exported_part_to_relpath[rel.part_name] = rel_path + return rel_path + + def _dedupe_filename(self, base_name: str) -> str: + if base_name not in self._used_filenames: + self._used_filenames[base_name] = base_name + return base_name + + stem = Path(base_name).stem + suffix = Path(base_name).suffix + index = 2 + while True: + candidate = f"{stem}_{index}{suffix}" + if candidate not in self._used_filenames: + self._used_filenames[candidate] = candidate + return candidate + index += 1 + + +def _normalize_image_format(ext_or_type: str) -> Optional[str]: + if not ext_or_type: + return None + value = ext_or_type.strip().lower() + + # Content-Type + if value.startswith("image/"): + value = value.split("/", 1)[1] + if "+" in value: + value = value.split("+", 1)[0] + if value.startswith("x-"): + value = value[2:] + + # Extension (with or without leading dot) + value = value.lstrip(".") + if value == "jpg": + return "jpeg" + if value in {"jpeg", "png", "gif", "bmp", "tiff", "webp", "svg", "emf", "wmf"}: + return value + return value or None + + +def _infer_format_from_target(target: str) -> Optional[str]: + if not target: + return None + parsed = urlparse(target) + path = parsed.path if parsed.scheme else target + suffix = PurePosixPath(path).suffix + return _normalize_image_format(suffix) + + +def _resolve_part_name(source_part_name: str, target: str) -> str: + if target.startswith("/"): + return posixpath.normpath(target) + source_dir = posixpath.dirname(source_part_name) + joined = posixpath.join(source_dir, target) + normalized = posixpath.normpath(joined) + if not normalized.startswith("/"): + normalized = "/" + normalized + return normalized + + +def create_drawing_context( + docx_path: str, + blocks_output_path: Optional[str] = None, +) -> DrawingExtractionContext: + """ + Create extraction context for a DOCX file. + + If blocks_output_path is provided, this also prepares `.image/` + beside the blocks file and clears any previous content. + """ + docx_file = Path(docx_path) + ctx = DrawingExtractionContext(docx_path=docx_file) + + if blocks_output_path: + output_path = Path(blocks_output_path) + export_dir_name = f"{output_path.stem}.image" + export_dir_path = output_path.parent / export_dir_name + if export_dir_path.exists(): + shutil.rmtree(export_dir_path) + export_dir_path.mkdir(parents=True, exist_ok=True) + ctx.blocks_output_path = output_path + ctx.export_dir_name = export_dir_name + ctx.export_dir_path = export_dir_path + + load_relationships(ctx) + return ctx + + +def load_relationships(ctx: DrawingExtractionContext) -> None: + rels_xml = "word/_rels/document.xml.rels" + content_types_xml = "[Content_Types].xml" + + overrides: Dict[str, str] = {} + defaults: Dict[str, str] = {} + + try: + with zipfile.ZipFile(ctx.docx_path, "r") as zf: + if content_types_xml in zf.namelist(): + ct_root = ET.parse(zf.open(content_types_xml)).getroot() + for node in ct_root.findall(f".//{{{CONTENT_TYPE_NS}}}Override"): + part_name = node.get("PartName") + content_type = node.get("ContentType") + if part_name and content_type: + overrides[part_name] = content_type + for node in ct_root.findall(f".//{{{CONTENT_TYPE_NS}}}Default"): + ext = node.get("Extension") + content_type = node.get("ContentType") + if ext and content_type: + defaults[ext.lower()] = content_type + + if rels_xml not in zf.namelist(): + return + rels_root = ET.parse(zf.open(rels_xml)).getroot() + except Exception: + return + + for rel in rels_root.findall(f".//{{{REL_NS}}}Relationship"): + rel_id = rel.get("Id") + target = rel.get("Target", "") + target_mode = rel.get("TargetMode", "") + rel_type = rel.get("Type", "") + if not rel_id: + continue + + part_name = None + content_type = None + image_format = None + + if target_mode.lower() != "external": + part_name = _resolve_part_name(SOURCE_DOCUMENT_PART, target) + if part_name: + content_type = overrides.get(part_name) + if not content_type: + ext = PurePosixPath(part_name).suffix.lower().lstrip(".") + content_type = defaults.get(ext) + image_format = _normalize_image_format( + content_type or _infer_format_from_target(part_name) + ) + else: + image_format = _normalize_image_format(_infer_format_from_target(target)) + + ctx.relationships[rel_id] = DrawingRelationship( + rel_id=rel_id, + target=target, + target_mode=target_mode, + rel_type=rel_type, + part_name=part_name, + content_type=content_type, + image_format=image_format, + ) + + +def _extract_blip_relationship(drawing_elem) -> Optional[Tuple[str, str]]: + for blip in drawing_elem.findall(".//a:blip", NS): + # Prefer explicit external links when both link/embed are present on one blip. + # Word may keep an embedded cache for linked pictures. + rel_link = blip.get(f"{{{NS['r']}}}link") + if rel_link: + return "link", rel_link + rel_embed = blip.get(f"{{{NS['r']}}}embed") + if rel_embed: + return "embed", rel_embed + return None + + +def _extract_imagedata_relationship(container_elem) -> Optional[str]: + """Find an image relationship id from a w:pict / w:object via v:imagedata. + + These legacy VML containers are how Word references EMF/WMF metafiles + (and the rendered preview of any embedded OLE object). v:imagedata uses + ``r:id`` to point at the image part for both embedded and externally + linked images — the relationship's ``TargetMode`` is what disambiguates + the two cases, so the caller must inspect the resolved relationship. + """ + r_id_attr = f"{{{NS['r']}}}id" + for imgdata in container_elem.findall(".//v:imagedata", NS): + rel_id = imgdata.get(r_id_attr) + if rel_id: + return rel_id + return None + + +def _build_placeholder(attrs: Dict[str, str]) -> str: + ordered_keys = ["id", "name", "path", "format"] + pieces = [] + for key in ordered_keys: + if key in attrs and attrs[key] is not None: + pieces.append(f'{key}="{escape(str(attrs[key]), quote=True)}"') + + # Preserve extra attributes deterministically (sorted by name) + for key in sorted(k for k in attrs.keys() if k not in ordered_keys): + value = attrs[key] + if value is not None: + pieces.append(f'{key}="{escape(str(value), quote=True)}"') + + return f"" + + +def extract_drawing_placeholder_from_element( + drawing_elem, + context: Optional[DrawingExtractionContext] = None, + include_extended_attrs: bool = True, +) -> str: + """ + Build a placeholder from a w:drawing element. + + Behavior: + - Always emits id/name from wp:docPr when present. + - For embedded images (a:blip@r:embed): exports image and sets path/format. + - For linked images (a:blip@r:link): does not download; path is original link target. + - When no image reference exists (e.g. chart drawing): keeps id/name only. + """ + doc_pr = drawing_elem.find(".//wp:docPr", NS) + attrs = { + "id": doc_pr.get("id", "") if doc_pr is not None else "", + "name": doc_pr.get("name", "") if doc_pr is not None else "", + } + + if include_extended_attrs: + rel_ref = _extract_blip_relationship(drawing_elem) + if rel_ref is not None and context is not None: + rel_kind, rel_id = rel_ref + rel = context.resolve_relationship(rel_id) + if rel is not None: + if rel_kind == "embed" and rel.rel_type == IMAGE_REL_TYPE: + rel_path = context.export_embedded_image(rel) + if rel_path: + attrs["path"] = rel_path + if rel.image_format: + attrs["format"] = rel.image_format + elif rel_kind == "link": + if rel.target: + attrs["path"] = rel.target + if rel.image_format: + attrs["format"] = rel.image_format + + return _build_placeholder(attrs) + + +def extract_vml_image_placeholder_from_element( + container_elem, + context: Optional[DrawingExtractionContext] = None, + include_extended_attrs: bool = True, +) -> str: + """ + Build a placeholder from a w:pict or w:object element. + + Legacy Word documents and OLE-embedded objects (Visio diagrams, equation + editor previews, etc.) expose their rendered image via VML rather than + DrawingML. The image is referenced through ```` + inside ````, and the underlying bytes are commonly EMF/WMF + metafiles. This function exports those bytes through the same context as + DrawingML images so EMF/WMF assets land in the blocks.assets directory + alongside PNG/JPEG ones. + + The output placeholder format matches + ``extract_drawing_placeholder_from_element`` so downstream consumers + treat both paths uniformly. + """ + shape = container_elem.find(".//v:shape", NS) + attrs = { + "id": shape.get("id", "") if shape is not None else "", + "name": shape.get("alt", "") if shape is not None else "", + } + + if include_extended_attrs: + rel_id = _extract_imagedata_relationship(container_elem) + if rel_id and context is not None: + rel = context.resolve_relationship(rel_id) + if rel is not None and rel.rel_type == IMAGE_REL_TYPE: + # VML reuses r:id for both embedded image parts and externally + # linked images; only the resolved TargetMode tells us which. + # Treating an external relationship as embedded would call + # export_embedded_image() (which short-circuits on external) + # and silently drop the linked path. + if rel.target_mode.lower() == "external": + if rel.target: + attrs["path"] = rel.target + if rel.image_format: + attrs["format"] = rel.image_format + else: + rel_path = context.export_embedded_image(rel) + if rel_path: + attrs["path"] = rel_path + if rel.image_format: + attrs["format"] = rel.image_format + + return _build_placeholder(attrs) + + +def parse_drawing_attributes(placeholder: str) -> Dict[str, str]: + """Parse attributes from a placeholder.""" + return { + name: unescape(value) + for name, value in DRAWING_ATTR_PATTERN.findall(placeholder) + } + + +def normalize_drawing_placeholder( + placeholder: str, + include_extended_attrs: bool = False, +) -> str: + """ + Normalize one drawing placeholder into canonical attribute order. + + Args: + placeholder: Input placeholder string + include_extended_attrs: If False, keeps only id/name. + """ + attrs = parse_drawing_attributes(placeholder) + normalized = { + "id": attrs.get("id", ""), + "name": attrs.get("name", ""), + } + if include_extended_attrs: + if "path" in attrs: + normalized["path"] = attrs["path"] + if "format" in attrs: + normalized["format"] = attrs["format"] + for key, value in attrs.items(): + if key not in {"id", "name", "path", "format"}: + normalized[key] = value + return _build_placeholder(normalized) + + +def normalize_drawing_placeholders_in_text( + text: str, + include_extended_attrs: bool = False, +) -> str: + """Normalize all drawing placeholders inside a text blob.""" + if not text: + return text + + def _replace(match: re.Match) -> str: + return normalize_drawing_placeholder( + match.group(0), + include_extended_attrs=include_extended_attrs, + ) + + return DRAWING_TAG_PATTERN.sub(_replace, text) diff --git a/lightrag/native_parser/docx/ir_builder.py b/lightrag/native_parser/docx/ir_builder.py new file mode 100644 index 0000000000..236dff5c76 --- /dev/null +++ b/lightrag/native_parser/docx/ir_builder.py @@ -0,0 +1,339 @@ +"""Native DOCX IR builder: ``extract_docx_blocks`` output → :class:`IRDoc`. + +Input contract: a list of block dicts as produced by +``lightrag.native_parser.docx.parse_document.extract_docx_blocks``. Each +block carries ``content`` text in which ``
``, ```` and +```` placeholders are already embedded by the upstream parser. +The builder rewrites those placeholders into IR placeholder tokens +(``{{TBL:k}} / {{EQ:k}} / {{EQI:k}} / {{IMG:k}}``) and builds the matching +``IRTable`` / ``IREquation`` / ``IRDrawing`` items. + +Asset bytes are extracted to disk by the upstream parser *before* this +builder runs (via ``DrawingExtractionContext`` passed to +``extract_docx_blocks``). The builder therefore declares assets with +``AssetSpec.source=None`` — the writer records each entry's size without +copying. + +Block-vs-inline equation distinction follows the legacy native rule: an +```` tag is *block* iff each side is either the +content boundary or a ``\\n`` character. Anything else stays inline, +keeps its tag in block text without an id, and never enters +``equations.json``. + +Positions are always emitted as ``IRPosition(type="paraid", range=[start, +end])`` where each side may be ``None`` (legacy / non-Word docx authors +sometimes omit ``w14:paraId``). The writer's ``to_jsonable`` faithfully +preserves the per-side null so consumers can distinguish "start missing" +vs "both missing". +""" + +from __future__ import annotations + +import itertools +import json +import re +from collections.abc import Callable +from dataclasses import dataclass, field +from pathlib import Path, PurePosixPath +from typing import Any + +from lightrag.native_parser.docx.drawing_image_extractor import ( + DRAWING_TAG_PATTERN, + parse_drawing_attributes, +) +from lightrag.sidecar.ir import ( + AssetSpec, + IRBlock, + IRDoc, + IRDrawing, + IREquation, + IRPosition, + IRTable, +) + + +_TABLE_TAG_RE = re.compile(r"
(.*?)
", re.DOTALL) +_EQUATION_TAG_RE = re.compile(r"(.*?)", re.DOTALL) + + +def _normalize_dimension(rows_value: Any) -> tuple[int, int]: + if not isinstance(rows_value, list): + return 0, 0 + num_rows = len(rows_value) + num_cols = max((len(r) for r in rows_value if isinstance(r, list)), default=0) + return num_rows, num_cols + + +def _placeholder_keyspace() -> Callable[[str], str]: + """Return a fresh counter producing ``{prefix}{N}`` keys (1-indexed).""" + counter = itertools.count(1) + return lambda prefix: f"{prefix}{next(counter)}" + + +def _safe_asset_ref_from_path(path_val: str, asset_prefix: str) -> str | None: + """Return the path inside ``asset_prefix`` only when it is safe. + + Native DOCX images are pre-extracted into ``.blocks.assets/``. + Treat a drawing path as local only when the suffix is a clean POSIX + relative path. Unsafe local-looking paths are dropped instead of being + registered as assets or preserved as linked references. + """ + if not asset_prefix or not path_val.startswith(asset_prefix): + return None + + rel_raw = path_val[len(asset_prefix) :] + if not rel_raw or "\\" in rel_raw: + return None + + rel_path = PurePosixPath(rel_raw) + if rel_path.is_absolute(): + return None + if any(part == ".." for part in rel_path.parts): + return None + + rel = rel_path.as_posix() + if rel in {"", "."}: + return None + return rel + + +@dataclass +class _BlockBuilder: + """Per-block scratch state for the three ``re.sub`` rewrite passes. + + Keeping the replacer routines as bound methods (rather than closures + redefined inside the per-block loop) means they're compiled once at + class-load and the state they mutate — ``tables`` / ``drawings`` / + ``equations`` / ``table_position`` — is held explicitly rather than + captured implicitly from the enclosing frame. + """ + + next_key: Callable[[str], str] + assets: list[AssetSpec] + seen_asset_refs: set[str] + asset_prefix: str + block_table_headers: list[Any] + tables: list[IRTable] = field(default_factory=list) + drawings: list[IRDrawing] = field(default_factory=list) + equations: list[IREquation] = field(default_factory=list) + # Position of the *next* ```` placeholder within this block, + # used to look up the matching entry in ``block_table_headers``. + table_position: int = 0 + + def replace_table(self, match: "re.Match[str]") -> str: + table_body_raw = match.group(1) + try: + rows = json.loads(table_body_raw) + if not isinstance(rows, list): + rows = None + except json.JSONDecodeError: + rows = None + + if rows is not None: + parsed_rows: list[list[str]] | None = [ + [str(c) for c in r] if isinstance(r, list) else [str(r)] for r in rows + ] + html: str | None = None + else: + parsed_rows = None + html = table_body_raw + + num_rows, num_cols = _normalize_dimension(parsed_rows) + + header_pos = self.table_position + self.table_position += 1 + header_rows = ( + self.block_table_headers[header_pos] + if header_pos < len(self.block_table_headers) + else None + ) + # Treat empty list / explicit None identically: no header + # entry on the sidecar item. + table_header = header_rows if header_rows else None + + placeholder = self.next_key("tb") + self.tables.append( + IRTable( + placeholder_key=placeholder, + rows=parsed_rows, + html=html, + num_rows=num_rows, + num_cols=num_cols, + caption="", + footnotes=[], + table_header=table_header, + body_override=table_body_raw, + ) + ) + return f"{{{{TBL:{placeholder}}}}}" + + def replace_equation(self, match: "re.Match[str]") -> str: + latex = match.group(1) + source = match.string + start, end = match.start(), match.end() + is_block = (start == 0 or source[start - 1] == "\n") and ( + end == len(source) or source[end] == "\n" + ) + placeholder = self.next_key("eq") + self.equations.append( + IREquation( + placeholder_key=placeholder, + latex=latex, + is_block=is_block, + caption="", + footnotes=[], + ) + ) + token = "EQ" if is_block else "EQI" + return f"{{{{{token}:{placeholder}}}}}" + + def replace_drawing(self, match: "re.Match[str]") -> str: + attrs = parse_drawing_attributes(match.group(0)) + path_val = attrs.get("path", "") or "" + src_val = attrs.get("src", "") or "" + fmt = attrs.get("format", "") or "" + if not fmt and path_val: + fmt = Path(path_val).suffix.lower().lstrip(".") + + # Two flavours of : + # 1. Local asset under .blocks.assets/ — already + # extracted to disk by DrawingExtractionContext; + # register as AssetSpec(source=None) and let the + # writer resolve the path via asset_paths. + # 2. External/linked path (URL, or any path that does + # not live under asset_prefix) — pass through + # verbatim via IRDrawing.path_override; do NOT emit + # an AssetSpec (no on-disk bytes to materialize). + rel_inside_assets = _safe_asset_ref_from_path(path_val, self.asset_prefix) + if rel_inside_assets is not None: + asset_ref = rel_inside_assets + suggested_name = Path(rel_inside_assets).name or rel_inside_assets + if asset_ref and asset_ref not in self.seen_asset_refs: + self.assets.append( + AssetSpec( + ref=asset_ref, + suggested_name=suggested_name, + source=None, # already extracted to disk + ) + ) + self.seen_asset_refs.add(asset_ref) + path_override: str | None = None + else: + asset_ref = "" + # Only mark as an external/linked reference when the + # upstream parser actually emitted a path. An empty + # ``path=""`` should fall back to the regular asset- + # resolution path (which will also produce ``path=""`` + # downstream) rather than masquerading as an explicit + # builder override. + path_override = ( + None + if self.asset_prefix and path_val.startswith(self.asset_prefix) + else path_val or None + ) + + placeholder = self.next_key("im") + self.drawings.append( + IRDrawing( + placeholder_key=placeholder, + asset_ref=asset_ref, + fmt=fmt, + caption="", + footnotes=[], + src=src_val, + path_override=path_override, + ) + ) + return f"{{{{IMG:{placeholder}}}}}" + + +class NativeDocxIRBuilder: + """Translate ``extract_docx_blocks`` output into an :class:`IRDoc`. + + The builder is stateless — instantiate per call. ``asset_dir_name`` is + the relative name (without trailing slash) of ``.blocks.assets/`` + that the upstream parser used when emitting ```` + attributes; the builder strips that prefix when building + :attr:`AssetSpec.ref` so the writer's ref↔filename mapping has + predictable keys. + """ + + def normalize( + self, + blocks: list[dict[str, Any]], + *, + document_name: str, + asset_dir_name: str, + parse_metadata: dict[str, Any] | None = None, + ) -> IRDoc: + next_key = _placeholder_keyspace() + ir_blocks: list[IRBlock] = [] + assets: list[AssetSpec] = [] + seen_asset_refs: set[str] = set() + + asset_prefix = f"{asset_dir_name}/" if asset_dir_name else "" + + for block in blocks: + raw_content = block.get("content") or "" + heading = block.get("heading") or "" + level = int(block.get("level", 0) or 0) + parent_headings = list(block.get("parent_headings") or []) + # Preserve per-side nulls in [start, end]. + uuid_start = block.get("uuid") or None + uuid_end = block.get("uuid_end") or None + + builder = _BlockBuilder( + next_key=next_key, + assets=assets, + seen_asset_refs=seen_asset_refs, + asset_prefix=asset_prefix, + block_table_headers=list(block.get("table_headers") or []), + ) + + # Rewrite order matches the legacy native flow: tables, then + # equations, then drawings — each ``re.sub`` operates on the + # output of the previous pass. + content_template = _TABLE_TAG_RE.sub(builder.replace_table, raw_content) + content_template = _EQUATION_TAG_RE.sub( + builder.replace_equation, content_template + ) + content_template = DRAWING_TAG_PATTERN.sub( + builder.replace_drawing, content_template + ) + + positions = [ + IRPosition(type="paraid", range=[uuid_start, uuid_end]), + ] + + ir_blocks.append( + IRBlock( + content_template=content_template, + heading=heading, + level=level, + parent_headings=parent_headings, + positions=positions, + tables=builder.tables, + drawings=builder.drawings, + equations=builder.equations, + ) + ) + + # doc_title: parse_metadata["first_heading"] when present, else file + # stem fallback (resolved here so the writer doesn't have to know). + first_heading = "" + if isinstance(parse_metadata, dict): + first_heading = str(parse_metadata.get("first_heading") or "") + doc_title = first_heading or (Path(document_name).stem or document_name) + + return IRDoc( + document_name=document_name, + document_format=Path(document_name).suffix.lower().lstrip("."), + doc_title=doc_title, + split_option={"fixlevel": 0}, + blocks=ir_blocks, + assets=assets, + bbox_attributes=None, + ) + + +__all__ = ["NativeDocxIRBuilder"] diff --git a/lightrag/native_parser/docx/numbering_resolver.py b/lightrag/native_parser/docx/numbering_resolver.py new file mode 100644 index 0000000000..7d73284d5b --- /dev/null +++ b/lightrag/native_parser/docx/numbering_resolver.py @@ -0,0 +1,423 @@ +#!/usr/bin/env python3 +""" +ABOUTME: Resolves automatic numbering labels from DOCX documents +ABOUTME: Parses numbering.xml and computes rendered number strings +""" + +import zipfile +from defusedxml import ElementTree as ET +from typing import Dict + +NSMAP = {"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main"} + + +class NumberingResolver: + """ + Resolves paragraph numbering to rendered label strings. + + DOCX stores numbering definitions in numbering.xml: + - abstractNum: Defines format templates (lvlText like "%1.%2.") + - num: Links numId to abstractNumId + + Each paragraph references: numId (which definition) + ilvl (which level) + """ + + # Number format converters + FORMAT_CONVERTERS = { + "decimal": lambda n: str(n), + "lowerLetter": lambda n: chr(ord("a") + (n - 1) % 26), + "upperLetter": lambda n: chr(ord("A") + (n - 1) % 26), + "lowerRoman": lambda n: NumberingResolver._to_roman(n).lower(), + "upperRoman": lambda n: NumberingResolver._to_roman(n), + "chineseCountingThousand": lambda n: NumberingResolver._to_chinese(n), + "ideographTraditional": lambda n: "甲乙丙丁戊己庚辛壬癸"[(n - 1) % 10], + "bullet": lambda n: "•", + "none": lambda n: "", + } + + def __init__(self, docx_path: str): + self.abstract_nums: Dict[str, dict] = {} # abstractNumId -> level definitions + self.num_to_abstract: Dict[str, str] = {} # numId -> abstractNumId + self.counters: Dict[ + str, Dict[int, int] + ] = {} # numId -> {ilvl -> current_count} + self.start_overrides: Dict[ + str, Dict[int, int] + ] = {} # numId -> {ilvl -> start_value} + self.style_numpr: Dict[ + str, dict + ] = {} # styleId -> {numId, ilvl} from styles.xml + self.style_based_on: Dict[str, str] = {} # styleId -> basedOn styleId + # Smart numbering merge state (Word's rendering behavior) + self.last_numId: str = None # Previous paragraph's numId + self.last_abstract_id: str = None # Previous paragraph's abstractNumId + self.last_style_id: str = None # Previous paragraph's style ID + self._parse_numbering_xml(docx_path) + self._parse_styles_xml(docx_path) + + def _parse_numbering_xml(self, docx_path: str): + """Parse numbering.xml from DOCX archive""" + try: + with zipfile.ZipFile(docx_path, "r") as zf: + if "word/numbering.xml" not in zf.namelist(): + return + + tree = ET.parse(zf.open("word/numbering.xml")) + root = tree.getroot() + + # Parse abstractNum definitions + for abstract in root.findall(".//w:abstractNum", NSMAP): + abstract_id = abstract.get(f'{{{NSMAP["w"]}}}abstractNumId') + levels = {} + + for lvl in abstract.findall("w:lvl", NSMAP): + ilvl = int(lvl.get(f'{{{NSMAP["w"]}}}ilvl')) + + start_elem = lvl.find("w:start", NSMAP) + start = ( + int(start_elem.get(f'{{{NSMAP["w"]}}}val')) + if start_elem is not None + else 1 + ) + + num_fmt_elem = lvl.find("w:numFmt", NSMAP) + num_fmt = ( + num_fmt_elem.get(f'{{{NSMAP["w"]}}}val') + if num_fmt_elem is not None + else "decimal" + ) + + lvl_text_elem = lvl.find("w:lvlText", NSMAP) + lvl_text = ( + lvl_text_elem.get(f'{{{NSMAP["w"]}}}val') + if lvl_text_elem is not None + else "%1." + ) + + is_lgl_elem = lvl.find("w:isLgl", NSMAP) + is_lgl = False + if is_lgl_elem is not None: + val = is_lgl_elem.get(f'{{{NSMAP["w"]}}}val') + is_lgl = val is None or val not in ("0", "false") + + levels[ilvl] = { + "start": start, + "numFmt": num_fmt, + "lvlText": lvl_text, + "isLgl": is_lgl, + } + + self.abstract_nums[abstract_id] = levels + + # Parse num -> abstractNum mapping and startOverride + for num in root.findall(".//w:num", NSMAP): + num_id = num.get(f'{{{NSMAP["w"]}}}numId') + abstract_ref = num.find("w:abstractNumId", NSMAP) + if abstract_ref is not None: + self.num_to_abstract[num_id] = abstract_ref.get( + f'{{{NSMAP["w"]}}}val' + ) + + # Parse lvlOverride/startOverride for this num + for lvl_override in num.findall("w:lvlOverride", NSMAP): + ilvl = int(lvl_override.get(f'{{{NSMAP["w"]}}}ilvl')) + start_override = lvl_override.find("w:startOverride", NSMAP) + if start_override is not None: + start_val = int(start_override.get(f'{{{NSMAP["w"]}}}val')) + if num_id not in self.start_overrides: + self.start_overrides[num_id] = {} + self.start_overrides[num_id][ilvl] = start_val + except Exception: + # Silently ignore parsing errors - document may not have numbering + pass + + def _parse_styles_xml(self, docx_path: str): + """Parse styles.xml to get style-inherited numbering definitions""" + try: + with zipfile.ZipFile(docx_path, "r") as zf: + if "word/styles.xml" not in zf.namelist(): + return + + tree = ET.parse(zf.open("word/styles.xml")) + root = tree.getroot() + + # Parse style definitions + for style in root.findall(".//w:style", NSMAP): + style_id = style.get(f'{{{NSMAP["w"]}}}styleId') + if not style_id: + continue + + # Check for basedOn (style inheritance) + based_on = style.find("w:basedOn", NSMAP) + if based_on is not None: + parent_id = based_on.get(f'{{{NSMAP["w"]}}}val') + if parent_id: + self.style_based_on[style_id] = parent_id + + # Check for numPr in style's pPr + pPr = style.find("w:pPr", NSMAP) + if pPr is not None: + numPr = pPr.find("w:numPr", NSMAP) + if numPr is not None: + num_id_elem = numPr.find("w:numId", NSMAP) + ilvl_elem = numPr.find("w:ilvl", NSMAP) + + if num_id_elem is not None: + num_id = num_id_elem.get(f'{{{NSMAP["w"]}}}val') + ilvl = ( + int(ilvl_elem.get(f'{{{NSMAP["w"]}}}val')) + if ilvl_elem is not None + else 0 + ) + self.style_numpr[style_id] = { + "numId": num_id, + "ilvl": ilvl, + } + except Exception: + # Silently ignore parsing errors + pass + + def _get_numbering_from_style(self, style_id: str, visited=None) -> dict: + """ + Get numbering definition from style, following inheritance chain. + + Args: + style_id: Style ID to look up + visited: Set of visited style IDs (to prevent circular references) + + Returns: + dict with 'numId' and 'ilvl', or None + """ + if visited is None: + visited = set() + + # Prevent circular references + if style_id in visited: + return None + visited.add(style_id) + + # Check if this style has numPr + if style_id in self.style_numpr: + return self.style_numpr[style_id] + + # Check parent style + if style_id in self.style_based_on: + parent_id = self.style_based_on[style_id] + return self._get_numbering_from_style(parent_id, visited) + + return None + + def reset_tracking_state(self): + """ + Reset numbering tracking state. + + Call this when encountering structural breaks that should + interrupt numbering continuity: + - Section breaks (sectPr) + - Table boundaries (before and after tables) + + This prevents incorrect numbering continuation across + document structure boundaries. + """ + self.last_numId = None + self.last_abstract_id = None + self.last_style_id = None + + def get_label(self, para_element) -> str: + """ + Get rendered numbering label for a paragraph. + + Checks both direct numPr and style-inherited numbering. Direct numPr + is a paragraph-local override and applies only to the current + paragraph; subsequent paragraphs that carry only pStyle fall back to + the style's numPr declared in styles.xml. + + Args: + para_element: lxml Element for + + Returns: + Rendered label string (e.g., "1.1", "a)", "第一章") or empty string + """ + try: + pPr = para_element.find(f'{{{NSMAP["w"]}}}pPr') + if pPr is None: + return "" + + num_id = None + ilvl = 0 + style_id = None + + # Get pStyle (if present) + pStyle = pPr.find(f'{{{NSMAP["w"]}}}pStyle') + if pStyle is not None: + style_id = pStyle.get(f'{{{NSMAP["w"]}}}val') + + # Check for direct numPr in paragraph + numPr = pPr.find(f'{{{NSMAP["w"]}}}numPr') + if numPr is not None: + num_id_elem = numPr.find(f'{{{NSMAP["w"]}}}numId') + ilvl_elem = numPr.find(f'{{{NSMAP["w"]}}}ilvl') + + if num_id_elem is not None: + num_id = num_id_elem.get(f'{{{NSMAP["w"]}}}val') + ilvl = ( + int(ilvl_elem.get(f'{{{NSMAP["w"]}}}val')) + if ilvl_elem is not None + else 0 + ) + + # If no direct numPr, fall back to style-inherited numbering. + # Direct numPr is a paragraph-local override in Word; it must not + # persist as a runtime default for the style, otherwise subsequent + # paragraphs that only carry pStyle will keep following the local + # override instead of the style's declared numPr. + if num_id is None and style_id: + style_num = self._get_numbering_from_style(style_id) + if style_num: + num_id = style_num["numId"] + ilvl = style_num["ilvl"] + + # If still no numbering found, clear state and return empty + if num_id is None: + # We should use list structure breaking logic to reset last_numId, last_abstract_id and last_style_id + return "" + + # Get abstract definition + abstract_id = self.num_to_abstract.get(num_id) + if abstract_id is None or abstract_id not in self.abstract_nums: + # Clear state for invalid numbering + self.last_numId = None + self.last_abstract_id = None + return "" + + levels = self.abstract_nums[abstract_id] + if ilvl not in levels: + # Clear state for invalid level + self.last_numId = None + self.last_abstract_id = None + return "" + + # Smart numbering merge: (Word's rendering behavior) + # When consecutive paragraphs have different numId but same abstractNumId, + # Word continues the numbering sequence rather than restarting. + # This happens regardless of whether the numId is new or style matches. + + if ( + self.last_numId is not None + and self.last_numId != num_id + and self.last_abstract_id == abstract_id + and self.last_numId in self.counters + ): + # Merge: copy previous numId's counter to current numId + self.counters[num_id] = self.counters[self.last_numId].copy() + + # Initialize/update counter + if num_id not in self.counters: + self.counters[num_id] = {} + + # Initialize all parent levels if not present (for deep nested numbering) + for i in range(ilvl): + if i not in self.counters[num_id] and i in levels: + # Use startOverride if exists, otherwise use abstractNum's start value + if ( + num_id in self.start_overrides + and i in self.start_overrides[num_id] + ): + self.counters[num_id][i] = self.start_overrides[num_id][i] + else: + self.counters[num_id][i] = levels[i]["start"] + + # Reset lower levels when higher level increments + for i in range(ilvl + 1, 10): + if i in self.counters[num_id]: + del self.counters[num_id][i] + + # Initialize current level if needed + if ilvl not in self.counters[num_id]: + # Use startOverride if exists, otherwise use abstractNum's start value + if ( + num_id in self.start_overrides + and ilvl in self.start_overrides[num_id] + ): + self.counters[num_id][ilvl] = self.start_overrides[num_id][ilvl] + else: + self.counters[num_id][ilvl] = levels[ilvl]["start"] + else: + self.counters[num_id][ilvl] += 1 + + # Format the label using lvlText template + label = self._format_label(num_id, ilvl, levels) + + # Update tracking state for next paragraph + self.last_numId = num_id + self.last_abstract_id = abstract_id + self.last_style_id = style_id + + return label + except Exception: + # Return empty on any error to avoid breaking document parsing + return "" + + def _format_label(self, num_id: str, ilvl: int, levels: dict) -> str: + """Format label string by replacing %1, %2, etc.""" + try: + lvl_text = levels[ilvl]["lvlText"] + result = lvl_text + current_is_lgl = levels[ilvl].get("isLgl", False) + + for i in range(ilvl + 1): + if i in levels and i in self.counters.get(num_id, {}): + num_fmt = levels[i]["numFmt"] + if current_is_lgl and i < ilvl: + num_fmt = "decimal" + count = self.counters[num_id][i] + converter = self.FORMAT_CONVERTERS.get(num_fmt, lambda n: str(n)) + formatted = converter(count) + result = result.replace(f"%{i+1}", formatted) + + return result + except Exception: + return "" + + @staticmethod + def _to_roman(n: int) -> str: + """Convert integer to Roman numeral""" + if n <= 0 or n >= 4000: + return str(n) + values = [ + (1000, "M"), + (900, "CM"), + (500, "D"), + (400, "CD"), + (100, "C"), + (90, "XC"), + (50, "L"), + (40, "XL"), + (10, "X"), + (9, "IX"), + (5, "V"), + (4, "IV"), + (1, "I"), + ] + result = "" + for value, numeral in values: + while n >= value: + result += numeral + n -= value + return result + + @staticmethod + def _to_chinese(n: int) -> str: + """Convert integer to Chinese numeral""" + digits = "零一二三四五六七八九" + if n <= 0 or n > 99: + return str(n) + if n < 10: + return digits[n] + if n < 20: + return "十" + (digits[n % 10] if n % 10 else "") + if n < 100: + tens = n // 10 + ones = n % 10 + return digits[tens] + "十" + (digits[ones] if ones else "") + return str(n) diff --git a/lightrag/native_parser/docx/omml/__init__.py b/lightrag/native_parser/docx/omml/__init__.py new file mode 100644 index 0000000000..0d6f5e48f3 --- /dev/null +++ b/lightrag/native_parser/docx/omml/__init__.py @@ -0,0 +1,10 @@ +""" +ABOUTME: OMML (Office Math Markup Language) to LaTeX conversion +""" + +from .ommlparser import OMMLParser + + +def convert_omml_to_latex(omml_element) -> str: + """Convert an m:oMath XML element to a LaTeX string.""" + return OMMLParser().parse(omml_element) diff --git a/lightrag/native_parser/docx/omml/cleaners.py b/lightrag/native_parser/docx/omml/cleaners.py new file mode 100644 index 0000000000..af0370fca5 --- /dev/null +++ b/lightrag/native_parser/docx/omml/cleaners.py @@ -0,0 +1,38 @@ +""" +Postprocessing functions for cleaning up latex equations in linear format which don't give valid LaTeX. +""" + +import re + +clean_exps = { + r"\\degf": "°F", + r"\\degc": "°C", + r"(\\cbrt)(\w+)": r"\\sqrt[3]{\2}", + r"(\\qdrt)(\w+)": r"\\sqrt[4]{\2}", + r"\\sfrac": r"\\frac", + r"(\\o[i]+nt)(\w+)": r"\1{\2}", + r"\\bullet(\w+)": r"\\bullet \1", + r"\\sum([a-zA-Z0-9]+)": r"\\sum{\1}", + r"\\prod([a-zA-Z0-9]+)": r"\\prod{\1}", + r"\\amalg([a-zA-Z0-9]+)": r"\\amalg{\1}", + r"\\bigcup([a-zA-Z0-9]+)": r"\\bigcup{\1}", + r"\\bigcap([a-zA-Z0-9]+)": r"\\bigcap{\1}", + r"\\bigvee([a-zA-Z0-9]+)": r"\\bigvee{\1}", + r"\\bigwedge([a-zA-Z0-9]+)": r"\\bigwedge{\1}", + r"\\lfloor([a-zA-Z0-9]+)": r"\\lfloor{\1}", + r"\\lceil([a-zA-Z0-9]+)": r"\\lceil{\1}", + r"\\lim\\below\{(.+)\}\{(.+)\}": r"\\lim_{\1}{\2}", + r"\\min\\below\{(.+)\}\{(.+)\}": r"\\min_{\1}{\2}", + r"\\max\\below\{(.+)\}\{(.+)\}": r"\\max_{\1}{\2}", +} + + +def clean_exp(exp): + """ + Takes in a linear expression and converts known invalid LaTeX equations to valid LaTeX + :param exp:str - An equation in invalid syntax + :return :str - A valid equation + """ + for e in clean_exps: + exp = re.sub(e, clean_exps[e], exp) + return exp diff --git a/lightrag/native_parser/docx/omml/ommlparser.py b/lightrag/native_parser/docx/omml/ommlparser.py new file mode 100644 index 0000000000..e45beaaa8d --- /dev/null +++ b/lightrag/native_parser/docx/omml/ommlparser.py @@ -0,0 +1,511 @@ +from xml.etree.cElementTree import Element + +from .utils import qn + + +class OMMLParser: + """ + Parser class for reading OMML and converting it into LaTeX. + """ + + FUNCTION_MAP = { + "sin": "\\sin", + "cos": "\\cos", + "tan": "\\tan", + "cot": "\\cot", + "sec": "\\sec", + "csc": "\\csc", + "sinh": "\\sinh", + "cosh": "\\cosh", + "tanh": "\\tanh", + "coth": "\\coth", + "sech": "\\operatorname{sech}", + "csch": "\\operatorname{csch}", + "log": "\\log", + "ln": "\\ln", + "min": "\\min", + "max": "\\max", + "lim": "\\lim", + } + + def _normalize_func_name(self, content: str) -> str: + if not content: + return content + if content.startswith("\\"): + return content + key = content.strip() + mapped = self.FUNCTION_MAP.get(key) + return mapped if mapped else content + + def parse(self, root: Element) -> str: + """ + Parses an m:oMath OMML tag into LaTeX. + :param root: An m:oMath OMML tag + :return: The LaTeX representation of the OMML input + """ + text = "" + try: + if root.tag == qn("m:t"): + return self.parse_t(root) + for child in root: + if child.tag in self.parsers: + text += self.parsers[child.tag](self, child) + except AttributeError: + # In case of missing attributes on OMML tags, + # we return an empty string (ref:issue_14) + return "" + return text + + def parse_e(self, root: Element) -> str: + text = "" + for child in root: + text += self.parse(child) + return text + + def parse_r(self, root: Element) -> str: + # TODO: Add support for m:rPr and m:scr to support different character styles + # For now, we just parse the text content of m:r + text = "" + for child in root: + text += self.parse(child) + return text + + def parse_t(self, root: Element): + symbol_map = { + "≜": "\\triangleq", + "≝": "\\stackrel{\\tiny def}{=}", + "≞": "\\stackrel{\\tiny m}{=}", + } + replacements = { + "<": "\\lt ", + ">": "\\gt ", + "≤": "\\leq ", + "≥": "\\geq ", + "∞": "\\infty ", + "<": "\\lt ", + ">": "\\gt ", + "≤": "\\leq ", + "≥": "\\geq ", + } + text = root.text.split() + if not text: + return " " + for i, t in enumerate(text): + if t in symbol_map: + text[i] = symbol_map[t] + for key, value in replacements.items(): + for i, t in enumerate(text): + text[i] = t.replace(key, value) + return " ".join(text) + + def parse_acc(self, root: Element) -> str: + character_map = { + 768: "\\grave", + 769: "\\acute", + 770: "\\hat", + 771: "\\tilde", + 773: "\\bar", + 774: "\\breve", + 775: "\\dot", + 776: "\\ddot", + 780: "\\check", + 831: "\\overline{\\overline", + 8400: "\\overset\\leftharpoonup", + 8401: "\\overset\\rightharpoonup", + 8406: "\\overleftarrow", + 8407: "\\overrightarrow", + 8411: "\\dddot", + 8417: "\\overset\\leftrightarrow", + } + text = "" + accent = 770 + for child in root: + if child.tag == qn("m:accPr"): + for child2 in child: + if child2.tag == qn("m:chr"): + val = child2.attrib.get(qn("m:val")) + if val: + try: + accent = ord(val) + except TypeError: + pass + + accent_cmd = character_map.get(accent) + if accent_cmd is None: + accent_cmd = character_map.get(770, "\\hat") + text += accent_cmd + "{" + for child in root: + if child.tag == qn("m:e"): + text += self.parse(child) + text += "}" + if accent == 831: + text += "}" + return text + + def parse_bar(self, root: Element) -> str: + text = "\\overline{" + for child in root: + if child.tag == qn("m:barPr"): + for child2 in child: + if child2.tag == qn("m:pos"): + if child2.attrib.get(qn("m:val")) == "bot": + text = "\\underline{" + + for child in root: + if child.tag == qn("m:e"): + text += self.parse(child) + text += "}" + return text + + def parse_border_box(self, root: Element) -> str: + text = "\\boxed{" + for child in root: + if child.tag == qn("m:e"): + text += self.parse(child) + text += "}" + return text + + def parse_box(self, root: Element) -> str: + text = "" + for child in root: + text += self.parse(child) + return text + + def parse_group_chr(self, root: Element) -> str: + character_map = { + "←": "\\leftarrow", + "→": "\\rightarrow", + "↔": "\\leftrightarrow", + "⇐": "\\Leftarrow", + "⇒": "\\Rightarrow", + "⇔": "\\Leftrightarrow", + } + text = "\\underbrace{" + bottom = False + for child in root: + if child.tag == qn("m:groupChrPr"): + for child2 in child: + if child2.tag == qn("m:chr"): + char = child2.attrib.get(qn("m:val")) + if char in character_map: + text = character_map[char] + for child2 in child: + if ( + child2.tag == qn("m:pos") + and child2.attrib.get(qn("m:val")) == "top" + ): + # If m:pos is set to "top", the symbol is supposed to + # be on top and the text is actually supposed to be under + bottom = True + + content = "" + for child in root: + if child.tag == qn("m:e"): + content = self.parse(child) + if text == "\\underbrace{": + if bottom: + text = "\\overbrace{" + content + "}" + else: + text += content + "}" + else: + if not bottom: + text = "\\overset{" + content + "}" + "{" + text + "}" + else: + text = "\\underset{" + content + "}" + "{" + text + "}" + return text + + def parse_d(self, root: Element) -> str: + bracket_map = { + "(": "\\left(", + ")": "\\right)", + "[": "\\left[", + "]": "\\right]", + "{": "\\left{", + "}": "\\right}", + "〈": "\\left\\langle", + "〉": "\\right\\rangle", + "⟨": "\\left\\langle", + "⟩": "\\right\\rangle", + "⌊": "\\left\\lfloor", + "⌋": "\\right\\rfloor", + "⌈": "\\left\\lceil", + "⌉": "\\right\\rceil", + "|": "\\left|", + "‖": "\\left\\|", + "⟦": "[\\![", + "⟧": "]\\!]", + } + text = "" + start_bracket = "(" + end_bracket = ")" + seperator = "|" + is_matrix = False + for child in root: + for child2 in child: + if child.tag == qn("m:dPr"): + if child2.tag == qn("m:begChr"): + start_bracket = child2.attrib.get(qn("m:val")) + if child2.tag == qn("m:endChr"): + end_bracket = child2.attrib.get(qn("m:val")) + if child2.tag == qn("m:sepChr"): + seperator = child2.attrib.get(qn("m:val")) + if child2.tag == qn("m:m"): + is_matrix = True + + for child in root: + if child.tag == qn("m:e"): + if text: + text += seperator + text += self.parse(child) + end_bracket_replacements = { + "|": "\\right|", + "‖": "\\right\\|", + "[": "\\right[", + } + start_bracket_replacements = { + "]": "\\left]", + } + start = "" + end = "" + if start_bracket: + if start_bracket in start_bracket_replacements: + start = start_bracket_replacements[start_bracket] + " " + elif start_bracket in bracket_map: + start = bracket_map[start_bracket] + " " + else: + start = "\\left(" + " " + if end_bracket: + if end_bracket in end_bracket_replacements: + end = " " + end_bracket_replacements[end_bracket] + elif end_bracket in bracket_map: + end = " " + bracket_map[end_bracket] + else: + end = " " + "\\right)" + # If there is no end bracket and this tag contains an m:eqArr tag as a + # child, we assume that the eqArr should be translated to a cases environment + # instead of an eqnarray* environment. + else: + for child in root: + if child.tag == qn("m:e"): + for child2 in child: + if child2.tag == qn("m:eqArr"): + text = text.replace("\\begin{eqnarray*}", "") + text = text.replace("\\end{eqnarray*}", "") + return "\\begin{cases} " + text + " \\end{cases}" + if is_matrix: + if start_bracket == "(" and end_bracket == ")": + return text.replace("{matrix}", "{pmatrix}") + elif start_bracket == "|" and end_bracket == "|": + return text.replace("{matrix}", "{vmatrix}") + elif start_bracket == "‖" and end_bracket == "‖": + return text.replace("{matrix}", "{Vmatrix}") + else: + return text.replace("{matrix}", "{bmatrix}") + return start + text + end + + def parse_eq_arr(self, root: Element) -> str: + text = "\\begin{eqnarray*}" + for child in root: + if child.tag == qn("m:e"): + text += self.parse(child) + " \\\\" + text += "\\end{eqnarray*}" + return text + + def parse_f(self, root: Element) -> str: + text = "\\frac{" + num = "" + den = "" + is_binom = False + for child in root: + if child.tag == qn("m:fPr"): + for child2 in child: + if ( + child2.tag == qn("m:type") + and child2.attrib.get(qn("m:val")) == "noBar" + ): + is_binom = True + if child.tag == qn("m:num"): + num = self.parse(child) + if child.tag == qn("m:den"): + den = self.parse(child) + if is_binom: + text = "\\genfrac{}{}{0pt}{}{" + text += num + "}{" + den + "}" + return text + + def parse_m(self, root: Element) -> str: + text = "\\begin{matrix} " + text += self.parse(root)[:-3] # Remove the last ' \\' + text += "\\end{matrix}" + return text + + def parse_mr(self, root: Element) -> str: + text = "" + for child in root: + if child.tag == qn("m:e"): + text += self.parse(child) + " & " + return text[:-2] + "\\\\ " # Remove the last ' & ' + + def parse_func(self, root: Element) -> str: + subscript = "" + superscript = "" + text = "" + func_name = "sin" + for child in root: + if child.tag == qn("m:fName"): + for child2 in child: + if child2.tag in [qn("m:sSup"), qn("m:sSub"), qn("m:r")]: + for child3 in child2: + if child3.tag == qn("m:sub"): + subscript = self.parse(child3) + if child3.tag == qn("m:sup"): + superscript = self.parse(child3) + if child3.tag == qn("m:t") or child3.tag == qn("m:e"): + func_name = self.parse(child3) + elif child2.tag == qn("m:limLow"): + for child3 in child2: + if child3.tag == qn("m:lim"): + for child4 in child3: + subscript += self.parse(child4) + if child3.tag == qn("m:e"): + func_name = self.parse(child3) + + if child.tag == qn("m:e"): + text += self.parse(child) + if func_name in ["lim", "max", "min"]: + return f"\\{func_name}\\limits_{{{subscript}}}^{{{superscript}}}{{{text}}}" + if func_name not in self.FUNCTION_MAP: + return f"{{{func_name}}}^{{{superscript}}}_{{{subscript}}}{{{text}}}" + return ( + self.FUNCTION_MAP[func_name] + + f"_{{{subscript}}}^{{{superscript}}}{{{text}}}" + ) + + def parse_s_sup(self, root: Element) -> str: + content = "" + exp_content = "" + for child in root: + if child.tag == qn("m:e"): + content = self.parse(child) + if child.tag == qn("m:sup"): + exp_content = self.parse(child) + content = self._normalize_func_name(content) + return f"{{{content}}}^{{{exp_content}}}" + + def parse_s_sub(self, root: Element) -> str: + content = "" + sub_content = "" + for child in root: + if child.tag == qn("m:e"): + content = self.parse(child) + if child.tag == qn("m:sub"): + sub_content = self.parse(child) + content = self._normalize_func_name(content) + return f"{{{content}}}_{{{sub_content}}}" + + def parse_s_sub_sup(self, root: Element) -> str: + content = "" + sub_content = "" + exp_content = "" + for child in root: + if child.tag == qn("m:e"): + content = self.parse(child) + if child.tag == qn("m:sub"): + sub_content = self.parse(child) + if child.tag == qn("m:sup"): + exp_content = self.parse(child) + content = self._normalize_func_name(content) + return f"{{{content}}}_{{{sub_content}}}^{{{exp_content}}}" + + def parse_s_pre(self, root: Element) -> str: + content = "" + sub_content = "" + exp_content = "" + for child in root: + if child.tag == qn("m:e"): + content = self.parse(child) + if child.tag == qn("m:sub"): + sub_content = self.parse(child) + if child.tag == qn("m:sup"): + exp_content = self.parse(child) + return "{}^{" + exp_content + "}_{" + sub_content + "}{" + content + "}" + + def parse_rad(self, root: Element) -> str: + content = "" + order = "" + for child in root: + if child.tag == qn("m:deg"): + order = self.parse(child) + if child.tag == qn("m:e"): + content += self.parse(child) + if order: + return f"\\sqrt[{order}]{{{content}}}" + return f"\\sqrt{{{content}}}" + + def parse_nary(self, root: Element) -> str: + character_map = { + 8719: "\\prod", + 8720: "\\coprod", + 8721: "\\sum", + 8747: "\\int", + 8748: "\\iint", + 8749: "\\iiint", + 8750: "\\oint", + 8751: "\\oiint", + 8752: "\\oiiint", + 8896: "\\bigwedge", + 8897: "\\bigvee", + 8898: "\\bigcap", + 8899: "\\bigcup", + } + char = 8747 + for child in root: + if child.tag == qn("m:naryPr"): + for child2 in child: + if child2.tag == qn("m:chr"): + val = child2.attrib.get(qn("m:val")) + if val: + try: + char = ord(val) + except TypeError: + pass + text = character_map.get(char, character_map[8721]) + sub = "" + sup = "" + content = "" + for child in root: + if child.tag == qn("m:sub"): + sub = self.parse(child) + if child.tag == qn("m:sup"): + sup = self.parse(child) + if child.tag == qn("m:e"): + content = self.parse(child) + if sub: + text += f"_{{{sub}}}" + if sup: + text += f"^{{{sup}}}" + text += "{" + content + "}" + return text + + parsers = { + qn("m:r"): parse_r, + qn("m:acc"): parse_acc, + qn("m:borderBox"): parse_border_box, + qn("m:bar"): parse_bar, + qn("m:box"): parse_box, + qn("m:d"): parse_d, + qn("m:e"): parse_e, + qn("m:groupChr"): parse_group_chr, + qn("m:f"): parse_f, + qn("m:sSup"): parse_s_sup, + qn("m:sSub"): parse_s_sub, + qn("m:sSubSup"): parse_s_sub_sup, + qn("m:sPre"): parse_s_pre, + qn("m:t"): parse_t, + qn("m:rad"): parse_rad, + qn("m:nary"): parse_nary, + qn("m:eqArr"): parse_eq_arr, + qn("m:func"): parse_func, + qn("m:m"): parse_m, + qn("m:mr"): parse_mr, + } diff --git a/lightrag/native_parser/docx/omml/utils.py b/lightrag/native_parser/docx/omml/utils.py new file mode 100644 index 0000000000..b0a5f43c3f --- /dev/null +++ b/lightrag/native_parser/docx/omml/utils.py @@ -0,0 +1,40 @@ +""" +Utility functions to extract text from the supported mathematical equations from xml tags and +convert them into LaTeX +""" + +from .cleaners import clean_exp + +ns_map = { + "w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main", + "m": "http://schemas.openxmlformats.org/officeDocument/2006/math", +} + + +def linear_expression(tag): + """ + Just returns the text contained in the given tag while setting docxlatex_skip_iteration flags + for all its children. + :param tag:defusedxml.Element - An xml element which contains a math equation in linear form + :return text:str - The equation in valid LaTeX syntax + """ + text = "" + for child in tag.iter(): + child.set("docxlatex_skip_iteration", True) + text += child.text if child.text is not None else "" + text = clean_exp(text) + return text + + +def qn(tag): + """ + A utility function to turn a namespace + prefixed tag name into a Clark-notation qualified tag name for lxml. For + example, qn('m:oMath') returns '{http://schemas.openxmlformats.org/officeDocument/2006/math}oMath' + + :param tag:str - A namespace-prefixed tag name + :return qn:str - A Clark-notation qualified name tag for lxml. + """ + prefix, tag_root = tag.split(":") + uri = ns_map[prefix] + return "{{{}}}{}".format(uri, tag_root) diff --git a/lightrag/native_parser/docx/parse_document.py b/lightrag/native_parser/docx/parse_document.py new file mode 100755 index 0000000000..548b9c11fa --- /dev/null +++ b/lightrag/native_parser/docx/parse_document.py @@ -0,0 +1,1892 @@ +#!/usr/bin/env python3 +""" +ABOUTME: Parses DOCX documents into text blocks using python-docx +ABOUTME: Extracts automatic numbering, splits by headings, converts tables to JSON +""" + +import json +import sys + +try: + from docx import Document +except ImportError: + print( + "Error: python-docx not installed. Run: pip install python-docx", + file=sys.stderr, + ) + sys.exit(1) + +from .numbering_resolver import NumberingResolver +from .table_extractor import TableExtractor +from .utils import estimate_tokens +from .drawing_image_extractor import ( + DrawingExtractionContext, + extract_drawing_placeholder_from_element, + extract_vml_image_placeholder_from_element, +) + + +# Constants for content validation (character-based for UI/display) +MAX_HEADING_LENGTH = 200 # Maximum heading length in characters (UI constraint) +MAX_ANCHOR_CANDIDATE_LENGTH = ( + 100 # Maximum length for candidate anchor paragraphs (characters) +) + +# Constants for content splitting (token-based for LLM context management) +IDEAL_BLOCK_CONTENT_TOKENS = 6000 # Ideal target size for balanced splitting (tokens) +MAX_BLOCK_CONTENT_TOKENS = 8000 # Maximum block content (tokens, hard limit) +SMALL_TAIL_THRESHOLD = ( + MAX_BLOCK_CONTENT_TOKENS - IDEAL_BLOCK_CONTENT_TOKENS +) // 2 # Threshold for tail absorption (1000 tokens) + +# Constants for table splitting (token-based) +TABLE_IDEAL_TOKENS = 3000 # Ideal target size for table chunks (tokens) +TABLE_MAX_TOKENS = 5000 # Maximum table size before splitting (tokens), must smaller than IDEAL_BLOCK_CONTENT_TOKENS +TABLE_MIN_LAST_CHUNK_TOKENS = int( + (TABLE_MAX_TOKENS - TABLE_IDEAL_TOKENS) * 0.8 +) # Minimum size for last chunk to avoid tiny fragments +TABLE_CHUNK_SUFFIX_LABEL = "表格片段" # Label prefix for split table chunk headings + + +def print_error(title: str, details: str, solution: str): + """ + Print a friendly, formatted error message. + + Args: + title: Error title + details: Detailed error information + solution: Suggested solution steps + """ + print("\n" + "=" * 80, file=sys.stderr) + print(f"ERROR: {title}", file=sys.stderr) + print("=" * 80, file=sys.stderr) + print(f"\n{details}", file=sys.stderr) + print("\nSOLUTION:", file=sys.stderr) + print(solution, file=sys.stderr) + print("\n" + "=" * 80 + "\n", file=sys.stderr) + + +def truncate_heading(heading_text: str, para_id: str = None) -> str: + """ + Truncate heading if it exceeds MAX_HEADING_LENGTH. + + Args: + heading_text: The heading text to check + para_id: Optional paragraph ID for warning message + + Returns: + str: Original heading if within limit, truncated heading with "..." if too long + """ + if len(heading_text) > MAX_HEADING_LENGTH: + truncated = heading_text[: MAX_HEADING_LENGTH - 3] + "..." + location = f" (para_id: {para_id})" if para_id else "" + print( + f"Warning: Heading truncated (length {len(heading_text)} > max {MAX_HEADING_LENGTH}){location}: " + f'"{truncated}"', + file=sys.stderr, + ) + return truncated + return heading_text + + +def validate_heading_length(heading_text: str, para_id: str): + """ + Validate that heading length does not exceed MAX_HEADING_LENGTH. + + Args: + heading_text: The heading text to validate + para_id: The paragraph ID for error reporting + + Exits: + sys.exit(1) if heading exceeds maximum length + """ + if len(heading_text) > MAX_HEADING_LENGTH: + preview = ( + heading_text[:100] + "..." if len(heading_text) > 100 else heading_text + ) + print_error( + f"Heading too long ({len(heading_text)} characters, max {MAX_HEADING_LENGTH})", + f'The following heading exceeds the maximum allowed length:\n\n "{preview}"\n\n' + f"Location: Paragraph ID {para_id}\n" + f"Actual length: {len(heading_text)} characters", + " 1. Open the document in Microsoft Word\n" + f" 2. Shorten this heading to {MAX_HEADING_LENGTH} characters or less\n" + " 3. Re-upload it to LightRAG", + ) + sys.exit(1) + + +def validate_table_tokens(table_json: str, block_heading: str): + """ + Validate that table JSON does not exceed MAX_BLOCK_CONTENT_TOKENS. + + Args: + table_json: The JSON representation of the table + block_heading: The heading of the block containing this table + + Exits: + sys.exit(1) if table exceeds maximum token limit + """ + table_tokens = estimate_tokens(table_json) + if table_tokens > MAX_BLOCK_CONTENT_TOKENS: + print_error( + f"Table too large (~{table_tokens} tokens, max {MAX_BLOCK_CONTENT_TOKENS})", + f"A table in the document is too large for LLM processing.\n\n" + f'Location: Under heading "{block_heading}"\n' + f"Table size: ~{table_tokens} tokens ({len(table_json)} characters)\n\n" + "Large tables can cause issues with file chunking.", + " 1. Open the document in Microsoft Word\n" + f' 2. Locate the table under heading "{block_heading}"\n' + " 3. Split the table into smaller tables, or\n" + " 4. Simplify the table content\n" + " 5. Re-upload it to LightRAG", + ) + sys.exit(1) + + +def find_first_valid_para_id(para_ids: list) -> str | None: + """ + Find the first valid paraId in a 2D array of paraIds. + + Args: + para_ids: 2D list of paraIds from table cells + + Returns: + First non-None paraId found, or None when every cell lacks a paraId. + Callers must tolerate ``None`` and treat it as a tracking gap rather + than a fatal error (legacy / non-Word docx authors omit ``w14:paraId`` + attributes and we want to keep parsing). + """ + for row in para_ids: + for para_id in row: + if para_id: + return para_id + return None + + +def find_last_valid_para_id(para_ids: list) -> str | None: + """ + Find the last valid paraId in a 2D array of paraIds. + + Returns the last non-None paraId, falling back to the first valid one + when reverse-iteration does not yield anything (single-paraId tables), + and finally ``None`` when every cell lacks a paraId. + """ + for row in reversed(para_ids): + for para_id in reversed(row): + if para_id: + return para_id + + return find_first_valid_para_id(para_ids) + + +def _table_has_any_paraid(para_ids: list) -> bool: + """True when at least one cell in the 2D paraId grid carries an id.""" + return find_first_valid_para_id(para_ids) is not None + + +def split_table( + table_rows: list, + para_ids: list, + para_ids_end: list, + header_indices: list, + debug: bool = False, +) -> list: + """ + Split large table into chunks at row boundaries. + + Splitting Strategy: + 1. Only split if table JSON exceeds TABLE_MAX_TOKENS (5000 tokens) + 2. Calculate target chunks based on TABLE_IDEAL_TOKENS (3000 tokens) + 3. Split at row boundaries to achieve balanced chunk sizes + 4. Avoid very small last chunk: if last chunk < 1000 tokens, merge with previous + 5. Extract first valid paraId for each chunk as UUID + + Output Strategy: + - First chunk: Merges with preceding content, uses original heading + - Middle chunks: Standalone blocks with heading suffix [1], [2], etc. + - Last chunk: Merges with following content, carries the cross-page + ``_table_header`` so the host block can surface it via ``table_headers`` + - The cross-page repeating header rows (extracted from ``w:tblHeader``) + flow per-table into each containing block's ``table_headers`` list + + Args: + table_rows: 2D array of table content + para_ids: 2D array of paraIds - first paraId in each cell (for uuid) + para_ids_end: 2D array of paraIds - last paraId in each cell (for uuid_end) + header_indices: List of row indices that are table headers + debug: If True, output debug information + + Returns: + List of chunk dicts: [{ + 'rows': 2D array subset, + 'para_ids': 2D array subset, + 'para_ids_end': 2D array subset, + 'uuid': first valid paraId in chunk, + 'is_first': True if first chunk, + 'is_last': True if last chunk + }, ...] + """ + import math + + # Calculate total JSON token count + total_json = json.dumps(table_rows, ensure_ascii=False) + total_tokens = estimate_tokens(total_json) + + if total_tokens <= TABLE_MAX_TOKENS: + # No splitting needed + uuid = find_first_valid_para_id(para_ids) + return [ + { + "rows": table_rows, + "para_ids": para_ids, + "para_ids_end": para_ids_end, + "uuid": uuid, + "is_first": True, + "is_last": True, + } + ] + + # Need to split - calculate target number of chunks + target_chunks = math.ceil(total_tokens / TABLE_IDEAL_TOKENS) + min_chunks_needed = math.ceil(total_tokens / TABLE_MAX_TOKENS) + target_chunks = max(target_chunks, min_chunks_needed) + + # Split at row boundaries + chunks = [] + num_rows = len(table_rows) + target_rows_per_chunk = num_rows / target_chunks + + start_row = 0 + for i in range(target_chunks): + # Calculate end row for this chunk + if i == target_chunks - 1: + # Last chunk gets all remaining rows + end_row = num_rows + else: + # Target end row (rounded) + end_row = min(int((i + 1) * target_rows_per_chunk), num_rows) + + # Adjust to avoid very small last chunk + rows_remaining = num_rows - end_row + if rows_remaining > 0 and rows_remaining < target_rows_per_chunk * 0.3: + # Last chunk would be too small, expand this chunk + end_row = num_rows + + # Extract chunk + chunk_rows = table_rows[start_row:end_row] + chunk_para_ids = para_ids[start_row:end_row] + chunk_para_ids_end = para_ids_end[start_row:end_row] + + if chunk_rows: + chunk_uuid = find_first_valid_para_id(chunk_para_ids) + chunks.append( + { + "rows": chunk_rows, + "para_ids": chunk_para_ids, + "para_ids_end": chunk_para_ids_end, + "uuid": chunk_uuid, + "is_first": (i == 0), + "is_last": (end_row >= num_rows), + } + ) + + start_row = end_row + if start_row >= num_rows: + break + + # Post-processing: Merge very small last chunk with previous chunk if possible + if len(chunks) >= 2: + last_chunk = chunks[-1] + last_chunk_json = json.dumps(last_chunk["rows"], ensure_ascii=False) + last_chunk_tokens = estimate_tokens(last_chunk_json) + + if last_chunk_tokens < TABLE_MIN_LAST_CHUNK_TOKENS: + # Try to merge with previous chunk + prev_chunk = chunks[-2] + + # Calculate combined size + combined_rows = prev_chunk["rows"] + last_chunk["rows"] + combined_json = json.dumps(combined_rows, ensure_ascii=False) + combined_tokens = estimate_tokens(combined_json) + + # Only merge if combined size doesn't exceed max limit + if combined_tokens <= TABLE_MAX_TOKENS: + # Merge the chunks + merged_para_ids = prev_chunk["para_ids"] + last_chunk["para_ids"] + merged_para_ids_end = ( + prev_chunk["para_ids_end"] + last_chunk["para_ids_end"] + ) + chunks[-2] = { + "rows": combined_rows, + "para_ids": merged_para_ids, + "para_ids_end": merged_para_ids_end, + "uuid": prev_chunk["uuid"], # Keep UUID of first chunk + "is_first": prev_chunk["is_first"], + "is_last": True, # This becomes the last chunk + } + chunks.pop() # Remove the last chunk + + if debug: + print( + f"[DEBUG] Merged small last chunk (~{last_chunk_tokens} tokens) with previous chunk", + file=sys.stderr, + ) + print( + f" Combined size: ~{combined_tokens} tokens", file=sys.stderr + ) + + return chunks + + +def split_table_with_heading( + table_rows: list, + para_ids: list, + para_ids_end: list, + header_indices: list, + current_heading: str, + start_suffix: int = 0, + debug: bool = False, +) -> list: + """ + Wrapper for split_table that includes heading information in debug output. + Supports sequential numbering when multiple tables are split in the same block. + + Args: + table_rows: 2D array of table content + para_ids: 2D array of paraIds - first paraId in each cell (for uuid) + para_ids_end: 2D array of paraIds - last paraId in each cell (for uuid_end) + header_indices: List of row indices that are table headers + current_heading: Current block heading (for generating chunk headings) + start_suffix: Starting suffix number for non-first chunks (default: 0) + When multiple tables in the same block are split, this ensures + sequential numbering (e.g., [1], [2] for first table, [3], [4] for second) + debug: If True, output debug information with headings + + Returns: + Same as split_table(), with each chunk having suffix calculated from start_suffix + """ + chunks = split_table( + table_rows, para_ids, para_ids_end, header_indices, debug=False + ) + + # Add suffix_number to each chunk for later use + for i, chunk in enumerate(chunks): + if i == 0: + chunk["suffix_number"] = None # First chunk has no suffix + else: + chunk["suffix_number"] = start_suffix + i + + # Debug output with headings + if debug and len(chunks) > 1: + print( + f"\n[DEBUG] Table split into {len(chunks)} chunks (final)", file=sys.stderr + ) + for i, chunk in enumerate(chunks): + chunk_json = json.dumps(chunk["rows"], ensure_ascii=False) + # Generate heading for this chunk + if chunk["suffix_number"] is None: + chunk_heading = current_heading + else: + chunk_heading = f"{current_heading} [{TABLE_CHUNK_SUFFIX_LABEL}{chunk['suffix_number']}]" + print( + f" Chunk {i+1}: heading=\"{chunk_heading}\", {len(chunk['rows'])} rows, {len(chunk_json)} chars", + file=sys.stderr, + ) + + return chunks + + +def merge_small_blocks(blocks: list, debug: bool = False) -> tuple: + """ + Merge blocks below IDEAL_BLOCK_CONTENT_TOKENS following bottom-up, level-aware strategy. + + Strategy (bottom-up approach): + 1. Process levels from deepest (largest number) to shallowest (level 1) + 2. For each level: + - Phase A: Same-level merging - merge adjacent blocks of same level + - Phase B: Cross-level absorption - allow higher levels to absorb current level + 3. Table chunk role restrictions: + - 'middle': cannot merge with any block + - 'first': can only merge forward (with next block) + - 'last': can only merge backward (with previous block) + - 'none': no restrictions + 4. Stop merging a block once it reaches IDEAL_BLOCK_CONTENT_TOKENS (locked) + 5. Reject merge if combined size > MAX_BLOCK_CONTENT_TOKENS + 6. Merged block's level = level of the block whose heading is kept + + Args: + blocks: List of block dictionaries with 'level' and 'table_chunk_role' fields + debug: If True, output debug information and return merge count + + Returns: + Tuple of (merged_blocks, merge_count) + """ + if len(blocks) <= 1: + return blocks, 0 + + merged_count = 0 + result = blocks.copy() + + # Find all unique levels and sort from deepest to shallowest + levels = sorted(set(block.get("level", 1) for block in result), reverse=True) + + if debug: + print( + f"\n[DEBUG] merge_small_blocks: Processing {len(result)} blocks across levels {levels}", + file=sys.stderr, + ) + + # Process each level from deepest to shallowest + for current_level in levels: + if debug: + print(f"[DEBUG] Processing level {current_level}", file=sys.stderr) + + # Phase A: Same-level merging + changed = True + iteration = 0 + while changed: + iteration += 1 + changed = False + i = 0 + new_result = [] + + while i < len(result): + current_block = result[i] + current_tokens = estimate_tokens(current_block["content"]) + block_level = current_block.get("level", 1) + current_role = current_block.get("table_chunk_role", "none") + + # Only process blocks of current level that are below IDEAL and not locked + is_below_ideal = ( + current_tokens < IDEAL_BLOCK_CONTENT_TOKENS and current_tokens > 0 + ) + is_current_level = block_level == current_level + + if is_below_ideal and is_current_level: + merged = False + + # Check table chunk role restrictions + can_merge_forward = current_role in ["none", "first"] + can_merge_backward = current_role in ["none", "last"] + + # Try forward merge with next block (only same level in Phase A) + if can_merge_forward and i + 1 < len(result): + next_block = result[i + 1] + next_level = next_block.get("level", 1) + next_role = next_block.get("table_chunk_role", "none") + next_can_merge_backward = next_role in ["none", "last"] + + # Phase A: Only merge same-level blocks + if next_level == current_level and next_can_merge_backward: + merged_content = ( + current_block["content"] + + "\n\n" + + next_block["content"] + ) + combined_tokens = estimate_tokens(merged_content) + + if combined_tokens <= MAX_BLOCK_CONTENT_TOKENS: + merged_block = { + "uuid": current_block["uuid"], + "uuid_end": next_block.get( + "uuid_end", next_block["uuid"] + ), + "heading": current_block["heading"], + "content": merged_content, + "type": "text", + "parent_headings": current_block["parent_headings"], + "level": current_level, + "table_chunk_role": "none", + } + + combined_headers = current_block.get( + "table_headers", [] + ) + next_block.get("table_headers", []) + if combined_headers: + merged_block["table_headers"] = combined_headers + + new_result.append(merged_block) + merged = True + merged_count += 1 + changed = True + i += 2 + continue + + # Try backward merge with previous (only same level in Phase A) + if not merged and can_merge_backward and len(new_result) > 0: + prev_block = new_result[-1] + prev_level = prev_block.get("level", 1) + prev_role = prev_block.get("table_chunk_role", "none") + prev_tokens = estimate_tokens(prev_block["content"]) + prev_can_merge_forward = prev_role in ["none", "first"] + prev_below_ideal = prev_tokens < IDEAL_BLOCK_CONTENT_TOKENS + + # Phase A: Only merge same-level blocks, and prev must be below IDEAL + if ( + prev_level == current_level + and prev_can_merge_forward + and prev_below_ideal + ): + merged_content = ( + prev_block["content"] + + "\n\n" + + current_block["content"] + ) + combined_tokens = estimate_tokens(merged_content) + + if combined_tokens <= MAX_BLOCK_CONTENT_TOKENS: + merged_block = { + "uuid": prev_block["uuid"], + "uuid_end": current_block.get( + "uuid_end", current_block["uuid"] + ), + "heading": prev_block["heading"], + "content": merged_content, + "type": "text", + "parent_headings": prev_block["parent_headings"], + "level": current_level, + "table_chunk_role": "none", + } + + combined_headers = prev_block.get( + "table_headers", [] + ) + current_block.get("table_headers", []) + if combined_headers: + merged_block["table_headers"] = combined_headers + + new_result[-1] = merged_block + merged = True + merged_count += 1 + changed = True + i += 1 + continue + + # No merge happened, keep block + if not merged: + new_result.append(current_block) + i += 1 + else: + # Current block is at or above IDEAL, or not current level + # Check for tail absorption: if remaining same-level blocks are small enough, absorb them all + if ( + is_current_level + and current_tokens >= IDEAL_BLOCK_CONTENT_TOKENS + ): + # Calculate total size of remaining same-level blocks + remaining_same_level_tokens = 0 + remaining_end_idx = i + 1 + + for j in range(i + 1, len(result)): + next_block = result[j] + next_level = next_block.get("level", 1) + + # Stop when we encounter a different level + if next_level != current_level: + break + + # Check if this block can be absorbed (table_chunk_role constraints) + next_role = next_block.get("table_chunk_role", "none") + if next_role == "middle": + # Middle chunks cannot be absorbed - stop here + break + + remaining_same_level_tokens += estimate_tokens( + next_block["content"] + ) + remaining_end_idx = j + 1 + + # If remaining same-level blocks are small enough, absorb them all + if ( + remaining_same_level_tokens > 0 + and remaining_same_level_tokens < SMALL_TAIL_THRESHOLD + ): + # Check if combined size doesn't exceed MAX + combined_tokens = ( + current_tokens + remaining_same_level_tokens + ) + + if combined_tokens <= MAX_BLOCK_CONTENT_TOKENS: + # Absorb all remaining same-level blocks + absorbed_content = current_block["content"] + last_uuid_end = current_block.get( + "uuid_end", current_block["uuid"] + ) + combined_headers = list( + current_block.get("table_headers", []) + ) + + for j in range(i + 1, remaining_end_idx): + next_block = result[j] + absorbed_content += "\n\n" + next_block["content"] + last_uuid_end = next_block.get( + "uuid_end", next_block["uuid"] + ) + combined_headers.extend( + next_block.get("table_headers", []) + ) + + # Create merged block + merged_block = { + "uuid": current_block["uuid"], + "uuid_end": last_uuid_end, + "heading": current_block["heading"], + "content": absorbed_content, + "type": "text", + "parent_headings": current_block["parent_headings"], + "level": current_level, + "table_chunk_role": "none", + } + + if combined_headers: + merged_block["table_headers"] = combined_headers + + new_result.append(merged_block) + merged_count += remaining_end_idx - i - 1 + changed = True + i = remaining_end_idx + + if debug: + num_absorbed = remaining_end_idx - i - 1 + print( + f" Tail absorption: block at IDEAL ({current_tokens} tokens) absorbed {num_absorbed} small tail blocks ({remaining_same_level_tokens} tokens)", + file=sys.stderr, + ) + + continue + + # No tail absorption, keep block as-is + new_result.append(current_block) + i += 1 + + result = new_result + + if debug and changed: + print( + f" Phase A iteration {iteration}: {merged_count} total merges", + file=sys.stderr, + ) + + # Phase B: Cross-level absorption (allow higher levels to absorb current level) + changed = True + iteration = 0 + while changed: + iteration += 1 + changed = False + i = 0 + new_result = [] + + while i < len(result): + current_block = result[i] + current_tokens = estimate_tokens(current_block["content"]) + block_level = current_block.get("level", 1) + current_role = current_block.get("table_chunk_role", "none") + + # Only process blocks of current level that are below IDEAL + is_below_ideal = ( + current_tokens < IDEAL_BLOCK_CONTENT_TOKENS and current_tokens > 0 + ) + is_current_level = block_level == current_level + + if is_below_ideal and is_current_level: + merged = False + + can_merge_forward = current_role in ["none", "first", "last"] + can_merge_backward = current_role in ["none", "last"] + + # Try forward merge (current can absorb deeper levels) + if can_merge_forward and i + 1 < len(result): + next_block = result[i + 1] + next_level = next_block.get("level", 1) + next_role = next_block.get("table_chunk_role", "none") + next_can_merge_backward = next_role in ["none", "last"] + + # Phase B: current level can absorb deeper levels (larger numbers) + if next_level > current_level and next_can_merge_backward: + merged_content = ( + current_block["content"] + + "\n\n" + + next_block["content"] + ) + combined_tokens = estimate_tokens(merged_content) + + if combined_tokens <= MAX_BLOCK_CONTENT_TOKENS: + merged_block = { + "uuid": current_block["uuid"], + "uuid_end": next_block.get( + "uuid_end", next_block["uuid"] + ), + "heading": current_block["heading"], + "content": merged_content, + "type": "text", + "parent_headings": current_block["parent_headings"], + "level": current_level, + "table_chunk_role": "none", + } + + combined_headers = current_block.get( + "table_headers", [] + ) + next_block.get("table_headers", []) + if combined_headers: + merged_block["table_headers"] = combined_headers + + new_result.append(merged_block) + merged = True + merged_count += 1 + changed = True + i += 2 + continue + + # Try backward merge (higher level can absorb current) + if not merged and can_merge_backward and len(new_result) > 0: + prev_block = new_result[-1] + prev_level = prev_block.get("level", 1) + prev_role = prev_block.get("table_chunk_role", "none") + prev_tokens = estimate_tokens(prev_block["content"]) + prev_can_merge_forward = prev_role in ["none", "first", "last"] + prev_below_ideal = prev_tokens < IDEAL_BLOCK_CONTENT_TOKENS + + # Phase B: higher level (smaller number) can absorb current level + if ( + prev_level < current_level + and prev_can_merge_forward + and prev_below_ideal + ): + merged_content = ( + prev_block["content"] + + "\n\n" + + current_block["content"] + ) + combined_tokens = estimate_tokens(merged_content) + + if combined_tokens <= MAX_BLOCK_CONTENT_TOKENS: + merged_block = { + "uuid": prev_block["uuid"], + "uuid_end": current_block.get( + "uuid_end", current_block["uuid"] + ), + "heading": prev_block["heading"], + "content": merged_content, + "type": "text", + "parent_headings": prev_block["parent_headings"], + "level": prev_level, + "table_chunk_role": "none", + } + + combined_headers = prev_block.get( + "table_headers", [] + ) + current_block.get("table_headers", []) + if combined_headers: + merged_block["table_headers"] = combined_headers + + new_result[-1] = merged_block + merged = True + merged_count += 1 + changed = True + i += 1 + continue + + if not merged: + new_result.append(current_block) + i += 1 + else: + new_result.append(current_block) + i += 1 + + result = new_result + + if debug and changed: + print( + f" Phase B iteration {iteration}: {merged_count} total merges", + file=sys.stderr, + ) + + if debug: + print( + f"[DEBUG] merge_small_blocks complete: {len(result)} blocks, {merged_count} total merges", + file=sys.stderr, + ) + + # Check for oversized blocks and print debug information + oversized_blocks = [] + for idx, block in enumerate(result): + block_tokens = estimate_tokens(block["content"]) + if block_tokens > 0: # MAX_BLOCK_CONTENT_TOKENS: + oversized_blocks.append( + { + "index": idx, + "heading": block.get("heading", "(no heading)"), + "level": block.get("level", "N/A"), + "tokens": block_tokens, + "has_table_header": bool(block.get("table_headers")), + "content_preview": block["content"][:200], + } + ) + + if oversized_blocks: + print( + f"\n[WARNING] Found {len(oversized_blocks)} oversized blocks after merging:", + file=sys.stderr, + ) + for info in oversized_blocks: + print( + f" Block #{info['index']}: level={info['level']}, tokens={info['tokens']}, heading=\"{info['heading']}\"", + file=sys.stderr, + ) + + return result, merged_count + + +def split_long_block( + block_heading: str, + paragraphs: list, + parent_headings: list, + block_level: int, + debug: bool = False, +) -> list: + """ + Split a long text block into smaller blocks using anchor paragraphs. + + Strategy (improved for balanced splitting): + 1. Calculate target number of blocks based on IDEAL_BLOCK_CONTENT_TOKENS + 2. Ensure minimum blocks needed to stay under MAX_BLOCK_CONTENT_TOKENS + 3. Find all candidate anchor paragraphs (<= MAX_ANCHOR_CANDIDATE_LENGTH chars) + 4. Select anchors closest to ideal split positions for balanced distribution + 5. Create blocks using selected anchors as new headings + + Important: Tables are NOT split by this function. + - Tables are already split at row boundaries by split_table() if needed (TABLE_MAX_TOKENS limit) + - Table paragraphs (is_table=True) are excluded from anchor candidate selection + - Table content remains intact and is not re-split into smaller table chunks + - If a block contains both text and table chunks exceeding the limit, only text + paragraphs are used as split points; table chunks stay complete + + Args: + block_heading: Original heading text + paragraphs: List of dicts with 'text', 'para_id', and 'is_table' keys + parent_headings: Parent heading stack + block_level: Heading level of this block (1=Heading 1, 2=Heading 2, etc.) + debug: If True, output debug information when splitting occurs + + Returns: + List of block dictionaries (may be split into multiple blocks), each with 'level' field + + Exits: + sys.exit(1) if no suitable anchor found and content exceeds limit + """ + import math + + # Check if this block starts with a split table chunk (has _chunk_heading metadata) + # If so, use that heading instead of block_heading + effective_heading = block_heading + + if paragraphs and paragraphs[0].get("_chunk_heading"): + effective_heading = paragraphs[0]["_chunk_heading"] + + # Calculate total content token count + total_content = "\n".join(p["text"] for p in paragraphs) + total_tokens = estimate_tokens(total_content) + + if total_tokens <= MAX_BLOCK_CONTENT_TOKENS: + # Within limit, return as single block + # Use first paragraph's para_id as UUID + # For uuid_end: use para_id_end if last element is a table, otherwise para_id + last_para = paragraphs[-1] if paragraphs else {} + uuid_end = last_para.get("para_id_end") or last_para.get("para_id") + + block = { + "uuid": paragraphs[0]["para_id"] if paragraphs else None, + "uuid_end": uuid_end, + "heading": effective_heading, + "content": total_content, + "type": "text", + "parent_headings": parent_headings, + "level": block_level, # Add level to block + } + + # Collect per-table cross-page headers (aligned with
tag order) + table_headers = _collect_table_headers(paragraphs) + if table_headers: + block["table_headers"] = table_headers + + return [block] + + # Content exceeds limit, need to split + # Calculate target number of blocks based on IDEAL_BLOCK_CONTENT_TOKENS + target_blocks = math.ceil(total_tokens / IDEAL_BLOCK_CONTENT_TOKENS) + + # Ensure we have enough blocks to stay under MAX_BLOCK_CONTENT_TOKENS + min_blocks_needed = math.ceil(total_tokens / MAX_BLOCK_CONTENT_TOKENS) + target_blocks = max(target_blocks, min_blocks_needed) + + # Calculate ideal token size per block + target_size = total_tokens / target_blocks + + # Find candidate anchors (short paragraphs, excluding tables and empty placeholders) + # Use character length for anchor candidate selection (UI/readability constraint) + candidates = [] + cumulative_tokens = 0 + for idx, para in enumerate(paragraphs): + if ( + not para.get("is_table", False) + and 0 < len(para["text"]) <= MAX_ANCHOR_CANDIDATE_LENGTH + ): + candidates.append( + { + "index": idx, + "text": para["text"], + "para_id": para["para_id"], + "position": cumulative_tokens, + } + ) + cumulative_tokens += estimate_tokens(para["text"]) + + if not candidates: + # No suitable anchor found + preview = ( + block_heading[:80] + "..." if len(block_heading) > 80 else block_heading + ) + print_error( + "Cannot split long block (no suitable anchor paragraphs found)", + f"A text block is too long (~{total_tokens} tokens, max {MAX_BLOCK_CONTENT_TOKENS})\n" + f"but no paragraphs <= {MAX_ANCHOR_CANDIDATE_LENGTH} characters were found to use as split points.\n\n" + f'Location: Under heading "{preview}"\n' + f"Block size: ~{total_tokens} tokens ({len(total_content)} characters)\n" + f"Number of paragraphs: {len(paragraphs)}\n" + f"Calculated target blocks: {target_blocks}", + " 1. Open the document in Microsoft Word\n" + f' 2. Locate the section under heading "{preview}"\n' + f" 3. Add short headings or paragraph breaks (≤{MAX_ANCHOR_CANDIDATE_LENGTH} chars) to divide the content\n" + " 4. Re-upload it to LightRAG", + ) + sys.exit(1) + + # Select anchors for splitting (target_blocks - 1 split points needed) + selected_anchors = [] + remaining_candidates = candidates.copy() + + for i in range(1, target_blocks): + if not remaining_candidates: + break + + # Calculate ideal position for this split (in tokens) + ideal_position = i * target_size + + # Find candidate closest to ideal position + best_candidate = min( + remaining_candidates, key=lambda c: abs(c["position"] - ideal_position) + ) + selected_anchors.append(best_candidate) + remaining_candidates.remove(best_candidate) + + # Sort selected anchors by index to maintain document order + selected_anchors.sort(key=lambda a: a["index"]) + + # Create blocks using selected split points + result_blocks = [] + prev_idx = 0 + current_parent_headings = parent_headings + current_block_heading = block_heading + + for anchor in selected_anchors: + split_idx = anchor["index"] + + # Create block from prev_idx to split_idx (exclusive) + block_paragraphs = paragraphs[prev_idx:split_idx] + if block_paragraphs: + block_content = "\n".join(p["text"] for p in block_paragraphs) + # For uuid_end: use para_id_end if last element is a table, otherwise para_id + last_para = block_paragraphs[-1] + block_uuid_end = last_para.get("para_id_end") or last_para.get("para_id") + new_block = { + "uuid": block_paragraphs[0][ + "para_id" + ], # UUID from first paragraph in content + "uuid_end": block_uuid_end, # UUID_end from last paragraph (or table's last cell) + "heading": current_block_heading, + "content": block_content, + "type": "text", + "parent_headings": current_parent_headings, + "_paragraphs": block_paragraphs, # Keep original paragraphs for potential re-splitting + } + new_table_headers = _collect_table_headers(block_paragraphs) + if new_table_headers: + new_block["table_headers"] = new_table_headers + result_blocks.append(new_block) + + # Validate anchor as new heading + validate_heading_length(anchor["text"], anchor["para_id"]) + + # Update for next block + current_block_heading = anchor["text"] + # Update parent headings: add previous heading only if not "Preface/Uncategorized" + if block_heading != "Preface/Uncategorized": + current_parent_headings = parent_headings + [block_heading] + + prev_idx = ( + split_idx # Don't skip anchor - it becomes first paragraph of next block + ) + + # Create final block with remaining paragraphs + final_paragraphs = paragraphs[prev_idx:] + if final_paragraphs: + final_content = "\n".join(p["text"] for p in final_paragraphs) + # For uuid_end: use para_id_end if last element is a table, otherwise para_id + last_final_para = final_paragraphs[-1] + final_uuid_end = last_final_para.get("para_id_end") or last_final_para.get( + "para_id" + ) + final_block = { + "uuid": final_paragraphs[0][ + "para_id" + ], # UUID from first paragraph in content + "uuid_end": final_uuid_end, # UUID_end from last paragraph (or table's last cell) + "heading": current_block_heading, + "content": final_content, + "type": "text", + "parent_headings": current_parent_headings, + "_paragraphs": final_paragraphs, # Keep original paragraphs for potential re-splitting + } + final_table_headers = _collect_table_headers(final_paragraphs) + if final_table_headers: + final_block["table_headers"] = final_table_headers + result_blocks.append(final_block) + + # Post-split validation: Check if any block still exceeds MAX_BLOCK_CONTENT_TOKENS + # If so, recursively split that block (handles sparse anchor scenarios) + validated_blocks = [] + for block in result_blocks: + block_tokens = estimate_tokens(block["content"]) + if block_tokens > MAX_BLOCK_CONTENT_TOKENS: + # This block is still too large - need to recursively split it + # Use the preserved paragraph structure + block_paragraphs = block.get("_paragraphs", []) + + if not block_paragraphs: + # Fallback: shouldn't happen, but handle gracefully + preview = ( + block["heading"][:80] + "..." + if len(block["heading"]) > 80 + else block["heading"] + ) + print_error( + "Cannot re-split oversized block (internal error)", + f"A block exceeded MAX_BLOCK_CONTENT_TOKENS but paragraph metadata was lost.\n\n" + f"Location: Under heading \"{preview}\"\n" + f"Block size: ~{block_tokens} tokens ({len(block['content'])} characters)", + "This is an internal error. Please report this issue.", + ) + sys.exit(1) + + # Recursively split this oversized block + # The recursive call will either find more anchors or raise an error + sub_blocks = split_long_block( + block["heading"], + block_paragraphs, + block["parent_headings"], + block_level, + debug, + ) + validated_blocks.extend(sub_blocks) + else: + # Remove internal _paragraphs field before adding to final output + block.pop("_paragraphs", None) + validated_blocks.append(block) + + # Add level to all blocks + for block in validated_blocks: + block["level"] = block_level + + # Output debug information if enabled and split occurred + if debug and len(validated_blocks) > 1: + print(f'\n[DEBUG] Block split: "{block_heading}"', file=sys.stderr) + print( + f" Original size: ~{total_tokens} tokens ({len(total_content)} characters)", + file=sys.stderr, + ) + block_tokens = [estimate_tokens(block["content"]) for block in validated_blocks] + print( + f" Final result: {len(validated_blocks)} blocks: ~{block_tokens} tokens", + file=sys.stderr, + ) + + return validated_blocks + + +def extract_para_id(para_element) -> str: + """ + Extract w14:paraId attribute from paragraph element. + + Args: + para_element: lxml paragraph element + + Returns: + 8-character hex paraId, or ``None`` when the paragraph carries no + ``w14:paraId`` attribute (legacy / non-Word docx authors). Callers + propagate the ``None`` upward — the LightRAG adapter counts these + and surfaces a single warning per document. + """ + return para_element.get( + "{http://schemas.microsoft.com/office/word/2010/wordml}paraId" + ) + + +def parse_styles_outline_levels(docx_path: str) -> dict: + """ + Parse styles.xml to extract outlineLvl definitions for each style, + following style inheritance chain (basedOn). + + Args: + docx_path: Path to DOCX file + + Returns: + dict: styleId -> outlineLvl (0-8 for headings, 9 for body text) + """ + import zipfile + + try: + from defusedxml import ElementTree as ET + except ImportError: + from xml.etree import ElementTree as ET + + styles_outline = {} # styleId -> outlineLvl (directly defined) + style_based_on = {} # styleId -> parent styleId + + try: + with zipfile.ZipFile(docx_path, "r") as zf: + if "word/styles.xml" not in zf.namelist(): + return styles_outline + + tree = ET.parse(zf.open("word/styles.xml")) + root = tree.getroot() + + ns = "http://schemas.openxmlformats.org/wordprocessingml/2006/main" + + # First pass: collect outlineLvl and basedOn for all styles + for style in root.findall(f".//{{{ns}}}style"): + style_id = style.get(f"{{{ns}}}styleId") + if not style_id: + continue + + # Check for basedOn (style inheritance) + based_on = style.find(f"{{{ns}}}basedOn") + if based_on is not None: + parent_id = based_on.get(f"{{{ns}}}val") + if parent_id: + style_based_on[style_id] = parent_id + + # Check for outlineLvl in style's pPr + pPr = style.find(f"{{{ns}}}pPr") + if pPr is not None: + outline_lvl_elem = pPr.find(f"{{{ns}}}outlineLvl") + if outline_lvl_elem is not None: + level = int(outline_lvl_elem.get(f"{{{ns}}}val")) + styles_outline[style_id] = level + + # Second pass: resolve inheritance chain for styles without direct outlineLvl + def get_outline_level(style_id: str, visited: set = None) -> int: + if visited is None: + visited = set() + if style_id in visited: + return None # Prevent circular references + visited.add(style_id) + + # If this style directly defines outlineLvl, return it + if style_id in styles_outline: + return styles_outline[style_id] + + # Otherwise check parent style + if style_id in style_based_on: + parent_id = style_based_on[style_id] + return get_outline_level(parent_id, visited) + + return None + + # Fill in missing outlineLvl from inheritance chain + all_style_ids = set(styles_outline.keys()) | set(style_based_on.keys()) + for style_id in all_style_ids: + if style_id not in styles_outline: + level = get_outline_level(style_id) + if level is not None: + styles_outline[style_id] = level + except Exception: + # Silently ignore parsing errors + pass + + return styles_outline + + +def get_heading_level(para_element, styles_outline_map: dict) -> int: + """ + Get heading level from paragraph, checking both direct format and style. + + Priority: paragraph outlineLvl > style outlineLvl + + Args: + para_element: lxml paragraph element + styles_outline_map: dict of styleId -> outlineLvl from styles.xml + + Returns: + int: 0-8 for heading levels (0=level 1, 1=level 2, etc.), None for non-heading + """ + # 1. Check paragraph direct format + pPr = para_element.find( + "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pPr" + ) + if pPr is not None: + outline_elem = pPr.find( + "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}outlineLvl" + ) + if outline_elem is not None: + level = int( + outline_elem.get( + "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val" + ) + ) + # Only 0-8 are true heading levels (9 is body text) + if level < 9: + return level + else: + return None # Level 9 is body text + + # 2. Check style definition's outlineLvl + if pPr is not None: + pStyle_elem = pPr.find( + "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pStyle" + ) + if pStyle_elem is not None: + style_id = pStyle_elem.get( + "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val" + ) + if style_id and style_id in styles_outline_map: + level = styles_outline_map[style_id] + if level < 9: + return level + else: + return None + + return None + + +def extract_text_from_run( + run, + ns: dict, + drawing_context: DrawingExtractionContext = None, +) -> str: + """ + Extract text from a run element, preserving superscript/subscript with markup. + + Converts Word formatting to HTML-like tags: + - Superscript: text + - Subscript: text + - Normal text: unchanged + + Args: + run: lxml run element (w:r) + ns: XML namespace dictionary + + Returns: + Text string with / markup for formatted portions + """ + text = "" + + # Check for vertAlign in rPr (superscript/subscript) + vert_align = None + rPr = run.find("w:rPr", ns) + if rPr is not None: + vert_elem = rPr.find("w:vertAlign", ns) + if vert_elem is not None: + vert_align = vert_elem.get( + "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val" + ) + + # Extract text content from run children + for child in run: + tag = child.tag.split("}")[-1] # Remove namespace + if tag == "t" and child.text: + text += child.text + elif tag == "tab": + text += "\t" + elif tag == "br": + # Handle line breaks - textWrapping or no type = soft line break + br_type = child.get( + "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}type" + ) + if br_type in (None, "textWrapping"): + text += "\n" + # Skip page and column breaks (layout elements) + elif tag == "drawing": + text += extract_drawing_placeholder_from_element( + child, + context=drawing_context, + include_extended_attrs=True, + ) + elif tag in ("pict", "object"): + text += extract_vml_image_placeholder_from_element( + child, + context=drawing_context, + include_extended_attrs=True, + ) + + # Apply superscript/subscript markup if needed + if text and vert_align == "superscript": + return f"{text}" + elif text and vert_align == "subscript": + return f"{text}" + + return text + + +def extract_paragraph_content( + element, + ns, + drawing_context: DrawingExtractionContext = None, +) -> str: + """ + Extract text and equations from a paragraph element in document order. + + Handles w:r (text runs), m:oMath (inline equations), and m:oMathPara + (block equations). Recurses into container elements (e.g., w:hyperlink, + w:ins, w:sdt, w:fldSimple, w:smartTag) to avoid dropping content. + + Args: + element: lxml paragraph element (w:p) + ns: XML namespace dictionary + + Returns: + Text string with equations wrapped in tags + """ + parts = [] + + def append_from(node) -> None: + tag = node.tag.split("}")[-1] + # Skip deleted content (w:del) and moved-from content (w:moveFrom) in tracked changes + # to maintain consistency with w:delText handling + if tag in ("del", "moveFrom"): + return + if tag == "r": + parts.append( + extract_text_from_run(node, ns, drawing_context=drawing_context) + ) + return + if tag == "oMath": + from .omml import convert_omml_to_latex + + latex = convert_omml_to_latex(node) + if latex: + parts.append(f"{latex}") + return + if tag == "oMathPara": + from .omml import convert_omml_to_latex + + for omath in node: + if omath.tag.split("}")[-1] == "oMath": + latex = convert_omml_to_latex(omath) + if latex: + parts.append(f"{latex}") + return + for child in node: + append_from(child) + + for child in element: + append_from(child) + + return "".join(parts) + + +def _collect_table_headers(paragraphs: list) -> list: + """Collect per-table cross-page header rows from ``is_table`` paragraphs. + + The returned list is aligned 1:1 with the order of ``
`` placeholder + tags emitted into the block's content; entries are either the list of + header rows captured from ``w:tblHeader`` or ``None`` when the table has + no cross-page repeating header. + """ + return [p.get("_table_header") for p in paragraphs if p.get("is_table")] + + +def _build_unsplit_block( + heading: str, paragraphs: list, parent_headings: list, level: int +) -> dict: + """Build a single block from paragraphs without size-based splitting.""" + last_para = paragraphs[-1] + block = { + "uuid": paragraphs[0]["para_id"], + "uuid_end": last_para.get("para_id_end") or last_para.get("para_id"), + "heading": heading, + "content": "\n".join(p["text"] for p in paragraphs), + "type": "text", + "parent_headings": parent_headings, + "level": level, + } + table_headers = _collect_table_headers(paragraphs) + if table_headers: + block["table_headers"] = table_headers + return block + + +def _flush_current_block( + blocks: list, + heading: str, + paragraphs: list, + parent_headings: list, + level: int, + fixlevel: int, + debug: bool, +) -> None: + """ + Flush accumulated paragraphs into blocks, respecting fixlevel mode. + + In default mode (fixlevel is None), runs split_long_block for token-based splitting. + In fixlevel mode, emits a single unsplit block and warns when size exceeds the limit. + """ + if not paragraphs: + return + + if fixlevel is None: + blocks.extend( + split_long_block(heading, paragraphs, parent_headings, level, debug) + ) + return + + block = _build_unsplit_block(heading, paragraphs, parent_headings, level) + block_tokens = estimate_tokens(block["content"]) + if block_tokens > MAX_BLOCK_CONTENT_TOKENS: + preview = heading[:80] + "..." if len(heading) > 80 else heading + print( + f"Warning: fixlevel block exceeds {MAX_BLOCK_CONTENT_TOKENS} tokens " + f'(~{block_tokens} tokens) under heading "{preview}". ' + f"Consider increasing --fixlevel=N or removing --fixlevel for automatic splitting.", + file=sys.stderr, + ) + blocks.append(block) + + +def extract_docx_blocks( + file_path: str, + debug: bool = False, + fixlevel: int = None, + drawing_context: DrawingExtractionContext = None, + parse_warnings: dict | None = None, + parse_metadata: dict | None = None, +) -> list: + """ + Extract text blocks (chunks) from a DOCX file for chunking later. + + Uses python-docx with custom numbering resolver to: + 1. Capture automatic numbering (list labels) + 2. Split document by headings + 3. Convert tables to JSON (2D array) + 4. Validate heading lengths and table sizes + 5. Split long blocks using anchor paragraphs + 6. Preserve superscript/subscript formatting with / markup + + Args: + file_path: Path to the DOCX file + debug: If True, output debug information when splitting blocks + fixlevel: If specified, disable smart splitting/merging and only split at heading levels <= fixlevel + (0 = split at all heading levels, 1 = Heading 1 only, 2 = Heading 1-2, etc.) + parse_warnings: Optional out-dict that this function mutates with + non-fatal warnings observed during parsing. Currently used for + ``missing_paraid_count`` — incremented once per body-level + paragraph (heading or text) that lacks a ``w14:paraId`` and once + per table whose every cell lacks one. Callers (the LightRAG + adapter / debug CLI) read this to surface a one-line warning per + document instead of crashing. + parse_metadata: Optional out-dict that this function mutates with + document-level metadata derived during parsing. Currently used + for ``first_heading`` — the text of the first heading encountered + in document order (regardless of level). Used by the LightRAG + adapter to populate ``meta.doc_title`` in ``.blocks.jsonl``. + + Returns: + List of block dictionaries with heading, content, type, and metadata + """ + doc = Document(file_path) + resolver = NumberingResolver(file_path) + styles_outline = parse_styles_outline_levels(file_path) + + blocks = [] + current_heading = "Preface/Uncategorized" + current_heading_level = 1 # Default level for "Preface/Uncategorized" + current_heading_stack = {} # {level: heading_text} - Use dict to correctly track heading hierarchy + current_parent_headings = [] # Parent headings for current block + current_paragraphs = [] # Track paragraphs with metadata for splitting + has_body_content = ( + False # Track if current block has body content (non-heading paragraphs/tables) + ) + matched_fixlevel_heading = False # Track whether --fixlevel matched any heading + table_split_counter = ( + 0 # Track cumulative table split suffix numbers within current block + ) + first_heading_recorded = ( + False # Track whether the document's first heading has been captured + ) + + # Iterate through document body elements (paragraphs and tables) + body = doc._element.body + + for element in body: + tag = element.tag.split("}")[-1] # Remove namespace + + if tag == "sectPr": # Document-level section break + resolver.reset_tracking_state() + continue + + if tag == "p": # Paragraph + # Get paragraph text with superscript/subscript markup and equations + para_text = "" + ns = { + "w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main", + "wp": "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing", + "m": "http://schemas.openxmlformats.org/officeDocument/2006/math", + } + para_text = extract_paragraph_content( + element, + ns, + drawing_context=drawing_context, + ) + + para_text = para_text.strip() + if not para_text: + continue + + # Get numbering label using our resolver + label = resolver.get_label(element) + full_text = f"{label} {para_text}".strip() if label else para_text + + # Check if this is a heading using the new function + outline_level = get_heading_level(element, styles_outline) + + if outline_level is not None: + # This is a heading (outline level 0-8) + # Convert 0-based to 1-based level + level = outline_level + 1 + + # In fixlevel mode, check if this heading should trigger a block split + should_split = True + if fixlevel is not None and fixlevel > 0: + # If fixlevel is specified and > 0, only split at levels <= fixlevel + should_split = level <= fixlevel + + # Extract paraId for this heading + heading_para_id = extract_para_id(element) + if parse_warnings is not None and not heading_para_id: + parse_warnings["missing_paraid_count"] = ( + parse_warnings.get("missing_paraid_count", 0) + 1 + ) + + # Validate heading length + validate_heading_length(full_text, heading_para_id) + + # Truncate heading if needed before storing + truncated_text = truncate_heading(full_text, heading_para_id) + + # Record the document's first heading (any level) for meta.doc_title. + if not first_heading_recorded: + if parse_metadata is not None: + parse_metadata["first_heading"] = truncated_text + first_heading_recorded = True + + if should_split: + if fixlevel is not None and fixlevel > 0: + matched_fixlevel_heading = True + + # This heading triggers a block split + # Only save previous block if it has body content + if has_body_content and current_paragraphs: + _flush_current_block( + blocks, + current_heading, + current_paragraphs, + current_parent_headings, + current_heading_level, + fixlevel, + debug, + ) + + # Reset for new block + current_paragraphs = [] + has_body_content = False + table_split_counter = ( + 0 # Reset table split counter for new heading + ) + + # Add heading to current_paragraphs + current_paragraphs.append( + { + "text": truncated_text, + "para_id": heading_para_id, + "is_table": False, + } + ) + + # Update current_heading and parent_headings for the FIRST heading in a block + # (when current_paragraphs just had this heading added as its first element) + if len(current_paragraphs) == 1: + current_heading = truncated_text + current_heading_level = ( + level # Only set level when setting heading + ) + # Parent headings = all headings from levels strictly less than current level + # Sort by level to maintain hierarchy order + current_parent_headings = [ + current_heading_stack[lvl] + for lvl in sorted(current_heading_stack.keys()) + if lvl < level + ] + + # Update heading stack: remove current level and all lower levels, then add current + current_heading_stack = { + k: v for k, v in current_heading_stack.items() if k < level + } + current_heading_stack[level] = truncated_text + else: + # This heading doesn't trigger split - treat as regular paragraph + para_id = heading_para_id + + # Store as regular paragraph with metadata + current_paragraphs.append( + {"text": truncated_text, "para_id": para_id, "is_table": False} + ) + + # Mark that we have body content + has_body_content = True + else: + # Regular paragraph content + para_id = extract_para_id(element) + if parse_warnings is not None and not para_id: + parse_warnings["missing_paraid_count"] = ( + parse_warnings.get("missing_paraid_count", 0) + 1 + ) + + # Store paragraph with metadata for potential splitting + current_paragraphs.append( + {"text": full_text, "para_id": para_id, "is_table": False} + ) + + # Mark that we have body content + has_body_content = True + + # Check for paragraph-level section break (after processing paragraph) + # sectPr in pPr means this paragraph ends a section + pPr = element.find( + "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}pPr" + ) + if pPr is not None: + sectPr = pPr.find( + "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}sectPr" + ) + if sectPr is not None: + # Section break after this paragraph - reset tracking + resolver.reset_tracking_state() + + elif tag == "tbl": # Table + # Reset numbering tracking before table (table start boundary) + resolver.reset_tracking_state() + + # Directly create Table object from XML element to avoid index mismatch + # (doc.tables may have different order due to nested tables) + from docx.table import Table + + table = Table(element, doc) + table_metadata = TableExtractor.extract_with_metadata( + table, + numbering_resolver=resolver, + drawing_context=drawing_context, + ) + + table_rows = table_metadata["rows"] + para_ids = table_metadata["para_ids"] + para_ids_end = table_metadata["para_ids_end"] # Last paraId in each cell + header_indices = table_metadata["header_indices"] + + # Count tables whose cells carry no w14:paraId. Legacy / non-Word + # docx authors omit these attributes; we no longer fail-fast, but + # the adapter surfaces a single warning so the user knows the edit + # range hints will be missing for these tables. + if parse_warnings is not None and not _table_has_any_paraid(para_ids): + parse_warnings["missing_paraid_count"] = ( + parse_warnings.get("missing_paraid_count", 0) + 1 + ) + + # Convert table to JSON and estimate token count + table_json = json.dumps(table_rows, ensure_ascii=False) + table_tokens = estimate_tokens(table_json) + + # Extract cross-page repeating header rows (w:tblHeader) once per + # table so both split and unsplit branches can surface them to the + # sidecar via the block-level ``table_headers`` list. + header_rows = [] + if header_indices: + header_rows = [ + table_rows[idx] for idx in header_indices if idx < len(table_rows) + ] + header_rows_or_none = header_rows if header_rows else None + + # Check if table needs splitting (disabled in fixlevel mode) + if fixlevel is None and table_tokens > TABLE_MAX_TOKENS: + # Table exceeds limit - split it + # Pass table_split_counter to ensure sequential numbering across multiple tables + table_chunks = split_table_with_heading( + table_rows, + para_ids, + para_ids_end, + header_indices, + current_heading, + table_split_counter, + debug, + ) + + for chunk_idx, chunk in enumerate(table_chunks): + chunk_json = json.dumps(chunk["rows"], ensure_ascii=False) + # Get uuid_end from last valid paraId in chunk (use para_ids_end for last cell's last paragraph) + chunk_para_id_end = find_last_valid_para_id(chunk["para_ids_end"]) + + if chunk["is_first"]: + # First chunk: add to current_paragraphs (will merge with preceding content) + current_paragraphs.append( + { + "text": f"
{chunk_json}
", + "para_id": chunk["uuid"], + "para_id_end": chunk_para_id_end, # Store end paraId for uuid_end calculation + "is_table": True, + "_table_header": header_rows_or_none, + } + ) + has_body_content = True + else: + # Middle or last chunk: save current block first + if current_paragraphs: + _flush_current_block( + blocks, + current_heading, + current_paragraphs, + current_parent_headings, + current_heading_level, + fixlevel, + debug, + ) + current_paragraphs = [] + has_body_content = False + + # Generate heading using suffix_number from chunk + if chunk["suffix_number"] is not None: + chunk_heading = f"{current_heading} [{TABLE_CHUNK_SUFFIX_LABEL}{chunk['suffix_number']}]" + else: + chunk_heading = current_heading + + # Build block for this table chunk + # Get uuid_end from last valid paraId in chunk (use para_ids_end for last cell's last paragraph) + chunk_uuid_end = find_last_valid_para_id(chunk["para_ids_end"]) + + # Determine table_chunk_role based on chunk position + if chunk["is_first"] and chunk["is_last"]: + table_chunk_role = "none" # Not split + elif chunk["is_first"]: + table_chunk_role = "first" + elif chunk["is_last"]: + table_chunk_role = "last" + else: + table_chunk_role = "middle" + + chunk_block = { + "uuid": chunk["uuid"], + "uuid_end": chunk_uuid_end, + "heading": chunk_heading, + "content": f"{chunk_json}
", + "type": "text", + "parent_headings": current_parent_headings, + "level": current_heading_level, + "table_chunk_role": table_chunk_role, + } + + # Always emit a per-table headers list (aligned with the + # single placeholder in this standalone block); + # the entry is None when the table has no cross-page + # repeating header so downstream counters stay aligned. + chunk_block["table_headers"] = [header_rows_or_none] + + if chunk["is_last"]: + # Last chunk: add to current_paragraphs for merging with following content + current_paragraphs.append( + { + "text": f"
{chunk_json}
", + "para_id": chunk["uuid"], + "para_id_end": chunk_para_id_end, # Store end paraId for uuid_end calculation + "is_table": True, + "_chunk_heading": chunk_heading, + "_table_header": header_rows_or_none, + } + ) + has_body_content = True + else: + # Middle chunk: output immediately as standalone block + blocks.append(chunk_block) + + # Update table_split_counter: add number of non-first chunks + # (first chunk doesn't get a suffix, so we count from second chunk onwards) + table_split_counter += len(table_chunks) - 1 + else: + # Table is within size limit - no splitting needed + # Store table as a paragraph with special marker + # Use first valid paraId from table, and last valid paraId (from para_ids_end) for uuid_end + table_para_id = find_first_valid_para_id(para_ids) + table_para_id_end = find_last_valid_para_id(para_ids_end) + current_paragraphs.append( + { + "text": f"{table_json}
", + "para_id": table_para_id, + "para_id_end": table_para_id_end, # Store end paraId for uuid_end calculation + "is_table": True, + "_table_header": header_rows_or_none, + } + ) + + # Mark that we have body content + has_body_content = True + + # Reset numbering tracking after table (table end boundary) + resolver.reset_tracking_state() + + # Save final block (respecting fixlevel mode) + _flush_current_block( + blocks, + current_heading, + current_paragraphs, + current_parent_headings, + current_heading_level, + fixlevel, + debug, + ) + + # Add table_chunk_role="none" to all blocks that don't have it (non-table or unsplit table blocks) + for block in blocks: + if "table_chunk_role" not in block: + block["table_chunk_role"] = "none" + + # Perform small block merging (unified merging after all splits) + # Disabled in fixlevel mode + if fixlevel is None: + if debug: + print(f"\n[DEBUG] Before merging: {len(blocks)} blocks", file=sys.stderr) + + merged_blocks, merge_count = merge_small_blocks(blocks, debug) + + if debug and merge_count > 0: + print( + f"[DEBUG] After merging: {len(merged_blocks)} blocks ({merge_count} merges performed)", + file=sys.stderr, + ) + + return merged_blocks + + # Fixed level mode: skip merging, but warn if no heading matched the requested level + if fixlevel > 0 and not matched_fixlevel_heading: + print( + f"Warning: --fixlevel={fixlevel} produced {len(blocks)} block(s). " + f"Document may not have heading levels <= {fixlevel}. " + f"Try a higher --fixlevel value or remove the flag.", + file=sys.stderr, + ) + return blocks diff --git a/lightrag/native_parser/docx/table_extractor.py b/lightrag/native_parser/docx/table_extractor.py new file mode 100644 index 0000000000..5fa123c9d4 --- /dev/null +++ b/lightrag/native_parser/docx/table_extractor.py @@ -0,0 +1,405 @@ +#!/usr/bin/env python3 +""" +ABOUTME: Extracts tables from DOCX with proper merged cell handling +ABOUTME: Vertically merged cells: content repeated in all rows with shared paraId +ABOUTME: Horizontally merged cells: content in first cell only +ABOUTME: Preserves superscript/subscript formatting with / markup +""" + +from docx.table import Table +from docx.oxml.ns import qn +from typing import List + +from .drawing_image_extractor import ( + DrawingExtractionContext, + extract_drawing_placeholder_from_element, + extract_vml_image_placeholder_from_element, +) + + +def extract_text_from_run_table( + run_elem, + qn_func, + drawing_context: DrawingExtractionContext = None, +) -> str: + """ + Extract text from a run element in table cell, preserving superscript/subscript with markup. + + Converts Word formatting to HTML-like tags: + - Superscript: text + - Subscript: text + - Normal text: unchanged + + Args: + run_elem: lxml run element (w:r) + qn_func: qn function for namespace handling + + Returns: + Text string with / markup for formatted portions + """ + text = "" + + # Check for vertAlign in rPr (superscript/subscript) + vert_align = None + rPr = run_elem.find(qn_func("w:rPr")) + if rPr is not None: + vert_elem = rPr.find(qn_func("w:vertAlign")) + if vert_elem is not None: + vert_align = vert_elem.get(qn_func("w:val")) + + # Extract text content from run children + for child in run_elem: + tag = child.tag.split("}")[-1] # Remove namespace + if tag == "t" and child.text: + text += child.text + elif tag == "tab": + text += "\t" + elif tag == "br": + # Handle line breaks - textWrapping or no type = soft line break + br_type = child.get(qn_func("w:type")) + if br_type in (None, "textWrapping"): + text += "\n" + # Skip page and column breaks (layout elements) + elif tag == "drawing": + text += extract_drawing_placeholder_from_element( + child, + context=drawing_context, + include_extended_attrs=True, + ) + elif tag in ("pict", "object"): + text += extract_vml_image_placeholder_from_element( + child, + context=drawing_context, + include_extended_attrs=True, + ) + + # Apply superscript/subscript markup if needed + if text and vert_align == "superscript": + return f"{text}" + elif text and vert_align == "subscript": + return f"{text}" + + return text + + +def extract_paragraph_content_table( + para_elem, + qn_func, + drawing_context: DrawingExtractionContext = None, +) -> str: + """ + Extract text and equations from a table cell paragraph in document order. + + Handles w:r (text runs), m:oMath (inline equations), and m:oMathPara + (block equations). Recurses into container elements (e.g., w:hyperlink, + w:ins, w:sdt, w:fldSimple, w:smartTag) to avoid dropping content. + + Args: + para_elem: lxml paragraph element (w:p) + qn_func: qn function for namespace handling + + Returns: + Text string with equations wrapped in tags + """ + parts = [] + + def append_from(node) -> None: + tag = node.tag.split("}")[-1] + # Skip deleted content (w:del) and moved-from content (w:moveFrom) in tracked changes + # to maintain consistency with w:delText handling + if tag in ("del", "moveFrom"): + return + if tag == "r": + parts.append( + extract_text_from_run_table( + node, + qn_func, + drawing_context=drawing_context, + ) + ) + return + if tag == "oMath": + from omml import convert_omml_to_latex + + latex = convert_omml_to_latex(node) + if latex: + parts.append(f"{latex}") + return + if tag == "oMathPara": + from omml import convert_omml_to_latex + + for omath in node: + if omath.tag.split("}")[-1] == "oMath": + latex = convert_omml_to_latex(omath) + if latex: + parts.append(f"{latex}") + return + for child in node: + append_from(child) + + for child in para_elem: + append_from(child) + + return "".join(parts) + + +class TableExtractor: + """ + Extract table content handling merged cells correctly. + + Merged cells in DOCX: + - Horizontal: w:gridSpan specifies how many columns cell spans + - Vertical: w:vMerge with val="restart" starts merge, subsequent cells continue + + Output format: + - 2D list of strings + - Vertically merged cells: content repeated in all rows, all rows use the same paraId (from start cell) + - Horizontally merged cells: content in left-most position only, other positions empty + """ + + @staticmethod + def extract( + table: Table, + numbering_resolver=None, + drawing_context: DrawingExtractionContext = None, + ) -> List[List[str]]: + """ + Extract table to 2D string array. + + Args: + table: python-docx Table object + numbering_resolver: Optional NumberingResolver for extracting numbering + + Returns: + List of rows, each row is list of cell text strings + """ + result = TableExtractor.extract_with_metadata( + table, + numbering_resolver=numbering_resolver, + drawing_context=drawing_context, + ) + return result["rows"] + + @staticmethod + def extract_with_metadata( + table: Table, + numbering_resolver=None, + drawing_context: DrawingExtractionContext = None, + ) -> dict: + """ + Extract table to 2D string array with metadata (paraIds, header info). + + Vertical merge behavior: + - All rows in a vertically merged region share the same content + - All rows use the paraId from the merge start cell (for precise edit targeting) + + Args: + table: python-docx Table object + numbering_resolver: Optional NumberingResolver for extracting numbering + + Returns: + Dict with: + - rows: 2D list of cell text strings + - para_ids: 2D list of paraIds (first paraId in each cell, or None) + For vertically merged cells, all rows share the start cell's paraId + - para_ids_end: 2D list of paraIds (last paraId in each cell, or None) + For vertically merged cells, all rows share the start cell's paraId + - header_indices: List of row indices marked as table headers + """ + tbl = table._tbl + + # Get number of columns from tblGrid + tbl_grid = tbl.find(qn("w:tblGrid")) + num_cols = 0 + if tbl_grid is not None: + num_cols = len(tbl_grid.findall(qn("w:gridCol"))) + + if num_cols == 0: + return { + "rows": [], + "para_ids": [], + "para_ids_end": [], + "header_indices": [], + } + + # Detect header rows using w:tblHeader attribute + header_indices = [] + for idx, tr in enumerate(tbl.findall(qn("w:tr"))): + trPr = tr.find(qn("w:trPr")) + if trPr is not None: + tbl_header = trPr.find(qn("w:tblHeader")) + if tbl_header is not None: + header_indices.append(idx) + + # Process each row by directly iterating elements + grid = [] + para_ids_grid = [] + para_ids_end_grid = [] # Track last paraId in each cell + vmerge_content = {} # Track vertical merge by column: {col: {'text': str, 'para_id': str, 'para_id_end': str}} + + for tr in tbl.findall(qn("w:tr")): + row_data = [""] * num_cols # Pre-fill with empty strings + row_para_ids = [None] * num_cols # Pre-fill with None + row_para_ids_end = [None] * num_cols # Pre-fill with None for last paraId + grid_col = 0 + + # Iterate actual elements (each physical cell appears once) + for tc in tr.findall(qn("w:tc")): + # Reset numbering state when cell changes to prevent incorrect continuation + if numbering_resolver is not None: + numbering_resolver.reset_tracking_state() + + tcPr = tc.find(qn("w:tcPr")) + + # Check gridSpan (horizontal merge) + grid_span = 1 + if tcPr is not None: + gs = tcPr.find(qn("w:gridSpan")) + if gs is not None: + grid_span = int(gs.get(qn("w:val"))) + + # Check vMerge (vertical merge) + vmerge_elem = None + vmerge_val = None + if tcPr is not None: + vmerge_elem = tcPr.find(qn("w:vMerge")) + if vmerge_elem is not None: + vmerge_val = vmerge_elem.get( + qn("w:val") + ) # 'restart' or None (means 'continue') + + # Determine vMerge status + is_vmerge_restart = vmerge_elem is not None and vmerge_val == "restart" + is_vmerge_continue = vmerge_elem is not None and vmerge_val in ( + None, + "continue", + ) + is_normal_cell = vmerge_elem is None + + cell_text = "" + cell_para_id = None + cell_para_id_end = None # Track last paraId in cell + + # Handle different vMerge cases + if is_vmerge_restart or is_normal_cell: + # Extract content for restart or normal cells + # Get cell text with numbering support and format preservation + if numbering_resolver is not None: + # Extract text with numbering labels and superscript/subscript markup + cell_paragraphs = [] + for para_elem in tc.findall(qn("w:p")): + # Capture paraId from each paragraph + para_id_attr = para_elem.get( + "{http://schemas.microsoft.com/office/word/2010/wordml}paraId" + ) + if para_id_attr: + if cell_para_id is None: + cell_para_id = para_id_attr # First paraId + cell_para_id_end = ( + para_id_attr # Always update to get last + ) + + # Get text content with format preservation (superscript/subscript/equations) + para_text = extract_paragraph_content_table( + para_elem, + qn, + drawing_context=drawing_context, + ) + + # Get numbering label + label = numbering_resolver.get_label(para_elem) + + # Combine label and text + if label: + full_text = f"{label} {para_text}".strip() + else: + full_text = para_text.strip() + + if full_text: + cell_paragraphs.append(full_text) + + cell_text = "\n".join(cell_paragraphs).replace("\x07", "") + else: + # Fallback to simple text extraction with format preservation + # Cannot use cell.text here, must extract from XML + para_texts = [] + for para_elem in tc.findall(qn("w:p")): + # Capture paraId from each paragraph + para_id_attr = para_elem.get( + "{http://schemas.microsoft.com/office/word/2010/wordml}paraId" + ) + if para_id_attr: + if cell_para_id is None: + cell_para_id = para_id_attr # First paraId + cell_para_id_end = ( + para_id_attr # Always update to get last + ) + + # Extract text with format preservation (superscript/subscript/equations) + para_text = extract_paragraph_content_table( + para_elem, + qn, + drawing_context=drawing_context, + ) + + if para_text: + para_texts.append(para_text.strip()) + cell_text = "\n".join(para_texts).replace("\x07", "") + + # Store content and paraIds for vMerge restart + if is_vmerge_restart: + vmerge_content[grid_col] = { + "text": cell_text, + "para_id": cell_para_id, + "para_id_end": cell_para_id_end, + } + elif is_normal_cell: + # For normal cells: if empty and we have active vMerge, copy all from start + # If non-empty, this ends the vMerge region + if not cell_text and grid_col in vmerge_content: + # Empty cell in vMerge region - copy content and paraIds from start + cell_text = vmerge_content[grid_col]["text"] + cell_para_id = vmerge_content[grid_col]["para_id"] + cell_para_id_end = vmerge_content[grid_col]["para_id_end"] + elif cell_text: + # Non-empty cell - this ends the vMerge for this column + vmerge_content.pop(grid_col, None) + + elif is_vmerge_continue: + # Copy content and para_id from previous merge start + # But extract actual para_id_end from this continue cell for range boundary + if grid_col in vmerge_content: + cell_text = vmerge_content[grid_col]["text"] + cell_para_id = vmerge_content[grid_col][ + "para_id" + ] # Use restart's paraId for edit targeting + + # Extract actual paraId from this continue cell for uuid_end (range boundary) + for para_elem in tc.findall(qn("w:p")): + para_id_attr = para_elem.get( + "{http://schemas.microsoft.com/office/word/2010/wordml}paraId" + ) + if para_id_attr: + cell_para_id_end = ( + para_id_attr # Use actual paraId for range boundary + ) + + # Place content at starting grid position only + if grid_col < num_cols: + row_data[grid_col] = cell_text + row_para_ids[grid_col] = cell_para_id + row_para_ids_end[grid_col] = cell_para_id_end + + # Move grid position by gridSpan + grid_col += grid_span + + grid.append(row_data) + para_ids_grid.append(row_para_ids) + para_ids_end_grid.append(row_para_ids_end) + + return { + "rows": grid, + "para_ids": para_ids_grid, + "para_ids_end": para_ids_end_grid, + "header_indices": header_indices, + } diff --git a/lightrag/native_parser/docx/utils.py b/lightrag/native_parser/docx/utils.py new file mode 100644 index 0000000000..649534790d --- /dev/null +++ b/lightrag/native_parser/docx/utils.py @@ -0,0 +1,791 @@ +#!/usr/bin/env python3 +""" +ABOUTME: Shared token estimation utilities for audit scripts +ABOUTME: XML sanitization helpers for document processing +""" + +import json +import os +import re + +try: + from google import genai + from google.genai import types + + HAS_GEMINI = True +except ImportError: # pragma: no cover - optional dependency + genai = None + types = None + HAS_GEMINI = False + +try: + import openai + + HAS_OPENAI = True +except ImportError: # pragma: no cover - optional dependency + openai = None + HAS_OPENAI = False + + +def estimate_tokens(text: str) -> int: + """ + Estimate token count for LLM context management. + + Uses a weighted formula based on character types: + - Chinese characters: ~0.75 tokens per character (subword tokenization) + - JSON structural characters (brackets, quotes, commas): ~1 tokens per character + - Other characters (English, numbers, symbols): ~0.4 tokens per character (~3 chars/token) + + Includes 5% buffer and safety offset for special formatting and system prompt overhead. + + Args: + text: Input text to estimate tokens for + + Returns: + int: Estimated token count + """ + if not text: + return 0 + + chinese_count = len(re.findall(r"[\u4e00-\u9fa5]", text)) + json_chars_count = len(re.findall(r'[\[\]",{}]', text)) + other_count = len(text) - chinese_count - json_chars_count + + base_estimate = ( + (chinese_count * 0.75) + (json_chars_count * 1) + (other_count * 0.4) + ) + final_tokens = int(base_estimate * 1.05) + 2 + return final_tokens + + +def sanitize_xml_string(text: str) -> str: + """ + Remove control characters that are illegal in XML 1.0. + + XML 1.0 allows: #x9 (tab), #xA (LF), #xD (CR), and #x20-#xD7FF, #xE000-#xFFFD, #x10000-#x10FFFF + This function removes all other control characters (0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F). + + Args: + text: Text that may contain control characters + + Returns: + Sanitized text safe for XML. Returns input unchanged if not a non-empty string. + """ + if not text or not isinstance(text, str): + return text + # Build a translation table to remove illegal control characters + # Keep: \t (0x09), \n (0x0A), \r (0x0D) + # Remove: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F + illegal_chars = "".join(chr(c) for c in range(0x20) if c not in (0x09, 0x0A, 0x0D)) + return text.translate(str.maketrans("", "", illegal_chars)) + + +def is_vertex_ai_mode() -> bool: + """ + Check if Vertex AI mode is enabled via environment variable. + + Returns: + True if GOOGLE_GENAI_USE_VERTEXAI is set to 'true', False otherwise + """ + return os.getenv("GOOGLE_GENAI_USE_VERTEXAI", "").lower() == "true" + + +def create_gemini_client(use_async: bool = False): + """ + Create Gemini client for AI Studio or Vertex AI. + + Supports two modes: + - AI Studio (default): Uses GOOGLE_API_KEY for authentication + - Vertex AI: Uses ADC (GOOGLE_APPLICATION_CREDENTIALS or gcloud auth) + + Environment variables for Vertex AI mode: + - GOOGLE_GENAI_USE_VERTEXAI: Set to 'true' to enable Vertex AI mode + - GOOGLE_CLOUD_PROJECT: Required GCP project ID + - GOOGLE_CLOUD_LOCATION: Optional region (default: us-central1) + - GOOGLE_VERTEX_BASE_URL: Optional custom API endpoint (for API gateway proxies) + - GOOGLE_APPLICATION_CREDENTIALS: Path to service account JSON (or use gcloud auth) + + Args: + use_async: If True, return the async client (.aio), otherwise return sync client + + Returns: + Gemini client instance (sync or async based on use_async parameter) + + Raises: + ValueError: If required environment variables are not set + """ + use_vertex = is_vertex_ai_mode() + + if use_vertex: + # Vertex AI mode - uses ADC (GOOGLE_APPLICATION_CREDENTIALS or gcloud auth) + project = os.getenv("GOOGLE_CLOUD_PROJECT") + location = os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1") + base_url = os.getenv("GOOGLE_VERTEX_BASE_URL") + + if not project: + raise ValueError( + "GOOGLE_CLOUD_PROJECT is required for Vertex AI mode. " + "Set GOOGLE_GENAI_USE_VERTEXAI=false to use AI Studio mode instead." + ) + + # Build http_options only if custom base_url is specified + http_options = None + if base_url: + http_options = {"base_url": base_url} + + # Note: ADC handles authentication automatically + # via GOOGLE_APPLICATION_CREDENTIALS env var or gcloud auth + client = genai.Client( + vertexai=True, project=project, location=location, http_options=http_options + ) + else: + # AI Studio mode - requires API key + api_key = os.getenv("GOOGLE_API_KEY") + if not api_key: + raise ValueError( + "GOOGLE_API_KEY is required for AI Studio mode. " + "Set GOOGLE_GENAI_USE_VERTEXAI=true and configure GCP credentials for Vertex AI mode." + ) + + client = genai.Client(api_key=api_key) + + # Return async or sync client based on parameter + return client.aio if use_async else client + + +def get_gemini_provider_name() -> str: + """ + Get the Gemini provider name based on current mode. + + Returns: + Provider name string for display purposes + """ + if is_vertex_ai_mode(): + project = os.getenv("GOOGLE_CLOUD_PROJECT", "unknown") + location = os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1") + return f"Google Gemini (Vertex AI: {project}/{location})" + return "Google Gemini (AI Studio)" + + +def create_openai_client(use_async: bool = True): + """ + Create OpenAI client with optional custom base URL. + + Environment variables: + - OPENAI_API_KEY: Required API key + - OPENAI_BASE_URL: Optional custom API endpoint (for proxies, Azure, etc.) + + Args: + use_async: If True, return AsyncOpenAI, otherwise return OpenAI + + Returns: + OpenAI client instance (async or sync based on use_async parameter) + + Raises: + ValueError: If OPENAI_API_KEY is not set + """ + if not HAS_OPENAI: + raise ValueError("openai library is not installed.") + api_key = os.getenv("OPENAI_API_KEY") + if not api_key: + raise ValueError("OPENAI_API_KEY is required for OpenAI mode.") + + base_url = os.getenv("OPENAI_BASE_URL") + + if use_async: + return openai.AsyncOpenAI(base_url=base_url) + return openai.OpenAI(base_url=base_url) + + +def get_openai_provider_name() -> str: + """ + Get the OpenAI provider name, including custom endpoint if configured. + + Returns: + Provider name string for display purposes + """ + base_url = os.getenv("OPENAI_BASE_URL") + if base_url: + return f"OpenAI (Custom: {base_url})" + return "OpenAI" + + +def is_openai_reasoning_model(model_name: str) -> bool: + """ + Check if the OpenAI model supports reasoning_effort parameter. + + Models that support reasoning_effort: + - o-series: o1, o3, o4 and their variants (o1-mini, o1-2024-12-17, etc.) + - gpt-5 series: gpt-5, gpt-5.2, gpt-5-turbo, etc. + + Non-reasoning models like gpt-4.1, gpt-4o, etc. will reject this parameter. + + Handles proxy/router prefixes like "openai/o1-mini" or "openrouter/gpt-5.2". + + Args: + model_name: The OpenAI model name (may include path prefix) + + Returns: + True if the model supports reasoning_effort, False otherwise + """ + model_lower = model_name.lower() + + # Handle proxy/router prefixes like "openai/o1-mini", "openrouter/gpt-5.2" + # Extract the base model name after the last "/" + if "/" in model_lower: + model_lower = model_lower.rsplit("/", 1)[-1] + + # Match o-series and gpt-5 series + return model_lower.startswith(("o1", "o3", "o4", "gpt-5")) + + +def is_openai_retryable(error: Exception) -> bool: + """ + Determine if an OpenAI error should be retried. + + Non-retryable errors: + - AuthenticationError (401): Invalid API key + - PermissionDeniedError (403): No access to resource + - BadRequestError (400): Invalid request format + - NotFoundError (404): Model or resource not found + + Retryable errors: + - RateLimitError (429): Rate limit exceeded + - APIConnectionError: Network issues + - InternalServerError (500): Server errors + - APIStatusError with 502, 503, 504: Gateway/service errors + + Args: + error: The exception from OpenAI API call + + Returns: + True if the error should be retried, False otherwise + """ + if not HAS_OPENAI: + return True + + # Authentication error - invalid API key (401) + if isinstance(error, openai.AuthenticationError): + return False + + # Permission denied - no access to resource (403) + if isinstance(error, openai.PermissionDeniedError): + return False + + # Bad request - invalid request format (400) + if isinstance(error, openai.BadRequestError): + return False + + # Not found - model or resource doesn't exist (404) + if isinstance(error, openai.NotFoundError): + return False + + # Rate limit exceeded - should retry with backoff (429) + if isinstance(error, openai.RateLimitError): + return True + + # API connection error - network issues, should retry + if isinstance(error, openai.APIConnectionError): + return True + + # Internal server error - should retry (500) + if isinstance(error, openai.InternalServerError): + return True + + # For other APIStatusError, check HTTP status code + if isinstance(error, openai.APIStatusError): + # Retryable server-side errors + return error.status_code in (429, 500, 502, 503, 504) + + # For unknown errors, default to retry (network issues, timeouts, etc.) + return True + + +def is_gemini_retryable(error: Exception) -> bool: + """ + Determine if a Gemini error should be retried. + + Uses string matching on error messages since google-genai may not have + well-defined exception types for all error cases. + + Non-retryable errors: + - API key errors + - Authentication/permission errors + - Invalid request errors + - Model not found errors + - Billing/quota permanently exceeded + + Retryable errors: + - Rate limit (429) + - Server errors (500, 502, 503, 504) + - Timeout/connection errors + + Args: + error: The exception from Gemini API call + + Returns: + True if the error should be retried, False otherwise + """ + error_str = str(error).lower() + + # API key / authentication errors - do not retry + if "api_key" in error_str or "api key" in error_str: + return False + if "authentication" in error_str or "authenticate" in error_str: + return False + if "invalid_api_key" in error_str or "invalid api key" in error_str: + return False + + # Permission / forbidden errors - do not retry + if "permission" in error_str and "denied" in error_str: + return False + if "forbidden" in error_str or "403" in error_str: + return False + + # Invalid request errors - do not retry + if "invalid" in error_str and ("request" in error_str or "argument" in error_str): + return False + if "400" in error_str and "bad request" in error_str: + return False + + # Model not found - do not retry + if "model" in error_str and ("not found" in error_str or "not exist" in error_str): + return False + if "404" in error_str: + return False + + # Billing / permanent quota errors - do not retry + if "billing" in error_str: + return False + if "quota" in error_str and ("exceeded" in error_str or "exhausted" in error_str): + # Check if it mentions billing which indicates permanent quota issue + if "billing" in error_str or "payment" in error_str: + return False + # Temporary quota (rate limit) - should retry + return True + + # Rate limit errors - should retry (429) + if "rate" in error_str and "limit" in error_str: + return True + if "429" in error_str or "resource_exhausted" in error_str: + return True + + # Server errors - should retry (500, 502, 503, 504) + if any(code in error_str for code in ["500", "502", "503", "504"]): + return True + if "internal" in error_str and ("error" in error_str or "server" in error_str): + return True + if "service" in error_str and "unavailable" in error_str: + return True + if "gateway" in error_str: + return True + + # Timeout / connection errors - should retry + if "timeout" in error_str or "timed out" in error_str: + return True + if "connection" in error_str: + return True + if "network" in error_str: + return True + + # Unknown errors - default to retry with limited attempts + return True + + +# JSON Schema for LLM structured output +AUDIT_RESULT_SCHEMA = { + "type": "object", + "additionalProperties": False, + "properties": { + "is_violation": { + "type": "boolean", + "description": "Whether any violations were found", + }, + "violations": { + "type": "array", + "description": "List of violations found", + "items": { + "type": "object", + "additionalProperties": False, + "properties": { + "rule_id": { + "type": "string", + "description": "ID of the violated rule (e.g., R001)", + }, + "violation_text": { + "type": "string", + "description": "The problematic text directly verbatim quote from the source content, and not span multiple cells", + }, + "violation_reason": { + "type": "string", + "description": "Explanation of why this violates the rule", + }, + "fix_action": { + "type": "string", + "enum": ["replace", "manual"], + "description": "Action type: replace substitutes text (including deletion-via-replace), manual requires human review", + }, + "revised_text": { + "type": "string", + "description": "For replace: complete replacement text (including deletion-via-replace). For manual: additional guidance for human reviewer", + }, + }, + "required": [ + "rule_id", + "violation_text", + "violation_reason", + "fix_action", + "revised_text", + ], + }, + }, + }, + "required": ["is_violation", "violations"], +} + +# JSON Schema for global extraction output +GLOBAL_EXTRACT_SCHEMA = { + "type": "object", + "additionalProperties": False, + "properties": { + "results": { + "type": "array", + "items": { + "type": "object", + "additionalProperties": False, + "properties": { + "rule_id": {"type": "string"}, + "extracted_results": { + "type": "array", + "items": { + "type": "object", + "additionalProperties": False, + "properties": { + "entity": {"type": "string"}, + "fields": { + "type": "array", + "items": { + "type": "object", + "additionalProperties": False, + "properties": { + "name": {"type": "string"}, + "value": {"type": "string"}, + "evidence": {"type": "string"}, + }, + "required": ["name", "value", "evidence"], + }, + }, + }, + "required": ["entity", "fields"], + }, + }, + }, + "required": ["rule_id", "extracted_results"], + }, + } + }, + "required": ["results"], +} + +# JSON Schema for global verification output +GLOBAL_VERIFY_SCHEMA = { + "type": "object", + "additionalProperties": False, + "properties": { + "violations": { + "type": "array", + "items": { + "type": "object", + "additionalProperties": False, + "properties": { + "rule_id": {"type": "string"}, + "uuid": {"type": "string"}, + "uuid_end": {"type": "string"}, + "violation_text": {"type": "string"}, + "violation_reason": {"type": "string"}, + "fix_action": {"type": "string", "enum": ["replace", "manual"]}, + "revised_text": {"type": "string"}, + }, + "required": [ + "rule_id", + "uuid", + "uuid_end", + "violation_text", + "violation_reason", + "fix_action", + "revised_text", + ], + }, + } + }, + "required": ["violations"], +} + + +async def global_extract_gemini_async( + user_prompt: str, + system_prompt: str, + model_name: str, + client, + thinking_level: str = None, + thinking_budget: int = None, +) -> dict: + thinking_config = None + if thinking_level and thinking_level.upper() in ( + "MINIMAL", + "LOW", + "MEDIUM", + "HIGH", + ): + level_map = { + "MINIMAL": types.ThinkingLevel.MINIMAL, + "LOW": types.ThinkingLevel.LOW, + "MEDIUM": types.ThinkingLevel.MEDIUM, + "HIGH": types.ThinkingLevel.HIGH, + } + thinking_config = types.ThinkingConfig( + thinking_level=level_map[thinking_level.upper()] + ) + elif thinking_budget is not None: + thinking_config = types.ThinkingConfig(thinking_budget=int(thinking_budget)) + + config_params = { + "system_instruction": system_prompt, + "response_mime_type": "application/json", + "response_schema": GLOBAL_EXTRACT_SCHEMA, + } + if thinking_config: + config_params["thinking_config"] = thinking_config + + response = await client.models.generate_content( + model=model_name, + contents=user_prompt, + config=types.GenerateContentConfig(**config_params), + ) + return json.loads(response.text) + + +async def global_extract_openai_async( + user_prompt: str, + system_prompt: str, + model_name: str, + client, + reasoning_effort: str = None, +) -> dict: + request_params = { + "model": model_name, + "messages": [ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_prompt}, + ], + "response_format": { + "type": "json_schema", + "json_schema": { + "name": "global_extract", + "strict": True, + "schema": GLOBAL_EXTRACT_SCHEMA, + }, + }, + } + if ( + reasoning_effort + and reasoning_effort.lower() in ("low", "medium", "high") + and is_openai_reasoning_model(model_name) + ): + request_params["reasoning_effort"] = reasoning_effort.lower() + + response = await client.chat.completions.create(**request_params) + return json.loads(response.choices[0].message.content) + + +async def global_verify_gemini_async( + user_prompt: str, + system_prompt: str, + model_name: str, + client, + thinking_level: str = None, + thinking_budget: int = None, +) -> dict: + thinking_config = None + if thinking_level and thinking_level.upper() in ( + "MINIMAL", + "LOW", + "MEDIUM", + "HIGH", + ): + level_map = { + "MINIMAL": types.ThinkingLevel.MINIMAL, + "LOW": types.ThinkingLevel.LOW, + "MEDIUM": types.ThinkingLevel.MEDIUM, + "HIGH": types.ThinkingLevel.HIGH, + } + thinking_config = types.ThinkingConfig( + thinking_level=level_map[thinking_level.upper()] + ) + elif thinking_budget is not None: + thinking_config = types.ThinkingConfig(thinking_budget=int(thinking_budget)) + + config_params = { + "system_instruction": system_prompt, + "response_mime_type": "application/json", + "response_schema": GLOBAL_VERIFY_SCHEMA, + } + if thinking_config: + config_params["thinking_config"] = thinking_config + + response = await client.models.generate_content( + model=model_name, + contents=user_prompt, + config=types.GenerateContentConfig(**config_params), + ) + return json.loads(response.text) + + +async def global_verify_openai_async( + user_prompt: str, + system_prompt: str, + model_name: str, + client, + reasoning_effort: str = None, +) -> dict: + request_params = { + "model": model_name, + "messages": [ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_prompt}, + ], + "response_format": { + "type": "json_schema", + "json_schema": { + "name": "global_verify", + "strict": True, + "schema": GLOBAL_VERIFY_SCHEMA, + }, + }, + } + if ( + reasoning_effort + and reasoning_effort.lower() in ("low", "medium", "high") + and is_openai_reasoning_model(model_name) + ): + request_params["reasoning_effort"] = reasoning_effort.lower() + + response = await client.chat.completions.create(**request_params) + return json.loads(response.choices[0].message.content) + + +async def audit_block_gemini_async( + user_prompt: str, + system_prompt: str, + model_name: str, + client, + thinking_level: str = None, + thinking_budget: int = None, +) -> dict: + """ + Audit a text block using Google Gemini with strict JSON mode (async version). + + Args: + user_prompt: User prompt to audit + system_prompt: Cached system prompt with rules and instructions + model_name: Gemini model to use + client: Gemini async client instance (client.aio) + thinking_level: Thinking level for Gemini 3 models (MINIMAL, LOW, MEDIUM, HIGH) + thinking_budget: Thinking token budget for Gemini 2.5 models (integer) + + Returns: + Audit result dictionary + """ + # Build thinking config based on model and parameters + thinking_config = None + + if thinking_level and thinking_level.upper() in ( + "MINIMAL", + "LOW", + "MEDIUM", + "HIGH", + ): + # For Gemini 3 models + level_map = { + "MINIMAL": types.ThinkingLevel.MINIMAL, + "LOW": types.ThinkingLevel.LOW, + "MEDIUM": types.ThinkingLevel.MEDIUM, + "HIGH": types.ThinkingLevel.HIGH, + } + thinking_config = types.ThinkingConfig( + thinking_level=level_map[thinking_level.upper()] + ) + elif thinking_budget is not None: + # For Gemini 2.5 models + thinking_config = types.ThinkingConfig(thinking_budget=int(thinking_budget)) + + config_params = { + "system_instruction": system_prompt, + "response_mime_type": "application/json", + "response_schema": AUDIT_RESULT_SCHEMA, + } + + # Only add thinking_config if it's configured + if thinking_config: + config_params["thinking_config"] = thinking_config + + response = await client.models.generate_content( + model=model_name, + contents=user_prompt, + config=types.GenerateContentConfig(**config_params), + ) + + # With structured output, response is guaranteed to be valid JSON + result = json.loads(response.text) + return result + + +async def audit_block_openai_async( + user_prompt: str, + system_prompt: str, + model_name: str, + client, + reasoning_effort: str = None, +) -> dict: + """ + Audit a text block using OpenAI with strict JSON mode (async version). + + Args: + user_prompt: User prompt to audit + system_prompt: Cached system prompt with rules and instructions + model_name: OpenAI model to use + client: AsyncOpenAI client instance + reasoning_effort: Reasoning effort for o-series models (low, medium, high) + + Returns: + Audit result dictionary + """ + request_params = { + "model": model_name, + "messages": [ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_prompt}, + ], + "response_format": { + "type": "json_schema", + "json_schema": { + "name": "audit_result", + "strict": True, + "schema": AUDIT_RESULT_SCHEMA, + }, + }, + } + + # Add reasoning_effort only for o-series models that support it + if ( + reasoning_effort + and reasoning_effort.lower() in ("low", "medium", "high") + and is_openai_reasoning_model(model_name) + ): + request_params["reasoning_effort"] = reasoning_effort.lower() + + response = await client.chat.completions.create(**request_params) + + # With structured output, response is guaranteed to be valid JSON + result = json.loads(response.choices[0].message.content) + return result diff --git a/lightrag/operate.py b/lightrag/operate.py index 8ca3f5d303..f477805290 100644 --- a/lightrag/operate.py +++ b/lightrag/operate.py @@ -4,13 +4,14 @@ import asyncio import json +import re +import warnings import json_repair from typing import Any, AsyncIterator, overload, Literal from collections import Counter, defaultdict from lightrag.exceptions import ( PipelineCancelledException, - ChunkTokenLimitExceededError, ) from lightrag.utils import ( logger, @@ -26,6 +27,9 @@ save_to_cache, CacheData, use_llm_func_with_cache, + get_env_value, + get_llm_cache_identity, + serialize_llm_cache_identity, update_chunk_cache_list, remove_think_tags, pick_by_weighted_polling, @@ -51,15 +55,16 @@ QueryResult, QueryContextResult, ) -from lightrag.prompt import PROMPTS +from lightrag.chunk_schema import strip_internal_multimodal_markup_for_extraction +from lightrag.prompt import PROMPTS, resolve_entity_extraction_prompt_profile from lightrag.constants import ( GRAPH_FIELD_SEP, DEFAULT_MAX_ENTITY_TOKENS, + DEFAULT_MAX_EXTRACT_INPUT_TOKENS, DEFAULT_MAX_RELATION_TOKENS, DEFAULT_MAX_TOTAL_TOKENS, DEFAULT_RELATED_CHUNK_NUMBER, DEFAULT_KG_CHUNK_PICK_METHOD, - DEFAULT_ENTITY_TYPES, DEFAULT_SUMMARY_LANGUAGE, SOURCE_IDS_LIMIT_METHOD_KEEP, SOURCE_IDS_LIMIT_METHOD_FIFO, @@ -77,6 +82,41 @@ load_dotenv(dotenv_path=Path(__file__).resolve().parent / ".env", override=False) +def _warn_deprecated_query_model_func(context: str) -> None: + warnings.warn( + "QueryParam.model_func is deprecated and will be removed at v1.5.0. " + "Use LightRAG.aupdate_llm_role_config() instead. " + f"Deprecated override used for {context}.", + DeprecationWarning, + stacklevel=3, + ) + + +def _get_relationship_vdb_timeout_seconds(global_config: dict[str, Any]) -> float: + """Derive a defensive timeout for relation VDB upserts. + + Rationale: + - `knowledge_graph_inst.upsert_edge()` for the default NetworkX storage is in-memory and fast. + - `relationships_vdb.upsert()` performs embedding calls and remote I/O, which is the more likely + point of silent stalls during relation merge. + """ + configured = global_config.get("default_embedding_timeout") + try: + base_timeout = float(configured) + except (TypeError, ValueError): + base_timeout = 30.0 + # Keep a fixed lower bound high enough to avoid false positives on slow providers. + return max(base_timeout * 3, 120.0) + + +def _format_relation_edge_label(edge_key: tuple[str, str] | list[str]) -> str: + if isinstance(edge_key, tuple): + left, right = edge_key + else: + left, right = edge_key[0], edge_key[1] + return f"{left}->{right}" + + def _truncate_entity_identifier( identifier: str, limit: int, chunk_key: str, identifier_role: str ) -> str: @@ -98,70 +138,61 @@ def _truncate_entity_identifier( return display_value -def chunking_by_token_size( - tokenizer: Tokenizer, - content: str, - split_by_character: str | None = None, - split_by_character_only: bool = False, - chunk_overlap_token_size: int = 100, - chunk_token_size: int = 1200, -) -> list[dict[str, Any]]: +def _truncate_vdb_content(content: str, global_config: dict, content_label: str) -> str: + """Clamp vector-store payload size to stay under embedding limits.""" + + if not content: + return content + + embedding_token_limit = global_config.get("embedding_token_limit") + tokenizer: Tokenizer | None = global_config.get("tokenizer") + if embedding_token_limit is None or tokenizer is None: + return content + + threshold = int(embedding_token_limit) + if threshold <= 0: + return content + tokens = tokenizer.encode(content) - results: list[dict[str, Any]] = [] - if split_by_character: - raw_chunks = content.split(split_by_character) - new_chunks = [] - if split_by_character_only: - for chunk in raw_chunks: - _tokens = tokenizer.encode(chunk) - if len(_tokens) > chunk_token_size: - logger.warning( - "Chunk split_by_character exceeds token limit: len=%d limit=%d", - len(_tokens), - chunk_token_size, - ) - raise ChunkTokenLimitExceededError( - chunk_tokens=len(_tokens), - chunk_token_limit=chunk_token_size, - chunk_preview=chunk[:120], - ) - new_chunks.append((len(_tokens), chunk)) - else: - for chunk in raw_chunks: - _tokens = tokenizer.encode(chunk) - if len(_tokens) > chunk_token_size: - for start in range( - 0, len(_tokens), chunk_token_size - chunk_overlap_token_size - ): - chunk_content = tokenizer.decode( - _tokens[start : start + chunk_token_size] - ) - new_chunks.append( - (min(chunk_token_size, len(_tokens) - start), chunk_content) - ) - else: - new_chunks.append((len(_tokens), chunk)) - for index, (_len, chunk) in enumerate(new_chunks): - results.append( - { - "tokens": _len, - "content": chunk.strip(), - "chunk_order_index": index, - } - ) - else: - for index, start in enumerate( - range(0, len(tokens), chunk_token_size - chunk_overlap_token_size) - ): - chunk_content = tokenizer.decode(tokens[start : start + chunk_token_size]) - results.append( - { - "tokens": min(chunk_token_size, len(tokens) - start), - "content": chunk_content.strip(), - "chunk_order_index": index, - } - ) - return results + if len(tokens) <= threshold: + return content + + # Leave headroom because tokenizer behavior can differ slightly from the provider. + effective_limit = max(threshold - min(256, max(32, threshold // 16)), 1) + truncated_content = tokenizer.decode(tokens[:effective_limit]) + logger.warning( + "%s VDB content truncated from %d to %d tokens (embedding limit: %d)", + content_label, + len(tokens), + effective_limit, + threshold, + ) + return truncated_content + + +_MM_DISPLAY_NAME_PATTERN = re.compile( + r"^\[(?:Image|Table|Equation) Name\](.+)$", + flags=re.MULTILINE, +) + + +def _parse_mm_display_name(content: str, fallback: str) -> str: + """Return the friendly name embedded in a multimodal chunk. + + Matches the leading ``[Image Name]…`` / ``[Table Name]…`` / + ``[Equation Name]…`` segment produced by + ``LightRAG._build_mm_chunks_from_sidecars`` — the producer-side + contract is documented in that function's ``_render`` helper. Falls + back to the sidecar id when the segment is missing or empty so + callers never end up with a blank label. + """ + if content: + match = _MM_DISPLAY_NAME_PATTERN.search(content) + if match: + candidate = match.group(1).strip() + if candidate: + return candidate + return fallback async def _handle_entity_relation_summary( @@ -319,11 +350,14 @@ async def _summarize_descriptions( Returns: Summarized description string """ - use_llm_func: callable = global_config["llm_model_func"] + use_llm_func: callable = global_config["role_llm_funcs"]["extract"] # Apply higher priority (8) to entity/relation summary tasks use_llm_func = partial(use_llm_func, _priority=8) - language = global_config["addon_params"].get("language", DEFAULT_SUMMARY_LANGUAGE) + addon_params = global_config.get("addon_params") or {} + language = global_config.get("_resolved_summary_language") + if language is None: + language = addon_params.get("language", DEFAULT_SUMMARY_LANGUAGE) summary_length_recommended = global_config["summary_length_recommended"] @@ -365,6 +399,7 @@ async def _summarize_descriptions( use_llm_func, llm_response_cache=llm_response_cache, cache_type="summary", + llm_cache_identity=get_llm_cache_identity(global_config, "extract"), ) # Check summary token length against embedding limit @@ -557,6 +592,231 @@ def _handle_single_relationship_extraction( return None +def _normalize_text_extraction_record_attributes( + record_attributes: list[str], chunk_key: str +) -> list[str]: + """Recover the known text-mode failure where relation rows use the entity prefix.""" + + if len(record_attributes) != 5: + return record_attributes + + prefix = record_attributes[0].strip().lower() + if "entity" not in prefix or "relation" in prefix: + return record_attributes + + logger.warning( + "Recovering mis-prefixed relation: `%s` ~ `%s`", + record_attributes[1], + record_attributes[2], + ) + normalized = list(record_attributes) + normalized[0] = "relation" + return normalized + + +def _looks_like_json_extraction_result(result: str) -> bool: + """Return True for raw or fenced JSON extraction responses.""" + + stripped = result.strip() + if not stripped: + return False + + if stripped.startswith(("{", "[")): + return True + + if stripped.startswith("```"): + return _strip_markdown_code_fence(stripped).strip().startswith(("{", "[")) + + return False + + +async def _process_json_extraction_result( + result: str, + chunk_key: str, + timestamp: int, + file_path: str = "unknown_source", +) -> tuple[dict, dict]: + """Process a JSON-formatted extraction result from LLM. + + This function parses the LLM response as JSON and extracts entities and relationships. + It uses json_repair to handle slightly malformed JSON from weaker models. + + Args: + result: The JSON extraction result from LLM + chunk_key: The chunk key for source tracking + timestamp: The timestamp for the extraction + file_path: The file path for citation + + Returns: + tuple: (nodes_dict, edges_dict) containing the extracted entities and relationships + """ + maybe_nodes = defaultdict(list) + maybe_edges = defaultdict(list) + + try: + # Parse the JSON response using json_repair for robustness + parsed = json_repair.loads(_strip_markdown_code_fence(result).strip()) + except Exception as e: + logger.warning(f"{chunk_key}: Failed to parse JSON extraction result: {e}") + return dict(maybe_nodes), dict(maybe_edges) + + if not isinstance(parsed, dict): + logger.warning( + f"{chunk_key}: JSON extraction result is not a dict, got {type(parsed).__name__}" + ) + return dict(maybe_nodes), dict(maybe_edges) + + # Process entities + entities_list = parsed.get("entities", []) + if not isinstance(entities_list, list): + logger.warning( + f"{chunk_key}: 'entities' field is not a list in JSON extraction result" + ) + entities_list = [] + + for entity_data in entities_list: + if not isinstance(entity_data, dict): + continue + + try: + entity_name = sanitize_and_normalize_extracted_text( + str(entity_data.get("name", "")), remove_inner_quotes=True + ) + if not entity_name or not entity_name.strip(): + logger.info( + f"{chunk_key}: Empty entity name found after sanitization in JSON result" + ) + continue + + entity_type = sanitize_and_normalize_extracted_text( + str(entity_data.get("type", "")), remove_inner_quotes=True + ) + if not entity_type.strip() or any( + char in entity_type + for char in ["'", "(", ")", "<", ">", "|", "/", "\\"] + ): + logger.warning( + f"{chunk_key}: Invalid entity type '{entity_type}' for entity '{entity_name}'" + ) + continue + + entity_type = entity_type.replace(" ", "").lower() + + entity_description = sanitize_and_normalize_extracted_text( + str(entity_data.get("description", "")) + ) + if not entity_description.strip(): + logger.warning( + f"{chunk_key}: Empty description for entity '{entity_name}'" + ) + continue + + truncated_name = _truncate_entity_identifier( + entity_name, + DEFAULT_ENTITY_NAME_MAX_LENGTH, + chunk_key, + "Entity name", + ) + + node_data = dict( + entity_name=truncated_name, + entity_type=entity_type, + description=entity_description, + source_id=chunk_key, + file_path=file_path, + timestamp=timestamp, + ) + maybe_nodes[truncated_name].append(node_data) + + except Exception as e: + logger.warning( + f"{chunk_key}: Failed to process entity from JSON result: {e}" + ) + continue + + # Process relationships + relationships_list = parsed.get("relationships", []) + if not isinstance(relationships_list, list): + logger.warning( + f"{chunk_key}: 'relationships' field is not a list in JSON extraction result" + ) + relationships_list = [] + + for rel_data in relationships_list: + if not isinstance(rel_data, dict): + continue + + try: + source = sanitize_and_normalize_extracted_text( + str(rel_data.get("source", "")), remove_inner_quotes=True + ) + target = sanitize_and_normalize_extracted_text( + str(rel_data.get("target", "")), remove_inner_quotes=True + ) + + if not source: + logger.info( + f"{chunk_key}: Empty source entity in JSON relationship result" + ) + continue + if not target: + logger.info( + f"{chunk_key}: Empty target entity in JSON relationship result" + ) + continue + if source == target: + logger.debug(f"{chunk_key}: Source and target are the same: '{source}'") + continue + + edge_keywords = sanitize_and_normalize_extracted_text( + str(rel_data.get("keywords", "")), remove_inner_quotes=True + ) + edge_keywords = edge_keywords.replace(",", ",") + + edge_description = sanitize_and_normalize_extracted_text( + str(rel_data.get("description", "")) + ) + + if not edge_description.strip(): + logger.warning( + f"{chunk_key}: Empty description for relationship '{source}' ~ '{target}', skipping" + ) + continue + + truncated_source = _truncate_entity_identifier( + source, + DEFAULT_ENTITY_NAME_MAX_LENGTH, + chunk_key, + "Relation entity", + ) + truncated_target = _truncate_entity_identifier( + target, + DEFAULT_ENTITY_NAME_MAX_LENGTH, + chunk_key, + "Relation entity", + ) + + edge_data = dict( + src_id=truncated_source, + tgt_id=truncated_target, + weight=1.0, + description=edge_description, + keywords=edge_keywords, + source_id=chunk_key, + file_path=file_path, + timestamp=timestamp, + ) + maybe_edges[(truncated_source, truncated_target)].append(edge_data) + + except Exception as e: + logger.warning( + f"{chunk_key}: Failed to process relationship from JSON result: {e}" + ) + continue + + return dict(maybe_nodes), dict(maybe_edges) + + async def rebuild_knowledge_from_chunks( entities_to_rebuild: dict[str, list[str]], relationships_to_rebuild: dict[tuple[str, str], list[str]], @@ -1020,6 +1280,9 @@ async def _process_extraction_result( ) record_attributes = split_string_by_multi_markers(record, [tuple_delimiter]) + record_attributes = _normalize_text_extraction_record_attributes( + record_attributes, chunk_key + ) # Try to parse as entity entity_data = _handle_single_entity_extraction( @@ -1068,7 +1331,11 @@ async def _rebuild_from_extraction_result( chunk_id: str, timestamp: int, ) -> tuple[dict, dict]: - """Parse cached extraction result using the same logic as extract_entities + """Parse cached extraction result using the same logic as extract_entities. + + Supports both JSON and delimiter-based formats for backward compatibility. + Attempts JSON parsing first; if the cached result looks like JSON (starts with '{'), + uses the JSON parser. Otherwise, falls back to the traditional delimiter-based parser. Args: text_chunks_storage: Text chunks storage to get chunk data @@ -1087,7 +1354,21 @@ async def _rebuild_from_extraction_result( else "unknown_source" ) - # Call the shared processing function + # Auto-detect format: try JSON first if the result looks like JSON + if _looks_like_json_extraction_result(extraction_result): + # Likely JSON format (from entity_extraction_use_json mode) + nodes, edges = await _process_json_extraction_result( + extraction_result, + chunk_id, + timestamp, + file_path, + ) + # If JSON parsing yielded results, use them + if nodes or edges: + return nodes, edges + # Otherwise fall through to text-based parsing + + # Fall back to traditional delimiter-based parsing return await _process_extraction_result( extraction_result, chunk_id, @@ -1142,7 +1423,11 @@ async def _update_entity_storage( # Update entity in vector database (equally critical) entity_vdb_id = compute_mdhash_id(entity_name, prefix="ent-") - entity_content = f"{entity_name}\n{final_description}" + entity_content = _truncate_vdb_content( + f"{entity_name}\n{final_description}", + global_config, + f"entity:{entity_name}", + ) vdb_data = { entity_vdb_id: { @@ -1535,7 +1820,11 @@ async def _rebuild_single_relationship( # Update entity_vdb for the newly created entity if entities_vdb is not None: entity_vdb_id = compute_mdhash_id(node_id, prefix="ent-") - entity_content = f"{node_id}\n{node_description}" + entity_content = _truncate_vdb_content( + f"{node_id}\n{node_description}", + global_config, + f"entity:{node_id}", + ) vdb_data = { entity_vdb_id: { "content": entity_content, @@ -1919,7 +2208,11 @@ async def _merge_nodes_then_upsert( node_data["entity_name"] = entity_name if entity_vdb is not None: entity_vdb_id = compute_mdhash_id(str(entity_name), prefix="ent-") - entity_content = f"{entity_name}\n{description}" + entity_content = _truncate_vdb_content( + f"{entity_name}\n{description}", + global_config, + f"entity:{entity_name}", + ) data_for_vdb = { entity_vdb_id: { "entity_name": entity_name, @@ -1965,6 +2258,7 @@ async def _merge_edges_then_upsert( try: if src_id == tgt_id: return None + relation_key = f"{src_id}->{tgt_id}" already_edge = None already_weights = [] @@ -2182,7 +2476,6 @@ async def _merge_edges_then_upsert( file_paths_list.append(file_path_item) seen_paths.add(file_path_item) await _cooperative_yield(i, every=32) - # Apply count limit if len(file_paths_list) > max_file_paths: limit_method = global_config.get( @@ -2287,7 +2580,11 @@ async def _merge_edges_then_upsert( if entity_vdb is not None: entity_vdb_id = compute_mdhash_id(need_insert_id, prefix="ent-") - entity_content = f"{need_insert_id}\n{description}" + entity_content = _truncate_vdb_content( + f"{need_insert_id}\n{description}", + global_config, + f"entity:{need_insert_id}", + ) vdb_data = { entity_vdb_id: { "content": entity_content, @@ -2300,9 +2597,14 @@ async def _merge_edges_then_upsert( await safe_vdb_operation_with_exception( operation=lambda payload=vdb_data: entity_vdb.upsert(payload), operation_name="added_entity_upsert", - entity_name=need_insert_id, + entity_name=f"{need_insert_id} [relation:{relation_key}]", max_retries=3, retry_delay=0.1, + timeout_seconds=_get_relationship_vdb_timeout_seconds( + global_config + ), + log_start=False, + success_log_threshold_seconds=5.0, ) # Track entities added during edge processing @@ -2407,15 +2709,18 @@ async def _merge_edges_then_upsert( ), } } - await safe_vdb_operation_with_exception( - operation=lambda payload=vdb_data: entity_vdb.upsert( - payload - ), - operation_name="existing_entity_update", - entity_name=need_insert_id, - max_retries=3, - retry_delay=0.1, - ) + await safe_vdb_operation_with_exception( + operation=lambda payload=vdb_data: entity_vdb.upsert(payload), + operation_name="existing_entity_update", + entity_name=f"{need_insert_id} [relation:{relation_key}]", + max_retries=3, + retry_delay=0.1, + timeout_seconds=_get_relationship_vdb_timeout_seconds( + global_config + ), + log_start=False, + success_log_threshold_seconds=5.0, + ) # 6. Log once at the end if any update occurred if updated: @@ -2429,6 +2734,7 @@ async def _merge_edges_then_upsert( pipeline_status["history_messages"].append(status_message) edge_created_at = int(time.time()) + edge_upsert_started = time.perf_counter() await knowledge_graph_inst.upsert_edge( src_id, tgt_id, @@ -2442,6 +2748,13 @@ async def _merge_edges_then_upsert( truncate=truncation_info, ), ) + edge_upsert_elapsed = time.perf_counter() - edge_upsert_started + if edge_upsert_elapsed >= 5.0: + logger.info( + "Graph edge upsert slow for `%s` in %.2fs", + relation_key, + edge_upsert_elapsed, + ) edge_data = dict( src_id=src_id, @@ -2468,7 +2781,11 @@ async def _merge_edges_then_upsert( logger.debug( f"Could not delete old relationship vector records {rel_vdb_id}, {rel_vdb_id_reverse}: {e}" ) - rel_content = f"{keywords}\t{src_id}\n{tgt_id}\n{description}" + rel_content = _truncate_vdb_content( + f"{keywords}\t{src_id}\n{tgt_id}\n{description}", + global_config, + f"relationship:{src_id}-{tgt_id}", + ) vdb_data = { rel_vdb_id: { "src_id": src_id, @@ -2481,12 +2798,20 @@ async def _merge_edges_then_upsert( "file_path": file_path, } } + relation_status_message = f"Upserting relation VDB: `{relation_key}`" + logger.info(relation_status_message) + if pipeline_status is not None and pipeline_status_lock is not None: + async with pipeline_status_lock: + pipeline_status["latest_message"] = relation_status_message await safe_vdb_operation_with_exception( operation=lambda payload=vdb_data: relationships_vdb.upsert(payload), operation_name="relationship_upsert", - entity_name=f"{src_id}-{tgt_id}", + entity_name=relation_key, max_retries=3, retry_delay=0.2, + timeout_seconds=_get_relationship_vdb_timeout_seconds(global_config), + log_start=False, + success_log_threshold_seconds=5.0, ) return edge_data @@ -2701,6 +3026,7 @@ async def _locked_process_edges(edge_key, edges): workspace = global_config.get("workspace", "") namespace = f"{workspace}:GraphDB" if workspace else "GraphDB" sorted_edge_key = sorted([edge_key[0], edge_key[1]]) + edge_label = _format_relation_edge_label(edge_key) async with get_storage_keyed_lock( sorted_edge_key, @@ -2710,7 +3036,6 @@ async def _locked_process_edges(edge_key, edges): try: added_entities = [] # Track entities added during edge processing - logger.debug(f"Processing relation {sorted_edge_key}") edge_data = await _merge_edges_then_upsert( edge_key[0], edge_key[1], @@ -2733,7 +3058,7 @@ async def _locked_process_edges(edge_key, edges): return edge_data, added_entities except Exception as e: - error_msg = f"Error processing relation `{sorted_edge_key}`: {e}" + error_msg = f"Error processing relation `{edge_label}`: {e}" logger.error(error_msg) # Try to update pipeline status, but don't let status update failure affect main exception @@ -2751,17 +3076,17 @@ async def _locked_process_edges(edge_key, edges): ) # Re-raise the original exception with a prefix - prefixed_exception = create_prefixed_exception( - e, f"{sorted_edge_key}" - ) + prefixed_exception = create_prefixed_exception(e, f"{edge_label}") raise prefixed_exception from e # Create relationship processing tasks edge_tasks = [] + edge_task_labels: dict[asyncio.Task, str] = {} for i, (edge_key, edges) in enumerate(all_edges.items(), start=1): task = asyncio.create_task(_locked_process_edges(edge_key, edges)) edge_tasks.append(task) - await _cooperative_yield(i, every=16) + edge_task_labels[task] = _format_relation_edge_label(edge_key) + await _cooperative_yield(i, every=32) # Execute relationship tasks with error handling processed_edges = [] @@ -2787,6 +3112,17 @@ async def _locked_process_edges(edge_key, edges): await _cooperative_yield(i, every=32) if pending: + pending_labels = [ + edge_task_labels.get(task, "") for task in pending + ] + preview = ", ".join(pending_labels[:10]) + if len(pending_labels) > 10: + preview += f", ... (+{len(pending_labels) - 10} more)" + logger.warning( + "Phase 2 pending relation tasks for %s: %s", + doc_id, + preview or "", + ) for task in pending: task.cancel() pending_results = await asyncio.gather(*pending, return_exceptions=True) @@ -2800,9 +3136,22 @@ async def _locked_process_edges(edge_key, edges): processed_edges.append(edge_data) all_added_entities.extend(added_entities) + logger.info( + "Phase 2 pending relation tasks drained for %s: collected_edges=%d collected_added_entities=%d", + doc_id, + len(processed_edges), + len(all_added_entities), + ) + if first_exception is not None: raise first_exception + logger.info( + "Phase 2 relation processing completed for %s: edges=%d added_entities=%d", + doc_id, + len(processed_edges), + len(all_added_entities), + ) await asyncio.sleep(0) # ===== Phase 3: Update full_entities and full_relations storage ===== @@ -2896,34 +3245,74 @@ async def extract_entities( "User cancelled during entity extraction" ) - use_llm_func: callable = global_config["llm_model_func"] + use_llm_func: callable = global_config["role_llm_funcs"]["extract"] entity_extract_max_gleaning = global_config["entity_extract_max_gleaning"] - - ordered_chunks = list(chunks.items()) - # add language and example number params to prompt - language = global_config["addon_params"].get("language", DEFAULT_SUMMARY_LANGUAGE) - entity_types = global_config["addon_params"].get( - "entity_types", DEFAULT_ENTITY_TYPES + # Cap on the gleaning LLM call's combined input (system + history user + # prompt + history assistant response + continue prompt). Pulled from + # the same env knob that gates ``analyze_multimodal``'s sidecar trimming + # so both EXTRACT-role consumers share one source of truth. ``0`` + # disables the gleaning guard (gleaning always runs regardless of size). + max_extract_input_tokens = get_env_value( + "MAX_EXTRACT_INPUT_TOKENS", + DEFAULT_MAX_EXTRACT_INPUT_TOKENS, + int, ) + extract_tokenizer: Tokenizer | None = global_config.get("tokenizer") - examples = "\n".join(PROMPTS["entity_extraction_examples"]) - - example_context_base = dict( - tuple_delimiter=PROMPTS["DEFAULT_TUPLE_DELIMITER"], - completion_delimiter=PROMPTS["DEFAULT_COMPLETION_DELIMITER"], - entity_types=", ".join(entity_types), - language=language, - ) - # add example's format - examples = examples.format(**example_context_base) + # Check if JSON structured output mode is enabled + use_json_extraction = global_config.get("entity_extraction_use_json", False) - context_base = dict( - tuple_delimiter=PROMPTS["DEFAULT_TUPLE_DELIMITER"], - completion_delimiter=PROMPTS["DEFAULT_COMPLETION_DELIMITER"], - entity_types=",".join(entity_types), - examples=examples, - language=language, - ) + ordered_chunks = list(chunks.items()) + # add language and example number params to prompt + addon_params = global_config.get("addon_params") or {} + language = global_config.get("_resolved_summary_language") + if language is None: + language = addon_params.get("language", DEFAULT_SUMMARY_LANGUAGE) + prompt_profile = global_config.get("_entity_extraction_prompt_profile") + if prompt_profile is None: + # Fallback for callers that construct global_config directly (e.g. tests + # or custom wiring). Re-run the resolver so behavior matches the cached + # path that LightRAG.__post_init__ populates, instead of duplicating + # guidance/override logic here. + prompt_profile = resolve_entity_extraction_prompt_profile( + addon_params, use_json_extraction + ) + entity_types_guidance = prompt_profile["entity_types_guidance"] + + max_total_records = global_config["entity_extract_max_records"] + max_entity_records = global_config["entity_extract_max_entities"] + + if use_json_extraction: + # JSON mode: use JSON-specific prompts without delimiters + examples = "\n".join(prompt_profile["entity_extraction_json_examples"]) + context_base = dict( + entity_types_guidance=entity_types_guidance, + examples=examples, + language=language, + max_total_records=max_total_records, + max_entity_records=max_entity_records, + ) + else: + # Text mode: use traditional delimiter-based prompts + examples = "\n".join(prompt_profile["entity_extraction_examples"]) + example_context_base = dict( + tuple_delimiter=PROMPTS["DEFAULT_TUPLE_DELIMITER"], + completion_delimiter=PROMPTS["DEFAULT_COMPLETION_DELIMITER"], + entity_types_guidance=entity_types_guidance, + language=language, + ) + # add example's format + examples = examples.format(**example_context_base) + + context_base = dict( + tuple_delimiter=PROMPTS["DEFAULT_TUPLE_DELIMITER"], + completion_delimiter=PROMPTS["DEFAULT_COMPLETION_DELIMITER"], + entity_types_guidance=entity_types_guidance, + examples=examples, + language=language, + max_total_records=max_total_records, + max_entity_records=max_entity_records, + ) processed_chunks = 0 total_chunks = len(ordered_chunks) @@ -2939,25 +3328,38 @@ async def _process_single_content(chunk_key_dp: tuple[str, TextChunkSchema]): nonlocal processed_chunks chunk_key = chunk_key_dp[0] chunk_dp = chunk_key_dp[1] - content = chunk_dp["content"] + # Strip parser-internal markup (, , + # ) before building the extraction prompt. The stored + # chunk content is left intact so query-time citations still resolve. + content = strip_internal_multimodal_markup_for_extraction(chunk_dp["content"]) # Get file path from chunk data or use default file_path = chunk_dp.get("file_path", "unknown_source") # Create cache keys collector for batch processing cache_keys_collector = [] - # Get initial extraction - # Format system prompt without input_text for each chunk (enables OpenAI prompt caching across chunks) - entity_extraction_system_prompt = PROMPTS[ - "entity_extraction_system_prompt" - ].format(**context_base) - # Format user prompts with input_text for each chunk - entity_extraction_user_prompt = PROMPTS["entity_extraction_user_prompt"].format( - **{**context_base, "input_text": content} - ) - entity_continue_extraction_user_prompt = PROMPTS[ - "entity_continue_extraction_user_prompt" - ].format(**{**context_base, "input_text": content}) + if use_json_extraction: + # JSON mode: use JSON prompts and pass entity_extraction flag to LLM provider + entity_extraction_system_prompt = PROMPTS[ + "entity_extraction_json_system_prompt" + ].format(**context_base) + entity_extraction_user_prompt = PROMPTS[ + "entity_extraction_json_user_prompt" + ].format(**{**context_base, "input_text": content}) + entity_continue_extraction_user_prompt = PROMPTS[ + "entity_continue_extraction_json_user_prompt" + ].format(**context_base) + else: + # Text mode: use traditional delimiter-based prompts + entity_extraction_system_prompt = PROMPTS[ + "entity_extraction_system_prompt" + ].format(**context_base) + entity_extraction_user_prompt = PROMPTS[ + "entity_extraction_user_prompt" + ].format(**{**context_base, "input_text": content}) + entity_continue_extraction_user_prompt = PROMPTS[ + "entity_continue_extraction_user_prompt" + ].format(**{**context_base, "input_text": content}) final_result, timestamp = await use_llm_func_with_cache( entity_extraction_user_prompt, @@ -2967,56 +3369,87 @@ async def _process_single_content(chunk_key_dp: tuple[str, TextChunkSchema]): cache_type="extract", chunk_id=chunk_key, cache_keys_collector=cache_keys_collector, + response_format=({"type": "json_object"} if use_json_extraction else None), + llm_cache_identity=get_llm_cache_identity(global_config, "extract"), ) history = pack_user_ass_to_openai_messages( entity_extraction_user_prompt, final_result ) - # Process initial extraction with file path - maybe_nodes, maybe_edges = await _process_extraction_result( - final_result, - chunk_key, - timestamp, - file_path, - tuple_delimiter=context_base["tuple_delimiter"], - completion_delimiter=context_base["completion_delimiter"], - ) + # Process initial extraction with appropriate parser + if use_json_extraction: + maybe_nodes, maybe_edges = await _process_json_extraction_result( + final_result, + chunk_key, + timestamp, + file_path, + ) + else: + maybe_nodes, maybe_edges = await _process_extraction_result( + final_result, + chunk_key, + timestamp, + file_path, + tuple_delimiter=context_base["tuple_delimiter"], + completion_delimiter=context_base["completion_delimiter"], + ) # Process additional gleaning results only 1 time when entity_extract_max_gleaning is greater than zero. - if entity_extract_max_gleaning > 0: - # Calculate total tokens for the gleaning request to prevent context window overflow - tokenizer = global_config["tokenizer"] - max_input_tokens = global_config["max_extract_input_tokens"] - - # Approximate total tokens: system prompt + history + user prompt. - # This slightly underestimates actual API usage (missing role/framing tokens) - # but is sufficient as a safety guard against context window overflow. - history_str = json.dumps(history, ensure_ascii=False) - full_context_str = ( - entity_extraction_system_prompt - + history_str - + entity_continue_extraction_user_prompt + run_gleaning = entity_extract_max_gleaning > 0 + if ( + run_gleaning + and extract_tokenizer is not None + and max_extract_input_tokens > 0 + ): + # Gleaning replays the initial extraction's user/assistant pair + # via ``history_messages`` and appends a "continue" instruction. + # When the initial response was large (many entities/edges) or + # the chunk content is itself near the budget, that combined + # payload can blow past MAX_EXTRACT_INPUT_TOKENS and yield a + # provider ``context_length_exceeded`` error. Pre-check here + # and skip rather than fail. + gleaning_token_count = ( + len(extract_tokenizer.encode(entity_extraction_system_prompt)) + + sum( + len(extract_tokenizer.encode(msg.get("content", "") or "")) + for msg in history + ) + + len(extract_tokenizer.encode(entity_continue_extraction_user_prompt)) ) - token_count = len(tokenizer.encode(full_context_str)) - - if token_count > max_input_tokens: + if gleaning_token_count > max_extract_input_tokens: logger.warning( - f"Gleaning stopped for chunk {chunk_key}: Input tokens ({token_count}) exceeded limit ({max_input_tokens})." - ) - else: - glean_result, timestamp = await use_llm_func_with_cache( - entity_continue_extraction_user_prompt, - use_llm_func, - system_prompt=entity_extraction_system_prompt, - llm_response_cache=llm_response_cache, - history_messages=history, - cache_type="extract", - chunk_id=chunk_key, - cache_keys_collector=cache_keys_collector, + f"Gleaning stopped for chunk {chunk_key}: " + f"Input tokens ({gleaning_token_count}) exceeded limit " + f"({max_extract_input_tokens})." ) + run_gleaning = False - # Process gleaning result separately with file path + if run_gleaning: + glean_result, timestamp = await use_llm_func_with_cache( + entity_continue_extraction_user_prompt, + use_llm_func, + system_prompt=entity_extraction_system_prompt, + llm_response_cache=llm_response_cache, + history_messages=history, + cache_type="extract", + chunk_id=chunk_key, + cache_keys_collector=cache_keys_collector, + response_format=( + {"type": "json_object"} if use_json_extraction else None + ), + llm_cache_identity=get_llm_cache_identity(global_config, "extract"), + ) + + # Process gleaning result with appropriate parser + if use_json_extraction: + glean_nodes, glean_edges = await _process_json_extraction_result( + glean_result, + chunk_key, + timestamp, + file_path, + ) + else: glean_nodes, glean_edges = await _process_extraction_result( glean_result, chunk_key, @@ -3026,46 +3459,108 @@ async def _process_single_content(chunk_key_dp: tuple[str, TextChunkSchema]): completion_delimiter=context_base["completion_delimiter"], ) - # Merge results - compare description lengths to choose better version - for i, (entity_name, glean_entities) in enumerate( - glean_nodes.items(), start=1 - ): - if entity_name in maybe_nodes: - # Compare description lengths and keep the better one - original_desc_len = len( - maybe_nodes[entity_name][0].get("description", "") or "" - ) - glean_desc_len = len( - glean_entities[0].get("description", "") or "" - ) + # Merge results - compare description lengths to choose better version + for i, (entity_name, glean_entities) in enumerate( + glean_nodes.items(), start=1 + ): + if entity_name in maybe_nodes: + # Compare description lengths and keep the better one + original_desc_len = len( + maybe_nodes[entity_name][0].get("description", "") or "" + ) + glean_desc_len = len(glean_entities[0].get("description", "") or "") - if glean_desc_len > original_desc_len: - maybe_nodes[entity_name] = list(glean_entities) - # Otherwise keep original version - else: - # New entity from gleaning stage + if glean_desc_len > original_desc_len: maybe_nodes[entity_name] = list(glean_entities) - await _cooperative_yield(i, every=8) + # Otherwise keep original version + else: + # New entity from gleaning stage + maybe_nodes[entity_name] = list(glean_entities) + await _cooperative_yield(i, every=8) - for i, (edge_key, glean_edge_list) in enumerate( - glean_edges.items(), start=1 - ): - if edge_key in maybe_edges: - # Compare description lengths and keep the better one - original_desc_len = len( - maybe_edges[edge_key][0].get("description", "") or "" - ) - glean_desc_len = len( - glean_edge_list[0].get("description", "") or "" - ) + for i, (edge_key, glean_edge_list) in enumerate( + glean_edges.items(), start=1 + ): + if edge_key in maybe_edges: + # Compare description lengths and keep the better one + original_desc_len = len( + maybe_edges[edge_key][0].get("description", "") or "" + ) + glean_desc_len = len( + glean_edge_list[0].get("description", "") or "" + ) - if glean_desc_len > original_desc_len: - maybe_edges[edge_key] = list(glean_edge_list) - # Otherwise keep original version - else: - # New edge from gleaning stage + if glean_desc_len > original_desc_len: maybe_edges[edge_key] = list(glean_edge_list) - await _cooperative_yield(i, every=8) + # Otherwise keep original version + else: + # New edge from gleaning stage + maybe_edges[edge_key] = list(glean_edge_list) + await _cooperative_yield(i, every=8) + + # Inject multimodal entity + associations for drawing/table/equation + # chunks. Placed before update_chunk_cache_list so the per-chunk + # cache write still happens after; placed inside the chunk's + # concurrency slot (rather than the centralized post-pass that used + # to live in utils_pipeline.augment_chunk_results_with_mm_entities) + # so each multimodal chunk benefits from the chunk-level concurrency + # already enforced by extract_entities. + sidecar_block = chunk_dp.get("sidecar") + if isinstance(sidecar_block, dict): + sidecar_type = sidecar_block.get("type") + sidecar_id = sidecar_block.get("id") + if ( + sidecar_type in {"drawing", "table", "equation"} + and isinstance(sidecar_id, str) + and sidecar_id + ): + mm_entity_name = sidecar_id + now_ts = int(time.time()) + mm_nodes_list = maybe_nodes.setdefault(mm_entity_name, []) + mm_nodes_list.append( + { + "entity_name": mm_entity_name, + "entity_type": sidecar_type, + # description == the full multimodal chunk content so + # the extracted entity carries the same grounding + # surface the prompt produced; analyze_multimodal's + # description/name field is already inlined there. + "description": chunk_dp.get("content", "") or "", + "source_id": chunk_key, + "file_path": file_path, + "timestamp": now_ts, + } + ) + heading_block = chunk_dp.get("heading") + heading_label = "unknown" + if isinstance(heading_block, dict): + heading_label = ( + str(heading_block.get("heading") or "").strip() or "unknown" + ) + mm_display_name = _parse_mm_display_name( + chunk_dp.get("content", "") or "", sidecar_id + ) + for tgt in list(maybe_nodes.keys()): + if tgt == mm_entity_name: + continue + edge_key = (mm_entity_name, tgt) + edge_list = maybe_edges.setdefault(edge_key, []) + edge_list.append( + { + "src_id": mm_entity_name, + "tgt_id": tgt, + "weight": 1.0, + "description": ( + f"{tgt} is associated with {sidecar_type} " + f"{mm_display_name} in section {heading_label} " + f'of document "{file_path}"' + ), + "keywords": "associated with, contained in", + "source_id": chunk_key, + "file_path": file_path, + "timestamp": now_ts, + } + ) # Batch update chunk's llm_cache_list with all collected cache keys if cache_keys_collector and text_chunks_storage: @@ -3209,9 +3704,12 @@ async def kg_query( if query_param.model_func: use_model_func = query_param.model_func else: - use_model_func = global_config["llm_model_func"] + use_model_func = global_config["role_llm_funcs"]["query"] # Apply higher priority (5) to query relation LLM function use_model_func = partial(use_model_func, _priority=5) + llm_cache_identity = get_llm_cache_identity( + global_config, "query", query_param.model_func + ) hl_keywords, ll_keywords = await get_keywords_from_query( query, query_param, global_config, hashing_kv @@ -3300,6 +3798,8 @@ async def kg_query( ll_keywords_str, query_param.user_prompt or "", query_param.enable_rerank, + "\n\n", + serialize_llm_cache_identity(llm_cache_identity), ) cached_result = await handle_cache( @@ -3313,6 +3813,8 @@ async def kg_query( ) response = cached_response else: + if query_param.model_func: + _warn_deprecated_query_model_func("KG query generation") response = await use_model_func( user_query, system_prompt=sys_prompt, @@ -3396,13 +3898,136 @@ async def get_keywords_from_query( if query_param.hl_keywords or query_param.ll_keywords: return query_param.hl_keywords, query_param.ll_keywords - # Extract keywords using extract_keywords_only function which already supports conversation history + # Extract keywords directly from the current query text. hl_keywords, ll_keywords = await extract_keywords_only( query, query_param, global_config, hashing_kv ) return hl_keywords, ll_keywords +def _normalize_keyword_list(raw_values: Any, field_name: str) -> list[str]: + """Normalize keyword payloads into a clean list of strings. + + When the field is a plain string (e.g. LLM returned CSV), split on + newlines/commas/semicolons. List-shaped payloads are preserved per-item so + multi-word phrases that legitimately contain commas are not broken apart. + """ + + if raw_values is None: + return [] + + if isinstance(raw_values, str): + raw_values = [ + part.strip() + for part in re.split(r"[\n,;]+", raw_values) + if part and part.strip() + ] + + if not isinstance(raw_values, list): + logger.warning( + "Keyword extraction field '%s' is not a list: %r", + field_name, + raw_values, + ) + return [] + + normalized: list[str] = [] + for idx, value in enumerate(raw_values): + if isinstance(value, str): + cleaned = value.strip() + if cleaned: + normalized.append(cleaned) + continue + + logger.warning( + "Keyword extraction field '%s' contains non-string element at index %d: %r", + field_name, + idx, + value, + ) + + return normalized + + +_CODE_FENCE_PATTERN = re.compile( + r"^\s*```(?:json|JSON)?\s*\n?(.*?)\n?\s*```\s*$", re.DOTALL +) + + +def _strip_markdown_code_fence(text: str) -> str: + """Strip a surrounding markdown code fence (```json ... ``` or ``` ... ```). + + Why: LLM training priors strongly associate "JSON output" with fenced code + blocks, so providers routinely wrap responses despite explicit instructions + to the contrary. Stripping here avoids relying on ``json_repair`` and the + noisy warning it emits. + """ + + match = _CODE_FENCE_PATTERN.match(text) + return match.group(1) if match else text + + +def _parse_keywords_payload(result: Any) -> tuple[bool, list[str], list[str]]: + """Parse keyword extraction responses from heterogeneous provider outputs.""" + + payload: Any + + if result is None: + return False, [], [] + + if hasattr(result, "model_dump") and callable(result.model_dump): + payload = result.model_dump() + elif isinstance(result, dict): + payload = result + elif isinstance(result, str): + cleaned_result = remove_think_tags(result) + unfenced_result = _strip_markdown_code_fence(cleaned_result) + if unfenced_result is not cleaned_result: + logger.debug( + "Stripped markdown code fence from keyword extraction response" + ) + cleaned_result = unfenced_result + try: + payload = json.loads(cleaned_result) + except json.JSONDecodeError as strict_error: + try: + payload = json_repair.loads(cleaned_result) + logger.warning( + "Keyword extraction response required JSON repair: %s; response: %r", + strict_error, + cleaned_result[:500], + ) + except Exception as repair_error: + logger.error( + "JSON parsing error: %s; repair failed: %s; response: %r", + strict_error, + repair_error, + cleaned_result[:500], + ) + return False, [], [] + else: + logger.error( + "Unsupported keyword extraction response type: %s", + type(result).__name__, + ) + return False, [], [] + + if not isinstance(payload, dict): + logger.error( + "Keyword extraction payload is not a JSON object: %s", + type(payload).__name__, + ) + return False, [], [] + + hl_keywords = _normalize_keyword_list( + payload.get("high_level_keywords"), "high_level_keywords" + ) + ll_keywords = _normalize_keyword_list( + payload.get("low_level_keywords"), "low_level_keywords" + ) + return True, hl_keywords, ll_keywords + + async def extract_keywords_only( text: str, param: QueryParam, @@ -3418,25 +4043,33 @@ async def extract_keywords_only( # 1. Build the examples examples = "\n".join(PROMPTS["keywords_extraction_examples"]) - language = global_config["addon_params"].get("language", DEFAULT_SUMMARY_LANGUAGE) + addon_params = global_config.get("addon_params") or {} + language = global_config.get("_resolved_summary_language") + if language is None: + language = addon_params.get("language", DEFAULT_SUMMARY_LANGUAGE) # 2. Handle cache if needed - add cache type for keywords + llm_cache_identity = get_llm_cache_identity( + global_config, "keyword", param.model_func + ) args_hash = compute_args_hash( param.mode, text, language, + "\n\n", + serialize_llm_cache_identity(llm_cache_identity), ) cached_result = await handle_cache( hashing_kv, args_hash, text, param.mode, cache_type="keywords" ) if cached_result is not None: cached_response, _ = cached_result # Extract content, ignore timestamp - try: - keywords_data = json_repair.loads(cached_response) - return keywords_data.get("high_level_keywords", []), keywords_data.get( - "low_level_keywords", [] - ) - except (json.JSONDecodeError, KeyError): + is_valid_payload, hl_keywords, ll_keywords = _parse_keywords_payload( + cached_response + ) + if is_valid_payload: + return hl_keywords, ll_keywords + else: logger.warning( "Invalid cache format for keywords, proceeding with extraction" ) @@ -3456,28 +4089,17 @@ async def extract_keywords_only( # 4. Call the LLM for keyword extraction if param.model_func: + _warn_deprecated_query_model_func("keyword extraction") use_model_func = param.model_func else: - use_model_func = global_config["llm_model_func"] + use_model_func = global_config["role_llm_funcs"]["keyword"] # Apply higher priority (5) to query relation LLM function use_model_func = partial(use_model_func, _priority=5) - result = await use_model_func(kw_prompt, keyword_extraction=True) + result = await use_model_func(kw_prompt, response_format={"type": "json_object"}) - # 5. Parse out JSON from the LLM response - result = remove_think_tags(result) - try: - keywords_data = json_repair.loads(result) - if not keywords_data: - logger.error("No JSON-like structure found in the LLM respond.") - return [], [] - except json.JSONDecodeError as e: - logger.error(f"JSON parsing error: {e}") - logger.error(f"LLM respond: {result}") - return [], [] - - hl_keywords = keywords_data.get("high_level_keywords", []) - ll_keywords = keywords_data.get("low_level_keywords", []) + # 5. Parse out JSON from the LLM response with tolerant provider normalization + _, hl_keywords, ll_keywords = _parse_keywords_payload(result) # 6. Cache only the processed keywords with cache type if hl_keywords or ll_keywords: @@ -3485,7 +4107,7 @@ async def extract_keywords_only( "high_level_keywords": hl_keywords, "low_level_keywords": ll_keywords, } - if hashing_kv.global_config.get("enable_llm_cache"): + if hashing_kv and hashing_kv.global_config.get("enable_llm_cache"): # Save to cache with query parameters queryparam_dict = { "mode": param.mode, @@ -4985,9 +5607,12 @@ async def naive_query( if query_param.model_func: use_model_func = query_param.model_func else: - use_model_func = global_config["llm_model_func"] + use_model_func = global_config["role_llm_funcs"]["query"] # Apply higher priority (5) to query relation LLM function use_model_func = partial(use_model_func, _priority=5) + llm_cache_identity = get_llm_cache_identity( + global_config, "query", query_param.model_func + ) tokenizer: Tokenizer = global_config["tokenizer"] if not tokenizer: @@ -5131,6 +5756,8 @@ async def naive_query( query_param.max_total_tokens, query_param.user_prompt or "", query_param.enable_rerank, + "\n\n", + serialize_llm_cache_identity(llm_cache_identity), ) cached_result = await handle_cache( hashing_kv, args_hash, user_query, query_param.mode, cache_type="query" @@ -5142,6 +5769,8 @@ async def naive_query( ) response = cached_response else: + if query_param.model_func: + _warn_deprecated_query_model_func("naive query generation") response = await use_model_func( user_query, system_prompt=sys_prompt, diff --git a/lightrag/parser_cli.py b/lightrag/parser_cli.py new file mode 100644 index 0000000000..1b7c00f42a --- /dev/null +++ b/lightrag/parser_cli.py @@ -0,0 +1,245 @@ +"""Unified sidecar debug CLI for native / mineru / docling parsers. + +Drives ``LightRAG.parse_`` against a single source file and writes +the resulting sidecar (and raw bundle, for mineru/docling) into a flat +layout — no ``__parsed__/`` middle layer, source file never archived — +so the artifacts can be inspected next to the input file. + +Invocation:: + + python -m lightrag.parser_cli path/to/sample.docx --engine native + python -m lightrag.parser_cli path/to/sample.pdf --engine mineru + python -m lightrag.parser_cli path/to/sample.pdf --engine docling --force-reparse + +See ``docs/ParserDebugCLI-zh.md`` for the full reference. +""" + +from __future__ import annotations + +import argparse +import asyncio +import json +import sys +from contextlib import ExitStack +from pathlib import Path +from typing import Any +from unittest import mock + +ENGINES = ("native", "mineru", "docling") + + +def _build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser( + prog="parse_sidecar", + description=( + "Run LightRAG.parse_ on a single file and emit sidecar " + "artifacts (plus a raw bundle for mineru/docling) into a flat " + "layout alongside the source. No __parsed__/ middle layer; the " + "source file is never moved." + ), + ) + parser.add_argument("input_file", type=Path, help="Source file to parse.") + parser.add_argument( + "--engine", + required=True, + choices=ENGINES, + help="Parser engine to drive.", + ) + parser.add_argument( + "-o", + "--sidecar-parent-dir", + type=Path, + default=None, + help=( + "Parent directory for .parsed/ and ._raw/. " + "Default: the source file's parent directory." + ), + ) + parser.add_argument( + "--doc-id", + default=None, + help="Override the doc id. Default: doc-.", + ) + parser.add_argument( + "--force-reparse", + action="store_true", + help=( + "Only affects mineru/docling. By default a non-empty raw_dir is " + "treated as a valid cache and reused without manifest checks; " + "this flag clears raw_dir and forces a fresh download/parse." + ), + ) + parser.add_argument( + "--preview", + type=int, + default=5, + metavar="N", + help="Number of block rows to preview after parsing (0 disables).", + ) + return parser + + +def _print_summary(blocks_path: Path, raw_dir: Path | None, preview: int) -> None: + with blocks_path.open("r", encoding="utf-8") as fh: + meta_line = fh.readline().strip() + if not meta_line: + raise SystemExit(f"empty blocks file at {blocks_path}") + meta = json.loads(meta_line) + rows = [json.loads(line) for line in fh if line.strip()] + parsed_dir = blocks_path.parent + print(f"parsed dir : {parsed_dir} (exists={parsed_dir.exists()})") + if raw_dir is not None: + print(f"raw dir : {raw_dir} (exists={raw_dir.exists()})") + print(f"document : {meta.get('document_name')}") + print(f"doc_id : {meta.get('doc_id')}") + print(f"engine : {meta.get('parse_engine')}") + print(f"blocks : {meta.get('blocks')}") + print( + f"sidecars : tables={meta.get('table_file')} " + f"drawings={meta.get('drawing_file')} " + f"equations={meta.get('equation_file')} " + f"asset_dir={meta.get('asset_dir')}" + ) + if preview > 0 and rows: + shown = min(preview, len(rows)) + print(f"--- preview (first {shown} of {len(rows)} blocks) ---") + for row in rows[:preview]: + heading = row.get("heading") or "" + content = (row.get("content") or "").replace("\n", " ") + snippet = content if len(content) <= 80 else content[:77] + "..." + print( + f" [{row.get('blockid', '')[:8]}] " f"heading={heading!r} :: {snippet}" + ) + + +async def _run(args: argparse.Namespace) -> int: + # Pipeline + heavy parser imports are deferred so ``--help`` and the + # input-file existence check don't pay for them. + from lightrag.constants import ( + FULL_DOCS_FORMAT_PENDING_PARSE, + PARSER_ENGINE_SUFFIX_CAPABILITIES, + ) + from lightrag.parser_debug import build_debug_rag + from lightrag.utils import compute_mdhash_id + import lightrag.pipeline as pipeline_mod + import lightrag.utils_pipeline as utils_pipeline_mod + + source = args.input_file.resolve() + if not source.is_file(): + print(f"error: input file does not exist: {source}", file=sys.stderr) + return 1 + + # Reject suffix/engine mismatches up-front: the pipeline would otherwise + # fail deep inside the IR builder with a less helpful message. + suffix = source.suffix.lstrip(".").lower() + supported = PARSER_ENGINE_SUFFIX_CAPABILITIES.get(args.engine, frozenset()) + if suffix not in supported: + print( + f"error: engine '{args.engine}' does not support .{suffix or ''} " + f"files (supported: {', '.join(sorted(supported))})", + file=sys.stderr, + ) + return 1 + + sidecar_parent = (args.sidecar_parent_dir or source.parent).resolve() + sidecar_parent.mkdir(parents=True, exist_ok=True) + + parsed_dir = sidecar_parent / f"{source.name}.parsed" + raw_dir = ( + sidecar_parent / f"{source.name}.{args.engine}_raw" + if args.engine in ("mineru", "docling") + else None + ) + + doc_id = args.doc_id or compute_mdhash_id(str(source), prefix="doc-") + + def _patched_artifact_dir( + file_path: str | None = None, + *, + parent_hint: Any | None = None, + ) -> Path: + # Flatten the production "/__parsed__/.parsed/" + # layout to "/.parsed/" so the sidecar + # and the source file sit side by side. + return parsed_dir + + def _lenient_bundle(raw_dir_arg: Path, _source_file: Path) -> bool: + return raw_dir_arg.exists() and any(raw_dir_arg.iterdir()) + + def _force_miss(*_args: Any, **_kwargs: Any) -> bool: + return False + + bundle_check = _force_miss if args.force_reparse else _lenient_bundle + + async def _noop_archive(*_args: Any, **_kwargs: Any) -> None: + return None + + rag = build_debug_rag() + parse_method = getattr(rag, f"parse_{args.engine}") + + with ExitStack() as stack: + # Patch 1: redirect sidecar output to the flat layout. + # parsed_artifact_dir_for is from-imported into pipeline at + # module load, so patch both namespaces. + stack.enter_context( + mock.patch.object( + utils_pipeline_mod, + "parsed_artifact_dir_for", + _patched_artifact_dir, + ) + ) + stack.enter_context( + mock.patch.object( + pipeline_mod, + "parsed_artifact_dir_for", + _patched_artifact_dir, + ) + ) + + # Patch 2: raw cache strategy. parse_mineru / parse_docling do a + # function-local ``from lightrag.external_parser. import + # is_bundle_valid``, so we replace the name on the facade module. + if args.engine == "mineru": + import lightrag.external_parser.mineru as mineru_pkg + + stack.enter_context( + mock.patch.object(mineru_pkg, "is_bundle_valid", bundle_check) + ) + elif args.engine == "docling": + import lightrag.external_parser.docling as docling_pkg + + stack.enter_context( + mock.patch.object(docling_pkg, "is_bundle_valid", bundle_check) + ) + + # Patch 3: keep the source file in place. All three parse_* methods + # call archive_docx_source_after_full_docs_sync at the end. + stack.enter_context( + mock.patch.object( + pipeline_mod, + "archive_docx_source_after_full_docs_sync", + _noop_archive, + ) + ) + + result = await parse_method( + doc_id, + str(source), + { + "parse_format": FULL_DOCS_FORMAT_PENDING_PARSE, + "content": "", + }, + ) + + blocks_path = Path(result["blocks_path"]) + _print_summary(blocks_path, raw_dir, args.preview) + return 0 + + +def main(argv: list[str] | None = None) -> int: + args = _build_parser().parse_args(argv) + return asyncio.run(_run(args)) + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/lightrag/parser_debug.py b/lightrag/parser_debug.py new file mode 100644 index 0000000000..7140639f08 --- /dev/null +++ b/lightrag/parser_debug.py @@ -0,0 +1,102 @@ +"""Shared debug LightRAG stand-in for the parse_* entry points. + +A minimal ``LightRAG`` stand-in plus a deterministic ``datetime`` shim, +shared by the unified parser debug CLI (``lightrag/parser_cli.py``), +the golden-fixture regen script (``scripts/regen_native_docx_golden.py``), +and the byte-equivalence golden tests +(``tests/native_parser/docx/test_native_docx_golden.py``). + +All three engines (``native`` / ``mineru`` / ``docling``) read the same +``self`` surface (``_persist_parsed_full_docs``, ``_resolve_source_file_for_parser``, +``self.full_docs``, ``self.doc_status``), so a single stand-in covers every +``parse_*`` method — when one of them grows a new dependency, extend +this module rather than copy-pasting parallel stubs into each call site. +""" + +from __future__ import annotations + +from datetime import datetime, timezone +from typing import Any + + +class DebugFullDocs: + """In-memory ``full_docs`` shim — captures the persisted record.""" + + def __init__(self) -> None: + self.data: dict[str, Any] = {} + + async def upsert(self, payload: dict[str, Any]) -> None: + self.data.update(payload) + + async def get_by_id(self, doc_id: str) -> Any: + return self.data.get(doc_id) + + async def index_done_callback(self) -> None: + return None + + +class DebugDocStatus: + """No-op ``doc_status`` shim — the parse_* methods never read/write it.""" + + async def get_by_id(self, doc_id: str) -> Any: + return None + + async def upsert(self, data: dict[str, Any]) -> None: + return None + + +def build_debug_rag(): + """Build a minimal LightRAG stand-in that exposes what ``parse_*`` reads. + + The import of ``LightRAG`` is intentionally function-local: deferring + it avoids a circular import when this helper is loaded during package + init (the parser CLI resolves ``lightrag.parser_debug`` before + ``lightrag`` itself is fully bound). + + LightRAG-side attributes the three ``parse_*`` methods read off ``self`` — + every entry MUST be provided by this stand-in, or the debug CLI / golden + tests / regen script will all break in sync: + + - **methods** (rebound from :class:`LightRAG`): + - ``_persist_parsed_full_docs(doc_id, payload)`` — async; touches + ``self.full_docs``. + - ``_resolve_source_file_for_parser(file_path)`` — returns the + on-disk source path. Stubbed to identity here since the CLI / tests + feed an already-resolved path. + - **storages**: + - ``self.full_docs.upsert(...)`` / ``.get_by_id(...)`` / + ``.index_done_callback()`` — :class:`DebugFullDocs` covers all three. + - ``self.doc_status.get_by_id(...)`` / ``.upsert(...)`` — + :class:`DebugDocStatus` covers both. + + When any of the three ``LightRAG.parse_*`` methods grows a new + dependency on ``self``, extend this stand-in (and update the list + above) rather than copy-pasting a parallel stub into the call sites. + """ + from lightrag import LightRAG + + class _DebugRag: + _persist_parsed_full_docs = LightRAG._persist_parsed_full_docs + parse_native = LightRAG.parse_native + parse_mineru = LightRAG.parse_mineru + parse_docling = LightRAG.parse_docling + + def __init__(self) -> None: + self.full_docs = DebugFullDocs() + self.doc_status = DebugDocStatus() + + def _resolve_source_file_for_parser(self, file_path: str) -> str: + return file_path + + return _DebugRag() + + +_FROZEN_NOW = datetime(2026, 1, 1, tzinfo=timezone.utc) + + +class FrozenDateTime(datetime): + """Pin ``datetime.now`` so ``write_sidecar`` stamps a deterministic time.""" + + @classmethod + def now(cls, tz=None): # noqa: D401 + return _FROZEN_NOW if tz is None else _FROZEN_NOW.astimezone(tz) diff --git a/lightrag/parser_routing.py b/lightrag/parser_routing.py new file mode 100644 index 0000000000..ee73388476 --- /dev/null +++ b/lightrag/parser_routing.py @@ -0,0 +1,896 @@ +from __future__ import annotations + +import fnmatch +import os +import re +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +from lightrag.constants import ( + DEFAULT_CHUNK_P_SIZE, + DEFAULT_R_SEPARATORS, + DEFAULT_SENTENCE_SPLIT_REGEX, + FULL_DOCS_FORMAT_LIGHTRAG, + FULL_DOCS_FORMAT_PENDING_PARSE, + FULL_DOCS_FORMAT_RAW, + PARSER_ENGINE_DOCLING, + PARSER_ENGINE_LEGACY, + PARSER_ENGINE_MINERU, + PARSER_ENGINE_NATIVE, + PARSER_ENGINE_SUFFIX_CAPABILITIES, + PROCESS_OPTION_CHUNK_CHARS, + PROCESS_OPTION_CHUNK_FIXED, + PROCESS_OPTION_CHUNK_VECTOR, + PROCESS_OPTION_CHUNK_PARAGRAH, + PROCESS_OPTION_CHUNK_RECURSIVE, + PROCESS_OPTION_EQUATIONS, + PROCESS_OPTION_IMAGES, + PROCESS_OPTION_SKIP_KG, + PROCESS_OPTION_TABLES, + ProcessChunkingOption, + SUPPORTED_PARSER_ENGINES, + SUPPORTED_PROCESS_OPTIONS, +) +from lightrag.utils import logger, parse_optional_float + +import json +from collections.abc import Mapping +from copy import deepcopy + +_PARSER_RULE_SPLIT_RE = re.compile(r"[;,]") +_PARSER_ENGINE_ENDPOINT_ENV = { + PARSER_ENGINE_DOCLING: "DOCLING_ENDPOINT", +} +_VALID_MINERU_API_MODES = {"official", "local"} + +# Trailing parser-hint pattern: matches ``.[engine].ext`` at end of basename. +# Group 1 captures the raw engine token (still needs normalize_parser_engine +# and SUPPORTED_PARSER_ENGINES validation); group 2 captures ``.ext`` so it +# can be reattached when stripping the hint. +_PARSER_HINT_RE = re.compile(r"\.\[([^\]]*)\](\.[^.]+)$") + + +class ParserRoutingConfigError(ValueError): + """Raised when LIGHTRAG_PARSER contains an invalid routing rule.""" + + +class FilenameParserHintError(ValueError): + """Raised when a filename parser hint is invalid for ingestion.""" + + +def normalize_parser_engine(engine: Any) -> str: + """Normalize engine hints such as mineru-iet to mineru.""" + return str(engine or "").strip().split("-", 1)[0].lower() + + +# --------------------------------------------------------------------------- +# Per-file processing options (i/t/e/!/F/R/V/P) +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class ProcessOptions: + """Decoded view of a ``process_options`` string. + + The ``raw`` string is preserved verbatim (with duplicates and ordering) + for storage / audit purposes; boolean flags reflect the deduped logical + state used by the pipeline. + """ + + raw: str = "" + images: bool = False + tables: bool = False + equations: bool = False + skip_kg: bool = False + chunking: ProcessChunkingOption = PROCESS_OPTION_CHUNK_FIXED + + @property + def chunking_explicit(self) -> bool: + """True iff ``raw`` actually contains a chunking selector char. + + Distinguishes "user explicitly opted into a chunking strategy" + from "no chunking selector supplied — pipeline used the default". + ``chunking`` itself is unreliable for this question because it + falls back to :data:`PROCESS_OPTION_CHUNK_FIXED` in both cases. + Used by ``process_single_document`` to decide whether to + dispatch via the new file-chunker contract or to honor the + legacy externally-supplied :attr:`LightRAG.chunking_func`. + """ + return any(c in PROCESS_OPTION_CHUNK_CHARS for c in self.raw) + + +_PROCESS_OPTION_DEFAULT = ProcessOptions() + + +def sanitize_process_options(options: Any) -> str: + """Strip non-supported characters / hyphen / whitespace from an options string. + + Returns the raw token sequence as-is (no dedup, no reorder) so the + canonical user intent is preserved on disk. Invalid characters are + silently dropped — the caller is expected to have already validated. + """ + if not options: + return "" + return "".join(ch for ch in str(options) if ch in SUPPORTED_PROCESS_OPTIONS) + + +def validate_process_options( + options: str, *, label: str = "process options" +) -> list[str]: + """Return a list of error messages for an options string; empty if valid.""" + errors: list[str] = [] + if not options: + return errors + seen_chunkers: list[str] = [] + for ch in options: + if ch in (" ", "-"): + continue + if ch not in SUPPORTED_PROCESS_OPTIONS: + errors.append(f"{label} contains unsupported character {ch!r}") + continue + if ch in PROCESS_OPTION_CHUNK_CHARS and ch not in seen_chunkers: + seen_chunkers.append(ch) + if len(seen_chunkers) > 1: + errors.append( + f"{label} specifies multiple chunking modes " + f"({'/'.join(seen_chunkers)}); pick one of " + f"{PROCESS_OPTION_CHUNK_FIXED}/{PROCESS_OPTION_CHUNK_RECURSIVE}/{PROCESS_OPTION_CHUNK_VECTOR}/{PROCESS_OPTION_CHUNK_PARAGRAH}" + ) + return errors + + +def parse_process_options(options: Any) -> ProcessOptions: + """Decode a process-options string into a :class:`ProcessOptions` view.""" + raw = sanitize_process_options(options) + if not raw: + return _PROCESS_OPTION_DEFAULT + chars = set(raw) + chunking: ProcessChunkingOption = PROCESS_OPTION_CHUNK_FIXED + # Pick the first chunking selector encountered; validate_process_options + # already filters duplicates upstream. + for ch in raw: + if ch in PROCESS_OPTION_CHUNK_CHARS: + chunking = ch # type: ignore[assignment] + break + return ProcessOptions( + raw=raw, + images=PROCESS_OPTION_IMAGES in chars, + tables=PROCESS_OPTION_TABLES in chars, + equations=PROCESS_OPTION_EQUATIONS in chars, + skip_kg=PROCESS_OPTION_SKIP_KG in chars, + chunking=chunking, + ) + + +# --------------------------------------------------------------------------- +# Per-chunker parameter snapshot (chunk_options) — counterpart to the +# F/R/V/P selector in ``ProcessOptions``. ``process_options`` chooses +# the strategy; ``chunk_options`` carries the parameters the chosen +# strategy reads. +# +# Storage shape: the per-document snapshot persisted to +# ``full_docs[doc_id]['chunk_options']`` carries ONLY the sub-dict of +# the chunking strategy selected by ``process_options`` — the other +# strategies' parameters are dropped because they are never consumed +# during processing. Reparsing a document overwrites both +# ``process_options`` and ``chunk_options`` together. +# --------------------------------------------------------------------------- + + +# Strategy selector (F/R/V/P) → snapshot sub-dict key. Single source +# of truth for the slim ``chunk_options`` shape — used by +# :func:`resolve_chunk_options` to pick which strategy block to keep +# and by :func:`slim_chunk_options` to project caller-supplied dicts +# down to the selected strategy. +_CHUNK_STRATEGY_KEYS: dict[str, str] = { + PROCESS_OPTION_CHUNK_FIXED: "fixed_token", + PROCESS_OPTION_CHUNK_RECURSIVE: "recursive_character", + PROCESS_OPTION_CHUNK_VECTOR: "semantic_vector", + PROCESS_OPTION_CHUNK_PARAGRAH: "paragraph_semantic", +} + + +def chunk_strategy_key(process_options: Any) -> str: + """Return the ``chunk_options`` sub-dict key for ``process_options``. + + Accepts a raw options string or a :class:`ProcessOptions` value. + Falls back to ``"fixed_token"`` when no chunking selector is + present — F is the default strategy used both by the file-chunker + dispatcher (when ``chunking_explicit`` is False the legacy + ``chunking_func`` runs, which defaults to fixed-token chunking + that reads from the same sub-dict). + """ + if isinstance(process_options, ProcessOptions): + strategy = process_options.chunking + else: + strategy = parse_process_options(process_options).chunking + return _CHUNK_STRATEGY_KEYS.get(strategy, "fixed_token") + + +def slim_chunk_options( + snapshot: Mapping[str, Any] | None, + process_options: Any = "", +) -> dict[str, Any]: + """Project a (possibly full) chunker snapshot down to the active strategy. + + Keeps the top-level ``chunk_token_size`` and the one strategy + sub-dict picked by :func:`chunk_strategy_key`; everything else is + discarded. Idempotent: a slim snapshot whose key already matches + ``process_options`` passes through unchanged (deep-copied for + isolation). When the matching strategy block is absent from the + input, an empty dict is used so downstream consumers always see a + dict-shaped slot. + + Strategy-specific default backfill: for ``paragraph_semantic`` we + guarantee a populated ``chunk_token_size`` slot before returning + (caller-supplied value > ``CHUNK_P_SIZE`` env > + ``DEFAULT_CHUNK_P_SIZE``). This is the single chokepoint that + every enqueue path runs through — both the + ``resolve_chunk_options`` path (built from addon_params) AND the + direct ``chunk_options=`` kwarg path (caller supplies the dict) + flow through here, so the backfill cannot be bypassed by runtime + addon_params mutation or by passing an explicit ``chunk_options`` + that omits the P slot. P must NOT inherit the top-level + ``chunk_token_size`` (global ``CHUNK_SIZE`` / legacy ctor) — + paragraph-semantic merging needs more headroom than the global + default. + """ + key = chunk_strategy_key(process_options) + src: Mapping[str, Any] = snapshot or {} + result: dict[str, Any] = {} + if "chunk_token_size" in src: + result["chunk_token_size"] = deepcopy(src["chunk_token_size"]) + result[key] = deepcopy(dict(src.get(key) or {})) + if key == "paragraph_semantic" and "chunk_token_size" not in result[key]: + p_size_raw = os.getenv("CHUNK_P_SIZE") + result[key]["chunk_token_size"] = ( + int(p_size_raw) if p_size_raw is not None else DEFAULT_CHUNK_P_SIZE + ) + return result + + +def _env_optional_str(key: str) -> str | None: + """Return the env value as a string, collapsing empty / 'None' to None.""" + raw = os.getenv(key) + if raw is None: + return None + stripped = raw.strip() + if not stripped or stripped.lower() == "none": + return None + return raw + + +def _env_bool(key: str, default: bool = False) -> bool: + raw = os.getenv(key) + if raw is None: + return default + return raw.strip().lower() in ("1", "true", "yes", "on", "t", "y") + + +def default_chunker_config() -> dict[str, Any]: + """Snapshot the **strategy-specific** env-driven defaults for every shipped chunker. + + Builds a per-strategy sub-dict whose keys mirror each strategy's + keyword-only signature (so :func:`resolve_chunk_options` can splat + them straight into the chunker call). + + Provenance / precedence note: this function reads only + *strategy-specific* env vars (``CHUNK_F_OVERLAP_SIZE``, + ``CHUNK_R_SIZE``, ``CHUNK_R_OVERLAP_SIZE``, ``CHUNK_R_SEPARATORS``, + ``CHUNK_V_SIZE``, ``CHUNK_V_*``, ``CHUNK_P_SIZE``, + ``CHUNK_P_OVERLAP_SIZE``, + ``CHUNK_F_SPLIT_BY_CHARACTER``…). It does **not** read the legacy + top-level envs ``CHUNK_SIZE`` / ``CHUNK_OVERLAP_SIZE``, and it + deliberately **omits** ``chunk_overlap_token_size`` from a strategy + sub-dict when its own env var is unset — leaving the slot empty is + the signal that lets + :meth:`LightRAG._apply_chunk_size_overlay` apply the legacy + constructor field (``LightRAG(chunk_overlap_token_size=…)``) and + finally the legacy ``CHUNK_OVERLAP_SIZE`` env in that order. Same + rationale for top-level ``chunk_token_size`` — overlay fills it from + ``LightRAG(chunk_token_size=…)`` then ``CHUNK_SIZE`` env. Net + precedence (high → low): ``addon_params`` explicit > strategy env + > legacy ctor field > legacy env. + + Read at instance-creation time via + :func:`lightrag.addon_params.default_addon_params`; users can mutate + ``addon_params['chunker']`` at runtime to change the defaults applied + to subsequently enqueued documents (already-enqueued docs hold a + frozen ``full_docs[doc_id]['chunk_options']`` snapshot). + """ + config: dict[str, Any] = { + "fixed_token": { + "split_by_character": _env_optional_str("CHUNK_F_SPLIT_BY_CHARACTER"), + "split_by_character_only": _env_bool( + "CHUNK_F_SPLIT_BY_CHARACTER_ONLY", False + ), + }, + "recursive_character": { + # Default separators include CJK sentence-ending punctuation + # so Chinese / mixed-language documents split at semantic + # boundaries instead of falling through to character-level + # splitting. See ``constants.DEFAULT_R_SEPARATORS`` for + # cascade order rationale. + "separators": json.loads( + os.getenv("CHUNK_R_SEPARATORS", json.dumps(list(DEFAULT_R_SEPARATORS))) + ), + }, + "semantic_vector": { + "breakpoint_threshold_type": os.getenv( + "CHUNK_V_BREAKPOINT_THRESHOLD_TYPE", "percentile" + ), + "breakpoint_threshold_amount": parse_optional_float( + os.getenv("CHUNK_V_BREAKPOINT_THRESHOLD_AMOUNT") + ), + "buffer_size": int(os.getenv("CHUNK_V_BUFFER_SIZE", "1")), + # Default extends LangChain's English-only sentence splitter + # with CJK terminators so SemanticChunker can actually find + # sentence boundaries on Chinese input. Override per + # deployment if you need a different language mix. + "sentence_split_regex": os.getenv( + "CHUNK_V_SENTENCE_SPLIT_REGEX", DEFAULT_SENTENCE_SPLIT_REGEX + ), + }, + "paragraph_semantic": {}, + } + + # Strategy-specific overlap envs only — leave the slot absent when + # unset so overlay can detect provenance and fill from the legacy + # tier (constructor field → CHUNK_OVERLAP_SIZE env). + f_overlap_raw = os.getenv("CHUNK_F_OVERLAP_SIZE") + if f_overlap_raw is not None: + config["fixed_token"]["chunk_overlap_token_size"] = int(f_overlap_raw) + r_overlap_raw = os.getenv("CHUNK_R_OVERLAP_SIZE") + if r_overlap_raw is not None: + config["recursive_character"]["chunk_overlap_token_size"] = int(r_overlap_raw) + p_overlap_raw = os.getenv("CHUNK_P_OVERLAP_SIZE") + if p_overlap_raw is not None: + config["paragraph_semantic"]["chunk_overlap_token_size"] = int(p_overlap_raw) + + # P strategy carries its own ``chunk_token_size`` override so the + # paragraph-semantic merge target can diverge from the global + # ``CHUNK_SIZE`` (e.g. heading-aligned chunks may want a larger + # ceiling). Unlike R/V, the slot is ALWAYS populated — when + # ``CHUNK_P_SIZE`` is unset we use ``DEFAULT_CHUNK_P_SIZE`` (2000) + # rather than letting the dispatcher fall back to the global + # ``CHUNK_SIZE`` (1200): paragraph-semantic merging needs more + # headroom than the global default to keep related paragraphs + # together, and silently inheriting the smaller global ceiling + # defeats the strategy's purpose. + p_size_raw = os.getenv("CHUNK_P_SIZE") + config["paragraph_semantic"]["chunk_token_size"] = ( + int(p_size_raw) if p_size_raw is not None else DEFAULT_CHUNK_P_SIZE + ) + + # R/V strategies likewise carry their own optional ``chunk_token_size`` + # overrides (recursive character splitting may want a smaller target, + # semantic-vector clustering a larger advisory ceiling). Same + # slot-absent convention as P. + r_size_raw = os.getenv("CHUNK_R_SIZE") + if r_size_raw is not None: + config["recursive_character"]["chunk_token_size"] = int(r_size_raw) + v_size_raw = os.getenv("CHUNK_V_SIZE") + if v_size_raw is not None: + config["semantic_vector"]["chunk_token_size"] = int(v_size_raw) + + return config + + +def resolve_chunk_options( + addon_params: Mapping[str, Any] | None, + *, + process_options: Any = "", + split_by_character: str | None = None, + split_by_character_only: bool = False, +) -> dict[str, Any]: + """Build a per-document slim ``chunk_options`` snapshot. + + Reads the chunker config from ``addon_params['chunker']``, falling + back to a freshly built :func:`default_chunker_config` when the + addon-params mapping is missing or hasn't been populated, then + keeps only the parameters of the strategy selected by + ``process_options`` (the other strategies' sub-dicts are dropped — + they would never be consumed during processing). See + :func:`slim_chunk_options` for the projection rules and + :func:`chunk_strategy_key` for the strategy → sub-dict mapping + (default F → ``fixed_token``). + + The F runtime args from ``LightRAG.ainsert`` overlay the + ``fixed_token`` sub-dict when (and only when) the active strategy + is F — for R/V/P these args have no slot to land in and are + silently dropped: + + - ``split_by_character`` overrides the env when **non-None**. + ``None`` (signature default) means "use the env / addon_params + default". + - ``split_by_character_only`` overrides the env when **True**. + ``False`` (signature default) means "use the env / addon_params + default" — there's no clean way to distinguish "unset" from + "explicit False" with a positional default, so the env wins + unless the caller actively opts in. + + The returned snapshot is an independent deep copy: mutating it has + no effect on subsequent resolutions. + """ + src: Mapping[str, Any] | None = None + if isinstance(addon_params, Mapping): + candidate = addon_params.get("chunker") + if isinstance(candidate, Mapping): + src = candidate + if src is None: + src = default_chunker_config() + + snapshot = slim_chunk_options(src, process_options) + if chunk_strategy_key(process_options) == "fixed_token": + fixed = snapshot["fixed_token"] + if split_by_character is not None: + fixed["split_by_character"] = split_by_character + if split_by_character_only: + fixed["split_by_character_only"] = True + # P-strategy ``chunk_token_size`` backfill lives in + # ``slim_chunk_options`` — that's the single chokepoint shared by + # every enqueue path (this function AND the direct + # ``chunk_options=`` kwarg path in ``_chunk_options_at``). + return snapshot + + +def split_engine_and_options(bracket_inner: str) -> tuple[str | None, str]: + """Decompose a bracket-hint inner string into ``(engine, options)``. + + Format rules (see docs/FileProcessingPipeline-zh.md): + - ``ENGINE-OPTIONS``: first ``-``-separated segment is the engine + candidate; the remainder is the options string. + - ``ENGINE``: matches a supported engine name as a whole. + - ``-OPTIONS``: leading ``-`` marks an options-only hint. + """ + inner = (bracket_inner or "").strip() + if not inner: + return None, "" + + if inner.startswith("-"): + return None, inner[1:].strip() + + if "-" in inner: + head, _, tail = inner.partition("-") + engine_candidate = normalize_parser_engine(head) + if engine_candidate in SUPPORTED_PARSER_ENGINES: + return engine_candidate, tail.strip() + return None, "" + + engine_candidate = normalize_parser_engine(inner) + if engine_candidate in SUPPORTED_PARSER_ENGINES: + return engine_candidate, "" + return None, "" + + +def parser_suffix(file_path: str | Path) -> str: + return Path(file_path).suffix.lower().lstrip(".") + + +def parser_engine_supports_suffix(engine: str, suffix: str) -> bool: + return suffix.lower().lstrip(".") in PARSER_ENGINE_SUFFIX_CAPABILITIES.get( + engine, frozenset() + ) + + +def parser_engine_endpoint_configured(engine: str) -> bool: + if engine == PARSER_ENGINE_MINERU: + mode = os.getenv("MINERU_API_MODE", "local").strip().lower() + if mode == "official": + return bool(os.getenv("MINERU_API_TOKEN", "").strip()) + if mode == "local": + return bool(os.getenv("MINERU_LOCAL_ENDPOINT", "").strip()) + return False + endpoint_env = _PARSER_ENGINE_ENDPOINT_ENV.get(engine) + if endpoint_env: + return bool(os.getenv(endpoint_env, "").strip()) + return True + + +def parser_engine_endpoint_requirement(engine: str) -> str | None: + if engine == PARSER_ENGINE_MINERU: + mode = os.getenv("MINERU_API_MODE", "local").strip().lower() + if mode == "official": + return "MINERU_API_TOKEN" + if mode == "local": + return "MINERU_LOCAL_ENDPOINT" + allowed = ", ".join(sorted(_VALID_MINERU_API_MODES)) + return f"valid MINERU_API_MODE ({allowed})" + return _PARSER_ENGINE_ENDPOINT_ENV.get(engine) + + +def _engine_is_usable( + engine: str, + suffix: str, + *, + require_external_endpoint: bool, +) -> bool: + if engine not in SUPPORTED_PARSER_ENGINES: + return False + if not parser_engine_supports_suffix(engine, suffix): + return False + if require_external_endpoint and not parser_engine_endpoint_configured(engine): + return False + return True + + +def _filename_hint_match( + file_path: str | Path, +) -> tuple[re.Match[str], str, str] | None: + """Locate a supported ``[hint]`` segment in a basename. + + Returns ``(match, engine_or_empty, options)`` when the bracket inner is a + recognised hint per the spec; otherwise ``None``. This low-level helper + stays non-throwing because scan grouping and basename canonicalization need + a best-effort classifier. Ingestion entrypoints must call + :func:`resolve_file_parser_directives`, which validates malformed hints and + raises instead of falling back. + """ + basename = Path(file_path).name + m = _PARSER_HINT_RE.search(basename) + if not m: + return None + inner = m.group(1).strip() + if inner.startswith("-") and not inner[1:].strip(): + return None + if ( + "-" in inner + and not inner.startswith("-") + and not inner.partition("-")[2].strip() + ): + return None + engine, options = split_engine_and_options(inner) + if options: + option_errors = validate_process_options(options) + if option_errors: + logger.warning( + f"[parser_routing] ignoring filename hint {m.group(0)!r} in " + f"{basename!r}: {'; '.join(option_errors)}" + ) + return None + if engine in SUPPORTED_PARSER_ENGINES: + return m, engine, options + if engine is None and options: + return m, "", options + return None + + +def _validate_filename_hint_for_resolution( + file_path: str | Path, + *, + require_external_endpoint: bool, +) -> None: + """Fail fast for malformed filename hints on ingestion entrypoints.""" + basename = Path(file_path).name + m = _PARSER_HINT_RE.search(basename) + if not m: + return + + inner = m.group(1) + errors: list[str] = [] + + if not inner.strip(): + errors.append(f"filename hint {m.group(0)!r} is empty") + raise FilenameParserHintError( + f"Invalid filename parser hint in {basename!r}: " + "; ".join(errors) + ) + + engine: str | None = None + options = "" + + if inner.startswith("-"): + options = inner[1:].strip() + if not options: + errors.append(f"filename hint {m.group(0)!r} has empty process options") + else: + errors.extend( + validate_process_options( + options, + label=f"filename hint {m.group(0)!r} options", + ) + ) + elif "-" in inner: + engine_name, _, options = inner.partition("-") + engine = normalize_parser_engine(engine_name) + if engine not in SUPPORTED_PARSER_ENGINES: + supported = ", ".join(sorted(SUPPORTED_PARSER_ENGINES)) + errors.append( + f"filename hint {m.group(0)!r} uses unsupported parser engine " + f"{engine_name.strip()!r}; supported engines: {supported}" + ) + elif not options.strip(): + errors.append(f"filename hint {m.group(0)!r} has empty process options") + else: + errors.extend( + validate_process_options( + options, + label=f"filename hint {m.group(0)!r} options", + ) + ) + else: + engine = normalize_parser_engine(inner) + if engine not in SUPPORTED_PARSER_ENGINES: + supported = ", ".join(sorted(SUPPORTED_PARSER_ENGINES)) + message = ( + f"filename hint {m.group(0)!r} uses unsupported parser engine " + f"{inner.strip()!r}; supported engines: {supported}" + ) + if all(ch in SUPPORTED_PROCESS_OPTIONS or ch == " " for ch in inner): + message += ( + "; options-only filename hints must start with '-' " + f"(use '[-{inner.strip()}]' instead)" + ) + errors.append(message) + + if engine in SUPPORTED_PARSER_ENGINES: + suffix = parser_suffix(file_path) + if not parser_engine_supports_suffix(engine, suffix): + supported_suffixes = ", ".join( + sorted(PARSER_ENGINE_SUFFIX_CAPABILITIES.get(engine, frozenset())) + ) + errors.append( + f"filename hint {m.group(0)!r} uses parser engine {engine!r} " + f"for unsupported suffix {suffix!r}; supported suffixes: " + f"{supported_suffixes}" + ) + endpoint_req = parser_engine_endpoint_requirement(engine) + if ( + require_external_endpoint + and endpoint_req + and not parser_engine_endpoint_configured(engine) + ): + errors.append( + f"filename hint {m.group(0)!r} requires {endpoint_req} " + "to be configured" + ) + + if errors: + raise FilenameParserHintError( + f"Invalid filename parser hint in {basename!r}: " + "; ".join(errors) + ) + + +def filename_parser_hint(file_path: str | Path) -> str | None: + """Return the engine inferred from a filename hint, or ``None``.""" + found = _filename_hint_match(file_path) + if not found: + return None + _, engine, _ = found + return engine or None + + +def filename_process_options(file_path: str | Path) -> str: + """Return the raw process-options string from a filename hint.""" + found = _filename_hint_match(file_path) + if not found: + return "" + return found[2] + + +def filename_parser_directives(file_path: str | Path) -> tuple[str | None, str]: + """Return ``(engine, options)`` decoded from a filename hint.""" + found = _filename_hint_match(file_path) + if not found: + return None, "" + _, engine, options = found + return (engine or None), options + + +def canonicalize_parser_hinted_basename(file_path: str | Path) -> str: + """Return basename with a supported parser hint removed. + + Only the final ``.[engine].ext`` (or ``.[engine-options].ext`` / + ``.[-options].ext``) segment is stripped, exactly once, and only when the + bracket content is a recognised hint. Nested hints such as + ``name.[native].[mineru].pdf`` therefore become ``name.[native].pdf`` — + additional outer hints are not unwrapped. + """ + basename = Path(file_path).name + found = _filename_hint_match(file_path) + if not found: + return basename + m, _, _ = found + return f"{basename[: m.start()]}{m.group(2)}" + + +def parser_rules_from_env() -> str: + return os.getenv("LIGHTRAG_PARSER", "").strip() + + +def _iter_parser_rule_items(rules: str) -> list[tuple[int, str]]: + return [ + (index, item.strip()) + for index, item in enumerate(_PARSER_RULE_SPLIT_RE.split(rules), start=1) + if item.strip() + ] + + +def _rule_pattern_matches_engine_capability(pattern: str, engine: str) -> bool: + supported_suffixes = PARSER_ENGINE_SUFFIX_CAPABILITIES.get(engine, frozenset()) + return any(fnmatch.fnmatch(suffix, pattern) for suffix in supported_suffixes) + + +def _rule_engine_and_options(engine_hint: str) -> tuple[str, str]: + """Split a ``LIGHTRAG_PARSER`` rule's RHS (``engine[-options]``). + + Returns ``(normalized_engine, options_str)``. Unlike the filename hint + splitter this always treats the first ``-`` as the engine/options + boundary, since ``LIGHTRAG_PARSER`` rules cannot be options-only. + """ + head, _, tail = engine_hint.partition("-") + return normalize_parser_engine(head), tail.strip() + + +def validate_parser_routing_config(parser_rules: str | None = None) -> None: + """Validate LIGHTRAG_PARSER syntax and required external parser endpoints.""" + rules = parser_rules_from_env() if parser_rules is None else parser_rules.strip() + if not rules: + return + + errors: list[str] = [] + for index, item in _iter_parser_rule_items(rules): + label = f"rule {index} ({item!r})" + if ":" not in item: + errors.append(f"{label} must use ':'") + continue + + pattern, engine_hint = item.split(":", 1) + pattern = pattern.strip().lower() + engine_hint = engine_hint.strip() + engine, options_str = _rule_engine_and_options(engine_hint) + + if not pattern: + errors.append(f"{label} has an empty suffix pattern") + continue + if "." in pattern: + errors.append( + f"{label} matches suffixes without dots; use 'pdf', not '*.pdf'" + ) + continue + if not engine_hint: + errors.append(f"{label} has an empty parser engine") + continue + if engine not in SUPPORTED_PARSER_ENGINES: + supported = ", ".join(sorted(SUPPORTED_PARSER_ENGINES)) + errors.append( + f"{label} uses unsupported parser engine {engine_hint!r}; " + f"supported engines: {supported}" + ) + continue + if not _rule_pattern_matches_engine_capability(pattern, engine): + supported_suffixes = ", ".join( + sorted(PARSER_ENGINE_SUFFIX_CAPABILITIES.get(engine, frozenset())) + ) + errors.append( + f"{label} does not match any suffix supported by {engine}; " + f"supported suffixes: {supported_suffixes}" + ) + endpoint_req = parser_engine_endpoint_requirement(engine) + if endpoint_req and not parser_engine_endpoint_configured(engine): + errors.append(f"{label} requires {endpoint_req} to be configured") + if options_str: + errors.extend( + f"{label}: {msg}" + for msg in validate_process_options( + options_str, label="process options" + ) + ) + + if errors: + raise ParserRoutingConfigError( + "Invalid LIGHTRAG_PARSER configuration: " + "; ".join(errors) + ) + + +def _matching_rule_directives( + file_path: str | Path, + *, + parser_rules: str | None, + require_external_endpoint: bool, +) -> tuple[str | None, str]: + """Find the first matching ``LIGHTRAG_PARSER`` rule for ``file_path``. + + Returns ``(engine, options_str)`` where ``engine`` is ``None`` when no + usable rule is found. ``options_str`` is empty when a rule matched but + has no ``-options`` suffix. + """ + suffix = parser_suffix(file_path) + rules = parser_rules_from_env() if parser_rules is None else parser_rules.strip() + if not rules: + return None, "" + for _, item in _iter_parser_rule_items(rules): + if ":" not in item: + continue + pattern, engine_hint = item.split(":", 1) + pattern = pattern.strip().lower() + engine, options_str = _rule_engine_and_options(engine_hint.strip()) + if not fnmatch.fnmatch(suffix, pattern): + continue + if _engine_is_usable( + engine, + suffix, + require_external_endpoint=require_external_endpoint, + ): + return engine, options_str + return None, "" + + +def resolve_file_parser_engine( + file_path: str | Path, + *, + parser_rules: str | None = None, + require_external_endpoint: bool = True, +) -> str: + """Resolve the extraction engine for a source file before content extraction.""" + engine, _ = resolve_file_parser_directives( + file_path, + parser_rules=parser_rules, + require_external_endpoint=require_external_endpoint, + ) + return engine + + +def resolve_file_parser_directives( + file_path: str | Path, + *, + parser_rules: str | None = None, + require_external_endpoint: bool = True, +) -> tuple[str, str]: + """Resolve ``(engine, process_options)`` for a source file before extraction. + + Resolution order (mirrors :func:`resolve_file_parser_engine`): + 1. Filename ``[hint]`` — engine and / or options take precedence. + 2. ``LIGHTRAG_PARSER`` rules — first matching rule provides defaults + for whichever of engine / options the filename hint did not + specify. + 3. Default engine ``legacy`` with empty options. + """ + suffix = parser_suffix(file_path) + _validate_filename_hint_for_resolution( + file_path, + require_external_endpoint=require_external_endpoint, + ) + + hinted_engine, hinted_options = filename_parser_directives(file_path) + if hinted_engine and not _engine_is_usable( + hinted_engine, suffix, require_external_endpoint=require_external_endpoint + ): + # Hinted engine cannot handle this file (e.g. wrong suffix or missing + # endpoint); fall back to rule-based resolution but keep the hinted + # options if any. + hinted_engine = None + + rule_engine, rule_options = _matching_rule_directives( + file_path, + parser_rules=parser_rules, + require_external_endpoint=require_external_endpoint, + ) + + engine = hinted_engine or rule_engine or PARSER_ENGINE_LEGACY + options_str = hinted_options or rule_options + return engine, sanitize_process_options(options_str) + + +def resolve_stored_document_parser_engine( + file_path: str | Path, + content_data: dict[str, Any] | None, +) -> str: + """Resolve parser engine for a full_docs row during pipeline processing.""" + if content_data: + doc_format = content_data.get("parse_format", FULL_DOCS_FORMAT_RAW) + if doc_format == FULL_DOCS_FORMAT_LIGHTRAG and content_data.get( + "sidecar_location" + ): + return PARSER_ENGINE_NATIVE + if doc_format != FULL_DOCS_FORMAT_PENDING_PARSE: + return PARSER_ENGINE_LEGACY + + suffix = parser_suffix(file_path) + pending_engine = normalize_parser_engine(content_data.get("parse_engine")) + if pending_engine in SUPPORTED_PARSER_ENGINES and parser_engine_supports_suffix( + pending_engine, suffix + ): + return pending_engine + + return resolve_file_parser_engine(file_path) diff --git a/lightrag/pipeline.py b/lightrag/pipeline.py new file mode 100644 index 0000000000..fdc8b6f919 --- /dev/null +++ b/lightrag/pipeline.py @@ -0,0 +1,4486 @@ +"""Document ingestion pipeline mixin for the LightRAG class. + +This module isolates the document parse/enqueue/extraction pipeline so that +``lightrag.py`` stays focused on storage management, querying, and editing. +The mixin is wired into :class:`lightrag.LightRAG` via multiple inheritance +and relies on attributes/methods that the main class provides +(``self.full_docs``, ``self.doc_status``, ``self.tokenizer``, +``self.parse_native``-related fields, ``self._insert_done``, +``self._process_extract_entities``, etc.). +""" + +from __future__ import annotations + +import asyncio +import base64 +import hashlib +import inspect +import json + +import json_repair +import mimetypes +import os +import re +import shutil +import time +import traceback +from dataclasses import dataclass +from datetime import datetime, timezone +from pathlib import Path +from typing import Any + +from lightrag.base import DocProcessingStatus, DocStatus +from lightrag.constants import ( + FULL_DOCS_FORMAT_LIGHTRAG, + FULL_DOCS_FORMAT_PENDING_PARSE, + FULL_DOCS_FORMAT_RAW, + PARSED_DIR_NAME, + PARSER_ENGINE_DOCLING, + PARSER_ENGINE_MINERU, + PARSER_ENGINE_NATIVE, +) +from lightrag.exceptions import MultimodalAnalysisError, PipelineCancelledException +from lightrag.kg.shared_storage import get_namespace_data, get_namespace_lock +from lightrag.operate import merge_nodes_and_edges +from lightrag.parser_routing import ( + resolve_file_parser_directives, + resolve_stored_document_parser_engine, +) +from lightrag.utils import ( + CacheData, + _serialize_cache_variant, + compute_args_hash, + compute_mdhash_id, + enforce_chunk_token_limit_before_embedding, + generate_cache_key, + generate_track_id, + get_content_summary, + get_env_value, + get_llm_cache_identity, + handle_cache, + logger, + sanitize_text_for_encoding, + save_to_cache, + serialize_llm_cache_identity, +) +from lightrag.utils_pipeline import ( + archive_docx_source_after_full_docs_sync, + archive_source_after_full_docs_sync, + build_chunks_dict_from_chunking_result, + chunk_fields_from_status_doc, + compute_text_content_hash, + doc_status_field, + doc_status_transition_metadata, + get_duplicate_doc_by_content_hash, + get_existing_doc_by_content_hash, + get_existing_doc_by_file_basename, + has_known_document_source, + input_dir_path, + load_lightrag_document_content, + make_lightrag_doc_content, + normalize_document_file_path, + parsed_artifact_dir_for, + resolve_doc_file_path, + sidecar_blocks_path, + sidecar_uri_for, + strip_lightrag_doc_prefix, +) + + +# Document statuses the pipeline considers "in-flight or pending" — used by +# both the initial snapshot and every refetch after a request_pending +# continuation. Module-level so we don't reconstruct the list on every +# pipeline entry. +_INFLIGHT_DOC_STATUSES = ( + DocStatus.PROCESSING, + DocStatus.FAILED, + DocStatus.PENDING, + DocStatus.PARSING, + DocStatus.ANALYZING, +) + + +def _call_source_file_resolver( + owner: Any, + file_path: str, + *, + source_file_name: str | None = None, + parser_engine: str | None = None, +) -> str: + """Call parser source resolver while tolerating legacy test doubles.""" + resolver = owner._resolve_source_file_for_parser + params = inspect.signature(resolver).parameters + supports_context = "source_file_name" in params or any( + param.kind == inspect.Parameter.VAR_KEYWORD for param in params.values() + ) + if supports_context: + return resolver( + file_path, + source_file_name=source_file_name, + parser_engine=parser_engine, + ) + return resolver(source_file_name or file_path) + + +# Map ``process_options.chunking`` selector → ``extraction_meta.chunking_method`` +# string used by the pipeline observability layer and the resume path. +_CHUNKING_METHOD_LABELS: dict[str, str] = { + "F": "fixed_token", + "R": "recursive_character", + "V": "semantic_vector", + "P": "paragraph_semantic", +} + + +_CHUNK_LOG_KEY_ALIASES: dict[str, str] = { + "chunk_overlap_token_size": "overlap", + "breakpoint_threshold_type": "break", + "breakpoint_threshold_amount": "amount", + "buffer_size": "buf", + "split_by_character": "split_by", + "split_by_character_only": "split_only", + "separators": "seps", + "sentence_split_regex": "regex", +} + + +def _format_chunking_params( + chunk_size: int, + params: dict[str, Any], +) -> str: + """Format the ``size=..., key=value, ...`` portion shared by the chunking + start log line and ``doc_status.metadata['chunk_opts']``. + + Drops keys with ``None``/empty values so the line stays scannable; + callers pass the strategy-specific kwargs they're about to splat + into the chunker so the output mirrors the actual call. Long keys are + aliased to short forms via ``_CHUNK_LOG_KEY_ALIASES``. + """ + pieces = [f"size={chunk_size}"] + for key, value in params.items(): + if value is None: + continue + if isinstance(value, (list, dict, str)) and len(value) == 0: + continue + short = _CHUNK_LOG_KEY_ALIASES.get(key, key) + pieces.append( + f"{short}={value!r}" if isinstance(value, str) else f"{short}={value}" + ) + return ", ".join(pieces) + + +@dataclass +class _BatchRunContext: + """Per-batch shared state for the parse/analyze/process worker pipeline. + + Bundles the cross-cutting handles (pipeline_status, locks, queues, + semaphore) so worker methods accept a single ``ctx`` argument instead of + ~8 individually plumbed parameters. ``processed_count`` mutates inside + each batch and is always read/written under ``pipeline_status_lock``. + """ + + pipeline_status: dict + pipeline_status_lock: Any + semaphore: asyncio.Semaphore + total_files: int + q_native: asyncio.Queue + q_mineru: asyncio.Queue + q_docling: asyncio.Queue + q_analyze: asyncio.Queue + q_process: asyncio.Queue + processed_count: int = 0 + + +class _PipelineMixin: + """Mixin providing document ingestion pipeline methods for LightRAG. + + Designed to be combined as a base of LightRAG only. Relies on + LightRAG-provided attributes (``self.full_docs``, ``self.doc_status``, + ``self.tokenizer``, ``self.parser_*``, ``self.workspace`` ...) and on the + shared methods ``self._insert_done`` / ``self._process_extract_entities`` + which remain in the main class and are resolved through MRO. + """ + + # ============================================================ + # Public document ingestion API (entry points) + # ============================================================ + + async def apipeline_enqueue_documents( + self, + input: str | list[str], + ids: list[str] | None = None, + file_paths: str | list[str] | None = None, + track_id: str | None = None, + docs_format: str = FULL_DOCS_FORMAT_RAW, + lightrag_document_paths: str | list[str] | None = None, + parse_engine: str | list[str] | None = None, + process_options: str | list[str] | None = None, + chunk_options: dict | list[dict] | None = None, + from_scan: bool = False, + ) -> str: + """ + Pipeline for Processing Documents + + 1. Validate ids if provided or generate MD5 hash IDs and remove duplicate contents (skip content dedup when format is lightrag) + 2. Generate document initial status + 3. Filter out already processed documents + 4. Enqueue document in status + + Args: + input: Single document string or list of document strings (can be empty when docs_format is lightrag) + ids: list of unique document IDs, if not provided, MD5 hash IDs will be generated (from content or file_path when lightrag) + file_paths: list of file paths corresponding to each document, used for citation + track_id: tracking ID for monitoring processing status + docs_format: "raw" (default) or "lightrag"; when "lightrag" content may be empty and content-dedup is skipped + lightrag_document_paths: paths to LightRAG Document (e.g. .blocks.jsonl dir or base path), when docs_format is lightrag + parse_engine: file extraction engine already used or target engine for pending_parse + process_options: per-document processing options string (i/t/e/!/F/R/V/P); + accepted as a single string broadcast to every input or as a list + aligned with ``input``. Stored verbatim on ``full_docs`` and + mirrored to ``doc_status.metadata['process_options']``. + chunk_options: per-document chunker parameter snapshot. + Accepted as ``dict`` (broadcast to every input) or + ``list[dict]`` (aligned with ``input``). When ``None``, + each doc's snapshot is built via + :func:`lightrag.parser_routing.resolve_chunk_options` + from ``self.addon_params['chunker']``. Persisted to + ``full_docs[doc_id]['chunk_options']`` and consumed by + :meth:`process_single_document` to drive the file + chunkers (F / R / V / P). Callers that need to bake + F-strategy runtime args (``split_by_character`` / + ``split_by_character_only``) into the snapshot — e.g. + :meth:`LightRAG.ainsert` — should call + :func:`resolve_chunk_options` themselves and pass the + result here; this function is intentionally chunker- + config agnostic. See + ``docs/FileProcessingConfiguration-zh.md`` for the schema. + from_scan: when True, the caller is the scan-owned background task + that already holds ``pipeline_status["scanning"]``. Scan + does additional doc_status reads during its classification + phase (PROCESSED detection, FAILED-stub deletion, etc.) + so external writers are blocked via + ``scanning_exclusive``. Scan's own enqueues happen in + its processing phase, after classification has cleared + ``scanning_exclusive``, but ``from_scan=True`` is still + forwarded as a defence-in-depth bypass so an unexpected + scan-owned write inside the classification window is + allowed through. External callers must leave this False. + + Returns: + str: tracking ID for monitoring processing status + + Raises: + RuntimeError: if a scan is in progress (and ``from_scan`` is + False), or if a destructive job (clear / delete) is in + flight. Concurrent indexing (``busy=True`` from the + processing loop) is permitted — the running loop is + notified via ``request_pending`` and picks up the + newly-enqueued doc after its current batch finishes. + """ + # Concurrency contract: enqueue may proceed concurrently with the + # processing loop because (a) full_docs is upserted before + # doc_status, so a consistency check never sees a ghost row, and + # (b) the running loop re-queries doc_status by status after each + # batch and sets ``request_pending`` whenever new work arrives + # while busy. Two states still block enqueue: + # * ``scanning_exclusive`` — scan task is in its CLASSIFICATION + # phase, reading doc_status to classify files and possibly + # deleting stale stubs. Concurrent enqueue would race + # against scan's reads / mutations. ``from_scan=True`` + # lifts this guard for the scan task's own enqueues. + # ``scanning`` alone (the processing phase) does NOT block, + # identical to the upload-during-busy case. + # * ``destructive_busy`` — clear / delete is dropping storages + # or removing input files; a concurrent write would be + # silently clobbered. + pipeline_status = await get_namespace_data( + "pipeline_status", workspace=self.workspace + ) + pipeline_status_lock = get_namespace_lock( + "pipeline_status", workspace=self.workspace + ) + async with pipeline_status_lock: + if not from_scan and pipeline_status.get("scanning_exclusive"): + raise RuntimeError( + "Cannot enqueue while scan is classifying files; " + "wait for the classification phase to finish " + "before retrying." + ) + if pipeline_status.get("destructive_busy"): + raise RuntimeError( + "Cannot enqueue while pipeline is clearing or " + "deleting documents; wait for the running job to " + "finish before retrying." + ) + + # Generate track_id if not provided + if track_id is None or track_id.strip() == "": + track_id = generate_track_id("enqueue") + if isinstance(input, str): + input = [input] + if isinstance(ids, str): + ids = [ids] + if isinstance(file_paths, str): + file_paths = [file_paths] + if isinstance(lightrag_document_paths, str): + lightrag_document_paths = ( + [lightrag_document_paths] if lightrag_document_paths else None + ) + if isinstance(parse_engine, str): + parse_engine = [parse_engine] * len(input) + if isinstance(process_options, str): + process_options = [process_options] * len(input) + if isinstance(chunk_options, dict): + chunk_options = [chunk_options] * len(input) + + # If file_paths is provided, ensure it matches the number of documents + if file_paths is not None: + if isinstance(file_paths, str): + file_paths = [file_paths] + if len(file_paths) != len(input): + raise ValueError( + "Number of file paths must match the number of documents" + ) + file_paths = [ + path.strip() if isinstance(path, str) else "" for path in file_paths + ] + file_paths = [path if path else "unknown_source" for path in file_paths] + else: + file_paths = ["unknown_source"] * len(input) + + is_lightrag_format = docs_format == FULL_DOCS_FORMAT_LIGHTRAG + if is_lightrag_format and lightrag_document_paths is not None: + if len(lightrag_document_paths) != len(input): + raise ValueError( + "Number of lightrag_document_paths must match the number of documents" + ) + if parse_engine is not None and len(parse_engine) != len(input): + raise ValueError( + "Number of parse engines must match the number of documents" + ) + if process_options is not None and len(process_options) != len(input): + raise ValueError( + "Number of process options must match the number of documents" + ) + if chunk_options is not None and len(chunk_options) != len(input): + raise ValueError( + "Number of chunk_options dicts must match the number of documents" + ) + + def _parse_engine_at(index: int) -> str | None: + if parse_engine is None: + return None + engine = str(parse_engine[index] or "").strip().lower() + return engine or None + + def _process_options_at(index: int) -> str: + if process_options is None: + return "" + from lightrag.parser_routing import sanitize_process_options + + return sanitize_process_options(process_options[index]) + + def _chunk_options_at(index: int) -> dict[str, Any]: + """Resolve the per-doc slim chunk_options snapshot. + + Projects the chunker config down to the one strategy + sub-dict selected by the doc's ``process_options`` (F by + default) — the persisted ``full_docs[doc_id]['chunk_options']`` + carries only the params actually consumed at process time. + + When the caller supplied ``chunk_options`` we slim it + against the per-doc options (deep-copying internally so two + docs broadcast from a single dict cannot share mutable + sub-dicts); otherwise we build a fresh snapshot from + ``self.addon_params['chunker']``. + + F-strategy runtime args (``split_by_character`` / + ``split_by_character_only`` from :meth:`LightRAG.ainsert`) + are baked into the snapshot upstream — ainsert calls + :func:`lightrag.parser_routing.resolve_chunk_options` itself + and passes the result via ``chunk_options=``. This function + is purely a persistence helper; chunker-config construction + is not its concern. + """ + from lightrag.parser_routing import ( + resolve_chunk_options, + slim_chunk_options, + ) + + doc_options = _process_options_at(index) + if chunk_options is not None: + return slim_chunk_options(chunk_options[index], doc_options) + return resolve_chunk_options(self.addon_params, process_options=doc_options) + + # 1. Validate ids and build contents (when lightrag: no content dedup, content may be empty) + if ids is not None: + if len(ids) != len(input): + raise ValueError("Number of IDs must match the number of documents") + if len(ids) != len(set(ids)): + raise ValueError("IDs must be unique") + + # Canonicalize every input filename once: the stored ``file_path`` + # is hint-stripped and serves UI display, filename dedup, and the + # deterministic doc_id seed in one go. + file_paths_canonical = [ + normalize_document_file_path(path) for path in file_paths + ] + contents: dict[str, dict[str, Any]] = {} + source_to_doc_id: dict[str, str] = {} + content_hash_to_doc_id: dict[str, str] = {} + duplicate_attempts: list[dict[str, Any]] = [] + # Per-doc I/O failures from the lightrag-format branch. Populated when + # ``load_lightrag_document_content`` cannot read the user-supplied + # blocks.jsonl; flushed as FAILED stubs via + # ``apipeline_enqueue_error_documents`` inside the critical section so + # the UI surfaces the root cause instead of a silent empty document. + lightrag_load_errors: list[dict[str, Any]] = [] + + def _add_content( + index: int, + content: str, + doc_format: str, + *, + sidecar_location: str | None = None, + ) -> None: + file_path_canonical = file_paths_canonical[index] + + # Body length excludes the {{LRdoc}} marker so duplicate-attempt + # bookkeeping reports the same units as raw documents. + # strip_lightrag_doc_prefix is a no-op for non-lightrag formats. + body_length = len(strip_lightrag_doc_prefix(content, doc_format)) + + # Compute content hash: skip for pending_parse (content extracted later). + # RAW and LIGHTRAG both hash the bare merged text so the same body + # carried by different envelopes (raw text vs sidecar) dedupes + # against itself across formats. + content_hash: str | None = None + if doc_format in (FULL_DOCS_FORMAT_RAW, FULL_DOCS_FORMAT_LIGHTRAG): + content_hash = compute_text_content_hash( + strip_lightrag_doc_prefix(content or "", doc_format) + ) + + known_source = has_known_document_source(file_path_canonical) + if ids is not None: + doc_id = ids[index] + elif known_source: + doc_id = compute_mdhash_id(file_path_canonical, prefix="doc-") + elif doc_format == FULL_DOCS_FORMAT_RAW: + doc_id = compute_mdhash_id(content or "", prefix="doc-") + elif content_hash: + doc_id = compute_mdhash_id(content_hash, prefix="doc-") + else: + doc_id = compute_mdhash_id( + f"{file_path_canonical}-{track_id}-{index}", prefix="doc-" + ) + + if known_source and file_path_canonical in source_to_doc_id: + duplicate_attempts.append( + { + "doc_id": doc_id, + "original_doc_id": source_to_doc_id[file_path_canonical], + "file_path": file_path_canonical, + "content_length": body_length, + "existing_status": "batch_duplicate", + "existing_track_id": "", + "duplicate_kind": "filename", + } + ) + return + + if content_hash and content_hash in content_hash_to_doc_id: + duplicate_attempts.append( + { + "doc_id": doc_id, + "original_doc_id": content_hash_to_doc_id[content_hash], + "file_path": file_path_canonical, + "content_length": body_length, + "existing_status": "batch_duplicate", + "existing_track_id": "", + "duplicate_kind": "content_hash", + } + ) + return + + if known_source: + source_to_doc_id[file_path_canonical] = doc_id + if content_hash: + content_hash_to_doc_id[content_hash] = doc_id + + content_data: dict[str, Any] = { + "content": content, + "file_path": file_path_canonical, + "parse_format": doc_format, + } + if content_hash: + content_data["content_hash"] = content_hash + if sidecar_location: + content_data["sidecar_location"] = sidecar_location + if engine := _parse_engine_at(index): + content_data["parse_engine"] = engine + if doc_format == FULL_DOCS_FORMAT_PENDING_PARSE: + source_file_name = Path(str(file_paths[index] or "").strip()).name + if has_known_document_source(source_file_name): + content_data["source_file_name"] = source_file_name + options_str = _process_options_at(index) + if options_str: + content_data["process_options"] = options_str + # Always snapshot chunk_options at enqueue time — independent + # of whether process_options selected a specific strategy — + # so the per-doc parameters are frozen even when ``F`` + # (default) is used. + content_data["chunk_options"] = _chunk_options_at(index) + contents[doc_id] = content_data + + if is_lightrag_format: + # LightRAG Document: no content hash dedup; content may be empty + for i in range(len(file_paths)): + path = file_paths[i] + raw_path = ( + lightrag_document_paths[i] if lightrag_document_paths else "" + ) or path + # Resolve to an absolute path so the sidecar URI carries + # full location info; relative paths are interpreted under + # input_dir. + p = Path(raw_path) + if not p.is_absolute(): + p = input_dir_path() / p + # The user may point at the ``*.blocks.jsonl`` file itself + # or at its containing ``*.parsed/`` directory. Sidecars + # are addressed by directory, so step up when given a file. + sidecar_dir = ( + p.parent + if p.suffix == ".jsonl" and p.name.endswith(".blocks.jsonl") + else p + ) + sidecar_location = sidecar_uri_for(sidecar_dir) + # Per docs/FileProcessingConfiguration-zh.md, full_docs.content + # for format=lightrag must be "{{LRdoc}}" + the merged body. + # If the blocks file cannot be read (permission, truncation, + # invalid JSON line), recording an empty body would let an + # untrue "{{LRdoc}}" record land in full_docs and desync from + # the on-disk blocks.jsonl. Instead, skip this doc and flush + # a FAILED stub via apipeline_enqueue_error_documents after + # the critical section so /documents surfaces the cause and + # /documents/scan retries cleanly once the file is fixed. + try: + merged_text, _ = await load_lightrag_document_content( + sidecar_location + ) + except Exception as exc: + error_msg = f"load_lightrag_document_content failed: {exc}" + logger.warning( + f"[apipeline_enqueue] {error_msg} ({raw_path})" + ) + file_size = 0 + blocks_path_str = sidecar_blocks_path(sidecar_location) + if blocks_path_str: + try: + file_size = Path(blocks_path_str).stat().st_size + except OSError: + file_size = 0 + lightrag_load_errors.append( + { + "file_path": path, + "error_description": ( + "Failed to load LightRAG Document blocks" + ), + "original_error": error_msg, + "file_size": file_size, + } + ) + continue + summary_content = make_lightrag_doc_content(merged_text) + _add_content( + i, + summary_content, + FULL_DOCS_FORMAT_LIGHTRAG, + sidecar_location=sidecar_location, + ) + elif ids is not None: + for i, doc in enumerate(input): + cleaned_content = sanitize_text_for_encoding(doc) + _add_content( + i, + cleaned_content, + FULL_DOCS_FORMAT_RAW, + ) + elif docs_format == FULL_DOCS_FORMAT_PENDING_PARSE: + for i, doc in enumerate(input): + _add_content( + i, + doc or "", + FULL_DOCS_FORMAT_PENDING_PARSE, + ) + else: + for i, doc in enumerate(input): + cleaned_content = sanitize_text_for_encoding(doc) + _add_content(i, cleaned_content, FULL_DOCS_FORMAT_RAW) + + # 2. Generate document initial status (without content) + def _initial_doc_status(content_data: dict[str, Any]) -> dict[str, Any]: + # For lightrag-format full_docs the persisted content carries the + # ``{{LRdoc}}`` marker; strip it so summary/length match raw + # semantics (the marker is full_docs internal bookkeeping and + # must not leak into doc_status). strip_lightrag_doc_prefix + # internally checks parse_format, so non-lightrag formats pass + # through untouched. + body_text = strip_lightrag_doc_prefix( + content_data.get("content", ""), + content_data.get("parse_format"), + ) + base: dict[str, Any] = { + "status": DocStatus.PENDING, + "content_summary": get_content_summary(body_text), + "content_length": len(body_text), + "created_at": datetime.now(timezone.utc).isoformat(), + "updated_at": datetime.now(timezone.utc).isoformat(), + "file_path": content_data["file_path"], + "track_id": track_id, + } + if content_data.get("content_hash"): + base["content_hash"] = content_data["content_hash"] + metadata: dict[str, Any] = {} + options_str = content_data.get("process_options") or "" + if options_str: + # Mirror process_options into doc_status.metadata so admin UIs + # can surface the per-document strategy without a full_docs lookup. + metadata["process_options"] = options_str + source_file_name = content_data.get("source_file_name") + if source_file_name: + metadata["source_file_name"] = source_file_name + if metadata: + base["metadata"] = metadata + return base + + new_docs: dict[str, Any] = { + id_: _initial_doc_status(content_data) + for id_, content_data in contents.items() + } + + # Serialise the dedup-read-then-upsert critical section across + # concurrent enqueue calls within the same workspace. Without + # this, two enqueues for the same content (e.g. /upload during + # scan's processing phase, or two uploads via /text + /upload) + # can both read doc_status before either upserts, both miss the + # content_hash dedup, and both end up writing PENDING rows for + # the same content — bypassing the dedup that's supposed to + # land one of them as ``duplicate_kind=content_hash`` FAILED. + # + # The lock is workspace-scoped and only spans steps 3-4 below + # (filter_keys → upserts). It does NOT block concurrent + # processing (``apipeline_process_enqueue_documents`` reads + # doc_status independently) or scan classification + # (``scanning_exclusive`` already gates concurrent enqueue). + # Lock order: enqueue_serialize → pipeline_status_lock (the + # request_pending nudge inside is fine; no caller holds + # pipeline_status_lock first then needs enqueue_serialize). + enqueue_serialize_lock = get_namespace_lock( + "enqueue_serialize", workspace=self.workspace + ) + + async with enqueue_serialize_lock: + # 3. Filter out already processed documents + # Get docs ids + all_new_doc_ids = set(new_docs.keys()) + # Exclude IDs of documents that are already enqueued. The previous + # ``reprocess_existing_non_processed`` flag has been removed: any + # same-name record (regardless of status) is treated as a duplicate + # here. Recovering half-processed documents is now the job of the + # pipeline's resume logic, which runs in apipeline_process_enqueue_documents + # rather than this enqueue path. + unique_new_doc_ids = await self.doc_status.filter_keys(all_new_doc_ids) + + for doc_id in list(unique_new_doc_ids): + content_data = contents[doc_id] + + # 3a. Filename-based dedup: same basename always treated as duplicate. + match = await get_existing_doc_by_file_basename( + self.doc_status, content_data["file_path"] + ) + if match: + existing_doc_id, existing_doc = match + unique_new_doc_ids.discard(doc_id) + duplicate_attempts.append( + { + "doc_id": doc_id, + "original_doc_id": existing_doc_id, + "file_path": content_data["file_path"], + "content_length": new_docs.get(doc_id, {}).get( + "content_length", 0 + ), + "existing_status": doc_status_field( + existing_doc, "status", "unknown" + ), + "existing_track_id": doc_status_field( + existing_doc, "track_id", "" + ), + "duplicate_kind": "filename", + } + ) + continue + + # 3b. Content-hash dedup: different filename but same body still dupes. + content_hash = content_data.get("content_hash") + if not content_hash: + continue + hash_match = await get_existing_doc_by_content_hash( + self.doc_status, content_hash + ) + if hash_match: + existing_doc_id, existing_doc = hash_match + unique_new_doc_ids.discard(doc_id) + duplicate_attempts.append( + { + "doc_id": doc_id, + "original_doc_id": existing_doc_id, + "file_path": content_data["file_path"], + "content_length": new_docs.get(doc_id, {}).get( + "content_length", 0 + ), + "existing_status": doc_status_field( + existing_doc, "status", "unknown" + ), + "existing_track_id": doc_status_field( + existing_doc, "track_id", "" + ), + "duplicate_kind": "content_hash", + } + ) + + # Handle duplicate documents - create trackable records with current track_id + ignored_ids = list(all_new_doc_ids - unique_new_doc_ids) + for doc_id in ignored_ids: + if any( + attempt.get("doc_id") == doc_id for attempt in duplicate_attempts + ): + continue + existing_doc = await self.doc_status.get_by_id(doc_id) + duplicate_attempts.append( + { + "doc_id": doc_id, + "original_doc_id": doc_id, + "file_path": new_docs.get(doc_id, {}).get( + "file_path", "unknown_source" + ), + "content_length": new_docs.get(doc_id, {}).get( + "content_length", 0 + ), + "existing_status": ( + existing_doc.get("status", "unknown") + if existing_doc + else "unknown" + ), + "existing_track_id": ( + existing_doc.get("track_id", "") if existing_doc else "" + ), + "duplicate_kind": "filename", + } + ) + + if duplicate_attempts: + duplicate_docs: dict[str, Any] = {} + for index, attempt in enumerate(duplicate_attempts): + doc_id = attempt["doc_id"] + file_path = attempt.get("file_path") or "unknown_source" + duplicate_kind = attempt.get("duplicate_kind") or "filename" + logger.warning( + f"Duplicate document detected ({duplicate_kind}): " + f"{doc_id} ({file_path})" + ) + + # Create a new record with unique ID for this duplicate attempt + dup_record_id = compute_mdhash_id( + f"{doc_id}-{track_id}-{index}-{file_path}", prefix="dup-" + ) + if duplicate_kind == "content_hash": + error_prefix = ( + "Identical content already exists under another filename." + ) + else: + error_prefix = "File name already exists." + duplicate_docs[dup_record_id] = { + "status": DocStatus.FAILED, + "content_summary": ( + f"[DUPLICATE:{duplicate_kind}] Original document: " + f"{attempt.get('original_doc_id', doc_id)}" + ), + "content_length": attempt.get("content_length", 0), + "chunks_count": 0, + "chunks_list": [], + "created_at": datetime.now(timezone.utc).isoformat(), + "updated_at": datetime.now(timezone.utc).isoformat(), + "file_path": file_path, + "track_id": track_id, # Use current track_id for tracking + "error_msg": ( + f"{error_prefix} " + f"Original doc_id: {attempt.get('original_doc_id', doc_id)}, " + f"Status: {attempt.get('existing_status', 'unknown')}" + ), + "metadata": { + "is_duplicate": True, + "duplicate_kind": duplicate_kind, + "original_doc_id": attempt.get("original_doc_id", doc_id), + "original_track_id": attempt.get("existing_track_id", ""), + }, + } + + # Store duplicate records in doc_status + if duplicate_docs: + await self.doc_status.upsert(duplicate_docs) + logger.info( + f"Created {len(duplicate_docs)} duplicate document records with track_id: {track_id}" + ) + + # Flush lightrag-format I/O failures as FAILED stubs. Done + # inside the critical section so concurrent enqueues either see + # the failure rows in full or not at all, and so a subsequent + # /documents/scan finds the stub-without-full_docs combination + # that document_routes treats as "delete and re-extract". + if lightrag_load_errors: + await self.apipeline_enqueue_error_documents( + lightrag_load_errors, track_id=track_id + ) + + # Filter new_docs to only include documents with unique IDs + new_docs = { + doc_id: new_docs[doc_id] + for doc_id in unique_new_doc_ids + if doc_id in new_docs + } + + if not new_docs: + logger.warning("No new unique documents were found.") + # If FAILED stubs were just flushed (lightrag-format I/O + # errors), the caller needs the track_id to query their + # status; a bare ``return None`` would also be interpreted + # by document_routes upload paths as "all duplicate — + # archive the source", silently hiding the failure. + if lightrag_load_errors: + return track_id + return + + # 4. Store document content in full_docs and status in doc_status + full_docs_data = { + doc_id: { + "content": contents[doc_id].get("content", ""), + "file_path": contents[doc_id]["file_path"], + "parse_format": contents[doc_id].get( + "parse_format", FULL_DOCS_FORMAT_RAW + ), + } + for doc_id in new_docs.keys() + } + for doc_id in new_docs.keys(): + if contents[doc_id].get("content_hash"): + full_docs_data[doc_id]["content_hash"] = contents[doc_id][ + "content_hash" + ] + if contents[doc_id].get("sidecar_location"): + full_docs_data[doc_id]["sidecar_location"] = contents[doc_id][ + "sidecar_location" + ] + if contents[doc_id].get("parse_engine"): + full_docs_data[doc_id]["parse_engine"] = contents[doc_id][ + "parse_engine" + ] + if contents[doc_id].get("process_options"): + full_docs_data[doc_id]["process_options"] = contents[doc_id][ + "process_options" + ] + # ``chunk_options`` is always populated by ``_add_content`` + # at enqueue time so it's persisted unconditionally. + if contents[doc_id].get("chunk_options") is not None: + full_docs_data[doc_id]["chunk_options"] = contents[doc_id][ + "chunk_options" + ] + await self.full_docs.upsert(full_docs_data) + # Persist data to disk immediately + await self.full_docs.index_done_callback() + + # Store document status (without content) + await self.doc_status.upsert(new_docs) + logger.debug(f"Stored {len(new_docs)} new unique documents") + + # Notify any in-flight processing loop that new work has arrived. + # The loop checks ``request_pending`` after each batch and will + # re-query doc_status to pick up these PENDING rows. Without + # this nudge a caller that does not subsequently call + # ``apipeline_process_enqueue_documents`` (or whose call races + # with the loop's just-finished batch) could leave the new docs + # stranded until the next unrelated trigger. + async with pipeline_status_lock: + if pipeline_status.get("busy"): + pipeline_status["request_pending"] = True + + return track_id + + async def apipeline_enqueue_error_documents( + self, + error_files: list[dict[str, Any]], + track_id: str | None = None, + ) -> None: + """ + Record file extraction errors in doc_status storage. + + This function creates error document entries in the doc_status storage for files + that failed during the extraction process. Each error entry contains information + about the failure to help with debugging and monitoring. + + Args: + error_files: List of dictionaries containing error information for each failed file. + Each dictionary should contain: + - file_path: Original file name/path + - error_description: Brief error description (for content_summary) + - original_error: Full error message (for error_msg) + - file_size: File size in bytes (for content_length, 0 if unknown) + track_id: Optional tracking ID for grouping related operations + + Returns: + None + """ + if not error_files: + logger.debug("No error files to record") + return + + # Generate track_id if not provided + if track_id is None or track_id.strip() == "": + track_id = generate_track_id("error") + + error_docs: dict[str, Any] = {} + current_time = datetime.now(timezone.utc).isoformat() + + for error_file in error_files: + file_path = normalize_document_file_path( + error_file.get("file_path", "unknown_file") + ) + error_description = error_file.get( + "error_description", "File extraction failed" + ) + original_error = error_file.get("original_error", "Unknown error") + file_size = error_file.get("file_size", 0) + + # Generate unique doc_id with "error-" prefix + doc_id_content = f"{file_path}-{error_description}" + doc_id = compute_mdhash_id(doc_id_content, prefix="error-") + + error_docs[doc_id] = { + "status": DocStatus.FAILED, + "content_summary": error_description, + "content_length": file_size, + "error_msg": original_error, + "chunks_count": 0, # No chunks for failed files + "chunks_list": [], + "created_at": current_time, + "updated_at": current_time, + "file_path": file_path, + "track_id": track_id, + "metadata": { + "error_type": "file_extraction_error", + }, + } + + # Store error documents in doc_status + if error_docs: + await self.doc_status.upsert(error_docs) + # Log each error for debugging + for doc_id, error_doc in error_docs.items(): + logger.error( + f"File processing error: - ID: {doc_id} {error_doc['file_path']}" + ) + + async def apipeline_process_enqueue_documents(self) -> None: + """ + Process pending documents by splitting them into chunks, processing + each chunk for entity and relation extraction, and updating the + document status. + + 1. Get all pending, failed, and abnormally terminated processing documents. + 2. Validate document data consistency and fix any issues + 3. Split document content into chunks + 4. Process each chunk for entity and relation extraction + 5. Update the document status + """ + pipeline_status = await get_namespace_data( + "pipeline_status", workspace=self.workspace + ) + pipeline_status_lock = get_namespace_lock( + "pipeline_status", workspace=self.workspace + ) + + async with pipeline_status_lock: + # Ensure only one worker is processing documents + if not pipeline_status.get("busy", False): + to_process_docs: dict[ + str, DocProcessingStatus + ] = await self.doc_status.get_docs_by_statuses( + list(_INFLIGHT_DOC_STATUSES) + ) + + if not to_process_docs: + logger.info("No documents to process") + return + + pipeline_status.update( + { + "busy": True, + "job_name": "Default Job", + "job_start": datetime.now(timezone.utc).isoformat(), + "docs": 0, + "batchs": 0, # Total number of files to be processed + "cur_batch": 0, # Number of files already processed + "request_pending": False, # Clear any previous request + "cancellation_requested": False, # Initialize cancellation flag + "latest_message": "", + } + ) + # Cleaning history_messages without breaking it as a shared list object + del pipeline_status["history_messages"][:] + else: + # Another process is busy, just set request flag and return + pipeline_status["request_pending"] = True + logger.info( + "Another process is already processing the document queue. Request queued." + ) + return + + # Tracks whether the loop has already released ``busy`` under + # the same critical section that observed request_pending=False. + # This makes the exit handoff atomic: a concurrent enqueue can + # either set request_pending BEFORE we release (in which case + # the loop continues with a fresh snapshot) or AFTER (in which + # case it sees busy=False and starts a new loop via its own + # process_enqueue call). Without this, a small window between + # "loop reads request_pending=False" and "finally clears busy" + # could strand newly-enqueued PENDING docs. + busy_released_in_loop = False + + try: + # Process documents until no more documents or requests + while True: + # Check for cancellation request at the start of main loop + async with pipeline_status_lock: + if pipeline_status.get("cancellation_requested", False): + pipeline_status["request_pending"] = False + pipeline_status["cancellation_requested"] = False + + log_message = "Pipeline cancelled by user" + logger.info(log_message) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + + # Exit directly, skipping request_pending check + return + + if not to_process_docs: + log_message = "All enqueued documents have been processed" + logger.info(log_message) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + if await self._atomic_release_busy_or_consume_pending( + pipeline_status, pipeline_status_lock + ): + busy_released_in_loop = True + break + to_process_docs = await self.doc_status.get_docs_by_statuses( + list(_INFLIGHT_DOC_STATUSES) + ) + continue + + # Validate document data consistency and fix any issues + to_process_docs = await self._validate_and_fix_document_consistency( + to_process_docs, pipeline_status, pipeline_status_lock + ) + + if not to_process_docs: + log_message = ( + "No valid documents to process after consistency check" + ) + logger.info(log_message) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + if await self._atomic_release_busy_or_consume_pending( + pipeline_status, pipeline_status_lock + ): + busy_released_in_loop = True + break + to_process_docs = await self.doc_status.get_docs_by_statuses( + list(_INFLIGHT_DOC_STATUSES) + ) + continue + + log_message = f"Processing {len(to_process_docs)} document(s)" + logger.info(log_message) + pipeline_status["docs"] = len(to_process_docs) + pipeline_status["batchs"] = len(to_process_docs) + pipeline_status["cur_batch"] = 0 + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + + await self._run_pipeline_batch( + to_process_docs, + pipeline_status=pipeline_status, + pipeline_status_lock=pipeline_status_lock, + ) + + # Atomic exit handoff: if request_pending was set during + # this batch (e.g. a concurrent enqueue while busy=True), + # clear it and refetch. Otherwise release ``busy`` under + # the SAME lock so a concurrent enqueue cannot squeeze a + # request_pending=True past us into a now-stranded state. + if await self._atomic_release_busy_or_consume_pending( + pipeline_status, pipeline_status_lock + ): + busy_released_in_loop = True + break + + log_message = "Processing additional documents due to pending request" + logger.info(log_message) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + + # Check for pending documents again + to_process_docs = await self.doc_status.get_docs_by_statuses( + list(_INFLIGHT_DOC_STATUSES) + ) + + finally: + log_message = "Enqueued document processing pipeline stopped" + logger.info(log_message) + # If the loop already released ``busy`` under the atomic exit + # check, don't clobber it here — a concurrent enqueue may have + # observed busy=False and started a new processing pass that + # has set busy=True for itself. Cancellation flag and log + # bookkeeping are always safe to update. + async with pipeline_status_lock: + if not busy_released_in_loop: + pipeline_status["busy"] = False + pipeline_status["cancellation_requested"] = ( + False # Always reset cancellation flag + ) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + + # ============================================================ + # Pipeline orchestration + # ============================================================ + + async def _run_pipeline_batch( + self, + to_process_docs: dict[str, DocProcessingStatus], + *, + pipeline_status: dict, + pipeline_status_lock, + ) -> None: + """Run one batch of pending documents through the parse → analyze → + process queues. + + Three cascading layers of queues: + - Layer 1: Content Parsing (parse_native / parse_mineru / parse_docling) + - Layer 2: Multimodal Analyze (analyze_multimodal) + - Layer 3: Entity / Relation Extraction (process_single_document) + """ + total_files = len(to_process_docs) + pipeline_status["job_name"] = self._format_job_name( + to_process_docs, total_files + ) + + ctx = _BatchRunContext( + pipeline_status=pipeline_status, + pipeline_status_lock=pipeline_status_lock, + semaphore=asyncio.Semaphore(self.max_parallel_insert), + total_files=total_files, + q_native=asyncio.Queue(maxsize=self.queue_size_default), + q_mineru=asyncio.Queue(maxsize=self.queue_size_default), + q_docling=asyncio.Queue(maxsize=self.queue_size_default), + q_analyze=asyncio.Queue(maxsize=self.queue_size_default), + q_process=asyncio.Queue(maxsize=self.queue_size_insert), + ) + + workers: list[asyncio.Task] = [] + for _ in range(max(1, self.max_parallel_parse_native)): + workers.append( + asyncio.create_task(self._parse_worker("native", ctx.q_native, ctx)) + ) + for _ in range(max(1, self.max_parallel_parse_mineru)): + workers.append( + asyncio.create_task(self._parse_worker("mineru", ctx.q_mineru, ctx)) + ) + for _ in range(max(1, self.max_parallel_parse_docling)): + workers.append( + asyncio.create_task(self._parse_worker("docling", ctx.q_docling, ctx)) + ) + for _ in range(max(1, self.max_parallel_analyze)): + workers.append(asyncio.create_task(self._analyze_worker(ctx))) + for _ in range(max(1, self.max_parallel_insert)): + workers.append(asyncio.create_task(self._process_worker(ctx))) + + # Add pending files to the correct parsing queue + for doc_id, status_doc in to_process_docs.items(): + content_data = await self.full_docs.get_by_id(doc_id) or {} + engine = resolve_stored_document_parser_engine( + file_path=getattr(status_doc, "file_path", "unknown_source"), + content_data=content_data, + ) + if engine == "mineru": + await ctx.q_mineru.put((doc_id, status_doc)) + elif engine == "docling": + await ctx.q_docling.put((doc_id, status_doc)) + else: + await ctx.q_native.put((doc_id, status_doc)) + + await asyncio.gather( + ctx.q_native.join(), ctx.q_mineru.join(), ctx.q_docling.join() + ) + await ctx.q_analyze.join() + await ctx.q_process.join() + + for w in workers: + w.cancel() + await asyncio.gather(*workers, return_exceptions=True) + + async def _validate_and_fix_document_consistency( + self, + to_process_docs: dict[str, DocProcessingStatus], + pipeline_status: dict, + pipeline_status_lock: asyncio.Lock, + ) -> dict[str, DocProcessingStatus]: + """Validate and fix document data consistency by deleting inconsistent entries, but preserve failed documents""" + inconsistent_docs = [] + failed_docs_to_preserve = [] + successful_deletions = 0 + + # Check each document's data consistency + for doc_id, status_doc in to_process_docs.items(): + # Check if corresponding content exists in full_docs + content_data = await self.full_docs.get_by_id(doc_id) + if not content_data: + # Check if this is a failed document that should be preserved + if ( + hasattr(status_doc, "status") + and status_doc.status == DocStatus.FAILED + ): + failed_docs_to_preserve.append(doc_id) + else: + inconsistent_docs.append(doc_id) + + # Log information about failed documents that will be preserved + if failed_docs_to_preserve: + async with pipeline_status_lock: + preserve_message = f"Preserving {len(failed_docs_to_preserve)} failed document entries for manual review" + logger.info(preserve_message) + pipeline_status["latest_message"] = preserve_message + pipeline_status["history_messages"].append(preserve_message) + + # Remove failed documents from processing list but keep them in doc_status + for doc_id in failed_docs_to_preserve: + to_process_docs.pop(doc_id, None) + + # Delete inconsistent document entries(excluding failed documents) + if inconsistent_docs: + async with pipeline_status_lock: + summary_message = ( + f"Inconsistent document entries found: {len(inconsistent_docs)}" + ) + logger.info(summary_message) + pipeline_status["latest_message"] = summary_message + pipeline_status["history_messages"].append(summary_message) + + successful_deletions = 0 + for doc_id in inconsistent_docs: + try: + status_doc = to_process_docs[doc_id] + file_path = resolve_doc_file_path(status_doc=status_doc) + + # Delete doc_status entry + await self.doc_status.delete([doc_id]) + successful_deletions += 1 + + # Log successful deletion + async with pipeline_status_lock: + log_message = ( + f"Deleted inconsistent entry: {doc_id} ({file_path})" + ) + logger.info(log_message) + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + + # Remove from processing list + to_process_docs.pop(doc_id, None) + + except Exception as e: + # Log deletion failure + async with pipeline_status_lock: + error_message = f"Failed to delete entry: {doc_id} - {str(e)}" + logger.error(error_message) + pipeline_status["latest_message"] = error_message + pipeline_status["history_messages"].append(error_message) + + # Final summary log + # async with pipeline_status_lock: + # final_message = f"Successfully deleted {successful_deletions} inconsistent entries, preserved {len(failed_docs_to_preserve)} failed documents" + # logger.info(final_message) + # pipeline_status["latest_message"] = final_message + # pipeline_status["history_messages"].append(final_message) + + # Reset interrupted documents that pass consistency checks to PENDING status + docs_to_reset = {} + reset_count = 0 + + for doc_id, status_doc in to_process_docs.items(): + # Check if document has corresponding content in full_docs (consistency check) + content_data = await self.full_docs.get_by_id(doc_id) + if content_data: # Document passes consistency check + # Check if document is in interrupted status + if hasattr(status_doc, "status") and status_doc.status in [ + DocStatus.PROCESSING, + DocStatus.FAILED, + DocStatus.PARSING, + DocStatus.ANALYZING, + ]: + preserved_chunks_list, preserved_chunks_count = ( + chunk_fields_from_status_doc(status_doc) + ) + resolved_file_path = resolve_doc_file_path( + status_doc=status_doc, + content_data=content_data, + ) + # Prepare document for status reset to PENDING + docs_to_reset[doc_id] = { + "status": DocStatus.PENDING, + "content_summary": status_doc.content_summary, + "content_length": status_doc.content_length, + "chunks_count": preserved_chunks_count, + "chunks_list": preserved_chunks_list, + "created_at": status_doc.created_at, + "updated_at": datetime.now(timezone.utc).isoformat(), + "file_path": resolved_file_path, + "track_id": getattr(status_doc, "track_id", ""), + "content_hash": getattr(status_doc, "content_hash", None), + # Clear transient error / processing fields but preserve + # long-lived per-doc metadata (process_options) seeded + # at enqueue time. + "error_msg": "", + "metadata": doc_status_transition_metadata(status_doc), + } + + # Update the status in to_process_docs as well + status_doc.status = DocStatus.PENDING + status_doc.file_path = resolved_file_path + reset_count += 1 + + # Update doc_status storage if there are documents to reset + if docs_to_reset: + await self.doc_status.upsert(docs_to_reset) + + async with pipeline_status_lock: + reset_message = ( + f"Reset {reset_count} documents from " + "PARSING/ANALYZING/PROCESSING/FAILED to PENDING status" + ) + logger.info(reset_message) + pipeline_status["latest_message"] = reset_message + pipeline_status["history_messages"].append(reset_message) + + return to_process_docs + + async def _atomic_release_busy_or_consume_pending( + self, + pipeline_status: dict, + pipeline_status_lock, + ) -> bool: + """Atomically decide whether to release ``busy`` or consume a + pending request. + + Closes the loop-exit handoff race: a concurrent enqueue that + sets ``request_pending`` while the processing loop is on its + way out will be observed in the same critical section that + releases ``busy``, so the loop sees it and refetches instead + of stranding the new doc in PENDING. + + Returns: + True when ``busy`` has been cleared under the same lock + that observed ``request_pending=False`` — caller must + break out of the loop and skip clearing ``busy`` in its + finally block. + + False when ``request_pending`` was set: the flag is + cleared and the caller must refetch ``doc_status`` and + continue the loop. + """ + async with pipeline_status_lock: + if pipeline_status.get("request_pending", False): + pipeline_status["request_pending"] = False + return False + pipeline_status["busy"] = False + return True + + @staticmethod + def _format_job_name( + to_process_docs: dict[str, DocProcessingStatus], + total_files: int, + ) -> str: + """Build the ``job_name`` shown in pipeline_status for one batch.""" + first_doc = next(iter(to_process_docs.values())) + first_doc_path = first_doc.file_path + if first_doc_path: + path_prefix = first_doc_path[:20] + ( + "..." if len(first_doc_path) > 20 else "" + ) + else: + path_prefix = "unknown_source" + return f"{path_prefix}[{total_files} files]" + + # ============================================================ + # Cascading queue workers (Layer 1 -> 2 -> 3) + # ============================================================ + + async def _parse_worker( + self, + engine: str, + in_q: asyncio.Queue, + ctx: _BatchRunContext, + ) -> None: + """Layer 1 worker: consume (doc_id, status_doc) and emit parsed data. + + Marks PARSING, runs the engine-specific parser (mineru / docling / + native), refreshes ``content_hash`` if the parser patched it, and + either short-circuits via ``_mark_duplicate_after_parse`` or hands + off to ``q_analyze``. Writes FAILED on exception. + """ + while True: + item = await in_q.get() + try: + doc_id_w, status_doc_w = item + file_path_w = getattr(status_doc_w, "file_path", "unknown_source") + content_data_w = await self.full_docs.get_by_id(doc_id_w) + if not content_data_w: + raise Exception( + f"Document content not found in full_docs for doc_id: {doc_id_w}" + ) + if isinstance(status_doc_w.metadata, dict): + source_file_name_w = status_doc_w.metadata.get("source_file_name") + if source_file_name_w and not content_data_w.get( + "source_file_name" + ): + content_data_w["source_file_name"] = source_file_name_w + # Stamp parsing_start_time on the in-memory status_doc so + # carry-over (_DOC_STATUS_METADATA_CARRY_OVER_KEYS) writes it + # into doc_status here and preserves it across every + # subsequent state transition for stage-duration analysis. + if not isinstance(status_doc_w.metadata, dict): + status_doc_w.metadata = {} + # Drop stale per-attempt fields from any prior failed/retried + # attempt before stamping the new parsing_start_time. + # ``analyzing_start_time`` and ``parse_stage_skipped`` are + # downstream of this point and would otherwise be carried + # forward via carry-over, skewing stage-duration metrics and + # the raw-cache-hit signal for the new attempt. The cache-hit + # mirror block below only re-writes ``parse_stage_skipped`` + # when the parser actually returns a hit, so cache-miss + # retries land with the field absent (= not skipped). + status_doc_w.metadata.pop("analyzing_start_time", None) + status_doc_w.metadata.pop("parse_stage_skipped", None) + status_doc_w.metadata["parsing_start_time"] = int(time.time()) + await self._upsert_doc_status_transition( + doc_id=doc_id_w, + status=DocStatus.PARSING, + status_doc=status_doc_w, + file_path=file_path_w, + ) + if engine == "mineru": + parsed_data_w = await self.parse_mineru( + doc_id_w, file_path_w, content_data_w + ) + elif engine == "docling": + parsed_data_w = await self.parse_docling( + doc_id_w, file_path_w, content_data_w + ) + else: + parsed_data_w = await self.parse_native( + doc_id_w, file_path_w, content_data_w + ) + + # Mirror non-fatal parser warnings (e.g. legacy docx tables + # missing w14:paraId) onto the in-memory status_doc so the + # ANALYZING / PROCESSING / PROCESSED / FAILED upserts carry + # the field through ``doc_status_transition_metadata``. + parse_warnings_payload_w = parsed_data_w.get("parse_warnings") + if parse_warnings_payload_w: + if not isinstance(status_doc_w.metadata, dict): + status_doc_w.metadata = {} + status_doc_w.metadata["parse_warnings"] = parse_warnings_payload_w + + # Mirror raw-bundle cache-hit flag from mineru/docling so the + # next upsert (ANALYZING) carries it into doc_status; absence + # means the parse stage actually ran. Only ``True`` is written + # so cache-miss documents stay clean. + if parsed_data_w.get("parse_stage_skipped"): + if not isinstance(status_doc_w.metadata, dict): + status_doc_w.metadata = {} + status_doc_w.metadata["parse_stage_skipped"] = True + + # parse_* may have patched content_hash for + # pending_parse → raw transitions. + refreshed = await self.doc_status.get_by_id(doc_id_w) + if refreshed: + refreshed_hash = ( + refreshed.get("content_hash") + if isinstance(refreshed, dict) + else getattr(refreshed, "content_hash", None) + ) + if refreshed_hash: + status_doc_w.content_hash = refreshed_hash + + if await self._mark_duplicate_after_parse( + doc_id=doc_id_w, + status_doc=status_doc_w, + file_path=file_path_w, + content_hash=status_doc_w.content_hash, + content_length=len(parsed_data_w.get("content", "")), + content_data=content_data_w, + pipeline_status=ctx.pipeline_status, + pipeline_status_lock=ctx.pipeline_status_lock, + ): + continue + + await ctx.q_analyze.put((doc_id_w, status_doc_w, parsed_data_w)) + except Exception as e: + logger.error(f"Parse worker failed ({engine}): {e}") + try: + await self._upsert_doc_status_transition( + doc_id=doc_id_w, + status=DocStatus.FAILED, + status_doc=status_doc_w, + file_path=getattr(status_doc_w, "file_path", "unknown_source"), + extra_fields={"error_msg": str(e)}, + ) + except Exception: + pass + finally: + in_q.task_done() + + async def _analyze_worker(self, ctx: _BatchRunContext) -> None: + """Layer 2 worker: run multimodal analysis (VLM) and feed q_process. + + Refreshes ``content_summary`` / ``content_length`` from the parsed + body (pending_parse → lightrag / raw documents start with empty + summary / zero length at enqueue) so PROCESSING / PROCESSED upserts + end up with real values. + """ + while True: + item = await ctx.q_analyze.get() + try: + doc_id_w, status_doc_w, parsed_data_w = item + file_path_w = getattr(status_doc_w, "file_path", "unknown_source") + refreshed_content_w = parsed_data_w.get("content", "") or "" + refreshed_summary_w = get_content_summary(refreshed_content_w) + refreshed_length_w = len(refreshed_content_w) + status_doc_w.content_summary = refreshed_summary_w + status_doc_w.content_length = refreshed_length_w + # Stamp analyzing_start_time so per-stage durations stay + # derivable from doc_status even after PROCESSED / FAILED; + # carry-over preserves it across later upserts. + if not isinstance(status_doc_w.metadata, dict): + status_doc_w.metadata = {} + status_doc_w.metadata["analyzing_start_time"] = int(time.time()) + await self._upsert_doc_status_transition( + doc_id=doc_id_w, + status=DocStatus.ANALYZING, + status_doc=status_doc_w, + file_path=file_path_w, + ) + analyzed = await self.analyze_multimodal( + doc_id=doc_id_w, + file_path=file_path_w, + parsed_data=parsed_data_w, + ) + await ctx.q_process.put((doc_id_w, status_doc_w, analyzed)) + except Exception as e: + # Mirror _parse_worker: failures here must transition the + # document to FAILED with a diagnostic ``error_msg``, otherwise + # MultimodalAnalysisError (raised by analyze_multimodal under + # the new hard-failure contract) would leave the doc stuck in + # ANALYZING forever. + logger.error(f"Analyze worker failed: {e}") + try: + await self._upsert_doc_status_transition( + doc_id=doc_id_w, + status=DocStatus.FAILED, + status_doc=status_doc_w, + file_path=getattr(status_doc_w, "file_path", "unknown_source"), + extra_fields={"error_msg": str(e)}, + ) + except Exception: + pass + finally: + ctx.q_analyze.task_done() + + async def _process_worker(self, ctx: _BatchRunContext) -> None: + """Layer 3 worker: dispatch each ready document to single-doc processing.""" + while True: + item = await ctx.q_process.get() + try: + doc_id_w, status_doc_w, parsed_data_w = item + await self.process_single_document( + doc_id=doc_id_w, + status_doc=status_doc_w, + parsed_data=parsed_data_w, + ctx=ctx, + ) + finally: + ctx.q_process.task_done() + + # ============================================================ + # Single-document state machine + # ============================================================ + + async def process_single_document( + self, + *, + doc_id: str, + status_doc: DocProcessingStatus, + parsed_data: dict[str, Any], + ctx: _BatchRunContext, + ) -> None: + """Single-document state machine: chunking → KG extraction → merge. + + Always invoked from ``_process_worker`` with ``parsed_data`` already + populated by ``_parse_worker`` + ``_analyze_worker``. Drives the + PROCESSING → PROCESSED state machine, with FAILED fallbacks at both + the extract and merge stage boundaries. + """ + from lightrag.parser_routing import parse_process_options + + file_path = resolve_doc_file_path(status_doc=status_doc) + current_file_number = 0 + file_extraction_stage_ok = False + processing_start_time = int(time.time()) + first_stage_tasks: list[asyncio.Task] = [] + entity_relation_task: asyncio.Task | None = None + chunks: dict[str, Any] = {} + content_data: dict[str, Any] | None = None + extraction_meta: dict[str, Any] = {} + chunk_results: list = [] + doc_process_opts = parse_process_options("") + + def get_failed_chunk_snapshot() -> tuple[list[str], int]: + if chunks: + chunk_ids = list(chunks.keys()) + return chunk_ids, len(chunk_ids) + return chunk_fields_from_status_doc(status_doc) + + async with ctx.semaphore: + try: + # Resolve file_path from full_docs before honoring a queued + # cancellation so corrupted doc_status placeholders do not + # get written back again during retry/cancel flows. + content_data = await self.full_docs.get_by_id(doc_id) + if content_data: + file_path = resolve_doc_file_path( + status_doc=status_doc, + content_data=content_data, + ) + status_doc.file_path = file_path + + # Check for cancellation before starting document processing. + # file_path is resolved before this check so queued documents + # do not lose their source path on early cancellation. + await self._raise_if_cancelled( + ctx.pipeline_status, ctx.pipeline_status_lock + ) + + async with ctx.pipeline_status_lock: + ctx.processed_count += 1 + current_file_number = ctx.processed_count + ctx.pipeline_status["cur_batch"] = ctx.processed_count + + log_message = ( + f"Extracting stage {current_file_number}/" + f"{ctx.total_files}: {file_path}" + ) + logger.info(log_message) + ctx.pipeline_status["history_messages"].append(log_message) + log_message = f"Processing d-id: {doc_id}" + logger.info(log_message) + ctx.pipeline_status["latest_message"] = log_message + ctx.pipeline_status["history_messages"].append(log_message) + + # Prevent memory growth: keep only latest 5000 messages + # when exceeding 10000. Trim in place so Manager.list- + # backed shared state remains appendable and visible + # across processes. + if len(ctx.pipeline_status["history_messages"]) > 10000: + logger.info( + f"Trimming pipeline history from {len(ctx.pipeline_status['history_messages'])} to 5000 messages" + ) + del ctx.pipeline_status["history_messages"][:-5000] + + content = parsed_data.get("content", "") + + # Decode per-document processing options once; later stages + # (multimodal hook / KG extraction) re-read them from + # full_docs as well. + doc_process_opts = parse_process_options( + (content_data or {}).get("process_options", "") + ) + + # Resume guard: if content was already extracted under + # earlier process_options, purge stale chunks + KG before + # rebuilding. + await self._purge_stale_extraction_if_resuming( + doc_id=doc_id, + status_doc=status_doc, + file_path=file_path, + content_data=content_data, + pipeline_status=ctx.pipeline_status, + pipeline_status_lock=ctx.pipeline_status_lock, + ) + + # Chunker dispatch is driven by whether ``process_options`` + # explicitly named a chunking strategy: + # - Explicit selector (F/R/V/P present in the raw + # options string): dispatch to a chunker that + # follows the standardized file-chunker contract + # ``(tokenizer, content, chunk_token_size, *, + # )``, with kwargs supplied from + # the per-doc ``chunk_options`` snapshot persisted + # at enqueue time. + # - No selector supplied: honor the + # externally-customizable ``self.chunking_func`` + # with its legacy 6-arg signature so existing + # callers (typically :meth:`ainsert` for raw text) + # keep working unchanged. Legacy callers still + # read parameters from ``chunk_options`` first + # (per-doc snapshot), with ctx values as fallback + # for already-enqueued docs predating chunk_options. + chunk_opts = (content_data or {}).get("chunk_options") + if not isinstance(chunk_opts, dict) or not chunk_opts: + # Backwards compatibility: rows enqueued before the + # chunk_options snapshot was added fall back to a + # fresh build from current addon_params['chunker'], + # scoped to the per-doc strategy decoded above so + # the slim shape stays consistent with newly + # enqueued rows. F-strategy split args fall back + # to whatever lives in + # ``addon_params['chunker']['fixed_token']``; + # runtime overrides are an ainsert-time concern and + # don't apply at process time for legacy rows. + from lightrag.parser_routing import resolve_chunk_options + + chunk_opts = resolve_chunk_options( + self.addon_params, process_options=doc_process_opts + ) + resolved_chunk_size = int( + chunk_opts.get("chunk_token_size") or self.chunk_token_size + ) + + # Captured per-strategy below; persisted to + # ``doc_status.metadata['chunk_opts']`` via ``extraction_meta`` + # so admin/list APIs can see the actual chunker params used. + chunk_opts_str: str = "" + + if doc_process_opts.chunking_explicit: + from lightrag.chunker import ( + chunking_by_fixed_token, + chunking_by_paragraph_semantic, + chunking_by_recursive_character, + chunking_by_semantic_vector, + ) + + strategy = doc_process_opts.chunking + if strategy == "P": + # P carries its own ``chunk_token_size`` (CHUNK_P_SIZE + # env or ``addon_params['chunker']['paragraph_semantic']``); + # pop it out of the kwargs so we don't pass it + # both positionally and via ``**`` splat (which + # would TypeError). Unlike R/V, ``default_chunker_config`` + # always populates this slot — falling back to + # ``resolved_chunk_size`` (global CHUNK_SIZE) here is + # only a safety net for snapshots predating that + # change; new docs always carry ``DEFAULT_CHUNK_P_SIZE``. + p_opts = dict(chunk_opts.get("paragraph_semantic") or {}) + p_chunk_size = int( + p_opts.pop("chunk_token_size", resolved_chunk_size) + ) + p_blocks_path = ( + str(parsed_data.get("blocks_path") or "").strip() or None + ) + chunk_opts_str = _format_chunking_params(p_chunk_size, p_opts) + logger.info(f"Chunking P: {chunk_opts_str}, doc_id: {doc_id}") + chunking_result = chunking_by_paragraph_semantic( + self.tokenizer, + content, + p_chunk_size, + blocks_path=p_blocks_path, + **p_opts, + ) + elif strategy == "R": + # R carries its own optional ``chunk_token_size`` + # override (CHUNK_R_SIZE env or + # ``addon_params['chunker']['recursive_character']``); + # pop it out of the kwargs so we don't pass it + # both positionally and via ``**`` splat (which + # would TypeError). Fall back to the shared + # top-level resolved size when unset. + r_opts = dict(chunk_opts.get("recursive_character") or {}) + r_chunk_size = int( + r_opts.pop("chunk_token_size", resolved_chunk_size) + ) + chunk_opts_str = _format_chunking_params(r_chunk_size, r_opts) + logger.info(f"Chunking R: {chunk_opts_str}, doc_id: {doc_id}") + chunking_result = chunking_by_recursive_character( + self.tokenizer, + content, + r_chunk_size, + **r_opts, + ) + elif strategy == "V": + # V carries its own optional ``chunk_token_size`` + # advisory ceiling override (CHUNK_V_SIZE env or + # ``addon_params['chunker']['semantic_vector']``); + # same pop-then-splat pattern as P/R. + v_opts = dict(chunk_opts.get("semantic_vector") or {}) + v_chunk_size = int( + v_opts.pop("chunk_token_size", resolved_chunk_size) + ) + chunk_opts_str = _format_chunking_params(v_chunk_size, v_opts) + logger.info(f"Chunking V: {chunk_opts_str}, doc_id: {doc_id}") + chunking_result = await chunking_by_semantic_vector( + self.tokenizer, + content, + v_chunk_size, + embedding_func=self.embedding_func, + **v_opts, + ) + else: # "F" + f_opts = chunk_opts.get("fixed_token") or {} + chunk_opts_str = _format_chunking_params( + resolved_chunk_size, f_opts + ) + logger.info(f"Chunking F: {chunk_opts_str}, doc_id: {doc_id}") + chunking_result = chunking_by_fixed_token( + self.tokenizer, + content, + resolved_chunk_size, + **f_opts, + ) + else: + f_opts = chunk_opts.get("fixed_token") or {} + chunk_opts_str = _format_chunking_params( + resolved_chunk_size, + { + "split_by_character": f_opts.get("split_by_character"), + "split_by_character_only": f_opts.get( + "split_by_character_only", False + ), + "overlap": f_opts.get( + "chunk_overlap_token_size", + self.chunk_overlap_token_size, + ), + }, + ) + logger.info( + f"Chunking F(legacy): {chunk_opts_str}, doc_id: {doc_id}" + ) + chunking_result = self.chunking_func( + self.tokenizer, + content, + f_opts.get("split_by_character"), + f_opts.get("split_by_character_only", False), + f_opts.get( + "chunk_overlap_token_size", + self.chunk_overlap_token_size, + ), + resolved_chunk_size, + ) + if inspect.isawaitable(chunking_result): + chunking_result = await chunking_result + + if not isinstance(chunking_result, (list, tuple)): + raise TypeError( + f"chunking_func must return a list or tuple of dicts, " + f"got {type(chunking_result)}" + ) + + # Reflect the format actually persisted in full_docs. + # Previously a structured-parse fallback always tagged + # parse_format=raw, which silently mislabelled lightrag docs; + # _build_mm_chunks_from_sidecars below gates on the persisted + # format via the sidecar presence check, so the tag must + # reflect what was actually stored. + persisted_format = ( + content_data.get("parse_format") + if isinstance(content_data, dict) + else FULL_DOCS_FORMAT_RAW + ) or FULL_DOCS_FORMAT_RAW + persisted_engine = ( + content_data.get("parse_engine") + if isinstance(content_data, dict) + else None + ) + extraction_meta = { + "parse_format": persisted_format, + "parse_engine": persisted_engine + or ( + "native" + if persisted_format == FULL_DOCS_FORMAT_LIGHTRAG + else "legacy" + ), + "chunking_method": ( + # Explicit selector in process_options: reflect + # the dispatched strategy. ``fixed_token_fallback`` + # is preserved as a defensive label in case a + # future selector char slips past the validator. + _CHUNKING_METHOD_LABELS.get( + doc_process_opts.chunking, "fixed_token_fallback" + ) + if doc_process_opts.chunking_explicit + # No selector: chunking_func was invoked, which + # defaults to chunking_by_token_size but may be + # customized by the caller. + else "legacy_chunking_func" + ), + # Mirrors the chunking start log line (params portion only, + # without the strategy prefix or file path) so admins can + # see the actual chunker params used. Carried across + # transitions via ``_DOC_STATUS_METADATA_CARRY_OVER_KEYS``. + "chunk_opts": chunk_opts_str, + } + + blocks_path = str(parsed_data.get("blocks_path") or "").strip() + if blocks_path: + max_order = -1 + for ch in chunking_result: + if isinstance(ch, dict) and isinstance( + ch.get("chunk_order_index"), int + ): + max_order = max(max_order, int(ch["chunk_order_index"])) + # Default to "" (no modalities) when full_docs has no + # ``process_options`` key for this doc: a reinsert that + # omits i/t/e must NOT re-index stale successful sidecars + # left over from an earlier multimodal run. The builder's + # None branch is reserved for ad-hoc callers (unit tests) + # that intentionally want every modality considered. + mm_chunks = self._build_mm_chunks_from_sidecars( + doc_id=doc_id, + file_path=file_path, + blocks_path=blocks_path, + base_order_index=max_order + 1, + process_options=(content_data or {}).get("process_options") + or "", + ) + if mm_chunks: + chunking_result = list(chunking_result) + mm_chunks + extraction_meta["mm_chunks"] = len(mm_chunks) + + # Final hard guard before embedding: split any oversize + # chunk while preserving heading hierarchy metadata. + if ( + self.embedding_token_limit is not None + and self.embedding_token_limit > 0 + ): + original_chunk_count = len(chunking_result) + chunking_result = enforce_chunk_token_limit_before_embedding( + chunking_result=chunking_result, + tokenizer=self.tokenizer, + max_tokens=self.embedding_token_limit, + ) + if len(chunking_result) != original_chunk_count: + logger.info( + "Applied hard fallback split before embedding for " + f"d-id: {doc_id}, chunks {original_chunk_count} -> {len(chunking_result)} " + f"(limit={self.embedding_token_limit})" + ) + # Compact "pre -> post" summary mirrors the log + # middle segment. Field is only present when a + # hard split actually occurred, so its presence + # alone signals the trigger. + extraction_meta["hard_fallback_split"] = ( + f"{original_chunk_count} -> {len(chunking_result)}" + ) + + chunks = build_chunks_dict_from_chunking_result( + chunking_result, doc_id=doc_id, file_path=file_path + ) + + if not chunks: + logger.warning("No document chunks to process") + + processing_start_time = int(time.time()) + + await self._raise_if_cancelled( + ctx.pipeline_status, ctx.pipeline_status_lock + ) + + # Stage 1: persist doc_status PROCESSING + chunks in parallel. + doc_status_task = asyncio.create_task( + self._upsert_doc_status_transition( + doc_id=doc_id, + status=DocStatus.PROCESSING, + status_doc=status_doc, + file_path=file_path, + extra_fields={ + "chunks_count": len(chunks), + "chunks_list": list(chunks.keys()), + }, + metadata_extra={ + "processing_start_time": processing_start_time, + **extraction_meta, + }, + ) + ) + chunks_vdb_task = asyncio.create_task(self.chunks_vdb.upsert(chunks)) + text_chunks_task = asyncio.create_task(self.text_chunks.upsert(chunks)) + first_stage_tasks = [ + doc_status_task, + chunks_vdb_task, + text_chunks_task, + ] + entity_relation_task = None + + await asyncio.gather(*first_stage_tasks) + + # Stage 2: entity/relation extraction (after text_chunks are + # saved). When the user opted out via process_options '!', + # skip extraction entirely; chunks remain in the vector + # store so naive / mix retrieval still works. + if doc_process_opts.skip_kg: + logger.info( + f"[skip_kg] process_options '!' set for d-id: {doc_id}; " + f"skipping entity/relation extraction" + ) + chunk_results = [] + extraction_meta["skip_kg"] = True + else: + entity_relation_task = asyncio.create_task( + self._process_extract_entities( + chunks, + ctx.pipeline_status, + ctx.pipeline_status_lock, + ) + ) + chunk_results = await entity_relation_task + file_extraction_stage_ok = True + + except Exception as e: + pending_tasks = first_stage_tasks + ( + [entity_relation_task] if entity_relation_task else [] + ) + await self._finalize_doc_failure( + doc_id=doc_id, + status_doc=status_doc, + file_path=file_path, + error=e, + stage_label="extract", + current_file_number=current_file_number, + total_files=ctx.total_files, + failed_chunks_snapshot=get_failed_chunk_snapshot(), + pending_tasks=pending_tasks, + metadata_extra={ + "processing_start_time": processing_start_time, + "processing_end_time": int(time.time()), + }, + pipeline_status=ctx.pipeline_status, + pipeline_status_lock=ctx.pipeline_status_lock, + ) + + # Concurrency is controlled by keyed lock for individual + # entities and relationships. + if file_extraction_stage_ok: + try: + await self._raise_if_cancelled( + ctx.pipeline_status, ctx.pipeline_status_lock + ) + + # Use chunk_results from entity_relation_task. When + # skip_kg is set, chunk_results is empty so there are no + # nodes/edges to merge — but we still need to flush the + # chunks_vdb / text_chunks writes (already done above) + # and reach PROCESSED. + if not doc_process_opts.skip_kg: + await merge_nodes_and_edges( + chunk_results=chunk_results, + knowledge_graph_inst=self.chunk_entity_relation_graph, + entity_vdb=self.entities_vdb, + relationships_vdb=self.relationships_vdb, + global_config=self._build_global_config(), + full_entities_storage=self.full_entities, + full_relations_storage=self.full_relations, + doc_id=doc_id, + pipeline_status=ctx.pipeline_status, + pipeline_status_lock=ctx.pipeline_status_lock, + llm_response_cache=self.llm_response_cache, + entity_chunks_storage=self.entity_chunks, + relation_chunks_storage=self.relation_chunks, + current_file_number=current_file_number, + total_files=ctx.total_files, + file_path=file_path, + ) + + processing_end_time = int(time.time()) + await self._upsert_doc_status_transition( + doc_id=doc_id, + status=DocStatus.PROCESSED, + status_doc=status_doc, + file_path=file_path, + extra_fields={ + "chunks_count": len(chunks), + "chunks_list": list(chunks.keys()), + }, + metadata_extra={ + "processing_start_time": processing_start_time, + "processing_end_time": processing_end_time, + **extraction_meta, + }, + ) + + await self._insert_done() + + async with ctx.pipeline_status_lock: + log_message = ( + f"Completed processing file " + f"{current_file_number}/{ctx.total_files}: " + f"{file_path}" + ) + logger.info(log_message) + ctx.pipeline_status["latest_message"] = log_message + ctx.pipeline_status["history_messages"].append(log_message) + + except Exception as e: + await self._finalize_doc_failure( + doc_id=doc_id, + status_doc=status_doc, + file_path=file_path, + error=e, + stage_label="merge", + current_file_number=current_file_number, + total_files=ctx.total_files, + failed_chunks_snapshot=get_failed_chunk_snapshot(), + pending_tasks=[], + metadata_extra={ + "processing_start_time": processing_start_time, + "processing_end_time": int(time.time()), + **extraction_meta, + }, + pipeline_status=ctx.pipeline_status, + pipeline_status_lock=ctx.pipeline_status_lock, + ) + + async def _purge_stale_extraction_if_resuming( + self, + *, + doc_id: str, + status_doc: DocProcessingStatus, + file_path: str, + content_data: dict[str, Any] | None, + pipeline_status: dict, + pipeline_status_lock, + ) -> None: + """If the document already has extracted content, purge stale chunks + and KG contributions before re-running chunking + entity extraction + under the current ``process_options``. + + Mutates ``status_doc.chunks_list`` / ``chunks_count`` to reflect the + purge so subsequent state-machine upserts don't write back stale IDs. + Also emits an engine-mismatch warning when the filename hint disagrees + with the stored ``parse_engine`` — the extracted content is the source + of truth, so the user must delete + re-upload to switch engines. + """ + content_already_extracted = isinstance(content_data, dict) and ( + ( + content_data.get("parse_format") == FULL_DOCS_FORMAT_LIGHTRAG + and content_data.get("sidecar_location") + ) + or ( + content_data.get("parse_format") == FULL_DOCS_FORMAT_RAW + and (content_data.get("content") or "").strip() + ) + ) + if not content_already_extracted: + return + + intended_engine, _ = resolve_file_parser_directives(file_path) + stored_engine = (content_data.get("parse_engine") or "").lower() + if intended_engine and stored_engine and intended_engine != stored_engine: + log_message = ( + f"[resume] {doc_id}: filename hint / " + f"LIGHTRAG_PARSER implies engine=" + f"{intended_engine!r} but full_docs " + f"already has parse_engine=" + f"{stored_engine!r}; keeping the existing " + f"extraction. Delete + re-upload to " + f"switch engines." + ) + logger.warning(log_message) + async with pipeline_status_lock: + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + + stored_chunk_ids = { + chunk_id + for chunk_id in (status_doc.chunks_list or []) + if isinstance(chunk_id, str) and chunk_id + } + if not stored_chunk_ids: + return + + log_message = ( + f"[resume] {doc_id}: purging " + f"{len(stored_chunk_ids)} chunk(s) and " + f"associated KG entries from a previous run " + f"before rebuilding under current " + f"process_options" + ) + logger.info(log_message) + async with pipeline_status_lock: + pipeline_status["latest_message"] = log_message + pipeline_status["history_messages"].append(log_message) + await self._purge_doc_chunks_and_kg( + doc_id, + stored_chunk_ids, + pipeline_status=pipeline_status, + pipeline_status_lock=pipeline_status_lock, + ) + # The status_doc carries chunks_list / chunks_count from the prior + # run; clear them so subsequent state-machine upserts don't write + # back stale IDs. + status_doc.chunks_list = [] + status_doc.chunks_count = 0 + + # ============================================================ + # doc_status state-machine helpers (shared by all layers) + # ============================================================ + + async def _upsert_doc_status_transition( + self, + doc_id: str, + status: DocStatus, + status_doc: DocProcessingStatus, + file_path: str, + *, + extra_fields: dict[str, Any] | None = None, + metadata_extra: dict[str, Any] | None = None, + ) -> None: + """Single source of truth for doc_status state-transition upserts. + + Mirrors the field set used at every PARSING / ANALYZING / PROCESSING / + PROCESSED / FAILED transition. ``extra_fields`` carries + ``chunks_count`` / ``chunks_list`` / ``error_msg``; ``metadata_extra`` + is forwarded to ``doc_status_transition_metadata`` so carry-over + fields (e.g. ``process_options``) survive every state change. + """ + payload: dict[str, Any] = { + "status": status, + "content_summary": status_doc.content_summary, + "content_length": status_doc.content_length, + "created_at": status_doc.created_at, + "updated_at": datetime.now(timezone.utc).isoformat(), + "file_path": file_path, + "track_id": status_doc.track_id, + "content_hash": status_doc.content_hash, + "metadata": doc_status_transition_metadata( + status_doc, extra=metadata_extra + ), + } + if extra_fields: + payload.update(extra_fields) + await self.doc_status.upsert({doc_id: payload}) + + async def _raise_if_cancelled( + self, + pipeline_status: dict, + pipeline_status_lock, + ) -> None: + """Raise ``PipelineCancelledException`` if the user has requested cancel.""" + async with pipeline_status_lock: + if pipeline_status.get("cancellation_requested", False): + raise PipelineCancelledException("User cancelled") + + async def _finalize_doc_failure( + self, + *, + doc_id: str, + status_doc: DocProcessingStatus, + file_path: str, + error: BaseException, + stage_label: str, + current_file_number: int, + total_files: int, + failed_chunks_snapshot: tuple[list[str], int], + pending_tasks: list[asyncio.Task], + metadata_extra: dict[str, Any], + pipeline_status: dict, + pipeline_status_lock, + ) -> None: + """Common epilogue for an extract / merge stage failure. + + Logs the error (or cancellation), cancels any pending stage tasks, + flushes the LLM response cache, and writes a FAILED status row that + preserves the failed chunks snapshot and processing-time metadata. + """ + if isinstance(error, PipelineCancelledException): + if stage_label == "merge": + error_msg = ( + f"User cancelled during merge {current_file_number}/" + f"{total_files}: {file_path}" + ) + else: + error_msg = ( + f"User cancelled {current_file_number}/{total_files}: {file_path}" + ) + logger.warning(error_msg) + async with pipeline_status_lock: + pipeline_status["latest_message"] = error_msg + pipeline_status["history_messages"].append(error_msg) + else: + logger.error(traceback.format_exc()) + if stage_label == "merge": + error_msg = ( + f"Merging stage failed in document " + f"{current_file_number}/{total_files}: {file_path}" + ) + else: + error_msg = ( + f"Failed to extract document " + f"{current_file_number}/{total_files}: {file_path}" + ) + logger.error(error_msg) + async with pipeline_status_lock: + pipeline_status["latest_message"] = error_msg + pipeline_status["history_messages"].append(traceback.format_exc()) + pipeline_status["history_messages"].append(error_msg) + + for task in pending_tasks: + if task and not task.done(): + task.cancel() + + if self.llm_response_cache: + try: + await self.llm_response_cache.index_done_callback() + except Exception as persist_error: + logger.error(f"Failed to persist LLM cache: {persist_error}") + + failed_chunks_list, failed_chunks_count = failed_chunks_snapshot + await self._upsert_doc_status_transition( + doc_id=doc_id, + status=DocStatus.FAILED, + status_doc=status_doc, + file_path=file_path, + extra_fields={ + "error_msg": str(error), + "chunks_count": failed_chunks_count, + "chunks_list": failed_chunks_list, + }, + metadata_extra=metadata_extra, + ) + + # ============================================================ + # Parser engines (also called by tests directly) + # ============================================================ + + async def parse_native( + self, doc_id: str, file_path: str, content_data: dict[str, Any] + ) -> dict[str, Any]: + """Phase 1 parse for native/raw, lightrag and pending_parse formats.""" + doc_format = content_data.get("parse_format", FULL_DOCS_FORMAT_RAW) + if doc_format == FULL_DOCS_FORMAT_LIGHTRAG: + # full_docs.content carries the merged text with the {{LRdoc}} + # marker; strip it so the chunking path is identical to raw. + # blocks_path is still resolved for downstream multimodal + # sidecar reads (_build_mm_chunks_from_sidecars). + merged_text = strip_lightrag_doc_prefix( + content_data.get("content"), doc_format + ) + blocks_path = ( + sidecar_blocks_path(content_data.get("sidecar_location")) or "" + ) + + return { + "doc_id": doc_id, + "file_path": file_path, + "parse_format": doc_format, + "content": merged_text, + "blocks_path": blocks_path, + } + + if doc_format == FULL_DOCS_FORMAT_PENDING_PARSE: + source_path = _call_source_file_resolver( + self, + file_path, + source_file_name=content_data.get("source_file_name"), + parser_engine=PARSER_ENGINE_NATIVE, + ) + p = Path(source_path) + if not (p.exists() and p.is_file() and p.suffix.lower() == ".docx"): + raise ValueError( + f"Native parser does not support pending file: {file_path}" + ) + + # Lazy imports keep this module import-cheap and avoid pulling + # the docx parser into call paths that never touch the native + # engine (mirrors parse_mineru). + from lightrag.native_parser.docx.drawing_image_extractor import ( + DrawingExtractionContext, + load_relationships, + ) + from lightrag.native_parser.docx.parse_document import ( + extract_docx_blocks, + ) + from lightrag.native_parser.docx.ir_builder import NativeDocxIRBuilder + from lightrag.sidecar import write_sidecar + + # ``file_path`` is canonical at the worker layer; canonicalize + # again defensively so direct callers (tests, CLI) may pass + # absolute paths or hint-bearing names. + document_name = normalize_document_file_path(file_path) + if document_name == "unknown_source": + document_name = p.name or f"{doc_id}.bin" + base_name = Path(document_name).stem or document_name + parsed_dir = parsed_artifact_dir_for(document_name, parent_hint=p.parent) + asset_dir = parsed_dir / f"{base_name}.blocks.assets" + + def _extract_blocks_sync() -> ( + tuple[list[dict[str, Any]], dict[str, Any], dict[str, Any]] + ): + # Pre-clean parsed_dir and pre-create the asset dir so the + # drawing extractor can write image bytes BEFORE write_sidecar + # runs (which is then called with clean_parsed_dir=False to + # keep those bytes). ``parsed_artifact_dir_for`` returns + # a unique dir per source (with ``_001``/``_002`` suffixes on + # collision), so the rmtree here only ever clobbers stale + # artifacts from a previous attempt at the same doc_id. + if parsed_dir.exists(): + shutil.rmtree(parsed_dir) + parsed_dir.mkdir(parents=True, exist_ok=True) + asset_dir.mkdir(parents=True, exist_ok=True) + ctx = DrawingExtractionContext( + docx_path=p, + blocks_output_path=parsed_dir / f"{base_name}.blocks.jsonl", + export_dir_name=asset_dir.name, + export_dir_path=asset_dir, + ) + load_relationships(ctx) + warnings: dict[str, Any] = {} + metadata: dict[str, Any] = {} + extracted = extract_docx_blocks( + str(p), + debug=False, + fixlevel=0, + drawing_context=ctx, + parse_warnings=warnings, + parse_metadata=metadata, + ) + return extracted, warnings, metadata + + try: + blocks, parse_warnings, parse_metadata = await asyncio.to_thread( + _extract_blocks_sync + ) + except BaseException: + # ``_extract_blocks_sync`` pre-creates ``parsed_dir`` and + # ``asset_dir`` before invoking the extractor; if extraction + # raises, those (possibly partially-populated) dirs would be + # left on disk. Roll them back so the next attempt starts clean. + if parsed_dir.exists(): + shutil.rmtree(parsed_dir, ignore_errors=True) + raise + if not blocks: + # Same cleanup path for the "extractor returned []" case — + # ``write_sidecar`` would never run, so without this the + # pre-created (empty) dirs would persist. + if parsed_dir.exists(): + shutil.rmtree(parsed_dir, ignore_errors=True) + raise ValueError(f"DOCX parser returned empty content for {file_path}") + + missing_paraid_count = int( + parse_warnings.get("missing_paraid_count", 0) or 0 + ) + if missing_paraid_count > 0: + # Surface once per document — the parser may encounter many + # missing paraIds (legacy / non-Word authors omit + # ``w14:paraId``), but a single warning with the count is + # enough. Affected blocks emit + # ``positions: [{"type": "paraid", "range": null}]``. + logger.warning( + "[parse_native] %s: %d paragraphs lack paraId; " + "Re-saving file in Word 2013+ to regenerate ids.", + p.name, + missing_paraid_count, + ) + + ir = NativeDocxIRBuilder().normalize( + blocks, + document_name=document_name, + asset_dir_name=asset_dir.name, + parse_metadata=parse_metadata, + ) + parsed_data = write_sidecar( + ir, + parsed_dir=parsed_dir, + doc_id=doc_id, + engine=PARSER_ENGINE_NATIVE, + clean_parsed_dir=False, # we pre-populated the asset dir + block_drawing_path_style="basename_only", # legacy native shape + ) + + await self._persist_parsed_full_docs( + doc_id, + { + "content": make_lightrag_doc_content(parsed_data["content"]), + "file_path": file_path, + "parse_format": FULL_DOCS_FORMAT_LIGHTRAG, + "sidecar_location": sidecar_uri_for(parsed_dir), + "parse_engine": PARSER_ENGINE_NATIVE, + "update_time": int(time.time()), + }, + ) + await archive_docx_source_after_full_docs_sync(str(p)) + logger.info( + f"[parse_native] pending_parse completed for {file_path} " + f"via native_parser/docx" + ) + result: dict[str, Any] = { + "doc_id": doc_id, + "file_path": file_path, + "parse_format": FULL_DOCS_FORMAT_LIGHTRAG, + "content": parsed_data["content"], + "blocks_path": parsed_data["blocks_path"], + } + if missing_paraid_count > 0: + # Pipeline reads this from the parsed_data dict and writes it + # to ``doc_status.metadata.parse_warnings`` so admin/list APIs + # can surface the issue alongside the document record. + result["parse_warnings"] = { + "missing_paraid_count": missing_paraid_count + } + return result + + return { + "doc_id": doc_id, + "file_path": file_path, + "parse_format": FULL_DOCS_FORMAT_RAW, + "content": content_data.get("content", ""), + "blocks_path": "", + } + + async def parse_mineru( + self, doc_id: str, file_path: str, content_data: dict[str, Any] + ) -> dict[str, Any]: + """Parse a document through MinerU and emit a spec-compliant sidecar. + + Layout produced under ``inputs//__parsed__/``: + + - ``.parsed/`` — sidecar (blocks.jsonl + per-modality JSONs + assets) + - ``.mineru_raw/`` — preserved MinerU bundle (content_list.json, + full.md, middle.json, images/, ...) plus ``_manifest.json`` + + The raw bundle is kept on disk so subsequent re-parses with the same + source content can skip the upload+poll+download round trip. It is + cleaned only when the user explicitly deletes the document with the + "also delete original file" option; see + :func:`lightrag.api.routers.document_routes.delete_file_variants_by_file_path`. + """ + # Lazy imports keep this module import-cheap and avoid pulling httpx + # into call paths that never touch the MinerU engine. + from lightrag.external_parser.mineru import ( + MinerUIRBuilder, + MinerURawClient, + clear_dir_contents, + is_bundle_valid, + raw_dir_for_parsed_dir, + ) + from lightrag.sidecar import write_sidecar + + source_file_path = Path( + _call_source_file_resolver( + self, + file_path, + source_file_name=content_data.get("source_file_name"), + parser_engine=PARSER_ENGINE_MINERU, + ) + ) + if not source_file_path.is_file(): + raise FileNotFoundError(f"MinerU source file not found: {source_file_path}") + + # Canonicalize defensively so direct callers (tests, CLI) may pass + # absolute paths or hint-bearing names. + document_name = normalize_document_file_path(file_path) + if document_name == "unknown_source": + document_name = source_file_path.name or f"{doc_id}.bin" + parsed_dir = parsed_artifact_dir_for( + document_name, parent_hint=source_file_path.parent + ) + raw_dir = raw_dir_for_parsed_dir(parsed_dir) + + force_reparse = os.getenv("LIGHTRAG_FORCE_REPARSE_MINERU", "").lower() in { + "1", + "true", + "yes", + "on", + } + + parse_stage_skipped = False + if not force_reparse and is_bundle_valid(raw_dir, source_file_path): + # Cache hit: keep the path purely local so a re-parse still + # succeeds if MinerU credentials/endpoint are temporarily + # unavailable (key rotation, debugging, etc.). Network config + # is only required on cache miss below. + parse_stage_skipped = True + logger.info("[parse_mineru] raw cache hit doc_id=%s", doc_id) + else: + if force_reparse and raw_dir.exists(): + logger.info( + "[parse_mineru] LIGHTRAG_FORCE_REPARSE_MINERU set; " + "discarding bundle at %s", + raw_dir, + ) + raw_dir.mkdir(parents=True, exist_ok=True) + clear_dir_contents(raw_dir) + client = MinerURawClient() + logger.info( + "[MinerU] Parsing %s %s (may take a few minutes)", + doc_id, + source_file_path.name, + ) + await client.download_into( + raw_dir, + source_file_path, + upload_name=document_name, + ) + + ir_builder = MinerUIRBuilder() + ir = ir_builder.normalize_from_workdir(raw_dir, document_name=document_name) + parsed_data = write_sidecar( + ir, + parsed_dir=parsed_dir, + doc_id=doc_id, + engine=PARSER_ENGINE_MINERU, + ) + + # Keep full_docs in sync so restart/reprocess can directly use the + # sidecar (matches the native_docx and content_list paths). + await self._persist_parsed_full_docs( + doc_id, + { + "content": make_lightrag_doc_content(parsed_data["content"]), + "file_path": file_path, + "parse_format": FULL_DOCS_FORMAT_LIGHTRAG, + "sidecar_location": sidecar_uri_for(parsed_dir), + "parse_engine": PARSER_ENGINE_MINERU, + "update_time": int(time.time()), + }, + ) + await archive_docx_source_after_full_docs_sync(str(source_file_path)) + return { + "doc_id": doc_id, + "file_path": file_path, + "parse_format": FULL_DOCS_FORMAT_LIGHTRAG, + "content": parsed_data["content"], + "blocks_path": parsed_data["blocks_path"], + "parse_stage_skipped": parse_stage_skipped, + } + + async def parse_docling( + self, doc_id: str, file_path: str, content_data: dict[str, Any] + ) -> dict[str, Any]: + """Parse a document through Docling Serve and emit a spec-compliant sidecar. + + Produces the same dual-directory layout as ``parse_mineru``: + + - ``.parsed/`` — sidecar (blocks.jsonl + per-modality JSONs + assets) + - ``.docling_raw/`` — preserved Docling bundle (``.json``, + ``.md``, ``artifacts/``) plus ``_manifest.json`` + + The raw bundle is kept so subsequent re-parses with the same source + bytes skip the upload + poll + download round trip. + """ + # Lazy imports keep this module import-cheap and avoid pulling httpx + # into call paths that never touch the Docling engine. + from lightrag.external_parser.docling import ( + DoclingIRBuilder, + DoclingRawClient, + clear_dir_contents, + is_bundle_valid, + raw_dir_for_parsed_dir, + ) + from lightrag.sidecar import write_sidecar + + source_file_path = Path( + _call_source_file_resolver( + self, + file_path, + source_file_name=content_data.get("source_file_name"), + parser_engine=PARSER_ENGINE_DOCLING, + ) + ) + if not source_file_path.is_file(): + raise FileNotFoundError( + f"Docling source file not found: {source_file_path}" + ) + + document_name = normalize_document_file_path(file_path) + if document_name == "unknown_source": + document_name = source_file_path.name or f"{doc_id}.bin" + parsed_dir = parsed_artifact_dir_for( + document_name, parent_hint=source_file_path.parent + ) + raw_dir = raw_dir_for_parsed_dir(parsed_dir) + + force_reparse = os.getenv("LIGHTRAG_FORCE_REPARSE_DOCLING", "").lower() in { + "1", + "true", + "yes", + "on", + } + + parse_stage_skipped = False + if not force_reparse and is_bundle_valid(raw_dir, source_file_path): + # Cache hit: keep purely local so re-parses still work when the + # docling-serve endpoint is temporarily unavailable. + parse_stage_skipped = True + logger.info("[parse_docling] raw cache hit doc_id=%s", doc_id) + else: + if force_reparse and raw_dir.exists(): + logger.info( + "[parse_docling] LIGHTRAG_FORCE_REPARSE_DOCLING set; " + "discarding bundle at %s", + raw_dir, + ) + # ``download_into`` mkdir's the raw_dir itself; we only need to + # wipe the existing contents (manifest + stale bundle files). + clear_dir_contents(raw_dir) + client = DoclingRawClient() + logger.info( + "[Docling] Parsing %s %s (may take a few minutes)", + doc_id, + source_file_path.name, + ) + # Pass the canonical (hint-stripped) name so docling-serve names + # the bundle's main JSON ``.json`` instead of + # ``.json``. Otherwise the IR builder — which only sees + # the canonical ``document_name`` — cannot locate the bundle JSON + # via the preferred-path lookup. + await client.download_into( + raw_dir, source_file_path, upload_filename=document_name + ) + + ir_builder = DoclingIRBuilder() + ir = ir_builder.normalize_from_workdir(raw_dir, document_name=document_name) + if not ir.blocks: + raise ValueError( + f"Docling IR builder produced zero blocks for {file_path} " + f"(raw_dir={raw_dir})" + ) + parsed_data = write_sidecar( + ir, + parsed_dir=parsed_dir, + doc_id=doc_id, + engine=PARSER_ENGINE_DOCLING, + ) + + await self._persist_parsed_full_docs( + doc_id, + { + "content": make_lightrag_doc_content(parsed_data["content"]), + "file_path": file_path, + "parse_format": FULL_DOCS_FORMAT_LIGHTRAG, + "sidecar_location": sidecar_uri_for(parsed_dir), + "parse_engine": PARSER_ENGINE_DOCLING, + "update_time": int(time.time()), + }, + ) + await archive_docx_source_after_full_docs_sync(str(source_file_path)) + return { + "doc_id": doc_id, + "file_path": file_path, + "parse_format": FULL_DOCS_FORMAT_LIGHTRAG, + "content": parsed_data["content"], + "blocks_path": parsed_data["blocks_path"], + "parse_stage_skipped": parse_stage_skipped, + } + + # ============================================================ + # Parser internals + # ============================================================ + + async def _persist_parsed_full_docs( + self, + doc_id: str, + record: dict[str, Any], + ) -> str | None: + """Write a parse-result record to ``full_docs`` and sync ``content_hash``. + + Computes ``content_hash`` from the actual extracted body so subsequent + ``get_doc_by_content_hash`` lookups can dedupe across pending_parse + records that did not have a hash at enqueue time. Also patches the + existing ``doc_status`` row so both storages stay aligned on + ``content_hash``. + + The original ``pending_parse`` record carries metadata seeded at + enqueue time (``process_options`` etc.) that downstream stages still + need after parsing. ``full_docs`` upserts overwrite the entire row, + so we merge the existing record with the new ``record`` payload + before upserting: fresh fields from ``record`` (``content`` / + ``parse_format`` / ``sidecar_location`` / ``parse_engine`` / + ``update_time``) take precedence, while pre-existing fields are + preserved. + """ + fmt = record.get("parse_format") + content_hash: str | None = None + # Hash the bare merged text (after stripping the ``{{LRdoc}}`` marker + # for lightrag-format) so cross-filename dedup fires regardless of + # whether the same body was ingested as raw text or via a sidecar. + # ``strip_lightrag_doc_prefix`` is a no-op for non-lightrag formats. + if fmt in (FULL_DOCS_FORMAT_RAW, FULL_DOCS_FORMAT_LIGHTRAG): + content_hash = compute_text_content_hash( + strip_lightrag_doc_prefix(record.get("content") or "", fmt) + ) + + existing = await self.full_docs.get_by_id(doc_id) + if isinstance(existing, dict): + payload = {**existing, **record} + else: + payload = dict(record) + if content_hash: + payload["content_hash"] = content_hash + + await self.full_docs.upsert({doc_id: payload}) + await self.full_docs.index_done_callback() + + if content_hash: + existing_status = await self.doc_status.get_by_id(doc_id) + if existing_status: + patched = dict(existing_status) + patched["content_hash"] = content_hash + patched["updated_at"] = datetime.now(timezone.utc).isoformat() + await self.doc_status.upsert({doc_id: patched}) + return content_hash + + async def _mark_duplicate_after_parse( + self, + doc_id: str, + status_doc: DocProcessingStatus, + file_path: str, + content_hash: str | None, + content_length: int, + content_data: dict[str, Any] | None = None, + pipeline_status: dict | None = None, + pipeline_status_lock: asyncio.Lock | None = None, + ) -> bool: + """Mark post-parse content duplicates and stop further processing.""" + if not content_hash: + return False + + match = await get_duplicate_doc_by_content_hash( + self.doc_status, content_hash, doc_id + ) + if not match: + return False + + original_doc_id, original_doc = match + original_track_id = doc_status_field(original_doc, "track_id", "") + original_status = doc_status_field(original_doc, "status", "unknown") + now = datetime.now(timezone.utc).isoformat() + message = ( + "Identical content already exists under another filename. " + f"Original doc_id: {original_doc_id}, Status: {original_status}" + ) + + await self.doc_status.upsert( + { + doc_id: { + "status": DocStatus.FAILED, + "content_summary": ( + f"[DUPLICATE:content_hash] Original document: {original_doc_id}" + ), + "content_length": content_length, + "chunks_count": 0, + "chunks_list": [], + "created_at": status_doc.created_at, + "updated_at": now, + "file_path": file_path, + "track_id": status_doc.track_id, + "content_hash": content_hash, + "error_msg": message, + "metadata": doc_status_transition_metadata( + status_doc, + extra={ + "is_duplicate": True, + "duplicate_kind": "content_hash", + "original_doc_id": original_doc_id, + "original_track_id": original_track_id, + }, + ), + } + } + ) + try: + await self.full_docs.delete([doc_id]) + await self.full_docs.index_done_callback() + except Exception as e: + logger.warning(f"Failed to remove duplicate full_docs entry {doc_id}: {e}") + + source_path = _call_source_file_resolver( + self, + file_path, + source_file_name=content_data.get("source_file_name") + if content_data + else None, + ) + archived = await archive_source_after_full_docs_sync(source_path) + archive_msg = f"; archived to {archived}" if archived else "" + warning = f"Duplicate content skipped after parsing: {file_path}{archive_msg}" + logger.warning(warning) + if pipeline_status is not None and pipeline_status_lock is not None: + async with pipeline_status_lock: + pipeline_status["latest_message"] = warning + pipeline_status["history_messages"].append(warning) + return True + + def _resolve_source_file_for_parser( + self, + file_path: str, + *, + source_file_name: str | None = None, + parser_engine: str | None = None, + ) -> str: + """Resolve a readable source file path for parser upload. + + ``file_path`` is the canonical stored basename. Pending-parse records + may also carry ``source_file_name`` with the real uploaded/scanned + basename, including parser hints. + """ + candidates: list[Path] = [] + roots: list[Path] = [] + + def _add_candidate(path_value: Any) -> None: + raw = str(path_value or "").strip() + if not raw: + return + path = Path(raw) + candidates.append(path) + if path.parent != Path("."): + roots.append(path.parent) + roots.append(path.parent / PARSED_DIR_NAME) + candidates.append(path.parent / PARSED_DIR_NAME / path.name) + + _add_candidate(file_path) + + p = Path(file_path) + name = p.name + source_name = Path(str(source_file_name or "").strip()).name + input_path = input_dir_path() + # API ``DocumentManager`` scopes its input dir to + # ``//`` (see DocumentManager.__init__); + # check that location first so files uploaded into a workspace + # subdirectory resolve correctly. ``self.workspace`` is empty when + # no workspace is configured, in which case these candidates + # collapse to the base candidates that follow. + workspace = getattr(self, "workspace", "") or "" + if workspace: + candidates.append(input_path / workspace / name) + candidates.append(input_path / workspace / PARSED_DIR_NAME / name) + roots.append(input_path / workspace) + roots.append(input_path / workspace / PARSED_DIR_NAME) + candidates.append(input_path / name) + candidates.append(input_path / PARSED_DIR_NAME / name) + roots.append(input_path) + roots.append(input_path / PARSED_DIR_NAME) + + # Common local defaults used by API server. + cwd = Path.cwd() + if workspace: + candidates.append(cwd / "inputs" / workspace / name) + candidates.append(cwd / "inputs" / workspace / PARSED_DIR_NAME / name) + roots.append(cwd / "inputs" / workspace) + roots.append(cwd / "inputs" / workspace / PARSED_DIR_NAME) + candidates.extend( + [ + cwd / "inputs" / name, + cwd / "inputs" / PARSED_DIR_NAME / name, + cwd / PARSED_DIR_NAME / name, + ] + ) + roots.extend( + [ + cwd / "inputs", + cwd / "inputs" / PARSED_DIR_NAME, + cwd / PARSED_DIR_NAME, + ] + ) + + if source_name: + candidates = [root / source_name for root in roots] + candidates + + seen_candidates: set[Path] = set() + for candidate in candidates: + if candidate in seen_candidates: + continue + seen_candidates.add(candidate) + if candidate.exists() and candidate.is_file(): + return str(candidate) + + canonical_name = normalize_document_file_path(file_path) + if has_known_document_source(canonical_name): + matches: list[Path] = [] + seen_roots: set[Path] = set() + for root in roots: + if root in seen_roots: + continue + seen_roots.add(root) + if not root.exists() or not root.is_dir(): + continue + for candidate in sorted(root.iterdir(), key=lambda item: item.name): + if ( + candidate.is_file() + and normalize_document_file_path(candidate.name) + == canonical_name + ): + matches.append(candidate) + + if source_name: + for candidate in matches: + if candidate.name == source_name: + return str(candidate) + if parser_engine: + from lightrag.parser_routing import filename_parser_directives + + for candidate in matches: + hinted_engine, _ = filename_parser_directives(candidate.name) + if hinted_engine == parser_engine: + return str(candidate) + if matches: + return str(matches[0]) + return file_path + + async def _write_lightrag_document_from_content_list( + self, + doc_id: str, + file_path: str, + content_list: list[dict[str, Any]], + engine: str, + ) -> dict[str, Any]: + """Convert parser content list to LightRAG Document files and return parsed_data.""" + document_name = normalize_document_file_path(file_path) + if document_name == "unknown_source": + document_name = f"{doc_id}.bin" + parsed_dir = parsed_artifact_dir_for(document_name) + if parsed_dir.exists(): + shutil.rmtree(parsed_dir) + parsed_dir.mkdir(parents=True, exist_ok=True) + + base_name = Path(document_name).stem or document_name + blocks_path = parsed_dir / f"{base_name}.blocks.jsonl" + tables_path = parsed_dir / f"{base_name}.tables.json" + drawings_path = parsed_dir / f"{base_name}.drawings.json" + equations_path = parsed_dir / f"{base_name}.equations.json" + + blocks_lines: list[str] = [] + merged_parts: list[str] = [] + block_idx = 0 + table_idx = 0 + drawing_idx = 0 + equation_idx = 0 + + tables: dict[str, Any] = {} + drawings: dict[str, Any] = {} + equations: dict[str, Any] = {} + + def _to_list_str(value: Any) -> list[str]: + if value is None: + return [] + if isinstance(value, list): + return [str(x) for x in value if str(x).strip()] + text_val = str(value).strip() + return [text_val] if text_val else [] + + def _parse_int(value: Any, default: int = 0) -> int: + try: + return int(value) + except Exception: + return default + + def _normalize_grid_rows(grid: Any) -> list[list[str]]: + normalized_rows: list[list[str]] = [] + if not isinstance(grid, list): + return normalized_rows + for row in grid: + if not isinstance(row, list): + continue + normalized_row: list[str] = [] + for cell in row: + if isinstance(cell, dict): + normalized_row.append(str(cell.get("text", "")).strip()) + else: + normalized_row.append(str(cell).strip()) + normalized_rows.append(normalized_row) + return normalized_rows + + def _coerce_table_rows( + value: Any, + ) -> tuple[str, Any, list[list[str]], int, int]: + raw_value = value + if isinstance(raw_value, str): + stripped = raw_value.strip() + if not stripped: + return "html", "", [], 0, 0 + parsed_value = None + try: + parsed_value = json.loads(stripped) + except Exception: + try: + import ast + + parsed_value = ast.literal_eval(stripped) + except Exception: + parsed_value = None + if parsed_value is None: + return "html", raw_value, [], 0, 0 + raw_value = parsed_value + + if isinstance(raw_value, list): + rows = _normalize_grid_rows(raw_value) + return ( + "json", + json.dumps(rows, ensure_ascii=False), + rows, + len(rows), + max((len(r) for r in rows), default=0), + ) + + if isinstance(raw_value, dict): + rows = _normalize_grid_rows(raw_value.get("grid")) + if not rows and isinstance(raw_value.get("rows"), list): + rows = _normalize_grid_rows(raw_value.get("rows")) + num_rows = _parse_int( + raw_value.get("num_rows"), len(rows) if rows else 0 + ) + num_cols = _parse_int( + raw_value.get("num_cols"), + max((len(r) for r in rows), default=0), + ) + if rows: + return ( + "json", + json.dumps(rows, ensure_ascii=False), + rows, + num_rows, + num_cols, + ) + return ( + "html", + json.dumps(raw_value, ensure_ascii=False), + [], + num_rows, + num_cols, + ) + + text_value = str(raw_value or "").strip() + return "html", text_value, [], 0, 0 + + heading_stack: list[str] = [] + + def _update_heading_context( + heading_text: str, level: int + ) -> tuple[str, int, list[str]]: + nonlocal heading_stack + clean_heading = str(heading_text or "").strip() + clean_level = max(_parse_int(level, 1), 1) + heading_stack = heading_stack[: max(clean_level - 1, 0)] + parent_chain = [x for x in heading_stack if x] + heading_stack.append(clean_heading) + return clean_heading, clean_level, parent_chain + + def _append_block( + content_text: str, + heading: str = "", + level: int = 0, + parent_headings: list[str] | None = None, + ) -> str: + nonlocal block_idx + content_text = str(content_text or "").strip() + if not content_text: + return "" + blockid = hashlib.md5( + f"{doc_id}:{block_idx}:{heading}:{content_text}".encode("utf-8") + ).hexdigest() + blocks_lines.append( + json.dumps( + { + "type": "content", + "blockid": blockid, + "format": "plain_text", + "content": content_text, + "heading": heading, + "parent_headings": list(parent_headings or []), + "level": level, + "session_type": "body", + "table_slice": "none", + "positions": [], + }, + ensure_ascii=False, + ) + ) + merged_parts.append(content_text) + block_idx += 1 + return blockid + + current_heading = "" + current_level = 0 + current_parent_headings: list[str] = [] + + for item in content_list: + if not isinstance(item, dict): + continue + item_type = str(item.get("type") or item.get("label") or "").lower() + + if item_type in {"text", "title", "section_header", "list", "code"}: + text = ( + item.get("text") + or item.get("content") + or "\n".join( + item.get("list_items", []) + if isinstance(item.get("list_items"), list) + else [] + ) + or item.get("code_body") + or "" + ) + if not str(text).strip(): + continue + inferred_level = int(item.get("text_level", 0) or 0) + if item_type in {"title", "section_header"} and inferred_level <= 0: + inferred_level = int(item.get("level", 1) or 1) + if inferred_level > 0: + ( + current_heading, + current_level, + current_parent_headings, + ) = _update_heading_context(str(text), inferred_level) + _append_block( + str(text), + heading=current_heading, + level=current_level, + parent_headings=current_parent_headings, + ) + continue + + if item_type == "equation": + equation_idx += 1 + eq_id = str( + item.get("id") + or f"eq-{doc_id.removeprefix('doc-')}-{equation_idx:04d}" + ) + caption = str(item.get("caption") or f"公式{equation_idx}") + footnotes = _to_list_str( + item.get("equation_footnote") or item.get("footnotes") + ) + eq_text = str(item.get("text") or item.get("content") or "").strip() + wrapped = ( + f'{eq_text}' + if eq_text + else f'公式{equation_idx}' + ) + blockid = _append_block( + wrapped, + heading=current_heading, + level=current_level, + parent_headings=current_parent_headings, + ) + equations[eq_id] = { + "id": eq_id, + "blockid": blockid, + "heading": current_heading, + "format": "latex", + "content": eq_text, + "caption": caption, + "footnotes": footnotes, + } + continue + + if item_type == "table": + table_idx += 1 + table_id = str( + item.get("id") + or f"tb-{doc_id.removeprefix('doc-')}-{table_idx:04d}" + ) + caption = str(item.get("caption") or f"表格{table_idx}") + table_caption = _to_list_str(item.get("table_caption")) + if table_caption and not item.get("caption"): + caption = table_caption[0] + footnotes = _to_list_str( + item.get("table_footnote") or item.get("footnotes") + ) + table_body = item.get("table_body") or item.get("content") or "" + rows = item.get("rows") if isinstance(item.get("rows"), list) else None + ( + fmt, + table_content, + normalized_rows, + inferred_num_rows, + inferred_num_cols, + ) = _coerce_table_rows(rows if rows is not None else table_body) + rows = normalized_rows or (rows if isinstance(rows, list) else []) + cite_text = ( + f'表{table_idx}' + ) + blockid = _append_block( + cite_text, + heading=current_heading, + level=current_level, + parent_headings=current_parent_headings, + ) + tables[table_id] = { + "id": table_id, + "blockid": blockid, + "heading": current_heading, + "dimension": [ + _parse_int(item.get("num_rows"), inferred_num_rows), + _parse_int(item.get("num_cols"), inferred_num_cols), + ], + "format": fmt, + "content": table_content, + "caption": caption, + "footnotes": footnotes, + "image": item.get("img_path") or item.get("image"), + } + continue + + if item_type in {"image", "picture", "drawing"}: + drawing_idx += 1 + drawing_id = str( + item.get("id") + or f"im-{doc_id.removeprefix('doc-')}-{drawing_idx:04d}" + ) + image_caption = _to_list_str( + item.get("image_caption") or item.get("captions") + ) + caption = str( + item.get("caption") + or (image_caption[0] if image_caption else f"图{drawing_idx}") + ) + footnotes = _to_list_str( + item.get("image_footnote") or item.get("footnotes") + ) + path_val = str(item.get("img_path") or item.get("path") or "") + src_val = str(item.get("src") or "") + fmt = ( + Path(path_val).suffix.lower().lstrip(".") + if path_val + else str(item.get("format") or "") + ) + drawing_tag = ( + f'' + ) + blockid = _append_block( + drawing_tag, + heading=current_heading, + level=current_level, + parent_headings=current_parent_headings, + ) + drawings[drawing_id] = { + "id": drawing_id, + "blockid": blockid, + "heading": current_heading, + "format": fmt, + "path": path_val, + "src": src_val, + "caption": caption, + "footnotes": footnotes, + } + continue + + # Fallback: serialize unknown item to text for robustness. + fallback_text = str(item.get("text") or item.get("content") or "").strip() + if fallback_text: + _append_block( + fallback_text, + heading=current_heading, + level=current_level, + parent_headings=current_parent_headings, + ) + + merged_text = "\n\n".join([x for x in merged_parts if x.strip()]) + doc_hash = hashlib.sha256(merged_text.encode("utf-8")).hexdigest() + parse_time = datetime.now(timezone.utc).isoformat() + meta = { + "type": "meta", + "format": "lightrag", + "version": "1.0", + "document_name": document_name, + "document_format": Path(document_name).suffix.lower().lstrip("."), + "document_hash": f"sha256:{doc_hash}", + "table_file": bool(tables), + "equation_file": bool(equations), + "drawing_file": bool(drawings), + "asset_dir": False, + "split_option": {}, + "blocks": len(blocks_lines), + "doc_id": doc_id, + "parse_engine": engine, + "parse_time": parse_time, + "doc_title": Path(document_name).stem or document_name, + } + blocks_path.write_text( + "\n".join([json.dumps(meta, ensure_ascii=False)] + blocks_lines) + "\n", + encoding="utf-8", + ) + + if tables: + tables_path.write_text( + json.dumps( + {"version": "1.0", "tables": tables}, ensure_ascii=False, indent=2 + ), + encoding="utf-8", + ) + if drawings: + drawings_path.write_text( + json.dumps( + {"version": "1.0", "drawings": drawings}, + ensure_ascii=False, + indent=2, + ), + encoding="utf-8", + ) + if equations: + equations_path.write_text( + json.dumps( + {"version": "1.0", "equations": equations}, + ensure_ascii=False, + indent=2, + ), + encoding="utf-8", + ) + + # Keep full_docs in sync so restart/reprocess can directly use LightRAG Document. + await self._persist_parsed_full_docs( + doc_id, + { + "content": make_lightrag_doc_content(merged_text), + "file_path": file_path, + "parse_format": FULL_DOCS_FORMAT_LIGHTRAG, + "sidecar_location": sidecar_uri_for(parsed_dir), + "parse_engine": engine, + "update_time": int(time.time()), + }, + ) + await archive_docx_source_after_full_docs_sync( + self._resolve_source_file_for_parser(file_path) + ) + return { + "doc_id": doc_id, + "file_path": file_path, + "parse_format": FULL_DOCS_FORMAT_LIGHTRAG, + "content": merged_text, + "blocks_path": str(blocks_path), + } + + # ============================================================ + # Multimodal / VLM + # ============================================================ + + async def analyze_multimodal( + self, + doc_id: str, + file_path: str, + parsed_data: dict[str, Any], + *, + process_options: str | None = None, + ) -> dict[str, Any]: + """Phase 2: Multimodal analysis (VLM). Writes llm_analyze_result to LightRAG Document. + + Per-document ``i`` / ``t`` / ``e`` flags from + ``full_docs.process_options`` decide which modalities are sent to the + VLM. Sidecars are always written by the parser regardless of these + flags so toggling options later does not require re-parsing — only + the ``llm_analyze_result`` payload is gated here. + + Per-item ``llm_analyze_result`` is recomputed and overwritten on each + run for enabled modalities. This lets operators fix VLM/EXTRACT + configuration or prompt limits and retry without manually clearing + prior failure markers from the sidecar. + + Args: + process_options: Optional override that bypasses the + ``full_docs.process_options`` lookup; primarily used by unit + tests that exercise the VLM analysis path without going + through the enqueue pipeline. + """ + from lightrag.parser_routing import parse_process_options + + blocks_path = parsed_data.get("blocks_path") + if not blocks_path: + return parsed_data + + block_file = Path(blocks_path) + if not block_file.exists(): + return parsed_data + + # Resolve which modalities the user opted into for this document. + if process_options is None: + try: + content_data = await self.full_docs.get_by_id(doc_id) or {} + except Exception: + content_data = {} + options_str = ( + content_data.get("process_options") + if isinstance(content_data, dict) + else "" + ) or "" + else: + options_str = process_options + process_opts = parse_process_options(options_str) + if not (process_opts.images or process_opts.tables or process_opts.equations): + logger.debug( + f"[analyze_multimodal] no i/t/e options set for d-id: {doc_id}; " + f"skipping VLM analysis" + ) + return parsed_data + + # Diagnose opt-in vs sidecar mismatch up-front so users investigating + # "why did VLM not run on my images" see a one-line INFO per document + # instead of silent skips. Empty sidecars are a normal outcome + # (some documents simply have no images/tables/equations), so this is + # informational rather than a warning. + sidecar_base = str(block_file) + if sidecar_base.endswith(".blocks.jsonl"): + sidecar_base = sidecar_base[: -len(".blocks.jsonl")] + opt_in_missing: list[str] = [] + for opt_char, modality, suffix in ( + ("i", "drawings", ".drawings.json"), + ("t", "tables", ".tables.json"), + ("e", "equations", ".equations.json"), + ): + enabled = { + "i": process_opts.images, + "t": process_opts.tables, + "e": process_opts.equations, + }[opt_char] + if enabled and not Path(sidecar_base + suffix).exists(): + opt_in_missing.append(f"{opt_char}:{modality}") + if opt_in_missing: + logger.info( + f"[analyze_multimodal] {','.join(opt_in_missing)} sidecar empty: {doc_id}" + ) + + # Backfill sidecar `surrounding` for the enabled modalities just + # before VLM consumption. Universal coverage: native, MinerU, + # Docling, and pre-existing LightRAG documents reused from disk + # all go through this single entrypoint. Idempotent: re-runs + # overwrite with stable output given unchanged block content. + enabled_modalities = { + mod + for mod, on in ( + ("drawings", process_opts.images), + ("tables", process_opts.tables), + ("equations", process_opts.equations), + ) + if on + } + tokenizer = getattr(self, "tokenizer", None) + if enabled_modalities and tokenizer is not None: + try: + from lightrag.multimodal_context import ( + enrich_sidecars_with_surrounding, + ) + + enrich_counts = enrich_sidecars_with_surrounding( + blocks_path=str(block_file), + enabled_modalities=enabled_modalities, + tokenizer=tokenizer, + ) + if any(enrich_counts.values()): + logger.info( + "[analyze_multimodal] " + + ", ".join(f"{k}={v}" for k, v in enrich_counts.items() if v) + + f" surrounding backfilled: {doc_id}" + ) + except Exception as enrich_err: + logger.warning( + f"[analyze_multimodal] surrounding enrichment failed for " + f"d-id: {doc_id}, file: {file_path}: {enrich_err}" + ) + + try: + lines = block_file.read_text(encoding="utf-8").splitlines() + if not lines: + return parsed_data + meta = json.loads(lines[0]) + if not isinstance(meta, dict) or meta.get("type") != "meta": + return parsed_data + + from lightrag.llm._vision_utils import ( + image_audit_metadata, + image_cache_metadata, + normalize_image_inputs, + read_image_dimensions, + ) + from lightrag.prompt_multimodal import ( + IMAGE_TYPE_ENUM, + IMAGE_TYPE_FALLBACK, + MULTIMODAL_PROMPTS, + ) + from lightrag.constants import ( + DEFAULT_MM_ANALYSIS_PRIORITY, + DEFAULT_MM_IMAGE_MIN_PIXEL, + DEFAULT_SUMMARY_LANGUAGE, + ) + + global_config = self._build_global_config() + addon_params = global_config.get("addon_params") or {} + language = ( + global_config.get("_resolved_summary_language") + or addon_params.get("language") + or DEFAULT_SUMMARY_LANGUAGE + ) + vlm_process_enable = bool(global_config.get("vlm_process_enable", False)) + max_image_bytes = max( + 256 * 1024, + int(os.getenv("VLM_MAX_IMAGE_BYTES", str(5 * 1024 * 1024))), + ) + min_image_pixel = max( + 1, + int(os.getenv("VLM_MIN_IMAGE_PIXEL", str(DEFAULT_MM_IMAGE_MIN_PIXEL))), + ) + # Multimodal analysis shares the entity-extraction cache flag + # (both run with mode="default" — see handle_cache short-circuit + # in lightrag.utils). When the flag is off we must NOT save the + # response either, otherwise stale cache entries would still + # accumulate while reads are blocked. cache_id attachment to + # the sidecar item.llm_cache_list is likewise gated so a + # disabled cache does not seed cache-cleanup metadata that + # corresponds to entries that were never persisted. + analysis_cache_enabled = bool( + global_config.get("enable_llm_cache_for_entity_extract") + ) + + use_vlm_func = self.role_llm_funcs.get("vlm") + use_extract_func = self.role_llm_funcs.get("extract") + vlm_cache_identity = get_llm_cache_identity(global_config, role="vlm") + extract_cache_identity = get_llm_cache_identity( + global_config, role="extract" + ) + + _IMAGE_TYPE_VALUES = set(IMAGE_TYPE_ENUM) + _VLM_RASTER_EXTS = {".png", ".jpg", ".jpeg", ".gif", ".webp"} + + def _json_extract(text: str) -> dict[str, Any]: + """Tolerant JSON object recovery. + + Mirrors :func:`lightrag.operate._process_json_extraction_result` + so weaker models that emit ```json ... ``` fenced output, + trailing commas, or unquoted keys are still salvageable. + The order of attempts is: + + 1. Strip a leading ```json fence if present. + 2. Hand the cleaned string to ``json_repair.loads`` (handles + minor structural slips like trailing commas). + 3. Fall back to a greedy ``{...}`` regex slice for outputs + that wrap the JSON object in prose, then re-run + ``json_repair.loads`` on the slice. + """ + if not text: + return {} + candidate = text.strip() + fence_match = re.match( + r"^```(?:json)?\s*\n(.*?)\n```$", + candidate, + re.DOTALL | re.IGNORECASE, + ) + if fence_match: + candidate = fence_match.group(1).strip() + try: + obj = json_repair.loads(candidate) + if isinstance(obj, dict): + return obj + except Exception: + pass + m = re.search(r"\{[\s\S]*\}", candidate) + if m: + try: + obj = json_repair.loads(m.group(0)) + if isinstance(obj, dict): + return obj + except Exception: + pass + return {} + + def _normalize_text(value: Any) -> str: + if value is None: + return "" + if isinstance(value, str): + return value.strip() + if isinstance(value, (list, tuple)): + return "\n".join(str(v).strip() for v in value if str(v).strip()) + return str(value).strip() + + def _captions_value(item_obj: dict[str, Any]) -> str: + return _normalize_text(item_obj.get("caption")) or "n/a" + + def _footnotes_value(item_obj: dict[str, Any]) -> str: + raw = item_obj.get("footnotes") + if isinstance(raw, (list, tuple)): + joined = "; ".join(str(v).strip() for v in raw if str(v).strip()) + return joined or "n/a" + text = _normalize_text(raw) + return text or "n/a" + + def _surrounding_value(item_obj: dict[str, Any], key: str) -> str: + surrounding = item_obj.get("surrounding") or {} + if not isinstance(surrounding, dict): + return "n/a" + value = _normalize_text(surrounding.get(key)) + return value or "n/a" + + def _resolve_image_path( + path_str: str | None, sidecar_dir: Path + ) -> Path | None: + if not path_str: + return None + candidate = Path(path_str) + if not candidate.is_absolute(): + sidecar_candidate = sidecar_dir / path_str + if sidecar_candidate.exists() and sidecar_candidate.is_file(): + candidate = sidecar_candidate + if candidate.exists() and candidate.is_file(): + return candidate + return None + + def _failure_result(message: str) -> dict[str, Any]: + return { + "analyze_time": int(time.time()), + "status": "failure", + "message": message, + } + + def _skipped_result(message: str) -> dict[str, Any]: + return { + "analyze_time": int(time.time()), + "status": "skipped", + "message": message, + } + + async def _analyze_drawing( + item_id: str, item: dict[str, Any], sidecar_dir: Path + ) -> tuple[dict[str, Any], str | None]: + path_str = ( + item.get("path") or item.get("img_path") or item.get("image_path") + ) + candidate = _resolve_image_path(path_str, sidecar_dir) + if candidate is None: + return ( + _skipped_result(f"image file not found: {path_str or 'n/a'}"), + None, + ) + ext = candidate.suffix.lower() + if ext not in _VLM_RASTER_EXTS: + return ( + _skipped_result(f"unsupported image format: {ext}"), + None, + ) + dims = read_image_dimensions(candidate) + if dims is not None and ( + dims[0] < min_image_pixel or dims[1] < min_image_pixel + ): + return ( + _skipped_result( + f"image width or height is smaller than " + f"{min_image_pixel}px" + ), + None, + ) + if not vlm_process_enable or use_vlm_func is None: + raise MultimodalAnalysisError( + f"drawings/{item_id}: VLM analysis required but " + "VLM role is not available " + "(VLM_PROCESS_ENABLE or vlm role config)" + ) + try: + raw = candidate.read_bytes() + except OSError as exc: + raise MultimodalAnalysisError( + f"drawings/{item_id}: cannot read image {candidate}: {exc}" + ) from exc + if not raw: + raise MultimodalAnalysisError( + f"drawings/{item_id}: image file is empty" + ) + if len(raw) > max_image_bytes: + return ( + _skipped_result( + f"image too large: {len(raw)} bytes " + f"(limit {max_image_bytes})" + ), + None, + ) + mime, _ = mimetypes.guess_type(str(candidate)) + mime = mime or "image/png" + img_payload = { + "base64": base64.b64encode(raw).decode("ascii"), + "mime_type": mime, + "source_id": item_id, + "source_file": str(candidate), + "modality": "image", + "doc_id": doc_id, + } + normalized_images = normalize_image_inputs([img_payload]) + prompt = MULTIMODAL_PROMPTS["image_analysis"].format( + language=language, + content="", + captions=_captions_value(item), + footnotes=_footnotes_value(item), + leading=_surrounding_value(item, "leading"), + trailing=_surrounding_value(item, "trailing"), + item_id=item_id, + file_path=file_path, + ) + args_hash = compute_args_hash( + prompt, + "", + "", + serialize_llm_cache_identity(vlm_cache_identity), + _serialize_cache_variant({"type": "json_object"}), + _serialize_cache_variant(image_cache_metadata(normalized_images)), + "drawing", + ) + cache_id = generate_cache_key("default", "analysis", args_hash) + cached = await handle_cache( + self.llm_response_cache, + args_hash, + prompt, + mode="default", + cache_type="analysis", + ) + if cached is not None: + result_text = cached[0] + fresh = False + else: + try: + result_text = await use_vlm_func( + prompt, + stream=False, + image_inputs=[img_payload], + _priority=DEFAULT_MM_ANALYSIS_PRIORITY, + ) + except Exception as exc: + raise MultimodalAnalysisError( + f"drawings/{item_id}: VLM call failed: {exc}" + ) from exc + fresh = True + parsed = _json_extract(str(result_text)) + name = parsed.get("name") + type_value = parsed.get("type") + description = parsed.get("description") + if not isinstance(name, str) or not name.strip(): + raise MultimodalAnalysisError( + f"drawings/{item_id}: missing or invalid field 'name'" + ) + if not isinstance(description, str) or not description.strip(): + raise MultimodalAnalysisError( + f"drawings/{item_id}: missing or invalid field 'description'" + ) + if not isinstance(type_value, str) or not type_value.strip(): + raise MultimodalAnalysisError( + f"drawings/{item_id}: missing or invalid field 'type'" + ) + if type_value not in _IMAGE_TYPE_VALUES: + type_value = IMAGE_TYPE_FALLBACK + cache_id_to_attach: str | None = None + if fresh and analysis_cache_enabled: + audit_blob = image_audit_metadata(normalized_images) + original_prompt = prompt + ( + f"\n" + f"{json.dumps(audit_blob, ensure_ascii=False)}" + "" + if audit_blob + else "" + ) + await save_to_cache( + self.llm_response_cache, + CacheData( + args_hash=args_hash, + content=str(result_text), + prompt=original_prompt, + mode="default", + cache_type="analysis", + chunk_id=None, + ), + ) + cache_id_to_attach = cache_id + elif not fresh: + # Cache hit: the entry exists, so attaching its id is + # safe (and necessary for document-delete cleanup). + cache_id_to_attach = cache_id + return ( + { + "name": name.strip(), + "type": type_value, + "description": description.strip(), + "analyze_time": int(time.time()), + "status": "success", + "message": "", + }, + cache_id_to_attach, + ) + + async def _analyze_text_modality( + kind: str, item_id: str, item: dict[str, Any] + ) -> tuple[dict[str, Any], str | None]: + if use_extract_func is None: + raise MultimodalAnalysisError( + f"{kind}/{item_id}: EXTRACT role is required but not configured" + ) + content_text = _normalize_text(item.get("content")) + if not content_text: + if kind == "table": + # Defensive fallback for sidecars that still carry + # empty-bodied table items (e.g. produced by an older + # parser run, or by a parser that doesn't filter + # MinerU-style misidentified blanks). Don't abort the + # whole worker — record the skip and move on. + logger.warning( + f"[analyze_multimodal] table/{item_id}: missing " + f"table content; skipping analysis ({file_path})" + ) + return ( + _skipped_result("missing table content"), + None, + ) + raise MultimodalAnalysisError( + f"{kind}/{item_id}: missing {kind} content" + ) + template = MULTIMODAL_PROMPTS[f"{kind}_analysis"] + + def _render(content_value: str) -> str: + return template.format( + language=language, + content=content_value, + captions=_captions_value(item), + footnotes=_footnotes_value(item), + leading=_surrounding_value(item, "leading"), + trailing=_surrounding_value(item, "trailing"), + item_id=item_id, + file_path=file_path, + ) + + prompt = _render(content_text) + + # Cap the EXTRACT prompt at MAX_EXTRACT_INPUT_TOKENS by + # trimming the (typically huge) sidecar `content` field — the + # other slots (surrounding/captions/footnotes) already have + # their own per-field caps upstream. The cap is resolved + # from the env var (falling back to + # DEFAULT_MAX_EXTRACT_INPUT_TOKENS) so deployments can tune + # it for their model's context window. + tokenizer = getattr(self, "tokenizer", None) + if tokenizer is not None: + from lightrag.constants import DEFAULT_MAX_EXTRACT_INPUT_TOKENS + from lightrag.multimodal_context import trim_content_to_budget + + SAFETY_BUFFER = 256 + max_extract_tokens = get_env_value( + "MAX_EXTRACT_INPUT_TOKENS", + DEFAULT_MAX_EXTRACT_INPUT_TOKENS, + int, + ) + total_tokens = len(tokenizer.encode(prompt)) + if max_extract_tokens > 0 and total_tokens > max_extract_tokens: + frame_tokens = len(tokenizer.encode(_render(""))) + content_budget = ( + max_extract_tokens - frame_tokens - SAFETY_BUFFER + ) + if content_budget <= 0: + # The prompt template alone (with empty content) + # already exceeds the cap — no content trim can + # bring the request under the limit. Fail this + # item rather than handing the LLM a payload we + # know will trigger ``context_length_exceeded``. + # Operators must raise MAX_EXTRACT_INPUT_TOKENS + # above the template frame for analysis to + # succeed; the document is reprocessable + # idempotently once the cap is widened. + raise MultimodalAnalysisError( + f"{kind}/{item_id}: prompt frame " + f"({frame_tokens} tokens) exceeds " + f"MAX_EXTRACT_INPUT_TOKENS " + f"({max_extract_tokens}); raise the cap" + ) + trimmed, was_trimmed = trim_content_to_budget( + content_text, + kind=f"{kind}s", + max_tokens=content_budget, + tokenizer=tokenizer, + ) + if was_trimmed: + prompt = _render(trimmed) + logger.warning( + f"[analyze_multimodal] {kind}/{item_id} " + f"content trimmed (prompt {total_tokens} " + f"→ fit {max_extract_tokens}, " + f"content_budget={content_budget})" + ) + # Post-trim hard guard: ``trim_content_to_budget`` + # is constrained by ``content_budget`` so the final + # prompt should fit within ``max_extract_tokens``; + # defend against tokenizer rounding / future template + # changes that could push it over. Refuse the call + # rather than send an over-cap prompt to the LLM. + final_tokens = len(tokenizer.encode(prompt)) + if final_tokens > max_extract_tokens: + raise MultimodalAnalysisError( + f"{kind}/{item_id}: trimmed prompt " + f"({final_tokens} tokens) still exceeds " + f"MAX_EXTRACT_INPUT_TOKENS " + f"({max_extract_tokens})" + ) + + args_hash = compute_args_hash( + prompt, + "", + "", + serialize_llm_cache_identity(extract_cache_identity), + _serialize_cache_variant({"type": "json_object"}), + _serialize_cache_variant([]), + kind, + ) + cache_id = generate_cache_key("default", "analysis", args_hash) + cached = await handle_cache( + self.llm_response_cache, + args_hash, + prompt, + mode="default", + cache_type="analysis", + ) + if cached is not None: + result_text = cached[0] + fresh = False + else: + try: + result_text = await use_extract_func( + prompt, + stream=False, + response_format={"type": "json_object"}, + _priority=DEFAULT_MM_ANALYSIS_PRIORITY, + ) + except Exception as exc: + raise MultimodalAnalysisError( + f"{kind}/{item_id}: EXTRACT call failed: {exc}" + ) from exc + fresh = True + parsed = _json_extract(str(result_text)) + name = parsed.get("name") + description = parsed.get("description") + if not isinstance(name, str) or not name.strip(): + raise MultimodalAnalysisError( + f"{kind}/{item_id}: missing or invalid field 'name'" + ) + if not isinstance(description, str) or not description.strip(): + raise MultimodalAnalysisError( + f"{kind}/{item_id}: missing or invalid field 'description'" + ) + result_obj: dict[str, Any] = { + "name": name.strip(), + "description": description.strip(), + "analyze_time": int(time.time()), + "status": "success", + "message": "", + } + if kind == "equation": + equation_value = parsed.get("equation") + if ( + not isinstance(equation_value, str) + or not equation_value.strip() + ): + raise MultimodalAnalysisError( + f"equation/{item_id}: missing or invalid field 'equation'" + ) + result_obj["equation"] = equation_value.strip() + cache_id_to_attach: str | None = None + if fresh and analysis_cache_enabled: + await save_to_cache( + self.llm_response_cache, + CacheData( + args_hash=args_hash, + content=str(result_text), + prompt=prompt, + mode="default", + cache_type="analysis", + chunk_id=None, + ), + ) + cache_id_to_attach = cache_id + elif not fresh: + # Cache hit path (handle_cache already gated by flag): + # safe to surface the existing cache_id for cleanup. + cache_id_to_attach = cache_id + return (result_obj, cache_id_to_attach) + + def _attach_cache_id( + item_obj: dict[str, Any], cache_id: str | None + ) -> None: + if not cache_id: + return + existing = item_obj.get("llm_cache_list") + if not isinstance(existing, list): + existing = [] + if cache_id not in existing: + existing.append(cache_id) + item_obj["llm_cache_list"] = existing + + base_name = str(block_file) + if base_name.endswith(".blocks.jsonl"): + base_name = base_name[: -len(".blocks.jsonl")] + sidecars = [ + ( + Path(base_name + ".drawings.json"), + "drawings", + "drawing", + process_opts.images, + ), + ( + Path(base_name + ".tables.json"), + "tables", + "table", + process_opts.tables, + ), + ( + Path(base_name + ".equations.json"), + "equations", + "equation", + process_opts.equations, + ), + ] + for sidecar_path, root_key, kind, enabled in sidecars: + if not enabled or not sidecar_path.exists(): + continue + try: + payload = json.loads(sidecar_path.read_text(encoding="utf-8")) + except Exception as exc: + raise MultimodalAnalysisError( + f"failed to read sidecar {sidecar_path}: {exc}" + ) from exc + items = payload.get(root_key, {}) + if not isinstance(items, dict): + continue + + valid_keys: list[str] = [] + analyze_tasks: list[Any] = [] + for item_id, item in items.items(): + if not isinstance(item, dict): + continue + valid_keys.append(item_id) + if kind == "drawing": + analyze_tasks.append( + _analyze_drawing(item_id, item, sidecar_path.parent) + ) + else: + analyze_tasks.append( + _analyze_text_modality(kind, item_id, item) + ) + + analyzed = await asyncio.gather(*analyze_tasks, return_exceptions=True) + + failure_to_raise: MultimodalAnalysisError | None = None + for idx, item_id in enumerate(valid_keys): + item = items.get(item_id) + if not isinstance(item, dict): + continue + outcome = analyzed[idx] + if isinstance(outcome, MultimodalAnalysisError): + item["llm_analyze_result"] = _failure_result(str(outcome)) + if failure_to_raise is None: + failure_to_raise = outcome + continue + if isinstance(outcome, Exception): + item["llm_analyze_result"] = _failure_result( + f"unexpected error: {outcome}" + ) + if failure_to_raise is None: + failure_to_raise = MultimodalAnalysisError( + f"{root_key}/{item_id}: unexpected error: {outcome}" + ) + continue + result_obj, cache_id = outcome + item["llm_analyze_result"] = result_obj + _attach_cache_id(item, cache_id) + try: + sidecar_path.write_text( + json.dumps(payload, ensure_ascii=False, indent=2), + encoding="utf-8", + ) + except OSError as exc: + logger.warning( + f"[analyze_multimodal] failed to write sidecar " + f"{sidecar_path}: {exc}" + ) + if failure_to_raise is not None: + raise failure_to_raise + + parsed_data["multimodal_processed"] = True + logger.info(f"[analyze_multimodal] completed for d-id: {doc_id}") + except MultimodalAnalysisError: + raise + except Exception as e: + logger.warning(f"[analyze_multimodal] failed for d-id: {doc_id}: {e}") + return parsed_data + + def _build_mm_chunks_from_sidecars( + self, + doc_id: str, + file_path: str, + blocks_path: str, + base_order_index: int, + process_options: str | None = None, + ) -> list[dict[str, Any]]: + """Build multimodal chunks from sidecars carrying analysis results. + + Only items whose ``llm_analyze_result.status == "success"`` produce + chunks. ``"skipped"`` items are silently ignored; ``"failure"`` + items raise :class:`MultimodalAnalysisError` so the document is + marked failed (a failure should already have aborted the analyze + phase — this is a defensive recheck). + + Each chunk follows the new schema: nested ``heading`` and + ``sidecar`` dicts, no flat ``parent_headings`` / ``level`` / + ``content_type`` fields. ``llm_cache_list`` is merged from the + underlying sidecar item so document deletion can clean up the + ``cache_type="analysis"`` entries it created. + + ``process_options`` gates which modality sidecars are read: a + document re-processed after opting out of ``i`` / ``t`` / ``e`` + must NOT pick up stale success results from a prior pass. When + ``None`` (e.g. ad-hoc unit tests), every modality is considered. + + Raises: + MultimodalAnalysisError: when an item carries ``status="failure"``, + or when the multimodal chunk cannot be fit under the + extraction token budget even after truncating description + to :data:`DEFAULT_MM_CHUNK_DESCRIPTION_MIN_TOKENS`. + """ + from lightrag.constants import ( + DEFAULT_MAX_EXTRACT_INPUT_TOKENS, + DEFAULT_MM_CHUNK_DESCRIPTION_MIN_TOKENS, + ) + from lightrag.parser_routing import parse_process_options + + block_file = Path(blocks_path) + if not block_file.exists(): + return [] + + base = str(block_file) + if base.endswith(".blocks.jsonl"): + base = base[: -len(".blocks.jsonl")] + + if process_options is None: + allowed = {"drawing", "table", "equation"} + else: + opts = parse_process_options(process_options) + allowed = set() + if opts.images: + allowed.add("drawing") + if opts.tables: + allowed.add("table") + if opts.equations: + allowed.add("equation") + + sidecar_defs = [ + (root, Path(base + suffix), kind) + for root, suffix, kind in ( + ("drawings", ".drawings.json", "drawing"), + ("tables", ".tables.json", "table"), + ("equations", ".equations.json", "equation"), + ) + if kind in allowed + ] + + mm_chunks: list[dict[str, Any]] = [] + order = base_order_index + + def _norm_str_list(v: Any) -> list[str]: + if v is None: + return [] + if isinstance(v, list): + return [str(x).strip() for x in v if str(x).strip()] + s = str(v).strip() + return [s] if s else [] + + def _norm_parent_headings(value: Any) -> list[str]: + if not isinstance(value, list): + return [] + return [str(p).strip() for p in value if str(p or "").strip()] + + def _build_heading_dict(item: dict[str, Any]) -> dict[str, Any] | None: + heading_raw = item.get("heading") + if isinstance(heading_raw, dict): + heading_text = str(heading_raw.get("heading") or "").strip() + parents = _norm_parent_headings(heading_raw.get("parent_headings")) + try: + level = int(heading_raw.get("level") or 0) + except (TypeError, ValueError): + level = 0 + else: + heading_text = str(heading_raw or "").strip() + parents = _norm_parent_headings(item.get("parent_headings")) + try: + level = int(item.get("level") or 0) + except (TypeError, ValueError): + level = 0 + if not heading_text and not parents and level == 0: + return None + return { + "level": level, + "heading": heading_text, + "parent_headings": parents, + } + + def _render( + kind: str, + name: str, + image_type: str, + description: str, + footnotes_joined: str, + equation_body: str, + ) -> str: + # NOTE: the `[Image Name]` / `[Table Name]` / `[Equation Name]` + # leading labels below are a contract consumed by + # ``lightrag.operate._parse_mm_display_name`` (regex + # ``_MM_DISPLAY_NAME_PATTERN``). If you rename or restructure + # these labels, update that regex too, or relation descriptions + # will silently fall back to sidecar ids. The + # ``test_parse_mm_display_name_on_real_builder_output`` + # regression pins this contract end-to-end. + if kind == "drawing": + head = f"[Image Name]{name}\n[Image Type]{image_type}" + footnote_label = "Image Footnotes" + elif kind == "table": + head = f"[Table Name]{name}" + footnote_label = "Table Footnotes" + else: # equation + head = f"{equation_body}\n[Equation Name]{name}" + footnote_label = "Equation Footnotes" + + sections = [head, description] + if footnotes_joined: + sections.append(f"[{footnote_label}]{footnotes_joined}") + return "\n\n".join(s for s in sections if s).strip() + + max_tokens = DEFAULT_MAX_EXTRACT_INPUT_TOKENS + min_desc_tokens = DEFAULT_MM_CHUNK_DESCRIPTION_MIN_TOKENS + + for root_key, sidecar_path, kind in sidecar_defs: + if not sidecar_path.exists(): + continue + try: + payload = json.loads(sidecar_path.read_text(encoding="utf-8")) + except Exception: + continue + items = payload.get(root_key, {}) + if not isinstance(items, dict): + continue + + for local_idx, (item_id, item) in enumerate(items.items()): + if not isinstance(item, dict): + continue + + analysis = item.get("llm_analyze_result") + if not isinstance(analysis, dict): + continue + status = analysis.get("status") + if status == "skipped": + continue + if status == "failure": + raise MultimodalAnalysisError( + f"{root_key}/{item_id}: llm_analyze_result.status='failure' " + f"({analysis.get('message') or 'no message'})" + ) + if status != "success": + # Treat unknown / legacy status as missing — no chunk. + continue + + name = str(analysis.get("name") or "").strip() + description = str(analysis.get("description") or "").strip() + equation_body = str(analysis.get("equation") or "").strip() + image_type = str(analysis.get("type") or "").strip() + if not name: + raise MultimodalAnalysisError( + f"{root_key}/{item_id}: success result missing 'name'" + ) + if not description: + raise MultimodalAnalysisError( + f"{root_key}/{item_id}: success result missing 'description'" + ) + if kind == "drawing" and not image_type: + raise MultimodalAnalysisError( + f"drawings/{item_id}: success result missing 'type'" + ) + if kind == "equation" and not equation_body: + raise MultimodalAnalysisError( + f"equations/{item_id}: success result missing 'equation'" + ) + + footnotes_list = _norm_str_list(item.get("footnotes")) + footnotes_joined = "; ".join(footnotes_list) + + def _compose(desc: str) -> str: + return _render( + kind=kind, + name=name, + image_type=image_type, + description=desc, + footnotes_joined=footnotes_joined, + equation_body=equation_body, + ) + + chunk_content = _compose(description) + tokens = len(self.tokenizer.encode(chunk_content)) + if tokens > max_tokens: + # Truncate only the description, never name/type/equation. + desc_tokens = self.tokenizer.encode(description) + overflow = tokens - max_tokens + keep = max(min_desc_tokens, len(desc_tokens) - overflow) + while True: + truncated_desc = self.tokenizer.decode(desc_tokens[:keep]) + chunk_content = _compose(truncated_desc) + tokens = len(self.tokenizer.encode(chunk_content)) + if tokens <= max_tokens or keep <= min_desc_tokens: + break + keep = max(min_desc_tokens, keep - (tokens - max_tokens)) + if tokens > max_tokens: + raise MultimodalAnalysisError( + f"{root_key}/{item_id}: multimodal chunk exceeds " + f"{max_tokens} tokens even after truncating description " + f"to {min_desc_tokens} tokens" + ) + + if not chunk_content: + continue + + heading_dict = _build_heading_dict(item) + sidecar_block = { + "type": kind, + "id": str(item_id), + "refs": [{"type": kind, "id": str(item_id)}], + } + cache_list = item.get("llm_cache_list") + cache_list = ( + [str(c) for c in cache_list if str(c).strip()] + if isinstance(cache_list, list) + else [] + ) + + chunk_dict: dict[str, Any] = { + "chunk_id": f"{doc_id}-mm-{kind}-{local_idx:03d}", + "chunk_order_index": order, + "content": chunk_content, + "tokens": tokens, + "sidecar": sidecar_block, + "llm_cache_list": cache_list, + } + if heading_dict is not None: + chunk_dict["heading"] = heading_dict + mm_chunks.append(chunk_dict) + order += 1 + + return mm_chunks diff --git a/lightrag/prompt.py b/lightrag/prompt.py index dcd829d487..fbfe61e0b0 100644 --- a/lightrag/prompt.py +++ b/lightrag/prompt.py @@ -1,5 +1,9 @@ from __future__ import annotations -from typing import Any +import os +from pathlib import Path +from typing import Any, Mapping, TypedDict + +import yaml PROMPTS: dict[str, Any] = {} @@ -8,102 +12,131 @@ PROMPTS["DEFAULT_TUPLE_DELIMITER"] = "<|#|>" PROMPTS["DEFAULT_COMPLETION_DELIMITER"] = "<|COMPLETE|>" +# Default entity type guidance injected into extraction prompts via {entity_types_guidance}. +# Users can override this by passing entity_types_guidance in addon_params, or by +# replacing the full prompt template string in PROMPTS. +PROMPTS[ + "default_entity_types_guidance" +] = """Classify each entity using one of the following types. If no type fits, use `Other`. + +- Person: Human individuals, real or fictional +- Creature: Non-human living beings (animals, mythical beings, etc.) +- Organization: Companies, institutions, government bodies, groups +- Location: Geographic places (cities, countries, buildings, regions) +- Event: Occurrences, incidents, ceremonies, meetings +- Concept: Abstract ideas, theories, principles, beliefs +- Method: Procedures, techniques, algorithms, workflows +- Content: Creative or informational works (books, articles, films, reports) +- Data: Quantitative or structured information (statistics, datasets, measurements) +- Artifact: Physical or digital objects created by humans (tools, software, devices) +- NaturalObject: Natural non-living objects (minerals, celestial bodies, chemical compounds)""" + PROMPTS["entity_extraction_system_prompt"] = """---Role--- -You are a Knowledge Graph Specialist responsible for extracting entities and relationships from the input text. +You are a Knowledge Graph Specialist responsible for extracting entities and relationships from the `---Input Text---` section of user prompt. ---Instructions--- -1. **Entity Extraction & Output:** - * **Identification:** Identify clearly defined and meaningful entities in the input text. - * **Entity Details:** For each identified entity, extract the following information: - * `entity_name`: The name of the entity. If the entity name is case-insensitive, capitalize the first letter of each significant word (title case). Ensure **consistent naming** across the entire extraction process. - * `entity_type`: Categorize the entity using one of the following types: `{entity_types}`. If none of the provided entity types apply, do not add new entity type and classify it as `Other`. - * `entity_description`: Provide a concise yet comprehensive description of the entity's attributes and activities, based *solely* on the information present in the input text. - * **Output Format - Entities:** Output a total of 4 fields for each entity, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `entity`. - * Format: `entity{tuple_delimiter}entity_name{tuple_delimiter}entity_type{tuple_delimiter}entity_description` - -2. **Relationship Extraction & Output:** - * **Identification:** Identify direct, clearly stated, and meaningful relationships between previously extracted entities. - * **N-ary Relationship Decomposition:** If a single statement describes a relationship involving more than two entities (an N-ary relationship), decompose it into multiple binary (two-entity) relationship pairs for separate description. - * **Example:** For "Alice, Bob, and Carol collaborated on Project X," extract binary relationships such as "Alice collaborated with Project X," "Bob collaborated with Project X," and "Carol collaborated with Project X," or "Alice collaborated with Bob," based on the most reasonable binary interpretations. - * **Relationship Details:** For each binary relationship, extract the following fields: - * `source_entity`: The name of the source entity. Ensure **consistent naming** with entity extraction. Capitalize the first letter of each significant word (title case) if the name is case-insensitive. - * `target_entity`: The name of the target entity. Ensure **consistent naming** with entity extraction. Capitalize the first letter of each significant word (title case) if the name is case-insensitive. - * `relationship_keywords`: One or more high-level keywords summarizing the overarching nature, concepts, or themes of the relationship. Multiple keywords within this field must be separated by a comma `,`. **DO NOT use `{tuple_delimiter}` for separating multiple keywords within this field.** - * `relationship_description`: A concise explanation of the nature of the relationship between the source and target entities, providing a clear rationale for their connection. - * **Output Format - Relationships:** Output a total of 5 fields for each relationship, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `relation`. - * Format: `relation{tuple_delimiter}source_entity{tuple_delimiter}target_entity{tuple_delimiter}relationship_keywords{tuple_delimiter}relationship_description` - -3. **Delimiter Usage Protocol:** - * The `{tuple_delimiter}` is a complete, atomic marker and **must not be filled with content**. It serves strictly as a field separator. - * **Incorrect Example:** `entity{tuple_delimiter}Tokyo<|location|>Tokyo is the capital of Japan.` - * **Correct Example:** `entity{tuple_delimiter}Tokyo{tuple_delimiter}location{tuple_delimiter}Tokyo is the capital of Japan.` - -4. **Relationship Direction & Duplication:** - * Treat all relationships as **undirected** unless explicitly stated otherwise. Swapping the source and target entities for an undirected relationship does not constitute a new relationship. - * Avoid outputting duplicate relationships. - -5. **Output Order & Prioritization:** - * Output all extracted entities first, followed by all extracted relationships. - * Within the list of relationships, prioritize and output those relationships that are **most significant** to the core meaning of the input text first. - -6. **Context & Objectivity:** - * Ensure all entity names and descriptions are written in the **third person**. - * Explicitly name the subject or object; **avoid using pronouns** such as `this article`, `this paper`, `our company`, `I`, `you`, and `he/she`. - -7. **Language & Proper Nouns:** - * The entire output (entity names, keywords, and descriptions) must be written in `{language}`. - * Proper nouns (e.g., personal names, place names, organization names) should be retained in their original language if a proper, widely accepted translation is not available or would cause ambiguity. - -8. **Completion Signal:** Output the literal string `{completion_delimiter}` only after all entities and relationships, following all criteria, have been completely extracted and outputted. +1. **Entity Extraction:** + - Identify clearly defined and meaningful entities in the `---Input Text---` section of user prompt. + - For each entity, extract: + - `entity_name`: The name of the entity. If the entity name is case-insensitive, capitalize the first letter of each significant word (title case). Ensure **consistent naming** across the entire extraction process. + - `entity_type`: Categorize the entity using the type guidance provided in the `---Entity Types---` section below. If none of the provided entity types apply, classify it as `Other`. + - `entity_description`: Provide a concise yet comprehensive description of the entity's attributes and activities, based *solely* on the information present in the input text. + +2. **Relationship Extraction:** + - Identify direct, clearly stated, and meaningful relationships between previously extracted entities. + - If a single statement describes a relationship involving more than two entities, decompose it into multiple binary relationships. + - For each binary relationship, extract: + - `source_entity`: The name of the source entity. Ensure **consistent naming** with entity extraction. Capitalize the first letter of each significant word (title case) if the name is case-insensitive. + - `target_entity`: The name of the target entity. Ensure **consistent naming** with entity extraction. Capitalize the first letter of each significant word (title case) if the name is case-insensitive. + - `relationship_keywords`: One or more high-level keywords summarizing the relationship. Multiple keywords within this field must be separated by a comma `,`. **DO NOT use `{tuple_delimiter}` for separating multiple keywords within this field.** + - `relationship_description`: A concise explanation of the nature of the relationship between the source and target entities. + +3. **Record Types:** + - `entity` is used only for entity rows and those rows always contain exactly 4 tuple parts total. + - `relation` is used only for relationship rows and those rows always contain exactly 5 tuple parts total. + - A row with two entity names plus relationship keywords and a relationship description must start with `relation`, never `entity`. + - After the last entity row, switch prefixes to `relation` for every relationship row. + +4. **Output Format:** + - Entity row: `entity{tuple_delimiter}entity_name{tuple_delimiter}entity_type{tuple_delimiter}entity_description` + - Relation row: `relation{tuple_delimiter}source_entity{tuple_delimiter}target_entity{tuple_delimiter}relationship_keywords{tuple_delimiter}relationship_description` + - Wrong: `entity{tuple_delimiter}Alice{tuple_delimiter}Acme{tuple_delimiter}founded{tuple_delimiter}Alice founded Acme` + - Correct: `relation{tuple_delimiter}Alice{tuple_delimiter}Acme{tuple_delimiter}founded{tuple_delimiter}Alice founded Acme` + +5. **Delimiter Usage:** + - The `{tuple_delimiter}` is a complete, atomic marker and **must not be filled with content**. It serves strictly as a field separator. + - Incorrect: `entity{tuple_delimiter}Tokyo<|location|>Tokyo is the capital of Japan.` + - Correct: `entity{tuple_delimiter}Tokyo{tuple_delimiter}location{tuple_delimiter}Tokyo is the capital of Japan.` + +6. **Output Order & Deduplication:** + - Output all extracted entities first, followed by all extracted relationships. + - Output at most {max_total_records} total rows across entities and relationships in this response. + - Output at most {max_entity_records} entity rows in this response. + - Output fewer rows if fewer high-value items are present. Do not try to fill the limit. + - Only output relationship rows whose source and target entities are both included in the selected entity rows for this response. + - If the limit is reached, stop adding new rows immediately and output `{completion_delimiter}`. + - Treat all relationships as **undirected** unless explicitly stated otherwise. Swapping the source and target entities for an undirected relationship does not constitute a new relationship. + - Avoid outputting duplicate relationships. + - Within the list of relationships, output the relationships that are **most significant** to the core meaning of the input text first. + +7. **Context & Language:** + - Ensure all entity names and descriptions are written in the **third person**. + - Explicitly name the subject or object; **avoid using pronouns** such as `this article`, `this paper`, `our company`, `I`, `you`, and `he/she`. + - The entire output (entity names, keywords, and descriptions) must be written in `{language}`. + - Proper nouns (e.g., personal names, place names, organization names) should be retained in their original language if a proper, widely accepted translation is not available or would cause ambiguity. + +8. **Completion Signal:** Output the literal string `{completion_delimiter}` only after all entities and relationships have been completely extracted and outputted. + +---Entity Types--- +{entity_types_guidance} ---Examples--- {examples} """ PROMPTS["entity_extraction_user_prompt"] = """---Task--- -Extract entities and relationships from the input text in Data to be Processed below. +Extract entities and relationships from the `---Input Text---` session below. ---Instructions--- -1. **Strict Adherence to Format:** Strictly adhere to all format requirements for entity and relationship lists, including output order, field delimiters, and proper noun handling, as specified in the system prompt. -2. **Output Content Only:** Output *only* the extracted list of entities and relationships. Do not include any introductory or concluding remarks, explanations, or additional text before or after the list. -3. **Completion Signal:** Output `{completion_delimiter}` as the final line after all relevant entities and relationships have been extracted and presented. -4. **Output Language:** Ensure the output language is {language}. Proper nouns (e.g., personal names, place names, organization names) must be kept in their original language and not translated. +1. **Strict Adherence to Format:** Strictly adhere to all format requirements for entity and relationship lists, including output order, field delimiters, and proper noun handling, as specified in the system prompt. +2. **Quantity Limits:** In this response, output at most {max_total_records} total rows and at most {max_entity_records} entity rows. Output fewer rows if fewer high-value items are present. Only output relationship rows whose source and target entities are both included in this response. +3. **Output Content Only:** Output *only* the extracted list of entities and relationships. Do not include any introductory or concluding remarks, explanations, or additional text before or after the list. +4. **Completion Signal:** Output `{completion_delimiter}` as the final line after all relevant entities and relationships have been extracted and presented. If the row limit is reached, output `{completion_delimiter}` immediately after the last allowed row. +5. **Output Language:** Ensure the output language is {language}. Proper nouns (e.g., personal names, place names, organization names) must be kept in their original language and not translated. ----Data to be Processed--- - -[{entity_types}] - - +---Input Text--- ``` {input_text} ``` - +---Output--- """ PROMPTS["entity_continue_extraction_user_prompt"] = """---Task--- -Based on the last extraction task, identify and extract any **missed or incorrectly formatted** entities and relationships from the input text. +Based on the last extraction task, identify and extract any missed or incorrectly formatted entities and relationships from the input text. ---Instructions--- -1. **Strict Adherence to System Format:** Strictly adhere to all format requirements for entity and relationship lists, including output order, field delimiters, and proper noun handling, as specified in the system instructions. -2. **Focus on Corrections/Additions:** - * **Do NOT** re-output entities and relationships that were **correctly and fully** extracted in the last task. - * If an entity or relationship was **missed** in the last task, extract and output it now according to the system format. - * If an entity or relationship was **truncated, had missing fields, or was otherwise incorrectly formatted** in the last task, re-output the *corrected and complete* version in the specified format. -3. **Output Format - Entities:** Output a total of 4 fields for each entity, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `entity`. -4. **Output Format - Relationships:** Output a total of 5 fields for each relationship, delimited by `{tuple_delimiter}`, on a single line. The first field *must* be the literal string `relation`. -5. **Output Content Only:** Output *only* the extracted list of entities and relationships. Do not include any introductory or concluding remarks, explanations, or additional text before or after the list. -6. **Completion Signal:** Output `{completion_delimiter}` as the final line after all relevant missing or corrected entities and relationships have been extracted and presented. -7. **Output Language:** Ensure the output language is {language}. Proper nouns (e.g., personal names, place names, organization names) must be kept in their original language and not translated. - - +1. **Strict Adherence to System Format:** Strictly adhere to all format requirements for entity and relationship lists, including output order, field delimiters, and proper noun handling, as specified in the system instructions. +2. **Focus on Corrections/Additions:** + - **Do NOT** re-output entities and relationships that were **correctly and fully** extracted in the last task. + - If an entity or relationship was **missed** in the last task, extract and output it now according to the system format. + - If an entity or relationship was **truncated, had missing fields, or was otherwise incorrectly formatted** in the last task, re-output the *corrected and complete* version in the specified format. + - Any corrected relationship row must be emitted with the literal `relation` prefix, never `entity`. +3. **Quantity Limits:** In this response, output at most {max_total_records} total rows and at most {max_entity_records} entity rows. Output fewer rows if fewer high-value corrections or additions remain. A relationship row may reference entities that were already extracted correctly in the previous response. Do not re-output those entities unless they were missing or need correction. +4. **Output Content Only:** Output *only* the extracted list of entities and relationships. Do not include any introductory or concluding remarks, explanations, or additional text before or after the list. +5. **Completion Signal:** Output `{completion_delimiter}` as the final line after all relevant missing or corrected entities and relationships have been extracted and presented. If the row limit is reached, output `{completion_delimiter}` immediately after the last allowed row. +6. **Output Language:** Ensure the output language is {language}. Proper nouns (e.g., personal names, place names, organization names) must be kept in their original language and not translated. + +---Output--- """ PROMPTS["entity_extraction_examples"] = [ - """ -["Person","Creature","Organization","Location","Event","Concept","Method","Content","Data","Artifact","NaturalObject"] + """---Entity Types--- +- Person: Human individuals, real or fictional +- Artifact: Physical or digital objects created by humans (tools, software, devices) +- Concept: Abstract ideas, theories, principles, beliefs - +---Input Text--- ``` while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order. @@ -114,12 +147,13 @@ It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths ``` - -entity{tuple_delimiter}Alex{tuple_delimiter}person{tuple_delimiter}Alex is a character who experiences frustration and is observant of the dynamics among other characters. -entity{tuple_delimiter}Taylor{tuple_delimiter}person{tuple_delimiter}Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective. -entity{tuple_delimiter}Jordan{tuple_delimiter}person{tuple_delimiter}Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device. -entity{tuple_delimiter}Cruz{tuple_delimiter}person{tuple_delimiter}Cruz is associated with a vision of control and order, influencing the dynamics among other characters. -entity{tuple_delimiter}The Device{tuple_delimiter}equipment{tuple_delimiter}The Device is central to the story, with potential game-changing implications, and is revered by Taylor. +---Output--- +entity{tuple_delimiter}Alex{tuple_delimiter}Person{tuple_delimiter}Alex is a character who experiences frustration and is observant of the dynamics among other characters. +entity{tuple_delimiter}Taylor{tuple_delimiter}Person{tuple_delimiter}Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective. +entity{tuple_delimiter}Jordan{tuple_delimiter}Person{tuple_delimiter}Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device. +entity{tuple_delimiter}Cruz{tuple_delimiter}Person{tuple_delimiter}Cruz is associated with a vision of control and order, influencing the dynamics among other characters. +entity{tuple_delimiter}The Device{tuple_delimiter}Artifact{tuple_delimiter}The Device is central to the story, with potential game-changing implications, and is revered by Taylor. +entity{tuple_delimiter}Discovery{tuple_delimiter}Concept{tuple_delimiter}Discovery represents the shared intellectual pursuit that unites Jordan and Alex in opposition to Cruz's controlling worldview. relation{tuple_delimiter}Alex{tuple_delimiter}Taylor{tuple_delimiter}power dynamics, observation{tuple_delimiter}Alex observes Taylor's authoritarian behavior and notes changes in Taylor's attitude toward the device. relation{tuple_delimiter}Alex{tuple_delimiter}Jordan{tuple_delimiter}shared goals, rebellion{tuple_delimiter}Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision.) relation{tuple_delimiter}Taylor{tuple_delimiter}Jordan{tuple_delimiter}conflict resolution, mutual respect{tuple_delimiter}Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce. @@ -128,56 +162,276 @@ {completion_delimiter} """, - """ -["Person","Creature","Organization","Location","Event","Concept","Method","Content","Data","Artifact","NaturalObject"] + """---Entity Types--- +- Person: Human individuals, real or fictional +- Location: Geographic places (cities, countries, buildings, regions) +- Creature: Non-human living beings (animals, mythical beings, etc.) +- Method: Procedures, techniques, algorithms, workflows +- Organization: Companies, institutions, government bodies, groups +- Content: Creative or informational works (books, articles, films, reports) +- NaturalObject: Natural non-living objects (minerals, celestial bodies, chemical compounds) + +---Input Text--- +``` +Dr. Elena Vasquez led a field expedition to the Borneo rainforest to document the population decline of the Bornean orangutan. Using transect sampling — a method where researchers walk predetermined line paths and record every animal sighting within a fixed distance — her team estimated that fewer than 1,500 individuals remained in the surveyed region. - +The expedition was funded by the Global Wildlife Conservation Institute and produced a landmark report titled "Primate Decline in Insular Southeast Asia." Vasquez attributed the collapse primarily to peat-soil destruction caused by palm oil plantation expansion, which had converted over 40% of the surveyed forest area within a decade. ``` -Stock markets faced a sharp downturn today as tech giants saw significant declines, with the global tech index dropping by 3.4% in midday trading. Analysts attribute the selloff to investor concerns over rising interest rates and regulatory uncertainty. -Among the hardest hit, nexon technologies saw its stock plummet by 7.8% after reporting lower-than-expected quarterly earnings. In contrast, Omega Energy posted a modest 2.1% gain, driven by rising oil prices. +---Output--- +entity{tuple_delimiter}Dr. Elena Vasquez{tuple_delimiter}Person{tuple_delimiter}Dr. Elena Vasquez is a field researcher who led an expedition to document orangutan population decline in Borneo. +entity{tuple_delimiter}Borneo Rainforest{tuple_delimiter}Location{tuple_delimiter}The Borneo rainforest is the field site of the expedition and the primary habitat of the Bornean orangutan. +entity{tuple_delimiter}Bornean Orangutan{tuple_delimiter}Creature{tuple_delimiter}The Bornean orangutan is a primate species whose population was found to have declined to fewer than 1,500 individuals in the surveyed region. +entity{tuple_delimiter}Transect Sampling{tuple_delimiter}Method{tuple_delimiter}Transect sampling is a wildlife survey technique where researchers walk predetermined paths and record animal sightings within a fixed lateral distance. +entity{tuple_delimiter}Global Wildlife Conservation Institute{tuple_delimiter}Organization{tuple_delimiter}The Global Wildlife Conservation Institute funded the expedition led by Dr. Vasquez. +entity{tuple_delimiter}Primate Decline in Insular Southeast Asia{tuple_delimiter}Content{tuple_delimiter}A landmark research report produced by Vasquez's expedition documenting primate population decline in the region. +entity{tuple_delimiter}Peat Soil{tuple_delimiter}NaturalObject{tuple_delimiter}Peat soil is a natural substrate in the Borneo rainforest that has been destroyed by palm oil plantation expansion. +relation{tuple_delimiter}Dr. Elena Vasquez{tuple_delimiter}Bornean Orangutan{tuple_delimiter}field research, population survey{tuple_delimiter}Dr. Vasquez led the expedition that documented the population decline of the Bornean orangutan. +relation{tuple_delimiter}Dr. Elena Vasquez{tuple_delimiter}Transect Sampling{tuple_delimiter}methodology, research application{tuple_delimiter}Dr. Vasquez's team used transect sampling to estimate the orangutan population. +relation{tuple_delimiter}Global Wildlife Conservation Institute{tuple_delimiter}Dr. Elena Vasquez{tuple_delimiter}funding, research support{tuple_delimiter}The institute funded the expedition led by Dr. Vasquez. +relation{tuple_delimiter}Dr. Elena Vasquez{tuple_delimiter}Primate Decline in Insular Southeast Asia{tuple_delimiter}authorship, research output{tuple_delimiter}Dr. Vasquez's expedition produced the landmark report on primate decline. +relation{tuple_delimiter}Peat Soil{tuple_delimiter}Borneo Rainforest{tuple_delimiter}habitat composition, ecological destruction{tuple_delimiter}Peat soil destruction in the Borneo rainforest was caused by palm oil plantation expansion and is a primary driver of orangutan decline. +{completion_delimiter} -Meanwhile, commodity markets reflected a mixed sentiment. Gold futures rose by 1.5%, reaching $2,080 per ounce, as investors sought safe-haven assets. Crude oil prices continued their rally, climbing to $87.60 per barrel, supported by supply constraints and strong demand. +""", + """---Entity Types--- +- Content: Creative or informational works (books, articles, films, reports) +- Artifact: Physical or digital objects created by humans (tools, software, devices) +- Person: Human individuals, real or fictional +- Organization: Companies, institutions, government bodies, groups +- Method: Procedures, techniques, algorithms, workflows +- Data: Quantitative or structured information (statistics, datasets, measurements) +- Concept: Abstract ideas, theories, principles, beliefs + +---Input Text--- +``` +The 2023 edition of "Advances in Neural Architecture Search" synthesized findings from over 200 peer-reviewed papers and introduced a new benchmarking framework called NASBench-360, designed to evaluate search algorithms across diverse task domains. The publication was co-authored by Dr. Priya Nair and Dr. Luca Ferretti of the DeepSystems Research Lab. -Financial experts are closely watching the Federal Reserve's next move, as speculation grows over potential rate hikes. The upcoming policy announcement is expected to influence investor confidence and overall market stability. +NASBench-360 measures three key metrics: search efficiency (time-to-solution), model accuracy on held-out test sets, and computational cost in GPU-hours. Early results showed that evolutionary search algorithms outperformed gradient-based methods by 12% on accuracy while consuming 30% fewer GPU-hours on vision tasks. ``` - -entity{tuple_delimiter}Global Tech Index{tuple_delimiter}category{tuple_delimiter}The Global Tech Index tracks the performance of major technology stocks and experienced a 3.4% decline today. -entity{tuple_delimiter}Nexon Technologies{tuple_delimiter}organization{tuple_delimiter}Nexon Technologies is a tech company that saw its stock decline by 7.8% after disappointing earnings. -entity{tuple_delimiter}Omega Energy{tuple_delimiter}organization{tuple_delimiter}Omega Energy is an energy company that gained 2.1% in stock value due to rising oil prices. -entity{tuple_delimiter}Gold Futures{tuple_delimiter}product{tuple_delimiter}Gold futures rose by 1.5%, indicating increased investor interest in safe-haven assets. -entity{tuple_delimiter}Crude Oil{tuple_delimiter}product{tuple_delimiter}Crude oil prices rose to $87.60 per barrel due to supply constraints and strong demand. -entity{tuple_delimiter}Market Selloff{tuple_delimiter}category{tuple_delimiter}Market selloff refers to the significant decline in stock values due to investor concerns over interest rates and regulations. -entity{tuple_delimiter}Federal Reserve Policy Announcement{tuple_delimiter}category{tuple_delimiter}The Federal Reserve's upcoming policy announcement is expected to impact investor confidence and market stability. -entity{tuple_delimiter}3.4% Decline{tuple_delimiter}category{tuple_delimiter}The Global Tech Index experienced a 3.4% decline in midday trading. -relation{tuple_delimiter}Global Tech Index{tuple_delimiter}Market Selloff{tuple_delimiter}market performance, investor sentiment{tuple_delimiter}The decline in the Global Tech Index is part of the broader market selloff driven by investor concerns. -relation{tuple_delimiter}Nexon Technologies{tuple_delimiter}Global Tech Index{tuple_delimiter}company impact, index movement{tuple_delimiter}Nexon Technologies' stock decline contributed to the overall drop in the Global Tech Index. -relation{tuple_delimiter}Gold Futures{tuple_delimiter}Market Selloff{tuple_delimiter}market reaction, safe-haven investment{tuple_delimiter}Gold prices rose as investors sought safe-haven assets during the market selloff. -relation{tuple_delimiter}Federal Reserve Policy Announcement{tuple_delimiter}Market Selloff{tuple_delimiter}interest rate impact, financial regulation{tuple_delimiter}Speculation over Federal Reserve policy changes contributed to market volatility and investor selloff. +---Output--- +entity{tuple_delimiter}Advances in Neural Architecture Search{tuple_delimiter}Content{tuple_delimiter}A 2023 publication that synthesizes findings from over 200 papers and introduces the NASBench-360 benchmarking framework. +entity{tuple_delimiter}NASBench-360{tuple_delimiter}Artifact{tuple_delimiter}NASBench-360 is a benchmarking framework introduced to evaluate neural architecture search algorithms across diverse task domains. +entity{tuple_delimiter}Dr. Priya Nair{tuple_delimiter}Person{tuple_delimiter}Dr. Priya Nair is a co-author of the publication and a researcher at the DeepSystems Research Lab. +entity{tuple_delimiter}Dr. Luca Ferretti{tuple_delimiter}Person{tuple_delimiter}Dr. Luca Ferretti is a co-author of the publication and a researcher at the DeepSystems Research Lab. +entity{tuple_delimiter}DeepSystems Research Lab{tuple_delimiter}Organization{tuple_delimiter}The DeepSystems Research Lab is the institution where the co-authors of the publication are affiliated. +entity{tuple_delimiter}Evolutionary Search{tuple_delimiter}Method{tuple_delimiter}Evolutionary search is a class of neural architecture search algorithms that outperformed gradient-based methods in the NASBench-360 evaluation. +entity{tuple_delimiter}Gradient-Based Search{tuple_delimiter}Method{tuple_delimiter}Gradient-based search is a class of neural architecture search algorithms that was benchmarked against evolutionary search in NASBench-360. +entity{tuple_delimiter}GPU-Hours{tuple_delimiter}Data{tuple_delimiter}GPU-hours is a metric used in NASBench-360 to measure the computational cost of neural architecture search algorithms. +entity{tuple_delimiter}Neural Architecture Search{tuple_delimiter}Concept{tuple_delimiter}Neural architecture search is the automated process of designing optimal neural network architectures, the central topic of the publication. +relation{tuple_delimiter}Dr. Priya Nair{tuple_delimiter}Advances in Neural Architecture Search{tuple_delimiter}authorship{tuple_delimiter}Dr. Priya Nair co-authored the publication. +relation{tuple_delimiter}Dr. Luca Ferretti{tuple_delimiter}Advances in Neural Architecture Search{tuple_delimiter}authorship{tuple_delimiter}Dr. Luca Ferretti co-authored the publication. +relation{tuple_delimiter}Advances in Neural Architecture Search{tuple_delimiter}NASBench-360{tuple_delimiter}introduces, benchmarking{tuple_delimiter}The publication introduced the NASBench-360 framework. +relation{tuple_delimiter}Evolutionary Search{tuple_delimiter}Gradient-Based Search{tuple_delimiter}performance comparison{tuple_delimiter}Evolutionary search outperformed gradient-based methods by 12% on accuracy and used 30% fewer GPU-hours on vision tasks. +relation{tuple_delimiter}NASBench-360{tuple_delimiter}GPU-Hours{tuple_delimiter}evaluation metric{tuple_delimiter}NASBench-360 uses GPU-hours as one of three key metrics to measure computational cost. {completion_delimiter} """, - """ -["Person","Creature","Organization","Location","Event","Concept","Method","Content","Data","Artifact","NaturalObject"] +] + +############################################################################### +# JSON Structured Output Prompts for Entity Extraction +# Used when entity_extraction_use_json is enabled for higher extraction quality +############################################################################### - +PROMPTS["entity_extraction_json_system_prompt"] = """---Role--- +You are a Knowledge Graph Specialist responsible for extracting entities and relationships from the `---Input Text---` session of user prompt. + +---Instructions--- +1. **Entity Extraction:** + - **Identification:** Identify clearly defined and meaningful entities in the `---Input Text---` session of user prompt. + - **Entity Details:** For each identified entity, extract the following information: + - `name`: The name of the entity. If the entity name is case-insensitive, capitalize the first letter of each significant word (title case). Ensure **consistent naming** across the entire extraction process. + - `type`: Categorize the entity using the type guidance provided in the `---Entity Types---` section below. If none of the provided entity types apply, classify it as `Other`. + - `description`: Provide a concise yet comprehensive description of the entity's attributes and activities, based *solely* on the information present in the input text. + +2. **Relationship Extraction:** + - **Identification:** Identify direct, clearly stated, and meaningful relationships between previously extracted entities. + - **N-ary Relationship Decomposition:** If a single statement describes a relationship involving more than two entities (an N-ary relationship), decompose it into multiple binary (two-entity) relationship pairs for separate description. + - Example: For "Alice, Bob, and Carol collaborated on Project X," extract binary relationships such as "Alice collaborated with Project X," "Bob collaborated with Project X," and "Carol collaborated with Project X," or "Alice collaborated with Bob," based on the most reasonable binary interpretations. + - **Relationship Details:** For each binary relationship, extract the following fields: + - `source`: The name of the source entity. Ensure **consistent naming** with entity extraction. Capitalize the first letter of each significant word (title case) if the name is case-insensitive. + - `target`: The name of the target entity. Ensure **consistent naming** with entity extraction. Capitalize the first letter of each significant word (title case) if the name is case-insensitive. + - `keywords`: One or more high-level keywords summarizing the overarching nature, concepts, or themes of the relationship, separated by commas. + - `description`: A concise explanation of the nature of the relationship between the source and target entities, providing a clear rationale for their connection. + +3. **Relationship Direction & Duplication:** + - Treat all relationships as **undirected** unless explicitly stated otherwise. Swapping the source and target entities for an undirected relationship does not constitute a new relationship. + - Avoid outputting duplicate relationships. + +4. **Output Limits & Prioritization:** + - Output at most {max_total_records} total records across `entities` and `relationships` in this response. + - Output at most {max_entity_records} entity objects in this response. + - Output fewer records if fewer high-value items are present. Do not try to fill the limit. + - Only output relationship objects whose `source` and `target` are both included in the selected `entities` list for this response. + - Within the list of relationships, prioritize and output those relationships that are **most significant** to the core meaning of the input text first. + +5. **Context & Objectivity:** + - Ensure all entity names and descriptions are written in the **third person**. + - Explicitly name the subject or object; **avoid using pronouns** such as `this article`, `this paper`, `our company`, `I`, `you`, and `he/she`. + +6. **Language & Proper Nouns:** + - The entire output (entity names, keywords, and descriptions) must be written in `{language}`. + - Proper nouns (e.g., personal names, place names, organization names) should be retained in their original language if a proper, widely accepted translation is not available or would cause ambiguity. + +7. **JSON Contract:** + - Return one valid JSON object with `entities` and `relationships` arrays only. + - If the record limit is reached, stop adding new objects immediately and return the JSON object with the allowed items only. + +---Entity Types--- +{entity_types_guidance} + +---Examples--- +{examples} +""" + +PROMPTS["entity_extraction_json_user_prompt"] = """---Task--- +Extract entities and relationships from the `---Input Text---` session below. + +---Instructions--- +1. **Strict Adherence to JSON Format:** Your output MUST be a valid JSON object with `entities` and `relationships` arrays. Do not include any introductory or concluding remarks, explanations, markdown code fences, or any other text before or after the JSON. +2. **Quantity Limits:** In this response, output at most {max_total_records} total records and at most {max_entity_records} entity objects. Output fewer records if fewer high-value items are present. Only output relationship objects whose `source` and `target` are both included in this response. +3. **Output Language:** Ensure the output language is {language}. Proper nouns (e.g., personal names, place names, organization names) must be kept in their original language and not translated. + +---Entity Types--- +{entity_types_guidance} + +---Input Text--- ``` -At the World Athletics Championship in Tokyo, Noah Carter broke the 100m sprint record using cutting-edge carbon-fiber spikes. +{input_text} ``` - -entity{tuple_delimiter}World Athletics Championship{tuple_delimiter}event{tuple_delimiter}The World Athletics Championship is a global sports competition featuring top athletes in track and field. -entity{tuple_delimiter}Tokyo{tuple_delimiter}location{tuple_delimiter}Tokyo is the host city of the World Athletics Championship. -entity{tuple_delimiter}Noah Carter{tuple_delimiter}person{tuple_delimiter}Noah Carter is a sprinter who set a new record in the 100m sprint at the World Athletics Championship. -entity{tuple_delimiter}100m Sprint Record{tuple_delimiter}category{tuple_delimiter}The 100m sprint record is a benchmark in athletics, recently broken by Noah Carter. -entity{tuple_delimiter}Carbon-Fiber Spikes{tuple_delimiter}equipment{tuple_delimiter}Carbon-fiber spikes are advanced sprinting shoes that provide enhanced speed and traction. -entity{tuple_delimiter}World Athletics Federation{tuple_delimiter}organization{tuple_delimiter}The World Athletics Federation is the governing body overseeing the World Athletics Championship and record validations. -relation{tuple_delimiter}World Athletics Championship{tuple_delimiter}Tokyo{tuple_delimiter}event location, international competition{tuple_delimiter}The World Athletics Championship is being hosted in Tokyo. -relation{tuple_delimiter}Noah Carter{tuple_delimiter}100m Sprint Record{tuple_delimiter}athlete achievement, record-breaking{tuple_delimiter}Noah Carter set a new 100m sprint record at the championship. -relation{tuple_delimiter}Noah Carter{tuple_delimiter}Carbon-Fiber Spikes{tuple_delimiter}athletic equipment, performance boost{tuple_delimiter}Noah Carter used carbon-fiber spikes to enhance performance during the race. -relation{tuple_delimiter}Noah Carter{tuple_delimiter}World Athletics Championship{tuple_delimiter}athlete participation, competition{tuple_delimiter}Noah Carter is competing at the World Athletics Championship. -{completion_delimiter} +---Output--- +""" + +PROMPTS["entity_continue_extraction_json_user_prompt"] = """---Task--- +Based on the last extraction task, identify and extract any **missed or incorrectly described** entities and relationships from the `---Input Text---` session. + +---Instructions--- +1. **Focus on Corrections/Additions:** + - **Do NOT** re-output entities and relationships that were **correctly and fully** extracted in the last task. + - If an entity or relationship was **missed** in the last task, extract and output it now. + - If an entity or relationship was **incorrectly described** in the last task, re-output the *corrected and complete* version. +2. **Strict Adherence to JSON Format:** Your output MUST be a valid JSON object with `entities` and `relationships` arrays. Do not include any introductory or concluding remarks, explanations, markdown code fences, or any other text before or after the JSON. +3. **Quantity Limits:** In this response, output at most {max_total_records} total records and at most {max_entity_records} entity objects. Output fewer records if fewer high-value corrections or additions remain. A relationship object may reference entities already extracted correctly in the previous response. Do not repeat those entity objects unless they were missing or need correction. +4. **Output Language:** Ensure the output language is {language}. Proper nouns (e.g., personal names, place names, organization names) must be kept in their original language and not translated. +5. **If nothing was missed or needs correction**, output: `{{"entities": [], "relationships": []}}` + +---Output--- +""" + +PROMPTS["entity_extraction_json_examples"] = [ + """---Entity Types--- +- Person: Human individuals, real or fictional +- Artifact: Physical or digital objects created by humans (tools, software, devices) +- Concept: Abstract ideas, theories, principles, beliefs + +---Input Text--- +``` +while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order. + +Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. "If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us." + +The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce. + +It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths +``` + +---Output--- +{ + "entities": [ + {"name": "Alex", "type": "Person", "description": "Alex is a character who experiences frustration and is observant of the dynamics among other characters."}, + {"name": "Taylor", "type": "Person", "description": "Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."}, + {"name": "Jordan", "type": "Person", "description": "Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."}, + {"name": "Cruz", "type": "Person", "description": "Cruz is associated with a vision of control and order, influencing the dynamics among other characters."}, + {"name": "The Device", "type": "Artifact", "description": "The Device is central to the story, with potential game-changing implications, and is revered by Taylor."}, + {"name": "Discovery", "type": "Concept", "description": "Discovery represents the shared intellectual pursuit that unites Jordan and Alex in opposition to Cruz's controlling worldview."} + ], + "relationships": [ + {"source": "Alex", "target": "Taylor", "keywords": "power dynamics, observation", "description": "Alex observes Taylor's authoritarian behavior and notes changes in Taylor's attitude toward the device."}, + {"source": "Alex", "target": "Jordan", "keywords": "shared goals, rebellion", "description": "Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision."}, + {"source": "Taylor", "target": "Jordan", "keywords": "conflict resolution, mutual respect", "description": "Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."}, + {"source": "Jordan", "target": "Cruz", "keywords": "ideological conflict, rebellion", "description": "Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order."}, + {"source": "Taylor", "target": "The Device", "keywords": "reverence, technological significance", "description": "Taylor shows reverence towards the device, indicating its importance and potential impact."} + ] +} + +""", + """---Entity Types--- +- Person: Human individuals, real or fictional +- Location: Geographic places (cities, countries, buildings, regions) +- Creature: Non-human living beings (animals, mythical beings, etc.) +- Method: Procedures, techniques, algorithms, workflows +- Organization: Companies, institutions, government bodies, groups +- Content: Creative or informational works (books, articles, films, reports) +- NaturalObject: Natural non-living objects (minerals, celestial bodies, chemical compounds) + +---Input Text--- +``` +Dr. Elena Vasquez led a field expedition to the Borneo rainforest to document the population decline of the Bornean orangutan. Using transect sampling — a method where researchers walk predetermined line paths and record every animal sighting within a fixed distance — her team estimated that fewer than 1,500 individuals remained in the surveyed region. + +The expedition was funded by the Global Wildlife Conservation Institute and produced a landmark report titled "Primate Decline in Insular Southeast Asia." Vasquez attributed the collapse primarily to peat-soil destruction caused by palm oil plantation expansion, which had converted over 40% of the surveyed forest area within a decade. +``` + +---Output--- +{ + "entities": [ + {"name": "Dr. Elena Vasquez", "type": "Person", "description": "Dr. Elena Vasquez is a field researcher who led an expedition to document orangutan population decline in Borneo."}, + {"name": "Borneo Rainforest", "type": "Location", "description": "The Borneo rainforest is the field site of the expedition and the primary habitat of the Bornean orangutan."}, + {"name": "Bornean Orangutan", "type": "Creature", "description": "The Bornean orangutan is a primate species whose population was found to have declined to fewer than 1,500 individuals in the surveyed region."}, + {"name": "Transect Sampling", "type": "Method", "description": "Transect sampling is a wildlife survey technique where researchers walk predetermined paths and record animal sightings within a fixed lateral distance."}, + {"name": "Global Wildlife Conservation Institute", "type": "Organization", "description": "The Global Wildlife Conservation Institute funded the expedition led by Dr. Vasquez."}, + {"name": "Primate Decline in Insular Southeast Asia", "type": "Content", "description": "A landmark research report produced by Vasquez's expedition documenting primate population decline in the region."}, + {"name": "Peat Soil", "type": "NaturalObject", "description": "Peat soil is a natural substrate in the Borneo rainforest that has been destroyed by palm oil plantation expansion."} + ], + "relationships": [ + {"source": "Dr. Elena Vasquez", "target": "Bornean Orangutan", "keywords": "field research, population survey", "description": "Dr. Vasquez led the expedition that documented the population decline of the Bornean orangutan."}, + {"source": "Dr. Elena Vasquez", "target": "Transect Sampling", "keywords": "methodology, research application", "description": "Dr. Vasquez's team used transect sampling to estimate the orangutan population."}, + {"source": "Global Wildlife Conservation Institute", "target": "Dr. Elena Vasquez", "keywords": "funding, research support", "description": "The institute funded the expedition led by Dr. Vasquez."}, + {"source": "Dr. Elena Vasquez", "target": "Primate Decline in Insular Southeast Asia", "keywords": "authorship, research output", "description": "Dr. Vasquez's expedition produced the landmark report on primate decline."}, + {"source": "Peat Soil", "target": "Borneo Rainforest", "keywords": "habitat composition, ecological destruction", "description": "Peat soil destruction in the Borneo rainforest was caused by palm oil plantation expansion and is a primary driver of orangutan decline."} + ] +} + +""", + """---Entity Types--- +- Content: Creative or informational works (books, articles, films, reports) +- Artifact: Physical or digital objects created by humans (tools, software, devices) +- Person: Human individuals, real or fictional +- Organization: Companies, institutions, government bodies, groups +- Method: Procedures, techniques, algorithms, workflows +- Data: Quantitative or structured information (statistics, datasets, measurements) +- Concept: Abstract ideas, theories, principles, beliefs + +---Input Text--- +``` +The 2023 edition of "Advances in Neural Architecture Search" synthesized findings from over 200 peer-reviewed papers and introduced a new benchmarking framework called NASBench-360, designed to evaluate search algorithms across diverse task domains. The publication was co-authored by Dr. Priya Nair and Dr. Luca Ferretti of the DeepSystems Research Lab. + +NASBench-360 measures three key metrics: search efficiency (time-to-solution), model accuracy on held-out test sets, and computational cost in GPU-hours. Early results showed that evolutionary search algorithms outperformed gradient-based methods by 12% on accuracy while consuming 30% fewer GPU-hours on vision tasks. +``` + +---Output--- +{ + "entities": [ + {"name": "Advances in Neural Architecture Search", "type": "Content", "description": "A 2023 publication that synthesizes findings from over 200 papers and introduces the NASBench-360 benchmarking framework."}, + {"name": "NASBench-360", "type": "Artifact", "description": "NASBench-360 is a benchmarking framework introduced to evaluate neural architecture search algorithms across diverse task domains."}, + {"name": "Dr. Priya Nair", "type": "Person", "description": "Dr. Priya Nair is a co-author of the publication and a researcher at the DeepSystems Research Lab."}, + {"name": "Dr. Luca Ferretti", "type": "Person", "description": "Dr. Luca Ferretti is a co-author of the publication and a researcher at the DeepSystems Research Lab."}, + {"name": "DeepSystems Research Lab", "type": "Organization", "description": "The DeepSystems Research Lab is the institution where the co-authors of the publication are affiliated."}, + {"name": "Evolutionary Search", "type": "Method", "description": "Evolutionary search is a class of neural architecture search algorithms that outperformed gradient-based methods in the NASBench-360 evaluation."}, + {"name": "Gradient-Based Search", "type": "Method", "description": "Gradient-based search is a class of neural architecture search algorithms that was benchmarked against evolutionary search in NASBench-360."}, + {"name": "GPU-Hours", "type": "Data", "description": "GPU-hours is a metric used in NASBench-360 to measure the computational cost of neural architecture search algorithms."}, + {"name": "Neural Architecture Search", "type": "Concept", "description": "Neural architecture search is the automated process of designing optimal neural network architectures, the central topic of the publication."} + ], + "relationships": [ + {"source": "Dr. Priya Nair", "target": "Advances in Neural Architecture Search", "keywords": "authorship", "description": "Dr. Priya Nair co-authored the publication."}, + {"source": "Dr. Luca Ferretti", "target": "Advances in Neural Architecture Search", "keywords": "authorship", "description": "Dr. Luca Ferretti co-authored the publication."}, + {"source": "Advances in Neural Architecture Search", "target": "NASBench-360", "keywords": "introduces, benchmarking", "description": "The publication introduced the NASBench-360 framework."}, + {"source": "Evolutionary Search", "target": "Gradient-Based Search", "keywords": "performance comparison", "description": "Evolutionary search outperformed gradient-based methods by 12% on accuracy and used 30% fewer GPU-hours on vision tasks."}, + {"source": "NASBench-360", "target": "GPU-Hours", "keywords": "evaluation metric", "description": "NASBench-360 uses GPU-hours as one of three key metrics to measure computational cost."} + ] +} """, ] @@ -380,11 +634,17 @@ 2. **low_level_keywords**: for specific entities or details, identifying the specific entities, proper nouns, technical jargon, product names, or concrete items. ---Instructions & Constraints--- -1. **Output Format**: Your output MUST be a valid JSON object and nothing else. Do not include any explanatory text, markdown code fences (like ```json), or any other text before or after the JSON. It will be parsed directly by a JSON parser. -2. **Source of Truth**: All keywords must be explicitly derived from the user query, with both high-level and low-level keyword categories are required to contain content. -3. **Concise & Meaningful**: Keywords should be concise words or meaningful phrases. Prioritize multi-word phrases when they represent a single concept. For example, from "latest financial report of Apple Inc.", you should extract "latest financial report" and "Apple Inc." rather than "latest", "financial", "report", and "Apple". -4. **Handle Edge Cases**: For queries that are too simple, vague, or nonsensical (e.g., "hello", "ok", "asdfghjkl"), you must return a JSON object with empty lists for both keyword types. -5. **Language**: All extracted keywords MUST be in {language}. Proper nouns (e.g., personal names, place names, organization names) should be kept in their original language. +1. **Output Format**: Your output MUST be a valid JSON object and nothing else. Do not include any explanatory text, markdown code fences (like ```json), comments, or any other text before or after the JSON. +2. **Exact JSON Shape**: The JSON object must contain exactly these two keys: + - `"high_level_keywords"`: an array of strings + - `"low_level_keywords"`: an array of strings +3. **JSON Boundary**: The first character of your response must be `{{` and the last character must be `}}`. +4. **Source of Truth**: All keywords must be explicitly derived from the user query. Do not infer unsupported facts. Do not invent entities, products, organizations, dates, or technical terms that are not grounded in the query. +5. **Concise & Meaningful**: Keywords should be concise words or meaningful phrases. Prioritize multi-word phrases when they represent a single concept. For example, from "latest financial report of Apple Inc.", extract "latest financial report" and "Apple Inc." rather than "latest", "financial", "report", and "Apple". +6. **Handle Edge Cases**: For queries that are too simple, vague, or nonsensical (e.g., "hello", "ok", "asdfghjkl"), return: + `{{"high_level_keywords": [], "low_level_keywords": []}}` +7. **No Duplicates**: Do not repeat the same keyword within a list. Keep the lists short and high-signal. +8. **Language**: All extracted keywords MUST be in {language}. Proper nouns (e.g., personal names, place names, organization names) should be kept in their original language. ---Examples--- {examples} @@ -428,5 +688,249 @@ "low_level_keywords": ["School access", "Literacy rates", "Job training", "Income inequality"] } -""", + """, ] + + +class EntityExtractionPromptProfile(TypedDict): + entity_types_guidance: str + entity_extraction_examples: list[str] + entity_extraction_json_examples: list[str] + + +def get_default_entity_extraction_prompt_profile() -> EntityExtractionPromptProfile: + """Return a copy of the built-in entity extraction prompt profile.""" + + return { + "entity_types_guidance": PROMPTS["default_entity_types_guidance"].rstrip(), + "entity_extraction_examples": [ + example.rstrip() for example in PROMPTS["entity_extraction_examples"] + ], + "entity_extraction_json_examples": [ + example.rstrip() for example in PROMPTS["entity_extraction_json_examples"] + ], + } + + +_ALLOWED_PROMPT_SUFFIXES = frozenset({".yml", ".yaml"}) +_DEFAULT_PROMPT_DIR = "./prompts" +_ENTITY_TYPE_SUBDIR = "entity_type" + + +def get_entity_type_prompt_dir() -> Path: + """Return the directory for entity type prompt profiles. + + Resolves ``PROMPT_DIR`` (defaults to ``./prompts`` relative to the current + working directory, mirroring ``INPUT_DIR`` / ``WORKING_DIR``) and appends + the hard-coded ``entity_type`` subdirectory. Profile files are provided by + the user at runtime and are not shipped with the distribution. The + file-name sandbox in :func:`resolve_entity_type_prompt_path` ensures + user-supplied file names cannot escape the resolved directory. + """ + + configured = os.getenv("PROMPT_DIR", "").strip() or _DEFAULT_PROMPT_DIR + return (Path(configured).expanduser() / _ENTITY_TYPE_SUBDIR).resolve() + + +def resolve_entity_type_prompt_path(prompt_file_name: str | Path) -> Path: + """Resolve an allowlisted prompt profile file name to an absolute path.""" + + file_name = str(prompt_file_name).strip() + if not file_name: + raise ValueError( + "ENTITY_TYPE_PROMPT_FILE must be a file name such as " + "'entity_type_prompt.sample.yml'." + ) + if "\\" in file_name: + raise ValueError( + "ENTITY_TYPE_PROMPT_FILE must not contain directory separators. " + "Only file names inside PROMPT_DIR/entity_type are allowed." + ) + + candidate = Path(file_name) + if ( + candidate.is_absolute() + or candidate.name != file_name + or ".." in candidate.parts + ): + raise ValueError( + "ENTITY_TYPE_PROMPT_FILE must be a file name only. " + "Files are loaded from PROMPT_DIR/entity_type " + "(PROMPT_DIR defaults to ./prompts)." + ) + if candidate.suffix.lower() not in _ALLOWED_PROMPT_SUFFIXES: + raise ValueError( + "ENTITY_TYPE_PROMPT_FILE must use a '.yml' or '.yaml' extension." + ) + + return get_entity_type_prompt_dir() / candidate.name + + +def _normalize_prompt_examples( + value: Any, field_name: str, profile_path: Path +) -> list[str]: + if not isinstance(value, list): + raise ValueError( + f"ENTITY_TYPE_PROMPT_FILE '{profile_path}' field '{field_name}' " + "must be a list of strings." + ) + normalized: list[str] = [] + for index, item in enumerate(value): + if not isinstance(item, str) or not item.strip(): + raise ValueError( + f"ENTITY_TYPE_PROMPT_FILE '{profile_path}' field '{field_name}' " + f"item {index} must be a non-empty string." + ) + normalized.append(item.rstrip()) + return normalized + + +def load_entity_extraction_prompt_profile( + prompt_file: str | Path, +) -> dict[str, Any]: + """Load and validate an entity extraction prompt profile from YAML.""" + + profile_path = Path(prompt_file) + if not profile_path.exists(): + raise FileNotFoundError( + f"ENTITY_TYPE_PROMPT_FILE '{profile_path}' does not exist." + ) + if not profile_path.is_file(): + raise ValueError( + f"ENTITY_TYPE_PROMPT_FILE '{profile_path}' must point to a file." + ) + + try: + content = profile_path.read_text(encoding="utf-8") + except OSError as exc: + raise OSError( + f"Failed to read ENTITY_TYPE_PROMPT_FILE '{profile_path}': {exc}" + ) from exc + + try: + raw_profile = yaml.safe_load(content) + except yaml.YAMLError as exc: + raise ValueError( + f"ENTITY_TYPE_PROMPT_FILE '{profile_path}' contains invalid YAML: {exc}" + ) from exc + + if raw_profile is None: + raw_profile = {} + if not isinstance(raw_profile, dict): + raise ValueError( + f"ENTITY_TYPE_PROMPT_FILE '{profile_path}' must contain a YAML mapping." + ) + + profile: dict[str, Any] = {} + + guidance = raw_profile.get("entity_types_guidance") + if guidance is not None: + if not isinstance(guidance, str) or not guidance.strip(): + raise ValueError( + f"ENTITY_TYPE_PROMPT_FILE '{profile_path}' field " + "'entity_types_guidance' must be a non-empty string." + ) + profile["entity_types_guidance"] = guidance.rstrip() + + for field_name in ( + "entity_extraction_examples", + "entity_extraction_json_examples", + ): + if field_name in raw_profile: + profile[field_name] = _normalize_prompt_examples( + raw_profile[field_name], field_name, profile_path + ) + + return profile + + +def resolve_entity_extraction_prompt_profile( + addon_params: Mapping[str, Any] | None, + use_json: bool, +) -> EntityExtractionPromptProfile: + """Resolve and merge the configured entity extraction prompt profile.""" + + default_profile = get_default_entity_extraction_prompt_profile() + addon_params = addon_params or {} + prompt_file = addon_params.get("entity_type_prompt_file") + + file_profile: dict[str, Any] = {} + if prompt_file: + prompt_path = resolve_entity_type_prompt_path(prompt_file) + file_profile = load_entity_extraction_prompt_profile(prompt_path) + required_examples_key = ( + "entity_extraction_json_examples" + if use_json + else "entity_extraction_examples" + ) + if required_examples_key not in file_profile: + mode_name = "json" if use_json else "text" + raise ValueError( + f"ENTITY_TYPE_PROMPT_FILE '{prompt_file}' must define " + f"'{required_examples_key}' when entity extraction runs in " + f"{mode_name} mode." + ) + + guidance = addon_params.get("entity_types_guidance") + if guidance is None: + guidance = file_profile.get( + "entity_types_guidance", default_profile["entity_types_guidance"] + ) + elif not isinstance(guidance, str) or not guidance.strip(): + raise ValueError( + "addon_params['entity_types_guidance'] must be a non-empty string." + ) + + return { + "entity_types_guidance": guidance, + "entity_extraction_examples": list( + file_profile.get( + "entity_extraction_examples", + default_profile["entity_extraction_examples"], + ) + ), + "entity_extraction_json_examples": list( + file_profile.get( + "entity_extraction_json_examples", + default_profile["entity_extraction_json_examples"], + ) + ), + } + + +def validate_entity_extraction_prompt_profile_for_mode( + prompt_profile: Mapping[str, Any], + use_json: bool, + prompt_file_name: str | None = None, +) -> EntityExtractionPromptProfile: + """Validate that the resolved profile contains the active-mode examples.""" + + required_examples_key = ( + "entity_extraction_json_examples" if use_json else "entity_extraction_examples" + ) + if ( + required_examples_key not in prompt_profile + or not prompt_profile[required_examples_key] + ): + mode_name = "json" if use_json else "text" + source = ( + f"ENTITY_TYPE_PROMPT_FILE '{prompt_file_name}'" + if prompt_file_name + else "the resolved prompt profile" + ) + raise ValueError( + f"{source} must define '{required_examples_key}' when entity extraction " + f"runs in {mode_name} mode." + ) + + return { + "entity_types_guidance": str(prompt_profile["entity_types_guidance"]).rstrip(), + "entity_extraction_examples": [ + str(example).rstrip() + for example in prompt_profile["entity_extraction_examples"] + ], + "entity_extraction_json_examples": [ + str(example).rstrip() + for example in prompt_profile["entity_extraction_json_examples"] + ], + } diff --git a/lightrag/prompt_multimodal.py b/lightrag/prompt_multimodal.py new file mode 100644 index 0000000000..41c65b3f3b --- /dev/null +++ b/lightrag/prompt_multimodal.py @@ -0,0 +1,322 @@ +"""Multimodal analysis prompts for LightRAG. + +These templates are consumed by ``LightRAG.analyze_multimodal`` to produce +modality-specific analysis JSON written into each sidecar item's +``llm_analyze_result``. + +Each template accepts the same variable set so the caller can format them +uniformly: + +- ``language`` : target language for ``name`` / ``description`` outputs. +- ``content`` : modality body (table JSON/HTML, equation LaTeX, etc.). + Images pass an empty string and rely on ``image_inputs``. +- ``captions`` : caption text or ``"n/a"``. +- ``footnotes`` : joined footnotes string or ``"n/a"``. +- ``leading`` : surrounding leading context or ``"n/a"``. +- ``trailing`` : surrounding trailing context or ``"n/a"``. +- ``item_id`` : sidecar item identifier (for diagnostics, not required by + every template). +- ``file_path`` : source document path (diagnostics only). + +The output schema differs by modality: + +- Image : ``{"name": str, "type": str, "description": str}`` +- Table : ``{"name": str, "description": str}`` +- Equation : ``{"name": str, "equation": str, "description": str}`` + +Image ``type`` is restricted to :data:`IMAGE_TYPE_ENUM`; values outside the +enum are folded into :data:`IMAGE_TYPE_FALLBACK` by the caller. +""" + +from __future__ import annotations + + +IMAGE_TYPE_ENUM: tuple[str, ...] = ( + "Photo", + "Illustration", + "Screenshot", + "Icon", + "Chart", + "Table", + "Infographic", + "Flowchart", + "Chat Log", + "Wireframe", + "Texture", + "Other", +) + +IMAGE_TYPE_FALLBACK = "Other" + + +MULTIMODAL_PROMPTS: dict[str, str] = {} + + +MULTIMODAL_PROMPTS[ + "image_analysis" +] = """You are an expert image analyzer. Analyze the provided image and return a single JSON object describing its content. + +================ INSTRUCTIONS ================ + +1. CONTENT RECOGNITION + Examine the image carefully and identify: + - The primary subject(s), scene, or composition. + - Salient visual elements (objects, people, text overlays, diagrams, charts, screenshots, etc.). + - Spatial layout when meaningful (e.g. left/right, foreground/background, panels of a figure). + - Any visible text — quote it verbatim when short; summarize when long. + - Color, style, or visual cues only when they materially aid interpretation. + +2. USE OF ADDITIONAL CONTEXT + The Additional Context section provides surrounding information that may help disambiguate the image's role in its source document: + - Captions : caption attached to the image ("n/a" = none) + - Footnotes : footnote attached to the image ("n/a" = none) + - Leading Text : text appearing immediately BEFORE the image ("n/a" = none) + - Trailing Text : text appearing immediately AFTER the image ("n/a" = none) + + Rules: + - Use context to disambiguate abbreviations, units, named entities, and the image's purpose. + - The IMAGE ITSELF takes priority when it conflicts with context — describe what is visible. + - Only mention a relationship between the image and Leading/Trailing Text if it is clearly supported. If uncertain, omit it. + - Captions, footnotes, leading text and trailing text must NOT be used to invent visual content not present in the image. + +3. NAMING (`name`) + - Produce a concise, distinctive name (3–8 words, snake_case preferred). + - It should convey what the image depicts, not just "image". + - Good examples: `crispr_cas9_workflow_diagram`, `q4_revenue_bar_chart`, `paris_eiffel_tower_photo`. + - Bad examples: `image`, `figure`, `picture_1`. + +4. TYPE (`type`) + - Pick exactly one value from this fixed list (verbatim, case-sensitive): + Photo, Illustration, Screenshot, Icon, Chart, Table, Infographic, Flowchart, Chat Log, Wireframe, Texture, Other + - Choose the single best fit. Use `Other` when no listed type clearly applies. + +5. DESCRIPTION (`description`, ≤ 500 words, natural prose — not bullets) + Cover the following where applicable: + - What the image depicts overall and what question/claim it visually supports. + - The primary subject(s), their attributes, and any meaningful relationships between them. + - Quantitative findings if the image is a chart/diagram (cite specific values when visible). + - Visible text content that carries meaning (labels, annotations, axis titles). + - Use specific proper nouns rather than pronouns whenever possible. + - If the image clearly supports the surrounding context(leading or trailing text), briefly note that relationship at the end. Otherwise omit. + +6. OUTPUT RULES + - Return ONE valid JSON object only. + - No surrounding markdown, no code fences, no preamble, no explanation. + - All string values must be properly escaped JSON strings (escape `"` as `\\"`, newlines as `\\n`). + - The output values for the JSON fields `name` and `description` must be written in `{language}`. + +================ ADDITIONAL CONTEXT ================ +- Captions: {captions} + +- Footnotes: {footnotes} + +- Leading Text: +``` +{leading} +``` + +- Trailing Text: +``` +{trailing} +``` + +================ OUTPUT FORMAT ================ +{{ + "name": "", + "type": "", + "description": "" +}} + +Output: +""" + + +MULTIMODAL_PROMPTS[ + "table_analysis" +] = """You are an expert table analyzer. The provided content contains table content in JSON or HTML format. Analyze it and return a single JSON object describing its structure and content. + +================ INSTRUCTIONS ================ + +1. CONTENT RECOGNITION + Read the table carefully and identify: + - Overall structure: number of rows and columns, presence of merged cells, multi-level headers, row groupings, or totals/subtotals rows. + - Column headers and (if present) row headers — capture their exact wording. + - Units of measurement (%, $, ms, kg, etc.) and any scale indicators ("in millions", "×1000"). + - Key data points: maxima, minima, outliers, notable values, totals. + - Patterns and trends across rows or columns (growth, decline, correlation, ranking). + - Empty cells, "—", "N/A", or other null markers — preserve them as-is, do NOT fabricate values. + - Footnote markers inside cells (e.g. "*", "†", "[1]") and what they refer to. + +2. USE OF ADDITIONAL CONTEXT + The Additional Context section provides surrounding information to help you understand the table's role in its source document: + - Captions : the table's caption ("n/a" = none) + - Footnotes : footnote attached to the table ("n/a" = none) + - Leading Text : text appearing immediately BEFORE the table ("n/a" = none) + - Trailing Text : text appearing immediately AFTER the table ("n/a" = none) + + Rules: + - Use context to disambiguate column meanings, units, abbreviations, and entity names. + - TABLE CONTENT TAKES PRIORITY over context when they conflict. Describe what you actually see; note the discrepancy only if it is material. + - Only mention a relationship between the table and Leading/Trailing Text if it is clearly supported. If uncertain, omit it. + - Captions, footnotes, leading text and trailing text may only be used for disambiguation purposes and must not be used to infer or fabricate content not present in TABLE CONTENT. + - NEVER invent rows, columns, values, units, or entities that are not visible. + +3. NAMING (`name`) + - Produce a concise, distinctive name (3–8 words, snake_case preferred). + - It should convey what the table is about, not just "table". + - Good examples: `q4_2024_revenue_by_region`, `model_benchmark_accuracy_latency`, `patient_demographics_baseline`. + - Bad examples: `table`, `data_table`, `results`. + +4. DESCRIPTION (`description`, ≤ 500 words, natural prose — not bullets) + Cover the following where applicable: + - What the table is about and what question it answers. + - What the rows represent and what the columns represent (the "shape" of the data). + - Units, time range, and scope of the data. + - The most important patterns, trends, comparisons, or outliers — cite specific values from the table to support each observation (e.g. "revenue grew from $1.2M in Q1 to $3.8M in Q4"). + - Any totals, subtotals, averages, or computed columns and what they reveal. + - Use specific proper nouns (entity names, column names) instead of pronouns. + - If the table clearly illustrates or supports the surrounding context(leading or trailing text), briefly note that relationship at the end. Otherwise omit. + - Do not restate the table cell by cell or row by row; focus on interpretation. + +5. OUTPUT RULES + - Return ONE valid JSON object only. + - No surrounding markdown, no code fences, no preamble, no explanation. + - All string values must be properly escaped JSON strings (escape `"` as `\\"`, newlines as `\\n`). + - The output values for the JSON fields `name` and `description` must be written in `{language}`. + +================ TABLE CONTENT ================ +``` +{content} +``` + +================ ADDITIONAL CONTEXT ================ +- Captions: {captions} + +- Footnotes: {footnotes} + +- Leading Text: +``` +{leading} +``` + +- Trailing Text: +``` +{trailing} +``` + +================ OUTPUT FORMAT ================ +{{ + "name": "", + "description": "" +}} + +Output: +""" + + +MULTIMODAL_PROMPTS[ + "equation_analysis" +] = """You are an expert analyzer of mathematical and chemical equations. The input is a TEXT-form equation written in LaTeX or Markdown. Analyze it and return a single JSON object describing its meaning and role. + +================ INSTRUCTIONS ================ + +1. CONTENT RECOGNITION + Read the equation carefully and identify: + - The type of expression: definition, identity, equation to solve, inequality, differential / integral equation, recurrence, chemical reaction, balance equation, etc. + - The mathematical or chemical meaning of the expression as a whole. + - The variables, constants, operators, and functions that appear, and what each likely denotes given the surrounding context. + - The application domain (e.g. classical mechanics, probability, thermodynamics, organic chemistry, machine learning loss function) inferred from context. + - Any physical, statistical, or theoretical significance. + - Whether the expression matches a well-known named formula (e.g. Bayes' theorem, Schrödinger equation, softmax, Michaelis–Menten). Name it explicitly when you are confident; do NOT guess. + +2. USE OF ADDITIONAL CONTEXT + The Additional Context section provides surrounding information to help you understand the equation's role in its source document: + - Captions : the equation's caption or label ("n/a" = none) + - Footnotes : footnote attached to the equation ("n/a" = none) + - Leading Text : text appearing immediately BEFORE the equation ("n/a" = none) + - Trailing Text : text appearing immediately AFTER the equation ("n/a" = none) + + Rules: + - Use context to determine variable meanings, units, and the domain of discussion. + - THE EQUATION ITSELF TAKES PRIORITY over context if they conflict; note the discrepancy if material. + - Only mention a relationship between the equation and Leading/Trailing Text if it is clearly supported. If uncertain, omit it. + - Captions, footnotes, leading text and trailing text may only be used for disambiguation purposes and must not be used to infer or fabricate content not present in EQUATION BODY. + - NEVER invent variables, terms, or interpretations that are not justified by either the equation or the context. + +3. NAMING (`name`) + - Produce a concise, distinctive name (3–8 words, snake_case preferred). + - It should convey what the equation IS or DOES, not just "equation". + - Good examples: + `bayes_theorem_posterior` + `softmax_cross_entropy_loss` + `ideal_gas_law` + `michaelis_menten_rate` + `combustion_of_methane` + `quadratic_formula_roots` + - Bad examples: + `equation`, `formula`, `math`, `the_equation`, `eq_1` + +4. NORMALIZED EQUATION (`equation`) + - Output the math-mode BODY ONLY. Do NOT wrap in any delimiter or environment: no `$...$`, no `$$...$$`, no `\\(...\\)`, no `\\[...\\]`, no `\\begin{{equation}}...\\end{{equation}}`. + - Strip those outer wrappers if present in the input. + - KEEP semantic inner environments such as `aligned`, `cases`, `pmatrix`, `bmatrix`, `array`, `split` — they are part of the equation's structure, not delimiters. + - If the input uses `\\begin{{align}}` or `\\begin{{align*}}`, convert to `\\begin{{aligned}}`. + - Strip equation numbering (`\\tag{{...}}`, automatic numbers from `align`/`equation`). + - Preserve all symbols, subscripts, superscripts, and operators faithfully. Do NOT simplify or rename variables. + - Convert Markdown / plain-text / Unicode math to standard LaTeX (`x^2` → `x^{{2}}`, `sqrt(a)` → `\\sqrt{{a}}`, `≤` → `\\leq`, `α` → `\\alpha`). + - For chemical equations, use `mhchem`: `\\ce{{2H2 + O2 -> 2H2O}}`. + - If multiple independent equations appear together, join them with `\\\\` inside a single `\\begin{{aligned}}...\\end{{aligned}}` and note the grouping in `description`. + +5. DESCRIPTION (`description`, ≤ 300 words, natural prose — not bullets) + Cover the following where applicable: + - What the equation expresses and what problem it addresses. + - Its role in the surrounding text (e.g. defines a quantity, states a constraint, derives a result, models a phenomenon). + - The named formula it corresponds to, if any, and where it is commonly used. + - Briefly clarify only those symbols whose meaning is non-obvious or domain-specific, OR whose meaning is fixed by the Leading/Trailing Text. Do NOT enumerate every symbol mechanically. + - Use specific proper nouns (variable names, entity names) instead of pronouns. + - If the equation clearly illustrates or supports the surrounding context(leading or trailing text), briefly note that relationship at the end. Otherwise omit. + +6. OUTPUT RULES + - Return ONE valid JSON object only. + - No surrounding markdown, no code fences, no preamble, no explanation. + - All string values must be properly escaped JSON strings (escape `"` as `\\"`, escape backslashes as `\\\\`, newlines as `\\n`). + - LaTeX backslashes inside the `equation` string must be double-escaped (e.g. `\\frac{{a}}{{b}}` is written as `"\\\\frac{{a}}{{b}}"` in the JSON). + - If the input uses `\\begin{{align}}` or `\\begin{{align*}}`, convert to `\\begin{{aligned}}` in the output (since the outer display wrapper is stripped). + - The output values for the JSON fields `name` and `description` must be written in `{language}`. + +================ EQUATION BODY ================ +``` +{content} +``` + +================ ADDITIONAL CONTEXT ================ +- Captions: {captions} + +- Footnotes: {footnotes} + +- Leading Text: +``` +{leading} +``` + +- Trailing Text: +``` +{trailing} +``` + +================ OUTPUT FORMAT ================ +{{ + "name": "", + "equation": "", + "description": "" +}} + +Output: +""" + + +__all__ = [ + "IMAGE_TYPE_ENUM", + "IMAGE_TYPE_FALLBACK", + "MULTIMODAL_PROMPTS", +] diff --git a/lightrag/sidecar/__init__.py b/lightrag/sidecar/__init__.py new file mode 100644 index 0000000000..1ddecdb763 --- /dev/null +++ b/lightrag/sidecar/__init__.py @@ -0,0 +1,33 @@ +"""LightRAG Sidecar writer infrastructure. + +Spec: ``docs/LightRAGSidecarFormat-zh.md``. + +This package owns the *single executable specification* of the LightRAG Sidecar +file format. Parser engines (native / mineru / docling) hand it an +``IRDoc`` (intermediate representation) describing the document; the writer +emits the spec-compliant ``*.parsed/`` directory. + +See :func:`lightrag.sidecar.writer.write_sidecar` for the entry point. +""" + +from lightrag.sidecar.ir import ( + AssetSpec, + IRBlock, + IRDoc, + IRDrawing, + IREquation, + IRPosition, + IRTable, +) +from lightrag.sidecar.writer import write_sidecar + +__all__ = [ + "AssetSpec", + "IRBlock", + "IRDoc", + "IRDrawing", + "IREquation", + "IRPosition", + "IRTable", + "write_sidecar", +] diff --git a/lightrag/sidecar/ir.py b/lightrag/sidecar/ir.py new file mode 100644 index 0000000000..5263a1711c --- /dev/null +++ b/lightrag/sidecar/ir.py @@ -0,0 +1,213 @@ +"""Intermediate representation (IR) handed by parser adapters to the writer. + +Parser engines do not write spec-shaped JSON directly. Each engine adapter +produces an :class:`IRDoc`; :func:`lightrag.sidecar.writer.write_sidecar` +turns that into ``*.parsed/`` files matching ``LightRAGSidecarFormat-zh.md``. + +Why an in-process IR (not a serialized intermediate): + +- One executable spec point. ``writer.py`` is the only place that knows id + formats, placeholder tags, blockid computation, ``asset_dir`` truth value. +- Engine adapters only translate; they never embed knowledge of the on-disk + format. +- The dataclasses below cover the spec contract plus an ``extras`` escape + hatch on item-level objects so engine-specific signals (rowspan, OCR + confidence, ...) can be passed through without spec churn. + +Placeholder convention used by :attr:`IRBlock.content_template`: + +- ``{{TBL:k}}`` — k is the placeholder key declared on the IRTable object +- ``{{IMG:k}}`` — IRDrawing +- ``{{EQ:k}}`` — block-level IREquation (``is_block=True``) +- ``{{EQI:k}}`` — inline IREquation (``is_block=False``); rendered without an + id, never enters ``equations.json`` + +The writer expands these templates after id allocation. Adapters MUST emit +exactly one placeholder per item; multiple in-content placeholders sharing +the same key are not supported. +""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from pathlib import Path +from typing import Any + + +@dataclass +class IRPosition: + """Block-level position. See spec §八. + + ``type`` values: ``"paraid"`` (docx) / ``"bbox"`` (pdf) / + ``"heading"`` (md) / ``"absolute"`` (text). + + ``origin`` is meaningful only for ``type="bbox"`` and acts as a + per-position override of ``IRDoc.bbox_attributes.origin`` (spec §八). + Leave ``None`` to inherit the document-level origin; set explicitly + (e.g. ``"LEFTTOP"`` / ``"LEFTBOTTOM"``) when this position's + coordinate system differs from the document default — used by the + Docling adapter to record mixed ``coord_origin`` without flipping + coordinates. + """ + + type: str + anchor: Any = None + range: list | None = None + charspan: list[int] | None = None + origin: str | None = None + + def to_jsonable(self) -> dict[str, Any]: + out: dict[str, Any] = {"type": self.type} + if self.anchor is not None: + out["anchor"] = self.anchor + if self.range is not None: + out["range"] = list(self.range) + if self.charspan is not None: + out["charspan"] = list(self.charspan) + if self.origin is not None: + out["origin"] = self.origin + return out + + +@dataclass +class IRTable: + """Spec §五. ``rows`` (preferred) or ``html`` describes the body. + + The writer renders ``{{TBL:placeholder_key}}`` in IRBlock.content_template + as ``body
``; ``format`` + is chosen by which payload the adapter populated. + """ + + placeholder_key: str + rows: list[list[str]] | None = None + html: str | None = None + num_rows: int = 0 + num_cols: int = 0 + caption: str = "" + footnotes: list[str] = field(default_factory=list) + table_header: list[list[str]] | None = None + # Spec §五 ``self_ref``: optional pointer into the engine's raw output + # (e.g. Docling JSON Pointer ``#/tables/2``). Empty string ⇒ writer + # omits the field. Used for traceability back to ``.docling_raw/``. + self_ref: str = "" + extras: dict[str, Any] = field(default_factory=dict) + # Optional verbatim body to render inside the ``…
`` tag + # in ``blocks.jsonl``. When set, the writer uses this string in the block + # text instead of re-encoding ``rows`` via ``json.dumps`` — preserving + # the parser's original whitespace/escaping when byte-equivalence with a + # pre-existing output is required. The ``tables.json`` ``content`` field + # is unaffected and remains the canonical + # ``json.dumps(rows, ensure_ascii=False)`` encoding. + # + # Coexistence with ``rows`` / ``html``: ``body_override`` does NOT replace + # the structured body. ``rows`` (or ``html``) must still be populated for + # the sidecar's ``content`` / ``dimension`` / ``format`` fields and for + # the writer's ``"json" vs "html"`` format selection. Adapters typically + # set BOTH (e.g. native docx sets ``rows`` from the parsed JSON AND sets + # ``body_override`` to the raw verbatim string). When JSON parsing fails + # in the adapter (``rows`` is None), ``html`` is used as the structured + # fallback and the writer renders ``format="html"`` with the body_override + # string verbatim — keeping the original (unparseable) bytes intact. + body_override: str | None = None + + +@dataclass +class IRDrawing: + """Spec §四. ``asset_ref`` points to an :class:`AssetSpec` in IRDoc.""" + + placeholder_key: str + asset_ref: str + fmt: str = "" + caption: str = "" + footnotes: list[str] = field(default_factory=list) + src: str = "" + # Spec §四 ``self_ref``: optional pointer into the engine's raw output + # (e.g. Docling JSON Pointer ``#/pictures/3``). Empty string ⇒ writer + # omits the field. Used for traceability back to ``.docling_raw/``. + self_ref: str = "" + extras: dict[str, Any] = field(default_factory=dict) + # Optional verbatim path. When set, the writer emits this string in + # both the ``blocks.jsonl`` ```` attribute and the + # ``drawings.json`` ``path`` field as-is — bypassing + # ``asset_paths`` resolution and the ``block_drawing_path_style`` + # transformation. Used for linked / external image references (e.g. + # ````) that point at bytes not + # materialized into ``.blocks.assets/``. + path_override: str | None = None + + +@dataclass +class IREquation: + """Spec §六. ``is_block=False`` ⇒ inline; not allocated an id, not written + to ``equations.json``; rendered as ```` + in block text. + """ + + placeholder_key: str + latex: str + is_block: bool = True + caption: str = "" + footnotes: list[str] = field(default_factory=list) + # Spec §六 ``self_ref``: optional pointer into the engine's raw output + # (e.g. Docling JSON Pointer ``#/texts/15``). Empty string ⇒ writer + # omits the field. Only meaningful when ``is_block=True``; inline + # equations never enter ``equations.json``. + self_ref: str = "" + extras: dict[str, Any] = field(default_factory=dict) + + +@dataclass +class IRBlock: + """One content block (spec §3.2). + + ``content_template`` is the final block text with placeholder tokens + embedded. The writer expands tokens once ids are assigned. + """ + + content_template: str + heading: str = "" + level: int = 0 + parent_headings: list[str] = field(default_factory=list) + session_type: str = "body" + table_slice: str = "none" + table_header: str | None = None + positions: list[IRPosition] = field(default_factory=list) + tables: list[IRTable] = field(default_factory=list) + drawings: list[IRDrawing] = field(default_factory=list) + equations: list[IREquation] = field(default_factory=list) + + +@dataclass +class AssetSpec: + """Describes one file that lands in ``.blocks.assets/``. + + ``source`` may be: + + - :class:`pathlib.Path` to an existing file on disk (writer copies it); + - :class:`bytes` payload (writer dumps it); + - ``None`` when the file is already in place at ``/`` + (e.g. native docx parser writes assets during extraction); the writer + then records its size without touching it. + + Carrier protocol: a drawing references the asset by :attr:`ref`; the + writer resolves that to a concrete filename inside the assets dir and + writes the result to both ``drawings.json`` (full relative path) and + the ```` attribute in ``blocks.jsonl``. + """ + + ref: str + suggested_name: str + source: Path | bytes | None = None + + +@dataclass +class IRDoc: + """Top-level IR — the input to :func:`write_sidecar`.""" + + document_name: str + document_format: str + doc_title: str + split_option: dict[str, Any] + blocks: list[IRBlock] + assets: list[AssetSpec] = field(default_factory=list) + bbox_attributes: dict[str, Any] | None = None diff --git a/lightrag/sidecar/placeholders.py b/lightrag/sidecar/placeholders.py new file mode 100644 index 0000000000..2ed1fa3d9b --- /dev/null +++ b/lightrag/sidecar/placeholders.py @@ -0,0 +1,117 @@ +"""Placeholder token rendering for spec-shaped multimodal tags. + +Adapters populate :attr:`IRBlock.content_template` with ``{{TBL:k}}``, +``{{IMG:k}}``, ``{{EQ:k}}`` and ``{{EQI:k}}`` tokens. The writer assigns +``tb-`` / ``im-`` / ``eq-`` ids, then calls :func:`render_template` to +substitute the spec-shaped XML-style tags described in +``LightRAGSidecarFormat-zh.md`` §3.3. +""" + +from __future__ import annotations + +import json +import re +from typing import Callable + +_TOKEN_RE = re.compile(r"\{\{(TBL|IMG|EQ|EQI):([A-Za-z0-9_\-]+)\}\}") + + +def xml_attr_escape(value: str) -> str: + """Escape an attribute value for an XML-style tag attribute.""" + return ( + str(value) + .replace("&", "&") + .replace("<", "<") + .replace(">", ">") + .replace('"', """) + ) + + +def caption_attr(caption: str) -> str: + """Render a leading-space ``caption="..."`` attribute; empty when absent. + + Matches the existing native_docx adapter convention exactly so consumers + that grep for ``caption="``-prefixed substrings keep working. + """ + return f' caption="{xml_attr_escape(caption)}"' if caption else "" + + +def render_table_tag(table_id: str, fmt: str, body: str) -> str: + """``body
`` per spec §3.3. + + ``body`` is the table content; for ``json`` it is the JSON array, for + ``html`` it is the raw ``...
`` HTML inside (the outer + wrapper is added here). + """ + return ( + f'{body}
' + ) + + +def render_drawing_tag( + drawing_id: str, + fmt: str, + caption: str, + path: str, + src: str, +) -> str: + """````.""" + return ( + f'' + ) + + +def render_equation_tag( + eq_id: str | None, + latex: str, + caption: str = "", +) -> str: + """Block equation: ``latex``. + + Inline equation (``eq_id is None``): ``latex`` + — no id, never written to ``equations.json``. Caption is preserved for + both forms (spec §3.3 allows ``caption`` on ````). + """ + if eq_id is None: + return f'{latex}' + return ( + f'{latex}' + ) + + +def render_template( + template: str, + *, + table_renderer: Callable[[str], str], + drawing_renderer: Callable[[str], str], + equation_renderer: Callable[[str], str], + inline_equation_renderer: Callable[[str], str], +) -> str: + """Replace ``{{TBL:k}}`` / ``{{IMG:k}}`` / ``{{EQ:k}}`` / ``{{EQI:k}}``. + + Each renderer takes the placeholder *key* (the ``k`` portion) and returns + the rendered XML-style tag. + """ + + def _replace(match: "re.Match[str]") -> str: + kind, key = match.group(1), match.group(2) + if kind == "TBL": + return table_renderer(key) + if kind == "IMG": + return drawing_renderer(key) + if kind == "EQ": + return equation_renderer(key) + return inline_equation_renderer(key) + + return _TOKEN_RE.sub(_replace, template) + + +def table_body_for_rows(rows: list[list[str]]) -> str: + """Encode rows as the JSON body that lives inside ````.""" + return json.dumps(rows, ensure_ascii=False) diff --git a/lightrag/sidecar/writer.py b/lightrag/sidecar/writer.py new file mode 100644 index 0000000000..dc4e9fc47e --- /dev/null +++ b/lightrag/sidecar/writer.py @@ -0,0 +1,627 @@ +"""Spec-compliant sidecar writer. + +This module is the *single executable specification* of the LightRAG sidecar +format (``docs/LightRAGSidecarFormat-zh.md``). Engine adapters hand it an +:class:`IRDoc`; it emits the ``*.parsed/`` directory. + +Responsibilities (none of these belong in adapters): + +- id allocation: ``tb-/im-/eq--NNNN`` (4-digit zero-padded, + global per-doc sequence) +- placeholder rendering: ``{{TBL:k}}`` / ``{{IMG:k}}`` / ``{{EQ:k}}`` / + ``{{EQI:k}}`` → spec-shaped XML-style tags +- blockid computation: ``md5(doc_id:block_index:heading:content)`` +- assets dir creation and file copying; ``asset_dir`` flag in meta is + derived from "directory exists and is non-empty" +- merged_text + document_hash +- meta line shape (spec §3.1) +- conditional writes: ``tables.json`` / ``drawings.json`` / ``equations.json`` + appear only when their dict is non-empty +""" + +from __future__ import annotations + +import hashlib +import json +import re +import shutil +from datetime import datetime, timezone +from pathlib import Path +from typing import Any + +from lightrag.constants import FULL_DOCS_FORMAT_LIGHTRAG +from lightrag.sidecar.ir import ( + AssetSpec, + IRBlock, + IRDoc, + IRDrawing, + IREquation, + IRTable, +) +from lightrag.sidecar.placeholders import ( + render_drawing_tag, + render_equation_tag, + render_table_tag, + render_template, + table_body_for_rows, +) +from lightrag.utils import logger + + +# --------------------------------------------------------------------------- +# Public entry point +# --------------------------------------------------------------------------- + + +_VALID_BLOCK_DRAWING_PATH_STYLES = {"with_prefix", "basename_only"} + + +def write_sidecar( + ir: IRDoc, + *, + parsed_dir: Path, + doc_id: str, + engine: str, + clean_parsed_dir: bool = True, + block_drawing_path_style: str = "with_prefix", +) -> dict[str, Any]: + """Emit a spec-compliant ``*.parsed/`` directory from an IR. + + Args: + ir: Document IR produced by an engine adapter. + parsed_dir: Output directory. By default cleared and recreated; the + caller is responsible for placing it under + ``__parsed__/.parsed/``. + doc_id: ``doc-``; ``doc_hash`` for sidecar ids is the 32-char + tail after stripping the ``doc-`` prefix. + engine: One of ``native`` / ``mineru`` / ``docling`` / ``legacy``; + written verbatim to ``meta.parse_engine``. + clean_parsed_dir: When True (default) the writer ``rmtree``s + ``parsed_dir`` before writing. Set to False when the caller has + already pre-populated the directory with side artifacts that + must survive — e.g. the native docx adapter pre-extracts image + bytes into ``.blocks.assets/`` before the writer runs, + and passing ``AssetSpec.source=None`` lets the writer record + them without copying. + block_drawing_path_style: How ```` in + ``blocks.jsonl`` resolves the asset path. ``"with_prefix"`` + (default) renders ``.blocks.assets/`` — matches + the path stored in ``drawings.json``. ``"basename_only"`` + renders just ````; legacy native docx convention + (downstream consumers read the file path from ``drawings.json``, + not from this attribute, so the basename-only form is purely + cosmetic but kept for byte-equivalence with the original + adapter). + + Returns: + Dict shaped like the pipeline's existing ``parsed_data`` payload: + ``{doc_id, file_path, parse_format, content, blocks_path}``. + ``file_path`` is ``ir.document_name``; the caller resolves it to the + actual on-disk path it wants persisted. + """ + if block_drawing_path_style not in _VALID_BLOCK_DRAWING_PATH_STYLES: + allowed = ", ".join(sorted(_VALID_BLOCK_DRAWING_PATH_STYLES)) + raise ValueError( + f"block_drawing_path_style must be one of {allowed}, " + f"got {block_drawing_path_style!r}" + ) + + if clean_parsed_dir and parsed_dir.exists(): + shutil.rmtree(parsed_dir) + parsed_dir.mkdir(parents=True, exist_ok=True) + + base_name = Path(ir.document_name).stem or ir.document_name + blocks_path = parsed_dir / f"{base_name}.blocks.jsonl" + tables_path = parsed_dir / f"{base_name}.tables.json" + drawings_path = parsed_dir / f"{base_name}.drawings.json" + equations_path = parsed_dir / f"{base_name}.equations.json" + assets_dir = parsed_dir / f"{base_name}.blocks.assets" + + # ``clean_parsed_dir=False`` is reserved for callers that pre-populate + # the directory with artifacts that must survive (e.g. the native docx + # adapter pre-extracts assets). If a stale ``blocks.jsonl`` is sitting + # there, the caller forgot to pre-clean — warn so the leftover doesn't + # get silently overwritten with partially-stale neighbors. + if not clean_parsed_dir and blocks_path.exists(): + logger.warning( + "[sidecar] clean_parsed_dir=False but %s already exists; " + "caller is expected to pre-clean before invoking write_sidecar", + blocks_path, + ) + + # Stage 1: realize assets first so drawings can carry resolved paths. + asset_paths = _materialize_assets(ir.assets, assets_dir) + + # Stage 2: walk blocks, allocate ids, render templates, accumulate + # sidecar item dicts and blocks.jsonl lines. + doc_hash = doc_id.removeprefix("doc-") + tables: dict[str, dict[str, Any]] = {} + drawings: dict[str, dict[str, Any]] = {} + equations: dict[str, dict[str, Any]] = {} + blocks_lines: list[str] = [] + merged_parts: list[str] = [] + + table_seq = 0 + drawing_seq = 0 + equation_seq = 0 + + asset_prefix = f"{assets_dir.name}/" + + # ``block_index`` in the blockid hash refers to the position in the + # SOURCE block list (``enumerate`` over ``ir.blocks``), not the emitted + # position. Otherwise an editor turning a previously-non-empty block + # into an empty one — which then gets dropped — would shift the + # blockids of every block after it; we want stable ids across edits. + for block_index, block in enumerate(ir.blocks): + # Allocate ids for items declared on this block. Order: tables -> + # drawings -> equations (per-block deterministic; the global + # sequence advances across blocks). + table_id_by_key: dict[str, str] = {} + for table in block.tables: + table_seq += 1 + tb_id = f"tb-{doc_hash}-{table_seq:04d}" + table_id_by_key[table.placeholder_key] = tb_id + + drawing_id_by_key: dict[str, str] = {} + for drawing in block.drawings: + drawing_seq += 1 + im_id = f"im-{doc_hash}-{drawing_seq:04d}" + drawing_id_by_key[drawing.placeholder_key] = im_id + + equation_id_by_key: dict[str, str] = {} + for equation in block.equations: + if not equation.is_block: + continue + equation_seq += 1 + eq_id = f"eq-{doc_hash}-{equation_seq:04d}" + equation_id_by_key[equation.placeholder_key] = eq_id + + # Render placeholder template. + rendered = _render_block_content( + block, + table_id_by_key=table_id_by_key, + drawing_id_by_key=drawing_id_by_key, + equation_id_by_key=equation_id_by_key, + asset_paths=asset_paths, + asset_prefix=asset_prefix, + block_drawing_path_style=block_drawing_path_style, + ) + + rendered = rendered.strip() + if not rendered: + # Drop empty blocks entirely — neither blocks.jsonl entry nor + # sidecar items (the items were tied to the placeholder; if it + # vanished, the items are orphans). This mirrors the existing + # native_docx behaviour and ensures merged_text is contiguous. + continue + + blockid = hashlib.md5( + f"{doc_id}:{block_index}:{block.heading}:{rendered}".encode("utf-8") + ).hexdigest() + + # Realize per-block sidecar item dicts now that blockid is known. + # Defensive: an adapter that declares an item on block.tables / + # drawings / equations but omits the matching ``{{TBL/IMG/EQ:k}}`` + # token from ``content_template`` would leave the rendered text + # without the corresponding tag. We detect that by checking whether + # the allocated id (which is doc-unique) appears in the rendered + # output, warn, and skip the sidecar entry — otherwise the per- + # modality JSON would reference a blockid whose body never names it. + for table in block.tables: + tb_id = table_id_by_key[table.placeholder_key] + if tb_id not in rendered: + logger.warning( + "[sidecar] orphan table id=%s on block %d " + "(placeholder %r not referenced in content_template); " + "skipping sidecar entry", + tb_id, + block_index, + table.placeholder_key, + ) + continue + tables[tb_id] = _table_item_dict(tb_id, blockid, block.heading, table) + for drawing in block.drawings: + im_id = drawing_id_by_key[drawing.placeholder_key] + if im_id not in rendered: + logger.warning( + "[sidecar] orphan drawing id=%s on block %d " + "(placeholder %r not referenced in content_template); " + "skipping sidecar entry", + im_id, + block_index, + drawing.placeholder_key, + ) + continue + drawings[im_id] = _drawing_item_dict( + im_id, blockid, block.heading, drawing, asset_paths, asset_prefix + ) + for equation in block.equations: + if not equation.is_block: + continue + eq_id = equation_id_by_key[equation.placeholder_key] + if eq_id not in rendered: + logger.warning( + "[sidecar] orphan equation id=%s on block %d " + "(placeholder %r not referenced in content_template); " + "skipping sidecar entry", + eq_id, + block_index, + equation.placeholder_key, + ) + continue + equations[eq_id] = _equation_item_dict( + eq_id, blockid, block.heading, equation + ) + + row: dict[str, Any] = { + "type": "content", + "blockid": blockid, + "format": "plain_text", + "content": rendered, + "heading": block.heading, + "parent_headings": list(block.parent_headings), + "level": int(block.level), + "session_type": block.session_type or "body", + "table_slice": block.table_slice or "none", + "positions": [p.to_jsonable() for p in block.positions], + } + if block.table_header: + row["table_header"] = block.table_header + blocks_lines.append(json.dumps(row, ensure_ascii=False)) + merged_parts.append(rendered) + + # Stage 3: doc-level metadata. + merged_text = "\n\n".join(p for p in merged_parts if p.strip()) + document_hash = hashlib.sha256(merged_text.encode("utf-8")).hexdigest() + parse_time = datetime.now(timezone.utc).isoformat() + + asset_dir_present = assets_dir.exists() and any(assets_dir.iterdir()) + if not asset_dir_present and assets_dir.exists(): + try: + assets_dir.rmdir() + except OSError: + pass + + meta: dict[str, Any] = { + "type": "meta", + "format": "lightrag", + "version": "1.0", + "document_name": ir.document_name, + "document_format": ir.document_format, + "document_hash": f"sha256:{document_hash}", + "table_file": bool(tables), + "equation_file": bool(equations), + "drawing_file": bool(drawings), + "asset_dir": asset_dir_present, + "split_option": dict(ir.split_option or {}), + "blocks": len(blocks_lines), + "doc_id": doc_id, + "parse_engine": engine, + "parse_time": parse_time, + "doc_title": ir.doc_title, + } + if ir.bbox_attributes is not None: + meta["bbox_attributes"] = dict(ir.bbox_attributes) + + blocks_path.write_text( + "\n".join([json.dumps(meta, ensure_ascii=False)] + blocks_lines) + "\n", + encoding="utf-8", + ) + + # Sidecar JSONs end with a trailing newline (POSIX text-file convention; + # also keeps end-of-file linters / pre-commit hooks happy and matches the + # ``blocks.jsonl`` convention above). + if tables: + tables_path.write_text( + json.dumps( + {"version": "1.0", "tables": tables}, + ensure_ascii=False, + indent=2, + ) + + "\n", + encoding="utf-8", + ) + if drawings: + drawings_path.write_text( + json.dumps( + {"version": "1.0", "drawings": drawings}, + ensure_ascii=False, + indent=2, + ) + + "\n", + encoding="utf-8", + ) + if equations: + equations_path.write_text( + json.dumps( + {"version": "1.0", "equations": equations}, + ensure_ascii=False, + indent=2, + ) + + "\n", + encoding="utf-8", + ) + + logger.info( + "[sidecar] wrote %d blocks for doc_id=%s " + "(%d tables, %d drawings, %d equations, assets=%s, engine=%s)", + len(blocks_lines), + doc_id, + len(tables), + len(drawings), + len(equations), + asset_dir_present, + engine, + ) + + return { + "doc_id": doc_id, + "file_path": ir.document_name, + "parse_format": FULL_DOCS_FORMAT_LIGHTRAG, + "content": merged_text, + "blocks_path": str(blocks_path), + } + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + + +def _materialize_assets( + assets: list[AssetSpec], + assets_dir: Path, +) -> dict[str, str]: + """Materialize :class:`AssetSpec` objects into ``assets_dir``. + + Returns: ``{ref: filename_inside_assets_dir}``. + + Collision policy: if two specs map to the same target name, the second + gets a ``-2``, ``-3``, ... suffix on the stem. We never overwrite a file + we've already produced. + """ + if not assets: + return {} + + assets_dir.mkdir(parents=True, exist_ok=True) + out: dict[str, str] = {} + used_names: set[str] = set() + + for spec in assets: + target_name = _allocate_unique_name(spec.suggested_name, used_names) + target_path = assets_dir / target_name + if isinstance(spec.source, (str, Path)): + src_path = Path(spec.source) + if not src_path.exists(): + logger.warning( + "[sidecar] asset source missing for ref=%s (%s); " "skipping copy", + spec.ref, + src_path, + ) + continue + if src_path.resolve() != target_path.resolve(): + shutil.copyfile(src_path, target_path) + elif isinstance(spec.source, bytes): + target_path.write_bytes(spec.source) + elif spec.source is None: + # Assumed already on disk at the target location (native_docx + # writes assets during extraction). Verify presence; warn if + # missing. + if not target_path.exists(): + logger.warning( + "[sidecar] asset ref=%s declared in place but %s " "is absent", + spec.ref, + target_path, + ) + continue + else: + logger.warning( + "[sidecar] unsupported AssetSpec.source type for ref=%s: %s", + spec.ref, + type(spec.source).__name__, + ) + continue + used_names.add(target_name) + out[spec.ref] = target_name + + return out + + +def _allocate_unique_name(suggested: str, used: set[str]) -> str: + """Make ``suggested`` unique within ``used``: ``foo.png`` → ``foo-2.png``.""" + if suggested not in used: + return suggested + stem = Path(suggested).stem + suffix = Path(suggested).suffix + n = 2 + while True: + cand = f"{stem}-{n}{suffix}" + if cand not in used: + return cand + n += 1 + + +def _render_block_content( + block: IRBlock, + *, + table_id_by_key: dict[str, str], + drawing_id_by_key: dict[str, str], + equation_id_by_key: dict[str, str], + asset_paths: dict[str, str], + asset_prefix: str, + block_drawing_path_style: str = "with_prefix", +) -> str: + """Expand placeholder tokens in ``block.content_template``.""" + + tables_by_key = {t.placeholder_key: t for t in block.tables} + drawings_by_key = {d.placeholder_key: d for d in block.drawings} + equations_by_key = {e.placeholder_key: e for e in block.equations} + + def _table(key: str) -> str: + table = tables_by_key.get(key) + if table is None: + return "" + tb_id = table_id_by_key.get(key, "") + if table.body_override is not None: + # Verbatim block-text body — used by adapters that need to + # preserve the parser's original whitespace/escaping (native + # docx). Sidecar entry's ``content`` field still gets the + # canonical ``table_body_for_rows`` encoding via + # ``_table_item_dict``. + fmt = "json" if table.rows is not None else "html" + return render_table_tag(tb_id, fmt, table.body_override) + if table.rows is not None: + return render_table_tag(tb_id, "json", table_body_for_rows(table.rows)) + return render_table_tag(tb_id, "html", table.html or "") + + def _drawing(key: str) -> str: + drawing = drawings_by_key.get(key) + if drawing is None: + return "" + im_id = drawing_id_by_key.get(key, "") + if drawing.path_override is not None: + # Verbatim external/linked reference — pass through unchanged. + path = drawing.path_override + else: + filename = asset_paths.get(drawing.asset_ref, "") + if not filename: + path = "" + elif block_drawing_path_style == "basename_only": + path = filename + else: + path = f"{asset_prefix}{filename}" + return render_drawing_tag( + im_id, + drawing.fmt, + drawing.caption, + path, + drawing.src, + ) + + def _equation(key: str) -> str: + eq = equations_by_key.get(key) + if eq is None: + return "" + if not eq.is_block: + # Adapter mistake: an EQ token should only be used for block + # equations. Treat as inline to avoid a dangling token. + return render_equation_tag(None, eq.latex, eq.caption) + eq_id = equation_id_by_key.get(key, "") + return render_equation_tag(eq_id, eq.latex, eq.caption) + + def _inline_equation(key: str) -> str: + eq = equations_by_key.get(key) + if eq is None: + return "" + return render_equation_tag(None, eq.latex, eq.caption) + + return render_template( + block.content_template, + table_renderer=_table, + drawing_renderer=_drawing, + equation_renderer=_equation, + inline_equation_renderer=_inline_equation, + ) + + +def _table_item_dict( + table_id: str, + blockid: str, + heading: str, + table: IRTable, +) -> dict[str, Any]: + if table.rows is not None: + fmt = "json" + content = table_body_for_rows(table.rows) + else: + fmt = "html" + content = table.html or "" + + item: dict[str, Any] = { + "id": table_id, + "blockid": blockid, + "heading": heading, + "dimension": [int(table.num_rows), int(table.num_cols)], + "format": fmt, + "content": content, + "caption": table.caption, + "footnotes": list(table.footnotes), + } + if table.table_header is not None: + # Spec §5: stored as JSON string. + item["table_header"] = json.dumps(table.table_header, ensure_ascii=False) + if table.self_ref: + item["self_ref"] = table.self_ref + if table.extras: + item["extras"] = dict(table.extras) + return item + + +def _drawing_item_dict( + drawing_id: str, + blockid: str, + heading: str, + drawing: IRDrawing, + asset_paths: dict[str, str], + asset_prefix: str, +) -> dict[str, Any]: + if drawing.path_override is not None: + path = drawing.path_override + else: + filename = asset_paths.get(drawing.asset_ref, "") + path = f"{asset_prefix}{filename}" if filename else "" + item: dict[str, Any] = { + "id": drawing_id, + "blockid": blockid, + "heading": heading, + "format": drawing.fmt, + "path": path, + "src": drawing.src, + "caption": drawing.caption, + "footnotes": list(drawing.footnotes), + } + if drawing.self_ref: + item["self_ref"] = drawing.self_ref + if drawing.extras: + item["extras"] = dict(drawing.extras) + return item + + +_LATEX_DOLLAR_RE = re.compile(r"^\s*\$\$?(.+?)\$\$?\s*$", re.DOTALL) + + +def _strip_latex_dollar_wrappers(latex: str) -> str: + """Strip leading/trailing ``$``/``$$`` wrappers from a latex string. + + ``equations.json`` stores clean latex (per the MinerU adapter contract: + ``blocks.jsonl`` keeps the parser's raw form so the rendered + ```` body is byte-identical to the source, while the + per-equation sidecar carries delimiter-free latex). Leaves strings + without wrappers untouched. + """ + if not latex: + return latex + m = _LATEX_DOLLAR_RE.match(latex) + return m.group(1).strip() if m else latex.strip() + + +def _equation_item_dict( + eq_id: str, + blockid: str, + heading: str, + equation: IREquation, +) -> dict[str, Any]: + item: dict[str, Any] = { + "id": eq_id, + "blockid": blockid, + "heading": heading, + "format": "latex", + "content": _strip_latex_dollar_wrappers(equation.latex), + "caption": equation.caption, + "footnotes": list(equation.footnotes), + } + if equation.self_ref: + item["self_ref"] = equation.self_ref + if equation.extras: + item["extras"] = dict(equation.extras) + return item diff --git a/lightrag/storage_migrations.py b/lightrag/storage_migrations.py new file mode 100644 index 0000000000..a38f537525 --- /dev/null +++ b/lightrag/storage_migrations.py @@ -0,0 +1,329 @@ +"""Storage data migration helpers for :class:`LightRAG`. + +Mixed into LightRAG and runs once at startup (``initialize_storages`` → +``check_and_migrate_data``) to upgrade legacy data layouts: + +- Backfill ``full_entities`` / ``full_relations`` from the graph + doc_status + history when those KV stores are empty (entity-relation migration). +- Rebuild ``entity_chunks`` / ``relation_chunks`` indexes by walking nodes/ + edges in the graph storage when they are empty + (chunk-tracking migration). +""" + +from __future__ import annotations + +from lightrag.base import DocStatus +from lightrag.constants import GRAPH_FIELD_SEP +from lightrag.kg.shared_storage import get_data_init_lock +from lightrag.utils import logger, make_relation_chunk_key + + +class _StorageMigrationMixin: + """Mixin that owns one-shot data migrations on :class:`LightRAG`. + + Mixed into LightRAG only. Relies on attributes that the main class + initializes in ``__post_init__`` (``doc_status``, ``full_entities``, + ``full_relations``, ``chunk_entity_relation_graph``, ``entity_chunks``, + ``relation_chunks``). + """ + + async def check_and_migrate_data(self): + """Check if data migration is needed and perform migration if necessary""" + async with get_data_init_lock(): + try: + # Check if migration is needed: + # 1. chunk_entity_relation_graph has entities and relations (count > 0) + # 2. full_entities and full_relations are empty + + # Get all entity labels from graph + all_entity_labels = ( + await self.chunk_entity_relation_graph.get_all_labels() + ) + + if not all_entity_labels: + logger.debug("No entities found in graph, skipping migration check") + return + + try: + # Initialize chunk tracking storage after migration + await self._migrate_chunk_tracking_storage() + except Exception as e: + logger.error(f"Error during chunk_tracking migration: {e}") + raise e + + # Check if full_entities and full_relations are empty + # Get all processed documents to check their entity/relation data + try: + processed_docs = await self.doc_status.get_docs_by_status( + DocStatus.PROCESSED + ) + + if not processed_docs: + logger.debug("No processed documents found, skipping migration") + return + + # Check first few documents to see if they have full_entities/full_relations data + migration_needed = True + checked_count = 0 + max_check = min(5, len(processed_docs)) # Check up to 5 documents + + for doc_id in list(processed_docs.keys())[:max_check]: + checked_count += 1 + entity_data = await self.full_entities.get_by_id(doc_id) + relation_data = await self.full_relations.get_by_id(doc_id) + + if entity_data or relation_data: + migration_needed = False + break + + if not migration_needed: + logger.debug( + "Full entities/relations data already exists, no migration needed" + ) + return + + logger.info( + f"Data migration needed: found {len(all_entity_labels)} entities in graph but no full_entities/full_relations data" + ) + + # Perform migration + await self._migrate_entity_relation_data(processed_docs) + + except Exception as e: + logger.error(f"Error during migration check: {e}") + raise e + + except Exception as e: + logger.error(f"Error in data migration check: {e}") + raise e + + async def _migrate_entity_relation_data(self, processed_docs: dict): + """Migrate existing entity and relation data to full_entities and full_relations storage""" + logger.info(f"Starting data migration for {len(processed_docs)} documents") + + # Create mapping from chunk_id to doc_id + chunk_to_doc = {} + for doc_id, doc_status in processed_docs.items(): + chunk_ids = ( + doc_status.chunks_list + if hasattr(doc_status, "chunks_list") and doc_status.chunks_list + else [] + ) + for chunk_id in chunk_ids: + chunk_to_doc[chunk_id] = doc_id + + # Initialize document entity and relation mappings + doc_entities = {} # doc_id -> set of entity_names + doc_relations = {} # doc_id -> set of relation_pairs (as tuples) + + # Get all nodes and edges from graph + all_nodes = await self.chunk_entity_relation_graph.get_all_nodes() + all_edges = await self.chunk_entity_relation_graph.get_all_edges() + + # Process all nodes once + for node in all_nodes: + if "source_id" in node: + entity_id = node.get("entity_id") or node.get("id") + if not entity_id: + continue + + # Get chunk IDs from source_id + source_ids = node["source_id"].split(GRAPH_FIELD_SEP) + + # Find which documents this entity belongs to + for chunk_id in source_ids: + doc_id = chunk_to_doc.get(chunk_id) + if doc_id: + if doc_id not in doc_entities: + doc_entities[doc_id] = set() + doc_entities[doc_id].add(entity_id) + + # Process all edges once + for edge in all_edges: + if "source_id" in edge: + src = edge.get("source") + tgt = edge.get("target") + if not src or not tgt: + continue + + # Get chunk IDs from source_id + source_ids = edge["source_id"].split(GRAPH_FIELD_SEP) + + # Find which documents this relation belongs to + for chunk_id in source_ids: + doc_id = chunk_to_doc.get(chunk_id) + if doc_id: + if doc_id not in doc_relations: + doc_relations[doc_id] = set() + # Use tuple for set operations, convert to list later + doc_relations[doc_id].add(tuple(sorted((src, tgt)))) + + # Store the results in full_entities and full_relations + migration_count = 0 + + # Store entities + if doc_entities: + entities_data = {} + for doc_id, entity_set in doc_entities.items(): + entities_data[doc_id] = { + "entity_names": list(entity_set), + "count": len(entity_set), + } + await self.full_entities.upsert(entities_data) + + # Store relations + if doc_relations: + relations_data = {} + for doc_id, relation_set in doc_relations.items(): + # Convert tuples back to lists + relations_data[doc_id] = { + "relation_pairs": [list(pair) for pair in relation_set], + "count": len(relation_set), + } + await self.full_relations.upsert(relations_data) + + migration_count = len( + set(list(doc_entities.keys()) + list(doc_relations.keys())) + ) + + # Persist the migrated data + await self.full_entities.index_done_callback() + await self.full_relations.index_done_callback() + + logger.info( + f"Data migration completed: migrated {migration_count} documents with entities/relations" + ) + + async def _migrate_chunk_tracking_storage(self) -> None: + """Ensure entity/relation chunk tracking KV stores exist and are seeded.""" + + if not self.entity_chunks or not self.relation_chunks: + return + + need_entity_migration = False + need_relation_migration = False + + try: + need_entity_migration = await self.entity_chunks.is_empty() + except Exception as exc: # pragma: no cover - defensive logging + logger.error(f"Failed to check entity chunks storage: {exc}") + raise exc + + try: + need_relation_migration = await self.relation_chunks.is_empty() + except Exception as exc: # pragma: no cover - defensive logging + logger.error(f"Failed to check relation chunks storage: {exc}") + raise exc + + if not need_entity_migration and not need_relation_migration: + return + + BATCH_SIZE = 500 # Process 500 records per batch + + if need_entity_migration: + try: + nodes = await self.chunk_entity_relation_graph.get_all_nodes() + except Exception as exc: + logger.error(f"Failed to fetch nodes for chunk migration: {exc}") + nodes = [] + + logger.info(f"Starting chunk_tracking data migration: {len(nodes)} nodes") + + # Process nodes in batches + total_nodes = len(nodes) + total_batches = (total_nodes + BATCH_SIZE - 1) // BATCH_SIZE + total_migrated = 0 + + for batch_idx in range(total_batches): + start_idx = batch_idx * BATCH_SIZE + end_idx = min((batch_idx + 1) * BATCH_SIZE, total_nodes) + batch_nodes = nodes[start_idx:end_idx] + + upsert_payload: dict[str, dict[str, object]] = {} + for node in batch_nodes: + entity_id = node.get("entity_id") or node.get("id") + if not entity_id: + continue + + raw_source = node.get("source_id") or "" + chunk_ids = [ + chunk_id + for chunk_id in raw_source.split(GRAPH_FIELD_SEP) + if chunk_id + ] + if not chunk_ids: + continue + + upsert_payload[entity_id] = { + "chunk_ids": chunk_ids, + "count": len(chunk_ids), + } + + if upsert_payload: + await self.entity_chunks.upsert(upsert_payload) + total_migrated += len(upsert_payload) + logger.info( + f"Processed entity batch {batch_idx + 1}/{total_batches}: {len(upsert_payload)} records (total: {total_migrated}/{total_nodes})" + ) + + if total_migrated > 0: + # Persist entity_chunks data to disk + await self.entity_chunks.index_done_callback() + logger.info( + f"Entity chunk_tracking migration completed: {total_migrated} records persisted" + ) + + if need_relation_migration: + try: + edges = await self.chunk_entity_relation_graph.get_all_edges() + except Exception as exc: + logger.error(f"Failed to fetch edges for chunk migration: {exc}") + edges = [] + + logger.info(f"Starting chunk_tracking data migration: {len(edges)} edges") + + # Process edges in batches + total_edges = len(edges) + total_batches = (total_edges + BATCH_SIZE - 1) // BATCH_SIZE + total_migrated = 0 + + for batch_idx in range(total_batches): + start_idx = batch_idx * BATCH_SIZE + end_idx = min((batch_idx + 1) * BATCH_SIZE, total_edges) + batch_edges = edges[start_idx:end_idx] + + upsert_payload: dict[str, dict[str, object]] = {} + for edge in batch_edges: + src = edge.get("source") or edge.get("src_id") or edge.get("src") + tgt = edge.get("target") or edge.get("tgt_id") or edge.get("tgt") + if not src or not tgt: + continue + + raw_source = edge.get("source_id") or "" + chunk_ids = [ + chunk_id + for chunk_id in raw_source.split(GRAPH_FIELD_SEP) + if chunk_id + ] + if not chunk_ids: + continue + + storage_key = make_relation_chunk_key(src, tgt) + upsert_payload[storage_key] = { + "chunk_ids": chunk_ids, + "count": len(chunk_ids), + } + + if upsert_payload: + await self.relation_chunks.upsert(upsert_payload) + total_migrated += len(upsert_payload) + logger.info( + f"Processed relation batch {batch_idx + 1}/{total_batches}: {len(upsert_payload)} records (total: {total_migrated}/{total_edges})" + ) + + if total_migrated > 0: + # Persist relation_chunks data to disk + await self.relation_chunks.index_done_callback() + logger.info( + f"Relation chunk_tracking migration completed: {total_migrated} records persisted" + ) diff --git a/lightrag/table_markup.py b/lightrag/table_markup.py new file mode 100644 index 0000000000..775044d5a1 --- /dev/null +++ b/lightrag/table_markup.py @@ -0,0 +1,152 @@ +"""Shared helpers for parsing and re-emitting ``
`` markup. + +These primitives are used by the paragraph-semantic chunker (Stage B +oversized-table re-split) and by the native multimodal surrounding-context +extractor. Both call sites need to: + +* recognise a post-rewrite ``
`` tag, +* decide whether the body is JSON or HTML, +* enumerate row-level units (JSON list items or HTML ```` rows along + with their ```` / ```` / ```` wrappers), and +* re-serialise a subset of rows while preserving the structural wrappers. + +Keeping the regexes and helpers in one place avoids subtle drift when +either consumer evolves. +""" + +from __future__ import annotations + +import json +import re +from typing import Any + +# Strict regex for a post-rewrite table tag emitted by the sidecar +# writer (``lightrag.sidecar.writer``): +# {rows_json}
+# blocks.jsonl invariants guarantee the tag has no embedded newlines. +TABLE_TAG_RE = re.compile( + r"[^>]*)>(?P.*?)", + re.DOTALL, +) + +# Format detection regex inside the attrs string, e.g. format="json". +_TABLE_FORMAT_RE = re.compile(r"""format\s*=\s*["'](?P[^"']+)["']""") + +# HTML ... row extractor. Standard HTML disallows nested , +# so a non-greedy match is sufficient for well-formed input. +HTML_TR_RE = re.compile(r"]*>.*?", re.DOTALL | re.IGNORECASE) + +# Combined scanner for row-grouping wrappers and rows themselves. Used +# to attribute each to its surrounding // so +# the wrapper can be reconstructed around chunk boundaries instead of +# being silently dropped during row-level table splitting. +HTML_ROW_PARTS_RE = re.compile( + r"(?P]*>)" r"|(?P]*>.*?)", + re.DOTALL | re.IGNORECASE, +) +HTML_WRAPPER_TAG_RE = re.compile( + r"<(?P/?)(?Pthead|tbody|tfoot)\b", re.IGNORECASE +) + + +def detect_table_format(attrs: str, body: str) -> str | None: + """Return ``"json"``, ``"html"`` or ``None`` for a parsed ```` tag. + + Prefers an explicit ``format="…"`` attribute. When silent, sniffs + the body: a leading ``[`` / ``{`` (after whitespace) implies JSON; + the presence of any `` tuple[str, list[Any]] | None: + """Parse a JSON ``
{rows_json}
``. + + Returns ``(attrs_str, rows)`` or ``None`` if the tag is malformed + (does not match ``TABLE_TAG_RE``, body is not JSON, or body decodes + to something other than a list). + """ + match = TABLE_TAG_RE.match((text or "").strip()) + if not match: + return None + body = match.group("body") + try: + rows = json.loads(body) + except json.JSONDecodeError: + return None + if not isinstance(rows, list): + return None + return match.group("attrs"), rows + + +def split_html_rows(body: str) -> list[tuple[str, str]] | None: + """Extract ``...`` rows tagged with their wrapper context. + + Returns a list of ``(wrapper_name, tr_str)`` tuples where + ``wrapper_name`` is ``"thead"`` / ``"tbody"`` / ``"tfoot"`` (lower- + cased) for rows that sit inside the corresponding wrapper, or ``""`` + for rows outside any of those wrappers. ``None`` signals "no row + found" so the caller falls through to character splitting. + + Whitespace, captions, comments, ```` and any other text + outside the recognised row-wrappers is dropped — this is a regex + extractor, not a full DOM parser. Wrapper attributes (e.g. + ````) are also dropped on re-emission; chunked + output uses bare wrapper tags. + """ + rows: list[tuple[str, str]] = [] + current_wrapper = "" + for match in HTML_ROW_PARTS_RE.finditer(body or ""): + wrap = match.group("wrap") + tr = match.group("tr") + if wrap is not None: + tag = HTML_WRAPPER_TAG_RE.match(wrap) + if tag: + slash = tag.group("slash") + name = tag.group("name").lower() + if slash == "/": + if current_wrapper == name: + current_wrapper = "" + else: + current_wrapper = name + elif tr is not None: + rows.append((current_wrapper, tr)) + if not rows: + return None + return rows + + +def serialize_html_rows(rows: list[tuple[str, str]]) -> str: + """Re-emit ``(wrapper, tr)`` rows grouped under their original + ```` / ```` / ```` wrappers. + + Consecutive rows sharing the same wrapper name collapse into a + single wrapper block; transitions emit a closing tag for the + previous wrapper and an opening tag for the next. Rows tagged with + ``""`` (no wrapper) emit bare ``...``. + """ + parts: list[str] = [] + current_wrapper = "" + for wrapper, tr in rows: + if wrapper != current_wrapper: + if current_wrapper: + parts.append(f"") + if wrapper: + parts.append(f"<{wrapper}>") + current_wrapper = wrapper + parts.append(tr) + if current_wrapper: + parts.append(f"") + return "".join(parts) diff --git a/lightrag/types.py b/lightrag/types.py index a18f2d3cd8..6a5241952c 100644 --- a/lightrag/types.py +++ b/lightrag/types.py @@ -1,12 +1,49 @@ from __future__ import annotations -from pydantic import BaseModel +from pydantic import BaseModel, Field from typing import Any, Optional -class GPTKeywordExtractionFormat(BaseModel): - high_level_keywords: list[str] - low_level_keywords: list[str] +class ExtractedEntity(BaseModel): + """A single entity extracted from text by the LLM.""" + + entity_name: str = Field( + description="Name of the entity. Use title case for case-insensitive names." + ) + entity_type: str = Field(description="Type/category of the entity.") + entity_description: str = Field( + description="Concise yet comprehensive description of the entity based on the input text." + ) + + +class ExtractedRelationship(BaseModel): + """A single relationship between two entities extracted from text.""" + + source_entity: str = Field( + description="Name of the source entity in the relationship." + ) + target_entity: str = Field( + description="Name of the target entity in the relationship." + ) + relationship_keywords: str = Field( + description="Comma-separated high-level keywords summarizing the relationship." + ) + relationship_description: str = Field( + description="Concise explanation of the relationship between source and target entities." + ) + + +class EntityExtractionResult(BaseModel): + """Structured output format for entity and relationship extraction from text.""" + + entities: list[ExtractedEntity] = Field( + default_factory=list, + description="List of entities extracted from the input text.", + ) + relationships: list[ExtractedRelationship] = Field( + default_factory=list, + description="List of relationships between entities extracted from the input text.", + ) class KnowledgeGraphNode(BaseModel): diff --git a/lightrag/utils.py b/lightrag/utils.py index 19110541a0..7922b71364 100644 --- a/lightrag/utils.py +++ b/lightrag/utils.py @@ -14,10 +14,12 @@ import re import time import uuid +import warnings from dataclasses import dataclass from datetime import datetime from functools import wraps from hashlib import md5 +from pathlib import Path from typing import ( Any, Protocol, @@ -41,6 +43,7 @@ DEFAULT_SOURCE_IDS_LIMIT_METHOD, VALID_SOURCE_IDS_LIMIT_METHODS, SOURCE_IDS_LIMIT_METHOD_FIFO, + PARSED_DIR_NAME, ) # Precompile regex pattern for JSON sanitization (module-level, compiled once) @@ -137,6 +140,9 @@ async def safe_vdb_operation_with_exception( max_retries: int = 3, retry_delay: float = 0.2, logger_func: Optional[Callable] = None, + timeout_seconds: float | None = None, + log_start: bool = False, + success_log_threshold_seconds: float = 10.0, ) -> None: """ Safely execute vector database operations with retry mechanism and exception handling. @@ -151,6 +157,9 @@ async def safe_vdb_operation_with_exception( max_retries: Maximum number of retry attempts retry_delay: Delay between retries in seconds logger_func: Logger function to use for error messages + timeout_seconds: Optional timeout for a single operation attempt + log_start: Whether to emit start/success logs for each attempt + success_log_threshold_seconds: Log successful attempts when duration exceeds this threshold Raises: Exception: When operation fails after all retry attempts @@ -158,22 +167,82 @@ async def safe_vdb_operation_with_exception( log_func = logger_func or logger.warning for attempt in range(max_retries): + start_ts = time.perf_counter() + attempt_label = f"{attempt + 1}/{max_retries}" try: - await operation() + if log_start: + logger.info( + "VDB %s start for %s (attempt %s, timeout=%s)", + operation_name, + entity_name or "", + attempt_label, + f"{timeout_seconds:.1f}s" + if timeout_seconds is not None + else "none", + ) + + if timeout_seconds is not None and timeout_seconds > 0: + await asyncio.wait_for(operation(), timeout=timeout_seconds) + else: + await operation() + + elapsed = time.perf_counter() - start_ts + if log_start or elapsed >= success_log_threshold_seconds: + logger.info( + "VDB %s success for %s in %.2fs (attempt %s)", + operation_name, + entity_name or "", + elapsed, + attempt_label, + ) return # Success, return immediately + except asyncio.TimeoutError as e: + elapsed = time.perf_counter() - start_ts + timeout_msg = ( + f"VDB {operation_name} timeout for {entity_name or ''} " + f"after {elapsed:.2f}s (attempt {attempt_label}, timeout={timeout_seconds}s)" + ) + if attempt >= max_retries - 1: + log_func(timeout_msg) + raise TimeoutError(timeout_msg) from e + log_func(f"{timeout_msg}, retrying...") + if retry_delay > 0: + await asyncio.sleep(retry_delay) except Exception as e: + elapsed = time.perf_counter() - start_ts if attempt >= max_retries - 1: - error_msg = f"VDB {operation_name} failed for {entity_name} after {max_retries} attempts: {e}" + error_msg = ( + f"VDB {operation_name} failed for {entity_name or ''} " + f"after {max_retries} attempts in {elapsed:.2f}s: {e}" + ) log_func(error_msg) raise Exception(error_msg) from e else: log_func( - f"VDB {operation_name} attempt {attempt + 1} failed for {entity_name}: {e}, retrying..." + f"VDB {operation_name} attempt {attempt + 1} failed for " + f"{entity_name or ''} after {elapsed:.2f}s: {e}, retrying..." ) if retry_delay > 0: await asyncio.sleep(retry_delay) +def parse_optional_float(raw: str | None) -> float | None: + """Decode env strings (or any text) into ``float | None``. + + Empty string and the literal ``"None"`` (case-insensitive) collapse + to ``None`` so users can leave a knob un-set in ``.env`` and have + the consuming code fall back to its own default. Any other + non-numeric value raises :class:`ValueError` so misconfigured envs + fail loudly at parse time rather than silently downstream. + """ + if raw is None: + return None + stripped = raw.strip() + if not stripped or stripped.lower() == "none": + return None + return float(stripped) + + def get_env_value( env_key: str, default: any, value_type: type = str, special_none: bool = False ) -> any: @@ -579,6 +648,91 @@ def compute_args_hash(*args: Any) -> str: return md5(safe_bytes).hexdigest() +def _serialize_cache_variant(value: Any) -> str: + """Serialize cache-affecting options to a stable string for hash inputs.""" + if value is None: + return "" + + if hasattr(value, "model_dump") and callable(value.model_dump): + try: + value = value.model_dump(mode="json") + except TypeError: + value = value.model_dump() + + if hasattr(value, "model_json_schema") and callable(value.model_json_schema): + value = value.model_json_schema() + + try: + return json.dumps( + value, + ensure_ascii=False, + sort_keys=True, + separators=(",", ":"), + default=repr, + ) + except (TypeError, ValueError): + return repr(value) + + +def get_llm_cache_identity( + global_config: dict[str, Any] | None, + role: str, + model_func_override: Callable[..., Any] | None = None, +) -> dict[str, Any]: + """Get the non-secret LLM identity used to partition LLM cache keys. + + Includes ``role``, ``binding``, ``model``, and ``host``. Deliberately excludes + ``api_key`` and ``provider_options`` so cache keys remain non-secret and safe + to persist. + + When ``model_func_override`` is set (the deprecated ``QueryParam.model_func`` + path), the identity collapses to ``{role, override}`` and does not partition + by the underlying model — callers swapping overrides will hit shared cache + entries. Use ``LightRAG.aupdate_llm_role_config()`` for cache-correct swaps. + """ + if model_func_override is not None: + return { + "role": role, + "override": "query_param.model_func", + } + + config = global_config or {} + identities = config.get("llm_cache_identities") + if isinstance(identities, dict): + identity = identities.get(role) + if isinstance(identity, dict): + return dict(identity) + + return { + "role": role, + "binding": None, + "model": config.get("llm_model_name"), + "host": None, + } + + +def serialize_llm_cache_identity(identity: Any) -> str: + """Serialize an LLM cache identity for inclusion in hash inputs.""" + return _serialize_cache_variant(identity) + + +def _validate_cached_response_format(response_format: Any | None) -> None: + """Reject structured-output modes that the cache wrapper does not support.""" + if response_format is None: + return + + if ( + isinstance(response_format, dict) + and response_format.get("type") == "json_object" + ): + return + + raise ValueError( + "use_llm_func_with_cache only supports response_format={'type': 'json_object'}; " + "json_schema and typed response_format values must not be passed through the cache wrapper." + ) + + def compute_mdhash_id(content: str, prefix: str = "") -> str: """ Compute a unique ID for a given content string. @@ -588,6 +742,55 @@ def compute_mdhash_id(content: str, prefix: str = "") -> str: return prefix + compute_args_hash(content) +def get_unique_filename_in_parsed(target_dir: Path, original_name: str) -> str: + """Generate a unique filename in target_dir, adding numeric suffixes on conflict. + + Tries the original name first, then `{stem}_001{ext}` ... `{stem}_999{ext}`, + falling back to a timestamp-suffixed name if all numeric slots are taken. + """ + original_path = Path(original_name) + base_name = original_path.stem + extension = original_path.suffix + + if not (target_dir / original_name).exists(): + return original_name + + for i in range(1, 1000): + new_name = f"{base_name}_{i:03d}{extension}" + if not (target_dir / new_name).exists(): + return new_name + + return f"{base_name}_{int(time.time())}{extension}" + + +async def move_file_to_parsed_dir( + file_path: Path, + *, + skip_if_already_parsed: bool = False, +) -> Path | None: + """Move a processed source file into its sibling __parsed__ directory. + + Returns the new path on success, the input path if `skip_if_already_parsed` + is set and the file already lives in a `__parsed__` directory, or None if + the source no longer exists. + """ + if not file_path.exists() or not file_path.is_file(): + return None + if skip_if_already_parsed and file_path.parent.name == PARSED_DIR_NAME: + return file_path + + parsed_dir = file_path.parent / PARSED_DIR_NAME + await asyncio.to_thread(parsed_dir.mkdir, parents=True, exist_ok=True) + + unique_filename = get_unique_filename_in_parsed(parsed_dir, file_path.name) + target_path = parsed_dir / unique_filename + await asyncio.to_thread(file_path.rename, target_path) + logger.debug( + f"Moved file to parsed directory: {file_path.name} -> {unique_filename}" + ) + return target_path + + def make_relation_vdb_ids(src_entity: str, tgt_entity: str) -> list[str]: """Return candidate relation VDB IDs for an undirected edge. @@ -715,6 +918,7 @@ def final_decro(func): counter = 0 shutdown_event = asyncio.Event() initialized = False + accepting_new_tasks = True worker_health_check_task = None # Enhanced task state management @@ -722,6 +926,11 @@ def final_decro(func): task_states_lock = asyncio.Lock() active_futures = weakref.WeakSet() reinit_count = 0 + submitted_total = 0 + completed_total = 0 + failed_total = 0 + cancelled_total = 0 + rejected_total = 0 async def worker(): """Enhanced worker that processes tasks with proper timeout and state management""" @@ -942,31 +1151,83 @@ async def ensure_workers(): f"{queue_name}: {workers_needed} new workers initialized {timeout_str}" ) - async def shutdown(): - """Gracefully shut down all workers and cleanup resources""" + async def get_queue_stats(): + """Return a best-effort snapshot of queue and worker state.""" + async with task_states_lock: + running = sum( + 1 + for task_state in task_states.values() + if task_state.worker_started and not task_state.future.done() + ) + in_flight = len(task_states) + + active_workers = len([task for task in tasks if not task.done()]) + return { + "queue_name": queue_name, + "max_async": max_size, + "max_queue_size": max_queue_size, + "queued": queue.qsize(), + "running": running, + "in_flight": in_flight, + "worker_count": active_workers, + "initialized": initialized, + "submitted_total": submitted_total, + "completed_total": completed_total, + "failed_total": failed_total, + "cancelled_total": cancelled_total, + "rejected_total": rejected_total, + } + + async def shutdown(graceful: bool = True, timeout: float | None = None): + """Shut down workers and cleanup resources. + + Graceful mode stops new submissions and drains queued/running + work; if the drain exceeds ``timeout`` (defaulting to + ``max_task_duration`` or 30s), it falls through to forced + cancellation so shutdown never blocks indefinitely. + """ + nonlocal accepting_new_tasks, initialized, worker_health_check_task logger.info(f"{queue_name}: Shutting down priority queue workers") - shutdown_event.set() + accepting_new_tasks = False - # Cancel all active futures - for future in list(active_futures): - if not future.done(): - future.cancel() + drain_timed_out = False + if graceful: + effective_timeout = timeout + if effective_timeout is None: + effective_timeout = ( + max_task_duration if max_task_duration is not None else 30.0 + ) + try: + await asyncio.wait_for(queue.join(), timeout=effective_timeout) + except asyncio.TimeoutError: + drain_timed_out = True + logger.warning( + f"{queue_name}: Graceful drain timed out after " + f"{effective_timeout}s; cancelling pending work" + ) - # Cancel all pending tasks - async with task_states_lock: - for task_id, task_state in list(task_states.items()): - if not task_state.future.done(): - task_state.future.cancel() - task_states.clear() + if not graceful or drain_timed_out: + # Cancel all active futures + for future in list(active_futures): + if not future.done(): + future.cancel() - # Wait for queue to empty with timeout - try: - await asyncio.wait_for(queue.join(), timeout=5.0) - except asyncio.TimeoutError: - logger.warning( - f"{queue_name}: Timeout waiting for queue to empty during shutdown" - ) + # Cancel all pending tasks + async with task_states_lock: + for task_id, task_state in list(task_states.items()): + if not task_state.future.done(): + task_state.future.cancel() + task_states.clear() + + while True: + try: + queue.get_nowait() + queue.task_done() + except asyncio.QueueEmpty: + break + + shutdown_event.set() # Cancel worker tasks for task in list(tasks): @@ -984,6 +1245,8 @@ async def shutdown(): await worker_health_check_task except asyncio.CancelledError: pass + worker_health_check_task = None + initialized = False logger.info(f"{queue_name}: Priority queue workers shutdown complete") @@ -1009,6 +1272,12 @@ async def wait_func( QueueFullError: If the queue is full and waiting times out Any exception raised by the decorated function """ + nonlocal submitted_total, completed_total, cancelled_total, failed_total + nonlocal rejected_total + if not accepting_new_tasks: + rejected_total += 1 + raise RuntimeError(f"{queue_name}: Queue is shutting down") + await ensure_workers() # Generate unique task ID @@ -1035,6 +1304,9 @@ async def wait_func( # Queue the task with timeout handling try: + if not accepting_new_tasks: + rejected_total += 1 + raise RuntimeError(f"{queue_name}: Queue is shutting down") if _queue_timeout is not None: await asyncio.wait_for( queue.put( @@ -1046,6 +1318,7 @@ async def wait_func( await queue.put( (_priority, current_count, task_id, args, kwargs) ) + submitted_total += 1 except asyncio.TimeoutError: raise QueueFullError( f"{queue_name}: Queue full, timeout after {_queue_timeout} seconds" @@ -1059,9 +1332,11 @@ async def wait_func( # Wait for result with timeout handling try: if _timeout is not None: - return await asyncio.wait_for(future, _timeout) + result = await asyncio.wait_for(future, _timeout) else: - return await future + result = await future + completed_total += 1 + return result except asyncio.TimeoutError: # This is user-level timeout (asyncio.wait_for caused) # Mark cancellation request @@ -1082,15 +1357,24 @@ async def wait_func( ): await asyncio.sleep(0.1) + cancelled_total += 1 raise TimeoutError( f"{queue_name}: User timeout after {_timeout} seconds" ) except WorkerTimeoutError as e: # This is Worker-level timeout, directly propagate exception information + failed_total += 1 raise TimeoutError(f"{queue_name}: {str(e)}") except HealthCheckTimeoutError as e: # This is Health Check-level timeout, directly propagate exception information + failed_total += 1 raise TimeoutError(f"{queue_name}: {str(e)}") + except asyncio.CancelledError: + cancelled_total += 1 + raise + except Exception: + failed_total += 1 + raise finally: # Ensure cleanup @@ -1100,6 +1384,7 @@ async def wait_func( # Add shutdown method to decorated function wait_func.shutdown = shutdown + wait_func.get_queue_stats = get_queue_stats return wait_func @@ -1462,6 +1747,164 @@ def truncate_list_by_token_size( return list_data +def normalize_string_list(raw_values: Any, context: str = "") -> list[str]: + """Return a list of non-empty strings from raw_values. + + Non-string elements are dropped and logged as warnings. If raw_values is + not a list, an empty list is returned. + """ + if not isinstance(raw_values, list): + return [] + result = [] + for i, value in enumerate(raw_values): + if isinstance(value, str) and value: + result.append(value) + else: + logger.warning( + "Non-string element dropped from list%s at index %d: %r", + f" ({context})" if context else "", + i, + value, + ) + return result + + +def split_text_units_for_hard_fallback(text: str) -> list[str]: + """Split text into sentence/paragraph-like units for fallback chunking.""" + if not text: + return [] + units: list[str] = [] + for para in text.split("\n\n"): + p = para.strip() + if not p: + continue + for sentence in re.split(r"(?<=[。!?;.!?])", p): + s = sentence.strip() + if s: + units.append(s) + return units if units else [text] + + +def split_text_by_token_limit( + text: str, tokenizer: Tokenizer, max_tokens: int +) -> list[str]: + """Split text by token limit with sentence-first, token-window fallback.""" + if not text: + return [] + + try: + total_tokens = len(tokenizer.encode(text)) + except Exception: + total_tokens = 0 + + if total_tokens > 0 and total_tokens <= max_tokens: + return [text] + + units = split_text_units_for_hard_fallback(text) + out: list[str] = [] + cur_parts: list[str] = [] + cur_tokens = 0 + + for unit in units: + try: + unit_tokens = len(tokenizer.encode(unit)) + except Exception: + unit_tokens = 0 + + # Sentence itself is oversize: token-window split directly. + if unit_tokens > max_tokens: + if cur_parts: + out.append("\n\n".join(cur_parts)) + cur_parts = [] + cur_tokens = 0 + + token_ids = tokenizer.encode(unit) + for start in range(0, len(token_ids), max_tokens): + piece = tokenizer.decode(token_ids[start : start + max_tokens]).strip() + if piece: + out.append(piece) + continue + + if cur_parts and cur_tokens + unit_tokens > max_tokens: + out.append("\n\n".join(cur_parts)) + cur_parts = [unit] + cur_tokens = unit_tokens + else: + cur_parts.append(unit) + cur_tokens += unit_tokens + + if cur_parts: + out.append("\n\n".join(cur_parts)) + + return [x for x in out if x.strip()] + + +def enforce_chunk_token_limit_before_embedding( + chunking_result: list[dict[str, Any]] | tuple[dict[str, Any], ...], + tokenizer: Tokenizer, + max_tokens: int, +) -> list[dict[str, Any]]: + """Hard fallback split before embedding while preserving heading hierarchy.""" + if max_tokens <= 0: + return list(chunking_result) + + normalized: list[dict[str, Any]] = [] + + for dp in chunking_result: + if not isinstance(dp, dict): + continue + + content = dp.get("content", "") + if not isinstance(content, str) or not content.strip(): + continue + + try: + token_count = len(tokenizer.encode(content)) + except Exception: + token_count = ( + dp.get("tokens", 0) if isinstance(dp.get("tokens"), int) else 0 + ) + + if token_count <= max_tokens: + ndp = dict(dp) + ndp["tokens"] = token_count if token_count > 0 else ndp.get("tokens", 0) + normalized.append(ndp) + continue + + pieces = split_text_by_token_limit(content, tokenizer, max_tokens) + if not pieces: + ndp = dict(dp) + ndp["tokens"] = token_count + normalized.append(ndp) + continue + + base_chunk_id = dp.get("chunk_id") + total_parts = len(pieces) + for i, piece in enumerate(pieces, 1): + new_dp = dict(dp) + new_dp["content"] = piece + try: + new_dp["tokens"] = len(tokenizer.encode(piece)) + except Exception: + new_dp["tokens"] = max(1, int(len(piece) * 0.5)) + + # Shallow-copy preserves the nested heading dict and sidecar + # block from the source chunk; only the payload (content/tokens + # /chunk_id) is rewritten per split slice. + if isinstance(base_chunk_id, str) and base_chunk_id.strip(): + new_dp["chunk_id"] = f"{base_chunk_id}-s{i:02d}" + + new_dp["split_type"] = "hard_fallback" + new_dp["split_part"] = i + new_dp["split_total"] = total_parts + normalized.append(new_dp) + + # Rebuild order index to keep continuity after splitting. + for idx, item in enumerate(normalized): + item["chunk_order_index"] = idx + return normalized + + def cosine_similarity(v1, v2): """Calculate cosine similarity between two vectors""" dot_product = np.dot(v1, v2) @@ -2062,6 +2505,9 @@ async def use_llm_func_with_cache( cache_type: str = "extract", chunk_id: str | None = None, cache_keys_collector: list = None, + response_format: Any | None = None, + entity_extraction: bool = False, + llm_cache_identity: Any | None = None, ) -> tuple[str, int]: """Call LLM function with cache support and text sanitization @@ -2080,12 +2526,33 @@ async def use_llm_func_with_cache( chunk_id: Chunk identifier to store in cache text_chunks_storage: Text chunks storage to update llm_cache_list cache_keys_collector: Optional list to collect cache keys for batch processing + response_format: Structured output control forwarded to the LLM provider. + Providers translate this to their native structured-output surface + (OpenAI response_format, Ollama format, Gemini response_mime_type/schema). + ``{"type": "json_object"}`` requests JSON output; typed/schema payloads + trigger schema-constrained output where supported; ``None`` leaves + output unconstrained. Providers that do not support structured output + safely strip this argument. + entity_extraction: Deprecated. When True and ``response_format`` is not + provided, maps to ``{"type": "json_object"}``. Prefer passing + ``response_format`` directly. + llm_cache_identity: Non-secret model/provider identity used to partition + cache entries across role model, binding, or host changes. Returns: tuple[str, int]: (LLM response text, timestamp) - For cache hits: (content, cache_create_time) - For cache misses: (content, current_timestamp) """ + if entity_extraction and response_format is None: + warnings.warn( + "use_llm_func_with_cache(entity_extraction=True) is deprecated; " + "pass response_format={'type': 'json_object'} instead.", + DeprecationWarning, + stacklevel=2, + ) + response_format = {"type": "json_object"} + _validate_cached_response_format(response_format) # Sanitize input text to prevent UTF-8 encoding errors for all LLM providers safe_user_prompt = sanitize_text_for_encoding(user_prompt) safe_system_prompt = ( @@ -2115,7 +2582,15 @@ async def use_llm_func_with_cache( prompt_parts.append(history) _prompt = "\n".join(prompt_parts) - arg_hash = compute_args_hash(_prompt) + response_format_key = _serialize_cache_variant(response_format) + llm_identity_key = serialize_llm_cache_identity(llm_cache_identity) + arg_hash = compute_args_hash( + _prompt, + "\n\n", + response_format_key, + "\n\n", + llm_identity_key, + ) # Generate cache key for this LLM call cache_key = generate_cache_key("default", cache_type, arg_hash) @@ -2144,6 +2619,8 @@ async def use_llm_func_with_cache( kwargs["history_messages"] = safe_history_messages if max_tokens is not None: kwargs["max_tokens"] = max_tokens + if response_format is not None: + kwargs["response_format"] = response_format res: str = await use_llm_func( safe_user_prompt, system_prompt=safe_system_prompt, **kwargs @@ -2178,6 +2655,8 @@ async def use_llm_func_with_cache( kwargs["history_messages"] = safe_history_messages if max_tokens is not None: kwargs["max_tokens"] = max_tokens + if response_format is not None: + kwargs["response_format"] = response_format try: res = await use_llm_func( diff --git a/lightrag/utils_pipeline.py b/lightrag/utils_pipeline.py new file mode 100644 index 0000000000..61af134af6 --- /dev/null +++ b/lightrag/utils_pipeline.py @@ -0,0 +1,681 @@ +"""Pipeline-specific helpers for document status, identity, and content. + +These helpers are shared by the LightRAG pipeline mixin (lightrag/pipeline.py) +and by other LightRAG methods that touch the document ingestion paths +(custom-chunks ingest, deletion, etc.). They are kept out of utils.py because +they are tied to the doc_status / full_docs domain rather than to general +text/token utilities. +""" + +from __future__ import annotations + +import hashlib +import json +import os +import re +import time +from pathlib import Path +from typing import Any, cast +from urllib.parse import quote, unquote, urlsplit + +from lightrag.base import DocProcessingStatus, DocStatus, DocStatusStorage +from lightrag.constants import ( + FULL_DOCS_FORMAT_LIGHTRAG, + LIGHTRAG_DOC_CONTENT_PREFIX, + PARSED_DIR_NAME, +) +from lightrag.parser_routing import canonicalize_parser_hinted_basename +from lightrag.utils import ( + compute_mdhash_id, + logger, + move_file_to_parsed_dir, +) + + +PLACEHOLDER_DOCUMENT_SOURCES = {"", "no-file-path", "unknown_source"} +SIDECAR_LOCATION_UNKNOWN = "unknown_source" + + +def build_chunks_dict_from_chunking_result( + chunking_result: list[dict[str, Any]], + *, + doc_id: str, + file_path: str, +) -> dict[str, dict[str, Any]]: + """Assemble the per-doc chunks dict written into chunks_vdb / text_chunks. + + Resolves a stable ``chunk_key`` for each entry — preferring an explicit + ``chunk_id`` (auto-prefixed with ``doc_id-`` if not already), falling back + to a positional ``chunk-NNN`` derived from ``chunk_order_index``, and + finally hashing on collision so two entries inside one document never + overwrite each other. + """ + chunks: dict[str, dict[str, Any]] = {} + for dp in chunking_result: + chunk_content = dp.get("content", "") + if not chunk_content: + continue + raw_chunk_id = dp.get("chunk_id", "") + order = dp.get("chunk_order_index") + if isinstance(raw_chunk_id, str) and raw_chunk_id.strip(): + chunk_key = ( + raw_chunk_id + if raw_chunk_id.startswith(f"{doc_id}-") + else f"{doc_id}-{raw_chunk_id}" + ) + elif isinstance(order, int): + chunk_key = f"{doc_id}-chunk-{order:03d}" + else: + chunk_key = compute_mdhash_id(f"{doc_id}:{chunk_content}", prefix="chunk-") + + # Hard collision guard (same chunk_id inside one document). + if chunk_key in chunks: + chunk_key = compute_mdhash_id( + f"{doc_id}:{order}:{chunk_content}", + prefix="chunk-", + ) + # Preserve any pre-populated cache ids on dp (multimodal chunks + # arrive with analysis cache ids already attached so document + # deletion can find them via the per-chunk llm_cache_list). + existing_cache_list = dp.get("llm_cache_list") + seed_cache_list: list[str] = [] + if isinstance(existing_cache_list, list): + seen: set[str] = set() + for entry in existing_cache_list: + key = str(entry or "").strip() + if key and key not in seen: + seen.add(key) + seed_cache_list.append(key) + chunks[chunk_key] = { + **dp, + "full_doc_id": doc_id, + "file_path": file_path, + "llm_cache_list": seed_cache_list, + } + return chunks + + +def chunk_fields_from_status_doc( + status_doc: DocProcessingStatus, +) -> tuple[list[str], int]: + """Return (chunks_list, chunks_count) preserved from a status document. + + Filters out any non-string or empty chunk IDs. When chunks_count is + absent or invalid, it is inferred from the length of chunks_list. + """ + chunks_list: list[str] = [] + if isinstance(status_doc.chunks_list, list): + chunks_list = [ + chunk_id + for chunk_id in status_doc.chunks_list + if isinstance(chunk_id, str) and chunk_id + ] + + if isinstance(status_doc.chunks_count, int) and status_doc.chunks_count >= 0: + return chunks_list, status_doc.chunks_count + + return chunks_list, len(chunks_list) + + +def resolve_doc_file_path( + status_doc: DocProcessingStatus | None = None, + content_data: dict[str, Any] | None = None, +) -> str: + """Resolve the best available document file path. + + Returns the first non-placeholder ``file_path`` from doc_status, then + full_docs. Both are already canonicalized at write time, so this only + has to skip placeholder sentinels. + """ + for source in ( + getattr(status_doc, "file_path", None), + content_data.get("file_path") if content_data else None, + ): + if not isinstance(source, str): + continue + candidate = source.strip() + if candidate and candidate not in PLACEHOLDER_DOCUMENT_SOURCES: + return candidate + return "unknown_source" + + +def normalize_document_file_path(file_path: Any) -> str: + """Return the canonical basename stored as ``file_path``. + + Strips any supported ``[hint]`` segment so ``abc.docx`` and + ``abc.[native-iet].docx`` map to the same key. Collapses placeholders to + ``"unknown_source"``. Idempotent. + """ + source = str(file_path or "").strip() + if source in PLACEHOLDER_DOCUMENT_SOURCES: + return "unknown_source" + canonical = canonicalize_parser_hinted_basename(source).strip() + if canonical in PLACEHOLDER_DOCUMENT_SOURCES: + return "unknown_source" + return canonical or "unknown_source" + + +# Back-compat alias retained until call sites that import the old name are +# all switched over (the public surface is ``normalize_document_file_path``). +document_canonical_key = normalize_document_file_path + + +def has_known_document_source(source_key: str) -> bool: + return source_key not in PLACEHOLDER_DOCUMENT_SOURCES + + +def doc_status_field(doc: Any, field: str, default: Any = "") -> Any: + if isinstance(doc, dict): + return doc.get(field, default) + return getattr(doc, field, default) + + +# Long-lived per-document metadata fields that must survive every +# doc_status state transition. ``process_options`` records the user's +# per-file processing strategy at enqueue time and is read by analyze / +# chunk / KG-skip stages and by admin/list APIs throughout the document's +# lifetime, so we cannot let an intermediate transition (PARSING / +# ANALYZING / PROCESSING / PROCESSED / FAILED upsert) clobber it. +# ``parse_warnings`` records non-fatal parser warnings (e.g. legacy docx +# tables missing ``w14:paraId``) that admins should be able to surface +# alongside the document record after PROCESSED. +# ``chunk_opts`` is written when entering PROCESSING (via ``extraction_meta``) +# and records the actual chunker params used for that document in the same +# format as the ``Chunking : ...`` log line (params portion only). +# Carrying it forward keeps the value visible after PROCESSING -> FAILED, +# whose ``metadata_extra`` only carries timing fields. +# ``parsing_start_time`` / ``analyzing_start_time`` are Unix epoch seconds +# stamped at the entry of ``_parse_worker`` / ``_analyze_worker`` (mirrors +# the existing ``processing_start_time`` set when entering PROCESSING) so +# per-stage durations can be derived from doc_status post-mortem. +# ``parse_stage_skipped`` is written by ``parse_mineru`` / ``parse_docling`` +# when the raw bundle cache is valid and the parse stage round trip is +# skipped; absence == not skipped (e.g. native parser, or cache miss). +# ``source_file_name`` records the original pending-parse source basename used +# by parser workers; it is intentionally separate from canonical ``file_path``. +# +# The order of this tuple is the rendering order of metadata fields in +# the WebUI ``DocumentStatusDetailsDialog`` (carry-over builds the new +# metadata dict by iterating this tuple, and dict / JSON / JSX preserve +# insertion order all the way to the rendered output). Keep fields +# grouped by stage: parse-stage fields together, analyze-stage fields +# together, etc., so the dialog reads top-to-bottom along the pipeline. +_DOC_STATUS_METADATA_CARRY_OVER_KEYS: tuple[str, ...] = ( + "process_options", + "source_file_name", + "parse_warnings", + "chunk_opts", + "parsing_start_time", + "parse_stage_skipped", + "analyzing_start_time", +) + + +def doc_status_metadata_carry_over(status_doc: Any) -> dict[str, Any]: + """Return the subset of ``status_doc.metadata`` to preserve across upserts. + + ``doc_status`` storage backends generally treat the ``metadata`` field + as an opaque blob and **replace** it on every upsert, so callers must + explicitly carry forward fields they want to keep. This helper centralises + the list of fields we always carry: today only ``process_options``, but + new long-lived metadata can be added by extending + ``_DOC_STATUS_METADATA_CARRY_OVER_KEYS``. + """ + if status_doc is None: + return {} + raw_metadata = doc_status_field(status_doc, "metadata", {}) + if not isinstance(raw_metadata, dict): + return {} + carry: dict[str, Any] = {} + for key in _DOC_STATUS_METADATA_CARRY_OVER_KEYS: + if key in raw_metadata and raw_metadata[key] not in (None, ""): + carry[key] = raw_metadata[key] + return carry + + +def doc_status_transition_metadata( + status_doc: Any, + *, + extra: dict[str, Any] | None = None, +) -> dict[str, Any]: + """Build a doc_status ``metadata`` payload that preserves carry-over fields. + + Use at every state-transition upsert site so the user's + ``process_options`` (and any future long-lived metadata fields) survive + PENDING → PARSING → ANALYZING → PROCESSING → PROCESSED / FAILED. + """ + payload = doc_status_metadata_carry_over(status_doc) + if extra: + payload.update(extra) + return payload + + +def doc_status_value(doc: Any) -> str: + status = doc_status_field(doc, "status", "") + if isinstance(status, DocStatus): + return status.value + return str(status or "") + + +# Sidecar item ids embed ``doc_hash`` (= doc_id without the ``doc-`` prefix), +# and for pending_parse uploads doc_id derives from the filename — so the +# same content under two filenames renders with different ids in +# ``merged_text``. Strip those surfaces before hashing so cross-filename +# content_hash dedup actually fires. +_SIDECAR_ID_PATTERN = re.compile(r"\b(tb|im|eq)-[0-9a-f]{32}-(\d{4})\b") +_ASSET_PATH_PATTERN = re.compile(r'(?<=path=")[^"]*\.blocks\.assets/') + + +def normalize_merged_text_for_hash(content: str) -> str: + """Strip filename-derived prefixes from sidecar ids and asset paths. + + Idempotent and safe on plain text (matches the doc_hash literal only — + 32 lowercase hex digits between the modality prefix and a 4-digit + sequence). RAW text bodies without sidecar markup pass through + unchanged. + """ + if not content: + return content + content = _SIDECAR_ID_PATTERN.sub(r"\1--\2", content) + content = _ASSET_PATH_PATTERN.sub("/", content) + return content + + +def compute_text_content_hash(content: str) -> str: + """MD5 hex digest of text content used for cross-filename dedup. + + Input is normalized via :func:`normalize_merged_text_for_hash` first so + sidecar-rendered bodies dedupe across filenames despite carrying + filename-derived item ids and asset paths. + """ + return compute_mdhash_id(normalize_merged_text_for_hash(content), prefix="") + + +def compute_file_content_hash(path_str: str) -> str | None: + """Stream-compute MD5 of a file's bytes; returns None if unreadable. + + Resolves the LightRAG ``*.blocks.jsonl`` conventions used by + ``_load_lightrag_document_content`` so the hash matches the actual + document body regardless of whether ``path_str`` points at the blocks + file directly or its parent directory/base name. + """ + if not path_str: + return None + try: + path = Path(path_str) + if path.is_dir(): + candidates = sorted(path.glob("*.blocks.jsonl")) + if not candidates: + return None + path = candidates[0] + elif not (path.exists() and path.is_file()): + blocks_path = Path(path_str + ".blocks.jsonl") + if blocks_path.exists() and blocks_path.is_file(): + path = blocks_path + else: + return None + h = hashlib.md5() + with path.open("rb") as f: + for chunk in iter(lambda: f.read(65536), b""): + h.update(chunk) + return h.hexdigest() + except Exception as e: + logger.warning(f"Failed to compute file content hash for {path_str}: {e}") + return None + + +def configured_input_dir() -> Path: + input_dir = os.getenv("INPUT_DIR", "").strip() + return Path(input_dir) if input_dir else Path.cwd() / "inputs" + + +async def get_existing_doc_by_file_basename( + doc_status: DocStatusStorage, file_path: Any +) -> tuple[str, Any] | None: + """Find an existing doc_status record by canonical file basename. + + Inputs are normalized via :func:`normalize_document_file_path` so callers + may pass either the bare canonical name (``abc.docx``) or a hint-bearing + variant (``abc.[native-iet].docx``); both resolve to the same logical + document. + """ + basename = normalize_document_file_path(file_path) + if basename == "unknown_source": + return None + return await doc_status.get_doc_by_file_basename(basename) + + +async def get_existing_doc_by_content_hash( + doc_status: DocStatusStorage, content_hash: str +) -> tuple[str, Any] | None: + """Find an existing doc_status record by content hash.""" + if not content_hash: + return None + return await doc_status.get_doc_by_content_hash(content_hash) + + +async def get_duplicate_doc_by_content_hash( + doc_status: DocStatusStorage, content_hash: str, current_doc_id: str +) -> tuple[str, Any] | None: + """Find another doc_status record with the same content hash.""" + if not content_hash: + return None + + match = await doc_status.get_doc_by_content_hash(content_hash) + if match and match[0] != current_doc_id: + return match + + try: + docs = await doc_status.get_docs_by_statuses(list(DocStatus)) + except Exception: + return None + for doc_id, doc in docs.items(): + if doc_id == current_doc_id: + continue + if doc_status_field(doc, "content_hash", "") == content_hash: + return doc_id, doc + return None + + +def make_lightrag_doc_content(merged_text: str) -> str: + """Build the ``full_docs.content`` value for ``format=lightrag`` records. + + The result has shape ``"{{LRdoc}}"`` — the marker prefix + distinguishes lightrag-format full_docs from raw-format ones, and the + body is the complete merged text from the ``.blocks.jsonl`` content + lines so F-chunking can run identically on raw and lightrag inputs + (the prefix is stripped at chunking time via + ``strip_lightrag_doc_prefix``). + """ + return f"{LIGHTRAG_DOC_CONTENT_PREFIX}{merged_text or ''}" + + +def strip_lightrag_doc_prefix(content: str | None, parse_format: str | None) -> str: + """Return the bare body for a stored ``full_docs.content`` value. + + The ``{{LRdoc}}`` marker is stripped **only** when ``parse_format`` + indicates the record is in lightrag format. Any other ``parse_format`` + (``raw``, ``pending_parse``, ``None`` ...) returns the content + unchanged so a raw document whose literal body happens to start with + ``{{LRdoc}}`` is never silently truncated. + + Centralizing the format check here turns "must check format before + stripping" from a caller-side discipline into a structural property of + the function: any future call site that forgets to gate is protected + automatically. + """ + if ( + parse_format == FULL_DOCS_FORMAT_LIGHTRAG + and isinstance(content, str) + and content.startswith(LIGHTRAG_DOC_CONTENT_PREFIX) + ): + return content[len(LIGHTRAG_DOC_CONTENT_PREFIX) :] + return content or "" + + +# --------------------------------------------------------------------------- +# Document path / artifact helpers (moved from _PipelineMixin) +# --------------------------------------------------------------------------- + + +def input_dir_path() -> Path: + return configured_input_dir() + + +def parsed_dir() -> Path: + """Return the project-wide parsed-artifact root: ``/__parsed__``.""" + return input_dir_path() / PARSED_DIR_NAME + + +def parsed_artifact_dir_for( + file_path: str, *, parent_hint: Path | str | None = None +) -> Path: + """Return the per-document sidecar directory for ``file_path``. + + ``file_path`` must already be canonical (run ``normalize_document_file_path`` + first if unsure). When ``parent_hint`` is supplied (e.g. the live source + file's parent), the sidecar is placed next to it under ``__parsed__/`` + rather than under the global ``input_dir``; this keeps test isolation + intact when the source lives outside ``INPUT_DIR``. On collision with an + existing non-directory entry, the helper appends ``_001``..``_999`` and + finally a unix timestamp suffix. + """ + if parent_hint is not None: + hint = Path(parent_hint) + # ``hint`` may already point at a ``__parsed__/`` dir (e.g. when the + # caller re-archived a source); reuse it in place rather than nesting. + root = hint if hint.name == PARSED_DIR_NAME else hint / PARSED_DIR_NAME + else: + root = parsed_dir() + source_name = ( + canonicalize_parser_hinted_basename(file_path or "document") or "document" + ) + artifact_name = f"{source_name}.parsed" + artifact_dir = root / artifact_name + if not artifact_dir.exists() or artifact_dir.is_dir(): + return artifact_dir + + for i in range(1, 1000): + candidate = root / f"{artifact_name}_{i:03d}" + if not candidate.exists() or candidate.is_dir(): + return candidate + + return root / f"{artifact_name}_{int(time.time())}" + + +# --------------------------------------------------------------------------- +# Sidecar URI helpers (``full_docs.sidecar_location``) +# --------------------------------------------------------------------------- +# +# Sidecar URI scheme conventions: +# - Local: ``file:///abs/path/to/abc.parsed/`` (trailing slash required) +# - Remote: ``s3://bucket/workspace/abc.parsed/`` (future; resolver returns +# None today so local readers gracefully skip) +# - Unknown sentinel: literal string ``"unknown_source"`` + + +def sidecar_uri_for(parsed_artifact_dir: Path | str) -> str: + """Build the canonical sidecar URI for a local artifact directory. + + The result always ends with ``/`` so a reader can distinguish a directory + from a file at the URI level. Non-ASCII characters are percent-encoded. + """ + p = Path(parsed_artifact_dir).resolve() + encoded = quote(str(p), safe="/") + return f"file://{encoded}/" + + +def resolve_sidecar_uri(uri: str | None) -> Path | None: + """Decode a sidecar URI into a local filesystem Path. + + Returns None for the unknown sentinel, empty input, or any non-``file://`` + scheme (remote schemes will get their own resolvers). + """ + if not uri or uri == SIDECAR_LOCATION_UNKNOWN: + return None + parts = urlsplit(uri) + if parts.scheme != "file": + return None + path_str = unquote(parts.path) + if path_str.endswith("/") and len(path_str) > 1: + path_str = path_str[:-1] + return Path(path_str) + + +def sidecar_blocks_path(uri: str | None) -> str | None: + """Locate the first ``*.blocks.jsonl`` file inside a sidecar URI. + + Returns the absolute path as a string, or None when the URI cannot be + resolved locally or the directory holds no blocks file. + """ + d = resolve_sidecar_uri(uri) + if d is None or not d.is_dir(): + return None + candidates = sorted(d.glob("*.blocks.jsonl")) + return str(candidates[0]) if candidates else None + + +def sidecar_modality_path(uri: str | None, modality: str) -> str | None: + """Return the path for a sidecar modality JSON (drawings/tables/equations). + + Does not require the file to exist — callers check. Returns None when the + sidecar URI cannot be resolved or has no blocks file to anchor the name. + """ + blocks = sidecar_blocks_path(uri) + if not blocks: + return None + return f"{blocks[: -len('.blocks.jsonl')]}.{modality}.json" + + +def sidecar_assets_dir_for_uri(uri: str | None) -> Path | None: + """Return the ``*.blocks.assets/`` directory Path for a sidecar URI. + + The directory may not exist; callers create it on first asset write. + """ + blocks = sidecar_blocks_path(uri) + if not blocks: + return None + return Path(f"{blocks[: -len('.blocks.jsonl')]}.blocks.assets") + + +# --------------------------------------------------------------------------- +# Source archive helpers +# --------------------------------------------------------------------------- + + +async def archive_docx_source_after_full_docs_sync(source_path: str) -> str | None: + source = Path(source_path) + try: + target = await move_file_to_parsed_dir(source, skip_if_already_parsed=True) + except Exception as e: + logger.warning( + f"[parse] Source archive skipped after full_docs sync: {source_path}: {e}" + ) + return None + if target is None: + return None + if target != source: + logger.debug( + f"[parse] Archived DOCX source after full_docs sync: {source} -> {target}" + ) + return str(target) + + +async def archive_source_after_full_docs_sync(source_path: str) -> str | None: + return await archive_docx_source_after_full_docs_sync(source_path) + + +# --------------------------------------------------------------------------- +# LightRAG Document blocks loader +# --------------------------------------------------------------------------- + + +async def load_lightrag_document_content(sidecar_uri: str) -> tuple[str, str]: + """Load LightRAG Document blocks and return ``(merged_text, blocks_path)``. + + ``sidecar_uri`` is a sidecar location URI (see ``sidecar_uri_for``); this + locates the ``*.blocks.jsonl`` file inside it, reads the content lines + (skipping the meta header at index 0 and any non-content entries), and + returns the merged body plus the absolute blocks path. + """ + resolved = sidecar_blocks_path(sidecar_uri) + if resolved is None: + raise FileNotFoundError( + f"LightRAG blocks file not found from sidecar uri: {sidecar_uri}" + ) + blocks_path = Path(resolved) + + merged_parts: list[str] = [] + with blocks_path.open("r", encoding="utf-8") as f: + for i, line in enumerate(f): + text = line.strip() + if not text: + continue + obj = json.loads(text) + if i == 0: + continue + if obj.get("type") != "content": + continue + content = obj.get("content", "") + if isinstance(content, str) and content.strip(): + merged_parts.append(content) + + return "\n\n".join(merged_parts), str(blocks_path) + + +# --------------------------------------------------------------------------- +# Payload introspection helpers (parser response normalization) +# --------------------------------------------------------------------------- + + +def get_by_path(payload: Any, path: str) -> Any: + if not path: + return None + cur = payload + for part in path.split("."): + if isinstance(cur, dict) and part in cur: + cur = cur[part] + else: + return None + return cur + + +def extract_content_list_from_payload( + payload: Any, +) -> list[dict[str, Any]] | None: + """Try to find a MinerU/Docling-like content list from arbitrary JSON payload.""" + if isinstance(payload, list): + if payload and all(isinstance(x, dict) for x in payload): + first = payload[0] + if "type" in first or "label" in first or "text" in first: + return cast(list[dict[str, Any]], payload) + return None + if not isinstance(payload, dict): + return None + + # Common direct keys first + for key in ("content_list", "content", "items", "result"): + value = payload.get(key) + if isinstance(value, list): + extracted = extract_content_list_from_payload(value) + if extracted is not None: + return extracted + elif isinstance(value, dict): + extracted = extract_content_list_from_payload(value) + if extracted is not None: + return extracted + + # Deep search as fallback + for value in payload.values(): + extracted = extract_content_list_from_payload(value) + if extracted is not None: + return extracted + return None + + +def normalize_parser_result_to_content_list( + parser_result: str | list[dict[str, Any]] | dict[str, Any] | None, +) -> list[dict[str, Any]] | None: + """Normalize parser result to structured content list if possible.""" + if parser_result is None: + return None + if isinstance(parser_result, list): + return extract_content_list_from_payload(parser_result) + if isinstance(parser_result, dict): + return extract_content_list_from_payload(parser_result) + text = str(parser_result).strip() + if not text: + return None + try: + payload = json.loads(text) + return extract_content_list_from_payload(payload) + except Exception: + return None + + +# Multimodal entity injection used to live here as a centralized post-pass +# over all chunk_results. It has been moved into +# :func:`lightrag.operate.extract_entities._process_single_content` so each +# multimodal chunk injects its own entity/relation records while still under +# its concurrency slot. The chunk's ``sidecar.type`` (drawing/table/equation) +# is the dispatch key; see operate.py for the new logic. diff --git a/lightrag_webui/src/api/lightrag.ts b/lightrag_webui/src/api/lightrag.ts index 27187e45dc..9cdc7e8a6a 100644 --- a/lightrag_webui/src/api/lightrag.ts +++ b/lightrag_webui/src/api/lightrag.ts @@ -25,6 +25,33 @@ export type LightragGraphType = { edges: LightragEdgeType[] } +export type LightragQueueStatus = { + available: boolean + queue_name?: string + max_async?: number + max_queue_size?: number + queued?: number + running?: number + in_flight?: number + worker_count?: number + initialized?: boolean + submitted_total?: number + completed_total?: number + failed_total?: number + cancelled_total?: number + rejected_total?: number +} + +export type LightragRoleLLMConfig = { + binding?: string | null + model?: string | null + host?: string | null + max_async?: number + timeout?: number + has_model_kwargs?: boolean + metadata?: Record +} + export type LightragStatus = { status: 'healthy' working_directory: string @@ -41,26 +68,44 @@ export type LightragStatus = { graph_storage: string vector_storage: string workspace?: string + storage_workspaces?: { + kv_storage?: string | null + doc_status_storage?: string | null + graph_storage?: string | null + vector_storage?: string | null + } max_graph_nodes?: string enable_rerank?: boolean rerank_binding?: string | null rerank_model?: string | null rerank_binding_host?: string | null + rerank_max_async?: number + rerank_timeout?: number summary_language: string force_llm_summary_on_merge: boolean max_parallel_insert: number max_async: number + llm_timeout?: number embedding_func_max_async: number embedding_batch_num: number + embedding_timeout?: number cosine_threshold: number min_rerank_score: number related_chunk_number: number + role_llm_config?: Record } update_status?: Record core_version?: string api_version?: string auth_mode?: 'enabled' | 'disabled' pipeline_busy: boolean + pipeline_active?: boolean + pipeline_scanning?: boolean + pipeline_destructive_busy?: boolean + pipeline_pending_enqueues?: number + llm_queue_status?: Record + embedding_queue_status?: LightragQueueStatus + rerank_queue_status?: LightragQueueStatus keyed_locks?: { process_id: number cleanup_performed: { @@ -160,13 +205,13 @@ export type EntityUpdateResponse = { } export type DocActionResponse = { - status: 'success' | 'partial_success' | 'failure' | 'duplicated' + status: 'success' | 'partial_success' | 'failure' message: string track_id?: string } export type ScanResponse = { - status: 'scanning_started' + status: 'scanning_started' | 'scanning_skipped_pipeline_busy' message: string track_id: string } @@ -183,7 +228,14 @@ export type DeleteDocResponse = { doc_id: string } -export type DocStatus = 'pending' | 'processing' | 'preprocessed' | 'processed' | 'failed' +export type DocStatus = + | 'pending' + | 'parsing' + | 'analyzing' + | 'processing' + | 'preprocessed' + | 'processed' + | 'failed' export type DocStatusResponse = { id: string @@ -200,7 +252,7 @@ export type DocStatusResponse = { } export type DocsStatusesResponse = { - statuses: Record + statuses: Partial> } export type TrackStatusResponse = { @@ -212,6 +264,7 @@ export type TrackStatusResponse = { export type DocumentsRequest = { status_filter?: DocStatus | null + status_filters?: DocStatus[] | null page: number page_size: number sort_field: 'created_at' | 'updated_at' | 'id' | 'file_path' @@ -860,7 +913,8 @@ export const getAuthStatus = async (): Promise => { }); // Check if response is HTML (which indicates a redirect or wrong endpoint) - const contentType = String(response.headers['content-type'] ?? ''); + const contentTypeHeader = response.headers['content-type']; + const contentType = typeof contentTypeHeader === 'string' ? contentTypeHeader : ''; if (contentType.includes('text/html')) { console.warn('Received HTML response instead of JSON for auth-status endpoint'); return { diff --git a/lightrag_webui/src/components/documents/UploadDocumentsDialog.tsx b/lightrag_webui/src/components/documents/UploadDocumentsDialog.tsx index 16e21e7d0c..251af629a7 100644 --- a/lightrag_webui/src/components/documents/UploadDocumentsDialog.tsx +++ b/lightrag_webui/src/components/documents/UploadDocumentsDialog.tsx @@ -19,9 +19,18 @@ import { useTranslation } from 'react-i18next' interface UploadDocumentsDialogProps { onDocumentsUploaded?: () => Promise + /** + * Fired once per batch as soon as the first file is accepted by the server. + * Lets the parent start its activity probe as early as possible (rather + * than waiting for the whole sequential batch to finish). + */ + onUploadBatchAccepted?: () => void } -export default function UploadDocumentsDialog({ onDocumentsUploaded }: UploadDocumentsDialogProps) { +export default function UploadDocumentsDialog({ + onDocumentsUploaded, + onUploadBatchAccepted +}: UploadDocumentsDialogProps) { const { t } = useTranslation() const [open, setOpen] = useState(false) const [isUploading, setIsUploading] = useState(false) @@ -76,6 +85,7 @@ export default function UploadDocumentsDialog({ onDocumentsUploaded }: UploadDoc try { // Track errors locally to ensure we have the final state const uploadErrors: Record = {} + let batchProbeTriggered = false // Create a collator that supports Chinese sorting const collator = new Intl.Collator(['zh-CN', 'en'], { @@ -103,13 +113,7 @@ export default function UploadDocumentsDialog({ onDocumentsUploaded }: UploadDoc })) }) - if (result.status === 'duplicated') { - uploadErrors[file.name] = t('documentPanel.uploadDocuments.fileUploader.duplicateFile') - setFileErrors(prev => ({ - ...prev, - [file.name]: t('documentPanel.uploadDocuments.fileUploader.duplicateFile') - })) - } else if (result.status !== 'success') { + if (result.status !== 'success') { uploadErrors[file.name] = result.message setFileErrors(prev => ({ ...prev, @@ -118,19 +122,40 @@ export default function UploadDocumentsDialog({ onDocumentsUploaded }: UploadDoc } else { // Mark that we had at least one successful upload hasSuccessfulUpload = true + if (!batchProbeTriggered) { + batchProbeTriggered = true + onUploadBatchAccepted?.() + } } } catch (err) { console.error(`Upload failed for ${file.name}:`, err) // Handle HTTP errors, including 400 errors let errorMsg = errorMessage(err) + const duplicateFileMsg = t('documentPanel.uploadDocuments.fileUploader.duplicateFile') // If it's an axios error with response data, try to extract more detailed error info if (err && typeof err === 'object' && 'response' in err) { const axiosError = err as { response?: { status: number, data?: { detail?: string } } } - if (axiosError.response?.status === 400) { - // Extract specific error message from backend response - errorMsg = axiosError.response.data?.detail || errorMsg + const status = axiosError.response?.status + const detail = axiosError.response?.data?.detail + if (status === 409) { + // Server now rejects same-name uploads with HTTP 409 instead of + // returning a 200 ``status="duplicated"`` payload. Map the most + // common cases (existing record / file in INPUT dir) back to the + // dedicated "duplicate file" UI affordance, and surface other + // 409 reasons (pipeline busy / scanning) verbatim from the + // server detail so users can tell why they were rejected. + if ( + typeof detail === 'string' && + (/already contains/i.test(detail) || /Status:/i.test(detail)) + ) { + errorMsg = duplicateFileMsg + } else { + errorMsg = detail || errorMsg + } + } else if (status === 400) { + errorMsg = detail || errorMsg } // Set progress to 100% to display error message @@ -175,7 +200,7 @@ export default function UploadDocumentsDialog({ onDocumentsUploaded }: UploadDoc setIsUploading(false) } }, - [setIsUploading, setProgresses, setFileErrors, t, onDocumentsUploaded] + [setIsUploading, setProgresses, setFileErrors, t, onDocumentsUploaded, onUploadBatchAccepted] ) return ( diff --git a/lightrag_webui/src/components/status/StatusCard.tsx b/lightrag_webui/src/components/status/StatusCard.tsx index 8eeaaf7090..ba693e932d 100644 --- a/lightrag_webui/src/components/status/StatusCard.tsx +++ b/lightrag_webui/src/components/status/StatusCard.tsx @@ -1,5 +1,105 @@ -import { LightragStatus } from '@/api/lightrag' +import type { + LightragQueueStatus, + LightragRoleLLMConfig, + LightragStatus +} from '@/api/lightrag' import { useTranslation } from 'react-i18next' +import { + Table, + TableBody, + TableCell, + TableHead, + TableHeader, + TableRow +} from '@/components/ui/Table' + +const ROLE_ORDER = ['extract', 'keyword', 'query', 'vlm'] + +type RoleLLMRow = { + role: string + config: LightragRoleLLMConfig + queue?: LightragQueueStatus +} + +const textValue = (value: string | number | null | undefined) => { + if (value === null || value === undefined || value === '') return '-' + return String(value) +} + +const formatKwargs = (value: Record | null | undefined): string => { + if (!value || typeof value !== 'object') return '-' + const entries = Object.entries(value) + if (!entries.length) return '-' + return entries + .map(([k, v]) => { + const strVal = typeof v === 'object' && v !== null ? JSON.stringify(v) : String(v) + return `${k}=${strVal}` + }) + .join(', ') +} + +const statValue = (value: number | undefined) => { + return typeof value === 'number' ? value.toString() : '-' +} + +const getModelRows = (status: LightragStatus): RoleLLMRow[] => { + const configs = status.configuration.role_llm_config || {} + const queues = status.llm_queue_status || {} + const discoveredRoles = new Set([...Object.keys(configs), ...Object.keys(queues)]) + const orderedRoles = [ + ...ROLE_ORDER.filter((role) => discoveredRoles.has(role)), + ...Array.from(discoveredRoles).filter((role) => !ROLE_ORDER.includes(role)) + ] + + const rows: RoleLLMRow[] = orderedRoles.map((role) => ({ + role, + config: configs[role] || { + binding: status.configuration.llm_binding, + model: status.configuration.llm_model, + host: status.configuration.llm_binding_host, + max_async: status.configuration.max_async + }, + queue: queues[role] + })) + + if (!rows.length) { + rows.push({ + role: 'base', + config: { + binding: status.configuration.llm_binding, + model: status.configuration.llm_model, + host: status.configuration.llm_binding_host, + max_async: status.configuration.max_async + } + }) + } + + rows.push({ + role: 'embed', + config: { + binding: status.configuration.embedding_binding, + model: status.configuration.embedding_model, + host: status.configuration.embedding_binding_host, + max_async: status.configuration.embedding_func_max_async + }, + queue: status.embedding_queue_status + }) + + if (status.configuration.enable_rerank || status.rerank_queue_status?.available) { + rows.push({ + role: 'rerank', + config: { + binding: status.configuration.rerank_binding, + model: status.configuration.rerank_model, + host: status.configuration.rerank_binding_host, + max_async: status.rerank_queue_status?.max_async + }, + queue: status.rerank_queue_status + }) + } + + return rows +} const StatusCard = ({ status }: { status: LightragStatus | null }) => { const { t } = useTranslation() @@ -7,6 +107,36 @@ const StatusCard = ({ status }: { status: LightragStatus | null }) => { return
{t('graphPanel.statusCard.unavailable')}
} + const roleRows = getModelRows(status) + const storageWorkspaces = status.configuration.storage_workspaces + const defaultWorkspace = status.configuration.workspace + const storageColumns = [ + { + key: 'kv', + label: t('graphPanel.statusCard.kvStorage'), + storageClass: status.configuration.kv_storage, + workspace: storageWorkspaces?.kv_storage ?? defaultWorkspace + }, + { + key: 'doc-status', + label: t('graphPanel.statusCard.docStatusStorage'), + storageClass: status.configuration.doc_status_storage, + workspace: storageWorkspaces?.doc_status_storage ?? defaultWorkspace + }, + { + key: 'graph', + label: t('graphPanel.statusCard.graphStorage'), + storageClass: status.configuration.graph_storage, + workspace: storageWorkspaces?.graph_storage ?? defaultWorkspace + }, + { + key: 'vector', + label: t('graphPanel.statusCard.vectorStorage'), + storageClass: status.configuration.vector_storage, + workspace: storageWorkspaces?.vector_storage ?? defaultWorkspace + } + ] + return (
@@ -20,58 +150,8 @@ const StatusCard = ({ status }: { status: LightragStatus | null }) => { {status.configuration.summary_language} / LLM summary on {status.configuration.force_llm_summary_on_merge.toString()} fragments {t('graphPanel.statusCard.threshold')}: cosine {status.configuration.cosine_threshold} / rerank_score {status.configuration.min_rerank_score} / max_related {status.configuration.related_chunk_number} - {t('graphPanel.statusCard.maxParallelInsert')}: - {status.configuration.max_parallel_insert} -
-
- -
-

{t('graphPanel.statusCard.llmConfig')}

-
- {t('graphPanel.statusCard.llmBindingHost')}: - {status.configuration.llm_binding_host} - {t('graphPanel.statusCard.llmModel')}: - {status.configuration.llm_binding}: {status.configuration.llm_model} (#{status.configuration.max_async} Async) -
-
- -
-

{t('graphPanel.statusCard.embeddingConfig')}

-
- {t('graphPanel.statusCard.embeddingBindingHost')}: - {status.configuration.embedding_binding_host} - {t('graphPanel.statusCard.embeddingModel')}: - {status.configuration.embedding_binding}: {status.configuration.embedding_model} (#{status.configuration.embedding_func_max_async} Async * {status.configuration.embedding_batch_num} batches) -
-
- - {status.configuration.enable_rerank && ( -
-

{t('graphPanel.statusCard.rerankerConfig')}

-
- {t('graphPanel.statusCard.rerankerBindingHost')}: - {status.configuration.rerank_binding_host || '-'} - {t('graphPanel.statusCard.rerankerModel')}: - {(status.configuration.rerank_binding || '-')} : {(status.configuration.rerank_model || '-')} -
-
- )} - -
-

{t('graphPanel.statusCard.storageConfig')}

-
- {t('graphPanel.statusCard.kvStorage')}: - {status.configuration.kv_storage} - {t('graphPanel.statusCard.docStatusStorage')}: - {status.configuration.doc_status_storage} - {t('graphPanel.statusCard.graphStorage')}: - {status.configuration.graph_storage} - {t('graphPanel.statusCard.vectorStorage')}: - {status.configuration.vector_storage} - {t('graphPanel.statusCard.workspace')}: - {status.configuration.workspace || '-'} - {t('graphPanel.statusCard.maxGraphNodes')}: - {status.configuration.max_graph_nodes || '-'} + {t('graphPanel.statusCard.otherSettings')}: + max_graph_nodes {status.configuration.max_graph_nodes || '-'} / max_parallel_insert {status.configuration.max_parallel_insert} {status.keyed_locks && ( <> {t('graphPanel.statusCard.lockStatus')}: @@ -84,6 +164,107 @@ const StatusCard = ({ status }: { status: LightragStatus | null }) => { )}
+ +
+

{t('graphPanel.statusCard.llmConfig')}

+
+ + + + role + + binding/model + + base_url/kwargs + + queued + + + run/max + + + req + + + + + {roleRows.map(({ role, config, queue }) => { + const maxAsync = queue?.max_async ?? config.max_async + return ( + + + {role} + + +
{textValue(config.binding)}
+
+ {textValue(config.model)} +
+
+ +
{textValue(config.host)}
+ {(() => { + const providerOptions = config.metadata?.provider_options as Record | undefined + const kwargsStr = formatKwargs(providerOptions) + return ( +
+ {kwargsStr} +
+ ) + })()} +
+ + {statValue(queue?.queued)} + + + {statValue(queue?.running)}/{statValue(maxAsync)} + + + {statValue(queue?.submitted_total)} + +
+ ) + })} +
+
+
+
+ +
+

{t('graphPanel.statusCard.storageConfig')}

+
+ + + + {storageColumns.map(({ key, label }) => ( + + {label} + + ))} + + + + + {storageColumns.map(({ key, storageClass }) => ( + + {textValue(storageClass)} + + ))} + + + {storageColumns.map(({ key, workspace }) => ( + + {textValue(workspace)} + + ))} + + +
+
+
) } diff --git a/lightrag_webui/src/components/status/StatusDialog.tsx b/lightrag_webui/src/components/status/StatusDialog.tsx index bb78d843be..4abfb96056 100644 --- a/lightrag_webui/src/components/status/StatusDialog.tsx +++ b/lightrag_webui/src/components/status/StatusDialog.tsx @@ -20,8 +20,8 @@ const StatusDialog = ({ open, onOpenChange, status }: StatusDialogProps) => { return ( - - + + {t('graphPanel.statusDialog.title')} {t('graphPanel.statusDialog.description')} diff --git a/lightrag_webui/src/features/DocumentManager.tsx b/lightrag_webui/src/features/DocumentManager.tsx index 2a90ee5d87..def01bf9cd 100644 --- a/lightrag_webui/src/features/DocumentManager.tsx +++ b/lightrag_webui/src/features/DocumentManager.tsx @@ -44,8 +44,19 @@ import { copyToClipboard } from '@/utils/clipboard' import { RefreshCwIcon, ActivityIcon, ArrowUpIcon, ArrowDownIcon, RotateCcwIcon, CheckSquareIcon, XIcon, AlertTriangle, Info, CopyIcon } from 'lucide-react' import PipelineStatusDialog from '@/components/documents/PipelineStatusDialog' +import { + getStatusBucket, + matchesStatusFilter, + type StatusBucket, + type StatusFilter +} from '@/features/documentStatusFilters' + +type StatusDisplayConfig = { + labelKey: string + className: string +} -type StatusFilter = DocStatus | 'all'; +const STATUS_BUCKETS: StatusBucket[] = ['processed', 'analyzing', 'processing', 'pending', 'failed'] // Utility functions defined outside component for better performance and to avoid dependency issues const getCountValue = (counts: Record, ...keys: string[]): number => { @@ -58,11 +69,27 @@ const getCountValue = (counts: Record, ...keys: string[]): numbe return 0 } +const getAggregateCount = (counts: Record, ...keys: string[]): number => + keys.reduce((total, key) => total + getCountValue(counts, key), 0) + const hasActiveDocumentsStatus = (counts: Record): boolean => - getCountValue(counts, 'PROCESSING', 'processing') > 0 || + getAggregateCount(counts, 'PROCESSING', 'processing', 'PARSING', 'parsing', 'ANALYZING', 'analyzing') > 0 || getCountValue(counts, 'PENDING', 'pending') > 0 || getCountValue(counts, 'PREPROCESSED', 'preprocessed') > 0 +const buildLegacyDocs = (documents: DocStatusResponse[]): DocsStatusesResponse => { + const statuses = STATUS_BUCKETS.reduce>((acc, status) => { + acc[status] = [] + return acc + }, {} as Record) + + documents.forEach((doc) => { + statuses[getStatusBucket(doc.status)].push(doc) + }) + + return { statuses } +} + const getDisplayFileName = (doc: DocStatusResponse, maxLength: number = 20): string => { // Check if file_path exists and is a non-empty string if (!doc.file_path || typeof doc.file_path !== 'string' || doc.file_path.trim() === '') { @@ -87,6 +114,20 @@ const getDisplayFileName = (doc: DocStatusResponse, maxLength: number = 20): str const formatMetadata = (metadata: Record): string => { const formattedMetadata = { ...metadata }; + if (formattedMetadata.parsing_start_time && typeof formattedMetadata.parsing_start_time === 'number') { + const date = new Date(formattedMetadata.parsing_start_time * 1000); + if (!isNaN(date.getTime())) { + formattedMetadata.parsing_start_time = date.toLocaleString(); + } + } + + if (formattedMetadata.analyzing_start_time && typeof formattedMetadata.analyzing_start_time === 'number') { + const date = new Date(formattedMetadata.analyzing_start_time * 1000); + if (!isNaN(date.getTime())) { + formattedMetadata.analyzing_start_time = date.toLocaleString(); + } + } + if (formattedMetadata.processing_start_time && typeof formattedMetadata.processing_start_time === 'number') { const date = new Date(formattedMetadata.processing_start_time * 1000); if (!isNaN(date.getTime())) { @@ -277,7 +318,9 @@ export default function DocumentManager() { // Track component mount status const isMountedRef = useRef(true); - // Set up mount/unmount status tracking + // Set up mount/unmount status tracking. Pending throttle/probe timers are NOT + // explicitly cleared on unmount — every timer callback checks isMountedRef + // before doing any work, so a stray fire is a no-op. useEffect(() => { isMountedRef.current = true; @@ -297,7 +340,7 @@ export default function DocumentManager() { const [showPipelineStatus, setShowPipelineStatus] = useState(false) const { t, i18n } = useTranslation() const health = useBackendState.use.health() - const pipelineBusy = useBackendState.use.pipelineBusy() + const pipelineActive = useBackendState.use.pipelineActive() // Legacy state for backward compatibility const [docs, setDocs] = useState(null) @@ -319,6 +362,13 @@ export default function DocumentManager() { has_prev: false }) const [statusCounts, setStatusCounts] = useState>({ all: 0 }) + // Mirror statusCounts in a ref so async callbacks (e.g. activity probe ticks) + // can read the latest value without being tied to the closure captured at + // schedule time. Synced via useEffect to satisfy react-hooks/refs. + const statusCountsRef = useRef(statusCounts) + useEffect(() => { + statusCountsRef.current = statusCounts + }, [statusCounts]) const [isRefreshing, setIsRefreshing] = useState(false) // Sort state @@ -332,7 +382,7 @@ export default function DocumentManager() { const [pageByStatus, setPageByStatus] = useState>({ all: 1, processed: 1, - preprocessed: 1, + analyzing: 1, processing: 1, pending: 1, failed: 1, @@ -342,15 +392,24 @@ export default function DocumentManager() { const [selectedDocIds, setSelectedDocIds] = useState([]) const isSelectionMode = selectedDocIds.length > 0 - // Add refs to track previous pipelineBusy state and current interval - const prevPipelineBusyRef = useRef(undefined); + // Add refs to track previous pipelineActive state and current interval + const prevPipelineActiveRef = useRef(undefined); const pollingIntervalRef = useRef | null>(null); const activeRefreshPromiseRef = useRef | null>(null); const pendingRefreshRequestRef = useRef(null); const latestRefreshRequestVersionRef = useRef(0); - - // Add retry mechanism state - const [retryState, setRetryState] = useState({ + // Throttle gate: all auto-driven /documents/paginated entrances funnel through + // refreshDocumentsThrottled() to enforce a minimum 2s wall-clock interval. + const lastPaginatedAtRef = useRef(0); + const pendingPaginatedTimerRef = useRef | null>(null); + // Activity probe: exponential-backoff burst of /health calls that stops once + // pipelineActive flips true. Holds the pending setTimeout ids so re-entry can + // reset the schedule to t=0. + const probeTimersRef = useRef[] | null>(null); + const probeActiveRef = useRef(false); + + // Add retry mechanism state (read by circuit breaker via setRetryState only). + const [, setRetryState] = useState({ count: 0, lastError: null as Error | null, isBackingOff: false @@ -402,7 +461,7 @@ export default function DocumentManager() { setPageByStatus({ all: 1, processed: 1, - preprocessed: 1, + analyzing: 1, processing: 1, pending: 1, failed: 1, @@ -442,6 +501,47 @@ export default function DocumentManager() { // Define a new type that includes status information type DocStatusWithStatus = DocStatusResponse & { status: DocStatus }; + const getStatusDisplay = useCallback((status: DocStatus): StatusDisplayConfig => { + switch (status) { + case 'processed': + return { + labelKey: 'documentPanel.documentManager.status.completed', + className: 'text-green-600' + } + case 'preprocessed': + return { + labelKey: 'documentPanel.documentManager.status.preprocessed', + className: 'text-purple-600' + } + case 'parsing': + return { + labelKey: 'documentPanel.documentManager.status.parsing', + className: 'text-cyan-600' + } + case 'analyzing': + return { + labelKey: 'documentPanel.documentManager.status.analyzing', + className: 'text-indigo-600' + } + case 'processing': + return { + labelKey: 'documentPanel.documentManager.status.processing', + className: 'text-blue-600' + } + case 'pending': + return { + labelKey: 'documentPanel.documentManager.status.pending', + className: 'text-yellow-600' + } + case 'failed': + default: + return { + labelKey: 'documentPanel.documentManager.status.failed', + className: 'text-red-600' + } + } + }, []) + const filteredAndSortedDocs = useMemo(() => { // Use currentPageDocs directly if available (from paginated API) // This preserves the backend's sort order and prevents status grouping @@ -458,26 +558,20 @@ export default function DocumentManager() { // Create a flat array of documents with status information const allDocuments: DocStatusWithStatus[] = []; - if (statusFilter === 'all') { - // When filter is 'all', include documents from all statuses - Object.entries(docs.statuses).forEach(([status, documents]) => { - documents.forEach(doc => { + Object.entries(docs.statuses).forEach(([status, documents]) => { + const fallbackStatus = status as DocStatus + + for (const doc of documents ?? []) { + const documentStatus = doc.status ?? fallbackStatus + + if (matchesStatusFilter(documentStatus, statusFilter)) { allDocuments.push({ ...doc, - status: status as DocStatus - }); - }); - }); - } else { - // When filter is specific status, only include documents from that status - const documents = docs.statuses[statusFilter] || []; - documents.forEach(doc => { - allDocuments.push({ - ...doc, - status: statusFilter - }); - }); - } + status: documentStatus + }) + } + } + }) // Sort all documents together if sort field and direction are specified if (sortField && sortDirection) { @@ -540,7 +634,7 @@ export default function DocumentManager() { const counts: Record = { all: 0 }; Object.entries(docs.statuses).forEach(([status, documents]) => { - counts[status as DocStatus] = documents.length; + counts[status] = documents.length; counts.all += documents.length; }); @@ -548,18 +642,21 @@ export default function DocumentManager() { }, [docs]); const processedCount = getCountValue(statusCounts, 'PROCESSED', 'processed') || documentCounts.processed || 0; - const preprocessedCount = - getCountValue(statusCounts, 'PREPROCESSED', 'preprocessed') || - documentCounts.preprocessed || + const analyzingCount = + getAggregateCount(statusCounts, 'PARSING', 'parsing', 'ANALYZING', 'analyzing', 'PREPROCESSED', 'preprocessed') || + documentCounts.analyzing || + 0; + const processingCount = + getAggregateCount(statusCounts, 'PROCESSING', 'processing') || + documentCounts.processing || 0; - const processingCount = getCountValue(statusCounts, 'PROCESSING', 'processing') || documentCounts.processing || 0; const pendingCount = getCountValue(statusCounts, 'PENDING', 'pending') || documentCounts.pending || 0; const failedCount = getCountValue(statusCounts, 'FAILED', 'failed') || documentCounts.failed || 0; // Store previous status counts const prevStatusCounts = useRef({ processed: 0, - preprocessed: 0, + analyzing: 0, processing: 0, pending: 0, failed: 0 @@ -605,18 +702,7 @@ export default function DocumentManager() { setCurrentPageDocs(response.documents); setStatusCounts(response.status_counts); - // Update legacy docs state for backward compatibility - const legacyDocs: DocsStatusesResponse = { - statuses: { - processed: response.documents.filter((doc: DocStatusResponse) => doc.status === 'processed'), - preprocessed: response.documents.filter((doc: DocStatusResponse) => doc.status === 'preprocessed'), - processing: response.documents.filter((doc: DocStatusResponse) => doc.status === 'processing'), - pending: response.documents.filter((doc: DocStatusResponse) => doc.status === 'pending'), - failed: response.documents.filter((doc: DocStatusResponse) => doc.status === 'failed') - } - }; - - setDocs(response.pagination.total_count > 0 ? legacyDocs : null); + setDocs(response.pagination.total_count > 0 ? buildLegacyDocs(response.documents) : null); }, []); @@ -710,7 +796,7 @@ export default function DocumentManager() { setPageByStatus({ all: 1, processed: 1, - preprocessed: 1, + analyzing: 1, processing: 1, pending: 1, failed: 1, @@ -865,6 +951,104 @@ export default function DocumentManager() { }); }, [buildQuerySnapshot, enqueueRefresh, pagination.page]); + // Throttle gate: any caller wanting to refresh the document list goes through + // here. If the wall-clock gap since the last paginated request is >= 2s, fire + // immediately; otherwise schedule a single trailing call at the 2s boundary + // and drop any further calls into that pending slot (natural coalescing). + const refreshDocumentsThrottled = useCallback(() => { + const fire = () => { + lastPaginatedAtRef.current = Date.now() + handleIntelligentRefresh().catch((err) => { + console.error('Throttled document refresh failed:', err) + }) + } + const gap = Date.now() - lastPaginatedAtRef.current + if (gap >= 2000) { + fire() + return + } + if (pendingPaginatedTimerRef.current !== null) return + // Snapshot the query identity. If page/filter/sort changes while we wait, + // the page-change useEffect bumps latestRefreshRequestVersionRef AND fires + // its own paginated request on the new query. Our trailing closure still + // holds the OLD handleIntelligentRefresh (capturing the old page), so we + // must drop it — otherwise the stale request would overwrite the new list + // (its requestVersion would be the newly-bumped value, so the in-flight + // stale-check inside runRefreshRequest can't catch it). + const versionAtSchedule = latestRefreshRequestVersionRef.current + pendingPaginatedTimerRef.current = setTimeout(() => { + pendingPaginatedTimerRef.current = null + if (!isMountedRef.current) return + if (versionAtSchedule !== latestRefreshRequestVersionRef.current) return + fire() + }, 2000 - gap) + }, [handleIntelligentRefresh]); + + // Activity probe: short exponential-backoff burst of /health checks fired + // after scan/upload triggers. Stops as soon as pipelineActive flips true so + // we can hand off to the existing 5s active polling cadence. Re-entry + // (e.g. another scan while a probe is mid-flight) cancels the current + // schedule and restarts at t=0 so the latest action gets a fresh observation + // window. + const startActivityProbe = useCallback((reason: string) => { + if (probeTimersRef.current) { + probeTimersRef.current.forEach((id) => clearTimeout(id)) + probeTimersRef.current = null + } + probeActiveRef.current = true + const timers: ReturnType[] = [] + const probeSchedule = [0, 1000, 2000, 4000, 8000, 16000] as const + const refreshAt = new Set([0, 2000, 4000, 8000, 16000]) + const cleanup = () => { + timers.forEach((id) => clearTimeout(id)) + if (probeTimersRef.current === timers) { + probeTimersRef.current = null + probeActiveRef.current = false + } + } + probeSchedule.forEach((delay, index) => { + const id = setTimeout(async () => { + if (!isMountedRef.current) { + cleanup() + return + } + try { + await useBackendState.getState().check() + } catch (err) { + console.error(`Activity probe (${reason}) check failed:`, err) + } + if (!isMountedRef.current) { + cleanup() + return + } + if (refreshAt.has(delay)) { + refreshDocumentsThrottled() + } + // Exit conditions (in priority order): + // - pipelineActive=true AND the document list has caught up: the 5s + // active polling cadence will take over from here. + // - pipelineActive=false after the first tick: the scan/upload didn't + // actually start any work (e.g. scan found nothing new, upload was + // rejected) — no point continuing to burst /health. + // - last tick: time budget exhausted, hand off to the polling loop. + // Note: NOT stopping on bare `pipelineActive=true` is intentional. + // /health flips to active on scanning/pending_enqueues before the new + // doc rows are visible in /documents/paginated, so a premature exit + // would strand the UI in 30s idle polling while classification is + // still running. + const active = useBackendState.getState().pipelineActive + const docsActive = hasActiveDocumentsStatus(statusCountsRef.current) + const isLast = index === probeSchedule.length - 1 + const stop = (active && docsActive) || (!active && index > 0) || isLast + if (stop) { + cleanup() + } + }, delay) + timers.push(id) + }) + probeTimersRef.current = timers + }, [refreshDocumentsThrottled]); + // New paginated data fetching function const fetchPaginatedDocuments = useCallback(async ( page: number, @@ -903,90 +1087,44 @@ export default function DocumentManager() { const startPollingInterval = useCallback((intervalMs: number) => { clearPollingInterval(); - pollingIntervalRef.current = setInterval(async () => { - try { - // Check circuit breaker before making request - if (isCircuitBreakerOpen()) { - return; // Skip this polling cycle - } - - // Only perform fetch if component is still mounted - if (isMountedRef.current) { - await fetchDocuments(); - recordSuccess(); // Record successful operation - } - } catch (err) { - // Only handle error if component is still mounted - if (isMountedRef.current) { - const errorClassification = classifyError(err); - - // Always reset isRefreshing state on error - setIsRefreshing(false); - - if (errorClassification.shouldShowToast) { - toast.error(t('documentPanel.documentManager.errors.scanProgressFailed', { error: errorMessage(err) })); - } - - if (errorClassification.shouldRetry) { - recordFailure(err as Error); - - // Implement exponential backoff for retries - const backoffDelay = Math.min(Math.pow(2, retryState.count) * 1000, 30000); // Max 30s - - if (retryState.count < 3) { // Max 3 retries - setTimeout(() => { - if (isMountedRef.current) { - setRetryState(prev => ({ ...prev, isBackingOff: false })); - } - }, backoffDelay); - } - } else { - // For non-retryable errors, stop polling - clearPollingInterval(); - } - } - } + pollingIntervalRef.current = setInterval(() => { + if (!isMountedRef.current) return; + if (isCircuitBreakerOpen()) return; + // refreshDocumentsThrottled is fire-and-forget; errors are surfaced via + // toast/recordFailure inside runRefreshRequest. + refreshDocumentsThrottled(); + recordSuccess(); }, intervalMs); - }, [fetchDocuments, t, clearPollingInterval, isCircuitBreakerOpen, recordSuccess, recordFailure, classifyError, retryState.count]); + }, [refreshDocumentsThrottled, clearPollingInterval, isCircuitBreakerOpen, recordSuccess]); const scanDocuments = useCallback(async () => { try { - // Check if component is still mounted before starting the request if (!isMountedRef.current) return; - const { status, message, track_id: _track_id } = await scanNewDocuments(); // eslint-disable-line @typescript-eslint/no-unused-vars + const { status, message } = await scanNewDocuments(); - // Check again if component is still mounted after the request completes if (!isMountedRef.current) return; - // Note: _track_id is available for future use (e.g., progress tracking) toast.message(message || status); - // Reset health check timer with 1 second delay to avoid race condition - useBackendState.getState().resetHealthCheckTimerDelayed(1000); - - // Perform immediate refresh with 90s timeout after scan (tolerates PostgreSQL switchover) - await handleIntelligentRefresh(undefined, false, 90000); - - // Start fast refresh with 2-second interval after initial refresh - startPollingInterval(2000); - - // Set recovery timer to restore normal polling interval after 15 seconds - setTimeout(() => { - if (isMountedRef.current && currentTab === 'documents' && health) { - // Restore intelligent polling interval based on document status - const hasActiveDocuments = hasActiveDocumentsStatus(statusCounts); - const normalInterval = hasActiveDocuments ? 5000 : 30000; - startPollingInterval(normalInterval); - } - }, 15000); // Restore after 15 seconds + if (status === 'scanning_started') { + // Activity probe drives /health bursts + throttled document refreshes. + // It exits as soon as pipelineActive flips true, after which the + // standard 5s polling cadence (driven by hasActiveDocumentsStatus) + // takes over. + startActivityProbe('scan'); + } else { + // scanning_skipped_pipeline_busy: a single check+refresh is enough, + // no need to start the probe (pipeline is already active). + useBackendState.getState().check().catch(() => undefined); + refreshDocumentsThrottled(); + } } catch (err) { - // Only show error if component is still mounted if (isMountedRef.current) { toast.error(t('documentPanel.documentManager.errors.scanFailed', { error: errorMessage(err) })); } } - }, [t, startPollingInterval, currentTab, health, statusCounts, handleIntelligentRefresh]) + }, [t, startActivityProbe, refreshDocumentsThrottled]) // Handle manual refresh with pagination reset logic const handleManualRefresh = useCallback(async () => { @@ -1001,49 +1139,45 @@ export default function DocumentManager() { latestRefreshRequestVersionRef.current += 1 }, [pagination.page, pagination.page_size, statusFilter, sortField, sortDirection]) - // Monitor pipelineBusy changes and trigger immediate refresh with timer reset + // Monitor pipelineActive changes and trigger an immediate refresh. The + // polling interval is reconciled by the main polling useEffect below + // (which also depends on pipelineActive), so there's no need to re-call + // startPollingInterval here. useEffect(() => { - // Skip the first render when prevPipelineBusyRef is undefined - if (prevPipelineBusyRef.current !== undefined && prevPipelineBusyRef.current !== pipelineBusy) { - // pipelineBusy state has changed, trigger immediate refresh + if (prevPipelineActiveRef.current !== undefined && prevPipelineActiveRef.current !== pipelineActive) { if (currentTab === 'documents' && health && isMountedRef.current) { - // Use intelligent refresh to preserve current page - handleIntelligentRefresh(); - - // Reset polling timer after intelligent refresh - const hasActiveDocuments = hasActiveDocumentsStatus(statusCounts); - const pollingInterval = hasActiveDocuments ? 5000 : 30000; - startPollingInterval(pollingInterval); + refreshDocumentsThrottled(); } } - // Update the previous state - prevPipelineBusyRef.current = pipelineBusy; + prevPipelineActiveRef.current = pipelineActive; }, [ - pipelineBusy, + pipelineActive, currentTab, health, - handleIntelligentRefresh, - statusCounts, - startPollingInterval + refreshDocumentsThrottled ]); - // Set up intelligent polling with dynamic interval based on document status + // Set up intelligent polling with dynamic interval based on document status. + // Treat pipelineActive=true as enough reason to stay in 5s fast polling even + // when statusCounts hasn't surfaced pending rows yet — /health flips active + // during scan classification / upload enqueue, well before the new doc rows + // appear in /documents/paginated. Without this, the UI would stall in 30s + // idle polling for several seconds after the user clicked scan/upload. useEffect(() => { if (currentTab !== 'documents' || !health) { clearPollingInterval(); return } - // Determine polling interval based on document status const hasActiveDocuments = hasActiveDocumentsStatus(statusCounts); - const pollingInterval = hasActiveDocuments ? 5000 : 30000; // 5s if active, 30s if idle + const pollingInterval = (hasActiveDocuments || pipelineActive) ? 5000 : 30000; startPollingInterval(pollingInterval); return () => { clearPollingInterval(); } - }, [health, t, currentTab, statusCounts, startPollingInterval, clearPollingInterval]) + }, [health, t, currentTab, statusCounts, pipelineActive, startPollingInterval, clearPollingInterval]) // Monitor docs changes to check status counts and trigger health check if needed useEffect(() => { @@ -1052,7 +1186,7 @@ export default function DocumentManager() { // Get new status counts const newStatusCounts = { processed: docs?.statuses?.processed?.length || 0, - preprocessed: docs?.statuses?.preprocessed?.length || 0, + analyzing: docs?.statuses?.analyzing?.length || 0, processing: docs?.statuses?.processing?.length || 0, pending: docs?.statuses?.pending?.length || 0, failed: docs?.statuses?.failed?.length || 0 @@ -1063,12 +1197,14 @@ export default function DocumentManager() { status => newStatusCounts[status] !== prevStatusCounts.current[status] ) - // Trigger health check if changes detected and component is still mounted - if (hasStatusCountChange && isMountedRef.current) { + // Trigger health check if changes detected and component is still mounted. + // Skip when the activity probe is running — the probe already drives /health + // on its own schedule, and double-firing would burn cache and skew rate. + if (hasStatusCountChange && isMountedRef.current && !probeActiveRef.current) { useBackendState.getState().check() } - // Update previous status counts + // Always update the snapshot so the first post-probe transition still fires. prevStatusCounts.current = newStatusCounts }, [docs]); @@ -1116,6 +1252,9 @@ export default function DocumentManager() { setStatusCounts({ all: 0, processed: 0, + preprocessed: 0, + parsing: 0, + analyzing: 0, processing: 0, pending: 0, failed: 0 @@ -1213,7 +1352,7 @@ export default function DocumentManager() { tooltip={t('documentPanel.documentManager.pipelineStatusTooltip')} size="sm" className={cn( - pipelineBusy && 'pipeline-busy' + pipelineActive && 'pipeline-busy' )} > {t('documentPanel.documentManager.pipelineStatusButton')} @@ -1261,7 +1400,10 @@ export default function DocumentManager() { ) : !isSelectionMode ? ( ) : null} - handleIntelligentRefresh(undefined, false, 120000)} /> + startActivityProbe('upload')} + onDocumentsUploaded={async () => { refreshDocumentsThrottled() }} + /> - {t('documentPanel.documentManager.status.all')} ({statusCounts.all || documentCounts.all}) + {t('documentPanel.documentManager.filters.all')} ({statusCounts.all || documentCounts.all})