Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
622 commits
Select commit Hold shift + click to select a range
cee2c37
♻️ refactor(chunking): implement four-tier specificity-ordered chunk …
danielaskdd May 9, 2026
db25a84
✨ feat(chunker): add dedicated chunk_token_size override for paragrap…
danielaskdd May 9, 2026
b88e096
📦 build(deps): pin langchain package versions and clean up lock file
danielaskdd May 9, 2026
2b2bfc0
📝 docs(FileProcessingConfiguration-zh): restructure and expand chines…
danielaskdd May 9, 2026
3a0628f
♻️ refactor(lightrag): integrate chunk size overlay into runtime addo…
danielaskdd May 9, 2026
50fda8e
Merge pull request #3046 from danielaskdd/feat/add-R-V-chunker
danielaskdd May 9, 2026
7fa7144
Merge branch 'main' into dev
danielaskdd May 9, 2026
ad19e9f
🔧 chore(env.example): reorganize configuration sections
danielaskdd May 9, 2026
cbe1502
✨ feat(chunker): add per-strategy chunk size overrides for R and V st…
danielaskdd May 9, 2026
2dad470
✨ feat(pipeline): add structured chunking start logs per strategy
danielaskdd May 9, 2026
4822ce3
📝 docs(zh): update file processing configuration for new chunk size o…
danielaskdd May 9, 2026
7ff3562
♻️ refactor(pipeline): extract chunking params formatting and persist…
danielaskdd May 9, 2026
7f7c6e0
♻️ refactor(pipeline): simplify hard fallback split metadata
danielaskdd May 9, 2026
8cc600b
🐛 fix(chunker): handle CJK punctuation and enforce chunk_token_size
danielaskdd May 9, 2026
61e72c7
📝 docs(zh): document chunker CJK support and chunk_token_size enforce…
danielaskdd May 9, 2026
dc1ab7a
♻️ refactor(pipeline): add chunk log key aliases for compression
danielaskdd May 9, 2026
a993e07
Merge pull request #3050 from danielaskdd/fix/chunker-cjk-support
danielaskdd May 9, 2026
8d542d9
Merge branch 'main' into dev
danielaskdd May 9, 2026
10f7d03
🔧 chore(pipeline): clean up chunk log key aliases
danielaskdd May 9, 2026
a40e774
🐛 fix(pipeline): fix blocks_path parameter duplication in chunking
danielaskdd May 9, 2026
d652da8
🔧 chore(config): update default chunk token size for paragraph semant…
danielaskdd May 9, 2026
1b4f305
📝 docs(CLAUDE): consolidate agent guidelines into single CLAUDE.md
danielaskdd May 10, 2026
36400d6
📝 docs(agents): consolidate agent guidance into AGENTS.md
danielaskdd May 10, 2026
46271fc
✨ feat(chunker): split oversized tables on row boundaries before char…
danielaskdd May 10, 2026
07e41bf
Merge branch 'dev' into feat/chunker-table-row-split
danielaskdd May 10, 2026
b5fa878
♻️ refactor(chunker): improve table and long block token budgeting
danielaskdd May 10, 2026
ebfd4d4
📝 docs(CLAUDE): add newline at end of file
danielaskdd May 10, 2026
8c0e867
♻️ refactor(chunker): fix edge cases in paragraph-semantic chunking
danielaskdd May 10, 2026
575d404
feat(chunker): preserve paragraph semantic overlap
danielaskdd May 10, 2026
5e094f6
📝 docs(chunker): update paragraph semantic chunking documentation and…
danielaskdd May 10, 2026
6756c9d
Merge pull request #3051 from danielaskdd/feat/chunker-table-row-split
danielaskdd May 10, 2026
a8380a1
📝 docs(FileProcessingConfiguration-zh): update chunking strategy docu…
danielaskdd May 10, 2026
d0c108d
📝 docs(FileProcessingConfiguration-zh): clarify chunk overlap behavio…
danielaskdd May 10, 2026
9138586
📝 docs(ParagraphSemanticChunking-zh): expand and restructure document…
danielaskdd May 11, 2026
9021293
fix(chunker): preserve HTML table row group wrappers
danielaskdd May 11, 2026
22841d7
Merge pull request #3055 from danielaskdd/fix/html-table-wrapper-split
danielaskdd May 11, 2026
d0f8798
Merge branch 'main' into dev
danielaskdd May 11, 2026
f16beea
🐛 fix(numbering_resolver): correct paragraph numbering override behavior
danielaskdd May 11, 2026
b32c1b7
📝 docs(plan): add native multimodal surrounding context implementatio…
danielaskdd May 11, 2026
49b3421
✨ feat(multimodal): backfill surrounding context on native sidecars
claude May 11, 2026
167d3ca
♻️ refactor(multimodal_context): add character-level fallback for tab…
danielaskdd May 11, 2026
5846539
♻️ refactor(multimodal_context): unify table marker handling and move…
danielaskdd May 11, 2026
5af5ba7
✅ test(document): add tokenizer setup for docx archive tests
danielaskdd May 11, 2026
d8e787f
📝 docs(plan): update native multimodal surrounding context plan
danielaskdd May 11, 2026
edd068e
Merge pull request #3057 from HKUDS/claude/implement-multimodal-conte…
danielaskdd May 11, 2026
7c83286
Merge branch 'main' into dev
danielaskdd May 12, 2026
d9b56e9
♻️ refactor(pipeline): rename _process_single_document to public method
danielaskdd May 12, 2026
d8efbf7
🐛 fix(docx): strip asset prefix from drawing path in blocks output
danielaskdd May 13, 2026
0999990
✨ feat(llm/vlm): unify image_inputs across bindings and add VLM cache…
danielaskdd May 13, 2026
1e15733
🐛 fix(pipeline/vlm): resolve sidecar-relative paths and skip vector i…
danielaskdd May 13, 2026
875986c
Fix linting
danielaskdd May 13, 2026
745c2c4
🐛 fix(pipeline/vlm): skip persistence on disabled-VLM and don't cache…
danielaskdd May 13, 2026
88cf78c
Fix lintings
danielaskdd May 13, 2026
006b3b2
🐛 fix(config): remove unsupported anthropic from VLM binding list
danielaskdd May 13, 2026
74a3ad2
Merge pull request #3063 from danielaskdd/feat/unified-llm-vlm-image-…
danielaskdd May 13, 2026
87b4f86
Merge branch 'main' into dev
danielaskdd May 13, 2026
9960c86
Update uv.lock
danielaskdd May 13, 2026
6d38cee
♻️ refactor(multimodal): rewrite analyze pipeline + mm-chunks + entit…
danielaskdd May 13, 2026
b18abcd
♻️ refactor(chunker/P): emit nested heading + blockid-backed sidecar …
danielaskdd May 13, 2026
0aad001
Fix lintings
danielaskdd May 13, 2026
069f704
🐛 fix(multimodal): address PR review findings — FAILED upsert, cache …
danielaskdd May 13, 2026
d0db8f6
🐛 fix(multimodal): gate analysis cache by flag, strip <table> ids, su…
danielaskdd May 13, 2026
6ed12ea
Fix lintings
danielaskdd May 13, 2026
487bba8
🐛 fix(multimodal): default missing process_options to no modalities a…
danielaskdd May 13, 2026
f463040
Fix lintings
danielaskdd May 13, 2026
256f860
Merge pull request #3064 from danielaskdd/refactor/multimodal-pipeline
danielaskdd May 13, 2026
8e516e9
📝 docs(prompt_multimodal): fix typo in prompt instructions
danielaskdd May 13, 2026
0332a50
🐛 fix(prompt_multimodal): correct typo in multimodal prompt templates
danielaskdd May 13, 2026
990c8a5
Merge branch 'refactor/multimodal-pipeline' into dev
danielaskdd May 14, 2026
81ba4e8
📝 docs(planning): delete obsolete native multimodal surrounding conte…
danielaskdd May 14, 2026
fd7d73a
📝 docs(LightRAGSidecarFormat): add chinese documentation for sidecar …
danielaskdd May 14, 2026
b623d64
📝 docs(docs): remove outdated concurrent processing documentation
danielaskdd May 14, 2026
5ec48d3
Merge branch 'main' into dev
danielaskdd May 14, 2026
497c0d8
✨ feat(docx): doc_title from first heading, table_header in sidecar
danielaskdd May 14, 2026
5d17912
Merge pull request #3065 from danielaskdd/feat/docx-sidecar-title-and…
danielaskdd May 14, 2026
5e9c4e6
✨ feat(multimodal): defer sidecar surrounding to analyze entry, env-t…
danielaskdd May 14, 2026
2bf52f9
Fix linting
danielaskdd May 14, 2026
346fe2b
Merge pull request #3066 from danielaskdd/feat/surrounding-enrichment…
danielaskdd May 14, 2026
9f0fefc
✨ feat(sidecar): shorten item IDs by stripping doc- prefix from doc_id
danielaskdd May 14, 2026
a464a04
Merge pull request #3067 from danielaskdd/feat/short-sidecar-ids
danielaskdd May 14, 2026
02fe999
✨ feat(multimodal): strip parser-internal markup from sidecar surroun…
danielaskdd May 14, 2026
c877720
Merge pull request #3068 from danielaskdd/feat/strip-internal-markup-…
danielaskdd May 14, 2026
36f23ed
✨ feat(extract): enforce MAX_EXTRACT_INPUT_TOKENS for analyze & gleaning
danielaskdd May 14, 2026
eb3c752
Merge pull request #3073 from danielaskdd/feat/extract-input-token-guard
danielaskdd May 14, 2026
17ecbda
♻️ refactor(operate): simplify multimodal entity name generation
danielaskdd May 14, 2026
fe182d8
Merge branch 'main' into dev
danielaskdd May 14, 2026
93614c6
Merge branch 'main' into dev
danielaskdd May 14, 2026
9fa9751
🐛 fix(multimodal): rename drawing id prefix from dr- to im-
danielaskdd May 14, 2026
73b96b5
♻️ refactor(pipeline): consolidate image rendering logic into single …
danielaskdd May 14, 2026
e26c29b
♻️ refactor(multimodal): switch mm-chunk content to inline bracket-la…
danielaskdd May 14, 2026
5ece54e
Merge pull request #3074 from danielaskdd/refact/mm-chunk-format
danielaskdd May 14, 2026
beb9b58
fix: extract Docling async markdown result
he-yufeng May 7, 2026
6756ec5
docs: align Docling endpoint examples
he-yufeng May 9, 2026
01a3bfa
📝 docs(LightRAGSidecarFormat-zh): update tag syntax and format placeh…
danielaskdd May 15, 2026
7d68f23
Merge branch 'dev' into fix/docling-markdown-extraction
danielaskdd May 15, 2026
f202381
📝 docs(config): add docling endpoint configuration comments
danielaskdd May 15, 2026
4570f6e
Merge pull request #3031 from he-yufeng/fix/docling-markdown-extraction
danielaskdd May 15, 2026
22fd48f
feat(sidecar): add unified IR + writer + MinerU raw cache infrastructure
claude May 15, 2026
530a115
feat(pipeline): route parse_mineru through unified sidecar writer + r…
claude May 15, 2026
2f3204f
test(sidecar): cover writer, mineru raw cache, adapter, parse_mineru,…
claude May 15, 2026
bc55aa1
chore: ruff --fix unused imports in new sidecar modules
claude May 15, 2026
8741259
fix(sidecar): warn on orphan items and bad bbox JSON; tidy manifest e…
danielaskdd May 15, 2026
85b186d
fix(parser_adapters/mineru): refuse path traversal in untrusted img_path
danielaskdd May 15, 2026
59fad3a
fix(parser_adapters/mineru): honor MINERU_IMAGE_URL_TEMPLATE in path …
danielaskdd May 15, 2026
949b1ed
feat(mineru): migrate to MinerU precision API v4 + mineru-api async t…
danielaskdd May 15, 2026
3aa9942
test: update parser-routed PDF defer test to new MinerU env scheme
danielaskdd May 15, 2026
20e79e7
Merge pull request #3075 from HKUDS/claude/analyze-sidecar-format-KwRMl
danielaskdd May 15, 2026
e3ac863
feat(native_parser/docx): route through unified SidecarWriter
danielaskdd May 15, 2026
ae1b0e4
Merge branch 'dev' into claude/native-docx-sidecar-migration
danielaskdd May 15, 2026
0fd78db
fix(native_parser/docx): preserve external/linked drawing paths
danielaskdd May 15, 2026
ab1ebcf
Fix lintings
danielaskdd May 15, 2026
af76abc
fix(sidecar): write trailing newline on per-modality JSON sidecars
danielaskdd May 15, 2026
481a104
test(conftest): scrub MinerU env vars at the start of every test
danielaskdd May 15, 2026
2447168
🐛 fix(native_docx): fix path override handling and add stale blocks w…
danielaskdd May 15, 2026
185bca8
♻️ refactor(native_parser): unify native docx parsing under pipeline …
danielaskdd May 15, 2026
eb4fff8
♻️ refactor(native_parser): centralize debug rag helpers and fix env …
danielaskdd May 15, 2026
1918f6d
♻️ refactor(native_docx): extract per-block state into _BlockBuilder …
danielaskdd May 15, 2026
1999b91
Merge pull request #3077 from danielaskdd/claude/native-docx-sidecar-…
danielaskdd May 15, 2026
d382f3c
♻️ refactor(pipeline): unify content hash computation across raw and …
danielaskdd May 15, 2026
ce12663
✨ feat(pipeline): dedupe cross-filename uploads by normalizing merged…
danielaskdd May 15, 2026
545018d
✅ test(pipeline): un-xfail content_hash cross-filename dedup test
danielaskdd May 15, 2026
668c9dc
Merge pull request #3078 from danielaskdd/claude/cross-filename-conte…
danielaskdd May 15, 2026
c45a166
📝 docs(env): update env.example with correct LIGHTRAG_PARSER format
danielaskdd May 15, 2026
181b578
♻️ refactor(mineru): fix async file upload compatibility with httpx 0…
danielaskdd May 15, 2026
f86ebeb
✨ feat(mineru): split-by-heading block merging with markdown titles
danielaskdd May 16, 2026
8e17a37
Fix lintings
danielaskdd May 16, 2026
a43cff5
✅ test: harden hermetic env fixture and fix MinerU put() fake
danielaskdd May 16, 2026
c2099d5
Merge pull request #3079 from danielaskdd/claude/mineru-heading-merge
danielaskdd May 16, 2026
abb97bb
Fix lintings
danielaskdd May 16, 2026
f8fd8e1
Merge branch 'claude/test-isolation-fixes' into dev
danielaskdd May 16, 2026
dcbf6dd
Update docs
danielaskdd May 16, 2026
8b2022a
✨ feat(mineru): emit page-level positions from page_idx
danielaskdd May 16, 2026
3c50336
🎨 chore(docs): trim trailing whitespace in sidecar format spec
danielaskdd May 16, 2026
6b5a086
Merge pull request #3081 from danielaskdd/claude/mineru-page-positions
danielaskdd May 16, 2026
4878908
📝 docs(FileProcessingConfiguration-zh): expand mineru and docling set…
danielaskdd May 16, 2026
88f6dac
📝 docs(sidecar): update chinese spec and add per-position origin and …
danielaskdd May 17, 2026
393b7a2
📝 docs(sidecar-format): add self_ref and extras fields to chinese spec
danielaskdd May 17, 2026
48311ff
✨ feat(external-parser): scaffold shared helpers and docling skeleton
danielaskdd May 17, 2026
b757afc
✨ feat(external-parser/docling): add raw client + cache + manifest
danielaskdd May 17, 2026
df915a7
✨ feat(external-parser/docling): add DoclingAdapter (JSON → IRDoc)
danielaskdd May 17, 2026
188a8b8
♻️ refactor(pipeline): route parse_docling through sidecar bundle
danielaskdd May 17, 2026
156a0c9
📝 docs(env): trim deprecated DOCLING_* and expose 6 tunables
danielaskdd May 17, 2026
45b8247
📝 docs(FileProcessingConfiguration-zh): document Docling sidecar setup
danielaskdd May 17, 2026
828654e
📝 docs: add Docling sidecar refactor plan
danielaskdd May 17, 2026
0485644
Fix lintings
danielaskdd May 17, 2026
a2a8ab2
📝 docs(docling): add polling budget envs and clarify endpoint usage
danielaskdd May 18, 2026
5368069
🐛 fix(cache): correct options signature validation to use current fix…
danielaskdd May 18, 2026
b9e7d39
🐛 fix(docling): prevent path traversal in image URI resolution
danielaskdd May 18, 2026
e6ef393
♻️ refactor(docling): canonicalize upload filename and fix manifest l…
danielaskdd May 18, 2026
2890077
🐛 fix(docling): prevent furniture text from leaking into sidecar meta…
danielaskdd May 18, 2026
ebfbd77
⚡️ perf(docling): fix OOM when parsing large documents
danielaskdd May 18, 2026
ce285e0
♻️ refactor(docling): normalize endpoint handling with backwards comp…
danielaskdd May 18, 2026
7149873
♻️ refactor(docling): exclude non-body tables and pictures from consu…
danielaskdd May 18, 2026
9f591f8
🔧 chore(config): change default DOCLING_FORCE_OCR from false to true
danielaskdd May 18, 2026
da70983
🐛 fix(docling): fix async multipart upload compatibility with httpx >…
danielaskdd May 18, 2026
8a6a04b
🐛 fix(docling): fix rapid polling loop when server ignores wait param…
danielaskdd May 18, 2026
12cdb38
♻️ refactor(docling): improve formula handling and child traversal logic
danielaskdd May 18, 2026
7af24a0
✨ feat(docling): add debug CLI and skip pictures with missing images
danielaskdd May 18, 2026
43e7925
♻️ refactor(docling): replace children_refs with ocr_texts for pictur…
danielaskdd May 18, 2026
0691f41
📝 docs(docling): remove IRTable.extras from spec and implementation
danielaskdd May 18, 2026
492db70
Merge pull request #3085 from danielaskdd/feat/docling
danielaskdd May 18, 2026
e015dc7
📝 docs(docs): update file processing configuration zh documentation
danielaskdd May 18, 2026
b219115
📝 docs(FileProcessingConfiguration-zh): update docling and mineru doc…
danielaskdd May 18, 2026
70c222a
📝 docs(refactor): delete docling sidecar refactor plan document
danielaskdd May 18, 2026
312323c
♻️ refactor(parser): rename adapters to ir_builder and consolidate pa…
danielaskdd May 18, 2026
e123b61
Merge pull request #3087 from danielaskdd/refact/parser-dir-structure
danielaskdd May 18, 2026
b205cd0
🔧 refactor(parser): unify parser debug CLI into single entry point
danielaskdd May 18, 2026
c7f7cbb
Fix lintings
danielaskdd May 18, 2026
3d1ed5f
✨ feat(pipeline): add parsing progress logging for MinerU and Docling
danielaskdd May 18, 2026
030208f
✨ feat(parser_cli): add early validation for file suffix and engine m…
danielaskdd May 18, 2026
5f295fd
Fix linting
danielaskdd May 18, 2026
d451b4a
Merge pull request #3088 from danielaskdd/feat/parse-cli-debug
danielaskdd May 18, 2026
53b6c33
Merge branch 'main' into dev
danielaskdd May 18, 2026
2f2e811
✨ feat(mineru): add self_ref tracing to content_list items
danielaskdd May 18, 2026
38f6876
♻️ refactor(docx): sanitize asset path resolution in ir builder
danielaskdd May 18, 2026
759c85f
✅ test(native_parser): add security tests for docx drawing asset hand…
danielaskdd May 18, 2026
e5554c9
✨ feat(mineru): add upload_name parameter to strip parser hints from …
danielaskdd May 18, 2026
1377158
♻️ refactor(mineru): replace sync file reads with streaming upload fo…
danielaskdd May 18, 2026
29dbaf9
Merge pull request #3089 from danielaskdd/refact/mineru-enhance
danielaskdd May 18, 2026
224f85b
Merge branch 'main' into dev
danielaskdd May 18, 2026
4224800
♻️ refactor(constants): centralize parsed artifact directory suffixes
danielaskdd May 18, 2026
6a22517
♻️ refactor(pipeline): remove multimodal postprocess hook and fix par…
danielaskdd May 18, 2026
5369a90
🐛 fix(mineru): drop empty table items to prevent analyze worker hard-…
danielaskdd May 18, 2026
6fb0b51
♻️ refactor(docling): drop empty tables from IR to prevent analyze wo…
danielaskdd May 18, 2026
35a7433
Merge pull request #3090 from danielaskdd/fix/empty-table
danielaskdd May 18, 2026
acef76c
🐛 fix(docling): fix cache invalidation for default docling options
danielaskdd May 18, 2026
328b413
✨ feat(doc-status): add per-stage start time and parse-skipped flag
danielaskdd May 18, 2026
8d5509a
🐛 fix(doc-status): clear stale per-attempt fields on PARSING retry
danielaskdd May 18, 2026
bbb9ef6
💄 style(doc-status): group parse_stage_skipped with parsing_start_time
danielaskdd May 18, 2026
410c3b3
Merge pull request #3091 from danielaskdd/feat/add-parse-anlyze-times…
danielaskdd May 18, 2026
8235d7e
♻️ refactor(config): centralize default values for pipeline concurrency
danielaskdd May 18, 2026
c96615f
📝 docs(docs): rename file processing configuration to pipeline docume…
danielaskdd May 18, 2026
d92a640
♻️ refactor(config): reorganize MinerU environment variables and upda…
danielaskdd May 18, 2026
cad7e1e
✨ feat(mineru): add parser options signature to cache validation
danielaskdd May 18, 2026
0155218
♻️ refactor(mineru): centralize parser options into MinerUParserOptio…
danielaskdd May 19, 2026
7ec407a
📝 docs(FileProcessingPipeline-zh): update recommended file processing…
danielaskdd May 19, 2026
7e36fd7
🔧 chore(config): update default pipeline concurrency values
danielaskdd May 19, 2026
de399c7
Merge branch 'feat/mineru-options-signature' into dev
danielaskdd May 19, 2026
238dbac
♻️ refactor(pipeline): unify document path handling and replace sourc…
danielaskdd May 19, 2026
5279aa8
🐛 fix(pipeline): resolve workspace-scoped input file paths
danielaskdd May 19, 2026
99575d1
📝 docs(FileProcessingPipeline-zh): update chunk_options documentation…
danielaskdd May 19, 2026
d12a559
Fix lintings
danielaskdd May 19, 2026
a80574b
Merge pull request #3093 from danielaskdd/refact/full-docs-storage
danielaskdd May 19, 2026
98bdbe9
refactor(pg): align PG storage fields with JSON storage for parity
claude May 19, 2026
a63f6c9
♻️ refactor(postgres): improve schema resilience and partial upsert s…
danielaskdd May 19, 2026
12d6728
♻️ refactor(storage): centralize file path normalization to business …
danielaskdd May 19, 2026
e2cacce
✨ feat(pipeline): add source file resolver with parser engine hints
danielaskdd May 19, 2026
738c223
✨ feat(parser_routing): add filename parser hint validation with erro…
danielaskdd May 19, 2026
24632a7
🐛 fix(parser): enforce leading hyphen for options-only filename hints
danielaskdd May 19, 2026
6ef5423
♻️ refactor(factory): replace lazy_external_import with standard impo…
danielaskdd May 19, 2026
ed09c28
🐛 fix(postgres): remove duplicate partial index creation
danielaskdd May 20, 2026
8022ef4
Merge pull request #3094 from HKUDS/claude/refactor-pg-storage-zIrOi
danielaskdd May 20, 2026
246702f
✨ feat(redis): add basename and content_hash lookups for doc status
danielaskdd May 20, 2026
be9c24b
refactor(mongo): align doc-status storage with JSON storage for parity
claude May 20, 2026
c149fde
Merge pull request #3098 from danielaskdd/refact/redis
danielaskdd May 20, 2026
efa44a5
Fix lintings
danielaskdd May 20, 2026
1d4142a
Merge branch 'dev' into claude/sync-mongo-storage-updates-jSuYr
danielaskdd May 20, 2026
b94e698
✨ feat(opensearch): add basename and content_hash lookups for doc status
claude May 20, 2026
324cca3
Merge pull request #3099 from HKUDS/claude/sync-mongo-storage-updates…
danielaskdd May 20, 2026
71406ba
✅ test(mongo): add tests for pymongo error handling in doc lookup
danielaskdd May 20, 2026
49d07fe
Merge branch 'claude/sync-mongo-storage-updates-jSuYr' into dev
danielaskdd May 20, 2026
4aff9ae
Merge branch 'dev' into claude/update-opensearch-storage-TNxo8
danielaskdd May 20, 2026
66f3a9a
Merge pull request #3100 from HKUDS/claude/update-opensearch-storage-…
danielaskdd May 20, 2026
cc102ac
Fix lintings
danielaskdd May 20, 2026
1e3283d
🔧 chore(setup): remove redundant default slash in redis uri normaliza…
danielaskdd May 20, 2026
46576b2
🔧 chore(docker): add prompt directory configuration across docker and…
danielaskdd May 20, 2026
69aed36
🔧 refactor(logging): unify and simplify pipeline log messages
danielaskdd May 20, 2026
ac025b1
♻️ refactor(doc-status-storage): convert concrete methods to abstract…
danielaskdd May 20, 2026
f3dddb1
📝 docs(docs): update log message in file processing pipeline document…
danielaskdd May 20, 2026
c050cc4
🔧 chore(gitignore): update ignore rules for prompts directory
danielaskdd May 20, 2026
de2917e
🔧 chore(memgraph): comment out exposed memgraph ports in docker template
danielaskdd May 20, 2026
61df440
♻️ refactor(pipeline): update chunking log messages to use doc_id
danielaskdd May 20, 2026
2213b31
🔧 chore(gitignore): update prompts ignore pattern
danielaskdd May 20, 2026
ab892c8
📝 docs(env.example): update entity extraction json comments
danielaskdd May 20, 2026
de72efc
✨ feat(pipeline-status): probe + throttled refresh for prompt scan/up…
danielaskdd May 20, 2026
6da3f9c
Bump API version to 0294
danielaskdd May 20, 2026
295babc
Merge pull request #3101 from danielaskdd/refact/pipeline-status-refresh
danielaskdd May 20, 2026
2db077a
📝 docs: translate file processing pipeline docs to English
claude May 20, 2026
dd5b19d
Merge remote-tracking branch 'upstream/claude/translate-docs-to-engli…
danielaskdd May 20, 2026
e242596
📝 docs(pipeline): remove deprecated appendix from file processing docs
danielaskdd May 20, 2026
c0d3ebd
📝 docs(README): update news section with latest features
danielaskdd May 20, 2026
5597244
✨ feat(chunker): give P strategy a dedicated default chunk_token_size
danielaskdd May 20, 2026
33f3366
Merge pull request #3102 from danielaskdd/feat/p-chunk-size-dedicated…
danielaskdd May 20, 2026
d4be4f8
🐛 fix(pipeline): change analyze_multimodal to always recompute enable…
danielaskdd May 20, 2026
8368f74
♻️ refactor(external_parser): unify HTTP error handling with detailed…
danielaskdd May 20, 2026
9aaf7b8
Bump API version to 0295
danielaskdd May 20, 2026
a997100
📝 docs(FileProcessingPipeline): fix env file reference in documentation
danielaskdd May 21, 2026
011a594
✨ feat(pipeline): add per-document I/O failure handling for lightrag …
danielaskdd May 21, 2026
aac652a
🔧 chore(config): update env.example with new parser configuration
danielaskdd May 21, 2026
3f9f47a
📝 docs(FileProcessingPipeline): restructure quick start guides and ad…
danielaskdd May 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 2 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ output/
rag_storage/
data/

# Runtime-provided entity type prompt profiles (keep sample files only)
# User cumstomized prompt directory
prompts/entity_type/

# Evaluation results
Expand All @@ -78,14 +78,10 @@ ignore_this.txt
# temporary test files in project root
/test_*

# Cline files
# AI Agent files
memory-bank
.claude/CLAUDE.md
.claude/

# Claude Code
CLAUDE.md

# Google Jules
.jules/

Expand Down
372 changes: 313 additions & 59 deletions AGENTS.md

Large diffs are not rendered by default.

347 changes: 1 addition & 346 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,346 +1 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

LightRAG is a Retrieval-Augmented Generation (RAG) framework that uses graph-based knowledge representation for enhanced information retrieval. The system extracts entities and relationships from documents, builds a knowledge graph, and uses multi-modal retrieval (local, global, hybrid, mix, naive) for queries.

## Core Architecture

### Key Components

- **lightrag.py**: Main orchestrator class (`LightRAG`) that coordinates document insertion, query processing, and storage management. Critical: Always call `await rag.initialize_storages()` after instantiation.

- **operate.py**: Core extraction and query operations including entity/relation extraction, chunking, and multi-mode retrieval logic.

- **base.py**: Abstract base classes for storage backends (`BaseKVStorage`, `BaseVectorStorage`, `BaseGraphStorage`, `BaseDocStatusStorage`).

- **kg/**: Storage implementations (JSON, NetworkX, Neo4j, PostgreSQL, MongoDB, Redis, Milvus, Qdrant, Faiss, Memgraph). Each storage type provides different trade-offs for production vs. development use.

- **llm/**: LLM provider bindings (OpenAI, Ollama, Azure, Gemini, Bedrock, Anthropic, etc.). All use async patterns with caching support.

- **api/**: FastAPI server (`lightrag_server.py`) with REST endpoints and Ollama-compatible API, plus React 19 + TypeScript WebUI.

### Storage Layer

LightRAG uses 4 storage types with pluggable backends:
- **KV_STORAGE**: LLM response cache, text chunks, document info
- **VECTOR_STORAGE**: Entity/relation/chunk embeddings
- **GRAPH_STORAGE**: Entity-relation graph structure
- **DOC_STATUS_STORAGE**: Document processing status tracking

Workspace isolation is implemented differently per storage type (subdirectories for file-based, prefixes for collections, fields for relational DBs).

### Query Modes

- **local**: Context-dependent retrieval focused on specific entities
- **global**: Community/summary-based broad knowledge retrieval
- **hybrid**: Combines local and global
- **naive**: Direct vector search without graph
- **mix**: Integrates KG and vector retrieval (recommended with reranker)

## Development Commands

### Setup
```bash
# Install core package (development mode)
uv sync
source .venv/bin/activate # Or: .venv\Scripts\activate on Windows

# Install with API support
uv sync --extra api

# Install specific extras
uv sync --extra offline-storage # Storage backends
uv sync --extra offline-llm # LLM providers
uv sync --extra test # Testing dependencies
```

### API Server
```bash
# Copy and configure environment
cp env.example .env # Edit with your LLM/embedding configs

# Build WebUI
cd lightrag_webui
bun install --frozen-lockfile
bun run build
cd ..

# Run server
lightrag-server # Production
uvicorn lightrag.api.lightrag_server:app --reload # Development
lightrag-gunicorn # Multi-worker (gunicorn)
```

### Testing
```bash
# Run offline tests (default)
python -m pytest tests

# Run integration tests (requires external services)
python -m pytest tests --run-integration
# Or set: LIGHTRAG_RUN_INTEGRATION=true

# Run specific test file
python test_graph_storage.py

# Keep artifacts for debugging
python -m pytest tests --keep-artifacts

# Run with custom workers
python -m pytest tests --test-workers 4
```

### Linting
```bash
ruff check .
```

## Key Implementation Patterns

### LightRAG Initialization (Critical)

The most common error is forgetting to initialize storages:

```python
import asyncio
from lightrag import LightRAG
from lightrag.llm.openai import gpt_4o_mini_complete, openai_embed

async def main():
rag = LightRAG(
working_dir="./rag_storage",
llm_model_func=gpt_4o_mini_complete,
embedding_func=openai_embed
)

# REQUIRED: Initialize storage backends
await rag.initialize_storages()

# Now safe to use
await rag.ainsert("Your text here")
result = await rag.aquery("Your question", param=QueryParam(mode="hybrid"))

# Cleanup
await rag.finalize_storages()

asyncio.run(main())
```

### Custom Embedding Functions

Use `@wrap_embedding_func_with_attrs` decorator and call `.func` when wrapping:

```python
from lightrag.utils import wrap_embedding_func_with_attrs

@wrap_embedding_func_with_attrs(embedding_dim=1536, max_token_size=8192)
async def custom_embed(texts: list[str]) -> np.ndarray:
# Call underlying function, not wrapped version
return await openai_embed.func(texts, model="text-embedding-3-large")
```

### Storage Configuration

Configure via environment variables or constructor params:

```python
# Environment-based (recommended for production)
# See env.example for full list

# Constructor-based
rag = LightRAG(
working_dir="./storage",
workspace="project_name", # For data isolation
kv_storage="PGKVStorage",
vector_storage="PGVectorStorage",
graph_storage="Neo4JStorage",
doc_status_storage="PGDocStatusStorage",
vector_db_storage_cls_kwargs={
"cosine_better_than_threshold": 0.2
}
)
```

### Document Insertion

```python
# Single document
await rag.ainsert("Text content")

# Batch insertion
await rag.ainsert(["Text 1", "Text 2", ...])

# With custom IDs
await rag.ainsert("Text", ids=["doc-123"])

# With file paths (for citation)
await rag.ainsert(["Text 1", "Text 2"], file_paths=["doc1.pdf", "doc2.pdf"])

# Configure batch size
rag = LightRAG(..., max_parallel_insert=4) # Default: 2, max recommended: 10
```

### Query Configuration

```python
from lightrag import QueryParam

result = await rag.aquery(
"Your question",
param=QueryParam(
mode="mix", # Recommended with reranker
top_k=60, # KG entities/relations to retrieve
chunk_top_k=20, # Text chunks to retrieve
max_entity_tokens=6000,
max_relation_tokens=8000,
max_total_tokens=30000,
enable_rerank=True,
user_prompt="Additional instructions for LLM",
stream=False
)
)
```

## WebUI Development

### Structure
- `lightrag_webui/src/`: React components (TypeScript)
- Uses Vite + Bun build system
- Tailwind CSS for styling
- React 19 with functional components and hooks

### Commands
```bash
cd lightrag_webui
bun install --frozen-lockfile # Install dependencies
bun run dev # Development server (Node + Vite)
bun run dev:bun # Development server (Bun native)
bun run build # Production build
bun run preview # Preview production build locally

# Linting (ESLint with TypeScript, React hooks, Stylistic rules)
bun run lint # Run ESLint on all *.ts/tsx/js/jsx files

# Testing (Bun built-in test runner)
bun test # Run all tests
bun test --watch # Watch mode
bun test --coverage # With coverage report
bun test src/api/lightrag.test.ts # Run a single test file
```

### Lint Rules
ESLint is configured with TypeScript-ESLint, React Hooks plugin, Prettier integration, and `@stylistic` rules:
- 2-space indentation, single quotes enforced
- `@typescript-eslint/no-explicit-any` is disabled (allowed)

## Common Issues

### 1. Storage Not Initialized
**Error**: `AttributeError: __aenter__` or `KeyError: 'history_messages'`
**Solution**: Always call `await rag.initialize_storages()` after creating LightRAG instance

### 2. Embedding Model Changes
When switching embedding models, you MUST clear the data directory (except optionally `kv_store_llm_response_cache.json` for LLM cache).

### 3. Nested Embedding Functions
Cannot wrap already-decorated embedding functions. Use `.func` to access underlying function:
```python
# Wrong: EmbeddingFunc(func=openai_embed)
# Right: EmbeddingFunc(func=openai_embed.func)
```

### 4. Context Length for Ollama
Ollama models default to 8k context; LightRAG requires 32k+. Configure via:
```python
llm_model_kwargs={"options": {"num_ctx": 32768}}
```

## Configuration Files

### .env Configuration
Primary configuration file for API server. Key sections:
- Server settings (HOST, PORT, CORS)
- Storage backends (connection strings via environment variables)
- Query parameters (TOP_K, MAX_TOTAL_TOKENS, etc.)
- Reranking configuration (RERANK_BINDING, RERANK_MODEL)
- Authentication (AUTH_ACCOUNTS, LIGHTRAG_API_KEY)

See `env.example` for comprehensive template.

### Workspace Isolation
Each LightRAG instance can use a `workspace` parameter for data isolation. Implementation varies by storage type:
- File-based: subdirectories
- Collection-based: collection name prefixes
- Relational DB: workspace column filtering
- Qdrant: payload-based partitioning

## Testing Guidelines

### Test Structure
- `tests/`: Main test suite (mirrors feature folders)
- `test_*.py` in root: Specific integration tests
- Markers: `offline`, `integration`, `requires_db`, `requires_api`

### Running Tests
```bash
# Default: runs only offline tests
pytest tests

# Include integration tests
pytest tests --run-integration

# Keep test artifacts for debugging
pytest tests --keep-artifacts

# Configure test workers
pytest tests --test-workers 4
```

### Environment Variables for Tests
Set `LIGHTRAG_*` variables for integration tests:
- `LIGHTRAG_RUN_INTEGRATION=true`
- `LIGHTRAG_KEEP_ARTIFACTS=true`
- `LIGHTRAG_TEST_WORKERS=4`
- Plus storage-specific connection strings

## Code Style

### Language
- Comment Language - Use English for comments and documentation
- Backend Language - Use English for backend code and messages
- Frontend Internationalization: i18next for multi-language support

### Python
- Follow PEP 8 with 4-space indentation
- Use type annotations
- Prefer dataclasses for state management
- Use `lightrag.utils.logger` instead of print
- Async/await patterns throughout
- Keep storage implementations in `kg/` with consistent base class inheritance

### TypeScript/React
- Functional components with hooks
- 2-space indentation
- PascalCase for components
- Tailwind utility-first styling

## Important Architectural Notes

### LLM Requirements
- Minimum 32B parameters recommended
- 32KB context minimum (64KB recommended)
- Avoid reasoning models during indexing
- Stronger models for query stage than indexing stage

### Embedding Models
- Must be consistent across indexing and querying
- Recommended: `BAAI/bge-m3`, `text-embedding-3-large`
- Changing models requires clearing vector storage and recreating with new dimensions

### Reranker Configuration
- Significantly improves retrieval quality
- Recommended models: `BAAI/bge-reranker-v2-m3`, Jina rerankers
- Use "mix" mode when reranker is enabled
Strictly follow the rules in ./AGENTS.md
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ RUN --mount=type=cache,target=/root/.local/share/uv \
&& /app/.venv/bin/python -m ensurepip --upgrade

# Create persistent data directories AFTER package installation
RUN mkdir -p /app/data/rag_storage /app/data/inputs /app/data/tiktoken
RUN mkdir -p /app/data/rag_storage /app/data/inputs /app/data/prompts /app/data/tiktoken

# Copy offline cache into the newly created directory
COPY --from=builder /app/data/tiktoken /app/data/tiktoken
Expand All @@ -102,6 +102,7 @@ COPY --from=builder /app/data/tiktoken /app/data/tiktoken
ENV TIKTOKEN_CACHE_DIR=/app/data/tiktoken
ENV WORKING_DIR=/app/data/rag_storage
ENV INPUT_DIR=/app/data/inputs
ENV PROMPT_DIR=/app/data/prompts

# Expose API port
EXPOSE 9621
Expand Down
Loading
Loading