fix(parsers): drop empty-bodied tables to prevent analyze worker hard-failure#3090
Merged
Conversation
…failure - add `_ir_table_body_has_content` helper to detect tables with no visible content - filter out misidentified table items in `_build_ir_table` before IR insertion - add defensive fallback in `analyze_multimodal` for sidecars with empty-bodied tables - add test covering absent, empty string, empty list, and blank-cell table bodies
…rker failures - add `_table_rows_have_content` helper to detect visible cell text - return `None` from `_build_ir_table` when table has no content - skip placeholder allocation in `_handle_table` for dropped tables - mirror existing MinerU-side empty table filter behavior - add test covering four shapes of empty table input
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tableitems whose body is empty — e.g. when a page number / blank region is misidentified as a table. The previous IR builders kept these items, so the writer landed them intables.jsonwithcontent=""and the multimodal analyze worker hard-failed withERROR: Analyze worker failed: table/tb-…: missing table content, marking the whole document FAILED.lightrag/external_parser/mineru/ir_builder.pyandlightrag/external_parser/docling/ir_builder.pynow drop body-less table items before they enter the IR. No placeholder, no<table>tag, no position leakage. Aninfolog records the dropped item for ops visibility._analyze_text_modalityinlightrag/pipeline.pyno longer raises forkind=="table"with empty content; it returnsstatus="skipped"with a warning so a stray empty entry in an old sidecar can't bring down the whole document.image/equationkeep the strict raise.Test plan
tests/external_parser/mineru/test_ir_builder.py::test_adapter_empty_table_dropped— covers absent body, empty string body, empty list, and grids of only blank cellstests/external_parser/docling/test_ir_builder.py::test_docling_adapter_empty_table_dropped— covers missingdata, empty grid, blank-cell grid, and blanktable_cellsfallbacktests/test_pipeline_analyze_multimodal.py— full suite passes (15 tests), confirming the analyze worker change is backward-compatible🤖 Generated with Claude Code