Skip to content

fix(parsers): drop empty-bodied tables to prevent analyze worker hard-failure#3090

Merged
danielaskdd merged 2 commits into
HKUDS:devfrom
danielaskdd:fix/empty-table
May 18, 2026
Merged

fix(parsers): drop empty-bodied tables to prevent analyze worker hard-failure#3090
danielaskdd merged 2 commits into
HKUDS:devfrom
danielaskdd:fix/empty-table

Conversation

@danielaskdd
Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: MinerU (and, defensively, Docling) occasionally emit table items whose body is empty — e.g. when a page number / blank region is misidentified as a table. The previous IR builders kept these items, so the writer landed them in tables.json with content="" and the multimodal analyze worker hard-failed with ERROR: Analyze worker failed: table/tb-…: missing table content, marking the whole document FAILED.
  • Fix (two layers):
    1. IR builder filter — both lightrag/external_parser/mineru/ir_builder.py and lightrag/external_parser/docling/ir_builder.py now drop body-less table items before they enter the IR. No placeholder, no <table> tag, no position leakage. An info log records the dropped item for ops visibility.
    2. Analyze worker defensive fallback_analyze_text_modality in lightrag/pipeline.py no longer raises for kind=="table" with empty content; it returns status="skipped" with a warning so a stray empty entry in an old sidecar can't bring down the whole document. image/equation keep the strict raise.

Test plan

  • tests/external_parser/mineru/test_ir_builder.py::test_adapter_empty_table_dropped — covers absent body, empty string body, empty list, and grids of only blank cells
  • tests/external_parser/docling/test_ir_builder.py::test_docling_adapter_empty_table_dropped — covers missing data, empty grid, blank-cell grid, and blank table_cells fallback
  • tests/test_pipeline_analyze_multimodal.py — full suite passes (15 tests), confirming the analyze worker change is backward-compatible
  • Full external_parser + sidecar regression: 164 tests pass

🤖 Generated with Claude Code

…failure

- add `_ir_table_body_has_content` helper to detect tables with no visible content
- filter out misidentified table items in `_build_ir_table` before IR insertion
- add defensive fallback in `analyze_multimodal` for sidecars with empty-bodied tables
- add test covering absent, empty string, empty list, and blank-cell table bodies
…rker failures

- add `_table_rows_have_content` helper to detect visible cell text
- return `None` from `_build_ir_table` when table has no content
- skip placeholder allocation in `_handle_table` for dropped tables
- mirror existing MinerU-side empty table filter behavior
- add test covering four shapes of empty table input
@danielaskdd danielaskdd merged commit 35a7433 into HKUDS:dev May 18, 2026
3 checks passed
@danielaskdd danielaskdd deleted the fix/empty-table branch May 18, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant