Skip to content

241 issue expand document loader coverage#262

Open
drr00t wants to merge 9 commits into
FalkorDB:mainfrom
drr00t:241-issue-expand-document-loader-coverage
Open

241 issue expand document loader coverage#262
drr00t wants to merge 9 commits into
FalkorDB:mainfrom
drr00t:241-issue-expand-document-loader-coverage

Conversation

@drr00t
Copy link
Copy Markdown
Contributor

@drr00t drr00t commented May 19, 2026

Summary

This feature extends the loader interface using Docling to add support for DOCX, XLSX, PPTX, CSV, HTML, and XHTML.

Changes

  • DoclingBaseLoader: Base class wrapping the Docling document model.

Test Plan

  • All existing tests pass (pytest tests/ -q)
  • New tests added for new functionality (if applicable)
  • Lint passes (ruff check src/)

Notes

Currently, the GraphRAG_SDK loader interface maps one loader to one extension. Since Docling supports multiple extensions, this PR updates the current loader implementation paradigm to support multi-format loaders.

Summary by CodeRabbit

  • New Features

    • Added ingestion support for Word, Excel, PowerPoint, HTML and CSV files and new docling-backed loaders.
    • Updated default document chunking to better preserve sentence boundaries and structure.
  • Chores

    • Introduced an optional "docling" dependency and included it in the aggregate "all" extras.
    • Test runner config updated to filter known deprecation/warning messages.
  • Tests

    • Added comprehensive tests and helper scripts for docling-based loading and import-failure scenarios.

Review Change Stack

drr00t and others added 3 commits May 17, 2026 17:10
…te()

Changes the default chunker that ``GraphRAG.ingest()`` and
``GraphRAG.update()`` fall back to when the caller doesn't pass an
explicit ``chunker=``. Was ``FixedSizeChunking()``; now
``SentenceTokenCapChunking()`` (sentence-aware, max_tokens=512,
overlap_sentences=2 — the strategy's own defaults).

Why
---
``FixedSizeChunking`` splits on a hard character window with no awareness
of sentence, word, or paragraph boundaries. When the window cuts through
an entity name, the per-chunk LLM extractor produces a stub entity for
the fragment (``"Wayne Enterprises"`` → ``"Wayne En"`` in chunk N plus
unparsable text in chunk N+1). These stubs never merge with their full
forms during resolution because their embeddings differ enough that
LLMVerifiedResolution scores them below the soft threshold.

This silently inflates cypher counts and pollutes "which X" lists. The
strategy that surfaced this — ``CypherFirstAggregationStrategy`` — was
hitting a 6/7 ceiling on the internal aggregation benchmark with one
question failing because of these stubs. Switching to
``SentenceTokenCapChunking`` cleared the benchmark to 7/7 stable across
three runs, and the post-ingest graph state went from 11-14 organization
nodes (including ``Glo`` / ``Initech System`` / ``Wayne En``) to exactly
10 clean orgs, and from 66-80 ``Person`` nodes (with ``Carla`` / ``Carla
Okafor`` duplicates) to exactly 56 distinct persons — matching the
corpus.

A side benefit: sentence-aware chunks with 2-sentence overlap almost
always keep a person's first mention in the same chunk as their later
short-form references, so per-chunk FastCoref now binds ``Carla → Carla
Okafor`` reliably. That eliminates the short-form-duplicate class too,
not just the truncation stubs.

Compatibility
-------------
``FixedSizeChunking`` remains exported and fully supported — callers who
explicitly pass ``chunker=FixedSizeChunking()`` get unchanged behavior.
Existing tests (748 passed, 24 skipped) pass without modification: no
test in the suite asserts on chunk count or content shape from the
default chunker, so switching defaults doesn't break the suite.

Callers who relied on the previous default and want to keep it should
pass ``chunker=FixedSizeChunking()`` explicitly. The docstrings call out
the new default and reference ``FixedSizeChunking`` as the opt-in
character-window alternative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@drr00t drr00t marked this pull request as draft May 19, 2026 00:27
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

📝 Walkthrough

Walkthrough

Adds DoclingBaseLoader and format-specific loaders (DOCX/XLSX/PPTX/HTML/CSV), exports them from the loaders package, adds an optional docling extra, wires auto-detection and loader selection into the main API, changes default chunker to SentenceTokenCapChunking, and adds tests and debug scripts for the new loaders.

Changes

Docling-based document loaders and API updates

Layer / File(s) Summary
Dependencies and DoclingBaseLoader base implementation
graphrag_sdk/pyproject.toml, graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_base.py
Adds docling>=2.91.0 optional extra and implements DoclingBaseLoader(LoaderStrategy) with async load() delegating to _load_sync(), file validation, import-time error wrapping, DocumentConverter usage, breadcrumb hierarchy tracking, label → element mapping, and assembly of DocumentOutput with text, elements, and metadata.
Format-specific loader implementations
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docx_loader.py, .../xlsx_loader.py, .../pptx_loader.py, .../html_loader.py, .../csv_loader.py, .../__init__.py
Introduces DocxLoader, XlsxLoader, PptxLoader, HtmlLoader, and CsvLoader as DoclingBaseLoader subclasses and expands __all__ to re-export them.
Main API wiring and defaults
graphrag_sdk/src/graphrag_sdk/api/main.py
Imports new loaders and SentenceTokenCapChunking; updates _default_loader_for() to detect .docx, .xlsx, .pptx, .html/.xhtml, .csv with TextLoader fallback; changes _ingest_single(), update(), and docstrings to default to SentenceTokenCapChunking.
Tests, debug harnesses, and import-mocking scripts
graphrag_sdk/tests/test_docling_loaders.py, graphrag_sdk/test_*.py, test_*.py
Adds TestDoclingBaseLoader tests validating missing-dependency handling, label mapping, breadcrumb construction, conversion errors, export-to-markdown fallback, and specialized element types; adds multiple debug/monkeypatch scripts that mock docling imports for local testing.

Sequence Diagram

sequenceDiagram
  participant Client
  participant DoclingBaseLoader
  participant DocumentConverter
  participant GraphRAG
  Client->>DoclingBaseLoader: load(source, ctx)
  DoclingBaseLoader->>DoclingBaseLoader: asyncio.to_thread(_load_sync)
  DoclingBaseLoader->>DocumentConverter: create & convert(source)
  DocumentConverter-->>DoclingBaseLoader: iterated items (label, text, level)
  DoclingBaseLoader->>GraphRAG: map items -> DocumentElement(s)
  DoclingBaseLoader-->>Client: DocumentOutput(text, elements, metadata)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • FalkorDB/GraphRAG-SDK#254: Updates default ingestion chunker from FixedSizeChunking to SentenceTokenCapChunking in main.py.

Suggested reviewers

  • Naseem77

Poem

🐰 Docling hops in, nose to the page,
Finding headers, lists across the stage,
Breadcrumbs stitched where sections belong,
CSV, DOCX, slides hum along,
The SDK reads on — a librarian's song.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly aligns with the main objective: expanding document loader coverage by integrating Docling to support multiple file formats (DOCX, XLSX, PPTX, CSV, HTML). It is concise, specific, and clearly summarizes the primary change.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/pyproject.toml`:
- Around line 59-67: The extras specification is inconsistent: the standalone
"docling" extra pins docling>=2.91.0 while the "all" extras list contains
docling>=2.0.0; update the "all" extras entry to require the same minimum
(docling>=2.91.0) so installing graphrag-sdk[all] cannot pull an older
incompatible docling version—edit the "docling" entry inside the all array in
pyproject.toml to match the docling extra's minimum version.

In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 329-333: Update the stale loader-default docstring that currently
reads "Loader: auto-detected from file extension (PDF or text)" so it reflects
the new extension routing; locate the docstring containing that exact phrase in
graphrag_sdk/api/main.py (the help/usage text shown in the diff) and replace it
with a concise description like "Loader: auto-detected from file extension (PDF,
DOCX, XLSX, PPTX, HTML/XHTML, CSV, or plain text)" so the user-facing docs list
the supported formats.

In `@graphrag_sdk/tests/test_docling_loaders.py`:
- Around line 24-27: The test is breaking the import system by having the
patched builtins.__import__ return None for non-target imports; change the patch
side_effect to delegate to the real importer for all names except the one you
want to simulate failing (i.e., capture the original_import =
builtins.__import__ and in the side_effect for the patch used around loader.load
call, raise ImportError("module not found") when name ==
"docling.document_converter" and otherwise return original_import(name, *args,
**kwargs)); keep the pytest.raises assertion around await loader.load(str(file),
ctx) unchanged and reference the patched builtins.__import__ side_effect that
delegates for everything except "docling.document_converter".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6566ed76-11b6-4ce9-8ef8-6fb548181321

📥 Commits

Reviewing files that changed from the base of the PR and between a174629 and 9bfbab4.

📒 Files selected for processing (10)
  • graphrag_sdk/pyproject.toml
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/csv_loader.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_base.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docx_loader.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/html_loader.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/pptx_loader.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/xlsx_loader.py
  • graphrag_sdk/tests/test_docling_loaders.py

Comment thread graphrag_sdk/pyproject.toml Outdated
Comment on lines 329 to +333
- Loader: auto-detected from file extension (PDF or text)
- Chunker: FixedSizeChunking(chunk_size=1000)
- Chunker: SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2)
— sentence-aware, never splits entity names at chunk boundaries.
Override with ``chunker=FixedSizeChunking(...)`` if you need
character-window chunking.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the loader-default docstring to match the new extension routing.

Line 329 still says “PDF or text”, but the default loader now supports DOCX/XLSX/PPTX/HTML/XHTML/CSV as well. This user-facing doc is now stale.

✏️ Proposed docstring fix
-        - Loader: auto-detected from file extension (PDF or text)
+        - Loader: auto-detected from file extension
+          (PDF, Markdown, DOCX, XLSX, PPTX, HTML/XHTML, CSV, or text fallback)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/api/main.py` around lines 329 - 333, Update the
stale loader-default docstring that currently reads "Loader: auto-detected from
file extension (PDF or text)" so it reflects the new extension routing; locate
the docstring containing that exact phrase in graphrag_sdk/api/main.py (the
help/usage text shown in the diff) and replace it with a concise description
like "Loader: auto-detected from file extension (PDF, DOCX, XLSX, PPTX,
HTML/XHTML, CSV, or plain text)" so the user-facing docs list the supported
formats.

Comment thread graphrag_sdk/tests/test_docling_loaders.py Outdated
drr00t and others added 4 commits May 18, 2026 21:48
Path C in retrieve_chunks used `COLLECT(c)[..3]` with no ORDER BY, so
hub entities (which can be MENTIONED_IN hundreds of chunks) returned
an arbitrary 3 — almost never including the chunks most relevant to
the current query.

Add an ORDER BY on `vec.cosineDistance(c.embedding, query_vector)`
before the COLLECT so per-entity chunk selection is query-aware.

Refs FalkorDB#258

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@drr00t drr00t marked this pull request as ready for review May 21, 2026 20:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
test_mock_import.py (1)

9-15: ⚡ Quick win

Avoid global sys.modules mutation at import time.

Lines 9-15 mutate global module state and execute prints on import, which can leak across tests/tools.

Proposed fix
-sys.modules["docling"] = MagicMock()
-sys.modules["docling.datamodel"] = MagicMock()
-sys.modules["docling.datamodel.document"] = mock_datamodel
-
-from docling.datamodel.document import DocItemLabel
-print("DocItemLabel is:", DocItemLabel)
-print("DocItemLabel.LIST_ITEM is:", getattr(DocItemLabel, "LIST_ITEM", None))
+def main():
+    modules = {
+        "docling": MagicMock(),
+        "docling.datamodel": MagicMock(),
+        "docling.datamodel.document": mock_datamodel,
+    }
+    with patch.dict("sys.modules", modules):
+        from docling.datamodel.document import DocItemLabel
+        print("DocItemLabel is:", DocItemLabel)
+        print("DocItemLabel.LIST_ITEM is:", getattr(DocItemLabel, "LIST_ITEM", None))
+
+if __name__ == "__main__":
+    main()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test_mock_import.py` around lines 9 - 15, The test mutates global sys.modules
and does prints at import time (sys.modules["docling"],
sys.modules["docling.datamodel"], sys.modules["docling.datamodel.document"],
mock_datamodel) and then imports DocItemLabel, which can leak state; change this
to perform the module patching inside the test or a fixture using a temporary
monkeypatch (e.g., monkeypatch.setitem(sys.modules, "docling", MagicMock()) and
monkeypatch.setitem(... "docling.datamodel.document", mock_datamodel)) or use
importlib and context-local injection before importing, remove the top-level
print statements, and reference DocItemLabel only within the test body so
mock_datamodel and DocItemLabel are created and torn down per-test instead of at
import time.
graphrag_sdk/test_monkeypatch.py (1)

19-19: ⚡ Quick win

Guard script execution behind if __name__ == "__main__":.

Line 19 causes side effects on import. In a test_*.py file this is easy to trigger unintentionally.

Proposed fix
-asyncio.run(main())
+if __name__ == "__main__":
+    asyncio.run(main())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/test_monkeypatch.py` at line 19, Current top-level call
asyncio.run(main()) in test_monkeypatch.py causes the module to execute on
import; wrap that invocation in a standard entry-point guard by adding if
__name__ == "__main__": and moving asyncio.run(main()) inside it so async def
main() only runs when the file is executed directly, not when imported by pytest
or other modules.
graphrag_sdk/test_monkeypatch3.py (1)

16-19: ⚡ Quick win

Don’t swallow all exceptions in the failure-path check.

Catching broad Exception and pass hides regressions unrelated to the mocked import failure.

Proposed fix
-        try:
-            await asyncio.to_thread(load_sync)
-        except Exception as e:
-            pass # catch the mocked exception
+        with pytest.raises(ImportError, match="module not found"):
+            await asyncio.to_thread(load_sync)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/test_monkeypatch3.py` around lines 16 - 19, The test currently
swallows all exceptions around the await asyncio.to_thread(load_sync) call;
change this to only accept the specific expected failure (e.g.,
ModuleNotFoundError or the mocked exception class) or use pytest.raises to
assert the failure. Replace the broad "except Exception as e: pass" with either
"except ModuleNotFoundError: pass" (or the exact mocked exception type used in
the test) or wrap the await in "with pytest.raises(ExpectedException): await
asyncio.to_thread(load_sync)" so unexpected exceptions are not hidden; ensure
you reference the load_sync call and asyncio.to_thread in the updated test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/test_debug.py`:
- Around line 27-28: The debug harness currently calls
loader._load_sync("dummy_path") which fails file-existence checks; replace the
dummy string with a real temporary file path (create a temp file via
tempfile.NamedTemporaryFile or tempfile.mkstemp, write minimal content that will
exercise the converter mapping logic, flush/close it) and pass that file's path
to loader._load_sync; ensure the temp file is cleaned up after the test and keep
the existing print(result.elements) to observe outputs.

In `@graphrag_sdk/test_monkeypatch2.py`:
- Around line 4-6: The function load_sync is declared async but is passed
directly to asyncio.to_thread at its call sites (asyncio.to_thread(load_sync)),
which causes to_thread to execute a callable that returns a coroutine instead of
running the import logic; change load_sync from async def load_sync() to a
synchronous def load_sync() that performs the import and returns "success" (so
the body runs inside asyncio.to_thread), or alternatively update the call sites
to await load_sync() (or wrap with asyncio.to_thread(lambda:
asyncio.run(load_sync())) if you must keep it async).

In `@test_import_mock.py`:
- Around line 6-10: The custom import hook _import currently calls __import__
directly for non-"fake" modules which will re-enter the patched
builtins.__import__ and cause infinite recursion; fix by saving the original
import function before patching (e.g., orig_import = builtins.__import__) and
have _import delegate to orig_import(name, *args, **kwargs) for non-"fake"
cases, then install _import via patch("builtins.__import__",
side_effect=_import) so normal imports use the saved original implementation.

In `@test_patch_dict.py`:
- Around line 3-5: Convert the top-level patch.dict snippet into a real pytest
test function (e.g., def test_patch_dict_imports_mocked_module():) that uses
unittest.mock.patch.dict to temporarily inject {"docling": MagicMock(),
"docling.document_converter": MagicMock()}, then imports (or
importlib.import_module) "docling.document_converter" and asserts that the
imported object is the MagicMock from sys.modules and that inside the context
sys.modules["docling.document_converter"] is the mock; after exiting the
patch.dict context assert sys.modules has been restored to its pre-test state
(original module present or key removed) to ensure cleanup.

In `@test_sys_modules.py`:
- Around line 3-6: The module-level mutations of sys.modules
(sys.modules['docling'] and sys.modules['docling.document_converter']) should be
moved into a test-scoped patch.dict to avoid cross-test pollution; wrap the
import of docling.document_converter inside a with patch.dict("sys.modules",
{"docling": MagicMock(), "docling.document_converter": MagicMock()}): block (or
use the patch.dict decorator) so the mocked entries are only present for the
duration of the test, and ensure the import statement (import
docling.document_converter) happens inside that patched context.

---

Nitpick comments:
In `@graphrag_sdk/test_monkeypatch.py`:
- Line 19: Current top-level call asyncio.run(main()) in test_monkeypatch.py
causes the module to execute on import; wrap that invocation in a standard
entry-point guard by adding if __name__ == "__main__": and moving
asyncio.run(main()) inside it so async def main() only runs when the file is
executed directly, not when imported by pytest or other modules.

In `@graphrag_sdk/test_monkeypatch3.py`:
- Around line 16-19: The test currently swallows all exceptions around the await
asyncio.to_thread(load_sync) call; change this to only accept the specific
expected failure (e.g., ModuleNotFoundError or the mocked exception class) or
use pytest.raises to assert the failure. Replace the broad "except Exception as
e: pass" with either "except ModuleNotFoundError: pass" (or the exact mocked
exception type used in the test) or wrap the await in "with
pytest.raises(ExpectedException): await asyncio.to_thread(load_sync)" so
unexpected exceptions are not hidden; ensure you reference the load_sync call
and asyncio.to_thread in the updated test.

In `@test_mock_import.py`:
- Around line 9-15: The test mutates global sys.modules and does prints at
import time (sys.modules["docling"], sys.modules["docling.datamodel"],
sys.modules["docling.datamodel.document"], mock_datamodel) and then imports
DocItemLabel, which can leak state; change this to perform the module patching
inside the test or a fixture using a temporary monkeypatch (e.g.,
monkeypatch.setitem(sys.modules, "docling", MagicMock()) and
monkeypatch.setitem(... "docling.datamodel.document", mock_datamodel)) or use
importlib and context-local injection before importing, remove the top-level
print statements, and reference DocItemLabel only within the test body so
mock_datamodel and DocItemLabel are created and torn down per-test instead of at
import time.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af7af7df-af2f-4640-a6c3-c5b56215e9ad

📥 Commits

Reviewing files that changed from the base of the PR and between 531d29c and a5ecc4e.

📒 Files selected for processing (11)
  • graphrag_sdk/dummy_path
  • graphrag_sdk/pyproject.toml
  • graphrag_sdk/test_debug.py
  • graphrag_sdk/test_monkeypatch.py
  • graphrag_sdk/test_monkeypatch2.py
  • graphrag_sdk/test_monkeypatch3.py
  • graphrag_sdk/tests/test_docling_loaders.py
  • test_import_mock.py
  • test_mock_import.py
  • test_patch_dict.py
  • test_sys_modules.py

Comment on lines +27 to +28
result = loader._load_sync("dummy_path")
print(result.elements)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use a real temp file path in the debug harness.

Line 27 calls _load_sync("dummy_path"), which will fail early on file existence checks, so this harness won’t exercise converter mapping logic.

Proposed fix
-ctx = Context()
-result = loader._load_sync("dummy_path")
+from pathlib import Path
+ctx = Context()
+tmp_file = Path("debug.docx")
+tmp_file.write_text("dummy content")
+result = loader._load_sync(str(tmp_file))
 print(result.elements)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/test_debug.py` around lines 27 - 28, The debug harness currently
calls loader._load_sync("dummy_path") which fails file-existence checks; replace
the dummy string with a real temporary file path (create a temp file via
tempfile.NamedTemporaryFile or tempfile.mkstemp, write minimal content that will
exercise the converter mapping logic, flush/close it) and pass that file's path
to loader._load_sync; ensure the temp file is cleaned up after the test and keep
the existing print(result.elements) to observe outputs.

Comment on lines +4 to +6
async def load_sync():
from docling.datamodel.document import DocItemLabel
return "success"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify async function is passed to asyncio.to_thread in this file.
rg -n -C2 'async def load_sync|to_thread\(load_sync\)' graphrag_sdk/test_monkeypatch2.py

Repository: FalkorDB/GraphRAG-SDK

Length of output: 552


Fix asyncio.to_thread usage: don’t pass an async def directly

load_sync is declared as async def (line 4) but both call sites pass it directly to asyncio.to_thread(load_sync) (lines 17 and 22). Since an async function returns a coroutine when called, to_thread will run a callable that immediately returns a coroutine object, rather than executing the coroutine body/import logic as intended.

Proposed fix
-async def load_sync():
+def load_sync():
     from docling.datamodel.document import DocItemLabel
     return "success"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/test_monkeypatch2.py` around lines 4 - 6, The function load_sync
is declared async but is passed directly to asyncio.to_thread at its call sites
(asyncio.to_thread(load_sync)), which causes to_thread to execute a callable
that returns a coroutine instead of running the import logic; change load_sync
from async def load_sync() to a synchronous def load_sync() that performs the
import and returns "success" (so the body runs inside asyncio.to_thread), or
alternatively update the call sites to await load_sync() (or wrap with
asyncio.to_thread(lambda: asyncio.run(load_sync())) if you must keep it async).

Comment thread test_import_mock.py
Comment on lines +6 to +10
def _import(name, *args, **kwargs):
if name == "fake":
raise ImportError("module not found")
return __import__(name, *args, **kwargs)
with patch("builtins.__import__", side_effect=_import):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify patched importer delegates to captured real importer (not __import__ directly).
rg -n -C2 'patch\("builtins\.__import__"|return __import__\(|real_import' test_import_mock.py

Repository: FalkorDB/GraphRAG-SDK

Length of output: 303


🏁 Script executed:

#!/bin/bash
# Inspect test_import_mock.py around the import-hook block to see if any real-import is captured.
nl -ba test_import_mock.py | sed -n '1,120p'

Repository: FalkorDB/GraphRAG-SDK

Length of output: 106


🏁 Script executed:

#!/bin/bash
# Inspect with line numbers using cat -n (nl not available).
cat -n test_import_mock.py | sed -n '1,120p'

Repository: FalkorDB/GraphRAG-SDK

Length of output: 667


Import hook can recurse indefinitely for non-"fake" imports.

_import is installed as builtins.__import__’s side_effect, but for non-"fake" names line 9 calls __import__ again, re-invoking _import (infinite recursion). The current test only imports "fake", so line 9 isn’t exercised here, but the hook is unsafe as written.

Proposed fix
 modules = {"fake": MagicMock()}
 with patch.dict("sys.modules", modules):
+    real_import = __import__
     def _import(name, *args, **kwargs):
         if name == "fake":
             raise ImportError("module not found")
-        return __import__(name, *args, **kwargs)
+        return real_import(name, *args, **kwargs)
     with patch("builtins.__import__", side_effect=_import):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _import(name, *args, **kwargs):
if name == "fake":
raise ImportError("module not found")
return __import__(name, *args, **kwargs)
with patch("builtins.__import__", side_effect=_import):
modules = {"fake": MagicMock()}
with patch.dict("sys.modules", modules):
real_import = __import__
def _import(name, *args, **kwargs):
if name == "fake":
raise ImportError("module not found")
return real_import(name, *args, **kwargs)
with patch("builtins.__import__", side_effect=_import):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test_import_mock.py` around lines 6 - 10, The custom import hook _import
currently calls __import__ directly for non-"fake" modules which will re-enter
the patched builtins.__import__ and cause infinite recursion; fix by saving the
original import function before patching (e.g., orig_import =
builtins.__import__) and have _import delegate to orig_import(name, *args,
**kwargs) for non-"fake" cases, then install _import via
patch("builtins.__import__", side_effect=_import) so normal imports use the
saved original implementation.

Comment thread test_patch_dict.py
Comment on lines +3 to +5
with patch.dict("sys.modules", {"docling": MagicMock(), "docling.document_converter": MagicMock()}):
import docling.document_converter
print("Success patch.dict!")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify module-level executable code in test-like files
rg -n -C2 '^\s*(with patch\.dict|sys\.modules\[|print\()' --iglob 'test*.py'

Repository: FalkorDB/GraphRAG-SDK

Length of output: 17088


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== File: test_patch_dict.py (lines 1-120) ==="
if [ -f test_patch_dict.py ]; then
  nl -ba test_patch_dict.py | sed -n '1,120p'
else
  echo "missing: test_patch_dict.py"
fi

echo
echo "=== Pytest config files (if present) ==="
ls -1 pytest.ini pyproject.toml setup.cfg tox.ini 2>/dev/null || true

echo
echo "=== Contents: pytest.ini (if present) ==="
[ -f pytest.ini ] && nl -ba pytest.ini | sed -n '1,200p' || true

echo
echo "=== Contents: pyproject.toml (if present) ==="
[ -f pyproject.toml ] && nl -ba pyproject.toml | sed -n '1,240p' || true

echo
echo "=== Contents: setup.cfg (if present) ==="
[ -f setup.cfg ] && nl -ba setup.cfg | sed -n '1,240p' || true

echo
echo "=== Contents: tox.ini (if present) ==="
[ -f tox.ini ] && nl -ba tox.ini | sed -n '1,240p' || true

echo
echo "=== Searching for pytest collection config keywords ==="
rg -n --hidden --no-ignore -S 'testpaths|python_files|python_classes|python_functions' pyproject.toml pytest.ini setup.cfg tox.ini 2>/dev/null || true

Repository: FalkorDB/GraphRAG-SDK

Length of output: 173


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Repo root files (top level) ==="
ls -1 | head -n 200

echo
echo "=== test_patch_dict.py (if present) ==="
if [ -f test_patch_dict.py ]; then
  cat -n test_patch_dict.py | sed -n '1,120p'
else
  echo "missing: test_patch_dict.py"
fi

echo
echo "=== Pytest config candidates (if present) ==="
ls -1 pytest.ini pyproject.toml setup.cfg tox.ini 2>/dev/null || true

echo
echo "=== pytest.ini ==="
[ -f pytest.ini ] && cat -n pytest.ini | sed -n '1,200p' || true

echo
echo "=== pyproject.toml (pytest section) ==="
if [ -f pyproject.toml ]; then
  rg -n "tool\.pytest\.ini_options|^\[tool\.pytest|python_files|testpaths" pyproject.toml || true
  # Print a limited window around pytest options if found
  rg -n -C 3 "tool\.pytest\.ini_options|^\[tool\.pytest" pyproject.toml || true
fi

echo
echo "=== setup.cfg / tox.ini (pytest-related snippets) ==="
[ -f setup.cfg ] && rg -n "python_files|testpaths|pytest" setup.cfg || true
[ -f tox.ini ] && rg -n "python_files|testpaths|pytest" tox.ini || true

Repository: FalkorDB/GraphRAG-SDK

Length of output: 778


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find any pytest config files anywhere in the repo
echo "=== pytest config file names ==="
fd -t f -e ini -e toml -e cfg -e py -a -x echo {} 2>/dev/null || true
# More direct: locate common config filenames
fd pytest.ini -a . 2>/dev/null || true
fd .*pytest.* -a . 2>/dev/null || true
fd pyproject.toml -a . 2>/dev/null || true
fd setup.cfg -a . 2>/dev/null || true
fd tox.ini -a . 2>/dev/null || true

echo
echo "=== Search for pytest config sections ==="
rg -n --hidden --no-ignore -S "\[tool\.pytest|python_files|testpaths|addopts" . || true

echo
echo "=== Search for collection overrides referencing test_*.py ==="
rg -n --hidden --no-ignore -S "python_files\s*=\s*|testpaths\s*=\s*|--pyargs|--ignore|--ignore-glob" . || true

echo
echo "=== Show any found pyproject.toml pytest sections (limited) ==="
rg -n -S "\[tool\.pytest\.ini_options\]|\[tool\.pytest\]" --glob '*pyproject.toml' . || true

Repository: FalkorDB/GraphRAG-SDK

Length of output: 9844


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== graphrag_sdk/pyproject.toml (pytest options) ==="
cat -n graphrag_sdk/pyproject.toml | sed -n '80,140p'

Repository: FalkorDB/GraphRAG-SDK

Length of output: 1171


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for pytest invocation details (where directory changes happen)
rg -n --hidden --no-ignore -S "pytest(\s|$)|python -m pytest" .github -S 2>/dev/null || true
rg -n --hidden --no-ignore -S "pytest(\s|$)|python -m pytest" . -S --glob '!**/.git/**' || true

# Show likely CI/workflow files and scripts (small set)
ls -R .github 2>/dev/null | head -n 200 || true
find . -maxdepth 3 -type f \( -iname '*workflow*' -o -iname 'Makefile' -o -iname 'tox.ini' -o -iname 'noxfile.py' -o -iname '*ci*' -o -iname '*.sh' \) 2>/dev/null | head -n 200

Repository: FalkorDB/GraphRAG-SDK

Length of output: 3340


🏁 Script executed:

#!/bin/bash
set -euo pipefail
cat -n .github/workflows/ci.yml | sed -n '1,140p'

Repository: FalkorDB/GraphRAG-SDK

Length of output: 3508


Convert test_patch_dict.py import-time patch.dict snippet into a real pytest test with assertions

test_patch_dict.py executes with patch.dict(...): import docling.document_converter; print(...) at import time and has no assertions. Wrap it in a test_* function and assert the mocked module is what gets imported (and ideally that sys.modules is restored).

This root-level test_patch_dict.py isn’t exercised by CI because CI runs python -m pytest tests/ from the graphrag_sdk/ working directory, but it will be picked up if someone runs pytest from the repo root.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test_patch_dict.py` around lines 3 - 5, Convert the top-level patch.dict
snippet into a real pytest test function (e.g., def
test_patch_dict_imports_mocked_module():) that uses unittest.mock.patch.dict to
temporarily inject {"docling": MagicMock(), "docling.document_converter":
MagicMock()}, then imports (or importlib.import_module)
"docling.document_converter" and asserts that the imported object is the
MagicMock from sys.modules and that inside the context
sys.modules["docling.document_converter"] is the mock; after exiting the
patch.dict context assert sys.modules has been restored to its pre-test state
(original module present or key removed) to ensure cleanup.

Comment thread test_sys_modules.py
Comment on lines +3 to +6
sys.modules['docling'] = MagicMock()
sys.modules['docling.document_converter'] = MagicMock()
import docling.document_converter
print("Success!")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate and inspect the file
ls -la
fd -t f "test_sys_modules.py" . -x echo "FOUND: {}" || true
FILE="$(fd -t f "test_sys_modules.py" . | head -n 1)"
echo "Using FILE=$FILE"
echo "---- File head (numbered) ----"
nl -ba "$FILE" | sed -n '1,120p'

echo "---- Search for sys.modules usage ----"
rg -n "sys\.modules\s*\[" -S .

echo "---- Search for patch.dict / monkeypatch patterns ----"
rg -n "patch\.dict\(" -S . || true
rg -n "monkeypatch" -S . || true

Repository: FalkorDB/GraphRAG-SDK

Length of output: 1608


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="./test_sys_modules.py"
echo "---- File (line-numbered) ----"
# Prefer cat -n; fallback to python if cat -n isn't available
if cat -n "$FILE" >/dev/null 2>&1; then
  cat -n "$FILE" | sed -n '1,160p'
else
  python3 - <<'PY'
import pathlib
p = pathlib.Path("$FILE")
lines = p.read_text().splitlines()
for i,l in enumerate(lines, start=1):
    print(f"{i}\t{l}")
PY
fi

echo "---- Search for sys.modules usage ----"
rg -n "sys\.modules\s*\[" -S .

echo "---- Search for patch.dict / monkeypatch patterns ----"
rg -n "patch\.dict\(" -S . || true
rg -n "monkeypatch" -S . || true

echo "---- Inspect test_patch_dict.py (if present) ----"
if [ -f "./test_patch_dict.py" ]; then
  if cat -n "./test_patch_dict.py" >/dev/null 2>&1; then
    cat -n ./test_patch_dict.py | sed -n '1,200p'
  else
    python3 - <<'PY'
import pathlib
p=pathlib.Path("test_patch_dict.py")
lines=p.read_text().splitlines()
for i,l in enumerate(lines, start=1):
    print(f"{i}\t{l}")
PY
  fi
fi

Repository: FalkorDB/GraphRAG-SDK

Length of output: 7873


Move sys.modules mocking into a test-scoped patch.dict to avoid cross-test pollution.

test_sys_modules.py mutates sys.modules at module scope (sys.modules['docling'] / sys.modules['docling.document_converter']) and then imports immediately, which can contaminate other tests. This repo already uses patch.dict("sys.modules", ...) patterns in other tests (e.g., test_patch_dict.py).

Safer isolated pattern
-import sys
-from unittest.mock import MagicMock
-sys.modules['docling'] = MagicMock()
-sys.modules['docling.document_converter'] = MagicMock()
-import docling.document_converter
-print("Success!")
+from unittest.mock import MagicMock, patch
+
+def test_docling_import_with_sys_modules_patch():
+    with patch.dict("sys.modules", {
+        "docling": MagicMock(),
+        "docling.document_converter": MagicMock(),
+    }):
+        import docling.document_converter as converter
+        assert converter is not None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test_sys_modules.py` around lines 3 - 6, The module-level mutations of
sys.modules (sys.modules['docling'] and
sys.modules['docling.document_converter']) should be moved into a test-scoped
patch.dict to avoid cross-test pollution; wrap the import of
docling.document_converter inside a with patch.dict("sys.modules", {"docling":
MagicMock(), "docling.document_converter": MagicMock()}): block (or use the
patch.dict decorator) so the mocked entries are only present for the duration of
the test, and ensure the import statement (import docling.document_converter)
happens inside that patched context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand document loader coverage (HTML, Markdown, DOCX, CSV, JSON, URL, S3, Image)

2 participants