-
Notifications
You must be signed in to change notification settings - Fork 124
241 issue expand document loader coverage #262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
drr00t
wants to merge
9
commits into
FalkorDB:main
Choose a base branch
from
drr00t:241-issue-expand-document-loader-coverage
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
31e927e
feat: initial mult-extension loader
drr00t 6631d8e
feat(ingestion): default to SentenceTokenCapChunking in ingest()/upda…
galshubeli 9bfbab4
test(loader): more tests for doclin-base loaders
drr00t 857f870
fix(issues): from coderabbitai review
drr00t 4af2f8c
fix(conflict): need to be updated
drr00t 7969df9
fix(retrieval): rank MENTIONED_IN chunks by cosine in MultiPath Path C
galshubeli 4f7d37c
Merge branch 'main' into 241-issue-expand-document-loader-coverage
drr00t 531d29c
test: refactor docling loader tests to use local mock enumerations in…
drr00t a5ecc4e
test: implement robust sys.modules mocking for docling in unit tests …
drr00t File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
17 changes: 16 additions & 1 deletion
17
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,23 @@ | ||
| # GraphRAG SDK — Ingestion: Loaders | ||
|
|
||
| from graphrag_sdk.ingestion.loaders.base import LoaderStrategy | ||
| from graphrag_sdk.ingestion.loaders.csv_loader import CsvLoader | ||
| from graphrag_sdk.ingestion.loaders.docx_loader import DocxLoader | ||
| from graphrag_sdk.ingestion.loaders.html_loader import HtmlLoader | ||
| from graphrag_sdk.ingestion.loaders.markdown_loader import MarkdownLoader | ||
| from graphrag_sdk.ingestion.loaders.pdf_loader import PdfLoader | ||
| from graphrag_sdk.ingestion.loaders.pptx_loader import PptxLoader | ||
| from graphrag_sdk.ingestion.loaders.text_loader import TextLoader | ||
| from graphrag_sdk.ingestion.loaders.xlsx_loader import XlsxLoader | ||
|
|
||
| __all__ = ["LoaderStrategy", "MarkdownLoader", "PdfLoader", "TextLoader"] | ||
| __all__ = [ | ||
| "LoaderStrategy", | ||
| "MarkdownLoader", | ||
| "PdfLoader", | ||
| "TextLoader", | ||
| "DocxLoader", | ||
| "XlsxLoader", | ||
| "PptxLoader", | ||
| "HtmlLoader", | ||
| "CsvLoader", | ||
| ] |
12 changes: 12 additions & 0 deletions
12
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/csv_loader.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| # GraphRAG SDK — Ingestion: CSV Loader | ||
| # Pattern: Strategy | ||
|
|
||
| from graphrag_sdk.ingestion.loaders.docling_base import DoclingBaseLoader | ||
|
|
||
|
|
||
| class CsvLoader(DoclingBaseLoader): | ||
| """Load text and structural elements from a CSV file using Docling.""" | ||
|
|
||
| @property | ||
| def extension_name(self) -> str: | ||
| return "csv" |
154 changes: 154 additions & 0 deletions
154
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_base.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,154 @@ | ||
| # GraphRAG SDK — Ingestion: Docling Base Loader | ||
| # Pattern: Strategy Base Class | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import asyncio | ||
| import logging | ||
| from pathlib import Path | ||
| from typing import Any | ||
|
|
||
| from graphrag_sdk.core.context import Context | ||
| from graphrag_sdk.core.exceptions import LoaderError | ||
| from graphrag_sdk.core.models import DocumentElement, DocumentInfo, DocumentOutput | ||
| from graphrag_sdk.ingestion.loaders.base import LoaderStrategy | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class DoclingBaseLoader(LoaderStrategy): | ||
| """Base loader using docling for advanced document parsing. | ||
|
|
||
| Subclasses should define the `extension_name` property. | ||
| """ | ||
|
|
||
| def __init__(self, **docling_kwargs: Any) -> None: | ||
| """Initialize the loader. | ||
|
|
||
| Args: | ||
| **docling_kwargs: Arbitrary keyword arguments passed to | ||
| `docling.document_converter.DocumentConverter` (e.g., | ||
| pipeline_options). | ||
| """ | ||
| self.docling_kwargs = docling_kwargs | ||
|
|
||
| @property | ||
| def extension_name(self) -> str: | ||
| return "unknown" | ||
|
|
||
| async def load(self, source: str, ctx: Context) -> DocumentOutput: | ||
| ctx.log(f"Loading {self.extension_name.upper()} file via docling: {source}") | ||
| # Run synchronous docling extraction in a non-blocking thread | ||
| return await asyncio.to_thread(self._load_sync, source) | ||
|
|
||
| def _load_sync(self, source: str) -> DocumentOutput: | ||
| path = Path(source) | ||
| if not path.exists(): | ||
| raise LoaderError(f"File not found: {source}") | ||
|
|
||
| try: | ||
| from docling.datamodel.document import DocItemLabel | ||
| from docling.document_converter import DocumentConverter | ||
| except ImportError: | ||
| raise LoaderError( | ||
| f"{self.extension_name.upper()} parsing requires 'docling'. Install with:\n" | ||
| " pip install graphrag-sdk[docling]" | ||
| ) | ||
|
|
||
| try: | ||
| converter = DocumentConverter(**self.docling_kwargs) | ||
| result = converter.convert(source) | ||
| doc = result.document | ||
| except Exception as exc: | ||
| raise LoaderError(f"Docling failed to process {source}: {exc}") from exc | ||
|
|
||
| elements: list[DocumentElement] = [] | ||
| current_breadcrumbs: list[tuple[int, str]] = [] | ||
| full_text_blocks = [] | ||
|
|
||
| # Map docling hierarchy to GraphRAG DocumentElements | ||
| for item, level in doc.iterate_items(): | ||
| content = getattr(item, "text", "") | ||
| if not content and hasattr(item, "export_to_markdown"): | ||
| try: | ||
| content = item.export_to_markdown() | ||
| except Exception: | ||
| pass | ||
|
|
||
| if not content: | ||
| continue | ||
|
|
||
| full_text_blocks.append(content) | ||
| label = getattr(item, "label", None) | ||
|
|
||
| if label in (DocItemLabel.TITLE, DocItemLabel.SECTION_HEADER): | ||
| # Update breadcrumbs | ||
| while current_breadcrumbs and current_breadcrumbs[-1][0] >= level: | ||
| current_breadcrumbs.pop() | ||
| current_breadcrumbs.append((level, content)) | ||
|
|
||
| elements.append( | ||
| DocumentElement( | ||
| type="header", | ||
| level=level, | ||
| content=content, | ||
| breadcrumbs=[b[1] for b in current_breadcrumbs], | ||
| ) | ||
| ) | ||
| elif label in (DocItemLabel.PARAGRAPH, DocItemLabel.TEXT): | ||
| elements.append( | ||
| DocumentElement( | ||
| type="paragraph", | ||
| content=content, | ||
| breadcrumbs=[b[1] for b in current_breadcrumbs], | ||
| ) | ||
| ) | ||
| elif label == DocItemLabel.LIST_ITEM: | ||
| elements.append( | ||
| DocumentElement( | ||
| type="list", | ||
| content=content, | ||
| breadcrumbs=[b[1] for b in current_breadcrumbs], | ||
| ) | ||
| ) | ||
| elif label == DocItemLabel.TABLE: | ||
| elements.append( | ||
| DocumentElement( | ||
| type="table", | ||
| content=content, | ||
| breadcrumbs=[b[1] for b in current_breadcrumbs], | ||
| ) | ||
| ) | ||
| elif label == DocItemLabel.CODE: | ||
| elements.append( | ||
| DocumentElement( | ||
| type="code", | ||
| content=content, | ||
| breadcrumbs=[b[1] for b in current_breadcrumbs], | ||
| ) | ||
| ) | ||
| else: | ||
| # Default for CAPTION, FOOTNOTE, etc. | ||
| elements.append( | ||
| DocumentElement( | ||
| type="paragraph", | ||
| content=content, | ||
| breadcrumbs=[b[1] for b in current_breadcrumbs], | ||
| metadata={"label": label.value if hasattr(label, "value") else label}, | ||
| ) | ||
| ) | ||
|
|
||
| full_text = "\n\n".join(full_text_blocks) | ||
|
|
||
| return DocumentOutput( | ||
| text=full_text, | ||
| document_info=DocumentInfo( | ||
| path=str(path), | ||
| metadata={ | ||
| "size_bytes": path.stat().st_size, | ||
| "loader": self.extension_name, | ||
| "suffix": path.suffix, | ||
| }, | ||
| ), | ||
| elements=elements, | ||
| ) |
12 changes: 12 additions & 0 deletions
12
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docx_loader.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| # GraphRAG SDK — Ingestion: DOCX Loader | ||
| # Pattern: Strategy | ||
|
|
||
| from graphrag_sdk.ingestion.loaders.docling_base import DoclingBaseLoader | ||
|
|
||
|
|
||
| class DocxLoader(DoclingBaseLoader): | ||
| """Load text and structural elements from a DOCX file using Docling.""" | ||
|
|
||
| @property | ||
| def extension_name(self) -> str: | ||
| return "docx" |
12 changes: 12 additions & 0 deletions
12
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/html_loader.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| # GraphRAG SDK — Ingestion: HTML Loader | ||
| # Pattern: Strategy | ||
|
|
||
| from graphrag_sdk.ingestion.loaders.docling_base import DoclingBaseLoader | ||
|
|
||
|
|
||
| class HtmlLoader(DoclingBaseLoader): | ||
| """Load text and structural elements from an HTML/XHTML file using Docling.""" | ||
|
|
||
| @property | ||
| def extension_name(self) -> str: | ||
| return "html" |
12 changes: 12 additions & 0 deletions
12
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/pptx_loader.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| # GraphRAG SDK — Ingestion: PPTX Loader | ||
| # Pattern: Strategy | ||
|
|
||
| from graphrag_sdk.ingestion.loaders.docling_base import DoclingBaseLoader | ||
|
|
||
|
|
||
| class PptxLoader(DoclingBaseLoader): | ||
| """Load text and structural elements from a PPTX file using Docling.""" | ||
|
|
||
| @property | ||
| def extension_name(self) -> str: | ||
| return "pptx" |
12 changes: 12 additions & 0 deletions
12
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/xlsx_loader.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| # GraphRAG SDK — Ingestion: XLSX Loader | ||
| # Pattern: Strategy | ||
|
|
||
| from graphrag_sdk.ingestion.loaders.docling_base import DoclingBaseLoader | ||
|
|
||
|
|
||
| class XlsxLoader(DoclingBaseLoader): | ||
| """Load text and structural elements from an XLSX file using Docling.""" | ||
|
|
||
| @property | ||
| def extension_name(self) -> str: | ||
| return "xlsx" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| import asyncio | ||
| from unittest.mock import MagicMock | ||
| from graphrag_sdk.core.context import Context | ||
| from tests.test_docling_loaders import MockDocxLoader, LabelEnum | ||
|
|
||
| loader = MockDocxLoader() | ||
| mock_items = [ | ||
| (MagicMock(label=LabelEnum.LIST_ITEM, text="List item 1"), 1), | ||
| (MagicMock(label=LabelEnum.TABLE, text="Table content"), 1), | ||
| (MagicMock(label=LabelEnum.CODE, text="print('hello')"), 1), | ||
| ] | ||
|
|
||
| mock_doc = MagicMock() | ||
| mock_doc.iterate_items.return_value = mock_items | ||
|
|
||
| mock_converter = MagicMock() | ||
| mock_converter.convert.return_value.document = mock_doc | ||
|
|
||
| # Force the monkeypatch manually | ||
| import sys | ||
| sys.modules["docling"] = MagicMock() | ||
| sys.modules["docling.datamodel"] = MagicMock() | ||
| sys.modules["docling.datamodel.document"] = MagicMock(DocItemLabel=LabelEnum) | ||
| sys.modules["docling.document_converter"] = MagicMock(DocumentConverter=MagicMock(return_value=mock_converter)) | ||
|
|
||
| ctx = Context() | ||
| result = loader._load_sync("dummy_path") | ||
| print(result.elements) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| import sys, asyncio | ||
| from unittest.mock import patch, MagicMock | ||
|
|
||
| async def main(): | ||
| mock_docling = MagicMock() | ||
| # mock_docling.__path__ = [] # Let's see without this | ||
| modules = { | ||
| 'docling': mock_docling, | ||
| 'docling.datamodel': MagicMock(), | ||
| 'docling.datamodel.document': MagicMock() | ||
| } | ||
| with patch.dict('sys.modules', modules): | ||
| def worker(): | ||
| from docling.datamodel.document import DocItemLabel | ||
| return 'success' | ||
| res = await asyncio.to_thread(worker) | ||
| print(res) | ||
|
|
||
| asyncio.run(main()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| import sys, asyncio, pytest | ||
| from unittest.mock import patch, MagicMock | ||
|
|
||
| async def load_sync(): | ||
| from docling.datamodel.document import DocItemLabel | ||
| return "success" | ||
|
drr00t marked this conversation as resolved.
|
||
|
|
||
| async def test_first(): | ||
| real_import = __import__ | ||
| def _import(name, *args, **kwargs): | ||
| if name == "docling.datamodel.document": | ||
| raise ImportError("module not found") | ||
| return real_import(name, *args, **kwargs) | ||
|
|
||
| with patch("builtins.__import__", side_effect=_import): | ||
| try: | ||
| await asyncio.to_thread(load_sync) | ||
| except Exception as e: | ||
| pass # catch the mocked exception | ||
|
|
||
| async def test_second(): | ||
| res = await asyncio.to_thread(load_sync) | ||
| print("Second test:", res) | ||
|
|
||
| async def main(): | ||
| modules = { | ||
| 'docling': MagicMock(), | ||
| 'docling.datamodel': MagicMock(), | ||
| 'docling.datamodel.document': MagicMock() | ||
| } | ||
| with patch.dict('sys.modules', modules): | ||
| await test_first() | ||
| await test_second() | ||
|
|
||
| asyncio.run(main()) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.