This repository provides a framework for evaluating and comparing different retrieval strategies for RAG systems, using the LongDocURL benchmark
The primary goal is to analyze how different document parsing and retrieval methods impact the quality of LLM responses on documents.
Dataset was created by selecting all Understanding and Locating questions with evidence elements being Text and Layout.
- Questioning without any retrieved data.
- Questioning using cut-off paradigm from LongDocURL.
- Questioning using PyMuPDF based classic RAG algorithm with 500 chunk size and 100 overlap.
- Questioning using MinerU based RAG algorithm.
- MinerU: v3.0.4.
- Tesseract OCR: v5.5.0.20241111 (System dependency).
- Python: 3.13.7.
- Full Python dependency list can be found in requirements.txt.
- Embeddings Model: distiluse-base-multilingual-cased-v1.
- Model used for questioning is Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf hosted in LM Studio.




