linagora · Ahmath-Gadji · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026 · Mar 27, 2026
diff --git a/.hydra_config/config.yaml b/.hydra_config/config.yaml
@@ -55,6 +55,13 @@ reranker:
   top_k: ${oc.decode:${oc.env:RERANKER_TOP_K, 10}} # Number of documents to return after reranking. Upgrade for better results if your llm has a wider context window.
   base_url: ${oc.env:RERANKER_BASE_URL, http://reranker:${oc.env:RERANKER_PORT, 7997}}
 
+file_reducer:
+  max_group_tokens: ${oc.decode:${oc.env:FILE_REDUCER_MAX_GROUP_TOKENS, 4096}}
+  min_group_tokens: ${oc.decode:${oc.env:FILE_REDUCER_MIN_GROUP_TOKENS, 2048}}
+  target_size_tokens: ${oc.decode:${oc.env:FILE_REDUCER_TARGET_SIZE_TOKENS, 1024}}
+  max_rounds: ${oc.decode:${oc.env:FILE_REDUCER_MAX_ROUNDS, 3}}
+  min_shrink_ratio: ${oc.decode:${oc.env:FILE_REDUCER_MIN_SHRINK_RATIO, 0.1}}
+
 map_reduce:
   # Number of documents to process in the initial mapping phase
   initial_batch_size: ${oc.decode:${oc.env:MAP_REDUCE_INITIAL_BATCH_SIZE, 10}}
@@ -91,6 +98,7 @@ prompts:
   chunk_contextualizer: chunk_contextualizer_tmpl.txt
   image_describer: image_captioning_tmpl.txt
   spoken_style_answer: spoken_style_answer_tmpl.txt
+  file_reducer: file_reducer_tmpl.txt
 
   # query templates for different retriever types
   hyde: hyde.txt

diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,292 @@
+# OpenRAG Agent Guide
+
+## Build, Lint, and Test Commands
+
+### Dependencies
+```bash
+# Install dependencies (uv package manager)
+uv sync
+
+# Install dev dependencies
+uv sync --group dev
+
+# Install lint dependencies
+uv sync --group lint
+```
+
+### Development Server
+```bash
+# GPU deployment
+docker compose up -d
+
+# CPU deployment
+docker compose --profile cpu up -d
+
+# Rebuild and run
+docker compose up --build -d
+```
+
+### Testing
+```bash
+# Run all unit tests
+uv run pytest
+
+# Run a single test file
+uv run pytest openrag/components/indexer/chunker/test_chunking.py
+
+# Run tests matching a pattern
+uv run pytest -k "test_chunk"
+
+# Run with verbose output
+uv run pytest -v
+
+# Run integration tests (requires running server)
+uv run pytest -m integration
+
+# Run tests with coverage
+uv run pytest --cov=openrag
+```
+
+### Linting and Formatting
+```bash
+# Check code style
+uv run ruff check openrag/ tests/
+
+# Auto-fix linting issues
+uv run ruff check --fix openrag/ tests/
+
+# Format code
+uv run ruff format openrag/ tests/
+
+# Check formatting without modifying
+uv run ruff format --check openrag/ tests/
+```
+
+### CI/CD
+```bash
+# Run API integration tests locally with act
+act -j api-tests -W .github/workflows/api_tests.yml --bind
+```
+
+## Code Style Guidelines
+
+### Imports
+- Use **absolute imports** from the `openrag/` directory (Python path root)
+- Group imports: standard library → third-party → first-party (`openrag.*`)
+- Use `from openrag.X import Y` not relative imports across packages
+- Isort configuration: `known-first-party = ["openrag"]`
+
+```python
+# Correct
+from components.ray_utils import call_ray_actor_with_timeout
+from utils.logger import get_logger
+from config import load_config
+
+# Avoid
+from ..ray_utils import ...  # Only use within same package
+```
+
+### Formatting
+- **Line length**: 120 characters (configured in `pyproject.toml`)
+- **Target Python**: 3.12+
+- Use **double quotes** for strings
+- Use **4 spaces** for indentation (no tabs)
+- Follow Black-compatible formatting (Ruff format)
+
+### Type Hints
+- Use **type hints** for function parameters and return values
+- Use `|` for union types (Python 3.10+ syntax)
+- Use `Optional[T]` or `T | None` for optional values
+- Use `list[T]`, `dict[str, Any]` for collections
+
+```python
+def process_file(file_id: str, partition: str | None = None) -> dict[str, Any]:
+    """Process a file and return metadata."""
+    ...
+```
+
+### Naming Conventions
+- **Functions/variables**: `snake_case`
+- **Classes**: `PascalCase`
+- **Constants**: `UPPER_CASE`
+- **Private members**: `_leading_underscore`
+- **Ray Actors**: `PascalCase` (e.g., `Indexer`, `TaskStateManager`)
+- **Test functions**: `test_<description>`
+
+### Error Handling
+- Use **custom exceptions** from `openrag/utils/exceptions/`
+- All exceptions inherit from `OpenRAGError`
+- Include `code`, `message`, and optional `status_code`
+- Use specific exception types: `VDBError`, `EmbeddingError`
+
+```python
+from utils.exceptions import OpenRAGError, VDBError
+
+# Raise error with code and message
+raise VDBError(message="Failed to connect", code="VDB_001", status_code=503)
+
+# Custom exception with extra context
+raise OpenRAGError(
+    message="File not found",
+    code="FILE_NOT_FOUND",
+    status_code=404,
+    file_id=file_id
+)
+```
+
+### Logging
+- Use **Loguru** with structured logging via `get_logger()`
+- Include contextual data using `.bind()`
+- Never log secrets or sensitive data
+
+```python
+from utils.logger import get_logger
+
+logger = get_logger()
+
+# Log with context
+logger.bind(file_id=file_id, partition=partition).info("Processing file")
+
+# Error logging with exception
+logger.bind(error=str(e)).error("Failed to process document")
+```
+
+### Async/Await
+- Use `async def` for I/O operations (database, HTTP, Ray)
+- Always `await` async calls
+- Use `asyncio.gather()` for concurrent independent operations
+- Use `call_ray_actor_with_timeout()` for Ray actor calls
+
+```python
+from components.ray_utils import call_ray_actor_with_timeout
+
+# Concurrent operations
+results = await asyncio.gather(
+    task1(),
+    task2(),
+    task3()
+)
+
+# Ray actor with timeout
+result = await call_ray_actor_with_timeout(
+    future=indexer.process.remote(data),
+    timeout=30,
+    task_description="Processing document"
+)
+```
+
+### Ray Actors
+- Ray Actors are initialized in `openrag/api.py`
+- Access actors via `ray.get_actor(name, namespace="openrag")`
+- All actor methods called with `.remote()`
+
+```python
+import ray
+
+# Get actor reference
+vectordb = ray.get_actor("Vectordb", namespace="openrag")
+indexer = ray.get_actor("Indexer", namespace="openrag")
+
+# Call methods
+await vectordb.async_search.remote(query=query, partition=partition)
+```
+
+### Configuration
+- Configuration via **Hydra** with YAML files in `.hydra_config/`
+- Access config via `load_config()` from `config.py`
+- Environment variables override config values
+
+```python
+from config import load_config
+
+config = load_config()
+chunk_size = config.chunker.size
+```
+
+### API Patterns
+- FastAPI routers in `openrag/routers/`
+- Use dependency injection for shared resources
+- Return `JSONResponse` for custom error responses
+- Use Pydantic models for request/response validation
+
+```python
+from fastapi import APIRouter, Depends
+from pydantic import BaseModel
+
+router = APIRouter()
+
+class DocumentRequest(BaseModel):
+    text: str
+    partition: str | None = None
+
+@router.post("/documents")
+async def create_document(req: DocumentRequest, user: User = Depends(get_current_user)):
+    ...
+```
+
+### Testing Guidelines
+- Unit tests: `openrag/components/**/test_*.py` (pytest)
+- Integration tests: `tests/api_tests/*.py`
+- Use pytest fixtures from `conftest.py`
+- Mark tests: `@pytest.mark.integration` or `@pytest.mark.unit`
+
+```python
+import pytest
+
+@pytest.mark.unit
+def test_chunking():
+    assert result == expected
+
+@pytest.mark.integration
+async def test_api_endpoint():
+    response = await client.post("/v1/chat/completions", json={...})
+    assert response.status_code == 200
+```
+
+### Documentation
+- Docstrings: **Google style** or **reStructuredText**
+- Include type hints in docstrings if not obvious
+- Document complex algorithms and business logic
+
+```python
+def process_chunk(chunk: Chunk) -> Embedding:
+    """Process a document chunk and generate embedding.
+
+    Args:
+        chunk: The chunk to process
+
+    Returns:
+        Generated embedding vector
+
+    Raises:
+        EmbeddingError: If embedding generation fails
+    """
+    ...
+```
+
+## Key Files and Directories
+
+```
+openrag/
+├── api.py                  # FastAPI app entry point, Ray initialization
+├── routers/                # API route handlers
+├── components/             # Core components (Indexer, Vectordb, Pipeline)
+│   ├── indexer/           # Document ingestion, chunking, embedding
+│   ├── pipeline.py        # RAG pipeline orchestration
+│   └── websearch/         # Web search integration
+├── utils/                  # Shared utilities
+│   ├── exceptions/        # Custom exception classes
+│   ├── logger.py          # Logging configuration
+│   └── config.py          # Configuration loading
+├── models/                 # Pydantic models
+└── prompts/                # LLM prompt templates
+```
+
+## Important Notes
+
+- **Never commit secrets** - use `.env` files (not in repo)
+- **Ray namespace** is always `"openrag"` for all actors
+- **Milvus** is the vector database with hybrid search (dense + BM25)
+- **Authentication** uses token-based auth with RBAC
+- **Partition-based** multi-tenant document organization
+- **OpenAI-compatible** API format for chat completions
diff --git a/docs/content/docs/documentation/API.mdx b/docs/content/docs/documentation/API.mdx
@@ -409,6 +409,7 @@ OpenAI-compatible text completion endpoint.
 | `websearch` | `bool` | `false` | Augments the RAG context with live web search results. When used with a partition (`openrag-{partition}`), document and web results are combined. When used without a partition (direct LLM mode), web results are the sole context. Requires `WEBSEARCH_API_TOKEN` to be configured. See [web search configuration](/openrag/documentation/env_vars/#web-search-configuration). |
 | `spoken_style_answer` | `bool` | `false` | Generates a succinct spoken-style conversational answer based on the retrieved documents. |
 | `use_map_reduce` | `bool` | `false` | Uses a map-reduce strategy to aggregate information from multiple documents. See [map-reduce configuration](/openrag/documentation/env_vars/#map--reduce-configuration). |
+| `attachments` | `list[{id: string}]` | `null` | Pins specific files by ID for retrieval, bypassing semantic search entirely. Each file's chunks are compressed by the file reducer before being sent to the LLM. See [file reducer configuration](/openrag/documentation/env_vars/#file-reducer-configuration). |
 | `llm_override` | `object` | `null` | Routes the request to a different LLM endpoint while still using OpenRAG's RAG pipeline (retrieval, reranking, prompt construction). Accepts: `base_url` (string), `api_key` (string), `model` (string). Any field not provided falls back to the default OpenRAG LLM configuration. |
 
 Examples:

diff --git a/docs/content/docs/documentation/env_vars.md b/docs/content/docs/documentation/env_vars.md
@@ -257,6 +257,7 @@ The RAG pipeline comes with preconfigured prompts **`./prompts/example1`**. Here
 | `image_captioning_tmpl.txt` | Template for generating image descriptions using the VLM |
 | `hyde.txt` | Hypothetical Document Embeddings (HyDE) query expansion template |
 | `multi_query_pmpt_tmpl.txt` | Template for generating multiple query variations |
+| `file_reducer_tmpl.txt` | System prompt for the file reducer's chunk compression LLM calls |
 
 To customize prompt:
 1. **Duplicate the example folder**: Copy the `example1` folder from `./prompts/`
@@ -455,6 +456,21 @@ curl -X 'POST' 'http://localhost:8080/v1/chat/completions' \
 ```
 :::
 
+### File Reducer Configuration
+
+The file reducer compresses a file's chunks down to a size that fits within the LLM context window. It works iteratively: chunks are grouped, each group is summarized by the LLM, and the process repeats until the total content fits. Two safety mechanisms prevent it from running indefinitely:
+
+- **`max_rounds`** — hard cap on the number of compression iterations.
+- **`min_shrink_ratio`** — if a round shrinks the content by less than this fraction, the LLM is not compressing meaningfully and the loop stops early.
+
+| Variable | Type | Default | Description |
+|----------|------|---------|-------------|
+| `FILE_REDUCER_TARGET_SIZE_TOKENS` | `int` | 1024 | Token budget for the final output. Compression rounds continue until the total content fits within this limit |
+| `FILE_REDUCER_MAX_GROUP_TOKENS` | `int` | 4096 | Maximum tokens per group fed to the LLM in a single summarization call |
+| `FILE_REDUCER_MIN_GROUP_TOKENS` | `int` | 2048 | Groups smaller than this threshold are passed through without calling the LLM |
+| `FILE_REDUCER_MAX_ROUNDS` | `int` | 3 | Maximum number of compression rounds before stopping regardless of output size |
+| `FILE_REDUCER_MIN_SHRINK_RATIO` | `float` | 0.1 | Minimum fraction of tokens that must be removed in a round to continue iterating (e.g. `0.1` = at least 10% reduction required) |
+
 ### FastAPI & Access Control
 :::info
 By default, our API (FastAPI) uses **`uvicorn`** for deployment. One can opt in to use `Ray Serve` for scalability (see the [ray serve configuration](/openrag/documentation/env_vars/#ray-serve-configuration))