Standalone RAG backend. Hybrid search (semantic + BM25 + graph), cross-encoder rerank, LightRAG entity graph, GLiNER fast-mode.
One docker compose up. REST API on :5050. Bring your own corpus.
11-doc adversarial corpus, 13 tagged queries (polysemy, paraphrase, exact-phrase rare jargon, multi-hop entity chain, header-path ablation, distractors). Full matrix in tests/integration_validation/REPORT.md.
| Channel | hit@1 | hit@5 | MRR | p50 latency | Notes |
|---|---|---|---|---|---|
π’ lexical (BM25) |
0.92 | 1.00 | 0.949 | 3 ms | one paraphrase miss at rank 3 |
π§ semantic (pgvector cosine) |
0.85 | 0.92 | 0.897 | 45 ms | two multi-hop misses (rank 2 and rank 6); only non-fused channel that does not saturate hit@5 |
π’ hybrid_norerank (RRF) |
0.92 | 1.00 | 0.962 | 80 ms | RRF promotes paraphrase miss back to 1 |
π§ hybrid_rerank (RRF + cross-encoder) |
0.85 | 1.00 | 0.904 | ~28 s | reranker reorders multi-hop down; latency dominated by CPU inference |
π’ hybrid_graph_norerank (RRF + graph) |
0.92 | 1.00 | 0.962 | 880 ms | graph adds candidates without disrupting other channels |
π§ hybrid_graph_rerank (full stack) |
0.92 | 1.00 | 0.949 | ~19 s | graph keeps recall at 100% even when reranker hurts MRR |
π¦ graph_only (/v1/search/graph) |
0.15 | 0.15 | 0.154 | 240 ms | by design: empty 200 on non-entity queries; 2/3 on the multi-hop entity slice |
At this corpus size every hybrid channel saturates
hit@5 = 1.00. The discriminative signal is inhit@1and per-query rank. Bigger adversarial corpora will separate channels further; that's tracked intests/integration_validation/REPORT.md.
Why a second corpus: the audition numbers above were taken on an 11-doc adversarial corpus before the GPU rerank work landed. This benchmark exercises every public endpoint and every retrieval channel end-to-end on a fresh non-trainable corpus, with all four rerank_model modes plus the LightRAG entity-extraction pipeline turned on. Repeatable runner: tests/integration_validation/hai_full_bench.py + sibling hai_corpus.py.
Why "non-trainable": every proper noun (companies, projects, theorems, people, locations) was invented for this benchmark. No public corpus contains "Quibbler-Frame protocol" or "Zigast's theorem". The reranker cannot pattern-match from training data; it has to read the chunk against the query and score relevance honestly.
Setup
- Corpus: 20 fictional docs (16 in-world + 4 distractors), each ~80 words, ingested through
POST /v1/documentswith LightRAG entity extraction (noX-LightRAG-Ingest: falseheader). Ingest cost: ~56 s/doc on local Qwen3.6-27B-AWQ4 vLLM withenable_thinking=false. Total ingest wall: ~19 min. - Memgraph state after ingest: 98 entities, 122 directed relations (HAI added 71 entities + 96 relations to the pre-existing 27/26 baseline).
- Queries: 12 hand-crafted, none copy-paste of any chunk text. Mix of paraphrase, exact rare-jargon, multi-hop, and polysemy.
- Hardware: AMD Strix Halo (gfx1151), 128 GiB UMA, vLLM at
gpu-memory-utilization=0.55running the whole time.
top_k=10, rerank_candidates=50. hybrid_graph_rerank adds the LightRAG graph channel to the RRF fusion before rerank.
| Channel / mode | hit@1 | hit@5 | MRR | p50_ms |
|---|---|---|---|---|
keyword (BM25 only) |
0.67 | 1.00 | 0.819 | 2 |
semantic (pgvector only) |
0.67 | 1.00 | 0.833 | 63 |
hybrid_norerank (RRF, no rerank) |
0.67 | 1.00 | 0.833 | 68 |
hybrid_rerank / default (CPU TEI BGE-v2-m3) |
0.75 | 1.00 | 0.861 | 3,173 |
hybrid_rerank / bge-gpu |
0.75 | 1.00 | 0.861 | 153 |
hybrid_rerank / qwen-4b |
0.67 | 1.00 | 0.819 | 1,377 |
hybrid_rerank / qwen-8b |
0.67 | 1.00 | 0.833 | 2,519 |
hybrid_graph_rerank / default |
0.75 | 1.00 | 0.861 | 4,628 |
hybrid_graph_rerank / bge-gpu |
0.75 | 1.00 | 0.861 | 1,775 |
hybrid_graph_rerank / qwen-4b |
0.67 | 1.00 | 0.819 | 3,024 |
hybrid_graph_rerank / qwen-8b |
0.67 | 1.00 | 0.833 | 4,103 |
graph_only (/v1/search/graph) |
0.00 | 0.00 | 0.000 | 263 |
What this says:
- bge-gpu is the practical winner: identical hit@1/MRR to the CPU default (same model, fp16 noise), 20x faster sequentially. The same comparison vs
hybrid_norerankshows the rerank stage adds +0.08 hit@1 on this corpus. - Qwen-4B and Qwen-8B trail BGE on hit@1 for this 12-query slice: 0.67 vs 0.75. Qwen-8B beats 4B by +0.014 MRR (one rank-2 improvement on the multi-hop coastal-lab query). On a longer adversarial corpus the LM-style scoring is more likely to widen, but on this slice it is a wash. Plug-and-play available; enable when your corpus benefits.
graph_onlyreturns 0/12 as designed: the queries are paraphrases, never exact entity-name lookups, so GLiNER cannot anchor in Memgraph. This channel is meant for entity-anchored questions like "give me everything about Hardin Volkenburg", not paraphrase questions. Verified byphase_graph_crudbelow:/v1/graph/path/152/153returns the correct 2-hop path between two seeded test concepts, so the channel itself works.hybrid_graph_rerankmatcheshybrid_rerankon hit@1/MRR. The graph channel adds candidates without disrupting rank order; the cost is one extra LLM call (entity extraction) at ~1.4 s.
| Endpoint | Status |
|---|---|
GET /health, /health/models |
green |
POST /v1/documents (LightRAG on) |
green, 20/20 docs, ~56 s/doc |
GET /v1/documents, GET /v1/documents/{id} |
green |
POST /v1/embed |
green (1024d, BGE-M3) |
POST /v1/rerank (passthrough) |
green |
POST /v1/search (default + bge-gpu + qwen-4b + qwen-8b) |
green |
POST /v1/search/semantic, /keyword, /graph |
green |
POST /v1/concepts, DELETE /v1/concepts/{id} |
green |
POST /v1/relations, GET /v1/relations, DELETE /v1/relations/{id} |
green |
GET /v1/graph/neighbors/{id}, /graph/path/{a}/{b} |
green |
GET /v1/graph/stats (concepts=2 test, relations=131) |
green |
GET /v1/graph/communities (Louvain, 105 communities) |
green |
Concurrency 8, 2 repeats x 12 queries x (4 flat channels + 2 rerank-channels x 4 rerank modes) = 288 total live-pipeline calls fired in parallel. vLLM Responses API (enable_thinking=false, stream=true) was poked before, mid-run, and after.
| Channel / mode | n | p50 ms | p95 ms | min | max |
|---|---|---|---|---|---|
keyword |
24 | 7 | 15 | 3 | 45 |
semantic |
24 | 254 | 446 | 138 | 586 |
hybrid_norerank |
24 | 392 | 691 | 108 | 865 |
hybrid_rerank / bge-gpu |
24 | 1,283 | 1,635 | 953 | 1,850 |
hybrid_rerank / default |
24 | 13,255 | 21,143 | 4,950 | 30,293 |
hybrid_rerank / qwen-4b |
24 | 13,631 | 16,772 | 8,188 | 17,269 |
hybrid_rerank / qwen-8b |
24 | 27,794 | 38,645 | 12,492 | 52,504 |
hybrid_graph_rerank / bge-gpu |
24 | 9,309 | 16,306 | 4,344 | 19,610 |
hybrid_graph_rerank / default |
24 | 18,821 | 25,091 | 8,134 | 25,895 |
hybrid_graph_rerank / qwen-4b |
24 | 22,320 | 27,569 | 17,559 | 29,421 |
hybrid_graph_rerank / qwen-8b |
24 | 38,067 | 47,646 | 20,037 | 54,208 |
graph_only |
24 | 9,780 | 10,946 | 431 | 11,403 |
Total wall: 459 s for 288 jobs, 0 errors. vLLM stayed healthy across the run: 3-poke TTFT 7 ms / 19 ms / 16 ms, all status 200, no degradation. All four reranker containers (CPU TEI + BGE-GPU + Qwen-4B + Qwen-8B) plus Postgres + Memgraph + embedding TEI stayed healthy throughout. No tracebacks in any sidecar log.
Per-batch latency at the /rerank endpoint, identical 32-doc payload:
| Sidecar | batch=4 | batch=16 | batch=32 |
|---|---|---|---|
CPU TEI (reranker:80) |
172 ms | 732 ms | 1,446 ms |
bge-gpu (:8083) |
16 ms | 35 ms | 56 ms |
qwen-4b (:8084) |
146 ms | 507 ms | 993 ms |
qwen-8b (:8085) |
est. ~270 | est. ~1,000 | ~2,000 ms |
| Test | Survived? |
|---|---|
π’ docker compose down + up |
yes (14 docs intact, all channels return identical results) |
π’ docker volume prune -f |
yes (no project state in named volumes; everything bind-mounted) |
π’ docker compose build api rebuild |
yes (LightRAG KV is bind-mounted to ./data/lightrag/) |
The whole stack is bind-mounted to host dirs under ./data/. There are no docker named volumes. docker volume prune is safe.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β rag-base (docker compose up) β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββ β
β β Postgres β β TEI β β TEI β βMemgraphβ β
β β pgvector β β Embed β β Rerank β β MAGE β β
β β pg_searchβ β :8081 β β :8082 β β :7687 β β
β β :5433 β β BGE-M3 β β BGE-r-m3 β β 3.9 β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββ β
β β² β² β² β² β
β ββββββββββββββββ΄ββββββββββββββ΄βββββββββββββ β
β β β
β ββββββββββββΌβββββββββββ β
β β API (FastAPI) β β
β β :5050 β β
β β + LightRAG β β
β β + GLiNER β β
β βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β²
β REST API (:5050)
β
Any client application
5 always-on containers (4 prebuilt images, 1 custom API), plus 2 optional GPU rerank sidecars on AMD Strix Halo.
| Service | Image | Role | Profile |
|---|---|---|---|
| π¦ Postgres + pgvector + pg_search | paradedb/paradedb:0.23.1-pg17 |
Documents, chunks, vectors. HNSW for semantic + Tantivy BM25 for keyword. | required |
| π¨ TEI Embed | text-embeddings-inference:cpu-1.9 |
Text to 1024d vectors via HuggingFace TEI. | required |
| π¨ TEI Rerank (CPU) | text-embeddings-inference:cpu-1.9 |
BGE-reranker-v2-m3 on CPU. Default rerank backend. | --profile rerank |
| π© GPU Rerank: BGE | reranker-rocm:local (custom) |
Same BGE-reranker-v2-m3 on AMD gfx1151 ROCm. Picked by rerank_model: "bge-gpu". |
--profile rerank-bge-gpu |
| π© GPU Rerank: Qwen-4B | reranker-rocm:local (custom) |
Qwen3-Reranker-4B (LM-style yes/no scoring). Picked by rerank_model: "qwen-4b". |
--profile rerank-qwen |
| π© GPU Rerank: Qwen-8B | reranker-rocm:local (custom) |
Qwen3-Reranker-8B (~16 GiB VRAM). Picked by rerank_model: "qwen-8b". |
--profile rerank-qwen-8b |
| π§ Memgraph MAGE | memgraph/memgraph-mage:3.9.0 |
Knowledge graph: LightRAG entities (:base) + custom concepts (:Concept). Cypher, PageRank, Louvain, BFS. |
--profile graph |
| πͺ API | Custom (python:3.12-slim) |
Orchestrator. FastAPI + LightRAG (ingest-time entity extraction) + GLiNER (query-time NER). | required |
Reranker, Memgraph, and the GPU sidecars are fully optional. The api detects their absence at startup and degrades gracefully. Hybrid + rerank still work without graph; semantic + BM25 still work without rerank. Requesting an unavailable GPU sidecar via
rerank_modelsilently falls back to the default CPU TEI reranker.
Click to expand - first boot in β 4-5 min, subsequent boots in β 30 s
- Docker + docker compose
- ~10 GB free disk for model caches
- A host LLM endpoint compatible with the OpenAI Responses API (vLLM, llama.cpp
serverwith--responses, etc). Any OpenAI-compatible LLM works for ingest-time entity extraction. The api still runs without an LLM, you just lose the LightRAG entity graph at ingest.
git clone git@github.com:hec-ovi/rag-base.git
cd rag-base
cp .env.template .env
nano .envThree things need attention:
POSTGRES_PASSWORD=your-secure-password
EMBEDDING_MODELS_DIR=/absolute/path/for/embedder/cache
RERANKER_MODELS_DIR=/absolute/path/for/reranker/cache
LLM_BASE_URL=http://host.docker.internal:8000 # your host LLM endpoint
LLM_MODEL=Qwen3.6-27B-AWQ4 # model id your endpoint servesMemgraph runs as UID 101 inside the container. The host bind-mount dir must be owned by 101 or it will SIGSEGV at startup with banner-only logs:
mkdir -p ./data/postgres ./data/memgraph ./data/lightrag
docker run --rm -v "$PWD/data/memgraph":/data alpine chown -R 101:101 /dataPostgres uses the PGDATA=/var/lib/postgresql/data/pgdata subdir trick so it handles its own ownership; the api runs as root so ./data/lightrag is fine without chown.
# Core only (postgres + embedding + api)
docker compose up -d
# Add reranking (recommended for quality)
docker compose --profile rerank up -d
# Add knowledge graph (LightRAG entity extraction + graph endpoints)
docker compose --profile graph up -d
# Add GPU reranker sidecars (AMD Strix Halo / gfx1151 only, requires /dev/kfd + /dev/dri)
docker compose --profile rerank-bge-gpu up -d # same model as CPU TEI, ~25x faster
docker compose --profile rerank-qwen up -d # Qwen3-Reranker-4B, better on hard corpora
docker compose --profile rerank-qwen-8b up -d # Qwen3-Reranker-8B, ~16 GiB VRAM
# Everything
docker compose --profile rerank --profile rerank-bge-gpu --profile rerank-qwen --profile rerank-qwen-8b --profile graph up -dThe GPU sidecars only run when their profile is active. Set the matching URL in .env so the api can find them; clients then opt in per request via rerank_model. Omit the field to keep the existing CPU TEI path.
| Sidecar | env var | value |
|---|---|---|
| BGE on gfx1151 | BGE_GPU_RERANKER_URL |
http://reranker-bge-gpu:80 |
| Qwen-4B on gfx1151 | QWEN_RERANKER_URL |
http://reranker-qwen:80 |
| Qwen-8B on gfx1151 | QWEN_8B_RERANKER_URL |
http://reranker-qwen-8b:80 |
TEI downloads embed/rerank models on first boot (~3 min on a normal connection). The api starts as soon as postgres + embedding are healthy, so the reranker may finish AFTER api boots. One-time fix:
docker compose restart apicurl http://localhost:5050/health{"status": "ok", "postgres": "connected", "embedding": "healthy", "reranker": "healthy", "memgraph": "connected"}curl -X POST http://localhost:5050/v1/documents \
-H "Content-Type: application/json" \
-d '{
"title": "React Documentation",
"content": "React is a JavaScript library for building user interfaces.",
"metadata": {"source": "docs"}
}'
curl -X POST http://localhost:5050/v1/search \
-H "Content-Type: application/json" \
-d '{"query": "frontend frameworks", "top_k": 5}'OpenAPI docs auto-generated at http://localhost:5050/docs. Full request/response specs in llm.txt.
| Endpoint | Channels | Rerank | Graph | When to use |
|---|---|---|---|---|
POST /v1/search |
semantic + keyword + (graph) -> RRF | configurable | configurable | default for general queries; recall + precision |
POST /v1/search/semantic |
pgvector cosine only | π΄ | π΄ | paraphrase-heavy queries |
POST /v1/search/keyword |
BM25 only (indexed_content) |
π΄ | π΄ | rare jargon, exact terms, statute citations |
POST /v1/search/graph |
GLiNER NER -> Memgraph -> chunks | π΄ | π’ only | entity-anchored queries; latency-sensitive (no LLM at query time) |
Picks which reranker backend to use. Optional; omit to keep today's behavior.
| Value | Backend | Latency (this engine, batch 32 direct) | When to use |
|---|---|---|---|
omitted / "default" |
CPU TEI BGE-v2-m3 (reranker service) |
1446 ms | default; no GPU needed |
"bge-gpu" |
Same BGE-v2-m3 on AMD gfx1151 GPU | 56 ms | same quality as default, ~25x faster |
"qwen-4b" |
Qwen3-Reranker-4B on AMD gfx1151 GPU | 993 ms | LM-style scoring; useful on semantically subtle queries |
"qwen-8b" |
Qwen3-Reranker-8B on AMD gfx1151 GPU | ~2 s | larger LM reranker; slight edge over 4B on multi-hop chains, ~16 GiB VRAM |
If "bge-gpu" or "qwen-4b" is requested but the corresponding sidecar URL is unset or unreachable, the API silently falls back to the default CPU TEI reranker (a warning is logged). This makes the new modes safe to send from any client without coordination.
| Method | Endpoint | Purpose |
|---|---|---|
| Documents | ||
POST |
/v1/documents |
Ingest: chunk, embed, store, optionally LightRAG-extract entities |
GET |
/v1/documents?offset=0&limit=20 |
List documents (paginated) |
GET |
/v1/documents/{id} |
Get document with chunks |
DELETE |
/v1/documents/{id} |
Delete document + chunks (cascade) |
| Search | ||
POST |
/v1/search |
Hybrid (semantic + keyword + graph + rerank) |
POST |
/v1/search/semantic |
Vector-only |
POST |
/v1/search/keyword |
BM25-only |
POST |
/v1/search/graph |
Graph-only fast mode (NER -> Memgraph -> chunks) |
| Graph | ||
POST |
/v1/concepts |
Create/update typed concept (upsert by name) |
GET |
/v1/concepts/{id} |
Get concept with relations |
DELETE |
/v1/concepts/{id} |
Delete concept + edges |
POST |
/v1/relations |
Create directed edge between concepts |
GET |
/v1/relations?concept_name=X |
Get relations for a concept |
DELETE |
/v1/relations/{id} |
Delete relation |
GET |
/v1/graph/neighbors/{id}?depth=N |
Multi-hop traversal (max depth 5) |
GET |
/v1/graph/path/{from_id}/{to_id} |
Shortest path (BFS) |
GET |
/v1/graph/communities |
Louvain community detection |
GET |
/v1/graph/stats |
Node/edge counts |
| Passthrough | ||
POST |
/v1/embed |
Embed text(s) directly via TEI |
POST |
/v1/rerank |
Rerank candidates directly via TEI |
| Health | ||
GET |
/health |
All-services status |
GET |
/health/models |
Loaded model info |
| Status | |
|---|---|
| π’ Atomic ingest (doc + chunks + vectors in one PG transaction) | shipped |
| π’ Real BM25 via ParadeDB pg_search (Tantivy in Postgres) | shipped, replaces older tsvector + ts_rank |
| π’ Contextual Chunk Headers (title + metadata + markdown header path) auto-prepended | shipped |
| π’ RRF fusion across semantic / keyword / graph channels | shipped |
| π’ Cross-encoder rerank (last stage) | shipped (CPU TEI default; optional GPU sidecars: BGE on gfx1151, Qwen3-Reranker-4B on gfx1151) |
| π’ GPU reranker sidecars on AMD Strix Halo (gfx1151) | shipped 2026-05-03; same BGE-v2-m3 (~25x faster than CPU), Qwen3-Reranker-4B (~1.5x faster), or Qwen3-Reranker-8B (~1.3x faster). All three share the same 1.97 GiB image; pick at request time via rerank_model. |
| π’ LightRAG entity + relation extraction at ingest -> Memgraph | shipped (LightRAG 1.4.15+, CVE-patched) |
| π’ GLiNER NER for graph-only fast mode (no LLM at query time) | shipped |
| π’ Graph channel results actually merged into RRF (closes the historical search.py:66 TODO) | shipped |
| π’ Graceful degradation: api never crashes on missing optional services | shipped |
π’ Anthropic Contextual Retrieval (contextual_retrieval: true flag on POST /v1/documents) |
shipped, opt-in; per-chunk LLM blurb prepended to indexed_content, vLLM auto-caches the document prefix |
π’ Editable prompt files in api/prompts/*.md (decoupled from Python code) |
shipped; covers CR + query-time entity extraction, restart api to reload |
| π’ Test gates: every SOTA mechanism has a "did it actually fire?" test | shipped (131 tests: 123 fast + 7 slow + 1 environment-conditional skip; the +16 from the rerank work: 17 sidecar unit tests, 12 routing unit tests, 4 live-stack 3-mode tests; sidecar tests run without torch via FakeCrossEncoder, integration tests gated by env when sidecars are not running) |
| π‘ LightRAG ingest is ~1-3 min per chunk on local 27B reasoning LLM | by design; bulk-ingest path uses X-LightRAG-Ingest: false header |
| π‘ CPU TEI rerank is slow (multi-second at batch 50) | by design; opt-in GPU sidecars (rerank-bge-gpu / rerank-qwen profiles) drop full-pipeline /v1/search p50 from ~530 ms to ~65 ms (BGE-GPU) or ~291 ms (Qwen-4B) on this engine |
π‘ Contextual Retrieval blurbing reuses LLM_BASE_URL, so a 27B reasoning LLM blurbs slowly |
by design; per-purpose LLM env vars are a deferred refactor (see "Deferred decisions") |
| π΄ No authentication; intended behind a reverse proxy | known |
| π΄ Single node; no clustering/replication | known |
| π΄ TEI cpu-1.9; no GPU image baked in | tag swap to cuda / rocm variants when needed |
query
β
ββββΊ [embed via TEI] βββΊ query_vector
β
βββββββββββββββββββββββββββββββββββββββββββ
β PARALLEL FAN-OUT β
βββββββββββββββββββββββββββββββββββββββββββ€
β semantic pgvector cosine over chunks β
β keyword BM25 over indexed_content β
β graph LLM ent. extraction -> β
β Memgraph match -> doc- β
β restricted semantic search β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
RRF (1 / (60 + rank_i))
β
βΌ
cross-encoder rerank (top-N -> top-K)
β
βΌ
results
query
β
βΌ
GLiNER NER (no LLM)
β
βΌ
Memgraph entity match (exact_ci, fallback CONTAINS)
β
βΌ
0/1/2-hop subgraph (two-query nodes-then-edges)
β
βΌ
degree/none ranking (seeds first)
β
βΌ
source_id bridge: chunk-<hash> -> LightRAG KV ->
(full_doc_id, chunk_order_index) -> chunks.id
β
βΌ
{matched_entities, subgraph: {nodes, edges}, chunks, trace}
Why both? /v1/search answers general questions with full retrieval power (~80 ms without rerank, ~28 s with). /v1/search/graph answers entity-anchored questions in ~240 ms with no LLM call, and returns the actual subgraph + per-chunk source attribution so you can show your work.
rag-base auto-prepends structural prefixes to each chunk before embedding and BM25:
- Document title and metadata:
[<title> | <k1>: <v1> | <k2>: <v2>] - Markdown heading breadcrumb at chunk start:
[Section > Subsection > Heading](omitted for headerless chunks) - Optional: Anthropic Contextual Retrieval blurb (a 50-100 token LLM-generated context line) when
contextual_retrieval: trueis sent onPOST /v1/documents - The raw chunk text
The augmented form lives in chunks.indexed_content and is what the embedder sees and what BM25 indexes. Raw chunk text in chunks.content stays clean for display.
Steps 1-2 are always on, no configuration. Step 3 is opt-in per ingest:
curl -X POST http://localhost:5050/v1/documents \
-H 'Content-Type: application/json' \
-d '{"title": "...", "content": "...", "metadata": {...}, "contextual_retrieval": true}'When enabled, the api makes one LLM call per chunk (the configured LLM_BASE_URL) with the full document as a stable prefix. vLLM's automatic prefix caching shares the document KV across the per-chunk calls, dropping the cost to ~$1 per 1M document tokens (Anthropic's measured number with their cookbook + prompt cache). On per-chunk LLM failure the chunk gets no blurb but ingest still succeeds. The prompt body is api/prompts/contextual_retrieval.md, editable without code changes; restart the api container to reload after editing.
Anthropic reports 35% / 49% / 67% reduction in retrieval failures (alone, with BM25, with rerank). Lift is corpus-dependent; on saturated smoke sets the gain is invisible. Measure on your real corpus before deciding whether to enable corpus-wide.
Ingested an 1784-word doc (llm.txt head, 4 chunks) through POST /v1/documents with contextual_retrieval: true against the local Qwen3.6-27B-AWQ4 vLLM. All 4 blurbs landed clean and specific (e.g. "Technical Reference: System Purpose, Ingest Pipeline (Chunking, Contextual Chunk Headers, Embedding, Postgres Storage, LightRAG Entity Extraction), and Hybrid Retrieval Pipeline β¦" ), no generic sludge, no per-chunk failures. Wall: 580 s for 4 chunks (serial; cache-warm pattern: first call solo to populate the prefix cache, the rest sequential).
Retrieval matrix against the same doc, 4 queries each crafted to target a specific chunk:
| Query (paraphrased) | target chunk | lexical | semantic | hybrid |
|---|---|---|---|---|
| graph fast mode endpoint with NER | 1 | 1 | 1 | 1 |
| reranker container memgraph ports | 2 | 1 | 1 | 1 |
| startup order health checks | 3 | 1 | 1 | 1 |
| system purpose ingest pipeline | 0 | 2 | 1 | 1 |
11/12 = 91.7% hit@1 across the matrix. The single rank-2 (lexical) put the CR-augmented chunk at rank 2 behind a leftover non-CR chunk with overlapping content; semantic and hybrid both pulled it back to rank 1, with the CR-augmented chunk outranking the non-CR equivalent in the same retrieval space. That's CR doing its job: the blurb adds disambiguating context the embedder uses to break ties.
chunks.content integrity verified: all 4 chunks hold unmodified slices of the source doc; indexed_content carries the CCH + blurb. The reranker (which sees chunks.content, not indexed_content) is unaffected by CR.
| What | Where | Survives down + up? |
Survives docker volume prune? |
|---|---|---|---|
| Documents + chunks + vectors | ./data/postgres/ -> /var/lib/postgresql/data (PGDATA subdir) |
π’ | π’ |
| LightRAG entity graph | ./data/memgraph/ -> /var/lib/memgraph |
π’ | π’ |
| LightRAG KV + nano-vectordb | ./data/lightrag/ -> /app/lightrag_data |
π’ | π’ |
| Embedding/reranker model cache | EMBEDDING_MODELS_DIR / RERANKER_MODELS_DIR (host) |
π’ | π’ |
All persistent state lives on host bind mounts under ./data/, never in docker named volumes. ./data/ is gitignored. docker compose down and docker volume prune both leave project data untouched. To wipe, rm -rf ./data/{postgres,memgraph,lightrag}.
Why bind mounts not named volumes? I
docker volume pruneregularly during development; project state living in named volumes is just a footgun. Bind mounts are also easier to back up, inspect, and migrate.
| Swap | Stage | Today | Result | Status |
|---|---|---|---|---|
| 3a. ParadeDB pg_search (real BM25) | Stage 1 keyword | Was Postgres ts_rank (no IDF, no length norm) |
True BM25 via Tantivy. Pre-swap hit@5 = 0.47; post-swap hit@5 = 1.00 (+113%). |
π’ done |
| 3c. Header-path prefix on chunks | Ingest | Was title + metadata only | + markdown heading breadcrumb. Keyword hit@1 0.87 -> 0.93 on saturated smoke set; bigger lift expected on harder corpora. |
π’ done |
| Stage 3 rerank | BGE-reranker-v2-m3 | est. +3-8 NDCG@10 with bge-reranker-v2-gemma / Qwen3-Reranker-8B / mxbai-large-v2 | π§ deferred (see below) |
Qwen3-Embedding-8B (semantic): deferred
Open-weight embedding leader on MMTEB at 70.58 vs BGE-M3 around 64. The 6-point lift is real but small relative to cost: 8B vs ~568M params is 14Γ larger. VRAM jumps from ~16 GB host RAM (BGE-M3 on CPU TEI) to ~30-40 GB GPU TEI. Per-query embed latency rises from ~10-30 ms to 100-300 ms. A swap also invalidates every existing vector and forces a full re-ingest.
Decision: keep BGE-M3 until an eval shows the embedding channel is the bottleneck. If it does, Qwen3-Embedding-8B is the swap.
Cross-encoder rerank upgrade: shipped 2026-05-03 as opt-in GPU sidecars
The 3-mode rerank shape lets each request pick its own backend via rerank_model on POST /v1/search. Default behavior is byte-identical to before; the new modes are silent fall-back to default if the corresponding sidecar is not running.
| Mode | Backend | Hardware | Latency p50 (full /v1/search, candidates=50) |
Quality vs default |
|---|---|---|---|---|
(omitted) / "default" |
CPU TEI BGE-v2-m3 | CPU | 553 ms | baseline |
"bge-gpu" |
Same BGE-v2-m3 on ROCm | AMD gfx1151 | 65 ms | identical (within fp16 noise) |
"qwen-4b" |
Qwen3-Reranker-4B on ROCm | AMD gfx1151 | 291 ms | LM-style yes/no scoring; ranks subtly-relevant docs higher |
Both GPU sidecars use the same 1.97 GiB image (reranker-rocm/) on ubuntu:rolling + ROCm gfx1151 prerelease wheels + sentence-transformers >= 5.4. Qwen pin: model commit 22e683669bc0f0bd69640a1354a6d0aebcfeede5 (the 2026-04-16 ST integration). VRAM use: BGE ~1.5 GiB, Qwen ~10 GiB; both fit comfortably alongside a vLLM at gpu-memory-utilization 0.55 in 128 GiB UMA.
Why kept the 8B candidate out: Qwen3-Reranker-8B is broken on vLLM gfx1151 today (lemonade-sdk/vllm-rocm #3 EngineCore HIP failure; vLLM #21681 random scores). Qwen-4B sits one notch below 8B on quality but ships clean.
Why a custom FastAPI sidecar instead of TEI/vLLM: TEI can't load any SOTA reranker (encoder-only loader). vLLM's reranker sidecar mode is unstable on gfx1151 V1 (#32180) and would compete with the host LLM for the same GPU. ~150 LOC of FastAPI + CrossEncoder.predict() is enough.
Anthropic Contextual Retrieval: shipped as opt-in flag (was deferred to wrapper, now in engine)
CR is an ingest-time content transformation: a cheap LLM generates a 50-100 token chunk-specific blurb, prepended to each chunk before embedding and BM25 indexing. Anthropic reports 35% / 49% / 67% reduction in retrieval failures (alone, with BM25, with rerank).
The deferral originally placed CR in the consumer (knowledge-base wrapper). Reconsidered: the engine already has an LLM dependency for LightRAG ingest and query-time entity extraction; adding CR as a third user of the same dependency is cheap, and putting it inline avoids the chunk-boundary coordination problem the wrapper-side approach would have. Shipped 2026-04-27 as an opt-in flag: POST /v1/documents with {"contextual_retrieval": true} runs CR; default false preserves byte-identical behavior for callers who don't opt in. vLLM's automatic prefix caching shares the document KV across the per-chunk calls, so cost stays at ~$1 per 1M document tokens.
The prompt body lives in api/prompts/contextual_retrieval.md and is editable without touching Python. Bring up the api with LLM_BASE_URL reachable; per-chunk LLM failure degrades gracefully (that chunk gets no blurb, ingest still succeeds).
Deferred follow-on (still open): per-purpose LLM env vars (LLM_INGEST_* / LLM_QUERY_* / LLM_BLURBER_*) so callers can route CR blurbing to a small fast model while keeping the big model for entity extraction. Today CR shares LLM_BASE_URL with the rest, so blurbing on a 27B reasoning model is slower than necessary. ~50 LOC additive change with no breakage when revisited.
No, not linearly. The two channels lost play different roles:
- Stage 1 (semantic + keyword + graph fused via RRF) governs recall: "did we get the right doc into the candidate pool at all?"
- Stage 2 (rerank) governs precision: "given recall, is it ranked #1?"
hit@1 = P(recall) Γ P(rerank ranks it #1 | recall). Losing the embedding lift reduces P(recall) for the semantic channel only; BM25 + graph still fire and RRF fuses them. Losing the rerank lift reduces P(rank #1 | recall) directly with no fallback. A naive sum predicts 6+5 = 11% end-to-end loss; realistic figure is closer to 3-7% hit@1.
hit@K: fraction of test queries where the expected doc appeared in top K.0.47= 47%.NDCG@10: 0-1 score for top-10 ordering quality vs ideal. BEIR averages NDCG@10 across 13 datasets.MRR: mean reciprocal rank of the expected doc. 1.0 if always rank 1; 0 if not in top K.
Baselines per-channel in tests/golden/eval_history.jsonl. Rerun with python tests/baseline/run_baseline.py --phase N after any swap.
All via .env. Full annotated template at .env.template.
| Variable | Description |
|---|---|
POSTGRES_PASSWORD |
Database password |
EMBEDDING_MODELS_DIR |
Absolute host path for embedding model cache |
RERANKER_MODELS_DIR |
Absolute host path for reranker model cache |
| Variable | Default | Description |
|---|---|---|
EMBEDDING_MODEL |
Snowflake/snowflake-arctic-embed-l-v2.0 |
HF embedding model id (1024d expected) |
RERANK_MODEL |
BAAI/bge-reranker-v2-m3 |
HF reranker model id (default CPU TEI) |
BGE_GPU_RERANKER_URL |
(empty) | Set to http://reranker-bge-gpu:80 after starting --profile rerank-bge-gpu. Enables rerank_model: "bge-gpu". |
QWEN_RERANKER_URL |
(empty) | Set to http://reranker-qwen:80 after starting --profile rerank-qwen. Enables rerank_model: "qwen-4b". |
QWEN_8B_RERANKER_URL |
(empty) | Set to http://reranker-qwen-8b:80 after starting --profile rerank-qwen-8b. Enables rerank_model: "qwen-8b". |
LLM_BASE_URL |
http://host.docker.internal:8000 |
OpenAI Responses API endpoint (LightRAG entity extraction) |
LLM_MODEL |
Qwen3.6-27B-AWQ4 |
Model id for the LLM endpoint |
LLM_API_KEY |
(empty) | Bearer token if needed |
LLM_ENABLE_THINKING |
false |
When false, sends chat_template_kwargs.enable_thinking=false. 50x latency improvement on Qwen3.6-27B-AWQ4 (684 to 3 output tokens, 33 s to 0.7 s). |
| Variable | Default | Description |
|---|---|---|
HNSW_M |
16 |
HNSW build param (higher = denser graph) |
HNSW_EF_CONSTRUCTION |
64 |
HNSW build accuracy |
HNSW_EF_SEARCH |
100 |
HNSW query-time accuracy |
DEFAULT_SEARCH_TOP_K |
20 |
Default top_k |
DEFAULT_RERANK_CANDIDATES |
50 |
RRF candidates passed to reranker |
CHUNK_SIZE |
512 |
Words per chunk |
CHUNK_OVERLAP |
50 |
Overlap words between chunks |
| Service | Default | Env var |
|---|---|---|
| API | 5050 |
API_PORT |
| Postgres | 5433 |
POSTGRES_PORT |
| TEI Embed | 8081 |
EMBEDDING_PORT |
| TEI Rerank (CPU) | 8082 |
RERANK_PORT |
| GPU Rerank: BGE | 8083 |
BGE_GPU_RERANK_PORT |
| GPU Rerank: Qwen-4B | 8084 |
QWEN_RERANK_PORT |
| GPU Rerank: Qwen-8B | 8085 |
QWEN_8B_RERANK_PORT |
| Memgraph | 7687 |
MEMGRAPH_PORT |
Port 8000 is deliberately avoided (common for vLLM and other local services).
# Full suite (131 tests: 123 fast + 7 slow + 1 environment-conditional skip)
pytest
# Just the GPU reranker sidecar unit tests (no torch / no GPU required)
pytest reranker-rocm/tests/
# Fast iteration (deselect slow LightRAG ingest tests)
pytest -m 'not slow'
# Integration validation runner (step-by-step retrieval matrix, ctrl-C-able)
python tests/integration_validation/run.py ingest # load corpus
python tests/integration_validation/run.py ingest-graph # also feed multi-hop docs to LightRAG
python tests/integration_validation/run.py step lexical
python tests/integration_validation/run.py step semantic
python tests/integration_validation/run.py step hybrid_norerank
python tests/integration_validation/run.py step hybrid_rerank
python tests/integration_validation/run.py step hybrid_graph_rerank
python tests/integration_validation/run.py step hybrid_graph_norerank
python tests/integration_validation/run.py step graph_only
python tests/integration_validation/run.py step header_ablationState persists across invocations in tests/integration_validation/results/state.json. Findings: tests/integration_validation/REPORT.md.
./scripts/backup.sh # dumps to backup_YYYYMMDD_HHMMSS.sql
./scripts/restore.sh backup_20260411.sql # restores from file (destructive!)Or just tar czf data-snapshot.tar.gz ./data/ for a complete on-disk snapshot (postgres + memgraph + lightrag in one shot).
This repo includes llm.txt, a structured technical reference for LLM agents and AI-assisted development:
- Full API spec (request/response examples for every endpoint)
- Configuration reference
- Startup behavior + healthcheck semantics
- Verified behavior catalog
- Known limitations
Point your agent at llm.txt for complete, current context.
rag-base/
βββ docker-compose.yml # All 5 services
βββ .env.template # Config template (committed)
βββ .env # Your config (gitignored)
βββ data/ # Bind mounts (gitignored): postgres, memgraph, lightrag
βββ postgres/ # Schema (init.sql) + container README
βββ embedding/ # TEI embed container README
βββ reranker/ # TEI rerank (CPU) container README
βββ reranker-rocm/ # GPU rerank sidecar (Dockerfile + server.py + tests). One image, two roles (bge-gpu, qwen-4b) selected by RERANK_MODEL.
βββ memgraph/ # Memgraph container README
βββ api/ # FastAPI app (only custom code)
β βββ Dockerfile
β βββ requirements.txt
β βββ prompts/ # User-editable LLM prompt templates (.md, with {{var}} substitution)
β β βββ contextual_retrieval.md
β β βββ query_entity_extraction.md
β βββ src/
β βββ routers/ # FastAPI routes (search, graph_search, documents, ...)
β βββ services/ # chunking, embedding, fusion, lightrag_store, ner, prompts, ...
β βββ models/ # Pydantic request/response models
βββ tests/ # Unit + integration tests, golden set, integration_validation runner
βββ scripts/ # Backup/restore
βββ llm.txt # Full technical reference (LLM-ready)
βββ README.md # This file
MIT.