A multi-document RAG system that helps a fraud analyst investigate OTP/SMS fraud across a heterogeneous corpus — policies, SOPs, threat-intel briefs, regulations (Markdown + PDF), historical investigation cases (CSV), and indicators of compromise (JSON). It uses hybrid retrieval (dense + BM25, fused with Reciprocal Rank Fusion) and answers with cross-document citations.
The fraud domain is the same A2P-bypass problem seen in Project 1, viewed from the investigation side: OTT grey-routing, SIM-box farms, OTP interception (SIM-swap), and smishing.
Part 2 of 3 in the Telecom RAG series. See
otp-signature-analytics(real-time RAG) andmno-revenue-assurance(agentic RAG).
Fraud queries are full of exact tokens — sender IDs (AD-SWIGGY), IMEI
prefixes, vendor names (RouteX). Pure dense (semantic) retrieval can miss
exact-match keywords; pure BM25 misses paraphrases. We run both and fuse the
ranked lists with RRF (score = Σ 1/(K + rank)), which needs no shared
score scale and reliably beats either retriever alone.
corpus (md · pdf · csv · json)
│
┌────────┴─────────┐
│ loaders (per type)│ -> common chunk shape {id,text,source,section,doc_type}
└────────┬─────────┘
├───────────────► DuckDB catalog (provenance / metadata)
▼
embed ► numpy vector store BM25 index (pure-python, over same chunks)
│ dense ranks │ sparse ranks
└──────────────┬────────────────┘
▼
Reciprocal Rank Fusion (RRF)
▼
fused top-k ► LLM (Ollama) / extractive fallback
▼
cited multi-document answer
cd otp-fraud-management
make setup # .venv + deps (incl. pypdf, reportlab for real PDFs)
make pipeline # generate corpus -> ingest (embed + DuckDB catalog)
make catalog # chunks/documents per doc_type
make search Q="SIM box IMEI grey route vendor RouteX" # inspect hybrid fusion
make ask Q="How do we investigate suspected A2P bypass and what did the Swiggy incident cost?"
make ui # Streamlit: catalog + retrieval inspector + chat
make test # 7 tests, offline & deterministicRuns with no API keys and no Ollama (extractive + hashing fallbacks).
Optional: make setup-embeddings (semantic embeddings) and ./setup_ollama.sh
(fluent answers).
fraud_rag/
config.py
data/generate.py synthetic MD/PDF/CSV/JSON fraud corpus
pipeline/
loaders.py per-format loaders -> common chunk shape
ingest.py load -> embed -> vector store + DuckDB catalog
rag/
embeddings.py sentence-transformers OR hashing fallback
vector_store.py persistent numpy cosine store
bm25.py pure-python BM25 (Okapi)
hybrid_retriever.py dense + sparse, RRF fusion
chunking.py heading-aware chunker
llm.py / fallback.py Ollama with deterministic offline fallback
answerer.py hybrid retrieve -> cited answer
app/cli.py, app/streamlit_app.py
tests/
- Heterogeneous ingestion: one dispatcher (
load_path) routes each file to the right loader; everything converges on a single chunk schema. - Provenance catalog in DuckDB records every chunk's
source,doc_type,section— useful for audits and metadata-filtered retrieval. - Incremental: re-ingesting only embeds chunks whose id (content hash) is new.
Same as Project 1: TELCO_RAG_FORCE_FALLBACK, TELCO_OLLAMA_MODEL,
OLLAMA_HOST, TELCO_EMBED_MODEL. RRF_K (fusion constant) lives in config.py.