OTP Fraud Management — Multi-Document RAG

A multi-document RAG system that helps a fraud analyst investigate OTP/SMS fraud across a heterogeneous corpus — policies, SOPs, threat-intel briefs, regulations (Markdown + PDF), historical investigation cases (CSV), and indicators of compromise (JSON). It uses hybrid retrieval (dense + BM25, fused with Reciprocal Rank Fusion) and answers with cross-document citations.

The fraud domain is the same A2P-bypass problem seen in Project 1, viewed from the investigation side: OTT grey-routing, SIM-box farms, OTP interception (SIM-swap), and smishing.

Part 2 of 3 in the Telecom RAG series. See otp-signature-analytics (real-time RAG) and mno-revenue-assurance (agentic RAG).

Why hybrid retrieval

Fraud queries are full of exact tokens — sender IDs (AD-SWIGGY), IMEI prefixes, vendor names (RouteX). Pure dense (semantic) retrieval can miss exact-match keywords; pure BM25 misses paraphrases. We run both and fuse the ranked lists with RRF (score = Σ 1/(K + rank)), which needs no shared score scale and reliably beats either retriever alone.

  corpus (md · pdf · csv · json)
            │
   ┌────────┴─────────┐
   │ loaders (per type)│  -> common chunk shape {id,text,source,section,doc_type}
   └────────┬─────────┘
            ├───────────────► DuckDB catalog (provenance / metadata)
            ▼
   embed ► numpy vector store        BM25 index (pure-python, over same chunks)
            │   dense ranks                 │  sparse ranks
            └──────────────┬────────────────┘
                           ▼
                Reciprocal Rank Fusion (RRF)
                           ▼
              fused top-k  ►  LLM (Ollama) / extractive fallback
                           ▼
                cited multi-document answer

Quick start

cd otp-fraud-management
make setup                 # .venv + deps (incl. pypdf, reportlab for real PDFs)
make pipeline              # generate corpus -> ingest (embed + DuckDB catalog)
make catalog               # chunks/documents per doc_type
make search Q="SIM box IMEI grey route vendor RouteX"   # inspect hybrid fusion
make ask Q="How do we investigate suspected A2P bypass and what did the Swiggy incident cost?"
make ui                    # Streamlit: catalog + retrieval inspector + chat
make test                  # 7 tests, offline & deterministic

Runs with no API keys and no Ollama (extractive + hashing fallbacks). Optional: make setup-embeddings (semantic embeddings) and ./setup_ollama.sh (fluent answers).

Layout

fraud_rag/
  config.py
  data/generate.py            synthetic MD/PDF/CSV/JSON fraud corpus
  pipeline/
    loaders.py                per-format loaders -> common chunk shape
    ingest.py                 load -> embed -> vector store + DuckDB catalog
  rag/
    embeddings.py             sentence-transformers OR hashing fallback
    vector_store.py           persistent numpy cosine store
    bm25.py                   pure-python BM25 (Okapi)
    hybrid_retriever.py       dense + sparse, RRF fusion
    chunking.py               heading-aware chunker
    llm.py / fallback.py      Ollama with deterministic offline fallback
    answerer.py               hybrid retrieve -> cited answer
  app/cli.py, app/streamlit_app.py
tests/

Data engineering notes

Heterogeneous ingestion: one dispatcher (load_path) routes each file to the right loader; everything converges on a single chunk schema.
Provenance catalog in DuckDB records every chunk's source, doc_type, section — useful for audits and metadata-filtered retrieval.
Incremental: re-ingesting only embeds chunks whose id (content hash) is new.

Configuration (env vars)

Same as Project 1: TELCO_RAG_FORCE_FALLBACK, TELCO_OLLAMA_MODEL, OLLAMA_HOST, TELCO_EMBED_MODEL. RRF_K (fusion constant) lives in config.py.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
fraud_rag		fraud_rag
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
config.py		config.py
conftest.py		conftest.py
requirements-embeddings.txt		requirements-embeddings.txt
requirements.txt		requirements.txt
setup_ollama.sh		setup_ollama.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OTP Fraud Management — Multi-Document RAG

Why hybrid retrieval

Quick start

Layout

Data engineering notes

Configuration (env vars)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OTP Fraud Management — Multi-Document RAG

Why hybrid retrieval

Quick start

Layout

Data engineering notes

Configuration (env vars)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages