Skip to content

srini696/otp-fraud-management

Repository files navigation

OTP Fraud Management — Multi-Document RAG

A multi-document RAG system that helps a fraud analyst investigate OTP/SMS fraud across a heterogeneous corpus — policies, SOPs, threat-intel briefs, regulations (Markdown + PDF), historical investigation cases (CSV), and indicators of compromise (JSON). It uses hybrid retrieval (dense + BM25, fused with Reciprocal Rank Fusion) and answers with cross-document citations.

The fraud domain is the same A2P-bypass problem seen in Project 1, viewed from the investigation side: OTT grey-routing, SIM-box farms, OTP interception (SIM-swap), and smishing.

Part 2 of 3 in the Telecom RAG series. See otp-signature-analytics (real-time RAG) and mno-revenue-assurance (agentic RAG).


Why hybrid retrieval

Fraud queries are full of exact tokens — sender IDs (AD-SWIGGY), IMEI prefixes, vendor names (RouteX). Pure dense (semantic) retrieval can miss exact-match keywords; pure BM25 misses paraphrases. We run both and fuse the ranked lists with RRF (score = Σ 1/(K + rank)), which needs no shared score scale and reliably beats either retriever alone.

  corpus (md · pdf · csv · json)
            │
   ┌────────┴─────────┐
   │ loaders (per type)│  -> common chunk shape {id,text,source,section,doc_type}
   └────────┬─────────┘
            ├───────────────► DuckDB catalog (provenance / metadata)
            ▼
   embed ► numpy vector store        BM25 index (pure-python, over same chunks)
            │   dense ranks                 │  sparse ranks
            └──────────────┬────────────────┘
                           ▼
                Reciprocal Rank Fusion (RRF)
                           ▼
              fused top-k  ►  LLM (Ollama) / extractive fallback
                           ▼
                cited multi-document answer

Quick start

cd otp-fraud-management
make setup                 # .venv + deps (incl. pypdf, reportlab for real PDFs)
make pipeline              # generate corpus -> ingest (embed + DuckDB catalog)
make catalog               # chunks/documents per doc_type
make search Q="SIM box IMEI grey route vendor RouteX"   # inspect hybrid fusion
make ask Q="How do we investigate suspected A2P bypass and what did the Swiggy incident cost?"
make ui                    # Streamlit: catalog + retrieval inspector + chat
make test                  # 7 tests, offline & deterministic

Runs with no API keys and no Ollama (extractive + hashing fallbacks). Optional: make setup-embeddings (semantic embeddings) and ./setup_ollama.sh (fluent answers).


Layout

fraud_rag/
  config.py
  data/generate.py            synthetic MD/PDF/CSV/JSON fraud corpus
  pipeline/
    loaders.py                per-format loaders -> common chunk shape
    ingest.py                 load -> embed -> vector store + DuckDB catalog
  rag/
    embeddings.py             sentence-transformers OR hashing fallback
    vector_store.py           persistent numpy cosine store
    bm25.py                   pure-python BM25 (Okapi)
    hybrid_retriever.py       dense + sparse, RRF fusion
    chunking.py               heading-aware chunker
    llm.py / fallback.py      Ollama with deterministic offline fallback
    answerer.py               hybrid retrieve -> cited answer
  app/cli.py, app/streamlit_app.py
tests/

Data engineering notes

  • Heterogeneous ingestion: one dispatcher (load_path) routes each file to the right loader; everything converges on a single chunk schema.
  • Provenance catalog in DuckDB records every chunk's source, doc_type, section — useful for audits and metadata-filtered retrieval.
  • Incremental: re-ingesting only embeds chunks whose id (content hash) is new.

Configuration (env vars)

Same as Project 1: TELCO_RAG_FORCE_FALLBACK, TELCO_OLLAMA_MODEL, OLLAMA_HOST, TELCO_EMBED_MODEL. RRF_K (fusion constant) lives in config.py.

About

Multi-document RAG (hybrid dense + BM25, RRF fusion) for investigating telecom OTP fraud across PDF/CSV/JSON with cross-document citations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors