This project is a Streamlit-based PDF Question Answering app built using Retrieval-Augmented Generation (RAG).
You can upload a PDF, ask questions about its content, and get short, accurate answers generated by Gemini, along with highlighted source chunks from the PDF.
- 📄 Upload any PDF file
- 🔍 Semantic search using FAISS
- 🧠 Text embeddings via HuggingFace (MiniLM)
- 🤖 Answer generation using Google Gemini
- 🖍️ Highlighted source chunks with page numbers
- ⚡ Cached vector store for fast performance
- ❌ Responds with "I don't know" if answer is not in the PDF
- 🧾 Stores Chat History
- Streamlit – UI
- LangChain – Orchestration & document processing
- FAISS – Vector database
- HuggingFace Embeddings – Semantic text embeddings
- Google Gemini – Answer generation
- PyPDFLoader – PDF extraction
├── main3.py # Streamlit app
├── .env # API keys
├── requirements.txt # Dependencies
└── README.md
pip install -r requirements.txt
- Upload a PDF – Use the file uploader widget
- PDF Processing – Document is split into overlapping chunks (800 chars, 150 char overlap)
- Embeddings – Chunks are converted into embeddings using HuggingFace MiniLM
- Vector Store – Embeddings are stored in FAISS for fast retrieval
- Query Processing – User question retrieves top-3 relevant chunks
- Answer Generation – Gemini generates an answer strictly from retrieved context
- Source Display – Original chunks are shown with page numbers and highlighting
Try asking these questions after uploading a PDF:
- "What is the main topic discussed in this PDF?"
- "Summarize the key points"
- "What does the document say about [specific topic]?"
- "Who are the authors?"
- ✅ Answers are strictly grounded in PDF context
- ✅ No hallucinations – If information is missing, the app responds with "I don't know"
- ❌ Does not generate information outside the PDF content
- ✅ Works best with text-based PDFs
- ❌ May struggle with scanned images (OCR not included)
- Vector stores are cached for faster repeated queries
- First query may take a few seconds while embeddings are computed