RAGChat

RAG stands for Retrieval-Augmented Generation. Here's a breakdown of what it is and how it works:

What is RAG

RAG is a technique used to improve the responses of large language models (LLMs) like GPT-3 and others. It combines the following:

Generative Language Models: These models are fantastic at producing human-quality text, understanding context, and engaging in conversations. However, they can make things up (hallucinate) or rely on outdated information from their training data.
Retrieval Systems: These systems can search large databases of text or knowledge bases to find relevant information based on a user's query.

RAG bridges the gap by having the retrieval system find relevant information, which is then provided to the language model along with the original query. This helps the language model generate even better responses.

How Does RAG Work?

User Query: A user asks a question or gives a prompt.
Retrieval: The retrieval system searches a knowledge base (like Wikipedia or a company's internal documents) to locate the most relevant pieces of information.
Generation: The language model takes both the original query and the retrieved information as input. It then processes this combined input to create a more informed and accurate response.

Why use RAG in the first place

Improved Accuracy: Access to relevant knowledge helps the language model avoid making incorrect statements or relying on outdated information.
Verifiable Answers: RAG can often provide references or citations to support the generated response, allowing users to fact-check the information.
Handling "I don't know" Situations: A RAG model is better at recognizing when the knowledge base doesn't contain a good answer to the question, leading it to respond honestly rather than making something up.
Up-To-Date Information: Since it pulls from a knowledge base, a RAG model can access the latest information even if its knowledge from training is old.

How do I use this app

Installing Ollama

Install Ollama by following the instructions given here.
Start Ollam
a.
Select a query model supported by Ollama from here, for e.g., llama2 or mistral.
Start the ollama server with the query model you want to run by running the following command.

ollama pull <MODEL-NAME>

Starting the backend:

Install a python environment manager like Miniconda by following the steps given in the link.
The query model selected earlier from ollama will be needed here.
Select an embeddings model as well. Select any open-source embeddings model from HuggingFace like hkunlp/instructor-large.
Create a .env file in the backend folder.
An example .env file is provided.
You can select any model that you want to run, not just the ones given here.
Go to the backend folder in your terminal.

cd backend

Create a new python environment.

conda create -n rag -y python=3.10

Activate the new environment.

conda activate rag

Install all the dependencies.

python -m pip install -r requirements.txt
python -m pip install "unstructured[all-docs]"

Start the backend server.

python app.py

Starting the frontend:

Ensure node >= 18 and npm is installed in your system.
If not, we recommend that instead of installing them directly install a nvm for Linux and MacOS based systems and nvm-windows for Windows.
Follow the installation guides provided in the links.
Install node and npm in your system by using the following commands.

nvm install 21
nvm use 21

No go to the frontend folder in the terminal.

cd frontend

Install the dependencies.

npm install

Now run the frontend.

npm run dev

Now go to http://localhost:3000 on your browser.
You will now be able to use RAG.

Using the app

The following video demo will show how to use RAGChat.

rag-demo.mp4

How is RAG implemented here

On the server side we use Langchain and LlamaIndex.
Tools provided by LlamaIndex are used for loading the uploaded documents.
These documents are then passed on to the embedding model.
We use HuggingFaceEmbeddings implementation provided by Langchain to use any embeddings model available on HuggingFace.
The generated embeddings are stored locally in a Chroma vector database.
To use an open-source query model locally, we use Ollama.
We select a model from the list of models Ollama provides.
When the model is queried, the query is converted to embeddings and the embeddings are then compared to the ones stored in our vector database.
If similar embeddings are found, they are passed onto the model along with our original query.
The model uses this data to generate a response.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
backend		backend
docs		docs
frontend		frontend
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
todos.md		todos.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAGChat

What is RAG

How Does RAG Work?

Why use RAG in the first place

How do I use this app

Installing Ollama

Starting the backend:

Starting the frontend:

Using the app

How is RAG implemented here

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

RAGChat

What is RAG

How Does RAG Work?

Why use RAG in the first place

How do I use this app

Installing Ollama

Starting the backend:

Starting the frontend:

Using the app

How is RAG implemented here

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages