GitHub - OpenStruct/query_genie

Understand the Components of a RAG System

A RAG system combines:

Retrieval: Fetching relevant data from a database or knowledge base.
Generation: Using a language model to generate a natural language response based on the retrieved data.

The hope for this project is to:

Parse the user's query.
Translate it into a database query (e.g., SQL).
Execute the query to retrieve the data.
Generate a response based on the retrieved data.

System Design

a. Natural Language Understanding (NLU)

Use a language model (TBD) to understand the user's query.
Extract key information from the query, such as:
- Metric: "How many people"
- Condition: "spent more than $200"
- Timeframe: "past 10 days"
- ETC

b. Query Translation

Convert the parsed query into a database query (e.g., SQL).
- Example: "How many people spent more than $200 in the past 10 days" SQL:

SELECT COUNT(DISTINCT user_id)
FROM transactions
WHERE amount > 200
  AND transaction_date >= NOW() - INTERVAL '10 days';

c. Database Interaction

Connect to your database (e.g., PostgreSQL, MySQL, etc.).
Execute the generated query and retrieve the results.

d. Response Generation

TBD on how the response should be. The response can be a simple count, a visualization, or a natural language sentence.

5. Challenges and Improvements

Query Translation Accuracy
Error Handling
Performance
Security

To help the model to understand the data in the database, we will feed it with the schema information. Example: table names, column names, data types, relationships, etc.

{
  "tables": [
    {
      "name": "transactions",
      "columns": [
        {
          "name": "user_id",
          "type": "integer",
          "foreign_key": "users.id"
        },
        {
          "name": "amount",
          "type": "float"
        },
        {
          "name": "transaction_date",
          "type": "date"
        }
      ]
    },
    {
      "name": "users",
      "columns": [
        {
          "name": "id",
          "type": "integer",
          "primary_key": true
        },
        {
          "name": "name",
          "type": "text"
        }
      ]
    }
  ]
}

This process will be automated whenever a user tries to interact with the system. The system will automatically fetch the schema information from the database and feed it to the language model.

Problems to resolve

Passing the schema information to the model causes an error, and so we need to find a better way of doing it.

Token indices sequence length is longer than the specified maximum sequence length for this model (2504 > 512). Running this sequence through the model will result in indexing errors

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
database		database
settings		settings
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
test.py		test.py
test_main.http		test_main.http

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understand the Components of a RAG System

System Design

5. Challenges and Improvements

Problems to resolve

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Understand the Components of a RAG System

System Design

5. Challenges and Improvements

Problems to resolve

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages