freshcrate
Home > Databases > rag-chatbot

rag-chatbot

RAG (Retrieval-augmented generation) ChatBot that provides answers based on contextual information extracted from a collection of Markdown files.

Description

RAG (Retrieval-augmented generation) ChatBot that provides answers based on contextual information extracted from a collection of Markdown files.

README

RAG (Retrieval-augmented generation) ChatBot

CI pre-commit Code style: Ruff

Check out the todo list to see the next steps and improvements we want to implement in this project here.

Important

Disclaimer: The code has been tested on:

  • Ubuntu 22.04.2 LTS running on a Lenovo Legion 5 Pro with twenty 12th Gen Intelยฎ Coreโ„ข i7-12700H and an NVIDIA GeForce RTX 3060.
  • MacOS Sonoma 14.3.1 running on a MacBook Pro M1 (2020).

If you are using another Operating System or different hardware, and you can't load the models, please take a look at the official Llama Cpp Python's GitHub issue.

Warning

  • llama_cpp_pyhon doesn't use GPU on M1 if you are running an x86 version of Python. More info here.
  • It's important to note that the large language model sometimes generates hallucinations or false information.

Note

To decide which hardware to use/buy to host you local LLMs we recommend you to read this great benchmarks:

Decision model:

  • Memory capacity is the main limit. Check if your model fits in memory (with quantization) https://www.canirun.ai/.
  • Memory bandwidth mostly determines speed (tokens/sec). Check if the bandwidth gives you acceptable speed.
  • If not, upgrade hardware or optimize the model.

For instance, it seems better to buy a second-hand or refurbished Mac Studio M2 Max with at least 64GB RAM, since it has 400Gbps of memory bandwidth compared to the M4 Pro, which has just 273Gbps.

Table of contents

Introduction

This project combines the power of llama.cpp and Chroma to build:

  • a Conversation-aware Chatbot (ChatGPT like experience).
  • a RAG (Retrieval-augmented generation) ChatBot.

The RAG Chatbot works by taking a collection of Markdown files as input and, when asked a question, provides the corresponding answer based on the context provided by those files.

rag-chatbot-architecture-1.png

Note

We decided to grab and refactor the RecursiveCharacterTextSplitter class from LangChain to effectively chunk Markdown files without adding LangChain as a dependency.

The Memory Builder component of the project loads Markdown pages from the docs folder. It then divides these pages into smaller sections, calculates the embeddings (a numerical representation) of these sections with the Semantic Search models from Sentence Transformers, and saves them in an embedding database called Chroma for later use.

When a user asks a question, the RAG ChatBot retrieves the most relevant sections from the Embedding database. Since the original question can't be always optimal to retrieve for the LLM, we first prompt an LLM to rewrite the question, then conduct retrieval-augmented reading. The most relevant sections are then used as context to generate the final answer using a local language model (LLM). Additionally, the chatbot is designed to remember previous interactions. It saves the chat history and considers the relevant context from previous conversations to provide more accurate answers.

To deal with context overflows, we implemented two approaches:

  • Create And Refine the Context: synthesize a responses sequentially through all retrieved contents.
    • create-and-refine-the-context.png
  • Hierarchical Summarization of Context: generate an answer for each relevant section independently, and then hierarchically combine the answers.
    • hierarchical-summarization.png

The Memory Builder builds the vector database in an incremental way, which means that when a document changes, we only update the corresponding chunks in the vector store instead of rebuilding the whole index.

This is achieved through:

  • Document-level metadata tracking: every chunk gets tagged with a source doc ID + version hash. When a doc changes, we regenerate chunks for that doc only, delete the old ones by metadata filter, and insert new ones. way cheaper than rebuilding the whole index.
  • Incremental ingestion pipeline: the pipeline diffs source docs against what's already indexed (using those version hashes). Only changed/new docs get processed. Keeps compute costs reasonable as the corpus grows.
  • Handling deletions: we keep a separate mapping table (doc_id โ†’ chunk_ids) in a SQLite db so we can precisely target what to remove without scanning the whole store.

Important

One thing to watch out for โ€” if you ever swap embedding models, you must rebuild it from scratch since the vector spaces wonโ€™t be compatible. Plan for that early.

Prerequisites

  • Python 3.12+
  • GPU supporting CUDA 12.4+
  • Poetry 2.3.0

For the UI:

  • Node 22.12+
  • Yarn 1.22+

Install Poetry

Install Poetry with pipx by following this link.

You must use the current adopted version of Poetry defined here.

If you have poetry already installed and is not the right version, you can downgrade (or upgrade) poetry through:

poetry self update <version>

or with pipx:

pipx install poetry==<version> --force

Bootstrap Environment

To easily install the dependencies and start the services we created a make file.

How to use the make file

Important

Run Setup as your init command (or after Clean).

  • Check: make check
    • Use it to check that which pip3 and which python3 points to the right path.
  • Setup:
    • Setup with NVIDIA CUDA acceleration: make setup_cuda
      • Creates an environment and installs all dependencies with NVIDIA CUDA acceleration.
    • Setup with Metal GPU acceleration: make setup_metal
      • Creates an environment and installs all dependencies with Metal GPU acceleration for macOS system only.
  • Start: make start
    • Start both the backend and frontend ensuring that the backend is running and ready before launching the frontend.
  • Update: make update
    • Update an environment and installs all updated dependencies.
  • Tidy up the code: make tidy
    • Run Ruff check and format.
  • Clean: make clean
    • Removes the environment and all cached files.
  • Test: make test
    • Runs all tests.
    • Using pytest

Environment

Copy .๐ž๐ง๐ฏ.๐ž๐ฑ๐š๐ฆ๐ฉ๐ฅ๐ž โ†’ .๐ž๐ง๐ฏ and fill it in.

Copy /frontend/.๐ž๐ง๐ฏ.๐ž๐ฑ๐š๐ฆ๐ฉ๐ฅ๐ž โ†’ .๐ž๐ง๐ฏ and fill it in.

Using the Open-Source LLMs/Embedding Models Locally

We utilize the open-source library llama-cpp-python, a binding for llama-cpp, allowing us to utilize it within a Python environment. llama-cpp serves as a C++ backend designed to work efficiently with transformer-based models. Running the LLMs architecture on a local PC is impossible due to the large (~7 billion) number of parameters. This library enable us to run them either on a CPU or GPU. Additionally, we use the Quantization and 4-bit precision to reduce number of bits required to represent the numbers. The quantized models are stored in GGML/GGUF format.

Supported LLMs Models

๐Ÿค– Model Supported Model Size Max Context Window Notes and link to the model card
qwen-3.5:0.8b Qwen 3.5 0.8B โœ… 0.8B 256k Tiny and fast multimodal, great for edge device - Card
qwen-3.5:2b Qwen 3.5 2B โœ… 2B 256k Multimodal for lightweight agents (small tool calls) - Card
qwen-3.5:4b Qwen 3.5 4B โœ… 4B 256k Doesnโ€™t drift from tasks as bad as 2B Card
qwen-3.5:9b Qwen 3.5 9B โœ… 9B 256k Recommended model Can handle more complex tasks and competes with larger models like gpt-oss 120B Card
qwen-2.5:3b - Qwen2.5 Instruct โœ… 3B 128k Card
qwen-2.5:3b-math-reasoning - Qwen2.5 Instruct Math Reasoning โœ… 3B 128k Card
llama-3.2:1b Meta Llama 3.2 Instruct โœ… 1B 128k Optimized to run locally on a mobile or edge device - Card
llama-3.2 Meta Llama 3.2 Instruct โœ… 3B 128k Optimized to run locally on a mobile or edge device - Card
llama-3.1 Meta Llama 3.1 Instruct โœ… 8B 128k Recommended model Card
deep-seek-r1:7b - DeepSeek R1 Distill Qwen 7B โœ… 7B 128k Experimental Card
openchat-3.6 - OpenChat 3.6 โœ… 8B 8192 Card
openchat-3.5 - OpenChat 3.5 โœ… 7B 8192 Card
starling Starling Beta โœ… 7B 8192 Is trained from Openchat-3.5-0106. It's recommended if you prefer more verbosity over OpenChat - Card
phi-3.5 Phi-3.5 Mini Instruct โœ… 3.8B 128k Card
stablelm-zephyr StableLM Zephyr OpenOrca โœ… 3B 4096 Card

Supported Embedding Models

For the semantic search, we support all the embedding models from Sentence Transformers but we tested those on the table below. To find the list of best embeddings models for the retrieval task in your language (or multiple languages) go to the Massive Text Embedding Benchmark (MTEB) Leaderboard. We do recommend you to use the jina-embeddings-v5-text models, which are small (239M & 677M parameters) with SOTA performance for multilingual retrieval tasks, and they perform very well on the MTEB benchmark.

๐Ÿง  Embedding Model Supported Model Size Max Tokens Retrieval score (MTEB) Notes and link to the model card
all-MiniLM-L6-v2 - Sentence Transformers All MiniLM L6 v2 โœ… 0.023B 512 33.30 Card
all-MiniLM-L12-v2 - Sentence Transformers All MiniLM L12 v2 โœ… 0.033B 256 33.37 Card
all-mpnet-base-v2 - Sentence Transformers All Mpnet base v2 โœ… 0.109B 384 33.80 Card
jinaai/jina-embeddings-v5-text-small-retrieval - jina-embeddings-v5-text-small โœ… 0.596B 32k 64.88 Recommended model Card
jinaai/jina-embeddings-v5-text-nano-retrieval - jina-embeddings-v5-text-nano โœ… 0.212B 8k 63.26 Card

Supported Response Synthesis strategies

โœจ Response Synthesis strategy Supported Notes
create-and-refine Create and Refine โœ…
tree-summarization Recommended - Tree Summarization โœ…

Build the memory index

You could download some Markdown pages from the Blendle Employee Handbook and put them under docs.

Build the memory index by running:

make migrate_db
python chatbot/memory_builder.py --model-name jinaai/jina-embeddings-v5-text-small-retrieval --chunk-size 1000 --chunk-overlap 50

Run the Chatbot

The Chatbot has a UI built with Vite, React and TypeScript, and a backend built with FastAPI that serves the LLMs through llama-cpp-python.

To install the UI dependencies, run:

cd frontend
nvm use
yarn

# Create .env file
echo "VITE_API_URL=http://localhost:8000" > .env

To start the backend type:

cd backend && PYTHONPATH=.:../chatbot uvicorn main:app --reload

To start the frontend (in a new terminal):

cd frontend && yarn dev

or to start both ensuring that the backend is running and ready before launching the frontend just run:

make start

The application will be available at http://localhost:5173, with the backend API at http://localhost:8000.

conversation-aware-chatbot.gif

You can enable the RAG Mode feature in the UI to ask questions based on the context provided by the Markdown files you loaded and indexed in the previous step:

rag_chatbot_example.gif

You can also upload a Markdown file using the file uploader. The document management section shows the uploaded and indexed documents. Once you upload one or multiple files, they will be: uploaded โ†’ chunked โ†’ embedded โ†’ upserted to Chroma.

rag_chatbot_load_doc_example.gif

References

Release History

VersionChangesUrgencyDate
main@2026-04-14Latest activity on main branchHigh4/14/2026
0.0.0No release found โ€” using repo HEADHigh4/11/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

OneRAGProduction-ready RAG Framework (Python/FastAPI). 1-line config swaps: 6 Vector DBs (Weaviate, Pinecone, Qdrant, ChromaDB, pgvector, MongoDB), 5 LLMs (Gemini, OpenAI, Claude, Ollama, OpenRouter). OpenAv1.0.7
local-rag-system๐Ÿค– Build your own local Retrieval-Augmented Generation system for private, offline AI memory without ongoing costs or data privacy concerns.main@2026-04-21
OllamaRAG๐Ÿค– Build a smart AI assistant that learns from any website using a Retrieval-Augmented Generation framework with local models powered by Ollama.main@2026-04-21
PS-HK19_MindForge_MindForgeProvide context-based, accurate answers to syllabus questions using AI powered by Retrieval-Augmented Generation for effective student learning.main@2026-04-21
orbitOne API for 20+ LLM providers, your databases, and your files โ€” self-hosted, open-source AI gateway with RAG, voice, and guardrails.v2.6.6