Students don't need more explanations â they need the right explanations, aligned exactly with their syllabus, units, and exam patterns.
uniAI is a Retrieval-Augmented Generation (RAG) system built for university students with one clear priority: exam scoring over generic learning. It is not a general-purpose AI tutor. Every architectural decision â from how PDFs are ingested to how the LLM prompt is structured â reflects the constraint that answers must be grounded in the student's actual syllabus, unit by unit.
Most AI study tools try to teach. uniAI is designed to help students score.
It is intentionally less creative, more constrained, and more exam-oriented than a general assistant. Concretely, this means it answers strictly from your own uploaded notes and syllabus PDFs, it explicitly flags out-of-syllabus questions instead of silently hallucinating, retrieval is unit-scoped so asking about Unit 3 only surfaces Unit 3 content, and a cross-encoder reranker ensures the most semantically relevant chunks reach the LLM rather than just the most cosine-similar ones.
PDF Notes / Syllabus / PYQs
â
âŧ
ââââââââââââââââââââââââââââââââââââââ
â VLM OCR Ingestion Pipeline â â Qwen3-VL (Ollama / OpenRouter / HuggingFace)
â Semantic sectioning per page â PyMuPDF, running topic list, garbage filtering
â One JSON per topic section â Rate-limit safe (exponential backoff)
ââââââââââââââââŦââââââââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââââââââââââââââ
â Three Isolated ChromaDB â
â Collections â â cosine similarity space
â multimodal_notes â
â multimodal_syllabus â
â multimodal_pyq â
ââââââââââââââââŦââââââââââââââââââââââ
â
Query arrives
â
âŧ
ââââââââââââââââââââââââââââââââââââââ
â Query Expansion (3 layers) â â Exam phrasing normalization
â â Abbreviation expansion
â â Syllabus keyword injection
ââââââââââââââââŦââââââââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââââââââââââââââ
â Hybrid Router (4 tiers) â 1. Regex for explicit unit mention
â â 2. Weighted keyword scoring
â â 3. Pre-computed unit embedding similarity
â â 4. LLM fallback (Qwen3.5 / Gemini)
ââââââââââââââââŦââââââââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââââââââââââââââ
â Metadata-Filtered Retrieval â â Subject + Unit scoped ChromaDB query
â Notes + Syllabus chunks â Cosine similarity threshold gating
ââââââââââââââââŦââââââââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââââââââââââââââ
â Cross-Encoder Reranker â â Qwen3-Reranker-0.6B (HuggingFace)
â â GPU inference via PyTorch CUDA
â â Sigmoid-normalized 0â1 relevance scores
ââââââââââââââââŦââââââââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââââââââââââââââ
â Hallucination Gate â â top cross-score < 0.65 â Generic Mode
ââââââââââââââââŦââââââââââââââââââââââ
â
âŧ
ââââââââââââââââââââââââââââââââââââââ
â Generation â â Gemini API / Ollama / Groq
â + Session Memory Injection â Exam-focused prompt assembly
ââââââââââââââââââââââââââââââââââââââ
uniAI/
âââ source_code/
â âââ config/
â â âââ env.py # Secrets and machine-specific settings from .env
â â âââ models.py # AI provider profiles (Gemini, Ollama, Groq)
â â âââ rag.py # RAG hyperparameters (thresholds, K values, etc.)
â â âââ paths.py # Filesystem paths, ChromaDB collection names
â â âââ main.py # Assembles CONFIG dict â single import for everything
â â
â âââ models.py # Unified provider abstraction (chat, embed, rerank, vision)
â âââ utils.py # Shared helpers: image encoding, ChromaDB, JSON parsing
â âââ prompts.py # Single source of truth for all LLM prompts
â â
â âââ extract/
â â âââ extract_multimodal_notes.py # VLM OCR: semantic sectioning with topic loop
â â âââ extract_multimodal_pyq.py # VLM OCR + LLM unit classification for PYQs
â â âââ extract_multimodal_syllabus.py # Structured syllabus extraction (7 chunks/PDF)
â â
â âââ ingest/
â â âââ ingest_multimodal.py # Notes â multimodal_notes
â â âââ ingest_multimodal_pyq.py # PYQs â multimodal_pyq
â â âââ ingest_multimodal_syllabus.py # Syllabus â multimodal_syllabus
â â
â âââ pipeline/
â â âââ embeddings/local_embedding.py # Ollama embedding client (keep_alive)
â â âââ generate_keyword_map.py # Builds subject_keywords.json for routing
â â âââ generate_unit_embeddings.py # Builds unit_embeddings.pkl for Stage 3 router
â â âââ retrieval_utils.py # Threshold-filtered retrieval helper
â â
â âââ rag/
â â âââ rag_pipeline.py # Main orchestrator: route â retrieve â rerank â generate
â â âââ hybrid_router.py # Coordinates 4-tier routing waterfall
â â âââ router.py # Tier 2: weighted keyword scoring
â â âââ embedding_router.py # Tier 3: pre-computed unit embedding similarity
â â âââ unit_router.py # Regex + keyword unit detection
â â âââ query_expander.py # 3-layer query expansion
â â âââ search.py # Collection-isolated retrieval functions
â â âââ cross_encoder.py # Qwen3-Reranker-0.6B reranker (GPU)
â â âââ reranker.py # Heuristic reranker (fallback / legacy)
â â âââ context_builder.py # Formats chunks into LLM-ready context
â â âââ chat_cli.py # CLI chat loop
â â
â âââ tests/
â âââ test_glm.py # Standalone GLM-OCR capability smoke test
â âââ chat/ # Manual chat session scripts and question sets
â âââ retrieval/ # Retrieval accuracy and routing tests
â âââ router/ # Router evaluation with trace logs
â âââ complete_system/ # Full pipeline integration tests
â âââ ci/ # CI/CD tests (syntax, Django, pytest)
â âââ db/ # ChromaDB audit and dump utilities
â âââ others/ # Miscellaneous unit tests
â âââ api/ # API provider smoke tests
â
âââ rag_project/ # Django backend
â âââ rag_api/
â âââ views.py # /api/query and /api/health endpoints
â âââ urls.py
â âââ templates/chat.html # Minimal HTML/JS frontend
â
âââ PROGRESS.md # Pipeline status tracker
âââ .github/workflows/ci.yml # CI: syntax check, Django health, pytest
âââ requirements.txt
âââ requirements_linux.txt # WSL/Ubuntu setup guide
âââ .env.example
The config was designed as a proper Python package with four files that each own one concern. env.py loads secrets from .env. models.py defines AI provider profiles and which one is active. rag.py holds every tunable hyperparameter. paths.py resolves filesystem locations using pathlib. The main.py assembles these into a single CONFIG dictionary that every other module imports, ensuring one consistent access pattern throughout the codebase.
This is the architectural core. Instead of every script calling ollama.chat() or genai.generate_content() directly, they all go through models.chat(), models.embed(), models.rerank(), or models.vision(). Switching the generation backend from Gemini to Groq is a one-line change in config/models.py. Provider clients are lazily initialized â they are only created on first use, which avoids import-time failures if a provider library is not installed.
| Function | Purpose | Providers |
|---|---|---|
models.chat() |
Text generation | Gemini, Ollama, Groq |
models.embed() |
Vector embeddings | Ollama |
models.rerank() |
Cross-encoder scoring | HuggingFace Transformers (local) |
models.vision() |
VLM OCR | Ollama, OpenRouter, HuggingFace |
Three parallel pipelines handle the three data types, each depositing into its own isolated ChromaDB collection.
Notes pipeline uses a semantic sectioning approach: per page, the VLM identifies distinct topic sections and returns a sections[] array. Each section is written as its own JSON file. A running topic list is maintained across pages so the VLM can reuse consistent section names rather than creating duplicates. Already-processed pages are detected by file glob and skipped, with existing topic names rehydrated from disk to preserve continuity. Images are rendered as JPEG at 1Ã scale for Ollama cloud (to avoid Cloudflare 524 timeouts) or PNG at 2Ã for HuggingFace.
Syllabus pipeline processes each syllabus PDF into exactly seven structured JSON files â one per unit plus course outcomes and a books/references chunk. This granularity is what makes unit-scoped retrieval precise later.
PYQ pipeline is the most involved. It transcribes exam papers page-by-page via VLM, then for each extracted question calls the chat LLM a second time to classify which syllabus unit the question belongs to. Questions are cleaned of marks annotations, pipe separators, and trailing numbers via regex before ingestion. Both the OCR step and the classification step use 15s à attempt exponential backoff to handle cloud rate limits.
Every query goes through a four-tier waterfall before any retrieval happens.
Tier 1 â Regex Unit Detection checks for an explicit unit mention (unit 3, unit-4) and extracts it immediately.
Tier 2 â Keyword Scoring scores the query against subject_keywords.json using a weighted system. PYQ keywords carry the most signal (weight 5), followed by unit-specific notes keywords (4), syllabus unit keywords (3), and core subject keywords (2). If one subject wins with no tie and meets the minimum threshold, routing completes in milliseconds without any LLM call.
| Signal | Weight |
|---|---|
| PYQ keywords | 5 |
| Notes unit-level keywords | 4 |
| Syllabus unit-level keywords | 3 |
| Core subject keywords | 2 |
Tier 3 â Embedding Similarity embeds the query and computes cosine similarity against pre-computed unit embeddings stored in unit_embeddings.pkl. These reference embeddings are generated offline from the keyword map and represent each subject/unit as a dense vector. If similarity exceeds EMBEDDING_ROUTER_THRESHOLD (0.55), routing is decided.
Tier 4 â LLM Fallback invokes a fast router model with a strict prompt that must reply with exactly one SUBJECT_UNIT string. Temperature is fixed at 0.0 for deterministic output. This tier only runs for genuinely ambiguous queries that escaped all previous stages.
Three layers are applied before embedding to bridge the vocabulary gap between how students phrase questions and how lecture notes are written.
The first layer strips exam-style phrasing so "write a short note on buffer overflow" becomes "buffer overflow" and the embedding captures the concept, not the question format. The second layer expands known abbreviations using a hardcoded map and a loaded subject_aliases.json. The third layer appends syllabus keywords for the detected subject and unit, anchoring the query embedding in academic vocabulary.
After cosine-similarity retrieval, the top candidates are reranked using tomaarsen/Qwen3-Reranker-0.6B-seq-cls. Unlike the bi-encoder used for initial retrieval, a cross-encoder processes the query and each document together, which allows it to detect semantic relationships that independent embeddings miss. Scores are sigmoid-normalized to a 0â1 range.
The hallucination gate sits immediately after reranking: if the top cross-encoder score falls below MIN_CROSS_SCORE (0.65), the pipeline discards all retrieved chunks and switches to Generic AI Tutor Mode. This is the mechanism that prevents the LLM from producing confident-sounding answers from irrelevant context.
| Collection | Content | Key Metadata |
|---|---|---|
multimodal_notes |
Lecture notes, handwritten notes, slides | subject, unit, title, chunk_idx, section_index, confidence |
multimodal_syllabus |
Unit topics, course outcomes, book lists | subject, unit, chunk_type, syllabus_version |
multimodal_pyq |
Past year exam questions | subject, unit, year, marks |
Collection isolation is foundational. The retrieve_notes() function applies an explicit document_type != "syllabus" filter to prevent syllabus chunks from appearing in notes results, even though both live under the same ChromaDB path.
git clone https://github.com/git-pratap-shrey/uniAI.git
cd uniAI
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtFor WSL/Ubuntu, follow the step-by-step guide in requirements_linux.txt (includes PyTorch CUDA setup).
cp .env.example .env
# Edit .env with your keys and pathsKey variables to set:
OLLAMA_BASE_URL=http://localhost:11434 # or cloud Ollama URL
OLLAMA_API_KEY=... # if using authenticated cloud Ollama
BASE_DATA_DIR=/path/to/your/data # flattened: SUBJECT/notes/unitN/*.pdf
CHROMA_DB_PATH=/path/to/your/chroma
GEMINI_API_KEY=... # if using Gemini for generation
OPENROUTER_API_KEY=... # if using OpenRouter for vision fallback
HF_TOKEN=... # if using HuggingFace for vision or reranking
USE_OLLAMA_CLOUD=true # true = use OLLAMA_BASE_URL, false = OLLAMA_LOCAL_URLMake sure Ollama is running locally (ollama serve) and required models are pulled.
Data layout is flat â no year nesting:
<BASE_DATA_DIR>/<SUBJECT>/
notes/unit1/*.pdf
notes/unit2/*.pdf
pyqs/*.pdf
syllabus/*.pdf
# OCR extraction (run as modules from project root)
python -m source_code.extract.extract_multimodal_notes
python -m source_code.extract.extract_multimodal_pyq
python -m source_code.extract.extract_multimodal_syllabus
# Ingest into ChromaDB
python source_code/ingest/ingest_multimodal.py
python source_code/ingest/ingest_multimodal_pyq.py
python source_code/ingest/ingest_multimodal_syllabus.py
# Build router artifacts
python source_code/pipeline/generate_keyword_map.py
python source_code/pipeline/generate_unit_embeddings.pyAll extraction scripts are resumable â already-processed files are detected and skipped automatically.
cd rag_project
python manage.py runserverAPI Endpoints:
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/health |
System health and active model |
POST |
/api/query |
Main RAG query endpoint |
Query payload:
{
"query": "Explain buffer overflow attack",
"history": [],
"subject": "CYBER_SECURITY"
}python source_code/rag/chat_cli.pyCommands: /switch <SUBJECT>, /subjects, /history, /clear
All tuneable parameters live in source_code/config/rag.py.
| Parameter | Default | Description |
|---|---|---|
similarity_threshold |
0.35 |
Min cosine similarity to keep a retrieval result |
min_strong_sim |
0.6 |
Min similarity the top chunk must have |
cross_encoder.model |
tomaarsen/Qwen3-Reranker-0.6B-seq-cls |
Reranker model |
cross_encoder.min_score |
0.65 |
Below this score â Generic AI Tutor Mode |
cross_encoder.candidates |
6 |
Max chunks sent to cross-encoder |
cross_encoder.pipeline_top_n |
4 |
Chunks kept after reranking |
history_limit |
4 |
Conversation turns injected into context |
keywords.min_score |
2 |
Min keyword score to trust Tier 2 routing |
embedding_router_threshold |
0.55 |
Min similarity to trust Tier 3 routing |
Chat model selection lives in source_code/config/models.py via ACTIVE_CHAT_MODEL.
| Layer | Stack |
|---|---|
| Backend | Python, Django |
| AI / ML | RAG, VLM OCR, Cross-encoder reranking, Embeddings |
| Models | Qwen3-VL, Qwen3-Reranker-0.6B, Qwen3-Embedding:4B, Qwen3.5:2b, Gemini API |
| Vector DB | ChromaDB (3 isolated collections, cosine space) |
| Inference | Ollama (local/cloud), OpenRouter, HuggingFace Transformers, PyTorch CUDA |
| Data Processing | PyMuPDF, semantic sectioning, custom cleaning |
| Testing | pytest, GitHub Actions CI |
| Dev & Infra | Git/GitHub, .env config, Cloudflare Tunnel, local-first design |
The cross-encoder loads on first call and blocks until it is warm, meaning the first request after a cold server start will be noticeably slow. CSRF is currently disabled on /api/query for development convenience and must be re-enabled before any public deployment. Only one academic year is fully ingested in the current prototype. There is no persistent long-term memory across sessions â conversation history is stateless and lives in the frontend.
Answer citations with source page references so students can trace answers back to their notes. A background warm-up thread for the cross-encoder to eliminate cold-start latency. Automated ingestion triggers for new subject data. Unit-level summaries and topic index generation. Fix zero-yield PYQ PDFs (fill-in-the-blank regex). College-wide deployment once the system is hardened.
Stage: Active development / prototype â notes extraction running
Target users: Self + small group of classmates
Future goal: College-wide deployment
The focus of uniAI is not novelty â it is alignment with real academic needs and practical engineering trade-offs. Every component exists because a simpler version failed a real retrieval or accuracy problem.
