A self-hosted memory backend for AI agents. RASPUTIN stores conversations as overlapping windows and LLM-extracted facts in Qdrant, with an LLM quality gate that prevents junk from entering the memory store.
Production-grade long-term memory for AI agents:
- Vector search (Qdrant) with two-lane retrieval (windows + facts)
- LLM-based fact extraction at ingest time
- Cross-encoder reranking (local, CPU)
- A-MAC quality gate on commits
Main server: tools/hybrid_brain.py
Memory Commit
│
├─► A-MAC quality gate (relevance/novelty/specificity)
├─► 5-turn overlapping windows (stride 2)
├─► LLM fact extraction (optional, Haiku)
├─► Embedding (nomic-embed-text, 768d)
└─► Persist to Qdrant
Search (two-lane)
│
├─► Multi-Query Expansion
├─► Query Embedding (nomic-embed-text, 768d)
│
├─► Lane 1: Window search (45 slots) ──┐
├─► Lane 2: Fact search (15 slots) ──┼─► Merge ─► Cross-encoder rerank ─► Top-60 to LLM
└─► (Optional: BM25 keyword lane) ──┘
- API server:
tools/hybrid_brain.py - Fact extraction:
tools/brain/fact_extractor.py - Cross-encoder reranker:
tools/brain/cross_encoder.py - Maintenance jobs:
tools/memory_decay.py,tools/memory_dedup.py
| Feature | RASPUTIN | Mem0 | Zep | LightRAG |
|---|---|---|---|---|
| Vector search | ✅ Qdrant | ✅ | ✅ | ✅ |
| LLM fact extraction | ✅ | ❌ | ❌ | ❌ |
| Two-lane retrieval | ✅ windows + facts | ❌ | ❌ | ❌ |
| Cross-encoder reranking | ✅ local CPU | ❌ | ❌ | ❌ |
| LLM quality gate | ✅ A-MAC | ❌ | ❌ | ❌ |
| Contradiction detection | ✅ | ❌ | ❌ | ❌ |
| Self-hosted / no vendor lock | ✅ | ✅ | ❌ (SaaS) | ✅ |
Evaluated on LoCoMo (ACL 2024), conv-0 (199 QA pairs). Two benchmark modes: production (Haiku answers, neutral judge — measures retrieval quality) and compare (gpt-4o-mini answers, generous judge — field-comparable). See benchmarks/README.md for methodology details.
| Mode | Non-adversarial | Overall |
|---|---|---|
| Production (retrieval signal) | 69.7% | 53.3% |
| Compare (field-comparable) | 72.4% | — |
| Category | Production | Questions |
|---|---|---|
| Open-domain | 82.9% | 70 |
| Temporal | 73.0% | 37 |
| Multi-hop | 53.8% | 13 |
| Single-hop | 43.8% | 32 |
| Adversarial | 6.4% | 47 |
| Metric | Value |
|---|---|
| Gold-in-ANY-chunk | 88.4% |
| Gold-in-Top-5 | 63.8% |
| Gold-in-Top-10 | 71.4% |
Published LoCoMo scores across memory systems are not directly comparable. Each system measures something different, uses different models, and reports under different conditions.
What varies across systems:
| Variable | Effect on Score | Example |
|---|---|---|
| Answer generation model | GPT-4o vs Haiku: ~20pp difference | A strong model rescues poor retrieval |
| Judge prompt leniency | "Be generous" vs neutral: ~5-10pp | Generous judges forgive vague answers |
| Context window size | 60 chunks vs 10: ~15pp | More context means ranking doesn't matter |
| Metric type | Retrieval recall vs answer accuracy | Fundamentally different measurements |
What each system actually measures:
| System | Metric | What It Tests |
|---|---|---|
| MemPalace | Retrieval recall | Whether the right evidence was found (no answer generated, no LLM) |
| LoCoMo original | Token F1 | Answer quality against gold standard (algorithmic, no LLM judge) |
| AMB/Hindsight | LLM judge accuracy | End-to-end: retrieval + answer + LLM evaluation |
| RASPUTIN | LLM judge accuracy | End-to-end with fixed, disclosed methodology |
| Memvid | LLM judge (claimed) | Methodology not published |
MemPalace's 96.6% LongMemEval score, for instance, is a retrieval recall metric — it measures whether the system found the right passage, not whether it generated a correct answer. This is a valid and useful metric, but it is not comparable to answer-accuracy scores reported by other systems.
Similarly, systems that use GPT-4o or Claude Opus for answer generation are primarily measuring LLM capability, not retrieval quality. A strong model can extract the correct answer from a large, poorly-ranked context window — which is exactly what our ablation program proved: at 60-chunk context, the entire ranking pipeline (BM25, keyword boosts, entity boosts, Cohere reranking, cross-encoder reranking) contributes 0pp because the answer model compensates.
RASPUTIN's methodology is fully disclosed:
- Production mode: Claude Haiku answers + neutral judge (isolates retrieval quality)
- Compare mode: gpt-4o-mini answers + generous judge (field-comparable baseline)
- Judge model pinned to
gpt-4o-mini-2024-07-18(prevents version drift) - All benchmark code, judge prompts, and experiment results are in this repository
We report production-mode numbers as primary because they reflect actual retrieval quality. Compare-mode numbers are provided for rough context against other systems, with the caveat that methodology differences make direct comparison approximate at best.
For a standardized comparison, we recommend the Agent Memory Benchmark (AMB), which evaluates all systems under identical conditions with a published judge prompt.
| System | Reported Score | Benchmark | Methodology |
|---|---|---|---|
| Backboard | 90.00% | LoCoMo | GPT-4.1, generous judge |
| Memvid | 85.70% | LoCoMo | Claimed LLM-as-judge, methodology not published |
| MemMachine | 84.87% | LoCoMo | Not published |
| Memobase | 75.78% | LoCoMo | Not published |
| Zep | 75.14% | LoCoMo | Not published |
| RASPUTIN (compare) | 72.4% | LoCoMo conv-0 | gpt-4o-mini answers, generous judge |
| RASPUTIN (production) | 69.7% | LoCoMo conv-0 | Haiku answers, neutral judge |
| mem0 | 66.88% | LoCoMo | Not published |
nomic-embed-text (768d) → Two-lane search (windows + facts) → Cross-encoder rerank → Haiku/gpt-4o-mini → gpt-4o-mini judge
See benchmarks/README.md for how to run benchmarks and reproduce numbers. See experiments/ for the full ablation program and scientific record.
docker compose up -dThis should start Qdrant and FalkorDB from the repository compose file.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-core.txtpython3 tools/hybrid_brain.pyServer runs on http://127.0.0.1:7777 by default.
curl http://localhost:7777/health
curl "http://localhost:7777/search?q=test&limit=3"
curl -X POST http://localhost:7777/commit \
-H 'Content-Type: application/json' \
-d '{"text":"Rasputin memory test event happened on 2026-03-01.","source":"conversation"}'The runtime loader reads this TOML and allows env overrides (see tools/config.py).
host(string): bind hostport(int): API port
url(string): Qdrant base URLcollection(string): active memory collection
host(string): FalkorDB hostport(int): FalkorDB portgraph_name(string): graph keydisabled(bool): disable graph search path
url(string): embedding endpointmodel(string): embedding model nameprefix_query(string): query embedding prefixprefix_doc(string): document embedding prefix
url(string): reranker endpointtimeout(int): timeout secondsenabled(bool): enable rerank stage
threshold(float): reject below this composite scoretimeout(int): scoring timeout secondsmodel(string): model for admission scoring
decay_half_life_low(int)decay_half_life_medium(int)decay_half_life_high(int)
enabled(bool): enable implicit constraint extraction at commit timemodel(string): LLM model for constraint extractiontimeout(int): extraction timeout seconds
known_entities_path(string): entity dictionary JSON path
All responses are JSON.
Returns service health and component status.
curl http://localhost:7777/healthHybrid retrieval endpoint.
curl "http://localhost:7777/search?q=payment+issue&limit=5"Body-based search variant.
curl -X POST http://localhost:7777/search \
-H 'Content-Type: application/json' \
-d '{"query":"project timeline","limit":5,"expand":true}'Commits memory after quality and duplicate checks.
curl -X POST http://localhost:7777/commit \
-H 'Content-Type: application/json' \
-d '{"text":"Vendor contract moved to April 12 with revised pricing.","source":"conversation","importance":75}'Direct graph lookup.
Qdrant and graph count summary.
A-MAC admission counters and rejection stats.
Lists stored contradiction records.
Returns proactive memory suggestions from recent context.
curl -X POST http://localhost:7777/proactive \
-H 'Content-Type: application/json' \
-d '{"messages":["We are discussing launch timelines"],"max_results":3}'Commits multi-turn conversations with automatic window chunking.
curl -X POST http://localhost:7777/commit_conversation \
-H 'Content-Type: application/json' \
-d '{"turns":[{"speaker":"Alice","text":"I got a promotion today!"},{"speaker":"Bob","text":"Congratulations!"}],"source":"conversation","window_size":5,"stride":2}'Updates retrieval usefulness signal.
curl -X POST http://localhost:7777/feedback \
-H 'Content-Type: application/json' \
-d '{"point_id":123,"helpful":true}'# lint
ruff check .
# type check
mypy tools/hybrid_brain.py tools/bm25_search.py --ignore-missing-imports
# unit tests (default suite)
pytest tests/ -k "not integration" -v
# integration tests (Qdrant required)
pytest tests/test_integration.py -v- Add/update tests in
tests/ - Keep API behavior backward-compatible where possible
- Prefer config via
config/rasputin.toml+ env overrides - Validate with lint + mypy + tests before commit
pytest tests/ -k "not integration" -vpytest tests/test_integration.py -vpytest tests/ --cov=tools --cov-report=term-missingCoverage threshold is configured in pyproject.toml (fail_under = 40).
- Two-lane retrieval: windows (45 slots) + LLM-extracted facts (15 slots)
- Cross-encoder reranker (ms-marco-MiniLM-L-6-v2, CPU)
- Structured fact extraction via Claude Haiku at ingest
- Windows-only chunking (individual turns proven to add 0pp)
- Ablation-tested: BM25, keyword/entity/temporal boosts, MMR, Cohere reranker all proven 0pp
- Benchmark infrastructure: production/compare modes, batch API (50% savings), failure analysis
- LoCoMo conv-0: 69.7% production, 72.4% compare (non-adversarial)
- Timing-safe auth, UTC datetimes, schema v0.7
- LLM reranker (Claude Haiku), professional benchmark harness
- Keyword overlap boosting, entity focus scoring
- recall@5: 0.67 → 0.82 (+22%), recall@10: 0.745 → 0.885 (+19%)
See CHANGELOG.md for full details.
MIT — see LICENSE.

