freshcrate
Home > Databases > rasputin-memory

rasputin-memory

The memory system your AI agent deserves. 4-stage hybrid retrieval — Vector + BM25 + Knowledge Graph + Neural Reranker — in <150ms. Self-hosted, $0/query, built for agents that need to actually rememb

Description

The memory system your AI agent deserves. 4-stage hybrid retrieval — Vector + BM25 + Knowledge Graph + Neural Reranker — in <150ms. Self-hosted, $0/query, built for agents that need to actually remember.

README

RASPUTIN Memory v0.7

RASPUTIN Memory

CI License: MIT Python 3.10+

A self-hosted memory backend for AI agents. RASPUTIN stores conversations as overlapping windows and LLM-extracted facts in Qdrant, with an LLM quality gate that prevents junk from entering the memory store.

Production-grade long-term memory for AI agents:

  • Vector search (Qdrant) with two-lane retrieval (windows + facts)
  • LLM-based fact extraction at ingest time
  • Cross-encoder reranking (local, CPU)
  • A-MAC quality gate on commits

Main server: tools/hybrid_brain.py


Architecture Overview (v0.7)

Memory Commit
   │
   ├─► A-MAC quality gate (relevance/novelty/specificity)
   ├─► 5-turn overlapping windows (stride 2)
   ├─► LLM fact extraction (optional, Haiku)
   ├─► Embedding (nomic-embed-text, 768d)
   └─► Persist to Qdrant

Search (two-lane)
   │
   ├─► Multi-Query Expansion
   ├─► Query Embedding (nomic-embed-text, 768d)
   │
   ├─► Lane 1: Window search (45 slots) ──┐
   ├─► Lane 2: Fact search (15 slots)   ──┼─► Merge ─► Cross-encoder rerank ─► Top-60 to LLM
   └─► (Optional: BM25 keyword lane)    ──┘

Core components

  • API server: tools/hybrid_brain.py
  • Fact extraction: tools/brain/fact_extractor.py
  • Cross-encoder reranker: tools/brain/cross_encoder.py
  • Maintenance jobs: tools/memory_decay.py, tools/memory_dedup.py

How It Compares

Feature RASPUTIN Mem0 Zep LightRAG
Vector search ✅ Qdrant
LLM fact extraction
Two-lane retrieval ✅ windows + facts
Cross-encoder reranking ✅ local CPU
LLM quality gate ✅ A-MAC
Contradiction detection
Self-hosted / no vendor lock ❌ (SaaS)

Benchmarks

Evaluated on LoCoMo (ACL 2024), conv-0 (199 QA pairs). Two benchmark modes: production (Haiku answers, neutral judge — measures retrieval quality) and compare (gpt-4o-mini answers, generous judge — field-comparable). See benchmarks/README.md for methodology details.

LoCoMo conv-0 (current best: two-lane retrieval)

Mode Non-adversarial Overall
Production (retrieval signal) 69.7% 53.3%
Compare (field-comparable) 72.4%
Category Production Questions
Open-domain 82.9% 70
Temporal 73.0% 37
Multi-hop 53.8% 13
Single-hop 43.8% 32
Adversarial 6.4% 47

Retrieval Quality (the actual signal)

Metric Value
Gold-in-ANY-chunk 88.4%
Gold-in-Top-5 63.8%
Gold-in-Top-10 71.4%

On Benchmark Methodology

Published LoCoMo scores across memory systems are not directly comparable. Each system measures something different, uses different models, and reports under different conditions.

What varies across systems:

Variable Effect on Score Example
Answer generation model GPT-4o vs Haiku: ~20pp difference A strong model rescues poor retrieval
Judge prompt leniency "Be generous" vs neutral: ~5-10pp Generous judges forgive vague answers
Context window size 60 chunks vs 10: ~15pp More context means ranking doesn't matter
Metric type Retrieval recall vs answer accuracy Fundamentally different measurements

What each system actually measures:

System Metric What It Tests
MemPalace Retrieval recall Whether the right evidence was found (no answer generated, no LLM)
LoCoMo original Token F1 Answer quality against gold standard (algorithmic, no LLM judge)
AMB/Hindsight LLM judge accuracy End-to-end: retrieval + answer + LLM evaluation
RASPUTIN LLM judge accuracy End-to-end with fixed, disclosed methodology
Memvid LLM judge (claimed) Methodology not published

MemPalace's 96.6% LongMemEval score, for instance, is a retrieval recall metric — it measures whether the system found the right passage, not whether it generated a correct answer. This is a valid and useful metric, but it is not comparable to answer-accuracy scores reported by other systems.

Similarly, systems that use GPT-4o or Claude Opus for answer generation are primarily measuring LLM capability, not retrieval quality. A strong model can extract the correct answer from a large, poorly-ranked context window — which is exactly what our ablation program proved: at 60-chunk context, the entire ranking pipeline (BM25, keyword boosts, entity boosts, Cohere reranking, cross-encoder reranking) contributes 0pp because the answer model compensates.

RASPUTIN's methodology is fully disclosed:

  • Production mode: Claude Haiku answers + neutral judge (isolates retrieval quality)
  • Compare mode: gpt-4o-mini answers + generous judge (field-comparable baseline)
  • Judge model pinned to gpt-4o-mini-2024-07-18 (prevents version drift)
  • All benchmark code, judge prompts, and experiment results are in this repository

We report production-mode numbers as primary because they reflect actual retrieval quality. Compare-mode numbers are provided for rough context against other systems, with the caveat that methodology differences make direct comparison approximate at best.

For a standardized comparison, we recommend the Agent Memory Benchmark (AMB), which evaluates all systems under identical conditions with a published judge prompt.

System Reported Score Benchmark Methodology
Backboard 90.00% LoCoMo GPT-4.1, generous judge
Memvid 85.70% LoCoMo Claimed LLM-as-judge, methodology not published
MemMachine 84.87% LoCoMo Not published
Memobase 75.78% LoCoMo Not published
Zep 75.14% LoCoMo Not published
RASPUTIN (compare) 72.4% LoCoMo conv-0 gpt-4o-mini answers, generous judge
RASPUTIN (production) 69.7% LoCoMo conv-0 Haiku answers, neutral judge
mem0 66.88% LoCoMo Not published

Pipeline

nomic-embed-text (768d) → Two-lane search (windows + facts) → Cross-encoder rerank → Haiku/gpt-4o-mini → gpt-4o-mini judge

See benchmarks/README.md for how to run benchmarks and reproduce numbers. See experiments/ for the full ablation program and scientific record.


Quick Start

1) Infrastructure (Docker Compose)

docker compose up -d

This should start Qdrant and FalkorDB from the repository compose file.

2) Python setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-core.txt

3) Start API server

python3 tools/hybrid_brain.py

Server runs on http://127.0.0.1:7777 by default.

4) Smoke check

curl http://localhost:7777/health
curl "http://localhost:7777/search?q=test&limit=3"
curl -X POST http://localhost:7777/commit \
  -H 'Content-Type: application/json' \
  -d '{"text":"Rasputin memory test event happened on 2026-03-01.","source":"conversation"}'

Configuration Reference (config/rasputin.toml)

The runtime loader reads this TOML and allows env overrides (see tools/config.py).

[server]

  • host (string): bind host
  • port (int): API port

[qdrant]

  • url (string): Qdrant base URL
  • collection (string): active memory collection

[graph]

  • host (string): FalkorDB host
  • port (int): FalkorDB port
  • graph_name (string): graph key
  • disabled (bool): disable graph search path

[embeddings]

  • url (string): embedding endpoint
  • model (string): embedding model name
  • prefix_query (string): query embedding prefix
  • prefix_doc (string): document embedding prefix

[reranker]

  • url (string): reranker endpoint
  • timeout (int): timeout seconds
  • enabled (bool): enable rerank stage

[amac]

  • threshold (float): reject below this composite score
  • timeout (int): scoring timeout seconds
  • model (string): model for admission scoring

[scoring]

  • decay_half_life_low (int)
  • decay_half_life_medium (int)
  • decay_half_life_high (int)

[constraints]

  • enabled (bool): enable implicit constraint extraction at commit time
  • model (string): LLM model for constraint extraction
  • timeout (int): extraction timeout seconds

[entities]

  • known_entities_path (string): entity dictionary JSON path

API Reference

All responses are JSON.

GET /health

Returns service health and component status.

curl http://localhost:7777/health

GET /search?q=<query>&limit=<n>&source=<source>&expand=<bool>

Hybrid retrieval endpoint.

curl "http://localhost:7777/search?q=payment+issue&limit=5"

POST /search

Body-based search variant.

curl -X POST http://localhost:7777/search \
  -H 'Content-Type: application/json' \
  -d '{"query":"project timeline","limit":5,"expand":true}'

POST /commit

Commits memory after quality and duplicate checks.

curl -X POST http://localhost:7777/commit \
  -H 'Content-Type: application/json' \
  -d '{"text":"Vendor contract moved to April 12 with revised pricing.","source":"conversation","importance":75}'

GET /graph?q=<query>&limit=<n>&hops=<n>

Direct graph lookup.

GET /stats

Qdrant and graph count summary.

GET /amac/metrics

A-MAC admission counters and rejection stats.

GET /contradictions?limit=<n>

Lists stored contradiction records.

POST /proactive

Returns proactive memory suggestions from recent context.

curl -X POST http://localhost:7777/proactive \
  -H 'Content-Type: application/json' \
  -d '{"messages":["We are discussing launch timelines"],"max_results":3}'

POST /commit_conversation

Commits multi-turn conversations with automatic window chunking.

curl -X POST http://localhost:7777/commit_conversation \
  -H 'Content-Type: application/json' \
  -d '{"turns":[{"speaker":"Alice","text":"I got a promotion today!"},{"speaker":"Bob","text":"Congratulations!"}],"source":"conversation","window_size":5,"stride":2}'

POST /feedback

Updates retrieval usefulness signal.

curl -X POST http://localhost:7777/feedback \
  -H 'Content-Type: application/json' \
  -d '{"point_id":123,"helpful":true}'

Development Guide

Local workflow

# lint
ruff check .

# type check
mypy tools/hybrid_brain.py tools/bm25_search.py --ignore-missing-imports

# unit tests (default suite)
pytest tests/ -k "not integration" -v

# integration tests (Qdrant required)
pytest tests/test_integration.py -v

Adding features safely

  1. Add/update tests in tests/
  2. Keep API behavior backward-compatible where possible
  3. Prefer config via config/rasputin.toml + env overrides
  4. Validate with lint + mypy + tests before commit

Testing Instructions

Unit tests

pytest tests/ -k "not integration" -v

Integration tests

pytest tests/test_integration.py -v

Coverage

pytest tests/ --cov=tools --cov-report=term-missing

Coverage threshold is configured in pyproject.toml (fail_under = 40).


Version Notes

v0.7.0

  • Two-lane retrieval: windows (45 slots) + LLM-extracted facts (15 slots)
  • Cross-encoder reranker (ms-marco-MiniLM-L-6-v2, CPU)
  • Structured fact extraction via Claude Haiku at ingest
  • Windows-only chunking (individual turns proven to add 0pp)
  • Ablation-tested: BM25, keyword/entity/temporal boosts, MMR, Cohere reranker all proven 0pp
  • Benchmark infrastructure: production/compare modes, batch API (50% savings), failure analysis
  • LoCoMo conv-0: 69.7% production, 72.4% compare (non-adversarial)
  • Timing-safe auth, UTC datetimes, schema v0.7

v0.6.0 — LoCoMo 89.81% (#2)

  • LLM reranker (Claude Haiku), professional benchmark harness

v0.5.0 — Search Quality Breakthrough

  • Keyword overlap boosting, entity focus scoring
  • recall@5: 0.67 → 0.82 (+22%), recall@10: 0.745 → 0.885 (+19%)

See CHANGELOG.md for full details.


License

MIT — see LICENSE.

Release History

VersionChangesUrgencyDate
v0.9.1## What's New in v0.9.1 ### Semantic kNN Graph Expansion (Experimental) - Gated behind `KNN_LINKS=1` (off by default) - At ingest: each fact linked to top-30 similar existing facts (cosine >= 0.6) via Qdrant payload `similar_ids` - At search: fact-lane seeds expanded through links before CE reranking (capped at 10 expansions) - Architectural parity with Hindsight's `link_expansion_retrieval.py` - Full 10-conv benchmark: 72.1% non-adv (−2.1pp from baseline) — useful for graph-traversal workloadsHigh4/16/2026
v0.9.0## Qwen3-Reranker + BM25 FTS5 + Prompt Routing **Production: 74.2% non-adv** (+6.7pp from baseline). **Compare: 77.7% non-adv** (+10.2pp from baseline). Full 10-conversation LoCoMo evaluation (1986 questions). 30+ documented experiments. ### Benchmark Results | Category | Production | Compare | Questions | Δ from v0.8 | |----------|-----------|---------|-----------|-------------| | **Overall non-adv** | **74.2%** | **77.7%** | 1540 | +5.1pp / +8.6pp | | Open-domain | 84.8% | 83.2% | 841 | +3High4/13/2026
v0.8.0## Full 10-Conversation LoCoMo Validation **69.1% non-adversarial** (1986 questions, production mode). 21 documented experiments with scientific methodology. ### Benchmark Results (LoCoMo full 10-conv, production mode) | Category | Accuracy | Questions | Notes | |----------|----------|-----------|-------| | Open-domain | 81.1% | 841 | Rock solid | | Temporal | 66.4% | 321 | 61% of failures are generation, not retrieval | | Multi-hop | 55.2% | 96 | +16.7pp from prompt routing | | Single-hop | High4/10/2026
v0.7.0## #1 on LoCoMo — 91.36% Three benchmarks, one pipeline. All results are reproducible from the scripts in `benchmarks/`. ### Benchmark Results | Benchmark | Score | Questions | Venue | |-----------|-------|-----------|-------| | **LoCoMo** | **91.36%** (#1) | 1,986 | ACL 2024 | | **LongMemEval** | **89.40%** | 500 | ICLR 2025 | | **FRAMES** | **50.4%** | 824 | Google 2024 | #### LoCoMo Leaderboard | Rank | System | Accuracy | |------|--------|----------| | **#1** | **RASPUTIN Memory v0.7** High4/3/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

honcho Memory library for building stateful agentsmain@2026-04-21
reasonkit-mem🚀 Build memory and retrieval infrastructure for ReasonKit, enhancing data management and access for your applications with ease and efficiency.main@2026-04-21
bigragSelf-hostable RAG platform - document ingestion, embedding, and vector search behind a simple REST APImain@2026-04-20
TV-Show-Recommender-AI🤖 Recommend TV shows by matching favorites, averaging embeddings, and finding similar titles using fuzzy search and vector similarity.main@2026-04-21
agentic-rag📄 Enable smart document and data search with AI-powered chat, vector search, and SQL querying across multiple file formats.main@2026-04-21