ClawMem — Context engine for Claude Code, OpenClaw, and Hermes agents

On-device memory for Claude Code, OpenClaw, Hermes, and AI agents. Retrieval-augmented search, hooks, and an MCP server in a single local system. No API keys, no cloud dependencies.

ClawMem fuses recent research into a retrieval-augmented memory layer that agents actually use. The hybrid architecture combines QMD-derived multi-signal retrieval (BM25 + vector search + reciprocal rank fusion + query expansion + cross-encoder reranking), SAME-inspired composite scoring (recency decay, confidence, content-type half-lives, co-activation reinforcement), MAGMA-style intent classification with multi-graph traversal (semantic, temporal, and causal beam search), and A-MEM self-evolving memory notes that enrich documents with keywords, tags, and causal links between entries. Pattern extraction from Engram adds deduplication windows, frequency-based durability scoring, and temporal navigation.

Integrates via Claude Code hooks, an MCP server (works with any MCP-compatible client), a native OpenClaw ContextEngine plugin, or a Hermes Agent MemoryProvider plugin. All paths write to the same local SQLite vault. A decision captured during a Claude Code session shows up immediately when an OpenClaw or Hermes agent picks up the same project.

TypeScript on Bun. MIT License.

What It Does

ClawMem turns your markdown notes, project docs, and research dumps into persistent memory for AI coding agents. It automatically:

Surfaces relevant context on every prompt (context-surfacing hook)
Bootstraps sessions with your profile, latest handoff, recent decisions, and stale notes
Captures decisions, preferences, milestones, and problems from session transcripts using a local GGUF observer model
Imports conversation exports from Claude Code, ChatGPT, Claude.ai, Slack, and plain text via clawmem mine, with optional post-import LLM fact extraction (--synthesize) that pulls structured decisions / preferences / milestones / problems and cross-fact links out of otherwise full-text conversation dumps (v0.7.2)
Generates handoffs at session end so the next session can pick up where you left off
Learns what matters via a feedback loop that boosts referenced notes and decays unused ones
Guards against prompt injection in surfaced content
Classifies query intent (WHY / WHEN / ENTITY / WHAT) to weight search strategies
Traverses multi-graphs (semantic, temporal, causal) via adaptive beam search
Evolves memory metadata as new documents create or refine connections
Infers causal relationships between facts extracted from session observations
Detects contradictions between new and prior decisions, auto-decaying superseded ones (with an additional merge-time contradiction gate in the consolidation worker that blocks cross-observation contradictions before they land, v0.7.1)
Guards against cross-entity merges during consolidation — name-aware dual-threshold merge safety compares entity anchors before merging similar observations, preventing "Alice decided X" from merging into "Bob decided X" (v0.7.1)
Prevents context bleed in derived insights — the Phase 3 deductive synthesis pipeline validates every draft against an anti-contamination wrapper (deterministic entity contamination check + LLM validator + dedupe) before writing cross-session deductive observations (v0.7.1)
Frames surfaced facts as background knowledge — context-surfacing wraps injected content in <instruction> + <facts> + <relationships> blocks, telling the model to treat facts as already-known and exposing memory-graph edges between surfaced docs directly in-prompt (v0.7.1)
Scores document quality using structure, keywords, and metadata richness signals
Boosts co-accessed documents — notes frequently surfaced together get retrieval reinforcement
Decomposes complex queries into typed retrieval clauses (BM25/vector/graph) for multi-topic questions
Cleans stale embeddings automatically before embed runs, removing orphans from deleted/changed documents
Transaction-safe indexing — crash mid-index leaves zero partial state (atomic commit with rollback)
Deduplicates hook-generated observations within a 30-minute window using normalized content hashing, preventing memory bloat from repeated hook output
Navigates temporal neighborhoods around any document via the timeline tool — progressive disclosure from search to chronological context to full content
Boosts frequently-revised memories — documents with higher revision counts get a durability signal in composite scoring (capped at 10%)
Supports pin/snooze lifecycle for persistent boosts and temporary suppression
Manages document lifecycle — policy-driven archival sweeps with restore capability
Auto-routes queries via memory_retrieve — classifies intent and dispatches to the optimal search backend
Syncs project issues from Beads issue trackers into searchable memory
Runs a quiet-window heavy maintenance lane — a second consolidation worker, off by default behind CLAWMEM_HEAVY_LANE=true, that runs on a longer interval only inside a configurable hour window. Gated by context_usage query-rate so it never competes for CPU/GPU with interactive sessions, scoped exclusively via DB-backed worker_leases, stale-first by default with an optional surprisal selector, and journals every attempt in maintenance_runs for operator visibility (v0.8.0)

Runs fully local with no API keys and no cloud services. Integrates via Claude Code hooks and MCP tools, as an OpenClaw ContextEngine plugin, or as a Hermes Agent MemoryProvider plugin. All modes share the same vault for cross-runtime memory. Works with any MCP-compatible client.

v0.2.0 Enhancements

Entity resolution + co-occurrence graph — LLM entity extraction with quality filters, type-agnostic canonical resolution within compatibility buckets (extensible type vocabulary), IDF-based entity edge scoring, co-occurrence tracking, entity graph traversal for ENTITY intent queries
MPFP graph retrieval — Multi-Path Fact Propagation with meta-path patterns per intent, hop-synchronized edge cache, Forward Push with α=0.15 teleport probability. Replaces single-beam traversal for causal/entity/temporal queries.
Temporal query extraction — regex-based date range extraction from natural language queries ("last week", "March 2026"), wired as WHERE filters into BM25 and vector search
4-way parallel retrieval — temporal proximity and entity graph channels added as parallel RRF legs in query tool (Tier 3 only), alongside existing BM25 + vector channels
3-tier consolidation — facts to observations (auto-generated, with proof_count and trend enum) to mental models. Background worker synthesizes clusters of related observations into consolidated patterns.
Observation invalidation — soft invalidation (invalidated_at/invalidated_by/superseded_by columns). Observations with confidence ≤ 0.2 after contradiction are filtered from search results.
Memory nudge — periodic ephemeral <vault-nudge> injection prompting lifecycle tool use after N turns of inactivity. Configurable via CLAWMEM_NUDGE_INTERVAL.

v0.7.1 Safety Release

Five independent safety gates around the consolidation pipeline and context surfacing, aimed at preventing contamination, cross-entity merges, and unchecked contradictions from landing in the vault. Every extraction ships with full unit + integration test coverage (+158 tests on top of the v0.7.0 baseline). See consolidation safety for the architectural walkthrough.

Taxonomy cleanup — standardized on the A-MEM contradicts (plural) convention across the entire codebase, eliminating silent query misses on the legacy singular form
Name-aware merge safety — the Phase 2 consolidation worker gate extracts entity anchors (via entity_mentions, with lexical proper-noun fallback) and runs dual-threshold normalized 3-gram cosine similarity before merging similar observations. Cross-entity merges are hard-rejected when anchor sets differ materially, preventing context bleed where "Alice decided X" merges into "Bob decided X". Thresholds are env-overridable (CLAWMEM_MERGE_SCORE_NORMAL=0.93, _STRICT=0.98). Dry-run mode via CLAWMEM_MERGE_GUARD_DRY_RUN for calibration.
Contradiction-aware merge gate — after the name-aware gate passes, a deterministic heuristic (negation asymmetry, number/date mismatch) plus an LLM check detect contradictory merges. Blocked merges route to link policy (insert new row + contradicts edge, default) or supersede policy (mark old row status='inactive'). Configurable via CLAWMEM_CONTRADICTION_POLICY and CLAWMEM_CONTRADICTION_MIN_CONFIDENCE. Phase 3 deductive synthesis applies the same gate to deductive dedupe matches.
Anti-contamination deductive synthesis — every Phase 3 draft runs through a three-layer validator: deterministic pre-checks (empty conclusion, invalid source_indices, pool-only entity contamination via entity_mentions) + LLM validator (fail-open with validatorFallbackAccepts counter) + dedupe. Per-reason rejection stats exposed via DeductiveSynthesisStats so Phase 3 yield can be diagnosed without enabling extra logging.
Context instruction + relationship snippets — context-surfacing now always prepends an <instruction> block framing the surfaced facts as background knowledge the model already holds, and appends an optional <relationships> block listing memory-graph edges where BOTH endpoints are in the surfaced doc set. The relationships block is the first thing dropped when the payload would overflow CLAWMEM_PROFILE's token budget, preserving facts-first behaviour while giving the model graph-level reasoning hooks directly in-prompt.

v0.7.2 Post-Import Conversation Synthesis

Opt-in LLM pass that runs after clawmem mine finishes indexing an imported collection. Operates on the freshly imported content_type='conversation' documents and extracts structured knowledge facts (decisions / preferences / milestones / problems) plus cross-fact relations, writing each fact as a first-class searchable document alongside the raw conversation exchanges. See post-import synthesis for the architectural walkthrough.

New CLI flag — clawmem mine <dir> --synthesize [--synthesis-max-docs N]. Off by default. When omitted, existing mine behaviour is byte-identical to v0.7.1.
Two-pass pipeline — Pass 1 extracts facts per conversation via the existing LLM, saves each via dedup-aware saveMemory, and populates a local alias map. Pass 2 resolves cross-fact links against the local map first, falling back to collection-scoped SQL lookup. Forward references (link to a fact extracted later in the same run) are resolved correctly.
Idempotent reruns — synthesized fact paths are a pure function of (sourceDocId, slug(title), short sha256(normalizedTitle)), so reruns over the same conversation batch hit the saveMemory update branch instead of creating parallel rows. Same-slug collisions are disambiguated by the stable hash suffix, not encounter order.
Fail-closed link resolution — when two different facts claim the same normalized title or alias, the resolver treats the link as ambiguous and counts it unresolved. Pre-existing docs with duplicate titles in the collection do not silently bind either.
Weight-monotonic relation upsert — memory_relations insert uses ON CONFLICT DO UPDATE SET weight = MAX(weight, excluded.weight), which is idempotent on equal-weight reruns but still accepts stronger later evidence without double-counting.
Non-fatal failure model — any LLM failure, JSON parse error, saveMemory collision, or relation insert error is counted and logged, never re-thrown. Synthesis failure after indexCollection commits does not roll back the mine import.
Split operator counters — llmFailures counts actual LLM path failures (null, thrown, non-array JSON), while docsWithNoFacts counts docs where the LLM responded validly but returned zero structured facts. Previously these were conflated as nullCalls.

Adds +63 tests (46 unit + 5 integration + 12 regression) on top of the v0.7.1 baseline.

v0.8.0 Quiet-Window Heavy Maintenance Lane

A second, longer-interval consolidation worker that keeps Phase 2 + Phase 3 running on large vaults without starving interactive sessions. Off by default — set CLAWMEM_HEAVY_LANE=true to enable. The existing 5-minute light-lane worker is unchanged. See heavy maintenance lane for the architectural walkthrough.

Quiet-window gating — the heavy lane only runs inside the hours set by CLAWMEM_HEAVY_LANE_WINDOW_START / CLAWMEM_HEAVY_LANE_WINDOW_END (0-23). Supports midnight wraparound (e.g., 22→6). Null on either bound means "always in window".
Query-rate gating via context_usage — counts hook injections in the last 10 minutes and skips the tick when the rate exceeds CLAWMEM_HEAVY_LANE_MAX_USAGES (default 30). No new query_activity table; reuses v0.7.0 telemetry.
DB-backed worker leases — exclusivity enforced via a new worker_leases table with atomic INSERT ... ON CONFLICT DO UPDATE ... WHERE expires_at <= ? acquisition, random 16-byte fencing tokens, and TTL reclaim. Safe under multi-process contention; any SQLite error translates to a lease_unavailable skip rather than a thrown exception.
Stale-first selection — Phase 2 and Phase 3 reorder their candidate sets by COALESCE(recall_stats.last_recalled_at, documents.last_accessed_at, documents.modified_at) ASC so long-unseen docs bubble up first. Empty recall_stats falls through to access-time without erroring.
Optional surprisal selector — CLAWMEM_HEAVY_LANE_SURPRISAL=true plumbs k-NN anomaly-ranked doc ids (via the existing computeSurprisalScores) into Phase 2 as an explicit candidateIds filter. Degrades to stale-first on vaults without embeddings and logs selector: 'surprisal-fallback-stale' in the journal.
maintenance_runs journal — every scheduled attempt writes a row: status (started/completed/failed/skipped), reason for skips, selected/processed/created/null_call counts, and a metrics_json payload with selector type and full DeductiveSynthesisStats breakdown. Operators can reconstruct any lane decision without reading worker logs.
Force-enforce merge gate — the heavy lane passes guarded: true to consolidateObservations, which overrides CLAWMEM_MERGE_GUARD_DRY_RUN inside findSimilarConsolidation so experimenting operators cannot weaken heavy-lane enforcement via env flag.

Adds +56 tests (13 worker-lease + 35 maintenance unit + 8 maintenance integration) on top of the v0.7.2 baseline.

v0.8.1 Multi-Turn Prior-Query Lookback

context-surfacing now builds its retrieval query from the current prompt plus up to two recent same-session prior prompts, so a short follow-up turn ("do the same for X", "explain the rationale") can still inherit the vocabulary of earlier turns. The raw prompt is persisted in a new nullable context_usage.query_text column so future hook ticks can reconstitute the multi-turn query from the DB. See multi-turn lookback for the full walkthrough.

Additive schema migration — new nullable query_text TEXT column on context_usage, guarded by PRAGMA table_info. Pre-v0.8.1 stores get the column added on first open; ad-hoc stores that skip the migration path degrade transparently via a feature-detect WeakMap so insertUsageFn never writes a column that doesn't exist.
Discovery path only — the multi-turn query feeds vector search, BM25, and query expansion. Cross-encoder reranking continues to use the RAW current prompt so relevance scoring is not diluted by older turns, and composite scoring / snippet extraction / dedupe / routing-hint detection all remain on the raw prompt as well.
Privacy-conscious persistence split — gated skip paths (slash commands, MIN_PROMPT_LENGTH, shouldSkipRetrieval, heartbeat dedupe) do NOT persist their raw text because those turns are not meaningful user questions and carry a higher sensitivity profile. Post-retrieval empty paths (empty result set, threshold blocked, budget blocked) DO persist so a follow-up turn can still inherit the intent even when the current turn surfaced nothing.
Current-first truncation — the combined query is clamped to 2000 chars with the current prompt preserved verbatim at the head. Older priors are dropped first when the budget runs out. If the current prompt alone already exceeds the cap, priors are omitted entirely and the current prompt is truncated.
SQL-level self-match guard — duplicate submits of the same prompt are filtered out of the lookback SELECT via AND query_text != ? so a retry burst cannot eat into the 2-prior budget and leave the lookback window underfilled.
10-minute max age, session-scoped — priors older than 10 minutes or from a different session_id are invisible to the lookback. All fallback paths (missing column, DB error, no matching rows) return the current prompt unchanged — the hook never throws on lookback failures.

Adds +27 tests (22 unit + 5 integration) on top of the v0.8.0 baseline.

Architecture

Install

Platform Support

Platform	Status	Notes
Linux	Full support	Primary target. systemd services for watcher + embed timer.
macOS	Full support	Homebrew SQLite handled automatically. GPU via Metal (llama.cpp).
Windows (WSL2)	Full support	Recommended for Windows users. Install Bun + ClawMem inside WSL2.
Windows (native)	Not recommended	Bun and sqlite-vec work, but `bin/clawmem` wrapper is bash, hooks expect bash commands, and systemd services have no equivalent. Use WSL2 instead.

Prerequisites

Required:

Bun v1.0+ — runtime for ClawMem. On Linux, install via curl -fsSL https://bun.sh/install | bash (not snap — snap Bun cannot read stdin, which breaks hooks).
SQLite with FTS5 — included with Bun. On macOS, install brew install sqlite for extension loading support (ClawMem detects and uses Homebrew SQLite automatically).

Optional (for better performance):

llama.cpp (llama-server) — for dedicated GPU inference. Without it, node-llama-cpp runs models in-process (auto-downloads on first use). GPU servers give better throughput and prevent silent CPU fallback.
systemd (Linux) or launchd (macOS) — for persistent background services (watcher, embed timer, GPU servers). ClawMem ships systemd unit templates; macOS users can create equivalent launchd plists. See systemd services.

Optional integrations:

Claude Code — for hooks + MCP integration
OpenClaw — for ContextEngine plugin integration
Hermes Agent — for MemoryProvider plugin integration
bd CLI v0.58.0+ — for Beads issue tracker sync (only if using Beads)

Install from npm (recommended)

npm install -g clawmem

If you use Bun as your package manager:

bun add -g clawmem

Install from source

git clone https://github.com/yoloshii/clawmem.git ~/clawmem
cd ~/clawmem && bun install
ln -sf ~/clawmem/bin/clawmem ~/.bun/bin/clawmem

Setup roadmap

After installing, here's the full journey from zero to working memory:

Step	What	How	Details
1. Bootstrap	Create a vault, index your first collection, embed, install hooks and MCP	`clawmem bootstrap ~/notes --name notes`	One command does it all. Or run each step manually (see below).
2. Choose models	Pick embedding + reranker models based on your hardware	12GB+ VRAM → SOTA stack (zembed-1 + zerank-2). Less → QMD native combo. No GPU → cloud embedding or CPU fallback.	GPU Services
3. Download models	Get the GGUF files for your chosen stack	`wget` from HuggingFace, or let `node-llama-cpp` auto-download the QMD native models on first use	Embedding, LLM Server, Reranker Server
4. Start services	Run GPU servers (if using dedicated GPU) and background services	`llama-server` for each model. systemd units for watcher + embed timer.	systemd services
5. Decide what to index	Add collections for your projects, notes, research, and domain docs	`clawmem collection add ~/project --name project`	The more relevant markdown you index, the better retrieval works. See building a rich context field.
6. Connect your agent	Hook into Claude Code, OpenClaw, Hermes, or any MCP client	`clawmem setup hooks && clawmem setup mcp` for Claude Code. `clawmem setup openclaw` for OpenClaw. Copy `src/hermes/` to Hermes plugins for Hermes.	Integration
7. Verify	Confirm everything is working	`clawmem doctor` (full health check) or `clawmem status` (quick index stats)	Verify Installation

Fastest path: Step 1 alone gets you a working system with in-process CPU/GPU inference and default models — no manual model downloads or service configuration needed. Steps 2-4 are optional upgrades for better performance. Steps 5-6 are where you customize what gets indexed and how your agent connects.

Customize what gets indexed: Each collection has a pattern field in ~/.config/clawmem/config.yaml (default: **/*.md). Tailor it per collection — index project docs, research notes, decision records, Obsidian vaults, or anything else your agents should know about. The more relevant content in the vault, the better retrieval works. See the quickstart for config examples.

Quick start commands

# One command: init + index + embed + hooks + MCP
clawmem bootstrap ~/notes --name notes

# Or step by step:
clawmem init
clawmem collection add ~/notes --name notes
clawmem update --embed
clawmem setup hooks
clawmem setup mcp

# Add more collections (the more you index, the richer retrieval gets)
clawmem collection add ~/projects/myapp --name myapp
clawmem collection add ~/research --name research
clawmem update --embed

# Verify
clawmem doctor

Upgrading

bun update -g clawmem   # or: npm update -g clawmem

Database schema migrates automatically on next startup (new tables and columns are added via CREATE IF NOT EXISTS / ALTER TABLE ADD COLUMN).

After major version updates (e.g. 0.1.x → 0.2.0) that add new enrichment pipelines, run a full enrichment pass to backfill existing documents:

clawmem reindex --enrich  # Full enrichment: entity extraction + links + evolution for all docs
clawmem embed             # Re-embed if upgrading embedding models (not needed for most updates)

--enrich forces the complete A-MEM pipeline (entity extraction, link generation, memory evolution) on all documents, not just new ones. Without it, reindex only refreshes metadata for existing docs.

Routine patch updates (e.g. 0.2.0 → 0.2.1) do not require reindexing.

For version-specific upgrade notes (opt-in features, optional cleanup steps, verification commands), see docs/guides/upgrading.md.

Integration

Claude Code

ClawMem integrates via hooks (settings.json) and an MCP stdio server. Hooks handle 90% of retrieval automatically - the agent never needs to call tools for routine context.

clawmem setup hooks    # Install lifecycle hooks (SessionStart, UserPromptSubmit, Stop, PreCompact)
clawmem setup mcp      # Register MCP server in ~/.claude.json (31 tools)

Automatic (90%): context-surfacing injects relevant memory on every prompt. postcompact-inject re-injects state after compaction. decision-extractor, handoff-generator, feedback-loop capture session state on stop.

Agent-initiated (10%): MCP tools (query, intent_search, find_causal_links, timeline, etc.) for targeted retrieval when hooks don't surface what's needed.

OpenClaw

ClawMem registers as a native ContextEngine plugin - OpenClaw's pluggable interface for context management. Same 90/10 automatic retrieval, delivered through OpenClaw's lifecycle system instead of Claude Code hooks.

clawmem setup openclaw   # Shows installation steps

What the plugin provides:

before_prompt_build hook - prompt-aware retrieval (context-surfacing + session-bootstrap)
ContextEngine.afterTurn() - decision extraction, handoff generation, feedback loop
ContextEngine.compact() - pre-compaction state preservation, delegates real compaction to legacy engine
5 agent tools - clawmem_search, clawmem_get, clawmem_session_log, clawmem_timeline, clawmem_similar
Session lifecycle hooks - session_start, session_end, before_reset safety net

Disable OpenClaw's native memory and memory-lancedb auto-recall/capture to avoid duplicate injection:

openclaw config set agents.defaults.memorySearch.extraPaths "[]"

Alternative: OpenClaw agents can also use ClawMem's MCP server directly (clawmem setup mcp), with or without hooks. This gives full access to all 31 MCP tools but bypasses OpenClaw's ContextEngine lifecycle, so you lose token budget awareness, native compaction orchestration, and the afterTurn() message pipeline. The ContextEngine plugin is recommended for new OpenClaw setups; MCP is available as an additional or standalone integration.

Hermes Agent

ClawMem integrates as a native MemoryProvider plugin — Hermes's pluggable interface for agent memory. Same automatic retrieval and extraction, delivered through Hermes's memory lifecycle instead of Claude Code hooks.

Install:

# Copy or symlink the plugin into Hermes's plugin directory
cp -r /path/to/ClawMem/src/hermes /path/to/hermes-agent/plugins/memory/clawmem

# Or symlink for development
ln -s /path/to/ClawMem/src/hermes /path/to/hermes-agent/plugins/memory/clawmem

Configure in your Hermes profile's .env or environment:

CLAWMEM_BIN=/path/to/clawmem          # Path to clawmem binary (or ensure it's on PATH)
CLAWMEM_SERVE_PORT=7438                # REST API port (default: 7438)
CLAWMEM_SERVE_MODE=external            # "external" (you run clawmem serve) or "managed" (plugin manages it)
CLAWMEM_PROFILE=balanced               # speed | balanced | deep

Then set memory.provider: clawmem in your Hermes config.yaml, or run hermes memory setup to configure interactively.

What the plugin provides:

prefetch() — prompt-aware retrieval via context-surfacing hook (automatic every turn)
on_session_end() — decision extraction, handoff generation, feedback loop (parallel)
on_pre_compress() — pre-compaction state preservation
session-bootstrap — session registration + first-turn context injection
5 agent tools — clawmem_retrieve, clawmem_get, clawmem_session_log, clawmem_timeline, clawmem_similar
Plugin-managed transcript — maintains its own JSONL transcript for ClawMem hooks

Requirements: clawmem binary on PATH and clawmem serve running (external mode) or the plugin starts it automatically (managed mode). Python 3.10+. No pip dependencies beyond Hermes itself (uses urllib for REST calls, httpx optional for better performance).

Alternative: Hermes also has built-in MCP client support. You can add ClawMem as an MCP server in Hermes's config.yaml under mcp_servers for tool-only access. But this misses the lifecycle hooks (prefetch, session_end, pre_compress), so the native plugin is recommended.

See Hermes plugin guide for architecture details, lifecycle mapping, and troubleshooting.

Multi-Framework Operation

All three integrations share the same SQLite vault by default. Claude Code, OpenClaw, and Hermes can run simultaneously — decisions captured in one runtime are immediately available in the others, giving agents persistent shared memory across sessions and platforms. WAL mode + busy_timeout handles concurrent access.

Multi-Vault (Optional)

By default, ClawMem uses a single vault at ~/.cache/clawmem/index.sqlite. For users who want separate memory domains (e.g., work vs personal, or isolated vaults per project), ClawMem supports named vaults.

Configure in ~/.config/clawmem/config.yaml:

vaults:
  work: ~/.cache/clawmem/work.sqlite
  personal: ~/.cache/clawmem/personal.sqlite

Or via environment variable:

export CLAWMEM_VAULTS='{"work":"~/.cache/clawmem/work.sqlite","personal":"~/.cache/clawmem/personal.sqlite"}'

Using vaults with MCP tools:

All retrieval tools (memory_retrieve, query, search, vsearch, intent_search) accept an optional vault parameter. Omit it to use the default vault.

# Search the default vault (no vault param needed)
query("authentication flow")

# Search a named vault
query("project timeline", vault="work")

# List configured vaults
list_vaults()

# Sync content into a vault
vault_sync(vault="work", content_root="~/work/docs")

Single-vault users: No action needed. Everything works without configuration. The vault parameter is always optional and ignored when no vaults are configured.

GPU Services

ClawMem uses three llama-server (llama.cpp) instances for neural inference. All three have in-process fallbacks via node-llama-cpp (auto-downloads on first use), so ClawMem works without a dedicated GPU. node-llama-cpp auto-detects the best available backend — Metal on Apple Silicon, Vulkan where available, CPU as last resort. With GPU acceleration (Metal/Vulkan), in-process inference is fast for these small models (0.3B–1.7B); on CPU-only systems it is significantly slower. For production use, run the servers via systemd services to prevent silent fallback.

GPU with VRAM to spare (12GB+, recommended): ZeroEntropy's distillation-paired stack delivers best retrieval quality — total ~10GB VRAM.

Service	Port	Model	VRAM	Purpose
Embedding	8088	zembed-1-Q4_K_M	~4.4GB	SOTA embedding (2560d, 32K context). Distilled from zerank-2 via zELO.
LLM	8089	qmd-query-expansion-1.7B-q4_k_m	~2.2GB	Intent classification, query expansion, A-MEM
Reranker	8090	zerank-2-Q4_K_M	~3.3GB	SOTA reranker. Outperforms Cohere rerank-3.5. Optimal pairing with zembed-1.

Important: zembed-1 and zerank-2 use non-causal attention — -ub must equal -b on llama-server (e.g. -b 2048 -ub 2048). See Reranker Server for details.

License: zembed-1 and zerank-2 are released under CC-BY-NC-4.0 — non-commercial only. The QMD native models below have no such restriction.

No dedicated GPU / GPU without VRAM to spare: The QMD native combo — total ~4GB VRAM, also runs via node-llama-cpp (Metal on Apple Silicon, Vulkan where available, CPU as last resort). Fast with GPU acceleration; significantly slower on CPU-only.

Service	Port	Model	VRAM	Purpose
Embedding	8088	EmbeddingGemma-300M-Q8_0	~400MB	Vector search, indexing, context-surfacing (768d, 2K context)
LLM	8089	qmd-query-expansion-1.7B-q4_k_m	~2.2GB	Intent classification, query expansion, A-MEM
Reranker	8090	qwen3-reranker-0.6B-Q8_0	~1.3GB	Cross-encoder reranking (query, intent_search)

The bin/clawmem wrapper defaults to localhost:8088/8089/8090. If a server is unreachable (transport error like ECONNREFUSED/ETIMEDOUT), ClawMem sets a 60-second cooldown and falls back to in-process inference via node-llama-cpp (auto-downloads the QMD native models on first use, uses Metal/Vulkan/CPU depending on hardware). HTTP errors (400/500) and user-cancelled requests do not trigger cooldown — the remote server is retried normally on the next call. With GPU acceleration the fallback is fast; on CPU-only it is significantly slower. ClawMem always works either way, but if you're running dedicated GPU servers, use systemd services to ensure they stay up.

To prevent fallback and fail fast instead, set CLAWMEM_NO_LOCAL_MODELS=true.

Remote GPU (optional)

If your GPU lives on a separate machine, point the env vars at it:

export CLAWMEM_EMBED_URL=http://gpu-host:8088
export CLAWMEM_LLM_URL=http://gpu-host:8089
export CLAWMEM_RERANK_URL=http://gpu-host:8090

For remote setups, set CLAWMEM_NO_LOCAL_MODELS=true to prevent node-llama-cpp from auto-downloading multi-GB model files if a server is unreachable.

No Dedicated GPU (in-process inference)

All three QMD native models run locally without a dedicated GPU. node-llama-cpp auto-downloads them on first use (~300MB embedding + ~1.1GB LLM + ~600MB reranker) and auto-detects the best backend — Metal on Apple Silicon (fast, uses integrated GPU), Vulkan where available (fast, uses discrete or integrated GPU), or CPU as last resort (significantly slower). With Metal or Vulkan, in-process inference handles these small models well; CPU-only is functional but noticeably slower.

Alternatively, use a cloud embedding provider if you prefer not to run models locally.

Embedding

ClawMem calls the OpenAI-compatible /v1/embeddings endpoint for all embedding operations. This works with local llama-server instances and cloud providers alike.

Option A: GPU with VRAM to spare (recommended)

Use zembed-1-Q4_K_M — SOTA retrieval quality, distilled from zerank-2 via ZeroEntropy's zELO methodology. CC-BY-NC-4.0 — non-commercial only.

Size: 2.4GB, Dimensions: 2560, VRAM: ~4.4GB, Context: 32K tokens

wget https://huggingface.co/Abhiray/zembed-1-Q4_K_M-GGUF/resolve/main/zembed-1-Q4_K_M.gguf

# -ub must match -b for non-causal attention
llama-server -m zembed-1-Q4_K_M.gguf \
  --embeddings --port 8088 --host 0.0.0.0 \
  -ngl 99 -c 8192 -b 2048 -ub 2048

Option B: No GPU / GPU without VRAM to spare

Use EmbeddingGemma-300M-Q8_0 — the QMD native embedding model. Only 300MB, runs on CPU or any GPU.

Size: 314MB, Dimensions: 768, VRAM: ~400MB (or CPU), Context: 2048 tokens

wget https://huggingface.co/ggml-org/embeddinggemma-300M-GGUF/resolve/main/embeddinggemma-300M-Q8_0.gguf

# On GPU (add -ngl 99):
llama-server -m embeddinggemma-300M-Q8_0.gguf \
  --embeddings --port 8088 --host 0.0.0.0 \
  -ngl 99 -c 2048 --batch-size 2048

# On CPU (omit -ngl):
llama-server -m embeddinggemma-300M-Q8_0.gguf \
  --embeddings --port 8088 --host 0.0.0.0 \
  -c 2048 --batch-size 2048

For multilingual corpora, the SOTA zembed-1 (Option A) supports multilingual out of the box. For a lightweight alternative: granite-embedding-278m-multilingual-Q6_K (314MB, set CLAWMEM_EMBED_MAX_CHARS=1100 due to 512-token context).

Option C: Cloud Embedding API

Alternatively, use a cloud embedding provider instead of running a local server. Any provider with an OpenAI-compatible /v1/embeddings endpoint works.

Configuration: Copy .env.example to .env and set your provider credentials:

cp .env.example .env
# Edit .env:
CLAWMEM_EMBED_URL=https://api.jina.ai
CLAWMEM_EMBED_API_KEY=jina_your-key-here
CLAWMEM_EMBED_MODEL=jina-embeddings-v5-text-small

Or export them in your shell. Precedence: shell environment > .env file > bin/clawmem wrapper defaults.

Provider	`CLAWMEM_EMBED_URL`	`CLAWMEM_EMBED_MODEL`	Dimensions	Notes
Jina AI	`https://api.jina.ai`	`jina-embeddings-v5-text-small`	1024	32K context, task-specific LoRA adapters
OpenAI	`https://api.openai.com`	`text-embedding-3-small`	1536	8K context, Matryoshka dimensions via `CLAWMEM_EMBED_DIMENSIONS`
Voyage AI	`https://api.voyageai.com`	`voyage-4-large`	1024	32K context
Cohere	`https://api.cohere.com`	`embed-v4.0`	1024	128K context

Cloud mode auto-detects your provider from the URL and sends the right parameters (Jina task, Voyage/Cohere input_type, OpenAI dimensions). Batch embedding (50 fragments/request), server-side truncation, adaptive TPM-aware pacing, and retry with jitter are all handled automatically. Set CLAWMEM_EMBED_TPM_LIMIT to match your provider tier (default: 100000). See docs/guides/cloud-embedding.md for full details.

Note: Cloud providers handle their own context window limits — ClawMem skips client-side truncation when an API key is set. Local llama-server truncates at CLAWMEM_EMBED_MAX_CHARS (default: 6000 chars).

Verify and embed

# Verify endpoint is reachable
curl $CLAWMEM_EMBED_URL/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $CLAWMEM_EMBED_API_KEY" \
  -d "{\"input\":\"test\",\"model\":\"$CLAWMEM_EMBED_MODEL\"}"

# Embed your vault
./bin/clawmem embed

LLM Server

Intent classification, query expansion, and A-MEM extraction use qmd-query-expansion-1.7B — a Qwen3-1.7B finetuned by QMD specifically for generating search expansion terms (hyde, lexical, and vector variants). ~1.1GB at q4_k_m quantization, served via llama-server on port 8089.

Without a server: If CLAWMEM_LLM_URL is unset, node-llama-cpp auto-downloads the model on first use.

Performance (RTX 3090):

Intent classification: 27ms
Query expansion: 333 tok/s
VRAM: ~2.2-2.8GB depending on quantization

Qwen3 /no_think flag: Qwen3 uses thinking tokens by default. ClawMem appends /no_think to all prompts automatically to get structured output in the content field.

Intent classification: Uses a dual-path approach:

Heuristic regex classifier (instant) — handles strong signals (why/when/who keywords) with 0.8+ confidence
LLM refinement (27ms on GPU) — only for ambiguous queries below 0.8 confidence

Server setup:

# Download the finetuned model
wget https://huggingface.co/tobil/qmd-query-expansion-1.7B-gguf/resolve/main/qmd-query-expansion-1.7B-q4_k_m.gguf

# Start llama-server for LLM inference
llama-server -m qmd-query-expansion-1.7B-q4_k_m.gguf \
  --port 8089 --host 0.0.0.0 \
  -ngl 99 -c 4096 --batch-size 512

Reranker Server

Cross-encoder reranking for query and intent_search pipelines on port 8090. ClawMem calls the /v1/rerank endpoint (or falls back to scoring via /v1/completions for compatible servers).

Scores each candidate against the original query (cross-encoder architecture). query pipeline: 4000 char context per doc (deep reranking); intent_search: 200 char context per doc (fast reranking).

GPU with VRAM to spare (recommended): zerank-2-Q4_K_M (2.4GB, ~3.3GB VRAM). Outperforms Cohere rerank-3.5 and Gemini 2.5 Flash. Optimal pairing with zembed-1 (same distillation architecture via zELO). CC-BY-NC-4.0 — non-commercial only.

wget https://huggingface.co/keisuke-miyako/zerank-2-gguf-q4_k_m/resolve/main/zerank-2-Q4_k_m.gguf

# -ub must match -b for non-causal attention
llama-server -m zerank-2-Q4_K_M.gguf \
  --reranking --port 8090 --host 0.0.0.0 \
  -ngl 99 -c 2048 -b 2048 -ub 2048

CPU / GPU without VRAM to spare: qwen3-reranker-0.6B-Q8_0 (~600MB, ~1.3GB VRAM). The QMD native reranker — auto-downloaded by node-llama-cpp if no server is running.

wget https://huggingface.co/ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/resolve/main/Qwen3-Reranker-0.6B-Q8_0.gguf

llama-server -m Qwen3-Reranker-0.6B-Q8_0.gguf \
  --reranking --port 8090 --host 0.0.0.0 \
  -ngl 99 -c 2048 --batch-size 512

Note: zerank-2 and zembed-1 use non-causal attention — -ub (ubatch) must equal -b (batch). Omitting -ub or setting it lower causes assertion crashes. qwen3-reranker-0.6B does not have this requirement. See llama.cpp#12836.

MCP Server

ClawMem exposes 31 MCP tools via the Model Context Protocol and an optional HTTP REST API. Any MCP-compatible client or HTTP client can use it.

Claude Code (automatic):

./bin/clawmem setup mcp   # Registers in ~/.claude.json

Manual (any MCP client):

Add to your MCP config (e.g. ~/.claude.json, claude_desktop_config.json, or your client's equivalent):

{
  "mcpServers": {
    "clawmem": {
      "command": "/absolute/path/to/clawmem/bin/clawmem",
      "args": ["mcp"]
    }
  }
}

The server runs via stdio — no network port needed. The bin/clawmem wrapper sets the GPU endpoint env vars automatically.

Verify: After registering, your client should see tools including memory_retrieve, search, vsearch, query, query_plan, intent_search, timeline, etc.

HTTP REST API (optional)

For web dashboards, non-MCP agents, cross-machine access, or programmatic use:

./bin/clawmem serve                          # localhost:7438, no auth
./bin/clawmem serve --port 8080              # custom port
CLAWMEM_API_TOKEN=secret ./bin/clawmem serve # with bearer token auth

Endpoints:

Method	Path	Description
GET	`/health`	Liveness probe + version + doc count
GET	`/stats`	Full index statistics
POST	`/search`	Unified search (`mode`: auto/keyword/semantic/hybrid)
POST	`/retrieve`	Smart retrieve with auto-routing (`mode`: auto/keyword/semantic/causal/timeline/hybrid)
GET	`/documents/:docid`	Single document by 6-char hash prefix
GET	`/documents?pattern=...`	Multi-get by glob pattern
GET	`/timeline/:docid`	Temporal neighborhood (before/after)
GET	`/sessions`	Recent session history
GET	`/collections`	List all collections
GET	`/lifecycle/status`	Active/archived/pinned/snoozed counts
POST	`/documents/:docid/pin`	Pin/unpin
POST	`/documents/:docid/snooze`	Snooze until date
POST	`/documents/:docid/forget`	Deactivate
POST	`/lifecycle/sweep`	Archive stale docs (dry_run default)
GET	`/graph/causal/:docid`	Causal chain traversal
GET	`/graph/similar/:docid`	k-NN neighbors
GET	`/export`	Full vault export as JSON
POST	`/reindex`	Trigger re-scan
POST	`/graphs/build`	Rebuild temporal + semantic graphs

Auth: Set CLAWMEM_API_TOKEN env var to require Authorization: Bearer <token> on all requests. If unset, access is open (localhost-only by default). See .env.example.

Search example:

curl -X POST http://localhost:7438/search \
  -H 'Content-Type: application/json' \
  -d '{"query": "authentication decisions", "mode": "hybrid", "compact": true}'

Verify Installation

./bin/clawmem doctor   # Full health check
./bin/clawmem status   # Quick index status
bun test               # Run test suite

Agent Instructions

ClawMem ships three instruction files and an optional maintenance agent:

File Loaded Purpose

CLAUDE.md Automatically (Claude Code, when working in this repo) Complete operational reference — hooks, tools, query optimization, scoring, pipeline details, troubleshooting

AGENTS.md Framework-dependent Identical to CLAUDE.md — cross-framework compatibility (Cursor, Windsurf, Codex, etc.)

SKILL.md

On-demand (agent reads

Release History

Version	Changes	Urgency	Date
main@2026-05-20	Latest activity on main branch	High	5/20/2026
v0.8.1	Latest release: v0.8.1	High	4/9/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

MegaMemoryPersistent project knowledge graph for coding agents. MCP server with semantic search, in-process embeddings, and web explorer.v1.6.2

opentabsBrowser automation clicks buttons. OpenTabs calls APIs.main@2026-06-06

cc-skillsClaude Code Skills Marketplace: plugins, skills for ADR-driven development, DevOps automation, ClickHouse management, semantic versioning, and productivity workflowsv21.87.0

recall-aiBuild and manage AI-driven workspaces using Next.js, React, and TypeScript with customizable UI and MIT licensing.main@2026-06-04

supersetCode Editor for the AI Agents Era - Run an army of Claude Code, Codex, etc. on your machinedesktop-v1.12.2

More in MCP Servers

PlanExeCreate a plan from a description in minutes

agentroveYour own Claude Code UI, sandbox, in-browser VS Code, terminal, multi-provider support (Anthropic, OpenAI, GitHub Copilot, OpenRouter), custom skills, and MCP servers.

ProxmoxMCP-PlusEnhanced Proxmox MCP server with advanced virtualization management and full OpenAPI integration.

node9-proxyThe Execution Security Layer for the Agentic Era. Providing deterministic "Sudo" governance and audit logs for autonomous AI agents.

ClawMem

Description

README