LycheeMem: Lightweight Long-Term Memory for LLM Agents

中文 | English

LycheeMem is a compact memory framework for LLM agents. It starts from efficient conversational memory—through structured organization, lightweight consolidation, and adaptive retrieval—and gradually extends toward action-aware, usage-aware memory for more capable agentic systems.

News • Related Projects • Quick Start • Web Demo • OpenClaw Plugin • MCP • Memory Architecture • Pipeline • API Reference

🔥 News

[04/03/2026] The project now supports installation via pip install lycheemem. You can easily start the service from anywhere using lycheemem-cli!
[03/30/2026] We evaluated LycheeMem on PinchBench with the OpenClaw plugin: compared to OpenClaw's native memory, it achieved an ~6% score improvement, while reducing token consumption by ~71% and cost by ~55%!
[03/28/2026] Semantic memory has been upgraded to Compact Semantic Memory (SQLite + LanceDB), no Neo4j required. See /quick-start for details.
[03/27/2026] OpenClaw Plugin is now available at /openclaw-plugin ! Setup guide →
[03/26/2026] MCP support is available at /mcp !
[03/23/2026] LycheeMem is now open source: GitHub Repository →

🔗 Related Projects

LycheeMem is part of the 3rd-generation Lychee (立知) large model series, which focuses on memory intelligence, continual learning, and long-context reasoning.

We welcome you to explore our related works:

LycheeMemory (ACL 2026, CCF-A): a unified framework for implicit long-term memory and explicit working memory collaboration in large language models
LycheeMem (this project): long-term memory infrastructure for LLM-based agents
LycheeDecode (ICLR 2026, CCF-A): selective recall from massive KV-cache context memory
LycheeCluster (ACL 2026, CCF-A): structured organization and hierarchical indexing for context memory

⚡ Quick Start

Prerequisites

Python 3.9+
An LLM API key (OpenAI, Gemini, or any litellm-compatible provider)

Installation

You can install LycheeMem directly via pip:

pip install lycheemem

Once installed, you can start the backend server instantly using the CLI:

lycheemem-cli

For development or if you prefer to run from source:

git clone https://github.com/LycheeMem/LycheeMem.git
cd LycheeMem
pip install -e .

Configuration

Create a .env file in your working directory and fill in your values. The full template in .env.example also includes session/user DB paths, JWT settings, and working-memory thresholds; the snippet below shows the most important ones:

# LLM — litellm format: provider/model
LLM_MODEL=openai/gpt-4o-mini
LLM_API_KEY=sk-...
LLM_API_BASE=                     # optional

# Embedder
EMBEDDING_MODEL=openai/text-embedding-3-small
EMBEDDING_DIM=1536
EMBEDDING_API_KEY=                # optional
EMBEDDING_API_BASE=               # optional

Supported LLM providers (via litellm):
openai/gpt-4o-mini · gemini/gemini-2.0-flash · ollama_chat/qwen2.5 · any OpenAI-compatible endpoint

Start the Server

If you installed via pip, you can start the LycheeMem background service from anywhere using:

lycheemem-cli

(If running from source, you can also use python main.py to start the server.)

The API is served at http://localhost:8000. Interactive docs at /docs.

main.py currently starts Uvicorn without enabling live reload. For development reload, run Uvicorn directly, for example:
uvicorn src.api.server:create_app --factory --reload

🎨 Web Demo

A frontend demo is included under web-demo/. It provides a chat interface alongside live views of the semantic memory tree, skill library, and working memory state.

cd web-demo
npm install
npm run dev      # served at http://localhost:5173

Make sure the backend is running on port 8000 (or update proxy settings in web-demo/vite.config.ts) before starting the frontend.

🦞 OpenClaw Plugin

LycheeMem ships a native OpenClaw plugin that gives any OpenClaw session persistent long-term memory with zero manual wiring.

What the plugin provides:

lychee_memory_smart_search — default long-term memory retrieval entry point
Automatic turn mirroring via hooks — the model does not need to call append_turn manually
- User messages are appended automatically
- Assistant messages are appended automatically
/new, /reset, /stop, and session_end automatically trigger boundary consolidation
Proactive consolidation on strong long-term knowledge signals

Under normal operation:

The model only calls lychee_memory_smart_search when recalling long-term context
The model may call lychee_memory_consolidate manually when an immediate persist is warranted
The model does not need to call lychee_memory_append_turn at all

Quick Install

openclaw plugins install "/path/to/LycheeMem/openclaw-plugin"
openclaw gateway restart

See the full setup guide: openclaw-plugin/INSTALL_OPENCLAW.md

🔧 MCP

LycheeMem also exposes an HTTP MCP endpoint at http://localhost:8000/mcp.

Available tools: lychee_memory_smart_search, lychee_memory_search, lychee_memory_append_turn, lychee_memory_synthesize, lychee_memory_consolidate
lychee_memory_consolidate works for sessions that already contain mirrored turns from /chat, /memory/reason, or lychee_memory_append_turn

MCP Transport

POST /mcp handles JSON-RPC requests
GET /mcp exposes the SSE stream used by some MCP clients
The server returns Mcp-Session-Id during initialize; reuse that header on later requests

Client Configuration

For any MCP client that supports remote HTTP servers, configure the MCP URL as:

http://localhost:8000/mcp

Generic config example:

{
  "mcpServers": {
    "lycheemem": {
      "url": "http://localhost:8000/mcp"
    }
  }
}

Manual JSON-RPC Flow

Call initialize
Reuse the returned Mcp-Session-Id
Send initialized
Call tools/list
Call tools/call

Initialize example:

curl -i -X POST http://localhost:8000/mcp \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "initialize",
    "params": {
      "protocolVersion": "2025-03-26",
      "capabilities": {},
      "clientInfo": {
        "name": "debug-client",
        "version": "0.1.0"
      }
    }
  }'

Tool call example:

curl -X POST http://localhost:8000/mcp \
  -H "Content-Type: application/json" \
  -H "Mcp-Session-Id: <session-id>" \
  -d '{
    "jsonrpc": "2.0",
    "id": 2,
    "method": "tools/call",
    "params": {
      "name": "lychee_memory_smart_search",
      "arguments": {
        "query": "what tools do I use for database backups",
        "top_k": 5,
        "mode": "compact",
        "include_graph": true,
        "include_skills": true
      }
    }
  }'

Recommended MCP Usage Pattern

Use /chat or /memory/reason with a stable session_id to write conversation turns, or mirror external host turns with lychee_memory_append_turn.
Use lychee_memory_smart_search in compact mode for the default one-shot recall path.
Use lychee_memory_search + lychee_memory_synthesize only when you explicitly want search and synthesis as separate stages.
After the conversation ends, call lychee_memory_consolidate with the same session_id.

📚 Memory Architecture

LycheeMem organizes memory into three complementary stores:

Working Memory	Semantic Memory	Procedural Memory
(Episodic) Session turns Summaries Token budget management	(Typed Action Store) 7 MemoryRecord types Conflict-aware Record Fusion Hierarchical memory tree Action-grounded retrieval planning Usage feedback loop + RL-ready statistics	(Skills) Skill entries HyDE retrieval

Working Memory

Semantic Memory

Procedural Memory

(Episodic)

Session turns
Summaries
Token budget management

(Typed Action Store)

7 MemoryRecord types
Conflict-aware Record Fusion
Hierarchical memory tree
Action-grounded retrieval planning
Usage feedback loop + RL-ready statistics

(Skills)

Skill entries
HyDE retrieval

💾 Working Memory

The working memory window holds the active conversation context for a session. It operates under a dual-threshold token budget:

Warn threshold (70%) — triggers asynchronous background pre-compression; the current request is not blocked.
Block threshold (90%) — the pipeline pauses and flushes older turns to a compressed summary before proceeding.

Compression produces summary anchors (past context, distilled) + raw recent turns (last N turns, verbatim). Both are passed downstream as the conversation history.

🗺️ Semantic Memory — Compact Semantic Memory

Semantic memory is organised around typed MemoryRecords plus action-grounded retrieval state. The storage layer is SQLite (FTS5 full-text search) + LanceDB (vector index), while retrieval is conditioned on recent context, tentative action, constraints, and missing slots.

Memory Record Types

Each memory entry is stored as a MemoryRecord. The memory_type field distinguishes seven semantic categories:

Type	Description
`fact`	Objective facts about the user, environment, or world
`preference`	User preferences (style, habits, likes/dislikes)
`event`	Specific events that have occurred
`constraint`	Conditions that must be respected
`procedure`	Reusable step-by-step procedures / methods
`failure_pattern`	Previously failed action paths and their causes
`tool_affordance`	Capabilities and applicable scenarios of tools/APIs

Beyond text, every MemoryRecord carries action-facing metadata (tool_tags, constraint_tags, failure_tags, affordance_tags) and usage statistics (retrieval_count, action_success_count, etc.) to seed future reinforcement-learning signals. Retrieval logs also persist retrieval_plan, action_state, response excerpts, and later user feedback so the system can close a lightweight action-outcome loop without training.

Related MemoryRecords can be fused online by the Record Fusion Engine into denser CompositeRecords. Composite entries persist direct child_composite_ids, so long-term semantic memory is organised as a hierarchical memory tree instead of a flat bag of summaries.

Four-Module Pipeline

Module 1: Compact Semantic Encoding

A single-pass pipeline that converts conversation turns into a list of MemoryRecords:

Typed extraction — LLM extracts self-contained facts and assigns a semantic category to each record.
Decontextualization — Pronouns and context-dependent phrases are expanded into full expressions, so each record is understandable without the original dialogue.
Action metadata annotation — LLM annotates each record with memory_type, tool_tags, constraint_tags, failure_tags, affordance_tags, and other structured labels.

record_id = SHA256(normalized_text) — naturally idempotent; duplicate content is deduplicated automatically.

Module 2: Record Fusion, Conflict Update, and Hierarchical Consolidation

Triggered online after each consolidation:

FTS / vector recall gathers related existing atomic records around the new records (candidate pool).
The existing synthesis judge prompt decides whether each candidate set should produce a new CompositeRecord or perform a conflict_update against an existing atomic record.
On conflict_update, the existing anchor record is updated in place, conflicting incoming records are soft-expired, and composites covering affected source records are invalidated.
On synthesis, the engine writes a new CompositeRecord to SQLite + LanceDB.
Additional hierarchy rounds can synthesize record -> composite and composite -> composite, persisting child_composite_ids so the memory tree can keep growing upward.

Module 3: Action-Grounded Retrieval Planning

Before retrieval, ActionAwareRetrievalPlanner analyses the user query + recent context + ActionState and emits a SearchPlan:

mode: answer (factual Q&A) / action (needs execution) / mixed
semantic_queries: content-facing search terms
pragmatic_queries: action/tool/constraint-facing search terms
tool_hints: tools likely needed for this request
required_constraints: constraints that must be respected
required_affordances: capabilities the retrieved memory should provide
missing_slots: parameters / slots that are absent
tree_retrieval_mode / tree_expansion_depth / include_leaf_records: whether retrieval should stay at high-level composites (root_only) or descend into child composites / direct leaf records (balanced / descend)

ActionState can carry fields such as current_subgoal, tentative_action, known_constraints, available_tools, failure_signal, and a recent-context excerpt. The planner merges this state with the LLM-produced plan so retrieval is conditioned on the current decision state rather than the query alone.

The plan drives multi-channel recall:

FTS channel — SQLite FTS5 keyword recall over MemoryRecord + CompositeRecord
Semantic vector channel — LanceDB ANN over semantic_text embeddings
Normalised vector channel — LanceDB ANN over normalized_text embeddings (for pragmatic queries)
Tag filter channel — exact filter by tool_hints / required_constraints / required_affordances
Temporal channel — filter by SearchPlan.temporal_filter time window
Slot-hint supplementation — when missing_slots is non-empty, extra FTS/tag recall is triggered to find records that can fill missing parameters

After base recall, retrieval can also expand along the memory tree. root_only keeps high-level composite summaries, balanced descends one level when tree hints match, and descend pulls child composites plus direct leaf records when the current action needs finer-grained detail.

Module 4: Multi-Dimensional Scorer

Candidates from all channels are de-duplicated and ranked by MemoryScorer using a weighted linear combination. Final top-k selection is composite-first: covering parent composites are preferred, covered child records are folded away unless they add unique value, and near-duplicate fragments are suppressed.

$$\text{Score} = \alpha \cdot S_\text{sem} + \beta \cdot S_\text{action} + \kappa \cdot S_\text{slot} + \gamma \cdot S_\text{temporal} + \delta \cdot S_\text{recency} + \eta \cdot S_\text{evidence} - \lambda \cdot C_\text{token}$$

Weight	Meaning	Default
α	SemanticRelevance (vector distance -> similarity)	0.25
β	ActionUtility (tag match score, mode-aware)	0.25
κ	SlotUtility (whether the memory helps fill missing action slots)	0.15
γ	TemporalFit (temporal reference match)	0.15
δ	Recency (memory freshness)	0.10
η	EvidenceDensity (evidence span density)	0.10
λ	TokenCost penalty (text length penalty)	0.10

🛠️ Procedural Memory — Skill Store

The skill store preserves reusable how-to knowledge as structured skill entries, each carrying:

Intent — a short description of what the skill does.
doc_markdown — a full Markdown document describing the procedure, commands, parameters, and caveats.
Embedding — a dense vector of the intent text, used for similarity search.
Metadata — usage counters, last-used timestamp, preconditions.

Skill retrieval uses HyDE (Hypothetical Document Embeddings): the query is first expanded into a hypothetical ideal answer by the LLM, then that draft text is embedded to produce a query vector that matches well against stored procedure descriptions, even when the user's original phrasing is vague.

⚙️ Pipeline

Every request passes through a fixed sequence of five agents. Four are synchronous stages in the LangGraph pipeline; one is a background post-processing task.

START

▼

1. WMManager — Token budget check + compress/render

↓

2. SearchCoordinator — Planner → Semantic + Skill retrieval

↓

3. SynthesizerAgent — LLM-as-Judge scoring + context fusion

↓

4. ReasoningAgent — Final response generation

▼

END

Background asyncio.create_task( ConsolidatorAgent )

Stage 1 — WMManager

Rule-based agent (no LLM prompt). Appends the user turn to the session log, counts tokens, and fires compression if either threshold is crossed. Produces compressed_history and raw_recent_turns for downstream stages.

Stage 2 — SearchCoordinator

SearchCoordinator first builds recent_context from compressed summaries + raw recent turns, then derives an ActionState from the current query, constraints, recent failures, token budget, and recent tool use. ActionAwareRetrievalPlanner uses that state to produce a SearchPlan containing mode, semantic_queries, pragmatic_queries, tool_hints, required_affordances, missing_slots, tree-traversal strategy, and more. Multi-channel recall (FTS, semantic vector, normalised vector, tag/affordance filter, temporal filter, slot-hint supplementation, plus tree expansion when needed) then queries SQLite + LanceDB. This stage returns raw semantic fragments, skill hits, retrieval provenance, and a dedicated novelty_retrieved_context built from pre-synthesis semantic fragments for later novelty checking; it does not build the final background_context yet. Skill retrieval is mode-aware (answer / action / mixed) and uses HyDE against the skill store only when it is likely to help.

When a new user turn arrives, SearchCoordinator also tries to apply lightweight feedback to the most recent unresolved action/mixed retrieval log, so the next turn can mark the prior memory usage as success / fail / correction.

Stage 3 — SynthesizerAgent

Acts as an LLM-as-Judge: scores every retrieved memory fragment on an absolute 0-1 relevance scale, discards fragments below the threshold (default 0.6), and fuses the survivors into a single dense background_context string. It also identifies skill_reuse_plan entries that can directly guide the final response. This stage is where the final answer-time context is built; it outputs provenance — a citation list containing scoring breakdown and source references for each kept memory item.

Stage 4 — ReasoningAgent

Receives compressed_history, background_context, and skill_reuse_plan and generates the final assistant reply. It appends the assistant turn back to the session store, and the pipeline finalizes the semantic usage log with a response excerpt so the next user turn can provide outcome feedback.

Background — ConsolidatorAgent

Triggered immediately after ReasoningAgent completes, runs in a thread pool and does not block the response. It:

Performs a novelty check — LLM judges whether the conversation introduced new information worth persisting. Skips consolidation for pure retrieval exchanges.
Compact consolidation — calls CompactSemanticEngine.ingest_conversation(), which runs a single-pass encoder (typed extraction → decontextualization → action metadata annotation), writes MemoryRecords to SQLite + LanceDB, then triggers conflict-aware Record Fusion. Novelty check uses the search-stage novelty_retrieved_context (raw semantic fragments), not the answer-time background_context, so query-conditioned synthesis does not suppress valid new-memory ingestion.
Skill extraction — identifies successful tool-usage patterns in the conversation and adds skill entries to the skill store. Runs in parallel with compact consolidation (ThreadPoolExecutor).

🔌 API Reference

`POST /memory/search` — Unified Memory Retrieval

Query both the semantic memory channel and the skill store in a single call. New integrations should prefer semantic_results; graph_results is kept as a backward-compatible alias. The response also includes novelty_retrieved_context, which is the correct input for later /memory/consolidate calls.

// Request
{
  "query": "what tools do I use for database backups",
  "top_k": 5,
  "include_graph": true,
  "include_skills": true
}

// Response
{
  "query": "...",
  "graph_results": [
    {
      "anchor": {
        "node_id": "compact_context",
        "name": "CompactSemanticMemory",
        "label": "SemanticContext",
        "score": 1.0
      },
      "constructed_context": "...",
      "provenance": [ { "record_id": "...", "source": "record", "semantic_source_type": "record", "score": 0.91, ... } ]
    }
  ],
  "semantic_results": [
    {
      "anchor": { "node_id": "compact_context", "name": "CompactSemanticMemory", "label": "SemanticContext", "score": 1.0 },
      "constructed_context": "...",
      "provenance": [ { "record_id": "...", "source": "record", "semantic_source_type": "record", "score": 0.91, ... } ]
    }
  ],
  "novelty_retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
  "skill_results": [ { "id": "...", "intent": "pg_dump backup to S3", "score": 0.87, ... } ],
  "total": 6
}

`POST /memory/smart-search` — One-Shot Recall

Runs search and, optionally, synthesis in one API call. mode=compact is the default integration path when you want a concise background_context without handling intermediate payloads yourself. Even in compact mode, the response still returns novelty_retrieved_context so a host can consolidate against raw retrieved memory instead of answer-time synthesis.

// Request
{
  "query": "what tools do I use for database backups",
  "top_k": 5,
  "synthesize": true,
  "mode": "compact"
}

// Response
{
  "query": "...",
  "mode": "compact",
  "synthesized": true,
  "background_context": "User regularly uses pg_dump with a cron job...",
  "skill_reuse_plan": [ { "skill_id": "...", "intent": "...", "doc_markdown": "..." } ],
  "provenance": [ { "record_id": "...", "source": "record", "score": 0.91, ... } ],
  "novelty_retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
  "kept_count": 4,
  "dropped_count": 2,
  "total": 6
}

`POST /memory/synthesize` — Memory Fusion

Takes raw retrieval results and produces a fused memory context using LLM-as-Judge.

// Request
{
  "user_query": "what tools do I use for database backups",
  "semantic_results": [...], // preferred from /memory/search
  "graph_results": [...],    // compatibility alias also accepted
  "skill_results": [...]
}

// Response
{
  "background_context": "User regularly uses pg_dump with a cron job...",
  "skill_reuse_plan": [ { "skill_id": "...", "intent": "...", "doc_markdown": "..." } ],
  "provenance": [ { "record_id": "...", "source": "semantic", "semantic_source_type": "record", "score": 0.91, ... } ],
  "kept_count": 4,
  "dropped_count": 2
}

`POST /memory/reason` — Grounded Reasoning

Runs the ReasoningAgent given pre-synthesized context. Can be chained after /memory/synthesize for full pipeline control.

// Request
{
  "session_id": "my-session",
  "user_query": "what tools do I use for database backups",
  "background_context": "User regularly uses pg_dump...",
  "skill_reuse_plan": [...],
  "append_to_session": true   // write result to session history (default: true)
}

// Response
{
  "response": "You typically use pg_dump scheduled via cron...",
  "session_id": "my-session",
  "wm_token_usage": 3412
}

`POST /memory/append-turn` — Mirror External Host Turns

Appends one user or assistant turn into LycheeMem's session store so it can be consolidated later.

// Request
{
  "session_id": "my-session",
  "role": "user",
  "content": "I usually back up PostgreSQL with pg_dump to S3."
}

// Response
{
  "status": "appended",
  "session_id": "my-session",
  "turn_count": 3
}

`POST /memory/consolidate` — Trigger Consolidation

Manually trigger memory consolidation for a session. This is the primary consolidation endpoint and supports both background and synchronous modes.

retrieved_context should preferably be the novelty_retrieved_context returned by /memory/search or /memory/smart-search, i.e. the search-stage raw semantic fragments, not /memory/synthesize's background_context.

// Request
{
  "session_id": "my-session",
  "retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
  "background": true
}

// Response (background mode)
{
  "status": "started",
  "entities_added": 0,
  "skills_added": 0,
  "facts_added": 0
}

Legacy compatibility endpoint: POST /memory/consolidate/{session_id}.

`GET /memory/graph` — Semantic Memory Tree

Returns the current semantic memory as a hierarchy. mode=cleaned (default) emits tree_roots plus direct tree edges for the frontend memory-tree view; mode=debug exposes the lower-level flattened relations for inspection.

`GET /pipeline/status` and `GET /pipeline/last-consolidation`

Use these endpoints for operational checks and background consolidation polling:

GET /pipeline/status returns aggregate counts for sessions, semantic memory, and skills.
GET /pipeline/last-consolidation?session_id=<id> returns the latest consolidation result for a session, or pending if the background task has not finished yet.

Usage Examples

Release HistoryVersionChangesUrgencyDate
master@2026-05-15Latest activity on master branchHigh5/15/2026
0.0.0No release found — using repo HEADHigh4/9/2026

Version	Changes	Urgency	Date
master@2026-05-15	Latest activity on master branch	High	5/15/2026
0.0.0	No release found — using repo HEAD	High	4/9/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

memoraGive your AI agents persistent memory.v0.2.29

tradememory-protocolDecision audit trail + persistent memory for AI trading agents. Outcome-weighted recall, SHA-256 tamper detection, 17 MCP tools.v0.5.1

sqltools_mcp🔌 Access multiple databases seamlessly with SQLTools MCP, a versatile service supporting MySQL, PostgreSQL, SQL Server, DM8, and SQLite without multiple servers.main@2026-06-07

pipulateLocal First AI SEO Software on Nix, FastHTML & HTMXmain@2026-06-06

OmnispindleA comprehensive MCP-based todo management system, that serves as a central nervous system for Madness Interactive, a multi-project task coordination workshop.main@2026-06-05

More in MCP Servers

PlanExeCreate a plan from a description in minutes

agentroveYour own Claude Code UI, sandbox, in-browser VS Code, terminal, multi-provider support (Anthropic, OpenAI, GitHub Copilot, OpenRouter), custom skills, and MCP servers.

ProxmoxMCP-PlusEnhanced Proxmox MCP server with advanced virtualization management and full OpenAPI integration.

node9-proxyThe Execution Security Layer for the Agentic Era. Providing deterministic "Sudo" governance and audit logs for autonomous AI agents.

LycheeMem

Description

README