freshcrate
Home > MCP Servers > LycheeMem

LycheeMem

Compact, efficient, and extensible long-term memory for LLM agents.

Description

Compact, efficient, and extensible long-term memory for LLM agents.

README

LycheeMem Logo

LycheeMem: Lightweight Long-Term Memory for LLM Agents

License Python Version LangGraphlitellm LanceDB Homepage PyPI HITsz-TMG δΈ­ζ–‡ | English

LycheeMem is a compact memory framework for LLM agents. It starts from efficient conversational memoryβ€”through structured organization, lightweight consolidation, and adaptive retrievalβ€”and gradually extends toward action-aware, usage-aware memory for more capable agentic systems.


News β€’ Related Projects β€’ Quick Start β€’ Web Demo β€’ OpenClaw Plugin β€’ MCP β€’ Memory Architecture β€’ Pipeline β€’ API Reference

πŸ”₯ News

  • [04/03/2026] The project now supports installation via pip install lycheemem. You can easily start the service from anywhere using lycheemem-cli!
  • [03/30/2026] We evaluated LycheeMem on PinchBench with the OpenClaw plugin: compared to OpenClaw's native memory, it achieved an ~6% score improvement, while reducing token consumption by ~71% and cost by ~55%!
  • [03/28/2026] Semantic memory has been upgraded to Compact Semantic Memory (SQLite + LanceDB), no Neo4j required. See /quick-start for details.
  • [03/27/2026] OpenClaw Plugin is now available at /openclaw-plugin ! Setup guide β†’
  • [03/26/2026] MCP support is available at /mcp !
  • [03/23/2026] LycheeMem is now open source: GitHub Repository β†’

πŸ”— Related Projects

LycheeMem is part of the 3rd-generation Lychee (η«‹ηŸ₯) large model series, which focuses on memory intelligence, continual learning, and long-context reasoning.

We welcome you to explore our related works:

  • LycheeMemory (ACL 2026, CCF-A): a unified framework for implicit long-term memory and explicit working memory collaboration in large language models
    arXiv GitHub Hugging Face

  • LycheeMem (this project): long-term memory infrastructure for LLM-based agents
    Project Page GitHub

  • LycheeDecode (ICLR 2026, CCF-A): selective recall from massive KV-cache context memory
    Project Page arXiv GitHub

  • LycheeCluster (ACL 2026, CCF-A): structured organization and hierarchical indexing for context memory
    arXiv


⚑ Quick Start

Prerequisites

  • Python 3.9+
  • An LLM API key (OpenAI, Gemini, or any litellm-compatible provider)

Installation

You can install LycheeMem directly via pip:

pip install lycheemem

Once installed, you can start the backend server instantly using the CLI:

lycheemem-cli

For development or if you prefer to run from source:

git clone https://github.com/LycheeMem/LycheeMem.git
cd LycheeMem
pip install -e .

Configuration

Create a .env file in your working directory and fill in your values. The full template in .env.example also includes session/user DB paths, JWT settings, and working-memory thresholds; the snippet below shows the most important ones:

# LLM β€” litellm format: provider/model
LLM_MODEL=openai/gpt-4o-mini
LLM_API_KEY=sk-...
LLM_API_BASE=                     # optional

# Embedder
EMBEDDING_MODEL=openai/text-embedding-3-small
EMBEDDING_DIM=1536
EMBEDDING_API_KEY=                # optional
EMBEDDING_API_BASE=               # optional

Supported LLM providers (via litellm):
openai/gpt-4o-mini Β· gemini/gemini-2.0-flash Β· ollama_chat/qwen2.5 Β· any OpenAI-compatible endpoint

Start the Server

If you installed via pip, you can start the LycheeMem background service from anywhere using:

lycheemem-cli

(If running from source, you can also use python main.py to start the server.)

The API is served at http://localhost:8000. Interactive docs at /docs.

main.py currently starts Uvicorn without enabling live reload. For development reload, run Uvicorn directly, for example:

uvicorn src.api.server:create_app --factory --reload

🎨 Web Demo

A frontend demo is included under web-demo/. It provides a chat interface alongside live views of the semantic memory tree, skill library, and working memory state.

cd web-demo
npm install
npm run dev      # served at http://localhost:5173

Make sure the backend is running on port 8000 (or update proxy settings in web-demo/vite.config.ts) before starting the frontend.


🦞 OpenClaw Plugin

LycheeMem ships a native OpenClaw plugin that gives any OpenClaw session persistent long-term memory with zero manual wiring.

What the plugin provides:

  • lychee_memory_smart_search β€” default long-term memory retrieval entry point
  • Automatic turn mirroring via hooks β€” the model does not need to call append_turn manually
    • User messages are appended automatically
    • Assistant messages are appended automatically
  • /new, /reset, /stop, and session_end automatically trigger boundary consolidation
  • Proactive consolidation on strong long-term knowledge signals

Under normal operation:

  • The model only calls lychee_memory_smart_search when recalling long-term context
  • The model may call lychee_memory_consolidate manually when an immediate persist is warranted
  • The model does not need to call lychee_memory_append_turn at all

Quick Install

openclaw plugins install "/path/to/LycheeMem/openclaw-plugin"
openclaw gateway restart

See the full setup guide: openclaw-plugin/INSTALL_OPENCLAW.md


πŸ”§ MCP

LycheeMem also exposes an HTTP MCP endpoint at http://localhost:8000/mcp.

  • Available tools: lychee_memory_smart_search, lychee_memory_search, lychee_memory_append_turn, lychee_memory_synthesize, lychee_memory_consolidate
  • lychee_memory_consolidate works for sessions that already contain mirrored turns from /chat, /memory/reason, or lychee_memory_append_turn

MCP Transport

  • POST /mcp handles JSON-RPC requests
  • GET /mcp exposes the SSE stream used by some MCP clients
  • The server returns Mcp-Session-Id during initialize; reuse that header on later requests

Client Configuration

For any MCP client that supports remote HTTP servers, configure the MCP URL as:

http://localhost:8000/mcp

Generic config example:

{
  "mcpServers": {
    "lycheemem": {
      "url": "http://localhost:8000/mcp"
    }
  }
}

Manual JSON-RPC Flow

  1. Call initialize
  2. Reuse the returned Mcp-Session-Id
  3. Send initialized
  4. Call tools/list
  5. Call tools/call

Initialize example:

curl -i -X POST http://localhost:8000/mcp \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "initialize",
    "params": {
      "protocolVersion": "2025-03-26",
      "capabilities": {},
      "clientInfo": {
        "name": "debug-client",
        "version": "0.1.0"
      }
    }
  }'

Tool call example:

curl -X POST http://localhost:8000/mcp \
  -H "Content-Type: application/json" \
  -H "Mcp-Session-Id: <session-id>" \
  -d '{
    "jsonrpc": "2.0",
    "id": 2,
    "method": "tools/call",
    "params": {
      "name": "lychee_memory_smart_search",
      "arguments": {
        "query": "what tools do I use for database backups",
        "top_k": 5,
        "mode": "compact",
        "include_graph": true,
        "include_skills": true
      }
    }
  }'

Recommended MCP Usage Pattern

  1. Use /chat or /memory/reason with a stable session_id to write conversation turns, or mirror external host turns with lychee_memory_append_turn.
  2. Use lychee_memory_smart_search in compact mode for the default one-shot recall path.
  3. Use lychee_memory_search + lychee_memory_synthesize only when you explicitly want search and synthesis as separate stages.
  4. After the conversation ends, call lychee_memory_consolidate with the same session_id.

πŸ“š Memory Architecture

LycheeMem organizes memory into three complementary stores:

Working Memory Semantic Memory Procedural Memory

(Episodic)

  • Session turns
  • Summaries
  • Token budget management

(Typed Action Store)

  • 7 MemoryRecord types
  • Conflict-aware Record Fusion
  • Hierarchical memory tree
  • Action-grounded retrieval planning
  • Usage feedback loop + RL-ready statistics

(Skills)

  • Skill entries
  • HyDE retrieval

πŸ’Ύ Working Memory

The working memory window holds the active conversation context for a session. It operates under a dual-threshold token budget:

  • Warn threshold (70%) β€” triggers asynchronous background pre-compression; the current request is not blocked.
  • Block threshold (90%) β€” the pipeline pauses and flushes older turns to a compressed summary before proceeding.

Compression produces summary anchors (past context, distilled) + raw recent turns (last N turns, verbatim). Both are passed downstream as the conversation history.

πŸ—ΊοΈ Semantic Memory β€” Compact Semantic Memory

Semantic memory is organised around typed MemoryRecords plus action-grounded retrieval state. The storage layer is SQLite (FTS5 full-text search) + LanceDB (vector index), while retrieval is conditioned on recent context, tentative action, constraints, and missing slots.

Memory Record Types

Each memory entry is stored as a MemoryRecord. The memory_type field distinguishes seven semantic categories:

Type Description
fact Objective facts about the user, environment, or world
preference User preferences (style, habits, likes/dislikes)
event Specific events that have occurred
constraint Conditions that must be respected
procedure Reusable step-by-step procedures / methods
failure_pattern Previously failed action paths and their causes
tool_affordance Capabilities and applicable scenarios of tools/APIs

Beyond text, every MemoryRecord carries action-facing metadata (tool_tags, constraint_tags, failure_tags, affordance_tags) and usage statistics (retrieval_count, action_success_count, etc.) to seed future reinforcement-learning signals. Retrieval logs also persist retrieval_plan, action_state, response excerpts, and later user feedback so the system can close a lightweight action-outcome loop without training.

Related MemoryRecords can be fused online by the Record Fusion Engine into denser CompositeRecords. Composite entries persist direct child_composite_ids, so long-term semantic memory is organised as a hierarchical memory tree instead of a flat bag of summaries.

Four-Module Pipeline

Module 1: Compact Semantic Encoding

A single-pass pipeline that converts conversation turns into a list of MemoryRecords:

  1. Typed extraction β€” LLM extracts self-contained facts and assigns a semantic category to each record.
  2. Decontextualization β€” Pronouns and context-dependent phrases are expanded into full expressions, so each record is understandable without the original dialogue.
  3. Action metadata annotation β€” LLM annotates each record with memory_type, tool_tags, constraint_tags, failure_tags, affordance_tags, and other structured labels.

record_id = SHA256(normalized_text) β€” naturally idempotent; duplicate content is deduplicated automatically.

Module 2: Record Fusion, Conflict Update, and Hierarchical Consolidation

Triggered online after each consolidation:

  1. FTS / vector recall gathers related existing atomic records around the new records (candidate pool).
  2. The existing synthesis judge prompt decides whether each candidate set should produce a new CompositeRecord or perform a conflict_update against an existing atomic record.
  3. On conflict_update, the existing anchor record is updated in place, conflicting incoming records are soft-expired, and composites covering affected source records are invalidated.
  4. On synthesis, the engine writes a new CompositeRecord to SQLite + LanceDB.
  5. Additional hierarchy rounds can synthesize record -> composite and composite -> composite, persisting child_composite_ids so the memory tree can keep growing upward.
Module 3: Action-Grounded Retrieval Planning

Before retrieval, ActionAwareRetrievalPlanner analyses the user query + recent context + ActionState and emits a SearchPlan:

  • mode: answer (factual Q&A) / action (needs execution) / mixed
  • semantic_queries: content-facing search terms
  • pragmatic_queries: action/tool/constraint-facing search terms
  • tool_hints: tools likely needed for this request
  • required_constraints: constraints that must be respected
  • required_affordances: capabilities the retrieved memory should provide
  • missing_slots: parameters / slots that are absent
  • tree_retrieval_mode / tree_expansion_depth / include_leaf_records: whether retrieval should stay at high-level composites (root_only) or descend into child composites / direct leaf records (balanced / descend)

ActionState can carry fields such as current_subgoal, tentative_action, known_constraints, available_tools, failure_signal, and a recent-context excerpt. The planner merges this state with the LLM-produced plan so retrieval is conditioned on the current decision state rather than the query alone.

The plan drives multi-channel recall:

  1. FTS channel β€” SQLite FTS5 keyword recall over MemoryRecord + CompositeRecord
  2. Semantic vector channel β€” LanceDB ANN over semantic_text embeddings
  3. Normalised vector channel β€” LanceDB ANN over normalized_text embeddings (for pragmatic queries)
  4. Tag filter channel β€” exact filter by tool_hints / required_constraints / required_affordances
  5. Temporal channel β€” filter by SearchPlan.temporal_filter time window
  6. Slot-hint supplementation β€” when missing_slots is non-empty, extra FTS/tag recall is triggered to find records that can fill missing parameters

After base recall, retrieval can also expand along the memory tree. root_only keeps high-level composite summaries, balanced descends one level when tree hints match, and descend pulls child composites plus direct leaf records when the current action needs finer-grained detail.

Module 4: Multi-Dimensional Scorer

Candidates from all channels are de-duplicated and ranked by MemoryScorer using a weighted linear combination. Final top-k selection is composite-first: covering parent composites are preferred, covered child records are folded away unless they add unique value, and near-duplicate fragments are suppressed.

$$\text{Score} = \alpha \cdot S_\text{sem} + \beta \cdot S_\text{action} + \kappa \cdot S_\text{slot} + \gamma \cdot S_\text{temporal} + \delta \cdot S_\text{recency} + \eta \cdot S_\text{evidence} - \lambda \cdot C_\text{token}$$

Weight Meaning Default
Ξ± SemanticRelevance (vector distance -> similarity) 0.25
Ξ² ActionUtility (tag match score, mode-aware) 0.25
ΞΊ SlotUtility (whether the memory helps fill missing action slots) 0.15
Ξ³ TemporalFit (temporal reference match) 0.15
Ξ΄ Recency (memory freshness) 0.10
Ξ· EvidenceDensity (evidence span density) 0.10
Ξ» TokenCost penalty (text length penalty) 0.10

πŸ› οΈ Procedural Memory β€” Skill Store

The skill store preserves reusable how-to knowledge as structured skill entries, each carrying:

  • Intent β€” a short description of what the skill does.
  • doc_markdown β€” a full Markdown document describing the procedure, commands, parameters, and caveats.
  • Embedding β€” a dense vector of the intent text, used for similarity search.
  • Metadata β€” usage counters, last-used timestamp, preconditions.

Skill retrieval uses HyDE (Hypothetical Document Embeddings): the query is first expanded into a hypothetical ideal answer by the LLM, then that draft text is embedded to produce a query vector that matches well against stored procedure descriptions, even when the user's original phrasing is vague.


βš™οΈ Pipeline

Every request passes through a fixed sequence of five agents. Four are synchronous stages in the LangGraph pipeline; one is a background post-processing task.

START
β–Ό
1. WMManager β€” Token budget check + compress/render
↓
2. SearchCoordinator β€” Planner β†’ Semantic + Skill retrieval
↓
3. SynthesizerAgent β€” LLM-as-Judge scoring + context fusion
↓
4. ReasoningAgent β€” Final response generation
β–Ό
END
Background asyncio.create_task( ConsolidatorAgent )

Stage 1 β€” WMManager

Rule-based agent (no LLM prompt). Appends the user turn to the session log, counts tokens, and fires compression if either threshold is crossed. Produces compressed_history and raw_recent_turns for downstream stages.

Stage 2 β€” SearchCoordinator

SearchCoordinator first builds recent_context from compressed summaries + raw recent turns, then derives an ActionState from the current query, constraints, recent failures, token budget, and recent tool use. ActionAwareRetrievalPlanner uses that state to produce a SearchPlan containing mode, semantic_queries, pragmatic_queries, tool_hints, required_affordances, missing_slots, tree-traversal strategy, and more. Multi-channel recall (FTS, semantic vector, normalised vector, tag/affordance filter, temporal filter, slot-hint supplementation, plus tree expansion when needed) then queries SQLite + LanceDB. This stage returns raw semantic fragments, skill hits, retrieval provenance, and a dedicated novelty_retrieved_context built from pre-synthesis semantic fragments for later novelty checking; it does not build the final background_context yet. Skill retrieval is mode-aware (answer / action / mixed) and uses HyDE against the skill store only when it is likely to help.

When a new user turn arrives, SearchCoordinator also tries to apply lightweight feedback to the most recent unresolved action/mixed retrieval log, so the next turn can mark the prior memory usage as success / fail / correction.

Stage 3 β€” SynthesizerAgent

Acts as an LLM-as-Judge: scores every retrieved memory fragment on an absolute 0-1 relevance scale, discards fragments below the threshold (default 0.6), and fuses the survivors into a single dense background_context string. It also identifies skill_reuse_plan entries that can directly guide the final response. This stage is where the final answer-time context is built; it outputs provenance β€” a citation list containing scoring breakdown and source references for each kept memory item.

Stage 4 β€” ReasoningAgent

Receives compressed_history, background_context, and skill_reuse_plan and generates the final assistant reply. It appends the assistant turn back to the session store, and the pipeline finalizes the semantic usage log with a response excerpt so the next user turn can provide outcome feedback.

Background β€” ConsolidatorAgent

Triggered immediately after ReasoningAgent completes, runs in a thread pool and does not block the response. It:

  1. Performs a novelty check β€” LLM judges whether the conversation introduced new information worth persisting. Skips consolidation for pure retrieval exchanges.
  2. Compact consolidation β€” calls CompactSemanticEngine.ingest_conversation(), which runs a single-pass encoder (typed extraction β†’ decontextualization β†’ action metadata annotation), writes MemoryRecords to SQLite + LanceDB, then triggers conflict-aware Record Fusion. Novelty check uses the search-stage novelty_retrieved_context (raw semantic fragments), not the answer-time background_context, so query-conditioned synthesis does not suppress valid new-memory ingestion.
  3. Skill extraction β€” identifies successful tool-usage patterns in the conversation and adds skill entries to the skill store. Runs in parallel with compact consolidation (ThreadPoolExecutor).

πŸ”Œ API Reference

POST /memory/search β€” Unified Memory Retrieval

Query both the semantic memory channel and the skill store in a single call. New integrations should prefer semantic_results; graph_results is kept as a backward-compatible alias. The response also includes novelty_retrieved_context, which is the correct input for later /memory/consolidate calls.

// Request
{
  "query": "what tools do I use for database backups",
  "top_k": 5,
  "include_graph": true,
  "include_skills": true
}

// Response
{
  "query": "...",
  "graph_results": [
    {
      "anchor": {
        "node_id": "compact_context",
        "name": "CompactSemanticMemory",
        "label": "SemanticContext",
        "score": 1.0
      },
      "constructed_context": "...",
      "provenance": [ { "record_id": "...", "source": "record", "semantic_source_type": "record", "score": 0.91, ... } ]
    }
  ],
  "semantic_results": [
    {
      "anchor": { "node_id": "compact_context", "name": "CompactSemanticMemory", "label": "SemanticContext", "score": 1.0 },
      "constructed_context": "...",
      "provenance": [ { "record_id": "...", "source": "record", "semantic_source_type": "record", "score": 0.91, ... } ]
    }
  ],
  "novelty_retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
  "skill_results": [ { "id": "...", "intent": "pg_dump backup to S3", "score": 0.87, ... } ],
  "total": 6
}

POST /memory/smart-search β€” One-Shot Recall

Runs search and, optionally, synthesis in one API call. mode=compact is the default integration path when you want a concise background_context without handling intermediate payloads yourself. Even in compact mode, the response still returns novelty_retrieved_context so a host can consolidate against raw retrieved memory instead of answer-time synthesis.

// Request
{
  "query": "what tools do I use for database backups",
  "top_k": 5,
  "synthesize": true,
  "mode": "compact"
}

// Response
{
  "query": "...",
  "mode": "compact",
  "synthesized": true,
  "background_context": "User regularly uses pg_dump with a cron job...",
  "skill_reuse_plan": [ { "skill_id": "...", "intent": "...", "doc_markdown": "..." } ],
  "provenance": [ { "record_id": "...", "source": "record", "score": 0.91, ... } ],
  "novelty_retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
  "kept_count": 4,
  "dropped_count": 2,
  "total": 6
}

POST /memory/synthesize β€” Memory Fusion

Takes raw retrieval results and produces a fused memory context using LLM-as-Judge.

// Request
{
  "user_query": "what tools do I use for database backups",
  "semantic_results": [...], // preferred from /memory/search
  "graph_results": [...],    // compatibility alias also accepted
  "skill_results": [...]
}

// Response
{
  "background_context": "User regularly uses pg_dump with a cron job...",
  "skill_reuse_plan": [ { "skill_id": "...", "intent": "...", "doc_markdown": "..." } ],
  "provenance": [ { "record_id": "...", "source": "semantic", "semantic_source_type": "record", "score": 0.91, ... } ],
  "kept_count": 4,
  "dropped_count": 2
}

POST /memory/reason β€” Grounded Reasoning

Runs the ReasoningAgent given pre-synthesized context. Can be chained after /memory/synthesize for full pipeline control.

// Request
{
  "session_id": "my-session",
  "user_query": "what tools do I use for database backups",
  "background_context": "User regularly uses pg_dump...",
  "skill_reuse_plan": [...],
  "append_to_session": true   // write result to session history (default: true)
}

// Response
{
  "response": "You typically use pg_dump scheduled via cron...",
  "session_id": "my-session",
  "wm_token_usage": 3412
}

POST /memory/append-turn β€” Mirror External Host Turns

Appends one user or assistant turn into LycheeMem's session store so it can be consolidated later.

// Request
{
  "session_id": "my-session",
  "role": "user",
  "content": "I usually back up PostgreSQL with pg_dump to S3."
}

// Response
{
  "status": "appended",
  "session_id": "my-session",
  "turn_count": 3
}

POST /memory/consolidate β€” Trigger Consolidation

Manually trigger memory consolidation for a session. This is the primary consolidation endpoint and supports both background and synchronous modes.

retrieved_context should preferably be the novelty_retrieved_context returned by /memory/search or /memory/smart-search, i.e. the search-stage raw semantic fragments, not /memory/synthesize's background_context.

// Request
{
  "session_id": "my-session",
  "retrieved_context": "[1] (procedure, source=record) Use pg_dump with cron ...",
  "background": true
}

// Response (background mode)
{
  "status": "started",
  "entities_added": 0,
  "skills_added": 0,
  "facts_added": 0
}

Legacy compatibility endpoint: POST /memory/consolidate/{session_id}.


GET /memory/graph β€” Semantic Memory Tree

Returns the current semantic memory as a hierarchy. mode=cleaned (default) emits tree_roots plus direct tree edges for the frontend memory-tree view; mode=debug exposes the lower-level flattened relations for inspection.


GET /pipeline/status and GET /pipeline/last-consolidation

Use these endpoints for operational checks and background consolidation polling:

  • GET /pipeline/status returns aggregate counts for sessions, semantic memory, and skills.
  • GET /pipeline/last-consolidation?session_id=<id> returns the latest consolidation result for a session, or pending if the background task has not finished yet.

Usage Examples

Release History
VersionChangesUrgencyDate
master@2026-04-19Latest activity on master branchHigh4/19/2026
0.0.0No release found β€” using repo HEADHigh4/9/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

memoraGive your AI agents persistent memory.v0.2.27
tradememory-protocolDecision audit trail + persistent memory for AI trading agents. Outcome-weighted recall, SHA-256 tamper detection, 17 MCP tools.v0.5.1
hybrid-orchestratorπŸ€– Implement hybrid human-AI orchestration patterns in Python to coordinate agents, manage sessions, and enable smooth AI-human handoffs.master@2026-04-21
zotero-mcp-liteπŸš€ Run a high-performance MCP server for Zotero, enabling customizable workflows without cloud dependency or API keys.main@2026-04-21
sqltools_mcpπŸ”Œ Access multiple databases seamlessly with SQLTools MCP, a versatile service supporting MySQL, PostgreSQL, SQL Server, DM8, and SQLite without multiple servers.main@2026-04-21