freshcrate
Home > Uncategorized > mesh-llm

mesh-llm

Distributed AI/LLM for the people. Share compute privately or publicly to power your agents and chat.

Description

Distributed AI/LLM for the people. Share compute privately or publicly to power your agents and chat.

README

Mesh LLM

Mesh LLM logo

Mesh LLM

Mesh LLM lets you pool spare GPU capacity across machines and expose the result as one OpenAI-compatible API.

If a model fits on one machine, it runs there. If it does not, Mesh LLM automatically spreads the work across the mesh:

  • Dense models use pipeline parallelism.
  • MoE models use expert sharding with zero cross-node inference traffic.
  • Models collaborate during inference β€” a text-only model consults a vision peer, an uncertain model gets a second opinion from a different architecture.
  • Every node gets the same local API at http://localhost:9337/v1.

Why people use it

  • Run models larger than a single machine can hold.
  • Turn a few uneven boxes into one shared inference pool.
  • Give agents a local OpenAI-compatible endpoint instead of wiring each tool by hand.
  • Keep the setup simple: start one node, add more later.

Quick start

Install the latest release:

curl -fsSL https://raw.githubusercontent.com/Mesh-LLM/mesh-llm/main/install.sh | bash

Then start a node:

mesh-llm serve --auto

Inspect local GPU identity:

mesh-llm gpus

That command:

  • picks a suitable bundled backend for your machine
  • downloads a model if needed
  • joins the best public mesh
  • exposes an OpenAI-compatible API at http://localhost:9337/v1
  • starts the web console at http://localhost:3131

Use --headless to disable the embedded web console while keeping the management API (/api/*) available on the --console port. This is useful for headless server deployments where the UI is not needed.

Check what is available:

curl -s http://localhost:9337/v1/models | jq '.data[].id'

Send a request:

curl http://localhost:9337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'

Common workflows

1. Try the public mesh

mesh-llm serve --auto

This is the easiest way to see the system working end to end.

2. Start a private mesh

mesh-llm serve --model Qwen2.5-32B

This starts serving a model, opens the local API and console, and prints an invite token for other machines.

3. Build from source

git clone https://github.com/Mesh-LLM/mesh-llm
cd mesh-llm
just build

Requires: just, cmake, Rust toolchain, Node.js 24 + npm. NVIDIA GPU builds need nvcc (CUDA toolkit). AMD GPU builds need ROCm/HIP. Vulkan GPU builds need the Vulkan development files plus glslc. CPU-only and Jetson/Tegra also work. For source builds, just build auto-detects CUDA vs ROCm vs Vulkan on Linux, or you can force backend=rocm or backend=vulkan. See CONTRIBUTING.md for details.

Windows source builds are also supported for cuda, rocm/hip, vulkan, and cpu via just build. Metal remains macOS-only. Tagged stable GitHub releases publish macOS bundles plus Linux CPU, Linux ARM64 CPU, Linux CUDA, Linux ROCm, and Linux Vulkan bundles. Prereleases use the same workflow and can optionally skip the Linux CUDA, Linux ROCm, and Linux Vulkan bundles. The Linux ARM64 CPU artifact is mesh-llm-aarch64-unknown-linux-gnu.tar.gz. In install and release contexts, arm64 and aarch64 mean the same 64-bit ARM target, and generic 32-bit ARM is not a published release target. Windows publish jobs are currently commented out in .github/workflows/release.yml, but you can still generate the matching local Windows artifacts with just release-build-windows, just release-build-cuda-windows, just release-build-rocm-windows, just release-build-vulkan-windows, and the matching release-bundle-*-windows recipes.

Run

Once installed, you can run:

mesh-llm serve --auto                      # join the best public mesh, start serving

That's it. Downloads a model for your hardware, connects to other nodes, and gives you an OpenAI-compatible API at http://localhost:9337.

Or start your own:

mesh-llm serve --model Qwen2.5-32B        # downloads model (~20GB), starts API + web console
mesh-llm serve --model Qwen2.5-3B         # or a small model first (~2GB)

Add another machine:

mesh-llm serve --join <token>              # token printed by the first machine

Or discover and join public meshes:

mesh-llm serve --auto                      # find and join the best mesh
mesh-llm client --auto                     # join as API-only client (no GPU)

How it works

Every node gets an OpenAI-compatible API at http://localhost:9337/v1. Distribution is automatic β€” you just say mesh-llm serve --model X and the mesh figures out the best strategy:

  • Model fits on one machine? β†’ runs solo, full speed, no network overhead
  • Dense model too big? β†’ pipeline parallelism β€” layers split across nodes
  • MoE model too big? β†’ expert parallelism β€” experts split across nodes, zero cross-node traffic

If a node has enough VRAM, it always runs the full model. Splitting only happens when it has to. Currently using a lightly forked version of llama.cpp (see the Justfile for where it pulls branch from).

Pipeline parallelism β€” for dense models that don't fit on one machine, layers are distributed across nodes proportional to VRAM. llama-server runs on the highest-VRAM node and coordinates via RPC. Each rpc-server loads only its assigned layers from local disk. Latency-aware: peers are selected by lowest RTT first, with an 80ms hard cap β€” high-latency nodes stay in the mesh as API clients but don't participate in splits.

MoE expert parallelism β€” Mixture-of-Experts models (Qwen3-MoE, GLM, OLMoE, Mixtral, DeepSeek β€” increasingly the best-performing architectures) are auto-detected from the GGUF header. The mesh reads expert routing statistics to identify which experts matter most, then assigns each node an overlapping shard: a shared core of critical experts replicated everywhere, plus unique experts distributed across nodes. Each node gets a standalone GGUF with the full trunk + its expert subset and runs its own independent llama-server β€” zero cross-node traffic during inference. Sessions are hash-routed to nodes for KV cache locality.

Multi-model β€” different nodes serve different models simultaneously. The API proxy peeks at the model field in each request and routes to the right node via QUIC tunnel. /v1/models lists everything available.

Demand-aware rebalancing β€” a unified demand map tracks which models the mesh wants (from --model flags, API requests, and gossip). Demand signals propagate infectiously across all nodes and decay naturally via TTL. Standby nodes auto-promote to serve unserved models with active demand, or rebalance when one model is significantly hotter than others. When a model loses its last server, standby nodes detect it within ~60s.

Inter-model collaboration β€” models on the mesh help each other during inference. When a text-only model receives an image, it silently consults a vision model on the mesh for a caption and generates from that. When a small model is uncertain, it races two peers for a second opinion and injects the winner's answer as context. When a model gets stuck in a repetition loop, another model nudges it out. The caller sees one seamless response β€” they don't know multiple models collaborated. Inspired by Mixture of Models (NSED) β€” the mesh is the ensemble. See VIRTUAL_LLM.md.

Latency design β€” the key insight is that HTTP streaming is latency-tolerant while RPC is latency-multiplied. llama-server always runs on the same box as the GPU. The mesh tunnels HTTP, so cross-network latency only affects time-to-first-token, not per-token throughput. RPC only crosses the network for pipeline splits where the model physically doesn't fit on one machine.

Network optimizations

  • Zero-transfer GGUF loading β€” SET_TENSOR_GGUF tells rpc-server to read weights from local disk. Dropped model load from 111s β†’ 5s.
  • RPC round-trip reduction β€” cached get_alloc_size, skip GGUF lookups for intermediates. Per-token round-trips: 558 β†’ 8.
  • Direct server-to-server transfers β€” intermediate tensors pushed directly between rpc-servers via TCP, not relayed through the client.
  • Speculative decoding β€” draft model runs locally on the host, proposes tokens verified in one batched forward pass. +38% throughput on code (75% acceptance).

Usage

Start a mesh

mesh-llm serve --model Qwen2.5-32B

Starts serving a model and prints an invite token. This mesh is private β€” only people you share the token with can join.

To make it public (discoverable by others via --auto):

mesh-llm serve --model Qwen2.5-32B --publish

Join a mesh

mesh-llm serve --join <token>              # join with invite token (GPU node)
mesh-llm client --join <token>             # join as API-only client (no GPU)

Named mesh (buddy mode)

mesh-llm serve --auto --model GLM-4.7-Flash-Q4_K_M --mesh-name "poker-night"

Everyone runs the same command. First person creates it, everyone else discovers "poker-night" and joins automatically. --mesh-name implies --publish β€” named meshes are always published to the directory.

Auto-discover

mesh-llm serve --auto                      # discover, join, and serve a model
mesh-llm client --auto                     # join as API-only client (no GPU)
mesh-llm discover                          # browse available meshes
mesh-llm gpus                              # inspect local GPUs and stable IDs

Inspect and clean the shared model cache

mesh-llm models installed
mesh-llm models cleanup --unused-since 30d
mesh-llm models cleanup --unused-since 30d --yes

models installed now shows whether a cached model is mesh-managed or external plus the last time mesh-llm used it. models cleanup only removes model files that mesh-llm explicitly marked as mesh-managed; by default it prints a dry run preview and requires --yes to delete anything.

Multi-model

mesh-llm serve --model Qwen2.5-32B --model GLM-4.7-Flash

# Route by model name
curl localhost:9337/v1/chat/completions -d '{"model":"GLM-4.7-Flash-Q4_K_M", ...}'

Different nodes serve different models. The API proxy routes by the model field.

Inspect local GPUs

mesh-llm gpus
mesh-llm gpus --json
mesh-llm gpu benchmark --json

mesh-llm gpus prints local GPU entries, backend device names, stable IDs, VRAM, unified-memory state, and cached bandwidth when a benchmark fingerprint is already available. Add --json for machine-readable inventory output, or run mesh-llm gpu benchmark --json to refresh the local fingerprint and print the benchmark result as JSON.

Use only pinnable Stable ID / stable_id values from mesh-llm gpus or mesh-llm gpus --json for pinned startup config. Stable-ID fallback values such as index:* or backend-device names like CUDA0 / HIP0 / MTL0 can still be printed for inventory purposes, but they are not valid pin targets.

Startup config

mesh-llm serve can now load startup models from ~/.mesh-llm/config.toml:

version = 1

[gpu]
assignment = "pinned"

[[models]]
model = "Qwen3-8B-Q4_K_M"
gpu_id = "pci:0000:65:00.0"

[[models]]
model = "bartowski/Qwen2.5-VL-7B-Instruct-GGUF/qwen2.5-vl-7b-instruct-q4_k_m.gguf"
mmproj = "bartowski/Qwen2.5-VL-7B-Instruct-GGUF/mmproj-f16.gguf"
ctx_size = 8192
gpu_id = "uuid:GPU-12345678"

[[plugin]]
name = "blackboard"
enabled = true

Start with the default config path:

mesh-llm serve

If no startup models are configured, mesh-llm serve prints a ⚠️ warning, shows help, and exits.

Or point at a different file:

mesh-llm serve --config /path/to/config.toml

Precedence rules:

  • Explicit --model or --gguf ignores configured [[models]].
  • Explicit --ctx-size overrides configured ctx_size for the selected startup models.
  • Plugin entries still live in the same file.

Pinned startup notes:

  • assignment = "pinned" requires every configured [[models]] entry to include a gpu_id.
  • Valid gpu_id values come from the pinnable stable IDs reported by mesh-llm gpus / mesh-llm gpus --json, not fallback inventory IDs.
  • Pinned configs fail closed when a configured ID is missing, ambiguous, unsupported on the local backend, or no longer resolves on the current machine.
  • Explicit --model / --gguf still bypass configured [[models]], so they also bypass config-owned pinned gpu_id values.

No-arg behavior

mesh-llm                                   # no args β€” prints --help and exits

Does not start the console or bind any ports. Use the CLI flags shown in --help to start or join a mesh.

Background service

To install it as a per-user background service:

curl -fsSL https://raw.githubusercontent.com/Mesh-LLM/mesh-llm/main/install.sh | bash -s -- --service

Service installs are user-scoped:

  • macOS installs a launchd agent at ~/Library/LaunchAgents/com.mesh-llm.mesh-llm.plist
  • Linux installs a systemd --user unit at ~/.config/systemd/user/mesh-llm.service
  • Shared environment config lives in ~/.config/mesh-llm/service.env
  • Startup models live in ~/.mesh-llm/config.toml

The two platforms handle launch startup the same way:

  • macOS: launchd runs ~/.config/mesh-llm/run-service.sh, which loads service.env and executes mesh-llm serve.
  • Linux: the installer writes mesh-llm serve directly into ExecStart= in ~/.config/systemd/user/mesh-llm.service.

The background service no longer stores custom startup args. Configure startup models in ~/.mesh-llm/config.toml instead.

service.env is optional and shared by both platforms. Use plain KEY=value lines, for example:

MESH_LLM_NO_SELF_UPDATE=1

If you edit the Linux unit manually, reload and restart it:

systemctl --user daemon-reload
systemctl --user restart mesh-llm.service

On Linux this is a user service, so if you want it to keep running after reboot before login, enable lingering once:

sudo loginctl enable-linger "$USER"

Web console

mesh-llm serve --model Qwen2.5-32B    # dashboard at http://localhost:3131

Live topology, per-node GPU capacity, model picker, and built-in chat. Live members show only the Client, Standby, Loading, and Serving badges. Wakeable provider-backed capacity is shown separately from topology and stays out of routing until it rejoins. Everything comes from /api/status (JSON) and /api/events (SSE).

Multimodal Support

mesh-llm supports multimodal requests on:

  • POST /v1/chat/completions
  • POST /v1/responses

The console supports image, audio, and file attachments. Large attachments use request-scoped blob upload rather than permanent storage.

Current support matrix

Family / model type Vision Audio Notes
Qwen3-VL, Qwen3VL yes no Example: Qwen3VL-2B-Instruct-Q4_K_M
Qwen2-VL, Qwen2.5-VL yes no Vision-capable Qwen VL families
LLaVA, mllama, PaliGemma, Idefics, Molmo, InternVL, GLM-4V, Ovis, Florence yes no Detected as vision-capable families
Qwen2-Audio no yes Audio-capable family
SeaLLM-Audio no yes Audio-capable family
Ultravox no yes Audio-capable family
Omni no or metadata-dependent yes Example: Qwen2.5-Omni-3B-Q4_K_M
Whisper no yes Audio-capable family
Any GGUF with mmproj sidecar yes depends Strong local signal for vision support
Any model with vision_config / vision token IDs yes depends Promoted by metadata
Any model with audio_config / audio token IDs depends yes Promoted by metadata
Generic multimodal, -vl, image, video, voice naming only likely likely Hint only, not a strong routing guarantee

Notes:

  • yes means mesh-llm treats the model as runtime-capable for routing and UI.
  • likely means mesh-llm shows a weaker hint but does not rely on it as a hard capability.
  • Mixed image+audio requests work only when the selected model/runtime actually supports both modalities.
  • Non-goals: POST /v1/audio/transcriptions, POST /v1/audio/speech, and v1/realtime.

For the full capability and transport details, see mesh-llm/docs/MULTI_MODAL.md.

Development

Build-from-source and UI development instructions are in CONTRIBUTING.md.

Using with agents

mesh-llm exposes an OpenAI-compatible API on localhost:9337. Any tool that supports custom OpenAI endpoints works. /v1/models lists available models; the model field in requests routes to the right node.

For built-in launcher integrations (goose, claude, opencode):

  • If a mesh is already running locally on --port, it is reused.
  • If not, mesh-llm auto-starts a background client node that auto-joins the mesh.
  • If --model is omitted, the launcher picks the strongest tool-capable model available on the mesh.
  • When the harness exits (e.g. claude quits), the auto-started node is cleaned up automatically.

goose

Goose is available as both CLI (goose session) and desktop app (Goose.app).

mesh-llm goose

Use a specific model (example: MiniMax):

mesh-llm goose --model MiniMax-M2.5-Q4_K_M

This command writes/updates ~/.config/goose/custom_providers/mesh.json and launches Goose.

opencode

OpenCode uses a temporary provider config injected by Mesh, so you don't need to edit local config files by hand. For the full advanced or manual setup, see docs/AGENTS.md.

mesh-llm opencode

Use a specific model (example: MiniMax):

mesh-llm opencode --model MiniMax-M2.5-Q4_K_M

pi

  1. Start a mesh client:
mesh-llm client --auto --port 9337
  1. Check what models are available:
curl -s http://localhost:9337/v1/models | jq '.data[].id'

Lemonade

mesh-llm ships a built-in lemonade plugin that registers a local Lemonade Server as another OpenAI-compatible backend. For setup and verification steps, see docs/USAGE.md.

If you want the mesh to be discoverable via --auto, publish it:

mesh-llm serve --model Qwen2.5-32B --publish

3. Add another machine

mesh-llm serve --join <token>

Use mesh-llm client if the machine should join without serving a model:

mesh-llm client --join <token>

4. Create a named mesh for a group

mesh-llm serve --auto --model GLM-4.7-Flash-Q4_K_M --mesh-name "poker-night"

Everyone runs the same command. The first node creates the mesh, the rest discover and join it automatically.

5. Serve more than one model

mesh-llm serve --model Qwen2.5-32B --model GLM-4.7-Flash

Requests are routed by the model field:

curl localhost:9337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'

How it works

Mesh LLM keeps the user-facing surface simple: talk to localhost:9337, pick a model, and let the mesh decide how to serve it.

  • If a model fits on one machine, it runs there with no network overhead.
  • If a dense model does not fit, layers are split across low-latency peers.
  • If an MoE model does not fit, experts are split across nodes and requests are hash-routed for cache locality.
  • Different nodes can serve different models at the same time.

Each node also exposes a management API and web console on port 3131.

Install notes

The installer currently targets macOS and Linux release bundles. Windows coming soon.

To force a specific bundled flavor during install:

curl -fsSL https://raw.githubusercontent.com/Mesh-LLM/mesh-llm/main/install.sh | MESH_LLM_INSTALL_FLAVOR=vulkan bash

Installed release bundles use flavor-specific llama.cpp binaries:

  • macOS: metal
  • Linux: cpu, cuda, rocm, vulkan
  • Linux ARM64 CPU: cpu (asset triple: aarch64-unknown-linux-gnu)

For release and install naming, arm64 and aarch64 both refer to the same 64-bit ARM target. Generic 32-bit ARM is not a published release target.

To update a bundle install to the latest release:

mesh-llm update

To install a specific bundled release tag:

mesh-llm update --version v0.X.Y

If you build from source, always use just:

git clone https://github.com/Mesh-LLM/mesh-llm
cd mesh-llm
just build

Requirements and backend-specific build notes are in CONTRIBUTING.md.

Web console

When a node is running, open:

http://localhost:3131

The console shows live topology with only Client, Standby, Loading, and Serving badges for live members, plus separate wakeable capacity, VRAM usage, loaded models, and built-in chat. Wakeable inventory is not part of topology peers or routing until it rejoins. It is backed by /api/status and /api/events.

To run without the embedded UI (for example, in a headless server environment), pass --headless:

mesh-llm serve --model Qwen2.5-3B --headless

In headless mode, the web console routes (/, /dashboard, /chat) return 404. The management API (/api/*) stays fully available on the --console port.

You can also try the hosted demo:

mesh-llm-console.fly.dev

More docs

Community

Join the #mesh-llm channel on the Goose Discord for discussion and support.

Release History

VersionChangesUrgencyDate
v0.64.0## What's Changed * fix: remove unintended PR job from 307 by @ndizazzo in https://github.com/Mesh-LLM/mesh-llm/pull/310 * Share split GGUF MoE rankings under stable distribution refs by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/314 * feat: Implement light mode for topology diagram by @ndizazzo in https://github.com/Mesh-LLM/mesh-llm/pull/302 * [codex] Bump llama.cpp fork pin to e88186e78777 by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/315 * Use published MoE analysis for fit pHigh4/20/2026
v0.63.0-rc5**Full Changelog**: https://github.com/Mesh-LLM/mesh-llm/compare/v0.63.0-rc4...v0.63.0-rc5High4/18/2026
v0.63.0-rc4## What's Changed * Run releases from GitHub Actions instead of tag pushes by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/323 * feature: add embedded SDK support by @ndizazzo in https://github.com/Mesh-LLM/mesh-llm/pull/233 * Let moe analyze share rankings automatically by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/325 * Add optional CUDA and ROCm skip for prerelease releases by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/326 * Keep auto-routed chats on one model and one peHigh4/18/2026
v0.63.0-rc2## What's Changed * Update LLAMA_CPP_SHA hash value by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/320 * models: inspect and clean up mesh-managed cache entries safely by @IvGolovach in https://github.com/Mesh-LLM/mesh-llm/pull/300 * Align MoE split cache naming with llama.cpp by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/321 **Full Changelog**: https://github.com/Mesh-LLM/mesh-llm/compare/v0.63.0-rc.1...v0.63.0-rc2High4/18/2026
v0.63.0-rc.1## What's Changed * fix: remove unintended PR job from 307 by @ndizazzo in https://github.com/Mesh-LLM/mesh-llm/pull/310 * Share split GGUF MoE rankings under stable distribution refs by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/314 * feat: Implement light mode for topology diagram by @ndizazzo in https://github.com/Mesh-LLM/mesh-llm/pull/302 * [codex] Bump llama.cpp fork pin to e88186e78777 by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/315 * Use published MoE analysis for fit pHigh4/17/2026
v0.62.1## What's Changed * Require mesh-llm/1 for mesh joins by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/299 * scripts: fix silent build exit when neither ccache nor sccache is installed by @ventz in https://github.com/Mesh-LLM/mesh-llm/pull/305 * remove cuda 103 pr259 by @ndizazzo in https://github.com/Mesh-LLM/mesh-llm/pull/307 ## New Contributors * @ventz made their first contribution in https://github.com/Mesh-LLM/mesh-llm/pull/305 **Full Changelog**: https://github.com/Mesh-LLM/mesh-llHigh4/17/2026
v0.60.3## What's Changed * chore: parameterize cuda version by @ndizazzo in https://github.com/Mesh-LLM/mesh-llm/pull/295 * Fix split GGUF shorthand resolution to download full model shards by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/297 **Full Changelog**: https://github.com/Mesh-LLM/mesh-llm/compare/v0.60.2...v0.60.3High4/16/2026
v0.60.2## What's Changed * fix: preserve trailing newline in release version updates by @ndizazzo in https://github.com/Mesh-LLM/mesh-llm/pull/288 * chore(ci-runner): add support for tagged self-hosted runners by @ndizazzo in https://github.com/Mesh-LLM/mesh-llm/pull/289 * OpenAI proxy by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/214 * Improve topology visualization layout and zoom behavior by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/290 * Replace Fly relay with iroh-managed AP SouthHigh4/15/2026
v0.61.1## What's Changed * fix: release packaging EXIT trap references out-of-scope local variable by @michaelneale in https://github.com/Mesh-LLM/mesh-llm/pull/287 **Full Changelog**: https://github.com/Mesh-LLM/mesh-llm/compare/v0.61.0...v0.61.1High4/15/2026
v0.60.0-rc.4## What's Changed * Update all references from michaelneale/mesh-llm to Mesh-LLM/mesh-llm by @michaelneale in https://github.com/Mesh-LLM/mesh-llm/pull/264 * mesh: refresh degraded relay-backed peer connections by @IvGolovach in https://github.com/Mesh-LLM/mesh-llm/pull/229 * fix: cache value / mechanism by @ndizazzo in https://github.com/Mesh-LLM/mesh-llm/pull/267 * Fix tunnel listener fd leak on peer departure by @michaelneale in https://github.com/Mesh-LLM/mesh-llm/pull/266 * feat: add fail-cMedium4/14/2026
v0.60.0-rc.3## What's Changed * pdf quick fix by @michaelneale in https://github.com/Mesh-LLM/mesh-llm/pull/257 * Fix inconsistent peer views: PeerDown verification and transitive peer stability by @michaelneale in https://github.com/Mesh-LLM/mesh-llm/pull/242 * Decompose mesh/mod.rs: extract tests, heartbeat, and gossip by @michaelneale in https://github.com/Mesh-LLM/mesh-llm/pull/258 * Fix Hugging Face job pricing decode for hardware flavors by @i386 in https://github.com/Mesh-LLM/mesh-llm/pull/261 * [codMedium4/13/2026
v0.60.0-rc.2## What's Changed * Add installer support for prerelease builds by @i386 in https://github.com/michaelneale/mesh-llm/pull/254 * fix: Update docker publish contract for multi-arch tags by @ndizazzo in https://github.com/michaelneale/mesh-llm/pull/255 **Full Changelog**: https://github.com/michaelneale/mesh-llm/compare/v0.60.0-rc.1...v0.60.0-rc.2Medium4/13/2026
v0.60.0-rc.1## What's Changed * [codex] Add a pre-commit checklist for common CI failures by @i386 in https://github.com/michaelneale/mesh-llm/pull/250 * Trim CUDA slim CI llama.cpp targets by @i386 in https://github.com/michaelneale/mesh-llm/pull/251 * client: only surface models that can actually serve requests by @IvGolovach in https://github.com/michaelneale/mesh-llm/pull/226 * split serving: recover dense models on surviving workers by @IvGolovach in https://github.com/michaelneale/mesh-llm/pull/232 * Medium4/12/2026
v0.59.0## What's Changed * Shrink UI bundle 58% by replacing elkjs with simple layout by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/218 * chore(react): add a quick recipe to speed up UI dev using public mesh by @ndizazzo in https://github.com/michaelneale/mesh-llm/pull/221 * Harden GGUF parser, identity key I/O, and symlink safety by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/216 * Remove unused Blackboard MCP code paths to eliminate Rust warnings by @i386 in httMedium4/11/2026
v0.58.0## What's Changed * ci: add warm-caches.yml to pre-warm llama.cpp CUDA cache on main by @ndizazzo in https://github.com/michaelneale/mesh-llm/pull/211 * Justfile: scope clean-ui Unix variant with [unix] to avoid duplicate on Windows by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/208 * Fix spurious 503s on slow prefill and remove tunnel double-proxy by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/213 **Full Changelog**: https://github.com/michaelneale/mesh-lMedium4/8/2026
v0.57.0## What's Changed * fix: prevent detail panel from being pushed when in fullscreen mode by @ndizazzo in https://github.com/michaelneale/mesh-llm/pull/202 * Smooth chat streaming and message queuing by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/205 * Show client nodes as tiny dots in mesh topology by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/206 * launch: scale health timeout by model size by @Bortlesboat in https://github.com/michaelneale/mesh-llm/pull/21Medium4/7/2026
v0.56.0## What's Changed * Disable windows by @i386 in https://github.com/michaelneale/mesh-llm/pull/143 * Remove client nodes from topology visualization by @i386 in https://github.com/michaelneale/mesh-llm/pull/145 * feature: mesh topology selector by @ndizazzo in https://github.com/michaelneale/mesh-llm/pull/152 * add navigable node detail sidebar by @i386 in https://github.com/michaelneale/mesh-llm/pull/148 * [codex] make README friendlier and split long reference docs by @i386 in https://github.coMedium4/7/2026
v0.55.1**Full Changelog**: https://github.com/michaelneale/mesh-llm/compare/v0.55.0...v0.55.1Medium4/3/2026
v0.55.0## What's Changed * feature: add support to skip jobs that don't have modified files in CI by @ndizazzo in https://github.com/michaelneale/mesh-llm/pull/116 * launch: asymmetric KV cache β€” Q8_0 K + Q4_0 V for 5-50GB models by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/104 * Fix compilation errors from missing `served_model_descriptors` in test struct initializers by @Copilot in https://github.com/michaelneale/mesh-llm/pull/120 * Fix compile errors in model descriptor test helMedium4/3/2026
v0.54.0## What's Changed * Add multi-SDK compat smoke tests to CI by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/100 * fix: rotate-key, Nostr discovery improvements, model pack simplification by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/98 * [codex] Cache the Vite UI build by @i386 in https://github.com/michaelneale/mesh-llm/pull/101 * feature(protocol): add support for transition to protobuf by @i386 in https://github.com/michaelneale/mesh-llm/pull/92 * fix: re-Medium4/1/2026
v0.53.1## What's Changed * Disable Windows release builds until llama.cpp CUDA fix by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/99 **Full Changelog**: https://github.com/michaelneale/mesh-llm/compare/v0.53.0...v0.53.1Medium3/31/2026
v0.52.0## What's Changed * feat: prefix-affinity routing for agentic scaffold reuse by @i386 in https://github.com/michaelneale/mesh-llm/pull/61 * fix: keep HTTP tunnel responses alive after request EOF by @i386 in https://github.com/michaelneale/mesh-llm/pull/59 * fix installer tmpdir cleanup trap by @i386 in https://github.com/michaelneale/mesh-llm/pull/81 * feat(bench): add memory bandwidth benchmark module with per-GPU fingerprinting and gossip propagation by @ndizazzo in https://github.com/michaelMedium3/30/2026
v0.51.0## What's Changed * Run rustfmt by @i386 in https://github.com/michaelneale/mesh-llm/pull/76 * [codex] Fix installer stdin entrypoint by @i386 in https://github.com/michaelneale/mesh-llm/pull/77 **Full Changelog**: https://github.com/michaelneale/mesh-llm/compare/v0.50.0...v0.51.0Medium3/30/2026
v0.50.0## What's Changed * Add Rust formatting rule before commits by @i386 in https://github.com/michaelneale/mesh-llm/pull/41 * fix detect_bin_dir: correct off-by-one in cargo dev path by @michaelneale in https://github.com/michaelneale/mesh-llm/pull/45 * Add tagged cross-platform GitHub release workflow by @i386 in https://github.com/michaelneale/mesh-llm/pull/36 * Autoupdater by @i386 in https://github.com/michaelneale/mesh-llm/pull/48 * fix: blurry react-flow on GPU hover by @ndizazzo in https://gMedium3/30/2026
v0.49.0## What's new - **Full MCP plugin runtime** β€” mesh-llm now ships a complete MCP v1 bridge for plugins, with protobuf IPC, tool/prompt/resource routers, and plugin supervision - **Rust plugin SDK** β€” build native mesh-llm plugins in Rust with the new `mesh-plugin` crate - **Plugin mesh visibility control** β€” plugins can reject startup based on mesh visibility - **KV efficiency fixes** β€” fixed context size for split mode, added `--ctx-size` flag, enabled flash attention by default - **Workspace rMedium3/27/2026
v0.48.0- Hardware detection module: modular Collector trait with platform-specific collectors (macOS, Linux NVIDIA/AMD, Jetson/Tegra) - GPU name, hostname, SoC status surfaced in gossip and management API - Privacy: hostname and GPU name require opt-in `--enumerate-host` flag; `is_soc` always shared - Per-GPU VRAM breakdown in API and UI - UI: vendor-colored GPU badges, hostname display, SoC/CPU icons - Backward compatible: old nodes ignore new gossip fields - 170 tests (50 new)Medium3/26/2026
v0.47.0- Linux support: build from source with CUDA, CPU-only, or Jetson/Tegra - Linux RAM detection: CPU-only machines detect system RAM, GPU machines report VRAM + RAM offload capacity - Jetson/Tegra: tegrastats fallback for VRAM detection (nvidia-smi broken on Orin AGX) - Linux build scripts: `just build` auto-detects platform (macOS→Metal, Linux→CUDA) - CUDA arch auto-detection: nvidia-smi → deviceQuery → GPU model lookup - CI: adds `cargo build --release` + CLI smoke test - Docs: Linux install insMedium3/26/2026
v0.46.0- Blackboard on by default for private meshes - Extend model demand TTL from 2h to 24h - Fix install URL in console UIMedium3/25/2026
v0.45.0- Blackboard on by default for private meshes (no `--blackboard` flag needed) - Blackboard off on public meshes (`--auto`) unless `--blackboard` explicitly passed - Warning when using blackboard on public meshes - Fix install URL in console UI (was `mesh-llm.com/install`, now links to `docs.anarchai.org/#install`) - Updated docs page with blackboard usage and privacy guidanceMedium3/24/2026
v0.44.0Rebased llama.cpp fork on latest upstream (419 commits). **Upstream highlights picked up:** - RPC remote code execution security patch - Metal CONV_3D support - bf16 native flash attention (CUDA) - Server Host header fix, httplib dynamic threads - Grammar parsing fix (stack overflow prevention) - Memory fix for recurrent models **Our patches (all rebased clean):** - Zero-transfer tensor loading (SET_TENSOR_GGUF) - RPC probing skip - get_alloc_size cache - B2B direct server-to-server transfers Medium3/24/2026
v0.43.0- Add `mesh-llm stop` command - Clean up `--help` β€” hide advanced options, use `--help-advanced` to see all - CI workflow (GitHub Actions): cargo test on every PR - Fix flaky router tests - Remove duplicate \#[test] attributesMedium3/24/2026
v0.42.0- Updated mesh diagram - Fixed repo name references (AGENTS.md, main.rs) - Blackboard/gossip visibility in README, docs, UI - MCP server documented in docs - Work-in-progress badge on docs site - goose lowercase in docs - NVIDIA Nemotron link in research section - Removed stale fly/api config - Install uses ~/.local/bin (no sudo) - Repo renamed to michaelneale/mesh-llmMedium3/24/2026
v0.41.0- Blackboard/gossip visibility in README, docs site, and UI - MCP server documented in docs blackboard section - Work-in-progress badge on docs site - goose lowercase throughout docs - NVIDIA Nemotron Coalition link in research section - Removed stale fly/api app config - Repo renamed: michaelneale/mesh-llm - Fixed install curl asset name (mesh-bundle.tar.gz)Medium3/24/2026
v0.40.0## What's new - **Thinking disabled by default** β€” `--reasoning-budget 0` on all llama-servers. Same answer quality, 2x faster for agentic workloads. API users can still opt in per-request with `chat_template_kwargs: {"enable_thinking": true}`. - **Blackboard MCP server** β€” `mesh-llm blackboard --mcp` exposes blackboard as MCP tools (`blackboard_post`, `blackboard_search`, `blackboard_feed`) for agent collaboration across the mesh. - **Web UI improvements** β€” no hidden thinking in chat responseMedium3/24/2026
v0.39.0**Blackboard** β€” shared agent collaboration across the mesh. Agents and people post status, findings, and questions to an ephemeral blackboard that propagates across the mesh. No cloud, no external services. ```bash # Enable on any node mesh-llm --client --blackboard # Install agent skill (pi, Goose, others) mesh-llm blackboard install-skill # Post, search, read mesh-llm blackboard "STATUS: [org/repo branch:main] working on billing refactor" mesh-llm blackboard --search "billing" mesh-llm blMedium3/23/2026
v0.38.6- Chat UI: render mermaid diagrams (CDN lazy-load, zero bundle cost) - Chat UI: render LaTeX math with KaTeX (CDN lazy-load) - Mermaid/math show as raw code while streaming, render after response completes - Bundle size unchanged (926KB) β€” both loaded on demandMedium3/23/2026
v0.38.5- Fix chat 400 errors on long conversations: bump small-host context from 8K to 16K - Router: stop preferring small models for chat β€” higher tier always preferred - Tighten random model spread window (50β†’15 points) so weak models are fallbacks, not equal contenders - Add Qwen3.5-9B router profile (tier 2) - TODO: context-aware routing, retry-on-400 noted as future workMedium3/23/2026
v0.38.4## What's new - **Expandable thinking stream**: click to expand and watch reasoning tokens stream in live while model is thinking - **Auto routing load spread**: concurrent chat requests spread across top-scoring models instead of queueing on one - **CDN-ready cache headers**: hashed static assets served with immutable cache headers, SSE unbuffered for Cloudflare/CDN compatibility - Updated project description and chat placeholder ### Install (macOS Apple Silicon) ```bash curl -fsSL https://gLow3/22/2026
v0.38.3## What's new - **Expandable thinking stream**: while a model is thinking, click to expand and watch reasoning tokens stream in live (previously just showed a static spinner) - **Auto routing load spread**: concurrent chat requests in auto mode now spread across top-scoring models instead of always queueing on the same one - Updated project description and chat placeholder ### Install (macOS Apple Silicon) ```bash curl -fsSL https://github.com/michaelneale/decentralized-inference/releases/latLow3/22/2026
v0.38.2## What's new - Updated project description across chat and dashboard pages - Chat placeholder now says "Ask me anything..." - Dashboard shows project banner with links to docs and GitHub - Wording adapts for public vs private mesh ### Install (macOS Apple Silicon) ```bash curl -fsSL https://github.com/michaelneale/decentralized-inference/releases/latest/download/mesh-llm-aarch64-apple-darwin.tar.gz | tar xz && sudo mv mesh-bundle/* /usr/local/bin/ ``` Low3/21/2026
v0.38.0## Vision/Multimodal Support Models with vision capability now automatically get their multimodal projector (mmproj) downloaded and launched with `--mmproj`. ### New - **Vision support**: Catalog models can declare an mmproj file. Download fetches it alongside the model. llama-server launches with `--mmproj` automatically. - **3 new vision models**: Qwen3.5-0.8B, 4B, 9B (all vision-native) - **Qwen3.5-27B** tagged as vision-capable with mmproj - **UI image attach**: Button appears for vision-cLow3/20/2026
v0.37.6## What's new - **Download never gives up**: Infinite retry with exponential backoff (3s β†’ 60s cap). If data was flowing before interruption, backoff resets. No more failing after 100 retries on large model downloads. - **Disk space check**: Checks free disk space before starting download. Bails early with a clear message if there isn't enough room (+1GB headroom). - **Less log spam**: Download attempt numbers only shown for first 3, then every 10th. ### Install (macOS Apple Silicon) ```bash Low3/20/2026
v0.37.5## What's new - **UI polish**: Public empty state with project link, private mesh install link, agent launcher commands in API popover - **Docs site**: New headline and tagline - **Fly fix**: Remove concurrency limits that caused MiniMax misrouting - **Revert**: Undo unnecessary client connection changes β€” original code is correct ### Install (macOS Apple Silicon) ```bash curl -fsSL https://github.com/michaelneale/decentralized-inference/releases/latest/download/mesh-llm-aarch64-apple-darwin.Low3/20/2026
v0.37.4## What's new - **Chat layout fix** β€” input area was cut off on desktop; scroll area now properly shares flex space - **Mobile polish** β€” textarea uses 16px font (prevents iOS Safari auto-zoom), viewport locked, footer hidden on chat page - **Docs links everywhere** β€” empty states, dashboard, API popover all link to docs site - **API popover simplified** β€” just the endpoint URL + setup guide link (removed verbose agent commands) - **Custom domain** β€” console available at www.mesh-llm.com ### ILow3/19/2026
v0.37.3## What's new - **Demand tracking cleanup**: `auto` and split GGUF suffixes no longer pollute the demand map - **Mobile chat fixes**: code blocks and tables no longer push the viewport wider on phones ### Install (macOS Apple Silicon) ```bash curl -fsSL https://github.com/michaelneale/decentralized-inference/releases/latest/download/mesh-llm-aarch64-apple-darwin.tar.gz | tar xz && sudo mv mesh-bundle/* /usr/local/bin/ ``` Low3/19/2026
v0.37.2## What's new - **Fix 90% slowdown in Claude Code**: `CLAUDE_CODE_ATTRIBUTION_HEADER=0` now set by `mesh-llm claude`. Without this, Claude Code prepends a changing header that invalidates the KV cache on every request β€” massively degrading performance with local models. ([ref](https://unsloth.ai/docs/basics/claude-code)) - Disables attribution on commits/PRs, telemetry, progress bar β€” all optimized for local inference ### Install (macOS Apple Silicon) ```bash curl -fsSL https://github.com/micLow3/19/2026
v0.37.1## What's new - **Publish watchdog fix**: surviving node now takes over Nostr publishing even with zero peers. Previously if the only other node died, the survivor wouldn't publish β€” making the mesh invisible to new nodes. Includes all v0.37.0 changes (smart agent model selection, launcher auto-start/cleanup, Anthropic format support). ### Install (macOS Apple Silicon) ```bash curl -fsSL https://github.com/michaelneale/decentralized-inference/releases/latest/download/mesh-llm-aarch64-apple-dLow3/18/2026
v0.37.0## What's new - **Smart agent model selection**: `mesh-llm claude` and `mesh-llm goose` now pick the strongest tool-capable model automatically (e.g. MiniMax over Qwen3-8B) instead of relying on per-request routing - **Router fix**: requests with a tools schema (from Claude Code, Goose, etc.) now always route to the strongest model β€” previously the first "hello" message could land on a small fast model - **Auto-start mesh**: launchers auto-start a client node if no mesh is running, and clean upLow3/18/2026
v0.36.6Chat UI now shows a 'Thinking…' spinner while the model reasons, then collapses it into an expandable accordion above the response.Low3/18/2026
v0.36.5Fix: serve manifest.json and app icons from root path. Add to Home Screen now works on iPhone/Android.Low3/17/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

openkrewDistributed multi-machine AI agent team platform0.6.4
LocalAILocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.v4.1.3
AutoAgentsA multi-agent framework written in Rust that enables you to build, deploy, and coordinate multiple intelligent agentsv0.3.7
swiftideFast, streaming indexing, query, and agentic LLM applications in Rustv0.32.1
@cloveos/cliCLOVE β€” Run, govern, and scale AI agent fleets from your terminal0.4.0