cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art .cpp ecosystem:
- llama.cpp - Text generation, chat, embeddings, and text-to-speech
- whisper.cpp - Speech-to-text transcription and translation
- stable-diffusion.cpp - Image and video generation
It combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.
Documentation | PyPI | Changelog
- High-level API --
complete(),chat(),LLMclass for quick prototyping / text generation. - Streaming -- token-by-token output with callbacks
- Batch processing -- process multiple prompts 3-10x faster
- GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform)
- Speculative decoding -- 2-3x speedup with draft models
- Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling
- RAG -- retrieval-augmented generation with local embeddings and sqlite-vector
- Speech recognition -- whisper.cpp transcription and translation
- Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.
- OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer with chat completions and embeddings endpoints
- Framework integrations -- OpenAI API client, LangChain LLM interface
pip install cyllamaThis installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed, by default, to take advantage of Apple Silicon.
GPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only):
pip install cyllama-cuda12 # NVIDIA GPU (CUDA 12.4)
pip install cyllama-rocm # AMD GPU (ROCm 6.3, requires glibc >= 2.35)
pip install cyllama-sycl # Intel GPU (oneAPI SYCL 2025.3)
pip install cyllama-vulkan # Cross-platform GPU (Vulkan)All variants install the same cyllama Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.
You can verify which backend is active after installation:
cyllama infoYou can also query the backend configuration at runtime:
from cyllama import _backend
print(_backend.cuda) # True if built with CUDA
print(_backend.metal) # True if built with MetalGGML_CUDA=1 pip install cyllama --no-binary cyllama
GGML_VULKAN=1 pip install cyllama --no-binary cyllamacyllama provides a unified CLI for all major functionality:
# Text generation
cyllama gen -m models/llama.gguf -p "What is Python?" --stream
cyllama gen -m models/llama.gguf -p "Write a haiku" --temperature 0.9 --json
# Chat (single-turn or interactive)
cyllama chat -m models/llama.gguf -p "Explain gravity" -s "You are a physicist"
cyllama chat -m models/llama.gguf # interactive mode
cyllama chat -m models/llama.gguf -n 1024 # interactive, up to 1024 tokens per response
# Embeddings
cyllama embed -m models/bge-small.gguf -t "hello world" -t "another text"
cyllama embed -m models/bge-small.gguf --dim # print dimensions
cyllama embed -m models/bge-small.gguf --similarity "cats" -f corpus.txt --threshold 0.5
# Other commands
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ -p "How do I configure X?"
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -f file.md # interactive mode
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ --db docs.sqlite -p "..." # index to persistent DB
cyllama rag -m models/llama.gguf -e models/bge-small.gguf --db docs.sqlite -p "..." # reuse existing DB, no re-indexing
cyllama server -m models/llama.gguf --port 8080
cyllama transcribe -m models/ggml-base.en.bin audio.wav
cyllama tts -m models/tts.gguf -p "Hello world"
cyllama sd txt2img --model models/sd.gguf --prompt "a sunset"
cyllama info # build and backend information
cyllama memory -m models/llama.gguf # GPU memory estimationRun cyllama --help or cyllama <command> --help for full usage. See CLI Cheatsheet for the complete reference.
from cyllama import complete
# One line is all you need
response = complete(
"Explain quantum computing in simple terms",
model_path="models/llama.gguf",
temperature=0.7,
max_tokens=200
)
print(response)High-Level API - Get started in seconds:
from cyllama import complete, chat, LLM
# One-shot completion
response = complete("What is Python?", model_path="model.gguf")
# Multi-turn chat
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="model.gguf")
# Reusable LLM instance (faster for multiple prompts)
llm = LLM("model.gguf")
response1 = llm("Question 1")
response2 = llm("Question 2") # Model stays loaded!Streaming Support - Real-time token-by-token output:
for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):
print(chunk, end="", flush=True)Batch Processing - Process multiple prompts 3-10x faster:
from cyllama import batch_generate
prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(prompts, model_path="model.gguf")Speculative Decoding - 2-3x speedup with draft models:
from cyllama.llama.llama_cpp import Speculative, SpeculativeParams
params = SpeculativeParams(n_max=16, p_min=0.75)
spec = Speculative(params, ctx_target)
draft_tokens = spec.draft(prompt_tokens, last_token)Memory Optimization - Smart GPU layer allocation:
from cyllama import estimate_gpu_layers
estimate = estimate_gpu_layers(
model_path="model.gguf",
available_vram_mb=8000
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")N-gram Cache - 2-10x speedup for repetitive text:
from cyllama.llama.llama_cpp import NgramCache
cache = NgramCache()
cache.update(tokens, ngram_min=2, ngram_max=4)
draft = cache.draft(input_tokens, n_draft=16)Response Caching - Cache LLM responses for repeated prompts:
from cyllama import LLM
# Enable caching with 100 entries and 1 hour TTL
llm = LLM("model.gguf", cache_size=100, cache_ttl=3600, seed=42)
response1 = llm("What is Python?") # Cache miss - generates response
response2 = llm("What is Python?") # Cache hit - returns cached response instantly
# Check cache statistics
info = llm.cache_info() # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)
# Clear cache when needed
llm.cache_clear()Note: Caching requires a fixed seed (seed != -1) since random seeds produce non-deterministic output. Streaming responses are not cached.
OpenAI-Compatible API - Drop-in replacement:
from cyllama.integrations import OpenAIClient
client = OpenAIClient(model_path="model.gguf")
response = client.chat.completions.create(
messages=[{"role": "user", "content": "Hello!"}],
temperature=0.7
)
print(response.choices[0].message.content)LangChain Integration - Seamless ecosystem access:
from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain
llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")Cyllama includes a zero-dependency agent framework with three agent architectures:
ReActAgent - Reasoning + Acting agent with tool calling:
from cyllama import LLM
from cyllama.agents import ReActAgent, tool
from simpleeval import simple_eval
@tool
def calculate(expression: str) -> str:
"""Evaluate a math expression safely."""
return str(simple_eval(expression))
llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer)ConstrainedAgent - Grammar-enforced tool calling for 100% reliability:
from cyllama.agents import ConstrainedAgent
agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4") # Guaranteed valid tool callsContractAgent - Contract-based agent with C++26-inspired pre/post conditions:
from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy
@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
"""Divide a by x."""
return a / x
agent = ContractAgent(
llm=llm,
tools=[divide],
policy=ContractPolicy.ENFORCE,
task_precondition=lambda task: len(task) > 10,
answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")See Agents Overview for detailed agent documentation.
Whisper Transcription - Transcribe audio files with timestamps:
from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np
# Load model and audio
ctx = WhisperContext("models/ggml-base.en.bin")
samples = load_audio_as_16khz_float32("audio.wav") # Your audio loading function
# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)
# Get results
for i in range(ctx.full_n_segments()):
start = ctx.full_get_segment_t0(i) / 100.0
end = ctx.full_get_segment_t1(i) / 100.0
text = ctx.full_get_segment_text(i)
print(f"[{start:.2f}s - {end:.2f}s] {text}")See Whisper docs for full documentation.
Image Generation - Generate images from text using stable-diffusion.cpp:
from cyllama.sd import text_to_image
# Simple text-to-image
image = text_to_image(
model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
prompt="a photo of a cute cat",
width=512,
height=512,
sample_steps=4,
cfg_scale=1.0
)
image.save("output.png")Advanced Generation - Full control with SDContext:
from cyllama.sd import SDContext, SDContextParams, SampleMethod, Scheduler
params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4
ctx = SDContext(params)
images = ctx.generate(
prompt="a beautiful mountain landscape",
negative_prompt="blurry, ugly",
width=512,
height=512,
sample_method=SampleMethod.EULER,
scheduler=Scheduler.DISCRETE
)CLI Tool - Command-line interface:
# Text to image
cyllama sd txt2img \
--model models/sd_xl_turbo_1.0.q8_0.gguf \
--prompt "a beautiful sunset" \
--output sunset.png
# Image to image
cyllama sd img2img \
--model models/sd-v1-5.gguf \
--init-img input.png \
--prompt "oil painting style" \
--strength 0.7
# Show system info
cyllama sd infoSupports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See Stable Diffusion docs for full documentation.
CLI - Query your documents from the command line:
# Single query against a directory of docs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
-d docs/ -p "How do I configure X?" --stream
# Interactive mode with source display
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
-f guide.md -f faq.md --sources
# Persistent vector store: index once, reuse across runs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
-d docs/ --db docs.sqlite -p "How do I configure X?" # first run: indexes to docs.sqlite
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
--db docs.sqlite -p "Another question?" # later runs: reuse index, no re-embeddingSimple RAG - Query your documents with LLMs:
from cyllama.rag import RAG
# Create RAG instance with embedding and generation models
rag = RAG(
embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
generation_model="models/llama.gguf"
)
# Add documents
rag.add_texts([
"Python is a high-level programming language.",
"Machine learning is a subset of artificial intelligence.",
"Neural networks are inspired by biological neurons."
])
# Query
response = rag.query("What is Python?")
print(response.text)Load Documents - Support for multiple file formats:
from cyllama.rag import RAG, load_directory
rag = RAG(
embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
generation_model="models/llama.gguf"
)
# Load all documents from a directory
documents = load_directory("docs/", glob="**/*.md")
rag.add_documents(documents)
response = rag.query("How do I configure the system?")Hybrid Search - Combine vector and keyword search:
from cyllama.rag import RAG, HybridStore, Embedder
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")
store = HybridStore("knowledge.db", embedder)
store.add_texts(["Document content..."])
# Hybrid search with configurable weights
results = store.search("query", k=5, vector_weight=0.7, fts_weight=0.3)Embedding Cache - Speed up repeated queries with LRU caching:
from cyllama.rag import Embedder
# Enable cache with 1000 entries
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf", cache_size=1000)
embedder.embed("hello") # Cache miss
embedder.embed("hello") # Cache hit - instant return
info = embedder.cache_info()
print(f"Hits: {info.hits}, Misses: {info.misses}")Agent Integration - Use RAG as an agent tool:
from cyllama import LLM
from cyllama.agents import ReActAgent
from cyllama.rag import RAG, create_rag_tool
rag = RAG(
embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
generation_model="models/llama.gguf"
)
rag.add_texts(["Your knowledge base..."])
# Create a tool from the RAG instance
search_tool = create_rag_tool(rag)
llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[search_tool])
result = agent.run("Find information about X in the knowledge base")Supports text chunking, multiple embedding pooling strategies, LRU caching for repeated queries, async operations, reranking, and SQLite-vector for persistent storage. See RAG Overview for full documentation.
GGUF File Manipulation - Inspect and modify model files:
from cyllama.llama.llama_cpp import GGUFContext
ctx = GGUFContext.from_file("model.gguf")
metadata = ctx.get_all_metadata()
print(f"Model: {metadata['general.name']}")Structured Output - JSON schema to grammar conversion (pure Python, no C++ dependency):
from cyllama.llama.llama_cpp import json_schema_to_grammar
schema = {"type": "object", "properties": {"name": {"type": "string"}}}
grammar = json_schema_to_grammar(schema)Huggingface Model Downloads:
from cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file
# Download from HuggingFace (saves to ~/.cache/llama.cpp/)
download_model("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
# Or with explicit parameters
download_model(hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
# Download specific file to custom path
download_model(
hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF",
hf_file="Llama-3.2-1B-Instruct-Q8_0.gguf",
model_path="./models/my_model.gguf"
)
# Get file info without downloading
info = get_hf_file("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
print(info) # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}
# List cached models
models = list_cached_models()- Full llama.cpp API - Complete Cython wrapper with strong typing
- High-Level API - Simple, Pythonic interface (
LLM,complete,chat) - Streaming Support - Token-by-token generation with callbacks
- Batch Processing - Efficient parallel inference
- Multimodal - LLAVA and vision-language models
- Speculative Decoding - 2-3x inference speedup with draft models
- Full whisper.cpp API - Complete Cython wrapper
- High-Level API - Simple
transcribe()function - Multiple Formats - WAV, MP3, FLAC, and more
- Language Detection - Automatic or specified language
- Timestamps - Word and segment-level timing
- Full stable-diffusion.cpp API - Complete Cython wrapper
- Text-to-Image - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2
- Image-to-Image - Transform existing images
- Inpainting - Mask-based editing
- ControlNet - Guided generation with edge/pose/depth
- Video Generation - Wan, CogVideoX models
- Upscaling - ESRGAN 4x upscaling
- GPU Acceleration - Metal, CUDA, Vulkan backends
- Memory Optimization - Smart GPU layer allocation
- Agent Framework - ReActAgent, ConstrainedAgent, ContractAgent
- Framework Integration - OpenAI API, LangChain, FastAPI
Performance: Compiled Cython wrappers with minimal overhead
- Strong type checking at compile time
- Zero-copy data passing where possible
- Efficient memory management
- Native integration with llama.cpp optimizations
Simplicity: From 50 lines to 1 line for basic generation
- Intuitive, Pythonic API design
- Automatic resource management
- Sensible defaults, full control when needed
Production-Ready: Battle-tested and comprehensive
- 1450+ passing tests with extensive coverage
- Comprehensive documentation and examples
- Proper error handling and logging
- Framework integration for real applications
Up-to-Date: Tracks bleeding-edge llama.cpp
- Regular updates with latest features
- All high-priority APIs wrapped
- Performance optimizations included
Current Version: 0.2.5 (Apr 2026) llama.cpp Version: b8757 Build System: scikit-build-core + CMake Test Coverage: 1450+ tests passing Platform: macOS (tested), Linux (tested), Windows (tested)
- v0.2.5 (Apr 2026) - Typed loader exceptions, concurrent-use guard on
LLM/Embedder/WhisperContext/SDContext, persistent RAG vector store (cyllama rag --db), corpus deduplication, vendored jinja2 chat templates (fixes Gemma 4 and other non-substring-detectable templates), Qwen3<think>-block stripping + n-gram repetition guard, readline history for REPLs, memory-leak regression tests, llama.cpp b8757 - v0.2.4 (Apr 2026) - Unified CLI (
cyllama gen,chat,embed,rag, ...),cyllama ragcommand-line RAG, Ctrl+C during inference, embeddings endpoint, Embedder logging fix, interactive chat token limit fix - v0.2.3 (Apr 2026) - SD flow_shift black-image fix, GPU OOM validation, dynamic Linux install fixes, wheel backend discovery after auditwheel/delvewheel rename, CLI entry point, wheel smoke tests, OpenCL targets, CUDA tuning flags
- v0.2.2 (Apr 2026) - CUDA wheel size stability (PTX-only sm_75), portability flags moved from manage.py to CI workflows
- v0.2.1 (Mar 2026) - Code quality hardening: GIL release for whisper/encode, async stream fixes, memory-aware embedding cache, CI robustness, 30+ bug fixes, 1150+ tests
- v0.2.0 (Mar 2026) - Dynamic-linked GPU wheels (CUDA, ROCm, SYCL, Vulkan) on PyPI, unified ggml, sqlite-vector vendored
- v0.1.21 (Mar 2026) - GPU wheel builds: CUDA + ROCm, sqlite-vector bundled
- v0.1.20 (Feb 2026) - Update llama.cpp + stable-diffusion.cpp
- v0.1.19 (Dev 2025) - Metal fix for stable-diffusion.cpp
- v0.1.18 (Dec 2025) - Remaining stable-diffusion.cpp wrapped
- v0.1.16 (Dec 2025) - Response class, Async API, Chat templates
- v0.1.12 (Nov 2025) - Initial wrapper of stable-diffusion.cpp
- v0.1.11 (Nov 2025) - ACP support, build improvements
- v0.1.10 (Nov 2025) - Agent Framework, bug fixes
- v0.1.9 (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation
- v0.1.8 (Nov 2025) - Speculative decoding API
- v0.1.7 (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache
- v0.1.6 (Nov 2025) - Multimodal test fixes
- v0.1.5 (Oct 2025) - Mongoose server, embedded server
- v0.1.4 (Oct 2025) - Memory estimation, performance optimizations
See CHANGELOG.md for complete release history.
To build cyllama from source:
-
A recent version of
python3(currently testing on python 3.13) -
Git clone the latest version of
cyllama:git clone https://github.com/shakfu/cyllama.git cd cyllama -
We use uv for package management:
If you don't have it see the link above to install it, otherwise:
uv sync
-
Type
makein the terminal.This will:
- Download and build
llama.cpp,whisper.cppandstable-diffusion.cpp - Install them into the
thirdpartyfolder - Build
cyllamausing scikit-build-core + CMake
- Download and build
# Full build (default: static linking, builds llama.cpp from source)
make # Build dependencies + editable install
# Dynamic linking (downloads pre-built llama.cpp release)
make build-dynamic # No source compilation needed for llama.cpp
# Build wheel for distribution
make wheel # Creates wheel in dist/
make dist # Creates sdist + wheel in dist/
# Backend-specific builds (static)
make build-cpu # CPU only
make build-metal # macOS Metal (default on macOS)
make build-cuda # NVIDIA CUDA
make build-vulkan # Vulkan (cross-platform)
make build-hip # AMD ROCm
make build-sycl # Intel SYCL
make build-opencl # OpenCL
# Backend-specific builds (dynamic -- shared libs)
make build-cpu-dynamic
make build-cuda-dynamic
make build-vulkan-dynamic
make build-metal-dynamic
make build-hip-dynamic
make build-sycl-dynamic
make build-opencl-dynamic
# Backend-specific wheels (static and dynamic)
make wheel-cuda # Static wheel
make wheel-cuda-dynamic # Dynamic wheel with shared libs
# Clean and rebuild
make clean # Remove build artifacts + dynamic libs
make reset # Full reset including thirdparty and .venv
make remake # Clean rebuild with tests
# Code quality
make lint # Lint with ruff (auto-fix)
make format # Format with ruff
make typecheck # Type check with mypy
make qa # Run all: lint, typecheck, format
# Memory leak detection
make leaks # RSS-growth leak check (10 cycles, 20% threshold)
# Publishing
make check # Validate wheels with twine
make publish # Upload to PyPI
make publish-test # Upload to TestPyPI