cyllama - Fast, Pythonic AI Inference

cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art .cpp ecosystem:

llama.cpp - Text generation, chat, embeddings, and text-to-speech
whisper.cpp - Speech-to-text transcription and translation
stable-diffusion.cpp - Image and video generation

It combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.

Documentation | PyPI | Changelog

Features

High-level API -- complete(), chat(), LLM class for quick prototyping / text generation.
Streaming -- token-by-token output with callbacks
Batch processing -- process multiple prompts 3-10x faster
GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform)
Speculative decoding -- 2-3x speedup with draft models
Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling
RAG -- retrieval-augmented generation with local embeddings and sqlite-vector
Speech recognition -- whisper.cpp transcription and translation
Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.
OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer with chat completions and embeddings endpoints
Framework integrations -- OpenAI API client, LangChain LLM interface

Installation

From PyPI

pip install cyllama

This installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed, by default, to take advantage of Apple Silicon.

GPU-Accelerated Variants

GPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only):

pip install cyllama-cuda12   # NVIDIA GPU (CUDA 12.4)
pip install cyllama-rocm     # AMD GPU (ROCm 6.3, requires glibc >= 2.35)
pip install cyllama-sycl     # Intel GPU (oneAPI SYCL 2025.3)
pip install cyllama-vulkan   # Cross-platform GPU (Vulkan)

All variants install the same cyllama Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.

You can verify which backend is active after installation:

cyllama info

You can also query the backend configuration at runtime:

from cyllama import _backend
print(_backend.cuda)   # True if built with CUDA
print(_backend.metal)  # True if built with Metal

Build from source with a specific backend

GGML_CUDA=1 pip install cyllama --no-binary cyllama
GGML_VULKAN=1 pip install cyllama --no-binary cyllama

Command-Line Interface

cyllama provides a unified CLI for all major functionality:

# Text generation
cyllama gen -m models/llama.gguf -p "What is Python?" --stream
cyllama gen -m models/llama.gguf -p "Write a haiku" --temperature 0.9 --json

# Chat (single-turn or interactive)
cyllama chat -m models/llama.gguf -p "Explain gravity" -s "You are a physicist"
cyllama chat -m models/llama.gguf                      # interactive mode
cyllama chat -m models/llama.gguf -n 1024              # interactive, up to 1024 tokens per response

# Embeddings
cyllama embed -m models/bge-small.gguf -t "hello world" -t "another text"
cyllama embed -m models/bge-small.gguf --dim                        # print dimensions
cyllama embed -m models/bge-small.gguf --similarity "cats" -f corpus.txt --threshold 0.5

# Other commands
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ -p "How do I configure X?"
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -f file.md   # interactive mode
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ --db docs.sqlite -p "..."  # index to persistent DB
cyllama rag -m models/llama.gguf -e models/bge-small.gguf --db docs.sqlite -p "..."           # reuse existing DB, no re-indexing
cyllama server -m models/llama.gguf --port 8080
cyllama transcribe -m models/ggml-base.en.bin audio.wav
cyllama tts -m models/tts.gguf -p "Hello world"
cyllama sd txt2img --model models/sd.gguf --prompt "a sunset"
cyllama info       # build and backend information
cyllama memory -m models/llama.gguf  # GPU memory estimation

Run cyllama --help or cyllama <command> --help for full usage. See CLI Cheatsheet for the complete reference.

Quick Start

from cyllama import complete

# One line is all you need
response = complete(
    "Explain quantum computing in simple terms",
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=200
)
print(response)

Key Features

Simple by Default, Powerful When Needed

High-Level API - Get started in seconds:

from cyllama import complete, chat, LLM

# One-shot completion
response = complete("What is Python?", model_path="model.gguf")

# Multi-turn chat
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="model.gguf")

# Reusable LLM instance (faster for multiple prompts)
llm = LLM("model.gguf")
response1 = llm("Question 1")
response2 = llm("Question 2")  # Model stays loaded!

Streaming Support - Real-time token-by-token output:

for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):
    print(chunk, end="", flush=True)

Performance Optimized

Batch Processing - Process multiple prompts 3-10x faster:

from cyllama import batch_generate

prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(prompts, model_path="model.gguf")

Speculative Decoding - 2-3x speedup with draft models:

from cyllama.llama.llama_cpp import Speculative, SpeculativeParams

params = SpeculativeParams(n_max=16, p_min=0.75)
spec = Speculative(params, ctx_target)
draft_tokens = spec.draft(prompt_tokens, last_token)

Memory Optimization - Smart GPU layer allocation:

from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers(
    model_path="model.gguf",
    available_vram_mb=8000
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")

N-gram Cache - 2-10x speedup for repetitive text:

from cyllama.llama.llama_cpp import NgramCache

cache = NgramCache()
cache.update(tokens, ngram_min=2, ngram_max=4)
draft = cache.draft(input_tokens, n_draft=16)

Response Caching - Cache LLM responses for repeated prompts:

from cyllama import LLM

# Enable caching with 100 entries and 1 hour TTL
llm = LLM("model.gguf", cache_size=100, cache_ttl=3600, seed=42)

response1 = llm("What is Python?")  # Cache miss - generates response
response2 = llm("What is Python?")  # Cache hit - returns cached response instantly

# Check cache statistics
info = llm.cache_info()  # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)

# Clear cache when needed
llm.cache_clear()

Note: Caching requires a fixed seed (seed != -1) since random seeds produce non-deterministic output. Streaming responses are not cached.

Framework Integrations

OpenAI-Compatible API - Drop-in replacement:

from cyllama.integrations import OpenAIClient

client = OpenAIClient(model_path="model.gguf")

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)
print(response.choices[0].message.content)

LangChain Integration - Seamless ecosystem access:

from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain

llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")

Agent Framework

Cyllama includes a zero-dependency agent framework with three agent architectures:

ReActAgent - Reasoning + Acting agent with tool calling:

from cyllama import LLM
from cyllama.agents import ReActAgent, tool
from simpleeval import simple_eval

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression safely."""
    return str(simple_eval(expression))

llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer)

ConstrainedAgent - Grammar-enforced tool calling for 100% reliability:

from cyllama.agents import ConstrainedAgent

agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4")  # Guaranteed valid tool calls

ContractAgent - Contract-based agent with C++26-inspired pre/post conditions:

from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy

@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
    """Divide a by x."""
    return a / x

agent = ContractAgent(
    llm=llm,
    tools=[divide],
    policy=ContractPolicy.ENFORCE,
    task_precondition=lambda task: len(task) > 10,
    answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")

See Agents Overview for detailed agent documentation.

Speech Recognition

Whisper Transcription - Transcribe audio files with timestamps:

from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np

# Load model and audio
ctx = WhisperContext("models/ggml-base.en.bin")
samples = load_audio_as_16khz_float32("audio.wav")  # Your audio loading function

# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)

# Get results
for i in range(ctx.full_n_segments()):
    start = ctx.full_get_segment_t0(i) / 100.0
    end = ctx.full_get_segment_t1(i) / 100.0
    text = ctx.full_get_segment_text(i)
    print(f"[{start:.2f}s - {end:.2f}s] {text}")

See Whisper docs for full documentation.

Stable Diffusion

Image Generation - Generate images from text using stable-diffusion.cpp:

from cyllama.sd import text_to_image

# Simple text-to-image
image = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)
image.save("output.png")

Advanced Generation - Full control with SDContext:

from cyllama.sd import SDContext, SDContextParams, SampleMethod, Scheduler

params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4

ctx = SDContext(params)
images = ctx.generate(
    prompt="a beautiful mountain landscape",
    negative_prompt="blurry, ugly",
    width=512,
    height=512,
    sample_method=SampleMethod.EULER,
    scheduler=Scheduler.DISCRETE
)

CLI Tool - Command-line interface:

# Text to image
cyllama sd txt2img \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png

# Image to image
cyllama sd img2img \
    --model models/sd-v1-5.gguf \
    --init-img input.png \
    --prompt "oil painting style" \
    --strength 0.7

# Show system info
cyllama sd info

Supports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See Stable Diffusion docs for full documentation.

RAG (Retrieval-Augmented Generation)

CLI - Query your documents from the command line:

# Single query against a directory of docs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ -p "How do I configure X?" --stream

# Interactive mode with source display
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -f guide.md -f faq.md --sources

# Persistent vector store: index once, reuse across runs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ --db docs.sqlite -p "How do I configure X?"   # first run: indexes to docs.sqlite
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    --db docs.sqlite -p "Another question?"                # later runs: reuse index, no re-embedding

Simple RAG - Query your documents with LLMs:

from cyllama.rag import RAG

# Create RAG instance with embedding and generation models
rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Add documents
rag.add_texts([
    "Python is a high-level programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are inspired by biological neurons."
])

# Query
response = rag.query("What is Python?")
print(response.text)

Load Documents - Support for multiple file formats:

from cyllama.rag import RAG, load_directory

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Load all documents from a directory
documents = load_directory("docs/", glob="**/*.md")
rag.add_documents(documents)

response = rag.query("How do I configure the system?")

Hybrid Search - Combine vector and keyword search:

from cyllama.rag import RAG, HybridStore, Embedder

embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")
store = HybridStore("knowledge.db", embedder)

store.add_texts(["Document content..."])

# Hybrid search with configurable weights
results = store.search("query", k=5, vector_weight=0.7, fts_weight=0.3)

Embedding Cache - Speed up repeated queries with LRU caching:

from cyllama.rag import Embedder

# Enable cache with 1000 entries
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf", cache_size=1000)

embedder.embed("hello")  # Cache miss
embedder.embed("hello")  # Cache hit - instant return

info = embedder.cache_info()
print(f"Hits: {info.hits}, Misses: {info.misses}")

Agent Integration - Use RAG as an agent tool:

from cyllama import LLM
from cyllama.agents import ReActAgent
from cyllama.rag import RAG, create_rag_tool

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)
rag.add_texts(["Your knowledge base..."])

# Create a tool from the RAG instance
search_tool = create_rag_tool(rag)

llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[search_tool])
result = agent.run("Find information about X in the knowledge base")

Supports text chunking, multiple embedding pooling strategies, LRU caching for repeated queries, async operations, reranking, and SQLite-vector for persistent storage. See RAG Overview for full documentation.

Common Utilities

GGUF File Manipulation - Inspect and modify model files:

from cyllama.llama.llama_cpp import GGUFContext

ctx = GGUFContext.from_file("model.gguf")
metadata = ctx.get_all_metadata()
print(f"Model: {metadata['general.name']}")

Structured Output - JSON schema to grammar conversion (pure Python, no C++ dependency):

from cyllama.llama.llama_cpp import json_schema_to_grammar

schema = {"type": "object", "properties": {"name": {"type": "string"}}}
grammar = json_schema_to_grammar(schema)

Huggingface Model Downloads:

from cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file

# Download from HuggingFace (saves to ~/.cache/llama.cpp/)
download_model("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Or with explicit parameters
download_model(hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Download specific file to custom path
download_model(
    hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF",
    hf_file="Llama-3.2-1B-Instruct-Q8_0.gguf",
    model_path="./models/my_model.gguf"
)

# Get file info without downloading
info = get_hf_file("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
print(info)  # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}

# List cached models
models = list_cached_models()

What's Inside

Text Generation (llama.cpp)

Full llama.cpp API - Complete Cython wrapper with strong typing
High-Level API - Simple, Pythonic interface (LLM, complete, chat)
Streaming Support - Token-by-token generation with callbacks
Batch Processing - Efficient parallel inference
Multimodal - LLAVA and vision-language models
Speculative Decoding - 2-3x inference speedup with draft models

Speech Recognition (whisper.cpp)

Full whisper.cpp API - Complete Cython wrapper
High-Level API - Simple transcribe() function
Multiple Formats - WAV, MP3, FLAC, and more
Language Detection - Automatic or specified language
Timestamps - Word and segment-level timing

Image & Video Generation (stable-diffusion.cpp)

Full stable-diffusion.cpp API - Complete Cython wrapper
Text-to-Image - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2
Image-to-Image - Transform existing images
Inpainting - Mask-based editing
ControlNet - Guided generation with edge/pose/depth
Video Generation - Wan, CogVideoX models
Upscaling - ESRGAN 4x upscaling

Cross-Cutting Features

GPU Acceleration - Metal, CUDA, Vulkan backends
Memory Optimization - Smart GPU layer allocation
Agent Framework - ReActAgent, ConstrainedAgent, ContractAgent
Framework Integration - OpenAI API, LangChain, FastAPI

Why Cyllama?

Performance: Compiled Cython wrappers with minimal overhead

Strong type checking at compile time
Zero-copy data passing where possible
Efficient memory management
Native integration with llama.cpp optimizations

Simplicity: From 50 lines to 1 line for basic generation

Intuitive, Pythonic API design
Automatic resource management
Sensible defaults, full control when needed

Production-Ready: Battle-tested and comprehensive

1450+ passing tests with extensive coverage
Comprehensive documentation and examples
Proper error handling and logging
Framework integration for real applications

Up-to-Date: Tracks bleeding-edge llama.cpp

Regular updates with latest features
All high-priority APIs wrapped
Performance optimizations included

Status

Current Version: 0.2.5 (Apr 2026) llama.cpp Version: b8757 Build System: scikit-build-core + CMake Test Coverage: 1450+ tests passing Platform: macOS (tested), Linux (tested), Windows (tested)

Recent Releases

v0.2.5 (Apr 2026) - Typed loader exceptions, concurrent-use guard on LLM/Embedder/WhisperContext/SDContext, persistent RAG vector store (cyllama rag --db), corpus deduplication, vendored jinja2 chat templates (fixes Gemma 4 and other non-substring-detectable templates), Qwen3 <think>-block stripping + n-gram repetition guard, readline history for REPLs, memory-leak regression tests, llama.cpp b8757
v0.2.4 (Apr 2026) - Unified CLI (cyllama gen, chat, embed, rag, ...), cyllama rag command-line RAG, Ctrl+C during inference, embeddings endpoint, Embedder logging fix, interactive chat token limit fix
v0.2.3 (Apr 2026) - SD flow_shift black-image fix, GPU OOM validation, dynamic Linux install fixes, wheel backend discovery after auditwheel/delvewheel rename, CLI entry point, wheel smoke tests, OpenCL targets, CUDA tuning flags
v0.2.2 (Apr 2026) - CUDA wheel size stability (PTX-only sm_75), portability flags moved from manage.py to CI workflows
v0.2.1 (Mar 2026) - Code quality hardening: GIL release for whisper/encode, async stream fixes, memory-aware embedding cache, CI robustness, 30+ bug fixes, 1150+ tests
v0.2.0 (Mar 2026) - Dynamic-linked GPU wheels (CUDA, ROCm, SYCL, Vulkan) on PyPI, unified ggml, sqlite-vector vendored
v0.1.21 (Mar 2026) - GPU wheel builds: CUDA + ROCm, sqlite-vector bundled
v0.1.20 (Feb 2026) - Update llama.cpp + stable-diffusion.cpp
v0.1.19 (Dev 2025) - Metal fix for stable-diffusion.cpp
v0.1.18 (Dec 2025) - Remaining stable-diffusion.cpp wrapped
v0.1.16 (Dec 2025) - Response class, Async API, Chat templates
v0.1.12 (Nov 2025) - Initial wrapper of stable-diffusion.cpp
v0.1.11 (Nov 2025) - ACP support, build improvements
v0.1.10 (Nov 2025) - Agent Framework, bug fixes
v0.1.9 (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation
v0.1.8 (Nov 2025) - Speculative decoding API
v0.1.7 (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache
v0.1.6 (Nov 2025) - Multimodal test fixes
v0.1.5 (Oct 2025) - Mongoose server, embedded server
v0.1.4 (Oct 2025) - Memory estimation, performance optimizations

See CHANGELOG.md for complete release history.

Building from Source

To build cyllama from source:

A recent version of python3 (currently testing on python 3.13)

Git clone the latest version of cyllama:

git clone https://github.com/shakfu/cyllama.git
cd cyllama

We use uv for package management:

If you don't have it see the link above to install it, otherwise:
```
uv sync
```
Type make in the terminal.

This will:
1. Download and build llama.cpp, whisper.cpp and stable-diffusion.cpp
2. Install them into the thirdparty folder
3. Build cyllama using scikit-build-core + CMake

Build Commands

# Full build (default: static linking, builds llama.cpp from source)
make              # Build dependencies + editable install

# Dynamic linking (downloads pre-built llama.cpp release)
make build-dynamic  # No source compilation needed for llama.cpp

# Build wheel for distribution
make wheel        # Creates wheel in dist/
make dist         # Creates sdist + wheel in dist/

# Backend-specific builds (static)
make build-cpu    # CPU only
make build-metal  # macOS Metal (default on macOS)
make build-cuda   # NVIDIA CUDA
make build-vulkan # Vulkan (cross-platform)
make build-hip    # AMD ROCm
make build-sycl   # Intel SYCL
make build-opencl # OpenCL

# Backend-specific builds (dynamic -- shared libs)
make build-cpu-dynamic
make build-cuda-dynamic
make build-vulkan-dynamic
make build-metal-dynamic
make build-hip-dynamic
make build-sycl-dynamic
make build-opencl-dynamic

# Backend-specific wheels (static and dynamic)
make wheel-cuda           # Static wheel
make wheel-cuda-dynamic   # Dynamic wheel with shared libs

# Clean and rebuild
make clean        # Remove build artifacts + dynamic libs
make reset        # Full reset including thirdparty and .venv
make remake       # Clean rebuild with tests

# Code quality
make lint         # Lint with ruff (auto-fix)
make format       # Format with ruff
make typecheck    # Type check with mypy
make qa           # Run all: lint, typecheck, format

# Memory leak detection
make leaks        # RSS-growth leak check (10 cycles, 20% threshold)

# Publishing
make check        # Validate wheels with twine
make publish      # Upload to PyPI
make publish-test # Upload to TestPyPI

GPU Acceleration

Version	Changes	Urgency	Date
0.3.1	## Changes since the last Release ### Added - `VaeFormat` enum and `SDContextParams.vae_format` / `.stream_layers` properties -- exposes the two `sd_ctx_params_t` fields added in stable-diffusion.cpp master-669-2d40a8b. `VaeFormat(IntEnum)` mirrors the C `sd_vae_format_t` (`AUTO=-1`, `FLUX=0`, `SD3=1`, `FLUX2=2`); `vae_format` forces the VAE format (default `AUTO` = auto-detect from the model) and `stream_layers` toggles residency+prefetch layer streaming (inert unless `max_vram` is se	High	6/4/2026
0.3.0	## Changes since the last Release ### Changed - BREAKING: distribution switched to abi3-only wheels; minimum Python raised to 3.12 -- from this release, published wheels are built against the CPython stable ABI (abi3) and tagged `cp312-abi3-<plat>`, so a single wheel per platform/backend covers CPython 3.12, 3.13, and 3.14. This collapses the former five-per-version wheel set (~5x fewer wheels) to keep the project under PyPI's 10 GB size limit. `requires-python` is raised `>=3.10` -> `	High	5/28/2026
0.2.18	## Changes since the last Release ### Added - stable-diffusion.cpp updated (new C-surface fields and sample methods) -- `src/cyllama/sd/stable_diffusion.pxd` and `src/cyllama/sd/stable_diffusion.pyx` mirror the upstream header changes: two new `sample_method_t` values (`EULER_CFG_PP_SAMPLE_METHOD`, `EULER_A_CFG_PP_SAMPLE_METHOD`); `sd_ctx_params_t` gains `backend` / `params_backend` (`const char `) fields; `sd_sample_params_t` gains `extra_sample_args` (`const char `); `new_upscaler_	High	5/17/2026
0.2.17	This is bug-fix released quickly to correct a bug in 0.2.16 which causes GPU variants to break for `stable-diffusion.cpp`. The fix is very simple, see the [0.2.16 release](https://github.com/shakfu/cyllama/releases/tag/0.2.16) for work-arounds and a deeper treatment of the issue, or better, just install this release, which is 0.2.16 with the fix.	High	5/13/2026
0.2.15	cyllama is a no-dependencies Python library for local AI inference built on the `.cpp` inference stack: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation ## Changes since the last Release ### Added - **`cylla	High	5/3/2026
0.2.14	cyllama is a no-dependencies Python library for local AI inference built on the `.cpp` inference stack: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation ## Changes since the last Release ### Added - **st	High	4/27/2026
0.2.12	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation ## Changes since the last Release	High	4/23/2026
0.2.11	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation ## Changes since the last Release	High	4/19/2026
0.2.10	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation ## Changes since the last Release	High	4/17/2026
0.2.9	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation NOTE: In the last release, it w	High	4/16/2026
0.2.8	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation ## Changes since the last Release	High	4/15/2026
0.2.7	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation ## Changes since last Release ##	High	4/12/2026
0.2.6	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation This is a bug-fix release and is ex	Medium	4/12/2026
0.2.5	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation ## Changes since the last Release	Medium	4/11/2026
0.2.4	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation ## Changes since the last Release	High	4/9/2026
0.2.3	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation ## Changes since the last Release	Medium	4/6/2026
0.2.2	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation It combines the performance of comp	Medium	4/1/2026
0.2.1	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation It combines the performance of comp	Medium	3/30/2026
0.2.0	cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - Text generation, chat, embeddings, and text-to-speech - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription and translation - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation It combines the performance of comp	Medium	3/29/2026
0.1.21	cyllama is a comprehensive no-dependencies Python library for AI inference built on the `.cpp` ecosystem: - [llama.cpp](https://github.com/ggml-org/llama.cpp) - LLM text generation - [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Speech-to-text transcription - [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) - Image and video generation It combines the performance of compiled Cython wrappers with a simple, high-level Python API. This is the	Low	3/22/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

awesome-opensource-aiCurated list of the best truly open-source AI projects, models, tools, and infrastructure.main@2026-06-06

llm-streamStream responses from OpenAI and Anthropic models with lightweight C++ tools for efficient large language model integration.main@2026-06-05

Medical-ResearchSearch and analyze medical literature across PubMed, ClinicalTrials.gov, and Europe PMC using AI to support clinical and research decisions.main@2026-06-04

ragtable-extractExtract tables precisely from PDFs and convert them to clean HTML for RAG pipelines, running fast on CPU without external dependencies.main@2026-06-04

langgraphBuild resilient language agents as graphs.1.2.4

More in RAG & Memory

nltkNatural Language Toolkit

edgequakeEdegQuake 🌋 High-performance GraphRAG inspired from LightRag written in Rust; Transform documents into intelligent knowledge graphs for superior retrieval and generation

vllmA high-throughput and memory-efficient inference and serving engine for LLMs

spiceaiA portable accelerated SQL query, search, and LLM-inference engine, written in Rust, for data-grounded AI apps and agents.

cyllama

Description

README