Tag: #inference
11 packages âĸ â 107,710 total stars
A high-throughput and memory-efficient inference and serving engine for LLMs
Faster Whisper transcription with CTranslate2
Fast inference engine for Transformer models
Efficient, Flexible and Portable Structured Generation
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX bac
OramaCore is the complete runtime you need for your projects, answer engines, copilots, and search. It includes a fully-fledged full-text search engine, vector database, LLM interface, and many more u
LLM7.io offers a single API gateway that connects you to a wide array of leading AI models from various providers.
The memory system your AI agent deserves. 4-stage hybrid retrieval â Vector + BM25 + Knowledge Graph + Neural Reranker â in <150ms. Self-hosted, $0/query, built for agents that need to actually rememb
Python client library and utilities for communicating with Triton Inference Server
