# vllm-cli

> A command-line interface tool for serving LLM using vLLM.

- **URL**: https://www.freshcrate.ai/projects/vllm-cli
- **Author**: Chen-zexi
- **Category**: Developer Tools
- **Latest version**: `v0.2.5` (2025-08-25)
- **License**: MIT
- **Source**: https://github.com/Chen-zexi/vllm-cli
- **Language**: Python
- **GitHub**: 491 stars, 28 forks
- **Registry**: github
- **Tags**: `llm`, `llm-inference`, `llm-tools`, `python`, `vllm`

## Description

A command-line interface tool for serving LLM using vLLM.

## Recent releases

| Version | Date | Urgency | Changes |
| --- | --- | --- | --- |
| `v0.2.5` | 2025-08-25 | Low | ### Added - **Multi-Model Proxy Server (Experimental)**: Enabling multiple LLMs through a single unified API endpoint   - Single OpenAI-compatible endpoint for all models   - Request routing based on model name   - Save and reuse proxy configurations - **Dynamic Model Management**: Add or remove models at runtime without restarting the proxy   - Live model registration and unregistration   - Pre-registration with verification lifecycle   - Graceful handling of model failures without affe |
| `v0.2.5rc2` | 2025-08-24 | Low | ### Multi-Model Proxy Server (Experimental)  The Multi-Model Proxy is a new experimental feature that enables serving multiple LLMs through a single unified API endpoint. This feature is currently under active development and available for testing.  **What It Does:** - **Single Endpoint** - All your models accessible through one API - **Live Management** - Add or remove models without stopping the service - **Dynamic GPU Management** - Efficient GPU resource distribution through vLLM's sl |
| `v0.2.5rc1` | 2025-08-22 | Low | Added multi model support through proxy |
| `v0.2.4` | 2025-08-20 | Low | ### Added - **Hardware-Optimized Profiles for GPT-OSS Models**: New built-in profiles optimized for different GPU architectures   - `gpt_oss_ampere`: Optimized for NVIDIA A100 GPUs   - `gpt_oss_hopper`: Optimized for NVIDIA H100/H200 GPUs   - `gpt_oss_blackwell`: Optimized for NVIDIA Blackwell (B100/B200) GPUs   - Based on official [vLLM GPT recipes](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html) - **Shortcuts System**: Save and quickly launch model + profile combinat |
| `v0.2.4rc2` | 2025-08-19 | Low | **Full Changelog**: https://github.com/Chen-zexi/vllm-cli/compare/v0.2.4rc1...v0.2.4rc2 |
| `v0.2.4rc1` | 2025-08-19 | Low | ### Added - **Ollama Model Support**: Full integration with Ollama-downloaded models through hf-model-tool   - Automatic discovery of Ollama models in user (`~/.ollama`) and system (`/usr/share/ollama`) directories   - GGUF format detection and experimental serving support  ### Changed - Model cache refresh properly respects TTL settings (>60s) - Improved path display in model management UI for better clarity  ### Fixed - Fixed duplicate Ollama models appearing in model list - Fixed m |
| `v0.2.3` | 2025-08-18 | Low | ### Fixed - **Critical**: Fixed missing built-in profiles when installing from PyPI - JSON schema files are now properly included in the package distribution |
| `v0.2.2` | 2025-08-18 | Low | ### Added - **Model Manifest Support**: Introduced `models_manifest.json` for mapping custom models in vLLM CLI native way - **Documentation**: Added [custom-model-serving.md](docs/custom-model-serving.md) for custom model serving guide  ### Fixed - Serving models from custom directories now works as expected - Fixed some UI issues |
| `0.2.1` | 2025-08-17 | Low | - **Critical**: Fixed package installation issue - setuptools now correctly includes all sub-packages |
| `0.2.0` | 2025-08-17 | Low | ## [0.2.0] - 2025-08-17  ### Added - **LoRA Adapter Support**: Serve models with LoRA adapters - select base model and multiple LoRA adapters for serving - **Enhanced Model List Display**: Comprehensive model listing showing HuggingFace models, LoRA adapters, and datasets with size information - **Model Directory Management**: Configure and manage custom model directories for automatic model discovery - **Model Caching**: Performance optimization through intelligent caching with TTL for mo |

## Citation

- HTML: https://www.freshcrate.ai/projects/vllm-cli
- Markdown: https://www.freshcrate.ai/projects/vllm-cli.md
- Dependencies JSON: https://www.freshcrate.ai/api/projects/vllm-cli/deps

_Generated by freshcrate.ai. Indexes github releases for AI-agent ecosystem packages._
