freshcrate
Skin:/
Home > Developer Tools > vllm-cli

vllm-cli

A command-line interface tool for serving LLM using vLLM.

Why this rank:Strong adoptionHealthy release cadenceRelease freshness

Description

A command-line interface tool for serving LLM using vLLM.

README

vLLM CLI

CI Release PyPI version License: MIT Python 3.9+ PyPI Downloads

A command-line interface tool for serving Large Language Models using vLLM. Provides both interactive and command-line modes with features for configuration profiles, model management, and server monitoring.

vLLM CLI Welcome Screen Interactive terminal interface with GPU status and system overview
Tip: You can customize the GPU stats bar in settings

Features

  • ๐ŸŽฏ Interactive Mode - Rich terminal interface with menu-driven navigation
  • โšก Command-Line Mode - Direct CLI commands for automation and scripting
  • ๐Ÿค– Model Management - Automatic discovery of local models with HuggingFace and Ollama support
  • ๐Ÿ”ง Configuration Profiles - Pre-configured and custom server profiles for different use cases
  • ๐Ÿ“Š Server Monitoring - Real-time monitoring of active vLLM servers
  • ๐Ÿ–ฅ๏ธ System Information - GPU, memory, and CUDA compatibility checking
  • ๐Ÿ“ Advanced Configuration - Full control over vLLM parameters with validation

Quick Links: ๐Ÿ“– Docs | ๐Ÿš€ Quick Start | ๐Ÿ“ธ Screenshots | ๐Ÿ“˜ Usage Guide | โ“ Troubleshooting | ๐Ÿ—บ๏ธ Roadmap

What's New in v0.2.5

Multi-Model Proxy Server (Experimental)

The Multi-Model Proxy is a new experimental feature that enables serving multiple LLMs through a single unified API endpoint. This feature is currently under active development and available for testing.

What It Does:

  • Single Endpoint - All your models accessible through one API
  • Live Management - Add or remove models without stopping the service
  • Dynamic GPU Management - Efficient GPU resource distribution through vLLM's sleep/wake functionality
  • Interactive Setup - User-friendly wizard guides you through configuration

Note: This is an experimental feature under active development. Your feedback helps us improve! Please share your experience through GitHub Issues.

For complete documentation, see the ๐ŸŒ Multi-Model Proxy Guide.

What's New in v0.2.4

๐Ÿš€ Hardware-Optimized Profiles for GPT-OSS Models

New built-in profiles specifically optimized for serving GPT-OSS models on different GPU architectures:

  • gpt_oss_ampere - Optimized for NVIDIA A100 GPUs
  • gpt_oss_hopper - Optimized for NVIDIA H100/H200 GPUs
  • gpt_oss_blackwell - Optimized for NVIDIA Blackwell GPUs

Based on official vLLM GPT recipes for maximum performance.

โšก Shortcuts System

Save and quickly launch your favorite model + profile combinations:

vllm-cli serve --shortcut my-gpt-server

๐Ÿฆ™ Full Ollama Integration

  • Automatic discovery of Ollama models
  • GGUF format support (experimental)
  • System and user directory scanning

๐Ÿ”ง Enhanced Configuration

  • Environment Variables - Universal and profile-specific environment variable management
  • GPU Selection - Choose specific GPUs for model serving (--device 0,1)
  • Enhanced System Info - vLLM feature detection with attention backend availability

See CHANGELOG.md for detailed release notes.

Quick Start

Important: vLLM Installation Notes

โš ๏ธ Binary Compatibility Warning: vLLM contains pre-compiled CUDA kernels that must match your PyTorch version exactly. Installing mismatched versions will cause errors.

vLLM-CLI will not install vLLM or Pytorch by default.

Installation

Option 1: Install vLLM seperately and then install vLLM CLI (Recommended)

# Install vLLM -- Skip this step if you have vllm installed in your environment
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
# Or specify a backend: uv pip install vllm --torch-backend=cu128

# Install vLLM CLI
uv pip install --upgrade vllm-cli
uv run vllm-cli

# If you are using conda:
# Activate the environment you have vllm installed in
pip install vllm-cli
vllm-cli

Option 2: Install vLLM CLI + vLLM

# Install vLLM CLI + vLLM
pip install vllm-cli[vllm]
vllm-cli

Option 3: Build from source (You still need to install vLLM seperately)

git clone https://github.com/Chen-zexi/vllm-cli.git
cd vllm-cli
pip install -e .

Option 4: For Isolated Installation (pipx/system packages)

โš ๏ธ Compatibility Note: pipx creates isolated environments which may have compatibility issues with vLLM's CUDA dependencies. Consider using uv or conda (see above) for better PyTorch/CUDA compatibility.

# If you do not want to use virtual environment and want to install vLLM along with vLLM CLI
pipx install "vllm-cli[vllm]"

# If you want to install pre-release version
pipx install --pip-args="--pre" "vllm-cli[vllm]"

Prerequisites

  • Python 3.9+
  • CUDA-compatible GPU (recommended)
  • vLLM package installed
  • For dependency issues, see Troubleshooting Guide

Basic Usage

# Interactive mode - menu-driven interface
vllm-cli
# Serve a model
vllm-cli serve --model openai/gpt-oss-20b

# Use a shortcut
vllm-cli serve --shortcut my-model

For detailed usage instructions, see the ๐Ÿ“˜ Usage Guide and ๐ŸŒ Multi-Model Proxy Guide.

Configuration

Built-in Profiles

vLLM CLI includes 7 optimized profiles for different use cases:

General Purpose:

  • standard - Minimal configuration with smart defaults
  • high_throughput - Maximum performance configuration
  • low_memory - Memory-constrained environments
  • moe_optimized - Optimized for Mixture of Experts models

Hardware-Specific (GPT-OSS):

  • gpt_oss_ampere - NVIDIA A100 GPUs
  • gpt_oss_hopper - NVIDIA H100/H200 GPUs
  • gpt_oss_blackwell - NVIDIA Blackwell GPUs

See ๐Ÿ“‹ Profiles Guide for detailed information.

Configuration Files

  • Main Config: ~/.config/vllm-cli/config.yaml
  • User Profiles: ~/.config/vllm-cli/user_profiles.json
  • Shortcuts: ~/.config/vllm-cli/shortcuts.json

Documentation

Integration with hf-model-tool

vLLM CLI uses hf-model-tool for model discovery:

  • Comprehensive model scanning
  • Ollama model support
  • Shared configuration

Development

Project Structure

src/vllm_cli/
โ”œโ”€โ”€ cli/           # CLI command handling
โ”œโ”€โ”€ config/        # Configuration management
โ”œโ”€โ”€ models/        # Model management
โ”œโ”€โ”€ server/        # Server lifecycle
โ”œโ”€โ”€ ui/            # Terminal interface
โ””โ”€โ”€ schemas/       # JSON schemas

Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request.

License

MIT License - see LICENSE file for details.

Release History

VersionChangesUrgencyDate
v0.2.5### Added - **Multi-Model Proxy Server (Experimental)**: Enabling multiple LLMs through a single unified API endpoint - Single OpenAI-compatible endpoint for all models - Request routing based on model name - Save and reuse proxy configurations - **Dynamic Model Management**: Add or remove models at runtime without restarting the proxy - Live model registration and unregistration - Pre-registration with verification lifecycle - Graceful handling of model failures without affeLow8/25/2025
v0.2.5rc2### Multi-Model Proxy Server (Experimental) The Multi-Model Proxy is a new experimental feature that enables serving multiple LLMs through a single unified API endpoint. This feature is currently under active development and available for testing. **What It Does:** - **Single Endpoint** - All your models accessible through one API - **Live Management** - Add or remove models without stopping the service - **Dynamic GPU Management** - Efficient GPU resource distribution through vLLM's slLow8/24/2025
v0.2.5rc1Added multi model support through proxyLow8/22/2025
v0.2.4### Added - **Hardware-Optimized Profiles for GPT-OSS Models**: New built-in profiles optimized for different GPU architectures - `gpt_oss_ampere`: Optimized for NVIDIA A100 GPUs - `gpt_oss_hopper`: Optimized for NVIDIA H100/H200 GPUs - `gpt_oss_blackwell`: Optimized for NVIDIA Blackwell (B100/B200) GPUs - Based on official [vLLM GPT recipes](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html) - **Shortcuts System**: Save and quickly launch model + profile combinatLow8/20/2025
v0.2.4rc2**Full Changelog**: https://github.com/Chen-zexi/vllm-cli/compare/v0.2.4rc1...v0.2.4rc2Low8/19/2025
v0.2.4rc1### Added - **Ollama Model Support**: Full integration with Ollama-downloaded models through hf-model-tool - Automatic discovery of Ollama models in user (`~/.ollama`) and system (`/usr/share/ollama`) directories - GGUF format detection and experimental serving support ### Changed - Model cache refresh properly respects TTL settings (>60s) - Improved path display in model management UI for better clarity ### Fixed - Fixed duplicate Ollama models appearing in model list - Fixed mLow8/19/2025
v0.2.3### Fixed - **Critical**: Fixed missing built-in profiles when installing from PyPI - JSON schema files are now properly included in the package distributionLow8/18/2025
v0.2.2### Added - **Model Manifest Support**: Introduced `models_manifest.json` for mapping custom models in vLLM CLI native way - **Documentation**: Added [custom-model-serving.md](docs/custom-model-serving.md) for custom model serving guide ### Fixed - Serving models from custom directories now works as expected - Fixed some UI issuesLow8/18/2025
0.2.1- **Critical**: Fixed package installation issue - setuptools now correctly includes all sub-packagesLow8/17/2025
0.2.0## [0.2.0] - 2025-08-17 ### Added - **LoRA Adapter Support**: Serve models with LoRA adapters - select base model and multiple LoRA adapters for serving - **Enhanced Model List Display**: Comprehensive model listing showing HuggingFace models, LoRA adapters, and datasets with size information - **Model Directory Management**: Configure and manage custom model directories for automatic model discovery - **Model Caching**: Performance optimization through intelligent caching with TTL for moLow8/17/2025
0.1.1-Display complete log when startup failed -Small UI fixLow8/16/2025
0.1.0# Initial Release - Complete vLLM CLI implementation with interactive and command-line modes - Model management with automatic discovery and caching - Configuration profiles with dynamic hardware optimization - Server management with process monitoring and cleanup - Rich terminal UI with menu-driven navigationLow8/16/2025

Dependencies & License Audit

Loading dependencies...

Similar Packages

ReNovel-AIโœ๏ธ Revise and enhance novels with ReNovel-AI, your smart tool for story reimagining and memory-driven writing assistance.main@2026-06-05
PromptDrifter๐Ÿงญ PromptDrifter โ€“ oneโ€‘command CI guardrail that catches prompt drift and fails the build when your LLM answers change.main@2026-06-04
ossatureAn open-source harness for spec-driven code generation.master@2026-06-01
OSATool that just makes your open source project better using LLM agentsv0.2.11
bibfixerA Python tool that automatically cleans, completes, and standardizes BibTeX entries using LLMs and web search.v0.3.0

More in Developer Tools

mypyOptional static typing for Python
pipThe PyPA recommended tool for installing Python packages.
anthropicThe official Python library for the anthropic API
openinference-instrumentationOpenInference instrumentation utilities