A comprehensive benchmarking tool for testing Large Language Models (LLMs) with different context sizes across multiple inference engines including Ollama, MLX, MLX Distributed (beta), llama.cpp, LM Studio (beta), Exo, and any OpenAI-compatible endpoint.
- š Multiple Benchmark Engines: Test models using Ollama (API & CLI), MLX, MLX Distributed (beta), llama.cpp, LM Studio (beta), Exo, and any OpenAI-compatible endpoint
- š§ Automatic Hardware Detection: Captures system specs including:
- CPU cores (with performance/efficiency breakdown on Apple Silicon)
- GPU cores (Apple Silicon)
- System memory
- š Visual Performance Charts: Generate detailed performance graphs with hardware info
- š¾ Context File Generation: Create test files with precise token counts
- š„ļø Apple Silicon Optimized: Full support for M1/M2/M3/M4 chips with MLX
- š Jupyter Notebook Support: Interactive benchmarking and analysis
- š Pre-commit Hooks: Automated code formatting with Black and isort
- š Complete Output Capture: Saves model responses for analysis
- Install Python dependencies using uv:
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync- Install framework-specific requirements:
- Install Ollama from https://ollama.com
- Pull the model you want to test:
ollama pull gpt-oss:20b # or ollama pull llama3.2 # or any other Ollama model
- Requires Apple Silicon and the
mlx-lmdependency (installed viauv sync). - Models will be downloaded automatically from Hugging Face when running benchmarks.
- The model is loaded once and reused across all context sizes, with an automatic warmup pass before benchmarking begins.
- Run a llama.cpp server instance:
# Example: Start llama.cpp server on port 8080 ./llama-server -m model.gguf --port 8080
- Install LM Studio from https://lmstudio.ai
- Start the local server from LM Studio UI
- Load your desired model in LM Studio
- Start a server that exposes the OpenAI Chat Completions API (vLLM, llama.cpp, Ollama, text-generation-webui, etc.)
- No additional dependencies required ā uses the
openaiPython SDK already included
- (Optional) Set up pre-commit hooks for code quality:
# Install pre-commit hooks (only runs Black and isort for Python formatting)
pre-commit install
# Run hooks manually on all files
pre-commit run --all-files# 1. Generate test data
uv run generate-context-files -- pride_and_prejudice.txt
# 2. Run benchmark with unified interface
uv run benchmark -- ollama-api gpt-oss:20b
# 3. View available engines
uv run benchmark -- --list-enginesAll scripts can still be run directly with python <script>.py ..., but the
recommended approach is uv run <command> -- ... using the script entry points
defined in pyproject.toml.
Common commands:
uv run benchmark -- --list-engines
uv run compare-benchmarks -- --output my_comparison
uv run generate-context-files -- pride_and_prejudice.txt --sizes 2,4,8,16Use uv run generate-context-files -- <source> [options] to create test files from any source text.
# Generate context files from Pride and Prejudice
uv run generate-context-files -- pride_and_prejudice.txt
# Generate specific context sizes (in thousands of tokens)
uv run generate-context-files -- source.txt --sizes 2,4,8,16,32,64,128Options:
--sizes: Comma-separated list of sizes in thousands of tokens (default: 2,4,8,16,32,64,128)--encoding: Tiktoken encoding to use (default: cl100k_base for GPT-3.5/GPT-4)--output-dir: Directory to save context files (default: current directory)--prompt-suffix: Custom prompt to append to each file (default: "Please provide a summary of the above text.")
# List available engines
uv run benchmark -- --list-engines
# Run Ollama API benchmark
uv run benchmark -- ollama-api gpt-oss:20b
# Run Ollama CLI benchmark
uv run benchmark -- ollama-cli llama3.2
# Run MLX benchmark (Apple Silicon only)
uv run benchmark -- mlx mlx-community/Qwen3-4B-Instruct-2507-4bit
# Run MLX distributed benchmark via mlx.launch (Beta - for example JACCL)
uv run benchmark -- mlx-distributed /Users/ifioravanti/MiniMax-M2.5-6bit \
--hostfile /Users/ifioravanti/github/mlx-lm/m3-ultra-jaccl.json \
--backend jaccl \
--env MLX_METAL_FAST_SYNCH=1 \
--env HF_HOME=/Users/Shared/.cache/huggingface
# Run llama.cpp benchmark (defaults to localhost:8080)
uv run benchmark -- llamacpp gpt-oss:20b
# Run llama.cpp with custom host and port
uv run benchmark -- llamacpp gpt-oss:20b --host 192.168.1.100 --port 9000
# Run LM Studio benchmark (Beta - requires LM Studio server)
uv run benchmark -- lmstudio local-model
# Run Exo benchmark (OpenAI-compatible endpoint on http://0.0.0.0:52415)
uv run benchmark -- exo local-model
# Run OpenAI-compatible endpoint benchmark (default: http://localhost:8080/v1)
uv run openai-benchmark --model llama3.2
# Run against a custom server URL
uv run openai-benchmark --model mistral --base-url http://localhost:11434/v1
# Run against a remote API
uv run openai-benchmark --model gpt-4o --base-url https://api.openai.com/v1 --api-key sk-...
# Custom options
uv run benchmark -- ollama-api gpt-oss:20b --contexts 0.5,1,2,4,8,16,32 --max-tokens 500 --save-responses
# Increase timeout for large context benchmarks
uv run benchmark -- mlx mlx-community/Qwen3-4B-Instruct-2507-4bit --contexts 64,128 --timeout 7200Common options:
--contexts: Context sizes to test (default: 0.5,1,2,4,8,16,32)--max-tokens: Maximum tokens to generate (default: 200)--timeout: Timeout in seconds for each benchmark (default: 3600 = 60 minutes)--save-responses: Save model responses to files--output-csv: Output CSV filename--output-chart: Output chart filename
Engine-specific options:
--kv-bit: KV cache bit size for MLX (e.g., 4 or 8)--host: Host for llama.cpp server (default: localhost)--port: Port for llama.cpp server (default: 8080)--backend: Distributed backend formlx-distributed(default:jaccl)--hostfile: Required hostfile JSON formlx-distributed--env: RepeatableKEY=VALUEformlx.launchenvironment variables inmlx-distributed--sharded-script: Path tomlx_lm/examples/sharded_generate.pyformlx-distributed--pipeline: Enable pipeline parallelism formlx-distributed--max-kv-size: KV cache size in tokens formlx--base-url: Override OpenAI-compatible endpoint base URL (exoandopenai-benchmark)--api-key: API key for the endpoint (defaults toOPENAI_API_KEYenv var or"no-key"for local servers)
After running multiple benchmarks, use the comparison tool to analyze performance differences:
# Compare all benchmark results in output directory
uv run compare-benchmarks
# Compare specific benchmark folders
uv run compare-benchmarks -- output/benchmark_ollama_* output/benchmark_mlx_*
# Save comparison to custom location
uv run compare-benchmarks -- --output my_comparisonThe comparison tool generates:
- comparison_chart.png: Side-by-side performance charts
- comparison_results.csv: Aggregated metrics in CSV format
- comparison_table.txt: Formatted comparison table
All benchmark scripts create a timestamped directory containing:
-
hardware_info.json - Detailed system specifications:
{ "chip": "Apple M3 Ultra", "total_cores": 32, "performance_cores": 24, "efficiency_cores": 8, "gpu_cores": 80, "memory_gb": 512 } -
benchmark_results.csv - Detailed metrics for each context size:
- Prompt tokens and tokens per second
- Generation tokens and tokens per second
- Total processing time
- Additional engine-specific timing columns may appear (e.g.,
prompt_eval_duration,time_to_first_token) - Peak memory usage (MLX only)
-
benchmark_chart.png - Visual charts showing:
- Hardware specs in the title
- Prompt processing speed (tokens/sec)
- Generation speed (tokens/sec)
- Total processing time and tokens generated (Ollama)
- Peak memory usage and tokens generated (MLX)
-
generated_*.txt - Complete model responses including (Ollama only):
- Model metadata
- Token counts and timing
- Full generated text (including thinking process for models that show it)
-
table.txt - Formatted table with hardware info and results:
gpt-oss:20b Ollama CLI Benchmark Results Hardware: Apple M3 Ultra, 512GB RAM, 32 CPU cores (24P+8E), 80 GPU cores Context | Prompt TPS | Gen TPS | Total Time --------|------------|---------|------------ 2k | 521.4 | 52.0 | 14.9s -
tweet.txt - Summary formatted for social media sharing
The project includes ollama_benchmark_notebook.ipynb for interactive benchmarking:
- Generate context files interactively
- Run both CLI and API benchmarks
- Compare results side-by-side
- Create comparison charts
- Display hardware information
The tool automatically detects and reports:
- Chip model and variant
- Total CPU cores with performance/efficiency breakdown
- GPU core count
- System memory
- Optimized for MLX framework performance
- Processor information
- CPU core count
- System memory
This information is:
- Displayed in chart titles
- Saved to
hardware_info.json - Included in output tables
- Shown during benchmark execution
This project uses pre-commit hooks for code formatting:
- Black: Python code formatting (120 char line length)
- isort: Import sorting
The hooks run automatically on commit if installed. To manually run:
# Check all files
pre-commit run --all-files
# Update hook versions
pre-commit autoupdatellm_context_benchmarks/
āāā benchmark.py # Unified benchmark interface (main entry point)
āāā benchmark_common.py # Shared utilities and functions
āāā ollama_api_benchmark.py # Ollama API-based benchmarking
āāā ollama_cli_benchmark.py # Ollama CLI-based benchmarking
āāā mlx_benchmark.py # MLX single-node benchmarking (loads model once + warmup)
āāā mlx_distributed_benchmark.py # MLX distributed benchmarking via mlx.launch (Beta)
āāā llamacpp_benchmark.py # llama.cpp server benchmarking
āāā lmstudio_benchmark.py # LM Studio benchmarking (Beta)
āāā openai_benchmark.py # Generic OpenAI-compatible endpoint benchmarking
āāā compare_benchmarks.py # Multi-benchmark comparison tool
āāā generate_context_files.py # Context file generation
āāā ollama_benchmark_notebook.ipynb # Interactive notebook
āāā pyproject.toml # Python dependencies (uv)
āāā uv.lock # Resolved dependency lockfile (uv)
āāā .pre-commit-config.yaml # Pre-commit configuration
āāā .gitignore # Git ignore rules
āāā README.md # This file
- Fork the repository
- Create a feature branch
- Install pre-commit hooks:
pre-commit install - Make your changes
- Run tests and benchmarks
- Submit a pull request
We welcome benchmark contributions from different hardware configurations! To share your benchmark results:
- Run benchmarks on your hardware
- The output folders (
benchmark_ollama_*) are normally gitignored - To commit your results, either:
- Comment out the relevant lines in
.gitignore, or - Rename your folder to include your hardware (e.g.,
benchmark_m2_max_64gb_ollama_cli_llama3.2)
- Comment out the relevant lines in
- Create a PR with your benchmark results
- Include hardware details in your PR description
This helps the community understand performance across different systems!
- Python 3.13+
- Sufficient RAM for the model and context sizes you want to test
- psutil (for hardware detection)
- matplotlib, numpy (for charts)
- tiktoken (for token counting)
- Ollama: Ollama installed and running
- MLX: Apple Silicon Mac (M1/M2/M3/M4), mlx-lm package
- MLX Distributed (Beta):
mlx.launchavailable and a valid hostfile JSON (for example with--backend jaccl) - llama.cpp: llama.cpp server running
- LM Studio (Beta): LM Studio installed with server running
- OpenAI-compatible endpoint: Any server exposing the OpenAI Chat Completions API (
/v1/chat/completions)
- Larger context sizes require more memory
- Performance varies significantly between models and hardware
- The tool automatically handles models that support different maximum context lengths
- Hardware information is automatically collected on macOS (Apple Silicon) and Linux systems
- Generated text files include both the model's thinking process (if shown) and final response (Ollama only)
- All outputs are organized in timestamped directories for easy comparison
- MLX benchmark loads the model once and runs a warmup pass before benchmarking, using the
mlx_lmPython API directly for efficient inference - MLX supports quantized models (4-bit, 8-bit) for efficient inference on Apple Silicon
- MLX Distributed (beta) launches
mlx_lm/examples/sharded_generate.pythroughmlx.launchon each benchmark run - llama.cpp integration requires a running server instance with your model loaded
- LM Studio support is currently in beta - ensure your server is running before benchmarking
