Benchmark and compare LLM tool, configuration, and prompt setups using a shared case framework with automated scoring and telemetry.