freshcrate
Home > MCP Servers > ingero

Description

eBPF-based GPU causal observability agent

README

Ingero - GPU Causal Observability

Go Report Card License GitHub Release CI MCP

Featured in: awesome-ebpf Β· awesome-observability Β· awesome-sre-tools Β· awesome-cloud-native Β· awesome-profiling Β· Awesome-GPU Β· awesome-devops-mcp-servers Β· MCP Registry Β· Glama Β· mcpservers.org

Version: 0.9.2

v0.9.2 improvements: multi-library libcudart discovery, _Py_DebugOffsets support for CPython 3.12, configurable ring buffers (--ringbuf-size), adaptive sampling (--sampling-rate), in-kernel aggregation of mm_page_alloc/sched_switch, a dedicated critical-events ring buffer (OOM/exec/exit/fork never drop), and an optional in-kernel CPython 3.10/3.11/3.12 frame walker (--py-walker=ebpf) that works at ptrace_scope=3.

The only GPU observability tool your AI assistant can talk to.

"What caused the GPU stall?" β†’ "forward() at train.py:142 - cudaMalloc spiking 48ms during CPU contention. 9,829 calls, 847 scheduler preemptions."

Ingero is a production-grade eBPF agent that traces the full chain - from Linux kernel events through CUDA API calls to your Python source lines - with <2% overhead, zero code changes, and one binary.

ingero demo incident β€” CPU contention causes GPU latency spike, full causal chain diagnosis with root cause and fix recommendation

Quick Start

# Install (Linux amd64 β€” see below for arm64/Docker)
# ingero-version:install-curl product=ingero channel=stable
VERSION=0.9.2
curl -fsSL "https://github.com/ingero-io/ingero/releases/download/v${VERSION}/ingero_${VERSION}_linux_amd64.tar.gz" | tar xz
sudo mv ingero /usr/local/bin/

# Trace your GPU workload
sudo ingero trace

# Diagnose what happened
ingero explain --since 5m
  • The "Why": Correlate a cudaStreamSync spike with sched_switch events - the host kernel preempted your thread.
  • The "Where": Map CUDA calls back to Python source lines in your PyTorch forward() pass.
  • The "Hidden Kernels": Trace the CUDA Driver API to see kernel launches by cuBLAS/cuDNN that bypass standard profilers.

No ClickHouse, no PostgreSQL, no MinIO - just one statically linked Go binary and embedded SQLite.

See a real AI investigation session - an AI assistant diagnosing GPU training issues on A100 and GH200 using only Ingero's MCP tools. No shell access, no manual SQL - just questions and answers.

What It Does

Ingero uses eBPF to trace GPU workloads at three layers, reads system metrics from /proc, and assembles causal chains that explain root causes:

  1. CUDA Runtime uprobes - traces cudaMalloc, cudaFree, cudaLaunchKernel, cudaMemcpy, cudaMemcpyAsync, cudaStreamSync / cudaDeviceSynchronize via uprobes on libcudart.so
  2. CUDA Driver uprobes - traces cuLaunchKernel, cuMemcpy, cuMemcpyAsync, cuCtxSynchronize, cuMemAlloc via uprobes on libcuda.so. Captures kernel launches from cuBLAS/cuDNN that bypass the runtime API.
  3. CUDA Graph lifecycle uprobes - traces cudaStreamBeginCapture, cudaStreamEndCapture, cudaGraphInstantiate, cudaGraphLaunch for graph capture/replay visibility in torch.compile and vLLM workloads
  4. Host tracepoints - traces sched_switch, sched_wakeup, mm_page_alloc, oom_kill, sched_process_exec/exit/fork for CPU scheduling, memory pressure, and process lifecycle
  5. System context - reads CPU utilization, memory usage, load average, and swap from /proc (no eBPF, no root needed)

The causal engine correlates events across layers by timestamp and PID to produce automated root cause analysis with severity ranking and fix recommendations.

$ sudo ingero trace

  Ingero Trace  -  Live CUDA Event Stream
  Target: PID 4821 (python3)
  Library: /usr/lib/x86_64-linux-gnu/libcudart.so.12
  CUDA probes: 14 attached
  Driver probes: 10 attached
  Host probes: 7 attached

  System: CPU [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 47% | Mem [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘] 72% (11.2 GB free) | Load 3.2 | Swap 0 MB

  CUDA Runtime API                                               Events: 11,028
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Operation            β”‚ Count  β”‚ p50      β”‚ p95      β”‚ p99      β”‚ Flags   β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ cudaLaunchKernel     β”‚ 11,009 β”‚ 5.2 Β΅s   β”‚ 12.1 Β΅s  β”‚ 18.4 Β΅s  β”‚         β”‚
  β”‚ cudaMalloc           β”‚     12 β”‚ 125 Β΅s   β”‚ 2.1 ms   β”‚ 8.4 ms   β”‚ ⚠ p99  β”‚
  β”‚ cudaDeviceSynchronizeβ”‚      7 β”‚ 684 Β΅s   β”‚ 1.2 ms   β”‚ 3.8 ms   β”‚         β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  CUDA Driver API                                                Events: 17,525
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Operation            β”‚ Count  β”‚ p50      β”‚ p95      β”‚ p99      β”‚ Flags   β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ cuLaunchKernel       β”‚ 17,509 β”‚ 4.8 Β΅s   β”‚ 11.3 Β΅s  β”‚ 16.2 Β΅s  β”‚         β”‚
  β”‚ cuMemAlloc           β”‚     16 β”‚ 98 Β΅s    β”‚ 1.8 ms   β”‚ 7.1 ms   β”‚         β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Host Context                                                   Events: 258
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Event           β”‚ Count  β”‚ Detail                                   β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ mm_page_alloc   β”‚    251 β”‚ 1.0 MB allocated (order-0: 251)          β”‚
  β”‚ process_exit    β”‚      7 β”‚ 7 processes exited                       β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  ⚠ cudaStreamSync p99 = 142ms  -  correlated with 23 sched_switch events
    (GPU thread preempted during sync wait, avg 2.1ms off-CPU)

What You'll Discover

Things no other GPU tool can show you.

"cuBLAS was launching 17,509 kernels and you couldn't see any of them." Most profilers trace only the CUDA Runtime API - but cuBLAS calls cuLaunchKernel (driver API) directly, bypassing the runtime. Ingero traces both layers: 11,009 runtime + 17,509 driver = complete visibility into every kernel launch.

"Your training slowed because logrotate stole 4 CPU cores." System Context shows CPU at 94%, Load 12.1. The CUDA table shows cudaStreamSync p99 jumping from 16ms to 142ms. The Host Context shows 847 sched_switch events. ingero explain assembles the full causal chain: logrotate preempted the training process β†’ CUDA sync stalled β†’ training throughput dropped 30%. Fix: nice -n 19 logrotate, or pin training to dedicated cores.

"Your model spends 38% of wall-clock time on data movement, not compute." nvidia-smi says "GPU utilization 98%", but the GPU is busy doing cudaMemcpy, not compute. Ingero's time-fraction breakdown makes this obvious. The fix (pinned memory, async transfers, larger batches) saves 30-50% wall-clock time.

"Your host is swapping and your GPU doesn't know it." System Context shows Swap 2.1 GB. cudaMalloc p99 rises from 0.02ms to 8.4ms. No GPU tool shows this - nvidia-smi says GPU memory is fine, but host-side CUDA bookkeeping is hitting swap.

"Your vLLM inference spiked because a new batch size triggered CUDA Graph re-capture." Ingero traces cudaStreamBeginCapture / cudaGraphLaunch via eBPF uprobes - no CUPTI, no Nsight, no code changes. When GraphLaunch rate drops 50%, Ingero flags graph pool exhaustion. When capture overlaps with OOM, the causal chain explains why. Works with torch.compile(mode="reduce-overhead") and vLLM out of the box.

"Rank 3 stalled for 200ms while ranks 0-2 waited - one query shows all 4 nodes." With ingero query --nodes, one command fans out to every node in your cluster and merges the results. ingero merge combines offline databases for air-gapped analysis. ingero export --format perfetto produces a timeline you can open in Perfetto UI - one track per node/rank, immediately spotting the straggler. Clock skew between nodes is detected automatically.

"Ask your AI: what line of my code caused the GPU stall?" Your AI assistant calls Ingero's MCP server and answers in one shot: "The issue is in forward() at train.py:142, calling cudaMalloc through PyTorch. 9,829 calls, avg 3.1ms but spiking to 48.3ms during CPU contention." Resolved Python source lines, native symbols, timing stats - no logs, no manual SQL, no hex addresses. The engineer asks questions in plain English and gets production root causes back.

See It In Action

sudo ingero check β€” system readiness
ingero check verifying kernel, BTF, NVIDIA driver, GPU model, CUDA libraries, and active processes
sudo ingero trace β€” live event stream
ingero trace showing live CUDA Runtime and Driver API statistics with rolling p50/p95/p99 latencies and host context
ingero explain --since 5m β€” automated diagnosis
ingero explain producing incident report with causal chains, root cause analysis, and fix recommendations
sudo ingero trace β€” CUDA Graph lifecycle events
ingero trace showing CUDA Graph capture, instantiate, and launch events alongside CUDA runtime and host events
ingero explain β€” graph causal chain diagnosis
ingero explain showing causal chain linking CUDA Graph launch to CPU contention with fix recommendations
ingero demo --no-gpu incident β€” try without a GPU
ingero demo running in synthetic mode without GPU, showing full causal chain diagnosis
ingero demo                 # run all 6 scenarios (auto-detects GPU)
ingero demo incident        # full causal chain in 30 seconds
ingero demo --no-gpu        # synthetic mode (no root, no GPU needed)
sudo ingero demo --gpu      # real GPU + eBPF tracing

Scenarios

Scenario What It Reveals
incident CPU spike + sched_switch storm β†’ cudaStreamSync 8.5x latency spike β†’ full causal chain with root cause and fix
cold-start First CUDA calls take 50-200x longer than steady state (CUDA context init)
memcpy-bottleneck cudaMemcpy dominates wall-clock time (38%), not compute - nvidia-smi lies
periodic-spike cudaMalloc spikes 50x every ~200 batches (PyTorch caching allocator)
cpu-contention Host CPU preemption causes CUDA latency spikes
gpu-steal Multi-process GPU time-slicing quantified via CUDA API timing patterns

Every scenario prints a GPU auto-detect header showing GPU model and driver version, then displays real-time ASCII bar charts for system context.


This README covers single-node GPU tracing and investigation. For multi-node distributed training diagnostics (fan-out queries across nodes, offline database merge, Perfetto timeline export, clock skew detection), see the Multi-Node Investigation Walkthrough.


Install

Binary Release (recommended)

Download a pre-built binary from GitHub Releases.

Archive filenames include the version: ingero_<version>_linux_<arch>.tar.gz. Replace VERSION below with the latest release (e.g., 0.9.1):

# Linux amd64
# ingero-version:install-archive-amd64 product=ingero channel=stable
VERSION=0.9.2
curl -fsSL "https://github.com/ingero-io/ingero/releases/download/v${VERSION}/ingero_${VERSION}_linux_amd64.tar.gz" | tar xz
sudo mv ingero /usr/local/bin/

# Linux arm64 (GH200, Grace Hopper, Graviton)
# ingero-version:install-archive-arm64 product=ingero channel=stable
VERSION=0.9.2
curl -fsSL "https://github.com/ingero-io/ingero/releases/download/v${VERSION}/ingero_${VERSION}_linux_arm64.tar.gz" | tar xz
sudo mv ingero /usr/local/bin/

Docker Image

Multi-arch images (amd64 + arm64) are published to GHCR on every release:

# Pull the latest image
docker pull ghcr.io/ingero-io/ingero:latest

# Or pin to a specific version
# ingero-version:docker-pull-example product=ingero channel=stable
docker pull ghcr.io/ingero-io/ingero:v0.9.2

# Quick test (no root, no GPU needed)
docker run --rm ghcr.io/ingero-io/ingero demo --no-gpu

# System readiness check
docker run --rm --privileged --pid=host ghcr.io/ingero-io/ingero check

# Live eBPF tracing (requires privileges + kernel mounts)
docker run --rm --privileged --pid=host \
  -v /sys/kernel/debug:/sys/kernel/debug \
  -v /sys/kernel/btf:/sys/kernel/btf:ro \
  -v /var/lib/ingero:/var/lib/ingero \
  ghcr.io/ingero-io/ingero trace --record

Minimum capabilities (alternative to --privileged): --cap-add=BPF --cap-add=PERFMON --cap-add=SYS_ADMIN.

Note: eBPF tracing (trace, demo --gpu) requires --privileged --pid=host plus the kernel volume mounts shown above. Without these, only unprivileged commands work (demo --no-gpu, check, version, explain, query). The --pid=host flag shares the host's /proc - do not also bind-mount -v /proc:/proc:ro as this causes OCI runtime errors on Docker Desktop and WSL2.

Data persistence: The container stores the SQLite database at /var/lib/ingero/ingero.db by default. Mount -v /var/lib/ingero:/var/lib/ingero to persist data after the container stops. Without this mount, all trace data is lost when the container exits.

Multiple databases: Use --db or the INGERO_DB env var to work with different databases:

# Trace to a named database
docker run --rm --privileged --pid=host \
  -v /var/lib/ingero:/var/lib/ingero \
  -v /sys/kernel/debug:/sys/kernel/debug \
  -v /sys/kernel/btf:/sys/kernel/btf:ro \
  ghcr.io/ingero-io/ingero trace --db /var/lib/ingero/training-run-42.db

# Investigate a specific database
docker run --rm \
  -v /var/lib/ingero:/var/lib/ingero \
  ghcr.io/ingero-io/ingero explain --db /var/lib/ingero/training-run-42.db

# Compare databases from different runs
docker run --rm \
  -v /var/lib/ingero:/var/lib/ingero \
  ghcr.io/ingero-io/ingero query --db /var/lib/ingero/training-run-41.db --since 1h

docker run --rm \
  -v /var/lib/ingero:/var/lib/ingero \
  ghcr.io/ingero-io/ingero query --db /var/lib/ingero/training-run-42.db --since 1h

The image is ~10 MB (Alpine 3.20 + statically linked Go binary). When building the dev Dockerfile locally, pass version info via build args:

# ingero-version:docker-build-arg product=ingero channel=stable
docker build -f deploy/docker/Dockerfile \
  --build-arg VERSION=0.9.2 \
  --build-arg COMMIT=$(git rev-parse --short HEAD) \
  --build-arg BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
  -t ingero:local .

GHCR images have version info baked in automatically via GoReleaser. See deploy/docker/Dockerfile for details.

Build from Source

# Quick setup: install all build dependencies (Go, clang, llvm) on Ubuntu 22.04/24.04
curl -fsSL https://raw.githubusercontent.com/ingero-io/ingero/main/scripts/install-deps.sh | bash

# Requires clang-14, Linux kernel with BTF
git clone https://github.com/ingero-io/ingero.git
cd ingero
make              # generates eBPF bindings, builds, tests, and lints  -  single command
sudo make install # optional  -  copies binary to /usr/local/bin/ingero
                  # or just use ./bin/ingero directly, or: alias ingero=$PWD/bin/ingero

Requirements

  • Linux kernel 5.15+ with BTF (CONFIG_DEBUG_INFO_BTF=y)
  • NVIDIA driver 550+ with CUDA 11.x, 12.x, or 13.x
  • Root / CAP_BPF + CAP_PERFMON (eBPF requires elevated privileges)
  • Tested on: GH200, H100, A100, A10, RTX 4090, RTX 3090 (x86_64 and aarch64)

Commands

ingero check

Check if your system is ready for eBPF-based GPU tracing.

$ ingero check

Ingero  -  System Readiness Check

  [βœ“] Kernel version: 5.15.0-144-generic
      need 5.15+
  [βœ“] BTF support: /sys/kernel/btf/vmlinux
      available (5242880 bytes)
  [βœ“] NVIDIA driver: 580.126.09
      open kernel modules (550+)
  [βœ“] GPU model: NVIDIA GeForce RTX 3090 Ti, 24564 MiB
  [βœ“] CUDA runtime: /usr/lib/x86_64-linux-gnu/libcudart.so.12
      loaded by 1 process(es)
  [βœ“] CUDA driver (libcuda.so): /usr/lib/x86_64-linux-gnu/libcuda.so.1
      available for driver API tracing
  [βœ“] CUDA processes: 1 found
      PID 4821 (python3)

All checks passed  -  ready to trace!

ingero trace

Live event stream with rolling stats, system context, and anomaly detection. Events are recorded to SQLite by default (use --record=false to disable). The database is capped at 10 GB rolling storage and auto-purges old events when the limit is reached (see --max-db).

sudo ingero trace                           # auto-detect all CUDA processes for current user
sudo ingero trace --pid 4821               # trace specific process
sudo ingero trace --pid 4821,5032          # trace multiple specific processes
sudo ingero trace --user bob               # trace all CUDA processes owned by bob
sudo ingero trace --record=false           # disable SQLite recording
sudo ingero trace --duration 60s           # stop after 60 seconds
sudo ingero trace --json                   # JSON output (pipe to jq)
sudo ingero trace --verbose                # show individual events
sudo ingero trace --stack=false            # disable stack traces (saves ~0.4-0.6% overhead)
sudo ingero trace --max-db 10g             # limit DB to 10 GB (default), prunes oldest events
sudo ingero trace --max-db 500m            # limit DB to 500 MB (tight disk budget)
sudo ingero trace --max-db 0               # unlimited (no size-based pruning)
sudo ingero trace --deadband 5              # suppress idle snapshots (5% threshold)
sudo ingero trace --deadband 5 --heartbeat 30s  # deadband + force report every 30s
sudo ingero trace --prometheus :9090       # expose Prometheus /metrics endpoint
sudo ingero trace --otlp localhost:4318    # push metrics via OTLP
sudo ingero trace --node gpu-node-07      # tag events with node identity (for multi-node)
sudo ingero trace --cuda-lib /opt/venv/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12
                                           # explicit libcudart path (skips auto-discovery)
sudo ingero trace --ringbuf-size 32m       # override high-throughput ring buffer size (power of 2, min 4096)
sudo ingero trace --sampling-rate 0        # adaptive sampling (default: 1 = emit all; N>1 = 1-in-N)
sudo ingero trace --py-walker ebpf         # in-kernel CPython walker (works at ptrace_scope=3)

Flag reference (post-v0.9.1 additions):

  • --cuda-lib PATH β€” Explicit path to libcudart.so. Skips auto-discovery. Useful for venv workloads where multiple libcudart copies exist.
  • --ringbuf-size SIZE β€” Override ring buffer size for high-throughput probes (cuda, driver, host). Accepts k/m/g suffix. Must be a power of 2, minimum 4096. Default: compiled sizes (8MB cuda/driver, 1MB host).
  • --sampling-rate N β€” Event sampling rate. 0 = adaptive (auto-adjusts under sustained drops). 1 = emit all events (default behavior). N > 1 = emit 1 in every N events. Applies to cuda/driver/graph probes only; host probes are never sampled.
  • --py-walker {auto,ebpf,userspace} β€” Python frame walker selection. auto (default) uses the userspace walker. ebpf uses the in-kernel CPython walker (supports 3.10, 3.11, 3.12 β€” no /proc/pid/mem required, works at ptrace_scope=3). userspace forces the classic walker.

ingero check now reports the current kernel.yama.ptrace_scope value with actionable hints when it blocks Python source attribution (see Troubleshooting).

Only trace needs sudo - it attaches eBPF probes to the kernel. All other commands (check, explain, query, mcp, demo) run unprivileged. When you run sudo ingero trace, the database is written to your home directory (not /root/) and chown'd to your user, so non-sudo commands can read it.

Process targeting:

  • Default (no flags): traces all CUDA processes owned by the invoking user (via SUDO_USER). On single-user boxes, this means all CUDA processes.
  • --pid: target specific process(es), comma-separated (e.g., --pid 1234,5678).
  • --user: target all CUDA processes owned by a specific user (--user bob, --user root).
  • Dynamic child tracking: fork events auto-enroll child PIDs for host correlation.

The trace display shows five sections:

  1. System Context - CPU, memory, load, swap with ASCII bar charts (green/yellow/red)
  2. CUDA Runtime API - per-operation p50/p95/p99 latency with anomaly flags (cudaMalloc, cudaLaunchKernel, graphLaunch, etc.)
  3. CUDA Driver API - driver-level operations (cuLaunchKernel, cuMemAlloc, etc.) that cuBLAS/cuDNN call directly
  4. Host Context - scheduler, memory, OOM, and process lifecycle events
  5. CUDA Graph events - graph capture, instantiate, and launch events (when graph-using workloads are traced)

ingero explain

Analyze recorded events from SQLite and produce an incident report with causal chains, root causes, and fix recommendations. Reads from the database populated by ingero trace - no root needed.

ingero explain                         # analyze last 5 minutes
ingero explain --since 1h             # last hour
ingero explain --since 2d             # last 2 days
ingero explain --since 1h30m          # human-friendly durations (also: 1w, 3d12h)
ingero explain --last 100             # last 100 events
ingero explain --pid 4821             # filter by specific process
ingero explain --pid 4821,5032        # filter by multiple processes
ingero explain --chains               # show stored causal chains (no re-analysis)
ingero explain --json                 # JSON output for pipelines
ingero explain --from "15:40" --to "15:45"  # absolute time range
ingero explain --per-process              # per-process CUDA API breakdown
ingero explain --per-process --json       # JSON output for pipelines

# Multi-node fleet queries (fan-out to multiple Ingero dashboard APIs)
ingero explain --nodes host1:8080,host2:8080,host3:8080  # cross-node causal chains

Per-Process Breakdown

For multi-process GPU workloads (RAG pipelines, model serving with workers, multi-tenant GPU sharing), --per-process shows a CUDA API breakdown grouped by process:

$ ingero explain --per-process --since 5m

PER-PROCESS GPU API BREAKDOWN

  PID 4821 (vllm-worker)
    cuLaunchKernel      12,847 calls   p50=4.8Β΅s   p95=11.2Β΅s   p99=16.1Β΅s
    cudaMemcpyAsync        892 calls   p50=38Β΅s    p95=124Β΅s    p99=891Β΅s
    cudaMallocManaged       14 calls   p50=112Β΅s   p95=2.1ms    p99=8.4ms

  PID 5032 (embedding-svc)
    cuLaunchKernel       3,201 calls   p50=5.1Β΅s   p95=12.8Β΅s   p99=19.4Β΅s
    cudaMemcpy             448 calls   p50=42Β΅s    p95=98Β΅s     p99=412Β΅s

  ⚠ Multi-process GPU contention: 2 processes sharing GPU with CUDA/Driver ops

This answers "which process is hogging the GPU?" - essential for diagnosing RAG pipeline contention where embedding, retrieval, and generation compete for GPU time.

INCIDENT REPORT  -  2 causal chains found (1 HIGH, 1 MEDIUM)

[HIGH] cudaStreamSync p99=142ms (8.5x p50)  -  CPU contention
  Timeline:
    15:41:20  [SYSTEM]  CPU 94%, Load 12.1, Swap 2.1GB
    15:41:20  [HOST]    sched_switch: PID 8821 (logrotate) preempted PID 4821
    15:41:22  [CUDA]    cudaStreamSync 142ms (normally 16.7ms)

  Root cause: logrotate cron job preempted training process 847 times
  Fix: Add `nice -n 19` to logrotate cron, or pin training to dedicated cores

ingero query

Query stored events by time range, PID, and operation type. Supports multi-node fleet queries with --nodes.

ingero query --since 1h
ingero query --since 1h --pid 4821
ingero query --since 1h --pid 4821,5032
ingero query --since 30m --op cudaMemcpy --json

# Multi-node fleet queries (fan-out to multiple Ingero dashboard APIs)
ingero query --nodes host1:8080,host2:8080 "SELECT node, source, count(*) FROM events GROUP BY node, source"
ingero query --nodes host1:8080,host2:8080,host3:8080 "SELECT node, count(*) FROM events GROUP BY node"

Fleet queries fan out the SQL to each node's /api/v1/query endpoint, concatenate results with a node column prepended, and display a unified table. Partial failures return results from reachable nodes with warnings for unreachable ones. Clock skew between nodes is detected automatically (configurable via --clock-skew-threshold, default 10ms).

Configure default fleet nodes in ingero.yaml under fleet.nodes to avoid repeating --nodes on every command.

Storage uses SQLite with size-based pruning (default 10 GB via --max-db). Data is stored locally at ~/.ingero/ingero.db - nothing leaves your machine.

ingero mcp

Start an MCP (Model Context Protocol) server for AI agent integration.

ingero mcp                        # stdio (for Claude Code / MCP clients)
ingero mcp --http :8080           # HTTPS on port 8080 (TLS 1.3, auto-generated self-signed cert)
ingero mcp --http :8080 --tls-cert cert.pem --tls-key key.pem  # custom TLS certificate

Note: The --http flag enables the Streamable HTTP transport - all connections use TLS 1.3 only (no plain HTTP). When no --tls-cert/--tls-key is provided, ingero auto-generates an ephemeral self-signed ECDSA P-256 certificate. Use curl -k to skip certificate verification for self-signed certs.

AI-first analysis: MCP responses use telegraphic compression (TSC) by default, reducing token count by ~60%. Set {"tsc": false} per request for verbose output.

MCP tools:

Tool Description
get_check System diagnostics (kernel, BTF, NVIDIA, CUDA, GPU model)
get_trace_stats CUDA + host statistics (p50/p95/p99 or aggregate fallback for large DBs)
get_causal_chains Causal chains with severity ranking and root cause (deduplicated, top 10 by default)
get_stacks Resolved call stacks for CUDA/driver operations (symbols, source files, timing)
graph_lifecycle CUDA Graph lifecycle timeline for a PID: capture, instantiate, launch sequences
graph_frequency Graph launch frequency per executable: hot/cold classification, pool saturation
run_demo Run synthetic demo scenarios
get_test_report GPU integration test report (JSON)
run_sql Execute read-only SQL for ad-hoc analysis
query_fleet Fan-out query across multiple Ingero nodes (chains, ops, overview, sql) with clock skew detection

MCP prompts:

Prompt Description
/investigate Guided investigation workflow - walks the AI through stats, chains, and SQL to diagnose GPU issues. Works with any MCP client.

Works with any AI, not just Claude. Use local open-source models via ollmcp (Ollama MCP client):

# Install ollmcp (minimax-m2.7:cloud routes to MiniMax API via Ollama Cloud,
# or use a local model like qwen3.5:32b via ollama pull qwen3.5:32b)
pip install mcp-client-for-ollama

# Create a config pointing to Ingero's MCP server
cat > /tmp/ingero-mcp.json << 'EOF'
{"mcpServers":{"ingero":{"command":"ingero","args":["mcp","--db","trace.db"]}}}
EOF

# Start investigating - /investigate triggers the guided workflow
ollmcp -m minimax-m2.7:cloud -j /tmp/ingero-mcp.json

Tested with MiniMax M2.7 and Qwen 3.5 via Ollama on saved investigation databases. Also works with Claude Desktop, Cursor, and any MCP-compatible client.

curl examples (with --http :8080):

# System diagnostics (-k for self-signed cert)
curl -sk https://localhost:8080/mcp \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json, text/event-stream' \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"get_check","arguments":{}}}' | jq

# Causal chains (TSC-compressed for AI)
curl -sk https://localhost:8080/mcp \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json, text/event-stream' \
  -d '{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"get_causal_chains","arguments":{}}}' | jq

# Verbose output (TSC off)
curl -sk https://localhost:8080/mcp \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json, text/event-stream' \
  -d '{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"get_trace_stats","arguments":{"tsc":false}}}' | jq

ingero dashboard

Start a browser-based GPU monitoring dashboard backed by the SQLite event store. Shows live system metrics, CUDA operation latencies, causal chains, and a capability manifest (grayed-out panels for metrics Ingero doesn't yet collect, with tooltips naming the required external tool). Requires ingero trace to be running (or to have run recently).

ingero dashboard                           # HTTPS on :8080 (self-signed TLS 1.3)
ingero dashboard --addr :9090              # custom port
ingero dashboard --db /path/to/ingero.db   # custom database
ingero dashboard --tls-cert cert.pem --tls-key key.pem  # custom TLS certificate
ingero dashboard --no-tls                  # plain HTTP (for fleet queries on trusted networks)

# Remote access via SSH tunnel:
ssh -L 8080:localhost:8080 user@gpu-vm
# Then open https://localhost:8080 in browser

No sudo needed - the dashboard reads from the SQLite database populated by ingero trace.

Security: TLS 1.3 only. Auto-generates an ephemeral self-signed ECDSA P-256 certificate (valid 24h) if no --tls-cert/--tls-key provided. DNS rebinding protection rejects requests from non-localhost Host headers.

API endpoints:

Endpoint Description
GET /api/v1/overview Event count, chain count, latest system snapshot, GPU info, top causal chain
GET /api/v1/ops?since=5m Per-operation latency stats (percentile or aggregate mode)
GET /api/v1/chains?since=1h Stored causal chains with severity, root cause, timeline
GET /api/v1/snapshots?since=60s System metric time series (CPU, memory, swap, load)
GET /api/v1/capabilities Metric availability manifest (available vs. grayed-out with required tool)
GET /api/v1/graph-metrics CUDA Graph metrics: capture/launch rates, instantiation durations
GET /api/v1/graph-events Recent CUDA Graph events with handles and durations
POST /api/v1/query Execute read-only SQL (used by fleet fan-out queries)
GET /api/v1/time Server wall-clock timestamp (used for clock skew detection)

ingero merge

Merge SQLite databases from multiple Ingero nodes into a single queryable database for offline cross-node analysis. Useful in air-gapped environments or when you prefer offline analysis over fan-out queries.

ingero merge node-a.db node-b.db node-c.db -o cluster.db       # merge 3 node databases
ingero merge old.db --force-node legacy-node -o merged.db       # assign node identity to legacy DBs

# Then use standard tools on the merged database
ingero query -d cluster.db --since 1h
ingero explain -d cluster.db --chains
ingero export --format perfetto -d cluster.db -o trace.json

Node-namespaced event IDs ({node}:{seq}) ensure zero collisions on merge. Stack traces are deduplicated by hash. Sessions are re-keyed. Clock skew between traces is detected and warned (configurable via --clock-skew-threshold, default 100ms).

ingero export

Export event data to visualization formats. Currently supports Perfetto/Chrome Trace Event Format for timeline visualization in ui.perfetto.dev or chrome://tracing.

# From a local or merged database
ingero export --format perfetto -d ~/.ingero/ingero.db -o trace.json
ingero export --format perfetto -d cluster.db -o trace.json --since 5m

# Fan-out mode (fetches from multiple nodes via fleet API)
ingero export --format perfetto --nodes node-1:8080,node-2:8080 -o trace.json

Opens in Perfetto UI with one process track per node/rank, CUDA events as duration spans, and causal chains as severity-colored instant markers. Multi-node traces show side-by-side timelines for spotting which rank stalled while others waited.

ingero demo

ingero demo                  # all 6 scenarios (incident first)
ingero demo incident         # single scenario
ingero demo gpu-steal        # also: gpu-contention, contention
ingero demo --no-gpu         # synthetic mode

ingero version

$ ingero version
ingero v0.9.1 (commit: 01af280, built: 2026-04-06)

Stack Tracing

Stack tracing is on by default - every CUDA/Driver API event captures the full userspace call chain. Shows who called cudaMalloc - from the CUDA library up through PyTorch, your Python code, and all the way to main(). GPU-measured overhead is 0.4-0.6% (within noise on RTX 3090 through H100). Disable with --stack=false if needed.

sudo ingero trace --json               # JSON with resolved stack traces (stacks on by default)
sudo ingero trace --debug              # debug output shows resolved frames on stderr
sudo ingero demo --gpu --json          # GPU demo with stack traces (needs sudo)
ingero explain                         # post-hoc causal analysis from DB (no sudo)
sudo ingero trace --stack=false        # disable stacks if needed

Maximum depth: 64 native frames (eBPF bpf_get_stack). This covers deep call chains from CUDA β†’ cuBLAS/cuDNN β†’ PyTorch C++ β†’ Python interpreter and up to main() / _start.

Python Stack Attribution

For Python workloads (PyTorch, TensorFlow, etc.), Ingero extracts CPython frame information directly from process memory. When a native frame is inside libpython's eval loop, the corresponding Python source frames are injected into the stack:

[Python] train.py:8 in train_step()
[Python] train.py:13 in main()
[Python] train.py:1 in <module>()
[Native] cublasLtSSSMatmul+0x1d4 (libcublasLt.so.12)
[Native] cublasSgemm_v2+0xa6 (libcublas.so.12)
[Native] (libtorch_cuda.so)

Supported Python versions: 3.10, 3.11, 3.12 (covers Ubuntu 22.04 default, conda default, and most production deployments). Version detection is automatic via /proc/[pid]/maps.

Why you want a Python frame walker

Native stack traces alone stop at _PyEval_EvalFrameDefault β€” the C function that runs the Python bytecode interpreter. Every frame above that in "what your code is actually doing" lives in interpreter state (PyThreadState, _PyInterpreterFrame, PyCodeObject), not in the C call stack. Without a walker, you see _PyEval_EvalFrameDefault repeated N times, which tells you nothing about which .py file triggered the slow cuLaunchKernel.

A Python frame walker reads CPython's own data structures and reconstructs the source-level call chain (train.py:train_step, model.py:forward, ...). That's what lets you answer "which Python line launched this slow kernel?" instead of "something inside the interpreter launched it."

Ingero ships two walker implementations for this:

  • Userspace walker (default) β€” runs in the Go process after an event arrives. Reads target process memory via /proc/[pid]/mem or process_vm_readv. Simple, flexible, handles the full CPython offset fallback chain (_Py_DebugOffsets β†’ known-offsets DB β†’ DWARF β†’ hardcoded).
  • In-kernel eBPF walker (opt-in) β€” walks frames from inside the kernel probe via bpf_probe_read_user helpers. No /proc/[pid]/mem access needed. Required when kernel.yama.ptrace_scope=3 (hardened systems), and useful when you want frame capture to happen synchronously with the CUDA event rather than asynchronously on event arrival.

How to use it

Default (userspace walker): Just pass --stack β€” frames appear automatically for supported Python versions.

sudo ingero trace --stack --duration 30s

You'll see py_file / py_func / py_line fields in JSON output, or [Python] <file>:<line> in <func>() entries in the table/debug view.

eBPF walker (opt-in): Pass --py-walker=ebpf alongside --stack.

sudo ingero trace --stack --py-walker=ebpf --duration 30s

Use the eBPF walker when:

  • Your system has kernel.yama.ptrace_scope=3 (the userspace walker can't read process memory there)
  • You want guaranteed synchronous frame capture at the exact moment of the CUDA event
  • You're running on a read-only/hardened host where /proc/[pid]/mem access is blocked

Stick with the default (--py-walker=auto, which resolves to the userspace walker) when:

  • You're on a normal Linux host (ptrace_scope 0, 1, or 2) β€” the userspace walker is simpler and has full offset-fallback coverage including distro-patched CPython builds
  • You care about minimum per-event overhead β€” the eBPF walker adds BPF helper-call cost per emitted event

Troubleshooting missing frames: Run ingero check β€” it now reports your kernel.yama.ptrace_scope value and tells you what to do if it's blocking the userspace walker. For CPython 3.12 you'll also benefit automatically from the self-describing _Py_DebugOffsets struct (no debug symbols needed); for 3.10/3.11 on patched distro builds, installing the matching python3.X-dbgsym package gives the userspace walker DWARF offsets to fall back on.

JSON Output with --stack

Real output from a PyTorch ResNet-50 training run on A100 SXM4 - a cuBLAS matmul kernel launch captured via Driver API uprobes, with the full call chain from Python through cuBLAS to the GPU:

{
  "timestamp": "2026-02-25T12:06:24.753983243Z",
  "pid": 11435,
  "tid": 11435,
  "source": "driver",
  "op": "cuLaunchKernel",
  "duration_ns": 10900,
  "duration": "11us",
  "stack": [
    {"ip": "0x0", "py_file": "train.py", "py_func": "tr

Release History

VersionChangesUrgencyDate
v0.10.0## Changelog ### Documentation * f7998b120ffdc7a7ec9a5339c41a9974ae7e70be docs: drop unnecessary writeups; security fixes already in 01b181c ### Other * 0fdf9ef9257458db636807004d80d4467ca7c4a5 release: v0.10.0 High4/21/2026
v0.10.0-rc1## Changelog ### Features * ecaaad5767b858f2ca883ac85918710e41cc936c feat(health): - threshold consumption, degradation, classification * aa889dba874103e61ec926b6289477cdf7a1279f feat(health): agent-side Fleet integration * 4e0dd4b95deed4f1004a4739e2b3705a85c78fc6 feat(health): wire real signal collector, rotate DNS, tune warmup * 1235cf045a7bb4382ccfb5ca0086344578f35585 feat(helm): add fleet-push mode to ingero chart * d5f4ea7dfed6e3b614071993a2183f6c5096c842 feat: add Fleet interface contract High4/20/2026
v0.9.2 The in-kernel eBPF Python frame walker (`--py-walker=ebpf`) is the headline of this release. After extensive validation on container and bare-metal Linux, through fork storms and lifecycle edge cases, it now delivers the advantages the userspace walker cannot: - works in containers (Docker, K8s) without debug symbols or host mounts - works on distro-patched CPython (Ubuntu 24.04, etc.) via runtime offset harvesting - works at `kernel.yama.ptrace_scope=3` with `--pid X` (PID-speHigh4/16/2026
v0.9.1> **Note:** The multi-node features in this release (fan-out queries, offline merge, Perfetto export) are interim solutions for cross-node GPU investigation. A dedicated cluster-level observability and diagnostics tool with native multi-node support is coming soon. **One command. Every node. Full causal chain.** Ingero can now investigate distributed GPU workloads across multiple nodes from a single CLI command, MCP tool call, or offline database merge. Diagnose which rank stalled, why, Medium4/3/2026
v0.9.0Ingero can now trace the full CUDA Graph lifecycle β€” capture, instantiate, launch β€” via eBPF uprobes on libcudart.so. Zero application modification, zero CUPTI dependency, production-safe overhead. CUDA Graph Observability - eBPF probes for cudaStreamBeginCapture, cudaStreamEndCapture, cudaGraphInstantiate, and cudaGraphLaunch β€” covers the stream capture path used by PyTorch torch.compile, vLLM, and TensorRT-LLM - Causal correlation connects graph events to system state: OOM during grMedium4/1/2026
v0.8.2 ## What's New Docker containerization, a real-time GPU dashboard preview, human-friendly CLI duration parsing, and improved platform support for WSL and AWS Deep Learning AMIs. This release also includes Ingero's first community contributions. ### Highlights - **Docker support** β€” Multi-arch (amd64/arm64) Alpine-based container image (~10 MB), auto-published to GHCR on tag push via GoReleaser. Includes GPU passthrough detection and healthcheck. - **Real-time GPU dashboLow3/18/2026
v0.8.1## What's New Seven fixes from RTX 4090 GPT-2 stress test analysis (5-phase, 237K+ events/min). ### Highlights - **DB compaction at shutdown** β€” WAL checkpoint + VACUUM when >20% of pages are free. Integration test DB shrank from 57 MB to 2.7 MB (95% reduction) - **Throughput-drop causal chains** β€” new detection for when CUDA op rate drops >40% from peak but per-call latency stays flat. Catches GPU starvation that tail-ratio chains miss - **Aggregate flush starvation fix** β€” high-throughput pLow3/15/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

mcp-debugCommand-line tool for debugging MCP serversv0.0.79
devtapπŸš€ Streamline build and dev output by feeding logs directly into AI coding sessions using Model Context Protocol for seamless automation.main@2026-04-21
Paylink🌐 Simplify payment processing with PayLink, a unified API for multi-provider checkouts, ensuring reliable transactions and seamless integration.main@2026-04-21
toolhive-catalogToolHive's registry catalog of MCP serversv0.20260421.0
tekmetric-mcpπŸ” Ask questions about your shop data in natural language and get instant answers about appointments, customers, and repair orders with Tekmetric MCP.main@2026-04-21