AI agents that triage infrastructure alerts, investigate root causes, and propose fixes — while a solo operator sleeps.
Looking for the complete technical reference? See README.extensive.md — full architecture, all 21 agentic design patterns, component inventory, evaluation system, and security operations.
One person. 310 infrastructure objects across 6 sites. 3 firewalls, 12 Kubernetes nodes, self-hosted everything. When an alert fires at 3am, there's no team to call. There never is.
Three agentic subsystems that handle the detective work:
- ChatOps — Infrastructure alerts (LibreNMS, Prometheus) triaged automatically. Remediation plans proposed via interactive polls. Human clicks to approve.
- ChatSecOps — CrowdSec intrusion alerts and vulnerability scanner findings mapped to MITRE ATT&CK. 54 scenarios auto-classified.
- ChatDevOps — CI/CD failures diagnosed, code changes reviewed with fresh-eyes sub-agents, multi-repo refactoring coordinated.
All three share the same engine: n8n orchestration, Matrix as the human interface, and a 3-tier agent architecture:
Alert → n8n → OpenClaw (GPT-5.1, fast triage) → Claude Code (Opus, deep analysis) → Human (approval)
The human stays in the loop for every infrastructure change. The system never acts without a thumbs-up or poll vote.
1. LibreNMS detects "Device down" on a host
2. n8n deduplicates, detects flapping, checks for correlated burst
3. OpenClaw (Tier 1) investigates in 7-21 seconds:
- Queries NetBox CMDB for device identity
- Searches incident knowledge base for similar past alerts (local Ollama embeddings)
- Extracts procedural knowledge from CLAUDE.md files + operational memory rules
- SSHes to the host, checks services and logs
- Posts findings + confidence score to Matrix and YouTrack
4. If confidence < 0.7 or severity is critical → escalates to Claude Code
5. Claude Code (Tier 2) receives targeted CLAUDE.md file guidance + operational
memories, delegates research to sub-agents, proposes remediation via [POLL]
6. Operator clicks a poll option in Matrix
7. Claude executes the selected plan
8. Session archived → incident knowledge updated → lessons extracted
Real example: IFRNLLEI01PRD-82 — LibreNMS alert → OpenClaw triage (30s) → Claude investigation (8min) → [POLL] with 3 options → operator clicks Plan A → fix applied → recovery confirmed.
| Metric | Value |
|---|---|
| Agentic design patterns implemented | 21/21 — all at A+ (tri-source audit: 11/11 dimensions A+) |
| n8n workflows | 17 (~400 nodes, Runner: 47 nodes incl. Evaluator-Optimizer) |
| MCP tool integrations | 10 servers, 153 tools, ACI-audited with 7 task-type profiles |
| Specialized sub-agents | 10 (Haiku for research, Opus for deep analysis) |
| CLAUDE.md knowledge files | 51, auto-routed to triage by hostname |
| Operational memory files | 200+, synced across both agent hosts |
| Prompt surfaces evaluated | 19, graded daily on 6 dimensions |
| Eval test scenarios | 98 across 3 eval sets (regression/discovery/holdout) + 18 node-level tests |
| RAG search | Hybrid (semantic + keyword via RRF) with query rewriting |
| Knowledge sources | 3 audited: Gulli book + Anthropic cert + 6 industry references |
| Component | Role |
|---|---|
| n8n | Workflow orchestration — 17 workflows handle alert intake, session management, knowledge population |
| OpenClaw (GPT-5.1) | Tier 1 — fast triage with 14 native skills (infra, K8s, security, correlated burst analysis) |
| Claude Code (Opus 4.6) | Tier 2 — deep analysis with 10 sub-agents, ReAct reasoning, interactive polls |
| Matrix (Synapse) | Human-in-the-loop — polls, reactions, replies. The system waits here. |
| YouTrack | Issue tracking — webhook triggers, state management, knowledge sink |
| NetBox | CMDB — 310 devices, 421 IPs, 39 VLANs |
| Prometheus + Grafana | Metrics — 5 dashboards, 63+ panels, 7 metric exporters |
| Ollama (RTX 3090 Ti) | Local embeddings — nomic-embed-text for semantic search, reachable from both agent hosts |
The system can investigate freely but never executes infrastructure changes without human approval:
- Claude Code hooks block destructive commands before they run (30+ patterns)
- safe-exec.sh enforces a code-level blocklist that prompt injection cannot bypass
- exec-approvals.json restricts OpenClaw to 36 specific skill patterns (no wildcards)
- Confidence gating stops sessions below 0.5 and escalates below 0.7
- Budget ceiling triggers plan-only mode at $25/day
- Credential scanning redacts tokens before posting to Matrix
Every session is scored post-completion by an LLM-as-a-Judge (Haiku for routine, Opus for flagged sessions) on 5 quality dimensions.
| Document | What it covers |
|---|---|
| Architecture Details | Workflows, MCP servers, sub-agents, skills, hooks, inter-agent protocol |
| Agentic Patterns Audit | 21/21 pattern scorecard with implementation evidence |
| Book Gap Analysis | Remaining improvements from Gulli's book |
| Industry Agentic References | 6 industry sources → cross-cutting advice, anti-patterns, 17 recommendations |
| Tri-Source Audit | Platform scored against all 3 knowledge sources — 11/11 dimensions A+ (100%) |
| Evaluation Process | 3-set eval model, flywheel, CI gate, judge calibration |
| ACI Tool Audit | 10 MCP tools audited against 8-point ACI checklist |
| Eval Report | E2E test results, before/after scoring |
| Installation Guide | Prerequisites, setup steps, cron configuration |
| A2A Protocol | Inter-agent communication specification |
| Known Failure Rules | 27 rules from 26 bugs |
git clone https://github.com/papadopouloskyriakos/agentic-chatops.git
cd agentic-chatops
cp .env.example .env # Add your credentialsSee the Installation Guide for full setup.
- Agentic Design Patterns by Antonio Gulli (Springer, 2025) — 21 patterns, all implemented
- Claude Certified Architect – Foundations Exam Guide (Anthropic) — sub-agent design, multi-tier architecture
- Industry Agentic References — 6 industry sources (Anthropic, OpenAI, LangChain, Microsoft) → tool design, evals, memory, RAG, guardrails
- Anthropic Official Documentation — hooks, sub-agents, skills, MCP security
- Anthropic Academy — sub-agent design patterns
- n8n — workflow automation
- Model Context Protocol — standardized LLM-tool integration
Sanitized mirror of a private GitLab repository. Internal hostnames, IPs, and credentials replaced with placeholders. Provided as-is for educational and reference purposes.
Built by a solo infrastructure operator who got tired of waking up at 3am for alerts that an AI could triage.

