Home > Uncategorized > arthur-engine

arthur-engine

Make AI work for Everyone - Monitoring and governing for your AI/ML

agentic benchmarking evaluation genai guardrails llm ml monitoring python

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

Make AI work for Everyone - Monitoring and governing for your AI/ML

README

Make AI work for Everyone.

Website - Documentation - Talk to someone at Arthur

The Arthur Engine

The Arthur Engine provides a complete service for monitoring and governing your AI/ML workloads using popular Open-Source technologies and frameworks. It is a tool designed for:

Evaluating and Benchmarking Machine Learning models
- Support for a wide range of evaluation metrics (e.g., drift, accuracy, precision, recall, F1, and AUC)
- Tools for comparing models, exploring feature importance, and identifying areas for optimization
- For LLMs/GenAI applications, measure and monitor response relevance, hallucination rates, token counts, latency, and more
Enforcing guardrails in your LLM Applications and Generative AI Workflows
- Configurable metrics for real-time detection of PII or Sensitive Data leakage, Hallucination, Prompt Injection attempts, Toxic language, and other quality metrics
Extensibility to fit into your application's architecture
- Support for plug-and-play metrics and extensible API so you can bring your own custom-models or popular open-source models (inc. HuggingFace, etc.)

Quickstart - See Examples

Clone the repository and cd deployment/docker-compose/genai-engine
Create .env file from .env.template file and modify it (more instructions can be found in README on the current path)
Run docker compose up
Wait for the genai-engine container to initialize then navigate to localhost:3030/docs to see the API docs
Start building!.

Arthur Platform Free Version

The genai-engine standalone deployment in the Quickstart provides powerful LLM evaluation and guardrailing features. To unlock the full capabilities of the Arthur Platform, sign up and get started for free.

Arthur Platform Enterprise Version

The enterprise version of the Arthur Platform provides better performance, additional features, and capabilities, including custom enterprise-ready guardrails + metrics, which can maximize the potential of AI for your organization.

Key features:

State-of-the-art proprietary evaluation models trained by Arthur's world-class machine learning engineering team
Airgapped deployment of the Arthur Engine (no dependency to Hugging Face Hub)
Optional on-premises deployment of the entire Arthur Platform
Support from the world-class engineering teams at Arthur

To learn more about the enterprise version of the Arthur Platform, reach out!

Performance Comparison between Free vs Enterprise version of Arthur Engine :

Enterprise version of Arthur Engine leverages state-of-the-art high-performing, low latency proprietary models for some of the LLM evaluations. Please see below for a detailed comparison between open-source vs enterprise performance.

Evaluation Type	Dataset	Free Version Performance (f1)	Enterprise Performance (f1)	Free Version Average Latency per Inference (s)	Enterprise Average Latency per Inference (s)
Prompt Injection	deepset	0.52 (0.44, 0.60)	0.89 (0.85, 0.93)	0.966	0.03
Prompt Injection	Arthur’s Custom Benchmark	0.79 (0.62, 0.93)	0.85 (0.71, 0.96)	0.16	0.005
Toxicity	Arthur’s Custom Benchmark	0.633 (0.45, 0.79)	0.89 (0.85, 0.93)	3.096	0.0358

Overview

The Arthur Engine is built with a focus on transparency and explainability, this framework provides users with comprehensive performance metrics, error analysis, and interpretable results to improve model understanding and outcomes. With support for plug-and-play metrics and extensible APIs, the Arthur Engine simplifies the process of understanding and optimizing generative AI outputs. The Arthur Engine can prevent data-security and compliance risks from creating negative or harmful experiences for your users in production or negatively impacting your organization's reputation.

Key Features:

Evaluate models on structured/tabular datasets with customizable metrics
Evaluate LLMs and generative AI workflows with customizable metrics
Support building real-time guardrails for LLM applications and agentic workflows
Trace and monitor model performance over time
Visualize feature importance and error breakdowns
Compare multiple models side-by-side
Extensible APIs for custom metric development or for using custom models
Integration with popular libraries like LangChain or LlamaIndex (coming soon!)

LLM Evaluations:

Eval	Technique	Source	Docs
Hallucination	Claim-based LLMJudge technique	Source	Docs
Prompt Injection	Open Source: Using deberta-v3-base-prompt-injection-v2	Source	Docs
Toxicity	Open Source: Using roberta_toxicity_classifier	Source	Docs
Sensitive Data	Few-shot optimized LLM Judge technique	Source	Docs
Personally Identifiable Information	Using presidio based off Named-Entity recognition	Source	Docs
CustomRules	Extend the service to support whatever monitoring or guardrails are applicable for your use-case	Build your own!	Docs

NB: We have provided open-source models for Prompt Injection and Toxicity evaluation as default in the free version of Arthur. In the case that you already have custom solutions for these evaluations and would like to use them, the models used for Prompt Injection and Toxicity are fully customizable and can be substituted out here (PI Code Pointer, Toxicity Code Pointer). If you are interested in higher performing and/or lower latency evaluations out of the box, please enquire about the enterprise version of Arthur Engine.

Broad Integration Support Through the OpenInference Specification

Arthur Engine fully supports the OpenInference specification, which allows you to connect the Engine to a wide range of AI frameworks, libraries, and agent stacks without custom instrumentation.

OpenInference provides a shared trace and data schema for AI systems. Since Arthur Engine follows this standard, you can immediately use any integration already built for the OpenInference ecosystem, including the large collection maintained by Arize Phoenix.

This includes support for many popular frameworks such as:

LangChain
LangGraph
LlamaIndex
Vercel AI SDK
FastAPI and Flask apps instrumented with OpenInference
OpenAI, Anthropic, Google, and other model providers aligned with the spec
Agent frameworks, orchestration tools, and custom pipelines supported by Phoenix integrations
And many others

You can view the full and continuously updated list of supported integrations here: https://github.com/Arize-ai/phoenix?tab=readme-ov-file#tracing-integrations

By adopting OpenInference, Arthur Engine provides a flexible and future proof way to bring traces, spans, metrics, inputs, outputs, and evaluation signals into the Arthur platform. This makes it easy to collect data from diverse Gen AI apps, agents, and services with a single unified integration path.

Contributing

Join the Arthur community on Discord to get help and share your feedback.
To make a request for a bug fix or a new feature, please file a Github issue.
For making code contributions, please review the contributing guidelines.
Thank you!

Release History

Version	Changes	Urgency	Date
2.1.725	# 🚀 Arthur Engine Release July 23, 2026 This release focuses on streamlined dataset workflows, clearer visibility into built-in evaluators, and significant reliability improvements for prompt experiments and ml-engine scaling. --- ## Datasets & Traces ### Bulk Trace Management * Added the ability to bulk add traces to a dataset directly from the trace table by selecting multiple traces and clicking "Add to dataset" in the selection toolbar, backed by a new `/api/v2/datasets/{datas	High	7/23/2026
2.1.699	# 🚀 Arthur Engine Release July 14, 2026 This release strengthens supply chain transparency with registry-embedded SBOMs, expands GenAI Engine deployment flexibility on EKS Auto Mode, and delivers several important security patches and stability fixes. --- ## GenAI Engine & Deployment ### Model Loading & PVC Support * Refactored model PVC support into a clean online/offline toggle via `modelPVC.enabled`, resolving read-only mount issues and letting users switch between online model	High	7/14/2026
2.1.688	# 🚀 Arthur Engine Release July 6, 2026 This release prioritizes security with a critical remote code execution patch, alongside PII processing updates, front-end dependency improvements, and expanded model deployment documentation. --- ## Security * Upgraded HuggingFace transformers to v5.3.0 to address CVE-2026-4372, a critical remote code execution vulnerability (CVSS 7.8) that allowed attackers to craft malicious `config.json` files to execute arbitrary code when loading models,	High	7/6/2026
2.1.683	# 🚀 Arthur Engine Release June 28, 2026 This release strengthens session observability with new filtering capabilities while delivering a comprehensive security hardening effort across container images, dependencies, and CI vulnerability scanning. --- ## Sessions & Observability ### Session Filtering * Added Trace ID and Session ID filtering to the Sessions table, bringing it to feature parity with the traces tables; backend now supports optional `trace_ids` and `session_ids` quer	High	6/28/2026
2.1.644	# 🚀 Arthur Engine Release June 16, 2026 This release introduces a comprehensive guardrails management UI, first-class ML evaluator support, GCS image integration for datasets, Kubernetes audit logging, and a major wave of security patches addressing over a dozen high-severity CVEs across the dependency stack. --- ## Guardrails ### Setup, Management & Observability * Added a comprehensive guardrails UI for creating rules, viewing rule cards, listing all rules, and testi	High	6/16/2026
2.1.610	# 🚀 Arthur Engine Release June 5, 2026 This release delivers a significantly refined onboarding experience with auto-scroll navigation, cleaner tour copy, and robust UI state management, alongside shareable demo completion certificates and improved PII classification consistency. --- ## Onboarding & Guided Tours ### Auto-Scroll & Navigation * Auto-scroll for highlighted elements now ensures that action elements automatically scroll into view during task tours, accounting for the s	High	6/5/2026
2.1.601	# 🚀 Arthur Engine Release June 3, 2026 This release delivers significant multi-tenant security hardening, a thoroughly refined onboarding tour experience, and improved agent trace visibility in the prompts playground. --- ## Multi-Tenancy & Access Control ### Security Fixes * Closed five critical multi-tenant security and correctness gaps including reCAPTCHA fail-open rejection, notebook ownership validation before experiment linking, org-scoped session trace pagination at the SQL	High	6/3/2026
2.1.579	# 🚀 Arthur Engine Release May 28, 2026 Arthur Engine 2.1.579 delivers Azure ecosystem integrations, a dedicated prompt injection validation endpoint, improved PII detection accuracy, onboarding workflows, and multi-tenant UI support — alongside quality-of-life improvements across the platform. --- ## Guardrails & Validation ### Prompt Injection * Added a new validate endpoint that enables easy, standalone prompt injection checks against incoming prompts (#1633) ### PII Detection	High	5/28/2026
0.0.11-lts	# 🚀 Arthur Engine Release May 20, 2026 This release lays the groundwork for full multi-tenancy, introduces a guided onboarding experience for new users, expands model provider and observability integrations, and strengthens compliance automation — making Arthur Engine ready for larger, organization-aware deployments. --- ## Multi-Tenancy and Access Control ### Organization-Scoped Data and API Keys * Organizations table and tenant isolation are now supported at the database level —	High	5/20/2026
2.1.563	# 🚀 Arthur Engine Release May 14, 2026 This release brings powerful new capabilities for onboarding, compliance automation, and observability — including an interactive onboarding agent, Azure OpenAI support, transform version history, and expanded SDK instrumentors for popular AI frameworks. --- ## Onboarding and Getting Started ### Interactive Onboarding Agent * New interactive onboarding CLI tool automates setup of observability, model configuration, Python instrumentation, and	High	5/14/2026
0.0.8-lts	# 🚀 Arthur Engine Release May 6, 2026 This release introduces the first Long-Term Support (LTS) channel for Arthur Engine, giving teams a stable, versioned deployment path alongside improvements to container security and airgapped deployment compatibility. --- ## Long-Term Support (LTS) Release Channel ### LTS Versioning and Distribution * Arthur Engine is now available through a dedicated Long-Term Support (LTS) release channel, providing a stable, predictable version track for p	High	5/6/2026
2.1.548	# 🚀 Arthur Engine Release May 5, 2026 This release strengthens deployment flexibility and security, enabling seamless operation in airgapped environments and improving container security across all supported platforms. --- ## Deployment & Infrastructure Enhancements ### Airgapped Deployment Support * Tiktoken encodings are now cached directly on the container image, eliminating the need for external network calls during container initialization. Users deploying in airgapped or net	High	5/5/2026
2.1.548	# 🚀 Arthur Engine Release May 5, 2026 This release strengthens deployment flexibility and security, enabling seamless operation in airgapped environments and improving container security across all supported platforms. --- ## Deployment & Infrastructure Enhancements ### Airgapped Deployment Support * Tiktoken encodings are now cached directly on the container image, eliminating the need for external network calls during container initialization. Users deploying in airgapped or net	Medium	5/5/2026
2.1.548	# 🚀 Arthur Engine Release May 5, 2026 This release strengthens deployment flexibility and security, enabling seamless operation in airgapped environments and improving container security across all supported platforms. --- ## Deployment & Infrastructure Enhancements ### Airgapped Deployment Support * Tiktoken encodings are now cached directly on the container image, eliminating the need for external network calls during container initialization. Users deploying in airgapped or net	Medium	5/5/2026
0.0.0-lts-patch-2	# 🚀 Arthur Engine Release May 1, 2026 This is a maintenance patch for the long-term support (LTS) branch focused on internal infrastructure improvements. There are no user-facing changes in this release. --- ## Deployment & Infrastructure Enhancements ### LTS Build Pipeline * Improved Docker image publishing for LTS releases by adopting a more efficient image retagging strategy, ensuring faster and more reliable delivery of patched LTS containers. This update strengthens the reli	High	5/1/2026
2.1.529	# 🚀 Arthur Engine Release April 17, 2026 This release strengthens compliance observability, improves trace exploration workflows, and resolves several UI and API issues that impacted pagination, task browsing, and HTTP spec compliance. --- ## Compliance and Alerting ### Accurate Violation Tracking Per Alert Rule * The policy_alert_rule_check_count compliance metric now reports the true number of violations per alert rule instead of always reporting 1.0, giving a more accurate pict	High	4/17/2026
2.1.516	# 🚀 Arthur Engine Release April 14, 2026 This release brings a redesigned Evaluate experience with unified evaluator management, bulk evaluation testing, automated compliance scheduling, and trace retention policies — giving teams more control over evaluation workflows, compliance monitoring, and data lifecycle management. --- ## Evaluation and Continuous Evals ### Unified Evaluators and Continuous Evals UX * The Evaluate section now features a **unified two-tab layout (Ev	High	4/14/2026
2.1.496	# 🚀 Arthur Engine Release April 2, 2026 This release introduces significant user experience improvements and enhanced tracing capabilities, while strengthening security and system reliability across the platform. --- ## User Experience Enhancements ### Interactive AI Assistant * Added Engine Chatbot with intelligent query capabilities for searching API documentation and managing resources * Integrated automatic model provider detection supporting Anthropic Claude, OpenAI GPT, and	High	4/2/2026
2.1.477	# 🚀 Arthur Engine Release March 23, 2026 This release delivers significant enhancements to experiment creation workflows, trace analysis capabilities, and user personalization while introducing the comprehensive Arthur Observability SDK v1.0 for Python developers. --- ## Arthur Observability SDK ### Python SDK Launch * Released Arthur Observability SDK v1.0, a comprehensive Python package for LLM application observability * Added automatic instrumentation for **33 AI	Medium	3/23/2026
2.1.456	# 🚀 Arthur Engine Release March 12, 2026 This release delivers a comprehensive UI modernization, enhanced evaluation workflows, and improved agent task management alongside critical security updates and performance optimizations. --- ## User Experience & Interface Enhancements ### Navigation Consolidation * Unified all major product areas into streamlined tabbed interfaces, replacing scattered navigation with intuitive single-entry points * Consolidated RAG functionality into unif	Low	3/11/2026
2.1.386	<h1>🚀 Arthur Engine Release</h1> <p><strong>February 18, 2026</strong></p> <p> This release strengthens evaluation workflows, task visibility, dataset intelligence, and enterprise deployment reliability across environments. </p> <hr /> <h2>Evaluation & Experiment Enhancements</h2> <h3>Improved Evaluation Configuration</h3> <ul> <li>Added a dedicated <strong>Evals input field</strong> for clearer configuration</li> <li>Introduced a new filtering mechanism for Conti	Low	2/20/2026
2.1.355	<h1> 🚀 Arthur Engine Release</h1> <p><strong>January 26 – February 5, 2026</strong></p> <p> This release significantly expands experimentation, trace visibility, model provider support, and deployment flexibility across the Agent Development Lifecycle. </p> <hr /> <h2>Agent Experiments & RAG Evaluation</h2> <h3>Agent Experiments</h3> <ul> <li>Introduced <strong>Agent Experiments</strong> with UI enhancements</li> <li>Added configurable Session ID support for repro	Low	2/17/2026
2.1.286	Enhancements: - Users can now configure where GenAI models are sourced from, enabling models to be pulled from an approved, customer-managed repository instead of the public Hugging Face Hub. - Metrics can now be segmented by user ID and conversation ID for more granular analysis. - Enhanced ODBC Connector Support: Improved handling of database views, more reliable primary key detection, and configurable connection and login timeouts. - Improved GenAI model bootstrapping reliability.	Low	1/14/2026
2.1.237	New Features: - Test & Preview Custom Metrics Before Saving: Users can now validate their custom metrics directly within the creation and editing workflow. Users can run the metric against available datasets to preview results and confirm the logic behaves as expected before saving. Bug fixes: - Custom metrics: - Sketch metrics can now be created and calculated without specifying any dimension columns. - Frontend No Longer Overwrites User-Defined Metadata for Reported Me	Low	12/5/2025
2.1.209	Bug Fix/Enhancements: - Fixed an issue where some metrics were missing from the selection list for custom datasets. - Increase ML engine aggregation timeout to support segmentation of larger & more complex datasets.	Low	11/21/2025
2.1.135	Enhancements - Made enhancements to PII detection model to improve date/time identification. - Docker configuration has been updated to use Postgres version 15, ensuring compatibility & preventing initialization errors during new engine setup.	Low	11/6/2025
2.1.94	Enhancements: - Updated telemetry ORM models, update migrations to enforce non-null timestamps. - Improved pagination handling for MSSQL. - Added `status_code` and `session_id` to spans.	Low	10/15/2025
2.1.93	New Features - Custom Metrics: You can now define and manage custom metrics using SQL. Custom metrics can be reused across models and projects, and integrate seamlessly with dashboards, alerts, and queries in the Arthur platform. Versioning ensures you can update metric logic while preserving historical data accuracy. [[Learn more](https://docs.arthur.ai/docs/custom-metrics)] Enhancements - Agent Trace Viewer: Improved filters — users can now filter by metric evaluation	Low	10/7/2025
2.1.79	Enhancements - Span Query Improvements: - New GET endpoint `/v1/spans/query`: allows filtering spans by type. - Added support for span name column: improves query flexibility and performance. - Optimized span queries: added indexes to frequently queried columns. - Improved ingestion stability: fixed batch ingestion when root spans are present. - Improved developer experience by unifying our API schema and client libraries across the GenAI & Ml Engines as well as the Arthur pla	Low	9/12/2025
2.1.71	New Features: - Agentic monitoring is now supported in the GenAI Engine: Building on the recently added /traces/ API, this release introduces support for monitoring agentic behavior: - Tasks now include an is_agentic flag to enable targeted analysis and evaluation. - Metrics and traces APIs have been upgraded to support structured outputs, trace reconstruction, and intelligent defaults. - The engine selectively computes metrics for agentic tasks, improving the precision	Low	8/28/2025
2.1.46	New Features: - Added image support for metrics + visualizing inferences in the Arthur Platform. - Users can now optionally configure attributes to segment over when defining metrics. Enhancements: - Improved hallucination detection for numbered lists and other structured formats. - Introduced configurable max-token limit for hallucination checks, helping users fine-tune thresholds for context.	Low	6/26/2025
2.1.44	New Features - Added a '/traces/' API to support ingesting Open-Telemetry traces that meet the OpenInference (https://github.com/arize-ai/openinference/) specification. This feature is in preparation for adding agentic evaluations - more details coming soon Enhancements: - Added Docker Compose health checks to improve service startup reliability. - Introduced a single script to install both the GenAI and ML engines. - Initialized the Arthur Common module with CI, linting, and unit	Low	5/23/2025
2.1.40	Enhancements: * Patched a PyTorch vulnerability * Configured Renovate on the Arthur Engine GitHub repository for automated dependency updates * The `FETCH_RAW_DATA_ENABLED` configuration now exposed on the Helm Chart * Docker Compose always pulls the container images for the `latest` tag users * Postgres now uses a volume to persist data in Docker Compose Bug fix: * The ml-engine was not able to communicate with the genai-engine in the arthur-engine Docker Compose deployment. All servic	Low	5/8/2025
2.1.39	Deprecation: * Deprecated the endpoints that validate prompt and response on default rules without any task association Enhancement: * Reduced the number of configurations exposed for the first deploy experience with Docker Compose	Low	5/2/2025
2.1.37	Enhancements: - Open sourced the Arthur Engine full deployment scripts, comprised of both the `genai-engine` and the `ml-engine` components! You now have access to see how the `ml-engine` is deployed on Docker Compose, AWS ECS, and Kubernetes. All deployment scripts can now be found in the `/deployment` folder. - The GenAI Engine server can now start with no LLM service connected. This allows users without access to a LLM service to still use the non-LLM based evaluations. - Improved the conf	Low	4/29/2025
2.1.23	Enhancements: - Optimized the profanity detection function in the toxicity rule to improve latency for inferences with a large number of consecutive repeating characters. - Increased the overall concurrency of GPU deployments by using 5 Gunicorn workers by default and ensuring that the models load without encountering any race condition issues. - Improved quick deployment by adding start scripts for Docker Compose, Helm Chart, and AWS CloudFormation. Bug fix: - Disabled rules can be now a	Low	4/16/2025
2.1.18	We are thrilled to announce the very first release of the Arthur Engine, now available as an open source project! The Arthur Engine is a tool designed for evaluating and benchmarking machine learning models and enforcing guardrails in your LLM applications and generative AI workflows. This initial release debuts the GenAI Engine submodule and its capability to add guardrails to your LLM applications and generative AI workflows. We value your feedback and contributions. Whether you enco	Low	3/31/2025

Dependencies & License Audit

Loading dependencies...

Similar Packages

AutoRAGAutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automationv2.0.0

robotsControl robots and physical hardware with natural language through Strands Agents.v0.4.1

ragasSupercharge Your LLM Application Evaluations 🚀v0.4.3

evalsA comprehensive evaluation framework for AI agents and LLM applications.v1.0.3

mem0Universal memory layer for AI Agentsv2.0.13

More in Uncategorized

TradingAgents-CN基于多智能体LLM的中文金融交易框架 - TradingAgents中文增强版

RAGENRAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.

gh-aw-firewallGitHub Agentic Workflows Firewall

ollamaGet up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.