freshcrate
Skin:/
Home > Uncategorized > arthur-engine

arthur-engine

Make AI work for Everyone - Monitoring and governing for your AI/ML

Why this rank:Strong adoptionRecent releaseHealthy release cadence

Description

Make AI work for Everyone - Monitoring and governing for your AI/ML

README

Arthur AI Logo

Make AI work for Everyone.

GenAI Engine CI Discord

Website - Documentation - Talk to someone at Arthur

The Arthur Engine

The Arthur Engine provides a complete service for monitoring and governing your AI/ML workloads using popular Open-Source technologies and frameworks. It is a tool designed for:

  • Evaluating and Benchmarking Machine Learning models
    • Support for a wide range of evaluation metrics (e.g., drift, accuracy, precision, recall, F1, and AUC)
    • Tools for comparing models, exploring feature importance, and identifying areas for optimization
    • For LLMs/GenAI applications, measure and monitor response relevance, hallucination rates, token counts, latency, and more
  • Enforcing guardrails in your LLM Applications and Generative AI Workflows
    • Configurable metrics for real-time detection of PII or Sensitive Data leakage, Hallucination, Prompt Injection attempts, Toxic language, and other quality metrics
  • Extensibility to fit into your application's architecture
    • Support for plug-and-play metrics and extensible API so you can bring your own custom-models or popular open-source models (inc. HuggingFace, etc.)

Quickstart - See Examples

  1. Clone the repository and cd deployment/docker-compose/genai-engine
  2. Create .env file from .env.template file and modify it (more instructions can be found in README on the current path)
  3. Run docker compose up
  4. Wait for the genai-engine container to initialize then navigate to localhost:3030/docs to see the API docs
  5. Start building!.

Arthur Platform Free Version

The genai-engine standalone deployment in the Quickstart provides powerful LLM evaluation and guardrailing features. To unlock the full capabilities of the Arthur Platform, sign up and get started for free.

Arthur GenAI Evals

Arthur Platform Enterprise Version

The enterprise version of the Arthur Platform provides better performance, additional features, and capabilities, including custom enterprise-ready guardrails + metrics, which can maximize the potential of AI for your organization.

Key features:

  • State-of-the-art proprietary evaluation models trained by Arthur's world-class machine learning engineering team
  • Airgapped deployment of the Arthur Engine (no dependency to Hugging Face Hub)
  • Optional on-premises deployment of the entire Arthur Platform
  • Support from the world-class engineering teams at Arthur

To learn more about the enterprise version of the Arthur Platform, reach out!

Performance Comparison between Free vs Enterprise version of Arthur Engine :

Enterprise version of Arthur Engine leverages state-of-the-art high-performing, low latency proprietary models for some of the LLM evaluations. Please see below for a detailed comparison between open-source vs enterprise performance.

Evaluation Type Dataset Free Version Performance (f1) Enterprise Performance (f1) Free Version Average Latency per Inference (s) Enterprise Average Latency per Inference (s)
Prompt Injection deepset 0.52 (0.44, 0.60) 0.89 (0.85, 0.93) 0.966 0.03
Prompt Injection Arthurโ€™s Custom Benchmark 0.79 (0.62, 0.93) 0.85 (0.71, 0.96) 0.16 0.005
Toxicity Arthurโ€™s Custom Benchmark 0.633 (0.45, 0.79) 0.89 (0.85, 0.93) 3.096 0.0358

Overview

The Arthur Engine is built with a focus on transparency and explainability, this framework provides users with comprehensive performance metrics, error analysis, and interpretable results to improve model understanding and outcomes. With support for plug-and-play metrics and extensible APIs, the Arthur Engine simplifies the process of understanding and optimizing generative AI outputs. The Arthur Engine can prevent data-security and compliance risks from creating negative or harmful experiences for your users in production or negatively impacting your organization's reputation.

Key Features:

  • Evaluate models on structured/tabular datasets with customizable metrics
  • Evaluate LLMs and generative AI workflows with customizable metrics
  • Support building real-time guardrails for LLM applications and agentic workflows
  • Trace and monitor model performance over time
  • Visualize feature importance and error breakdowns
  • Compare multiple models side-by-side
  • Extensible APIs for custom metric development or for using custom models
  • Integration with popular libraries like LangChain or LlamaIndex (coming soon!)

LLM Evaluations:

Eval Technique Source Docs
Hallucination Claim-based LLMJudge technique Source Docs
Prompt Injection Open Source: Using deberta-v3-base-prompt-injection-v2 Source Docs
Toxicity Open Source: Using roberta_toxicity_classifier Source Docs
Sensitive Data Few-shot optimized LLM Judge technique Source Docs
Personally Identifiable Information Using presidio based off Named-Entity recognition Source Docs
CustomRules Extend the service to support whatever monitoring or guardrails are applicable for your use-case Build your own! Docs

NB: We have provided open-source models for Prompt Injection and Toxicity evaluation as default in the free version of Arthur. In the case that you already have custom solutions for these evaluations and would like to use them, the models used for Prompt Injection and Toxicity are fully customizable and can be substituted out here (PI Code Pointer, Toxicity Code Pointer). If you are interested in higher performing and/or lower latency evaluations out of the box, please enquire about the enterprise version of Arthur Engine.

Broad Integration Support Through the OpenInference Specification

Arthur Engine fully supports the OpenInference specification, which allows you to connect the Engine to a wide range of AI frameworks, libraries, and agent stacks without custom instrumentation.

OpenInference provides a shared trace and data schema for AI systems. Since Arthur Engine follows this standard, you can immediately use any integration already built for the OpenInference ecosystem, including the large collection maintained by Arize Phoenix.

This includes support for many popular frameworks such as:

  • LangChain
  • LangGraph
  • LlamaIndex
  • Vercel AI SDK
  • FastAPI and Flask apps instrumented with OpenInference
  • OpenAI, Anthropic, Google, and other model providers aligned with the spec
  • Agent frameworks, orchestration tools, and custom pipelines supported by Phoenix integrations
  • And many others

You can view the full and continuously updated list of supported integrations here: https://github.com/Arize-ai/phoenix?tab=readme-ov-file#tracing-integrations

By adopting OpenInference, Arthur Engine provides a flexible and future proof way to bring traces, spans, metrics, inputs, outputs, and evaluation signals into the Arthur platform. This makes it easy to collect data from diverse Gen AI apps, agents, and services with a single unified integration path.

Contributing

  • Join the Arthur community on Discord to get help and share your feedback.
  • To make a request for a bug fix or a new feature, please file a Github issue.
  • For making code contributions, please review the contributing guidelines.
  • Thank you!

Release History

VersionChangesUrgencyDate
2.1.601# ๐Ÿš€ Arthur Engine Release **June 3, 2026** This release delivers significant multi-tenant security hardening, a thoroughly refined onboarding tour experience, and improved agent trace visibility in the prompts playground. --- ## Multi-Tenancy & Access Control ### Security Fixes * Closed five critical **multi-tenant security and correctness gaps** including reCAPTCHA fail-open rejection, notebook ownership validation before experiment linking, org-scoped session trace pagination at the SQLHigh6/3/2026
2.1.579# ๐Ÿš€ Arthur Engine Release **May 28, 2026** Arthur Engine 2.1.579 delivers Azure ecosystem integrations, a dedicated prompt injection validation endpoint, improved PII detection accuracy, onboarding workflows, and multi-tenant UI support โ€” alongside quality-of-life improvements across the platform. --- ## Guardrails & Validation ### Prompt Injection * Added a new **validate endpoint** that enables easy, standalone prompt injection checks against incoming prompts (#1633) ### PII Detection High5/28/2026
0.0.11-lts# ๐Ÿš€ Arthur Engine Release **May 20, 2026** This release lays the groundwork for full multi-tenancy, introduces a guided onboarding experience for new users, expands model provider and observability integrations, and strengthens compliance automation โ€” making Arthur Engine ready for larger, organization-aware deployments. --- ## Multi-Tenancy and Access Control ### Organization-Scoped Data and API Keys * **Organizations table and tenant isolation** are now supported at the database level โ€”High5/20/2026
2.1.563# ๐Ÿš€ Arthur Engine Release **May 14, 2026** This release brings powerful new capabilities for onboarding, compliance automation, and observability โ€” including an interactive onboarding agent, Azure OpenAI support, transform version history, and expanded SDK instrumentors for popular AI frameworks. --- ## Onboarding and Getting Started ### Interactive Onboarding Agent * New **interactive onboarding CLI tool** automates setup of observability, model configuration, Python instrumentation, andHigh5/14/2026
0.0.8-lts# ๐Ÿš€ Arthur Engine Release **May 6, 2026** This release introduces the first Long-Term Support (LTS) channel for Arthur Engine, giving teams a stable, versioned deployment path alongside improvements to container security and airgapped deployment compatibility. --- ## Long-Term Support (LTS) Release Channel ### LTS Versioning and Distribution * Arthur Engine is now available through a dedicated **Long-Term Support (LTS) release channel**, providing a stable, predictable version track for pHigh5/6/2026
2.1.548# ๐Ÿš€ Arthur Engine Release **May 5, 2026** This release strengthens deployment flexibility and security, enabling seamless operation in airgapped environments and improving container security across all supported platforms. --- ## Deployment & Infrastructure Enhancements ### Airgapped Deployment Support * **Tiktoken encodings are now cached directly on the container image**, eliminating the need for external network calls during container initialization. Users deploying in airgapped or netHigh5/5/2026
2.1.548# ๐Ÿš€ Arthur Engine Release **May 5, 2026** This release strengthens deployment flexibility and security, enabling seamless operation in airgapped environments and improving container security across all supported platforms. --- ## Deployment & Infrastructure Enhancements ### Airgapped Deployment Support * **Tiktoken encodings are now cached directly on the container image**, eliminating the need for external network calls during container initialization. Users deploying in airgapped or netMedium5/5/2026
2.1.548# ๐Ÿš€ Arthur Engine Release **May 5, 2026** This release strengthens deployment flexibility and security, enabling seamless operation in airgapped environments and improving container security across all supported platforms. --- ## Deployment & Infrastructure Enhancements ### Airgapped Deployment Support * **Tiktoken encodings are now cached directly on the container image**, eliminating the need for external network calls during container initialization. Users deploying in airgapped or netMedium5/5/2026
0.0.0-lts-patch-2# ๐Ÿš€ Arthur Engine Release **May 1, 2026** This is a maintenance patch for the long-term support (LTS) branch focused on internal infrastructure improvements. There are no user-facing changes in this release. --- ## Deployment & Infrastructure Enhancements ### LTS Build Pipeline * Improved **Docker image publishing** for LTS releases by adopting a more efficient image retagging strategy, ensuring faster and more reliable delivery of patched LTS containers. This update strengthens the reliHigh5/1/2026
2.1.529# ๐Ÿš€ Arthur Engine Release **April 17, 2026** This release strengthens compliance observability, improves trace exploration workflows, and resolves several UI and API issues that impacted pagination, task browsing, and HTTP spec compliance. --- ## Compliance and Alerting ### Accurate Violation Tracking Per Alert Rule * The **policy_alert_rule_check_count** compliance metric now reports the true number of violations per alert rule instead of always reporting 1.0, giving a more accurate pictHigh4/17/2026
2.1.516# ๐Ÿš€ Arthur Engine Release **April 14, 2026** This release brings a redesigned Evaluate experience with unified evaluator management, bulk evaluation testing, automated compliance scheduling, and trace retention policies โ€” giving teams more control over evaluation workflows, compliance monitoring, and data lifecycle management. --- ## Evaluation and Continuous Evals ### Unified Evaluators and Continuous Evals UX * The Evaluate section now features a **unified two-tab layout (EvHigh4/14/2026
2.1.496# ๐Ÿš€ Arthur Engine Release **April 2, 2026** This release introduces significant user experience improvements and enhanced tracing capabilities, while strengthening security and system reliability across the platform. --- ## User Experience Enhancements ### Interactive AI Assistant * Added **Engine Chatbot** with intelligent query capabilities for searching API documentation and managing resources * Integrated automatic model provider detection supporting Anthropic Claude, OpenAI GPT, and High4/2/2026
2.1.477# ๐Ÿš€ Arthur Engine Release **March 23, 2026** This release delivers significant enhancements to experiment creation workflows, trace analysis capabilities, and user personalization while introducing the comprehensive Arthur Observability SDK v1.0 for Python developers. --- ## Arthur Observability SDK ### Python SDK Launch * Released **Arthur Observability SDK v1.0**, a comprehensive Python package for LLM application observability * Added automatic instrumentation for **33 AI Medium3/23/2026
2.1.456# ๐Ÿš€ Arthur Engine Release **March 12, 2026** This release delivers a comprehensive UI modernization, enhanced evaluation workflows, and improved agent task management alongside critical security updates and performance optimizations. --- ## User Experience & Interface Enhancements ### Navigation Consolidation * Unified all major product areas into streamlined tabbed interfaces, replacing scattered navigation with intuitive single-entry points * Consolidated **RAG functionality** into unifLow3/11/2026
2.1.386<h1>๐Ÿš€ Arthur Engine Release</h1> <p><strong>February 18, 2026</strong></p> <p> This release strengthens evaluation workflows, task visibility, dataset intelligence, and enterprise deployment reliability across environments. </p> <hr /> <h2>Evaluation &amp; Experiment Enhancements</h2> <h3>Improved Evaluation Configuration</h3> <ul> <li>Added a dedicated <strong>Evals input field</strong> for clearer configuration</li> <li>Introduced a new filtering mechanism for ContiLow2/20/2026
2.1.355<h1> ๐Ÿš€ Arthur Engine Release</h1> <p><strong>January 26 โ€“ February 5, 2026</strong></p> <p> This release significantly expands experimentation, trace visibility, model provider support, and deployment flexibility across the Agent Development Lifecycle. </p> <hr /> <h2>Agent Experiments &amp; RAG Evaluation</h2> <h3>Agent Experiments</h3> <ul> <li>Introduced <strong>Agent Experiments</strong> with UI enhancements</li> <li>Added configurable Session ID support for reproLow2/17/2026
2.1.286**Enhancements:** - Users can now configure where GenAI models are sourced from, enabling models to be pulled from an approved, customer-managed repository instead of the public Hugging Face Hub. - Metrics can now be segmented by user ID and conversation ID for more granular analysis. - Enhanced ODBC Connector Support: Improved handling of database views, more reliable primary key detection, and configurable connection and login timeouts. - Improved GenAI model bootstrapping reliability.Low1/14/2026
2.1.237**New Features:** - **Test & Preview Custom Metrics Before Saving:** Users can now validate their custom metrics directly within the creation and editing workflow. Users can run the metric against available datasets to preview results and confirm the logic behaves as expected before saving. **Bug fixes:** - Custom metrics: - Sketch metrics can now be created and calculated without specifying any dimension columns. - Frontend No Longer Overwrites User-Defined Metadata for Reported MeLow12/5/2025
2.1.209Bug Fix/Enhancements: - Fixed an issue where some metrics were missing from the selection list for custom datasets. - Increase ML engine aggregation timeout to support segmentation of larger & more complex datasets.Low11/21/2025
2.1.135Enhancements - Made enhancements to PII detection model to improve date/time identification. - Docker configuration has been updated to use Postgres version 15, ensuring compatibility & preventing initialization errors during new engine setup.Low11/6/2025
2.1.94**Enhancements**: - Updated telemetry ORM models, update migrations to enforce non-null timestamps. - Improved pagination handling for MSSQL. - Added `status_code` and `session_id` to spans. Low10/15/2025
2.1.93**New Features** - **Custom Metrics:** You can now define and manage custom metrics using SQL. Custom metrics can be reused across models and projects, and integrate seamlessly with dashboards, alerts, and queries in the Arthur platform. Versioning ensures you can update metric logic while preserving historical data accuracy. [[Learn more](https://docs.arthur.ai/docs/custom-metrics)] **Enhancements** - **Agent Trace Viewer:** Improved filters โ€” users can now filter by metric evaluation Low10/7/2025
2.1.79**Enhancements** - Span Query Improvements: - New GET endpoint `/v1/spans/query`: allows filtering spans by type. - Added support for span name column: improves query flexibility and performance. - Optimized span queries: added indexes to frequently queried columns. - Improved ingestion stability: fixed batch ingestion when root spans are present. - Improved developer experience by unifying our API schema and client libraries across the GenAI & Ml Engines as well as the Arthur plaLow9/12/2025
2.1.71**New Features:** - **Agentic monitoring is now supported in the GenAI Engine**:ย Building on the recently added /traces/ API, this release introduces support for monitoring agentic behavior: - Tasks now include an is_agentic flag to enable targeted analysis and evaluation. - Metrics and traces APIs have been upgraded to support structured outputs, trace reconstruction, and intelligent defaults. - The engine selectively computes metrics for agentic tasks, improving the precisionLow8/28/2025
2.1.46**New Features:** - Added image support for metrics + visualizing inferences in the Arthur Platform. - Users can now optionally configure attributes to segment over when defining metrics. **Enhancements:** - Improved hallucination detection for numbered lists and other structured formats. - Introduced configurable max-token limit for hallucination checks, helping users fine-tune thresholds for context.Low6/26/2025
2.1.44**New Features** - Added a '/traces/' API to support ingesting Open-Telemetry traces that meet the OpenInference (https://github.com/arize-ai/openinference/) specification. This feature is in preparation for adding agentic evaluations - more details coming soon **Enhancements:** - Added Docker Compose health checks to improve service startup reliability. - Introduced a single script to install both the GenAI and ML engines. - Initialized the Arthur Common module with CI, linting, and unitLow5/23/2025
2.1.40Enhancements: * Patched a PyTorch vulnerability * Configured Renovate on the Arthur Engine GitHub repository for automated dependency updates * The `FETCH_RAW_DATA_ENABLED` configuration now exposed on the Helm Chart * Docker Compose always pulls the container images for the `latest` tag users * Postgres now uses a volume to persist data in Docker Compose Bug fix: * The ml-engine was not able to communicate with the genai-engine in the arthur-engine Docker Compose deployment. All servicLow5/8/2025
2.1.39Deprecation: * Deprecated the endpoints that validate prompt and response on default rules without any task association Enhancement: * Reduced the number of configurations exposed for the first deploy experience with Docker ComposeLow5/2/2025
2.1.37Enhancements: - Open sourced the Arthur Engine full deployment scripts, comprised of both the `genai-engine` and the `ml-engine` components! You now have access to see how the `ml-engine` is deployed on Docker Compose, AWS ECS, and Kubernetes. All deployment scripts can now be found in the `/deployment` folder. - The GenAI Engine server can now start with no LLM service connected. This allows users without access to a LLM service to still use the non-LLM based evaluations. - Improved the confLow4/29/2025
2.1.23Enhancements: - Optimized the profanity detection function in the toxicity rule to improve latency for inferences with a large number of consecutive repeating characters. - Increased the overall concurrency of GPU deployments by using 5 Gunicorn workers by default and ensuring that the models load without encountering any race condition issues. - Improved quick deployment by adding start scripts for Docker Compose, Helm Chart, and AWS CloudFormation. Bug fix: - Disabled rules can be now aLow4/16/2025
2.1.18We are thrilled to announce the very first release of the Arthur Engine, now available as an open source project! The Arthur Engine is a tool designed for evaluating and benchmarking machine learning models and enforcing guardrails in your LLM applications and generative AI workflows. This initial release debuts the GenAI Engine submodule and its capability to add guardrails to your LLM applications and generative AI workflows. We value your feedback and contributions. Whether you encoLow3/31/2025

Dependencies & License Audit

Loading dependencies...

Similar Packages

AutoRAGAutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automationv0.3.22
robotsControl robots and physical hardware with natural language through Strands Agents.v0.3.8
ragasSupercharge Your LLM Application Evaluations ๐Ÿš€v0.4.3
modal-clientSDK libraries for Modalmain@2026-06-05
opentulpaSelf-hosted personal AI agent that lives in your DMs. Describe any workflow: triage Gmail, pull a Giphy feed, build a Slack bot, monitor markets. It writes the code, runs it, schedules it, and saves imain@2026-06-05

More in Uncategorized

llama.cppLLM inference in C/C++
modal-clientSDK libraries for Modal
anolisaANOLISA - Agentic Nexus Operating Layer & Interface System Architecture