freshcrate
Skin:/
Home > Testing > GTA

GTA

[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2

Why this rank:Strong adoptionRelease freshnessHealthy release cadence

Description

[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2

README

GTA: General Tool Agent Benchmark and Evaluation Framework

⬇️ Download Dataset Here: [GTA-Atomic] [GTA-Workflow]

🌟 Introduction

GTA-2 is a benchmark and evaluation kit for General Tool Agents, designed to bridge atomic tool-use evaluation and open-ended workflow evaluation in one repository.

Benchmark hierarchy

  • GTA-Workflow: the new focus of GTA-2, for long-horizon, open-ended workflow evaluation.
  • GTA-Atomic: the original GTA benchmark for short-horizon atomic tool-use tasks. Please refer to README_GTA-1.md.

This readme is centered around GTA-Workflow, which targets realistic long-horizon tasks with open-ended deliverables. Compared with traditional benchmark-style evaluation, GTA-Workflow focuses more on what an agent can finally accomplish in a complete workflow, rather than only whether it predicts the next tool call correctly.

What this repo supports

  • Workflow-oriented agent evaluation.
    Evaluate long-horizon, open-ended agent tasks with deliverable-centric scoring.

  • Both model and harness evaluation.
    GTA-Workflow is designed to evaluate not only the underlying LLM, but also the execution harness / agent framework behind it.

  • Default OpenCompass-based evaluation.
    We provide a standard evaluation pipeline based on OpenCompass + Lagent, suitable for agents integrated as callable frameworks.

  • Custom agent / custom LLM integration.
    Beyond the default setup, users can plug in their own agent framework or LLM backend. See docs/ADDING_NEW_AGENT_OR_LLM.md.

  • End-to-end evaluation without OpenCompass.
    For agent products or closed systems that cannot be directly integrated into our framework, GTA-2 also supports evaluating final execution results directly, enabling assessment of systems such as Manus, Kortix, or OpenClaw.

📣 What's New

  • [2026.4.20] Release GTA-2 paper and GTA-Workflow dataset. 🔥🔥🔥
  • [2026.4.12] Release GTA-2, extending the original GTA benchmark into a hierarchical evaluation repo with:
    • GTA-Workflow for long-horizon, open-ended workflow evaluation in productivity scenarios,
    • support for evaluating both LLM capability (GPT, Gemini, Claude, etc.) and agent execution harnesses (OpenClaw, Manus, Kortix, etc.),
    • support for both OpenCompass-based agent evaluation and end-to-end result evaluation for external/closed agent systems.
  • [2026.2.14] Update 🏆Leaderboard, Feb. 2026, including new models such as GPT-5, Gemini-2.5, Claude-4.5, Kimi-K2, Grok-4, Llama-4, Deepseek-V3.2, Qwen3-235B-A22B series.
  • [2025.3.25] Update 🏆Leaderboard, Mar. 2025, including new models such as Deepseek-R1, Deepseek-V3, Qwen-QwQ, Qwen-2.5-max series.
  • [2024.9.26] GTA is accepted to NeurIPS 2024 Dataset and Benchmark Track! 🎉🎉🎉
  • [2024.7.11] Paper available on arXiv. ✨✨✨
  • [2024.7.3] Release the evaluation and tool deployment code of GTA. 🔥🔥🔥
  • [2024.7.1] Release the GTA dataset on Hugging Face. 🎉🎉🎉

📚 Dataset Statistics

GTA-Workflow: Real-World Productivity Tasks

GTA-Workflow focuses on long-horizon, open-ended productivity scenarios, where agents are required to complete realistic deliverables instead of predicting intermediate tool calls.

These tasks cover diverse real-world use cases, including

  • Data Analysis
  • Education & Instruction
  • Planning & Decision
  • Creative Design
  • Marketing Strategy
  • Retrieval & QA

Compared to GTA-Atomic, GTA-Workflow significantly expands modalities, tool ecosystem, and task complexity.

Data Sources

Unlike GTA-Atomic (original GTA), which is manually constructed for controlled evaluation, GTA-Workflow is built from real-world workflow tasks with a human-in-the-loop pipeline. The tasks are collected and rewritten from two major sources:

🏆 Leaderboard, Apr. 2026

Main evaluation results of both LLMs and agent harness GTA-2.

🚀 How to Evaluate on GTA-2

GTA-2 supports three evaluation modes depending on your setup.

  • Default OpenCompass-based evaluation.
    We provide a standard pipeline based on OpenCompass + Lagent, suitable for agents that can be integrated as callable frameworks. The following instructions in this section focus on this setup.

  • Custom agent / custom LLM integration.
    You can plug in your own agent framework or LLM backend via a wrapper.
    See docs/ADDING_NEW_AGENT_OR_LLM.md.

  • End-to-end evaluation without OpenCompass.
    For external or productized agent systems where only final outputs are available, GTA-2 supports evaluating results directly (e.g., Manus-, Kortix-, or OpenClaw-style systems).
    See agent_app_eval/README.md.

The following instructions focus on GTA-Workflow evaluation of default setup. For GTA-Atomic (original GTA) evaluation, please refer to
README_GTA1.md. The codebase remains compatible.

Prepare GTA-2 Dataset

  1. Clone this repo.
git clone https://github.com/open-compass/GTA.git
cd GTA
  1. Download the dataset from release file.
mkdir ./opencompass/data

Put it under the folder ./opencompass/data/. The structure of files should be:

GTA/
├── agentlego
├── opencompass
│   ├── data
│   │   ├── gta_dataset_v2
│   ├── ...
├── ...

Prepare Your Model

  1. Download the model weights.
pip install -U huggingface_hub
# huggingface-cli download --resume-download hugging/face/repo/name --local-dir your/local/path --local-dir-use-symlinks False
huggingface-cli download --resume-download Qwen/Qwen1.5-7B-Chat --local-dir ~/models/qwen1.5-7b-chat --local-dir-use-symlinks False
  1. Install LMDeploy.
conda create -n lmdeploy python=3.10
conda activate lmdeploy

For CUDA 12:

pip install lmdeploy

For CUDA 11+:

export LMDEPLOY_VERSION=0.4.0
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
  1. Launch a model service.
# lmdeploy serve api_server path/to/your/model --server-port [port_number] --model-name [your_model_name]
lmdeploy serve api_server ~/models/qwen1.5-7b-chat --server-port 12580 --model-name qwen1.5-7b-chat

Deploy Tools

  1. Install AgentLego.
conda create -n agentlego python=3.11.9
conda activate agentlego
cd agentlego
pip install -r requirements_all.txt
pip install -r requirements_gta_v2.txt
pip install agentlego
pip install -e .
mim install mmengine
mim install mmcv==2.1.0

Open ~/anaconda3/envs/agentlego/lib/python3.11/site-packages/transformers/modeling_utils.py, then set _supports_sdpa = False to _supports_sdpa = True in line 1279.

  1. Deploy tools for GTA benchmark.

To use the GoogleSearch and MathOCR tools, you should first get the Serper API key from https://serper.dev, and the Mathpix API key from https://mathpix.com/. Then export these keys as environment variables.

export SERPER_API_KEY='your_serper_key_for_google_search_tool'
export MATHPIX_APP_ID='your_mathpix_key_for_mathocr_tool'
export MATHPIX_APP_KEY='your_mathpix_key_for_mathocr_tool'

Start the tool server.

agentlego-server start --port 16181 --extra ./benchmark.py  `cat benchmark_toollist_v2.txt` --host 0.0.0.0

Start Evaluation

  1. Install OpenCompass.
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
cd agentlego
pip install -e .
cd ../opencompass
pip install -e .
pip install huggingface_hub==0.25.2 transformers==4.40.1
  1. Modify the config file at configs/eval_gta_bench_v2.py as below.

The ip and port number of openai_api_base is the ip of your model service and the port number you specified when using lmdeploy.

The ip and port number of tool_server is the ip of your tool service and the port number you specified when using agentlego.

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        tool_server='http://10.140.0.138:16181',
        tool_meta='data/gta_dataset_v2/toolmeta.json',
        batch_size=8,
    ),
]

Before running, set:

export OPENCOMPASS_TOOLMETA_PATH=data/gta_dataset_v2/toolmeta.json
export OPENAI_API_KEY=your_openai_key
  1. Infer and evaluate with OpenCompass.
# infer only
python run.py configs/eval_gta_bench_v2.py --max-num-workers 32 --debug --mode infer
# evaluate only
python run.py configs/eval_gta_bench_v2.py --max-num-workers 32 --debug --reuse [time_stamp_of_prediction_file] --mode eval
# infer and evaluate
python run.py configs/eval_gta_bench_v2.py -p llmit -q auto --max-num-workers 32 --debug

📝 Citation

If you use GTA in your research, please cite the following paper:

@article{wang2024gta,
  title={GTA: a benchmark for general tool agents},
  author={Wang, Jize and Ma, Zerun and Li, Yining and Zhang, Songyang and Chen, Cailian and Chen, Kai and Le, Xinyi},
  journal={Advances in Neural Information Processing Systems},
  pages={75749--75790},
  year={2024}
}
@article{wang2026gta2,
  title={GTA-2: benchmarking general tool agents from atomic tool-use to open-ended workflows},
  author={Wang, Jize and Liu, Xuanxuan and Li, Yining and Zhang, Songyang and Wang, Yijun and Shan, Zifei and Le, Xinyi and Chen, Cailian and Guan, Xinping and Tao, Dacheng},
  journal={arXiv:2604.15715},
  year={2026}
}

Release History

VersionChangesUrgencyDate
v0.2.0Release GTA-2 and its new introduced GTA-Workflow dataset.High4/20/2026
v0.1.0The GTA Dataset.Low6/25/2024

Dependencies & License Audit

Loading dependencies...

Similar Packages

mlflowThe open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllinv3.13.0
ComfyUI-AudioSR🎶 Enhance audio quality with ComfyUI-AudioSR, a versatile tool for upscaling sounds to 48kHz for better clarity and listening experience.main@2026-05-31
giskard-oss🐢 Open-Source Evaluation & Testing library for LLM Agentsgiskard-checks/v1.0.2b3
chinese-llm-benchmarkReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、Minv5.10
ai-agents-reality-checkBenchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistica0.0.0

More in Testing

vector-db-benchmarkFramework for benchmarking vector search engines
GitoAn AI-powered GitHub code review tool that uses LLMs to detect high-confidence, high-impact issues—such as security vulnerabilities, bugs, and maintainability concerns.
mxcliMendix cli tool, a headless way to work with Mendix projects. Enables Mendix projects for use with 3rd party agentic coding tools like Claude Code and Copilot. Includes a starlark linter for quality v
llm_context_benchmarks 📊 LLM Context Benchmarks - A comprehensive benchmarking tool for testing LLMs with varying context sizes using Ollama. Features dual benchmark modes (API/CLI), automatic hardware detection (optimiz