freshcrate
Home > Testing > GTA

GTA

[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2

Description

[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2

README

GTA: General Tool Agent Benchmark and Evaluation Framework

⬇️ Download Dataset Here: [GTA-Atomic] [GTA-Workflow]

🌟 Introduction

GTA-2 is a benchmark and evaluation kit for General Tool Agents, designed to bridge atomic tool-use evaluation and open-ended workflow evaluation in one repository.

Benchmark hierarchy

  • GTA-Workflow: the new focus of GTA-2, for long-horizon, open-ended workflow evaluation.
  • GTA-Atomic: the original GTA benchmark for short-horizon atomic tool-use tasks. Please refer to README_GTA-1.md.

This readme is centered around GTA-Workflow, which targets realistic long-horizon tasks with open-ended deliverables. Compared with traditional benchmark-style evaluation, GTA-Workflow focuses more on what an agent can finally accomplish in a complete workflow, rather than only whether it predicts the next tool call correctly.

What this repo supports

  • Workflow-oriented agent evaluation.
    Evaluate long-horizon, open-ended agent tasks with deliverable-centric scoring.

  • Both model and harness evaluation.
    GTA-Workflow is designed to evaluate not only the underlying LLM, but also the execution harness / agent framework behind it.

  • Default OpenCompass-based evaluation.
    We provide a standard evaluation pipeline based on OpenCompass + Lagent, suitable for agents integrated as callable frameworks.

  • Custom agent / custom LLM integration.
    Beyond the default setup, users can plug in their own agent framework or LLM backend. See docs/ADDING_NEW_AGENT_OR_LLM.md.

  • End-to-end evaluation without OpenCompass.
    For agent products or closed systems that cannot be directly integrated into our framework, GTA-2 also supports evaluating final execution results directly, enabling assessment of systems such as Manus, Kortix, or OpenClaw.

📣 What's New

  • [2026.4.20] Release GTA-2 paper and GTA-Workflow dataset. 🔥🔥🔥
  • [2026.4.12] Release GTA-2, extending the original GTA benchmark into a hierarchical evaluation repo with:
    • GTA-Workflow for long-horizon, open-ended workflow evaluation in productivity scenarios,
    • support for evaluating both LLM capability (GPT, Gemini, Claude, etc.) and agent execution harnesses (OpenClaw, Manus, Kortix, etc.),
    • support for both OpenCompass-based agent evaluation and end-to-end result evaluation for external/closed agent systems.
  • [2026.2.14] Update 🏆Leaderboard, Feb. 2026, including new models such as GPT-5, Gemini-2.5, Claude-4.5, Kimi-K2, Grok-4, Llama-4, Deepseek-V3.2, Qwen3-235B-A22B series.
  • [2025.3.25] Update 🏆Leaderboard, Mar. 2025, including new models such as Deepseek-R1, Deepseek-V3, Qwen-QwQ, Qwen-2.5-max series.
  • [2024.9.26] GTA is accepted to NeurIPS 2024 Dataset and Benchmark Track! 🎉🎉🎉
  • [2024.7.11] Paper available on arXiv. ✨✨✨
  • [2024.7.3] Release the evaluation and tool deployment code of GTA. 🔥🔥🔥
  • [2024.7.1] Release the GTA dataset on Hugging Face. 🎉🎉🎉

📚 Dataset Statistics

GTA-Workflow: Real-World Productivity Tasks

GTA-Workflow focuses on long-horizon, open-ended productivity scenarios, where agents are required to complete realistic deliverables instead of predicting intermediate tool calls.

These tasks cover diverse real-world use cases, including

  • Data Analysis
  • Education & Instruction
  • Planning & Decision
  • Creative Design
  • Marketing Strategy
  • Retrieval & QA

Compared to GTA-Atomic, GTA-Workflow significantly expands modalities, tool ecosystem, and task complexity.

Data Sources

Unlike GTA-Atomic (original GTA), which is manually constructed for controlled evaluation, GTA-Workflow is built from real-world workflow tasks with a human-in-the-loop pipeline. The tasks are collected and rewritten from two major sources:

🏆 Leaderboard, Apr. 2026

Main evaluation results of both LLMs and agent harness GTA-2.

🚀 How to Evaluate on GTA-2

GTA-2 supports three evaluation modes depending on your setup.

  • Default OpenCompass-based evaluation.
    We provide a standard pipeline based on OpenCompass + Lagent, suitable for agents that can be integrated as callable frameworks. The following instructions in this section focus on this setup.

  • Custom agent / custom LLM integration.
    You can plug in your own agent framework or LLM backend via a wrapper.
    See docs/ADDING_NEW_AGENT_OR_LLM.md.

  • End-to-end evaluation without OpenCompass.
    For external or productized agent systems where only final outputs are available, GTA-2 supports evaluating results directly (e.g., Manus-, Kortix-, or OpenClaw-style systems).
    See agent_app_eval/README.md.

The following instructions focus on GTA-Workflow evaluation of default setup. For GTA-Atomic (original GTA) evaluation, please refer to
README_GTA1.md. The codebase remains compatible.

Prepare GTA-2 Dataset

  1. Clone this repo.
git clone https://github.com/open-compass/GTA.git
cd GTA
  1. Download the dataset from release file.
mkdir ./opencompass/data

Put it under the folder ./opencompass/data/. The structure of files should be:

GTA/
├── agentlego
├── opencompass
│   ├── data
│   │   ├── gta_dataset_v2
│   ├── ...
├── ...

Prepare Your Model

  1. Download the model weights.
pip install -U huggingface_hub
# huggingface-cli download --resume-download hugging/face/repo/name --local-dir your/local/path --local-dir-use-symlinks False
huggingface-cli download --resume-download Qwen/Qwen1.5-7B-Chat --local-dir ~/models/qwen1.5-7b-chat --local-dir-use-symlinks False
  1. Install LMDeploy.
conda create -n lmdeploy python=3.10
conda activate lmdeploy

For CUDA 12:

pip install lmdeploy

For CUDA 11+:

export LMDEPLOY_VERSION=0.4.0
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
  1. Launch a model service.
# lmdeploy serve api_server path/to/your/model --server-port [port_number] --model-name [your_model_name]
lmdeploy serve api_server ~/models/qwen1.5-7b-chat --server-port 12580 --model-name qwen1.5-7b-chat

Deploy Tools

  1. Install AgentLego.
conda create -n agentlego python=3.11.9
conda activate agentlego
cd agentlego
pip install -r requirements_all.txt
pip install -r requirements_gta_v2.txt
pip install agentlego
pip install -e .
mim install mmengine
mim install mmcv==2.1.0

Open ~/anaconda3/envs/agentlego/lib/python3.11/site-packages/transformers/modeling_utils.py, then set _supports_sdpa = False to _supports_sdpa = True in line 1279.

  1. Deploy tools for GTA benchmark.

To use the GoogleSearch and MathOCR tools, you should first get the Serper API key from https://serper.dev, and the Mathpix API key from https://mathpix.com/. Then export these keys as environment variables.

export SERPER_API_KEY='your_serper_key_for_google_search_tool'
export MATHPIX_APP_ID='your_mathpix_key_for_mathocr_tool'
export MATHPIX_APP_KEY='your_mathpix_key_for_mathocr_tool'

Start the tool server.

agentlego-server start --port 16181 --extra ./benchmark.py  `cat benchmark_toollist_v2.txt` --host 0.0.0.0

Start Evaluation

  1. Install OpenCompass.
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
cd agentlego
pip install -e .
cd ../opencompass
pip install -e .
pip install huggingface_hub==0.25.2 transformers==4.40.1
  1. Modify the config file at configs/eval_gta_bench_v2.py as below.

The ip and port number of openai_api_base is the ip of your model service and the port number you specified when using lmdeploy.

The ip and port number of tool_server is the ip of your tool service and the port number you specified when using agentlego.

models = [
  dict(
        abbr='qwen1.5-7b-chat',
        type=LagentAgent,
        agent_type=ReAct,
        max_turn=10,
        llm=dict(
            type=OpenAI,
            path='qwen1.5-7b-chat',
            key='EMPTY',
            openai_api_base='http://10.140.1.17:12580/v1/chat/completions',
            query_per_second=1,
            max_seq_len=4096,
            stop='<|im_end|>',
        ),
        tool_server='http://10.140.0.138:16181',
        tool_meta='data/gta_dataset_v2/toolmeta.json',
        batch_size=8,
    ),
]

Before running, set:

export OPENCOMPASS_TOOLMETA_PATH=data/gta_dataset_v2/toolmeta.json
export OPENAI_API_KEY=your_openai_key
  1. Infer and evaluate with OpenCompass.
# infer only
python run.py configs/eval_gta_bench_v2.py --max-num-workers 32 --debug --mode infer
# evaluate only
python run.py configs/eval_gta_bench_v2.py --max-num-workers 32 --debug --reuse [time_stamp_of_prediction_file] --mode eval
# infer and evaluate
python run.py configs/eval_gta_bench_v2.py -p llmit -q auto --max-num-workers 32 --debug

📝 Citation

If you use GTA in your research, please cite the following paper:

@article{wang2024gta,
  title={GTA: a benchmark for general tool agents},
  author={Wang, Jize and Ma, Zerun and Li, Yining and Zhang, Songyang and Chen, Cailian and Chen, Kai and Le, Xinyi},
  journal={Advances in Neural Information Processing Systems},
  pages={75749--75790},
  year={2024}
}
@article{wang2026gta2,
  title={GTA-2: benchmarking general tool agents from atomic tool-use to open-ended workflows},
  author={Wang, Jize and Liu, Xuanxuan and Li, Yining and Zhang, Songyang and Wang, Yijun and Shan, Zifei and Le, Xinyi and Chen, Cailian and Guan, Xinping and Tao, Dacheng},
  journal={arXiv:2604.15715},
  year={2026}
}

Release History

VersionChangesUrgencyDate
v0.2.0Release GTA-2 and its new introduced GTA-Workflow dataset.High4/20/2026
v0.1.0The GTA Dataset.Low6/25/2024

Dependencies & License Audit

Loading dependencies...

Similar Packages

chinese-llm-benchmarkReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、Minv5.10
ComfyUI-AudioSR🎶 Enhance audio quality with ComfyUI-AudioSR, a versatile tool for upscaling sounds to 48kHz for better clarity and listening experience.main@2026-04-21
mlflowThe open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controllints/v0.2.0-rc.1
giskard-oss🐢 Open-Source Evaluation & Testing library for LLM Agentsgiskard-checks/v1.0.2b1
ai-agents-reality-checkBenchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistica0.0.0