LRAT

The implementation for SIGIR 2026: Learning to Retrieve from Agent Trajectories.

Description

The implementation for SIGIR 2026: Learning to Retrieve from Agent Trajectories.

README

[SIGIR 2026] Learning to Retrieve from Agent Trajectories

Retrieval is no longer optimized only for human searchers. As large language model agents increasingly issue queries, inspect snippets, browse documents, and reason over retrieved evidence, the target of retrieval training has shifted from human interaction to agent interaction. LRAT studies this paradigm shift and learns retrievers directly from multi-step agent trajectories.

Training data for agent native search should match how search agents actually search, browse, and consume evidence.

Introduction

LRAT studies how to train retrievers from the intermediate behaviors of strong search agents rather than from only final answers. The repository focuses on a practical pipeline for:

collecting long-horizon search trajectories from agentic systems,
converting trajectories into retrieval supervision,
training retrievers on the resulting samples, and
evaluating both retrieval quality and end-to-end task success.

News

2026/04/08: Released the paper on arXiv.
2026/04/03: Paper accepted to SIGIR 2026.
2026/03/24: Open-sourced model checkpoints and the LRAT-Train dataset.

Highlights

Trajectory-first retrieval learning: build retriever supervision from agent search and browse traces instead of relying only on static relevance labels.
Agent-friendly data collection: run local or API-based research agents and save each query as structured trajectory JSON.
Training data construction with an LLM judge: turn trajectories into (query, pos, neg, ...) training pairs with reasoning-aware annotations.
Benchmark-oriented evaluation: evaluate outputs on BrowseComp-Plus and InfoSeek-Eval with a local vLLM judge.

Resources

Resource	Status
Homepage	Homepage
Model Checkpoint	LRAT Collection
Dataset Release	LRAT-Train
Paper	arXiv · Accepted by SIGIR 2026

Repository Overview

Path	Description
`src/`	Core utilities for index construction and trajectory-to-training-data conversion
`search_agent/`	Agent clients for Tongyi DeepResearch, WebExplorer, AgentCPM, OpenAI-compatible APIs, and related prompts/utilities
`searcher/`	Search backends and local retrieval interfaces
`docs/`	Step-by-step documentation for indexing, trajectory construction, training data construction, and evaluation
`datasets/`	Benchmark files used in evaluation
`topics-qrels/`	Query and qrel files for retrieval experiments
`trajectory/`	Example trajectory artifacts
`FlagEmbedding/`	Local copy of FlagEmbedding used for retriever training
`tevatron/`	Local copy of Tevatron utilities used in dense retrieval workflows
`scripts_evaluation/`	Evaluation scripts for end-to-end judging

Vendored Dependencies

FlagEmbedding/ is a vendored and locally modified copy based on the upstream FlagEmbedding project. In this repository, it reflects user-side modifications layered on top of upstream work and earlier external changes.
tevatron/ is a vendored upstream dependency used to support dense retrieval utilities and encoding workflows.
More details are documented in THIRD_PARTY_NOTICES.md.

Supported Components

Retriever Backends

bm25
faiss

Agent / LLM Backends

Alibaba-NLP/Tongyi-DeepResearch-30B-A3B
hkust-nlp/WebExplorer-8B
openbmb/AgentCPM-Explore
openai/gpt-oss-120b
OpenAI-compatible API services such as MiniMax / GLM-style endpoints

Quick Start

The quickest way to understand the repository is:

set up the environment,
build a retrieval index,
generate agent trajectories,
convert trajectories into retriever training data,
train a retriever, and
run benchmark evaluation.

1. Environment Setup

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Sync environment
uv sync

# Optional: activate the environment
source .venv/bin/activate

# Install flash-attn if needed by your environment
uv pip install --no-build-isolation flash-attn

Install Java 21 for the Lucene / Pyserini-based BM25 pipeline:

conda install -c conda-forge openjdk=21

Install the local FlagEmbedding package:

cd FlagEmbedding
pip install -e .
cd ..

2. Build Retrieval Indexes

See docs/index_construction.md for the full indexing notes.

BM25 template:

python src/index_builder.py \
  --retrieval_method bm25 \
  --corpus_path /path/to/your/corpus.jsonl \
  --save_dir /path/to/save/index

Dense embedding template via Tevatron:

CUDA_VISIBLE_DEVICES=0 python -m tevatron.retriever.driver.encode \
  --model_name_or_path /path/to/your/embedding_model \
  --dataset_path /path/to/your/corpus.jsonl \
  --encode_output_path /path/to/save/embeddings.pkl \
  --passage_max_len 512 \
  --normalize \
  --pooling <eos|mean> \
  --passage_prefix "" \
  --per_device_eval_batch_size 512 \
  --padding_side left \
  --fp16

3. Generate Agent Trajectories

See docs/trajectory_construction.md for full examples.

Example with Tongyi and a bm25 backend:

python search_agent/tongyi_client.py \
  --output-dir /path/to/output/dir \
  --searcher-type bm25 \
  --index-path /path/to/bm25/index/dir \
  --num-threads 32 \
  --model /path/to/agent_or_llm_dir \
  --snippet-max-tokens 64 \
  --query /path/to/queries.tsv \
  --port <PORT> \
  --k 10

Example with Tongyi and a faiss backend:

python search_agent/tongyi_client.py \
  --output-dir /path/to/output/dir \
  --searcher-type faiss \
  --index-path "/path/to/embeddings/index-*.pkl" \
  --model-name /path/to/embedding/model \
  --pooling <mean|eos> \
  --normalize \
  --num-threads 32 \
  --snippet-max-tokens 64 \
  --query /path/to/queries.tsv \
  --port <PORT> \
  --dataset-name /path/to/corpus_or_dataset \
  --model /path/to/agent_or_llm_dir \
  --k 10

4. Build Training Data from Trajectories

See docs/training_data_construction.md.

If you do not want to build training data from scratch, you can directly use the released LRAT-Train dataset. If you prefer to control filtering or supervision design yourself, you can also start from saved agent trajectories and rerun pair extraction with src/data_builder.py.

python src/data_builder.py \
  --corpus-path /path/to/your/corpus.jsonl \
  --traj-dir /path/to/your/trajectory_dir \
  --output-path /path/to/save/output.jsonl \
  --tokenizer-path /path/to/your/tokenizer_or_model_dir \
  --judge-api-url http://<JUDGE_HOST>:<PORT>/v1/chat/completions \
  --judge-model <JUDGE_MODEL_NAME> \
  --max-workers 32 \
  --future-timeout 30

5. Train the Retriever

The repository currently uses the local FlagEmbedding training recipe. Start from:

You can plug the JSONL generated by src/data_builder.py into your existing training setup without changing the repository-level presentation structure.

6. Evaluate End-to-End Performance

See docs/evaluate.md.

python scripts_evaluation/evaluate.py \
  --input_dir /path/to/agent_output_json_dir \
  --gt_path /path/to/InfoSeek-Eval.tsv \
  --dataset_type InfoSeek-Eval \
  --output_file /path/to/save/eval_results.json \
  --model_path /path/to/local_judge_model \
  --tensor_parallel_size <NUM_GPUS> \
  --gpu_memory_utilization <GPU_MEM_UTIL> \
  --batch_size 32

Documentation

Core Guides

Topic	Link
Index Construction	docs/index_construction.md
Trajectory Construction	docs/trajectory_construction.md
Training Data Construction	docs/training_data_construction.md
Minimal Reproduction	docs/minimal_repro.md
Evaluation	docs/evaluate.md

Advanced Notes

Topic	Link
Experiment Layout	docs/advanced/experiment_layout.md
Segmented Training Data Experiments	docs/advanced/segmented_training_data_experiment.md

Data and Outputs

Example benchmark files are stored in datasets/.
Query and qrel files are stored in topics-qrels/.
Example trajectory outputs are stored in trajectory/.
Generated run artifacts are saved as one JSON file per query by the agent clients.

Acknowledgements

This repository builds on and benefits from several excellent open-source projects and public resources:

License

This repository is released under the Apache License 2.0. See LICENSE.

Vendored components keep their own upstream licenses, especially:

FlagEmbedding/ under its upstream MIT license
tevatron/ under Apache License 2.0

Citation

If you find this repository useful, please cite our SIGIR 2026 paper below. The latest public version is available on arXiv.

@inproceedings{zhou2026lrat,
  title={Learning to Retrieve from Agent Trajectories},
  author={Zhou, Yuqi and Dai, Sunhao and Qu, Changle and Pang, Liang and Xu, Jun and Wen, Ji-Rong},
  booktitle={Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2026}
}

Release History

Version	Changes	Urgency	Date
0.0.0	No release found — using repo HEAD	High	4/8/2026
main@2026-04-08	Latest activity on main branch	High	4/8/2026
main@2026-04-08	Latest activity on main branch	High	4/8/2026
main@2026-04-08	Latest activity on main branch	High	4/8/2026
main@2026-04-08	Latest activity on main branch	High	4/8/2026
main@2026-04-08	Latest activity on main branch	Medium	4/8/2026
main@2026-04-08	Latest activity on main branch	Medium	4/8/2026
main@2026-04-08	Latest activity on main branch	Medium	4/8/2026
main@2026-04-08	Latest activity on main branch	Medium	4/8/2026
main@2026-04-08	Latest activity on main branch	Medium	4/8/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

adk-pythonAn open-source, code-first Python toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.v1.31.1

OpenOutreachLinkedin Automation Tool: Describe your product. Define your target market. The AI finds the leads for you.main@2026-04-20

AReaLLightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.v1.0.3

MemOSAI memory OS for LLM and Agent systems(moltbot,clawdbot,openclaw), enabling persistent Skill memory for cross-task skill reuse and evolution.v2.0.13

taijiAI-powered self-learning OS with I Ching philosophy | 融合易经哲学的自学型 AI 操作系统v0.1.0