π A curated list of awesome repository-level code generation research papers and resources. If you want to contribute to this list (please do), feel free to send me a pull request. π If you have any further questions, feel free to contact Yuling Shi or Xiaodong Gu (SJTU).
- π Contents
- π₯ Repo-Level Issue Resolution
- π€ Repo-Level Code Completion
- π Repo-Level Code Translation
- π§ͺ Repo-Level Unit Test Generation
- π Repo-Level Code QA
- π©βπ» Repo-Level Issue Task Synthesis
- π Datasets and Benchmarks
- GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair [2026-04-arXiv] [π paper]
- Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals [2026-04-arXiv] [π paper]
- ABTest: Behavior-Driven Testing for AI Coding Agents [2025-04-arXiv] [π paper]
- First behavior-driven fuzzing framework for AI coding agents, 40.8% detection precision.
- Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures [2025-04-arXiv] [π paper]
- Taxonomy of 13 open-source coding agents across 12 dimensions.
- Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints [2025-04-arXiv] [π paper]
- Agent-CoEvo: Coevolution framework for code and tests.
- DebugHarness: Emulating Human Dynamic Debugging for Autonomous Program Repair [2025-04-arXiv] [π paper]
- Human-like interactive debugging for program repair, 90% fix rate on SEC-bench.
- Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs [2025-04-arXiv] [π paper]
- STITCH: Higher quality with fewer training trajectories.
- From SWE-ZERO to SWE-HERO: Two-Stage SFT for SWE-Bench [2026-04-arXiv] [π paper]
- RepoRepair: Leveraging Code Documentation for Repository-Level Automated Program Repair [2026-03-arXiv] [π paper]
- SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution [2026-03-arXiv] [π paper]
- Compressing Code Context for LLM-based Issue Resolution [2026-03-arXiv] [π paper]
- Monte Carlo Tree Search for Execution-Guided Program Repair with Large Language Models [2026-02-arXiv] [π paper]
- SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training [2026-02-arXiv] [π paper]
- SWE-World: Building Software Engineering Agents in Docker-Free Environments [2026-02-arXiv] [π paper]
- SVRepair: Structured Visual Reasoning for Automated Program Repair [2026-02-arXiv] [π paper]
- The Limits of Long-Context Reasoning in Automated Bug Fixing [2026-02-arXiv] [π paper]
- Structurally Aligned Subtask-Level Memory for Software Engineering Agents [2026-02-arXiv] [π paper]
- SWE-ProtΓ©gΓ©: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents [2026-02-arXiv] [π paper]
- Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents [2026-02-arXiv] [π paper]
- What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair [2026-02-arXiv] [π paper]
- EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents [2026-01-arXiv] [π paper]
- Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey [2026-01-arXiv] [π paper]
- RGFL: Reasoning Guided Fault Localization for Automated Program Repair Using Large Language Models [2026-01-arXiv] [π paper]
- SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents [2026-01-arXiv] [π paper]
- SWE-RM: Execution-free Feedback For Software Engineering Agents [2025-12-arXiv] [π paper]
- BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization [2025-12-arXiv] [π paper]
- LIVE-SWE-AGENT: Can Software Engineering Agents Self-Evolve on the Fly? [2025-11-arXiv] [π paper]
- Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories [2025-10-arXiv] [π paper]
- BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [2025-10-arXiv] [π paper]
- Where LLM Agents Fail and How They can Learn From Failures [2025-09-arXiv] [π paper] [π repo]
- SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints [2025-09-arXiv] [π paper]
- Diffusion is a code repair operator and generator [2025-08-arXiv] [π paper]
- SWE-Exp: Experience-Driven Software Issue Resolution [2025-07-arXiv] [π paper] [π repo]
- SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution [2025-07-arXiv] [π paper] [π repo]
- The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [2025-06-arXiv] [π paper]
- Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards [2025-06-arXiv] [π paper]
- EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair [2025-06-arXiv] [π paper]
- Coding Agents with Multimodal Browsing are Generalist Problem Solvers [2025-06-arXiv] [π paper] [π repo]
- CoRet: Improved Retriever for Code Editing [2025-05-arXiv] [π paper]
- Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents [2025-05-arXiv] [π paper] [π repo]
- SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development [2025-05-arXiv] [π paper] [π repo]
- Putting It All into Context: Simplifying Agents with LCLMs [2025-05-arXiv] [π paper]
- SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning [2025-05-arXiv] [π blog] [π repo]
- Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [2025-03-arXiv] [π paper] [π repo]
- Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs [2025-03-arXiv] [π paper]
- CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching [2025-03-arXiv] [π paper]
- SEAlign: Alignment Training for Software Engineering Agent [2025-03-arXiv] [π paper]
- DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [2025-03-arXiv] [π paper] [π repo]
- LocAgent: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [π paper] [π repo]
- SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning [2025-02-arXiv] [π paper]
- SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [2025-02-arXiv] [π paper] [π repo]
- SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [2025-01-arXiv] [π paper] [π repo]
- CodeMonkeys: Scaling Test-Time Compute for Software Engineering [2025-01-arXiv] [π paper] [π repo]
- Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [π paper] [π repo]
- CODEV: Issue Resolving with Visual Data [2024-12-arXiv] [π paper] [π repo]
- LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues [2024-11-arXiv] [π paper]
- Globant Code Fixer Agent Whitepaper [2024-11] [π paper]
- MarsCode Agent: AI-native Automated Bug Fixing [2024-11-arXiv] [π paper]
- Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement [2024-11-arXiv] [π paper] [π repo]
- SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement [2024-10-arXiv] [π paper] [π repo]
- SpecRover: Code Intent Extraction via LLMs [2024-08-arXiv] [π paper]
- OpenHands: An Open Platform for AI Software Developers as Generalist Agents [2024-07-arXiv] [π paper] [π repo]
- AGENTLESS: Demystifying LLM-based Software Engineering Agents [2024-07-arXiv] [π paper]
- RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph [2024-07-arXiv] [π paper] [π repo]
- CodeR: Issue Resolving with Multi-Agent and Task Graphs [2024-06-arXiv] [π paper] [π repo]
- Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration [2024-06-arXiv] [π paper]
- AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions [2025-FSE] [π paper]
- AutoCodeRover: Autonomous Program Improvement [2024-09-ISSTA] [π paper] [π repo]
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [2024-NeurIPS] [π paper] [π repo]
-
Persistent Cross-Attempt State Optimization for Repository-Level Code Generation [2025-04-arXiv] [π paper]
- LiveCoder: 22.94% improvement, 53.63% cost reduction.
-
Toward Executable Repository-Level Code Generation via Environment Alignment [2025-04-arXiv] [π paper]
- EnvGraph: Environment alignment for executable code generation.
-
Executing as You Generate: Hiding Execution Latency in Code Completion [2026-04-arXiv] [π paper]
-
CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation [2025-04-arXiv] [π paper]
-
RTLRepoCoder: Repository-Level RTL Code Completion through the Combination of Fine-Tuning and Retrieval Augmentation [2025-04-arXiv] [π paper]
-
What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond [2025-03-arXiv] [π paper]
-
Improving FIM Code Completions via Context & Curriculum Based Learning [2024-12-arXiv] [π paper]
-
ContextModule: Improving Code Completion via Repository-level Contextual Information [2024-12-arXiv] [π paper]
-
RepoGenReflex: Enhancing Repository-Level Code Completion with Verbal Reinforcement and Retrieval-Augmented Generation [2024-09-arXiv] [π paper]
-
RAMBO: Enhancing RAG-based Repository-Level Method Body Completion [2024-09-arXiv] [π paper] [π repo]
-
RLCoder: Reinforcement Learning for Repository-Level Code Completion [2024-07-arXiv] [π paper] [π repo]
-
STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis [2024-06-arXiv] [π paper]
-
GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model [2024-06-arXiv] [π paper]
-
Enhancing Repository-Level Code Generation with Integrated Contextual Information [2024-06-arXiv] [π paper]
-
R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models [2024-06-arXiv] [π paper]
-
Natural Language to Class-level Code Generation by Iterative Tool-augmented Reasoning over Repository [2024-05-arXiv] [π paper] [π repo]
-
Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback [2024-03-arXiv] [π paper] [π repo]
-
Repoformer: Selective Retrieval for Repository-Level Code Completion [2024-03-arXiv] [π paper] [π repo]
-
RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion [2024-03-arXiv] [π paper] [π repo]
-
RepoFusion: Training Code Models to Understand Your Repository [2023-06-arXiv] [π paper] [π repo]
-
Enhancing Project-Specific Code Completion by Inferring Internal API Information [2025-07-TSE] [π paper] [π repo]
-
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases [2025-04-NAACL] [π paper]
-
Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs [2025-04-AAAI] [π paper] [π repo]
-
REPOFILTER: Adaptive Retrieval Context Trimming for Repository-Level Code Completion [2025-04-OpenReview] [π paper]
-
A^3-CodGen: A Repository-Level Code Generation Framework for Code Reuse With Local-Aware, Global-Aware, and Third-Party-Library-Aware [2024-12-TSE] [π paper]
-
RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss Screening [2024-07-Internetware] [π paper]
-
CodePlan: Repository-Level Coding using LLMs and Planning [2024-07-FSE] [π paper] [π repo]
-
DraCo: Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion [2024-05-ACL] [π paper] [π repo]
-
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [2023-10-EMNLP] [π paper] [π repo]
-
Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context [2023-09-NeurIPS] [π paper] [π repo]
-
Repository-Level Prompt Generation for Large Language Models of Code [2023-06-ICML] [π paper] [π repo]
-
Fully Autonomous Programming with Large Language Models [2023-06-GECCO] [π paper] [π repo]
- EVOC2RUST: A Skeleton-guided Framework for Project-Level C-to-Rust Translation [2025-08-arXiv] [π paper]
- A Systematic Literature Review on Neural Code Translation [2025-05-arXiv] [π paper]
- Enhancing llm-based code translation in repository context via triple knowledge-augmented [2025-03-arXiv] [π paper]
- C2SaferRust: Transforming C Projects into Safer Rust with NeuroSymbolic Techniques [2025-01-arXiv] [π paper] [π repo]
- Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code [2024-04-ICSE] [π paper] [π repo]
- Scalable, Validated Code Translation of Entire Projects using Large Language Models [2025-06-PLDI] [π paper]
- Syzygy: Dual Code-Test C to (safe) Rust Translation using LLMs and Dynamic Analysis [2024-12-arxiv] [π paper] [πΈοΈ website]
- RustRepoTrans: Repository-level Code Translation Benchmark Targeting Rust [2024-11-arxiv] [π paper] [π repo]
- Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents [2026-02-arXiv] [π paper]
- Execution-Feedback Driven Test Generation from SWE Issues [2025-08-arXiv] [π paper]
- AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests [2025-07-arXiv] [π paper]
- Mystique: Automated Vulnerability Patch Porting with Semantic and Syntactic-Enhanced LLM [2025-06-arXiv] [π paper]
- Issue2Test: Generating Reproducing Test Cases from Issue Reports [2025-03-arXiv] [π paper]
- Agentic Bug Reproduction for Effective Automated Program Repair at Google [2025-02-arXiv] [π paper]
- LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues [2024-11-arXiv] [π paper]
-
RepoChat Arena [2025-Blog] [π repo]
-
RepoChat: An LLM-Powered Chatbot for GitHub Repository Question-Answering [MSR-2025] [π repo]
-
FastCode: Fast and Cost-Efficient Code Understanding and Reasoning [2026-03-arXiv] [π paper]
-
SWE-QA: Can Language Models Answer Repository-level Code Questions? [2025-09-arXiv] [π paper] [π repo]
-
Decompositional Reasoning for Graph Retrieval with Large Language Models [2025-06-arXiv] [π paper]
-
LongCodeBench: Evaluating Coding LLMs at 1M Context Windows [2025-05-arXiv] [π paper]
-
LocAgent: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [π paper] [π repo]
-
CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering [2025-01-arXiv] [π paper]
-
CodeQueries: A Dataset of Semantic Queries over Code [2022-09-arXiv] [π paper]
- SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources [2025-04-arXiv] [π paper]
- Automated skill library construction from heterogeneous scientific resources.
- SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories [2025-09-arXiv] [π paper]
- SWE-bench Goes Live! [2025-05-arXiv] [π paper] [π repo]
- R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents [2025-04-arXiv] [π paper] [π repo]
- Scaling Data for Software Engineering Agents [2025-04-arXiv] [π paper] [π repo]
- Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [2025-04-arXiv] [π paper] [π repo]
- Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [π paper] [π repo]
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios [2025-04-arXiv] [π paper]
- Long-horizon software evolution benchmark.
-
Are Benchmark Tests Strong Enough? STING Framework for Enhanced SWE-bench Testing [2026-04-arXiv] [π paper]
-
SWE-CI: Evaluating Agent Capabilities in Maintaining CI Pipelines [2026-04-arXiv] [π paper]
-
AutoCodeBench: AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators [2025-08-arXiv] [π paper] [π repo]
-
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [2024-06-arxiv] [π paper] [πΈοΈ website]
-
OmniCode: A Benchmark for Evaluating Software Engineering Agents [2026-02-arXiv] [π paper]
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios [2025-12-arXiv] [π paper]
-
SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories [2025-12-arXiv] [π paper]
-
Multi-Docker-Eval: A βShovel of the Gold Rushβ Benchmark on Automatic Environment Building for Software Engineering? [2025-12-arXiv] [π paper]
-
CodeClash: CodeClash: Benchmarking Goal-Oriented Software Engineering [2025-11-arXiv] [π paper] [π repo]
-
SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads? [2025-11-arXiv] [π paper]
-
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [2025-11-arXiv] [π paper]
-
SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks [2025-11-arXiv] [π paper]
-
ImpossibleBench: Measuring LLMsβ Propensity of Exploiting Test Cases [2025-10-arXiv] [π paper]
-
SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement [2025-10-arXiv] [π paper]
-
Vibe Checker: Aligning Code Evaluation with Human Preference [2025-10-arXiv] [π paper]
-
SWE-QA: Can Language Models Answer Repository-level Code Questions? [2025-09-arXiv] [π paper] [π repo]
-
RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback [2025-09-arXiv] [π paper]
-
MULocBench: A Benchmark for Localizing Code and Non-Code Issues in Software Projects [2025-09-arXiv] [π paper] [πΈοΈ website]
-
SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios [2025-09-arXiv] [π paper]
-
SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? [2025-09] [π paper] [π repo]
-
AgentIssue-Bench: Can Agents Fix Agent Issues? [2025-08-arXiv] [π paper] [π repo]
-
LiveRepoReflection: Turning the Tide: Repository-based Code Reflection [2025-07-arXiv] [π paper] [π repo]
-
SWE-Perf: SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? [2025-07-arXiv] [π paper] [π repo]
-
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance [2025-07-arXiv] [π paper]
-
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code [2025-06-arXiv] [π paper] [π repo]
-
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks [2025-06-arXiv] [π paper] [π repo]
-
OmniGIRL: OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution [2025-05-arXiv] [π paper] [π repo]
-
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [2025-05-arXiv] [π paper] [πΈοΈ website]
-
SWE-bench-Live: A Live Benchmark for Repository-Level Issue Resolution [2025-05-arXiv] [π paper] [π repo]
-
SWE-Dev: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development [2025-05-arXiv] [π paper] [π repo]
-
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation [2025-04-arXiv] [π paper] [π repo]
-
SWE-Smith: Scaling Data for Software Engineering Agents [2025-04-arXiv] [π paper] [π repo]
-
SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [2025-04-arXiv] [π paper] [π repo]
-
SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [2025-04-arXiv] [π paper] [π repo]
-
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving [2025-04-arXiv] [π paper] [π repo]
-
Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study [2025-03-arXiv] [π paper]
-
Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [2025-03-arXiv] [π paper]
-
SWEE-Bench & SWA-Bench: Automated Benchmark Generation for Repository-Level Coding Tasks [2025-03-arXiv] [π paper]
-
REPOST-TRAIN: Scalable Repository-Level Coding Environment Construction with Sandbox Testing [2025-03-arXiv] [π paper] [π repo]
-
Loc-Bench: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [π paper] [π repo]
-
ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments [2025-02-arXiv] [π paper] [π repo]
-
SWE-Lancer: Can Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? [2025-02-arXiv] [π paper] [π repo]
-
SolEval: Benchmarking Large Language Models for Repository-level Solidity Code Generation [2025-02-arXiv] [π paper] [π repo]
-
Evaluating Agent-based Program Repair at Google [2025-01-arXiv] [π paper]
-
SWE-Gym: Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [π paper] [π repo]
-
RepoTransBench: RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation [2024-12-arXiv] [π paper] [π repo]
-
Visual SWE-bench: Issue Resolving with Visual Data [2024-12-arXiv] [π paper] [π repo]
-
ExecRepoBench: Multi-level Executable Code Completion Evaluation [2024-12-arXiv] [π paper] [π site]
-
REPOCOD: Can Language Models Replace Programmers? REPOCOD Says 'Not Yet' [2024-10-arXiv] [π paper] [π repo]
-
M2RC-EVAL: M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation [2024-10-arXiv] [π paper] [π repo]
-
SWE-bench+: Enhanced Coding Benchmark for LLMs [2024-10-arXiv] [π paper]
-
SWE-bench Multimodal: Multimodal Software Engineering Benchmark [2024-10-arXiv] [π paper] [π site]
-
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [2024-10-arXiv] [π paper] [π repo]
-
CodeRAG-Bench: Can Retrieval Augment Code Generation? [2024-06-arXiv] [π paper] [π repo]
-
R2C2-Bench: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models [2024-06-arXiv] [π paper]
-
RepoClassBench: Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [2024-05-arXiv] [π paper] [π repo]
-
Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions [ICLR-2025 Oral] [π paper] [π repo]
-
UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench [ACL-2025] [π paper]
-
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner [ICML-2025] [π paper] [π repo]
-
SWE-Lancer: Can Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? [2025-arXiv] [π paper] [π repo]
-
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation [2025-05-ACL] [π paper] [π repo]
-
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution [2025-05-ISSTA] [π paper]
-
LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation [2025-04-NAACL] [π paper][π Website]
-
ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation [2025 ACL-Findings] [π paper] [π repo]
-
HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation [2025-ICSE] [π paper] [π repo]
-
RepoExec: On the Impacts of Contexts on Repository-Level Code Generation [2025-NAACL] [π paper] [π repo]
-
DevEval: Evaluating Code Generation in Practical Software Projects [2024-ACL-Findings] [π paper] [π repo]
-
CodAgentBench: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges [2024-ACL] [π paper]
-
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems [2024-ICLR] [π paper] [π repo]
-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [2024-ICLR] [π paper] [π repo]
-
CrossCodeLongEval: Repoformer: Selective Retrieval for Repository-Level Code Completion [2024-ICML] [π paper] [π repo]
-
R2E-Eval: Turning Any GitHub Repository into a Programming Agent Test Environment [2024-ICML] [π paper] [π repo]
-
RepoEval: Repository-Level Code Completion Through Iterative Retrieval and Generation [2023-EMNLP] [π paper] [π repo]
-
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion [2023-NeurIPS] [π paper] [π site]
-
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation [2025-01-arxiv] [π paper] [π repo]
