🤖✨ Awesome Repository-Level Code Generation ✨🤖

🌟 A curated list of awesome repository-level code generation research papers and resources. If you want to contribute to this list (please do), feel free to send me a pull request. 🚀 If you have any further questions, feel free to contact Yuling Shi or Xiaodong Gu (SJTU).

💥 Repo-Level Issue Resolution

GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair [2026-04-arXiv] [📄 paper]
Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals [2026-04-arXiv] [📄 paper]
ABTest: Behavior-Driven Testing for AI Coding Agents [2025-04-arXiv] [📄 paper]
- First behavior-driven fuzzing framework for AI coding agents, 40.8% detection precision.
Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures [2025-04-arXiv] [📄 paper]
- Taxonomy of 13 open-source coding agents across 12 dimensions.
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints [2025-04-arXiv] [📄 paper]
- Agent-CoEvo: Coevolution framework for code and tests.
DebugHarness: Emulating Human Dynamic Debugging for Autonomous Program Repair [2025-04-arXiv] [📄 paper]
- Human-like interactive debugging for program repair, 90% fix rate on SEC-bench.
Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs [2025-04-arXiv] [📄 paper]
- STITCH: Higher quality with fewer training trajectories.
From SWE-ZERO to SWE-HERO: Two-Stage SFT for SWE-Bench [2026-04-arXiv] [📄 paper]
RepoRepair: Leveraging Code Documentation for Repository-Level Automated Program Repair [2026-03-arXiv] [📄 paper]
SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution [2026-03-arXiv] [📄 paper]
Compressing Code Context for LLM-based Issue Resolution [2026-03-arXiv] [📄 paper]
Monte Carlo Tree Search for Execution-Guided Program Repair with Large Language Models [2026-02-arXiv] [📄 paper]
SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training [2026-02-arXiv] [📄 paper]
SWE-World: Building Software Engineering Agents in Docker-Free Environments [2026-02-arXiv] [📄 paper]
SVRepair: Structured Visual Reasoning for Automated Program Repair [2026-02-arXiv] [📄 paper]
The Limits of Long-Context Reasoning in Automated Bug Fixing [2026-02-arXiv] [📄 paper]
Structurally Aligned Subtask-Level Memory for Software Engineering Agents [2026-02-arXiv] [📄 paper]
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents [2026-02-arXiv] [📄 paper]
Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents [2026-02-arXiv] [📄 paper]
What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair [2026-02-arXiv] [📄 paper]
EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents [2026-01-arXiv] [📄 paper]
Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey [2026-01-arXiv] [📄 paper]
RGFL: Reasoning Guided Fault Localization for Automated Program Repair Using Large Language Models [2026-01-arXiv] [📄 paper]
SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents [2026-01-arXiv] [📄 paper]
SWE-RM: Execution-free Feedback For Software Engineering Agents [2025-12-arXiv] [📄 paper]
BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization [2025-12-arXiv] [📄 paper]
LIVE-SWE-AGENT: Can Software Engineering Agents Self-Evolve on the Fly? [2025-11-arXiv] [📄 paper]
Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories [2025-10-arXiv] [📄 paper]
BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [2025-10-arXiv] [📄 paper]
Where LLM Agents Fail and How They can Learn From Failures [2025-09-arXiv] [📄 paper] [🔗 repo]
SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints [2025-09-arXiv] [📄 paper]
Diffusion is a code repair operator and generator [2025-08-arXiv] [📄 paper]
SWE-Exp: Experience-Driven Software Issue Resolution [2025-07-arXiv] [📄 paper] [🔗 repo]
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution [2025-07-arXiv] [📄 paper] [🔗 repo]
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [2025-06-arXiv] [📄 paper]
Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards [2025-06-arXiv] [📄 paper]
EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair [2025-06-arXiv] [📄 paper]
Coding Agents with Multimodal Browsing are Generalist Problem Solvers [2025-06-arXiv] [📄 paper] [🔗 repo]
CoRet: Improved Retriever for Code Editing [2025-05-arXiv] [📄 paper]
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents [2025-05-arXiv] [📄 paper] [🔗 repo]
SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development [2025-05-arXiv] [📄 paper] [🔗 repo]
Putting It All into Context: Simplifying Agents with LCLMs [2025-05-arXiv] [📄 paper]
SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning [2025-05-arXiv] [📄 blog] [🔗 repo]
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [2025-03-arXiv] [📄 paper] [🔗 repo]
Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs [2025-03-arXiv] [📄 paper]
CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching [2025-03-arXiv] [📄 paper]
SEAlign: Alignment Training for Software Engineering Agent [2025-03-arXiv] [📄 paper]
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [2025-03-arXiv] [📄 paper] [🔗 repo]
LocAgent: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [📄 paper] [🔗 repo]
SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning [2025-02-arXiv] [📄 paper]
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution [2025-02-arXiv] [📄 paper] [🔗 repo]
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [2025-01-arXiv] [📄 paper] [🔗 repo]
CodeMonkeys: Scaling Test-Time Compute for Software Engineering [2025-01-arXiv] [📄 paper] [🔗 repo]
Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [📄 paper] [🔗 repo]
CODEV: Issue Resolving with Visual Data [2024-12-arXiv] [📄 paper] [🔗 repo]
LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues [2024-11-arXiv] [📄 paper]
Globant Code Fixer Agent Whitepaper [2024-11] [📄 paper]
MarsCode Agent: AI-native Automated Bug Fixing [2024-11-arXiv] [📄 paper]
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement [2024-11-arXiv] [📄 paper] [🔗 repo]
SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement [2024-10-arXiv] [📄 paper] [🔗 repo]
SpecRover: Code Intent Extraction via LLMs [2024-08-arXiv] [📄 paper]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents [2024-07-arXiv] [📄 paper] [🔗 repo]
AGENTLESS: Demystifying LLM-based Software Engineering Agents [2024-07-arXiv] [📄 paper]
RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph [2024-07-arXiv] [📄 paper] [🔗 repo]
CodeR: Issue Resolving with Multi-Agent and Task Graphs [2024-06-arXiv] [📄 paper] [🔗 repo]
Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration [2024-06-arXiv] [📄 paper]
AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions [2025-FSE] [📄 paper]
AutoCodeRover: Autonomous Program Improvement [2024-09-ISSTA] [📄 paper] [🔗 repo]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [2024-NeurIPS] [📄 paper] [🔗 repo]

🤖 Repo-Level Code Completion

Persistent Cross-Attempt State Optimization for Repository-Level Code Generation [2025-04-arXiv] [📄 paper]
- LiveCoder: 22.94% improvement, 53.63% cost reduction.
Toward Executable Repository-Level Code Generation via Environment Alignment [2025-04-arXiv] [📄 paper]
- EnvGraph: Environment alignment for executable code generation.
Executing as You Generate: Hiding Execution Latency in Code Completion [2026-04-arXiv] [📄 paper]
CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation [2025-04-arXiv] [📄 paper]
RTLRepoCoder: Repository-Level RTL Code Completion through the Combination of Fine-Tuning and Retrieval Augmentation [2025-04-arXiv] [📄 paper]
What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond [2025-03-arXiv] [📄 paper]
Improving FIM Code Completions via Context & Curriculum Based Learning [2024-12-arXiv] [📄 paper]
ContextModule: Improving Code Completion via Repository-level Contextual Information [2024-12-arXiv] [📄 paper]
RepoGenReflex: Enhancing Repository-Level Code Completion with Verbal Reinforcement and Retrieval-Augmented Generation [2024-09-arXiv] [📄 paper]
RAMBO: Enhancing RAG-based Repository-Level Method Body Completion [2024-09-arXiv] [📄 paper] [🔗 repo]
RLCoder: Reinforcement Learning for Repository-Level Code Completion [2024-07-arXiv] [📄 paper] [🔗 repo]
STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis [2024-06-arXiv] [📄 paper]
GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model [2024-06-arXiv] [📄 paper]
Enhancing Repository-Level Code Generation with Integrated Contextual Information [2024-06-arXiv] [📄 paper]
R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models [2024-06-arXiv] [📄 paper]
Natural Language to Class-level Code Generation by Iterative Tool-augmented Reasoning over Repository [2024-05-arXiv] [📄 paper] [🔗 repo]
Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback [2024-03-arXiv] [📄 paper] [🔗 repo]
Repoformer: Selective Retrieval for Repository-Level Code Completion [2024-03-arXiv] [📄 paper] [🔗 repo]
RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion [2024-03-arXiv] [📄 paper] [🔗 repo]
RepoFusion: Training Code Models to Understand Your Repository [2023-06-arXiv] [📄 paper] [🔗 repo]
Enhancing Project-Specific Code Completion by Inferring Internal API Information [2025-07-TSE] [📄 paper] [🔗 repo]
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases [2025-04-NAACL] [📄 paper]
Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs [2025-04-AAAI] [📄 paper] [🔗 repo]
REPOFILTER: Adaptive Retrieval Context Trimming for Repository-Level Code Completion [2025-04-OpenReview] [📄 paper]
A^3-CodGen: A Repository-Level Code Generation Framework for Code Reuse With Local-Aware, Global-Aware, and Third-Party-Library-Aware [2024-12-TSE] [📄 paper]
RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss Screening [2024-07-Internetware] [📄 paper]
CodePlan: Repository-Level Coding using LLMs and Planning [2024-07-FSE] [📄 paper] [🔗 repo]
DraCo: Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion [2024-05-ACL] [📄 paper] [🔗 repo]
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [2023-10-EMNLP] [📄 paper] [🔗 repo]
Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context [2023-09-NeurIPS] [📄 paper] [🔗 repo]
Repository-Level Prompt Generation for Large Language Models of Code [2023-06-ICML] [📄 paper] [🔗 repo]
Fully Autonomous Programming with Large Language Models [2023-06-GECCO] [📄 paper] [🔗 repo]

🔄 Repo-Level Code Translation

EVOC2RUST: A Skeleton-guided Framework for Project-Level C-to-Rust Translation [2025-08-arXiv] [📄 paper]
A Systematic Literature Review on Neural Code Translation [2025-05-arXiv] [📄 paper]
Enhancing llm-based code translation in repository context via triple knowledge-augmented [2025-03-arXiv] [📄 paper]
C2SaferRust: Transforming C Projects into Safer Rust with NeuroSymbolic Techniques [2025-01-arXiv] [📄 paper] [🔗 repo]
Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code [2024-04-ICSE] [📄 paper] [🔗 repo]
Scalable, Validated Code Translation of Entire Projects using Large Language Models [2025-06-PLDI] [📄 paper]
Syzygy: Dual Code-Test C to (safe) Rust Translation using LLMs and Dynamic Analysis [2024-12-arxiv] [📄 paper] [🕸️ website]
RustRepoTrans: Repository-level Code Translation Benchmark Targeting Rust [2024-11-arxiv] [📄 paper] [🔗 repo]

🧪 Repo-Level Unit Test Generation

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents [2026-02-arXiv] [📄 paper]
Execution-Feedback Driven Test Generation from SWE Issues [2025-08-arXiv] [📄 paper]
AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests [2025-07-arXiv] [📄 paper]
Mystique: Automated Vulnerability Patch Porting with Semantic and Syntactic-Enhanced LLM [2025-06-arXiv] [📄 paper]
Issue2Test: Generating Reproducing Test Cases from Issue Reports [2025-03-arXiv] [📄 paper]
Agentic Bug Reproduction for Effective Automated Program Repair at Google [2025-02-arXiv] [📄 paper]
LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues [2024-11-arXiv] [📄 paper]

🔍 Repo-Level Code QA

RepoChat Arena [2025-Blog] [🔗 repo]
RepoChat: An LLM-Powered Chatbot for GitHub Repository Question-Answering [MSR-2025] [🔗 repo]
FastCode: Fast and Cost-Efficient Code Understanding and Reasoning [2026-03-arXiv] [📄 paper]
SWE-QA: Can Language Models Answer Repository-level Code Questions? [2025-09-arXiv] [📄 paper] [🔗 repo]
Decompositional Reasoning for Graph Retrieval with Large Language Models [2025-06-arXiv] [📄 paper]
LongCodeBench: Evaluating Coding LLMs at 1M Context Windows [2025-05-arXiv] [📄 paper]
LocAgent: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [📄 paper] [🔗 repo]
CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering [2025-01-arXiv] [📄 paper]
CodeQueries: A Dataset of Semantic Queries over Code [2022-09-arXiv] [📄 paper]

👩‍💻 Repo-Level Issue Task Synthesis

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources [2025-04-arXiv] [📄 paper]
- Automated skill library construction from heterogeneous scientific resources.
SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories [2025-09-arXiv] [📄 paper]
SWE-bench Goes Live! [2025-05-arXiv] [📄 paper] [🔗 repo]
R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents [2025-04-arXiv] [📄 paper] [🔗 repo]
Scaling Data for Software Engineering Agents [2025-04-arXiv] [📄 paper] [🔗 repo]
Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [2025-04-arXiv] [📄 paper] [🔗 repo]
Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [📄 paper] [🔗 repo]

📊 Datasets and Benchmarks

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios [2025-04-arXiv] [📄 paper]
- Long-horizon software evolution benchmark.
Are Benchmark Tests Strong Enough? STING Framework for Enhanced SWE-bench Testing [2026-04-arXiv] [📄 paper]
SWE-CI: Evaluating Agent Capabilities in Maintaining CI Pipelines [2026-04-arXiv] [📄 paper]
AutoCodeBench: AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators [2025-08-arXiv] [📄 paper] [🔗 repo]
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [2024-06-arxiv] [📄 paper] [🕸️ website]
OmniCode: A Benchmark for Evaluating Software Engineering Agents [2026-02-arXiv] [📄 paper]
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios [2025-12-arXiv] [📄 paper]
SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories [2025-12-arXiv] [📄 paper]
Multi-Docker-Eval: A ‘Shovel of the Gold Rush’ Benchmark on Automatic Environment Building for Software Engineering? [2025-12-arXiv] [📄 paper]
CodeClash: CodeClash: Benchmarking Goal-Oriented Software Engineering [2025-11-arXiv] [📄 paper] [🔗 repo]
SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads? [2025-11-arXiv] [📄 paper]
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [2025-11-arXiv] [📄 paper]
SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks [2025-11-arXiv] [📄 paper]
ImpossibleBench: Measuring LLMs’ Propensity of Exploiting Test Cases [2025-10-arXiv] [📄 paper]
SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement [2025-10-arXiv] [📄 paper]
Vibe Checker: Aligning Code Evaluation with Human Preference [2025-10-arXiv] [📄 paper]
SWE-QA: Can Language Models Answer Repository-level Code Questions? [2025-09-arXiv] [📄 paper] [🔗 repo]
RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback [2025-09-arXiv] [📄 paper]
MULocBench: A Benchmark for Localizing Code and Non-Code Issues in Software Projects [2025-09-arXiv] [📄 paper] [🕸️ website]
SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios [2025-09-arXiv] [📄 paper]
SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? [2025-09] [📄 paper] [🔗 repo]
AgentIssue-Bench: Can Agents Fix Agent Issues? [2025-08-arXiv] [📄 paper] [🔗 repo]
LiveRepoReflection: Turning the Tide: Repository-based Code Reflection [2025-07-arXiv] [📄 paper] [🔗 repo]
SWE-Perf: SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? [2025-07-arXiv] [📄 paper] [🔗 repo]
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance [2025-07-arXiv] [📄 paper]
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code [2025-06-arXiv] [📄 paper] [🔗 repo]
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks [2025-06-arXiv] [📄 paper] [🔗 repo]
OmniGIRL: OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution [2025-05-arXiv] [📄 paper] [🔗 repo]
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [2025-05-arXiv] [📄 paper] [🕸️ website]
SWE-bench-Live: A Live Benchmark for Repository-Level Issue Resolution [2025-05-arXiv] [📄 paper] [🔗 repo]
SWE-Dev: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development [2025-05-arXiv] [📄 paper] [🔗 repo]
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation [2025-04-arXiv] [📄 paper] [🔗 repo]
SWE-Smith: Scaling Data for Software Engineering Agents [2025-04-arXiv] [📄 paper] [🔗 repo]
SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [2025-04-arXiv] [📄 paper] [🔗 repo]
SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [2025-04-arXiv] [📄 paper] [🔗 repo]
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving [2025-04-arXiv] [📄 paper] [🔗 repo]
Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study [2025-03-arXiv] [📄 paper]
Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution [2025-03-arXiv] [📄 paper]
SWEE-Bench & SWA-Bench: Automated Benchmark Generation for Repository-Level Coding Tasks [2025-03-arXiv] [📄 paper]
REPOST-TRAIN: Scalable Repository-Level Coding Environment Construction with Sandbox Testing [2025-03-arXiv] [📄 paper] [🔗 repo]
Loc-Bench: Graph-Guided LLM Agents for Code Localization [2025-03-arXiv] [📄 paper] [🔗 repo]
ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments [2025-02-arXiv] [📄 paper] [🔗 repo]
SWE-Lancer: Can Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? [2025-02-arXiv] [📄 paper] [🔗 repo]
SolEval: Benchmarking Large Language Models for Repository-level Solidity Code Generation [2025-02-arXiv] [📄 paper] [🔗 repo]
Evaluating Agent-based Program Repair at Google [2025-01-arXiv] [📄 paper]
SWE-Gym: Training Software Engineering Agents and Verifiers with SWE-Gym [2024-12-arXiv] [📄 paper] [🔗 repo]
RepoTransBench: RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation [2024-12-arXiv] [📄 paper] [🔗 repo]
Visual SWE-bench: Issue Resolving with Visual Data [2024-12-arXiv] [📄 paper] [🔗 repo]
ExecRepoBench: Multi-level Executable Code Completion Evaluation [2024-12-arXiv] [📄 paper] [🔗 site]
REPOCOD: Can Language Models Replace Programmers? REPOCOD Says 'Not Yet' [2024-10-arXiv] [📄 paper] [🔗 repo]
M2RC-EVAL: M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation [2024-10-arXiv] [📄 paper] [🔗 repo]
SWE-bench+: Enhanced Coding Benchmark for LLMs [2024-10-arXiv] [📄 paper]
SWE-bench Multimodal: Multimodal Software Engineering Benchmark [2024-10-arXiv] [📄 paper] [🔗 site]
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [2024-10-arXiv] [📄 paper] [🔗 repo]
CodeRAG-Bench: Can Retrieval Augment Code Generation? [2024-06-arXiv] [📄 paper] [🔗 repo]
R2C2-Bench: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models [2024-06-arXiv] [📄 paper]
RepoClassBench: Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [2024-05-arXiv] [📄 paper] [🔗 repo]
Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions [ICLR-2025 Oral] [📄 paper] [🔗 repo]
UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench [ACL-2025] [📄 paper]
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner [ICML-2025] [📄 paper] [🔗 repo]
SWE-Lancer: Can Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? [2025-arXiv] [📄 paper] [🔗 repo]
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation [2025-05-ACL] [📄 paper] [🔗 repo]
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution [2025-05-ISSTA] [📄 paper]
LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation [2025-04-NAACL] [📄 paper][🔗 Website]
ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation [2025 ACL-Findings] [📄 paper] [🔗 repo]
HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation [2025-ICSE] [📄 paper] [🔗 repo]
RepoExec: On the Impacts of Contexts on Repository-Level Code Generation [2025-NAACL] [📄 paper] [🔗 repo]
DevEval: Evaluating Code Generation in Practical Software Projects [2024-ACL-Findings] [📄 paper] [🔗 repo]
CodAgentBench: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges [2024-ACL] [📄 paper]
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems [2024-ICLR] [📄 paper] [🔗 repo]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [2024-ICLR] [📄 paper] [🔗 repo]
CrossCodeLongEval: Repoformer: Selective Retrieval for Repository-Level Code Completion [2024-ICML] [📄 paper] [🔗 repo]
R2E-Eval: Turning Any GitHub Repository into a Programming Agent Test Environment [2024-ICML] [📄 paper] [🔗 repo]
RepoEval: Repository-Level Code Completion Through Iterative Retrieval and Generation [2023-EMNLP] [📄 paper] [🔗 repo]
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion [2023-NeurIPS] [📄 paper] [🔗 site]
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation [2025-01-arxiv] [📄 paper] [🔗 repo]

Version	Changes	Urgency	Date
main@2026-04-10	Latest activity on main branch	High	4/10/2026
0.0.0	No release found — using repo HEAD	High	4/7/2026

Awesome-Repo-Level-Code-Generation

Description

README