Software engineering (SE) agents are autonomous or semi-autonomous systems that leverage Large Language Models (LLMs) to reason, plan, and act within their environment to achieve specific goals 1. These systems mark a significant evolution from traditional rule-based methods towards intelligent entities capable of solving complex problems in software development 3. Key characteristics of SE agents include autonomous operation, robust LLM integration for powerful language understanding and reasoning, multi-component orchestration encompassing planning and memory mechanisms, and the ability to handle diverse tasks. Their core value is demonstrated through a "Perceive-Decide-Act" cycle, allowing them to understand natural language specifications and generate contextually appropriate code with minimal human intervention 3. Specific tasks that these agents are designed to perform span a wide range of software development activities, including code generation, code translation, program repair, code reasoning, test generation, and even complex activities like pull request reviews and navigating unfamiliar repositories 3.
The crucial role and necessity of benchmarks for SE agents cannot be overstated. Given the sophisticated and dynamic nature of these LLM-empowered systems, benchmarks are fundamental for evaluating their performance, reliability, and overall capabilities. They enable systematic progress by providing a comprehensive understanding of how different solutions and evaluation methodologies interconnect, allowing researchers to effectively compare approaches and identify areas for improvement 3. Traditional LLM evaluation methods, which primarily assess text generation or question answering, are insufficient for agents that operate in dynamic, interactive environments, requiring complex reasoning, planning, and tool execution 1. Benchmarks are essential for assessing an agent's intelligence, reliability, and safety in real-world scenarios, directly influencing their applicability in critical domains 2. They facilitate the granular assessment of specific capabilities beyond external behavior, such as tool use, planning, reasoning, memory and context retention, and multi-agent collaboration 1. Ultimately, the goal of these benchmarks is to establish unified evaluation standards, accepted metric systems, and mature methodologies within what is currently a fragmented and standards-deficient field 2.
However, benchmarking AI agents for software engineering tasks is fraught with complexities and significant challenges that hinder systematic progress and reliable evaluation. A primary challenge arises from the inherently dynamic and probabilistic nature of LLM agents, rendering static, rule-based testing inadequate for capturing their full "perceive-decide-act" cycle 1. The field also suffers from a profound lack of standardization, with varied research efforts often employing self-built, task-specific environments and metrics, making cross-study comparisons difficult 2. This has led to a widening "capability-evaluation" gap, where agent capabilities, especially reasoning and tool usage, evolve faster than our ability to rigorously evaluate their reliability, robustness, and boundary conditions. Many current evaluations often focus on final outcomes, neglecting critical aspects like the agent's chain of thought or decision rationale 2.
Further complexities stem from issues related to realism and task complexity. A significant "realism gap" exists, where even seemingly real-world benchmarks, being simplified and controlled, fail to guarantee performance in complex real-world scenarios characterized by infinite edge cases, API instability, or ambiguous task descriptions 2. An empirical study, for instance, showed a substantial drop in task success rates when agents encountered a slightly updated API version 2. Tasks themselves can be problematic; benchmarks, such as the original SWE-Bench, have been found to include overly specific unit tests that might incorrectly reject valid solutions or render tasks "nearly impossible," or contain underspecified issue descriptions leading to ambiguity 5. Agents frequently struggle due to missing or mismanaged context, like hallucinating deprecated APIs or mixing library versions, rather than a lack of inherent intelligence 4.
Methodological and practical challenges also abound. The "scalability dilemma" highlights the tension between costly, human-involved evaluations (often considered the gold standard for subjective aspects) and efficient, automated metrics that may be too coarse to capture semantic correctness or creativity 2. Traditional metrics, such as word-overlap scores, struggle with the "one-to-many" characteristics of diverse, plausible action sequences generated by agents 2. Data contamination is another concern, as large foundation models may have been inadvertently trained on public data used in benchmarks, potentially leading to an overestimation of their true capabilities 5. Moreover, enterprise-specific requirements introduce challenges related to secure data access, auditability, and handling complex long-term interaction patterns 1. The "privacy paradox" underscores the conflict between agents requiring access to massive amounts of sensitive data and the need for robust privacy protection, an aspect not systematically integrated into current evaluations 2.
Despite these formidable challenges, the landscape of SE agent benchmarking is continuously evolving. Benchmarks like SWE-Bench, initially highlighting issues with overly specific tests and environmental setups, have seen improvements with initiatives like SWE-bench Verified, which human-validates subsets to enhance reliability and better reflect model capabilities 5. A variety of other benchmarks, including Terminal-Bench, Ï„-Bench, Context-Bench, and DPAI Arena, are actively emerging to measure different dimensions of SE agent capabilities, from code patching and command-line operations to enterprise workflows. This ongoing development signifies a collective and sustained effort to build a comprehensive and effective evaluation ecosystem for intelligent software engineering agents 4.
Despite the complexities and significant challenges inherent in benchmarking AI agents for software engineering tasks, a rich and diverse landscape of existing and influential benchmarks has emerged. These benchmarks are crucial for systematically evaluating the performance, reliability, and various capabilities of software engineering (SE) agents, moving beyond traditional Large Language Model (LLM) evaluation methods to assess their interactive and dynamic nature . They collectively represent an ongoing effort to build a comprehensive evaluation landscape for intelligent software engineering agents 4.
| Benchmark Name | Target Software Engineering Tasks | Underlying Datasets | Evaluation Protocols | Key Performance Indicators (KPIs) |
|---|---|---|---|---|
| SWE-Bench | Resolving real-world GitHub issues; producing patches that pass project test suites 4 | 2,294 problems from GitHub issues across 12 Python repositories 6; problem statement, solution code, and unit tests 5 | Agents edit codebase files; solutions evaluated by running FAIL_TO_PASS and PASS_TO_PASS tests, both must pass 5 | Resolve Rate 7; Public leaderboards track performance 4 |
| SWE-Bench Verified | Resolving real-world software engineering issues 6 | Human-validated subset of SWE-Bench (500 samples), screened for well-specified issue descriptions and appropriate unit tests 5 | Same as SWE-Bench, with Docker environments for reliability 5 | Resolve Rate 5 |
| SWE-Bench Pro | Resolving real-world software engineering issues (bug fixes, feature implementations, optimizations, security updates, UI/UX changes) 7 | 1,865 problems from 41 diverse professional repositories 6; Public, Commercial, and Held-out Sets 7 | Reproducible Docker-based environments; human-augmented problem statements 7; same patch evaluation criteria as SWE-Bench 7 | Resolve Rate 7; Public and Commercial Leaderboards 7 |
| Terminal-Bench | Operating in sandboxed command-line environments; multi-step workflows (compiling, configuring, running tools, navigating filesystem) 4 | Curated, real-world tasks from researchers, engineers, practitioners; each with natural-language description, reference solution, and verification script 4 | Agents operate in real, sandboxed CLI environments 4 | Reliability across shell-based tasks; CLI proficiency (Setup, Debug, Build, Execution categories) 4 |
| Ï„-Bench (Tau-Bench) | Long-horizon, tool-enabled conversational workflows with human-in-the-loop; interacting with human users and APIs, adhering to domain-specific policies 4 | E-commerce, airline reservations, retail, telecom scenarios | Multi-turn interactions; emphasis on reliability at scale and policy adherence 4 | pass^k metric (reliability over multiple runs) 4 |
| Context-Bench | Agentic context engineering; maintaining, reusing, and reasoning over long-running context; chaining file operations, tracing project relationships, consistent multi-step decisions 4 | Built on Letta's open-source evaluation framework 4 | Measures continuity, memory management, and long-horizon reasoning with cost-to-performance ratio 4 | Continuity scores; efficiency (token consumption) 4 |
| Spring AI Bench | Enterprise Java workflows; issue triage, dependency upgrades, PR reviews, compliance checks, test expansion on Spring projects 4 | Real Spring projects 4 | Evaluation within stable, opinionated Java frameworks with strict architectural patterns and CI pipelines 4 | Raw correctness and consistency under enterprise constraints 4 |
| DPAI Arena | Cross-ecosystem developer productivity; full multi-workflow, multi-language agents across the engineering lifecycle (patching, test generation, PR review, static analysis, repo navigation) 4 | Structured, reproducible environments modeled on real-world projects 4 | Measures correctness, workflow efficiency, and behavior across languages 4 | Leaderboards track multi-dimensional proficiency 4 |
| SWT-Bench | Automated software testing; generating, repairing, and executing test suites; reasoning about program behavior 4 | Real projects 4 | Not explicitly detailed; implied by tasks (e.g., navigating repositories, analyzing existing tests) 4 | Performance across Test Generation, Test Repair, and Coverage Improvement categories 4 |
| Cline Bench | Local-first agent workflows in realistic, repository-based development environments; diagnosing issues, navigating repo structures, executing multi-step workflows 4 | Real project snapshots and failure cases 4 | Emphasizes practical agent behavior: file edits, tool invocation, iterative refinement, recovery after missteps 4 | Reliability in repository-based workflows 4 |
| SWE-PolyBench | Polyglot codebases; evaluating AI coding agents across diverse programming tasks and languages 4 | Over 2,000 curated issues from 21 real-world repositories, covering Java, JavaScript, TypeScript, and Python 6 | Not explicitly detailed; focuses on multi-language capability 6 | Leaderboards available 6 |
| LiveCodeBench | Code-related tasks; self-repair, code execution, test output prediction 6 | New problems continuously collected from competitive programming platforms 6 | Not explicitly detailed; includes self-repair and code execution 6 | Leaderboards available 6 |
| Aider's Benchmarks | Editing, refactoring, and contributing to existing codebases; coding and self-correction 6 | Challenging refactoring benchmarks; Aider Polyglot: 225 Exercism coding exercises across C++, Go, Java, JavaScript, Python, Rust 6 | Not explicitly detailed; focuses on code modifications and correctness 6 | Leaderboards available 6 |
The SWE-Bench family of benchmarks evaluates the capacity of LLMs and AI agents to resolve real-world software engineering issues.
SWE-Bench, debuted in 2023 by Princeton researchers, rapidly became a leading benchmark for assessing model-level coding competence in real-world scenarios 4. It evaluates agents on their ability to resolve genuine GitHub issues by generating patches that successfully pass a project's test suite 4. The core task involves providing an agent with a code repository and an issue description, then requiring it to edit files to fix the issue without explicit access to verification tests 5. Success is determined by a patch passing both FAIL_TO_PASS tests, confirming the fix, and PASS_TO_PASS tests, confirming no regressions 5. Its datasets comprise 2,294 problems from GitHub issues across 12 Python repositories, with each sample including a problem statement, solution code, and unit tests . Public leaderboards track model performance across various categories 4.
SWE-Bench Verified emerged from a collaboration with OpenAI to address limitations in the original SWE-Bench, such as overly specific or unrelated unit tests, underspecified issue descriptions, and environment setup difficulties 5. This human-validated subset comprises 500 samples from the original dataset, meticulously screened for well-specified issues and appropriate unit tests . It aims to provide more reliable evaluations by filtering out problematic tasks, utilizing Docker environments for reproducibility and assessing a Resolve Rate KPI 5.
SWE-Bench Pro represents a more rigorous evolution, designed to provide a realistic evaluation for AI agents in professional software engineering contexts 7. It addresses key challenges like data contamination, limited task diversity, oversimplified problems, and unreliable testing 7. Its dataset includes 1,865 problems from 41 diverse professional repositories, featuring a Public Set, a Commercial Set from private codebases, and a Held-out Set . Evaluation protocols involve reproducible Docker-based environments and human-augmented problem statements, using the same patch evaluation criteria as SWE-Bench to calculate a "Resolve Rate" 7. SWE-Bench Pro is significantly more challenging, with top models achieving around a 23% resolve rate compared to over 70% on SWE-Bench Verified, thereby offering a more accurate measure of true problem-solving capabilities in professional development environments 7.
Launched in May 2025 in collaboration with Stanford and the Laude Institute, Terminal-Bench assesses AI agents' competence in real, sandboxed command-line environments 4. Unlike one-shot patch-generation benchmarks, it evaluates an agent's ability to plan, execute, and recover through multi-step workflows, including compiling code, configuring environments, and navigating filesystems 4. Its datasets consist of curated, real-world tasks contributed by researchers and industry practitioners, each with a natural-language description, reference solution, and verification script 4. The benchmark uses a verification script for each task and ranks full agent systems based on their reliability across various shell-based tasks, capturing operational behavior often missed by pure LLM evaluations 4. KPIs include reliability across shell-based tasks and CLI proficiency across categories like Setup, Debug, Build, and Execution 4.
Debuted in June 2024 by Sierra, Ï„-Bench focuses on evaluating agent systems in long-horizon, tool-enabled conversational workflows under realistic human-in-the-loop conditions 4. Key evaluation criteria include interaction with simulated human users and programmatic APIs, adherence to domain-specific policies, and high reliability at scale 4. Tasks range from e-commerce, airline, retail, to telecom scenarios, requiring agents to ask questions, consult databases, and invoke APIs . The benchmark introduces a "pass^k" metric to measure reliability over multiple runs, highlighting how consistent performance can differ from one-shot successes 4. This benchmark addresses a critical gap by assessing sustained interaction, policy compliance, and repeatability in conversational, tool-driven agents 4.
Introduced in October 2025, Context-Bench, from generative AI startup Letta, measures an agent's ability to manage, reuse, and reason over long-running context, which is a crucial capability for modern agent systems 4. Built on Letta's open-source evaluation framework, it tests agents on tasks such as chaining file operations, tracing relationships across project structures, and making consistent decisions over extended workflows 4. The evaluation protocol measures continuity, memory management, and long-horizon reasoning, also exposing the cost-to-performance ratio, where high continuity scores might entail dramatically increased token consumption. This provides a more realistic economic picture of agentic capability alongside continuity scores and efficiency (token consumption) KPIs 4.
Announced in October 2025, Spring AI Bench is an open benchmarking suite specifically designed for Java-centric AI developer agents 4. It addresses the enterprise Java ecosystem, a domain often overlooked by mainstream agent benchmarks 4. The benchmark utilizes real Spring projects as its dataset to evaluate agents on tasks pertinent to day-to-day enterprise software maintenance, including issue triage, dependency upgrades, pull request reviews, compliance checks, and test expansion 4. Its value lies in its emphasis on enterprise realism, assessing agents within stable, opinionated frameworks with strict architectural patterns and high bars for backward compatibility, with raw correctness and consistency under enterprise constraints as key performance indicators 4.
Launched in October 2025 by JetBrains, DPAI Arena is designed as a broad platform for benchmarking coding agents across multiple languages and frameworks 4. Unlike benchmarks focusing on single tasks, it evaluates agents across the entire engineering lifecycle, including patching, test generation, pull request reviews, static analysis, and navigating unfamiliar repositories 4. The arena provides structured, reproducible environments mimicking real-world projects and ranks agents based on correctness, workflow efficiency, and cross-language behavior, with leaderboards tracking multi-dimensional proficiency 4. It aims to become a shared, cross-ecosystem testing surface for general-purpose coding agents 4.
Released in October 2024 by LogicStar AI, SWT-Bench shifts focus to automated software testing 4. It evaluates agents' capacity to generate, repair, and execute test suites across real projects, a vital capability for quality assurance and self-correcting coding agents 4. Tasks involve navigating repositories, analyzing existing test structures, and producing valid test cases that meaningfully cover the underlying code 4. The benchmark's leaderboard provides insights into agent performance in Test Generation, Test Repair, and Coverage Improvement categories 4.
SWE-PolyBench, a complementary benchmark designed by Amazon, evaluates how well models handle polyglot codebases, which span multiple programming languages 4. This multi-language benchmark includes over 2,000 curated issues from 21 real-world repositories, covering languages such as Java, JavaScript, TypeScript, and Python 6. It addresses the increasing relevance of models capable of operating in heterogeneous software systems, with leaderboards available to track performance .
These benchmarks collectively highlight the field's progression towards developing and evaluating AI agents that can not only reason but also consistently and safely act across the complex, multi-step workflows encountered by developers daily 4. Each benchmark uniquely contributes to assessing different performance axes, such as patch correctness, operational reliability, long-horizon context management, enterprise workflows, or test generation, indicating that no single benchmark fully captures the entire spectrum of a capable AI agent 4.
The integration of artificial intelligence, particularly Large Language Models (LLMs), has significantly advanced software engineering (SE) 8. While traditional AI methods for tasks like bug detection and code synthesis faced limitations such as exclusive feature engineering and scalability challenges, LLMs introduced new solutions for code generation, debugging, and documentation 8. However, LLMs have their own drawbacks, including restricted context length, hallucinations, and an inability to use external tools 8. To overcome these, LLM-based agents have emerged, combining LLMs with external tools and resources for more autonomous and dynamic operations, facilitating tasks like autonomous debugging and adaptive test generation 8. An AI agent is defined as a system that autonomously performs tasks by designing workflows with available tools, encompassing decision-making, problem-solving, and interaction with external environments 9.
LLMs serve as the cognitive core or "brain" of AI agents . A typical LLM agent framework comprises a user request, an agent (brain), planning, and memory 10.
The LLMs themselves typically adhere to one of three primary architectural designs 8:
An LLM-based agent can be formally represented by the tuple \(\langle L,O,M,P,A,R\rangle\) 11:
Planning is essential for decomposing complex tasks into manageable sub-steps .
Tools enable LLM agents to interact with external environments such as search APIs, code interpreters, math engines, databases, knowledge bases, and external models 10. Specific approaches include:
LLM-based Multi-Agent (LMA) systems consist of multiple interacting intelligent agents collaborating to solve complex problems or achieve goals beyond the capacity of a single agent 11. These systems typically include an orchestration platform and individual LLM-based agents 11.
AI agents can be developed with varying levels of sophistication 9:
| Agent Type | Description | Example |
|---|---|---|
| Simple Reflex Agents | Base actions purely on current perception, operating on predefined rules or reflexes without memory or interaction with other agents. Effective in fully observable environments. | A thermostat |
| Model-Based Reflex Agents | Use current perception and memory to maintain an internal model of the world, adapting actions based on this model and previous states. Can operate in partially observable environments but are still rule-limited. | A robot vacuum cleaner |
| Goal-Based Agents | Possess an internal world model and specific goals. They search for and plan action sequences to achieve these goals, improving effectiveness beyond reflex agents. | A navigation system finding the fastest route |
| Utility-Based Agents | Select action sequences that not only reach a goal but also maximize a defined utility or reward (e.g., fuel efficiency, time, cost). A utility function assigns values to scenarios. | A navigation system optimizing for multiple factors |
| Learning Agents | Incorporate all previous capabilities with the unique ability to learn autonomously from new experiences, continuously enhancing their knowledge base and adaptability. Includes learning, critic, performance, and problem generator components. | Personalized e-commerce recommendations |
SE agents are categorized across seven key themes in software engineering :
The described architectures enable significant capabilities and benefits while also introducing specific limitations:
In conclusion, SE agents, particularly those based on LLM multi-agent systems, signify a transformative shift in software development. They leverage advanced LLM capabilities with specialized tools, planning, and memory modules to offer substantial advantages in automation, robustness, scalability, and output quality. While challenges related to fine-tuning, computational costs, and complex orchestration persist, ongoing research aims to enhance individual agent capabilities and optimize inter-agent collaboration, paving the way for more autonomous, scalable, and trustworthy SE systems.
The landscape of Software Engineering (SE) agent benchmarks has seen rapid evolution from 2023 to the present, driven by significant advancements in Large Language Models (LLMs) and the increasing sophistication of AI agents. This period marks a transformation towards more complex, real-world task evaluation and structured human-AI collaboration.
Recent years have witnessed a surge in benchmark development, with 71 new benchmarks identified in 2024 alone and a projection of 109 for 2025, highlighting the growing impact of AI4SE benchmarking 13. This expansion reflects a move beyond foundational benchmarks that focused on single-shot code generation towards comprehensive evaluations of multi-step, multi-language, and context-aware workflows 4.
The SWE-Bench family exemplifies this progression:
Beyond code patching, new benchmarks are assessing diverse capabilities:
This proliferation signifies a collective effort to build a comprehensive evaluation landscape for intelligent software engineering agents 4.
Several key trends are shaping the future of SE agent benchmarks:
To address the complexities of evaluating dynamic and probabilistic LLM agents, new methodologies and metrics are being developed:
Despite significant progress, several challenges persist in SE agent benchmarking:
Future research in SE agent benchmarking will need to address these challenges by focusing on:
By addressing these priorities, the field can develop more robust, reliable, and trustworthy software engineering agents capable of truly transforming software development.