The process of managing and resolving software defects, commonly known as bug triage, is a critical component of software development and maintenance. With the increasing complexity and scale of software systems, manual bug triage has become a significant bottleneck, necessitating the adoption of automated solutions. This section introduces the core concepts of automatic bug triage and the pivotal role of intelligent agents in transforming this process. It provides formal definitions, traces the historical evolution of bug triage automation, and highlights early foundational contributions that paved the way for current advancements, particularly those leveraging artificial intelligence.
Automatic bug triage is a systematic process designed to review and classify reported software bugs according to their severity and impact on the software application 1. Its primary goal is to prioritize these bugs and efficiently assign them to the appropriate developers for resolution 1. This encompasses identifying, tracking, prioritizing, and addressing software bugs to organize them effectively and ensure that the most critical issues are handled first 2. More broadly, triage involves a sequence of analytical activities aimed at efficiently managing an issue's lifecycle, including detecting duplicates, prioritizing urgency, classifying the issue's type (e.g., bug, feature request, or security vulnerability), and routing it to the most suitable entity, which could be a specific developer, a component team, or an automated analysis pipeline 3. In the context of automation, machine learning algorithms, including Large Language Models (LLMs), are employed to classify bugs and assign them to developers, thereby enhancing efficiency 4. An "Auto Triage AI Agent" exemplifies this by automating bug reporting, triage, and follow-up processes through generative AI, thereby transforming traditionally manual methods into streamlined, efficient workflows 5.
Intelligent agents, especially LLM agents, are defined as autonomous or semi-autonomous systems capable of understanding their environment, planning tasks, and executing actions to achieve long-term goals within the domain of software engineering 6. These agents can be combined into multi-agent systems, where multiple LLMs collaborate using structured communication, task specialization, and coordination protocols 6. Within bug tracking, LLM agents play a crucial role across the entire bug lifecycle. Their assistance spans bug report creation and enhancement, reproduction attempts, classification, traceability link creation, validation, bug assignment, localization, fixing, verification, and deployment 6. This integration of agents facilitates automation and injects intelligence at various stages, significantly reducing manual effort, improving the quality of reports, and bridging communication gaps between non-technical end-users and technical developers 6. For instance, an autonomous agent can use generative AI to analyze emails, extract key details, cross-reference product documentation, and create bug reports in systems like Azure DevOps, while another agent handles autonomous bug updates and follow-ups based on user replies 5.
The synergy between automatic bug triage and intelligent agents is profound. By integrating LLM agents, the historically labor-intensive processes of bug triage are transformed into automated, intelligent workflows. This allows for increased efficiency in classifying bugs, assigning them to appropriate developers 4, and managing the entire bug lifecycle with reduced manual intervention 6. This integration reduces manual effort, improves report quality, and ensures better communication and coordination, ultimately accelerating the resolution of software defects 6.
The evolution of bug tracking reflects the maturation of software engineering practices, moving from informal, manual methods to sophisticated, collaborative, and increasingly automated platforms 6.
| Era | Description |
|---|---|
| Early Digital (1940s-1970s) | Bug tracking was predominantly a manual, often paper-based process, lacking systematic ways to categorize, prioritize, or assign bugs 6. |
| Pre-Internet (1970s-1980s) | Communication evolved with email systems and simple databases, but bug reproduction and fixing, along with software distribution, remained slow and inefficient, characterized by low collaboration and high time to resolution (TTR) 6. |
| Internet (1980s-1990s) | The first dedicated bug-tracking systems, such as GNATS and CMVC, emerged, using text-based files and email for structured logging. Organizations implemented formal processes with multi-level bug classification systems 6. |
| Web-Based (2000s) | Marked a major shift to web-based platforms like Bugzilla, MantisBT, Trac, and early versions of Jira, introducing structured fields, status transitions, user roles, and integration with agile methodologies 6. |
| SaaS, DevOps, Automation (2010s-2022) | Platforms like GitHub Issues and Azure DevOps became standard, integrating bug tracking fully into CI/CD pipelines. Academic research began exploring Machine Learning (ML) techniques for tasks such as duplicate detection, severity prediction, and assignee recommendation 6. |
| Generative AI (Present & Future) | The current vision leverages LLM agents to augment existing systems, aiming to automate report refinement, reproduction attempts, classification, localization, assignment, and patch review to reduce TTR and coordination overhead 6. |
Early research laid the groundwork for current intelligent systems by exploring the application of machine learning to streamline bug assignment. Academic studies initially focused on using ML techniques to enhance bug tracking efficiency, particularly applying text classification and clustering algorithms for tasks like duplicate detection, severity prediction, and assignee recommendation 6.
A significant contribution came from Anvik et al. (2006), who discussed the importance of traceability and coordination in modern bug tracking practices, and also presented early machine learning models for developer assignment based on text categorization . Much of this initial research on automated bug assignment frequently targeted Open-Source Software (OSS) communities, utilizing projects such as Eclipse and Mozilla as common subjects for study 7. Jonsson et al. further advanced the field with work on "Automated bug assignment: Ensemble-based machine learning in large scale industrial contexts" (2016a) and "Automatic localization of bugs to faulty components in large scale software systems using bayesian classification" (2016b) 7. More recently, Sarkar et al. (2019) focused on "Improving Bug Triaging with High Confidence Predictions at Ericsson," emphasizing the promise of confidence-based approaches . Practical insights from industrial case studies, such as those by Aktas and Yilmaz (2020a) and Oliveira et al. (2021), also demonstrated the value of ML-based issue assignment, even when accuracy was not exceptionally high, stressing the need for iterative processes and feature monitoring 7. Other early methodologies included Ant Colony Optimization (ACO) for feature selection and a self-bug triaging approach using reinforcement learning 4. Empirical studies generally suggest that automating the bug assignment process has the potential to significantly reduce software evolution effort and costs 8.
Automatic bug triage has significantly advanced through the application of sophisticated Artificial Intelligence (AI) and Machine Learning (ML) techniques, integrated within various agent architectures. These methodologies aim to improve accuracy, decrease false positives, and accelerate bug resolution by leveraging structured and unstructured data from bug reports 9.
Automatic bug triage predominantly utilizes a diverse array of AI/ML techniques:
To manage the complexity and automate various facets of bug triage, distinct agent architectures are designed:
| Agent Type | Description | Key Responsibilities |
|---|---|---|
| Data | Prepares and partitions datasets | Cleans and partitions bug report data 10 |
| Prompt | Formulates instructions for LLMs | Constructs model input with CTQRS, CoT, one-shot exemplars 10 |
| Fine-tuning | Adapts LLMs to prompt strategies | Adapts pretrained LLMs to specific prompt strategies 10 |
| Generation | Creates bug reports/summaries | Generates structured reports or summaries 10 |
| Evaluation | Assesses quality and consistency | Evaluates structural completeness, lexical fidelity, semantic consistency 10 |
| Reporting | Organizes and presents results | Organizes and presents generated bug reports and evaluation results 10 |
| Controller | Orchestrates overall workflow | Oversees execution, sequences agent invocations 10 |
The integration of AI/ML techniques within agent-based systems for bug triage results in powerful, automated workflows:
Automatic bug triage systems powered by intelligent agents offer substantial improvements in software development workflows. This section provides a comprehensive overview of the advantages, significant challenges, and standard evaluation metrics crucial for understanding and implementing these advanced systems.
Intelligent agents significantly enhance efficiency, accuracy, and overall developer productivity in automatic bug triage systems.
Despite their numerous benefits, agent-based automatic bug triage systems encounter several significant challenges that impede their widespread adoption and optimal performance.
Evaluating the performance of agent-based bug triage systems requires a multi-layered approach, incorporating both general AI agent metrics and specific bug triage indicators.
| Metric Category | Metric | Description | Reference |
|---|---|---|---|
| Core Performance Metrics | Accuracy | Measures how often the agent's outputs match expected results, including overall classification accuracy (e.g., 85–90% in severity, >85% in bug classification), factual correctness, and ground truth alignment . | |
| Hop Accuracy | Represents the accuracy when the number of assignments does not exceed a specified number of "hops" (reassignments), critical for continuous triage 12. | 12 | |
| Precision | Assesses the proportion of correctly identified positive results among all positive results returned. Priority prediction can be 82% or higher . | ||
| Recall | Measures the proportion of actual positive results that are correctly identified 14. | 14 | |
| F1-score | The harmonic mean of precision and recall, especially useful for multiclass classification tasks, with studies showing improvements of around 4% . | ||
| False Positive Rate | Indicates the percentage of incorrect positive classifications, with AI aiming to cut this by up to 60% 9. | 9 | |
| Task Success Rate | Whether agents complete assigned tasks, assessed as binary (completed/failed) or graded (partial completion) 14. | 14 | |
| Efficiency and Latency | Time to Engage (TTE) | The time from incident report to assignment to the correct team. A key efficiency factor, with reductions up to 91% reported 12. | 12 |
| Time-to-Resolution (TTR) | The overall time taken to resolve an issue 9. | 9 | |
| Average Triage Time | The average time spent on the triage process, which can see a 65% reduction 9. | 9 | |
| Bug Resolution Speed | Measures how quickly bugs are fixed, often showing 30–40% faster resolution 9. | 9 | |
| Latency | The response time, including query submission to final response, model inference duration, and tool call execution time 14. | 14 | |
| Transfer Hop Counts | Number of reassignments before reaching the correct team 12. | 12 | |
| System and Quality | User Satisfaction | Measured through explicit feedback (ratings, surveys) or implicit signals (conversation continuation) 14. | 14 |
| Developer Satisfaction | Improved due to better task alignment 9. | 9 | |
| Cost | Token usage per interaction, number of LLM API calls, infrastructure, and compute expenses 14. | 14 | |
| Robustness | Agent resilience to challenging inputs, edge cases, and adversarial prompts 14. | 14 | |
| Agent Trajectory Quality | Evaluates the sequence of actions and decisions, ensuring logical reasoning paths and efficiency 14. | 14 | |
| Tool Selection Accuracy | Whether agents correctly identify and invoke relevant tools with appropriate parameters 14. | 14 | |
| Step Completion & Utility | Tracks whether necessary steps are executed correctly and if each action contributes meaningfully 14. | 14 | |
| Context Relevance, Precision, and Recall | Evaluate the quality of information retrieved from knowledge bases 14. | 14 | |
| Faithfulness | Measures whether agent responses are grounded in retrieved context and don't hallucinate information 14. | 14 | |
| Clarity, Conciseness, Consistency | Assess the quality of agent-generated responses for understandability, brevity, and stability over time 14. | 14 | |
| PII Detection | Validates that agents do not expose sensitive information 14. | 14 | |
| Adaptability | How well agents adjust to new scenarios and generalize beyond training distributions 14. | 14 | |
| Coverage Improvements | The extent to which agent-generated tests cover requirements 13. | 13 | |
| Defect Detection Rates | How effectively the system identifies actual defects 13. | 13 |
Intelligent agents are increasingly vital in automatic bug triage, addressing the growing complexity and volume of software incidents in modern systems 3. Manual triage is time-consuming and prone to "bug tossing," which delays resolution and reduces operational efficiency 3. Automated solutions, powered by intelligent agents, aim to accelerate issue resolution, enhance diagnostic accuracy, facilitate cross-team collaboration, and establish a foundation for further intelligent automation 3. An AI agent is an autonomous software entity designed for goal-directed task execution, capable of perceiving inputs, reasoning over context, and initiating actions within digital environments 15. These agents possess autonomy, task-specificity, and reactivity with adaptation, operating through a "Perceive → Plan → Act → Learn" loop .
Research in automatic bug triage primarily focuses on leveraging machine learning (ML) and artificial intelligence (AI) techniques to streamline the bug lifecycle. Early efforts frequently targeted open-source software (OSS) projects like Eclipse and Mozilla 7.
Key contributions include:
While many research efforts contribute to foundational techniques, several stand out as significant prototypes or tool-focused contributions:
Several large companies have successfully implemented intelligent agents for automatic bug triage, demonstrating significant efficiency gains:
The table below summarizes the features and effectiveness of these industrial implementations:
| Industrial Implementation | Percentage Auto-Assigned | Accuracy | Resolution Speed | Other Features/Benefits | Challenges/Notes |
|---|---|---|---|---|---|
| Ericsson TRR | 30% | 75% | 21% faster | Saved engineer hours, process improvements, increased communication, higher job satisfaction. Confidence-based triaging. | Intricate adoption in a large, complex organization. Misclassifications are a significant concern. End-user trust was vital. |
| IsBank IssueTAG | 100% (380 issues/day) | N/A | More efficient | Accuracy monitoring, explainability. Found useful even if accuracy slightly less than manual. | Necessitated changes to manual assignment processes. Did not identify objections to deployment. |
| LG Electronics | N/A | >90% | N/A | Iterative process, effective communication, trust development, accuracy monitoring. Value even with imperfect accuracy. | Focused on textual features for ML models. Misclassifications considered a bigger problem at Ericsson compared to LG Electronics and IsBank. |
The period from 2023 to the present has witnessed substantial advancements and novel approaches in automatic bug triage, primarily driven by the integration of Large Language Models (LLMs) and a growing emphasis on ethical considerations. These developments are reshaping how bug reports are handled and analyzed.
LLMs are increasingly pivotal in automating bug triaging processes, facilitating the classification of bug reports, recommending suitable developers, and predicting bug priority 17. A significant breakthrough involves the use of instruction-tuned, project-specific LLMs that incorporate candidate-constrained decoding to ensure valid developer assignments 18. This methodology often leverages parameter-efficient fine-tuning (PEFT) techniques, such as LoRA adapters, applied to models like DeepSeek-R1-Distill-Llama-8B, minimizing the need for handcrafted features or complex preprocessing, which enables swift adaptation to new projects 18. LLMs exhibit the capability to process diverse bug content, including extensive descriptions, code snippets, and discussion threads, circumventing the token span limitations or noise introduction issues encountered by earlier transformer models 18.
Evaluations indicate that while achieving exact Top-1 accuracy remains challenging, especially within large, long-tail label spaces, LLMs can effectively generate high-quality shortlists for real-world bug triaging, with reported Hit@10 scores reaching up to 0.753 on datasets like Mozilla 18. Furthermore, some LLM-based tools, such as LATTE, which integrates LLMs with automated binary taint analysis, have successfully uncovered previously unknown vulnerabilities 19. LLMs also demonstrate potential in generating localized bug fixes and test assertions, with their performance often enhanced through sophisticated prompt engineering 20. Beyond bug triage, LLMs have shown superior performance compared to traditional machine learning models in cybersecurity log classification for vulnerability detection 21.
The inherent "black box" nature of LLMs poses challenges for interpretability in critical applications 20. However, progress is being made in utilizing LLMs to produce structured, domain-relevant explanations that align with classical interpretability methods 21. This capability significantly enhances transparency and trustworthiness, making advanced threat detection more accessible and providing clearer, more justifiable alerts, particularly for non-expert users in small and medium-sized enterprises (SMEs) 21. The reliability of these LLM-generated explanations is an ongoing area of research, as these models can produce plausible yet factually incorrect information, commonly referred to as "hallucinations" 21. Addressing this issue is critical for constructing trustworthy LLM-based cybersecurity solutions 21.
Transfer Learning: Fine-tuning pre-trained models on domain-specific data has become a widespread strategy in LLM applications for both bug triage and code verification 20. Techniques like Parameter-Efficient Fine-Tuning (PEFT), specifically Low-Rank Adaptation (LoRA), facilitate efficient fine-tuning with reduced computational overhead while preserving the original model's representational capacity 18. This fine-tuning is crucial for adapting LLMs to specific bug detection tasks, enabling the identification of errors even in the absence of specific test cases by leveraging annotated datasets 19.
Multi-modal Agent Systems: While the provided research does not extensively detail explicit "multi-modal agent systems," the emerging concept of "hybrid approaches" and integration with other tools points towards similar functionalities. Future research directions propose combining LLM ranking with graph-derived priors or developer profile embeddings—which could encompass diverse data like expertise, components, and recency—to refine candidate sets and improve Top-1 assignment accuracy 18. LLMs are also being integrated with traditional static analyzers and formal verification tools to enhance their capabilities 20. Agentic approaches are recognized as a key strategy in code verification 20. Tools like PentestGPT exemplify agentic architectures through self-interacting modules (inference, generation, parsing) that share intermediate results in a recursive feedback cycle to tackle complex tasks such as penetration testing 19.
The use of real bug reports and developer identifiers sourced from publicly available issue trackers raises significant ethical considerations within automated triage systems 18. A major concern is the susceptibility of LLMs to adversarial attacks, which include prompt injection, jailbreaking attacks, data poisoning, and backdoor attacks 19. Ethical risks also extend to "dual-use" scenarios, where models designed for defensive purposes could potentially be exploited for offensive actions, alongside structural biases and privacy risks inherent in data processing 19. To mitigate these risks, it is imperative to implement robust safeguards such as access controls, rely on carefully audited datasets, employ output filtering mechanisms, and adhere strictly to responsible AI frameworks 19. The potential for LLMs to generate "hallucinations"—plausible but factually incorrect information—also poses a considerable risk, especially when providing explanations for security alerts, necessitating robust mechanisms to ensure accuracy and prevent misleading information 21. Observed biases, such as political bias found in models like ChatGPT, underscore the continuous need for vigilant monitoring and ethical balancing in the deployment of these advanced systems 19.