Continuous code quality monitoring is a modern approach to software quality assurance that integrates quality checks and fixes directly into the development workflow throughout the entire software development lifecycle (SDLC) . It emphasizes early and frequent testing, typically within continuous integration (CI) and continuous delivery (CD) pipelines, to identify and address issues as soon as possible . The primary goals are to accelerate software delivery, improve product quality, safeguard against security vulnerabilities, and ensure the maintainability, scalability, and security of the software .
Traditionally, code quality monitoring, often referred to as traditional testing, employed a structured, phase-based approach that commonly followed a waterfall methodology 1. In this model, testing typically occurred as a distinct phase after development completion 1. Key characteristics included phase-based execution, extensive documentation, manual execution, and a primary focus on defect detection 1. Traditional testing typically encompassed several phases: unit testing by developers, integration testing for combined units, system testing against specified requirements, and acceptance testing by end-users or customers 1.
While traditional methods offered advantages such as a structured framework, in-depth documentation, a strong focus on defect detection, clear roles and responsibilities, and a proven track record 1, they also presented significant drawbacks. These included being time-consuming and resource-intensive, with manual code reviews often inconsistent . Traditional approaches frequently resulted in limited test coverage due to manual execution, reduced adaptability to changing requirements, and delayed feedback, which increased resolution costs 1. Reliance on manual processes also created potential bottlenecks in the development lifecycle 1.
The integration of Artificial Intelligence (AI) into continuous code quality monitoring marks a transformative shift, embedding intelligent algorithms that learn, adapt, and optimize testing cycles . This transforms quality assurance from a series of manual tasks into a seamless, automated process, enabling rapid execution of complex tasks, defect detection, and informed decision-making 2. Core concepts underpinning AI integration include:
To further illustrate the role of AI in this context, various AI/ML techniques are applied to enhance the detection, remediation, and prevention of code quality issues, representing a fundamental shift from traditional monitoring paradigms:
| AI/ML Technique | Description | Applications in Code Quality |
|---|---|---|
| Machine Learning (ML) | Algorithms analyze historical test data and large code datasets to identify patterns, make test case selection effective, and predict software defects . | Bug Detection: Predicting potential defects, identifying correlations between code changes and failures, detecting anomalies 4. Maintainability: Recognizing coding best practices and common mistakes 3. |
| Natural Language Processing (NLP) | Interprets code documentation and inline comments to understand developer intent 3. | Code Documentation: Aligning code analysis with developer intent and business objectives 3. |
| Deep Learning | Performs advanced analyses based on an extensive understanding of code patterns and logic . | Security: Identifying sophisticated vulnerabilities in code repositories 4. Bug Detection: Scaling real-time anomaly detection in dynamic systems 4. |
| Predictive Analytics | Analyzes historical trends and data patterns to forecast future events or potential issues . | Bug Detection: Forecasting potential bugs and architectural weaknesses before they arise . Performance: Assessing scalability challenges 3. |
| Static Code Analysis | Examines code without executing it to identify potential issues . AI enhances this by automating error detection and suggesting immediate fixes 2. | Bug Detection: Identifying syntax errors, potential bugs 3. Security: Detecting security vulnerabilities 3. Style Adherence: Enforcing coding standards, code linting . |
| Dynamic Analysis | Executes the code and monitors its behavior at runtime 3. | Bug Detection: Detecting runtime errors 3. Performance: Identifying performance bottlenecks 3. |
| Automated Code Remediation | AI/ML models provide consistent, reliable, and highly accurate remediation recommendations tailored to specific codebases and security policies 5. | Bug Detection & Security: Delivering precise fix instructions and accelerating the remediation of security flaws 5. |
| Code Duplication Detection | AI systems identify redundant code segments 3. | Maintainability: Promoting adherence to DRY (Don't Repeat Yourself) principles 3. |
The integration of Artificial Intelligence (AI) and Machine Learning (ML) is fundamentally transforming continuous code quality monitoring, shifting it from predominantly manual processes to intelligent, predictive, and automated systems . This section provides an in-depth overview of the core AI/ML techniques employed, detailing their underlying mechanisms, applications across various code quality aspects, and a critical analysis of their strengths and weaknesses.
Machine Learning algorithms form the bedrock of AI-driven code quality monitoring, learning from extensive historical data to identify patterns and predict potential issues .
Deep Learning, a subset of ML, utilizes multi-layered neural networks to capture intricate patterns and dependencies, proving particularly effective for modeling and generating source code.
NLP techniques enable AI systems to understand, interpret, and generate human language, bridging the gap between natural language requirements and executable code.
Reinforcement Learning empowers AI agents to learn optimal strategies through interaction with an environment, making it suitable for dynamic optimization problems in code quality.
Leveraging AI and ML models, particularly Large Language Models (LLMs), this area focuses on automating the suggestion and generation of code to fix identified issues and create new functionalities.
These techniques are critical for ensuring the trustworthiness, fairness, and reliability of AI/ML systems used in code quality monitoring, especially given the "black box" nature of complex models.
These techniques focus on maintaining the performance and relevance of AI/ML models deployed in continuous code quality monitoring systems as data and codebases evolve.
Crucial for AI/ML systems, these techniques ensure that code quality monitoring tools themselves perform efficiently and reliably under various loads and deployment scenarios.
The following tables summarize the discussed AI/ML techniques, their implementation for various code quality aspects, and their inherent strengths and weaknesses.
| Code Quality Aspect | AI/ML Techniques & Implementation |
|---|---|
| Bug Detection | ML models (e.g., Decision Trees, Ensemble Methods) and DL models are trained on historical bug data to predict and detect bugs . AI-powered static code analysis tools automate error detection and suggest fixes . |
| Security Vulnerability Ident. | ML models are trained on secure coding practices to flag potential vulnerabilities like SQL injections . AI-driven remediation tools provide bespoke fixes for security flaws (e.g., Veracode Fix) . GitHub CodeQL uses ML to identify vulnerabilities . |
| Maintainability Assessment | ML algorithms are used to assess maintainability . AI-driven code analysis identifies "code smells" and helps manage technical debt . Automated code reviews contribute to this by ensuring adherence to standards and best practices . |
| Performance Optimization | ML models can identify performance bottlenecks in code . RL algorithms are being used for compiler optimization and resource allocation strategies . AI can optimize Infrastructure as Code (IaC) like Terraform scripts . |
| Adherence to Coding Standards | AI tools perform code linting to ensure compliance with predefined coding standards and reduce common mistakes . ML models can be trained to enforce coding standards consistently . |
| Test Automation/Optimization | AI-based analysis optimizes test coverage by linking code changes to relevant tests, reducing redundancy . NLP generates test cases from user stories . RL agents streamline test-case generation and prioritization . |
| Real-time Monitoring/Feedback | AI-driven anomaly detection integrates into CI/CD pipelines to monitor code quality and system behavior as changes are deployed . ML tools provide real-time feedback during the coding process within IDEs . |
| Automated Code Suggestions/Generation | AI/ML models, especially LLMs, provide real-time secure code suggestions, generate code snippets, and create secure code from natural language . |
| Flaw Remediation | AI/ML models deliver consistent, reliable, and highly accurate remediation recommendations tailored to specific codebases and security policies 5. |
| Bias Detection & Interpretability | Fairness metrics, explainability tools (SHAP, LIME), adversarial testing identify and mitigate biases, and enhance transparency in AI models 7. |
| Drift Detection | Automated retraining pipelines and drift detection mechanisms monitor data and concept drift to maintain model performance and accuracy 7. |
| Scalability Testing | Load testing, stress testing, and A/B testing assess latency, throughput, and resilience of AI-powered systems under high user traffic 7. |
|AI/ML Technique|Strengths|Weaknesses/Limitations| |---|---| |Machine Learning|Captures complex patterns; objective, consistent feedback; adaptability; improved accuracy over traditional methods .|Requires large, high-quality, and diverse training data; struggles with rare/unseen cases; susceptible to overfitting; potential for false positives/negatives .| |Deep Learning|Excellent for complex pattern recognition; high accuracy; context-aware code analysis; supports code completion and suggestions .|High computational cost; large data dependency; "black box" interpretability issues; potential for subtle errors/security flaws in generated code .| |Natural Language Processing|Transforms natural language requirements into test cases; facilitates documentation; bridges human language to code .|Can struggle with ambiguous or incomplete NL specifications; less effective for complex code generation tasks .| |Reinforcement Learning|Learns optimal strategies for long-term goals; adaptable; useful for complex optimization tasks; can improve model adaptability .|Sample inefficiency; stability and convergence issues; generalization and transferability challenges; sensitive to reward function design; high computational cost; potential for biases .| |AI-Driven Remediation & Code Generation|Speed and efficiency; high accuracy of tailored fixes; reduced security debt .|Inaccuracy/insecurity of generated code; hidden dependencies; lack of domain expertise; over-reliance risk 6.| |Model Interpretability & Bias Detection|Ensures ethical AI; builds trust; aids debugging 7.|Complexity of interpreting models; continuous effort for monitoring 7.| |Continuous Testing & Drift Detection|Adaptability to evolving models; stability assurance; early warning of model decay 7.|Robust CI/CD pipelines and infrastructure requirements; complexity of testing evolving ML models 7.| |Performance & Scalability Testing|Ensures production readiness; identifies scalability challenges and optimizes resource utilization .|Complexity of setup for realistic simulations; high infrastructure cost for comprehensive tests 7.|
Leveraging the foundation of AI and machine learning techniques for continuous code quality monitoring introduces a transformative approach to software development. This methodology optimizes the analysis of source code, identifies flaws, and suggests improvements, thereby enhancing security, accuracy, and development speed 8. While AI coding tools are gaining rapid adoption among developers, with 76% reporting usage or plans to use them by September 2024, opinions on AI accuracy remain divided 11. This section comprehensively analyzes the benefits, challenges, trade-offs, and real-world implications of integrating AI into continuous code quality monitoring.
AI-driven tools significantly improve traditional code analysis methods, offering numerous advantages:
Improved Efficiency and Productivity AI automates tedious and repetitive tasks, allowing developers to concentrate on more complex problem-solving and innovation 11. It accelerates the code review process by quickly highlighting issues and recommending actionable fixes, reducing bottlenecks in large and distributed teams 9. Automated scans can process thousands of lines of code in seconds, freeing human reviewers to focus on design and edge cases 10. An example of this efficiency is IBM watsonx Code Assistant for Z, which streamlines mainframe application lifecycle management, making it more cost-effective 10.
Enhanced Accuracy and Consistency Utilizing machine learning, AI tools minimize false positives and improve accuracy compared to traditional methods by continuously learning from new data to identify emerging threats 12. They enforce coding standards and flag inconsistencies, ensuring uniform guidelines across teams 11. AI assessment, being immune to fatigue or bias, guarantees consistent enforcement irrespective of project size or team composition 9. These tools are highly effective at detecting in-depth errors and code smells often overlooked in manual reviews, with increasingly context-aware models leading to greater accuracy 10.
Early and Predictive Detection of Issues AI systems provide continuous monitoring, actively scanning for vulnerabilities in real-time as code is written and updated, generating instant alerts 12. This allows for the detection of subtle issues missed in manual reviews and catches errors and potential vulnerabilities during early development stages, reducing the likelihood of bugs reaching production 9. AI offers predictive analysis capabilities, forecasting potential issues by analyzing historical data, and automates risk assessment and prioritization to focus teams on critical problems 12.
Improved Code Quality, Maintainability, and Security AI systematically checks for style violations, security faults, and outdated patterns, referencing best practices 9. It proactively identifies common code smells, suggests refactorings, and helps reduce technical debt by ensuring clean and maintainable code 9. Advanced security features include context-aware vulnerability assessment, considering user behavior and data sensitivity, and continuously refining models based on new security threats through adaptive learning 12.
Support for Developer Learning and Collaboration AI tools provide targeted, explainable feedback that helps developers understand not only what is wrong but also why, often with references to documentation or examples 9. This assists less experienced developers in internalizing best practices and standardizes the onboarding process for new contributors, fostering a learning-oriented environment and promoting continuous skill improvement 9. It also improves communication and ensures best practices are consistently applied across projects 8.
Despite the significant benefits, AI-driven code quality monitoring also presents several challenges and limitations:
Contextual Misinterpretation and Limited Creativity AI tools often struggle to fully comprehend business logic, custom requirements, or domain-specific idioms 8. This can lead to misinterpretation of intent, flagging valid code as problematic, and necessitating rework 9. Limitations in context size for Large Language Models (LLMs), such as the 32,000-token limit for ChatGPT-based Copilot, can pose challenges for analyzing large projects 11. Moreover, AI lacks the creativity and intuition of experienced programmers, struggling with complex dependencies, poor architecture, or intricate internal designs, potentially focusing on less significant details 11.
False Positives and Negatives AI systems are prone to both false positives, where valid code is incorrectly flagged, and false negatives, where genuine issues go undetected 8. A high rate of false positives can overwhelm developers, causing them to ignore important warnings, while false negatives mean real defects can slip into production 9. Such inaccuracies complicate the code review process, leading to wasted time or unaddressed issues 10.
Overreliance and Human Oversight Developers risk becoming overly dependent on AI-generated recommendations, potentially diminishing their own expertise and critical thinking 8. This overreliance can result in unchecked technical debt, propagation of suboptimal practices, or superficial fixes, ultimately reducing codebase quality over time, especially for nuanced architectural decisions 9. Incorrect AI suggestions can also significantly affect project functionality 11.
Data Requirements and Integration Complexities AI models require vast datasets for training, and algorithms trained on biased data may make unfair or incorrect predictions 8. Integrating these tools into existing development environments and CI/CD pipelines demands careful planning 12. Furthermore, some functionalities of tools like GitHub Copilot are limited in certain IDEs such as IntelliJ and JetBrains, creating integration hurdles 11. Scalability issues can also arise as some AI systems struggle to efficiently analyze very large codebases 8.
Explainability, Ethical, and Security Concerns AI tools may fall short if objectives are not clearly defined, requiring users to formulate precise queries 11. The "illusion of quality" in AI responses necessitates additional verification 11. Security risks are significant, as some AI-powered code review systems require access to proprietary code, raising privacy concerns 8. Additionally, AI can have difficulties detecting specific security vulnerabilities, leading many experienced developers to prefer specialized libraries like SonarQube for this purpose 11. Bias in AI models, derived from potentially biased training datasets, can produce skewed or inaccurate results 8.
Adopting AI for code quality requires strategic considerations to maximize benefits and mitigate risks effectively:
Balancing Automation with Human Oversight: AI tools serve best as complementary assets, augmenting rather than replacing human expertise 11. Achieving optimal results requires a delicate balance of automation and human control 8. Developers should critically review AI-generated code, treating it as a draft, and double-check recommendations, incorporating human verification where necessary 8.
Strategic Integration and Customization: Organizations must assess their current coding practices, select tools that align with specific needs, and configure settings based on project requirements 12. Tools should be seamlessly integrated into existing development workflows and CI/CD pipelines to enable automatic scans at various stages for maximum effectiveness 12.
Defining Context and Objectives: Clear definition of objectives and project relevance before AI deployment is crucial 11. Insufficient or poorly considered information provided to AI can lead to inaccurate analysis 11. AI is most effective when applied to individual system layers, such as security or authentication, rather than as a comprehensive solution for an entire application 11.
Continuous Learning and Adaptation: Regularly updating AI models is essential to adapt to new security threats and evolving codebases 12. Organizations should invest in infrastructure that supports extensive monitoring and feedback loops. Regular training and knowledge sharing among teams are also vital to keep pace with changing technologies and practices 9.
Ethical and Security Standards: Companies must deploy AI efficiently while maintaining human control throughout the review process 8. Implementing ethical standards and setting boundaries are critical to prevent misuse, especially to safeguard against security risks associated with proprietary code access 8. Ensuring AI technology meets stringent security standards is paramount to prevent data exposure 8.
Cost-Benefit Analysis: A thorough cost-benefit analysis is crucial, evaluating the costs of implementing AI-driven solutions against the financial implications of potential security breaches. This includes considering savings from reduced incidents, improved compliance, and lower remediation costs 12.
The practical application of AI in code quality monitoring highlights both its strengths and current limitations across various real-world scenarios:
Integration with CI/CD Pipelines: AI tools are often embedded seamlessly into Continuous Integration/Continuous Deployment (CI/CD) pipelines, enabling automatic scans at various development stages 12. This approach ensures code is scanned for vulnerabilities early and continuously, maintaining high quality standards without impeding release cycles 9.
Use of Specific Tools:
Addressing Language Model Limitations: While tools like GitHub Copilot can interpret a programmer's coding style and rely on existing codebase structures, achieving consistent solutions requires providing AI with all relevant project and team information 11. Advanced LLMs such as GPT-4 or Claude 4, trained on vast datasets, can understand logic flows, detect non-obvious bugs, and offer human-like suggestions, but they are still subject to context size limitations 11.
Developer Experience: AI offers valuable assistance, particularly in measurable and structured code writing. Experienced developers recognize its value in minimizing time spent on simple tasks, allowing them to focus on business processes or alternative solutions 11. However, a significant risk lies in solutions that users, especially inexperienced developers, do not fully understand 11.
AI-driven continuous code quality monitoring has led to the emergence of various tools and platforms designed to automate and optimize code analysis, identify flaws, and suggest improvements. These solutions aim to enhance the security, accuracy, and speed of software development by leveraging artificial intelligence and machine learning 8. The landscape of these tools, their features, and underlying AI methodologies are diverse, catering to different aspects of code quality and development environments.
| Tool/Platform | Key Features | Underlying AI Methodologies | Integration Capabilities | Market Positioning/Context |
|---|---|---|---|---|
| GitHub Copilot | Offers context-aware code suggestions and issue handling 11. Interprets a programmer's coding style and helps in measurable, structured code writing 11. Minimizes time spent on simple tasks, allowing developers to focus on complex problems 11. | Primarily utilizes Large Language Models (LLMs), with context size limitations (e.g., a 32,000 token limit for ChatGPT-based Copilot) 11. | Popular among developers 11. Relies on existing codebase structure 11. May have limitations in specific IDEs like IntelliJ and JetBrains regarding certain functionalities 11. | A popular AI coding tool that augments developer productivity and efficiency 11. |
| DeepCode (by Snyk) | Searches millions of open-source repositories for security issues and coding inefficiencies 8. Capable of detecting and automatically fixing vulnerabilities 9. | Employs machine learning 8. Combines symbolic and generative AI to enhance detection and autofix capabilities 9. | Integrates into development workflows to provide insights into security and efficiency, though specific integration details are not provided 8. | Focuses on AI-driven security vulnerability detection and code efficiency improvements by leveraging vast open-source code knowledge 8. |
| Mend | Identifies and remediates security issues in both proprietary and open-source code 9. | Utilizes AI in its core functionality for issue identification and remediation 9. | Designed for integration into CI/CD pipelines to enable real-time scanning and policy enforcement 9. | Specializes in AI-powered security issue identification and remediation, actively supporting continuous integration and delivery processes 9. |
| IBM watsonx Code Assistant for Z | Accelerates the mainframe application lifecycle and streamlines modernization efforts through generative AI 10. Enables developers to efficiently refactor, optimize, and modernize code 10. | Leverages generative AI tailored for code manipulation and optimization 10. | Specifically designed for and integrated within mainframe development environments 10. | A key solution for modernizing and improving the efficiency and cost-effectiveness of legacy mainframe applications 10. |
| IBM AIOps Insights | Enhances IT issue resolution speed by gathering data from client IT environments 10. Identifies correlations and potential issues, demonstrating AI's capability to uncover problems overlooked by manual review 10. | Employs Large Language Models (LLMs) and generative AI for data analysis and correlation 10. | Collects and analyzes data from diverse client IT environments to provide operational insights 10. | An AIOps tool that exemplifies how AI can identify complex issues in IT operations, complementing code quality monitoring by ensuring overall system health and problem detection 10. |
These tools illustrate the growing adoption of AI across various facets of software development, from direct code generation and analysis to operational intelligence, all contributing to enhanced code quality and system reliability. While they offer significant advantages, their effective deployment often requires careful integration and human oversight 8.
This section details the most recent advancements, significant trends, and ongoing academic research initiatives in AI for continuous code quality monitoring, with a specific focus on the period between 2023 and 2025. It covers novel AI algorithms, methodologies, integration with large language models (LLMs), and their impact on the software development lifecycle.
Research and industry reports from 2023-2025 highlight significant advancements in using AI, particularly LLMs, for various code quality tasks.
Recent studies have benchmarked LLMs, including OpenAI GPT-4.0 and DeepSeek-V3, for code smell detection across multiple programming languages like Java, Python, JavaScript, and C++ 13. A preprint accepted by EASE25 (April 2025) introduced a structured methodology and evaluation matrix, finding that GPT-4.0 achieved higher precision (0.79) than DeepSeek-V3 (0.42), though both exhibited relatively low recall 13. This study also conducted a cost analysis comparing LLM-based detection with traditional static analysis tools such as SonarQube and identified key code smell categories including Bloaters, Dispensables, Couplers, Object-Orientation Abusers, and Change Preventers 13. Furthermore, the iSMELL project (2024-ASE) focuses on assembling LLMs with expert toolsets for code smell detection and refactoring 14. A Master's Thesis (December 2025) analyzing code smells in open-source LLMs revealed a "Syntax–Logic Gap," where 98.5% of generated code is syntactically valid, but 52%-78% contain at least one code smell 15. This research indicates that LLM-generated code tends to have structural errors like undefined variables and namespace collisions, while human-written code often has more stylistic violations 15.
Several papers in 2024 explored LLM-driven vulnerability detection, including "LProtector: An LLM-driven Vulnerability Detection System," "Smart-LLaMA: Two-Stage Post-Training of Large Language Models for Smart Contract Vulnerability Detection and Explanation," and "RealVul: Can We Detect Vulnerabilities in Web Applications with LLM?" 14. An August 2025 study by Sonar quantitatively evaluated five prominent LLMs (Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder 8B) on 4,442 Java coding assignments using SonarQube 16. This study found that LLMs consistently introduce security vulnerabilities, accounting for approximately 2% of total issues discovered, with a high proportion classified as 'BLOCKER' or 'CRITICAL' (e.g., Llama 3.2 90B produced over 70% 'BLOCKER' vulnerabilities) 16. Common vulnerabilities identified include Path-Traversal & Injection, Hard-Coded Credentials, and Cryptography Misconfiguration 16.
Research in 2024 has advanced automated program repair (APR) using LLMs, with notable papers such as "A Comprehensive Survey of AI-Driven Advancements and Techniques in Automated Program Repair and Code Generation," "CORE: Resolving Code Quality Issues using LLMs," and "Prompt Fix: Vulnerability Automatic Repair Technology Based on Prompt Engineering" 14. A 2025 study using GPT-4 on the Defects4J dataset achieved a bug-detection accuracy of 89.7% and a mitigation efficacy of 86.4% 15. AI agents like Google's CodeMender are being deployed, contributing over 70 security fixes to large-scale open-source projects 15. The August 2025 Sonar report also indicated that LLMs generate bugs, which constituted 5-8% of total issues, with control-flow mistakes and API contract violations being common categories 16.
LLMs are increasingly embedded across the entire software development lifecycle, augmenting human roles by providing real-time assistance in code improvement and automating tasks like code smell detection and refactoring 13. An empirical study in 2024 explored the potential of LLMs in automated software refactoring, and collaborative LLM-based agents are being developed for code reviewer recommendations (2024-ASE) 14. Prompt engineering plays a crucial role, with simple constraint-based prompts shown to reduce code smell density by 7-15% 15.
While explicit "MLOps practices" are not detailed in the provided sources, the extensive research into benchmarking, improving, and optimizing LLM performance for code quality tasks implies an underlying need for robust MLOps to manage these evolving models and their integration into development workflows . The continuous evaluation of models, their output quality, and the impact of different prompting strategies are key aspects that MLOps would address 15.
LLMs are rapidly adopted in software development, with AI assistants writing an average of 46% of developer code, a trend that enhances developer productivity and accelerates development . Despite achieving functional correctness, LLM-generated code often suffers from significant non-functional quality issues, such as poor structure and low maintainability 15. A critical concern is that developers using AI assistants can produce less secure code while simultaneously showing greater confidence in its security 16. Consequently, LLM-generated code is not immediately production-ready and requires rigorous verification, with static analysis identified as an essential protective mechanism for detecting latent defects 16. A "potential paradox" exists where more capable LLMs might generate more sophisticated solutions that, while functionally robust, introduce a larger surface area for defects, potentially leading to more static analysis findings 16.
Table: Distribution of Issue Types by LLM (August 2025 Study)
| LLM Model | Total Bugs | % Bugs | Total Vulnerabilities | % Vulnerabilities | Total Code Smells | % Code Smells | Source |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4 | 423 | 5.85 | 141 | 1.95 | 6,661 | 92.19 | 16 |
| Claude 3.7 Sonnet | 352 | 5.35 | 116 | 1.76 | 6,108 | 92.88 | 16 |
| GPT-4o | 406 | 7.41 | 112 | 2.05 | 4,958 | 90.54 | 16 |
| Llama 3.2 90B | 398 | 7.71 | 123 | 2.38 | 4,638 | 89.90 | 16 |
| OpenCoder-8B | 247 | 6.33 | 67 | 1.72 | 3,589 | 91.95 | 16 |