Fine-tuned Code Large Language Models (LLMs) for agents represent a pivotal advancement in artificial intelligence, enabling machines to exhibit sophisticated behaviors such as planning, tool use, and reflection within coding contexts. This section introduces the fundamental concepts of Code LLMs, fine-tuning, and AI agents, detailing their integration to unlock agentic capabilities. It explores the architectural foundations, specific fine-tuning methodologies, and key frameworks that facilitate these intelligent behaviors, setting the stage for a comprehensive understanding of the field.
The foundation of fine-tuned Code LLMs for agents rests upon three core concepts: Code LLMs themselves, the techniques used for fine-tuning them, and the definition of AI agents.
Code LLMs are transformer-based deep neural networks specifically engineered for sequence-to-sequence tasks in software development 1. These models are pre-trained on extensive corpora of code and code-related data, allowing them to comprehend problem descriptions and generate functional code 1. Prominent examples illustrating their evolution and capabilities include:
| Code LLM | Year | Key Features | Pass@1 HumanEval (first attempt) |
|---|---|---|---|
| OpenAI Codex | 2021 | GPT-3 descendant, fine-tuned on public GitHub code, translates natural language to Python, powers GitHub Copilot 1. | 28.8% 1 |
| DeepMind AlphaCode | 2022 | Focused on competitive programming, 41 billion parameters, generates numerous candidate programs filtered by test cases, achieved top 54% rank in Codeforces 1. | N/A |
| OpenAI GPT-4 | 2023 | General-purpose multi-modal LLM with high coding abilities, trained on broad web text and code, fine-tuned with human feedback 1. | 67% 1 |
| Meta Code Llama | 2023 | Open-source, built on LLaMA-2, trained on 500 billion code tokens, includes variants (7B, 13B, 34B) and specialized versions like Code Llama - Python and Code Llama - Instruct 1. | Nearly 50% (34B version) 1 |
| StarCoder | 2023 | Model from the BigCode initiative, contributes to open research in code LLMs 1. | N/A |
Fine-tuning is the process of adapting pre-trained LLMs to excel in specific tasks or domains, enhancing their accuracy, contextual relevance, and operational efficiency 2. Key techniques crucial for developing agentic capabilities include:
AI agents are programs designed to autonomously execute tasks on behalf of a user 6. Their core characteristics include the ability to devise plans with a series of steps to accomplish complex tasks, use function calling to interact with external tools and data sources, and learn from feedback while storing information in memory to improve future performance 6. Agents are particularly effective in dynamic, underspecified environments where the exact sequence of steps is not predefined and may require exploration 7.
When fine-tuned, Code LLMs serve as the "brain" for AI agents, providing the necessary reasoning and generation capabilities for complex tasks. Instruction-tuned Code LLMs enhance an agent's ability to understand and execute natural language instructions for coding tasks; for instance, Code Llama - Instruct is fine-tuned on prompt-response pairs to align with developer requests 1. RLHF further refines Code LLMs to produce more helpful and secure code, steering them away from insecure practices or bugs 1.
The combination of LLMs with external "tools"—pieces of code that perform specific actions—enables agents to dynamically select, execute, and evaluate these tools based on the LLM's reasoning 8. Prompting techniques like ReAct facilitate a "Thought-Action-Observation" cycle, allowing LLMs to engage in planning and reflection by iteratively refining their actions and responses 9. Through fine-tuning, Code LLMs can develop specialized personas, process instructions while maintaining context, manage complex tasks through multi-turn conversations, and plan solutions effectively 10.
The successful fine-tuning of Code LLMs for agentic tasks relies on several methodological approaches:
Several frameworks facilitate the integration of LLMs into agentic workflows, often employing specific prompting strategies to enable advanced behaviors.
ReAct (Reason->Act) prompting is a method where an AI systematically thinks through a problem step-by-step rather than simply providing a static response 9. This dynamic cycle involves: the AI forming a Thought regarding the next logical step; performing an Action based on that thought (e.g., cross-referencing, deeper analysis); observing the Observation resulting from its action; and finally, adjusting its subsequent steps to arrive at a Final Answer 9. ReAct is superior to basic prompts because it enables dynamic reasoning and multi-step problem-solving, leading to deeper insights and improved results 9.
AutoGen, an open-source framework from Microsoft, enables the creation of multi-agent AI applications that leverage ReAct prompting and other techniques to solve complex problems 9. Its architecture supports autonomous agent behavior through flexible agent configuration using system_message to define roles and capabilities, a conversation-driven architecture with persistent context, and integrated planning capabilities via structured group conversations and reflection 10. AutoGen also features seamless tool integration with automatic selection and parameter inference, sophisticated multi-agent collaboration through delegation, secure code execution in Docker-based environments, and teachability, allowing agents to learn from past interactions 9.
Beyond AutoGen, other notable AI agent frameworks contribute to orchestrating complex workflows:
The industry often distinguishes between "Simple ReAct Agents" as a "generalist" approach, where an LLM in a loop uses tools to solve any problem 8, and "Agentic Workflows" which emphasize explicit workflow engineering with predefined steps for well-defined business problems 8. While generalist approaches offer flexibility, agentic workflows excel in consistency and reliability, especially for tasks with clear input/output requirements and requiring strong subject matter expertise 8.
In summary, fine-tuned Code LLMs for agents are built upon robust transformer architectures, specialized pre-training on code, and refined through advanced fine-tuning techniques such as instruction tuning and RLHF. Frameworks like ReAct and AutoGen provide the essential scaffolding for these LLMs to operate as effective, intelligent agents in diverse and complex coding workflows.
Building upon the foundational concepts and methodologies of fine-tuned code Large Language Model (LLM) agents, this section delves into their practical applications and diverse use cases that are transforming the software development landscape. These specialized models are increasingly deployed across various aspects of software development, offering significant automation, efficiency, and expanded capabilities, from generating code to managing complex project workflows .
Fine-tuned code LLM agents are inherently designed for enhancing the entire software development lifecycle. They provide capabilities ranging from basic code suggestions to managing end-to-end development workflows.
Code LLM agents are proving instrumental in identifying, diagnosing, and fixing software defects, thereby streamlining the debugging process and improving code reliability.
Fine-tuned LLM agents play a crucial role in improving code maintainability, readability, and performance by automating refactoring and optimization tasks.
These agents significantly enhance the efficiency and coverage of software testing by automating the creation of test cases and simulating human-like testing behaviors.
Fine-tuned code LLMs are also applied in data-centric roles, transforming natural language into structured queries and automating reporting.
While explicit examples for autonomous robotics code generation are not extensively detailed within the provided context, the overarching capabilities demonstrated by fine-tuned code LLM agents point towards their significant promise in complex system orchestration. These agents are characterized by their autonomy, managing entire workflows from task decomposition to coding and debugging across the full software development lifecycle 14. This includes managing infrastructure updates through autonomous workflows, as seen with Vitara AI 12, and infrastructure-as-code in cloud-native environments via Qudo AI 12. The ability of IBM's SWE agents to autonomously resolve GitHub issues by localizing and modifying code 11 further signifies their capability for autonomous action within complex systems, paving the way for applications in domains requiring sophisticated automated control, such as robotics.
The deployment of fine-tuned code LLM agents has led to substantial improvements across various facets of software development, as summarized below:
| Benefit | Description | Examples / Impact |
|---|---|---|
| Increased Productivity & Efficiency | Automates repetitive tasks, accelerates development cycles, and allows for rapid prototyping and iteration . | Shopify uses Copilot for faster feature rollouts 12. A SaaS startup saw 30% sprint task completion and 25% development time reduction with Vitara AI 12. Cursor adopters observed 3-5 times more lines of code added in the first month 15. |
| Enhanced Code Quality & Security | Enforces best practices, optimizes algorithms, reduces technical debt, and minimizes human errors 12. | NVIDIA developed an AI app for detecting software vulnerabilities 13. Amazon CodeWhisperer includes built-in security scans 12. |
| Shifting Developer Role | Transforms developers from code producers to "code curators" or "intent-driven engineers," focusing on higher-order problem-solving and prompt engineering 11. | Developers shift focus to orchestrating AI-generated code and design thinking 11. |
| Broader Task Scope & Collaboration | Agents handle ambiguous requirements, perform iterative optimization, and integrate with version control for enhanced knowledge sharing . | They cover most tasks in the software development lifecycle, extending beyond mere code snippets 14. |
Following the discussion of various applications of fine-tuned code Large Language Model (LLM) agents, a crucial aspect of their development and deployment involves rigorous performance evaluation and benchmarking. This section provides a comprehensive overview of the metrics, benchmarks, and evaluation methodologies used to assess the performance, robustness, efficiency, and safety of fine-tuned code LLMs specifically acting as agents. Effective evaluation is vital for understanding their capabilities, limitations, and areas for improvement, extending beyond traditional code generation to encompass the full software development lifecycle 14.
The assessment of LLM agents considers not only final task outcomes but also intermediary behaviors . Evaluation methods are generally categorized into LLM-as-a-judge approaches and human-in-the-loop evaluations 16.
These methods utilize LLMs themselves to evaluate the quality of their outputs, comparing generated text against ground-truth data or statistical metrics, offering efficiency for large-scale deployments 16.
This approach involves human evaluators assessing the quality of LLM output based on criteria such as relevance, fluency, coherence, and overall quality, providing subjective feedback 16. It is particularly critical for high-stakes applications and for identifying subtle problems 16.
Several frameworks have emerged to evaluate LLM agents under realistic scenarios :
| Benchmark / Framework | Agent Type | Focus and Features | Example Tasks / Domains | Key Metrics |
|---|---|---|---|---|
| MultiAgentBench / MARBLE | Multi-agent | Comprehensive cooperative and competitive scenarios, supporting various coordination structures and flexible planner strategies (CoT, group discussion) . | Research collaboration, coding, gaming (e.g., multi-player puzzle) . | Task completion and milestone KPI, communication and planning scores (averaged into a Coordination Score) . |
| Self-Evolving Benchmark | Single/Multi | Dynamic benchmark generating new test instances, perturbing inputs to stress-test models for robustness . | Extended QA, math, reasoning tasks (original datasets plus adversarial or rewritings) . | Original task accuracy plus performance drop on evolved instances (quantifies robustness); fine-grained metrics for sub-abilities . |
| Domain Intelligence Benchmark Suite (DIBS) | Single-agent | Enterprise-focused tasks emphasizing domain knowledge and tool use in real workflows, with defined subtasks and schemas . | Text-to-JSON extraction, function-calling (API generation), RAG workflows based on domain data (e.g., contracts, FAQs) . | Task-specific metrics: information extraction accuracy (F1/EM for JSON fields), function-call correctness (tool selection & JSON syntax), RAG answer quality (retrieval & answer F1) . |
| DeepEval | General | Developer-focused testing framework, integrates with CI/CD pipelines, pre-defined metrics for accuracy, bias, performance 16. | Various LLM applications, RAG pipelines, AI agents 17. | Answer Relevancy, Task Completion, Correctness, Hallucination, Tool Correctness, Contextual Relevancy, Responsible Metrics (bias, toxicity), Task-Specific Metrics (summarization) 17. |
| LEval | Long-Context LLMs | Evaluates LLMs on long-context understanding across various tasks, contexts from 5,000 to 200,000 tokens 16. | Academic summarization, technical document generation, multi-turn dialogue coherence 16. | Coherence 16. |
| LangSmith (LangChain) | LLM Applications | Debugging, testing, and monitoring platform with features for comparing models and tracing execution paths 16. | LLM applications 16. | (Monitoring & comparison tools) 16. |
To thoroughly assess fine-tuned code LLM agents, a diverse set of metrics is employed, covering agentic capabilities, code-specific attributes, robustness, safety, and efficiency.
For systems where multiple agents collaborate, additional dimensions beyond individual task success emerge .
Beyond general LLM benchmarks, specific datasets and benchmarks are crucial for evaluating code LLM agents 16.
The "Awesome-Code-LLM" resource categorizes various benchmarks relevant to code-related tasks 18:
Evaluating LLM agents faces several challenges, including a lack of standardization, scalability issues due to reliance on static datasets or human annotation, difficulties in diagnostic tools for failure attribution, limited attention to safety and bias, and a scarcity of cost/efficiency metrics .
Best practices for effective evaluation involve :
LLM-based code generation agents are rapidly transforming the software development landscape by offering autonomy and an expanded task scope across the full Software Development Lifecycle (SDLC) 14. These agents differentiate themselves from traditional LLMs by independently managing entire workflows, from task decomposition to coding and debugging, effectively simulating the complete workflow of human programmers 14. Research in this domain has seen significant growth, particularly since 2023 14.
Multi-agent systems (LMA systems) represent a pivotal development, significantly boosting performance through synergistic collaboration where multiple heterogeneous or homogeneous agents communicate, cooperate, and negotiate to achieve goals that exceed the capacity of a single agent 14.
LMA systems for code generation frequently incorporate role specialization and iterative feedback loops to optimize collaboration 19. Common roles identified within these architectures include:
| Role | Function | Examples |
|---|---|---|
| Orchestrator | Manages high-level planning, task decomposition, delegation to specialized agents, progress monitoring, and workflow alignment | PairCoder's Navigator, Self-Organized Agents' Mother agents, CODES' RepoSketcher |
| Programmer | Responsible for writing the initial code version | - |
| Reviewer and Tester | Evaluate code, provide feedback on quality, functionality, and adherence to requirements; generate various test cases | - |
| Debugger | Resolves identified issues | - |
| Information Retriever | Gathers relevant information from external sources, like similar problem examples or graph databases built from static analysis | Agent4PLC, MapCoder, CodexGraph |
An orchestration platform is essential for managing interactions and information flow among agents 19. AgentReport, for instance, employs a multi-agent pipeline where agents have fixed responsibilities and operate sequentially, prioritizing reproducibility and deterministic evaluation 20. This platform encompasses various coordination models (cooperative, competitive, hierarchical, or mixed) and communication mechanisms (centralized, decentralized, or hierarchical channels for exchanging data such as code snippets or bug reports) 19.
Collaboration often involves activities such as debate and discussion to enhance factuality and reasoning, ensuring validation of outputs 19. Agent Forest exemplifies this by utilizing a sampling-and-voting framework where multiple agents independently generate candidate outputs, with the solution achieving the highest consensus score being selected 19.
Self-correction is a critical capability for LLM agents, enabling them to evaluate and refine their outputs. The reflection component in these agents allows them to examine, evaluate, and correct their own generated content or existing data to improve past actions and continuously correct errors 14.
Frameworks like Self-Refine introduce an iterative refinement process where the model self-evaluates its natural language output to identify potential issues and revises it based on feedback, requiring no additional training or supervision 14. CodeChain guides models in constructing reusable modular code through multiple iterations and self-revision during the planning phase 14. Furthermore, CodeAct allows agents to dynamically revise prior actions or emit new actions based on new observations through multi-turn interactions, incorporating autonomous self-debugging capabilities 21.
ROCODE incorporates a closed-loop mechanism that integrates code generation, real-time error detection, and adaptive backtracking. It monitors compilation output and initiates backtracking when syntax errors are detected, using static program analysis to identify the minimal scope for modification 14. CodeTool utilizes process-level supervision mechanisms for tool invocation, explicitly modeling and supervising each step, and integrates feedback through incremental debugging strategies 14. In AgentReport, the Prompt Agent uses Chain-of-Thought (CoT) instructions to guide the model in performing step-wise self-checks to detect omissions or inconsistencies and revise its output. The Evaluation Agent assesses structural completeness, lexical fidelity, and semantic consistency using metrics like CTQRS, ROUGE, and SBERT 20. Other frameworks like INTERVENOR pair a Code Learner with a Code Teacher, where the Teacher analyzes bug reports and buggy code to provide repair instructions 19.
Advancements in guiding and training LLM agents are crucial for their effectiveness, encompassing refined prompt design strategies and efficient fine-tuning methods.
| Category | Technique | Description |
|---|---|---|
| Prompt Design | Structured Prompting | Enforces inclusion of key sections in outputs (e.g., CTQRS-based prompts for bug reports) to reduce incompleteness and ambiguity 20 |
| Prompt Design | Chain-of-Thought (CoT) | Guides models to perform step-wise self-checks and self-review, enhancing logical consistency and completeness 20 |
| Prompt Design | One-Shot Exemplars | Retrieves and inserts relevant examples from a training dataset (e.g., via FAISS) into the prompt for contextual grounding, ensuring realistic outputs and preventing data leakage 20 |
| Prompt Design | Self-Planning | Prompts the model to generate a sequence of high-level solution steps prior to actual code generation 14 |
| Fine-Tuning | QLoRA-4bit Fine-tuning | Applies to base models (e.g., Qwen2.5-7B-Instruct) to embed structural constraints and reasoning strategies (like CTQRS, CoT, exemplars) directly into model parameters, reducing memory usage and allowing training in resource-limited environments 20 |
| Fine-Tuning | Instruction-Tuning | Utilizes datasets (e.g., CodeActInstruct, consisting of 7,000 multi-turn interactions) to improve models (e.g., Llama2, Mistral) in agent-oriented tasks without compromising general capabilities 21 |
| RAG for Context | Repository-Level Retrieval | Establishes vector retrieval systems (e.g., RepoHyper, CodeNav) to locate reusable code segments from large codebases, improving control over long-distance dependencies 14 |
| RAG for Context | Knowledge Graphs | Represents code repositories as knowledge graphs to enhance retrieval quality from structural and relational perspectives, significantly improving project-level code generation 14 |
| RAG for Context | Structured Chunking | Uses Abstract Syntax Tree (AST)-based chunking (e.g., cAST) to improve syntactic completeness of code retrieval through recursive partitioning and semantic coherent block merging 14 |
Retrieval Augmented Generation (RAG) methods are also increasingly employed to retrieve relevant information from knowledge bases or code repositories, constructing richer contexts to alleviate knowledge limitations, model hallucinations, and data security issues 14.
Generalist code LLM agents are advancing their ability to handle diverse challenges across software development.
This involves leveraging advanced planning and reasoning techniques such as Self-Planning, CodeChain, CodeAct, GIF-MCTS, PlanSearch, CodeTree, Tree-of-Code, DARS (adaptive tree structures), and Guided Search (one-step lookahead and trajectory selection), all of which enhance structured reasoning and exploration in various problem spaces 14. Agents also integrate external tools like search engines, calculators, and compilers to expand their problem-solving capabilities 14. For example, CodeAct integrates a Python interpreter for immediate execution and dynamic action adjustment 21, while CodeAgent integrates five programming tools to interact with software components 14. Domain-specific tools, such as those encapsulating simulator functions for analog circuit design (AnalogCoder) or integrating syntax tree-level waveform tracing for hardware code generation (VerilogCoder), demonstrate adaptability to specialized tasks 14. Context management, facilitated by RAG systems with repository-level and knowledge graph-based retrieval, helps agents understand and utilize highly contextualized information from large and private codebases, which is crucial for real development environments 14. Furthermore, dynamic process models like Think-on-Process (ToP) and MegaAgent enable the dynamic generation of agent roles and plans based on specific project requirements, moving beyond rigid, static workflows 19.
LLM agents inherently possess tool usage capabilities, allowing them to actively invoke external APIs and tools to enhance problem-solving 14. ToolCoder combines API search tools with LLMs, using annotated training data to learn accurate API invocation 14. CodeAgent integrates multiple programming tools, enabling interaction with various software components 14. CodeAct further demonstrates integration with a Python interpreter, enabling agents to execute code and perform sophisticated tasks using existing libraries 21.
While challenges remain in integrating agents with real development environments, incorporating human feedback is a key area of focus. AgileGen enhances Agile development practices by integrating close user involvement to ensure alignment between requirements and generated code, notably using the Gherkin language for testable requirements 19. The broader challenge of effective human-agent interaction, trustworthiness, and cost is identified as a critical future direction for these systems 14.
Despite significant progress, integrating code generation agents with real development environments still faces hurdles, including understanding large, private codebases, customized build processes, internal API specifications, and unwritten team conventions 14. Additionally, agent-generated code may contain logical defects, performance pitfalls, or security vulnerabilities 14. Future research aims to enhance individual agent capabilities and optimize agent collaboration and synergy, paving the way for autonomous, scalable, and trustworthy LMA systems 19.
While fine-tuned code LLMs for agents present significant advancements and potential, their widespread, responsible deployment hinges on overcoming a range of technical, ethical, and practical challenges. This concluding section outlines the current limitations, open research questions, ethical implications, and promising future directions.
Fine-tuned code LLM agents face several significant technical hurdles that impact their reliability and efficacy. A primary concern is the phenomenon of hallucinations, where LLMs generate fluent but factually incorrect or fabricated responses 22. In code generation, this translates to syntactically correct but incorrect or suboptimal code 23, potentially including fictitious citations 24. These models often exhibit reasoning failures, struggling with deep understanding of code semantics, architecture, and external functionalities 23. They may lack the ability to autonomously decompose tasks or understand cross-file context without specific agentic mechanisms 25, leading to code modifications misaligned with project goals due to a lack of explicit purpose understanding 23.
Integrating LLMs into real development environments (IDEs) also presents challenges, as they may lack contextual awareness and user-specific adaptability, potentially conflicting with project conventions or failing to address nuanced refactoring goals 23. Furthermore, high computational costs are inherent; training and fine-tuning these large models demand substantial computational resources and time, posing barriers for smaller teams 26. This resource intensity contributes to an environmental impact through significant energy consumption and carbon emissions . The lack of explainability, often referred to as the "black-box" nature of these models, complicates the justification of automated recommendations and makes it difficult to understand how specific answers are derived .
Regarding the difficulty with complex, open-ended tasks, validating automatically generated code is particularly challenging, exacerbated by the absence of comprehensive test cases . Current models, often trained on individual files, struggle to generate tests that consider broader code context 27. Additional limitations include challenges in maintaining robustness and updatability in dynamic software environments , risks of overfitting to training data and catastrophic forgetting of previously acquired general knowledge 26, and the inherent dynamic nature of LLMs, which constantly evolve and refine their capabilities, posing challenges for traditional security measures 28.
The probabilistic and opaque nature of LLMs makes them susceptible to novel attacks 22. Key security vulnerabilities include:
The integration of LLM agents in critical applications introduces significant ethical concerns, especially regarding autonomous decision-making.
Addressing these challenges requires focused future research and development across several key areas:
Robust Security Strategies: Developing LLM-specific threat modeling, implementing prompt isolation and input sanitization, and hardening models with safety constraints like Reinforcement Learning from Human Feedback (RLHF) 22. Real-time monitoring, logging, and alert systems are essential for early threat detection, alongside governance and compliance layers aligning with evolving AI regulations . Agent-specific security measures, including sandboxing and human approval for high-impact commands, are also critical 22.
Enhancing Code Understanding and Generation: Research should focus on curating high-quality, domain-specific datasets for fine-tuning, especially for refactoring tasks 23. Enriched prompting techniques, such as chain-of-thought and few-shot prompting, can guide LLMs toward more targeted and effective code generation 23. Hallucination mitigation strategies, including uncertainty quantification and requiring LLMs to generate justifications, are vital for semantic correctness 23. Automated test generation and verification, integrating tools like EvoSuite for static analysis and mutation testing, will be crucial for validating LLM-generated code . Further development of Self-Evolved Comprehension (Tutorial Fine-Tuning - TFT) approaches will enable models to learn from limited data and continuously improve by correcting their own errors 30.
Ethical AI Development and Governance: Future work must prioritize bias mitigation through regular evaluations and diverse data sampling . Privacy protection requires advanced data anonymization and secure model serving . Transparency and explainability can be improved by providing contextual insights, disclaimers, and open data protocols 31. Establishing clear accountability frameworks, legal contracts, and ethical checklists will define responsibility 29. Misinformation safeguards, content filtering, and public education are necessary to counter AI-generated disinformation . Operationalizing Meaningful Human Control (MHC) through tiered autonomy and escalation pathways is also paramount 24.
Integration of Software Engineering Insights: Incorporating domain-specific insights into LLM training and evaluation processes is vital for enhancing reliability 27. This includes explicit consideration of the mapping between code and test files (CAT-LM) and using software artifacts for differential testing (DIFFSPEC) 27.
Retrieval-Augmented Generation (RAG) for Code: Continued research into RAG methods, including repository-level vector retrieval and knowledge graph-based approaches, promises to alleviate knowledge limitations, reduce hallucinations, and address data security issues in code generation .
Fine-tuned code LLM agents hold immense potential to transform software development and beyond. However, realizing this potential requires a concerted effort to systematically address their technical limitations, such as hallucinations, reasoning failures, and computational demands, alongside the critical ethical challenges of bias, privacy, and accountability. Future research must aggressively pursue robust security measures, enhance model interpretability, establish comprehensive ethical governance, and deeply integrate software engineering principles. Continuous interdisciplinary collaboration among researchers, developers, ethicists, and policymakers is indispensable to navigate this evolving landscape and ensure that LLM agents are not only powerful but also reliable, trustworthy, and ultimately beneficial for society .