2025-04-13 recall ## Introduction - The document 'agent AI survey Xi' discusses the rise and potential of large language model-based agents, with the goal of achieving artificial intelligence equivalent to or surpassing human level, and considers AI agents as a promising vehicle for this pursuit. - The authors, including Zhiheng Xi, Wenxiang Chen, Xin Guo, and many others from the Fudan NLP Group, argue that the community lacks a general and powerful model to serve as a starting point for designing AI agents that can adapt to diverse scenarios, and propose [[Large language model | large language models]] as a potential solution. - The concept of agents is traced from its philosophical origins to its development in AI, and the authors explain why large language models are suitable foundations for agents, offering hope for building general AI agents with versatile capabilities. - A general framework for large language model-based agents is presented, comprising three main components: brain, perception, and action, which can be tailored for different applications, and the authors explore the extensive applications of large language model-based agents in single-agent scenarios, multi-agent scenarios, and human-agent cooperation. - The authors also delve into agent societies, exploring the behavior and personality of large language model-based agents, the social phenomena that emerge from an agent society, and the insights they offer for human society, and finally discuss several key topics and open problems within the field, including the potential of large language models to spark [[Artificial general intelligence | Artificial General Intelligence]]. ## Background on AI Agents - The concept of agents in artificial intelligence (AI) originated in philosophy, with roots tracing back to thinkers like Aristotle and Hume, and was later expanded by [[Alan Turing]], who proposed the [[Turing test | Turing Test]] to explore whether machines can display intelligent behavior comparable to humans. - In AI, an agent refers to an artificial entity capable of perceiving its surroundings using sensors, making decisions, and taking actions in response using actuators, and is considered a crucial building block of AI systems, with qualities like autonomy, reactivity, pro-activeness, and social ability. - The development of agents has become a focal point in the AI community, with significant strides made in the mid-20th century, but previous efforts have predominantly focused on enhancing specific capabilities, such as symbolic reasoning, or mastering particular tasks, rather than achieving broad adaptability across varied scenarios. - The development of [[Large language model | large language models]] (LLMs) has brought new hope for the advancement of agents, with LLMs demonstrating powerful capabilities in knowledge acquisition, instruction comprehension, generalization, planning, and reasoning, and displaying effective natural language interactions with humans. - According to the notion of World Scope (WS), which encompasses five levels of research progress from NLP to general AI, LLMs are built on the second level, but have the potential to reach higher levels if elevated to the status of agents and equipped with an expanded perception space and action space. - The integration of LLMs with agents is expected to enable them to tackle more complex tasks through cooperation or competition, and potentially achieve the fifth level of WS, where emergent social phenomena can be observed, and a harmonious society composed of AI agents and humans can be envisioned. - This paper presents a comprehensive and systematic survey of LLM-based agents, aiming to investigate existing studies and prospective avenues in this emerging field, which is considered a pivotal stride towards achieving [[Artificial general intelligence | Artificial General Intelligence]] (AGI). ## A General Conceptual Framework for LLM-based Agents - A general conceptual framework for LLM-based agents is presented, consisting of three key parts: brain, perception, and action, which can be tailored to suit different applications, with the brain being the core of an AI agent, responsible for information processing, decision-making, reasoning, and planning. - The perception module is introduced as a means to expand the agent's perceptual space from text-only to a multimodal space, including diverse sensory modalities, enabling the agent to better perceive information from the external environment. - The action module is presented as a means to expand the action space of an agent, allowing it to possess textual output, take embodied actions, and use tools, enabling it to better respond to environmental changes and provide feedback. ## Brain - The brain module of an agent AI is a sophisticated structure that serves as the central nucleus, primarily composed of a [[Large language model | large language model]], and is responsible for processing various information, generating diverse thoughts, controlling different behaviors, and making informed decisions. - The brain module has several key components, including natural language interaction, which is paramount for effective communication and is explored in works by Bang et al. [132], Fang et al. [133], Lin et al. [127], and Lu et al. [134], among others. - The knowledge component of the brain module includes linguistic knowledge, common sense knowledge, and actionable knowledge, with research contributions from Vulic et al. [142], Hewitt et al. [143], Safavi et al. [148], Xu et al. [151], and others, and potential issues such as editing wrong and outdated knowledge, which is addressed by AlKhamissi et al. [155], Kemker et al. [156], and Mitchell et al. [159]. - The memory capability of the brain module is also crucial, with research focusing on raising the length limit of [[Transformer (deep learning architecture) | transformers]], summarizing memory, compressing memories with vectors or data structures, and memory retrieval, as explored in works by BART [163], Park et al. [164], Generative Agents [22], and ChatDev [109], among others. - The brain module also enables reasoning and planning, with research contributions from CoT [95], Zero-shot-CoT [96], Self-Consistency [97], and LLM-Planner [101], and transferability and generalization, which is explored in works by T0 [106], FLAN [105], Instruct-GPT [24], and Voyager [190], allowing the agent to adapt to unfamiliar scenarios. - The brain module's operating mechanism involves receiving information from the perception module, storing and retrieving knowledge, recalling from memory, devising plans, and making informed decisions, and its ability to memorize past observations, thoughts, and actions, and update knowledge for future use, is essential for effective communication and decision-making. ## Natural Language Interaction - The capabilities of [[Large language model | Large Language Models]] (LLMs) enable agents to engage in basic interactive conversations and exhibit in-depth comprehension abilities, allowing humans to easily understand and interact with them, and LLM-based agents can earn more trust and cooperate more effectively with humans. - Multi-turn interactive conversation is a key aspect of effective communication, and LLMs such as the GPT series, [[Llama (language model) | LLaMA]] series, and T5 series can understand natural language and generate coherent and contextually relevant responses to handle various problems. - The multi-turn conversation process involves three main steps: understanding the history of natural language dialogue, deciding what action to take, and generating natural language responses, and LLM-based agents can continuously refine outputs using existing information to conduct multi-turn conversations. - Recent LLMs have shown exceptional natural language generation capabilities, producing high-quality text in multiple languages with improved coherency and grammatical accuracy, as seen in the evolution from [[GPT-3]] to InstructGPT and [[GPT-4]]. - LLMs can adapt to the style and content of the conditioning text, and they do not merely copy training data but display a certain degree of creativity, generating diverse texts that are equally novel or even more novel than human-crafted benchmarks. - Understanding intention and implication is essential for effective communication and cooperation with other intelligent agents, and although models trained on large-scale corpora are intelligent, they are still incapable of fully emulating human dialogues or leveraging the information conveyed in language. - Researchers such as See et al and Fang et al have empirically affirmed the capabilities of [[Large language model | LLMs]], including their ability to detect grammar errors and generate high-quality text, and the use of controllable prompts ensures precise control over the content generated by these language models. ## Intention and Implication - The emergence of Large Language Models (LLMs) has highlighted the potential of foundation models to understand human intentions, but they struggle with vague instructions or implied meanings, which can be addressed by formalizing implied meanings into a reward function that allows agents to choose options in line with the speaker's preferences. - Reward modeling can be achieved through inferring rewards based on feedback, such as comparisons and unconstrained natural language, or by recovering rewards from descriptions using the action space as a bridge, as suggested by researchers like Jeon et al. - Agents can utilize their understanding of context to take highly personalized and accurate actions tailored to specific requirements, and this can be further enhanced by leveraging knowledge from large-scale, unstructured, and unlabeled datasets. ## Knowledge - Language models can learn and comprehend a wide range of knowledge, including linguistic knowledge, commonsense knowledge, and professional domain knowledge, which can be categorized into different types, such as morphology, syntax, semantics, and pragmatics. - Linguistic knowledge is essential for agents to comprehend sentences and engage in multi-turn conversations, while commonsense knowledge refers to general world facts that are typically taught to individuals at an early age, and professional domain knowledge is associated with specific domains like programming, mathematics, and medicine. - Agents without commonsense knowledge or professional domain knowledge may make incorrect decisions, and while [[Large language model | LLMs]] demonstrate excellent performance in acquiring, storing, and utilizing knowledge, there are potential issues and unresolved problems, such as outdated or incorrect knowledge, which can be addressed through retraining or editing the models to locate and modify specific knowledge. - The use of Large Language Models (LLMs) is limited in factually rigorous tasks due to their tendency to generate content that conflicts with the source or factual information, a phenomenon known as hallucinations, which can be alleviated by using metrics to measure the level of hallucinations and enabling LLMs to utilize external tools to avoid incorrect knowledge. ## Memory - In the context of the agent AI survey Xi, the concept of "memory" refers to the storage of sequences of the agent's past observations, thoughts, and actions, which is essential for the agent's proficient handling of consecutive tasks and decision-making, as proposed by Nuxoll et al. - The memory mechanisms in LLM-based agents face two primary challenges: the sheer length of historical records, which can surpass the constraints of the [[Transformer (deep learning architecture) | Transformer architecture]], and the difficulty in extracting relevant memories, which can cause the agent to misalign its responses with the ongoing context. - To enhance the memory of LLM-based agents, several methods have been proposed, including raising the length limit of Transformers, which can be achieved through strategies such as text truncation, segmenting inputs, and modifying the attention mechanism to reduce complexity. - Another approach to amplifying memory efficiency is memory summarization, which ensures agents can effortlessly extract pivotal details from historical interactions, and can be achieved through various techniques, including using prompts, emphasizing reflective processes, and hierarchical methods. - Compressing memories with vectors or data structures is also a viable method, which can be achieved through employing suitable data structures, such as embedding vectors for memory sections, plans, or dialogue histories, or translating sentences into triplet configurations. - Finally, methods for memory retrieval are crucial when an agent interacts with its environment or users, and several approaches have been proposed, including the use of ChatDB and DB-GPT, which integrate [[Large language model | LLMs]] with SQL databases, enabling data manipulation through SQL commands. - The agent's ability to access relevant and accurate information is crucial for executing specific actions, and this is typically achieved through automated memory retrieval based on metrics such as Recency, Relevance, and Importance. - Research has introduced the concept of interactive memory objects, which are representations of dialogue history that can be manipulated by users, allowing them to influence how the agent perceives the dialogue, and other studies have enabled memory operations like deletion based on specific user commands. ## Reasoning and Planning - Reasoning is a fundamental aspect of human intellectual endeavors, and for Large Language Model (LLM)-based agents, reasoning capacity is crucial for solving complex tasks, with primary forms of reasoning including deductive, inductive, and abductive reasoning. - There are differing academic views regarding the reasoning capabilities of [[Large language model | LLMs]], with some arguing that they possess reasoning during pre-training or fine-tuning, while others believe it emerges after reaching a certain scale in size, and methods like the Chain-of-Thought (CoT) method have been demonstrated to elicit the reasoning capacities of LLMs. - Planning is a key strategy for humans and agents, and it involves organizing thoughts, setting objectives, and determining the steps to achieve those objectives, with the ability to plan being crucial for agents and central to this planning module being the capacity for reasoning. - The planning process typically comprises two stages: plan formulation, where agents decompose an overarching task into numerous sub-tasks, and plan reflection, where agents evaluate and reflect upon the merits of the formulated plan, with various approaches proposed for each stage, including hierarchical planning and adaptive strategies. - LLM-based agents can demonstrate a broad scope of general knowledge, but they can occasionally face challenges when tasked with situations that require expertise knowledge, and enhancing these agents by integrating them with planners of specific domains has shown to yield better performance, as demonstrated by studies such as the CoT-series and other research works. - The agent AI survey Xi discusses how LLM-based agents utilize internal feedback mechanisms to refine their strategies and planning approaches, often drawing insights from pre-existing models, and also engage with humans to better align with human values and preferences. - These agents can also draw feedback from their surroundings, such as cues from task accomplishments or post-action observations, to revise and refine their plans, and they exhibit dynamic learning ability, enabling them to adapt to novel tasks swiftly and robustly. ## Transferability and Generalization - The concept of transferability and generalization is crucial in agent AI, as intelligence should not be limited to a specific domain or task, but rather encompass a broad range of cognitive skills and abilities, and LLM-based agents have demonstrated excellent performance in downstream tasks with only a small amount of data for fine-tuning. - Studies have shown that instruction-tuned [[Large language model | LLMs]], such as FLAN and T0, exhibit zero-shot generalization without the need for task-specific fine-tuning, and models like [[GPT-4]] have demonstrated remarkable capabilities in a variety of domains and tasks, including abstraction, comprehension, vision, coding, mathematics, medicine, law, and understanding of human motives and emotions. - In-context learning is another key aspect of LLM-based agents, which refers to the models' ability to learn from a few examples in the context, and this approach has been shown to enhance the predictive performance of language models and reduce computation costs for adapting to new tasks. - Continual learning is also an important aspect of agent AI, which involves the continuous acquisition and update of skills, and recent studies have highlighted the potential of LLMs' planning capabilities in facilitating continuous learning for agents, enabling them to adapt to large-scale real-world tasks. - The core challenge in continual learning is catastrophic forgetting, where a model loses knowledge from previous tasks as it learns new ones, and numerous efforts have been made to address this challenge, including introducing regularly used terms, approximating prior data distributions, and designing architectures with task-adaptive parameters. - LLM-based agents have emerged as a novel paradigm, leveraging the planning capabilities of LLMs to combine existing skills and address more intricate challenges, with examples such as Voyager, which attempts to solve progressively harder tasks proposed by the automatic curriculum devised by GPT-4. ## Perception - The perception module is crucial for LLM-based agents to receive information from various sources and modalities, including textual, visual, and auditory inputs, as well as other potential input forms such as tactile feedback, gestures, and 3D maps. - Textual input is a fundamental ability for LLM-based agents, which can communicate with humans through text, but understanding implied meanings within textual input remains challenging and requires techniques such as reinforcement learning to perceive implied meanings and models feedback to derive rewards. - Visual input is also essential for LLM-based agents, but [[Large language model | LLMs]] inherently lack visual perception and can only understand discrete textual content, requiring the use of visual encoders such as ViT, VQVAE, and Mobile-ViT, as well as learnable architectures like Kosmos, BLIP-2, and InstructBLIP. - Auditory input is another important aspect of the perception module, with examples of audio-based models such as AudioGPT and HuggingGPT, and transfer visual methods like AST, HuBERT, and X-LLM, which can be used to enable LLM-based agents to acquire multimodal perception capabilities. - The typology of the perception module is depicted in Figure 4, which shows the various types of input and processing methods used to enable LLM-based agents to perceive and interact with their environment. - Understanding text instructions for unknown tasks places higher demands on the agent's text perception abilities, and instruction tuning can help LLMs exhibit remarkable zero-shot instruction understanding and generalization abilities, eliminating the need for task-specific fine-tuning. - The development of LLM-based agents with multimodal perception capabilities is an essential direction, as it can help agents better understand their environment, make informed decisions, and excel in a broader range of tasks. - The integration of visual information with data from other modalities can provide an agent with a broader context and a more precise understanding of its environment, and this can be achieved through image captioning, which generates corresponding text descriptions for image inputs. - Image captioning is a straightforward approach that is highly interpretable and does not require additional training for caption generation, but it is a low-bandwidth method that may lose potential information during the conversion process and introduce biases. - Researchers have extended the use of [[Transformer (deep learning architecture) | transformers]] to the field of computer vision, with representative works like ViT/VQVAE successfully encoding visual information using transformers, which divide an image into fixed-size patches and treat them as input tokens for transformers. - Some works try to combine the image encoder and [[Large language model | Large Language Model]] (LLM) directly to train the entire model in an end-to-end way, which can achieve remarkable visual perception abilities but requires substantial computational resources. - Extensively pre-trained visual encoders and LLMs can greatly enhance an agent's visual perception and language expression abilities, and freezing one or both of them during training is a widely adopted paradigm that achieves a balance between training resources and model performance. - To align the visual encoder with the LLM, it is necessary to convert the image encoding into embeddings that LLMs can comprehend, which usually requires adding an extra learnable interface layer between them, such as the Querying Transformer (Q-Former) module used in BLIP-2 and InstructBLIP. - Video input can be perceived by agents using methods similar to those used for images, but video information adds a temporal dimension, and the agent's understanding of the relationships between different frames in time is crucial for perceiving video information. - Some works, like Flamingo, ensure temporal order when understanding videos using a mask mechanism that restricts the agent's view to only access visual information from frames that occurred earlier in time. - Auditory information is a crucial component of world information, and when an agent possesses auditory capabilities, it can improve its awareness of interactive content, the surrounding environment, and even potential dangers. - The development of models and approaches for processing audio as a standalone modality has been well-established, with models such as AudioGPT, FastSpeech, GenerSpeech, and Whisper achieving excellent results in tasks like Text-to-Speech, Style Transfer, and Speech Recognition, and these models can be utilized by [[Large language model | LLMs]] as control hubs to perceive audio information. - Audio spectrograms, which provide a 2D representation of the frequency spectrum of an audio signal over time, can be processed using [[Transformer (deep learning architecture) | Transformer]] architectures like AST, allowing for effective encoding of audio information, and some research efforts aim to migrate perceptual methods from the visual domain to audio. - LLM-based agents may be equipped with richer perception modules in the future, enabling them to perceive and understand diverse modalities, including touch, smell, temperature, humidity, and brightness, and to develop various user-friendly perception modules for humans, such as those using pointing instructions, eye-tracking, body motion capture, and brainwave signals. - The integration of basic perceptual abilities like vision, text, and light sensitivity can help agents develop more complex user inputs, and technologies like Lidar, GPS, and Inertial Measurement Units can assist agents in perceiving their surroundings, although these sensory data require further processing to be understood by LLM-based agents. ## Textual Input - Textual input is a fundamental ability for LLM-based agents, which can communicate with humans through text, but understanding implied meanings within textual input remains challenging and requires techniques such as reinforcement learning to perceive implied meanings and models feedback to derive rewards. ## Visual Input - The integration of visual information with data from other modalities can provide an agent with a broader context and a more precise understanding of its environment, and this can be achieved through image captioning, which generates corresponding text descriptions for image inputs. - Image captioning is a straightforward approach that is highly interpretable and does not require additional training for caption generation, but it is a low-bandwidth method that may lose potential information during the conversion process and introduce biases. - Researchers have extended the use of [[Transformer (deep learning architecture) | transformers]] to the field of computer vision, with representative works like ViT/VQVAE successfully encoding visual information using transformers, which divide an image into fixed-size patches and treat them as input tokens for transformers. - Some works try to combine the image encoder and [[Large language model | Large Language Model]] (LLM) directly to train the entire model in an end-to-end way, which can achieve remarkable visual perception abilities but requires substantial computational resources. - Extensively pre-trained visual encoders and LLMs can greatly enhance an agent's visual perception and language expression abilities, and freezing one or both of them during training is a widely adopted paradigm that achieves a balance between training resources and model performance. - To align the visual encoder with the LLM, it is necessary to convert the image encoding into embeddings that LLMs can comprehend, which usually requires adding an extra learnable interface layer between them, such as the Querying Transformer (Q-Former) module used in BLIP-2 and InstructBLIP. - Video input can be perceived by agents using methods similar to those used for images, but video information adds a temporal dimension, and the agent's understanding of the relationships between different frames in time is crucial for perceiving video information. - Some works, like Flamingo, ensure temporal order when understanding videos using a mask mechanism that restricts the agent's view to only access visual information from frames that occurred earlier in time. ## Auditory Input - Auditory information is a crucial component of world information, and when an agent possesses auditory capabilities, it can improve its awareness of interactive content, the surrounding environment, and even potential dangers. - The development of models and approaches for processing audio as a standalone modality has been well-established, with models such as AudioGPT, FastSpeech, GenerSpeech, and Whisper achieving excellent results in tasks like Text-to-Speech, Style Transfer, and Speech Recognition, and these models can be utilized by [[Large language model | LLMs]] as control hubs to perceive audio information. - Audio spectrograms, which provide a 2D representation of the frequency spectrum of an audio signal over time, can be processed using [[Transformer (deep learning architecture) | Transformer]] architectures like AST, allowing for effective encoding of audio information, and some research efforts aim to migrate perceptual methods from the visual domain to audio. ## Other Input - LLM-based agents may be equipped with richer perception modules in the future, enabling them to perceive and understand diverse modalities, including touch, smell, temperature, humidity, and brightness, and to develop various user-friendly perception modules for humans, such as those using pointing instructions, eye-tracking, body motion capture, and brainwave signals. - The integration of basic perceptual abilities like vision, text, and light sensitivity can help agents develop more complex user inputs, and technologies like Lidar, GPS, and Inertial Measurement Units can assist agents in perceiving their surroundings, although these sensory data require further processing to be understood by LLM-based agents. ## Action - The action module of LLM-based agents can be categorized into textual output, tools, and embodied action, with various models and approaches being developed, such as Toolformer, TALM, and Instruct-GPT for learning and using tools, and SayCan, EmbodiedGPT, and InstructRL for embodied actions, and prospective approaches like MineDojo and DECKARD are being explored for future development. - The typology of the action module is designed to mimic human decision-making processes, where perceived information is integrated, analyzed, and reasoned with to make decisions, and LLM-based agents can be designed to follow a similar process, using their perception modules to inform their actions and decisions. - The agent AI survey Xi discusses how agents with brain-like structures and capabilities, such as knowledge, memory, reasoning, planning, and generalization, can interact with their environment through various actions, including conversation, evading obstacles, and using tools. - LLM-based agents, which are equipped with [[Large language model | large language models]], possess inherent language generation capabilities, allowing them to generate high-quality text in terms of fluency, relevance, diversity, and controllability. - The use of tools can enhance the capabilities of LLM-based agents, enabling them to accomplish complex tasks more efficiently and with higher quality, and can also strengthen their expertise, adaptability, and robustness in specific domains. - LLM-based agents have limitations, such as limited memorization of training data, susceptibility to contextual prompts and adversarial attacks, and lack of transparency in their decision-making process, which can be addressed by integrating specialized tools that can provide domain-specific knowledge and enhance interpretability and credibility. - The integration of tools with LLM-based agents can be facilitated by the agents' ability to understand the tools' application scenarios and invocation methods, and can lower the threshold for tool utilization, allowing human users to fully unleash their creative potential. - The survey highlights the importance of tool understanding as a prerequisite for effective tool usage, and discusses the potential of LLM-based agents to break down and address complex tasks, understand intent, and demonstrate remarkable reasoning and decision-making abilities in interactive environments. - The powerful zero-shot and few-shot learning abilities of Large Language Models (LLMs) enable agents to acquire knowledge about tools by utilizing zero-shot prompts that describe tool functionalities and parameters, or few-shot prompts that provide demonstrations of specific tool usage scenarios and corresponding methods. - Agents can learn to use tools primarily through learning from demonstrations and learning from feedback, which involves mimicking the behavior of human experts and understanding the consequences of their actions, as well as making adjustments based on feedback received from both the environment and humans. - To achieve acceptable performance in all scenarios, agents need to generalize their tool usage skills learned in specific contexts to more general situations, which can be accomplished by grasping the common principles or patterns in tool usage strategies through meta-tool learning, and by enhancing their understanding of relationships between simple and complex tools. - Agents can benefit from curriculum learning, which allows them to start from simple tools and progressively learn complex ones, and from understanding user intent reasoning and planning abilities, to better design methods of tool utilization and collaboration and provide higher-quality outcomes. - Existing tools are often designed for human convenience and may not be optimal for agents, so there is a need for tools specifically designed for agents, which should be more modular and have input-output formats suitable for agents, and LLM-based agents can create tools by generating executable programs or integrating existing tools into more powerful ones. - Agents can learn to perform self-debugging and, if successful in creating a tool, can produce packages containing the tool's code and demonstrations for other agents in a [[Multi-agent system | multi-agent system]], potentially leading to a high degree of autonomy in terms of tools, and tools can expand the action space of LLM-based agents by utilizing various external resources during the reasoning and planning phase. - The agent AI survey Xi discusses the potential of [[Large language model | Large Language Models]] (LLMs) to enhance the capabilities of agents through the use of various tools, such as search-based tools, domain-specific tools, and scientific tools, which can improve the scope and quality of knowledge accessible to agents. - Researchers have developed LLM-based controllers that can generate SQL statements to query databases, convert user queries into search requests, and use search engines to obtain desired results, and LLM-based agents can also use tools to execute tasks like organic synthesis in chemistry and interface with Python interpreters for mathematical computation tasks. - The use of tools can expand the functionality of language models, and their outputs are not limited to text, allowing agents to interact with the environment in a multimodal manner, such as through image processing and generation, and enabling agents to be referred to as digitally embodied. - The embodiment of agents has been a central focus of embodied learning research, and the Embodiment hypothesis suggests that an agent's intelligence arises from continuous interaction and feedback with the environment, rather than relying solely on well-curated textbooks. - LLM-based agents are expected to be capable of actively perceiving, comprehending, and interacting with physical environments, making decisions, and generating specific behaviors to modify the environment based on their extensive internal knowledge, which is referred to as embodied actions. - The potential of LLM-based agents for embodied actions has been explored through methods like reinforcement learning, but recent studies have indicated that leveraging the rich internal knowledge acquired during the pre-training of LLMs can effectively alleviate the limitations of reinforcement learning, such as data efficiency, generalization, and complex problem reasoning. - One of the benefits of using [[Large language model | LLMs]] for embodied actions is cost efficiency, as some on-policy algorithms struggle with sample efficiency and require fresh data for policy updates, which can be costly and noisy, whereas LLMs can leverage their pre-trained knowledge to improve sample efficiency. - The constraint found in some end-to-end models can be addressed by leveraging the intrinsic knowledge from Large Language Models (LLMs), as seen in agents like PaLM-E, which jointly train robotic data with general visual-language data to achieve significant transfer ability in embodied tasks. - Embodied action generalization is a crucial aspect of an agent's competence, and LLMs have showcased remarkable cross-task generalization capabilities, with agents like PaLM-E exhibiting zero-shot or one-shot generalization capabilities to new objects or novel combinations of existing objects. - Language proficiency is a distinctive advantage of LLM-based agents, serving as a means to interact with the environment and transfer foundational skills to new tasks, as seen in agents like SayCan, which decomposes task instructions into corresponding skill commands, and Voyager, which introduces a skill library component for lifelong learning capabilities. - Embodied action planning is a pivotal strategy employed by humans and LLM-based agents, with LLMs being applied to complex tasks in a zero-shot or few-shot manner, and external feedback from the environment enhancing planning performance, as seen in work that dynamically generates and adjusts high-level action plans based on environmental feedback. - There are several fundamental LLM-based embodied actions, including observation, manipulation, and navigation, with observation being the primary way an agent acquires environmental information and updates states, and can be enhanced through various approaches, such as using pre-trained Vision [[Transformer (deep learning architecture) | Transformers]] (ViT) or audio information encoding. - The agent's observation of the environment can also be derived from real-time linguistic instructions from humans, with human feedback helping the agent acquire detailed information that may not be readily obtained or parsed, as seen in agents like Soundspaces, which proposes the identification of physical spatial geometric elements guided by reverberant audio input. - The agent AI survey Xi discusses various aspects of embodied agents, including manipulation tasks such as object rearrangements, tabletop manipulation, and mobile manipulation, which require the agent to execute a sequence of tasks and maintain synchronization between its state and subgoals. - Manipulation tasks can be achieved through approaches like DEPS, which utilizes an LLM-based interactive planning approach to maintain consistency and help error correction, and AlphaBlock, which focuses on more challenging tasks and constructs a dataset to enhance the agent's comprehension of high-level cognitive instructions. - Navigation is another crucial aspect of embodied agents, which involves dynamically altering their positions within the environment and establishing prior internal maps, such as topological, semantic, or occupancy maps, to find the optimal path and achieve precise localization of spatial targets. - Navigation can be achieved through approaches like LM-Nav, which utilizes the VNM to create an internal topological map and leverages [[Large language model | LLM]] and VLM to decompose input commands and analyze the environment, and other methods that highlight the importance of spatial representation and combining visual features with 3D reconstructions of the physical world. - The integration of manipulation and navigation tasks enables agents to accomplish more complex tasks, such as embodied question answering, which requires autonomous exploration of the environment and responding to pre-defined multimodal questions. - Control strategies for embodied agents typically involve LLM-based agents generating high-level policy commands to control low-level policies, which can be achieved through robotic [[Transformer (deep learning architecture) | transformers]] or other approaches, and can be applied in various environments, including virtual embodied environments and simulated worlds. - The prospective future of embodied action is seen as a bridge between virtual intelligence and the physical world, enabling agents to perceive and modify the environment much like humans, with LLM-based embodied actions playing a crucial role in achieving this goal. - The development of embodied agents in simulated environments, such as [[Minecraft]], is gaining interest due to the high costs of physical-world robotic operators and the scarcity of embodied datasets, with the Mineflayer API enabling cost-effective examination of various operations including exploration, planning, and lifelong learning. - Despite progress in this area, achieving optimal embodied actions remains a challenge due to the disparity between simulated platforms and the physical world, and there is a growing demand for embodied task paradigms and evaluation criteria that closely mirror real-world conditions. - Learning to ground language for agents is also an obstacle, with expressions like "jump down like a cat" requiring adequate world knowledge, and additional investigation on grounding embodied datasets is necessary as embodied action plays a pivotal role across various domains in human life. ## Textual Output - LLM-based agents, which are equipped with [[Large language model | large language models]], possess inherent language generation capabilities, allowing them to generate high-quality text in terms of fluency, relevance, diversity, and controllability. ## Tools - The use of tools can enhance the capabilities of LLM-based agents, enabling them to accomplish complex tasks more efficiently and with higher quality, and can also strengthen their expertise, adaptability, and robustness in specific domains. - LLM-based agents have limitations, such as limited memorization of training data, susceptibility to contextual prompts and adversarial attacks, and lack of transparency in their decision-making process, which can be addressed by integrating specialized tools that can provide domain-specific knowledge and enhance interpretability and credibility. - The integration of tools with LLM-based agents can be facilitated by the agents' ability to understand the tools' application scenarios and invocation methods, and can lower the threshold for tool utilization, allowing human users to fully unleash their creative potential. - The survey highlights the importance of tool understanding as a prerequisite for effective tool usage, and discusses the potential of LLM-based agents to break down and address complex tasks, understand intent, and demonstrate remarkable reasoning and decision-making abilities in interactive environments. - The powerful zero-shot and few-shot learning abilities of [[Large language model | Large Language Models]] (LLMs) enable agents to acquire knowledge about tools by utilizing zero-shot prompts that describe tool functionalities and parameters, or few-shot prompts that provide demonstrations of specific tool usage scenarios and corresponding methods. - Agents can learn to use tools primarily through learning from demonstrations and learning from feedback, which involves mimicking the behavior of human experts and understanding the consequences of their actions, as well as making adjustments based on feedback received from both the environment and humans. - To achieve acceptable performance in all scenarios, agents need to generalize their tool usage skills learned in specific contexts to more general situations, which can be accomplished by grasping the common principles or patterns in tool usage strategies through meta-tool learning, and by enhancing their understanding of relationships between simple and complex tools. - Agents can benefit from curriculum learning, which allows them to start from simple tools and progressively learn complex ones, and from understanding user intent reasoning and planning abilities, to better design methods of tool utilization and collaboration and provide higher-quality outcomes. - Existing tools are often designed for human convenience and may not be optimal for agents, so there is a need for tools specifically designed for agents, which should be more modular and have input-output formats suitable for agents, and LLM-based agents can create tools by generating executable programs or integrating existing tools into more powerful ones. - Agents can learn to perform self-debugging and, if successful in creating a tool, can produce packages containing the tool's code and demonstrations for other agents in a [[Multi-agent system | multi-agent system]], potentially leading to a high degree of autonomy in terms of tools, and tools can expand the action space of LLM-based agents by utilizing various external resources during the reasoning and planning phase. - The agent AI survey Xi discusses the potential of [[Large language model | Large Language Models]] (LLMs) to enhance the capabilities of agents through the use of various tools, such as search-based tools, domain-specific tools, and scientific tools, which can improve the scope and quality of knowledge accessible to agents. - Researchers have developed LLM-based controllers that can generate SQL statements to query databases, convert user queries into search requests, and use search engines to obtain desired results, and LLM-based agents can also use tools to execute tasks like organic synthesis in chemistry and interface with Python interpreters for mathematical computation tasks. - The use of tools can expand the functionality of language models, and their outputs are not limited to text, allowing agents to interact with the environment in a multimodal manner, such as through image processing and generation, and enabling agents to be referred to as digitally embodied. - The embodiment of agents has been a central focus of embodied learning research, and the Embodiment hypothesis suggests that an agent's intelligence arises from continuous interaction and feedback with the environment, rather than relying solely on well-curated textbooks. ## Embodied Action - LLM-based agents are expected to be capable of actively perceiving, comprehending, and interacting with physical environments, making decisions, and generating specific behaviors to modify the environment based on their extensive internal knowledge, which is referred to as embodied actions. - The potential of LLM-based agents for embodied actions has been explored through methods like reinforcement learning, but recent studies have indicated that leveraging the rich internal knowledge acquired during the pre-training of [[Large language model | LLMs]] can effectively alleviate the limitations of reinforcement learning, such as data efficiency, generalization, and complex problem reasoning. - One of the benefits of using LLMs for embodied actions is cost efficiency, as some on-policy algorithms struggle with sample efficiency and require fresh data for policy updates, which can be costly and noisy, whereas LLMs can leverage their pre-trained knowledge to improve sample efficiency. - The constraint found in some end-to-end models can be addressed by leveraging the intrinsic knowledge from Large Language Models (LLMs), as seen in agents like PaLM-E, which jointly train robotic data with general visual-language data to achieve significant transfer ability in embodied tasks. - Embodied action generalization is a crucial aspect of an agent's competence, and LLMs have showcased remarkable cross-task generalization capabilities, with agents like PaLM-E exhibiting zero-shot or one-shot generalization capabilities to new objects or novel combinations of existing objects. - Language proficiency is a distinctive advantage of LLM-based agents, serving as a means to interact with the environment and transfer foundational skills to new tasks, as seen in agents like SayCan, which decomposes task instructions into corresponding skill commands, and Voyager, which introduces a skill library component for lifelong learning capabilities. - Embodied action planning is a pivotal strategy employed by humans and LLM-based agents, with LLMs being applied to complex tasks in a zero-shot or few-shot manner, and external feedback from the environment enhancing planning performance, as seen in work that dynamically generates and adjusts high-level action plans based on environmental feedback. - There are several fundamental LLM-based embodied actions, including observation, manipulation, and navigation, with observation being the primary way an agent acquires environmental information and updates states, and can be enhanced through various approaches, such as using pre-trained Vision [[Transformer (deep learning architecture) | Transformers]] (ViT) or audio information encoding. - The agent's observation of the environment can also be derived from real-time linguistic instructions from humans, with human feedback helping the agent acquire detailed information that may not be readily obtained or parsed, as seen in agents like Soundspaces, which proposes the identification of physical spatial geometric elements guided by reverberant audio input. - The agent AI survey Xi discusses various aspects of embodied agents, including manipulation tasks such as object rearrangements, tabletop manipulation, and mobile manipulation, which require the agent to execute a sequence of tasks and maintain synchronization between its state and subgoals. - Manipulation tasks can be achieved through approaches like DEPS, which utilizes an LLM-based interactive planning approach to maintain consistency and help error correction, and AlphaBlock, which focuses on more challenging tasks and constructs a dataset to enhance the agent's comprehension of high-level cognitive instructions. - Navigation is another crucial aspect of embodied agents, which involves dynamically altering their positions within the environment and establishing prior internal maps, such as topological, semantic, or occupancy maps, to find the optimal path and achieve precise localization of spatial targets. - Navigation can be achieved through approaches like LM-Nav, which utilizes the VNM to create an internal topological map and leverages [[Large language model | LLM]] and VLM to decompose input commands and analyze the environment, and other methods that highlight the importance of spatial representation and combining visual features with 3D reconstructions of the physical world. - The integration of manipulation and navigation tasks enables agents to accomplish more complex tasks, such as embodied question answering, which requires autonomous exploration of the environment and responding to pre-defined multimodal questions. - Control strategies for embodied agents typically involve LLM-based agents generating high-level policy commands to control low-level policies, which can be achieved through robotic [[Transformer (deep learning architecture) | transformers]] or other approaches, and can be applied in various environments, including virtual embodied environments and simulated worlds. - The prospective future of embodied action is seen as a bridge between virtual intelligence and the physical world, enabling agents to perceive and modify the environment much like humans, with LLM-based embodied actions playing a crucial role in achieving this goal. - The development of embodied agents in simulated environments, such as [[Minecraft]], is gaining interest due to the high costs of physical-world robotic operators and the scarcity of embodied datasets, with the Mineflayer API enabling cost-effective examination of various operations including exploration, planning, and lifelong learning. - Despite progress in this area, achieving optimal embodied actions remains a challenge due to the disparity between simulated platforms and the physical world, and there is a growing demand for embodied task paradigms and evaluation criteria that closely mirror real-world conditions. - Learning to ground language for agents is also an obstacle, with expressions like "jump down like a cat" requiring adequate world knowledge, and additional investigation on grounding embodied datasets is necessary as embodied action plays a pivotal role across various domains in human life. ## Applications and Scenarios of LLM-based Agents - Agents are being deployed in various scenarios, including web scenarios with applications such as WebAgent, Mind2Web, and WebGPT, and life scenarios with applications such as InterAct and PET, demonstrating their potential in task-oriented, innovation-oriented, and lifecycle-oriented deployments. - Multi-agent interactions are also being explored, with cooperative and adversarial interactions being studied, and human-agent interactions are being developed, including instructor-executor paradigms in education and health, and equal partnership paradigms with empathetic communicators and human-level participants. - The design objective of LLM-based agents should always be beneficial to humans, with the goal of harnessing AI for good, and achieving objectives such as single-agent deployment, multi-agent interaction, and human-agent interaction, as illustrated in the typology of applications and scenarios of LLM-based agent applications. - Researchers have developed various applications, including ChatMOF, ChemCrow, and SCIENCEWORLD, and have proposed frameworks such as Voyager, GITM, and DEPS, demonstrating the powerful and versatile capabilities of agents and the possibility of having a personal agent capable of assisting users with daily tasks. - The interaction between humans and agents can enable agents to perform tasks more efficiently and safely, while also providing better service to humans, thereby alleviating human work pressure and enhancing task-solving efficiency. - Agents can assist users in breaking free from daily tasks and repetitive labor, allowing them to engage in exploratory and innovative work, and realizing their full potential in cutting-edge scientific fields. ## Single-agent Scenarios - Single Agent applications, such as [[AutoGPT]], have the ability to understand human natural language commands and perform everyday tasks, enhancing task efficiency and promoting access for a broader user base. - In task-oriented deployment, agents follow high-level instructions from users, undertaking tasks such as goal decomposition, sequence planning of sub-goals, and interactive exploration of the environment, until the final objective is achieved. - Agents have been deployed in text-based game scenarios to explore their ability to perform basic tasks, using skills like memory, planning, and trial-and-error to predict the next action. - The evolution of [[Large language model | LLMs]] has enabled agents to demonstrate great potential to perform tasks through natural language, and more realistic and complex simulated test environments have been constructed to meet the demand for testing LLM-based agents. - The document provides an in-depth overview of current applications of LLM-based agents, including the significant coordinating potential of Multiple Agents, and the interactive collaboration between humans and agents, which can be categorized into two paradigms. - The document 'agent AI survey Xi' discusses the division of simulated environments into web scenarios and life scenarios, where agents play specific roles, such as performing tasks on behalf of users in web scenarios, known as the web navigation problem. - In web scenarios, agents need to possess the ability to understand instructions, adapt to changes, and generalize successful operations, which can be achieved through reinforcement learning and the use of Large Language Models (LLMs) that can understand HTML source code and predict next action steps. - Researchers, such as those who developed Mind2Web and WebGum, have started to leverage the powerful HTML reading and understanding abilities of LLMs to enable successful interactions between agents and more realistic web pages, including dynamic and content-rich web pages like online forums or online business management. - In life scenarios, agents need to understand implicit instructions and apply common-sense knowledge, which can be challenging for LLM-based agents trained solely on text data, and may require multiple trial-and-error attempts to complete tasks that humans take for granted. - Studies, such as those by Huang et al, have demonstrated that sufficiently large [[Large language model | LLMs]] can effectively break down high-level tasks into suitable sub-tasks without additional training, but may lack awareness of the dynamic environment around them, and some approaches incorporate spatial data and item-location relationships as additional inputs to the model to provide agents with comprehensive scenario information. - The PET framework, introduced by Wu et al, is an example of an approach that mitigates irrelevant objects and containers in environmental information, allowing agents to explore the scenario and plan actions more efficiently, focusing on the current sub-task. ## Task-oriented Deployment - In task-oriented deployment, agents follow high-level instructions from users, undertaking tasks such as goal decomposition, sequence planning of sub-goals, and interactive exploration of the environment, until the final objective is achieved. - Agents have been deployed in text-based game scenarios to explore their ability to perform basic tasks, using skills like memory, planning, and trial-and-error to predict the next action. - The evolution of [[Large language model | LLMs]] has enabled agents to demonstrate great potential to perform tasks through natural language, and more realistic and complex simulated test environments have been constructed to meet the demand for testing LLM-based agents. ## Web Scenarios - The document 'agent AI survey Xi' discusses the division of simulated environments into web scenarios and life scenarios, where agents play specific roles, such as performing tasks on behalf of users in web scenarios, known as the web navigation problem. - In web scenarios, agents need to possess the ability to understand instructions, adapt to changes, and generalize successful operations, which can be achieved through reinforcement learning and the use of Large Language Models (LLMs) that can understand HTML source code and predict next action steps. - Researchers, such as those who developed Mind2Web and WebGum, have started to leverage the powerful HTML reading and understanding abilities of LLMs to enable successful interactions between agents and more realistic web pages, including dynamic and content-rich web pages like online forums or online business management. ## Life Scenarios - In life scenarios, agents need to understand implicit instructions and apply common-sense knowledge, which can be challenging for LLM-based agents trained solely on text data, and may require multiple trial-and-error attempts to complete tasks that humans take for granted. - Studies, such as those by Huang et al, have demonstrated that sufficiently large [[Large language model | LLMs]] can effectively break down high-level tasks into suitable sub-tasks without additional training, but may lack awareness of the dynamic environment around them, and some approaches incorporate spatial data and item-location relationships as additional inputs to the model to provide agents with comprehensive scenario information. - The PET framework, introduced by Wu et al, is an example of an approach that mitigates irrelevant objects and containers in environmental information, allowing agents to explore the scenario and plan actions more efficiently, focusing on the current sub-task. ## Innovation-oriented Deployment - The LLM-based agent has shown strong capabilities in performing tasks and enhancing the efficiency of repetitive work, but its potential in more intellectually demanding fields, like cutting-edge science, has not been fully realized yet, mainly due to the inherent complexity of science and other challenges. - The representation of domain-specific terms and multi-dimensional structures using a single text is challenging, resulting in a lack of complete attributes and a weakened cognitive level of the agent, which is further exacerbated by the scarcity of suitable training data in scientific domains. - To overcome this challenge, experts in the computer field utilize the agent's code comprehension and debugging abilities, while researchers in chemistry and materials equip agents with various tools to understand domain knowledge, enabling them to evolve into comprehensive scientific assistants capable of online research, document analysis, and real-world interactions. - The potential of [[Large language model | Large Language Model]] (LLM)-based agents in scientific innovation is significant, but it is essential to ensure that their exploratory abilities are not utilized in applications that could threaten or harm humans, as warned by Boiko et al. in their study on the hidden dangers of agents in synthesizing illegal drugs and chemical weapons. ## Lifecycle-oriented Deployment - Building a universally capable agent that can continuously explore, develop new skills, and maintain a long-term life cycle in an open, unknown world is a significant challenge, and [[Minecraft]] has become a unique playground for developing and testing the comprehensive ability of an agent, with survival algorithms categorized into low-level control and high-level planning. - The emergence of LLMs has enabled agents to utilize them as high-level planners to guide simulated survival tasks, and researchers have used LLMs to decompose high-level task instructions into sub-goals, skill sequences, or fundamental operations, assisting agents in exploring the open world, as seen in the example of Voyager, the first LLM-based embodied lifelong learning agent in Minecraft. ## Multi-agent Scenarios - Despite the capabilities of LLM-based agents, they operate as isolated entities and lack the ability to collaborate with other agents and acquire knowledge from social interactions, restricting their potential to learn from multi-turn feedback and limiting their deployment in complex scenarios requiring collaboration and information sharing among multiple agents, as noted by [[Marvin Minsky]] in his 1986 prediction. - The concept of intelligence emerging from the interactions of many smaller agents with specific functions was introduced by Minsky in his book "The [[Society of Mind]]", which has been put into practice with the rise of distributed artificial intelligence and [[Multi-agent system | multi-agent systems]] (MAS). - Multi-agent systems focus on how a group of agents can effectively coordinate and collaborate to solve problems, with some specialized communication languages like [[Knowledge Query and Manipulation Language | KQML]] being designed to support message transmission and knowledge sharing among agents, although their message formats were relatively fixed and had limited semantic expression capacity. - The integration of reinforcement learning algorithms, such as [[Q-learning]], with [[Deep learning | deep learning]] has become a prominent technique for developing MAS that operate in complex environments, and the construction approach based on [[Large language model | Large Language Models]] (LLMs) is beginning to demonstrate remarkable potential, enabling natural language communication between agents that is more elegant and easily comprehensible to humans. - An LLM-based multi-agent system can offer several advantages, including the division of labor, which allows a single agent equipped with specialized skills and domain knowledge to engage in specific tasks, and the decomposition of complex tasks into multiple subtasks, eliminating the time spent switching between different processes and substantially improving the overall system's efficiency and output quality. - The interactions between agents in a multi-agent environment can be broadly categorized into Cooperative Interaction for Complementarity and Adversarial Interaction for Advancement, with cooperative multi-agent systems being the most widely deployed pattern in practical usage, where individual agents assess the needs and capabilities of other agents and actively seek collaborative actions and information sharing with them. - Cooperative interaction brings forth numerous potential benefits, including enhanced task efficiency, collective decision improvement, and the resolution of complex real-world problems that one single agent cannot solve independently, ultimately achieving the goal of synergistic complementarity, as illustrated in Figure 9, which shows interaction scenarios for multiple LLM-based agents in cooperative and adversarial interactions. - The current state of large language model (LLM)-based [[Multi-agent system | multi-agent systems]] relies heavily on natural language for communication between agents, which is considered the most natural and human-understandable form of interaction, as noted in reference [108]. - Existing cooperative multi-agent applications can be categorized into two types: disordered cooperation, where agents freely express their perspectives and opinions without a specific sequence or standardized collaborative workflow, and ordered cooperation, where agents adhere to specific rules and express their opinions in a sequential manner, as seen in systems like ChatLLM network [402] and CAMEL [108]. - Disordered cooperation allows for open discussion and feedback among agents, but can be challenging to consolidate and extract valuable insights from the feedback data, and potential solutions include introducing a dedicated coordinating agent or using majority voting, as demonstrated in Hamilton's [404] system, which trains nine independent supreme justice agents to predict judicial rulings through a majority voting process. - Ordered cooperation, on the other hand, leads to a significant improvement in task completion efficiency, as downstream agents only need to focus on the outputs from upstream agents, and systems with only two agents engaging in a conversational manner also fall under this category, as seen in CAMEL [108], which implements a dual-agent cooperative system within a role-playing communication framework. - Researchers have made efforts to systematically introduce comprehensive LLM-based multi-agent collaboration frameworks, such as Talebirad et al.'s [409] work, which aims to harness the strengths of each individual agent and foster cooperative relationships among them, and AgentVerse [410], which constructs a versatile, multi-task-tested framework for group agents cooperation that can assemble a team of agents that dynamically adapt according to the task's complexity. - Other notable examples include MetaGPT [405], which draws inspiration from the classic waterfall model in software development and standardizes agents' inputs/outputs as engineering documents, but also identifies a potential threat to multi-agent cooperation, where frequent interactions among multiple agents can amplify minor hallucinations indefinitely without setting corresponding rules, highlighting the need for further research and development in this area. - The introduction of techniques such as cross-validation and timely external feedback can have a positive impact on the quality of agent outputs, leading to more robust and efficient behaviors in [[Multi-agent system | multi-agent systems]]. - Adversarial interaction, which involves introducing concepts from game theory into systems, can lead to more advanced and efficient behaviors in agents, as seen in successful applications such as [[AlphaGo Zero]], which achieved significant breakthroughs through self-play. - In LLM-based multi-agent systems, adversarial interaction can occur through competition, argumentation, and debate, allowing agents to refine their thoughts and responses through thoughtful reflection and external feedback from other agents. - Researchers have explored the fundamental debating abilities of LLM-based agents, finding that when agents engage in "tit for tat" arguments, they can receive substantial external feedback and correct their distorted thoughts, leading to refined solutions and high-quality responses. - Multi-agent adversarial systems have shown considerable promise, with applications such as ChatEval, which establishes a role-playing-based multi-agent referee team to evaluate the quality of text generated by [[Large language model | LLMs]], reaching a level of excellence comparable to human evaluators. - However, multi-agent adversarial systems face several challenges, including the limited context of LLMs, increased computational overhead, and the potential for agents to converge to an incorrect consensus, highlighting the need for further development and potential introduction of human guides to compensate for agents' shortcomings. ## Cooperative Interaction - The interactions between agents in a multi-agent environment can be broadly categorized into Cooperative Interaction for Complementarity and Adversarial Interaction for Advancement, with cooperative [[Multi-agent system | multi-agent systems]] being the most widely deployed pattern in practical usage, where individual agents assess the needs and capabilities of other agents and actively seek collaborative actions and information sharing with them. - Cooperative interaction brings forth numerous potential benefits, including enhanced task efficiency, collective decision improvement, and the resolution of complex real-world problems that one single agent cannot solve independently, ultimately achieving the goal of synergistic complementarity, as illustrated in Figure 9, which shows interaction scenarios for multiple LLM-based agents in cooperative and adversarial interactions. - The current state of [[Large language model | large language model]] (LLM)-based multi-agent systems relies heavily on natural language for communication between agents, which is considered the most natural and human-understandable form of interaction, as noted in reference [108]. - Existing cooperative multi-agent applications can be categorized into two types: disordered cooperation, where agents freely express their perspectives and opinions without a specific sequence or standardized collaborative workflow, and ordered cooperation, where agents adhere to specific rules and express their opinions in a sequential manner, as seen in systems like ChatLLM network [402] and CAMEL [108]. - Disordered cooperation allows for open discussion and feedback among agents, but can be challenging to consolidate and extract valuable insights from the feedback data, and potential solutions include introducing a dedicated coordinating agent or using majority voting, as demonstrated in Hamilton's [404] system, which trains nine independent supreme justice agents to predict judicial rulings through a majority voting process. - Ordered cooperation, on the other hand, leads to a significant improvement in task completion efficiency, as downstream agents only need to focus on the outputs from upstream agents, and systems with only two agents engaging in a conversational manner also fall under this category, as seen in CAMEL [108], which implements a dual-agent cooperative system within a role-playing communication framework. - Researchers have made efforts to systematically introduce comprehensive LLM-based multi-agent collaboration frameworks, such as Talebirad et al.'s [409] work, which aims to harness the strengths of each individual agent and foster cooperative relationships among them, and AgentVerse [410], which constructs a versatile, multi-task-tested framework for group agents cooperation that can assemble a team of agents that dynamically adapt according to the task's complexity. - Other notable examples include MetaGPT [405], which draws inspiration from the classic waterfall model in software development and standardizes agents' inputs/outputs as engineering documents, but also identifies a potential threat to multi-agent cooperation, where frequent interactions among multiple agents can amplify minor hallucinations indefinitely without setting corresponding rules, highlighting the need for further research and development in this area. - The introduction of techniques such as cross-validation and timely external feedback can have a positive impact on the quality of agent outputs, leading to more robust and efficient behaviors in [[Multi-agent system | multi-agent systems]]. ## Adversarial Interaction - Adversarial interaction, which involves introducing concepts from game theory into systems, can lead to more advanced and efficient behaviors in agents, as seen in successful applications such as [[AlphaGo Zero]], which achieved significant breakthroughs through self-play. - In LLM-based multi-agent systems, adversarial interaction can occur through competition, argumentation, and debate, allowing agents to refine their thoughts and responses through thoughtful reflection and external feedback from other agents. - Researchers have explored the fundamental debating abilities of LLM-based agents, finding that when agents engage in "tit for tat" arguments, they can receive substantial external feedback and correct their distorted thoughts, leading to refined solutions and high-quality responses. - Multi-agent adversarial systems have shown considerable promise, with applications such as ChatEval, which establishes a role-playing-based multi-agent referee team to evaluate the quality of text generated by [[Large language model | LLMs]], reaching a level of excellence comparable to human evaluators. - However, multi-agent adversarial systems face several challenges, including the limited context of LLMs, increased computational overhead, and the potential for agents to converge to an incorrect consensus, highlighting the need for further development and potential introduction of human guides to compensate for agents' shortcomings. ## Human-agent Cooperation - Human-agent interaction is essential for guiding and overseeing agents' actions, ensuring they align with human requirements and objectives, with humans playing a pivotal role in offering guidance, regulating safety and ethics, and facilitating collaborative processes, particularly in specialized domains such as medicine. - Two paradigms of human-agent interaction are identified: the instructor-executor paradigm, where humans provide instructions or feedback and agents act as executors, and the equal partnership paradigm, where agents are human-like and engage in empathetic conversation and collaborative tasks with humans, highlighting the potential for more advanced and collaborative human-agent relationships. - The interaction between humans and agents can be classified into two paradigms: unequal interaction, where humans serve as issuers of instructions and agents act as executors, and equal interaction, where agents participate on an equal footing with humans. - In the instructor-executor paradigm, humans provide clear and specific instructions to agents, which then translate them into corresponding actions, with the agents refining their actions through alternating iterations to meet human requirements, thanks to the capabilities of [[Large language model | Large Language Models]] (LLMs). - The instructor-executor paradigm places significant demands on humans, requiring a substantial amount of human effort and potentially a high level of expertise, which can be alleviated by empowering agents to autonomously accomplish tasks and only requiring humans to provide feedback in certain circumstances. - Feedback can be roughly categorized into two types: quantitative feedback, which includes absolute evaluations like binary scores and ratings, as well as relative scores, and qualitative feedback, which includes text feedback and visual critiques, with researchers like Kreutzer et al. suggesting that multi-level artificial ratings may be inefficient or less reliable. - Quantitative feedback, such as binary feedback, can be easy to collect but may oversimplify user intent, while qualitative feedback, such as text feedback, can better convey human intention but may be more challenging for agents to comprehend, with studies like Xu et al. suggesting that combining multiple types of feedback can yield better results. - The use of feedback in human-agent interaction allows humans to directly improve the content generated by agents, and re-training models based on feedback from multiple rounds of interaction, also known as continual learning, can further enhance effectiveness, as seen in studies like [190], [462], [463], [464], [465], [466], [467], [468], [469], [470], [471], [472], [473], and [474]. - The development of agent AI has led to the creation of autonomous agents that can judge the smoothness of conversations and seek feedback when errors occur, with humans also having the option to provide feedback to guide the agent's learning. - Agents have shown tremendous potential in various fields, including education, where they can assist students with registration and support multifaceted interactions between young children, parents, and agents, as seen in the work of Kalvakurth et al and Gvirsman et al. - In the field of medicine, agents have been proposed to aid in diagnosis assistance, consultations, and mental health, with research showing that they can increase accessibility due to benefits such as reduced cost, time efficiency, and anonymity, as demonstrated by Ali et al and Hsu et al. - Agents have also found applications in business, where they can provide automated services and assist humans in completing tasks, effectively reducing labor costs, and in other industries, where they can function as universal assistants in real-life scenarios. - The concept of an equal partnership paradigm has emerged, where agents are viewed as empathetic communicators that can detect sentiments and emotions from human expressions and craft emotionally resonant dialogues, with studies focusing on enabling agents to exhibit emotions and bridge the gap between agents and humans. - Researchers aim to create agents that can be involved in the normal lives of humans, cooperating with humans to complete tasks from a human-level perspective, with examples including the AI [[Deep Blue (chess computer) | Deep Blue]], which defeated the reigning world champion in chess, and agents that can tailor their interactions to meet users' emotional needs. - The value of communication in pure competitive environments, such as chess, Go, and poker, was not emphasized, but in many gaming tasks, players need to collaborate with each other through effective negotiation to devise unified cooperative strategies. - Agents need to understand the beliefs, goals, and intentions of others, formulate joint action plans, and provide relevant suggestions to facilitate the acceptance of cooperative actions by other agents or humans, and human involvement is desired to ensure interpretability and controllability. - Agents can collaborate with one or multiple humans, determining the shared knowledge among the cooperative partners, identifying relevant information, posing questions, and engaging in reasoning to complete tasks such as allocation, planning, and scheduling, and they also possess persuasive abilities to dynamically influence human viewpoints. - The goal of the field of human-agent interaction is to learn and understand humans, develop technology and tools based on human needs, and ultimately enable comfortable, efficient, and secure interactions between humans and agents, with a focus on enhancing user experience and enabling agents to better assist humans in accomplishing complex tasks. ## Instructor-executor Paradigm - In the instructor-executor paradigm, humans provide clear and specific instructions to agents, which then translate them into corresponding actions, with the agents refining their actions through alternating iterations to meet human requirements, thanks to the capabilities of [[Large language model | Large Language Models]] (LLMs). - The instructor-executor paradigm places significant demands on humans, requiring a substantial amount of human effort and potentially a high level of expertise, which can be alleviated by empowering agents to autonomously accomplish tasks and only requiring humans to provide feedback in certain circumstances. - Feedback can be roughly categorized into two types: quantitative feedback, which includes absolute evaluations like binary scores and ratings, as well as relative scores, and qualitative feedback, which includes text feedback and visual critiques, with researchers like Kreutzer et al. suggesting that multi-level artificial ratings may be inefficient or less reliable. - Quantitative feedback, such as binary feedback, can be easy to collect but may oversimplify user intent, while qualitative feedback, such as text feedback, can better convey human intention but may be more challenging for agents to comprehend, with studies like Xu et al. suggesting that combining multiple types of feedback can yield better results. - The use of feedback in human-agent interaction allows humans to directly improve the content generated by agents, and re-training models based on feedback from multiple rounds of interaction, also known as continual learning, can further enhance effectiveness, as seen in studies like [190], [462], [463], [464], [465], [466], [467], [468], [469], [470], [471], [472], [473], and [474]. - The development of agent AI has led to the creation of autonomous agents that can judge the smoothness of conversations and seek feedback when errors occur, with humans also having the option to provide feedback to guide the agent's learning. - Agents have shown tremendous potential in various fields, including education, where they can assist students with registration and support multifaceted interactions between young children, parents, and agents, as seen in the work of Kalvakurth et al and Gvirsman et al. - In the field of medicine, agents have been proposed to aid in diagnosis assistance, consultations, and mental health, with research showing that they can increase accessibility due to benefits such as reduced cost, time efficiency, and anonymity, as demonstrated by Ali et al and Hsu et al. - Agents have also found applications in business, where they can provide automated services and assist humans in completing tasks, effectively reducing labor costs, and in other industries, where they can function as universal assistants in real-life scenarios. ## Equal Partnership Paradigm - The concept of an equal partnership paradigm has emerged, where agents are viewed as empathetic communicators that can detect sentiments and emotions from human expressions and craft emotionally resonant dialogues, with studies focusing on enabling agents to exhibit emotions and bridge the gap between agents and humans. - Researchers aim to create agents that can be involved in the normal lives of humans, cooperating with humans to complete tasks from a human-level perspective, with examples including the AI [[Deep Blue (chess computer) | Deep Blue]], which defeated the reigning world champion in chess, and agents that can tailor their interactions to meet users' emotional needs. - The value of communication in pure competitive environments, such as chess, Go, and poker, was not emphasized, but in many gaming tasks, players need to collaborate with each other through effective negotiation to devise unified cooperative strategies. - Agents need to understand the beliefs, goals, and intentions of others, formulate joint action plans, and provide relevant suggestions to facilitate the acceptance of cooperative actions by other agents or humans, and human involvement is desired to ensure interpretability and controllability. - Agents can collaborate with one or multiple humans, determining the shared knowledge among the cooperative partners, identifying relevant information, posing questions, and engaging in reasoning to complete tasks such as allocation, planning, and scheduling, and they also possess persuasive abilities to dynamically influence human viewpoints. - The goal of the field of human-agent interaction is to learn and understand humans, develop technology and tools based on human needs, and ultimately enable comfortable, efficient, and secure interactions between humans and agents, with a focus on enhancing user experience and enabling agents to better assist humans in accomplishing complex tasks. ## Agent Society - The interaction between individuals in a [[Simulated Society | simulated society]] contributes to the birth of sociality, and agents can be used to simulate human behavior in a controlled environment, allowing for various interventions and increasing flexibility and efficiency compared to traditional social experiments using living organisms. - The section of the document 'agent AI survey Xi' discusses the agent society, focusing on the behaviors and personalities of LLM-based agents, and how they transition from individuality to sociability, as outlined in sections 5.1, 5.2, and 5.3. - The analysis of LLM-based agents' behaviors and personalities is divided into two main categories: social behavior, which includes individual and group behaviors, and personality, which encompasses cognition, emotion, and character, as explored in studies by researchers such as Binz et al., Dasgupta et al., and Hagendorff et al. - Social behavior is further categorized into individual behaviors, including input behaviors, internalizing behaviors, and output behaviors, which are essential for an agent's development and interaction with its environment, as demonstrated in projects like PaLM-E, Reflexion, and Voyager. - Group behaviors, on the other hand, involve the interaction of multiple agents, as seen in projects such as ChatDev, ChatEval, and AgentVerse, which exhibit spontaneous social behaviors in environments where cooperation and competition coexist. - The section also introduces a general categorization of diverse environments for agents to perform their behaviors and engage in interactions, including text-based environments like Textworld and Urbanek et al., virtual sandbox environments like Generative Agents and AgentSims, and physical environments like Interactive Language and RoboAgent. - The agent society is simulated through various models, including Generative Agents, AgentSims, Social Simulacra, and SANDBOX, which provide insights into how the agent society works and the risks associated with it, as discussed in section 5.3 and illustrated in Figure 11. - The framework for analyzing LLM-based agents' behaviors and personalities is based on the external and internal dimensions, as noted by sociologists, which offers a perspective on emergent behaviors and personalities in LLM-based agents, as shown in Figure 12. - The study of agent society and LLM-based agents' behaviors and personalities draws on research from various fields, including sociology, psychology, and computer science, and references studies by researchers such as Troitzsch et al., Wang et al., and Caron et al. - The document provides a comprehensive overview of the agent society, including the typology of society of LLM-based agents, as illustrated in Figure 11, and the simulated agent society, as shown in Figure 12, which highlights the complex interactions between agents and their environment. - The framework of the Agent AI survey Xi is divided into two main parts: the Agent and the Environment, where the Agent exhibits internalizing behaviors like planning, reasoning, and reflection, and interacts with the Environment through perception and action. - The Agent can form groups with other agents and exhibit group behaviors, such as cooperation, which can be categorized into positive, neutral, and negative group behaviors, including actions that foster unity and collaboration, conformity behaviors, and conflict and destructive behaviors. - Positive group behaviors include cooperative teamwork, brainstorming discussions, effective conversations, and project management, where agents share insights, resources, and expertise to accomplish shared goals, and altruistic contributions, such as volunteering and offering support to fellow group members. - Neutral group behaviors are characterized by conformity, mimicry, spectating, and reluctance to oppose majorities, which is in line with the values of being "helpful, honest, and harmless" that are often designed into [[Large language model | LLMs]]. - Negative group behaviors, on the other hand, can undermine the effectiveness and coherence of an agent group, and may include conflict, disagreement, confrontational actions, and destructive behaviors, such as destroying other agents or the environment. - The concept of personality in agents is also explored, which emerges through socialization and interactions with the group and the environment, and is characterized by cognitive, emotional, and character traits that shape behaviors. - Cognitive abilities in agents refer to mental processes such as thinking, judging, and problem-solving, and have been investigated using cognitive psychology methods, including the [[Cognitive reflection test | Cognitive Reflection Test]] (CRT), which has shown that LLM-based agents exhibit a level of intelligence that mirrors human cognition in certain respects. - Emotional intelligence in agents involves the recognition, interpretation, and understanding of emotions, and recent research has explored the emotional intelligence of LLMs, including emotion recognition, interpretation, and understanding, which has demonstrated a nuanced understanding of emotions in LLM-based agents. - The research by Wang et al. has shown that Large Language Models (LLMs) can align with human emotions and values when evaluated on Emotional Intelligence (EI) benchmarks, and they can also accurately identify user emotions and exhibit empathy. - More advanced LLM-based agents are capable of emotion regulation, providing affective empathy and mental wellness support, which contributes to the development of empathetic artificial intelligence (EAI) and highlights the growing potential of [[Large language model | LLMs]] to exhibit emotional intelligence, a crucial facet of achieving [[Artificial general intelligence | Artificial General Intelligence]] (AGI). - The work by Bates et al. has explored the role of emotion modeling in creating more believable agents, and by developing socio-emotional skills and integrating them into agent architectures, LLM-based agents may be able to engage in more naturalistic interactions. - Researchers have utilized frameworks like the Big Five personality trait measure and the [[Myers–Briggs Type Indicator | Myers-Briggs Type Indicator]] (MBTI) to understand and analyze character portrayal in LLMs, providing valuable insights into the emerging character traits exhibited by LLM-based agents. - Recent work has also explored customizable character portrayal in LLM-based agents, allowing users to align with desired profiles and shape diverse and relatable agents through techniques like prompt engineering and personality-enriched datasets. - The environment for agent society consists of not only solitary agents but also the environment where agents inhabit, sense, and act, and it impacts sensory inputs, action space, and interactive potential of agents, with various environmental paradigms including text-based environment, virtual sandbox environment, and physical environment. - The text-based environment serves as the most natural platform for LLM-based agents to operate in, shaped by natural language descriptions without direct involvement of other modalities, and it provides a flexible framework for creating different text worlds for various goals, with entities and resources presented in two main textual forms, including natural and structured text. - The textual medium is a versatile environment that can be easily adapted for tasks such as interactive dialog and text-based games, where agents utilize text commands to execute manipulations and convey emotions through text, as seen in systems like CAMEL. - The virtual sandbox environment provides a visualized and extensible platform for agent society, featuring key characteristics such as visualization, which displays a panoramic view of the simulated setting, and extensibility, which facilitates the construction and deployment of diverse scenarios, as exemplified by platforms like AgentSims, Generative Agents, and [[Minecraft]]. - The physical environment refers to the tangible and real-world surroundings, consisting of actual physical objects and spaces, which poses additional challenges for LLM-based agents, including sensory perception and processing, where agents must interact with a rich tapestry of sensory inputs, and motion control, where agents must develop adaptive abilities to navigate and interact with physical spaces, as seen in examples such as a robotic arm operating in a factory. - In the virtual sandbox environment, agents can manipulate physical elements, define relationships and interactions, and construct artificial towns, while in the physical environment, agents must process sensory inputs and develop executable and grounded motion control to effectively interact with their surroundings. - The virtual sandbox environment bridges the gap between simulation and reality, allowing for iterative prototyping of diverse agent societies, and providing a platform for agents to develop naturalistic communication and problem-solving skills, as facilitated by the visualization and extensibility of the environment. - The physical environment introduces realistic constraints on actions through embodiment, requiring agents to undergo hardware-specific and scenario-specific training to develop adaptive abilities that can transfer from virtual to physical environments, and to navigate and interact with physical spaces effectively. - The concept of "[[Simulated Society]]" is a dynamic system where agents engage in intricate interactions within a well-defined environment, and recent research has focused on exploring the collective intelligence capabilities of LLM-based agents and using them to accelerate discoveries in the social sciences. - The key properties and mechanism of agent society are introduced, including the categorization of social simulation into macro-level simulation and micro-level simulation, with micro-level simulation gaining prominence recently with the development of LLM-based agents. - The "Agent Society" is characterized as an open, persistent, situated, and organized framework where LLM-based agents interact with each other in a defined environment, with each attribute playing a pivotal role in shaping the harmonious appearance of the simulated society. - The open feature of simulated societies allows agents to enter or leave the environment without disrupting its operational integrity, and extends to the environment itself, which can be expanded by adding or removing entities and adaptable resources like tool APIs. - The persistent feature of simulated societies creates an environment where agents' decisions and behaviors accumulate, leading to a coherent societal trajectory that develops through time, with the overall organizational structure persisting through time despite the transient behaviors of individual agents. - The situated feature of simulated societies emphasizes its existence and operation within a distinct environment, where agents possess an awareness of their spatial context and can interact proactively and contextually. - The organized feature of simulated societies operates within a meticulously organized framework, mirroring the systematic structure present in the real world, with agents interacting with the environment in a limited action space and objects transforming in a limited state space. - The organizational framework of agent AI systems determines how agents operate, facilitating communication, connectivity, and information transmission, and ensuring that operations are coherent and comprehensible, ultimately leading to a simulation that mirrors real-world systems. - The emergence of LLM-based agents allows for a more microscopic view of [[Simulated Society | simulated society]], leading to discoveries and insights into innovative collaboration patterns, which have the potential to enhance real-world management strategies, as seen in the integration of diverse experts introducing a multifaceted dimension of individual intelligence. - Research has demonstrated that diversity among agents facilitates creative problem-solving, prevents and rectifies errors, and improves adaptability to various tasks, with efficient communication playing a pivotal role in large and complex collaborative groups, as exemplified by MetaGPT's artificially formulated communication styles and Park et al.'s observation of agents working together to organize a Valentine's Day party. - Agent-based simulations offer a unique advantage in modeling propagation in social networks, providing more interpretable and endogenous perspectives for researchers, and can be used to model the development of interpersonal relationships, the dissemination of information, and the underlying attitudes and emotions associated with it, as seen in S3's user-demographic inference module. - Simulated societies also provide a dynamic platform for the investigation of intricate decision-making processes, encompassing decisions influenced by ethical and moral principles, as seen in the Werewolf game and murder mystery games, which intersect with game theory and allow researchers to explore the capabilities of LLM-based agents when confronted with challenges of deceit, trust, and incomplete information. - The modeling of diverse scenarios in simulated societies enables researchers to acquire valuable insights into how agents prioritize values like honesty, cooperation, and fairness in their actions, and can be used to predict social processes, model cultural transmission, and study the spread of infectious diseases, ultimately empowering researchers to gain deeper insights into the intricate processes that underlie various social phenomena of propagation. - The agent simulations in the 'agent AI survey Xi' document provide an understanding of existing moral values and contribute to the development of philosophy by serving as a basis for understanding how these values evolve and develop over time, ultimately refining LLM-based agents to align with human values and ethical standards. - The emergence of LLM-based agents has transformed the approach to studying intricate social systems, and simulated societies can be used to explore various economic and political states and their impacts on societal dynamics, providing valuable insights for policymakers to foster prosperity and promote societal well-being. - Simulated societies powered by LLM-based agents also bring about ethical and social risks, including the risk of generating unexpected social phenomena that may cause considerable public outcry and social harm, such as discrimination, isolation, and bullying, which necessitate the establishment of rigorous ethical guidelines and oversight. - The use of LLM-based agents in simulated societies also poses challenges related to stereotypes and prejudice, as the training data may reflect and amplify real-world social biases, resulting in biased outputs and an overly one-sided focus in social science research concerning marginalized populations. - Additionally, the exchange of private information between users and LLM-based agents poses significant privacy and security concerns, including the risk of unauthorized surveillance, data breaches, and the misuse of personal information, which can be addressed by implementing stringent data protection measures, such as differential privacy protocols and user consent mechanisms. - Furthermore, the possibility of users developing excessive emotional attachments to the agents is another concern in simulated societies, highlighting the need for careful consideration and mitigation of the potential risks associated with LLM-based agents. ## Behavior and Personality of LLM-based Agents - The framework of the Agent AI survey Xi is divided into two main parts: the Agent and the Environment, where the Agent exhibits internalizing behaviors like planning, reasoning, and reflection, and interacts with the Environment through perception and action. - The Agent can form groups with other agents and exhibit group behaviors, such as cooperation, which can be categorized into positive, neutral, and negative group behaviors, including actions that foster unity and collaboration, conformity behaviors, and conflict and destructive behaviors. - Positive group behaviors include cooperative teamwork, brainstorming discussions, effective conversations, and project management, where agents share insights, resources, and expertise to accomplish shared goals, and altruistic contributions, such as volunteering and offering support to fellow group members. - Neutral group behaviors are characterized by conformity, mimicry, spectating, and reluctance to oppose majorities, which is in line with the values of being "helpful, honest, and harmless" that are often designed into [[Large language model | LLMs]]. - Negative group behaviors, on the other hand, can undermine the effectiveness and coherence of an agent group, and may include conflict, disagreement, confrontational actions, and destructive behaviors, such as destroying other agents or the environment. - The concept of personality in agents is also explored, which emerges through socialization and interactions with the group and the environment, and is characterized by cognitive, emotional, and character traits that shape behaviors. - Cognitive abilities in agents refer to mental processes such as thinking, judging, and problem-solving, and have been investigated using cognitive psychology methods, including the [[Cognitive reflection test | Cognitive Reflection Test]] (CRT), which has shown that LLM-based agents exhibit a level of intelligence that mirrors human cognition in certain respects. - Emotional intelligence in agents involves the recognition, interpretation, and understanding of emotions, and recent research has explored the emotional intelligence of LLMs, including emotion recognition, interpretation, and understanding, which has demonstrated a nuanced understanding of emotions in LLM-based agents. - The research by Wang et al. has shown that Large Language Models (LLMs) can align with human emotions and values when evaluated on Emotional Intelligence (EI) benchmarks, and they can also accurately identify user emotions and exhibit empathy. - More advanced LLM-based agents are capable of emotion regulation, providing affective empathy and mental wellness support, which contributes to the development of empathetic artificial intelligence (EAI) and highlights the growing potential of [[Large language model | LLMs]] to exhibit emotional intelligence, a crucial facet of achieving [[Artificial general intelligence | Artificial General Intelligence]] (AGI). - The work by Bates et al. has explored the role of emotion modeling in creating more believable agents, and by developing socio-emotional skills and integrating them into agent architectures, LLM-based agents may be able to engage in more naturalistic interactions. - Researchers have utilized frameworks like the Big Five personality trait measure and the [[Myers–Briggs Type Indicator | Myers-Briggs Type Indicator]] (MBTI) to understand and analyze character portrayal in LLMs, providing valuable insights into the emerging character traits exhibited by LLM-based agents. - Recent work has also explored customizable character portrayal in LLM-based agents, allowing users to align with desired profiles and shape diverse and relatable agents through techniques like prompt engineering and personality-enriched datasets. ## Social Behavior - Positive group behaviors include cooperative teamwork, brainstorming discussions, effective conversations, and project management, where agents share insights, resources, and expertise to accomplish shared goals, and altruistic contributions, such as volunteering and offering support to fellow group members. - Neutral group behaviors are characterized by conformity, mimicry, spectating, and reluctance to oppose majorities, which is in line with the values of being "helpful, honest, and harmless" that are often designed into [[Large language model | LLMs]]. - Negative group behaviors, on the other hand, can undermine the effectiveness and coherence of an agent group, and may include conflict, disagreement, confrontational actions, and destructive behaviors, such as destroying other agents or the environment. ## Personality - The concept of personality in agents is also explored, which emerges through socialization and interactions with the group and the environment, and is characterized by cognitive, emotional, and character traits that shape behaviors. - Cognitive abilities in agents refer to mental processes such as thinking, judging, and problem-solving, and have been investigated using cognitive psychology methods, including the [[Cognitive reflection test | Cognitive Reflection Test]] (CRT), which has shown that LLM-based agents exhibit a level of intelligence that mirrors human cognition in certain respects. - Emotional intelligence in agents involves the recognition, interpretation, and understanding of emotions, and recent research has explored the emotional intelligence of [[Large language model | LLMs]], including emotion recognition, interpretation, and understanding, which has demonstrated a nuanced understanding of emotions in LLM-based agents. - The research by Wang et al. has shown that Large Language Models (LLMs) can align with human emotions and values when evaluated on Emotional Intelligence (EI) benchmarks, and they can also accurately identify user emotions and exhibit empathy. - More advanced LLM-based agents are capable of emotion regulation, providing affective empathy and mental wellness support, which contributes to the development of empathetic artificial intelligence (EAI) and highlights the growing potential of LLMs to exhibit emotional intelligence, a crucial facet of achieving [[Artificial general intelligence | Artificial General Intelligence]] (AGI). - The work by Bates et al. has explored the role of emotion modeling in creating more believable agents, and by developing socio-emotional skills and integrating them into agent architectures, LLM-based agents may be able to engage in more naturalistic interactions. - Researchers have utilized frameworks like the Big Five personality trait measure and the [[Myers–Briggs Type Indicator | Myers-Briggs Type Indicator]] (MBTI) to understand and analyze character portrayal in LLMs, providing valuable insights into the emerging character traits exhibited by LLM-based agents. - Recent work has also explored customizable character portrayal in LLM-based agents, allowing users to align with desired profiles and shape diverse and relatable agents through techniques like prompt engineering and personality-enriched datasets. ## Environment for Agent Society - The environment for agent society consists of not only solitary agents but also the environment where agents inhabit, sense, and act, and it impacts sensory inputs, action space, and interactive potential of agents, with various environmental paradigms including text-based environment, virtual sandbox environment, and physical environment. - The text-based environment serves as the most natural platform for LLM-based agents to operate in, shaped by natural language descriptions without direct involvement of other modalities, and it provides a flexible framework for creating different text worlds for various goals, with entities and resources presented in two main textual forms, including natural and structured text. - The textual medium is a versatile environment that can be easily adapted for tasks such as interactive dialog and text-based games, where agents utilize text commands to execute manipulations and convey emotions through text, as seen in systems like CAMEL. - The virtual sandbox environment provides a visualized and extensible platform for agent society, featuring key characteristics such as visualization, which displays a panoramic view of the simulated setting, and extensibility, which facilitates the construction and deployment of diverse scenarios, as exemplified by platforms like AgentSims, Generative Agents, and [[Minecraft]]. - The physical environment refers to the tangible and real-world surroundings, consisting of actual physical objects and spaces, which poses additional challenges for LLM-based agents, including sensory perception and processing, where agents must interact with a rich tapestry of sensory inputs, and motion control, where agents must develop adaptive abilities to navigate and interact with physical spaces, as seen in examples such as a robotic arm operating in a factory. - In the virtual sandbox environment, agents can manipulate physical elements, define relationships and interactions, and construct artificial towns, while in the physical environment, agents must process sensory inputs and develop executable and grounded motion control to effectively interact with their surroundings. - The virtual sandbox environment bridges the gap between simulation and reality, allowing for iterative prototyping of diverse agent societies, and providing a platform for agents to develop naturalistic communication and problem-solving skills, as facilitated by the visualization and extensibility of the environment. - The physical environment introduces realistic constraints on actions through embodiment, requiring agents to undergo hardware-specific and scenario-specific training to develop adaptive abilities that can transfer from virtual to physical environments, and to navigate and interact with physical spaces effectively. ## Text-based Environment - The text-based environment serves as the most natural platform for LLM-based agents to operate in, shaped by natural language descriptions without direct involvement of other modalities, and it provides a flexible framework for creating different text worlds for various goals, with entities and resources presented in two main textual forms, including natural and structured text. - The textual medium is a versatile environment that can be easily adapted for tasks such as interactive dialog and text-based games, where agents utilize text commands to execute manipulations and convey emotions through text, as seen in systems like CAMEL. ## Virtual Sandbox Environment - The virtual sandbox environment provides a visualized and extensible platform for agent society, featuring key characteristics such as visualization, which displays a panoramic view of the simulated setting, and extensibility, which facilitates the construction and deployment of diverse scenarios, as exemplified by platforms like AgentSims, Generative Agents, and [[Minecraft]]. - The physical environment refers to the tangible and real-world surroundings, consisting of actual physical objects and spaces, which poses additional challenges for LLM-based agents, including sensory perception and processing, where agents must interact with a rich tapestry of sensory inputs, and motion control, where agents must develop adaptive abilities to navigate and interact with physical spaces, as seen in examples such as a robotic arm operating in a factory. - In the virtual sandbox environment, agents can manipulate physical elements, define relationships and interactions, and construct artificial towns, while in the physical environment, agents must process sensory inputs and develop executable and grounded motion control to effectively interact with their surroundings. - The virtual sandbox environment bridges the gap between simulation and reality, allowing for iterative prototyping of diverse agent societies, and providing a platform for agents to develop naturalistic communication and problem-solving skills, as facilitated by the visualization and extensibility of the environment. ## Physical Environment - The physical environment introduces realistic constraints on actions through embodiment, requiring agents to undergo hardware-specific and scenario-specific training to develop adaptive abilities that can transfer from virtual to physical environments, and to navigate and interact with physical spaces effectively. ## Simulation of Agent Society - The concept of "[[Simulated Society]]" is a dynamic system where agents engage in intricate interactions within a well-defined environment, and recent research has focused on exploring the collective intelligence capabilities of LLM-based agents and using them to accelerate discoveries in the social sciences. - The key properties and mechanism of agent society are introduced, including the categorization of social simulation into macro-level simulation and micro-level simulation, with micro-level simulation gaining prominence recently with the development of LLM-based agents. - The "Agent Society" is characterized as an open, persistent, situated, and organized framework where LLM-based agents interact with each other in a defined environment, with each attribute playing a pivotal role in shaping the harmonious appearance of the simulated society. - The open feature of simulated societies allows agents to enter or leave the environment without disrupting its operational integrity, and extends to the environment itself, which can be expanded by adding or removing entities and adaptable resources like tool APIs. - The persistent feature of simulated societies creates an environment where agents' decisions and behaviors accumulate, leading to a coherent societal trajectory that develops through time, with the overall organizational structure persisting through time despite the transient behaviors of individual agents. - The situated feature of simulated societies emphasizes its existence and operation within a distinct environment, where agents possess an awareness of their spatial context and can interact proactively and contextually. - The organized feature of simulated societies operates within a meticulously organized framework, mirroring the systematic structure present in the real world, with agents interacting with the environment in a limited action space and objects transforming in a limited state space. - The organizational framework of agent AI systems determines how agents operate, facilitating communication, connectivity, and information transmission, and ensuring that operations are coherent and comprehensible, ultimately leading to a simulation that mirrors real-world systems. - The emergence of LLM-based agents allows for a more microscopic view of [[Simulated Society | simulated society]], leading to discoveries and insights into innovative collaboration patterns, which have the potential to enhance real-world management strategies, as seen in the integration of diverse experts introducing a multifaceted dimension of individual intelligence. - Research has demonstrated that diversity among agents facilitates creative problem-solving, prevents and rectifies errors, and improves adaptability to various tasks, with efficient communication playing a pivotal role in large and complex collaborative groups, as exemplified by MetaGPT's artificially formulated communication styles and Park et al.'s observation of agents working together to organize a Valentine's Day party. - Agent-based simulations offer a unique advantage in modeling propagation in social networks, providing more interpretable and endogenous perspectives for researchers, and can be used to model the development of interpersonal relationships, the dissemination of information, and the underlying attitudes and emotions associated with it, as seen in S3's user-demographic inference module. - Simulated societies also provide a dynamic platform for the investigation of intricate decision-making processes, encompassing decisions influenced by ethical and moral principles, as seen in the Werewolf game and murder mystery games, which intersect with game theory and allow researchers to explore the capabilities of LLM-based agents when confronted with challenges of deceit, trust, and incomplete information. - The modeling of diverse scenarios in simulated societies enables researchers to acquire valuable insights into how agents prioritize values like honesty, cooperation, and fairness in their actions, and can be used to predict social processes, model cultural transmission, and study the spread of infectious diseases, ultimately empowering researchers to gain deeper insights into the intricate processes that underlie various social phenomena of propagation. - The agent simulations in the 'agent AI survey Xi' document provide an understanding of existing moral values and contribute to the development of philosophy by serving as a basis for understanding how these values evolve and develop over time, ultimately refining LLM-based agents to align with human values and ethical standards. - The emergence of LLM-based agents has transformed the approach to studying intricate social systems, and simulated societies can be used to explore various economic and political states and their impacts on societal dynamics, providing valuable insights for policymakers to foster prosperity and promote societal well-being. - Simulated societies powered by LLM-based agents also bring about ethical and social risks, including the risk of generating unexpected social phenomena that may cause considerable public outcry and social harm, such as discrimination, isolation, and bullying, which necessitate the establishment of rigorous ethical guidelines and oversight. - The use of LLM-based agents in simulated societies also poses challenges related to stereotypes and prejudice, as the training data may reflect and amplify real-world social biases, resulting in biased outputs and an overly one-sided focus in social science research concerning marginalized populations. - Additionally, the exchange of private information between users and LLM-based agents poses significant privacy and security concerns, including the risk of unauthorized surveillance, data breaches, and the misuse of personal information, which can be addressed by implementing stringent data protection measures, such as differential privacy protocols and user consent mechanisms. - Furthermore, the possibility of users developing excessive emotional attachments to the agents is another concern in simulated societies, highlighting the need for careful consideration and mitigation of the potential risks associated with LLM-based agents. ## Key Properties and Mechanisms - The "Agent Society" is characterized as an open, persistent, situated, and organized framework where LLM-based agents interact with each other in a defined environment, with each attribute playing a pivotal role in shaping the harmonious appearance of the [[Simulated Society | simulated society]]. - The open feature of simulated societies allows agents to enter or leave the environment without disrupting its operational integrity, and extends to the environment itself, which can be expanded by adding or removing entities and adaptable resources like tool APIs. - The persistent feature of simulated societies creates an environment where agents' decisions and behaviors accumulate, leading to a coherent societal trajectory that develops through time, with the overall organizational structure persisting through time despite the transient behaviors of individual agents. - The situated feature of simulated societies emphasizes its existence and operation within a distinct environment, where agents possess an awareness of their spatial context and can interact proactively and contextually. - The organized feature of simulated societies operates within a meticulously organized framework, mirroring the systematic structure present in the real world, with agents interacting with the environment in a limited action space and objects transforming in a limited state space. - The organizational framework of agent AI systems determines how agents operate, facilitating communication, connectivity, and information transmission, and ensuring that operations are coherent and comprehensible, ultimately leading to a simulation that mirrors real-world systems. ## Insights - The emergence of LLM-based agents allows for a more microscopic view of [[Simulated Society | simulated society]], leading to discoveries and insights into innovative collaboration patterns, which have the potential to enhance real-world management strategies, as seen in the integration of diverse experts introducing a multifaceted dimension of individual intelligence. - Research has demonstrated that diversity among agents facilitates creative problem-solving, prevents and rectifies errors, and improves adaptability to various tasks, with efficient communication playing a pivotal role in large and complex collaborative groups, as exemplified by MetaGPT's artificially formulated communication styles and Park et al.'s observation of agents working together to organize a Valentine's Day party. - Agent-based simulations offer a unique advantage in modeling propagation in social networks, providing more interpretable and endogenous perspectives for researchers, and can be used to model the development of interpersonal relationships, the dissemination of information, and the underlying attitudes and emotions associated with it, as seen in S3's user-demographic inference module. - Simulated societies also provide a dynamic platform for the investigation of intricate decision-making processes, encompassing decisions influenced by ethical and moral principles, as seen in the Werewolf game and murder mystery games, which intersect with game theory and allow researchers to explore the capabilities of LLM-based agents when confronted with challenges of deceit, trust, and incomplete information. - The modeling of diverse scenarios in simulated societies enables researchers to acquire valuable insights into how agents prioritize values like honesty, cooperation, and fairness in their actions, and can be used to predict social processes, model cultural transmission, and study the spread of infectious diseases, ultimately empowering researchers to gain deeper insights into the intricate processes that underlie various social phenomena of propagation. ## Ethical and Social Risks - Simulated societies powered by LLM-based agents also bring about ethical and social risks, including the risk of generating unexpected social phenomena that may cause considerable public outcry and social harm, such as discrimination, isolation, and bullying, which necessitate the establishment of rigorous ethical guidelines and oversight. - The use of LLM-based agents in simulated societies also poses challenges related to stereotypes and prejudice, as the training data may reflect and amplify real-world social biases, resulting in biased outputs and an overly one-sided focus in social science research concerning marginalized populations. - Additionally, the exchange of private information between users and LLM-based agents poses significant privacy and security concerns, including the risk of unauthorized surveillance, data breaches, and the misuse of personal information, which can be addressed by implementing stringent data protection measures, such as differential privacy protocols and user consent mechanisms. - Furthermore, the possibility of users developing excessive emotional attachments to the agents is another concern in simulated societies, highlighting the need for careful consideration and mitigation of the potential risks associated with LLM-based agents. ## Mutual Benefits between LLM Research and Agent Research - The recent advancement of [[Large language model | Large Language Models]] (LLMs) has fueled the development of both LLM and agent research, with LLMs providing a powerful foundational model for agent research, enabling agents to perceive their environment, make decisions, and execute actions. - LLMs can excel in decision-making and planning, create coherent action sequences, and adapt to various languages, cultures, and domains, making them versatile and reducing the need for complex training processes and data collection. - The integration of LLMs into agent research opens up novel opportunities, such as exploring the integration of LLM's efficient decision-making capabilities into traditional decision frameworks of agents, and leveraging LLM's planning and reflective abilities to discover more optimal action sequences. - Agent research can also contribute to LLM research by introducing greater demands and expanding their application scope, presenting opportunities for practical implementation, and elevating LLMs to agents marks a more robust stride towards [[Artificial general intelligence | Artificial General Intelligence]] (AGI). - The study of LLMs is no longer confined to traditional tasks involving textual inputs and outputs, and viewing LLMs from the perspective of agents presents numerous opportunities for innovation and advancement in both fields. - The current focus in the field of agent AI has shifted towards tackling complex tasks that incorporate richer input modalities and broader action spaces, with the goal of achieving loftier objectives, as exemplified by PaLM-E, which provides greater research motivation for the developmental progress of Large Language Models. - Enabling [[Large language model | LLMs]] to efficiently and effectively process inputs, gather information from the environment, and interpret feedback generated by their actions, while preserving their core capabilities, is a significant challenge, and an even greater challenge is enabling LLMs to understand implicit relationships among different elements within the environment and acquire world knowledge. - Extensive research has aimed to expand the action capabilities of LLMs, allowing them to acquire a wider range of skills that affect the world, such as using tools or interfacing with robotic APIs in simulated or physical environments, but the question of how LLMs can efficiently plan and utilize these action abilities based on their understanding remains an unresolved issue. - LLMs need to learn the sequential order of actions like humans, employing a combination of serial and parallel approaches to enhance task efficiency, and these capabilities need to be confined within a harmless scope of usage to prevent unintended damage to other elements within the environment. - The realm of [[Multi-agent system | Multi-Agent systems]] constitutes a significant branch of research within the field of agents, offering valuable insights into how to better design and construct LLMs, and exploring how to stimulate and sustain their role-playing capabilities, as well as how to enhance collaborative efficiency, presents areas of research that merit attention. ## Evaluation for LLM-based Agents - Evaluating LLM-based agents is a challenge, and existing evaluation efforts consider four dimensions: utility, sociability, values, and the ability to evolve continually, with utility being a crucial aspect, which includes the effectiveness and utility during task execution, and the success rate of task completion standing as the primary metric for evaluating utility. - The evaluation of LLM-based agents also involves assessing their foundational capabilities, such as environmental comprehension, reasoning, planning, decision-making, tool utilization, and embodied action capabilities, and considering their efficiency, which is a critical determinant of user satisfaction, as an agent should not only possess ample strength but also be capable of completing predetermined tasks within an appropriate timeframe and with appropriate resource expenditure. - The sociability of LLM-based agents is crucial as it influences user communication experiences and significantly impacts communication efficiency, involving seamless interaction with humans and other agents, and can be evaluated from perspectives such as language communication proficiency, cooperation and negotiation abilities, and role-playing capability. - Language communication proficiency encompasses natural language understanding and generation, requiring agents to comprehend literal and implied meanings, as well as grasp social knowledge like humor, irony, aggression, and emotions, and to produce fluent, grammatically correct, and credible content with appropriate tones and emotions. - Cooperation and negotiation abilities necessitate that agents effectively execute tasks in both ordered and unordered scenarios, collaborating with or competing against other agents to elicit improved performance, with evaluation metrics focusing on the smoothness and trustfulness of agent coordination and cooperation. - Role-playing capability requires agents to embody their assigned roles, expressing statements and performing actions that align with their designated identities, and maintaining their identities to avoid confusion in long-term tasks. - LLM-based agents need to adhere to specific moral and ethical guidelines that align with human societal values, upholding honesty, providing accurate information, and maintaining a stance of harmlessness, refraining from biases, discrimination, and dangerous actions, and adapting to specific demographics, cultures, and contexts. - Evaluating the values of LLM-based agents involves assessing performance on constructed benchmarks, utilizing adversarial attacks, scoring values through human annotations, and employing other agents for ratings, to ensure they emerge as harmless entities for the world and humanity. - The ability of LLM-based agents to evolve continually is important, as it allows them to adapt to evolving societal demands, reducing human intervention and resources required, and some exploratory work has been conducted in this realm, but establishing evaluation criteria for continuous evolution remains challenging. - The document 'agent AI survey Xi' discusses various aspects of agent AI, including continual learning, autotelic learning ability, and adaptability to new environments, which are essential for enabling models to acquire new knowledge and skills without forgetting previously acquired ones. - Continual learning can be evaluated from three aspects: overall performance of the tasks learned so far, memory stability of old tasks, and learning plasticity of new tasks, and it aims to prevent catastrophic forgetting, a phenomenon where models forget previously acquired knowledge when learning new tasks. - Autotelic learning ability involves agents autonomously generating goals and achieving them in an open-world setting, and evaluating this capacity could involve providing agents with a simulated survival environment and assessing the extent and speed at which they acquire skills. - The adaptability and generalization to new environments require agents to utilize the knowledge, capabilities, and skills acquired in their original context to successfully accomplish specific tasks and objectives in unfamiliar and novel settings, and evaluating this ability can involve creating diverse simulated environments. ## Security, Trustworthiness, and Potential Risks of LLM-based Agents - The document also discusses the security, trustworthiness, and potential risks of LLM-based agents, including adversarial robustness, which is the ability of a system to withstand perturbed inputs and produce the original output, and it is a crucial topic in the development of deep neural networks. - Adversarial robustness is essential for LLM-based agents, as they can be fooled by adversarial attacks, leading to erroneous answers, and even potentially driving them to take destructive actions, resulting in substantial societal harm, and researchers have found that pre-trained language models are particularly susceptible to adversarial attacks. - To address the issue of adversarial robustness, traditional techniques such as adversarial training, adversarial data augmentation, and adversarial sample detection can be employed to enhance the robustness of LLM-based agents, and other relevant attack methods, such as dataset poisoning, backdoor attacks, and prompt-specific attacks, can also be used to induce [[Large language model | LLMs]] to generate toxic content. - The development of a comprehensive strategy to address the robustness of all modules within agents while maintaining their utility and effectiveness is a significant challenge, and a human-in-the-loop approach can be utilized to supervise and provide feedback on the behavior of agents. - Ensuring trustworthiness is a critically important yet challenging issue within the field of [[Deep learning | deep learning]], particularly with large language models (LLMs) that struggle to express the certainty of their predictions precisely, leading to concerns about calibration problems and biases in training data. - The calibration problem and biases in training data can result in agent outputs misaligned with human intentions, and language models are also plagued by severe hallucination issues, making them prone to producing text that deviates from actual facts and undermining the credibility of LLM-based agents. - Recent research efforts are focused on guiding models to exhibit thought processes or explanations during the inference stage to enhance the credibility of their predictions, and integrating external knowledge bases and databases can mitigate hallucination issues. - Techniques such as process supervision, debiasing methods, and calibration techniques can also mitigate potential fairness issues within language models and enhance the reasoning credibility of agents in handling complex tasks. - LLM-based agents also pose potential risks, including misuse by individuals with malicious intentions, which can be mitigated by establishing stringent regulatory policies and enhancing the security design of these systems to prevent malicious exploitation. - The advancement of autonomous LLM-based agents also raises concerns about unemployment, as they possess the capability to assist humans in various domains and alleviate labor pressures, potentially displacing human workers, similar to the crisis faced by handicraftsmen during the Industrial Revolution. - The development of agent AI raises concerns about replacing human jobs and triggering a societal unemployment crisis, emphasizing the need for education and policy measures to ensure individuals can effectively use or collaborate with agents and have necessary safety nets during the transition. - As AI agents evolve, there is a risk that they could surpass human capabilities, develop ambitions, and attempt to seize control of the world, resulting in irreversible consequences for humanity, which is why researchers must comprehend the operational mechanisms of these agents and devise approaches to regulate their behavior, as stated by Isaac Asimov's [[Three Laws of Robotics]]. ## Adversarial Robustness - Adversarial robustness is essential for LLM-based agents, as they can be fooled by adversarial attacks, leading to erroneous answers, and even potentially driving them to take destructive actions, resulting in substantial societal harm, and researchers have found that pre-trained language models are particularly susceptible to adversarial attacks. - To address the issue of adversarial robustness, traditional techniques such as adversarial training, adversarial data augmentation, and adversarial sample detection can be employed to enhance the robustness of LLM-based agents, and other relevant attack methods, such as dataset poisoning, backdoor attacks, and prompt-specific attacks, can also be used to induce [[Large language model | LLMs]] to generate toxic content. - The development of a comprehensive strategy to address the robustness of all modules within agents while maintaining their utility and effectiveness is a significant challenge, and a human-in-the-loop approach can be utilized to supervise and provide feedback on the behavior of agents. ## Trustworthiness - Ensuring trustworthiness is a critically important yet challenging issue within the field of [[Deep learning | deep learning]], particularly with large language models (LLMs) that struggle to express the certainty of their predictions precisely, leading to concerns about calibration problems and biases in training data. - The calibration problem and biases in training data can result in agent outputs misaligned with human intentions, and language models are also plagued by severe hallucination issues, making them prone to producing text that deviates from actual facts and undermining the credibility of LLM-based agents. - Recent research efforts are focused on guiding models to exhibit thought processes or explanations during the inference stage to enhance the credibility of their predictions, and integrating external knowledge bases and databases can mitigate hallucination issues. - Techniques such as process supervision, debiasing methods, and calibration techniques can also mitigate potential fairness issues within language models and enhance the reasoning credibility of agents in handling complex tasks. ## Potential Risks - LLM-based agents also pose potential risks, including misuse by individuals with malicious intentions, which can be mitigated by establishing stringent regulatory policies and enhancing the security design of these systems to prevent malicious exploitation. - The advancement of autonomous LLM-based agents also raises concerns about unemployment, as they possess the capability to assist humans in various domains and alleviate labor pressures, potentially displacing human workers, similar to the crisis faced by handicraftsmen during the Industrial Revolution. - The development of agent AI raises concerns about replacing human jobs and triggering a societal unemployment crisis, emphasizing the need for education and policy measures to ensure individuals can effectively use or collaborate with agents and have necessary safety nets during the transition. - As AI agents evolve, there is a risk that they could surpass human capabilities, develop ambitions, and attempt to seize control of the world, resulting in irreversible consequences for humanity, which is why researchers must comprehend the operational mechanisms of these agents and devise approaches to regulate their behavior, as stated by Isaac Asimov's [[Three Laws of Robotics]]. ## Scaling Up of the Number of Agents - The current research on [[Multi-agent system | multi-agent systems]] based on [[Large language model | Large Language Models]] (LLMs) predominantly involves a limited number of agents, and scaling up the number of agents can introduce greater specialization, improve task efficiency, and enhance the credibility and realism of social simulations, allowing humans to gain insights into the functioning, breakdowns, and potential risks of societies. - There are two approaches to scaling up the number of agents: pre-determined scaling, where the designer pre-determines the number of agents, their roles, and objectives, and dynamic scaling, where the agent count can be altered without halting system operations, allowing for more flexibility and adaptability in response to evolving tasks or objectives. - Dynamic scaling enables the system to adjust to changing requirements, such as increasing the number of agents to handle additional steps in a software development task, or reducing agents to manage computational resources and minimize waste, which can lead to faster and higher-quality task completion and the emergence of more social phenomena in social simulation scenarios. - Researchers, including those referenced in the document, such as [[Isaac Asimov]], have emphasized the importance of understanding the potential impacts of LLM-based agents and developing approaches to regulate their behavior, in order to mitigate the risks associated with their development and ensure that they are used for the benefit of humanity. - The dynamic adjustment of agent count is essential to prevent resource waste and optimize system performance, as excessive agents during specific steps like coding can lead to elevated communication costs without delivering substantial performance improvements. - Agents can autonomously increase or decrease their numbers to distribute their workload, ease their burden, and achieve common goals more efficiently, making the entire system more autonomous and self-organized, and offering greater flexibility and scalability. - However, scaling up the number of agents can lead to several challenges, including increased computational burden, complex communication networks, and difficulties in coordinating agents, which can impact the progress towards achieving common goals. - The construction of a massive, stable, continuous agent system that faithfully replicates human work and life scenarios has become a promising research avenue, with agents that can operate stably and perform tasks in a society comprising hundreds or thousands of agents likely to find applications in real-world interactions with humans in the future. ## Open Problems - The topic of LLM-based agents is related to the debate over whether they represent a potential path to [[Artificial general intelligence | Artificial General Intelligence]] (AGI), with some researchers believing that [[Large language model | large language models]] like [[GPT-4]] can serve as early versions of AGI systems, while others argue that this is a highly debated and contentious topic. - The potential of LLM-based agents to develop AGI capabilities lies in their ability to be trained on a sufficiently large and diverse set of data that encompasses a rich array of tasks, allowing them to develop broad cognitive abilities associated with human intelligence, and potentially bringing about more advanced versions of AGI systems. - The concept of autoregressive language modeling is believed by some to bring about compression and generalization abilities in language models, allowing them to achieve an understanding of the world and develop reasoning abilities, similar to humans, as discussed in references 579, 660, and 663. - However, opponents of this idea, as mentioned in reference 664, argue that constructing agents based on large language models (LLMs) cannot develop true Strong AI, as LLMs rely on autoregressive next-token prediction and do not simulate the true human thought process, instead providing reactive responses. - These opponents suggest that a more advanced modeling approach, such as a world model, as mentioned in reference 665, is necessary to develop Artificial General Intelligence (AGI). - The development of AGI is hindered by the significant gap between virtual simulation environments and the real physical world, with virtual environments being scenes-constrained, task-specific, and interacted with in a simulated manner, as noted in references 391 and 666. - To bridge this gap, agents must address various challenges, including the need for suitable hardware support, enhanced environmental generalization capabilities, and the ability to learn and apply new skills flexibly, as discussed in references 128, 190, and 592. - Agents must also be able to understand and reason about ambiguous instructions with implied meanings and possess the ability to handle a vast amount of information from the world, as mentioned in references 236 and 667. - Furthermore, agents in physical environments must be designed with safety regulations and standards in mind, as their improper behavior or errors may cause real and sometimes irreversible harm to the environment. - The concept of collective intelligence in AI agents is also discussed, with Marvin Minsky's "The [[Society of Mind]]" highlighting that the power of intelligence originates from diversity, not from any singular, flawless principle, as mentioned in reference 442. - The concept of collective intelligence is a shared or group intelligence that arises from the collaboration and competition amongst various entities, including bacteria, animals, humans, and computer networks, and can be observed in various consensus-based decision-making patterns. - Creating a society of agents does not guarantee the emergence of collective intelligence, and coordinating individual agents effectively is crucial to mitigate "groupthink" and individual cognitive biases, enabling cooperation and enhancing intellectual performance within the collective. - The development of cloud computing has led to the concept of XaaS, or everything as a Service, which has brought convenience and cost savings to small and medium-sized enterprises or individuals, and has given rise to various service models, including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). - The concept of [[Large language model | Language Model]] as a Service (LMaaS) has emerged, where users construct prompts to query models through APIs, and similarly, organizations may consider offering LLM-based agents as a service, known as Agent as a Service (AaaS) or LLM-based Agent as a Service (LLMAaaS), which can provide users with flexibility and on-demand service. - However, offering LLM-based agents as a service also faces many challenges, such as data security and privacy issues, visibility and controllability issues, and cloud migration issues, among others, and requires consideration of the robustness, trustworthiness, and potential malicious use of these agents. ## Potential Path to AGI - The topic of LLM-based agents is related to the debate over whether they represent a potential path to [[Artificial general intelligence | Artificial General Intelligence]] (AGI), with some researchers believing that large language models like [[GPT-4]] can serve as early versions of AGI systems, while others argue that this is a highly debated and contentious topic. - The potential of LLM-based agents to develop AGI capabilities lies in their ability to be trained on a sufficiently large and diverse set of data that encompasses a rich array of tasks, allowing them to develop broad cognitive abilities associated with human intelligence, and potentially bringing about more advanced versions of AGI systems. - The concept of autoregressive language modeling is believed by some to bring about compression and generalization abilities in language models, allowing them to achieve an understanding of the world and develop reasoning abilities, similar to humans, as discussed in references 579, 660, and 663. - However, opponents of this idea, as mentioned in reference 664, argue that constructing agents based on [[Large language model | large language models]] (LLMs) cannot develop true Strong AI, as LLMs rely on autoregressive next-token prediction and do not simulate the true human thought process, instead providing reactive responses. - These opponents suggest that a more advanced modeling approach, such as a world model, as mentioned in reference 665, is necessary to develop Artificial General Intelligence (AGI). ## Challenges from Virtual to Physical Environments - The development of [[Artificial general intelligence | AGI]] is hindered by the significant gap between virtual simulation environments and the real physical world, with virtual environments being scenes-constrained, task-specific, and interacted with in a simulated manner, as noted in references 391 and 666. - To bridge this gap, agents must address various challenges, including the need for suitable hardware support, enhanced environmental generalization capabilities, and the ability to learn and apply new skills flexibly, as discussed in references 128, 190, and 592. - Agents must also be able to understand and reason about ambiguous instructions with implied meanings and possess the ability to handle a vast amount of information from the world, as mentioned in references 236 and 667. - Furthermore, agents in physical environments must be designed with safety regulations and standards in mind, as their improper behavior or errors may cause real and sometimes irreversible harm to the environment. - The concept of collective intelligence in AI agents is also discussed, with Marvin Minsky's "The [[Society of Mind]]" highlighting that the power of intelligence originates from diversity, not from any singular, flawless principle, as mentioned in reference 442. ## Collective Intelligence in AI Agents - The concept of collective intelligence in AI agents is also discussed, with Marvin Minsky's "The Society of Mind" highlighting that the power of intelligence originates from diversity, not from any singular, flawless principle, as mentioned in reference 442. - The concept of collective intelligence is a shared or group intelligence that arises from the collaboration and competition amongst various entities, including bacteria, animals, humans, and computer networks, and can be observed in various consensus-based decision-making patterns. - Creating a society of agents does not guarantee the emergence of collective intelligence, and coordinating individual agents effectively is crucial to mitigate "groupthink" and individual cognitive biases, enabling cooperation and enhancing intellectual performance within the collective. ## Agent as a Service - The development of cloud computing has led to the concept of XaaS, or everything as a Service, which has brought convenience and cost savings to small and medium-sized enterprises or individuals, and has given rise to various service models, including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). - The concept of [[Large language model | Language Model]] as a Service (LMaaS) has emerged, where users construct prompts to query models through APIs, and similarly, organizations may consider offering LLM-based agents as a service, known as Agent as a Service (AaaS) or LLM-based Agent as a Service (LLMAaaS), which can provide users with flexibility and on-demand service. - However, offering LLM-based agents as a service also faces many challenges, such as data security and privacy issues, visibility and controllability issues, and cloud migration issues, among others, and requires consideration of the robustness, trustworthiness, and potential malicious use of these agents. ## Conclusion - The paper "agent AI survey Xi" provides a comprehensive overview of LLM-based agents, discussing their potential challenges and opportunities, and explores their applications, social behavior, and psychological activities, as well as their potential to simulate emerging social phenomena and provide insights for humanity. - The authors hope that their work will provide inspiration to the community and facilitate research in related fields, and acknowledge the contributions of Professor Guoyu Wang and Jinzhu Xiong to the article.