The emergence of self-creating AI agents across major large language models (LLMs) marks a significant turning point in artificial intelligence development. Models such as Claude Code, Codex, and Pi are now capable of generating their own autonomous agents, a capability that researchers suggest will fundamentally alter how technology is utilized.
The Advancement of Agentic AI
The field of agentic AI has gained substantial visibility recently, with platforms like OpenClaw and NemoClaw garnering media attention. These sophisticated agents have the potential to manage complex tasks, such as organizing computer files or clearing an email inbox. Furthermore, major tech companies, including Microsoft, are planning to integrate these capabilities across their Windows operating system.
Experts note that the operational capacity of advanced LLMs, like Claude, is growing at a pace that exceeds the current ability of developers to create specific, practical use cases for them. While orchestrators for models such as Claude had begun taking over tasks on GitHub not long ago, LLMs have since demonstrated exponential improvements in performance.
Challenges in Multi-Agent Systems
A critical concern in developing these complex systems is the risk of error compounding within multi-agent architectures. When the output from one AI agent serves as the input for another, inaccuracies can multiply rapidly. While AI agents are known to hallucinate or provide unreliable information, this problem becomes exponentially more severe when multiple agents interact.
In a study conducted by Google’s DeepMind in 2025, researchers tested 180 different configurations across five distinct agent architectures and three primary LLMs. The findings revealed that unstructured multi-agent networks increased errors up to 17.2 times when compared against single-agent baselines.
The research also indicated that performance benefits failed to scale beyond four agents, as the necessary coordination overhead began negating any potential advantages. This outcome sharply contrasts with industry practices, where some organizations currently deploy between six and twenty agents simultaneously to tackle complex assignments.
Research Outpaces Practice
A growing disparity exists between rapid AI model development and the pace of research into system security and interaction protocols. Because these powerful models are now self-building, safety studies cannot begin until the models are released to researchers. This creates a lag similar to that seen in tool development for AI.
A recent paper titled “Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures” specifically challenged pre-designed multi-agent hierarchies. The study analyzed 25,000 tasks across eight different LLM models, utilizing configurations ranging from four to 256 agents under eight various coordination protocols.
The results demonstrated that the optimal performance was achieved using a hybrid approach: while a general structure could be established, individual agents were permitted to self-organize and determine their own roles. This suggests that rather than assigning fixed roles, developers should provide autonomous agents with a specific mission, an operational protocol, and a suitable model for execution.