The question that arrives after an organisation decides to build multi-agent systems with a pro-code approach is always the same: which framework should we use? AutoGen, LangGraph, CrewAI, the Claude Agent SDK — the landscape has four serious contenders, each with different design philosophies, different strengths, and vocal advocates who insist their choice is the right one. The framework comparison articles are abundant. Most of them are wrong — not about the frameworks, but about what matters.
Framework choice is a second-order decision. It determines the programming model and the abstractions you work with. It does not determine whether your multi-agent system creates enterprise value. Five architectural decisions determine that: how your agents communicate and coordinate (orchestration design), how they accumulate and share knowledge (memory architecture), how they are governed and constrained (governance layer), how they select and route between models (model routing), and how you monitor and debug them in production (observability). Get these five right, and any serious framework will execute. Get them wrong, and no framework will save you.
This is not a claim that frameworks are interchangeable. They are not — and this article covers the meaningful differences. But the architecture sits above the framework, and the architecture is where multi-agent systems succeed or fail at enterprise scale.
Orchestration design: how agents coordinate
Orchestration design is the decision about how agents communicate, share state, and resolve conflicts. Three patterns dominate enterprise multi-agent systems, and the right choice depends on the workflow being automated, not on framework preference.
Hub-and-spoke orchestration places a central orchestrator agent that receives all requests, interprets intent, and routes to specialised sub-agents. The orchestrator maintains the master context, and sub-agents report their results back to it. This is the pattern that Copilot Studio implements natively, and it is the right pattern for customer service routing (one orchestrator, five to ten specialised agents), document processing pipelines (one router, specialised extractors), and any workflow where the coordination logic is relatively simple and the specialisation is in the execution, not the coordination.
The limitation of hub-and-spoke is that the orchestrator becomes a bottleneck — every interaction passes through it, it needs to understand the capabilities of every sub-agent, and its context window must accommodate the full conversation history. As the number of sub-agents grows beyond fifteen to twenty, the orchestrator's ability to route accurately degrades. More importantly, sub-agents cannot communicate directly with each other, which prevents the collaborative patterns that create the most value.
Peer-to-peer orchestration allows agents to communicate directly, without a central coordinator. Each agent knows which other agents it can delegate to or consult, and the communication graph can be dynamic — agents can discover and invoke other agents based on the task requirements. This pattern enables the iterative reasoning loops — research-validate-refine cycles, multi-agent debate, consensus-building — that define advanced multi-agent systems.
Peer-to-peer is the right pattern when agents need to challenge, refine, and build on each other's outputs. A due diligence system where a research agent, a financial analysis agent, and a legal review agent each contribute findings and cross-validate is inherently peer-to-peer — no single orchestrator can manage the multi-directional flow of information and the iterative refinement that produces reliable results.
The challenge of peer-to-peer orchestration is governance. Without a central coordinator, determining who decided what — and why — requires explicit logging and audit trails at every agent-to-agent communication point. This is the primary reason most enterprise implementations start with hub-and-spoke and evolve toward peer-to-peer as their governance infrastructure matures.
Hierarchical orchestration combines elements of both. A top-level orchestrator delegates to team leads, each of which coordinates a group of specialised agents. The financial analysis team has a team lead that coordinates analyst, researcher, and risk assessment agents. The legal review team has a team lead that coordinates contract review, regulatory compliance, and IP assessment agents. The top-level orchestrator coordinates between teams, not between individual agents.
This pattern scales better than flat hub-and-spoke because each team lead only needs to understand its own agents, and the top-level orchestrator only needs to understand teams, not individual capabilities. It is the pattern that most closely mirrors how large organisations actually coordinate work — departments with managers who coordinate specialists, reporting to executives who coordinate departments.
Memory architecture: how agents learn
Memory architecture is the decision that most directly determines whether a multi-agent system produces linear or compounding value. It is also the decision that most organisations under-invest in because it has no immediate visible impact — a system without shared memory works fine for the first hundred interactions. It fails to compound value across the first thousand.
Isolated memory means each agent maintains its own context and history. When the sales agent learns that a customer has a specific pain point, that knowledge stays with the sales agent. The support agent that handles the same customer's next interaction starts without that context. Isolated memory is simple to implement, has no coordination overhead, and is the default in most frameworks. It is also the architecture that produces Level 1 tool-level value — the productivity-at-the-margin that the major consulting houses consistently flag as insufficient for enterprise-level EBIT impact.
Shared memory means agents read from and write to a common knowledge base. The sales agent's discovery about the customer's pain point is written to shared memory. When the support agent handles the next interaction, it reads that context and responds accordingly. When the product team's agent analyses feature requests, it can query the shared memory for patterns across all customer interactions, not just the ones it participated in.
Shared memory is where knowledge compounds. But it introduces hard design problems: concurrency (what happens when two agents write conflicting information simultaneously), relevance (how does an agent retrieve only the memory that matters for its current task, not the entire history), decay (how does the system handle knowledge that becomes outdated), and privacy (which agents should have access to which memories, particularly when the knowledge includes customer data subject to data protection requirements).
Persistence strategy determines whether memory survives beyond a single session. Session-scoped memory (the default in most frameworks) means all context is lost when the conversation ends. Persistent memory means knowledge accumulated during one interaction is available in the next, across days, weeks, and months. The financial multi-agent system described in the architecture decision article — where agents maintain a kanban-style board of ongoing findings, proposals, and tracked issues — requires persistent shared memory. Without it, the system cannot compound knowledge over time, and the core value proposition collapses.
The practical implementation pattern that works at enterprise scale is a tiered approach: short-term memory (the current conversation context), medium-term memory (the current session or task), and long-term memory (the persistent knowledge base that survives across all interactions). Each tier has different retrieval strategies, different storage requirements, and different governance rules.
Governance layer: who decides what
The AI decision architecture that applies to individual AI systems becomes exponentially more complex in multi-agent environments, because decisions emerge from interactions between agents rather than from a single system. A well-governed multi-agent system requires three types of rules, implemented as system constraints rather than policy documents.
Delegation rules define what each agent can decide independently, what requires confirmation from another agent, and what requires human approval. A procurement agent can approve purchases below €5,000 without human involvement. Between €5,000 and €50,000, the compliance agent must confirm that the purchase meets policy requirements. Above €50,000, a human decision-maker must approve. These thresholds are not suggestions — they are hard constraints enforced in code.
Escalation rules define what happens when agents encounter situations outside their operational boundaries. An autonomous monitoring agent that detects an anomaly it was not designed to handle must escalate — not attempt to resolve it. The escalation path must be explicit: which agent receives the escalation, what information is passed, and what happens if the escalation is not acknowledged within a defined time window. Implicit escalation — where agents attempt to handle situations beyond their capability because no explicit boundary was set — is the most common failure mode in multi-agent systems and the most expensive to recover from.
Boundary rules define what agents are explicitly prohibited from doing. A customer-facing agent cannot make pricing commitments that override the pricing engine. A supply chain agent cannot commit to delivery dates that the logistics system cannot fulfil. A compliance agent can flag risks but cannot approve its own risk assessments — a different agent or a human must confirm. Boundary rules prevent the failure mode where locally optimal agent decisions produce globally harmful outcomes.
The governance frameworks for mid-market companies apply directly to multi-agent systems, but with an additional layer of complexity: governance must cover not only individual agent decisions but the emergent behaviour of the system as a whole. A multi-agent system where each individual agent operates within its boundaries can still produce ungoverned outcomes if the interaction between agents is not explicitly governed. This is the governance gap that surfaces in McKinsey's State of AI research: roughly a third of organisations are using generative AI regularly, but only around a third of those report scaling it across the enterprise, and McKinsey names security, risk and the absence of governance infrastructure — policy frameworks, retrieval systems, audit trails — as the leading barriers to scaling agentic AI specifically. For the EU AI Act, this is not optional engineering hygiene. Where an agent system touches a high-risk use case under Annex III — creditworthiness, recruitment, critical infrastructure — Article 14 requires meaningful human oversight and Article 12 requires automatic logging of events over the system's lifetime. In a multi-agent architecture, both obligations land squarely on your delegation, escalation and tracing design, not on the framework vendor.
Model routing: matching capability to cost
Model routing is the decision about which AI model handles which task within the multi-agent system. It is an economic decision as much as a technical one, and on published API pricing the gap between a small model and a frontier model runs roughly an order of magnitude per token — so routing the right work to the right tier is one of the largest cost levers in the whole architecture, without sacrificing output quality where it matters.
The principle is straightforward: use the cheapest model that meets the quality requirement for each specific task. A triage agent that classifies incoming requests into five categories does not need a frontier model — a small, fast model like Claude Haiku or GPT-4o mini handles classification at a fraction of the cost with comparable accuracy. A reasoning agent that analyses complex financial scenarios and generates recommendations needs a frontier model, because reasoning quality directly determines output value. A code-generation agent benefits from a model tuned for code.
In practice, model routing introduces three sources of complexity. Latency varies between models, and switching models mid-task introduces variable response times that affect the user experience and the coordination timing between agents. Prompt formats and system prompts may need adjustment when switching between model families — a prompt optimised for Claude may underperform on GPT and vice versa. And fallback logic is necessary — when a model API is unavailable or rate-limited, the system must gracefully route to an alternative without losing context or producing inconsistent results.
The economic case for model routing is compelling, and it compounds at volume. An enterprise multi-agent system processing thousands of interactions a day where every agent reaches for a frontier model will spend several times more on inference than an identical system where triage uses a small model, routine tasks use a mid-tier model, and only genuine reasoning reaches a frontier model. The inference cost analysis covers the economics in detail. In multi-agent systems, the savings multiply because the number of LLM calls per user interaction is higher — each agent in the chain makes its own calls, and a five-agent pipeline means five times the opportunity for cost optimisation through model routing.
Observability: monitoring what you cannot see
Monitoring AI in production is challenging for single-model systems. For multi-agent systems, it is an order of magnitude more complex because failures can occur at any point in the agent chain, and the cause of a bad outcome may be several agents removed from the symptom.
Decision tracing means recording not just the final output but the reasoning chain that produced it. When the procurement agent approves a purchase order that later turns out to violate policy, you need to trace back: which agent flagged the order as compliant, what data did it base that assessment on, which other agents were consulted, and where in the chain did the error occur. Without decision tracing, debugging multi-agent systems is effectively guesswork.
Confidence monitoring means tracking the confidence scores of each agent's outputs and alerting when confidence drops below defined thresholds. An agent that suddenly starts producing low-confidence outputs may indicate model degradation, data quality issues, or changes in the input distribution that the agent was not designed to handle. In a multi-agent system, a low-confidence output from one agent propagates through the chain — the downstream agents make decisions based on uncertain inputs, and the uncertainty compounds.
Performance monitoring means tracking latency, throughput, and error rates per agent and per agent chain. A multi-agent pipeline that processes a customer request through five agents may take anywhere from two seconds to thirty seconds depending on which agents are invoked, which models they use, and whether any external API calls are involved. Understanding where time is spent — and where it spikes — is essential for maintaining acceptable response times as the system scales.
Cost monitoring means tracking inference costs per agent, per chain, and per use case. Without cost monitoring, inference expenses grow invisibly until they appear as a line item in the quarterly cloud bill. Model routing optimises costs at design time. Cost monitoring ensures those optimisations hold as usage patterns evolve.
The framework landscape — positioned correctly
With the five architectural decisions defined, the framework choice becomes a practical question: which framework provides the best abstractions for your specific orchestration pattern, memory architecture, governance requirements, model routing strategy, and observability needs? The four leading frameworks each have genuine strengths and genuine limitations.
AutoGen (Microsoft Research) was rebuilt from the ground up in the v0.4 redesign around an asynchronous, event-driven actor model, replacing the conversation-centric design of the v0.2 line. It now ships as three layers: Core, the low-level event-driven runtime where agents exchange messages and react to events; AgentChat, the higher-level task-driven API for group chat, code execution and pre-built agents, which is the easiest path to migrate to and the right place to prototype; and Extensions, for third-party model clients and integrations. AutoGen Studio sits on top as a low-code interface for visual building and stakeholder demos. The framework excels at tool use, code execution and structured coordination, and it is the natural choice for Microsoft-adjacent enterprises that want a pro-code framework inside their existing ecosystem. Its limitation is that dynamic, message-driven coordination can make failures harder to reproduce — reconstructing the exact sequence of events that produced a bad outcome takes discipline, which is precisely why observability matters more here than the framework choice suggests.
LangGraph (LangChain) models agents as nodes in a directed graph with conditional edges and shared state. It excels at stateful workflows, cyclical processes and long-running operations. Its built-in persistence layer saves a checkpoint of graph state at every step, organised into threads — and those checkpoints are what make human-in-the-loop interrupts, time-travel debugging and fault-tolerant retries work without losing progress. Persistent checkpointers (Postgres or SQLite) carry state across requests, hours and approval gates. LangGraph is the strongest choice for complex workflows with branching logic and hard state-persistence requirements. Its limitation is complexity: the graph model is powerful but has a steeper learning curve than higher-level frameworks, and simple use cases feel over-engineered.
CrewAI uses role-based agent teams, where each agent is defined by a role, a goal and a backstory, and tasks carry a description and an expected output. The framework distinguishes Crews — agents collaborating autonomously within one execution context — from Flows, a higher-level orchestration layer that chains crews together with conditional logic, state management and event-driven triggers. CrewAI is the fastest path from concept to working system for business-process automation: define the roles, define the tasks, and the framework orchestrates. Its limitation is architectural depth — the role-based abstraction is intuitive but offers less fine-grained control over orchestration, memory and model routing than LangGraph or AutoGen Core.
Claude Agent SDK (Anthropic) packages the same agent loop, tool-use system and context management that powers Claude Code, available in Python and TypeScript. Two capabilities matter most for multi-agent work. First, subagents run in their own isolated context windows and return only their final message to the orchestrator — which is, in effect, a built-in answer to the hub-and-spoke context-bloat problem this article opened with. Second, when accumulated results approach the context limit, the SDK automatically compacts conversation history so long-running tasks continue. Its strength is reasoning-heavy, autonomous workflows where the quality of each agent's output is critical; its limitation is ecosystem — it is built around Claude models and does not offer the native multi-model routing that AutoGen and LangGraph provide.
The honest assessment: no framework is best for everything. AutoGen fits Microsoft-centric enterprises building event-driven agent systems. LangGraph fits engineering teams building complex, stateful workflows that need production-grade persistence and human-in-the-loop. CrewAI fits teams that need to prototype and ship role-based systems fast. Claude Agent SDK fits teams prioritising reasoning quality and clean context handling over multi-model flexibility. Most enterprise architectures will eventually run more than one — different frameworks for different subsystems, connected through well-defined APIs.
What determines success
The organisations that move agentic AI from pilot to measurable enterprise value — the minority McKinsey finds actually scaling rather than piloting — do not get there by choosing the right framework. They get there by getting the architectural decisions right: orchestration patterns that match the actual workflow, memory systems that compound knowledge, governance layers that enable autonomy within hard boundaries, model routing that balances cost against quality, and observability that makes the system debuggable and auditable. Every one of those decisions outlives the framework you start with — and several of them, under the EU AI Act, are legal obligations rather than nice-to-haves.
The framework is the tool. The architecture is the leverage. The workflow redesign is the value.
A Fit Call assesses your multi-agent architecture requirements — orchestration patterns, memory needs, governance boundaries, and model routing strategy — and maps them to the framework and deployment approach that matches your organisation's workflows, data landscape, and strategic ambition.
References: Microsoft Research, "AutoGen v0.4: Reimagining the foundation of agentic AI" (microsoft.com/en-us/research/blog/autogen-v0-4-reimagining-the-foundation-of-agentic-ai-for-scale-extensibility-and-robustness); LangGraph persistence documentation, LangChain (docs.langchain.com/oss/python/langgraph/persistence); CrewAI documentation, agents and flows (docs.crewai.com/en/concepts/agents); Anthropic, "Building agents with the Claude Agent SDK" (anthropic.com/engineering/building-agents-with-the-claude-agent-sdk); McKinsey & Company, "The state of AI in 2025: Agents, innovation, and transformation" (mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai); EU AI Act Articles 12 and 14 (artificialintelligenceact.eu/article/12, /article/14).
