OpenAI Swarm for Multi-Agent AI: A Cross-Model Comparison on BotDojo

Engineering

8:51

This week, OpenAI released Swarm, their framework for building multi-agent workflows. It's sparked significant interest in the AI community, and at BotDojo, we've done a deep dive into Swarm to demonstrate how it works, comparing how different language models behave in multi-agent workflows.

In this post, we'll walk you through our journey of implementing OpenAI's Airline Customer Service example, and share insights we gained from our cross-model comparison.

Understanding OpenAI's Swarm Strategy

OpenAI's Swarm framework introduces a novel approach to building multi-agent systems. The key components of their strategy include:

Agents as Specialized Entities: Each agent is designed with a specific role and set of capabilities.
Dynamic Handoffs: Agents can transfer control to other agents as needed, allowing for complex, multi-step interactions.
Shared Conversation History: All agents have access to the full conversation history, ensuring context is maintained throughout interactions.

These components are evident in the Swarm README


# Overview

Swarm focuses on making agent **coordination** and **execution** lightweight, highly controllable, and easily testable.

It accomplishes this through two primitive abstractions: `Agent`s and **handoffs**. An `Agent` encompasses `instructions` and `tools`, and can at any point choose to hand off a conversation to another `Agent`.

These primitives are powerful enough to express rich dynamics between tools and networks of agents, allowing you to build scalable, real-world solutions while avoiding a steep learning curve.

Efficient Information Transfer: Swarm's handoff mechanism leverages the power of shared conversation history to enable seamless transitions between agents. Under the hood, each agent has a unique system prompt and set of tools, tailored to its specific role.

When a handoff occurs, the receiving agent doesn't need explicit instructions about what to do next. Instead, it can analyze the shared conversation history through the lens of its own specialized prompt and toolset, quickly inferring the context and required actions.

This approach eliminates the need for verbose information passing between agents, reducing potential delays and information loss that can occur in traditional tool-calling setups. The result is a more fluid, context-aware interaction flow that can adapt to complex, multi-step scenarios.

The Airline Customer Service Example

We reproduced OpenAI's airline customer service example in BotDojo, maintaining the original prompts, logic, and evaluations. Our implementation includes:

Triage Agent: Analyzes initial requests and routes to appropriate specialized agents
Flight Modification Agent: Further triages between flight cancellations and changes
Flight Cancel Agent: Handles the specifics of flight cancellations
Flight Change Agent: Manages flight change requests
Lost Baggage Agent: Deals with lost luggage inquiries

Blog 1 Image 1 (1)

Link to our implementation: here

Eval: here

Evaluation Question: here

Evaluation Results : here

While this complex agent network may seem excessive for this example, it demonstrates scalable system design for real-world applications. In production environments, breaking tasks into specific agents with tailored prompts enables better control, easier debugging, and more predictable outcomes, particularly for complex processes or when integrating multiple external systems.

Evaluation Framework and Cross-Model Comparison

For this experiment, we ported the evaluations from the OpenAI Swarm repository into our BotDojo platform. This framework assesses how well each model follows instructions and makes appropriate tool calls in various scenarios. We expanded the evaluation to include the number of tool calls made, as we discovered that different models exhibited varying behaviors in this area. (eval flow)

It's important to note that while this evals provides valuable insights, it was originally optimized for GPT-4o. Our goal in using the same eval and prompts across different models was to observe how they perform in a standardized setting, rather than to definitively measure the abilities of other models. The results should be interpreted with this context in mind.

We tested our implementation using four different models:

GPT-4o
GPT-4o-mini
Claude 3.5
Llama (3.1 405B)

Here's what we found:

Overall Performance Summary on OpenAI Evals

The evaluation process was based on the original 8 questions provided in the OpenAI Swarm repository. The eval checks to see if the correct agent was called and did it in the desired number of steps.

Model	Cost	Average Response Time	Pass Rate	Strengths	Areas for Improvement
GPT-4o	$0.027580 (2nd highest)	4.78s (fastest)	78.57% (highest)	Consistent performance across scenarios, appropriate tool calls, concise responses	Occasionally made extra tool calls in simple scenarios
GPT-4o-mini	$0.000774 (lowest by far)	4.96s (2nd fastest)	50.00% (3rd highest)	Highly cost-effective, fast responses, excellent at seeking clarification in ambiguous situations	Sometimes missed necessary tool calls in complex scenarios
Claude 3.5	$0.043680 (highest)	9.92s (2nd slowest)	69.23% (2nd highest)	Detailed and customer-friendly responses, strong performance in complex scenarios, often provided most contextually appropriate responses	Occasionally made unnecessary tool calls or fewer calls than expected in some situations. Claude 3.5's clear writing style is notable, but it tends to use tokens for "thinking" between tool calls. This adds to the cost; while priced similarly to GPT-4o on a per-token basis, it can become expensive for repetitive tasks that don't require extensive self-reflection.
Llama 3.1 420B (BedRock)	$0.018150 (3rd highest)	12.45s (slowest)	9.52% (lowest)	Engaged in conversations and provided coherent responses	Frequently made more tool calls than necessary, including inappropriate escalations; struggled with efficient request handling and appropriate tool usage in this specific framework

Key Insights from Multi-Agent Tool Calling Evaluation

Our evaluation revealed distinct performance patterns and behaviors across models in multi-agent tool calling scenarios:

Claude 3.5 demonstrated a more conversational approach, often being chattier and more aggressive in trying to resolve customer inquiries directly. This led to more detailed responses but led to higher cost and sometimes resulted in unnecessary tool calls.
OpenAI models (GPT-4o and GPT-4o-mini) tended to ask for clarification more frequently before making tool calls, leading to more precise but potentially slower resolutions.
Llama, while showing potential, often went off-rails, struggling to maintain context and make appropriate tool calls in complex scenarios.
GPT-4o-mini's exceptional cost-effectiveness makes it attractive for high-volume tasks, despite occasional struggles with complexity.

Our cross-model evaluation highlights the critical importance of understanding each model's tool-calling behaviors when building multi-agent systems. While GPT-4o offers precision, Claude 3.5 provides thoroughness, and GPT-4o-mini delivers cost-effectiveness, the real power lies in Swarm's handoff architecture.

Breaking complex workflows into maintainable components through specialized agents and shared conversation history, combined with robust evaluation frameworks, leads to more repeatable, reliable results—essential for optimizing and scaling AI systems in production environments.

Join BotDojo for Free and Build Smarter AI Workflows Today!

We're excited to hear your thoughts on our implementation and findings. We've made our BotDojo flow available for you to test and compare different models and scenarios:

Try the BotDojo Swarm Implementation.

Sample Data and Eval Results

We encourage you to experiment with various models, tweak the prompts, and even add your own scenarios. Your insights could help push the boundaries of multi-agent systems further.

Have you discovered any interesting behaviors or optimizations? Do you have ideas for expanding this experiment? We'd love to hear from you!

Share your experiences and suggestions in the comments below or reach out to us directly at feedback@botdojo.com .

Let's collaborate to advance the field of multi-agent AI systems together!

Implementing OpenAI Swarm in BotDojo: A Cross-Model Comparison in Multi-Agent Tool Calling

Relevant Contents

Understanding OpenAI's Swarm Strategy

The Airline Customer Service Example

Evaluation Framework and Cross-Model Comparison

Overall Performance Summary on OpenAI Evals

Key Insights from Multi-Agent Tool Calling Evaluation

Join BotDojo for Free and Build Smarter AI Workflows Today!

Start Building Reliable AI with BotDojo

Templates

Support

Company

Legal