Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 19 additions & 6 deletions docs/research/agent-ecosystem-report.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,23 @@

## Executive Summary

The AI agent ecosystem has matured significantly, moving past experimental phases into production-grade orchestration systems. Three prominent frameworks continue to dominate the multi-agent landscape in 2026: **LangGraph**, **CrewAI**, and **AutoGen**. Each addresses different operational paradigms, ranging from strict graph-based state machines to dynamic conversational workflows.
The AI agent ecosystem has matured significantly, moving past experimental phases into production-grade orchestration systems. Prominent frameworks continue to dominate the multi-agent landscape in 2026: **LangGraph**, **CrewAI**, **AutoGen**, and the emerging **OpenAI Agents SDK**. Each addresses different operational paradigms, ranging from strict graph-based state machines to dynamic conversational workflows. There is a notable industry trend towards the "Agent as a Tool" and handoff patterns, offering more modular, transparent, and auditable multi-agent collaboration.

## Framework Analysis & Capabilities

### 1. LangGraph (LangChain)
### 1. OpenAI Agents SDK (Swarm Evolution)

The OpenAI Agents SDK represents a streamlined, native approach to multi-agent orchestration, heavily relying on the concepts of routines and handoffs without needing complex external framework dependencies.

- **Core Paradigm:** Agent-as-a-Tool and Handoffs.
- **Key Capabilities:**
- **Native Integration:** Direct integration with OpenAI's APIs, leveraging the newest model capabilities seamlessly.
- **Tool Support:** Comprehensive support for custom Python functions, managed tools (e.g., Code Interpreter, WebSearch), and external MCP servers.
- **Handoff Mechanism:** Agents can seamlessly transfer control to specialized peer agents based on task requirements, treating other agents essentially as executable tools.
- **Strict LLM Orchestration:** Avoids heavy state-machine abstractions in favor of letting the LLM's tool-calling logic drive the orchestration flow directly.
- **Best Use Cases:** Systems requiring transparent, auditable collaboration with minimal orchestration boilerplate, leveraging specialized sub-agents.

### 2. LangGraph (LangChain)

LangGraph has solidified its position as the premier framework for complex, stateful, and deterministic orchestration. With the stable release of LangChain 1.0 and LangGraph 1.0, it excels in environments with strict auditability and high-reliability requirements.

Expand All @@ -19,7 +31,7 @@ LangGraph has solidified its position as the premier framework for complex, stat
- **Stability and Modernization:** Python 3.10+ requirement and simplified package structure for production-grade deployments.
- **Best Use Cases:** Complex, conditional pipelines; production systems requiring compliance and strict audit trails.

### 2. CrewAI
### 3. CrewAI

CrewAI focuses on simplifying the creation of multi-agent systems by leveraging intuitive human-like team metaphors. It offers the fastest path from prototype to functional multi-agent collaboration.

Expand All @@ -31,7 +43,7 @@ CrewAI focuses on simplifying the creation of multi-agent systems by leveraging
- **MCP Integration:** Native support for the Model Context Protocol (MCP), enabling deeper integration with external tools and resources.
- **Best Use Cases:** Business workflows, research syndication, and task delegation where roles map neatly to human organizational structures.

### 3. AutoGen (Microsoft Agent Framework)
### 4. AutoGen (Microsoft Agent Framework)

Backed by enterprise resources and now in version 0.4.0+, AutoGen excels in dynamic, conversational interactions and complex problem-solving where iterative refinement is required.

Expand All @@ -45,9 +57,10 @@ Backed by enterprise resources and now in version 0.4.0+, AutoGen excels in dyna

## Industry Trends & Next Steps

- **Hybrid Architectures:** We are seeing an increase in production deployments combining frameworks (e.g., LangGraph for overall state orchestration, wrapping a CrewAI team for a specific research sub-task).
- **Hybrid Architectures:** We are seeing an increase in production deployments combining frameworks (e.g., LangGraph for overall state orchestration, wrapping a CrewAI team or an OpenAI Agent SDK routine for a specific research sub-task).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use one canonical name for the framework across the doc.

OpenAI Agent SDK (Line 60) and OpenAI Agents (Line 66) are inconsistent with the rest of the document (OpenAI Agents SDK). Standardize these references to avoid ambiguity.

✏️ Proposed wording fix
-- **Hybrid Architectures:** We are seeing an increase in production deployments combining frameworks (e.g., LangGraph for overall state orchestration, wrapping a CrewAI team or an OpenAI Agent SDK routine for a specific research sub-task).
+- **Hybrid Architectures:** We are seeing an increase in production deployments combining frameworks (e.g., LangGraph for overall state orchestration, wrapping a CrewAI team or an OpenAI Agents SDK routine for a specific research sub-task).

-_Update:_ We have explicitly expanded our benchmarks to track State Recovery Success Rate (SRSR), Coordination Token Overhead (CTO), and Orchestration Latency Penalty (OLP). We have also created adapter layers for LangGraph, CrewAI, AutoGen, and OpenAI Agents to support these metrics.
+_Update:_ We have explicitly expanded our benchmarks to track State Recovery Success Rate (SRSR), Coordination Token Overhead (CTO), and Orchestration Latency Penalty (OLP). We have also created adapter layers for LangGraph, CrewAI, AutoGen, and OpenAI Agents SDK to support these metrics.

Also applies to: 66-66

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/research/agent-ecosystem-report.md` at line 60, The document uses
inconsistent names for the OpenAI agent framework; replace every instance of
"OpenAI Agent SDK" and "OpenAI Agents" with the canonical phrase "OpenAI Agents
SDK" so the references are consistent (search for the strings "OpenAI Agent SDK"
and "OpenAI Agents" and standardize them to "OpenAI Agents SDK" wherever they
appear, e.g., in the Hybrid Architectures sentence and the later mention).

- **Agent as a Tool:** A massive shift towards the "Agent as a Tool" handoff pattern (popularized by OpenAI Agents SDK) where central orchestrators treat specialized sub-agents simply as functional tool calls.
- **Production Safety:** Error handling and robust fallback mechanisms ("safe nodes") are becoming standard requirements over sheer capability.

**Recommendation:** Summit's internal orchestration and benchmarking must expand to cover these advanced topologies, specifically evaluating the overhead of coordination and the resilience of durable execution under load.

_Update:_ We have explicitly expanded our benchmarks to track State Recovery Success Rate (SRSR), Coordination Token Overhead (CTO), and Orchestration Latency Penalty (OLP). We have also created adapter layers for LangGraph, CrewAI, and AutoGen to support these metrics.
_Update:_ We have explicitly expanded our benchmarks to track State Recovery Success Rate (SRSR), Coordination Token Overhead (CTO), and Orchestration Latency Penalty (OLP). We have also created adapter layers for LangGraph, CrewAI, AutoGen, and OpenAI Agents to support these metrics.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other references in this document (like the new section heading) and the related agent-eval-insights.md file, consider using the full name "OpenAI Agents SDK" here instead of "OpenAI Agents".

Suggested change
_Update:_ We have explicitly expanded our benchmarks to track State Recovery Success Rate (SRSR), Coordination Token Overhead (CTO), and Orchestration Latency Penalty (OLP). We have also created adapter layers for LangGraph, CrewAI, AutoGen, and OpenAI Agents to support these metrics.
_Update:_ We have explicitly expanded our benchmarks to track State Recovery Success Rate (SRSR), Coordination Token Overhead (CTO), and Orchestration Latency Penalty (OLP). We have also created adapter layers for LangGraph, CrewAI, AutoGen, and OpenAI Agents SDK to support these metrics.

9 changes: 7 additions & 2 deletions docs/research/agent-eval-insights.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Overview

Based on the latest developments in the agent ecosystem (LangGraph, CrewAI, AutoGen), Summit Bench must expand its evaluation dimensions to accurately measure production-grade multi-agent capabilities. The current benchmarks largely focus on single-agent reasoning and deterministic tool use. We must shift toward evaluating coordination, state resilience, and execution overhead in complex multi-agent topologies.
Based on the latest developments in the agent ecosystem (LangGraph, CrewAI, AutoGen, and OpenAI Agents SDK), Summit Bench must expand its evaluation dimensions to accurately measure production-grade multi-agent capabilities. The current benchmarks largely focus on single-agent reasoning and deterministic tool use. We must shift toward evaluating coordination, state resilience, and execution overhead in complex multi-agent topologies.

## Proposed Benchmark Expansions

Expand All @@ -12,6 +12,7 @@ Based on the latest developments in the agent ecosystem (LangGraph, CrewAI, Auto
- **High-Concurrency Orchestration:** Measuring the latency and throughput of the orchestration layer itself when managing hundreds or thousands of simultaneous agent interactions.
- **Role-Based Delegation Efficiency:** Evaluating how accurately an orchestrator (like a CrewAI Manager) can divide a complex task, assign the correct sub-tasks to specialized agents based on their defined roles, and synthesize the results without hallucination.
- **Dynamic Code Execution & Sandboxing:** Testing an agent's ability to iteratively write, safely execute, debug, and refine code in an isolated environment to solve a problem that cannot be addressed purely via static tool calls.
- **Agent-as-a-Tool Handoff Efficiency:** Evaluating the smoothness and token cost of an orchestrator dynamically passing complete contextual state to a specialized sub-agent and returning the synthesized result, a pattern native to the OpenAI Agents SDK.

### 2. Proposed Cases & Fixtures (Backlog)

Expand All @@ -28,6 +29,10 @@ Based on the latest developments in the agent ecosystem (LangGraph, CrewAI, Auto
- **Case: `iterative_script_debugging`**
- **Description:** Provide a task requiring parsing a deliberately malformed proprietary binary format.
- **Goal:** The agent must write a script, observe the execution failure (stack trace), and iterate on the code until the script successfully parses the file and extracts the expected string.
- **Case: `agent_as_tool_handoff`**
- **Description:** An overarching "Analyst" agent must delegate three distinct domain queries (Legal, Financial, Technical) to three different specialized agents via tool calls (handoffs).
- **Target Framework:** OpenAI Agents SDK and frameworks supporting "Agent as a Tool".
- **Goal:** Evaluate the context retention during the handoff boundaries and the CTO (Coordination Token Overhead) of the handoff mechanism compared to monolithic resolution.

### 3. Proposed Evaluation Metrics

Expand All @@ -37,6 +42,6 @@ Based on the latest developments in the agent ecosystem (LangGraph, CrewAI, Auto

## Next Steps for the Summit Team

1. [x] **Framework Integration:** Implement adapter layers for the latest stable versions of LangGraph, CrewAI, and AutoGen within the `evaluation/adapters/` directory.
1. [x] **Framework Integration:** Implement adapter layers for the latest stable versions of LangGraph, CrewAI, AutoGen, and OpenAI Agents SDK within the `evaluation/adapters/` directory.
2. [x] **Dataset Generation:** Construct the golden fixtures for `concurrent_stress_test` and `mid_task_failure_recovery` in `GOLDEN/datasets/agent_orchestration/`.
3. [x] **Metric Implementation:** Add the `SRSR` and `CTO` scoring logic to `evaluation/scoring/agent_metrics.py`.
Loading