Research Report · March 2026

State of AI Agent
Waste 2026

The largest independent study of compute waste in autonomous AI agent execution. 99,167 real sessions. 2,615,295 real steps. Fully reproducible.

Dataset: HuggingFace public trajectories
Models: GPT-4o, Claude, Llama, Gemini
Published: March 24, 2026
By: CAUM Systems Research
99,167
Real sessions analyzed
2.6M
Agent steps measured
3.4×
More waste in failed sessions
98.7%
Of waste caused by loops
$95K
Annual waste per 10K daily sessions

Executive Summary

Autonomous AI agents — systems that plan, execute tools, and iterate toward a goal — have rapidly moved from research into production. Engineering teams now run thousands of agent sessions per day. Yet the vast majority of runtime failures are silent: the agent doesn't crash, it just spins, repeating the same actions in an infinite loop until it exhausts its budget.

Key finding: Failed agent sessions waste 3.4× more compute than successful ones. 98.7% of that waste comes from a single failure mode: behavioral loops — the agent executing the same or semantically identical actions cyclically without progress.

This report presents the first large-scale, cross-model, cross-framework measurement of AI agent compute waste. All data comes from publicly available trajectory datasets on HuggingFace. All analysis is reproducible.

Dataset & Methodology

We analyzed four trajectory datasets, totaling 99,167 sessions:

🧠
nebius/SWE-agent-trajectories
79,773 sessions. Multi-model. Real GitHub issues as tasks. The primary dataset.
Claude trajectories
8,588 sessions. Claude 3.5/3.7 on SWE-bench tasks via SWE-agent framework.
🔷
GPT-4o trajectories
5,816 sessions. GPT-4o on SWE-bench. Used for cross-model validation.
🦙
Llama trajectories
4,990 sessions. Llama 3.x on balanced benchmark tasks.

Detection Method

Waste is measured using the CAUM motor v10.31 — a behavioral analysis engine that computes semantic similarity between consecutive agent steps using SBERT embeddings, then classifies each step into one of four behavioral regimes:

Steps classified as LOOP or STAGNATION are counted as wasted compute. The engine reads only tool names and structural metadata — zero semantic content, zero prompt data, zero business logic.

Core Findings

Finding 1 — Failed sessions waste 3.4× more compute

Session TypeSessionsAvg Waste %Cohen's d
Resolved (successful)19,5914.04%
Unresolved (failed)79,57613.83%
Difference3.4× higher in failures+0.548

A Cohen's d of +0.548 (medium-large effect) means waste is a reliable, statistically significant predictor of session failure — not noise. This holds across all four model families tested.

Finding 2 — 98.7% of waste is behavioral loops

Of all wasted steps in the dataset, 98.7% were classified as LOOP — the agent repeating semantically identical actions. Only 1.3% were STAGNATION (partial progress with no new information). This is a critical finding for system designers: the primary failure mode is not "agent gives up" — it's "agent doesn't know it's stuck."

Why this matters: Loops are detectable. Because CAUM monitors behavioral structure rather than semantic content, it can identify loops in real time — while the agent is running — without reading a single character of prompt or payload data.

Finding 3 — Cross-model signal is consistent

Model FamilyFrameworkCohen's dAUCSignal
GPT-4oSWE-agent+1.0990.757 EXCELLENT
GPT-4oOpenHands+1.1310.852 EXCELLENT
Llama 3.xSWE-agent+0.9680.747 EXCELLENT
Gemini 3 Flashmini-SWE+0.8040.722 EXCELLENT
Claude 3.7SWE-agent+0.7750.650 GOOD
Claude 3.5SWE-agent+0.1750.554 WEAK

The GPT-4o / OpenHands result (d=+1.131, AUC=0.852) is particularly notable — it shows that the behavioral signal is framework-agnostic. The same motor, with no retraining, works across different agent execution environments.

Finding 4 — The $95K/year enterprise cost

For an enterprise running 10,000 agent sessions per day, the measured waste translates directly to compute costs:

This is a conservative estimate. It only counts token/compute cost. It excludes engineering time spent debugging loops, customer SLA breaches from stuck sessions, and infrastructure costs from processes that don't terminate cleanly.

Enterprise Validation — Real-World Tasks

Beyond code-focused benchmarks, we validated the CAUM motor on the hkust-nlp/Toolathlon-Trajectories dataset — 6,818 sessions of real-world agentic tasks: WooCommerce order management, railway ticketing, form filling, API orchestration.

Signal was confirmed in 10 out of 22 models tested (d > 0.20). Best performer: Gemini 2.5 Flash at d=+0.527. The signal is weaker on enterprise tasks than coding tasks — consistent with the hypothesis that enterprise agents have more diverse tool vocabularies, making loop patterns harder to detect without domain-specific calibration.

Reproducibility

All analysis code and datasets used in this report are publicly available:

How CAUM Detects Waste in Real Time

CAUM integrates as a passive observer alongside any agent framework. It receives each tool call and result as the agent executes, computes a behavioral regime classification, and emits a running health score (UDS 0–1). No prompts, payloads, or business logic are read.

# One-time setup — works with any agent framework
from caum import ZeroTrustAuditor

aud = ZeroTrustAuditor(model_hint="gpt4o")

# Per-step — call this after every tool execution
for tool, result in agent.steps():
    verdict = aud.push(tool, result)
    if verdict["regime"] == "LOOP":
        alert_team("Agent stuck in loop", verdict)

# End of session — get full audit certificate
cert = aud.finalize()  # cert["uds"] is the health score (0–1)

Analyze Your Agent's Trajectories

Upload a trajectory file and get a full 10-page forensic PDF report in under 3 minutes. Cryptographically signed. No prompts read.

Questions? contact@caum.systems