As Companies Race To Scale, AI Is Straining Under Operational Limits, Datadog Finds

Wednesday, 22 April 2026, 12:07 pm
Press Release: Datadog

SYDNEY - 22 April 2026 - As AI adoption accelerates, operational complexity – not model intelligence – is becoming the primary barrier to reliable AI at scale, according to new data from Datadog, Inc. (NASDAQ: DDOG), the AI-powered observability and security platform.

Datadog’s State of AI Engineering 2026 report, based on real-world data from thousands of organisations running AI in production, highlights a compounding complexity challenge as AI systems scale. Nearly seven in ten companies (69 per cent) now use three or more models alongside increasingly complex agent workflows. Around 5 per cent of AI model requests fail in production, with nearly 60 per cent of those failures caused by capacity limits – leading to slowdowns, errors, and broken experiences in AI-powered applications.

Additional key findings:

Multi-model is now the norm: OpenAI remains the most widely used provider at 63 per cent share, alongside rising adoption of Google Gemini and Anthropic Claude which grew by 20 and 23 percentage points, respectively.
Agent framework adoption doubled year-over-year, accelerating development but also introducing more moving parts into production systems.
The amount of data sent to AI models per request is also rising: the average number of tokens more than doubled for ‘median use’ teams (50th percentile of usage volume) and quadrupled for heavy users (90th percentile).

Advertisement - scroll to continue reading

“AI is starting to look a lot like the early days of cloud,” said Yanbing Li, Chief Product Officer at Datadog. “The cloud made systems programmable but much more complex to manage. AI is now doing the same thing to the application layer. The companies that win won’t just build better models - they’ll build operational control around them. In this new era, AI observability becomes as essential as cloud observability was a decade ago.”

Speed Requires Control

Competitive pressure is accelerating AI deployment across startups and large enterprises alike. But as systems scale, speed without control creates risk. Failures are increasingly driven by system design, including fragmented workflows, excessive retries, and inefficient routing.

"The next wave of agent failures won't be about what agents can't do but what teams can't observe,” said Guillermo Rauch, CEO at Vercel, the company behind Next.js and a leading platform for building AI-powered web applications. “We built agentic infrastructure at Vercel because agents need the same production feedback loops as great software. Unlike traditional software, agents have control flow driven by the LLM itself, making observability not just useful, but essential.”

“Innovation alone isn’t enough,” added Li. “To scale AI with confidence, organisations need real-time visibility across the entire stack – from GPU utilisation to model behaviour to agent workflows. Visibility and operational control are what allow teams to move fast without sacrificing reliability or governance. At scale, how you operate AI may matter more than the models you choose.”

Yadi Narayana, Datadog's CTO for APJ, said “In A/NZ, the focus has firmly shifted to running AI reliably in production and multi-model architectures and agentic workflows are becoming standard, but that maturity is exposing significant gaps. A failure rate sitting at around five per cent, largely driven by capacity constraints, is a material concern in industries where uptime and trust are non-negotiable. AI systems are increasingly resembling distributed systems, yet many teams are still not managing them with the operational discipline that demands.”

“There is also a cost problem hiding in plain sight. Token consumption is climbing fast, while optimisation techniques, like prompt caching and smarter context design, remain largely untapped. The next phase is about closing the gap between how sophisticated these systems have become and how rigorously they're being operated. Organisations will prioritise foundational capabilities like observability, governance, and cost control, over accelerating deployment speed, building AI systems that are reliable, scalable, and accountable,” said Narayana.

Read the full report - The State of AI Engineering 2026 - and learn how Datadog is investing in AI observability to help teams operate and scale AI systems in production here.

Report Methodology

Datadog analysed anonymised usage data from thousands of customers using LLMs in production environments, with global coverage across industries and geographies.

About Datadog
Datadog is the AI-powered observability and security platform. Our SaaS platform integrates and automates infrastructure monitoring, application performance monitoring, log management, user experience monitoring, cloud security and many other capabilities to provide unified, real-time observability and security for our customers' entire technology stack. Datadog is used by organizations of all sizes and across a wide range of industries to enable digital transformation and cloud migration, drive collaboration among development, operations, security and business teams, accelerate time to market for applications, reduce time to problem resolution, secure applications and infrastructure, understand user behavior and track key business metrics.

Advertisement - scroll to continue reading

Using Scoop for work?

Scoop is free for personal use, but you"ll need a licence for work use. This is part of our Ethical Paywall and how we fund Scoop. Join today with plans starting from less than $3 per week, plus gain access to exclusive Pro features.

Join Pro Individual Find out more

Find more from Datadog on InfoPages.

Business Headlines | Sci-Tech Headlines