As Companies Race To Scale, AI Is Straining Under Operational Limits, Datadog Finds
SYDNEY - 22 April 2026 - As AI adoption accelerates, operational complexity – not model intelligence – is becoming the primary barrier to reliable AI at scale, according to new data from Datadog, Inc. (NASDAQ: DDOG), the AI-powered observability and security platform.
Datadog’s State of AI Engineering 2026 report, based on real-world data from thousands of organisations running AI in production, highlights a compounding complexity challenge as AI systems scale. Nearly seven in ten companies (69 per cent) now use three or more models alongside increasingly complex agent workflows. Around 5 per cent of AI model requests fail in production, with nearly 60 per cent of those failures caused by capacity limits – leading to slowdowns, errors, and broken experiences in AI-powered applications.
Additional key findings:
- Multi-model is now the norm: OpenAI remains the most widely used provider at 63 per cent share, alongside rising adoption of Google Gemini and Anthropic Claude which grew by 20 and 23 percentage points, respectively.
- Agent framework adoption doubled year-over-year, accelerating development but also introducing more moving parts into production systems.
- The amount of data sent to AI models per request is also rising: the average number of tokens more than doubled for ‘median use’ teams (50th percentile of usage volume) and quadrupled for heavy users (90th percentile).
“AI is starting to look a lot like the early days of cloud,” said Yanbing Li, Chief Product Officer at Datadog. “The cloud made systems programmable but much more complex to manage. AI is now doing the same thing to the application layer. The companies that win won’t just build better models - they’ll build operational control around them. In this new era, AI observability becomes as essential as cloud observability was a decade ago.”
Speed Requires Control
Competitive pressure is accelerating AI deployment across startups and large enterprises alike. But as systems scale, speed without control creates risk. Failures are increasingly driven by system design, including fragmented workflows, excessive retries, and inefficient routing.
"The next wave of agent failures won't be about what agents can't do but what teams can't observe,” said Guillermo Rauch, CEO at Vercel, the company behind Next.js and a leading platform for building AI-powered web applications. “We built agentic infrastructure at Vercel because agents need the same production feedback loops as great software. Unlike traditional software, agents have control flow driven by the LLM itself, making observability not just useful, but essential.”
“Innovation alone isn’t enough,” added Li. “To scale AI with confidence, organisations need real-time visibility across the entire stack – from GPU utilisation to model behaviour to agent workflows. Visibility and operational control are what allow teams to move fast without sacrificing reliability or governance. At scale, how you operate AI may matter more than the models you choose.”
Yadi Narayana, Datadog's CTO for APJ, said “In A/NZ, the focus has firmly shifted to running AI reliably in production and multi-model architectures and agentic workflows are becoming standard, but that maturity is exposing significant gaps. A failure rate sitting at around five per cent, largely driven by capacity constraints, is a material concern in industries where uptime and trust are non-negotiable. AI systems are increasingly resembling distributed systems, yet many teams are still not managing them with the operational discipline that demands.”
“There is also a cost problem hiding in plain sight. Token consumption is climbing fast, while optimisation techniques, like prompt caching and smarter context design, remain largely untapped. The next phase is about closing the gap between how sophisticated these systems have become and how rigorously they're being operated. Organisations will prioritise foundational capabilities like observability, governance, and cost control, over accelerating deployment speed, building AI systems that are reliable, scalable, and accountable,” said Narayana.
Read the full report - The State of AI Engineering 2026 - and learn how Datadog is investing in AI observability to help teams operate and scale AI systems in production here.
Report Methodology
Datadog analysed anonymised usage data from thousands of customers using LLMs in production environments, with global coverage across industries and geographies.
About Datadog
Datadog
is the AI-powered observability and security platform. Our
SaaS platform integrates and automates infrastructure
monitoring, application performance monitoring, log
management, user experience monitoring, cloud security and
many other capabilities to provide unified, real-time
observability and security for our customers' entire
technology stack. Datadog is used by organizations of all
sizes and across a wide range of industries to enable
digital transformation and cloud migration, drive
collaboration among development, operations, security and
business teams, accelerate time to market for applications,
reduce time to problem resolution, secure applications and
infrastructure, understand user behavior and track key
business
metrics.
University of Auckland: Junk Food Designed To Make Us Eat More, Study Finds
Spark: New Report Sets Out Outcomes-Led Approach To Lift Rural Connectivity Using The Right Mix Of Technologies
Bill Bennett: Fixed Voice Rules Head For Deregulation
UN Department of Global Communications: United Nations Proposes New Global Dashboard To Measure Progress Beyond GDP
Banking Ombudsman Scheme: Fraud Check Delays Well Worth The Inconvenience, Says Banking Ombudsman
Asia Pacific AML: NZ’s Financial Crime Gap - Beyond The 'Number 8 Wire' Mentality

