From AI Chaos to AI Control: How Observability Is Saving Enterprise AI Deployments

As AI agents proliferate across enterprise environments, a troubling reality emerges: despite promises of productivity gains, many organizations are drowning in AI chaos. Gartner predicts 40% of AI agent projects will be cancelled by 2027 due to escalating costs and unclear value. Yet while 92% of Indian organizations remain stuck in AI pilot phases, Yadi Narayana, Field CTO APJ at Datadog, sees a powerful solution emerging. Through advanced observability—going far beyond traditional monitoring—enterprises are transforming unpredictable AI systems into strategic assets. The key lies in tracing every agent decision, understanding the “why” behind AI actions, and preventing silent failures before they cascade into operational disasters.

Yadi Narayana
Field CTO, Asia Pacific & Japan
Datadog

CIO&Leader: AI agents are being hailed as productivity multipliers, yet many users report the opposite. How does observability bridge that gap and turn AI from a burden into a real productivity driver?

Yadi Narayana: Unlike traditional software—where results are predictable and tests can catch almost all logic errors—AI agents can behave in ways that are far less deterministic. Their actions and decisions depend on inputs, context, and even hidden biases in their models, not just code. What was meant to be a shortcut to efficiency can quickly become a source of confusion and frustration, and in more severe cases, compromise sensitive data and personally identifiable information (PII) or become a gateway to a security breach. In fact, Gartner estimates that over 40 percent of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, and inadequate risk controls.

Observability represents a turning point. Importantly, it isn’t just another word for monitoring. Traditional monitoring checks whether systems are up, their response times, and their overall health. For AI agents, you need something more profound. You must trace every agent decision, every tool call it makes, and follow how it interacts with data at each step. Observability gives you that whole story—from inputs and the context used, to the reasons behind every action the agent takes.

As businesses deploy networks of AI agents working together, this complexity scales exponentially, compounding the challenge – reports say that a striking 92 per cent of Indian organizations remain stuck in pilot or exploratory AI phases, with only eight per cent achieving full-scale deployment. Orchestration between agents becomes a problem that even seasoned engineers struggle to manage without the right tools and expertise. Without deep observability, agents can “drift”—making mistakes quietly or delivering subpar results without anyone spotting the root cause.

With proper observability, however, you can:

  • Understand why agents made confident choices, not just what they did.
  • Spot when performance and decision quality begin slipping, often before customers or users notice.
  • Trace how data and instructions evolve across multiple agent steps—which is crucial for catching silent failures or efficiency losses.

CIO&Leader: With shadow AI adoption accelerating, what role can observability play in ensuring accountable and safe use of AI agents across enterprise environments?

Yadi Narayana: As shadow AI adoption races ahead—meaning employees are using AI tools without official approval—the risks for businesses are growing fast. These hidden deployments create significant vulnerabilities, from sensitive data leaks to compliance breaches and total blind spots in day-to-day operations.

Observability is critical, making AI usage in enterprises both accountable and safe. Here’s how:

  • Fully-searchable logs: Capture every tool call and agent interaction, allowing teams to trace who did what, when, and why.
  • Centralised visibility: Bring all AI agent activity—whether in-house or third-party—under a single pane of glass, creating actual operational oversight.
  • Rules-based governance: Automatically redact or block sensitive data before it’s exposed, and enforce compliance with company policies.
  • Real-time alerts: Quickly surface suspicious or unauthorised activity, so actions can be taken before problems escalate.
  • Granular access controls: Restrict what data each agent can access, ensuring permissions match actual business needs.

With these capabilities, Organisations gain real-time insight into every facet of AI usage, empowering them to stop security incidents early, stay compliant, and run AI safely even as shadow deployments continue to rise.

CIO&Leader: Can you share examples of how trace-level observability helped organizations detect and resolve misbehaving AI agents before they caused downtime or operational debt?

Yadi Narayana: The Indian AI ecosystem is rapidly expanding—India now leads globally in generative AI adoption, with 92 per cent of employees using GenAI tools, far exceeding the global average. In this context, ensuring the reliability and performance of AI systems is critical. Here are examples of how trace-level observability with Datadog helped organisations detect and resolve misbehaving AI agents before they caused downtime or operational debt:

  • AppFolio: Reducing Latency to Boost User Adoption

AppFolio’s AI-powered communication assistant, Realm-X Messages, faced unpredictable performance issues that threatened its usefulness. Property managers experienced high latency in AI-assisted resident communications, directly impacting product usage rates.

By using Datadog’s LLM Observability with deep tracing, AppFolio gained:

  • End-to-end tracing of each AI agent’s decision path from input to output
  • Real-time monitoring of latency and token usage patterns in large language models
  • Anomaly detection for error spikes and performance drops
  • Quality checks, including toxicity monitoring and failure-to-answer alerts

This trace-level monitoring revealed a clear correlation: as response times increased, users abandoned the AI assistant. By pinpointing bottlenecks in the AI workflow, engineers optimised problematic steps, resulting in significant latency reduction and improved adoption.

  • Thomson Reuters: Autonomous Investigations for Faster Incident Resolution

Thomson Reuters needed to scale AI solutions across a complex tech stack while maximising operational efficiency. Without proper observability, AI agents risked creating blind spots and extending incident response times.

With Datadog’s Bits AI SRE agent, they achieved:

  • Full context from initial alert to final resolution
  • Autonomous root cause analysis leveraging telemetry and service data
  • Real-time investigation summaries to assist human responders

This accelerated incident response by delivering detailed diagnostics automatically, reducing mean time to resolution and preventing minor issues from escalating into downtime or operational debt.

CIO&Leader: As enterprises deploy multiple AI tools across cloud-native systems, how can observability frameworks prevent fragmentation and maintain infrastructure coherence?

Yadi Narayana: India’s AI market is projected to grow rapidly—from approximately US$8 billion in 2025 to US$17 billion by 2027. Without clear observability, organisations risk facing incompatible telemetry formats from different AI tools, making it difficult or even impossible to connect related events across AI agents. This fragmentation leads to blind spots and delays in identifying systemic issues.

A unified observability framework helps by:

  • Tracing end-to-end workflows across multiple AI agents and systems, providing a complete picture of how AI-driven processes flow and interconnect.
  • Correlating events from different agents to identify broader systemic problems rather than siloed, isolated issues.
  • Maintaining unified visibility across diverse AI implementations, ensuring that both AI and non-AI workloads can be monitored cohesively in one platform.

Modern approaches extend traditional monitoring to include AI-specific metrics such as token usage, model versions, parameters, tool interactions, and vector database access.

By treating AI observability as an essential, integrated part of the overall infrastructure monitoring strategy, enterprises can avoid tool sprawl and operational complexity. This cohesion ensures faster root cause analysis, improved performance tuning, better compliance monitoring, and ultimately smoother AI scaling.

CIO&Leader: AI agents expand the attack surface in new ways. How can integrating observability with security monitoring help enterprises stay ahead of these emerging risks?

Yadi Narayana: Indian banks are projected to achieve nearly 46 per cent productivity gains through generative AI, driving advancements in customer service, fraud detection, and compliance. Similarly, EY India forecasts a 35–37 percent uplift in productivity by 2030 across the retail, consumer, and e-commerce sectors, driven by generative AI–powered enhancements in product development, pricing, promotions, and customer experience.  However, these gains come with elevated stakes—AI agents expand the attack surface in new ways, making such sectors highly vulnerable. Unlike traditional apps, they use natural language, making them prone to prompt injection, data leaks, and adversarial inputs.

Key risks include:

  • Prompt injection: Malicious inputs trick agents into unwanted actions.
  • Data leaks: Agents may accidentally reveal sensitive info.
  • API risks: Agents call numerous APIs, some of which are poorly secured.
  • Over-permissioned agents: Agents with too much access increase risk.

To stay ahead, enterprises must integrate observability with security:

  • Monitor agent inputs and outputs to spot anomalies.
  • Trace API calls and tool usage in real time.
  • Correlate AI activity with security logs to detect attacks.
  • Enforce strict access controls monitored through observability.
  • Utilize real-time alerts and automated responses to address threats.

This combined approach helps detect AI-specific threats early, thereby reducing risk while enabling the safe use of AI.

CIO&Leader: Looking ahead, what innovations in observability—be it automated anomaly detection, real-time dashboards, or predictive insights—will be most critical to managing AI agents at scale?

Yadi Narayana: Looking ahead, observability for AI agents will need to move beyond monitoring what has happened, towards predicting issues and guiding action in real time. As these systems scale, their complexity and autonomy make it impossible for people to identify and resolve problems manually. In India, this evolution is critical: an EY India survey forecasts that GenAI will boost the country’s $254 billion USD IT industry’s productivity by 43 to 45 percent over the next five years. Sustaining that uplift will require platforms that not only detect anomalies but recommend and even execute fixes.

The most critical innovation will be automated anomaly detection with built-in explanations. It’s not enough to raise an alert; platforms will need to identify the likely cause and suggest a solution. This turns raw data into something actionable.

Equally important are adaptive dashboards that adjust to context. Instead of static charts, we’ll see living dashboards that bring forward what matters most at that moment—whether it’s performance, drift, cost, or bias. The goal is to give teams clarity, not clutter.

Finally, the real step-change will be predictive insights and prescriptive action. Observability platforms will not only warn of risks, such as instability or cost spikes, but also recommend, or even carry out, adjustments like retraining or throttling agents. This is where observability starts to blend into autonomous operations.

In simple terms, observability must become less of a rear-view mirror and more of a co-pilot—helping organisations run AI systems at scale safely, efficiently, and with trust.

Share on