Why Digital Outages Are Rising—and How AI-Powered Observability Can Stop Them

As cloud architectures grow more complex, system outages are becoming more costly and damaging to customer trust. Rob Newell, Senior Vice President and General Manager for Asia Pacific & Japan at New Relic, explains why traditional monitoring tools are failing to keep pace with modern digital ecosystems. With downtime now costing nearly $2 million per hour, Newell advocates for AI-strengthened observability that moves beyond reactive firefighting to autonomous incident detection and resolution. In this conversation, he outlines how organizations can consolidate fragmented tools, embed intelligence into workflows, and prepare teams for a future where observability platforms don’t just alert—they act.

Rob Newell
Senior Vice President and General Manager, Asia Pacific & Japan
New Relic.

CIO&Leader: Outages are becoming more frequent and more visible. Why are failures increasing even as enterprises invest in cloud-native architectures, AI-driven systems, and more resilient infrastructure?

Rob Newell: Digital systems now extend to multiple layers, services, and dependencies, turning large parts of the digital estate into a black box where early warning signs often go unnoticed with traditional tools. On top of that, most organisations use around 4 monitoring tools, leading to fragmented visibility and scattered insights across dashboards. This fragmentation makes it harder for teams to understand how systems behave as a whole and where failures originate.

The compounding complexities are prompting businesses to invest in AI-enhanced observability to raise their monitoring to the level of today’s digital ecosystem. It analyses real-time signals alongside historical performance data and incident outcomes, helping teams surface hidden risks, detect issues early, and address threats to system reliability before they escalate into outages.

CIO&Leader: Incidents like the recent Cloudflare outage showed how one failure can ripple across millions of users and businesses. What did this episode reveal about the fragility of today’s deeply interconnected digital ecosystems?

Rob Newell: The Cloudflare outage underlines how the cost of disruption now extends far beyond direct financial loss. It shows that reputational damage can be equally severe, as even minor disruptions can erode customer trust and confidence. In B2B environments in particular, disruptions cascade across the value chain, affecting downstream businesses and, ultimately, end users.

These incidents also place sustained pressure on IT teams. They are forced to divert time and resources toward expediting issue resolution, often at the expense of strategic initiatives and long-term business priorities. Such incidents call for advanced, AI-strengthened observability that spans the entire digital ecosystem. Organisations require monitoring capabilities that provide proactive awareness of emerging issues, accelerate resolution, and enable automated remediation where appropriate, reducing the risk of costly failures.

CIO&Leader: The cost of downtime is now estimated at nearly $2 million per hour, but the damage often goes far beyond revenue loss. How should business leaders think about the reputational, customer-trust, and ecosystem-wide impact of outages?

Rob Newell: As more organisations adopt AI-strengthened observability and deliver consistently stable services, tolerance for disruption continues to decline. Even a single incident can prompt customers to reassess their loyalty towards a business. Business leaders need to view resilience as an ecosystem-level responsibility. This requires moving beyond reactive incident management and toward deeper visibility, faster detection, and automated risk mitigation across the digital estate. Since reliability directly shapes trust and growth, proactive system monitoring and resolution are the only way forward.

CIO&Leader:  Despite rising adoption of observability platforms, engineering teams still spend too much time firefighting incidents. What’s broken in today’s approach to monitoring and incident response?

Rob Newell: The problem largely lies in fragmented observability. Even in today’s interconnected environments, many organisations rely on four or more monitoring tools, each offering partial and often disjointed insights. Instead of building a cohesive understanding of system behaviour, teams are left piecing together signals that add limited value to the larger picture. Plus, traditional monitoring tools struggle to surface subtle signals buried deep within complex workflows, slowing down problem detection and resolution.  Without AI-strengthened observability that can operate across these layers and detect early anomalies, teams remain reactive rather than preventative.

Most high-impact outages take 30–90 minutes to detect (MTTD) and to resolve (MTTR). Tool sprawl further stretches response timelines, increasing manual effort and slowing coordination during incidents. This way, engineering teams remain trapped in cycles of firefighting, responding to symptoms rather than addressing root causes, while customer trust and business outcomes continue to take the hit.

CIO&Leader: You’ve said that in 2026, observability will move beyond dashboards to autonomous action. How will agentic AI change how incidents are detected, diagnosed, and resolved in real time?

Rob Newell: As agentic orchestrations become more common across Indian enterprises, identifying issues and maintaining system stability has become significantly more complex. An agentic AI-strengthened intelligent observability platform can address these complexities by embedding AI-driven insights, recommendations, and task automation directly into operational workflows. This integration enables teams to detect issues earlier, diagnose root causes faster, and trigger remediation, reducing both MTTD and MTTR.

Agentic AI-driven observability can automatically surface code-level errors, correlate them with system behaviour, and suggest fixes with full contextual awareness. Teams can also query systems using natural language to understand incidents and receive recommended actions with minimal effort. Automated post-incident analysis helps organisations capture insights from failures, identify recurring patterns, and brace for similar scenarios in the future.

CIO&Leader: As organisations move toward autonomous observability, what foundational changes—across people, processes, and technology—do leaders need to make now to ensure trust, control, and measurable ROI?

Rob Newell: On the technology front, organisations should eliminate tool sprawl and consolidate observability platforms. Fragmented tooling limits visibility and undermines the effectiveness of AI-driven analysis. Autonomous systems rely on unified, high-quality data across applications, infrastructure, and workflows. Without consolidation, AI lacks the context required to act accurately and safely.

From a process perspective, organisations need to embed observability directly into operational workflows. Insights and actions should flow into software delivery pipelines (such as CI/CD) and service management systems, including ITSM and SDLC, rather than being confined to isolated dashboards. Clear escalation paths, approval mechanisms, and guardrails are essential to maintain control as automation increases. Finally, leaders must prepare people for the shift. Autonomous observability changes how teams work, moving engineers away from manual troubleshooting toward defining intent, policies, and boundaries for AI-driven action. Investing in upskilling and change management builds trust in automated decisions and ensures teams remain accountable for outcomes.

Share on