When Systems Talk, Let Humans Listen Last

AI resolves routine DevOps issues; humans handle meaningful escalations only.

“AI handles our first line of defense. If an alert still needs a person, it’s already escalated beyond the trivial.”  ~  Gaurav Duggal Senior Vice President – AI & IT, Jio Platforms

At Reliance Jio Platforms, scale is not just a characteristic—it’s a defining constraint. With a digital ecosystem that spans hundreds of millions of users, thousands of services, and billions of daily telemetry signals, DevOps cannot function as a manual, reactive model. For us, the only sustainable path forward is one where observability is not just intelligent, but automated—where systems handle the routine, and people focus on the exceptional.

Jio Platforms has built a highly evolved approach to incident management, driven by AI and issue categorization that prioritizes autonomy over intervention. Incidents are dynamically classified based on pattern recognition, persistence, and impact. The vast majority of what would traditionally be considered “alerts”—momentary latency spikes, packet drops, retry attempts—are absorbed and resolved by automation before they ever surface on a dashboard. These are considered part of the system’s operational noise floor, not meaningful deviations.

Where patterns persist, or when impact expands in scope or severity, the system escalates intelligently—first to enriched diagnostics and correlation layers, and only then to human teams. In this model, an engineer rarely touches an incident unless AI has already filtered, suppressed, enriched, and attempted remediation. We’ve built our observability layers to ensure that humans don’t get involved until absolutely necessary. If something still reaches you, it’s already escalated beyond the trivial.

This automation strategy is powered in part by ATK—a platform we use to encode and orchestrate its first-response reflexes. Through ATK, predefined workflows are triggered based on observed conditions, whether it’s rolling back faulty releases, adjusting runtime configurations, isolating noisy services, or rebalancing infrastructure loads. Crucially, these actions are not only automated but traceable, auditable, and dynamically updated based on operational learnings.

We leverage ATK as more than a remediation engine—it’s an evolving DevOps brain. With each incident, the system learns, adapts, and expands its scope of autonomy. Over time, engineers shift from firefighting to curating the conditions under which the system decides to act. It’s not just that we fix faster—we fix smarter, and we know exactly when to escalate and when not to.

Another breakthrough in Jio’s model is its emphasis on client-side observability. It should be noted that many system degradations don’t originate in the backend. They start at the edge—within the browser, inside a mobile cache, or during a device-network handoff. These failures rarely show up in logs or server traces, but they distort the user experience in ways that are commercially significant. To close this gap, Jio instruments the client environment, analyzes behavioral anomalies, and builds baselines of “normal” user flows. When those flows deviate—sudden drop-offs, erratic navigation, silent session failures—the system responds as if it were a backend incident. We don’t wait for logs to tell us there’s a problem—We let user behavior tell us.

By combining automated RCA, self-healing actions, client-side intelligence, and platformized decision logic, we have turned DevOps into an architecture of trust. Engineers are not removed from the loop—they’re elevated. Their time is spent improving system design, refining suppression logic, and working on issues that truly require human reasoning.

The future of DevOps is not about adding more dashboards or writing better scripts. It’s about building systems that can sense, act, and learn with minimal handholding. We’re not trying to eliminate people from DevOps. We’re trying to give them back their time, so they can solve the problems that matter.

Share on