Should reliability be back on top of CIO agenda?

Series of airlines' IT outages have exposed the susceptibility of legacy IT infrastructure to failure. Is it time CIOs gave a little more priority to infrastructure reliability

Should reliability be back on top of CIO agenda? - CIO&Leader
“(This is) CIO’s worst nightmare coming true,” said Nilesh Sangoi, CEO of Meru Cabs and a former CIO.
 
“Shows unpreparedness in terms of DR and BCP. How can a top airline afford this?” asked Kamal Karnataka, CIO, RJ Corp.
 
“Unbelievable! No DR preparedness?! Thousands across the world affected due to this silly mishap!!!” was the reaction of Aneesh Nair, CIO, NDTV Worldwide.
 
“Just comes back to the basis of ensuring that there is no single point of failure in your ecosystem,” said Darshan Appayanna, CIO & CISO, Happiest Minds.  
 
These CIOs were reacting to a CIO&Leader question on Twitter asking them to react in the aftermath of a huge IT outage at Delta Airlines in August 2016.  The outage had forced the airlines to cancel more than 2100 flights and incur a loss estimated to be USD 150 million. 
 
Fast forward to May 2017 and it is a sense of déjà vu. 
 
On 27 May, British Airways (BA) cancelled all its flights from Heathrow and Gatwick airports in London because of an “IT failure”. This led to a severe disruption to its global operations, Heathrow being its hub airport. Gatwick too operates many flights of BA.  
 
The airlines scrapped close to 500 flights, about 60% of its scheduled, on May 27 when the outage happened, as well as close to 200 services in the next two days. About 75,000 passengers were affected in what is clearly one of the biggest operational disruptions due to an IT system failure.
 
Except for the Icelandic volcano smoke shutting down all flying in northern Europe in 2010, nothing of this magnitude has happened anytime in recent past.
 
The outage is estimated to have cost as much as 100 million euros (USD 112 million). 
 
And these two are not exceptional cases. 
 
United Airlines has seen multiple such outages in last one year, the latest two being in October 2016 that caused massive delay and in January this year which severely impacted its domestic schedule. 
 
Just a few weeks before the Delta outage, in July 2016, SouthWest Airlines had gone through similar problems. It had to cancel 2300 flights due to an IT failure. The loss was estimated to be between USD 54 million and USD 82 million, including revenue loss and added costs. 
 
And this is not restricted to airlines. HSBC, Amazon and a few Indian banks have all gone through similar situations. It is just that operational disruptions in airlines become far more visible as passengers get stranded in airports and start flooding the social media. 
 
What caused the massive disruptions?
As it has become clear now, the British Airways outage was caused by an uncontrolled return of power which caused a power surge that damaged the servers and distribution panels. 
 
In the Delta Airlines outage, the glitch was attributed to a small fire in one of its datacenters, which was quickly extinguished too. Yet, it did wreak havoc on the passengers, throwing the airlines operations to chaos for multiple days.
 
The United Airlines outage in October too was attributed to a power glitch. 
 
The SouthWest outage was traced to one router failure.
 
There seems to be a clear pattern. Here are things that are common to all these outages:
 
  • No external attack was involved, not even a DDoS attack
  • They all involved company-owned infrastructure, mostly legacy
  • They all involved what can be called minor glitches; but the impacts were major
  • In three out of these four cases, it was a power issue
  • It was not data; it was infrastructure that caused the problems in each of these cases
  • Disaster recovery was the issue in all cases
In short, they are failures of their internal IT; they are not targeted crimes, as no external actors were involved. 
 
Contrast this with the impact of global ransomware attack in May. Described as “The Biggest Attack in History” malware WannaCry hogged media headlines for days together with the term entering popular lexicon. It was clearly an attack by external actors. Yet, we did not hear any story of any major business being disrupted anywhere in the world. 
 
It is evident from the fact that all that it has fetched for the attackers in close to a month is about USD 150,000, going by the Twitter bot @actual_ransom tracking the payments to the three bitcoin wallets tied to the attack. That’s a miniscule fraction of what British Airways or Delta lost in a day. 
 
Misplaced priorities?
Today, business alignment is the catch phrase among CIOs. CIOs compete with each other to show how business savvy they are. Many are even apologetic talking about their technology role. And if they do, it is all about AI/machine learning/Big Data/IoT that rule the vocabulary. Even cloud is becoming passé; let alone talk about datacenter operations. 
 
Yes, security is still solidly part of the agenda and is becoming more and more important. CIOs and CISOs do talk about and try to prevent their organizations from sophisticated cyber attacks. Vendors do their best to project the new threats arising because of today’s attackers – who no longer do it for fun but are criminals. Organized criminals to internal employees; state actors to hackers—they all pose threats. 
 
Unfortunately, basic issues like reliability and resilience do not get that kind of attention, presumably because it is taken for granted. The technology is mature and since basic infrastructure does not add much incremental value to the business directly, it is not on the CIO’s radar. 
 
The airlines outages do raise questions about the correctness of this laissez-faire posture. There seems to be a false sense of confidence about dependability and reliability of IT systems. Vendors, who often build and influence the discourse and priorities for CIOs, too are silent on it, as there is not much of business opportunities there. 
 
One need not wait for big outages like this to get reminded of this. According to Ponemon Institute, the cost of downtime in a datacenter is sharply rising.  The average total cost per minute of an unplanned outage was USD 8,851, according to Ponemon Institute’s 2016 Cost of Data Center Outage study.
 
What outages like BA’s and Deltas’ have just shown is that large global corporations are not isolated from the possibility of the infrastructure failures. Comparatively smaller glitches—not caused by any natural calamities or external actors but by internal IT (human/process/technology) have brought them to their knees. 
 
Time to Relook
It is not just the CIOs who get carried away. There seems to be a sense of confidence about dependability of critical IT infrastructure. In World Economic Forum’s annual Global Risk Report 2017 (GRR), the risk ‘critical information infrastructure breakdown’ now features in quadrant III, denoting that it is a low-impact, low-likelihood risk even as ‘cyberattacks’ features in quadrant I, denoting it is a high-impact, high-likelihood risk. Even ‘data fraud/theft’ is perceived to be a much higher likelihood risk as compared to ‘critical information infrastructure breakdown’. 
 
Much of the laissez-faire stance is coming from a (false?) sense of confidence about the likely impact of ‘critical information infrastructure breakdown’. In the last three years, the perceived risk associated with ‘critical information infrastructure breakdown’ has shown a continuous slide—from 5.1 in 2015 to 3.38 in 2017, even though perceived likelihood of occurrence has not changed much. In other words, while stakeholders do not rule out the possibility of critical information infrastructure breakdown happening, they think the impact will be less and less. 
 
Delta and BA outages have busted that myth completely. 
 
Even if IT infrastructure maintenance and reliability receives 20% of the attention given to the possibility of organized cyberattacks, it would result in significantly bringing down the risk of failure of infastructure.
 
Consider this. A data breach may cause a lot of harm and reputation loss but an infrastructure failure will actually stop your operations, make you incur a huge direct loss. Even a moderate size outage will wipe out all the gains that you make by using social media effectively. 
 
As more and more of business operations—from procurement to manufacturing to sales—become digitalized, even minor glitches can cause huge disruptions. It is time ensuring reliability became a strategic priority once again. 
 

Add new comment