Should reliability be back on top of CIO agenda? We asked this question in a cover story in CIO&Leader, two-and-a-half years back?in May 2017 to be precise. This was in the aftermath of a massive impact in operations of British Airways on May 27 that year, which involved 700 flight cancellations, impacted 75,000 passengers who were stranded across airports and resulted in a loss of USD 112 million to the airlines, with estimated goodwill losses far higher.
This huge disruption in services happened due to a very small human error in just one data center near the Heathrow airport. An engineer disconnected a power supply at the data center and when it was reconnected, it caused a power surge, which caused a physical damage.
British Airways was not alone. Massive disruptions in services at SouthWest Airlines in July 2016 and at Delta Airlines in the very next month were also results of small issues: One router failure and a fire in one data center respectively. Number of such disruptions have increased manifold.
In 2019 alone, there were close to a dozen such disruptions in various operations, including in SouthWest Airlines, American Airlines, JetBlue, Delta Airlines, Air India, British Airways, National Stock Exchange, and HongKong Exchange. In most cases, the major impact is due to one specific, often small issue in infrastructure.
For example, the problem and delay in recording executed transactions in India?s National Stock Exchange in August 2019 was attributed to a network component failure in a data center.
The HongKong Exchange, in September 2019, had to completely halt its derivative trading?its first in history?due to a connectivity failure.
In March and April last year, in two separate incidents?one involving a glitch in SABRE?s reservation system and another involving a glitch in Aerodata Systems?operations of multiple airlines were affected. The airlines that suffered included American Airlines, JetBlue, SouthWest and Delta.
Air India?s service disruption in April 2019 was traced to a glitch in the passenger system server failure at global airlines technology services firm, SITA.
There seems to be some clear commonalities in all these business disruptions:
- No external attack was involved, not even a DDoS attack
- They all involved what can be called minor glitches; but the impacts were major
- In most cases, it was either a power or network issue
- It was not data; it was infrastructure that caused the problems in each of these cases
- Disaster recovery was the issue in all cases
In short, they were all failures of IT hardware or software, not caused by any intentional or accidental ?hacking? into the system.
It was a question of reliability and resilience of the systems?something that has been taken for granted by many, in the IT community and outside.
CIOs often explicitly say they do not lose their sleep on these things. While the so-called concept of two-speed IT started making CIOs vie with each other to be known as progressive game changers by focusing on business IT, the low value ?infrastructure? was perceived as basic hygiene that did not require CIO?s attention.
If they had to worry, they?d rather worry about cyberattacks that were getting more and more sophisticated and targeted!
Is Infrastructure Assured?
This sense of complete confidence around ?highly mature infrastructure??now proven to be somewhat,
if not fully, misplaced as these incidents show?made the technology managers turn their attention to ?think? and ?talk? business which was seen as the path to salvation.
So high was the confidence that they managed to convince the top managers?the CEOs and the boards?about the infallibility of their IT systems?the network, data center, enterprise software, especially those that are more ?mature?, whatever it means.
Possible breakdown of critical information infrastructure, which figured in the World Economic Forum?s Global Risk Report (GRR) 2015 as the most potentially impactful technological risk?at the 7th position ahead of such global risks as fiscal crises, unemployment, food crises, terrorist attacks and asset bubbles, had, in the next few years disappeared from the minds of global leaders.
In the next two years, the global community had almost stopped perceiving infrastructure breakdown as a risk worth its name. By 2017?the GRR is released in January every year?’breakdown of critical information infrastructure? had dropped to 22nd position in the rank of most potential impactful global risks.
And have no confusion. Technological risks, as such, were being seen as more and more likely. Even as the impact of possible infrastructure breakdown was perceived to become lesser and lesser, the likelihood of security-related risks was being seen as more and more possible. During that period, Data Fraud or Theft rose from being the 9th most likely risk in 2015 to the 4th most likely risk in 2018. Large-scale cyberattacks rose from being perceived as the 11th most likely risk to the 3rd most likely risk in 2018.
In short, the infrastructure was in place; it was assured, infallible. But the attacker was getting more and more powerful! Everything that can be done to ensure a smooth business was done by the IT teams, but it was the external environment that was changing. That was the narrative.
One can only speculate if the vendors had anything to do with the building of this narrative. The business had to be more and more secure. Security vendors came out with most of the research about the threats from the actors?internal or external! In fact. Some security companies published more research than research firms and media companies!
Newer technologies like big data, cloud, AI, IoT and security vied with each other to catch the CIO attention?taking the time-tested approach of research, publicity and continuous narrative.
Attention to maintenance and resilience of basic infrastructure did not have the scope for high value investments. Preparedness without spending big money was not a big commercial proposition for anyone.
It took the back seat.
An acute desire to be seen as business savvy?and making an impact on business through use of emerging technologies, the use cases for which, the CEOs proactively asked?got the better of CIOs to somewhat neglect (deprioritize, if the n-word sounds too drastic) the area.
The result has been before everyone to see. While only big disruptions get the media, we have seen smaller disruption in services in many areas, mostly because of glitches, not big attacks. But attacks get the press easily.
While the custodians of IT were getting confident about the reliability and resilience of their infrastructure, it was natural that the business leaders also were fairly assured.
Till, of course, the big disruptions shook that sense of comfort!
That is when the global community again began to see ?breakdown of critical information infrastructure? as a big risk. From a lowly 22nd position in 2017, the Global Risk Report of WEF puts it at 6th most impactful global risk this year. From 2017, its perception as a risk has steadily risen in the eyes of the global community, as reflected in its GRR ranking. In 2020, it even overtook cyberattacks as the most impactful technological risk.
If the trend is anything to go by, it may well become one of the top three most impactful global risks by next year. Already, it is being perceived as more impactful than interstate conflicts, infectious diseases (even as the fear of Coronavirus looms large at the time of writing this), food crises and natural disasters.
To be sure, it is still not a very likely risk but highly impactful risk, like say weapons of mass destruction or failure of climate action. While these two are top-of-mind issues for the global leaders?as seen from the number of summits and negotiations?even the very custodians of information infrastructure do not attach the kind of importance that protecting them deserves.
What does it mean for the CIOs?
Simple. Protecting information infrastructure is unambiguously the responsibility of IT leaders.
Failure of that infrastructure is their failure. Period.
In the wake of the British Airways outage in May 2017, we had asked CIOs how they saw it.
?(This is) CIO?s worst nightmare coming true,? said Nilesh Sangoi, the then CEO of Meru Cabs, who handled the CIO job earlier and is now Chief Digital Officer at Fincare.
?Shows unpreparedness in terms of DR and BCP. How can a top airline afford this?? asked Kamal Karnatak, CIO, RJ Corp.
?Unbelievable! No DR preparedness?! Thousands across the world affected due to this silly mishap!!!? was the shocked reaction of Aneesh Nair, then CIO, NDTV Worldwide.
While the surprise was understandable, Darshan Appayanna, then CIO & CISO, Happiest Mind, was more circumspect and realistic.
?Just comes back to the basis of ensuring that there is no single point of failure in your ecosystem,? he said.
The rising crisis of confidence in the reliability of critical information infrastructure stems from two factors.
One, of course, is the misplaced confidence in the infallibility of these infrastructures by its custodians.
But that still does not explain why it has been happening with one/two industries like airlines and stock exchanges.
To answer that question, one must go to the basics?something that every CIO knows but quite a few overlook. That is the difference between Disaster Recovery (DR) and Business Continuity Planning (BCP). Disaster Recovery, in the case of an IT infrastructure breakdown, is a completely technical job.
DR is essentially a restoring work. In many industries?like pharmaceuticals or manufacturing?a big outage in IT does impact the operations and many a times the financials. But a fast (say 4-5 hours) restoration job minimizes that impact and the world in general does not come to know about it.
That is not the case with mission-critical operations as in case of airlines and stock exchanges. Take airlines. Anything beyond a few minutes
completely throws off the schedule because of huge cascading effect and has a huge impact on the customers. In today?s world of social media, the reputational cost is huge, and far bigger than the direct financial loss. Also, regulatory compliance cost may add to that, as the impact of these outages involve individuals.
A good Business Continuity Planning is supposed to minimize/eliminate such possibility, which incidentally should start with examining our earlier hypothesis about the growing confidence about the infallibility of IT infrastructure.
CIO&Leader asked this question on Twitter, to CIOs.
?With all the advances in technology, is a major business disruption because of failure of IT infra a possibility today??
?Chances are reduced a lot today compared to the past due to effective real-time DR and replicated cloud set up,? was the answer of Anjani Kumar, CIO of pharma company, Strides.
?Captive data centers and in-house development always carry the risk of IT infrastructure/applications failure,? admitted Keyur Desai, CIO, Essar Ports & Shipping, thus pointing the vulnerability to one specific model of IT delivery.
?If the infra is built on Kubernetes and if the application is built with resiliencies, such downtimes can be avoided,? said Chandresh Dedhia, Head – IT, Ascent Health & Wellness.
However, Desai did admit that ?glitch is always a possibility.? And Dedhia agreed that ?there are still instances where even the PaaS providers fail.?
And that precisely is the beginning of BCP. It should start with the presumption that ?DR is not enough?, however ironic and self-depreciating it sounds.
Technology is the first step in a BCP. Then comes processes and finally people. Whether the owner-ship of that planning is the Head of Operations or Head of Risk or Head of Technology, that ownership includes all that is required outside technology operations.
If CIOs have to be the owner of BCP, they cannot just rely on technology, no matter how many layers of redundancy they have built.
That is not as sexy as AI or Big Data. But that is extremely important, especially for mission-critical businesses.
To bring down ?critical information infrastructure breakdown? in the list of most impactful global risks, a good practice of BCP should be the first step. Even that may fail. But that would add two levels beyond technology operations: non-technology related processes and the right people to make decisions and choose course of action.
Beyond that is anybody?s guess.