Premise
This research explores the event triggers and root causes of why IT disasters happen. The event triggers precipitate an initial event or anomaly. The root causes are the reasons why the initial event becomes a disaster. They are usually related to inability to recover from the event in a reasonable length of time. One of the major reasons why recovery takes so long and creates IT disasters is loss of data-in-flight.
Introduction to IT Disasters
The assertions in this research are based on joint collaboration between Wikibon and Alex Winokur, Entrepreneur & CTO of Axxana. Our collective experience in analyzing IT disasters and related scenarios spans many decades, including many examples that have been brought to our attention.
IT Disasters don’t always happen as one would expect. On the contrary, almost all scenarios of IT disaster are living examples of Murphy’s law. This states that if something can go wrong it will, in the worst possible way.
An example of mission-critical IT is an Oracle database environment. It would be common for a database server and an Oracle Far Sync server – or a database server and its local standby – to be placed in different areas of the same data center. This should ensure that in the case of some local disaster one of the two would survive. In a real-life situation, however, it is almost certain that if IT disaster strikes, both will be down.
Setting the IT Disasters Scene
Triggers for IT Disasters
The first aspect of IT disasters is understanding the event triggers for disaster. The typical and more frequent threats to normal data center operations include the following:
- Water related events. Typically floods and water Leakage.
- Power outage related events. Short circuit, lightning and central power outages.
- Cooling system malfunctions. Typically manifest themselves with rising temperatures beyond acceptable levels, but in many cases also by water leakage.
- Local fires. Typically generated by overheating or short circuits or faulty electronic components.
- Human error. Humans make mistakes, which sometimes are costly. Examples include accidental emergency power shutoff to accidental triggering of the fire extinguishing system – essentially the possibilities here are endless.
- Software failures. Programmers make mistakes.
- In-data center and external communication failures.
Prevention Processes for IT Disasters
The second aspect of IT disasters are the prevention measures which an organization builds into its data center disaster prevention methodology. In many cases these measures are tested or can only be tested in real IT disasters scenarios, at which point it becomes apparent that the measures are inadequate or don’t function properly.
Here are some examples:
- UPS. In many cases only after a power failure one discovers that the UPS battery is depleted or the balance of power between UPS units is not correct. As a result, one of the UPS units is overloaded and fails, bringing the systems connected to it down.
- Fire isolations protection. Only in the case of a real fire one can find out how effective the system is. If only one water pipe, one cable tunnel, or a cooling tunnel is not well-isolated then it is highly likely that fire or unacceptably high temperature will spread to adjacent rooms.
- Fire extinguishing system. Until tried in a real-world scenario, one will not know if there was a gas leakage and a malfunction of the system.
- False redundancy. Due to configuration errors, redundant systems are often not really redundant. In our experience, this typically applies to communication switches, SAN directors and storage systems. Only when one of these systems fails, can an organization discover that the high availability configuration is not correct and application outage occurs.
Local Event Triggers for IT Disasters
We have used the word “event triggers” and not “root cause” in the title of this section. This is an important distinction. If we look outside IT, the event trigger for a forest fire could be a lightning strike. The root cause or causes for the ensuring devastating forest fire will probably be factors such as failure to create firebreaks, clear undergrowth, etc. Most IT disasters are often a combination of seemingly unimportant triggers and a number of root causes that interact together to cause the disaster.
By local IT disasters we mean IT disasters that occur on-premises in the data center only. The following are the four major event triggers for on-premises IT disasters.
- Human error or human ineptitude
- Equipment malfunctions
- Software failures
- Unexpected occurrences
Here are some categories and examples.
Fires
- Leaving some flammable substance (various cleaning liquids or paper in many cases) in an overheated or poorly ventilated area.
- Spilling liquid (drinks in many cases) on electrically sensitive equipment, creating a short circuit which ends up with either a fire, a major power outage and in many cases damages to some electronic equipment (a switch board for example). In one case we have seen, there was some cooling leakage from the ceiling. An employee put a bucket on a ladder and another one accidentally knocked a half full bucket on a storage system which short circuited, creating a fire.
- Faulty electrical or electronic component – in particular UPS’s batteries and power supplies in the various battery backed up electronic equipment (storage systems for example).
Water Leakage
- Broken seal of the water-cooling system or of the regular water supply system.
- Water penetrating the data center due to extreme weather conditions (rain or snow). Cracks in the concrete due to very cold temperatures can cause this.
- Over-flooding of a sewer in many cases due to human incompetence which created a sewer blockade by disposing obstructing materials through the sewer.
Power outages
In many cases, a return to normal operation following a power outage is long, due to long repair and recovery times. Sometimes the first sign of a fire caused by a short circuit is a power outage. A disorderly shut down of various data center components can make fast recovery of many devices and applications impossible. In all these cases, failover to a remote site is necessary.
- In many cases, power outages are caused by short circuits as an outcome of a fire, lightning, water leakage, attaching a faulty equipment (faulty fan for temporary heat distribution) or some other electrical component failure.
- Accidental central power emergency shutoff. This is one of the more common user errors.
- Maintenance of a faulty UPS, such as changing a battery, replacing an electronic card, etc. A technician switches the UPS off but like in many such cases, erroneously switches off the wrong UPS, causing a power outage.
Communications outages
Configuration errors are a frequent cause of outages. Partial communications is lost after a switch malfunctions, only to discover that the other redundant switch is not connected due to false redundancy (i.e. human error). The time to correct such a problem is often long and, in many cases, causes additional communications outages due to further erroneous reconfigurations. These situations require a failover of at least some of the applications to a remote data center.
Software Failures
Software failures have led to many, many disasters. One recent and perhaps most disconcerting example was the AI software failure that contributed to the crash of two Boeing 737 Max flights.
Regional Event Triggers for IT Disasters
The effect of regional disasters on data centers is well known. We provide here only a brief description, for the sake of completeness. Regional IT disasters are typically a result of one or more of the following:
- Extreme weather conditions. Examples include strong winds, extremely low temperatures, massive rains, and floods.
- Earthquakes. May occur in totally unexpected areas like the August 2011 earthquake in Washington DC. The earthquake does not have to be very high in magnitude to cause massive disk drive failures.
- Fires. These cause collateral damage. For example, if a regional fire strikes, electricity to the area is almost certainly cut off. In many cases operators discover that some UPS units malfunction, that the alternate power source is not well balanced, and that bringing the data center back on line is impossible.
- Regional infrastructure failures of communication, water and power. Those typically occur due to the other two points above. In many cases however they occur due to some local disaster at the infrastructure service provider.
The Surprise Factors in IT Disasters
From our observations, information we’ve received and analysis we’ve performed, we find that the disaster outcome was a complete surprise. They have in mind some disaster model, while in real life IT disasters occur differently. Some examples will make this clear:
False Redundancy
The practice of placing redundant servers in two separate locations in a data center, to protect them against water leakage, local fires, etc. will fail if the water leakage or fire occur in one of the areas which hosts one of the data center’s central switching boards or power distribution cabinets. In such a case, the equipment in the two locations will survive, but they will be without power or without communications or both. It is very difficult (and should be assumed impossible) to make sure that there are no areas in the data center that don’t host resources (power lines, communication lines or cooling system pipes) common to the two different locations. In many such cases a failover to the alternate remote site is required because the down time to repair is too long.
False confidence in system redundancy. People forget about the cascading failure effect which is so common to software, communications, and power systems. For example, the data center personnel are confident that they would not suffer any communications outage since they get communication services from two independent vendors. They discover that when one vendor fails, the other fails too due to overload and traffic congestion of the second vendor as everybody in the area failed over to this vendor at the same time. The same phenomenon can occur locally when a communications switch fails, or a UPS fails, bringing the whole communication system or power system down.
Spreading Fire
Fire in one location may have significant effects on other locations for the following reasons:
- Cooling pipes heat conductivity to the point where the systems will shut down.
- Failure of the overall cooling system
- An emergency total power cut off, including UPS’s. This is a common practice when any fire is detected.
Unimagined
Some scenarios of IT disasters are completely unforeseen.
- In one case rats gnawed through the power cables, creating a short circuit accompanied by fire;
- A gas-based fire extinguishing test generated a supersonic wave which completely crashed most of the HDD drives;
- A careless technician knocks some critical racks when bringing in new equipment. Experience shows that it is always the rack with the gravest impact on disaster recovery.
The Importance of Loss of Data-in-Flight in IT Disasters
Defining Loss of Data-in-Flight
We have been evaluating IT disaster scenarios, disaster risk mitigation and the role of technology for more than thirty years. Digital initiatives are dramatically increasing the value of data and the need to protect it. The systems that support business processes are increasingly interconnected. For example, under the new GDPR rules the failure of a simple financial monitoring report can force closure of trading with huge potential financial losses.
A careful examination of the disaster scenarios above demonstrates that fail-over to a distant location is an essential component of a disaster recovery strategy. However, the biggest inhibitor for initiating fail-over is uncertainty about data integrity between the two sites. The reason for this uncertainty is incomplete knowledge about data-in-flight between the production site and the recovery site.
The speed of light is finite and does not allow a production system to wait to receive an acknowledgement of successful data transmission from the recovery site. As a result, after a disaster occurs, the production system may have committed data changes, but the data may or may not have arrived at the recovery site. This is loss of data-in-flight.
As a result of this uncertainty, best practice is to attempt to recover on the production site. The lack of certainty also means a subsequent decision to recover from the recovery site is much more difficult and takes far longer to execute. Highly skilled operators with years of experience are required to make this work, and is an inherently unreliable process. If senior executives exert business pressure to recover quickly, the processes become even more unreliable.
Protecting against Loss of Data-in-Flight
Disaster engineers have always understood how critical and valuable a zero loss of data-in-flight solution would be. There are many attempts to develop software and hardware solutions to solve the loss of data-in-flight problem. EMC, Hitachi, IBM, Oracle and many others have marketed solutions that address the data-in-flight challenge. All of them certainly reduce the probability of loss of data-in-flight, but none completely remove the uncertainty that some data-in-flight has been lost.
The alternative approach is not to not rely completely on sending the data off site, but to also to harden the data shipped off site for a period of time. This system could be a device which is immune to fire and heat conditions, does not depend on the data center power and communications systems, is water and vibration resistant, is shock resistant, can survive an earthquake, and most other disasters. It holds the data-in-flight at the production site until it is the recovery site acknowledges receipt, and has sufficient capacity to support rolling disasters. The data-in-flight is recoverable in many ways, by direct connection, by a separate network, and/or by cellular transmission.
Wikibon reminds practitioners that there is no such thing as a Recovery Time Objective (RTO) of zero, or a Recovery Point Objective (RPO) of zero. Wikibon’s research concludes that a combination of hardened on-site data together with best of breed software and hardware to achieve remote replication of key data can achieve the lowest RPO, and therefore RTO. Oracle’s sophisticated distributed database recovery system including the latest Far Synch functionality, paired with Axxana’s Phoenix System for Oracle Database is an example of this configuration. This provides the building blocks for ultra-low RPOs and RTOs. Wikibon’s research shows that Axxana is the only currently available solution for hardened on-site data protection.
Video on Hardening Data-in-Flight
SiliconANGLE co-founder Dave Vellante (@dvellante) and Infinidat CTO Doc D’Errico (@docderrico) talk about the Infinisync, based on the Axxana Phoenix “bunker within the datacenter” technology after acquiring Axxana.
Deliberate IT Disasters
The previous sections have assumed the good intentions of people working in IT, as consultants, in the line of business, or other places. Another category of disaster comes from deliberate attempts to create disasters, from outside or inside hackers.
Cyber Attacks
Cyber attack is a very important category for potential disaster. Reports document many dramatic cyber attacks. Below are three examples.
- Equifax was the victim of a cyber attack in 2017, affecting the highly personal data 145 million people in the USA. IT failed to apply a patch to an open-source web server and allowed the breach. Moody’s is planning to include the risk of a business-critical cyber attack as part of its credit rating, and recently cited the cyber attack as a factor in downgrading Equifax. Moody’s estimates the cost of the Equinix cyber attack as over $1 Billion over 3 years, as reported by Forbes.
- The are continuous cyber attacks on cyber currencies using blockchain technology. One example is a recent attack on Ethereum, who have championed compliance and working with government agencies. Cyber attacks are no respecter of good intentions, and the Ethereum cyber attack lost many millions, as well as loss of face and brand image.
- One potentially catastrophic loss could occur if cyber attackers breach the encryption metadata and keys within a large company or government department. To our knowledge, this hasn’t happened yet, but maybe only a matter of time.
Artificial Intelligence
Artificial intelligence is a potential solution to cyber attacks, and can reduce the attack surface. It certainly has a place protecting against some attacks. However, in the evolutionary arms race of black hats against white hats, the black hats have the major advantages of surprise, high rewards, and endless experimentation and patience. The black hats only have to win once, and they will also be using AI.
AI is part of a solution alongside many other tools. These include a strong security culture, tested operational procedures, collaboration with other organizations, an approach of continuous improvement and testing of disaster scenarios, and air gaps.
Air Gaps
OT practitioners have always stressed physical air gaps in safety design, whether in ships, nuclear power stations, or anything else. IT’s default solution is a common communication service allowing everything to talk to everything. OT’s usual response is unprintable.
A second location is one physical air gap. A second physical copy of data and applications and log file should be another. Physical air gaps have an important place in ensuring there are “doomsday” copies. These are original raw copies of data and log files in another location with no logical connections to the original systems. Continuous testing of recover from these copies is also fundamental.
Conclusions on IT Disasters
Event triggers together with large number of low probability conditions precipitate IT disasters. There are usually many low-probability root causes that can contribute to the initial event becoming a disaster.
A disaster on one site requires the ability to recover at a remote location, a physical air gap away from the primary site. A major problem for ensuring remote recovery is ensuring that data in flight is complete and consistent. Without this confidence, operators try to recover on the original site. There is a technology that can protect and recover data in flight under almost all circumstances. This allows operators and executives to make definitive decisions very quickly about recovery.
Ensuring zero loss of data-in-flight enables the ability to fail-over frequently and fail-back. Testing the fail-over process should become a regular event that evaluates and improves disaster recovery processes and readiness.
Deliberate outside cyber attacks are a major and growing category of disaster. The cost of these breaches is staggering, and can threaten the existence of large corporations. AI can be of help to both defenders and attackers. IT should deploy a broad range of software and hardware tools, culture changes, recovery testing, and physical air gaps.
After a cyber attack, there must also be “doomsday” copies. These are original raw copies of data and log files in another location with a true physical air gap and no logical connections to the original systems. Continuous testing of recover from these copies is also fundamental.
Action Item
Humans are weak at assessing the impact of low probability events. A sober assessment of the risk of IT disasters is essential.
Wikibon suggests some important questions for CXOs to ask:
- Do the recovery procedures allow fail-over and fail-back (continuous testing is very difficult without it)?
- Can data in flight be recovered in all circumstances?
- Does a function outside of IT fully test the recovery procedures at least once a month by?
- Are the recovery procedures fully documented and updated after each recovery test?
- Is there a program in place to reduce the number of steps in recovery and lower the probability of failure?
- Are there doomsday copies of data, applications, and log files, together with a copy of any encryption keys, in another location with no logical connection?
- Is there frequent testing of recovery from a doomsday copy?
Action Items for Practitioners
Please send us any additional event triggers, risk areas, root causes, and other factors that have led to disasters. Please suggest additional questions for CXOs to ask. We will update this document and give attribution or anonymity as requested. The email address is David.Floyer@Wikibon.org.
Footnotes
The following Wikibon research covers other quantification of downtime, impact of disasters, and probability of disasters.
Halving Downtime Costs for Mission Critical Apps
Synopsis: The Wikibon research entitled “Halving Downtime Costs for Mission Critical Apps” shows that traditional backup and recovery systems (which are mainly storage-based), and the processes / in-house automation round them, are very complex and error prone, especially for database recovery. Wikibon recommends setting a policy that all new application projects should adopt application-led backup and recovery processes. Existing applications should migrate to an application-led backup and recovery architecture over time.
We believe database applications and file-system suppliers will develop specialized and more integrated (i.e. end-to-end) data protection solutions. Cloud service providers such as AWS and Microsoft Azure are also investing in end-to-end architectures. Wikibon believe that application-led end-to-end architectures for backup and recovery are a prerequisite for digital transformation projects.
Hardening Data-in-Flight Reduces IT Disasters
Synopsis: The Wikibon research entitled “Hardening Data-in-Flight Reduces IT Disasters” shows the cost of downtime due to data loss and unplanned outages at Global 2000 companies ranges from 5-8% of revenue. Digital initiatives and the increasing value and importance of data elevates the imperative to address data loss. Organizations should especially focus on vulnerabilities in mission critical systems with a specific emphasis on eliminating the loss of data in-flight. Doing so can cut the cost of downtime in half over a four-year period.
This research also discusses how zero loss of data-in-flight leads to certainty of data integrity. This in turn can lead to simplified recovery and significantly reduces the chances of disasters.
Appendix A in this research give some detailed analysis on the probability of IT disasters.