One of my family members experienced two days of hassle and chaos last month after Southwest Airlines cancelled 1,150 flights because of a system outage. He was eventually able to travel to his destination, a day later than planned, but the outage cost the airline tens of millions of dollars.
A faulty router that took down Southwest’s reservation system was the culprit last month, while a power outage caused a worldwide system failure for Delta Air Lines this week. That outage grounded more than 1,800 flights and caused countless delays and complications that could have been easily avoided with a proper power backup and protection plan.
Tens of thousands of Delta passengers were stranded or inconvenienced by the outage, which was caused by a power failure at the company’s Atlanta headquarters. The outage, which lasted six hours, took down check-in systems and airport screens along with Delta’s website and smartphone apps. Delta pinpointed a crucial power control module that malfunctioned as the point of failure, subsequently causing a surge to the transformer and a loss of power.
Power was quickly restored after the malfunction, but critical systems and network equipment failed to switch over to backups, while other less critical systems made the switch. Delta CEO Ed Bastian said the company has invested hundreds of millions of dollars in technology infrastructure upgrades and backup systems over the past three years, which should have prevented such a catastrophic outage.
“I'm sorry that it happened and I don't have the final analysis of what caused the outage,” Bastian told customers in a video message. “We did have a redundant backup power source in place. Unfortunately some of our core systems and key systems did not kick over to the backup power source when we lost power and, as a consequence of that, it caused our entire system effectively to crash and we had to reboot and start the operation up from scratch.”
Preventing these types of disasters comes down to appropriate planning and routine backup and disaster recovery exercises, according to Mark Jaggers, data center recovery and continuity analyst at Gartner (News - Alert). He said data centers require redundant power from a grid and redundant networking from a service provider – both independent of primary power and networking resources, to operate effectively in the event of a failure.
“Planning and executing disaster recovery exercises is something that should be done on a regular basis to find out these issues before they may be impactful,” said Jaggers. “The issue, which was also the case with Southwest Airlines, is not planning for partial failure scenarios that are harder to get to the root cause of and work around.”
Another important element of planning is to have a primary data center in one location, with a backup or alternate center far enough away that it doesn’t experience the same disaster or failure. Companies also need to ensure their data center staff, whether internal or outsourced, are competent and well versed in disaster recovery.
“In today's world, the business expectation is that you're up and running quickly after a disaster,” said Roberta Witty, risk and security management analyst at Gartner. “The 'always on' driver is changing the way organizations deliver IT in general, and so they are building out their data centers to be more resilient.”
She added that crisis management practices need to be exercised on a quarterly basis to pinpoint problem areas and ensure backup and recovery measures are operating properly. Companies should also make backup and disaster recovery a part of every new project, without exception.