David Fosberry's Personal Opinion Blog Post

This blog posting represents the views of the author, David Fosberry. Those opinions may change over time. They do not constitute an expert legal or financial opinion.

The Opinion Blog is organised by threads, so each post is identified by a thread number ("Major" index) and a post number ("Minor" index). If you want to view the index of blogs, click here to download it as an Excel spreadsheet.

To view, save, share or refer to a particular blog post, use the link in that post (below/right, where it says "Show only this post").

Something Wrong With British Airway’s IT Systems

Posted on 18^th May 2017

Show only this post
Show all posts in this thread.

There have been many reports over the last few days about the chaos caused by British Airway and their IT failure:

here, from the BBC, where the CEO of BA says that he will not resign, and blames the failure on a power surge;
here, from the BBC, where the consumer group "Which?" urges automatic compensation for all affected passengers;
here, from the BBC's Tech Tent, which points out that a power surge is no excuse nor explanation, and that "experts point out that power management is an essential element of any well-planned IT system";
and here, also from the BBC, which points out that BA's "Disaster Recovery Plan should have whirred into action."

Very many types of business, in the modern world, are fully dependent upon IT systems to operate: banking, telecoms, air travel (not just airlines; also airports, air-traffic control, etc.), the full gamut of Internet based businesses, etc. Most of these companies seem to have understood how vital it is, for both them and their customers, to ensure that their systems are reliable and robust, but it seems that BA "didn't get the memo".

Now, high reliability (usually call high availability) systems is something that I know quite a lot about, and the experts quoted by the BBC's Tech Tent are right: a power surge is no excuse, and power management is a vital part of any business critical system design. To put this into context for those readers who are not familiar with the subject matter, let me describe a typical disaster recovery plan:

Two data centres, with systems clustered together so that load is shared between the two sites, such that if one site has a failure, the workload is taken over automatically and instantly by the remaining site, and with power supplied to each site from separate feeds from the power grid, plus UPSes (uninterruptible power supplies) and backup generators at each site. Such a configuration is immune to local power surges and failures in the electricity supply grid; the only impact of the failure of one data centre is some loss of performance.
A disaster recovery site, which contains a copy of the data from the main data centres, which can take over the load if both main data centres fail. The disaster recovery systems are usually manually started, so there can be a delay of a few minutes before service is restored.
Off-site backups, so that even if all the systems fail, service can usually be restored, with minimal data loss, after a few hours.
A comprehensive disaster recovery plan detailing manual fall-back processes, mobile data centres and even paper based systems to ensure service continuity.

Of course, none of this is any use if the fall-back systems and processes don't work, which seems to be the case here. When you spend millions on redundant systems and data-back-ups, you have to test that they work. You must test that, when a system or a whole site, fails, that the load is properly switched to other systems (and that you can put the system back into its normal operating mode once the fault is repaired). You must also test that the software and processes to restore data from back-ups actually work. It seems likely that BA failed to do this, since their systems stopped working.

Of course, I do not know if BA simply failed to put in place a properly reliable set of systems and processes, or if they at least tried to do so, but failed to test that they worked properly in the event of failure. Either way, the outcome is simply unacceptable, and the impact on its customers was major and intolerable. This simply cements BA's position as one of the world's worst airlines; one a cavalier and irresponsible attitude to their customers.

BA's CEO said that the "flight disruption had nothing to do with cutting costs". I beg to differ. Clearly, not enough money was spent in building and testing the disaster recovery plan.