The catastrophic systems failure that grounded British Airways flights for a day appears to have been caused by networking hardware failing to cope with a power surge and messaging systems failing as a result. The Register has asked BA's press office to detail what went wrong, what equipment failed, what disaster recovery …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

Page:

Tuesday 30th May 2017 06:05 GMT jerehada

So by messaging he means some sort of enterprise service bus was taken down ? also a power surge protection is normal so what went wrong with the switchover to redundant power? The hints from this a bit like the Talk Talk hack suggest something very simple and not some unavoidable impossible to understand failure they would like to media steer us towards.

30 0 Reply
1. Tuesday 30th May 2017 06:18 GMT wyatt
  
  I'm looking forward to seeing the RCA, if we ever do. So many businesses have single points of failure, even within HA systems. I realise that backup systems have been mentioned but if they don't work or can't be brought online (ever tested?), has someone been telling porkie pies?
  
  13 0 Reply
  1. Tuesday 30th May 2017 12:16 GMT Anonymous Coward
    
    "I realise that backup systems have been mentioned"
    
    I used to work for a company, large company, that provided DR services. The vast majority of companies treat DR as a compliance checkbox. They buy some DR services so they can say they have DR services... but in the event of a primary data center loss, there really is only the rough outline of a plan. Basically their data, or most of it, is in some alternative site and they may have the rest of their gear there too or not. There is rarely anything resembling a real time switch over from site A to site B in case of a disaster in which their entire stack(s) would come up without any manual intervention at site B. Mainly because architectures are a hodge podge of stuff which has collected over the years. Many companies never rewrite or modernize anything, meaning much of the environment is legacy with legacy HA/DR tools... and there is sparse automation.
    
    8 0 Reply
    1. Tuesday 30th May 2017 13:12 GMT wheelybird
      
      There's a difference between disaster recovery and high-availability (though they do overlap).
      
      It's perfectly reasonable that disaster recovery is a manual fail-over process. Fully resilient systems over two geographically separated locations can be hard and expensive to implement for smaller companies with not much in the way of a budget or resources, and so you have to compromise you expectations for DR.
      
      Even if failing-over can be automated, there might be a high cost in failing-back afterwards, and so you might actually prefer the site to be down for a short while instead of kicking in the DR procedures; it works out cheaper and avoid complications with restoring the primary site from the DR site.
      
      Not every company runs a service that absolutely needs to be up 24/7.
      
      A lot of people designing the DR infrastructure will be limited by the (often poor) choices of technology made by the people that wrote the in-house stuff.
      
      As an example, replicating your MySQL database between two datacentres is more complicated than most people would expect. Do you do normal replication and risk that the slave has lagged behind the master at the point of failure, losing data? Or use synchronous replication like Galera at the cost of a big latency hit to the cluster, slowing it right down?
      
      If it's normal replication, do you risk master-master so that it's easy to fail-back, with the caveat that master-master is generally frowned upon for good reasons?
      
      I think it's disingenuous to berate people for implementing something that can be very difficult to implement.
      
      Though of course, large companies with lots of money and lots to lose by being down (like BA) have no excuses.
      
      7 0 Reply
    2. Tuesday 30th May 2017 14:22 GMT Anonymous Coward
      
      That's scary
      
      That is our DR to a tee, so glad I'm not the boss
      
      anon for obvious reasons
      
      1 0 Reply
  2. Tuesday 30th May 2017 16:58 GMT NoneSuch
    
    You can't test everything all the time.
    
    Stuff happens and often the unimagined causes grief.
    
    Redundancies are only guaranteed when they come from HR.
    
    1 0 Reply
2. Tuesday 30th May 2017 06:26 GMT Anonymous Coward
  
  suggest something very simple and not some unavoidable impossible to understand failure
  
  The weasel isn't going to take any personal responsibility - even though he is THE chief executive officer. But it is his fault, all of it, in that capacity. The total and absolute failure of everything is clearly a series of multiple failures, and he (and BA) are trying to control the message as though that denies the reality of this catastrophe. He should be fired for his poor communication and poor leadership if nothing else. But that's what you get when you put the boss of a tiddly low cost airline into a big, highly complex operation with a totally different value proposition.
  
  Looking around, press comment reckons that it'll be two weeks before all flight operational impacts are worked out (crews, aircraft in the wrong place at the wrong time, passenger failures made as good as they can), and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?
  
  60 0 Reply
  1. Tuesday 30th May 2017 06:51 GMT Anonymous Coward
    
    But that's what you get when you put the boss of a tiddly low cost airline into a big, highly complex operation with a totally different value proposition.
    
    Whatever you might think about his performance during this unmitigated balls-up, there's much more relevant experience in his biography than just running a "tiddly low cost airline".
    
    6 0 Reply
    1. Tuesday 30th May 2017 07:27 GMT Dan 55
      
      I don't know who you're trying to convince, but it's not me. Neither Clickair nor Vueling have or had stellar reputations, Sabre's had its outages, and the less said about US airports the better.
      
      5 0 Reply
    2. Wednesday 31st May 2017 04:54 GMT dieseltaylor
      
      His CEO experiences is from a minor airline so accept that fact. His previous experience reads well but then every exec I know of makes sure it does. : )
      
      Should he jump? Probably not but some people somewhere must be guilty of hiding, or not implementing, necessary IT improvements.
      
      0 2 Reply
  2. Tuesday 30th May 2017 07:16 GMT Bloodbeastterror
    
    "I wonder if that will affect his bonus?"
    
    Ha ha ha ha... Of course not. After the attainment of a certain pay grade "reward for failure" kicks in. Only the actual workers enjoy "reward for success". Sometimes.
    
    22 0 Reply
    1. Tuesday 30th May 2017 14:25 GMT Antron Argaiv
      
      Re: "I wonder if that will affect his bonus?"
      
      A former boss referred to it as "f*ck up and move up".
      
      Though, admittedly, a change of employer is sometimes implied.
      
      1 0 Reply
    2. Tuesday 30th May 2017 17:11 GMT Anonymous Coward
      
      Re: "I wonder if that will affect his bonus?"
      
      He didnt get one last year
      
      "Alex Cruz, the Spanish CEO of British Airways, will not receive a bonus for 2016 from the IAG airlines group. The company said in a statement to the National Stock Market Commission that he will be the only one of the 12 senior executives not to receive a bonus. "
      
      0 0 Reply
      1. Wednesday 31st May 2017 06:25 GMT John Smith 19
        
        "he will be the only one of the 12 senior executives not to receive a bonus. ""
        
        Which suggests he has been trying extra hard to get one.
        
        And look what his efforts have produced.....
        
        I think he's going to be on the corporate naughty step again.
        
        IT.
        
        It's trickier than it looks in the commercials.
        
        2 0 Reply
  3. Tuesday 30th May 2017 07:19 GMT Anonymous Coward
    
    The weasel isn't going to take any personal responsibility - even though he is THE chief executive officer.
    
    IT and the CIO don't fall under him, IT is provided by [parent company] IAG "Global Business Services" as of last year. But of course, Cruz has fully supported all the rounds of cuts that have been made.
    
    It smells like a store-and-forward messaging system from the dawn of the mainframe age
    
    JMS-based ESB.
    
    Ex BA AC
    
    29 0 Reply
    1. Tuesday 30th May 2017 09:45 GMT Anonymous Coward
      
      But you would think that something as critical as an ESB in BA would mean that they have built it with high availability in an active/active configuration with plenty of spare capacity built in with nodes in different locations and on different power supplies. And of course ensuring that the underlying data network has similar high availability.
      
      Otherwise you have just built in a single point of failure to your whole enterprise and as Murphy's law tells us - if it can go wrong then it will go wrong and usually at the most inopportune moment.
      
      6 0 Reply
      1. Tuesday 30th May 2017 10:04 GMT Norman Nescio
        
        ESB?
        
        "...something as critical as an ESB in BA would mean that they have built it with high availability in an active/active configuration with plenty of spare capacity..."
        
        They probably thought they could just pop down the road to the brewery in Chiswick and get a refill there.
        
        13 0 Reply
        
        Tuesday 30th May 2017 11:24 GMT Tom Paine
        
        Re: ESB?
        
        Funny, I thought Fuller's had closed that site and moved to an industrial estate in Maidstone or Nuneaton or something -- but I was completely wrong: https://www.fullers.co.uk/brewery
        
        Doesn't it look nice? Mmmm... ale...
        
        7 1 Reply
        
        Tuesday 30th May 2017 13:33 GMT Simon Harris
        
        Re: ESB?
        
        "Funny, I thought Fuller's had closed that site ... but I was completely wrong: https://www.fullers.co.uk/brewery
        
        Doesn't it look nice? Mmmm... ale..."
        
        They do an excellent brewery tour with a tasting session in their bar/museum afterwards :)
        
        5 0 Reply
        
        Wednesday 31st May 2017 12:06 GMT Bigbird3141
        
        Re: ESB?
        
        Think you're confusing it with Young's - the Wandsworth-based brewer Fullers bought and closed and redeveloped the site of.
        
        0 0 Reply
        
        Thursday 1st June 2017 16:43 GMT CH in CT20
        
        Re: ESB?
        
        Ahem. You seem to be confusing Fuller's with Charles Wells, the company which brews Young's beers in Bedford.
        
        0 0 Reply
        
        Tuesday 30th May 2017 15:12 GMT Doctor Syntax
        
        Re: ESB?
        
        "They probably thought they could just pop down the road to the brewery in Chiswick and get a refill there."
        
        You thought they could organise....?
        
        4 1 Reply
    2. Tuesday 30th May 2017 13:04 GMT Anonymous Coward
      
      IT and the CIO don't fall under him, IT is provided by [parent company] IAG "Global Business Services" as of last year.
      
      As a director of BA, he is in fact responsible in law, even if the group have chosen to provide the service differently. I work for a UK based, foreign owned energy company. Our IT is supported by Anonco Business Services, incorporated in the parent company's jurisdiction, and owned by the ultimate parent. If our IT screws up (which it does with some regularity), our customers' have redress against the UK business, and our directors hold the full contractual, legal and regulatory liability, whether the service that screwed up is in-house, outsourced, or delivered via captive service companies.
      
      13 0 Reply
      1. Tuesday 30th May 2017 15:29 GMT Anonymous Coward
        
        Director?
        
        If he is a director of BA! A search of companies house finds a director of a BA company in the name of
        
        Alejandro Cruz De Llano
        
        I'm guessing this him?
        
        A member of staff of a company only has legal responsibility if they are a registered director with companies house. The fact the company calls them a CEO or director does not mean they are a registered director.
        
        0 0 Reply
        
        Wednesday 31st May 2017 07:00 GMT butigy
        
        Re: Director?
        
        Actually I don't believe that's correct. If you hold yourself out to be a director then it's possible you can be treated like one also the concept of shadow directors but that's a bit different.
        
        0 0 Reply
  4. Tuesday 30th May 2017 08:06 GMT John Smith 19
    
    "and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?"
    
    You can bet that any "profit improving" (IE cost cutting) ideas certainly did.
    
    This should as well.
    
    But probably won't, given this is the "New World Order" of large corporate management that takes ownership of any success and avoids any possibility that their decisions could have anything to do with this.
    
    If you wonder who is most modern CEO's role model for corporate behavior it's simple.
    
    Carter Burke in Aliens.
    
    10 0 Reply
    1. Tuesday 30th May 2017 13:25 GMT 0765794e08
      
      Re: "and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?"
      
      “Carter Burke in Aliens”
      
      Sticking with movies, Johnny from Airplane! springs to mind...
      
      “Just kidding! Oh, wrong cable. Should’ve been the grey one. Rapunzel! Rapunzel!”
      
      1 0 Reply
  5. Tuesday 30th May 2017 12:42 GMT Anonymous Coward
    
    Cruz previously worked at Vueling which has a terrible record for cancellations, lost bookings and cruddy customer service - so he's clearly brought his experience over.
    
    He was appointed to cut costs at BA which he's done by emulating RyanAir and EasyJet whilst keeping BA prices. He's allowed the airline to go downmarket just as the Turkish, the Gulf and Asian carriers are hitting their stride in offering world-wide routing and don't treat customers like crap. Comparing Emirates to BA in economy is like chalk and cheese.
    
    BA's only hope is if the American carriers continue to be as dreadful as ever.
    
    10 0 Reply
    1. Tuesday 30th May 2017 13:59 GMT Anonymous Coward
      
      I had the pleasure of flying back to the UK on American in Business Class recently - service and comfort was a notch above BA Club World, and the ticket was cheaper than BA Premium Economy. BA are screwed...
      
      3 0 Reply
    2. Wednesday 31st May 2017 10:31 GMT Richard Laval
      
      "BA's only hope is if the American carriers continue to be as dreadful as ever."
      
      So they definitely have a fighting chance then!
      
      0 0 Reply
3. Tuesday 30th May 2017 06:29 GMT Voland's right hand
  
  It smells like a store-and-forward messaging system from the dawn of the mainframe age (Shows how much BA has been investing into its IT). It may even be hardware + software. Switching over to backup is non-trivial as this is integrated into transactions, so you need to rewind transactions, etc.
  
  It can go wrong and often does, especially if you have piled up a gazillion of new and wonderful things connected to it via extra interfaces. Example of this type of clusterf*** the NATS catastrophic failure a few years back.
  
  That is NOT the clusterf*ck they experienced though because their messaging and transaction was half-knackered on Friday. My boarding pass emails were delayed 8 hours, check-in email announcement by 10 hours. So while it most likely was the messaging component, it was not knackered by a surge, it was half-dead 24h before that and the "surge" was probably someone hired on too little money (or someone hired on too much money giving the idiotic order) trying to reset it via power-cycle on Sat.
  
  This is why when you intended to run a system and build on it for decades, you have upgrade, and you have to START each upgrade cycle by upgrading the messaging and networking. Not do it as an afterthought and an unwelcome expense (the way BA does anything related to paying with the exception of paying exec bonuses).
  
  39 3 Reply
  1. Tuesday 30th May 2017 07:03 GMT James Anderson
    
    If it was a properly architected and configured mainframe system it would have just worked.
    
    High availability, failover, geographically distributed databases, etc. etc. were implemented on the mainframe sometime in the late '80s.
    
    Some of the commentards on this site seem to think the last release of a mainframe OS was in 1979, when actually they have been subject to continuous development, incremental improvement and innovation to this day. A modern IBM mainframe is bleeding edge hardware and software presenting a venerable 1960s facade to its venerable programmers. Bit like a modern Bentley with its staid '50s styling on the outside and a monster twin turbo multi valve engine on the inside.
    
    58 0 Reply
    1. Tuesday 30th May 2017 08:07 GMT Nolveys
      
      @ James Anderson
      
      (Mainframe operating systems) have been subject to continuous development, incremental improvement and innovation to this day.
      
      That sounds expensive, has anyone told Ginni about this?
      
      6 0 Reply
    2. Tuesday 30th May 2017 09:32 GMT Mr Dogshit
      
      There is no such verb as "to architect".
      
      22 11 Reply
      1. Tuesday 30th May 2017 10:15 GMT MyffyW
        
        no such verb as "to architect".
        
        I architect - the successor to the Asimov robot flick
        
        You architect - an early form of 21st century abuse
        
        He/She architects - well I have no problem with gender fluidity
        
        We architect - sadly nothing to do with Nintendo
        
        You architect - abuse, but this time collective
        
        They architect - in which case it was neither my fault, nor yours
        
        16 1 Reply
        
        This post has been deleted by its author
      2. Tuesday 30th May 2017 11:18 GMT Nigel 13
        
        There is now.
        
        0 0 Reply
        
        Wednesday 31st May 2017 03:11 GMT Aus Tech
        
        RE: There is now.
        
        It's too late now, the disaster has already happened. Very much like the old story "shut the gate, the horse has bolted."
        
        0 0 Reply
      3. Tuesday 30th May 2017 12:41 GMT Anonymous Coward
        
        But there will be as soon as enough prescriptive-grammar fogeys who can remember that once there wasn't die off. This is how language evolves: by the death of idiots.
        
        1 7 Reply
        
        Tuesday 30th May 2017 14:24 GMT Anonymous Coward
        
        Die off is fine. So is die back. They're descriptive and worth keeping. Architect as a verb is more or less OK, although why did someone assume 'design' wasn't good enough, since it's a correct description of the process, making architect as a verb a replacement for a word that didn't need replacing.
        
        3 1 Reply
        
        Tuesday 30th May 2017 14:27 GMT whileI'mhere
        
        by the ~~death~~ birth of idiots.
        
        FTFY
        
        1 0 Reply
      4. Tuesday 30th May 2017 14:13 GMT Anonymous Coward
        
        <rant> True. But at least that's one I can, however reluctantly, at least imagine.
        
        For me, by far the worst example of this American obsession with creating non-existent 'verbs' is, obviously, 'to leverage'.
        
        Surely that sounds as crass to even the most dim-witted American as it does to everyone else in the English speaking world, doesn't it? I'm told these words are created to make the speaker sound important when they are clueless.
        
        I can accept that some lone moron invented the word. But why did the number of people using it ever rise above 1? </rant>
        
        9 0 Reply
        
        Tuesday 30th May 2017 15:12 GMT Anonymous Coward
        
        *I can accept that some lone moron invented the word. But why did the number of people using it ever rise above 1?*
        
        Because the number of morons is >>>> 0[1]
        
        [1] Yes - I made up >>>>> to be "a far, far larger number than the one compared against" It's mine. You can't have it. So there.
        
        3 0 Reply
        
        Wednesday 31st May 2017 12:43 GMT Glenturret Single Malt
        
        And pronounce it levverage instead of leeverage.
        
        1 0 Reply
      5. Tuesday 30th May 2017 14:54 GMT Nifty
        
        No such verb as to architect?
        
        https://en.oxforddictionaries.com/definition/architect
        
        verb
        
        [WITH OBJECT]Computing
        
        Design and configure (a program or system)
        
        ‘an architected information interface’
        
        7 2 Reply
      6. Wednesday 31st May 2017 03:28 GMT Jtom
        
        Suggestion we were given some twenty-five years ago: Don't verb nouns.
        
        4 0 Reply
        
        Wednesday 31st May 2017 11:10 GMT Marduk
        
        Verbing weirds a language.
        
        1 0 Reply
      7. Wednesday 31st May 2017 16:41 GMT dajames
        
        There is no such verb as "to architect".
        
        That's the beauty of the English language -- a word doesn't have to exist to be usable. (Almost) anything goes.
        
        It's not always a good idea to use words that "don't exist" -- especially if you're unhappy about being lexicographered into the ground by your fellow grammar nazis -- but most of the time you'll get the idea across.
        
        [There is no such verb as "to lexicographer", either, but methinks you will have got the point!]
        
        Ponder, though, on this.
        
        0 0 Reply
    3. Tuesday 30th May 2017 15:06 GMT CrazyOldCatMan
      
      A modern IBM mainframe is bleeding edge hardware and software presenting a venerable 1960s facade to its venerable programmers.
      
      And always has done. in the early 90's, I was maintaining TPF assembler code that was originally written in the 60's (some was older than me!).
      
      And I doubt very much if those systems are not still at the heart of things - they worked. In the same way as banks still have lots of stuff using Cobol, I suspect airlines still have a lot of IBM mainframes running TPF. With lots of shiny interfaces so that modern stuff can be done with the source data.
      
      5 0 Reply
      1. Tuesday 30th May 2017 21:39 GMT Down not across
        
        With lots of shiny interfaces so that modern stuff can be done with the source data.
        
        Dunno if its shiny, but probably something like MQ.
        
        For most parts it seems to do fairly decent job in distributed systems if it has been properly configured.
        
        2 0 Reply