back to article BA CEO blames messaging and networks for grounding

The catastrophic systems failure that grounded British Airways flights for a day appears to have been caused by networking hardware failing to cope with a power surge and messaging systems failing as a result. The Register has asked BA's press office to detail what went wrong, what equipment failed, what disaster recovery …

Page:

  1. jerehada

    So by messaging he means some sort of enterprise service bus was taken down ? also a power surge protection is normal so what went wrong with the switchover to redundant power? The hints from this a bit like the Talk Talk hack suggest something very simple and not some unavoidable impossible to understand failure they would like to media steer us towards.

    1. wyatt

      I'm looking forward to seeing the RCA, if we ever do. So many businesses have single points of failure, even within HA systems. I realise that backup systems have been mentioned but if they don't work or can't be brought online (ever tested?), has someone been telling porkie pies?

      1. Anonymous Coward
        Anonymous Coward

        "I realise that backup systems have been mentioned"

        I used to work for a company, large company, that provided DR services. The vast majority of companies treat DR as a compliance checkbox. They buy some DR services so they can say they have DR services... but in the event of a primary data center loss, there really is only the rough outline of a plan. Basically their data, or most of it, is in some alternative site and they may have the rest of their gear there too or not. There is rarely anything resembling a real time switch over from site A to site B in case of a disaster in which their entire stack(s) would come up without any manual intervention at site B. Mainly because architectures are a hodge podge of stuff which has collected over the years. Many companies never rewrite or modernize anything, meaning much of the environment is legacy with legacy HA/DR tools... and there is sparse automation.

        1. wheelybird

          There's a difference between disaster recovery and high-availability (though they do overlap).

          It's perfectly reasonable that disaster recovery is a manual fail-over process. Fully resilient systems over two geographically separated locations can be hard and expensive to implement for smaller companies with not much in the way of a budget or resources, and so you have to compromise you expectations for DR.

          Even if failing-over can be automated, there might be a high cost in failing-back afterwards, and so you might actually prefer the site to be down for a short while instead of kicking in the DR procedures; it works out cheaper and avoid complications with restoring the primary site from the DR site.

          Not every company runs a service that absolutely needs to be up 24/7.

          A lot of people designing the DR infrastructure will be limited by the (often poor) choices of technology made by the people that wrote the in-house stuff.

          As an example, replicating your MySQL database between two datacentres is more complicated than most people would expect. Do you do normal replication and risk that the slave has lagged behind the master at the point of failure, losing data? Or use synchronous replication like Galera at the cost of a big latency hit to the cluster, slowing it right down?

          If it's normal replication, do you risk master-master so that it's easy to fail-back, with the caveat that master-master is generally frowned upon for good reasons?

          I think it's disingenuous to berate people for implementing something that can be very difficult to implement.

          Though of course, large companies with lots of money and lots to lose by being down (like BA) have no excuses.

        2. Anonymous Coward
          Anonymous Coward

          That's scary

          That is our DR to a tee, so glad I'm not the boss

          anon for obvious reasons

      2. NoneSuch Silver badge

        You can't test everything all the time.

        Stuff happens and often the unimagined causes grief.

        Redundancies are only guaranteed when they come from HR.

    2. Anonymous Coward
      Anonymous Coward

      suggest something very simple and not some unavoidable impossible to understand failure

      The weasel isn't going to take any personal responsibility - even though he is THE chief executive officer. But it is his fault, all of it, in that capacity. The total and absolute failure of everything is clearly a series of multiple failures, and he (and BA) are trying to control the message as though that denies the reality of this catastrophe. He should be fired for his poor communication and poor leadership if nothing else. But that's what you get when you put the boss of a tiddly low cost airline into a big, highly complex operation with a totally different value proposition.

      Looking around, press comment reckons that it'll be two weeks before all flight operational impacts are worked out (crews, aircraft in the wrong place at the wrong time, passenger failures made as good as they can), and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?

      1. Anonymous Coward
        Anonymous Coward

        But that's what you get when you put the boss of a tiddly low cost airline into a big, highly complex operation with a totally different value proposition.

        Whatever you might think about his performance during this unmitigated balls-up, there's much more relevant experience in his biography than just running a "tiddly low cost airline".

        1. Dan 55 Silver badge

          I don't know who you're trying to convince, but it's not me. Neither Clickair nor Vueling have or had stellar reputations, Sabre's had its outages, and the less said about US airports the better.

        2. dieseltaylor

          His CEO experiences is from a minor airline so accept that fact. His previous experience reads well but then every exec I know of makes sure it does. : )

          Should he jump? Probably not but some people somewhere must be guilty of hiding, or not implementing, necessary IT improvements.

      2. Bloodbeastterror

        "I wonder if that will affect his bonus?"

        Ha ha ha ha... Of course not. After the attainment of a certain pay grade "reward for failure" kicks in. Only the actual workers enjoy "reward for success". Sometimes.

        1. Antron Argaiv Silver badge
          Thumb Up

          Re: "I wonder if that will affect his bonus?"

          A former boss referred to it as "f*ck up and move up".

          Though, admittedly, a change of employer is sometimes implied.

        2. Anonymous Coward
          Anonymous Coward

          Re: "I wonder if that will affect his bonus?"

          He didnt get one last year

          "Alex Cruz, the Spanish CEO of British Airways, will not receive a bonus for 2016 from the IAG airlines group. The company said in a statement to the National Stock Market Commission that he will be the only one of the 12 senior executives not to receive a bonus. "

          1. John Smith 19 Gold badge
            Unhappy

            "he will be the only one of the 12 senior executives not to receive a bonus. ""

            Which suggests he has been trying extra hard to get one.

            And look what his efforts have produced.....

            I think he's going to be on the corporate naughty step again.

            IT.

            It's trickier than it looks in the commercials.

      3. Anonymous Coward
        Anonymous Coward

        The weasel isn't going to take any personal responsibility - even though he is THE chief executive officer.

        IT and the CIO don't fall under him, IT is provided by [parent company] IAG "Global Business Services" as of last year. But of course, Cruz has fully supported all the rounds of cuts that have been made.

        It smells like a store-and-forward messaging system from the dawn of the mainframe age

        JMS-based ESB.

        Ex BA AC

        1. Anonymous Coward
          Anonymous Coward

          But you would think that something as critical as an ESB in BA would mean that they have built it with high availability in an active/active configuration with plenty of spare capacity built in with nodes in different locations and on different power supplies. And of course ensuring that the underlying data network has similar high availability.

          Otherwise you have just built in a single point of failure to your whole enterprise and as Murphy's law tells us - if it can go wrong then it will go wrong and usually at the most inopportune moment.

          1. Norman Nescio Silver badge
            Pint

            ESB?

            "...something as critical as an ESB in BA would mean that they have built it with high availability in an active/active configuration with plenty of spare capacity..."

            They probably thought they could just pop down the road to the brewery in Chiswick and get a refill there.

            1. Tom Paine
              Pint

              Re: ESB?

              Funny, I thought Fuller's had closed that site and moved to an industrial estate in Maidstone or Nuneaton or something -- but I was completely wrong: https://www.fullers.co.uk/brewery

              Doesn't it look nice? Mmmm... ale...

              1. Simon Harris
                Pint

                Re: ESB?

                "Funny, I thought Fuller's had closed that site ... but I was completely wrong: https://www.fullers.co.uk/brewery

                Doesn't it look nice? Mmmm... ale..."

                They do an excellent brewery tour with a tasting session in their bar/museum afterwards :)

              2. Bigbird3141

                Re: ESB?

                Think you're confusing it with Young's - the Wandsworth-based brewer Fullers bought and closed and redeveloped the site of.

                1. CH in CT20
                  Pint

                  Re: ESB?

                  Ahem. You seem to be confusing Fuller's with Charles Wells, the company which brews Young's beers in Bedford.

            2. Doctor Syntax Silver badge

              Re: ESB?

              "They probably thought they could just pop down the road to the brewery in Chiswick and get a refill there."

              You thought they could organise....?

        2. Anonymous Coward
          Anonymous Coward

          IT and the CIO don't fall under him, IT is provided by [parent company] IAG "Global Business Services" as of last year.

          As a director of BA, he is in fact responsible in law, even if the group have chosen to provide the service differently. I work for a UK based, foreign owned energy company. Our IT is supported by Anonco Business Services, incorporated in the parent company's jurisdiction, and owned by the ultimate parent. If our IT screws up (which it does with some regularity), our customers' have redress against the UK business, and our directors hold the full contractual, legal and regulatory liability, whether the service that screwed up is in-house, outsourced, or delivered via captive service companies.

          1. Anonymous Coward
            Anonymous Coward

            Director?

            If he is a director of BA! A search of companies house finds a director of a BA company in the name of

            Alejandro Cruz De Llano

            I'm guessing this him?

            A member of staff of a company only has legal responsibility if they are a registered director with companies house. The fact the company calls them a CEO or director does not mean they are a registered director.

            1. butigy

              Re: Director?

              Actually I don't believe that's correct. If you hold yourself out to be a director then it's possible you can be treated like one also the concept of shadow directors but that's a bit different.

      4. John Smith 19 Gold badge
        Unhappy

        "and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?"

        You can bet that any "profit improving" (IE cost cutting) ideas certainly did.

        This should as well.

        But probably won't, given this is the "New World Order" of large corporate management that takes ownership of any success and avoids any possibility that their decisions could have anything to do with this.

        If you wonder who is most modern CEO's role model for corporate behavior it's simple.

        Carter Burke in Aliens.

        1. 0765794e08
          Joke

          Re: "and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?"

          “Carter Burke in Aliens”

          Sticking with movies, Johnny from Airplane! springs to mind...

          “Just kidding! Oh, wrong cable. Should’ve been the grey one. Rapunzel! Rapunzel!”

      5. Anonymous Coward
        Anonymous Coward

        Cruz previously worked at Vueling which has a terrible record for cancellations, lost bookings and cruddy customer service - so he's clearly brought his experience over.

        He was appointed to cut costs at BA which he's done by emulating RyanAir and EasyJet whilst keeping BA prices. He's allowed the airline to go downmarket just as the Turkish, the Gulf and Asian carriers are hitting their stride in offering world-wide routing and don't treat customers like crap. Comparing Emirates to BA in economy is like chalk and cheese.

        BA's only hope is if the American carriers continue to be as dreadful as ever.

        1. Anonymous Coward
          Anonymous Coward

          I had the pleasure of flying back to the UK on American in Business Class recently - service and comfort was a notch above BA Club World, and the ticket was cheaper than BA Premium Economy. BA are screwed...

        2. Richard Laval

          "BA's only hope is if the American carriers continue to be as dreadful as ever."

          So they definitely have a fighting chance then!

    3. Voland's right hand Silver badge

      It smells like a store-and-forward messaging system from the dawn of the mainframe age (Shows how much BA has been investing into its IT). It may even be hardware + software. Switching over to backup is non-trivial as this is integrated into transactions, so you need to rewind transactions, etc.

      It can go wrong and often does, especially if you have piled up a gazillion of new and wonderful things connected to it via extra interfaces. Example of this type of clusterf*** the NATS catastrophic failure a few years back.

      That is NOT the clusterf*ck they experienced though because their messaging and transaction was half-knackered on Friday. My boarding pass emails were delayed 8 hours, check-in email announcement by 10 hours. So while it most likely was the messaging component, it was not knackered by a surge, it was half-dead 24h before that and the "surge" was probably someone hired on too little money (or someone hired on too much money giving the idiotic order) trying to reset it via power-cycle on Sat.

      This is why when you intended to run a system and build on it for decades, you have upgrade, and you have to START each upgrade cycle by upgrading the messaging and networking. Not do it as an afterthought and an unwelcome expense (the way BA does anything related to paying with the exception of paying exec bonuses).

      1. James Anderson

        If it was a properly architected and configured mainframe system it would have just worked.

        High availability, failover, geographically distributed databases, etc. etc. were implemented on the mainframe sometime in the late '80s.

        Some of the commentards on this site seem to think the last release of a mainframe OS was in 1979, when actually they have been subject to continuous development, incremental improvement and innovation to this day. A modern IBM mainframe is bleeding edge hardware and software presenting a venerable 1960s facade to its venerable programmers. Bit like a modern Bentley with its staid '50s styling on the outside and a monster twin turbo multi valve engine on the inside.

        1. Nolveys
          Windows

          @ James Anderson

          (Mainframe operating systems) have been subject to continuous development, incremental improvement and innovation to this day.

          That sounds expensive, has anyone told Ginni about this?

        2. Mr Dogshit
          Headmaster

          There is no such verb as "to architect".

          1. MyffyW Silver badge

            no such verb as "to architect".

            I architect - the successor to the Asimov robot flick

            You architect - an early form of 21st century abuse

            He/She architects - well I have no problem with gender fluidity

            We architect - sadly nothing to do with Nintendo

            You architect - abuse, but this time collective

            They architect - in which case it was neither my fault, nor yours

            1. This post has been deleted by its author

          2. Nigel 13

            There is now.

            1. Aus Tech

              RE: There is now.

              It's too late now, the disaster has already happened. Very much like the old story "shut the gate, the horse has bolted."

          3. Anonymous Coward
            Alert

            But there will be as soon as enough prescriptive-grammar fogeys who can remember that once there wasn't die off. This is how language evolves: by the death of idiots.

            1. Anonymous Coward
              Anonymous Coward

              Die off is fine. So is die back. They're descriptive and worth keeping. Architect as a verb is more or less OK, although why did someone assume 'design' wasn't good enough, since it's a correct description of the process, making architect as a verb a replacement for a word that didn't need replacing.

            2. whileI'mhere

              by the death birth of idiots.

              FTFY

          4. Anonymous Coward
            Anonymous Coward

            <rant> True. But at least that's one I can, however reluctantly, at least imagine.

            For me, by far the worst example of this American obsession with creating non-existent 'verbs' is, obviously, 'to leverage'.

            Surely that sounds as crass to even the most dim-witted American as it does to everyone else in the English speaking world, doesn't it? I'm told these words are created to make the speaker sound important when they are clueless.

            I can accept that some lone moron invented the word. But why did the number of people using it ever rise above 1? </rant>

            1. Anonymous Coward
              Anonymous Coward

              *I can accept that some lone moron invented the word. But why did the number of people using it ever rise above 1?*

              Because the number of morons is >>>> 0[1]

              [1] Yes - I made up >>>>> to be "a far, far larger number than the one compared against" It's mine. You can't have it. So there.

            2. Glenturret Single Malt

              And pronounce it levverage instead of leeverage.

          5. Nifty Silver badge

            No such verb as to architect?

            https://en.oxforddictionaries.com/definition/architect

            verb

            [WITH OBJECT]Computing

            Design and configure (a program or system)

            ‘an architected information interface’

          6. Jtom

            Suggestion we were given some twenty-five years ago: Don't verb nouns.

            1. Marduk

              Verbing weirds a language.

          7. dajames
            Headmaster

            There is no such verb as "to architect".

            That's the beauty of the English language -- a word doesn't have to exist to be usable. (Almost) anything goes.

            It's not always a good idea to use words that "don't exist" -- especially if you're unhappy about being lexicographered into the ground by your fellow grammar nazis -- but most of the time you'll get the idea across.

            [There is no such verb as "to lexicographer", either, but methinks you will have got the point!]

            Ponder, though, on this.

        3. CrazyOldCatMan Silver badge

          A modern IBM mainframe is bleeding edge hardware and software presenting a venerable 1960s facade to its venerable programmers.

          And always has done. in the early 90's, I was maintaining TPF assembler code that was originally written in the 60's (some was older than me!).

          And I doubt very much if those systems are not still at the heart of things - they worked. In the same way as banks still have lots of stuff using Cobol, I suspect airlines still have a lot of IBM mainframes running TPF. With lots of shiny interfaces so that modern stuff can be done with the source data.

          1. Down not across

            With lots of shiny interfaces so that modern stuff can be done with the source data.

            Dunno if its shiny, but probably something like MQ.

            For most parts it seems to do fairly decent job in distributed systems if it has been properly configured.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like