back to article The biggest British Airways IT meltdown WTF: 200 systems in the critical path?

One of the key principles of designing any high availability system is to make sure only vital apps or functions use it and everything else doesn't – sometimes referred to as KISS (Keep It Simple Stupid). High availability or reliability is always technically challenging at whatever systems level it is achieved, be it hardware …

Page:

  1. AMBxx Silver badge
    Joke

    More importantly

    Is it just me, or does that picture look like David Cameron is about to be eaten by a dinosaur? Have we accidentally uncovered Angela Merkel's real identity?

    1. Anonymous Coward
      Anonymous Coward

      Re: More importantly

      I'd say Theresa May's, but velociraptors seem to be effecfive, have a mission and set out to do it, rather than changing their mission with the daily headlines ....

      1. Yet Another Anonymous coward Silver badge

        Re: More importantly

        Most large dinosaurs are strong and stable

    2. AMBxx Silver badge

      Re: More importantly

      2 thumbs down for my joke?

      Is that you Mr & Mrs Cameron?

      1. Anonymous Coward
        Anonymous Coward

        Re: More importantly

        Thumbs down from me only because I'm bored sick of reading political posts in almost every comment section now.

        1. Danny 14

          Re: More importantly

          Dont worry citizen. Most forum comments will be deemed thought crime in a few years. The citizens posting them will be in reeducation.

          Strong and stable reeducation. Nothing to see here please move along.

        2. Anonymous Coward
          Big Brother

          Re: More importantly

          @boltar: "Thumbs down from me only because I'm bored sick of reading political posts in almost every comment section now."

          You usually see these kind of posts when they're attempting to derail the subject and distract from the contents of the main article. That being, the crap IT system at BA and who is responsible for it.

          1. AMBxx Silver badge

            Re: More importantly

            You usually see these kind of posts when they're attempting to derail the subject and distract from the contents of the main article

            Nope, just making a joke. Surprised anybody took the trouble to comment.

  2. James 51

    Ignorance and greed. Trusting the professional to do their job can overcome the former, not so sure about the latter. BTW it needn't necessarily be the greed of immediate managers. Business in general tend to live like the grasshopper before he learns his valuable lession.

    1. Anonymous Coward
      Anonymous Coward

      Ignorance and greed

      In my experience, it has more to do with ignorance. I am always amazed at how poor most people are at judging risk. In particular, people tend to overestimate and agonize over vanishingly small risks and underestimate the mundane, everyday risks they face. Examples of each are being killed by a terrorist in the US versus being killed in a car accident. That can make for bad policy and decisions, in politics and in business.

      1. Tom 7

        Re: Ignorance and greed

        From the article "Indeed, it was far from clear that even senior NASA management were actually capable of understanding the warnings their engineers were raising – often having neither an engineering or a scientific background."

        This - in every company I've worked for. Even in the ones where they had some engineering experience it was so out of date as to be useless or actually only had a talent for climbing greasy poles. The best boss I ever had was an utter charlatan but he had the sense to leave the engineering to those that knew about it

        1. ByTheSea

          Re: Ignorance and greed

          "Even in the ones where they had some engineering experience it was so out of date as to be useless"

          In IBM I referred to these managers as "technicians". They had forgotten everything they ever learned in Engineering and had learned nothing in Management.

      2. Anonymous Coward
        Anonymous Coward

        Re: Ignorance and greed

        " Examples of each are being killed by a terrorist in the US versus being killed in a car accident."

        People always underestimate risk when they feel they're in control.

      3. Trigonoceps occipitalis

        Re: Ignorance and greed

        I've used this before.

        The greatest test of an engineer is not his technical ingenuity but his ability to persuade those in power who do not want to be persuaded and convince those for whom the evidence of their own eyes is anything but convincing.

        Extract from "Plain Words" in The Engineer 2nd October 1959

        1. Anonymous Coward
          Anonymous Coward

          Re: Ignorance and greed

          >The greatest test of an engineer is not his technical ingenuity but his ability to persuade those in power who do not want to be persuaded and convince those for whom the evidence of their own eyes is anything but convincing.

          Wise words for 1959 but in this post fact world the engineer needs more in the way of seminary skills than logic or debate.

      4. Tom Paine

        Re: Ignorance and greed

        I am always amazed at how poor most people are at judging risk. In particular, people tend to overestimate and agonize over vanishingly small risks and underestimate the mundane, everyday risks they face.

        Well spotted. Here's a bit more on that.

        https://en.wikipedia.org/wiki/Risk_perception

        In the wods of Mr Monroe, "Six hours of fascinated clicking later,.. "

      5. Steve Channell

        low availability cluster

        This sounds scarily like a standard blue print for a service oriented architecture gone horribly wrong. in financial markets it was common for "tib-storms" to crash a broadcast network with re-requests to sync topics, but capacity, tiering and investment addressed it.

        My money is on a panic'd recovery - like the RBS CA7 debacle.

    2. Anonymous Coward
      Anonymous Coward

      The problem for many businesses is that their competition are cutting corners and cutting out as much spend as possible too. Customers then reward that behaviour - there's no point having the most reliable IT stack if you have no customers left to fund it all. You end up with a capture effect where the luckiest cheapskate has all the customers until their luck runs out, then people build resilient systems from scratch all over again, which immediately start getting cost reduced.

    3. Ian Michael Gumby
      Boffin

      @James51

      No, not ignorance and greed.

      Try micro services in additional to having legacy systems in place where it is cheaper to add another micro service in to the chain than it is to rewrite the original service, test it, with the additional feature.

      The one advantage is that if you have only a certain class of travelers who have an additional process to check some sort of security... you don't have to run everyone thru that process.

      Note, I'm not suggesting that this is the case, or that this model is the best fit for BA, but it could be viable and it what is happening when you consider stream processing.

      The issue is that at some point you run in to a problem when the chain gets too long and it breaks in places and you don't know how to move forward or handle the errors.

  3. Anonymous Coward
    Anonymous Coward

    Sunny when it is working

    We provide cloud services and connectivity is the key factor in the actual uptime of our services to our end users. IT managers regularly make the wrong decision based on perceived priorities. If something is working, then tasks related to that will go down the list and even are forgotten.

    For example, keeping some services running over ADSL when you have a new leased line available and not prioritising the work to switch because everything is working. Statistics tell us leased lines have better availability and quality of service but the customer often only reacts when a failure happens.

    I know that they said outsourcing is not to blame but getting rid of people that know how things work or are held together is a dangerous risk. How many companies know nothing about what all the boxes do let alone how they are all dependant on each other?

    1. Anonymous Coward
      Anonymous Coward

      Re: Sunny when it is working

      getting rid of people that know how things work or are held together is a dangerous risk

      This is an inevitable consequence of the regular "efficiency" pogroms that most companies undertake against their own support services (and that applies to functions like finance and procurement too). There is a vast amount of tacit knowledge in employees' heads that is never written down, and which the business places no value on until things go wrong. By then it is too late, because these pogroms are always selective - the people seen as whingers, the challenging, the "difficult", those simply so clever or well informed that they are a threat to management, all are first on the list to go. And unfortunately those are often the people who know how much string and sellotape holds everything together.

      I work for a large company that has a home brew CRM of great complexity. It works pretty well, costs next to nothing in licence fees (cf SAP or Oracle), and we have absolute control - we're not beholden to a tech company who can force upgrade sales by "ending support". Over recent years we've outsourced many levels of IT to HPE, and each time something new goes over the fence, HPE waste no time in getting rid of the expensive talent that has been TUPE'd across. We did even have a CIO who understood the system - but he's been pushed out and replaced by a corporate SAP-head. You can guess what's going on now - the company is sleepwalking into replacing a low risk, stable CRM with a very high risk, high cost SAP implementation, and at the end of it will have a similarly complex CRM, except that it will cost us far more in annual licence fees, we'll have no control of the technology, and the costs of the changeover alone will total around £400m, judging by the serial screwups by all of our competitors.

      1. Ken Hagan Gold badge

        Re: Sunny when it is working

        "and which the business places no value on until things go wrong"

        The only way to find out whether everyone still employed knows how to rebuild the system, is to provide them with an opportunity to do it. (It needn't be the actual system. You can let them assemble a clone.) Of course, that's expensive, but that is the cost of finding out whether the proposed efficiency drive is safe. My guess is that if that cost was included in the business case for the efficiency drive, the case would disappear.

        Taking the argument a step further, it is easily seen that it isn't safe to let any of your staff go until you have reached the point where the system can be rebuilt by script. That's going to be an unpopular conclusion within management circles, but its unpopularity doesn't mean it is wrong.

        1. Danny 14

          Re: Sunny when it is working

          Problem is kost IT managers see the free audit and recommendations as salesmen trying to sell shit. So they wont learn.

          Better to take the audit for tye reports and ignore the phone for a month.

        2. Doctor Syntax Silver badge

          Re: Sunny when it is working

          "it is easily seen that it isn't safe to let any of your staff go until you have reached the point where the system can be rebuilt by script."

          And even then, when the staff are let go you may find nobody knows what the script actually does and you will even more likely find that nobody knows why it does it.

          Not only do you need to retain knowledgeable staff, you need to have succession planning in place.

    2. Stoneshop
      Facepalm

      Re: Sunny when it is working

      I know that they said outsourcing is not to blame but getting rid of people that know how things work or are held together is a dangerous risk. How many companies know nothing about what all the boxes do let alone how they are all dependant on each other?

      Another factor that I see happening is feature sprawl, add-ons often being introduced as 'nice to have', with a low priority to fix if broken. Problem is, even if those features keep being handled at low prio[0], each of those features adds to the knowledge the first and second line support have to have at the ready, as well as simply adding to the workload as such. Having to not just physically but also mentally switch from one environment to another if a more urgent problem comes in and you have to suspend or hand off the first problem because you're the one who best understands the second one is another matter.

      [0] and often they don't, because the additional info they provide allows for instance faster handling of processes, smoother workflow, better overview, etcetera, and after a while people balk at having to do without them. So even when they''re still officially low prio, call handling often bumps them to medium or even high because "people can't work". Oh yes they can; how about remembering the workflow that doesn't rely on those add-ons? The workflow they were trained in?

      1. Anonymous Coward
        Anonymous Coward

        Re: Sunny when it is working

        Worked for a large firm and we switched from one provider to another for some fairly mission critical stuff. The previous system required essentially one program to be running and that was it. The new system required several add on programs (some TSR) to be running in addition to the main program on a users PC. The first time we noticed this wasn't long after deployment when someone couldn't start the main software on their machine. We tried various things with the vendor on the phone before they suggested that one of the other little progs might not be running or have stopped.

        Spoke to someone else who used their software and he said that they didn't really do traditional updates to their software. If some functionality was additionally needed they'd just write another small add on program to provide this. Then after a few years release a completely new product complete with new name plus the bells and whistles added to the old one and the cycle restarts. Didn't exactly fill me with confidence.

        1. GrapeBunch

          Re: Sunny when it is working

          TSR? That brings back memories. You could also understand this as a reaction to "creeping featurism" on the part of the client company.

          In DR-DOS I used to use TSRs to achieve needed functionality on a work PC. Difference is, in that world, nobody ever produced a single program to provide the same functionality.

  4. Mine's a pint

    Typo? Looks strange

    "shuttle failure was "necessarily" one in 105"

    1 in 105? is this perhaps meant to mean "1 in 10^5"?

    1. smudge

      Re: Typo? Looks strange

      Yup. 1 in 100,000 was the figure projected by NASA.

      1. Anonymous Coward
        Anonymous Coward

        Re: Typo? Looks strange

        I think I see the problem here:- NASA predicted 1 in 10^5, and the actual probability of failure was 1 in 10^6, which, as all readers of great literature know is almost certainly going to happen.

    2. Excellentsword
      Pint

      Re: Typo? Looks strange

      Updated

      1. TDog

        Re: Typo? Looks strange

        "The chance that an HTHTP pipe will burst is 10^-7." You can't estimate things like that; a probability of 1 in 10,000,000 is almost impossible to estimate. It was clear that the numbers for each part of the engine were chosen so that when you add everything together you get 1 in 100,000.

        From "What Do You Care What Other People Think", Richard Feynman.

        1. yoganmahew

          Re: Typo? Looks strange

          @TDog

          "You can't estimate things like that; a probability of 1 in 10,000,000 is almost impossible to estimate."

          It's also wildly meaningless. 1 in 10m whats? Messages through the system? Milliseconds? Times the life of the universe? (Remember six sigma events happened daily during the biggest move days of the financial crash... either the universe is impossibly old and all those events happened in a row, or these sort of 1 in x statistics are complete bunkum).

          1. Trigonoceps occipitalis

            Re: Typo? Looks strange

            "You can't estimate things like that; a probability of 1 in 10,000,000 is almost impossible to estimate."

            But it will happen 99.9% of the time.

            (Apologies to TP.)

        2. Grunt #1

          Re: Typo? Looks strange

          Everyone should read Richard Feynman.

          1. Rich 11

            Re: Typo? Looks strange

            Strongly seconded. His writing is hugely entertaining as well as educational.

            1. John Smith 19 Gold badge
              Joke

              " His writing is hugely entertaining as well as educational."

              Indeed.

              But I could never get the image of him as a New York taxi driver chewing a cigar out of my head.

              "How 'bout that Quantum Chrono Dynamics, huh? Virtual particles mediating force transfer in a vacuum. Tricky stuff. You in town on business?"

              Joking aside the world is poorer, not just for his intellect and vision but also for his ability to explain complex ideas. His rubber band in a cup of ice water (modelling the root cause of the Challenger crash) was a classic. Simple enough for even the "I don't understand science" crowd to grasp.

              1. staggers

                Re: " His writing is hugely entertaining as well as educational."

                I think he was born at the southern end of NY, but with that accent he should be from Noo Joizy.

                I always fondly imagine him wearing a zoot suit and spats, carrying a violin case.

                One unarguably great thing Bill Gates did was to buy the rights to the lecture series so we can all watch them for free.

            2. Anonymous Coward
              Anonymous Coward

              Re: Feynman: see also Haddon-Cave

              " [feynman's] writing is hugely entertaining as well as educational."

              Closer to home in the UK, there's a senior judge called Charles Haddon Cave. He's a lawyer not a scientist or engineer, but if you need an inquiry done properly, he seems like a good man to have on your side. His writing is also educational, and entertaining in a way.

              See e.g. his talk(s) on "Leadership*&*Culture,!Principles*&*Professionalism,!

              Simplicity*&*Safety*–*Lessons*from*the*Nimrod*Review"

              RAF Nimrod XV230 suffered a catastrophic mid-air fire whilst on a routine mission over Helmand

              Province in Afghanistan on 2 nd September 2006. This led to the total loss of the aircraft and the death of all 14 service personnel on board. It was the biggest single loss of life of British service personnel in one incident since the Falklands War. The cause was not enemy fire, but leaking fuel being ignited by an exposed hot cross-feed pipe. It was a pure technical failure. It was an accident waiting to happen.

              The deeper causes were organizational and managerial. This presentation addresses:

              (1) A failure of Leadership, Culture and Priorities

              (2) The four States of Man (Risk Ignorant, Cavalier, Averse and Sensible)

              (3) Inconvenient Truths

              (4) The importance of simplicity

              (5) Seven Steps to the loss of Nimrod (over 30 years)

              (6) Seven Themes of Nimrod

              (7) Ten Commandments of Nimrod

              (8) The four LIPS Principles (Leadership, Independence, People and Simplicity)

              (9) The four classic cultures (Flexible, Just, Learning and Reporting Cultures)

              (10) The vital fifth culture (A Questioning Culture) "

              See especially point 10: A Questioning Culture.

              In various places, just search for it (I have to be elsewhere ASAP).

              As well as the Nimrod enquiry, from memory he also did the inquiry for Piper Alpha oil rig disaster and the Herald of Free Enterprise ferry disaster.

              1. Tom Paine

                Re: Feynman: see also Haddon-Cave

                Another tangent: accident investigation reports can be very thought provoking, as well as interesting in their own right. Chernobyl, both Shuttle accidents, the Deepwater Horizon / Macondo 252, Piper Alpha, and all sorts of air accident investigation reports -- all have lessons, and describe similar patterns of organisational and system design or operation failures or accidents waiting to happen to those in many fellow commentards' workplaces. Recognising them doesn't necessarily help you stop them happening, because the root causes are often many pay grades above one's own., but it does make saying "I told you so" more fun,.

                1. Anonymous Coward
                  Anonymous Coward

                  Re: Feynman: see also Haddon-Cave

                  "Recognising them doesn't necessarily help you stop them happening, because the root causes are often many pay grades above one's own."

                  True.

                  "it does make saying "I told you so" more fun,."

                  Please don't take this the wrong way, but how much fun is there when being ignored by management leads to e.g. a a fatal incident which could easily have been avoided?

              2. Tom Paine
                Thumb Up

                Re: Feynman: see also Haddon-Cave

                I think this is the Charles Haddon Cave talk you refer to:

                https://www.youtube.com/watch?v=y99_lhFFCsk

                1. Anonymous Coward
                  Anonymous Coward

                  Re: Feynman: see also Haddon-Cave

                  That'll do nciely, thanks. Charles Haddon-Cave's Piper Alpha 25 presentation session is a good place to start. It's nearly an hour long, but can mostly be treated as radio.

                  There is an almost identical script (or maybe transcript) at

                  https://www.judiciary.gov.uk/wp-content/uploads/JCO/Documents/Speeches/ch-c-speech-piper25-190613.pdf

          2. Anonymous Coward
            Anonymous Coward

            Re: Typo? Looks strange - Everyone should read Richard Feynman.

            Mrs. May wants to know why you want people reading stuff that would be useful to terrorists.

            (I would prefer that Daesh supporters continued to believe in miracles rather than science, thanks.)

            1. Allan George Dyer

              Re: Typo? Looks strange - Everyone should read Richard Feynman.

              Voyna i Mor - "(I would prefer that Daesh supporters continued to believe in miracles rather than science, thanks.)"

              Really? If they believed in science, surely they'd stop supporting Daesh?

              What is the scientific likelihood of enjoying 72 virgins (or white raisins) after death?

          3. Tim Jenkins

            Re: Typo? Looks strange

            Everyone should read the Rogers Commission appendix by Richard Feynman at the very least:

            "For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."

            https://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Appendix-F.txt

            1. John Smith 19 Gold badge
              Unhappy

              Re: Typo? Looks strange

              ""For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.""

              I can't recall if it was him or AC Clarke who commented "Against the laws of Physics there are no appeals."

              The Universe does not care how rich, famous or powerful you are. If a meteorite comes through your roof all that matters is are you in its path or not (yes people really have died of this).

          4. asdf

            Re: Typo? Looks strange

            >Everyone should read Richard Feynman.

            The most underrated and largely unknown boffin by the public probably of all time (certainly of the 20th century). Though Maxwell is right up there as well.

        3. Tom 7

          Re: Typo? Looks strange

          When working on fibre optics we used to work for an error rate of less than 1bit in 10**14bits. Its actually not that hard to work out if you are above or below that level at the theory level . Sitting in the lab for whatever was required to check that less than 1 bit every 3 days is wrong on average for 400Mb is another matter all together.

  5. Buzzword

    Workers defending their territory; managers afraid to challenge them.

    This sounds like a situation where each worker aggressively defends his or her patch. "No, you can't possibly merge my legacy paper reporting system with Bob's new email reporting system, because [insert ridiculous reason here]." Given the chance, most of us will defend the systems we maintain (and by extension our jobs): it's human nature. A manager's job is to challenge the ridiculous reasons given.

    BA's management are squarely to blame here.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like