back to article BA's 'global IT system failure' was due to 'power surge'

British Airways CEO Alex Cruz has said the root cause of Saturday's London flight-grounding IT systems ambi-cockup was "a power supply issue*" and that the airline has "no evidence of any cyberattack". The airline has cancelled all flights from London's Heathrow and Gatwick amid what BA has confirmed to The Register is a " …

Page:

  1. Tom Paine
    Pint

    "Tirelessly"?

    The airline's IT teams are working "tirelessly" to fix the problems, said Cruz.

    I bet they're not, you know. At the time of writing - 19:48 on the Saturday of a Bank Holiday weekend - I'm pretty sure they're tired, fed up, and just want to go to the pub.

    1. allthecoolshortnamesweretaken

      Re: "Tirelessly"?

      "Teams"? As in "more than one"? After all the RIFs?

      1. Yet Another Anonymous coward Silver badge

        Re: "Tirelessly"?

        They have one team on rentacoder and another from craigslist

        1. Anonymous Coward
          Anonymous Coward

          Re: "Tirelessly"?

          Nope, they are all on a flight from Mumbai to London. There is no one left here to fix the issues, they all got outsourced some time ago.

          Power Supply? Perhaps these 'support' guys think that a few PC supplies obtained from a market stall on the way to the airport will fix the issue.

          Seriously, don't BA have a remote DR site?

          Probably got axed or is in the process of being moved to India.

          1. handleoclast

            Re: "don't BA have a remote DR site?"

            Probably got axed or is in the process of being moved to India.

            Close. The people who kept insisting that BA invest in a remote DR site got axed and their jobs moved to India. Not only are Indians cheaper, they don't keep insisting on stupid ideas like remote DR that costs lots of money, they just do what they're told to do without question.

            1. Tbird

              Re: "don't BA have a remote DR site?"

              Couldn't agree more.

              Any decent sized company would have a 'Failover' system or DR system ready to kick in from a remote location, it's standard practice even in smaller companies.

              Apart from millions of pounds in claims the government should fine BA for such poor practice. It's all very well saying 'Rebook' or 'Refund' and don't come to the airport but people have holidays planned at the other end of the journey that have taken a year to plan and save for.

              We all also know that a trip to Heathrow involves parking, taxis, motorways and is not as simple as just 'popping over' when your fights ready.

              Shame on you Alex Cruz, the shareholders should speak out!

          2. Version 1.0 Silver badge

            Re: "Tirelessly"?

            Seriously, don't BA have a remote DR site?

            Of course it does, it's kitted out with WMRN drives (Write Many, Read Never) but they were having a reliability problem with them causing slow writes so they redirected the backups to dev/null - it was much faster.

    2. anthonyhegedus Silver badge

      Re: "Tirelessly"?

      Do they have pubs in that part of India?

    3. Trixr

      Re: "Tirelessly"?

      I dunno, it's not a bank holiday in India, and they're probably flogging all the poor bastards to death over there.

    4. Robert E A Harvey

      Re: "Tirelessly"?

      It must be true. He was wearing a yellow high-viz waistcoat when he said it

      1. Duffaboy
        Trollface

        Re: "Tirelessly"?

        And a Clipboard

    5. Voland's right hand Silver badge

      Re: "Tirelessly"?

      They are lying too. Their system was half-knackered the day before so things do not compute. It was definitely not a Saturday failure - it started 24hours before that.

      I did not get my check-in notification until 10 hours late and I the boarding pass emails were 8 hours late on Friday.

      So they are massively lying. Someone should check if there are holes in the walls in their office at Watersisde from Pinoccio nose punching through them at mach 1.

      1. Anonymous Coward
        Anonymous Coward

        Re: "Tirelessly"?

        "So they are massively lying. Someone should check if there are holes in the walls in their office at Watersisde from Pinoccio nose punching through them at mach 1."

        Ala 9/11?

    6. TheVogon

      Re: "Tirelessly"?

      Did TATA outsource it to Capita?

      1. W T Riker

        Re: "Tirelessly"?

        The correct spelling is crapita

    7. Aralivas

      Re: "Tirelessly"?

      I work in IT operations of different banks since almost 30 years .

      Unfortunately I experienced the same trend in all banks I works for : outsourcing the IT to offshore.

      And each time the result was the same . Poor service and a lot of disruption and miscommunication between offshore teams and local teams.

      However I cannot understand how a major airline company like BA does not have a tested and validated disaster recovery plan.

      In banks it's a common practice to do DR drills each year and validated all critical applications in case of a major incident (fire, power outage, earth quake etc).

      During those drills which takes place on two weekends all IT staff is present and they simulate the full outage of a data center and try to bring up the most critical applications on the second data center. Normally the applications should be up in less than two hours , otherwise the DR test is supposed to be a failure.

      Failure of a power supply is not a valid reason these days. The UPS (uninterrupted power supply) with strong batteries are able to keep up the most critical systems and servers for 24 hours or more.

      If the British Airways' IT director decided not to have a disaster recovery data center and not to perform such disaster recovery drills yearly then he has to be fired ! This is the basics of a Tier 0 ( critical applications) IT architecture.

      The bad news is that if the BA does not improve its IT architecture it means the same issue could happen again.

      1. Amorous Cowherder
        Facepalm

        Re: "Tirelessly"?

        Not sure about other banks but it's part of our mandatory requirement to the auditors to prove we have a functioning DR site and appropriate, tested procedures for using it!

      2. Jellied Eel Silver badge

        Re: "Tirelessly"?

        This probably isn't a DR issue, but an HA one. So BA relies heavily on IT to know where it's aircraft, passengers, staff, luggage, spare crews, spare parts and everything else is, in real-time. So lots of interdependent data that would need to be synchronously replicated between the DCs at LHR so an accurate state table is maintained, even if X breaks.. But then if X does break, and there's data losss or corruption, getting back to a working state gets harder. Rolling back to a previous state may tell you where stuff was, but not where it is now.. Which can be a fun sizing challenge if you don't have enough transaction capacity to handle an entire resync.

        Or maybe power & cooling capacity. Unusually for a UK bank holiday, the weather has been quite nice. So cooling & power demands increased in and around LHR, which includes lots of large datacentres. On the plus side, there'd be plenty of food for IT & power folks working at LHR given meals have probably been arriving, with no passengers to eat them.

      3. davidhall121

        Re: "Tirelessly"?

        24 hours on UPS...

        I think not !

        10 years in the DC business leads me to believe that the warm weather likely played a part !

      4. Anonymous Coward
        Anonymous Coward

        Re: "Tirelessly"?

        I worked with DR systems which have a multitude of recovery point objectives for the apps from 1 to 24 hours. And failing back to the primary system at the end of DR has some serious omissions, so there's a tendency not to want to activate the DR unless absolutely necessary.

        As for testing the DR plan periodically? It isn't done...would result in too much downtime of critical systems and take weeks to work out how to do it. We just pulled the wool over the customer's eyes and did a limited set of testing of apps.

        When I had to activate DR applications, the amount of reconfiguration work required and troubleshooting took 6 hours.

        The customer got what they specified and paid for.

        Welcome to the world of non-banking organisations.

        1. Anonymous Coward
          Anonymous Coward

          Re: "Tirelessly"?

          I worked for banks. They had DR sites. DR tests typically involved a long, carefully-sequenced series of events to migrate one, or at most a few, services between the sites. If they ever had a major event in one if the DCs and had to do an unplanned DR of a large number / all of the services then I had no doubt they woukd have failed, both to do the DR and then shortly afterwards as a bank (and, depending on which bank it was, this would likely have triggered a cascade failure of other banks with results which make 2007-2008 look like the narrowly-avoided catastrophe it was).

          Banks are not better at DR: it is just convenient to believe they are. We now know what a (partial?) DR looks like for BA: we should live in the pious hope that we never find out what one is like for a bank, although we will.

          Of course, when it hapens it will be convenient to blame people with brown skins who live far away, whose fault it isn't: racism is the solution to all problems, of course.

        2. CrazyOldCatMan Silver badge

          Re: "Tirelessly"?

          Welcome to the world of non-banking organisations.

          Well - BA *used* to (in the early 90's) use mainframes running TPF. As did quite a lot of banks.

          Whether BA still do I don't know.

    8. Tom Paine
      Angel

      Re: "Tirelessly"?

      Finally, a bit of actual detail from Mr Cruz. I took the liberty of transcribing relevant bits, hear it at

      (starts about 12m in) http://www.bbc.co.uk/programmes/b08rp2xd

      A: "On Sat morning, We had a power surge in one of our DCs which

      affected the networking hardware, that stopped messaging --

      millions and millions of messages that come between all the

      different systems and applications within the BA network. It

      affected ALL the operations systems - baggage, operations,

      passenger processing, etc. We will make a full investigation... "

      Q: "I'm not an IT expert but I've spoken to a lot of people who are,

      some of them connected to your company,. and they are staggered,

      frankly and that's the word I'd use, that there isn't some kind of

      backup that just kicks in when you have power problems. If there

      IS a backup system, why didn't it work? Because these are experts

      - professionals -- they cannot /believe/ you've had a problem

      going over several *days*."

      A: "Well, the actual problem only lasted a few minutes. So there WAS a

      power surge, there WAS a backup system, which DID not work, at

      that particular point in time. It was restored after a few hours

      in terms of some hardware changes, but eventually it took a long

      time for messsaging, and for systems to come up again as the

      operation was picking up again. We will find out exactly WHY the

      backup systems did not trigger at the right time,and we will make

      sure it doesn't happen again."

      (part 1)

      1. TkH11

        Re: "Tirelessly"?

        Doesn't explain how the network switches and equipment lost power. Were the UPS's properly maintained?

  2. Pen-y-gors

    Ho hum

    Another business where the phrase 'single point of failure' was possibly just words - or where one failure cascaded down to overload the backup.

    Resilience costs money.

    1. Grimsterise

      Re: Ho hum

      Amen

      1. Danny 14

        Re: Ho hum

        Said the same to our bean counter. Either give me the money for 2 identicals systems or the money for one and log my concern. Money for 1.5 wont work.

        1. h4rm0ny
          Joke

          Re: Ho hum

          Yeah, BA IT staff told their CEO they needed greater redundancy... So he fired them.

          They're called Tata because that's what they say once they've got your money.

          Tata have stated they'll be flying hundreds of engineers to the UK to resolve the problem. As soon as they find an airline able to transport them.

          It technically IS a power supply issue. Alex Cruz should never have had any.

          1. Anonymous Coward
            Anonymous Coward

            Re: Ho hum

            Agree, TCS is the cut rate provider among cut rate providers. They always seem to promise the moon to win contracts but the follow through has not been impressive based on the engagements I have seen.

            1. John Smith 19 Gold badge
              Unhappy

              "Agree, TCS is the cut rate provider among cut rate providers. "

              Sounds like they have a bright future joining the "Usual suspects" in HMG IT contracts.

              Bright for them. Not so bright for the British taxpayer.

            2. JimboSmith Silver badge

              Re: Ho hum

              I know a company (coz I used to slave for them) that went with a software/hardware supplier who promised the earth and then didn't deliver. The funny thing is they weren't cheap but they were cheerful when you called them to hear them say:

              "No our systems don't offer that functionality".

              "The old system did and that was a damn sight less expensive than yours"

              "We could develop it for you but that's going to cost dev time"

        2. Aitor 1

          Re: Ho hum

          But two identical systems capable of taking over each other is not 2x the expense, but 4x the expense.

          So the intelligent thing here would be to have systems as light as possible (jo Java, please, PLEASE), and have them replicated in three places.

          Now, knowing this type of company, I can imagine many fat servers with complicated setups.. the 90s on steroids.

          The solution, of course, is to have critical systems that are LIGHT. It saves a ton of money, and they could be working right now, just a small hiccup.

          Note: you would need 4x identical systems, + 4 smaller ones for being "bomb proof"

          2x identical systems on production. Different locations

          2x the above, for preproduction tests, as you cant test with your clients.

          4x for developing and integration. They can be smaller, but have to retain the architecture.

          At best, you can get rid of integration and be the same as preproduction.

          These days, almost nobody does this.. too expensive.

          1. Peter Gathercole Silver badge

            Re: Ho hum

            It does not have to be quite so expensive.

            Most organisations faced with a disaster scenario will pause pretty much all development and next phase testing.

            So it is possible to use some of your DR environment for either development or PreProduction.

            The trick is to have a set of rules that dictate the order of shedding load in PP to allow you to fire up the DR environment.

            So, you have your database server in DR running all the time in remote update mode, shadowing all of the write operations while doing none of the query. This will use a fraction of the resource. You also have the rest of the representative DR environment running at, say, 10% of capacity. This allows you to continue patching the DR environment

            When you call a disaster, you shutdown PP, and dynamically add the CPU and memory to your DR environment. You the switch the database to full operation, point all the satellite systems to your DR environment, and you should be back in business.

            This will not five you a fully fault tolerant environment, but will give you an environment which you can spin up in a matter of minutes rather than hours, and will prevent you from having valuable resources sitting doing nothing. The only doubling up is in storage, because you have to have the PP and DR environments built simultaneously.

            With today's automation tools, or using locally written bespoke tools, it should be possible to pretty much automate the shutdown and reallocation of the resources.

            One of the difficult things to decide is when to call DR. Many times it is better to try to fix the main environment rather than switch, because no matter how you set it up, it is quicker to switch to DR than to switch back. Get the decision wrong, and you either have the pain of moving back, or you end up waiting for things to be fixed, which often take longer that the estimates. The responsibility for that decision is what the managers are paid the big bucks for,

            1. Anonymous Coward
              Anonymous Coward

              Re: Ho hum

              You don't even need all that.

              Set up your DCs in an Active-Active setup and run them both serving the resources it needs from the quickest location. Find some bullet proof filesystem and storage hardware that won't fall over if there is a couple of lost writes (easier said than done!) and make sure you use resource pools efficiently with proper tiering of your applications.

              Therefore one DC goes down then all workloads are moved to spare capacity on the other site - non-critical workloads automatically have their resources decreased or removed and your critical workloads carry on running.

              After restoration of the dead DC then you manually or automatically move your workloads back and your non-critical workloads find themselves with some resources to play with again.

              This is how cloud computing from the big vendors is designed to work. However legacy systems abound in many places, including old data warehouses with 'mainframe' beasts. Sometimes it isn't easy to reengineer it all to be lightweight virtual servers. It's why the banks struggle so much and why new entrants to the banking sector can create far better systems.

              1. Anonymous Coward
                Anonymous Coward

                Re: Ho hum

                Are you a sales guy for one of the large cloud providers by any chance?

                Who in their right mind would gift the entire internal workings of a huge multinational company to a single outfit?

                Why not create your own "cloud" infrastructure with your own "legacy" systems?

                Thats the real way to do it from a huge companies perspective.....

                1. Anonymous Coward
                  Anonymous Coward

                  Re: Ho hum

                  "Who in their right mind would gift the entire internal workings of a huge multinational company to a single outfit?"

                  Salesforce, Snapchat, AirBnB, Kellogg's, Netflix, Spotify... many companies, more companies moving workload to some cloud service on a daily basis. You don't need to give it all to one cloud provider, could be multiple... almost certainly will be multiple. My point is - How many occasions do you recall when amazon.com or google.com were down or performance degraded in your life? Probably not once, despite constant code releases (they don't have a 'freeze period', many daily releases), crazy spikes in traffic, users in all corners of the world, constant DDOS and similar attempts. How many on prem environments can keep email or the well worn application of your choosing up with exceptional performance at those levels (planned outages counting... the idea that there are such a thing as "planned outages" and they are considered acceptable or necessary is my point)?

                  "Why not create your own "cloud" infrastructure with your own "legacy" systems?"

                  You won't have the reliability or performance, especially to users in far flung corners, of a cloud provider. The financials are much worse on prem. In a cloud service you can pay for actual utilization as opposed to having to scale to whatever your one peak day is in a three year period (and probably beyond that as no one is exactly sure what next year's peak will be). Shut workloads off or reconfigure on the fly vs writing architectures in stone. No need to pay a fortune for things like Cisco, VMware, EMC, etc. Use open source and non proprietary gear in the cloud.

                  Also, all applications written post 2000 are natively in the cloud, aaS. There is no on prem option. Most of the legacy providers, e.g. Microsoft and Oracle, are pushing customers to adopt cloud as well. Unless you plan to never use anything new, you're bound to be in the cloud at some point.

                  1. patrickstar

                    Re: Ho hum

                    The full IT setup of a global airline is a lot harder to distribute than anything those companies do.

                    Atleast three (Netflix, Spotify and SnapchatI) are just about pushing bits to end-users with little intelligence, which isn't even remotely comparable to what BA does. And Google search, while being a massive database, has no hard consistency requirements.

                    Kellogg's just uses it for marketing and development - see https://aws.amazon.com/solutions/case-studies/kellogg-company/ . Which seems like a pretty good use case, but again in no way comparable to airline central IT.

                    Salesforce has had atleast one major outage, by the way.

                    Didn't Netflix move to their own infrastructure from the clown... with cost given as the reason?

                    Sigh, kids of today - thinking a streaming service is the most complex, demanding and critical IT setup there is. Next you'll suggest they rewrite everything in nodeJS with framework-of-the-day...

                    1. Anonymous Coward
                      Anonymous Coward

                      Re: Ho hum

                      "Kellogg's just uses it for marketing and development"

                      No they don't. They have SAP and all of the centralized IT running in cloud. That is just one example too. There are all manner of companies using it for core IT. Every Workday ERP customer, by definition, uses cloud.

                      "Salesforce has had atleast one major outage, by the way."

                      After the started moving to cloud? They had several outages which is why Benioff decided that maintaining infrastructure doesn't make them a better sales software company, move it to the cloud. This isn't our area of expertise. They just started their migration last year.

                      "Didn't Netflix move to their own infrastructure from the clown... with cost given as the reason?"

                      No, the opposite of that is true. They decided to move everything to the cloud purely for speed and agility... and then found, to their surprise, it also saved them a lot money. Link below.

                      https://media.netflix.com/en/company-blog/completing-the-netflix-cloud-migration

                      "thinking a streaming service is the most complex, demanding and critical IT setup there is."

                      You think the IT that traditional enterprises run is *more* complex, demanding and critical than Google, Amazon, etc's services? The most valuable companies in the world... running billion active user applications worth hundreds of billions of dollars? Those are the most complex, high value applications on the planet. You could add orgs like Netflix and Snapchat to those rosters. Those applications don't support "the business". Those applications are the business. The reason that traditional companies haven't yet moved the whole shooting match to cloud is nothing to do with the average insurance company doing something on the AS/400 which is just beyond the technical prowess of AWS or Google. It is because of internal politics... i.e. the IT Ops people dragging their feet because they think cloud makes them obsolete. I don't think this is true. It will work similarly to the on prem environment for admins. It's not like people even go into their on prem data centers that often... but the perception is that cloud is a threat so IT Ops tries to kill any cloud iniative. I think those IT Ops people are harming their careers as cloud is going to happen, you can delay it but not going to stop it. Soon every company will be looking for someone with 5 years of experience running production cloud from the Big 3 and those that dragged their feet the longest won't have it whereas the early adopters will be in high demand.

                      1. patrickstar

                        Re: Ho hum

                        Regarding Kellogg's, if you read Amazon's own marketing material, it's about SAP Accelerated Trade Promotion Management (analytics) and HANA (in-memory database), not SAP ERP.

                        They are not related as to what tasks they do. HANA isn't even a SAP product originally.

                        As to Netflix - sorry, I confused them with Dropbox. Point remains.

                        That you keep bringing up Google or Netflix or whatever as a reason for why BA could migrate their IT infrastructure to AWS clearly shows you have absolutely no clue what you are talking about.

                        They are completely different applications. The issues and challenges are not remotely comparable in any way.

                        If you have a service that realistically can be distributed over a lot of unreliable hosts, then AWS or similar might be for you. Such as pushing a lot of bits (streaming services, Snapchat), or maintaining huge databases without hard consistency requirements (Google search, analytics). Neither of which is easy at those scales, of course, but they do fundamentally lend themselves to this way of operating.

                        What you need for core IT operations at eg. an airline or a bank is totally different.

                        Plus you are completely glossing over the myriad of very good reasons why someone could need their own infrastructure and/or fully control the staff involved. (Can you even get a current list of everyone with admin or physical access to the hosts from AWS...?)

                        1. Anonymous Coward
                          Anonymous Coward

                          Re: Ho hum

                          "What you need for core IT operations at eg. an airline or a bank is totally different."

                          You seem to think that because Amazon.com and Google search are different services than an airline's systems (although Amazon.com is pretty similar to an airline's systems), that they cannot take on an airline reservation system or flight scheduling system of their cloud services. Google, for instance, has invented a totally revolutionary RDB (with ACID compliance, strong consistency) called Spanner which is perfect for an airline system or a bank... an infinitely scalable, infinite performance traditional SQL DB.

                          "sorry, I confused them with Dropbox."

                          True, Dropbox did move their stuff off of AWS. I think for a service of Dropbox's size with 600-700 millon active users that moving off of AWS is not unreasonable. AWS is largely just IaaS with no network. Even so, it may make sense for a Dropbox but that assumes you have many hundreds of millions of active users to achieve the sort of scale where that could make sense... and we'll see if that makes sense in a few years. AWS was charging huge mark ups on their cloud services, massive margins. Largely because until Google and Azure came on to the scene over the last few years, they had no competitors. Now those three are in a massive price war and prices are falling through the floor. Cloud prices are going to continue to fall and it may not be viable in the future.... This is a rare case though in any case. Dropbox is one of the largest computing users in the world. The average company, even large company like BA, is not close to their scale.

                          "Plus you are completely glossing over the myriad of very good reasons why someone could need their own infrastructure and/or fully control the staff involved"

                          I don't think there are a myriad of reasons. The one reason people cite is security.... just generally. I think this is unfounded though. Google, for instance, uses a private protocol within their data centers for security and performance (not IP). Even if you were able to get in, there isn't much you could do as any computer you have would not understand the Google protocol. Google builds all of their own equipment so there is no vector for attack. Unlike the average company which uses Cisco or Juniper access points with about a million people out there with knowledge of those technologies. DDOS is another good one. You are not tipping AWS or Google over with a DDOS attack, but could knock down an average company. As far as internal security, it is well locked down, caged, etc in any major cloud service. Nothing to worry about... AT&T, Orange, Verizon, etc could be intercepting an de-encrypting the packets you send over their network.. but no one is worried about that because you know they have solid safe guards and every incentive not to let that happen. Everyone is using the "network cloud", but, because that is the way it has always worked, people just accept it.

                          1. patrickstar

                            Re: Ho hum

                            I'm sure that there is some possibility that in say 30 or 50 years time, having your entire business rely on Spanner could be a good idea. That's how long some of the existing systems for this have taken to get where they are today - very, very reliable if properly maintained.

                            As to security: I really can't fathom that you're trying to argue that it's somehow secure just because they use custom protocols instead of IP. Or custom networking gear (uh, they design their own forwarding ASICs, or what?).

                            At the very least, that certainly didn't stop the NSA from eavesdropping on the links between the Google DCs...

                            Pretty much everyone consider their telco links untrusted these days, by the way. Thus AT&T or whatever has no way of "de-encrypting" your data since they aren't involved in encrypting it in the first place. Have you really missed the net-wide push for end-to-end encryption?

                            I don't know offhand what hypervisor Google uses, but AWS is all Xen. Have you checked the vulnerability history for Xen lately? Do you really want Russia/China/US intelligence being able to run code on the same servers as you keep all your corporate secrets on, separated by nothing more than THAT?

                            Never mind how secure the hosting is against external parties, what if I want to know such a basic thing about my security posture as who actually has access to the servers? That's pretty fundamental if you're going to keep your crown jewels on them.

                            What if I need it for compliance and won't be allowed to operate without it?

                            How do I get a complete list of that from the likes of Google or AWS? Do I get updates before any additions are made? Can I get a list of staff and approve/disapprove of them myself? Can I veto new hires?

                2. Anonymous Coward
                  Anonymous Coward

                  Re: Ho hum

                  "Are you a sales guy for ..."

                  Re-read my post. I never said to use a large cloud computing company, although many people may choose this as a solution.

                  I said to set up *your* datacentres in an active/active mode rather than an active/passive mode. However this is really hard with some legacy systems. They just don't have the ability to transfer workloads easily or successfully share storage without data loss etc. I was acknowledging that active/active can be difficult with systems that weren't designed for it. However if you rewrite your system with active/active designed in then it is a lot easier.

                  1. Anonymous Coward
                    Anonymous Coward

                    Re: Ho hum

                    "I said to set up *your* datacentres in an active/active mode rather than an active/passive mode."

                    You can set up DBs in an active-active mode... and it isn't all that difficult to do. The problem is that it kills performance as you would need to write to the primary DB... then the primary DB would synchronously send that write to the second DB... the second DB would acknowledge to the primary that it has, indeed, written that data... then the primary could start on the next write. For every single write. Happening at ms rates, but it will have a performance impact if you are doing it in any sort of high performance workload. It is also really expensive as it involves buying something like Oracle Active Data Guard or comparable. You can also have multiple active DBs in an HA set up with RAC, assuming you are using Oracle. Problem there is 1) Really expensive. 2) The RAC manager evicts nodes and fails about as often as the DB itself so kind of a waste of effort. 3) All RAC nodes write to the same storage. If that storage name space goes down, it doesn't help you to have the DB servers, with no access to storage, still running.

                    The way to do it is to shard and cluster a DB across multiple zones/DCs... or Google just released a DB called Spanner, their internal RDB, which is on a whole new level. Really complicated to explain, but impressive.

                    1. TheVogon

                      Re: Ho hum

                      "The RAC manager evicts nodes and fails about as often as the DB itself so kind of a waste of effort"

                      Worked just fine for me in multiple companies on both Windows and Linux. What did Oracle say?

                      "All RAC nodes write to the same storage"

                      The same storage that can be Synchronously replicated to other arrays across sites and with a failure over time in seconds if it goes down. So clustering basically. Like it says on the box... Or you can automate failover and recovery to completely separate storage with Data guard for failover time of a few minutes.

                      "The way to do it is to shard and cluster a DB across multiple zones/DCs"

                      Like say Oracle RAC or SQL Server and presumably many others already can do then. But only RAC is true active active on the same DB instance.

                      "Google just released a DB called Spanner"

                      If I cared about uptime, support and the product actually existing next week, the last thing I would consider is anything from Google.

                    2. jamestaylor17

                      Re: Ho hum

                      ''The problem is that it kills performance...''

                      Not always, depends on your distance I've implemented Oracle DBs and replicated synchronously across miles of fibre without any knock on performance - and that's with a high performance workload. Of course and major distance and you would be in trouble.

                      Your point about sharding is, of course, well made but the truth is BA's legacy architecture is unlikely to be suitable.

                      1. Anonymous Coward
                        Anonymous Coward

                        Re: Ho hum

                        "Not always, depends on your distance I've implemented Oracle DBs and replicated synchronously across miles of fibre without any knock on performance - and that's with a high performance workload. Of course and major distance and you would be in trouble."

                        True enough, fair point. If you have a dark fiber network of say 10-20 miles or some similar distance, you are going to need a tremendous amount of I/O (write I/O) before you bottleneck the system... the more you increase the distance, the fewer number of writes it takes to choke it. For most shops, they are never going to hit the write scale to performance choke the DBs.

                        "Your point about sharding is, of course, well made but the truth is BA's legacy architecture is unlikely to be suitable."

                        True, it would likely be a substantial effort to modernize whatever BA is using (many airlines are still using that monolithic mainframe architecture).

              2. The First Dave

                Re: Ho hum

                " Find some bullet proof filesystem"

                Find that and you can retire from IT...

                1. Real Ale is Best
                  Boffin

                  Re: Ho hum

                  " Find some bullet proof filesystem"

                  Find that and you can retire from IT...

                  Fast, cheap, reliable. Pick two.

            2. Anonymous Coward
              Anonymous Coward

              Re: Ho hum

              You're right. It's called a DRP and all DRPs ought to have a resumption plan.

          2. Wayland

            Re: Ho hum

            No you need two systems both doing the job like two engines but with the capacity to carry on with one system. Then have enough spares so you can go out and mend the broken one.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like