back to article Microsoft reveals train of mistakes that killed Azure in the South Central US 'incident'

Microsoft has published the preliminary findings for what it calls “the South Central US incident”, but what many will call “the day the Azure cloud fell from the sky” and it doesn’t make for happy reading. Thunder and lightning, very very frightening As is well known now, high energy storms hit Southern Texas early in the …

  1. Ken Moorhouse Silver badge

    RE: asynchronous nature of geo-replication could have led to data loss

    One of the reasons I am anti-cloud is the above. The engineers in this particular case thankfully prioritised customer data integrity over other agendas, but if other decision-makers were to have chosen a different priority then who knows what could happen to data. Customers of cloud services have no say in such decision-making.

    Databases are particularly vulnerable to this kind of data loss. In the case of an on-premises outage a skillful systems person could stitch things back together again, having direct access to the actual underlying files: Not easy, but arguably doable. But if you are prodding a database engine remotely, with the potential that engineers at the cloud end are also working on data recovery (which you have no knowledge of, or control over), the difficulties are a lot greater.

    1. Mark 110

      Re: RE: asynchronous nature of geo-replication could have led to data loss

      Its interesting. I like their honesty though. Not often you get to learn from a fuck up. I'm pleased they are sharing so we can all improve unlike all the other fuckers who keep the root cause of their outages, or breaches, a secret.

      1. steviebuk Silver badge

        Re: RE: asynchronous nature of geo-replication could have led to data loss

        That's cause Mark Russinovich is in charge.

        I'm not a fan of the cloud and yes Mark, it is because I fear for my job, but also because of issues like this.

        But he's always been quite honest since moving to Microsoft after they bought Sysinternals. Especially when he does his Sysinternal talks and pokes fun at the Office team. But he's also done a talk before about another time when Azure went down big.

    2. Sir Loin Of Beef

      Re: RE: asynchronous nature of geo-replication could have led to data loss

      The cloud just puts your data in the hands of other engineers. It is still human driven. Better to have things local.

      1. Mark 85

        Re: RE: asynchronous nature of geo-replication could have led to data loss

        Sir Loin Of Beef says it all.... Manglement forgets about what happens when you put your data on other peoples' servers. Maybe they should forget that also when it comes to profits, bonuses, etc. I'll be glad to hold all their assets for them.

    3. Doctor Syntax Silver badge

      Re: RE: asynchronous nature of geo-replication could have led to data loss

      "Databases are particularly vulnerable to this kind of data loss. In the case of an on-premises outage a skillful systems person could stitch things back together again, having direct access to the actual underlying files: Not easy, but arguably doable."

      Some customers might have wanted to prioritise fail over to integrity (management, of course, would want both but pay for neither, that's a different argument). On-premises gives them the power to decide.

    4. Ken Moorhouse Silver badge

      Re: the potential that engineers at the cloud end are also working on data recovery

      I had this problem with one of my customers once, and therefore speak with first-hand experience.

      Their Netware server hardware died. Having replaced it I then rebuilt the data from a remote backup that I was carryng out nightly on their behalf (Remote as in remote to their location). High priority documents and databases were targeted first, followed by the thousands of images they had built up over the years.

      The remote backup was restoring files at the speed expected of an ADSL link. Being jpg images file compression is of no benefit and can make transmission time longer.

      Halfway through this procedure I was finding anomalies in what was being restored. Turns out that one of the people who was responsible for some groups of images, frustrated by the delay, had found a local backup of images and had copied those back onto the server. This process was competing against the Remote restore causing mayhem with "file in use" and version problems. With this intervention the whole restore took a lot longer and with many doubts as to whether everything was up to date or not.

      This was a problem with someone in the same location as me, in spite of my communicating my methods of restoring data to all staff. Imagine a similar situation where you have no control over the remote end.

    5. Steve_Jobs1974

      Re: RE: asynchronous nature of geo-replication could have led to data loss

      When people say Cloud, they need to differentiate Azure from AWS in terms of reliability.

      AWS have had Availability Zones from day 1, allowing for synchronous replication of databases. They have built all their higher level Services on top of these and they are available in every region.

      1. Anonymous Coward
        Anonymous Coward

        Re: RE: asynchronous nature of geo-replication could have led to data loss

        Amazon also have thousands of employees working in slave-like conditions, earning such piss poor wages that they also qualify for and rely on welfare handouts, while Amazon are avoiding paying as much tax as possible and claiming any and all the government grants they can.

        As much as I feel that Microsoft still has a way to go with regards to Azure stability, they're getting there and I'm willing to put up with things like this if it means I can happily tell Amazon to go fuck themselves.

    6. Phil Endecott

      Re: RE: asynchronous nature of geo-replication could have led to data loss

      I have geographically-distributed replicated postgresql databases in AWS.

      It is my choice whether that replication is synchronous or asynchronous.

      Is there something inherently different about Azure?

      1. Anonymous Coward
        Anonymous Coward

        Re: RE: asynchronous nature of geo-replication could have led to data loss

        "Is there something different about Azure" - Yes there is a difference in Azure.

        Azure used an architecture principal of Region Pairs. They built their regions as an equivalent of a single AWS Availability Zone (AZ). Therefore you need to run multi-region in Azure to get the same resilience as multi AZ in AWS. Now Azure announced last year that they are going to start to build AZ's but this will mean a rewrite of all their higher level services and could take years in my opinion.

        The issue is that Azure Region Pairs have a lot of latency. If you try to run a synchronous commit database over this latency you will end up with terrible database performance. Therefore (just as the VSTS guys did) you need to use asynchronous replication which results in data loss if you don't failover nicely. This is a major problem in my opinion.

        AWS AZs are low latency, allowing synchronous commit databases over completely isolated datacenters. AWS should be able to handle an incident that took down Azure single region

  2. Anonymous Coward
    Anonymous Coward

    lmao...

    oh... my side is hurting... stop it...

    1. Hans 1
      Windows

      Re: lmao...

      Schadenfreude ? I see what you feel!

      I also loved the "metadata" BS, just, what is this metadata and why is it stored in other regions ? Maybe for the NSA to decide which data, belonging to the metadata, it would like to have to help its mates over at Boeing etc ... Besides, I thought the data was encrypted by the customer, that MS could not read the data, how come MS have metadata for said data ? Hmmm ...

      Who still trusts MS ?

  3. Notas Badoff

    The American Midwest is famous

    for wild weather. I was once shown a building in Fort Worth Texas that was scheduled to be dismantled, because a tornado had twisted the 35 floor steel-frame building just enough it make unserviceable. Stick "building twists" in your disaster plan!

    I figure every disaster recovery plan ought to be looked over by a Dutchman (floods), an Indonesian (earthquakes/volcanoes), and a Midwesterner (everything else?). There are some reasons for the crazed looks they have.

    1. kain preacher

      Re: The American Midwest is famous

      Come to California were you have fires, floods and earthquakes. Then some were like NYC were you have meters of snow.

    2. a_yank_lurker

      Re: The American Midwest is famous

      Canada and Buffalo, NY for snow removal. Gulf Coast and Florida for hurricanes. Oklahoma for severe tornadoes and thunderstorms. California for earthquakes and fires. Cascade Range for volcanoes. Got all somewhere.

      1. The Oncoming Scorn Silver badge
        Coat

        Re: The American Midwest is famous

        Come to Alberta, if you don't like the current weather, there will be another one in the next 15 - 30 minutes.

        Coat icon, because it could snow at any given moment, even with clear blue skies overhead.

      2. Korev Silver badge
        Joke

        Re: The American Midwest is famous

        Canada and Buffalo, NY for snow removal. Gulf Coast and Florida for hurricanes. Oklahoma for severe tornadoes and thunderstorms. California for earthquakes and fires. Cascade Range for volcanoes. Got all somewhere.

        So, what you're saying is "move out of North America"

        1. MyffyW Silver badge

          Re: The American Midwest is famous

          I'm sorry boys, but for being wrong footed by the slightest of perturbances from the normal I claim the title for Britannia and Queen Bess.

          Hurricanes -pah! Lightning - nah! A slightly slippery leaf is all it takes here.

        2. tony2heads
          Mushroom

          Re: The American Midwest is famous

          Well don't wait for the Yellowstone Supervolcano to erupt.

        3. oldcoder

          Re: The American Midwest is famous

          There is always sunny Australia...

          :-)

        4. kain preacher

          Re: The American Midwest is famous

          Japan has giant lizards so be carefull .Oz has killer drop b ears so cross that of the list too.

  4. MerkavaPL

    First honest coverage of the event

    Truth and nothing but the truth

    Data centers can and will go down. Strona, floodings, earthquakes and people mistakes are unevitable. Cloud based global services shouldn't go down. MS have all the resources and knowledge required to be prepared and ready for 'unthinkable'.

    1. Pascal Monett Silver badge
      Coat

      I would suggest that, the next time Microsoft decides to plonk down a data center, they do a review of weather history on the locations they are looking into ?

      Maybe avoid a place that has violent thunderstorms ? Or at least factor that into the requirements ?

      Do any of the cloud providers have data centers in Tornado Alley ? No ? I wonder why . . .

      1. Flatlander29

        Google has a large data center in Council Bluffs, Iowa. Guess Iowa is no longer in Tornado Alley.

  5. FozzyBear
    Alert

    Uh!

    Here is Sydney Australia , The only natural disasters we have to worry about are bush fires and the cosmos level ineptitude of the Politicians and PHB's.

    I'll let you decide which is worse.

  6. doug_bostrom

    The promise of cloud is lower cost but with a heavy seasoning of better reliability. Presumably customers are doing the math on this finding the promise to be true?

    Trading random, asynchronous private data center outages of the old days for the modern, synchronized 100 megaton variety at least offers a ray of hope: everybody can take the day off, secure in the knowledge that everybody is stuffed. But that presumes they can still communicate.

    Perhaps "outage as a service" could be a thing? There must be a way to charge for this.

    1. Joe W Silver badge
      Pint

      Re: OAAS (outage as a service)

      plus you can be quite sure that the pub across the road is still up and running unless your shady (or was that cloudy?) provider is in the same area, so you can just head over there for an early lunch... (and unlike the BOFH you don't have to hook up the pub to your generator - but then you will have to pay for your drinks, unlike the BOFH).

      1. Anonymous Coward
        Anonymous Coward

        Re: OAAS (outage as a service)

        My lot pulled out of one cloud contract (meant to secure a fuckload of historical data that t' government might express an interest in for up to seven years) when, during after-deployment tests, we attempted to restore a file from backup and they were like "what backups? This is just cold storage right?".

        Internal Audit's guy was clearly struggling for words when the IT manager (famed for being 'adequately paranoid") said "we've not decommissioned the on-premise storage yet", and that was the end of the matter.

  7. Anonymous Coward
    Windows

    Sorted.

    That's why I'm happy to rely on MSFT's engineering resources, our on-premises BHOFs can't even get the firewall working properly.

    1. Pascal Monett Silver badge

      Re: Sorted.

      If you have a PEBCAK problem at that level, you don't have a sysadmin problem, you have a manglement problem.

      Fire him and get a competent BOFH. They do exist - for a price.

  8. Anonymous Coward
    Anonymous Coward

    Not all clouds are equal

    Azure rushed to expand globally, to demonstrate they are as big as AWS.

    This is at the expense of solid Architecture. There Azure region pair model is wrong, and they recognise it, announcing Availability Zones last year.

    The issue they now have is that building out Availability Zones could take years. They are going to have o re-write their PaaS services to take advantage of them.

  9. ForthIsNotDead
    Pint

    Well...

    Being an old miserable grey-beard type, I have some reservations about this new-fangled Cloud thingy. Something about putting all your eggs in one basket. Simultaneously however, I do marvel at the technology and admire it. It seems that even when your data is distributed automagically across the world by your cloud provider, a major outage such as this can still cause major hassle.

    However, from what I am reading, no customer data was lost.

    I tip my hat to the engineers, and to Microsoft for being forthright and just telling the truth about what happened. You'll get a lot more slack from your customers, and respect to boot.. Someone made the right call there.

    Beer icon, 'cause... well, you fixed it lads, kick back and have a beer.

  10. ClearlyNotMe

    It seems from the conversation so far that everyone is pretty adamant that the only way to be totally sure about your data is to run it locally. Does that mean that no one has ever suffered a local data loss or is it a case of the “tin hat” brigade having a slight case of survivor bias?

    For what it is worth I think having someone who has far more skills and expertise on hand than I do to run my day to day data needs makes sense. Where I add value is to not put all my eggs in one basket so that I am not “surprised” when the people with more skills and expertise have a bad day - because everyone has a bad day.

    I want their good days to be my good days and their bad days to be theirs alone.

    1. Spazturtle Silver badge

      Nobody is saying to only have it locally, you should always have an offsite backup. But you should also always have a local copy of the data.

      1. Killfalcon Silver badge

        Cloud storage is really cheap, and easy to expand.

        It's brilliant for off-site backups, so long as you trust it's security and are legally allowed to put your data there (where 'there' means 'the actual datacentre(s) involved').

  11. SVV

    shiny new Azure Resource Manager (ARM)

    That acronym's already been taken dudes. Don't supppose a large technology company would know that though.

    1. oldcoder

      Re: shiny new Azure Resource Manager (ARM)

      Microsoft never really cared if terms/trademarks were already used or not...

      They just use it anyway.

  12. Reginald Onway
    Facepalm

    Yet, there is no consequence for failure....

    People pay real money to have their stuff in the cloud supposedly immune from earthly disaster.

    But, no matter how much money they pay, there are still disasters but no consequence for those providing the failed service. True, corporate protocol demands somebody in PR take five minutes to prepare an apology text (vetted by the lawyers, example below). That's it.

    Major disaster response:

    "We are SO sorry. We are SO sorry. We are SO sorry."

    (note to self: keep a local backup no matter what they say)

  13. Cuddles

    Odd design choice

    "as its thermal buffers were depleted, temperatures rose, and a shutdown started. Alas, this was not before temperatures had risen to the point where actual hardware, including storage units and network devices, were damaged."

    If you're going to have a system to protect hardware from damage due to overheating, it seems it would be a good idea to have it kick in before things get damaged due to overheating.

    1. Spazturtle Silver badge

      Re: Odd design choice

      My guess is that were were some hot spots in the data center which could be fixed with having some fans to circulate air, but at normal temperatures they weren't an issue so somebody though "We can deal with that at some other time".

  14. Anonymous Coward
    Anonymous Coward

    Cloud. Perfectly named...

    Looks solid from a distance, but not so much close up; you can't build anything on a cloud; dumps its content when you least want it to; anything can pass straight through it, anywhere; essentially uncontrollable; and so on...

    LOL.

  15. thondwe

    Just Wondering

    How long does it take for "local is best" types to recover if a lightening strike fries their data centre? Everyone got a second data centre ready just in case - or is that in the cloud?? Full cupboard of spare hardware, expert engineers available on site 365x24x7? Someone constantly thinking up "unthinkable" scenarios to mitigate against and test? How much does that cost your business?

    Just saying Azure/AWS is a service with lots of things being done behind the scenes, very few orgs have the resources to do better...

    1. Killfalcon Silver badge

      Re: Just Wondering

      A local DC is putting all your eggs in one basket.

      Co-location is having a second basket for your eggs, for the low low price of two baskets.

      In theory, the cloud is putting your eggs in many geographically diverse baskets for less than the cost of one local DC.

      In practice, the cloud is putting your eggs in someone else's basket for much less than the cost of one DC, because once you start talking about making dramatic savings to budget people, a red mist descends over their eyes and they just start *cutting* and *cutting* until there's only the bare minimum remaining.

      It's really freaky how often I've seen "we can get better service for less cost" become "we can get barely adequate service for much less cost".

    2. Ken Moorhouse Silver badge

      Re: How long does it take for "local is best" types to recover...

      Surprisingly quickly.

      I've dealt with quite a few disaster recovery situations over the years.

      One of them involved a flood where the customer's server and a few pc's were afflicted. Found a pc and printer that worked and got the programs and data on there so they could invoice. The telephones/ broadband had died, the outlets being below the flood level, so having Cloud would not have worked for them.

      Nearest I've come to a lightning strike situation was a client who had an industrial unit next to the Euston main line. Power surges were such a regular occurrence that there was a community pressure group formed to liaise with Network Rail. Routers, switches, cheap pc's and photocopiers had all succumbed to such events. So again, the network side was part of the weakness, which is bad for Cloud.

      MiFi type devices would be a solution, but I would think that throughput would be dire unless every pc had its own device.

  16. Milo Tsukroff
    FAIL

    As usual, the air conditioning was the last thing they funded...

    > and also overloaded suppressors on the mechanical cooling system, shutting it down.

    So it was a mechanical issue, the air conditioning, which did the dirty deed. As usual, the air conditioning was the last thing they funded. I've seen this before: The A/C won't be properly funded until _after_ the huge outage caused by inadequate A/C. And it there's just one A/C system, not two for redundancy, you can be sure that the single system will go down. Typical for Microsoft: Good with software, not so good with hardware.

    1. This post has been deleted by its author

    2. Waseem Alkurdi

      Re: As usual, the air conditioning was the last thing they funded...

      Applies to every other piece of crucial server infrastructure, including (a) UPSes (b) the servers themselves.

  17. Waseem Alkurdi
    Joke

    Unfortunately, it appeared that ARM also struggled with customers experiencing time-outs and, of course, problems with resources that had underlying dependencies.

    Is that why we're still using x86 on servers? Never knew M$ loved ARM anyway.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like