back to article TALE OF FAIL: Microsoft offers blow-by-blow Azure outage account

Microsoft has published a full, frank, and ugly account of just what went wrong when Azure Storage entered Total Inability To Support Usual Performance – TITSUP - mode in November. The nub of the problem was that Azure's update procedures and code had “... a gap in the deployment tooling that relied on human decisions and …

  1. Anonymous Coward
    Trollface

    Does sound complicated

    Could they not call in someone who knows how to run this sort of thing rather than trying to go it alone?

    This is the problem with so many mom and pop software houses, they need to have technical backup if they are selling this sort of thing to companies.

    1. Blane Bramble

      Re: Does sound complicated

      The problem is the software they bought doesn't have any warranty if it doesn't work.

    2. FlatSpot
      Trollface

      Re: Does sound complicated

      Or just fire the arrogant IT engineer who thought they knew best?

    3. Disko
      Mushroom

      Re: Does sound complicated

      One wonders why they did not call in support from their supplier before rolling out such an important update. I guess Clilppy was also on break?

  2. Arctic fox
    Windows

    Hmm... Whilst there is much that one can criticise Redmond for I have to say..............

    ...........that their openness about what went wrong on this occasion is to be welcomed. I am sure we can all think of one or another example (no names, no pack-drill) of BigCorp who would under no circumstances have published such a report when its admissions were as embarrassing to them as this report must be for MS.

    1. Trevor_Pott Gold badge

      Re: Hmm... Whilst there is much that one can criticise Redmond for I have to say..............

      For all Microsoft's faults, the Azure guys have been pretty good about doing postmortems. They deserve some kudos.

      1. Tim 11

        Re: Hmm... Whilst there is much that one can criticise Redmond for I have to say..............

        I'm not a big fan of MS or azure but I have to admit they're on to a no-win situation here. Being open is definitely what we'd like to see from these mega-service-providers but it just gives a bigger attack surface for their detractors.

      2. Arctic fox

        @Trevor_Pott Re: "For all Microsoft's faults, the Azure guys ........."

        Indeed Trevor, that has been my impression. However, not just with their "cloudy compadres" but also with certain other parts of the conglomerate that MS clearly is. There appears to a process going on there that may or may not lead to collective improvement in the company as a whole. It will be interesting to see what actually comes out of the wash in the medium to long term.

        1. Trevor_Pott Gold badge

          Re: @Trevor_Pott "For all Microsoft's faults, the Azure guys ........."

          Aye, but a lot of the "improvements" have been pretty half-assed. See: the VDI licensing redux. Okay, they rationalized it, sort of. But it's still shit and it's still a horrid rip-off. Microsoft knows it has a problem. But it may be institutionally capable of understanding what needs to be done to resolve it.

      3. Anonymous Coward
        Anonymous Coward

        Re: Hmm... Whilst there is much that one can criticise Redmond for I have to say..............

        From the MS article: Linux nodes... Unaffected. Windows nodes... Bucket of Fail. Sloppy seconds.

        There's a lesson here somewhere...

    2. Roo
      Windows

      Re: Hmm... Whilst there is much that one can criticise Redmond for I have to say..............

      "...........that their openness about what went wrong on this occasion is to be welcomed."

      I'll second that, well played MS Azure folks.

  3. Mystic Megabyte
    Headmaster

    eh?

    Flight or fight, that is the question.

    Sorry but cannot email corrections from here.

    1. MatthewSt
      Joke

      Re: eh?

      Was just about to post the same thing:

      “... an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during fighting (testing).”

      Missing an l there. Rather amusingly, this is quoted from the Reg's own linked article, which is also spelt wrong, but the other article links to the source which is spelt correctly! So much for journalism being all copy and paste...!

    2. David Roberts
      Thumb Up

      Re: eh?

      Oh, I don't know. I quite liked the picture of a knock down drag out bare knuckle fight between developers and system testers that this conjured up.

      Even more so between the project team and the customer.

      "Here you are - it fully meets the signed off requirements and test specifications."

      "Oh, no, we've changed our minds. And the colours are so last year."

      {sound of jackets being removed, sleeves being rolled up, and lengths of metal pipe being slipped out of waist bands}

      Sigh.....a man can dream.....

  4. Graham 24

    Openness Preferred

    Given the choice between this approach of telling all, and the opposite approach of a "there there, don't worry your pretty little customer head about it - we've fixed the problem, and that's all we're saying", I would prefer the first.

    At least with openness I'm in a better position to decide whether this sort of outage is likely to reoccur.

    As a somewhat stretched analogy, if my car broke down, was towed to a garage and subsequently returned to me working again, I would expect the garage to tell me what was wrong and what they did to fix it.

    1. Ole Juul

      Re: Openness Preferred

      . . . and what they did to fix it.

      I thought this was closed source.

  5. Anonymous Coward
    Anonymous Coward

    Openness is to be welcomed, but it's competence that is the real benchmark of the right provider.

    If your IT admin admits to never having patched a server then that's better than him not telling you, but it's still a good reason to find someone fit for the job.

    Similarly here, it's interesting and surprising that Microsoft are being frank that a software bug caused such a widespread outage but it's not something that I've ever seen happen with their principle rivals. Single points of failure fail, but when a software issue causes a bug across multiple regions it suggests a fundamental flaw in their rollout process that makes one wonder what others they're not aware of.

    Why build an available multi-region application on a platform that might bork all of them at once because someone hit the wrong button?

  6. Anonymous Coward
    Anonymous Coward

    No wonder MS are grovelling as their SLA uptime is 99.9% but have consistently failed to to hit anywhere near that target, they are one of the lowest of the majors in the industry. I feel sorry for the poor old engineer who flicked the switched, special high intensity training will be appearing on his next targets. The real cause is buggy code and poor SOPs to avoid it reaching the front end.

  7. I ain't Spartacus Gold badge

    We went with a cloudy online accounts package. Not picked by me, but I did a bit of research, and was not impressed to find they'd had an outage the previous year, where they'd not only been down for about 4 days, but also lost a couple of previous days work for clients. I'm guessing the incremental backups were buggered too, so they had to go back to a full one. That could actually be hard to fix, if you don't know exactly what you input when - and you've filed the paperwork as you input it.

    The reason I didn't object is that they had a full post-mortem. They'd published what was happening at the time, then had blog posts every week or so for the next few months, with what had gone wrong, why, what mistakes they'd made, what they were thinking of doing to stop it in future, and then what progress they were making on setting up their systems.

    It was because they'd started small, grown, then not upped the IT to match the new business. It was impressive enough that we went with them. Any cloudy accounts is a risk, but ours are simple enough that we can rebuild from our locally held copy of their backup (something they introduced after the fiasco) - and it wouldn't actually be that expensive to rebuild from the filed paperwork.

    Actually they're now several years from doomsday, so I wonder if complacency will be kicking in again, and we should move to another supplier who's just had a major cock-up themselves and is now in full repentance mode?

    1. P. Lee

      I'm curious for the reasoning behind going for a cloudy accounts system. You said you didn't pick it but what was the reasoning? I would have expected demand to be reasonably static (at least nothing that an out of hours batch run couldn't handle) and resource requirements not too high.

      I'm a bit old fashioned, I would have thought batch input would be ideal for this kind of application and that means you can feed data to a backup system easily, if DR/Continuity is an issue. That means you don't need the kind of engineering the cloud typically uses, meaning a cheaper system (hopefully).

      1. I ain't Spartacus Gold badge

        P. Lee,

        We're a small company. Fewer than 10 people. We have no IT people (I'm strictly an amateur), and with two people on the road, and several working at least as much from home as the office.

        So anything we do in-house would need to involve paying someone competent to set up and manage it. I can do a bunch of stuff, and the internet means I can work out a lot of the rest - however unskilled but vaguely competent isn't really good enough.

        For companies like us the cloud is amazing. There are all sorts of things like CRM that we couldn't have dreamed of doing 15 years ago, and when we looked at it 10 years ago were prohibitively expensive. There are various risks with cloud providers, but these are no worse than the risks of us (basically me) buggering up running our own server(s).

        So cloudy accounts and payroll for the 2 of us who need it. CRM that we can all access, including from mobiles / tablets. Office 365 - so we've got linked diaries and our CRM can link our emails in. And all for less than £4k a year. Chuck in 1 or 2 new laptops a year and that's an amazing IT budget for what we get.

  8. Anonymous Coward
    Anonymous Coward

    +1 for Microsoft

    "+1 for Microsoft". Now that's a title I never thought I'd write. But this is exactly the kind of sharing that makes all of us in the industry better. I liken it to the reports that the NTSB does after plane crashes or that the legendary Sheck Exley wrote up for divers. No punches pulled, but no emotion or politics. Raises everybody's game.

  9. Anonymous Coward
    Anonymous Coward

    Good root cause analysis and transparency...

    ... is the best way to fix issues and learn lessons. No spin. No Bullshit. Just clean and clear 'hands up'. Kudos Microsoft.

    Now if you could just stop fucking up products with piss poor patches every month we can all go back to business as usual of complaining about Windows8/2012

  10. ecofeco Silver badge
    Facepalm

    Cloud fail again and again

    How many years will it take until people realize that handing your valuables to complete strangers is really, really stupid?

    You would think after a few thousand years people would get this.

  11. Terafirma-NZ

    It's not just Azure I have to say that making ME people run their products for customers (not MS internal IT) has pushed product development in a good direction for server products. Now if they would stop trumpeting the only on Azure or O365 rubbish and provide both self install and online for all features and all products.

  12. Anonymous Coward
    Anonymous Coward

    Whodda ever thunk?

    How could Microsucks ever have buggy software that results in losses or crashes? Have you checked the accuracy of these claims?

    Duh.

  13. ben_myers

    All this flighting???

    "Flighting"? That's a new one. Enough for one to flight or flee from Azure Cloud, IMHO. Better to use a professionally run cloud.

    1. Anonymous Coward
      Anonymous Coward

      Re: All this flighting???

      We are on amazon cloud, if we were on azure we would have a lot of explaining to do, because we house part of our customer's IT on the shit (No customer data, just management application).

      1. Mark Dempster

        Re: All this flighting???

        So Amazon have never had an outage? Or they have, but not told you what happened?

      2. Howverydare

        Re: All this flighting???

        So what did you say to your customers when AWS went down?

        They've all had issues. Amazon screwed up a few of my customer VMs before when they changed hypervisors. Performance on the old tier was abysmal at best for a good while, too.

  14. Craig 2
    Facepalm

    Cloud services: You are the beta testers.

  15. Simon Watson

    Single Point of Failure

    I think when the outage occurred a lot of people were asking themselves how MS had managed to set up cloud infrastructure that obviously had a single point of failure.

    I think by publishing this in such detail they have shown that they don't, but managed to simulate one very well by doing an entire platform upgrade at once instead of following their flighting policy and running the two configurations in parallel for a while.

    What happened to MS could probably happen to any cloud provider if they made a similar mistake during an upgrade.

  16. RegGuy1 Silver badge

    Is it coz Balmer's gone...

    ... that they can be open?

  17. Stevie

    Bah!

    This article was hard to read the day after the office Xmas party for me, mainly because of two style points:

    a) clever multi-word alliterative phrases in the body of the article instead of just in the headline where they belong. Those caused an annoying ringing in the brain.

    2) When using multi-word phrases for a thing that will be mentioned every few lines, Mr Rapknuckle taught the students of St John Backsides Comprehensive that best practice was to use the phrase once, Show an acronym for it in parentheses right after the first use, then use the acronym thereafter. The repeated instances of Azure Blob storage Front-Ends made the ringing worse.

    Well played, sir.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like