back to article Microsoft reveals terrible trio of bugs that knocked out Azure, Office 362.5 multi-factor auth logins for 14 hours

Microsoft has delivered its postmortem report detailing the failures that led to unlucky folks being unable to log into its cloud services for 14 hours last week. Redmond said on Monday this week that there were three separate cock-ups that combined to cause the cascading mess that left Azure and Office 363 users unable to …

  1. Anonymous Coward
    Anonymous Coward

    This hit us bad.

    The company I work for switched from locally hosted Exchange and Notes to Office 365 specifically on claims it was more reliable and faster (coincidentally, they also could layoff about 12 sysadmins). For the most part, it works, though with much greater latency (this is a rural area). But when there is an emergency, where multiple departments need to coordinate, it's proven useless on multiple occasions. When we have to enter data regarding an emergency, almost in real time we need 24/7 up-time. We're communicating at all hours. I work 0000-0800 CDT, and need to know that my superior knows what happens before 0500, and for the past couple years, while everything has been sent through Office 362.5, the entire emergency services department I work in has taken to defaulting to Apple's iMessage for sharing secured emergency info (which also meant everyone needed an iPhone) to make sure it actually gets through. This is entirely unofficial, of course, and despite the Idiot Tax involved, it's saved lives and likely millions of dollars since I've been employed here. Sometimes "It just works" is just what we need.

    1. A.P. Veening Silver badge

      Cheaper, faster, more reliable

      "claims it was more reliable and faster (coincidentally, they also could layoff about 12 sysadmins)."

      Cheaper, faster, more reliable: Pick any two out of three.

      Hint: Neither faster nor more reliable is actually guaranteed.

    2. Anonymous Coward
      Anonymous Coward

      Re: This hit us bad.

      Cheaper yes, faster no, better no.

      1. Antron Argaiv Silver badge
        Alert

        Re: This hit us bad.

        Cheaper? ...for now

        Faster? (does anyone else remember diskless workstations, and before them, diskless X-terminals?)

        More reliable? From Microsoft? Surely, you jest...

        In theory, computing in the cloud should just work...multiple redundant servers, load balancing, unlimited storage and blindingly fast speed -- all those goodies. And The Internet hardly ever goes down, right?

        Thankfully, my company has not yet converted to O362.5, but the indications are that it will eventually happen -- Microsoft will force us to.

        We also have (an outstanding) in-house IT staff, but have recently been bought by a much larger corporation. As long as they don't outsource IT support, we'll probably be OK...

    3. Ben1892

      Re: This hit us bad.

      You're using email for Emergency services, is your name Moss by any chance?

      https://www.youtube.com/watch?v=EzRFoO8wSVA

      1. DJO Silver badge

        Re: This hit us bad.

        Cheaper yes

        You wait until the majority of users are 100% committed and returning to in-house would be almost impossible due to "letting go" the staff with the necessary skills.

        Random CIO, Approx April 2020: "Ohh look, MS just jacked up the 365 subscriptions"

        1. Anonymous Coward
          Anonymous Coward

          Re: This hit us bad.

          No different to them hiking up Exchange licensing, or Windows licensing whenever they want. So what you gonna do? Stick to Pegasus Mail?

          We had one guy who did Exchange, and we have one guy who does 365, the same guy...

          Mail is one of the no-brainer things to SaaS, on-prem was a pain the arse. TB's of storage, TB more to backup...

  2. bombastic bob Silver badge
    FAIL

    DDoS'able logins - who'd a thunk it?

    Seems to me that having a login system that is _SO_ inefficient, and SO reliant on a single "provider", that a 30 second timeout on a login token is sufficient [under the right conditions] to create RACE conditions and other 'token expiration' related problems, that maybe... JUST maybe... the entire design needs to be COMPLETELY re-thought.

    All eggs: one basket. Yeah, THAT isn't a recipe for FAIL !!!

    It's COMPLETELY DDoS'able, as it only took "everyone flushing at once" (more or less) to cause the system to 'overflow' heh heh heh. Must've been REALLY fun in the basement bathrooms.

    MSDN has a somewhat 'paranoid' security model as well, one that expires a download link after about 4 hours. This means that very very large files over moderate connection speeds CAN NOT COMPLETE DOWNLOADING. When Micro-shaft's IIS servers did NOT follow the RFC's (a couple of years back), you couldn't even pick up where you left off - it was 'start from the beginning again' every time. Fortunately, they fixed that last part, eventually... [making it usable again with proper browser plugins or through-the-hoop jumping].

    NOW they're "at it again" with their "all eggs, one basket" approach to logins, and unrealistically short timeout periods on the tokens, not allowing for very busy networks, slow connections, or DDoS attacks.

    Wheeeee.

    this reminds me of a computer back in the late 70's that had an old-style 12" floppy drive connected to a serial terminal (access via serial and control chars on a shared serial line at 1200 baud). A grad student wrote an application in BASIC that allowed you to store things on it [inefficiently]. But, if the mini-computer had more than a handful of users on, when you tried to retrieve your stored files, you'd get buffer overruns and lost data. Often it was COMPLETELY unusable. I re-wrote a new version in assembly language that had proper buffering [and an actual file system on the disk]. I'd ask the drive for ONLY a track at a time (not 'flood me with all at once'), which fit nicely into the mini-computer's serial buffer, and no data was lost, even if the system was THRASHING because of too many users.

    Anyway...

    1. Anonymous Coward
      Anonymous Coward

      Re: DDoS'able logins - who'd a thunk it?

      "Seems to me that having a login system that is _SO_ inefficient,"

      I don't see any evidence it's not efficient. It must cope with tens of millions of concurrent logins.

      "SO reliant on a single "provider""

      Well that's cloud for you. Or your own for for you. Not many people use two solutions for 2FA.

      "at a 30 second timeout on a login token is sufficient [under the right conditions] to create RACE conditions"

      It was a bug, it happens.

      "All eggs: one basket. Yeah, THAT isn't a recipe for FAIL !!!"

      Sure, but email is not a BC1 application for most companies. You can spend more keeping it on prem if you need to.

      "It's COMPLETELY DDoS'able"

      As above this was a bug. And it probably isn't externally DOSable as you need to authenticate to get access to 2FA.

      "MSDN has a somewhat 'paranoid' security model as well, one that expires a download link after about 4 hours. This means that very very large files over moderate connection speeds CAN NOT COMPLETE DOWNLOADING. "

      Not true - links only expire to initiate new downloads - they dont kill downloads in progress.

  3. Picky

    Standard Operating Procedure

    They called the hell-line ..

    Microsoft would eventually solve the problem by turning the servers off and on again after applying mitigations.

  4. Mario Becroft
    FAIL

    What's in the fine print?

    How many 9's of uptime are you promised as a paying commercial MS customer? What is the recourse if they fail? This is basic due diligence you would do on any vendor. I wonder if MS etc. are perceived as "too big to fail," and the (arguable) convenience of SaaS leads organizations to move to services that simply have no assurance of service level.

    1. katrinab Silver badge
      Windows

      Re: What's in the fine print?

      I read the SLA. They promise one 9 of uptime (95%), so it should be renamed Office 347.

    2. TheVogon

      Re: What's in the fine print?

      "How many 9's of uptime are you promised as a paying commercial MS customer?"

      In general, three 9s - for basic services and Office 365. Four 9s for certain cross availability zone clusters, etc.

      "What is the recourse if they fail?"

      Service credits.

  5. Dwarf

    Yet more proof

    1. Code is riddled with bugs

    2. Insufficient in house testing

    3. Users are guinea pig testers

    4. Even with all that telemetry, they can’t see what’s happening

    I suppose somewhere, this will be used as a reason for yet more telemetry, rather than a review for better coding and testing methods.

    I wonder what the cost was to all the businesses affected by the outage ?

    1. Anonymous Coward
      Devil

      " Even with all that telemetry, they can’t see what’s happening"

      And do you believe they will address 2) or 3)?

      No, they will address only 4) adding even more telemetry - until latency in the telemetry cache will create race conditions in the telemetry processes that will bring down the whole system....

    2. Michael Habel

      Re: Yet more proof

      I'm faily sure any sane Company out there, has stuck with their local copy of oriface, and are likely using Oriface362.8 as a colabertive (Out of the Office), effort. and, those who are thinking in the short term of... Well what do we need those Twelve Admins for? Will eventually find out...

    3. Anonymous Coward
      Anonymous Coward

      Re: Yet more proof

      5. Cloud services are not the always available panacea that various snake oil salesmen make them out to be.

      But then if a PHB actually understood technology he wouldn't be a PHB. Catch 22.

  6. N2

    Scaleable?

    Clearly not in its current manifestation.

    1. DJV Silver badge

      Re: Scaleable?

      Only the outages are scalable!

  7. Anonymous South African Coward Bronze badge

    Reminds me of when OS/2 on a single CPU totally outperformed NT running on a quadprocessor setup.

    1. Anonymous Coward
      Anonymous Coward

      Reminds me of when OS/2 on a single CPU totally outperformed NT running on a quadprocessor setup.

      Nowadays that baton has been taken over by Linux. I cannot believe we're wasting so much processing power on, well, sh*t.

      1. Anonymous Coward
        Anonymous Coward

        >>Nowadays that baton has been taken over by Linux.

        You realise that Windows and Windows Server has tended to beat Linux in performance benchmarks for most things for years now?

        1. HighTension

          Yes, all those Windows supercomputers in the Top500 sure are impressive!

      2. Anonymous Coward
        Anonymous Coward

        "Nowadays that baton has been taken over by Linux."

        No kidding. My desktop is the same machine I used to run WinXP on. It might run Win7, couldn't possibly run Win10, but is zipping along with Ubuntu 14.04 and latest Chromium, Thunderbird, FireFox, and LibreOffice, not to mention my collection of Windows games under Wine.

    2. Michael Habel

      They had Cores in the 90's?! Single yeah to be sure, but I'm gonna presumbe you ment multi-socketed Procs running in SMP instead.

  8. Martin hepworth

    3 root causes???

    No.. a single root cause of an overloading of the service which wasnt gracefully handled by separate cascading systems....

    1. Jay Lenovo
      Facepalm

      Re: 3 root causes???

      Dumb and Dumber to Dumberer

      (Harry and Lloyd found new employment).

      We're left tripping all over that mess, but like the sequel, not very funny.

    2. This post has been deleted by its author

      1. Anonymous Coward
        Anonymous Coward

        Re: 3 root causes???

        "Yes, all those Windows supercomputers in the Top500 sure are impressive!"

        Yes, not bad for just runing a script on a cloud:

        https://www.top500.org/site/50454

        You dont see many supercomputers runing Windows these days though - not because of any scalability limitation - it certainly scales and tends to beat say Linux with say wide band low latency performance such as Mellanox interconnects or 100GBe networking. The main reason is because it's licenced per core, and top end supercomputers can have more than 10 million cores!

  9. Anonymous Coward
    Anonymous Coward

    ha ha ha

    Thought the whole fucking point of cloud was scalability...

    Question first pops up is why didn't azure simply add more backend servers automatically thought this was the who point of that shit..

    1. Version 1.0 Silver badge

      Re: ha ha ha

      It worked, the outage scaled very well. I could go on with the old Claude Rains joke but let's face it - when you had all your data off to another company that what do you think will happen? Are they going to be concerned with maximizing their profits or yours?

      Edit: Damn Autocorrect.

  10. Anonymous Coward
    Anonymous Coward

    Reminds me of the Gerard Hoffnung "Bricklayer's Lament" monologue. Unfortunately on this recording the audience were anticipating the story and kept interrupting the flow with laughter.

    1. steelpillow Silver badge
      Happy

      "Unfortunately on this recording the audience were anticipating the story and kept interrupting the flow with laughter."

      Oh, some of us have been doing the same with Microsoft for a very long time.

  11. Dan 55 Silver badge

    In other news...

    Windows 1809 breaks Win32 apps, Windows Media Player and the iCloud client. MS has decided machines with the iCloud client installed won't receive the 1809 update for the moment, so you know what to do...

    1. Michael Habel
      Trollface

      Re: In other news...

      I think the cure maybe worse then the desease. Beside I don't know any iTards who'd be using their JebusPhones / MaxiPads to make iMessage pay its own rent.

  12. regadpellagru
    Joke

    the gaps in telemetry ...

    In a Microsoft article, World has gone banana !

  13. steelpillow Silver badge
    Megaphone

    System engineering

    This is what happens when you don't do your system engineering properly before rollout. Every part of this multifaceted crap was foreseeable, testable and hence avoidable.

  14. Vulture@C64

    I used to be a WIndows Server advocate, it's on the whole been very stable (and easy to manage) even back to NT351, NT4, 2008R2, 2012 and now 2016 etc but MS have ruined it now - telemetry, update process, the memory is requires has increased despite MS saying it's decreased, the CPU resource it takes has also increased.

    Whilst Centos 7 has matured and developed into a fantastically stable OS, rock solid, fast, needs very few resources and has also become more manageable with a range of tools - the manageability of it was what put me off years ago.

    Microsoft are ignoring the very things which made them useful and leaving the door open to Linux to walk right in . . . how many new builds are now done on Windows ? None that I know of. Same with SQL Server - was a great product but cost is massive now on SPLA so PostgreSQL it is - another tick in the enterprise box.

    Bye Microsoft . . . it's been fun :)

    1. oldcoder

      That is part of the problem when you artifically tie so many things together...

      You can no longer separate them to reduce loads...

      All that happens is the load keeps getting bigger and bigger - with more and more bugs that can't be fixed without breaking the entire thing.

      1. Steve Foster
        Trollface

        @oldcoder

        OC, it's not entirely clear from your post, are you talking about Microsoft now, or systemd?

    2. Michael Habel

      Perhaps this is the point? I was under the impression that under a post Balmer MicroSoft, they (i.e. MicroSoft), LOVE Linux now?

  15. Don Pederson

    The title references Office 362.5 and in the article it refers to it as Office 363 and 364. Need a proofreader?

    1. druck Silver badge

      I just think it emphasises the unreliability.

  16. Cavehomme_

    The beginning of the end...

    ...goodbye MS, you’ve shot yourselves in your feet. Muppets.

    1. Fatman
      WTF?

      Re: The beginning of the end...

      <quote>...goodbye MS, you’ve shot yourselves in your feet nuts. Muppets.</quote>

      FTFY!!!

  17. steviebuk Silver badge

    I'm a bit slow

    I assume the 362.5 then 363 then 364 was a piss take?

    And I'll end with my usual. "But the cloud never fails, it will cleanly fall over to the next data centre. We need to be infrastructure free. It will save thousands because the cloud costs nothing".

    1. Steve Foster
      Holmes

      Re: I'm a bit slow

      "I assume the 362.5 then 363 then 364 was a piss take?"

      Oh yes.

      cf. previous articles on Microsoft cloud outages (there are too many to cite individually, of course).

      1. DuchessofDukeStreet

        Re: I'm a bit slow

        I was just impressed it was up all the way to 364...

        1. Sir Runcible Spoon

          Re: I'm a bit slow

          If we adopt the 'days since last incident' approach, isn't it like O7 or something?

          1. BeerTokens

            Re: I'm a bit slow

            Can we have this as a banner on the reg homepage please?

        2. Anonymous Coward
          Anonymous Coward

          Re: I'm a bit slow

          "I was just impressed it was up all the way to 364..."

          I wonder what will happen in a Leap Year?

  18. Anonymous Coward
    Facepalm

    Sympathy

    You have to be sympathetic to those poor fellows at Microsoft. They are a young, small company that hasn't had much experience of dealing with big systems and that sort of thing. Give them another decade or so and they will learn how to get it right. Possibly.

  19. ma1010
    Megaphone

    Just goes to show that agile is crap

    What caused this? Lack of TESTING. But since MS fired their testers and decided that their users are now the testers - welcome to the world of crapware and constant outages.

    As to doing emergency communications with email, download LibreOffice and a decent email client. Then subscribe to two different paid-for services (like Fastmail) and set up all the clients to check both as that way you eliminate THAT single point of failure. Likely cheaper than MS 3.625 times ten to the second power (and falllllling) and would certainly work.

    1. A.P. Veening Silver badge

      Re: decent email client

      I recommend Thunderbird, but there are more, just avoid Outhouse.

  20. Melanie Winiger

    Next time, buy a Mainframe. -)

  21. stevebp

    Office 358

    I understand that Microsoft have had network issues in London - my Client is complaining that Office 365 (sic) has been unavailable for two days and he's resorted to texting senior management with updates. He says that the service has "lost all credibility" with the Executive.

  22. Anonymous Coward
    Anonymous Coward

    to stop this issue locking us out again I've created an AD user which syncs with o365, made it a global admin, have not enabled MFA and have disabled it in AD.

    If I ever need to use it then I just enable it in AD and force a sync.

  23. I said the red button Igor!
    Facepalm

    Oh happy day!

    It's down again today and I've getting that deja-vu feeling again.

    On the one hand, MS strongly encourage all O365 Global Admins to enable and use MFA. On the other, they lock everyone out, including the admins. I wonder if the NSA or GCHQ will release details of their back doors?

    Must be beer o'clock 'cos nothing else is going to get done today - and it's only Tuesday.

  24. MAH

    yup..confirmed in Canada..MFA is broken again...

  25. Jay Lenovo
    Unhappy

    MFA broken again

    Single points of failure, certainly not in a cloud service...Still it all comes tumbling down.

  26. Howard Hanek
    Happy

    Just About Given UP Windows

    The latest update won't permit me to login with anything but a temp profile and wiped all my premium subscription 3rd party software. I have several Linux boxes working and want to completely dump Windows for the future.

  27. Steve @ Ex Cathedra Solutions

    It's gone again!

    They reckon it went over again at 14:25GMT - update in an hour or so.

  28. arctic_haze

    "gaps in telemetry "

    Yes. Only more telemetry can save us. Every user of Microsoft software should have telemetry installed in his/her orifices.

  29. teknopaul

    i n c r e m e n t a a a a l b a c k o f f f f f

    Err heard of it.

  30. This post has been deleted by its author

  31. FozzyBear
    Happy

    We seriously need an update to our "icons", for instance a Mr Magoo icon, A bumbling old blind guy crashing into situation after situation, yet through "blind" luck manages to survive, perfectly describes Microslop don't you think

  32. Anonymous Coward
    Anonymous Coward

    "Now, Microsoft says, it is looking to prevent a recurrence of the fiasco by reviewing how it handles updates and testing, as well as reviewing its internal monitoring services and how it contains failures once they begin."

    Isn't that pretty much what they said the last few dozen times their services fell flat on their butts?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like