back to article Azure fell over for 7 hours in Europe because someone accidentally set off the fire extinguishers

Microsoft has explained how a cascading series of cockups left some of its Northern European Azure customers without access to services for nearly seven hours. On September 29, the sounds of "Sacré bleu!" "Scheisse!" and "What are the bastards up to now?" were, we're guessing, heard from Redmond's Euro clients after key …

  1. Christian Berger

    The insane thing about it is...

    that most of the things people do on that cloud are things you could do at home with an extremely modest server from 20 years ago. E-Mail and storage aren't particularly hard things to do.

    1. Anonymous Coward
      Anonymous Coward

      Re: The insane thing about it is...

      But that takes work! It's just so much easier to let Azure do it.

      Oh wait...

    2. Anonymous Coward
      Anonymous Coward

      Re: The insane thing about it is...

      You’ve never worked with Exchange Server have you?

      First off, many companies consider Email mission critical. Email is customer facing, and it’s used by everyone. I’d rather have the accounting system down for a few hours than email.

      Second, Exchange Server is complicated... When you have hundreds/thousands of users it becomes a b1tch to maintain. Try managing spam, phishing, archival (legal), backups, etc. is a pain in the a$$.

      1. Anonymous Coward
        Anonymous Coward

        Re: The insane thing about it is...

        "Second, Exchange Server is complicated... "

        Sure - it's an enterprise grade solution.

        "When you have hundreds/thousands of users it becomes a b1tch to maintain. Try managing spam, phishing, archival (legal), backups, etc. is a pain in the a$$."

        It's easier than any other on site option I am aware of to do all those thing on!

        1. Anonymous Coward
          Anonymous Coward

          Re: The insane thing about it is...

          It's a MS product, how can it be the easiest?

          Are all the Linux and open source fans wrong after all?

      2. razorfishsl

        Re: The insane thing about it is...

        Yes an exchange server = email done the WRONG way.

        1. AMBxx Silver badge

          Re: The insane thing about it is...

          Exchange is way more than just email.

          I've recently moved from Exchange in home office to Office 365. So pleased to get rid of the admin overhead of Exchange. Every upgrade of OS was fingers crossed and way too complicated.

          Yes, Exchange is too complicated for an email solution, but if that's all you think it does, you're missing the point.

        2. phuzz Silver badge

          Re: The insane thing about it is...

          "Yes an exchange server = email done the WRONG way."

          You're looking at it the wrong way. Exchange is a multi user calender system that just happens to also do email, and if you'd ever tried getting a working calendar system for more than a few hundred users, you'd understand why Exchange still makes money.

      3. Anonymous South African Coward Bronze badge

        Re: The insane thing about it is...

        Well said.

        I was an Exchange 2003 admin once, and spam just got a real PITA. Fortunately (or unfortunately?) the system b0rked itself after a power outage, and the company decided to outsource the email to an hosted Exchange.

        Benefits

        - somebody else's problem with dealing with spamz0rz and haxx0rz

        - somebody else's problem dealing with the DataStore on Exchange

        - somebody else's problem dealing with backups

        Drawbacks

        - adding users may take a bit longer

        - some issues take longer to address

        But in general it is a great deal better as I don't have to waste my time dealing with Exchange and its quirks anymore, and can focus more on other matters.

        FWIW Exchange is a good product, and is reliable when set up properly. It went downhill with Exchange 2010 and higher, which is a pity.

        1. TheVogon

          Re: The insane thing about it is...

          "It went downhill with Exchange 2010 and higher, which is a pity."

          Nope, the newer 2010, 2013 and 2016 versions are very good and there are many design, scalability, resilience, maintenance and functionality improvements. 2007 was very flaky and scalability limited in comparison.

          1. Terrance Brennan

            Re: The insane thing about it is...

            Exchange 2010 was the pinnacle of the product for on-premises customers. 2013 and beyond were completely designed solely to meet Microsoft's cloud needs. It's a great solution if you have multiple datacenters and thousands of servers which you buy by the truck load and can afford to have a dozen or more copies of each database.

      4. hoola Silver badge

        Re: The insane thing about it is...

        And you don't have to do any of that in O365?

      5. Terrance Brennan

        Re: The insane thing about it is...

        Through Exchange Server 2010 it was not that hard to do and keep it all running. Exchange Server 2013 and 2016 however, are different beasts. They are now designed and optimized for Microsoft's use in the cloud and are not fit for most on-premises use.

        1. Anonymous Coward
          Anonymous Coward

          Re: The insane thing about it is...

          "Exchange Server 2013 and 2016 however, are different beasts. They are now designed and optimized for Microsoft's use in the cloud and are not fit for most on-premises use."

          As someone who is Exchange certified and architects and runs large installs, I can tell you that you are wrong. There are many onsite advantages to the newer Exchange versions.

      6. StargateSg7

        Re: The insane thing about it is...

        OMG! A few hundred or thousands of users?

        Ha ha ha ha I WISH we had that TINY TINY LOAD!

        I deal with PETABYTES of data PER DAY!

        I deal with 500,000 64 kilobyte Iinput/Output requests PER SECOND PER SERVER!

        I deal with files that are 100 Terabytes in size!

        I can have TEN MILLION SIMULTANEOUS real connections and

        another few MILLIONS of computer-simulated virtual users on

        an inhouse platform.

        40 Gigabit connections are TOOO TINY to fit my needed bandwidth!

        I use MANY CUSTOM terabit fibre interconnects and THOUSANDS

        OF GPU's as mini-HTML/SQL servers!

        Your data server requirements are PIPSQUEAK SMALL compared to some people!

        So YES email/htm/sql for 1000 users can ALL be done inhouse with a $2000 (1500 Euro) server

        and some GPU cards to offload the tasks to!

    3. garetht t

      Re: The insane thing about it is...

      "things you could do at home with an extremely modest server from 20 years ago"

      Such as:

      Redundant power

      Redundant A/C

      Redundant Internet connections

      Low latency transport links to backbone

      Physical Security

      Audited and certified systems and procedures

      Sneering at the cloud is really easy until you actually think about it.

      1. Destroy All Monsters Silver badge
        Windows

        Re: The insane thing about it is...

        Sneering at the cloud is really easy until you actually think about it.

        Totally correct. Nobody does this at home except the youthful hacker clubbe and people who think they are consultancy-grade but are actually lacking lots of clues.

        If you have the money, you might want to stay off the pubic cloud and rent a few racks in a secure datacenter in the 'burbs, but then it's up to you to manage the hardware/software, which actually costs a bunch of money, especially if you want to harden it against lots of failure modes.

        1. smudge
          Coat

          Re: The insane thing about it is...

          If you have the money, you might want to stay off the pubic cloud

          Agreed. You could pick up some nasty infections that way.

        2. allthecoolshortnamesweretaken

          Re: The insane thing about it is...

          "... people who think they are consultancy-grade but are actually lacking lots of clues ..."

          Meet your new boss!

          1. Stoneshop
            Windows

            Meet your new boss!

            Same as the old boss.

      2. Mad Mike

        Re: The insane thing about it is...

        @garetht t

        "Audited and certified systems and procedures"

        Yeah. Like they worked really well here!! Reality is that cloud providers are proving themselves to be no better at running datacentres and systems than in-house staff.

        1. Anonymous Coward
          Anonymous Coward

          Re: The insane thing about it is...

          Same staff, different location.

      3. jmch Silver badge

        Re: The insane thing about it is...

        "Redundant A/C" - seems like it's useless having redundant A/C if they all shut down in case of fire!

        1. Scroticus Canis
          Flame

          Re: "useless having redundant A/C if they all shut down in case of fire!"

          Not shutting down the AC during a fire event is the best way to spread the fire while feeding it fresh oxygen. Should also have fire dampers that close off all the ducting so fire does not spread through them.

          In this case it was the wrong thing to do; they should have burnt it to the ground and started over so I gave you an up vote as Azure sucks big time.

      4. Anonymous Coward
        Anonymous Coward

        Re: Physical Security and the Cloud

        Er....? Unless you know exactly where your 'Cloud Service' is being served on every second of every day then how do you know that it is physically secure?

        Come on now, the must be a PHD or two in verifying Cloud Physical Security

        How do you know that the backup to that swany Azure (other cloud services are available) is not a few old P4's housed in the back of Achmed's Kebab Shop in Kentish Town? (Other kebab shops are available)

        Do you really know for sure and not what the cloud snake oil salesmen tell you?

      5. Anonymous Coward
        Anonymous Coward

        Re: The insane thing about it is...

        The cloud is like a plane, it's fantastic till it goes wrong. No-one would fly if planes if they had the outage figures cloud providers have.

        If it is truly a cloud infrastructure then it should NEVER go TITSUP.

      6. AJ MacLeod

        Re: The insane thing about it is... (@garetht t)

        A modest server hardly requires redundant A/C - but basically all the stuff in your list that actually matters in real life is quite readily achievable for even quite a small company with nothing more than a dedicated well ventilated IT room.

        Redundant power - UPS capable of handling several hours of outage is easy to get and even a small petrol generator could easily be kept on hand in the unlikely case of an outage lasting any longer than that.

        Redundant Internet connections - Easy. (And even a 3/4G last ditch option would be plenty for an Email server)

        Low latency links to backbone... hardly necessary for the majority of companies, especially if the bulk of their IT is based on one site.

        Physical security... really?

        Audited and certified systems and procedures... whatever. In practice, plenty of companies get along much better with just a bit of personal responsibility and good old common sense. If your IT staff consists of a handful (or fewer) of reliable, competent individuals that work well together they'll make sure that nothing too stupid is likely to happen.

        You can keep your cloud, it's just your data on a pile of other people's computers managed by fallible humans you can't speak to and the whole edifice waiting to fall over when any one of the billion or so sequences of events occurs that wasn't covered by the "certified procedures"

      7. Pedigree-Pete
        Mushroom

        Re: The insane thing about it is...

        I'm sort of with Christian Berger on this.

        It depends on your use case, If it's just a family server(s), even tho' I'm no IT bod, I'd do it in house.

        A few hours/days of inconvenience isn't a biggie.

        If you're a very small SME then you could get away with it with a off site cloudy email and backup and a decent landline/mobile backup.

        I think you see where I'm going.

        @garetht t, you are, of course, spot on too for many use cases. PP

        ICON> If you need resilience an guaranteed uptime and it doesn't work.

    4. Ryan Kendall

      Re: The insane thing about it is...

      But who wants a san attached loud fan blowing server running 24/7 in their home.

      1. Muscleguy

        Re: The insane thing about it is...

        Doesn't have to be a server. I've just bought and installed new fans on this Mid 2010 Macbook Pro inherited from my daughter. The fan noise was becoming VERY distracting. The surgery was really quite simple. I've done far harder.

        But oh, the silence! the lack of vibration! Bliss.

      2. Alan Edwards

        Re: The insane thing about it is...

        > But who wants a san attached loud fan blowing server running 24/7 in their home.

        I have a self-built VMWare ESXi server and a NAS running 24x7. The HP MicroServer that runs the NAS is give-or-take silent, the PSU fan in the VMWare server is quiet enough that I don't notice it.

        The noise factor has stopped me getting a cheap ex-corporate server off eBay though. We once powered up a de-racked ProLiant DL-something in the office, damn that thing was loud. Lots of tiny screaming fans.

      3. Anonymous Coward
        Anonymous Coward

        Re: The insane thing about it is...

        "But who wants a san attached loud fan blowing server running 24/7 in their home."

        I have a pair of HP servers running Hyper-V in my loft. They are close to silent in general use.

  2. Dwarf

    Cost a pretty penny

    That fireproof gas (tm) ain't cheap to replace

    I bet the maintenance folk needed some new trousers too.

    1. Swarthy
      Flame

      Re: Cost a pretty penny

      Ah, but the beancounter(s) who were in the server room when the BOFH workmen set off the halon fire extinguishers will need more than clean trousers. In fact, they will need: some old carpet, a bag of quicklime, and a couple of spades.

      1. robidy

        Re: Cost a pretty penny

        ROFL

      2. Anonymous Coward
        Anonymous Coward

        Re: Cost a pretty penny

        "when the BOFH workmen set off the halon fire extinguishers"

        It would likely have been an Inergen IG541 based solution. Hardly anyone still has Halon these days.

        1. Mr Dogshit

          Re: Cost a pretty penny

          Yes yes we know.

        2. Anonymous Coward
          Anonymous Coward

          Re: Cost a pretty penny

          I found evidence in other comments that there are actually many farms that still have water in their fire extinguishing systems

          1. Anonymous Coward
            Anonymous Coward

            Re: Cost a pretty penny

            "I found evidence in other comments that there are actually many farms that still have water in their fire extinguishing systems"

            Lots of clueless US facilities have sprinkler systems I have noticed.

            1. Anonymous Coward
              Anonymous Coward

              Re: Cost a pretty penny

              Re: sprinklers

              Not a terrible idea if you have an inert gas suppression system. The gas should knock down any major files long before the sprinklers trigger. The sprinklers act as a backup, so if the crap really hits the fan, you will have some soggy hard drives from which to extract data instead of a crispy pile of melted parts.

      3. Pedigree-Pete
        Pint

        Re: Cost a pretty penny

        @Swarthy. Can't upvote you enough for that. You owe me a new keyboard but I only have 1 icon choice so Cheers to Friday Eve. PP

    2. Anonymous Coward
      Anonymous Coward

      Re: Cost a pretty penny

      During all the commotion, did anyone check to make sure the poorly paid "fire extinguisher testers" didn't in fact install a back door into the system? Wouldn't THAT be funny....LOL!!

  3. DagD

    Rain in the cloud

    Oh, the irony. But that's ok... Your salesman has got your back (w-a-a-a-y back...).

    Still nice and sunny here in the land of "keep it in-house".

  4. Ken Moorhouse Silver badge

    This was a Microsoft training exercise...

    Altogether now:-

    Embrace (cut that out you two at the back)

    Extend (I wanna see those pierced navels)

    Extinguish (noooo not with one of those...)

  5. sjsmoto
    FAIL

    When was the last (or first?) time we heard there was a major cloud problem but it didn't affect anyone because the rollover performed flawlessly?

    1. Anonymous Coward
      Anonymous Coward

      Probably such events only make local news, if that.

    2. yowl00

      "Implementation of Virtual Machines in Availability Sets with Managed Disks would have provided resiliency against significant service impact for VM based workloads. ". So if you'd implemented your service properly then that's exactly what would have happened.

      1. Colin Tree

        in theory

        in theory,

        read the article, they're low level, physical, real world problems,

        but clouds are so ethereal, each level of abstraction increases complexity.

        If you didn't use the azure cloud, you wouldn't have a problem.

        KISS

        oh, azure is the colour of a clear sky, nothing to do with clouds,

        MS cockup again

    3. JimC

      > hear about ... problem ... didn't affect anyone

      To be fair, I don't think we'd hear about those at all. I imagine most cloud hosting sites would rather not let the customers know there had been a problem.

      My experience has been that most PHBs I've been involved with would rather pretend there are no problems rather than tell the customer ever time there's been a problem which hasn't impacted the customer. Its the same mindset, I guess, that thinks those 9s come from writing the SLA, not good design and careful planning.

      1. sjsmoto

        Re: > hear about ... problem ... didn't affect anyone

        @JimC - I don't know... when these kind of problems keep happening, you'd think a company (especially a smaller one) would be very happy to get noticed by saying a crash affected no one.

        1. Lusty

          Re: > hear about ... problem ... didn't affect anyone

          Yep, here you go. Full list of all Azure issues. Nothing to do with publicity or cover ups, its responsibility and trust.

          https://azure.microsoft.com/en-us/status/history/

          https://status.aws.amazon.com/

      2. Pedigree-Pete
        Happy

        Re: > hear about ... problem ... didn't affect anyone

        We have a cloud supplier who has a global presence. As Admin on services we sell on that platform I get 6 emails from them whenever an "issue threshold" is breached. For clarity I'm in the UK.

        1/ There have been reports of issues with X in Shanghai/Hong Kong.

        2/ We're investigating issues with x in Shanghai/Hong Kong.

        3/ We have identified a probable cause and applied a fix for x in Shanghai/Hong Kong.

        4/ We're monitoring x in Shanghai/Hong Kong.

        5/ No further incidences of x have occurred in Shanghai/Hong Kong or anywhere else.

        6/ Issue is now resolved.

        Naturally, I shrug and go ho hum, but should one of our users call and say I'm trying to do x with Shanghai/Hong Kong here in the trenches I can say I know and it's being worked on. In my experience emails 1-6 rarely take more than 45 mins.

        All cloud providers have outages somewhere in the services they provide. It's how you communicate that down the channel that counts.

        I'm sure you've all had outages and been unable to make progress on investigating and fixing because your colleagues/customers keep ringing to tell you, you have an outage. :(

        PP

    4. garetht t

      "Dog bites man!"

  6. yowl00

    Didn't even notice. A subset of customers were affected, what percentage would be interesting.

  7. Pascal Monett Silver badge

    From the looks of it, cogs were falling off all over the place

    So, let's countdown the failures :

    -VMs were axed

    - Backup vaults were not available

    - Azure Site Recovery lost failover ability

    - Azure Scheduler and Functions dropped jobs

    - Azure Monitor and Data Factory experienced pipeline errors

    - Azure Stream Analytics went on the fritz

    - Azure Stream Analytics had a stroke

    Apart from that, the Cloud is marvelous, never fails you and you can always access your data.

    Except when it FUBARs and no backup is working any more, but the salespeople will never tell you that.

    1. garetht t

      Re: From the looks of it, cogs were falling off all over the place

      "the Cloud is marvelous, never fails you and you can always access your data."

      That's a strawman - you're saying things so that you can knock it down.

      The cloud doesn't guarantee anything except possible failure, and you are massively encouraged to architect your systems against failure. High-availability systems across availability zones, backup systems in different geographic regions.

      The people highest on their horse on this page against the cloud are the people who know the least. How infuriating!

      1. Nate Amsden

        Re: From the looks of it, cogs were falling off all over the place

        Most likely those folks know that architecting for failure in cloud is a pretty rare thing just look at how many customers have outages when cloud goes down.

        Hell I have seen developers complain about tcp connections being dropped during a LB failover(takes about 1 second ) because their app couldn't even handle that without restarting it. And this is for a new application stack, not something designed 10 or 15 years ago. I could go on and on for other real scenarios easily.

        Building apps with single points of failure is very common still.

        I remember what was it a decade ago or so, fire at data center in seattle, a facility that had at least annual power outages for 2 or 3 years prior. Bing travel site was in that data center. Was down for a long time. Maybe MS got it onlinr before the datacenter came back online with external generator trucks about 40 hrs later not sure (this was a colo facility not a MS datacenter).

        Point is 10 years ago isn't that long and a company with the size and resources of MS wasn't willing or able to do it for bing travel at the time(hell even I had the foresight to move the company I was with at the time out of that DC 2 years before the big outage), doesn't surprise me that companies the fraction of the size still can't figure it out today. It's not as if it's impossible, it is just very difficult to do and most talk the talk but won't walk the walk when it comes down to it.

        Same situation applies to security of applications.

      2. Hans 1

        Re: From the looks of it, cogs were falling off all over the place

        High-availability systems across availability zones, backup systems in different geographic regions.

        In Theory, maybe, problem is, Slurp held it wrong, else customers would not have noticed.

        What I do not understand is why do people go with AWS or Azure ?

        Multiple providers offer OpenStack, you can get service from two or three to do ultra high availability and disaster recovery, same stack, MUCH easier to implement ... if you really wanna go cloud, that is. What are the chances for two or three OpenStack vendors to fail at the same time vs AWS or Azure ?

        The people highest on their horse on this page against the cloud are the people who know the least. How infuriating!

        Generalization, not good.

        If you back Azure, your opinion does not count.

        1. TonyJ

          Re: From the looks of it, cogs were falling off all over the place

          "...Generalization, not good.

          If you back Azure, your opinion does not count...."

          You clearly don't get irony.

        2. Anonymous Coward
          Anonymous Coward

          Re: From the looks of it, cogs were falling off all over the place

          "same stack, MUCH easier to implement "

          You are kidding right? Openstack is WAY more complex and fiddly to implement and use than say Azure. You have to edit text files to store config for a start - how prehistoric and insecure. For instance how do you control ACLs for and audit changes to say just one setting in a text file?!

      3. Anonymous Coward
        Anonymous Coward

        Re: From the looks of it, cogs were falling off all over the place

        AC as details

        Not working for a huge company so cloud has one great advantage, ability to automatically "spin up" additional resources if required (dealing with activity spikes)

        Yes, that could be done "on site" but would mean a lot of (expensive) kit, doing very little much of the time, just sitting there waiting for an activity spike.

        Other advantages, let's talk Azure here, is the Azure SQL "Point in Time" functionality, all that db backup burden removed, the geographical replication / failover stuff (that protects against some cloud failure) .

        If you are a huge company then enough onsite "iron" for those rare peaks is probably viable, and multiple geographically distributed replicating data centres is viable but not for many smaller outfits: Cloud is not perfect, but it's useful for some of us.

  8. Sureo
    Flame

    Now they know...

    that the fire suppression system works. How often do you get to try that for real?

    1. Anonymous Coward
      Anonymous Coward

      Re: Now they know...

      Yes, that bit did but the server shutdown part didn't, nor apparently did the failover to the mirror site.

  9. Erik4872

    Today's cloud lesson...

    The lesson for today is that you should never assume a cloud provider's operations are 100%. I hate having to explain to people why we need to have an instance of our service in more than one region. "But it's so expensive! My cloud salesman assured me that each region is interconnected data centers miles apart and they are nearly incapable of failing!"

    It's all just computers and data centers, even if it's very much software-defined and very resilient. If humans and computers are involved, something will eventually go wrong.

    1. Mark 85

      Re: Today's cloud lesson...

      It's all just computers and data centers, even if it's very much software-defined and very resilient. If humans and computers are involved, something will eventually go wrong.

      Just to simplify things: Murphy's Law applies to everything. Manglement seems to forget that.

  10. fedoraman
    Facepalm

    Really?

    This all happened because you lost an AHU? (I'm assuming not all of the AHUs in the data centre were stopped, just some, in the allegedly fire-affected region)

    So the rack temperature starts to rise, quite rapidly, because you no longer have moving air, to carry away your excess heat. At what point, do you think, it might be a good idea to have graceful shutdowns of the affected racks, that have lost their conditioned air. You know, triggered by some kind of flow sensor, or a delta-p switch across the AHU fan?

    I look after many dozens of air-handling systems. Even with two motors to each fan, and multiple belts, they do break down occasionally.

    1. Nate Amsden

      Re: Really?

      I think large scale graceful shutdowns in this situation is probably really complicated as they operate as a cluster, as systems shut down likely other things kick in to try to restore availability maybe moving resources to other nodes or something. At some point you probably have to set a flag in the entire system saying it is down and take it all offline(at which point graceful from a customer standpoint is out the window)

      I think this happened during that semi recent big S3 outage.

      Not as if these are just racks and racks of standalone web servers with local storage.

    2. DropBear
      Flame

      Re: Really?

      First thing I thought of too. As far as failures go, thermal ones are as gentle as failures can possibly get* - they're not instant, and you get a warning they're happening. If your cloud can't even handle that gracefully, what the ever-loving fuck is it good for, exactly...?

      * ...well, unless the heatsink itself falls off your CPU. You know, because the retaining bracket snapped. And you only realize it because the fan suddenly snaps to full throttle for no good reason. At which point you remember an old Youtube video you once saw about an AMD CPU frying in milliseconds (the Intel one just throttled way down) due to the exact same cause and you bash the power switch mightily. Yes, it survived - new bracket, I'm still using it...

  11. Anonymous Coward
    Anonymous Coward

    so much for those fault domains, eh?

    1. Anonymous Coward
      Anonymous Coward

      "so much for those fault domains, eh?"

      It only impacted one fault domain. Northern Europe. There ae 7 others in Europe.

      1. This post has been deleted by its author

  12. Anonymous Coward
    Anonymous Coward

    It was easier to make up the whole fire extinguisher thing that to own up about the real reason: forced Windows updates borked the cloud, then all the servers got confused uploading telemetry information to themselves.

    1. wallaby

      "It was easier to make up the whole fire extinguisher thing that to own up about the real reason: forced Windows updates borked the cloud, then all the servers got confused uploading telemetry information to themselves"

      the sad thing is that at the time of writing this 14 loons had upvoted the above statement - I suspect only because they dislike Microsoft - or would they like to share the evidence to back the statement up ?

      1. Anonymous Coward Silver badge
        Holmes

        The really sad thing is that wallabies don't understand sarcasm. I thought that was reserved for Americans and that wallabies were more antipodean??

        1. wallaby

          I fully understand sarcasm, its just you sounded like one of the militant penguinistas the frequently rant off on stories about Microsoft.

          and as there is no button for a sarcastic upvote I drew conclusions on the (at the time) 14 clickers rather than what you wrote.

  13. razorfishsl

    The sad thing is that when scientists talk about modeling random systems and tracking them, people laugh.

    But someone comes up with a half assed idea of sticking the whole of mankind's data into a cloud system and instantly it is a good idea.

  14. Anonymous Coward
    Anonymous Coward

    The Cloud...

    Other peoples computers you have no control over.

    1. hplasm
      Facepalm

      Re: The Cloud...

      "Other peoples computers you have no control over."

      And neither, it seems, do they...

  15. Anonymous Coward
    Terminator

    Routine periodic fire suppression system maintenance

    'The problems started when one of Microsoft's data centers was carrying out routine maintenance on fire extinguishing systems, and the workmen accidentally set them off. This released fire suppression gas, and triggered a shutdown of the air con to avoid feeding oxygen to any flames and cut the risk of an inferno spreading via conduits. This lack of cooling, though, knackered nearby powered-up machines, bringing down a "storage scale unit."'

    I thought the 'cloud' was immune to failures at a single location. When a VM instance fails at one location, another is started up elsewhere. What happens to 99.999% up time when you have a real fire?

    1. Destroy All Monsters Silver badge

      Re: Routine periodic fire suppression system maintenance

      It's the same problem as discussed here, back in the mid-80s:

      Computer System Reliability and Nuclear War

      "We can't be sure that it works in a tough situation until after the fact"

    2. Anonymous Coward
      Anonymous Coward

      Re: Routine periodic fire suppression system maintenance

      Where is the checklist?

      Where is the failsafe switch?

      Where is the oversight?

      Every DC I worked in had a master control to switch off before work started and everyone was accompanied while working on site to prevent outages.

      Seems like it's time to find a new supplier and manager.

  16. Ryan Kendall

    Not all affected

    I have about 20 servers running in Azure North Europe.

    Strangely none of them went down.

    1. Ken Moorhouse Silver badge

      Re: Strangely none of them went down.

      That's because they were located at the back of Achmed's Kebab Shop in Kentish Town

      (How would you know other than doing some low-level packet tracing, and other detective work? It's a bit like Bit Torrent)

      1. Ken Moorhouse Silver badge

        Re: Achmed's Kebab Shop in Kentish Town

        The more I think about Achmed's Kebab Shop in Kentish Town, the more I think we're all being fooled.

        Not only does Achmed help serve up BT's local OpenZone service (if the shop uses a BT HomeHub), but if the business owner's pc has BitTorrent installed then there is a possibility he is a contributor to a film you may be watching (I'm sure I read somewhere that Microsoft are using BitTorrent techniques to serve up updates since the advent of W10). How do we know that Azure/AWS does not "sub-contract" in a similar way? AFAIK there is no agreement between BT and Achmed as to whether BT can use Achmed's Broadband connection for providing BT's Public WiFi service - BT being a big company y'know. Plus (I'm sure I've said this before), do Azure rent capacity from AWS and vice versa?

      2. Anonymous Coward
        Anonymous Coward

        Re: Strangely none of them went down.

        "That's because they were located at the back of Achmed's Kebab Shop in Kentish Town

        (How would you know other than doing some low-level packet tracing, "

        Because of where the Azure ExpressRoute connections we pay for go to?

  17. wyatt

    Interesting comments on this thread. I wouldn't say I'm against hosting, there have been some very good examples given here of the benefits. However, I think that when moving from on site to hosted the potential issues are not planned for. Multiple geographically spread instances with redundant networking (yours and theirs) should be a minimum requirement you'd think?

    We've hosted services with Azure and I'm not aware of them having outages either which is good, either not in an affected location or resilient enough to keep going.

  18. This post has been deleted by its author

    1. Anonymous Coward
      Anonymous Coward

      Re: SLA and BC/DR

      "In 5 years, Microsoft should have been able to avoid any outage for any customers by implementing a correct BC/DR strategy and testing it."

      They do and it worked perfectly. You have a choice to use it.

  19. hoola Silver badge

    Double Standards

    Every time there is some cock up with Azure Microsoft give a load of pathetic assurances that it will not happen again and they are always improving. Based on what we experienced in very specific circumstances there was data loss on a VM.

    If this or even a far more minor event (a single VM host falling over) had happened in our data centre there would have been people screaming from the rooftops. But this this, because it is in Azure is just accepted, not even a whisper from upstairs.

    On site is constantly under scrutiny and has to provide a far better service and then there are complaints about the cost.

  20. Anonymous Coward
    Anonymous Coward

    It always seems to be the testing that brings these things down, testing of any kind carries a risk that it won't go to plan and fail over gracefully. Mitigating that risk by not testing isn't an option and having on-prem doesn't make you immune from a technician/inspector accidentally pushing the big red button.

    How certain are you ? Cool, go on then, go and set of your fire systems and it'll be like a ballet; watching everything seamlessly and gracefully fail-over, migrate and shut down :)

  21. Anonymous Coward
    Anonymous Coward

    but..

    ...how can this be? The "cloud" is infallible so a certain digital director thinks.

  22. Smoking Gun

    Interesting, we have clients in NE and were not affected.

    I presume the referred too Microsoft services are all distributed across the data centre so this event, while disruptive to some, only disrupted a limited number of services for a limited number of clients.

    The funny thing is when reading comments or talking to people about Cloud, as soon as they implement a service in Azure they automatically have an expectation of 100% availability and nothing will ever fail.

    A lot of this comes back to lack of understanding, if you want availability, you still need to architect your service correctly, even on a public cloud, and this will ultimately increase costs.

    1. itguy

      So we were one of the sites impacted. Yes this is a screw up by MS BUT it was also our screwup. We didn't have the traffic manager enabled and if we did, our service would probably have switched over to another geo.

      Lessons learnt on both sides.

      1. Ken Moorhouse Silver badge

        Re: our service would probably have switched over to another geo.

        Probably? Not a certainty then.

        The other point is whether your data had been replicated to that other geo. How can you guarantee that the data you are now looking at through that different conduit is current? And how would you know for certain that it is up to date?

        Replication was I believe one of the issues with Lotus Notes. Have lessons been learned?

        1. Anonymous Coward
          Anonymous Coward

          Re: our service would probably have switched over to another geo.

          "The other point is whether your data had been replicated to that other geo. How can you guarantee that the data you are now looking at through that different conduit is current? And how would you know for certain that it is up to date?"

          By say using synchronous replication and time stamps as one of several options.

          "And how would you know for certain that it is up to date?"

          Because transactions will only be committed once replicated to all live sites.

  23. Pat Harkin

    I've never really understood The Cloud...

    ...but at least now I know it works on fire and if you put out the fire it stops working.

    1. DropBear
      Joke

      Re: I've never really understood The Cloud...

      There can be only one logical explanation. The Cloud is... *GASP* steam powered...!

  24. Alan Edwards

    Use their own service??

    > Azure Site Recovery lost failover ability

    So a failure at one data centre knocked out everyone else's ability to fail over to a different site :)

    1. Anonymous Coward
      Anonymous Coward

      Re: Use their own service??

      "So a failure at one data centre knocked out everyone else's ability to fail over to a different site :)"

      No - only to failover to that one site.

  25. DagD

    is your ISP using BGP?

    ..Then there are network based attacks

    https://www.nist.gov/news-events/news/2017/10/new-network-security-standards-will-protect-internets-routing

  26. Daniel B.

    BOFH

    So I guess we now know where the BOFH is working at these days!

  27. Anonymous Coward
    Anonymous Coward

    "between 1327 and 2015"

    Nearly 700 years, is this a new outage record for M$??

    (Yes, yes, I know, feeble joke).

  28. Anonymous Coward
    Anonymous Coward

    Dirty shutdowns?

    The note about dirty shutdowns indicates that there was no communication between the cooling system and the servers.

  29. msroadkill

    Some failsafe clouds are more failsafe than others - G Orwell

  30. TheVogon

    "The note about dirty shutdowns indicates that there was no communication between the cooling system and the servers."

    Quite probably so. The failsafe is that the servers will shutdown at a critical temperature - which is a likely better solution in most cases as stuff that doesn't get too hot won't shut down.

    To shut a massive cloud system down cleanly in a hurry is simply not likely to be possible in a period under tens of minutes anyway so likely that's another reason why they don't do that.

  31. Steve_Jobs1974

    AWS have been running Availability Zones (AZs) for years.

    Fully isolated zones (of clusters of data centers) with low latency connections. This simply wouldn't happen in AWS. Microsoft have been in such a rush to expand their foot print that they have not done a great job here - A single isolated event takes out an Azure region. This is terrible.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like