back to article Revealed: Why Amazon, Netflix, Tinder, Airbnb and co plunged offline

Netflix, Tinder, Airbnb and other big names were crippled or thrown offline for millions of people when Amazon suffered what's now revealed to be a cascade of cock-ups. On Sunday, Amazon Web Services (AWS), which powers a good chunk of the internet, broke down and cut off websites from people eager to stream TV, or hookup with …

Page:

  1. Ken Moorhouse Silver badge

    Is Control Theory still on the syllabus for ICT qualifications?

    The nature of control systems has changed a lot over the years but I dare say that the principles are as relevant today as they were back when I were a student.

    1. Destroy All Monsters Silver badge

      Re: Is Control Theory still on the syllabus for ICT qualifications?

      Anyone can get the basics really quickly, but I daresay that in these systems here, it is very unclear how to analyze things in terms of feedback loops. This is worse than an electric power circuit. Even finding the relevant loops seems dicey as they may change from day to day. Overprovision or build in features to kill services off in order to regain control and sod the quality of service are two possible options.

    2. Anonymous Coward
      Anonymous Coward

      Re: Is Control Theory still on the syllabus for ICT qualifications?

      Clearly, in cloud land - no.

      There are multiple proofs of it - if you look at what they are trying to do with networking it is in the same league too. The moment I look at some of their "software driven cloud network designs", it is obvious, that there is no convergence, no global minimum and it is subject to at least one positive feedback loop with a potential tailspin death of the service. However, explaining this to the cloud guy is like throwing pearls to swine. He does not get it.

      1. Destroy All Monsters Silver badge
        Holmes

        Re: Is Control Theory still on the syllabus for ICT qualifications?

        it is obvious, that there is no convergence, no global minimum and it is subject to at least one positive feedback loop with a potential tailspin death of the service

        I like overheated rhetoric as much as anyone, but most data-processing systems are not amenable to cybernetic analysis. Even in simple analog systems, control engineers have to go through major contortions to get something that behaves predictably.

    3. keithpeter Silver badge
      Coat

      Re: Is Control Theory still on the syllabus for ICT qualifications?

      The mathematics of working out exactly how many straws of what length may distress the camel's spine is notoriously difficult. Especially when the camel is changing weight.

      Seriously in terms of feedback theory the loop is both non-linear with thresholds and has hysteresis.

    4. DropBear

      Re: Is Control Theory still on the syllabus for ICT qualifications?

      This sort of issue is notoriously problematic to avoid even by professionals - I wouldn't call the nice chaps devising the TFTP protocol amateurs, yet they managed to bake exactly such a problem (see " Sorcerer's Apprentice Syndrome") right into the specification - it only got corrected years later (unfortunately, the correction pretty much breaks TFTP with u-boot to this day if packet loss exists).

  2. batfastad

    If only there was a way

    But that's fine because you've built the service that provides revenue for your business to scale across multiple providers, or at least failover to another? Or at least using multiple AZs within Amazon? Right?

    The only fail here is people relying on the availability of a single site. The same fail everyone has been talking about for 30+ blinking years.

    1. tin 2

      Re: If only there was a way

      No, because the cloud prevents you from needing all that. All the resilience is in the super-massive-cloud provider's infrastructure and software, thus negating your puny little business needing to try (and fail ofc) to do all that stuff.

      Except it doesn't does it?

      1. Probie

        Bollocks

        I will give you the marketing material spaffed every where leads to that assumption, but if you actually read the best practice guides etc... there is mention of Multiple AZ's and then Multiple regions if you want to "stay up no matter what". The reality is people want to see the "cost reduction" of the cloud - at the incessant moaning and bitching of cheap tight fisted tossing finance dicks, yet the truth is that in the cloud multiple regions ain't cheap. It still the same caveat "you get what you pay for".

        1. Ken Moorhouse Silver badge

          Replication

          "Staying up" is arguably the tip of the iceberg.

          Another difficult problem surely occurs where it is essential that data can be input/updated at any node, and where data-locking prevents data corruption.

          If any segment of the cloud goes down then any nodes in that sector would surely set all nodes on both sides of the break to be "read-only" until replication can be guaranteed.

        2. Anonymous Coward
          Anonymous Coward

          Re: Bollocks

          As people have said earlier, DynamoDB is not IaaS but a managed service which claims to have resilience built in.

          From the AWS FAQ:

          ____

          Q: How highly available is Amazon DynamoDB?

          The service runs across Amazon’s proven, high-availability data centers. The service replicates data across three facilities in an AWS Region to provide fault tolerance in the event of a server failure or Availability Zone outage.

          Q: How does Amazon DynamoDB achieve high uptime and durability?

          To achieve high uptime and durability, Amazon DynamoDB synchronously replicates data across three facilities within an AWS Region

          ____

          Now, I grant you that multiple region replication is possible but given that this problem wasn't caused by 3 availability zones going down but rather by a capacity error within the database infrastructure it's unlikely that additional resilience would have solved anything. This problem falls very much into the 'managed service failure' category and so the fault sits with AWS.

          The lesson here could be that people should stick to IaaS from Amazon and plan their own service management (especially at the scale of Netflix end Airbnb) but it certainly isn't that Amazon aren't at fault because people ignored their best practice advice, as a number of comments have suggested.

    2. Voland's right hand Silver badge

      Re: If only there was a way

      That is what you actually pay for when buying cloud. You pay for the assumption that you have paid for the mainframe of the 21st century - an infrastructure that does not fail. It walks like mainframe, it talks like a mainframe, it is a mainframe and the IBM CEO of old (Watson) who said "I think there is a world market for about five computers" is having the last belly laugh.

      If you are trying to failsafe and failover cloud you have failed to grok what cloud is in the first place.

      Now, the fact that the present generation of Mainf^WClouds is pretty far from what being failproof by design is an entirely different story.

      1. Ian Michael Gumby

        @ Voland's right hand Re: If only there was a way

        You've never worked in telephony have you?

        Talk about failure testing and redundancy built in to the switches...

        1. Anonymous Coward
          Go

          Re: @ Voland's right hand If only there was a way

          You've never worked in telephony have you?

          Yup the land of 5 9's or better. One day the rest of the computing world will catch up...

          Just checking uptime on our last bit of legacy kit....yup 2000+ days. A lightning strike mean't we had to take it down for 15 mins to replace a power unit...no no, don't get me wrong it was still running, but the sparky didn't like working on live kit, otherwise it would be about 3000+

          1. Doctor Syntax Silver badge

            Re: @ Voland's right hand If only there was a way

            "the sparky didn't like working on live kit"

            Clearly you need a BOFH. Choice of working on live kit or the cattle prod.

            1. TheVogon

              Re: @ Voland's right hand If only there was a way

              "Clearly you need a BOFH"

              In telecoms, they are all employed answering customer support calls...

          2. Peter Gathercole Silver badge

            @Lost all faith

            Like AT&T in 1990, when a cascade failure took out long-distance telephony in the US, maybe?

            But those types of failure in telephony occurred maybe once a decade, and generally triggered reviews and remedial work to make sure that the same problem never happened again. Cloud failures seem to be much more frequent than that, and don't appear to have the rigorous response.

            Maybe all cloud providers should learn to walk before they attempt to run!

      2. Ken 16 Silver badge

        Cross Provider DR

        I believe to "do cloud right" you need to be able to exit from your vender and you need to have a DR capability hosted by another vender. It's not easy and I haven't seen it done (or managed to convince any customer it's worth doing) but I'll keep expressing the opinion.

        Obviously the cloud venders make it easy to get data in and hard (or expensive) to get it out but imagine if a replica environment had been available, ticking over at minimum size on, say, Azure with the ability to scale up on demand?

        1. TheVogon

          Re: Cross Provider DR

          "I believe to "do cloud right" you need to be able to exit from your vender and you need to have a DR capability hosted by another vender. It's not easy and I haven't seen it done (or managed to convince any customer it's worth doing) but I'll keep expressing the opinion."

          It's relatively easy these days: http://www.theregister.co.uk/2015/07/10/microsoft_tries_to_paint_vmware_azure_with_disaster_recover_detour/

      3. TheVogon

        Re: If only there was a way

        "That is what you actually pay for when buying cloud. You pay for the assumption that you have paid for the mainframe of the 21st century - an infrastructure that does not fail"

        Not in public cloud you don't. Availability is stated at 99.9% if you are lucky.

        You should design your applications to be resilient. See https://msdn.microsoft.com/en-us/library/azure/dn251004.aspx for some guidelines.

        1. Hans 1

          Re: If only there was a way

          >Not in public cloud you don't. Availability is stated at 99.9% if you are lucky.

          And MS has been unlucky since Azure inception, because they have yet to reach that availability ... with servers that have a max uptime of 1 month, with a monthly patchday requiring multiple reboots, occasionally even bricking servers of a given type, I must say they are not doing tooo bad, still, nowhere near 99.9%.

          1. Not until they manage to hire a drone who can do certificate management properly, which I doubt.

          2. Not until they move their windows update test teams from China to the US or any other high tech country like UK, France, Germany .... for example. Don't get me wrong, I am sure China techies are good as well, but look what's happening to Windows Updates since they moved to China (~ release of Windows8), we have had several servers with update issues, found before updating live servers (luckily), but still. Windows workstations have had update issues several times this year alone.

      4. Dan 10

        Re: If only there was a way

        From AWS Cloud Best Practices:

        "Be a pessimist when designing architectures in the cloud; assume things will fail. In other words, always design, implement and deploy for automated recovery from failure. In particular, assume that your hardware will fail. Assume that outages will occur. "

        Customers aren't paying for an infrastructure that does not fail - they are paying for things like elasticity, parallelism, and the transfer of capex to opex.

    3. Anonymous Coward
      Anonymous Coward

      Re: If only there was a way

      No, this failure crossed availability zones.

    4. Anonymous Coward
      Anonymous Coward

      Re: If only there was a way

      This thing where people come in to blame Netflix et al for not adding resilience is getting a bit old.

      DynamoDB is a managed db service with resilience built in automatically. There is no multi az option for clients because it's built in.

      Should people not use it now it's been demonstrated unreliable? Quite possibly. Should they avoid any Amazon managed service? Perhaps, though difficult.

      Is it their fault and not Amazon's when DynamoDB fails? No.

      Not knowing much about AWS is fine. Assuming that their clients are at fault when one of their managed services goes down isn't. Unless your point is that they shouldn't trust AWS at all in which case you'll likely find a solid approval base here.

      1. Adam 52 Silver badge

        Re: If only there was a way

        The thing is Netflix do actually do this right. They do run active-active across regions and famously have Chaos Gorilla. So I'm still curious as to what went wrong for them (if indeed anything did go badly wrong and this isn't just media hype).

        The rush to blame "cloud" from commentards who have no idea what they're talking about is a sad reflection of Reg readers.

  3. Henry Wertz 1 Gold badge

    No exponential backoff?

    "Unavailable servers continued to retry requests for membership data, maintaining high load on the metadata service."

    I'd call this the root problem. No exponential backoff? AWS client APIs support exponential backoff with jitter. In other words, in case of failure a retry does not just wait x seconds then retry... it may start retrying in 1 second, then 2 seconds, then 4 seconds, doubling the delay each time. The "jitter" part means there'll be a bit of random variation in the time delays, so if the failed queries were all fired off at once, the retries won't be.

    It sounds like calls from storage system to DyanmoDB were using fixed retry intervals instead of exponential backoff. Or possibly just not enough backoff. With fixed backoff, once some load limit was hit where enough calls failed *even temporarily*, then the retries would be mixed in with new calls (which when they also fail would be retried), the load would just keep getting worse and worse as more and more calls are retried. From their description of not even being able to reach the admin interface, this sounds likely. With exponential backoff with jitter, the load would increase at first as these calls are retried with short time interval, then level off and hopefully decrease as failed calls are retried less and less frequently. And if they were lucky and it was just a load spike, then (perhaps even just a few minutes later) the load could have been lower enough for new calls to succeed and the failed calls to also succeed on retry.

    1. Crazy Operations Guy

      Re: No exponential backoff?

      I was just about to post something like that.... That technique has been around pretty much forever, hell DECNet had a setting to do just that.

      1. Mike Pellatt

        Re: No exponential backoff?

        No just (or even) DECNet - it's fundamental to CSMA-CD working properly, else everyone would keep trying to transmit at the same time.

        With today's star network topology with switches and FDX links, rather than a shared bus, it's not relevant.

        1. TeeCee Gold badge

          Re: No exponential backoff?

          Ah yes, but that didn't stop old skool Ethernet going into traffic storm and catastrophic failure once saturation was reached now, did it?

          That technique stops working once whenever you pick to retry, you can pretty much guarantee that someone else will hit it too. This causes a blizzard of retries followed by the metaphorical sound of a grand piano being dropped 20m onto concrete.

          In heavily loaded traffic situations, a protocol with some form of arbitration[1] is required.

          [1] Like maybe taking turns with an access token??

          1. Ken Moorhouse Silver badge

            Re: Access Token

            ...until the Access Token gets lost, then there's got to be a mechanism for creating a replacement. Then if that mechanism erroneously creates one...

            (Perfectly reasonable suggestion. Just sayin' that life ain't easy).

    2. GoNoGo
      Happy

      Re: No exponential backoff?

      Ah! The fond memories of programming multiuser Clipper 5.01 record locking with random back-off intervals :)

    3. AOD
      FAIL

      Re: No exponential backoff?

      I could be wrong here but I have a hazy memory from my CompSci degree that for Ethernet CSMA/CD the whole point of the exponential backoff and retry was that your retry period was randomly chosen to be between 1 and 2 to the power of your retry attempt (based on what your slot/interval time was).

      From Wikipedia:

      After c collisions, a random number of slot times between 0 and 2c - 1 is chosen

      Otherwise you'd have clients that clashed at the same time continually clashing as they all try again at the same time.

      Anytime I've seen a "dumb" retry approach in a production system, (hey, lets wait 3 seconds between retries and give up after 3 attempts) this always springs to mind and I'm frightened by how many folk haven't a clue when I mention it.

  4. Big Ed

    The Falicy of the Public Cloud is Exposed

    This outage really exposes the falicy of the public cloud; everyone expects cloud capacity to scale infinitely and scale to handle any demand. The reality is that it can't.

    Network folks have known this for ages; e.g. shared Internet bandwidth is unreliable; want reliability, buy private lines from different carriers with route diversification.

    Time for the compute and storage folks to learn that lesson, or standup your own cloud with your own peaking capacity.

  5. Grikath

    So in short...

    The system DDOS-ed itself...

    Pretty...

  6. Your alien overlord - fear me

    I wonder what's going to happen to the beancounter who said "2 metadata servers? It'll cost too much, you can cope with just one" not realising that Sod's Law in a resilient scenario is pretty much guaranteed.

    1. 100113.1537

      What's going to happen?

      Nothing. He has already retired with his stock options intact based on the 3-5 years of growth the company experienced after his decision.

    2. TeeCee Gold badge
      Facepalm

      Well, as the article does say "metadata servers", plural, presumably nothing as he doesn't exist.

  7. Edwin
    Facepalm

    Different kinds of politics

    The problem is that the two kinds of democracy are incompatible.

    The EU politicians are a group of inept concensus-builders, which means that anything that comes out of the EU is a bit like the big American beers: everything with substance has been removed to avoid offending anyone, which means the result is flat and pointless.

    The Americans are owned by big business, which means that anything that comes out serves the interest of the dollar and not necessarily the interests of warm bodies anywhere. And when push comes to shove, the guvmint will do what it likes anyway.

    Don't even get me started on polarisation as a political tactic...

    I'm more and more convinced that benevolent dictatorship is my form of government of choice.

    1. Destroy All Monsters Silver badge
      Paris Hilton

      Re: Different kinds of politics

      That's another thread though.

      benevolent dictatorship is my form of government of choice

      For all 15 minutes of it. Then the Heydriches move in.

  8. Henry Wertz 1 Gold badge

    Still relevant

    "No just (or even) DECNet - it's fundamental to CSMA-CD working properly, else everyone would keep trying to transmit at the same time.

    With today's star network topology with switches and FDX links, rather than a shared bus, it's not relevant."

    Uhh, yeah it is. Not usually at the network level (except wifi) but at the application level this can also be important. In this case, if the timed out requests were retried at some regular interval, then they could just keep causing load spikes and timing out at regular intervals (and if the load spike lasts longer than the retry interval you're really done.)

    1. Mike Pellatt

      Re: Still relevant

      Yeah, well, I thought it was pretty clear that I was referring to the hardware layer on wired networks of a certain topology.

      This whole subthread is pointing out that fixed-period retries is well-recognised as A Bad Idea. Has been for a Very Long Time.

  9. Anonymous Coward
    Anonymous Coward

    The Cloud

    Other people's computers you have no control over

    1. phuzz Silver badge

      Re: The Cloud

      Other people's computers you have a little control over and which are cheaper than the alternatives.

      Fixed that for you

  10. Ken Moorhouse Silver badge

    Infrastructure Ownership

    At the moment, cloud infrastructure seems to be owned and managed by a small chain of entities - AWS being one of them.

    Do you not think there will come a time when the owners/investors in these cloud companies will think that certain aspects of their hardware, software, service provision, property ownership portfolio become undesirable in some way, and try to divest it/them? (Maybe the government wanting to split a company to combat anti-competitive practices would be a good example).

    All very well saying that your cloud service is provided by Amazon, but will you be told if it does become sub-contracted? This has repercussions not only in the ability of a company in the ownership chain to enforce SLA's, etc. but in other areas outside scope of this current topic e.g., privacy. No doubt Amazon would do due diligence on their sub-contractors, but when the sub-contractors start to sub-contract that's where the messy business would no doubt begin.

    Look through history for examples - particularly those created as a result of government intervention.

  11. pklausner

    Those pesky metadata...

    ... obviously are hard to get right. Joyent has a completely different cloud, yet failed with a similar problem, cf:

    https://www.joyent.com/blog/manta-postmortem-7-27-2015

  12. Anonymous Coward
    Anonymous Coward

    Funniest darned thing

    Popcorn Time (AKA Netflix for Pirates) kept working.

    Perhaps it's time to switch streaming services.

  13. I. Aproveofitspendingonspecificprojects

    IT staff>reource management

    Where > = pay grades and job security/satisfaction.

  14. wolfetone Silver badge
    Boffin

    Solution

    (1 Website + 1 Database) + 1 Server = Nein Problem

    B*llocks to the cloud.

    1. dogged

      Re: Solution

      > (1 Website + 1 Database) + 1 Server = Nein Problem

      Good luck with that next time you get a power outage, segfault, updates demanding a reboot, infrastructure - eg, internet pipe - outage, etc ad nauseam.

      One datacentre is safer but still not safe.

      I see a lot of pissing and moaning here but the problem with Cloud is data protection, not uptime. Pretty much every alternative has worse uptime.

      1. wolfetone Silver badge

        Re: Solution

        Well what's more important? Security of data or not being able to buy a pair of tights at 3am?

        Don't answer this if you're in to nylon.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like