back to article Amazon S3-izure cause: Half the web vanished because an AWS bod fat-fingered a command

Amazon has provided the postmortem for Tuesday's AWS S3 meltdown, shedding light on what caused one of its largest cloud facilities to bring a chunk of the web down. In a note today to customers, the tech giant said the storage system was knocked offline by a staffer trying to address a problem with its billing system. …

Page:

  1. Anonymous Coward
    Anonymous Coward

    So much for fault injection testing !

    " Hey Ravi - run this CLI ... that is what fixed it last time ... "

    1. Mpeler
      Mushroom

      Re: So much for fault injection testing !

      Here's a song for them then:

      I've looked at clouds from both sides now

      From up and down, and still somehow

      It's cloud illusions I recall

      I really don't know clouds at all.....

    2. The IT Ghost

      Re: So much for fault injection testing !

      Plenty of fault was injected, no doubt. Probably 4 or 5 people shown the door, none of them the one who actually flubbed the command.

    3. TheVogon

      Re: So much for fault injection testing !

      "an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," the team wrote in its message.

      "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."

      Wow they make manual command line changes that can impact lots of production systems?! Glad I don't use Amazon then. Such changes should be planned, change controlled, scripted in a file, and 4 eyed before pressing go....

  2. Herby

    To err is human...

    ...to really foul things up requires a computer.

    To guarantee a mess put a human in charge of said computer. Enough said.

    Fat fingers win every time as in "I only changed one card line"...

    I'm showing my age...

    1. TitterYeNot

      Re: To err is human...

      "To guarantee a mess put a human in charge of said computer. Enough said."

      And to guarantee a shitstorm of Diluvian proportions, put said barely-technical human in front of some automation they don't really understand - but hey it looked great in the management meeting.

      It'll save us loads of money, they said.

      It'll guarantee five nines availability, they said.

      It's foolproof, they said...

    2. Anonymous Coward
      Anonymous Coward

      Re: To err is human...

      ""I only changed one card line"..."

      What did you change?

      Nothing.

      What did you change?

      Nothing .....that is relevant.

    3. SotarrTheWizard

      Re: To err is human...

      Funny that you mention punchcards: I recently pulled one of my old boxes of code stacks out of the cellar, to let my grand-daughters make the quintessential early-70s craft project: the Punchcard Christmas Wreath.

      I had forgotten the joys of card stacks, and the multiple marker and highlighter lines across the top of the deck to help quickly restore the deck if you dropped it.

      Good times, good times. . .

      1. Anonymous Coward
        Anonymous Coward

        Re: To err is human...Punch cards devine

        The 1970's with it's punch cards was good times, a peak in many ways for Canadians, and I'm not talking Fortran WATFOR or WATFIV.

        Back then the average family income was about $10,000. That's about $65,000 today, which if you look up family income is still roughly about the middle of family incomes today. No real growth but apparently not much of a set back, until we look at where that income comes from and goes.

        In 1970's family income was usually from a single income. Today almost all $65K families are at least dual income and thanks to dramatic changing in Canadian taxes, from who and how much is collected they do not get to keep much of that. Even the US numbers show us what good times the past was when it came to growth and optimism. .

        "Expressed in 1950 dollars, U.S. median household income in 1950 was $4,237. Expenditures came to $3,808. Savings came to $429, or 10 per cent of income. The average new-house price was roughly $7,500 – or less than 200 per cent of income. By 1975, however, it took 300 per cent of median household incomes to buy a house; by 2005, 470 per cent."

        Many more years in school and training are required to get a job, all adults in a family have to work, most at jobs with much longer hours and often no benefits and today it is almost impossible to get a detached house in a major Canadian city for even 10X the annual income of the average high school graduate.

        When I look fondly at punch cards I am reminded that the good times was largely the result of citizens being "allowed" to share in the wealth they were creating.

      2. Anonymous Coward
        Anonymous Coward

        Re: To err is human...

        as late as 2001... we used blank punch cards at IBM as note pads / post it notes .. the file cabinets were stacked with them instead of note pads.

    4. fidodogbreath

      Re: To err is human...

      In the original Reg article about the S-pocalypse, I commented that the last voice command ever was "Alexa, turn off all the servers." Turns out, that's more or less what happened.

      Since the outage took down IFTTT, "Alexa, turn all the servers back on" didn't work.

    5. ZootCadillac

      Re: To err is human...

      Herby, don't misplace the punch tape!

  3. Anonymous Coward
    Anonymous Coward

    Homo Sapien Ergonomics

    I wish my finger tips were smaller than the average keyboard key.

    Otherwise, I'm quite proud of my Neanderthal heritage.

    1. MyffyW Silver badge

      Re: Homo Sapien Ergonomics

      I'm quite proud of my amply covered form, but plump fingers are a bloody nuisance.

      1. Anonymous Coward
        Anonymous Coward

        Re: Homo Sapien Ergonomics

        plump fingers are a bloody nuisance.

        Yes, but for most of the bigger boned, that's down to choices they've made (eg, whilst passing Greggs). It is also one that they can unmake, if the downsides of podgy digits get too much?

    2. gotes

      Re: Homo Sapien Ergonomics

      I wish the enter key wasn't so close to the backspace key.

  4. John Smith 19 Gold badge
    FAIL

    Makes me wonder how many others in the "playbook" have this capacity.

    Well it should be making Amazon wonder that.

    Under what circumstances would you want to be able to (virtually) shut down a whole data centre with one (mis) executed command?

    1. Anonymous Coward
      Anonymous Coward

      Re: Makes me wonder how many others in the "playbook" have this capacity.

      They dig into this to an extent in the full statement. The command alone wasn't enough to do it. It was running a command designed for a much smaller scale of S3 over too many machines causing a bunch of systems subsequently layered over those machines to mutually screw each other up.

      Critically it was the requirement to restart that really screwed them. The system hadn't been restarted in so long no one noticed the restart procedure took a really, really long time. Cheeky little humblebrag, methinks.

      They also mention a full audit of existing operations to ensure sanity checks are in place. I for one look forward to the outage caused by being unable to affect a change to as many machines as actually needed, because sod's law's just like that.

      1. Bronek Kozicki

        Re: Makes me wonder how many others in the "playbook" have this capacity.

        I think they need "chaos monkey" to occasionally reset some machine or shutdown some process. At random. That would force them to learn building inherently resilient systems, quickly.

      2. John Smith 19 Gold badge
        Unhappy

        "They also mention a full audit of existing operations to ensure sanity checks are in place. I"

        Oh dear, that sounds like an event.

        Not a process.

        Which suggests they will find (and hopefully) fix all such issues this time round a whole new bunch will accumulate over time till the next one surfaces and borks them again.

        Periodic review following significant (cumulative) changes should be SOP for such a large operation.

    2. Anonymous Coward
      Anonymous Coward

      Re: Makes me wonder how many others in the "playbook" have this capacity.

      "Under what circumstances would you want to be able to (virtually) shut down a whole data centre with one (mis) executed command?"

      Ultimately somebody has to have the power to do this because shutting down servers is a valid admin activity. However it should be made a multistep process with plenty of Are You Sure? types prompts (or even somehow require 2 people/keys nuclear missile launch style), not something that can be done with a single mistyped command. In the end its a balancing act between treating your admins like responsible professionals and not children who need to be hand-held, but also ensuring one tired person can't make an almighty cock up.

      1. Keith Langmead

        Re: Makes me wonder how many others in the "playbook" have this capacity.

        "However it should be made a multistep process with plenty of Are You Sure? types prompts"

        Not just "are you sure Y/N", but also "Here's exactly what is about to be done... is that correct and what you actually intended? Y/N", otherwise anyone would just assume the command they'd entered would do what THEY intended, not what the command was about to do.

        1. Bronek Kozicki

          Re: Makes me wonder how many others in the "playbook" have this capacity.

          Not Y/N , but "in the prompt below, enter the missing from the above shell command, to make it work". Force them to read and think, that is.

          1. donk1
            FAIL

            Re: Makes me wonder how many others in the "playbook" have this capacity.

            1st prompt

            This will shutdown 1040 servers, please type 1040 to continue.

            2nd prompt

            This will reduce capacity enough to cause a service failure for the following 8 services

            A

            ...

            G

            Please type "8 SERVICE FAILURES" to continue.

        2. Allan George Dyer

          Re: Makes me wonder how many others in the "playbook" have this capacity.

          "However it should be made a multistep process with plenty of Are You Sure? types prompts"

          So HAL was just working to design?

          "I think you know what the problem is, Dave"

      2. John Smith 19 Gold badge
        Unhappy

        "not something that can be done with a single mistyped command. "

        My point exactly.

        Yes servers have to be taken down. Yes sometimes clusters of servers have to be taken down. But it should be very rare that all need to be taken down at the same time.

        And it should be impossible to do so without whoever's doing it realizing exactly what is about to happen.

      3. Adam 1

        Re: Makes me wonder how many others in the "playbook" have this capacity.

        > Ultimately somebody has to have the power to do this because shutting down servers is a valid admin activity. However it should be made a multistep process with plenty of Are You Sure? types prompts

        How about "Please enter the shutdown validation GUID. This can be found on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."

    3. Wayland

      Re: Makes me wonder how many others in the "playbook" have this capacity.

      One command is better than having to type a 100. 100 commands put into a file, we call that a 'program'.

  5. gv

    PEBKAC

    That is all.

  6. Anonymous Coward
    Anonymous Coward

    Next problem:

    "I'm sorry Dave, I can't let you do that"

  7. Your alien overlord - fear me

    I want to know want was the command they were supposed to enter and what did they actually enter.

    1. Anonymous Coward
      Anonymous Coward

      It's a super awesome convenience to be able to hit tons of machines in a big data center operation, but as you can see things can go wrong in a big way. It would be interesting to see a pseudo-syntax of what happened, if this was a webgui or a cli, or a script, what have you. I can tell you at the Yahoo! CEO shuffle I attended a few years back we could address wide swaths of machines, but most of the folks knew what not to do, and how to break up big jobs (ha!) into easy to handle tasks. For instance, my first task was to run a script that fixed the storage issue with NetApp "moosehead" disks that would cause it to loose data and the extra cool thing; not be able to recover from their RAID! Good times! This was on over 300 mail "farms" which were middle-tier mail handling clusters that did the sorting of mail vs junk/spam. The spam goes off to cheapo storage, and "good mail" goes to the main stores. Anyway, the IDs needing fixing to point mail user's mail to the new storage by running a script on close to 6000 machines, no VMs, all pizza boxes. No WAY was I to just go nuts and try and run them all at once, even though you could very well do that with Limo, their internal custom multi-host command tool, later replaced by a tool called Pogo. Clusters of machines could also be addressed with aliases, so I could say "all hosts in a group with a simple name"; turn off the flag to show availability to the VIP. For the script work I was clued in via change management meetings, then I ran the script on one farm to make sure it worked and that we did not clobber any users, then we did 10 farms, then 100, and the rest (are here on Gilligan's Island!). No problem. My goal was to not cause any issue that would make it into the news. :P I had nothing to do with the security also, which is a big embarrassment to their new owners, I'm sure.

      I was also in Search (AKA the Bing Gateway) and there we typically choose UTC midnight on Wednesdays to perform updates to the front end search servers. In the US there were two big data centers, each with two clusters of 110 hosts to handle the web facing search front end. For maintenance, you just choose a single host, take it out of the global load balancer, then update it, and drop it back in with extra monitoring turned up. If it does not crap itself, we could then take out half of a data center, do the update, put them back in, then repeat the process three more times for the other clusters, and that was that. But, yes, super easy to fuck up and take out every data center if you don't pay attention to your machine lists.

      1. Anonymous Coward
        Thumb Up

        It's a super awesome convenience...

        You could take down Bing or Yahoo! any time you like and for as long as you like for "maintenance" and pretty much no-one would ever notice. In fact, why not just leave them down and free up some server space?

        1. fredesmite
          Meh

          Re: It's a super awesome convenience...

          Quite honestly - if Bing , FB, google , yahoo , blah blah - disappeared would they really be missed ?

          They produce nothing other than hordes of advertising spam . Remember the days before that crap existed .. young adults could actually have a face to face conversation , working meant doing something other than browsing the internet for links to share among co-workers ...

      2. donk1

        6000 machines...so run 200 machines at a time for 30 times.

        What is this obession with 10,100,2000,rest and doing a massive population in 5 steps?

        Even if 2110 machines worked fine how long would it take to fix the last 3900 machines if enough of them broke?

        For failures it is not the number of times you have done it before but the size of the failure domain and how long it takes to fix.

        it should be possible to rollout automatically in small batches and even had multiple upgrades rolling out at the same time on an automatic schedule, ripple across the farm!

        If it is automated and scheduled who cares how many batches of upgrades are run?

        You would catch errors with less impact that way as the failed batch size would be smaller and it would be minimal extra work if designed correctly.

        This is the next stage in cloud service design - being able to have slower rolling upgrades with smaller batches!

    2. fronty

      rm -rf /

      1. Kevin McMurtrie Silver badge

        Funny, this should have finished while I was at lunch

        $ cd storage

        $ rm -rf tmp1* tmp2* tmp3 *

        1. muddysteve

          Re: Funny, this should have finished while I was at lunch

          >$ cd storage

          >

          >$ rm -rf tmp1* tmp2* tmp3 *

          That's always been the trouble with computers - they do what you tell them to, rather than what you wanted them to.

        2. Doctor_Wibble
          Boffin

          Re: Funny, this should have finished while I was at lunch

          When it comes to spotting mistakes, the first guess is probably the correct one - and having had numerous requests for file recovery over the years, the 'extra space' problem is not that rare.

          Perhaps oddly it seemed to be more common amongst people who did know what they are doing but didn't stop to re-inspect what they typed to see if they accidentally batted the space bar somewhere.

          Though at the other end of the scale, someone trying to follow unfamiliar instructions printed in a poorly-selected font where they have been told 'do this exactly' and it sure as hell looks like that's meant to be a space there...

        3. Colin Bull 1

          Re: Funny, this should have finished while I was at lunch

          It is very easy to set an alias for rm so that it lists all directories it is going to delete and asks you for confirmation first - simples

          1. Anonymous Coward
            Anonymous Coward

            Re: Funny, this should have finished while I was at lunch

            or just use " rm -i"?

        4. stu 4

          Re: Funny, this should have finished while I was at lunch

          I did similar thing about 2 months ago on my mac while trying to tidy stuff up in the root drive.

          UserTemp

          Usertemp

          ...

          sudo rm -rf User*

          hmm.that's taking an awful long time to delete some temporary crap....

          ..argh!@!!@#!@^#^

          CntlC CntlC CnltC

          Luckily good old timemachine got me back to an hour before and I had a 'Users' directory again.

          I have to say, in 10 years of mac ownership... one of the many many many times timemachine has got me out of a deep deep hole.

          I also remember one time, about 20 years ago - working for a large UK telecom company...needed to reboot one of the live boxes that handled 30% of the load of UK non geographic phone calls (0845, 0800, etc)...

          sudo shutdown now -r

          ....

          ...

          hmm can't seem to connect to that... doesn't seem to be coming back up..

          It was in an unmanned exchanged 30 miles from the nearest engineer.... had to get one of em to go out there, and press the ON button again.

    3. roselan

      rm -rf //

    4. TomChaton
      Alert

      re: ...and what did they actually enter.

      I suspect it had an asterisk in it somewhere.

  8. Dwarf

    This command will affect 13,432,454,456,234 objects . Are your sure ?

    Of course I'm sure, its pre-programmed I hit Yes when any pop up or confirmation is shown.

    1. Anonymous Coward
      Anonymous Coward

      We used to ask for double confirmation on important decisions like abandoning things.

      We soon learned that the second prompt had to have an inverse question - so a second "yes" was effectively a "no". That blocked trigger happy responses and made people stop and think.

      1. Anonymous Coward
        Anonymous Coward

        We soon learned that the second prompt had to have an inverse question - so a second "yes" was effectively a "no". That blocked trigger happy responses and made people stop and think.

        Works even better if you present the two dialogs in a random order...

  9. Daedalus

    Wur doomed

    The real Y2K problem was that in the year 2000 technology got big enough that there would never be enough wise people to look after it.

  10. Anonymous Coward
    Anonymous Coward

    SELECT * FROM EC3_Instance THEN DROP ALL$

    Beware the wildcard!

  11. Anonymous Coward
    Anonymous Coward

    Availability Zones

    What Amazon left out, and what El Reg didn't mention in their article 12 hours ago, is Availability Zones. You're not supposed to have to go multi-region in order to be able to sustain a major AWS outage. Being in multiple AZs is supposed to allow you to survive a fat finger by an AWS employee.

    The fact that Amazon's statement talks so casually about US-EAST-1 S3 makes it clear that there is no segmentation of S3 between AZs. If S3 isn't segmented that probably means other AWS services aren't either. Paid extra for multi-AZ RDS? Added extra EC2 instances for multi-AZ load balancing? It won't help at all if RDS and ELB are administered at the regional level anyway.

    I think Amazon has some splaining to do. If their own services aren't redundant across AZs then what is the point of customers paying extra to be in multiple AZs? Is the only independent component of AZs the power source? That is a far cry from Amazon's selling points of multiple AZs.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like