back to article Fat-fingered admin downs entire Joyent data center

Cloud operator Joyent went through a major failure on Tuesday when a fat-fingered admin brought down an entire data center's compute assets. The cloud provider began reporting "transient availability issues" for its US-East-1 data center at around six-thirty in the evening, East Coast time. "Due to an operator error, all …

COMMENTS

This topic is closed for new posts.

Page:

  1. Anonymous Coward
    Anonymous Coward

    How fat was the finger?

    We need to know, so we can lean from his mistak: How fat is his finger?

    I will obviously be instituting a company wide lathe based pre-emptive fix for this for my staff. Now that's ISO9000 Preventative Action stuff. It's not Corrective - none of us have ever done something like that before.

    Cheers

    Jon

    1. Pete 2 Silver badge

      Treating the symptoms, not the disease

      Unfortunately, this¹ is exactly what most companies do when faced with this sort of issue. They say "oooh, the <command> is far too powerful - let's remove it, or require an operator to get approval from the change board before it's used in future"

      Although Joyent have said they are instigating a full investigation, they will find that their system has so many fundamental holes designed in that fixing them all will require not only a total re-write, but a complete redesign of their software and operational practices. A prospect that is likely (considering how poor the whole discipline of system design is) to introduce as many new problems as it fixes.

      So ultimately I fully expect the expedient solutions to be applied: an extra layer of checks that will slow down operations and make life for operators even more exasperating (such as an "are you sure" dialog after every command) and will soon become ineffective due to the pressures of getting stuff done (a 10% decrease in operational effectiveness is never paid for with a 10% increase in staff numbers) and management cuts.

      [1] yes, satire: I get it

    2. I. Aproveofitspendingonspecificprojects

      Re: How fat was the finger?

      I think the register is in sausages per finger but we are dealing with lemons here -until we learn more.

    3. TheVogon

      Re: How fat was the finger?

      "How fat is his finger?"

      Fat enough to press a big red button labelled 'EPO' - and stupid enough to confuse it with the light switch!

  2. Nate Amsden

    worse

    http://www.theregister.co.uk/2008/08/28/flexiscale_outage/

    "As Lucas explained in an email to customers - posted to the Web by CNet - the outage occurred when an XCalibre engineer accidentally deleted one of FlexiScale's main storage volumes."

    http://www.theregister.co.uk/2009/05/15/flexiscale_upgrade/

    "Nine months after an engineer accidentally deleted its Amazon-like compute cloud - and six months after a second major outage - FlexiScale has finally completed a software overhaul meant to avoid such extended blackouts."

    1. A Non e-mouse Silver badge

      Re: worse

      I the days before "cloud", and when Novel were still alive, they had a tool (in beta) which allowed you to run commands on all your NetWare servers at once. It was useful, but then Novel wiped the tool from the face of the planet. I did some asking around, and was told that Novel pulled it because it was too easy to use and too many customers were shooting themselves in the foot with it. I was heard tales of people deleting NDS from every server in the tree with the tool. Ouch!

  3. Jim 59

    op/admin

    Joyent says operator. El Reg says administrator. Which is it?

    1. Anonymous Coward
      Anonymous Coward

      Re: op/admin

      .. maybe its a forklift operator inside them datacenters! happens all the time with telco cables on the street when somebody's digging.

    2. foxyshadis

      Re: op/admin

      You don't consider the BOFH an administrator?

      Well he administers the pints, that's for sure.

    3. Anonymous Coward
      Anonymous Coward

      Re: op/admin

      Operator of the mop, or administrator of the bucket - does it matter?

  4. Allan George Dyer
    Joke

    Shock collars?

    He makes it sound like he tried. Guess they should invest in waterproof keyboards that can be operated with flippers.

    1. Anonymous Coward
      Anonymous Coward

      Re: Shock collars?

      Patronisingly refering to his staff as "dolphins" is going to make him REALLY popular I'm sure. No doubt another clueless accounting oik parachuted into the CTO job who has no idea what his tech staff actually do and probably thinks its all pretty easy and he could do sooo much better himself if only he had enough spare time away from the golf course.

      1. Anonymous Coward
        Anonymous Coward

        Re: Shock collars?

        That or he is a proper network tech, and is aware of the in-joke of that remark (which you're obviously not).

      2. Anonymous Coward
        Anonymous Coward

        Re: Shock collars?

        No doubt another clueless accounting oik parachuted into the CTO job

        Hmm, you've no idea who Bryan Cantrill is, have you? He jumped ship to Joyent when Oracle bought Sun, where he was one of the senior Solaris/ZFS designers. You should try and catch one of his presentations one day, he can be a PITA at times but he's far from clueless, and very entertaining.

        1. Anonymous Coward
          Anonymous Coward

          Re: Shock collars?

          As an engineer I worked with Bryan. He's one of the smartest and most capable people I've ever seen or known. Far from being "another clueless accounting oik" if he were, he'd unlikely have spoken so frankly about this event.

          1. Jamie Jones Silver badge
            Thumb Up

            Re: Shock collars?

            Well said, anon.

            It's the inept who are used to bluffing their way through life that naturally turn to spin and bullshit.

            Those who actually have a clue are quite open to admit when there's been a cockup, as their reputation isn't based on smoke and mirrors.

      3. bcantrill

        Re: Shock collars?

        (I'm the CTO in question. Normally, I could/would resist being trolled, but given that this constitutes special circumstances, I feel it's appropriate to clarify a few things.)

        First, I was clearly not referring to our own staff as dolphins, but rather that meting out punishment rarely changes behavior. I can assure you that no one internally took this the wrong way.

        Second, as for me being a "clueless accounting oik": to the contrary, I have dedicated my entire career to the understanding and improvement of software operating production systems. Having spent a ton of time on production systems, I have also made by far share of mistakes -- and (like every human who has done something important that requires precision) have had plenty of near-misses that I only found when I double- or triple-checked my own work. So not only do I not think it's easy, I know it isn't.

        Finally: I hate golf.

        1. Anonymous Coward
          Anonymous Coward

          Re: Shock collars?

          Fair enough - I retract my remarks. I'll leave my original post up otherwise no one will know what this thread is about.

        2. Vic

          Re: Shock collars?

          > meting out punishment rarely changes behavior

          I wish you'd been one of my bosses.

          I've been to so many places[1] where the "important" part of a post-mortem is working out who carries the can. It never helps.

          CxOs who understand the need to fix the problem rather than fix the blame seem to be few and far between...

          Vic.

          [1] Usually as a contractor, called in to sort out the cock-up. Thank $deity...

        3. Jamie Jones Silver badge
          Pint

          Re: Shock collars? @Bryan

          I was going to post this as a new message, but seeing you've posted here, I'll reply instead.

          I found your candidness in your reponse, your openess, and your proposal for moving forward very refreshing.

          If I was a customer, I'd have found it most reassuring.

          Other companies (and politicians!) should note that bullshit and spin and skirting around the issue impresses no-one.

  5. Don Jefe

    Biomed Coverup, Political Intrigue or Rock Band

    I'm pretty sure 'transient availability issues' should be the pivotal soapbox issue of the next presidential election. You can play that so many ways: 'Enhanced border protection has resulted in transient availability issues that have caused a spike in food prices, leaving many to starve', for example.

    It also works for unsanctioned medical experiments, value priced organ transplants and urban hunting outfitters: 'Transient availability issues have slowed field trial results and publication of Stage IV results of REDACTED will not be available until Q4'. 'Transient availability issues have resulted in an unforeseen shortage of viable Human organs. Until inventory levels have stabilized hunting at inner city housing projects will be permitted between 1-5 AM'.

    Also, 'Transient Availability Issues' would be a great band name!

    1. Fatman

      Re: Biomed Coverup, Political Intrigue or Rock Band

      HUH????

      For a moment, I thought I was reading one of 'amanfrommars' posts!

      1. keithpeter Silver badge

        Re: Biomed Coverup, Political Intrigue or Rock Band

        @Fatman

        Google 'gonzo journalism' and HST if you like the style.

        I suspect Mr Jefe would not be allowed within 25 feet of a console capable of controlling more than one running kernel.

        1. Don Jefe

          Re: Biomed Coverup, Political Intrigue or Rock Band

          Furthermore, I'll have you know my console is capable of running large numbers of kernels simultaneously. Long before you lot were messing about with your virtual kernels and manipulating kernels with concentrated RF radiation I was commanding THOUSANDS of kernels simultaneously using naught but alternating current, a dead short and a bunch of fucking butter. So suck it.

        2. Mpeler
          Paris Hilton

          Re: Biomed Coverup, Political Intrigue or Rock Band

          I suspect Mr Jefe would not be allowed within 25 feet of a console capable of controlling more than one running kernel.

          Ah, me glasses again - I read that as "controlling more than one running kennel"....wondering what kind of transients were meant.....

          It's a dog's life.....

          Paris...her glasses don't seem to be working either....

      2. Don Jefe

        Re: Biomed Coverup, Political Intrigue or Rock Band

        Pah! amanfrommars uses bendy logic and a very special form of punctuation in his posts. I, on the other hand, utilize the wholly correct method of applying various definitions of a word in a context other than that used in the original statement.

        Indeed, others, some from inside asylums for the disturbed, foresaw such intentional mangling of the words of others and even created an alphabetized index of words and their meanings specifically so that context and definitions can be interchanged to great effect. You can make great jokes, create stunning headlines, delight and horrify shareholders or run for public office based solely on understanding what a dictionary is and mastering its use. Dictionary Expansion Packs include Thesaurus, Foreign Language Cross-Indices, The Urban Dictionary and Trade Jargon Indexes.

  6. Hazmoid

    So I suspect that operator will be told subtly that maybe he would be better looking for work somewhere else. either that or they promote him out of the operator position so he can't do it again :)

    1. Mark 85

      I suspect you may be right, or they will create a position of "corporate scapegoat" and put him in it. Then everything from then on that goes wrong can be blamed on a "fat fingered admin".

    2. Uffish

      Re: don't do it again.

      A colleague of mine used to work for a small company where people had to make quick decisions on the fly. They had a 'mistakes book' where all bad decisions and goofs were recorded. You could make any sort of mistake as long as it wasn't in the book. He said the system worked well.

      1. Anonymous Coward
        Anonymous Coward

        Re: don't do it again.

        A new CEO (Marcus Steyn?) of M&S looked at the company's very thick file of accumulated corrective actions. He took it away and returned it slimmed down to a few pages of generic guidelines. He said that the original was too much for most people to have read - therefore its only possible use had been to apportion blame after the event.

      2. Anonymous Coward
        Anonymous Coward

        Re: don't do it again.

        "You could make any sort of mistake as long as it wasn't in the book."

        In the evening the work surfaces along the computer suite walls were stacked high with card boxes for the night runs. A development programmer was standing waiting for his dedicated hands-on timeslot. He leaned back against the card trays - and they moved backwards. Unfortunately there was an emergency off mushroom button at that point on the wall.

        As a result - the work surface cabinets were re-arranged to leave a gap in front of the button. A few days later the same programmer was waiting again. He was now wary of leaning against the boxes. So he positioned himself in the convenient gap in the work surface cabinets and leaned back against the wall - and against the emergency off button.

        After that a papertape reel plastic core was taped round the button as a shield - so that only a finger could press it.

    3. Anonymous Coward
      Pint

      Wrong answer!

      Unless it's malicious, you take the person aside and in private figure out what went wrong, and why. Then you gather everyone after everything's back up, publicly thank every one for their response and do an after-action review. What I wouldn't do is to pass the operator's name around. If anyone's ass is going to be chewed on, it;s mine.

      I have to get the recipe exactly right since I have the social graces of a male warthog in mating season. I do know how to react in this one.

  7. skeptical i

    Jheez, poor bastard. :\

    There but for the grace ....

    1. Alan W. Rateliff, II
      Paris Hilton

      Re: Jheez, poor bastard. :\

      ... really go all of us. Even the snotty it-won't-ever-happen-to-me RegTards.

      1. Anonymous Coward
        Anonymous Coward

        Re: Jheez, poor bastard. :\

        I'm always doubtful, and I suppose it is a healthy and respectful attitude to have. The trouble is the "mgnt." see that as a weakness - not a strength. So they get in some cock sure punk, and I let him cock it up - just for the LULZ!

    2. Anonymous Coward
      Anonymous Coward

      Re: Jheez, poor bastard. :\

      You're not a real sysadmin until you've made a major cockup (e.g. rebooting the Production system rather than the Test one).

      Until then, you're an accident waiting to happen. Afterwards, you have the heightened situational awareness and the little voice in your head that whispers "something's not right" to stop you hitting the wrong button again.

      So Joyent now have an experienced sysadmin. Why would they sack him and bring in another overconfident time-bomb?

      Anon, because "voices in your head" doesn't sound good!

      1. Trygve Henriksen

        Re: Jheez, poor bastard. :\

        Absolutely.

        And the way to see who is a good sysadmin from a bad is if he admits that he screwed up or tries to blame someone else.

        If he blatantly blames someone else, he's a bad one.

        Taking the blame is a good one...

        Successfully laying the blame on Microsoft... Well, then you have a bona-fide Guru on your hands.

        ;-)

        1. Pete 2 Silver badge

          Re: Jheez, poor bastard. :\

          > And the way to see who is a good sysadmin from a bad is if he admits that he screwed up or tries to blame someone else.

          Maybe. But the mark of a truly excellent (ahem!) sysadmin is that he / she gets the problem fixed before anyone else notices.

        2. Fatman

          Re: Jheez, poor bastard. :\

          If he blatantly blames someone else, he's a bad one mangler in training.

          There!

          FTFY!

      2. Vic

        Re: Jheez, poor bastard. :\

        > e.g. rebooting the Production system rather than the Test one

        I've found it useful, in environments where someone might get confused like that, to blacklist shutdown and reboot in the sudoers file.

        It won't stop someone with privilege from rebooting the machine, of course, but it does mean the procedure is slightly different, which prevents accidental reboots. It's remarkably effective...

        Vic.

        1. Anonymous Coward
          Anonymous Coward

          Re: Jheez, poor bastard. :\

          >I've found it useful, in environments where someone might get confused like that, to blacklist

          >shutdown and reboot in the sudoers file.

          >It won't stop someone with privilege from rebooting the machine, of course, but it does mean the

          >procedure is slightly different, which prevents accidental reboots. It's remarkably effective...

          You're going to have to explain how someone "accidentally" types shutdown or reboot.

          1. noboard
            Pint

            Re: Jheez, poor bastard. :\

            No they don't, it covers the case where the person wants to shut-down or reboot the test server, but hasn't realised they're on the live server.

            Have a pint for the old reading comprehension :)

          2. Missing Semicolon Silver badge
            Boffin

            Re: Jheez, poor bastard. :\

            ... by being logged in to an SSH session, and not noticing which machine you are connected to.

            I now make sure that every VM I create has a descriptive hostname, so that it appears at the shell prompt (and in the title bar of Putty).

          3. Anonymous Coward
            Anonymous Coward

            Re: Jheez, poor bastard. :\

            >You're going to have to explain how someone "accidentally" types shutdown or reboot.

            I'm guessing you're not a Linux admin? Very simply - tab completion. Say you have powermt installed in 98% of your boxes. You'd probably gain a bad habit of typing power<tab>&<enter> to autocomplete powermt and execute the command.

            Say one day you hit one of those 2% machines that don't have powermt installed or configured in the path. power<tab>&<enter> suddenly becomes power<off>. Machine gone.

            There are lots of examples with other dangerous commands. Personally I'd like to see init, poweroff and reboot etc all require a -y flag by default.

      3. J P

        Re: Jheez, poor bastard. :\

        Perhaps someone high up at Joyent has themselves had the fat fingered moment of dread, and recognises the value of experience...

        There's a similar thing with discharged bankrupts; they are statistically the least likely people to go bankrupt (yes, again). Despite which, they are also one of the groups that finds it hardest to get bank accounts. The test is of course administered by bankers who have not themselves gone bankrupt.

      4. Captain Scarlet
        Unhappy

        Re: Jheez, poor bastard. :\

        "You're not a real sysadmin until you've made a major cockup (e.g. rebooting the Production system rather than the Test one"

        I have to confess I shutdown a wrong customer server, after being unable to RDP to the server I went on via a KVM and misread o as a in a list of a few thousand similarly named servers. I was on the phone with people in front of the machine and I misread the name four times before hitting shutdown. I then saw the wrong green dot went red on the network screen and realised what I had done!

      5. Keith Langmead

        Re: Jheez, poor bastard. :\

        Definitely true! Only after experiencing that sinking feeling where it feels like the bottom has just dropped out of your world, and then having to tell your boss what you've done, only then can you truly appreciate axioms like "don't assume, check" and "hope for the best but plan for the worst". Until then they're just words that are impossible to put into proper context.

        1. Vic
          Joke

          Re: Jheez, poor bastard. :\

          Only after experiencing that sinking feeling where it feels like the bottom has just dropped out of your world

          That's for beginners.

          You know things have gone really wrong as you experience that feeling of the world dropping out of your bottom...

          Vic.

    3. Vic

      Re: Jheez, poor bastard. :\

      > There but for the grace ....

      I once deleted a live /var filesystem while trying to clean up a machine.

      Luckily for me, it was one of my own.

      The box is still in use - and I keep finding orphaned unpackaged files lying around from before the accident...

      Vic.

  8. A Non e-mouse Silver badge
    Pint

    Poor sod...

    I feel sorry for that poor sysadmin. It's going to be a while before they get over this. The only glimmer of hope, is that Joyent don't sound like they're going to hang the sysadmin out to dry.

    Have a beer to drown your sorrows.

Page:

This topic is closed for new posts.

Other stories you might like