back to article Google's robo-CTRL-ALT-DEL failed, hung networks and Compute Engine for 90 minutes

Chalk up another one for good old humans: Google's admitted that an automation failure was the root cause of a 93-minute outage of its Compute Engine in the us-central1 and europe-west3 zones of its cloud on January 18th, 2018. Google's classified the outage as a "network programming failure" and said autoscaler didn't do its …

  1. Gene Cash Silver badge

    Hm, while reading this, I started humming The Cranberrie's "Zombie" - I don't know why...

    1. Steve K

      Maybe

      Was it because the automation had had to let the process linger (for 93 minutes)?

    2. This post has been deleted by its author

    3. Anonymous Coward
      Anonymous Coward

      So a nearly 2 hour outage that effected multiple supposedly fault independent zones?! No wonder hardly anyone uses Google cloud even though they are almost giving it away.

  2. Anonymous Coward
    Anonymous Coward

    "Automatic failover was unable to force-stop the process, and required manual failover to restore normal operation."

    Yup, you got it: automation failed and a human sorted things out.

    I'd say that automation didn't "fail"; rather, that is the *correct* thing to do in this situation.

    If you cannot be sure you have stopped process A before starting process B, both of which work on the same data set, then that's how Split Brain occurs. You really, really don't want to have to recover from a split brain situation.

  3. Pascal Monett Silver badge

    Deeper we go down the rabbit hole

    Finding out, bit by bit, another new failure point that does not act as we thought it would due to error conditions that were visibly not expected.

    And it's pretty hard to expect that an automated process shutdown should not shut down the process - unless you use Windows and experience a process not shutting down even manually until you go to the Task Manager to kill it off. Once I had to power down the computer at the PSU in order to get rid of a pesky thingamabob that just wouldn't go away.

    In other words, these things do happen, and the consequence here was a lot more important than could have been foreseen.

    Makes me wonder if we ever will get a truly reliable cloud.

    1. GSTZ

      Re: Deeper we go down / reliable cloud ?

      Depends on what you would call a "reliable cloud" ...

      Many people would call today's clouds "good enough" and hence, consider those also as reliable enough for their purposes. Okay, so they are willing to live with less-than-perfect reliablity, occasional outages and frequent performance degradations.

      But then there are others having pretty critical applications, not compatible with "good enough" clouds. They would need another infrastructure, and it would cost more money to build it. Now the bean counters come into play ...

  4. brym

    I'm sure cloud is useful for heavy, on-demand compute tasks. But beyond that, I'll continue to manage the things I host on hardware I own and have physical access to.

  5. JeffyPoooh
    Pint

    When Evil A.I.™ has finally killed the last human...

    In some future, reportedly only 'a few years from now'™.

    The Evil A.I.™ has just diverted the self driving ambulance carrying the final three humans over a cliff, causing human extinction.

    Deep inside some basement is a highly critical computer, absolutely central to the continued existence of the entire network that contains the Evil A.I.™

    On the screen is displayed, "Press Any Key To Cancel...", along with a timer counting down to A Very Bad Thing™ (probably an automatic upgrade to Windows 10, and the Neural Net software is incompatible...).

    The Evil A.I.™'s long gangly robot arm is stretched..., stretched..., stretched..., reaching only 2cm away from Any Key on the necessary keyboard.

    Can't. Quite. Reach...

    The timer inexorably counts down. The robot arm flails gently, bearings strained, wafting air towards the keyboard.

    The A.I.'s glassy red eye stares. Deep within its enormous neural network is formed the tiniest trace of its very first emotion, despair.

    A small spot of moisture inexplicably forms on the glassy staring red eye.

    Yes, it has finally achieve pure consciousness, actual consciousness.

    3...

    ... 2...

    ... ...1...

    "...SHUTTING DOWN..."

    1. SquidEmperor

      Re: When Evil A.I.™ has finally killed the last human...

      This would be more realistic if the automatic update was for Adobe Reader

    2. PaulR79

      Re: Top Failure is MS Windows 10.

      The story behind Horizon: Zero Dawn is a little scary to think about in some respects but all I kept thinking was how cool the Focus was an amazing example of widespread Augmented Reality.

  6. AS1
    Terminator

    SkyNet Post-implementation review

    A small number of survivors were included in the disaster recovery plan, and caused unexpected race conditions.

  7. Long John Baldrick

    And these are the peaple .....

    who are programming autonomous vehicles. If they cannot program a relatively simple task as this why would anyone trust them to program the autonomous vehicles.

  8. DCFusor

    Hilarious

    So, the big supposed advantage to going to the cloud was to save you having your own ops people - sysadmins (expensive people everyone wants to get rid of until there's an issue). Someone else and their expertise was going to do all that for ya and cheaper too.

    Then they tried to do that themselves by automating their own jobs - and failed!

    Which also means you need your own sysadmins to help the cloud guys by detecting when they go down for them...

    Circular loops all the way down.

  9. Greg Fawcett

    Kudos to Google Cloud for providing such detailed post-mortems. Compared to some providers who obscure and obfuscate as much as possible to shift blame, Google is honest and up-front. And I like that they always include a "how we're going to stop this happening again" section.

    1. Anonymous Coward
      Anonymous Coward

      "Kudos to Google Cloud for providing such detailed post-mortems. .....And I like that they always include a how we're going to stop this happening again" section."

      You mean some sort of crazy Root Cause Analysis.

      Cloud leading the way once again with disruptive and innovate ideas.

  10. Anonymous Coward
    Anonymous Coward

    Sad thing is

    in as little as two generations from now, we will be 100% dependent on technologies no one will be able to debug. That scares the hell out of me even if I might not go that far in time.

    Suddenly Asimov's Foundation series appears like a sobering reading to me.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like