back to article Boffins warn LIMPWARE takes the pleasure out of cloud

Computer science boffins from the USA have come up with a lovely term to describe under-performing hardware: “Limpware”. The term's not just for fun, but actually has some pretty serious implications for cloud computing because the boffins have run tests suggesting just one limp node in a cloudy cluster can lower performance …

COMMENTS

This topic is closed for new posts.
  1. Trevor_Pott Gold badge

    From my understanding of the issues discussed in these papers, this is the sort of thing that the folks at Cloudphysics have st out to identify. I wonder if group A has been introduced to group B? Sounds like they are thinking along the same lines...

    1. Roo

      It's nice to see folks trying to quantify this stuff, I just hope they're doing something new rather than repeating the distributed systems research done in the 50s/60s/70s/80s.

      I suspect they are simply repeating research because they think the word 'Cloud' somehow changes all the rules of distributed computing... Either way I'm sure they'll be rewarded for a new buzzword that will give warm fuzzies to ignorant salesmen, fanbois and execs.

  2. M Gale
    Coat

    NIC card?

    Is that like a PIN number?

    Redundancy department of redundancy?

    Ok, ok, I'm going.

    1. Anonymous Coward
      Anonymous Coward

      Re: NIC card?

      Yeah, they use them to connect to their local LAN.

  3. codeusirae
    Facepalm

    The CLOUD brought down by a single NIC card?

    "A third paper, Impact of Limpware on HDFS: A Probabilistic Estimation (PDF) offers a detailed analysis of how a single limplocked component, in this case a single NIC card, can greatly degrade the performance of a Hadoop cluster. The paper also shows that Hadoop can't detect the under-performing NIC and therefore doesn't fail over to another."

    One would have thought that the people building the CLOUD would have designed in such failure detection from the beginning. What effect would failure of component X have on the system-wide performance etc.

    1. Don Jefe

      Re: The CLOUD brought down by a single NIC card?

      Designing for problems you don't know exist or do not understand is how you end up broke with a shitty product. You have to put things into production to identify how to improve them. Without that study and understanding you're doing no more than guessing.

      1. Roo

        Re: The CLOUD brought down by a single NIC card?

        "Designing for problems you don't know exist or do not understand is how you end up broke with a shitty product. You have to put things into production to identify how to improve them. Without that study and understanding you're doing no more than guessing."

        Agreed. But equally this class of problem is very old hat. It really should not be a surprise to anyone.

    2. Roo

      Re: The CLOUD brought down by a single NIC card?

      "One would have thought that the people building the CLOUD would have designed in such failure detection from the beginning. What effect would failure of component X have on the system-wide performance etc."

      Detecting sub-optimal performance can be tricky. In the example given the NIC appears to still be passing traffic, so it hasn't failed as such - it's just slow. Perhaps the sink for the data isn't keeping up so flow-control is throttling the data rate, or perhaps the auto-negotiation is picking the wrong value, or maybe segment congestion is killing the throughput, it could be starved of memory bandwidth etc.

      If you choose to apply a simple threshold, what value do you pick for the threshold ? How do you account for averaging effect of legitimate idle periods or segment congestion on the measured throughput ?

      Then if you decide to blacklist that component instead of tolerating it's degraded performance what will happen when you redirect that traffic via another set of components ? Sometimes (actually quite often in practice) fail-over can cause components to degrade or fail because they are suddenly deluged with extra work.

      Some times fail-over is a very costly process in itself (state transfer, sync etc) so the time & space resources expended during the fail-over can actually outweigh the potential savings from blacklisting a degraded component.

  4. John Smith 19 Gold badge
    Unhappy

    So need some kind of "node profiler" tool?

    I guess up to now most HPC systems have been by assuming that identical hardare == identical performance.

    A reasonable idea.

    But wrong.

    Except I can't help think that things like Tivoli and what was CA Unicenter were meant to have tools like that a decade ago.

This topic is closed for new posts.

Other stories you might like