Computer science boffins from the USA have come up with a lovely term to describe under-performing hardware: “Limpware”. The term's not just for fun, but actually has some pretty serious implications for cloud computing because the boffins have run tests suggesting just one limp node in a cloudy cluster can lower performance …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Monday 18th November 2013 04:17 GMT Trevor_Pott

From my understanding of the issues discussed in these papers, this is the sort of thing that the folks at Cloudphysics have st out to identify. I wonder if group A has been introduced to group B? Sounds like they are thinking along the same lines...

2 0
1. Monday 18th November 2013 12:15 GMT Roo
  
  It's nice to see folks trying to quantify this stuff, I just hope they're doing something new rather than repeating the distributed systems research done in the 50s/60s/70s/80s.
  
  I suspect they are simply repeating research because they think the word 'Cloud' somehow changes all the rules of distributed computing... Either way I'm sure they'll be rewarded for a new buzzword that will give warm fuzzies to ignorant salesmen, fanbois and execs.
  
  0 0
Monday 18th November 2013 04:41 GMT M Gale

NIC card?

Is that like a PIN number?

Redundancy department of redundancy?

Ok, ok, I'm going.

2 0
1. Monday 18th November 2013 07:01 GMT Anonymous Coward
  
  Re: NIC card?
  
  Yeah, they use them to connect to their local LAN.
  
  3 0
Monday 18th November 2013 07:47 GMT codeusirae

The CLOUD brought down by a single NIC card?

"A third paper, Impact of Limpware on HDFS: A Probabilistic Estimation (PDF) offers a detailed analysis of how a single limplocked component, in this case a single NIC card, can greatly degrade the performance of a Hadoop cluster. The paper also shows that Hadoop can't detect the under-performing NIC and therefore doesn't fail over to another."

One would have thought that the people building the CLOUD would have designed in such failure detection from the beginning. What effect would failure of component X have on the system-wide performance etc.

1 0
1. Monday 18th November 2013 08:27 GMT Don Jefe
  
  Re: The CLOUD brought down by a single NIC card?
  
  Designing for problems you don't know exist or do not understand is how you end up broke with a shitty product. You have to put things into production to identify how to improve them. Without that study and understanding you're doing no more than guessing.
  
  1 1
  1. Monday 18th November 2013 12:33 GMT Roo
    
    Re: The CLOUD brought down by a single NIC card?
    
    "Designing for problems you don't know exist or do not understand is how you end up broke with a shitty product. You have to put things into production to identify how to improve them. Without that study and understanding you're doing no more than guessing."
    
    Agreed. But equally this class of problem is very old hat. It really should not be a surprise to anyone.
    
    1 0
2. Monday 18th November 2013 12:32 GMT Roo
  
  Re: The CLOUD brought down by a single NIC card?
  
  "One would have thought that the people building the CLOUD would have designed in such failure detection from the beginning. What effect would failure of component X have on the system-wide performance etc."
  
  Detecting sub-optimal performance can be tricky. In the example given the NIC appears to still be passing traffic, so it hasn't failed as such - it's just slow. Perhaps the sink for the data isn't keeping up so flow-control is throttling the data rate, or perhaps the auto-negotiation is picking the wrong value, or maybe segment congestion is killing the throughput, it could be starved of memory bandwidth etc.
  
  If you choose to apply a simple threshold, what value do you pick for the threshold ? How do you account for averaging effect of legitimate idle periods or segment congestion on the measured throughput ?
  
  Then if you decide to blacklist that component instead of tolerating it's degraded performance what will happen when you redirect that traffic via another set of components ? Sometimes (actually quite often in practice) fail-over can cause components to degrade or fail because they are suddenly deluged with extra work.
  
  Some times fail-over is a very costly process in itself (state transfer, sync etc) so the time & space resources expended during the fail-over can actually outweigh the potential savings from blacklisting a degraded component.
  
  1 0
Monday 18th November 2013 07:48 GMT John Smith 19

So need some kind of "node profiler" tool?

I guess up to now most HPC systems have been by assuming that identical hardare == identical performance.

A reasonable idea.

But wrong.

Except I can't help think that things like Tivoli and what was CA Unicenter were meant to have tools like that a decade ago.

2 0