To heck with the laws of physics... we will squeeze more juice from these processors

Wednesday 7th June 2017 08:35 GMT John Smith 19

"requently encounter data sets with very sparse matrices."

For which much more efficient methods exist than storing it in the obvious n-dimensional array.

In fact in big apps with multiple matrices being processed together the first task is an optimization to find a better order in which to multiply them.

This can make a very substantial difference to how much processing is done.

7 0 Reply

Wednesday 7th June 2017 09:47 GMT Anonymous Coward

Re: "requently encounter data sets with very sparse matrices."

Unless, of course, you don't KNOW your matrix has a lot of zeroes in it. Doesn't it take processing power again to figure out your matrix has a lot of zeroes in it in the first place?

0 0 Reply
1. Wednesday 7th June 2017 11:38 GMT John Smith 19
  
  "Doesn't it take processing power again..matrix has a lot of zeroes in it in the first place?"
  
  Depends on the class of problem you're working on.
  
  Some problems are inherently sparse, others would need to scan the arrays needed.
  
  Some perspective from "Algorithms" by R. Sedgewick (Ch 42. Linear Programming). Multiplying 6 matrices together required 274200 multiplications in one ordering and 6024 in another ordering. None had more than 4 rows or 3 columns.
  
  If you're doing serious matrix work some prep work, even down to the ordering of matrices, before you start multiplying them out makes very big savings.
  
  3 1 Reply

Wednesday 7th June 2017 10:16 GMT Anonymous Coward

With non managed, non VM languages and less bloated APIs...

... we could probably squeeze twice the performance out of the current generation. But some coders just can't cope without their hand holding managed VM languages and/or scripting languages.

4 1 Reply

Wednesday 7th June 2017 11:28 GMT John Smith 19

"we could probably squeeze twice the performance out of the current generation. "

Depends.

Classic big matrix mashing apps are coded still coded in FORTRAN which is normally fully compiled and quite good.

Theoretically JIT compilers allow development in "managed" languages but can be coded much closer to the raw hardware if the option is engaged.

The best advice is still code the app as simply as possible then profile it to find out what really taking the time. History has taught me the answer to this question is normally "Not where you think it is." IOW any time spent on "tricky" coding up front is probably in the wrong place. As a bonus it may have handicapped the translator from spotting optimizations it would have applied to a straightforward version and therefor give you a slower program (as well as one that's was harder to write and harder to understand if someone else has to debug or upgrade it).

7 1 Reply

Wednesday 7th June 2017 12:53 GMT Anonymous Coward

Re: "we could probably squeeze twice the performance out of the current generation. "

"Theoretically JIT compilers allow development in "managed" languages but can be coded much closer to the raw hardware if the option is engaged."

True, but there's still overhead even with JIT. Eg: boundary checking, garbage collection.

"IOW any time spent on "tricky" coding up front is probably in the wrong place"

Sometimes, not always. Obviously if a program is I/O bound then no amount of fancy coding is going to significatnly speed it up, but if its CPU bound then you can work with the compiler and profiler to tightly optimise the relevant part of the code.

0 0 Reply
1. Wednesday 7th June 2017 22:03 GMT John Smith 19
  
  @Boltar
  
  "True, but there's still overhead even with JIT. Eg: boundary checking, garbage collection."
  
  It depends wheather or not those ease-of-use/reliability improving features outweigh the performance hit having them causes. For some the performance hit will be too great, for others it will be acceptable and outweighed the amount of time they will spend hunting down an intermittent bug that causes array accesses to go haywire.
  
  ""IOW any time spent on "tricky" coding up front is probably in the wrong place"
  
  Sometimes, not always. Obviously if a program is I/O bound then no amount of fancy coding is going to significatnly speed it up, but if its CPU bound then you can work with the compiler and profiler to tightly optimise the relevant part of the code."
  
  I was actually talking about people who alter their coding plan based on what they think will be slow, without profiling their code first.
  
  The experience of developers from DE Knuth to Steve Connell is that the bits they thought would be slow, when they did profile them, turned out not to be. The bottleneck was never where they expected it to be. It was somewhere else.
  
  What you're talking about is re-coding after you've found the hot spots with a profiler, which is best practice. Implement as simply as possible (to make sure the results are correct) to begin with and then optimze those parts that will make a serious difference, the proverbial 80/20 rule (or in some of Knuths work the 95/5 rule IE most of the run time was swallowed by just 5% of the code).
  
  BTW there are many places for a program to have bottlenecks. Looking at various programs it seems the biggest speed up is to step back and decide if the basic algorithm is right for the job. changing that seems to have the biggest influence, but only after you've profiled the basic version.
  
  1 1 Reply
  1. Thursday 8th June 2017 23:56 GMT Alan Brown
    
    Re: @Boltar
    
    "Obviously if a program is I/O bound then no amount of fancy coding is going to significatnly speed it up"
    
    If the I/O bound is caused by trying to work on too much data in the first pass - most of which you're then going to throw away - then fancy coding (as in changing the order you work on things) makes a huge difference.
    
    This is one of the primary reasons for optimizing database joins and selects. Get it right and you can see speedups of 100x or more.
    
    0 0 Reply

Wednesday 7th June 2017 16:00 GMT Anonymous Coward

I think the author will find the issue has been Intel playing 'Happy Monopolies' all along.

AMD recent news shows us a great deal of bandwidth is on the way.

3 0 Reply

Wednesday 7th June 2017 17:19 GMT Anonymous Coward

Physical limitations vs economical.

Yes to an extent. There are physical/mathematical limitations and tradeoffs. However there should be a little more room before we hit the physical requirement of going optical or quantum or magnetic in processor substance.

Say when your entire mobile is "on a chip", so you don't even have all those pcbs in it, and it's the size of the Apple Watch or smaller. As I assume that could be done today, but you'd have to pay for the slice of silicone and the fact it's custom to one device/design only (no scavenging the chips for other devices). It's expense that stops us doing 99% of the work on a single chip (and in a similar way stops us all going 100% SSD etc).

Being able to use generic ram/memory chips, generic power regulation etc just is more economical right now than doing a separate chip for every single compute design going imaginable.

0 0 Reply

Wednesday 7th June 2017 21:34 GMT Alistair

Ummm boys and girls

AThe real problem, as Hilgeman points out, is memory bandwidth per core.

... Something I've been saying since the CoreDuo......

1 0 Reply

Thursday 8th June 2017 22:39 GMT John Smith 19

""The real problem, as Hilgeman points out, is memory bandwidth per core.""

Well sort of.

In principal you hook each processor up to its own block of ram and problem solved.

But IRL most significant problems have to share some data (at some point) between processors. I don't think even SIMD systems are immune to this. IOW It's all about "contention" and how you deal with it. Essentially it comes down to 2 options.

a) Single copy of data item. Every processor that wants it forms a queue.

Fine if they are all reading it but if they are reading and writing then the final result could depend on what order the processors are accessing the data.

So maybe you lock out all further writes until all processors (who are not writing this location) report they have now processed the new datum, at which point the next write happens.

I have no idea how to make this happen and the datum could be a single word up to a whole data structure.

b) Multiple copies IE each processor has a cache.

But how do you efficiently inform all the other processors (who may have mapped different parts of the main memory to their caches) which part of main memory you've updated and that they should update their copies as well?

The Transputer architecture still looks pretty good at handling these problems. Separate memory spaces, good mix of stack and local register architecture, hardware scheduling with 2 level system. internal DMA with multiple channels (and in principal the ability to virtualize the channels)

Too bad it never got a decent MMU.

0 1 Reply

This post has been deleted by its author

Thursday 8th June 2017 11:28 GMT Pat Harkin

I think I understand...

...we have to compute smarter, not harder.

0 0 Reply

Thursday 8th June 2017 23:53 GMT Alan Brown

The issue isn't just bandwidth

Memory latency has barely changed in the last 20 years. When a processor sends out for data and can spend thousands/millions of cycles waiting for it to actually arrive, there's a obvious area for improvement if this can be solved.

(DDRn means you can get more words in per request, but the ~60ns latency for random requests hasn't changed. There are lower latencies if you can get words from an adjacent row or in sequential order, etc but REAL chances of this happening in a multiuser system are slim to negligible.)

Yes, profiling and the 95/5 rule still apply, but this is one of those problems that solving would result in across the board improvements in performance - and it could also result in simpler processors. A large chunk of the support logic (and power consumption) inside a modern CPU core is dedicated to trying to predict what addresses will be asked for next and having it ready before the ALU asks for it. This kind of predictive prefetching and pipelining isn't terribly successful at predictions (usually about 30% at best) but it's still better than not trying at all, although longer pipelines looking further ahead isn't the answer -Netburst proved that.

1 0 Reply

Saturday 10th June 2017 07:46 GMT John Smith 19

"one of those problems that solving would result in across the board improvements"

Agreed.

There seems to be a deep disconnect between the algorithm writers view and the hardware people. DRAM's problems have not substantially changed. You take a big hit when you cross row boundaries and data alignment is difficult to control. Likewise once you've got that row being output you find it's not that fast (that said Samsung said their latest are toggling at 9GHz, which does seem fast).

For actual parallel algorithms what I think is needed is something like the hardware equivalent of the "Publish & Subscribe" model, but I'm not sure if it should be in terms of the data or the program PoV.

Somehow a program indicates "I want to know if this data item (which is not in my local address space) has changed so I can use it. Likewise the program needs a mechanism so that when it writes a new value of a data item that others have requested they (and only they) get a copy of the new value. How do you retro-fit that to the shedload of legacy code out there?

I'm sure this has been proposed repeatedly but I've never seen it done because it's a monumental PITA to implement fast enough in hardware, when a "data item" could literally be from the smallest addressable unit of memory up to a very large record (you'd want some way to say "I want to know when a matrix element changes, not the whole matrix. That's a given".)

Maybe a "Harvard" architecture with separate data and program spaces? A large shared "smart" (how many ways is that word overloaded?) data store? So every write is a data write and it's a question of figuring out which processors need to be told about it

The problem remains. You want a single big block of memory you can hand out to whatever processor needs, however much it needs (so you can accommodate that huge model, even if the code to process it , running on your army of processors, is quite small) but you don't want all the delays you'll get with contention.

I don't really believe you can have that total flexibility with multiple processors and have maximum performance. Directly linking a chunk of memory to a processor limits your maximum code size but guarantees no contention. I believe you can have better, but you can't have it all.

0 1 Reply

Topics

Special Features

Vendor Voice

Resources

User topics

Article topics

User topics

Article topics

COMMENTS

"requently encounter data sets with very sparse matrices."

Re: "requently encounter data sets with very sparse matrices."

"Doesn't it take processing power again..matrix has a lot of zeroes in it in the first place?"

With non managed, non VM languages and less bloated APIs...

"we could probably squeeze twice the performance out of the current generation. "

Re: "we could probably squeeze twice the performance out of the current generation. "

@Boltar

Re: @Boltar

Physical limitations vs economical.

Ummm boys and girls

""The real problem, as Hilgeman points out, is memory bandwidth per core.""

I think I understand...

The issue isn't just bandwidth

"one of those problems that solving would result in across the board improvements"

POST COMMENT House rules

Enter your comment

Add an icon

Other stories you might like

VMwhat? Dell snaps up Cloudify for multi-cloud orchestration

Dell rolls out 4th-Gen Xeon PowerEdge servers for cloud builders

Dell agrees to pay $1b to settle claims it shortchanged stockholders during EMC acquisition

Dell intros updated VxRail and dedicated AI platform with VMware

US Navy told to do a 'supplemental' integrity investigation of $2.5b Dell deal

About Us

Our Websites

Your Privacy