back to article 'Prodigy' chip moonshot gets hand from Arm CPU guru Prof Steve Furber

Silicon design startup Tachyum has appointed the original designer of the Arm CPU to its advisory board and chucked its hat in the ring to be the domestic exascale supercomputer chip inside the EU's home-grown super. EU parliament photo2 via Shutterstock EU plans for domestic exascale supercomputer chips: A RISC-y business …

  1. Nate Amsden

    reminds me of itanium

    "[..]Power efficiencies are gained by moving out-of-order execution capability to software, Danilak said. “All the register rename, checkpointing, seeking, retiring, which is consuming majority of the power, is basically gone, replaced with simple hardware. All the smartness of out-of-order execution was put to compiler."

    and then from wikipedia

    https://en.wikipedia.org/wiki/Itanium

    "[..]With EPIC, the compiler determines in advance which instructions can be executed at the same time, so the microprocessor simply executes the instructions and does not need elaborate mechanisms to determine which instructions to execute in parallel. The goal of this approach is twofold: to enable deeper inspection of the code at compile time to identify additional opportunities for parallel execution, and to simplify processor design and reduce energy consumption by eliminating the need for runtime scheduling circuitry. "

    1. Anonymous Coward
      Anonymous Coward

      Re: reminds me of itanium

      Similar comments for the DEC Alpha too. However, for a CPU designed for HPC/numerical computing the behavior of the code should be more regular and amenable to software optimization. There has also been a lot of progress in compilers since the Itanium was first released. The Itanium itself performed very well on HPC code.

      1. DCFusor

        Re: reminds me of itanium

        Nate and AC - me too.

        If you _can_, putting the smarts in the compiler is clearly the best way as you compile once (or a few times during debugging) and run forever or some approximation of that.

        And indeed, _most_ supercomputing tasks are better suited to this kind of approach then the usual mix, else making a big machine out of a bunch of little ones wouldn't be at all practical.

        Yet it is the glory of software to "if()" and do something different "if some condition is now true". Else you could simulate it with a fixed numerical calculator entirely. You might be able to specify that yes, it takes this branch the overwhelming majority of times for at least things like error checking, and of course do the IO outside the main array (mostly) but....

        It's sad the most attempts at this have failed in some way, including some early work I did on VLIW CPU design with an assembler that simply "pulled upwards" bus and ALU control bits that didn't cross a line of dependency until as many of the sub divisions of the CPU as possible were doing all they could on every cycle. For situations with no pipeline stalls, this was killer-good. But then you need software engineers who actually understand the computer part of the problem, and don't find all their answers on stack overflow. It seemed easier to build a computer to the available talent than to scout for the ones who would be the best in the end.

        And in the end, memory latency killed you anyway...

  2. Anonymous Coward
    Anonymous Coward

    Normally I would discount such a thing out of hand

    But they have a pretty impressive lineup of people involved, so while I'm still skeptical of such grand claims at least it isn't simply a scam.

  3. HmmmYes

    Designing processors and instruction sets is hard.

    Sure, the first steps go quickly - oh this instruction willl do this.

    Then you throw stuff out, put stuff in.

    You corrupt the nice simple isa with special cases.

    That chesper n faster chip turns out slower and more expensive. And investors want something now!

  4. mics39
    Trollface

    But will it run ...

    Spectre/Meltdown?

  5. John Smith 19 Gold badge
    Unhappy

    Actually I was thinking of

    The Transmeta Crusoe

    Here's the thing.

    HPC is all about numerical computing

    Now (in theory) porting most of the code means re-compiling it with the FORTRAN whatever compiler you have on your architecture.

    But

    Too cute a re-ordering and your end users carefully crafted high level numerical algorithm turns to rubbish. Quickly produced rubbish results are still rubbish.

    TBH I think a lot of this is bo***cks.

    Furber knew when they designed ARM it was all about the DRAM latency and it still is.

    You're talking to the on chip cache. What happens when it fails? What happens (if you have one) an L2 cache hit fails? Because it will happen.

    I'd say a big part of serious HPC design is an n-way memory system and lining up the data & code within those rows and keeping the row shifts as infrequent (and over lapped) as possible. And even then what is the current standard? 8ns flat out when the processor could be clocking at 0.5ns?

    As for power consumption we've know the clock drivers are the biggest power sinks on any processor and going clockless is the way to eliminate most of them.

    Given that a modern row of a DRAM could take the address space of an entire 8bit processor (laid out as a bit stream on one chip). So yes code density still is a thing.

    We know that theoretical compiler technology has improved over the last few decades but how much of that has actually been used?

    It's not like there aren't alternatives to x86 already available, like the open source SPARC ISA's.

    This has a huge mountain to climb

    1. Destroy All Monsters Silver badge

      Re: Actually I was thinking of

      Too cute a re-ordering and your end users carefully crafted high level numerical algorithm turns to rubbish. Quickly produced rubbish results are still rubbish.

      I don't understand this. There are people "carefully crafting" memory access operations implied by FORTRAN loops?

      Are they completely nuts. Do the compilers actually DO anything?

  6. John Smith 19 Gold badge
    Unhappy

    "And in the end, memory latency killed you anyway..."

    This.

    DRAM is the dominant memory storage technology everywhere.

    And for people who want maximum speed all the time it has the same issues it had since the first DRAM chips rolled off the line in the late 60's

    The absolute numbers have changed but the ratios have not. It's about hos smart you are in working around those limitations.

    1. DCFusor

      Re: "And in the end, memory latency killed you anyway..."

      Yup, for example just look in your bios...see the latency (sum of clocks) for an access. Now, remember those clocks are slower than the CPU ones, on top.

      Of course, it's why we have and continue to develop cache. But look again at the issues with a level 1 miss...and a level 2 miss...and a level 3 miss which adds cycles (to discover the miss) to the first one above.

      At first, needing a single byte from a memory line required fetching that entire line before the cpu was allowed to proceed. I believe that sort of idiocy is fixed these days, but there are no "just get what you need and move on to the next needed thing" as far as I know, so it just pushes the problem a little.

      Most people don't understand that size has its own set of issues. Carries in an ALU take longer to propagate, address decodes take extra time and so on, the bigger things get, and it's not always linear.

      I note it was pointed out above that yeah, the wrong kind of optimization might fuddle your carefully constructed code like the stuff in numerical recipes for avoiding some of the more egregious rounding errors with floats and so on. But that kind of thing isn't hard to avoid, it'd be a beginner mistake these days.

      1. John Smith 19 Gold badge
        Unhappy

        But that kind of thing isn't hard to avoid, it'd be a beginner mistake these days.

        You seem to be thinking this is an issue for a developer.

        What I was talking about was code mis-shuffling by the compiler. AFAIK the best advice remains "Develop clean code and let the compiler do the optimizations, then measure where it's really running slow, then optimize that." But again, optimizing numerical code can have very nasty consequences.

        " I believe that sort of idiocy is fixed these days, "

        I'd expect anyone deeply enough into being concerned about this to know if it was still an issue. :-(

  7. abufrejoval

    I was thinking MIPS and Mill/Belt architecture

    My first reaction to moving smarts from hardware to software is "MIPS learned the hard way".

    And when it comes to a promising new architecture fulfilling the aforementioned goals, the last rather exciting thing I saw was the Mill Computing belt architecture (https://millcomputing.com) fabulously expounded by Ivan Godard in sessions easy to find on YouTube.

  8. Børge Nøst

    Wait, what?

    Once moce? This was _very_ thin on tech explanation of why these guys should have cracked it.

    I know Mill Computing has basically said much of this, but if you follow what they have released you see that they have a lot of tech stuff to facilitate their goals.

    1. WorBlux

      Now quite like the mill.

      Yes this is radically different from the mill, and is somewhat reflective of the old multi-flow processors in some ways and close to xeon phi in others.

      Best I can tell, it the compiler is scheduling out of order (loads before branches) and expcilitly marking results with branch/poisen flags so they can be revoked on mispredict. Rename is largely unneeded as the compiler is aware of the speculation and can color registers with it in mind, as are retire stations and re-order buffers for similar reasons

      Bundles are variable length but contain instructions that can issue together (like the mill and in this respect but bundles have specific alignment requirements more strick than the mill. ) Another similarity to the mill may be the vector mask registers, which if used correctly may help to vector-ize more code than cpu's with/out. However it still looks like mill is a lot wider, and presents better opportunity software pipe-lining.

      I also suspect they have some sort of hyper-block thing going on that lets you dispatch to a different part of the code (that you presumably have data for already) on cache miss + stall. This is a hard to do on the compiler, but should scientific code should play pretty well with it. ( a sort of micro-threading)

      Further things to note - DDR5 and pci-e 5. which on thier own should be power and performance uplifts.

      Data from below the L2 crosses a mesh to get to the core, which is a known low-power config option. There's only one big and expensive data line around the whole processor group.

      Additionally the main ALU's are massively wide and support vector and matrix ops that fill the whole width, which is a wind on code that can actually use it.

  9. Herring`

    CPU complexity

    Back at school, it was 6502 assembler - about 10 instructions* and four of them were the BCD stuff nobody used. I have a manual of 486 instructions and even that's mad. I dread to think how big the instruction set is now. What is it all for? Also get off my lawn.

    *Exaggeration for comic effect - although I have seen a 6502 instruction reference that will print out legibly on a single sheet of A3.

    1. Christian Berger

      Re: CPU complexity

      Well there is the idea of "data-flow" based computing where you have lots of simple processors, each one with its own RAM and fast links to the others. The idea was to split up your problem into many small parts and make those run as part of a pipeline.

      The problem is that this is conceptionally verry different to a PDP-11, so C(++)-Code won't run efficiently on such a machine. The solution back then was to use specialized languages like Star-LISP.

      Of course today with the dominance of Unixoid systems, shell scripts could actually be a solution. Just let each of the C-based utilities run on its own core, only communicating via pipes with the others.

      1. Herring`

        Re: CPU complexity

        Way back, I did do some stuff on Transputers (I have a T800 card somewhere at home). You can see how message based stuff like Occam could work in a multi-core world. It does require a lot more thought though - although not as much as making regular code thread-safe without crippling the performance.

  10. Measurer
    Mushroom

    Is it me...

    Or does the name 'Tachyum' conjure up images of a slightly absent minded starship captain saying things like 'blast them with the Tachyum thingy ray gun'!

  11. . 3

    Transactional memory

    Possibly the biggest new thing here is proper hardware support for transactional memory. It will take all the guesswork out of inter-thread data sharing and will make putting memory barriers into code a thing of the past, together with the horrible caching penalties those can incur. The last few generations of Xeons have had some support for it baked in but it's currently disabled by the microcode for some reason.

    There's not much language support for TM yet, but C11 and C++11 introduced new primitives. Only GCC supports them so far, relying on a software emulation library which is dog slow especially on ARM.

  12. Tom Paine

    Ethernet?

    Each chip will have two 400 Gigabit Ethernet ports.

    Shurely shome mishtake?

  13. Anonymous Coward
    Anonymous Coward

    Reinventing the wheel

    ... and all for the sake of technical sovereignty.

    Not much they’re proposing is original. It’s foolish to believe that failed ideas of the past like EPIC/VLIW for the datacenter will work only because they were not tried properly.

    Just use American tech and get on with your day. Add value where you CAN and stop reinventing the wheel.

    This would be a misguided waste of EU tax payer money. The fact that they can propose this kind of stuff tells you why the EU is in the $hit$

  14. Speedypaper

    Thanks for this information!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon