Cool
Being able to say "My CPU is a neural-net processor; a learning computer" makes me want one even more.
Samsung has revealed the blueprints to its mystery M1 processor cores at the heart of its S7 and S7 Edge smartphones. International versions of the top-end Android mobiles, which went on sale in March, sport a 14nm FinFET Exynos 8890 system-on-chip that has four standard 1.6GHz ARM Cortex-A53 cores and four M1 cores running at …
"Your Exynos only makes my penis harder!"
Seriously, I would not have considered a neural network used in branch prediction ... but in retrospect why the hell not?
OTOH, what are the features that enter into the (clearly supervised) learning phase, when is the learning over when is the network reset (likely not whenever another process gets schedule..)?
Sooooo... it's a chip that's capable of creating an emotional and physical response in a user. Looking to see if there wasn't a Dr Noonian Soong involved in its development - one small patch of skin on my arm has goosebumps in anticipation.
Except it's not. Fake AI jargon* for a type of computer architecture vaguely inspired by biological neural networks, which are only partly understood.
* Almost all AI jargon has no relationship to the real biological world, it's designed to get grants. market expert systems (really human curated databases with interfaces based on simple parsing, written by humans. No real "self learning" or adaptation).
Hang on a minute, what's going on here?
Instruction decode? Branch prediction? It's as if someone has decided that ARM is a CISC instruction set all of a sudden and needs to be re-implemented. But ARM is already RISC (very RISCy in fact), and even the 64bit version needs only 48,000ish transistors to implement.
How can it be better to add all that rename, decode and microcode nonsense on top? That's surely going to be a good demonstration of the law of diminishing returns. Wouldn't it be better simply to use all those extra transistors as extra cache (which is always useful), or a whole extra core, instead?
3W at 2+ GHz and not quicker than a competing design at single core performance? Well I think that about answers it. I don't know what Apple have done, but I'd not heard that they (or anyone else) had gone down the same microcode route.
Neural nets for branch prediction? Well, why not I suppose, but from a pure CPU design point of view isn't it a kind of surrender? It's a bit like saying "we don't know how to do this properly" and deciding to build something that cannot be mathematically analysed instead and hoping it's better. That's fine if the result is good...
It does mean that this is useless for hard real-time applications. Branch execution time is now impossible to predict.
Whether CISC or RISC (a useless distinction nowadays, how about simpy "ISC"), there needs to be instruction pipelines. Otherwise how are you going to keep the various elements of the chip busy. Indeed, pipelines were just on RISCs in the first place to be able to issue an instruction at every clock cycle: RISC pipeline.
OTOH, this pipeline seems very deep. Recovering from a bad branch (i.e. emptying the pipeline, then refilling it) will take a few cycles.
Well, it would probably take a few hours to study the design in detail. Not my area now..
It does mean that this is useless for hard real-time applications. Branch execution time is now impossible to predict.
Well, "hard real-time" is still in the ms range, right, a few orders of magnitude slower? I don't think this is going to matter.
@Destroy All Monsters,
"Whether CISC or RISC (a useless distinction nowadays, how about simply "ISC"),"
It kinda does matter these days. A consequence of Intel's CISC-RISC translation of x86 to microcode is that there's an awful lot of transistors needed to do that (and everything else that goes with it). Transistors need power, and this was one of the contributors to Intel's best effort at a mobile x86 processor falling short of ARMs on power consumption.
It does mean that this is useless for hard real-time applications. Branch execution time is now impossible to predict.
This is an M1, i.e. from ARM's M (for Microcontroller) range. If predictable execution times matter to you, you'd use their R (real-time) range. (For completeness, the A range are Application processors.)
BTW it's an acrostic:
A = Application
R = Realtime
M = Microcontroller
This is not ARM's only outward-facing pun: the architecture is described in the Architecture Reference Manual.
"This is an M1, i.e. from ARM's M (for Microcontroller) range."
No, this is Samsung's naming system and not related to ARM's microcontrollers.
ARM have already have an Cortex-M1 core: http://www.arm.com/products/processors/cortex-m/cortex-m1.php
Pop quiz question: Why is there no ARM Cortex-M2?
@AC,
This is an M1, i.e. from ARM's M (for Microcontroller) range.
It most decidedly is not a microcontroller. Since when did a microcontroller have cache, TLBs, an MMU, and all the other gubbins needed by an application CPU.
This M of Samsung's is nothing to do with ARM's use of M to denote a core intended for use in a microcontroller.
If predictable execution times matter to you,
They matter a lot if you're doing VR. Sloppy latency on scene calculations is a good way of inducing motion sickness. Given that everyone is getting into VR these days this might end up being of concern.
They matter a lot if you're doing VR. Sloppy latency on scene calculations is a good way of inducing motion sickness. Given that everyone is getting into VR these days this might end up being of concern.
I don't understand this at all. VR and we are talking sub-microsecond realtime arrival times ... ON THE FSCKING CPU (yes, the CPU, not the graphgics pipeline)
Not going slightly overboard here? And I mean hanging on a 15 meter outrigger slightly overboard?
"Since when did a microcontroller have cache, TLBs, an MMU, and all the other gubbins needed by an application CPU."
Microcontrollers have had these features, and more (e.g. graphics accelerators), for many years. It is a microcontroller if it has built-in peripherals to control things.
This post has been deleted by its author
And stupid naming choice by Samsung causes confusion already.
These are Samsung M1 cores, not ARM Cortex-M1 cores. Apparently the Samsung M1 is a full ARMv8-A processor, not an ARMv7-M.
I was puzzled initially why a CPU with an M1 was a big deal, but the 3W power consumption, and clock speed and then instruction set made it clear it was not talking about the Cortex-M1 but rather some other M1. Very annoying naming choice.
I hate Allwinner's chip names too, with calling everything A# where a lot of them end up matching the Cortex-A# models while not being those of course. Of course Apple is doing it too, although at the rate they are counting up, I think ARM will be long past before Apple gets to a given number.
Spectacularly misinformed post...
The vast majority of high-performance ARM processors - including Apple's - use all the features you're bitching about. Branch prediction is basically an absolute necessity for any high-performance design - high clock requires a long pipeline; without a branch predictor, a bubble is created in the pipeline which leads to a stall during branch resolution. This is a major performance issue, and one that a branch predictor with high accuracy resolves. As for your comment about real-time applications, a worst-case time is not impossible to predict; microarchitectures have documented branch mispredict recovery times, usually on the order of 10-20 cyc. This, by the way, is basically no less deterministic than cores with caches, which you seem to have no problem advocating for - if a load hits cache, it might take 5 cyc to complete; if it misses cache and hits main memory, it might take 150 cyc.
Decode/microcode: Decode doesn't mean what you think it is; it's an essential part of any CPU design, RISC or CISC, as decode controls things like "what functional unit does this op go to?" and "what operands does this op use?" Microcode was mentioned nowhere. I suspect you're confusing use of micro-ops - ie, internal basic operations in a long fixed-length format - with microcode, ie lookup of certain complex operations in a microcode ROM at decode time. The first does not imply the second. Most fast processors have a complex decoder for operations that are more efficient to break into 2-3 uops, and this doesn't hit microcode. The M1 core may or may not have microcode - since it doesn't mention a ucode engine in the decode slides, and it wasn't mentioned in the presentation (I was there) I suspect it does not. Even in ARM there are ops that can be beneficial to crack into multiple uops - reg+reg addressing for instance (one uop for the reg+reg calculation, one uop for the load/store.) There are even more examples in other RISC ISA's - take a look at the manual for a modern PowerPC core, for instance, and check out the number of ops that are cracked or microcoded!
As for out-of-order execution, it's an extremely helpful technique for exposing memory-level parallelism (by, for instance, continuing to run code during a cache miss) for surprisingly little additional overhead. Additionally, it takes the number of architectural registers out of the equation by renaming them onto a broad set of physical registers - as a result, in an OoO machine, architectural register count is almost never a hard limitation; false dependencies are eliminated and instructions run when their operands become available, not when a previous non-dependency operation completes so its scratch register can be used. This can improve power efficiency at a given performance target, because an in-order machine generally has to clock higher to get the same level of performance.
Again, Apple does these things too - they have an aggressively out-of-order machine with branch prediction register renaming too (in fact, more aggressively out-of-order than the M1 in the article!) http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/4 has a nice summary of Apple's current uarch.
Please do more research before making this kind of post...
Spectacularly misinformed post...
That's as maybe. But there's some mere mortals out here, and I'd really like to understand the relationship between silicon designers and ARM. Are they designing mere software, or mere hardware? How do the bits all fit together?
I know it is a complex topic, extending from the purely physical realm through to the dirty world of applications, but the Reg has done some stunning journalism explaining complicated sh** in (for example) the world of storage, maybe they could address the parallel world of mobile and low power processing?
Please? Pretty please? I know the staff don't read this. But some of YOU WHO KNOW could write something in plain English and submit it as an article, and get some beer money?
@Dusk,
"Decode/microcode: Decode doesn't mean what you think it is; it's an essential part of any CPU design, RISC or CISC, as decode controls things like "what functional unit does this op go to?" and "what operands does this op use?" Microcode was mentioned nowhere. I suspect you're confusing use of micro-ops - ie, internal basic operations in a long fixed-length format - with microcode, ie lookup of certain complex operations in a microcode ROM at decode time. "
Ha! Yes, you're quite right of course. I've read the article in haste. Ta!
Though I'd like to note that I wasn't dissing the value of out-of-order execution, pipelines, etc.
@ Destroy All Monsters,
"I don't understand this at all. VR and we are talking sub-microsecond realtime arrival times ... ON THE FSCKING CPU (yes, the CPU, not the graphgics pipeline)
Not going slightly overboard here? And I mean hanging on a 15 meter outrigger slightly overboard?"
Going overboard? Quite possibly.
Having spent many a year developing many hard real time systems I yearn for dependable execution times, something that seems to be going out of fashion fast. I hate having to deal with CPUs that don't run code in predictable times. To do large (e.g. 50+ CPUs) real time systems these days is a real pain in the arse - all that variation in latency starts to accumulate and we can't quite max out a collection of CPUs like we used to be able to. Intel chips are truly horrible in this regard, but there's not a lot out there to touch them when it comes to average performance so its hard not to use them.
It does mean that this is useless for hard real-time applications. Branch execution time is now impossible to predict.
I wouldn't go that far; worst-case branch time is still predictable, so it's still possible to do hard real-time.
Whether it's worth doing so is another matter...
Vic.
My MSc thesis was on building hyperdimensional mathematical maps of how a neural network learns and how they 'remember' key attributes. Even back in 2005 there were a lot of papers for me to draw upon for my own work.
If your neural network is a 'black box' you're doing it wrong.
This post has been deleted by its author
The tech here is basically the same principle as Autocorrect (loosely) isnt it, and we all know how successful that can and cannot be some times (for those who dont, see autocorrect.com, v'funny). I guess here if it gets it wrong it throws it away, or will we start to see new wesites where i asked to do 'x' and it did 'y' with funny/deadly curcomstabnances...
...or has this whole thing become stupidly over-complicated?
Back in the 1980s, the RISC chips of the time got around the branching problem with features like branch delay slots (MIPS) and predicated instructions (ARM).
I can't help feeling that perhaps there is a much simpler solution to this than dedicating ever more transistors to hugely complicated algorithms that could instead be used for the operations that the programmer actually intended.
Perhaps some radically new, but elegant type of CPU design is needed.
Actually this is the way to go. I assume theoretically it is simpler...
Just the inbetween is more complex. The bit where we integrate existing designs to neural network like branch prediction and/or code execution.
When we get good at it, or the price comes down, or just the use of the design scales up, we will see lots of systems that adjust automatically for the task. I suppose GPUs already do this with their pixel pipelines and programmable shader cores etc (I am no expert so may have misunderstood?).
Branch prediction is actually based on the idea of branch delay slots. With delay slots, every time you have a branch, you get a bunch of wasted CPU cycles while you wait for the branch to be evaluated and calculated.
Since you're going to waste that time anyway, you might as well take a guess and run one of the two possible branches. If you pick right, then you get a nice boost in performance over just delay slots, and if you pick wrong, you're just back where you were before. So, even something stupid like always predict the branch won't be taken can get you good results. But, if 50% accuracy is good, why not try for something better? IIRC, some of these perceptron-based branch prediction schemes can get around 95% accuracy. That ends up being a ton of CPU cycles that would have otherwise just been discarded. Also means you can do better at fetching instructions from memory, so you don't get as many stalls there either.
Predicated instructions aren't really much different. They really just replace the complexity of branches with a read-after-write hazard. Same result, you end up waiting for the condition to be evaluated before you can start executing the conditional instruction.
So, branch prediction does add some complexity, especially if you want a high accuracy predictor, but the result is that it saves CPU cycles. And the more stages you have in your pipeline, the more cycles it saves. So, if you want your CPU to be as fast as possible, then you're going to want to throw away as few cycles on mispredicted branches as possible.
TL;DR: branch delay slots and predicated instructions are simple, but slow, and we want fast.
An 8-core chip at 2+ GHz seems a bit overkill for a damn phone... I would think that focusing on reducing the requirements of the software would be a far better investment (at the very least, you can significantly increase battery life).
I grew up in the *Nix world where a full-featured OS with a basic offices suite would ship on a couple of floppies. Now you have projects that are trying to do that with a CD-ROM and are considered 'over-ambitious'. In the past 10 years Linux has gone from a single CD with dozens of useful packages (Including a useful browser and OpenOffice) to requiring a full DVD just for the OS.
Its no surprise that there are security holes in Mobile OSes and phones barely last a day per charge when they strain under the 8 GB of OS code lumbering along.
Phones are expected to do FULL FAT office suites, 3D gaming with lots of graphics and math, and other high performance jobs while simultaneously keeping up with both mobile and WiFi networks all while on a battery. And the customer is always right or they'll go to LG. So what do you do?
What if you had 3 versions of each branch instruction:
Branch back most of the time;
Branch back 50/50;
Branch back almost never.
Depending on the software loop you're writing you'll know which kind of branch instruction to write, they really are all the same logic. It's just that if you write the first one "branch back most of the time', then the predictor already knows the prediction: branch back. Same with the third one, branch back almost never (it would predict not to branch back then). The only time active prediction is needed is with the middle, 50/50 instruction.