back to article 'Neural network' spotted deep inside Samsung's Galaxy S7 silicon brain

Samsung has revealed the blueprints to its mystery M1 processor cores at the heart of its S7 and S7 Edge smartphones. International versions of the top-end Android mobiles, which went on sale in March, sport a 14nm FinFET Exynos 8890 system-on-chip that has four standard 1.6GHz ARM Cortex-A53 cores and four M1 cores running at …

  1. Anomalous Croissant
    Terminator

    Cool

    Being able to say "My CPU is a neural-net processor; a learning computer" makes me want one even more.

    1. Destroy All Monsters Silver badge
      Paris Hilton

      Re: Cool

      "Your Exynos only makes my penis harder!"

      Seriously, I would not have considered a neural network used in branch prediction ... but in retrospect why the hell not?

      OTOH, what are the features that enter into the (clearly supervised) learning phase, when is the learning over when is the network reset (likely not whenever another process gets schedule..)?

      1. MrT

        Re: Cool

        Sooooo... it's a chip that's capable of creating an emotional and physical response in a user. Looking to see if there wasn't a Dr Noonian Soong involved in its development - one small patch of skin on my arm has goosebumps in anticipation.

        1. getHandle

          Re: Cool

          I was thinking more Cyberdyne Systems, but then I remembered it was Samsung we're talking about ;-)

      2. Anonymous Coward
        Paris Hilton

        Re: Cool

        I'd have thought resets would occur whenever there's a shift in load? With BEEEEELIONS of teachable moments occurring every second, learning could be replete in far less time than the meatware can perceive.

        1. Dan 55 Silver badge

          Re: Cool

          Imagine making a neural net and torturing it with Samsung bloatware. 100 years from now, if we're still around, it'll probably be illegal.

    2. Mage Silver badge
      Facepalm

      Re: Cool, neural network

      Except it's not. Fake AI jargon* for a type of computer architecture vaguely inspired by biological neural networks, which are only partly understood.

      * Almost all AI jargon has no relationship to the real biological world, it's designed to get grants. market expert systems (really human curated databases with interfaces based on simple parsing, written by humans. No real "self learning" or adaptation).

  2. bazza Silver badge

    Most Surprised

    Hang on a minute, what's going on here?

    Instruction decode? Branch prediction? It's as if someone has decided that ARM is a CISC instruction set all of a sudden and needs to be re-implemented. But ARM is already RISC (very RISCy in fact), and even the 64bit version needs only 48,000ish transistors to implement.

    How can it be better to add all that rename, decode and microcode nonsense on top? That's surely going to be a good demonstration of the law of diminishing returns. Wouldn't it be better simply to use all those extra transistors as extra cache (which is always useful), or a whole extra core, instead?

    3W at 2+ GHz and not quicker than a competing design at single core performance? Well I think that about answers it. I don't know what Apple have done, but I'd not heard that they (or anyone else) had gone down the same microcode route.

    Neural nets for branch prediction? Well, why not I suppose, but from a pure CPU design point of view isn't it a kind of surrender? It's a bit like saying "we don't know how to do this properly" and deciding to build something that cannot be mathematically analysed instead and hoping it's better. That's fine if the result is good...

    It does mean that this is useless for hard real-time applications. Branch execution time is now impossible to predict.

    1. chris 17 Silver badge

      Re: Most Surprised

      @bazza

      It may not be all that useful today as you claim, but maybe successive improved versions will prove to provide that extra performance and battery boost over their rivals they are looking for.

    2. Destroy All Monsters Silver badge
      Holmes

      Re: Most Surprised

      Whether CISC or RISC (a useless distinction nowadays, how about simpy "ISC"), there needs to be instruction pipelines. Otherwise how are you going to keep the various elements of the chip busy. Indeed, pipelines were just on RISCs in the first place to be able to issue an instruction at every clock cycle: RISC pipeline.

      OTOH, this pipeline seems very deep. Recovering from a bad branch (i.e. emptying the pipeline, then refilling it) will take a few cycles.

      Well, it would probably take a few hours to study the design in detail. Not my area now..

      It does mean that this is useless for hard real-time applications. Branch execution time is now impossible to predict.

      Well, "hard real-time" is still in the ms range, right, a few orders of magnitude slower? I don't think this is going to matter.

      1. bazza Silver badge

        Re: Most Surprised

        @Destroy All Monsters,

        "Whether CISC or RISC (a useless distinction nowadays, how about simply "ISC"),"

        It kinda does matter these days. A consequence of Intel's CISC-RISC translation of x86 to microcode is that there's an awful lot of transistors needed to do that (and everything else that goes with it). Transistors need power, and this was one of the contributors to Intel's best effort at a mobile x86 processor falling short of ARMs on power consumption.

    3. Anonymous Coward
      Anonymous Coward

      Re: Most Surprised

      It does mean that this is useless for hard real-time applications. Branch execution time is now impossible to predict.

      This is an M1, i.e. from ARM's M (for Microcontroller) range. If predictable execution times matter to you, you'd use their R (real-time) range. (For completeness, the A range are Application processors.)

      BTW it's an acrostic:

      A = Application

      R = Realtime

      M = Microcontroller

      This is not ARM's only outward-facing pun: the architecture is described in the Architecture Reference Manual.

      1. Ed 13
        FAIL

        Re: Most Surprised

        "This is an M1, i.e. from ARM's M (for Microcontroller) range."

        No, this is Samsung's naming system and not related to ARM's microcontrollers.

        ARM have already have an Cortex-M1 core: http://www.arm.com/products/processors/cortex-m/cortex-m1.php

        Pop quiz question: Why is there no ARM Cortex-M2?

      2. bazza Silver badge

        Re: Most Surprised

        @AC,

        This is an M1, i.e. from ARM's M (for Microcontroller) range.

        It most decidedly is not a microcontroller. Since when did a microcontroller have cache, TLBs, an MMU, and all the other gubbins needed by an application CPU.

        This M of Samsung's is nothing to do with ARM's use of M to denote a core intended for use in a microcontroller.

        If predictable execution times matter to you,

        They matter a lot if you're doing VR. Sloppy latency on scene calculations is a good way of inducing motion sickness. Given that everyone is getting into VR these days this might end up being of concern.

        1. Destroy All Monsters Silver badge
          Windows

          Re: Most Surprised

          They matter a lot if you're doing VR. Sloppy latency on scene calculations is a good way of inducing motion sickness. Given that everyone is getting into VR these days this might end up being of concern.

          I don't understand this at all. VR and we are talking sub-microsecond realtime arrival times ... ON THE FSCKING CPU (yes, the CPU, not the graphgics pipeline)

          Not going slightly overboard here? And I mean hanging on a 15 meter outrigger slightly overboard?

        2. Gideon 1

          Re: Most Behind the Times

          "Since when did a microcontroller have cache, TLBs, an MMU, and all the other gubbins needed by an application CPU."

          Microcontrollers have had these features, and more (e.g. graphics accelerators), for many years. It is a microcontroller if it has built-in peripherals to control things.

      3. Anonymous Coward
        Anonymous Coward

        Re: Most Surprised

        No, it's not from the ARM "Cortex-M" range; it's a Samsung designed core which is functionally equivalent of the ARM Cortex-A series).

        The ARM "M" microcontroller cores definitely are not this big, nor require 3 watts ...

      4. This post has been deleted by its author

      5. Lennart Sorensen

        Re: Most Surprised

        And stupid naming choice by Samsung causes confusion already.

        These are Samsung M1 cores, not ARM Cortex-M1 cores. Apparently the Samsung M1 is a full ARMv8-A processor, not an ARMv7-M.

        I was puzzled initially why a CPU with an M1 was a big deal, but the 3W power consumption, and clock speed and then instruction set made it clear it was not talking about the Cortex-M1 but rather some other M1. Very annoying naming choice.

        I hate Allwinner's chip names too, with calling everything A# where a lot of them end up matching the Cortex-A# models while not being those of course. Of course Apple is doing it too, although at the rate they are counting up, I think ARM will be long past before Apple gets to a given number.

      6. Dave Lawton
        Angel

        Re: Most Surprised

        Ah, no the original is

        Acorn RISC Machine

    4. Dusk
      Thumb Down

      Re: Most Surprised

      Spectacularly misinformed post...

      The vast majority of high-performance ARM processors - including Apple's - use all the features you're bitching about. Branch prediction is basically an absolute necessity for any high-performance design - high clock requires a long pipeline; without a branch predictor, a bubble is created in the pipeline which leads to a stall during branch resolution. This is a major performance issue, and one that a branch predictor with high accuracy resolves. As for your comment about real-time applications, a worst-case time is not impossible to predict; microarchitectures have documented branch mispredict recovery times, usually on the order of 10-20 cyc. This, by the way, is basically no less deterministic than cores with caches, which you seem to have no problem advocating for - if a load hits cache, it might take 5 cyc to complete; if it misses cache and hits main memory, it might take 150 cyc.

      Decode/microcode: Decode doesn't mean what you think it is; it's an essential part of any CPU design, RISC or CISC, as decode controls things like "what functional unit does this op go to?" and "what operands does this op use?" Microcode was mentioned nowhere. I suspect you're confusing use of micro-ops - ie, internal basic operations in a long fixed-length format - with microcode, ie lookup of certain complex operations in a microcode ROM at decode time. The first does not imply the second. Most fast processors have a complex decoder for operations that are more efficient to break into 2-3 uops, and this doesn't hit microcode. The M1 core may or may not have microcode - since it doesn't mention a ucode engine in the decode slides, and it wasn't mentioned in the presentation (I was there) I suspect it does not. Even in ARM there are ops that can be beneficial to crack into multiple uops - reg+reg addressing for instance (one uop for the reg+reg calculation, one uop for the load/store.) There are even more examples in other RISC ISA's - take a look at the manual for a modern PowerPC core, for instance, and check out the number of ops that are cracked or microcoded!

      As for out-of-order execution, it's an extremely helpful technique for exposing memory-level parallelism (by, for instance, continuing to run code during a cache miss) for surprisingly little additional overhead. Additionally, it takes the number of architectural registers out of the equation by renaming them onto a broad set of physical registers - as a result, in an OoO machine, architectural register count is almost never a hard limitation; false dependencies are eliminated and instructions run when their operands become available, not when a previous non-dependency operation completes so its scratch register can be used. This can improve power efficiency at a given performance target, because an in-order machine generally has to clock higher to get the same level of performance.

      Again, Apple does these things too - they have an aggressively out-of-order machine with branch prediction register renaming too (in fact, more aggressively out-of-order than the M1 in the article!) http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/4 has a nice summary of Apple's current uarch.

      Please do more research before making this kind of post...

      1. Anonymous Coward
        Anonymous Coward

        Re: Most Surprised

        Spectacularly misinformed post...

        That's as maybe. But there's some mere mortals out here, and I'd really like to understand the relationship between silicon designers and ARM. Are they designing mere software, or mere hardware? How do the bits all fit together?

        I know it is a complex topic, extending from the purely physical realm through to the dirty world of applications, but the Reg has done some stunning journalism explaining complicated sh** in (for example) the world of storage, maybe they could address the parallel world of mobile and low power processing?

        Please? Pretty please? I know the staff don't read this. But some of YOU WHO KNOW could write something in plain English and submit it as an article, and get some beer money?

      2. bazza Silver badge

        Re: Most Surprised

        @Dusk,

        "Decode/microcode: Decode doesn't mean what you think it is; it's an essential part of any CPU design, RISC or CISC, as decode controls things like "what functional unit does this op go to?" and "what operands does this op use?" Microcode was mentioned nowhere. I suspect you're confusing use of micro-ops - ie, internal basic operations in a long fixed-length format - with microcode, ie lookup of certain complex operations in a microcode ROM at decode time. "

        Ha! Yes, you're quite right of course. I've read the article in haste. Ta!

        Though I'd like to note that I wasn't dissing the value of out-of-order execution, pipelines, etc.

        @ Destroy All Monsters,

        "I don't understand this at all. VR and we are talking sub-microsecond realtime arrival times ... ON THE FSCKING CPU (yes, the CPU, not the graphgics pipeline)

        Not going slightly overboard here? And I mean hanging on a 15 meter outrigger slightly overboard?"

        Going overboard? Quite possibly.

        Having spent many a year developing many hard real time systems I yearn for dependable execution times, something that seems to be going out of fashion fast. I hate having to deal with CPUs that don't run code in predictable times. To do large (e.g. 50+ CPUs) real time systems these days is a real pain in the arse - all that variation in latency starts to accumulate and we can't quite max out a collection of CPUs like we used to be able to. Intel chips are truly horrible in this regard, but there's not a lot out there to touch them when it comes to average performance so its hard not to use them.

    5. Vic

      Re: Most Surprised

      It does mean that this is useless for hard real-time applications. Branch execution time is now impossible to predict.

      I wouldn't go that far; worst-case branch time is still predictable, so it's still possible to do hard real-time.

      Whether it's worth doing so is another matter...

      Vic.

    6. DaddyHoggy

      Re: Most Surprised

      My MSc thesis was on building hyperdimensional mathematical maps of how a neural network learns and how they 'remember' key attributes. Even back in 2005 there were a lot of papers for me to draw upon for my own work.

      If your neural network is a 'black box' you're doing it wrong.

  3. Anonymous Coward
    Anonymous Coward

    Wow

    Neural networks on single chips were until now confined to the lab due to the high costs of custom silicon and the need for precise weighting etc.

    1. Destroy All Monsters Silver badge

      Re: Wow

      That was back before the War on Durror (that's when I read the last IEEE Micro, after that I had to change my interests)

      Get one for USD 94 here now:

      http://www.cognimem.com/products/chips-and-modules/CM1K-Chip/

  4. Mystic Megabyte
    Alien

    chip wars

    "decision-making perceptrons"

    Aren't they the arch enemy of the mysterons?

  5. NanoMeter
    Devil

    A Mind of Their Own

    We must be careful, or these phones with neural networks might develop a mind of their own and connect to a skynet.

  6. Sgt_Oddball
    Holmes

    Only 24 years late.....

    On the other hand at least it doesn't require a spaceship to move around (though it does have cameras that constantly watches you if Samsung still have that whole 'screen on when you look at it' thing going on)

  7. MiguelC Silver badge

    A "hashed perceptron system in its branch prediction"

    Sounds like something the BOFH and the PFY would claim needed to be fixed when talking to their boss (or Wally to his PHB)

  8. This post has been deleted by its author

  9. Dominic Sweetman

    Simple RISC CPUs (one pipeline, in-order) work quite well while the primary cache returns data in a very small number of cycles -- perhaps two. At GHz speeds, that's impossible. Out-of-order execution makes your brain hurt, but it keeps a CPU reasonably busy while waiting for the data.

  10. Kirstian K
    Holmes

    But:

    The tech here is basically the same principle as Autocorrect (loosely) isnt it, and we all know how successful that can and cannot be some times (for those who dont, see autocorrect.com, v'funny). I guess here if it gets it wrong it throws it away, or will we start to see new wesites where i asked to do 'x' and it did 'y' with funny/deadly curcomstabnances...

  11. Drone Pilot

    Snake...

    just got harder.

  12. Wilseus

    Is it just me...

    ...or has this whole thing become stupidly over-complicated?

    Back in the 1980s, the RISC chips of the time got around the branching problem with features like branch delay slots (MIPS) and predicated instructions (ARM).

    I can't help feeling that perhaps there is a much simpler solution to this than dedicating ever more transistors to hugely complicated algorithms that could instead be used for the operations that the programmer actually intended.

    Perhaps some radically new, but elegant type of CPU design is needed.

    1. Anonymous Coward
      Anonymous Coward

      Re: Is it just me...

      Actually this is the way to go. I assume theoretically it is simpler...

      Just the inbetween is more complex. The bit where we integrate existing designs to neural network like branch prediction and/or code execution.

      When we get good at it, or the price comes down, or just the use of the design scales up, we will see lots of systems that adjust automatically for the task. I suppose GPUs already do this with their pixel pipelines and programmable shader cores etc (I am no expert so may have misunderstood?).

    2. Anonymous Coward
      Anonymous Coward

      Re: Is it just me...

      Branch prediction is actually based on the idea of branch delay slots. With delay slots, every time you have a branch, you get a bunch of wasted CPU cycles while you wait for the branch to be evaluated and calculated.

      Since you're going to waste that time anyway, you might as well take a guess and run one of the two possible branches. If you pick right, then you get a nice boost in performance over just delay slots, and if you pick wrong, you're just back where you were before. So, even something stupid like always predict the branch won't be taken can get you good results. But, if 50% accuracy is good, why not try for something better? IIRC, some of these perceptron-based branch prediction schemes can get around 95% accuracy. That ends up being a ton of CPU cycles that would have otherwise just been discarded. Also means you can do better at fetching instructions from memory, so you don't get as many stalls there either.

      Predicated instructions aren't really much different. They really just replace the complexity of branches with a read-after-write hazard. Same result, you end up waiting for the condition to be evaluated before you can start executing the conditional instruction.

      So, branch prediction does add some complexity, especially if you want a high accuracy predictor, but the result is that it saves CPU cycles. And the more stages you have in your pipeline, the more cycles it saves. So, if you want your CPU to be as fast as possible, then you're going to want to throw away as few cycles on mispredicted branches as possible.

      TL;DR: branch delay slots and predicated instructions are simple, but slow, and we want fast.

      1. Wilseus

        Re: Is it just me...

        OK, so as I think someone else said, you deal with it in the compiler or at the ASM level by having branch "hints", e.g. flags saying how likely the branch is, and you use the transistors saved on implementing an extra core or two.

  13. Doogie Howser MD

    Hot Chips?

    Makes me think of the Profanisaurus - "Like a dog eating hot chips"

    http://www.urbandictionary.com/define.php?term=like%20a%20dog%20eating%20hot%20chips

    1. Charles 9

      Re: Hot Chips?

      I think I've had the Profanisaurus widget on my Android phones about 6 years running. It was the first app I ever paid for, and I still get a kick out of it.

  14. Crazy Operations Guy

    Why not just write better code?

    An 8-core chip at 2+ GHz seems a bit overkill for a damn phone... I would think that focusing on reducing the requirements of the software would be a far better investment (at the very least, you can significantly increase battery life).

    I grew up in the *Nix world where a full-featured OS with a basic offices suite would ship on a couple of floppies. Now you have projects that are trying to do that with a CD-ROM and are considered 'over-ambitious'. In the past 10 years Linux has gone from a single CD with dozens of useful packages (Including a useful browser and OpenOffice) to requiring a full DVD just for the OS.

    Its no surprise that there are security holes in Mobile OSes and phones barely last a day per charge when they strain under the 8 GB of OS code lumbering along.

    1. Charles 9

      Re: Why not just write better code?

      Phones are expected to do FULL FAT office suites, 3D gaming with lots of graphics and math, and other high performance jobs while simultaneously keeping up with both mobile and WiFi networks all while on a battery. And the customer is always right or they'll go to LG. So what do you do?

  15. Grunchy Silver badge

    What if you had 3 versions of each branch instruction:

    Branch back most of the time;

    Branch back 50/50;

    Branch back almost never.

    Depending on the software loop you're writing you'll know which kind of branch instruction to write, they really are all the same logic. It's just that if you write the first one "branch back most of the time', then the predictor already knows the prediction: branch back. Same with the third one, branch back almost never (it would predict not to branch back then). The only time active prediction is needed is with the middle, 50/50 instruction.

    1. Anonymous Coward
      Anonymous Coward

      It's not just branch BACKs they have to consider but also branch FORWARDs. And these are trickier to predict because many of them check on live (read: less predictable) data.

    2. TJ1
      Thumb Up

      Linux kernel does branch prediction weighting

      Linux kernel has the macros LIKELY and UNLIKELY [0] which causes the compiler to arrange conditional jump instruction destinations so as to favour the branch predictor.

      [0] https://kernelnewbies.org/FAQ/LikelyUnlikely

  16. Dead Parrot

    Flicking through the comments....

    ...and I see people are as spectacularly misinformed about AI as they were 20 years ago.

    This is progress, is it?

    Jesus H Christ: give me my floppy disks back, I've got assembly to code.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like