back to article 2 + 2 = 4, er, 4.1, no, 4.3... Nvidia's Titan V GPUs spit out 'wrong answers' in scientific simulations

Nvidia’s flagship Titan V graphics cards may have hardware gremlins causing them to spit out different answers to repeated complex calculations under certain conditions, according to computer scientists. The Titan V is the Silicon Valley giant's most powerful GPU board available to date, and is built on Nv's Volta technology. …

Page:

  1. Anonymous Coward
    Anonymous Coward

    no mention of bitcoin mining

    Wonder how that would get on.

    1. 9Rune5
      Coat

      Re: no mention of bitcoin mining

      My guess is that we will have to reinstate the bite-test and carefully inspect each and every bitcoin we handle.

      Mine is the one with dentures in the pocket.

      1. CrazyOldCatMan Silver badge

        Re: no mention of bitcoin mining

        reinstate the bite-test

        Which, with gold coins, was mostly to test wether they were adulterated with lead. Some unscrupulous moneylenders or mints would use it to make the expensive stuff go further thus creating more profit.

        Just as well our financial institutions today are not that unethical eh?

        1. Frenchie Lad

          Re: no mention of bitcoin mining

          Ask the Germans why its impossible for them to get their own gold back from the USofA. It was stored there during the Cold War but its return is impossible owing to "security concerns". The Yanks are adamant that they haven't lent/sold it to anyone else.

    2. PNGuinn
      Trollface

      Re: no mention of bitcoin mining

      Fine if you spend 'em feeding your gambling habit?

      1. BebopWeBop
        Happy

        Re: no mention of bitcoin mining

        I take it that the winnings are not paid in Bitcoin?

    3. MonkeyCee

      Re: no mention of bitcoin mining

      Not masses of data on it in the wild, but about 50% increase over a 1080ti would be my guess.

      Maybe 3-4 bucks a day income, 2-3 in profit.

      Not worth it for mining,

    4. HamsterNet

      Re: no mention of bitcoin mining

      For Mining (not bitcoin as that's ASCI only now) but any Alt coins:

      The Titan gets around 77MH on ETH at a cost of £3k and drawing over 230W.

      A Vega 56 or 64 get 48MH on ETH at a cost of £500 (but really £600-900 now) but only draw 100W (you do need to do some serious optimisation on both to get these figures)

      If its throwing memory errors then it will also cock up the mining. Memory errors result in incorrect shares which are not paid for at all.

      Very surprised at how low the HBM memory bandwidth is, 8GB HBM2 on a Vega will overclock to over 600GBs, This Titan has 3x the bus but only 652GBs, which suggests they are not using the vastly superior Samsung chips but are using the Hynix pants.

    5. Anonymous Coward
      Anonymous Coward

      Re: no mention of bitcoin mining

      No mention because no self respecting miner would use a titan to mine any cryptocurrencies. (let's ignore the fact bitcoin hasn't been minable on GPU's for about 5 years)

      We don't just need speed we need efficiency. Titans are not efficient in price nor power usage.

  2. Anonymous Coward
    Anonymous Coward

    I guess we shouldn't be surprised. Certainly, I don't know if this is an architectural issue but speed at all cost is what is being pushed out and it's what sells, although if there was a real cost to maunfactures maybe they'd reel it in a little. If it's good enough for Gaming use but not scientific use it should be labelled as such. It is aimed at industry, so needs to perform better than it does.

    from the website:

    https://www.nvidia.com/en-us/titan/titan-v/

    NVIDIA TITAN V is the most powerful graphics card ever created for the PC, driven by the world’s most advanced architecture—NVIDIA Volta. NVIDIA’s supercomputing GPU architecture is now here for your PC, and fueling breakthroughs in every industry.

    1. Anonymous Coward
      Anonymous Coward

      Re: I guess we shouldn't be surprised

      I'm not. I recall seeing a seminar or conference presentation shortly after people began touting GPU's for scientific computation, where the presenter pointed out that they were optimized for graphics output at speed ... a domain where getting the a bit or two wrong every now and then would only show up as a probably small and very brief visual glitch.

      Still, perhaps in the intervening decade or whatever, this became less likely, and so now the problem is re-emerging?

      1. ratfox

        Re: I guess we shouldn't be surprised

        Indeed, it used to be that GPU were completely unreliable for precise computations. Of course, that has changed in the past decades, when the industry realized that there was money in fast GPUs that did not make mistakes, and advertised them as such.

        There's nothing wrong in itself with GPU that return slightly imprecise results in exchange for speed; but that should be clearly announced so that buyers know what to expect.

        1. Richard 12 Silver badge

          This one is *supposed* to be used for compute

          nVidia market this one as being for compute, and actively discourage using their other - much cheaper - cards for this kind of thing.

          To put it bluntly, this is a massive blunder on the part of nVidia that's going to damage their reputation for a decade or more.

        2. Anonymous Coward
          Anonymous Coward

          Re: I guess we shouldn't be surprised

          There's nothing wrong in itself with GPU that return slightly imprecise results in exchange for speed; but that should be clearly announced so that buyers know what to expect.

          This. Causal users might not care. But for visual artists, this will act like a bug and bug the hell out of them. It'll be even worst when a day worth of rendering has this one single visual imprecise that forces the artists to constantly redraw their art assets.

          1. anothercynic Silver badge

            Re: I guess we shouldn't be surprised

            It's not just artists. It's scientists who rely on computations to be exact. Not 4.1. Not 4.3. 4.0. Nothing else.

            1. Paul Shirley

              Re: I guess we shouldn't be surprised

              Using a computer and finite resolution math scientists expect results to be repeatable with provable error bounds, NOT exact. Nvidia returning random tainted values breaks both expectations and gets results wrong in all other senses!

    2. Anonymous Coward
      Anonymous Coward

      This isn't news.

      People have been trying to shoehorn NVIDIA gaming GPUs into scientific applications ever since the Tesla GPUs were first introduced. NVIDIA's marketing has always been ambiguous regarding the differences between desktop and enterprise kit, and they leave it up to the systems integrators to tell customers whether or not the GPU would be a good fit for their applications. They love to tout the "supercomputing" architecture of the gaming GPUs as a way to move more units, but they're not designed to replace their Tesla line. They're produced from the same binning practices that distinguish desktop CPUs from their enterprise counterparts; a Tesla GPU that doesn't meet the performance standards to be enterprise-worthy gets remade into a desktop GPU with a bunch of features disabled.

      The fact is that gaming GPUs can be used in various scientific applications, but they've never promised accurate results, because they don't have double precision. If you don't know why you would need double precision, then you probably shouldn't be in the market for a supercomputer.

  3. John H Woods Silver badge

    3 <= 2 + 2 <= 5

    <Pedant>

    2+2 can, of course, be anywhere in the range 3..5 when rounding to 1 significant figure.

    </Pedant>

    1. Chris Miller

      Re: 3 <= 2 + 2 <= 5

      2 + 2 = 5

      (for sufficiently large values of '2')

      1. Anonymous Coward
        Anonymous Coward

        Re: 3 <= 2 + 2 <= 5

        2 + 2 = 22

        Not sure where you two went to skool.

        1. John G Imrie
          Happy

          Re: 3 <= 2 + 2 <= 5

          2+2 = 10

          for sufficiently small values of base.

    2. d3rrial

      Re: 3 <= 2 + 2 <= 5

      3 <= 2 + 2 <= 5 = 1

      easy.

      (((3 <= 2) + 2) <= 5)

      3 <= 2 = 0

      0 + 2 = 2

      2 <= 5 = 1

      1. Anonymous Coward
        Anonymous Coward

        Re: 3 <= 2 + 2 <= 5

        You are all clearly smarter than me with my calculation however did you consider trumpets?

        1. d3rrial

          Re: 3 <= 2 + 2 <= 5

          No, trumpets are -p=<

          1. TRT Silver badge

            Re: 3 <= 2 + 2 <= 5

            2 <= many <= many + 1

    3. Anonymous Coward
      Anonymous Coward

      Re: 3 <= 2 + 2 <= 5

      Ah, now I see why non of the climate change models actually fit reality.

    4. Zolko Silver badge

      Re: 3 <= 2 + 2 <= 5

      when rounding to 1 significant figure.

      actually, I think this is exactly what's happening. People here use Nvidia cards to do scientific computing, and recently they have reported that the same calculations bring different results. They attribute this to the massively parallel calculations that spread the sub-calculations on the many cores differently each time, such that intermediate results are calculated along different paths, and since these are floating-point calculations the roundings in these intermediate results differ.

      They say that, unlike in a real processor, there is no OS so there is no way to tell the processor what and when and how to do: it does it's magic on his own.

  4. Anonymous Coward
    Joke

    Looks like...

    There's some quantum-computing going on in those cards !

  5. IgorS

    Anyone knows if this affects the server-class V100s, too?

  6. Claptrap314 Silver badge

    Redlining memory? Buhahahaha! Not a chance.

    I spent 10 years doing microprocessor validation, from 1996-2006.

    1) There an approximately 0% chance that this is due to pushing memory to the edge of the envelope. All doped silicon degrades with use. If they push the envelop, then all of their cards will die sooner rather than later. The closest you get to this is what we call "infant mortality", where a certain percentage of product varies from the baseline to the point that it dies quickly.

    2) In order to root cause errors of this sort, it is really, really important to understand if this affects all parts, indicating a design bug, or some, indicating a manufacturing defect. If the article indicated which is the case, I missed it.

    3) Design bug or manufacturing defect, inconsistent results come down to timing issues in every case I saw or heard about. In the worse case, you get some weird data/clock line coupling that causes a bit to arrive late at the latch. Much more often there is some corner case that the design team missed. Again, I would need to know what the nature of the computations involved, and the differences observed, to meaningfully guess at the problem.

    1. 404
      Boffin

      Re: Redlining memory? Buhahahaha! Not a chance.

      What you term 'infant mortality', we out in the field call 'The Shitty One(s)' - Take 100 identical PC's, 93 of them run per spec, 5-6 run fast as fuck, and the final 1-2 total pieces of shite.

      Good times ;)

      1. Flakk
        Pint

        we out in the field call...

        Best laugh I've had all day. For you. Thanks for the field work you do.

      2. Vinyl-Junkie
        Thumb Up

        Re: we out in the field

        Indeed; build identical 100 PCs from the same image, 99 will work fine and the last one will be a never-ending source of problems.

    2. Cuddles

      Re: Redlining memory? Buhahahaha! Not a chance.

      "important to understand if this affects all parts, indicating a design bug, or some, indicating a manufacturing defect. If the article indicated which is the case, I missed it."

      The article said that one person tested 4 GPUs and found problems with 2 of them. Given the small sample size, and only being tested on a single problem, I don't think there's really enough information to figure out which it might be.

    3. Anonymous Coward
      Anonymous Coward

      Re: Redlining memory? Buhahahaha! Not a chance.

      I'll tell you about the card behaviour...

      At the moment mine is mining Ethereum. I need to offset the cost until the newest generation of CUDA gets proper support and drivers get matured.

      I used to run the card overclocked with 120% Power and +142 Memory Overclock(best stable overclock at that time). It worked for a bit more than a month with no issues. I talked to other people online who couldn't get theirs to run stable as mine did. I was quite happy that I won the silicon lottery until it stopped working at this settings.

      After a month and a bit, the card is stable at 100% power and +130 Memory overclock. If I try to overclock it a bit higher it stops working properly after a few hours. The ethminer displays the calculations but it doesn't go through with sending the results back to the pool.

      It seems to me that it has something to do with the memory as well.

      The card has degraded in performance with time and as far as ethereum mining goes it is related to the memory. Could you please explain how is this possible?

      1. Patched Out

        Re: Redlining memory? Buhahahaha! Not a chance.

        This is possible because semiconductor structures on silicon can wear out due to metal migration based on current densities and temperature. This causes their switching characteristics to change or degrade over time. Device Mean-Time-To-Failure (MTTF) is typically characterized using the arrhenius equation where higher device temperature results in shorter life.

        In a CMOS transistor structure, most power is dissipated when switching logic states. Power translates into heat. As operating frequency is increased, the transistors switch more often in less time causing more heat, which will accelerate wearout.

        It used to be the characteristic life of a device could be a 100 years or more, but with operating frequencies now in the GHz levels and device feature sizes shrunk to pack more transistors into less real estate, the design margins have shrunk to the point where characteristic lifetimes are reduced to a decade or two and greatly shortened by overclocking.

        Yes, I am a reliability engineer.

        1. Paul Shirley

          Re: Redlining memory? Buhahahaha! Not a chance.

          Or the cooling has degraded over a couple of months and the overclock headroom shrank with it.

      2. Anonymous Coward
        Anonymous Coward

        Re: Redlining memory? Buhahahaha! Not a chance.

        Note also that as well as the other answer about hardware degredation, the Ethereum DAG size has grown tremendously, and people are seeing lower hashing rates than a few months ago.

      3. Tom 64

        Re: Redlining memory? Buhahahaha! Not a chance.

        > "Could you please explain how is this possible?"

        Nvidia are known for cutting corners to maximise profits.

        Google for Charlie Demerjian vs Nvidia

  7. Yet Another Anonymous coward Silver badge

    Probably fine for scientific computing

    In the real world there aren't exact answers and very questions involving integers

    If the errors are small and/or rare enough - that's fine.

    Out experiments give bad results all the time - we understand that, it's the whole point of experimental physics.

    1. Anonymous Coward
      Anonymous Coward

      Re: Probably fine for scientific computing

      It completely depends on the use case.

      Galaxy dynamics and Knights of the Old Republic - fine.

      Rocket dynamics - not fine.

      From Floating-Point Arithmetic Besieged by “Business Decisions” - A Keynote Address, prepared for the IEEE-Sponsored ARITH 17 Symposium on Computer Arithmetic, delivered on Mon. 27 June 2005 in Hyannis, Massachussets

      You have succeeded too well in building Binary Floating-Point Hardware.

      Floating-point computation is now so ubiquitous, so fast, and so cheap that almost none of it is worth debugging if it is wrong, if anybody notices.

      By far the overwhelming majority of floating-point computations occur in entertainment and games.

      IBM’s Cell Architecture: “The floating-point operation is presently geared for throughput of media and 3D objects. That means ... that IEEE correctness is sacrificed for speed and simplicity. ... A small display glitch in one display frame is tolerable; ...” Kevin Krewell’s Microprocessor Report for Feb. 14 2005.

      A larger glitch might turn into a feature propagated through a Blog thus: “There is no need to find and sacrifice a virgin to the Gorgon who guards the gate to level 17. She will go catatonic if offered exactly $13.785.”

      How often does a harmful loss of accuracy to roundoff go undiagnosed?

      Nobody knows. Nobody keeps score.

      And when numerical anomalies are noticed they are routinely misdiagnosed.

      Re EXCEL, see DavidEinstein’s column on p. E2 of the San Francisco Chronicle for 16 and 30 May 2005.

      Consider MATLAB, used daily by hundreds of thousands. How often do any of them notice roundoff-induced anomalies? Not often. Bugs can persist for Decades.

      e.g., log2(...) has lost as many as 48 of its 53 sig. bits at some arguments since 1994 .

      PC MATLAB’s acos(...) and acosh(...) lost about half their 53 sig. bits at some arguments for several years.

      MATLAB’s subspace(X, Y) still loses half its sig bits at some arguments, as it has been doing since 1988.

      Nevertheless, as such systems go, MATLAB is among the best.

      1. Yet Another Anonymous coward Silver badge

        Re: Probably fine for scientific computing

        Rocket dynamics - not fine.

        Depends on the trade off.

        Would like you to use a mainframe with duplicate processors checking each result - but only enough power to model your combustion chamber with 1m^3 cells

        Or a bunch of cheap GPUs where you can do 1mm^3 but some small percentage of those cells have incorrect values?

        Classic report I once got back from our supercomputer center. "The cray detected an uncorrected memory fault during your run. Please check your results" . If I could check the results manually why would I need the fsckign Cray?

      2. Claptrap314 Silver badge

        Re: Probably fine for scientific computing

        Among other product, I worked on the STI Cell microprocessor. It was my job to compare the documents to the actual product. The documents plainly the accuracy of the division-class instructions. If MATLAB or whoever failed to write code that incorporated the information in the documentation, whose fault is that? That MATLAB has been so egregiously wrong for a decade before the STI Cell microprocessor came out should help if anyone is confused.

        IEEE-754 is fine for specifying data representation and last-bit correctness. But the nasty corners are an excellent example of just how bad committee work can be. And the folks writing floating point libraries regularly produce code that is mediocre at best. Don't blame hardware for your software bugs.

  8. Stuart Dole

    Shades of the Pentium floating point bug?

    Not the first time this sort of thing has cropped up! Old-timers will remember the famous “Pentium floating point bug”.

    1. Tony Gathercole ...
      FAIL

      Re: Shades of the Pentium floating point bug?

      Or even the similar floating-point problem with the DEC VAX (8600/Venus family I think) circa 1988? Recollections of my then employer's chemical engineering department having to re-run significant amounts of safety-critical design related computations over a period of weeks and months on alternative (slower) VAX models. In the support teams we ended up scheduling low-priority batch jobs running tasks with known results set up to flag re-occurances of the problem as it wasn't failing consistently.

    2. Neil Barnes Silver badge

      Re: Shades of the Pentium floating point bug?

      Yabbut - at least the Pentium got the same wrong answer every time.

    3. bazza Silver badge

      Re: Shades of the Pentium floating point bug?

      Some versions of PowerPC’s AltiVec SIMD unit are optimised to complete instructions in a single clock cycle at the expense of perfect numerical accuracy. Well documented, understood and repeatable, this was fine for games, signal processing, image processing.

      This problem with NVidia’s latest sounds different. Sounds like a big mistake in the silicon process, or too optimistic on the clock speed.

    4. Pascal

      Re: Shades of the Pentium floating point bug?

      I am Pentium of Borg. Division is futile. You will be approximated.

      1. TRT Silver badge

        Re: Shades of the Pentium floating point bug?

        The ultimate answer? To life, the universe and everything? OK, the answer... the answer is...

        41.999999999999999

        I said you weren't going to like it.

        1. DropBear
          Trollface

          Re: Shades of the Pentium floating point bug?

          Well that depends on whether you meant literally 41.999999999999999, or 41.(9), considering the latter is mathematically exactly equal to 42 (yes, really)...

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like