no mention of bitcoin mining
Wonder how that would get on.
Nvidia’s flagship Titan V graphics cards may have hardware gremlins causing them to spit out different answers to repeated complex calculations under certain conditions, according to computer scientists. The Titan V is the Silicon Valley giant's most powerful GPU board available to date, and is built on Nv's Volta technology. …
reinstate the bite-test
Which, with gold coins, was mostly to test wether they were adulterated with lead. Some unscrupulous moneylenders or mints would use it to make the expensive stuff go further thus creating more profit.
Just as well our financial institutions today are not that unethical eh?
For Mining (not bitcoin as that's ASCI only now) but any Alt coins:
The Titan gets around 77MH on ETH at a cost of £3k and drawing over 230W.
A Vega 56 or 64 get 48MH on ETH at a cost of £500 (but really £600-900 now) but only draw 100W (you do need to do some serious optimisation on both to get these figures)
If its throwing memory errors then it will also cock up the mining. Memory errors result in incorrect shares which are not paid for at all.
Very surprised at how low the HBM memory bandwidth is, 8GB HBM2 on a Vega will overclock to over 600GBs, This Titan has 3x the bus but only 652GBs, which suggests they are not using the vastly superior Samsung chips but are using the Hynix pants.
No mention because no self respecting miner would use a titan to mine any cryptocurrencies. (let's ignore the fact bitcoin hasn't been minable on GPU's for about 5 years)
We don't just need speed we need efficiency. Titans are not efficient in price nor power usage.
I guess we shouldn't be surprised. Certainly, I don't know if this is an architectural issue but speed at all cost is what is being pushed out and it's what sells, although if there was a real cost to maunfactures maybe they'd reel it in a little. If it's good enough for Gaming use but not scientific use it should be labelled as such. It is aimed at industry, so needs to perform better than it does.
from the website:
https://www.nvidia.com/en-us/titan/titan-v/
NVIDIA TITAN V is the most powerful graphics card ever created for the PC, driven by the world’s most advanced architecture—NVIDIA Volta. NVIDIA’s supercomputing GPU architecture is now here for your PC, and fueling breakthroughs in every industry.
I'm not. I recall seeing a seminar or conference presentation shortly after people began touting GPU's for scientific computation, where the presenter pointed out that they were optimized for graphics output at speed ... a domain where getting the a bit or two wrong every now and then would only show up as a probably small and very brief visual glitch.
Still, perhaps in the intervening decade or whatever, this became less likely, and so now the problem is re-emerging?
Indeed, it used to be that GPU were completely unreliable for precise computations. Of course, that has changed in the past decades, when the industry realized that there was money in fast GPUs that did not make mistakes, and advertised them as such.
There's nothing wrong in itself with GPU that return slightly imprecise results in exchange for speed; but that should be clearly announced so that buyers know what to expect.
nVidia market this one as being for compute, and actively discourage using their other - much cheaper - cards for this kind of thing.
To put it bluntly, this is a massive blunder on the part of nVidia that's going to damage their reputation for a decade or more.
There's nothing wrong in itself with GPU that return slightly imprecise results in exchange for speed; but that should be clearly announced so that buyers know what to expect.
This. Causal users might not care. But for visual artists, this will act like a bug and bug the hell out of them. It'll be even worst when a day worth of rendering has this one single visual imprecise that forces the artists to constantly redraw their art assets.
People have been trying to shoehorn NVIDIA gaming GPUs into scientific applications ever since the Tesla GPUs were first introduced. NVIDIA's marketing has always been ambiguous regarding the differences between desktop and enterprise kit, and they leave it up to the systems integrators to tell customers whether or not the GPU would be a good fit for their applications. They love to tout the "supercomputing" architecture of the gaming GPUs as a way to move more units, but they're not designed to replace their Tesla line. They're produced from the same binning practices that distinguish desktop CPUs from their enterprise counterparts; a Tesla GPU that doesn't meet the performance standards to be enterprise-worthy gets remade into a desktop GPU with a bunch of features disabled.
The fact is that gaming GPUs can be used in various scientific applications, but they've never promised accurate results, because they don't have double precision. If you don't know why you would need double precision, then you probably shouldn't be in the market for a supercomputer.
when rounding to 1 significant figure.
actually, I think this is exactly what's happening. People here use Nvidia cards to do scientific computing, and recently they have reported that the same calculations bring different results. They attribute this to the massively parallel calculations that spread the sub-calculations on the many cores differently each time, such that intermediate results are calculated along different paths, and since these are floating-point calculations the roundings in these intermediate results differ.
They say that, unlike in a real processor, there is no OS so there is no way to tell the processor what and when and how to do: it does it's magic on his own.
I spent 10 years doing microprocessor validation, from 1996-2006.
1) There an approximately 0% chance that this is due to pushing memory to the edge of the envelope. All doped silicon degrades with use. If they push the envelop, then all of their cards will die sooner rather than later. The closest you get to this is what we call "infant mortality", where a certain percentage of product varies from the baseline to the point that it dies quickly.
2) In order to root cause errors of this sort, it is really, really important to understand if this affects all parts, indicating a design bug, or some, indicating a manufacturing defect. If the article indicated which is the case, I missed it.
3) Design bug or manufacturing defect, inconsistent results come down to timing issues in every case I saw or heard about. In the worse case, you get some weird data/clock line coupling that causes a bit to arrive late at the latch. Much more often there is some corner case that the design team missed. Again, I would need to know what the nature of the computations involved, and the differences observed, to meaningfully guess at the problem.
"important to understand if this affects all parts, indicating a design bug, or some, indicating a manufacturing defect. If the article indicated which is the case, I missed it."
The article said that one person tested 4 GPUs and found problems with 2 of them. Given the small sample size, and only being tested on a single problem, I don't think there's really enough information to figure out which it might be.
I'll tell you about the card behaviour...
At the moment mine is mining Ethereum. I need to offset the cost until the newest generation of CUDA gets proper support and drivers get matured.
I used to run the card overclocked with 120% Power and +142 Memory Overclock(best stable overclock at that time). It worked for a bit more than a month with no issues. I talked to other people online who couldn't get theirs to run stable as mine did. I was quite happy that I won the silicon lottery until it stopped working at this settings.
After a month and a bit, the card is stable at 100% power and +130 Memory overclock. If I try to overclock it a bit higher it stops working properly after a few hours. The ethminer displays the calculations but it doesn't go through with sending the results back to the pool.
It seems to me that it has something to do with the memory as well.
The card has degraded in performance with time and as far as ethereum mining goes it is related to the memory. Could you please explain how is this possible?
This is possible because semiconductor structures on silicon can wear out due to metal migration based on current densities and temperature. This causes their switching characteristics to change or degrade over time. Device Mean-Time-To-Failure (MTTF) is typically characterized using the arrhenius equation where higher device temperature results in shorter life.
In a CMOS transistor structure, most power is dissipated when switching logic states. Power translates into heat. As operating frequency is increased, the transistors switch more often in less time causing more heat, which will accelerate wearout.
It used to be the characteristic life of a device could be a 100 years or more, but with operating frequencies now in the GHz levels and device feature sizes shrunk to pack more transistors into less real estate, the design margins have shrunk to the point where characteristic lifetimes are reduced to a decade or two and greatly shortened by overclocking.
Yes, I am a reliability engineer.
In the real world there aren't exact answers and very questions involving integers
If the errors are small and/or rare enough - that's fine.
Out experiments give bad results all the time - we understand that, it's the whole point of experimental physics.
It completely depends on the use case.
Galaxy dynamics and Knights of the Old Republic - fine.
Rocket dynamics - not fine.
From Floating-Point Arithmetic Besieged by “Business Decisions” - A Keynote Address, prepared for the IEEE-Sponsored ARITH 17 Symposium on Computer Arithmetic, delivered on Mon. 27 June 2005 in Hyannis, Massachussets
You have succeeded too well in building Binary Floating-Point Hardware.
Floating-point computation is now so ubiquitous, so fast, and so cheap that almost none of it is worth debugging if it is wrong, if anybody notices.
By far the overwhelming majority of floating-point computations occur in entertainment and games.
IBM’s Cell Architecture: “The floating-point operation is presently geared for throughput of media and 3D objects. That means ... that IEEE correctness is sacrificed for speed and simplicity. ... A small display glitch in one display frame is tolerable; ...” Kevin Krewell’s Microprocessor Report for Feb. 14 2005.
A larger glitch might turn into a feature propagated through a Blog thus: “There is no need to find and sacrifice a virgin to the Gorgon who guards the gate to level 17. She will go catatonic if offered exactly $13.785.”
How often does a harmful loss of accuracy to roundoff go undiagnosed?
Nobody knows. Nobody keeps score.
And when numerical anomalies are noticed they are routinely misdiagnosed.
Re EXCEL, see DavidEinstein’s column on p. E2 of the San Francisco Chronicle for 16 and 30 May 2005.
Consider MATLAB, used daily by hundreds of thousands. How often do any of them notice roundoff-induced anomalies? Not often. Bugs can persist for Decades.
e.g., log2(...) has lost as many as 48 of its 53 sig. bits at some arguments since 1994 .
PC MATLAB’s acos(...) and acosh(...) lost about half their 53 sig. bits at some arguments for several years.
MATLAB’s subspace(X, Y) still loses half its sig bits at some arguments, as it has been doing since 1988.
Nevertheless, as such systems go, MATLAB is among the best.
Rocket dynamics - not fine.
Depends on the trade off.
Would like you to use a mainframe with duplicate processors checking each result - but only enough power to model your combustion chamber with 1m^3 cells
Or a bunch of cheap GPUs where you can do 1mm^3 but some small percentage of those cells have incorrect values?
Classic report I once got back from our supercomputer center. "The cray detected an uncorrected memory fault during your run. Please check your results" . If I could check the results manually why would I need the fsckign Cray?
Among other product, I worked on the STI Cell microprocessor. It was my job to compare the documents to the actual product. The documents plainly the accuracy of the division-class instructions. If MATLAB or whoever failed to write code that incorporated the information in the documentation, whose fault is that? That MATLAB has been so egregiously wrong for a decade before the STI Cell microprocessor came out should help if anyone is confused.
IEEE-754 is fine for specifying data representation and last-bit correctness. But the nasty corners are an excellent example of just how bad committee work can be. And the folks writing floating point libraries regularly produce code that is mediocre at best. Don't blame hardware for your software bugs.
Or even the similar floating-point problem with the DEC VAX (8600/Venus family I think) circa 1988? Recollections of my then employer's chemical engineering department having to re-run significant amounts of safety-critical design related computations over a period of weeks and months on alternative (slower) VAX models. In the support teams we ended up scheduling low-priority batch jobs running tasks with known results set up to flag re-occurances of the problem as it wasn't failing consistently.
Some versions of PowerPC’s AltiVec SIMD unit are optimised to complete instructions in a single clock cycle at the expense of perfect numerical accuracy. Well documented, understood and repeatable, this was fine for games, signal processing, image processing.
This problem with NVidia’s latest sounds different. Sounds like a big mistake in the silicon process, or too optimistic on the clock speed.