Memory controller feature
This is a good approach, though detecting multiple access to a single dram row should be done by the memory controller. It could then force a refresh or slow the processor. Also use ecc ram wherever affordable.
A group of German researchers reckon they've cracked a pretty hard nut indeed: how to protect all x86 architectures from the “Rowhammer” memory bug. It's been 18 months since “Rowhammer” first emerged, and responses have largely come from individual vendors working out how to block the “bit-flipping” attacks in their own …
I am, in fact, quite surprised that there is a low-enough level of control to exhaust a particular capacitor, certainly in any controlled way whatsoever.
Software shouldn't have to deal with stuff like this as anything more than a stopgap. Like DEP etc. it should be using the hardware's inherent capabilities to manage this kind of thing, not doing the software "bouncer pushing certain groups back" method.
Comes down to money eventually - people want cheaper/faster DRAM and so design margins are inevitably pushed down and refresh arrangements made more 'optimistic' so they don't block I/O too much, etc.
ECC should trap this of course, but again few will pay the ~15% more for ECC DRAM and sadly most AMD motherboard don't support it even though AMD do in the CPU! For Intel you have to pay extra for the 'server' CPUs to use it (except I think for a few embedded CPUs where they grudgingly enable the feature).
Still this approach makes sense as it has little performance hit and the genera idea, of identifying and separating physical RAM regions that care at risk of coupling in a rowhammer attack, could be applied to other OS as well. Assuming they care...
ECC should trap this of course, but again few will pay the ~15% more for ECC DRAM and sadly most AMD motherboard don't support it even though AMD do in the CPU!
I think you may find that at least on some motherboards ECC memory works fine even though the motherboard documentation don't say anything about supporting it. Just as an example I have Gigabyte 990XA-UD3 with Athlon II X2 and it is running ECC memory with CentOS just fine.
Nov 4 20:31:58 centos-test kernel: AMD64 EDAC driver v3.4.0
Nov 4 20:31:58 centos-test kernel: EDAC amd64: DRAM ECC enabled.
Nov 4 20:31:58 centos-test kernel: EDAC amd64: F10h detected (node 0).
Nov 4 20:31:58 centos-test kernel: EDAC amd64: MC: 0: 0MB 1: 0MB
Nov 4 20:31:58 centos-test kernel: EDAC amd64: MC: 2: 2048MB 3: 2048MB
Nov 4 20:31:58 centos-test kernel: EDAC amd64: MC: 4: 0MB 5: 0MB
Nov 4 20:31:58 centos-test kernel: EDAC amd64: MC: 6: 0MB 7: 0MB
Nov 4 20:31:58 centos-test kernel: EDAC amd64: MC: 0: 0MB 1: 0MB
Nov 4 20:31:58 centos-test kernel: EDAC amd64: MC: 2: 2048MB 3: 2048MB
Nov 4 20:31:58 centos-test kernel: EDAC amd64: MC: 4: 0MB 5: 0MB
Nov 4 20:31:58 centos-test kernel: EDAC amd64: MC: 6: 0MB 7: 0MB
Nov 4 20:31:58 centos-test kernel: EDAC amd64: using x4 syndromes.
Nov 4 20:31:58 centos-test kernel: EDAC amd64: MCT channel count: 2
Nov 4 20:31:58 centos-test kernel: EDAC amd64: CS2: Unbuffered DDR3 RAM
Nov 4 20:31:58 centos-test kernel: EDAC amd64: CS3: Unbuffered DDR3 RAM
I suspect most servers used for serious database work would have ECC DRAM and probably be tested (often called "qualified") that it works without crashing.
My Asus Chromebook, now running Linux, hangs occasionally. When I tried the rowhammer example it hung the same way. Also it hangs on memtest86 unless you use the 'safe' mode, so guess who has crappy RAM?
Would be nicer if ECC handled things more gracefully. e.g. send a signal/interrupt to the kernel indicating corruption in that region, and allow the kernel to decide what to do. Within the mapped area of a userland process? Kill the process. Within a file backed area? No big deal, remap the file somewhere else in memory. Within the kernel itself? Panic.
"Would be nicer if ECC handled things more gracefully. e.g. send a signal/interrupt to the kernel indicating corruption in that region, and allow the kernel to decide what to do."
Been happening for years. Of course you may have to look outside the weird and wonderful world of x86 hardware and software to do that, but it's definitely not rocket science.
E.g. this snippet from 2003
http://h41379.www4.hpe.com/wizard/wiz_8771.html
No, ECC is pretty much guaranteed to detect multi-bit errors. It calculates CRC-like checksum of every data word and stores it in a separate space. Most commonly it's 8 checksum bits for every 64 data bits. Checksums are used on every read and write operation. If there is an uncorrectable error, memory controller has to issue NMI signal and reboot the machine.
As for correction - normal ECC has sufficient checksums to correct one wrong bit, but there are implementations in the wild that can correct up to 4 bits. 8-bit versions exist in research papers.
>I can visualise how buggering up memory can cause other programs to mis-behave but still struggle to visualise how you can force such a specific mis-behaviour that you can take over control of the machine.
The Google Project Zero the article refers to is outlined here. It should answer your question better than I can!
https://googleprojectzero.blogspot.co.uk/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
I think the rough idea is that by hammering the memory bits you have permission to access, you can flip a bit in adjacent memory that otherwise would be off limits to you. Part of the exploit method is to deliberately fragment the machine's memory before the hammering, so that there is a greater chance of accessible memory being adjacent to memory reserved for the kernel.
If chips are vulnerable to rowhammer, they are defective, and should have been rejected during manufacture (although standard practice appears to be to sell them in 3rd world countries).
Please can we have the Linux memory test upgraded to detect rowhammer so we can check our own memories. (Quickly - I have just ordered a bunch of cheap memory from Ebay).
"Cheap memory is a false economy."
...and expensive memory is money spent on unneeded heatsinks and flashing LEDs or on a brand name that sells the same OEM hardware with a flashy name on it. I'm _not_ saying there is no such thing as better, more reliable memory - I'm saying good luck figuring out when your money pays for an actual difference in quality...
I see it that way too: if rowhammer works the memory chip will not operate correctly when running some algorithms so it is faulty, not fit for purpose, should be sent back and potentialy a class action taken out against the manufacturer. However the fact that the manufacturers are not being sued suggests that this errant behavior is declared somewhere in the depths of the datasheets. Is this true? Do they declare that the memory will corrupt if it is used intensively? And finally is there any conceivable legitimate computational algorithm that is doomed to fail due to rowhammer errors it will trigger?