* Posts by DrBandwidth

8 publicly visible posts • joined 25 Mar 2013

3 is the magic number (of bits): Flip 'em at once and your ECC protection can be Rowhammer'd

DrBandwidth

clever, but easy to detect

The trick of testing one bit at a time until you find three bits that are susceptible is clever, but the approach is risky (i.e., if only 2 of the 3 bits flip, you get an uncorrectable error that leaves lots of log information behind), and it is also easy to protect against. Every processor that I know of that supports ECC also supports a counter that measures corrected single-bit errors. We monitor these correctable error rates so we can replace error-prone DIMMs, which means that we also pay attention to who and what was running on the node when the corrected error rate increased. This monitoring could be automated, but that is not necessary -- having humans reviewing this data means that there is a decent chance that the attacker will get caught and locked out of the system. (This usually means that someone has hacked an authorized user's account, but occasionally an authorized user gets stupid....)

DARPA seeks SSITH lords to keep hardware from the Dark Side

DrBandwidth

DARPA has been funding this sort of work for years

The DARPA CRASH-SAFE project (http://www.crash-safe.org/) delivered a lot of encouraging results during its run (2010-2015).

Many of the ideas developed as part of the CRASH-SAFE project have been picked up by the "Dover" project (http://www.draper.com/solution/inherently-secure-processor), and plans to include them in a RISC-V processor design: https://riscv.org/wp-content/uploads/2016/01/Wed1430-dover_riscv_jan2016_v3.pdf

AMD does an Italian job on Intel, unveils 32-core, 64-thread 'Naples' CPU

DrBandwidth

Article misses a factor of 2 on memory BW

The article states: "Also "Naples" supports up to 21.3GBps per channel with DDR4-2667 x 8 channels (total 170.7GBps), versus the E5-2699A v4 processor's implied 140GBps."

This is based on a faulty interpretation of the statement that the AMD processor has "122% more memory bandwidth". The author apparently interpreted this as 1.22x as much memory bandwidth, to compute 170.7GB/s/1.22 = 140 GB/s. The correct interpretation of "122% more" is "2.22x", yielding 170.7 GB/s vs 76.8 GB/s. This implies 4 channels of DDR4-2400 on each socket of the Xeon E5-2699 v4, which is consistent with Intel's published specifications. The use of "1866 MHz" for the Xeon E5-2699A v4 in an earlier slide may be correct for a configuration with multiple DIMMs per channel -- the details vary by product and are not easy to look up.

Bug of the month: Cache flow problem crashes Samsung phone apps

DrBandwidth

Seen this one before....

When Apple switched from the Motorola PowerPC 604 to the IBM PowerPC 970 (aka "G5"), a similar problem occurred. All prior PowerPC processors had used 32 Byte cache lines, and software was written with the expectation that the "DCBZ" (Data Cache Block Zero) instruction would zero 32 Bytes. The PowerPC 970 used a 128 Byte cache line, so the DCBZ instruction zeroed the 32 Bytes that were expected, then continued and zeroed the next 96 Bytes as well. Sometimes it was data, sometimes it was text, but frequently it was a mess. IBM added a mode bit that caused the DCBZ instruction to operate on 32 Bytes instead of the full cache line and made that the default setting on the parts sent to Apple.

IBM lifts lid, unleashes Linux-based x86 killer on unsuspecting world

DrBandwidth

Don't forget Xeon Phi x200 (Knights Landing)

Mainstream Xeon processors have lower peak bandwidth than the Power8 discussed here, but I routinely sustain >400 GB/s from the 16 GiB MCDRAM memory on my Xeon Phi 7250 systems. The best numbers I have seen are just under 490 GB/s, but that takes a bit of extra tweaking. STREAM Triad gets >470 GB/s with no unusual fiddling required (Flat-Quadrant mode, transparent huge pages enabled, compiled for AVX-512, run with 68 OpenMP threads and launched with "numactl --membind=1").

The Power8 has immensely more memory capacity than the 16 GiB of MCDRAM on the Xeon Phi 7250, but only a subset of jobs need both huge bandwidth and huge capacity. Some of these can use x86 via scalable systems like the SGI^H^H^H HPE UV systems. There are other high-bandwidth, huge-capacity solutions as well -- the Oracle T5-8 from 2013 delivered over 640 GB/s from an 8-socket server with 4TiB capacity.

Storage with the speed of memory? XPoint, XPoint, that's our plan

DrBandwidth

Hopelessly inaccurate numbers.....

Minor: As noted above, unloaded DRAM latencies in typical two-socket server systems are in the range of ~85 ns (local) to ~120 ns (remote). These have been increasing slightly over time as the number of cores increases and the core frequencies decrease. Latency under load can be much higher, but in such cases the throughput is typically more important than the latency.

Major: It is a bit difficult to read the chart showing "Price per gigabyte" vs "bandwidth", but if I am interpreting the axes correctly, then the values are off by more than an order of magnitude. For DRAM, the chart shows "price per gigabyte" in the range of $30 to $400, centered somewhere between $100 and $200. This is ridiculous. DRAM wholesale chip costs are in the range of under $4/GiB (4Gib DDR4) to about $5 (8 Gib DDR4), leading to *retail* prices for registered ECC DIMMs in the range of $6/GB (for 16GiB and 32GiB DIMMs). It looks like the SSD pricing is off by almost as much...

Major: What is the y-axis of that chart supposed to mean? "Bandwidth" per what? Price per gigabyte is a reasonably well-defined concept, but "bandwidth" by itself can be interpreted so many ways that it is almost meaningless. The chart shows DRAM "bandwidth" in the 1000 MB/s to 10,000 MB/s range. The low end of this range roughly matches the bandwidth of a single DDR4/2133 DRAM chip with a x4 output configuration, while the high end of the range is only a bit more than 1/2 of the bandwidth available from a DDR4/2133 DIMM.

Docker kicks KVM's butt in IBM tests

DrBandwidth

Easy explanation

The 2x performance difference in LINPACK is very easy to explain -- KVM does not report that it supports AVX by default, so the LINPACK code runs using 128-bit SSE instructions instead of the 256-bit AVX instructions that are used in the native and containerized versions. We saw the same thing when we tested KVM at TACC and (if I recall correctly) it was very easy to fix.

STREAM is actually more difficult, but the two tests that IBM reported were constrained in ways that prevented the trouble from being visible. The single socket STREAM data (IBM's Figure 2) is reasonable for compilation with gcc. With streaming/nontemporal stores the results would be higher -- in the 36 GB/s to 38 GB/s range for native, container, or KVM. The two socket STREAM data (IBM's Figure 3) is only consistent across the three execution environments because they forced the memory allocation to be interleaved across the two sockets. Normally a native run of STREAM would use local memory allocation and get 75 GB/s to 78 GB/s (with streaming stores) or ~52 GB/s to 57 GB/s (without streaming stores). In this case the KVM version typically has serious trouble, since the virtual machine does not inherit the needed NUMA information from the host OS, and often loses close to a factor of two in performance. I don't know whether the containerized solution does better in providing visibility to the NUMA hardware characteristics.

GE puts new Nvidia tech through its paces, ponders HPC future

DrBandwidth

Arithmetic, anyone?

Interesting piece, but I worry about any table of results that does not appear to be internally consistent....

The piece does not define what exactly is meant by "latency", but it is odd that the results in the second and third columns (both labelled "latency") are precisely 1/2 of the transfer time that one would compute using the complicated mathematical formula "time = quantity / rate". In this case, for example, on would expect that transferring 16 KiB at a rate of 2000 MB/s to take 8.192 microseconds, rather than the 4.09 microseconds stated in the table.

Latency might mean "time for the first bit of the data to arrive" or it might mean "time for the entire block of data to arrive". The latter is not possible given the stated numbers (since all the computed transfer times are precisely twice the stated "latencies"), while the former would imply a mildly perverse buffering scheme that always buffered precisely 1/2 of the data before beginning to deliver it to the accelerator.

Buffering exactly 1/2 of the data is perhaps not as crazy as it sounds -- such schemes are sometimes (often?) used in optimized rate-matching interfaces. If the input is guaranteed to be a contiguous block, then buffering exactly 1/2 the data allows the buffer to transmit the output data at 2x the input rate after pausing for 1/2 of the transfer time. Such a scheme minimizes the latency between the arrival and delivery of the final bit of data in the input block. Unfortunately, it also makes the actual hardware latency invisible (provided that the hardware latency is less than 1/2 of the transfer time of the smallest block with reported results).

Whether this buffering scheme makes sense depends a lot on the data access patterns of the subsequent processing steps. If the subsequent step demands that a full block be in place before starting, then this is the way to go. On the other hand, many signal processing algorithms could pipeline operations with data transfers in smaller blocks, in which case a different buffering scheme might make more sense.