back to article Scale-out sister? Unreliable disks are better for your storage

Recently I was asked to review a document that used as a reference a piece of work from Google (PDF) which talked about the need to relax the resiliency levels of hard drives and SSDs. The premise is interesting: hyper-scalers claim they could do a better job in managing performance and availability if the HDDs they use were …

  1. Paul Crawford Silver badge

    Not so new

    The "enterprise" HDD normally had a short re-try time because they were typically used in RAID where it matters a lot less if a sector is bad as it can be fixed from the parity. Of course, they usually also promised better integrity like ECC RAM and so on, more reliable mechanics, etc. Whether it really was delivered in all cases is another matter...

    Of course we see different options being sold (such as WD 'red' etc) for this, so I doubt very much if the HDD makers are willing to lose profit margins by making an HDD that allows you to configure key settings like retry time-out, etc, to help its RAID-using customers.

    1. toughluck

      Re: Not so new

      Google are talking out of their asses again. I built a homemade NAS array with Western Digital's Green 1 TB drives without having an idea of TLER or failure rates (or that it's a 4 K drive pretending to be 512 b). You have to take that into account when later using the array.

      What Google should do is realize that the drive is going into recovery mode and recreate data from parity or mirror on the fly.

      By all means, if they feel like it, they can even divide the disk into a billion partitions and span them per disk (leaving some as spares) before adding spans into an array. That way, if there's a failure in such a partition, they can simply mark that partition as invalid and move it to a spare partition on the same disk.

      But that's difficult and would require that they modify some software or administer it in a different way. I would have thought that of all companies, Google would have had the people to do it.

      1. TheSolderMonkey

        Re: Not so new

        I think you may be missing the point.

        The drive with a bad block may or may not give a warning via SMART, that's irrelevant.

        The problem is that an desktop drive will realise that it's having problems reading the sector that the controller has asked for and will retry many many times. This effectively stalls that drive for seconds at a time. If the drive is part of a stripe, stripe operations can stall too.

        In the desktop environment, the user would want the drive to do everything to get the data back. Because you only have a single user, so what if things slow a little? That file could be a picture of fluffy kittens or something of equally life changing import.

        In an array, why the hell would you want a disk to attempt ERPs? You can rebuild the bad block from other disks in a fraction of the time that the single disk can do it's ERPs.

        You want the single disk to fail as quickly as possible. You quickly reconstruct the data, and write verify it back to the single disk. If the single disk fails to write, the block gets added to the G list and reallocated.

        It's not really about reliability, the article isn't great in that respect. It's about response time. The actual bit error rates and FIT rates aren't a hell of a lot different between enterprise and desktop drives. But the response of a drive to an error is very different.

        1. toughluck

          Re: Not so new

          There's one problem with that. When I bought disks for my array, I chose WD's Green 1 TB (WD10EARS). I didn't know or care about TLER (that's time-limited error recovery) and I had no idea those drives had physical 4K sectors, but reported 512b sectors to the OS.

          What did WD offer?

          - WD10EARS with no TLER and no way to enable it, and with 4K sectors reported as 512b logical and physical sectors -- for 50€.

          - WD10something with configurable TLER and with 4K physical sectors reported as 512b logical and 4K physical sectors so that the OS or the HBA could choose what to do with them -- for 150€.

          Your choice is to live with limited features but spare that 100 euro on a drive. What Google did is they bought the cheapest drives available, realized that some features were missing but available in more expensive models and they are now asking drive manufacturers to add them back for no charge.

          Good luck with that.

  2. quxinot

    I just was going to say many of those things.

    The article says "Google wants" this and that, and that's fine. Seeing as the majority of us are not working there, we don't much care. It's a shame that a drive (along with a fair few other hardware items, and non-computery stuff as well....) don't allow some configuration, perhaps by firmware reflashing, to let us choose an appropriate configuration.

    Instead of the usual "Oh, sir wants reliability AND speed? The ladder to reach the top shelf is over there..."

  3. Nate Amsden

    3par does this too

    Working with SSD vendors to reduce space allocated for bad cells and give it to the array (20% I think). Since 3par operates with chunklets if some flash goes bad it just stops using it. Too many chunklets go bad and the drive is proactively failed. HP calls this "adaptive sparing".

    My org's oldest 3par flash is about 22 months old and as far as media life the SSDs have 98% of their life left (if you were to source these SSDs direct from SanDisk they would be read intensive SSDs though HP doesn't limit their workload under the 5 year warranty).

    Google could do what XIO did just partner with the HDD makers. Or go buy XIO(and stick to seagate drives last I read XIOs tech was specific to that). XIO can't be too much to buy.

  4. Anonymous Coward
    Anonymous Coward

    QLC - slightly lower cost per GB, massively reduced lifespan

    QLC gives a 33% capacity boost over TLC, but from what I heard at a conference recently, it reduces the longevity by almost an order of magnitude in terms of PE cycles.

    If that's correct, 3D is the way forward and should get us to the next mainstream medium after NAND flash.

    What have other El Reg readers heard about the likely P/E cycles of QLC NAND?

    Is it really like "White Hole" in Red Dwarf where they increase Holly's IQ to >12,000, but reduce her life expectancy to 3 minutes!

  5. Anonymous Coward
    Anonymous Coward

    Using the most commmon tools on google-drive the risk of loss is ery small because the data is stored buy their software in at least three locations in one data-centre and replicated to one other data-centre, and shoudl one fo those three locations fail, they can get it from the other locations.

    My objection to the article is that is gives the impression that HDDs are so reliable (when they are not: read the stats from google on disk-fails too) that theey can be relied upon for storage. HDDs are nice for temp-storage but not for permanent storage.

    1. Anonymous Coward
      Anonymous Coward

      > HDDs are nice for temp-storage but not for permanent storage.

      What other technology are you proposing for permanent storage? Tape perhaps?

      Hard drives certainly do fail, but in my experience they often fail slowly - a few sectors here and there. The majority of the data can still be retrieved. SSD on the other hand, when it fails (and they often do), it fails hard. Total loss.

      In both cases, if you have two or three copies of the data, you can reduce the chance of data loss to "sufficiently small"

      But until you've copied the data elsewhere you have a window of increased risk. The HDD gives you a better chance of recovery, since if two or three drives fail, they are unlikely to have all lost the same piece of data.

      Traditional RAID arrays tend to squander this advantage by kicking the *whole* drive out of the array at the first sign of trouble; but I expect Google's software is smarter than that. (ZFS is also smarter).

      1. Anonymous Coward
        Anonymous Coward

        I do not think any one piece of hardware is suitable for permanent storage, only a combination of three or more.

        I do see read/write specialists and hardware-makers further optimising hardware and low-level software, but end-users will be using more and more cloud-based storage where they will not be aware of nor interested in where what exactly is stored.

      2. Eric 23
        Paris Hilton

        Hierarchical Storage Management (HSM)

        Tape is fairly reliable, and has a nice side effect of being very power efficient.

        When I first think of HSM, SamFS comes to mind as I used to work in a shop which used it, and it worked really well. I think Sun released it as open source years back, and no clue of it's current status.

        I have no experience in it, but there's also the Linear Tape File System (LTFS) which sounded promising.

  6. TWB

    Interesting...

    A couple of things I took from the article - HDDs are very reliable - yes and reliable things when they fail seem to often have a dramatic effect. I can make an analogy here, where I work, slightly less reliable systems that need 'occasional' nursing become far more familiar to the support guys and so when they go wrong, they can be restored and fixed very quickly. Stuff that runs for years and years gets forgotten about, so when it fails, often can take a long time to fix.

    Secondly - the bit about HDDs retrying a mis-read - important for my bank balance and detailed documents, but for a video server playing a TV programme out to an audience, you don't have time to have another try at reading - much better to have a bit of mangled image and it press on regardless than it stutter and skip or stop.

  7. Anonymous Coward
    Anonymous Coward

    Google is full of shit.

    HDD manufacturers obscure a lot of diagnostics and firmware settings in order to prevent end users from ruining the drives. There is plenty of information already exposed in S.M.A.R.T. to help any system administrator make reasonable predictions on when a drive is going to fail.

    Google's main gripe is "the current generation of modern disks, often called "nearline enterprise" disks, are not optimized for [Google's] use case".

    In other words, Google is too cheap to purchase true enterprise grade disks, and they're whining over the fact that manufacturers obscure a lot of firmware data on the cheaper models. Perhaps someone should clue them into the fact that enterprise grade also includes hardware that doesn't exist on nearline and consumer grade, and that's why there's less diagnostic data on the cheaper models.

    1. Ammaross Danan

      Re: Google is full of shit.

      You missed the mark. They're not complaining that they can't access error rates in SMART or the like, they're complaining that when a read-error occurs, they don't get an API event that they can simply respond to which tells the drive to fast-ignore the read error rather than doing it's head-park, re-seek, re-read pre-programmed action 3 times before finally giving up and SMART logging an uncorrectable. This fast-error would maintain the performance of the drive, while also allowing Google et al to reconstruct the lost sector from other sources and relocate that data on the fly or the like.

  8. Anonymous Coward
    Anonymous Coward

    "reliability"

    Kinda depends on what your definition of "reliability" is. Depends on what your application is. If you don't have means to retrieve data in the face of drive failures (say a desktop), you want that drive to do everything possible to retrieve your precious data. Doesn't matter if it takes a second or two (!) since the machine is gonna crash or do some other bad thing if the drive can't get that data. On the other hand, if you've got some nice multi-drive array with fancy data redundancy, you want the drive to be predictable, transparent about status and provide means to control the drive in the face of errors.

    In my view, a big problem is that nobody wants to spend the time to create & get buy in for a different standard for array targeted vs standalone targeted (aka standard) drives. Lots of users went and demanded custom specials from suppliers which increases cost, complexity, bugs, etc. Getting adequate profits for the manufacturers along with lower prices to customers requires a crisper set of requirements that allows manufacturers to cut costs. It's not just trying to squeeze the manufacturers harder or convince them to "do better". Thus, I believe the Google paper is a nice start, but it doesn't address some of the business aspects, nor does it start with a clean (enough) sheet of paper.

    1. Anonymous Coward
      Anonymous Coward

      Re: "reliability"

      In my view, a big problem is that nobody wants to spend the time to create & get buy in for a different standard for array targeted vs standalone targeted (aka standard) drives.

      Isn't that exactly what the manufacturers are doing - for example WD making and selling Red, Blue, Green, Purple and Black drives to different market segments?

      Agreed, they're mostly the same drive with a different sticker and maybe different parameters in the firmware. But this is precisely the level of customisation Google are asking for: send more diagnostic messages, don't try to be too clever in error recovery.

      1. Anonymous Coward
        Anonymous Coward

        Re: "reliability"

        The Google paper talks about wanting different z heights, head failure management, actuator architecture changes, etc which don't fit in the same league as firmware tweaks.

  9. Tom 64
    Windows

    Multiple actuators

    I've often wondered why HDDs have only one set of read/write heads on a single actuator.

    There is easily enough room in a standard form-factor for two arms, or maybe even four.

    Yes, its more stuff to go wrong, but the performance boost, particularly for random read/write ops would be well worth any reliability hit, and this would be a boon to any customer, enterprise or consumer.

    1. richardcox13

      Re: Multiple actuators

      I doubt it will happen. SSDs are so much quicker for random ops already that the necessary investment for spinning rust, their microcontrollers and firmware would never pay off.

      Cheaper to invest in (near) real time replication from SSD array (satisfying the applications IO needs) to a high reliability array (batched writes and otherwise focus on lifetime).

  10. Stuart Halliday

    While they do that, can we get better reporting of WiFi errors? At present I'd have to buy a specialist device to read errors in WiFi streams. Android, IOS and Windows need to be able to report errors in frames to the user before the stream collapses.

  11. TheSolderMonkey

    That's the main difference between and enterprise drive and a desktop drive.

    The desktop drive does everything it can to recover your data - response time be damned, whereas the enterprise drive does everything it can to meet the response time, data recovery be damned.

    Sounds like Google (and some journos) are leaning storage 101.

  12. toughluck

    And yet, array controllers or control software are dumb.

    Suppose an array is resyncing and is at 20%. You want to read data at 40-50%. What does the controller do? It will slow down the resync and thrash heads to read the data at 40% with the active spindles while the spare is continuing rebuild at 20%.

    What it could do is jump to the data you are requesting and resync there, returning to 20% later when read/write activity has died down.

    --

    There's a similar case with RAID 1+span where a single large drive mirrors multiple smaller drives (e.g. one 3 TB drive mirrors a span of three 1 TB drives). If a read request comes into the area already resynced, bam, the array will use the rebuilding spare to read data from it.

    --

    There are a lot of such small niggles in various software that are very annoying without you even realizing it.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like