back to article HPE storage meltdown at Australian Tax Office lost no taxpayer data

The Australian Taxation Office (ATO) has revealed that no taxpayer data was lost as a result of the weird crash of Hewlett Packard Enterprise 3Par storage kit. The ATO described the Monday outage as the result of a never-before-experienced problem that HPE was working to fix. The Register understands a 3Par storage rig was to …

  1. Spotswood

    We lost...

    A *Petabyte* of data, it literally screwed every system we have, except 'Taxpayer data'.... well thank God you had that on your 'other' storage...

    Am I the only one who smells bullshit here?

    1. Anonymous Coward
      Anonymous Coward

      Re: We lost... no jobs either !

      After all number 1 priority is arse covering. Number 2 is finger pointing. Number 3 somebody fixes the issue.

      1. Anonymous Coward
        Anonymous Coward

        Re: We lost... no jobs either !

        ATO jobs are safe. After all they purchased the top tier of support.

        You still get the same support engineer and response time as basic support, but the vendor will take the blame. That's the true value of premium support....

      2. Trigonoceps occipitalis

        Re: We lost... no jobs either !

        NO!

        Number one priority is "Learning Lessons."

  2. GortonSM

    so many eggs and only one basket?

    The whole scenario depicted sounds peculiarly odd (read fanciful should you choose):

    1. Single storage silo disrupting so many services? Surely we have mirrored services across different storage silos to minimise the "blast radius"?

    2. Restoring from backup failed? Whats the betting where the backup indexing or metadata was held?

    3. "Unique set of circumstances never experienced before in the whole world" aka no-one else stupid enough to architect anything similar.

    Don't blame the vendor (who will take your money), look internally ......

    1. Anonymous Coward
      Anonymous Coward

      Re: so many eggs and only one basket?

      I agree. I worked many years at storage vendors. Only a few outages are the result of firmware bugs or multiple hardware component failures. I'd say 95% of outages are caused by architecture and/or driving the system into the ground.

      I can tell you that the people responsible for selling,architecting and buying these solutions get paid much much more than the people that fix the mess. Despite what vendors claim, there's are no tools that could reliably predict sizing for complex workloads. Most SE's just follow the "next,next,next wizard" which asks a few basic questions. Finally the system gets sized, based on what the customer is willing pay.

      If the customer is large enough and has a prestigeous name the vendor may sell under COGS if it'll look good in the paper. The ATO is one of those customers.

    2. eldakka

      Re: so many eggs and only one basket?

      Couple notes:

      1) Modern SANs virtualise storage across several physical arrays. Therefore a filesystem on a server thats on a volume that has been grown several times could have chunks of storage supplied from different arrays, however since it's a filesystem on the server, if one of those chunks disappears (even if it's only 20% of the total volume size) then the entire filesystem is corrupt and useless. As for mirrors umm, you do realise a mirror replicates data, no matter what the data is, across all the 'mirrors. So, if you accidentally delete file A from the filesystem, well, that file will be removed from the mirror as well. A mirror only protects you from certain hardware failure modes. E.g. if the hardware failure mode is such that the array controller starts corrupting all data on the array, that corruption will be mirrored onto the mirrored array (sometimes people refer to them as the 'backup array', but that is not correct, as they are live copies of the data, which includes deletions and file corruptions). Or if, say, an admin does a rm -rf /, well, all that deleted data will be deleted from the mirror as well. They DO have mirrors, but the failure was such that it was replicated to the mirror(s) as well.

      2) Where do they say restoring the backups failed? It's more than a PETABYTE of data. It could take days, weeks, to restore that much data from tape. Not to mention, depending on the failure that actually occurred, they may not have anywhere available to restore the data to. If multiple arrays were impacted (which seems to be the case), then they are not going to restore the data to those arrays until they know exactly what happened and if it will happen again. Therefore they probably don't TRUST those existing arrays to restore the data too. Therefore until they've verified that the error won't re-occur (e.g. they've fired the person who typed the "re-initialise array as new system" command, or that the controller corrupting data has been replaced and verified that it was an error with only that particular controller device and not some general issue with that model of controller that may crop up in another instance, and so on) they won't want to restore to those arrays and put them back into production. Therefore they might have to obtain new arrays, whether through purchase or HPE 'lending' them an array while they figure out what went on or re-tasking other arrays that are used for other, lower priority purposes (i.e. freeing up space on those existing arrays). However in any of those cases, it takes time to get the arrays. In the Australian market, usually if you order a large piece of hardware, you have to wait for it to be shipped internationally to Australia. With the just-in-time inventory systems used by most large vendors now, unlike even 10 years ago, vendors in Australia no longer have large systems (or even significant quantities of small systems/replacement parts) sitting in warehouses in Australia for immediate delivery to a purchaser. If you are lucky they may have one in transit to another customer that they may be able to divert to a higher-priority customer and then send them the one ordered for you when it arrives in a couple weeks.

      Let's wait for more information before throwing stones.

      1. Anonymous Coward
        Anonymous Coward

        Re: so many eggs and only one basket?

        As I understand it the 'backup' system that should have kicked in instantly is a full SAN storage replica and it failed because the problem is data corruption which as eldakka pointed out was unhelpfully replicated.

        There are presumably then other backup systems which take more time to become useful.

        I don't believe HPE would let the customer foist the blame onto them in this case if they could possibly avoid it.

        -

        If we are having a sweepstake on the root cause I'll go for de-dupe.

        1. Solmyr ibn Wali Barad

          Re: so many eggs and only one basket?

          "If we are having a sweepstake on the root cause I'll go for de-dupe."

          Plausible. Wouldn't be the first case of someone getting duped by de-dupe.

        2. Anonymous Coward
          Anonymous Coward

          Re: so many eggs and only one basket?

          "If we are having a sweepstake on the root cause I'll go for de-dupe."

          All it takes is a multindisk failure so that your Luns and Volumes go into R/O mode. The system admin panics and reboots the server. Or the support engineer panics and clears the write cache. Wouldn't be the first time for HP.

          So you lose a bit of data "in-flight" and you cannot know ,exactly where, especially when you do additional host based data management, like you would for a database.

          That data loss gets replicated to the mirror site.

          Restoring a PB will take too long, so your next best option is to let the hosts do some fsck hoping that your DB and transaction logs weren't affected.

          All of that could have been done yesterday, but customers usually spend a day hoping that the storage vendor pulls a magic bullet out of their arse and perform some sort of array based recovery, which will never happen.

          As soon as data corruption becomes a topic, both customer and vendor management becomes paralised clutches ng to straws and decision making gets slowed down drastically.

      2. highdiver_2000

        Re: so many eggs and only one basket?

        Paragraph 1 makes some sense if the SAN rig has been running for years with lots of shuffling of the SAN. Not in this case. This is at most a year old.

        http://www.crn.com.au/news/ato-ripping-out-emc-storage-and-moving-to-hpe-441262

    3. DaLo

      Re: so many eggs and only one basket?

      "3. "Unique set of circumstances never experienced before in the whole world" aka no-one else stupid enough to architect anything similar."

      And yet Kings College London had a suspiciously similar issue very recently with HP kit (presumed 3par)

      http://www.theregister.co.uk/2016/10/31/a_fortnight_of_woes_gone_and_a_fortnight_to_come_as_kcl_outage_continues/

      1. Anonymous Coward
        Anonymous Coward

        Re: so many eggs and only one basket?

        No need to architect anything, it is explained by simple statistical thinking. Most bugs are unique, encountered only by one operator. You're just used to the hyperscale computing of PCs and phones, were there are 100s of millions of devices running the same software, and so even the smallest bug is widely experienced.

        But this is a SAN. They might have sold 100,000. So most bugs are still new bugs.

    4. Anonymous Coward
      Anonymous Coward

      Re: so many eggs and only one basket?

      what makes you think tape is still used ? Good practice used to be that two vendors were used for critical back-end hardware such as storage, so the same bug did not crash all systems. But it was a cheaper tender and all knowledgeable staff were outsourced decades ago so vendor shiny shiny is believed.

    5. Vic

      Re: so many eggs and only one basket?

      "Unique set of circumstances never experienced before in the whole world" aka no-one else stupid enough to architect anything similar.

      I'm not so sure about that. It sounds rather like the SAN meltdown at KCL not so long ago - wasn't that 3PAR as well?

      Vic.

  3. Anonymous Coward
    Anonymous Coward

    Believing the Vendor

    HPE sell their 3par as an "enterprise" array. The array is not active-active on its controllers, and the way that the array is designed, a four controller array is that half the disk are owned by two controllers, and the other half is owned by the other two controllers. When LUNs are created to allocate to the servers they are built in 1gb increments, one from one set of controllers, and then 1gb from the other set of controllers. If you lose the right two controllers, you have lost all access to all the disks due to the alternation of allocation across all the disks. I think they lost the right two set of controllers (or wrong).

  4. Anonymous Coward
    Anonymous Coward

    Hope they were not using peer persistence

    Several corner cases were found when testing peer persistence for multi-site failover that could cause data loss or logical corruption. HPE decided not to fix them fully.

  5. JJKing
    Facepalm

    It's a mess.

    Like I suspect many people, I know someone in ATO tech and it's a clusterfuck. My friend is not that experienced but a time back their lead engineer was transferred to another project and my friend was left as senior engineer in charge. My friend had no idea about what needed to be done because they are at the "learning how to do it" stage of their career and it scared them shitless. My friend and most of the support people were extremely stressed during the whole saga.

    ATO keeps shedding very experienced employees and then wonder why this sort of thing happens.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like