back to article HPE says 3PAR problem that broke Australia was a one-off. Probably

Hewlett Packard Enterprise says the 3PAR Storage SNAFU that took the Australian Taxation Office offline for a week does not appear to be a systemic problem. The ATO went down hard last week, with many online services that citizens, businesses and the bean-counting industry rely on disappearing from the internet as a result. …

  1. Nate Amsden

    got to be careful with high automation

    trying to get the most efficient, and automated system possible, there will always be gaps, and sometimes big ones that people don't expect.

    If you want a "real" backup then the backup cannot be connected to the primary (e.g. real time replication or clustering). A tightly integrated backup protects against many failure scenarios but obviously cannot protect from all.

    I endured a similar event on a 3PAR system close to 7 years ago now, I learned a lot during the process. The support(at the time) was outstanding (since HP took over it has been closer to adequate than outstanding), and made me a more loyal customer as a result. That case at the time 3PAR determined was a one off as well(at least at the time). The backups I had at the company were limited to small scope tape backups due to limited budget. Fortunately I was able to pull some miracles out of my ass and bring everything back online in a few days (storage array itself was back online in a few hours). After all of that the company axed the disaster recovery budget I worked on for a month in order to give the funds to another project that they massively under budgeted for. I left a couple of weeks after that.

    I was part of another full array failure data loss event more than a decade ago on an EMC system, that was an interesting experience as well, I wasn't responsible for that system at the time(I supported the front end apps). Maybe 35 hours of downtime, and we were recovering from the occasional corrupted data thing in Oracle for the next year or two that I was at the company.

    The key is of course to realize no system is invincible. There are bugs, there are edge cases, and in highly complex environments those can be nasty. It's certainly very unfortunate that this customer got hit by one of those, but it wasn't the first, and it won't be the last.

    The biggest outages I have been a part of have been application-stack related.

    Some of the more recent management I work with freak out when shit is down for an hour or two, oh my they have no idea how bad things can get.

    This kind of thing has also kept me more in HP/3PAR's court(customer now for almost 11 years), because if this kind of thing can happen to a storage system that is roughly 10 years old then I can only imagine the issues that can happen with the startups. These big 3PAR boxes get a lot more testing and more deployments etc.

    But it's also probably indication that HP won't ditch Hitachi for the ultra high end just yet(where they have 100% guarantees).

    In general perhaps I am lucky, or maybe just lazy that I don't encounter more issues because I tend to not leverage much of the functionality of the systems I use. Take 3PAR for example some people are surprised that I haven't used the majority of the software available for the system(e.g. never used replication). Part of that is budget, part of that is I know there are more bugs in the more complex things(on any platform).

    Same with VMware, I file on average 1 ticket with HP/VMware support per year over the past 4 years, currently running almost 1,000 VMs. Runs smooth as hell, very few issues, and again much of the more advanced stuff (even though we use enterprise+) goes unused (but we do use distributed virtual switches and host profiles that are in ent+). I have seen lots of complaints over the years about vmware bugs that I honestly have never seen, I guess because I just don't have a need for those features. The only crashes I have gotten have been because of hardware failures (maybe 6 in the past 5 years, and none in the 6 years before that at least while I was at the companies). And no - no plans for vsphere 6 anytime soon.

    Same goes for my ethernet switches, the feature set I need on those hasn't changed in a decade. List goes on...

    at the end of the day you have to realize what you are protecting against. Right now I am trying to get a tape system approved (with LTFS over NFS) for offline backups. What I am protecting against there is someone breaking into our systems and deleting our data AND our backups. Having offline tape(stored off site) is a good tried and true method of protecting data. I don't expect to use it ever, we use HP StoreOnce for backups & off site backups, but still someone could delete data from those just as they could delete data from an API-based cloud system.

    Co-ordinating someone to return all of our tapes and delete them is a far bigger task.

    Dealing with tape directly isn't fun, I am hoping that LTFS over NFS will make it pretty easy since all of our backups write to NFS as is(on StoreOnce), so adapting them to LTFS should not be difficult. Certainly am aiming to avoid working directly with fancy tape backup software at least.

    It would be really cool if StoreOnce could automatically integrate with tape, so I could write to NFS to storeonce and then have it write it to tape on the backend. It would remove some steps I will otherwise have to do myself. I know there is 3PAR->Tape automation but that is too low level and relies on use cases that don't cover what I do for the most part.

  2. Anonymous Coward
    Anonymous Coward

    I wonder ...

    ... if the problems are similar to the KCL outage. Certainly KCL was at least partially self-inflicted, but there again, HPE was guilty of giving a lunatic a loaded gun (to quote the Bard of Salford). Perhaps they did they same down under. Just saying ..... just wondering .....

  3. TRT Silver badge

    Gosh...

    This is sounding awfully familiar...

  4. Anonymous Coward
    Anonymous Coward

    99.9999?

    Wonder if they were on the 99.9999 guarantee? https://www.hpe.com/h20195/v2/GetPDF.aspx/4AA5-2846ENN.pdf

  5. dpk

    it hardly 'broke' Australia. JFC.

  6. Calleb III

    One off?

    I guess the 3PAR array that failed so spectacularly in a similar way at KCL 2 months ago was also a one off...

  7. Anonymous Coward
    Anonymous Coward

    Just like major outage at Blue Cross in Canada was a one off.

  8. Anonymous Coward
    Anonymous Coward

    Bug or miss-sold

    Having used 3par for 7 odd years now and witnessing my own fair share of bugs I am not surprised as the units become more mainstream and some of HP's less than Stella sales ethics rub off on the brand that we will see more issues in the press.

    One of the more disturbing practices I have seen of late from HPE is to not spec out cage level redundancy leaving your whole array vunrable to failure on a single shelf, the argument is that at a certain scale pushes down capacity or pushes up cost point but unless the risks are explained and signed off by a client that understands I cannot see how this is a good buisness practice.

    This failure could be quite a few things, and would echo the above that the more features that you use that intertwine with one another the chance of a killer bug increases.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon