back to article Hundreds of websites go titsup in Prime Hosting disk meltdown

Hundreds of UK-hosted websites and email accounts fell offline when a disk array failed at web biz Prime Hosting. As many as 860 customers are still waiting for a fix more than 48 hours after the storage unit went titsup. The downtime at the Manchester-based hosting reseller began at 5am on 31 July, and two days later some …

COMMENTS

This topic is closed for new posts.

Page:

  1. Mako

    "[P]romised that it has had a team working solidly for 36 hours without sleep in order to minimise the impact."

    Knowing how goofy and error-prone I become after about 24 hours without sleep, that doesn't exactly fill me with feelings of confidence.

    And it also gives me the impression that this is yet another company that thinks working people like pit-ponies is not only acceptable, it's laudable.

    1. Rameses Niblick the Third (KKWWMT)
      Thumb Up

      Definitely, definitely this. One upvote is just not enough

    2. LarsG

      This pretty much tells you to keep a local backup and not rely on a third party to keep your data.

  2. Wize

    They stick on old backups to start with...

    ...then are slowly replacing them with newer backups.

    What if a customer places an order while the old one is up and the database gets splatted with the newer backup?

  3. Lord Voldemortgage

    Have some sympathy for them

    Some.

    Drives do sometimes go in batches.

    But if I was a customer I would want to be asked first before an old version of a site was brought on line - I mean there might be ordering systems with old pricing / stock figures or anything on there; in those circumstances better a holding page with an apology than a working site feeding through garbage.

  4. Steve Evans

    Tut tut...

    RAID is about availability, it is *NOT* a backup solution.

    /end lecture.

    1. Alex Rose
      WTF?

      Re: Tut tut...

      I've read the article again and I still can't see the bit where anybody claims that RAID is a backup solution.

      1. jockmcthingiemibobb
        FAIL

        Re: Tut tut...

        Restoring the data from a 3 month old backup would kinda imply they were relying on RAID as their backup solution

        1. Anonymous Coward
          Anonymous Coward

          Re: Tut tut...

          No it wouldn't, it would imply that they had migrated to a new array and the old one was still there. No-one said they restored the data from 3 months ago.

    2. Anonymous Coward
      Anonymous Coward

      Re: Tut tut...

      Reading the article is about comprehension of a story, not just reading the first line and jumping to a conclusion about what you expect to have happened.

      /end lecture.

  5. Trygve Henriksen
    FAIL

    This is CRAP!

    Any server system with a smidgeon of professionality built into it will warn you when a drive becomes borderline.

    Having 3 fail in one RAID6 array is... mindboggling...

    Exactly how many drives do they have in each array, anyway?

    Restoring old site backups to get the VMs up faster?

    This is CRAP!

    I'm guessing that what they brought back is the LAST FULL BACKUP of the failed array, and that they're now busy restoring Differential or Incrementals from after that.*

    They should at least have the brains to keep the systems offline until they've restored everything, as it may otherwise result in lost orders and whatnot.

    (What if someone browses to a webshop on one of those sites, orders something using a CC and they then restore over the transaction details? )

    * Cheap bastards probably used incremental backups, too, instead of Differential, to save money...

    1. Anonymous Coward
      Anonymous Coward

      Re: This is CRAP!

      Rather than jumping to the "This is all crap" conclusion, consider:

      The array probably had all its drives purchased at the same time, this vastly increases the likelihood of drives failing in fairly quick succession.

      If the array is new, it's entirely possible that the array that it was replacing is still kicking around, awaiting decommissioning. If it became apparent that the existing array is completely dead, it may have been a case of just zoning the old LUNs to the servers and away you go, with old data. This would also back up the first point. A recovery from tape could then take place to update the old data and the new array could be recommissioned when everything has settled down.

      1. DJ Smiley
        Facepalm

        Re: This is CRAP!

        Or they don't understand data scrubbing and checking for data failures on the devices themselves rather than trusting the raid controller which is going "Yes yes its all fine, don't worry about those blocks I've just moved because they failed, its really ok I promise you!"

        1. Trygve Henriksen

          Re: This is CRAP!

          Error logs from array systems are there for a reason, which many unfortunately never bother to read.

          With a 'new' system, that should be checked DAILY.

          Automated emails from the system?

          Sure, but I wouldn't trust them. Too many systems between the originator and me.

          (sucks if the email warning of a problem with a RAID gets lost because the email storage is on the glitching array... Or, someone changes the IP of the SMTP server and the array box doesn't understand DNS. )

          1. Nigel 11

            Re: This is CRAP!

            You really need to data-scrub, and watch the SMART statistics for the drives themselves, and act pro-actively. If the number of reallocated blocks starts increasing, replace that drive BEFORE the array is in peril. Sometimes drives do turn into bricks just like that, but in my experience and that of Google, an increasing rate of bad block reallocations after the array is first built is a warning not to be ignored.

            If I were ever running a big-data centre, I'd insist on buying a few disk drives monthly, and from different manufacturers, so I could assemble new RAID arrays from disks no two of which were likely to be from the same manufacturing batch. A RAID-6 made out of drives with consecutive serial numbers is horribly vulnerable to all the drives containing the same faulty component that will fail within a month. I'd also want to burn in a new array for a month or longer before putting it into service. If a new drive is going to turn into a brick, it most commonly does so in its first few weeks (aka the bathtub curve).

            1. Anonymous Coward
              Anonymous Coward

              Re: This is CRAP!

              "...If I were ever running a big-data centre, I'd insist on buying a few disk drives monthly, and from different manufacturers, so I could assemble new RAID arrays from disks no two of which were likely to be from the same manufacturing batch...."

              It doesn't work like that, you get the disks you get when you buy an array. The array manufacturers spend ages testing that the firmware is compatible with the existing disks, that the disks are reliable and perform to spec, with the array and the array controller. There is far more chance of a failure caused by bad firmware or incompatibilities between disks and array/controller than mechanically. You will also be very hard pushed to find a array supplier who will support random disks being inserted into their array.

              The best bet, is to have a healthy amount of online spares, for automatic rebuild and an array that phones home for more.

      2. chops

        Re: This is CRAP!

        maybe it was hardware common to the disks, like a backplane or cable which failed.

        Prime seem to have had a problem with their SAN for a while - it's been blamed for some slow (to non existent) server responses over the past couple of weeks. I'm not even sure it's a reliable design of SAN (I believe it was 'home baked' from what support staff told me <last time this happened!>).

        Not much surprises me with Prime any longer, they don't usually appear to show a great deal of care or understanding about the importance of DNS, data or, come to it, their customers.

    2. Anonymous Coward
      Anonymous Coward

      > Having 3 fail in one RAID6 array is... mindboggling...

      Not very mindboggling here: we lost a very large number of drives simultaneously when some muppet contractor managed to set off the fire suppression system whilst attempting routine "maintenance". Service induced failure is a lot more common than you might think in all sorts of areas.

      Most likely they had an old copy of the data sitting on disk storage somewhere and brought that back on line as a quick fix whilst the tape system is recovering the backups. I can't remember the last time I saw a full backup done or imagine quite how long it would take. These days all our stuff is incremental into a library, and we just restore from the library without having to worry about when individual files were backed up.

      1. Trygve Henriksen

        Re: Mupped doing maintenance...

        Nothing can protect you against 'Acts of Dead Meat', unfortunately...

        (Well, mirroring the array to another similar box in another location might... )

        Full backups are important. They really are.

        If for nothing else, they're really handy for off-site storage...

        (To protect against flooding, fire, sabotage, theft... )

        1. Anonymous Coward
          Anonymous Coward

          Re: Mupped doing maintenance...

          Replication is great, but like RAID, not a panacea.

          Also, it's highly likely that the customers don't want to pay for that level of data security, it's very expensive, more than double the costs because you have to pay for the datalink as well as the extra servers and disk.

    3. Fatman
      FAIL

      Re: ....warn you when a drive becomes borderline.

      Perhaps it did!!!

      (Now to get my damagement bashing in; and this is just speculation, mind you.)

      Perhaps the warning signs were there, but damagement, in its quest for ever increasing profits, decided to hold off replacing the drives. Could it be that they did not want their quarterly bonuses to take a `hit`??? The spreadsheet jockeys could not find a line item for replacement drives.

      Icon that says it all.

      1. Captain Scarlet
        Meh

        Re: ....warn you when a drive becomes borderline.

        I'm sure most "Enterprise" drives (If they used them) have 3-5 year warranties and the majority of manufacturers will replace if certain parameters have reached.

  6. Dave 62
    Happy

    At least they have recent backups.

  7. theloon
    FAIL

    no sleep? Umm, go home

    last thing anyone needs are exhausted people working on problems..... Not reassuring..

  8. Anonymous Coward
    Anonymous Coward

    Batch

    Dear me, I learned in the late 1990s through practical experience that you NEVER put a RAID together with disks from the same batch. A guarantee of disaster if they start popping off in quick succession..

    1. Colin Bull 1

      Re: Batch

      It is not trivial to avoid using the same batch in a RAID unless using RAID10.

      I bet they wish they had joined this group ..

      http://www.miracleas.com/BAARF/BAARF2.html?40,51

      It might be old but it is still applicable

      1. Destroy All Monsters Silver badge

        Re: Batch

        OH SO TRUE.

        Of course, buy new hardware, the disks will have continguous serial numbers. Order a replacement for the failed one, the next one will fail while the new one you got will ALSO fail.

  9. BryanM
    FAIL

    Lady Bracknell

    To paraphrase Oscar Wilde...

    "To lose one disk, Mr Smith, may be regarded as a misfortune; to lose three looks like carelessness."?

    After 3 disk failures I'd be checking the RAID controllers and stuff to ensure it's not something other than a disk issue. Unless you tell me it's software RAID that is, then I'll just laugh at you.

    1. LinkOfHyrule
      Coat

      Re: Lady Bracknell

      Where's that quote from? The Ballard of RAIDing Gaol?

    2. TeeCee Gold badge

      Re: Lady Bracknell

      Nope. Disk #1 fails. A new disk is insterted and the array starts to rebuild. The act of rebuilding stresses the living shit out of the other disks, including accessing areas of them that haven't been looked at since Jesus was a lad[1] (e.g. parity stripes for O/S files on the failed disk that were written during a server installation early in the array's life and never touched since). Disks #2 and #3 turn their toes....

      Of the three disks I have had fail in my own gear, two failed during full backup cycles and one in a RAID rebuild. It's heavy use of the entire disk that shines a glaring light on problems. This is also why anyone relying on incremental backups and thus not ensuring that the entire disk structure is kosher on a regular basis is asking for it.

      Mixing batches of disks is unlikely to help, except in the unlikely case where a particular batch has a manufacturing defect. In such cases, they'll usually start dropping like flies at commissioning time anyway. What will help is ensuring that your RAID array is populated with disks with significantly different numbers of service hours on them, but since arrays tend to be commissioned in one go with new disks, this very rarely happens.

      [1] This is why monitoring the SMART stats makes no odds. SMART only records errors when they are seen in normal operation, it does not proactively scan the entire surface looking for 'em.

  10. Anonymous Coward
    Anonymous Coward

    More fun if the card goes

    Raid error reporting is one thing but if the raid card itself goes there is not even a hint of the impending doom.

    Can't help thinking RAID is another one of those "Many beasts with one name" technologies that would benefit from some rigorous standards.

    Drives from one controller will often not talk to later versions of the same controller (or have I just been unlucky)

    AC just because many people in IT seem to think they have it all covered and "unknown unknowns" could never happen to them, pointing fingers and being smart may distract us from the discipline required.

    1. DJ Smiley

      Re: More fun if the card goes

      We had a raid controller from Dhell do this - it went to write through mode as it should if it encounters errors; except instead of actually writing the data though (abet slowly) it decided any writes could be silently ignored and dropped.

      People saying H/W raid is better than software raid are either never dealt with dodgy raid controllers; or are thinking of that joke of raid that comes built into motherboards and not mdadm.

      1. Anonymous Coward
        Anonymous Coward

        Re: More fun if the card goes

        Really? In my experience, people who say that software RAID is better than hardware RAID are OS engineers, who think that they somehow automatically know about either local or SAN attached storage infrastructure.

        Software RAID, after all, still goes through disk controller chips, often the same one for multiple drives.

    2. Anonymous Coward
      Anonymous Coward

      > people in IT seem to think they have it all covered

      Mmm, some of the loud shouters come across to me as being rather inexperienced and naive. If you manage to stick around long enough in this flakey industry you see all sorts of weird stuff.

      1. Kev K
        Devil

        Re: > people in IT seem to think they have it all covered

        " If you manage to stick around long enough in this flakey industry you see all sorts of weird stuff."

        This with huge great bells on it. $hit WILL happen.

    3. Nigel 11

      Re: More fun if the card goes

      Drives from one controller will often not talk to later versions of the same controller (or have I just been unlucky)

      No, thats one of the several reasons that these days I refuse to countenance hardware RAID controllers.

      Another is the case where the manufacturer of your RAID controller goes out of business and the only place you can get a (maybe!) compatible replacement is E-bay. And then there's the time you find out the hard way that if you swap two drives by mistake, it immediately scrambles all your data beyond retrieval. And if there's a hardware RAID card that uses ECC RAM, I've yet to see it.

      Use Linux software RAID. Modern CPUs can crunch XORs on one out of four or more cores much faster than SATA drives can deliver data. And auto-assembly from shuffled drives does work! You do of course have a UPS, and you have of course tested that UPS-initiated low-battery shutdown does actually work before putting it in production.

      (Enterprise RAID systems with sixteen-up drives may be less bad, and in any case it's a bit hard to interface more than 12 drives to a regular server PC. It's little 4-8 drive hardware RAID controllers that I won't touch with a bargepole.

      1. Anonymous Coward
        Anonymous Coward

        Re: More fun if the card goes

        @Nigel 11 - I think we're talking about significantly different systems here. To me a RAID array is something that is free standing and has hundreds of disks. A the only locally attached arrays that I've used recently are made by HP (nee Compaq, nee DEC) and are 2u racks full of 2.4" disks, with controllers that have battery backed write cache - with error correcting RAM.

        If your primary concern when buying an array is "will this company go bust", don't buy it. However, rest assured a proper, enterprise (or SME) class RAID controller/Array is way faster and more reliable than software, it also won't knacker your disks if you put them in, in the wrong order. It certainly won't loose writes if cached, when there's a power failure, which software RAID will.

  11. Johnny Quest
    FAIL

    Prime Hosting has apologised to punters...

    "Prime Hosting has apologised to punters"

    Um... no, no they have not. Not a single apology.

    Their site has no information about downtime on it anywhere, their ticket support system is being completely ignored, their phone lines (which might be back up now) were down for all of yesterday with a recorded message basically saying "We know there issues, go away".

    Their Twitter feed is the only thing with any information on, and that is remarkably lacking.

    1. Anonymous Coward
      Anonymous Coward

      Re: Prime Hosting has apologised to punters...

      >their phone lines (which might be back up now) were down for all of yesterday with a recorded message basically saying "We know there issues, go away"

      And you think calling them is going to make them suddenly get things back to normal?

      If there is information on twitter then I would assume that is the current state of the problem, if you don't think so then complain later..

      What do you want? A bit by bit commentary on twitter and a fully manned telephone ops room or maybe a dedicated line for you to constantly ask "what's going on? when will my really, really important web pages be available, don't you know there are poeole out their who haven't seen a pictrue of my pussy for more than ten minutes?", while one guy tries to get the thing back to its previous state?

      1. Johnny Quest
        Holmes

        Re: Prime Hosting has apologised to punters...

        Those are some nice logical leaps you've made there. True genius in the works.

        Actually, despite you trying to make it sound completely ridiculous, a bit by bit commentary isn't exactly out of the question. It's not that unheard of for there to be people employed by a company that aren't experts in data recovery and server migration. Maybe those people not involved in that side of the issue could take a few minutes to at least to keep some worried customers updated?

        Regardless, what I'd expect is the bare minimum of customer support:

        1) At least one mention that there is a known issue on their website;

        2) For their single point of support to be working (their ticket system was offline all day ). These shared servers that are down aren't their only hosting business.

        3) Less than 9 hours between Twitter posts on the day the majority of sites went down.

        4) To maybe ask customers whether they would like an unusably old backup in place before doing so;

        5) To maybe let customers who do have an already unusably old backup in place that they can provide a more recent one, so there's not a need to panic (12+ hours between putting some backups in place and then sending out a Tweet).

        That is not much to ask, seeing as they have just destroyed a good number of businesses (not mine, chillax before you start worrying about my transexual cat photo enterprise).

      2. Johnny Quest
        Facepalm

        Re: Prime Hosting has apologised to punters...

        Oh, and I forgot the most important one:

        A FUCKING APOLOGY!

        Telling The Register that they're sorry isn't quite the same as telling the hundreds of customers. I think some of Prime's customers might not be regular Reg readers.

  12. Anonymous Coward
    Anonymous Coward

    disk do fail

    I added some more memory to a DELL R610 last month (one of our Hyper-v hosts) restarted the server to find that 2 of the 4 disks that make up a RAID 10 volume had failed. S*it happenslucky it was RAID 10 so it wasn't much of an issue

  13. Anonymous Coward
    Anonymous Coward

    Oh lovely lovely RAID...

    One of those techs people think is the holy grail to save you having to spend lots of dosh on a duplicate or clustered system. If anyone wants to save themselves from getting into the hell hole of a situation; always have backup on backups and duplicate on duplicate systems.

    Where I'm working at the moment; we have 6 duplicate servers hosting all our websites and all the elements and database hosted on a clustered group of servers. Even the local hard-disks on the servers are RAIDed for performance reasons on top of availability. It would take a lot to go wrong for us to go fully tits up.

    1. Anonymous Coward
      Anonymous Coward

      Re: Oh lovely lovely RAID...

      I take it all these servers are spread across different physical locations, redundant power and networks within each location, and diverse power and data into the data centres?

      1. Anonymous Coward
        Anonymous Coward

        Re: Oh lovely lovely RAID...

        Oh deffo! If you're going to half the risk, you might as well go all the way down the chain. Even down to split redundant switches using teamed NIC cards. =D

        1. Anonymous Coward
          Anonymous Coward

          Re: Oh lovely lovely RAID...

          It's something that should be on your checklist when choosing a hosting ISP:

          I chose one that had redundant data centres and took the subject seriously.

          Not the cheapest but you get what you pay for.

  14. Anonymous Coward
    Anonymous Coward

    It's ok....

    ...the customers have their own back ups as well don't they? You know just in case the site goes utterly tits up / bankrupt / get closed down by the police / you want to move hosts.

    oh....

  15. Chris Long
    Unhappy

    Irony

    Whilst looking in vain on their website for any scrap of information as to what the fudge had happened to my sites, I enjoyed* the irony of finding this press release:

    http://www.primehosting.co.uk/news/Recruiting_Again

    How nice of them, I thought, to be blowing their own trumpets whilst quite literally in the middle of the biggest clusterfudge a hosting company could hope to experience. Surely, I wondered, the PR person pimping this press release could instead be informing customers as to when their sites might re-appear? But apparently not.

    * did not enjoy

  16. This post has been deleted by its author

  17. Wensleydale Cheese

    And for those of you using hosting ISPs

    Do you take regular backups of your sites?

    I certainly do and can restore the lot reasonably quickly. I have tested that too.

    This article does present a scenario I hadn't thought of though, namely that of the ISP restoring older backups over whatever I might have already restored.

Page:

This topic is closed for new posts.

Other stories you might like