back to article Sysadmin wiped two servers, left the country to escape the shame

Grab a very small cake and a bunch of candles, dear readers, for today we mark the 10th edition of “Who, me?”, The Register’s confessional for IT pros who broke things badly. This week, meet “Graham” who “ended up as an authority on a fledging new product called SFT III from Novell. SFT stood for “system fault tolerance”. “ …

  1. Anonymous Coward
    Anonymous Coward

    “I have subsequently moved to Australia to try and put this behind me,” Graham concluded.

    A bit too far for a mere cockup and a definitely not far enough if those servers belonged to people you really need to run from.

    I had to maintain a few of those during the hungry 90-es on the other side of the wall and I prefer not to remember the details of that. Healthier that way. Less likely to find my remains 10m up in the air starting the car.

    1. Mayday
      Pirate

      Sentenced to transportation then?

      “I have subsequently moved to Australia to try and put this behind me,” Graham concluded*

      *from an Aussie too.

  2. TonyJ

    Ohhh SFT III...I set that up for a company that used to make bricks for kilns.

    It was a real pig to get running - the hardware had to be absolutely identical, even down to the disk firmware and, if I recall correctly, geometry. Which was a problem when Compaq sent the same part number disks but from a different manufacturer.

    Anyway I still remember being asked to nip a spare fibre cable down, as they weren't far from where I lived.

    It was quite a thick, glass fibre. When I went to the stores to get it, the store man merrily bent it over itself to fit in a jiffy bag. I could literally hear it snapping each bend.

    "Oh well...we need another one now" :)

    1. Anonymous Coward
      Anonymous Coward

      > Ohhh SFT III...I set that up for a company that used to make bricks for kilns.

      Making bricks - how appropriate for SFTIII. Having inherited an SFTIII installation, I found that as long as you didn't want to change anything ever, it was fine, but in a dynamic environment, it was a ball and chain on change.

      Who remembers those "demonstrations" at the Networks show in Birmingham in the 90s, where Novell dropped anvils on one of a pair of running servers to show, er, gravity or something. That kind of 90s loadsamoney waste was one of the reasons I was turned off Netware and started heading towards this new NT thing, which turned out to be rather good, and cheaper too. Imagine that - MS cheaper, no doubt as Netware had to pay for demo anvils.

      1. TonyJ

        "...Who remembers those "demonstrations" at the Networks show in Birmingham in the 90s, where Novell dropped anvils on one of a pair of running servers to show, er, gravity or something. ..."

        Yoikes! I remember! Seemed pretty wasteful. I'd have loved a server to own back then. My lab was a couple of ageing desktops

        1. DJSpuddyLizard

          ..Who remembers those "demonstrations" at the Networks show in Birmingham in the 90s, where Novell dropped anvils on one of a pair of running servers to show, er, gravity or something. ...

          I remember seeing the Compaq Destructive Testing Lab when I was a contractor back in Houston in the 90s. It seemed they would attempt to crush servers, then act surprised when the poor machine actually crushed. I really don't know how that was supposed to help the average buyer ... "Yes, but our servers can stand up to FOURTEEN thousand pounds of force!" sorry, mate, if there's 14,000 pounds force extra in your server room, there's a problem that most likely needs immediate attention.

      2. Water Cooler

        Mirror, mirror, off the wall...

        Ah yes, I remember! We ran a semi-critical development environment on one of these setups, way back. We never had the old 'anvil falling from the sky' incident on our site, but we did have some sort of nasty OS software corruption on one of the boxes that caused it to go TITSUP. Fortunately, we had the second box, which was instantly mirrored to duplicate the... no, wait, stop!!!

        The whole system was offline for about a week.

      3. Anonymous Coward
        Anonymous Coward

        Yes, back then mirrored servers was the sort of thing done by DEC at very high cost, so the SFT idea was a good try at a lower cost but did have that unfortunate requirement to be a very homogeneous environment. Then again, in my experience so did the higher end stuff as well, but the variability was lower in that high cost environment.

        I tried the SFT way once but decided it wasn't necessary in a file server environment as long as one kept sufficiently frequent backups. The file focused nature meant that large amounts of data weren't at risk. Novell servers weren't that often used for database hosting.

        Funnily enough we looked at Windows NT as an alternative F&P solution and concluded that it was absolute garbage compared to a Novell installation well through into the 2000's. Even had an MS "evangelist": tout its security advantages until one of my peers pointed out that the "high" security rating only applied when the NT server was running in an air-gapped environment and wasn't connected to any other system. If there were connections the rating changed to very poor.

        Biggest problem sticking with Novell through the late 1990's was the sabotage campaign the MS conducted as it used undocumented procedure calls and hidden features to disable Novell client connections on PC's. This wasn't revealed as an active campaign until much later, and at the time people tended to blame Novell, which was rather MS';s intent and did help (in my environment) promote the desire to change over to Windows server despite the inadequacies and heavy footprint of AD at the time.

        1. GrumpyKiwi

          You didn't always need Microsoft's assistance to break the Netware client. Back in the early naughties and I was working for the NZ branch of a very large Japanese consumer electronics manufacturer. Who would regularly send out updates to their reporting software that would break the Netware client on Windows requiring a reinstall of the Netware client. On every single computer. I can only guess that it was pushing Japanese language DLL's or something similar.

        2. Stoneshop

          Then again, in my experience so did the higher end stuff as well, but the variability was lower in that high cost environment.

          VMS clusters can be quite unhomogeneous, VAX8800 with Alpha DS10 and MicroVAXes, as long as they're running the same VMS version. Different CPU architectures require that the system disks be separate (you can have a common system for all the machines of the same arch), but data disks can be mirrored right across them. For disk mirroring you do want roughly identical sized disks, although they don't need to be exactly identical. Just that you want to start building the mirror set off the smallest.

    2. J.G.Harston Silver badge

      One a site I needed a 25m length of earth bonding cable. Went to the suppliers who measured it out, then promptly folded it into 25 1-m lengths of earth bonding cable.

      1. ssharwood

        Did you buy another or splice?

  3. John Robson Silver badge

    Don’t need human error for that

    Place I had a summer job had a raid controller do this all on its own - failed drive was pulled, replaced... controller then mirrored the drive - the new drive that is, over the remaining good drive... oops

    1. Anonymous Coward
      Anonymous Coward

      Re: Don’t need human error for that

      Had that happen on my Netgear ReadyNAS. Its supposed to be hotswappable. I never bothered using that feature again.

      1. Antron Argaiv Silver badge
        Pint

        Re: Don’t need human error for that

        I have one. No failures yet.

        THANKS! (and have a virtual cold one on me) -------------------------->

        ...for the warning.

        I'll be shutting it down before replacing any failed drives in the future.

    2. Doctor Syntax Silver badge

      Re: Don’t need human error for that

      "mirrored the drive - the new drive that is, over the remaining good drive"

      It seems to be a standard feature of mirroring judging by the number of times I've heard of that.

      1. John Robson Silver badge

        Re: Don’t need human error for that

        I've only come across it once - but of course the number of people who have heard me tell of it is probably quite large, and some of those may have repeated it to others....

        I wonder how many instances there actually have been, and how many times we hear the same instance repeated...

  4. Korev Silver badge

    Incremental backups?

    I don't know the full story; but wouldn't taking a full backup before starting a major operation be prudent?

    1. Anonymous Coward
      Anonymous Coward

      Re: Incremental backups?

      But the company probably had an ISO 9001 certificated backup procedure of monthly full backups followed by daily incrementals with specioied policies of how the tapes get shuffled around the firesafe/sent off site/etc etc so taking a full backup would have been an exception to the policy and required far too much paperwork/management sign-off to contemplate.

      1. Anonymous Custard
        Headmaster

        Re: Incremental backups?

        And of course you just wait for some minion to make an incremental back-up over the top of the full one, or to use the same tape for the full one each month and end up overwriting the historical archive once it gets too full.

      2. CrazyOldCatMan Silver badge

        Re: Incremental backups?

        had an ISO 9001 certificated

        ISO 9001 only specifies that you have a process and follow that process in a documented fashion. It doesn't specify that the process has to be any good or have any value.

        So a process that says "we do nothing about backups" is a valid ISO 9001 process as long as you can proved in documentation that you did nothing.

        Of course, that would then fail any ISO 27001 certification but that's a whole different can of Annelids.

        I've worked at places that do both ISO schemes. Which is why I'm so cynical about it..

        1. Doctor Syntax Silver badge

          Re: Incremental backups?

          "ISO 9001 only specifies that you have a process and follow that process in a documented fashion. It doesn't specify that the process has to be any good or have any value."

          That was one of the objections I had about ISO 9001. It was supposed to be a quality management standard but providing the "quality" was repeatable it didn't matter how bad it was. I kept calling the mediocrity management system.

          We introduced it after TQM. It was supposed to bea step up from that. As TQM had a mantra of Get it right First Time Every Time I wanted to know how, if we were already doing that, it could be a step up.

          1. Anonymous Coward
            Anonymous Coward

            Re: Incremental backups?

            TQM brings back memories. The government agency I worked for pretended to adopt it when a political appointee who was mad about it forced all the agencies under his control to do it. The agency had been zig-zagging back and forth between management fads for a while. A lower level manager I knew wrote a proposal that the acronym be changed to CQM (C=first letter of agency's name), and remain CQM regardless of future zigs and zags so it looked like we were maintaining steady progress forward. He did it as a joke - he got an award for it.

        2. Anonymous Coward
          Anonymous Coward

          Re: Incremental backups?

          "ISO 9001 only specifies that you have a process and follow that process in a documented fashion. It doesn't specify that the process has to be any good or have any value."

          In my days at an ISO manufacturer, my standard line was that:

          ISO 9000 = "we have processes"

          ISO 9001 = "we have processes, and we wrote them down"

          ISO 9002 = "we have processes, we wrote them down, and we follow them"

          We ended up doing ISO 13485 at one point. I couldn't even come up with a cute joke about that.

          I've heard that the latest 9001 is focused more on risk management than specific processes. Sounds good in theory, but I have a feeling that most auditors just won't get it.

          1. Bitsminer Silver badge
            Happy

            Re: Incremental backups?

            "ISO 9001 only specifies that you have a process and follow that process in a documented fashion. It doesn't specify that the process has to be any good or have any value."

            We are ISO 9001 registered. We can repeat our mistakes exactly.

      3. Anonymous Coward
        Anonymous Coward

        Re: Incremental backups?

        But the company probably had an ISO 9001 certificated backup procedure of monthly full backups followed by daily incrementals with specioied policies of how the tapes get shuffled around the firesafe/sent off site/etc etc

        But the million dollar question is... did the procedure ever include verifying or testing the backups?

    2. Dick Emery

      Re: Incremental backups?

      I used to sys admin for a small company amd found out when their RAID5 failed that the incrementals had been corrupting slowly bit by bit for some time. Bad data in. Bad data out. That was a very sweaty evening rebuilding the array I can tell you. Did I get any thanks for it? Don't make me laugh.

  5. Korev Silver badge
    Joke

    Down under

    As servers are upside down in Australia does that mean that the discs are now the right way round?

    1. ssharwood

      Re: Down under

      I'll tell you as soon as I can ride my Kangaroo to the office, wrestle the crocs out of the way to get into the data centre and smoke the snakes out of the aircon. Once I'm done I'll drink a slab of Fosters (assuming it still exists) and tamper with a cricket ball.

  6. jake Silver badge

    No shame in cocking up!

    Nothing wrong with an honest mistake. You're only human.

    As Grampa told me, after I replanted the 10x100ft rows of carrot seeds that he had already planted the day before: "Own up to your mistake, learn from it, resolve not to do it again, and move on". Sound advice for a 9 year old half a century or so ago, still sound advice for an adult today.

    1. Korev Silver badge
      Coat

      Re: No shame in cocking up!

      I feel some vegetable puns coming on... Let's hope none leek out

      1. Anonymous Coward
        Anonymous Coward

        Re: No shame in cocking up!

        You can't top a carrot story.

      2. jake Silver badge

        Re: No shame in cocking up!

        Veg puns? I didn't think of that, but I'll beet you're right. We'll probably be peppered with them. Lettuce wait and see ...

      3. Anonymous Coward
        Anonymous Coward

        Re: No shame in cocking up!

        Not interested --- I don't carrot at about vegetable puns.

        1. Anonymous Custard
          Headmaster

          Re: No shame in cocking up!

          The true wisdom is not just to learn from your own mistakes, but to also learn from those of others to save you making your own in the first place.

          At least that's my excuse for reading this section on a Monday morning (and On-Call on a Friday)... :)

          1. Solmyr ibn Wali Barad

            Re: No shame in cocking up!

            Wise man can learn from mistakes of others, but fool cannot learn from his own.

          2. Anonymous Coward
            Anonymous Coward

            Re: No shame in cocking up!

            “Only a fool learns from his own mistakes. The wise man learns from the mistakes of others.”

            Bismarck

      4. Commswonk

        Re: No shame in cocking up!

        I feel some vegetable puns coming on... Let's hope none leek out

        1206A: thank goodness; they seem to have stopped sprouting up now.

        1. Paul Westerman
          Coat

          Re: No shame in cocking up!

          That's a turnip for the books.

      5. keith_w

        Re: No shame in cocking up!

        as a poster in a previously read story said, "There's not mushroom for a pun in Reg story"

      6. Doctor Syntax Silver badge

        Re: No shame in cocking up!

        "Let's hope none leek out"

        Then we'd be in the soup. Cock-a-leekie.

      7. Anonymous Coward
        Anonymous Coward

        Re: No shame in cocking up!

        only for those who can't cut the mustard.

      8. dmacleo

        Re: No shame in cocking up!

        we need to squash this now before it sprouts (especially those of you in brussels) and buds while dropping more seeds.

        no more vegetable punions.

    2. Anonymous Coward
      Anonymous Coward

      Re: No shame in cocking up!

      It is not the making of mistakes - but how quickly you can recover from them.

      That's called "experience". As the saying goes "An expert is someone who has made all the mistakes" ( and learned from them).

      It is always good to have at least a Plan 'B' in case of unknown unknowns. As Sod's Law states "Anything that can go wrong, will go wrong - at the worst possible moment.

      People who never expect to make a mistake will eventually get a comeuppance.

      1. Anonymous Coward
        Anonymous Coward

        Re: No shame in cocking up!

        A friend asked a junior colleague recently about the quality of the tests he was writing with his code. "Oh that's OK, I don't write bugs" was the response.

        1. uccsoundman

          Re: No shame in cocking up!

          Or what I hear all the time... "We're Agile, we don't need to test"

          1. Aladdin Sane

            Re: No shame in cocking up!

            Worked fine in dev, ops problem now.

      2. Fruit and Nutcase Silver badge

        Re: No shame in cocking up!

        It is not the making of mistakes - but how quickly you can recover from them.

        We had some databases which were prone to getting corrupted due to what was eventually traced to a hardware issue. The recovery process was to rebuild by restoring the previous nighty backup and rolling forward using the transaction logs. One one such occasion, I had cleared the wrong directory as part of the reset, losing some configuration files which would prevent restarting the database. No idea what made me think of the solution, which was to connect to one of the other remote sites (each had identical server builds), copy over the zapped files and carry on...

    3. Gordon Pryra

      Re: No shame in cocking up!

      "Nothing wrong with an honest mistake. You're only human."

      Yeah that sounds all good, unless you happen to have a mortgage ....

    4. Anonymous Coward
      Anonymous Coward

      Re: No shame in cocking up!

      Own up to a mistake? That's a good one and not something a true BOFH would admit to.

      On the other hand if you get away without owning up at all...then all the better. Least isn't that what the powers that be keep showing us?

  7. Anonymous Coward
    Anonymous Coward

    A young boy was trying to grow cauliflowers. He had heard about a technique which his gardener grandfather said would never work. The system meant that the growth of the plants could not be observed.

    On the day of the local gardening competition his grandfather watched as the boy revealed each plant. After the third stunted one his grandfather shook his head and went home. The next one was a giant. The boy took it to the competition tent - which was deserted - and managed to fit it into the display by removing its leaves.

    After the judging - the boy and his grandfather looked at the results. The grandfather admired the cauliflower which was now sporting a "2nd" rosette. A judge remarked that if it had had its leaves it would have been the winner.

    The grandfather put his spectacles on and picked up the plant's identification card saying admiringly "Now let's see who this expert is".

    ***a story from the autobiography of someone raised in a northern English town in the middle of the 20th century.

  8. TrumpSlurp the Troll
    Trollface

    Biggest point - glossed over.

    The backups actually restored.

    1. Gordon Pryra

      Re: Biggest point - glossed over.

      "The backups actually restored."

      All fiction, when in the history of IT has the backup EVER restored when you really really really needed it too (especially at 3 in the morning when its your last chance)?

      1. Grant Fromage

        Re: Biggest point - glossed over.

        "The backups actually restored."

        All fiction, when in the history of IT has the backup EVER restored when you really really really needed it too (especially at 3 in the morning when its your last chance)?

        My non-fiction contains success most times, but it always takes longer than the last recovery exercise on the last backup as it has got bigger, and especially when you are the sole resource/implementer in the wee smalls your sphincter does goldfish impressions until it is done and running. I have been lucky, the worst I`ve had to do is load incremental backups on an older full one, but colleagues have told me of nightmare scenarios.

      2. Mr Humbug

        I've had a backup restore

        When I moved a number of user accounts from one Active Directory OU to another then deleted the (now empty) OU. Then I discovered that in an act of malice the servers decided that deleting the OU (and all that it contains) should synchronise throughout the domain before the account moves should synchronise. At least that's the only explanation I could come up with for all teh user accounts vanishing.

        But, as I said, the backup worked.

        1. CrazyOldCatMan Silver badge

          Re: I've had a backup restore

          But, as I said, the backup worked.

          You managed to back up (and restore) AD in a way that actually worked? Wow!

          1. Mr Humbug

            Re: I've had a backup restore

            I cheated, which I could do because it's a small network.

          2. Anonymous Coward
            Anonymous Coward

            Re: I've had a backup restore

            Easy. It just needs practice.

      3. werdsmith Silver badge

        Re: Biggest point - glossed over.

        "The backups actually restored."

        and nobody mentioned the loss of work done since the last backup.

        RPO and all that.

      4. CrazyOldCatMan Silver badge

        Re: Biggest point - glossed over.

        when in the history of IT has the backup EVER restored

        I'm sure it must have happened once. After all, the laws of infinite improbability demand it.

      5. Hans 1
        Boffin

        Re: Biggest point - glossed over.

        @Gordon

        when in the history of IT has the backup EVER restored when you really really really needed it to

        You do test the backups regularly, right ?

        You do have multiple backups, right ?

        Kept at different locations ?

        And, if your management has common sense, hard to come by, agreed, you even have DR, right ?

        Go tell the bean-counters that you are PRODUCTION, NOT A COST CENTER, if they don't believe you, simulate an IT incident, with good tested backups of course ... and let them sweat it for, say, an hour ... ;-)

        1. Doctor Syntax Silver badge

          Re: Biggest point - glossed over.

          "and let them sweat it for, say, an hour"

          And make sure the beancounters' stuff is the last to get restored.

        2. Anonymous Coward
          Anonymous Coward

          Re: Biggest point - glossed over.

          > simulate an IT incident

          Yeah, back in the '90s with S-100 systems, the accountants told us we couldn't have backup hardware (tape drives) 'cuz they're just too expensive.

          So the next payroll run, the "systems went down" (since the admin disabled logins) and "we have no backups so it'll be a while" so the accountants had to hand write paychecks for 200+ folks. We got our tape drives.

          Anon b/c he's probably reading...

      6. TVU Silver badge

        Re: Biggest point - glossed over.

        "All fiction, when in the history of IT has the backup EVER restored when you really really really needed it too (especially at 3 in the morning when its your last chance)?"

        With me for one although it took quite a bit of time and I did lose the most recent data but the bulk of it was saved by the backups. As Aunt Mabel used to say, "One can never have too many back up options".

        OK, so I did make that very last bit up but the point stands - backups can save your bacon/butt/job/etc.

      7. Anonymous Coward
        Anonymous Coward

        Re: Biggest point - glossed over.

        Always, because I always randomly perform test restores which saved my neck on several occassions.

        1. Anonymous Coward
          Anonymous Coward

          Re: Biggest point - glossed over.

          "[...] I always randomly perform test restores [...]"

          It's always the one you haven't tested that lets you down.

      8. Sigfried

        Re: Biggest point - glossed over.

        Once again long time ago, using an odd OS called Magix (I think, could have been Magic), was a Pick-like PC oriented OS for database development. Had a number of sites with their own server running a production and sales database and each system had a tape backup unit with a well established backup process. You know the sort of thing, 2 x Monday, Tuesday etc, 5 x Fridays, all rotated and stored in a separate part of the site - and even tested so that we were reasonably certain that the backup and restore process worked.

        So one site had a serious hardware fault that brought the server down, so we replaced it. Loaded a fresh OS copy, configured everything and then fetched the latest tape and set about restoring the Data. Up popped a message - "Please put in Tape 1". Puzzled we tried again, and then tried the rest of the tape series, all with the same message. So we summoned the operator and asked about this. His response was that a few weeks back the back process had started to pop the tape out and ask for a second tape to be inserted - so he just pushed the same tape back in ! Never thought to tell anyone about this. Disaster loomed, as we'd tried every day and every Friday tape we knew about. Then by sheer luck I saw another atpe sitting over the other side of the office on a shelf and asked what it was - turned out that two days before the user had wondered about the two tape thing and had actually but in a new tape on that day and that was the magical "Tape 1" from two days back. And hallelujah, it was exactly that, put that in and then the two day ago tape 2, and voila, a working system with only 2 days of update data required.

        Quite why the operator said nothing at first, and then did try a second tape at that crucial point - and not mention it earlier - we never did find out. We just thanked our lucky stars and did a runner to the pub for something to calm the nerves.

        Next day a memo goes out saying that if the system asked for a second tape: first use a new tape (and label it) and second, tell IT Support !

    2. Sgt_Oddball

      Re: Biggest point - glossed over.

      Never let the truth get in the way of a good story...

    3. TrumpSlurp the Troll
      Windows

      Re: Biggest point - glossed over.

      Just to add that back in the day with mainframes and exchangeable discs we always (on full backup day once a week) took the full backup to tape, removed the discs, put in fresh disc packs and restored to them.

      We always had a proven set of backup tapes plus the discs from a running system.

      Three sets of tapes as well.

      Tape store, on site disaster store, off site disaster store.

      Tell that to the kids of today....

  9. Anonymous South African Coward Bronze badge

    Regarding RAIDs and hotswaps - I prefer to do a full backup, then put said backup offline, then insert the new hard drive for a rebuild session.

    Never had the need to restore from backups.

  10. RealBigAl

    SFTIII was great

    It mirrored servers perfectly, down to the smallest of faults. One server developed a fault? You could be sure the other would follow as soon as the mirror caught up.

    Apart from that the majority of SFTIII installations I encountered suffered from the hardware being Netware compliant but not SFTIII compliant, so it didn't work correctly.

  11. phuzz Silver badge
    Facepalm

    A long time ago I was making totally legal backups of Amiga games. Ahem.

    Yeah, ok, so I was pirating some games, and to do so, a friend had lent me his copy of White Lightning, which was reputed to be able to copy pretty much anything, and fast. Not having much in the way of pocket money, even new blank discs were pretty pricey, so I'd not even bothered to make a copy of White Lightning, and so I was using my freind's diskette to copy various games.

    You can probably guess what happened next, somehow I mixed up which disc was in which drive, and wrote over my friend's copy program. At the time I remember trying to blame it on him leaving the copy protect tab open on the disc. Sorry Tim.

    Since then I've learnt my lesson and only got 'Source' and 'Destination' the wrong way round about four or five times since. Maybe six. Or seven.

    1. Andrew Newstead

      Started learning my more serious computing with a Sinclair QL (when they were being sold cheap at Dixons). The first lesson with a QL was backup everything because the microdrives were finicky as hell!

      I ended up having 3 or more copies of everything I was using. Still paranoid even now.

    2. jeffdyer

      I made a slightly more serious mistake when my 20Mb Amigs HDD filled and I bought a new one. My RocTec HDD didn't have a second IDE power cable or splitter so to copy from one to the other, I plopped the two drives on top of each other so the connectors were aligned and jury rigged a power cable by putting the loose ends of a spare 4 pin connector into the same positions of the old drive.

      Sadly, somehow the pins were in the other order on the new drive, and I toasted my old drive.

      Fortunately I'd backup up most of my work onto floppies, but it was an expensive lesson, particularly back in 1992 odd.

  12. Anonymous Coward
    Anonymous Coward

    Modern disks

    I miss the days when disk drives had physical write-protect buttons, with red lights to show the drive was protected.

  13. Woza
    Facepalm

    dd if=/mnt/some/long/involved/path of=/dev/sda

    Followed 0.03 seconds later by swearing and control-C. It didn't have time to do much... other than totally corrupt the file system on /dev/sda. Target should have been /dev/sdb, /dev/sda was the backup drive I was trying to recover from.

  14. adam payne

    The only good cock-up is one you are able to recover from.

  15. Valerion

    300km?

    People don't drive 300 km in Britain.

    They drive 186.411 miles.

    Most of it over potholes.

    1. Anonymous IV

      Re: 300km?

      > They drive 186.411 miles. Most of it over potholes.

      That would have been in 'creator mode' - the word you wanted was "through".

      1. Anonymous Custard
        Headmaster

        Re: 300km?

        Or for some of them around here, "into"...

    2. TVU Silver badge

      Re: 300km?

      "Most of it over potholes"

      ...because too much money is being blown on HS2 instead.

  16. davidpoirier

    SFTIII

    I remember having this in my client's site in the mid 90's. It worked too well, when one server went down it switched seamlessly over to the secondary... so seamlessly no-one noticed the primary had failed, until two months later when the secondary also failed.

    And then we had to work out which one had failed first so that when we bought them back up again, they synched the right way. It was a 50/50 chance, so of course we got it wrong.

    1. Anonymous Coward
      Anonymous Coward

      Re: SFTIII

      Had a couple of instances at a council. First one had one physical server die (hardware failure) and it ran better with one server than it ever did with two.

      Second one was just being installed, using that new-fangled Netware 4 SFT, and it nicely decided to get involved in a reboot-crash cycle, with both server alternately ABENDing, restarting, joining the SFT III cluster-thingy just before the second server crashed and repeated the cycle. Users didn't see much of a problem, but it was a sod to fix.

  17. Steve Kerr

    Customer sites

    Had to fly to Jersey to do a customer upgrade

    Solaris upgrade and application upgrade.

    Specified to customer that they must be a full backup before I got there on Saturday morning

    Got there Saturday morning, asked about the backup, finally got hold of their Unix admin, nope, didn't do it.

    Okay!

    On site customer said do the upgrade anyway

    Okay!

    Disk 1 of 2 - whirr whirr whir -OK

    Disk 2 of 2 - whirr whirr, disk read error

    Oh shit!

    Retry, whirr whirr, disk read error

    <swear swear swear, no backup, swear swear>

    remove CD, look at scratches, customer doesn't have another copy

    Oh shit.

    Bit of saliva, bit of cleaning, bit more saliva, bit more buffing

    Put disk inm whirr, whirr whirr (20 minutes or so later) - done.

    Seat of the pants upgrade with customer standing there too!

    Was also their Solaris disks!

    Another customer (in Jersey oddly)

    IT Manager "you broke our system"

    me "it was OK on the weekend it was upgraded" - This was wednesday

    IT Managaer, "you have to come over fix it now, rant rant rant rant"

    Me - immediatte flight to Jersey

    IT managers manager, flies over from Guernsey

    All in computer room at console

    dig through events, log in on wednesday, 5 minutes later, errors

    All look at IT Manager.

    Sheepishly admits he logged in and "done something"

    Irate Guernsey Manager

    2 return flights bought at no notice and one emegerncy consultancy day shows IT manager in bad light.

  18. Anonymous Coward
    Anonymous Coward

    Another interesting one, upgrading an IBM Power server with new firmware, and it failed, borking the machine. Turned out that the check processor that solely existed to check that things were working OK, wasn't working so the machine failed because it couldn't check that it was running OK !

    There must be a way to start without the check processor so we could restart the firmware upgrade process. IBM tech discovered that process at 4:30 am (started at 11 pm) on page 400 or something like that in Appendix XII of the manual. After that it took a mere 2 minutes to boot, reload the firmware, and restart and all was go.

    But a tension filled 5 1/2 wait in between.

  19. Chairman of the Bored

    You got to know when to run

    Just out of high school I worked briefly at a steel mill in the electrical shop. They had these "little" 1 to 10MW aux generators that ran on gas to handle peak loads. Part of my job was to synch these to mains bus before closing breakers. I failed, and one of the synchronous machines ripped off it's mounts and left the shop. For the most part it went dark. I quit real fast and did a GTFO before the metal workers - paid by the piece - arrived to perhaps literally tear me from limb to limb

  20. BinkyTheMagicPaperclip Silver badge

    Thought for a moment it was the SFT II failure I encountered

    (which I mentioned previously) where the customer *didn't* have a backup, and an engineer had to sit on site for a week recreating the system from printouts..

    There was also the customer who really knew their stuff, and had asked their supplier for a particular HP tape drive. The tape drive arrived, and Did Not Work. Looking at it the customer noted that this wasn't an HP tape drive 'that's ok' said the supplier 'it's actually exactly the same drive, just not HP branded'. 'No', said the customer-with-clue, 'the HP drive has a special chip which enables it to work in this server, and your compatible version does not. Please ship the genuine part'

  21. Oflife
    FAIL

    Self harmed

    What I did in the 80s, was very similar. Had spent a good few man months working on a very detailed fully illustrated User Guide for what became our well reviewed well loved but commercially failed because it was TBEX (Too Bloody Expensive) during a crashing late 1980s economy Classic BiT BOPPER Audio Reactive Video Entertainment System.

    One day, I fired up the PC or whatever it was I was using to write the user guide on to do a backup. This involved inserting the 3 or 3.5" floppy disk into the machine's drive and copying the Word file ontop of the backup file. Simples!

    Errr...

    I dragged the user guide file onto the backup floppy. Grrdd grrdd grrdd grrddd etc. Done. (I recall that's the noise they made back then.)

    Later, when I went to open up the working user guide to wrap it up prior to launch, the Word file was empty, just a few lines of text.

    As one does during such crisis, I tried to understand what was going on. Flashes of my life and hard work went past as I tried to mentally deny I had lost months of effort. A corrupt file, virus etc?

    I then got the backup out, it was the same as the working file!

    On my PC, the windows I opened up were identical, as was (of course) the name of the file inside.

    I had dragged the WRONG file from the WRONG window, and overwritten my working file with a backup from months ago, that was in fact blank except for a few seconds of typing to open up the new Word file. I had for whatever idiotic reason, probably to do with time passing quickly and being busy, procrastinated on making full on periodical backups. I had lost everything, including the diagrams I had painstakingly drawn.

    I had to re do the whole user guide by copying from hard copy I had made a few weeks earlier of an earlier version, although I stopped about 75% way through when it became apparent the product itself was doomed due to the economic crash of the late 80s/1990.

    The upside is we went on to develop a way cheaper compact version based on the Atari Falcon, and it's user guide was published both in print - AND online, one of the first user guides ever published online. Here you go, it's still there in lovely 1918 style html...

    http://www.tecterran.com/svbb/userguide/

    (For the record, people are still using this machine today for the retro visuals.)

    I learned a few lessons, that are less relevant today in a world of online publishing and dynamic cloud backups/syncing etc: a) Don't procrastinate in backing up stuff. b) Name your folders/windows differently!

    (I have never made such an error since.)

    In a way, what happened was probably a diving entity telling me I was wasting my time anyway, being the product was doomed.

  22. gb2

    SFT III was a bit difficult to get right.

    I remember finding driver issues with the first round of the MSL cards, random disconnections under heavy load.

    Having fixed that I remember Arcserve continuing to be a PITA. Nobody wants a mirror disconnect during the backup.

    If memory serves, the abend message was:

    Richard Keil Memorial Abend #27

  23. tfewster
    Facepalm

    One from this weeked

    Migrating a customer from Windows XP/Outlook Express to Linux/Thunderbird. I'm still a bit hazy on the details, but it seems to be the difference between POP3 & IMAP.

    Imported the Outlook files into Thunderbird just fine and got Thunderbird talking to the ISPs mail server. Then the Good Idea Fairy suggested moving the migrated Outlook Inbox contents into the "real" Thunderbird Inbox. Great idea G.I.F., the customer will be happy everything is organised the same! But it seems that with IMAP, those emails are no longer on the server, so it duly synched the Thunderbird Inbox to remove all the "deleted" emails.

    Luckily the "customer" is my Mum, so I didn't get sacked. I just have to redo the import again.

  24. bombastic bob Silver badge
    IT Angle

    water, water, everywhere!

    decades ago I had an "in between" job as a building maintenance guy, working graveyard shift. One morning I went to turn on the A/C unit, but when I started the cooling tower pump, someone had shut the valve on the inlet of the A/C unit [only one guy ever did that]. It broke several 12" diameter PVC pipes running around the building, dumping tons of water into the parking garage. Yeah, the switch for the pumps was on the opposite side of the building from the A/C unit with the valve. I didn't get fired, though. But the A/C was down for several days [fortunately weather wasn't bad, and regular ventilation was sufficient] while the pipes were all re-done. With double-thickness pipe so it wouldn't happen again.

    ok NOT related to I.T. hence icon.

  25. Unicornpiss
    Meh

    Break a mirror? 7 y bad luck?

    I remember one of my first IT jobs where I supported POS systems. (in both senses of the acronym) Discovering early into the job that there were tape drives but only a few, filthy, incomplete backup tape sets, and no meaningful backups being done, I pushed the CIO to let me buy backup tape sets for all locations. Over $6K worth of tapes were bought. Subsequently I had some system failures and discovered that even with new tapes and perfectly working drives, I could only recover data about 1 in 5 times. (and this was when the closing store manager remembered to type "backup" at the login prompt before leaving)

    Add to that the daily backup process also mirrored the primary drive to the inert secondary (that was only used when the primary failed) in the wee hours of the morning. It would cheerfully mirror a failing, corrupted drive, and wipe out the last night's perfectly good backup on the secondary. To change the drive involved swapping drives and setting jumpers, then relicensing the software, which was tied to the drive's serial number. I drove 100 miles through a blizzard at least once to perform this asinine, 15-minute process.

  26. JeffyPoooh
    Pint

    Mirror C: to X: first

    HDDs have never been that expensive that a server couldn't have several.

    1. jake Silver badge

      Re: Mirror C: to X: first

      Oh, I dunno. In 1986 my Sun server had a bottomless pit of a drive for user space. It was a 300 megabyte CDC Wren IV SCSI drive. It cost US$14,000 ... in 1986 dollars. (The "user" was a database archiving network statistics, if anybody's wondering.)

    2. BinkyTheMagicPaperclip Silver badge

      Re: Mirror C: to X: first

      That assumes you can sensibly attach extra drives to a server, which isn't a given. Modern servers all have fast USB, historically you'd better hope there's external SCSI, spare drive bays, or similar.

      1. albegadeep

        Re: Mirror C: to X: first

        "historically you'd better hope there's external SCSI, spare drive bays, or similar."

        My first server was a 100MHz P1 bought from university surplus for $10, in 2000. A Compaq with all kinds of weird custom parts inside (memory and processor on a riser card!), and the boot disk was a SCSI hard drive too small for practical use. So we added a second drive to an available IDE port. No extra drive bays - no problem, just zip-tied it to the existing drive bays. Ran like that for 5+ years, never had any problems with it. (Eventually replaced with a P3, which I am currently in the process of replacing. It's amazing how low-powered a small web/mail/file server can be.)

        1. BinkyTheMagicPaperclip Silver badge

          Re: Mirror C: to X: first

          Sometimes you're lucky, but if the port is on a SAS/SCSI backplane IDE/SATA isn't an option.

          I recently copied VMs from one ESXi server to another, and initially thought I'd re-use the sole SATA connection intended for optical drives to directly copy between disks. It turned out that whilst there is an SATA connection, there is no SATA power feed or way to extract power from the custom redundant PSUs.

          Resorted to an external USB to SATA dock, which despite the fact previous versions of ESXi had fought tooth and nail when using USB storage, ESXi 6.5 was almost straightforward (once the USB service had been stopped, and various obscure commands run to identify the long GUID/lun identifier for the external disk and mount it)

  27. a handle

    I was on the other side of planet earth all on my own, no-one to call when things got ugly. I was doing some remote work on a server at the country I was in. For a change, I thought I would have a go at using Windows 3.1 Winfile instead of DOS Norton Commander, but over the 14.4k modem the screen updates were really-really slow. "Oops" I clicked delete and then OK for it to delete all files on the OS/2 server. It was like a slow motion train wreck, nothing I could do, it slowly deleted everything in front of my eyes. The server I had deleted all the files from was core to the data communication system; all other servers of the data communication system started up from it. Complicated stuff with many configuration files that were being regularly updated whist debugging the new unreliable system, no regular tape backups; just the odd occasional one. The system kept running...phew! All I had to do now was book a flight to the other end of the country the following day. I had to get there before the next system crash, as crashes happened most days being a new system. Attempt to restore from backup, if complete enough, if good enough, if current enough to restore the folders, configuration files, then a very brave attempt of a complete system restart. Sphincter was like a goldfish mouth for 24 hours... got it going! If I hadn't, I don't know what I would have done.

  28. enormous c word

    SFT wasnt so bad

    It was clever stuff - Fault Tolerance all done in SW on old (generic) kit with Intel single CPU speeds measured in the range of maybe 25-70MHz. So it was essential for the 2 machines to be identical if they were to operate in lock-step with each other - there simply weren't enough spare CPU cycles to deal with Hardware Abstractions and translations.

    Fault Tolerance (more or less) went away for a loooong time on generic intel kit until vmWare came along and virtualised everything but with the benefit of CPUs running at 50x the clock speed and multiple cores.

  29. Anonymous Coward
    Anonymous Coward

    I remember SFT at a Novell marketing session. The company I worked at only had a single server. It had failed hard drive twice in my time there, both on a Friday. The first time our vendor met their SLA with a replacement drive, but not with a restore by a Netware expert. I being a fresh minted CS grad with only 4 months Netware experience had to pull a weekend miracle to have it reinstalled and restored for Monday morning. The second time I had a bit more experience and was restoring from my redesigned backup system. I had fired the vendor for their dropping of the ball multiple times. I had ordered a new drive with overnight shipping via UPS. It was coming through California which had a major earthquake earlier that week. It arrived Sunday morning instead of Saturday morning. I was still able to have a working fully restored system by Monday morning. I also had UPS regional manager in my office over the lack of cooperation on their part to get the shipment there in spite of it being at the local UPS warehouse by Saturday afternoon. In both instances, I tested restores regularly prior to the event. My redesign was due to "lessons learned" from the first event. That redesign and my greater experience allowed me to to the same task in half the time (actually less). "Lessons learned" from the second event allowed me to redesign yet again with the goal to achieve an overnight complete system restoration given working hardware to which to restore. Never had to actually do one, but I simulated it multiple times to test the design, earlier tests failed, fix and redo.

  30. TrixyB

    Just like a Robocopy mirror...

    Everyone has ran /Mir the wrong way round at least once? OK maybe just me!

  31. molletts

    It replicated. What more do you want? :)

    I once had a similar thing happen when I added a new server to an existing DFS Replication group on Windows Server (two words that I've always said don't sit comfortably together). DFS-R very quickly and efficiently replicated the nice, capacious new storage volume to the other servers. (Hey, I could save some money on disks! No need to buy several new ones when we run out of space: just buy one and replicate it! Why didn't I think of that before? :-P )

    Thankfully, the data on that particular replication set was not mission-critical (it was mostly multimedia content) so it was merely inconvenient for the day or so it took to restore from the previous night's backup and re-replicate to the group.

    Since then, I've always pre-seeded new servers using Robocopy. DFS-R still insists on replicating the whole lot in one direction or the other (exactly which way seems to be totally random) when the server joins the group, causing massive fragmentation (this is NTFS, after all, where writing a 1MB file to a freshly-formatted 1TB disk will probably result in at least 60 fragments), but at least the losses are, at worst, limited to any changes saved to an existing server since the files were copied over to the new machine.

    I should have done this anyway the first time, just as I would have with NTFRS, but DFS-R was shiny and new and supposedly much more reliable (my a**e - there's a reason I held off migrating SYSVOL from NTFRS to DFS-R until Server 2012R2 migration forced my hand) and I thought it would be kind of cool to be able to just "set it and forget it" and watch it sort itself out without me having to lift a finger. Why did we bother upgrading to Server 2003R2 if not for the greater ease of administration? (Of course I'd also forgotten that, even if replication had gone the right way, the new server would immediately start advertising itself as a good place for the clients to find their content, despite most of it not yet having been copied over.)

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like