back to article Sysadmin’s plan to manage system config changes backfires spectacularly

Welcome once more to Who, Me?, the column for Reg readers to get their worst deeds off their chest. This week, "Ryan" tells us about a time many years ago when he got a little bit cocky with root-level commands. At the time, he was the senior systems and network administrator for a major research lab. "I administered a …

  1. Michael H.F. Wilkinson Silver badge

    Automation does have its place

    We had one sysadmin (who is no longer with us) who tended not to like automating stuff too much. This meant that whenever we needed new accounts for students or guests, he would fiddle around a while, and give you a list of new user names and (temporary) passwords on a bit of paper. I learnt the hard way, after many complaints by students, that it paid to check all accounts manually, to see if

    a) the login actually worked,

    b) things like home directories had actually been made for each account, and

    c) that account A didn't by default write in home directory B and vice-versa.

    I currently administer a small compute server used for teaching and research, and I have never been able to replicate these kinds of errors when using "adduser" to create new accounts.

    1. defiler

      Re: Automation does have its place

      Group Policy works great for user setups on Windows. You just add a user in the correct OU, and it picks up the policy. Sets up home drives, profiles, application configs and stuff like that.

      If you're spinning up lots at once, Powershell does the trick, but if they come in dribs and drabs then ADUC and GPO saves a lot of guesswork.

    2. Prst. V.Jeltz Silver badge

      Re: Automation does have its place

      We had one sysadmin (who is no longer with us) who tended not to like automating stuff too much

      obviously in the wrong career then, if not a dangerous liability.

      1. Anonymous Coward
        Anonymous Coward

        Re: obviously in the wrong career

        Probably moved to politics and is not managing Brexit from the looks of things.

    3. Lee D Silver badge

      Re: Automation does have its place

      The "admin who does things like it was 30 years ago" is surprisingly common.

      When I started here, there was no computer imaging process - each one was manually cloned from one of its nearby machines and then manually re-configured. There were duplicate SIDs and unlicensed software everywhere. There was no user-management - each one was set up manually each time, so half of them were missing something or other. And home folders were manually made and permissioned for each user on creation*. Everything was done with copy-paste batch scripts that he didn't understand, which everyone ran on every login, and which literally carved out exceptions (e.g. IF %username% = "fbloggs", to map drives, printers, etc.). The console windows were still visible minutes after logging on as they churned through it all every time.

      AD was literally a shock to the guy beyond "create new user". And he was being paid by the hour (not the reason for his lack of process, at least not directly, but he literally didn't have the knowledge).

      Within a week, and without spending a penny more than had already been spent, I introduced F12 PXE boot to WDS (which meant imaging took 20 minutes from bare-machine to domain-ready client with the base software in the worst case), group policy (which meant that user's printers, drive maps and settings, and machine's specific software and settings were installed after a couple of reboots of any fresh machine, controlled centrally and changed and cloned easily), and the MSKB article which shows you how to permission the root profile folders applied so that users just logging in would create their own profile folders if they didn't already have one.

      Literally the guy was stuck on using things that had "worked" for him on Windows 2000 and never bothered to update knowledge in all that time. That you could deploy a printer from a GPO was new knowledge. That you could image machines from a clean template. That you could centrally control updates. That you could map drives. That you could have a proper tree of users and groups (rather than just leaving everything in the default users and groups folders) and have "Users" settings apply to everyone, while "Users\Office" people also got office settings, that you could modify policies on the domain other than "Default Domain Policy" (literally EVERYTHING was in there). That you could target a policy at users, groups, or even things like Windows versions or machine types.

      It took me a few weeks to go from utter unmanaged chaos to "F12, new image, reboot, right-click in AD, clone an existing user (even disabled) of the same type, set password, bang... everything comes down".

      It's alright, it's not like we were a school or anything, with 500+ pupils, ~100 staff, all with different settings and permissions, ~100 leaving and ~100 joining users every year, and all needing central control for things like web filters (enforced proxies), etc.

      Literally, his "web proxy setting" was a Regedit script for Mozilla Firefox run from a login batch file. Press Ctrl-C and it never got applied. Unapply it after login and it bypassed everything. And, no, not even a "catch-all" transparent filter.... literally relying on that batch file to be all your security.

      I honestly never asked what the rest of the junk in his batch files was and just started replacing them from day one. There were things in there playing with Word/Office, activation, antivirus warning disabling, ActiveX permissions, desktop icons (copied from the central server every logon), all kinds of stuff. I just switched them off for a few test machines and then resolved the issues that occurred in a more proper manner.

      (*To this day, years later, I'm still finding folders that don't have inheritable permissions and/or have things like "Administrators" - the group not the user - as the owner. There were also a ton of legacy folders, including user profiles, that literally the user could access but administrators couldn't. The only way to fix is to take ownership of all files with recursion, then repermission with recursion, then put the file owner back as it should have been).

      P.S. He didn't last long.

      1. Anonymous Coward
        Anonymous Coward

        Re: Automation does have its place

        I can understand how that happens. Often, especially in the likes of schools and other very important yet poorly funded organisations, the person who shows a little sense of computer knowledge (eg knows that a computer is some sort of box) is placed in charge of such things, often without training (especially back in the 80's and 90's, dunno about these days).

        Over time they learn little tricks to help maintain things, but they lack the time (and sometimes permission) to do any further training or improvements. I've been there myself, maintaining a system for a decade over and above my normal work (which, BTW, was a full-time job!). Least the firm paid time+.5 for the first extra hour and time*2 after that. I got a decent income but 0 social life.

        Then the new guy comes along. He'd spent more time in computer training than I'd ever considered, and was able to automate a lot of stuff in ways I'd never considered.

        For a long while things were great. Then, as with another post below, a large portion of the users disappears overnight. A glitch in one of his batch files took out stuff due to an unforeseen combination of events. It took out a lot of stuff. First I knew about it was the phone call from the 2nd shift supervisor, the call to come in and fix. Long night but got it sorted.

        I did come to his defense at a later managerial meeting. I pointed out that while his mistake had cost a considerable amount in lost productivity, his automations had saved more than that in the preceding months, and we were actually able to pitch a strong case for improving the IT training of staff. If the company had given me the training I'd asked for earlier...

        A guy who has the resources and training to patch together what bits he can as he goes is not to blame for the system he leaves behind, unless he is offered the opportunity to improve things and doesn't take it. What works gets you through the day. What's new may not work so well, and may not get you through several days (or it may save a few hundred overtime hours a year and pay for itself in a couple of weeks...)

        1. Lee D Silver badge

          Re: Automation does have its place

          @Anon The guy in question was a highly-paid specialist IT consultant brought in to do disaster recovery on their systems... he had a year, a clean slate, virtually unlimited funds, new kit (everything from network switches to PCs to tablets to servers from the ground up), all the time in the world, and absolute control of anything he wanted.

          He was brought in as "the expert" to set the tone for the system. I was hired later as the guy to "keep it ticking over" day to day. It took 6 months to turn that situation on its head.

      2. Anonymous Coward
        Facepalm

        Re: Automation does have its place

        @ Lee D "P.S. He didn't last long."

        Going on the number of down votes, at least two people on this forum recognize themselves :]

    4. Anonymous Coward
      Anonymous Coward

      Re: Automation does have its place

      A problem with automation scripts written by sysadmins is that they generally do not have a development background and do not consider enough corner cases and how things can go poorly nor do they write defensively Automation scripts should be considered as nothing less than production apps and subject to the same controls: peer review and source code check-in and check-out to name a few.

      One day I woke up and reviewed alerts before work. I noted that the entire Accounting department had their accounts deleted. I thought to myself "Hmmm, I thought we needed an Accounting department."

      Upon arrival at the office I asked what happened and was told it was "just a glitch" and the accounts had been retrieved from the AD Dumpster. It had never happened before so it would never happen again. Right.

      As you've already guessed, it happened the next day. So now the sysadmins decided to investigate. They had written a script to sync the HR management software with the AD structure so AD reflected HR. Good idea. But the author failed to consider what would happen if a department manager went on leave.

      The Accounting manager went on long-term sick leave and was removed from the HR org tree after one week per policy. When their AD import script saw no manager it branched to the cleanup section and because the the department apparently no longer existed, it deleted all of the active accounts instead of stopping for confirmation or just disabling the accounts. It actually was a cascading fault because one of the Accounting manager's subordinates supervised a smaller department and their accounts were wiped out as well when the subordinate had his account deleted by the script.

      Just imagine if the CEO had gone on long-term leave...

      1. John Riddoch

        Re: Automation does have its place

        I used to have to do user account creation annually at a university. I'd inherited some (fairly ropy) scripts and an MS Word mail merge template which took a fair bit of manual effort. I reduced it to a couple of Unix scripts which then created a LaTeX file to print out and another output file to create the Novell 4.1 accounts (that probably dates it pretty well). The printouts were handed to the lecturers to distribute to their classes on the first day and get them to log in.

      2. PickledAardvark

        Re: Automation does have its place

        "Automation scripts should be considered as nothing less than production apps and subject to the same controls: peer review and source code check-in and check-out to name a few."

        In the old days, we used to talk about things with colleagues. Even if you have a change management process, you still have to talk informally with colleagues -- including people who have a different outlook. When you take a day off and things go wrong, somebody else needs to understand more than you wrote in comments and a change report.

      3. Anonymous Coward
        Terminator

        Re: Automation does have its place

        A problem with automation scripts written by sysadmins is that they generally do not have a development background and do not consider enough corner cases and how things can go poorly nor do they write defensively Automation scripts should be considered as nothing less than production apps and subject to the same controls: peer review and source code check-in and check-out to name a few.

        Exactly so. And now we have things like Puppet which make sure that these scripts get run around tens of thousands of machines.

        And, of course, we're all now planning on keeping everything in the cloud, which is probably millions of running OS instances. And they make it cheap by making it scale, and they make it scale by automating the shit out of it: they can, I presume, deploy changes across millions of machines in one go. Well, of course, these people are much cleverer than most sysadmins and they test stuff a lot more carefully I hope but they are going to make a mistake (or someone is going to make them make a mistake), at which point the we're all, well, fucked.

        1. John Brown (no body) Silver badge

          Re: Automation does have its place

          "deploy changes across millions of machines in one go... going to make a mistake (or someone is going to make them make a mistake), at which point the we're all, well, fucked."

          It's called Windows Update (windows as a service) and the problems have already been documented :-(

          1. Anonymous Coward
            Terminator

            Re: Automation does have its place

            Well, I was really thinking of 'AWS hypervisor update' or whatever low-level thing it is they use. At some point there is going to be a platform-level compromise of AWS, and at that point, well.

      4. cream wobbly

        Re: Automation does have its place

        A problem with automation scripts written by sysadmins is that they generally do not have a development background...

        A problem with automation scripts written by developers is that they generally do not have a sysadmin background.

        It's true both ways around, which is why the top role was originally called Unix Programmer. It's rare for a company to knowingly employ such a deity these days. They'll make do with mere Systems Administrators who know a bit of scripting; or even System Operators who sometimes know how to change a config without manpages.

        1. Anonymous Coward
          Anonymous Coward

          Script deleting HR accounts

          That's just stupid having a script delete accounts in automated fashion. It should produce an alert "the following accounts are ready for deletion for $reason" and list commands that will do it. Then sysadmins can investigate, decide it is valid, and cut and paste the commands to do the deed.

          Anything that is going to have a major impact should not be done in automated fashion unless it is time critical (like disabling accounts that may be involved in a security breach) Deleting accounts certainly falls under that! It isn't like there's a rush to delete them, account deletion is never an emergency that can't wait for a human to approve.

          1. Anonymous Coward
            Anonymous Coward

            Re: Script deleting HR accounts

            > That's just stupid having a script delete accounts in automated fashion.

            It simply was an extension of the management directive of "Virtualize everything" to "Automate everything!"

            Just because they could, they did.

          2. Trixr

            Re: Script deleting HR accounts

            That's just stupid having a script delete accounts in automated fashion.

            Er, if you've got a few dozen users perhaps. When you're doing dozens/hundreds of account expiry/deletion operations a day, you do not want someone having to go through all that by hand.

            It's silly if you've only got a relatively small user base as well, because you are wasting a LOT of time doing stuff manually. You're actually more likely to introduce errors by hand than doing it using a robust automated process.

            What you should do is in ensure the process/checks prior to your automated part is robust, with some contingency in case of manual error. For example, for us, a contractor termination date must be entered into the HR system first. If HR has it wrong, we are not accountable for their mistake.... but we have baked in a 30 day interval where an account is only disabled on that termination date, prior to deletion.

            Then you carefully TEST your scripts (in a non-production environment first, and then in production, scoped to a specific user set) and make sure that you have subroutines to catch exceptions and flag errors. And good backups. And as others have said, peer review.

            1. The Oncoming Scorn Silver badge
              Alert

              Re: Script deleting HR accounts

              I'm in a weird situation here & I will admit to being on the whole totally new at the whole create\disable accounts (Having moved up & sideways late in my career) thing.

              Hires & Fires come in at all times of the day.

              I create accounts as per my instructions, different job roles have a whole bunch of OU's to be added "globally or branch specific", based on the training I have had & prior knowledge.

              So I'm slowly working on script automation, basically the same script (For each job role) that calls the specific's for each branch information (Branch address & the correct OU's etc) & I still have according to my process training a number of Office\Exchange\Licensing\Skype server web pages to go through to complete, which I don't want to get out of my depth by even attempting to automate, simply testing as I go.

              Disabling accounts is pretty much the same process (In reverse), but really doesn't require any script as yet (That said, there's a few parts I'd like to automate).

              Upshot I'm really not sure if the way I do things is really the correct Microsoft way, it's just how I have been told to do it so it works with the rest of our group infrastructure.

              My scripts at the very least save me having to cut n paste in branch specifics from a Excel master spreadsheet so that's a time saver when setting up a new user or changing job role.

              1. Anonymous Coward
                Anonymous Coward

                Re: Script deleting HR accounts

                "Upshot I'm really not sure if the way I do things is really the correct Microsoft way, it's just how I have been told to do it so it works with the rest of our group infrastructure."

                <Joke> I'm not sure "correct" and "Microsoft" really belong in the same sentence...

            2. Danny 2

              Re: Script deleting HR accounts

              Welcome to Brazil, Where a Computer Bug Condemns a Man to Death

              https://gizmodo.com/welcome-to-brazil-where-a-computer-bug-condemns-a-man-1659912414

              The first computer bug, the story goes, was a moth squashed inside an old electromechanical computer. In Terry Gilliam's Brazil, one such bug gets stuck in a printer, resulting in a typo that leads to the killing of poor innocent Archibald Buttle, a cobbler, rather than alleged terrorist Archibald Tuttle.

            3. Anonymous Coward
              Anonymous Coward

              @Tixr - dozens of deletes a day

              If you are doing this as part of an actual process, you don't delete the accounts of users who leave. You disable them (make it very easy to undo, and have sanity checks to insure not too many are done in one day which may indicate a problem) and then have a process that acts later and deletes them - making sure they are already disabled before trying to delete.

        2. Alien8n

          Re: Automation does have its place

          I seem to be a bit of a rarity nowadays, worked my way into IT Management the long way... started out as an operator, moved to engineer (mechanical), then to report designer for the engineers. Then moved to product engineer (emphasis on data analysis), then systems engineer, then systems designer (still technically an engineer at this point). When they realised they needed an IT person with a working knowledge of manufacturing systems they moved me into IT where I gradually worked through several developer positions, DBA roles, and finally into IT management with some networking skills. However I'm intelligent enough to ask the question "what happens if I press this" BEFORE pressing the button, rather than as I press the button.

        3. Anonymous Coward
          Anonymous Coward

          Re: Automation does have its place

          Re: "Systems Administrators who know a bit of scripting"

          Sorry, but how can you possibly be a sysadmin of any description and not know quite a bit of scripting? It's pretty inherent to the role!

          (Whether the scripts one writes are quick and dirty hacks or reasonably elegant, with reasonable error/edge case handling, is perhaps another question, however.)

    5. Anonymous Coward
      Anonymous Coward

      Re: Automation does have its place

      "I currently administer a small compute server used for teaching and research, and I have never been able to replicate these kinds of errors when using "adduser" to create new accounts."

      Never underestimate the talent of idiots and their ability to deliver beyond the limits of competency.

      1. JimboSmith Silver badge

        Re: Automation does have its place

        We had an auto delete option on some software in the config. It was supposed to be used for the deletion of older files from the data file system and to just leave two days worth of daily logs etc. It was also supposed to be set to local drives only. Someone at a satellite office set theirs to delete files from all the drives it could see. We were first alerted when a user found their database missing and investigated. About 15mins later we had retrieved the missing files (anything more than 48hrs old) worked out what had been done and administered a quiet word. The next day the same thing happened again from a different satellite office. Same bloke had been on a road trip and had applied the same fix...........

    6. Gerhard Mack

      Re: Automation does have its place

      "I currently administer a small compute server used for teaching and research, and I have never been able to replicate these kinds of errors when using "adduser" to create new accounts."

      Adduser is designed to be easy to use, on the other hand, useradd has a ton of fun ways to let you screw things up.

  2. Waseem Alkurdi

    Why use a revision control system?

    These are typically used for code.

    Sysadmins usually back up to tape.

    If there's something broken, one would restore from tape (or at least, restore the offending file from a tape).

    1. BinkyTheMagicPaperclip Silver badge

      Re: Why use a revision control system?

      It's probably overkill for standard config files. If, however, it's a shell script, firewall configuration, or other fairly complex file than a revision control system could be an advantage.

      1. Anonymous Coward
        Anonymous Coward

        Re: Why use a revision control system?

        It's not overkill for standard configuration files. If a user gets added to a system, say, people will want to know who added it, when, why, what authority they had to do so and so on. A revision control system gives you the key into that: '/etc/passwd & /etc/shadow were changed by commit 615032f and the commit log for that says this corresponds to approved change 23857 and I can look that up and it's been signed off by Spodge, who is an approved authoriser for ...' and so it goes on.

    2. Anonymous Coward
      Anonymous Coward

      Re: Why use a revision control system?

      If there's something broken, one would restore from tape (or at least, restore the offending file from a tape).

      If you're making several versions on the same day and the earlier one worked but later was borked?

      Access to older backups may not be imminent and errors may manifest after months.

      If there are multiple sysadmins it is nice to know what was changed when fixing someone else's errors. I'm not going to commend every single config change on the config file itself.

      1. Waseem Alkurdi

        Re: Why use a revision control system?

        If you're making several versions on the same day and the earlier one worked but later was borked?

        If I'm not commenting my changes in the file itself, I'd do this:

        /etc/fstab-20181203-1208

        /etc/fstab-20181201-1100

        ... up to five past revisions per file per backup period.

        I'm not going to commend every single config change on the config file itself.

        Why not?

        Wasting an extra two minutes' worth of:

        # note to self: this fixes the ________ issue by doing _______ and ________.

        would save twenty minutes trying to decipher what you've done that day on midnight when your eyes were seeing that each line of config file somehow has two blurry, redundant versions over and under it.

      2. phuzz Silver badge

        Re: Why use a revision control system?

        "I'm not going to commend (sic) every single config change on the config file itself."

        I work with people like you, and from bitter experience, I hate you.

        1. Stevie

          Re: Why use a revision control system? 4 phuzz

          And I work with people like Wakeem who copy old versions to a new name.

          Thing is, there are dozens of us and each has his own preferred naming convention. I hate the mess the filesystems become as a result and end up running a script to dump it all into a new directory called obsolete and zip them all down. Once a year I tar all the zips automagically into a yearly archive.

          My director yelled that he wanted us to use a VC tool for this stuff, but refused me permission to deploy git in a hysterically funny and very annoying saga I already told, and the situation is still as it was pre-shoutyboss.

        2. cream wobbly

          Re: Why use a revision control system?

          I work with people like you, and from bitter experience, I hate you.

          Yes. (From bitter experience) I'm completely the opposite to our commending friend. I'd go so far as to say that documentation, communication, is vastly more important than the config change itself. If you can't reverse it or replicate it, it's a guess, it's a hack, it's broken and there's the door.

          1. Danny 14

            Re: Why use a revision control system?

            wow you change your fstab so many times in a day you worry about backup revisions? just type it out again by looking at the copies file on ypur WS. You checked it out and back in? Sure that means a secondary copy.

    3. Secta_Protecta

      Re: Why use a revision control system?

      When I worked for an ISP we used revision control for the named config files; at the time we had some very ropey contractors working there and it came in handy a number of times...

    4. Anonymous Coward
      Anonymous Coward

      Re: Why use a revision control system?

      "one would restore from tape (or at least, restore the offending file from a tape)"

      Maybe life is just too short for that...

      P.S. A tape backup is not a change management system, version control with check in comments (partially) are.

    5. iainr

      Re: Why use a revision control system?

      If you want to revert a file, it's a lot quicker and easier to use a version control system to see when it was changed and see what the differences are than pulling stuff off tape. At work we use a configuration management system called LCFG (www.lcfg.org) that allows us to configure large numbers of unix boxes via configuration files that are under version control. If I want to add software to a lab full of machines I can do it by editing a file, in a months time if someone wants to know who installed the software and why it was installed they can check the configuration log files. If they want to remove the software they know what to remove from the config file and if they want to revert the labs software set back to what it was before the start of the term it's a matter of reverting the configuration file in the version control system.

    6. gordonmcoats

      Re: Why use a revision control system?

      All my extra-special configuration files are safely stored on the 12" reel hidden under my desk. Not had to reload anything off it in years though..

      1. Anonymous Coward
        Anonymous Coward

        Re: Why use a revision control system?

        LOL, but you do make a valid point: backup mechanisms have as much a lifespan as their media.

        I doubt anyone will be able to dig up the 8" drives I used when I started in IT, and I think you will already have to go to some rather obscure places to still find a 3.5" drive. Another fun on is CDROM media - the very early media were not made for the 40x spin speeds the later drives were capable of, which meant you cannot really restore from them anymore as they will shatter on spinup (been there, and it was very impressive). Ditto with tape.

        1. ridley

          Re: Why use a revision control system?

          Your tapes shatter on spin up?

          You're holding them wrong.

          1. Loyal Commenter Silver badge

            Re: Why use a revision control system?

            Your tapes shatter on spin up?

            You're holding them wrong.

            Maybe not shatter, but tape can demagnetise or otherwise degrade over time. Plastic becomes brittle and perishes with age. Do you know what state the tape from a 1990s backup is in right now without trying to restore from it?

            1. HellDeskJockey

              Re: Why use a revision control system?

              Ahh paper tape. Worst come to worst you could always read it manually. Though for a backup I would use Mylar. That stuff was darnmed near indestructible. Way too bulky for modern systems though 1 Kilobyte requires about 2.6 meters of tape.

    7. Pirate Dave Silver badge
      Pirate

      Re: Why use a revision control system?

      I use an RCS to backup my switch configs weekly via script. That way, if I dork something up, or have to replace a switch, I've got a recent config file to fall back on for each switch.

    8. Adrian 4

      Re: Why use a revision control system?

      Code used to be backed up to tape too. It was obsoleted by revision control systems.

      1. Nick Kew

        Re: Why use a revision control system?

        Code used to be backed up to tape too. It was obsoleted by revision control systems.

        First code I ever wrote had to be saved to tape for every increment. 'Cos we didn't have discs back then, and a simple bug would commonly require a several-minute reboot (from tape) and restore (ditto).

        But revision control had already existed for some years: sccs goes right back to 1972.

      2. Doctor Syntax Silver badge

        Re: Why use a revision control system?

        "Code used to be backed up to tape too. It was obsoleted by revision control systems."

        And where is your revision control system backed up? Don't tell me it isn't. Revision control and backups are two different things.

        1. Anonymous Coward
          Alien

          Re: Why use a revision control system?

          This is an important point. And in fact there are at least three different things which people confuse: revision control, whose job is to track changes and let you understand them and back them out; hardware redundancy whose job is to make sure that suitably mild failures don't take out the system (how mild depends on the money you are willing to spend: typically a single disk, but if your mirrors are in DCs 20 miles apart then you're probably robust against some minor nuclear wars); backups, whose job is to be a backstop for everything else.

          I frequently hear people saying 'oh, we have RAID, we don't need backups': yes, yes you do need backups. And if the data matters you need them to be physically far away from the live data.

    9. Herbert Meyer

      Re: Why use a revision control system?

      Useful things like diff are available. Even (when there are multiple administrators) who made the changes in question.

      Linux has a system called etckeeper that puts /etc under git version control. With some additional hooks that understand the dnf/apt upgrade process. Often dnf/apt is what broke it, not me.

    10. Loyal Commenter Silver badge

      Re: Why use a revision control system?

      If there's something broken, one would restore from tape (or at least, restore the offending file from a tape).

      Hahaha. No.

      The reason why this is wrong is nicely illustrated by the following hypothetical situation:

      You get in at 8am and find there is some urgent configuration work to do and your cient needs it all working by the end of the day. The changes aren't simple, and after making and testing several revisions, you're finally ready to go at 4:30 pm. You're just about to run your scripts, and you discover that you've accidentally deleted the folder they are in because windows explorer had the focus when you thought you were hitting the Delete key in a Word document (because you are documenting everything, and you are working over a laggy connection to a VM in another office). Do you:

      a) Restore from a tape backup and repeat 8.5 hours work. This will take 24 hours to retrieve the backup tape from the secure off-site storage, followed by 3 hours to verify, find and restore the file in question. Or it will do, once you have got management authorisation to make a request to have the tape retrieved. Lets hope the backup compelted succesfully, eh?

      b) Retrieve the last good version of your script from version control and reapply the last 0.5 hours of work.

      Tape backups and version control systems are different tools, for different jobs, and both have their places. I wouldn't use a git repository for database backups, and I wouldn't use tape for version control.

  3. Terje

    I can thing of several reasons to use revision control for config files.

    It makes it very easy to set up a new unit in a specific environment while keeping tab on what changes are made (Say all computers in lab B where you add a new computer have the same config but those in lab A are different), you check out the correct branch (Lab B) and get all the correct configs for it.

    If down the line you find some issue that is not immediately apparent you can easily see what have changed in the config since it last worked no matter how long ago the change was made.

  4. Saruman the White Silver badge

    Other screw-ups

    Many, many years ago when I was young and working in my first job, I was given responsibilty for managing a small network of Sun workstations (one of which was a diskless node). One day I decided I need to clean up /tmp since it was close to being full and causing problems, so I logged on as "root" and entered the command "rm -rf / tmp". Note the significant space!

    Control-C followed after about 2 seconds, but there was not enough of the system left to be usable (although I was able to dump the data directries to tape prior to a full re-install).

    1. Waseem Alkurdi
      Thumb Up

      Re: Other screw-ups

      Safe aliases for 'rm' are a good thing to prevent this!

      1. Chairman of the Bored

        Re: Other screw-ups

        @Waseem, aye! Excellent point. Another trick to to have some zero length files in all directories you care about called '-i'

        If someone is running a rampaging rm -rf, having defeated the safe alias because of {reasons}, this may force rm back into interactive mode.

      2. CAPS LOCK

        "Safe aliases for 'rm' are a good thing to prevent this!"

        Testify! After, shall I say, bitter experience, I learned the wisdom of creating a 'del' command with 'mv'. I haven't created an alternate version of 'rm' just in case I need it to function as standard.

      3. Nick Kew

        Re: Other screw-ups

        Safe aliases for 'rm' are a good thing to prevent this!

        Aliases for standard system commands are pure evil. They bugger up expectations, both for those who know the standard commands and may react unpredictably to unexpected behaviour, and for those who come new to the aliases and are then surprised by the real thing.

        If you want an "rm" you consider safe, use something else for the alias. "del", for instance.

    2. Danny 2

      Re: Other screw-ups

      One of my most embarrassing mistakes was working at a Cisco-kid software testing on Solaris.

      One young tester was distractingly chatting away while typing about how some idiot at his university had rm -rf'ed and ruined his project. And as he was telling the anecdote the young tester rm -rf'ed his own work, and admitted it.

      I was trying to tune out his monologue but thought, "What a bloody idiot". And then I rm -rf'ed my own system.

      Warnings are like ear-worms, they sink into your subconscious. When you are not paying full attention and you hear the last thing you want to do, it becomes the thing you do next.

  5. Rich 11
    Joke

    When was the last time your best-laid plans went very awry?

    Gallipoli. I really should have made sure the landing craft were loaded into the cargo holds last.

  6. MacroRodent

    SCCS hits you

    The version control system must have been SCCS, which was for years the standard tool for this on Unix. It has this weird default of removing the edited copy of the file when you check in the changes. There is an option to immediately check out the read-only copy, but it is not the default behaviour.

    1. ibmalone

      Re: SCCS hits you

      Thanks, I was puzzling over why not checking out a read-only version by itself could have caused this.

      1. Peter Gathercole Silver badge

        Re: SCCS hits you

        The problem (or maybe it's a strength) with SCCS is that you have embedded tags that are expanded, normally with dates, versions etc. as the file is checked out readonly. With SCCS, they are surronded by % or some such. (RCS does use similar but incompatible tags, I'm not sure about other systems).

        The problem is that in some cases, these tags can mean something to other tools, and may also expect to use % as a special character, in which case deploying an un0checked in copy may cause undesirable effects.

        Of course, one solution to this is to use it with "make", which would allow you to perform additional processing around the versioning system. I'm not sure I remember how I did it, but I'm pretty certain when I used make and SCCS in anger, I had a method where I could spot that it was not checked in. Make is slightly aware of SCCS.

        But of course, you can't meaningfully compare SCCS with modern tools. I'm sure it wasn't the first versioning system around, but it must have been one of the earliest, dating back to the early 1970's. It was not meant to work with vast software development projects with many people working on them, but for it's time, it did a pretty good job (Bell Labs. used it to develop UNIX).

        Each iteration of version control since, like CVS, RCS, arch, Subversion, Git et. al. has expanded on the functionality, meaning that as the grandaddy of them all, SCCS cannot come out favorably in any comparison.

        But I still use it on occasion, as it is normally installed on AIX, even when nothing else is.

        1. MacroRodent

          Re: SCCS hits you

          Tag expansion also happens in RCS, CVS and Subversion (in the latter it has to be enabled in the properties of the file). The difference is that the tag trigger notation in these ($id: ,,,,$ and some others) stays in the file, in SCCS the magic strings expand to version numbers without the triggering character sequence.

          Git lost this feature, because it is seriously contrary to its idea of identifying versions with a hash of the file contents. Expanding a version tag would make the file be of a different version in the eyes of Git. A loss, because the embedded file version numbers have often saved my sanity by allowing a compiled program identify what file versions it has been put together from.

    2. cdegroot

      Nothing new...

      My thought as well. I've used CVS for the same in the '90s, and it worked quite well - I hated the guts of SCCS and always tried to stick with RCS instead which didn't have the anal locking that SCCS sported.

      I've never gotten around unleashing git on /etc/ though (although my "dotfiles" are there and it's very nice). There's enough stuff in there to make it maybe worth a try, although these days Chef/Puppet/Ansible/Salt/... are probably more appropriate.

  7. Chairman of the Bored

    Ok, we need some beer over here!

    Two pints:

    One for the OP to have the courage to admit the mistake, and the second for his management to have the wisdom to chalk this up to a learning experience

    Cheers!

  8. ibmalone

    I'm missing something...

    Why was a writeable fstab so fatal? Having it non-root writeable isn't good, but I wouldn't expect a writeable fstab it to get wiped on boot on a modern system (every Linux I've I've seen has had it 644). Something different about Sun?

    1. Anonymous Coward
      Anonymous Coward

      Re: I'm missing something...

      I don't know the system but I presumed once checked in the file was locked or removed until it was checked out RO. Probably just a quirk of the vcs.

    2. OldCrow

      Re: I'm missing something...

      One of those older version-control systems that imitated a physical pile of cards. A check-in removes the file from your disk.

      I'm sure it had SOME kind of logical reason for doing that beyond trying to imitate carbon-copy shifting, but I wouldn't know what the reason is.

    3. Doctor Syntax Silver badge

      Re: I'm missing something...

      "Why was a writeable fstab so fatal?"

      I think what you've missed was that the revision control removed the file when checking in. That's why it had to be checked out again.

      Checking out read only would be a side issue. It would mean that the revision control system wouldn't have the version locked and it would also mean that the running version couldn't get edited to a state inconsistent with the version the revision control system had marked current.

      1. ibmalone

        Re: I'm missing something...

        I think what you've missed was that the revision control removed the file when checking in. That's why it had to be checked out again.

        Thanks, yes, looks like another commenter has fingered SCCS as the culprit. Never met a VCS that does that, but I'm sure it made sense to somebody at the time o.0

        Knowing that makes the whole thing seem a lot more rickety. I suppose I might have taken to copying the file and checking in the copy instead, but there's only one way to learn that kind of paranoia...

        1. Doctor Syntax Silver badge

          Re: I'm missing something...

          " I suppose I might have taken to copying the file and checking in the copy instead"

          I might have written a script that did the check-in/check-out as a single command. That's assuming there wasn't an option - as per the comment on SCCS - in which case just get used to that as the normal way to do things.

          1. ibmalone
            Joke

            Re: I'm missing something...

            That's assuming there wasn't an option - as per the comment on SCCS - in which case just get used to that as the normal way to do things.

            Steady on there!

        2. Anonymous Coward
          Headmaster

          Re: I'm missing something...

          I suppose I might have taken to copying the file and checking in the copy instead

          And in fact if you aren't doing that you are probably taking risks which you should not be taking, unless your VCS is very, very carefully written. For quite a significant number of files in /etc it is absolutely essential that a sane copy of the file exists all the time, and making sure that this is true is quite fiddly. As an example you need to deal with the filesystem filling as you save the file: if that happens you mustleave the original in place.

          The trick to doing this right is typically: copy the file to a different name in the same directory ensuring all the permissions are right; modify this file to be correct; copy the original file again to a backup (alternative: make a hard link to it), then rename the new file to the original. This is safe because renames are atomic: they either happen or they don't, and you are not allocating space in the filesystem at the point of the rename, and nor are you increasing the number of inodes in use.

          (Someone is now going to point out I have got some part of this wrong, which I may have: the point is that it's not safe just to overwrite the file because you can end up with a partial copy.)

  9. Chairman of the Bored

    My worst config error?

    Been so many, but I think the worst one in terms of financial impact was dd'ing a hard drive image over a live, mission critical volume. An encrypted volume at that.

    This was my firm so I couldn't very well fire myself. Backups worked (*), but we were out many man-hours of work.

    But I was a late on a deliverable and had to tell the customer it was because I had personally screwed up.

    Causative factors: impatience, overconfidence, lacking a questioning attitude. Performing a rather aggressive admin action on a production system. dd is a fairly blunt instrument, could have chosen a better tool.

    Things that went well: Having a comprehensive, tested backup. Honesty with customer and staff paid off in the long run.

    (*) Wish I had made a binary image of the boot sector and anti-forensic stripes of the encrypted volume key store though, might have been able to save some information

    1. Anonymous Coward
      Anonymous Coward

      Re: My worst config error?

      IMHO, the OS should protect you from that by refusing to permit writes to a device that's mounted. There's no possible scenario where this could be useful.

    2. Anonymous Coward
      Linux

      Re: My worst config error?

      @Chairman of the Bored ".. dd'ing a hard drive image over a live, mission critical volume .."

      Yea, if you had stuck with the industry standard Windows, this kind of thing would never happen.

  10. Anonymous Coward
    Anonymous Coward

    Set the clock failed.

    To this day, I work in a PC where the clock is 15 minutes ahead. And there is a command prompt on login saying something on the lines of "user has no permission" to set the machine clock. A windows command prompt with NET TIME done with user's permissions that NEVER worked for anybody.

    1. Anonymous Coward
      Anonymous Coward

      Re: Set the clock failed.

      "And there is a command prompt on login saying something on the lines of "user has no permission" to set the machine clock."

      Sounds like your computer is part of an AD. The workstations get their time sync'd from the AD servers so they must have the wrong time.

      1. Danny 14

        Re: Set the clock failed.

        gpo for ntp too. you can set the ntp server to be outside your own dc, that gets fun when the two are out of sync.

        1. Trixr

          Re: Set the clock failed.

          Which is why your domain should be synced to a RELIABLE time source. And so too with any non-domain clients.

          If they're in the same network, the upstream timesource should be the same for the domain time source (the PDC Emulator) and non-domain clients. It's not rocket science.

        2. Trixr

          Re: Set the clock failed.

          If you're in a domain, why on earth would you be setting a different NTP time source on your domain clients via GPO? I can't describe how poor a practice that would be.

          (The only excuse would be if you're not using Windows NTP client at all and you're using another NTP client with better precision. In which case you should sync your DCs from the same time source).

          1. Anonymous Coward
            Linux

            Re: Set the clock failed.

            @Trixr "If you're in a domain, why on earth would you be setting a different NTP time source on your domain clients via GPO?"

            Sometimes in reading Microsoft documentation, I get the feeling I'm reading from the secret scriptures of some obscure cult, that's cult with an ‘ L’ :]

  11. Anonymous Coward
    Anonymous Coward

    zfs snapshots

    if this was an oracle box, where are the snapshots?

    1. Stevie

      Re: zfs snapshots

      zfs? This is some new magic not available in solaris 9.

      8o)

  12. Stevie

    Bah!

    “Overconfident Sun SA”.

    Redundant phrasing, from my personal experience.

  13. Anonymous Coward
    Anonymous Coward

    Ah, SUN's pizza boxes

    In the very early years of the Net (think pre-URL) I was tasked with building and installing SUN based firewalls.

    Now, I personally *loved* the beautiful engineering that could be seen inside the pizza box design, but it had one b*stard of a gotcha involving a connected terminal. If you would switch it off before you disconnected, it would issue a STOP instruction to the system so it would basically be off as far as functionality is concerned.

    That's an *excellent* thing to forget when doing an install for which you have to drive 6 hours to get there, in the days when mobile phones were luxury items only given to directors which wanted to get into weightlifting. Oh, the joy of checking your email on return.

    I don't know who dreamt that up, but he must have been the one to originate the BOFH DNA.

    1. Down not across

      Re: Ah, SUN's pizza boxes

      Now, I personally *loved* the beautiful engineering that could be seen inside the pizza box design, but it had one b*stard of a gotcha involving a connected terminal. If you would switch it off before you disconnected, it would issue a STOP instruction to the system so it would basically be off as far as functionality is concerned.

      Close but no cigar. Some, not all, serial terminals effectively send BREAK when powered off. This is usually caused by combination of the RS-232 driver and power supply causing logic low that appears as BREAK. SunOBP goes into PROM monitor on break. You can recover by typing 'go' and system should resume.

      I don't know who dreamt that up, but he must have been the one to originate the BOFH DNA.

      Hate to disappoint, but it is due to bad(cheap) design/engineering terminals (and many terminal/console servers) and the way they have implemented RS-232. As an example Cisco 2511 would send break, whereas 26xx/36xx/28xx/38xx with NM or HWIC async cards IIRC don't. Likewise ISTR Cyclades mostly worked. Then there are some that send break when powered ON, just to be awkward.

      1. Anonymous Coward
        Anonymous Coward

        Re: Ah, SUN's pizza boxes

        Ah, nice to know at last the detail.

        Yes, telling a client to type "go" was the cure, but it still was rather annoying. Lesson learned, though, also because customers sometimes couldn't resist switching the screen on (I think we mainly had WYSE terminals hooked up). They didn't know that the off switch was a tad too thorough, so that would result in a support call where, naturally, nobody would admit to having taken a peek..

  14. Will Godfrey Silver badge
    Linux

    Don't test it

    Never try to test your own automation if it's marginally more than trivial. Get someone else to try it - with as little information as is reasonable. If it doesn't screw up you can be cautiously optimistic.

  15. redwine

    Wanna major problem everywhere very quickly?

    ... config management!

  16. Anonymous Coward
    Anonymous Coward

    We are safe in your hands?!

    My goodness, most of you lot are making it up as you go along! Call this a profession?! Bloody dangerous school boys, no wonder IT is a mess these days!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like