back to article Sysadmin’s plan to manage system config changes backfires spectacularly

Welcome once more to Who, Me?, the column for Reg readers to get their worst deeds off their chest. This week, "Ryan" tells us about a time many years ago when he got a little bit cocky with root-level commands. At the time, he was the senior systems and network administrator for a major research lab. "I administered a …

Page:

  1. Michael H.F. Wilkinson Silver badge

    Automation does have its place

    We had one sysadmin (who is no longer with us) who tended not to like automating stuff too much. This meant that whenever we needed new accounts for students or guests, he would fiddle around a while, and give you a list of new user names and (temporary) passwords on a bit of paper. I learnt the hard way, after many complaints by students, that it paid to check all accounts manually, to see if

    a) the login actually worked,

    b) things like home directories had actually been made for each account, and

    c) that account A didn't by default write in home directory B and vice-versa.

    I currently administer a small compute server used for teaching and research, and I have never been able to replicate these kinds of errors when using "adduser" to create new accounts.

    1. defiler

      Re: Automation does have its place

      Group Policy works great for user setups on Windows. You just add a user in the correct OU, and it picks up the policy. Sets up home drives, profiles, application configs and stuff like that.

      If you're spinning up lots at once, Powershell does the trick, but if they come in dribs and drabs then ADUC and GPO saves a lot of guesswork.

    2. Prst. V.Jeltz Silver badge

      Re: Automation does have its place

      We had one sysadmin (who is no longer with us) who tended not to like automating stuff too much

      obviously in the wrong career then, if not a dangerous liability.

      1. Anonymous Coward
        Anonymous Coward

        Re: obviously in the wrong career

        Probably moved to politics and is not managing Brexit from the looks of things.

    3. Lee D Silver badge

      Re: Automation does have its place

      The "admin who does things like it was 30 years ago" is surprisingly common.

      When I started here, there was no computer imaging process - each one was manually cloned from one of its nearby machines and then manually re-configured. There were duplicate SIDs and unlicensed software everywhere. There was no user-management - each one was set up manually each time, so half of them were missing something or other. And home folders were manually made and permissioned for each user on creation*. Everything was done with copy-paste batch scripts that he didn't understand, which everyone ran on every login, and which literally carved out exceptions (e.g. IF %username% = "fbloggs", to map drives, printers, etc.). The console windows were still visible minutes after logging on as they churned through it all every time.

      AD was literally a shock to the guy beyond "create new user". And he was being paid by the hour (not the reason for his lack of process, at least not directly, but he literally didn't have the knowledge).

      Within a week, and without spending a penny more than had already been spent, I introduced F12 PXE boot to WDS (which meant imaging took 20 minutes from bare-machine to domain-ready client with the base software in the worst case), group policy (which meant that user's printers, drive maps and settings, and machine's specific software and settings were installed after a couple of reboots of any fresh machine, controlled centrally and changed and cloned easily), and the MSKB article which shows you how to permission the root profile folders applied so that users just logging in would create their own profile folders if they didn't already have one.

      Literally the guy was stuck on using things that had "worked" for him on Windows 2000 and never bothered to update knowledge in all that time. That you could deploy a printer from a GPO was new knowledge. That you could image machines from a clean template. That you could centrally control updates. That you could map drives. That you could have a proper tree of users and groups (rather than just leaving everything in the default users and groups folders) and have "Users" settings apply to everyone, while "Users\Office" people also got office settings, that you could modify policies on the domain other than "Default Domain Policy" (literally EVERYTHING was in there). That you could target a policy at users, groups, or even things like Windows versions or machine types.

      It took me a few weeks to go from utter unmanaged chaos to "F12, new image, reboot, right-click in AD, clone an existing user (even disabled) of the same type, set password, bang... everything comes down".

      It's alright, it's not like we were a school or anything, with 500+ pupils, ~100 staff, all with different settings and permissions, ~100 leaving and ~100 joining users every year, and all needing central control for things like web filters (enforced proxies), etc.

      Literally, his "web proxy setting" was a Regedit script for Mozilla Firefox run from a login batch file. Press Ctrl-C and it never got applied. Unapply it after login and it bypassed everything. And, no, not even a "catch-all" transparent filter.... literally relying on that batch file to be all your security.

      I honestly never asked what the rest of the junk in his batch files was and just started replacing them from day one. There were things in there playing with Word/Office, activation, antivirus warning disabling, ActiveX permissions, desktop icons (copied from the central server every logon), all kinds of stuff. I just switched them off for a few test machines and then resolved the issues that occurred in a more proper manner.

      (*To this day, years later, I'm still finding folders that don't have inheritable permissions and/or have things like "Administrators" - the group not the user - as the owner. There were also a ton of legacy folders, including user profiles, that literally the user could access but administrators couldn't. The only way to fix is to take ownership of all files with recursion, then repermission with recursion, then put the file owner back as it should have been).

      P.S. He didn't last long.

      1. Anonymous Coward
        Anonymous Coward

        Re: Automation does have its place

        I can understand how that happens. Often, especially in the likes of schools and other very important yet poorly funded organisations, the person who shows a little sense of computer knowledge (eg knows that a computer is some sort of box) is placed in charge of such things, often without training (especially back in the 80's and 90's, dunno about these days).

        Over time they learn little tricks to help maintain things, but they lack the time (and sometimes permission) to do any further training or improvements. I've been there myself, maintaining a system for a decade over and above my normal work (which, BTW, was a full-time job!). Least the firm paid time+.5 for the first extra hour and time*2 after that. I got a decent income but 0 social life.

        Then the new guy comes along. He'd spent more time in computer training than I'd ever considered, and was able to automate a lot of stuff in ways I'd never considered.

        For a long while things were great. Then, as with another post below, a large portion of the users disappears overnight. A glitch in one of his batch files took out stuff due to an unforeseen combination of events. It took out a lot of stuff. First I knew about it was the phone call from the 2nd shift supervisor, the call to come in and fix. Long night but got it sorted.

        I did come to his defense at a later managerial meeting. I pointed out that while his mistake had cost a considerable amount in lost productivity, his automations had saved more than that in the preceding months, and we were actually able to pitch a strong case for improving the IT training of staff. If the company had given me the training I'd asked for earlier...

        A guy who has the resources and training to patch together what bits he can as he goes is not to blame for the system he leaves behind, unless he is offered the opportunity to improve things and doesn't take it. What works gets you through the day. What's new may not work so well, and may not get you through several days (or it may save a few hundred overtime hours a year and pay for itself in a couple of weeks...)

        1. Lee D Silver badge

          Re: Automation does have its place

          @Anon The guy in question was a highly-paid specialist IT consultant brought in to do disaster recovery on their systems... he had a year, a clean slate, virtually unlimited funds, new kit (everything from network switches to PCs to tablets to servers from the ground up), all the time in the world, and absolute control of anything he wanted.

          He was brought in as "the expert" to set the tone for the system. I was hired later as the guy to "keep it ticking over" day to day. It took 6 months to turn that situation on its head.

      2. Anonymous Coward
        Facepalm

        Re: Automation does have its place

        @ Lee D "P.S. He didn't last long."

        Going on the number of down votes, at least two people on this forum recognize themselves :]

    4. Anonymous Coward
      Anonymous Coward

      Re: Automation does have its place

      A problem with automation scripts written by sysadmins is that they generally do not have a development background and do not consider enough corner cases and how things can go poorly nor do they write defensively Automation scripts should be considered as nothing less than production apps and subject to the same controls: peer review and source code check-in and check-out to name a few.

      One day I woke up and reviewed alerts before work. I noted that the entire Accounting department had their accounts deleted. I thought to myself "Hmmm, I thought we needed an Accounting department."

      Upon arrival at the office I asked what happened and was told it was "just a glitch" and the accounts had been retrieved from the AD Dumpster. It had never happened before so it would never happen again. Right.

      As you've already guessed, it happened the next day. So now the sysadmins decided to investigate. They had written a script to sync the HR management software with the AD structure so AD reflected HR. Good idea. But the author failed to consider what would happen if a department manager went on leave.

      The Accounting manager went on long-term sick leave and was removed from the HR org tree after one week per policy. When their AD import script saw no manager it branched to the cleanup section and because the the department apparently no longer existed, it deleted all of the active accounts instead of stopping for confirmation or just disabling the accounts. It actually was a cascading fault because one of the Accounting manager's subordinates supervised a smaller department and their accounts were wiped out as well when the subordinate had his account deleted by the script.

      Just imagine if the CEO had gone on long-term leave...

      1. John Riddoch

        Re: Automation does have its place

        I used to have to do user account creation annually at a university. I'd inherited some (fairly ropy) scripts and an MS Word mail merge template which took a fair bit of manual effort. I reduced it to a couple of Unix scripts which then created a LaTeX file to print out and another output file to create the Novell 4.1 accounts (that probably dates it pretty well). The printouts were handed to the lecturers to distribute to their classes on the first day and get them to log in.

      2. PickledAardvark

        Re: Automation does have its place

        "Automation scripts should be considered as nothing less than production apps and subject to the same controls: peer review and source code check-in and check-out to name a few."

        In the old days, we used to talk about things with colleagues. Even if you have a change management process, you still have to talk informally with colleagues -- including people who have a different outlook. When you take a day off and things go wrong, somebody else needs to understand more than you wrote in comments and a change report.

      3. Anonymous Coward
        Terminator

        Re: Automation does have its place

        A problem with automation scripts written by sysadmins is that they generally do not have a development background and do not consider enough corner cases and how things can go poorly nor do they write defensively Automation scripts should be considered as nothing less than production apps and subject to the same controls: peer review and source code check-in and check-out to name a few.

        Exactly so. And now we have things like Puppet which make sure that these scripts get run around tens of thousands of machines.

        And, of course, we're all now planning on keeping everything in the cloud, which is probably millions of running OS instances. And they make it cheap by making it scale, and they make it scale by automating the shit out of it: they can, I presume, deploy changes across millions of machines in one go. Well, of course, these people are much cleverer than most sysadmins and they test stuff a lot more carefully I hope but they are going to make a mistake (or someone is going to make them make a mistake), at which point the we're all, well, fucked.

        1. John Brown (no body) Silver badge

          Re: Automation does have its place

          "deploy changes across millions of machines in one go... going to make a mistake (or someone is going to make them make a mistake), at which point the we're all, well, fucked."

          It's called Windows Update (windows as a service) and the problems have already been documented :-(

          1. Anonymous Coward
            Terminator

            Re: Automation does have its place

            Well, I was really thinking of 'AWS hypervisor update' or whatever low-level thing it is they use. At some point there is going to be a platform-level compromise of AWS, and at that point, well.

      4. cream wobbly

        Re: Automation does have its place

        A problem with automation scripts written by sysadmins is that they generally do not have a development background...

        A problem with automation scripts written by developers is that they generally do not have a sysadmin background.

        It's true both ways around, which is why the top role was originally called Unix Programmer. It's rare for a company to knowingly employ such a deity these days. They'll make do with mere Systems Administrators who know a bit of scripting; or even System Operators who sometimes know how to change a config without manpages.

        1. Anonymous Coward
          Anonymous Coward

          Script deleting HR accounts

          That's just stupid having a script delete accounts in automated fashion. It should produce an alert "the following accounts are ready for deletion for $reason" and list commands that will do it. Then sysadmins can investigate, decide it is valid, and cut and paste the commands to do the deed.

          Anything that is going to have a major impact should not be done in automated fashion unless it is time critical (like disabling accounts that may be involved in a security breach) Deleting accounts certainly falls under that! It isn't like there's a rush to delete them, account deletion is never an emergency that can't wait for a human to approve.

          1. Anonymous Coward
            Anonymous Coward

            Re: Script deleting HR accounts

            > That's just stupid having a script delete accounts in automated fashion.

            It simply was an extension of the management directive of "Virtualize everything" to "Automate everything!"

            Just because they could, they did.

          2. Trixr

            Re: Script deleting HR accounts

            That's just stupid having a script delete accounts in automated fashion.

            Er, if you've got a few dozen users perhaps. When you're doing dozens/hundreds of account expiry/deletion operations a day, you do not want someone having to go through all that by hand.

            It's silly if you've only got a relatively small user base as well, because you are wasting a LOT of time doing stuff manually. You're actually more likely to introduce errors by hand than doing it using a robust automated process.

            What you should do is in ensure the process/checks prior to your automated part is robust, with some contingency in case of manual error. For example, for us, a contractor termination date must be entered into the HR system first. If HR has it wrong, we are not accountable for their mistake.... but we have baked in a 30 day interval where an account is only disabled on that termination date, prior to deletion.

            Then you carefully TEST your scripts (in a non-production environment first, and then in production, scoped to a specific user set) and make sure that you have subroutines to catch exceptions and flag errors. And good backups. And as others have said, peer review.

            1. The Oncoming Scorn Silver badge
              Alert

              Re: Script deleting HR accounts

              I'm in a weird situation here & I will admit to being on the whole totally new at the whole create\disable accounts (Having moved up & sideways late in my career) thing.

              Hires & Fires come in at all times of the day.

              I create accounts as per my instructions, different job roles have a whole bunch of OU's to be added "globally or branch specific", based on the training I have had & prior knowledge.

              So I'm slowly working on script automation, basically the same script (For each job role) that calls the specific's for each branch information (Branch address & the correct OU's etc) & I still have according to my process training a number of Office\Exchange\Licensing\Skype server web pages to go through to complete, which I don't want to get out of my depth by even attempting to automate, simply testing as I go.

              Disabling accounts is pretty much the same process (In reverse), but really doesn't require any script as yet (That said, there's a few parts I'd like to automate).

              Upshot I'm really not sure if the way I do things is really the correct Microsoft way, it's just how I have been told to do it so it works with the rest of our group infrastructure.

              My scripts at the very least save me having to cut n paste in branch specifics from a Excel master spreadsheet so that's a time saver when setting up a new user or changing job role.

              1. Anonymous Coward
                Anonymous Coward

                Re: Script deleting HR accounts

                "Upshot I'm really not sure if the way I do things is really the correct Microsoft way, it's just how I have been told to do it so it works with the rest of our group infrastructure."

                <Joke> I'm not sure "correct" and "Microsoft" really belong in the same sentence...

            2. Danny 2

              Re: Script deleting HR accounts

              Welcome to Brazil, Where a Computer Bug Condemns a Man to Death

              https://gizmodo.com/welcome-to-brazil-where-a-computer-bug-condemns-a-man-1659912414

              The first computer bug, the story goes, was a moth squashed inside an old electromechanical computer. In Terry Gilliam's Brazil, one such bug gets stuck in a printer, resulting in a typo that leads to the killing of poor innocent Archibald Buttle, a cobbler, rather than alleged terrorist Archibald Tuttle.

            3. Anonymous Coward
              Anonymous Coward

              @Tixr - dozens of deletes a day

              If you are doing this as part of an actual process, you don't delete the accounts of users who leave. You disable them (make it very easy to undo, and have sanity checks to insure not too many are done in one day which may indicate a problem) and then have a process that acts later and deletes them - making sure they are already disabled before trying to delete.

        2. Alien8n

          Re: Automation does have its place

          I seem to be a bit of a rarity nowadays, worked my way into IT Management the long way... started out as an operator, moved to engineer (mechanical), then to report designer for the engineers. Then moved to product engineer (emphasis on data analysis), then systems engineer, then systems designer (still technically an engineer at this point). When they realised they needed an IT person with a working knowledge of manufacturing systems they moved me into IT where I gradually worked through several developer positions, DBA roles, and finally into IT management with some networking skills. However I'm intelligent enough to ask the question "what happens if I press this" BEFORE pressing the button, rather than as I press the button.

        3. Anonymous Coward
          Anonymous Coward

          Re: Automation does have its place

          Re: "Systems Administrators who know a bit of scripting"

          Sorry, but how can you possibly be a sysadmin of any description and not know quite a bit of scripting? It's pretty inherent to the role!

          (Whether the scripts one writes are quick and dirty hacks or reasonably elegant, with reasonable error/edge case handling, is perhaps another question, however.)

    5. Anonymous Coward
      Anonymous Coward

      Re: Automation does have its place

      "I currently administer a small compute server used for teaching and research, and I have never been able to replicate these kinds of errors when using "adduser" to create new accounts."

      Never underestimate the talent of idiots and their ability to deliver beyond the limits of competency.

      1. JimboSmith Silver badge

        Re: Automation does have its place

        We had an auto delete option on some software in the config. It was supposed to be used for the deletion of older files from the data file system and to just leave two days worth of daily logs etc. It was also supposed to be set to local drives only. Someone at a satellite office set theirs to delete files from all the drives it could see. We were first alerted when a user found their database missing and investigated. About 15mins later we had retrieved the missing files (anything more than 48hrs old) worked out what had been done and administered a quiet word. The next day the same thing happened again from a different satellite office. Same bloke had been on a road trip and had applied the same fix...........

    6. Gerhard Mack

      Re: Automation does have its place

      "I currently administer a small compute server used for teaching and research, and I have never been able to replicate these kinds of errors when using "adduser" to create new accounts."

      Adduser is designed to be easy to use, on the other hand, useradd has a ton of fun ways to let you screw things up.

  2. Waseem Alkurdi

    Why use a revision control system?

    These are typically used for code.

    Sysadmins usually back up to tape.

    If there's something broken, one would restore from tape (or at least, restore the offending file from a tape).

    1. BinkyTheMagicPaperclip Silver badge

      Re: Why use a revision control system?

      It's probably overkill for standard config files. If, however, it's a shell script, firewall configuration, or other fairly complex file than a revision control system could be an advantage.

      1. Anonymous Coward
        Anonymous Coward

        Re: Why use a revision control system?

        It's not overkill for standard configuration files. If a user gets added to a system, say, people will want to know who added it, when, why, what authority they had to do so and so on. A revision control system gives you the key into that: '/etc/passwd & /etc/shadow were changed by commit 615032f and the commit log for that says this corresponds to approved change 23857 and I can look that up and it's been signed off by Spodge, who is an approved authoriser for ...' and so it goes on.

    2. Anonymous Coward
      Anonymous Coward

      Re: Why use a revision control system?

      If there's something broken, one would restore from tape (or at least, restore the offending file from a tape).

      If you're making several versions on the same day and the earlier one worked but later was borked?

      Access to older backups may not be imminent and errors may manifest after months.

      If there are multiple sysadmins it is nice to know what was changed when fixing someone else's errors. I'm not going to commend every single config change on the config file itself.

      1. Waseem Alkurdi

        Re: Why use a revision control system?

        If you're making several versions on the same day and the earlier one worked but later was borked?

        If I'm not commenting my changes in the file itself, I'd do this:

        /etc/fstab-20181203-1208

        /etc/fstab-20181201-1100

        ... up to five past revisions per file per backup period.

        I'm not going to commend every single config change on the config file itself.

        Why not?

        Wasting an extra two minutes' worth of:

        # note to self: this fixes the ________ issue by doing _______ and ________.

        would save twenty minutes trying to decipher what you've done that day on midnight when your eyes were seeing that each line of config file somehow has two blurry, redundant versions over and under it.

      2. phuzz Silver badge

        Re: Why use a revision control system?

        "I'm not going to commend (sic) every single config change on the config file itself."

        I work with people like you, and from bitter experience, I hate you.

        1. Stevie

          Re: Why use a revision control system? 4 phuzz

          And I work with people like Wakeem who copy old versions to a new name.

          Thing is, there are dozens of us and each has his own preferred naming convention. I hate the mess the filesystems become as a result and end up running a script to dump it all into a new directory called obsolete and zip them all down. Once a year I tar all the zips automagically into a yearly archive.

          My director yelled that he wanted us to use a VC tool for this stuff, but refused me permission to deploy git in a hysterically funny and very annoying saga I already told, and the situation is still as it was pre-shoutyboss.

        2. cream wobbly

          Re: Why use a revision control system?

          I work with people like you, and from bitter experience, I hate you.

          Yes. (From bitter experience) I'm completely the opposite to our commending friend. I'd go so far as to say that documentation, communication, is vastly more important than the config change itself. If you can't reverse it or replicate it, it's a guess, it's a hack, it's broken and there's the door.

          1. Danny 14

            Re: Why use a revision control system?

            wow you change your fstab so many times in a day you worry about backup revisions? just type it out again by looking at the copies file on ypur WS. You checked it out and back in? Sure that means a secondary copy.

    3. Secta_Protecta

      Re: Why use a revision control system?

      When I worked for an ISP we used revision control for the named config files; at the time we had some very ropey contractors working there and it came in handy a number of times...

    4. Anonymous Coward
      Anonymous Coward

      Re: Why use a revision control system?

      "one would restore from tape (or at least, restore the offending file from a tape)"

      Maybe life is just too short for that...

      P.S. A tape backup is not a change management system, version control with check in comments (partially) are.

    5. iainr

      Re: Why use a revision control system?

      If you want to revert a file, it's a lot quicker and easier to use a version control system to see when it was changed and see what the differences are than pulling stuff off tape. At work we use a configuration management system called LCFG (www.lcfg.org) that allows us to configure large numbers of unix boxes via configuration files that are under version control. If I want to add software to a lab full of machines I can do it by editing a file, in a months time if someone wants to know who installed the software and why it was installed they can check the configuration log files. If they want to remove the software they know what to remove from the config file and if they want to revert the labs software set back to what it was before the start of the term it's a matter of reverting the configuration file in the version control system.

    6. gordonmcoats

      Re: Why use a revision control system?

      All my extra-special configuration files are safely stored on the 12" reel hidden under my desk. Not had to reload anything off it in years though..

      1. Anonymous Coward
        Anonymous Coward

        Re: Why use a revision control system?

        LOL, but you do make a valid point: backup mechanisms have as much a lifespan as their media.

        I doubt anyone will be able to dig up the 8" drives I used when I started in IT, and I think you will already have to go to some rather obscure places to still find a 3.5" drive. Another fun on is CDROM media - the very early media were not made for the 40x spin speeds the later drives were capable of, which meant you cannot really restore from them anymore as they will shatter on spinup (been there, and it was very impressive). Ditto with tape.

        1. ridley

          Re: Why use a revision control system?

          Your tapes shatter on spin up?

          You're holding them wrong.

          1. Loyal Commenter Silver badge

            Re: Why use a revision control system?

            Your tapes shatter on spin up?

            You're holding them wrong.

            Maybe not shatter, but tape can demagnetise or otherwise degrade over time. Plastic becomes brittle and perishes with age. Do you know what state the tape from a 1990s backup is in right now without trying to restore from it?

            1. HellDeskJockey

              Re: Why use a revision control system?

              Ahh paper tape. Worst come to worst you could always read it manually. Though for a backup I would use Mylar. That stuff was darnmed near indestructible. Way too bulky for modern systems though 1 Kilobyte requires about 2.6 meters of tape.

    7. Pirate Dave Silver badge
      Pirate

      Re: Why use a revision control system?

      I use an RCS to backup my switch configs weekly via script. That way, if I dork something up, or have to replace a switch, I've got a recent config file to fall back on for each switch.

    8. Adrian 4

      Re: Why use a revision control system?

      Code used to be backed up to tape too. It was obsoleted by revision control systems.

      1. Nick Kew

        Re: Why use a revision control system?

        Code used to be backed up to tape too. It was obsoleted by revision control systems.

        First code I ever wrote had to be saved to tape for every increment. 'Cos we didn't have discs back then, and a simple bug would commonly require a several-minute reboot (from tape) and restore (ditto).

        But revision control had already existed for some years: sccs goes right back to 1972.

      2. Doctor Syntax Silver badge

        Re: Why use a revision control system?

        "Code used to be backed up to tape too. It was obsoleted by revision control systems."

        And where is your revision control system backed up? Don't tell me it isn't. Revision control and backups are two different things.

        1. Anonymous Coward
          Alien

          Re: Why use a revision control system?

          This is an important point. And in fact there are at least three different things which people confuse: revision control, whose job is to track changes and let you understand them and back them out; hardware redundancy whose job is to make sure that suitably mild failures don't take out the system (how mild depends on the money you are willing to spend: typically a single disk, but if your mirrors are in DCs 20 miles apart then you're probably robust against some minor nuclear wars); backups, whose job is to be a backstop for everything else.

          I frequently hear people saying 'oh, we have RAID, we don't need backups': yes, yes you do need backups. And if the data matters you need them to be physically far away from the live data.

    9. Herbert Meyer

      Re: Why use a revision control system?

      Useful things like diff are available. Even (when there are multiple administrators) who made the changes in question.

      Linux has a system called etckeeper that puts /etc under git version control. With some additional hooks that understand the dnf/apt upgrade process. Often dnf/apt is what broke it, not me.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like