Could have been worse ..
Bruce could have been working at Hawaii Emergency Management Agency last January.
Welcome to Monday morning, dear readers. We’ll try to make it bearable for you by offering you a new instalment of “Who, me?”, The Register’s column in which readers share stories of having screwed things up. This week meet “Bruce” who told us that “Many years ago I was a junior sysadmin for a large battery manufacturer and I …
This post has been deleted by its author
I've known more than a few people to do this.
I once told a soldier the portable version of a server was ready to be shut-down and packed up for deployment, he dutifully walked into the server room up to a (very) non-portable 42u rack and shutdown the servers in that. Cue calls to my phone from across Blighty asking why systems were down. Thankfully, they didn't take too long to bring back up, but I did have to explain what had happened to some much higher levels.
That was before the days of mollyguard, but I now make sure it's on everything to help avoid accidents (not sure it'd have helped in that case though)
One of my wonderful moment was working with a group testing a prototype communications analysis syeyem that I had designed and was responsible for testing in the field. It was a very wet and cold day in the middle of the British 'wilderness' (as it is).
Performance was good, all of the teams identified safe places to site their equipment and there were no obvious complaints about (a very prototyped) UI designed to be usable with gloves and in a somewhat hurried manner.
But I was there for the last week of trials and though after watching people using it I would takle myself off to talk to some of the remote groups and ask them about their experience (hey I was all that ouldf be described as a usability assessor as well as technical deigner). The last team I saw were perched on the side of a hill with a bare gap between protecting woodlands. It was wet and vey cold. I was pleasuably surprised about their enthusiam. Universally approved enthusiastically.
I tried to drill down on why and was somewhat chagrined by their repsonse. It runs so hot that we tackle turns to warm our feet on it!
Neeless to say we did manage to get the power down, and the prototypes were deployed in Bosnia 4 months later......
> Soldiers take things very literally. Never EVER label anything as "BOOT"
Yeah, to be fair to him he was just having a bad day. He knew more than enough about the systems to have not made that mistake, just wasn't really with it that morning.
Not that that made it any easier to explain up the chain, of course.
Bridges are portable. We took them to bits, moved them somewhere else and put them back together again.
Just because we needed a load of lorries etc does not make them any less portable. Nowadays, they would probably sling larger chunks beneath Chinooks and spend less time stuck in muddy places.
I am not sure I can define what the RE officially did not consider portable but the limits will have only increased in the intervening decades!
Holes are also portable. Especially when a staff sargeant tells you that you dug his hole 3 inches too far to the left, pulls out a measuring stick and demonstrates that you dug it in the wrong place and 2 inches too deep.
This in driving sleet on a mountain in the Brecon Beacons in February.
In that case the soviets had portable factories. In the face of a German invasion in WW2 they completely dismantled many factories in European Russia and reassembled them in the Ural Mountains and beyond. I've always been impressed by that.
Railways were apparently the key to moving the factories.
"I once told a soldier the portable version of a server was ready to be shut-down and packed up for deployment, he dutifully walked into the server room up to a (very) non-portable 42u rack and shutdown the servers in that"
In fairness to him, Soldiers tend to get used to carrying 30kg packs. He might have a different idea of portable to you and I.
I have created tiled bitmaps with the server's name on it (eg NODE1, PRIMARY DOMAIN CONTROLLER etc), so if you log in to a server via RDP you can instantly see which server it is that you're working on.
And, yes, this was preceded by me rebooting the wrong server. Now I can instantly see which server I'm working on, and this avoids mistakes.
Face it, a slew of open RDP sessions on your desktop will invariably cause you to issue the wrong command in the wrong window. Fun.
> background was red with pictures of bombs on it.
Suse Linux had this. IIRC it was brought in after people did things as root, not recognizing they were, often enough with serious consequences.
After some major terror attack (don't recall which) it was removed.
"I have created tiled bitmaps with the server's name on it"
We tried that at a customer on their RDP servers, so users can quickly look at the desktop to tell techs which server they are on.
Turns out roaming profiles will cache the background image, even if it's set by GPO at the computer level.
DesktopInfo is a wonderful tool.
Just come up with a template INI file and stick it somewhere all RDP users can read it, create a shortcut in ProgramData...\Startup to launch desktopinfo.exe for all users, and bake that into your gold image. Easy to package and distribute as well.
Then you get the name of your system as big as you want on screen - colour code for prod/non-prod if you're fancy, and some cute at-a-glance statuses if you want those as well.
In late 1977 I managed to take down all the PDP10 kit at Stanford and Berkeley with a software upgrade. Effectively split the West coast ARPANet in half for a couple hours. Not fun having bigwigs from Moffett and NASA Ames screaming because they couldn't talk to JPL and Lockheed without going through MIT ...
Taking down TOPS10 was so easy a luser could do it by assigning too many disk name aliases.
Mostly done for shits and giggles on last day of term with the added entertainment of super-lusers going to the computer centre to wrongly claim "I've just crashed the system".
These days that would probably be terrorism or some serious offence.
...your leader is worth following. Screwups will happen. But will grace and a second chance happen as well? If you find these in a leader, make sure you follow that person.
Bet the admin here never, ever made the same mistake again; performance across the board probably amped up as the lesson drove home the seriousness of the job.
I encountered a great leader once, in my first year of college working in a copy and print shop. The owner - a recent immigrant from Lebanon working three jobs at once to get enough cash to bring his family over - always seemed to be a hard man. But one after one all-nighter running a $10,000 job I realized all too late that I'd screwed up the whole thing, and lost a major client. Margins are razor thin so we ate something like $9,600. When Mr. Hammad came in, I just had to press my "man up" button, tell him what I'd done, and wait to be fired. Instead he stared at me for a very long time, and took me in the back for a cup of tea. His one question - that still stings across the years - was "So... tell me exactly why you are so careless with our money? Our paper and supplies and our customers? Did you respect our customer? Is that what you want to be?" Then "I should fire you but instead I want you to stay here and show me who you really are" I wasn't fired and ended up running the business.
Guys and gals like that are tough to find, but the world really needs them. So try to be one.
Everybody on the team from server admins right through to data input should know that if they screw up then there will be no negative consequence on their career if they own up and alert the rest of the team immediately it happens.
Because fear can cause cover-up and attempt to hide the problem, and then the problem can compound out of control the further in time you get from the error.
I once killed someones SQL server (and hence the app that was accessing it) by running a not-inconsiderable query. Ordinarily, this would have been fine, if slow.
The kicker was that the server had a dying raid drive. This would have been picked up in the normal course of events by the engineering bods and replaced, however that hadn't happened yet.
The extra load combined with the slowdown from the dying drive ground the system to a halt.
Cue some frantic work to get things back up, and a replacement drive sent back out ASAP.
This type of thing is so easily done the only real safeguard is a fully redundant system with fault tolerance. It still baffles me today that major transport operators, banks and so on experience outages when a correctly architected and implemented solution should keep outages at bay, even taking disasters into account.