* Posts by Trevor_Pott

6991 publicly visible posts • joined 31 May 2010

Render farming is hot!

Trevor_Pott Gold badge

@The Cube

Well, we live in Edmonton, Alberta, Canada. 10 months of the year, the outside temperature is below 20 degrees C. I have never installed a datacenter in this city without an outside air system. It would be unbelievably stupid not to take advantage of the massive source of cold air just on the other side of the wall.

The issue is that for two months of the year, the outside air temperature is often over 30 degrees C. This means that in addition to your outside air system, you need chillers capable of handling the entire datacenter, even if they are only active two months of the year. (Also: the outside air system has to be upgraded to a much higher volume/minute capacity than it currently has.)

The upgrades won’t be particularly hard…but they simply cannot be done right at the moment. Edmonton has had something like a metre of snow in past five days. We are still trying to clear our streets and walks, let alone having warm bodies to climb up on frozen rooftops to upgrade chillers!

In truth, most of the datacenters I have anything to do with here only need chillers two or at the outside three months out of the year. The rest of it is simply forcing outside air into the building, through the front of the servers and then exhausting the lot of it back outside the building. Not particularly complex, but it does take a sheet metal guy, someone to drill holes in the concrete wall and some dudes up on the roof upgrading the chillers.

Trevor_Pott Gold badge

@AC

Actually, we never had much of an issue here. We simply spammed spindles. Big fat RAID 10s running on Multiple Adaptec 2820sa controllers. The limits were the controllers themselves, not the drives! It took some work, but we eventually realised that if you staggered the startup of the system, they wouldn't all be trying to read/write from the storage array at once. Once we mastered staggered startups, booting was no longer an issue.

The control software was also good about this: it could be configured to only hand out jobs to a preset number of nodes at a time. So the first 15 nodes would get jobs and then 30 seconds later another 15 nodes would get jobs, etc. This staggered the requests from nodes reading jobs and writing results enough that it was within the capabilities of the hardware as provided.

Trevor_Pott Gold badge

@Robinson

What were you hoping to read about?

Trevor_Pott Gold badge
Pint

Video cards

At the moment, the client's sysadmin is most familiar with Octane Render as their GPU rendering platform. It can only talk to CUDA cards, and so nVidia was the only choice. So far, despite Octane being in beta…I’m mightily impressed. It is dirt simple to use, and fast as could be desired. The renders done on GPU have all the fidelity of a CPU render; there are no shortcuts taken by this software. (Traditionally, GPU renders would show graininess in shadows and there were frequently issues with glass rendering.)

At the current pace of development, Octane Render should have a fully supported version 1.0 product out the door before we get the datacenter upgrades completed and the new farm installed. Version 1.0 will come with all the scripting gibbons and bobs we need to make the whole thing properly talk to a command and control server and then we’re off to the races!

Fortunately for me however, I’m not the one who has to deal with the render software. I am setting up the deployable operating system, designing the network and speccing the hardware that will be used. I get to design the datacenter’s cooling and power systems and oversee the retrofit. I’ll be ensuring that Octane Render is installed properly on the client systems and that the individual nodes grab their configs from a central location…but tying those nodes into the CnC server is the in-house sysadmin’s job.

Overall, it’s a great way to play with new toys in a fully funded environment. More to the point, it’s doing so in a fashion that takes full advantage of my unique skills: instead of simply following a manual someone else has written, I am doing the research and writing the book myself. Doing that which hasn’t quite done before…but for once, with a proper budget backing it up.

The fact that I am getting paid for it is simply icing on the cake. These types of jobs are so fun...honestly, I'd do them for free. Hurray for the fun gigs!

Trevor_Pott Gold badge
Happy

Power systems.

You are correct in that the render nodes don't need UPS support. The command and control servers (as well as the storage systems) do have UPSes. The UPSes are APC, and are installed on the same racks as the systems using them. So far, they have handelled 2 hour outages with room to spare. Though that's probably because I went to Princess Auto and bought a bunch of very large Deep Cycle batteries and added them to the UPSes when I installed them. Works a treat!

Trevor_Pott Gold badge

@swisstoni

Ask, and ye shall recieve. I've put it on my todo list for new articles. :)

Ptable: It’s all about the interface

Trevor_Pott Gold badge

"Where did you find these guys."

All over. It's an interesting thing sometimes to walk away from our common cloud of friends and associates and realise that there is this whole huge world out there separate and distinct from the technorati. We don’t know how good we have it; understanding technology, it’s uses and applications. Many people still use manual input devices and dead tree ledgers to do their accounting. Even more use dead tree flyers as marketing…and shockingly millions of people read them!

Ever now and again I actually disconnect from the Internet and all her various digital denizens and walk the streets of the world less connected. I meet interesting people from interesting places…periodically even some who own businesses. This is where I find these people. They own my favourite cheese shop, or are my barber. They run a ski hill or a church. They own a nightclub or a shoe-repair shop or that little café around the corner with the really good grilled cheese sandwiches. They are hundreds of thousands of people in my city alone, and millions across my continent.

Every now and again I forget about them…it’s good to stop and remember.

Trevor_Pott Gold badge

@Tonyhoyle

Yes, it uses JavaScript. I really should have mentionned that. As to "not using CSS layout" if you click on "about," the author actually does talk about it. Personally, I support his reasons. I never did buy into the "CSS layout fundementalism" that the hard-core purists get on about.

In any case, take some time to troll around the "about" page...he does talk about this near the bottom.

Storage pros: Big or small, you still have to hit the sweet spot

Trevor_Pott Gold badge

@AC

MAIDs are for all intents and purposes standard RAIDs wherein the disks are "spun down" when not in use. That is 100% in the RAID card. The Intel RS2BL080 is my current favourite card. It uses an LSI 2108 chip. It seems to spin my disks down when idle just fine. (I believe you need MegaRAID 3.6.)

DELL PERC H700 and H800 card can also be configured for Spin Down.

Most vendors don't call it MAID. They just call it "spin down." Usually tout it as a power saving feature. I should point out that a single modern RAID card can be married to SAS expanders to provide truly a "Massive Array of Idle Disks." :)

Trevor_Pott Gold badge

Tiered storage.

OS/Apps: SSD

Hot Data: 10K 2.5" Drives

Cold Data: 5.4K 3.5" Drives

There's a lot more cold data than hot on my network. Get the right controller and you can spin down your 3.5" disks when not in use. (I do like MAIDs.) Newer controllers (LSI has sever very nice ones doing 6GB SAS now) can recognise and deal with SSD/Hot 2.5"/MAID 3.5" disks separately and in different appropriate fashions for dirt cheap.

It is understandable for you to be using 3.5" drives right now if you are carrying over legacy equipment (especially if you are towards the end of your refresh cycle,) but by the next refresh, there really will be no excuse.

Why did my server just die?

Trevor_Pott Gold badge

@Ammaross Danan

Where was "a flaw in virtualisation technology" ever remotely blamed? Also: how exactly do you feel you have the right to tell someone else they "should" have resource caps in place on testbed servers? There are dozens of reasons not to and only a few very flimsy reasons why you might want to.

I don’t see how properly testing an application – in a testbed environment – in order to do things like application resource profiling (among others) is such an issue. If you feel this article is an attack on virtualisation as a technology, then I would like to suggest that you are more than a little overly sensitive about the topic. The article was about patch testing and conveying the concept that “just because something has always behaved in X fashion does not mean that it will continue to do so forever.”

Nothing in the article remotely talked about “a flaw in virtualisation technology.” Furthermore I most certainly /was/ "asking for it." That is implied by the concept of a TESTBED server. The purpose of a testbed server is to "run the thing and see if/how it blows up." I've never met a systems administrator who didn't equate "testbed server" with "asking for it to explode."

Trevor_Pott Gold badge

@Ammaross Danan

The VM was allowed to dominate the entire server simply because...it was a test server. The point of the testing is to see if/how patches and updates change the behaviour profile of an application. If the behaviour profile of an application changes, then the entire virtual instance changes and I then go back and recalculate my load balancing.

Virtualisation is a tool in a systems administrator’s arsenal, no different from any other. While resource constraints are a fine thing in a production environment, they make absolutely no sense to me in a testbed environment. I view my testbed virtual servers similarly to how I view calibrating my various test equipment: it is there to provide you with baselines against which you can measure issues in production.

Given the radical nature of the performance changes delivered by this patch, I would say that this application server is quite simply no longer a candidate for virtualisation. Put another way: my calibration tests determined that the tool I was using is no longer valid for the environment in which it must operate. Indeed, this incident makes me grateful that I /don’t/ run my testbed systems with resource constraints. With resource constraints in place I may never have caught this performance difference. Had I not caught it, we would not be able to take advantage of the vastly superior report generation times this patch enables.

More to the point…it allows us to simply remove from service several instances of Windows Server we had been using to run “report servers” in order to compensate for the slow report generation features of this application. This frees up licences that I can use elsewhere for other projects. All in all, win/win/win largely because I take the time to reprofile my applications with every patch by running them on an “unlocked” server.

Based upon the performance delta, I have already begun the process of sourcing equipment to properly physicalise this server. This patch will not be applied to the production environment until that production environment has been fully physicalised.

Trevor_Pott Gold badge

@Trygve Henriksen

Under the old version that used 30% resources, a sample report would take 6 hours.

Under hte new version that uses over 85% of the resources, the sample report takes 5 minutes.

Trevor_Pott Gold badge

Resource caps

There are no resource caps on the test server because the test server is used to see what kind of resources an app consumes. I /want/ them fighting for resource contention, because it helps me profile any changes to the application. :D

Trevor_Pott Gold badge

@twelvebore

Production and test systems were not on the same server. The test VM was on a test server with other test VMs. When the new patch finally "unlocked" the performance of the POS software, it flattened the test VM, on the test virtual server. The side effect was one of also flattening the various other test VMs on that same server.

No production systems were harmed during the testing of this patch.

Trevor_Pott Gold badge

Resources.

The system would never really get above 30% CPU. It wasn't hitting the disk much, and didn't use the network for more than about 2-3%. RAM was about 40% in use. RAM wasn't all that active, so it wasn't a "very small chunk of RAM getting hit so hard that all the RAM bandwidth was being nommed" issue.

I poked at it on and off for four years. I never did figure out what I could possibly change that would make it actually consume more resources. There didn't appear to /be/ any form of bottleneck. Just a stubborn refusal to use what was provided it.

Truly and honestly the single most bizarre application I have ever had the opportunity to work with. With the sole exception of this one application, I have never met an app I couldn’t identify a bottleneck on.

Who will rid me of these obsolete PCs?

Trevor_Pott Gold badge

"Common light bulb."

CFL or Incandescent? My house is all CFL moving to LED. I am not even sure you can /buy/ incandescent house bulbs any more. So you are claiming the waste power of a PSU is 14W or less on a 1kW PSU? In what universe? Even an 80 PLUS Platinum couldn’t claim that!

Now, I am always on the lookout for more energy efficient gear. You make a bold claim by saying “…those so-called "1 kilowatt" power supplies only consume about as much AC power as a common light bulb.” I would honestly like to know where you can get such a beast, if they indeed exist.

Also: you claim that “On those power supplies that say "350 watts" in big letters, that's the DC power capacity.” I would like to know which PSUs you use where you can reliably count on getting 350W DC on a unit stamped “350W.” Is that after 5 years of capacitor aging, or right off the shelf? Off the shelf, the best I’ve gotten is 98% of rated capacity, with degradation to 85% rated capacity with 5 years of capacitor aging.

I have always specced my PSUs such that system load = 80% of PSU capacity, and banked the PSU pull from the wall at 110% rated maximum. If you use alternate calculations, or know of better PSUs than 80 PLUS Platinum, please let me know!

I have had the best luck with FSP PSUs.

http://www.plugloadsolutions.com/80PlusPowerSuppliesDetail.aspx?id=37&type=2

Trevor_Pott Gold badge

Always.

As much as I like to be fair, I am certainly not above a bit of donation-related nepotism, my friend. You will always get first crack at the leftover gear. ;)

Trevor_Pott Gold badge

"Big, evil hard drive shredders."

I must arrange to send some equipment here, if only for the spectacle of watching a few useless drives get mauled. Beats turning them into coasters. I have about fifty formerly-hard-drive-platter coasters sitting on a shelf and dozens more "dead" drives waiting for a slow day so I can make more...

Trevor_Pott Gold badge

@Roger Greenwood

Well, I ran that idea past the head beancounter. The official response it that the beancounters don't get to decide what is considered a consumable and what is not. Apparently, that is decided by the government...there supposed exists an actual section of law that documents computers and other electronics as being "fixed assets" rather than consumables. Would have been really nice if that trick would have worked!

I wonder if it does in the US? The UK? Different countries, different laws...

Trevor_Pott Gold badge

C90LEW Wyse clients.

I have many.

Trevor_Pott Gold badge
Unhappy

Residiual Value

Arrange for me a method of swapping out the PCs of the beancounters in Ottawa that create these silly laws and you’re on. As for the bean counters here at work…they are the “ethical” kind. If you did that, they would still maintain that they had residual value…though they would have a perfectly legitimate reason why they require a PC with more value than that evident in the older PCs.

Seriously though, who comes up with these laws? I understand that some checks are needed against people who launder money or dodge taxes in this manner…but for the everyday Joe this is just utter lunacy.

Trevor_Pott Gold badge

Scheduling conflicts

I don't actually "work for the Register." I work for a company here in Edmonton as a systems administrator full time (8-12 hours a day) as well as maintain about a dozen other networks (for example, those of my company's largest clients) "after hours." I squeeze in writing articles for El Reg mostly because part of my day job is cranking out documentation at work. Incident reports, how-tos, you name it. Half a sysadmin's job is paperwork, the other half is research.

A good example of a typical day would be today. I woke up at 7am to be on the road before 8am. I had to stop at Memory Express on the way in to pick up a spare disk and showed up at work by 9am. I managed to check the comments section and respond to a few whilst standing in line. I am at work until about 7:30pm tonight, followed by a short dinner date and then a server swap and data migration. I’ll get home around 11:00pm. That is enough time to feed the pets, check my e-mail and collapse into a heap. Rinse repeat until Sunday, which is then filled with doing all of the chores needed to keep a house maintained and various pets happy.

Even responding to this comment gets delayed; my phone let me know that it was posted at 10:39am. I have been pecking at it in between support calls and troubleshooting ever since. It is now 11:27am

Between my day job, the various other networks I maintain and writing for El Reg I work 10-16 hours a day, 6 days a week. (Usually Sundays off.) Scheduling conflicts are thusly something fairly normal. It is something of a common complaint I hear from other people in the city. Not only sysadmins, but anyone who has to work and commute in this sprawling city knows that ecostation days – like trips to the doctor – essentially require taking a day off work. This is especially true when you consider that it can take an hour to make it through the ecostation once you arrive.

Trevor_Pott Gold badge

@Jeremy 2

Worth a try. Edmonton's freecycle community isn't exactly something I would call "vibrant." Oddly enough, there is still a Usenet group (?!?) still active in these parts named "edm.forsale." I might actually have some lucky there...but the last time I played around with that usenet group, the crowd was fairly picky. Wanting complete documentation on such "give-away" prizes, etc. Worth a boo, though!

Trevor_Pott Gold badge

@skelband

Thanks! I will look into this.

Trevor_Pott Gold badge

I wish!

Actualy, the way it seems to work here is that you have to dig up evidence relating to the market value of the item. Good examples would be print-outs of ebay auctions or kijiji ads for similarly specced equipment. That information then has to be retained for seven years.

Trevor_Pott Gold badge

@The Unexpected Bill

I have 15 Identical Pentium IIIIs. 11 Years old and <3. More on that, later...

Storage experts: Does size matter?

Trevor_Pott Gold badge

Listen to this man.

"My personal experience with HDD failures especially in enterprise level storage arrays is that frequently the disk that has been failed by the array is actually still quite serviceable and I have redeployed many of them to other less demanding situations without any issues. I suspect that the main reason for the high failure rate in storage arrays is that in a raid stripe or volume group that one slow drive can effect the performance of the other drives and storage vendors will fail these drives for performance balancing reasons."

Everyone listen to this man: he knows of what he speaks. This quote is Truth spoken freely. I should also point out that in many cases disks which consistently fail our of a RAID will pass vendor diagnostics as they are mechanically and electrically sound...they simply have remapped critical sectors as failed such that the disks are that msec slower than all the others in the array.

This indeed is why the TLER bug on the Velociraptors is such a pain: there is nothing wrong with the drives themselves...but they stop responding due to a firmware issue that ends up dropping a perfectly valid drive from the array.

Trevor_Pott Gold badge

WD drives.

Say what you want about Western Digital drives in general - and I've no good words for the Velociraptors - but I'll be damned if the RE4 Green Power 2TB drives aren't solid gear. Slow as sin...but they store a great many bits very reliably.

Bad (terrible!) idea as primary storage. Not remotely half bad as archival storage or in a MAID.

Diary of a server failure

Trevor_Pott Gold badge

@AC

That is a very rational way to run a shop. In all honesty, we don't run that tight...but I do try to adhere to the principals you espouse as much as is possible. The reality of that situation is that some hardware issues I am unqualified to troubleshoot. (I am not an electrical engineer.) Others...well we sometimes take the easy route out. ("We only have five of these in the field, they are 2/3 of the way through their life and we've RMAed three. Pull the whole line and we'll replace them.")

I do take the time to throw things on the bench and test the dickens out of them whenever I can. It's the reason I have spare parts for everything; more often than not, if I can get the originals back in hand I can find out how they died and often repair them.

Overall, I think taking the time to properly investigate failures is important; sometimes failures are preventable simply by making small changes to the operating environment. (Reducing vibration, temperature deltas, etc.) It’s an important practice.

Trevor_Pott Gold badge

The issue here..

...is perceptions like "your boss driving a new Mercedes, etc." My Boss drives a 2003 GMC Jimmy. I drive a 2005 Scion XB. He lives in a nicer house, but he bought during a housing bust - I bought during a boom. You are - wrongly - transposing your views/experiences to others. Sure, my boss makes more than I do; but not a heck of a lot more and he earns every penny through additional responsibility and a great deal of hard work.

For every complaint I could lodge against the place I work – and the folk who run it – I cannot say that they eat cake whilst the proles beg for crusts of bread. My boss makes mistakes – we all do including myself – but I will unreservedly say he’s a good person.

Regarding the angst bit…you are wrong. There is no angst over any of this, merely frustration. Regardless of the number of words – 150, 1200 or otherwise – I don’t seem capable of providing adequate context. This is doubly frustrating for me; as a sysadmin it means having someone analyse and comment on my professional capability whilst starting from incorrect assumptions.

As a writer, I seem to lack the ability to convey information in such a manner as to be capable of correcting those false assumptions. This generates no more angst than a website install requiring a mod_security alteration that I don’t know off the top of my head. It generates more frustration however, because I can Google the mod_security alteration. I am as yet inexperienced enough to know which syntax to enter into which search engine to alter misperceptions.

As to “commenting too much on my own articles,” you’re probably right. I made the mistake of assuming that certain commenters – yourself among them – were willing and capable to of absorbing additional facts that might then alter extant misperceptions. Unlike the many of the other authors here on El Reg I started as a commenter first. Long debates moderated by her excellence Madam Bee are not foreign to me.

Make no mistake, I welcome criticism and suggestions. I have a deep respect for the staff and commenters on El Reg. In many of my articles there have been excellent suggestions…several of which I have tested and which have made their way into my production environment. I wrote an article here about how I got lucky and recovered a RAID 5. A half dozen people came out of the woodwork and proclaimed “RAIDX is dead! Long live ZFS!” I haven’t had much opportunity to work with ZFS in the past 18 months or so, but these comments have inspired me to go forth and set up a test lab to see exactly what it is I am missing.

Where it all falls down for me – personally and professionally – are the circular arguments. I have absolutely no idea how to deal with them. The religious argument is a great example:

10 The bible is infallible.

20 How do you know it’s infallible?

30 Because the Bible is the word of God.

40 How can you be sure it’s the word of God?

50 Because the Bible tells us so.

60 Why believe the Bible?

70 GOTO 10

I am unable to deal with those arguments. I do not know how to “win” them. When trapped in them, I know of no graceful way out of them. Whilst I can deal with technical, political or religious arguments about many topics, I have this personal failing when it comes to circular reasoning. When you and I have an argument along the lines of:

10 X is terrible, you should Y.

20 I had no funds for Y, I had no choice but to X.

30 There is always money available!

40 I promise you there is no way there was funding to Y, I had to X.

50 GOTO 10

I expect experience will give me a greater chance of seeing these sorts of logic loops and avoiding them. I hope experience grants me the ability to at some point learn how to gracefully exit these sorts of pointless conversations. Until that point in my individual development however, I fear circular reasoning loops will continue to be my personal kryptonite.

Trevor_Pott Gold badge
Pint

@Peter Mc Aulay

It's a relief to know that there are some folks on these boards that do indeed understand. Yes - the approaches between running an SME shop with two tin cans (one of which is on loan) and a string is exceptionally different from running a shop where you can do magical things like source all your gear from a Tier 1. The interesting part is that I am held to a 99.99999 SLA by one of the two primary shareholders. Concepts like “well, just cluster everything and then get four-hour service plans from Tier 1s!” display a shocking ignorance of what my world actually looks like.

I’m lucky though…it all ends in mid 2012. For the very first time we get to refresh our servers all at once, and do it properly. This year I got to do the desktops: Out with the 11-year-old systems that were falling apart to a brand new deployment of Wyse clients. (Hurray!) 2012 brings the server refresh…and a move from my world where whitepapers might actually apply! Its things like “the money exists, trust me” that absolutely floor me. No degree – MBA or not – makes that assertion true.

Still, the commenter’s disease of most commenters thrown aside…I don’t write my articles for the folks with MBAs or working in places where buying “new gear for a specific job” is ever an option. El Reg has plenty of readers who don’t fall into that category. In my city alone El Reg is the wild favourite of all the sysadmins working for the various charities. Several low-IT-budget SME admins are also part of the local gang. Not to sya the folks running the University departments aren’t also regulars…but they simply play in a different world than I do.

There are lots of articles on El Reg that talk about “EMC storage arrays” and “VMWare’s latest super-deluxe ultra-edition management software that requires you to pay in the form of pureed virgin soul.” There aren’t so many aimed at the guy working for the local charity who is putting together donations from a dozen different businesses, most of which don’t match, barely work and for which he doesn’t have spare parts.

I’ll see if I can get the brass to rename my blog. “Sysadmin blog” is obviously going to cause nothing but continual commenter’s disease issues with the types of folk who think that all sysadmins face the exact same challenges. Maybe I can get them to rename it “Two cans and some borrowed string Blog.” Has a ring to it, I think!

In any case, it should pointed out that despite my frustrations as regards rampant commenter’s disease, there is a lot of gold in this thread if you aren’t a “two cans and some borrowed string” kind of sysadmin. Most of the commenters here are – as usual – dead bang on rights. El Reg really does have a bright crowd answering the call of the comments section.

Trevor_Pott Gold badge
Unhappy

@jake

Wow. I cannot believe I was so very deeply wrong about you. You truly do have the very worst form of Commenter’s Disease there is. This statement: “iii - The funding exists. Trust me. But it'll go to managerial bonuses, unless you can figure out a way to redirect it.” Quite simply means you have no effing clue, and aren’t interested in even trying to extract said clue from what someone else writes. You actually are incapable of comprehending that the world does periodically function in a manner that is non-cognate with your personal beliefs and experiences. I will note this and move one. I am deeply disappointed that I was this wrong about you.

As to angst, well…I know the internet is terrible for conveying subtlety. You are mistaking angst for frustration. They are very different concepts that I at least deal with differently Given the commenter’s disease present here however I won’t bother trying to explain.

If at some point in the future I ever find myself in a situation remotely like yours, I will look you up. You are an intelligent individual with a great deal of experience to share. Unfortunately, the disparities between our professional and personal lives is so great that we are unable to communicate remotely effectively. I lack the skill to convey my situation in a manner capable of overcoming your commenter’s disease; a sad failing I freely admit.

For now, I will simply wish you a good day, sir. Good luck in your future endeavours.

Trevor_Pott Gold badge

"Why RAID 5"

That has a very long story. Thanks for asking rather than assuming, though! The real answer was that we didn't specify "RAID 5" when building these servers. Originally, the servers had 8 drive bays: 2x 250GB Seagate ES.2s for the OS and 2x Velociraptors for the VMs. Both were RAID 1.

We left the rest of the bays open so that we could expand capacity later…when we got more money to do so. (You really have to understand that money does not flow here like it does for many of the other commenters. People cavalierly toss about sugesstions of putting 15K SAS drives in my servers…but I had to scrimp and sacrifice to put my data on separate disks from my OS in the first pair of this model VM server.)

When we bought our third server (and along with it FINALLY a physical backup unit in case the motherboard went on any of our now three production copies of this model of VM server) we were in a position of doing rather well, cash-wise. I was able to purchase enough disks to fill all the slots in all three servers. It would be enough to get us the VM capacity we so very desperately needed. With the following caveats:

1) I only had enough money if I didn’t toss the existing 4 Velociraptors. That meant all my new drives either had to be the same or I had to start divvying up the arrays.

2) The only way I would get enough space out of the existing drives whilst still having redundancy of any variety was RAID 5.

This meant extending our extant two Velociraptor RAID 1s into RAID 5s, and putting a RAID 5 in third server. That was about 8 months ago. We now have 6 of these systems in service with two physical spares. We have 20% surplus capacity across this model of VM server…so I will be able to take the hit to reduce to RAID 6. (We do that this coming Tuesday, as a matter of fact.)

They aren’t our only VM servers…I have a fleet of 12 others in the field (two active, one physical spare per city for four cities.) They have half the cores, half the RAM and run only a RAID 1 of Velociraptors each. There are a smattering of other VM servers too…but they are all test bench stuff as they are one offs that I don’t have replacement parts for.

So the “why RAID 5” is a legacy item: it’s from days not too long ago when we absolutely needed those very last gigabytes and had no more dollars to spend. Not that we have many dollars now…but I have been making very careful purchases with every dollar I can get my hands on.

I am very eagerly awaiting the First True Server Refresh in 2012 (we finally have this budgeted as a company-wide Major Project!) This refresh will see SANs in each city. If I have my way, SANs running SAS drives in RAID 10. It might be sad, but when I dream the dreams that I dream, I dream dreams of SANs…

Trevor_Pott Gold badge

@AC

Well, our servers are specced by us...but built by the local distie. (Supercom.) The servers are actually usually quite good kit and Supercom are fantastic to work with. I do indeed get to specify "please make sure they aren't all the same batch." They will even go out of their way to dig up slightly older or fresh-off-the-boat-new disks to mix-and-match what goes into an array for me. Maybe all server makers won’t…but mine does. I love them for it.

Still, there is only so much variability you can get doing that. When you want 6 drives of the same model for your array…they are going to be relatively close together. No perfect happy solution, I’m afraid…

Trevor_Pott Gold badge

@Peter Kay

I had trouble installing ESXi on the system with the Intel RAID card simply because it was never designed to be an ESXi server! It was a prototype Windows file server...in all honesty it was a Big Collection Of Storage Space that served as a "focal point" for backups across the company. All the backups were collected onto this system, then written to removable media. I was testing a new chassis (24x SAS hotswap darling,) a new Motherboard, new RAID card and a new SAS expander. It was literally in early prototype state.

Remember that the RAID 5 didn't actually have any disk failures! The drives merely dropped out of the array due to that wretched TLER bug. (For the record: I loathe Velociraptors. I wish I had the cash for proper SAS drives, but at the time it was "use the Velociraptors, or we make you use 7200rpm Seagate ES drives." I had zero choice in the matter.) That means the drives were actually fine…but after 49 days and change they simply stop responding to commands. The RAID card can’t see them any more and so thinks that they have dropped from the array. Power the server physically off and then power them on…*poof!* drives are back up and doing fine.

So in this case, when the came up the LSI controller read the metadata on these two drives and saw that they should be part of a 6-disk RAID 5. When it looked for other members of that array it found four other disks…all who believed they were members of an array which had dropped two disks! By sheer fluke the Intel controller was able to pick up all six disks as a single array…apparently ignoring the metadata mismatch that the TLER error caused.

As to not choosing SAS drives…it simply wasn’t an option. Most commenters in this thread behave as though I could have simply had a tantrum and money would have appeared…but that quite honestly wasn’t the case. I was lucky (HAH!) to not be stuck with 7200rpm Seagate ESes. Things will be different in 2012. Then I finally get to something like a “bulk replace” of my entire server fleet. For the first time I can do it properly: a SAN with some front end VM servers, some physical servers for critical tasks and proper identical parts (with spares) from a single vendor giving us a sexy warrantee. The company I work for has never been in the position before to do so. Seven years ago they had four computers and one server. The growth has been in fits and starts and quite literally at the very limit of the budget each time.

The transfer time issue is this: Only that bloody Intel controller would talk to those six disks as an array. If I shoved the disks back into the LSI 1078 (any of the many 1078s I have) it would see them as two arrays. If I wanted to get the VMs off (which I did, because restoring from backups is a pain in the ass,) then I had to shove the Intel controller into a system which could boot ESXi (not the Windows prototype it was originally located in) and then pull the VMs off. Understand that nothing about the array was suspect! The drives were not DEAD. They had dropped out of the array due to the TLER error and nothing more. The data was 100% intact, the only question was; how to get at it?

Once I had put one of the spare ESXi computers back together (I had it apart for a testbed project) I was able to toss the Intel card into it and it saw the array just fine. I shoved a new set of disks into the original ESXi box with the 1078 in RAID 6. Copy the VMs from the spare ESXi box with the Intel controller in it to a file server and then from there back up to the original ESXi box with its new array.

This is why the transfer time is important: getting the array back up and shoved in a box that would read it doesn’t take long. Pulling the VMs off and then uploading them again does. Fortunately, I don’t have to do anything but periodically poke the computer to make sure the transfer hasn’t failed. That’s a hell of a lot less work than restoring everything from backups would have been.

Trevor_Pott Gold badge

@All the "raid != backup comments."

For the record...I do have perfectly valid backups. Recovering the array in question was not a matter of "oh crap...if I lose that data I am dead!" Recovering from backups is a fairly long and tedious process that I was not particularly amused by.

Recovering the array on the other hand was

a) far more intellectually interesting

b) potentially much faster.

If it makes anyone feel any better, one of the elements left out of this particular article was that in the background, whilst I fiddled with the array, backups were unpacking to a secondary server just in case I needed to make use of them. In my case, recovering the array was faster than recovering from backups, which would have involved transfering and differentiating template VMs followed by reloading the latest backups to them.

That all said, I would like to reinforce that RAID != Backup!!! Raid is a convenience, nothing more. It is a method of helping to provide uptime or raw speed. I should also add that Replication != Backup!!! Replication is again nothing more than a convenience. It is an uptime tool. (If it is offsite replication it can be a disaster recovery tool.)\

Backups provide more than simply the ability to recover from a disaster such as failed RAID. Properly done, they provide the ability to recover from human error: “oops I deleted this file!” Replication in many cases will simply replicate the deletion. RAID won’t help you out of that pickle at all.

Recovering an array as described in this article should never, EVER be your only option! Please view such measures as convieniences only!

Trevor_Pott Gold badge

@jake

I don’t think you’re antagonistic, jake. I think you’re arrogant in your assumptions. I can forgive a lot of the commenters on El Reg for having Commenter Disease, but not you. For someone with your years of experience, you should be perfectly capable of understanding that the world as it applies to you does not apply to everyone else.

I read your comment as saying that a “proper” sysadmin who is in my situation would either sweet-talk my bosses into providing me more funding, or get a different job because the one I am in isn’t good enough. There are two problems with this comment. The first is that there quite literally is no more funding to be had. No matter how many degrees you have, how good a con artist, salesman, businessman or smooth talker you are…you quite simply cannot get access to what isn’t available. I certainly will not be going in depth into my Company’s finances in full view of the internet…but I am one of the few in this company who knows where all the dollars end up. Suffice it to say that they are being spent where they need to be spent and there really isn’t anything more to be had for IT.

The second assumption, that I should get an MBA and be essentially “just like you” is rubbish. I’d rather be boiled. I got into IT because I like FIXING things. I don’t like project management. I don’t really like management – though I’ll do either if called required. Thanks to having shrinks for parents I am fairly good at manipulating, coercing, cajoling and coddling people…I simply choose not to. I prefer machines. I don’t want your career. In fact, as time has progressed and I have lived my life…I have discovered I want less and less to do with IT in general. I prefer writing. I compose music. Oddly enough, I get a thrill out of taking hardware and software of various types and pushing them to their absolute limits. In the 80s and early 90s I would have been called a “hacker.” Not because I spend my time penetrating other people’s computer systems, but because I like to tinker with things and figure out how they work.

The key here is that (shock and horror,) I have no real ambition “to be rich.” I am (most of the time) content with being a middle-class largely blue-collar schmoe. I make enough to keep me happy right now. 5-10% more than I make would provide enough to save luxuriously for retirement. You and I simply have different values, jake. I want nothing more than to largely be left alone to tinker. Periodically, I like to share my thoughts and experiences before returning to my man-cave. I want a different job, it’s true. I just don’t want /your/ job.

The job I really want is pretty rare. I want a job in what amounts to “practical application research and development of IT systems.” I want a job wherein I get to take off-the-shelf components and do something with them that hasn’t been quite done before. “I wonder if this can do X.” That guy – whomever he is – that decided that cookie-tray servers were a good idea? Dreamt them up, built a prototype, tested them and refined the process? That’s the job I want. Figure out how to bodge 48 disks reliably into a case meant for 8? Hey, that sounds like an absolute BLAST! Working where I am is never going to make me rich. It’s frustrating and it’s constraining and I get laughed at by people on the internet for not being an MBA working for a fortune 500. Its still the closest thing I’ve ever found to being the guy I described above.

How does this reflect in my writing, my articles and threads like this? It means I direct myself not at the guy who is gunning for the job at the fortune 500 and running a fantastic network with all the right parts in all the right slots with the right budget. I write what I know: trying to do the nearly impossible with a virtually non-existent budget and usually a whole bunch of mismatched equipment that was purchased slowly, a piece at a time over the course of years.

Look at some of these comments in this thread. There’s a guy somewhere here who makes some radical assumptions like “you’re lucky ESXi installed on that third computer.” Talk about Commenter Disease! That “third computer” was a diskless spare box designed to be swapped in place of a failed ESXi box. He takes the fact that ESXi didn’t install on my PROTOTYPE WINDOWS FILESERVER and extends this logic to assume that I simply didn’t have known adaptable spares. This is a shining example of where you and I butt heads. When I am writing an article such as the one I just wrote, I am trying to convey a narrow slice of an infinitely complex puzzle and bodge the whole thing into 500 words. This comment alone is longer than I would be allowed to write my articles! You extrapolate an awful lot from what is available and make some very big (and largely incorrect) assumptions in doing so.

It would take me days to properly explain my environment to you. We quite simply don’t have the money to do things “by the book,” but that doesn’t mean we don’t have backups upon backups and dozens of layers of redundancy. Everything on this network is designed in such a way that it can pull double or triple duty if necessary. There are spares for all critical components and I even go so far as to ensure that my personal computers (and personal computers sold to family members/etc.) use standard-model parts. If the day ever comes that I have burned through all of my spares and absolutely need a replacement bit of kit on an emergency basis, I’ll know where to go to find it.

I am not saying I do everything perfectly, even taking my limited resources into account. Far from it! I have much yet to learn. I this exact case, I did something neat: I saved a failed RAID 5 by using a different-but-related RAID controller. I futzed with the servers for a few frustrating (but ultimately very fun) hours, and then I poked a transfer window two or three times over the course of a weekend. I didn’t have to go through the hassle of restoring anything from backups. Restoring from backups would have taken about the same amount of time but been far more work.

I learned something new and figured I would pass it along to whomever my musings might help. I did so knowing full well that the comments thread would be nothing but eleventeen squillion commenters with “let’s make a bunch of completely invalid assumptions and then lay into an author/commenter/random individual for making the mistakes we only assume they made” Commenter Disease. I can even forgive them that.

You though? A worldly management type with decades of technical experience should be beyond that by now; you should know the world is rarely as simple as it is presented in a 500 word bit of text.

Trevor_Pott Gold badge

@Diskcrash

Can I have your budget? It sounds large.

Trevor_Pott Gold badge

@Graeme Leggett

RAID 1 for small stuff. Disk capacities are enormous now, and RAID 1 is the quickest rebuild.

RAID 6 for anything you might have previously used RAID 5 for: it's the new "best compromise."

RAID 0+1 for Speed.

That would be my take for RAIDing on a budget…

Gawker rooted by anonymous hackers

Trevor_Pott Gold badge

From El Reg's kind and generous Interblag Guru:

> Being (un)able vote on your own posts should now be fixed.

>

> The multitude of other, more important things ... I'm still working on ;)

Hope that keeps folk happy and merry Christmas to all!

Google backs 'Chromoting' remote access for web-bound OS

Trevor_Pott Gold badge

All of a sudden...

...ChromeOS became exceptionally useful. I had applied for a ChromeOS netbook, but was really kind of 'meh' on getting one. All of a sudden, I am quite a bit more eager.

Come on, Google! Pick my name out of the hat!

Google delays Chrome OS, punts brandless beta netbook

Trevor_Pott Gold badge

I have applied.

Well I have applied. I am leery of the idea…but I am leery of all things new. As a sysadmin, I find the idea of drinking the internet through a browser straw interesting; can I provision all of my company’s services to my users as SAAS? Interesting exercise. What about my personal usage? I already do nearly everything from my Desire…but a lot of that is using the Wyse PocketCloud app to RDP into things. Can I use a browser-only device that can’t do RDP without going mad?

I don’t know. I’d love the chance to find out though! I imagine there aren’t exactly many of these to go around, espessially as I am Canadian rather than American. Still, it was only a few minutes of my time to apply...why not, eh? Either way, I hope that someone from either El Reg or Ars Technica (or both) manage to get their paws on one. I’d love to see some in depth reviews from folk I trust to write about them who’ve actually put the things through their paces.

As negative as I generally am…this could be a game changer, or a flop…it’s still too early to tell how it will shake out. I guess it depends on how sexy all those NaCl add-ons to the Chrome browser really are…

Microsoft badmouths Google over fed contract win

Trevor_Pott Gold badge

@Michael C

Have you taken a chance to look at the documentation for Microsoft's formats? I have. I have even tried to write things that can parse them. As have people far smarter than I. The documentation is a mess. An ABSOLUTE MESS. Worse yet, it's incomplete!

The only thing that can properly talk to Microsoft Office file formats is Microsoft Office. This is true quite simply because nobody has enough documentation from Microsoft (bullsh** ISO standard or no) to actually reproduce the bloody things.

The “time saving business features” are needed by a very small fraction of its user base. Representatively, of the 1500 people in various organisations that I am responsible for, only three actually require Microsoft Office to get their jobs done versus the competition. (All three of them do things in Excel that competitors can’t do.) Everyone else is perfectly fine with Libre Office (formerly OOo) or even Google Apps!

That isn’t to promote those products; of the 1500 individuals there are only about 50 who have converted away from Office. The reason? FILE FORMATS. Whilst they are perfectly willing to use alternate programs, their customers aren’t willing to move towards an open standard such as ODF. This means having to stick with Microsoft Office “because that’s what everyone else uses.” Microsoft’s extant largesse ensures Microsoft’s continued largesse!

If you truly have bought into the steaming turd that is Microsoft’s “open standard office format,” and honestly believe that the documentation provided the ISO was complete enough to create competing implementations then we have absolutely nothing more to talk about.

You and I can continue this conversation at that distant, rainbow-filled future time when Microsoft sits down at the table with the Open Source community, Google, IBM and all other competitors and champions a truly open format that encompasses all features from all parties at the table. It must have no licensing, no patent encumbrances and be documented so thoroughly that all parties at the table can read and write to this file format seamlessly. At the moment, the /ONLY/ contender is ODF…and ODF doesn’t quite cover all of the functions that the various platforms are capable of.

Until that shining day upon whence Microsoft agrees to compete with people in the Office Productivity arena based upon features, experience, integration and Overall Better Designed Application, we simply will never be capable of agreeing on this topic.

That isn’t to say Microsoft Office isn’t a good product. It is a GREAT product. There are however many other adequate-to-very-good products out there. More importantly they don’t cost nearly as much. These products sadly cannot compete in the same arena not because of quality, but because Microsoft doesn’t pay in the open file format playground.

Ta!

Trevor_Pott Gold badge

Not really...

...Office file-format lock-in is the lynchpin of the entire Microsoft Empire. If that were to fail, then people would not need Office. If they didn't need Office, where's the advantage in Sharepoint, Live communications Server, Exchange or the rest of it? If they don't need any of that stuff...why does Windows need to be in play as anything other than the odd lone virtual machine supporting legacy applications?

Microsoft is perfectly aware that they absolutely cannot under any circumstances afford to lose market share in the Office productivity applications market. Far more importantly, they cannot afford to let an alternate file format become dominant...or their own stagnant for long enough that other applications become as good at writing to them as Microsoft's own.

The vitriol in that circumstance then is not amazing at all. It's perfectly expected.

As soon as Facebook started becoming a threat (I.E. relevant to internet advertising) you started to see the same thing from Google.

The human router

Trevor_Pott Gold badge

Fewer and fewer of us left though.

Every year there are fewer and fewer of us. Large corporations won't (generally) touch us with a twelve foot pole. The smaller end of the SMEs are jumping ship to cloud services. The mid-range companies are using consulting services with increasing frequency; consulting services largely comprised of groups of specialists.

That said however, those SMEs that outsource their services (cloud or otherwise) generally need a body or two to sort it all out. Organise the IT part of it while dealing with the sharp business end to make sure that the company isn’t being taken for a ride.

I think that as services like cloud computing catch on in larger enterprises, this trend will start to move up the chain. Interesting times…

Apple, Oracle air-kiss their way to OpenJDK deal for Mac OS X

Trevor_Pott Gold badge

@JonHendry

Yep. That is something glossed over by folks cheering the rah-rah-Apple, however. Good move for Apple (helps lock devs and users in by reducing cross platform compatibility.) Bad move for users (makes getting cross-platform apps via the App Store just that little bit harder and thusly reducing choice.)

Trevor_Pott Gold badge

@AnotherNetNarcissist

Just because Apple realized that its customer base actually desired and required Java and (eventually) worked out something with Oracle to provide it does not in any way mean that they had originally planned to do anything other than screw their customers from the start.

As to Oracle providing Mac support either out of the generosity of their cold black heart or even because they think there is much money to be made there; I remain unconvinced. Whilst I have no need or desire to enter into yet another tedious debate when you and i obviously have very different philosophical beliefs, I still maintain that Apple handled this entire incident exceedingly poorly.

According to my belief system the proper way to deal with these issues would have been for apple to make a formal announcement at time of deprecation. This announcement should have included why they deprecated the technology and how they plan to ensure their extant customer base is looked after. If Apple didn’t have the Oracle deal in their pocket at the time, they should not have been ditching Java. If they did, it should have been formally announced at that time. No amount of argument, name calling or what-have-you will change my mind on that.

@ThomH: I think you are correct. Many people on the internet (including myself) have made up our minds about “companies like Apple, Google and Microsoft long ago.” Let me be absolutely, perfectly, crystal clear about how my mind is made about these (and any other) company: they exist to make money. They do not give a left-footed damn about their customers or userbase beyond what is necessary to keep them happy enough to continue buying product.

I am willing to give individual human beings the benefit of the doubt and believe that they are decent, compassionate and capable of both sympathy and empathy. I do not remotely believe the same thing of corporations. If a corporation (for whatever reason) wishes to come across as anything other than greedy, grasping and completely untrustworthy they have to prove it.

Unlike people, corporations do not get any benefit of the doubt from me. If that makes me a bad person in your eyes, I am sorry. I can and do trust individuals. Yet in my experience thus far with life “it’s not personal, it’s business” changes how individuals treat one another.

@edev: Apple was never banning Java from OSX. Apple was however ceasing to supply it. Apple also had not announced (until just now) any plans to ensure that it would be made available. Apple did not announce any plans to work with any other organization (be that Oracle or the open source community) to make their extant code available in order to ensure continuity.

To simply assume that “an official JVM would be forthcoming” in that environment is “stupid.” While it might be possible that it proved advantageous to Oracle to provide a JVM, until it was formally committed to assuming it would arrive as a matter of course is daft. It is not obviously to Oracle’s advantage to do so – most especially if they didn’t have Apple’s co-operation in the matter. It is far safer to assume that you will never receive support or assistance of any kind for a corporation – it is only rarely that you are otherwise surprised.

It did not necessarily make business sense for Oracle to waste the resources on providing a JVM to Apple if they were forced to work on the project in a vacuum. While it certainly made business sense for Apple to work with Oracle on the matter…”making business sense” isn’t always a guaranteed driver for Apple.

That is not, just by the by, a slam. What might seem like “making obvious business sense to us might still be far too short term thinking compared to Steve Jobs’s exceptionally good long term thinking. Finding a way to get Java off of Macs altogether holds at least as much promise in the long term as supporting it; it could be equally to Apple’s preference to work with Oracle to keep Java on the Macs or to murder it in the face.

Overall I am quite pleased that Oracle has made the decision to support Macs, but I remain firm in my beliefs that it was not a forgone conclusion. Despite the happy ending, I reiterate my belief that the entire incident was exceptionally poorly handled by Apple. Because of this, I will continue to recommend and work towards their replacement amongst my customer base. No matter what sort of name calling is employed by random posters on the Internet.

HP pays off investigators

Trevor_Pott Gold badge

I know, eh?

You can't make this **** up...

Project managers: fall-guys or heroes?

Trevor_Pott Gold badge

Project management is important.

I would also like it if someone applied some project management to getting the RDPCLIP bug in Windows actually fixed. Copying and pasting various things from point A to point B...I almost posted a very early article draft/idea scratchpad as a comment!

Joking aside; good article. I wish my bosses understood the importance of project management plays in the real world. More to the point; why it’s a bad idea to have your project manager and your technical lead be the same person. Both individuals are usually hugely under pressure to put out fires. Combing the jobs just gives you a very burnt out individual who will never have the time to put out both the technical and political fires that are all in the processes of burning holes in their desk.