Blaming it on their F5s?
I call bullshit. I've seen F5s used for years without problems - the only place that seems to have consistent trouble with them is MSO.
A corrupted file in Microsoft's DNS services brought down its cloud across the world, the software giant has revealed. In a dramatic failure, Office 365 and Windows Live services including Hotmail and SkyDrive fell over for more than three hours earlier this month, causing further embarrassment for Redmond. No customer data …
...but that's what they use ; )
As often as they have trouble with them (these BPOS-S outages aren't the half of it - for a while we were averaging a load balancer issue of one sort or another every 2-3 months for our dedicated environment) the only reasonable explanation is operator error.
Most cloud professionals* would add the proviso: "...but probably best not *that* cloud..."
There are a few companies out there that seem to know their stuff; enviable uptime and reliability. That's never really been Microsoft's strong point, has it?
*Cue luddite hordes with their "cloud professional? Tautology, that!" cleverness...
...other than for backup.
Anyone's server can go down. Even your own. But you can do something about your own server. You don't want to be at the hands of someone not entirely interested in your companies profits, only their own liability clause.
Remind me again why I should trust a company with centralized control of my data ... Especially when that company spent decades trying to move control of the personal desktop from mainframe data centers to the personal computer?
No, thank you. I'll keep it in-house. For values of "in-house" that include a couple continents. Honestly, it's not all that hard to roll your own.
I can't tell you how many times 'load-balancing devices in the DNS service respond to a malformed input string' give me problems with my desktop computer. What a relief I can depend on others to fix it now. Then again when 'load-balancing devices in the DNS service respond to a malformed input string' on my network; it has never brought the entire MS online services product suite down across the world for everyone else. Go Cloud!
So what is he is saying is that some idiot added crap to the configuration file and it got propagated across network
by two "rare conditions" = Muppet didn't have a second person to eyeball his/her handiwork before committing the change to the configuration file, in addition they altered the configuration file directly rather than coping the one from the test machine.
Really, this is IT 101....
Well Its partially human.... The so called file should have a parser to catch errors before they sent out to various other servers no?
That should eliminate any human issue. No whether there is a check at each server to check for validation issues is another possibility. Its called check and recheck and then do a checksum.
... such overlooked little services in a basket.
They did use to run their four DNS servers in the same subnet, didn't they? Oh and they got their all-important everything-depends-on-this sso domain suspended for non-payment, too. Why companies feel they need to sprawl across dozens of domains, all interdependent, is a little beyond me. But maybe reasons why or why not are just a little beyond them. They're certainly not the only tech giants to bugger this one up regularly. As self-proclaimed world improvers employing supposedly the worlds finest tech heads and with plenty of resources to fix it all up neat and tidy, their antics do seem a bit pathetic, however.
...IT Service Management consultant on contract at MS a year or so ago. I quickly realized that their infrastructure management skill levels and practices were abysmal. I told them what they needed to do and got out as soon as decently possible - I didn't want to be associated with such a crowd of no-hopers. It seems nothing has changed.
"A tool that helps balance network traffic was being updated and the update did not work correctly...
Taste your own medicine MS, So now you know just how bloody frustrating it is when your updates dont work correctly
& did the "helpline" assistant go "ooh, I think you'll have to buy another license for that"?
Huh....epic fail, to be sure, but I have a Hotmail account (foisted upon me against my will by a higher educational institute which shall soon give me a fancy piece of paper that I'll put in a frame and reference on a resume but otherwise never think of again) and never noticed the outage. Then again, I only reluctantly use that account.
...but computers are excellent amplifiers. They wouldn't be the first outfit to fall victim to a self-inflicted DDoS. I think there's must be an axiom about resilient systems in here somewhere.
While the number of single points of failure (SPF) is inversely proportional to the number of redundant features, SPF can only approach (but never reach) a lower limit of 1.