"It is ideal for telecom providers, financial trading exchanges, media and gaming companies that require speed, reliability and reach."
Hmm. Note how they're careful not to include "power" in that list.
BT customers in the UK are once again banging their heads against their keyboards this morning: a power outage has thrown them offline for the second day running. Today the issue is a power outage at Telehouse North in London. An email message from BT Wholesale, with the subject line 'Major Service Interruption' – seen by The …
Having recently had a 5-day outage at the bods who do some of my hosting (how often do both drives in a RAID setup fail at the same time? And with a dodgy backup?) and spent time placating customers for cock-up by third party, I feel a teensy-weensy bit of sympathy for BT when they are affected by a supplier's kit going titsup. (We're currently migrating all our customers to someone more reliable anyway - they really should have been more apologetic!)
BUT - BT is not the same as a small One-man-and-a-cat rural welsh IT company specialising in bi-lingual web development. They're several times bigger, and should surely be able to afford truly redundant kit that just keeps working, although I will allow them a hiccup or two in the event of meteorite strikes or Brexit.
@pen-y-gors
I think you've got the wrong end of the stick here, It's not BT's own infrastructure in Telehouse North, (despite the confusing name) it's part of the London Internet Exchange, which provides routing and connectivity to all the Tier2 providers, so BT are more in your position at the moment, of sitting waiting for their suppliers to sort their shit out.
I don't believe this problem has anything to do with LINX. THN is a massive building and LINX have space there but the room currently affected by the power problems is not the LINX suite.
Well we're having big problems with latency and packet loss from a Plus-Net fibre link to some of our servers at Firehosts/Armor, and most of the issues seem to be the Telia nodes in London:
ldn-b3-link.telia.net
ldn-bb2-link.telia.net
ldn-b5-link.telia.net
By comparison, we aren't seeing any issues with these hosts on the same routing:
peer2-et-9-1-0.redbus.ukcore.bt.net
peer5-te0-0-0-31.telehouse.ukcore.bt.net
So maybe this is a routing issue from BT onwards?
I'm sorry, you're incorrect. Today's outage directly affected BT's suites, so their infrastructure lost power. It did not affect the LINX suites which are located elsewhere in the building.
However they are waiting for their supplier - Telehouse needs to restore power to the affected areas before BT can bring their kit back up.
(how often do both drives in a RAID setup fail at the same time? And with a dodgy backup?)
If two drives are from the same batch, and have spent their entire lives in the same RAID setup where they have experienced very similar duty cycles, thermal conditions, vibration, etc., then it's not entirely unlikely that they will have about the same lifetime.
If one drive in a RAID array fails, the other(s) will experience atypically high use while the array is rebuilt, and the chances of a second drive failing during the rebuild are not insignificant. It does happen.
The dodgy backup is another matter, though. There should always be more than one backup.
I've always wondered why there isn't business practice to replace half the drives in RAID 10 and move them over to a fresh server with more brand new drives. Then you get half your drives (make sure each is one half of a mirrored pair) with a remaining MTBF to the others and therefore reduce the chances of dual failure.
Doing this could allow you to use cheap low endurance SSDs rather than high end drives (HP Drives High Endurance 800GB SSD = £2700 ex VAT each!) with less overall risk even if it does require more swapping over the years. If you have spare room in the chassis to keep hot spare then so much the better.
Replacing a mirrored SSD doesn't affect performance very much in testing. Losing a RAID system and having to restore from backup can annoy users a lot more. If you have a redundant system then you can switch your VMs over to your other systems while you run the disk replacement.
In any case you are only looking to switch the drives out every 18~30 months (depending on MTBF/2). If it allowed you to buy SSDs whereas previously you would have had Hard Drives instead then the speed increase will be major anyway and the volume during a RAID Mirror rebuild would be faster than the normal operation of non-SSDs in a VM environment anyway.
Back when I was a data recovery engineer we saw drives from the same batch failing quite a few times. We also saw failed controllers since most RAID boxes only have one controller. But the most common problem was user error due to one of the following:
* Ignoring first drive failure and continuing to run the system until another went. A RAID is not a form of backup. It's a 'get me home' solution.
* Poorly written software that didn't make it clear how to rebuild an array and didn't have adequate safeguards to protect the array integrity when adding a new drive.
* Incompetence.
Unfortunately the majority are only happy if they're paying a fiver a month or something silly for Internet.. You'll hear them at the water cooler saying how they only pay peanuts, and if one of the big providers don't do it cheap enough they'll move elsewhere.
Unfortunately, all these fivers a month don't add up to a resilient system. It's not enough for every link in the chain to be as reliable as it could be - yes the companies involved make some profit, and yes, probably a lot of it - but that's what businesses exist for.
They can't make profit AND provide 100% resilience on a shoe string. Something somewhere will suffer. So next time you hear someone boasting they pay a fiver for Internet/calls/Sky, and the next day whinging it was off for a few hours, remind them on that fibre across oceans and satellites in space doesn't come cheap. It's surprising it works at all.
"Unfortunately, all these fivers a month don't add up to a resilient system. It's not enough for every link in the chain to be as reliable as it could be..."
And yet we pay £30,000 per year to BT for a single** BTNet line and are having the same issues. So how much should we be paying to ensure that we have some resilience in the network?
**Before you comment, there is no option for second line without £100,000+ in groundworks which would still have also been affected by this problem.
Ahh, because you're paying £30k, you expect that'll cover resilience everywhere. Nope - you're just subsidising everyone paying their fivers a month.
Unfortunately even your £30k/y doesn't buy resilience in a national network - it would barely cover the wage of one "engineer" who climbs poles all day! It wouldn't come close to running you your own separate little bit of pipe all the way back to a data centre and beyond, so you have to partially share infrastructure with the masses.
Like it or not - the market has spoken - and it decided it wanted cheap internet that sometimes goes off now and again. And if TalkTalk don't supply it, they'll go to Sky. And if they put their prices up 50p, they'll move on again. Businesses having no option on what they'll pay have to delve deep in their pockets to keep the whole affair ticking over.
You might not like that, but that's the way it is - the proof in the pudding is your Internet being off two days on the trot. If you don't like that for the money you're paying, don't pay it.. and encourage all the cheapskates who're the root cause of the problem to pay a bit more for their Internet!
"Like it or not - the market has spoken - and it decided it wanted cheap internet that sometimes goes off now and again".
In very much the same way as, 30 years ago, the market spoke and decided that it wanted cheap software that sometimes has to be rebooted and that has little integrity and no security. That's what made Microsoft and Bill Gates the world leaders they are now.
Sounds like the person paying £30k/annum has a big internet pipe or is based somewhere remote and expensive. Do you host servers at the site or just need it for internet and possibly voice comms?
Have you considered a mobile (4G if available) or bonded multiple DSL service as a backup? These don't have to go via BT's core even if delivered locally by BT copper lines for the DSL.
You can't have looked hard - One of the suppliers was mentioned in the post - TalkTalk.
Their website currently advertises FREE broadband for 18 months (after 17.70 line rental). Pretty sure my line rental with KCOM is about the £14 mark, so even if TT's line rental is slightly above average, it's still about a fiver for the broadband.
Their retention offers are better again - my grandfather pays peanuts for all his TalkTalk goodies.
The FREE broadband is interesting though, as to my knowledge BT/Telehouse etc are unlikely to offer all they do for free, so something somewhere is being skimped on. (It's kind of how eBay encourage you to post your items for free, but I've not yet found a Post Office that'll do that for me). There are these offers all over the place, then we're sat here asking where the redundancy is, LOL!
Telehouse is expensive - plus it has the best infastructure of any. I worked there for three years +
I assume a single trip means just one of their UPS power cabinets dropped off. We can only guess the cause at this time. A spot of load redistribution and a reset is a very quick fix. The knock on of restarting an entire network is rather more lengthy.
> BT is not the same as a small One-man-and-a-cat rural welsh IT company
Indeed.
Us small fish just have to accept what conditions are on offer. BT are big enough that they should be able tocan dictate to suppliers how things are done. It may be a supplier's problem, but BT can't hide behind that because either they've audited the setup and were happy with it (oops), or they didn't audit it in which case they can't be said to have done due diligence (oops).
Either way, from a PR PoV it's BT's name in the headlines.
Which is why I went with fast.co.uk (Dark Group) back in 2007. I have found them eminently satisfactory ever since. But I was among those without any Internet connection this morning, because unfortunately fast.co.uk were dependent on the kit up in Telehouse North. Even their phones weren't working!
"Having recently had a 5-day outage at the bods who do some of my hosting (how often do both drives in a RAID setup fail at the same time?"
When they're from the same batch, same make, same model, same usage pattern (RAID-1 would suggest exactly that by design)...
...quite often actually. One of the (many) reasons we mix drive models/batches/makes in RAID arrays
> ...the recent Government report ...
No, it's because the Snoopers Charter is back in the works and the black boxes are being installed. Every major outage happens around the same time as the charter reappears, and always in a different place because obviously you don't install the things in the same place twice.
Check the records, you'll see I'm right!
Black helicopter as there's no point being AC any more because They know, They always know, as though They were right in here with me in my secret location under the stairs.
Oh no, They are back! How do They keep finding me?!?!?
Curiously hot on the heels of a recent mysteriously kept-under-wraps issue at Plusnet, where despite their insistence they don't intercept or anything of the ilk, connections to third-party SMTP servers were timing out – but only when messages had an attachment, no matter how tiny. Very, very odd, indeed pretty much inexplicable without foul play involved, especially with supposedly encrypted connections. They tried to blame it on some other ongoing DNS issues, but DNS doesn't care whether email has attachments or not...
@ Pen-y-gors
"how often do both drives in a RAID setup fail at the same time? And with a dodgy backup?"
if its that important then you need to pay more for a more resilient solution, perhaps active / standby cluster across geographically separated data centres provided by different vendors on different ISP's and supplied by different power stations. what's the cost to you of the outage vs the cost of a resilient solution?
regarding BT, yes this latest outage should not have the impact on its network that it has. Services should have instantly rerouted with little to no impact on end users.
Both 'outages' have been the loss of part of their diverse network infrastructure.
They have highlighted that while their network is diverse, there is not enough bandwidth when one major location goes down to handle all their peering traffic.
That is down to investment in redundancy. They took the gamble and lost - twice in two days!
Even if they were to compensate customers, it would probably cost less than having the redundant bandwidth so their FD and shareholders will still be happy!
That is down to investment in redundancy. They took the gamble and lost - twice in two days!
Quite so. However the alternative is another gamble; spend more on additional infrastructure and increase the prices the users have to pay so that there is actually a ROI rather than a bit of a black hole in the accounts. And then the customers would be griping about higher charges and perhaps looking elsewhere for their service.
Well it's all relative. Given how many internet service customers BT has, a few million is "a small number", in fact in Internet terms, the loss of service by 60m users in one country would also be an outage that inconvenienced "a small number of customers"...