nav search
Data Center Software Security Transformation DevOps Business Personal Tech Science Emergent Tech Bootnotes BOFH

back to article
Intel's Atom C2000 chips are bricking products – and it's not just Cisco hit

I smell an oscar

Bang up detective and reporting work. All you need now is a script and a role for Mark Ruffalo

31
0

Re: I smell an oscar

Agreed, good work.

3
0
Silver badge
Thumb Up

Re: I smell an oscar

And another thumbs up for the article!

5
1
Silver badge

"A board level workaround exists" that involves us paying each other millions as bonuses for not paying attention to Quality Control. The rest of the world can just sit there wondering if their kit will boot tomorrow.

28
0
Anonymous Coward

:) I see what you did there. Boredroom level workaround, indeed!

I think when they say "stepping" they mean the next iteration of the chip. Which means new chips and boards all 'round, unless you are handy doing super-fine, surface mount soldering with your trusty micro-iron! The workaround sounds like strapping the line to another clock source, and Low Pin Count sounds like the processor is not fully enabled yet so just provide a clock to get us out of POST and hand over control to the boot loader and everything will be fine. Also, no way to fix with microcode. Very nasty problem given the 18 month wait period before it manifests!

Good on El Reg for putting the clues together!

13
0

The work around means adding some resistors to the design, which on most boards is not something you can just do, since this is a clock line. You can't just add wires and resistors since that would mess with the clock signal. So it is either change the board design to add the resistors, or wait for the next version of the chip (which will probably take months to happen). Of course since the chips are soldered to the board (not a socket), they are not easy to replace either.

2
0
Anonymous Coward

adding resistors

"The work around means adding some resistors to the design"

Apologies if I missed something obvious - but is this stated officially anywhere?

[It's entirely plausible, it's just one of the many failure modes possible when digital designers forget they live in an analogue world]

3
0

Uh oh

We own several Synology DS1815+ devices each with about 24TB capacity and currently quite full of data. They use the Intel Atom 2538 which is listed as a SoC containing this fault. These are well over 12 months old and therefore approaching the 18 month danger zone.

This is obviously very concerning.

I hope Synology are ready to help us.

16
0
Silver badge

Re: Uh oh

pfSense also sell boxes based on C2358/C2758 processors.

2
0
Silver badge

Re: Uh oh

Looks like those have 3 year warranty so would be surprised if they didn't fix it..but maybe you have to wait for them to fail.

1
0
Gold badge
WTF?

"I hope Synology are ready to help us."

Wouldn't it be an idea to call them first?

7
0
Thumb Up

Re: Uh oh

Don't worry. Just put all your data in the cloud.

7
1

Re: Uh oh

Hello,

JP here from Synology UK. Unfortunately we didn't receive the contact here so we apologise that no has responded. Please contact our technical support team via www.synology.com/ticket and they will be able to advise you on the best course of action.

Kind regards

JP

15
0
Anonymous Coward

Re: Uh oh

Hello,

AC here from Synology UK. We hold all our open ticket and customer details on our own award winning DS1815+ devices, so will get back to you in 18 months and 1 day.

Kind regards

AC.

7
3
FAIL

Re: Uh oh

Probably not a good idea to have faith in Synology. They are still promoting said products on their website:

https://www.synology.com/en-us/products/DS1815+#spec

And the resellers the Synology website sends you to are still reselling:

http://www.ncix.com/detail/synology-ds1815-diskstation-8-bay-diskless-71-103588.htm?promoid=1721

0
1
Silver badge
FAIL

Re: Uh oh

Double jeopardy for me... "If my 1815+ crashes, no worries, stuff is available elsewhere in verified backups. I can always build myself a U-NAS box as a replacement and I've got just the motherbo-oh crap, it's a C2758 !"

Seems 2016 was not a great year for hardware purchases for the home lab.

7
0
Silver badge
Mushroom

Re: Uh oh

Dear JP

Posting boilerplate messages to random threads doesn't inspire me with confidence in Synology's approach to this problem.

Right now, I've got a shiny DS1815+ that's humming along very nicely, but I'd really like to know if this thing has one of affected processors so I can plan accordingly. I understand that RAID Is Not Backup, so losing access to my storage won't hurt me, but it will make my life less convenient.

This is an inherent fault in manufacture of a component, identified by the manufacturer of that component. I'd really like to know what Synology (and other manufacturers who are reading this discussion - CK) are going to do to rectify it.

If your attitude is "we'll wait for your kit to die, then we'll replace it" then your box is going back to where I bought it from - and I will make very sure that anyone who asks me for advice on buying a NAS knows that.

Toodle Pip,

CK

4
1

Re: Uh oh

I have a Synology 1815+ that we purchased about 10 months ago. Starting in December I started having random reboots but they weren't caught right away because notifications were not setup correctly and the unit boots so quickly. Towards the end of December I noticed the problem as the frequency had increased quite a bit. I did some searching around and it sounds like quite a few DS1815+ units have bad power supplies. I called up Synology and they confirmed that my issue was a bad power supply and swapped it out. The transfer to the replacement DS1815+ couldn't have been easier. My point with this long winded comment is that the forum post that is referenced in this article is most likely faulty power supplies which is concerning but not the same issue.

0
0
Bronze badge

This will show up the good vendors vs the bad ones.

Cisco are top of the good list, and so far the only entry. Any other takers? Or does everyone else think nobody will notice?

Personally sick of companies that take the sweep it under the carpet and hope nobody notices. Don't they realise than in the internet era nothing can really be covered up and mass product faults that the manufacturer hopes nobody will notice, that won't wash anymore... (Panasonic AllPlay, calling you out here...)

7
1

Re: This will show up the good vendors vs the bad ones.

> Personally sick of companies that take the sweep it under the carpet and hope nobody notices.

Recently had this with a camera. A Fuji randomly shutting down. Reading various forums suggested it was a common issue and that the lens (built-in) was the cause. After a struggle contacting Fuji UK, they said there was no problem with the camera or the lens, but to send it back. They replaced the lens, which resolved the issue. Which, of course, didn't exist.

Why do corporates not take the high ground of admitting clearly to a problem and then resolving it? There's far more to gain and much less to lose that way.

14
0
Silver badge

Re: This will show up the good vendors vs the bad ones.

Why do corporates not take the high ground of admitting clearly to a problem and then resolving it? There's far more to gain and much less to lose that way.

Why? Litigation society, that's why. It's usually believed that if a company admits a problem then they are admitting liability and opening themselves up to litigation. Which can/will be expensive. Best to err on the side of caution and to never admit to anything. Ever.

See also: "Dark ages" or "why nothing of great importance happened because much of Europe was concerned with pointless legal matters and why external input was required"

7
0
Silver badge
Facepalm

Oh no, not again

Back in the dawn of PC time, I worked on an 8086 machine designed before everyone was expected to copy IBM. The 8088 and 8086 used a special intel clock driver, the 8284. We had loads of problems with them not oscillating properly ..

8
2
Silver badge
Coat

Re: Oh no, not again

Perhaps someone should have introduced Intel to OpAmps, they ALWAYS oscillate, even when you don't want them to !

13
0
Silver badge

Crap support

The problem with this is they will simply start dying.. and my guess as many other commentards say is that most vendors will do a la la la, and ignore clients. As they are set to lose plenty of money.. unless intel compensates them.

What SHOULD be done is vendors sending new devices with the corrected processor, so the old ones are returned and either scrapped or refurbished.

This is quite bad news.. and potentially crippling for intel, not just for the money, but for the lack of confidence. Ppl might just feel more confident putting an nvidia SOC than an Intel one!

9
0
Silver badge

Re: Crap support

Actually the vendors might have a very strong civil suit against Chipzilla for delivering defective products. The vendors are caught in the middle as the ultimate miscreant, Chipzilla, is a direct supplier. So the customer harasses/sues vendor who in turn harasses/sue Chipzilla.

Note, do not scrimp on QA/QC because the few bucks you save up front will eventually come out of your hide with a very serious multiplication factor.

9
1
FAIL

Re: Crap support

You're right. Companies don't spend money on customer support or service any more; that cash is instead split as follows: 85% to the board, 14.5% to marketing and making the website pretty and 0.5% to an offshore team to run the customer twitter account ('can you be typing in your number of customer and bank account and identification of order with a quickness kind sir, and I or they or we will be back with you with a perfect answer in hours of plenty').

We will find out that most tech companies don't understand what they sell at all, and are just a change of logo, a website and a hefty dose of BS. It won't be an easy lesson.

11
1
Silver badge

Now would be a good time to buy ARM shares......damit.

15
0

Pity it's too late to buy ARM shares. It was bought by SoftBank last year.

1
1
Silver badge

>Pity it's too late to buy ARM shares. It was bought by SoftBank last year.

Hence the Damit. Do keep up at the back, no offence intended.

8
0
Anonymous Coward

I worked at NetApp when they encountered the PCI/NMI error whereas a sub standard adhesive caused controllers throw up protection faults and panic. I have never seen so much effort go into Cover Up, Playing Down, Case Manage and Control Communications (inside as well as outside the organisation).

The Company went into full damage control mode, so concerned about reputation that the technical fault itself became a secondary issue. For NetApp only a few thousand systems were affected, yet they couldn't keep up with producing/refurbishing the number of fixed boards required. It took months to years to fix the last customers.

Now imagine intel with millions of C2000's and most of them on SoC's.

I can tell you this:

If you are large customer with a large vendor (e.g. a large Cisco customer) you get fixed first. Cisco say they would prioritise systems by operational age, but that's BS. Customer's get prioritised by the size of impact and potential of negative press. Therefore large Telco's will come first. Cisco wants to avoid negative press at all costs. "ISP or Mobile Carrier went down due to faulty Cisco gear", would affect a lot of people and generate a lot of negative press.

If you are a small'ish vendor of C2000 systems -or - you are a customer of those systems - you are screwed!

That hot potato will stay in your hands until the large vendors and customers are fixed. Next comes the medium businesses and finally the guys at home with their Synology NAS' come last.

The reason you don't hear a thing from your vendor - is not because they're unaware of the issue - it's because they're developing strategies to minimise their costs. And sorry - they don't give a shit about you (the customer) and the fact that your gear (or business) may fail at any time.

29
0
Anonymous Coward

Been there done that.

As a vendor there is so much you can do.. and doing a samsung is going broke.

A BGA resolder properly done can go to 400$ a piece.. so it makes no sense to do it on synologys...and yet hey, there is your data.

We have a synology as a single point of failure in our company, just for internal use and replication. While we do have a backup of it (well, 2 to be precise) it will be a nuissance to say the least.

1
0
Anonymous Coward

Completely agree about the cover-up

Just like the cases of flaming Ford Kuga's (check news in New Zealand and South Africa)

0
0
Anonymous Coward

Re: Completely agree about the cover-up

"Just like the cases of flaming Ford Kuga's (check news in New Zealand and South Africa)"

The problem only occurs there because the Kuga was never designed to run upside down.

8
0

Re: The problem only occurs there because the Kuga was never designed to run upside down.

And isn't reported in Australia because it gets blamed on bushfires instead. :-)

4
0

The only fix so far is to change your own board to add the workaround. New chips don't exist yet so no one is getting those until they exist. So everyone is at their own mercy about how long it takes to change the board design and get new boards made, or they can wait for the new chips and hope for the best in the mean time. Doesn't matter if you are Cisco or some tiny company. Of course I suspect Cisco might very well be able to get a new board revision design made a lot faster than the little guys.

2
0
Silver badge

Re: Been there done that.

"A BGA resolder properly done can go to 400$ a piece.. so it makes no sense to do it on synologys...and yet hey, there is your data."

So just swap the whole processor board.

0
0
Anonymous Coward

Re: So just swap the whole processor board.

I'm not familar with the NAS boxes in question, but as well as swapping the processor board, wouldn't another option be to swap the hard drive(s) to a similar-enough NAS box that wasn't implicated in this affair?

The valuable-to-customers bit here is probably the data not the hardware, right?

Just askin' (apologies if it's a daft question).

2
0
Silver badge

@AC

Given the actual screwup is Chipzilla, the vendors in many cases do not have any real options until Chipzilla figures out how to fix their mess. Then Cisco can start fixing/replacing gear; they do not have any inventory of good chips. Right now there is no gear except for known defective gear to push out. Cisco has the luxury of nailing Chipzilla with a knockout punch and probably will go after them.

1
1
Silver badge

Re: So just swap the whole processor board.

"Just askin' (apologies if it's a daft question)."

Not a daft question. I'm not familiar with the product.

If the drives are nothing but data and the whole thing is driven by firmware on the processor board then it would be a tad difficult. It would depend on being able to find an alternate device with sufficiently similar firmware which would be entirely down to the software being generic. Without going off & researching that I've no idea whether it is or whether it's proprietary.

If the drives have an OS on them then it would depend on the OS including the right drivers. There's always a problem, even with general purpose OS's, of having support for newer or even older hardware.

Short answer, "similar-enough" might not exist.

0
0
Anonymous Coward

Re: Completely agree about the cover-up

"Just like the cases of flaming Ford Kuga's (check news in New Zealand and South Africa)"

That was the Voice Control System committing suicide after hearing the accent! :)

1
0
Anonymous Coward

I remember the NetApp PCI/NMI error. Internally they called it the PCI/Enema and everybody had a good laugh.

When facing the customer the sales guys pretended not to know anything about it. Actually not just sales, but the entire leadership team, all the way up.

9
0
Silver badge

So when do I short CSCO/INTC stock??

Given that this seems to happen after 18 months, one might want to calculate the time of first failure, and watch the stock go down. It could get interesting.

Of course, one wonders WHY the failure manifests itself after 18 months. Is there some flash component that gets used to determine elapsed time? We know the symptoms of the failure, but not the actual root cause (other than a bad chip design (DUH!).

In any event, not an easy re-work. BGAs are almost impossible, Surface mounts can probably be done in the field, but I wouldn't. Time will tell how this is handled (good, bad, terrible).

Me? No, I don't own any INTC/CSCO stock.

2
0
Silver badge

Re: So when do I short CSCO/INTC stock??

Semiconductors of all types wear out over time, as the doping drifts - mostly due to thermal effects, so hotter parts fail faster.

Package pins are connected to the silicon by really tiny wires that can snap, eg under the stress of warming up or cooling down.

There's other failure modes such as insulation breakdown, overvoltages and many more.

It only takes a small miscalculation or manufacturing error to turn a chip with a theoretical 50-year MTBF into chip with an 18-month MTBF.

It sounds like this failure may only matter at boot, if true then a device left running will keep going even after the failure - it just won't boot again.

It is a shame that Intel is saying nothing about the failure rate. Could be 1%, or even 90%. Given the lack of info, it's probably quite high.

8
1

Re: So when do I short CSCO/INTC stock??

Funny you should mention stock value - this is the actual title of an article published today: "Intel Is on a Roll After a Difficult Spell, So Buy the Stock Now"

Unfortunately, I cannot post the link, but here is a nice quote:

"...the quarter also solidified 2016 as a comeback year for the Silicon Valley company.

For years, Intel has tried to break into the mobile-phone business. Last year, it finally secured a deal with Apple to provide chips for the iPhone 7."

Quite funny in context, eh? :)

1
0
Anonymous Coward

Cheating Software ?

Perhaps intel's planned obsolescence team has made a mistake and set the thresholds too low?

This should be investigated. Could be the next VW.

4
0
Anonymous Coward

2017 is the new Millenium Bug.. !

0
0
Unhappy

My NAS build uses a ASRock c2750di and mysteriously stopped working several months back. I was blaming ASRock as there are a lot of complains about that motherboard failing.

Would there be any way to find out if it's because of intel or if its an unrelated fault?

I can't afford to have such an expensive board break again and each time I try to come up with a new build that can handle as many hdd's I get carried away and things get expensive... So that machine is still not replaced.

1
0
Unhappy

Aargh!

I have two potentially affected boxes:

iXsystems FreeNAS Mini

CPU: Intel(R) Atom(TM) CPU C2750 @ 2.40GHz (2400.06-MHz K8-class CPU)

Nothing yet in the FreeNAS forum.

Netgate pfSense SG-2220 firewall

CPU: Intel(R) Atom(TM) CPU C2338 @ 1.74GHz (1750.04-MHz K8-class CPU)

User comments and questions already present in pfSense forum. No response yet from Netgate.

Plus the FreeNAS Mini XL I have on order (8-(

Very annoying that this quite expensive kit should have such a problem. Thanks Intel. Some of us have not yet forgotten the Pentium FDIV saga.

4
0
Silver badge
Facepalm

Rut Roh

Well... smeg.

3
0
Anonymous Coward

@ pfSense, SuperMicro, Synology & others

By this stage you should probably release some sort of official statement along the lines:

- we are aware of the issues with the C2000 CPU

- we are investigating whether any of our products are affected

- we are working with intel to determine whether our products are affected and how

- we will communicate the next steps with you (the customer) in a timely manner

So far I haven't heard anything from these vendors and this is making me and others very nervous.

I'm a customer of all of these vendors - the first vendor addressing the problem will keep me as a customer!

6
0

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

The Register - Independent news and views for the tech community. Part of Situation Publishing