back to article Intel's Atom C2000 chips are bricking products – and it's not just Cisco hit

Intel's Atom C2000 processor family has a fault that effectively bricks devices, costing the company a significant amount of money to correct. But the semiconductor giant won't disclose precisely how many chips are affected nor which products are at risk. On its Q4 2016 earnings call earlier this month, chief financial officer …

Page:

  1. Andy Tunnah

    I smell an oscar

    Bang up detective and reporting work. All you need now is a script and a role for Mark Ruffalo

    1. AmenFromMars

      Re: I smell an oscar

      Agreed, good work.

    2. Mephistro
      Thumb Up

      Re: I smell an oscar

      And another thumbs up for the article!

  2. Your alien overlord - fear me

    "A board level workaround exists" that involves us paying each other millions as bonuses for not paying attention to Quality Control. The rest of the world can just sit there wondering if their kit will boot tomorrow.

    1. Anonymous Coward
      Anonymous Coward

      :) I see what you did there. Boredroom level workaround, indeed!

      I think when they say "stepping" they mean the next iteration of the chip. Which means new chips and boards all 'round, unless you are handy doing super-fine, surface mount soldering with your trusty micro-iron! The workaround sounds like strapping the line to another clock source, and Low Pin Count sounds like the processor is not fully enabled yet so just provide a clock to get us out of POST and hand over control to the boot loader and everything will be fine. Also, no way to fix with microcode. Very nasty problem given the 18 month wait period before it manifests!

      Good on El Reg for putting the clues together!

      1. Lennart Sorensen

        The work around means adding some resistors to the design, which on most boards is not something you can just do, since this is a clock line. You can't just add wires and resistors since that would mess with the clock signal. So it is either change the board design to add the resistors, or wait for the next version of the chip (which will probably take months to happen). Of course since the chips are soldered to the board (not a socket), they are not easy to replace either.

        1. Anonymous Coward
          Anonymous Coward

          adding resistors

          "The work around means adding some resistors to the design"

          Apologies if I missed something obvious - but is this stated officially anywhere?

          [It's entirely plausible, it's just one of the many failure modes possible when digital designers forget they live in an analogue world]

  3. Spotswood

    Uh oh

    We own several Synology DS1815+ devices each with about 24TB capacity and currently quite full of data. They use the Intel Atom 2538 which is listed as a SoC containing this fault. These are well over 12 months old and therefore approaching the 18 month danger zone.

    This is obviously very concerning.

    I hope Synology are ready to help us.

    1. Chris King

      Re: Uh oh

      pfSense also sell boxes based on C2358/C2758 processors.

    2. Nate Amsden

      Re: Uh oh

      Looks like those have 3 year warranty so would be surprised if they didn't fix it..but maybe you have to wait for them to fail.

    3. John Smith 19 Gold badge
      WTF?

      "I hope Synology are ready to help us."

      Wouldn't it be an idea to call them first?

    4. Anonymous Coward
      Thumb Up

      Re: Uh oh

      Don't worry. Just put all your data in the cloud.

    5. Synology UK

      Re: Uh oh

      Hello,

      JP here from Synology UK. Unfortunately we didn't receive the contact here so we apologise that no has responded. Please contact our technical support team via www.synology.com/ticket and they will be able to advise you on the best course of action.

      Kind regards

      JP

      1. Anonymous Coward
        Anonymous Coward

        Re: Uh oh

        Hello,

        AC here from Synology UK. We hold all our open ticket and customer details on our own award winning DS1815+ devices, so will get back to you in 18 months and 1 day.

        Kind regards

        AC.

      2. Chris King
        Mushroom

        Re: Uh oh

        Dear JP

        Posting boilerplate messages to random threads doesn't inspire me with confidence in Synology's approach to this problem.

        Right now, I've got a shiny DS1815+ that's humming along very nicely, but I'd really like to know if this thing has one of affected processors so I can plan accordingly. I understand that RAID Is Not Backup, so losing access to my storage won't hurt me, but it will make my life less convenient.

        This is an inherent fault in manufacture of a component, identified by the manufacturer of that component. I'd really like to know what Synology (and other manufacturers who are reading this discussion - CK) are going to do to rectify it.

        If your attitude is "we'll wait for your kit to die, then we'll replace it" then your box is going back to where I bought it from - and I will make very sure that anyone who asks me for advice on buying a NAS knows that.

        Toodle Pip,

        CK

    6. Chris 244
      FAIL

      Re: Uh oh

      Probably not a good idea to have faith in Synology. They are still promoting said products on their website:

      https://www.synology.com/en-us/products/DS1815+#spec

      And the resellers the Synology website sends you to are still reselling:

      http://www.ncix.com/detail/synology-ds1815-diskstation-8-bay-diskless-71-103588.htm?promoid=1721

    7. Chris King
      FAIL

      Re: Uh oh

      Double jeopardy for me... "If my 1815+ crashes, no worries, stuff is available elsewhere in verified backups. I can always build myself a U-NAS box as a replacement and I've got just the motherbo-oh crap, it's a C2758 !"

      Seems 2016 was not a great year for hardware purchases for the home lab.

    8. ryan_c

      Re: Uh oh

      I have a Synology 1815+ that we purchased about 10 months ago. Starting in December I started having random reboots but they weren't caught right away because notifications were not setup correctly and the unit boots so quickly. Towards the end of December I noticed the problem as the frequency had increased quite a bit. I did some searching around and it sounds like quite a few DS1815+ units have bad power supplies. I called up Synology and they confirmed that my issue was a bad power supply and swapped it out. The transfer to the replacement DS1815+ couldn't have been easier. My point with this long winded comment is that the forum post that is referenced in this article is most likely faulty power supplies which is concerning but not the same issue.

  4. Planty Bronze badge

    This will show up the good vendors vs the bad ones.

    Cisco are top of the good list, and so far the only entry. Any other takers? Or does everyone else think nobody will notice?

    Personally sick of companies that take the sweep it under the carpet and hope nobody notices. Don't they realise than in the internet era nothing can really be covered up and mass product faults that the manufacturer hopes nobody will notice, that won't wash anymore... (Panasonic AllPlay, calling you out here...)

    1. Anonymous Coward
      Anonymous Coward

      Re: This will show up the good vendors vs the bad ones.

      > Personally sick of companies that take the sweep it under the carpet and hope nobody notices.

      Recently had this with a camera. A Fuji randomly shutting down. Reading various forums suggested it was a common issue and that the lens (built-in) was the cause. After a struggle contacting Fuji UK, they said there was no problem with the camera or the lens, but to send it back. They replaced the lens, which resolved the issue. Which, of course, didn't exist.

      Why do corporates not take the high ground of admitting clearly to a problem and then resolving it? There's far more to gain and much less to lose that way.

      1. Nick Ryan Silver badge

        Re: This will show up the good vendors vs the bad ones.

        Why do corporates not take the high ground of admitting clearly to a problem and then resolving it? There's far more to gain and much less to lose that way.

        Why? Litigation society, that's why. It's usually believed that if a company admits a problem then they are admitting liability and opening themselves up to litigation. Which can/will be expensive. Best to err on the side of caution and to never admit to anything. Ever.

        See also: "Dark ages" or "why nothing of great importance happened because much of Europe was concerned with pointless legal matters and why external input was required"

        1. Auntie Dix
          Megaphone

          Re: This will show up the good vendors vs the bad ones.

          Doing the right thing does not get a company sued, typically, but the right thing means putting into a remediation chest BIG MONEY.

          CEO scumbags are playing games ("We won't name the perp!"), because the law won't prosecute this hiding of information. Their mansions are safe, while your Intel-crippled Synology box croaks and you lose your money, time, and data.

          Companies ("persons," under insane U.S. law) will lie, unless severe penalties are at the ready. Think of automobile recalls. Those CEO scumbags have (some) regulation over their slimy heads.

  5. Adrian 4
    Facepalm

    Oh no, not again

    Back in the dawn of PC time, I worked on an 8086 machine designed before everyone was expected to copy IBM. The 8088 and 8086 used a special intel clock driver, the 8284. We had loads of problems with them not oscillating properly ..

    1. Dwarf
      Coat

      Re: Oh no, not again

      Perhaps someone should have introduced Intel to OpAmps, they ALWAYS oscillate, even when you don't want them to !

  6. Aitor 1

    Crap support

    The problem with this is they will simply start dying.. and my guess as many other commentards say is that most vendors will do a la la la, and ignore clients. As they are set to lose plenty of money.. unless intel compensates them.

    What SHOULD be done is vendors sending new devices with the corrected processor, so the old ones are returned and either scrapped or refurbished.

    This is quite bad news.. and potentially crippling for intel, not just for the money, but for the lack of confidence. Ppl might just feel more confident putting an nvidia SOC than an Intel one!

    1. a_yank_lurker

      Re: Crap support

      Actually the vendors might have a very strong civil suit against Chipzilla for delivering defective products. The vendors are caught in the middle as the ultimate miscreant, Chipzilla, is a direct supplier. So the customer harasses/sues vendor who in turn harasses/sue Chipzilla.

      Note, do not scrimp on QA/QC because the few bucks you save up front will eventually come out of your hide with a very serious multiplication factor.

    2. Anonymous Coward
      FAIL

      Re: Crap support

      You're right. Companies don't spend money on customer support or service any more; that cash is instead split as follows: 85% to the board, 14.5% to marketing and making the website pretty and 0.5% to an offshore team to run the customer twitter account ('can you be typing in your number of customer and bank account and identification of order with a quickness kind sir, and I or they or we will be back with you with a perfect answer in hours of plenty').

      We will find out that most tech companies don't understand what they sell at all, and are just a change of logo, a website and a hefty dose of BS. It won't be an easy lesson.

  7. Anonymous Coward
    Anonymous Coward

    Now would be a good time to buy ARM shares......damit.

    1. Peter Danckwerts

      Pity it's too late to buy ARM shares. It was bought by SoftBank last year.

      1. Anonymous Coward
        Anonymous Coward

        >Pity it's too late to buy ARM shares. It was bought by SoftBank last year.

        Hence the Damit. Do keep up at the back, no offence intended.

  8. Anonymous Coward
    Anonymous Coward

    I worked at NetApp when they encountered the PCI/NMI error whereas a sub standard adhesive caused controllers throw up protection faults and panic. I have never seen so much effort go into Cover Up, Playing Down, Case Manage and Control Communications (inside as well as outside the organisation).

    The Company went into full damage control mode, so concerned about reputation that the technical fault itself became a secondary issue. For NetApp only a few thousand systems were affected, yet they couldn't keep up with producing/refurbishing the number of fixed boards required. It took months to years to fix the last customers.

    Now imagine intel with millions of C2000's and most of them on SoC's.

    I can tell you this:

    If you are large customer with a large vendor (e.g. a large Cisco customer) you get fixed first. Cisco say they would prioritise systems by operational age, but that's BS. Customer's get prioritised by the size of impact and potential of negative press. Therefore large Telco's will come first. Cisco wants to avoid negative press at all costs. "ISP or Mobile Carrier went down due to faulty Cisco gear", would affect a lot of people and generate a lot of negative press.

    If you are a small'ish vendor of C2000 systems -or - you are a customer of those systems - you are screwed!

    That hot potato will stay in your hands until the large vendors and customers are fixed. Next comes the medium businesses and finally the guys at home with their Synology NAS' come last.

    The reason you don't hear a thing from your vendor - is not because they're unaware of the issue - it's because they're developing strategies to minimise their costs. And sorry - they don't give a shit about you (the customer) and the fact that your gear (or business) may fail at any time.

    1. Anonymous Coward
      Anonymous Coward

      Been there done that.

      As a vendor there is so much you can do.. and doing a samsung is going broke.

      A BGA resolder properly done can go to 400$ a piece.. so it makes no sense to do it on synologys...and yet hey, there is your data.

      We have a synology as a single point of failure in our company, just for internal use and replication. While we do have a backup of it (well, 2 to be precise) it will be a nuissance to say the least.

      1. Doctor Syntax Silver badge

        Re: Been there done that.

        "A BGA resolder properly done can go to 400$ a piece.. so it makes no sense to do it on synologys...and yet hey, there is your data."

        So just swap the whole processor board.

        1. Anonymous Coward
          Anonymous Coward

          Re: So just swap the whole processor board.

          I'm not familar with the NAS boxes in question, but as well as swapping the processor board, wouldn't another option be to swap the hard drive(s) to a similar-enough NAS box that wasn't implicated in this affair?

          The valuable-to-customers bit here is probably the data not the hardware, right?

          Just askin' (apologies if it's a daft question).

          1. Doctor Syntax Silver badge

            Re: So just swap the whole processor board.

            "Just askin' (apologies if it's a daft question)."

            Not a daft question. I'm not familiar with the product.

            If the drives are nothing but data and the whole thing is driven by firmware on the processor board then it would be a tad difficult. It would depend on being able to find an alternate device with sufficiently similar firmware which would be entirely down to the software being generic. Without going off & researching that I've no idea whether it is or whether it's proprietary.

            If the drives have an OS on them then it would depend on the OS including the right drivers. There's always a problem, even with general purpose OS's, of having support for newer or even older hardware.

            Short answer, "similar-enough" might not exist.

    2. Anonymous Coward
      Anonymous Coward

      Completely agree about the cover-up

      Just like the cases of flaming Ford Kuga's (check news in New Zealand and South Africa)

      1. Anonymous Coward
        Anonymous Coward

        Re: Completely agree about the cover-up

        "Just like the cases of flaming Ford Kuga's (check news in New Zealand and South Africa)"

        The problem only occurs there because the Kuga was never designed to run upside down.

        1. Paul Kinsler

          Re: The problem only occurs there because the Kuga was never designed to run upside down.

          And isn't reported in Australia because it gets blamed on bushfires instead. :-)

      2. Anonymous Coward
        Anonymous Coward

        Re: Completely agree about the cover-up

        "Just like the cases of flaming Ford Kuga's (check news in New Zealand and South Africa)"

        That was the Voice Control System committing suicide after hearing the accent! :)

    3. Lennart Sorensen

      The only fix so far is to change your own board to add the workaround. New chips don't exist yet so no one is getting those until they exist. So everyone is at their own mercy about how long it takes to change the board design and get new boards made, or they can wait for the new chips and hope for the best in the mean time. Doesn't matter if you are Cisco or some tiny company. Of course I suspect Cisco might very well be able to get a new board revision design made a lot faster than the little guys.

    4. a_yank_lurker

      @AC

      Given the actual screwup is Chipzilla, the vendors in many cases do not have any real options until Chipzilla figures out how to fix their mess. Then Cisco can start fixing/replacing gear; they do not have any inventory of good chips. Right now there is no gear except for known defective gear to push out. Cisco has the luxury of nailing Chipzilla with a knockout punch and probably will go after them.

  9. Anonymous Coward
    Anonymous Coward

    I remember the NetApp PCI/NMI error. Internally they called it the PCI/Enema and everybody had a good laugh.

    When facing the customer the sales guys pretended not to know anything about it. Actually not just sales, but the entire leadership team, all the way up.

  10. Herby

    So when do I short CSCO/INTC stock??

    Given that this seems to happen after 18 months, one might want to calculate the time of first failure, and watch the stock go down. It could get interesting.

    Of course, one wonders WHY the failure manifests itself after 18 months. Is there some flash component that gets used to determine elapsed time? We know the symptoms of the failure, but not the actual root cause (other than a bad chip design (DUH!).

    In any event, not an easy re-work. BGAs are almost impossible, Surface mounts can probably be done in the field, but I wouldn't. Time will tell how this is handled (good, bad, terrible).

    Me? No, I don't own any INTC/CSCO stock.

    1. Richard 12 Silver badge

      Re: So when do I short CSCO/INTC stock??

      Semiconductors of all types wear out over time, as the doping drifts - mostly due to thermal effects, so hotter parts fail faster.

      Package pins are connected to the silicon by really tiny wires that can snap, eg under the stress of warming up or cooling down.

      There's other failure modes such as insulation breakdown, overvoltages and many more.

      It only takes a small miscalculation or manufacturing error to turn a chip with a theoretical 50-year MTBF into chip with an 18-month MTBF.

      It sounds like this failure may only matter at boot, if true then a device left running will keep going even after the failure - it just won't boot again.

      It is a shame that Intel is saying nothing about the failure rate. Could be 1%, or even 90%. Given the lack of info, it's probably quite high.

    2. bogd

      Re: So when do I short CSCO/INTC stock??

      Funny you should mention stock value - this is the actual title of an article published today: "Intel Is on a Roll After a Difficult Spell, So Buy the Stock Now"

      Unfortunately, I cannot post the link, but here is a nice quote:

      "...the quarter also solidified 2016 as a comeback year for the Silicon Valley company.

      For years, Intel has tried to break into the mobile-phone business. Last year, it finally secured a deal with Apple to provide chips for the iPhone 7."

      Quite funny in context, eh? :)

  11. Anonymous Coward
    Anonymous Coward

    Cheating Software ?

    Perhaps intel's planned obsolescence team has made a mistake and set the thresholds too low?

    This should be investigated. Could be the next VW.

  12. Anonymous Coward
    Anonymous Coward

    2017 is the new Millenium Bug.. !

  13. weekend
    Unhappy

    My NAS build uses a ASRock c2750di and mysteriously stopped working several months back. I was blaming ASRock as there are a lot of complains about that motherboard failing.

    Would there be any way to find out if it's because of intel or if its an unrelated fault?

    I can't afford to have such an expensive board break again and each time I try to come up with a new build that can handle as many hdd's I get carried away and things get expensive... So that machine is still not replaced.

  14. abortnow
    Unhappy

    Aargh!

    I have two potentially affected boxes:

    iXsystems FreeNAS Mini

    CPU: Intel(R) Atom(TM) CPU C2750 @ 2.40GHz (2400.06-MHz K8-class CPU)

    Nothing yet in the FreeNAS forum.

    Netgate pfSense SG-2220 firewall

    CPU: Intel(R) Atom(TM) CPU C2338 @ 1.74GHz (1750.04-MHz K8-class CPU)

    User comments and questions already present in pfSense forum. No response yet from Netgate.

    Plus the FreeNAS Mini XL I have on order (8-(

    Very annoying that this quite expensive kit should have such a problem. Thanks Intel. Some of us have not yet forgotten the Pentium FDIV saga.

  15. ecofeco Silver badge
    Facepalm

    Rut Roh

    Well... smeg.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like