back to article HPE supercomputer is still crunching numbers in space after 340 days

HPE’s mini supercomputer launched into space last year has survived the harsh conditions of zero gravity and radiation for almost a year. The Spaceborne Computer isn’t the greatest supercomputer and has a performance of one teraflop, runs on Red Hat Enterprise Linux and is built out of two HPE Apollo Intel x86 servers with a …

  1. Pascal Monett Silver badge

    "SSDs fail at an alarming rate in space"

    They fail at an alarming rate on Earth as well. In the last five years, I've known of two people who's SSD just crashed and died. I would have to track the last fifteen years to find someone who's HDD died without a warning.

    Okay, maybe that's not so alarming after all, but still.

    1. Anonymous South African Coward Bronze badge

      Re: "SSDs fail at an alarming rate in space"

      Roll on the SSD that can last for ages.

    2. Lee D Silver badge

      Re: "SSDs fail at an alarming rate in space"

      Counterpoint:

      Since replacing most desktops WD Blues with the cheapest-shite Crucial 128Gb SSDs, I've not had a single drive failure over 200+ machines in over 2 years, compared to several a year.

      If you compare versus Seagate, including server-grade SAS drives, I literally got a failure a week on those after 6 months in deployment.

      Your (or my) anecdotal evidence means nothing compared to someone like that cloud-storage firm who publish annual failure numbers across millions of drives.

      I can name 4 private individuals whose hard drives crashed unrecoverably in the last year. I can't name one SSD anyway - in fact I've never seen an SSD fail, and I have a Samsung 850 EVO in my laptop for... 4 years?

      P.S. All the SSDs I use do not experience any special treatment. I don't change a single software option (they were seen as a sacrificial in-production test where easy replacements - the original hard drives - were to hand any time I need them), no special write-caching, no disabling of swap, nothing... just a straight clone of the existing (sometimes years-old) image of Windows.

      I'm not saying they're infallible. But in real-world, heavy user use, and worst-case configurations, where I expect them to fail... not one has so far.

      1. Alan Brown Silver badge

        Re: "SSDs fail at an alarming rate in space"

        "Since replacing most desktops WD Blues with the cheapest-shite Crucial 128Gb SSDs, I've not had a single drive failure over 200+ machines in over 2 years, compared to several a year."

        My results are similar. I had an early-generation PATA SSD fail and the only other failures I've had have been wearout ones that were well-telegraphed. (yes, it is possible to beat a SSD to death)

        That said, a lot of early SSDs were heavily read-optimised and didn't last long in service, as are a lot of SATA-DOMs (and are explicitly sold as such - they're intended for write-once read-mostly type work in embedded systems and r/w optimised ones are more expensive)

      2. ForthIsNotDead

        Re: "SSDs fail at an alarming rate in space"

        My Crucial 128GB SSD failed after 6 years of constant, daily use. And even then, it failed in such a way that I could still transfer all my data off it. It just started running very slowly (very slowly) after about an hour of use. I think it was temperature related. So, I lost no data, and simply replaced it with another Crucial drive.

        Money well spent.

      3. TheVogon

        Re: "SSDs fail at an alarming rate in space"

        Several hundred Samsung 840 Pros deployed here in desktops. Zero failures over several years.

        1. Captain Scarlet Silver badge

          Re: "SSDs fail at an alarming rate in space"

          Had SSD "failures" here, except if you slap power on them and leave them for a few hours almost all of them start working again.

          1. Alan Brown Silver badge

            Re: "SSDs fail at an alarming rate in space"

            > Had SSD "failures" here, except if you slap power on them and leave them for a few hours almost all of them start working again.

            I've had the opposite in some 24*7 powered SSDs - if you leave them powered off for a few days they start working again. (Which of course is what happens when they're RMAed. Thankfully Crucial were good about it when I showed the failure mode. They'd never tested over long periods before)

      4. stiine Silver badge

        Re: "SSDs fail at an alarming rate in space"

        I'm sorry to say I have had an SSD fail, but it was not an Intel or Samsung or actually any name I'd seen before....but compared to a 7200rpm drive, it was lightning fast...while it lasted.

        I have to say that the three more recent SSD drive's i've bought were 1) not the same brand, and 2) are so close to the same lightning speed that I can't tell which is faster or slower. I used to be able to go in and pour a drink while waiting for windows to boot,but now, I don't even have time to light a cigarette.

        edit: spelling

        1. JLV

          Re: "SSDs fail at an alarming rate in space"

          >I don't even have time to light a cigarette.

          Yeah, but with all the anti-smoking regs nowadays, a cig is quite the production (cf the IT Crowd. No, I'm not a smoker, but the hoops nowadays...)

          I wonder if SSD wear isn't partly accelerated by constant swap writes eating up their write cycle allowances. Updated my laptop from 8 to 16 GB RAM for just that reason after my Samsung 850 died without warning after 2 years. Also made a very small (200MB) Ram disk to write transient files like test logs, faster too.

    3. Rabbit80

      Re: "SSDs fail at an alarming rate in space"

      I'm currently restoring backups of an old archive after a HDD failure.. not had an SSD go down on me yet and have been fitting them exclusively for the past 3 years now. We used to burn through around 2 HDD's / month!

      1. juice

        Re: "SSDs fail at an alarming rate in space"

        I think the point the original poster was trying to make is that it's pretty much all or nothing with SSDs - when they fail, it's generally without warning and there's little or no hope of recovery. Conversely, you usually get some warning with HDDs - SMART alerts, bad-sector alerts during scans, physical noises, etc.

        (though from personal experience, SMART mointoring has been spectacularly useless when it comes to flagging up potential issues!)

        1. Lee D Silver badge

          Re: "SSDs fail at an alarming rate in space"

          Nonsense.

          Hard drives, even with the best SMART monitoring in the world, fail unpredictably a large portion of the time. Any large hard drive survey will show you that.

          And sometimes they fail so quickly even WITH SMART monitoring that you don't stand a chance of being able to do anything about it.

          Reporting bad-sectors may be a symptom of imminent failure, but only so far as coughing up your lungs is a symptom of death. There are many other ways to die without doing that.

        2. Alan Brown Silver badge

          Re: "SSDs fail at an alarming rate in space"

          "it's pretty much all or nothing with SSDs - when they fail, it's generally without warning"

          That might have been the case with the early ones, but it's certainly not the case now. Pay attention to your SMART returns.

    4. Anonymous Coward
      Anonymous Coward

      Re: "SSDs fail at an alarming rate in space"

      My 1953-era mercury delay line is still holding on. Never a failure, except that one time in 1979 when our cat Felix drank a few slurps of the mercury and dropped stone cold dead into the trough. His wee little empty head floating on the mercury corrupted the bouncing bits. It was terrible.

      /not really

      1. Ken 16 Silver badge
        Joke

        I know you're lying

        The Mercury project didn't get funded until 1958

    5. PeterGriffin

      Re: "SSDs fail at an alarming rate in space"

      You don't know me but mine died after following the firmware update instructions explicitly. Thanks to Samsung it's not possible to force a firmware update, which may resolve the fault...

      One day I may get around to extracting the executable from the live Linux distro it's encapsulated in and trying again...

  2. Secta_Protecta

    "The other one was down to an astronaut who accidentally turned off the power switch to a rack that contained HPE’s computer when unloading supplies from SpaceX’s Dragon capsule."

    I wonder if the astronaut will submit this to the "Who, Me?" page...

    1. Mark 110

      I was wondering why the power switch didn't have a guard over it . . .

      1. Captain Scarlet Silver badge

        Hmm maybe because it adds weight?

        1. stiine Silver badge

          Or because it takes time..and in space, that's one thing that you have even less of than air.

    2. Anonymous Coward
      Anonymous Coward

      He will be offered a job at the Australian Tax Office.

  3. BristolBachelor Gold badge

    Rad Hard

    Rad Hard means that they'll be no effect, not that it won't break. In my experience, most things will survive the radiation in space for a year without major failure. There are occasional latch ups where you have to power off quickly to prevent burnout, but they don't happen too often.

    Much more likely are upsets where something gets corrupted and then you get soft failures; sort of the equivalent of running Windows and using Excel/Outlook on Earth :D Depending on the orbit and sun activity you might expect one of these a day or even more often.

    It would've been interesting to read how often these occurred.

    1. werdsmith Silver badge

      Re: Rad Hard

      In my experience, most things will survive the radiation in space for a year without major failure.

      Experience I wish I had. In three years of trying our team has not got a single piece of work flown.

      1. Jason Bloomberg Silver badge
        Alien

        Re: Rad Hard

        No pun intended, but it does seem a bit hit and miss. Some people throw cheap off-the-shelf microcontrollers up in CubeSats and the like with no particular protection and suffer no problems at all, hardware nor software. Others aren't so lucky.

  4. Korev Silver badge
    Coat

    Top astro boffinery

    It's always good to see people testing cool stuff out.

    Have a pint on me, you can get it from the Mars Bar

    1. Anonymous Coward
      Anonymous Coward

      Re: Top astro boffinery

      I was thinking much the same. Doing stuff for the sake of doing stuff. We don’t do it enough. I hope the bean counters in HPE recognise the importance of stuff like this.

      1. Lord Elpuss Silver badge

        Re: Top astro boffinery

        "I hope the bean counters in HPE recognise the importance of stuff like this."

        I hope the beancounters in IBM, Dell, Microsoft et al recognise the importance as well. We need a good old tech space race to get things going again. IBM's still banging on about how it's systems took men to the Moon back in the 60s, and how ThinkPads went up with the Shuttle; both amazing to be sure, but if I were looking for a relevant case study I wouldn't want to have to go back 10 years to find one.

  5. BobC

    Using COTS instead of rad-hard devices.

    Electronic components hardened to tolerate radiation exposure are unbelievably expensive. Even cheap "rad-hard" parts can easily cost 20x their commercial relatives. And 1000x is not at all uncommon!

    There have been immense efforts to find ways to make COTS (commercial off-the-shelf) parts and equipment better tolerate the rigors of use in space. This has been going on ever since the start of the world's space programs, especially after the Van Allen radiation belts were discovered in 1958.

    I was fortunate to have participated in one small project in 1999-2000 to design dirt-cheap avionics for a small set of tiny (1 kg) disposable short-lived satellites. I'll describe a few highlights of that process.

    First, it is important to understand how radiation affects electronics. There are two basic kinds of damage radiation can cause: Bit-flips and punch-throughs (I'm intentionally not using the technical terms for such faults). First bit-flips: If you have ECC (error-correcting) memory, which many conventional servers do, bit-flips there can be automatically detected and fixed. However, if a bit-flip occurs in bulk logic or CPU registers, a software fault can result. The "fix" here is to have at least 3 processors running identical code, then performing multi-way comparison "voting" of the results. If a bit-flip is found or suspected, a system reset will generally clear it.

    Then there are the punch-throughs, where radiation creates an ionized path through multiple silicon layers that becomes a short-circuit between power and ground, quickly leading to overheating and the Release of the Sacred Black Smoke. The fix here is to add current monitoring to each major chip (especially the MCU) and also to collections of smaller chips. This circuitry is analog, which is inherently less sensitive to radiation than digital circuits. When an abnormal current spike is detected, the power supply will be temporally turned off long enough to let the ionized area recombine (20ms-100ms), after which power may then be safely restored and the system restarted.

    Second, we must know the specific kinds of radiation we need to worry about. In LEO (Low Earth Orbit), where our satellites would be, our biggest concern was Cosmic Rays, particles racing at near light-speed with immense energies, easily capable of creating punch-throughs. (The Van Allen belts shields LEO from most other radiation.) We also need to worry about less energetic radiation, but it's less than the level of a dental X-Ray.

    With that information in hand, next came system design and part selection. Since part selection influences the design (and vice-versa), these phases occur in parallel. However, CPU selection came first, since so many other parts depend on the specific CPU being used.

    Here is where a little bit of sleuthing saved us tons of time and money. We first built a list of all the rad-hard processors in production, then looked at their certification documents to learn on which semiconductor production lines they were being produced. We then looked to see what other components were also produced on those lines, and checked if any of them were commercial microprocessors.

    We lucked out, and found one processor that not only had great odds of being what we called "rad-hard-ish" (soon shortened to "radish"), but also met all our other mission needs! We did a quick system circuit design, and found sources for most of our other chips that were also "radish". We had further luck when it turned out half of them were available from our processor vendor!

    Then we got stupid-lucky when the vendor's eval board for that processor also included many of those parts. Amazingly good fortune. Never before or since have I worked on a project having so much luck.

    Still, having parts we hoped were "radish" didn't mean they actually were. We had to do some real-world radiation testing. Cosmic Rays were the only show-stopper: Unfortunately, science has yet to find a way to create Cosmic Rays on Earth! Fortunately, heavy ions accelerated to extreme speeds can serve as stand-ins for Cosmic Rays. But they can't easily pass through the plastic chip package, so we had to remove the top (a process called "de-lidding") to expose the IC beneath.

    Then we had to find a source of fast, heavy ions. Which the Brookhaven National Labs on Long Island happens to possess in their Tandem Van de Graaf facility (https://en.wikipedia.org/wiki/Tandem_Van_de_Graaff). We were AGAIN fantastically lucky to arrange to get some "piggy-back' time on another company's experiments so we could put our de-lidded eval boards into the vacuum chamber for exposure to the beam. Unfortunately, this time was between 2 and 4 AM.

    Whatever - we don't look gift horses in the mouth. Especially when we're having so much luck.

    I wrote test software that exercised all parts of the board and exchanged its results with an identical eval board that was running outside the beam. Whenever the results differed (meaning a bit-flip error was detected), both processors would be reset (to keep them in sync). We also monitored the power used by each eval board, and briefly interrupted power when the current consumption differed by a specific margin.

    The tests showed the processor wasn't very rad-hard. In fact, it was kind of marginal, despite being far better than any other COTS processor we were aware of. Statistically, in the worst case we could expect to see Cosmic Ray no more than once per second during the duration of the mission. Our software application needed to complete its main loop AT LEAST once every second, and in the worst case took 600 ms to run. But a power trip took 100 ms, and a reboot took 500 ms. We were 200 ms short! Missing even a single processing loop iteration could cause the satellite to lose critical information, enough to jeopardize the mission.

    All was not lost! I was able to use a bunch of embedded programming tricks to get the cold boot time down to less than 100 ms. The first and most important "trick" was to eliminate the operating system and program to the "bare metal": I wrote a nano-RTOS that provided only the few OS services the application needed.

    When the PCBs were made, the name "Radish" was prominently displayed in the top copper layer. We chose to keep the source of the name confidential, and invented an alternate history for it.

    Then we found we had badly overshot our weight/volume budget (which ALWAYS happens during satellite design), and wouldn't have room for three instances of the processor board. A small hardware and software change allowed us to go with just a single processor board with only a very small risk increase for the mission. Yes, we got lucky yet again.

    I forgot to mention the project had even more outrageous luck right from the start: We had a free ride to space! We were to be ejected from a Russian resupply vehicle after it left the space station.

    Unfortunately, the space station involved was Mir (not ISS), and Mir was deactivated early when Russia was encouraged to shift all its focus to the ISS. The US frowned on "free" rides to the ISS, and was certainly not going to allow any uncertified micro-satellites developed by a tiny team on an infinitesimal budget (compared to "real" satellites) anywhere near the ISS.

    We lost our ride just before we started building the prototype satellite, so not much hardware existed when our ride (and thus the project) was canceled. I still have a bunch of those eval boards in my closet: I can't seem to let them go!

    It's been 18 years, and I wonder if it would be worth reviving the project and asking Elon Musk or Jeff Bezos for a ride...

    Anyhow, I'm not at all surprised a massively parallel COTS computer would endure in LEO.

    1. Alan Brown Silver badge

      Re: Using COTS instead of rad-hard devices.

      > Electronic components hardened to tolerate radiation exposure are unbelievably expensive. Even cheap "rad-hard" parts can easily cost 20x their commercial relatives. And 1000x is not at all uncommon!

      The price differential isn't so much for the rad hardening as for the accompanying paperwork.- for every individual component, which has to be kept around seemingly forever.

      Part of the problem is that we don't have enough heavy lift capabilities, so everything must be as light as possible, which in turn means virtually no shielding. The idea of putting things in the water tanks isn't so silly (or wrapping the water tanks around the modules - This has been suggested)

    2. Jason 24
      Pint

      Re: Using COTS instead of rad-hard devices.

      Is it masochistic that I enjoy being made to feel very very stupid reading comments like this?

      Beer for the fascinating read! >>

    3. Flakk
      Joke

      Re: Using COTS instead of rad-hard devices.

      Well, there's your problem, BobC... you spent your luck budget on the radish gear!

      Seriously, thanks for sharing. Yours is one of the most fascinating posts I've read recently. What a heart-breaker of an ending. I do encourage you to reach out to SpaceX and Blue Origin. I have to think that, what with plans for LEO microsat constellations, there can't be too much research right now into the effects of radiation on COTS gear.

    4. Anonymous Coward
      Pint

      Re: Using COTS instead of rad-hard devices.

      @Bob C

      Thank you! That post told me so much that interested me, and that I had not a clue about. To tell the truth, the knowledge you've imparted has relevance to my line of work, and I'll never use it, but I'm so pleased you took the trouble to expound at such length.

      For me, the expertise of the commentariat is at least 60% of the appeal of The Reg.

    5. Cookie 8

      Re: Using COTS instead of rad-hard devices.

      >Unfortunately, science has yet to find a way to create Cosmic Rays on Earth

      Have a look into the ChipIr instrument at the UK science and technology facilities council (stfc.ac.uk)

  6. DropBear
    Trollface

    Well, one could always locate the server box at the centre of the crew water tank for shielding...

    1. defiler

      ...and for heating their tap water? Double-win!

  7. Douchus McBagg

    the only SSD's I've seen fail have been in win7 machines where the users (who of course, knew better than anyone else) had enabled scheduled defrag after being told not to. But hey, they know best.

    Still running a stack of Intel 320series 160gig jobs as the backend store for a load of VMs. bought in a load of cheap Lite-On 128s and 256s to upgrade all the 200-odd spinning rust equipped lappies in the fleet. All new kit is NVMe equipped. spinning rust relegated to mass storage duties in the datacentre.

    how would you rad harden an SSD? you'd have to re-cut the flash chips individually on a different substrate? wrap them in depleted Boron-11? dunk it in water?... oh wait...

    Love the "Radish" PCB prints. used to be an easter-egg hunt; opening kit. Find the dev's names printed in the circuit traces, or song lyrics, or jumper pads/pins labled "free beer"...

    1. Alan Brown Silver badge

      "the only SSD's I've seen fail have been in win7 machines where the users (who of course, knew better than anyone else) had enabled scheduled defrag after being told not to. "

      With modern(ish) SSDs that won't matter anyway.

      " All new kit is NVMe equipped. spinning rust relegated to mass storage duties in the datacentre."

      Same. I'm moving to Optane for some of the more demanding tasks. Adding a layer of SSD fronting the spinners helps a lot (ZFS FTW!)

      "how would you rad harden an SSD?"

      Encapsulation in silicon sealant and then a bucket of water springs to mind.

    2. Anonymous Coward
      Anonymous Coward

      Defrag on SSD

      "the only SSD's I've seen fail have been in win7 machines where the users (who of course, knew better than anyone else) had enabled scheduled defrag after being told not to. "

      Hah... what kind of idiot runs a defrag on a SSD, why you'd have to ... wait, did I remember to disable automatic defrag when I upgraded my laptop to a SSD?

      Answer: no I did not. Mr. McBagg may have saved me a bunch of aggravation!

    3. david 12 Silver badge

      "the only SSD's I've seen fail have been in win7 machines where the users (who of course, knew better than anyone else) had enabled scheduled defrag after being told not to. But hey, they know best."

      When you enable scheduled optimization on SSD's in Win7. Win7 sends a "retrim" instruction on the schedule. This is in case "Trim" instructions have been lost durring heavy disk use, due to queue overflow.

      This "scheduled optimization" is not defragmentation. Which is why Win7 doesn't use the word "defragmentation".

  8. Admiral Grace Hopper

    I'm completely operational and all my circuits are functioning perfectly

    Apollo is a good name for a space-bound system.

    I do hope that they set it up with the same boot and close-down wav files that I and many others used when PCs with sound cards started hitting our desks.

  9. Eclectic Man Silver badge

    What a waste

    Doing 'benchmarking' processing, rather than something useful like bitcoin mining, or even factoring RSA moduluses / moduli (?) Oh well, I expect the Americans know what they are doing.

    1. Loyal Commenter Silver badge

      Re: What a waste

      For performance monitoring, I would have thought you would need to be running something with a constant run time and known output. That rules out both of those suggestions, on both grounds.

  10. Destroy All Monsters Silver badge
    Paris Hilton

    But they are on the ISS, not on the other side of the magnetosphere

    What is this clownery?

    If you want to go to Mars, you gotta be in "deeper space".

    Move that super there for REAL tests.

    1. Ken Hagan Gold badge

      Re: But they are on the ISS, not on the other side of the magnetosphere

      Indeed, the ISS is so "rad soft" that even people can survive there for a year.

  11. J.G.Harston Silver badge

    you mean...

    crunching numbers IN SPAAAAAACCEE!

  12. Tigra 07
    Pint

    Linux...In...Spaaaaaaaaaaaaaace!

    1. ArrZarr Silver badge

      If anybody here ends up writing a linux distro specifically designed for space usage and doesn't call it "Sputnix" then I will be immensely disappointed.

  13. ShowEvidenceThenObject

    Since it was intended to be a test of commercial-off-the-shelf tech, and manages one teraflop - but is that before or after the spectre/meltown patches? I'll volunteer to go if they need them applying to RHEL :)

  14. Bicycle Repair Man

    See for yourself...

    The ISS has been visible over Blighty every night this week, and will continue to be visible for the next few nights. Check out https://spotthestation.nasa.gov/ for exact times.

    On the one hand, it's just a white dot moving across the sky. Boring. On the other hand ITS A SPACESHIP WITH REAL SPACEMEN AND I CAN SEE IT!

    1. Destroy All Monsters Silver badge

      Re: See for yourself...

      Indeed. Don't miss this show!!

  15. Joe Gurman

    Maybe not the most convincing of tests....

    ....for a Mars mission.

    The ISS orbits the earth at a relatively low altitude (just above 400 km) that means its protected most of the time from all but the highest energy solar/cosmic ray energetic charged particles by the earth's magnetosphere. The inclination of the orbit (51.6°) means the ISS spends little if any time near the geomagnetic poles, where those charged particles can spiral down the earth's magnetic field.

    For what it's worth, spacecraft like SOHO, which has spent almost 23 years around the L1 Lagrange point (1% of the stance between the earth and the Sun) has a solid state recorder of late 1980s/early 1990s vintage that has corrected every single event upset over an entire solar magnetic cycle (two solar activity cycles) just fine. The technology to provide digital electronics that can survive a ~ 3 year expedition outside the earth's magnetosphere is not exactly, er, rocket science at this point.

    1. stiine Silver badge

      Re: Maybe not the most convincing of tests....

      Should have contacted SpaceX, I'm sure they'd have put it the car's trunk for your for a small fee.

      1. Alan Brown Silver badge

        Re: Maybe not the most convincing of tests....

        "I'm sure they'd have put it the car's trunk for your for a small fee."

        I believe there's an electric monk already occupying that space.

  16. nil0
    Alien

    SETI@Home

    They really, really, really ought to be hammering it by running SETI@Home.

  17. the Jim bloke
    Mushroom

    " At least we know it would probably make it all the way to Mars, but we're not sure if it'd make it back,”

    Especially not if they carry out the traditional Lithobraking maneouver...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like