back to article System upgrade to blame for BlackBerry outage

A three-hour outage that left many BlackBerry users unable to send and receive email was the result of an upgrade gone wrong. The upgrade to Research in Motion's internal data-routing system was designed to expand capacity. But instead, it disrupted email service for many subscribers located in North America. The episode, …

COMMENTS

This topic is closed for new posts.
  1. Tim Spence

    Heads are going to roll

    Some (possibly blameless) minion is sure to get the chop for this.

    It does make me wonder though why they start these upgrades in the middle of the afternoon, on a weekday. In my line of business it would involve pizza, overtime, and a late night start.

  2. Anonymous Coward
    Anonymous Coward

    late night start

    The North America srp was down for a couple of hours early Saturday morning EST too. It was probably the Vista SP1 install needing a reboot on completion Monday afternoon ;-)

  3. Anonymous Coward
    Coat

    OH NOES!

    What did all those people do without their Crackberry e-mail!?

    Mine's the one I'm wearing while I chill sans e-mail interference.

  4. Phil Rigby
    Paris Hilton

    @Stuart

    You're dead right pal - it's freaking EMAIL people. For 3 hours. GET OVER IT.

    Paris because - I could get over her for 3 hours :-)

  5. Anonymous Coward
    Black Helicopters

    Payroll

    They probably didn't want to pay the engineers any overtime for the upgrade! Besides its always 3 PM somewhere in their market place so someone will get their work day screwed.

  6. Anonymous Coward
    Flame

    It's not profitable to be safe

    Have you ever noticed best practices and profitability don't seem to go hand in hand?

    I mean, we all know about five 9s and redundancy and testing and QA. And there are companies that follow these rules. But it seems more often then not, these companies are not the profitable ones. The ones that know how to make money don't spend anything extra in having test systems, QA, or test runs. They just punch it out, and fix it if it breaks.

    RIM knows how to be profitable and that is probably incompatible with a strategy of reliability and redundancy.

    I'm not defending anyone here, or claiming we should give up on quality, but I am pointing out that this is probably the result of an overall business strategy and corporate decisions, rather than a person or group making an honest mistake.

  7. Anonymous Coward
    Paris Hilton

    Upgrades during customer peak time - bull-by-product I say

    Upgrade during peak NA time - I think not.

    QA - how they spelling that.

    Too retrospectivly come out with the same bull-spin as last `upgrade` and to say it a few days later when Telco's would of been notified of said upgrade around a week or two in advance; same telco's who dinny know shit about it. Well I call that utter contempt for customers interlect.

    If you were hacked then say it, dont spin the bottle.

    But given that all the USA goverment emails pass thru there systems with about as much issolation a pea in a pod against other pea's when they thinks its a seperate system. Well you can see why there saying this.

    Are RIM capable of a upgrade that has been QA'd or even properly tested without downgrading everybody. You realy have to wonder. Especialy when expansion upgrades that have the complete opposite effect. Well again you wonder how robust there internal infrastructure actualy is.

    This also again highlights that there idea of DR is completely different from the rest of the IT industry.

    I do wonder if they know what QA actualy is beyond lip service as from what I've seen they havn't got a ruddy clue at all. Seriously is QA something you just tell your staff that you have and its stored next to the golden monkey in the MD's office and nobody can touch it as its that special. I would say get a clue, but first they need to find a large enough stick to induce said clue that wont break on first wack.

    Seriously RIM is a stock ticker not what you do too the intellegince of your customer base. "system upgrade" after the last utter SNAFU and to impact at peak-time.

    But at least they never lost any emails for there customers, only the whole reason for having there services in the first place of instant email. But thats why RIM handsets allow you to make phone calls, so in essence the whole outage can be blamed upon customers :). But dont let me give RIM idea's for there next outage which will probably happen before the end of there financial year statisticaly.

    My recomendations to RIM is:

    1) WIKI Quality Assurance and actualy read it with intent.

    2) Roll-back and staff changes you have done as clearly things going bad

    3) Organise a 3rd party QA team consisting of members of your high-profile customers likew ew T-Mobile say.

    4) Stop cluster fudging your infrastructure and do distributed upgrade cells that dont kludge-up all the other egg's in the same box/location.

    5) have a real DR setup as clearly you dont have dick in-place currently, unless 3 yours is your idea of a fail-over.

    6) Sell lots of your shares as its had its day as far as industry analysts go with the current directions

    7) Cash in and market IP to 3rd parties more aggresivly and packaged.

    8) Pay your old-school handset designers more and listern too them more instead of lip servicing them.

    9) Buy your Root cause anaylsts alot of drinks as they sure in hell are goona need em from the fallout from this.

    10) Dont sell out to Microsoft.

    11) Focus on the product and not the share price, worked for you in the past and you did good stuff back then, past years products and service has been utterly handbagged marketed. Your product is business not end-consumer biased so stop allowing one to impact the others infrasturcture, its a security flaw that you security team will never stand a hope in hell of addressing until that happens.

    12) If you cant tell your staff the truth, then you have no chance of customers believing a word you say.

    Bottom line RIM - get back to your roots or thats all you will have to eat. Another outage like this or even using that excuse ever again will be a death blow with regards to industry analysts and your share price will fail badly then.

  8. Andrew Oliver

    Now I know why they're called 'crackberries'

    it's not because of the users' crack addition, it's because that's what the network admins are smoking.

    I'm sorry, but NO organization in its right mind that targets business users would EVER think that 3:30pm is the right time to roll out a code upgrade. That's primo time between when all the execs are back from lunch, and just as all the lazy ones are checking to see whether they can skip off home early.

    Midnight or later is the only time this kind of thing should happen. That's when you'll have the least number of users impacted by any problems that do occur.

    And if your network admins cry about it, get new ones. As much as you might like them, they are not the ones that pay the bills.

  9. Morely Dotes
    Flame

    RIM is a bunch of gromless gits

    Blizzard has somewhere around 10 million subscribers for the World of Warcraft online game, and they can manage to do their regular maintenance starting at 3 AM Pacific Coast time.

    I'm reasonably sure RIM has more than 10 million Crackberry subscribers. Either they have severely underpriced their services, or they can afford to start (and complete) system downtime during the wee hours (and perform a test run on a hot spare server, in case there's still some nasty surprise lurking in the patches come "upgrade" day).

    Yes, it's "just" email, and as a sysadmin I think (actually, I *know*) that it's not a life-threatening issue. However, RIM appears to have knowingly and deliberately ignored even the most basic principles of good network management.

This topic is closed for new posts.

Other stories you might like