Been there as well
I've completed a few big projects in my time, with thankfully only minor niggles to correct in the new version.
What gets me the most is that I know that, however much preparation and planning and caution I employ, in the end, on launch day, there will always be one user who will have an incredible problem nobody else does. And when I'm confronted with it, I then have to tie my brain into a knot in order to find a solution that doesn't break it for everyone else.
And, often, that one user will be a VIP or otherwise important person in the company, so the solution just has to be found ASAP.
It's almost as exhausting as the entire rest of the project.
Problem VIP user
Thorough investigation will usually show the one single user just didn't read the memo and is the only one still trying to connect to the disconnected system. VIPs have a tendency to think that memos to all don't apply to them. Just charging that VIP with all unnecessary overtime caused by him (never found a female in that particular group yet) is very satisfying and will make sure the project comes in within budget, but that is about all you can do about it.
About Ariane 5...
More thorough testing could have caught the problem.
More thorough testing could ALWAYS have caught the problem. This is an "empty" truth.
As Bertrand Meyer says (among others) in "The Lessons of Arian"
Is it a testing error? Not really. Not surprisingly, the Inquiry Board's report recommends better testing procedures, and testing the whole system rather than parts of it (in the Ariane 5 case the SRI and the flight software were tested separately). But if one can test more one cannot test all. Testing, we all know, can show the presence of errors, not their absence. And the only fully "realistic" test is to launch; this is what happened, although the launch was not really intended as a $500-million test of
More relevant was a software config error.
Particularly vexing is the realization that the error came from a piece of the software that was not needed during the crash. It has to do with the Inertial Reference System, for which we will keep the acronym SRI used in the report, if only to avoid the unpleasant connotation that the reverse acronym could evoke for US readers. Before lift-off certain computations are performed to align the SRI. Normally they should be stopped at -9 seconds, but in the unlikely event of a hold in the countdown resetting the SRI could, at least in earlier versions of Ariane, take several hours; so the computation continues for 50 seconds after the start of flight mode -- well into the flight period. After takeoff, of course, this computation is useless; but in the Ariane 5 flight it caused an exception, which was not caught and -- boom.
More interesting, William Kahan has this take in https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf
A commission of inquiry with perfect hindsight blamed the disaster upon inadequate testing of the rocket’s software. What software failure could not be blamed upon inadequate testing? The disaster can be blamed just as well upon a programming language ( Ada ) that disregarded the default exception-handling specifications in IEEE Standard 754 for Binary Floating-Point Arithmetic. Here is why: Upon launch, sensors reported acceleration so strong that it caused Conversion-to-Integer Overflow in software intended for recalibration of the rocket’s inertial guidance while on the launching pad. This software could have been disabled upon rocket ignition but leaving it enabled had istakenly been deemed harmless. Lacking a handler for its unanticipated overflow trap, this software trapped to a system diagnostic that dumped its debugging data into an area of memory in use at the time by the programs guiding the rocket’s motors. At the same time control was switched to a backup computer, but it had the same data. This was misinterpreted as necessitating strong corrective action: the rocket’s motors swivelled to the limits of their mountings. Disaster ensued. Had overflow merely obeyed the IEEE 754 default policy, the recalibration software would have raised a flag and delivered an invalid result both to be ignored by the motor guidance programs, and the Ariane 5 would have pursued its intended trajectory. The moral of this story: A trap too often catches creatures it was not set to catch.
"include go/no-go meetings"
I've seen one circumstance where it should have been no-go right from inception. However the project owner was the senior IT manager so it was go all the way in moving an application to a completely new OS as far as the RDBMS & tools vendor was concerned (I was later told their porting procedure was something along the lines of "we made the changes necessary to get it to compile"; I suspect we were the only site that went live).
In practice as soon as we got to go-live we started to get database index corruption; I suspect there was a race condition that only manifested itself under real load. Oddly enough, migration back to the sort of OS it should have been on had weeks of testing mandated with no issues found then or on go-lie. I could have done without those weeks of testing as they were weeks of fire-fighting on the live system as far as I was concerned.
Experience is a harsh teacher
"It sounds a little like changing your car engine while in the overtaking lane on the motorway. "
... in the dark, with no tools.
The challenge comes when *someone* (it doesn't matter who, but they often count beans for a living, decides that the carefully prepared plan "takes too long" and needs to be done in less time. Inevitably some part of the process has to give, and ultimately that leads to problems, and support issues, and things taking longer than if they hadn't made any "helpful" comments in the first place.
Re: Experience is a harsh teacher
Pretty much this ^^
Inhouse development is different as you can sometimes amortize what's needed for true testing over multiple projects, if brought in on off projects you are in trouble, it goes like this
Go-Live dates and feature lists are not normally set by the worker techs.
They are told 'here are 200 features that have to be in day 1' (often not this simple, see Agile)
Then they are told when day 1 is, at some point in the project.
There can be any number of reasons this date is picked: could be a post year end period, all company presentation, 'quiet period', Christmas, end of life date on previous product, IPO date, astrologer says so etc..
Load testing is then put at number 201 on the list, some people churn happens on the project, it never happens.
The above list is fine in a perfect world, you will be lucky if you are ever on a project that can do it.
Re: Experience is a harsh teacher
"The challenge comes when *someone* (it doesn't matter who, but they often count beans for a living, decides that the carefully prepared plan "takes too long" and needs to be done in less time."
A secondary issue is that the 'plan' doesn't make economic sense if properly specified to the correct hardware/network/redundancy/DR/day-to-day debugging capability; the imaginary savings are much sweeter if they use imaginary numbers for capacity and support cost.
Re: Experience is a harsh teacher
"'quiet period', Christmas, end of life date on previous product"
Situation: current H/W due to be EoL (at least for support purposes) at end of 31 Dec.
The quiet period between Christmas & New Year would have been the ideal time to migrate over to new H/W. Minimal risk, just unload the data and reload it onto a version of the same engine on current H/W*. Client's manglement absolutely forbade it even when warned that any H/W failure would cost an arm and a leg and possibly CEO's first-born.. It turned out that they'd arranged for bean-counters to come in to value the company for a sale.
* When it eventually was moved it went just as smoothly as anticipated.
decides that the carefully prepared plan
"takes too long" "costs too much" and needs to be done in less time for less money
Projects ALWAYS go over budget, and in my experience upper management is far more likely to have tolerance for giving extra time than extra money. At a certain point, upper management's performance bonuses are based on financials which become impacted, thus they will clamp down on project spending. That's why they are usually willing to give extra time (unless there are other constraints involved like regulatory issues) since that spreads out the expenses to future quarters.
Typically items deemed "not critical" to the project, like developing scripts to guarantee fast and accurate transition, are the first to go - I remember one time being told "I am paying tens of thousands of dollars a day for all these specialists, they don't need training wheels". Also often on the chopping block is preparation for potential rollback - as an example and I again quote "failure is not an option".
As it turned out, that project was cut further and I wasn't around to personally see it through to completion, but I was told by those who remained that it was a total shit show and the CIO's "rising star" who was responsible for those quotes was fired as a result.
> A simple floating point-to-integer conversion failure in the inertial reference software killed the guidance system.
No it wasn't.
People who think that summarising a complex chain of events into a single simplistic failure will somehow enable their own projects to avoid similar complex causal chains are kidding themselves.
Much better to give examples of such chains so that developers are better able to identify them when they occur in their own projects.
Re: Ariane 5
>> A simple floating point-to-integer conversion failure in the inertial reference software killed the guidance system.
> No it wasn't.
“The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer.”
Sometimes testing is not enough
Especially when the information you are given about the installed system is wrong.
On a factory automation upgrade our people insisted on getting timings for every part, right down to solenoid actuation. We built a sophisticated test rig with all those times built in, then added extra to allow for slow or sticking ones, and as many failsafes as we could think of.
When we installed it the system wouldn't even start. One of those failsafes kicked in. We had been assured the vacuum system took 1 minute to get to a usable level - even though we had doubts and allowed for 2 minutes. It actually took at least 10, not helped by the fact there was a leaky pipe that was known about, but not by us.
Changing the user interface can have major unexpected consequences.
Back in the 80s, the company internal telephone system was updated. The new system examined external calls and sent them by the cheapest route - BT, Mercury, or up the leased line to head office.
Old system: Pick up phone, get internal dial tone. Dial 9, external dial tone. Dial number.
New system: Pick up phone, get internal dial tone. Dial 9, system waits for the external number with no dial tone then works out the cheapest route.
Nobody, not even the switchboard, was warned in advance about this change.
In a large office building, large numbers of people thought there was a fault. Many of them tried pressing the 9 key again and again to try to get the external dial tone.
The telephone system watched the key presses until it recognised a valid number, then dialled that number.
Note to non-UKans: The UK emergency number, roughly equivalent to the US '911', is '999'.
One thing that really lowers the pucker factor is a complete test environment.
IE on one that has a "Hyper" level supporting the concept of a whole separate "company" level.
Testing functions on a live system is way more stressful.
Cutting over from a liver to another live system even more so.*
*BTW One companies quiet time IE Christmas is another companies massive panic time. SOP for large retailers is to freeze system changes months before Christmas/January. If it's not working right they will roll back (you do have a roll back plan, don't you?) to the earlier version and tough it out.
I really oppose Big Bang implementations unless there is absolutely, no other option - i.e. for "real" reasons like the power station example sited in the article. Normally you end up with a big bang because the business won't pay for a dual running period or there is some arbitrary deadline.
Every big bang implementation I've seen has involved lots of last minute frantic fire fighting in live.
It often for silly reasons too, e.g. an infrastructure difference between test and live no one noticed or the user "forgot" about that monthly report they always do when signing off their requirements..
I have also seen a worrying trend for "MVP" testing where I work now. We recently put a major new customer facing system live, and while we thankfully didn't do a big bang we did an "MVP" performance test (under protest) which didn't really prove anything. Funnily enough said new system is performing like a dog in live, time to get the fire extinguisher out...