Lemme get this straight. We bought an air traffic control system which automatically shuts itself down if it exceeds 49 days' uptime?!?
KTLA 5 reported this as a radio failure, with the comment that "the FAA said it was some kind of computer glitch." (Warning: This link may crash Mozilla or Firefox. Bad KTLA, bad.) The LA Times played it as human error (sorry, can't find a non-reg copy of this article), but look at the crucial highlighted sentences in the summary below (from ACM TechNews). Air traffic control software that's designed to just die?!? What were the designers THINKING?!?
"Human Errors Silenced Airports"
Los Angeles Times (09/16/04) P. A1; Alonso-Zaldivar, Ricardo; Malnic, Eric; Oldham, Jennifer
A software glitch led to a three-hour shutdown of Southern California's air traffic control radio system, cutting off radio communications and leading to five incidents where planes breached the required separation distance from one another. FAA officials said the radio system, known as Voice Switching and Control System (VSCS), contained a software glitch discovered one year ago as the agency began upgrading the systems nationwide. Originally based on a Unix system built by Harris, the upgraded touch-screen system used Dell computers running a Microsoft operating system; the new system automatically shut down after 49.7 days in order to prevent data overload, in which case controllers might receive wrong information without knowing about a malfunction. FAA officials blamed an improperly trained technician for failing to manually reset the internal clock during maintenance, leading to the initial failure, while the back-up radio system's subsequent failure was also attributed to a technician's mistake. A technicians union advisor, Richard Riggs, said the software glitch should have been fixed when it was first discovered and before the new systems were deployed at 21 regional air traffic control centers. FAA officials have only corrected the error in the Seattle air traffic control center, but have deployed an early warning system in the Southern California center that will prevent another outage. The three-hour radio communications shutdown left planes above Southern California, Arizona, and New Mexico without air-traffic control instructions, until communications tasks were handed off to other regional centers. In two cases, pilots had to take evasive maneuvers to avoid danger, while Los Angeles International Airport officials said approximately 30,000 travelers were affected at their airport alone.
Surely any sane standard of rationality says that even if the software is so CRAPPY that it has to be restarted every 49 days (I'm guessing the problem is memory leaks that the vendor couldn't find or couldn't be bothered to fix), the software should start giving increasingly strident warnings after, say, 45 days.
"Approaching maximum runtime limit; maintenance shutdown recommended."
"Maximum runtime limit reached; maintenance shutdown is URGENT."
"MAXIMUM RUNTIME LIMIT EXCEEDED; THIS SYSTEM MUST BE REBOOTED WITHIN 12 HOURS. DATA MAY BE UNRELIABLE."
Or something like that. But to just silently cross over a runtime threshold and just shut itself down without warning?!?
Hey, folks, this is AIR TRAFFIC CONTROL we're talking about, not web poker! It shouldn't EVER just shut itself down without warning!
Yeah, there was human error involved. But it wasn't principally on the part of the technician who forgot to reset the clock. Any realistic person could have been predicted that sooner or later, that was going to happen.
No, the human error was on the part of the vendor who presented this as a functional, reliable air traffic control software system, and the FAA administrator who accepted it as same, both knowing that it would just .... DIE .... if it ever exceeded 49.7 days' indicated-on-the-clock runtime.
What's worse, a fix for the problem exists..... and it's only been deployed in one regional air traffic control center? Hello? I think I hear the sound of tax dollars going across the street for a three-martini lunch.