Profile

unixronin: Galen the technomage, from Babylon 5: Crusade (Default)
Unixronin

December 2012

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829
3031     

Most Popular Tags

Expand Cut Tags

No cut tags
Friday, September 17th, 2004 05:39 pm

Lemme get this straight. We bought an air traffic control system which automatically shuts itself down if it exceeds 49 days' uptime?!?

KTLA 5 reported this as a radio failure, with the comment that "the FAA said it was some kind of computer glitch."  (Warning:  This link may crash Mozilla or Firefox.  Bad KTLA, bad.)  The LA Times played it as human error (sorry, can't find a non-reg copy of this article), but look at the crucial highlighted sentences in the summary below (from ACM TechNews).  Air traffic control software that's designed to just die?!?  What were the designers THINKING?!?

"Human Errors Silenced Airports"

Los Angeles Times (09/16/04) P. A1; Alonso-Zaldivar, Ricardo; Malnic, Eric; Oldham, Jennifer

A software glitch led to a three-hour shutdown of Southern California's air traffic control radio system, cutting off radio communications and leading to five incidents where planes breached the required separation distance from one another.  FAA officials said the radio system, known as Voice Switching and Control System (VSCS), contained a software glitch discovered one year ago as the agency began upgrading the systems nationwide.  Originally based on a Unix system built by Harris, the upgraded touch-screen system used Dell computers running a Microsoft operating system; the new system automatically shut down after 49.7 days in order to prevent data overload, in which case controllers might receive wrong information without knowing about a malfunction.  FAA officials blamed an improperly trained technician for failing to manually reset the internal clock during maintenance, leading to the initial failure, while the back-up radio system's subsequent failure was also attributed to a technician's mistake.  A technicians union advisor, Richard Riggs, said the software glitch should have been fixed when it was first discovered and before the new systems were deployed at 21 regional air traffic control centers.  FAA officials have only corrected the error in the Seattle air traffic control center, but have deployed an early warning system in the Southern California center that will prevent another outage.  The three-hour radio communications shutdown left planes above Southern California, Arizona, and New Mexico without air-traffic control instructions, until communications tasks were handed off to other regional centers.  In two cases, pilots had to take evasive maneuvers to avoid danger, while Los Angeles International Airport officials said approximately 30,000 travelers were affected at their airport alone.

Surely any sane standard of rationality says that even if the software is so CRAPPY that it has to be restarted every 49 days (I'm guessing the problem is memory leaks that the vendor couldn't find or couldn't be bothered to fix), the software should start giving increasingly strident warnings after, say, 45 days.

"Approaching maximum runtime limit; maintenance shutdown recommended."

"Maximum runtime limit reached; maintenance shutdown is URGENT."

"MAXIMUM RUNTIME LIMIT EXCEEDED; THIS SYSTEM MUST BE REBOOTED WITHIN 12 HOURS.  DATA MAY BE UNRELIABLE."

Or something like that.  But to just silently cross over a runtime threshold and just shut itself down without warning?!?

Hey, folks, this is AIR TRAFFIC CONTROL we're talking about, not web poker!  It shouldn't EVER just shut itself down without warning!

Yeah, there was human error involved.  But it wasn't principally on the part of the technician who forgot to reset the clock.  Any realistic person could have been predicted that sooner or later, that was going to happen.

No, the human error was on the part of the vendor who presented this as a functional, reliable air traffic control software system, and the FAA administrator who accepted it as same, both knowing that it would just .... DIE .... if it ever exceeded 49.7 days' indicated-on-the-clock runtime.

What's worse, a fix for the problem exists..... and it's only been deployed in one regional air traffic control center?  Hello?  I think I hear the sound of tax dollars going across the street for a three-martini lunch.

Friday, September 17th, 2004 02:52 pm (UTC)
I just did some math. It turns out that 2^32 milliseconds is equal to 49.71026962962962962962962962963 days. That tells me that the problem could be caused by a 32 bit variable that keeps track of the number of milliseconds since the system was started.
Friday, September 17th, 2004 03:21 pm (UTC)
That is a suspiciously coincidental number, isn't it?

Now the question is, did they use a 32-bit counter of milliseconds for the timer and the "possible data overload" problem is some other issue, or is the "data overload" because they're storing ATC data with millisecond resolution and only have a 32-bit index into it?

(Come on, folks, how far does a 767 move in one millisecond? I make it about a foot, at cruising speed. One second resolution should be plenty.)
Friday, September 17th, 2004 03:23 pm (UTC)
The GetTickTime() API in Windows has exactly this problem. In fact, some versions of Windows 95/98 won't stay up longer than 49 days because of this.
Friday, September 17th, 2004 03:41 pm (UTC)
yeah, I seem to recall having to apply a patch for that on our old Win98SE boxen. (Installed using 98Lite, they were actually capable of staying up long enough for it to be a factor.)
Friday, September 17th, 2004 04:02 pm (UTC)
HA HA HA HA HA HA HA oh god remind me not to fly anymore.

-Ogre
Friday, September 17th, 2004 05:10 pm (UTC)
It baffles me how Microsoft gets away with this kind of shit. :P
Friday, September 17th, 2004 05:47 pm (UTC)
I lay this one primarily at the feet of the FAA and the vendor, not Microsoft. They had a system that they knew would crash if it was allowed to remain up longer than 49.7 days at a time without having its clock reset, and they approved it and put it into service anyway, for air traffic control.

Thank god it wasn't the Department of Energy. We could have nuclear plants that melt down unless rebooted every 49.7 days.
Friday, September 17th, 2004 08:38 pm (UTC)
Yeah, they must be using Windows 98. And they must have known about the bug and used Windows 98 anyway.

Microsoft eventually acknowledged that Windows 98 wouldn't run more that 49.7 days. It took about 2 years for people to find this bug, because they kept assuming the problem was just standard instability in Windows 98, but eventually someone noticed the pattern.

Windows may crash after 49.7 days (http://news.com.com/2100-1040-222391.html?legacy=cnet)

Broken Windows Theory (http://web.ukonline.co.uk/eric.price/humour2/0379.htm)