Have you turned it off and on again?
May 1, 2015 10:19 AM   Subscribe

FAA (pdf link): A Boeing Model 787 airplane that has been powered continuously for 248 days can lose all AC electrical power due to the generator control units (GCUs) simultaneously going into failsafe mode. This condition is caused by a software counter internal to the GCUs that will overflow after 248 days of continuous power. We are issuing this airworthiness directive to prevent loss of all AC electrical power, which could result in loss of control of the airplane. Guardian article: "In the latest of a long line of problems plaguing Boeing’s 787 Dreamliner, which saw the company’s fleet grounded over battery issues and concerns raised over possible hacking vulnerabilities, the new software bug was found in plane’s generator-control units."

“If there is a definitive record of a powercycle within the last 120 days, no operator action is immediately required. Operators will perform periodic power cycling at scheduled intervals until incorporating a software update." According to Boeing’s records, all of the 787s currently in service have been turned off and turned on again as part of maintenance.

(Also in recent news, airplanes with in-flight Wi-Fi are vulnerable to hacks by passengers and could be targeted by a "malicious attacker" on the ground: newer planes such as the Boeing 787 Dreamliner and the Airbus A350 and A380 have a single network that is used both by pilots to fly the plane and passengers for their Wi-Fi connections. And earlier, a "back door" in a computer chip used in military systems and aircraft such as the Boeing 787 that could allow the chip to be taken over via the internet.)
posted by RedOrGreen (77 comments total) 16 users marked this as a favorite
 
Oh, this is nothing new. A group of F-22s had all of their electronics crash a number of years ago after they crossed the International Date Line.
posted by backseatpilot at 10:22 AM on May 1, 2015 [5 favorites]


So the Millennium Bug was just biding its time and waiting for our guard to drop?
posted by sobarel at 10:23 AM on May 1, 2015 [3 favorites]


Would it be unusual to leave it powered up for that long?
posted by thelonius at 10:31 AM on May 1, 2015 [2 favorites]


Even older: the Patriot missile clock bug
On the 25th February 1991, Iraqi forces targeting an airfield in Dhahran, Saudi Arabia launched a Scud missile. Six Patriot batteries were assigned to protect the airfields and seaports of Dhahran; in particular, Alpha battery was the one assigned the targeted airfield.7

Alpha battery had been in continuous operation for over 100 consecutive hours, and the resulting inaccuracy resulting from the software bug was roughly 0.34 seconds. However, this meant that the range gate could not successfully track the incoming Scud (travelling at roughly 1.7km/sec, so the time difference resulted in the range gate scanning an area of air space more than half a kilometre away from the missile). See Appendix A and for more details.

No Patriot missiles were launched to intercept the incoming Scud, which successfully hit a warehouse being used by the U.S. Army as a barracks, killing 28 soldiers, and another 98 people were injured.
posted by djb at 10:33 AM on May 1, 2015 [8 favorites]


They are kind of like refrigerators; they run all the time.
posted by Walleye at 10:33 AM on May 1, 2015 [1 favorite]


Would it be unusual to leave it powered up for that long?

Yeah, that's over 8 months. Surely there's regular maintenance being done more often than that?
posted by Greg_Ace at 10:33 AM on May 1, 2015


I wonder if they've tried rolling the planes' Windows down and back up.
posted by Dashy at 10:36 AM on May 1, 2015 [5 favorites]


A new version of Java is ready to be installed. Click here to continue.
posted by disclaimer at 10:40 AM on May 1, 2015 [12 favorites]


And this is Boeing, the company whose software development practices are legendarily slow and cautious.

Try not to think too hard about Airbus.

Yeah, that's over 8 months. Surely there's regular maintenance being done more often than that?

Doesn't matter. If you have an integer that's used as a counter, you must DTRT for overflow and underflow.
posted by ocschwar at 10:40 AM on May 1, 2015 [4 favorites]


The inflight wifi is on the same network as the computers that control the plane?

What was the rationale for that?
posted by mccarty.tim at 10:41 AM on May 1, 2015 [13 favorites]


American Airlines also had several dozen of its planes grounded by a flawed iPad app update this week, and the FAA quietly disclosed an attack on its networks earlier in the month.

It feels like there has been an uptick in discussion in plane bugs and vulnerabilities this year (the GAO recently issued a fairly detailed report on the latter) after what felt - at least to a layperson - like years of murmuring about theoretical problems and attacks. The FBI's recently detention of a researcher for disclosing a fairly trivial potential hack doesn't bode all that well for the response, though.
posted by ryanshepard at 10:42 AM on May 1, 2015 [2 favorites]


I imagine each plane will now have a conspicuously posted notice that reads, "000 Days Since Last Powercycle."
posted by ogooglebar at 10:43 AM on May 1, 2015


Why would you design an aircraft to have control systems and passenger wifi to use the SAME NETWORK? Does that really mean what I read it to mean? i.e. Everyone's on the same physical segment and the same logical subnet? Because that is capital S stupid. That just can't be right.
posted by ursus_comiter at 10:44 AM on May 1, 2015 [14 favorites]


File under todays viral scare story I suspect.
posted by GallonOfAlan at 10:47 AM on May 1, 2015


Why would you design an aircraft to have control systems and passenger wifi to use the SAME NETWORK?

I'm gonna guess: weight savings.
posted by slater at 10:50 AM on May 1, 2015


FYI: (2^31)/24/60/60/100 = 248.55
posted by GuyZero at 10:51 AM on May 1, 2015 [30 favorites]


So basically, this is like an undergrad-level rookie mistake.
posted by GuyZero at 10:51 AM on May 1, 2015 [1 favorite]


OH MY GOD THEY HAD TO REBOOT A PLANE TO FIX AN ERROR
posted by Theta States at 10:52 AM on May 1, 2015 [1 favorite]


File under todays viral scare story I suspect.

Oh, hardly. I just couldn't believe that one of the world's front line airliners is subject to a fatal 32-bit counter overflow bug, and that the recommended fix - the recommended fix! - is to turn it off and then on again.
posted by RedOrGreen at 10:53 AM on May 1, 2015 [19 favorites]


Technically they only had to reboot a subsystem. The real problem is when you have to reboot all the redundant copies of the system at the same time which sort of undermines the whole redundancy thing.
posted by GuyZero at 10:55 AM on May 1, 2015


This sounds like basically the same bug that afflicted Windows 95 and Windows 98. Although there the (less-potentially-literal) crash was after 2^32 milliseconds, or 49.7 days. I suppose 248 day failure means there's a 32-bit 200Hz counter now? (edit: GuyZero's 100Hz signed counter sounds more plausible)

At least they found the 787 problem in the lab. What was hilarious in the Microsoft case was that, for four years, every consumer Windows system would crash after a month and a half, but nobody noticed the pattern, because since when would Windows 9x make it a whole month and a half without crashing?
posted by roystgnr at 10:56 AM on May 1, 2015 [9 favorites]


OH MY GOD THEY HAD TO REBOOT A PLANE TO FIX AN ERROR

This is actually a pretty big problem, if, say, the plane is flying. And all of its flight control systems are electrically powered. They don't break when it's convenient to fix them.
posted by backseatpilot at 10:56 AM on May 1, 2015


Reminded me of this old cartoon for some reason.
posted by craven_morhead at 10:57 AM on May 1, 2015 [5 favorites]


Why would you design an aircraft to have control systems and passenger wifi to use the SAME NETWORK?

So the pilot can sit in First Class, controlling the plane via his iPad. Obviously.
posted by happyroach at 11:00 AM on May 1, 2015 [2 favorites]


Everyone's on the same physical segment and the same logical subnet? Because that is capital S stupid. That just can't be right.

According to a Wired article:

Boeing 787 Dreamliner jets, as well as Airbus A350 and A380 aircraft, have Wi-Fi passenger networks that use the same network as the avionics systems of the planes, raising the possibility that a hacker could hijack the navigation system or commandeer the plane through the in-plane network, according to the US Government Accountability Office, which released a report about the planes today.

A hacker would have to first bypass a firewall that separates the Wi-Fi system from the avionics system. But firewalls are not impenetrable, particularly if they are misconfigured. A better design, security experts have warned for years, is to air gap critical systems from non-critical ones—that is, physically separate the networks so that a hacker on the plane can’t bridge from one to the other, nor can a remote hacker pass malware through the internet connection to the plane’s avionics system. As the report notes, because the Wi-Fi systems in these planes connect to the world outside the plane, it opens the door for malicious actors to also remotely harm the plane’s system.

posted by ryanshepard at 11:00 AM on May 1, 2015


The prospect of an in-flight failure reminds me of the Gimli Glider.
posted by ogooglebar at 11:03 AM on May 1, 2015 [8 favorites]


This is actually a pretty big problem, if, say, the plane is flying. And all of its flight control systems are electrically powered. They don't break when it's convenient to fix them.

YES I UNDERSTAND AND THAT IS WHY I AM EXASPERATED AND YELLING AND MAYBE I DID NOT CONVEY THE TONE PROPERLY IN MY ORIGINAL POST AND FOR THAT I AM SORRY MY GOD MAYBE I SHOULD GET THIS KEYBOARD FIXED.
posted by Theta States at 11:04 AM on May 1, 2015 [8 favorites]


Y'all won't go subscribing to the RISKS Digest now, will you? Not if you want to sleep easy anyway.
posted by merlynkline at 11:04 AM on May 1, 2015 [6 favorites]


Roy is alive and he's working at Boeing!
posted by Talez at 11:04 AM on May 1, 2015


Yeah, out of all of this, the public on the same physical network as avionics is really troubling. And a bit ironic given the locked hardened cockpit doors. That means any sort of vulnerability or back door could potentially be abused by someone with a remote device, without even being on the airplane. If malware is involved it can certainly be on a timer or even waiting to respond to some specific avionics input (perhaps broadcast on the ADS-B channels). The potential for a devastating coordinated attack is frightening.

It's hard for me to believe that none of the engineers ever saw Battlestar Galactica. So, we must conclude that they are indeed the Cylons.
posted by meinvt at 11:06 AM on May 1, 2015 [1 favorite]


I have flown on the Dreamliner a couple of times now. It wasn't much different than flying on an older plane. The one negative feature is that the cool computerized-tinting of the windows also means that it's never completely dark in the cabin.
posted by Nevin at 11:07 AM on May 1, 2015


In the air and in the sea. "While Microsoft continues to trumpet the success of its NT operating system over Unix-based systems, the US Navy is having second thoughts about putting NT at the helm. A system failure on the USS Yorktown last September temporarily paralyzed the cruiser, leaving it stalled in port for the remainder of a weekend."
posted by Gungho at 11:07 AM on May 1, 2015 [1 favorite]


[T]he US Navy is having second thoughts about putting NT at the helm. A system failure on the USS Yorktown last September temporarily paralyzed the cruiser, leaving it stalled in port for the remainder of a weekend. A system failure on the USS Yorktown last September temporarily paralyzed the cruiser, leaving it stalled in port for the remainder of a weekend.

Blue screen of slightly less death.
posted by The Bellman at 11:11 AM on May 1, 2015 [9 favorites]


And then my mechanic Uncle who works for a legacy carrier said with a straight-face when referring to Boeing's main competitor as "Scarebus"
posted by wcfields at 11:12 AM on May 1, 2015 [1 favorite]


Just to be clear, the FBI detention of a researcher was someone who talked about connecting to flight emergency systems during a flight, and implied heavily that he had done so. Here is the specific tweet that got him detained (not arrested, just detained and questioned). I hate on the FBI a lot, but this seems fairly reasonable to me.

Integer overflows are everywhere in software. Luckily, clang has just added new support for finding them and fixing them! Check out IntegerSanitizer based on John Reghr's excellent work.
posted by yeahwhatever at 11:15 AM on May 1, 2015 [5 favorites]


According to a Wired article:

Boeing 787 Dreamliner jets, as well as Airbus A350 and A380 aircraft, have Wi-Fi passenger networks that use the same network as the avionics systems of the planes, raising the possibility that a hacker could hijack the navigation system or commandeer the plane through the in-plane network, according to the US Government Accountability Office, which released a report about the planes today.


This is some really terrible reporting by Wired; you can read the report yourself and it says nothing of the sort. Most of the report is speculative; it's making the case for strong certification requirements to prevent future aircraft from having security vulnerabilities because of this kind of mixed network. It doesn't say anything about Wi-Fi sharing a network about avionics, and the only time it mentions Boeing or Airbus is on page 20, where it points out that the FAA is already issuing special network security requirements on a case-by-case basis.

This is yet another example of how a sensationalized story can be halfway around the world before the truth has got its boots on.
posted by teraflop at 11:16 AM on May 1, 2015 [4 favorites]




I still don't get what an inflight wifi system needs to share with the plane's computers, other than electricity. Do they have some of the plane's antennas or radio equipment share some hardware? It looks like most inflight wifi is cellular tower or cellular/satellite hybrid based.

I'm more curious as to what's shared than trying to pick apart the idea. When they say the same network with a firewall, does that literally mean all that stops a passenger from sending a packet to the computers controlling the plane is a list of rules in the firewall? Or is this a more esoteric attack?
posted by mccarty.tim at 11:18 AM on May 1, 2015


GuyZero:

I ran the same calculation, and hazard two guesses:

1) They're using a signed int32 to count 10ms clock intervals
2) They're using an unsigned int32 to count phases off 50Hz AC

I have no idea if the AC system is actually 50Hz, and (1) seems like a super-idiotic way to code things, but, given my understanding of the aerospace industry, it wouldn't surprise me.
posted by 7segment at 11:19 AM on May 1, 2015 [1 favorite]


2004: On a plane being backed out onto the tarmac when suddenly we stop. The flight crew came over the intercom and said they had an error with one of the engines and a tech was coming to look at it.

We sat there for 45 minutes. Tech works on the engine. Tech finishes. We start on our way.

The flight crew comes back over the intercom and says "The error cleared after the tech rebooted the engine".

I was convinced we were all gonna die.
posted by Annika Cicada at 11:19 AM on May 1, 2015 [7 favorites]


On review, TerraFlop seems to have answered my question! Thanks!

It's a good thing they're coming up with these best practices now. I can see how future aviation communication and navigation designs could benefit from internet connectivity, but airgapped, redundant network systems and digital signatures on any message meant to control or guide the plane would be the bare minimum I'd not break into a cold sweat about. And those improvements would have to be enhancements to existing technologies, kind of like how AGPS works just fine without cell towers or the internet, but it gets a better fix faster than plain GPS when they are available.
posted by mccarty.tim at 11:21 AM on May 1, 2015


Since this could happen in the air, all I can say is RATs.
posted by tigrrrlily at 11:23 AM on May 1, 2015 [1 favorite]


Integer overflows are everywhere in software.
Well yes but you'd hope software in these sorts of systems would have somewhat different standards applied to review and test etc than the latest jewel crushing app or whatever.

(1) seems like a super-idiotic way to code things
Hard to say unless you're deeply familiar with the rest of the system it's embedded in. Which the author hopefully was.
posted by merlynkline at 11:26 AM on May 1, 2015


Can we infer from this that the GCUs are 32 bit processors? Why not 64 bit? (Are embedded systems still, generally speaking, 32 bit?)

(I'm having the hardest time not expanding "GCU" as "general contact unit"...)
posted by golwengaud at 11:30 AM on May 1, 2015 [2 favorites]


I can see how future aviation communication and navigation designs could benefit from internet connectivity, but airgapped, redundant network systems and digital signatures on any message meant to control or guide the plane would be the bare minimum I'd not break into a cold sweat about.

Part of the problem is that most of the "nextgen" comm/nav/surveillance solutions we're implementing now were conceived well before cybersecurity was a huge concern. So it's really not just the network within the aircraft, but the whole end-to-end system. ADS-B isn't encrypted at all, civilian GPS can be spoofed, it's all free and clear over the air.

Is it a huge risk at this point? This isn't my specific area of expertise, but I'd argue no, not immediately. However, it's good that people are starting to pay attention to it, because aircraft are only going to get more interconnected and more reliant on digital products that can be exploited.
posted by backseatpilot at 11:31 AM on May 1, 2015


FYI: (2^31)/24/60/60/100 = 248.55

So, they can fix the issue by changing to an unsigned integer!
posted by ocschwar at 11:37 AM on May 1, 2015 [2 favorites]


Would it be unusual to leave it powered up for that long?

Extremely, given that they'll turn the planes off overnight when they're just sitting at a gate, and they'll also turn them off when the do the A check, which demands a full power-on test of all systems. That happens roughly every 250 flight hours or 300 cycles, depending on if it's a shorthaul aircraft (so lots of cycles) or long haul (lots of flight hours.) A checks are usually done overnight at the gate.

So, yeah, I don't see a 788 or 789 going just shy of three quarters of a year without a power cycle. Hell, it's going to get B check every 6 months. That check involves pulling the engines and APU off. That will definitely shut the plane down. B checks on a wide body are 2-3 days in a hanger.

However....

Doesn't matter. If you have an integer that's used as a counter, you must DTRT for overflow and underflow.

Exactly. The reason this shit keeps happening -- the reason our software keeps crashing and the bad guys keep finding holes is that lazy developers don't follow basic practices.

However, this is not a big deal. Basically, everyone is going to be told "Power cycle the GCUs one every quarter until a patch comes out" and not a single airline in the world is going to need to write an actual directive, because it'll be covered by the A checks. But I'm sure there will be a check written as part of the daytime start (where you check a little more than if you are just flying a turn) that says "check last date of GCU power cycle" and they'll look in the logbook, and if it's more than 160 days away, they'll restart the GCU, either right away (if it's easy) or after the next flight (if it takes some effort.)

I have no idea if the AC system is actually 50Hz

Actually, it's not, it's 400Hz. Why so high? Makes AC transformers very small and light, which is ideal. Bad -- conductor losses are high, making it bad for long runs, but that's not an issue in aircraft.

This also explains why seatback power didn't offer 110V/220V AC, even though the planes system often ran off 115VAC. It was 115VAC @400Hz. Nowadays, power conversion is cheap and small enough that you can offer wall power at the seat back.
posted by eriko at 11:38 AM on May 1, 2015 [14 favorites]


Can we infer from this that the GCUs are 32 bit processors? Why not 64 bit? (Are embedded systems still, generally speaking, 32 bit?)

By installed volume they're probably 8-bit generally speaking. 32 bits is a lot for an embedded system.

Why not 64? Because a) you don't need a 64-bit bus and b) it's way more expensive. It's not just bit in the CPU, it's traces on the board, RAM configuration, etc.

32 bits ought to be enough for anybody.
posted by GuyZero at 11:43 AM on May 1, 2015 [4 favorites]


Nowadays, power conversion is cheap and small enough that you can offer wall power at the seat back.

On the planes I get to fly on the only thing I get from the seat back is that uncomfortably close feeling.
posted by tommasz at 11:44 AM on May 1, 2015 [2 favorites]


The flight crew comes back over the intercom and says "The error cleared after the tech rebooted the engine".

I was convinced we were all gonna die.


The reason the gag in The IT Crowd is funny is because it really is a solution to like 90% of all engineering problems.
posted by GuyZero at 11:45 AM on May 1, 2015 [1 favorite]


American Airlines also had several dozen of its planes grounded by a flawed iPad app update this week

That one was funny, and it didn't result in a massive cry for the return of the old paper charts and calculators, because almost all the AA pilots remember the day the October 15 cycle charts didn't make it onto the 727s and MD-80s, which grounded every single one of them from flying instrument approaches, because they didn't have the current charts.

Which meant they couldn't land at DFW or ORD, which meant they were basically screwed until they got the charts, because an AA aircraft that can't fly to the two biggest AA hubs is basically useless.

Losing a few of the iPads was an annoyance at best, and I'm hoping AA has learned an important lesson, which is Update the first officer's iPads on Monday, wait until Wednesday, and if they still work, then update the Captain's. :-) I hope it also put a better sense of QA into the software team, but I doubt it. Though, unlike most bugs, this one cost AA actual observable money. They can point and say "This is what that bug cost us."

But seriously -- getting the charts and such onto the iPads has been a huge win compared to the nightmare of getting the new ones rolled out to the fleet every quarter. Every airline is either going to tablets or has gone to tablets.

I mean, yes, these are annoying, but you guys forget the amazing stupids that happened when we tracked all this stuff on paper. That's when we well and truly lost your luggage -- because it went interline, then international, and nobody had any idea either how, or that, your bag ended up in Karachi.

Now, they know that it did, and can get it back.
posted by eriko at 11:46 AM on May 1, 2015 [12 favorites]


32 bits ought to be enough for anybody.

And 640k of RAM.
posted by Greg_Ace at 11:49 AM on May 1, 2015 [2 favorites]


Boeing 787 Dreamliner jets, as well as Airbus A350 and A380 aircraft, have Wi-Fi passenger networks that use the same network as the avionics systems of the planes, raising the possibility that a hacker could hijack the navigation system or commandeer the plane through the in-plane network, according to the US Government Accountability Office, which released a report about the planes today.

[A security researcher] was able to connect to a box under his seat on several occasions, allowing him to view data from the aircraft's engines, fuel and flight-management systems

But don't worry:
Could a hacker *really* bring down a plane from a mobile phone in seat 12C?
The answer is, "That's very, very, unlikely."

posted by T.D. Strange at 11:52 AM on May 1, 2015


Why would you design an aircraft to have control systems and passenger wifi to use the SAME NETWORK?


[Source of the following explanation]

While they use Ethernet physical layers and standard Enet frame formats, everything else is quite customized. Go search for "AFDX" on Google - stands for Aircraft Full Duplex Ethernet - standardized as ARINC 664. The avionics data networks on those two planes are built on this technology. One of its main characteristics is that all message flows are pre-declared, and their bandwidths pre-allocated. The source/destination of all message flows (called Virtual Links in AFDX-parlance) are statically stored in the switches, which validate the source/destination/port # of every frame that flows through them.

With this architecture you cannot get an arbitrary frame onto the network. If you try to inject one, the switch you are connected to will reject it unless it is a predeclared message. I highly doubt that the switches have been programmed to accept write messages sourced from the IFE system into an avionics computer.

The other way is likely - give that IFE equipment (and possibly Cabin systems for the flight attendants) may need information like aircraft position, speed, etc - at least what is needed to show the moving map to the passengers. If this researcher has claimed to have seen such messages from the WiFi network, he is foolish to think that if he changed them on that network that the aircraft would be affected one iota.

The primary needed connection is to/from the Satcom links. It is expensive and heavy to duplicate satcom radios and antennas, so I understand that the link is shared between IFE (e.g. passenger WiFi) and cockpit systems. But messages are not pre-programmed into the switches to accept control commands from a Satcom link and flow them to an autopilot, or engine control system and have them accept it as a legitimate input. I am highly skeptical of claims of being able to remotely control (crash) an airplane from the ground. I don't believe the static connections are present. The worst case scenario I have seen amongst my aviation colleagues is a corrupted flight plan sent to the flight deck, accepted without review by the pilots, and then having the pilots not notice they aren't going where they thought (with lots of alarms from the EGPWS if they are heading into terrain).
posted by DreamerFi at 11:53 AM on May 1, 2015 [23 favorites]


Can we infer from this that the GCUs are 32 bit processors? (I'm having the hardest time not expanding "GCU" as "general contact unit"...)

That kind of GCU is also 32-bit. At most.
posted by ROU_Xenophobe at 12:34 PM on May 1, 2015 [2 favorites]


Typical ROU prejudice.
posted by ursus_comiter at 1:19 PM on May 1, 2015 [2 favorites]


Wow. I'm used to 32 bit SNMP counters rolling over at 496 days, but limiting it to half that time for uptime seems a bit restrictive for something designed this century.
posted by rmd1023 at 1:25 PM on May 1, 2015


In the 90s a large retail chain ran a version of Unix that relied on a software licensing authorization process. The authors of said process made a similar mistake.

There were hundreds of remote servers with uptime > 247 days.

The 248th day was bad. shudder
posted by grimjeer at 2:02 PM on May 1, 2015 [1 favorite]


On some 64 bit systems, e.g. 64 bit Mac and iOS using Xcode, "int" is still 32 bit. You need to use "long" to get a 64 bit integer. NSInteger is 64 bit though. Yes, people get confused.

So you could get this same bug on a 64 bit machine just by making a false assumption about sizeof(int).
posted by w0mbat at 2:14 PM on May 1, 2015


The reason the gag in The IT Crowd is funny is because it really is a solution to like 90% of all engineering problems.

And the reason for that is that as humans we're just not very good (relatively speaking) at state management. Cycle the power and you reset the state.

At least it seems like in the past few years that concept has begun to penetrate the consciousness of more programmers, leading to the increased popularity of functional programming languages like Clojure & Haskell, and functional programming techniques used in non-functional languages, e.g. React. Combine functional programming with formal verification, and you have a strong foundation for writing code that has fewer bugs, including security vulnerabilites, e.g. the "High-Assurance Cyber Military Systems" (UAV onboard control software) developed for DARPA. Think about how many years we've been making web sites, and we still can't prevent hackers from getting in. We really don't want to go through that whole cycle with planes, cars, and drones.

tldr: https://www.flickr.com/photos/paulbarry/2613013337/
posted by jjwiseman at 2:23 PM on May 1, 2015 [3 favorites]


The Linux kernel switched in 2.5 (2003) to start jiffies at nearly the maximum value (so that overflow would occur a few minutes into boot and make debugging easier:
 /*
 * Have the 32 bit jiffies value wrap 5 minutes after boot
 * so jiffies wrap bugs show up earlier.
 */
#define INITIAL_JIFFIES (-1UL & (unsigned long)(-300*HZ))
The code in <linux/jiffies.h> has lots of comments about how to safely deal with the overflow, which requires quite a bit of thought. Having it happen early in the boot means that all device driver authors must contend with it.

A similar roll-over bug resurfaced a few years ago in Linux 2.6. This time a 64-bit performance counter would overflow after 208 days of uptime. This was fixed in 2.6.32.50:
sched, x86: Avoid unnecessary overflow in sched_clock
    
    In hundreds of days, the __cycles_2_ns calculation in sched_clock
    has an overflow.  cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing
    the final value to become zero.  We can solve this without losing
    any precision.
    
    We can decompose TSC into quotient and remainder of division by the
    scale factor, and then use this to convert TSC into nanoseconds.
posted by autopilot at 2:25 PM on May 1, 2015 [5 favorites]


Windows has installed new updates. Your airliner will be restarted today.

GIVE ME A TIMEFRAME GODDAMMIT
posted by disclaimer at 2:34 PM on May 1, 2015


Ctrl-Alt-Itude
posted by chavenet at 2:43 PM on May 1, 2015 [2 favorites]


I think we could all agree that hacking a 787 and running it into the ground would be tragic, but what about taking it for a Grand Canyon run or doing a barrel roll or buzzing the Jersey shore?
posted by ReeMonster at 2:49 PM on May 1, 2015 [2 favorites]


So basically, this is like an undergrad-level rookie mistake.

Sort of, but it can be difficult to spot, and even people who should know better can make this mistake. We would have to know the particular details of this bug to know what level of scoffing is appropriate.

As roystgnr says, Windows has this problem. And while that particular component of Windows causing the 9x systems to crash was fixed, the underlying API is still there and lots of programmers are still failing to account for the overflow. Specifically, it is GetTickCount() that is the source of the problems.

And just like Linux, as autopilot points out, internal Windows builds used by the development team and beta testers start "one hour before 32-bit timer tick rollover" to help find these bugs.

There is good work being done by the static analyzers folks to help flag these for developers, but they will probably never be 100%. Its a good thing to watch for in code reviews.
posted by jeffamaphone at 3:01 PM on May 1, 2015 [3 favorites]


Nowadays, power conversion is cheap and small enough that you can offer wall power at the seat back.

I had a high-powered laptop with me on a flight, and it didn't work plugged into the seat-wall outlet. I asked the stewardess and she checked my model number and brought a lower-powered compatible adapter to use. I wonder how many different models they support this way, how many adapters they have in that bin.
posted by StickyCarpet at 3:11 PM on May 1, 2015


Sort of, but it can be difficult to spot, and even people who should know better can make this mistake. We would have to know the particular details of this bug to know what level of scoffing is appropriate.

So yes, you are right, software is hard (no sarcasm) and these sort of mistakes are far too easy to make.

On the other hand, if you're writing embedded avionics software the stakes are higher and so are my expectations. Clearly this category of bugs is not new and as other comments have shown, there are ways to identify these types of issues.
posted by GuyZero at 3:18 PM on May 1, 2015


And the reason for that is that as humans we're just not very good (relatively speaking) at state management. Cycle the power and you reset the state.

Correct program state has about the same half-life as a grocery store banana.
posted by GuyZero at 3:19 PM on May 1, 2015 [2 favorites]


On some 64 bit systems, e.g. 64 bit Mac and iOS using Xcode, "int" is still 32 bit.

int is whatever int is documented to be on that platform. Period. There is no other correct answer there. Assuming that int will be *ANYTHING* wider than 16 bits in C derived languages is a classic mistake. Ints have been 16 bits wide on 32bit bit machines, 16 bits on 36 bit machines, 20 bits on 36 bit machines, 32 bits on 32 bit machines, blah blah blah blah blah.

If you need a certain range, you look up what the actual types are, and you declare appropriately. In C, a "long int" is at least 32 bits, but if you need 64, you need a "long long int" (since C99) to ensure that you have 64 bits, or you need an implementation that explicitly states that an int or a long is 64 bits.

Of course, you're now dealing with a portability trap if your current int is 64 bits and that code changes platform to an implementation where the int or long *isn't* 64 bits, so document the hell out of the fact that you require int/long to be 64 bit.
posted by eriko at 3:23 PM on May 1, 2015 [3 favorites]


when referring to Boeing's main competitor as "Scarebus"

That's been around for a long time. If you read pilot forums you might think it was the actual name of an aircraft manufacturer.

Good thing Colonel Panic is at the helm!
posted by spitbull at 3:34 PM on May 1, 2015


GuyZero: "The reason the gag in The IT Crowd is funny is because it really is a solution to like 90% of all engineering problems."

Anyone who has supported Windows NT in any way knows that rebooting was a ten to fifteen minute process.

I used to tell people to reboot just so I could get some coffee or take a leak or talk to one of the Unix guys and ask them why the one server isn't accepting/authenticating remote logins, and that's why Stacy can't log in from Denver. He'd fix it on his end, I'd walk back to my desk, hook Stacy up and look like an absolute genius.

Man, I miss NT4.
posted by Sphinx at 4:27 PM on May 1, 2015 [1 favorite]


Why would you design an aircraft to have control systems and passenger wifi to use the SAME NETWORK? Does that really mean what I read it to mean? i.e. Everyone's on the same physical segment and the same logical subnet? Because that is capital S stupid. That just can't be right.
posted by ursus_comiter at 10:44 AM on May 1 [7 favorites +] [!]


Have we learned nothing from Admiral Adama??
posted by You Can't Tip a Buick at 5:13 PM on May 1, 2015


You may have noticed that the Space Shuttle was never in flight over New Year's eve/day.
posted by LastOfHisKind at 5:14 PM on May 1, 2015 [1 favorite]


int is whatever int is documented to be on that platform. Period.

Yup. Which is why we have good ol' stdint.h which defines int8, int16, int32, etc, and if you always code to those types you increase your chances of cross-platform building without these sorts of errors.

Ask an iOS developer who inherited a project that didn't use NSInteger, CGFloat, and all the other special typedefs how their 64bit conversion went. Though we still have a hard time agreeing on just how to properly format longs in string formatting.
posted by jeffamaphone at 8:44 PM on May 1, 2015 [1 favorite]


I highly doubt that the switches have been programmed to accept write messages sourced from the IFE system into an avionics computer.

Just because the system was programmed to ignore write messages does mean it will actually do so.
posted by MikeKD at 9:22 PM on May 1, 2015


This is pretty interesting for me as I've worked in software development and testing in the aviation industry since 2007 and even worked on a (different) 787 component as a tester. I think for the same company which developed the 787 GCU for Boeing.

One intriguing thing for me is the fact that the software is actually entering a fail-safe state as opposed to getting a ton of nuisance protection trips or controlling something incorrectly or resetting itself. There generally aren't that many conditions which can even cause a fail-safe to occur. One example, if the software fails an integrity check of the program stored in (presumably) flash memory, that would cause a fail-safe. The nice thing about fail-safes is that they can be pretty easy to backtrack and find out which particular fail-safe occurred (part of a fail-safe would likely involve saving a snapshot of information to the non-volatile memory on the GCU). From there it wouldn't be that hard to figure out the cause of the fail-safe was an overflow.

My worry if I were Boeing or the FAA would be whether the fail-safe is hiding other overflow issues. That is, once the fail-safe issue is fixed and the software runs for 8 months straight, will the software make some other mistake? Hopefully they are looking into that as well while they work on this issue, just in case. If Boeing was running a unit non-stop for 8 months they may very well have been looking for this kind of issue in the first place, so maybe 8 or 16 months after the fix is applied they'll come back saying they found some more issues.

Side note on the size of integers: the software I've worked with in the aviation industry requires the use of typedefs instead of build-in types when programming C. So something like int32_t instead of int. In fact I'm pretty sure this is checked by most of the code analyzers I've worked with since it's very easy to look for, unlike overflow issues which are harder to detect. Even if a code analyzer did detect a potential overflow issue, all it takes is the person reviewing the issue to mistakenly think it's not a problem and say the finding was a false flag.
posted by Green With You at 10:05 PM on May 1, 2015 [2 favorites]


Look at it from the bright side!

There are a million mistakes like this that coders can, and constantly do, make. The fact that the 787 guys messed this up means that they're that much less likely to ever make this particular mistake again in the future!

And they didn't even have to crash an airplane to figure this out! Win-win for everyone!
posted by Djinh at 2:24 AM on May 2, 2015


« Older "Optimization For The Motorola RAZR"   |   This video will make you hallucinate Newer »


This thread has been archived and is closed to new comments