Pakistan vs. YouTube, BGP loses
March 2, 2008 11:41 AM   Subscribe

YouTube Hijacking: A RIPE NCC RIS case study is the definitive look at how actions of Pakistan Telecom caused the global outage of YouTube Sunday the 24th of Feb. 2008. This incident has exposed weaknesses of the Border Gateway Protocol as is outlined by Danny McPherson from Arbor Networks as well as on the Renesys blog.
posted by gen (33 comments total) 4 users marked this as a favorite
 
Youtube was down?
posted by flatluigi at 11:55 AM on March 2, 2008


Pakistan causes worldwide YouTube outage

Did YouTube pull the offending video? I've read conflicting reports.
posted by mediareport at 12:15 PM on March 2, 2008


What would prevent a bunch of script kiddies from completely scrambling the internet by sending lots of these kinds of messages? I assume these messages cannot be sent by end-users, so what kind of system would they have to break into to gain authorization to send these kinds of messages?

To put the question a different way, could a hostile nation like Iran do it? Or North Korea?
posted by Class Goat at 12:16 PM on March 2, 2008


This is why you should outsource your censorship to China, where the real talent is.
posted by Abiezer at 12:20 PM on March 2, 2008 [1 favorite]


It's pretty impressive how quickly the YouTube network guys responded.
posted by smackfu at 12:21 PM on March 2, 2008


What would prevent a bunch of script kiddies from completely scrambling the internet by sending lots of these kinds of messages?

Security between the core routers.

I assume these messages cannot be sent by end-users, so what kind of system would they have to break into to gain authorization to send these kinds of

Cisco routers that are part of the trusted network. Most of these do not permit remote logins at all.

To put it a different way: if it was easy, it would have been done a very long time ago. The Pakistani 2nd tier ISP that passed up the bad route probably did so by mistake -- they were probably intending to hijack the subnet for local users only but accidentally propagated it upstream.

It doesn't make much sense for them to try to hijack it internationally -- that's an immediate ticket to going out of business as your company would be permanently on the blacklist.
posted by tkolar at 12:23 PM on March 2, 2008


Hey look, BGP finally blew something up that people care about.
posted by Skorgu at 12:44 PM on March 2, 2008 [4 favorites]


Some background for those of you who don't understand a word of what's going on here.

By now, I imagine most of you have probably noticed that when you type 'www.metafilter.com' in your browser and hit enter, your computer connects to this server and displays the web page there. But have you ever wondered just exactly how that works?

When you start a conversation with Metafilter, your computer assembles a connection-start packet, and sends it to its local gateway. Your local gateway forwards it into the Internet proper.

At each step of the way, there's a router. For each packet, the router has to determine which of its interfaces is the 'best' direction to send it.

So, typically, your packet flies from router to router, each tossing the packet closer and closer to MeFi, until it actually arrives. MeFi then replies, and the answer packet flies back, very possibly via a different path. The paths can change at any time, and often do. At each step, a router decides which way to send the packet, and forwards it on.

Your machine gets the answer packet, and then sends a final connection-established packet. Once again, several routers decide which way to send it, until it reaches MeFi again. This gets both machines in agreement about the parameters of the connection. Note that we've involved a lot of routers, each forwarding three packets, before we've even sent the http request!

At that point, your machine requests the actual data, MeFi's server answers, and many hundreds of packets start flying around -- most of them from MeFi to you. Finally, all the data has been sent, and the page is rendered. It looks like you just directly connected to MeFi, but a lot of other computers were involved. Despite how fast everything seems, there's a lot going on. For instance, there are 15 hops between this machine and www.metafilter.com, so in posting this message, all 15 hops will be making a forwarding decision on each and every packet that goes through.

Routing is the process of actually deciding which way to send a packet. In simple networks, most routing is done statically. That is, each router is pre-programmed with all the routes it needs to know. For instance, if you have a main office and a branch office, the IT staff would assign different net ranges to each building, and would program the routers to know where they are. If you, in office A, connect to a machine in office B, your local router knows to send those packets down the wire that goes to the other office.

This works really well, up to a point. Past a certain level of complexity, it starts getting hard to track what net ranges are where. As offices start up and close down, or as net ranges need to expand to cope with more workers, the routes gradually get more and more complex, and there's more and more chance of error. In a complex, static-routed environment, adding a new network might require changes in 10 or 20 routers, and a typo while configuring any of them can potentially screw the whole company up.

Further, this doesn't allow for easy failover. Many companies think it's very important that the network stay up at all times, and will run redundant circuits. But telling 10 routers that Circuit A has failed, but Circuit B is still up, is very time-consuming. This causes outages. IT staff lives in constant fear, because people only notice us when things break. So, as you can imagine, we're very happy about things that automate these processes and make failures invisible to our clients.

Thus, routing protocols were born. Basically, these are just ways that routers tell neighboring routers what networks they know about. In that first simple example, Router B announces the network range in Office B, and Router announces Office A and the 'default route'... the internet. When Office B sends traffic, that router knows to send all of it to Router A. Router A is the core 'decider'.... on any given packet, it routes to Office B, to Office A, or to the Internet.

In that simplest of examples, you don't gain anything over just static routing. But if you add in Office C, and then Offices D through H... each time, all the routers talk to each other and figure out what networks they know about, so the traffic goes to the right places. And if you add in redundant links, when something fails, the same protocol can immediately notify the whole network, which can then immediately start routing onto the standby circuits instead. This simplifies administration enormously.

This scales up to the whole internet... most of the big routers in the Net run BGP, Border Gateway Protocol. Companies are assigned "AS numbers" -- AS means Autonomous System. YouTube, for instance, has an AS number. It's kind of like an ownership tag. These entities also get networks to use. They advertise those networks to their neighbors, combined with their AS number. So, with all these disparate links to all these different places, with all these routers running BGP.... all the routers run a fairly complex algorithm to try to determine the shortest path to every network they're told about.

So, okay, phew. You've got this giant cloud of thousands of routers, handling millions of networks, and everything's good. So now (cue evil music) Pakistan decrees that YouTube Shall Be Silenced. The ISP of Pakistan adds in a BGP advertisement for YouTube's network space, which their ISP propagates (a critical error)... and within two minutes, the entire Internet knows that YouTube's IP space is in Pakistan. Almost all YouTube traffic goes there, where it's discarded, and YouTube is off the air.

One of the rules of BGP is that the most specific route wins. YouTube normally advertises four class C networks with a single route; that is, they say that 'all the numbers between these two ranges belong to us, and should come here'. The Pakistan advertisement is for just one of their class C networks, so that's more specific, and thus Pakistan wins. (this is done so that you can say 'except'.... as in, "send traffic for this enormous network range over here, except for this tiny network, which goes here instead.")

The first thing YouTube tries is advertising just that one class C, the same as Pakistan, but that doesn't override Pakistan's route for most of the Net... only for the routers that are very physically close to YouTube. Most routers see that the two routes are the same specificity, so then they route by distance, and Pakistan is closer for a big chunk of the world. So then YouTube subdivides their class C into TWO routes, and tries advertising those... and for those routers that accept those advertisements, it's a more specific route, and thus YouTube wins.

But, most of the Net ignores those two little advertisements. Why? Because there are so many global routes that most of the big providers simply refuse advertisements that are smaller than a certain size. Every new route adds to the load on their routers, and all the central routers are just barely keeping up. Most of the Net ignores small advertisements. Thus, YouTube is still mostly down.

Finally, Pakistan's ISP steps up and stops propagating Pakistan's illicit advertisement, which resolves everything about two minutes later.

Total downtime: about two hours.

So, what does this all mean? It means that BGP doesn't have much authentication, and even small mistakes can be spread Net-wide very quickly. If your router trusts my router, and my router trusts Evil Bad Guy's, then YOU trust Evil Bad Guy too. This is called 'transitive trust'. It's a very bad idea, and causes all kinds of computer security pain.

The upshot is that now they're talking about going to some kind of encryption/signature mechanism for routing... but that's a problem, because the core routers are just barely keeping up, and they really don't have much extra juice to verify that people aren't lying with their advertisements. Verifying a signature takes CPU power, and those central routers are so burdened that checking 10,000 routes will be very painful, and providers are going to want to avoid that if they can.

So, this is a moderately sticky problem, and there will be some wailing and gnashing-of-teeth in conference rooms eveywhere. You can be quietly amused as the highly-paid network jocks have to earn their keep this year. :)
posted by Malor at 1:09 PM on March 2, 2008 [107 favorites]


It looks to me like everything worked exactly as it was intended to. I'm having quite a time even seeing what people are so up in arms about. Yes, some erroneous announcements were made, but they weren't the result of a technical flaw, they were the result of human error.

Do we really want a central oversight body to be dictating who gets to route what to where? To avoid a problem that crops up once every, what, 10 years?
posted by TheNewWazoo at 1:12 PM on March 2, 2008 [1 favorite]


(and it is possible that they won't actually change anything; they may decide that it's too expensive/painful to be worth fixing, and rely on the fact that the Net will nullroute people who screw with it.)
posted by Malor at 1:18 PM on March 2, 2008


It looks to me like everything worked exactly as it was intended to.

No, it didn't. Many people couldn't access YouTube for 2.5 hours. And a system that allows human error at a third party to deny service to some other network does have a technical flaw.
posted by grouse at 1:24 PM on March 2, 2008


Youtube was down?

MeFi was at loss for posts.
posted by matteo at 1:32 PM on March 2, 2008 [5 favorites]


It looks to me like everything worked exactly as it was intended to.

Say it's your business that's affected. Say you're not the world-famous YouTube. Say you have to convince people that you didn't fuck up your own A record, and you don't even really know what happened. Still copacetic?

This isnj't a whine-and-gnash-teech sort of thing, it's an engineering issue, and the RIPE people are addressing it in a community approach. It's something that needs ... at least a tweak or two. But it will take a lot of stakeholders to agree to that, which is the real problem.
posted by dhartung at 1:55 PM on March 2, 2008


"MeFi's server answers, and many hundreds of packets start flying around"

Or it starts barfing ColdFusion errors at me once again... :)
posted by drstein at 2:45 PM on March 2, 2008


TheNewWazoo: Do we really want a central oversight body to be dictating who gets to route what to where? To avoid a problem that crops up once every, what, 10 years?

About all the infrastructure for that is already in place.

IP address blocks are handed out by RIRs (regional internet registries) such as ARIN, RIPE NCC, LACNIC, APNIC. They register such in their database. If you want to route a certain netblock you will have to register a route object in the respective database, but RIPE only allows you to do so if you have the proper credentials on the inetnum object, which they put in the database for you when the netblock was allocated to you.

Now, if you want to acquire upstream connectivity, your new provider can check for everything you announce in RIPE's IRRdb (internet routing registry database) for matching objects. If they're smart, they build filters automatically using tools like Merit's^WRIPE's^WISC's IRRToolSet (or easier-to-use like irrpt and Marco d'Itri's rpsltool).

The loophole that allowed this outage to exist was a failure in the last stage: an upstream provider not properly filtering a downstream customer.

Why don't providers do this? Not nearly all IP space is properly registered, plus writing tools to automate updating router configuration scares off people, and has been known to break (e.g. at least one provider had a big oops moment when ripe.net's domain registration was accidentally killed by NetSol, thus whois.ripe.net was offline, thus all access lists were emptied out the next run of the config-generating script).
posted by LanTao at 3:02 PM on March 2, 2008


BGP is an incredibly simple protocol by the way. The initial design was sketched out on a napkin and it hasn't grown too much since then.
posted by Skorgu at 3:08 PM on March 2, 2008


Can we still send the illegible youtube comments to Pakistan?
posted by jlowen at 4:44 PM on March 2, 2008


It's not much of a surprise that BGP is flawed in this way. It's one of the few big internet protocols that assumes everyone's cooperative. When someone trusted fails to cooperate, it's a mess.
posted by Nelson at 6:36 PM on March 2, 2008


looks like they took it down
posted by north carolina mortgage at 7:14 PM on March 2, 2008


The really interesting thing is that this isn't the first time BGP-gone-bad has had some major consequences. Back in 1997, the "AS7007 Incident" took down a large chunk of the entire Internet.

It was essentially the same situation that Malor describes above, except that instead of advertising very specific (and bad) routes for a single site, like the Pakistani ISP did, an ISP in Florida (with AS number 7007) started advertising very specific routes for practically the entire Internet. As the routes propagated, "the entire internet existed in one location - some crappy Bay Networks router in AS7007". As packets got routed there instead of to their correct destinations, the Internet essentially ceased funtioning for several hours. Needless to say, much crow had to be eaten by AS7007's operators.

In the wake of it, there was a whole lot of discussion about how to keep such an obviously stupid route from propagating, and various ways were invented (or dragged out of relative obscurity) to help ISPs filter and sanity-check incoming routes. Never again would some random malfunctioning Cisco box somewhere decide that it had the entire Internet in /24s on one of its interfaces and tell the world about it. BGP and the rest of the world lurched on.

But as the YouTube incident demonstrates, not everyone took the post-AS7007 safeguards seriously, and the core routing network is still pretty vulnerable to either human error or malice if it occurs at a high enough level.

Frankly I think incidents like this will continue to happen every few years or so, whenever sysops get a little lax about routing policies, unless there's a major change in protocols. But as Malor points out, there's going to be a lot of resistance to anything that increases the hardware requirements for the core routers -- to the companies and organizations that operate them, they're cost centers, not profit centers. (And big ones at that.) They really don't have a lot of interest in spending truckloads of money to fix a problem that's relatively rare.
posted by Kadin2048 at 9:01 PM on March 2, 2008 [3 favorites]


It's not much of a surprise that BGP is flawed in this way. It's one of the few big internet protocols that assumes everyone's cooperative.

Virtually all of the routing protocols assume cooperation.

In any case, I can tell you in advance that this fooforaw about "fixing" BGP will go nowhere. BGP has *been* the core of the internet for almost 14 years, and you might have noticed that the internet has been doing pretty well over that time. The 1st tier ISPs have no interest in destabilizing the core to cover the case where a trusted 2nd tier goes bad.

As someone else mentioned, the system worked exactly the way it was supposed to. Yes, YouTube was off the air for over two hours -- and I'm sure their lawyers are pouring over their ISP contract right now trying to figure out if they can get some money out of it. My guess is that they'll find that the answer is no -- contracts on that level usually provide for a certain amount of unscheduled downtime per year.

I do have to say that it's amazing how much people expectations have changed over the years, though. A website was down for two hours and this is a big deal? In 1994 there was a bug that kept the country of Slovenia off the web for two days, and nobody was particularly put out.
posted by tkolar at 9:17 PM on March 2, 2008 [1 favorite]


I'm with TheNewWazoo — I don't really see how a technical fix could deal with this. Pakistan could, for example, lay a whole bunch of fiber directly to wherever YouTube's servers are (San Francisco?). And then it would be perfectly appropriate for them to advertise such a route, and perfectly appropriate for everyone to send them the traffic. That's how routing works. We know that it's absurd that Pakistan has a high-capacity link to SF, but how can we encode that information into the BGP routers? Anything that requires frequent manual tweaks as global connectivity changes is not workable.

Routing is a somewhat difficult problem because it's inherently global. A change in connectivity to any AS can cause optimal routes to change arbitrarily far away in the network.
posted by hattifattener at 9:32 PM on March 2, 2008


There's a simple solution to this problem: drop all routes to Pakistan Telecom, permanently. It would be one thing if this was an innocent mistake. Perhaps that would allow for a simple suspension of access, say for a month or so. But this was done in the interest of censorship, i.e., evil. Therefore judgement must be swift and merciless.

Another ISP in Pakistan would be welcome to apply for access on their own terms.
posted by vsync at 9:33 PM on March 2, 2008


It would be one thing if this was an innocent mistake.

Is this one of those "There are no mistakes in the commission of crime" things? 'Cuz I can practically guarantee you that the propagation of the route outside of Pakistan was an innocent mistake.
posted by tkolar at 9:42 PM on March 2, 2008


hattifattener wrote...
I don't really see how a technical fix could deal with this.

One could imagine a fairly grosteque system where every class C (and B and A for that matter) had a public/private keypair and where routing updates fed into the core were authenticated on a per-subnetwork basis. So basically your neighbors would refuse to take a new route from you unless you could prove that you were authorized to advertise it (i.e. possessed the private key)

That's just a rough cut, but hopefully you can see where I'm going with it.
posted by tkolar at 9:53 PM on March 2, 2008


Thanks for the "explanation" Malor, but you can't fool me. Everyone knows that the Internet is just a series of tubes. Probably pneumatic ones and that is how the packets get pushed around from point A to point B.
posted by madamjujujive at 10:05 PM on March 2, 2008 [1 favorite]


Malor wrote...
So, typically, your packet flies from router to router, each tossing the packet closer and closer to MeFi, until it actually arrives. MeFi then replies, and the answer packet flies back, very possibly via a different path.

The link your packet "flies" over from one router to another is often referred to as a "data pipe", although in reality it is some variation of electrical signals sent over a wire.

The fact that all internet traffic goes through these various pipes to get to its destination is the source for the infamous "series of tubes" description.
posted by tkolar at 10:16 PM on March 2, 2008


Didn't this happen before with a malicious intent? I recall someone remapping certain sites to prove a point. Or did that attack (if it existed outside of my head) have to do with domain names instead of BGP.
posted by phyrewerx at 12:55 AM on March 3, 2008


This was a most excellent post about "how teh internets really work." Thanks, malor et al.
posted by Lynsey at 10:04 PM on March 3, 2008


Dang, that's pretty scary. It makes you wonder what terrorist could do intentionally.
posted by ciaciagi at 10:16 AM on March 7, 2008


It makes you wonder what terrorist could do intentionally.

Not much. Hackers have been trying to attack it for 14 years now and haven't gotten anywhere.

Although I honestly would love to see at least one armed takeover of an ISP for the specific purpose of injecting bogus routes into the internet backbone. That would be badass.
posted by tkolar at 4:26 PM on March 7, 2008


That wouldn't work for very long, ciaciagi.

I touched on AS numbers up there, but in rereading, I should probably have left them out entirely, since they didn't quite directly relate to the rest of what I was talking about. But the AS numbers are key to the relatively limited security BGP offers.

If a minor ISP somewhere got hijacked by terrorists, and injected a bunch of bogus routes, their upstream ISP should already have it set up so that their routers wouldn't propagate the bogus routes. This is the mistake that took down YouTube; Pakistan's ISP had no such filter. They should have accepted and propagated advertisements from Pakistan's AS number only for net ranges that ISP knew belonged to Pakistan. If they weren't already doing it, it's likely that virtually every provider will be putting in AS filters for their direct customers, to make sure this doesn't happen again.

So, in the event that a small ISP gets taken over by terrorists, chances are pretty good that they won't be able to disrupt global routing. If there's enough bandwidth, they can probably launch DoS attacks, but a good botnet is a hell of a lot more dangerous than any single ISP.

If they took out a Tier 1 provider, and this would require coordinated country-wide attacks against multiple facilities at the same time, then yes, they could disrupt things. But the ISPs that hadn't been hijacked would be able to nullroute the taken-over ISP and recover network operations within an hour or two. Things would be badly degraded until the attacked ISP was back in friendly hands, but I don't think anyone can take the whole Internet offline for very long.

Basically, I wouldn't worry about it too much. Absolute worst-case scenario, I think, is a disruption of a couple of hours. Things might be very slow for a few days, and real-time apps might be impaired, but the fundamental Internet shouldn't be vulnerable to any single ISP being taken out.
posted by Malor at 6:18 AM on March 8, 2008


I should qualify that a little... direct customers of the attacked ISP might well be offline for the duration. Most small players aren't multihomed -- that is, they have a presence only via one provider. Most of the Net would remain intact, but some small players could be gone for awhile.
posted by Malor at 6:20 AM on March 8, 2008


« Older Looking for some dumb quotes?   |   Jonathan Richman - Now Is Better Than Before Newer »


This thread has been archived and is closed to new comments