Why Facebook went down

A look at how the system can fail

Facebook is a big company, with a lot of very smart people that keep it running smoothly and make sure its users can stay connected to its network and their social media lives. So how does it happen that such a big company, with so many smart people, effectively disconnected itself from the internet? It’s important to know that Facebook didn’t crash, it wasn’t hacked a data leaked, and it didn’t (technically) “break.” Facebook and its engineers, programmers, etc. know what they are doing with their networks and are particularly good at building their own switches, routers, operating systems and more. They are not, however, infallible, and for a while, you couldn’t access any Facebook apps. This was not because Facebook wasn’t responding, but because for a couple of hours, Facebook’s URL did not exist as a name on the internet. This is not something that can happen easily and the fact that it did is a fluke. It goes to show, however, that the systems we rely on aren’t perfect, and we shouldn’t take it for granted. If things go wrong, we need knowledgeable people that can fix them.

Some lingo first

There is a bit of terminology one has to understand if you want to talk about what happened to Facebook. Once you understand this terminology though, it becomes remarkably simple to explain why Facebook was unreachable.

The two most important concepts are DNS and BGP. Both are vital for the internet to be able to work the way it does. A DNS is a Domain Name System, and it helps your computer get the IPv4 (Internet Protocol version 4) address of the website you are looking for – in other words, its address on the internet. If a website does not have this address, it is inaccessible through the URL you use to find it. For example, if you type in facebook.com as a new or fresh URL, your device will send out a request to your local DNS server (which is usually your router or company server) to find the IP address for that URL. If it does not know the IP address, it will broadcast a request to all adjacent networks such as the various ISPs (internet service providers) connected to your local network. On Facebook’s side, they are constantly announcing their IP address as it is requested by millions of people across the world. They have many IPs to balance the sheer number of requests they receive. The contacted DNS servers will reply with their most recent known IP addresses for that URL allowing your ISP to plot the optimal route to Facebook’s server. If this entire process is successful, you are now connected to Facebook.

The BGP (Border Gateway Protocol) version 4 is used by routers to connect to networks. BGP4 stands at the border of every network on the internet and every 60 seconds at that border it announces that it is alive and can be queried for information. The BGP of a specific network will also decide if the records inside of the network are for public or private consumption. If the information is private, it will not leave that network.

So how do DNS and BGP work together? It’s best to get to the bottom of that question with a simplified schema and some steps in a scenario:

(Your phone)
(Facebook’s servers)
1Using BGP, it asks B and C if they are alive.   
2 Using BGP, it says it is alive.Using BGP, it says it is alive. 
3Using DNS, it asks if they know Facebook’s IP address.   
4 Using DNS, it says that it knows because it asked D less than 60 seconds ago and sends the DNS record with the IP address and time stamp to A.Using DNS, it says it doesn’t know because it had not asked D recently. 
5  Using BGP, it asks B and D if they are alive. 
6 Using BGP, it tells C it is alive. Using BGP, it tells C it is alive.
7  Using DNS, it asks if they know Facebook’s IP address. 
8 Using DNS, it says that it knows because it asked D less than 60 seconds ago and sends the DNS record with the IP address and time stamp to C. Using DNS, it says it knows the IP address because it hosts Facebook. It sends the public DNS with the IP address and time stamp to C.
9  C updates its records and using DNS, replies to A’s request that it now knows where Facebook is and sends the DNS record with the IP address and time stamp to A. 
10With the information from B and C, A can now use routing protocols to determine the best route to Facebook (D).   

How did this fail with Facebook?

Now that we understand all the protocols involved with connecting you to Facebook, understanding how they got disconnected is simple – from Facebook’s side, step 6 stopped taking place and the BGP no longer announced that it was alive and well for other networks to connect to the Facebook server. This probably happened because Facebook pushed an incorrect BGP configuration into operation. Facebook was still announcing its IP address within its own network, but there was no way for it to get that public information across the BGP. So, though Facebook was not physically disconnected from the internet, its address was unreachable by DNS. If you had one of Facebook’s IP addresses that you could input directly into your web browser, you would still have been able to connect to the server.

Overall, Facebook disconnecting was a simple issue with massive repercussions. Because Instagram and WhatsApp are also hosted on the same servers as Facebook, they also went down. The fix (rolling back the configuration) was supposed to be easy, but Facebook’s entire system of emails, key cards and more, including the remote systems used to configure the routers, all work from the network across the internet and hence, they were not able to remotely undo the changes or get into the room controlling the routers because the security systems and protocols didn’t work. An easy fix became a logistical conundrum that took them about 6 hours to fix. It just goes to show that we should never take the systems and networks we have for granted.

Post a Comment