Facebook has said that a routine maintenance error caused the disruption to its services earlier this week, “effectively disconnecting Facebook data centres globally.”
The outage, which lasted more than five hours and affected Facebook, Instagram and WhatsApp, occurred after day-to-day infrastructure maintenance brought down connections in Facebook’s global backbone network.
In a blog post, Facebook’s VP of Engineering and Infrastructure Santosh Janardhan said,
“During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centres globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.
“This change caused a complete disconnection of our server connections between our data centres and the internet.”
This disconnection then had a knock-on effect. Facebook’s DNS servers disable border gateway protocols (BGP) if they can’t speak to the data centres – an indication of an unhealthy network connection. “In the recent outage, the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements,” said Janardhan.
“The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.”
Facebook engineers were faced with fast-moving a series of events and had to overcome two main challenges: inability to access Facebook’s data centres because their networks were down, and the total loss of DNS also disabling most of Facebook’s internal outage troubleshooting tools.
“Our primary and out-of-band network access was down, so we sent engineers onsite to the data centres to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online,” explained Janardhan.
However, once connectivity to the backbone network was restored across all data centre regions, services came back online – but not without a new set of difficulties.
“We knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centres were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk,” continued Janardhan.
However, Facebook has been running “storm” drills for exactly this sort of situation: “In a storm exercise, we simulate a major system failure by taking a service, data centre, or entire region offline, stress testing all the infrastructure and software involved. Experience from these drills gave us the confidence and experience to bring things back online and carefully manage the increasing loads.
“In the end, our services came back up relatively quickly without any further systemwide failures. And while we’ve never previously run a storm that simulated our global backbone being taken offline, we’ll certainly be looking for ways to simulate events like this moving forward,” said Janardhan.