In October this year, Facebook caused global disruptions when its services went down alongside the platforms it owns, including WhatsApp, Instagram, Facebook Messenger and Workplace.
The outage was one of the worst of its kind in recent history. The world’s largest social media company’s platforms were down for a total of almost six hours, during which several of the most dominant forms of online communication, both socially and within the workplace, went dark and were inaccessible.
Throughout the pandemic, these forms of communication have been essential to our everyday lives, from staying connected to our loved ones to helping small businesses continue their operations. The Facebook outage has worried both companies and data centres about the causes behind it, whether it will impact them again, and how these types of outages can be prevented going forward. The reputational damage from an outage like this can be long lasting, and the impacts can quickly spread outwards to all elements of a company, including its partners and customers.
What are BGP and DNS?
The answer to these challenges partly lies in understanding DNS and BGP. BGP, or Border Gateway Protocol, is today’s protocol for routing internet traffic for public internet infrastructure generally. This means that BGP is responsible for selecting the best available routes to communicate data from a source to a specific destination.
DNS is the internet’s equivalent to the list of contacts on your phone. DNS tells your browser what to do by translating the URL you want into a numbered IP address. The Domain Name System (DNS) is designed to provide translations, converting hostnames, or URLs, to IP addresses (via name resolution).
What caused the Facebook outage?
What we now know is that the Facebook outage can be attributed to a misconfiguration within the BGP routing design of Facebook. This was allowed to propagate across their routing fabric internally (iBGP) and then externally (eBGP).
While global DNS servers were able to provide resolution to requests for Facebook domains, the public IPs provided in the DNS responses could not be used to route the ensuing external client traffic into Facebook systems, which was exacerbated by the internal DNS architecture at Facebook, impacted by the BGP misconfiguration.
Facebook’s authoritative name servers are advertised to the rest of the internet via border gateway protocol (BGP). To ensure reliable operation, Facebook’s DNS servers disable BGP advertisements if they themselves cannot speak to their data centres. In the recent outage, the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that their DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find their servers.
The reason that the outages took so long to resolve was largely due to Facebook’s inability to access their internal management network during the outage (OOB – Out-of-Band). This significantly delayed the time it took to resolve the issue because Facebook was not able to access its own network and fix the configuration; similar to forgetting your admin password and irreversibly losing access to your workstation, though at global internet scale.
How can websites prevent these outages going forward?
Understanding what happened during the Facebook outage is just the first step to preventing outages for websites going forward.
As large scale cloud networks continue to grow and use automation both to scale and remove human error, there still must be a structure in place to protect from the human element that remains. The idea of ‘guardrails’ should be introduced to ensure critical infrastructure decisions are controlled and validated before they are eventually deployed, and these are key to the continued stability of services at internet-scale. These guardrails apply not only to the cloud service providers’ management of infrastructure but also to the businesses that build upon these platforms.
Those who own websites also need to carefully consider the problem of cloud vendor lock-in, making sure they prioritise the ability to migrate their business assets and processes to competing cloud platforms. This in turn puts pressure on these cloud service providers to deliver the best possible service or lose their clients.
It’s clear that the Big Four (Amazon, Facebook, Apple, and Google) are overseers to some of the largest marketplaces today. This means that companies, especially smaller businesses, may have little choice but to ensure they are integrated with these platforms and, as we saw during the Facebook outage, can be entirely dependent on them for their own success. This is why multi-provider strategies will be key to avoiding any risks associated with Big Tech’s dominance: hence, technology alone will not be a catch-all solution for mitigating these outage risks.
While there are still technology based solutions available to resolve these problems, these are mostly available for larger companies. For example, businesses can look to migrate away from a platform such as Amazon and invest in experts who can run their own cloud infrastructures internally. By putting digital strategies at the foundation of its business, companies can look to have a competitive edge in mitigating networking risks.
Whether it be by employing guardrails to prevent against human error, using the best cloud providers or upskilling its employees, businesses should seek to be proactive in meeting underlying networking challenges. Facebook’s outages have shown us that if these challenges aren’t resolved, businesses can risk long-lasting reputational damage that can have a serious impact on their bottom line.