With the amount of data used across the globe only increasing year by year, Aliaksandr Valialkin, Co-Founder & CTO at VictoriaMetrics, takes a deep dive into how to avoid ‘breaking the internet’.
Website and server crashes are an inevitability. The more excitement surrounding an event, the more likely the website crashes on day one. For concerts and events with a fixed amount of tickets, server crashes are more of a nuisance than a genuine problem; there are only so many spaces, so while it may take a while, eventually all sales can be processed. However, when live services or e-commerce platforms fail, it can cost thousands of dollars in lost revenue per minute.
The root cause of a website crash is always an overload in the data processing capability of a system. How to fix this is contingent on what service you need to provide. Some websites are maliciously taken down with specific attacks like a DDoS (direct denial of service). This mimics a huge surge in traffic designed to slow down or crash a website entirely. On the internet, this kind of overload is colloquially called a ‘hug of death’; how to manage these surges in traffic and make the most of a viral moment means thinking about each component in your system.
Ticketmaster crashed earlier this year with the ticket release for Taylor Swift’s Tour. Essentially, the system could not handle the amount of requests it was receiving. This is a common issue with limited quantity releases, which users themselves exacerbate. Social media posts show users with two or more devices refreshing constantly trying to get into the queue for tickets. Even if Ticketmaster had run stress tests on the theoretical maximum number of users trying to purchase tickets, there are huge costs in doubling or tripling that capacity. Depending on the efficiency of their infrastructure, cost increases to handle more users can be exponential. Doubling capacity can be 10 times more expensive rather than twice as expensive (assuming every part of their architecture supports that kind of scaling in the first place.)
Companies that do everything right still struggle to prevent downtime entirely. In the gaming industry, for example, where downtimes can kill a game’s momentum, they use every trick in the book to help their servers cope. Distributed systems are endlessly complex, reality is unbounded – it is impossible to account for every variable. However, how a system is designed can help to minimise downtimes and keep you in control.
Sometimes the failure point isn’t even visible as a network issue. If your machines share a data centre, increasing load on one component can increase temperatures across the server stack. This increase in temperature then causes network cards to fail; in this case, two technically unconnected systems are interfering with each other and creating a cascade of failures. Human interference can then compound the problem.
Take Twitter (X) recently implementing rate limiting. The goal was to protect the platform from bots scraping data. Unfortunately, Twitter uses bots internally to handle certain tasks. Rate limiting these internal bots then snowballed into disrupting the services that handle feed generation.
Observability is key
The first step is observability. Without the ability to perceive what is going on in a distributed system, you cannot correct the issue. Server failures often start in one system and snowball into other systems that are working optimally.
If an authentication server is working faster than the system that manages user logins, it can overfeed the login server causing it to crash. If you aren’t able to perceive the entirety of the system, you could lose time trying to diagnose the wrong point of failure.
Distributed systems are probably the most complicated things that people make, so just perceiving it in its entirety is crucial and something to prioritise. This means that observability empowers the individual engineer to see what is going wrong whilst simultaneously reducing the cognitive load on that engineer. Losing visibility means you can no longer make optimal decisions, so having a reliable way of checking on the entire system is an obvious step to reduce downtime. In previous roles, I’ve seen hours of troubleshooting invalidated because the monitoring solution was outputting incorrect observations. This is why it’s important to have an effective source of truth.
There isn’t a meaningful long-term solution to servers being overloaded. While the industry regularly innovates to make data logs more efficient, the extra headspace is taken up by more information. Take webpages as an example of this; 15 years ago, a webpage probably tracked a few key metrics like visit time, duration, clicks. Now pages are tracking read times, backlinks, load times, frequency of visit. While monitoring becomes more efficient, the amount of data being tracked is exponential.
This isn’t something that is likely to change. With cookies falling out of favour, we are likely to see observational telemetry become more popular to monitor users. Instead, operators will have to make strategic decisions on what they track. While data is useful to businesses, arbitrarily increasing your analytics load in pursuit of insights could cost your bottom line. Ultimately, you can be more ready than others, but the best solution is to have a simple scalable system that can be observed to enable fast repair.