Rajkumar Vijayarangakannan, Lead of Network Design & DevOps, ManageEngine, explores the importance of tackling long-tail latencies to provide the instant response times needed for highly-interactive, real-time services.
The growing use of data centres globally has compelled many organisations to adopt cutting edge, highly interactive, real-time services over edge networks. Additionally, the growth in the scope of and demand for these services has been greatly spurred by the expansion of users, mobile apps, and the 5G revolution.
Despite the demand, the responsiveness of services (which primarily rely on immediate response times) is what really fuels their income and dependability. Due to the strong demand for quick, reliable delivery, businesses are searching for distributed platforms and microservices architectures to supply these services.
These intricate systems divide end-user requests into several parallel suboperations to improve responsiveness and are executed either as virtual machines (VMs) or containers over a large number of shared, multi-tenant, real machines. Also, the significance of latency fluctuations increases with the size of the operations in the data centres.
Long-tail latencies: What causes them?
Long-tail latencies are the result of interactions between the components of the data centre, the availability of resources, and a variety of other factors, including:
- Resource contention: Long-tail latencies could arise from resource contention within a single workload, synchronised resource locking, stringent resource ordering schemes, or other concurrent workloads running in the same shared environment. Resource contention is the main cause of latency fluctuation.
- Queuing delays: Subprocesses that have stalled or are stuck in queues could amplify the variations in latency.
- Concurrent activities: Increased latency may be the result of interference from unrelated applications, like log compression or garbage collection tools. Latency problems can also result from conflicts between colocated programmes for resources in a shared environment.
- Other outliers: Latency can also be caused by software or algorithm flaws and performance problems.
Due to the complexity of the data centre environment and the propensity for the issues listed above to emerge sporadically, traditional debugging techniques frequently face substantial difficulties in addressing these issues.
The most common long-tail latency issue affecting data centres’ overall performance manifests as a broad spectrum of latency fluctuations. The saying ‘the tail wags the dog’ applies here, since data centre performance is often influenced by specialised elements or exceptional events. This idea is strikingly shown in contemporary data centres, where various modest or uncommon events suddenly dominate the network performance of the entire data centre.
Response times for the entire data centre will be consistently slow in complex environments if the response from each subprocess exhibits more than the minimal latency before a final response is sent to the client. Thousands of microservices run concurrently, so a process that exhibits a delayed reaction defines the overall response times of the user-facing, real-time web services.
How to address long-tail latencies
To react promptly and reliably to workloads with latency fluctuations, ManageEngine uses the following tail-tolerant techniques:
Global anycasting
To be closer to end-users, we supply services across the internet by using the ManageEngine CloudDNS, a web-based solution for DNS resolution. Using a DNS resolution system supplies users with a setup consisting of numerous anycast sites or data centres at global vantage points for each continent. It also helps deliver services with low latency to end-users, wherever they are.
Because of this, the workload can be distributed to a different data centre that is easier to access and has servers that are equally capable of processing and swiftly responding to incoming requests as the origin server.
Global load balancing
Through the use of intelligent traffic steering filters and global load balancing strategies, CloudDNS optimises data centre traffic. The global load balancer manages increasing loads in distributed infrastructures by orchestrating them amongst scattered resources. This entails setting routing protocols for various geographic regions to provide optimal global resource routing.
DNS resolution systems use special routing based on IP addresses or Autonomous System Numbers to identify particular network groups to reduce latency and swiftly respond to requests specifically designed for the end-user’s network. DNS technologies ensure top performance while minimising latency and directing users to the best resources.
Integrating health monitoring checks
DNS-integrated health monitoring systems perform proactive monitoring checks at regular intervals using a variety of protocols for websites, including HTTPS, HTTP, TCP, DNS, and ICMP (ping). These systems diligently observe the network for active failover events from a variety of crucial angles.
A health monitor immediately replaces any weak, unhealthy resource copies that it spots in DNS failover configurations. This method greatly reduces any latencies that might occur when end users utilise the required services from anywhere in the world — to the point where the latencies become either insignificant or completely undetectable.
Dedicated VMs for latency-sensitive workloads
When collocated apps pathologically compete for resources in shared environments that support numerous apps, resource contention occurs. Due to significant long-tail latency, this may result in pending transactions taking longer than anticipated to complete.
Moreover, task mixing or co-scheduling CPU-intensive workloads with other workloads that are latency-sensitive leads to instances of poor neighbouring, which regularly contributes to long-tail latencies. For the purposes of isolating and running latency-sensitive workloads, users should invest in a DNS tool that launches specialised VMs.
Tolerating tails over taming tails
Long-tail latencies can be reduced by controlling outliers and improving workload predictability. This can be accomplished by equipping the data centre resources with auxiliary software and embedded hardware that enable the provisioning of real-time debugging information, high-granularity performance monitoring, and system-wide insights. Although it is not possible to eliminate all latency sources from operating data centres, integrated solutions that proactively avoid workload conflicts can reduce long-tail latencies without changing the underlying infrastructure.