Microsoft has admitted that its slow response in acknowledging a major outage that affected Azure customers in Europe late last month was due to a sleeping manager.
The company noted that customers in Europe and the UK experienced a significant slowdown in Azure services at around 9am UTC on March 24. However, while the company’s policy is to acknowledge any outage within 10 minutes, the company remained silent for more than five hours.
Azure services did not return to normal until three days after customers started to report issues, with the delay in acknowledgement by Microsoft down to the company’s primary incident manager. A PIM is the person responsible for posting external communications acknowledging the incident, however, Microsoft’s PIM during this incident was US-based, meaning at the time of the outage, this person was fast asleep.
Microsoft has put together a post mortem of the incident, with Chad Kimes, director of engineering at Azure, profusely apologising for the lack of communication customers received during this time.
“On the first day, when the impact was most severe, we didn’t acknowledge the incident for approximately five hours, which is substantially worse than our target of 10 minutes. This lack of acknowledgement leads to frustration and confusion, and we apologize for that as well,” he noted.
“The problem here is that our live-site processes have a gap for these types of incidents. When incidents involve customer request failures or performance impacts, we have automated tooling that starts an incident and loops in both a DRI (designated responsible individual) and what we call a PIM (primary incident manager).
“Pipeline delays are detected by different tooling, and the PIM is not currently paged for these types of incidents. As a result, while the DRI was hard at work understanding the technical issues and looking for potential mitigations, the PIM was still asleep. Only when the PIM joined the incident bridge at roughly the beginning of business hours in the Eastern United States was the incident finally acknowledged.”
So what actually happened in this incident, Kimes explained:
“As Azure has documented at http://aka.ms/cloudCovidResponseFAQ, a significant surge in use has led to deployment success rates for some compute resource types falling below their normal rates in a number of geographies, including Europe and the United Kingdom. For each Azure Pipelines job that is run, the hosted agent pool allocates a fresh virtual machine based on the Azure Pipelines custom image – a process performed over 30,000 times per hour in Europe and the United Kingdom during business hours. This makes Azure Pipelines especially sensitive to increases in the compute allocation failure rate. While Azure Pipelines does retry allocation on failure, the increased failure rate causes our average time to spin up a new agent substantially.
“This increase in the time to spin up new agents then led to the total number of agents in the hosted pool being too low to service the number of jobs, which then led to queueing and pipeline delays.
“We’ve been working on architectural changes to our hosted agent pools to mitigate the potential for issues of this type, and we sped up the deployment of these fixes in Europe and the United Kingdom as a result of these incidents. For Linux agents, the change was to use ephemeral OS disks. This replaces VM allocation/deallocation operations per pipeline job with a re-image operation, thus avoiding allocations altogether. For Windows agents, ephemeral disks were not an option due to the size restrictions – our agents have lots of software, including Visual Studio Enterprise, which requires a large amount of disk storage. As a result, the change for Windows involves using larger Azure VMs and nested virtualization. The primary reason these incidents took so long to mitigate is that we were rolling these changes out much faster than we otherwise would have wanted to, and needed to ensure they didn’t introduce new issues or make things even worse.”
Microsoft has noted that Azure services are running smoothly again, however it is rolling out architectural changes which should help mitigate bottlenecks in spinning up new agents from its hosted agent pool. Additionally, the company is making changes to its communications during these types of incidents, ensuring that customers are notified at the same time as they would be for any other incident type.