Overheating data centre forces shutdown of all network, compute, and storage resources
UK South — one of Microsoft Azure’s two local cloud regions — crashed offline on Monday after an outage triggered by a cooling system failure in a data centre.
The incident, between 14:54 BST on 14 Sep 2020 and 01:41 BST on 15 Sep 2020, left engineers scrambling to place the automated cooling system into manual mode and reset affected pumps, after rising internal temperatures saw systems shut down all network, compute, and storage resources “to protect data durability”.
“Customers using multiple Availability Zones, or Zone Redundant services may have experienced minimal impact” notes Microsoft in its incident report.
The outage dragged on as after manually overriding automated cooling systems and resetting them, engineers had to phase in a return of power and bring infrastructure progressively back online. (A similar incident hit AWS in Japan in 2019).
The outage is the latest in a dismal summer for data centres in the UK, after an August 25th fire in a Telstra data centre in London’s Isle of Dogs and an August 18th outage at Equinix’s prominent LBX LD8 co-location data centre after a UPS failure.
⚠️Engineers are currently investigating an issue impacting Storage and Virtual Machines in UK South. More information can be found on the Azure Status page at https://t.co/AkAjNhhnWh
— Azure Support (@AzureSupport) September 14, 2020
Among those knocked offline were Public Health England which was left unable to update its COVID-19 dashboard during the day as a result.
As Peter Groucutt, managing director of data resilience specialist Databarracks notes: “We are increasingly dependent on a small number of players who dominate the market. Recent events show the challenge of maintaining productivity in outages highlights the importance of external backups.
“Some argue the reason you do not need to back up cloud data is because a data loss is so unlikely. It would be too embarrassing and damaging for Microsoft, Google or AWS if they were unable to recover data for their customers. Unfortunately, there are many examples of data being lost for a small subset of users. If you’re in that small subset, you don’t have a lot of power in the relationship with the cloud provider and if they say your data is unrecoverable, there isn’t much you can do.”
Azure UK South Outage: Company Apologises, to Investigate Further
Microsoft said: “We undertook various workstreams to bring back connectivity. The site engineers placed the cooling system into manual mode and began to reset the affected pumps to recover the cooling plant. This helped to bring temperatures to safe operational ranges in all the impacted areas of the datacenter by 16:40 UTC.
“Once temperatures were within safe thresholds, engineers started to restore power to the affected infrastructure and began a phased approach to bringing this infrastructure back online. Once storage and the networking infrastructure was fully restored, dependent compute scale units began to recover. As compute scale units became healthy, virtual machines and other dependent Azure services recovered.
The company says it will “investigate to establish the full root cause and prevent future occurrences” and apologised to customers. The company has come under regular attack for availability issues, with Gartner this month noting in its cloud magic quadrant that “Microsoft has the lowest ratio of availability zones to regions of any vendor in this Magic Quadrant, and a limited set of services support the availability zone model. As a result, Gartner continues to have concerns related to the overall architecture and implementation of Azure, despite resilience-focused engineering efforts and improved service availability metrics during the past year.”