Here's a new story about the best practices of our users and how they can sometimes save your infrastructure. This time, we are going to take a look at the UK infrastructure of the Web.com company and more particularly, a look at their excellent backup policy.
A bit of context
On March 10, 2021, a fire broke out in one of the four datacenters of OVH, the leader in cloud infrastructure in France. This dramatic fire led to the complete destruction of the Strasbourg datacenter (SBG2), and seriously damaged a second site nearby (SBG3).
At the time of the incident, an estimated 3.2 million websites were impacted by the event, and still today (5 days later), 10 to 15,000 companies and administrations are partially or totally affected by this event.
If you want to learn more about the event, a complete FAQ is available on the OVH website.
We had the opportunity to talk with Mark Hewitt, Systems Administrator at Web.com, who told us how the fire in the OVH datacenter in Strasbourg (where part of their infrastructure was located) had almost no impact on their production thanks to an efficient and careful backup policy.
With over 20 years of experience providing a full range of online marketing services, Web.com helps businesses compete and succeed in the online market.
This wealth of experience, and an excellent eye for detail allowed them to shape solutions for many clients across a wide range of sectors. The UK offices alone, based in the North East of England and London serve approximately 3.5 million customers worldwide.
The virtual machine infrastructure of web.com is running on XCP-ng, with around fifty dedicated hosts, all managed with Xen Orchestra, distributed in several datacenters in Europe, mainly in France and England.
2 hosts were on the SBG2 and SBG3 sites when the fire broke out and Mark remembers:
I received two notifications within ten minutes of each other on Wednesday night [March 10] telling me some services were unavailable. Nothing really critical. It was the next morning when I logged into the OVH panel that I learned about the fire. One host completely gone up in smoke and a second, at SBG3, of uncertain status but likely not to come back online until Friday at best.
Between 10 and 20 VMs were distributed between these two hosts. Immediately, Mark starts the recovery procedures:
This is not the first time we have had to perform restorations. Whether it's because of faulty drives, or some other failed hardware, this kind of thing happens. What is a first is to have a host that is totally destroyed, with no repair possibility.
In this case, Mark's backup policy has proven to be a good practice:
As a matter of policy, we make sure that our backups are never on the same site as our hardware. In this case, restoring all the VMs lost at SBG2 & 3 to another of our data centers was a breeze with Xen Orchestra. We use deltas backup, which allowed us to recover all the lost machines in just under an hour and a half.
Finally, Mark concluded:
This event that may have been catastrophic for some companies, for us, with our backup policy and Xen Orchestra was a non-event. In fact, we didn't even need to notify our customers, as the data recovery was quick and almost transparent for end-users.
We hope that this feedback from a Xen Orchestra user will confirm your choice to use our backup solutions or will allow you to question your own situation if you do not yet have a backup policy worthy of the name. Don't forget that when it comes to hardware, it's not a question of knowing if a failure will happen or not, it's a question of knowing when it will happen and what procedure you will be able to apply at that time.
If you are not sure what type of backup solution you need, take a look at this article we did a while ago but still accurate, or our webinar we recorded with timecode on each backup type: