- What we learnt from our partial outage on 20180524

In the beginning of 2017 we started to run our own data center which functions a bit different from other data centers: it is completely built upon Open Source components, reuses old factory halls in the canton of Glarus and runs on hydropower.

While we have the luck to be able to build everything on a green field (literally), it does not fully save us from making mistakes others have made before.

With this blog post we want to inform our customers about what happened, and also to others who are running their own FOSS-based data center so they can avoid repeating our mistakes.

Affected services

Our data center is currently operating in two locations: place5 (Schwanden) and place6 (Linthal). The outage on 20180524 affected all services in place5 and the connectivity to place6. No services in place6 were affected. The total downtime for the services in place5 was 7h 32m, the downtime of the connectivity to place6 was 1 minute.

What happened, how we solved it and how to prevent it from happening for the future(timeline)

On 2018-05-24, 14:53 CEST...

.. we noticed a very short power outage in our data center in Schwanden that lasted for about 1 minute. At this point of writing it is not entirely clear whether it had been a voltage drop or a real power outage. Our monitoring system informed us at 14:54.

Monitoring worked as expected

14:54 ~ 15:05

During this time we verified that it was indeed no false alarm. As we are a distributed team, the engineers of the ops team connected online and two people were notified to move physically on site. On 15:05 the ops team announced the downtime on Twitter. So far everything went as expected.

The team formed as expected

Unfortunately in the next hours we encountered some problems that should not have been there.

15:05 ~ 22:17

Before our ops team arrived on site, we tried to assess the situation. The first finding was that to our surprise only one of our routers was online.

One router did not power on after the power outage

We are running a redundant router with VRRP, so in theory one router should be enough to ensure everything is reachable. This was our first mistake: instead of concentrating on fixing the servers (more below), our ops team focused on trying to get router1 back online.

Fix redundant services at the end, not at the beginning.

The remote management of router1 was not reachable and after the first person arrived on site, it would not even turn on, on (long) power button press. It took us some time to realise that we had seen this behaviour before: if the voltage provided by the grid drops (not a full power outages), it causes servers to stay in a "zombie" mode. In this mode they are half turned on (fans spinning), half turned off (remote management card NOT booted, no signal on the display, not reacting to keyboard).

Always have an offline version of known problems for outages

This problem could also have been prevented, if all core systems would have been connected to a UPS, however the routers weren't.

Have all core systems on UPS (to compensate voltage drops)

The UPS for the core systems have already been ordered the day after the outage and should be installed by latest end of this week - we will update this blog article as soon as they have.

After getting router1 back, we proceeded to fix the servers for the vm hosting and realised that only one of our servers booted up. After realising the zombie mode of the router, we expected that this was also the case for the servers. However, our expectation was in the wrong place.

Some of the servers were reporting that there was no operating system installed. This was a perplexing discovery, as just about 2 hours ago the servers were still running. We feared that our servers somehow incurred data loss and tried to boot them from an external medium. After preparing a bootable Devuan usb stick, we did not find mdadm on it, which is required for starting the software raid. So we had to recreate another boot medium, this time with grml on it. And the time went by, as the usb sticks were rather slow.

Always have a verified, bootable medium in each location

With grml we found that everything was still intact and that no data loss had occurred. Almost hooray. However, a problem from the problems we had in autumn 2017, showed up: From autumn 2017 to beginning of 2018, we had a massive amount of disks (ssds, hdds) dying. In place5 our original setup is that every server has the operating system on a software raid 6 and uses ceph for data storage. So while we had a lot of disks dying in 2017, none of the servers ever had a downtime. When we began to operate place6, we decided to configure all servers to get their operating system from the network. Why is this important?

The scripts we used to replace disks, created a GPT partition table with partitions suitable for software raid and ceph. GPT however does not have the concept of a bootable flag. While the servers in place6 don't care, as they netboot, the servers in place5 did lose the ability to boot, when we replaced almost all disks.

Keep things the same in different locations

So there we were, having all data, but nothing to boot the servers. Luckily, our engineers had prepared netboot in place5 some weeks ago, however we did not roll it out in production yet. So we decided to get the systems back up using netboot. However the firmware of our fiber 10 Gbit/s network cards does not support netboot. Luckily we do have onboard network cards! Unfortunately, our switches did not have enough copper ports to connect all servers. Luckily again, we did have a spare copper switch in place6, which one member of the ops team quickly picked up and brought to place5.

Always have enough spare parts

After reconfiguring all servers for netboot, we were facing the problem that they would not receive an ip address on the internal network. Because the netboot service was not yet in production, the firewall was not opened to allow the requests. While from a security point of view this was perfectly correct, this problem costed us additional time. If the transition to netboot would have been finished before the outage, the length of the downtime would probably have been less than 10 minutes.

Keep transitional states short

After fixing the firewall, most servers were booted into a fresh Devuan installation and were reconfigured with our configuration management system cdist. However some servers still would not boot up. The raid controllers of some servers detected that there was an unclean cache. It stopped the boot process and required manual intervention - we had to clear the caches on the controllers. And in this regard cache refers to "old configuration data" - i.e. physically removed disks, not to the write-back cache. Could we have detected this problem before? If our servers rebooted regularly, we would have.

Reboot servers regulary to ensure they do actually reboot

After all servers were back up, we tried to use our scripts for bringing back the ceph cluster. Unfortunately, the automatic startup failed, because the partition layout (3 partitions instead of 2, different partition types) was incompatible with the scripts that were mainly developed for place6. For that reason we had to inspect disks semi-automatic-semi-manual to find out which OSD is found on which partition. Again, this would not have happened, if place5 was completely migrated to the netboot setup already.

Keep transitional states short, really!

After the ceph storage cluster was back up running, we tried to redeploy the virtual machines. However qemu (our hypervisor) now replied with "Operation not permitted". Because reconfiguring all servers after netbooting did not correctly setup the libvirt secret that is required to access the storage of the VMs. How could this have been prevented? You guessed it right:

Keep transitional states short and reboot servers regulary

Redeploying all VMs 22:17 - 22:25

After the keys were corrected on the servers, we started to redeploy the VMs to all servers at 22:17 and finished by 22:24. On 22:25 we announced on Twitter that everything was back online.

Actions required

We acknowledge that many of these things should not have happened. That is why we are focusing now on making our infrastructure and our communication more resistant to failures.

Customer notifications (ETA 2018-06-01)

Even though we announced the downtime on Twitter, one of our customers was not aware where to look for the information. We will send an information mail to all of our customers in the next days, to ensure they know where to find status information or how to reach us in case of an outage.

UPS for core services (ETA 2018-06-01)

While our infrastructure is designed to cope with partial failure of services, we will add UPS for our core machines. This should prevent the observed zombie mode from repeating itself.

Provide verified, easy to locate boot media (ETA 2018-06-01)

We will test if grml works in all our cases and provide bootable media in both locations, additional to the existing netboot setup.

Repartiton disks / re-add to ceph (ETA 2018-06-08)

To allow automatic startup of all ceph disks, we will repartition the wrong disks from the cluster and re-add them. This will cause data rebalance and will take about 14 days to finish. In case of another outage happening in the mean time, we will write another set of scripts to deal with the 2nd format.

Automatic server reboot (ETA 2018-06-29)

To avoid a repetition of massive reboot failures, servers will be configured to reboot once per week. This however does not mean the customers will have a downtime - because customers' VMs will not be affected by this change.

The show must go on

Being a young team or a startup is not an excuse for a downtime, nor does it undo the emotions or the bad feeling that caused by the downtime. We sincerely apologise to all our customers, and promise to enhance the stability so that you can fully enjoy the services provided by us. Our whole team is fully committed to making our data center a truly enjoyable experience. To keep you close to our commitment, we will update this article with our progress and will publish a review of our infrastructure again in 6 months from now.

Updates

2018-05-30

Bootable media is now checked. We will also distribute bootable rescue systems to everyone in the engineering team.

What we learnt from our partial outage on 20180524

A review and the lessons learned