September 14th Outage Postmortem

On September 14, 2020, the VoltBuilder server went down for several hours. The purpose of this document is to clarify what happened and what we are doing to make sure it doesn’t happen again.

It’s usual in these cases that there isn’t a single failure - one problem uncovers the next and so on. We used this failure to see where other weaknesses might exist in our processes.

The root cause was the main build server running out of space. While we have jobs in place to clear app data soon after builds are completed we found that one of our build tools was caching intermediate build products. We’ve improved our existing clean-up process to include the cached files. We’re also more closely monitoring the free space on the build server in case another anomaly crops up.

Generally resolving the space issue would have solved the problem, but in this case we continued to see a number of problems with the main build server. Further investigation showed that we were experiencing a localized DNS outage. In this particular case it appears that when the host ran out of space, the network adapter settings were somehow corrupted, leaving the main build server able to access the network, but unable to resolve any DNS names. Resetting the network interface resolved this issue.

Once the DNS issue was resolved we were able to get our build infrastructure back to building apps. No submissions were lost during this time, though we did have a bit of a backlog to work through. It took about an hour: no jobs were lost.

Over the course of dealing with this outage we discovered a few other changes we’re implementing to keep VoltBuilder running smoothly and consistently. These include:

In order to make our systems resilient against such outages we will continue to expand our server infrastructure.

We are very sorry for the inconvenience this issue caused.