September 14th Outage Postmortem

On September 14, 2020, the VoltBuilder server went down for several hours. The purpose of this document is to clarify what happened and what we are doing to make sure it doesn’t happen again.

It’s usual in these cases that there isn’t a single failure - one problem uncovers the next and so on. We used this failure to see where other weaknesses might exist in our processes.

The root cause was the main build server running out of space. While we have jobs in place to clear app data soon after builds are completed we found that one of our build tools was caching intermediate build products. We’ve improved our existing clean-up process to include the cached files. We’re also more closely monitoring the free space on the build server in case another anomaly crops up.

Generally resolving the space issue would have solved the problem, but in this case we continued to see a number of problems with the main build server. Further investigation showed that we were experiencing a localized DNS outage. In this particular case it appears that when the host ran out of space, the network adapter settings were somehow corrupted, leaving the main build server able to access the network, but unable to resolve any DNS names. Resetting the network interface resolved this issue.

Once the DNS issue was resolved we were able to get our build infrastructure back to building apps. No submissions were lost during this time, though we did have a bit of a backlog to work through. It took about an hour: no jobs were lost.

Over the course of dealing with this outage we discovered a few other changes we’re implementing to keep VoltBuilder running smoothly and consistently. These include:

  • Making our network retry code more resilient so we can deal with a larger variety of outages, should they occur.
  • Fixing some minor bugs in our build status checking logic so that longer than average build times don’t cause unnecessary network traffic.
  • Continuing to enhance and improve out monitoring and notification so that our engineers are notified of any system anomalies so they can be resolved quickly and efficiently.

In order to make our systems resilient against such outages we will continue to expand our server infrastructure.

We are very sorry for the inconvenience this issue caused.