How we discovered, mitigated, and enabled the Go team to fix Go bug 30041.

In January of 2019, we found our production servers crashing several times an hour with error messages that read
"fatal error: sweep increased allocation count"
or
"found bad pointer in Go heap (incorrect use of unsafe or cgo?)"
with a mention of "pointer to unallocated span" before that message.

This was not good. While our server processes automatically restart after crashing, a process could have a large number of client requests in flight at the time of the crash, and suddenly terminating those requests could have user-visible impact.

We started investigating.

Part 1 – Diagnosing the issue

The first question we set out to answer was when the problem started, to see whether there were recent changes we could roll back to try and fix the problem. During our investigation we made an alarming discovery—these crashes had been happening sporadically for longer than we realized, and our crash reporting was broken. (Fortunately, these were the only kinds of unreported crashes.) Searching through the server logs, we found that the frequency had previously been very low, but then increased significantly one day. Based on the deployments that day, we were able to narrow it down to a set of about 30 changes that could have caused the increase in frequency; but none of them stood out as a likely culprit.

The crashes were happening across all of our Go services, suggesting a problem in shared infrastructure rather than application code. We were unable to reproduce the issue locally, so we started making production changes to narrow it down (in parallel with fixing our crash reporting).

Looking for an easy way out

An easy potential fix was to upgrade Go from 1.11.1 to 1.11.4, the latest version at the time. This didn't help. We also tried downgrading to 1.10; this wasn’t too difficult since we weren’t using any of the new Go features, but was messy because the formatting in "go fmt" had changed. This didn’t help either. We reverted the mess.

The final version of Go we wanted to try was a custom build of the Go 1.11 release branch to pick up some changes that had already been backported from 1.12, even though none of them looked like they would be relevant. This was slightly more complicated since we had to modify our infrastructure to allow for custom Go toolchains instead of the official releases, but this seemed like a good investment anyway. As expected, it didn't help; but, also as expected, the infrastructure investment would pay off later.

Removing unsafe

The error messages suggested an incorrect use of unsafe or cgo. Our servers are built with cgo disabled, so our first hypothesis was an incorrect use of unsafe. We grepped our codebase for usages of unsafe and found two third-party libraries where it was used. The first we were able to simply remove since it was just a debugging feature; the second we were able to modify to avoid the use of unsafe.

We then wrote a small tool to analyze the import graph based on the AST to confirm that no other uses of unsafe were reachable beyond the standard library.

Even without unsafe or cgo anywhere in our server code, the crashes continued.

What now?

With our easy options exhausted, we added some GODEBUG settings and GOTRACEBACK=crash, as well as the ability to collect core dumps from our production system. We stared at the stack traces and poked around in the core dumps using the delve debugger for a bit—not for long, since the production problems were ongoing—but no pattern was immediately obvious.

Based on other reports of similar potential bugs with the Go runtime that we had come across while looking for related problems on GitHub (such as #26243), the Go team wouldn't be able to do much with a crash report that doesn't include a way to reproduce the problem. So we decided not to look for help there, for the moment.

At this point, a quick resolution seemed unlikely and the crashes were still happening with the increased frequency. We needed to find a way to stabilize our production system without understanding the cause of the crashes.

Part 2 – Mitigating the issue

The first error message, sweep increased allocation count, sounded like the Go runtime believes there is more memory allocated after the "sweep" phase of the garbage collector than before. We had realized early in our investigation that, while the underlying cause of this could be a metadata corruption in the memory management with serious consequences, the problem as detected sounded rather benign—it sounded like a slow memory leak.

The second message was found [pointer to unallocated span] in Go heap. This seemed more serious; perhaps there's a risk of use-after-free here with the memory already overwritten. Still, if the garbage collector had previously concluded that there aren't any pointers to the memory, it's possible that there really aren't any and the problem is with the assertion.

Neither of these errors seemed like a reason to immediately crash the process and abort all pending requests. Programs in certain other languages have memory management bugs from time to time, too, and (mostly) get away with it.

So we decided to take the gamble and disable these assertions in the Go runtime.

Marking processes as unhealthy

Rather than crashing the process, we decided to instead mark the process unhealthy so that it would stop receiving new requests from the load balancer. The loadbalancer would then drain connections and restart the server once it had finished processing all current requests. This would minimize the window where the potential memory corruption could have consequences, without aborting in-flight requests. If every server became unhealthy at the same time, that could cause problems of its own, but our crashes were infrequent enough that it was a risk we were willing to take.

While doing this seemed conceptually simple, we had to figure out how to mark our servers as unhealthy in response to an assertion failure from within the garbage collector, which wasn’t immediately obvious. We needed a way to communicate from within the garbage collection code to the application code running on top of the runtime.

Global variables to the rescue

While we aren't too familiar with the internals of the Go runtime, we were able to hack together a patch that exposes a var CrashHook func() that controls the behavior on assertion failures. If non-nil, instead of crashing, the runtime will call this function and keep going; if nil, it will crash as before.

Runtime crashes go through a function throw that takes the error message as a string argument.  We added a wrapper function maybethrow:

var CrashHook func()

func maybethrow(s string) {
    if CrashHook == nil {
        throw(s)
    }
    systemstack(func() {
        print(“fatal error: “, s, “\n”)
    })
    CrashHook()
}

and then replaced calls like

throw(“sweep increased allocation count”)

with

maybethrow(“sweep increased allocation count”)

in the places that were crashing.

We didn't try to characterize what exactly a CrashHook function is allowed to do, but it's certainly very limited—the function is getting called from the runtime while the runtime is still holding various internal locks, so the function can't safely re-enter certain parts of the runtime. We also have little confidence that the patch is completely correct (e.g., we didn't look into whether we should be using //go:nosplit). Still, it seemed to work. Exposing a hook for arbitrary user code was intended to avoid the hassle of rebuilding the runtime while we iterated on a solution.

Our health check handlers already had a feature to return unhealthy when a global variable was set (we use this for testing), so the CrashHook function we installed used sync/atomic to set that global variable, thus marking the process unhealthy. (If we had to do this again, we'd probably have the runtime set an "atomic" global variable directly, and poll that variable from code outside the runtime, to avoid concerns around nosplit and similar subtleties.)

The change worked and restored the health of our production system. The crashes disappeared and our server processes automatically restarted to recover from the corrupt garbage collector state. However, under the hood, the bug was still there and we had more work to do.

Part 3 – Isolating the issue

Now that our system was stable, we had time to delve deeper into the root of the problem. We checked a few more things that were easy to test: we ran our servers with compiler optimizations disabled (for a few weeks, since we accidentally forgot to re-enable them—the effect was barely noticeable), and with GOMAXPROCS=1 (for a few minutes, since we almost immediately received alerts that our system had slowed down to a crawl, so we had to revert).

Next, we returned to staring at stack traces and, eventually, started noticing some patterns. Many, but not all, of the stack traces included a function introduced in one of the suspect changes that immediately preceded the increase in crash frequency. The change had added a feature that simplified certain production maintenance procedures, but we weren’t planning on using it in the near future. We reverted that change and—tada—the frequency of the crashes (or, in our current system, processes going unhealthy) went down again. Finally, we had a solid lead and a path to reproducing the problem.

We then built a local stress test for this codepath and were able to reproduce the crashes locally. We had one more false suspicion and removed some debugging code from the production system that seemed to make the crashes go away locally, but it turned out not to help—we just hadn't run the stress tests for long enough.

It took us a few days to shave down the crashing code bit by bit until it was small enough to submit to the Go team as bug 30041. Within 6 hours, the Go team had a plausible theory and a proposed patch, which we immediately applied to our production system. The crashes have been gone ever since, and we've reverted to a runtime without the CrashHook hack.

Underlying Cause

The change where the crashes started had introduced a new flag using our package for parameters that can be configured at runtime (similar to the recently released Go Cloud runtimevar). We were already using this package extensively throughout the code without issues, but in this particular location we also happened to be calling an older function that returned a large, monolithic configuration struct.

Our flag package uses reflect.MakeFunc internally to create strongly typed wrapper functions for the flag lookups. Calling the flag lookup function created by reflect.MakeFunc followed shortly after by creating the large struct would trigger the crash.

Based on our example, the Go team realized that reflect.MakeFunc was writing the function return values back onto the stack in a way that corrupted the Go garbage collector state. Creating the large struct immediately after made it much more likely to trigger the garbage collector while the memory was in this bad state, causing it to crash.

Closing remarks

One of the nice things about working with Go is how much of the stack is written in Go. You can jump-to-definition your way all the way through the standard library, and even the runtime is in Go. It was easy for us to change our deployment system to use a custom runtime, to make the garbage collector feed information up into our application, and to get a simple modification to the Go runtime working well enough to take the pressure off a production emergency.

Thanks to the Go team for quickly triaging and fixing the bug.

If you’re interested in helping us solve and prevent future issues like these (or in implementing proper retries throughout our client-side code) message us at builders@shift.com.

The Go gopher was designed by Renee French.