A Peek Into Profiling Go

Tue, May 16, 2017

Some programmers consider profiling code to be some unknown magic only wizards from A Great Age can do and they only can because of the years they’ve put in to the practice. I know because I used to be one of them until I learned more about two handy tools under the extremely useful go command.

These tools make it easy for anyone to have a look under the hood of their Go code and are highly encouraged to by the community, but only when performance becomes a true concern. In fact, most performance questions I see about Go are almost always met with a “Have you benchmarked it yet?” before going any further. It’s not to deter the questioner, it’s just too easy not to. With a benchmark, a more beneficial and informative discussion to continue.

Hopefully, the rest of this post will show you how quick it can be to profile some of your code. If your servers are on fire and you desparately need to find the problem and get some sleep, there’s a TL;DR at the bottom. Otherwise read along so you can learn to fish instead of fill your appetite.

Let’s Make A Benchmark

This seems to usually be the barrier to entry for a lot of programmers. Coming from other languages, making benchmarks usually requires finding a library to help make it less painful but Go’s awesome std lib saves the day again. If you didn’t already know, Go has an awesome testing package with all the functionality you need to write solid tests. What is even more outstanding is it provides all you need for simple benchmarking!

Just like there’s a testing.T, there is a testing.B with useful functions and capabilities for making benchmarks. The most important is b.N because it decides how many times to run a function in order to get accurate results. See? We already don’t have to think about a pretty complicated thing in order to have correct enough benchmarks for our needs.

The other thing a testing.B gives us is the b.ReportAllocs() method. This will tell the testing package to provide allocation information in the test binary we are about to create.

In my opinion, the simplest step to making your first benchmark is to simply copy a happy path test for the function you want to benchmark and make some slight adjustments to it. This is assuming good testing practices are in play so your setup and teardown methods give you an environment where you’re comfortable with this function running many times. It should look something like this:

package suspect_test

import "testing"

TestSuspect(b *testing.B) {
	testSetup()
	Suspect()
  	// test assertions!
	testTeardown()
}

BenchmarkSuspect(b *testing.B) {
	testSetup()
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		Suspect()
	}
	testTeardown()
}

With a benchmark in place, we now get to use the awesome go command to interact with it.

I Wanna Go Fast!

First, we can run our benchmark with go test and get an overview of the function. This output is useful for getting before and after stats to confirm any changes made for performance actually succeeded and by how much. Someone even built a tool to understand this output and do all the comparison math for you, called benchcmp! Just run go get golang.org/x/tools/cmd/benchcmp to install it.

We make sure to tell go test to not run any actual tests in order to keep our machine reasonably quiet by giving -run= a regexp matching nothing. I could have used some text that wouldn’t match a test name like NONE or asldghlksd instead, but I like to keep those names in my back pocket… you know… just in case.

go test -run=^$ -bench=BenchmarkSuspect | tee initialbench.txt
BenchmarkSuspect-4	100	10652038 ns/op	3977365 B/op	16243 allocs/op
PASS
ok  	github.com/username/suspect/	1.092s

And an example of benchcmp in action

benchmark             old ns/op     new ns/op     delta
BenchmarkSuspect-4     10652038      1315424       -87.65%

benchmark             old allocs     new allocs     delta
BenchmarkSuspect-4     16243          2978           -81.67%

benchmark             old bytes     new bytes     delta
BenchmarkSuspect-4     3977365       319078        -91.98%

Now we have this stored in initialbench.txt and it has a lot of useful information in here we can look at. First, the -4 after the benchmark name tells us how many cores the host machine has. The next number, 100, informs us how many times the testing package decided to run the benchmark (remember b.N?).

While this print out is useful, it only lets us kick the tires. We can use go test to create some files for us so we can dig deeper into the Suspect function. The command below creates both a cpu and memory profile, along with keeping the binary built to run the tests with all the information put inside it from the testing package:

go test -run=^$ -bench=BenchmarkSuspect -cpuprofile=prof.cpu -memprofile=prof.mem

Under The Hood

With these files, we can use go tool pprof to do a deep dive into what the function is actually doing. pprof takes the binary built by go test and the profile information it is given and provides a prompt where one can continue to analyze the data collected in those two files.

suspect.test is the test binary created (named after the package) and -alloc_space tells pprof to report how many allocations any code makes

go tool pprof suspect.test prof.cpu
go tool pprof -alloc_space suspect.test prof.mem

We’re officially in Super Awesome Magic Land for me, but I’ll give you a tour of the commands I’ve learned and have found the most useful.

First, there is top. This command lists the top ten nodes (think functions) occupying either your cpu time or memory allocations.

(pprof) top
293.99MB of 350.60MB total (83.85%)
Dropped 41 nodes (cum <= 1.75MB)
Showing top 10 nodes out of 211 (cum >= 18MB)
      flat  flat%   sum%        cum   cum%
  194.39MB 55.44% 55.44%   194.39MB 55.44%  compress/flate.NewReader
   26.23MB  7.48% 62.92%    26.23MB  7.48%  regexp.(*bitState).reset
   20.10MB  5.73% 68.66%    20.10MB  5.73%  bytes.makeSlice
   15.04MB  4.29% 72.95%    15.04MB  4.29%  runtime.rawstringtmp
    9.07MB  2.59% 75.53%     9.07MB  2.59%  reflect.unsafe_NewArray
       8MB  2.28% 77.82%    11.01MB  3.14%  runtime.mapassign
       ...

For the problem code I ran this on, it says a lot of the allocations are in the compress package. But I didn’t write any compression code! Hm. I need to see which one of my functions is causing this flate.NewReader to be called. Wouldn’t it be nice if I could just take a glance at a nice call graph of my benchmark?

We can do exactly that with the web command:

Just make sure you have graphviz installed before you run it because web uses it to create an svg and then uses your system’s browser to open it.

helloworld

From here, the boxes with bigger fonts to represent the function name hint that those functions are your “heavy hitters” — this could be cpu or memory-wise, depending on what you told pprof to load.

For instance, the above image is of a memory profile and while how many allocations my beloved hello world program makes is none of your beeswax, it seems like I’d have to stop using fmt.Println to win any extreme hello world hackathons. This is because it takes an empty interface, which forces Go to do some things behind the scenes in order for it to know how to print what I gave it.

Otto Von Benchmark

Now I believe you have all you need to start writing some initial benchmarks to monitor the performance of your app throughout its lifecycle. This is a valuable asset, as it gives proper insight into what is actually causing problems instead of spending developer time speculating and rewriting innocent code. Not only will this save the company money, but can prevent arguments between teammates and keep the peace!

Obviously, if you operate at a very large scale and performance is of the utmost concern, there is more studying to be done for highly accurate benchmarks. That being said, I’ve been able to keep within the bounds of this article to quickly find problem areas in code and smooth them over in order to improve UX when needed.

Thank You

I took “Go: Beyond the Basics” from Brian Ketelsen, who initially showed me the go tool’s profiling capabilities and I believe he pointed us to Brad Fitzpatrick’s talk where Brad performs a live demo of a lot of this in action, which you can find here. Brad’s demo includes even more of Go’s awesome tooling like the race detector, so I highly encourage you to watch it in your free time.

TL;DR

BenchmarkSuspect(b *testing.B) {
	b.ReportAllocs()
	// Keep all prep outside the loop for accuracy.
	for i := 0; i < b.N; i++ {
		Suspect()
	}
}

go test -run=^$ -bench=BenchmarkSuspect -cpuprofile=prof.cpu -memprofile=prof.mem

go tool pprof suspect.test prof.cpu
# OR
go tool pprof -alloc_space suspect.test prof.mem

# popular pprof commands
web
list Suspect
top
top --cum