Application Tuning in go: Benchmarks

January 15, 2018

One of the best parts about writing and maintaining software with go is that it comes with “batteries included” for developing scalable applications. Beyond just verifying the correctness of your code using go’s testing package, you can just as easily measure and track the performance of your code, using more or less the same paradigms and constructs. To illustrate the power and simplicity of benchmarks, this post walks through an example, real-world, tuning scenario.

Our sample program

For our exercise we will be tuning the performance of a statsd client that our hypothetical web server relies on. Our server calls out to this client multiple times when serving each request, making it a very “hot” region of code. By using pprof, we have identified that we are spending a lot of CPU time in one function - a small utility for adding tags to a stat string as per the DogStatsD spec.

// fmtStatString takes a preformatted statting string and appends
// a flattened map of tags to it.
func fmtStatStr(stat string, tags map[string]string) string {
  parts := make([]string, 0, len(tags))
  for k, v := range tags {
    if v != "" {
      parts = append(parts, fmt.Sprintf("%s:%s", clean(k), clean(v)))
     }
  }

  return fmt.Sprintf("%s|%s", stat, strings.Join(parts, ","))
}

Writing A Benchmark

The process of writing benchmarks should be very familiar to those who have written a functional test in go before. The key difference is a different parameter to your testing function - *testing.B instead of *testing.T, as well as the inclusion of a small looping construct in the body of your test.

Most benchmarks ultimately end up resembling this pattern:

func exLog() {
  log.Println("bench")
}

func BenchmarkLogging(b *testing.B) {
    for i := 0; i < b.N; i++ {
      exLog()
    }
}

Under the hood, the go test tool will the execute code inside of the benchmark loop until a certain time limit is reached, at which point it is simple to derive performance characteristics from the number of iterations that finished and the resource consumption of your application during that time.

In order to get the absolute tightest bound on your application’s true performance, you should design your benchmark so that you have only the code you wish to measure inside the benchmark “for” loop. By doing any setup outside of the benchmark loop and then calling b.ResetTimer you will ensure that no allocations or cycles spent during setup will pollute your results.

For our fmtStatStr function, we don’t need a lot of prep-work and can write a straightforward benchmark:

func BenchmarkLogging(b *testing.B) {
    for i := 0; i < b.N; i++ {
      fmtStatStr("handler.sample", map[string]string{"os": "osx", "locale": "en-US"})
    }
}

Running your benchmark

You can run your benchmarks via the go test tool simply by adding a few flags, like so:

go test -memprofile -bench=.

After running this command and waiting a few seconds you should see something like this output to your terminal:

BenchmarkLogging-4     1000000       1462 ns/op      584 B/op       14 allocs/op

The output can be read from right to left as

The name of the benchmark function
The number of iterations the loop was run for.
The number of nanoseconds of system time each iteration took.
The number of bytes allocated per iteration.
The number of distinct allocations that took place in each operation.

Analyzing and Iterating

Now that we have a good baseline of our program’s performance, we can begin refactoring our code to see how much work we can shave off. There is no single process or magical remedy for all performance issues, but a few quick Google searches and a dive into the relevant go docs is probably a good place to start.

After a little research we discovered that using fmt.Printf to concatenate and join strings is a fairly CPU and memory intensive process and that using a bytes.Buffer is much more efficient means of achieving the same result. Armed with this knowledge we refactor our function to look like this:

func fmtStatStrOpt(stat string, tags map[string]string) string {
  b := bytes.NewBufferString(stat)
  b.WriteString("|")

  first := true
  for k, v := range tags {
    if v != "" {
      if !first {
        // do not append a ',' at the
        // beginning of the first iteration.
        b.WriteString(",")
      }

      b.WriteString(k)
      b.WriteString(":")
      b.WriteString(v)

      // automatically set to falsey after first iteration
      first = false
    }
  }

  return b.String()
}

Now that we have two identical functions, we can compare their performance side to side and see which is better! To do so we add a new benchmark function and supply the same parameters to both tests.

var test := struct {
  stat string
  tags map[string]string
}{
  "handler.sample",
  map[string]string{
    "os": "ios",
    "locale": "en-US",
  }
}

func Benchmark_fmtStatStrBefore(b *testing.B) {
  for i := 0; i < b.N; i++ {
    fmtStatStr(test.stat, test.tags)
  }
}

To run both of our benchmarks we can run the same command as before:

go test -benchmem -bench=.

Which should output something like:

Benchmark_fmtStatStrBefore-4     1000000              1333 ns/op             248 B/op         12 allocs/op
Benchmark_fmtStatStrAfter-4      3000000               492 ns/op             224 B/op          4 allocs/op
PASS

At first glance our code immediately looks a lot better! Not only is our code faster (lower ns/op), but we are also doing fewer allocations per operation. Not too bad for a few minutes of work!

Aside:

Notice how much longer and difficult to read the optimized function is? Now is a good time to meditate on one of the core tradeoffs of optimization - while your code will be faster, it will almost always be more brittle, harder to read, and much more difficult to maintain by other engineers. This is why it is imperative that you only tune code that needs to be optimized and write tests to verify the original behavior of your code before refactoring it and adding additional complexity for the sake of performance.

Caveat Emptor - Benchmarks and the Go compiler

Although writing benchmarks is simple and straightforward for 95% of all use cases, there are some common traps you have to look out for.

Escape Analysis Tricks

In some situations, your code may be subject to some compile-time optimizations which render your benchmarks invalid.

Consider the following scenario:

func concat(a, b string) string {
  return a + b
}

func Benchmark_concat(b *testing.B) {
  for i := 0; i < b.N; i++ {
    concat(a, b)
  }
}

Since concat is a very short function and its return value isn’t captured at its callsite, the go compiler may happily remove this function call for you - turning your benchmark into a really lame game of for-loop musical chairs. To save yourself from this benign mistake you can trick the compiler into thinking your code is more important than it really is by capturing return values in various scopes - like so:

// package level scope
var concatRes string

func Benchmark_concat(b *testing.B) {
  // function level scope
  var res string

  for i := 0; i < b.N; i++ {
    // block level scope

    // assign variable outside of block scope so that this
    // loop is not escaped.
    res = concat(a, b)
  }

  // assign final result to a package-level variable to prevent
  // closure level escaping.
  concatRes = res
}

If you notice that your benchmarks are running suspiciously fast, you may want to to obfuscate your code and make it a little harder for the compiler to eliminate your slow code for you :)

Noisy Neighbors

If you are running benchmarks on your local machine it is important that you try and minimize the amount of other work that your computer is doing during the benchmark. You should shut down Slack, XCode, your Bitcoin rig etc.. if you want to have fair and reproducible benchmarks.