Wednesday, January 15, 2014

Benchmarks

I got Cznic's code to work with very little trouble. I spent much more time thinking about how to make benchmark results comparable. The issues I considered are:
  1. The trees must be of the same order (I used 128)
  2. The trees must use the same key type (I used uint64)
  3. The trees must be pre-filled with a large number of elements
Let me take a moment to explain #3. The benchmarking engine in Go works by calling your code with an iteration count. It times your code and then divides the time by the count to determine the number of nanoseconds per iteration. It starts with a small iteration count and increases the value until the sample is statistically valid. This has a poor property for comparing b-trees: The more elements in the tree the slower most operations will proceed. I can't control the iteration counts, instead I pre-filled the trees with 4 million elements. Why 4 million elements? Because inserting another million elements is not likely to change depth of the tree. 

The results of the benchmark:

BenchmarkRandomPut  200000      11836 ns/op
BenchmarkCznicRandomPut   1000000       1556 ns/op
BenchmarkRandomGet 5000000        652 ns/op
BenchmarkCznicRandomGet   2000000        797 ns/op
BenchmarkRandomDelete   200000      12581 ns/op
BenchmarkCznicRandomDelete 1000000 1228 ns/op

There's clearly a severe penalty for the Load() and Store() operations making my implementation an order of magnitude slower for operations that alter the tree. The good news is that my fetch implementation is somewhat faster. Since the expected mix of operations on a b-tree heavily favors lookups I believe that this offsets my poor update performance somewhat. 

How do I know that Load() and Store() eat up a lot of time? Go has a built-in profiling tool. Here's how it works. First you add some profiling code to a test harness. This is what I did:

file, _ := os.Create("insertions.out")
pprof.StartCPUProfile(file)
for i:=0; i<5000000; i++ {
tree.Put(uint64(src.Int63()), i)
}
pprof.StopCPUProfile()

You can then open your profile result (insertions.out) like this:

$ go tool pprof bin/main insertions.out
Welcome to pprof!  For help, type 'help'.
(pprof) 

In this case "bin/main" is the name of my executable. Much more information on how to use Go's pprof can be found here. The first command you're likely to use is the "top" command which shows you your top 10 CPU users. 

(pprof) top
Total: 5291 samples
    1289  24.4%  24.4%     2563  48.4% github.com/mike-matera/Go/btree.(*SimpleNode).Store
    1069  20.2%  44.6%     1069  20.2% github.com/mike-matera/Go/btree.(*SimpleNode).Load
     272   5.1%  49.7%      285   5.4% github.com/mike-matera/Go/btree.(*SimpleNode).Find
     240   4.5%  54.2%      809  15.3% runtime.assertE2T
     233   4.4%  58.6%      233   4.4% runtime.memcopy64
     209   4.0%  62.6%      584  11.0% assertE2Tret
     200   3.8%  66.4%      433   8.2% copyout
     196   3.7%  70.1%      344   6.5% sweepspan
     195   3.7%  73.8%      195   3.7% flushptrbuf
     141   2.7%  76.4%      344   6.5% scanblock

You can see that the top CPU hogs were Store() and Load(). The right most percentage is the percent of samples where those functions were on the call stack. A whopping 68.6% of the time. The 'web' command creates a nice SVG image of the result. Below are the results of the loop above and a similar one that fetches values.

Insertions Profile

Fetches Profile


No comments:

Post a Comment