Saturday, October 26, 2013

Implemented Delete

With the implementation of the Delete() routine my B-tree is fully functional. I decided to implement a full delete that rebalances the tree instead of a sparse delete because I might use my B-tree for keeping a sorted fringe in a graph search algorithm. Sparse delete is possibly a better choice for databases where you expect that the amount of data will mostly only increase and deletes will be few.

I've also moved the testing code into Go's unit test framework, which is pretty cool.

There are a lot of implementations out there. I'm interested in comparing my to an existing one to measure performance and quality of results.

Saturday, October 19, 2013

Performance tuning

You can spend a lot of time optimizing code. It's a rabbit hole that can stop you from ever reaching a working product. I almost got completely sucked down one when I noticed that simple changes made big differences in the performance of my B-tree. The latest commit has the ability to compare parameterizations of the B-tree based on insertion count and node size (order). Here's the performance of the current head on my i7 laptop:

[test] Random insertion 1000000 elements (order 16)
insertions took 5666420 us (176 K/sec)
iteration took 1216899 us (821 K/sec)
fetch took 3534220 us (282 K/sec)
[pass]: depth: 5 nodes: 7538 leaves: 83356

The "fetch" operation above is just fetching every node in the tree. For random elements I just re-seed the RNG and fetch in the same order I inserted them in. 

One method I use a lot is this one: 

func nodeFind (node *Node, value uint64) int {
pos := len(node.Values)
for i, k := range node.Values {
if value < k.Key {
pos = i;
break
}
}
return pos
}

This simply finds a key in my array of key/value pairs. It wasn't always written this way. This is how it used to be:

func nodeFind (node *Node, value *Pair) int {
pos := len(node.Values)
for i, k := range node.Values {
if value.Key < k.Key {
pos = i;
break
}
}
return pos
}

And this was the performance of that implementation: 

[test] Random insertion 1000000 elements (order 16)
insertions took 5715310 us (174 K/sec)
iteration took 1210612 us (826 K/sec)
fetch took 4331910 us (230 K/sec)
[pass]: depth: 5 nodes: 7485 leaves: 83217

Notice that fetches are much slower 230 K/s versus 280 K/s is 18% slower. The time spent in nodeFind dominates the fetch operation (that's basically all fetch does). 

Sunday, October 13, 2013

Initial Version

So I finally stopped screwing around with my datatypes and started getting coding done. When you don't know a language well it's tedious at first just doing basic things, like inserting elements into arrays. The implementation is not complete but it supports insertion and iteration over elements. The repo is here:

https://github.com/mike-matera/GoLibrary

Validation.

I've only done basic work on validating the data structure so far. Though I found that Go enables interesting mechanisms for testing. This is my test function:

func doTest(tree *btree.Btree, count int, 
        generate func(int, *btree.Btree), 
        check func(current, last uint64) bool) bool {

start := time.Now().UnixNano();
for i := 0; i<count; i++ {
generate(i, tree)
}
end := time.Now().UnixNano()
us := (end - start) / 1000
var rate float32 = float32 (((int64(count) * 1000000) / us) / 1000) 
fmt.Print("\tinsertions took ", us, " us (", rate, " K/sec)\n")

start = time.Now().UnixNano();
ch := tree.Iterate()
last := <- ch
for key := range ch {
if ! check(key,last) {
fmt.Print("\t[fail]\n")
return false
}
last = key
}
end = time.Now().UnixNano()
us = (end - start) / 1000
rate = float32 (((int64(count) * 1000000) / us) / 1000) 
fmt.Print("\titeration took ", us, " us (", rate, " K/sec)\n")
fmt.Print("\t[pass]\n")
return true
}

The function automates a bunch of not-so-fun work, like looping over insertions and deletions. It takes advantage of Go's easy ability to use function pointers. The caller looks like this:

tree := new (btree.Btree)
n := 500000
fmt.Print("[test] Linear insertion ", n, " elements\n")
doTest(tree, n, 
func(i int, t *btree.Btree) {
t.Insert(uint64(i), nil)
},
func(current, last uint64) bool {
return (current == last + 1)
})

The calling syntax uses closures. I probably could get away with not passing a pointer to the tree in the inserter function, but I was working fast. The above test is for linear insertion (or insertion of the integers in order. I have also tested using random numbers. That test looks like this:

tree = new (btree.Btree)
n = 500000
fmt.Print("[test] Random insertion ", n, " elements\n")
r := rand.New(rand.NewSource(time.Now().UnixNano()))
doTest(tree, n, 
func(i int, t *btree.Btree) {
t.Insert(uint64(r.Int63()), nil)
},
func(current, last uint64) bool {
return (current > last)
})

That test is working quite well also. Initially I was testing using random numbers gotten with the Int31() function. This produced the same number twice and threw off my testing. I'm not sure what's supposed to happen when you stuff a B-tree with the same key twice. Because I want to use my B-tree for search I need to make sure that when that happens the result is stable (i.e. that elements with the same key are extracted in insertion order). 

Friday, October 4, 2013

Learning Go

It's fun learning a new language. My classes are keeping me very busy so I haven't been able to put much time into my B-tree project. I have, however, had time to bend my mind around getting things done the Go way. That means getting out from underneath the the Object Oriented model. Do you every find yourself saying, "What accessors will I need?" or "How do I give this object access to that thingy over there?" 

That's how you know you're objectifying your code. Treating your code like an object is just as serious. You wouldn't objectify your cat would you? Let's face it your code is much more important than kitty. 

So here's my progress. B-trees require quirky data structures. Each node contains two lists. One list of indexes along with pointers to records and another list of node pointers. B-trees also have three different kinds of nodes, root, inner and leaf nodes. This kind of polymorphism can cause fits for the OO programmers. Here's how I've done it:

type Pair struct {
Key int
Value interface{}
}

type Node struct {
Values [] Pair
Nodes [] *Node
}

I created a type called Pair so that I could hold the index and node pointers together. If you're wondering what interface{} means it seems to be the way that you can do something like void * in Go. Since every struct implements the empty interface you can put anything there. That's necessary because I want my B-tree structure to be able to hold arbitrary data. 

I was able to use a single struct for all three types of nodes. In all nodes I allocate space for Pairs but I only allocate node pointers on root and leaf nodes.