Friday, December 27, 2013

Working toward a more generic implementation

The purpose of this project is to help me understand effective patterns in Go. It's been a very fun journey. After I completed the first B-tree implementation I looked around for existing implementations that I could compare it to. I found these two:

cznic's is an in-memory implementation that appears to be a part of an SQL project in Go. A very nice piece of work. cznic's approach to polymorphism is to search/replace in the file to create a specialization --Just like C++ used to do-- which isn't very satisfying to me. He/she seems to have done extensive benchmarking on the code. 

santucco's implementation is an on-disk implementation, which really got me thinking. The B-tree is meant to be an on-disk index with its nodes sized to match disk blocks. What would it take to make a tree that was suitably generic so that client code could choose an in-memory or on-disk implementation? 

After a lot of thought and several trial implementations I came up with these design parameters:
  1. Generic code should implement as much of the algorithm as possible. Type-specific code should not be hard to write and you should not have to re-validate the algorithm for each specific type.
  2. Allocation and disposal of node memory must be done in client code. This gives client code the flexibility to use any type of backing storage. 
Over the next month (that I have off!!) I'll compare all three implementations more completely. The latest commit can be found on GitHub.


Wednesday, November 20, 2013

Delete works

The delete operation would fail if the key was located outside of a leaf node (i.e. it was being used as a median.) This was the broken code:

        if pos > 0 && node.Values[pos-1].Key == index {
                // Found the delete value
                if node.Nodes != nil {
                        // Lost median, will balance
                        tree.balance(node, pos)
                }else{

The problem is that when I lost a median I just rebalanced two subtrees, resulting in no change. What was I thinking?

if pos > 0 && node.Values[pos-1].Key == index {
// Found the delete value
tree.Stats.Size--
if node.Nodes != nil {
// Lost median, must borrow
var remaining int
node.Values[pos-1], remaining =                                                             tree.borrow(node.Nodes[pos-1])
if remaining < (tree.N/2) {
tree.balance(node, pos-1)
}

The new code implements the borrow() routine. That operates similar to the del() routine except that it always descends down the right-most tree or value, returning the largest value in the subtree.

Tuesday, November 19, 2013

Better testing reveals broken delete...

The latest commit on GitHub has an updated test implementation. It's a big improvement over what I had before because it uses Go's map type as a reference implementation. First I created an interface "TreeLike" that defines the basic operations on a tree:

type Treelike interface {
Insert(uint64, interface{})
Fetch(uint64) interface{}
Delete(uint64)
Iterate() chan uint64
}

I know Iterate() is basically useless, it's still a work in progress. TreeLike is already implemented by Btree. Now I make a struct that implements checking. 

type BtreeTest struct {
test *testing.T
tree Treelike 
reference map[uint64] interface{}
}

The class contains a test so that I can testing.T.Fail() when a problem is detected. The TreeLike is kept in sync with the map[] by implementing methods that operate on both structures. This is the implementation of Insert():

func (self *BtreeTest) Insert(key uint64, value interface{}) {
self.reference[key] = value
self.tree.Insert(key, value)
}

The Insert() function does no checking. The Delete() function does:

func (self *BtreeTest) Delete(key uint64) {
delete(self.reference, key)
self.tree.Delete(key)
verify := self.tree.Fetch(key) 
if (verify != nil) {
self.test.Fail()
}
}

The struct BtreeTest itself implements TreeLike so any code that requires TreeLike can be easily instrumented to be testing code. Thanks to that I found a serious bug in Delete(). My delete function fails to delete values that are not stored in leaf nodes. I'm fixing it. 

Tuesday, November 5, 2013

Adding Persistence to Kali Linux on a USB Key

The official Kali Linux USB Live Install guide mentions how to partition your USB stick to add a persistence partition. The persistence partition is very useful because it give Kali state. The instructions are tell you how to do the procedure on a non-Kali Linux. What's the use in that.

Here's the procedure for adding persistence from within Kali Linux:

1. Figure out what the device your flash key is using. The 'df' command gave me this output:
root@kali:~# df
Filesystem     1K-blocks    Used Available Use% Mounted on
rootfs           2050320   39228   2011092   2% /
udev               10240       0     10240   0% /dev
tmpfs             410068     716    409352   1% /run
/dev/sdb1        2553376 2553376         0 100% /lib/live/mount/medium
/dev/loop0       2354304 2354304         0 100% /lib/live/mount/rootfs/filesystem.squashfs
tmpfs            2050320       0   2050320   0% /lib/live/mount/overlay
tmpfs            2050320       0   2050320   0% /lib/live/mount/overlay
aufs             2050320   39228   2011092   2% /
tmpfs               5120       0      5120   0% /run/lock
tmpfs             820120     224    819896   1% /run/shm
2. The flash device is /dev/sdb. Start gparted on /dev/sdb:
gparted /dev/sdb
 3. Your current partitioning scheme should look similar to this:



4. Proceed to format a new partition of your desired size to be used for persistence. In our example, we used all the remaining space available. Make sure the volume label of the newly created partition is persistence, and format it using the ext4 filesystem.

 
5. Once the process is complete, mount your persistence USB partition using the following commands:
mkdir /mnt/usb mount /dev/sdb2 /mnt/usb echo "/ union" >> /mnt/usb/persistence.conf umount /mnt/usb
 6. The default instructions make you add the "persistence" parameter to the kernel every time you want your persistent storage. That's cumbersome. Unfortunately we're stuck with that problem because the file that controls the bootup is on an ISO9660 filesystem, therefore it cannot be changed without re-imaging the key. Bummer.

Plug the USB stick into the computer you want to boot up. Make sure your BIOS is set to boot from your USB device. When the Kali Linux boot screen is displayed, select “Live boot” from the menu (don’t press enter), and press the tab button. This will allow you to edit the boot parameters. Add the word “persistence” to the end of the boot parameter line each time you want to mount your persistent storage.



Maybe I'll blog about how to fix that one day...

Saturday, October 26, 2013

Implemented Delete

With the implementation of the Delete() routine my B-tree is fully functional. I decided to implement a full delete that rebalances the tree instead of a sparse delete because I might use my B-tree for keeping a sorted fringe in a graph search algorithm. Sparse delete is possibly a better choice for databases where you expect that the amount of data will mostly only increase and deletes will be few.

I've also moved the testing code into Go's unit test framework, which is pretty cool.

There are a lot of implementations out there. I'm interested in comparing my to an existing one to measure performance and quality of results.

Saturday, October 19, 2013

Performance tuning

You can spend a lot of time optimizing code. It's a rabbit hole that can stop you from ever reaching a working product. I almost got completely sucked down one when I noticed that simple changes made big differences in the performance of my B-tree. The latest commit has the ability to compare parameterizations of the B-tree based on insertion count and node size (order). Here's the performance of the current head on my i7 laptop:

[test] Random insertion 1000000 elements (order 16)
insertions took 5666420 us (176 K/sec)
iteration took 1216899 us (821 K/sec)
fetch took 3534220 us (282 K/sec)
[pass]: depth: 5 nodes: 7538 leaves: 83356

The "fetch" operation above is just fetching every node in the tree. For random elements I just re-seed the RNG and fetch in the same order I inserted them in. 

One method I use a lot is this one: 

func nodeFind (node *Node, value uint64) int {
pos := len(node.Values)
for i, k := range node.Values {
if value < k.Key {
pos = i;
break
}
}
return pos
}

This simply finds a key in my array of key/value pairs. It wasn't always written this way. This is how it used to be:

func nodeFind (node *Node, value *Pair) int {
pos := len(node.Values)
for i, k := range node.Values {
if value.Key < k.Key {
pos = i;
break
}
}
return pos
}

And this was the performance of that implementation: 

[test] Random insertion 1000000 elements (order 16)
insertions took 5715310 us (174 K/sec)
iteration took 1210612 us (826 K/sec)
fetch took 4331910 us (230 K/sec)
[pass]: depth: 5 nodes: 7485 leaves: 83217

Notice that fetches are much slower 230 K/s versus 280 K/s is 18% slower. The time spent in nodeFind dominates the fetch operation (that's basically all fetch does). 

Sunday, October 13, 2013

Initial Version

So I finally stopped screwing around with my datatypes and started getting coding done. When you don't know a language well it's tedious at first just doing basic things, like inserting elements into arrays. The implementation is not complete but it supports insertion and iteration over elements. The repo is here:

https://github.com/mike-matera/GoLibrary

Validation.

I've only done basic work on validating the data structure so far. Though I found that Go enables interesting mechanisms for testing. This is my test function:

func doTest(tree *btree.Btree, count int, 
        generate func(int, *btree.Btree), 
        check func(current, last uint64) bool) bool {

start := time.Now().UnixNano();
for i := 0; i<count; i++ {
generate(i, tree)
}
end := time.Now().UnixNano()
us := (end - start) / 1000
var rate float32 = float32 (((int64(count) * 1000000) / us) / 1000) 
fmt.Print("\tinsertions took ", us, " us (", rate, " K/sec)\n")

start = time.Now().UnixNano();
ch := tree.Iterate()
last := <- ch
for key := range ch {
if ! check(key,last) {
fmt.Print("\t[fail]\n")
return false
}
last = key
}
end = time.Now().UnixNano()
us = (end - start) / 1000
rate = float32 (((int64(count) * 1000000) / us) / 1000) 
fmt.Print("\titeration took ", us, " us (", rate, " K/sec)\n")
fmt.Print("\t[pass]\n")
return true
}

The function automates a bunch of not-so-fun work, like looping over insertions and deletions. It takes advantage of Go's easy ability to use function pointers. The caller looks like this:

tree := new (btree.Btree)
n := 500000
fmt.Print("[test] Linear insertion ", n, " elements\n")
doTest(tree, n, 
func(i int, t *btree.Btree) {
t.Insert(uint64(i), nil)
},
func(current, last uint64) bool {
return (current == last + 1)
})

The calling syntax uses closures. I probably could get away with not passing a pointer to the tree in the inserter function, but I was working fast. The above test is for linear insertion (or insertion of the integers in order. I have also tested using random numbers. That test looks like this:

tree = new (btree.Btree)
n = 500000
fmt.Print("[test] Random insertion ", n, " elements\n")
r := rand.New(rand.NewSource(time.Now().UnixNano()))
doTest(tree, n, 
func(i int, t *btree.Btree) {
t.Insert(uint64(r.Int63()), nil)
},
func(current, last uint64) bool {
return (current > last)
})

That test is working quite well also. Initially I was testing using random numbers gotten with the Int31() function. This produced the same number twice and threw off my testing. I'm not sure what's supposed to happen when you stuff a B-tree with the same key twice. Because I want to use my B-tree for search I need to make sure that when that happens the result is stable (i.e. that elements with the same key are extracted in insertion order). 

Friday, October 4, 2013

Learning Go

It's fun learning a new language. My classes are keeping me very busy so I haven't been able to put much time into my B-tree project. I have, however, had time to bend my mind around getting things done the Go way. That means getting out from underneath the the Object Oriented model. Do you every find yourself saying, "What accessors will I need?" or "How do I give this object access to that thingy over there?" 

That's how you know you're objectifying your code. Treating your code like an object is just as serious. You wouldn't objectify your cat would you? Let's face it your code is much more important than kitty. 

So here's my progress. B-trees require quirky data structures. Each node contains two lists. One list of indexes along with pointers to records and another list of node pointers. B-trees also have three different kinds of nodes, root, inner and leaf nodes. This kind of polymorphism can cause fits for the OO programmers. Here's how I've done it:

type Pair struct {
Key int
Value interface{}
}

type Node struct {
Values [] Pair
Nodes [] *Node
}

I created a type called Pair so that I could hold the index and node pointers together. If you're wondering what interface{} means it seems to be the way that you can do something like void * in Go. Since every struct implements the empty interface you can put anything there. That's necessary because I want my B-tree structure to be able to hold arbitrary data. 

I was able to use a single struct for all three types of nodes. In all nodes I allocate space for Pairs but I only allocate node pointers on root and leaf nodes. 

Tuesday, September 10, 2013

I really want to program in Go (for some reason)

I've done some work in my beloved Emacs but for larger programming projects I've really come to love Eclipse. The Java support is excellent and CDT works well. One of the great things about Eclipse is that there are plugins for many of my favorite things:

  1. AVR tools (so I can use a full-power IDE for Arduino)
  2. Google Web Toolkit and Google App Engine
  3. ANTLR integration. (I used this and GWT for WebLOGO)
  4. Android ADT 
So I wasn't that surprised to find a plugin for Go. It claims to be in a pretty early state. Here's how to get it. In Eclipse (I'm using 3.8) go to Help -> Install New Software and add the update site (http://goclipse.googlecode.com/svn/trunk/goclipse-update-site/). It installed cleanly in 3.8 for Ubuntu. Now there's a Go project type: 


Now what to do? I want to implement a B-tree in Go. The B-tree is interesting because I want to better understand how Go takes some of the work out of typing and polymorphism. This kind of thing usually drives you crazy in C++. Also, I'd like to implement iterators using Channels. My goal is to implement a few graph-search methods using B-trees to keep the fringe sorted.