Friday, November 2, 2018

Replacing AVL tree in kernel with Ordered dictionary

I introduced the ordered dictionary (odict) module into HelenOS about two years ago (September 2016). I originally developed this module within another project (where I use it quite a lot) with the intent to use it in HelenOS, too.

It is a fairly versatile data structure based on Red-black trees. It never calls to malloc (all storage is provided by the user via odict_t and odlink_t structures). The key can be anything (the user provides a key comparison function). It also has some additional optimizations and such.

It didn't actually have any consumer in HelenOS, until SIF recently. In user space I assume it will see more use as the foundation for other data structures (such as the ID allocator, which I will talk about sometime else..).

In kernel, we currently have AVL trees and B+ trees. Jakub pointed out that we could replace B+ trees with odict, with the benefit that odict does not call malloc (and B+trees do). I realized that odict can replace AVL trees just as easily -- although the only clear benefit is that we will only have one tree implementation in the kernel -- which is a good reason in itself, right?

The only possible downfall is that, while red-black trees are slightly faster at insertion and deletion than AVL trees (at least in theory), our AVL tree has the advantage over odict that it does not need to call to the getkey/comparison functions (as the key is always an integer), which should bring some speed advantage. Therefore I was worried that replacing it with Odict could make the system slower -- although I thought the difference in real-time performace probably would not be measurable.

Therefore I decided to create a prototype implementation with Odict replacing AVL and benchmark it. For the measurement I first tried measuring some kernel tests with a stopwatch, but that obviously wasn't precise enough to show any difference. So I went for the only actual benchmark available in HelenOS, that is the IPC ping-pong benchmark.

I ran the benchmark 10 times and recorded the results, then calculated the averages and standard deviation using Gnumeric. I tested with AVL and Odict and also tested both in Qemu / KVM and QEMU without KVM.

In the following table all measurements were done with 10 samples.
Test caseAverage RTs / 10 sStd. dev.
Odict / KVM4033204210
AVL / KVM3887004962
Odict / softmmu25300249
AVL / softmmu23520434

In the table above we can see Odict beating AVL by 4% in the KVM case and even a 7% improvement of Odict compared to AVL in the softmmu case, with standard deviation of the measurements being around 1 %. This is kinda surprising.

I then went and made some changes to the ping_pong benchmark. Instead or repeatedly doing 100 iterations until we measure that 10 seconds have expired, I changed it to determine the number of iterations (lowest power of two) needed to have the test run for at least 10 seconds, then measure the actual time taken (and divide number of iterations / duration). I hoped this would give me more accurate results and measure less of how getuptime() performs. I also had the ping_pong benchmark run 10 times and calculate the statistics for me.

Here's the results: Again all measurements are with 10 samples:
Test caseAverage RTs / sStd. dev.
Odict / KVM38988241
AVL / KVM38601285
Odict / softmmu245244
AVL / softmmu229027

There still seems to be some improvement when switching from AVL to Odict, albeit it is much closer to the noise level of the measurement.

AVL trees are used to hold the global lists of threads and tasks in the kernel. I am not sure if their performance ever comes to play in the ping_pong test. Any ideas how AVL/odict could affect results of the ping-pong test? How come we see any difference? Any suggestion for a benchmark that would touch better upon this area? (e.g. creating and destroying a thread repeatedly)