In ToyKV compaction: it finally begins!, I noted that I’d finally started writing a simple compactor for ToyKV, a key/value store I’ve been writing (to learn about Rust and writing databases) based on an LSM-tree structure. The idea is to have a working database storage engine, albeit not a particularly sophisticated one.
A really important piece of an LSM-tree storage engine is compaction. Compaction takes the many files that the engine produces over the course of processing writes and reduces their volume to improve read performance — it drops old version of writes, and reorganises the data to be more efficient for reads.
I’d avoided working on this because getting first version built was a large chunk of code. As I mentioned in the post above, by breaking down the task I was able to take it on step by step. And, indeed, Simple compaction v1-v7 by mikerhodes #25, is both large (2,500 new lines of code) and proceeds in a step-by-step manner.
Now lets talk about a few of the bits of code I’m most happy with. Nothing’s perfect, but I tried to lay a good grounding for adding more sophisticated compaction algorithms later.
The pieces of code I’m happiest with are:
ConcatIterator
. After compaction, what’s left is a set of files. Each stores a part of the key space, and together they contain all the data in the database. TheConcatIterator
is the piece of code that presents those files as one large key range to the rest of the system, able to scan and seek across all the files as if they are one large file.It’s definitely not perfect. The initial seek for scans is linear through the files; it should be a binary search. The way I manage the current iterator using an index into the iterator array could be improved by instead popping iterators from the
Vec
as we iterate. As well as simplifying state management, this would mean we drop iterators we no longer need, which would be more efficient. But it’s not too bad.CompactionGuard
is a really nice pattern I’d not seen before that uses Rust’s strict object lifetimes to enforce that only a single compaction can be active at once. It’s a key part in making the maincompact
method simple to read.Finally, I’m also quite happy with how the main parts of the compaction code turned out. There’s a
CompactionPolicy
trait with aSimpleCompactionPolicy
implementation.The policy is responsible for two things: (1) constructing a
CompactionTask
that implements the compaction itself — reading older files, writing new ones — and (2) merging the old sstable index with the updates made by compaction.This abstraction will hopefully make it easier if I move forward with a more advanced compaction implementation. If I’ve got the abstraction correct, I’ll just need a new policy and
CompactionTask
implementation. Fingers crossed.
As well as these bits, there are a ton of tests. Claude helped with the tests, as usual for this year’s work in this project. But in the end, I rewrote a lot of Claude’s work to make the tests more effective at testing exactly what I wanted.
For example, the test to ensure that deleted keys were dropped during compaction required a decent chunk of extra work to make it actually test that keys were dropped, rather than merely still appeared to be deleted (but still had delete tombstones in the files). But Claude is still good at getting the groundwork laid, and ensuring coverage for a range of edge-cases.
Overall, I have a few more things to do before I’ll be properly happy:
I still need to checksum each on-disk block to ensure that data hasn’t been corrupted. Checksum sstable blocks · #14.
I’d also like to write that more sophisticated compaction. The main problem with the “simple” compaction is that, because the whole database is re-created in one go, the disk space required during compaction is potentially twice the total size of data in the database. This isn’t very practical, and is also rather slow for anything large!
Leveled compaction is one common option here. Compactions can also be run in parallel, but I think that might be too far for what is, in the end, a toy project.
My suspicion is that I might do checksums, then call it a day on this project for a while. I’ve been working on it for a few months now, after a year’s break, and I think it might be time for a change again. But I’m really pleased with the state of the code now. I also feel much more comfortable writing (and reading) Rust, alongside gaining a much more practical knowledge of the deeper guts of databases.