Writing bits to disk: toykv

In the early part of my experiments in writing a very simple document database, I thought that writing the bits to disk would be too difficult for me. And perhaps it was, back then. But I think things have changed.

As I wrote more of the document database, I got more experience transforming JSON data into key and value byte arrays that could be saved, retrieved and indexed. And as I got more comfortable with the tiny data formats I was creating, particularly for the indexes, I began to think, “why not, I bet I could write the code that organises these on disk too”.

Of course, just as my document database is nothing like ElasticSearch, my on disk formats would be nothing like so advanced as sled, the key-value storage layer I was using in my document database.

(I still think that I’d need a bunch more practice in this space to be able to write something production ready; but I am starting to think that I need not be a Better Programmer, but instead merely More Experienced In This Subject. Perhaps, in the end, they mean the same thing).

But I thought it would have a lot of value, to me at least. Because I believed I’d learn a lot, even writing the simplest, silliest key-value storage engine. And I think I have, and to try to cement that knowledge into my head, I’m going to write a whistlestop tour of that code into this post.

LSM Trees

Starting somewhere close to the beginning, let’s talk about the shape of the implementation, so we have a skeleton to hang details from.

Toykv is (rather loosely) based on the idea of an LSM tree. The original paper describes something a lot more complicated than we will entertain making, but the ideas start there. Databases like LevelDB and RocksDB take those ideas, and maybe mix in a few from the Stepped merge paper. ToyKV is a bit more like these later databases. It has an in-memory and on-disk component, and it creates a growing set of files on disk. This is more similar to stepped-merge than the original LSM tree paper, which only has two levels.

The basic idea is that we have an in-memory map of keys to values. We have a WAL containing every operation carried out on this map — essentially setting values. The WAL allows us to recreate the in-memory map if we crash by replaying the operations in order. To allow us to grow beyond memory, periodically we write out the in-memory table to files on disk and reset the in-memory map and the WAL. When we read a value, we have to consult the in-memory map, then fall back to the files on disk. When reading the files on disk, we need to be careful to read them in reverse-order to the order they were written, so we find the latest version of a key first. We will end up reading every file if we are asked to find a key that doesn’t exist.

I was inspired to write this really simple, pseudo-LSM-tree by reading The Three Places for Data in an LSM. The article really helped me to see that one could reduce the LSM tree concept to a barebones implementation which still captured the key ideas — and that would be possible to do in a couple of weeks’ worth of my about-an-hour-free evenings. To get the most from this piece, I strongly recommend reading that article before continuing this one (5-10 minutes).

What will ours look like

toykv is about 670 lines of rust code, and uses nothing outside rust’s pretty small standard library. That means it can’t be that complicated!

What I want to capture is the key ideas in The Three Places for Data in an LSM:

  1. An in-memory set of the most recently written data. This in-memory piece is typically called a memtable, and that’s what we’ll use through this article (and the code).
  2. A WAL, to make that in-memory set durable and recoverable in the event of a program crash or hardware failure.
  3. A set of files (sstables) on-disk, to allow the database to grow beyond memory. sstable stands for sorted string table, and these tables are ordered, by key, on disk.
┌───────┐        ┌──────────┐              
│ WAL   ◄────────┤ memtable │   more recent
└───────┘        └────┬─────┘       ▲    
                      │             │    
                      │             │    
                 ┌────▼─────┐       │    
                 │ sstable1 │       │    
                 └────┬─────┘       │    
                      │             │    
                      │             │    
                 ┌────▼─────┐       ▼    
                 │ sstable2 │   less recent
                 └──────────┘

For writing to disk, I’m going to use the simplest structures I can think of. We’ll also leave out a ton of error handling, and there will be a hard limit on the number of writes allowed to a given database in this version.

To keep it even simpler, I will only allow get(key) and set(key, value) operations. No others: no delete, no range scan. Nothing fancy whatsoever.

Because there is no range scan, I can’t yet use toykv as the storage engine for my document database. This is because the document database relies on range scans for its query functionality. I’d love to add scanning — indeed, my long-term goal is to be able to use toykv as the underpinnings of rust-docdb, and so to have written a top-to-bottom database “engine” — but I’m not there yet.

The challenge of this work is writing an end-to-end solution without using up all my free time for months — not writing a robust production system. No one’s paying me for this! 😀

Writing keys and values to disk

When working in memory, we are typically working with a lot of structure. Even when working with raw byte arrays, these arrays are named things (bound to variables) and carry information such as their length along with them, depending on the programming language we use. In contrast, on disk, we just have a blob of bytes that could be anything. We have to create the order from scratch (well, nearly — we at least get to read bytes rather than bits).

So the first thing we need to do for toykv is to describe what a blob of bytes for a key and value will look like. We are defining a data format. Sounds fancy.

In our case, the data format is not fancy. In memory, we have two byte arrays, the key and the value. When reading the value from disk, we generally stream the data. The key and the value are of variable length, so we’ll have to store those lengths somewhere. What we do is we create a format for the bytes we’ll write that consists of metadata (or header) and the key and value data (the payload).

The key thing is that the metadata has a fixed length. We can read len(header) bytes from the stream, then interpret the bytes into things like integers to tell us the length of the key and value. To cut to the chase, here’s the format I used, taken from a code documentation:

A KVRecord consists of a header and then the key and value data
verbatim. The header contains a magic byte and version byte. The
version byte can be later used to support different versions, eg,
if we add a checksum. Then there is the length of the key and the
value, which are used to read the key and value respectively.

0          1            2            4              8
| u8 magic | u8 version | u16 keylen | u32 valuelen | key | value |
  -------------------------------------------------   ---   -----
            KV header                                  |      |
                                              keylen bytes    |
                                                              |
                                                   valuelen bytes

The “magic” byte is just a constant value that we use at the start of each record. If the byte we read doesn’t match the expected value, we assume something has gone wrong. And, in my code, that means we basically give up. The version value is similar. It ensures that the code that is trying to read the record can understand the record, so if we get an unexpected value for the version, again, we give up.

The length of the key and value comes next. The key is limited to 10kb and the value 100kb, so we chose unsigned integer sizes that can contain those values, u16 and u32 respectively.

So first we read a fixed 8 bytes of header from the stream. From that we know that we next need to read keylen bytes into our key byte array and then valuelen bytes into our value array. After we’ve done that, we can stop and return the key and value to the caller.

This is my rust code. The function takes a readable stream and attempts to read a single key and value from it. It closely follows the logic above: attempt to read 8 bytes. Interpret those eight bytes as magic, version, keylen and valuelen. Next, attempt to read first the key and then the value into byte vectors. Form those byte vectors into a KVRecord struct, and return them.

/// Attempt to read a KVRecord from a Read stream.
pub(crate) fn read_one<T: Read>(r: &mut T) -> Result<Option<KVRecord>, Error> {
    let mut header = [0u8; 8];
    let n = r.read(&mut header)?;
    if n < 8 {
        return Ok(None);
    }

    assert_eq!(header[0], KV_MAGIC, "Unexpected magic byte");
    assert_eq!(header[1], KV_VERSION, "Unexpected version byte");
    let keylen = u16::from_be_bytes(header[2..4].try_into().unwrap());
    let valuelen = u32::from_be_bytes(header[4..8].try_into().unwrap());

    let mut key = Vec::with_capacity(keylen as usize);
    r.by_ref().take(keylen as u64).read_to_end(&mut key)?;
    let mut value = Vec::with_capacity(valuelen as usize);
    r.by_ref().take(valuelen as u64).read_to_end(&mut value)?;

    let kv = KVRecord { key, value };

    Ok(Some(kv))
}

from_be_bytes is “from big endian bytes”. One interesting twist of writing data to disk is deciding what endian numbers should be stored as, something that just never comes up in normal handling of numbers in code. When in memory, the endian of numerical values is whatever your CPU architecture happens to use. Some databases are explicit about this, and some are not. I decided to be explicit about it just because it felt better to do that.

All the by_ref...take...read_to_end is a slightly odd-looking rust way to say “read this number of bytes from a stream into a variably sized Vec”. There might be a better way.

Writing the record is simpler:

pub(crate) fn serialize(&self) -> Vec<u8> {
    let mut buf = Vec::<u8>::new();
    buf.push(KV_MAGIC);
    buf.push(KV_VERSION);
    buf.extend((self.key.len() as u16).to_be_bytes());
    buf.extend((self.value.len() as u32).to_be_bytes());
    buf.extend(self.key);
    buf.extend(self.value);
    buf
}

Here we write a key value record to a byte buffer. I chose to do this because it lets me embed this record into other, more complex records.

We’ll get to one of those more complex records right away: the WAL.

WAL - write ahead log

The WAL and the memtable are a pair of data structures. The WAL lives on disk and the memtable in memory, but they go together and store the same data.

The memtable is a simple map of key to value. As data is written to the key value store, it’s first put into the map. Alongside that map, it’s written to disk in a log called the Write Ahead Log. The purpose of the WAL is to make the memtable durable. It can be used at startup to recreate the memtable if the process crashes, or just that we’ve not got enough data in the memtable at shutdown to make it worth our while writing out an sstable.

Strictly speaking, one could have a database that was just memtable and WAL. But that comes with two drawbacks:

  1. Data has to fit in the memtable, so cannot be larger than available RAM.
  2. Over time the WAL will grow large, and the cost of replaying it at startup will become prohibitive. This is because the WAL might include many updates for the same key, so it can be a lot larger than the size of the raw data.

So every so often, we store the memtable to a different on disk format, the sstable. This allows us to clear the WAL so it doesn’t get too large while retaining data durability. But we’ll come to sstables later.

The WAL is a log of the mutating operations on the memtable. In our case, there is just one mutating operation: setting a key to a value. The WAL is just a long on-disk list of set operations. We store each list entry as a WAL record, which has this format:

We have a WAL header that contains the WAL specific fields of seq and
op, then we embed a KVRecord serialisation.

0       1         5       6
| magic | u32 seq | u8 op | KVRecord |
  -----------------------   --------
    6 byte WAL header         see kvrecord.rs

Valid `op` values:

- 1: SET

Again we have a magic byte to check that we’re reading a WAL record. Then we have a seq value — each WAL record on disk increments this by 1. It’s a simple way to make sure that the WAL is valid, every record should increase by one. If it doesn’t, it likely means we have a bug in how we are writing the WAL. Again, if this problem happens, we can just bail, claiming the database files are corrupt. Next we have a byte for the op, which is always the same, but there just in case we need to store different operations in the WAL log later.

Reading and writing the WAL

I’ll spare you the code for this one, as it’s very similar to the code for the KVRecord. Instead, I’ll give a brief overview so the code makes sense, linking to the code as we go. You can read it in full at toykv/src/wal.rs.

When opening a toykv database, the first thing we do is to “replay” the WAL into the initial memtable used by the database. Only once that replay is done is the database ready for use by clients. The replay stage consists of reading entries one-by-one — by reading the WAL header and then the KVRecord embedded in each WAL record. For each WAL record we read, the code sets the value in the memtable for the KVRecord we’ve read from the file. We never read the WAL file again after this initial replay at database startup.

At this point we have an open file handle to the WAL, at the end of the file. Each time the database receives a write, we create a WAL record and write it to that open file. The database can be configured to either sync the file to disk after every write, or to leave it up to the OS (never fsync explicitly). Writes are an order of magnitude faster if you don’t fsync, but there’s a lot of cost in terms of data durability. Ideally, I’d write the code for the middle ground: a periodic fsync, say every 100ms, that would balance speed and potential for data loss.

Finally, there’s the reset code, which closes the WAL file, deletes it and starts a new WAL, used when we write out an sstable and are able to clear out the memtable and WAL.

sstable - sorted string table

A set of sstable files form the long term storage of toykv. First we’ll look at the sstable data format, and then at how we manage the set of sstables files on disk.

Writing an sstable

A sorted string table is the keys and values from a memtable streamed to disk in key order as a snapshot, before we clear the memtable and WAL.

A common way to structure sstables in real world systems is to use a B+Tree, but the simplest way to store this is to literally write KVRecord after KVRecord after KVRecord to disk, in order of key of course. We’ll take the simpler option!

Rust’s BTreeMap allows iterating in order of key, so writing an sstable is just iterating the map and writing the KVRecord. This is factored out into a few methods in the implementation, but I’ll collapse them together here to show the idea:

pub(crate) fn write_new_sstable(
    &mut self,
    memtable: &BTreeMap<Vec<u8>, Vec<u8>>,
) -> Result<(), Error> {
    // new_writer works out what the new file 
    // should be called and opens it
    let mut f = new_writer(self.d.as_path())?;
    for entry in memtable {
        let key = entry.0;
        let value = entry.1;
        let buf = KVWriteRecord { key, value }.serialize();
        f.write_all(&buf)?; // self.f is an open file handle
    }
    f.flush()?;
    f.sync_all()?;
    Ok(())
}

Once we write an sstable, it’s immutable. We never write to it again. Instead of modifying the files, we manage a set of sstable files on disk, in the toykv database directory. So next we’ll talk about how we manage our set of immutable files to create a database where values can be modified, and where our get(key) calls can find the newest written values.

Managing the set of sstables on disk

When writing a new sstable to a file, we need to figure out the path for the file. As we will see later when reading, the key thing we need to remember about our set of sstable files is the order in which they were written to disk.

For toykv I considered a few simple approaches to “remember” the order:

  1. Use the file created or modified date.
  2. Use filenames.
  3. Use a separate index file for the sstables.

I decided on the second option, filenames. I chose this option because it make it easy to debug: printing filenames as they are read allows you to check the order is correct. Toykv embeds a generation number in the sstable file name, incremeting the generation with each sstable written. So more recent sstables have higher numbered filenames. The file names are of the form 0000000001.data.

When writing a new sstable, we list the files in the toykv data folder, figure out the highest generation number we have so far, add one to that, and write the new sstable to that numbered file. Here’s what that looks like:

┌──────────┐              
│ memtable │                               more recent
└────┬─────┘                                   ▲    
     │ next file will be 0000000003.data       │    
     │                                         │    
┌────▼───────────────────────┐                 │    
│ sstable2 (0000000002.data) │                 │    
└────┬───────────────────────┘                 │    
     │                                         │    
     │                                         │    
┌────▼───────────────────────┐                 ▼    
│ sstable1 (0000000001.data) │             less recent
└────────────────────────────┘

To help work out the next filename, there’s a helper function sorted_sstable_files. It finds all the *.data files on disk, then returns an ordered Vec of their filenames with the newest (ie, highest generation number) name first. The sstable write code just has to pull the first file name from the Vec, parse the name into an integer and increment it by one. sorted_sstable_files is also used by the data reading code to get an ordered list of file paths.

The code is a little messy: first, find all the *.data files that have a numerical name. Then order them lexographically — the leading zeros ensure that works — and finally reverse them before returning. I think this method would be clearer in something like Python, I find some of the rust-isms get in the way a bit.

fn sorted_sstable_files(dir: &Path) -> Result<Vec<DirEntry>, Error> {
    // Get the list of *.data files with numerical filenames
    let mut entries: Vec<DirEntry> = fs::read_dir(dir)?
        .filter_map(|e| e.ok()) // the directory entry is valid
        .filter(|p| p.path().is_file()) // and it's a file
        .filter(|p| p.path().extension().is_some_and(|e| e == "data")) // *.data
        .filter(|e| { // and it's an integer filename
            e.path().file_stem().is_some_and(|stem| {
                stem.len() == 10 && 
                    stem.to_string_lossy().parse::<u32>().is_ok()
            })
        })
        .collect();

    // Order the filenames and find the largest (the zero prefix should give
    // lexographic ordering).
    entries.sort_unstable_by_key(|e| {
        let path = e.path();
        let stem = path.file_stem().unwrap();
        stem.to_os_string()
    });

    // Reverse, so the newest sstable comes first in the list.
    entries.reverse();

    Ok(entries)
}

The gotcha that prevents us writing forever

I use a u32 (unsigned 32-bit) for the file number parsing, and chose to use ten digits in the file name so the name can hold all u32 values. The thoughtful amongst you will notice that there’s nothing to stop one writing values until all the available sstable filenames have been used. Once that happens, toykv will break, because we can’t add 1 any more. At this point you will have run out of files and cannot write more data.

The way to fix this is to implement compaction, but I decided that was too hard for now. We’ll come back to why that would help later.

Putting writing together

So we have a WAL + memtable, and a way to write the memtable to an sstable. But when should we write the sstable? There are a lot of ways make that decision:

  • When the WAL is getting large
  • When the memtable has “too many” items in it
  • Some combination of the two. Like “after either a million WAL entries or we have 100,000 memtable entries”.
  • Um, probably a bunch of others if I read more papers and database blog posts.

In the end, my code is a toy, so I chose a toy way to do it: every 1,000 calls to set — ie, writes — I’d write out the memtable to disk as an sstable. This isn’t super silly, and I’m sure it wouldn’t work too badly in the real world, albeit maybe with 100,000 or a million writes. Writing the sstable clears the WAL too — the sstable is now the durable copy of the data.

const WAL_WRITE_THRESHOLD: u32 = 1000;

pub fn set(&mut self, k: Vec<u8>, v: Vec<u8>) -> Result<(), ToyKVError> {
    self.wal.write(&k, &v)?;
    self.memtable.insert(k, v);
    if self.wal.wal_writes >= WAL_WRITE_THRESHOLD {
        self.sstables.write_new_sstable(&self.memtable)?;
        self.wal.reset()?;
        self.memtable.clear();
    }
    Ok(())
}

write_new_sstable is shown a couple of sections above; it streams the memtable to disk. After that wal.reset deletes the WAL file on disk and creates a new one. We finally clear the memtable. Now we are ready for the next write, and our data is safe on disk in an sstable.

For ease, I wrote toykv to be only safe to use from one thread. This choice is particularly valuable here: it makes this code much simpler because we can create the sstable and reset the WAL and memtable without taking any locks. This obviously means that some writes — in our case, every thousandth — are a lot slower than the other writes, because of the dump to disk of the sstable.

This slow-write property is in fact an issue in real world implementations, and people have come up with a bunch of ways to lessen the impact. Many are relatively simple.

As an example, I believe LevelDB maintains two memtables. The “active” one, and then a second, “background” memtable which is being written to disk. As an active memtable fills, it’s swapped into the background slot and written to disk, while writes go to a new active memtable. This means that writes can continue to the active memtable as the background memtable is streamed out.

Even with this approach, however, if writes come too fast, then the active memtable needs to be written out before the background one has been completely written. Then writes must pause so the sstable writing can complete, as they also must pause in my simpler implementation. There are smarter schemes, and probably LevelDB is way smarter than I imply, but in the end there is only so fast you can write to a disk, whatever your on-disk structure.

Reading

If we think about how data ends up on disk, we quickly see that a given key can appear in a many places — it will be in every sstable that was written while that key was in the memtable. Just how many depends on the workload.

One extreme is a workload that always writes to different keys. It will have just one copy of any given key on disk, either in the WAL or in an sstable. The other extreme is a workload that repeatedly writes the same small set of keys: it’s likely to have a value for each key in every sstable, and could have many entries for a key in the WAL given every write to the database adds an entry to the WAL regardless of whether the key is already in the WAL or not!

So when we read a value from toykv — or any LSM tree database — we have to search through our memtable and sstables in the right order such that we return the most recently written value to the client. In toykv, we have one memtable and a linear series of historical sstables. To retrieve a value with get(key), we need to read those in the right order:

┌──────────┐              
│ memtable │                     more recent (start here)
└────┬─────┘                         │   
     │                               │    
     │                               │    
┌────▼───────────────────────┐       │    
│ sstable2 (0000000002.data) │       │    
└────┬───────────────────────┘       │    
     │                               │    
     │                               │    
┌────▼───────────────────────┐       ▼    
│ sstable1 (0000000001.data) │   less recent (end here)
└────────────────────────────┘

Recall we write sstables to disk with increasing numerical names. So, in this example, first we look in the memtable, then sstable2 and finally sstable1. If we find the key in any of the locations, we stop looking and return the value (Some(value) in rust). If we don’t find the key by the time we’ve read sstable1, we return not found (None in rust).

toykv’s get method is simple: it looks in the memtable, and then reads from sstables:

pub fn get(&mut self, k: &[u8]) -> Result<Option<Vec<u8>>, ToyKVError> {
    let r = self.memtable.get(k);
    let r = match r {
        Some(r) => Some(r.clone()),
        None => self.sstables.get(k)?,
    };
    Ok(r)
}

The check of the memtable is simple. It’s just a map of keys to values so we just ask it whether it has key k. If so, we make a copy of the value and return it.

Reading from the sstables is slightly more complicated.

To do this, toykv maintains an SSTableReader struct, which stores a list of open file handles to the sstables on disk. They are ordered from most recent to least recent, as returned by sorted_sstable_files shown above.

In the following code, self.tables is the Vec of open file handles. For each file, we first seek back to the start because previous get calls will have left the file partially read. Next we call KVRecord::read_one which attempts to read a KVRecord from the file. If read_one successfully reads a record, then we see if its key matches the key k we were asked to find. If it is, we clone and return the value. Otherwise we try to read another KVRecord. Once we can’t read another KVRecord, we go to the next file. Finally, once we run out of files to look through, we’ve discovered the key doesn’t exist in the database and return None.

There are many optimisations this process could have. For example, we could be smarter when we write out the sstable. We could write it as a tree. Or we could still have a straight list of KVRecords but include a skip list so we can skip over large swathes of file. We could add a bloom filter which would tell us whether the key might be in the sstable file, allowing us to skip whole files (super useful for workloads that rarely write the same key!).

But in this version, the only tiny optimisation I do is to avoid reading the whole file if we don’t need to: as the sstables are ordered, once we read a key that is greater than the key we are looking for, we can bail and go to the next file.

fn get(&mut self, k: &[u8]) -> Result<Option<Vec<u8>>, Error> {
    // self.tables is in the right order for scanning the sstables
    // on disk. Read each until we find k. If we fall off the end,
    // return None.
    for t in self.tables.as_mut_slice() {
        t.seek(SeekFrom::Start(0))?;
        loop {
            let kv = KVRecord::read_one(t)?;
            match kv {
                None => break, // end of SSTable, go to next
                Some(kv) => {
                    // We found it!
                    if kv.key == k {
                        return Ok(Some(kv.value.clone()));
                    }
                    // Gone beyond the key, stop scanning
                    // this file, go to next.
                    if kv.key.as_slice() > k {
                        break;
                    }
                }
            }
        }
    }
    // Otherwise, we didn't find it.
    Ok(None)
}

What is toykv missing?

Many, many, MANY things! 😬 But here are a few of the most important.

Deleting things

Update 2024-05-19: Implemented in 2325ff1.

Right now you can only get and set values. You can’t delete them. I think this is a relatively straightforward improvement.

The main thing that needs to be changed is that we must store the fact that a key/value was deleted. During a read, if we find the most recent “value” for a key in either the memtable or an sstable is a Deleted sentinal value, we can return immediately as if there is no value stored. Without storing this sentinal, a read would find an out-of-date live version of a key in a previous sstable generation.

To remember the fact that a key has been deleted, one approach is:

  1. Alter KVRecord to have a Deleted flag, then this would be automatically serialised by KVRecord into sstable files.
  2. Alter the memtable to store an enum of enum { Value(v Vec<u8>), Deleted }. Writing stores Value and deleting stores Deleted in the memtable.
  3. For the WAL, we could either add a new op for delete, or we could still use the set op value and rely on the KVRecord delete flag.

While this probably would be altered as I coded it, this approach reinforces my believe that this change would be straightforward to make.

Range scans

As noted at the beginning, we are missing range scans. These are essential to any real world database. They are the backbone of most query execution engines, whether doing table or index scans.

Adding these is more complicted. I think the approach is to carry out a k-way merge over all the sstables and memtables, ensuring that for every key we take the most recent version and discard those that appear in older sstables. This feels more complicated, and so I need to think rather more about how I would do it.

Without adding these to toykv, however, I can’t succeed in my dream of using toykv to support rust-docdb. So I’ll probably at least give it a go 💪

Compaction

We noted in passing that compaction was missing, and that it was needed if we were not to eventually run out of “generations” for our sstables.

Compaction merges together files, keeping only the most recent values, so we’d be able to “make space” for new sstables by merging files because we’d reduce the highest generation number on disk. In addition, compaction throws away older versions of keys, which keeps disk usage under control as we have fewer duplicates on disk.

As well as running out of generations and disk space, in real life allowing u32::MAX sstable files would mean read perfomance would be awful. You might have to read tens of thousands of files to return a single value!

Because of this, most implementations allow for just a few generations (or levels) — often fewer than 10. In order that individual files don’t get too large, each generation (level) is split into multiple files. Each file contains a non-overlapping part of the key space, and when reading the code can select the appropriate subset of files at a given level to read.

Compaction is a key characteristic for LSM tree databases and their performance. I fould the RocksDB description of compaction algorithms a good starting point. Beyond the number of allowed levels, there are many ways to compact, each with tradeoffs. For example, different algorithms may differ in:

  • Number and length of write stalls while compaction is happening.
  • Their trade-offs between write or read amplification — how much work is done when writing or reading data.

So there is no “best” compaction algorithm, only ones that work better or worse for particular workloads. I’d choose something simple for toykv, as with all the other implementation decisions, and we’ll see if I ever get around to it. I think this also requires that k-way merge algorithm that range scanning needs, but I’m not sure what it needs over and above that.

Phew

So there we have it, a whirlwind tour of the tiniest of toy data-storage layers for a database. I’ve learned a lot writing this; a lot about rust but even more about databases. I’ve read several papers and a whole ton of blog posts while writing even this implementation. Again, like docdb, I’ve come to a place where I can much better understand the tradeoffs within LSM tree databases, and especially some of the discussions around compaction effects are starting to make more sense. But there’s still a lot that reads as mumbo jumbo to me. This area is DEEP. But fun, and fascinating, and enjoyable.

For now, we’re done. Top marks for sticking with me all this way, I hope you learned something too.

👋

← Older
S3 Express One architecture deep dive