This is all about doing live migration of VMs that have attached local storage. So the storage needs to move alongside the compute — and it has to physically move, block by block, from the old hypervisor’s local disk to the new hypervisor’s local disk. How do you do that without a horrible stop-the-world for your customers’ applications?
I always wondered how this was done, and this post gives the shape of one approach to the problem. Enjoyed.
The Linux feature we need to make this work already exists; it’s called
dm-clone. Given an existing, readable storage device,dm-clonegives us a new device, of identical size, where reads of uninitialized blocks will pull from the original. It sounds terribly complicated, but it’s actually one of the simpler kernel lego bricks. Let’s demystify it.
In ToyKV compaction: it finally begins!, I noted that I’d finally started writing a simple compactor for ToyKV, a key/value store I’ve been writing (to learn about Rust and writing databases) based on an LSM-tree structure. The idea is to have a working database storage engine, albeit not a particularly sophisticated one.
A really important piece of an LSM-tree storage engine is compaction. Compaction takes the many files that the engine produces over the course of processing writes and reduces their volume to improve read performance — it drops old version of writes, and reorganises the data to be more efficient for reads.
I’d avoided working on this because getting first version built was a large chunk of code. As I mentioned in the post above, by breaking down the task I was able to take it on step by step. And, indeed, Simple compaction v1-v7 by mikerhodes #25, is both large (2,500 new lines of code) and proceeds in a step-by-step manner.
Now lets talk about a few of the bits of code I’m most happy with. Nothing’s perfect, but I tried to lay a good grounding for adding more sophisticated compaction algorithms later.
Simon Willison links to an idea that I immediately fell in love with, and will obviously never use. But that someone has done it makes me somehow happier.
I Saved a PNG Image To A Bird. Benn Jordan provides one of the all time great YouTube video titles, and it’s justified. He drew an image in an audio spectrogram, played that sound to a talented starling (internet celebrity “The Mouth”) and recorded the result that the starling almost perfectly imitated back to him.
Benn himself further says:
Hypothetically, if this were an audible file transfer protocol that used a 10:1 data compression ratio, that’s nearly 2 megabytes of information per second. While there are a lot of caveats and limitations there, the fact that you could set up a speaker in your yard and conceivably store any amount of data in songbirds is crazy.
Sometimes you have to write down an intent publicly to make sure you carry it through to the end. And so here is one such statement, to encourage me to complete this vital part of toykv.
I’ve started work on Simple Compaction · Issue #24 · mikerhodes/toykv. I’d been putting it off for a while — it felt big, and tackling it in the short evening blocks of time I usually have seemed a bit overwhelming.
To make it more manageable, I spent the first hour dividing the work into several “versions”, creating an iterative path to the simplest compaction scheme: merge all sstables into a single sorted run. This broke the back of the task: three hours in, I’ve now reached version 2 of 7.
The top-level method is deceptively, but pleasingly, simple:
// Attempt to run a compaction on the stored data. The exact
// compaction carried out is opaque to the caller.
pub fn compact(&mut self) -> Result<(), ToyKVError> {
// v2 compaction just compacts all existing sstables
// in L0 into a single L0 table. This is implemented
// with SimpleCompactionPolicy.
let policy = SimpleCompactionPolicy::new();
// Guard ensures only one compaction can happen at a time.
let _guard = CompactionGuard::acquire(self.state.clone())?;
// Get a table compaction task from the sstables structure.
let c_task = {
let state = self.state.read().unwrap();
state.sstables.build_compaction_task_v2(policy)?
};
// Compaction task is safe to run outside state lock, to
// allow reads and writes to continue while we compact.
let c_result = c_task.compact_v2()?;
// Commit the compacted table while holding write lock.
{
let mut state = self.state.write().unwrap();
state.sstables.try_commit_compaction_v2(c_result)?;
}
self.metrics.compacts.fetch_add(1, Ordering::Relaxed);
Ok(())
}
And so here is my stake in the ground, my statement to say, “Behold, for this is something I intend to complete!” 😬
This is a brief tale, told mostly through links, about subtlety. And fsync,
though perhaps the two are synonymous.
While I’m writing about this in September, the events actually happened back around March; I intended to write this up back then, but somehow it just never happened.
Earlier this year, I read NULL BITMAP Builds a Database #2: Enter the Memtable. At the end, Justin Jaffray mentions a potential sad path when the database you are coding up (as one does) crashes. Here, we are talking about whether the database can accidentally lie to a reader about whether a write is on-disk (durable):
I do a write, and it goes into the log, and then the database crashes before we fsync. We come back up, and the reader, having not gotten an acknowledgment that their write succeeded, must do a read to see if it did or not. They do a read, and then the write, having made it to the OS’s in-memory buffers, is returned. Now the reader would be justified in believing that the write is durable: they saw it, after all. But now we hard crash, and the whole server goes down, losing the contents of the file buffers. Now the write is lost, even though we served it!
The solution is easy: just fsync the log on startup so that any reads we do are based off of data that has made it to disk.
If you’re anything like me, that will take you at least three reads to get the order of events straight in your head. But once I did, it felt right to me. As I work on a database, I thought I’d ask the team whether we did that. I was pretty sure we did, but it’s part of my job to double-check these things when I come across them.
Herewith, the story and the warning about subtlety.