In my idle moments, as I sit quietly and watch clouds scud across the sky, I sometimes wonder things like:
“Inside Kubernetes or whatever, underneath containers and all that fancy stuff, it’s just the kernel’s cgroups v2 holding things apart from each other — if I was writing some Go code, what’s the best way to put a process into a cgroup to limit its resources?”
Now, I know a lot of the theory here. I’ve read the cgroups v2 documentation and know how it fits together, that it’s filesystem-based and what the various controllers can do for you. And I know the broad strokes story of what and how you manage resources. While I could definitely know more about namespaces, I do know enough to know that namespaces are what I need to reach for if I want to put some decently hard isolation boundaries between processes, albeit not ones you’d probably want to use for untrusted process execution. I can use this knowledge effectively when building stuff in Kubernetes.
But I want to start grounding this theory in practice. Write some code that does the stuff rather than just knowing about the stuff. Because that’s when you really get to know the details of the thing, and can exercise and test your reasoning about the world.
So let’s start by taking a look at this little specific corner of Linux, how do we put something into a cgroup? I took a slightly circuitous path when figuring this out, and, as I like reading similar detective stories, I’ll write out what I did. Finally, let’s get specific about what we want to do:
First we’ll see the naive way to do it, following the process in the cgroups v2 documentation. This way works, but has an annoying imperfection about it. Next we’ll look at a way that doesn’t have the problem, but which turns out not to work in Go. And finally, we’ll use some relatively recent additions to Go to make it work neatly.
Last night I went to see the Nova Twins in Bristol. We were super-excited; Nova Twins were on our “love to see” list. And we were not disappointed: it was excellent, intense and exhilarating, just as rock-metal should be.
🤘
Take the chance to see them if they play near you.
Simon Højberg expresses a sentiment I think I agree with. I’m pretty sure that I’d find agent baby-sitting much less fun than writing code.
LLMs seem like a nuke-it-from-orbit solution to the complexities of software. Rather than addressing the actual problems, we reached for something far more complex and nebulous to cure the symptoms. I don’t really mind replacing
sedwith Claude or asking it for answers about a library or framework that, after hours of hunting through docs, I still seek clarity. But I profoundly do not want to be merely an operator or code reviewer: taking a backseat to the fun and interesting work. I want to drive, immerse myself in craft, play in the orchestra, and solve complex puzzles. I want to remain a programmer, a craftsperson.
This is all about doing live migration of VMs that have attached local storage. So the storage needs to move alongside the compute — and it has to physically move, block by block, from the old hypervisor’s local disk to the new hypervisor’s local disk. How do you do that without a horrible stop-the-world for your customers’ applications?
I always wondered how this was done, and this post gives the shape of one approach to the problem. Enjoyed.
The Linux feature we need to make this work already exists; it’s called
dm-clone. Given an existing, readable storage device,dm-clonegives us a new device, of identical size, where reads of uninitialized blocks will pull from the original. It sounds terribly complicated, but it’s actually one of the simpler kernel lego bricks. Let’s demystify it.
In ToyKV compaction: it finally begins!, I noted that I’d finally started writing a simple compactor for ToyKV, a key/value store I’ve been writing (to learn about Rust and writing databases) based on an LSM-tree structure. The idea is to have a working database storage engine, albeit not a particularly sophisticated one.
A really important piece of an LSM-tree storage engine is compaction. Compaction takes the many files that the engine produces over the course of processing writes and reduces their volume to improve read performance — it drops old version of writes, and reorganises the data to be more efficient for reads.
I’d avoided working on this because getting first version built was a large chunk of code. As I mentioned in the post above, by breaking down the task I was able to take it on step by step. And, indeed, Simple compaction v1-v7 by mikerhodes #25, is both large (2,500 new lines of code) and proceeds in a step-by-step manner.
Now lets talk about a few of the bits of code I’m most happy with. Nothing’s perfect, but I tried to lay a good grounding for adding more sophisticated compaction algorithms later.