In my idle moments, as I sit quietly and watch clouds scud across the sky, I sometimes wonder things like:
“Inside Kubernetes or whatever, underneath containers and all that fancy stuff, it’s just the kernel’s cgroups v2 holding things apart from each other — if I was writing some Go code, what’s the best way to put a process into a cgroup to limit its resources?”
Now, I know a lot of the theory here. I’ve read the cgroups v2 documentation and know how it fits together, that it’s filesystem-based and what the various controllers can do for you. And I know the broad strokes story of what and how you manage resources. While I could definitely know more about namespaces, I do know enough to know that namespaces are what I need to reach for if I want to put some decently hard isolation boundaries between processes, albeit not ones you’d probably want to use for untrusted process execution. I can use this knowledge effectively when building stuff in Kubernetes.
But I want to start grounding this theory in practice. Write some code that does the stuff rather than just knowing about the stuff. Because that’s when you really get to know the details of the thing, and can exercise and test your reasoning about the world.
So let’s start by taking a look at this little specific corner of Linux, how do we put something into a cgroup? I took a slightly circuitous path when figuring this out, and, as I like reading similar detective stories, I’ll write out what I did. Finally, let’s get specific about what we want to do:
- Rather than figuring this in the abstract, let’s write some Go.
- We’ll deal exclusively with cgroups v2 as most distros enable it.
- And we will deal with this single question, rather than covering all the functionality cgroups has — because you have to be able to put a process into a cgroup before you can do anything else.
First we’ll see the naive way to do it, following the process in the cgroups v2 documentation. This way works, but has an annoying imperfection about it. Next we’ll look at a way that doesn’t have the problem, but which turns out not to work in Go. And finally, we’ll use some relatively recent additions to Go to make it work neatly.
What the docs say to do
The kernel exposes user control of cgroups using a special filesystem mount. To put a process into a cgroup, the cgroups v2 docs say:
- Create a cgroup by making a directory in the cgroup filesystem. To all
intents and purposes, from the filesystem UX, the directory is the cgroup.
One clue it’s not a normal filesystem is that the directory comes
pre-populated with a ton of exotic looking files.
- We’ll look at how to find the path to the cgroup file system later.
- Start the process.
- Write the process’s Pid to the
cgroup.procsfile in the cgroup’s directory. There’s nothing fancy about this, it’s literally fine to just useecho 123456 >> cgroup.procs.
Let’s write up these steps in Go, leaving out error handling for clarity:
// 1. Create the cgroup
p := "/sys/fs/cgroup/foo.slice"
os.Mkdir(p, 0644)
// 2. Start the process
cmd := exec.Command("sleep", "30")
cmd.Start()
// 3. Add the Pid to the cgroup
procsFile := filepath.Join(p, "cgroup.procs")
file, _ := os.OpenFile(procsFile, os.O_WRONLY|os.O_APPEND, 0644)
fmt.Fprintf(file, "%d", cmd.Process.Pid)
file.Close()
This will work just fine. But the thing I don’t like is that there’s a gap
between starting the process and putting it in the cgroup. I don’t like gaps;
they tend to be where exploits are found. The process could do ANYTHING in those
nanoseconds! (…if it wasn’t sleep 😬 ).
So let’s figure out how to fix that gap.
Fork, add yourself to a cgroup, then exec
So I thought about it, and the next thing that came to mind was to go to the raw
syscalls, and call fork. The child process of the fork then moves itself into
the cgroup. Next it uses exec to replace itself with the new process. Bam, new
process is in the new cgroup from the start. Bar things like dropping
permissions in the child, we’re all done, right? Right?
But Go lacks a fork call. You can exec with
syscall.Exec , but you can only do a
combined ForkExec. So in Go, you can’t
twiddle things around in the child process of fork before you call exec.
Bummer.
There are ways around this — eg, start an intermediate program instead of forking, and have that program do the actions of the child process — but it doesn’t feel right that you have to go to such lengths. Perhaps we’re stuck with The Gap?
fork and the exec family in this free chapter of the Linux
Programming Interface book,
24: Process Creation.(Under the hood, of course Go has to use the standard fork/exec style to
launch the process. But that is all kept deep down in the stdlib code rather
than available to users. I am not sure why Go doesn’t support a raw fork call,
but that is a quest for another day.)
All is not lost: clone3 to the rescue
While we can’t fork/exec, it turns out the Linux kernel has some more
flexible ways to start a new process which allow us the control we need. And
thankfully the parts we need are exposed in Go.
In 2019, Linux 5.3 added a
clone3 syscall to allow
setting more options on the cloned child process than fork allows. In
particular, it allows spawning the process with a different set of
namespaces. Linux
5.7, released in 2020, amended clone3 to also support setting the cgroup of
the cloned process:
struct clone_args {
u64 flags; /* Flags bit mask */
// ... other fields ...
u64 cgroup; /* File descriptor for target cgroup
of child (since Linux 5.7) */
};
This is used as follows:
In order to place the child process in a different cgroup,
the caller specifies CLONE_INTO_CGROUP in cl_args.flags and
passes a file descriptor that refers to a version 2 cgroup
in the cl_args.cgroup field.
Obviously this is all in C, but fortunately a request to support use of clone3
to set a spawned process’s cgroup was created in Go issue #51246:
syscall: add PidFD, CgroupFD, and UseCgroupFD options for Linux clone to SysProcAttr.
syscall.SysProcAttr is a platform-specific structure that can be used to
describe options for new processes when spawning processes, including with
exec.Cmd.
The changes requested by the issue were
implemented in late 2022,
resulting in two new fields fields on
SysProcAttr: CgroupFD and
UseCgroupFD . The UseCgroupFD maps to
adding the CLONE_INTO_CGROUP flag on the clone_args struct.
CgroupFD maps directly to the cl_args.cgroup field.
In the clone3 man page we also find confirmation that this API is exactly
designed to solve the problem of processes not being spawned directly into the
appropriate cgroup 🌟:
• Spawning a process into a cgroup different from the
parent's cgroup makes it possible for a service manager
to directly spawn new services into dedicated cgroups.
This eliminates the accounting jitter that would be
caused if the child process was first created in the
same cgroup as the parent and then moved into the target
cgroup. Furthermore, spawning the child process
directly into a target cgroup is significantly cheaper
than moving the child process into the target cgroup
after it has been created.
Also nice that it’s more efficient to launch the process into the cgroup. The cgroup documentation also talks about it being expensive to move between cgroups.
How do we launch a process in another cgroup in Go?
Enough with the history lesson. Let’s get back to the original quest, and put all this together in a Go program:
- We need to create the cgroup directory.
- In real life we’d also assign the controllers we want to use and set up resource limits, but for now we will leave it.
- We need to obtain an FD to that cgroup directory.
- We need to launch a process using the above
UseCgroupFDandCgroupFDfields onSysProcAttr.
Prerequisites: running this on a Mac
Wait! One more interruption!
I use a Mac, and of course we can’t use cgroups on a Mac. They’re a Linux thing. We need to use a virtual machine running Linux. I installed lima and launched the default machine. After launching, we need to install Go:
$ wget https://go.dev/dl/go1.25.4.linux-amd64.tar.gz
$ sudo rm -rf /usr/local/go
$ sudo tar -C /usr/local -xzf go1.25.4.linux-amd64.tar.gz
$ export PATH=$PATH:/usr/local/go/bin
$ go version
go version go1.25.4 linux/arm64
Way back at the start I promised to show how we find the mount point of the cgroups directory hierarchy:
$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
Probably this path is unvarying across the wideness of the universe, used on every Linux system, but it’s worth a check, eh?
Back to go
Now we can write our Go program that will:
- Create a cgroup,
foo.sliceusingos.Mkdir. Creating new cgroups is as easy as creating directories in the filesystem. The one gotcha is that you need the right permissions; we will be lazy and run our Go program asroot. - Get the FD for that cgroup using
os.OpenandFile.Fd. - Create an
exec.Cmdand then addSysProcAttrto it with the cgroup details. - Start the program and print its Pid so we can check whether it matches what we see in the cgroup directory tree.
- Finally we’ll
Waiton the process.
There isn’t much code, so let’s just read all of it:
package main
import (
"log"
"os"
"os/exec"
"syscall"
)
func main() {
// p is the path to our cgroup. Kubelet uses the .slice
// suffix, so let's use that too.
p := "/sys/fs/cgroup/foo.slice"
// 1. Create the cgroup
err := os.Mkdir(p, 0644)
if err != nil {
log.Fatal(err)
}
defer os.Remove(p)
// 2. Get the FD of the cgroup
cgroup, err := os.Open(p)
if err != nil {
log.Fatal(err)
}
defer cgroup.Close()
cgroupFD := cgroup.Fd()
// 3. Create a process that will start in the cgroup
cmd := exec.Command("sleep", "30")
if cmd.SysProcAttr == nil { // usually nil, but check
cmd.SysProcAttr = &syscall.SysProcAttr{}
}
cmd.SysProcAttr.UseCgroupFD = true
cmd.SysProcAttr.CgroupFD = int(cgroupFD)
// 4. Start the program and print its Pid
// Check for Pid in /sys/fs/cgroup/foo.slice/cgroup.procs
if err := cmd.Start(); err != nil {
log.Fatal(err)
}
log.Printf("Started process pid: %d", cmd.Process.Pid)
// 5. Wait for process to exit
cmd.Wait()
}
Running the code
In one terminal run the Go code:
$ sudo go run main.go
2025/11/06 20:42:47 Started process pid: 180105
[... sleep sleep sleep ...]
While it’s sleeping we can verify the Pid is in the right cgroup:
$ sudo cat /sys/fs/cgroup/foo.slice/cgroup.procs
180105
And there we go!
We can see this pattern in practice in projects like GitLab’s Gitaly.
Aside: why a file descriptor?
It was interesting to me that the clone3 API takes a file descriptor rather
than a path. I don’t read syscall interfaces often, but the
openat man page provides
the rationale for using FDs:
First, `openat()` allows an application to avoid race conditions
that could occur when using `open()` to open files in directories
other than the current working directory. These race conditions
result from the fact that some component of the directory prefix
given to `open()` could be changed in parallel with the call to
`open()`. Suppose, for example, that we wish to create the file
`dir1/dir2/xxx.dep` if the file `dir1/dir2/xxx` exists. The problem
is that between the existence check and the file-creation step,
`dir1` or `dir2` (which might be symbolic links) could be modified to
point to a different location.
openat allows using an FD to open a file, rather than a path. The man page
also notes that the reasoning holds for a whole swath of other APIs, including
clone.
This is a similar race condition to the one we had way back at the top of the post, where we had the gap between starting the process and putting it into the cgroup.
There’s obviously a threat vector with cgroups that an attacker could switcheroo
the cgroup directory between creation and when the process is added. Squinting,
I can also see advantages in using syscalls like openat to create the resource
control files in the cgroup, to allow use of the FD you get from open. That
similarly avoids putting those files into the wrong cgroup. The attacker can
then either remove the resource limits on the process, or manipulate the process
itself. Not what you want, if you’re kubelet or systemd.
But I wonder about this in practice for cgroups: an attacker would have to be pretty far into the system to be able to manipulate the cgroup file system in this way. I’m not expert here enough to pass judgement either way.