In my idle moments, as I sit quietly and watch clouds scud across the sky, I sometimes wonder things like:

“Inside Kubernetes or whatever, underneath containers and all that fancy stuff, it’s just the kernel’s cgroups v2 holding things apart from each other — if I was writing some Go code, what’s the best way to put a process into a cgroup to limit its resources?”

Now, I know a lot of the theory here. I’ve read the cgroups v2 documentation and know how it fits together, that it’s filesystem-based and what the various controllers can do for you. And I know the broad strokes story of what and how you manage resources. While I could definitely know more about namespaces, I do know enough to know that namespaces are what I need to reach for if I want to put some decently hard isolation boundaries between processes, albeit not ones you’d probably want to use for untrusted process execution. I can use this knowledge effectively when building stuff in Kubernetes.

But I want to start grounding this theory in practice. Write some code that does the stuff rather than just knowing about the stuff. Because that’s when you really get to know the details of the thing, and can exercise and test your reasoning about the world.

So let’s start by taking a look at this little specific corner of Linux, how do we put something into a cgroup? I took a slightly circuitous path when figuring this out, and, as I like reading similar detective stories, I’ll write out what I did. Finally, let’s get specific about what we want to do:

  • Rather than figuring this in the abstract, let’s write some Go.
  • We’ll deal exclusively with cgroups v2 as most distros enable it.
  • And we will deal with this single question, rather than covering all the functionality cgroups has — because you have to be able to put a process into a cgroup before you can do anything else.

First we’ll see the naive way to do it, following the process in the cgroups v2 documentation. This way works, but has an annoying imperfection about it. Next we’ll look at a way that doesn’t have the problem, but which turns out not to work in Go. And finally, we’ll use some relatively recent additions to Go to make it work neatly.

What the docs say to do

The kernel exposes user control of cgroups using a special filesystem mount. To put a process into a cgroup, the cgroups v2 docs say:

  1. Create a cgroup by making a directory in the cgroup filesystem. To all intents and purposes, from the filesystem UX, the directory is the cgroup. One clue it’s not a normal filesystem is that the directory comes pre-populated with a ton of exotic looking files.
    • We’ll look at how to find the path to the cgroup file system later.
  2. Start the process.
  3. Write the process’s Pid to the cgroup.procs file in the cgroup’s directory. There’s nothing fancy about this, it’s literally fine to just use echo 123456 >> cgroup.procs.

Let’s write up these steps in Go, leaving out error handling for clarity:

// 1. Create the cgroup
p := "/sys/fs/cgroup/foo.slice"
os.Mkdir(p, 0644)

// 2. Start the process
cmd := exec.Command("sleep", "30")
cmd.Start()

// 3. Add the Pid to the cgroup
procsFile := filepath.Join(p, "cgroup.procs")
file, _ := os.OpenFile(procsFile, os.O_WRONLY|os.O_APPEND, 0644)
fmt.Fprintf(file, "%d", cmd.Process.Pid)
file.Close()

This will work just fine. But the thing I don’t like is that there’s a gap between starting the process and putting it in the cgroup. I don’t like gaps; they tend to be where exploits are found. The process could do ANYTHING in those nanoseconds! (…if it wasn’t sleep 😬 ).

So let’s figure out how to fix that gap.

Fork, add yourself to a cgroup, then exec

So I thought about it, and the next thing that came to mind was to go to the raw syscalls, and call fork. The child process of the fork then moves itself into the cgroup. Next it uses exec to replace itself with the new process. Bam, new process is in the new cgroup from the start. Bar things like dropping permissions in the child, we’re all done, right? Right?

But Go lacks a fork call. You can exec with syscall.Exec , but you can only do a combined ForkExec. So in Go, you can’t twiddle things around in the child process of fork before you call exec. Bummer.

There are ways around this — eg, start an intermediate program instead of forking, and have that program do the actions of the child process — but it doesn’t feel right that you have to go to such lengths. Perhaps we’re stuck with The Gap?

Learn more about fork and the exec family in this free chapter of the Linux Programming Interface book, 24: Process Creation.

(Under the hood, of course Go has to use the standard fork/exec style to launch the process. But that is all kept deep down in the stdlib code rather than available to users. I am not sure why Go doesn’t support a raw fork call, but that is a quest for another day.)

All is not lost: clone3 to the rescue

While we can’t fork/exec, it turns out the Linux kernel has some more flexible ways to start a new process which allow us the control we need. And thankfully the parts we need are exposed in Go.

In 2019, Linux 5.3 added a clone3 syscall to allow setting more options on the cloned child process than fork allows. In particular, it allows spawning the process with a different set of namespaces. Linux 5.7, released in 2020, amended clone3 to also support setting the cgroup of the cloned process:

struct clone_args {
	u64 flags;        /* Flags bit mask */
   // ... other fields ...
   u64 cgroup;       /* File descriptor for target cgroup
						of child (since Linux 5.7) */
};

This is used as follows:

In order to place the child process in a different cgroup,
the caller specifies CLONE_INTO_CGROUP in cl_args.flags and
passes a file descriptor that refers to a version 2 cgroup
in the cl_args.cgroup field.

Obviously this is all in C, but fortunately a request to support use of clone3 to set a spawned process’s cgroup was created in Go issue #51246: syscall: add PidFD, CgroupFD, and UseCgroupFD options for Linux clone to SysProcAttr. syscall.SysProcAttr is a platform-specific structure that can be used to describe options for new processes when spawning processes, including with exec.Cmd.

The changes requested by the issue were implemented in late 2022, resulting in two new fields fields on SysProcAttr: CgroupFD and UseCgroupFD . The UseCgroupFD maps to adding the CLONE_INTO_CGROUP flag on the clone_args struct. CgroupFD maps directly to the cl_args.cgroup field.

In the clone3 man page we also find confirmation that this API is exactly designed to solve the problem of processes not being spawned directly into the appropriate cgroup 🌟:

•  Spawning a process into a cgroup different from the
parent's cgroup makes it possible for a service manager
to directly spawn new services into dedicated cgroups.
This eliminates the accounting jitter that would be
caused if the child process was first created in the
same cgroup as the parent and then moved into the target
cgroup.  Furthermore, spawning the child process
directly into a target cgroup is significantly cheaper
than moving the child process into the target cgroup
after it has been created.

Also nice that it’s more efficient to launch the process into the cgroup. The cgroup documentation also talks about it being expensive to move between cgroups.

How do we launch a process in another cgroup in Go?

Enough with the history lesson. Let’s get back to the original quest, and put all this together in a Go program:

  1. We need to create the cgroup directory.
  2. We need to obtain an FD to that cgroup directory.
  3. We need to launch a process using the above UseCgroupFD and CgroupFD fields on SysProcAttr.

Prerequisites: running this on a Mac

Wait! One more interruption!

I use a Mac, and of course we can’t use cgroups on a Mac. They’re a Linux thing. We need to use a virtual machine running Linux. I installed lima and launched the default machine. After launching, we need to install Go:

$ wget https://go.dev/dl/go1.25.4.linux-amd64.tar.gz
$ sudo rm -rf /usr/local/go
$ sudo tar -C /usr/local -xzf go1.25.4.linux-amd64.tar.gz
$ export PATH=$PATH:/usr/local/go/bin

$ go version
go version go1.25.4 linux/arm64

Way back at the start I promised to show how we find the mount point of the cgroups directory hierarchy:

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

Probably this path is unvarying across the wideness of the universe, used on every Linux system, but it’s worth a check, eh?

Back to go

Now we can write our Go program that will:

  1. Create a cgroup, foo.slice using os.Mkdir. Creating new cgroups is as easy as creating directories in the filesystem. The one gotcha is that you need the right permissions; we will be lazy and run our Go program as root.
  2. Get the FD for that cgroup using os.Open and File.Fd.
  3. Create an exec.Cmd and then add SysProcAttr to it with the cgroup details.
  4. Start the program and print its Pid so we can check whether it matches what we see in the cgroup directory tree.
  5. Finally we’ll Wait on the process.

There isn’t much code, so let’s just read all of it:

package main

import (
	"log"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	// p is the path to our cgroup. Kubelet uses the .slice
	// suffix, so let's use that too.
	p := "/sys/fs/cgroup/foo.slice"

	// 1. Create the cgroup
	err := os.Mkdir(p, 0644)
	if err != nil {
		log.Fatal(err)
	}
	defer os.Remove(p)

	// 2. Get the FD of the cgroup
	cgroup, err := os.Open(p)
	if err != nil {
		log.Fatal(err)
	}
	defer cgroup.Close()
	cgroupFD := cgroup.Fd()

	// 3. Create a process that will start in the cgroup
	cmd := exec.Command("sleep", "30")
	if cmd.SysProcAttr == nil { // usually nil, but check
		cmd.SysProcAttr = &syscall.SysProcAttr{}
	}
	cmd.SysProcAttr.UseCgroupFD = true
	cmd.SysProcAttr.CgroupFD = int(cgroupFD)

	// 4. Start the program and print its Pid
	//    Check for Pid in /sys/fs/cgroup/foo.slice/cgroup.procs
	if err := cmd.Start(); err != nil {
		log.Fatal(err)
	}
	log.Printf("Started process pid: %d", cmd.Process.Pid)

	// 5. Wait for process to exit
	cmd.Wait()
}

Running the code

In one terminal run the Go code:

$ sudo go run main.go
2025/11/06 20:42:47 Started process pid: 180105
[... sleep sleep sleep ...]

While it’s sleeping we can verify the Pid is in the right cgroup:

$ sudo cat /sys/fs/cgroup/foo.slice/cgroup.procs
180105

And there we go!

We can see this pattern in practice in projects like GitLab’s Gitaly.

Aside: why a file descriptor?

It was interesting to me that the clone3 API takes a file descriptor rather than a path. I don’t read syscall interfaces often, but the openat man page provides the rationale for using FDs:

First, `openat()` allows an application to avoid race conditions
that could occur when using `open()` to open files in directories
other than the current working directory.  These race conditions
result from the fact that some component of the directory prefix
given to `open()` could be changed in parallel with the call to
`open()`.  Suppose, for example, that we wish to create the file
`dir1/dir2/xxx.dep` if the file `dir1/dir2/xxx` exists.  The problem
is that between the existence check and the file-creation step,
`dir1` or `dir2` (which might be symbolic links) could be modified to
point to a different location.

openat allows using an FD to open a file, rather than a path. The man page also notes that the reasoning holds for a whole swath of other APIs, including clone.

This is a similar race condition to the one we had way back at the top of the post, where we had the gap between starting the process and putting it into the cgroup.

There’s obviously a threat vector with cgroups that an attacker could switcheroo the cgroup directory between creation and when the process is added. Squinting, I can also see advantages in using syscalls like openat to create the resource control files in the cgroup, to allow use of the FD you get from open. That similarly avoids putting those files into the wrong cgroup. The attacker can then either remove the resource limits on the process, or manipulate the process itself. Not what you want, if you’re kubelet or systemd.

But I wonder about this in practice for cgroups: an attacker would have to be pretty far into the system to be able to manipulate the cgroup file system in this way. I’m not expert here enough to pass judgement either way.

Further reading

← Older
Nova Twins