CPU stalls with cgroups and Kubernetes

This is something I keep trying and failing to explain, so I am writing it down to hopefully create something that makes sense. Then I can refer people to it, rather than rambling away at them for a time and leaving them slightly bewildered.

A CPU stall is created when a process uses up all of its allocated CPU, and is put to sleep by the kernel, often for 10s of milliseconds.

Sometimes this isn’t noticeable, other times it is catastrophic. It depends on how latency critical the workload is. A few 10s of milliseconds on something that takes a few minutes, like a background batch job, doesn’t matter much, but if it’s a database operation on a critical user-facing request path, perhaps it’ll be an issue. It’s one reason monitoring tail latencies is important — this kind of thing doesn’t show up in averages, only in tails.

Anyway, now we know what the problems are: what causes it? And how can we prevent it?

Memory limits

To help us understand what’s going on, it’s easier to first look at memory. Let’s give ourselves three processes, and apply limits to them. We’ll look at one machine, and imagine that we’ve three servers in our app tier there. Each process is the same, and gets 1 GB of memory. On an 8 GB system this can be satisfied and would look like this:

+-----+-----+-----+-------------+
| 1GB | 1GB | 1GB | kernel etc. |
+-----+-----+-----+-------------+

<-----------8GB----------------->

The critical thing here is that each process can really be given its own slice of memory. Each bit in memory is — to some approximation anyway — a transistor and that transistor can be given to the process by the kernel. No other process needs to have access to that transistor. Clearly the kernel, CPU or whatever might be shifting things around, but each bit in memory can be conceptually owned by a given process.

For memory, then, the resource is shared in terms of space. Presuming there’s enough memory on the machine, each process gets to keep its own bit of it.

Let’s look at how this differs when we are talking about CPU time.

CPU time

The main difference with memory is that a CPU core needs to be shared in time rather than space. We can’t divide up the CPU between processes at the same time, like we can with memory. At any time, a process is either running on the CPU core or it isn’t.

So, let’s take our three server processes, and give them each a limit of 30% of the CPU core. What happens?

What doesn’t happen, because it can’t, is that each process runs all the time, but with 30% of the speed. This was a misconception that I held in my head for quite a long time. Because, for me, it was just easier to think of the CPU as divisible in that manner.

But, of course, we just said that we can’t do that. A process is either running or it isn’t. So how it has to work is that instead each process is running at full speed, but only part of the time.

The kernel does this by pre-empting a process when it’s had its full portion of CPU time. The kernel uses a configurable time window to schedule and limit processes. I’m going to say we configured this to 100ms. Using this window, because each server process gets to run for 30% of the time, it runs for 30ms each window. After that’s done, it has to wait for the next 100ms window to start.

+---------+---------+---------+---+---------+---------+-------------+
| server1 | server2 | server3 |...| server1 | server2 | ...         |
+---------+---------+---------+---+---------+---------+-------------+
   30ms      30ms      30ms    10ms   30ms      30ms

 100ms---------------------------->100ms------------------>

So, server1 gets 30ms. Then it’s put to sleep to let the other servers run. server2 gets 30ms, followed by server3 getting 30ms. Finally, there’s a 10ms timeframe where other things can run. As we’ve hit 100ms, server1 gets to run again. The kernel might not choose to run server1 at the start of the second 100ms block, but it certainly won’t run it after it has exhausted its time in the first block, even if there are no other waiting processes.

Any requests in flight to server1 when it exhausts its 30ms on the CPU will see a 70ms delay in their response while server1 sleeps. If a typical request took 2ms to respond to, that delay will be very noticeable!

In this scenario, we’d see more consistent performance by running one server and giving it the whole of the CPU core. We’d avoid the problem of a tail of requests being delayed by processes being put to sleep.

So we change to using one server, and give it all the CPU:

+---------------------------------+-----------------------
| server1                         | server1            ...
+---------------------------------+-----------------------

100ms---------------------------->100ms------------------>

Hooray, no sleep! Problem solved?

Multiple cores and threads

Sadly we’ve probably not solved the stalls in practice, because of two inconvenient facts:

Our machine will have more than 1 core.
Our process will have more than 1 thread.

The problem is that the kernel can schedule each thread onto a different core, but the total CPU time is for the process and not for each thread. So if two threads are ready to run at the same time, the kernel may schedule those on two different cores. While the process gets all of its 100ms slice, it is only running for 50ms wall clock time. The rest of the time, the process will sleep.

Our stalls have returned.

          +-----------------+---------------+-----------------+------
  Core 1  | server1         | other proc    | server1         | other
          +-----------------+---------------+-----------------+------

          +-----------------+---------------+-----------------+------
  Core 2  | server1         | other proc    | server1         | other
          +-----------------+---------------+-----------------+------

           50ms------------>    ...zzz...    50ms------------>

The pattern will continue as we add threads. If there are enough cores, the kernel may wake them all at the start of the window. Three threads and three cores will give us about 33ms per thread, and then 67ms of sleep. Ten threads and ten cores will give us 10ms per thread and a massive 90ms of sleep!

CPU Pinning

The behaviour we see above comes about when using Linux’s completely fair scheduler. This process scheduler has proven to be a generally good one, but it does have quirks, and as we’ve seen this can cause user-visible stalls for some workloads. CFS is the default scheduler on Linux, and it’s how Kubernetes implements resource requests and limits.

To prevent stalling, we need to do one of two things. Either we reduce the threads down to one or we restrict our process to a single core. If we then give that process 100% of a core, it will be eligible for running all the time. Similarly, for a process with two threads, restricting to two cores and giving the process 200% will ensure both threads are always eligible to run. This should fully alleviate the problem.

The kernel uses a system called cpusets to manage this. Processes can be allocated/restricted to specific CPU cores using cpusets. In its kernel implementation, this is flexible; many processes could be assigned the same CPU set. For example, critical system processes might be assigned to one fairly generous cpuset while the other processes on the system might be left to fight between themselves for the other cores.

Kubernetes uses the cpuset functionality a bit differently, by using it to allocate CPU cores for the exclusive use of a pod. To enable CPU allocation in Kubernetes, you have to configure the Kubernetes CPU manager to enable it. Using the static policy, Kubernetes will assign containers meeting specific criteria around their requests and limits specific CPU cores for their exclusive use. Other pods cannot use those resources.

Summary

We’ve seen how the Linux kernel splits time between processes and how their limits — for example set in Kubernetes resource limits — are applied. We’ve seen that memory and CPU are shared differently between processes because of how RAM works and how CPUs work. And finally we saw how threads can sneakily undo our best efforts to keep our process from being forcibly put to sleep by the kernel.

Overall, for time critical workloads, CPU pinning is a better way to separate out processes, whether in Kubernetes or on a more traditional server. That gives a process full use of a processor, making performance more predictable and consistent — only, of course, if the application itself isn’t doing something silly!

This is a tradeoff that needs to be considered for most workloads in a Kubernetes environment. Many background applications can be given a standard CPU limit via the completely fair scheduler. User-facing applications, however, need to watch out for the effect of stalls on the user experience and might want to consider allocation to specific cores if tail latencies show a need for it.