Recently at work, one of our workloads on kubernetes was erroring out a lot and grafana showed that the pods were being CPU-throttled 50-60% of the time.
That was weird, because the pods were nowhere close to hitting their limits. I decided to dig and I ended up in a rabbit hole quite deep. Here’s what I found.
NOTE: The text below is a mix of copy-pasting from a few websites, some paraphrasing, and custom text. A lot of this information here is not originally mine. I present it here for (hopefully) better understanding.
Cpu requests and cpu limits (in K8S and Linux) are implemented using two separate control systems.
In K8S land, cpu requests serve two purposes. One, it lets kubelet decide which node is suitable for a pending pod. Two, once the pod is scheduled, the requests are translated into docker-level flags known as CPU shares. This is the CPU shares system.
By default, when no requests are set, all containers get the same proportion of CPU cycles. When you set a “Request” however, this number is translated into a proportion. This proportion is applied as a CPU share ‘weight’ to the container, which is relative to the weighting of all other running containers.
Let me explain. In Linux, CPU shares work by dividing each CPU-core into 1024 slices and guarantee that each process will receive its proportional share of those slices. With 1024 slices, if each of two containers sets cpu.shares to 512, then they will each get about half of the available CPU time.
However, the proportion is enforced only when CPU-intensive processes are running. When tasks in one container are idle, other containers can use the left-over CPU time. Therefore, the actual amount of CPU time of a container will vary depending on the number of containers running on the system.
For example, consider three containers, container1 requesting 256 shares, and the other two 512 each, and the system of 1024 cpu shares is busy, i.e when processes in all three containers attempt to use 100% of CPU. The proportion for container1 is 1024/5 = 204.8 and 409.6 for each of container2 and 3. Where did the number ‘5’ come from? This works on least-common-denominator principle. Here, 256 is the LCD. Let’s call it ‘X’. 512 is 2X. We have a total of 1X + 2X + 2X weights = 5X competing for 1024 shares = 1024/5.
On a multicore CPU, the shares of CPU time are distributed over all CPU cores. On a 2-core system that is busy, a container requesting for 256 shares, is allowed to use all the cores of the CPU for its allowed proportion of CPU-time. So it can either use 256 shares on one core or 128 each on both cores.
So far so good. However, the cpu shares system cannot enforce upper bounds. This is because the (CFS) Scheduler releases the unused CPU to the global pool so that it can allocate it to the cgroups that are demanding for more CPU power. If one process doesn’t use its share, the other is free to. And, this is where limits come into picture.
K8S uses Linux’s CFS (Completely Fair Scheduler – Linux’s default cpu scheduler) quota mechanism to implement limits. This is done using a system called “cpu bandwidth control”. https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt.
The bandwidth control system defines a time period ‘cfs_period_us‘ (“us” == microseconds), which is usually set to 1/10 of a second, or 100,000 microseconds, and a quota ‘cfs_quota_us‘ which represents the maximum number of slices in that period that a process is allowed to run on the cpu. Quota is reset after the expiry of ‘cfs_period_us’ period.
Kubernetes divides cpu cores into 1000 shares (unlike 1024 at the Linux cpu control group and docker level). In K8S, we usually define our requests/limits with the ‘m’ notation, where the unit suffix ‘m’ stands for “thousandth of a core”.
Assuming you’ve configured a pod’s limit to 200m, that is 200/1000th of a core. With cpu.cfs_period_us set to 100000, for every 100,000 microseconds of CPU time, your pod can get 200/1000 * 100,000 microseconds = 20,000 microseconds of unthrottled CPU time. So your limit translates to setting cpu.cfs_quota_us=20000 on the process’s cpu.
To re-iterate, this is 20ms of CPU time slice available to the process out of each 100ms time period, during which the process gets unfettered access to all the CPU cores on the system.
Additionally, when you specify CPU limits on a container, you’re actually limiting its CPU time across all of the node’s CPUs, not restricting the container to a specific CPU or set of CPUs. This means that your container will sees (and can use) all of the node’s CPUs — it’s just the time that’s limited. For example, specifying a CPU limits of 4 on a node with 8 CPUs means the container will use the equivalent of 4 CPUs, but spread across all CPUs. In the above example with 20ms quota, on a two core system, the process can use 10ms each on both cores or 20ms on one core.
In the case of our problematic pod, the pods requests/limits were set to:
This means that the pod is requesting for a node with atleast 50m minimum CPU shares available. If there was no limit, the pod is free to consume as much CPU as is available when required. On a busy system, the pod gets atleast its 50m weighted share of the CPU cores.
Coming to the limits, the pods get 40ms of unthrottled CPU in a 100ms period. If the pod’s process(es) need more CPU time than that provided by the 50m CPU request/share, they can do so within the 40ms of allotted CPU time. However, if the pod needs more CPU time, it has no option but to wait for 60ms so that the 100ms cfs_period_us time period expires/resets, and it can get a fresh quota for another 40ms.
Looking at the metrics, our pods were not actually exceeding or even coming close to the 400m limit if you look at it from a purely CPU shares perspective. The CPU usage of the pods was hovering around 200 – 300m. However, the pods needed more CPU time than its 50m weight guaranteed, or the 40ms bursts allowed. So they were being throttled for the rest of the 60ms.
Plus, all the throttling was flip-flopping the health of the pod, which also seemed to increase the CPU load/requirement of the pods. Once the limit was raised to 1000m, which is 100ms of 100ms and equal to one CPU core , the pod’s throttling stopped, and the CPU usage also dropped comparitively. The request was also bumped to 250m to allow for a higher baseline CPU weighted share for the pod. The pod can use the 100ms cpu_quota to use one-CPU core worth CPU-time on the system. It can take 100% of a CPU core, or 25% of all CPU cores on a 4-core system, however it sees fit.
If you want a (poor) analogy, let’s talk about cars on a highway. CPU ‘m’ notation is ‘kph’, ‘cfs_quota_us’ is set to 100seconds, each container is a car, and the system is our highway.
– In a system without requests/limits on containers, cars are free to travel at whatever speed they want to as long as there’s no traffic and within the limits of the highway design (system capacity). If there’s traffic, things get messy because every car tries its best to use up the highway capacity for its own.
– In a system with requests on containers, cars are guaranteed to travel at a speed they initially choose. Let’s assume that is 50kph (or 50m in K8S). They are allowed to burst to whatever speed they require as long as there’s no traffic and within the limits of the highway design (system capacity). Even in the heaviest of traffic, this car will still be allowed to travel at 50kph, guaranteed.
– In a system with requests and limits on containers, assuming 50m req and and 400m limits from above. The limit translates to 400/1000*100seconds = 40seconds. This car is guaranteed to travel at a minimum of 50kph or can travel at whatever speed it requires as long as there’s no traffic and within the limits of the highway design (system capacity). BUT, the limits restrict the car to running for only 40 seconds of a 100 second time period. So the car can do 800kph (if there’s capacity) but has to abruptly stop after it’s exhausted 40s of allotted time. Then restart after the end of the 100s time period, then stop again in 40s.
Then, let’s say this particular car model also needs to travel at a baseline 120kph to be able to perform optimally, and we only guarantee 50kph, then it’s performance is going to suffer as well.
I told you it wasn’t a great analogy.
Other interesting bits:
Things get a little more complex if you think about a multi-threaded process. In K8S land, each thread can unfortunately also be counted against the CPU quota as well (https://github.com/kubernetes/kubernetes/issues/67577#issuecomment-417321541), so if our pod’s process has two threads, the combined two-thread process would only sustain 40ms/2 = 20ms of unthrottled CPU, and face 80ms of throttling.
Finally, Guaranteed pods in K8S are pods where the limits == requests in the pod spec. However, contrary to what the term implies, gauranteed pods are also throttled based on the limits. For example, a pod with CPU request and limits set to 700m, the pod is requesting for a node with 700m free CPU shares. That is a guarantee of 700m weight (not CPU) on the total of all CPU cores on the node. K8S doesn’t oversubscribe nodes, but if there are pods without requests, or pods that are keeping the system busy, then the 700m could translate to less than ideal CPU time. However, without limits, if the pod needs more CPU, and the system’s idle, it’ll get more CPU.
Now, coming to the limits, 700m translates to 70ms (700/1000 * 100,000). Thus, the pod is allowed unfettered CPU access for 70ms of a 100ms period. That’s only 70% of a CPU core for the pod. Can you see how that can be a problem? The pod’s request allows it 700m weight share cpu on the system, and full freedom to burst away to glory, but is limited to only 70ms runtime which equates to just 70% of one CPU core. See “multi-CPU” example: https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt. It’s a better idea to either get rid of the limit or set a limit according to the usage of the pod.