This is an old revision of the document!

CPU Partitioning

CPUs can be partitioned to separate the resources of tasks and interrupts with different focus. In a real time system, CPU partitioning can be used to separate CPUs dedicated to real time tasks and their corresponding interrupts.

The base technology for CPU partitioning is CPU affinity. On top of this mechanism further Linux kernel facilities for CPU partitioning are implemented. User space tooling is available as well.

This article gives an short overview about the facilities and tools. Follow the links for detailed information.

Affinity

The processing of tasks or interrupts can be restricted to a specified set of CPUs by setting the affinity. The task CPU affinity affects the scheduler and makes sure that the specific task is executed only on the CPUs which are in the tasks affinity set. The IRQ affinity specifies to which CPU an interrupt is allowed to be routed.

CPU affinity of tasks

In a SMP system the property that binds processes or tasks to one or more processors by the OS scheduler is known as CPU affinity, the capability to override how the processes or tasks are assigned to a particular set of processors by the scheduler is a feature available in several OSes. The idea is to say “always run this process/task on processor one” or “run these processes/tasks on all processors but processor zero”. The scheduler places the processes/tasks on the CPUs which are contained in the affinity set.

Task affinity management can be utilized via the following mechanisms:

cgroups: cpusets

CPU isolation: see CONFIG_CPU_ISOLATION and command line parameter isolcpus.

System calls: sched_[get/set]affinity

Tools: taskset

The CPU affinity of per-CPU threads like ksoftirqd/n and kworker/n (where n is the core number) is not settable. Other threads like kswapd/n are per-NUMA node and can be only pinned within the cores of their node.

CPU affinity and kworkers

Kworker threads and the workqueue tasks which they perform are a special case. While it is possible rely on taskset and sched_setaffinity() to manage kworkers, doing so is of little utility since the threads are often short-lived and, at any rate, often perform a wide variety of work. The paradigm with workqueues is instead to associate an affinity setting with the task itself. “Unbound” is the name for workqueues which are not per-CPU. These workqueues consume a lot of CPU time on many systems and tend to present the greatest management challenge for latency control. Those unbound workqueues which appear in /sys/devices/virtual/workqueue are configurable from userspace. The parameters affinity_scope, affinity_strict and cpu_mask together determine on which cores the kworker which executes the work function will run. Many unbound workqueues are not configurable via sysfs. Making their properties visible there requires an additional WQ_SYSFS flag in the kernel source.

Since kernel 6.5, the tools/workqueue/wq_monitor.py Python script is available in-tree, and since 6.6, wq_dump.py has joined it. These Python scripts require the drgn debugger, which is packaged by major Linux distributions. Another recent addition of potential particular interest for the realtime project is wqlat.py, which is part of the bcc/tools suite (see https://github.com/iovisor/bcc/blob/master/tools/wqlat.py). Both sets of tools may require special kernel configuration settings.

IRQ affinity

Hardware interrupts can interrupt kernel and user space computations at any given time, except when the kernel disables interrupt processing to protect resources. When a hardware interrupt is handled the CPU switches into a separate context and executes the handler code and switches back to the interrupted context and resumes the execution.

Depending on the interrupt hardware, interrupts can be routed to any CPU or delivery can be rotated between CPUs. Most interrupt controllers allow to restrict the set of CPUs to which a particular interrupt can be delivered by setting the IRQ affinity.

When the CPU receives an interrupt, a context switch to interrupt context is executed and the current task has to wait until the IRQ is handled. The possibility to allow only a set of CPUs to handle dedicated IRQ is called IRQ affinity. Thereby the hardware routing of the interrupt to the CPUs is affected.

IRQ affinity management can be utilized via the following mechanisms:

procfs: procfs

Kernel command line parameter: Default IRQ affinity

Tools: irqbalanced

Tools: taskset

Housekeeping cores

A common paradigm with realtime systems is to pin latency-insensitive kernel and userspace tasks tasks on a designated “housekeeping” core. For example, taskset can pin kernel threads like kswapd and kauditd. Applications whose network traffic latency is not critical may wish to pin network IRQs there as well. Userspace threads which are sometimes CPU-intensive like systemd and rsyslog may also be pinned on the housekeeping core. Pinning userspace threads will not have the desired effect if much of their work is performed by unbound workqueues, which may migrate to any core.

Softirqs and kthreads

Softirqs are kernel threads which are often challenging to manage on realtime systems. Softirqs may run in atomic context immediately following a hard IRQ which “raises” them, or they may be executed in process context by per-CPU kernel threads called ksoftirqd/n, where n is the core number. There are 10 kinds of softirqs which perform diverse tasks for the networking, block, scheduling, timer and RCU subsystems as well as executing callbacks for a large number of device drivers via the tasklet mechanism. Only one softirq of any kind may be active at any given time on a core. Thus if ksoftirqd is preempted by a hard IRQ, the associated soft interrupt is disabled from following it immediately, and must wait for ksoftirqd. This unfortunate situation has been called “the new Big Kernel Lock” by realtime Linux maintainers.

Kernel configuration allows system managers to move the NET_RX and RCU callbacks out of softirqs and into their own kthreads. Since kernel 5.12, moving the NET_RX into its own kthread is possible by echo-ing '1' into the threaded sysfs attribute associated with a network device. The process table will afterwards include a new kthread called napi/xxx, where xxx is the interface name. [Read more about the NAPI mechanism in the networking wiki.] Userspace may employ taskset to pin this kthread on any core. Moving the softirq into its own kthread incurs a context-switch penalty, but even so may be worthwhile on systems where bursts of network traffic unacceptably delay applications. RCU Callback Offloading produces a new set of kthreads, and can be accomplished via a combination of compile-time configuration with boot-time command-line parameters.

Realtime application best practices

Multithreaded applications which rely on glibc's libpthread are prone to unexpected latency delays since pthread condition variables do not honor priority inheritance (bugzilla). librtpi is an alternative LGPL-licensed pthread implementation which supports priority inheritance, and whose API as close to glibc's as possible. The alternative MUSL libc's pthread implementation appears to be similar to glibc's.

Wiki

Table of Contents

CPU Partitioning

Affinity

CPU affinity of tasks

CPU affinity and kworkers

IRQ affinity

Housekeeping cores

Softirqs and kthreads

Realtime application best practices

Wiki

User Tools

Site Tools

Table of Contents

CPU Partitioning

Affinity

CPU affinity of tasks

CPU affinity and kworkers

IRQ affinity

Housekeeping cores

Softirqs and kthreads

Realtime application best practices

Page Tools