FPU Context Switching
- Author: Gerwin Klein, Rafal Kolanski, Indan Zupancic
- Proposed: 2024-06-21
Summary
We propose to let user level configure policy on when FPU context is switched instead of relying on runtime faults to switch FPU context lazily. Policy is configured via a flag in the TCB, authorised by the corresponding TCB capability.
This improves performance, moves policy from kernel to user, enables verification of FPU handling, and fixes an information leakage in current FPU context switching.
Motivation
If the next thread is not the current FPU owner, current FPU context switching in seL4 disables the FPU on kernel exit, waits for a thread to fault when it accesses the FPU, transparently handles the fault, saves/restores FPU context, and then returns to the thread.
The motivation for this scheme was that threads that never use the FPU do not pay for FPU context switching cost. However, it has the following problems:
-
this is a fixed context switching policy in the kernel;
-
most of this complexity is invisible to the current formal verification, and therefore not covered by the proofs;
-
when switching between domains, this scheme introduces information flow leakage, because the kernel performs actions on behalf of domain A while running in domain B: assuming A and B are the only domains, the first thread running in domain B can determine whether a thread in domain A used the FPU simply by using the FPU itself, since kernel FPU save/restore actions will only occur if another thread used the FPU.
-
performance of this scheme is not optimal -- if multiple threads are using the FPU in a row, the code will be taking a fault every time the FPU is used. This adds unnecessary overhead, especially in systems where FPU usage is high.
The proposed scheme fixes these problems and maintains the original intent that threads which do not use the FPU should not pay FPU context switching cost.
The assumption is that application threads are likely to be using the FPU, and that low-level OS threads are unlikely to be using the FPU. One particular case we are aiming to cover is an application thread calling the OS and the OS returning to that thread. Ideally, this case should not need to save or restore any FPU registers.
Guide-level explanation
Concepts
-
The TCB will get a new flag
fpuDisabled
that determines whether the FPU is disabled or enabled when the kernel switches to the thread. If the FPU is disabled and the thread uses the FPU, a fault is generated and delivered to the thread's fault handler like other faults. If the FPU is enabled, FPU context is saved and restored as needed. There is no more kernel-level transparent FPU fault handling. -
For backwards compatibility, the flag is initially false (the FPU is enabled) on platforms that have FPU enabled at kernel configuration time.
-
The flag can be set/cleared via the new
seL4_TCB_Set_Flags
invocation of the corresponding TCB capability. This allows user level to configure fine-grained FPU context switching policy.
High-level Performance Considerations
The following high-level properties will hold:
-
When no thread is configured to use the FPU, no FPU context switching happens after initialisation.
-
When all threads are configured to use the FPU, the scheme is equivalent to eager FPU switching, that is, FPU context is always saved and restored, no matter if the threads actually use the FPU or not.
-
On domain exit, the current FPU context is always saved when it exists.
-
Calls from FPU-threads to exclusively non-FPU threads do not trigger any FPU context save/restores.
For example, consider three threads A, B, and C. A is configured to use the FPU, B and C are not. Switching from A to B will not need an FPU context save or restore, neither will switching from B to C or C to B. Switching from B to A only requires an FPU context save/restore if another thread has run in between that was configured to use the FPU.
Reference-level explanation
Adding the flag to the TCB will not increase the TCB object size, and is consistent with other TCB parameters authorised by the TCB cap.
The default is chosen so that a value of 0 (cleared memory after retype
) means
that the FPU is enabled. That means user-level code that does not know about the
new FPU switching scheme will be able to use the FPU as before.
As it does now, the kernel will maintain a current FPU owner, which will be a pointer to a TCB. The invariant on that pointer is that it points to the TCB of the most recently running thread that was configured to use the FPU. It is the TCB the current FPU state has to be saved to.
-
When the kernel switches away from a thread, no FPU save/restore actions happen.
-
When the kernel switches away from a domain, the current FPU context is always saved.
-
When the kernel switches to a thread where the FPU is disabled, no FPU context actions happen.
-
When the kernel switches to a thread where the FPU is enabled, the FPU context of the current owner is saved, and the FPU context of the new thread is restored. The current FPU owner is updated to the new thread.
API
TCB_Set_Flags
static inline int seL4_TCB_Set_Flags
Clear and set feature flags of the specified TCB -- in that order, i.e. when a flag is both cleared and set, it will be set.
Type | Name | Description |
---|---|---|
seL4_TCB | _service | Capability to the TCB to set flags on |
seL4_Word | clear | Bit 0: set to 1 to clear the fpuDisabled flag |
seL4_Word | set | Bit 0: set to 1 to set the fpuDisabled flag |
Flags that are not available on the current platform or configuration will be
ignored. Currently only bit 0 (fpuDisabled
) is available on platforms with
CONFIG_HAVE_FPU
.
The function returns seL4_NoError
for success as usual, and
seL4_InvalidCapability
if the provided capability is not a TCB capability.
Verification Impact
Verification impact should be moderate.
The following items will need to be added to the specifications and proofs on
all verification configurations that have FPU proof support (ARM imx8-fpu-ver
, AARCH64
, X64
).
- added
seL4_TCB_Set_Flags
system call decode and perform functions - change to
TCB
object state, including default retype reasoning - the owner of the FPU state will be tracked, which needs new invariants. For
example:
- when a current owner exists it always points to valid TCB
- this TCB belongs to the current domain
The following main parts of the proof stack are affected by these:
- specifications: abstract spec, design spec, capDL spec
- functional correctness, all levels
- fast path for
ARM
andAARCH64
(additional check for FPU) - access control
- information flow
- capDL and system initialisation
While the changes are cross-cutting, we think the new invariants will remain relatively isolated and will not require much interaction with existing ones.
Potential Optimisations
-
On Arm it is possible to interleave loading and storing of the FPU registers, which might be faster than first loading and then storing all registers. This interleaving is only possible if there is always an FPU context to save when there is one to restore. We propose to leave this up to the implementation phase.
-
Newer x86 CPUs support the
XSAVES
instruction, which combinesXSAVEOPT
andXSAVEC
. That means together with the high overhead of disabling the FPU, lazy switching may not make sense on newer x86 CPUs and the user may want to configure all threads as FPU threads. -
There are multiple options on how to represent the state where there is no current FPU owner. This happens at the start of the system where no FPU thread has run yet, and at the start of a domain when no FPU thread has yet run in that domain. This state could be represented by a NULL owner, or it could be represented by pointing to a dummy thread, e.g. the idle thread. The latter implies that the pointer always points to a valid thread and that the current FPU context can always be saved, even if there is no current owner. This would remove testing for NULL in the (very) frequent case where a current owner exists and would add a superfluous context save in the infrequent case where no owner exists. Doing this might complicate the information flow proofs where the FPU context of the idle thread would have to be modelled (and proved) as unobservable to the user. However, it may also reduce timing differences at domain switch (modulo caching), because each domain switch will always have a context to save.
Drawbacks
-
Compared to the current scheme, user level now needs to explicitly disable the FPU for a thread when no FPU context switching is required. This requires more system analysis than before. However, if user level is not consciously treating the FPU, it is likely that modern compilers on most architectures will at least use SIMD instructions that use FPU registers, which means that the default of "FPU enabled" should be the correct choice for such systems.
-
For a workload where many threads do use the FPU, but use it only occasionally, the proposed scheme would perform many more FPU context switches than the current fault-based lazy scheme.
In this case, it would still be possible to emulate the fault-based lazy scheme from user-level:
- the rare-FPU thread is set to FPU disabled
- when it does use the FPU, the user-level fault handler gets the corresponding fault
- the fault handler enables the FPU on this thread and restarts it
- the thread could either stay FPU enabled or be FPU-disabled again after its time slice or budget are up, depending on system policy.
This does have more overhead than the current in-kernel transparent fault handling, but not massively more, and it allows full user-level control and trade-offs. For instance, this could also be used for starting all threads as FPU-disabled and enabling FPU for them on demand. That would yield low amortised overhead without needing system analysis.
Rationale and alternatives
There are multiple alternatives to the semi-lazy context switching scheme proposed here. The main classes are:
- fully eager -- FPU context is always saved and restored
- semi-lazy with early context saving -- FPU context is always saved when switching away from an FPU-enabled thread and always restored when switching to a FPU-enabled thread
- fault-based lazy -- the current scheme in seL4
This RFC has already argued that semi-lazy is preferable to the current scheme for multiple reasons (policy, performance, verification). Compared to eager switching, the proposal has the following advantages:
- Fully eager represents another fixed kernel policy vs configurable policy.
- Fully eager switching can be achieved in the proposed scheme by setting all threads to FPU enabled. The only overhead this is incurs is one branch test which checks the flag and always finds the FPU enabled.
- Fully eager switching does not perform well on all work loads, for example a common use case in the device driver framework is a chain of OS producers/consumers that do not use FPU, called by an application that potentially does use the FPU.
- If the one-branch-test overhead for fully eager switching in the proposed
scheme is important for a specific workload, it would be possible to provide
an optimisation: have an always-eager compile-time config flag that statically
sets the
fpuDisabled
flag tofalse
, which eliminates the branch. This would disallow disabling the FPU inseL4_TCB_Set_Flags
. We do not anticipate this to be needed currently, but it is an option for the future if this turns out to be a common case.
The proposed scheme is slightly more complex than the semi-lazy scheme with early context save, and early saving would not need any special treatment on domain switch, but the targeted use case of OS threads without FPU called by an application thread with FPU would still need one FPU context save and restore, whereas the proposed scheme does not require any.
Prior art
There has been initial discussion and a draft pull request for a proposal to introduce an FPU object to seL4, which served as inspiration for this RFC. The FPU object proposal similarly makes context switching configurable by binding or not binding an FPU object to a TCB, and implements a similar switching strategy.
Adding a new object to the API has much higher implementation+verification impact as well as higher API breaking impact for user-level systems. The issue of memory-backing for the FPU is orthogonal to the issue of when to context switch the FPU. The approach proposed here focuses on context switching only, which keeps verification manageable. The performance numbers we measured for the proposal in this RFC are similar to the numbers obtained in the FPU object proposal.
If TCB size and memory pressure do turn out to be an issue in practice, the FPU object can still be added on top of this RFC in the future.
Performance Numbers
The following numbers compare the fault-based lazy switching (current seL4 implementation) with fully eager switching (FPU always on). The tables will be updated when more results are available, including an implementation of the proposed semi-lazy switching without faulting.
These measurements should give an indication of the trade-off space: lazy fault-based switching where no thread is using the FPU should be the same as the base line, because the FPU is always off. Eager switching is the highest context switching load. The space in between has trade-off between these and cost for enabling/disabling the FPU, potentially more complex logic, and taking faults.
Expectations for final performance: Task with FPU enabled will behave close to eager switching and tasks with FPU disabled will behave like the current lazy switching, which always disables the FPU at task switch.
All tests done on non-SMP, non-MCS and release builds of seL4 and sel4bench
.
Keep in mind that all cycles are not in CPU cycles, but in platform specific
fixed clock cycles, hence the numbers should not be compared between boards.
AArch64
The AArch64 tqma
system:
Cortex-A35 | Call | ReplyRecv |
---|---|---|
Old Lazy | 311 | 328 |
Eager | 429 | 493 |
Basic save + restore overhead seems to be about 120 cycles (after aligning
fpuState
). The unexpected extra slowdown for ReplyRecv
may be because of
secondary effects, like less optimal memory accesses. Performance is mostly
limited by store and load throughput.
On ARM enabling and disabling the FPU is cheap.
x86
The skylake
system:
Skylake | Call | ReplyRecv | Consumer to producer |
---|---|---|---|
Old Lazy | 157 | 163 | 2455 |
Eager | 556 | 527 | 1203 |
Sync no FP | 945 |
The other sync tests had similar results. The compiler is much more eager to use SIMD instructions on x86 compared to AArch64, causing FPU traps during the sync tests. The Last column shows the FPU trap overhead.
When compiling the sync test with -mgeneral-regs-only
, the lazy FPU sync
results are much better, see "Sync no FP" row. However, the result is only ~250
cycles faster instead of the expected 350-400, compared to eager switching.
Comet Lake | Call | ReplyRecv | Consumer to producer |
---|---|---|---|
Old Lazy | 124 | 126 | 1839 |
Eager | 427 | 405 | 926 |
Sync no FP | 675 |
Other measurements have been made, and all-in-all the x86 performance
measurements do not seem very reliable. For instance, adding extra FPU
disable+enable calls added 300 cycles, which should be similar to the cost of
XSAVE
+XRSTOR
, yet we don't see that in the actual results.
Other examples are measuring 2k trap overhead cycles, while the difference between "Sync no FP" is around 1.5k on Skylake, and getting much worse results when compiling the IPC test with general regs only too, while from the results it is clear that no trapping happens, so there should be no FP instructions used during the measured window.
A cycle counter overhead of 200 seems to complicate things too. For x86 it may
be better to use PMU counters instead of rdtsc
.
Benchmark Conclusions
Although micro-benchmark results are much worse for eager switching, on a macro level the results are dramatically better if the FPU is used at all. The proposed semi-lazy switching with FPU flag should get the best of both approaches.
Unresolved questions
- While the internal representation needs to be
fpuDisabled
to get the right default, it may be nicer for the API to present the inverted flagfpuEnabled
instead.