Introduction

This book collects the approved RFCs for the seL4 ecosystem.

The seL4 foundation uses the request for comments (RFC) process to

allow the community to discuss design changes in seL4,
gather valuable feedback from the community on changes the foundation is considering,
allow members of the seL4 community to get support and approval to propose and implement their own changes to the seL4 ecosystem,
ensure that all changes made to core components of the seL4 ecosystem or that have wide and varying impacts on users of seL4 undergo rigorous review, and
ensure large changes are well advertised and can viewed publicly before contributors commit to implementing them.

This helps the seL4 community ensure that such changes are made with the goal of the best outcome for the most users of seL4 without compromising seL4's high-assurance properties of functional correctness, isolation, and security.

The RFCs in this book are split into three parts: Implemented RFCs are completed and available for use. Active RFCs are approved and currently under implementation. Deferred RFCs are approved, but currently not being worked on for implementation.

An RFC process for the seL4 ecosystem

Author: Curtis Millar
Proposed: 2018-12-13

Summary

Note that this is the RFC process as it was first proposed in 2018. The process has changed in details over time, for instance it now runs on GitHub instead of Jira.

The current seL4 RFC process is described in the seL4 RFC repository.

In order to deal with major changes to the seL4 ecosystem, a process of Requests For Comments (RFCs) will be introduced. These will take the form of issues on a dedicated Jira project which are open to the public to view changes as they are planned and to provide feedback and input on the direction of those changes.

Motivation

We have in the past made many large changes to the ecosystem with little or no warning that such changes were being planned only to notify the community of such changes as they were deployed. This has been a limitation of the processes we have in place for making such changes.

As the seL4 community grows and the platform is more widely used we need to ensure that such changes occur in a manner that provides the best outcomes for all affected parties and that gives advanced warning of such changes that affected projects have ample time to plan and prepare.

The RFC process ensures that the community is aware of fundamental changes to the ecosystem and provides opportunity for discussion and contribution ahead of major changes. The process also provides an avenue for the community to co-ordinate large changes to the ecosystem with all of the same benefits in a manner we have had difficulty supporting in the past.

Guide-level explanation

The RFC process is one that any member of the community can initiate allowing not only the core seL4 team but also all users of seL4 to co-ordinate major changes to the ecosystem.

The RFC process begins with a discussion on one of the existing points of communication, namely the IRC channel and the mailing list. This is to ensure that when an RFC is proposed its purpose and design are clear multiple alternatives have been considered and the impact on the ecosystem is understood.

Once the discussion has lead to support for the change, both from the community and the seL4 development team, someone will create an RFC issue on the dedicated Jira project. The person who does takes responsibility for maintaining the proposal through the approval process. This includes monitoring discussion on the issue and updating the issue description to reflect changes that need to be made to the proposal.

Where RFCs cannot gather enough discussion for a consensus to be formed but the issue is still considered worthy of further consideration it will be postponed and may be reconsidered at a later date when a member of the community volunteers to take responsibility for it.

Once there is consensus on the change from a majority of the community and the seL4 development team the RFC may be approved or rejected. Whomever approves the RFC is then responsible for maintaining the RFC through its implementation.

In cases where an RFC is approved but the resources cannot be found to see it through to complete implementation or resources are diverted from the implementation of an RFC it will be deferred. Continuing work on deferred RFCs will take priority over approving new RFCs when resources later become available.

Reference-level explanation

RFCs will be submitted via a public Jira project which will enforce the RFC workflow and chain of responsibility. Each RFC will be maintained in the description field of an 'RFC' issue within that project. All discussion of the RFC post-proposal will occur on the issue itself.

The Jira workflow will ensure that the RFC is assigned to its creator upon its proposal and then assigned to the user by whom it is later approved.

The approver is then responsible for updating the description as changes are made in the process of implementation and linking related issues from other projects to track the implementation of the proposal.

Drawbacks

Jira is limited in many ways, users cannot comment on particular sections of the description and the comment section does not allow for any level of threading.

There is also no way to review further changes to the RFC as one could with other document forms, instead a certain set of users are able to change it at will without review. This can be accounted for, to some degree, by limiting the ability to modify the issue to the assignee and the project administrators.

Rationale and alternatives

This process heavily borrows from the Rust RFC process. That process documents RFCs in a dedicated git repository with the initial process requiring a pull request, changes recorded in commits, and post-approval changes requiring further pull-requests. The whole process, moves from their forums in the pre-proposal stage, to the RFC repository in the approval stage, to a separate tracking issue in a different repository in the implementation stage.

We felt that the process used by the Rust team would split the information across too many locations making it harder for all to be maintained and for users of the platform to find the information that they were looking for.

The chosen design trades features of a more robust approval process for greater centralisation of the process to a single service. We believe that it is more important to keep the information spread over fewer sources and that we are not likely to see the level of involvement that would require a more structured process in the near future.

We feel the particular features of Jira, such as the ability to enforce a particular workflow and permissions in software, allow us to account for some of the benefits we lose by not using both a Git repository and an issue tracker. We also have experience in administrating and using Jira to track issues which positions us to better utilise these features.

Prior art

This sort of process has been used by many other communities to manage and co-ordinate large scale changes to widely used systems.

A few notable sources of inspiration are:

Rust's RFC process,
Python's PEP process, and
The IETF's RFC process.

These demonstrate that such a process can be used to effectively co-ordinate community-driven change and ensure that large-scale changes are made with precise plans documented and key stakeholders involved.

Unresolved questions

The existing forums for discussion, the IRC channel and the mailing list, are not necessarily the most accessible media for discussion and may not facilitate the kind of discussion required by this process as well as other alternatives. We may wish to utilise an alternative to compliment or replace these.

We may find that the process does require a more rigorous process for tracking and approving changes to proposals or for more detailed discussion than what this design anticipates. These requirements should be revisited at a later date when the process has been under regular use to determine if the process should be restructured.

Disposition

This has been accepted and will be put into use. There has been little response from the community, likely due to this being a new process.

Some of the details of this system, such as it's use of Jira over alternatives may be revised at a later date. We won't really be sure until we start using it.

Note that this is the original seL4 RFC process as proposed in 2018. Find the current RFC process here.

A dedicated C runtime for seL4

Author: Curtis Millar
Proposed: 2019-01-04

Summary

Many languages provide a runtime in order to provide the initial abstractions to run code within an operating system process. Many borrow heavily from the runtime provided by the C library.

A new runtime will be provided to abstract the interface provided to seL4 processes. It will be the most minimal set of abstractions needed to allow a process to run and access all of the language features that must be supported by the runtime directly. All abstractions commonly available in a standard library (such as allocators and thread-creation) will then be left as to be supported via external libraries.

Motivation

Most seL4 projects currently rely on a fork of the musl libc library and runtime to provide process initialisation and libc support. Due to the modifications made to this port, it is difficult to benefit from any upstream maintenance.

An official internally supported runtime would decouple these projects from musl libc and allow the use of alternative C standard libraries (such as newlib and external pthread libraries) without need for modification.

The runtime would also have little to no external dependencies or conflicts allowing it to be used as the basis for other language runtimes and to support library-OS projects on seL4.

Guide-level explanation

The runtime would support two types of process being created; those starting in the style of the seL4 initial process, and those starting using the static System-V ELF loaded method.

The runtime would require the sysroot within which code is compiled to ensure that its runtime objects are compiled and linked with all binaries built, that its library is linked into all binaries, and that its link script is used with all libraries being built.

Code built with the library would run from an initial main function as would normally be the case in an C program and the full range of features such as thread-local variables and dynamic constructors and destructors would be available.

Additionally, a few mechanisms that reflect particular abstractions available to seL4 would be made available to processes and threads running within the runtime. These would exist to support abstractions that would be common in systems built on seL4 that rely on communication between user level programs. These would include mechanisms for informing processes of their capability state and location of their IPC buffer.

Reference-level explanation

The runtime is primarily responsible for constructing the environment in which code expects to run. This means that it should have a functioning stack and that all library systems should be fully initialised, and the initial thread's thread-local storage should be correctly initialised.

As such, the first thing the runtime needs to do for a seL4 initial process is create the stack and construct the initial System-V stack data (command line arguments, environment variables, and auxiliary vector).

For all processes, the runtime must then configure the TLS region for the initial thread using information found in the auxiliary vector. It should also record this information to allow it to later construct and modify TLS regions for newly created threads.

The runtime must then initialise all global runtime state, such as state that records the initial capability layout of the process.

After that, all constructors for the binary, including all library constructors must be run.

Finally, the stack information for the command line arguments and the environment variables must be passed to the main function as it expects to receive them.

The runtime must also provide routines for handling process exit. These should ensure that that all dynamic destructors are run before the process exits.

Drawbacks

This will become a maintenance burden and, as it will be relied upon for production systems, would require a similar level of support as the kernel.

Making this runtime will probably lead to particular patterns being enforced across all processes running on seL4. Design decisions made in the development of this runtime must be thoroughly considered as they will form the basis of effectively all processes running on seL4.

Rationale and alternatives

Existing runtimes from well supported libc implementations could be used, however they are all tightly coupled to the operating systems for which they are purpose-built and supporting them requires emulating many of the abstractions provided natively by those operating systems. For example, to support thread-safety in muslc, the Linux futex API must be emulated to support musl's locking mechanisms.

Providing an entire C library and runtime to fully support the entire C language requirements is another alternative, however there is no sufficient reason to maintain a C library when an external C library, such as newlib, can be easily supported with minimal abstractions over existing seL4 concepts.

Prior art

Many of the seL4 rapid development libraries utilise a common set of patterns, some of which are sufficiently common that should be standardised at this level.

The musl libc runtime is a primary reference for this implementation.

Unresolved questions

What are the natural abstractions that should be available to all processes running under seL4?

What patterns should this encourage or standardise?

Future possibilities

This runtime will allow for further, more high-level abstractions such as re-usable allocators and POSIX-style threading libraries to be provided.

Disposition

After much internal discussion this been approved. Development will happen in the seL4/sel4runtime repository. This will depend on the outcomes of the RFC Cross-platform thread-local storage support and so will not be usable until that is approved and implemented.

Cross-platform thread-local storage support

Author: Curtis Millar
Proposed: 2019-01-04

Summary

Currently, the kernel does not support thread-local storage (TLS) on all platforms it otherwise does support. In particular, there is no such support for TLS on RISC-V.

This proposal seeks to ensure that the kernel and supporting libraries support static TLS on all platforms. Due to hardware and ABI restrictions however, this also requires changing the user-level API for determining the location of IPC buffers.

This change would see that all platforms can use TLS, but that it would be entirely the responsibility of user-level to determine and track the location of IPC buffers for each thread and for newly created processes. In addition, a configuration-dependent seL4_SetTLSBase system call would be added, allowing a thread to change the value of its own TLS register without a capability to its own TCB.

To assist with the creation of threads using TLS, the seL4 runtime will provide simple primitives for initialising and modifying the TLS of threads from within a running process as well as initialising the TLS of System-V style processes.

Motivation

seL4 already supports TLS on some platforms in the same manner as the Linux ABI. On aarch32 (ARMv7 and newer) and aarch64, TLS is supported via a EL0 readable control coprocessor process ID register. On x86, the TLS region can be dereferenced from a general purpose segment register.

Many projects on seL4 utilise TLS to store state associated with particular threads. This feature is used particularly heavily in library-OS projects.

The kernel supports the IPC buffer as a kind of thread-local storage area on all platforms. With the kernel supporting generic TLS on all platforms, this would no longer be necessary as threads would simply be able to record the address of their IPC buffers in their TLS region.

To allow for implementations of green-threading and to allow a process to initialise its own TLS without a capability, threads would be able to modify their own TLS address using seL4_SetTLSBase across all platforms where they would not otherwise be able to modify the designated TLS register directly. This ensures that across all platforms, the TLS Base register is viewed as being completely mutable by the executing thread, just like all of the general purpose registers.

On platforms where the TLS Base register can be set in unprivileged mode (aarch64, aarch32 with the c13 control register, riscv, and x86_64 with FSGSBASE set), threads should set their TLS address without the use of the invocation.

Guide-level explanation

Thread-local storage

Any code running on seL4 can use thread-local variables within its code. These variables will be initialised to the same value for each new thread (unless modified by the thread creator).

For gcc-compatible C code, this is as simple as marking a global or static variable with the __thread attribute or the C11 standard _Thread_local attribute.

When constructing a thread, it is the responsibility of the code constructing the thread to ensure that the layout of the TLS region follows the appropriate layout for the given platform (the seL4 runtime will provide the tools necessary for this on supported platforms).

It is also the responsibility of the runtime of the process being started to correctly initialise the TLS of the initial thread of a process (the seL4 runtime will perform this task for System-V style processes on all supported platforms).

Determining the location of the IPC buffer

A thread can determine the address of its IPC buffer via the __sel4_ipc_buffer thread-local variable provided by libsel4. This must be initialised when the thread is created to refer to the location of the thread's IPC buffer.

libsel4 will also continue to provide the seL4_GetIPCBuffer function that returns the address of the current thread's IPC buffer, however this will now simply read the value from the thread-local variable.

Reference-level explanation

Implementation of TLS

The kernel will support TLS in a manner allowing for the use of System-V style TLS using the same TLS ABI as Linux. In particular it will use the following registers on the following platforms to store the thread identifier (the address of the TLS region).

fs on x86_64
gs on ia32
tp (x4) on RISC-V
tpidrurw where it is available on AArch32 and the GLOBALS_FRAME otherwise. This requires building for the arm ELF EABI (code generated for arm-none-eabi assumes no operating system and an emulated TLS rather than an ELF style TLS) and building with the --mtp=soft option to use the thread ID lookup function from the seL4 runtime rather than the Linux default of using tpuiduro.
The tpidr_el0 register on aarch64

The seL4 runtime will provide full support for TLS for each platform but will only support the local static TLS model on supported platforms. Statically linked code will be able to use TLS whilst dynamically linked code will not. It will use TLS variant appropriate for each platform (variant I for Arm and RISC-V and variant II for x86).

For more information, see Elf Handling For Thread-Local Storage.

Managing the IPC buffer address

The seL4 runtime will expect the IPC buffer address of the initial thread to be passed via an AUX vector as part of the System-V ELF process loading.

The runtime will also provide a mechanism to allow for modifying thread-local variables, such as __sel4_ipc_buffer, of other threads.

Kernel change

Only a holder of a TCB capability is able to modify the IPC buffer address and page for that TCB and only the kernel is able to read either for that TCB. There will be no way to determine the IPC buffer address either as the running thread or as the holder of the TCB capability.

The kernel will consider the register used for the TLS base and all thread identifier registers that can be written to from user mode to be general-purpose registers and saves and restores them with all other registers upon trap and context switch.

aarch32 (with GLOBALS_FRAME)
- Before: The kernel writes the IPC buffer address and TLS_BASE to a global frame mapped into all virtual address spaces as read-only (the GLOBALS_FRAME). Both require an invocation to change.
- After: The kernel writes the tpidr_el0/tpidrurw and tpidrro_el0/tpidruro to the GLOBALS_FRAME. The values in the globals frame are saved and restored with all other general-purpose registers. The thread can update its own TLS register (tpidr_el0/tpidrurw) using seL4_SetTLSBase.
aarch64 and aarch32 (with TPIDRURW & TPIDRURO)
- Before: The kernel uses tpidr_el0/tpidruro for the TLS register and tpidrro_el0/tpidrurw for the IPC buffer. It writes both from virtual registers and requires an invocation to change either.
- After: A thread uses tpidr_el0/tpidrurw for TLS which can be written to from user mode. The kernel saves and restores tpidr_el0/tpidrurw with all other general-purpose registers.
x86_64
- Before: The kernel uses fs as the TLS register and gs for the IPC buffer. It writes to both from virtual registers in the TCB and requires an invocation on the TCB to change either.
- After: A thread uses fs as the TLS register. The kernel always enables FSGSBASE to allow user mode to write directly to fs and gs. The kernel saves and restores fs and gs with all other general-purpose registers.
ia32
- Before: The kernel uses gs as the TLS register and fs for the IPC buffer. It writes to both from virtual registers in the TCB and requires an invocation on the TCB to change either.
- After: The kernel uses gs as the TLS register. The kernel saves and restores gs with all other general-purpose registers. The thread can update its own TLS register using seL4_SetTLSBase.
riscv
- Before: The kernel uses tp for the IPC buffer. The kernel writes to the tp virtual register in the TCB and requires an invocation to change it.
- After: A thread uses tp as the TLS register which can be written to from user mode. The kernel saves and restores tp with all other general-purpose registers.

In all cases, the holder of the TCB capability can read and modify the any newly-defined general purpose registers (virtual or otherwise) using seL4_TCB_ReadRegisters, seL4_TCB_WriteRegisters, and seL4_TCB_CopyRegisters.

Drawbacks

This change requires more work by user code for threads to be able to determine the location of the IPC buffer. The addition of a provided runtime will take care of most of this additional work.

This requires anyone compiling code for seL4 to use a linux-compatible ABI or with compilers that support the given TLS ABI. Any code using the seL4 runtime or libsel4 will need to be compiled with such compilers.

Code used to create new processes will need to be modified to pass the IPC buffer address via the auxiliary vectors.

Rationale

In order to make the best use of existing standards and tooling that supports them, we have decided to borrow from the implementation used across many Unix-like platforms such as Linux.

We could achieve the user level support for this by modifying our fork of the musl libc runtime to co-operate with whichever standard we chose. However, due to ongoing growing pains as well as not wanting to tightly couple seL4 to a particular fork of musl, providing support as part of the seL4 runtime is more appropriate.

Prior art

All major operating systems implement TLS in a similar manner. Many UNIX operating systems (such as Linux) use this exact model.

Many Linux libc alternatives (such as glibc and musl) provide similar support for Linux's TLS implementation. The musl implementation is the primary reference for this implementation.

Unresolved questions

How will code written in other languages operate with the new requirements for managing the IPC buffer address? How will they be able to leverage TLS?

Future possibilities

Given that the current support is only intended to allow for static binaries with no dynamic linking, this could be extended in the future to also cater to dynamically linked binaries that have been loaded via some linker as well as libraries that perform internal dynamic library loading.

Disposition

After long discussion internally as well as external interest this has been approved and implementation is underway and will occur in the seL4 repository. Implementation will occur in conjunction with that of the seL4 runtime will provide the runtime and interface necessary to utilise the features.

Remove support for ARMv6 (KZM/i.MX31)

Author: Matthew Brecknell
Proposed: 2021-09-15

Summary

This RFC proposes to remove seL4 support for KMC's KZM evaluation board, which used the freescale (now NXP) i.MX31 processor with an ARM1136 core. Since this is the only ARMv6 platform supported by seL4, this proposal would also remove support for ARMv6.

Motivation

The KZM board has been unavailable for purchase for some years. The i.MX31 processor exited NXP's product longevity program in 2015. We previously asked the seL4 community [1,2,3] if there are current users of the KZM platform. From the responses, we believe there is already consensus that there is little value in continuing support for KZM.

With no known users, we are motivated to eliminate the overhead of maintaining support for KZM. Unlike all other platforms supported by seL4, KZM lacks sufficient registers for thread-local identifiers such as TLS_BASE and the IPC buffer address. For KZM alone, seL4 implements a workaround by mapping a special frame into each thread (the globals frame) to store these identifiers.

As an example of the cost of maintaining support for KZM, the globals frame currently makes it difficult to uniformly describe the address space available to user-space programs.

Additionally, the KZM board the seL4 Foundation has used for testing KZM support is no longer reliable, so we are currently only able to test using QEMU simulation. Although KZM was the original seL4 verification target, KZM support has not been verified since early 2016.

Since we believe there is already consensus around the removal of KZM support, this RFC can be considered a formal last call for objections.

Guide-level explanation

We propose to remove from seL4:

seL4 kernel build targets for the KZM board, i.MX31 processor, and ARMv6 architecture;
any platform-specific code that is only required for those build targets, including platform-specific definitions, drivers and device-tree specifications.

Prior to removing support from seL4, we will remove code from other repositories that depend on seL4 support for KZM and ARMv6, including tests, benchmarks, and tools such as CAmkES.

We propose to update documentation on the docs.sel4.systems website to show that the KZM board is no longer supported, and to show the last seL4 release version that included support.

The impact of this change is that any current users of seL4 on the KZM board would not be able to use new releases of seL4 on the KZM board. They would most likely need to freeze the version of seL4 they use, and would therefore not be able to expect the usual level of support from the seL4 community.

Since we are not aware of any users, we do not expect significant impact.

Reference-level explanation

We have drafted a pull request to remove seL4 support for KZM, i.MX31 and ARMv6:

https://github.com/seL4/seL4/pull/580

A number of related pull requests remove references from related repositories. Some of these are already merged, since the changes they make do not require an RFC process:

Drawbacks

Since the KZM board is the only ARMv6-based platform supported by seL4, this proposal implies the removal of ARMv6, even though later ARMv6 boards might would not require the same workaround as the KZM board. Removing ARMv6 will make it slightly harder to add a new ARMv6-based platform in the future. But it would not be prohibitively difficult, since the main difference between ARMv6 and ARMv7 support in seL4 is the workaround required for KZM.

In any case, we believe that new ARM-based applications of seL4 will most likely use ARMv7 or later, so the benefits of removing KZM support currently outweigh the drawbacks.

Rationale and alternatives

We've already addressed the rationale for removing KZM support.

One might ask why it's necessary to remove all references to ARMv6 when we remove support for KZM, since retaining ARMv6 references might make it easier to support other ARMv6 boards in the future. However, we do not think it makes sense to retain references to ARMv6 without a specific ARMv6 board to exercise them, and without a clear demand for ARMv6 support. Again, the cost of maintaining those references outweighs any clear benefit.

Prior art

The Linux kernel removed support for the KZM board and i.MX31 processor in 2015.

Unresolved questions

We are not aware of any unresolved questions, but welcome discussion on this RFC.

New Capability for seL4 SMC Forwarding

Author: Robbie Van Vossen
Proposed: 2021-11-06

Summary

This RFC proposes to add a new, badged capability which certain threads can invoke so seL4 can make SMC calls on ARM platforms in EL2 (virtualized mode) on the thread’s behalf.

Motivation

Currently, seL4 running in EL2 traps all SMC calls. Then, if setup, seL4 calls a handler to deal with the call. Some SMC calls are emulated, such as PSCI for the camkes-vm, but most are just ignored. However, there may be some SMC calls that actually need to make it to the Supervisor, and this configuration does not allow that to happen. This RFC proposes a solution to this problem by creating a new, badged capability which can be given to specific threads to allows them to forward configured SMC calls to seL4 so that the microkernel will make the actual SMC call on the thread's behalf.

Guide-level explanation

This change will allow you to forward to the microkernel, emulate, or block SMC calls based on platform specific configuration. When an SMC call is made by a VM, the microkernel intercepts the call, which it does for all SMC calls made at EL1. It then calls handle_smc() for the corresponding VMM thread. That function will decode the specific SMC call and decide whether to block it, emulate it, or forward the call back into seL4. There are examples already for multiple platforms on emulation and blocking. The zynqmp platform will have an example of forwarding the SMC with the new API call, seL4_ARM_SMC_Call().

New concepts

`seL4_ARM_SMC_Call()`

seL4_ARM_SMC_Call() is the new ObjectInvocation API call and can be called in two ways:

A VM attempts to make an SMC call which traps into the microkernel. The microkernel informs the VMM’s handler and the VMM decides that the SMC call really needs to be made. The VMM then calls seL4_ARM_SMC_Call() with the SMC capability and all the arguments that the VM provided.
Directly by constructing the SMC call from the thread with the SMC capability. This means that a trap isn’t needed.

Our expected use case has been using the first approach in the camkes-vm for VMM handlers on the behalf of its VM. I am not sure if there is a use case for the second approach, but it is certainly possible. In either case, the function still requires the caller to set all 8 registers, even if the specific SMC call uses less arguments.

`seL4_ARM_SMC`

seL4_ARM_SMC is the new capability. It needs to be given to any threads that you expect to call seL4_ARM_SMC_Call(), otherwise it will be denied. We expect that in most use cases, it will be given to each VMM thread.

The capability is badged and the badge represents the SMC Function Identifier. Therefore, the specific SMC calls that each Thread can make is configurable and checked by seL4.

For users

We plan to implement this configuration as an example configuration of camkes-vm for the zynqmp platform. Due to WCET implication of this approach, it will not be the default for all platforms. It should only be enabled when actually needed for a specific platform or application. Without this change, Linux is unable to run as a guest on camkes-vm for the zynqmp platform.

Here is an example for how to enable SMC calls for some VMs in the devices.camkes file implemented for a platform in camkes-vm-apps.


vm0.allow_smc = true;
vm0.allowed_smc_functions = [
    0xC2000001, // PM_GET_API_VERSION
    0xC2000017  // PM_FPGA_GET_STATUS
];

vm1.allow_smc = true;
vm1.allowed_smc_functions = [
    0xC2000001, // PM_GET_API_VERSION
    0xC200000D, // PM_REQUEST_NODE
    0xC200000E  // PM_RELEASE_NODE
];

vm2.allow_smc = false;
vm2.allow_smc_functions = [];

As you can see in this example, you can configure each VM to allow or disallow SMC calls. For each VM that has SMC calls enabled, you then specify which SMC function calls they can make. These are integers that are defined per platform. As you can see, VMs can share the same functions and they can have different functions as well.

Reference-level explanation

This implementation will meet the SMC Calling Convention specification from ARM (ARM DEN 0028E v1.4) for 32-bit SMC calls. 64-bit SMC calls are partially supported, but full support could be added later.

seL4_ARM_SMC_Call()


static int seL4_ARM_SMC_Call(seL4_ARM_SMC _service,
    seL4_ARM_SMCContext *smc_args,
    seL4_ARM_SMCContext *smc_response
)


Parameters
    [in]  _service      Capability to allow threads to interact with SMC calls.
    [in]  smc_args      The structure that has the provided arguments.

    [out] smc_response  The structure to capture the responses.

Returns

struct seL4_ARM_SMCContext


typedef struct seL4_ARM_SMCContext_ {
    /* register arguments */
    seL4_Word x0, x1, x2, x3, x4, x5, x6, x7;
} seL4_ARM_SMCContext;

Drawbacks

There are a couple of drawbacks to allowing SMC calls to actually go through. Both relate to allowing a supervisor to do anything in the system past initialization time.

One drawback is that the worst case execution time (WCET) is no longer guaranteed for the system. Each SMC call can take an arbitrarily long amount of time. This means a thread can essentially lock up all of the cores for an unspecified amount of time. The core the SMC call is made on is obviously blocked because the supervisor will not return to seL4 until it completes its function. The other cores will become locked as soon as seL4 runs and tries to grab the Big Kernel Lock (BKL). One potential fix for the multicore issue is to release the BKL before making the SMC call. However, that is out of the scope of this RFC as it has a lot of implications. The current approach will be to have this config option off by default and add documentation about how enabling it affects WCET.

Another drawback is the effect this has on assurance. Since the Supervisor is a higher privilege level than seL4, it can make changes to the system of which seL4 is unaware. This would likely add on to the assumptions made if the proof ever included this configuration setting. The proofs would have to assume that the SMC calls that are made will not invalidate the proof. This is on the User to look at what each allowed SMC call does and check that it won’t interfere with the proof.

Rationale and alternatives

This design is the best because it gives users flexibility by allowing them to make some SMC calls while still being able to emulate other SMC calls. It is the preferred method, because the microkernel shouldn't need to be updated with each new platform, VMM configuration, and secure application implementation. It can all be configured in user space.

Alternative approaches

SMC filtering could be done in the microkernel so that developers are less likely to configure the system incorrectly. However, there are some cases where the filtering is dependent on specific threads and the application level (Secure world communication) instead of just platform dependent. On the first point, instead of creating a cap for making the SMC call, you could have a cap for each numbered SMC call so that you could configure what SMC call each thread can make using the root thread. However, that is quite a bit of work as that will differ for each ARM platform and adds another step when porting the microkernel to new ARM platforms. That could also solve the second point, but then users would likely need to modify the microkernel to support their specific secure world application, which is something that should really be avoided.

Another part of the implementation that could be done differently is using the VCPU capability with an access right instead of creating a new type of capability. The kinds of access rights are already determined as read, write, grant, and grantreply. A write access right would probably make the most sense, but it still seems obtuse that a write access on a VCPU capability would allow the invoker to make an SMC call. Besides that, there are two other issues I see with this approach:

I can foresee a use-case where you have a health monitor component that isn’t associated with a VCPU but needs to be able to forward an SMC call to actually interact with EL3 or secure applications.
If the VCPU capability ever needs read/write access rights that make more sense with the type of object it is, for example, limiting which threads can actually read or write that VCPUs registers.

For those reasons we believe a new capability makes the most sense.

Impact of not implementing this

Not implementing this will limit the ARM platforms we can use for the camkes-vm and/or the guests that can run on those platforms. It will also limit support for applications that run in secure side of the trustzone that need to communicate with the nonsecure side (Where sel4, VMMs, and the VMs run).

Prior art

Xen has a similar implementation for the Zynqmp where arbitration is done to allow/disallow/emulate SMC calls from guest VMs. The arbitration is quite specific and is automatically generated based on resource allocation for each VM. Xen does not have VMM pieces that run in Userspace. Basically all of the management runs in EL2. Therefore, all of the arbitration is done in Xen itself. This approach does not make sense for seL4 for the reasons stated above in the Rationale section.

However, this does show a need for this feature in hypervisors on ARMv8 platforms.

If necessary, we could reference the Xen arbitration code to help improve the default arbitration for the zynqmp in our implementation.

Unresolved questions

How do we address the WCET issue? (Doesn’t need to be solved by this RFC because we are documenting the effects)

Disposition

The TSC has approved the API design in this RFC for implementation as presented in the current state, but notes that the following two issues should be clarified in the RFC and corresponding documentation:

the option 2 for a thread to make SMC calls is to invoke the SMC cap directly, and not to switch off trapping on SMC calls for specific threads
the actual SMC performed by seL4 happens while the kernel holds the kernel lock. That means kernel entry will be blocked on other cores until the SMC has terminated.

The second point is basically the WCET discussion, but the TSC felt it important to clearly spell out the consequences.

AArch64: remove seL4_ARM_PageDirectory and seL4_ARM_PageUpperDirectory

Author: Kent McLeod
Proposed: 2022-03-11

Summary

This RFC proposes to remove two AArch64 VSpace object types, seL4_ARM_PageDirectory and seL4_ARM_PageUpperDirectory. The functionality previously provided by these types will be provided by the existing seL4_ARM_PageTable object type. This allows for a simpler API and enables a smaller kernel implementation that will be easier to verify.

Motivation

This change is motivated by an ongoing verification project to verify the AArch64 design and implementation. Unified PageTable object types was attempted during the RISCV64 verification project and it successfully reduced verification complexity. By applying the same design strategy to Aarch64, the verification can become easier.

The change is also motivated by an reducing complexity of the kernel user API where there are less object and capability types and less different invocation-object pairs while maintaining the same functionality. The ability to use PageTable objects for any level of the page-table hierarchy can also make virtual address space resource management easier as objects are more interchangeable.

Guide-level explanation

This change only affects the AArch64 architecture, but affects every configuration.

This change requires removing the seL4_ARM_PageDirectory and seL4_ARM_PageUpperDirectory object types. This means that it will no longer be possible to create these objects from Untyped capabilities. To make up for this, seL4_ARM_PageTable objects will be able to be mapped into an seL4_ARM_VSpace object or other seL4_ARM_PageTable objects.

This change will require modifications to existing user code. Every usage of an seL4_ARM_PageDirectory or seL4_ARM_PageUpperDirectory will need to be updated to instead use a seL4_PageTable. This update should be reasonably direct as all three object types have the same kind of invocations, are the same size, and contain the same number of table entries as each other.

Reference-level explanation

The entire sections of the seL4_ARM_PageDirectory and seL4_ARM_PageUpperDirectory interfaces have been removed from the libsel4 API XML interface definitions. The kernel will no longer recognize them as objects.

It's possible to redefine all of the deleted constants and symbols in the libsel4 implementation using CPP macro definitions in a deprecated.h header file. This can allow migration with minimal manual updating. Manual updates may still be required in places where the programming language used expects each object type to have a different number such as in a C switch statement.

In addition to a patch to the kernel, there is a patch to seL4_libs showing the minimal set of changes required to migrate the seL4test and seL4bench applications to the new API.

Internally, the kernel will use a single pte_t datatype to represent all page table entries in the VSpace implementation. This allows for considerably less code duplication for mapping and unmapping operations as well as for object creation and revocation. This new implementation follows the design taken with the RISC-V architecture.

Drawbacks

The main drawback for this change is that is API breaking -- existing user space programs will require their sources and binaries to be updated to use the newer kernel version. However, this will only affect configurations that previously have not been verified which should already be expecting API breakages as part of a verification process.

Rationale and alternatives

As AArch64 will be the third 64-bit architecture to go through a verification process we are able to make decisions that are informed by the experiences of the x86_64 and RISCV64 projects. In particular, the reduction in effort that this type unification allows was directly seen when comparing the x86_64 project, which used a separate object type for each level of the page table hierarchy, to the RISCV64 project which uses a single type.

One change that has been made compared to the RISC-V design is using a different type for the seL4_ARM_VSpace object. This object serves as the root page table of the address space. The AArch64 virtual memory architecture allows the top level page table to be sized differently depending on the size of the address space. This object also supports additional invocations that can't be applied to intermediate page table objects and it is also likely to gain more invocations in the future.

Prior art

Having a single page table object type is based on the design of the AArch64 virtual memory system architecture which sets out to reuse page table descriptor formats at all levels of the virtual address space.

Unresolved questions

There are existing pull-requests that show-case the proposed changes.

As always, verification progress can lead to design or implementation changes as bugs are found or verification blockers are encountered.

Disposition

The TSC has voted to approve this RFC for implementation without changes.

The seL4 Microkit

(Previous title: "seL4 Core Platform")

Author: Ben Leslie
Proposed: 2020-11-19
Implementation repository: https://github.com/seL4/microkit
Current manual: https://github.com/seL4/microkit/blob/main/docs/manual.md

Summary

The seL4 Microkit is an operating system (OS) personality for the seL4 microkernel.

The purpose of seL4 Microkit is to:

provide a small and simple OS for a wide range of IoT, cyber physical and other embedded use cases;
provide a reasonable degree of application portability appropriate for the targeted use cases;
make seL4-based systems easy to develop and deploy within the target areas;

Motivation

The seL4 microkernel provides a set of powerful and flexible mechanisms that can be used for building almost arbitrary systems. While minimising constraints on the nature of system designs and scope of deployments, this flexibility makes it challenging to design the best system for a particular use case, requiring extensive seL4 experience from developers.

The seL4 software ecosystem currently provides various options for developing systems on seL4, however, such support is both overly broad (providing too many options) and simultaneously too limited for many developers.

The seL4 Microkit addresses this challenge by constraining the system architecture and to one that provides enough features and power for this usage class, enabling a much simpler set of developer-visible abstractions.

The goals of the microkit are to:

provide well-defined hardware interfaces to ease porting of the microkit;
support a high degree of code reuse between deployments;
provide well-defined internal interfaces to support diverse implementations of the same logical service to adapt to usage-specific trade-offs and ease compatibility between implementation of system services from different developers;
leverage seL4's strong isolation properties to support a near-minimal trusted computing base (TCB);
retain seL4's trademark performance for systems built with it;
be, in principle, amenable to formal analysis of system safety and security properties (although such analysis is beyond the initial scope).

Guide-level explanation

The seL4 Microkit consists of:

an seL4 Microkit specification, which defines the key abstractions and Microkit API
an seL4 Microkit implementation, which provides implementations of the key abstractions and API
definitions of key OS service interfaces and implementations of these

Scope

The initial focus and scope of this RFC is on defining the key seL4 Microkit abstractions, which build on seL4 abstractions.

The seL4 Microkit abstractions are:

processor core (core)
protection domain (PD)
communication channel (CC)
memory region
notification
protected procedure call (PPC)
virtual machine (VM)

More details about these abstractions, and how they build on seL4 abstractions, are available in https://github.com/BreakawayConsulting/platform-sel4-docs/blob/master/sel4-platform.md

Reference-level explanation

While this RFC initially targets the main microkit abstractions, ultimately the microkit will provide an SDK (including interfaces, libraries, services, and tools) for building seL4 Microkit based systems.

The microkit implementation (and SDK) could be based on, and incorporate, existing technology such as CAmkES, capDL, seL4 runtime, seL4 driver framework, etc. however this has not been decided yet.

The seL4 Microkit will initially be limited to a small set of hardware platforms. This is to ensure that effort is focussed on fully supporting the microkit on this hardware. Ports to other hardware can be contributed by others.

Drawbacks

The RFC currently only presents the seL4 Microkit concepts. These have not yet been implemented nor have they yet been used to build concrete systems.

While the intention of the seL4 Microkit is to focus seL4 ecosystem and developer effort on a single (well defined and supported) model, the danger is that it adds yet another set of tools and platforms/frameworks to the mix making it even less clear what developers should use.

Rationale and alternatives

The intention is that experience from building seL4-based systems and developing and using previous existing platforms/frameworks/libraries/tools will lead to a more suitable and usable platform definition that will supersede some of the earlier platforms/frameworks/libraries/tools and that new (and existing) developers will migrate to the seL4 Microkit.

Prior art

Prior (and similar) art includes:

CAmkES
Genode
TRENTOS
AADL and HAMR
feL4
various other (ad hoc) software

Unresolved questions

This RFC addresses the first step towards defining the full seL4 Microkit.

It does not provide a roadmap to the further steps.

Disposition

The TSC has approved this RFC in its 12 Sep 2023 meeting and also approved a name change to the Microkit (or seL4 Microkit if context is needed).

The design concerns from previous meetings have been addressed and implementation is expected to converge to that design. The repository will be moved to the seL4 GitHub organisation.

MCS: set fault and timeout handler parameters while configuring TCBs

Author: Corey Lewis
Proposed: 2022-07-22

Summary

This RFC proposes to modify TCB configuration functions so that it is possible to directly set the badge and rights of the fault and timeout handler Endpoint Caps. This would make it consistent with how the CSpace and VSpace of threads are set.

Motivation

This change is motivated by wanting to simplify the process of configuring fault and timeout handlers when using a system with the MCS extensions enabled. Without this change, setting the badge or rights of a fault handler requires first minting a new copy of the Endpoint Cap with the desired parameters, and then installing it in the TCB. In many systems this additional copy of the cap ends up being unnecessary.

A longer term motivation is that we would like to port how fault handlers are configured on MCS to mainline seL4. This would resolve SELFOUR-29, improve consistency, and provide increased functionality to systems that do not use MCS. However, this next step would be very difficult without this change, due to technicalities in the existing verification approach of the CapDL Loader.

Guide-level explanation

On systems with the MCS extensions enabled, caps to a TCB's fault and timeout handler endpoints are installed directly in the TCB. This change adds two new parameters to the relevant API calls used to install these caps (TCBSetSchedParams, TCBSetTimeoutEndpoint, TCBSetSpace). These parameters can be used to set the badge and rights of the fault and timeout handler endpoints being installed. In cases where the badge and rights do not need to be set, the existing approach is still supported by setting both parameters to 0.

See seL4/capdl#47 for an example of this new functionality being used. The CapDL Loader previously minted a new endpoint cap for the fault handler with the required badge and rights. With this change it no longer needs to mint the extra cap and instead seL4 sets the fields while installing the endpoint caps in the TCB.

seL4/sel4test#78 is an example of updating existing code without using the new functionality. As the seL4 tests generally do not need to set the badge and rights, they are able to be updated for this change by just using seL4_NilData and seL4_NoRights as the parameters.

Reference-level explanation

The API calls TCBSetSchedParams, TCBSetTimeoutEndpoint and TCBSetSpace have each had two new parameters added, and their corresponding decode functions expect two additional syscall arguments. These arguments are used in the same pattern as used when setting the CSpace and VSpace of a TCB. When the parameters are non-zero then the provided cap is modified with updateCapData and maskCapRights, and then it is confirmed that it is safe to copy the capability by calling deriveCap. Following this, the decode functions then confirm that the modified cap is a valid cap through validFaultHandler. seL4/seL4#889 is the current proposal for what these changes would look like.

In addition to the patch to the kernel, there are also patches for seL4_libs, sel4test, and capdl which show the required changes to migrate these projects to the new API. The patch to sel4test is an example of a minimal migration which does not use the new functionality, while the patch for capdl is a migration that does make use of it.

Drawbacks

On some platforms, the extra parameters that this change adds might exceed the hardware limit on the number of message registers, which would cause the syscall to go into the IPC buffer. In the worst case, if this does become an issue then it could possibly be avoided by modifying the API so that each syscall performs fewer operations and hence does not require as many arguments.

This implementation also means that seL4 will silently downgrade some incorrect capabilities to NullCaps if a user attempts to set them as a TCBs fault or timeout handler instead of returning an error. This behaviour currently occurs when setting the CSpace and VSpace and is generally safe, although could cause confusion if it hides a user error. If this behaviour is not desired then we could separately check for this happening and explicitly return the error.

Rationale and alternatives

This design provides increased consistency within seL4, as it makes setting the fault and timeout handlers of TCBs exactly the same as setting their CSpace and VSpace.

An alternative design with a minor difference that was considered was combining the badge and right parameters into one word, in the same fashion as the seL4_CNode_CapData structure. While this would lead to a slightly simpler API due to only needing one parameter, on 64-bit platforms it would limit the badge to 60 bits instead of the full 64. Since existing users often make use of all available bits for the badges, this alternative was not chosen.

Unresolved questions

There are existing pull-requests that demonstrate the proposed changes.

There is also some initial work on updating the seL4 proofs for this proposal at seL4/l4v#505. While initial verification efforts suggest that no further implementation changes are required to maintain verifiability, this work is not yet completed and progress might lead to issues being discovered.

Disposition

The TSC has approved this RFC in terms of API design as is, leaving the decision on a potential compatibility wrapper in libsel4 or not to the PR discussion in the implementation stage.

MCS: Improve constraints on grant via reply

Author: Matthew Brecknell
Proposed: 2022-11-04

Summary

In MCS configurations, the ability to grant capabilities when replying to a call is determined entirely by the grant right on the reply capability. If a server can create reply objects by retyping untyped memory, then it can also grant capabilities to its clients. Consequently, many realistic systems would not have the authority confinement property that we would expect.

This RFC proposes two alternatives:

Both alternatives restrict the server's ability to grant via reply according to whether or not the grant right is present on the receive endpoint capability at the time the server called seL4_Recv.
The first alternative keeps the grant bit on reply object capabilities. Servers that want to grant via reply must have grant rights on both the receive endpoint capability and the reply object capability.
The second alternative removes the grant bit from reply object capabilities. Servers that want to grant via reply only need a grant right on the receive endpoint, but can't make reply object capabilities that mask those grant rights.

Motivation

The seL4 integrity theorems show that seL4, in non-MCS configurations, can enforce access control policies. For example, one can construct a system in which untrusted client and server components can communicate through an endpoint, but cannot write to each other's private memory.

The theorems are conditional: They require system setups that constrain the propagation of authority. In particular, grant rights are too powerful for the access control model, so one of the conditions is that untrusted components cannot grant or be granted capabilities.

We would like to prove similar integrity theorems for MCS configurations, and therefore need appropriate mechanisms to limit capability grant rights between untrusted components. Currently, the only way to prevent an MCS server granting to its clients is to prevent it from retyping untyped memory. That's too severe a limitation. It's also an indirect mechanism, which would be difficult to express in an access control model.

This RFC proposes a more direct mechanism for limiting an MCS server's ability to grant to its clients, and should provide a clearer path to useful integrity theorems for MCS configurations.

Guide-level explanation

We give a brief account of the history that led to the current situation.

In the non-MCS kernel, there are no reply objects. Reply capabilities are generated by the kernel when a call is received through an endpoint, and inserted into a slot in the receiver's TCB. The kernel sets the grant right on the reply capability iff the grant right was set on the endpoint cap the receiver used when it called seL4_Recv.

The intention of the design is that the receiver's endpoint capability controls the receiver's ability to grant capabilities when it replies. The implementation is the grant right on the reply capability, but the kernel ensures that the grant right on the generated reply capability is determined by the grant right on the receiver's endpoint capability. The intention is important, because endpoint rights are crucial to the ability to construct systems with useful access control properties.

MCS changed the object-capability model, introducing reply objects and scheduling contexts. Reply objects make protected procedure call stacks more explicit, allowing scheduling contexts to be passed from caller to receiver and back again, potentially recursively. Reply object capabilities are supplied by the receiver, and the kernel internally links the reply object with the calling thread when a call is received.

With respect to grant rights for replies, MCS reused parts of the implementation from the non-MCS kernel, but did not capture its intention. In both MCS and non-MCS, the ability to grant when replying is controlled by the grant bit on the reply capability. However, the way that bit is set is very different. In MCS, the grant right on a reply object capability is typically controlled by the receiver, if it can create reply objects. The grant right on the receiver's endpoint capability has no effect. Consequently, it is harder to construct MCS systems with useful access control properties.

In this RFC, we propose changes that limit an MCS receiver's ability to grant when replying, according to the rights on the receiver's endpoint capability. The kernel propagates the endpoint grant right via the reply object.

Reference-level explanation

We propose to add a boolean canGrant field to the MCS reply object. It is controlled by the kernel, and is only meaningful when the reply object is associated with either a receiver waiting for a caller, or a caller waiting for a reply. In those cases, the canGrant field indicates whether the capcanGrant bit was set on the receiver's endpoint capability when it called seL4_Recv or seL4_ReplyRecv.

Regardless of whether a caller is waiting on the endpoint, the canGrant field of the reply object is set in receiveIPC, which is part of either seL4_Recv or seL4_ReplyRecv. In receiveIPC, the kernel has access to both the receiver's endpoint capability and its reply object capability, so it can directly copy the capCanGrant flag from the receive endpoint capability to the canGrant field of the reply object.

If there is already a caller waiting, then the send phase of the call completes immediately, and the caller becomes blocked on the reply object.

If the receiver has to block waiting for a caller, then either:

The receiver's IPC is cancelled before a caller arrives, and the reply object becomes dissociated from the receiver. In this case, the canGrant value that was set on the reply object is never used.
The receiver remains blocked until a caller arrives, and the canGrant field of the reply object still reflects the capCanGrant bit that was set on the receiver's endpoint capability. The send phase completes, and the caller becomes blocked on the reply object.

Therefore, regardless of how the caller becomes blocked on the reply object, the canGrant field of the reply object reflects the capCanGrant bit on the endpoint capability used to receive that call. This is the property we'll need to prove a useful integrity theorem.

This RFC includes two alternatives:

The first alternative only makes the change just described. The grant bit on the reply object capability (which is separate from the canGrant field of the reply object) is retained, and acts as a mask against the grant right on the receiver's endpoint capability. Therefore, to grant via a reply, a receiver would need grant rights on both the receive endpoint capability and the reply object capability.
The second alternative makes the change just described, and additionally removes the grant bit on reply object capabilities. To grant via a reply, a receiver only needs a grant right on the receive endpoint, but can't make a reply object capability that masks a grant right conferred by the receive endpoint.

Drawbacks

The two alternatives are compared in the next section.

Both alternatives have the potential to break existing MCS systems that include threads that:

receive calls on endpoint capabilities without grant rights, yet
expect to be able to grant capabilities when replying.

Any such systems would need to be updated to add grant rights to the receive endpoint capabilities.

Unfortunately, the breakage will only become evident when the caller tries to the use the capabilities it should have been granted.

A mitigating factor is that non-MCS systems already require the grant bit to be set appropriately on receive endpoint capabilities, so systems ported from non-MCS to MCS should not be affected.

Rationale and alternatives

We've already covered the rationale, so it only remains to compare the two alternatives.

Retaining the grant bit in reply object capabilities

The first alternative retains the grant bit in the reply object capability. The code change is minimal. Any existing MCS systems that make use of the ability to mask grant rights via the reply object capability will continue to enjoy that ability, so the change does not give any thread permissions it did not already have.

Removing the grant bit in reply object capabilities

The second alternative removes the grant bit in the reply object capability. This results in a simpler implementation and API semantics, but removes a feature that might be used in some MCS systems. There is no corresponding feature in the non-MCS kernel, so the feature removal is only potentially relevant to MCS systems.

Unfortunately, the only reasonable way to remove the feature is silently. This means that an existing MCS system that currently uses the feature to mask the grant rights of reply object capabilities will continue to work (assuming the corresponding receive endpoint capabilities have grant rights), but might include threads which gain permissions that the system designer did not intend. That is, there might be threads that did not previously have the right to grant capabilities when replying, that could gain that right when this alternative is implemented.

The rationale for this alternative is that we think it is unlikely that there are any such systems. We are not aware of any real use cases, and have not been able to construct a hypothetical use case that is not more easily implemented without this feature.

If there are any use cases, we would like to hear about them.

Prior art

The main inspiration for this change is the non-MCS version of seL4, which:

already uses the grant bit on a receiver's endpoint capability to govern the receiver's ability to grant capabilities when replying.
does not provide any facility to remove the grant right from a reply capability.

Unresolved questions

Is there a realistic use case for a grant bit on the reply object capability, that can be used to mask a grant right conferred by the receive endpoint capability?

This is the main factor in choosing between the two alternatives. So far, the discussion on this issue has only shown a hypothetical use case, which did not survive scrutiny, and which for non-MCS systems, was based on a mistaken understanding of the API semantics.

If you do use seL4_CNode_Mint to mask the grant rights of MCS reply object capabilities, we would like to hear from you.

Disposition

The TSC as approved the option of the RFC that removes the grant right on the reply cap (the second alternative). Should there be appear a particular use case for it after all, it should be easy enough to consider another RFC to add it.

seL4 multikernel IPI API

Author: Kent McLeod
Proposed: 2024-02-13
Implementation PR: seL4/seL4#1222
Verification PR: seL4/l4v#733

Summary

Proposal to add an seL4 capability to generate interprocessor interrupts (IPIs).

Motivation

There's a proposed approach for enabling multicore seL4 while still using verified single-core configurations -- a partitioned multikernel.

As part of this approach, we need a way to both send and receive asynchronous signals between different cores in a system. Receiving IPIs from other cores is currently supported by existing IRQ handling mechanisms. We need to add a new mechanism for sending asynchronous signals to another core by generating IPIs.

It's desirable for this mechanism to have minimal verification effort to add to the baseline verified kernel implementations.

It could also be desirable for this mechanism to be able to scale from a couple of cores up to very large multicore systems with tens or even hundreds of cores.

Guide-level explanation

There are already hardware IPI mechanisms present on all supported platforms that the SMP kernel uses for sending and receiving signals between multiple cores. We can use the same mechanisms but give user level more direct access via additional APIs for generating and receiving IPIs directly. This would allow each single-core kernel instance to be able to send interrupts to other kernels and receive interrupts from other kernels while having zero shared kernel state between cores. Each kernel can think of the other kernels as just devices that they can send and receive interrupts between each other, and delegate any shared memory access to user level.

Breaking this new API up into two parts:

we need a way to send interrupts to a remote core from user space,
and we need a way to receive interrupts from a remote core and deliver them to user space.

Sending interrupts involves creating a new capability type, an IPI cap, that can be created from an IRQ control cap.

The IRQControl invocation to create the new IPI cap encodes any hardware-specific configuration that will be used whenever the IPI cap is invoked.
The IPI Cap is invoked in the same way as a notification send cap. The sender cannot be blocked and the kernel is expected to deliver at-most-one event to the target receiver.
The target receiver is going to be running in user space on a remote core.
For the first implementation we'll only support unicast signalling, but broadcast and multicast functionality can be added later via extensions to the IRQControl invocation that creates the IPI cap.

It is already possible to register send-notification-caps that are invoked when an interrupt is received so receiving IPIs doesn't require adding any new caps and only requires limited IRQControlCap invocation changes on some platforms.

The architecture specific IRQControl invocation for creating IRQ handler caps will need to allow creating IRQ caps for receiving IPIs.
Then these will work with notification objects the same as all other IRQ handler caps.
On Arm it's already possible to request IRQ caps to the IPI interrupt IDs so no changes are required.
On RISC-V the IRQControl invocation will need to be extended to allow the SWI interrupt to be delivered to user level.
On x86 it should already be possible to receive IRQs for a specified vector and so changes to the IRQControl invocation shouldn't be required.

Reference-level explanation

A quick note on terminology, since each architecture uses a different name for inter-processor interrupts:

x86 = IPI (Inter-processor interrupts)
Arm = SGI (Software generated interrupts)
RISC-V = SI/SWI (Software interrupts)

So while this RFC will use IPI to refer to all three, I expect that the new invocations we add for creating the caps will use the arch specific names as IRQs are currently all handled by arch specific invocations.

Sending IPIs is dependent on the platform's interrupt controller.

on Arm cores that use the Generic Interrupt Controller, there are 16 different interrupt IDs that can be used to receive IPIs on. A core can thus send up to 16 different signals to each target. Once more signals are required, the interrupt IDs must be reused and a user-space shared memory protocol is needed to distinguish between messages.
on RISC-V cores there is only one interrupt ID for an IPI. A core can send IPIs to multiple cores at once using the SBI API.
on x86 cores IPIs can have an arbitrary vector ID between 0 and 255 (excluding reserved vectors). Sending an IPI can be to any target and with any vector.

At a minimum this is sufficient to build a user level component that implements more sophisticated software signal multiplexing over a single hardware interrupt. This approach creates no shared kernel state and avoids concurrency.

Drawbacks

Some scalability limitations include:

Using different interrupt lines to distinguish cores requires there's at least as many IRQ lines available for IPIs as there are cores in the system. This is not typically the case. When there's more cores than IRQ lines, distinguishing cores apart requires an additional shared memory protocol that must be implemented at user level.

Rationale and alternatives

Platform specific extensions:

On Arm, the GICv2 and GICv3 have 16 lines reserved for SGI. This allows 16 different channels to be supported on each core. This may be helpful for systems that want to support direct software signalling without having to implement a user level component assuming their requirements are simple enough.

Future steps may involve seeking to use a method that involves direct memory access between cores allowing shared memory to be used for more information to be associated with each interrupt. However this leads to increased complexity around designing and implementing protocols that can guarantee at-least-once delivery of all signals with reasonable scalability overhead.

Prior art

– seL4_DebugSendIPI(seL4_Uint8 target, unsigned irq) has existed for several years as a debug function for generating SGIs on ARM platforms. The current proposal takes this functionality and adds capability management for it. (As well as adding support for interrupting multiple targets as supported by GICv2 and GICv3).

Unresolved questions

The current proposed implementation is here, but it’s awaiting a minor update to the revoke paths as part of some ongoing verification work: https://github.com/seL4/seL4/commit/7fda970d382c034c4e9b6d49b10bfbe33d4f815f

Disposition

The TSC approved the RFC with the following conditions:

The RFC should be updated to remove the API for broadcast and multicast for now. Until we have figured out a good general model for multicast, we want to keep the API small so that conservative extensions are possible. The rationale is that for the current use case of a low number (2-16) of cores, it is cheap to simulate multicast and broadcast in user space via a loop over caps to each of the core, whereas it will be hard to change the API later if we commit to a specific multicast model too early. This means, the SGISignalCap (on Arm) should authorise sending to a single core (and store a single SGI number).
The API for Arm should be updated to include higher affinity bits for GICv3 to support sending to more cores. For platforms that do not support these bits (e.g. GICv2), the corresponding IRQControl invocation should fail if those bits are attempted to be set.
The TSC confirmed that we want the API for invoking SGISignalCaps to be like notifications, that is, the invocation should not take any parameters beside the SGISignal cap itself. The current (not yet updated) pull request for this RFC is seL4/seL4#1222.

FPU Context Switching

Author: Gerwin Klein, Rafal Kolanski, Indan Zupancic
Proposed: 2024-06-21

Summary

We propose to let user level configure policy on when FPU context is switched instead of relying on runtime faults to switch FPU context lazily. Policy is configured via a flag in the TCB, authorised by the corresponding TCB capability.

This improves performance, moves policy from kernel to user, enables verification of FPU handling, and fixes an information leakage in current FPU context switching.

Motivation

If the next thread is not the current FPU owner, current FPU context switching in seL4 disables the FPU on kernel exit, waits for a thread to fault when it accesses the FPU, transparently handles the fault, saves/restores FPU context, and then returns to the thread.

The motivation for this scheme was that threads that never use the FPU do not pay for FPU context switching cost. However, it has the following problems:

this is a fixed context switching policy in the kernel;
most of this complexity is invisible to the current formal verification, and therefore not covered by the proofs;
when switching between domains, this scheme introduces information flow leakage, because the kernel performs actions on behalf of domain A while running in domain B: assuming A and B are the only domains, the first thread running in domain B can determine whether a thread in domain A used the FPU simply by using the FPU itself, since kernel FPU save/restore actions will only occur if another thread used the FPU.
performance of this scheme is not optimal -- if multiple threads are using the FPU in a row, the code will be taking a fault every time the FPU is used. This adds unnecessary overhead, especially in systems where FPU usage is high.

The proposed scheme fixes these problems and maintains the original intent that threads which do not use the FPU should not pay FPU context switching cost.

The assumption is that application threads are likely to be using the FPU, and that low-level OS threads are unlikely to be using the FPU. One particular case we are aiming to cover is an application thread calling the OS and the OS returning to that thread. Ideally, this case should not need to save or restore any FPU registers.

Guide-level explanation

Concepts

The TCB will get a new flag fpuDisabled that determines whether the FPU is disabled or enabled when the kernel switches to the thread. If the FPU is disabled and the thread uses the FPU, a fault is generated and delivered to the thread's fault handler like other faults. If the FPU is enabled, FPU context is saved and restored as needed. There is no more kernel-level transparent FPU fault handling.
For backwards compatibility, the flag is initially false (the FPU is enabled) on platforms that have FPU enabled at kernel configuration time.
The flag can be set/cleared via the new seL4_TCB_Set_Flags invocation of the corresponding TCB capability. This allows user level to configure fine-grained FPU context switching policy.

High-level Performance Considerations

The following high-level properties will hold:

When no thread is configured to use the FPU, no FPU context switching happens after initialisation.
When all threads are configured to use the FPU, the scheme is equivalent to eager FPU switching, that is, FPU context is always saved and restored, no matter if the threads actually use the FPU or not.
On domain exit, the current FPU context is always saved when it exists.
Calls from FPU-threads to exclusively non-FPU threads do not trigger any FPU context save/restores.

For example, consider three threads A, B, and C. A is configured to use the FPU, B and C are not. Switching from A to B will not need an FPU context save or restore, neither will switching from B to C or C to B. Switching from B to A only requires an FPU context save/restore if another thread has run in between that was configured to use the FPU.

Reference-level explanation

Adding the flag to the TCB will not increase the TCB object size, and is consistent with other TCB parameters authorised by the TCB cap.

The default is chosen so that a value of 0 (cleared memory after retype) means that the FPU is enabled. That means user-level code that does not know about the new FPU switching scheme will be able to use the FPU as before.

As it does now, the kernel will maintain a current FPU owner, which will be a pointer to a TCB. The invariant on that pointer is that it points to the TCB of the most recently running thread that was configured to use the FPU. It is the TCB the current FPU state has to be saved to.

When the kernel switches away from a thread, no FPU save/restore actions happen.
When the kernel switches away from a domain, the current FPU context is always saved.
When the kernel switches to a thread where the FPU is disabled, no FPU context actions happen.
When the kernel switches to a thread where the FPU is enabled, the FPU context of the current owner is saved, and the FPU context of the new thread is restored. The current FPU owner is updated to the new thread.

API

TCB_Set_Flags

static inline int seL4_TCB_Set_Flags

Clear and set feature flags of the specified TCB -- in that order, i.e. when a flag is both cleared and set, it will be set.

Type	Name	Description
`seL4_TCB`	`_service`	Capability to the TCB to set flags on
`seL4_Word`	`clear`	Bit 0: set to 1 to clear the `fpuDisabled` flag
`seL4_Word`	`set`	Bit 0: set to 1 to set the `fpuDisabled` flag

Flags that are not available on the current platform or configuration will be ignored. Currently only bit 0 (fpuDisabled) is available on platforms with CONFIG_HAVE_FPU.

The function returns seL4_NoError for success as usual, and seL4_InvalidCapability if the provided capability is not a TCB capability.

Verification Impact

Verification impact should be moderate.

The following items will need to be added to the specifications and proofs on all verification configurations that have FPU proof support (ARM imx8-fpu-ver, AARCH64, X64).

added seL4_TCB_Set_Flags system call decode and perform functions
change to TCB object state, including default retype reasoning
the owner of the FPU state will be tracked, which needs new invariants. For example:
- when a current owner exists it always points to valid TCB
- this TCB belongs to the current domain

The following main parts of the proof stack are affected by these:

specifications: abstract spec, design spec, capDL spec
functional correctness, all levels
fast path for ARM and AARCH64 (additional check for FPU)
access control
information flow
capDL and system initialisation

While the changes are cross-cutting, we think the new invariants will remain relatively isolated and will not require much interaction with existing ones.

Potential Optimisations

On Arm it is possible to interleave loading and storing of the FPU registers, which might be faster than first loading and then storing all registers. This interleaving is only possible if there is always an FPU context to save when there is one to restore. We propose to leave this up to the implementation phase.
Newer x86 CPUs support the XSAVES instruction, which combines XSAVEOPT and XSAVEC. That means together with the high overhead of disabling the FPU, lazy switching may not make sense on newer x86 CPUs and the user may want to configure all threads as FPU threads.
There are multiple options on how to represent the state where there is no current FPU owner. This happens at the start of the system where no FPU thread has run yet, and at the start of a domain when no FPU thread has yet run in that domain. This state could be represented by a NULL owner, or it could be represented by pointing to a dummy thread, e.g. the idle thread. The latter implies that the pointer always points to a valid thread and that the current FPU context can always be saved, even if there is no current owner. This would remove testing for NULL in the (very) frequent case where a current owner exists and would add a superfluous context save in the infrequent case where no owner exists. Doing this might complicate the information flow proofs where the FPU context of the idle thread would have to be modelled (and proved) as unobservable to the user. However, it may also reduce timing differences at domain switch (modulo caching), because each domain switch will always have a context to save.

Drawbacks

Compared to the current scheme, user level now needs to explicitly disable the FPU for a thread when no FPU context switching is required. This requires more system analysis than before. However, if user level is not consciously treating the FPU, it is likely that modern compilers on most architectures will at least use SIMD instructions that use FPU registers, which means that the default of "FPU enabled" should be the correct choice for such systems.
For a workload where many threads do use the FPU, but use it only occasionally, the proposed scheme would perform many more FPU context switches than the current fault-based lazy scheme.

In this case, it would still be possible to emulate the fault-based lazy scheme from user-level:
- the rare-FPU thread is set to FPU disabled
- when it does use the FPU, the user-level fault handler gets the corresponding fault
- the fault handler enables the FPU on this thread and restarts it
- the thread could either stay FPU enabled or be FPU-disabled again after its time slice or budget are up, depending on system policy.
This does have more overhead than the current in-kernel transparent fault handling, but not massively more, and it allows full user-level control and trade-offs. For instance, this could also be used for starting all threads as FPU-disabled and enabling FPU for them on demand. That would yield low amortised overhead without needing system analysis.

Rationale and alternatives

There are multiple alternatives to the semi-lazy context switching scheme proposed here. The main classes are:

fully eager -- FPU context is always saved and restored
semi-lazy with early context saving -- FPU context is always saved when switching away from an FPU-enabled thread and always restored when switching to a FPU-enabled thread
fault-based lazy -- the current scheme in seL4

This RFC has already argued that semi-lazy is preferable to the current scheme for multiple reasons (policy, performance, verification). Compared to eager switching, the proposal has the following advantages:

Fully eager represents another fixed kernel policy vs configurable policy.
Fully eager switching can be achieved in the proposed scheme by setting all threads to FPU enabled. The only overhead this is incurs is one branch test which checks the flag and always finds the FPU enabled.
Fully eager switching does not perform well on all work loads, for example a common use case in the device driver framework is a chain of OS producers/consumers that do not use FPU, called by an application that potentially does use the FPU.
If the one-branch-test overhead for fully eager switching in the proposed scheme is important for a specific workload, it would be possible to provide an optimisation: have an always-eager compile-time config flag that statically sets the fpuDisabled flag to false, which eliminates the branch. This would disallow disabling the FPU in seL4_TCB_Set_Flags. We do not anticipate this to be needed currently, but it is an option for the future if this turns out to be a common case.

The proposed scheme is slightly more complex than the semi-lazy scheme with early context save, and early saving would not need any special treatment on domain switch, but the targeted use case of OS threads without FPU called by an application thread with FPU would still need one FPU context save and restore, whereas the proposed scheme does not require any.

Prior art

There has been initial discussion and a draft pull request for a proposal to introduce an FPU object to seL4, which served as inspiration for this RFC. The FPU object proposal similarly makes context switching configurable by binding or not binding an FPU object to a TCB, and implements a similar switching strategy.

Adding a new object to the API has much higher implementation+verification impact as well as higher API breaking impact for user-level systems. The issue of memory-backing for the FPU is orthogonal to the issue of when to context switch the FPU. The approach proposed here focuses on context switching only, which keeps verification manageable. The performance numbers we measured for the proposal in this RFC are similar to the numbers obtained in the FPU object proposal.

If TCB size and memory pressure do turn out to be an issue in practice, the FPU object can still be added on top of this RFC in the future.

Performance Numbers

The following numbers compare the fault-based lazy switching (current seL4 implementation) with fully eager switching (FPU always on). The tables will be updated when more results are available, including an implementation of the proposed semi-lazy switching without faulting.

These measurements should give an indication of the trade-off space: lazy fault-based switching where no thread is using the FPU should be the same as the base line, because the FPU is always off. Eager switching is the highest context switching load. The space in between has trade-off between these and cost for enabling/disabling the FPU, potentially more complex logic, and taking faults.

Expectations for final performance: Task with FPU enabled will behave close to eager switching and tasks with FPU disabled will behave like the current lazy switching, which always disables the FPU at task switch.

All tests done on non-SMP, non-MCS and release builds of seL4 and sel4bench. Keep in mind that all cycles are not in CPU cycles, but in platform specific fixed clock cycles, hence the numbers should not be compared between boards.

AArch64

The AArch64 tqma system:

Cortex-A35	Call	ReplyRecv
Old Lazy	311	328
Eager	429	493

Basic save + restore overhead seems to be about 120 cycles (after aligning fpuState). The unexpected extra slowdown for ReplyRecv may be because of secondary effects, like less optimal memory accesses. Performance is mostly limited by store and load throughput.

On ARM enabling and disabling the FPU is cheap.

x86

The skylake system:

Skylake	Call	ReplyRecv	Consumer to producer
Old Lazy	157	163	2455
Eager	556	527	1203
Sync no FP			945

The other sync tests had similar results. The compiler is much more eager to use SIMD instructions on x86 compared to AArch64, causing FPU traps during the sync tests. The Last column shows the FPU trap overhead.

When compiling the sync test with -mgeneral-regs-only, the lazy FPU sync results are much better, see "Sync no FP" row. However, the result is only ~250 cycles faster instead of the expected 350-400, compared to eager switching.

Comet Lake	Call	ReplyRecv	Consumer to producer
Old Lazy	124	126	1839
Eager	427	405	926
Sync no FP			675

Other measurements have been made, and all-in-all the x86 performance measurements do not seem very reliable. For instance, adding extra FPU disable+enable calls added 300 cycles, which should be similar to the cost of XSAVE+XRSTOR, yet we don't see that in the actual results.

Other examples are measuring 2k trap overhead cycles, while the difference between "Sync no FP" is around 1.5k on Skylake, and getting much worse results when compiling the IPC test with general regs only too, while from the results it is clear that no trapping happens, so there should be no FP instructions used during the measured window.

A cycle counter overhead of 200 seems to complicate things too. For x86 it may be better to use PMU counters instead of rdtsc.

Benchmark Conclusions

Although micro-benchmark results are much worse for eager switching, on a macro level the results are dramatically better if the FPU is used at all. The proposed semi-lazy switching with FPU flag should get the best of both approaches.

Unresolved questions

While the internal representation needs to be fpuDisabled to get the right default, it may be nicer for the API to present the inverted flag fpuEnabled instead.

The seL4 RFC Book