Runtime Domain Schedules

  • Author: Gerwin Klein, Rafal Kolanski, Indan Zupancic
  • Proposed: 2025-09-10

Summary

We propose to add the ability to set, maintain, and switch between domain schedules at runtime, authorised by the existing DomainCap, retaining the current information flow proof and behaviour when the DomainCap no longer exists in the system or is not used.

Motivation

Currently the domain schedule is provided as a .c file and compiled into the kernel. This was an initial stop-gap implementation and is inconvenient for SDK approaches such as the Microkit and overall inflexible and hard to use.

For instance, it is currently not possible to provide a verified kernel binary and modify it with a new kernel schedule without invalidating the verification.

It is also currently not possible to run a system in multiple modes, such as "wheels up" and "wheels down" in aviation.

The intended use for the proposed mechanism is that the initialiser sets up domain schedules at boot time, for instance the two schedules in the example above, or a single domain schedule as would now be provided as .c file. After that, if the system is to remain static, the DomainCap can be deleted as it is now, or it can be given to a high privilege mode control component that can atomically switch between schedules.

Updating a running system to tweak an existing schedule or add a new mode at runtime would also become possible.

Guide-level explanation

At a conceptual level, we propose to mostly keep the current domain scheduler implementation as it is: a compile-time sized array of entries consisting of domain and duration. Instead of representing a single schedule as before, we propose to use the array now for representing multiple schedules. The new components are:

  • entries with the value (0, 0) are end markers for a domain schedule
  • there is a new kernel global that contains the start index for the current domain schedule

Example

The following diagram shows an example.

-----------------------------------------------------------------
| (d_0, t_0) | (d_1, t_1) | ... | (0, 0) | (d_x, t_x) | .. | .. |
-----------------------------------------------------------------
       0          ^                  n         n+1

ksDomainStart = 0
ksDomScheduleIdx = 1

The example has an overall array of length config_DomScheduleLength, configured at compile time, a current domain start index of 0, and an currently active domain schedule index of 1. The current entry will run domain d_1 for a duration of t_1 > 0, and the schedule will keep going until it hits index n which has an entry with value (0, 0). Instead of running the entry at index n, it proceeds at ksDomainStart, i.e. index 0.

To populate the entires, a user thread with the DomainCap can invoke the setDomainEntry method to set duration and domain for a specific index, potentially with duration 0 to create an end marker.

To atomically switch to the second schedule in the example, a user thread with the DomainCap can set ksDomainStart to n+1, which will start running domain d_x for a duration t_x. The schedule will keep going until either another entry with duration 0 occurs, or the index hits the end of the array. In both cases, the index of the next entry will again be ksDomainStart.

Conditions and Invariants

For both, setting the domain start and setting the value of an entry, the kernel will prevent the user from creating a schedule where the entry at ksDomainStart has duration 0. This would signify an empty domain schedule. It would be possible to give this situation a useful meaning: not switching domains until a new schedule is set. However, implementing it would be more invasive than the currently proposed changes and the same effect can already be achieved with a one-element schedule of long duration.

Apart from the value (0, 0), the kernel will prevent the creation of entries with duration 0. The duration is in timer ticks.

The kernel initialises with an array where all entries are (0, 0), apart from the entry at index 0, which will run domain 0 for the maximum expressible time.

The last entry of the domain schedule array is reserved to be an end marker. This allows for a simpler domain schedule end check.

The kernel will preserve the following invariants:

  • ksDomainStart < config_DomScheduleLength - 1
  • ksDomScheduleIdx < config_DomScheduleLength - 1
  • schedule[ksDomainStart].duration ~= 0
  • for all i with 0 <= i config_DomScheduleLength, if schedule[i].duration = 0 then schedule[i].domain = 0

Reference-level explanation

Invocations

The current invocations of the DomainCap remain as they are. There are two new invocations:

Set_Domain_Start

static inline int seL4_Set_Domain_Start

Set the start index of the current domain schedule. The schedule entry at this index must not be the end marker (0, 0). The domain at the schedule entry with the specified index will start running immediately after the kernel call completes.

TypeNameDescription
seL4_DomainSet_serviceDomain capability to authorise the call
seL4_WordindexThe new start index. Must not point to (0,0) and must be smaller than config_DomScheduleLength-1

The function returns seL4_NoError for success as usual. The following error codes are returned otherwise:

ErrorPossible cause
seL4_InvalidCapabilitythe provided capability is not the domain capability
seL4_InvalidArgumentthe entry at the provided index is an end marker
seL4_RangeErrorthe index is not less than config_DomScheduleLength-1

Set_Domain_Entry

static inline int seL4_Set_Domain_Entry

Set and entry in the domain schedule at the specified index to the specified domain and duration. If the duration is 0, the domain must also be 0 to indicate and end marker. On MCS, the duration must be >= MIN_BUDGET. If the index is the current schedule start index, the entry must not be an end marker (0, 0).

The change to the schedule takes effect when the entry is next read by the kernel. In particular, when the duration or domain of the currently running schedule index are changed, the change will only take effect after the current domain time slice has expired and the schedule reaches the current index again.

Section Initialisation contains a scenario where setting the entry at the currently running index may be useful, but generally one should avoid updating the currently running schedule. Instead create new schedules beyond current end marker and then set the schedule start once the system is ready to switch.

TypeNameDescription
seL4_DomainSet_serviceDomain capability to authorise the call
seL4_WordindexThe index of the entry to set. Must be smaller than config_DomScheduleLength-1.
seL4_Uint8domainThe domain of the schedule entry. Must be smaller than CONFIG_NUM_DOMAINS. Must be 0 if the duration is 0.
seL4_WorddurationThe duration for the entry in ticks.

The function returns seL4_NoError for success as usual. The following error codes are returned otherwise:

ErrorPossible cause
seL4_InvalidCapabilitythe provided capability is not the domain capability
seL4_InvalidArgumentthe index is the current domain start index and the duration is 0, or the duration is 0, but the domain is not 0.
seL4_RangeErrorthe index is not less than config_DomScheduleLength-1, or the domain is not less than CONFIG_NUM_DOMAINS

Configuration Options

The new config option config_DomScheduleLength determines the static size of the overall domain schedule array and thereby the longest domain schedule that is possible to configure at runtime. The default value for config_DomScheduleLength is 100.

Initialisation

At system startup, the array contains 2 active entries: entry 0 with domain 0 and a long duration, and entry 1 with (0, 0). With the following scheme it is possible to use the full length of the array even though these two entries are already in use.

At kernel boot time, given a user-provided schedule [(d_0, t_0), (d_1, t_1), ...] that satisfies the requirements described above, the root task could achieve the provided schedule as follows:

  1. First, set up the of rest system as before, including starting all threads
  2. Set all (d_i, t_i) according to the schedule where i > 1.
  3. Then, overwrite the two active schedule entries 0 and 1 with (d_0, t_0) and (d_1, t_1).
  4. Set the schedule start to the desired start value, e.g. index 0
  5. Suspend/stop the initialiser

The initialiser will be able to run for up to the maximum duration expressible in ticks (hundreds of days even on extremely high timer frequencies), and the user-provided schedule will start after the initialiser is done. If the first time slice of the new schedule is running domain 0, the initialiser will consume the time necessary to perform the suspend/stop step, but no more.

The reason the scheme works is that the kernel will not act on new values in the schedule before the current domain slice has expired whereas setting the start index in step 4 comes into effect immediately.

capDL initialiser

The capDL initialiser will change to also initialise the domain schedule if one is provided.

The schedule is provided in a separate new section of the capDL input specification, specifying the schedule as a comma-separated list of pairs (domain, duration). If no end marker is provided at the end of the list, an implicit end marker is assumed.

Drawbacks

The main drawback is that this is a breaking change for users. Setting and initialising a domain schedule now works differently, and build scripts may need to be updated to no longer generate/add the former .c file that contained the schedule.

In theory a tool could be provided that converts current .c file schedules into the capDL sections, but since .c files can contain anything we are not proposing such a tool as part of this RFC.

Tests for the domain scheduler in sel4test will need to be changed to create the test schedules at initialisation time instead of expecting them to be compiled in.

Tools and applications that do not use the domain scheduler should be unaffected.

Rationale and alternatives

The proposed design was chosen for its conceptual simplicity, flexibility, implementation simplicity, and low proof impact. There are no changes in how domains are treated currently in the implementation at runtime or in the proofs for information flow. The current information flow proofs should be fully preserved in systems that no longer have the domain capability, which is already assumed in the current proofs.

The two extreme choices in the design space would be to leave the static schedule as is and to remove domain scheduler completely. This RFC has already motivated why the former is not a good option. Removing the domain scheduler from the kernel completely and delegating partition scheduling to user level would invalidate the information flow theorems for seL4 without any even conceptual formal remedy for the statement of these theorems. Isolation would no longer be a property of seL4.

Other points in the design space that have been considered are the following.

Batched setting of the domain schedule

The current proposal requires multiple kernel calls to set a new schedule -- one for each entry in the new domain schedule. An alternative would be to batch multiple such calls and collect an entire schedule or larger parts of a schedule in the IPC buffer. Since setting new schedules, as opposed to switching between existing schedules, is unlikely to be a performance critical operation. That means, added complexity of a batched API with variable length arguments brings no tangible benefit for system performance or even user convenience.

Restricting to two schedules

Instead of using one array with end markers, the domain schedule could be represented as two (or more) separate arrays: one active, running schedule, and one (or more) inactive schedule that can be edited. The representation proposed in this RFC achieves the same overall API with more flexibility and less storage.

Allowing zero-length schedules

The API proposed in this RFC forbids schedules of length zero -- these are schedules where the start directly points to an end marker. An alternative would be to allow such schedules and to treat all entries with duration 0 as end markers. A zero-length schedule with an end marker for a domain d would then mean that the kernel will stay in domain d indefinitely until a new schedule is set, practically disabling domain scheduling at runtime. The RFC did not choose this option, because a one-element schedule with very long domain time (e.g. years) already achieves the same result in practice and has a simpler, more efficient implementation. Allowing zero-length schedules would require additional state to be checked any time the kernel reduces the consumed domain time.

Point of activation of setting domain/duration

In the proposed API domain schedule switches take effect immediately, but setting domain and duration of an entry in the currently running schedule takes effect only the next time the schedule wraps around to the current index.

This behaviour has the simpler implementation and is also simpler conceptually. Making domain/duration changes take effect immediately would mean that the system call might have to switch immediately to a new domain when either the new domain does not match the current domain or the new duration is shorter than the already expired domain time. Either would mean that the kernel has already violated the new domain schedule and is attempting to catch up and rectify the violation. If the semantics instead is that the effect takes place at the next time the index is visited, the kernel never needs to violate the schedule.

The behaviour difference is largely theoretical -- the recommended use of the API is to never edit the current schedule, but instead edit a new schedule and switch to the new schedule atomically.

Prior art

Most separation kernels provide some form of configurable static scheduling.

The motivation for the original domain scheduler design in seL4 was an explicit customer/user request for a static separation kernel configuration of seL4 with static scheduling in a security and information flow context. This also motivated the current form of the information flow theorem. For the application at the time, the current API was sufficient, but for a more general use case it is too restrictive and has too much build system impact, because it needs recompilation for schedule changes.

Another application area with similar requirements is aviation, in particular the ARINC 653 standard, which would be better supported by this RFC than the current implementation.

This RFC is not attempting to provide a full implementation of any particular standard, it merely aims to make the existing API more usable and improve the overall developer experience on seL4.

Unresolved questions

The API is expressed in terms of timer ticks in expectation of an additional upcoming RFC to change the rest of the API to timer ticks as well. Depending on the outcome of that discussion, the API could also be in terms of time_t instead if required for consistency.

Disposition

The TSC approved the RFC and decided to use timer ticks. The RFC text has been updated with minor changes discovered during the implementation.