1%
2% Copyright 2014, General Dynamics C4 Systems
3%
4% SPDX-License-Identifier: GPL-2.0-only
5%
6
7\chapter{\label{ch:threads}Threads and Execution}
8
9\section{Threads}
10\label{sec:threads}
11
12
13seL4 provides threads to represent an execution context. On MCS configurations of
14the kernel, scheduling contexts are used to manage processor time. Without MCS, processor
15time is also represented by the thread abstraction.
16A thread is represented in seL4 by its thread control block
17object (\obj{TCB}).
18
19With MCS, a scheduling context is represented by a scheduling context object
20(\obj{SCO}), and threads cannot run unless they are bound to, or receive a
21scheduling context.
22
23\subsection{Thread control blocks}
24
25Each \obj{TCB} has an associated CSpace (see
26\autoref{ch:cspace}) and VSpace (see \autoref{ch:vspace}) which
27may be shared with other threads. A \obj{TCB} may also have an IPC buffer
28(see  \autoref{ch:ipc}), which is used to pass extra arguments during IPC
29or kernel object invocation that do not fit in the architecture-defined message
30registers. While it is not compulsory that a thread has an IPC buffer,
31it will not be able to perform most kernel invocations, as they require
32cap transfer.
33Each thread belongs to exactly one security domain (see
34\autoref{sec:domains}).
35%FIXME: there is much more information held in the TCB!
36
37\subsection{Thread Creation}
38\label{sec:thread_creation}
39
40Like other objects, \obj{TCB}s are created with the
41\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method (see
42\autoref{sec:kernmemalloc}). A newly created thread is initially inactive. It
43is configured by setting its CSpace and VSpace with the
44\apifunc{seL4\_TCB\_SetSpace}{tcb_setspace}
45or \apifunc{seL4\_TCB\_Configure}{tcb_configure} methods and then calling
46\apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} with an initial stack pointer and instruction
47pointer. The thread can then be activated either by setting the
48\texttt{resume\_target} parameter in the \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} invocation to true
49or by separately calling the \apifunc{seL4\_TCB\_Resume}{tcb_resume} method. Both of these methods
50place the thread in a runnable state.
51
52On the master kernel, this will result in the thread immediately being added to
53the scheduler. On the MCS kernel, the thread will only begin running if it has a
54scheduling context object.
55
56In a SMP configuration of the kernel, the thread will resume on the core
57corresponding to the affinity of the thread. For master, this is set using
58\apifunc{seL4\_TCB\_SetAffinity}{tcb_setaffinity}, while on the MCS kernel the affinity is
59derived from the scheduling context object.
60
61\subsection{Thread Deactivation}
62\label{sec:thread_deactivation}
63
64The \apifunc{seL4\_TCB\_Suspend}{tcb_suspend} method deactivates a thread.
65Suspended threads can later be resumed.
66Their suspended state can be retrieved with the
67\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and
68\apifunc{seL4\_TCB\_CopyRegisters}{tcb_copyregisters} methods.
69They can also be reconfigured and
70reused or left suspended indefinitely if not needed. Threads will be
71automatically suspended when the last capability to their \obj{TCB} is
72deleted.
73% an example of which is demonstrated in \nameref{ex:second_thread}.
74
75\subsection{Scheduling}
76\label{sec:sched}
77
78seL4 uses a preemptive, tickless scheduler with 256 priority levels (0 --- 255).  All threads have
79a maximum controlled priority (MCP) and a priority, the latter being the effective priority of the
80thread.
81When a thread modifies a another threads priority (including itself) it must provide a
82thread capability from which to use the MCP from. Threads can only set priorities and MCPs
83to be less than or equal to the provided thread's MCP.
84The initial task starts with an MCP and priority as the highest priority in the system (\texttt{seL4\_MaxPrio}).
85Thread priority and MCP can be
86set with \apifunc{seL4\_TCB\_SetSchedParams}{tcb_setschedparams} and
87\apifunc{seL4\_TCB\_SetPriority}{tcb_setpriority},
88\apifunc{seL4\_TCB\_SetMCPriority}{tcb_setmcpriority} methods.
89
90
91Of threads eligible for scheduling, the highest priority thread in a runnable state is chosen.
92
93Thread priority (structure \texttt{seL4\_PrioProps\_t}) consists of two values as follows:
94
95\begin{description} \item[Priority] the priority a thread will be scheduled with.  \item[Maximum
96controlled priority (MCP)] the highest priority a thread can set itself or another thread to.
97\end{description}
98
99\subsection{MCS Scheduling}
100
101This section only applies to configurations with MCS enabled, where threads must have
102a scheduling context object available in order to be admitted to the scheduler.
103
104\subsection{Scheduling Contexts}
105\label{sec:scheduling_contexts}
106
107Access to CPU execution time is controlled through scheduling context objects.
108Scheduling contexts are configured with a tuple of
109\textit{budget}$(b)$ and \textit{period} $(p)$, both in microseconds, set by
110\apifunc{seL4\_SchedControl\_Configure}{schedcontrol_configure} (see \autoref{sec:sc_creation}).
111The tuple $(b, p)$ forms an upper bound on the thread's execution -- the kernel will not permit a
112thread to run for more than $b$ out of every $p$ microseconds. However, $\frac{b}{p}$ does not
113represent a lower bound on execution, as a thread must have the highest or equal highest priority
114of all runnable threads to be guaranteed to be scheduled at all, and the kernel does not conduct
115an admission test. As a result the set of all parameters is not necessarily schedulable. If
116multiple threads have available budget concurrently they are scheduled first-in first-out, and
117round-robin scheduling is applied once the budget is expired.
118
119A scheduling context that is eligible to be picked by the scheduler, i.e has budget available, is
120referred to as \emph{active}.  Budget charging and replenishment rules are different for round-robin
121and sporadic threads.  For round-robin threads, the budget is charged each time the current node's
122scheduling context is changed, until it is depleted and then refilled immediately.
123
124Threads where $b == p$ are treated as round robin threads, where $b$ acts as a timeslice.
125Otherwise the kernel uses \emph{sporadic servers} to enforce temporal isolation, which enforce the
126property that $\frac{b}{p}$ cannot be exceeded for all possible $p$.
127In theory, sporadic servers provide temporal isolation -- preventing threads from exceeding their allocated budget -- by using the following algorithm:
128
129\begin{itemize}
130\item When a thread starts executing at current time $T$, record $T_{s}$
131\item When a thread stops executing (blocks or is preempted), schedule a replenishment at $T_{s} + p$ for the
132amount of time consumed ($T - T_{s}$) and subtract that from the current replenishment being used.
133\end{itemize}
134
135seL4 implements this algorithm by maintaining an ordered list of sporadic replenishments --
136\texttt{refills} for brevity -- in each scheduling context. Each replenishment contains a tuple
137of the time it is eligible for use (\emph{rTime}) and the amount that replenishment is for
138(\texttt{rAmount}).
139While a thread is executing, it constantly drains the budget from the \texttt{rAmount} at the head of the
140replenishment list. If the \texttt{rTime} is in the future, the thread bound to that
141scheduling context is placed in a queue of threads waiting for more budget.
142
143Round-robin threads are treated that same as sporadic threads except for one case: how the
144budget is charged. Round-robin threads have two refills only, both of which are always ready to be
145used. When a round-robin thread stops executing, budget is moved from the head to the tail
146replenishment. Once the head budget is consumed, the thread is removed from the scheduling queue
147for its priority and appended at the tail.
148
149Sporadic threads behave differently depending on the amount of replenishments available, which
150must be bounded. Developers have two options to configure the size of the replenishment list:
151
152\begin{itemize}
153\item The maximum number of refills in a single scheduling context is determined by the size
154      of the scheduling context when created by \apifunc{seL4\_Untyped\_Retype}{untyped_retype}.
155\item A per scheduling context parameter, \texttt{extra\_refills} that
156limits the number of refills for that specific scheduling context. This value is added to the base
157value of 2 and is limited by the size of the scheduling context.
158\end{itemize}
159
160Threads that have short execution times (e.g interrupt handlers) and are not frequently preempted
161should have less refills, while longer running threads with long values of $b$ should have a higher
162value. Threads bound to a scheduling context with 0 extra refills will behave periodically -- tasks
163that use their head replenishment, or call yield, will not be scheduled again until the start of
164their next period.
165
166Given the number of replenishments is limited, if a node's SC changes and the outgoing SC does not
167have enough space to store the new replenishment, space is created by removing the current
168replenishment which can result in preemption if the next replenishment is not yet available.
169Scheduling contexts with a higher number of configured refills will consume closer
170to their whole budget, as they can be preempted or switch threads more often without filling their
171replenishment queue. However, the scheduling overhead will be higher as the replenishment list is
172subject to fragmentation.
173
174Whenever a thread is executing it consumes the budget from its current scheduling context.  The
175system call \apifunc{seL4\_Yield}{sel4_yield} can be used to sacrifice any remaining budget and
176block until the next replenishment is ready to be used.
177
178Threads can be bound to scheduling contexts using \apifunc{seL4\_TCB\_Configure}{tcb_configure} or
179\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind}, both invocations have the same effect
180although \apifunc{seL4\_TCB\_Configure}{tcb_configure} allows more thread fields to be set with only
181one kernel entry.  When a thread is bound to a scheduling context, if it is in a runnable state and
182the scheduling context is active, it will be added to the scheduler.
183
184\subsection{Passive Threads} \label{sec:passive}
185
186Threads can be unbound from a scheduling context with
187\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}.  This is distinct from
188suspending a thread, in that threads that are blocked waiting in an endpoint or notification queue
189will remain in the queue and can still receive messages and signals.  However, the unbound thread
190will not be schedulable again until it receives a scheduling context.  Threads without scheduling
191contexts are referred to as \emph{passive} threads, as they cannot execute without the action of
192another thread.
193
194\subsection{Scheduling Context Creation} \label{sec:sc_creation}
195
196Like other objects, scheduling contexts are created from untyped memory using
197\apifunc{seL4\_UntypedRetype}{untyped_retype}.  On creation, scheduling contexts are empty,
198representing 0\% of CPU execution time.  To populate a scheduling context with parameters, one must
199invoke the appropriate \obj{SchedControl} capability, which provides access to CPU time management
200on a single node.  A scheduling control cap for each node is provided to the initial task at run
201time.  Threads run on the node that their scheduling context is configured for.  Scheduling context
202parameters can then be set and updated using
203\apifunc{seL4\_SchedControl\_ConfigureFlags}{schedcontrol_configureflags}, which allows the budget and period
204to be specified along with a bitwise OR'd set of the following flags.
205
206\begin{description}
207
208\item[seL4\_SchedContext\_Sporadic]: constrain the execution time only according to the
209sporadic server algorithm rather than to a continuous constant bandwidth.
210
211\end{description}
212
213The kernel does not conduct any schedulability tests, as task admission is left to user-level policy
214and can be conducted online or offline, statically or dynamically or not at all.
215
216\subsection{Scheduling Context Donation}
217
218In addition to explicitly binding and removing scheduling contexts through
219\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind} and
220\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}, scheduling contexts can move
221between threads over IPC.  Scheduling contexts are donated implicitly when the system calls
222\apifunc{seL4\_Call}{sel4_call} and \apifunc{seL4\_NBSendRecv}{sel4_nbsendrecv} are used to
223communicate with a passive thread.  When an active thread invokes an endpoint with
224\apifunc{seL4\_Call}{sel4_call} and rendezvous with a passive thread, the active thread's scheduling
225context is donated to the passive thread. The generated reply cap ensures that the callee is merely
226borrowing the scheduling context: when the reply cap is consumed by a reply message being sent the
227scheduling context will be returned to the caller.  If the reply cap is revoked, and the callee
228holds the scheduling context, the scheduling context will be returned to the caller.  However, if in
229a deep call chain and a reply cap in the middle of the call chain is revoked, such that the callee
230does not possess the scheduling context, the thread will be removed from the call chain and the
231scheduling context will remain where it is.  If the receiver does not provide a reply object to
232track the donation in (i.e uses \apifunc{seL4\_Wait}{sel4_wait} instead of
233\apifunc{seL4\_Recv}{sel4_recv} scheduling context donation will not occur but the message will be
234delivered. The passive receiver will be set to inactive as it does not have a scheduling context.
235
236Consider an example where thread A calls thread B which calls thread C.  If whilst C holds the
237scheduling context, B's reply cap to A is revoked, then the scheduling context will remain with C.
238However, a call chain will remain between A and C, such that if C's reply cap is revoked, or
239invoked, the scheduling context will return to A.
240
241\apifunc{seL4\_NBSendRecv}{sel4_nbsendrecv} can also result in scheduling context donation.
242If the non-blocking send phase of the operation results in message delivery to a passive thread, the
243scheduling context will be donated to that passive thread and the thread making the system call becomes passive on the
244receiving endpoint in the receive phase.  No reply capability generated, so there
245is no guarantee that the scheduling context will return, which increases book keeping complexity but allows
246for data-flow like architectures rather than remote-procedure calls. Note that \apifunc{seL4\_Call}{sel4_call}
247does not guarantee the return of a scheduling context: this is an inherently trusted operation as the
248server could never reply and return the scheduling context.
249
250Scheduling contexts can also be bound to notification objects using
251\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind} and unbound using
252\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}.  If a signal is delivered to
253a notification object with a passive thread blocked waiting on it, the passive thread will receive
254the scheduling context that is bound to the notification object.  The scheduling context is returned
255when the thread blocks on the notification object.  This feature allows for passive servers to use
256notification binding (See \autoref{sec:notification-binding}).  If a scheduling context is bound to
257both a notification object and a thread, the behaviour will be the same as for a passive server:
258The scheduling context will be unbound from the thread when it blocks on the bound notification object.
259This is useful when launching passive servers or handling timeout exceptions.
260
261Scheduling contexts can be unbound from all objects (notification objects and TCBs that are bound or
262have received a scheduling context through donation) using
263\apifunc{seL4\_SchedContext\_Unbind}{schedcontext_unbind}.
264
265Passive threads will run on the CPU node that the scheduling context was configured with, and will
266be migrated on IPC.
267
268\subsection{Scheduling algorithm} \label{sec:mcs-sched}
269
270Threads are only eligible for scheduling if they have an active scheduling context.
271Of threads eligible for scheduling, the highest priority thread in a runnable state is chosen.
272
273Threads of sufficient maximum controlled priority and with possession of the
274appropriate scheduling context capability can manipulate the scheduler and
275implement user-level schedulers using IPC.
276
277Scheduling contexts provide access to and an upper bound on execution CPU time,
278however when a thread executes is determined by thread priority.  Consequently,
279access to CPU is a function of thread MCPs, scheduling contexts and the
280\obj{SchedControl} capability.  The kernel will enforce that threads do not
281exceed the budget in their scheduling context for any given period, and that
282the highest priority thread will always run, however it is up to the system
283designer to make sure the entire system is schedulable.
284
285
286\subsection{Exceptions}
287\label{sec:exceptions}
288
289Each thread has two associated exception-handler endpoints, a \emph{standard}
290exception handler and a \emph{timeout} exception handler, where the latter is MCS
291only. If the thread causes
292an exception, the kernel creates an IPC message with the relevant details and
293sends this to the endpoint. This thread can then take the appropriate action.
294Fault IPC messages are described in \autoref{sec:faults}.  Standard exception-handler
295endpoints can be set with the \apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} or
296\apifunc{seL4\_TCB\_SetSchedParams}{tcb_setschedparams} methods while Timeout exception
297handlers an be set with \apifunc{seL4\_TCB\_SetTimeoutEndpoint}{tcb_settimeoutendpoint} (MCS only).
298With these methods, a
299capability address for the exception handler can be associated with a thread.
300This address is then used to lookup the handler endpoint, and the capability to
301the endpoint is installed into the threads' kernel CNode.  For threads without
302an exception handler, a null capability can be used, however the consequences
303are different per exception handler type.  Before raising an exception the
304handler capability is validated. The kernel does not perform another lookup,
305but checks that the capability is an endpoint with the correct rights.
306
307The exception endpoint must have Write and either Grant or GrantReply rights.
308Replying to the exception message restarts the thread. For certain exception
309types, the contents of the reply message may be used to set the values in the
310registers of the thread being restarted.  See \autoref{sec:faults} for details.
311
312\subsubsection{Standard Exceptions}
313
314The standard exception handler is used when a fault is triggered by a thread
315which cannot be recovered without action by another thread.  For example, if a
316thread raises a fault due to an unmapped virtual memory page, the thread cannot
317make any more progress until the page is mapped.  If a thread experiences a
318fault that would trigger the standard exception handler while it is set to a
319null capability, the kernel will pause the thread and it will not run again.
320This is because without action by another thread, standard exceptions cannot be
321recovered from.  Consequently threads without standard exception handlers
322should be trusted not to fault at all.
323
324Standard exception handlers can be passive, in which case they will run on the
325scheduling context of the faulting thread.
326
327\subsubsection{Timeout Exceptions (MCS Only)} \label{sec:timeout-exceptions}
328
329Timeout faults are raised when a thread attempts to run but has no available
330budget, and if that thread has a valid timeout exception handler capability.
331The handling of timeout faults is not compulsory: if a thread does not have a
332timeout fault handler, a fault will not be raised and the thread will continue
333running when it's budget is replenished.  This allows temporally sensitive
334threads to handle budget overruns while other threads may ignore them.
335
336Timeout faults are registered per thread, which means that while clients may
337not have a timeout fault handler, servers may, allowing single-threaded,
338time-sensitive, passive servers to use a timeout exception handler to recover
339from malicious or untrusted clients whose budget expires while the server is
340completing the request.  Timeout fault handlers can access server reply objects
341and reply with an error to the client, then reset the server to handle the next
342client request.
343
344If a reply message is sent to a nested server and a scheduling context without
345available budget returned, another timeout fault will be generated if the
346nested server also has a timeout fault handler.
347
348\subsection{Message Layout of the Read-/Write-Registers Methods}
349\label{sec:read_write_registers}
350
351The registers of a thread can be read and written with the
352\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} methods.
353For some registers, the kernel will silently mask certain bits or ranges of bits off, and force them to contain certain
354values to ensure that they cannot be maliciously set to values that would compromise the running system, or to respect
355values that the architecture specifications have mandated to be certain values.
356The register contents are transferred via the IPC buffer.
357
358\section{Faults}
359\label{sec:faults}
360
361A thread's actions may result in a fault. Faults are delivered to the
362thread's exception handler so that it can take the appropriate action.
363The fault type is specified in the message label and is one of:
364\texttt{seL4\_Fault\_CapFault}, \texttt{seL4\_Fault\_VMFault},
365\texttt{seL4\_Fault\_UnknownSyscall}, \texttt{seL4\_Fault\_UserException},
366\texttt{seL4\_Fault\_DebugException},
367\texttt{seL4\_Fault\_TimeoutFault}, or \texttt{seL4\_Fault\_NullFault}
368(indicating no fault occurred and this is a normal IPC message).
369
370Faults are delivered in such a way as to imitate a Call from the faulting
371thread. This means that to send a fault message the fault endpoint
372must have Write and either Grant or GrantReply permissions. If this is not the
373case, a double fault happens (generally the thread is simply suspended).
374
375\subsection{Capability Faults}
376
377Capability faults may occur in two places. Firstly, a capability fault
378can occur when lookup of a capability referenced by a
379\apifunc{seL4\_Call}{sel4_call} or \apifunc{seL4\_Send}{sel4_send} system call
380failed (\apifunc{seL4\_NBSend}{sel4_nbsend} calls on
381invalid capabilities silently fail). In this case, the capability
382on which the fault occurred may be the capability being invoked or an
383extra capability passed in the \texttt{caps} field in the IPC buffer.
384
385Secondly, a capability fault can occur when \apifunc{seL4\_Recv}{sel4_recv} or \apifunc{seL4\_NBRecv}{sel4_nbrecv}
386is called on a capability that does not exist, is not an endpoint or notification capability or does not have
387receive permissions.
388
389Replying to the fault IPC will restart the faulting thread. The contents of the
390IPC message are given in \autoref{tbl:ipc_contents}.\\
391
392\begin{table}[htb]
393\noindent\begin{tabularx}{\textwidth}{XX}
394\toprule
395    \textbf{Meaning} & \textbf{ IPC buffer location} \\
396\midrule
397    Address at which to restart execution & \ipcbloc{seL4\_CapFault\_IP} \\
398    Capability address & \ipcbloc{seL4\_CapFault\_Addr} \\
399In receive phase (1 if the fault happened during a receive system call, 0
400    otherwise) & \ipcbloc{seL4\_CapFault\_InRecvPhase} \\
401Lookup failure description. As described in \autoref{sec:lookup_fail_desc} &
402    \ipcbloc{seL4\_CapFault\_LookupFailureType} \\
403\bottomrule
404\end{tabularx}
405\caption{\label{tbl:ipc_contents}Contents of an IPC message.}
406\end{table}
407
408\subsection{Unknown Syscall}
409\label{sec:unknown-syscall}
410
411This fault occurs when a thread executes a system call with a syscall
412number that is unknown to seL4.
413The register set
414of the faulting thread is passed to the thread's exception handler so that it
415may, for example, emulate the system call if a thread is being
416virtualised.
417
418Replying to the fault IPC allows the thread to be restarted
419and/or the thread's register set to be modified. If the reply has
420a label of zero, the thread will be restarted. Additionally, if the
421message length is non-zero, the faulting thread's register set will be
422updated. In this case, the number of
423registers updated is controlled with the length field of the message
424tag.
425
426\subsection{User Exception}
427
428User exceptions are used to deliver architecture-defined exceptions. For
429example, such an exception could occur if a user thread attempted to
430divide a number by zero.
431
432Replying to the fault IPC allows the thread to be restarted
433and/or the thread's register set to be modified. If the reply has
434a label of zero, the thread will be restarted. Additionally, if the
435message length is non-zero, the faulting thread's register set will be
436updated. In this case, the number of
437registers updated is controlled with the length field of the message
438tag.
439
440\subsection{Debug Exception: Breakpoints and Watchpoints}
441\label{sec:debug_exceptions}
442
443Debug exceptions are used to deliver trace and debug related events to threads.
444Breakpoints, watchpoints, trace-events and instruction-performance sampling
445events are examples. These events are supported for userspace threads when the kernel
446is configured to include them (when CONFIG\_HARDWARE\_DEBUG\_API is set). The hardware
447debugging extensions API is supported on the following subset of the platforms that the
448kernel has been ported to:
449
450\begin{itemize}
451\item PC99: IA-32 and x86\_64
452\item Sabrelite (i.MX6)
453\item Jetson TegraK1
454\item HiSilicon Hikey
455\item Raspberry Pi 3
456\item Odroid-X (Exynos4)
457\item Xilinx zynq7000
458\end{itemize}
459
460Information on the available hardware debugging resources is presented in the form of the following constants:
461
462\begin{description}
463\item[seL4\_NumHWBreakpoints]: Defines the total number of hardware break
464registers available, of all types available on the hardware platform. On the Arm
465Cortex~A7 for example, there are 6 exclusive instruction breakpoint registers,
466and 4 exclusive data watchpoint registers, for a total of 10 monitor registers.
467On this platform therefore, \texttt{seL4\_NumHWBreakpoints} is defined as 10.
468The instruction breakpoint registers will always be assigned the lower API-IDs,
469and the data watchpoints will always be assigned following them.
470
471Additionally, \texttt{seL4\_NumExclusiveBreakpoints}, \texttt{seL4\_NumExclusiveWatchpoints}
472and \texttt{seL4\_NumDualFunctionMonitors}
473are defined for each target platform to reflect the number of available
474hardware breakpoints/watchpoints of a certain type.
475
476\item[seL4\_NumExclusiveBreakpoints]: Defines the number of hardware registers
477capable of generating a fault \textbf{only} on instruction execution. Currently this will be
478set only on Arm platforms. The API-ID of the first exclusive breakpoint is given
479in \texttt{seL4\_FirstBreakpoint}. If there are no instruction-break exclusive
480registers, \texttt{seL4\_NumExclusiveBreakpoints} will be set to \texttt{0} and
481\texttt{seL4\_FirstBreakpoint} will be set to -1.
482
483\item[seL4\_NumExclusiveWatchpoints]: Defines the number of hardware registers
484capable of generating a fault \textbf{only} on data access. Currently this will be set only
485on Arm platforms. The API-ID of the first exclusive watchpoint is given
486in \texttt{seL4\_FirstWatchpoint}. If there are no data-break exclusive
487registers, \texttt{seL4\_NumExclusiveWatchpoints} will be set to \texttt{0} and
488\texttt{seL4\_FirstWatchpoint} will be set to -1.
489
490\item[seL4\_NumDualFunctionMonitors]: Defines the number of hardware registers
491capable of generating a fault on either type of access -- i.e, the register
492supports both instruction and data breaks. Currently this will be set only on
493x86 platforms. The API-ID of the first dual-function monitor is given
494in \texttt{seL4\_FirstDualFunctionMonitor}. If there are no dual-function break
495registers, \texttt{seL4\_NumDualFunctionMonitors} will be set to \texttt{0} and
496\texttt{seL4\_FirstDualFunctionMonitor} will be set to -1.
497
498\end{description}
499
500\begin{table}[h]
501\begin{tabularx}{\textwidth}{XXX}
502\toprule
503\textbf{Value sent} & \textbf{IPC buffer location} \\
504\midrule
505\reg{Breakpoint instruction address} & \ipcbloc{IPCBuffer[0]} \\
506\reg{Exception reason} & \ipcbloc{IPCBuffer[1]} \\
507\reg{Watchpoint data access address} & \ipcbloc{IPCBuffer[2]} \\
508\reg{Register API-ID} & \ipcbloc{IPCBuffer[3]} \\
509\bottomrule
510\end{tabularx}
511\caption{\label{tbl:debug_exception_result}Debug fault message layout. The
512register API-ID is not returned in the fault message from the kernel on
513single-step faults.}
514\end{table}
515
516\subsection{Debug Exception: Single-stepping}
517\label{sec:single_stepping_debug_exception}
518
519The kernel provides support for the use of hardware single-stepping of userspace
520threads when configured to do so (when CONFIG\_HARDWARE\_DEBUG\_API is set). To
521this end it exposes the invocation, \texttt{seL4\_TCB\_ConfigureSingleStepping}.
522
523The caller is expected to select an API-ID that corresponds to
524an instruction breakpoint, to use when setting up the single-stepping
525functionality (i.e, API-ID from 0 to \texttt{seL4\_NumExclusiveBreakpoints} - 1).
526However, not all hardware platforms require an actual hardware breakpoint
527register to provide single-stepping functionality. If the caller's hardware platform requires the
528use of a hardware breakpoint register, it will use the breakpoint register given to it in \texttt{bp\_num},
529and return \texttt{true} in \texttt{bp\_was\_consumed}. If the underlying platform does not need a hardware
530breakpoint to provide single-stepping, seL4 will return \texttt{false} in \texttt{bp\_was\_consumed} and
531leave \texttt{bp\_num} unchanged.
532
533If \texttt{bp\_was\_consumed} is \texttt{true}, the caller should not
534attempt to re-configure \texttt{bp\_num} for Breakpoint or Watchpoint usage until
535the caller has disabled single-stepping and released that register, via a subsequent
536call to \texttt{seL4\_TCB\_ConfigureSingleStepping}, or a fault-reply with
537\texttt{n\_instr} being 0. Setting \texttt{num\_instructions} to \texttt{0}
538\textbf{disables single stepping}.
539
540On architectures that require an actual hardware registers to be configured for
541single-stepping functionality, seL4 will restrict the number of registers that
542can be configured as single-steppers, to one at any given time. The register that
543is currently configured (if any) for single-stepping will be the implicit
544\texttt{bp\_num} argument in a single-step debug fault reply.
545
546The kernel's single-stepping, also supports skipping a certain number of
547instructions before delivering the single-step fault message. \texttt{Num\_instructions}
548should be set to \texttt{1} when single-stepping, or any non-zero integer value to skip that many
549instructions before resuming single-stepping. This skip-count can also be set in
550the fault-reply to a single-step debug fault.
551
552\begin{table}[h]
553\begin{tabularx}{\textwidth}{XXX}
554\toprule
555\textbf{Value sent} & \textbf{Register set by reply} & \textbf{IPC buffer location} \\
556\midrule
557\reg{Breakpoint instruction address} & \texttt{num\_instructions} to skip & \ipcbloc{IPCBuffer[0]} \\
558\reg{Exception reason} & --- & \ipcbloc{IPCBuffer[1]} \\
559\bottomrule
560\end{tabularx}
561\caption{\label{tbl:single_step_exception_result}Single-step fault message layout.}
562\end{table}
563
564\subsection{Timeout Fault (MCS only)}
565\label{sec:timeout-fault}
566
567Timeout faults are raised when a thread consumes all of its budget and has a
568timeout fault handler that is not a null capability.  They allow a timeout
569exception handler to take some action to restore the thread, and deliver a
570message containing the scheduling context data word, as well as the amount of
571time consumed since the last timeout fault occurred on this scheduling context,
572or since \apifunc{seL4\_SchedContext\_YieldTo}{schedcontext_yieldto} or
573\apifunc{seL4\_SchedContext\_Consumed}{schedcontext_consumed} was last called.
574Timeout exception handlers can reply to a temporal fault with the registers set
575in the same format as outlined in \autoref{sec:read_write_registers}.
576
577\begin{table}[htb] \noindent\begin{tabularx}{\textwidth}{XX} \toprule
578    \textbf{Meaning} & \textbf{IPC buffer location} \\ \midrule Data word from
579    the scheduling context object that the thread was running on when the fault
580    occurred. & \ipcbloc{seL4\_TimeoutFault\_Data} \\ Upper 32-bits of
581    microseconds consumed since last reset &
582    \ipcbloc{seL4\_TimeoutFault\_Consumed} \\ Lower 32-bits of microseconds
583    consumed since last reset & \ipcbloc{seL4\_TimeoutFault\_Consumed\_LowBits}
584    \\ \bottomrule \end{tabularx}\\ \\ \caption{\label{tbl:tf_message_32}
585    Timeout fault outcome on 32-bit architectures.} \end{table}
586
587\subsection{VM Fault}
588\label{sec:vm-fault}
589
590The thread caused a page fault. Replying to the fault IPC will restart
591the thread. The contents of the IPC message are given below.\\
592
593\begin{table}[htb]
594\begin{tabularx}{\textwidth}{XXX}
595\toprule
596\textbf{Meaning} & \textbf{IPC buffer location} \\
597\midrule
598    Program counter to restart execution at. & \ipcbloc{seL4\_VMFault\_IP} \\
599Address that caused the fault. & \ipcbloc{seL4\_VMFault\_SP} \\
600    Instruction fault (1 if the fault was caused by an instruction fetch). & \ipcbloc{seL4\_VMFault\_PrefetchFault}  \\
601Fault status register (FSR). Contains information about the cause of the fault. Architecture dependent. & \ipcbloc{seL4\_VMFault\_FSR} \\
602\bottomrule
603\end{tabularx}
604\caption{\label{tbl:vm_fault_result_arm} VM Fault outcome on all architectures.}
605\end{table}
606
607\section{Domains}
608\label{sec:domains}
609
610Domains are used to isolate independent subsystems, so as to limit
611information flow between them.
612The kernel switches between domains according to a fixed, time-triggered
613schedule.
614The fixed schedule is compiled into the kernel via the constant
615\texttt{CONFIG\_NUM\_DOMAINS} and the global variable \texttt{ksDomSchedule}.
616
617A thread belongs to exactly one domain, and will only run when that domain
618is active.
619The \apifunc{seL4\_DomainSet\_Set}{domainset_set} method changes the domain
620of a thread.
621The caller must possess a \obj{Domain} cap and the thread's \obj{TCB} cap.
622The initial thread starts with a \obj{Domain} cap (see
623\autoref{sec:messageinfo}).
624
625\section{Virtualisation}
626\label{sec:virt}
627
628Hardware execution virtualisation is supported on specific arm and x86 platforms. The interface is exposed through a series
629of kernel objects, invocations and syscalls that allow the user to take advantage of hardware
630virtualisation features.
631
632Hardware virtualisation allows for a thread to perform instructions and operations as if it were
633running at a higher privilege level. As higher privilege levels typically have access to
634additional machine registers and other pieces of state a \obj{VCPU} object is introduced to act
635as storage for this state. For simplicity we refer to this virtualised higher privileged level as
636'guest mode'. \obj{VCPU}s are bound in a one-to-one relationship with a \obj{TCB} in order
637to provide a thread with this ability to run in higher privilege mode. See the section on
638Arm or x86 for more precise details.
639
640\obj{VCPU} objects also have additional, architecture specific, invocations for manipulating
641the additional state or other virtualisation controls provided by the hardware. Binding of
642a \obj{VCPU} to a \obj{TCB} is done by an invocation on the \obj{VCPU} only, and not the \obj{TCB}.
643
644The provided objects and invocations are, generally speaking, the thinnest possible shim over
645the underlying hardware primitives and operations. As a result an in depth familiarity with
646the underlying architecture specific hardware mechanisms is required to use these objects, and
647such familiarity is therefore assumed in description.
648
649\subsection{Arm}
650
651When a \obj{TCB} has a bound \obj{VCPU} it is allowed to have the mode portion of the
652\texttt{cpsr} register set to values other than \texttt{user}. Specifically it may have any value other than
653\texttt{hypervisor}.
654
655TODO: this section needs more detail
656
657\subsection{x86}
658
659A \obj{TCB} with a bound \obj{VCPU} has two execution modes; one is the original thread just as
660if there was no bound \obj{VCPU}, and the other is the guest mode execution using the
661\obj{VCPU}. Switching from regular execution mode into the guest execution mode is
662done by using the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall. Executing this syscall causes the thread, whenever
663it is scheduled thereafter, to execute using the higher privileged mode controlled by the \obj{VCPU}.
664Should the guest execution mode generate any kind of fault, or if a message arrives
665on the \obj{TCB}s bound notification, the \obj{TCB} will be switched back to regular mode
666and the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall will return with a message indicating the reason for return.
667
668\obj{VCPU} state and execution is controlled through the \apifunc{seL4\_VCPU\_ReadVMCS}{x86_vcpu_readvmcs}
669and \apifunc{seL4\_VCPU\_WriteVMCS}{x86_vcpu_writevmcs} invocations.
670These are very thin wrappers around the hardware \texttt{vmread} and \texttt{vmwrite} instructions and the kernel
671merely does enough validation on the parameters to ensure the \obj{VCPU} is not configured
672to run in such a way as to violate any kernel properties. For example, it is not possible to
673disable the use of External Interrupt Exiting, as this would prevent the kernel from receiving
674timer interrupts and allow the thread to monopolise CPU time.
675
676Memory access of the guest execution mode is controlled by requiring the use of Extended
677Page Tables (EPT). A series of EPT related paging structure objects (\obj{EPTPML4}, \obj{EPTPDPT}, \obj{EPTPD}, \obj{EPTPT})
678exist and are manipulated in exactly the same manner as the objects for the regular virtual
679address space. Once constructed a \obj{TCB} can be given an \obj{EPTPML4} as an EPT root with \apifunc{seL4\_TCB\_SetEPTRoot}{x86_set_eptroot},
680which serves as the vspace root when executing in guest mode, with the vspace root set
681with \apifunc{seL4\_TCB\_SetSPace}{tcb_setspace} or \apifunc{seL4\_TCB\_Configure}{tcb_configure}
682continuing to provide translation when the TCB is executing in its normal mode.
683
684Direct access to I/O ports can be given to the privileged execution mode through the
685\apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport} invocation and allows the provided I/O port capability to be
686linked to the VCPU, and a subset of its I/O port range to be made accessible to the \obj{VCPU}.
687Linking means that an I/O port capability can only be used in a single \apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport}
688invocation and a second invocation will undo the previous one. The link also means that
689if the I/O port capability is deleted for any reason the access will be correspondingly removed
690from the \obj{VCPU}.
691