1%
2% Copyright 2014, General Dynamics C4 Systems
3%
4% SPDX-License-Identifier: GPL-2.0-only
5%
6
7\chapter{\label{ch:objects}Kernel Services and Objects}
8
9A limited number of service primitives are provided by the microkernel;
10more complex services may be implemented as applications on top of these
11primitives. In this way, the functionality of the system can be extended
12without increasing the code and complexity in privileged mode, while
13still supporting a potentially wide number of services for varied
14application domains.
15
16Note that some services are available only when the kernel is configured for
17MCS\footnote{``mixed-criticality system''} support.
18
19The basic services seL4 provides are as follows:
20\begin{description}
21    \item[Threads] are an abstraction of CPU execution that supports
22    running software;
23
24    \item[Scheduling contexts] (MCS only) are an abstraction of CPU execuion time.
25
26    \item[Address spaces] are virtual memory spaces that each contain an
27    application. Applications are limited to accessing memory in their
28    address space;
29
30    \item[Inter-process communication] (IPC) via \emph{endpoints} allows
31    threads to communicate using message passing;
32
33    \item[Reply objects] (MCS only) are used to store single-use reply capabilities,
34    and are provided by the receiver during message passing.
35
36    \item[Notifications] provide a non-blocking signalling mechanism
37      similar to binary semaphores;
38
39    \item[Device primitives] allow device drivers to be implemented as
40    unprivileged applications.  The kernel exports hardware device
41    interrupts via IPC messages; and
42
43    \item[Capability spaces] store capabilities (i.e., access rights) to
44    kernel services along with their book-keeping information.
45\end{description}
46
47This chapter gives an overview of these services, describes how kernel
48objects are accessed by userspace applications, and describes how new
49objects can be created.
50
51\section{Capability-based Access Control}
52\label{sec:cap-access-control}
53
54The seL4 microkernel provides a capability-based access-control model.
55Access control governs all kernel services; in order to perform an
56operation, an application must \emph{invoke} a capability in its
57possession that has sufficient access rights for the requested service.
58With this, the system can be configured to isolate software components from
59each other, and also to enable authorised, controlled communication
60between components by selectively granting specific communication
61capabilities.  This enables software-component isolation with a high
62degree of assurance, as only those operations explicitly authorised by
63capability possession are permitted.
64
65A capability is an unforgeable token that references a specific kernel
66object (such as a thread control block) and carries access rights that
67control what methods may be invoked.
68Conceptually, a capability resides in an application's \emph{capability
69space}; an address in this space refers to a \emph{slot} which may or
70may not contain a capability.  An application may refer to
71a capability---to request a kernel service, for example---using the
72address of the slot holding that capability.  This means, the seL4
73capability model is an instance of a \emph{segregated} (or \emph{partitioned})
74capability system, where capabilities are managed by the kernel.
75
76Capability spaces are implemented as a directed graph of kernel-managed
77\emph{capability nodes} (\obj{CNode}s).  A \obj{CNode} is a table of
78slots, where each slot may contain further \obj{CNode} capabilities. An
79address of a capability in a capability space is the concatenation of the indices
80of slots within \obj{CNodes} forming the path to the destination
81slot; we discuss \obj{CNode} objects in detail in \autoref{ch:cspace}.
82
83Capabilities can be copied and moved within capability spaces, and
84also sent via IPC. This allows creation of applications with specific
85access rights, the delegation of authority to another application, and
86passing to an application authority to a newly created (or selected)
87kernel service. Furthermore, capabilities can be \emph{minted} to
88create a derived capability with a subset of the rights of the
89original capability (never with more rights). A newly minted
90capability can be used for partial delegation of authority.
91
92Capabilities can also be revoked to withdraw
93authority. Revocation recursively removes any capabilities that have
94been derived from the original capability being revoked. The propagation of
95capabilities through the system is controlled by a
96\emph{take-grant}-based model~\cite{Elkaduwe_KE_08,Boyton_09}.
97
98\section{System Calls}
99\label{sec:syscalls}
100\label{sec:sys_send}
101\label{sec:sys_wait}
102\label{sec:sys_nbwait}
103\label{sec:sys_recv}
104\label{sec:sys_call}
105\label{sec:sys_reply}
106\label{sec:sys_nbsend}
107\label{sec:sys_replyrecv}
108\label{sec:sys_nbrecv}
109\label{sec:sys_nbsendrecv}
110\label{sec:sys_nbsendwait}
111\label{sec:sys_yield}
112
113The seL4 kernel provides a message-passing service for communication between
114threads. This mechanism is also used for communication with kernel-provided
115services. There is a standard message format, each message containing a
116number of data words and possibly some capabilities. The structure and encoding
117of these messages are described in detail in \autoref{ch:ipc}.
118
119Threads send messages by invoking capabilities within their capability space.
120When an endpoint, notification or reply capability is invoked in this way, the
121message will be transferred through the kernel to another thread.
122When other capabilities to kernel objects are invoked, the message will be
123interpreted as a method invocation in a manner specific to the type of kernel
124object. For example, invoking a thread control block (TCB) capability with a
125correctly formatted message will suspend the target thread.
126
127% TODO: The \apifunc{}{} refs in the paragraphs below (up to the section on
128% kernel objects) go to the API reference which is very terse and not nearly as
129% useful for cross-references at this point in the manual.  We should consider
130% making these prose descriptions of the system calls legitimate targets for
131% references, and cross-reference to those instead.
132
133Fundamentally, we can regard the kernel as providing three system calls:
134\emph{Send}, \emph{Receive} and \emph{Yield}.  However, there are also
135combinations and variants of the basic \emph{Send} and \emph{Receive} calls.  An
136important variant is the \emph{Call} operation, which consists of a standard
137\emph{Send} operation atomically followed by a variant of \emph{Receive} which
138waits for a \emph{Reply}.  A \emph{reply} message is always delivered via a
139special resource instead of using the standard IPC mechanism; see
140\apifunc{seL4\_Call}{sel4_call} below for details.
141
142Invoking methods on kernel objects other than endpoints and notifications is
143done with \emph{Send} or \emph{Call}, depending on whether the invoker
144wants a reply from the kernel (\emph{Call}) or not (\emph{Send}).  By using
145functions provided by the libsel4 API you are guaranteed to always use the more
146appropriate one.  The \emph{Yield} system call is not associated with any kernel
147object and is the only operation that does not invoke a capability.  In the MCS
148configuration, \emph{Wait} is a variant of \emph{Receive} that does not require
149a reply object to be provided---on non-MCS configurations, \emph{Wait} is
150synonymous with \emph{Receive}, because neither call takes a reply object.
151
152The fundamental system calls are:
153\begin{description}
154    \item[\apifunc{seL4\_Yield}{sel4_yield}] is the only system call that does
155    not require a capability to be used.  It forfeits the remainder of the
156    calling thread's timeslice and causes invocation of the kernel's scheduler.
157    If there are no other runnable threads with the same priority as the caller,
158    the calling thread will immediately be scheduled with a fresh timeslice.  In
159    the MCS configuration, this behaviour depends on the state of the scheduling
160    context; see \autoref{sec:scheduling_contexts}.
161
162    \item[\apifunc{seL4\_Send}{sel4_send}] delivers a message through the named
163    capability.  If the invoked capability is an endpoint, and no receiver is
164    ready to receive the message immediately, the sending thread will block
165    until the message can be delivered.  No error code or response will be
166    returned by the receiving object.
167
168    \item[\apifunc{seL4\_Recv}{sel4_recv}] (``receive'') is used by a thread to
169    receive messages through endpoints or notifications.  If no sender or
170    notification is pending, the caller will block until a message or
171    notification can be delivered.  This system call works only on
172    \obj{Endpoint} or \obj{Notification} capabilities, raising a fault (see
173    section \ref{sec:faults}) when attempted with other capability types.
174
175    In the MCS configuration, \emph{Receive} takes a reply capability---a
176    capability to a reply object--as a parameter.
177\end{description}
178
179% TODO: seL4_Recv() gets hyphenated as
180% seL4_-
181% Recv()
182% here, and that's ugly.  But just suppressing hyphenation for it (with \mbox or
183% \hyphenation) makes it overrun the right margin.  GBR knows an easy way to
184% deal with this in groff but this is not groff...
185
186The remaining system calls are variants and combinations of
187\apifunc{seL4\_Send}{sel4_send} and \apifunc{seL4\_Recv}{sel4_recv} efficiently
188accommodate common use cases in systems programming.
189
190\begin{description}
191    \item[\apifunc{seL4\_NBSend}{sel4_nbsend}] performs a polling send on an
192    endpoint.  If the message cannot be delivered immediately, i.e., there is no
193    receiver waiting on the destination \obj{Endpoint}, the message is silently
194    dropped.  The sending thread continues execution.  As with
195    \apifunc{seL4\_Send}{sel4_send}, no error code or response will be returned.
196
197    \item[\apifunc{seL4\_NBRecv}{sel4_nbrecv}] is used by a thread to check for
198    signals pending on a notification object or messages pending on an endpoint
199    without blocking.  This system call works only on endpoints and notification
200    object capabilities, raising a fault (see section \ref{sec:faults}) when
201    attempted with other capability types.
202
203    \item[\apifunc{seL4\_Call}{sel4_call}] combines
204    \apifunc{seL4\_Send}{sel4_send} and \apifunc{seL4\_Recv}{sel4_recv} with
205    some important differences.  The call blocks the sending thread until its
206    message is delivered and a reply message is received.
207
208    When invoking capabilities to kernel services other than endpoints, using
209    \apifunc{seL4\_Call}{sel4_call} allows the kernel to return an error code or
210    other response through the reply message.
211
212    When the sent message is delivered to another thread via an \obj{Endpoint},
213    the kernel does the same operation as \apifunc{seL4\_Send}{sel4_send}.  What
214    happens next depends on the kernel configuration.  For MCS configurations,
215    the kernel then updates the \emph{reply object} provided by the receiver.  A
216    \emph{reply object} is a vessel for tracking reply messages, used to send
217    a reply message and wake up the caller.  In non-MCS configurations, the
218    kernel then deposits a special \emph{reply capability} in a dedicated slot
219    in the receiver's \obj{TCB}.  This \emph{reply capability} is a single-use
220    right to send a reply message and wake up the caller, meaning that the
221    kernel invalidates it as soon as it has been invoked.  For both variants,
222    the calling thread is blocked until a capability to the reply object is
223    invoked.  For more information, see \autoref{sec:ep-cal}.
224
225    \item[\apifunc{seL4\_Reply}{sel4_reply}] is used to respond to a
226    \apifunc{seL4\_Call}{sel4_call}, by invoking the reply capability generated
227    by the \apifunc{seL4\_Call}{sel4_call} system call and stored in a dedicated
228    slot in the replying thread's TCB.  It has exactly the same behaviour as
229    invoking the reply capability with \apifunc{seL4\_Send}{sel4_send} which is
230    described in \autoref{sec:ep-cal}.
231
232    \item[\apifunc{seL4\_ReplyRecv}{sel4_replyrecv}] combines
233    \apifunc{seL4\_Reply}{sel4_reply} and \apifunc{seL4\_Recv}{sel4_recv}.  It
234    exists mostly for efficiency reasons, namely the common case of replying to
235    a request and waiting for the next can be performed in a single kernel
236    system call instead of two.  The transition from the reply to the receive
237    phase is also atomic.
238
239    \item[\apifunc{seL4\_Wait}{sel4_wait}] works like
240    \apifunc{seL4\_Recv}{sel4_recv}; on non-MCS configurations, they are in fact
241    synonymous.  In the MCS configuration, \apifunc{seL4\_Wait}{sel4_wait} is
242    used when no reply is expected.  Unlike \apifunc{seL4\_Recv}{sel4_recv},
243    \apifunc{seL4\_Wait}{sel4_wait} takes no reply capability.
244
245    \item[\apifunc{seL4\_NBWait}{sel4_nbwait}] (MCS only) is used by a thread to
246    poll for messages through endpoints or notifications.  If no sender or
247    notification is pending, the system call returns immediately.
248
249    \item[\apifunc{seL4\_NBSendWait}{sel4_nbsendwait}] (MCS only) combines an
250    \apifunc{seL4\_NBSend}{sel4_nbsend} and \apifunc{seL4\_Wait}{sel4_wait} into
251    one atomic system call.
252
253    \item[\apifunc{seL4\_NBSendRecv}{sel4_nbsendwait}] (MCS only) combines an
254    \apifunc{seL4\_NBSend}{sel4_nbsend} and \apifunc{seL4\_Recv}{sel4_recv} into
255    one atomic system call.
256\end{description}
257
258\section{Kernel Objects}
259\label{s:sel4_internals}
260
261In this section we give a brief overview of the kernel-implemented
262object types whose instances (also simply called \emph{objects}) can be invoked by applications. The interface to these
263objects forms the interface to the kernel itself. The creation and use
264of kernel services is achieved by the creation,
265manipulation, and combination of these kernel objects:
266
267\begin{description}
268
269    \item[\obj{CNodes}] (see \autoref{ch:cspace}) store capabilities, giving threads permission to
270    invoke methods on particular objects.
271    Each \obj{CNode} has a fixed number of slots,
272    always a power of two, determined when the \obj{CNode} is created. Slots
273    can be empty or contain a capability.
274
275    \item[Thread Control Blocks] (\obj{TCB}s; see \autoref{ch:threads}) represent a thread of
276    execution in seL4. Threads are the unit of execution that is
277    scheduled, blocked, unblocked, etc., depending on the application's
278    interaction with other threads.
279
280   \item[Scheduling contexts] (MCS only) (\obj{SchedulingContext}s; see \autoref{ch:threads}) represent
281       CPU time in seL4. Users can create scheduling contexts from untyped objects, however on
282       creation scheduling contexts are \textit{empty} and do not represent any time.
283       Initially, there is a capability to \obj{SchedControl} for each node, which
284       allows scheduling context to be populated with parameters, which combined with priority
285       control thread's access to CPU time.
286
287    \item[\obj{Endpoints}] (see \autoref{ch:ipc}) facilitate message-passing
288    communication between threads. IPC is synchronous: A thread
289    trying to send or receive on an endpoint blocks until the message
290    can be delivered. This means that message delivery only happens if
291    a sender and a receiver rendezvous at the endpoint, and the
292    kernel can deliver the message with a single copy (or without
293    copying for short messages using only registers).
294
295    A capability to an endpoint can be restricted to be
296    send-only or receive-only. Additionally, \obj{Endpoint}
297    capabilities can have the grant right, which allows sending
298    capabilities as part of the message.
299
300   \item[\obj{Reply objects}] (MCS only) (see \autoref{ch:ipc}) track scheduling
301    context donation and provide a container for single-use reply capabilities.
302    They are provided by \apifunc{seL4\_Recv}{sel4_recv}.
303
304    \item[\obj{Notification} Objects] (see \autoref{ch:notifications})
305      provide a simple signalling mechanism. A \obj{Notification}
306      is a word-size array of flags, each of which behaves like a binary semaphore. Operations
307      are \emph{signalling} a subset of flags in a single operation,
308      polling to check any flags,
309      and blocking until any are signalled. Notification capabilities
310      can be signal-only or wait-only.
311
312    \item[Virtual Address Space Objects] (see \autoref{ch:vspace})
313    are used to construct a virtual
314    address space (or VSpace) for one or more threads. These
315    objects largely directly correspond to those of the hardware, and
316    as such are architecture-dependent. The kernel also includes \obj{ASID
317    Pool} and \obj{ASID Control} objects for tracking the status of
318    address spaces.
319
320    \item[Interrupt Objects] (see \autoref{ch:io}) give applications the ability to receive
321    and acknowledge interrupts from hardware devices.
322    Initially, there is a capability to \obj{IRQControl},
323    which allows for the creation of \obj{IRQHandler} capabilities.
324    An \obj{IRQHandler} capability permits the management of a specific
325    interrupt source associated with a specific device.
326    It is delegated to
327    a device driver to access an interrupt source. The \obj{IRQHandler}
328    object allows threads to wait for and acknowledge individual
329    interrupts.
330
331    \item[Untyped Memory] (see \autoref{sec:kernmemalloc}) is the foundation of memory allocation
332    in the seL4 kernel.  Untyped memory capabilities have a single method
333    which allows the creation of new kernel objects. If the method
334    succeeds, the calling thread gains access to capabilities to the
335    newly-created objects. Additionally, untyped memory objects can be
336    divided into a group of smaller untyped memory objects allowing
337    delegation of part (or all) of the system's memory.  We discuss
338    memory management in general in the following sections.
339
340\end{description}
341
342\section{Kernel Memory Allocation}
343\label{sec:kernmemalloc}
344
345The seL4 microkernel does not dynamically allocate memory for kernel objects.
346Instead, objects must be explicitly created from application-controlled memory
347regions via \obj{Untyped Memory} capabilities.  Applications must have
348explicit authority to memory (through these \obj{Untyped Memory} capabilities) in
349order to create new objects, and all objects consume a fixed amount of memory once
350created. These mechanisms can be used to precisely control
351the specific amount of physical memory available to applications,
352including being able to enforce isolation of physical memory access
353between applications or a device.  There are no arbitrary resource
354limits in the kernel apart from those dictated by the
355hardware\footnote{The treatment of virtual ASIDs imposes a fixed number
356of address spaces. This limitation is to be removed in future
357versions of seL4.}, and so many denial-of-service attacks via resource
358exhaustion are avoided.
359
360At boot time, seL4 pre-allocates the memory required for the kernel
361itself, including the code, data, and stack sections (seL4 is a single
362kernel-stack operating system). It then creates an initial user
363thread (with an appropriate address and capability space).
364The kernel then hands all remaining memory to
365the initial thread in the form of capabilities to \obj{Untyped Memory}, and
366some additional capabilities to kernel objects that were required to
367bootstrap the initial thread.  These \obj{Untyped Memory} regions can then be split into
368smaller regions or other kernel objects using the
369\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method; the created objects are termed \emph{children} of
370the original untyped memory object.
371
372The user-level application that creates an object using \apifunc{seL4\_Untyped\_Retype}{untyped_retype}
373receives full authority over the resulting object. It can then delegate
374all or part of the authority it possesses over this object to one or
375more of its clients.
376
377Untyped memory objects represent two different types of memory:
378general purpose memory, or device memory.
379\emph{General purpose} memory can be untyped into any other object
380type and used for any operation on untyped memory provided by the kernel.
381\emph{Device memory} covers memory regions reserved for devices
382as determined by the hardware platform, and usage of these objects
383is restricted by the kernel in the following ways:
384
385\begin{itemize}
386\item Device untyped objects can only be retyped into frames or other
387untyped objects; developers cannot, for example, create an endpoint from device memory.
388\item Frame objects retyped from device untyped objects cannot be set as thread IPC buffers, or used
389in the creation of an ASID pool
390\end{itemize}
391
392The type attribute (whether it represents \emph{general purpose} or
393\emph{device} memory) of a child untyped object is inherited from its
394parent untyped object. That is, any child of a device untyped will
395also be a device untyped. Developers cannot change the type attribute of an untyped.
396
397\subsection{Reusing Memory}
398\label{s:memRevoke}
399
400The model described thus far is sufficient for applications to
401allocate kernel objects, distribute authority among client
402applications, and obtain various kernel services provided by these
403objects.  This alone is sufficient for a simple static system
404configuration.
405
406The seL4 kernel also allows \obj{Untyped Memory} regions to be reused.
407Reusing a region of memory is allowed only
408when there are no dangling references (i.e., capabilities) left to the
409objects inside that memory.  The kernel tracks
410\emph{capability derivations}, i.e., the children generated by the
411methods \apifunc{seL4\_Untyped\_Retype}{untyped_retype}, \apifunc{seL4\_CNode\_Mint}{cnode_mint}, \apifunc{seL4\_CNode\_Copy}{cnode_copy}, and
412\apifunc{seL4\_CNode\_Mutate}{cnode_mutate}.
413
414The tree structure so generated is termed the \emph{capability
415derivation tree} (CDT).\footnote{Although the CDT conceptually is a separate
416data structure, it is implemented as part of the CNode object and so
417requires no additional kernel meta-data.}  For example, when a user
418creates new kernel objects by retyping untyped memory, the newly created
419capabilities would be inserted into the CDT as children of the untyped
420memory capability.
421
422For each \obj{Untyped} capability pointing to an \obj{Untyped Memory} region,
423the kernel keeps a \emph{watermark} recording how much of the region has
424previously been allocated. Whenever a user requests the kernel to create new
425objects in an untyped memory region, the kernel will carry out one of two
426actions: if there are already existing objects allocated in the region, the
427kernel will allocate the new objects at the current watermark level, and
428increase the watermark. If all capabilities to objects previously allocated in
429the region have been deleted, the kernel will reset the watermark and start
430allocating new objects from the beginning of the region again.
431
432Finally, the \apifunc{seL4\_CNode\_Revoke}{cnode_revoke} method provided by
433the \obj{CNode} objects deletes all capabilities derived from the argument
434capability. Revoking the last capability to a kernel object triggers the
435\emph{destroy} operation on the now unreferenced object. This cleans up any
436in-kernel dependencies between it, other objects and the kernel. It does not
437necessarily zero all memory state associated with the object yet. Memory zeroing
438will happen for the entire region when an untyped capability is \emph{reset} as
439part of the first retype operation after all child capabilities have been
440revoked.
441
442To reuse a region of memory, user code can call
443\apifunc{seL4\_CNode\_Revoke}{cnode_revoke} on the original untyped capability
444for that region, thereby removing all children of that capability. After this
445invocation, no references remain to any object within the untyped region, and
446the region may be safely retyped again.
447
448\subsection{Summary of Object Sizes}
449\label{sec:object_sizes}
450
451When retyping untyped memory it is useful to know how much memory the
452object will require. Object sizes are defined in libsel4.
453
454Note that \obj{CNode}s, \obj{SchedContext}s (MCS only), and \obj{Untyped Object}s
455have variables sizes. When retyping untyped memory into \obj{CNode}s
456or \obj{SchedContext}s, or breaking an
457\obj{Untyped Object} into smaller \obj{Untyped Object}s, the
458\texttt{size\_bits} argument to
459\apifunc{seL4\_Untyped\_Retype}{untyped_retype} is used to specify
460the size of the resulting objects.
461For all other object types, the size is fixed, and the \texttt{size\_bits}
462argument to \apifunc{seL4\_Untyped\_Retype}{untyped_retype} is ignored.
463
464\begin{table}[htb]
465  \renewcommand{\arraystretch}{1.5}
466  \begin{tabularx}{\textwidth}{p{0.15\textwidth}XXXX}
467    \toprule
468    Type & Meaning of \texttt{size\_bits} & Size in Bytes  \\
469    \midrule
470    \obj{CNode} & $\log_2$ number of slots & $2^\texttt{size\_bits} \cdot 2^\texttt{seL4\_SlotBits}$
471	  \texttt{seL4\_SlotBits} is: \newline
472	  \emph{on 32-bit architectures:} 4 \newline
473	  \emph{on 64-bit architectures:} 5 \\
474    \obj{SchedContext} (MCS only) & $\log_2$ size in bytes & $2^\texttt{size\_bits}$ \\
475    \obj{Untyped} & $\log_2$ size in bytes & $2^\texttt{size\_bits}$ \\
476    \bottomrule
477  \end{tabularx}
478  \caption{\label{tab:objsize} Meaning of \texttt{size\_bits} for object types of
479	variable size}
480\end{table}
481
482A single call to \apifunc{seL4\_Untyped\_Retype}{untyped_retype} can retype a
483single \obj{Untyped Object} into multiple objects. The number of objects
484to create is specified by its \texttt{num\_objects} argument. All created
485objects must be of the same type, specified by the \texttt{type} argument. In
486the case of variable-sized objects, each object must also be of the same size.
487If the size of the memory area needed (calculated by the object size multiplied
488by \texttt{num\_objects}) is greater than the remaining unallocated memory of
489the \obj{Untyped Object}, an error will result.
490