1% 2% Copyright 2014, General Dynamics C4 Systems 3% 4% SPDX-License-Identifier: GPL-2.0-only 5% 6 7\chapter{\label{ch:objects}Kernel Services and Objects} 8 9A limited number of service primitives are provided by the microkernel; 10more complex services may be implemented as applications on top of these 11primitives. In this way, the functionality of the system can be extended 12without increasing the code and complexity in privileged mode, while 13still supporting a potentially wide number of services for varied 14application domains. 15 16Note that some services are available only when the kernel is configured for 17MCS\footnote{``mixed-criticality system''} support. 18 19The basic services seL4 provides are as follows: 20\begin{description} 21 \item[Threads] are an abstraction of CPU execution that supports 22 running software; 23 24 \item[Scheduling contexts] (MCS only) are an abstraction of CPU execuion time. 25 26 \item[Address spaces] are virtual memory spaces that each contain an 27 application. Applications are limited to accessing memory in their 28 address space; 29 30 \item[Inter-process communication] (IPC) via \emph{endpoints} allows 31 threads to communicate using message passing; 32 33 \item[Reply objects] (MCS only) are used to store single-use reply capabilities, 34 and are provided by the receiver during message passing. 35 36 \item[Notifications] provide a non-blocking signalling mechanism 37 similar to binary semaphores; 38 39 \item[Device primitives] allow device drivers to be implemented as 40 unprivileged applications. The kernel exports hardware device 41 interrupts via IPC messages; and 42 43 \item[Capability spaces] store capabilities (i.e., access rights) to 44 kernel services along with their book-keeping information. 45\end{description} 46 47This chapter gives an overview of these services, describes how kernel 48objects are accessed by userspace applications, and describes how new 49objects can be created. 50 51\section{Capability-based Access Control} 52\label{sec:cap-access-control} 53 54The seL4 microkernel provides a capability-based access-control model. 55Access control governs all kernel services; in order to perform an 56operation, an application must \emph{invoke} a capability in its 57possession that has sufficient access rights for the requested service. 58With this, the system can be configured to isolate software components from 59each other, and also to enable authorised, controlled communication 60between components by selectively granting specific communication 61capabilities. This enables software-component isolation with a high 62degree of assurance, as only those operations explicitly authorised by 63capability possession are permitted. 64 65A capability is an unforgeable token that references a specific kernel 66object (such as a thread control block) and carries access rights that 67control what methods may be invoked. 68Conceptually, a capability resides in an application's \emph{capability 69space}; an address in this space refers to a \emph{slot} which may or 70may not contain a capability. An application may refer to 71a capability---to request a kernel service, for example---using the 72address of the slot holding that capability. This means, the seL4 73capability model is an instance of a \emph{segregated} (or \emph{partitioned}) 74capability system, where capabilities are managed by the kernel. 75 76Capability spaces are implemented as a directed graph of kernel-managed 77\emph{capability nodes} (\obj{CNode}s). A \obj{CNode} is a table of 78slots, where each slot may contain further \obj{CNode} capabilities. An 79address of a capability in a capability space is the concatenation of the indices 80of slots within \obj{CNodes} forming the path to the destination 81slot; we discuss \obj{CNode} objects in detail in \autoref{ch:cspace}. 82 83Capabilities can be copied and moved within capability spaces, and 84also sent via IPC. This allows creation of applications with specific 85access rights, the delegation of authority to another application, and 86passing to an application authority to a newly created (or selected) 87kernel service. Furthermore, capabilities can be \emph{minted} to 88create a derived capability with a subset of the rights of the 89original capability (never with more rights). A newly minted 90capability can be used for partial delegation of authority. 91 92Capabilities can also be revoked to withdraw 93authority. Revocation recursively removes any capabilities that have 94been derived from the original capability being revoked. The propagation of 95capabilities through the system is controlled by a 96\emph{take-grant}-based model~\cite{Elkaduwe_KE_08,Boyton_09}. 97 98\section{System Calls} 99\label{sec:syscalls} 100\label{sec:sys_send} 101\label{sec:sys_wait} 102\label{sec:sys_nbwait} 103\label{sec:sys_recv} 104\label{sec:sys_call} 105\label{sec:sys_reply} 106\label{sec:sys_nbsend} 107\label{sec:sys_replyrecv} 108\label{sec:sys_nbrecv} 109\label{sec:sys_nbsendrecv} 110\label{sec:sys_nbsendwait} 111\label{sec:sys_yield} 112 113The seL4 kernel provides a message-passing service for communication between 114threads. This mechanism is also used for communication with kernel-provided 115services. There is a standard message format, each message containing a 116number of data words and possibly some capabilities. The structure and encoding 117of these messages are described in detail in \autoref{ch:ipc}. 118 119Threads send messages by invoking capabilities within their capability space. 120When an endpoint, notification or reply capability is invoked in this way, the 121message will be transferred through the kernel to another thread. 122When other capabilities to kernel objects are invoked, the message will be 123interpreted as a method invocation in a manner specific to the type of kernel 124object. For example, invoking a thread control block (TCB) capability with a 125correctly formatted message will suspend the target thread. 126 127% TODO: The \apifunc{}{} refs in the paragraphs below (up to the section on 128% kernel objects) go to the API reference which is very terse and not nearly as 129% useful for cross-references at this point in the manual. We should consider 130% making these prose descriptions of the system calls legitimate targets for 131% references, and cross-reference to those instead. 132 133Fundamentally, we can regard the kernel as providing three system calls: 134\emph{Send}, \emph{Receive} and \emph{Yield}. However, there are also 135combinations and variants of the basic \emph{Send} and \emph{Receive} calls. An 136important variant is the \emph{Call} operation, which consists of a standard 137\emph{Send} operation atomically followed by a variant of \emph{Receive} which 138waits for a \emph{Reply}. A \emph{reply} message is always delivered via a 139special resource instead of using the standard IPC mechanism; see 140\apifunc{seL4\_Call}{sel4_call} below for details. 141 142Invoking methods on kernel objects other than endpoints and notifications is 143done with \emph{Send} or \emph{Call}, depending on whether the invoker 144wants a reply from the kernel (\emph{Call}) or not (\emph{Send}). By using 145functions provided by the libsel4 API you are guaranteed to always use the more 146appropriate one. The \emph{Yield} system call is not associated with any kernel 147object and is the only operation that does not invoke a capability. In the MCS 148configuration, \emph{Wait} is a variant of \emph{Receive} that does not require 149a reply object to be provided---on non-MCS configurations, \emph{Wait} is 150synonymous with \emph{Receive}, because neither call takes a reply object. 151 152The fundamental system calls are: 153\begin{description} 154 \item[\apifunc{seL4\_Yield}{sel4_yield}] is the only system call that does 155 not require a capability to be used. It forfeits the remainder of the 156 calling thread's timeslice and causes invocation of the kernel's scheduler. 157 If there are no other runnable threads with the same priority as the caller, 158 the calling thread will immediately be scheduled with a fresh timeslice. In 159 the MCS configuration, this behaviour depends on the state of the scheduling 160 context; see \autoref{sec:scheduling_contexts}. 161 162 \item[\apifunc{seL4\_Send}{sel4_send}] delivers a message through the named 163 capability. If the invoked capability is an endpoint, and no receiver is 164 ready to receive the message immediately, the sending thread will block 165 until the message can be delivered. No error code or response will be 166 returned by the receiving object. 167 168 \item[\apifunc{seL4\_Recv}{sel4_recv}] (``receive'') is used by a thread to 169 receive messages through endpoints or notifications. If no sender or 170 notification is pending, the caller will block until a message or 171 notification can be delivered. This system call works only on 172 \obj{Endpoint} or \obj{Notification} capabilities, raising a fault (see 173 section \ref{sec:faults}) when attempted with other capability types. 174 175 In the MCS configuration, \emph{Receive} takes a reply capability---a 176 capability to a reply object--as a parameter. 177\end{description} 178 179% TODO: seL4_Recv() gets hyphenated as 180% seL4_- 181% Recv() 182% here, and that's ugly. But just suppressing hyphenation for it (with \mbox or 183% \hyphenation) makes it overrun the right margin. GBR knows an easy way to 184% deal with this in groff but this is not groff... 185 186The remaining system calls are variants and combinations of 187\apifunc{seL4\_Send}{sel4_send} and \apifunc{seL4\_Recv}{sel4_recv} efficiently 188accommodate common use cases in systems programming. 189 190\begin{description} 191 \item[\apifunc{seL4\_NBSend}{sel4_nbsend}] performs a polling send on an 192 endpoint. If the message cannot be delivered immediately, i.e., there is no 193 receiver waiting on the destination \obj{Endpoint}, the message is silently 194 dropped. The sending thread continues execution. As with 195 \apifunc{seL4\_Send}{sel4_send}, no error code or response will be returned. 196 197 \item[\apifunc{seL4\_NBRecv}{sel4_nbrecv}] is used by a thread to check for 198 signals pending on a notification object or messages pending on an endpoint 199 without blocking. This system call works only on endpoints and notification 200 object capabilities, raising a fault (see section \ref{sec:faults}) when 201 attempted with other capability types. 202 203 \item[\apifunc{seL4\_Call}{sel4_call}] combines 204 \apifunc{seL4\_Send}{sel4_send} and \apifunc{seL4\_Recv}{sel4_recv} with 205 some important differences. The call blocks the sending thread until its 206 message is delivered and a reply message is received. 207 208 When invoking capabilities to kernel services other than endpoints, using 209 \apifunc{seL4\_Call}{sel4_call} allows the kernel to return an error code or 210 other response through the reply message. 211 212 When the sent message is delivered to another thread via an \obj{Endpoint}, 213 the kernel does the same operation as \apifunc{seL4\_Send}{sel4_send}. What 214 happens next depends on the kernel configuration. For MCS configurations, 215 the kernel then updates the \emph{reply object} provided by the receiver. A 216 \emph{reply object} is a vessel for tracking reply messages, used to send 217 a reply message and wake up the caller. In non-MCS configurations, the 218 kernel then deposits a special \emph{reply capability} in a dedicated slot 219 in the receiver's \obj{TCB}. This \emph{reply capability} is a single-use 220 right to send a reply message and wake up the caller, meaning that the 221 kernel invalidates it as soon as it has been invoked. For both variants, 222 the calling thread is blocked until a capability to the reply object is 223 invoked. For more information, see \autoref{sec:ep-cal}. 224 225 \item[\apifunc{seL4\_Reply}{sel4_reply}] is used to respond to a 226 \apifunc{seL4\_Call}{sel4_call}, by invoking the reply capability generated 227 by the \apifunc{seL4\_Call}{sel4_call} system call and stored in a dedicated 228 slot in the replying thread's TCB. It has exactly the same behaviour as 229 invoking the reply capability with \apifunc{seL4\_Send}{sel4_send} which is 230 described in \autoref{sec:ep-cal}. 231 232 \item[\apifunc{seL4\_ReplyRecv}{sel4_replyrecv}] combines 233 \apifunc{seL4\_Reply}{sel4_reply} and \apifunc{seL4\_Recv}{sel4_recv}. It 234 exists mostly for efficiency reasons, namely the common case of replying to 235 a request and waiting for the next can be performed in a single kernel 236 system call instead of two. The transition from the reply to the receive 237 phase is also atomic. 238 239 \item[\apifunc{seL4\_Wait}{sel4_wait}] works like 240 \apifunc{seL4\_Recv}{sel4_recv}; on non-MCS configurations, they are in fact 241 synonymous. In the MCS configuration, \apifunc{seL4\_Wait}{sel4_wait} is 242 used when no reply is expected. Unlike \apifunc{seL4\_Recv}{sel4_recv}, 243 \apifunc{seL4\_Wait}{sel4_wait} takes no reply capability. 244 245 \item[\apifunc{seL4\_NBWait}{sel4_nbwait}] (MCS only) is used by a thread to 246 poll for messages through endpoints or notifications. If no sender or 247 notification is pending, the system call returns immediately. 248 249 \item[\apifunc{seL4\_NBSendWait}{sel4_nbsendwait}] (MCS only) combines an 250 \apifunc{seL4\_NBSend}{sel4_nbsend} and \apifunc{seL4\_Wait}{sel4_wait} into 251 one atomic system call. 252 253 \item[\apifunc{seL4\_NBSendRecv}{sel4_nbsendwait}] (MCS only) combines an 254 \apifunc{seL4\_NBSend}{sel4_nbsend} and \apifunc{seL4\_Recv}{sel4_recv} into 255 one atomic system call. 256\end{description} 257 258\section{Kernel Objects} 259\label{s:sel4_internals} 260 261In this section we give a brief overview of the kernel-implemented 262object types whose instances (also simply called \emph{objects}) can be invoked by applications. The interface to these 263objects forms the interface to the kernel itself. The creation and use 264of kernel services is achieved by the creation, 265manipulation, and combination of these kernel objects: 266 267\begin{description} 268 269 \item[\obj{CNodes}] (see \autoref{ch:cspace}) store capabilities, giving threads permission to 270 invoke methods on particular objects. 271 Each \obj{CNode} has a fixed number of slots, 272 always a power of two, determined when the \obj{CNode} is created. Slots 273 can be empty or contain a capability. 274 275 \item[Thread Control Blocks] (\obj{TCB}s; see \autoref{ch:threads}) represent a thread of 276 execution in seL4. Threads are the unit of execution that is 277 scheduled, blocked, unblocked, etc., depending on the application's 278 interaction with other threads. 279 280 \item[Scheduling contexts] (MCS only) (\obj{SchedulingContext}s; see \autoref{ch:threads}) represent 281 CPU time in seL4. Users can create scheduling contexts from untyped objects, however on 282 creation scheduling contexts are \textit{empty} and do not represent any time. 283 Initially, there is a capability to \obj{SchedControl} for each node, which 284 allows scheduling context to be populated with parameters, which combined with priority 285 control thread's access to CPU time. 286 287 \item[\obj{Endpoints}] (see \autoref{ch:ipc}) facilitate message-passing 288 communication between threads. IPC is synchronous: A thread 289 trying to send or receive on an endpoint blocks until the message 290 can be delivered. This means that message delivery only happens if 291 a sender and a receiver rendezvous at the endpoint, and the 292 kernel can deliver the message with a single copy (or without 293 copying for short messages using only registers). 294 295 A capability to an endpoint can be restricted to be 296 send-only or receive-only. Additionally, \obj{Endpoint} 297 capabilities can have the grant right, which allows sending 298 capabilities as part of the message. 299 300 \item[\obj{Reply objects}] (MCS only) (see \autoref{ch:ipc}) track scheduling 301 context donation and provide a container for single-use reply capabilities. 302 They are provided by \apifunc{seL4\_Recv}{sel4_recv}. 303 304 \item[\obj{Notification} Objects] (see \autoref{ch:notifications}) 305 provide a simple signalling mechanism. A \obj{Notification} 306 is a word-size array of flags, each of which behaves like a binary semaphore. Operations 307 are \emph{signalling} a subset of flags in a single operation, 308 polling to check any flags, 309 and blocking until any are signalled. Notification capabilities 310 can be signal-only or wait-only. 311 312 \item[Virtual Address Space Objects] (see \autoref{ch:vspace}) 313 are used to construct a virtual 314 address space (or VSpace) for one or more threads. These 315 objects largely directly correspond to those of the hardware, and 316 as such are architecture-dependent. The kernel also includes \obj{ASID 317 Pool} and \obj{ASID Control} objects for tracking the status of 318 address spaces. 319 320 \item[Interrupt Objects] (see \autoref{ch:io}) give applications the ability to receive 321 and acknowledge interrupts from hardware devices. 322 Initially, there is a capability to \obj{IRQControl}, 323 which allows for the creation of \obj{IRQHandler} capabilities. 324 An \obj{IRQHandler} capability permits the management of a specific 325 interrupt source associated with a specific device. 326 It is delegated to 327 a device driver to access an interrupt source. The \obj{IRQHandler} 328 object allows threads to wait for and acknowledge individual 329 interrupts. 330 331 \item[Untyped Memory] (see \autoref{sec:kernmemalloc}) is the foundation of memory allocation 332 in the seL4 kernel. Untyped memory capabilities have a single method 333 which allows the creation of new kernel objects. If the method 334 succeeds, the calling thread gains access to capabilities to the 335 newly-created objects. Additionally, untyped memory objects can be 336 divided into a group of smaller untyped memory objects allowing 337 delegation of part (or all) of the system's memory. We discuss 338 memory management in general in the following sections. 339 340\end{description} 341 342\section{Kernel Memory Allocation} 343\label{sec:kernmemalloc} 344 345The seL4 microkernel does not dynamically allocate memory for kernel objects. 346Instead, objects must be explicitly created from application-controlled memory 347regions via \obj{Untyped Memory} capabilities. Applications must have 348explicit authority to memory (through these \obj{Untyped Memory} capabilities) in 349order to create new objects, and all objects consume a fixed amount of memory once 350created. These mechanisms can be used to precisely control 351the specific amount of physical memory available to applications, 352including being able to enforce isolation of physical memory access 353between applications or a device. There are no arbitrary resource 354limits in the kernel apart from those dictated by the 355hardware\footnote{The treatment of virtual ASIDs imposes a fixed number 356of address spaces. This limitation is to be removed in future 357versions of seL4.}, and so many denial-of-service attacks via resource 358exhaustion are avoided. 359 360At boot time, seL4 pre-allocates the memory required for the kernel 361itself, including the code, data, and stack sections (seL4 is a single 362kernel-stack operating system). It then creates an initial user 363thread (with an appropriate address and capability space). 364The kernel then hands all remaining memory to 365the initial thread in the form of capabilities to \obj{Untyped Memory}, and 366some additional capabilities to kernel objects that were required to 367bootstrap the initial thread. These \obj{Untyped Memory} regions can then be split into 368smaller regions or other kernel objects using the 369\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method; the created objects are termed \emph{children} of 370the original untyped memory object. 371 372The user-level application that creates an object using \apifunc{seL4\_Untyped\_Retype}{untyped_retype} 373receives full authority over the resulting object. It can then delegate 374all or part of the authority it possesses over this object to one or 375more of its clients. 376 377Untyped memory objects represent two different types of memory: 378general purpose memory, or device memory. 379\emph{General purpose} memory can be untyped into any other object 380type and used for any operation on untyped memory provided by the kernel. 381\emph{Device memory} covers memory regions reserved for devices 382as determined by the hardware platform, and usage of these objects 383is restricted by the kernel in the following ways: 384 385\begin{itemize} 386\item Device untyped objects can only be retyped into frames or other 387untyped objects; developers cannot, for example, create an endpoint from device memory. 388\item Frame objects retyped from device untyped objects cannot be set as thread IPC buffers, or used 389in the creation of an ASID pool 390\end{itemize} 391 392The type attribute (whether it represents \emph{general purpose} or 393\emph{device} memory) of a child untyped object is inherited from its 394parent untyped object. That is, any child of a device untyped will 395also be a device untyped. Developers cannot change the type attribute of an untyped. 396 397\subsection{Reusing Memory} 398\label{s:memRevoke} 399 400The model described thus far is sufficient for applications to 401allocate kernel objects, distribute authority among client 402applications, and obtain various kernel services provided by these 403objects. This alone is sufficient for a simple static system 404configuration. 405 406The seL4 kernel also allows \obj{Untyped Memory} regions to be reused. 407Reusing a region of memory is allowed only 408when there are no dangling references (i.e., capabilities) left to the 409objects inside that memory. The kernel tracks 410\emph{capability derivations}, i.e., the children generated by the 411methods \apifunc{seL4\_Untyped\_Retype}{untyped_retype}, \apifunc{seL4\_CNode\_Mint}{cnode_mint}, \apifunc{seL4\_CNode\_Copy}{cnode_copy}, and 412\apifunc{seL4\_CNode\_Mutate}{cnode_mutate}. 413 414The tree structure so generated is termed the \emph{capability 415derivation tree} (CDT).\footnote{Although the CDT conceptually is a separate 416data structure, it is implemented as part of the CNode object and so 417requires no additional kernel meta-data.} For example, when a user 418creates new kernel objects by retyping untyped memory, the newly created 419capabilities would be inserted into the CDT as children of the untyped 420memory capability. 421 422For each \obj{Untyped} capability pointing to an \obj{Untyped Memory} region, 423the kernel keeps a \emph{watermark} recording how much of the region has 424previously been allocated. Whenever a user requests the kernel to create new 425objects in an untyped memory region, the kernel will carry out one of two 426actions: if there are already existing objects allocated in the region, the 427kernel will allocate the new objects at the current watermark level, and 428increase the watermark. If all capabilities to objects previously allocated in 429the region have been deleted, the kernel will reset the watermark and start 430allocating new objects from the beginning of the region again. 431 432Finally, the \apifunc{seL4\_CNode\_Revoke}{cnode_revoke} method provided by 433the \obj{CNode} objects deletes all capabilities derived from the argument 434capability. Revoking the last capability to a kernel object triggers the 435\emph{destroy} operation on the now unreferenced object. This cleans up any 436in-kernel dependencies between it, other objects and the kernel. It does not 437necessarily zero all memory state associated with the object yet. Memory zeroing 438will happen for the entire region when an untyped capability is \emph{reset} as 439part of the first retype operation after all child capabilities have been 440revoked. 441 442To reuse a region of memory, user code can call 443\apifunc{seL4\_CNode\_Revoke}{cnode_revoke} on the original untyped capability 444for that region, thereby removing all children of that capability. After this 445invocation, no references remain to any object within the untyped region, and 446the region may be safely retyped again. 447 448\subsection{Summary of Object Sizes} 449\label{sec:object_sizes} 450 451When retyping untyped memory it is useful to know how much memory the 452object will require. Object sizes are defined in libsel4. 453 454Note that \obj{CNode}s, \obj{SchedContext}s (MCS only), and \obj{Untyped Object}s 455have variables sizes. When retyping untyped memory into \obj{CNode}s 456or \obj{SchedContext}s, or breaking an 457\obj{Untyped Object} into smaller \obj{Untyped Object}s, the 458\texttt{size\_bits} argument to 459\apifunc{seL4\_Untyped\_Retype}{untyped_retype} is used to specify 460the size of the resulting objects. 461For all other object types, the size is fixed, and the \texttt{size\_bits} 462argument to \apifunc{seL4\_Untyped\_Retype}{untyped_retype} is ignored. 463 464\begin{table}[htb] 465 \renewcommand{\arraystretch}{1.5} 466 \begin{tabularx}{\textwidth}{p{0.15\textwidth}XXXX} 467 \toprule 468 Type & Meaning of \texttt{size\_bits} & Size in Bytes \\ 469 \midrule 470 \obj{CNode} & $\log_2$ number of slots & $2^\texttt{size\_bits} \cdot 2^\texttt{seL4\_SlotBits}$ 471 \texttt{seL4\_SlotBits} is: \newline 472 \emph{on 32-bit architectures:} 4 \newline 473 \emph{on 64-bit architectures:} 5 \\ 474 \obj{SchedContext} (MCS only) & $\log_2$ size in bytes & $2^\texttt{size\_bits}$ \\ 475 \obj{Untyped} & $\log_2$ size in bytes & $2^\texttt{size\_bits}$ \\ 476 \bottomrule 477 \end{tabularx} 478 \caption{\label{tab:objsize} Meaning of \texttt{size\_bits} for object types of 479 variable size} 480\end{table} 481 482A single call to \apifunc{seL4\_Untyped\_Retype}{untyped_retype} can retype a 483single \obj{Untyped Object} into multiple objects. The number of objects 484to create is specified by its \texttt{num\_objects} argument. All created 485objects must be of the same type, specified by the \texttt{type} argument. In 486the case of variable-sized objects, each object must also be of the same size. 487If the size of the memory area needed (calculated by the object size multiplied 488by \texttt{num\_objects}) is greater than the remaining unallocated memory of 489the \obj{Untyped Object}, an error will result. 490