1# Zircon vDSO 2 3The Zircon vDSO is the sole means of access to [system calls](syscalls.md) 4in Zircon. vDSO stands for *virtual Dynamic Shared Object*. (*Dynamic 5Shared Object* is a term used for a shared library in the ELF format.) 6It's *virtual* because it's not loaded from an ELF file that sits in a 7filesystem. Instead, the vDSO image is provided directly by the kernel. 8 9[TOC] 10 11## Using the vDSO 12 13### System Call ABI 14 15The vDSO is a shared library in the ELF format. It's used in the normal 16way that ELF shared libraries are used, which is to look up entry points by 17symbol name in the ELF *dynamic symbol table* (the `.dynsym` section, 18located via `DT_SYMTAB`). ELF defines a hash table format to optimize 19lookup by name in the symbol table (the `.hash` section, located via 20`DT_HASH`); GNU tools have defined an improved hash table format that makes 21lookups much more efficient (the `.gnu_hash` section, located via 22`DT_GNU_HASH`). Fuchsia ELF shared libraries, including the vDSO, use the 23`DT_GNU_HASH` format exclusively. (It's also possible to use the symbol 24table directly via linear search, ignoring the hash table.) 25 26The vDSO uses a [simplified layout](#Read_Only-Dynamic-Shared-Object-Layout) 27that has no writable segment and requires no dynamic relocations. This 28makes it easier to use the system call ABI without implementing a 29general-purpose ELF loader and full ELF dynamic linking semantics. 30 31ELF symbol names are the same as C identifiers with external linkage. 32Each [system call](syscalls.md) corresponds to an ELF symbol in the vDSO, 33and has the ABI of a C function. The vDSO functions use only the basic 34machine-specific C calling conventions governing the use of machine 35registers and the stack, which is common across many systems that use ELF, 36such as Linux and all the BSD variants. They do not rely on complex 37features such as ELF Thread-Local Storage, nor on Fuchsia-specific ABI 38elements such as the [SafeStack](safestack.md) unsafe stack pointer. 39 40### vDSO Unwind Information 41 42The vDSO has an ELF program header of type `PT_GNU_EH_FRAME`. This points 43to unwind information in the GNU `.eh_frame` format, which is a close 44relative of the standard DWARF Call Frame Information format. This 45information makes it possible to recover the register values from call 46frames in the vDSO code, so that a complete stack trace can be reconstructed 47from any thread's register state with a PC value inside the vDSO code. 48These formats and their use are just the same in the vDSO as they are in any 49normal ELF shared library on Fuchsia or other systems using common GNU ELF 50extensions, such as Linux and all the BSD variants. 51 52### vDSO Build ID 53 54The vDSO has an ELF *Build ID*, as other ELF shared libraries and 55executables built with common GNU extensions do. The Build ID is a unique 56bit string that identifies a specific build of that binary. This is stored 57in ELF note format, pointed to by an ELF program header of type `PT_NOTE`. 58The payload of the note with name `"GNU"` and type `NT_GNU_BUILD_ID` is a 59sequence of bytes that constitutes the Build ID. 60 61One main use of Build IDs is to associate binaries with their debugging 62information and the source code they were built from. The vDSO binary is 63innately tied to (and embedded within) the kernel binary and includes 64information specific to each kernel build, so the Build ID of the vDSO 65distinguishes kernels as well. 66 67### **process_start**() argument 68 69The [**process_start**()](syscalls/process_start.md) system call is how a 70program loader tells the kernel to start a new process's first thread 71executing. The final argument (`arg2` 72in [the **process_start**() documentation](syscalls/process_start.md)) is a 73plain `uintptr_t` value passed to the new thread in a register. 74 75By convention, the program loader maps the vDSO into each new process's 76address space (at a random location chosen by the system) and passes the 77base address of the image to the new process's first thread in the `arg2` 78register. This address is where the ELF file header can be found in memory, 79pointing to all the other ELF format elements necessary to look up symbol 80names and thus make system calls. 81 82### **PA_VMO_VDSO** handle 83 84The vDSO image is embedded in the kernel at compile time. The kernel 85exposes it to userspace as a read-only [VMO](objects/vm_object.md). 86 87When a program loader sets up a new process, the only way to make it 88possible for that process to make system calls is for the program loader to 89map the vDSO into the new process's address space before its first thread 90starts running. Hence, each process that will launch other processes 91capable of making system calls must have access to the vDSO VMO. 92 93By convention, a VMO handle for the vDSO is passed from process to process 94in the `zx_proc_args_t` bootstrap message sent to each new process 95(see [`<zircon/processargs.h>`](../system/public/zircon/processargs.h)). 96The VMO handle's entry in the handle table is identified by the *handle 97info entry* `PA_HND(PA_VMO_VDSO, 0)`. 98 99## vDSO Implementation Details 100 101### **abigen** tool 102 103The [`abigen` tool](../system/host/abigen/) generates both C/C++ function 104declarations that form the public [system call](syscalls.md) API, and some 105C++ and assembly code used in the implementation of the vDSO. Both the 106public API and the private interface between the kernel and the vDSO code 107are specified by 108[`<zircon/syscalls.abigen>`](../system/public/zircon/syscalls.abigen), 109which is the input to `abigen`. 110 111The `syscall` entries in `syscalls.abigen` fall into the following groups, 112distinguished by the presence of attributes after the system call name: 113 114 * Entries with neither `vdsocall` nor `internal` are the simple cases 115 (which are the majority of the system calls) where the public API and 116 the private API are exactly the same. These are implemented entirely 117 by generated code. The public API functions have names prefixed by 118 `_zx_` and `zx_` (aliases). 119 120* `vdsocall` entries are simply declarations for the public API. 121 These functions are implemented by normal, hand-written C++ code found 122 in [`system/ulib/zircon/`](../system/ulib/zircon/). Those source 123 files `#include "private.h"` and then define the C++ function for the 124 system call with its name prefixed by `_zx_`. Finally, they use the 125 `VDSO_INTERFACE_FUNCTION` macro on the system call's name prefixed by 126 `zx_` (no leading underscore). This implementation code can call the 127 C++ function for any other system call entry (whether a public 128 generated call, a public hand-written `vdsocall`, or an `internal` 129 generated call), but must use its private entry point alias, which has 130 the `VDSO_zx_` prefix. Otherwise the code is normal (minimal) C++, 131 but must be stateless and reentrant (use only its stack and registers). 132 133 * `internal` entries are declarations of a private API used only by the 134 vDSO implementation code to enter the kernel (i.e., by other functions 135 implementing `vdsocall` system calls). These produce functions in the 136 vDSO implementation with the same C signature that would be declared in 137 the public API given the signature of the system call entry. However, 138 instead of being named with the `_zx_` and `zx_` prefixes, these are 139 available only via `#include "private.h"` with `VDSO_zx_` prefixes. 140 141### Read-Only Dynamic Shared Object Layout 142 143The vDSO is a normal ELF shared library and can be treated like any 144other. But it's intentionally kept to a small subset of what an ELF 145shared library in general is allowed to do. This has several benefits: 146 147 * Mapping the ELF image into a process is straightforward and does not 148 involve any complex corner cases of general support for ELF `PT_LOAD` 149 program headers. The vDSO's layout can be handled by special-case 150 code with no loops that reads only a few values from ELF headers. 151 * Using the vDSO does not require full-featured ELF dynamic linking. 152 In particular, the vDSO has no dynamic relocations. Mapping in the 153 ELF `PT_LOAD` segments is the only setup that needs to be done. 154 * The vDSO code is stateless and reentrant. It refers only to the 155 registers and stack with which it's called. This makes it usable in 156 a wide variety of contexts with minimal constraints on how user code 157 organizes itself, which is appropriate for the mandatory ABI of an 158 operating system. It also makes the code easier to reason about and 159 audit for robustness and security. 160 161The layout is simply two consecutive segments, each containing aligned 162whole pages: 163 164 1. The first segment is read-only, and includes the ELF headers and 165 metadata for dynamic linking along with constant data private to the 166 vDSO's implementation. 167 2. The second segment is executable, containing the vDSO code. 168 169The whole vDSO image consists of just these two segments' pages, present 170in the ELF image just as they should appear in memory. To map in the 171vDSO requires only two values gleaned from the vDSO's ELF headers: the 172number of pages in each segment. 173 174### Boot-time Read-Only Data 175 176Some system calls simply return values that are constant throughout the 177runtime of the whole system, though the ABI of the system is that their 178values must be queried at runtime and cannot be compiled into user code. 179These values either are fixed in the kernel at compile time or are 180determined by the kernel at boot time from hardware or boot parameters. 181Examples include [**system_get_version**()](syscalls/system_get_version.md), 182[**system_get_num_cpus**()](syscalls/system_get_num_cpus.md), and 183[**ticks_per_second**()](syscalls/ticks_per_second.md). 184The last example is influenced by 185a [kernel command line option](kernel_cmdline.md#vdso_soft_ticks_bool). 186 187Because these values are constant, there is no need to pay the overhead 188of entering the kernel to read them. Instead, the vDSO implementations 189of these are simple C++ functions that just return constants read from 190the vDSO's read-only data segment. Values fixed at compile time (such 191as the system version string) are simply compiled into the vDSO. 192 193For the values determined at boot time, the kernel must modify the 194contents of the vDSO. This is accomplished by the boot-time code that 195sets up the vDSO VMO, before it starts the first userspace process and 196gives it the VMO handle. At compile time, the offset into the vDSO 197image of 198the [`vdso_constants`](../kernel/lib/vdso/include/lib/vdso-constants.h) 199data structure is extracted from the vDSO ELF file that will be embedded 200in the kernel. At boot time, the kernel temporarily maps the pages of 201the VMO covering `vdso_constants` into its own address space long enough 202to initialize the structure with the right values for the current run of 203the system. 204 205### Enforcement 206 207The vDSO entry points are the only means to enter the kernel for system 208calls. The machine-specific instructions used to enter the kernel 209(e.g. `syscall` on x86) are not part of the system ABI and it's invalid 210for user code to execute such instructions directly. The interface 211between the kernel and the vDSO code is a private implementation detail. 212 213Because the vDSO is itself normal code that executes in userspace, the 214kernel must robustly handle all possible entries into kernel mode from 215userspace. However, potential kernel bugs can be mitigated somewhat by 216enforcing that each kernel entry be made only from the proper vDSO code. 217This enforcement also avoids developers of userspace code circumventing 218the ABI rules (because of ignorance, malice, or misguided intent to work 219around some perceived limitation of the official ABI), which could lead 220to the private kernel-vDSO interface becoming a *de facto* ABI for 221application code. 222 223The kernel enforces correct use of the vDSO in two ways: 224 225 1. It constrains how the vDSO VMO can be mapped into a process. 226 227 When a [**vmar_map**()](syscalls/vmar_map.md) call is made using the 228 vDSO VMO and requesting `ZX_VM_PERM_EXECUTE`, the kernel 229 requires that the offset and size of the mapping exactly match the 230 vDSO's executable segment. It also allows only one such mapping. 231 Once the valid vDSO mapping has been established in a process, it 232 cannot be removed. Attempts to map the vDSO a second time into the 233 same process, to unmap the vDSO code from a process, or to make an 234 executable mapping of the vDSO that don't use the correct offset and 235 size, fail with `ZX_ERR_ACCESS_DENIED`. 236 237 At compile time, the offset and size of the vDSO's code segment are 238 extracted from the vDSO ELF file and used as constants in the 239 kernel's mapping enforcement code. 240 241 When the one valid vDSO mapping is established in a process, the 242 kernel records the address for that process so it can be checked 243 quickly. 244 245 2. It constrains what PC locations can be used to enter the kernel. 246 247 When a user thread enters the kernel for a system call, a register 248 indicates which low-level system call is being invoked. The 249 low-level system calls are the private interface between the kernel 250 and the vDSO; many correspond directly the system calls in the 251 public ABI, but others do not. 252 253 For each low-level system call, there is a fixed set of PC locations 254 in the vDSO code that invoke that call. The source code for the 255 vDSO defines internal symbols identifying each such location. At 256 compile time, these locations are extracted from the vDSO's symbol 257 table and used to generate kernel code that defines a PC validity 258 predicate for each low-level system call. Since there is only one 259 definition of the vDSO code used by all user processes in the 260 system, these predicates simply check for known, valid, constant 261 offsets from the beginning of the vDSO code segment. 262 263 On entry to the kernel for a system call, the kernel examines the PC 264 location of the `syscall` instruction on x86 (or equivalent 265 instruction on other machines). It subtracts the base address of 266 the vDSO code recorded for the process at **vmar_map**() time from 267 the PC, and passes the resulting offset to the validity predicate 268 for the system call being invoked. If the predicate rules the PC 269 invalid, the calling thread is not allowed to proceed with the 270 system call and instead takes a synthetic exception similar to the 271 machine exception that would result from invoking an undefined or 272 privileged machine instruction. 273 274### Variants 275 276**TODO(mcgrathr)**: vDSO *variants* are an experimental feature that is 277not yet in real use. There is a proof-of-concept implementation and 278simple tests, but more work is required to implement the concept 279robustly and determine what variants will be made available. The 280concept is to provide variants of the vDSO image that export only a 281subset of the full vDSO system call interface. For example, system 282calls intended only for use by device drivers might be elided from the 283vDSO variant used for normal application code. 284