1# Zircon vDSO
2
3The Zircon vDSO is the sole means of access to [system calls](syscalls.md)
4in Zircon.  vDSO stands for *virtual Dynamic Shared Object*.  (*Dynamic
5Shared Object* is a term used for a shared library in the ELF format.)
6It's *virtual* because it's not loaded from an ELF file that sits in a
7filesystem.  Instead, the vDSO image is provided directly by the kernel.
8
9[TOC]
10
11## Using the vDSO
12
13### System Call ABI
14
15The vDSO is a shared library in the ELF format.  It's used in the normal
16way that ELF shared libraries are used, which is to look up entry points by
17symbol name in the ELF *dynamic symbol table* (the `.dynsym` section,
18located via `DT_SYMTAB`).  ELF defines a hash table format to optimize
19lookup by name in the symbol table (the `.hash` section, located via
20`DT_HASH`); GNU tools have defined an improved hash table format that makes
21lookups much more efficient (the `.gnu_hash` section, located via
22`DT_GNU_HASH`).  Fuchsia ELF shared libraries, including the vDSO, use the
23`DT_GNU_HASH` format exclusively.  (It's also possible to use the symbol
24table directly via linear search, ignoring the hash table.)
25
26The vDSO uses a [simplified layout](#Read_Only-Dynamic-Shared-Object-Layout)
27that has no writable segment and requires no dynamic relocations.  This
28makes it easier to use the system call ABI without implementing a
29general-purpose ELF loader and full ELF dynamic linking semantics.
30
31ELF symbol names are the same as C identifiers with external linkage.
32Each [system call](syscalls.md) corresponds to an ELF symbol in the vDSO,
33and has the ABI of a C function.  The vDSO functions use only the basic
34machine-specific C calling conventions governing the use of machine
35registers and the stack, which is common across many systems that use ELF,
36such as Linux and all the BSD variants.  They do not rely on complex
37features such as ELF Thread-Local Storage, nor on Fuchsia-specific ABI
38elements such as the [SafeStack](safestack.md) unsafe stack pointer.
39
40### vDSO Unwind Information
41
42The vDSO has an ELF program header of type `PT_GNU_EH_FRAME`.  This points
43to unwind information in the GNU `.eh_frame` format, which is a close
44relative of the standard DWARF Call Frame Information format.  This
45information makes it possible to recover the register values from call
46frames in the vDSO code, so that a complete stack trace can be reconstructed
47from any thread's register state with a PC value inside the vDSO code.
48These formats and their use are just the same in the vDSO as they are in any
49normal ELF shared library on Fuchsia or other systems using common GNU ELF
50extensions, such as Linux and all the BSD variants.
51
52### vDSO Build ID
53
54The vDSO has an ELF *Build ID*, as other ELF shared libraries and
55executables built with common GNU extensions do.  The Build ID is a unique
56bit string that identifies a specific build of that binary.  This is stored
57in ELF note format, pointed to by an ELF program header of type `PT_NOTE`.
58The payload of the note with name `"GNU"` and type `NT_GNU_BUILD_ID` is a
59sequence of bytes that constitutes the Build ID.
60
61One main use of Build IDs is to associate binaries with their debugging
62information and the source code they were built from.  The vDSO binary is
63innately tied to (and embedded within) the kernel binary and includes
64information specific to each kernel build, so the Build ID of the vDSO
65distinguishes kernels as well.
66
67### **process_start**() argument
68
69The [**process_start**()](syscalls/process_start.md) system call is how a
70program loader tells the kernel to start a new process's first thread
71executing.  The final argument (`arg2`
72in [the **process_start**() documentation](syscalls/process_start.md)) is a
73plain `uintptr_t` value passed to the new thread in a register.
74
75By convention, the program loader maps the vDSO into each new process's
76address space (at a random location chosen by the system) and passes the
77base address of the image to the new process's first thread in the `arg2`
78register.  This address is where the ELF file header can be found in memory,
79pointing to all the other ELF format elements necessary to look up symbol
80names and thus make system calls.
81
82### **PA_VMO_VDSO** handle
83
84The vDSO image is embedded in the kernel at compile time.  The kernel
85exposes it to userspace as a read-only [VMO](objects/vm_object.md).
86
87When a program loader sets up a new process, the only way to make it
88possible for that process to make system calls is for the program loader to
89map the vDSO into the new process's address space before its first thread
90starts running.  Hence, each process that will launch other processes
91capable of making system calls must have access to the vDSO VMO.
92
93By convention, a VMO handle for the vDSO is passed from process to process
94in the `zx_proc_args_t` bootstrap message sent to each new process
95(see [`<zircon/processargs.h>`](../system/public/zircon/processargs.h)).
96The VMO handle's entry in the handle table is identified by the *handle
97info entry* `PA_HND(PA_VMO_VDSO, 0)`.
98
99## vDSO Implementation Details
100
101### **abigen** tool
102
103The [`abigen` tool](../system/host/abigen/) generates both C/C++ function
104declarations that form the public [system call](syscalls.md) API, and some
105C++ and assembly code used in the implementation of the vDSO.  Both the
106public API and the private interface between the kernel and the vDSO code
107are specified by
108[`<zircon/syscalls.abigen>`](../system/public/zircon/syscalls.abigen),
109which is the input to `abigen`.
110
111The `syscall` entries in `syscalls.abigen` fall into the following groups,
112distinguished by the presence of attributes after the system call name:
113
114 * Entries with neither `vdsocall` nor `internal` are the simple cases
115   (which are the majority of the system calls) where the public API and
116   the private API are exactly the same.  These are implemented entirely
117   by generated code.  The public API functions have names prefixed by
118   `_zx_` and `zx_` (aliases).
119
120* `vdsocall` entries are simply declarations for the public API.
121  These functions are implemented by normal, hand-written C++ code found
122  in [`system/ulib/zircon/`](../system/ulib/zircon/).  Those source
123  files `#include "private.h"` and then define the C++ function for the
124  system call with its name prefixed by `_zx_`.  Finally, they use the
125  `VDSO_INTERFACE_FUNCTION` macro on the system call's name prefixed by
126  `zx_` (no leading underscore).  This implementation code can call the
127  C++ function for any other system call entry (whether a public
128  generated call, a public hand-written `vdsocall`, or an `internal`
129  generated call), but must use its private entry point alias, which has
130  the `VDSO_zx_` prefix.  Otherwise the code is normal (minimal) C++,
131  but must be stateless and reentrant (use only its stack and registers).
132
133 * `internal` entries are declarations of a private API used only by the
134   vDSO implementation code to enter the kernel (i.e., by other functions
135   implementing `vdsocall` system calls).  These produce functions in the
136   vDSO implementation with the same C signature that would be declared in
137   the public API given the signature of the system call entry.  However,
138   instead of being named with the `_zx_` and `zx_` prefixes, these are
139   available only via `#include "private.h"` with `VDSO_zx_` prefixes.
140
141### Read-Only Dynamic Shared Object Layout
142
143The vDSO is a normal ELF shared library and can be treated like any
144other.  But it's intentionally kept to a small subset of what an ELF
145shared library in general is allowed to do.  This has several benefits:
146
147 * Mapping the ELF image into a process is straightforward and does not
148   involve any complex corner cases of general support for ELF `PT_LOAD`
149   program headers.  The vDSO's layout can be handled by special-case
150   code with no loops that reads only a few values from ELF headers.
151 * Using the vDSO does not require full-featured ELF dynamic linking.
152   In particular, the vDSO has no dynamic relocations.  Mapping in the
153   ELF `PT_LOAD` segments is the only setup that needs to be done.
154 * The vDSO code is stateless and reentrant.  It refers only to the
155   registers and stack with which it's called.  This makes it usable in
156   a wide variety of contexts with minimal constraints on how user code
157   organizes itself, which is appropriate for the mandatory ABI of an
158   operating system.  It also makes the code easier to reason about and
159   audit for robustness and security.
160
161The layout is simply two consecutive segments, each containing aligned
162whole pages:
163
164 1. The first segment is read-only, and includes the ELF headers and
165    metadata for dynamic linking along with constant data private to the
166    vDSO's implementation.
167 2. The second segment is executable, containing the vDSO code.
168
169The whole vDSO image consists of just these two segments' pages, present
170in the ELF image just as they should appear in memory.  To map in the
171vDSO requires only two values gleaned from the vDSO's ELF headers: the
172number of pages in each segment.
173
174### Boot-time Read-Only Data
175
176Some system calls simply return values that are constant throughout the
177runtime of the whole system, though the ABI of the system is that their
178values must be queried at runtime and cannot be compiled into user code.
179These values either are fixed in the kernel at compile time or are
180determined by the kernel at boot time from hardware or boot parameters.
181Examples include [**system_get_version**()](syscalls/system_get_version.md),
182[**system_get_num_cpus**()](syscalls/system_get_num_cpus.md), and
183[**ticks_per_second**()](syscalls/ticks_per_second.md).
184The last example is influenced by
185a [kernel command line option](kernel_cmdline.md#vdso_soft_ticks_bool).
186
187Because these values are constant, there is no need to pay the overhead
188of entering the kernel to read them.  Instead, the vDSO implementations
189of these are simple C++ functions that just return constants read from
190the vDSO's read-only data segment.  Values fixed at compile time (such
191as the system version string) are simply compiled into the vDSO.
192
193For the values determined at boot time, the kernel must modify the
194contents of the vDSO.  This is accomplished by the boot-time code that
195sets up the vDSO VMO, before it starts the first userspace process and
196gives it the VMO handle.  At compile time, the offset into the vDSO
197image of
198the [`vdso_constants`](../kernel/lib/vdso/include/lib/vdso-constants.h)
199data structure is extracted from the vDSO ELF file that will be embedded
200in the kernel.  At boot time, the kernel temporarily maps the pages of
201the VMO covering `vdso_constants` into its own address space long enough
202to initialize the structure with the right values for the current run of
203the system.
204
205### Enforcement
206
207The vDSO entry points are the only means to enter the kernel for system
208calls.  The machine-specific instructions used to enter the kernel
209(e.g. `syscall` on x86) are not part of the system ABI and it's invalid
210for user code to execute such instructions directly.  The interface
211between the kernel and the vDSO code is a private implementation detail.
212
213Because the vDSO is itself normal code that executes in userspace, the
214kernel must robustly handle all possible entries into kernel mode from
215userspace.  However, potential kernel bugs can be mitigated somewhat by
216enforcing that each kernel entry be made only from the proper vDSO code.
217This enforcement also avoids developers of userspace code circumventing
218the ABI rules (because of ignorance, malice, or misguided intent to work
219around some perceived limitation of the official ABI), which could lead
220to the private kernel-vDSO interface becoming a *de facto* ABI for
221application code.
222
223The kernel enforces correct use of the vDSO in two ways:
224
225 1. It constrains how the vDSO VMO can be mapped into a process.
226
227    When a [**vmar_map**()](syscalls/vmar_map.md) call is made using the
228    vDSO VMO and requesting `ZX_VM_PERM_EXECUTE`, the kernel
229    requires that the offset and size of the mapping exactly match the
230    vDSO's executable segment.  It also allows only one such mapping.
231    Once the valid vDSO mapping has been established in a process, it
232    cannot be removed.  Attempts to map the vDSO a second time into the
233    same process, to unmap the vDSO code from a process, or to make an
234    executable mapping of the vDSO that don't use the correct offset and
235    size, fail with `ZX_ERR_ACCESS_DENIED`.
236
237    At compile time, the offset and size of the vDSO's code segment are
238    extracted from the vDSO ELF file and used as constants in the
239    kernel's mapping enforcement code.
240
241    When the one valid vDSO mapping is established in a process, the
242    kernel records the address for that process so it can be checked
243    quickly.
244
245 2. It constrains what PC locations can be used to enter the kernel.
246
247    When a user thread enters the kernel for a system call, a register
248    indicates which low-level system call is being invoked.  The
249    low-level system calls are the private interface between the kernel
250    and the vDSO; many correspond directly the system calls in the
251    public ABI, but others do not.
252
253    For each low-level system call, there is a fixed set of PC locations
254    in the vDSO code that invoke that call.  The source code for the
255    vDSO defines internal symbols identifying each such location.  At
256    compile time, these locations are extracted from the vDSO's symbol
257    table and used to generate kernel code that defines a PC validity
258    predicate for each low-level system call.  Since there is only one
259    definition of the vDSO code used by all user processes in the
260    system, these predicates simply check for known, valid, constant
261    offsets from the beginning of the vDSO code segment.
262
263    On entry to the kernel for a system call, the kernel examines the PC
264    location of the `syscall` instruction on x86 (or equivalent
265    instruction on other machines).  It subtracts the base address of
266    the vDSO code recorded for the process at **vmar_map**() time from
267    the PC, and passes the resulting offset to the validity predicate
268    for the system call being invoked.  If the predicate rules the PC
269    invalid, the calling thread is not allowed to proceed with the
270    system call and instead takes a synthetic exception similar to the
271    machine exception that would result from invoking an undefined or
272    privileged machine instruction.
273
274### Variants
275
276**TODO(mcgrathr)**: vDSO *variants* are an experimental feature that is
277not yet in real use.  There is a proof-of-concept implementation and
278simple tests, but more work is required to implement the concept
279robustly and determine what variants will be made available.  The
280concept is to provide variants of the vDSO image that export only a
281subset of the full vDSO system call interface.  For example, system
282calls intended only for use by device drivers might be elided from the
283vDSO variant used for normal application code.
284