1.. SPDX-License-Identifier: GPL-2.0 2 3=============================== 4Software Guard eXtensions (SGX) 5=============================== 6 7Overview 8======== 9 10Software Guard eXtensions (SGX) hardware enables for user space applications 11to set aside private memory regions of code and data: 12 13* Privileged (ring-0) ENCLS functions orchestrate the construction of the 14 regions. 15* Unprivileged (ring-3) ENCLU functions allow an application to enter and 16 execute inside the regions. 17 18These memory regions are called enclaves. An enclave can be only entered at a 19fixed set of entry points. Each entry point can hold a single hardware thread 20at a time. While the enclave is loaded from a regular binary file by using 21ENCLS functions, only the threads inside the enclave can access its memory. The 22region is denied from outside access by the CPU, and encrypted before it leaves 23from LLC. 24 25The support can be determined by 26 27 ``grep sgx /proc/cpuinfo`` 28 29SGX must both be supported in the processor and enabled by the BIOS. If SGX 30appears to be unsupported on a system which has hardware support, ensure 31support is enabled in the BIOS. If a BIOS presents a choice between "Enabled" 32and "Software Enabled" modes for SGX, choose "Enabled". 33 34Enclave Page Cache 35================== 36 37SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated 38with an enclave. It is contained in a BIOS-reserved region of physical memory. 39Unlike pages used for regular memory, pages can only be accessed from outside of 40the enclave during enclave construction with special, limited SGX instructions. 41 42Only a CPU executing inside an enclave can directly access enclave memory. 43However, a CPU executing inside an enclave may access normal memory outside the 44enclave. 45 46The kernel manages enclave memory similar to how it treats device memory. 47 48Enclave Page Types 49------------------ 50 51**SGX Enclave Control Structure (SECS)** 52 Enclave's address range, attributes and other global data are defined 53 by this structure. 54 55**Regular (REG)** 56 Regular EPC pages contain the code and data of an enclave. 57 58**Thread Control Structure (TCS)** 59 Thread Control Structure pages define the entry points to an enclave and 60 track the execution state of an enclave thread. 61 62**Version Array (VA)** 63 Version Array pages contain 512 slots, each of which can contain a version 64 number for a page evicted from the EPC. 65 66Enclave Page Cache Map 67---------------------- 68 69The processor tracks EPC pages in a hardware metadata structure called the 70*Enclave Page Cache Map (EPCM)*. The EPCM contains an entry for each EPC page 71which describes the owning enclave, access rights and page type among the other 72things. 73 74EPCM permissions are separate from the normal page tables. This prevents the 75kernel from, for instance, allowing writes to data which an enclave wishes to 76remain read-only. EPCM permissions may only impose additional restrictions on 77top of normal x86 page permissions. 78 79For all intents and purposes, the SGX architecture allows the processor to 80invalidate all EPCM entries at will. This requires that software be prepared to 81handle an EPCM fault at any time. In practice, this can happen on events like 82power transitions when the ephemeral key that encrypts enclave memory is lost. 83 84Application interface 85===================== 86 87Enclave build functions 88----------------------- 89 90In addition to the traditional compiler and linker build process, SGX has a 91separate enclave “build” process. Enclaves must be built before they can be 92executed (entered). The first step in building an enclave is opening the 93**/dev/sgx_enclave** device. Since enclave memory is protected from direct 94access, special privileged instructions are then used to copy data into enclave 95pages and establish enclave page permissions. 96 97.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c 98 :functions: sgx_ioc_enclave_create 99 sgx_ioc_enclave_add_pages 100 sgx_ioc_enclave_init 101 sgx_ioc_enclave_provision 102 103Enclave runtime management 104-------------------------- 105 106Systems supporting SGX2 additionally support changes to initialized 107enclaves: modifying enclave page permissions and type, and dynamically 108adding and removing of enclave pages. When an enclave accesses an address 109within its address range that does not have a backing page then a new 110regular page will be dynamically added to the enclave. The enclave is 111still required to run EACCEPT on the new page before it can be used. 112 113.. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c 114 :functions: sgx_ioc_enclave_restrict_permissions 115 sgx_ioc_enclave_modify_types 116 sgx_ioc_enclave_remove_pages 117 118Enclave vDSO 119------------ 120 121Entering an enclave can only be done through SGX-specific EENTER and ERESUME 122functions, and is a non-trivial process. Because of the complexity of 123transitioning to and from an enclave, enclaves typically utilize a library to 124handle the actual transitions. This is roughly analogous to how glibc 125implementations are used by most applications to wrap system calls. 126 127Another crucial characteristic of enclaves is that they can generate exceptions 128as part of their normal operation that need to be handled in the enclave or are 129unique to SGX. 130 131Instead of the traditional signal mechanism to handle these exceptions, SGX 132can leverage special exception fixup provided by the vDSO. The kernel-provided 133vDSO function wraps low-level transitions to/from the enclave like EENTER and 134ERESUME. The vDSO function intercepts exceptions that would otherwise generate 135a signal and return the fault information directly to its caller. This avoids 136the need to juggle signal handlers. 137 138.. kernel-doc:: arch/x86/include/uapi/asm/sgx.h 139 :functions: vdso_sgx_enter_enclave_t 140 141ksgxd 142===== 143 144SGX support includes a kernel thread called *ksgxd*. 145 146EPC sanitization 147---------------- 148 149ksgxd is started when SGX initializes. Enclave memory is typically ready 150for use when the processor powers on or resets. However, if SGX has been in 151use since the reset, enclave pages may be in an inconsistent state. This might 152occur after a crash and kexec() cycle, for instance. At boot, ksgxd 153reinitializes all enclave pages so that they can be allocated and re-used. 154 155The sanitization is done by going through EPC address space and applying the 156EREMOVE function to each physical page. Some enclave pages like SECS pages have 157hardware dependencies on other pages which prevents EREMOVE from functioning. 158Executing two EREMOVE passes removes the dependencies. 159 160Page reclaimer 161-------------- 162 163Similar to the core kswapd, ksgxd, is responsible for managing the 164overcommitment of enclave memory. If the system runs out of enclave memory, 165*ksgxd* “swaps” enclave memory to normal memory. 166 167Launch Control 168============== 169 170SGX provides a launch control mechanism. After all enclave pages have been 171copied, kernel executes EINIT function, which initializes the enclave. Only after 172this the CPU can execute inside the enclave. 173 174EINIT function takes an RSA-3072 signature of the enclave measurement. The function 175checks that the measurement is correct and signature is signed with the key 176hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the 177SHA256 of a public key. 178 179Those MSRs can be configured by the BIOS to be either readable or writable. 180Linux supports only writable configuration in order to give full control to the 181kernel on launch control policy. Before calling EINIT function, the driver sets 182the MSRs to match the enclave's signing key. 183 184Encryption engines 185================== 186 187In order to conceal the enclave data while it is out of the CPU package, the 188memory controller has an encryption engine to transparently encrypt and decrypt 189enclave memory. 190 191In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to 192encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in 193SRAM to maintain integrity of the encrypted data. This provides integrity and 194anti-replay protection but does not scale to large memory sizes because the time 195required to update the Merkle tree grows logarithmically in relation to the 196memory size. 197 198CPUs starting from Icelake use Total Memory Encryption (TME) in the place of 199MEE. TME-based SGX implementations do not have an integrity Merkle tree, which 200means integrity and replay-attacks are not mitigated. B, it includes 201additional changes to prevent cipher text from being returned and SW memory 202aliases from being created. 203 204DMA to enclave memory is blocked by range registers on both MEE and TME systems 205(SDM section 41.10). 206 207Usage Models 208============ 209 210Shared Library 211-------------- 212 213Sensitive data and the code that acts on it is partitioned from the application 214into a separate library. The library is then linked as a DSO which can be loaded 215into an enclave. The application can then make individual function calls into 216the enclave through special SGX instructions. A run-time within the enclave is 217configured to marshal function parameters into and out of the enclave and to 218call the correct library function. 219 220Application Container 221--------------------- 222 223An application may be loaded into a container enclave which is specially 224configured with a library OS and run-time which permits the application to run. 225The enclave run-time and library OS work together to execute the application 226when a thread enters the enclave. 227 228Impact of Potential Kernel SGX Bugs 229=================================== 230 231EPC leaks 232--------- 233 234When EPC page leaks happen, a WARNING like this is shown in dmesg: 235 236"EREMOVE returned ... and an EPC page was leaked. SGX may become unusable..." 237 238This is effectively a kernel use-after-free of an EPC page, and due 239to the way SGX works, the bug is detected at freeing. Rather than 240adding the page back to the pool of available EPC pages, the kernel 241intentionally leaks the page to avoid additional errors in the future. 242 243When this happens, the kernel will likely soon leak more EPC pages, and 244SGX will likely become unusable because the memory available to SGX is 245limited. However, while this may be fatal to SGX, the rest of the kernel 246is unlikely to be impacted and should continue to work. 247 248As a result, when this happpens, user should stop running any new 249SGX workloads, (or just any new workloads), and migrate all valuable 250workloads. Although a machine reboot can recover all EPC memory, the bug 251should be reported to Linux developers. 252 253 254Virtual EPC 255=========== 256 257The implementation has also a virtual EPC driver to support SGX enclaves 258in guests. Unlike the SGX driver, an EPC page allocated by the virtual 259EPC driver doesn't have a specific enclave associated with it. This is 260because KVM doesn't track how a guest uses EPC pages. 261 262As a result, the SGX core page reclaimer doesn't support reclaiming EPC 263pages allocated to KVM guests through the virtual EPC driver. If the 264user wants to deploy SGX applications both on the host and in guests 265on the same machine, the user should reserve enough EPC (by taking out 266total virtual EPC size of all SGX VMs from the physical EPC size) for 267host SGX applications so they can run with acceptable performance. 268 269Architectural behavior is to restore all EPC pages to an uninitialized 270state also after a guest reboot. Because this state can be reached only 271through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc`` 272provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction 273on all pages in the virtual EPC. 274 275``EREMOVE`` can fail for three reasons. Userspace must pay attention 276to expected failures and handle them as follows: 277 2781. Page removal will always fail when any thread is running in the 279 enclave to which the page belongs. In this case the ioctl will 280 return ``EBUSY`` independent of whether it has successfully removed 281 some pages; userspace can avoid these failures by preventing execution 282 of any vcpu which maps the virtual EPC. 283 2842. Page removal will cause a general protection fault if two calls to 285 ``EREMOVE`` happen concurrently for pages that refer to the same 286 "SECS" metadata pages. This can happen if there are concurrent 287 invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc`` 288 file descriptor in the guest is closed at the same time as 289 ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``. 290 This can be avoided in userspace by serializing calls to the ioctl() 291 and to close(), but in general it should not be a problem. 292 2933. Finally, page removal will fail for SECS metadata pages which still 294 have child pages. Child pages can be removed by executing 295 ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors 296 mapped into the guest. This means that the ioctl() must be called 297 twice: an initial set of calls to remove child pages and a subsequent 298 set of calls to remove SECS pages. The second set of calls is only 299 required for those mappings that returned a nonzero value from the 300 first call. It indicates a bug in the kernel or the userspace client 301 if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has 302 a return code other than 0. 303