1# Hafnium architecture 2 3The purpose of Hafnium is to provide memory isolation between a set of security 4domains, to better separate untrusted code from security-critical code. It is 5implemented as a type-1 hypervisor, where each security domain is a VM. 6 7On AArch64 (currently the only supported architecture) it runs at EL2, while the 8VMs it manages run at EL1 (and user space applications within those VMs at EL0). 9A Secure Monitor such as 10[Trusted Firmware-A](https://www.trustedfirmware.org/about/) runs underneath it 11at EL3. 12 13Hafnium provides memory isolation between these VMs by managing their stage 2 14page tables, and using IOMMUs to restrict how DMA devices can be used to access 15memory. It must also prevent them from accessing system resources in a way which 16would allow them to escape this containment. It also provides: 17 18* Means for VMs to communicate with each other through message passing and 19 memory sharing, according to the Arm 20 [Arm Firmware Framework for Arm A-profile (FF-A)](https://developer.arm.com/documentation/den0077/latest/). 21* Emulation of some basic hardware features such as timers. 22* A simple paravirtualised interrupt controller for secondary VMs, as they 23 don't have access to hardware interrupts. 24* A simple logging API for bringup and low-level debugging of VMs. 25 26See the [VM interface](VmInterface.md) documentation for more details. 27 28Hafnium makes a distinction between a **primary VM**, which would typically run 29the main user-facing operating system such as Android, and a number of 30**secondary VMs** which are smaller and exist to provide various services to the 31primary VM. The primary VM typically owns the majority of the system resources, 32and is likely to be more latency-sensitive as it is running user-facing tasks. 33Some of the differences between primary and secondary VMs are explained below. 34 35## Security model 36 37Hafnium runs a set of VMs without trusting any of them. Neither do the VMs trust 38each other. Hafnium aims to prevent malicious software running in one VM from 39compromising any of the other VMs. Specifically, we guarantee 40**confidentiality** and **memory integrity** of each VM: no other VM should be 41able to read or modify the memory that belongs to a VM without that VM's 42consent. 43 44We do not make any guarantees of **availability** of VMs, except for the primary 45VM. In other words, a compromised primary VM may prevent secondary VMs from 46running, but not gain unauthorised access to their memory. A compromised 47secondary VM should not be able to prevent the primary VM or other secondary VMs 48from running. 49 50## Design principles 51 52Hafnium is designed with the following principles in mind: 53 54* Open design 55 * Hafnium is developed as open source, available for all to use, 56 contribute and scrutinise. 57* Economy of mechanism 58 * Hafnium strives to be as small and simple of possible, to reduce the 59 attack surface. 60 * This also makes Hafnium more amenable to formal verification. 61* Least privilege 62 * Each VM is a separate security domain and is given access only to what 63 it needs, to reduce the impact if it is compromised. 64 * Everything that doesn't strictly need to be part of Hafnium itself (in 65 EL2) should be moved to a VM (in EL1). 66* Defence in depth 67 * Hafnium provides an extra layer of security isolation on top of those 68 provided by the OS kernel, to better isolate sensitive workloads from 69 untrusted code. 70 71## VM model 72 73A [VM](../../inc/hf/vm.h) in Hafnium consists of: 74 75* A set of memory pages owned by and/or available to the VM, stored in the 76 stage 2 page table managed by Hafnium. 77* One or more vCPUs. (The primary VM always has the same number of vCPUs as 78 the system has physical CPUs; secondary VMs have a configurable number.) 79* A one page TX buffer used for sending messages to other VMs. 80* A one page RX buffer used for receiving messages from other VMs. 81* Some configuration information (VM ID, whitelist of allowed SMCs). 82* Some internal state maintained by Hafnium (locks, mailbox wait lists, 83 mailbox state, log buffer). 84 85Each [vCPU](../../inc/hf/vcpu.h) also has: 86 87* A set of saved registers, for when it isn't being run on a physical CPU. 88* A current state (switched off, ready to run, running, waiting for a message 89 or interrupt, aborted). 90* A set of virtual interrupts which may be enabled and/or pending. 91* Some internal locking state. 92 93VMs and their vCPUs are configured statically from a [manifest](Manifest.md) 94read at boot time. There is no way to create or destroy VMs at run time. 95 96## System resources 97 98### CPU 99 100Unlike many other type-1 hypervisors, Hafnium does not include a scheduler. 101Instead, we rely on the primary VM to handle scheduling, calling Hafnium when it 102wants to run a secondary VM's vCPU. This is because: 103 104* In line with our design principles of _economy of mechanism_ and _least 105 privilege_, we prefer to avoid complexity in Hafnium and instead rely on VMs 106 to handle complex tasks. 107* According to our security model, we don't guarantee availability of 108 secondary VMs, so it is acceptable for a compromised primary VM to deny CPU 109 time to secondary VMs. 110* A lot of effort has been put into making the Linux scheduler work well to 111 maintain a responsive user experience without jank, manage power 112 efficiently, and handle heterogeneous CPU architectures such as big.LITTLE. 113 We would rather avoid re-implementing this. 114 115Hafnium therefore maintains a 1:1 mapping of physical CPUs to vCPUs for the 116primary VM, and allows the primary VM to control the power state of physical 117CPUs directly through the standard Arm Power State Coordination Interface 118(PSCI). The primary VM should then create kernel threads for each secondary VM 119vCPU and schedule them to run the vCPUs according to the 120[interface expectations defined by Hafnium](SchedulerExpectations.md). PSCI 121calls made by secondary VMs are handled by Hafnium, to change the state of the 122VM's vCPUs. In the case of (Android) Linux running in the primary VM this is 123handled by the Hafnium kernel module. 124 125#### Example 126 127For example, considering a simple system with a single physical CPU, and a 128single secondary VM with one vCPU, where the primary VM kernel has created 129**thread 1** to run the secondary VM's vCPU while **thread 2** is some other 130normal thread: 131 132 133 1341. Scheduler chooses thread 1 to run. 1352. Scheduler runs thread 1, and configures a physical timer to expire once the 136 quantum runs out. 1373. Thread 1 is responsible for running a vCPU, so it asks Hafnium to run it. 1384. Hafnium switches to the secondary VM vCPU. 1395. Eventually the quantum runs out and the physical timer interrupts the CPU. 1406. Hafnium traps the interrupt. Physical interrupts are owned by the primary 141 VM, so it switches back to the primary VM. 1427. The interrupt handler in the primary VM gets invoked, and calls the 143 scheduler. 1448. Scheduler chooses a different thread to run (thread 2). 1459. Scheduler runs thread 2. 146 147### Memory 148 149At boot time each VM owns a mutually exclusive subset of memory pages, as 150configured by the [manifest](Manifest.md). These pages are all identity mapped 151in the stage 2 page table which Hafnium manages for the VM, so that it has full 152access to use them however it wishes. 153 154Hafnium maintains state of which VM **owns** each page, and which VMs have 155**access** to it. It does this using the stage 2 page tables of the VMs, with 156some extra application-defined bits in the page table entries. A VM may share, 157lend or donate memory pages to another VM using the appropriate FF-A requests. A 158given page of memory may never be shared with more than two VMs, either in terms 159of ownership or access. Thus, the following states are possible for each page, 160for some values of X and Y: 161 162* Owned by VM X, accessible only by VM X 163 * This is the initial state for each page, and also the state of a page 164 that has been donated. 165* Owned by VM X, accessible only by VM Y 166 * This state is reached when a page is lent. 167* Owned by VM X, accessible by VMs X and Y 168 * This state is reached when a page is shared. 169 170For now, in the interests of simplicity, Hafnium always uses identity mapping in 171all page tables it manages (stage 2 page tables for VMs, and stage 1 for itself) 172– i.e. the IPA (intermediate physical address) is always equal to the PA 173(physical address) in the stage 2 page table, if it is mapped at all. 174 175### Devices 176 177From Hafnium's point of view a device consists of: 178 179* An MMIO address range (i.e. a set of pages). 180* A set of interrupts that the device may generate. 181* Some IOMMU configuration associated with the device. 182 183For now, each device is associated with exactly one VM, which is statically 184assigned at boot time (through the manifest) and cannot be changed at runtime. 185 186Hafnium is responsible for mapping the device's MMIO pages into the owning VM's 187stage 2 page table with the appropriate attributes, and for configuring the 188IOMMU so that the device can only access the memory that is accessible by its 189owning VM. This needs to be kept in sync as the VM's memory access changes with 190memory sharing operations. Hafnium may also need to re-initialise the IOMMU if 191the device is powered off and powered on again. 192 193The primary VM is responsible for forwarding interrupts to the owning VM, in 194case the device is owned by a secondary VM. This does mean that a compromised 195primary VM may choose not to forward interrupts, or to inject spurious 196interrupts, but this is consistent with our security model that secondary VMs 197are not guaranteed any level of availability. 198