1Authors: Feng Wu <feng.wu@intel.com> 2 3VT-d Posted-interrupt (PI) design for XEN 4 5Important Definitions 6================== 7VT-d posted-interrupts: posted-interrupts support in root-complex side 8CPU-side posted-interrupts: posted-interrupts support in CPU side 9IRTE: Interrupt Remapping Table Entry 10Posted-interrupt Descriptor Address: the address of the posted-interrupt descriptor 11Virtual Vector: the guest vector of the interrupt 12URG: indicates if the interrupt is urgent 13 14Posted-interrupt descriptor: 15The Posted Interrupt Descriptor hosts the following fields: 16Posted Interrupt Request (PIR): Provide storage for posting (recording) interrupts (one bit 17per vector, for up to 256 vectors). 18 19Outstanding Notification (ON): Indicate if there is a notification event outstanding (not 20processed by processor or software) for this Posted Interrupt Descriptor. When this field is 0, 21hardware modifies it from 0 to 1 when generating a notification event, and the entity receiving 22the notification event (processor or software) resets it as part of posted interrupt processing. 23 24Suppress Notification (SN): Indicate if a notification event is to be suppressed (not 25generated) for non-urgent interrupt requests (interrupts processed through an IRTE with 26URG=0). 27 28Notification Vector (NV): Specify the vector for notification event (interrupt). 29 30Notification Destination (NDST): Specify the physical APIC-ID of the destination logical 31processor for the notification event. 32 33Background 34========== 35With the development of virtualization, there are more and more device 36assignment requirements. However, today when a VM is running with 37assigned devices (such as, NIC), external interrupt handling for the assigned 38devices always needs VMM intervention. 39 40VT-d Posted-interrupt is a more enhanced method to handle interrupts 41in the virtualization environment. Interrupt posting is the process by 42which an interrupt request is recorded in a memory-resident 43posted-interrupt-descriptor structure by the root-complex or software, 44followed by an optional notification event issued to the CPU. 45 46With VT-d Posted-interrupt we can get the following advantages: 47- Direct delivery of external interrupts to running vCPUs without VMM 48intervention 49- Decrease the interrupt migration complexity. On vCPU migration, software 50can atomically co-migrate all interrupts targeting the migrating vCPU. For 51virtual machines with assigned devices, migrating a vCPU across pCPUs 52either incurs the overhead of forwarding interrupts in software (e.g. via VMM 53generated IPIs), or complexity to independently migrate each interrupt targeting 54the vCPU to the new pCPU. However, after enabling VT-d PI, the destination vCPU 55of an external interrupt from assigned devices is stored in the IRTE (i.e. 56Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU, 57we will set this new pCPU in the 'NDST' field of Posted-interrupt descriptor, this 58make the interrupt migration automatic. 59 60Here is what Xen currently does for external interrupts from assigned devices: 61 62When a VM is running and an external interrupt from an assigned device occurs 63for it. VM-EXIT happens, then: 64 65vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() --> 66raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ) 67 68softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq() 69 70dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> vmsi_deliver() --> 71vmsi_inj_irq() --> vlapic_set_irq() 72 73vlapic_set_irq() does the following things: 741. If CPU-side posted-interrupt is supported, call vmx_deliver_posted_intr() to deliver 75the virtual interrupt via posted-interrupt infrastructure. 762. Else if CPU-side posted-interrupt is not supported, set the related vIRR in vLAPIC 77page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, vmx_intr_assist() 78will help to inject the interrupt to guests. 79 80However, after VT-d PI is supported, when a guest is running in non-root and an 81external interrupt from an assigned device occurs for it. no VM-Exit is needed, 82the guest can handle this totally in non-root mode, thus avoiding all the above 83code flow. 84 85Posted-interrupt Introduction 86======================== 87There are two components in the Posted-interrupt architecture: 88Processor Support and Root-Complex Support 89 90- Processor Support 91Posted-interrupt processing is a feature by which a processor processes 92the virtual interrupts by recording them as pending on the virtual-APIC 93page. 94 95Posted-interrupt processing is enabled by setting the process posted 96interrupts VM-execution control. The processing is performed in response 97to the arrival of an interrupt with the posted-interrupt notification vector. 98In response to such an interrupt, the processor processes virtual interrupts 99recorded in a data structure called a posted-interrupt descriptor. 100 101More information about APICv and CPU-side Posted-interrupt, please refer 102to Chapter "APIC VIRTUALIZATION AND VIRTUAL INTERRUPTS", and Section 103"POSTED-INTERRUPT PROCESSING" in the Intel SDM: 104http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf 105 106- Root-Complex Support 107Interrupt posting is the process by which an interrupt request (from IOAPIC 108or MSI/MSIx capable sources) is recorded in a memory-resident 109posted-interrupt-descriptor structure by the root-complex, followed by 110an optional notification event issued to the CPU complex. The interrupt 111request arriving at the root-complex carry the identity of the interrupt 112request source and a 'remapping-index'. The remapping-index is used to 113look-up an entry from the memory-resident interrupt-remap-table. Unlike 114interrupt-remapping, the interrupt-remap-table-entry for a posted-interrupt, 115specifies a virtual-vector and a pointer to the posted-interrupt descriptor. 116The virtual-vector specifies the vector of the interrupt to be recorded in 117the posted-interrupt descriptor. The posted-interrupt descriptor hosts storage 118for the virtual-vectors and contains the attributes of the notification event 119(interrupt) to be issued to the CPU complex to inform CPU/software about pending 120interrupts recorded in the posted-interrupt descriptor. 121 122More information about VT-d PI, please refer to 123http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html 124 125Design Overview 126============== 127In this design, we will cover the following items: 1281. Add a variable to control whether enable VT-d posted-interrupt or not. 1292. VT-d PI feature detection. 1303. Extend posted-interrupt descriptor structure to cover VT-d PI specific items. 1314. Extend IRTE structure to support VT-d PI. 1325. Introduce a new global vector which is used for waking up the blocked vCPU. 1336. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration). 1347. Update posted-interrupt descriptor during vCPU scheduling. 1358. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler). 1369. New boot command line for Xen, which controls VT-d PI feature by user. 13710. Multicast/broadcast and lowest priority interrupts consideration. 138 139 140Implementation details 141=================== 142- New variable to control VT-d PI 143 144Like variable 'iommu_intremap' for interrupt remapping, it is very straightforward 145to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set 146only when interrupt remapping and VT-d posted-interrupt are both enabled. 147 148- VT-d PI feature detection. 149Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt support. 150 151- Extend posted-interrupt descriptor structure to cover VT-d PI specific items. 152Here is the new structure for posted-interrupt descriptor: 153 154struct pi_desc { 155 DECLARE_BITMAP(pir, NR_VECTORS); 156 union { 157 struct 158 { 159 u16 on : 1, /* bit 256 - Outstanding Notification */ 160 sn : 1, /* bit 257 - Suppress Notification */ 161 rsvd_1 : 14; /* bit 271:258 - Reserved */ 162 u8 nv; /* bit 279:272 - Notification Vector */ 163 u8 rsvd_2; /* bit 287:280 - Reserved */ 164 u32 ndst; /* bit 319:288 - Notification Destination */ 165 }; 166 u64 control; 167 }; 168 u32 rsvd[6]; 169} __attribute__ ((aligned (64))); 170 171- Extend IRTE structure to support VT-d PI. 172 173Here is the new structure for IRTE: 174/* interrupt remap entry */ 175struct iremap_entry { 176 union { 177 struct { u64 lo, hi; }; 178 struct { 179 u16 p : 1, 180 fpd : 1, 181 dm : 1, 182 rh : 1, 183 tm : 1, 184 dlm : 3, 185 avail : 4, 186 res_1 : 4; 187 u8 vector; 188 u8 res_2; 189 u32 dst; 190 u16 sid; 191 u16 sq : 2, 192 svt : 2, 193 res_3 : 12; 194 u32 res_4 : 32; 195 } remap; 196 struct { 197 u16 p : 1, 198 fpd : 1, 199 res_1 : 6, 200 avail : 4, 201 res_2 : 2, 202 urg : 1, 203 im : 1; 204 u8 vector; 205 u8 res_3; 206 u32 res_4 : 6, 207 pda_l : 26; 208 u16 sid; 209 u16 sq : 2, 210 svt : 2, 211 res_5 : 12; 212 u32 pda_h; 213 } post; 214 }; 215}; 216 217- Introduce a new global vector which is used to wake up the blocked vCPU. 218 219Currently, there is a global vector 'posted_intr_vector', which is used as the 220global notification vector for all vCPUs in the system. This vector is stored in 221VMCS and CPU considers it as a _special_ vector, uses it to notify the related 222pCPU when an interrupt is recorded in the posted-interrupt descriptor. 223 224This existing global vector is a _special_ vector to CPU, CPU handle it in a 225_special_ way compared to normal vectors, please refer to 29.6 in Intel SDM 226http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf 227for more information about how CPU handles it. 228 229After having VT-d PI, VT-d engine can issue notification event when the 230assigned devices issue interrupts. We need add a new global vector to 231wakeup the blocked vCPU, please refer to later section in this design for 232how to use this new global vector. 233 234- Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration). 235After VT-d PI is introduced, the format of IRTE is changed as follows: 236 Descriptor Address: the address of the posted-interrupt descriptor 237 Virtual Vector: the guest vector of the interrupt 238 URG: indicates if the interrupt is urgent 239 Other fields continue to have the same meaning 240 241'Descriptor Address' tells the destination vCPU of this interrupt, since 242each vCPU has a dedicated posted-interrupt descriptor. 243 244'Virtual Vector' tells the guest vector of the interrupt. 245 246When guest changes the configuration of the interrupts, such as, the 247cpu affinity, or the vector, we need to update the associated IRTE accordingly. 248 249- Update posted-interrupt descriptor during vCPU scheduling 250 251The basic idea here is: 2521. When vCPU is running 253 - Set 'NV' to 'posted_intr_vector'. 254 - Clear 'SN' to accept posted-interrupts. 255 - Set 'NDST' to the pCPU on which the vCPU will be running. 2562. When vCPU is blocked 257 - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the 258 related vCPU when posted-interrupt happens for it. 259 Please refer to the above section about the new global vector. 260 - Clear 'SN' to accept posted-interrupts 2613. When vCPU is preempted or sleeping 262 - Set 'SN' to suppress non-urgent interrupts 263 (Currently, we only support non-urgent interrupts) 264 When vCPU is preempted or sleep, it doesn't need to accept 265 posted-interrupt notification event since we don't change the behavior 266 of scheduler when the interrupt occurs, we still need wait for the next 267 scheduling of the vCPU. When external interrupts from assigned devices occur, 268 the interrupts are recorded in PIR, and will be synced to IRR before VM-Entry. 269 - Set 'NV' to 'posted_intr_vector'. 270 271- How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler). 272 273Here is the scenario for the usage of the new global vector: 274 2751. vCPU0 is running on pCPU0 2762. vCPU0 is blocked and vCPU1 is currently running on pCPU0 2773. An external interrupt from an assigned device occurs for vCPU0, if we 278still use 'posted_intr_vector' as the notification vector for vCPU0, the 279notification event for vCPU0 (the event will go to pCPU1) will be consumed 280by vCPU1 incorrectly (remember this is a special vector to CPU). The worst 281case is that vCPU0 will never be woken up again since the wakeup event 282for it is always consumed by other vCPUs incorrectly. So we need introduce 283another global vector, naming 'pi_wakeup_vector' to wake up the blocked vCPU. 284 285After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification 286event using this new vector. Since this new vector is not a SPECIAL one to CPU, 287it is just a normal vector. To CPU, it just receives an normal external interrupt, 288then we can get control in the handler of this new vector. In this case, hypervisor 289can do something in it, such as wakeup the blocked vCPU. 290 291Here are what we do for the blocked vCPU: 2921. Define a per-cpu list 'pi_blocked_vcpu', which stored the blocked 293vCPU on the pCPU. 2942. When the vCPU is going to block, insert the vCPU 295to the per-cpu list belonging to the pCPU it was running. 2963. When the vCPU is unblocked, remove the vCPU from the related pCPU list. 297 298In the handler of 'pi_wakeup_vector', we do: 2991. Get the physical CPU. 3002. Iterate the list 'pi_blocked_vcpu' of the current pCPU, if 'ON' is set, 301we unblock the associated vCPU. 302 303When the vCPU is blocked, we change the posted-interrupts descriptor and 304put it in the pCPU's blocking list, we don't change the status of posted- 305interrupts descriptor back when the vCPU is unblocked or the blocking 306operation directly returns since there are events to be delivered. Instead, 307we do it exactly before VM-Entry. 308 309- New boot command line for Xen, which controls VT-d PI feature by user. 310 311Like 'intremap' for interrupt remapping, we add a new boot command line 312'intpost' for posted-interrupts. 313 314- Multicast/broadcast and lowest priority interrupts consideration. 315 316With VT-d PI, the destination vCPU information of an external interrupt 317from assigned devices is stored in IRTE, this makes the following 318consideration of the design: 3191. Multicast/broadcast interrupts cannot be posted. 3202. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex 321(starting from Nehalem) ignore TPR value, and instead supported two other 322ways (configurable by BIOS) on how the handle lowest priority interrupts: 323 A) Round robin: In this method, the chipset simply delivers lowest priority 324interrupts in a round-robin manner across all the available logical CPUs. While 325this provides good load balancing, this was not the best thing to do always as 326interrupts from the same device (like NIC) will start running on all the CPUs 327thrashing caches and taking locks. This led to the next scheme. 328 B) Vector hashing: In this method, hardware would apply a hash function 329on the vector value in the interrupt request, and use that hash to pick a logical 330CPU to route the lowest priority interrupt. This way, a given vector always goes 331to the same logical CPU, avoiding the thrashing problem above. 332 333So, gist of above is that, lowest priority interrupts has never been delivered as 334"lowest priority" in physical hardware. 335 336Vector hashing is used in this design. 337