1.. _hv-cpu-virt: 2 3CPU Virtualization 4################## 5 6.. figure:: images/hld-image47.png 7 :align: center 8 :name: hv-cpu-virt-components 9 10 ACRN Hypervisor CPU Virtualization Components 11 12The following sections discuss the major modules (indicated above in blue) 13in the CPU virtualization overview shown in :numref:`hv-cpu-virt-components`. 14 15Based on Intel VT-x virtualization technology, ACRN emulates a virtual CPU 16(vCPU) with the following methods: 17 18- **core partition**: one vCPU is dedicated and associated with one 19 physical CPU (pCPU), 20 making much of the hardware register emulation simply 21 passthrough. This method provides good isolation for physical interrupts 22 and guest execution. (See `Static CPU Partitioning`_ for more 23 information.) 24 25- **core sharing** (to be added): two or more vCPUs share one 26 physical CPU (pCPU). A more complicated context switch is needed 27 between different vCPUs' switching. This method provides flexible computing 28 resources sharing for low-performance demand vCPU tasks. 29 (See `Flexible CPU Sharing`_ for more information.) 30 31- **simple schedule**: a well-designed scheduler framework that allows ACRN 32 to adopt different scheduling policies, such as the **noop** and **round-robin**: 33 34 - **noop scheduler**: only two thread loops are maintained for a CPU: a 35 vCPU thread and a default idle thread. A CPU runs most of the time in 36 the vCPU thread for emulating a guest CPU, switching between VMX root 37 mode and non-root mode. A CPU schedules out to default idle when an 38 operation needs it to stay in VMX root mode, such as when waiting for 39 an I/O request from the Device Model (DM) or when ready to destroy. 40 41 - **round-robin scheduler** (to be added): allows more vCPU thread loops 42 to run on a CPU. A CPU switches among different vCPU threads and default 43 idle threads as it runs out corresponding timeslices or necessary 44 scheduling outs such as waiting for an I/O request. A vCPU can yield 45 itself as well, such as when it executes a "PAUSE" instruction. 46 47 48Static CPU Partitioning 49*********************** 50 51CPU partitioning is a policy for mapping a virtual 52CPU (vCPU) to a physical CPU. To enable this feature, the ACRN hypervisor can 53configure a noop scheduler as the schedule policy for this physical CPU. 54 55ACRN then forces a fixed 1:1 mapping between a vCPU and this physical CPU 56when creating a vCPU for the guest operating system. This makes the vCPU 57management code much simpler. 58 59ACRN uses the ``cpu_affinity`` parameter in ``vm config`` to decide which 60physical CPU to map to a vCPU in a VM, then finalizes the fixed mapping. When 61launching a User VM, need to choose pCPUs from the VM's ``cpu_affinity`` that 62are not used by any other VMs. 63 64Flexible CPU Sharing 65******************** 66 67To enable CPU sharing, the ACRN hypervisor can configure the BVT 68(Borrowed Virtual Time) scheduler policy. 69 70The ``cpu_affinity`` parameter in ``vm config`` indicates all the physical CPUs 71on which this VM is allowed to run. A pCPU can be shared among a Service VM and 72any User VM as long as the local APIC passthrough is not enabled in that User 73VM. 74 75See :ref:`cpu_sharing` for more information. 76 77.. _hv-cpu-virt-cpu-mgmt-partition: 78 79CPU Management in the Service VM Under Static CPU Partitioning 80************************************************************** 81 82With ACRN, all ACPI table entries are passthrough to the Service VM, including 83the Multiple Interrupt Controller Table (MADT). The Service VM sees all 84physical CPUs by parsing the MADT when the Service VM kernel boots. All 85physical CPUs are initially assigned to the Service VM by creating the same 86number of virtual CPUs. 87 88After the Service VM boots, it releases the physical CPUs intended 89for User VM use. 90 91Here is an example flow of CPU allocation on a multi-core platform. 92 93.. figure:: images/static-core-image2.png 94 :width: 600px 95 :align: center 96 :name: static-core-cpu-allocation 97 98 CPU Allocation on a Multi-core Platform 99 100CPU Management in the Service VM Under Flexible CPU Sharing 101*********************************************************** 102 103The Service VM sees all physical CPUs via the MADT, as described in 104:ref:`hv-cpu-virt-cpu-mgmt-partition`. However, the Service VM does not release 105the physical CPUs intended for User VM use. 106 107CPU Management in the User VM 108***************************** 109 110The ``cpu_affinity`` parameter in ``vm config`` defines a set of pCPUs that a 111User VM is allowed to run on. The Device Model can launch a User VM on only a 112subset of the pCPUs or on all pCPUs listed in ``cpu_affinity``, but it cannot 113assign any pCPU that is not included in it. 114 115CPU Assignment Management in the Hypervisor 116******************************************* 117 118The physical CPU assignment is predefined by ``cpu_affinity`` in 119``vm config``, while post-launched VMs could be launched on pCPUs that are 120a subset of it. 121 122The ACRN hypervisor does not support virtual CPU migration to 123different physical CPUs. No changes to the mapping of the virtual CPU to 124physical CPU can happen without first calling ``offline_vcpu``. 125 126 127.. _vCPU_lifecycle: 128 129vCPU Lifecycle 130************** 131 132A vCPU lifecycle is shown in :numref:`hv-vcpu-transitions` below, where 133the major states are: 134 135- **VCPU_INIT**: vCPU is in an initialized state, and its vCPU thread 136 is not ready to run on its associated CPU. 137 138- **VCPU_RUNNING**: vCPU is running, and its vCPU thread is ready (in 139 the queue) or running on its associated CPU. 140 141- **VCPU_PAUSED**: vCPU is paused, and its vCPU thread is not running 142 on its associated CPU. 143 144- **VPCU_ZOMBIE**: vCPU is transitioning to an offline state, and its vCPU thread is 145 not running on its associated CPU. 146 147- **VPCU_OFFLINE**: vCPU is offline. 148 149.. figure:: images/hld-image17.png 150 :align: center 151 :name: hv-vcpu-transitions 152 153 ACRN vCPU State Transitions 154 155The following functions are used to drive the state machine of the vCPU 156lifecycle: 157 158.. doxygenfunction:: create_vcpu 159 :project: Project ACRN 160 161.. doxygenfunction:: zombie_vcpu 162 :project: Project ACRN 163 164.. doxygenfunction:: reset_vcpu 165 :project: Project ACRN 166 167.. doxygenfunction:: offline_vcpu 168 :project: Project ACRN 169 170 171vCPU Scheduling Under Static CPU Partitioning 172********************************************* 173 174.. figure:: images/hld-image35.png 175 :align: center 176 :name: hv-vcpu-schedule 177 178 ACRN vCPU Scheduling Flow Under Static CPU Partitioning 179 180For static CPU partitioning, ACRN implements a simple scheduling mechanism 181based on two threads: vcpu_thread and default_idle. A vCPU in the 182VCPU_RUNNING state always runs in a vcpu_thread loop. 183A vCPU in the VCPU_PAUSED or VCPU_ZOMBIE state runs in a default_idle 184loop. The behaviors in the vcpu_thread and default_idle threads 185are illustrated in :numref:`hv-vcpu-schedule`: 186 187- The **vcpu_thread** loop will do the loop of handling VM exits, 188 and pending requests around the VM entry/exit. 189 It will also check the reschedule request then schedule out to 190 default_idle if necessary. See `vCPU Thread`_ for more details 191 about vcpu_thread. 192 193- The **default_idle** loop simply does do_cpu_idle while also 194 checking for need-offline and reschedule requests. 195 If a CPU is marked as need-offline, it will go to cpu_dead. 196 If a reschedule request is made for this CPU, it will 197 schedule out to vcpu_thread if necessary. 198 199- The function ``make_reschedule_request`` drives the thread 200 switch between vcpu_thread and default_idle. 201 202Some example scenario flows are shown here: 203 204.. figure:: images/hld-image7.png 205 :align: center 206 207 ACRN vCPU Scheduling Scenarios 208 209- **During VM startup**: after a vCPU is created, the bootstrap processor (BSP) 210 calls *launch_vcpu* through *start_vm*. The application processor (AP) calls 211 *launch_vcpu* through vLAPIC INIT-SIPI emulation. Finally, this vCPU runs in 212 a *vcpu_thread* loop. 213 214- **During VM shutdown**: *pause_vm* function forces a vCPU 215 running in *vcpu_thread* to schedule out to *default_idle*. The 216 following *reset_vcpu* and *offline_vcpu* de-init and then offline 217 this vCPU instance. 218 219- **During IOReq handling**: after an IOReq is sent to DM for emulation, a 220 vCPU running in *vcpu_thread* schedules out to *default_idle* 221 through *acrn_insert_request_wait->pause_vcpu*. After the DM 222 completes the emulation for this IOReq, it calls 223 *hcall_notify_ioreq_finish->resume_vcpu* and changes the vCPU 224 schedule back to *vcpu_thread* to continue its guest execution. 225 226vCPU Scheduling Under Flexible CPU Sharing 227****************************************** 228 229To be added. 230 231vCPU Thread 232*********** 233 234The vCPU thread flow is a loop as shown and described below: 235 236.. figure:: images/hld-image68.png 237 :align: center 238 239 ACRN vCPU Thread 240 241 2421. Check if *vcpu_thread* needs to schedule out to *default_idle* or 243 other *vcpu_thread* by reschedule request. If needed, then schedule 244 out to *default_idle* or other *vcpu_thread*. 245 2462. Handle pending request by calling *acrn_handle_pending_request*. 247 (See `Pending Request Handlers`_.) 248 2493. VM Enter by calling *start/run_vcpu*, then enter non-root mode to do 250 guest execution. 251 2524. VM Exit from *start/run_vcpu* when the guest triggers a VM exit reason in 253 non-root mode. 254 2555. Handle VM exit based on specific reason. 256 2576. Loop back to step 1. 258 259vCPU Run Context 260================ 261 262During a vCPU switch between root and non-root mode, the run context of 263the vCPU is saved and restored using this structure: 264 265.. doxygenstruct:: run_context 266 :project: Project ACRN 267 268The vCPU handles runtime context saving by three different 269categories: 270 271- Always save/restore during VM exit/entry: 272 273 - These registers must be saved for each VM exit, and restored 274 for each VM entry 275 - Registers include: general purpose registers, CR2, and 276 IA32_SPEC_CTRL 277 - Definition in *vcpu->run_context* 278 - Get/Set them through *vcpu_get/set_xxx* 279 280- On-demand cache/update during VM exit/entry: 281 282 - These registers are used frequently. They should be cached from 283 VMCS on first time access after a VM exit, and updated to VMCS on 284 VM entry if marked dirty 285 - Registers include: RSP, RIP, EFER, RFLAGS, CR0, and CR4 286 - Definition in *vcpu->run_context* 287 - Get/Set them through *vcpu_get/set_xxx* 288 289- Always read/write from/to VMCS: 290 291 - These registers are rarely used. Access to them is always 292 from/to VMCS. 293 - Registers are in VMCS but not list in the two cases above. 294 - No definition in *vcpu->run_context* 295 - Get/Set them through VMCS API 296 297For the first two categories above, ACRN provides these get/set APIs: 298 299.. doxygenfunction:: vcpu_get_gpreg 300 :project: Project ACRN 301 302.. doxygenfunction:: vcpu_set_gpreg 303 :project: Project ACRN 304 305.. doxygenfunction:: vcpu_get_rip 306 :project: Project ACRN 307 308.. doxygenfunction:: vcpu_set_rip 309 :project: Project ACRN 310 311.. doxygenfunction:: vcpu_get_rsp 312 :project: Project ACRN 313 314.. doxygenfunction:: vcpu_set_rsp 315 :project: Project ACRN 316 317.. doxygenfunction:: vcpu_get_efer 318 :project: Project ACRN 319 320.. doxygenfunction:: vcpu_set_efer 321 :project: Project ACRN 322 323.. doxygenfunction:: vcpu_get_rflags 324 :project: Project ACRN 325 326.. doxygenfunction:: vcpu_set_rflags 327 :project: Project ACRN 328 329.. doxygenfunction:: vcpu_get_cr0 330 :project: Project ACRN 331 332.. doxygenfunction:: vcpu_set_cr0 333 :project: Project ACRN 334 335.. doxygenfunction:: vcpu_get_cr2 336 :project: Project ACRN 337 338.. doxygenfunction:: vcpu_set_cr2 339 :project: Project ACRN 340 341.. doxygenfunction:: vcpu_get_cr4 342 :project: Project ACRN 343 344.. doxygenfunction:: vcpu_set_cr4 345 :project: Project ACRN 346 347 348VM Exit Handlers 349================ 350 351ACRN implements its VM exit handlers with a static table. Except for the 352exit reasons listed below, a default *unhandled_vmexit_handler* is used 353that will trigger an error message and return without handling: 354 355.. list-table:: 356 :widths: 33 33 33 357 :header-rows: 1 358 359 * - **VM Exit Reason** 360 - **Handler** 361 - **Description** 362 363 * - VMX_EXIT_REASON_EXCEPTION_OR_NMI 364 - exception_vmexit_handler 365 - Only trap #MC, print error then inject back to guest 366 367 * - VMX_EXIT_REASON_EXTERNAL_INTERRUPT 368 - external_interrupt_vmexit_handler 369 - External interrupt handler for physical interrupt happening in non-root mode 370 371 * - VMX_EXIT_REASON_TRIPLE_FAULT 372 - triple_fault_vmexit_handler 373 - Handle triple fault from vCPU 374 375 * - VMX_EXIT_REASON_INIT_SIGNAL 376 - init_signal_vmexit_handler 377 - Handle INIT signal from vCPU 378 379 * - VMX_EXIT_REASON_INTERRUPT_WINDOW 380 - interrupt_window_vmexit_handler 381 - To support interrupt window if VID is disabled 382 383 * - VMX_EXIT_REASON_CPUID 384 - cpuid_vmexit_handler 385 - Handle CPUID access from guest 386 387 * - VMX_EXIT_REASON_VMCALL 388 - vmcall_vmexit_handler 389 - Handle hypercall from guest 390 391 * - VMX_EXIT_REASON_CR_ACCESS 392 - cr_access_vmexit_handler 393 - Handle CR registers access from guest 394 395 * - VMX_EXIT_REASON_IO_INSTRUCTION 396 - pio_instr_vmexit_handler 397 - Emulate I/O access with range in IO_BITMAP, 398 which may have a handler in hypervisor (such as vUART or vPIC), 399 or need to create an I/O request to DM 400 401 * - VMX_EXIT_REASON_RDMSR 402 - rdmsr_vmexit_handler 403 - Read MSR from guest in MSR_BITMAP 404 405 * - VMX_EXIT_REASON_WRMSR 406 - wrmsr_vmexit_handler 407 - Write MSR from guest in MSR_BITMAP 408 409 * - VMX_EXIT_REASON_APIC_ACCESS 410 - apic_access_vmexit_handler 411 - APIC access for APICv 412 413 * - VMX_EXIT_REASON_VIRTUALIZED_EOI 414 - veoi_vmexit_handler 415 - Trap vLAPIC EOI for specific vector with level trigger mode 416 in vIOAPIC, required for supporting PTdev 417 418 * - VMX_EXIT_REASON_EPT_VIOLATION 419 - ept_violation_vmexit_handler 420 - MMIO emulation, which may have handler in hypervisor 421 (such as vLAPIC or vIOAPIC), or need to create an I/O 422 request to DM 423 424 * - VMX_EXIT_REASON_XSETBV 425 - xsetbv_vmexit_handler 426 - Set host owned XCR0 for supporting xsave 427 428 * - VMX_EXIT_REASON_APIC_WRITE 429 - apic_write_vmexit_handler 430 - APIC write for APICv 431 432 433Details of each VM exit reason handler are described in other sections. 434 435.. _pending-request-handlers: 436 437Pending Request Handlers 438======================== 439 440ACRN uses the function *acrn_handle_pending_request* to handle 441requests before VM entry in *vcpu_thread*. 442 443A bitmap in the vCPU structure lists the different requests: 444 445.. code-block:: c 446 447 #define ACRN_REQUEST_EXCP 0U 448 #define ACRN_REQUEST_EVENT 1U 449 #define ACRN_REQUEST_EXTINT 2U 450 #define ACRN_REQUEST_NMI 3U 451 #define ACRN_REQUEST_EOI_EXIT_BITMAP_UPDATE 4U 452 #define ACRN_REQUEST_EPT_FLUSH 5U 453 #define ACRN_REQUEST_TRP_FAULT 6U 454 #define ACRN_REQUEST_VPID_FLUSH 7U /* flush vpid tlb */ 455 456 457ACRN provides the function *vcpu_make_request* to make different 458requests, set the bitmap of the corresponding request, and notify the target 459vCPU through the IPI if necessary (when the target vCPU is not 460running). See :ref:`vcpu-request-interrupt-injection` for details. 461 462.. code-block:: c 463 464 void vcpu_make_request(struct vcpu *vcpu, uint16_t eventid) 465 { 466 uint16_t pcpu_id = pcpuid_from_vcpu(vcpu); 467 468 bitmap_set_lock(eventid, &vcpu->arch_vcpu.pending_req); 469 /* 470 * if current hostcpu is not the target vcpu's hostcpu, we need 471 * to invoke IPI to wake up target vcpu 472 * 473 * TODO: Here we just compare with cpuid, since cpuid is 474 * global under pCPU / vCPU 1:1 mapping. If later we enabled vcpu 475 * scheduling, we need change here to determine it target vcpu is 476 * VMX non-root or root mode 477 */ 478 if (get_cpu_id() != pcpu_id) { 479 send_single_ipi(pcpu_id, VECTOR_NOTIFY_VCPU); 480 } 481 } 482 483The function *acrn_handle_pending_request* handles each 484request as shown below. 485 486 487.. list-table:: 488 :widths: 25 25 25 25 489 :header-rows: 1 490 491 * - **Request** 492 - **Description** 493 - **Request Maker** 494 - **Request Handler** 495 496 * - ACRN_REQUEST_EXCP 497 - Request for exception injection 498 - vcpu_inject_gp, vcpu_inject_pf, vcpu_inject_ud, vcpu_inject_ac, 499 or vcpu_inject_ss and then queue corresponding exception by 500 vcpu_queue_exception 501 - vcpu_inject_hi_exception, vcpu_inject_lo_exception based 502 on exception priority 503 504 * - ACRN_REQUEST_EVENT 505 - Request for vLAPIC interrupt vector injection 506 - vlapic_fire_lvt or vlapic_set_intr, which could be triggered 507 by vlapic lvt, vioapic, or vmsi 508 - vcpu_do_pending_event 509 510 * - ACRN_REQUEST_EXTINT 511 - Request for extint vector injection 512 - vcpu_inject_extint, triggered by vPIC 513 - vcpu_do_pending_extint 514 515 * - ACRN_REQUEST_NMI 516 - Request for nmi injection 517 - vcpu_inject_nmi 518 - Program VMX_ENTRY_INT_INFO_FIELD directly 519 520 * - ACRN_REQUEST_EOI_EXIT_BITMAP_UPDATE 521 - Request for VEOI bitmap update for level triggered vector 522 - vlapic_reset_tmr or vlapic_set_tmr change trigger mode in RTC 523 - vcpu_set_vmcs_eoi_exit 524 525 * - ACRN_REQUEST_EPT_FLUSH 526 - Request for EPT flush 527 - ept_add_mr, ept_modify_mr, ept_del_mr, or vmx_write_cr0 disable cache 528 - invept 529 530 * - ACRN_REQUEST_TRP_FAULT 531 - Request for handling triple fault 532 - vcpu_queue_exception meet triple fault 533 - fatal error 534 535 * - ACRN_REQUEST_VPID_FLUSH 536 - Request for VPID flush 537 - None 538 - flush_vpid_single 539 540.. note:: Refer to the interrupt management chapter for request 541 handling order for exception, nmi, and interrupts. For other requests 542 such as tmr update, or EPT flush, there is no mandatory order. 543 544VMX Initialization 545****************** 546 547ACRN attempts to initialize the vCPU's VMCS before its first 548launch. ACRN sets the host state, execution control, guest state, 549entry control, and exit control, as shown in the table below. 550 551The table briefly shows how each field is configured. 552The guest state field is critical for running a guest CPU 553based on different CPU modes. 554 555For a guest vCPU's state initialization: 556 557- If it's BSP, the guest state configuration is done in software load, 558 which can be initialized by different objects: 559 560 - Service VM BSP: Hypervisor does context initialization in different 561 software load based on different boot mode 562 563 - User VM BSP: DM context initialization through hypercall 564 565- If it's AP, it always starts from real mode, and the start 566 vector always comes from vLAPIC INIT-SIPI emulation. 567 568.. doxygenstruct:: acrn_regs 569 :project: Project ACRN 570 571.. list-table:: 572 :widths: 20 40 10 30 573 :header-rows: 1 574 575 * - **VMX Domain** 576 - **Fields** 577 - **Bits** 578 - **Description** 579 580 * - **host state** 581 - CS, DS, ES, FS, GS, TR, LDTR, GDTR, IDTR 582 - n/a 583 - According to host 584 585 * - 586 - MSR_IA32_PAT, MSR_IA32_EFER 587 - n/a 588 - According to host 589 590 * - 591 - CR0, CR3, CR4 592 - n/a 593 - According to host 594 595 * - 596 - RIP 597 - n/a 598 - Set to vm_exit pointer 599 600 * - 601 - IA32_SYSENTER_CS/ESP/EIP 602 - n/a 603 - Set to 0 604 605 * - **execution control** 606 - VMX_PIN_VM_EXEC_CONTROLS 607 - 0 608 - Enable external-interrupt exiting 609 610 * - 611 - 612 - 7 613 - Enable posted interrupts 614 615 * - 616 - VMX_PROC_VM_EXEC_CONTROLS 617 - 3 618 - Use TSC offsetting 619 620 * - 621 - 622 - 21 623 - Use TPR shadow 624 625 * - 626 - 627 - 25 628 - Use I/O bitmaps 629 630 * - 631 - 632 - 28 633 - Use MSR bitmaps 634 635 * - 636 - 637 - 31 638 - Activate secondary controls 639 640 * - 641 - VMX_PROC_VM_EXEC_CONTROLS2 642 - 0 643 - Virtualize APIC accesses 644 645 * - 646 - 647 - 1 648 - Enable EPT 649 650 * - 651 - 652 - 3 653 - Enable RDTSCP 654 655 * - 656 - 657 - 5 658 - Enable VPID 659 660 * - 661 - 662 - 7 663 - Unrestricted guest 664 665 * - 666 - 667 - 8 668 - APIC-register virtualization 669 670 * - 671 - 672 - 9 673 - Virtual-interrupt delivery 674 675 * - 676 - 677 - 20 678 - Enable XSAVES/XRSTORS 679 680 * - **guest state** 681 - CS, DS, ES, FS, GS, TR, LDTR, GDTR, IDTR 682 - n/a 683 - According to vCPU mode and init_ctx 684 685 * - 686 - RIP, RSP 687 - n/a 688 - According to vCPU mode and init_ctx 689 690 * - 691 - CR0, CR3, CR4 692 - n/a 693 - According to vCPU mode and init_ctx 694 695 * - 696 - GUEST_IA32_SYSENTER_CS/ESP/EIP 697 - n/a 698 - Set to 0 699 700 * - 701 - GUEST_IA32_PAT 702 - n/a 703 - Set to PAT_POWER_ON_VALUE 704 705 * - **entry control** 706 - VMX_ENTRY_CONTROLS 707 - 2 708 - Load debug controls 709 710 * - 711 - 712 - 14 713 - Load IA32_PAT 714 715 * - 716 - 717 - 15 718 - Load IA23_EFER 719 720 * - **exit control** 721 - VMX_EXIT_CONTROLS 722 - 2 723 - Save debug controls 724 725 * - 726 - 727 - 9 728 - Host address space size 729 730 * - 731 - 732 - 15 733 - Acknowledge Interrupt on exit 734 735 * - 736 - 737 - 18 738 - Save IA32_PAT 739 740 * - 741 - 742 - 19 743 - Load IA32_PAT 744 745 * - 746 - 747 - 20 748 - Save IA32_EFER 749 750 * - 751 - 752 - 21 753 - Load IA32_EFER 754 755 756CPUID Virtualization 757******************** 758 759CPUID access from a guest would cause VM exits unconditionally if executed 760as a VMX non-root operation. ACRN must return the emulated processor 761identification and feature information in the EAX, EBX, ECX, and EDX 762registers. 763 764To simplify, ACRN returns the same values from the physical CPU for most 765of the CPUID, and specially handles a few CPUID features that are APIC 766ID related such as CPUID.01H. 767 768ACRN emulates some extra CPUID features for the hypervisor as well. 769 770The per-vm *vcpuid_entries* array is initialized during VM creation 771and used to cache most of the CPUID entries for each VM. During guest 772CPUID emulation, ACRN reads the cached value from this array, except 773some APIC ID-related CPUID data emulated at runtime. 774 775This table describes details for CPUID emulation: 776 777.. list-table:: 778 :widths: 20 80 779 :header-rows: 1 780 781 782 * - **CPUID** 783 - **Emulation Description** 784 785 * - 01H 786 - - Get original value from physical CPUID 787 - Fill APIC ID from vLAPIC 788 - Disable x2APIC 789 - Disable PCID 790 - Disable VMX 791 - Disable XSAVE if host not enabled 792 793 * - 0BH 794 - - Fill according to X2APIC feature support (default is disabled) 795 - If not supported, fill all registers with 0 796 - If supported, get from physical CPUID 797 798 * - 0DH 799 - - Fill according to XSAVE feature support 800 - If not supported, fill all registers with 0 801 - If supported, get from physical CPUID 802 803 * - 07H 804 - - Get from per-vm CPUID entries cache 805 - For subleaf 0, disabled INVPCID, Intel RDT 806 807 * - 16H 808 - - Get from per-vm CPUID entries cache 809 - If physical CPU supports CPUID.16H, read from physical CPUID 810 - If physical CPU does not support it, emulate with TSC frequency 811 812 * - 40000000H 813 - - Get from per-vm CPUID entries cache 814 - EAX: the maximum input value for CPUID supported by ACRN (40000010) 815 - EBX, ECX, EDX: hypervisor vendor ID signature - "ACRNACRNACRN" 816 817 * - 40000010H 818 - - Get from per-vm CPUID entries cache 819 - EAX: virtual TSC frequency in kHz 820 - EBX, ECX, EDX: reserved to 0 821 822 * - 0AH 823 - - PMU disabled 824 825 * - 0FH, 10H 826 - - Intel RDT disabled 827 828 * - 12H 829 - - Fill according to SGX virtualization 830 831 * - 14H 832 - - Intel Processor Trace disabled 833 834 * - Others 835 - - Get from per-vm CPUID entries cache 836 837.. note:: ACRN needs to take care of 838 some CPUID values that can change at runtime, for example, the XD feature in 839 CPUID.80000001H may be cleared by the MISC_ENABLE MSR. 840 841 842MSR Virtualization 843****************** 844 845ACRN always enables an MSR bitmap in the *VMX_PROC_VM_EXEC_CONTROLS* VMX 846execution control field. This bitmap marks the MSRs to cause a VM 847exit upon guest access for both read and write. The VM 848exit reason for reading or writing these MSRs is respectively 849*VMX_EXIT_REASON_RDMSR* or *VMX_EXIT_REASON_WRMSR* and the VM exit 850handler is *rdmsr_vmexit_handler* or *wrmsr_vmexit_handler*. 851 852This table shows the predefined MSRs that ACRN will trap for all the guests. For 853the MSRs whose bitmap values are not set in the MSR bitmap, guest access will be 854passthrough directly: 855 856.. list-table:: 857 :widths: 33 33 33 858 :header-rows: 1 859 860 * - **MSR** 861 - **Description** 862 - **Handler** 863 864 * - MSR_IA32_TSC_ADJUST 865 - TSC adjustment of local APIC's TSC deadline mode 866 - Emulates with vLAPIC 867 868 * - MSR_IA32_TSC_DEADLINE 869 - TSC target of local APIC's TSC deadline mode 870 - Emulates with vLAPIC 871 872 * - MSR_IA32_BIOS_UPDT_TRIG 873 - BIOS update trigger 874 - Update microcode from the Service VM, the signature ID read is from 875 physical MSR, and a BIOS update trigger from the Service VM will trigger a 876 physical microcode update. 877 878 * - MSR_IA32_BIOS_SIGN_ID 879 - BIOS update signature ID 880 - \" 881 882 * - MSR_IA32_TIME_STAMP_COUNTER 883 - Time-stamp counter 884 - Work with VMX_TSC_OFFSET_FULL to emulate virtual TSC 885 886 * - MSR_IA32_APIC_BASE 887 - APIC base address 888 - Emulates with vLAPIC 889 890 * - MSR_IA32_PAT 891 - Page-attribute table 892 - Save/restore in vCPU, write to VMX_GUEST_IA32_PAT_FULL if cr0.cd is 0 893 894 * - MSR_IA32_PERF_CTL 895 - Performance control 896 - Trigger real P-state change if P-state is valid when writing, 897 fetch physical MSR when reading 898 899 * - MSR_IA32_FEATURE_CONTROL 900 - Feature control bits that configure operation of VMX and SMX 901 - Disabled, locked 902 903 * - MSR_IA32_MCG_CAP/STATUS 904 - Machine-Check global control/status 905 - Emulates with vMCE 906 907 * - MSR_IA32_MISC_ENABLE 908 - Miscellaneous feature control 909 - Read-only, except MONITOR/MWAIT enable bit 910 911 * - MSR_IA32_SGXLEPUBKEYHASH0/1/2/3 912 - SHA256 digest of the authorized launch enclaves 913 - Emulates with vSGX 914 915 * - MSR_IA32_SGX_SVN_STATUS 916 - Status and SVN threshold of SGX support for ACM 917 - Read-only, emulates with vSGX 918 919 * - MSR_IA32_MTRR_CAP 920 - Memory type range register related 921 - Handled by MTRR emulation 922 923 * - MSR_IA32_MTRR_DEF_TYPE 924 - \" 925 - \" 926 927 * - MSR_IA32_MTRR_PHYSBASE_0~9 928 - \" 929 - \" 930 931 * - MSR_IA32_MTRR_FIX64K_00000 932 - \" 933 - \" 934 935 * - MSR_IA32_MTRR_FIX16K_80000/A0000 936 - \" 937 - \" 938 939 * - MSR_IA32_MTRR_FIX4K_C0000~F8000 940 - \" 941 - \" 942 943 * - MSR_IA32_X2APIC_* 944 - x2APIC related MSRs (offset from 0x800 to 0x900) 945 - Emulates with vLAPIC 946 947 * - MSR_IA32_L2_MASK_BASE~n 948 - L2 CAT mask for CLOSn 949 - Disabled for guest access 950 951 * - MSR_IA32_L3_MASK_BASE~n 952 - L3 CAT mask for CLOSn 953 - Disabled for guest access 954 955 * - MSR_IA32_MBA_MASK_BASE~n 956 - MBA delay mask for CLOSn 957 - Disabled for guest access 958 959 * - MSR_IA32_VMX_BASIC~VMX_TRUE_ENTRY_CTLS 960 - VMX related MSRs 961 - Not supported, access will inject #GP 962 963 964CR Virtualization 965***************** 966 967ACRN emulates ``mov to cr0``, ``mov to cr4``, ``mov to cr8``, and ``mov 968from cr8`` through *cr_access_vmexit_handler* based on 969*VMX_EXIT_REASON_CR_ACCESS*. 970 971.. note:: ``mov to cr8`` and ``mov from cr8`` are 972 not valid as ``CR8-load/store exiting`` bits are set as 0 in 973 *VMX_PROC_VM_EXEC_CONTROLS*. 974 975A VM can ``mov from cr0`` and ``mov from 976cr4`` without triggering a VM exit. The values read are the read shadows 977of the corresponding register in VMCS. The shadows are updated by the 978hypervisor on CR writes. 979 980.. list-table:: 981 :widths: 30 70 982 :header-rows: 1 983 984 * - **Operation** 985 - **Handler** 986 987 * - mov to cr0 988 - Based on vCPU set context API: vcpu_set_cr0 -> vmx_write_cr0 989 990 * - mov to cr4 991 - Based on vCPU set context API: vcpu_set_cr4 -> vmx_write_cr4 992 993 * - mov to cr8 994 - Based on vLAPIC tpr API: vlapic_set_cr8 -> vlapic_set_tpr 995 996 * - mov from cr8 997 - Based on vLAPIC tpr API: vlapic_get_cr8 -> vlapic_get_tpr 998 999 1000For ``mov to cr0`` and ``mov to cr4``, ACRN sets 1001*cr0_host_mask/cr4_host_mask* into *VMX_CR0_MASK/VMX_CR4_MASK* 1002for the bitmask causing VM exit. 1003 1004As ACRN always enables ``unrestricted guest`` in 1005*VMX_PROC_VM_EXEC_CONTROLS2*, *CR0.PE* and *CR0.PG* can be 1006controlled by the guest. 1007 1008.. list-table:: 1009 :widths: 20 40 40 1010 :header-rows: 1 1011 1012 * - **CR0 MASK** 1013 - **Value** 1014 - **Comments** 1015 1016 * - cr0_always_on_mask 1017 - fixed0 & (~(CR0_PE | CR0_PG)) 1018 - fixed0 comes from MSR_IA32_VMX_CR0_FIXED0, these bits 1019 are fixed to be 1 under VMX operation. 1020 1021 * - cr0_always_off_mask 1022 - ~fixed1 1023 - ~fixed1 comes from MSR_IA32_VMX_CR0_FIXED1, these bits 1024 are fixed to be 0 under VMX operation. 1025 1026 * - CR0_TRAP_MASK 1027 - CR0_PE | CR0_PG | CR0_WP | CR0_CD | CR0_NW 1028 - ACRN will also trap PE, PG, WP, CD, and NW bits. 1029 1030 * - cr0_host_mask 1031 - ~(fixed0 ^ fixed1) | CR0_TRAP_MASK 1032 - ACRN will finally trap bits under VMX root mode control plus 1033 additionally added bits. 1034 1035 1036For ``mov to cr0`` emulation, ACRN will handle a paging mode change based on 1037PG bit change, and a cache mode change based on CD and NW bits changes. 1038ACRN also takes care of illegal writing from a guest to invalid 1039CR0 bits (for example, set PG while CR4.PAE = 0 and IA32_EFER.LME = 1), 1040which will finally inject a #GP to the guest. Finally, 1041*VMX_CR0_READ_SHADOW* will be updated for guest reading of host 1042controlled bits, and *VMX_GUEST_CR0* will be updated for real vmx cr0 1043setting. 1044 1045.. list-table:: 1046 :widths: 20 40 40 1047 :header-rows: 1 1048 1049 * - **CR4 MASK** 1050 - **Value** 1051 - **Comments** 1052 1053 * - cr4_always_on_mask 1054 - fixed0 1055 - fixed0 comes from MSR_IA32_VMX_CR4_FIXED0, these bits 1056 are fixed to be 1 under VMX operation 1057 1058 * - cr4_always_off_mask 1059 - ~fixed1 1060 - ~fixed1 comes from MSR_IA32_VMX_CR4_FIXED1, these bits 1061 are fixed to be 0 under VMX operation 1062 1063 * - CR4_TRAP_MASK 1064 - CR4_PSE | CR4_PAE | CR4_VMXE | CR4_PCIDE | CR4_SMEP | CR4_SMAP | CR4_PKE 1065 - ACRN will also trap PSE, PAE, VMXE, and PCIDE bits 1066 1067 * - cr4_host_mask 1068 - ~(fixed0 ^ fixed1) | CR4_TRAP_MASK 1069 - ACRN will finally trap bits under VMX root mode control plus 1070 additionally added bits 1071 1072 1073The ``mov to cr4`` emulation is similar to cr0 emulation noted above. 1074 1075.. _io-mmio-emulation: 1076 1077IO/MMIO Emulation 1078***************** 1079 1080ACRN always enables an I/O bitmap in *VMX_PROC_VM_EXEC_CONTROLS* and EPT 1081in *VMX_PROC_VM_EXEC_CONTROLS2*. Based on them, 1082*pio_instr_vmexit_handler* and *ept_violation_vmexit_handler* are 1083used for IO/MMIO emulation for an emulated device. The device can 1084be emulated by the hypervisor or DM in the Service VM. 1085 1086For a device emulated by the hypervisor, ACRN provides some basic 1087APIs to register its IO/MMIO range: 1088 1089- For the Service VM, the default I/O bitmap values are all set to 0, which 1090 means the Service VM will passthrough all I/O port access by default. Adding 1091 an I/O handler for a hypervisor emulated device needs to first set its 1092 corresponding I/O bitmap to 1. 1093 1094- For the User VM, the default I/O bitmap values are all set to 1, which means 1095 the User VM will trap all I/O port access by default. Adding an I/O handler 1096 for a hypervisor emulated device does not need to change its I/O bitmap. If 1097 the trapped I/O port access does not fall into a hypervisor emulated device, 1098 it will create an I/O request and pass it to the Service VM DM. 1099 1100- For the Service VM, EPT maps the entire range of memory to the Service VM 1101 except for the ACRN hypervisor area. The Service VM will passthrough all 1102 MMIO access by default. Adding an MMIO handler for a hypervisor emulated 1103 device needs to first remove its MMIO range from EPT mapping. 1104 1105- For the User VM, EPT only maps its system RAM to the User VM, which means the 1106 User VM will trap all MMIO access by default. Adding an MMIO handler for a 1107 hypervisor emulated device does not need to change its EPT mapping. If the 1108 trapped MMIO access does not fall into a hypervisor emulated device, it will 1109 create an I/O request and pass it to the Service VM DM. 1110 1111.. list-table:: 1112 :widths: 30 70 1113 :header-rows: 1 1114 1115 * - **API** 1116 - **Description** 1117 1118 * - register_pio_emulation_handler 1119 - Register an I/O emulation handler for a hypervisor emulated device 1120 by specific I/O range. 1121 1122 * - register_mmio_emulation_handler 1123 - Register an MMIO emulation handler for a hypervisor emulated device 1124 by specific MMIO range. 1125 1126.. _instruction-emulation: 1127 1128Instruction Emulation 1129********************* 1130 1131ACRN implements a simple instruction emulation infrastructure for 1132MMIO (EPT) and APIC access emulation. When such a VM exit is triggered, the 1133hypervisor needs to decode the instruction from RIP then attempt the 1134corresponding emulation based on its instruction and read/write direction. 1135 1136ACRN supports emulating instructions for ``mov``, ``movx``, 1137``movs``, ``stos``, ``test``, ``and``, ``or``, ``cmp``, ``sub``, and 1138``bittest`` without support for lock prefix. Real mode emulation is not 1139supported. 1140 1141.. figure:: images/hld-image82.png 1142 :align: center 1143 1144 Instruction Emulation Work Flow 1145 1146In the handlers for EPT violation or APIC access VM exit, ACRN will: 1147 11481. Fetch the MMIO access request's address and size. 1149 11502. Do *decode_instruction* for the instruction in the current RIP 1151 with the following check: 1152 1153 a. Is the instruction supported? If not, inject #UD to the guest. 1154 b. Is the GVA of RIP, dest, and src valid? If not, inject #PF to the guest. 1155 c. Is the stack valid? If not, inject #SS to the guest. 1156 11573. If step 2 succeeds, check the access direction. If it's a write, then 1158 do *emulate_instruction* to fetch the MMIO request's value from 1159 instruction operands. 1160 11614. Execute the MMIO request handler. For EPT violation, it is *emulate_io*. 1162 For APIC access, it is *vlapic_write/read* based on access 1163 direction. It will finally complete this MMIO request emulation 1164 by: 1165 1166 a. putting req.val to req.addr for write operation 1167 b. getting req.val from req.addr for read operation 1168 11695. If the access direction is read, then do *emulate_instruction* to 1170 put the MMIO request's value into instruction operands. 1171 11726. Return to the guest. 1173 1174TSC Emulation 1175************* 1176 1177Guest vCPU execution of *RDTSC/RDTSCP* and access to 1178*MSR_IA32_TSC_AUX* do not cause a VM Exit to the hypervisor. 1179The hypervisor uses *MSR_IA32_TSC_AUX* to record CPU ID, thus 1180the CPU ID provided by *MSR_IA32_TSC_AUX* might be changed via the guest. 1181 1182*RDTSCP* is widely used by the hypervisor to identify the current CPU ID. Due 1183to no VM Exit for the *MSR_IA32_TSC_AUX* MSR register, the ACRN hypervisor 1184saves the *MSR_IA32_TSC_AUX* value on every VM Exit and restores it on every VM Enter. 1185Before the hypervisor restores the host CPU ID, *rdtscp* should not be 1186called as it could get the vCPU ID instead of the host CPU ID. 1187 1188The *MSR_IA32_TIME_STAMP_COUNTER* is emulated by the ACRN hypervisor, with a 1189simple implementation based on *TSC_OFFSET* (enabled 1190in *VMX_PROC_VM_EXEC_CONTROLS*): 1191 1192- For read: ``val = rdtsc() + exec_vmread64(VMX_TSC_OFFSET_FULL)`` 1193- For write: ``exec_vmwrite64(VMX_TSC_OFFSET_FULL, val - rdtsc())`` 1194 1195ART Virtualization 1196****************** 1197 1198The invariant TSC is based on the invariant timekeeping hardware (called 1199Always Running Timer or ART), which runs at the core crystal clock frequency. 1200The ratio defined by the CPUID leaf 15H expresses the frequency relationship 1201between the ART hardware and the TSC. 1202 1203If CPUID.15H.EBX[31:0] != 0 and CPUID.80000007H:EDX[InvariantTSC] = 1, the 1204following linearity relationship holds between the TSC and the ART hardware: 1205 1206 ``TSC_Value = (ART_Value * CPUID.15H:EBX[31:0]) / CPUID.15H:EAX[31:0] + K`` 1207 1208Where ``K`` is an offset that can be adjusted by a privileged agent. 1209When ART hardware is reset, both invariant TSC and K are also reset. 1210 1211The guideline of ART virtualization (vART) is that software in native can run in 1212the VM too. The vART solution is: 1213 1214- Present the ART capability to the guest through CPUID leaf 15H for ``CPUID.15H:EBX[31:0]`` 1215 and ``CPUID.15H:EAX[31:0]``. 1216- Passthrough devices see the physical ART_Value (vART_Value = pART_Value). 1217- Relationship between the ART and TSC in the guest is: 1218 ``vTSC_Value = (vART_Value * CPUID.15H:EBX[31:0]) / CPUID.15H:EAX[31:0] + vK`` 1219 where ``vK = K + VMCS.TSC_OFFSET``. 1220- If the guest changes ``vK`` or ``vTSC_Value``, we change the ``VMCS.TSC_OFFSET`` accordingly. 1221- ``K`` should never be changed by the hypervisor. 1222 1223XSAVE Emulation 1224*************** 1225 1226The XSAVE feature set is composed of eight instructions: 1227 1228- *XGETBV* and *XSETBV* allow software to read and write the extended 1229 control register *XCR0*, which controls the operation of the 1230 XSAVE feature set. 1231 1232- *XSAVE*, *XSAVEOPT*, *XSAVEC*, and *XSAVES* are four instructions 1233 that save the processor state to memory. 1234 1235- *XRSTOR* and *XRSTORS* are corresponding instructions that load the 1236 processor state from memory. 1237 1238- *XGETBV*, *XSAVE*, *XSAVEOPT*, *XSAVEC*, and *XRSTOR* can be executed 1239 at any privilege level. 1240 1241- *XSETBV*, *XSAVES*, and *XRSTORS* can be executed only if CPL = 0. 1242 1243Enabling the XSAVE feature set is controlled by XCR0 (through XSETBV) 1244and IA32_XSS MSR. Refer to the `Intel SDM Volume 1`_ chapter 13 for more details. 1245 1246 1247.. _Intel SDM Volume 1: 1248 https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-1-manual.html 1249 1250.. figure:: images/hld-image38.png 1251 :align: center 1252 1253 ACRN Hypervisor XSAVE Emulation 1254 1255By default, ACRN enables XSAVES/XRSTORS in 1256*VMX_PROC_VM_EXEC_CONTROLS2*, so it allows the guest to use the XSAVE 1257feature. Because guest execution of *XSETBV* will always trigger XSETBV VM 1258exit, ACRN actually needs to take care of XCR0 access. 1259 1260ACRN emulates XSAVE features through the following rules: 1261 12621. Enumerate CPUID.01H for native XSAVE feature support. 12632. If yes for step 1, enable XSAVE in the hypervisor by CR4.OSXSAVE. 12643. Emulate XSAVE related CPUID.01H and CPUID.0DH to the guest. 12654. Emulate XCR0 access through *xsetbv_vmexit_handler*. 12665. Passthrough the access of IA32_XSS MSR to the guest. 12676. ACRN hypervisor does NOT use any feature of XSAVE. 12687. When ACRN emulates the vCPU with partition mode: based on above rules 5 1269 and 6, a guest vCPU will fully control the XSAVE feature in 1270 non-root mode. 1271