1.. _hv-cpu-virt:
2
3CPU Virtualization
4##################
5
6.. figure:: images/hld-image47.png
7   :align: center
8   :name: hv-cpu-virt-components
9
10   ACRN Hypervisor CPU Virtualization Components
11
12The following sections discuss the major modules (indicated above in blue)
13in the CPU virtualization overview shown in :numref:`hv-cpu-virt-components`.
14
15Based on Intel VT-x virtualization technology, ACRN emulates a virtual CPU
16(vCPU) with the following methods:
17
18-  **core partition**: one vCPU is dedicated and associated with one
19   physical CPU (pCPU),
20   making much of the hardware register emulation simply
21   passthrough. This method provides good isolation for physical interrupts
22   and guest execution.  (See `Static CPU Partitioning`_ for more
23   information.)
24
25-  **core sharing** (to be added): two or more vCPUs share one
26   physical CPU (pCPU). A more complicated context switch is needed
27   between different vCPUs' switching. This method provides flexible computing
28   resources sharing for low-performance demand vCPU tasks.
29   (See `Flexible CPU Sharing`_ for more information.)
30
31-  **simple schedule**: a well-designed scheduler framework that allows ACRN
32   to adopt different scheduling policies, such as the **noop** and **round-robin**:
33
34   - **noop scheduler**: only two thread loops are maintained for a CPU: a
35     vCPU thread and a default idle thread. A CPU runs most of the time in
36     the vCPU thread for emulating a guest CPU, switching between VMX root
37     mode and non-root mode. A CPU schedules out to default idle when an
38     operation needs it to stay in VMX root mode, such as when waiting for
39     an I/O request from the Device Model (DM) or when ready to destroy.
40
41   - **round-robin scheduler** (to be added): allows more vCPU thread loops
42     to run on a CPU. A CPU switches among different vCPU threads and default
43     idle threads as it runs out corresponding timeslices or necessary
44     scheduling outs such as waiting for an I/O request. A vCPU can yield
45     itself as well, such as when it executes a "PAUSE" instruction.
46
47
48Static CPU Partitioning
49***********************
50
51CPU partitioning is a policy for mapping a virtual
52CPU (vCPU) to a physical CPU. To enable this feature, the ACRN hypervisor can
53configure a noop scheduler as the schedule policy for this physical CPU.
54
55ACRN then forces a fixed 1:1 mapping between a vCPU and this physical CPU
56when creating a vCPU for the guest operating system. This makes the vCPU
57management code much simpler.
58
59ACRN uses the ``cpu_affinity`` parameter in ``vm config`` to decide which
60physical CPU to map to a vCPU in a VM, then finalizes the fixed mapping. When
61launching a User VM, need to choose pCPUs from the VM's ``cpu_affinity`` that
62are not used by any other VMs.
63
64Flexible CPU Sharing
65********************
66
67To enable CPU sharing, the ACRN hypervisor can configure the BVT
68(Borrowed Virtual Time) scheduler policy.
69
70The ``cpu_affinity`` parameter in ``vm config`` indicates all the physical CPUs
71on which this VM is allowed to run. A pCPU can be shared among a Service VM and
72any User VM as long as the local APIC passthrough is not enabled in that User
73VM.
74
75See :ref:`cpu_sharing` for more information.
76
77.. _hv-cpu-virt-cpu-mgmt-partition:
78
79CPU Management in the Service VM Under Static CPU Partitioning
80**************************************************************
81
82With ACRN, all ACPI table entries are passthrough to the Service VM, including
83the Multiple Interrupt Controller Table (MADT). The Service VM sees all
84physical CPUs by parsing the MADT when the Service VM kernel boots. All
85physical CPUs are initially assigned to the Service VM by creating the same
86number of virtual CPUs.
87
88After the Service VM boots, it releases the physical CPUs intended
89for User VM use.
90
91Here is an example flow of CPU allocation on a multi-core platform.
92
93.. figure:: images/static-core-image2.png
94   :width: 600px
95   :align: center
96   :name: static-core-cpu-allocation
97
98   CPU Allocation on a Multi-core Platform
99
100CPU Management in the Service VM Under Flexible CPU Sharing
101***********************************************************
102
103The Service VM sees all physical CPUs via the MADT, as described in
104:ref:`hv-cpu-virt-cpu-mgmt-partition`. However, the Service VM does not release
105the physical CPUs intended for User VM use.
106
107CPU Management in the User VM
108*****************************
109
110The ``cpu_affinity`` parameter in ``vm config`` defines a set of pCPUs that a
111User VM is allowed to run on. The Device Model can launch a User VM on only a
112subset of the pCPUs or on all pCPUs listed in ``cpu_affinity``, but it cannot
113assign any pCPU that is not included in it.
114
115CPU Assignment Management in the Hypervisor
116*******************************************
117
118The physical CPU assignment is predefined by ``cpu_affinity`` in
119``vm config``, while post-launched VMs could be launched on pCPUs that are
120a subset of it.
121
122The ACRN hypervisor does not support virtual CPU migration to
123different physical CPUs. No changes to the mapping of the virtual CPU to
124physical CPU can happen without first calling ``offline_vcpu``.
125
126
127.. _vCPU_lifecycle:
128
129vCPU Lifecycle
130**************
131
132A vCPU lifecycle is shown in :numref:`hv-vcpu-transitions` below, where
133the major states are:
134
135-  **VCPU_INIT**: vCPU is in an initialized state, and its vCPU thread
136   is not ready to run on its associated CPU.
137
138-  **VCPU_RUNNING**: vCPU is running, and its vCPU thread is ready (in
139   the queue) or running on its associated CPU.
140
141-  **VCPU_PAUSED**: vCPU is paused, and its vCPU thread is not running
142   on its associated CPU.
143
144-  **VPCU_ZOMBIE**: vCPU is transitioning to an offline state, and its vCPU thread is
145   not running on its associated CPU.
146
147-  **VPCU_OFFLINE**: vCPU is offline.
148
149.. figure:: images/hld-image17.png
150   :align: center
151   :name: hv-vcpu-transitions
152
153   ACRN vCPU State Transitions
154
155The following functions are used to drive the state machine of the vCPU
156lifecycle:
157
158.. doxygenfunction:: create_vcpu
159   :project: Project ACRN
160
161.. doxygenfunction:: zombie_vcpu
162   :project: Project ACRN
163
164.. doxygenfunction:: reset_vcpu
165   :project: Project ACRN
166
167.. doxygenfunction:: offline_vcpu
168   :project: Project ACRN
169
170
171vCPU Scheduling Under Static CPU Partitioning
172*********************************************
173
174.. figure:: images/hld-image35.png
175   :align: center
176   :name: hv-vcpu-schedule
177
178   ACRN vCPU Scheduling Flow Under Static CPU Partitioning
179
180For static CPU partitioning, ACRN implements a simple scheduling mechanism
181based on two threads: vcpu_thread and default_idle. A vCPU in the
182VCPU_RUNNING state always runs in a vcpu_thread loop.
183A vCPU in the VCPU_PAUSED or VCPU_ZOMBIE state runs in a default_idle
184loop. The behaviors in the vcpu_thread and default_idle threads
185are illustrated in :numref:`hv-vcpu-schedule`:
186
187-  The **vcpu_thread** loop will do the loop of handling VM exits,
188   and pending requests around the VM entry/exit.
189   It will also check the reschedule request then schedule out to
190   default_idle if necessary. See `vCPU Thread`_ for more details
191   about vcpu_thread.
192
193-  The **default_idle** loop simply does do_cpu_idle while also
194   checking for need-offline and reschedule requests.
195   If a CPU is marked as need-offline, it will go to cpu_dead.
196   If a reschedule request is made for this CPU, it will
197   schedule out to vcpu_thread if necessary.
198
199-  The function ``make_reschedule_request`` drives the thread
200   switch between vcpu_thread and default_idle.
201
202Some example scenario flows are shown here:
203
204.. figure:: images/hld-image7.png
205   :align: center
206
207   ACRN vCPU Scheduling Scenarios
208
209-  **During VM startup**: after a vCPU is created, the bootstrap processor (BSP)
210   calls *launch_vcpu* through *start_vm*. The application processor (AP) calls
211   *launch_vcpu* through vLAPIC INIT-SIPI emulation. Finally, this vCPU runs in
212   a *vcpu_thread* loop.
213
214-  **During VM shutdown**: *pause_vm* function forces a vCPU
215   running in *vcpu_thread* to schedule out to *default_idle*. The
216   following *reset_vcpu*  and *offline_vcpu* de-init and then offline
217   this vCPU instance.
218
219-  **During IOReq handling**: after an IOReq is sent to DM for emulation, a
220   vCPU running in *vcpu_thread* schedules out to *default_idle*
221   through *acrn_insert_request_wait->pause_vcpu*. After the DM
222   completes the emulation for this IOReq, it calls
223   *hcall_notify_ioreq_finish->resume_vcpu* and changes the vCPU
224   schedule back to *vcpu_thread* to continue its guest execution.
225
226vCPU Scheduling Under Flexible CPU Sharing
227******************************************
228
229To be added.
230
231vCPU Thread
232***********
233
234The vCPU thread flow is a loop as shown and described below:
235
236.. figure:: images/hld-image68.png
237   :align: center
238
239   ACRN vCPU Thread
240
241
2421. Check if *vcpu_thread* needs to schedule out to *default_idle* or
243   other *vcpu_thread* by reschedule request. If needed, then schedule
244   out to *default_idle* or other *vcpu_thread*.
245
2462. Handle pending request by calling *acrn_handle_pending_request*.
247   (See `Pending Request Handlers`_.)
248
2493. VM Enter by calling *start/run_vcpu*, then enter non-root mode to do
250   guest execution.
251
2524. VM Exit from *start/run_vcpu* when the guest triggers a VM exit reason in
253   non-root mode.
254
2555. Handle VM exit based on specific reason.
256
2576. Loop back to step 1.
258
259vCPU Run Context
260================
261
262During a vCPU switch between root and non-root mode, the run context of
263the vCPU is saved and restored using this structure:
264
265.. doxygenstruct:: run_context
266   :project: Project ACRN
267
268The vCPU handles runtime context saving by three different
269categories:
270
271-  Always save/restore during VM exit/entry:
272
273   -  These registers must be saved for each VM exit, and restored
274      for each VM entry
275   -  Registers include: general purpose registers, CR2, and
276      IA32_SPEC_CTRL
277   -  Definition in *vcpu->run_context*
278   -  Get/Set them through *vcpu_get/set_xxx*
279
280-  On-demand cache/update during VM exit/entry:
281
282   -  These registers are used frequently. They should be cached from
283      VMCS on first time access after a VM exit, and updated to VMCS on
284      VM entry if marked dirty
285   -  Registers include: RSP, RIP, EFER, RFLAGS, CR0, and CR4
286   -  Definition in *vcpu->run_context*
287   -  Get/Set them through *vcpu_get/set_xxx*
288
289-  Always read/write from/to VMCS:
290
291   -  These registers are rarely used. Access to them is always
292      from/to VMCS.
293   -  Registers are in VMCS but not list in the two cases above.
294   -  No definition in *vcpu->run_context*
295   -  Get/Set them through VMCS API
296
297For the first two categories above, ACRN provides these get/set APIs:
298
299.. doxygenfunction:: vcpu_get_gpreg
300   :project: Project ACRN
301
302.. doxygenfunction:: vcpu_set_gpreg
303   :project: Project ACRN
304
305.. doxygenfunction:: vcpu_get_rip
306   :project: Project ACRN
307
308.. doxygenfunction:: vcpu_set_rip
309   :project: Project ACRN
310
311.. doxygenfunction:: vcpu_get_rsp
312   :project: Project ACRN
313
314.. doxygenfunction:: vcpu_set_rsp
315   :project: Project ACRN
316
317.. doxygenfunction:: vcpu_get_efer
318   :project: Project ACRN
319
320.. doxygenfunction:: vcpu_set_efer
321   :project: Project ACRN
322
323.. doxygenfunction:: vcpu_get_rflags
324   :project: Project ACRN
325
326.. doxygenfunction:: vcpu_set_rflags
327   :project: Project ACRN
328
329.. doxygenfunction:: vcpu_get_cr0
330   :project: Project ACRN
331
332.. doxygenfunction:: vcpu_set_cr0
333   :project: Project ACRN
334
335.. doxygenfunction:: vcpu_get_cr2
336   :project: Project ACRN
337
338.. doxygenfunction:: vcpu_set_cr2
339   :project: Project ACRN
340
341.. doxygenfunction:: vcpu_get_cr4
342   :project: Project ACRN
343
344.. doxygenfunction:: vcpu_set_cr4
345   :project: Project ACRN
346
347
348VM Exit Handlers
349================
350
351ACRN implements its VM exit handlers with a static table. Except for the
352exit reasons listed below, a default *unhandled_vmexit_handler* is used
353that will trigger an error message and return without handling:
354
355.. list-table::
356   :widths: 33 33 33
357   :header-rows: 1
358
359   * - **VM Exit Reason**
360     - **Handler**
361     - **Description**
362
363   * - VMX_EXIT_REASON_EXCEPTION_OR_NMI
364     - exception_vmexit_handler
365     - Only trap #MC, print error then inject back to guest
366
367   * - VMX_EXIT_REASON_EXTERNAL_INTERRUPT
368     - external_interrupt_vmexit_handler
369     - External interrupt handler for physical interrupt happening in non-root mode
370
371   * - VMX_EXIT_REASON_TRIPLE_FAULT
372     - triple_fault_vmexit_handler
373     - Handle triple fault from vCPU
374
375   * - VMX_EXIT_REASON_INIT_SIGNAL
376     - init_signal_vmexit_handler
377     - Handle INIT signal from vCPU
378
379   * - VMX_EXIT_REASON_INTERRUPT_WINDOW
380     - interrupt_window_vmexit_handler
381     - To support interrupt window if VID is disabled
382
383   * - VMX_EXIT_REASON_CPUID
384     - cpuid_vmexit_handler
385     - Handle CPUID access from guest
386
387   * - VMX_EXIT_REASON_VMCALL
388     - vmcall_vmexit_handler
389     - Handle hypercall from guest
390
391   * - VMX_EXIT_REASON_CR_ACCESS
392     - cr_access_vmexit_handler
393     - Handle CR registers access from guest
394
395   * - VMX_EXIT_REASON_IO_INSTRUCTION
396     - pio_instr_vmexit_handler
397     - Emulate I/O access with range in IO_BITMAP,
398       which may have a handler in hypervisor (such as vUART or vPIC),
399       or need to create an I/O request to DM
400
401   * - VMX_EXIT_REASON_RDMSR
402     - rdmsr_vmexit_handler
403     - Read MSR from guest in MSR_BITMAP
404
405   * - VMX_EXIT_REASON_WRMSR
406     - wrmsr_vmexit_handler
407     - Write MSR from guest in MSR_BITMAP
408
409   * - VMX_EXIT_REASON_APIC_ACCESS
410     - apic_access_vmexit_handler
411     - APIC access for APICv
412
413   * - VMX_EXIT_REASON_VIRTUALIZED_EOI
414     - veoi_vmexit_handler
415     - Trap vLAPIC EOI for specific vector with level trigger mode
416       in vIOAPIC, required for supporting PTdev
417
418   * - VMX_EXIT_REASON_EPT_VIOLATION
419     - ept_violation_vmexit_handler
420     - MMIO emulation, which may have handler in hypervisor
421       (such as vLAPIC or vIOAPIC), or need to create an I/O
422       request to DM
423
424   * - VMX_EXIT_REASON_XSETBV
425     - xsetbv_vmexit_handler
426     - Set host owned XCR0 for supporting xsave
427
428   * - VMX_EXIT_REASON_APIC_WRITE
429     - apic_write_vmexit_handler
430     - APIC write for APICv
431
432
433Details of each VM exit reason handler are described in other sections.
434
435.. _pending-request-handlers:
436
437Pending Request Handlers
438========================
439
440ACRN uses the function *acrn_handle_pending_request* to handle
441requests before VM entry in *vcpu_thread*.
442
443A bitmap in the vCPU structure lists the different requests:
444
445.. code-block:: c
446
447   #define ACRN_REQUEST_EXCP 0U
448   #define ACRN_REQUEST_EVENT 1U
449   #define ACRN_REQUEST_EXTINT 2U
450   #define ACRN_REQUEST_NMI 3U
451   #define ACRN_REQUEST_EOI_EXIT_BITMAP_UPDATE 4U
452   #define ACRN_REQUEST_EPT_FLUSH 5U
453   #define ACRN_REQUEST_TRP_FAULT 6U
454   #define ACRN_REQUEST_VPID_FLUSH 7U /* flush vpid tlb */
455
456
457ACRN provides the function *vcpu_make_request* to make different
458requests, set the bitmap of the corresponding request, and notify the target
459vCPU through the IPI if necessary (when the target vCPU is not
460running). See :ref:`vcpu-request-interrupt-injection` for details.
461
462.. code-block:: c
463
464   void vcpu_make_request(struct vcpu *vcpu, uint16_t eventid)
465   {
466      uint16_t pcpu_id = pcpuid_from_vcpu(vcpu);
467
468      bitmap_set_lock(eventid, &vcpu->arch_vcpu.pending_req);
469      /*
470       * if current hostcpu is not the target vcpu's hostcpu, we need
471       * to invoke IPI to wake up target vcpu
472       *
473       * TODO: Here we just compare with cpuid, since cpuid is
474       *  global under pCPU / vCPU 1:1 mapping. If later we enabled vcpu
475       *  scheduling, we need change here to determine it target vcpu is
476       *  VMX non-root or root mode
477       */
478      if (get_cpu_id() != pcpu_id) {
479              send_single_ipi(pcpu_id, VECTOR_NOTIFY_VCPU);
480      }
481   }
482
483The function *acrn_handle_pending_request* handles each
484request as shown below.
485
486
487.. list-table::
488   :widths: 25 25 25 25
489   :header-rows: 1
490
491   * - **Request**
492     - **Description**
493     - **Request Maker**
494     - **Request Handler**
495
496   * - ACRN_REQUEST_EXCP
497     - Request for exception injection
498     - vcpu_inject_gp, vcpu_inject_pf, vcpu_inject_ud, vcpu_inject_ac,
499       or vcpu_inject_ss and then queue corresponding exception by
500       vcpu_queue_exception
501     - vcpu_inject_hi_exception, vcpu_inject_lo_exception based
502       on exception priority
503
504   * - ACRN_REQUEST_EVENT
505     - Request for vLAPIC interrupt vector injection
506     - vlapic_fire_lvt or vlapic_set_intr, which could be triggered
507       by vlapic lvt, vioapic, or vmsi
508     - vcpu_do_pending_event
509
510   * - ACRN_REQUEST_EXTINT
511     - Request for extint vector injection
512     - vcpu_inject_extint, triggered by vPIC
513     - vcpu_do_pending_extint
514
515   * - ACRN_REQUEST_NMI
516     - Request for nmi injection
517     - vcpu_inject_nmi
518     - Program VMX_ENTRY_INT_INFO_FIELD directly
519
520   * - ACRN_REQUEST_EOI_EXIT_BITMAP_UPDATE
521     - Request for VEOI bitmap update for level triggered vector
522     - vlapic_reset_tmr or vlapic_set_tmr change trigger mode in RTC
523     - vcpu_set_vmcs_eoi_exit
524
525   * - ACRN_REQUEST_EPT_FLUSH
526     - Request for EPT flush
527     - ept_add_mr, ept_modify_mr, ept_del_mr, or vmx_write_cr0 disable cache
528     - invept
529
530   * - ACRN_REQUEST_TRP_FAULT
531     - Request for handling triple fault
532     - vcpu_queue_exception meet triple fault
533     - fatal error
534
535   * - ACRN_REQUEST_VPID_FLUSH
536     - Request for VPID flush
537     - None
538     - flush_vpid_single
539
540.. note:: Refer to the interrupt management chapter for request
541   handling order for exception, nmi, and interrupts. For other requests
542   such as tmr update, or EPT flush, there is no mandatory order.
543
544VMX Initialization
545******************
546
547ACRN attempts to initialize the vCPU's VMCS before its first
548launch. ACRN sets the host state, execution control, guest state,
549entry control, and exit control, as shown in the table below.
550
551The table briefly shows how each field is configured.
552The guest state field is critical for running a guest CPU
553based on different CPU modes.
554
555For a guest vCPU's state initialization:
556
557-  If it's BSP, the guest state configuration is done in software load,
558   which can be initialized by different objects:
559
560   -  Service VM BSP: Hypervisor does context initialization in different
561      software load based on different boot mode
562
563   -  User VM BSP: DM context initialization through hypercall
564
565-  If it's AP, it always starts from real mode, and the start
566   vector always comes from vLAPIC INIT-SIPI emulation.
567
568.. doxygenstruct:: acrn_regs
569   :project: Project ACRN
570
571.. list-table::
572   :widths: 20 40 10 30
573   :header-rows: 1
574
575   * - **VMX Domain**
576     - **Fields**
577     - **Bits**
578     - **Description**
579
580   * - **host state**
581     - CS, DS, ES, FS, GS, TR, LDTR, GDTR, IDTR
582     - n/a
583     - According to host
584
585   * -
586     - MSR_IA32_PAT, MSR_IA32_EFER
587     - n/a
588     - According to host
589
590   * -
591     - CR0, CR3, CR4
592     - n/a
593     - According to host
594
595   * -
596     - RIP
597     - n/a
598     - Set to vm_exit pointer
599
600   * -
601     - IA32_SYSENTER_CS/ESP/EIP
602     - n/a
603     - Set to 0
604
605   * - **execution control**
606     - VMX_PIN_VM_EXEC_CONTROLS
607     - 0
608     - Enable external-interrupt exiting
609
610   * -
611     -
612     - 7
613     - Enable posted interrupts
614
615   * -
616     - VMX_PROC_VM_EXEC_CONTROLS
617     - 3
618     - Use TSC offsetting
619
620   * -
621     -
622     - 21
623     - Use TPR shadow
624
625   * -
626     -
627     - 25
628     - Use I/O bitmaps
629
630   * -
631     -
632     - 28
633     - Use MSR bitmaps
634
635   * -
636     -
637     - 31
638     - Activate secondary controls
639
640   * -
641     - VMX_PROC_VM_EXEC_CONTROLS2
642     - 0
643     - Virtualize APIC accesses
644
645   * -
646     -
647     - 1
648     - Enable EPT
649
650   * -
651     -
652     - 3
653     - Enable RDTSCP
654
655   * -
656     -
657     - 5
658     - Enable VPID
659
660   * -
661     -
662     - 7
663     - Unrestricted guest
664
665   * -
666     -
667     - 8
668     - APIC-register virtualization
669
670   * -
671     -
672     - 9
673     - Virtual-interrupt delivery
674
675   * -
676     -
677     - 20
678     - Enable XSAVES/XRSTORS
679
680   * - **guest state**
681     - CS, DS, ES, FS, GS, TR, LDTR, GDTR, IDTR
682     - n/a
683     - According to vCPU mode and init_ctx
684
685   * -
686     - RIP, RSP
687     - n/a
688     - According to vCPU mode and init_ctx
689
690   * -
691     - CR0, CR3, CR4
692     - n/a
693     - According to vCPU mode and init_ctx
694
695   * -
696     - GUEST_IA32_SYSENTER_CS/ESP/EIP
697     - n/a
698     - Set to 0
699
700   * -
701     - GUEST_IA32_PAT
702     - n/a
703     - Set to PAT_POWER_ON_VALUE
704
705   * - **entry control**
706     - VMX_ENTRY_CONTROLS
707     - 2
708     - Load debug controls
709
710   * -
711     -
712     - 14
713     - Load IA32_PAT
714
715   * -
716     -
717     - 15
718     - Load IA23_EFER
719
720   * - **exit control**
721     - VMX_EXIT_CONTROLS
722     - 2
723     - Save debug controls
724
725   * -
726     -
727     - 9
728     - Host address space size
729
730   * -
731     -
732     - 15
733     - Acknowledge Interrupt on exit
734
735   * -
736     -
737     - 18
738     - Save IA32_PAT
739
740   * -
741     -
742     - 19
743     - Load IA32_PAT
744
745   * -
746     -
747     - 20
748     - Save IA32_EFER
749
750   * -
751     -
752     - 21
753     - Load IA32_EFER
754
755
756CPUID Virtualization
757********************
758
759CPUID access from a guest would cause VM exits unconditionally if executed
760as a VMX non-root operation. ACRN must return the emulated processor
761identification and feature information in the EAX, EBX, ECX, and EDX
762registers.
763
764To simplify, ACRN returns the same values from the physical CPU for most
765of the CPUID, and specially handles a few CPUID features that are APIC
766ID related such as CPUID.01H.
767
768ACRN emulates some extra CPUID features for the hypervisor as well.
769
770The per-vm *vcpuid_entries* array is initialized during VM creation
771and used to cache most of the CPUID entries for each VM.  During guest
772CPUID emulation, ACRN reads the cached value from this array, except
773some APIC ID-related CPUID data emulated at runtime.
774
775This table describes details for CPUID emulation:
776
777.. list-table::
778   :widths: 20 80
779   :header-rows: 1
780
781
782   * - **CPUID**
783     - **Emulation Description**
784
785   * - 01H
786     - - Get original value from physical CPUID
787       - Fill APIC ID from vLAPIC
788       - Disable x2APIC
789       - Disable PCID
790       - Disable VMX
791       - Disable XSAVE if host not enabled
792
793   * - 0BH
794     - - Fill according to X2APIC feature support (default is disabled)
795       - If not supported, fill all registers with 0
796       - If supported, get from physical CPUID
797
798   * - 0DH
799     - - Fill according to XSAVE feature support
800       - If not supported, fill all registers with 0
801       - If supported, get from physical CPUID
802
803   * - 07H
804     - - Get from per-vm CPUID entries cache
805       - For subleaf 0, disabled INVPCID, Intel RDT
806
807   * - 16H
808     - - Get from per-vm CPUID entries cache
809       - If physical CPU supports CPUID.16H, read from physical CPUID
810       - If physical CPU does not support it, emulate with TSC frequency
811
812   * - 40000000H
813     - - Get from per-vm CPUID entries cache
814       - EAX: the maximum input value for CPUID supported by ACRN (40000010)
815       - EBX, ECX, EDX: hypervisor vendor ID signature - "ACRNACRNACRN"
816
817   * - 40000010H
818     - - Get from per-vm CPUID entries cache
819       - EAX: virtual TSC frequency in kHz
820       - EBX, ECX, EDX: reserved to 0
821
822   * - 0AH
823     - - PMU disabled
824
825   * - 0FH, 10H
826     - - Intel RDT disabled
827
828   * - 12H
829     - - Fill according to SGX virtualization
830
831   * - 14H
832     - - Intel Processor Trace disabled
833
834   * - Others
835     - - Get from per-vm CPUID entries cache
836
837.. note:: ACRN needs to take care of
838   some CPUID values that can change at runtime, for example, the XD feature in
839   CPUID.80000001H may be cleared by the MISC_ENABLE MSR.
840
841
842MSR Virtualization
843******************
844
845ACRN always enables an MSR bitmap in the *VMX_PROC_VM_EXEC_CONTROLS* VMX
846execution control field. This bitmap marks the MSRs to cause a VM
847exit upon guest access for both read and write. The VM
848exit reason for reading or writing these MSRs is respectively
849*VMX_EXIT_REASON_RDMSR* or *VMX_EXIT_REASON_WRMSR* and the VM exit
850handler is *rdmsr_vmexit_handler* or *wrmsr_vmexit_handler*.
851
852This table shows the predefined MSRs that ACRN will trap for all the guests. For
853the MSRs whose bitmap values are not set in the MSR bitmap, guest access will be
854passthrough directly:
855
856.. list-table::
857   :widths: 33 33 33
858   :header-rows: 1
859
860   * - **MSR**
861     - **Description**
862     - **Handler**
863
864   * - MSR_IA32_TSC_ADJUST
865     - TSC adjustment of local APIC's TSC deadline mode
866     - Emulates with vLAPIC
867
868   * - MSR_IA32_TSC_DEADLINE
869     - TSC target of local APIC's TSC deadline mode
870     - Emulates with vLAPIC
871
872   * - MSR_IA32_BIOS_UPDT_TRIG
873     - BIOS update trigger
874     - Update microcode from the Service VM, the signature ID read is from
875       physical MSR, and a BIOS update trigger from the Service VM will trigger a
876       physical microcode update.
877
878   * - MSR_IA32_BIOS_SIGN_ID
879     - BIOS update signature ID
880     - \"
881
882   * - MSR_IA32_TIME_STAMP_COUNTER
883     - Time-stamp counter
884     - Work with VMX_TSC_OFFSET_FULL to emulate virtual TSC
885
886   * - MSR_IA32_APIC_BASE
887     - APIC base address
888     - Emulates with vLAPIC
889
890   * - MSR_IA32_PAT
891     - Page-attribute table
892     - Save/restore in vCPU, write to VMX_GUEST_IA32_PAT_FULL if cr0.cd is 0
893
894   * - MSR_IA32_PERF_CTL
895     - Performance control
896     - Trigger real P-state change if P-state is valid when writing,
897       fetch physical MSR when reading
898
899   * - MSR_IA32_FEATURE_CONTROL
900     - Feature control bits that configure operation of VMX and SMX
901     - Disabled, locked
902
903   * - MSR_IA32_MCG_CAP/STATUS
904     - Machine-Check global control/status
905     - Emulates with vMCE
906
907   * - MSR_IA32_MISC_ENABLE
908     - Miscellaneous feature control
909     - Read-only, except MONITOR/MWAIT enable bit
910
911   * - MSR_IA32_SGXLEPUBKEYHASH0/1/2/3
912     - SHA256 digest of the authorized launch enclaves
913     - Emulates with vSGX
914
915   * - MSR_IA32_SGX_SVN_STATUS
916     - Status and SVN threshold of SGX support for ACM
917     - Read-only, emulates with vSGX
918
919   * - MSR_IA32_MTRR_CAP
920     - Memory type range register related
921     - Handled by MTRR emulation
922
923   * - MSR_IA32_MTRR_DEF_TYPE
924     - \"
925     - \"
926
927   * - MSR_IA32_MTRR_PHYSBASE_0~9
928     - \"
929     - \"
930
931   * - MSR_IA32_MTRR_FIX64K_00000
932     - \"
933     - \"
934
935   * - MSR_IA32_MTRR_FIX16K_80000/A0000
936     - \"
937     - \"
938
939   * - MSR_IA32_MTRR_FIX4K_C0000~F8000
940     - \"
941     - \"
942
943   * - MSR_IA32_X2APIC_*
944     - x2APIC related MSRs (offset from 0x800 to 0x900)
945     - Emulates with vLAPIC
946
947   * - MSR_IA32_L2_MASK_BASE~n
948     - L2 CAT mask for CLOSn
949     - Disabled for guest access
950
951   * - MSR_IA32_L3_MASK_BASE~n
952     - L3 CAT mask for CLOSn
953     - Disabled for guest access
954
955   * - MSR_IA32_MBA_MASK_BASE~n
956     - MBA delay mask for CLOSn
957     - Disabled for guest access
958
959   * - MSR_IA32_VMX_BASIC~VMX_TRUE_ENTRY_CTLS
960     - VMX related MSRs
961     - Not supported, access will inject #GP
962
963
964CR Virtualization
965*****************
966
967ACRN emulates ``mov to cr0``, ``mov to cr4``, ``mov to cr8``, and ``mov
968from cr8`` through *cr_access_vmexit_handler* based on
969*VMX_EXIT_REASON_CR_ACCESS*.
970
971.. note::  ``mov to cr8`` and ``mov from cr8`` are
972   not valid as ``CR8-load/store exiting`` bits are set as 0 in
973   *VMX_PROC_VM_EXEC_CONTROLS*.
974
975A VM can ``mov from cr0`` and ``mov from
976cr4`` without triggering a VM exit. The values read are the read shadows
977of the corresponding register in VMCS. The shadows are updated by the
978hypervisor on CR writes.
979
980.. list-table::
981   :widths: 30 70
982   :header-rows: 1
983
984   * - **Operation**
985     - **Handler**
986
987   * - mov to cr0
988     - Based on vCPU set context API: vcpu_set_cr0 -> vmx_write_cr0
989
990   * - mov to cr4
991     - Based on vCPU set context API: vcpu_set_cr4 -> vmx_write_cr4
992
993   * - mov to cr8
994     - Based on vLAPIC tpr API: vlapic_set_cr8 -> vlapic_set_tpr
995
996   * - mov from cr8
997     - Based on vLAPIC tpr API: vlapic_get_cr8 -> vlapic_get_tpr
998
999
1000For ``mov to cr0`` and ``mov to cr4``, ACRN sets
1001*cr0_host_mask/cr4_host_mask* into *VMX_CR0_MASK/VMX_CR4_MASK*
1002for the bitmask causing VM exit.
1003
1004As ACRN always enables ``unrestricted guest`` in
1005*VMX_PROC_VM_EXEC_CONTROLS2*, *CR0.PE* and *CR0.PG* can be
1006controlled by the guest.
1007
1008.. list-table::
1009   :widths: 20 40 40
1010   :header-rows: 1
1011
1012   * - **CR0 MASK**
1013     - **Value**
1014     - **Comments**
1015
1016   * - cr0_always_on_mask
1017     - fixed0 & (~(CR0_PE | CR0_PG))
1018     - fixed0 comes from MSR_IA32_VMX_CR0_FIXED0, these bits
1019       are fixed to be 1 under VMX operation.
1020
1021   * - cr0_always_off_mask
1022     - ~fixed1
1023     - ~fixed1 comes from MSR_IA32_VMX_CR0_FIXED1, these bits
1024       are fixed to be 0 under VMX operation.
1025
1026   * - CR0_TRAP_MASK
1027     - CR0_PE | CR0_PG | CR0_WP | CR0_CD | CR0_NW
1028     - ACRN will also trap PE, PG, WP, CD, and  NW bits.
1029
1030   * - cr0_host_mask
1031     - ~(fixed0 ^ fixed1) | CR0_TRAP_MASK
1032     - ACRN will finally trap bits under VMX root mode control plus
1033       additionally added bits.
1034
1035
1036For ``mov to cr0`` emulation, ACRN will handle a paging mode change based on
1037PG bit change, and a cache mode change based on CD and NW bits changes.
1038ACRN also takes care of  illegal writing from a guest to invalid
1039CR0 bits (for example, set PG while CR4.PAE = 0 and IA32_EFER.LME = 1),
1040which will finally inject a #GP to the guest. Finally,
1041*VMX_CR0_READ_SHADOW* will be updated for guest reading of host
1042controlled bits, and *VMX_GUEST_CR0* will be updated for real vmx cr0
1043setting.
1044
1045.. list-table::
1046   :widths: 20 40 40
1047   :header-rows: 1
1048
1049   * - **CR4 MASK**
1050     - **Value**
1051     - **Comments**
1052
1053   * - cr4_always_on_mask
1054     - fixed0
1055     - fixed0 comes from MSR_IA32_VMX_CR4_FIXED0, these bits
1056       are fixed to be 1 under VMX operation
1057
1058   * - cr4_always_off_mask
1059     - ~fixed1
1060     - ~fixed1 comes from MSR_IA32_VMX_CR4_FIXED1, these bits
1061       are fixed to be 0 under VMX operation
1062
1063   * - CR4_TRAP_MASK
1064     - CR4_PSE | CR4_PAE | CR4_VMXE | CR4_PCIDE | CR4_SMEP | CR4_SMAP | CR4_PKE
1065     - ACRN will also trap PSE, PAE, VMXE, and PCIDE bits
1066
1067   * - cr4_host_mask
1068     - ~(fixed0 ^ fixed1) | CR4_TRAP_MASK
1069     - ACRN will finally trap bits under VMX root mode control plus
1070       additionally added bits
1071
1072
1073The ``mov to cr4`` emulation is similar to cr0 emulation noted above.
1074
1075.. _io-mmio-emulation:
1076
1077IO/MMIO Emulation
1078*****************
1079
1080ACRN always enables an I/O bitmap in *VMX_PROC_VM_EXEC_CONTROLS* and EPT
1081in *VMX_PROC_VM_EXEC_CONTROLS2*. Based on them,
1082*pio_instr_vmexit_handler* and *ept_violation_vmexit_handler* are
1083used for IO/MMIO emulation for an emulated device. The device can
1084be emulated by the hypervisor or DM in the Service VM.
1085
1086For a device emulated by the hypervisor, ACRN provides some basic
1087APIs to register its IO/MMIO range:
1088
1089-  For the Service VM, the default I/O bitmap values are all set to 0, which
1090   means the Service VM will passthrough all I/O port access by default. Adding
1091   an I/O handler for a hypervisor emulated device needs to first set its
1092   corresponding I/O bitmap to 1.
1093
1094-  For the User VM, the default I/O bitmap values are all set to 1, which means
1095   the User VM will trap all I/O port access by default. Adding an I/O handler
1096   for a hypervisor emulated device does not need to change its I/O bitmap. If
1097   the trapped I/O port access does not fall into a hypervisor emulated device,
1098   it will create an I/O request and pass it to the Service VM DM.
1099
1100-  For the Service VM, EPT maps the entire range of memory to the Service VM
1101   except for the ACRN hypervisor area. The Service VM will passthrough all
1102   MMIO access by default. Adding an MMIO handler for a hypervisor emulated
1103   device needs to first remove its MMIO range from EPT mapping.
1104
1105-  For the User VM, EPT only maps its system RAM to the User VM, which means the
1106   User VM will trap all MMIO access by default. Adding an MMIO handler for a
1107   hypervisor emulated device does not need to change its EPT mapping. If the
1108   trapped MMIO access does not fall into a hypervisor emulated device, it will
1109   create an I/O request and pass it to the Service VM DM.
1110
1111.. list-table::
1112   :widths: 30 70
1113   :header-rows: 1
1114
1115   * - **API**
1116     - **Description**
1117
1118   * - register_pio_emulation_handler
1119     - Register an I/O emulation handler for a hypervisor emulated device
1120       by specific I/O range.
1121
1122   * - register_mmio_emulation_handler
1123     - Register an MMIO emulation handler for a hypervisor emulated device
1124       by specific MMIO range.
1125
1126.. _instruction-emulation:
1127
1128Instruction Emulation
1129*********************
1130
1131ACRN implements a simple instruction emulation infrastructure for
1132MMIO (EPT) and APIC access emulation. When such a VM exit is triggered, the
1133hypervisor needs to decode the instruction from RIP then attempt the
1134corresponding emulation based on its instruction and read/write direction.
1135
1136ACRN supports emulating instructions for ``mov``, ``movx``,
1137``movs``, ``stos``, ``test``, ``and``, ``or``, ``cmp``, ``sub``, and
1138``bittest`` without support for lock prefix.  Real mode emulation is not
1139supported.
1140
1141.. figure:: images/hld-image82.png
1142   :align: center
1143
1144   Instruction Emulation Work Flow
1145
1146In the handlers for EPT violation or APIC access VM exit, ACRN will:
1147
11481. Fetch the MMIO access request's address and size.
1149
11502. Do *decode_instruction*  for the instruction in the current RIP
1151   with the following check:
1152
1153   a. Is the instruction supported? If not, inject #UD to the guest.
1154   b. Is the GVA of RIP, dest, and src valid? If not, inject #PF to the guest.
1155   c. Is the stack valid? If not, inject #SS to the guest.
1156
11573. If step 2 succeeds, check the access direction. If it's a write, then
1158   do *emulate_instruction* to fetch the MMIO request's value from
1159   instruction operands.
1160
11614. Execute the MMIO request handler. For EPT violation, it is *emulate_io*.
1162   For APIC access, it is *vlapic_write/read* based on access
1163   direction. It will finally complete this MMIO request emulation
1164   by:
1165
1166   a. putting req.val to req.addr for write operation
1167   b. getting req.val from req.addr for read operation
1168
11695. If the access direction is read, then do *emulate_instruction* to
1170   put the MMIO request's value into instruction operands.
1171
11726. Return to the guest.
1173
1174TSC Emulation
1175*************
1176
1177Guest vCPU execution of *RDTSC/RDTSCP* and access to
1178*MSR_IA32_TSC_AUX* do not cause a VM Exit to the hypervisor.
1179The hypervisor uses *MSR_IA32_TSC_AUX* to record CPU ID, thus
1180the CPU ID provided by *MSR_IA32_TSC_AUX* might be changed via the guest.
1181
1182*RDTSCP* is widely used by the hypervisor to identify the current CPU ID. Due
1183to no VM Exit for the *MSR_IA32_TSC_AUX* MSR register, the ACRN hypervisor
1184saves the *MSR_IA32_TSC_AUX* value on every VM Exit and restores it on every VM Enter.
1185Before the hypervisor restores the host CPU ID, *rdtscp* should not be
1186called as it could get the vCPU ID instead of the host CPU ID.
1187
1188The *MSR_IA32_TIME_STAMP_COUNTER* is emulated by the ACRN hypervisor, with a
1189simple implementation based on *TSC_OFFSET* (enabled
1190in *VMX_PROC_VM_EXEC_CONTROLS*):
1191
1192-  For read: ``val = rdtsc() + exec_vmread64(VMX_TSC_OFFSET_FULL)``
1193-  For write: ``exec_vmwrite64(VMX_TSC_OFFSET_FULL, val - rdtsc())``
1194
1195ART Virtualization
1196******************
1197
1198The invariant TSC is based on the invariant timekeeping hardware (called
1199Always Running Timer or ART), which runs at the core crystal clock frequency.
1200The ratio defined by the CPUID leaf 15H expresses the frequency relationship
1201between the ART hardware and the TSC.
1202
1203If CPUID.15H.EBX[31:0] != 0 and CPUID.80000007H:EDX[InvariantTSC] = 1, the
1204following linearity relationship holds between the TSC and the ART hardware:
1205
1206   ``TSC_Value = (ART_Value * CPUID.15H:EBX[31:0]) / CPUID.15H:EAX[31:0] + K``
1207
1208Where ``K`` is an offset that can be adjusted by a privileged agent.
1209When ART hardware is reset, both invariant TSC and K are also reset.
1210
1211The guideline of ART virtualization (vART) is that software in native can run in
1212the VM too. The vART solution is:
1213
1214-  Present the ART capability to the guest through CPUID leaf 15H for ``CPUID.15H:EBX[31:0]``
1215   and ``CPUID.15H:EAX[31:0]``.
1216-  Passthrough devices see the physical ART_Value (vART_Value = pART_Value).
1217-  Relationship between the ART and TSC in the guest is:
1218   ``vTSC_Value = (vART_Value * CPUID.15H:EBX[31:0]) / CPUID.15H:EAX[31:0] + vK``
1219   where ``vK = K + VMCS.TSC_OFFSET``.
1220-  If the guest changes ``vK`` or ``vTSC_Value``, we change the ``VMCS.TSC_OFFSET`` accordingly.
1221-  ``K`` should never be changed by the hypervisor.
1222
1223XSAVE Emulation
1224***************
1225
1226The XSAVE feature set is composed of eight instructions:
1227
1228-  *XGETBV* and *XSETBV* allow software to read and write the extended
1229   control register *XCR0*, which controls the operation of the
1230   XSAVE feature set.
1231
1232-  *XSAVE*, *XSAVEOPT*, *XSAVEC*, and *XSAVES* are four instructions
1233   that save the processor state to memory.
1234
1235-  *XRSTOR* and *XRSTORS* are corresponding instructions that load the
1236   processor state from memory.
1237
1238-  *XGETBV*, *XSAVE*, *XSAVEOPT*, *XSAVEC*, and *XRSTOR* can be executed
1239   at any privilege level.
1240
1241-  *XSETBV*, *XSAVES*, and *XRSTORS* can be executed only if CPL = 0.
1242
1243Enabling the XSAVE feature set is controlled by XCR0 (through XSETBV)
1244and IA32_XSS MSR. Refer to the `Intel SDM Volume 1`_ chapter 13 for more details.
1245
1246
1247.. _Intel SDM Volume 1:
1248   https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-1-manual.html
1249
1250.. figure:: images/hld-image38.png
1251   :align: center
1252
1253   ACRN Hypervisor XSAVE Emulation
1254
1255By default, ACRN enables XSAVES/XRSTORS in
1256*VMX_PROC_VM_EXEC_CONTROLS2*, so it allows the guest to use the XSAVE
1257feature. Because guest execution of *XSETBV* will always trigger XSETBV VM
1258exit, ACRN actually needs to take care of XCR0 access.
1259
1260ACRN emulates XSAVE features through the following rules:
1261
12621. Enumerate CPUID.01H for native XSAVE feature support.
12632. If yes for step 1, enable XSAVE in the hypervisor by CR4.OSXSAVE.
12643. Emulate XSAVE related CPUID.01H and CPUID.0DH to the guest.
12654. Emulate XCR0 access through *xsetbv_vmexit_handler*.
12665. Passthrough the access of IA32_XSS MSR to the guest.
12676. ACRN hypervisor does NOT use any feature of XSAVE.
12687. When ACRN emulates the vCPU with partition mode: based on above rules 5
1269   and 6, a guest vCPU will fully control the XSAVE feature in
1270   non-root mode.
1271