1.. _sw_design_guidelines:
2
3Software Design Guidelines
4##########################
5
6Error Detection and Error Handling
7**********************************
8
9Workflow
10========
11
12Error detection and error handling workflow in the ACRN hypervisor is shown in
13:numref:`work_flow_of_error_detection_and_error_handling`.
14
15.. figure:: images/work_flow_of_error_detection_and_error_handling.png
16   :align: center
17   :name: work_flow_of_error_detection_and_error_handling
18
19   Error Detection and Error Handling Workflow
20
21
22Design Assumption
23=================
24
25There are three types of design assumptions in the ACRN hypervisor, as shown
26below:
27
28**Pre-condition**
29  Pre-conditions shall be defined right before the definition/declaration of
30  the corresponding function in the C source file or header file.
31  All pre-conditions shall be guaranteed by the caller of the function.
32  Error checking of the pre-conditions is not needed in release version of the
33  function. Developers could use ASSERT to catch design errors in a debug
34  version for some cases. Verification of the hypervisor shall check whether
35  each caller guarantees all pre-conditions of the callee (or not).
36
37  This design assumption applies to the following cases:
38
39  - Input parameters of the function.
40  - Global state, such as hypervisor operation mode.
41
42**Post-condition**
43  Post-conditions shall be defined right before the definition/declaration of
44  the corresponding function in the C source file or header file.
45  All post-conditions shall be guaranteed by the function. All callers of the
46  function should trust these post-conditions are met.
47  Error checking of the post-conditions is not needed in release version of
48  each caller. Developers could use ASSERT to catch design errors in a debug
49  version for some cases. Verification of the hypervisor shall check whether
50  the function guarantees all post-conditions (or not).
51
52  This design assumption applies to the following case:
53
54  - Return value of the function
55
56    It is used to guarantee that the return value is valid; for example, the
57    return pointer is not NULL, the return value is within a valid range, or
58    the  members of the return structure are valid.
59
60
61**Application Constraints**
62  Application constraints of the hypervisor shall be defined in design
63  document and safety manual. All application constraints shall be guaranteed
64  by external safety applications, such as Board Support Package, firmware,
65  safety VM, and Hardware. The verification of application integration shall
66  check whether the safety application meets all application constraints.
67  These constraints must be verified during hypervisor validation test. It is
68  optional to do error checking for application constraints at hypervisor
69  boot time.
70
71  This design assumption applies to the following cases:
72
73  - Configuration data defined by external safety application, such as
74    physical PCI device information specific for each board design.
75
76  - Input data that is specified only by external safety application.
77
78.. note:: If input data can be specified by both a non-safety VM and a
79   safety VM, the application constraint isn't applicable to these data.
80   Related error checking and handling shall be done during hypervisor design.
81
82Refer to the :ref:`C Programming Language Coding Guidelines <c_coding_guidelines>`
83to document these design assumptions with doxygen-style comments.
84
85Architecture Level
86==================
87
88Functional Safety Consideration
89-------------------------------
90
91The hypervisor will do range check in hypercalls and HW capability checks
92according to Table A.2 of FuSA Standards [IEC_61508-3_2010]_.
93
94Error Handling Methods
95----------------------
96
97The error handling methods used in the ACRN hypervisor on an architecture
98level are shown below.
99
100**Invoke default fatal error handler**
101  The hypervisor shall invoke the default fatal error handler when the below
102  cases occur. Customers can define platform-specific handlers, allowing them
103  to implement additional error reporting (mostly to hardware) if required.
104  The default fatal error handler will invoke platform-specific handlers
105  defined by users at first, then it will panic the system.
106
107  This method applies to the following cases:
108
109  - Related hardware resources are unavailable.
110  - Boot information is invalid during platform initialization.
111  - Unexpected exception occurs in root mode due to hardware failures.
112  - Failures occur in the VM dedicated for error handling.
113
114**Return error code**
115  The hypervisor shall return an error code to the VM when the below cases
116  occur. The error code shall indicate the error type detected (e.g., invalid
117  parameter, device not found, device busy, and resource unavailable).
118
119  This method applies to the following case:
120
121  - The hypercall parameter from the VM is invalid.
122
123**Inform the safety VM through specific register or memory area**
124  The hypervisor shall inform the safety VM through a specific register or
125  memory area when the below cases occur. The VM will decide how to handle
126  the related error. This shall be done only after the VM (Safety VM or
127  Service VM) dedicated to error handling has started.
128
129  This method applies to the following cases:
130
131  - Machine check errors occur due to hardware failures.
132
133  - Unexpected VM entry failures occur, where the VM is not the one dedicated
134    for error handling.
135
136**Panic the system via ASSERT**
137  The hypervisor can panic the system when the below cases occur. It shall
138  only be used for debug and used to check pre-conditions and post-conditions
139  to catch design errors.
140
141  This method applies to the following case:
142
143  - Software design errors occur.
144
145
146Rules of Error Detection and Error Handling
147-------------------------------------------
148
149The rules of error detection and error handling on an architecture level are
150shown in :numref:`rules_arch_level` below.
151
152.. table:: Rules of Error Detection and Error Handling on Architecture Level
153   :align: center
154   :widths: auto
155   :name: rules_arch_level
156
157   +--------------------+-------------------------+--------------+---------------------------+-------------------------+
158   | Resource Class     | Failure Mode            | Error        | Error Handling Policy     | Example                 |
159   |                    |                         | Detection    |                           |                         |
160   |                    |                         | via          |                           |                         |
161   |                    |                         | Hypervisor   |                           |                         |
162   +====================+=========================+==============+===========================+=========================+
163   | External resource  | Invalid register/memory | Yes          | Follow SDM strictly, or   | Unsupported MSR         |
164   | provided by VM     | state on VM exit        |              | state any deviation to the| or invalid CPU ID       |
165   |                    |                         |              | document explicitly.      |                         |
166   |                    +-------------------------+--------------+---------------------------+-------------------------+
167   |                    | Invalid hypercall       | Yes          | The hypervisor shall      | Invalid hypercall       |
168   |                    | parameter               |              | return related error code | parameter provided by   |
169   |                    |                         |              | to the VM                 | any VM                  |
170   |                    +-------------------------+--------------+---------------------------+-------------------------+
171   |                    | Invalid data in the     | Yes          | Case by case depending    | Invalid data in memory  |
172   |                    | sharing memory area     |              | on the data               | shared with all VMs,    |
173   |                    |                         |              |                           | such as IO request      |
174   |                    |                         |              |                           | buffers and sbuf for    |
175   |                    |                         |              |                           | debug                   |
176   +--------------------+-------------------------+--------------+---------------------------+-------------------------+
177   | External resource  | Invalid E820 table or   | Yes          | The hypervisor shall      | Invalid E820 table or   |
178   | provided by        | invalid boot information|              | panic during platform     | invalid boot information|
179   | bootloader         |                         |              | initialization.           |                         |
180   | (GRUB or SBL)      |                         |              |                           |                         |
181   +--------------------+-------------------------+--------------+---------------------------+-------------------------+
182   | Physical resource  | 1GB page is not         | Yes          | The hypervisor shall      | 1GB page is not         |
183   | used by the        | available on the        |              | panic during platform     | available on the        |
184   | hypervisor         | platform or invalid     |              | initialization.           | platform or invalid     |
185   |                    | physical CPU ID         |              |                           | physical CPU ID         |
186   +--------------------+-------------------------+--------------+---------------------------+-------------------------+
187
188
189Examples
190--------
191
192Here is an example to illustrate when error handling codes are required on
193an architecture level.
194
195There are two pre-condition statements of ``vcpu_from_vid``. It indicates that
196it's the caller's responsibility to guarantee these pre-conditions.
197
198.. code-block:: c
199
200  /**
201   * @pre vcpu_id < CONFIG_MAX_VCPUS_PER_VM
202   * @pre &(vm->hw.vcpu_array[vcpu_id])->state != VCPU_OFFLINE
203   */
204  static inline struct acrn_vcpu *vcpu_from_vid(struct acrn_vm *vm, uint16_t vcpu_id)
205  {
206          return &(vm->hw.vcpu_array[vcpu_id]);
207  }
208
209``vcpu_from_vid`` is called by ``hcall_set_vcpu_regs``, which is a hypercall.
210``hcall_set_vcpu_regs`` is an external interface and ``vcpu_id`` is provided
211by the VM. In this case, we shall add the error checking codes before calling
212``vcpu_from_vid`` to make sure that the passed parameters are valid and the
213pre-conditions are guaranteed.
214
215Here is the sample code for error checking before calling ``vcpu_from_vid``:
216
217.. code-block:: c
218
219  status = 0;
220
221  if (vcpu_id >= CONFIG_MAX_VCPUS_PER_VM) {
222          pr_err("vcpu id is out of range \r\n");
223          status = -EINVAL;
224  } else if ((&(vm->hw.vcpu_array[vcpu_id]))->state == VCPU_OFFLINE) {
225          pr_err("vcpu is offline \r\n");
226          status = -EINVAL;
227  }
228
229  if (status == 0) {
230          vcpu = vcpu_from_vid(vm, vcpu_id);
231          ...
232  }
233
234
235Module Level
236============
237
238Functional Safety Consideration
239-------------------------------
240
241Data verification, and explicit specification of pre-conditions and
242post-conditions are applied for internal functions of the hypervisor
243according to Table A.4 of FuSA Standards [IEC_61508-3_2010]_ .
244
245Error Handling Methods
246----------------------
247
248The error handling methods used in the ACRN hypervisor on a module level are
249shown below.
250
251**Panic the system via ASSERT**
252  The hypervisor can panic the system when the below cases occur. It shall
253  only be used for debugging, used to check pre-conditions and post-conditions
254  to catch design errors.
255
256  This method applies to the following case:
257
258  - Software design errors occur.
259
260
261Rules of Error Detection and Error Handling
262-------------------------------------------
263
264The rules of error detection and error handling on a module level are shown in
265:numref:`rules_module_level` below.
266
267.. table:: Rules of Error Detection and Error Handling on Module Level
268   :align: center
269   :widths: auto
270   :name: rules_module_level
271
272   +--------------------+-----------+----------------------------+---------------------------+-------------------------+
273   | Resource Class     | Failure   | Error Detection via        | Error Handling Policy     | Example                 |
274   |                    | Mode      | Hypervisor                 |                           |                         |
275   +====================+===========+============================+===========================+=========================+
276   | Internal data of   | N/A       | Partial.                   | The hypervisor shall use  | Virtual PCI device      |
277   | the hypervisor     |           | The related pre-conditions | the internal resource/data| information, defined    |
278   |                    |           | are required.              | directly.                 | with array              |
279   |                    |           |                            |                           | ``pci_vdevs[]``         |
280   |                    |           | The design will guarantee  |                           | through static          |
281   |                    |           | the correctness and the    |                           | allocation.             |
282   |                    |           | test cases will verify the |                           |                         |
283   |                    |           | related pre-conditions.    |                           |                         |
284   |                    |           | If the design cannot       |                           |                         |
285   |                    |           | guarantee the correctness, |                           |                         |
286   |                    |           | the related error handling |                           |                         |
287   |                    |           | codes need to be added.    |                           |                         |
288   |                    |           | Note: Some examples of     |                           |                         |
289   |                    |           | pre-conditions are listed, |                           |                         |
290   |                    |           | like non-empty array, valid|                           |                         |
291   |                    |           | array size and non-null    |                           |                         |
292   |                    |           | pointer.                   |                           |                         |
293   +--------------------+-----------+----------------------------+---------------------------+-------------------------+
294   | Configuration data | Corrupted | No.                        | The bootloader initializes| ``vm_config->pci_devs`` |
295   | of the VM          | VM config | The related pre-conditions | hypervisor (including     | is configured           |
296   |                    |           | are required.              | code, data, and bss) and  | statically.             |
297   |                    |           | Note: VM configuration data| verifies the integrity of |                         |
298   |                    |           | are auto generated based on| hypervisor image in which |                         |
299   |                    |           | different board configs,   | VM configurations are.    |                         |
300   |                    |           | they are defined           | Thus hypervisor does not  |                         |
301   |                    |           | as static structure.       | need any additional       |                         |
302   |                    |           |                            | mechanism.                |                         |
303   +--------------------+-----------+----------------------------+---------------------------+-------------------------+
304   | Configuration data | N/A       | No.                        | The hypervisor shall use  | The maximum number of   |
305   | of the hypervisor  |           | The related pre-conditions | the internal resource/data| PCI devices in the VM,  |
306   |                    |           | are required.              | directly.                 | defined with            |
307   |                    |           | The design will guarantee  |                           | CONFIG_MAX_PCI_DEV_NUM  |
308   |                    |           | the correctness and this   |                           | through configuration.  |
309   |                    |           | shall be verified manually.|                           |                         |
310   +--------------------+-----------+----------------------------+---------------------------+-------------------------+
311
312
313Examples
314--------
315
316Here are some examples to illustrate when error handling codes are required on
317a module level.
318
319**Example_1: Analyze the function ``partition_mode_vpci_init``**
320
321.. code-block:: c
322
323  /**
324   * @pre vm != NULL
325   * @pre vm->vpci->pci_vdev_cnt <= CONFIG_MAX_PCI_DEV_NUM
326   */
327  static int32_t partition_mode_vpci_init(const struct acrn_vm *vm)
328  {
329          struct acrn_vpci *vpci = (struct acrn_vpci *)&(vm->vpci);
330          struct pci_vdev *vdev;
331          struct acrn_vm_config *vm_config = get_vm_config(vm->vm_id);
332          struct acrn_vm_pci_dev_config *pci_dev_config;
333          uint32_t i;
334
335          vpci->pci_vdev_cnt = vm_config->pci_dev_num;
336
337          for (i = 0U; i < vpci->pci_vdev_cnt; i++) {
338                  vdev = &vpci->pci_vdevs[i];
339                  vdev->vpci = vpci;
340                  pci_dev_config = &vm_config->pci_devs[i];
341                  vdev->vbdf.value = pci_dev_config->vbdf.value;
342
343                  if (vdev->vbdf.value != 0U) {
344                          partition_mode_pdev_init(vdev, pci_dev_config->pbdf);
345                          vdev->ops = &pci_ops_vdev_pt;
346                  } else {
347                          vdev->ops = &pci_ops_vdev_hostbridge;
348                  }
349
350                  if (vdev->ops->init != NULL) {
351                          if (vdev->ops->init(vdev) != 0) {
352                                  pr_err("%s() failed at PCI device (vbdf %x)!",
353                                          __func__, vdev->vbdf);
354                          }
355                  }
356          }
357
358          return 0;
359  }
360
361``get_vm_config`` is called by ``partition_mode_vpci_init``.
362There are one pre-condition and two post-conditions of ``get_vm_config``.
363It indicates that the caller of ``get_vm_config`` shall guarantee these
364pre-conditions and ``get_vm_config`` itself shall guarantee the post-condition.
365
366.. code-block:: c
367
368  /**
369   * @pre vm_id < CONFIG_MAX_VM_NUM
370   * @post retval != NULL
371   * @post retval->pci_dev_num <= MAX_PCI_DEV_NUM
372   */
373  struct acrn_vm_config *get_vm_config(uint16_t vm_id)
374  {
375          return &vm_configs[vm_id];
376  }
377
378**Question_1: Is error checking required for ``vm_config``?**
379
380No. Because ``vm_config`` is getting data from ``get_vm_config`` and the
381post-condition of ``get_vm_config`` guarantees that the return value is not NULL.
382
383
384**Question_2: Is error checking required for ``vdev``?**
385
386No. Here are the reasons:
387
388a) The pre-condition of ``partition_mode_vpci_init`` guarantees that ``vm``
389   is not NULL. It indicates that ``vpci`` is not NULL. Since ``vdev`` is
390   getting data from the array ``pci_vdevs[]`` via indexing, ``vdev`` is not
391   NULL as long as the index is valid.
392
393b) The post-condition of ``get_vm_config`` guarantees that
394   ``vpci->pci_vdev_cnt`` is less than or equal to ``CONFIG_MAX_PCI_DEV_NUM``,
395   which is the array size of ``pci_vdevs[]``. It indicates that the index
396   used to get ``vdev`` is always valid.
397
398Given the two reasons above, ``vdev`` is always not NULL. So, the error
399checking codes are not required for ``vdev``.
400
401
402**Question_3: Is error checking required for ``pci_dev_config``?**
403
404No. ``pci_dev_config`` is getting data from the array ``pci_vdevs[]``, which
405is the physical PCI device information coming from Board Support Package and
406firmware. For physical PCI device information, the related application
407constraints shall be defined in the design document or safety manual. For
408debug purposes, developers could use ASSERT here to catch the Board Support
409Package or firmware failures, which do not guarantee these application
410constraints.
411
412
413**Question_4: Is error checking required for ``vdev->ops->init``?**
414
415No. Here are the reasons:
416
417a) Question_2 proves that ``vdev`` is always not NULL.
418
419b) ``vdev->ops`` is fully initialized before ``vdev->ops->init`` is called.
420
421Given the two reasons above, ``vdev->ops->init`` is always not NULL. So, the
422error checking codes are not required for ``vdev->ops->init``.
423
424
425**Question_5: How to handle the case when ``vdev->ops->init(vdev)`` returns non-zero?**
426
427This case indicates that the initialization of specific virtual device fails.
428Investigation has to be done to figure out the root-cause. Default fatal error
429handler shall be invoked here if it is caused by a hardware failure or invalid
430boot information.
431
432
433**Example_2: Analyze the function ``partition_mode_vpci_deinit``**
434
435.. code-block:: c
436
437  /**
438   * @pre vdev != NULL
439   * @pre vm->vpci->pci_vdev_cnt <= CONFIG_MAX_PCI_DEV_NUM
440   */
441  static void partition_mode_vpci_deinit(const struct acrn_vm *vm)
442  {
443          struct pci_vdev *vdev;
444          uint32_t i;
445
446          for (i = 0U; i < vm->vpci.pci_vdev_cnt; i++) {
447                  vdev = (struct pci_vdev *) &(vm->vpci.pci_vdevs[i]);
448                  if ((vdev->ops != NULL) && (vdev->ops->deinit != NULL)) {
449                          if (vdev->ops->deinit(vdev) != 0) {
450                                  pr_err("vdev->ops->deinit failed!");
451                          }
452                  }
453                  /* TODO: implement the deinit of 'vdev->ops' */
454          }
455  }
456
457
458**Question_6: Is error checking required for ``vdev->ops`` and ``vdev->ops->init``?**
459
460Yes. Because ``vdev->ops`` and ``vdev->ops->init`` cannot be guaranteed to be
461not NULL. If the VM called ``partition_mode_vpci_deinit`` twice, it may be
462NULL.
463
464
465Module Level Configuration Design Guidelines
466********************************************
467
468Design Goals
469============
470
471There are two goals for module level configuration design, as shown below:
472
473a) In order to make the hypervisor more flexible, one source code and binary
474   is preferred for different platforms with different configurations;
475
476b) If one module is not used by a specific project, the module source code is
477   treated as dead code. The effort to configure it in/out shall be minimized.
478
479
480Hypervisor Operation Modes
481==========================
482
483The hypervisor operation modes are shown in
484:numref:`hypervisor_operation_modes` below.
485
486.. table:: Hypervisor Operation Modes
487   :align: center
488   :widths: 10 10 50
489   :name: hypervisor_operation_modes
490
491   +-------------+-----------+------------------------------------------------------------------------------+
492   | Operation   | Sub-modes | Description                                                                  |
493   | Modes       |           |                                                                              |
494   +=============+===========+==============================================================================+
495   | INIT mode   | DETECT    | The hypervisor detects firmware, detects hardware resource, and reads        |
496   |             | mode      | configuration data.                                                          |
497   |             +-----------+------------------------------------------------------------------------------+
498   |             | STARTUP   | The hypervisor initializes hardware resources, creates virtual resources like|
499   |             | mode      | VCPU and VM, and executes VMLAUNCH instruction(the very first VM entry).     |
500   +-------------+-----------+------------------------------------------------------------------------------+
501   | OPERATIONAL | N/A       | After the first VM entry, the hypervisor runs in VMX root mode and guest OS  |
502   | mode        |           | runs in VMX non-root mode.                                                   |
503   +-------------+-----------+------------------------------------------------------------------------------+
504   | TERMINATION | N/A       | If any fatal error is detected, the hypervisor will enter TERMINATION mode.  |
505   | mode        |           | In this mode, a default fatal error handler will be invoked to handle the    |
506   |             |           | fatal error.                                                                 |
507   +-------------+-----------+------------------------------------------------------------------------------+
508
509
510Configurable Module Properties
511==============================
512
513The properties of configurable modules are shown below:
514
515- The functionality of the module depends on platform configurations;
516- Corresponding platform configurations can be detected in DETECT mode;
517- The module APIs shall be configured in DETECT mode;
518- The module APIs shall be used in modes other than DETECT mode.
519
520Platform configurations include:
521
522- Features depending on hardware or firmware
523- Configuration data provided by firmware
524- Configuration data provided by BSP
525
526
527Design Rules
528============
529
530The module level configuration design rules are shown below:
531
5321. The platform configurations shall be detectable by the hypervisor in
533   DETECT mode;
534
5352. Configurable module APIs shall be abstracted as operations that are
536   implemented through a set of function pointers in the operations data
537   structure;
538
5393. Every function pointer in the operations data structure shall be
540   instantiated as one module API in DETECT mode and the API is allowed to be
541   implemented as empty function for some specific configurations;
542
5434. The operations data structure shall be read-only in STARTUP mode,
544   OPERATIONAL mode, and TERMINATION mode;
545
5465. The configurable module shall only be accessed via APIs in the operations
547   data structure in STARTUP mode or OPERATIONAL mode;
548
5496. In order to guarantee that the function pointer in the operations data
550   structure is dereferenced after it has been instantiated, the pre-condition
551   shall be added for the function that dereferences the function pointer,
552   instead of checking the pointer for NULL.
553
554.. note:: The third rule shall be double checked during code review.
555
556Use Cases
557=========
558
559The following table shows some use cases of module level configuration design:
560
561.. list-table:: Module Level Configuration Design Use Cases
562   :widths: 10 25 20
563   :header-rows: 1
564
565   * - **Platform Configuration**
566     - **Configurable Module**
567     - **Prerequisite**
568
569   * - Features depending on hardware or firmware
570     - This module is used to virtualize part of LAPIC functionalities.
571       It can be done via APICv or software emulation depending on CPU
572       capabilities.
573       For example, Kaby Lake NUC doesn't support virtual-interrupt delivery,
574       while other platforms support it.
575     - If a function pointer is used, the prerequisite is
576       "hv_operation_mode == OPERATIONAL".
577
578   * - Configuration data provided by firmware
579     - This module is used to interact with firmware (UEFI or SBL), and the
580       configuration data is provided by firmware.
581     - If a function pointer is used, the prerequisite is
582       "hv_operation_mode != DETECT".
583
584   * - Configuration data provided by BSP
585     - This module is used to virtualize LAPIC, and the configuration data is
586       provided by BSP.
587       For example, some VMs use LAPIC passthrough and the other VMs use
588       vLAPIC.
589     - If a function pointer is used, the prerequisite is
590       "hv_operation_mode == OPERATIONAL".
591
592.. note:: Prerequisite is used to guarantee that the function pointer used for
593   configuration is dereferenced after it has been instantiated.
594
595
596References
597**********
598
599.. [IEC_61508-3_2010] IEC 61508-3:2010, Functional safety of electrical/electronic/programmable electronic safety-related systems - Part 3: Software requirements
600