1.. _sw_design_guidelines: 2 3Software Design Guidelines 4########################## 5 6Error Detection and Error Handling 7********************************** 8 9Workflow 10======== 11 12Error detection and error handling workflow in the ACRN hypervisor is shown in 13:numref:`work_flow_of_error_detection_and_error_handling`. 14 15.. figure:: images/work_flow_of_error_detection_and_error_handling.png 16 :align: center 17 :name: work_flow_of_error_detection_and_error_handling 18 19 Error Detection and Error Handling Workflow 20 21 22Design Assumption 23================= 24 25There are three types of design assumptions in the ACRN hypervisor, as shown 26below: 27 28**Pre-condition** 29 Pre-conditions shall be defined right before the definition/declaration of 30 the corresponding function in the C source file or header file. 31 All pre-conditions shall be guaranteed by the caller of the function. 32 Error checking of the pre-conditions is not needed in release version of the 33 function. Developers could use ASSERT to catch design errors in a debug 34 version for some cases. Verification of the hypervisor shall check whether 35 each caller guarantees all pre-conditions of the callee (or not). 36 37 This design assumption applies to the following cases: 38 39 - Input parameters of the function. 40 - Global state, such as hypervisor operation mode. 41 42**Post-condition** 43 Post-conditions shall be defined right before the definition/declaration of 44 the corresponding function in the C source file or header file. 45 All post-conditions shall be guaranteed by the function. All callers of the 46 function should trust these post-conditions are met. 47 Error checking of the post-conditions is not needed in release version of 48 each caller. Developers could use ASSERT to catch design errors in a debug 49 version for some cases. Verification of the hypervisor shall check whether 50 the function guarantees all post-conditions (or not). 51 52 This design assumption applies to the following case: 53 54 - Return value of the function 55 56 It is used to guarantee that the return value is valid; for example, the 57 return pointer is not NULL, the return value is within a valid range, or 58 the members of the return structure are valid. 59 60 61**Application Constraints** 62 Application constraints of the hypervisor shall be defined in design 63 document and safety manual. All application constraints shall be guaranteed 64 by external safety applications, such as Board Support Package, firmware, 65 safety VM, and Hardware. The verification of application integration shall 66 check whether the safety application meets all application constraints. 67 These constraints must be verified during hypervisor validation test. It is 68 optional to do error checking for application constraints at hypervisor 69 boot time. 70 71 This design assumption applies to the following cases: 72 73 - Configuration data defined by external safety application, such as 74 physical PCI device information specific for each board design. 75 76 - Input data that is specified only by external safety application. 77 78.. note:: If input data can be specified by both a non-safety VM and a 79 safety VM, the application constraint isn't applicable to these data. 80 Related error checking and handling shall be done during hypervisor design. 81 82Refer to the :ref:`C Programming Language Coding Guidelines <c_coding_guidelines>` 83to document these design assumptions with doxygen-style comments. 84 85Architecture Level 86================== 87 88Functional Safety Consideration 89------------------------------- 90 91The hypervisor will do range check in hypercalls and HW capability checks 92according to Table A.2 of FuSA Standards [IEC_61508-3_2010]_. 93 94Error Handling Methods 95---------------------- 96 97The error handling methods used in the ACRN hypervisor on an architecture 98level are shown below. 99 100**Invoke default fatal error handler** 101 The hypervisor shall invoke the default fatal error handler when the below 102 cases occur. Customers can define platform-specific handlers, allowing them 103 to implement additional error reporting (mostly to hardware) if required. 104 The default fatal error handler will invoke platform-specific handlers 105 defined by users at first, then it will panic the system. 106 107 This method applies to the following cases: 108 109 - Related hardware resources are unavailable. 110 - Boot information is invalid during platform initialization. 111 - Unexpected exception occurs in root mode due to hardware failures. 112 - Failures occur in the VM dedicated for error handling. 113 114**Return error code** 115 The hypervisor shall return an error code to the VM when the below cases 116 occur. The error code shall indicate the error type detected (e.g., invalid 117 parameter, device not found, device busy, and resource unavailable). 118 119 This method applies to the following case: 120 121 - The hypercall parameter from the VM is invalid. 122 123**Inform the safety VM through specific register or memory area** 124 The hypervisor shall inform the safety VM through a specific register or 125 memory area when the below cases occur. The VM will decide how to handle 126 the related error. This shall be done only after the VM (Safety VM or 127 Service VM) dedicated to error handling has started. 128 129 This method applies to the following cases: 130 131 - Machine check errors occur due to hardware failures. 132 133 - Unexpected VM entry failures occur, where the VM is not the one dedicated 134 for error handling. 135 136**Panic the system via ASSERT** 137 The hypervisor can panic the system when the below cases occur. It shall 138 only be used for debug and used to check pre-conditions and post-conditions 139 to catch design errors. 140 141 This method applies to the following case: 142 143 - Software design errors occur. 144 145 146Rules of Error Detection and Error Handling 147------------------------------------------- 148 149The rules of error detection and error handling on an architecture level are 150shown in :numref:`rules_arch_level` below. 151 152.. table:: Rules of Error Detection and Error Handling on Architecture Level 153 :align: center 154 :widths: auto 155 :name: rules_arch_level 156 157 +--------------------+-------------------------+--------------+---------------------------+-------------------------+ 158 | Resource Class | Failure Mode | Error | Error Handling Policy | Example | 159 | | | Detection | | | 160 | | | via | | | 161 | | | Hypervisor | | | 162 +====================+=========================+==============+===========================+=========================+ 163 | External resource | Invalid register/memory | Yes | Follow SDM strictly, or | Unsupported MSR | 164 | provided by VM | state on VM exit | | state any deviation to the| or invalid CPU ID | 165 | | | | document explicitly. | | 166 | +-------------------------+--------------+---------------------------+-------------------------+ 167 | | Invalid hypercall | Yes | The hypervisor shall | Invalid hypercall | 168 | | parameter | | return related error code | parameter provided by | 169 | | | | to the VM | any VM | 170 | +-------------------------+--------------+---------------------------+-------------------------+ 171 | | Invalid data in the | Yes | Case by case depending | Invalid data in memory | 172 | | sharing memory area | | on the data | shared with all VMs, | 173 | | | | | such as IO request | 174 | | | | | buffers and sbuf for | 175 | | | | | debug | 176 +--------------------+-------------------------+--------------+---------------------------+-------------------------+ 177 | External resource | Invalid E820 table or | Yes | The hypervisor shall | Invalid E820 table or | 178 | provided by | invalid boot information| | panic during platform | invalid boot information| 179 | bootloader | | | initialization. | | 180 | (GRUB or SBL) | | | | | 181 +--------------------+-------------------------+--------------+---------------------------+-------------------------+ 182 | Physical resource | 1GB page is not | Yes | The hypervisor shall | 1GB page is not | 183 | used by the | available on the | | panic during platform | available on the | 184 | hypervisor | platform or invalid | | initialization. | platform or invalid | 185 | | physical CPU ID | | | physical CPU ID | 186 +--------------------+-------------------------+--------------+---------------------------+-------------------------+ 187 188 189Examples 190-------- 191 192Here is an example to illustrate when error handling codes are required on 193an architecture level. 194 195There are two pre-condition statements of ``vcpu_from_vid``. It indicates that 196it's the caller's responsibility to guarantee these pre-conditions. 197 198.. code-block:: c 199 200 /** 201 * @pre vcpu_id < CONFIG_MAX_VCPUS_PER_VM 202 * @pre &(vm->hw.vcpu_array[vcpu_id])->state != VCPU_OFFLINE 203 */ 204 static inline struct acrn_vcpu *vcpu_from_vid(struct acrn_vm *vm, uint16_t vcpu_id) 205 { 206 return &(vm->hw.vcpu_array[vcpu_id]); 207 } 208 209``vcpu_from_vid`` is called by ``hcall_set_vcpu_regs``, which is a hypercall. 210``hcall_set_vcpu_regs`` is an external interface and ``vcpu_id`` is provided 211by the VM. In this case, we shall add the error checking codes before calling 212``vcpu_from_vid`` to make sure that the passed parameters are valid and the 213pre-conditions are guaranteed. 214 215Here is the sample code for error checking before calling ``vcpu_from_vid``: 216 217.. code-block:: c 218 219 status = 0; 220 221 if (vcpu_id >= CONFIG_MAX_VCPUS_PER_VM) { 222 pr_err("vcpu id is out of range \r\n"); 223 status = -EINVAL; 224 } else if ((&(vm->hw.vcpu_array[vcpu_id]))->state == VCPU_OFFLINE) { 225 pr_err("vcpu is offline \r\n"); 226 status = -EINVAL; 227 } 228 229 if (status == 0) { 230 vcpu = vcpu_from_vid(vm, vcpu_id); 231 ... 232 } 233 234 235Module Level 236============ 237 238Functional Safety Consideration 239------------------------------- 240 241Data verification, and explicit specification of pre-conditions and 242post-conditions are applied for internal functions of the hypervisor 243according to Table A.4 of FuSA Standards [IEC_61508-3_2010]_ . 244 245Error Handling Methods 246---------------------- 247 248The error handling methods used in the ACRN hypervisor on a module level are 249shown below. 250 251**Panic the system via ASSERT** 252 The hypervisor can panic the system when the below cases occur. It shall 253 only be used for debugging, used to check pre-conditions and post-conditions 254 to catch design errors. 255 256 This method applies to the following case: 257 258 - Software design errors occur. 259 260 261Rules of Error Detection and Error Handling 262------------------------------------------- 263 264The rules of error detection and error handling on a module level are shown in 265:numref:`rules_module_level` below. 266 267.. table:: Rules of Error Detection and Error Handling on Module Level 268 :align: center 269 :widths: auto 270 :name: rules_module_level 271 272 +--------------------+-----------+----------------------------+---------------------------+-------------------------+ 273 | Resource Class | Failure | Error Detection via | Error Handling Policy | Example | 274 | | Mode | Hypervisor | | | 275 +====================+===========+============================+===========================+=========================+ 276 | Internal data of | N/A | Partial. | The hypervisor shall use | Virtual PCI device | 277 | the hypervisor | | The related pre-conditions | the internal resource/data| information, defined | 278 | | | are required. | directly. | with array | 279 | | | | | ``pci_vdevs[]`` | 280 | | | The design will guarantee | | through static | 281 | | | the correctness and the | | allocation. | 282 | | | test cases will verify the | | | 283 | | | related pre-conditions. | | | 284 | | | If the design cannot | | | 285 | | | guarantee the correctness, | | | 286 | | | the related error handling | | | 287 | | | codes need to be added. | | | 288 | | | Note: Some examples of | | | 289 | | | pre-conditions are listed, | | | 290 | | | like non-empty array, valid| | | 291 | | | array size and non-null | | | 292 | | | pointer. | | | 293 +--------------------+-----------+----------------------------+---------------------------+-------------------------+ 294 | Configuration data | Corrupted | No. | The bootloader initializes| ``vm_config->pci_devs`` | 295 | of the VM | VM config | The related pre-conditions | hypervisor (including | is configured | 296 | | | are required. | code, data, and bss) and | statically. | 297 | | | Note: VM configuration data| verifies the integrity of | | 298 | | | are auto generated based on| hypervisor image in which | | 299 | | | different board configs, | VM configurations are. | | 300 | | | they are defined | Thus hypervisor does not | | 301 | | | as static structure. | need any additional | | 302 | | | | mechanism. | | 303 +--------------------+-----------+----------------------------+---------------------------+-------------------------+ 304 | Configuration data | N/A | No. | The hypervisor shall use | The maximum number of | 305 | of the hypervisor | | The related pre-conditions | the internal resource/data| PCI devices in the VM, | 306 | | | are required. | directly. | defined with | 307 | | | The design will guarantee | | CONFIG_MAX_PCI_DEV_NUM | 308 | | | the correctness and this | | through configuration. | 309 | | | shall be verified manually.| | | 310 +--------------------+-----------+----------------------------+---------------------------+-------------------------+ 311 312 313Examples 314-------- 315 316Here are some examples to illustrate when error handling codes are required on 317a module level. 318 319**Example_1: Analyze the function ``partition_mode_vpci_init``** 320 321.. code-block:: c 322 323 /** 324 * @pre vm != NULL 325 * @pre vm->vpci->pci_vdev_cnt <= CONFIG_MAX_PCI_DEV_NUM 326 */ 327 static int32_t partition_mode_vpci_init(const struct acrn_vm *vm) 328 { 329 struct acrn_vpci *vpci = (struct acrn_vpci *)&(vm->vpci); 330 struct pci_vdev *vdev; 331 struct acrn_vm_config *vm_config = get_vm_config(vm->vm_id); 332 struct acrn_vm_pci_dev_config *pci_dev_config; 333 uint32_t i; 334 335 vpci->pci_vdev_cnt = vm_config->pci_dev_num; 336 337 for (i = 0U; i < vpci->pci_vdev_cnt; i++) { 338 vdev = &vpci->pci_vdevs[i]; 339 vdev->vpci = vpci; 340 pci_dev_config = &vm_config->pci_devs[i]; 341 vdev->vbdf.value = pci_dev_config->vbdf.value; 342 343 if (vdev->vbdf.value != 0U) { 344 partition_mode_pdev_init(vdev, pci_dev_config->pbdf); 345 vdev->ops = &pci_ops_vdev_pt; 346 } else { 347 vdev->ops = &pci_ops_vdev_hostbridge; 348 } 349 350 if (vdev->ops->init != NULL) { 351 if (vdev->ops->init(vdev) != 0) { 352 pr_err("%s() failed at PCI device (vbdf %x)!", 353 __func__, vdev->vbdf); 354 } 355 } 356 } 357 358 return 0; 359 } 360 361``get_vm_config`` is called by ``partition_mode_vpci_init``. 362There are one pre-condition and two post-conditions of ``get_vm_config``. 363It indicates that the caller of ``get_vm_config`` shall guarantee these 364pre-conditions and ``get_vm_config`` itself shall guarantee the post-condition. 365 366.. code-block:: c 367 368 /** 369 * @pre vm_id < CONFIG_MAX_VM_NUM 370 * @post retval != NULL 371 * @post retval->pci_dev_num <= MAX_PCI_DEV_NUM 372 */ 373 struct acrn_vm_config *get_vm_config(uint16_t vm_id) 374 { 375 return &vm_configs[vm_id]; 376 } 377 378**Question_1: Is error checking required for ``vm_config``?** 379 380No. Because ``vm_config`` is getting data from ``get_vm_config`` and the 381post-condition of ``get_vm_config`` guarantees that the return value is not NULL. 382 383 384**Question_2: Is error checking required for ``vdev``?** 385 386No. Here are the reasons: 387 388a) The pre-condition of ``partition_mode_vpci_init`` guarantees that ``vm`` 389 is not NULL. It indicates that ``vpci`` is not NULL. Since ``vdev`` is 390 getting data from the array ``pci_vdevs[]`` via indexing, ``vdev`` is not 391 NULL as long as the index is valid. 392 393b) The post-condition of ``get_vm_config`` guarantees that 394 ``vpci->pci_vdev_cnt`` is less than or equal to ``CONFIG_MAX_PCI_DEV_NUM``, 395 which is the array size of ``pci_vdevs[]``. It indicates that the index 396 used to get ``vdev`` is always valid. 397 398Given the two reasons above, ``vdev`` is always not NULL. So, the error 399checking codes are not required for ``vdev``. 400 401 402**Question_3: Is error checking required for ``pci_dev_config``?** 403 404No. ``pci_dev_config`` is getting data from the array ``pci_vdevs[]``, which 405is the physical PCI device information coming from Board Support Package and 406firmware. For physical PCI device information, the related application 407constraints shall be defined in the design document or safety manual. For 408debug purposes, developers could use ASSERT here to catch the Board Support 409Package or firmware failures, which do not guarantee these application 410constraints. 411 412 413**Question_4: Is error checking required for ``vdev->ops->init``?** 414 415No. Here are the reasons: 416 417a) Question_2 proves that ``vdev`` is always not NULL. 418 419b) ``vdev->ops`` is fully initialized before ``vdev->ops->init`` is called. 420 421Given the two reasons above, ``vdev->ops->init`` is always not NULL. So, the 422error checking codes are not required for ``vdev->ops->init``. 423 424 425**Question_5: How to handle the case when ``vdev->ops->init(vdev)`` returns non-zero?** 426 427This case indicates that the initialization of specific virtual device fails. 428Investigation has to be done to figure out the root-cause. Default fatal error 429handler shall be invoked here if it is caused by a hardware failure or invalid 430boot information. 431 432 433**Example_2: Analyze the function ``partition_mode_vpci_deinit``** 434 435.. code-block:: c 436 437 /** 438 * @pre vdev != NULL 439 * @pre vm->vpci->pci_vdev_cnt <= CONFIG_MAX_PCI_DEV_NUM 440 */ 441 static void partition_mode_vpci_deinit(const struct acrn_vm *vm) 442 { 443 struct pci_vdev *vdev; 444 uint32_t i; 445 446 for (i = 0U; i < vm->vpci.pci_vdev_cnt; i++) { 447 vdev = (struct pci_vdev *) &(vm->vpci.pci_vdevs[i]); 448 if ((vdev->ops != NULL) && (vdev->ops->deinit != NULL)) { 449 if (vdev->ops->deinit(vdev) != 0) { 450 pr_err("vdev->ops->deinit failed!"); 451 } 452 } 453 /* TODO: implement the deinit of 'vdev->ops' */ 454 } 455 } 456 457 458**Question_6: Is error checking required for ``vdev->ops`` and ``vdev->ops->init``?** 459 460Yes. Because ``vdev->ops`` and ``vdev->ops->init`` cannot be guaranteed to be 461not NULL. If the VM called ``partition_mode_vpci_deinit`` twice, it may be 462NULL. 463 464 465Module Level Configuration Design Guidelines 466******************************************** 467 468Design Goals 469============ 470 471There are two goals for module level configuration design, as shown below: 472 473a) In order to make the hypervisor more flexible, one source code and binary 474 is preferred for different platforms with different configurations; 475 476b) If one module is not used by a specific project, the module source code is 477 treated as dead code. The effort to configure it in/out shall be minimized. 478 479 480Hypervisor Operation Modes 481========================== 482 483The hypervisor operation modes are shown in 484:numref:`hypervisor_operation_modes` below. 485 486.. table:: Hypervisor Operation Modes 487 :align: center 488 :widths: 10 10 50 489 :name: hypervisor_operation_modes 490 491 +-------------+-----------+------------------------------------------------------------------------------+ 492 | Operation | Sub-modes | Description | 493 | Modes | | | 494 +=============+===========+==============================================================================+ 495 | INIT mode | DETECT | The hypervisor detects firmware, detects hardware resource, and reads | 496 | | mode | configuration data. | 497 | +-----------+------------------------------------------------------------------------------+ 498 | | STARTUP | The hypervisor initializes hardware resources, creates virtual resources like| 499 | | mode | VCPU and VM, and executes VMLAUNCH instruction(the very first VM entry). | 500 +-------------+-----------+------------------------------------------------------------------------------+ 501 | OPERATIONAL | N/A | After the first VM entry, the hypervisor runs in VMX root mode and guest OS | 502 | mode | | runs in VMX non-root mode. | 503 +-------------+-----------+------------------------------------------------------------------------------+ 504 | TERMINATION | N/A | If any fatal error is detected, the hypervisor will enter TERMINATION mode. | 505 | mode | | In this mode, a default fatal error handler will be invoked to handle the | 506 | | | fatal error. | 507 +-------------+-----------+------------------------------------------------------------------------------+ 508 509 510Configurable Module Properties 511============================== 512 513The properties of configurable modules are shown below: 514 515- The functionality of the module depends on platform configurations; 516- Corresponding platform configurations can be detected in DETECT mode; 517- The module APIs shall be configured in DETECT mode; 518- The module APIs shall be used in modes other than DETECT mode. 519 520Platform configurations include: 521 522- Features depending on hardware or firmware 523- Configuration data provided by firmware 524- Configuration data provided by BSP 525 526 527Design Rules 528============ 529 530The module level configuration design rules are shown below: 531 5321. The platform configurations shall be detectable by the hypervisor in 533 DETECT mode; 534 5352. Configurable module APIs shall be abstracted as operations that are 536 implemented through a set of function pointers in the operations data 537 structure; 538 5393. Every function pointer in the operations data structure shall be 540 instantiated as one module API in DETECT mode and the API is allowed to be 541 implemented as empty function for some specific configurations; 542 5434. The operations data structure shall be read-only in STARTUP mode, 544 OPERATIONAL mode, and TERMINATION mode; 545 5465. The configurable module shall only be accessed via APIs in the operations 547 data structure in STARTUP mode or OPERATIONAL mode; 548 5496. In order to guarantee that the function pointer in the operations data 550 structure is dereferenced after it has been instantiated, the pre-condition 551 shall be added for the function that dereferences the function pointer, 552 instead of checking the pointer for NULL. 553 554.. note:: The third rule shall be double checked during code review. 555 556Use Cases 557========= 558 559The following table shows some use cases of module level configuration design: 560 561.. list-table:: Module Level Configuration Design Use Cases 562 :widths: 10 25 20 563 :header-rows: 1 564 565 * - **Platform Configuration** 566 - **Configurable Module** 567 - **Prerequisite** 568 569 * - Features depending on hardware or firmware 570 - This module is used to virtualize part of LAPIC functionalities. 571 It can be done via APICv or software emulation depending on CPU 572 capabilities. 573 For example, Kaby Lake NUC doesn't support virtual-interrupt delivery, 574 while other platforms support it. 575 - If a function pointer is used, the prerequisite is 576 "hv_operation_mode == OPERATIONAL". 577 578 * - Configuration data provided by firmware 579 - This module is used to interact with firmware (UEFI or SBL), and the 580 configuration data is provided by firmware. 581 - If a function pointer is used, the prerequisite is 582 "hv_operation_mode != DETECT". 583 584 * - Configuration data provided by BSP 585 - This module is used to virtualize LAPIC, and the configuration data is 586 provided by BSP. 587 For example, some VMs use LAPIC passthrough and the other VMs use 588 vLAPIC. 589 - If a function pointer is used, the prerequisite is 590 "hv_operation_mode == OPERATIONAL". 591 592.. note:: Prerequisite is used to guarantee that the function pointer used for 593 configuration is dereferenced after it has been instantiated. 594 595 596References 597********** 598 599.. [IEC_61508-3_2010] IEC 61508-3:2010, Functional safety of electrical/electronic/programmable electronic safety-related systems - Part 3: Software requirements 600