1# Best Practices in the Hypervisor 2 3## Handling unexpected conditions 4 5### Guidelines 6 7Passing errors up the stack should be used when the caller is already 8expecting to handle errors, and the state when the error was 9discovered isn’t broken, or isn't too hard to fix. 10 11domain_crash() should be used when passing errors up the stack is too 12difficult, and/or when fixing up state of a guest is impractical, but 13where fixing up the state of Xen will allow Xen to continue running. 14This is particularly appropriate when the guest is exhibiting behavior 15well-behaved guests shouldn't. 16 17BUG_ON() should be used when you can’t pass errors up the stack, and 18either continuing or crashing the guest would likely cause an 19information leak or privilege escalation vulnerability. 20 21ASSERT() IS NOT AN ERROR HANDLING MECHANISM. ASSERT is a way to move 22detection of a bug earlier in the programming cycle; it is a 23more-noticeable printk. It should only be added after one of the 24other three error-handling mechanisms has been evaluated for 25reliability and security. 26 27### Rationale 28 29It's frequently the case that code is written with the assumption that 30certain conditions can never happen. There are several possible 31actions programmers can take in these situations: 32 33 * Programmers can simply not handle those cases in any way, other than 34 perhaps to write a comment documenting what the assumption is. 35 * Programmers can try to handle the case gracefully -- fixing up 36 in-progress state and returning an error to the user. 37 * Programmers can crash the guest. 38 * Programmers can use ASSERT(), which will cause the check to be 39 executed in DEBUG builds, and cause the hypervisor to crash if it's 40 violated 41 * Programmers can use BUG_ON(), which will cause the check to be 42 executed in both DEBUG and non-DEBUG builds, and cause the hypervisor 43 to crash if it's violated. 44 45In selecting which response to use, we want to achieve several goals: 46 47 * To minimize risk of introducing security vulnerabilities, 48 particularly as the code evolves over time 49 * To efficiently spend programmer time 50 * To detect violations of assumptions as early as possible 51 * To minimize the impact of bugs on production use cases 52 53The guidelines above attempt to balance these: 54 55 * When the caller is expecting to handle errors, and there is no 56 broken state at the time the unexpected condition is discovered, or 57 when fixing the state is straightforward, then fixing up the state and 58 returning an error is the most robust thing to do. However, if the 59 caller isn't expecting to handle errors, or if the state is difficult 60 to fix, then returning an error may require extensive refactoring, 61 which is not a good use of programmer time when they're certain that 62 this condition cannot occur. 63 * BUG_ON() will stop all hypervisor action immediately. In situations 64 where continuing might allow an attacker to escalate privilege, a 65 BUG_ON() can change a privilege escalation or information leak into a 66 denial-of-service (an improvement). But in situations where 67 continuing (say, returning an error) might be safe, then BUG_ON() can 68 change a benign failure into denial-of-service (a degradation). 69 * domain_crash() is similar to BUG_ON(), but with a more limited 70 effect: it stops that domain immediately. In situations where 71 continuing might cause guest or hypervisor corruption, but destroying 72 the guest allows the hypervisor to continue, this can change a more 73 serious bug into a guest denial-of-service. But in situations where 74 returning an error might be safe, then domain_crash() can change a 75 benign failure into a guest denial-of-service. 76 * ASSERT() will stop the hypervisor during development, but allow 77 hypervisor action to continue during production. In situations where 78 continuing will at worst result in a denial-of-service, and at best 79 may have little effect other than perhaps quirky behavior, using an 80 ASSERT() will allow violation of assumptions to be detected as soon as 81 possible, while not causing undue degradation in production 82 hypervisors. However, in situations where continuing could cause 83 privilege escalation or information leaks, using an ASSERT() can 84 introduce security vulnerabilities. 85 86Note however that domain_crash() has its own traps: callers far up the 87call stack may not realize that the domain is now dying as a result of 88an innocuous-looking operation, particularly if somewhere on the 89callstack between the initial function call and the failure, no error 90is returned. Using domain_crash() requires careful inspection and 91documentation of the code to make sure all callers at the stack handle 92a newly-dead domain gracefully. 93