1# Xen Live Patching Design v1
2
3## Rationale
4
5A mechanism is required to binarily patch the running hypervisor with new
6opcodes that have come about due to primarily security updates.
7
8This document describes the design of the API that would allow us to
9upload to the hypervisor binary patches.
10
11The document is split in four sections:
12
13 * Detailed descriptions of the problem statement.
14 * Design of the data structures.
15 * Design of the hypercalls.
16 * Implementation notes that should be taken into consideration.
17
18
19## Glossary
20
21 * splice - patch in the binary code with new opcodes
22 * trampoline - a jump to a new instruction.
23 * payload - telemetries of the old code along with binary blob of the new
24   function (if needed).
25 * reloc - telemetries contained in the payload to construct proper trampoline.
26
27## History
28
29The document has gone under various reviews and only covers v1 design.
30
31The end of the document has a section titled `Not Yet Done` which
32outlines ideas and design for the future version of this work.
33
34## Multiple ways to patch
35
36The mechanism needs to be flexible to patch the hypervisor in multiple ways
37and be as simple as possible. The compiled code is contiguous in memory with
38no gaps - so we have no luxury of 'moving' existing code and must either
39insert a trampoline to the new code to be executed - or only modify in-place
40the code if there is sufficient space. The placement of new code has to be done
41by hypervisor and the virtual address for the new code is allocated dynamically.
42
43This implies that the hypervisor must compute the new offsets when splicing
44in the new trampoline code. Where the trampoline is added (inside
45the function we are patching or just the callers?) is also important.
46
47To lessen the amount of code in hypervisor, the consumer of the API
48is responsible for identifying which mechanism to employ and how many locations
49to patch. Combinations of modifying in-place code, adding trampoline, etc
50has to be supported. The API should allow read/write any memory within
51the hypervisor virtual address space.
52
53We must also have a mechanism to query what has been applied and a mechanism
54to revert it if needed.
55
56## Workflow
57
58The expected workflows of higher-level tools that manage multiple patches
59on production machines would be:
60
61 * The first obvious task is loading all available / suggested
62   hotpatches when they are available.
63 * Whenever new hotpatches are installed, they should be loaded too.
64 * One wants to query which modules have been loaded at runtime.
65 * If unloading is deemed safe (see unloading below), one may want to
66   support a workflow where a specific hotpatch is marked as bad and
67   unloaded.
68
69## Patching code
70
71The first mechanism to patch that comes in mind is in-place replacement.
72That is replace the affected code with new code. Unfortunately the x86
73ISA is variable size which places limits on how much space we have available
74to replace the instructions. That is not a problem if the change is smaller
75than the original opcode and we can fill it with nops. Problems will
76appear if the replacement code is longer.
77
78The second mechanism is by ti replace the call or jump to the
79old function with the address of the new function.
80
81A third mechanism is to add a jump to the new function at the
82start of the old function. N.B. The Xen hypervisor implements the third
83mechanism. See `Trampoline (e9 opcode)` section for more details.
84
85### Example of trampoline and in-place splicing
86
87As example we will assume the hypervisor does not have XSA-132 (see
88*domctl/sysctl: don't leak hypervisor stack to toolstacks*
894ff3449f0e9d175ceb9551d3f2aecb59273f639d) and we would like to binary patch
90the hypervisor with it. The original code looks as so:
91
92<pre>
93   48 89 e0                  mov    %rsp,%rax
94   48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax
95</pre>
96
97while the new patched hypervisor would be:
98
99<pre>
100   48 c7 45 b8 00 00 00 00   movq   $0x0,-0x48(%rbp)
101   48 c7 45 c0 00 00 00 00   movq   $0x0,-0x40(%rbp)
102   48 c7 45 c8 00 00 00 00   movq   $0x0,-0x38(%rbp)
103   48 89 e0                  mov    %rsp,%rax
104   48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax
105</pre>
106
107This is inside the arch_do_domctl. This new change adds 21 extra
108bytes of code which alters all the offsets inside the function. To alter
109these offsets and add the extra 21 bytes of code we might not have enough
110space in .text to squeeze this in.
111
112As such we could simplify this problem by only patching the site
113which calls arch_do_domctl:
114
115<pre>
116do_domctl:
117 e8 4b b1 05 00          callq  ffff82d08015fbb9 <arch_do_domctl>
118</pre>
119
120with a new address for where the new `arch_do_domctl` would be (this
121area would be allocated dynamically).
122
123Astute readers will wonder what we need to do if we were to patch `do_domctl`
124- which is not called directly by hypervisor but on behalf of the guests via
125the `compat_hypercall_table` and `hypercall_table`.
126Patching the offset in `hypercall_table` for `do_domctl:
127(ffff82d080103079 <do_domctl>:)
128
129<pre>
130
131 ffff82d08024d490:   79 30
132 ffff82d08024d492:   10 80 d0 82 ff ff
133
134</pre>
135
136with the new address where the new `do_domctl` is possible. The other
137place where it is used is in `hvm_hypercall64_table` which would need
138to be patched in a similar way. This would require an in-place splicing
139of the new virtual address of `arch_do_domctl`.
140
141In summary this example patched the callee of the affected function by
142 * allocating memory for the new code to live in,
143 * changing the virtual address in all the functions which called the old
144   code (computing the new offset, patching the callq with a new callq).
145 * changing the function pointer tables with the new virtual address of
146   the function (splicing in the new virtual address). Since this table
147   resides in the .rodata section we would need to temporarily change the
148   page table permissions during this part.
149
150However it has drawbacks - the safety checks which have to make sure
151the function is not on the stack - must also check every caller. For some
152patches this could mean - if there were an sufficient large amount of
153callers - that we would never be able to apply the update.
154
155Having the patching done at predetermined instances where the stacks
156are not deep mostly solves this problem.
157
158### Example of different trampoline patching.
159
160An alternative mechanism exists where we can insert a trampoline in the
161existing function to be patched to jump directly to the new code. This
162lessens the locations to be patched to one but it puts pressure on the
163CPU branching logic (I-cache, but it is just one unconditional jump).
164
165For this example we will assume that the hypervisor has not been compiled
166with fe2e079f642effb3d24a6e1a7096ef26e691d93e (XSA-125: *pre-fill structures
167for certain HYPERVISOR_xen_version sub-ops*) which mem-sets an structure
168in `xen_version` hypercall. This function is not called **anywhere** in
169the hypervisor (it is called by the guest) but referenced in the
170`compat_hypercall_table` and `hypercall_table` (and indirectly called
171from that). Patching the offset in `hypercall_table` for the old
172`do_xen_version` (ffff82d080112f9e <do_xen_version>)
173
174</pre>
175 ffff82d08024b270 <hypercall_table>:
176 ...
177 ffff82d08024b2f8:   9e 2f 11 80 d0 82 ff ff
178
179</pre>
180
181with the new address where the new `do_xen_version` is possible. The other
182place where it is used is in `hvm_hypercall64_table` which would need
183to be patched in a similar way. This would require an in-place splicing
184of the new virtual address of `do_xen_version`.
185
186An alternative solution would be to patch insert a trampoline in the
187old `do_xen_version' function to directly jump to the new `do_xen_version`.
188
189<pre>
190 ffff82d080112f9e do_xen_version:
191 ffff82d080112f9e:       48 c7 c0 da ff ff ff    mov    $0xffffffffffffffda,%rax
192 ffff82d080112fa5:       83 ff 09                cmp    $0x9,%edi
193 ffff82d080112fa8:       0f 87 24 05 00 00       ja     ffff82d0801134d2 ; do_xen_version+0x534
194</pre>
195
196with:
197
198<pre>
199 ffff82d080112f9e do_xen_version:
200 ffff82d080112f9e:       e9 XX YY ZZ QQ          jmpq   [new do_xen_version]
201</pre>
202
203which would lessen the amount of patching to just one location.
204
205In summary this example patched the affected function to jump to the
206new replacement function which required:
207 * allocating memory for the new code to live in,
208 * inserting trampoline with new offset in the old function to point to the
209   new function.
210 * Optionally we can insert in the old function a trampoline jump to an function
211   providing an BUG_ON to catch errant code.
212
213The disadvantage of this are that the unconditional jump will consume a small
214I-cache penalty. However the simplicity of the patching and higher chance
215of passing safety checks make this a worthwhile option.
216
217This patching has a similar drawback as inline patching - the safety
218checks have to make sure the function is not on the stack. However
219since we are replacing at a higher level (a full function as opposed
220to various offsets within functions) the checks are simpler.
221
222Having the patching done at predetermined instances where the stacks
223are not deep mostly solves this problem as well.
224
225### Security
226
227With this method we can re-write the hypervisor - and as such we **MUST** be
228diligent in only allowing certain guests to perform this operation.
229
230Furthermore with SecureBoot or tboot, we **MUST** also verify the signature
231of the payload to be certain it came from a trusted source and integrity
232was intact.
233
234As such the hypercall **MUST** support an XSM policy to limit what the guest
235is allowed to invoke. If the system is booted with signature checking the
236signature checking will be enforced.
237
238## Design of payload format
239
240The payload **MUST** contain enough data to allow us to apply the update
241and also safely reverse it. As such we **MUST** know:
242
243 * The locations in memory to be patched. This can be determined dynamically
244   via symbols or via virtual addresses.
245 * The new code that will be patched in.
246
247This binary format can be constructed using an custom binary format but
248there are severe disadvantages of it:
249
250 * The format might need to be changed and we need an mechanism to accommodate
251   that.
252 * It has to be platform agnostic.
253 * Easily constructed using existing tools.
254
255As such having the payload in an ELF file is the sensible way. We would be
256carrying the various sets of structures (and data) in the ELF sections under
257different names and with definitions.
258
259Note that every structure has padding. This is added so that the hypervisor
260can re-use those fields as it sees fit.
261
262Earlier design attempted to ineptly explain the relations of the ELF sections
263to each other without using proper ELF mechanism (sh_info, sh_link, data
264structures using Elf types, etc). This design will explain the structures
265and how they are used together and not dig in the ELF format - except mention
266that the section names should match the structure names.
267
268The Xen Live Patch payload is a relocatable ELF binary. A typical binary would have:
269
270 * One or more .text sections.
271 * Zero or more read-only data sections.
272 * Zero or more data sections.
273 * Relocations for each of these sections.
274
275It may also have some architecture-specific sections. For example:
276
277 * Alternatives instructions.
278 * Bug frames.
279 * Exception tables.
280 * Relocations for each of these sections.
281
282The Xen Live Patch core code loads the payload as a standard ELF binary, relocates it
283and handles the architecture-specifc sections as needed. This process is much
284like what the Linux kernel module loader does.
285
286The payload contains at least three sections:
287
288 * `.livepatch.funcs` - which is an array of livepatch_func structures.
289 * `.livepatch.depends` - which is an ELF Note that describes what the payload
290    depends on. **MUST** have one.
291 *  `.note.gnu.build-id` - the build-id of this payload. **MUST** have one.
292
293### .livepatch.funcs
294
295The `.livepatch.funcs` contains an array of livepatch_func structures
296which describe the functions to be patched:
297
298<pre>
299struct livepatch_func {
300    const char *name;
301    void *new_addr;
302    void *old_addr;
303    uint32_t new_size;
304    uint32_t old_size;
305    uint8_t version;
306    uint8_t opaque[31];
307};
308</pre>
309
310The size of the structure is 64 bytes on 64-bit hypervisors. It will be
31152 on 32-bit hypervisors.
312
313* `name` is the symbol name of the old function. Only used if `old_addr` is
314   zero, otherwise will be used during dynamic linking (when hypervisor loads
315   the payload).
316
317* `old_addr` is the address of the function to be patched and is filled in at
318  payload generation time if hypervisor function address is known. If unknown,
319  the value *MUST* be zero and the hypervisor will attempt to resolve the address.
320
321* `new_addr` can either have a non-zero value or be zero.
322  * If there is a non-zero value, then it is the address of the function that is
323    replacing the old function and the address is recomputed during relocation.
324    The value **MUST** be the address of the new function in the payload file.
325
326  * If the value is zero, then we NOPing out at the `old_addr` location
327    `new_size` bytes.
328
329* `old_size` contains the sizes of the respective `old_addr` function in bytes.
330   The value of `old_size` **MUST** not be zero.
331
332* `new_size` depends on what `new_addr` contains:
333  * If `new_addr` contains an non-zero value, then `new_size` has the size of
334    the new function (which will replace the one at `old_addr`)  in bytes.
335  * If the value of `new_addr` is zero then `new_size` determines how many
336    instruction bytes to NOP (up to opaque size modulo smallest platform
337    instruction - 1 byte x86 and 4 bytes on ARM).
338
339* `version` is to be one.
340
341* `opaque` **MUST** be zero.
342
343The size of the `livepatch_func` array is determined from the ELF section
344size.
345
346When applying the patch the hypervisor iterates over each `livepatch_func`
347structure and the core code inserts a trampoline at `old_addr` to `new_addr`.
348The `new_addr` is altered when the ELF payload is loaded.
349
350When reverting a patch, the hypervisor iterates over each `livepatch_func`
351and the core code copies the data from the undo buffer (private internal copy)
352to `old_addr`.
353
354It optionally may contain the address of functions to be called right before
355being applied and after being reverted:
356
357 * `.livepatch.hooks.load` - an array of function pointers.
358 * `.livepatch.hooks.unload` - an array of function pointers.
359
360
361### Example of .livepatch.funcs
362
363A simple example of what a payload file can be:
364
365<pre>
366/* MUST be in sync with hypervisor. */
367struct livepatch_func {
368    const char *name;
369    void *new_addr;
370    void *old_addr;
371    uint32_t new_size;
372    uint32_t old_size;
373    uint8_t version;
374    uint8_t pad[31];
375};
376
377/* Our replacement function for xen_extra_version. */
378const char *xen_hello_world(void)
379{
380    return "Hello World";
381}
382
383static unsigned char patch_this_fnc[] = "xen_extra_version";
384
385struct livepatch_func livepatch_hello_world = {
386    .version = LIVEPATCH_PAYLOAD_VERSION,
387    .name = patch_this_fnc,
388    .new_addr = xen_hello_world,
389    .old_addr = (void *)0xffff82d08013963c, /* Extracted from xen-syms. */
390    .new_size = 13, /* To be be computed by scripts. */
391    .old_size = 13, /* -----------""---------------  */
392} __attribute__((__section__(".livepatch.funcs")));
393
394</pre>
395
396Code must be compiled with -fPIC.
397
398### .livepatch.hooks.load and .livepatch.hooks.unload
399
400This section contains an array of function pointers to be executed
401before payload is being applied (.livepatch.funcs) or after reverting
402the payload. This is useful to prepare data structures that need to
403be modified patching.
404
405Each entry in this array is eight bytes.
406
407The type definition of the function are as follow:
408
409<pre>
410typedef void (*livepatch_loadcall_t)(void);
411typedef void (*livepatch_unloadcall_t)(void);
412</pre>
413
414### .livepatch.depends and .note.gnu.build-id
415
416To support dependencies checking and safe loading (to load the
417appropiate payload against the right hypervisor) there is a need
418to embbed an build-id dependency.
419
420This is done by the payload containing an section `.livepatch.depends`
421which follows the format of an ELF Note. The contents of this
422(name, and description) are specific to the linker utilized to
423build the hypevisor and payload.
424
425If GNU linker is used then the name is `GNU` and the description
426is a NT_GNU_BUILD_ID type ID. The description can be an SHA1
427checksum, MD5 checksum or any unique value.
428
429The size of these structures varies with the --build-id linker option.
430
431## Hypercalls
432
433We will employ the sub operations of the system management hypercall (sysctl).
434There are to be four sub-operations:
435
436 * upload the payloads.
437 * listing of payloads summary uploaded and their state.
438 * getting an particular payload summary and its state.
439 * command to apply, delete, or revert the payload.
440
441Most of the actions are asynchronous therefore the caller is responsible
442to verify that it has been applied properly by retrieving the summary of it
443and verifying that there are no error codes associated with the payload.
444
445We **MUST** make some of them asynchronous due to the nature of patching
446it requires every physical CPU to be lock-step with each other.
447The patching mechanism while an implementation detail, is not an short
448operation and as such the design **MUST** assume it will be an long-running
449operation.
450
451The sub-operations will spell out how preemption is to be handled (if at all).
452
453Furthermore it is possible to have multiple different payloads for the same
454function. As such an unique name per payload has to be visible to allow proper manipulation.
455
456The hypercall is part of the `xen_sysctl`. The top level structure contains
457one uint32_t to determine the sub-operations and one padding field which
458*MUST* always be zero.
459
460<pre>
461struct xen_sysctl_livepatch_op {
462    uint32_t cmd;                   /* IN: XEN_SYSCTL_LIVEPATCH_*. */
463    uint32_t pad;                   /* IN: Always zero. */
464	union {
465          ... see below ...
466        } u;
467};
468
469</pre>
470while the rest of hypercall specific structures are part of the this structure.
471
472### Basic type: struct xen_livepatch_name
473
474Most of the hypercalls employ an shared structure called `struct xen_livepatch_name`
475which contains:
476
477 * `name` - pointer where the string for the name is located.
478 * `size` - the size of the string
479 * `pad` - padding - to be zero.
480
481The structure is as follow:
482
483<pre>
484/*
485 *  Uniquely identifies the payload.  Should be human readable.
486 * Includes the NUL terminator
487 */
488#define XEN_LIVEPATCH_NAME_SIZE 128
489struct xen_livepatch_name {
490    XEN_GUEST_HANDLE_64(char) name;         /* IN, pointer to name. */
491    uint16_t size;                          /* IN, size of name. May be upto
492                                               XEN_LIVEPATCH_NAME_SIZE. */
493    uint16_t pad[3];                        /* IN: MUST be zero. */
494};
495</pre>
496
497### XEN_SYSCTL_LIVEPATCH_UPLOAD (0)
498
499Upload a payload to the hypervisor. The payload is verified
500against basic checks and if there are any issues the proper return code
501will be returned. The payload is not applied at this time - that is
502controlled by *XEN_SYSCTL_LIVEPATCH_ACTION*.
503
504The caller provides:
505
506 * A `struct xen_livepatch_name` called `name` which has the unique name.
507 * `size` the size of the ELF payload (in bytes).
508 * `payload` the virtual address of where the ELF payload is.
509
510The `name` could be an UUID that stays fixed forever for a given
511payload. It can be embedded into the ELF payload at creation time
512and extracted by tools.
513
514The return value is zero if the payload was succesfully uploaded.
515Otherwise an -XEN_EXX return value is provided. Duplicate `name` are not supported.
516
517The `payload` is the ELF payload as mentioned in the `Payload format` section.
518
519The structure is as follow:
520
521<pre>
522struct xen_sysctl_livepatch_upload {
523    xen_livepatch_name_t name;          /* IN, name of the patch. */
524    uint64_t size;                      /* IN, size of the ELF file. */
525    XEN_GUEST_HANDLE_64(uint8) payload; /* IN: ELF file. */
526};
527</pre>
528
529### XEN_SYSCTL_LIVEPATCH_GET (1)
530
531Retrieve an status of an specific payload. This caller provides:
532
533 * A `struct xen_livepatch_name` called `name` which has the unique name.
534 * A `struct xen_livepatch_status` structure. The member values will
535   be over-written upon completion.
536
537Upon completion the `struct xen_livepatch_status` is updated.
538
539 * `status` - indicates the current status of the payload:
540   * *LIVEPATCH_STATUS_CHECKED*  (1) loaded and the ELF payload safety checks passed.
541   * *LIVEPATCH_STATUS_APPLIED* (2) loaded, checked, and applied.
542   *  No other value is possible.
543 * `rc` - -XEN_EXX type errors encountered while performing the last
544   LIVEPATCH_ACTION_* operation. The normal values can be zero or -XEN_EAGAIN which
545   respectively mean: success or operation in progress. Other values
546   imply an error occurred. If there is an error in `rc`, `status` will **NOT**
547   have changed.
548
549The return value of the hypercall is zero on success and -XEN_EXX on failure.
550(Note that the `rc`` value can be different from the return value, as in
551rc=-XEN_EAGAIN and return value can be 0).
552
553For example, supposing there is an payload:
554
555<pre>
556 status: LIVEPATCH_STATUS_CHECKED
557 rc: 0
558</pre>
559
560We apply an action - LIVEPATCH_ACTION_REVERT - to revert it (which won't work
561as we have not even applied it. Afterwards we will have:
562
563<pre>
564 status: LIVEPATCH_STATUS_CHECKED
565 rc: -XEN_EINVAL
566</pre>
567
568It has failed but it remains loaded.
569
570This operation is synchronous and does not require preemption.
571
572The structure is as follow:
573
574<pre>
575struct xen_livepatch_status {
576#define LIVEPATCH_STATUS_CHECKED      1
577#define LIVEPATCH_STATUS_APPLIED      2
578    uint32_t state;                 /* OUT: LIVEPATCH_STATE_*. */
579    int32_t rc;                     /* OUT: 0 if no error, otherwise -XEN_EXX. */
580};
581
582struct xen_sysctl_livepatch_get {
583    xen_livepatch_name_t name;      /* IN, the name of the payload. */
584    xen_livepatch_status_t status;  /* IN/OUT: status of the payload. */
585};
586</pre>
587
588### XEN_SYSCTL_LIVEPATCH_LIST (2)
589
590Retrieve an array of abbreviated status and names of payloads that are loaded in the
591hypervisor.
592
593The caller provides:
594
595 * `version`. Version of the payload. Caller should re-use the field provided by
596    the hypervisor. If the value differs the data is stale.
597 * `idx` index iterator. The index into the hypervisor's payload count. It is
598    recommended that on first invocation zero be used so that `nr` (which the
599    hypervisor will update with the remaining payload count) be provided.
600    Also the hypervisor will provide `version` with the most current value.
601 * `nr` the max number of entries to populate. Can be zero which will result
602    in the hypercall being a probing one and return the number of payloads
603    (and update the `version`).
604 * `pad` - *MUST* be zero.
605 * `status` virtual address of where to write `struct xen_livepatch_status`
606   structures. Caller *MUST* allocate up to `nr` of them.
607 * `name` - virtual address of where to write the unique name of the payload.
608   Caller *MUST* allocate up to `nr` of them. Each *MUST* be of
609   **XEN_LIVEPATCH_NAME_SIZE** size. Note that **XEN_LIVEPATCH_NAME_SIZE** includes
610   the NUL terminator.
611 * `len` - virtual address of where to write the length of each unique name
612   of the payload. Caller *MUST* allocate up to `nr` of them. Each *MUST* be
613   of sizeof(uint32_t) (4 bytes).
614
615If the hypercall returns an positive number, it is the number (upto `nr`
616provided to the hypercall) of the payloads returned, along with `nr` updated
617with the number of remaining payloads, `version` updated (it may be the same
618across hypercalls - if it varies the data is stale and further calls could
619fail). The `status`, `name`, and `len`' are updated at their designed index
620value (`idx`) with the returned value of data.
621
622If the hypercall returns -XEN_E2BIG the `nr` is too big and should be
623lowered.
624
625If the hypercall returns an zero value there are no more payloads.
626
627Note that due to the asynchronous nature of hypercalls the control domain might
628have added or removed a number of payloads making this information stale. It is
629the responsibility of the toolstack to use the `version` field to check
630between each invocation. if the version differs it should discard the stale
631data and start from scratch. It is OK for the toolstack to use the new
632`version` field.
633
634The `struct xen_livepatch_status` structure contains an status of payload which includes:
635
636 * `status` - indicates the current status of the payload:
637   * *LIVEPATCH_STATUS_CHECKED*  (1) loaded and the ELF payload safety checks passed.
638   * *LIVEPATCH_STATUS_APPLIED* (2) loaded, checked, and applied.
639   *  No other value is possible.
640 * `rc` - -XEN_EXX type errors encountered while performing the last
641   LIVEPATCH_ACTION_* operation. The normal values can be zero or -XEN_EAGAIN which
642   respectively mean: success or operation in progress. Other values
643   imply an error occurred. If there is an error in `rc`, `status` will **NOT**
644   have changed.
645
646The structure is as follow:
647
648<pre>
649struct xen_sysctl_livepatch_list {
650    uint32_t version;                       /* OUT: Hypervisor stamps value.
651                                               If varies between calls, we are
652                                               getting stale data. */
653    uint32_t idx;                           /* IN: Index into hypervisor list. */
654    uint32_t nr;                            /* IN: How many status, names, and len
655                                               should be filled out. Can be zero to get
656                                               amount of payloads and version.
657                                               OUT: How many payloads left. */
658    uint32_t pad;                           /* IN: Must be zero. */
659    XEN_GUEST_HANDLE_64(xen_livepatch_status_t) status;  /* OUT. Must have enough
660                                               space allocate for nr of them. */
661    XEN_GUEST_HANDLE_64(char) id;           /* OUT: Array of names. Each member
662                                               MUST XEN_LIVEPATCH_NAME_SIZE in size.
663                                               Must have nr of them. */
664    XEN_GUEST_HANDLE_64(uint32) len;        /* OUT: Array of lengths of name's.
665                                               Must have nr of them. */
666};
667</pre>
668
669### XEN_SYSCTL_LIVEPATCH_ACTION (3)
670
671Perform an operation on the payload structure referenced by the `name` field.
672The operation request is asynchronous and the status should be retrieved
673by using either **XEN_SYSCTL_LIVEPATCH_GET** or **XEN_SYSCTL_LIVEPATCH_LIST** hypercall.
674
675The caller provides:
676
677 * A 'struct xen_livepatch_name` `name` containing the unique name.
678 * `cmd` the command requested:
679  * *LIVEPATCH_ACTION_UNLOAD* (1) unload the payload.
680   Any further hypercalls against the `name` will result in failure unless
681   **XEN_SYSCTL_LIVEPATCH_UPLOAD** hypercall is perfomed with same `name`.
682  * *LIVEPATCH_ACTION_REVERT* (2) revert the payload. If the operation takes
683  more time than the upper bound of time the `rc` in `xen_livepatch_status'
684  retrieved via **XEN_SYSCTL_LIVEPATCH_GET** will be -XEN_EBUSY.
685  * *LIVEPATCH_ACTION_APPLY* (3) apply the payload. If the operation takes
686  more time than the upper bound of time the `rc` in `xen_livepatch_status'
687  retrieved via **XEN_SYSCTL_LIVEPATCH_GET** will be -XEN_EBUSY.
688  * *LIVEPATCH_ACTION_REPLACE* (4) revert all applied payloads and apply this
689  payload. If the operation takes more time than the upper bound of time
690  the `rc` in `xen_livepatch_status' retrieved via **XEN_SYSCTL_LIVEPATCH_GET**
691  will be -XEN_EBUSY.
692 * `time` the upper bound of time (ns) the cmd should take. Zero means to use
693   the hypervisor default. If within the time the operation does not succeed
694   the operation would go in error state.
695 * `pad` - *MUST* be zero.
696
697The return value will be zero unless the provided fields are incorrect.
698
699The structure is as follow:
700
701<pre>
702#define LIVEPATCH_ACTION_UNLOAD  1
703#define LIVEPATCH_ACTION_REVERT  2
704#define LIVEPATCH_ACTION_APPLY   3
705#define LIVEPATCH_ACTION_REPLACE 4
706struct xen_sysctl_livepatch_action {
707    xen_livepatch_name_t name;              /* IN, name of the patch. */
708    uint32_t cmd;                           /* IN: LIVEPATCH_ACTION_* */
709    uint32_t time;                          /* IN: If zero then uses */
710                                            /* hypervisor default. */
711                                            /* Or upper bound of time (ns) */
712                                            /* for operation to take. */
713};
714
715</pre>
716
717## State diagrams of LIVEPATCH_ACTION commands.
718
719There is a strict ordering state of what the commands can be.
720The LIVEPATCH_ACTION prefix has been dropped to easy reading and
721does not include the LIVEPATCH_STATES:
722
723<pre>
724              /->\
725              \  /
726 UNLOAD <--- CHECK ---> REPLACE|APPLY --> REVERT --\
727                \                                  |
728                 \-------------------<-------------/
729
730</pre>
731## State transition table of LIVEPATCH_ACTION commands and LIVEPATCH_STATUS.
732
733Note that:
734
735 - The CHECKED state is the starting one achieved with *XEN_SYSCTL_LIVEPATCH_UPLOAD* hypercall.
736 - The REVERT operation on success will automatically move to the CHECKED state.
737 - There are two STATES: CHECKED and APPLIED.
738 - There are four actions (aka commands): APPLY, REPLACE, REVERT, and UNLOAD.
739
740The state transition table of valid states and action states:
741
742<pre>
743
744+---------+---------+--------------------------------+-------+--------+
745| ACTION  | Current | Result                         | Next STATE:    |
746| ACTION  | STATE   |                                |CHECKED|APPLIED |
747+---------+----------+-------------------------------+-------+--------+
748| UNLOAD  | CHECKED | Unload payload. Always works.  |       |        |
749|         |         | No next states.                |       |        |
750+---------+---------+--------------------------------+-------+--------+
751| APPLY   | CHECKED | Apply payload (success).       |       |   x    |
752+---------+---------+--------------------------------+-------+--------+
753| APPLY   | CHECKED | Apply payload (error|timeout)  |   x   |        |
754+---------+---------+--------------------------------+-------+--------+
755| REPLACE | CHECKED | Revert payloads and apply new  |       |   x    |
756|         |         | payload with success.          |       |        |
757+---------+---------+--------------------------------+-------+--------+
758| REPLACE | CHECKED | Revert payloads and apply new  |   x   |        |
759|         |         | payload with error.            |       |        |
760+---------+---------+--------------------------------+-------+--------+
761| REVERT  | APPLIED | Revert payload (success).      |   x   |        |
762+---------+---------+--------------------------------+-------+--------+
763| REVERT  | APPLIED | Revert payload (error|timeout) |       |   x    |
764+---------+---------+--------------------------------+-------+--------+
765</pre>
766
767All the other state transitions are invalid.
768
769## Sequence of events.
770
771The normal sequence of events is to:
772
773 1. *XEN_SYSCTL_LIVEPATCH_UPLOAD* to upload the payload. If there are errors *STOP* here.
774 2. *XEN_SYSCTL_LIVEPATCH_GET* to check the `->rc`. If *-XEN_EAGAIN* spin. If zero go to next step.
775 3. *XEN_SYSCTL_LIVEPATCH_ACTION* with *LIVEPATCH_ACTION_APPLY* to apply the patch.
776 4. *XEN_SYSCTL_LIVEPATCH_GET* to check the `->rc`. If in *-XEN_EAGAIN* spin. If zero exit with success.
777
778
779## Addendum
780
781Implementation quirks should not be discussed in a design document.
782
783However these observations can provide aid when developing against this
784document.
785
786
787### Alternative assembler
788
789Alternative assembler is a mechanism to use different instructions depending
790on what the CPU supports. This is done by providing multiple streams of code
791that can be patched in - or if the CPU does not support it - padded with
792`nop` operations. The alternative assembler macros cause the compiler to
793expand the code to place a most generic code in place - emit a special
794ELF .section header to tag this location. During run-time the hypervisor
795can leave the areas alone or patch them with an better suited opcodes.
796
797Note that patching functions that copy to or from guest memory requires
798to support alternative support. For example this can be due to SMAP
799(specifically *stac* and *clac* operations) which is enabled on Broadwell
800and later architectures. It may be related to other alternative instructions.
801
802### When to patch
803
804During the discussion on the design two candidates bubbled where
805the call stack for each CPU would be deterministic. This would
806minimize the chance of the patch not being applied due to safety
807checks failing. Safety checks such as not patching code which
808is on the stack - which can lead to corruption.
809
810#### Rendezvous code instead of stop_machine for patching
811
812The hypervisor's time rendezvous code runs synchronously across all CPUs
813every second. Using the stop_machine to patch can stall the time rendezvous
814code and result in NMI. As such having the patching be done at the tail
815of rendezvous code should avoid this problem.
816
817However the entrance point for that code is
818do_softirq->timer_softirq_action->time_calibration
819which ends up calling on_selected_cpus on remote CPUs.
820
821The remote CPUs receive CALL_FUNCTION_VECTOR IPI and execute the
822desired function.
823
824#### Before entering the guest code.
825
826Before we call VMXResume we check whether any soft IRQs need to be executed.
827This is a good spot because all Xen stacks are effectively empty at
828that point.
829
830To randezvous all the CPUs an barrier with an maximum timeout (which
831could be adjusted), combined with forcing all other CPUs through the
832hypervisor with IPIs, can be utilized to execute lockstep instructions
833on all CPUs.
834
835The approach is similar in concept to stop_machine and the time rendezvous
836but is time-bound. However the local CPU stack is much shorter and
837a lot more deterministic.
838
839This is implemented in the Xen Project hypervisor.
840
841### Compiling the hypervisor code
842
843Hotpatch generation often requires support for compiling the target
844with -ffunction-sections / -fdata-sections.  Changes would have to
845be done to the linker scripts to support this.
846
847### Generation of Live Patch ELF payloads
848
849The design of that is not discussed in this design.
850
851This is implemented in a seperate tool which lives in a seperate
852GIT repo.
853
854Currently it resides at git://xenbits.xen.org/livepatch-build-tools.git
855
856### Exception tables and symbol tables growth
857
858We may need support for adapting or augmenting exception tables if
859patching such code.  Hotpatches may need to bring their own small
860exception tables (similar to how Linux modules support this).
861
862If supporting hotpatches that introduce additional exception-locations
863is not important, one could also change the exception table in-place
864and reorder it afterwards.
865
866As found almost every patch (XSA) to a non-trivial function requires
867additional entries in the exception table and/or the bug frames.
868
869This is implemented in the Xen Project hypervisor.
870
871### .rodata sections
872
873The patching might require strings to be updated as well. As such we must be
874also able to patch the strings as needed. This sounds simple - but the compiler
875has a habit of coalescing strings that are the same - which means if we in-place
876alter the strings - other users will be inadvertently affected as well.
877
878This is also where pointers to functions live - and we may need to patch this
879as well. And switch-style jump tables.
880
881To guard against that we must be prepared to do patching similar to
882trampoline patching or in-line depending on the flavour. If we can
883do in-line patching we would need to:
884
885 * alter `.rodata` to be writeable.
886 * inline patch.
887 * alter `.rodata` to be read-only.
888
889If are doing trampoline patching we would need to:
890
891 * allocate a new memory location for the string.
892 * all locations which use this string will have to be updated to use the
893   offset to the string.
894 * mark the region RO when we are done.
895
896The trampoline patching is implemented in the Xen Project hypervisor.
897
898### .bss and .data sections.
899
900In place patching writable data is not suitable as it is unclear what should be done
901depending on the current state of data. As such it should not be attempted.
902
903However, functions which are being patched can bring in changes to strings
904(.data or .rodata section changes), or even to .bss sections.
905
906As such the ELF payload can introduce new .rodata, .bss, and .data sections.
907Patching in the new function will end up also patching in the new .rodata
908section and the new function will reference the new string in the new
909.rodata section.
910
911This is implemented in the Xen Project hypervisor.
912
913### Security
914
915Only the privileged domain should be allowed to do this operation.
916
917### Live patch interdependencies
918
919Live patch patches interdependencies are tricky.
920
921There are the ways this can be addressed:
922 * A single large patch that subsumes and replaces all previous ones.
923   Over the life-time of patching the hypervisor this large patch
924   grows to accumulate all the code changes.
925 * Hotpatch stack - where an mechanism exists that loads the hotpatches
926   in the same order they were built in. We would need an build-id
927   of the hypevisor to make sure the hot-patches are build against the
928   correct build.
929 * Payload containing the old code to check against that. That allows
930   the hotpatches to be loaded indepedently (if they don't overlap) - or
931   if the old code also containst previously patched code - even if they
932   overlap.
933
934The disadvantage of the first large patch is that it can grow over
935time and not provide an bisection mechanism to identify faulty patches.
936
937The hot-patch stack puts stricts requirements on the order of the patches
938being loaded and requires an hypervisor build-id to match against.
939
940The old code allows much more flexibility and an additional guard,
941but is more complex to implement.
942
943The second option which requires an build-id of the hypervisor
944is implemented in the Xen Project hypervisor.
945
946Specifically each payload has two build-id ELF notes:
947 * The build-id of the payload itself (generated via --build-id).
948 * The build-id of the payload it depends on (extracted from the
949   the previous payload or hypervisor during build time).
950
951This means that the very first payload depends on the hypervisor
952build-id.
953
954# Not Yet Done
955
956This is for further development of live patching.
957
958## TODO Goals
959
960The implementation must also have a mechanism for (in no particular order):
961
962 * Be able to lookup in the Xen hypervisor the symbol names of functions from the
963   ELF payload. (Either as `symbol` or `symbol`+`offset`).
964 * Be able to patch .rodata, .bss, and .data sections.
965 * Deal with NMI/MCE checks during patching instead of ignoring them.
966 * Further safety checks (blacklist of which functions cannot be patched, check
967   the stack, make sure the payload is built with same compiler as hypervisor).
968   Specifically we want to make sure that live patching codepaths cannot be patched.
969 * NOP out the code sequence if `new_size` is zero.
970 * Deal with other relocation types:  R_X86_64_[8,16,32,32S], R_X86_64_PC[8,16,64]
971   in payload file.
972
973### Handle inlined __LINE__
974
975This problem is related to hotpatch construction
976and potentially has influence on the design of the hotpatching
977infrastructure in Xen.
978
979For example:
980
981We have file1.c with functions f1 and f2 (in that order).  f2 contains a
982BUG() (or WARN()) macro and at that point embeds the source line number
983into the generated code for f2.
984
985Now we want to hotpatch f1 and the hotpatch source-code patch adds 2
986lines to f1 and as a consequence shifts out f2 by two lines.  The newly
987constructed file1.o will now contain differences in both binary
988functions f1 (because we actually changed it with the applied patch) and
989f2 (because the contained BUG macro embeds the new line number).
990
991Without additional information, an algorithm comparing file1.o before
992and after hotpatch application will determine both functions to be
993changed and will have to include both into the binary hotpatch.
994
995Options:
996
9971. Transform source code patches for hotpatches to be line-neutral for
998   each chunk.  This can be done in almost all cases with either
999   reformatting of the source code or by introducing artificial
1000   preprocessor "#line n" directives to adjust for the introduced
1001   differences.
1002
1003   This approach is low-tech and simple.  Potentially generated
1004   backtraces and existing debug information refers to the original
1005   build and does not reflect hotpatching state except for actually
1006   hotpatched functions but should be mostly correct.
1007
10082. Ignoring the problem and living with artificially large hotpatches
1009   that unnecessarily patch many functions.
1010
1011   This approach might lead to some very large hotpatches depending on
1012   content of specific source file.  It may also trigger pulling in
1013   functions into the hotpatch that cannot reasonable be hotpatched due
1014   to limitations of a hotpatching framework (init-sections, parts of
1015   the hotpatching framework itself, ...) and may thereby prevent us
1016   from patching a specific problem.
1017
1018   The decision between 1. and 2. can be made on a patch--by-patch
1019   basis.
1020
10213. Introducing an indirection table for storing line numbers and
1022   treating that specially for binary diffing. Linux may follow
1023   this approach.
1024
1025   We might either use this indirection table for runtime use and patch
1026   that with each hotpatch (similarly to exception tables) or we might
1027   purely use it when building hotpatches to ignore functions that only
1028   differ at exactly the location where a line-number is embedded.
1029
1030For BUG(), WARN(), etc., the line number is embedded into the bug frame, not
1031the function itself.
1032
1033Similar considerations are true to a lesser extent for __FILE__, but it
1034could be argued that file renaming should be done outside of hotpatches.
1035
1036## Signature checking requirements.
1037
1038The signature checking requires that the layout of the data in memory
1039**MUST** be same for signature to be verified. This means that the payload
1040data layout in ELF format **MUST** match what the hypervisor would be
1041expecting such that it can properly do signature verification.
1042
1043The signature is based on the all of the payloads continuously laid out
1044in memory. The signature is to be appended at the end of the ELF payload
1045prefixed with the string `'~Module signature appended~\n'`, followed by
1046an signature header then followed by the signature, key identifier, and signers
1047name.
1048
1049Specifically the signature header would be:
1050
1051<pre>
1052#define PKEY_ALGO_DSA       0
1053#define PKEY_ALGO_RSA       1
1054
1055#define PKEY_ID_PGP         0 /* OpenPGP generated key ID */
1056#define PKEY_ID_X509        1 /* X.509 arbitrary subjectKeyIdentifier */
1057
1058#define HASH_ALGO_MD4          0
1059#define HASH_ALGO_MD5          1
1060#define HASH_ALGO_SHA1         2
1061#define HASH_ALGO_RIPE_MD_160  3
1062#define HASH_ALGO_SHA256       4
1063#define HASH_ALGO_SHA384       5
1064#define HASH_ALGO_SHA512       6
1065#define HASH_ALGO_SHA224       7
1066#define HASH_ALGO_RIPE_MD_128  8
1067#define HASH_ALGO_RIPE_MD_256  9
1068#define HASH_ALGO_RIPE_MD_320 10
1069#define HASH_ALGO_WP_256      11
1070#define HASH_ALGO_WP_384      12
1071#define HASH_ALGO_WP_512      13
1072#define HASH_ALGO_TGR_128     14
1073#define HASH_ALGO_TGR_160     15
1074#define HASH_ALGO_TGR_192     16
1075
1076
1077struct elf_payload_signature {
1078	u8	algo;		/* Public-key crypto algorithm PKEY_ALGO_*. */
1079	u8	hash;		/* Digest algorithm: HASH_ALGO_*. */
1080	u8	id_type;	/* Key identifier type PKEY_ID*. */
1081	u8	signer_len;	/* Length of signer's name */
1082	u8	key_id_len;	/* Length of key identifier */
1083	u8	__pad[3];
1084	__be32	sig_len;	/* Length of signature data */
1085};
1086
1087</pre>
1088(Note that this has been borrowed from Linux module signature code.).
1089
1090
1091### .bss and .data sections.
1092
1093In place patching writable data is not suitable as it is unclear what should be done
1094depending on the current state of data. As such it should not be attempted.
1095
1096That said we should provide hook functions so that the existing data
1097can be changed during payload application.
1098
1099To guarantee safety we disallow re-applying an payload after it has been
1100reverted. This is because we cannot guarantee that the state of .bss
1101and .data to be exactly as it was during loading. Hence the administrator
1102MUST unload the payload and upload it again to apply it.
1103
1104There is an exception to this: if the payload only has .livepatch.funcs;
1105and the .data or .bss sections are of zero length.
1106
1107### Inline patching
1108
1109The hypervisor should verify that the in-place patching would fit within
1110the code or data.
1111
1112### Trampoline (e9 opcode), x86
1113
1114The e9 opcode used for jmpq uses a 32-bit signed displacement. That means
1115we are limited to up to 2GB of virtual address to place the new code
1116from the old code. That should not be a problem since Xen hypervisor has
1117a very small footprint.
1118
1119However if we need - we can always add two trampolines. One at the 2GB
1120limit that calls the next trampoline.
1121
1122Please note there is a small limitation for trampolines in
1123function entries: The target function (+ trailing padding) must be able
1124to accomodate the trampoline. On x86 with +-2 GB relative jumps,
1125this means 5 bytes are required which means that `old_size` **MUST** be
1126at least five bytes if patching in trampoline.
1127
1128Depending on compiler settings, there are several functions in Xen that
1129are smaller (without inter-function padding).
1130
1131<pre>
1132readelf -sW xen-syms | grep " FUNC " | \
1133    awk '{ if ($3 < 5) print $3, $4, $5, $8 }'
1134
1135...
11363 FUNC LOCAL wbinvd_ipi
11373 FUNC LOCAL shadow_l1_index
1138...
1139</pre>
1140A compile-time check for, e.g., a minimum alignment of functions or a
1141runtime check that verifies symbol size (+ padding to next symbols) for
1142that in the hypervisor is advised.
1143
1144The tool for generating payloads currently does perform a compile-time
1145check to ensure that the function to be replaced is large enough.
1146
1147#### Trampoline, ARM
1148
1149The unconditional branch instruction (for the encoding see the
1150DDI 0406C.c and DDI 0487A.j Architecture Reference Manual's).
1151with proper offset is used for an unconditional branch to the new code.
1152This means that that `old_size` **MUST** be at least four bytes if patching
1153in trampoline.
1154
1155The instruction offset is limited on ARM32 to +/- 32MB to displacement
1156and on ARM64 to +/- 128MB displacement.
1157
1158The new code is placed in the 8M - 10M virtual address space while the
1159Xen code is in 2M - 4M. That gives us enough space.
1160
1161The hypervisor also checks the displacement during loading of the payload.
1162