1# Xenstore Migration
2
3## Background
4
5The design for *Non-Cooperative Migration of Guests*[1] explains that extra
6save records are required in the migrations stream to allow a guest running PV
7drivers to be migrated without its co-operation. Moreover the save records must
8include details of registered xenstore watches as well as content; information
9that cannot currently be recovered from `xenstored`, and hence some extension
10to the xenstored implementations will also be required.
11
12As a similar set of data is needed for transferring xenstore data from one
13instance to another when live updating xenstored this document proposes an
14image format for a 'migration stream' suitable for both purposes.
15
16## Proposal
17
18The image format consists of a _header_ followed by 1 or more _records_. Each
19record consists of a type and length field, followed by any data mandated by
20the record type. At minimum there will be a single record of type `END`
21(defined below).
22
23### Header
24
25The header identifies the stream as a `xenstore` stream, including the version
26of the specification that it complies with.
27
28All fields in this header must be in _big-endian_ byte order, regardless of
29the setting of the endianness bit.
30
31
32```
33    0       1       2       3       4       5       6       7    octet
34+-------+-------+-------+-------+-------+-------+-------+-------+
35| ident                                                         |
36+-------------------------------+-------------------------------|
37| version                       | flags                         |
38+-------------------------------+-------------------------------+
39```
40
41
42| Field     | Description                                       |
43|-----------|---------------------------------------------------|
44| `ident`   | 0x78656e73746f7265 ('xenstore' in ASCII)          |
45|           |                                                   |
46| `version` | The version of the specification, defined values: |
47|           | 0x00000001: all fields and records without any    |
48|           |             explicitly mentioned version          |
49|           |             dependency are valid.                 |
50|           | 0x00000002: all fields and records valid for      |
51|           |             version 1 plus fields and records     |
52|           |             explicitly stated to be supported in  |
53|           |             version 2 are valid.                  |
54|           |                                                   |
55| `flags`   | 0 (LSB): Endianness: 0 = little, 1 = big          |
56|           |                                                   |
57|           | 1-31: Reserved (must be zero)                     |
58
59### Records
60
61Records immediately follow the header and have the following format:
62
63
64```
65    0       1       2       3       4       5       6       7    octet
66+-------+-------+-------+-------+-------+-------+-------+-------+
67| type                          | len                           |
68+-------------------------------+-------------------------------+
69| body
70...
71|       | padding (0 to 7 octets)                               |
72+-------+-------------------------------------------------------+
73```
74
75NOTE: padding octets or fields not valid in the used version here and in all
76      subsequent format specifications must be written as zero and should be
77      ignored when the stream is read.
78
79
80| Field  | Description                                          |
81|--------|------------------------------------------------------|
82| `type` | 0x00000000: END                                      |
83|        | 0x00000001: GLOBAL_DATA                              |
84|        | 0x00000002: CONNECTION_DATA                          |
85|        | 0x00000003: WATCH_DATA                               |
86|        | 0x00000004: TRANSACTION_DATA                         |
87|        | 0x00000005: NODE_DATA                                |
88|        | 0x00000006: GLOBAL_QUOTA_DATA                        |
89|        | 0x00000007: DOMAIN_DATA                              |
90|        | 0x00000008: WATCH_DATA_EXTENDED (version 2 and up)   |
91|        | 0x00000009 - 0xFFFFFFFF: reserved for future use     |
92|        |                                                      |
93| `len`  | The length (in octets) of `body`                     |
94|        |                                                      |
95| `body` | The type-specific record data                        |
96
97Some records will depend on other records in the migration stream. Records
98upon which other records depend must always appear earlier in the stream.
99
100The various formats of the type-specific data are described in the following
101sections:
102
103\pagebreak
104
105### END
106
107The end record marks the end of the image, and is the final record
108in the stream.
109
110```
111    0       1       2       3       4       5       6       7    octet
112+-------+-------+-------+-------+-------+-------+-------+-------+
113```
114
115
116The end record contains no fields; its body length is 0.
117
118\pagebreak
119
120### GLOBAL_DATA
121
122This record is only relevant for live update. It contains details of global
123xenstored state that needs to be restored.
124
125```
126    0       1       2       3    octet
127+-------+-------+-------+-------+
128| rw-socket-fd                  |
129+-------------------------------+
130| evtchn-fd                     |
131+-------------------------------+
132```
133
134
135| Field          | Description                                  |
136|----------------|----------------------------------------------|
137| `rw-socket-fd` | The file descriptor of the socket accepting  |
138|                | read-write connections                       |
139|                |                                              |
140| `evtchn-fd`    | The file descriptor used to communicate with |
141|                | the event channel driver                     |
142
143xenstored will resume in the original process context. Hence `rw-socket-fd`
144simply specifies the file descriptor of the socket. Sockets are not always
145used, however, and so -1 will be used to denote an unused socket.
146
147\pagebreak
148
149### CONNECTION_DATA
150
151For live update the image format will contain a `CONNECTION_DATA` record for
152each connection to xenstore. For migration it will only contain a record for
153the domain being migrated.
154
155
156```
157    0       1       2       3       4       5       6       7    octet
158+-------+-------+-------+-------+-------+-------+-------+-------+
159| conn-id                       | conn-type     |               |
160+-------------------------------+---------------+---------------+
161| conn-spec
162...
163+---------------+---------------+-------------------------------+
164| in-data-len   | out-resp-len  | out-data-len                  |
165+---------------+---------------+-------------------------------+
166| data
167...
168```
169
170
171| Field          | Description                                  |
172|----------------|----------------------------------------------|
173| `conn-id`      | A non-zero number used to identify this      |
174|                | connection in subsequent connection-specific |
175|                | records                                      |
176|                |                                              |
177| `conn-type`    | 0x0000: shared ring                          |
178|                | 0x0001: socket                               |
179|                | 0x0002 - 0xFFFF: reserved for future use     |
180|                |                                              |
181| `conn-spec`    | See below                                    |
182|                |                                              |
183| `in-data-len`  | The length (in octets) of any data read      |
184|                | from the connection not yet processed        |
185|                |                                              |
186| `out-resp-len` | The length (in octets) of a partial response |
187|                | not yet written to the connection            |
188|                |                                              |
189| `out-data-len` | The length (in octets) of any pending data   |
190|                | not yet written to the connection, including |
191|                | a partial response (see `out-resp-len`)      |
192|                |                                              |
193| `data`         | Pending data: first in-data-len octets of    |
194|                | read data, then out-data-len octets of       |
195|                | written data (any of both may be empty)      |
196
197In case of live update the connection record for the connection via which
198the live update command was issued will contain the response for the live
199update command in the pending not yet written data.
200
201\pagebreak
202
203The format of `conn-spec` is dependent upon `conn-type`.
204
205For `shared ring` connections it is as follows:
206
207
208```
209    0       1       2       3       4       5       6       7    octet
210+---------------+---------------+---------------+---------------+
211| domid         | tdomid        | evtchn                        |
212+-------------------------------+-------------------------------+
213```
214
215
216| Field     | Description                                       |
217|-----------|---------------------------------------------------|
218| `domid`   | The domain-id that owns the shared page           |
219|           |                                                   |
220| `tdomid`  | The domain-id that `domid` acts on behalf of if   |
221|           | it has been subject to an SET_TARGET              |
222|           | operation [2] or DOMID_INVALID [3] otherwise      |
223|           |                                                   |
224| `evtchn`  | The port number of the interdomain channel used   |
225|           | by xenstored to communicate with `domid`          |
226|           |                                                   |
227
228The GFN of the shared page is not preserved because the ABI reserves
229entry 1 in `domid`'s grant table to point to the xenstore shared page.
230Note there is no guarantee the page will still be valid at the time of
231the restore because a domain can revoke the permission.
232
233For `socket` connections it is as follows:
234
235
236```
237+---------------+---------------+---------------+---------------+
238| socket-fd                     | pad                           |
239+-------------------------------+-------------------------------+
240```
241
242
243| Field       | Description                                     |
244|-------------|-------------------------------------------------|
245| `socket-fd` | The file descriptor of the connected socket     |
246
247This type of connection is only relevant for live update, where the xenstored
248resumes in the original process context. Hence `socket-fd` simply specify
249the file descriptor of the socket connection.
250
251\pagebreak
252
253### WATCH_DATA
254
255The image format will contain either a `WATCH_DATA` or a `WATCH_DATA_EXTENDED`
256record for each watch registered by a connection for which there is
257`CONNECTION_DATA` record previously present.
258
259```
260    0       1       2       3    octet
261+-------+-------+-------+-------+
262| conn-id                       |
263+---------------+---------------+
264| wpath-len     | token-len     |
265+---------------+---------------+
266| wpath
267...
268| token
269...
270```
271
272
273| Field       | Description                                     |
274|-------------|-------------------------------------------------|
275| `conn-id`   | The connection that issued the `WATCH`          |
276|             | operation [2]                                   |
277|             |                                                 |
278| `wpath-len` | The length (in octets) of `wpath` including the |
279|             | NUL terminator                                  |
280|             |                                                 |
281| `token-len` | The length (in octets) of `token` including the |
282|             | NUL terminator                                  |
283|             |                                                 |
284| `wpath`     | The watch path, as specified in the `WATCH`     |
285|             | operation                                       |
286|             |                                                 |
287| `token`     | The watch identifier token, as specified in the |
288|             | `WATCH` operation                               |
289
290\pagebreak
291
292### WATCH_DATA_EXTENDED
293
294The image format will contain either a `WATCH_DATA` or a `WATCH_DATA_EXTENDED`
295record for each watch registered by a connection for which there is
296`CONNECTION_DATA` record previously present. The `WATCH_DATA_EXTENDED` record
297type is valid only in version 2 and later.
298
299```
300    0       1       2       3    octet
301+-------+-------+-------+-------+
302| conn-id                       |
303+---------------+---------------+
304| wpath-len     | token-len     |
305+---------------+---------------+
306| depth         | pad           |
307+---------------+---------------+
308| wpath
309...
310| token
311...
312```
313
314
315| Field       | Description                                     |
316|-------------|-------------------------------------------------|
317| `conn-id`   | The connection that issued the `WATCH`          |
318|             | operation [2]                                   |
319|             |                                                 |
320| `wpath-len` | The length (in octets) of `wpath` including the |
321|             | NUL terminator                                  |
322|             |                                                 |
323| `token-len` | The length (in octets) of `token` including the |
324|             | NUL terminator                                  |
325|             |                                                 |
326| `depth`     | The number of directory levels below the        |
327|             | watched path to consider for a match.           |
328|             | A value of 0xffff is used for unlimited depth.  |
329|             |                                                 |
330| `wpath`     | The watch path, as specified in the `WATCH`     |
331|             | operation                                       |
332|             |                                                 |
333| `token`     | The watch identifier token, as specified in the |
334|             | `WATCH` operation                               |
335
336\pagebreak
337
338### TRANSACTION_DATA
339
340The image format will contain a `TRANSACTION_DATA` record for each transaction
341that is pending on a connection for which there is `CONNECTION_DATA` record
342previously present.
343
344
345```
346    0       1       2       3    octet
347+-------+-------+-------+-------+
348| conn-id                       |
349+-------------------------------+
350| tx-id                         |
351+-------------------------------+
352```
353
354
355| Field          | Description                                  |
356|----------------|----------------------------------------------|
357| `conn-id`      | The connection that issued the               |
358|                | `TRANSACTION_START` operation [2]            |
359|                |                                              |
360| `tx-id`        | The transaction id passed back to the domain |
361|                | by the `TRANSACTION_START` operation         |
362
363\pagebreak
364
365### NODE_DATA
366
367For live update the image format will contain a `NODE_DATA` record for each
368node in xenstore. For migration it will only contain a record for the nodes
369relating to the domain being migrated. The `NODE_DATA` may be related to
370a _committed_ node (globally visible in xenstored) or a _pending_ node (created
371or modified by a transaction for which there is also a `TRANSACTION_DATA`
372record previously present).
373
374Each _committed_ node in the stream is required to have an already known parent
375node. A parent node is known if it was either in the node data base before the
376stream was started to be processed, or if a `NODE_DATA` record for that parent
377node has already been processed in the stream.
378
379
380```
381    0       1       2       3    octet
382+-------+-------+-------+-------+
383| conn-id                       |
384+-------------------------------+
385| tx-id                         |
386+---------------+---------------+
387| path-len      | value-len     |
388+---------------+---------------+
389| access        | perm-count    |
390+---------------+---------------+
391| perm1                         |
392+-------------------------------+
393...
394+-------------------------------+
395| permN                         |
396+---------------+---------------+
397| path
398...
399| value
400...
401```
402
403
404| Field        | Description                                    |
405|--------------|------------------------------------------------|
406| `conn-id`    | If this value is non-zero then this record     |
407|              | related to a pending transaction               |
408|              |                                                |
409| `tx-id`      | This value should be ignored if `conn-id` is   |
410|              | zero. Otherwise it specifies the id of the     |
411|              | pending transaction                            |
412|              |                                                |
413| `path-len`   | The length (in octets) of `path` including the |
414|              | NUL terminator                                 |
415|              |                                                |
416| `value-len`  | The length (in octets) of `value` (which will  |
417|              | be zero for a deleted node)                    |
418|              |                                                |
419| `access`     | This value should be ignored if this record    |
420|              | does not relate to a pending transaction,      |
421|              | otherwise it specifies the accesses made to    |
422|              | the node and hence is a bitwise OR of:         |
423|              |                                                |
424|              | 0x0001: read                                   |
425|              | 0x0002: written                                |
426|              |                                                |
427|              | The value will be zero for a deleted node      |
428|              |                                                |
429| `perm-count` | The number (N) of node permission specifiers   |
430|              | (which will be 0 for a node deleted in a       |
431|              | pending transaction)                           |
432|              |                                                |
433| `perm1..N`   | A list of zero or more node permission         |
434|              | specifiers (see below)                         |
435|              |                                                |
436| `path`       | The absolute path of the node                  |
437|              |                                                |
438| `value`      | The node value (which may be empty or contain  |
439|              | NUL octets)                                    |
440
441
442A node permission specifier has the following format:
443
444
445```
446    0       1       2       3    octet
447+-------+-------+-------+-------+
448| perm  | flags | domid         |
449+-------+-------+---------------+
450```
451
452| Field   | Description                                         |
453|---------|-----------------------------------------------------|
454| `perm`  | One of the ASCII values `w`, `r`, `b` or `n` as     |
455|         | specified for the `SET_PERMS` operation [2]         |
456|         |                                                     |
457| `flags` | A bit-wise OR of:                                   |
458|         | 0x01: stale permission, ignore when checking        |
459|         |       permissions                                   |
460|         |                                                     |
461| `domid` | The domain-id to which the permission relates       |
462
463Note that perm1 defines the domain owning the node. See [4] for more
464explanation of node permissions.
465
466\pagebreak
467
468### GLOBAL_QUOTA_DATA
469
470This record is only relevant for live update. It contains the global settings
471of xenstored quota.
472
473```
474    0       1       2       3    octet
475+-------+-------+-------+-------+
476| n-dom-quota   | n-glob-quota  |
477+---------------+---------------+
478| quota-val 1                   |
479+-------------------------------+
480...
481+-------------------------------+
482| quota-val N                   |
483+-------------------------------+
484| quota-names
485...
486```
487
488
489| Field          | Description                                  |
490|----------------|----------------------------------------------|
491| `n-dom-quota`  | Number of quota values which apply per       |
492|                | domain by default.                                      |
493|                |                                              |
494| `n-glob-quota` | Number of quota values which apply globally  |
495|                | only.                                        |
496|                |                                              |
497| `quota-val`    | Quota values, first the ones applying per    |
498|                | domain, then the ones applying globally. A   |
499|                | value of 0 has the semantics of "unlimited". |
500|                |                                              |
501| `quota-names`  | 0 delimited strings of the quota names in    |
502|                | the same sequence as the `quota-val` values. |
503
504
505Allowed quota names are those explicitly named in [2] for the `GET_QUOTA`
506and `SET_QUOTA` commands, plus implementation specific ones. Quota names not
507recognized by the receiving side should not have any effect on behavior for
508the receiving side (they can be ignored or preserved for inclusion in
509future live migration/update streams).
510
511\pagebreak
512
513### DOMAIN_DATA
514
515This record is optional and can be present once for each domain.
516
517
518```
519    0       1       2       3     octet
520+-------+-------+-------+-------+
521| domain-id     | n-quota       |
522+---------------+---------------+
523| features                      |
524+-------------------------------+
525| quota-val 1                   |
526+-------------------------------+
527...
528+-------------------------------+
529| quota-val N                   |
530+-------------------------------+
531| quota-names
532...
533```
534
535
536| Field          | Description                                  |
537|----------------|----------------------------------------------|
538| `domain-id`    | The domain-id of the domain this record      |
539|                | belongs to.                                  |
540|                |                                              |
541| `n-quota`      | Number of quota values.                      |
542|                |                                              |
543| `features`     | Value of the feature field visible by the    |
544|                | guest at offset 2064 of the ring page.       |
545|                | Only valid for version 2 and later.          |
546|                |                                              |
547| `quota-val`    | Quota values, a value of 0 has the semantics |
548|                | "unlimited".                                 |
549|                |                                              |
550| `quota-names`  | 0 delimited strings of the quota names in    |
551|                | the same sequence as the `quota-val` values. |
552
553Allowed quota names are those explicitly named in [2] for the `GET_QUOTA`
554and `SET_QUOTA` commands, plus implementation specific ones. Quota names not
555recognized by the receiving side should not have any effect on behavior for
556the receiving side (they can be ignored or preserved for inclusion in
557future live migration/update streams).
558
559\pagebreak
560
561
562* * *
563
564[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
565
566[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
567
568[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/xen.h;hb=HEAD#l612
569
570[4] https://wiki.xen.org/wiki/XenBus
571