1% LibXenLight Domain Image Format
2% Andrew Cooper <<andrew.cooper3@citrix.com>>
3  Wen Congyang <<wency@cn.fujitsu.com>>
4  Yang Hongyang <<hongyang.yang@easystack.cn>>
5% Revision 2
6
7Introduction
8============
9
10For the purposes of this document, `xl` is used as a representation of any
11implementer of the `libxl` API.  `xl` should be considered completely
12interchangeable with alternates, such as `libvirt` or `xenopsd-xl`.
13
14Purpose
15-------
16
17The _domain image format_ is the context of a running domain used for
18snapshots of a domain or for transferring domains between hosts during
19migration.
20
21There are a number of problems with the domain image format used in Xen 4.5
22and earlier (the _legacy format_)
23
24* There is no `libxl` context information.  `xl` is required to send certain
25  pieces of `libxl` context itself.
26
27* The contents of the stream is passed directly through `libxl` to `libxc`.
28  The legacy `libxc` format contained some information which belonged at the
29  `libxl` level, resulting in awkward layer violation to return the
30  information back to `libxl`.
31
32* The legacy `libxc` format was inextensible, causing inextensibility in the
33  legacy `libxl` handling.
34
35This design addresses the above points, allowing for a completely
36self-contained, extensible stream with each layer responsible for its own
37appropriate information.
38
39
40Not Yet Included
41----------------
42
43The following features are not yet fully specified and will be
44included in a future draft.
45
46* ARM
47
48
49Overview
50========
51
52The image format consists of a _Header_, followed by 1 or more _Records_.
53Each record consists of a type and length field, followed by any type-specific
54data.
55
56\clearpage
57
58Header
59======
60
61The header identifies the stream as a `libxl` stream, including the version of
62this specification that it complies with.
63
64All fields in this header shall be in _big-endian_ byte order, regardless of
65the setting of the endianness bit.
66
67     0     1     2     3     4     5     6     7 octet
68    +-------------------------------------------------+
69    | ident                                           |
70    +-----------------------+-------------------------+
71    | version               | options                 |
72    +-----------------------+-------------------------+
73
74--------------------------------------------------------------------
75Field       Description
76----------- --------------------------------------------------------
77ident       0x4c6962786c466d74 ("LibxlFmt" in ASCII).
78
79version     0x00000002.  The version of this specification.
80
81options     bit 0: Endianness.    0 = little-endian, 1 = big-endian.
82
83            bit 1: Legacy Format. If set, this stream was created by
84                                  the legacy conversion tool.
85
86            bits 2-31: Reserved.
87--------------------------------------------------------------------
88
89The endianness shall be 0 (little-endian) for images generated on an
90i386, x86_64, or arm host.
91
92\clearpage
93
94
95Record Overview
96===============
97
98A record has a record header, type specific data and a trailing footer.  If
99`length` is not a multiple of 8, the body is padded with zeroes to align the
100end of the record on an 8 octet boundary.
101
102     0     1     2     3     4     5     6     7 octet
103    +-----------------------+-------------------------+
104    | type                  | body_length             |
105    +-----------+-----------+-------------------------+
106    | body...                                         |
107    ...
108    |           | padding (0 to 7 octets)             |
109    +-----------+-------------------------------------+
110
111--------------------------------------------------------------------
112Field        Description
113-----------  -------------------------------------------------------
114type         0x00000000: END
115
116             0x00000001: LIBXC_CONTEXT
117
118             0x00000002: EMULATOR_XENSTORE_DATA
119
120             0x00000003: EMULATOR_CONTEXT
121
122             0x00000004: CHECKPOINT_END
123
124             0x00000005: CHECKPOINT_STATE
125
126             0x00000006 - 0x7FFFFFFF: Reserved for future _mandatory_
127             records.
128
129             0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
130             records.
131
132body_length  Length in octets of the record body.
133
134body         Content of the record.
135
136padding      0 to 7 octets of zeros to pad the whole record to a multiple
137             of 8 octets.
138--------------------------------------------------------------------
139
140\clearpage
141
142Emulator Records
143----------------
144
145Several records are specifically for emulators, and have a common sub header.
146
147     0     1     2     3     4     5     6     7 octet
148    +------------------------+------------------------+
149    | emulator_id            | index                  |
150    +------------------------+------------------------+
151    | record specific data                            |
152    ...
153    +-------------------------------------------------+
154
155--------------------------------------------------------------------
156Field            Description
157------------     ---------------------------------------------------
158emulator_id      0x00000000: Unknown (In the case of a legacy stream)
159
160                 0x00000001: Qemu Traditional
161
162                 0x00000002: Qemu Upstream
163
164                 0x00000003 - 0xFFFFFFFF: Reserved for future emulators.
165
166index            Index of this emulator for the domain.
167--------------------------------------------------------------------
168
169\clearpage
170
171Records
172=======
173
174END
175----
176
177A end record marks the end of the image, and shall be the final record
178in the stream.
179
180     0     1     2     3     4     5     6     7 octet
181    +-------------------------------------------------+
182
183The end record contains no fields; its body_length is 0.
184
185LIBXC\_CONTEXT
186--------------
187
188A libxc context record is a marker, indicating that the stream should be
189handed to `xc_domain_restore()`.  `libxc` shall be responsible for reading its
190own image format from the stream.
191
192     0     1     2     3     4     5     6     7 octet
193    +-------------------------------------------------+
194
195The libxc context record contains no fields; its body_length is 0[^1].
196
197
198[^1]: The sending side cannot calculate ahead of time how much data `libxc`
199might write into the stream, especially for live migration where the quantity
200of data is partially proportional to the elapsed time.
201
202EMULATOR\_XENSTORE\_DATA
203------------------------
204
205A set of xenstore key/value pairs for a specific emulator associated with the
206domain.
207
208     0     1     2     3     4     5     6     7 octet
209    +------------------------+------------------------+
210    | emulator_id            | index                  |
211    +------------------------+------------------------+
212    | xenstore key/value data                         |
213    ...
214    +-------------------------------------------------+
215
216Xenstore key/value data are encoded as a packed sequence of (key, value)
217tuples.  Each (key, value) tuple is a packed pair of NUL terminated octets,
218conforming to xenstore protocol character encoding (keys strictly as
219alphanumeric ASCII and `-/_@`, values expected to be human-readable ASCII).
220
221Keys shall be relative to to the device models xenstore tree for the new
222domain.  At the time of writing, keys are relative to the path
223
224> `/local/domain/$dm_domid/device-model/$domid/`
225
226although this path is free to change moving forward, thus should not be
227assumed.
228
229EMULATOR\_CONTEXT
230----------------
231
232A context blob for a specific emulator associated with the domain.
233
234     0     1     2     3     4     5     6     7 octet
235    +------------------------+------------------------+
236    | emulator_id            | index                  |
237    +------------------------+------------------------+
238    | emulator_ctx                                    |
239    ...
240    +-------------------------------------------------+
241
242The *emulator_ctx* is a binary blob interpreted by the emulator identified by
243*emulator_id*.  Its format is unspecified.
244
245CHECKPOINT\_END
246---------------
247
248A checkpoint end record marks the end of a checkpoint in the image.
249
250     0     1     2     3     4     5     6     7 octet
251    +-------------------------------------------------+
252
253The end record contains no fields; its body_length is 0.
254
255
256CHECKPOINT\_STATE
257--------------
258
259A checkpoint state record contains the control information for checkpoint. It
260is only used by COLO, more detail please reference README.colo.
261
262     0     1     2     3     4     5     6     7 octet
263    +------------------------+------------------------+
264    | control_id             | padding                |
265    +------------------------+------------------------+
266
267--------------------------------------------------------------------
268Field            Description
269------------     ---------------------------------------------------
270control_id       0x00000000: Secondary VM is out of sync, start a new checkpoint
271                 (Primary -> Secondary)
272
273                 0x00000001: Secondary VM is suspended (Secondary -> Primary)
274
275                 0x00000002: Secondary VM is ready (Secondary -> Primary)
276
277                 0x00000003: Secondary VM is resumed (Secondary -> Primary)
278
279--------------------------------------------------------------------
280
281In COLO, Primary is running in below loop:
282
2831. Suspend primary vm
284    a. Suspend primary vm
285    b. Read _CHECKPOINT\_SVM\_SUSPENDED_ sent by secondary
2862. Checkpoint
2873. Resume primary vm
288    a. Read _CHECKPOINT\_SVM\_READY_ from secondary
289    b. Resume primary vm
290    c. Read _CHECKPOINT\_SVM\_RESUMED_ from secondary
2914. Wait a new checkpoint
292    a. Send _CHECKPOINT\_NEW_ to secondary
293
294While Secondary is running in below loop:
295
2961. Resume secondary vm
297    a. Send _CHECKPOINT\_SVM\_READY_ to primary
298    b. Resume secondary vm
299    c. Send _CHECKPOINT\_SVM\_RESUMED_ to primary
3002. Wait a new checkpoint
301    a. Read _CHECKPOINT\_NEW_ from primary
3023. Suspend secondary vm
303    a. Suspend secondary vm
304    b. Send _CHECKPOINT\_SVM\_SUSPENDED_ to primary
3054. Checkpoint
306
307Future Extensions
308=================
309
310All changes to this specification should bump the revision number in
311the title block.
312
313All changes to the header require the header version to be increased.
314
315The format may be extended by adding additional record types.
316
317Extending an existing record type must be done by adding a new record
318type.  This allows old images with the old record to still be
319restored.
320