1% libxenctrl (libxc) Domain Image Format
2% David Vrabel <<david.vrabel@citrix.com>>
3  Andrew Cooper <<andrew.cooper3@citrix.com>>
4  Wen Congyang <<wency@cn.fujitsu.com>>
5  Yang Hongyang <<hongyang.yang@easystack.cn>>
6% Revision 3
7
8Introduction
9============
10
11Purpose
12-------
13
14The _domain save image_ is the context of a running domain used for
15snapshots of a domain or for transferring domains between hosts during
16migration.
17
18There are a number of problems with the format of the domain save
19image used in Xen 4.4 and earlier (the _legacy format_).
20
21* Dependant on toolstack word size.  A number of fields within the
22  image are native types such as `unsigned long` which have different
23  sizes between 32-bit and 64-bit toolstacks.  This prevents domains
24  from being migrated between hosts running 32-bit and 64-bit
25  toolstacks.
26
27* There is no header identifying the image.
28
29* The image has no version information.
30
31A new format that addresses the above is required.
32
33ARM does not yet have have a domain save image format specified and
34the format described in this specification should be suitable.
35
36Not Yet Included
37----------------
38
39The following features are not yet fully specified and will be
40included in a future draft.
41
42* Page data compression.
43
44* ARM
45
46
47Overview
48========
49
50The image format consists of two main sections:
51
52* _Headers_
53* _Records_
54
55Headers
56-------
57
58There are two headers: the _image header_, and the _domain header_.
59The image header describes the format of the image (version etc.).
60The _domain header_ contains general information about the domain
61(architecture, type etc.).
62
63Records
64-------
65
66The main part of the format is a sequence of different _records_.
67Each record type contains information about the domain context.  At a
68minimum there is a END record marking the end of the records section.
69
70
71Fields
72------
73
74All the fields within the headers and records have a fixed width.
75
76Fields are always aligned to their size.
77
78Padding and reserved fields are set to zero on save and must be
79ignored during restore.
80
81Integer (numeric) fields in the image header are always in big-endian
82byte order.
83
84Integer fields in the domain header and in the records are in the
85endianness described in the image header (which will typically be the
86native ordering).
87
88\clearpage
89
90Headers
91=======
92
93Image Header
94------------
95
96The image header identifies an image as a Xen domain save image.  It
97includes the version of this specification that the image complies
98with.
99
100Tools supporting version _V_ of the specification shall always save
101images using version _V_.  Tools shall support restoring from version
102_V_.  If the previous Xen release produced version _V_ - 1 images,
103tools shall supported restoring from these.  Tools may additionally
104support restoring from earlier versions.
105
106The marker field can be used to distinguish between legacy images and
107those corresponding to this specification.  Legacy images will have at
108one or more zero bits within the first 8 octets of the image.
109
110Fields within the image header are always in _big-endian_ byte order,
111regardless of the setting of the endianness bit.
112
113     0     1     2     3     4     5     6     7 octet
114    +-------------------------------------------------+
115    | marker                                          |
116    +-----------------------+-------------------------+
117    | id                    | version                 |
118    +-----------+-----------+-------------------------+
119    | options   | (reserved)                          |
120    +-----------+-------------------------------------+
121
122
123--------------------------------------------------------------------
124Field       Description
125----------- --------------------------------------------------------
126marker      0xFFFFFFFFFFFFFFFF.
127
128id          0x58454E46 ("XENF" in ASCII).
129
130version     0x00000003.  The version of this specification.
131
132options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
133
134            bit 1-15: Reserved.
135--------------------------------------------------------------------
136
137The endianness shall be 0 (little-endian) for images generated on an
138i386, x86_64, or arm host.
139
140\clearpage
141
142Domain Header
143-------------
144
145The domain header includes general properties of the domain.
146
147     0      1     2     3     4     5     6     7 octet
148    +-----------------------+-----------+-------------+
149    | type                  | page_shift| (reserved)  |
150    +-----------------------+-----------+-------------+
151    | xen_major             | xen_minor               |
152    +-----------------------+-------------------------+
153
154--------------------------------------------------------------------
155Field       Description
156----------- --------------------------------------------------------
157type        0x0000: Reserved.
158
159            0x0001: x86 PV.
160
161            0x0002: x86 HVM.
162
163            0x0003 - 0xFFFFFFFF: Reserved.
164
165page_shift  Size of a guest page as a power of two.
166
167            i.e., page size = 2 ^page_shift^.
168
169xen_major   The Xen major version when this image was saved.
170
171xen_minor   The Xen minor version when this image was saved.
172--------------------------------------------------------------------
173
174The legacy stream conversion tool writes a `xen_major` version of 0, and sets
175`xen_minor` to the version of itself.
176
177\clearpage
178
179Records
180=======
181
182A record has a record header, type specific data and a trailing
183footer.  If `body_length` is not a multiple of 8, the body is padded
184with zeroes to align the end of the record on an 8 octet boundary.
185
186     0     1     2     3     4     5     6     7 octet
187    +-----------------------+-------------------------+
188    | type                  | body_length             |
189    +-----------+-----------+-------------------------+
190    | body...                                         |
191    ...
192    |           | padding (0 to 7 octets)             |
193    +-----------+-------------------------------------+
194
195--------------------------------------------------------------------
196Field        Description
197-----------  -------------------------------------------------------
198type         0x00000000: END
199
200             0x00000001: PAGE_DATA
201
202             0x00000002: X86_PV_INFO
203
204             0x00000003: X86_PV_P2M_FRAMES
205
206             0x00000004: X86_PV_VCPU_BASIC
207
208             0x00000005: X86_PV_VCPU_EXTENDED
209
210             0x00000006: X86_PV_VCPU_XSAVE
211
212             0x00000007: SHARED_INFO
213
214             0x00000008: X86_TSC_INFO
215
216             0x00000009: HVM_CONTEXT
217
218             0x0000000A: HVM_PARAMS
219
220             0x0000000B: TOOLSTACK (deprecated)
221
222             0x0000000C: X86_PV_VCPU_MSRS
223
224             0x0000000D: VERIFY
225
226             0x0000000E: CHECKPOINT
227
228             0x0000000F: CHECKPOINT_DIRTY_PFN_LIST (Secondary -> Primary)
229
230             0x00000010: STATIC_DATA_END
231
232             0x00000011: X86_CPUID_POLICY
233
234             0x00000012: X86_MSR_POLICY
235
236             0x00000013 - 0x7FFFFFFF: Reserved for future _mandatory_
237             records.
238
239             0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
240             records.
241
242body_length  Length in octets of the record body.
243
244body         Content of the record.
245
246padding      0 to 7 octets of zeros to pad the whole record to a multiple
247             of 8 octets.
248--------------------------------------------------------------------
249
250Records may be _mandatory_ or _optional_.  Optional records have bit
25131 set in their type.  Restoring an image that has unrecognised or
252unsupported mandatory record must fail.  The contents of optional
253records may be ignored during a restore.
254
255The following sub-sections specify the record body format for each of
256the record types.
257
258\clearpage
259
260END
261----
262
263An end record marks the end of the image, and shall be the final record
264in the stream.
265
266     0     1     2     3     4     5     6     7 octet
267    +-------------------------------------------------+
268
269The end record contains no fields; its body_length is 0.
270
271\clearpage
272
273PAGE_DATA
274---------
275
276The bulk of an image consists of many PAGE_DATA records containing the
277memory contents.
278
279     0     1     2     3     4     5     6     7 octet
280    +-----------------------+-------------------------+
281    | count (C)             | (reserved)              |
282    +-----------------------+-------------------------+
283    | pfn[0]                                          |
284    +-------------------------------------------------+
285    ...
286    +-------------------------------------------------+
287    | pfn[C-1]                                        |
288    +-------------------------------------------------+
289    | page_data[0]...                                 |
290    ...
291    +-------------------------------------------------+
292    | page_data[N-1]...                               |
293    ...
294    +-------------------------------------------------+
295
296--------------------------------------------------------------------
297Field       Description
298----------- --------------------------------------------------------
299count       Number of pages described in this record.
300
301pfn         An array of count PFNs and their types.
302
303            Bit 63-60: XEN_DOMCTL_PFINFO_* type (from
304            `public/domctl.h` but shifted by 32 bits)
305
306            Bit 59-52: Reserved.
307
308            Bit 51-0: PFN.
309
310page_data   page_size octets of uncompressed page contents for each
311            page set as present in the pfn array.
312--------------------------------------------------------------------
313
314Note: Count is strictly > 0.  N is strictly <= C and it is possible for there
315to be no page_data in the record if all pfns are of invalid types.
316
317--------------------------------------------------------------------
318PFINFO type    Value      Description
319-------------  ---------  ------------------------------------------
320NOTAB          0x0        Normal page.
321
322L1TAB          0x1        L1 page table page.
323
324L2TAB          0x2        L2 page table page.
325
326L3TAB          0x3        L3 page table page.
327
328L4TAB          0x4        L4 page table page.
329
330               0x5-0x8    Reserved.
331
332L1TAB_PIN      0x9        L1 page table page (pinned).
333
334L2TAB_PIN      0xA        L2 page table page (pinned).
335
336L3TAB_PIN      0xB        L3 page table page (pinned).
337
338L4TAB_PIN      0xC        L4 page table page (pinned).
339
340BROKEN         0xD        Broken page.
341
342XALLOC         0xE        Allocate only.
343
344XTAB           0xF        Invalid page.
345--------------------------------------------------------------------
346
347Table: XEN_DOMCTL_PFINFO_* Page Types.
348
349PFNs with type `BROKEN`, `XALLOC`, or `XTAB` do not have any
350corresponding `page_data`.
351
352The saver uses the `XTAB` type for PFNs that become invalid in the
353guest's P2M table during a live migration[^2].
354
355Restoring an image with unrecognised page types shall fail.
356
357[^2]: In the legacy format, this is the list of unmapped PFNs in the
358tail.
359
360\clearpage
361
362X86_PV_INFO
363-----------
364
365     0     1     2     3     4     5     6     7 octet
366    +-----+-----+-----------+-------------------------+
367    | w   | ptl | (reserved)                          |
368    +-----+-----+-----------+-------------------------+
369
370--------------------------------------------------------------------
371Field            Description
372-----------      ---------------------------------------------------
373guest_width (w)  Guest width in octets (either 4 or 8).
374
375pt_levels (ptl)  Number of page table levels (either 3 or 4).
376--------------------------------------------------------------------
377
378\clearpage
379
380X86_PV_P2M_FRAMES
381-----------------
382
383     0     1     2     3     4     5     6     7 octet
384    +-----+-----+-----+-----+-------------------------+
385    | p2m_start_pfn (S)     | p2m_end_pfn (E)         |
386    +-----+-----+-----+-----+-------------------------+
387    | p2m_pfn[p2m frame containing pfn S]             |
388    +-------------------------------------------------+
389    ...
390    +-------------------------------------------------+
391    | p2m_pfn[p2m frame containing pfn E]             |
392    +-------------------------------------------------+
393
394--------------------------------------------------------------------
395Field            Description
396-------------    ---------------------------------------------------
397p2m_start_pfn    First pfn index in the p2m_pfn array.
398
399p2m_end_pfn      Last pfn index in the p2m_pfn array.
400
401p2m_pfn          Array of PFNs containing the guest's P2M table, for
402                 the PFN frames containing the PFN range S to E
403                 (inclusive).
404
405--------------------------------------------------------------------
406
407\clearpage
408
409X86_PV_VCPU_BASIC, EXTENDED, XSAVE, MSRS
410----------------------------------------
411
412The format of these records are identical.  They are all binary blobs
413of data which are accessed using specific pairs of domctl hypercalls.
414
415     0     1     2     3     4     5     6     7 octet
416    +-----------------------+-------------------------+
417    | vcpu_id               | (reserved)              |
418    +-----------------------+-------------------------+
419    | context...                                      |
420    ...
421    +-------------------------------------------------+
422
423---------------------------------------------------------------------
424Field            Description
425-----------      ----------------------------------------------------
426vcpu_id          The VCPU ID.
427
428context          Binary data for this VCPU.
429---------------------------------------------------------------------
430
431---------------------------------------------------------------------
432Record type                  Accessor hypercalls
433-----------------------      ----------------------------------------
434X86_PV_VCPU_BASIC            XEN_DOMCTL_{get,set}vcpucontext
435
436X86_PV_VCPU_EXTENDED         XEN_DOMCTL_{get,set}\_ext_vcpucontext
437
438X86_PV_VCPU_XSAVE            XEN_DOMCTL_{get,set}vcpuextstate
439
440X86_PV_VCPU_MSRS             XEN_DOMCTL_{get,set}\_vcpu_msrs
441---------------------------------------------------------------------
442
443\clearpage
444
445SHARED_INFO
446-----------
447
448The content of the Shared Info page.
449
450     0     1     2     3     4     5     6     7 octet
451    +-------------------------------------------------+
452    | shared_info                                     |
453    ...
454    +-------------------------------------------------+
455
456--------------------------------------------------------------------
457Field            Description
458-----------      ---------------------------------------------------
459shared_info      Contents of the shared info page.  This record
460                 should be exactly 1 page long.
461--------------------------------------------------------------------
462
463\clearpage
464
465X86_TSC_INFO
466------------
467
468Domain TSC information, as accessed by the
469XEN_DOMCTL_{get,set}tscinfo hypercall sub-ops.
470
471     0     1     2     3     4     5     6     7 octet
472    +------------------------+------------------------+
473    | mode                   | khz                    |
474    +------------------------+------------------------+
475    | nsec                                            |
476    +------------------------+------------------------+
477    | incarnation            | (reserved)             |
478    +------------------------+------------------------+
479
480--------------------------------------------------------------------
481Field            Description
482-----------      ---------------------------------------------------
483mode             TSC mode, TSC_MODE_* constant.
484
485khz              TSC frequency, in kHz.
486
487nsec             Elapsed time, in nanoseconds.
488
489incarnation      Incarnation.
490--------------------------------------------------------------------
491
492\clearpage
493
494HVM_CONTEXT
495-----------
496
497HVM Domain context, as accessed by the
498XEN_DOMCTL_{get,set}hvmcontext hypercall sub-ops.
499
500     0     1     2     3     4     5     6     7 octet
501    +-------------------------------------------------+
502    | hvm_ctx                                         |
503    ...
504    +-------------------------------------------------+
505
506--------------------------------------------------------------------
507Field            Description
508-----------      ---------------------------------------------------
509hvm_ctx          The HVM Context blob from Xen.
510--------------------------------------------------------------------
511
512\clearpage
513
514HVM_PARAMS
515----------
516
517HVM Domain parameters, as accessed by the
518HVMOP_{get,set}\_param hypercall sub-ops.
519
520     0     1     2     3     4     5     6     7 octet
521    +------------------------+------------------------+
522    | count (C)              | (reserved)             |
523    +------------------------+------------------------+
524    | param[0].index                                  |
525    +-------------------------------------------------+
526    | param[0].value                                  |
527    +-------------------------------------------------+
528    ...
529    +-------------------------------------------------+
530    | param[C-1].index                                |
531    +-------------------------------------------------+
532    | param[C-1].value                                |
533    +-------------------------------------------------+
534
535--------------------------------------------------------------------
536Field            Description
537-----------      ---------------------------------------------------
538count            The number of parameters contained in this record.
539                 Each parameter in the record contains an index and
540                 value.
541
542param index      Parameter index.
543
544param value      Parameter value.
545--------------------------------------------------------------------
546
547\clearpage
548
549TOOLSTACK (deprecated)
550----------------------
551
552> *This record was only present for transitionary purposes during
553>  development.  It is should not be used.*
554
555An opaque blob provided by and supplied to the higher layers of the
556toolstack (e.g., libxl) during save and restore.
557
558     0     1     2     3     4     5     6     7 octet
559    +------------------------+------------------------+
560    | data                                            |
561    ...
562    +-------------------------------------------------+
563
564--------------------------------------------------------------------
565Field            Description
566-----------      ---------------------------------------------------
567data             Blob of toolstack-specific data.
568--------------------------------------------------------------------
569
570\clearpage
571
572VERIFY
573------
574
575A verify record indicates that, while all memory has now been sent, the sender
576shall send further memory records for debugging purposes.
577
578     0     1     2     3     4     5     6     7 octet
579    +-------------------------------------------------+
580
581The verify record contains no fields; its body_length is 0.
582
583\clearpage
584
585CHECKPOINT
586----------
587
588A checkpoint record indicates that all the preceding records in the stream
589represent a consistent view of VM state.
590
591     0     1     2     3     4     5     6     7 octet
592    +-------------------------------------------------+
593
594The checkpoint record contains no fields; its body_length is 0
595
596If the stream is embedded in a higher level toolstack stream, the
597CHECKPOINT record marks the end of the libxc portion of the stream
598and the stream is handed back to the higher level for further
599processing.
600
601The higher level stream may then hand the stream back to libxc to
602process another set of records for the next consistent VM state
603snapshot.  This next set of records may be terminated by another
604CHECKPOINT record or an END record.
605
606\clearpage
607
608CHECKPOINT_DIRTY_PFN_LIST
609-------------------------
610
611A checkpoint dirty pfn list record is used to convey information about
612dirty memory in the VM. It is an unordered list of PFNs. Currently only
613applicable in the backchannel of a checkpointed stream. It is only used
614by COLO, more detail please reference README.colo.
615
616     0     1     2     3     4     5     6     7 octet
617    +-------------------------------------------------+
618    | pfn[0]                                          |
619    +-------------------------------------------------+
620    ...
621    +-------------------------------------------------+
622    | pfn[C-1]                                        |
623    +-------------------------------------------------+
624
625The count of pfns is: record->length/sizeof(uint64_t).
626
627\clearpage
628
629STATIC_DATA_END
630---------------
631
632A static data end record marks the end of the static state.  I.e. state which
633is invariant of guest execution.
634
635
636     0     1     2     3     4     5     6     7 octet
637    +-------------------------------------------------+
638
639The end record contains no fields; its body_length is 0.
640
641\clearpage
642
643X86_CPUID_POLICY
644----------------
645
646CPUID policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy
647hypercall sub-ops.
648
649     0     1     2     3     4     5     6     7 octet
650    +-------------------------------------------------+
651    | CPUID_policy                                    |
652    ...
653    +-------------------------------------------------+
654
655--------------------------------------------------------------------
656Field            Description
657------------     ---------------------------------------------------
658CPUID_policy     Array of xen_cpuid_leaf_t[]'s
659--------------------------------------------------------------------
660
661\clearpage
662
663X86_MSR_POLICY
664--------------
665
666MSR policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy
667hypercall sub-ops.
668
669     0     1     2     3     4     5     6     7 octet
670    +-------------------------------------------------+
671    | MSR_policy                                      |
672    ...
673    +-------------------------------------------------+
674
675--------------------------------------------------------------------
676Field            Description
677----------       ---------------------------------------------------
678MSR_policy       Array of xen_msr_entry_t[]'s
679--------------------------------------------------------------------
680
681\clearpage
682
683
684Layout
685======
686
687The set of valid records depends on the guest architecture and type.  No
688assumptions should be made about the ordering or interleaving of
689independent records.  Record dependencies are noted below.
690
691Some records are used for signalling, and explicitly have zero length.  All
692other records contain data relevant to the migration.  Data records with no
693content should be elided on the source side, as their presence serves no
694purpose, but results in extra work for the restore side.
695
696x86 PV Guest
697------------
698
699A typical save record for an x86 PV guest image would look like:
700
701* Image header
702* Domain header
703* Static data records:
704    * X86_PV_INFO record
705    * X86_{CPUID,MSR}_POLICY
706    * STATIC_DATA_END
707* X86_PV_P2M_FRAMES record
708* Many PAGE_DATA records
709* X86_TSC_INFO
710* SHARED_INFO record
711* VCPU context records for each online VCPU
712    * X86_PV_VCPU_BASIC record
713    * X86_PV_VCPU_EXTENDED record
714    * X86_PV_VCPU_XSAVE record
715    * X86_PV_VCPU_MSRS record
716* END record
717
718There are some strict ordering requirements.  The following records must
719be present in the following order as each of them depends on information
720present in the preceding ones.
721
722* X86_PV_INFO record
723* X86_PV_P2M_FRAMES record
724* PAGE_DATA records
725* VCPU records
726
727x86 HVM Guest
728-------------
729
730A typical save record for an x86 HVM guest image would look like:
731
732* Image header
733* Domain header
734* Static data records:
735    * X86_{CPUID,MSR}_POLICY
736    * STATIC_DATA_END
737* Many PAGE_DATA records
738* X86_TSC_INFO
739* HVM_PARAMS
740* HVM_CONTEXT
741* END record
742
743HVM_PARAMS must precede HVM_CONTEXT, as certain parameters can affect
744the validity of architectural state in the context.
745
746Compatibility with older versions
747=================================
748
749v3 compat with v2
750-----------------
751
752A v3 stream is compatible with a v2 stream, but mandates the presense of a
753STATIC_DATA_END record ahead of any memory/register content.  This is to ease
754the introduction of new static configuration records over time.
755
756A v3-compatible reciever interpreting a v2 stream should infer the position of
757STATIC_DATA_END based on finding the first X86_PV_P2M_FRAMES record (for PV
758guests), or PAGE_DATA record (for HVM guests) and behave as if STATIC_DATA_END
759had been sent.
760
761Legacy Images (x86 only)
762------------------------
763
764Restoring legacy images from older tools shall be handled by
765translating the legacy format image into this new format.
766
767It shall not be possible to save in the legacy format.
768
769There are two different legacy images depending on whether they were
770generated by a 32-bit or a 64-bit toolstack. These shall be
771distinguished by inspecting octets 4-7 in the image.  If these are
772zero then it is a 64-bit image.
773
774Toolstack  Field                            Value
775---------  -----                            -----
77664-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
77732-bit     extended-info chunk ID (PV)      0xFFFFFFFF
77832-bit     Chunk type (HVM)                 < 0
77932-bit     Page count (HVM)                 > 0
780
781Table: Possible values for octet 4-7 in legacy images
782
783This assumes the presence of the extended-info chunk which was
784introduced in Xen 3.0.
785
786
787Future Extensions
788=================
789
790All changes to this specification should bump the revision number in
791the title block.
792
793All changes to the image or domain headers require the image version
794to be increased.
795
796The format may be extended by adding additional record types.
797
798Extending an existing record type must be done by adding a new record
799type.  This allows old images with the old record to still be
800restored.
801
802The image header may only be extended by _appending_ additional
803fields.  In particular, the `marker`, `id` and `version` fields must
804never change size or location.
805
806
807Errata
808======
809
8101. For compatibility with older code, the receving side of a stream should
811   tolerate and ignore variable sized records with zero content.  Xen releases
812   between 4.6 and 4.8 could end up generating valid HVM_PARAMS or
813   X86_PV_VCPU_{EXTENDED,XSAVE,MSRS} records with zero-length content.
814