1Blktap2 Userspace Tools + Library
2================================
3
4Dutch Meyer
54th June 2009
6
7Andrew Warfield and Julian Chesterfield
816th June 2006
9
10
11The blktap2 userspace toolkit provides a user-level disk I/O
12interface. The blktap2 mechanism involves a kernel driver that acts
13similarly to the existing Xen/Linux blkback driver, and a set of
14associated user-level libraries.  Using these tools, blktap2 allows
15virtual block devices presented to VMs to be implemented in userspace
16and to be backed by raw partitions, files, network, etc.
17
18The key benefit of blktap2 is that it makes it easy and fast to write
19arbitrary block backends, and that these user-level backends actually
20perform very well.  Specifically:
21
22- Metadata disk formats such as Copy-on-Write, encrypted disks, sparse
23  formats and other compression features can be easily implemented.
24
25- Accessing file-based images from userspace avoids problems related
26  to flushing dirty pages which are present in the Linux loopback
27  driver.  (Specifically, doing a large number of writes to an
28  NFS-backed image don't result in the OOM killer going berserk.)
29
30- Per-disk handler processes enable easier userspace policing of block
31  resources, and process-granularity QoS techniques (disk scheduling
32  and related tools) may be trivially applied to block devices.
33
34- It's very easy to take advantage of userspace facilities such as
35  networking libraries, compression utilities, peer-to-peer
36  file-sharing systems and so on to build more complex block backends.
37
38- Crashes are contained -- incremental development/debugging is very
39  fast.
40
41How it works (in one paragraph):
42
43Working in conjunction with the kernel blktap2 driver, all disk I/O
44requests from VMs are passed to the userspace deamon (using a shared
45memory interface) through a character device. Each active disk is
46mapped to an individual device node, allowing per-disk processes to
47implement individual block devices where desired.  The userspace
48drivers are implemented using asynchronous (Linux libaio),
49O_DIRECT-based calls to preserve the unbuffered, batched and
50asynchronous request dispatch achieved with the existing blkback
51code.  We provide a simple, asynchronous virtual disk interface that
52makes it quite easy to add new disk implementations.
53
54As of June 2009 the current supported disk formats are:
55
56 - Raw Images (both on partitions and in image files)
57 - Fast sharable RAM disk between VMs (requires some form of
58   cluster-based filesystem support e.g. OCFS2 in the guest kernel)
59 - VHD, including snapshots and sparse images
60 - Qcow, including snapshots and sparse images
61
62
63Build and Installation Instructions
64===================================
65
66Make to configure the blktap2 backend driver in your dom0 kernel.  It
67will inter-operate with the existing backend and frontend drivers.  It
68will also cohabitate with the original blktap driver.  However, some
69formats (currently aio and qcow) will default to their blktap2
70versions when specified in a vm configuration file.
71
72To build the tools separately, "make && make install" in
73tools/blktap2.
74
75
76Using the Tools
77===============
78
79Preparing an image for boot:
80
81The userspace disk agent is configured to start automatically via xend
82
83Customize the VM config file to use the 'tap:tapdisk' handler,
84followed by the driver type. e.g. for a raw image such as a file or
85partition:
86
87disk = ['tap:tapdisk:aio:<FILENAME>,sda1,w']
88
89Alternatively, the vhd-util tool (installed with make install, or in
90/blktap2/vhd) can be used to build sparse copy-on-write vhd images.
91
92For example, to build a sparse image -
93  vhd-util create -n MyVHDFile -s 1024
94
95This creates a sparse 1GB file named "MyVHDFile" that can be mounted
96and populated with data.
97
98One can also base the image on a raw file -
99  vhd-util snapshot -n MyVHDFile -p SomeRawFile -m
100
101This creates a sparse VHD file named "MyVHDFile" using "SomeRawFile"
102as a parent image.  Copy-on-write semantics ensure that writes will be
103stored in "MyVHDFile" while reads will be directed to the most
104recently written version of the data, either in "MyVHDFile" or
105"SomeRawFile" as is appropriate.  Other options exist as well, consult
106the vhd-util application for the complete set of VHD tools.
107
108VHD files can be mounted automatically in a guest similarly to the
109above AIO example simply by specifying the vhd driver.
110
111disk = ['tap:tapdisk:vhd:<VHD FILENAME>,sda1,w']
112
113
114Snapshots:
115
116Pausing a guest will also plug the corresponding IO queue for blktap2
117devices and stop blktap2 drivers.  This can be used to implement a
118safe live snapshot of qcow and vhd disks.  An example script "xmsnap"
119is shown in the tools/blktap2/drivers directory.  This script will
120perform a live snapshot of a qcow disk.  VHD files can use the
121"vhd-util snapshot" tool discussed above.  If this snapshot command is
122applied to a raw file mounted with tap:tapdisk:AIO, include the -m
123flag and the driver will be reloaded as VHD.  If applied to an already
124mounted VHD file, omit the -m flag.
125
126
127Mounting images in Dom0 using the blktap2 driver
128===============================================
129Tap (and blkback) disks are also mountable in Dom0 without requiring an
130active VM to attach.
131
132The syntax is -
133  tapdisk2 -n <type>:<full path to file>
134
135For example -
136  tapdisk2  -n aio:/home/images/rawFile.img
137
138When successful the location of the new device will be provided by
139tapdisk2 to stdout and tapdisk2 will terminate.  From that point
140forward control of the device is provided through sysfs in the
141directory-
142
143  /sys/class/blktap2/blktap#/
144
145Where # is a blktap2 device number present in the path that tapdisk2
146printed before terminating.  The sysfs interface is largely intuitive,
147for example, to remove tap device 0 one would-
148
149  echo 1 > /sys/class/blktap2/blktap0/remove
150
151Similarly, a pause control is available, which is can be used to plug
152the request queue of a live running guest.
153
154Previous versions of blktap mounted devices in dom0 by using blkfront
155in dom0 and the xm block-attach command.  This approach is still
156available, though slightly more cumbersome.
157
158
159Tapdisk Development
160===============================================
161
162People regularly ask how to develop their own tapdisk drivers, and
163while it has not yet been well documented, the process is relatively
164easy.  Here I will provide a brief overview.  The best reference, of
165course, comes from the existing drivers.  Specifically,
166blktap2/drivers/block-ram.c and blktap2/drivers/block-aio.c provide
167the clearest examples of simple drivers.
168
169
170Setup:
171
172First you need to register your new driver with blktap. This is done
173in disktypes.h.  There are five things that you must do.  To
174demonstrate, I will create a disk called "mynewdisk", you can name
175yours freely.
176
1771) Forward declare an instance of struct tap_disk.
178
179e.g. -
180  extern struct tap_disk tapdisk_mynewdisk;
181
1822) Claim one of the unused disk type numbers, take care to observe the
183MAX_DISK_TYPES macro, increasing the number if necessary.
184
185e.g. -
186  #define DISK_TYPE_MYNEWDISK         10
187
1883) Create an instance of disk_info_t.  The bulk of this file contains examples of these.
189
190e.g. -
191  static disk_info_t mynewdisk_disk = {
192          DISK_TYPE_MYNEWDISK,
193          "My New Disk (mynewdisk)",
194          "mynewdisk",
195          0,
196  #ifdef TAPDISK
197          &tapdisk_mynewdisk,
198  #endif
199  };
200
201A few words about what these mean.  The first field must be the disk
202type number you claimed in step (2).  The second field is a string
203describing your disk, and may contain any relevant info.  The third
204field is the name of your disk as will be used by the tapdisk2 utility
205and xend (for example tapdisk2 -n mynewdisk:/path/to/disk.image, or in
206your xm create config file).  The forth is binary and determines
207whether you will have one instance of your driver, or many.  Here, a 1
208means that your driver is a singleton and will coordinate access to
209any number of tap devices.  0 is more common, meaning that you will
210have one driver for each device that is created.  The final field
211should contain a reference to the struct tap_disk you created in step
212(1).
213
2144) Add a reference to your disk info structure (from step (3)) to the
215dtypes array.  Take care here - you need to place it in the position
216corresponding to the device type number you claimed in step (2).  So
217we would place &mynewdisk_disk in dtypes[10].  Look at the other
218devices in this array and pad with "&null_disk," as necessary.
219
2205) Modify the xend python scripts.  You need to add your disk name to
221the list of disks that xend recognizes.
222
223edit:
224  tools/python/xen/xend/server/BlktapController.py
225
226And add your disk to the "blktap_disk_types" array near the top of
227your file.  Use the same name you specified in the third field of step
228(3).  The order of this list is not important.
229
230
231Now your driver is ready to be written.  Create a block-mynewdisk.c in
232tools/blktap2/drivers and add it to the Makefile.
233
234
235Development:
236
237Copying block-aio.c and block-ram.c would be a good place to start.
238Read those files as you go through this, I will be assisting by
239commenting on a few useful functions and structures.
240
241struct tap_disk:
242
243Remember the forward declaration in step (1) of the setup phase above?
244Now is the time to make that structure a reality.  This structure
245contains a list of function pointers for all the routines that will be
246asked of your driver.  Currently the required functions are open,
247close, read, write, get_parent_id, validate_parent, and debug.
248
249e.g. -
250  struct tap_disk tapdisk_mynewdisk = {
251          .disk_type          = "tapdisk_mynewdisk",
252          .flags              = 0,
253          .private_data_size  = sizeof(struct tdmynewdisk_state),
254          .td_open            = tdmynewdisk_open,
255                 ....
256
257The private_data_size field is used to provide a structure to store
258the state of your device.  It is very likely that you will want
259something here, but you are free to design whatever structure you
260want.  Blktap will allocate this space for you, you just need to tell
261it how much space you want.
262
263
264tdmynewdisk_open:
265
266This is the open routine.  The first argument is a structure
267representing your driver.  Two fields in this array are
268interesting.
269
270driver->data will contain a block of memory of the size your requested
271in in the .private_data_size field of your struct tap_disk (above).
272
273driver->info contains a structure that details information about your
274disk.  You need to fill this out.  By convention this is done with a
275_get_image_info() function.  Assign a size (the total number of
276sectors), sector_size (the size of each sector in bytes, and set
277driver->info->info to 0.
278
279The second parameter contains the name that was specified in the
280creation of your device, either through xend, or on the command line
281with tapdisk2.  Usually this specifies a file that you will open in
282this routine.  The final parameter, flags, contains one of a number of
283flags specified in tapdisk.h that may change the way you treat the
284disk.
285
286
287_queue_read/write:
288
289These are your read and write operations.  What you do here will
290depend on your disk, but you should do exactly one of-
291
2921) call td_complete_request with either error or success code.
293
2942) Call td_forward_request, which will forward the request to the next
295driver in the stack.
296
2973) Queue the request for asynchronous processing with
298td_prep_read/write.  In doing so, you will also register a callback
299for request completion.  When the request completes you must do one of
300options (1) or (2) above.  Finally, call td_queue_tiocb to submit the
301request to a wait queue.
302
303The above functions are defined in tapdisk-interface.c.  If you don't
304use them as specified you will run into problems as your driver will
305fail to inform blktap of the state of requests that have been
306submitted.  Blktap keeps track of all requests and does not like losing track.
307
308
309_close, _get_parent_id, _validate_parent:
310
311These last few tend to be very routine.  _close is called when the
312device is closed, and also when it is paused (in this case, open will
313also be called later).  The other functions are used in stacking
314drivers.  Most often drivers will return TD_NO_PARENT and -EINVAL,
315respectively.
316
317
318
319
320
321
322