1Blktap2 Userspace Tools + Library 2================================ 3 4Dutch Meyer 54th June 2009 6 7Andrew Warfield and Julian Chesterfield 816th June 2006 9 10 11The blktap2 userspace toolkit provides a user-level disk I/O 12interface. The blktap2 mechanism involves a kernel driver that acts 13similarly to the existing Xen/Linux blkback driver, and a set of 14associated user-level libraries. Using these tools, blktap2 allows 15virtual block devices presented to VMs to be implemented in userspace 16and to be backed by raw partitions, files, network, etc. 17 18The key benefit of blktap2 is that it makes it easy and fast to write 19arbitrary block backends, and that these user-level backends actually 20perform very well. Specifically: 21 22- Metadata disk formats such as Copy-on-Write, encrypted disks, sparse 23 formats and other compression features can be easily implemented. 24 25- Accessing file-based images from userspace avoids problems related 26 to flushing dirty pages which are present in the Linux loopback 27 driver. (Specifically, doing a large number of writes to an 28 NFS-backed image don't result in the OOM killer going berserk.) 29 30- Per-disk handler processes enable easier userspace policing of block 31 resources, and process-granularity QoS techniques (disk scheduling 32 and related tools) may be trivially applied to block devices. 33 34- It's very easy to take advantage of userspace facilities such as 35 networking libraries, compression utilities, peer-to-peer 36 file-sharing systems and so on to build more complex block backends. 37 38- Crashes are contained -- incremental development/debugging is very 39 fast. 40 41How it works (in one paragraph): 42 43Working in conjunction with the kernel blktap2 driver, all disk I/O 44requests from VMs are passed to the userspace deamon (using a shared 45memory interface) through a character device. Each active disk is 46mapped to an individual device node, allowing per-disk processes to 47implement individual block devices where desired. The userspace 48drivers are implemented using asynchronous (Linux libaio), 49O_DIRECT-based calls to preserve the unbuffered, batched and 50asynchronous request dispatch achieved with the existing blkback 51code. We provide a simple, asynchronous virtual disk interface that 52makes it quite easy to add new disk implementations. 53 54As of June 2009 the current supported disk formats are: 55 56 - Raw Images (both on partitions and in image files) 57 - Fast sharable RAM disk between VMs (requires some form of 58 cluster-based filesystem support e.g. OCFS2 in the guest kernel) 59 - VHD, including snapshots and sparse images 60 - Qcow, including snapshots and sparse images 61 62 63Build and Installation Instructions 64=================================== 65 66Make to configure the blktap2 backend driver in your dom0 kernel. It 67will inter-operate with the existing backend and frontend drivers. It 68will also cohabitate with the original blktap driver. However, some 69formats (currently aio and qcow) will default to their blktap2 70versions when specified in a vm configuration file. 71 72To build the tools separately, "make && make install" in 73tools/blktap2. 74 75 76Using the Tools 77=============== 78 79Preparing an image for boot: 80 81The userspace disk agent is configured to start automatically via xend 82 83Customize the VM config file to use the 'tap:tapdisk' handler, 84followed by the driver type. e.g. for a raw image such as a file or 85partition: 86 87disk = ['tap:tapdisk:aio:<FILENAME>,sda1,w'] 88 89Alternatively, the vhd-util tool (installed with make install, or in 90/blktap2/vhd) can be used to build sparse copy-on-write vhd images. 91 92For example, to build a sparse image - 93 vhd-util create -n MyVHDFile -s 1024 94 95This creates a sparse 1GB file named "MyVHDFile" that can be mounted 96and populated with data. 97 98One can also base the image on a raw file - 99 vhd-util snapshot -n MyVHDFile -p SomeRawFile -m 100 101This creates a sparse VHD file named "MyVHDFile" using "SomeRawFile" 102as a parent image. Copy-on-write semantics ensure that writes will be 103stored in "MyVHDFile" while reads will be directed to the most 104recently written version of the data, either in "MyVHDFile" or 105"SomeRawFile" as is appropriate. Other options exist as well, consult 106the vhd-util application for the complete set of VHD tools. 107 108VHD files can be mounted automatically in a guest similarly to the 109above AIO example simply by specifying the vhd driver. 110 111disk = ['tap:tapdisk:vhd:<VHD FILENAME>,sda1,w'] 112 113 114Snapshots: 115 116Pausing a guest will also plug the corresponding IO queue for blktap2 117devices and stop blktap2 drivers. This can be used to implement a 118safe live snapshot of qcow and vhd disks. An example script "xmsnap" 119is shown in the tools/blktap2/drivers directory. This script will 120perform a live snapshot of a qcow disk. VHD files can use the 121"vhd-util snapshot" tool discussed above. If this snapshot command is 122applied to a raw file mounted with tap:tapdisk:AIO, include the -m 123flag and the driver will be reloaded as VHD. If applied to an already 124mounted VHD file, omit the -m flag. 125 126 127Mounting images in Dom0 using the blktap2 driver 128=============================================== 129Tap (and blkback) disks are also mountable in Dom0 without requiring an 130active VM to attach. 131 132The syntax is - 133 tapdisk2 -n <type>:<full path to file> 134 135For example - 136 tapdisk2 -n aio:/home/images/rawFile.img 137 138When successful the location of the new device will be provided by 139tapdisk2 to stdout and tapdisk2 will terminate. From that point 140forward control of the device is provided through sysfs in the 141directory- 142 143 /sys/class/blktap2/blktap#/ 144 145Where # is a blktap2 device number present in the path that tapdisk2 146printed before terminating. The sysfs interface is largely intuitive, 147for example, to remove tap device 0 one would- 148 149 echo 1 > /sys/class/blktap2/blktap0/remove 150 151Similarly, a pause control is available, which is can be used to plug 152the request queue of a live running guest. 153 154Previous versions of blktap mounted devices in dom0 by using blkfront 155in dom0 and the xm block-attach command. This approach is still 156available, though slightly more cumbersome. 157 158 159Tapdisk Development 160=============================================== 161 162People regularly ask how to develop their own tapdisk drivers, and 163while it has not yet been well documented, the process is relatively 164easy. Here I will provide a brief overview. The best reference, of 165course, comes from the existing drivers. Specifically, 166blktap2/drivers/block-ram.c and blktap2/drivers/block-aio.c provide 167the clearest examples of simple drivers. 168 169 170Setup: 171 172First you need to register your new driver with blktap. This is done 173in disktypes.h. There are five things that you must do. To 174demonstrate, I will create a disk called "mynewdisk", you can name 175yours freely. 176 1771) Forward declare an instance of struct tap_disk. 178 179e.g. - 180 extern struct tap_disk tapdisk_mynewdisk; 181 1822) Claim one of the unused disk type numbers, take care to observe the 183MAX_DISK_TYPES macro, increasing the number if necessary. 184 185e.g. - 186 #define DISK_TYPE_MYNEWDISK 10 187 1883) Create an instance of disk_info_t. The bulk of this file contains examples of these. 189 190e.g. - 191 static disk_info_t mynewdisk_disk = { 192 DISK_TYPE_MYNEWDISK, 193 "My New Disk (mynewdisk)", 194 "mynewdisk", 195 0, 196 #ifdef TAPDISK 197 &tapdisk_mynewdisk, 198 #endif 199 }; 200 201A few words about what these mean. The first field must be the disk 202type number you claimed in step (2). The second field is a string 203describing your disk, and may contain any relevant info. The third 204field is the name of your disk as will be used by the tapdisk2 utility 205and xend (for example tapdisk2 -n mynewdisk:/path/to/disk.image, or in 206your xm create config file). The forth is binary and determines 207whether you will have one instance of your driver, or many. Here, a 1 208means that your driver is a singleton and will coordinate access to 209any number of tap devices. 0 is more common, meaning that you will 210have one driver for each device that is created. The final field 211should contain a reference to the struct tap_disk you created in step 212(1). 213 2144) Add a reference to your disk info structure (from step (3)) to the 215dtypes array. Take care here - you need to place it in the position 216corresponding to the device type number you claimed in step (2). So 217we would place &mynewdisk_disk in dtypes[10]. Look at the other 218devices in this array and pad with "&null_disk," as necessary. 219 2205) Modify the xend python scripts. You need to add your disk name to 221the list of disks that xend recognizes. 222 223edit: 224 tools/python/xen/xend/server/BlktapController.py 225 226And add your disk to the "blktap_disk_types" array near the top of 227your file. Use the same name you specified in the third field of step 228(3). The order of this list is not important. 229 230 231Now your driver is ready to be written. Create a block-mynewdisk.c in 232tools/blktap2/drivers and add it to the Makefile. 233 234 235Development: 236 237Copying block-aio.c and block-ram.c would be a good place to start. 238Read those files as you go through this, I will be assisting by 239commenting on a few useful functions and structures. 240 241struct tap_disk: 242 243Remember the forward declaration in step (1) of the setup phase above? 244Now is the time to make that structure a reality. This structure 245contains a list of function pointers for all the routines that will be 246asked of your driver. Currently the required functions are open, 247close, read, write, get_parent_id, validate_parent, and debug. 248 249e.g. - 250 struct tap_disk tapdisk_mynewdisk = { 251 .disk_type = "tapdisk_mynewdisk", 252 .flags = 0, 253 .private_data_size = sizeof(struct tdmynewdisk_state), 254 .td_open = tdmynewdisk_open, 255 .... 256 257The private_data_size field is used to provide a structure to store 258the state of your device. It is very likely that you will want 259something here, but you are free to design whatever structure you 260want. Blktap will allocate this space for you, you just need to tell 261it how much space you want. 262 263 264tdmynewdisk_open: 265 266This is the open routine. The first argument is a structure 267representing your driver. Two fields in this array are 268interesting. 269 270driver->data will contain a block of memory of the size your requested 271in in the .private_data_size field of your struct tap_disk (above). 272 273driver->info contains a structure that details information about your 274disk. You need to fill this out. By convention this is done with a 275_get_image_info() function. Assign a size (the total number of 276sectors), sector_size (the size of each sector in bytes, and set 277driver->info->info to 0. 278 279The second parameter contains the name that was specified in the 280creation of your device, either through xend, or on the command line 281with tapdisk2. Usually this specifies a file that you will open in 282this routine. The final parameter, flags, contains one of a number of 283flags specified in tapdisk.h that may change the way you treat the 284disk. 285 286 287_queue_read/write: 288 289These are your read and write operations. What you do here will 290depend on your disk, but you should do exactly one of- 291 2921) call td_complete_request with either error or success code. 293 2942) Call td_forward_request, which will forward the request to the next 295driver in the stack. 296 2973) Queue the request for asynchronous processing with 298td_prep_read/write. In doing so, you will also register a callback 299for request completion. When the request completes you must do one of 300options (1) or (2) above. Finally, call td_queue_tiocb to submit the 301request to a wait queue. 302 303The above functions are defined in tapdisk-interface.c. If you don't 304use them as specified you will run into problems as your driver will 305fail to inform blktap of the state of requests that have been 306submitted. Blktap keeps track of all requests and does not like losing track. 307 308 309_close, _get_parent_id, _validate_parent: 310 311These last few tend to be very routine. _close is called when the 312device is closed, and also when it is paused (in this case, open will 313also be called later). The other functions are used in stacking 314drivers. Most often drivers will return TD_NO_PARENT and -EINVAL, 315respectively. 316 317 318 319 320 321 322