1# PV Calls Protocol version 1 2 3## Glossary 4 5The following is a list of terms and definitions used in the Xen 6community. If you are a Xen contributor you can skip this section. 7 8* PV 9 10 Short for paravirtualized. 11 12* Dom0 13 14 First virtual machine that boots. In most configurations Dom0 is 15 privileged and has control over hardware devices, such as network 16 cards, graphic cards, etc. 17 18* DomU 19 20 Regular unprivileged Xen virtual machine. 21 22* Domain 23 24 A Xen virtual machine. Dom0 and all DomUs are all separate Xen 25 domains. 26 27* Guest 28 29 Same as domain: a Xen virtual machine. 30 31* Frontend 32 33 Each DomU has one or more paravirtualized frontend drivers to access 34 disks, network, console, graphics, etc. The presence of PV devices is 35 advertized on XenStore, a cross domain key-value database. Frontends 36 are similar in intent to the virtio drivers in Linux. 37 38* Backend 39 40 A Xen paravirtualized backend typically runs in Dom0 and it is used to 41 export disks, network, console, graphics, etcs, to DomUs. Backends can 42 live both in kernel space and in userspace. For example xen-blkback 43 lives under drivers/block in the Linux kernel and xen_disk lives under 44 hw/block in QEMU. Paravirtualized backends are similar in intent to 45 virtio device emulators. 46 47* VMX and SVM 48 49 On Intel processors, VMX is the CPU flag for VT-x, hardware 50 virtualization support. It corresponds to SVM on AMD processors. 51 52 53 54## Rationale 55 56PV Calls is a paravirtualized protocol that allows the implementation of 57a set of POSIX functions in a different domain. The PV Calls frontend 58sends POSIX function calls to the backend, which implements them and 59returns a value to the frontend and acts on the function call. 60 61This version of the document covers networking function calls, such as 62connect, accept, bind, release, listen, poll, recvmsg and sendmsg; but 63the protocol is meant to be easily extended to cover different sets of 64calls. Unimplemented commands return ENOTSUP. 65 66PV Calls provide the following benefits: 67* full visibility of the guest behavior on the backend domain, allowing 68 for inexpensive filtering and manipulation of any guest calls 69* excellent performance 70 71Specifically, PV Calls for networking offer these advantages: 72* guest networking works out of the box with VPNs, wireless networks and 73 any other complex configurations on the host 74* guest services listen on ports bound directly to the backend domain IP 75 addresses 76* localhost becomes a secure host wide network for inter-VMs 77 communications 78 79 80## Design 81 82### Why Xen? 83 84PV Calls are part of an effort to create a secure runtime environment 85for containers (Open Containers Initiative images to be precise). PV 86Calls are based on Xen, although porting them to other hypervisors is 87possible. Xen was chosen because of its security and isolation 88properties and because it supports PV guests, a type of virtual machines 89that does not require hardware virtualization extensions (VMX on Intel 90processors and SVM on AMD processors). This is important because PV 91Calls is meant for containers and containers are often run on top of 92public cloud instances, which do not support nested VMX (or SVM) as of 93today (early 2017). Xen PV guests are lightweight, minimalist, and do 94not require machine emulation: all properties that make them a good fit 95for this project. 96 97### Xenstore 98 99The frontend and the backend connect via [xenstore] to 100exchange information. The toolstack creates front and back nodes with 101state of [XenbusStateInitialising]. The protocol node name 102is **pvcalls**. There can only be one PV Calls frontend per domain. 103 104#### Frontend XenBus Nodes 105 106version 107 Values: <string> 108 109 Protocol version, chosen among the ones supported by the backend 110 (see **versions** under [Backend XenBus Nodes]). Currently the 111 value must be "1". 112 113port 114 Values: <uint32_t> 115 116 The identifier of the Xen event channel used to signal activity 117 in the command ring. 118 119ring-ref 120 Values: <uint32_t> 121 122 The Xen grant reference granting permission for the backend to map 123 the sole page in a single page sized command ring. 124 125#### Backend XenBus Nodes 126 127versions 128 Values: <string> 129 130 List of comma separated protocol versions supported by the backend. 131 For example "1,2,3". Currently the value is just "1", as there is 132 only one version. 133 134max-page-order 135 Values: <uint32_t> 136 137 The maximum supported size of a memory allocation in units of 138 log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It must 139 be 1 or more. 140 141function-calls 142 Values: <uint32_t> 143 144 Value "0" means that no calls are supported. 145 Value "1" means that socket, connect, release, bind, listen, accept 146 and poll are supported. 147 148#### State Machine 149 150Initialization: 151 152 *Front* *Back* 153 XenbusStateInitialising XenbusStateInitialising 154 - Query virtual device - Query backend device 155 properties. identification data. 156 - Setup OS device instance. - Publish backend features 157 - Allocate and initialize the and transport parameters 158 request ring. | 159 - Publish transport parameters | 160 that will be in effect during V 161 this connection. XenbusStateInitWait 162 | 163 | 164 V 165 XenbusStateInitialised 166 167 - Query frontend transport parameters. 168 - Connect to the request ring and 169 event channel. 170 | 171 | 172 V 173 XenbusStateConnected 174 175 - Query backend device properties. 176 - Finalize OS virtual device 177 instance. 178 | 179 | 180 V 181 XenbusStateConnected 182 183Once frontend and backend are connected, they have a shared page, which 184will is used to exchange messages over a ring, and an event channel, 185which is used to send notifications. 186 187Shutdown: 188 189 *Front* *Back* 190 XenbusStateConnected XenbusStateConnected 191 | 192 | 193 V 194 XenbusStateClosing 195 196 - Unmap grants 197 - Unbind event channels 198 | 199 | 200 V 201 XenbusStateClosing 202 203 - Unbind event channels 204 - Free rings 205 - Free data structures 206 | 207 | 208 V 209 XenbusStateClosed 210 211 - Free remaining data structures 212 | 213 | 214 V 215 XenbusStateClosed 216 217 218### Commands Ring 219 220The shared ring is used by the frontend to forward POSIX function calls 221to the backend. We shall refer to this ring as **commands ring** to 222distinguish it from other rings which can be created later in the 223lifecycle of the protocol (see [Indexes Page and Data ring]). The grant 224reference for shared page for this ring is shared on xenstore (see 225[Frontend XenBus Nodes]). The ring format is defined using the familiar 226`DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`). Frontend 227requests are allocated on the ring using the `RING_GET_REQUEST` macro. 228The list of commands below is in calling order. 229 230The format is defined as follows: 231 232 #define PVCALLS_SOCKET 0 233 #define PVCALLS_CONNECT 1 234 #define PVCALLS_RELEASE 2 235 #define PVCALLS_BIND 3 236 #define PVCALLS_LISTEN 4 237 #define PVCALLS_ACCEPT 5 238 #define PVCALLS_POLL 6 239 240 struct xen_pvcalls_request { 241 uint32_t req_id; /* private to guest, echoed in response */ 242 uint32_t cmd; /* command to execute */ 243 union { 244 struct xen_pvcalls_socket { 245 uint64_t id; 246 uint32_t domain; 247 uint32_t type; 248 uint32_t protocol; 249 #ifdef CONFIG_X86_32 250 uint8_t pad[4]; 251 #endif 252 } socket; 253 struct xen_pvcalls_connect { 254 uint64_t id; 255 uint8_t addr[28]; 256 uint32_t len; 257 uint32_t flags; 258 grant_ref_t ref; 259 uint32_t evtchn; 260 #ifdef CONFIG_X86_32 261 uint8_t pad[4]; 262 #endif 263 } connect; 264 struct xen_pvcalls_release { 265 uint64_t id; 266 uint8_t reuse; 267 #ifdef CONFIG_X86_32 268 uint8_t pad[7]; 269 #endif 270 } release; 271 struct xen_pvcalls_bind { 272 uint64_t id; 273 uint8_t addr[28]; 274 uint32_t len; 275 } bind; 276 struct xen_pvcalls_listen { 277 uint64_t id; 278 uint32_t backlog; 279 #ifdef CONFIG_X86_32 280 uint8_t pad[4]; 281 #endif 282 } listen; 283 struct xen_pvcalls_accept { 284 uint64_t id; 285 uint64_t id_new; 286 grant_ref_t ref; 287 uint32_t evtchn; 288 } accept; 289 struct xen_pvcalls_poll { 290 uint64_t id; 291 } poll; 292 /* dummy member to force sizeof(struct xen_pvcalls_request) to match across archs */ 293 struct xen_pvcalls_dummy { 294 uint8_t dummy[56]; 295 } dummy; 296 } u; 297 }; 298 299The first two fields are common for every command. Their binary layout 300is: 301 302 0 4 8 303 +-------+-------+ 304 |req_id | cmd | 305 +-------+-------+ 306 307- **req_id** is generated by the frontend and is a cookie used to 308 identify one specific request/response pair of commands. Not to be 309 confused with any command **id** which are used to identify a socket 310 across multiple commands, see [Socket]. 311- **cmd** is the command requested by the frontend: 312 313 - `PVCALLS_SOCKET`: 0 314 - `PVCALLS_CONNECT`: 1 315 - `PVCALLS_RELEASE`: 2 316 - `PVCALLS_BIND`: 3 317 - `PVCALLS_LISTEN`: 4 318 - `PVCALLS_ACCEPT`: 5 319 - `PVCALLS_POLL`: 6 320 321Both fields are echoed back by the backend. See [Socket families and 322address format] for the format of the **addr** field of connect and 323bind. The maximum size of command specific arguments is 56 bytes. Any 324future command that requires more than that will need a bump the 325**version** of the protocol. 326 327Similarly to other Xen ring based protocols, after writing a request to 328the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and 329issues an event channel notification when a notification is required. 330 331Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro. 332The format is the following: 333 334 struct xen_pvcalls_response { 335 uint32_t req_id; 336 uint32_t cmd; 337 int32_t ret; 338 uint32_t pad; 339 union { 340 struct _xen_pvcalls_socket { 341 uint64_t id; 342 } socket; 343 struct _xen_pvcalls_connect { 344 uint64_t id; 345 } connect; 346 struct _xen_pvcalls_release { 347 uint64_t id; 348 } release; 349 struct _xen_pvcalls_bind { 350 uint64_t id; 351 } bind; 352 struct _xen_pvcalls_listen { 353 uint64_t id; 354 } listen; 355 struct _xen_pvcalls_accept { 356 uint64_t id; 357 } accept; 358 struct _xen_pvcalls_poll { 359 uint64_t id; 360 } poll; 361 struct _xen_pvcalls_dummy { 362 uint8_t dummy[8]; 363 } dummy; 364 } u; 365 }; 366 367The first four fields are common for every response. Their binary layout 368is: 369 370 0 4 8 12 16 371 +-------+-------+-------+-------+ 372 |req_id | cmd | ret | pad | 373 +-------+-------+-------+-------+ 374 375- **req_id**: echoed back from request 376- **cmd**: echoed back from request 377- **ret**: return value, identifies success (0) or failure (see [Error 378 numbers] in further sections). If the **cmd** is not supported by the 379 backend, ret is ENOTSUP. 380- **pad**: padding 381 382After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks whether 383it needs to notify the frontend and does so via event channel. 384 385A description of each command, their additional request and response 386fields follow. 387 388 389#### Socket 390 391The **socket** operation corresponds to the POSIX [socket][socket] 392function. It creates a new socket of the specified family, type and 393protocol. **id** is freely chosen by the frontend and references this 394specific socket from this point forward. See [Socket families and 395address format] to see which ones are supported by different versions of 396the protocol. 397 398Request fields: 399 400- **cmd** value: 0 401- additional fields: 402 - **id**: generated by the frontend, it identifies the new socket 403 - **domain**: the communication domain 404 - **type**: the socket type 405 - **protocol**: the particular protocol to be used with the socket, usually 0 406 407Request binary layout: 408 409 8 12 16 20 24 28 410 +-------+-------+-------+-------+-------+ 411 | id |domain | type |protoco| 412 +-------+-------+-------+-------+-------+ 413 414Response additional fields: 415 416- **id**: echoed back from request 417 418Response binary layout: 419 420 16 20 24 421 +-------+--------+ 422 | id | 423 +-------+--------+ 424 425Return value: 426 427 - 0 on success 428 - See the [POSIX socket function][connect] for error names; see 429 [Error numbers] in further sections. 430 431#### Connect 432 433The **connect** operation corresponds to the POSIX [connect][connect] 434function. It connects a previously created socket (identified by **id**) 435to the specified address. 436 437The connect operation creates a new shared ring, which we'll call **data 438ring**. The data ring is used to send and receive data from the 439socket. The connect operation passes two additional parameters: 440**evtchn** and **ref**. **evtchn** is the port number of a new event 441channel which will be used for notifications of activity on the data 442ring. **ref** is the grant reference of the **indexes page**: a page 443which contains shared indexes that point to the write and read locations 444in the **data ring**. The **indexes page** also contains the full array 445of grant references for the **data ring**. When the frontend issues a 446**connect** command, the backend: 447 448- finds its own internal socket corresponding to **id** 449- connects the socket to **addr** 450- maps the grant reference **ref**, the indexes page, see struct 451 pvcalls_data_intf 452- maps all the grant references listed in `struct pvcalls_data_intf` and 453 uses them as shared memory for the **data ring** 454- bind the **evtchn** 455- replies to the frontend 456 457The [Indexes Page and Data ring] format will be described in the 458following section. The **data ring** is unmapped and freed upon issuing 459a **release** command on the active socket identified by **id**. A 460frontend state change can also cause data rings to be unmapped. 461 462Request fields: 463 464- **cmd** value: 0 465- additional fields: 466 - **id**: identifies the socket 467 - **addr**: address to connect to, see [Socket families and address format] 468 - **len**: address length up to 28 octets 469 - **flags**: flags for the connection, reserved for future usage 470 - **ref**: grant reference of the indexes page 471 - **evtchn**: port number of the evtchn to signal activity on the **data ring** 472 473Request binary layout: 474 475 8 12 16 20 24 28 32 36 40 44 476 +-------+-------+-------+-------+-------+-------+-------+-------+-------+ 477 | id | addr | 478 +-------+-------+-------+-------+-------+-------+-------+-------+-------+ 479 | len | flags | ref |evtchn | 480 +-------+-------+-------+-------+ 481 482Response additional fields: 483 484- **id**: echoed back from request 485 486Response binary layout: 487 488 16 20 24 489 +-------+-------+ 490 | id | 491 +-------+-------+ 492 493Return value: 494 495 - 0 on success 496 - See the [POSIX connect function][connect] for error names; see 497 [Error numbers] in further sections. 498 499#### Release 500 501The **release** operation closes an existing active or a passive socket. 502 503When a release command is issued on a passive socket, the backend 504releases it and frees its internal mappings. When a release command is 505issued for an active socket, the data ring and indexes page are also 506unmapped and freed: 507 508- frontend sends release command for an active socket 509- backend releases the socket 510- backend unmaps the data ring 511- backend unmaps the indexes page 512- backend unbinds the event channel 513- backend replies to frontend with an **ret** value 514- frontend frees data ring, indexes page and unbinds event channel 515 516Request fields: 517 518- **cmd** value: 1 519- additional fields: 520 - **id**: identifies the socket 521 - **reuse**: an optimization hint for the backend. The field is 522 ignored for passive sockets. When set to 1, the frontend lets the 523 backend know that it will reuse exactly the same set of grant pages 524 (indexes page and data ring) and event channel when creating one of 525 the next active sockets. The backend can take advantage of it by 526 delaying unmapping grants and unbinding the event channel. The 527 backend is free to ignore the hint. Reused data rings are found by 528 **ref**, the grant reference of the page containing the indexes. 529 530Request binary layout: 531 532 8 12 16 17 533 +-------+-------+-----+ 534 | id |reuse| 535 +-------+-------+-----+ 536 537Response additional fields: 538 539- **id**: echoed back from request 540 541Response binary layout: 542 543 16 20 24 544 +-------+-------+ 545 | id | 546 +-------+-------+ 547 548Return value: 549 550 - 0 on success 551 - See the [POSIX shutdown function][shutdown] for error names; see 552 [Error numbers] in further sections. 553 554#### Bind 555 556The **bind** operation corresponds to the POSIX [bind][bind] function. 557It assigns the address passed as parameter to a previously created 558socket, identified by **id**. **Bind**, **listen** and **accept** are 559the three operations required to have fully working passive sockets and 560should be issued in that order. 561 562Request fields: 563 564- **cmd** value: 2 565- additional fields: 566 - **id**: identifies the socket 567 - **addr**: address to connect to, see [Socket families and address 568 format] 569 - **len**: address length up to 28 octets 570 571Request binary layout: 572 573 8 12 16 20 24 28 32 36 40 44 574 +-------+-------+-------+-------+-------+-------+-------+-------+-------+ 575 | id | addr | 576 +-------+-------+-------+-------+-------+-------+-------+-------+-------+ 577 | len | 578 +-------+ 579 580Response additional fields: 581 582- **id**: echoed back from request 583 584Response binary layout: 585 586 16 20 24 587 +-------+-------+ 588 | id | 589 +-------+-------+ 590 591Return value: 592 593 - 0 on success 594 - See the [POSIX bind function][bind] for error names; see 595 [Error numbers] in further sections. 596 597 598#### Listen 599 600The **listen** operation marks the socket as a passive socket. It corresponds to 601the [POSIX listen function][listen]. 602 603Reuqest fields: 604 605- **cmd** value: 3 606- additional fields: 607 - **id**: identifies the socket 608 - **backlog**: the maximum length to which the queue of pending 609 connections may grow in number of elements 610 611Request binary layout: 612 613 8 12 16 20 614 +-------+-------+-------+ 615 | id |backlog| 616 +-------+-------+-------+ 617 618Response additional fields: 619 620- **id**: echoed back from request 621 622Response binary layout: 623 624 16 20 24 625 +-------+-------+ 626 | id | 627 +-------+-------+ 628 629Return value: 630 - 0 on success 631 - See the [POSIX listen function][listen] for error names; see 632 [Error numbers] in further sections. 633 634 635#### Accept 636 637The **accept** operation extracts the first connection request on the 638queue of pending connections for the listening socket identified by 639**id** and creates a new connected socket. The id of the new socket is 640also chosen by the frontend and passed as an additional field of the 641accept request struct (**id_new**). See the [POSIX accept function][accept] 642as reference. 643 644Similarly to the **connect** operation, **accept** creates new [Indexes 645Page and Data ring]. The **data ring** is used to send and receive data from 646the socket. The **accept** operation passes two additional parameters: 647**evtchn** and **ref**. **evtchn** is the port number of a new event 648channel which will be used for notifications of activity on the data 649ring. **ref** is the grant reference of the **indexes page**: a page 650which contains shared indexes that point to the write and read locations 651in the **data ring**. The **indexes page** also contains the full array of 652grant references for the **data ring**. 653 654The backend will reply to the request only when a new connection is 655successfully accepted, i.e. the backend does not return EAGAIN or 656EWOULDBLOCK. 657 658Example workflow: 659 660- frontend issues an **accept** request 661- backend waits for a connection to be available on the socket 662- a new connection becomes available 663- backend accepts the new connection 664- backend creates an internal mapping from **id_new** to the new socket 665- backend maps the grant reference **ref**, the indexes page, see struct 666 pvcalls_data_intf 667- backend maps all the grant references listed in `struct 668 pvcalls_data_intf` and uses them as shared memory for the new data 669 ring **in** and **out** arrays 670- backend binds to the **evtchn** 671- backend replies to the frontend with a **ret** value 672 673Request fields: 674 675- **cmd** value: 4 676- additional fields: 677 - **id**: id of listening socket 678 - **id_new**: id of the new socket 679 - **ref**: grant reference of the indexes page 680 - **evtchn**: port number of the evtchn to signal activity on the data ring 681 682Request binary layout: 683 684 8 12 16 20 24 28 32 685 +-------+-------+-------+-------+-------+-------+ 686 | id | id_new | ref |evtchn | 687 +-------+-------+-------+-------+-------+-------+ 688 689Response additional fields: 690 691- **id**: id of the listening socket, echoed back from request 692 693Response binary layout: 694 695 16 20 24 696 +-------+-------+ 697 | id | 698 +-------+-------+ 699 700Return value: 701 702 - 0 on success 703 - See the [POSIX accept function][accept] for error names; see 704 [Error numbers] in further sections. 705 706 707#### Poll 708 709In this version of the protocol, the **poll** operation is only valid 710for passive sockets. For active sockets, the frontend should look at the 711indexes on the **indexes page**. When a new connection is available in 712the queue of the passive socket, the backend generates a response and 713notifies the frontend. 714 715Request fields: 716 717- **cmd** value: 5 718- additional fields: 719 - **id**: identifies the listening socket 720 721Request binary layout: 722 723 8 12 16 724 +-------+-------+ 725 | id | 726 +-------+-------+ 727 728 729Response additional fields: 730 731- **id**: echoed back from request 732 733Response binary layout: 734 735 16 20 24 736 +--------+--------+ 737 | id | 738 +--------+--------+ 739 740Return value: 741 742 - 0 on success 743 - See the [POSIX poll function][poll] for error names; see 744 [Error numbers] in further sections. 745 746#### Expanding the protocol 747 748It is possible to introduce new commands without changing the protocol 749ABI. Naturally, a feature flag among the backend xenstore nodes should 750advertise the availability of a new set of commands. 751 752If a new command requires parameters in struct xen_pvcalls_request 753larger than 56 bytes, which is the current size of the request, then the 754protocol version should be increased. One way to implement the large 755request structure without disrupting the current ABI is to introduce a 756new command, such as PVCALLS_CONNECT_EXTENDED, and a flag to specify 757that the request uses two request slots, for a total of 112 bytes. 758 759#### Error numbers 760 761The numbers corresponding to the error names specified by POSIX are: 762 763 [EPERM] -1 764 [ENOENT] -2 765 [ESRCH] -3 766 [EINTR] -4 767 [EIO] -5 768 [ENXIO] -6 769 [E2BIG] -7 770 [ENOEXEC] -8 771 [EBADF] -9 772 [ECHILD] -10 773 [EAGAIN] -11 774 [EWOULDBLOCK] -11 775 [ENOMEM] -12 776 [EACCES] -13 777 [EFAULT] -14 778 [EBUSY] -16 779 [EEXIST] -17 780 [EXDEV] -18 781 [ENODEV] -19 782 [EISDIR] -21 783 [EINVAL] -22 784 [ENFILE] -23 785 [EMFILE] -24 786 [ENOSPC] -28 787 [EROFS] -30 788 [EMLINK] -31 789 [EDOM] -33 790 [ERANGE] -34 791 [EDEADLK] -35 792 [EDEADLOCK] -35 793 [ENAMETOOLONG] -36 794 [ENOLCK] -37 795 [ENOTEMPTY] -39 796 [ENOSYS] -38 797 [ENODATA] -61 798 [ETIME] -62 799 [EBADMSG] -74 800 [EOVERFLOW] -75 801 [EILSEQ] -84 802 [ERESTART] -85 803 [ENOTSOCK] -88 804 [EOPNOTSUPP] -95 805 [EAFNOSUPPORT] -97 806 [EADDRINUSE] -98 807 [EADDRNOTAVAIL] -99 808 [ENOBUFS] -105 809 [EISCONN] -106 810 [ENOTCONN] -107 811 [ETIMEDOUT] -110 812 [ENOTSUP] -524 813 814#### Socket families and address format 815 816The following definitions and explicit sizes, together with POSIX 817[sys/socket.h][address] and [netinet/in.h][in] define socket families and 818address format. Please be aware that only the **domain** `AF_INET`, **type** 819`SOCK_STREAM` and **protocol** `0` are supported by this version of the 820specification, others return ENOTSUP. 821 822 #define AF_UNSPEC 0 823 #define AF_UNIX 1 /* Unix domain sockets */ 824 #define AF_LOCAL 1 /* POSIX name for AF_UNIX */ 825 #define AF_INET 2 /* Internet IP Protocol */ 826 #define AF_INET6 10 /* IP version 6 */ 827 828 #define SOCK_STREAM 1 829 #define SOCK_DGRAM 2 830 #define SOCK_RAW 3 831 832 /* generic address format */ 833 struct sockaddr { 834 uint16_t sa_family_t; 835 char sa_data[26]; 836 }; 837 838 struct in_addr { 839 uint32_t s_addr; 840 }; 841 842 /* AF_INET address format */ 843 struct sockaddr_in { 844 uint16_t sa_family_t; 845 uint16_t sin_port; 846 struct in_addr sin_addr; 847 char sin_zero[20]; 848 }; 849 850 851### Indexes Page and Data ring 852 853Data rings are used for sending and receiving data over a connected socket. They 854are created upon a successful **accept** or **connect** command. 855The **sendmsg** and **recvmsg** calls are implemented by sending data and 856receiving data from a data ring, and updating the corresponding indexes 857on the **indexes page**. 858 859Firstly, the **indexes page** is shared by a **connect** or **accept** 860command, see **ref** parameter in their sections. The content of the 861**indexes page** is represented by `struct pvcalls_ring_intf`, see 862below. The structure contains the list of grant references which 863constitute the **in** and **out** buffers of the data ring, see ref[] 864below. The backend maps the grant references contiguously. Of the 865resulting shared memory, the first half is dedicated to the **in** array 866and the second half to the **out** array. They are used as circular 867buffers for transferring data, and, together, they are the data ring. 868 869 870 +---------------------------+ Indexes page 871 | Command ring: | +----------------------+ 872 | @0: xen_pvcalls_connect: | |@0 pvcalls_data_intf: | 873 | @44: ref +-------------------------------->+@76: ring_order = 1 | 874 | | |@80: ref[0]+ | 875 +---------------------------+ |@84: ref[1]+ | 876 | | | 877 | | | 878 +----------------------+ 879 | 880 v (data ring) 881 +-------+-----------+ 882 | @0->4098: in | 883 | ref[0] | 884 |-------------------| 885 | @4099->8196: out | 886 | ref[1] | 887 +-------------------+ 888 889 890#### Indexes Page Structure 891 892 typedef uint32_t PVCALLS_RING_IDX; 893 894 struct pvcalls_data_intf { 895 PVCALLS_RING_IDX in_cons, in_prod; 896 int32_t in_error; 897 898 uint8_t pad[52]; 899 900 PVCALLS_RING_IDX out_cons, out_prod; 901 int32_t out_error; 902 903 uint8_t pad[52]; 904 905 uint32_t ring_order; 906 grant_ref_t ref[]; 907 }; 908 909 /* not actually C compliant (ring_order changes from socket to socket) */ 910 struct pvcalls_data { 911 char in[((1<<ring_order)<<PAGE_SHIFT)/2]; 912 char out[((1<<ring_order)<<PAGE_SHIFT)/2]; 913 }; 914 915- **ring_order** 916 It represents the order of the data ring. The following list of grant 917 references is of `(1 << ring_order)` elements. It cannot be greater than 918 **max-page-order**, as specified by the backend on XenBus. It has to 919 be one at minimum. 920- **ref[]** 921 The list of grant references which will contain the actual data. They are 922 mapped contiguosly in virtual memory. The first half of the pages is the 923 **in** array, the second half is the **out** array. The arrays must 924 have a power of two size. Together, their size is `(1 << ring_order) * 925 PAGE_SIZE`. 926- **in** is an array used as circular buffer 927 It contains data read from the socket. The producer is the backend, the 928 consumer is the frontend. 929- **out** is an array used as circular buffer 930 It contains data to be written to the socket. The producer is the frontend, 931 the consumer is the backend. 932- **in_cons** and **in_prod** 933 Consumer and producer indexes for data read from the socket. They keep track 934 of how much data has already been consumed by the frontend from the **in** 935 array. **in_prod** is increased by the backend, after writing data to **in**. 936 **in_cons** is increased by the frontend, after reading data from **in**. 937- **out_cons**, **out_prod** 938 Consumer and producer indexes for the data to be written to the socket. They 939 keep track of how much data has been written by the frontend to **out** and 940 how much data has already been consumed by the backend. **out_prod** is 941 increased by the frontend, after writing data to **out**. **out_cons** is 942 increased by the backend, after reading data from **out**. 943- **in_error** and **out_error** They signal errors when reading from the socket 944 (**in_error**) or when writing to the socket (**out_error**). 0 means no 945 errors. When an error occurs, no further reads or writes operations are 946 performed on the socket. In the case of an orderly socket shutdown (i.e. read 947 returns 0) **in_error** is set to ENOTCONN. **in_error** and **out_error** 948 are never set to EAGAIN or EWOULDBLOCK (the data is written to the 949 ring as soon as it is available). 950 951The binary layout of `struct pvcalls_data_intf` follows: 952 953 0 4 8 12 64 68 72 76 954 +---------+---------+---------+-----//-----+---------+---------+---------+ 955 | in_cons | in_prod |in_error | padding |out_cons |out_prod |out_error| 956 +---------+---------+---------+-----//-----+---------+---------+---------+ 957 958 76 80 84 88 4092 4096 959 +---------+---------+---------+----//---+---------+ 960 |ring_orde| ref[0] | ref[1] | | ref[N] | 961 +---------+---------+---------+----//---+---------+ 962 963**N.B** For one page, N is maximum 991 ((4096-132)/4), but given that N needs 964to be a power of two, actually max N is 512 (ring_order = 9). 965 966#### Data Ring Structure 967 968The binary layout of the data ring follow: 969 970 0 ((1<<ring_order)<<PAGE_SHIFT)/2 ((1<<ring_order)<<PAGE_SHIFT) 971 +------------//-------------+------------//-------------+ 972 | in | out | 973 +------------//-------------+------------//-------------+ 974 975#### Why ring.h is not needed 976 977Many Xen PV protocols use the macros provided by [ring.h] to manage 978their shared ring for communication. PVCalls does not, because the [Data 979Ring Structure] actually comes with two rings: the **in** ring and the 980**out** ring. Each of them is mono-directional, and there is no static 981request size: the producer writes opaque data to the ring. On the other 982end, in [ring.h] they are combined, and the request size is static and 983well-known. In PVCalls: 984 985 in -> backend to frontend only 986 out-> frontend to backend only 987 988In the case of the **in** ring, the frontend is the consumer, and the 989backend is the producer. Everything is the same but mirrored for the 990**out** ring. 991 992The producer, the backend in this case, never reads from the **in** 993ring. In fact, the producer doesn't need any notifications unless the 994ring is full. This version of the protocol doesn't take advantage of it, 995leaving room for optimizations. 996 997On the other end, the consumer always requires notifications, unless it 998is already actively reading from the ring. The producer can figure it 999out, without any additional fields in the protocol, by comparing the 1000indexes at the beginning and the end of the function. This is similar to 1001what [ring.h] does. 1002 1003#### Workflow 1004 1005The **in** and **out** arrays are used as circular buffers: 1006 1007 0 sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2 1008 +-----------------------------------+ 1009 |to consume| free |to consume | 1010 +-----------------------------------+ 1011 ^ ^ 1012 prod cons 1013 1014 0 sizeof(array) 1015 +-----------------------------------+ 1016 | free | to consume | free | 1017 +-----------------------------------+ 1018 ^ ^ 1019 cons prod 1020 1021The following function is provided to calculate how many bytes are currently 1022left unconsumed in an array: 1023 1024 #define _MASK_PVCALLS_IDX(idx, ring_size) ((idx) & (ring_size-1)) 1025 1026 static inline PVCALLS_RING_IDX pvcalls_ring_unconsumed(PVCALLS_RING_IDX prod, 1027 PVCALLS_RING_IDX cons, 1028 PVCALLS_RING_IDX ring_size) 1029 { 1030 PVCALLS_RING_IDX size; 1031 1032 if (prod == cons) 1033 return 0; 1034 1035 prod = _MASK_PVCALLS_IDX(prod, ring_size); 1036 cons = _MASK_PVCALLS_IDX(cons, ring_size); 1037 1038 if (prod == cons) 1039 return ring_size; 1040 1041 if (prod > cons) 1042 size = prod - cons; 1043 else { 1044 size = ring_size - cons; 1045 size += prod; 1046 } 1047 return size; 1048 } 1049 1050The producer (the backend for **in**, the frontend for **out**) writes to the 1051array in the following way: 1052 1053- read *[in|out]_cons*, *[in|out]_prod*, *[in|out]_error* from shared memory 1054- general memory barrier 1055- return on *[in|out]_error* 1056- write to array at position *[in|out]_prod* up to *[in|out]_cons*, 1057 wrapping around the circular buffer when necessary 1058- write memory barrier 1059- increase *[in|out]_prod* 1060- notify the other end via evtchn 1061 1062The consumer (the backend for **out**, the frontend for **in**) reads from the 1063array in the following way: 1064 1065- read *[in|out]_prod*, *[in|out]_cons*, *[in|out]_error* from shared memory 1066- read memory barrier 1067- return on *[in|out]_error* 1068- read from array at position *[in|out]_cons* up to *[in|out]_prod*, 1069 wrapping around the circular buffer when necessary 1070- general memory barrier 1071- increase *[in|out]_cons* 1072- notify the other end via evtchn 1073 1074The producer takes care of writing only as many bytes as available in 1075the buffer up to *[in|out]_cons*. The consumer takes care of reading 1076only as many bytes as available in the buffer up to *[in|out]_prod*. 1077*[in|out]_error* is set by the backend when an error occurs writing or 1078reading from the socket. 1079 1080 1081[xenstore]: http://xenbits.xen.org/docs/unstable/misc/xenstore.txt 1082[XenbusStateInitialising]: http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html 1083[address]: http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html 1084[in]: http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html 1085[socket]: http://pubs.opengroup.org/onlinepubs/009695399/functions/socket.html 1086[connect]: http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html 1087[shutdown]: http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html 1088[bind]: http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html 1089[listen]: http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html 1090[accept]: http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html 1091[poll]: http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html 1092[ring.h]: http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD 1093