1# PV Calls Protocol version 1
2
3## Glossary
4
5The following is a list of terms and definitions used in the Xen
6community. If you are a Xen contributor you can skip this section.
7
8* PV
9
10  Short for paravirtualized.
11
12* Dom0
13
14  First virtual machine that boots. In most configurations Dom0 is
15  privileged and has control over hardware devices, such as network
16  cards, graphic cards, etc.
17
18* DomU
19
20  Regular unprivileged Xen virtual machine.
21
22* Domain
23
24  A Xen virtual machine. Dom0 and all DomUs are all separate Xen
25  domains.
26
27* Guest
28
29  Same as domain: a Xen virtual machine.
30
31* Frontend
32
33  Each DomU has one or more paravirtualized frontend drivers to access
34  disks, network, console, graphics, etc. The presence of PV devices is
35  advertized on XenStore, a cross domain key-value database. Frontends
36  are similar in intent to the virtio drivers in Linux.
37
38* Backend
39
40  A Xen paravirtualized backend typically runs in Dom0 and it is used to
41  export disks, network, console, graphics, etcs, to DomUs. Backends can
42  live both in kernel space and in userspace. For example xen-blkback
43  lives under drivers/block in the Linux kernel and xen_disk lives under
44  hw/block in QEMU. Paravirtualized backends are similar in intent to
45  virtio device emulators.
46
47* VMX and SVM
48
49  On Intel processors, VMX is the CPU flag for VT-x, hardware
50  virtualization support. It corresponds to SVM on AMD processors.
51
52
53
54## Rationale
55
56PV Calls is a paravirtualized protocol that allows the implementation of
57a set of POSIX functions in a different domain. The PV Calls frontend
58sends POSIX function calls to the backend, which implements them and
59returns a value to the frontend and acts on the function call.
60
61This version of the document covers networking function calls, such as
62connect, accept, bind, release, listen, poll, recvmsg and sendmsg; but
63the protocol is meant to be easily extended to cover different sets of
64calls. Unimplemented commands return ENOTSUP.
65
66PV Calls provide the following benefits:
67* full visibility of the guest behavior on the backend domain, allowing
68  for inexpensive filtering and manipulation of any guest calls
69* excellent performance
70
71Specifically, PV Calls for networking offer these advantages:
72* guest networking works out of the box with VPNs, wireless networks and
73  any other complex configurations on the host
74* guest services listen on ports bound directly to the backend domain IP
75  addresses
76* localhost becomes a secure host wide network for inter-VMs
77  communications
78
79
80## Design
81
82### Why Xen?
83
84PV Calls are part of an effort to create a secure runtime environment
85for containers (Open Containers Initiative images to be precise). PV
86Calls are based on Xen, although porting them to other hypervisors is
87possible. Xen was chosen because of its security and isolation
88properties and because it supports PV guests, a type of virtual machines
89that does not require hardware virtualization extensions (VMX on Intel
90processors and SVM on AMD processors). This is important because PV
91Calls is meant for containers and containers are often run on top of
92public cloud instances, which do not support nested VMX (or SVM) as of
93today (early 2017). Xen PV guests are lightweight, minimalist, and do
94not require machine emulation: all properties that make them a good fit
95for this project.
96
97### Xenstore
98
99The frontend and the backend connect via [xenstore] to
100exchange information. The toolstack creates front and back nodes with
101state of [XenbusStateInitialising]. The protocol node name
102is **pvcalls**.  There can only be one PV Calls frontend per domain.
103
104#### Frontend XenBus Nodes
105
106version
107     Values:         <string>
108
109     Protocol version, chosen among the ones supported by the backend
110     (see **versions** under [Backend XenBus Nodes]). Currently the
111     value must be "1".
112
113port
114     Values:         <uint32_t>
115
116     The identifier of the Xen event channel used to signal activity
117     in the command ring.
118
119ring-ref
120     Values:         <uint32_t>
121
122     The Xen grant reference granting permission for the backend to map
123     the sole page in a single page sized command ring.
124
125#### Backend XenBus Nodes
126
127versions
128     Values:         <string>
129
130     List of comma separated protocol versions supported by the backend.
131     For example "1,2,3". Currently the value is just "1", as there is
132     only one version.
133
134max-page-order
135     Values:         <uint32_t>
136
137     The maximum supported size of a memory allocation in units of
138     log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It must
139     be 1 or more.
140
141function-calls
142     Values:         <uint32_t>
143
144     Value "0" means that no calls are supported.
145     Value "1" means that socket, connect, release, bind, listen, accept
146     and poll are supported.
147
148#### State Machine
149
150Initialization:
151
152    *Front*                               *Back*
153    XenbusStateInitialising               XenbusStateInitialising
154    - Query virtual device                - Query backend device
155      properties.                           identification data.
156    - Setup OS device instance.           - Publish backend features
157    - Allocate and initialize the           and transport parameters
158      request ring.                                      |
159    - Publish transport parameters                       |
160      that will be in effect during                      V
161      this connection.                            XenbusStateInitWait
162                 |
163                 |
164                 V
165       XenbusStateInitialised
166
167                                          - Query frontend transport parameters.
168                                          - Connect to the request ring and
169                                            event channel.
170                                                         |
171                                                         |
172                                                         V
173                                                 XenbusStateConnected
174
175     - Query backend device properties.
176     - Finalize OS virtual device
177       instance.
178                 |
179                 |
180                 V
181        XenbusStateConnected
182
183Once frontend and backend are connected, they have a shared page, which
184will is used to exchange messages over a ring, and an event channel,
185which is used to send notifications.
186
187Shutdown:
188
189    *Front*                            *Back*
190    XenbusStateConnected               XenbusStateConnected
191                |
192                |
193                V
194       XenbusStateClosing
195
196                                       - Unmap grants
197                                       - Unbind event channels
198                                                 |
199                                                 |
200                                                 V
201                                         XenbusStateClosing
202
203    - Unbind event channels
204    - Free rings
205    - Free data structures
206               |
207               |
208               V
209       XenbusStateClosed
210
211                                       - Free remaining data structures
212                                                 |
213                                                 |
214                                                 V
215                                         XenbusStateClosed
216
217
218### Commands Ring
219
220The shared ring is used by the frontend to forward POSIX function calls
221to the backend. We shall refer to this ring as **commands ring** to
222distinguish it from other rings which can be created later in the
223lifecycle of the protocol (see [Indexes Page and Data ring]). The grant
224reference for shared page for this ring is shared on xenstore (see
225[Frontend XenBus Nodes]). The ring format is defined using the familiar
226`DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`).  Frontend
227requests are allocated on the ring using the `RING_GET_REQUEST` macro.
228The list of commands below is in calling order.
229
230The format is defined as follows:
231
232    #define PVCALLS_SOCKET         0
233    #define PVCALLS_CONNECT        1
234    #define PVCALLS_RELEASE        2
235    #define PVCALLS_BIND           3
236    #define PVCALLS_LISTEN         4
237    #define PVCALLS_ACCEPT         5
238    #define PVCALLS_POLL           6
239
240    struct xen_pvcalls_request {
241    	uint32_t req_id; /* private to guest, echoed in response */
242    	uint32_t cmd;    /* command to execute */
243    	union {
244    		struct xen_pvcalls_socket {
245    			uint64_t id;
246    			uint32_t domain;
247    			uint32_t type;
248    			uint32_t protocol;
249    			#ifdef CONFIG_X86_32
250    			uint8_t pad[4];
251    			#endif
252    		} socket;
253    		struct xen_pvcalls_connect {
254    			uint64_t id;
255    			uint8_t addr[28];
256    			uint32_t len;
257    			uint32_t flags;
258    			grant_ref_t ref;
259    			uint32_t evtchn;
260    			#ifdef CONFIG_X86_32
261    			uint8_t pad[4];
262    			#endif
263    		} connect;
264    		struct xen_pvcalls_release {
265    			uint64_t id;
266    			uint8_t reuse;
267    			#ifdef CONFIG_X86_32
268    			uint8_t pad[7];
269    			#endif
270    		} release;
271    		struct xen_pvcalls_bind {
272    			uint64_t id;
273    			uint8_t addr[28];
274    			uint32_t len;
275    		} bind;
276    		struct xen_pvcalls_listen {
277    			uint64_t id;
278    			uint32_t backlog;
279    			#ifdef CONFIG_X86_32
280    			uint8_t pad[4];
281    			#endif
282    		} listen;
283    		struct xen_pvcalls_accept {
284    			uint64_t id;
285    			uint64_t id_new;
286    			grant_ref_t ref;
287    			uint32_t evtchn;
288    		} accept;
289    		struct xen_pvcalls_poll {
290    			uint64_t id;
291    		} poll;
292    		/* dummy member to force sizeof(struct xen_pvcalls_request) to match across archs */
293    		struct xen_pvcalls_dummy {
294    			uint8_t dummy[56];
295    		} dummy;
296    	} u;
297    };
298
299The first two fields are common for every command. Their binary layout
300is:
301
302    0       4       8
303    +-------+-------+
304    |req_id |  cmd  |
305    +-------+-------+
306
307- **req_id** is generated by the frontend and is a cookie used to
308  identify one specific request/response pair of commands. Not to be
309  confused with any command **id** which are used to identify a socket
310  across multiple commands, see [Socket].
311- **cmd** is the command requested by the frontend:
312
313    - `PVCALLS_SOCKET`:  0
314    - `PVCALLS_CONNECT`: 1
315    - `PVCALLS_RELEASE`: 2
316    - `PVCALLS_BIND`:    3
317    - `PVCALLS_LISTEN`:  4
318    - `PVCALLS_ACCEPT`:  5
319    - `PVCALLS_POLL`:    6
320
321Both fields are echoed back by the backend. See [Socket families and
322address format] for the format of the **addr** field of connect and
323bind. The maximum size of command specific arguments is 56 bytes. Any
324future command that requires more than that will need a bump the
325**version** of the protocol.
326
327Similarly to other Xen ring based protocols, after writing a request to
328the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and
329issues an event channel notification when a notification is required.
330
331Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro.
332The format is the following:
333
334    struct xen_pvcalls_response {
335        uint32_t req_id;
336        uint32_t cmd;
337        int32_t ret;
338        uint32_t pad;
339        union {
340    		struct _xen_pvcalls_socket {
341    			uint64_t id;
342    		} socket;
343    		struct _xen_pvcalls_connect {
344    			uint64_t id;
345    		} connect;
346    		struct _xen_pvcalls_release {
347    			uint64_t id;
348    		} release;
349    		struct _xen_pvcalls_bind {
350    			uint64_t id;
351    		} bind;
352    		struct _xen_pvcalls_listen {
353    			uint64_t id;
354    		} listen;
355    		struct _xen_pvcalls_accept {
356    			uint64_t id;
357    		} accept;
358    		struct _xen_pvcalls_poll {
359    			uint64_t id;
360    		} poll;
361    		struct _xen_pvcalls_dummy {
362    			uint8_t dummy[8];
363    		} dummy;
364    	} u;
365    };
366
367The first four fields are common for every response. Their binary layout
368is:
369
370    0       4       8       12      16
371    +-------+-------+-------+-------+
372    |req_id |  cmd  |  ret  |  pad  |
373    +-------+-------+-------+-------+
374
375- **req_id**: echoed back from request
376- **cmd**: echoed back from request
377- **ret**: return value, identifies success (0) or failure (see [Error
378  numbers] in further sections). If the **cmd** is not supported by the
379  backend, ret is ENOTSUP.
380- **pad**: padding
381
382After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks whether
383it needs to notify the frontend and does so via event channel.
384
385A description of each command, their additional request and response
386fields follow.
387
388
389#### Socket
390
391The **socket** operation corresponds to the POSIX [socket][socket]
392function. It creates a new socket of the specified family, type and
393protocol. **id** is freely chosen by the frontend and references this
394specific socket from this point forward. See [Socket families and
395address format] to see which ones are supported by different versions of
396the protocol.
397
398Request fields:
399
400- **cmd** value: 0
401- additional fields:
402  - **id**: generated by the frontend, it identifies the new socket
403  - **domain**: the communication domain
404  - **type**: the socket type
405  - **protocol**: the particular protocol to be used with the socket, usually 0
406
407Request binary layout:
408
409    8       12      16      20     24       28
410    +-------+-------+-------+-------+-------+
411    |       id      |domain | type  |protoco|
412    +-------+-------+-------+-------+-------+
413
414Response additional fields:
415
416- **id**: echoed back from request
417
418Response binary layout:
419
420    16       20       24
421    +-------+--------+
422    |       id       |
423    +-------+--------+
424
425Return value:
426
427  - 0 on success
428  - See the [POSIX socket function][connect] for error names; see
429    [Error numbers] in further sections.
430
431#### Connect
432
433The **connect** operation corresponds to the POSIX [connect][connect]
434function. It connects a previously created socket (identified by **id**)
435to the specified address.
436
437The connect operation creates a new shared ring, which we'll call **data
438ring**. The data ring is used to send and receive data from the
439socket. The connect operation passes two additional parameters:
440**evtchn** and **ref**. **evtchn** is the port number of a new event
441channel which will be used for notifications of activity on the data
442ring. **ref** is the grant reference of the **indexes page**: a page
443which contains shared indexes that point to the write and read locations
444in the **data ring**. The **indexes page** also contains the full array
445of grant references for the **data ring**. When the frontend issues a
446**connect** command, the backend:
447
448- finds its own internal socket corresponding to **id**
449- connects the socket to **addr**
450- maps the grant reference **ref**, the indexes page, see struct
451  pvcalls_data_intf
452- maps all the grant references listed in `struct pvcalls_data_intf` and
453  uses them as shared memory for the **data ring**
454- bind the **evtchn**
455- replies to the frontend
456
457The [Indexes Page and Data ring] format will be described in the
458following section. The **data ring** is unmapped and freed upon issuing
459a **release** command on the active socket identified by **id**. A
460frontend state change can also cause data rings to be unmapped.
461
462Request fields:
463
464- **cmd** value: 0
465- additional fields:
466  - **id**: identifies the socket
467  - **addr**: address to connect to, see [Socket families and address format]
468  - **len**: address length up to 28 octets
469  - **flags**: flags for the connection, reserved for future usage
470  - **ref**: grant reference of the indexes page
471  - **evtchn**: port number of the evtchn to signal activity on the **data ring**
472
473Request binary layout:
474
475    8       12      16      20      24      28      32      36      40      44
476    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
477    |       id      |                            addr                       |
478    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
479    | len   | flags |  ref  |evtchn |
480    +-------+-------+-------+-------+
481
482Response additional fields:
483
484- **id**: echoed back from request
485
486Response binary layout:
487
488    16      20      24
489    +-------+-------+
490    |       id      |
491    +-------+-------+
492
493Return value:
494
495  - 0 on success
496  - See the [POSIX connect function][connect] for error names; see
497    [Error numbers] in further sections.
498
499#### Release
500
501The **release** operation closes an existing active or a passive socket.
502
503When a release command is issued on a passive socket, the backend
504releases it and frees its internal mappings. When a release command is
505issued for an active socket, the data ring and indexes page are also
506unmapped and freed:
507
508- frontend sends release command for an active socket
509- backend releases the socket
510- backend unmaps the data ring
511- backend unmaps the indexes page
512- backend unbinds the event channel
513- backend replies to frontend with an **ret** value
514- frontend frees data ring, indexes page and unbinds event channel
515
516Request fields:
517
518- **cmd** value: 1
519- additional fields:
520  - **id**: identifies the socket
521  - **reuse**: an optimization hint for the backend. The field is
522    ignored for passive sockets. When set to 1, the frontend lets the
523    backend know that it will reuse exactly the same set of grant pages
524    (indexes page and data ring) and event channel when creating one of
525    the next active sockets. The backend can take advantage of it by
526    delaying unmapping grants and unbinding the event channel. The
527    backend is free to ignore the hint. Reused data rings are found by
528    **ref**, the grant reference of the page containing the indexes.
529
530Request binary layout:
531
532    8       12      16    17
533    +-------+-------+-----+
534    |       id      |reuse|
535    +-------+-------+-----+
536
537Response additional fields:
538
539- **id**: echoed back from request
540
541Response binary layout:
542
543    16      20      24
544    +-------+-------+
545    |       id      |
546    +-------+-------+
547
548Return value:
549
550  - 0 on success
551  - See the [POSIX shutdown function][shutdown] for error names; see
552    [Error numbers] in further sections.
553
554#### Bind
555
556The **bind** operation corresponds to the POSIX [bind][bind] function.
557It assigns the address passed as parameter to a previously created
558socket, identified by **id**. **Bind**, **listen** and **accept** are
559the three operations required to have fully working passive sockets and
560should be issued in that order.
561
562Request fields:
563
564- **cmd** value: 2
565- additional fields:
566  - **id**: identifies the socket
567  - **addr**: address to connect to, see [Socket families and address
568    format]
569  - **len**: address length up to 28 octets
570
571Request binary layout:
572
573    8       12      16      20      24      28      32      36      40      44
574    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
575    |       id      |                            addr                       |
576    +-------+-------+-------+-------+-------+-------+-------+-------+-------+
577    |  len  |
578    +-------+
579
580Response additional fields:
581
582- **id**: echoed back from request
583
584Response binary layout:
585
586    16      20      24
587    +-------+-------+
588    |       id      |
589    +-------+-------+
590
591Return value:
592
593  - 0 on success
594  - See the [POSIX bind function][bind] for error names; see
595    [Error numbers] in further sections.
596
597
598#### Listen
599
600The **listen** operation marks the socket as a passive socket. It corresponds to
601the [POSIX listen function][listen].
602
603Reuqest fields:
604
605- **cmd** value: 3
606- additional fields:
607  - **id**: identifies the socket
608  - **backlog**: the maximum length to which the queue of pending
609    connections may grow in number of elements
610
611Request binary layout:
612
613    8       12      16      20
614    +-------+-------+-------+
615    |       id      |backlog|
616    +-------+-------+-------+
617
618Response additional fields:
619
620- **id**: echoed back from request
621
622Response binary layout:
623
624    16      20      24
625    +-------+-------+
626    |       id      |
627    +-------+-------+
628
629Return value:
630  - 0 on success
631  - See the [POSIX listen function][listen] for error names; see
632    [Error numbers] in further sections.
633
634
635#### Accept
636
637The **accept** operation extracts the first connection request on the
638queue of pending connections for the listening socket identified by
639**id** and creates a new connected socket. The id of the new socket is
640also chosen by the frontend and passed as an additional field of the
641accept request struct (**id_new**). See the [POSIX accept function][accept]
642as reference.
643
644Similarly to the **connect** operation, **accept** creates new [Indexes
645Page and Data ring]. The **data ring** is used to send and receive data from
646the socket. The **accept** operation passes two additional parameters:
647**evtchn** and **ref**. **evtchn** is the port number of a new event
648channel which will be used for notifications of activity on the data
649ring. **ref** is the grant reference of the **indexes page**: a page
650which contains shared indexes that point to the write and read locations
651in the **data ring**. The **indexes page** also contains the full array of
652grant references for the **data ring**.
653
654The backend will reply to the request only when a new connection is
655successfully accepted, i.e. the backend does not return EAGAIN or
656EWOULDBLOCK.
657
658Example workflow:
659
660- frontend issues an **accept** request
661- backend waits for a connection to be available on the socket
662- a new connection becomes available
663- backend accepts the new connection
664- backend creates an internal mapping from **id_new** to the new socket
665- backend maps the grant reference **ref**, the indexes page, see struct
666  pvcalls_data_intf
667- backend maps all the grant references listed in `struct
668  pvcalls_data_intf` and uses them as shared memory for the new data
669  ring **in** and **out** arrays
670- backend binds to the **evtchn**
671- backend replies to the frontend with a **ret** value
672
673Request fields:
674
675- **cmd** value: 4
676- additional fields:
677  - **id**: id of listening socket
678  - **id_new**: id of the new socket
679  - **ref**: grant reference of the indexes page
680  - **evtchn**: port number of the evtchn to signal activity on the data ring
681
682Request binary layout:
683
684    8       12      16      20      24      28      32
685    +-------+-------+-------+-------+-------+-------+
686    |       id      |    id_new     |  ref  |evtchn |
687    +-------+-------+-------+-------+-------+-------+
688
689Response additional fields:
690
691- **id**: id of the listening socket, echoed back from request
692
693Response binary layout:
694
695    16      20      24
696    +-------+-------+
697    |       id      |
698    +-------+-------+
699
700Return value:
701
702  - 0 on success
703  - See the [POSIX accept function][accept] for error names; see
704    [Error numbers] in further sections.
705
706
707#### Poll
708
709In this version of the protocol, the **poll** operation is only valid
710for passive sockets. For active sockets, the frontend should look at the
711indexes on the **indexes page**. When a new connection is available in
712the queue of the passive socket, the backend generates a response and
713notifies the frontend.
714
715Request fields:
716
717- **cmd** value: 5
718- additional fields:
719  - **id**: identifies the listening socket
720
721Request binary layout:
722
723    8       12      16
724    +-------+-------+
725    |       id      |
726    +-------+-------+
727
728
729Response additional fields:
730
731- **id**: echoed back from request
732
733Response binary layout:
734
735    16       20       24
736    +--------+--------+
737    |        id       |
738    +--------+--------+
739
740Return value:
741
742  - 0 on success
743  - See the [POSIX poll function][poll] for error names; see
744    [Error numbers] in further sections.
745
746#### Expanding the protocol
747
748It is possible to introduce new commands without changing the protocol
749ABI. Naturally, a feature flag among the backend xenstore nodes should
750advertise the availability of a new set of commands.
751
752If a new command requires parameters in struct xen_pvcalls_request
753larger than 56 bytes, which is the current size of the request, then the
754protocol version should be increased. One way to implement the large
755request structure without disrupting the current ABI is to introduce a
756new command, such as PVCALLS_CONNECT_EXTENDED, and a flag to specify
757that the request uses two request slots, for a total of 112 bytes.
758
759#### Error numbers
760
761The numbers corresponding to the error names specified by POSIX are:
762
763    [EPERM]         -1
764    [ENOENT]        -2
765    [ESRCH]         -3
766    [EINTR]         -4
767    [EIO]           -5
768    [ENXIO]         -6
769    [E2BIG]         -7
770    [ENOEXEC]       -8
771    [EBADF]         -9
772    [ECHILD]        -10
773    [EAGAIN]        -11
774    [EWOULDBLOCK]   -11
775    [ENOMEM]        -12
776    [EACCES]        -13
777    [EFAULT]        -14
778    [EBUSY]         -16
779    [EEXIST]        -17
780    [EXDEV]         -18
781    [ENODEV]        -19
782    [EISDIR]        -21
783    [EINVAL]        -22
784    [ENFILE]        -23
785    [EMFILE]        -24
786    [ENOSPC]        -28
787    [EROFS]         -30
788    [EMLINK]        -31
789    [EDOM]          -33
790    [ERANGE]        -34
791    [EDEADLK]       -35
792    [EDEADLOCK]     -35
793    [ENAMETOOLONG]  -36
794    [ENOLCK]        -37
795    [ENOTEMPTY]     -39
796    [ENOSYS]        -38
797    [ENODATA]       -61
798    [ETIME]         -62
799    [EBADMSG]       -74
800    [EOVERFLOW]     -75
801    [EILSEQ]        -84
802    [ERESTART]      -85
803    [ENOTSOCK]      -88
804    [EOPNOTSUPP]    -95
805    [EAFNOSUPPORT]  -97
806    [EADDRINUSE]    -98
807    [EADDRNOTAVAIL] -99
808    [ENOBUFS]       -105
809    [EISCONN]       -106
810    [ENOTCONN]      -107
811    [ETIMEDOUT]     -110
812    [ENOTSUP]      -524
813
814#### Socket families and address format
815
816The following definitions and explicit sizes, together with POSIX
817[sys/socket.h][address] and [netinet/in.h][in] define socket families and
818address format. Please be aware that only the **domain** `AF_INET`, **type**
819`SOCK_STREAM` and **protocol** `0` are supported by this version of the
820specification, others return ENOTSUP.
821
822    #define AF_UNSPEC   0
823    #define AF_UNIX     1   /* Unix domain sockets      */
824    #define AF_LOCAL    1   /* POSIX name for AF_UNIX   */
825    #define AF_INET     2   /* Internet IP Protocol     */
826    #define AF_INET6    10  /* IP version 6         */
827
828    #define SOCK_STREAM 1
829    #define SOCK_DGRAM  2
830    #define SOCK_RAW    3
831
832    /* generic address format */
833    struct sockaddr {
834        uint16_t sa_family_t;
835        char sa_data[26];
836    };
837
838    struct in_addr {
839        uint32_t s_addr;
840    };
841
842    /* AF_INET address format */
843    struct sockaddr_in {
844        uint16_t         sa_family_t;
845        uint16_t         sin_port;
846        struct in_addr   sin_addr;
847        char             sin_zero[20];
848    };
849
850
851### Indexes Page and Data ring
852
853Data rings are used for sending and receiving data over a connected socket. They
854are created upon a successful **accept** or **connect** command.
855The **sendmsg** and **recvmsg** calls are implemented by sending data and
856receiving data from a data ring, and updating the corresponding indexes
857on the **indexes page**.
858
859Firstly, the **indexes page** is shared by a **connect** or **accept**
860command, see **ref** parameter in their sections. The content of the
861**indexes page** is represented by `struct pvcalls_ring_intf`, see
862below. The structure contains the list of grant references which
863constitute the **in** and **out** buffers of the data ring, see ref[]
864below. The backend maps the grant references contiguously. Of the
865resulting shared memory, the first half is dedicated to the **in** array
866and the second half to the **out** array. They are used as circular
867buffers for transferring data, and, together, they are the data ring.
868
869
870  +---------------------------+                 Indexes page
871  | Command ring:             |                 +----------------------+
872  | @0: xen_pvcalls_connect:  |                 |@0 pvcalls_data_intf: |
873  | @44: ref  +-------------------------------->+@76: ring_order = 1   |
874  |                           |                 |@80: ref[0]+          |
875  +---------------------------+                 |@84: ref[1]+          |
876                                                |           |          |
877                                                |           |          |
878                                                +----------------------+
879                                                            |
880                                                            v (data ring)
881                                                    +-------+-----------+
882                                                    |  @0->4098: in     |
883                                                    |  ref[0]           |
884                                                    |-------------------|
885                                                    |  @4099->8196: out |
886                                                    |  ref[1]           |
887                                                    +-------------------+
888
889
890#### Indexes Page Structure
891
892    typedef uint32_t PVCALLS_RING_IDX;
893
894    struct pvcalls_data_intf {
895    	PVCALLS_RING_IDX in_cons, in_prod;
896    	int32_t in_error;
897
898    	uint8_t pad[52];
899
900    	PVCALLS_RING_IDX out_cons, out_prod;
901    	int32_t out_error;
902
903    	uint8_t pad[52];
904
905    	uint32_t ring_order;
906    	grant_ref_t ref[];
907    };
908
909    /* not actually C compliant (ring_order changes from socket to socket) */
910    struct pvcalls_data {
911        char in[((1<<ring_order)<<PAGE_SHIFT)/2];
912        char out[((1<<ring_order)<<PAGE_SHIFT)/2];
913    };
914
915- **ring_order**
916  It represents the order of the data ring. The following list of grant
917  references is of `(1 << ring_order)` elements. It cannot be greater than
918  **max-page-order**, as specified by the backend on XenBus. It has to
919  be one at minimum.
920- **ref[]**
921  The list of grant references which will contain the actual data. They are
922  mapped contiguosly in virtual memory. The first half of the pages is the
923  **in** array, the second half is the **out** array. The arrays must
924  have a power of two size. Together, their size is `(1 << ring_order) *
925  PAGE_SIZE`.
926- **in** is an array used as circular buffer
927  It contains data read from the socket. The producer is the backend, the
928  consumer is the frontend.
929- **out** is an array used as circular buffer
930  It contains data to be written to the socket. The producer is the frontend,
931  the consumer is the backend.
932- **in_cons** and **in_prod**
933  Consumer and producer indexes for data read from the socket. They keep track
934  of how much data has already been consumed by the frontend from the **in**
935  array. **in_prod** is increased by the backend, after writing data to **in**.
936  **in_cons** is increased by the frontend, after reading data from **in**.
937- **out_cons**, **out_prod**
938  Consumer and producer indexes for the data to be written to the socket. They
939  keep track of how much data has been written by the frontend to **out** and
940  how much data has already been consumed by the backend. **out_prod** is
941  increased by the frontend, after writing data to **out**. **out_cons** is
942  increased by the backend, after reading data from **out**.
943- **in_error** and **out_error** They signal errors when reading from the socket
944  (**in_error**) or when writing to the socket (**out_error**). 0 means no
945  errors. When an error occurs, no further reads or writes operations are
946  performed on the socket. In the case of an orderly socket shutdown (i.e. read
947  returns 0) **in_error** is set to ENOTCONN. **in_error** and **out_error**
948  are never set to EAGAIN or EWOULDBLOCK (the data is written to the
949  ring as soon as it is available).
950
951The binary layout of `struct pvcalls_data_intf` follows:
952
953    0         4         8         12           64        68        72        76
954    +---------+---------+---------+-----//-----+---------+---------+---------+
955    | in_cons | in_prod |in_error |  padding   |out_cons |out_prod |out_error|
956    +---------+---------+---------+-----//-----+---------+---------+---------+
957
958    76        80        84        88      4092      4096
959    +---------+---------+---------+----//---+---------+
960    |ring_orde|  ref[0] |  ref[1] |         |  ref[N] |
961    +---------+---------+---------+----//---+---------+
962
963**N.B** For one page, N is maximum 991 ((4096-132)/4), but given that N needs
964to be a power of two, actually max N is 512 (ring_order = 9).
965
966#### Data Ring Structure
967
968The binary layout of the data ring follow:
969
970    0         ((1<<ring_order)<<PAGE_SHIFT)/2       ((1<<ring_order)<<PAGE_SHIFT)
971    +------------//-------------+------------//-------------+
972    |            in             |           out             |
973    +------------//-------------+------------//-------------+
974
975#### Why ring.h is not needed
976
977Many Xen PV protocols use the macros provided by [ring.h] to manage
978their shared ring for communication. PVCalls does not, because the [Data
979Ring Structure] actually comes with two rings: the **in** ring and the
980**out** ring. Each of them is mono-directional, and there is no static
981request size: the producer writes opaque data to the ring. On the other
982end, in [ring.h] they are combined, and the request size is static and
983well-known. In PVCalls:
984
985  in -> backend to frontend only
986  out-> frontend to backend only
987
988In the case of the **in** ring, the frontend is the consumer, and the
989backend is the producer. Everything is the same but mirrored for the
990**out** ring.
991
992The producer, the backend in this case, never reads from the **in**
993ring. In fact, the producer doesn't need any notifications unless the
994ring is full. This version of the protocol doesn't take advantage of it,
995leaving room for optimizations.
996
997On the other end, the consumer always requires notifications, unless it
998is already actively reading from the ring. The producer can figure it
999out, without any additional fields in the protocol, by comparing the
1000indexes at the beginning and the end of the function. This is similar to
1001what [ring.h] does.
1002
1003#### Workflow
1004
1005The **in** and **out** arrays are used as circular buffers:
1006
1007    0                               sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
1008    +-----------------------------------+
1009    |to consume|    free    |to consume |
1010    +-----------------------------------+
1011               ^            ^
1012               prod         cons
1013
1014    0                               sizeof(array)
1015    +-----------------------------------+
1016    |  free    | to consume |   free    |
1017    +-----------------------------------+
1018               ^            ^
1019               cons         prod
1020
1021The following function is provided to calculate how many bytes are currently
1022left unconsumed in an array:
1023
1024    #define _MASK_PVCALLS_IDX(idx, ring_size) ((idx) & (ring_size-1))
1025
1026    static inline PVCALLS_RING_IDX pvcalls_ring_unconsumed(PVCALLS_RING_IDX prod,
1027    		PVCALLS_RING_IDX cons,
1028    		PVCALLS_RING_IDX ring_size)
1029    {
1030    	PVCALLS_RING_IDX size;
1031
1032    	if (prod == cons)
1033    		return 0;
1034
1035    	prod = _MASK_PVCALLS_IDX(prod, ring_size);
1036    	cons = _MASK_PVCALLS_IDX(cons, ring_size);
1037
1038    	if (prod == cons)
1039    		return ring_size;
1040
1041    	if (prod > cons)
1042    		size = prod - cons;
1043    	else {
1044    		size = ring_size - cons;
1045    		size += prod;
1046    	}
1047    	return size;
1048    }
1049
1050The producer (the backend for **in**, the frontend for **out**) writes to the
1051array in the following way:
1052
1053- read *[in|out]_cons*, *[in|out]_prod*, *[in|out]_error* from shared memory
1054- general memory barrier
1055- return on *[in|out]_error*
1056- write to array at position *[in|out]_prod* up to *[in|out]_cons*,
1057  wrapping around the circular buffer when necessary
1058- write memory barrier
1059- increase *[in|out]_prod*
1060- notify the other end via evtchn
1061
1062The consumer (the backend for **out**, the frontend for **in**) reads from the
1063array in the following way:
1064
1065- read *[in|out]_prod*, *[in|out]_cons*, *[in|out]_error* from shared memory
1066- read memory barrier
1067- return on *[in|out]_error*
1068- read from array at position *[in|out]_cons* up to *[in|out]_prod*,
1069  wrapping around the circular buffer when necessary
1070- general memory barrier
1071- increase *[in|out]_cons*
1072- notify the other end via evtchn
1073
1074The producer takes care of writing only as many bytes as available in
1075the buffer up to *[in|out]_cons*. The consumer takes care of reading
1076only as many bytes as available in the buffer up to *[in|out]_prod*.
1077*[in|out]_error* is set by the backend when an error occurs writing or
1078reading from the socket.
1079
1080
1081[xenstore]: http://xenbits.xen.org/docs/unstable/misc/xenstore.txt
1082[XenbusStateInitialising]: http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html
1083[address]: http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html
1084[in]: http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html
1085[socket]: http://pubs.opengroup.org/onlinepubs/009695399/functions/socket.html
1086[connect]: http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html
1087[shutdown]: http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html
1088[bind]: http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html
1089[listen]: http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html
1090[accept]: http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html
1091[poll]: http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html
1092[ring.h]: http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD
1093