1# Xen transport for 9pfs version 1
2
3## Background
4
59pfs is a network filesystem protocol developed for Plan 9. 9pfs is very
6simple and describes a series of commands and responses. It is
7completely independent from the communication channels, in fact many
8clients and servers support multiple channels, usually called
9"transports". For example the Linux client supports tcp and unix
10sockets, fds, virtio and rdma.
11
12
13### 9pfs protocol
14
15This document won't cover the full 9pfs specification. Please refer to
16this [paper] and this [website] for a detailed description of it.
17However it is useful to know that each 9pfs request and response has the
18following header:
19
20    struct header {
21    	uint32_t size;
22    	uint8_t id;
23    	uint16_t tag;
24    } __attribute__((packed));
25
26    0         4  5    7
27    +---------+--+----+
28    |  size   |id|tag |
29    +---------+--+----+
30
31- *size*
32The size of the request or response.
33
34- *id*
35The 9pfs request or response operation.
36
37- *tag*
38Unique id that identifies a specific request/response pair. It is used
39to multiplex operations on a single channel.
40
41It is possible to have multiple requests in-flight at any given time.
42
43
44## Rationale
45
46This document describes a Xen based transport for 9pfs, in the
47traditional PV frontend and backend format. The PV frontend is used by
48the client to send commands to the server. The PV backend is used by the
499pfs server to receive commands from clients and send back responses.
50
51The transport protocol supports multiple rings up to the maximum
52supported by the backend. The size of every ring is also configurable
53and can span multiple pages, up to the maximum supported by the backend
54(although it cannot be more than 2MB). The design is to exploit
55parallelism at the vCPU level and support multiple outstanding requests
56simultaneously.
57
58This document does not cover the 9pfs client/server design or
59implementation, only the transport for it.
60
61
62## Configuration
63
64The frontend and backend are configured via Xenstore. See [header] for
65the detailed Xenstore entries and the connection protocol.
66
67
68## Ring Setup
69
70The shared page has the following layout:
71
72    typedef uint32_t XEN_9PFS_RING_IDX;
73
74    struct xen_9pfs_intf {
75    	XEN_9PFS_RING_IDX in_cons, in_prod;
76    	uint8_t pad[56];
77    	XEN_9PFS_RING_IDX out_cons, out_prod;
78    	uint8_t pad[56];
79
80    	uint32_t ring_order;
81        /* this is an array of (1 << ring_order) elements */
82    	grant_ref_t ref[1];
83    };
84
85    /* not actually C compliant (ring_order changes from ring to ring) */
86    struct ring_data {
87        char in[((1 << ring_order) << PAGE_SHIFT) / 2];
88        char out[((1 << ring_order) << PAGE_SHIFT) / 2];
89    };
90
91- **ring_order**
92  It represents the order of the data ring. The following list of grant
93  references is of `(1 << ring_order)` elements. It cannot be greater than
94  **max-ring-page-order**, as specified by the backend on XenBus.
95- **ref[]**
96  The list of grant references which will contain the actual data. They are
97  mapped contiguosly in virtual memory. The first half of the pages is the
98  **in** array, the second half is the **out** array. The array must
99  have a power of two number of elements.
100- **out** is an array used as circular buffer
101  It contains client requests. The producer is the frontend, the
102  consumer is the backend.
103- **in** is an array used as circular buffer
104  It contains server responses. The producer is the backend, the
105  consumer is the frontend.
106- **out_cons**, **out_prod**
107  Consumer and producer indices for client requests. They keep track of
108  how much data has been written by the frontend to **out** and how much
109  data has already been consumed by the backend. **out_prod** is
110  increased by the frontend, after writing data to **out**. **out_cons**
111  is increased by the backend, after reading data from **out**.
112- **in_cons** and **in_prod**
113  Consumer and producer indices for responses. They keep track of how
114  much data has already been consumed by the frontend from the **in**
115  array. **in_prod** is increased by the backend, after writing data to
116  **in**.  **in_cons** is increased by the frontend, after reading data
117  from **in**.
118
119The binary layout of `struct xen_9pfs_intf` follows:
120
121    0         4         8           64        68        72        76
122    +---------+---------+-----//-----+---------+---------+---------+
123    | in_cons | in_prod |  padding   |out_cons |out_prod |ring_orde|
124    +---------+---------+-----//-----+---------+---------+---------+
125
126    76        80        84      4092      4096
127    +---------+---------+----//---+---------+
128    |  ref[0] |  ref[1] |         |  ref[N] |
129    +---------+---------+----//---+---------+
130
131**N.B** For one page, N is maximum 991 (4096-132)/4, but given that N
132needs to be a power of two, actually max N is 512. As 512 == (1 << 9),
133the maximum possible max-ring-page-order value is 9.
134
135The binary layout of the ring buffers follow:
136
137    0         ((1<<ring_order)<<PAGE_SHIFT)/2       ((1<<ring_order)<<PAGE_SHIFT)
138    +------------//-------------+------------//-------------+
139    |            in             |           out             |
140    +------------//-------------+------------//-------------+
141
142## Why ring.h is not needed
143
144Many Xen PV protocols use the macros provided by [ring.h] to manage
145their shared ring for communication. This procotol does not, because it
146actually comes with two rings: the **in** ring and the **out** ring.
147Each of them is mono-directional, and there is no static request size:
148the producer writes opaque data to the ring. On the other end, in
149[ring.h] they are combined, and the request size is static and
150well-known. In this protocol:
151
152  in -> backend to frontend only
153  out-> frontend to backend only
154
155In the case of the **in** ring, the frontend is the consumer, and the
156backend is the producer. Everything is the same but mirrored for the
157**out** ring.
158
159The producer, the backend in this case, never reads from the **in**
160ring. In fact, the producer doesn't need any notifications unless the
161ring is full. This version of the protocol doesn't take advantage of it,
162leaving room for optimizations.
163
164On the other end, the consumer always requires notifications, unless it
165is already actively reading from the ring. The producer can figure it
166out, without any additional fields in the protocol, by comparing the
167indexes at the beginning and the end of the function. This is similar to
168what [ring.h] does.
169
170## Ring Usage
171
172The **in** and **out** arrays are used as circular buffers:
173
174    0                               sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
175    +-----------------------------------+
176    |to consume|    free    |to consume |
177    +-----------------------------------+
178               ^            ^
179               prod         cons
180
181    0                               sizeof(array)
182    +-----------------------------------+
183    |  free    | to consume |   free    |
184    +-----------------------------------+
185               ^            ^
186               cons         prod
187
188The following functions are provided to read and write to an array:
189
190    #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1))
191
192    static inline void xen_9pfs_read(char *buf,
193    		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
194    		uint8_t *h, size_t len) {
195    	if (*masked_cons < *masked_prod) {
196    		memcpy(h, buf + *masked_cons, len);
197    	} else {
198    		if (len > XEN_9PFS_RING_SIZE - *masked_cons) {
199    			memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons);
200    			memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons));
201    		} else {
202    			memcpy(h, buf + *masked_cons, len);
203    		}
204    	}
205    	*masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len);
206    }
207
208    static inline void xen_9pfs_write(char *buf,
209    		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
210    		uint8_t *opaque, size_t len) {
211    	if (*masked_prod < *masked_cons) {
212    		memcpy(buf + *masked_prod, opaque, len);
213    	} else {
214    		if (len > XEN_9PFS_RING_SIZE - *masked_prod) {
215    			memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod);
216    			memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod));
217    		} else {
218    			memcpy(buf + *masked_prod, opaque, len);
219    		}
220    	}
221    	*masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len);
222    }
223
224The producer (the backend for **in**, the frontend for **out**) writes to the
225array in the following way:
226
227- read *cons*, *prod* from shared memory
228- general memory barrier
229- verify *prod* against local copy (consumer shouldn't change it)
230- write to array at position *prod* up to *cons*, wrapping around the circular
231  buffer when necessary
232- write memory barrier
233- increase *prod*
234- notify the other end via event channel
235
236The consumer (the backend for **out**, the frontend for **in**) reads from the
237array in the following way:
238
239- read *prod*, *cons* from shared memory
240- read memory barrier
241- verify *cons* against local copy (producer shouldn't change it)
242- read from array at position *cons* up to *prod*, wrapping around the circular
243  buffer when necessary
244- general memory barrier
245- increase *cons*
246- notify the other end via event channel
247
248The producer takes care of writing only as many bytes as available in the buffer
249up to *cons*. The consumer takes care of reading only as many bytes as available
250in the buffer up to *prod*.
251
252
253## Request/Response Workflow
254
255The client chooses one of the available rings, then it sends a request
256to the other end on the *out* array, following the producer workflow
257described in [Ring Usage].
258
259The server receives the notification and reads the request, following
260the consumer workflow described in [Ring Usage]. The server knows how
261much to read because it is specified in the *size* field of the 9pfs
262header. The server processes the request and sends back a response on
263the *in* array of the same ring, following the producer workflow as
264usual. Thus, every request/response pair is on one ring.
265
266The client receives a notification and reads the response from the *in*
267array. The client knows how much data to read because it is specified in
268the *size* field of the 9pfs header.
269
270
271[paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf
272[website]: https://github.com/chaos/diod/blob/master/protocol.md
273[header]: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/9pfs.h;hb=HEAD
274[ring.h]: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD
275