1% Staging grants for network I/O requests
2% Revision 4
3
4\clearpage
5
6--------------------------------------------------------------------
7Architecture(s): Any
8--------------------------------------------------------------------
9
10# Background and Motivation
11
12At the Xen hackaton '16 networking session, we spoke about having a permanently
13mapped region to describe header/linear region of packet buffers. This document
14outlines the proposal covering motivation of this and applicability for other
15use-cases alongside the necessary changes.
16
17The motivation of this work is to eliminate grant ops for packet I/O intensive
18workloads such as those observed with smaller requests size (i.e. <= 256 bytes
19or <= MTU). Currently on Xen, only bulk transfer (e.g. 32K..64K packets) are the
20only ones performing really good (up to 80 Gbit/s in few CPUs), usually
21backing end-hosts and server appliances. Anything that involves higher packet
22rates (<= 1500 MTU) or without sg, performs badly almost like a 1 Gbit/s
23throughput.
24
25# Proposal
26
27The proposal is to leverage the already implicit copy from and to packet linear
28data on netfront and netback, to be done instead from a permanently mapped
29region. In some (physical) NICs this is known as header/data split.
30
31Specifically some workloads (e.g. NFV) it would provide a big increase in
32throughput when we switch to (zero)copying in the backend/frontend, instead of
33the grant hypercalls. Thus this extension aims at futureproofing the netif
34protocol by adding the possibility of guests setting up a list of grants that
35are set up at device creation and revoked at device freeing - without taking
36too much grant entries in account for the general case (i.e. to cover only the
37header region <= 256 bytes, 16 grants per ring) while configurable by kernel
38when one wants to resort to a copy-based as opposed to grant copy/map.
39
40\clearpage
41
42# General Operation
43
44Here we describe how netback and netfront general operate, and where the proposed
45solution will fit. The security mechanism currently involves grants references
46which in essence are round-robin recycled 'tickets' stamped with the GPFNs,
47permission attributes, and the authorized domain:
48
49(This is an in-memory view of struct grant_entry_v1):
50
51     0     1     2     3     4     5     6     7 octet
52    +------------+-----------+------------------------+
53    | flags      | domain id | frame                  |
54    +------------+-----------+------------------------+
55
56Where there are N grant entries in a grant table, for example:
57
58    @0:
59    +------------+-----------+------------------------+
60    | rw         | 0         | 0xABCDEF               |
61    +------------+-----------+------------------------+
62    | rw         | 0         | 0xFA124                |
63    +------------+-----------+------------------------+
64    | ro         | 1         | 0xBEEF                 |
65    +------------+-----------+------------------------+
66
67      .....
68    @N:
69    +------------+-----------+------------------------+
70    | rw         | 0         | 0x9923A                |
71    +------------+-----------+------------------------+
72
73Each entry consumes 8 bytes, therefore 512 entries can fit on one page.
74The `gnttab_max_frames` which is a default of 32 pages. Hence 16,384
75grants. The ParaVirtualized (PV) drivers will use the grant reference (index
76in the grant table - 0 .. N) in their command ring.
77
78\clearpage
79
80## Guest Transmit
81
82The view of the shared transmit ring is the following:
83
84     0     1     2     3     4     5     6     7 octet
85    +------------------------+------------------------+
86    | req_prod               | req_event              |
87    +------------------------+------------------------+
88    | rsp_prod               | rsp_event              |
89    +------------------------+------------------------+
90    | pvt                    | pad[44]                |
91    +------------------------+                        |
92    | ....                                            | [64bytes]
93    +------------------------+------------------------+-\
94    | gref                   | offset    | flags      | |
95    +------------+-----------+------------------------+ +-'struct
96    | id         | size      | id        | status     | | netif_tx_sring_entry'
97    +-------------------------------------------------+-/
98    |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N
99    +-------------------------------------------------+
100
101Each entry consumes 16 octets therefore 256 entries can fit on one page.`struct
102netif_tx_sring_entry` includes both `struct netif_tx_request` (first 12 octets)
103and `struct netif_tx_response` (last 4 octets).  Additionally a `struct
104netif_extra_info` may overlay the request in which case the format is:
105
106    +------------------------+------------------------+-\
107    | type |flags| type specific data (gso, hash, etc)| |
108    +------------+-----------+------------------------+ +-'struct
109    | padding for tx         | unused                 | | netif_extra_info'
110    +-------------------------------------------------+-/
111
112In essence the transmission of a packet in a from frontend to the backend
113network stack goes as following:
114
115**Frontend**
116
1171) Calculate how many slots are needed for transmitting the packet.
118   Fail if there are aren't enough slots.
119
120[ Calculation needs to estimate slots taking into account 4k page boundary ]
121
1222) Make first request for the packet.
123   The first request contains the whole packet size, checksum info,
124   flag whether it contains extra metadata, and if following slots contain
125   more data.
126
1273) Put grant in the `gref` field of the tx slot.
128
1294) Set extra info if packet requires special metadata (e.g. GSO size)
130
1315) If there's still data to be granted set flag `NETTXF_more_data` in
132request `flags`.
133
1346) Grant remaining packet pages one per slot. (grant boundary is 4k)
135
1367) Fill resultant grefs in the slots setting `NETTXF_more_data` for the N-1.
137
1388) Fill the total packet size in the first request.
139
1409) Set checksum info of the packet (if the chksum offload if supported)
141
14210) Update the request producer index (`req_prod`)
143
14411) Check whether backend needs a notification
145
14611.1) Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__
147      depending on the guest type.
148
149**Backend**
150
15112) Backend gets an interrupt and runs its interrupt service routine.
152
15313) Backend checks if there are unconsumed requests
154
15514) Backend consume a request from the ring
156
15715) Process extra info (e.g. if GSO info was set)
158
15916) Counts all requests for this packet to be processed (while
160`NETTXF_more_data` is set) and performs a few validation tests:
161
16216.1) Fail transmission if total packet size is smaller than Ethernet
163minimum allowed;
164
165  Failing transmission means filling `id` of the request and
166  `status` of `NETIF_RSP_ERR` of `struct netif_tx_response`;
167  update rsp_prod and finally notify frontend (through `EVTCHNOP_send`).
168
16916.2) Fail transmission if one of the slots (size + offset) crosses the page
170boundary
171
17216.3) Fail transmission if number of slots are bigger than spec defined
173(18 slots max in netif.h)
174
17517) Allocate packet metadata
176
177[ *Linux specific*: This structure emcompasses a linear data region which
178generally accomodates the protocol header and such. Netback allocates up to 128
179bytes for that. ]
180
18118) *Linux specific*: Setup up a `GNTTABOP_copy` to copy up to 128 bytes to this small
182region (linear part of the skb) *only* from the first slot.
183
18419) Setup GNTTABOP operations to copy/map the packet
185
18620) Perform the `GNTTABOP_copy` (grant copy) and/or `GNTTABOP_map_grant_ref`
187    hypercalls.
188
189[ *Linux-specific*: does a copy for the linear region (<=128 bytes) and maps the
190         remaining slots as frags for the rest of the data ]
191
19221) Check if the grant operations were successful and fail transmission if
193any of the resultant operation `status` were different than `GNTST_okay`.
194
19521.1) If it's a grant copying backend, therefore produce responses for all the
196the copied grants like in 16.1). Only difference is that status is
197`NETIF_RSP_OKAY`.
198
19921.2) Update the response producer index (`rsp_prod`)
200
20122) Set up gso info requested by frontend [optional]
202
20323) Set frontend provided checksum info
204
20524) *Linux-specific*: Register destructor callback when packet pages are freed.
206
20725) Call into to the network stack.
208
20926) Update `req_event` to `request consumer index + 1` to receive a notification
210    on the first produced request from frontend.
211    [optional, if backend is polling the ring and never sleeps]
212
21327) *Linux-specific*: Packet destructor callback is called.
214
21527.1) Set up `GNTTABOP_unmap_grant_ref` ops for the designated packet pages.
216
21727.2) Once done, perform `GNTTABOP_unmap_grant_ref` hypercall. Underlying
218this hypercall a TLB flush of all backend vCPUS is done.
219
22027.3) Produce Tx response like step 21.1) and 21.2)
221
222[*Linux-specific*: It contains a thread that is woken for this purpose. And
223it batch these unmap operations. The callback just queues another unmap.]
224
22527.4) Check whether frontend requested a notification
226
22727.4.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__
228      depending on the guest type.
229
230**Frontend**
231
23228) Transmit interrupt is raised which signals the packet transmission completion.
233
23429) Transmit completion routine checks for unconsumed responses
235
23630) Processes the responses and revokes the grants provided.
237
23831) Updates `rsp_cons` (request consumer index)
239
240This proposal aims at removing steps 19) 20) 21) by using grefs previously
241mapped at guest request. Guest decides how to distribute or use these premapped
242grefs with either linear or full packet. This allows us to replace step 27)
243(the unmap) preventing the TLB flush.
244
245Note that a grant copy does the following (in pseudo code):
246
247	rcu_lock(src_domain);
248	rcu_lock(dst_domain);
249
250	for (op = gntcopy[0]; op < nr_ops; op++) {
251		src_frame = __acquire_grant_for_copy(src_domain, <op.src.gref>);
252		^ here implies a holding a potential contended per CPU lock on the
253	          remote grant table.
254		src_vaddr = map_domain_page(src_frame);
255
256		dst_frame = __get_paged_frame(dst_domain, <op.dst.mfn>)
257		dst_vaddr = map_domain_page(dst_frame);
258
259		memcpy(dst_vaddr + <op.dst.offset>,
260			src_frame + <op.src.offset>,
261			<op.size>);
262
263		unmap_domain_page(src_frame);
264		unmap_domain_page(dst_frame);
265
266	rcu_unlock(src_domain);
267	rcu_unlock(dst_domain);
268
269Linux netback implementation copies the first 128 bytes into its network buffer
270linear region. Hence on the case of the first region it is replaced by a memcpy
271on backend, as opposed to a grant copy.
272
273\clearpage
274
275## Guest Receive
276
277The view of the shared receive ring is the following:
278
279     0     1     2     3     4     5     6     7 octet
280    +------------------------+------------------------+
281    | req_prod               | req_event              |
282    +------------------------+------------------------+
283    | rsp_prod               | rsp_event              |
284    +------------------------+------------------------+
285    | pvt                    | pad[44]                |
286    +------------------------+                        |
287    | ....                                            | [64bytes]
288    +------------------------+------------------------+
289    | id         | pad       | gref                   | ->'struct netif_rx_request'
290    +------------+-----------+------------------------+
291    | id         | offset    | flags     | status     | ->'struct netif_rx_response'
292    +-------------------------------------------------+
293    |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N
294    +-------------------------------------------------+
295
296
297Each entry in the ring occupies 16 octets which means a page fits 256 entries.
298Additionally a `struct netif_extra_info` may overlay the rx request in which
299case the format is:
300
301    +------------------------+------------------------+
302    | type |flags| type specific data (gso, hash, etc)| ->'struct netif_extra_info'
303    +------------+-----------+------------------------+
304
305Notice the lack of padding, and that is because it's not used on Rx, as Rx
306request boundary is 8 octets.
307
308In essence the steps for receiving of a packet in a Linux frontend is as
309 from backend to frontend network stack:
310
311**Backend**
312
3131) Backend transmit function starts
314
315[*Linux-specific*: It means we take a packet and add to an internal queue
316 (protected by a lock) whereas a separate thread takes it from that queue and
317 process the actual like the steps below. This thread has the purpose of
318 aggregating as much copies as possible.]
319
3202) Checks if there are enough rx ring slots that can accomodate the packet.
321
3223) Gets a request from the ring for the first data slot and fetches the `gref`
323   from it.
324
3254) Create grant copy op from packet page to `gref`.
326
327[ It's up to the backend to choose how it fills this data. E.g. backend may
328  choose to merge as much as data from different pages into this single gref,
329  similar to mergeable rx buffers in vhost. ]
330
3315) Sets up flags/checksum info on first request.
332
3336) Gets a response from the ring for this data slot.
334
3357) Prefill expected response ring with the request `id` and slot size.
336
3378) Update the request consumer index (`req_cons`)
338
3399) Gets a request from the ring for the first extra info [optional]
340
34110) Sets up extra info (e.g. GSO descriptor) [optional] repeat step 8).
342
34311) Repeat steps 3 through 8 for all packet pages and set `NETRXF_more_data`
344   in the N-1 slot.
345
34612) Perform the `GNTTABOP_copy` hypercall.
347
34813) Check if the grant operations status was incorrect and if so set `status`
349    of the `struct netif_rx_response` field to NETIF_RSP_ERR.
350
35114) Update the response producer index (`rsp_prod`)
352
353**Frontend**
354
35515) Frontend gets an interrupt and runs its interrupt service routine
356
35716) Checks if there's unconsumed responses
358
35917) Consumes a response from the ring (first response for a packet)
360
36118) Revoke the `gref` in the response
362
36319) Consumes extra info response [optional]
364
36520) While N-1 requests has `NETRXF_more_data`, then fetch each of responses
366    and revoke the designated `gref`.
367
36821) Update the response consumer index (`rsp_cons`)
369
37022) *Linux-specific*: Copy (from first slot gref) up to 256 bytes to the linear
371    region of the packet metadata structure (skb). The rest of the pages
372    processed in the responses are then added as frags.
373
37423) Set checksum info based on first response flags.
375
37624) Call packet into the network stack.
377
37825) Allocate new pages and any necessary packet metadata strutures to new
379    requests. These requests will then be used in step 1) and so forth.
380
38126) Update the request producer index (`req_prod`)
382
38327) Check whether backend needs notification:
384
38527.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__
386      depending on the guest type.
387
38828) Update `rsp_event` to `response consumer index + 1` such that frontend
389    receive a notification on the first newly produced response.
390    [optional, if frontend is polling the ring and never sleeps]
391
392This proposal aims at replacing step 4), 12) and  22) with memcpy if the
393grefs on the Rx ring were requested to be mapped by the guest. Frontend may use
394strategies to allow fast recycling of grants for replinishing the ring,
395hence letting Domain-0 replace the grant copies with  memcpy instead, which is
396faster.
397
398Depending on the implementation, it would mean that we no longer
399would need to aggregate as much as grant ops as possible (step 1) and could
400transmit the packet on the transmit function (e.g. Linux ```ndo_start_xmit```)
401as previously proposed
402here\[[0](http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html)\].
403This would heavily improve efficiency specifially for smaller packets. Which in
404return would decrease RTT, having data being acknoledged much quicker.
405
406\clearpage
407
408# Proposed Extension
409
410The idea is to allow guest more controllability on how its grants are mapped or
411not. Currently there's no control over it for frontends or backends, and latter
412cannot make assumptions on the mapping transmit or receive grants, hence we
413need frontend to take initiative into managing its own mapping of grants.
414Guests may then opportunistically recycle these grants (e.g. Linux) and avoid
415resorting to copies which come when using a fixed amount of buffers. Other
416frameworks (e.g.  XDP, netmap, DPDK) use a fixed set of buffers which also
417makes the case for this extension.
418
419## Terminology
420
421`staging grants` is a term used in this document to refer to the whole concept
422of having a set of grants permanently mapped with backend, containing data
423staging until completion. Therefore the term should not be confused with a new
424kind of grants on the hypervisor.
425
426## Control Ring Messages
427
428### `XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE`
429
430This message is sent by the frontend to fetch the number of grefs that can
431be kept mapped in the backend. It only receives the queue as argument, and
432data representing amount of free entries in the mapping table.
433
434### `XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING`
435
436This is sent by the frontend to map a list of grant references in the backend.
437It receives the queue index, the grant containing the list (offset is
438implicitly zero) and how many entries in the list. Each entry in this list
439has the following format:
440
441	    0     1     2     3     4     5     6     7  octet
442	 +-----+-----+-----+-----+-----+-----+-----+-----+
443	 | grant ref             |  flags    |  status   |
444	 +-----+-----+-----+-----+-----+-----+-----+-----+
445
446	 grant ref: grant reference
447	 flags: flags describing the control operation
448	 status: XEN_NETIF_CTRL_STATUS_*
449
450The list can have a maximum of 512 entries to be mapped at once.
451The 'status' field is not used for adding new mappings and hence, The message
452returns an error code describing if the operation was successful or not. On
453failure cases, none of the grant mappings specified get added.
454
455### `XEN_NETIF_CTRL_TYPE_DEL_GREF_MAPPING`
456
457This is sent by the frontend for backend to unmap a list of grant references.
458The arguments are the same as `XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING`, including
459the format of the list. The entries used are only the ones representing grant
460references that were previously the subject of a
461`XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING` operation. Any other entries will have
462their status set to `XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER` upon completion.
463The entry 'status' field determines if the entry was successfully removed.
464
465## Datapath Changes
466
467Control ring is only available after backend state is `XenbusConnected`
468therefore only on this state change can the frontend query the total amount of
469maps it can keep. It then grants N entries per queue on both TX and RX ring
470which will create the underying backend gref -> page association (e.g.  stored
471in hash table). Frontend may wish to recycle these pregranted buffers or choose
472a copy approach to replace granting.
473
474On steps 19) of Guest Transmit and 3) of Guest Receive, data gref is first
475looked up in this table and uses the underlying page if it already exists a
476mapping. On the successfull cases, steps 20) 21) and 27) of Guest Transmit are
477skipped, with 19) being replaced with a memcpy of up to 128 bytes. On Guest
478Receive, 4) 12) and 22) are replaced with memcpy instead of a grant copy.
479
480Failing to obtain the total number of mappings
481(`XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE`) means the guest falls back to the
482normal usage without pre granting buffers.
483
484\clearpage
485
486# Wire Performance
487
488This section is a glossary meant to keep in mind numbers on the wire.
489
490The minimum size that can fit in a single packet with size N is calculated as:
491
492  Packet = Ethernet Header (14) + Protocol Data Unit (46 - 1500) = 60 bytes
493
494In the wire it's a bit more:
495
496  Preamble (7) + Start Frame Delimiter (1) + Packet + CRC (4) + Interframe gap (12) = 84 bytes
497
498For given Link-speed in Bits/sec and Packet size, real packet rate is
499	calculated as:
500
501  Rate = Link-speed / ((Preamble + Packet + CRC + Interframe gap) * 8)
502
503Numbers to keep in mind (packet size excludes PHY layer, though packet rates
504disclosed by vendors take those into account, since it's what goes on the
505wire):
506
507| Packet + CRC (bytes)   | 10 Gbit/s  |  40 Gbit/s |  100 Gbit/s  |
508|------------------------|:----------:|:----------:|:------------:|
509| 64                     | 14.88  Mpps|  59.52 Mpps|  148.80 Mpps |
510| 128                    |  8.44  Mpps|  33.78 Mpps|   84.46 Mpps |
511| 256                    |  4.52  Mpps|  18.11 Mpps|   45.29 Mpps |
512| 1500                   |   822  Kpps|   3.28 Mpps|    8.22 Mpps |
513| 65535                  |   ~19  Kpps|  76.27 Kpps|  190.68 Kpps |
514
515Caption:  Mpps (Million packets per second) ; Kpps (Kilo packets per second)
516
517\clearpage
518
519# Performance
520
521Numbers between a Linux v4.11 guest and another host connected by a 100 Gbit/s
522NIC on a E5-2630 v4 2.2 GHz host to give an idea on the performance benefits of
523this extension. Please refer to this presentation[7] for a better overview of
524the results.
525
526( Numbers include protocol overhead )
527
528**bulk transfer (Guest TX/RX)**
529
530 Queues  Before (Gbit/s) After (Gbit/s)
531 ------  -------------   ------------
532 1queue  17244/6000      38189/28108
533 2queue  24023/9416      54783/40624
534 3queue  29148/17196     85777/54118
535 4queue  39782/18502     99530/46859
536
537( Guest -> Dom0 )
538
539**Packet I/O (Guest TX/RX) in UDP 64b**
540
541 Queues  Before (Mpps)  After (Mpps)
542 ------  -------------  ------------
543 1queue  0.684/0.439    2.49/2.96
544 2queue  0.953/0.755    4.74/5.07
545 4queue  1.890/1.390    8.80/9.92
546
547\clearpage
548
549# References
550
551[0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html
552
553[1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362
554
555[2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1
556
557[3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
558
559[4] http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data
560
561[5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073
562
563[6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52
564
565[7] https://schd.ws/hosted_files/xendeveloperanddesignsummit2017/e6/ToGrantOrNotToGrant-XDDS2017_v3.pdf
566
567# History
568
569A table of changes to the document, in chronological order.
570
571------------------------------------------------------------------------
572Date       Revision Version  Notes
573---------- -------- -------- -------------------------------------------
5742016-12-14 1        Xen 4.9  Initial version for RFC
575
5762017-09-01 2        Xen 4.10 Rework to use control ring
577
578                             Trim down the specification
579
580                             Added some performance numbers from the
581                             presentation
582
5832017-09-13 3        Xen 4.10 Addressed changes from Paul Durrant
584
5852017-09-19 4        Xen 4.10 Addressed changes from Paul Durrant
586
587------------------------------------------------------------------------
588