1% Staging grants for network I/O requests 2% Revision 4 3 4\clearpage 5 6-------------------------------------------------------------------- 7Architecture(s): Any 8-------------------------------------------------------------------- 9 10# Background and Motivation 11 12At the Xen hackaton '16 networking session, we spoke about having a permanently 13mapped region to describe header/linear region of packet buffers. This document 14outlines the proposal covering motivation of this and applicability for other 15use-cases alongside the necessary changes. 16 17The motivation of this work is to eliminate grant ops for packet I/O intensive 18workloads such as those observed with smaller requests size (i.e. <= 256 bytes 19or <= MTU). Currently on Xen, only bulk transfer (e.g. 32K..64K packets) are the 20only ones performing really good (up to 80 Gbit/s in few CPUs), usually 21backing end-hosts and server appliances. Anything that involves higher packet 22rates (<= 1500 MTU) or without sg, performs badly almost like a 1 Gbit/s 23throughput. 24 25# Proposal 26 27The proposal is to leverage the already implicit copy from and to packet linear 28data on netfront and netback, to be done instead from a permanently mapped 29region. In some (physical) NICs this is known as header/data split. 30 31Specifically some workloads (e.g. NFV) it would provide a big increase in 32throughput when we switch to (zero)copying in the backend/frontend, instead of 33the grant hypercalls. Thus this extension aims at futureproofing the netif 34protocol by adding the possibility of guests setting up a list of grants that 35are set up at device creation and revoked at device freeing - without taking 36too much grant entries in account for the general case (i.e. to cover only the 37header region <= 256 bytes, 16 grants per ring) while configurable by kernel 38when one wants to resort to a copy-based as opposed to grant copy/map. 39 40\clearpage 41 42# General Operation 43 44Here we describe how netback and netfront general operate, and where the proposed 45solution will fit. The security mechanism currently involves grants references 46which in essence are round-robin recycled 'tickets' stamped with the GPFNs, 47permission attributes, and the authorized domain: 48 49(This is an in-memory view of struct grant_entry_v1): 50 51 0 1 2 3 4 5 6 7 octet 52 +------------+-----------+------------------------+ 53 | flags | domain id | frame | 54 +------------+-----------+------------------------+ 55 56Where there are N grant entries in a grant table, for example: 57 58 @0: 59 +------------+-----------+------------------------+ 60 | rw | 0 | 0xABCDEF | 61 +------------+-----------+------------------------+ 62 | rw | 0 | 0xFA124 | 63 +------------+-----------+------------------------+ 64 | ro | 1 | 0xBEEF | 65 +------------+-----------+------------------------+ 66 67 ..... 68 @N: 69 +------------+-----------+------------------------+ 70 | rw | 0 | 0x9923A | 71 +------------+-----------+------------------------+ 72 73Each entry consumes 8 bytes, therefore 512 entries can fit on one page. 74The `gnttab_max_frames` which is a default of 32 pages. Hence 16,384 75grants. The ParaVirtualized (PV) drivers will use the grant reference (index 76in the grant table - 0 .. N) in their command ring. 77 78\clearpage 79 80## Guest Transmit 81 82The view of the shared transmit ring is the following: 83 84 0 1 2 3 4 5 6 7 octet 85 +------------------------+------------------------+ 86 | req_prod | req_event | 87 +------------------------+------------------------+ 88 | rsp_prod | rsp_event | 89 +------------------------+------------------------+ 90 | pvt | pad[44] | 91 +------------------------+ | 92 | .... | [64bytes] 93 +------------------------+------------------------+-\ 94 | gref | offset | flags | | 95 +------------+-----------+------------------------+ +-'struct 96 | id | size | id | status | | netif_tx_sring_entry' 97 +-------------------------------------------------+-/ 98 |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N 99 +-------------------------------------------------+ 100 101Each entry consumes 16 octets therefore 256 entries can fit on one page.`struct 102netif_tx_sring_entry` includes both `struct netif_tx_request` (first 12 octets) 103and `struct netif_tx_response` (last 4 octets). Additionally a `struct 104netif_extra_info` may overlay the request in which case the format is: 105 106 +------------------------+------------------------+-\ 107 | type |flags| type specific data (gso, hash, etc)| | 108 +------------+-----------+------------------------+ +-'struct 109 | padding for tx | unused | | netif_extra_info' 110 +-------------------------------------------------+-/ 111 112In essence the transmission of a packet in a from frontend to the backend 113network stack goes as following: 114 115**Frontend** 116 1171) Calculate how many slots are needed for transmitting the packet. 118 Fail if there are aren't enough slots. 119 120[ Calculation needs to estimate slots taking into account 4k page boundary ] 121 1222) Make first request for the packet. 123 The first request contains the whole packet size, checksum info, 124 flag whether it contains extra metadata, and if following slots contain 125 more data. 126 1273) Put grant in the `gref` field of the tx slot. 128 1294) Set extra info if packet requires special metadata (e.g. GSO size) 130 1315) If there's still data to be granted set flag `NETTXF_more_data` in 132request `flags`. 133 1346) Grant remaining packet pages one per slot. (grant boundary is 4k) 135 1367) Fill resultant grefs in the slots setting `NETTXF_more_data` for the N-1. 137 1388) Fill the total packet size in the first request. 139 1409) Set checksum info of the packet (if the chksum offload if supported) 141 14210) Update the request producer index (`req_prod`) 143 14411) Check whether backend needs a notification 145 14611.1) Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__ 147 depending on the guest type. 148 149**Backend** 150 15112) Backend gets an interrupt and runs its interrupt service routine. 152 15313) Backend checks if there are unconsumed requests 154 15514) Backend consume a request from the ring 156 15715) Process extra info (e.g. if GSO info was set) 158 15916) Counts all requests for this packet to be processed (while 160`NETTXF_more_data` is set) and performs a few validation tests: 161 16216.1) Fail transmission if total packet size is smaller than Ethernet 163minimum allowed; 164 165 Failing transmission means filling `id` of the request and 166 `status` of `NETIF_RSP_ERR` of `struct netif_tx_response`; 167 update rsp_prod and finally notify frontend (through `EVTCHNOP_send`). 168 16916.2) Fail transmission if one of the slots (size + offset) crosses the page 170boundary 171 17216.3) Fail transmission if number of slots are bigger than spec defined 173(18 slots max in netif.h) 174 17517) Allocate packet metadata 176 177[ *Linux specific*: This structure emcompasses a linear data region which 178generally accomodates the protocol header and such. Netback allocates up to 128 179bytes for that. ] 180 18118) *Linux specific*: Setup up a `GNTTABOP_copy` to copy up to 128 bytes to this small 182region (linear part of the skb) *only* from the first slot. 183 18419) Setup GNTTABOP operations to copy/map the packet 185 18620) Perform the `GNTTABOP_copy` (grant copy) and/or `GNTTABOP_map_grant_ref` 187 hypercalls. 188 189[ *Linux-specific*: does a copy for the linear region (<=128 bytes) and maps the 190 remaining slots as frags for the rest of the data ] 191 19221) Check if the grant operations were successful and fail transmission if 193any of the resultant operation `status` were different than `GNTST_okay`. 194 19521.1) If it's a grant copying backend, therefore produce responses for all the 196the copied grants like in 16.1). Only difference is that status is 197`NETIF_RSP_OKAY`. 198 19921.2) Update the response producer index (`rsp_prod`) 200 20122) Set up gso info requested by frontend [optional] 202 20323) Set frontend provided checksum info 204 20524) *Linux-specific*: Register destructor callback when packet pages are freed. 206 20725) Call into to the network stack. 208 20926) Update `req_event` to `request consumer index + 1` to receive a notification 210 on the first produced request from frontend. 211 [optional, if backend is polling the ring and never sleeps] 212 21327) *Linux-specific*: Packet destructor callback is called. 214 21527.1) Set up `GNTTABOP_unmap_grant_ref` ops for the designated packet pages. 216 21727.2) Once done, perform `GNTTABOP_unmap_grant_ref` hypercall. Underlying 218this hypercall a TLB flush of all backend vCPUS is done. 219 22027.3) Produce Tx response like step 21.1) and 21.2) 221 222[*Linux-specific*: It contains a thread that is woken for this purpose. And 223it batch these unmap operations. The callback just queues another unmap.] 224 22527.4) Check whether frontend requested a notification 226 22727.4.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__ 228 depending on the guest type. 229 230**Frontend** 231 23228) Transmit interrupt is raised which signals the packet transmission completion. 233 23429) Transmit completion routine checks for unconsumed responses 235 23630) Processes the responses and revokes the grants provided. 237 23831) Updates `rsp_cons` (request consumer index) 239 240This proposal aims at removing steps 19) 20) 21) by using grefs previously 241mapped at guest request. Guest decides how to distribute or use these premapped 242grefs with either linear or full packet. This allows us to replace step 27) 243(the unmap) preventing the TLB flush. 244 245Note that a grant copy does the following (in pseudo code): 246 247 rcu_lock(src_domain); 248 rcu_lock(dst_domain); 249 250 for (op = gntcopy[0]; op < nr_ops; op++) { 251 src_frame = __acquire_grant_for_copy(src_domain, <op.src.gref>); 252 ^ here implies a holding a potential contended per CPU lock on the 253 remote grant table. 254 src_vaddr = map_domain_page(src_frame); 255 256 dst_frame = __get_paged_frame(dst_domain, <op.dst.mfn>) 257 dst_vaddr = map_domain_page(dst_frame); 258 259 memcpy(dst_vaddr + <op.dst.offset>, 260 src_frame + <op.src.offset>, 261 <op.size>); 262 263 unmap_domain_page(src_frame); 264 unmap_domain_page(dst_frame); 265 266 rcu_unlock(src_domain); 267 rcu_unlock(dst_domain); 268 269Linux netback implementation copies the first 128 bytes into its network buffer 270linear region. Hence on the case of the first region it is replaced by a memcpy 271on backend, as opposed to a grant copy. 272 273\clearpage 274 275## Guest Receive 276 277The view of the shared receive ring is the following: 278 279 0 1 2 3 4 5 6 7 octet 280 +------------------------+------------------------+ 281 | req_prod | req_event | 282 +------------------------+------------------------+ 283 | rsp_prod | rsp_event | 284 +------------------------+------------------------+ 285 | pvt | pad[44] | 286 +------------------------+ | 287 | .... | [64bytes] 288 +------------------------+------------------------+ 289 | id | pad | gref | ->'struct netif_rx_request' 290 +------------+-----------+------------------------+ 291 | id | offset | flags | status | ->'struct netif_rx_response' 292 +-------------------------------------------------+ 293 |/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N 294 +-------------------------------------------------+ 295 296 297Each entry in the ring occupies 16 octets which means a page fits 256 entries. 298Additionally a `struct netif_extra_info` may overlay the rx request in which 299case the format is: 300 301 +------------------------+------------------------+ 302 | type |flags| type specific data (gso, hash, etc)| ->'struct netif_extra_info' 303 +------------+-----------+------------------------+ 304 305Notice the lack of padding, and that is because it's not used on Rx, as Rx 306request boundary is 8 octets. 307 308In essence the steps for receiving of a packet in a Linux frontend is as 309 from backend to frontend network stack: 310 311**Backend** 312 3131) Backend transmit function starts 314 315[*Linux-specific*: It means we take a packet and add to an internal queue 316 (protected by a lock) whereas a separate thread takes it from that queue and 317 process the actual like the steps below. This thread has the purpose of 318 aggregating as much copies as possible.] 319 3202) Checks if there are enough rx ring slots that can accomodate the packet. 321 3223) Gets a request from the ring for the first data slot and fetches the `gref` 323 from it. 324 3254) Create grant copy op from packet page to `gref`. 326 327[ It's up to the backend to choose how it fills this data. E.g. backend may 328 choose to merge as much as data from different pages into this single gref, 329 similar to mergeable rx buffers in vhost. ] 330 3315) Sets up flags/checksum info on first request. 332 3336) Gets a response from the ring for this data slot. 334 3357) Prefill expected response ring with the request `id` and slot size. 336 3378) Update the request consumer index (`req_cons`) 338 3399) Gets a request from the ring for the first extra info [optional] 340 34110) Sets up extra info (e.g. GSO descriptor) [optional] repeat step 8). 342 34311) Repeat steps 3 through 8 for all packet pages and set `NETRXF_more_data` 344 in the N-1 slot. 345 34612) Perform the `GNTTABOP_copy` hypercall. 347 34813) Check if the grant operations status was incorrect and if so set `status` 349 of the `struct netif_rx_response` field to NETIF_RSP_ERR. 350 35114) Update the response producer index (`rsp_prod`) 352 353**Frontend** 354 35515) Frontend gets an interrupt and runs its interrupt service routine 356 35716) Checks if there's unconsumed responses 358 35917) Consumes a response from the ring (first response for a packet) 360 36118) Revoke the `gref` in the response 362 36319) Consumes extra info response [optional] 364 36520) While N-1 requests has `NETRXF_more_data`, then fetch each of responses 366 and revoke the designated `gref`. 367 36821) Update the response consumer index (`rsp_cons`) 369 37022) *Linux-specific*: Copy (from first slot gref) up to 256 bytes to the linear 371 region of the packet metadata structure (skb). The rest of the pages 372 processed in the responses are then added as frags. 373 37423) Set checksum info based on first response flags. 375 37624) Call packet into the network stack. 377 37825) Allocate new pages and any necessary packet metadata strutures to new 379 requests. These requests will then be used in step 1) and so forth. 380 38126) Update the request producer index (`req_prod`) 382 38327) Check whether backend needs notification: 384 38527.1) If so, Perform hypercall `EVTCHNOP_send` which might mean a __VMEXIT__ 386 depending on the guest type. 387 38828) Update `rsp_event` to `response consumer index + 1` such that frontend 389 receive a notification on the first newly produced response. 390 [optional, if frontend is polling the ring and never sleeps] 391 392This proposal aims at replacing step 4), 12) and 22) with memcpy if the 393grefs on the Rx ring were requested to be mapped by the guest. Frontend may use 394strategies to allow fast recycling of grants for replinishing the ring, 395hence letting Domain-0 replace the grant copies with memcpy instead, which is 396faster. 397 398Depending on the implementation, it would mean that we no longer 399would need to aggregate as much as grant ops as possible (step 1) and could 400transmit the packet on the transmit function (e.g. Linux ```ndo_start_xmit```) 401as previously proposed 402here\[[0](http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html)\]. 403This would heavily improve efficiency specifially for smaller packets. Which in 404return would decrease RTT, having data being acknoledged much quicker. 405 406\clearpage 407 408# Proposed Extension 409 410The idea is to allow guest more controllability on how its grants are mapped or 411not. Currently there's no control over it for frontends or backends, and latter 412cannot make assumptions on the mapping transmit or receive grants, hence we 413need frontend to take initiative into managing its own mapping of grants. 414Guests may then opportunistically recycle these grants (e.g. Linux) and avoid 415resorting to copies which come when using a fixed amount of buffers. Other 416frameworks (e.g. XDP, netmap, DPDK) use a fixed set of buffers which also 417makes the case for this extension. 418 419## Terminology 420 421`staging grants` is a term used in this document to refer to the whole concept 422of having a set of grants permanently mapped with backend, containing data 423staging until completion. Therefore the term should not be confused with a new 424kind of grants on the hypervisor. 425 426## Control Ring Messages 427 428### `XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE` 429 430This message is sent by the frontend to fetch the number of grefs that can 431be kept mapped in the backend. It only receives the queue as argument, and 432data representing amount of free entries in the mapping table. 433 434### `XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING` 435 436This is sent by the frontend to map a list of grant references in the backend. 437It receives the queue index, the grant containing the list (offset is 438implicitly zero) and how many entries in the list. Each entry in this list 439has the following format: 440 441 0 1 2 3 4 5 6 7 octet 442 +-----+-----+-----+-----+-----+-----+-----+-----+ 443 | grant ref | flags | status | 444 +-----+-----+-----+-----+-----+-----+-----+-----+ 445 446 grant ref: grant reference 447 flags: flags describing the control operation 448 status: XEN_NETIF_CTRL_STATUS_* 449 450The list can have a maximum of 512 entries to be mapped at once. 451The 'status' field is not used for adding new mappings and hence, The message 452returns an error code describing if the operation was successful or not. On 453failure cases, none of the grant mappings specified get added. 454 455### `XEN_NETIF_CTRL_TYPE_DEL_GREF_MAPPING` 456 457This is sent by the frontend for backend to unmap a list of grant references. 458The arguments are the same as `XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING`, including 459the format of the list. The entries used are only the ones representing grant 460references that were previously the subject of a 461`XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING` operation. Any other entries will have 462their status set to `XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER` upon completion. 463The entry 'status' field determines if the entry was successfully removed. 464 465## Datapath Changes 466 467Control ring is only available after backend state is `XenbusConnected` 468therefore only on this state change can the frontend query the total amount of 469maps it can keep. It then grants N entries per queue on both TX and RX ring 470which will create the underying backend gref -> page association (e.g. stored 471in hash table). Frontend may wish to recycle these pregranted buffers or choose 472a copy approach to replace granting. 473 474On steps 19) of Guest Transmit and 3) of Guest Receive, data gref is first 475looked up in this table and uses the underlying page if it already exists a 476mapping. On the successfull cases, steps 20) 21) and 27) of Guest Transmit are 477skipped, with 19) being replaced with a memcpy of up to 128 bytes. On Guest 478Receive, 4) 12) and 22) are replaced with memcpy instead of a grant copy. 479 480Failing to obtain the total number of mappings 481(`XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE`) means the guest falls back to the 482normal usage without pre granting buffers. 483 484\clearpage 485 486# Wire Performance 487 488This section is a glossary meant to keep in mind numbers on the wire. 489 490The minimum size that can fit in a single packet with size N is calculated as: 491 492 Packet = Ethernet Header (14) + Protocol Data Unit (46 - 1500) = 60 bytes 493 494In the wire it's a bit more: 495 496 Preamble (7) + Start Frame Delimiter (1) + Packet + CRC (4) + Interframe gap (12) = 84 bytes 497 498For given Link-speed in Bits/sec and Packet size, real packet rate is 499 calculated as: 500 501 Rate = Link-speed / ((Preamble + Packet + CRC + Interframe gap) * 8) 502 503Numbers to keep in mind (packet size excludes PHY layer, though packet rates 504disclosed by vendors take those into account, since it's what goes on the 505wire): 506 507| Packet + CRC (bytes) | 10 Gbit/s | 40 Gbit/s | 100 Gbit/s | 508|------------------------|:----------:|:----------:|:------------:| 509| 64 | 14.88 Mpps| 59.52 Mpps| 148.80 Mpps | 510| 128 | 8.44 Mpps| 33.78 Mpps| 84.46 Mpps | 511| 256 | 4.52 Mpps| 18.11 Mpps| 45.29 Mpps | 512| 1500 | 822 Kpps| 3.28 Mpps| 8.22 Mpps | 513| 65535 | ~19 Kpps| 76.27 Kpps| 190.68 Kpps | 514 515Caption: Mpps (Million packets per second) ; Kpps (Kilo packets per second) 516 517\clearpage 518 519# Performance 520 521Numbers between a Linux v4.11 guest and another host connected by a 100 Gbit/s 522NIC on a E5-2630 v4 2.2 GHz host to give an idea on the performance benefits of 523this extension. Please refer to this presentation[7] for a better overview of 524the results. 525 526( Numbers include protocol overhead ) 527 528**bulk transfer (Guest TX/RX)** 529 530 Queues Before (Gbit/s) After (Gbit/s) 531 ------ ------------- ------------ 532 1queue 17244/6000 38189/28108 533 2queue 24023/9416 54783/40624 534 3queue 29148/17196 85777/54118 535 4queue 39782/18502 99530/46859 536 537( Guest -> Dom0 ) 538 539**Packet I/O (Guest TX/RX) in UDP 64b** 540 541 Queues Before (Mpps) After (Mpps) 542 ------ ------------- ------------ 543 1queue 0.684/0.439 2.49/2.96 544 2queue 0.953/0.755 4.74/5.07 545 4queue 1.890/1.390 8.80/9.92 546 547\clearpage 548 549# References 550 551[0] http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg01504.html 552 553[1] https://github.com/freebsd/freebsd/blob/master/sys/dev/netmap/netmap_mem2.c#L362 554 555[2] https://www.freebsd.org/cgi/man.cgi?query=vale&sektion=4&n=1 556 557[3] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf 558 559[4] http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/requirements.html#write-access-to-packet-data 560 561[5] http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2073 562 563[6] http://lxr.free-electrons.com/source/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L52 564 565[7] https://schd.ws/hosted_files/xendeveloperanddesignsummit2017/e6/ToGrantOrNotToGrant-XDDS2017_v3.pdf 566 567# History 568 569A table of changes to the document, in chronological order. 570 571------------------------------------------------------------------------ 572Date Revision Version Notes 573---------- -------- -------- ------------------------------------------- 5742016-12-14 1 Xen 4.9 Initial version for RFC 575 5762017-09-01 2 Xen 4.10 Rework to use control ring 577 578 Trim down the specification 579 580 Added some performance numbers from the 581 presentation 582 5832017-09-13 3 Xen 4.10 Addressed changes from Paul Durrant 584 5852017-09-19 4 Xen 4.10 Addressed changes from Paul Durrant 586 587------------------------------------------------------------------------ 588