1.. SPDX-License-Identifier: GPL-2.0 2 3.. _kfuncs-header-label: 4 5============================= 6BPF Kernel Functions (kfuncs) 7============================= 8 91. Introduction 10=============== 11 12BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux 13kernel which are exposed for use by BPF programs. Unlike normal BPF helpers, 14kfuncs do not have a stable interface and can change from one kernel release to 15another. Hence, BPF programs need to be updated in response to changes in the 16kernel. See :ref:`BPF_kfunc_lifecycle_expectations` for more information. 17 182. Defining a kfunc 19=================== 20 21There are two ways to expose a kernel function to BPF programs, either make an 22existing function in the kernel visible, or add a new wrapper for BPF. In both 23cases, care must be taken that BPF program can only call such function in a 24valid context. To enforce this, visibility of a kfunc can be per program type. 25 26If you are not creating a BPF wrapper for existing kernel function, skip ahead 27to :ref:`BPF_kfunc_nodef`. 28 292.1 Creating a wrapper kfunc 30---------------------------- 31 32When defining a wrapper kfunc, the wrapper function should have extern linkage. 33This prevents the compiler from optimizing away dead code, as this wrapper kfunc 34is not invoked anywhere in the kernel itself. It is not necessary to provide a 35prototype in a header for the wrapper kfunc. 36 37An example is given below:: 38 39 /* Disables missing prototype warnings */ 40 __diag_push(); 41 __diag_ignore_all("-Wmissing-prototypes", 42 "Global kfuncs as their definitions will be in BTF"); 43 44 __bpf_kfunc struct task_struct *bpf_find_get_task_by_vpid(pid_t nr) 45 { 46 return find_get_task_by_vpid(nr); 47 } 48 49 __diag_pop(); 50 51A wrapper kfunc is often needed when we need to annotate parameters of the 52kfunc. Otherwise one may directly make the kfunc visible to the BPF program by 53registering it with the BPF subsystem. See :ref:`BPF_kfunc_nodef`. 54 552.2 Annotating kfunc parameters 56------------------------------- 57 58Similar to BPF helpers, there is sometime need for additional context required 59by the verifier to make the usage of kernel functions safer and more useful. 60Hence, we can annotate a parameter by suffixing the name of the argument of the 61kfunc with a __tag, where tag may be one of the supported annotations. 62 632.2.1 __sz Annotation 64--------------------- 65 66This annotation is used to indicate a memory and size pair in the argument list. 67An example is given below:: 68 69 __bpf_kfunc void bpf_memzero(void *mem, int mem__sz) 70 { 71 ... 72 } 73 74Here, the verifier will treat first argument as a PTR_TO_MEM, and second 75argument as its size. By default, without __sz annotation, the size of the type 76of the pointer is used. Without __sz annotation, a kfunc cannot accept a void 77pointer. 78 792.2.2 __k Annotation 80-------------------- 81 82This annotation is only understood for scalar arguments, where it indicates that 83the verifier must check the scalar argument to be a known constant, which does 84not indicate a size parameter, and the value of the constant is relevant to the 85safety of the program. 86 87An example is given below:: 88 89 __bpf_kfunc void *bpf_obj_new(u32 local_type_id__k, ...) 90 { 91 ... 92 } 93 94Here, bpf_obj_new uses local_type_id argument to find out the size of that type 95ID in program's BTF and return a sized pointer to it. Each type ID will have a 96distinct size, hence it is crucial to treat each such call as distinct when 97values don't match during verifier state pruning checks. 98 99Hence, whenever a constant scalar argument is accepted by a kfunc which is not a 100size parameter, and the value of the constant matters for program safety, __k 101suffix should be used. 102 103.. _BPF_kfunc_nodef: 104 1052.3 Using an existing kernel function 106------------------------------------- 107 108When an existing function in the kernel is fit for consumption by BPF programs, 109it can be directly registered with the BPF subsystem. However, care must still 110be taken to review the context in which it will be invoked by the BPF program 111and whether it is safe to do so. 112 1132.4 Annotating kfuncs 114--------------------- 115 116In addition to kfuncs' arguments, verifier may need more information about the 117type of kfunc(s) being registered with the BPF subsystem. To do so, we define 118flags on a set of kfuncs as follows:: 119 120 BTF_SET8_START(bpf_task_set) 121 BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) 122 BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) 123 BTF_SET8_END(bpf_task_set) 124 125This set encodes the BTF ID of each kfunc listed above, and encodes the flags 126along with it. Ofcourse, it is also allowed to specify no flags. 127 128kfunc definitions should also always be annotated with the ``__bpf_kfunc`` 129macro. This prevents issues such as the compiler inlining the kfunc if it's a 130static kernel function, or the function being elided in an LTO build as it's 131not used in the rest of the kernel. Developers should not manually add 132annotations to their kfunc to prevent these issues. If an annotation is 133required to prevent such an issue with your kfunc, it is a bug and should be 134added to the definition of the macro so that other kfuncs are similarly 135protected. An example is given below:: 136 137 __bpf_kfunc struct task_struct *bpf_get_task_pid(s32 pid) 138 { 139 ... 140 } 141 1422.4.1 KF_ACQUIRE flag 143--------------------- 144 145The KF_ACQUIRE flag is used to indicate that the kfunc returns a pointer to a 146refcounted object. The verifier will then ensure that the pointer to the object 147is eventually released using a release kfunc, or transferred to a map using a 148referenced kptr (by invoking bpf_kptr_xchg). If not, the verifier fails the 149loading of the BPF program until no lingering references remain in all possible 150explored states of the program. 151 1522.4.2 KF_RET_NULL flag 153---------------------- 154 155The KF_RET_NULL flag is used to indicate that the pointer returned by the kfunc 156may be NULL. Hence, it forces the user to do a NULL check on the pointer 157returned from the kfunc before making use of it (dereferencing or passing to 158another helper). This flag is often used in pairing with KF_ACQUIRE flag, but 159both are orthogonal to each other. 160 1612.4.3 KF_RELEASE flag 162--------------------- 163 164The KF_RELEASE flag is used to indicate that the kfunc releases the pointer 165passed in to it. There can be only one referenced pointer that can be passed in. 166All copies of the pointer being released are invalidated as a result of invoking 167kfunc with this flag. 168 1692.4.4 KF_KPTR_GET flag 170---------------------- 171 172The KF_KPTR_GET flag is used to indicate that the kfunc takes the first argument 173as a pointer to kptr, safely increments the refcount of the object it points to, 174and returns a reference to the user. The rest of the arguments may be normal 175arguments of a kfunc. The KF_KPTR_GET flag should be used in conjunction with 176KF_ACQUIRE and KF_RET_NULL flags. 177 1782.4.5 KF_TRUSTED_ARGS flag 179-------------------------- 180 181The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It 182indicates that the all pointer arguments are valid, and that all pointers to 183BTF objects have been passed in their unmodified form (that is, at a zero 184offset, and without having been obtained from walking another pointer, with one 185exception described below). 186 187There are two types of pointers to kernel objects which are considered "valid": 188 1891. Pointers which are passed as tracepoint or struct_ops callback arguments. 1902. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc. 191 192Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to 193KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset. 194 195The definition of "valid" pointers is subject to change at any time, and has 196absolutely no ABI stability guarantees. 197 198As mentioned above, a nested pointer obtained from walking a trusted pointer is 199no longer trusted, with one exception. If a struct type has a field that is 200guaranteed to be valid as long as its parent pointer is trusted, the 201``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as 202follows: 203 204.. code-block:: c 205 206 BTF_TYPE_SAFE_NESTED(struct task_struct) { 207 const cpumask_t *cpus_ptr; 208 }; 209 210In other words, you must: 211 2121. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro. 213 2142. Specify the type and name of the trusted nested field. This field must match 215 the field in the original type definition exactly. 216 2172.4.6 KF_SLEEPABLE flag 218----------------------- 219 220The KF_SLEEPABLE flag is used for kfuncs that may sleep. Such kfuncs can only 221be called by sleepable BPF programs (BPF_F_SLEEPABLE). 222 2232.4.7 KF_DESTRUCTIVE flag 224-------------------------- 225 226The KF_DESTRUCTIVE flag is used to indicate functions calling which is 227destructive to the system. For example such a call can result in system 228rebooting or panicking. Due to this additional restrictions apply to these 229calls. At the moment they only require CAP_SYS_BOOT capability, but more can be 230added later. 231 2322.4.8 KF_RCU flag 233----------------- 234 235The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument. 236When used together with KF_ACQUIRE, it indicates the kfunc should have a 237single argument which must be a trusted argument or a MEM_RCU pointer. 238The argument may have reference count of 0 and the kfunc must take this 239into consideration. 240 241.. _KF_deprecated_flag: 242 2432.4.9 KF_DEPRECATED flag 244------------------------ 245 246The KF_DEPRECATED flag is used for kfuncs which are scheduled to be 247changed or removed in a subsequent kernel release. A kfunc that is 248marked with KF_DEPRECATED should also have any relevant information 249captured in its kernel doc. Such information typically includes the 250kfunc's expected remaining lifespan, a recommendation for new 251functionality that can replace it if any is available, and possibly a 252rationale for why it is being removed. 253 254Note that while on some occasions, a KF_DEPRECATED kfunc may continue to be 255supported and have its KF_DEPRECATED flag removed, it is likely to be far more 256difficult to remove a KF_DEPRECATED flag after it's been added than it is to 257prevent it from being added in the first place. As described in 258:ref:`BPF_kfunc_lifecycle_expectations`, users that rely on specific kfuncs are 259encouraged to make their use-cases known as early as possible, and participate 260in upstream discussions regarding whether to keep, change, deprecate, or remove 261those kfuncs if and when such discussions occur. 262 2632.5 Registering the kfuncs 264-------------------------- 265 266Once the kfunc is prepared for use, the final step to making it visible is 267registering it with the BPF subsystem. Registration is done per BPF program 268type. An example is shown below:: 269 270 BTF_SET8_START(bpf_task_set) 271 BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) 272 BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) 273 BTF_SET8_END(bpf_task_set) 274 275 static const struct btf_kfunc_id_set bpf_task_kfunc_set = { 276 .owner = THIS_MODULE, 277 .set = &bpf_task_set, 278 }; 279 280 static int init_subsystem(void) 281 { 282 return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set); 283 } 284 late_initcall(init_subsystem); 285 2862.6 Specifying no-cast aliases with ___init 287-------------------------------------------- 288 289The verifier will always enforce that the BTF type of a pointer passed to a 290kfunc by a BPF program, matches the type of pointer specified in the kfunc 291definition. The verifier, does, however, allow types that are equivalent 292according to the C standard to be passed to the same kfunc arg, even if their 293BTF_IDs differ. 294 295For example, for the following type definition: 296 297.. code-block:: c 298 299 struct bpf_cpumask { 300 cpumask_t cpumask; 301 refcount_t usage; 302 }; 303 304The verifier would allow a ``struct bpf_cpumask *`` to be passed to a kfunc 305taking a ``cpumask_t *`` (which is a typedef of ``struct cpumask *``). For 306instance, both ``struct cpumask *`` and ``struct bpf_cpmuask *`` can be passed 307to bpf_cpumask_test_cpu(). 308 309In some cases, this type-aliasing behavior is not desired. ``struct 310nf_conn___init`` is one such example: 311 312.. code-block:: c 313 314 struct nf_conn___init { 315 struct nf_conn ct; 316 }; 317 318The C standard would consider these types to be equivalent, but it would not 319always be safe to pass either type to a trusted kfunc. ``struct 320nf_conn___init`` represents an allocated ``struct nf_conn`` object that has 321*not yet been initialized*, so it would therefore be unsafe to pass a ``struct 322nf_conn___init *`` to a kfunc that's expecting a fully initialized ``struct 323nf_conn *`` (e.g. ``bpf_ct_change_timeout()``). 324 325In order to accommodate such requirements, the verifier will enforce strict 326PTR_TO_BTF_ID type matching if two types have the exact same name, with one 327being suffixed with ``___init``. 328 329.. _BPF_kfunc_lifecycle_expectations: 330 3313. kfunc lifecycle expectations 332=============================== 333 334kfuncs provide a kernel <-> kernel API, and thus are not bound by any of the 335strict stability restrictions associated with kernel <-> user UAPIs. This means 336they can be thought of as similar to EXPORT_SYMBOL_GPL, and can therefore be 337modified or removed by a maintainer of the subsystem they're defined in when 338it's deemed necessary. 339 340Like any other change to the kernel, maintainers will not change or remove a 341kfunc without having a reasonable justification. Whether or not they'll choose 342to change a kfunc will ultimately depend on a variety of factors, such as how 343widely used the kfunc is, how long the kfunc has been in the kernel, whether an 344alternative kfunc exists, what the norm is in terms of stability for the 345subsystem in question, and of course what the technical cost is of continuing 346to support the kfunc. 347 348There are several implications of this: 349 350a) kfuncs that are widely used or have been in the kernel for a long time will 351 be more difficult to justify being changed or removed by a maintainer. In 352 other words, kfuncs that are known to have a lot of users and provide 353 significant value provide stronger incentives for maintainers to invest the 354 time and complexity in supporting them. It is therefore important for 355 developers that are using kfuncs in their BPF programs to communicate and 356 explain how and why those kfuncs are being used, and to participate in 357 discussions regarding those kfuncs when they occur upstream. 358 359b) Unlike regular kernel symbols marked with EXPORT_SYMBOL_GPL, BPF programs 360 that call kfuncs are generally not part of the kernel tree. This means that 361 refactoring cannot typically change callers in-place when a kfunc changes, 362 as is done for e.g. an upstreamed driver being updated in place when a 363 kernel symbol is changed. 364 365 Unlike with regular kernel symbols, this is expected behavior for BPF 366 symbols, and out-of-tree BPF programs that use kfuncs should be considered 367 relevant to discussions and decisions around modifying and removing those 368 kfuncs. The BPF community will take an active role in participating in 369 upstream discussions when necessary to ensure that the perspectives of such 370 users are taken into account. 371 372c) A kfunc will never have any hard stability guarantees. BPF APIs cannot and 373 will not ever hard-block a change in the kernel purely for stability 374 reasons. That being said, kfuncs are features that are meant to solve 375 problems and provide value to users. The decision of whether to change or 376 remove a kfunc is a multivariate technical decision that is made on a 377 case-by-case basis, and which is informed by data points such as those 378 mentioned above. It is expected that a kfunc being removed or changed with 379 no warning will not be a common occurrence or take place without sound 380 justification, but it is a possibility that must be accepted if one is to 381 use kfuncs. 382 3833.1 kfunc deprecation 384--------------------- 385 386As described above, while sometimes a maintainer may find that a kfunc must be 387changed or removed immediately to accommodate some changes in their subsystem, 388usually kfuncs will be able to accommodate a longer and more measured 389deprecation process. For example, if a new kfunc comes along which provides 390superior functionality to an existing kfunc, the existing kfunc may be 391deprecated for some period of time to allow users to migrate their BPF programs 392to use the new one. Or, if a kfunc has no known users, a decision may be made 393to remove the kfunc (without providing an alternative API) after some 394deprecation period so as to provide users with a window to notify the kfunc 395maintainer if it turns out that the kfunc is actually being used. 396 397It's expected that the common case will be that kfuncs will go through a 398deprecation period rather than being changed or removed without warning. As 399described in :ref:`KF_deprecated_flag`, the kfunc framework provides the 400KF_DEPRECATED flag to kfunc developers to signal to users that a kfunc has been 401deprecated. Once a kfunc has been marked with KF_DEPRECATED, the following 402procedure is followed for removal: 403 4041. Any relevant information for deprecated kfuncs is documented in the kfunc's 405 kernel docs. This documentation will typically include the kfunc's expected 406 remaining lifespan, a recommendation for new functionality that can replace 407 the usage of the deprecated function (or an explanation as to why no such 408 replacement exists), etc. 409 4102. The deprecated kfunc is kept in the kernel for some period of time after it 411 was first marked as deprecated. This time period will be chosen on a 412 case-by-case basis, and will typically depend on how widespread the use of 413 the kfunc is, how long it has been in the kernel, and how hard it is to move 414 to alternatives. This deprecation time period is "best effort", and as 415 described :ref:`above<BPF_kfunc_lifecycle_expectations>`, circumstances may 416 sometimes dictate that the kfunc be removed before the full intended 417 deprecation period has elapsed. 418 4193. After the deprecation period the kfunc will be removed. At this point, BPF 420 programs calling the kfunc will be rejected by the verifier. 421 4224. Core kfuncs 423============== 424 425The BPF subsystem provides a number of "core" kfuncs that are potentially 426applicable to a wide variety of different possible use cases and programs. 427Those kfuncs are documented here. 428 4294.1 struct task_struct * kfuncs 430------------------------------- 431 432There are a number of kfuncs that allow ``struct task_struct *`` objects to be 433used as kptrs: 434 435.. kernel-doc:: kernel/bpf/helpers.c 436 :identifiers: bpf_task_acquire bpf_task_release 437 438These kfuncs are useful when you want to acquire or release a reference to a 439``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a 440struct_ops callback arg. For example: 441 442.. code-block:: c 443 444 /** 445 * A trivial example tracepoint program that shows how to 446 * acquire and release a struct task_struct * pointer. 447 */ 448 SEC("tp_btf/task_newtask") 449 int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags) 450 { 451 struct task_struct *acquired; 452 453 acquired = bpf_task_acquire(task); 454 455 /* 456 * In a typical program you'd do something like store 457 * the task in a map, and the map will automatically 458 * release it later. Here, we release it manually. 459 */ 460 bpf_task_release(acquired); 461 return 0; 462 } 463 464---- 465 466A BPF program can also look up a task from a pid. This can be useful if the 467caller doesn't have a trusted pointer to a ``struct task_struct *`` object that 468it can acquire a reference on with bpf_task_acquire(). 469 470.. kernel-doc:: kernel/bpf/helpers.c 471 :identifiers: bpf_task_from_pid 472 473Here is an example of it being used: 474 475.. code-block:: c 476 477 SEC("tp_btf/task_newtask") 478 int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags) 479 { 480 struct task_struct *lookup; 481 482 lookup = bpf_task_from_pid(task->pid); 483 if (!lookup) 484 /* A task should always be found, as %task is a tracepoint arg. */ 485 return -ENOENT; 486 487 if (lookup->pid != task->pid) { 488 /* bpf_task_from_pid() looks up the task via its 489 * globally-unique pid from the init_pid_ns. Thus, 490 * the pid of the lookup task should always be the 491 * same as the input task. 492 */ 493 bpf_task_release(lookup); 494 return -EINVAL; 495 } 496 497 /* bpf_task_from_pid() returns an acquired reference, 498 * so it must be dropped before returning from the 499 * tracepoint handler. 500 */ 501 bpf_task_release(lookup); 502 return 0; 503 } 504 5054.2 struct cgroup * kfuncs 506-------------------------- 507 508``struct cgroup *`` objects also have acquire and release functions: 509 510.. kernel-doc:: kernel/bpf/helpers.c 511 :identifiers: bpf_cgroup_acquire bpf_cgroup_release 512 513These kfuncs are used in exactly the same manner as bpf_task_acquire() and 514bpf_task_release() respectively, so we won't provide examples for them. 515 516---- 517 518You may also acquire a reference to a ``struct cgroup`` kptr that's already 519stored in a map using bpf_cgroup_kptr_get(): 520 521.. kernel-doc:: kernel/bpf/helpers.c 522 :identifiers: bpf_cgroup_kptr_get 523 524Here's an example of how it can be used: 525 526.. code-block:: c 527 528 /* struct containing the struct task_struct kptr which is actually stored in the map. */ 529 struct __cgroups_kfunc_map_value { 530 struct cgroup __kptr_ref * cgroup; 531 }; 532 533 /* The map containing struct __cgroups_kfunc_map_value entries. */ 534 struct { 535 __uint(type, BPF_MAP_TYPE_HASH); 536 __type(key, int); 537 __type(value, struct __cgroups_kfunc_map_value); 538 __uint(max_entries, 1); 539 } __cgroups_kfunc_map SEC(".maps"); 540 541 /* ... */ 542 543 /** 544 * A simple example tracepoint program showing how a 545 * struct cgroup kptr that is stored in a map can 546 * be acquired using the bpf_cgroup_kptr_get() kfunc. 547 */ 548 SEC("tp_btf/cgroup_mkdir") 549 int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path) 550 { 551 struct cgroup *kptr; 552 struct __cgroups_kfunc_map_value *v; 553 s32 id = cgrp->self.id; 554 555 /* Assume a cgroup kptr was previously stored in the map. */ 556 v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id); 557 if (!v) 558 return -ENOENT; 559 560 /* Acquire a reference to the cgroup kptr that's already stored in the map. */ 561 kptr = bpf_cgroup_kptr_get(&v->cgroup); 562 if (!kptr) 563 /* If no cgroup was present in the map, it's because 564 * we're racing with another CPU that removed it with 565 * bpf_kptr_xchg() between the bpf_map_lookup_elem() 566 * above, and our call to bpf_cgroup_kptr_get(). 567 * bpf_cgroup_kptr_get() internally safely handles this 568 * race, and will return NULL if the task is no longer 569 * present in the map by the time we invoke the kfunc. 570 */ 571 return -EBUSY; 572 573 /* Free the reference we just took above. Note that the 574 * original struct cgroup kptr is still in the map. It will 575 * be freed either at a later time if another context deletes 576 * it from the map, or automatically by the BPF subsystem if 577 * it's still present when the map is destroyed. 578 */ 579 bpf_cgroup_release(kptr); 580 581 return 0; 582 } 583 584---- 585 586Another kfunc available for interacting with ``struct cgroup *`` objects is 587bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup, 588and return it as a cgroup kptr. 589 590.. kernel-doc:: kernel/bpf/helpers.c 591 :identifiers: bpf_cgroup_ancestor 592 593Eventually, BPF should be updated to allow this to happen with a normal memory 594load in the program itself. This is currently not possible without more work in 595the verifier. bpf_cgroup_ancestor() can be used as follows: 596 597.. code-block:: c 598 599 /** 600 * Simple tracepoint example that illustrates how a cgroup's 601 * ancestor can be accessed using bpf_cgroup_ancestor(). 602 */ 603 SEC("tp_btf/cgroup_mkdir") 604 int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path) 605 { 606 struct cgroup *parent; 607 608 /* The parent cgroup resides at the level before the current cgroup's level. */ 609 parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1); 610 if (!parent) 611 return -ENOENT; 612 613 bpf_printk("Parent id is %d", parent->self.id); 614 615 /* Return the parent cgroup that was acquired above. */ 616 bpf_cgroup_release(parent); 617 return 0; 618 } 619 6204.3 struct cpumask * kfuncs 621--------------------------- 622 623BPF provides a set of kfuncs that can be used to query, allocate, mutate, and 624destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label` 625for more details. 626