1.. SPDX-License-Identifier: GPL-2.0
2
3.. _kfuncs-header-label:
4
5=============================
6BPF Kernel Functions (kfuncs)
7=============================
8
91. Introduction
10===============
11
12BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux
13kernel which are exposed for use by BPF programs. Unlike normal BPF helpers,
14kfuncs do not have a stable interface and can change from one kernel release to
15another. Hence, BPF programs need to be updated in response to changes in the
16kernel. See :ref:`BPF_kfunc_lifecycle_expectations` for more information.
17
182. Defining a kfunc
19===================
20
21There are two ways to expose a kernel function to BPF programs, either make an
22existing function in the kernel visible, or add a new wrapper for BPF. In both
23cases, care must be taken that BPF program can only call such function in a
24valid context. To enforce this, visibility of a kfunc can be per program type.
25
26If you are not creating a BPF wrapper for existing kernel function, skip ahead
27to :ref:`BPF_kfunc_nodef`.
28
292.1 Creating a wrapper kfunc
30----------------------------
31
32When defining a wrapper kfunc, the wrapper function should have extern linkage.
33This prevents the compiler from optimizing away dead code, as this wrapper kfunc
34is not invoked anywhere in the kernel itself. It is not necessary to provide a
35prototype in a header for the wrapper kfunc.
36
37An example is given below::
38
39        /* Disables missing prototype warnings */
40        __diag_push();
41        __diag_ignore_all("-Wmissing-prototypes",
42                          "Global kfuncs as their definitions will be in BTF");
43
44        __bpf_kfunc struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
45        {
46                return find_get_task_by_vpid(nr);
47        }
48
49        __diag_pop();
50
51A wrapper kfunc is often needed when we need to annotate parameters of the
52kfunc. Otherwise one may directly make the kfunc visible to the BPF program by
53registering it with the BPF subsystem. See :ref:`BPF_kfunc_nodef`.
54
552.2 Annotating kfunc parameters
56-------------------------------
57
58Similar to BPF helpers, there is sometime need for additional context required
59by the verifier to make the usage of kernel functions safer and more useful.
60Hence, we can annotate a parameter by suffixing the name of the argument of the
61kfunc with a __tag, where tag may be one of the supported annotations.
62
632.2.1 __sz Annotation
64---------------------
65
66This annotation is used to indicate a memory and size pair in the argument list.
67An example is given below::
68
69        __bpf_kfunc void bpf_memzero(void *mem, int mem__sz)
70        {
71        ...
72        }
73
74Here, the verifier will treat first argument as a PTR_TO_MEM, and second
75argument as its size. By default, without __sz annotation, the size of the type
76of the pointer is used. Without __sz annotation, a kfunc cannot accept a void
77pointer.
78
792.2.2 __k Annotation
80--------------------
81
82This annotation is only understood for scalar arguments, where it indicates that
83the verifier must check the scalar argument to be a known constant, which does
84not indicate a size parameter, and the value of the constant is relevant to the
85safety of the program.
86
87An example is given below::
88
89        __bpf_kfunc void *bpf_obj_new(u32 local_type_id__k, ...)
90        {
91        ...
92        }
93
94Here, bpf_obj_new uses local_type_id argument to find out the size of that type
95ID in program's BTF and return a sized pointer to it. Each type ID will have a
96distinct size, hence it is crucial to treat each such call as distinct when
97values don't match during verifier state pruning checks.
98
99Hence, whenever a constant scalar argument is accepted by a kfunc which is not a
100size parameter, and the value of the constant matters for program safety, __k
101suffix should be used.
102
103.. _BPF_kfunc_nodef:
104
1052.3 Using an existing kernel function
106-------------------------------------
107
108When an existing function in the kernel is fit for consumption by BPF programs,
109it can be directly registered with the BPF subsystem. However, care must still
110be taken to review the context in which it will be invoked by the BPF program
111and whether it is safe to do so.
112
1132.4 Annotating kfuncs
114---------------------
115
116In addition to kfuncs' arguments, verifier may need more information about the
117type of kfunc(s) being registered with the BPF subsystem. To do so, we define
118flags on a set of kfuncs as follows::
119
120        BTF_SET8_START(bpf_task_set)
121        BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
122        BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
123        BTF_SET8_END(bpf_task_set)
124
125This set encodes the BTF ID of each kfunc listed above, and encodes the flags
126along with it. Ofcourse, it is also allowed to specify no flags.
127
128kfunc definitions should also always be annotated with the ``__bpf_kfunc``
129macro. This prevents issues such as the compiler inlining the kfunc if it's a
130static kernel function, or the function being elided in an LTO build as it's
131not used in the rest of the kernel. Developers should not manually add
132annotations to their kfunc to prevent these issues. If an annotation is
133required to prevent such an issue with your kfunc, it is a bug and should be
134added to the definition of the macro so that other kfuncs are similarly
135protected. An example is given below::
136
137        __bpf_kfunc struct task_struct *bpf_get_task_pid(s32 pid)
138        {
139        ...
140        }
141
1422.4.1 KF_ACQUIRE flag
143---------------------
144
145The KF_ACQUIRE flag is used to indicate that the kfunc returns a pointer to a
146refcounted object. The verifier will then ensure that the pointer to the object
147is eventually released using a release kfunc, or transferred to a map using a
148referenced kptr (by invoking bpf_kptr_xchg). If not, the verifier fails the
149loading of the BPF program until no lingering references remain in all possible
150explored states of the program.
151
1522.4.2 KF_RET_NULL flag
153----------------------
154
155The KF_RET_NULL flag is used to indicate that the pointer returned by the kfunc
156may be NULL. Hence, it forces the user to do a NULL check on the pointer
157returned from the kfunc before making use of it (dereferencing or passing to
158another helper). This flag is often used in pairing with KF_ACQUIRE flag, but
159both are orthogonal to each other.
160
1612.4.3 KF_RELEASE flag
162---------------------
163
164The KF_RELEASE flag is used to indicate that the kfunc releases the pointer
165passed in to it. There can be only one referenced pointer that can be passed in.
166All copies of the pointer being released are invalidated as a result of invoking
167kfunc with this flag.
168
1692.4.4 KF_KPTR_GET flag
170----------------------
171
172The KF_KPTR_GET flag is used to indicate that the kfunc takes the first argument
173as a pointer to kptr, safely increments the refcount of the object it points to,
174and returns a reference to the user. The rest of the arguments may be normal
175arguments of a kfunc. The KF_KPTR_GET flag should be used in conjunction with
176KF_ACQUIRE and KF_RET_NULL flags.
177
1782.4.5 KF_TRUSTED_ARGS flag
179--------------------------
180
181The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
182indicates that the all pointer arguments are valid, and that all pointers to
183BTF objects have been passed in their unmodified form (that is, at a zero
184offset, and without having been obtained from walking another pointer, with one
185exception described below).
186
187There are two types of pointers to kernel objects which are considered "valid":
188
1891. Pointers which are passed as tracepoint or struct_ops callback arguments.
1902. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc.
191
192Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to
193KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset.
194
195The definition of "valid" pointers is subject to change at any time, and has
196absolutely no ABI stability guarantees.
197
198As mentioned above, a nested pointer obtained from walking a trusted pointer is
199no longer trusted, with one exception. If a struct type has a field that is
200guaranteed to be valid as long as its parent pointer is trusted, the
201``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as
202follows:
203
204.. code-block:: c
205
206	BTF_TYPE_SAFE_NESTED(struct task_struct) {
207		const cpumask_t *cpus_ptr;
208	};
209
210In other words, you must:
211
2121. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro.
213
2142. Specify the type and name of the trusted nested field. This field must match
215   the field in the original type definition exactly.
216
2172.4.6 KF_SLEEPABLE flag
218-----------------------
219
220The KF_SLEEPABLE flag is used for kfuncs that may sleep. Such kfuncs can only
221be called by sleepable BPF programs (BPF_F_SLEEPABLE).
222
2232.4.7 KF_DESTRUCTIVE flag
224--------------------------
225
226The KF_DESTRUCTIVE flag is used to indicate functions calling which is
227destructive to the system. For example such a call can result in system
228rebooting or panicking. Due to this additional restrictions apply to these
229calls. At the moment they only require CAP_SYS_BOOT capability, but more can be
230added later.
231
2322.4.8 KF_RCU flag
233-----------------
234
235The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument.
236When used together with KF_ACQUIRE, it indicates the kfunc should have a
237single argument which must be a trusted argument or a MEM_RCU pointer.
238The argument may have reference count of 0 and the kfunc must take this
239into consideration.
240
241.. _KF_deprecated_flag:
242
2432.4.9 KF_DEPRECATED flag
244------------------------
245
246The KF_DEPRECATED flag is used for kfuncs which are scheduled to be
247changed or removed in a subsequent kernel release. A kfunc that is
248marked with KF_DEPRECATED should also have any relevant information
249captured in its kernel doc. Such information typically includes the
250kfunc's expected remaining lifespan, a recommendation for new
251functionality that can replace it if any is available, and possibly a
252rationale for why it is being removed.
253
254Note that while on some occasions, a KF_DEPRECATED kfunc may continue to be
255supported and have its KF_DEPRECATED flag removed, it is likely to be far more
256difficult to remove a KF_DEPRECATED flag after it's been added than it is to
257prevent it from being added in the first place. As described in
258:ref:`BPF_kfunc_lifecycle_expectations`, users that rely on specific kfuncs are
259encouraged to make their use-cases known as early as possible, and participate
260in upstream discussions regarding whether to keep, change, deprecate, or remove
261those kfuncs if and when such discussions occur.
262
2632.5 Registering the kfuncs
264--------------------------
265
266Once the kfunc is prepared for use, the final step to making it visible is
267registering it with the BPF subsystem. Registration is done per BPF program
268type. An example is shown below::
269
270        BTF_SET8_START(bpf_task_set)
271        BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
272        BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
273        BTF_SET8_END(bpf_task_set)
274
275        static const struct btf_kfunc_id_set bpf_task_kfunc_set = {
276                .owner = THIS_MODULE,
277                .set   = &bpf_task_set,
278        };
279
280        static int init_subsystem(void)
281        {
282                return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set);
283        }
284        late_initcall(init_subsystem);
285
2862.6  Specifying no-cast aliases with ___init
287--------------------------------------------
288
289The verifier will always enforce that the BTF type of a pointer passed to a
290kfunc by a BPF program, matches the type of pointer specified in the kfunc
291definition. The verifier, does, however, allow types that are equivalent
292according to the C standard to be passed to the same kfunc arg, even if their
293BTF_IDs differ.
294
295For example, for the following type definition:
296
297.. code-block:: c
298
299	struct bpf_cpumask {
300		cpumask_t cpumask;
301		refcount_t usage;
302	};
303
304The verifier would allow a ``struct bpf_cpumask *`` to be passed to a kfunc
305taking a ``cpumask_t *`` (which is a typedef of ``struct cpumask *``). For
306instance, both ``struct cpumask *`` and ``struct bpf_cpmuask *`` can be passed
307to bpf_cpumask_test_cpu().
308
309In some cases, this type-aliasing behavior is not desired. ``struct
310nf_conn___init`` is one such example:
311
312.. code-block:: c
313
314	struct nf_conn___init {
315		struct nf_conn ct;
316	};
317
318The C standard would consider these types to be equivalent, but it would not
319always be safe to pass either type to a trusted kfunc. ``struct
320nf_conn___init`` represents an allocated ``struct nf_conn`` object that has
321*not yet been initialized*, so it would therefore be unsafe to pass a ``struct
322nf_conn___init *`` to a kfunc that's expecting a fully initialized ``struct
323nf_conn *`` (e.g. ``bpf_ct_change_timeout()``).
324
325In order to accommodate such requirements, the verifier will enforce strict
326PTR_TO_BTF_ID type matching if two types have the exact same name, with one
327being suffixed with ``___init``.
328
329.. _BPF_kfunc_lifecycle_expectations:
330
3313. kfunc lifecycle expectations
332===============================
333
334kfuncs provide a kernel <-> kernel API, and thus are not bound by any of the
335strict stability restrictions associated with kernel <-> user UAPIs. This means
336they can be thought of as similar to EXPORT_SYMBOL_GPL, and can therefore be
337modified or removed by a maintainer of the subsystem they're defined in when
338it's deemed necessary.
339
340Like any other change to the kernel, maintainers will not change or remove a
341kfunc without having a reasonable justification.  Whether or not they'll choose
342to change a kfunc will ultimately depend on a variety of factors, such as how
343widely used the kfunc is, how long the kfunc has been in the kernel, whether an
344alternative kfunc exists, what the norm is in terms of stability for the
345subsystem in question, and of course what the technical cost is of continuing
346to support the kfunc.
347
348There are several implications of this:
349
350a) kfuncs that are widely used or have been in the kernel for a long time will
351   be more difficult to justify being changed or removed by a maintainer. In
352   other words, kfuncs that are known to have a lot of users and provide
353   significant value provide stronger incentives for maintainers to invest the
354   time and complexity in supporting them. It is therefore important for
355   developers that are using kfuncs in their BPF programs to communicate and
356   explain how and why those kfuncs are being used, and to participate in
357   discussions regarding those kfuncs when they occur upstream.
358
359b) Unlike regular kernel symbols marked with EXPORT_SYMBOL_GPL, BPF programs
360   that call kfuncs are generally not part of the kernel tree. This means that
361   refactoring cannot typically change callers in-place when a kfunc changes,
362   as is done for e.g. an upstreamed driver being updated in place when a
363   kernel symbol is changed.
364
365   Unlike with regular kernel symbols, this is expected behavior for BPF
366   symbols, and out-of-tree BPF programs that use kfuncs should be considered
367   relevant to discussions and decisions around modifying and removing those
368   kfuncs. The BPF community will take an active role in participating in
369   upstream discussions when necessary to ensure that the perspectives of such
370   users are taken into account.
371
372c) A kfunc will never have any hard stability guarantees. BPF APIs cannot and
373   will not ever hard-block a change in the kernel purely for stability
374   reasons. That being said, kfuncs are features that are meant to solve
375   problems and provide value to users. The decision of whether to change or
376   remove a kfunc is a multivariate technical decision that is made on a
377   case-by-case basis, and which is informed by data points such as those
378   mentioned above. It is expected that a kfunc being removed or changed with
379   no warning will not be a common occurrence or take place without sound
380   justification, but it is a possibility that must be accepted if one is to
381   use kfuncs.
382
3833.1 kfunc deprecation
384---------------------
385
386As described above, while sometimes a maintainer may find that a kfunc must be
387changed or removed immediately to accommodate some changes in their subsystem,
388usually kfuncs will be able to accommodate a longer and more measured
389deprecation process. For example, if a new kfunc comes along which provides
390superior functionality to an existing kfunc, the existing kfunc may be
391deprecated for some period of time to allow users to migrate their BPF programs
392to use the new one. Or, if a kfunc has no known users, a decision may be made
393to remove the kfunc (without providing an alternative API) after some
394deprecation period so as to provide users with a window to notify the kfunc
395maintainer if it turns out that the kfunc is actually being used.
396
397It's expected that the common case will be that kfuncs will go through a
398deprecation period rather than being changed or removed without warning. As
399described in :ref:`KF_deprecated_flag`, the kfunc framework provides the
400KF_DEPRECATED flag to kfunc developers to signal to users that a kfunc has been
401deprecated. Once a kfunc has been marked with KF_DEPRECATED, the following
402procedure is followed for removal:
403
4041. Any relevant information for deprecated kfuncs is documented in the kfunc's
405   kernel docs. This documentation will typically include the kfunc's expected
406   remaining lifespan, a recommendation for new functionality that can replace
407   the usage of the deprecated function (or an explanation as to why no such
408   replacement exists), etc.
409
4102. The deprecated kfunc is kept in the kernel for some period of time after it
411   was first marked as deprecated. This time period will be chosen on a
412   case-by-case basis, and will typically depend on how widespread the use of
413   the kfunc is, how long it has been in the kernel, and how hard it is to move
414   to alternatives. This deprecation time period is "best effort", and as
415   described :ref:`above<BPF_kfunc_lifecycle_expectations>`, circumstances may
416   sometimes dictate that the kfunc be removed before the full intended
417   deprecation period has elapsed.
418
4193. After the deprecation period the kfunc will be removed. At this point, BPF
420   programs calling the kfunc will be rejected by the verifier.
421
4224. Core kfuncs
423==============
424
425The BPF subsystem provides a number of "core" kfuncs that are potentially
426applicable to a wide variety of different possible use cases and programs.
427Those kfuncs are documented here.
428
4294.1 struct task_struct * kfuncs
430-------------------------------
431
432There are a number of kfuncs that allow ``struct task_struct *`` objects to be
433used as kptrs:
434
435.. kernel-doc:: kernel/bpf/helpers.c
436   :identifiers: bpf_task_acquire bpf_task_release
437
438These kfuncs are useful when you want to acquire or release a reference to a
439``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a
440struct_ops callback arg. For example:
441
442.. code-block:: c
443
444	/**
445	 * A trivial example tracepoint program that shows how to
446	 * acquire and release a struct task_struct * pointer.
447	 */
448	SEC("tp_btf/task_newtask")
449	int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags)
450	{
451		struct task_struct *acquired;
452
453		acquired = bpf_task_acquire(task);
454
455		/*
456		 * In a typical program you'd do something like store
457		 * the task in a map, and the map will automatically
458		 * release it later. Here, we release it manually.
459		 */
460		bpf_task_release(acquired);
461		return 0;
462	}
463
464----
465
466A BPF program can also look up a task from a pid. This can be useful if the
467caller doesn't have a trusted pointer to a ``struct task_struct *`` object that
468it can acquire a reference on with bpf_task_acquire().
469
470.. kernel-doc:: kernel/bpf/helpers.c
471   :identifiers: bpf_task_from_pid
472
473Here is an example of it being used:
474
475.. code-block:: c
476
477	SEC("tp_btf/task_newtask")
478	int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags)
479	{
480		struct task_struct *lookup;
481
482		lookup = bpf_task_from_pid(task->pid);
483		if (!lookup)
484			/* A task should always be found, as %task is a tracepoint arg. */
485			return -ENOENT;
486
487		if (lookup->pid != task->pid) {
488			/* bpf_task_from_pid() looks up the task via its
489			 * globally-unique pid from the init_pid_ns. Thus,
490			 * the pid of the lookup task should always be the
491			 * same as the input task.
492			 */
493			bpf_task_release(lookup);
494			return -EINVAL;
495		}
496
497		/* bpf_task_from_pid() returns an acquired reference,
498		 * so it must be dropped before returning from the
499		 * tracepoint handler.
500		 */
501		bpf_task_release(lookup);
502		return 0;
503	}
504
5054.2 struct cgroup * kfuncs
506--------------------------
507
508``struct cgroup *`` objects also have acquire and release functions:
509
510.. kernel-doc:: kernel/bpf/helpers.c
511   :identifiers: bpf_cgroup_acquire bpf_cgroup_release
512
513These kfuncs are used in exactly the same manner as bpf_task_acquire() and
514bpf_task_release() respectively, so we won't provide examples for them.
515
516----
517
518You may also acquire a reference to a ``struct cgroup`` kptr that's already
519stored in a map using bpf_cgroup_kptr_get():
520
521.. kernel-doc:: kernel/bpf/helpers.c
522   :identifiers: bpf_cgroup_kptr_get
523
524Here's an example of how it can be used:
525
526.. code-block:: c
527
528	/* struct containing the struct task_struct kptr which is actually stored in the map. */
529	struct __cgroups_kfunc_map_value {
530		struct cgroup __kptr_ref * cgroup;
531	};
532
533	/* The map containing struct __cgroups_kfunc_map_value entries. */
534	struct {
535		__uint(type, BPF_MAP_TYPE_HASH);
536		__type(key, int);
537		__type(value, struct __cgroups_kfunc_map_value);
538		__uint(max_entries, 1);
539	} __cgroups_kfunc_map SEC(".maps");
540
541	/* ... */
542
543	/**
544	 * A simple example tracepoint program showing how a
545	 * struct cgroup kptr that is stored in a map can
546	 * be acquired using the bpf_cgroup_kptr_get() kfunc.
547	 */
548	 SEC("tp_btf/cgroup_mkdir")
549	 int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path)
550	 {
551		struct cgroup *kptr;
552		struct __cgroups_kfunc_map_value *v;
553		s32 id = cgrp->self.id;
554
555		/* Assume a cgroup kptr was previously stored in the map. */
556		v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id);
557		if (!v)
558			return -ENOENT;
559
560		/* Acquire a reference to the cgroup kptr that's already stored in the map. */
561		kptr = bpf_cgroup_kptr_get(&v->cgroup);
562		if (!kptr)
563			/* If no cgroup was present in the map, it's because
564			 * we're racing with another CPU that removed it with
565			 * bpf_kptr_xchg() between the bpf_map_lookup_elem()
566			 * above, and our call to bpf_cgroup_kptr_get().
567			 * bpf_cgroup_kptr_get() internally safely handles this
568			 * race, and will return NULL if the task is no longer
569			 * present in the map by the time we invoke the kfunc.
570			 */
571			return -EBUSY;
572
573		/* Free the reference we just took above. Note that the
574		 * original struct cgroup kptr is still in the map. It will
575		 * be freed either at a later time if another context deletes
576		 * it from the map, or automatically by the BPF subsystem if
577		 * it's still present when the map is destroyed.
578		 */
579		bpf_cgroup_release(kptr);
580
581		return 0;
582        }
583
584----
585
586Another kfunc available for interacting with ``struct cgroup *`` objects is
587bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup,
588and return it as a cgroup kptr.
589
590.. kernel-doc:: kernel/bpf/helpers.c
591   :identifiers: bpf_cgroup_ancestor
592
593Eventually, BPF should be updated to allow this to happen with a normal memory
594load in the program itself. This is currently not possible without more work in
595the verifier. bpf_cgroup_ancestor() can be used as follows:
596
597.. code-block:: c
598
599	/**
600	 * Simple tracepoint example that illustrates how a cgroup's
601	 * ancestor can be accessed using bpf_cgroup_ancestor().
602	 */
603	SEC("tp_btf/cgroup_mkdir")
604	int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path)
605	{
606		struct cgroup *parent;
607
608		/* The parent cgroup resides at the level before the current cgroup's level. */
609		parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1);
610		if (!parent)
611			return -ENOENT;
612
613		bpf_printk("Parent id is %d", parent->self.id);
614
615		/* Return the parent cgroup that was acquired above. */
616		bpf_cgroup_release(parent);
617		return 0;
618	}
619
6204.3 struct cpumask * kfuncs
621---------------------------
622
623BPF provides a set of kfuncs that can be used to query, allocate, mutate, and
624destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label`
625for more details.
626