Commit ae28ed45 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull bpf updates from Alexei Starovoitov:

 - Support pulling non-linear xdp data with bpf_xdp_pull_data() kfunc
   (Amery Hung)

   Applied as a stable branch in bpf-next and net-next trees.

 - Support reading skb metadata via bpf_dynptr (Jakub Sitnicki)

   Also a stable branch in bpf-next and net-next trees.

 - Enforce expected_attach_type for tailcall compatibility (Daniel
   Borkmann)

 - Replace path-sensitive with path-insensitive live stack analysis in
   the verifier (Eduard Zingerman)

   This is a significant change in the verification logic. More details,
   motivation, long term plans are in the cover letter/merge commit.

 - Support signed BPF programs (KP Singh)

   This is another major feature that took years to materialize.

   Algorithm details are in the cover letter/marge commit

 - Add support for may_goto instruction to s390 JIT (Ilya Leoshkevich)

 - Add support for may_goto instruction to arm64 JIT (Puranjay Mohan)

 - Fix USDT SIB argument handling in libbpf (Jiawei Zhao)

 - Allow uprobe-bpf program to change context registers (Jiri Olsa)

 - Support signed loads from BPF arena (Kumar Kartikeya Dwivedi and
   Puranjay Mohan)

 - Allow access to union arguments in tracing programs (Leon Hwang)

 - Optimize rcu_read_lock() + migrate_disable() combination where it's
   used in BPF subsystem (Menglong Dong)

 - Introduce bpf_task_work_schedule*() kfuncs to schedule deferred
   execution of BPF callback in the context of a specific task using the
   kernel’s task_work infrastructure (Mykyta Yatsenko)

 - Enforce RCU protection for KF_RCU_PROTECTED kfuncs (Kumar Kartikeya
   Dwivedi)

 - Add stress test for rqspinlock in NMI (Kumar Kartikeya Dwivedi)

 - Improve the precision of tnum multiplier verifier operation
   (Nandakumar Edamana)

 - Use tnums to improve is_branch_taken() logic (Paul Chaignon)

 - Add support for atomic operations in arena in riscv JIT (Pu Lehui)

 - Report arena faults to BPF error stream (Puranjay Mohan)

 - Search for tracefs at /sys/kernel/tracing first in bpftool (Quentin
   Monnet)

 - Add bpf_strcasecmp() kfunc (Rong Tao)

 - Support lookup_and_delete_elem command in BPF_MAP_STACK_TRACE (Tao
   Chen)

* tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (197 commits)
  libbpf: Replace AF_ALG with open coded SHA-256
  selftests/bpf: Add stress test for rqspinlock in NMI
  selftests/bpf: Add test case for different expected_attach_type
  bpf: Enforce expected_attach_type for tailcall compatibility
  bpftool: Remove duplicate string.h header
  bpf: Remove duplicate crypto/sha2.h header
  libbpf: Fix error when st-prefix_ops and ops from differ btf
  selftests/bpf: Test changing packet data from kfunc
  selftests/bpf: Add stacktrace map lookup_and_delete_elem test case
  selftests/bpf: Refactor stacktrace_map case with skeleton
  bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
  selftests/bpf: Fix flaky bpf_cookie selftest
  selftests/bpf: Test changing packet data from global functions with a kfunc
  bpf: Emit struct bpf_xdp_sock type in vmlinux BTF
  selftests/bpf: Task_work selftest cleanup fixes
  MAINTAINERS: Delete inactive maintainers from AF_XDP
  bpf: Mark kfuncs as __noclone
  selftests/bpf: Add kprobe multi write ctx attach test
  selftests/bpf: Add kprobe write ctx attach test
  selftests/bpf: Add uprobe context ip register change test
  ...
parents 4b81e2eb 4ef77dd5
Loading
Loading
Loading
Loading
+6 −0
Original line number Diff line number Diff line
@@ -3912,6 +3912,12 @@ S: C/ Federico Garcia Lorca 1 10-A
S: Sevilla 41005
S: Spain

N: Björn Töpel
E: bjorn@kernel.org
D: AF_XDP
S: Gothenburg
S: Sweden

N: Linus Torvalds
E: torvalds@linux-foundation.org
D: Original kernel hacker
+18 −1
Original line number Diff line number Diff line
@@ -335,9 +335,26 @@ consider doing refcnt != 0 check, especially when returning a KF_ACQUIRE
pointer. Note as well that a KF_ACQUIRE kfunc that is KF_RCU should very likely
also be KF_RET_NULL.

2.4.8 KF_RCU_PROTECTED flag
---------------------------

The KF_RCU_PROTECTED flag is used to indicate that the kfunc must be invoked in
an RCU critical section. This is assumed by default in non-sleepable programs,
and must be explicitly ensured by calling ``bpf_rcu_read_lock`` for sleepable
ones.

If the kfunc returns a pointer value, this flag also enforces that the returned
pointer is RCU protected, and can only be used while the RCU critical section is
active.

The flag is distinct from the ``KF_RCU`` flag, which only ensures that its
arguments are at least RCU protected pointers. This may transitively imply that
RCU protection is ensured, but it does not work in cases of kfuncs which require
RCU protection but do not take RCU protected arguments.

.. _KF_deprecated_flag:

2.4.8 KF_DEPRECATED flag
2.4.9 KF_DEPRECATED flag
------------------------

The KF_DEPRECATED flag is used for kfuncs which are scheduled to be
+0 −264
Original line number Diff line number Diff line
@@ -347,270 +347,6 @@ However, only the value of register ``r1`` is important to successfully finish
verification. The goal of the liveness tracking algorithm is to spot this fact
and figure out that both states are actually equivalent.

Data structures
~~~~~~~~~~~~~~~

Liveness is tracked using the following data structures::

  enum bpf_reg_liveness {
	REG_LIVE_NONE = 0,
	REG_LIVE_READ32 = 0x1,
	REG_LIVE_READ64 = 0x2,
	REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64,
	REG_LIVE_WRITTEN = 0x4,
	REG_LIVE_DONE = 0x8,
  };

  struct bpf_reg_state {
 	...
	struct bpf_reg_state *parent;
 	...
	enum bpf_reg_liveness live;
 	...
  };

  struct bpf_stack_state {
	struct bpf_reg_state spilled_ptr;
	...
  };

  struct bpf_func_state {
	struct bpf_reg_state regs[MAX_BPF_REG];
        ...
	struct bpf_stack_state *stack;
  }

  struct bpf_verifier_state {
	struct bpf_func_state *frame[MAX_CALL_FRAMES];
	struct bpf_verifier_state *parent;
        ...
  }

* ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new
  verifier state creation;

* ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is
  defined by some instruction verified between this verifier state's parent and
  verifier state itself;

* ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot)
  is read by a some child state of this verifier state;

* ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid
  processing same verifier state multiple times and for some sanity checks;

* ``->live`` field values are formed by combining ``enum bpf_reg_liveness``
  values using bitwise or.

Register parentage chains
~~~~~~~~~~~~~~~~~~~~~~~~~

In order to propagate information between parent and child states, a *register
parentage chain* is established. Each register or stack slot is linked to a
corresponding register or stack slot in its parent state via a ``->parent``
pointer. This link is established upon state creation in ``is_state_visited()``
and might be modified by ``set_callee_state()`` called from
``__check_func_call()``.

The rules for correspondence between registers / stack slots are as follows:

* For the current stack frame, registers and stack slots of the new state are
  linked to the registers and stack slots of the parent state with the same
  indices.

* For the outer stack frames, only callee saved registers (r6-r9) and stack
  slots are linked to the registers and stack slots of the parent state with the
  same indices.

* When function call is processed a new ``struct bpf_func_state`` instance is
  allocated, it encapsulates a new set of registers and stack slots. For this
  new frame, parent links for r6-r9 and stack slots are set to nil, parent links
  for r1-r5 are set to match caller r1-r5 parent links.

This could be illustrated by the following diagram (arrows stand for
``->parent`` pointers)::

      ...                    ; Frame #0, some instructions
  --- checkpoint #0 ---
  1 : r6 = 42                ; Frame #0
  --- checkpoint #1 ---
  2 : call foo()             ; Frame #0
      ...                    ; Frame #1, instructions from foo()
  --- checkpoint #2 ---
      ...                    ; Frame #1, instructions from foo()
  --- checkpoint #3 ---
      exit                   ; Frame #1, return from foo()
  3 : r1 = r6                ; Frame #0  <- current state

             +-------------------------------+-------------------------------+
             |           Frame #0            |           Frame #1            |
  Checkpoint +-------------------------------+-------------------------------+
  #0         | r0 | r1-r5 | r6-r9 | fp-8 ... |
             +-------------------------------+
                ^    ^       ^       ^
                |    |       |       |
  Checkpoint +-------------------------------+
  #1         | r0 | r1-r5 | r6-r9 | fp-8 ... |
             +-------------------------------+
                     ^       ^       ^
                     |_______|_______|_______________
                             |       |               |
               nil  nil      |       |               |      nil     nil
                |    |       |       |               |       |       |
  Checkpoint +-------------------------------+-------------------------------+
  #2         | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
             +-------------------------------+-------------------------------+
                             ^       ^               ^       ^       ^
               nil  nil      |       |               |       |       |
                |    |       |       |               |       |       |
  Checkpoint +-------------------------------+-------------------------------+
  #3         | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... |
             +-------------------------------+-------------------------------+
                             ^       ^
               nil  nil      |       |
                |    |       |       |
  Current    +-------------------------------+
  state      | r0 | r1-r5 | r6-r9 | fp-8 ... |
             +-------------------------------+
                             \
                               r6 read mark is propagated via these links
                               all the way up to checkpoint #1.
                               The checkpoint #1 contains a write mark for r6
                               because of instruction (1), thus read propagation
                               does not reach checkpoint #0 (see section below).

Liveness marks tracking
~~~~~~~~~~~~~~~~~~~~~~~

For each processed instruction, the verifier tracks read and written registers
and stack slots. The main idea of the algorithm is that read marks propagate
back along the state parentage chain until they hit a write mark, which 'screens
off' earlier states from the read. The information about reads is propagated by
function ``mark_reg_read()`` which could be summarized as follows::

  mark_reg_read(struct bpf_reg_state *state, ...):
      parent = state->parent
      while parent:
          if state->live & REG_LIVE_WRITTEN:
              break
          if parent->live & REG_LIVE_READ64:
              break
          parent->live |= REG_LIVE_READ64
          state = parent
          parent = state->parent

Notes:

* The read marks are applied to the **parent** state while write marks are
  applied to the **current** state. The write mark on a register or stack slot
  means that it is updated by some instruction in the straight-line code leading
  from the parent state to the current state.

* Details about REG_LIVE_READ32 are omitted.

* Function ``propagate_liveness()`` (see section :ref:`read_marks_for_cache_hits`)
  might override the first parent link. Please refer to the comments in the
  ``propagate_liveness()`` and ``mark_reg_read()`` source code for further
  details.

Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are
applied conservatively: stack slots are marked as written only if write size
corresponds to the size of the register, e.g. see function ``save_register_state()``.

Consider the following example::

  0: (*u64)(r10 - 8) = 0   ; define 8 bytes of fp-8
  --- checkpoint #0 ---
  1: (*u32)(r10 - 8) = 1   ; redefine lower 4 bytes
  2: r1 = (*u32)(r10 - 8)  ; read lower 4 bytes defined at (1)
  3: r2 = (*u32)(r10 - 4)  ; read upper 4 bytes defined at (0)

As stated above, the write at (1) does not count as ``REG_LIVE_WRITTEN``. Should
it be otherwise, the algorithm above wouldn't be able to propagate the read mark
from (3) to checkpoint #0.

Once the ``BPF_EXIT`` instruction is reached ``update_branch_counts()`` is
called to update the ``->branches`` counter for each verifier state in a chain
of parent verifier states. When the ``->branches`` counter reaches zero the
verifier state becomes a valid entry in a set of cached verifier states.

Each entry of the verifier states cache is post-processed by a function
``clean_live_states()``. This function marks all registers and stack slots
without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``.
Registers/stack slots marked in this way are ignored in function ``stacksafe()``
called from ``states_equal()`` when a state cache entry is considered for
equivalence with a current state.

Now it is possible to explain how the example from the beginning of the section
works::

  0: call bpf_get_prandom_u32()
  1: r1 = 0
  2: if r0 == 0 goto +1
  3: r0 = 1
  --- checkpoint[0] ---
  4: r0 = r1
  5: exit

* At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }``
  is pushed to states processing queue (pc stands for program counter).

* At instruction #4:

  * ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``;
  * ``checkpoint[0].r0`` is marked as written;
  * ``checkpoint[0].r1`` is marked as read;

* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed
  by ``clean_live_states()``. After this processing ``checkpoint[0].r1`` has a
  read mark and all other registers and stack slots are marked as ``NOT_INIT``
  or ``STACK_INVALID``

* The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue
  and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states
  are considered equivalent.

.. _read_marks_for_cache_hits:

Read marks propagation for cache hits
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Another point is the handling of read marks when a previously verified state is
found in the states cache. Upon cache hit verifier must behave in the same way
as if the current state was verified to the program exit. This means that all
read marks, present on registers and stack slots of the cached state, must be
propagated over the parentage chain of the current state. Example below shows
why this is important. Function ``propagate_liveness()`` handles this case.

Consider the following state parentage chain (S is a starting state, A-E are
derived states, -> arrows show which state is derived from which)::

                   r1 read
            <-------------                A[r1] == 0
                                          C[r1] == 0
      S ---> A ---> B ---> exit           E[r1] == 1
      |
      ` ---> C ---> D
      |
      ` ---> E      ^
                    |___   suppose all these
             ^           states are at insn #Y
             |
      suppose all these
    states are at insn #X

* Chain of states ``S -> A -> B -> exit`` is verified first.

* While ``B -> exit`` is verified, register ``r1`` is read and this read mark is
  propagated up to state ``A``.

* When chain of states ``C -> D`` is verified the state ``D`` turns out to be
  equivalent to state ``B``.

* The read mark for ``r1`` has to be propagated to state ``C``, otherwise state
  ``C`` might get mistakenly marked as equivalent to state ``E`` even though
  values for register ``r1`` differ between ``C`` and ``E``.

Understanding eBPF verifier messages
====================================

+0 −2
Original line number Diff line number Diff line
@@ -27466,10 +27466,8 @@ F: tools/testing/selftests/bpf/*xdp*
K:	(?:\b|_)xdp(?:\b|_)
XDP SOCKETS (AF_XDP)
M:	Björn Töpel <bjorn@kernel.org>
M:	Magnus Karlsson <magnus.karlsson@intel.com>
M:	Maciej Fijalkowski <maciej.fijalkowski@intel.com>
R:	Jonathan Lemon <jonathan.lemon@gmail.com>
R:	Stanislav Fomichev <sdf@fomichev.me>
L:	netdev@vger.kernel.org
L:	bpf@vger.kernel.org
+1 −1
Original line number Diff line number Diff line
@@ -2,4 +2,4 @@
#
# ARM64 networking code
#
obj-$(CONFIG_BPF_JIT) += bpf_jit_comp.o
obj-$(CONFIG_BPF_JIT) += bpf_jit_comp.o bpf_timed_may_goto.o
Loading