kernel_optimize_test

History

Maciej Fijalkowski ebf7d1f508 bpf, x64: rework pro/epilogue and tailcall handling in JIT This commit serves two things: 1) it optimizes BPF prologue/epilogue generation 2) it makes possible to have tailcalls within BPF subprogram Both points are related to each other since without 1), 2) could not be achieved. In [1], Alexei says: "The prologue will look like: nop5 xor eax,eax // two new bytes if bpf_tail_call() is used in this // function push rbp mov rbp, rsp sub rsp, rounded_stack_depth push rax // zero init tail_call counter variable number of push rbx,r13,r14,r15 Then bpf_tail_call will pop variable number rbx,.. and final 'pop rax' Then 'add rsp, size_of_current_stack_frame' jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp' This way new function will set its own stack size and will init tail call counter with whatever value the parent had. If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'. Instead it would need to have 'nop2' in there." Implement that suggestion. Since the layout of stack is changed, tail call counter handling can not rely anymore on popping it to rbx just like it have been handled for constant prologue case and later overwrite of rbx with actual value of rbx pushed to stack. Therefore, let's use one of the register (%rcx) that is considered to be volatile/caller-saved and pop the value of tail call counter in there in the epilogue. Drop the BUILD_BUG_ON in emit_prologue and in emit_bpf_tail_call_indirect where instruction layout is not constant anymore. Introduce new poke target, 'tailcall_bypass' to poke descriptor that is dedicated for skipping the register pops and stack unwind that are generated right before the actual jump to target program. For case when the target program is not present, BPF program will skip the pop instructions and nop5 dedicated for jmpq $target. An example of such state when only R6 of callee saved registers is used by program: ffffffffc0513aa1: e9 0e 00 00 00 jmpq 0xffffffffc0513ab4 ffffffffc0513aa6: 5b pop %rbx ffffffffc0513aa7: 58 pop %rax ffffffffc0513aa8: 48 81 c4 00 00 00 00 add $0x0,%rsp ffffffffc0513aaf: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0513ab4: 48 89 df mov %rbx,%rdi When target program is inserted, the jump that was there to skip pops/nop5 will become the nop5, so CPU will go over pops and do the actual tailcall. One might ask why there simply can not be pushes after the nop5? In the following example snippet: ffffffffc037030c: 48 89 fb mov %rdi,%rbx (...) ffffffffc0370332: 5b pop %rbx ffffffffc0370333: 58 pop %rax ffffffffc0370334: 48 81 c4 00 00 00 00 add $0x0,%rsp ffffffffc037033b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0370340: 48 81 ec 00 00 00 00 sub $0x0,%rsp ffffffffc0370347: 50 push %rax ffffffffc0370348: 53 push %rbx ffffffffc0370349: 48 89 df mov %rbx,%rdi ffffffffc037034c: e8 f7 21 00 00 callq 0xffffffffc0372548 There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall and jump target is not present. ctx is in %rbx register and BPF subprogram that we will call into on ffffffffc037034c is relying on it, e.g. it will pick ctx from there. Such code layout is therefore broken as we would overwrite the content of %rbx with the value that was pushed on the prologue. That is the reason for the 'bypass' approach. Special care needs to be taken during the install/update/remove of tailcall target. In case when target program is not present, the CPU must not execute the pop instructions that precede the tailcall. To address that, the following states can be defined: A nop, unwind, nop B nop, unwind, tail C skip, unwind, nop D skip, unwind, tail A is forbidden (lead to incorrectness). The state transitions between tailcall install/update/remove will work as follows: First install tail call f: C->D->B(f) * poke the tailcall, after that get rid of the skip Update tail call f to f': B(f)->B(f') * poke the tailcall (poke->tailcall_target) and do NOT touch the poke->tailcall_bypass Remove tail call: B(f')->C(f') * poke->tailcall_bypass is poked back to jump, then we wait the RCU grace period so that other programs will finish its execution and after that we are safe to remove the poke->tailcall_target Install new tail call (f''): C(f')->D(f'')->B(f''). * same as first step This way CPU can never be exposed to "unwind, tail" state. Last but not least, when tailcalls get mixed with bpf2bpf calls, it would be possible to encounter the endless loop due to clearing the tailcall counter if for example we would use the tailcall3-like from BPF selftests program that would be subprogram-based, meaning the tailcall would be present within the BPF subprogram. This test, broken down to particular steps, would do: entry -> set tailcall counter to 0, bump it by 1, tailcall to func0 func0 -> call subprog_tail (we are NOT skipping the first 11 bytes of prologue and this subprogram has a tailcall, therefore we clear the counter...) subprog -> do the same thing as entry and then loop forever. To address this, the idea is to go through the call chain of bpf2bpf progs and look for a tailcall presence throughout whole chain. If we saw a single tail call then each node in this call chain needs to be marked as a subprog that can reach the tailcall. We would later feed the JIT with this info and: - set eax to 0 only when tailcall is reachable and this is the entry prog - if tailcall is reachable but there's no tailcall in insns of currently JITed prog then push rax anyway, so that it will be possible to propagate further down the call chain - finally if tailcall is reachable, then we need to precede the 'call' insn with mov rax, [rbp - (stack_depth + 8)] Tail call related cases from test_verifier kselftest are also working fine. Sample BPF programs that utilize tail calls (sockex3, tracex5) work properly as well. [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/ Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>		2020-09-17 19:55:30 -07:00
..
preload	bpf: Disallow BPF_PRELOAD in allmodconfig builds	2020-08-25 15:23:46 -07:00
arraymap.c	bpf, x64: rework pro/epilogue and tailcall handling in JIT	2020-09-17 19:55:30 -07:00
bpf_inode_storage.c	bpf: Add map_meta_equal map ops	2020-08-28 15:41:30 +02:00
bpf_iter.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next	2020-09-01 13:22:59 -07:00
bpf_local_storage.c	bpf: Split bpf_local_storage to bpf_sk_storage	2020-08-25 15:00:04 -07:00
bpf_lru_list.c
bpf_lru_list.h	bpf: Fix a typo "inacitve" -> "inactive"	2020-04-06 21:54:10 +02:00
bpf_lsm.c	bpf: Allow local storage to be used from LSM programs	2020-08-25 15:00:04 -07:00
bpf_struct_ops_types.h
bpf_struct_ops.c	bpf: Move btf_resolve_size into __btf_resolve_size	2020-08-25 15:37:41 -07:00
btf.c	bpf: Add BTF_SET_START/END macros	2020-08-25 15:37:41 -07:00
cgroup.c	bpf: Add support for forced LINK_DETACH command	2020-08-01 20:38:28 -07:00
core.c	bpf, x64: rework pro/epilogue and tailcall handling in JIT	2020-09-17 19:55:30 -07:00
cpumap.c	bpf: {cpu,dev}map: Change various functions return type from int to void	2020-09-01 15:45:58 +02:00
devmap.c	bpf: {cpu,dev}map: Change various functions return type from int to void	2020-09-01 15:45:58 +02:00
disasm.c
disasm.h
dispatcher.c	bpf: Remove bpf_image tree	2020-03-13 12:49:52 -07:00
hashtab.c	bpf: Introduce sleepable BPF programs	2020-08-28 21:20:33 +02:00
helpers.c	bpf: Add bpf_copy_from_user() helper.	2020-08-28 21:20:33 +02:00
inode.c	bpf: Add kernel module with user mode driver that populates bpffs.	2020-08-20 16:02:36 +02:00
local_storage.c	bpf/local_storage: Fix build without CONFIG_CGROUP	2020-07-25 20:16:36 -07:00
lpm_trie.c	bpf: Add map_meta_equal map ops	2020-08-28 15:41:30 +02:00
Makefile	bpf: Implement bpf_local_storage for inodes	2020-08-25 15:00:04 -07:00
map_in_map.c	bpf: Relax max_entries check for most of the inner map types	2020-08-28 15:41:30 +02:00
map_in_map.h	bpf: Add map_meta_equal map ops	2020-08-28 15:41:30 +02:00
map_iter.c	bpf: Implement link_query callbacks in map element iterators	2020-08-21 14:01:39 -07:00
net_namespace.c	bpf: Add support for forced LINK_DETACH command	2020-08-01 20:38:28 -07:00
offload.c	bpf, offload: Replace bitwise AND by logical AND in bpf_prog_offload_info_fill	2020-02-17 16:53:49 +01:00
percpu_freelist.c	bpf: Dont iterate over possible CPUs with interrupts disabled	2020-02-24 16:18:20 -08:00
percpu_freelist.h
prog_iter.c	bpf: Refactor bpf_iter_reg to have separate seq_info member	2020-07-25 20:16:32 -07:00
queue_stack_maps.c	bpf: Add map_meta_equal map ops	2020-08-28 15:41:30 +02:00
reuseport_array.c	bpf: Add map_meta_equal map ops	2020-08-28 15:41:30 +02:00
ringbuf.c	bpf: Add map_meta_equal map ops	2020-08-28 15:41:30 +02:00
stackmap.c	bpf: Add map_meta_equal map ops	2020-08-28 15:41:30 +02:00
syscall.c	bpf: Add BPF_PROG_BIND_MAP syscall	2020-09-15 18:28:27 -07:00
sysfs_btf.c	bpf: Support llvm-objcopy for vmlinux BTF	2020-03-19 12:32:38 +01:00
task_iter.c	bpf: Avoid iterating duplicated files for task_file iterator	2020-09-02 16:40:33 +02:00
tnum.c	bpf: Verifier, do explicit ALU32 bounds tracking	2020-03-30 14:59:53 -07:00
trampoline.c	bpf: Remove bpf_lsm_file_mprotect from sleepable list.	2020-08-31 23:03:57 +02:00
verifier.c	bpf, x64: rework pro/epilogue and tailcall handling in JIT	2020-09-17 19:55:30 -07:00