kernel_optimize_test/kernel
Yonghong Song 0c9a876f28 bpf: Fix potentially incorrect results with bpf_get_local_storage()
commit a2baf4e8bb0f306fbed7b5e6197c02896a638ab5 upstream.

Commit b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed a bug for bpf_get_local_storage() helper so different tasks
won't mess up with each other's percpu local storage.

The percpu data contains 8 slots so it can hold up to 8 contexts (same or
different tasks), for 8 different program runs, at the same time. This in
general is sufficient. But our internal testing showed the following warning
multiple times:

  [...]
  warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193
     __cgroup_bpf_run_filter_sock_ops+0x13e/0x180
  RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
  <IRQ>
   tcp_call_bpf.constprop.99+0x93/0xc0
   tcp_conn_request+0x41e/0xa50
   ? tcp_rcv_state_process+0x203/0xe00
   tcp_rcv_state_process+0x203/0xe00
   ? sk_filter_trim_cap+0xbc/0x210
   ? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160
   tcp_v6_do_rcv+0x181/0x3e0
   tcp_v6_rcv+0xc65/0xcb0
   ip6_protocol_deliver_rcu+0xbd/0x450
   ip6_input_finish+0x11/0x20
   ip6_input+0xb5/0xc0
   ip6_sublist_rcv_finish+0x37/0x50
   ip6_sublist_rcv+0x1dc/0x270
   ipv6_list_rcv+0x113/0x140
   __netif_receive_skb_list_core+0x1a0/0x210
   netif_receive_skb_list_internal+0x186/0x2a0
   gro_normal_list.part.170+0x19/0x40
   napi_complete_done+0x65/0x150
   mlx5e_napi_poll+0x1ae/0x680
   __napi_poll+0x25/0x120
   net_rx_action+0x11e/0x280
   __do_softirq+0xbb/0x271
   irq_exit_rcu+0x97/0xa0
   common_interrupt+0x7f/0xa0
   </IRQ>
   asm_common_interrupt+0x1e/0x40
  RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac
   ? __cgroup_bpf_run_filter_skb+0x378/0x4e0
   ? do_softirq+0x34/0x70
   ? ip6_finish_output2+0x266/0x590
   ? ip6_finish_output+0x66/0xa0
   ? ip6_output+0x6c/0x130
   ? ip6_xmit+0x279/0x550
   ? ip6_dst_check+0x61/0xd0
  [...]

Using drgn [0] to dump the percpu buffer contents showed that on this CPU
slot 0 is still available, but slots 1-7 are occupied and those tasks in
slots 1-7 mostly don't exist any more. So we might have issues in
bpf_cgroup_storage_unset().

Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset().
Currently, it tries to unset "current" slot with searching from the start.
So the following sequence is possible:

  1. A task is running and claims slot 0
  2. Running BPF program is done, and it checked slot 0 has the "task"
     and ready to reset it to NULL (not yet).
  3. An interrupt happens, another BPF program runs and it claims slot 1
     with the *same* task.
  4. The unset() in interrupt context releases slot 0 since it matches "task".
  5. Interrupt is done, the task in process context reset slot 0.

At the end, slot 1 is not reset and the same process can continue to occupy
slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 BPF
program won't be able to claim an empty slot and a warning will be issued.

To fix the issue, for unset() function, we should traverse from the last slot
to the first. This way, the above issue can be avoided.

The same reverse traversal should also be done in bpf_get_local_storage() helper
itself. Otherwise, incorrect local storage may be returned to BPF program.

  [0] https://github.com/osandov/drgn

Fixes: b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210810010413.1976277-1-yhs@fb.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03 10:09:31 +02:00
..
bpf bpf: Fix potentially incorrect results with bpf_get_local_storage() 2021-09-03 10:09:31 +02:00
cgroup cgroup1: fix leaked context root causing sporadic NULL deref in LTP 2021-07-31 08:16:11 +02:00
configs
debug kgdb: fix to kill breakpoints on initmem after boot 2021-03-04 11:38:46 +01:00
dma dma-mapping: handle vmalloc addresses in dma_common_{mmap,get_sgtable} 2021-07-28 14:35:38 +02:00
entry x86/entry: Move nmi entry/exit into common code 2021-03-17 17:06:36 +01:00
events perf: Fix data race between pin_count increment/decrement 2021-06-16 12:01:45 +02:00
gcov gcov: re-fix clang-11+ support 2021-04-14 08:41:58 +02:00
irq genirq/timings: Prevent potential array overflow in __irq_timings_store() 2021-08-18 08:59:15 +02:00
kcsan kcsan: Fix debugfs initcall return type 2021-05-26 12:06:54 +02:00
livepatch kernel/: fix repeated words in comments 2020-10-16 11:11:19 -07:00
locking lockdep: Fix wait-type for empty stack 2021-07-14 16:56:10 +02:00
power PM: EM: postpone creating the debugfs dir till fs_initcall 2021-03-30 14:32:04 +02:00
printk printk: fix deadlock when kernel panic 2021-03-04 11:38:41 +01:00
rcu srcu: Provide polling interfaces for Tiny SRCU grace periods 2021-09-03 10:09:30 +02:00
sched kthread: Fix PF_KTHREAD vs to_kthread() race 2021-09-03 10:09:31 +02:00
time timers: Move clearing of base::timer_running under base:: Lock 2021-08-12 13:22:15 +02:00
trace tracing / histogram: Fix NULL pointer dereference on strcmp() on NULL event name 2021-08-26 08:35:54 -04:00
.gitignore kbuild: update config_data.gz only when the content of .config is changed 2021-05-11 14:47:37 +02:00
acct.c kernel: acct.c: fix some kernel-doc nits 2020-10-16 11:11:19 -07:00
async.c
audit_fsnotify.c fsnotify: generalize handle_inode_event() 2020-12-30 11:54:18 +01:00
audit_tree.c audit: move put_tree() to avoid trim_trees refcount underflow and UAF 2021-09-03 10:09:31 +02:00
audit_watch.c fsnotify: generalize handle_inode_event() 2020-12-30 11:54:18 +01:00
audit.c audit: Remove redundant null check 2020-08-26 09:10:39 -04:00
audit.h audit: change unnecessary globals into statics 2020-08-17 20:26:58 -04:00
auditfilter.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
auditsc.c
backtracetest.c
bounds.c
capability.c LSM: Signal to SafeSetID when setting group IDs 2020-10-13 09:17:34 -07:00
compat.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
configs.c
context_tracking.c
cpu_pm.c notifier: Fix broken error handling pattern 2020-09-01 09:58:03 +02:00
cpu.c cpu/hotplug: Cure the cpusets trainwreck 2021-07-19 09:44:59 +02:00
crash_core.c crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfo 2021-06-23 14:42:52 +02:00
crash_dump.c
cred.c ucounts: Increase ucounts reference counter before the security hook 2021-09-03 10:09:24 +02:00
delayacct.c
dma.c
exec_domain.c
exit.c kernel/io_uring: cancel io_uring before task works 2021-01-30 13:55:18 +01:00
extable.c
fail_function.c fail_function: Remove a redundant mutex unlock 2020-11-19 11:58:16 -08:00
fork.c sched/core: Initialize the idle task with preemption disabled 2021-07-14 16:55:50 +02:00
freezer.c Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing" 2021-04-07 15:00:14 +02:00
futex.c mm, futex: fix shared futex pgoff on shmem huge page 2021-06-30 08:47:29 -04:00
gen_kheaders.sh
groups.c LSM: Signal to SafeSetID when setting group IDs 2020-10-13 09:17:34 -07:00
hung_task.c kernel/hung_task.c: make type annotations consistent 2020-11-02 12:14:19 -08:00
iomem.c
irq_work.c
jump_label.c jump_label: Fix jump_label_text_reserved() vs __init 2021-07-20 16:05:58 +02:00
kallsyms.c treewide: Convert macro and uses of __section(foo) to __section("foo") 2020-10-25 14:51:49 -07:00
kcmp.c exec: Transform exec_update_mutex into a rw_semaphore 2021-01-09 13:46:24 +01:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c kcov: make some symbols static 2020-08-12 10:58:02 -07:00
kexec_core.c kernel: kexec: remove the lock operation of system_transition_mutex 2021-02-03 23:28:37 +01:00
kexec_elf.c
kexec_file.c kernel: kexec_file: fix error return code of kexec_calculate_store_digests() 2021-05-19 10:13:09 +02:00
kexec_internal.h
kexec.c LSM: Introduce kernel_post_load_data() hook 2020-10-05 13:37:03 +02:00
kheaders.c
kmod.c kmod: remove redundant "be an" in the comment 2020-08-12 10:58:01 -07:00
kprobes.c kprobes: Fix to delay the kprobes jump optimization 2021-03-04 11:38:35 +01:00
ksysfs.c
kthread.c kthread: Fix PF_KTHREAD vs to_kthread() race 2021-09-03 10:09:31 +02:00
latencytop.c
Makefile kbuild: update config_data.gz only when the content of .config is changed 2021-05-11 14:47:37 +02:00
module_signature.c module: harden ELF info handling 2021-03-25 09:04:11 +01:00
module_signing.c module: harden ELF info handling 2021-03-25 09:04:11 +01:00
module-internal.h
module.c module: limit enabling module.sig_enforce 2021-06-30 08:47:15 -04:00
notifier.c notifier: Fix broken error handling pattern 2020-09-01 09:58:03 +02:00
nsproxy.c
padata.c padata: fix possible padata_works_lock deadlock 2020-09-04 17:51:55 +10:00
panic.c panic: don't dump stack twice on warn 2020-11-14 11:26:04 -08:00
params.c params: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
pid_namespace.c kernel/: fix repeated words in comments 2020-10-16 11:11:19 -07:00
pid.c exec: Transform exec_update_mutex into a rw_semaphore 2021-01-09 13:46:24 +01:00
profile.c
ptrace.c ptrace: make ptrace() fail if the tracee changed its pid unexpectedly 2021-05-26 12:06:49 +02:00
range.c kernel.h: split out min()/max() et al. helpers 2020-10-16 11:11:19 -07:00
reboot.c reboot: fix overflow parsing reboot cpu number 2020-11-14 11:26:03 -08:00
regset.c
relay.c kernel/relay.c: drop unneeded initialization 2020-10-16 11:11:22 -07:00
resource.c kernel/resource: make walk_mem_res() find all busy IORESOURCE_MEM resources 2021-05-19 10:13:09 +02:00
rseq.c
scftorture.c scftorture: Add cond_resched() to test loop 2020-08-24 18:38:38 -07:00
scs.c mm: memcontrol: account kernel stack per node 2020-08-07 11:33:25 -07:00
seccomp.c seccomp: Fix setting loaded filter count during TSYNC 2021-08-18 08:59:06 +02:00
signal.c ptrace: fix task_join_group_stop() for the case when current is traced 2020-11-02 12:14:19 -08:00
smp.c smp: Fix smp_call_function_single_async prototype 2021-05-14 09:50:46 +02:00
smpboot.c sched/core: Initialize the idle task with preemption disabled 2021-07-14 16:55:50 +02:00
smpboot.h
softirq.c softirq: Add debug check to __raise_softirq_irqoff() 2020-09-16 15:18:56 +02:00
stackleak.c stackleak: let stack_erasing_sysctl take a kernel pointer buffer 2020-09-19 13:13:39 -07:00
stacktrace.c stacktrace: Remove reliable argument from arch_stack_walk() callback 2020-09-18 14:24:16 +01:00
static_call.c static_call: Fix static_call_text_reserved() vs __init 2021-07-20 16:05:58 +02:00
stop_machine.c stop_machine, rcu: Mark functions as notrace 2020-10-26 12:12:27 +01:00
sys_ni.c mm/madvise: introduce process_madvise() syscall: an external memory hinting API 2020-10-18 09:27:10 -07:00
sys.c Add a reference to ucounts for each cred 2021-07-14 16:55:48 +02:00
sysctl-test.c
sysctl.c sysctl.c: fix underflow value setting risk in vm_table 2021-03-17 17:06:25 +01:00
task_work.c task_work: cleanup notification modes 2020-10-17 15:05:30 -06:00
taskstats.c taskstats: move specifying netlink policy back to ops 2020-10-02 19:11:12 -07:00
test_kprobes.c
torture.c
tracepoint.c tracepoint: Use rcu get state and cond sync for static call updates 2021-09-03 10:09:30 +02:00
tsacct.c
ucount.c Add a reference to ucounts for each cred 2021-07-14 16:55:48 +02:00
uid16.c
uid16.h
umh.c usermodehelper: reset umask to default before executing user process 2020-10-06 10:31:52 -07:00
up.c smp: Fix smp_call_function_single_async prototype 2021-05-14 09:50:46 +02:00
user_namespace.c Add a reference to ucounts for each cred 2021-07-14 16:55:48 +02:00
user-return-notifier.c
user.c
usermode_driver.c bpf: Fix umd memory leak in copy_process() 2021-03-30 14:32:03 +02:00
utsname_sysctl.c
utsname.c
watch_queue.c watch_queue: Limit the number of watches a user can hold 2020-08-17 09:39:18 -07:00
watchdog_hld.c
watchdog.c watchdog: fix barriers when printing backtraces from all CPUs 2021-05-19 10:13:00 +02:00
workqueue_internal.h
workqueue.c workqueue: fix UAF in pwq_unbound_release_workfn() 2021-07-31 08:16:11 +02:00