kernel_optimize_test/kernel
Alexei Starovoitov d83525ca62 bpf: introduce bpf_spin_lock
Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
bpf program serialize access to other variables.

Example:
struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
}

Restrictions and safety checks:
- bpf_spin_lock is only allowed inside HASH and ARRAY maps.
- BTF description of the map is mandatory for safety analysis.
- bpf program can take one bpf_spin_lock at a time, since two or more can
  cause dead locks.
- only one 'struct bpf_spin_lock' is allowed per map element.
  It drastically simplifies implementation yet allows bpf program to use
  any number of bpf_spin_locks.
- when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
- bpf program must bpf_spin_unlock() before return.
- bpf program can access 'struct bpf_spin_lock' only via
  bpf_spin_lock()/bpf_spin_unlock() helpers.
- load/store into 'struct bpf_spin_lock lock;' field is not allowed.
- to use bpf_spin_lock() helper the BTF description of map value must be
  a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
  Nested lock inside another struct is not allowed.
- syscall map_lookup doesn't copy bpf_spin_lock field to user space.
- syscall map_update and program map_update do not update bpf_spin_lock field.
- bpf_spin_lock cannot be on the stack or inside networking packet.
  bpf_spin_lock can only be inside HASH or ARRAY map value.
- bpf_spin_lock is available to root only and to all program types.
- bpf_spin_lock is not allowed in inner maps of map-in-map.
- ld_abs is not allowed inside spin_lock-ed region.
- tracing progs and socket filter progs cannot use bpf_spin_lock due to
  insufficient preemption checks

Implementation details:
- cgroup-bpf class of programs can nest with xdp/tc programs.
  Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
  Other solutions to avoid nested bpf_spin_lock are possible.
  Like making sure that all networking progs run with softirq disabled.
  spin_lock_irqsave is the simplest and doesn't add overhead to the
  programs that don't use it.
- arch_spinlock_t is used when its implemented as queued_spin_lock
- archs can force their own arch_spinlock_t
- on architectures where queued_spin_lock is not available and
  sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
- presence of bpf_spin_lock inside map value could have been indicated via
  extra flag during map_create, but specifying it via BTF is cleaner.
  It provides introspection for map key/value and reduces user mistakes.

Next steps:
- allow bpf_spin_lock in other map types (like cgroup local storage)
- introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
  to request kernel to grab bpf_spin_lock before rewriting the value.
  That will serialize access to map elements.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-01 20:55:38 +01:00
..
bpf bpf: introduce bpf_spin_lock 2019-02-01 20:55:38 +01:00
cgroup bpf, cgroups: clean up kerneldoc warnings 2019-01-31 10:32:01 +01:00
configs kvm_config: add CONFIG_VIRTIO_MENU 2018-10-24 20:55:56 -04:00
debug kdb: use bool for binary state indicators 2018-12-30 08:31:52 +00:00
dma swiotlb: clear io_tlb_start and io_tlb_end in swiotlb_exit 2019-01-16 09:59:17 -05:00
events Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
gcov
irq genirq/irqdesc: Fix double increment in alloc_descs() 2019-01-18 00:43:09 +01:00
livepatch livepatch: Replace synchronize_sched() with synchronize_rcu() 2018-12-01 12:38:50 -08:00
locking locking/rwsem: Fix (possible) missed wakeup 2019-01-21 11:15:39 +01:00
power mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
printk Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
rcu rcutorture: Don't do busted forward-progress testing 2018-12-01 12:45:42 -08:00
sched sched/wake_q: Fix wakeup ordering for wake_q 2019-01-21 11:15:37 +01:00
time posix-cpu-timers: Unbreak timer rearming 2019-01-15 16:34:37 +01:00
trace tracing/kprobes: Fix NULL pointer dereference in trace_kprobe_create() 2019-01-15 11:33:45 -05:00
.gitignore
acct.c
async.c
audit_fsnotify.c audit: minimize our use of audit_log_format() 2018-11-26 18:40:00 -05:00
audit_tree.c audit: minimize our use of audit_log_format() 2018-11-26 18:40:00 -05:00
audit_watch.c audit: minimize our use of audit_log_format() 2018-11-26 18:40:00 -05:00
audit.c audit: remove duplicated include from audit.c 2018-12-14 12:09:30 -05:00
audit.h audit: use current whenever possible 2018-11-26 18:41:21 -05:00
auditfilter.c
auditsc.c audit: use current whenever possible 2018-11-26 18:41:21 -05:00
backtracetest.c
bounds.c kbuild: fix kernel/bounds.c 'W=1' warning 2018-10-31 08:54:14 -07:00
capability.c
compat.c make 'user_access_begin()' do 'access_ok()' 2019-01-04 12:56:09 -08:00
configs.c
context_tracking.c
cpu_pm.c
cpu.c x86/speculation: Rework SMT state change 2018-11-28 11:57:07 +01:00
crash_core.c kernel/crash_core.c: print timestamp using time64_t 2018-08-22 10:52:47 -07:00
crash_dump.c
cred.c cred: export get_task_cred(). 2018-12-19 13:52:44 -05:00
delayacct.c delayacct: track delays from thrashing cache pages 2018-10-26 16:26:32 -07:00
dma.c
elfcore.c
exec_domain.c
exit.c sched/wait: Fix rcuwait_wake_up() ordering 2019-01-21 11:15:36 +01:00
extable.c
fail_function.c kernel/fail_function.c: remove meaningless null pointer check before debugfs_remove_recursive 2018-10-31 08:54:12 -07:00
fork.c Merge branch 'akpm' (patches from Andrew) 2019-01-08 18:58:29 -08:00
freezer.c
futex.c futex: Fix (possible) missed wakeup 2019-01-21 11:15:38 +01:00
groups.c
hung_task.c kernel/hung_task.c: break RCU locks based on jiffies 2019-01-04 13:13:45 -08:00
iomem.c
irq_work.c
jump_label.c jump_label: move 'asm goto' support test to Kconfig 2019-01-06 09:46:51 +09:00
kallsyms.c kallsyms: reduce size a little on 64-bit 2018-09-10 22:54:33 +09:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks bpf: introduce bpf_spin_lock 2019-02-01 20:55:38 +01:00
Kconfig.preempt kconfig: warn no new line at end of file 2018-12-15 17:44:35 +09:00
kcov.c kernel/kcov.c: mark write_comp_data() as notrace 2019-01-04 13:13:47 -08:00
kexec_core.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
kexec_file.c kexec_file: kexec_walk_memblock() only walks a dedicated region at kdump 2018-12-06 14:38:50 +00:00
kexec_internal.h
kexec.c
kmod.c
kprobes.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-26 14:45:18 -08:00
ksysfs.c
kthread.c
latencytop.c
Makefile kbuild: change filechk to surround the given command with { } 2019-01-06 09:46:51 +09:00
memremap.c mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm 2018-12-28 12:11:52 -08:00
module_signing.c modsign: use all trusted keys to verify module signature 2018-11-07 14:41:41 +01:00
module-internal.h
module.c jump_label: move 'asm goto' support test to Kconfig 2019-01-06 09:46:51 +09:00
notifier.c
nsproxy.c
padata.c padata: clean an indentation issue, remove extraneous space 2018-11-16 14:11:04 +08:00
panic.c kernel/sysctl: add panic_print into sysctl 2019-01-04 13:13:47 -08:00
params.c
pid_namespace.c signal: Use group_send_sig_info to kill all processes in a pid namespace 2018-09-16 16:08:25 +02:00
pid.c Fix failure path in alloc_pid() 2018-12-28 12:42:30 -08:00
profile.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
ptrace.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
range.c
reboot.c kernel/reboot.c: export pm_power_off_prepare 2018-09-11 16:13:24 +01:00
relay.c
resource.c kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable 2018-12-28 12:11:49 -08:00
rseq.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
seccomp.c seccomp: fix UAF in user-trap code 2019-01-15 09:43:12 -08:00
signal.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
smp.c smp,cpumask: introduce on_each_cpu_cond_mask 2018-10-09 16:51:11 +02:00
smpboot.c
smpboot.h
softirq.c Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-10-25 11:43:47 -07:00
stackleak.c stackleak: Mark stackleak_track_stack() as notrace 2018-12-05 19:31:44 -08:00
stacktrace.c
stop_machine.c
sys_ni.c y2038: socket: Add compat_sys_recvmmsg_time64 2018-12-18 16:13:04 +01:00
sys.c kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymore 2019-01-14 10:38:03 +12:00
sysctl_binary.c kernel/sysctl: add panic_print into sysctl 2019-01-04 13:13:47 -08:00
sysctl.c kernel/sysctl: add panic_print into sysctl 2019-01-04 13:13:47 -08:00
task_work.c
taskstats.c
test_kprobes.c
torture.c torture: Remove unnecessary "ret" variables 2018-12-01 12:45:35 -08:00
tracepoint.c tracing: Replace synchronize_sched() and call_rcu_sched() 2018-11-27 09:21:41 -08:00
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c umh: add exit routine for UMH process 2019-01-11 18:05:40 -08:00
up.c smp,cpumask: introduce on_each_cpu_cond_mask 2018-10-09 16:51:11 +02:00
user_namespace.c userns: also map extents in the reverse map to kernel IDs 2018-11-07 23:51:16 -06:00
user-return-notifier.c
user.c userns: use irqsave variant of refcount_dec_and_lock() 2018-08-22 10:52:47 -07:00
utsname_sysctl.c
utsname.c
watchdog_hld.c watchdog: Mark watchdog touch functions as notrace 2018-08-30 12:56:40 +02:00
watchdog.c watchdog: Mark watchdog touch functions as notrace 2018-08-30 12:56:40 +02:00
workqueue_internal.h
workqueue.c workqueue: Replace call_rcu_sched() with call_rcu() 2018-11-27 09:21:44 -08:00