kernel_optimize_test/kernel
Daniel Jordan 771b663fa5 cpuset: fix race between hotplug work and later CPU offline
commit 406100f3da08066c00105165db8520bbc7694a36 upstream.

One of our machines keeled over trying to rebuild the scheduler domains.
Mainline produces the same splat:

  BUG: unable to handle page fault for address: 0000607f820054db
  CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
  Workqueue: events cpuset_hotplug_workfn
  RIP: build_sched_domains
  Call Trace:
   partition_sched_domains_locked
   rebuild_sched_domains_locked
   cpuset_hotplug_workfn

It happens with cgroup2 and exclusive cpusets only.  This reproducer
triggers it on an 8-cpu vm and works most effectively with no
preexisting child cgroups:

  cd $UNIFIED_ROOT
  mkdir cg1
  echo 4-7 > cg1/cpuset.cpus
  echo root > cg1/cpuset.cpus.partition

  # with smt/control reading 'on',
  echo off > /sys/devices/system/cpu/smt/control

RIP maps to

  sd->shared = *per_cpu_ptr(sdd->sds, sd_id);

from sd_init().  sd_id is calculated earlier in the same function:

  cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
  sd_id = cpumask_first(sched_domain_span(sd));

tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
value from per_cpu_ptr() above.

The problem is a race between cpuset_hotplug_workfn() and a later
offline of CPU N.  cpuset_hotplug_workfn() updates the effective masks
when N is still online, the offline clears N from cpu_sibling_map, and
then the worker uses the stale effective masks that still have N to
generate the scheduling domains, leading the worker to read
N's empty cpu_sibling_map in sd_init().

rebuild_sched_domains_locked() prevented the race during the cgroup2
cpuset series up until the Fixes commit changed its check.  Make the
check more robust so that it can detect an offline CPU in any exclusive
cpuset's effective mask, not just the top one.

Fixes: 0ccea8feb9 ("cpuset: Make generate_sched_domains() work with partition")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:54:11 +01:00
..
bpf bpf: Fix enum names for bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpers 2020-12-11 14:19:07 -08:00
cgroup cpuset: fix race between hotplug work and later CPU offline 2020-12-30 11:54:11 +01:00
configs
debug
dma
entry
events
gcov
irq genirq/irqdomain: Don't try to free an interrupt that has no mapping 2020-12-30 11:53:25 +01:00
kcsan
livepatch
locking lockdep: Put graph lock/unlock under lock_recursion protection 2020-11-17 13:15:35 +01:00
power
printk Urgent printk fix for 5.10 2020-11-27 10:38:36 -08:00
rcu rcu/tree: Defer kvfree_rcu() allocation to a clean context 2020-12-30 11:53:17 +01:00
sched sched: Reenable interrupts in do_sched_yield() 2020-12-30 11:52:59 +01:00
time
trace bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address() 2020-12-30 11:53:34 +01:00
.gitignore
acct.c
async.c
audit_fsnotify.c
audit_tree.c
audit_watch.c
audit.c
audit.h
auditfilter.c
auditsc.c
backtracetest.c
bounds.c
capability.c
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling 2020-11-27 00:10:39 +11:00
crash_core.c
crash_dump.c
cred.c
delayacct.c
dma.c
exec_domain.c
exit.c
extable.c
fail_function.c fail_function: Remove a redundant mutex unlock 2020-11-19 11:58:16 -08:00
fork.c mm/gup: prevent gup_fast from racing with COW during fork 2020-12-30 11:53:54 +01:00
freezer.c
futex.c
gen_kheaders.sh
groups.c
hung_task.c
iomem.c
irq_work.c
jump_label.c
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c
kexec_core.c
kexec_elf.c
kexec_file.c
kexec_internal.h
kexec.c
kheaders.c
kmod.c
kprobes.c
ksysfs.c
kthread.c
latencytop.c
Makefile elfcore: fix building with clang 2020-12-11 14:02:14 -08:00
module_signature.c
module_signing.c
module-internal.h
module.c
notifier.c
nsproxy.c
padata.c
panic.c panic: don't dump stack twice on warn 2020-11-14 11:26:04 -08:00
params.c
pid_namespace.c
pid.c
profile.c
ptrace.c ptrace: Set PF_SUPERPRIV when checking capability 2020-11-17 12:53:22 -08:00
range.c
reboot.c reboot: fix overflow parsing reboot cpu number 2020-11-14 11:26:03 -08:00
regset.c
relay.c
resource.c
rseq.c
scftorture.c
scs.c
seccomp.c seccomp: Set PF_SUPERPRIV when checking capability 2020-11-17 12:53:22 -08:00
signal.c
smp.c
smpboot.c
smpboot.h
softirq.c
stackleak.c
stacktrace.c
static_call.c
stop_machine.c
sys_ni.c
sys.c
sysctl-test.c
sysctl.c
task_work.c
taskstats.c
test_kprobes.c
torture.c
tracepoint.c
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c
up.c
user_namespace.c
user-return-notifier.c
user.c
usermode_driver.c
utsname_sysctl.c
utsname.c
watch_queue.c
watchdog_hld.c
watchdog.c kernel/watchdog: fix watchdog_allowed_mask not used warning 2020-11-14 11:26:03 -08:00
workqueue_internal.h
workqueue.c