kernel_optimize_test/kernel
Waiman Long fd99aeb978 clocksource: Avoid accidental unstable marking of clocksources
[ Upstream commit c86ff8c55b8ae68837b2fa59dc0c203907e9a15f ]

Since commit db3a34e17433 ("clocksource: Retry clock read if long delays
detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew
threshold"), it is found that tsc clocksource fallback to hpet can
sometimes happen on both Intel and AMD systems especially when they are
running stressful benchmarking workloads. Of the 23 systems tested with
a v5.14 kernel, 10 of them have switched to hpet clock source during
the test run.

The result of falling back to hpet is a drastic reduction of performance
when running benchmarks. For example, the fio performance tests can
drop up to 70% whereas the iperf3 performance can drop up to 80%.

4 hpet fallbacks happened during bootup. They were:

  [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
  [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
  [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
  [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable

Other fallbacks happen when the systems were running stressful
benchmarks. For example:

  [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
  [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable

Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"),
changed the skew margin from 100us to 50us. I think this is too small
and can easily be exceeded when running some stressful workloads on a
thermally stressed system.  So it is switched back to 100us.

Even a maximum skew margin of 100us may be too small in for some systems
when booting up especially if those systems are under thermal stress. To
eliminate the case that the large skew is due to the system being too
busy slowing down the reading of both the watchdog and the clocksource,
an extra consecutive read of watchdog clock is being done to check this.

The consecutive watchdog read delay is compared against
WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that
the system is just too busy. A warning will be printed to the console
and the clock skew check is skipped for this round.

Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27 10:54:06 +01:00
..
bpf bpf: Don't promote bogus looking registers after null check. 2022-01-27 10:54:00 +01:00
cgroup cgroup: Fix rootcg cpu.stat guest double counting 2021-11-18 14:04:13 +01:00
configs
debug kgdb: fix to kill breakpoints on initmem after boot 2021-03-04 11:38:46 +01:00
dma dma/pool: create dma atomic pool only if dma zone has managed pages 2022-01-27 10:53:44 +01:00
entry KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest 2021-10-06 15:55:49 +02:00
events perf: Protect perf_guest_cbs with RCU 2022-01-20 09:17:50 +01:00
gcov gcov: re-fix clang-11+ support 2021-04-14 08:41:58 +02:00
irq genirq/timings: Fix error return code in irq_timings_test_irqs() 2021-09-15 09:50:29 +02:00
kcsan kcsan: Fix debugfs initcall return type 2021-05-26 12:06:54 +02:00
livepatch
locking lockdep: Let lock_is_held_type() detect recursive read as read 2021-11-18 14:04:02 +01:00
power PM: hibernate: use correct mode for swsusp_close() 2021-12-01 09:19:06 +01:00
printk printk/console: Allow to disable console output by using console="" or console=null 2021-11-12 14:58:33 +01:00
rcu rcu/exp: Mark current CPU as exp-QS in IPI loop second pass 2022-01-27 10:53:55 +01:00
sched sched/rt: Try to restart rt period timer when rt runtime exceeded 2022-01-27 10:53:54 +01:00
time clocksource: Avoid accidental unstable marking of clocksources 2022-01-27 10:54:06 +01:00
trace bpf: Remove config check to enable bpf support for branch records 2022-01-27 10:53:54 +01:00
.gitignore kbuild: update config_data.gz only when the content of .config is changed 2021-05-11 14:47:37 +02:00
acct.c
async.c
audit_fsnotify.c fsnotify: generalize handle_inode_event() 2020-12-30 11:54:18 +01:00
audit_tree.c audit: move put_tree() to avoid trim_trees refcount underflow and UAF 2021-09-03 10:09:31 +02:00
audit_watch.c fsnotify: generalize handle_inode_event() 2020-12-30 11:54:18 +01:00
audit.c audit: improve robustness of the audit queue handling 2021-12-22 09:30:51 +01:00
audit.h
auditfilter.c
auditsc.c audit: fix possible null-pointer dereference in audit_filter_rules 2021-10-27 09:56:52 +02:00
backtracetest.c
bounds.c
capability.c
compat.c
configs.c
context_tracking.c
cpu_pm.c PM: cpu: Make notifier chain use a raw_spinlock_t 2021-09-15 09:50:40 +02:00
cpu.c sched/scs: Reset task stack state in bringup_cpu() 2021-12-01 09:19:08 +01:00
crash_core.c crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfo 2021-06-23 14:42:52 +02:00
crash_dump.c
cred.c Revert "Add a reference to ucounts for each cred" 2021-09-08 08:49:00 +02:00
delayacct.c
dma.c
exec_domain.c
exit.c kernel/io_uring: cancel io_uring before task works 2021-01-30 13:55:18 +01:00
extable.c
fail_function.c fail_function: Remove a redundant mutex unlock 2020-11-19 11:58:16 -08:00
fork.c posix-cpu-timers: Clear task::posix_cputimers_work in copy_process() 2021-11-18 14:04:29 +01:00
freezer.c Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing" 2021-04-07 15:00:14 +02:00
futex.c mm, futex: fix shared futex pgoff on shmem huge page 2021-06-30 08:47:29 -04:00
gen_kheaders.sh
groups.c
hung_task.c kernel/hung_task.c: make type annotations consistent 2020-11-02 12:14:19 -08:00
iomem.c
irq_work.c
jump_label.c jump_label: Fix jump_label_text_reserved() vs __init 2021-07-20 16:05:58 +02:00
kallsyms.c treewide: Convert macro and uses of __section(foo) to __section("foo") 2020-10-25 14:51:49 -07:00
kcmp.c exec: Transform exec_update_mutex into a rw_semaphore 2021-01-09 13:46:24 +01:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c
kexec_core.c kernel: kexec: remove the lock operation of system_transition_mutex 2021-02-03 23:28:37 +01:00
kexec_elf.c
kexec_file.c kernel: kexec_file: fix error return code of kexec_calculate_store_digests() 2021-05-19 10:13:09 +02:00
kexec_internal.h
kexec.c
kheaders.c
kmod.c
kprobes.c kprobes: Limit max data_size of the kretprobe instances 2021-12-08 09:03:20 +01:00
ksysfs.c
kthread.c kthread: Fix PF_KTHREAD vs to_kthread() race 2021-09-03 10:09:31 +02:00
latencytop.c
Makefile kbuild: update config_data.gz only when the content of .config is changed 2021-05-11 14:47:37 +02:00
module_signature.c module: harden ELF info handling 2021-03-25 09:04:11 +01:00
module_signing.c module: harden ELF info handling 2021-03-25 09:04:11 +01:00
module-internal.h
module.c module: limit enabling module.sig_enforce 2021-06-30 08:47:15 -04:00
notifier.c
nsproxy.c
padata.c
panic.c panic: don't dump stack twice on warn 2020-11-14 11:26:04 -08:00
params.c params: Replace zero-length array with flexible-array member 2020-10-29 17:22:59 -05:00
pid_namespace.c memcg: enable accounting for pids in nested pid namespaces 2021-09-18 13:40:36 +02:00
pid.c exec: Transform exec_update_mutex into a rw_semaphore 2021-01-09 13:46:24 +01:00
profile.c profiling: fix shift-out-of-bounds bugs 2021-09-26 14:08:58 +02:00
ptrace.c ptrace: make ptrace() fail if the tracee changed its pid unexpectedly 2021-05-26 12:06:49 +02:00
range.c
reboot.c reboot: fix overflow parsing reboot cpu number 2020-11-14 11:26:03 -08:00
regset.c
relay.c
resource.c kernel/resource: make walk_mem_res() find all busy IORESOURCE_MEM resources 2021-05-19 10:13:09 +02:00
rseq.c KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest 2021-10-06 15:55:49 +02:00
scftorture.c
scs.c
seccomp.c seccomp: Fix setting loaded filter count during TSYNC 2021-08-18 08:59:06 +02:00
signal.c signal: Remove the bogus sigkill_pending in ptrace_stop 2021-11-18 14:03:47 +01:00
smp.c smp: Fix smp_call_function_single_async prototype 2021-05-14 09:50:46 +02:00
smpboot.c sched/core: Initialize the idle task with preemption disabled 2021-07-14 16:55:50 +02:00
smpboot.h
softirq.c
stackleak.c
stacktrace.c
static_call.c static_call: Fix unused variable warn w/o MODULE 2021-09-08 08:49:00 +02:00
stop_machine.c stop_machine, rcu: Mark functions as notrace 2020-10-26 12:12:27 +01:00
sys_ni.c
sys.c prctl: allow to setup brk for et_dyn executables 2021-09-26 14:08:57 +02:00
sysctl-test.c
sysctl.c bpf: Add kconfig knob for disabling unpriv bpf by default 2022-01-05 12:40:34 +01:00
task_work.c
taskstats.c
test_kprobes.c
torture.c
tracepoint.c tracepoint: Use rcu get state and cond sync for static call updates 2021-09-03 10:09:30 +02:00
tsacct.c
ucount.c Revert "Add a reference to ucounts for each cred" 2021-09-08 08:49:00 +02:00
uid16.c
uid16.h
umh.c
up.c smp: Fix smp_call_function_single_async prototype 2021-05-14 09:50:46 +02:00
user_namespace.c Revert "Add a reference to ucounts for each cred" 2021-09-08 08:49:00 +02:00
user-return-notifier.c
user.c
usermode_driver.c bpf: Fix umd memory leak in copy_process() 2021-03-30 14:32:03 +02:00
utsname_sysctl.c
utsname.c
watch_queue.c
watchdog_hld.c
watchdog.c watchdog: fix barriers when printing backtraces from all CPUs 2021-05-19 10:13:00 +02:00
workqueue_internal.h
workqueue.c workqueue: Fix unbind_workers() VS wq_worker_running() race 2022-01-16 09:14:22 +01:00