Commit Graph

1237 Commits

Author SHA1 Message Date
Paul E. McKenney
12f54c3a84 rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups.  The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks.  The non-leader rcuo kthreads
only invoke callbacks.

This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks.  However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up.  This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.

One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.

This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks.  Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group.  Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.

This change does introduce additional kthreads, however:

1.	The number of additional kthreads is about the square root of
	the number of CPUs, so that a 4096-CPU system would have only
	about 64 additional kthreads.  Note that recent changes
	decreased the number of rcuo kthreads by a factor of two
	(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
	this still represents a significant improvement on most systems.

2.	The leading "rcuo" of the rcuog kthreads should allow existing
	scripting to affinity these additional kthreads as needed, the
	same as for the rcuop and rcuos kthreads.  (There are no longer
	any rcuob kthreads.)

3.	A state-machine approach was considered and rejected.  Although
	this would allow the rcuo kthreads to continue their dual
	leader/follower roles, it complicates callback invocation
	and makes it more difficult to consolidate rcuo callback
	invocation with existing softirq callback invocation.

The introduction of rcuog kthreads should thus be acceptable.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
6484fe54b5 rcu/nocb: Update comments to prepare for forward-progress work
This commit simply rewords comments to prepare for leader nocb kthreads
doing only grace-period work and callback shuffling.  This will mean
the addition of replacement kthreads to invoke callbacks.  The "leader"
and "follower" thus become less meaningful, so the commit changes no-CB
comments with these strings to "GP" and "CB", respectively.  (Give or
take the usual grammatical transformations.)

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
58bf6f77c6 rcu/nocb: Rename rcu_data fields to prepare for forward-progress work
This commit simply renames rcu_data fields to prepare for leader
nocb kthreads doing only grace-period work and callback shuffling.
This will mean the addition of replacement kthreads to invoke callbacks.
The "leader" and "follower" thus become less meaningful, so the commit
changes no-CB fields with these strings to "gp" and "cb", respectively.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
31da067023 Merge branches 'consolidate.2019.08.01b', 'fixes.2019.08.12a', 'lists.2019.08.13a' and 'torture.2019.08.01b' into HEAD
consolidate.2019.08.01b: Further consolidation cleanups
fixes.2019.08.12a: Miscellaneous fixes
lists.2019.08.13a: Optional lockdep arguments for RCU list macros
torture.2019.08.01b: Torture-test updates
2019-08-13 14:30:30 -07:00
Mukesh Ojha
511b44f759 rcu: Fix spelling mistake "greate"->"great"
This commit fixes a spelling mistake in file tree_exp.h.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-12 11:25:06 -07:00
Paul E. McKenney
b823cafa75 rcu: Remove redundant "if" condition from rcu_gp_is_expedited()
Because rcu_expedited_nesting is initialized to 1 and not decremented
until just before init is spawned, rcu_expedited_nesting is guaranteed
to be non-zero whenever rcu_scheduler_active == RCU_SCHEDULER_INIT.
This commit therefore removes this redundant "if" equality test.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-08-12 11:25:06 -07:00
Joel Fernandes (Google)
28875945ba rcu: Add support for consolidated-RCU reader checking
This commit adds RCU-reader checks to list_for_each_entry_rcu() and
hlist_for_each_entry_rcu().  These checks are optional, and are indicated
by a lockdep expression passed to a new optional argument to these two
macros.  If this optional lockdep expression is omitted, these two macros
act as before, checking for an RCU read-side critical section.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Update to eliminate return within macro and update comment. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-09 11:00:35 -07:00
Paul E. McKenney
60013d5d2b rcutorture: Aggressive forward-progress tests shouldn't block shutdown
The more aggressive forward-progress tests can interfere with rcutorture
shutdown, resulting in false-positive diagnostics.  This commit therefore
ends any such tests 30 seconds prior to shutdown.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Joel Fernandes (Google)
77e9752ce6 rcuperf: Make rcuperf kernel test more robust for !expedited mode
It is possible that the rcuperf kernel test runs concurrently with init
starting up.  During this time, the system is running all grace periods
as expedited.  However, rcuperf can also be run for normal GP tests.
Right now, it depends on a holdoff time before starting the test to
ensure grace periods start later. This works fine with the default
holdoff time however it is not robust in situations where init takes
greater than the holdoff time to finish running. Or, as in my case:

I modified the rcuperf test locally to also run a thread that did
preempt disable/enable in a loop. This had the effect of slowing down
init. The end result was that the "batches:" counter in rcuperf was 0
causing a division by 0 error in the results. This counter was 0 because
only expedited GPs seem to happen, not normal ones which led to the
rcu_state.gp_seq counter remaining constant across grace periods which
unexpectedly happen to be expedited. The system was running expedited
RCU all the time because rcu_unexpedited_gp() would not have run yet
from init.  In other words, the test would concurrently with init
booting in expedited GP mode.

To fix this properly, this commit waits until system_state is set to
SYSTEM_RUNNING before starting the test.  This change is made just
before kernel_init() invokes rcu_end_inkernel_boot(), and this latter
is what turns off boot-time expediting of RCU grace periods.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Paul E. McKenney
bd1bfc51a3 rcutorture: Emulate userspace sojourn during call_rcu() floods
During an actual call_rcu() flood, there would be frequent trips to
userspace (in-kernel call_rcu() floods must be otherwise housebroken).
Userspace execution allows a great many things to interrupt execution,
and rcutorture needs to also allow such interruptions.  This commit
therefore causes call_rcu() floods to occasionally invoke schedule(),
thus preventing spurious rcutorture failures due to other parts of the
kernel becoming irate at the call_rcu() flood events.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Xiao Yang
b3f3886c59 rcuperf: Fix perf_type module-parameter description
The rcu_bh rcuperf type was removed by commit 620d246065cd("rcuperf:
Remove the "rcu_bh" and "sched" torture types"), but it lives on in the
MODULE_PARM_DESC() of perf_type.  This commit therefore changes that
module-parameter description to substitute srcu for rcu_bh.

Signed-off-by: Xiao Yang <ice_yangxiao@163.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Joel Fernandes (Google)
9147089bee rcu: Remove redundant debug_locks check in rcu_read_lock_sched_held()
The debug_locks flag can never be true at the end of
rcu_read_lock_sched_held() because it is already checked by the earlier
call todebug_lockdep_rcu_enabled().   This commit therefore removes this
redundant check.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:17:01 -07:00
Byungchul Park
3545832fc2 rcu: Change return type of rcu_spawn_one_boost_kthread()
The return value of rcu_spawn_one_boost_kthread() is not used any longer.
This commit therefore changes its return type from int to void, and
removes the cast to void from its callers.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
7e210a653e srcu: Avoid srcutorture security-based pointer obfuscation
Because pointer output is now obfuscated, and because what you really
want to know is whether or not the callback lists are empty, this commit
replaces the srcu_data structure's head callback pointer printout with
a single character that is "." is the callback list is empty or "C"
otherwise.

This is the only remaining user of rcu_segcblist_head(), so this
commit also removes this function's definition.  It also turns out that
rcu_segcblist_tail() no longer has any callers, so this commit removes
that function's definition while in the area.  They were both marked
"Interim", and their end has come.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
fbad01af8c rcu: Add destroy_work_on_stack() to match INIT_WORK_ONSTACK()
The synchronize_rcu_expedited() function has an INIT_WORK_ONSTACK(),
but lacks the corresponding destroy_work_on_stack().  This commit
therefore adds destroy_work_on_stack().

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
cdc694b235 rcu: Add kernel parameter to dump trace after RCU CPU stall warning
This commit adds a rcu_cpu_stall_ftrace_dump kernel boot parameter, that,
when set, causes the trace buffer to be dumped after an RCU CPU stall
warning is printed.  This kernel boot parameter is disabled by default,
maintaining compatibility with previous behavior.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
1f3ebc8253 rcu: Restore barrier() to rcu_read_lock() and rcu_read_unlock()
Commit bb73c52bad ("rcu: Don't disable preemption for Tiny and Tree
RCU readers") removed the barrier() calls from rcu_read_lock() and
rcu_write_lock() in CONFIG_PREEMPT=n&&CONFIG_PREEMPT_COUNT=n kernels.
Within RCU, this commit was OK, but it failed to account for things like
get_user() that can pagefault and that can be reordered by the compiler.
Lack of the barrier() calls in rcu_read_lock() and rcu_read_unlock()
can cause these page faults to migrate into RCU read-side critical
sections, which in CONFIG_PREEMPT=n kernels could result in too-short
grace periods and arbitrary misbehavior.  Please see commit 386afc9114
("spinlocks and preemption points need to be at least compiler barriers")
and Linus's commit 66be4e66a7 ("rcu: locking and unlocking need to
always be at least barriers"), this last of which restores the barrier()
call to both rcu_read_lock() and rcu_read_unlock().

This commit removes barrier() calls that are no longer needed given that
the addition of them in Linus's commit noted above.  The combination of
this commit and Linus's commit effectively reverts commit bb73c52bad
("rcu: Don't disable preemption for Tiny and Tree RCU readers").

Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Fix embarrassing typo located by Alan Stern. ]
2019-08-01 14:05:51 -07:00
Joel Fernandes (Google)
cb4dbbfaa1 rcu: Simplify rcu_note_context_switch exit from critical section
Because __rcu_read_unlock() can be preempted just before the call to
rcu_read_unlock_special(), it is possible for a task to be preempted just
before it would have fully exited its RCU read-side critical section.
This would result in a needless extension of that critical section until
that task was resumed, which might in turn result in a needlessly
long grace period, needless RCU priority boosting, and needless
force-quiescent-state actions.  Therefore, rcu_note_context_switch()
invokes __rcu_read_unlock() followed by rcu_preempt_deferred_qs() when
it detects this situation.  This action by rcu_note_context_switch()
ends the RCU read-side critical section immediately.

Of course, once the task resumes, it will invoke rcu_read_unlock_special()
redundantly.  This is harmless because the fact that a preemption
happened means that interrupts, preemption, and softirqs cannot
have been disabled, so there would be no deferred quiescent state.
While ->rcu_read_lock_nesting remains less than zero, none of the
->rcu_read_unlock_special.b bits can be set, and they were all zeroed by
the call to rcu_note_context_switch() at task-preemption time.  Therefore,
setting ->rcu_read_unlock_special.b.exp_hint to false has no effect.

Therefore, the extra call to rcu_preempt_deferred_qs_irqrestore()
would return immediately.  With one possible exception, which is
if an expedited grace period started just as the task was being
resumed, which could leave ->exp_deferred_qs set.  This will cause
rcu_preempt_deferred_qs_irqrestore() to invoke rcu_report_exp_rdp(),
reporting the quiescent state, just as it should.  (Such an expedited
grace period won't affect the preemption code path due to interrupts
having already been disabled.)

But when rcu_note_context_switch() invokes __rcu_read_unlock(), it
is doing so with preemption disabled, hence __rcu_read_unlock() will
unconditionally defer the quiescent state, only to immediately invoke
rcu_preempt_deferred_qs(), thus immediately reporting the deferred
quiescent state.  It turns out to be safe (and faster) to instead
just invoke rcu_preempt_deferred_qs() without the __rcu_read_unlock()
middleman.

Because this is the invocation during the preemption (as opposed to
the invocation just after the resume), at least one of the bits in
->rcu_read_unlock_special.b must be set and ->rcu_read_lock_nesting
must be negative.  This means that rcu_preempt_need_deferred_qs() must
return true, avoiding the early exit from rcu_preempt_deferred_qs().
Thus, rcu_preempt_deferred_qs_irqrestore() will be invoked immediately,
as required.

This commit therefore simplifies the CONFIG_PREEMPT=y version of
rcu_note_context_switch() by removing the "else if" branch of its
"if" statement.  This change means that all callers that would have
invoked rcu_read_unlock_special() followed by rcu_preempt_deferred_qs()
will now simply invoke rcu_preempt_deferred_qs(), thus avoiding the
rcu_read_unlock_special() middleman when __rcu_read_unlock() is preempted.

Cc: rcu@vger.kernel.org
Cc: kernel-team@android.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:04:20 -07:00
Paul E. McKenney
87446b4874 rcu: Make rcu_read_unlock_special() checks match raise_softirq_irqoff()
Threaded interrupts provide additional interesting interactions between
RCU and raise_softirq() that can result in self-deadlocks in v5.0-2 of
the Linux kernel.  These self-deadlocks can be provoked in susceptible
kernels within a few minutes using the following rcutorture command on
an 8-CPU system:

tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --configs "TREE03" --bootargs "threadirqs"

Although post-v5.2 RCU commits have at least greatly reduced the
probability of these self-deadlocks, this was entirely by accident.
Although this sort of accident should be rowdily celebrated on those
rare occasions when it does occur, such celebrations should be quickly
followed by a principled patch, which is what this patch purports to be.

The key point behind this patch is that when in_interrupt() returns
true, __raise_softirq_irqoff() will never attempt a wakeup.  Therefore,
if in_interrupt(), calls to raise_softirq*() are both safe and
extremely cheap.

This commit therefore replaces the in_irq() calls in the "if" statement
in rcu_read_unlock_special() with in_interrupt() and simplifies the
"if" condition to the following:

	if (irqs_were_disabled && use_softirq &&
	    (in_interrupt() ||
	     (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
		raise_softirq_irqoff(RCU_SOFTIRQ);
	} else {
		/* Appeal to the scheduler. */
	}

The rationale behind the "if" condition is as follows:

1.	irqs_were_disabled:  If interrupts are enabled, we should
	instead appeal to the scheduler so as to let the upcoming
	irq_enable()/local_bh_enable() do the rescheduling for us.
2.	use_softirq: If this kernel isn't using softirq, then
	raise_softirq_irqoff() will be unhelpful.
3.	a.	in_interrupt(): If this returns true, the subsequent
		call to raise_softirq_irqoff() is guaranteed not to
		do a wakeup, so that call will be both very cheap and
		quite safe.
	b.	Otherwise, if !in_interrupt() the raise_softirq_irqoff()
		might do a wakeup, which is expensive and, in some
		contexts, unsafe.
		i.	The "exp" (an expedited RCU grace period is being
			blocked) says that the wakeup is worthwhile, and:
		ii.	The !.deferred_qs says that scheduler locks
			cannot be held, so the wakeup will be safe.

Backporting this requires considerable care, so no auto-backport, please!

Fixes: 05f415715c ("rcu: Speed up expedited GPs when interrupting RCU reader")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:04:20 -07:00
Paul E. McKenney
d143b3d1cd rcu: Simplify rcu_read_unlock_special() deferred wakeups
In !use_softirq runs, we clearly cannot rely on raise_softirq() and
its lightweight bit setting, so we must instead do some form of wakeup.
In the absence of a self-IPI when interrupts are disabled, these wakeups
can be delayed until the next interrupt occurs.  This means that calling
invoke_rcu_core() doesn't actually do any expediting.

In this case, it is better to take the "else" clause, which sets the
current CPU's resched bits and, if there is an expedited grace period
in flight, uses IRQ-work to force the needed self-IPI.  This commit
therefore removes the "else if" clause that calls invoke_rcu_core().

Reported-by: Scott Wood <swood@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:04:20 -07:00
Ingo Molnar
83086d654d Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull rcu/next + tools/memory-model changes from Paul E. McKenney:

 - RCU flavor consolidation cleanups and optmizations
 - Documentation updates
 - Miscellaneous fixes
 - SRCU updates
 - RCU-sync flavor consolidation
 - Torture-test updates
 - Linux-kernel memory-consistency-model updates, most notably the addition of plain C-language accesses

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-28 19:46:47 +02:00
Paul E. McKenney
11ca7a9d54 Merge branches 'consolidate.2019.05.28a', 'doc.2019.05.28a', 'fixes.2019.06.13a', 'srcu.2019.05.28a', 'sync.2019.05.28a' and 'torture.2019.05.28a' into HEAD
consolidate.2019.05.28a: RCU flavor consolidation cleanups and optmizations.
doc.2019.05.28a: Documentation updates.
fixes.2019.06.13a: Miscellaneous fixes.
srcu.2019.05.28a: SRCU updates.
sync.2019.05.28a: RCU-sync flavor consolidation.
torture.2019.05.28a: Torture-test updates.
2019-06-19 09:21:46 -07:00
Paul E. McKenney
96050c68be rcu: Upgrade sync_exp_work_done() to smp_mb()
The sync_exp_work_done() function uses smp_mb__before_atomic(), but
there is no obvious atomic in the ensuing code.  The ordering is
absolutely required for grace periods to work correctly, so this
commit upgrades the smp_mb__before_atomic() to smp_mb().

Fixes: 6fba2b3767 ("rcu: Remove deprecated RCU debugfs tracing code")
Reported-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-06-13 15:33:19 -07:00
Paul E. McKenney
354ea05d02 rcutorture: Upper case solves the case of the vanishing NULL pointer
Various security techniques can obfuscate pointer printouts on the
console.  Unfortunately, rcutorture relies on either "null" or all zeroes
to identify the last few statistics printouts at the end of the test.
These need to be identified because failing to do so will results in
false-positive complaints about grace-period hangs.

This commit therefore prints the "ver:" in capitals ("VER:") when
the RCU-protected pointer has been set to NULL, which causes rcutorture's
parse-console.sh script to correctly ignore these lines.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
34aa34b818 rcutorture: Dump trace buffer for callback pipe drain failures
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
c682db558e rcutorture: Add trivial RCU implementation
I have been showing off a trivial RCU implementation for non-preemptive
environments for some time now:

	#define rcu_read_lock()
	#define rcu_read_unlock()
	#define rcu_dereference(p) READ_ONCE(p)
	#define rcu_assign_pointer(p, v) smp_store_release(&(p), (v))
	void synchronize_rcu(void)
	{
	int cpu;
		for_each_online_cpu(cpu)
			sched_setaffinity(current->pid, cpumask_of(cpu));
	}

Trivial or not, as the old saying goes, "if it ain't tested, it don't
work!".  This commit therefore adds a "trivial" flavor to rcutorture
and a corresponding TRIVIAL test scenario.  This variant does not handle
CPU hotplug, which is unconditionally enabled on x86 for post-v5.1-rc3
kernels, which is why the TRIVIAL.boot says "rcutorture.onoff_interval=0".
This commit actually does handle CONFIG_PREEMPT=y kernels, but only
because it turns back the Linux-kernel clock in order to provide these
alternative definitions (or the moral equivalent thereof):

	#define rcu_read_lock() preempt_disable()
	#define rcu_read_unlock() preempt_enable()

In CONFIG_PREEMPT=n kernels without debugging, these are equivalent to
empty macros give or take a compiler barrier.  However, the have been
successfully tested with actual empty macros as well.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Fix symbol issue reported by kbuild test robot <lkp@intel.com>. ]
[ paulmck: Work around sched_setaffinity() issue noted by Andrea Parri. ]
[ paulmck: Add rcutorture.shuffle_interval=0 to TRIVIAL.boot to fix
  interaction with shuffler task noted by Peter Zijlstra. ]
Tested-by: Andrea Parri <andrea.parri@amarulasolutions.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
3432d765c5 rcutorture: Halt forward-progress checks at end of run
Once removed, an rcu_torture element can be deferred-freed by a chain
of call_rcu() invocations, with each callback invoking another round of
call_rcu() until either a fixed number of call_rcu() invocations have
been chained or until the test ends.  This means that if the test ends,
some of the rcu_torture elements will be "stranded" partway through the
deferred-free process, which results in false-positive warnings from
rcu_torture_writer() due to lack of forward progress should the test
end just at the end of a stutter interval.

This commit therefore suppresses rcu_torture_writer()'s forward-progress
checks when the test ends in order to avoid these false-positive reports..

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
ab21f6081f rcutorture: Give the scheduler a chance on PREEMPT && NO_HZ_FULL kernels
In !PREEMPT kernels, cond_resched() is a no-op.  In NO_HZ_FULL kernels,
in-kernel execution (such as that of rcutorture's kthreads) might extend
indefinitely without the scheduler gaining the aid of a scheduling-clock
interrupt.  This combination can make the interaction of an rcutorture
forward-progress test and a CPU-hotplug stop_machine operation make less
forward progress than one might like.  Additionally, Sebastian Siewior
notes that NO_HZ_FULL kernels have a scheduler check upon return to
userspace execution, which suggests that in-kernel emulation of tight
userspace loops containing system calls doing call_rcu() might also need
explicit checks in the PREEMPT && NO_HZ_FULL case.

This commit therefore introduces a rcu_torture_fwd_prog_cond_resched()
function that explicitly invokes schedule() in such kernels whenever
need_resched() returns true, while retaining use of cond_resched()
for kernels that are either !PREEMPT or !NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
5eabea594b rcutorture: Exempt tasks RCU from timely draining of grace periods
After the end of each stutter pause interval, the rcu_torture_writer()
kthread checks to be sure that all prior callbacks have completed so
that all the test structures have been freed.  This works fine except
for tasks RCU, in which grace periods can take one good long time.
This commit therefore exempts tasks RCU from this check.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
ff3bf92d90 torture: Allow inter-stutter interval to be specified
Currently, the inter-stutter interval is the same as the stutter duration,
that is, whatever number of jiffies is passed into torture_stutter_init().
This has worked well for quite some time, but the addition of
forward-progress testing to rcutorture can delay processes for several
seconds, which can triple the time that they are stuttered.

This commit therefore adds a second argument to torture_stutter_init()
that specifies the inter-stutter interval.  While locktorture preserves
the current behavior, rcutorture uses the RCU CPU stall warning interval
to provide a wider inter-stutter interval.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
e8516c64fe rcutorture: Fix stutter_wait() return value and freelist checks
The stutter_wait() function is supposed to return true if it actually
waits and false otherwise, but it instead unconditionally returns false.
Which hides a bug in rcu_torture_writer() that fails to account for
the fact that one of the rcu_tortures[] array elements will normally be
referenced by rcu_torture_current, and thus not be on the freelist.

This commit therefore corrects the stutter_wait() return value and adds a
check for rcu_torture_current to rcu_torture_writer()'s check that things
get freed after everything goes quiescent.  In addition, this commit
causes torture_stutter() to give a bit more than one second (instead of
only one jiffy) warning of the end of the stutter interval.  Finally,
this commit disables long-delay readers and aggressive update-side
forward-progress checks while forward-progress testing is in flight.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
140e53f20b rcutorture: Add cond_resched() to forward-progress free-up loop
The rcu_torture_fwd_prog_cbfree() function frees callbacks used during
rcutorture's call_rcu() forward-progress test, but does so in a tight
loop.  This could cause problems given a very long list of callbacks to be
freed, and actual testing produces lists with as many as 25M callbacks.
This commit therefore adds a cond_resched() to this loop.  While in
the area, this commit also rearranges the lock releases to look a bit
more sane.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Oleg Nesterov
89da3b94bb rcu/sync: Simplify the state machine
With this patch rcu_sync has a single state variable and the transition rules
become really simple:

	GP_IDLE   - owned by the first rcu_sync_enter() which moves it to

	GP_ENTER  - owned by rcu-callback which moves it to

	GP_PASSED - owned by the last rcu_sync_exit() which moves it to

	GP_EXIT   - and this is the only "nontrivial" state.

		rcu-callback moves it back to GP_IDLE unless another enter()
		comes before a GP pass.

		If rcu-callback is invoked before the next rcu_sync_exit() it
		must see gp_count incremented by that enter() and set GP_PASSED.

		Otherwise, if the next rcu_sync_exit() wins the race, it will
		move it to

	GP_REPLAY - owned by rcu-callback which moves it to GP_EXIT

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ paulmck: While here, apply READ_ONCE() and WRITE_ONCE() to ->gp_state. ]
[ paulmck: Tweaks to make htmldocs happy. (Reported by kbuild test robot.) ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:05:23 -07:00
Oleg Nesterov
95bf33b55f rcu/sync: Kill rcu_sync_type/gp_type
Now that the RCU flavors have been consolidated, rcu_sync_type makes no
sense because none of internal update functions aside from .held() depend
on gp_type.  This commit therefore removes this field and consolidates
the relevant code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ paulmck: Added RCU and RCU-bh checks to rcu_sync_is_idle(). ]
[ paulmck: And applied subsequent feedback from Oleg Nesterov. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:05:23 -07:00
Jiang Biao
11b000457f rcu: Make __call_srcu static
Because __call_srcu() is not used outside kernel/rcu/srcutree.c,
this commit makes it static.

Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:03:35 -07:00
Paul E. McKenney
fe15b50cde srcu: Allocate per-CPU data for DEFINE_SRCU() in modules
Adding DEFINE_SRCU() or DEFINE_STATIC_SRCU() to a loadable module requires
that the size of the reserved region be increased, which is not something
we want to be doing all that often.  One approach would be to require
that loadable modules define an srcu_struct and invoke init_srcu_struct()
from their module_init function and cleanup_srcu_struct() from their
module_exit function.  However, this is more than a bit user unfriendly.

This commit therefore creates an ___srcu_struct_ptrs linker section,
and pointers to srcu_struct structures created by DEFINE_SRCU() and
DEFINE_STATIC_SRCU() within a module are placed into that module's
___srcu_struct_ptrs section.  The required init_srcu_struct() and
cleanup_srcu_struct() functions are then automatically invoked as needed
when that module is loaded and unloaded, thus allowing modules to continue
to use DEFINE_SRCU() and DEFINE_STATIC_SRCU() while avoiding the need
to increase the size of the reserved region.

Many of the algorithms and some of the code was cheerfully cherry-picked
from other code making use of linker sections, perhaps most notably from
tracepoints.  All bugs are nevertheless the sole property of the author.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
[ paulmck: Use __section() and use "default" in srcu_module_notify()'s
  "switch" statement as suggested by Joel Fernandes. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-05-28 09:03:35 -07:00
Paul E. McKenney
d5a9a8c3bc rcu: Set a maximum limit for back-to-back callback invocation
Currently, if a CPU has more than 10,000 callbacks pending, it will
increase rdp->blimit to LONG_MAX.  If you are lucky, LONG_MAX is only
about two billion, but this is still a bit too many callbacks to invoke
back-to-back while otherwise ignoring the world.

This commit therefore sets a maximum limit of DEFAULT_MAX_RCU_BLIMIT,
which is set to 10,000, for rdp->blimit.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:02:57 -07:00
Neeraj Upadhyay
3ae976a7e3 rcu: Correctly unlock root node in rcu_check_gp_start_stall()
On systems whose rcu_node tree has only one node, the
rcu_check_gp_start_stall() function's values of rnp and rnp_root will
be identical.  In this case, it clearly does not make sense to release
both rnp->lock and rnp_root->lock, but that is exactly what this function
does in the last early exit.  This commit therefore unlocks only rnp->lock
when rnp and rnp_root are equal.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:02:57 -07:00
Neeraj Upadhyay
cd6d17b4a4 rcu: Dump specified number of blocked tasks
The dump_blkd_tasks() function dumps at most 10 blocked tasks, ignoring
the value of the ncheck parameter.  This commit therefore substitutes
the value of ncheck for the hard-coded value of 10.  Because all callers
currently pass 10 as the number, this patch does not change behavior,
but it is clearly an accident waiting to happen.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:02:57 -07:00
Jiang Biao
f0b6356273 rcu: Remove unused rdp local from synchronize_rcu_expedited()
Because rdp is initialized but never used in synchronize_rcu_expedited(),
this commit removes it.

Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 08:48:19 -07:00
Paul E. McKenney
1bb336443c rcu: Rename rcu_data's ->deferred_qs to ->exp_deferred_qs
The rcu_data structure's ->deferred_qs field is used to indicate that the
current CPU is blocking an expedited grace period (perhaps a future one).
Given that it is used only for expedited grace periods, its current name
is misleading, so this commit renames it to ->exp_deferred_qs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 08:48:19 -07:00
Joel Fernandes (Google)
eddded8012 rcu: Add checks for dynticks counters in rcu_is_cpu_rrupt_from_idle()
It would be good to combine the dynticks and dynticks_nesting counters
in order to simplify the code.  Unfortunately, there are concerns
about usermode upcalls appearing to RCU as half of an interrupt, as
Byungchul learned [1].  The "half" in "half interrupt" is due to an
unpaired rcu_irq_enter(): Normally, each rcu_irq_enter() has a later
call to rcu_irq_exit().

Out of an abundance of caution, Paul added warnings [2] in the RCU
code which if not fired by 2021 will be interpreted as meaning that
this half-interrupt scenario cannot happen any more, thus permitting
simplification of this code.

In the meantime, this commit makes the following changes:

(1) Combining these two counters requires that rcu_rrupt_from_idle()
    is invoked only from hard-interrupt contexts as discussed here [3].
    This commit therefore adds the required lockdep_assert_in_irq()
    to check this constraint.

(2) Furthermore, rcu_rrupt_from_idle() is not explicit about how it
    is using the counters which can lead to weird future bugs. This
    commit therefore adds comments indicating the meaning and use of
    each counter.

(3) Lastly, this commit checks for counter underflows as another check
    that half interrupts don't occur.  (Previously, the function would
    simply return true upon underflow.)

All these checks checks are NOOPs if PROVE_LOCKING (and thus PROVE_RCU)
are disabled.

[1] https://lore.kernel.org/patchwork/patch/952349/
[2] Commit e11ec65cc8 ("rcu: Add warning to detect half-interrupts")
[3] https://lore.kernel.org/lkml/20190312150514.GB249405@google.com/

Cc: byungchul.park@lge.com
Cc: kernel-team@android.com
Cc: rcu@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 08:48:19 -07:00
Paul E. McKenney
e015a34112 rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()
The sync_sched_exp_online_cleanup() is invoked at online time to handle
the case where the start of an expedited grace period ran concurrently
with a CPU being taken offline and then immediately being placed online.
It checks to see if RCU needs an expedited quiescent state from the
incoming CPU, sending it an IPI if so.  However, it is quite possible
that sync_sched_exp_online_cleanup() is running on that CPU, in which
case it is considerably less overhead to simply request the quiescent
state locally instead of simulating a self-IPI.

This commit therefore places the last few lines of rcu_exp_handler()
into a new rcu_exp_need_qs() function, which is invoked both by
rcu_exp_handler() and by sync_sched_exp_online_cleanup() in the self-IPI
case.

This also reduces the rcu_exp_handler() function's state space by
removing the direct call that this smp_call_function_single() uses to
emulate the requested self-IPI.  This in turn will allow tighter error
checking in rcu_is_cpu_rrupt_from_idle().

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-05-25 14:50:50 -07:00
Paul E. McKenney
b9ad4d6ed1 rcu: Avoid self-IPI in sync_rcu_exp_select_node_cpus()
Although sync_rcu_exp_select_node_cpus() treats the current CPU as being
in a quiescent state, it might well migrate to some other CPU before
reaching the smp_call_function_single(), which could then result in an
unnecessary simulated self-IPI.  This commit therefore instead simply
refuses to invoke smp_call_function_single() on the current CPU, which
causes the later rcu_report_exp_cpu_mult() to report this CPU's quiescent
state with less overhead.

This also reduces the rcu_exp_handler() function's state space by removing
the direct call that this smp_call_function_single() uses to emulate the
requested self-IPI.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Use get_cpu() instead of preempt_disable() per Joel Fernandes. ]
2019-05-25 14:50:50 -07:00
Paul E. McKenney
43e903ad3e rcu: Inline invoke_rcu_callbacks() into its sole remaining caller
This commit saves a few lines of code by inlining invoke_rcu_callbacks()
into its sole remaining caller.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25 14:50:49 -07:00
Paul E. McKenney
0864f057b0 rcu: Use irq_work to get scheduler's attention in clean context
When rcu_read_unlock_special() is invoked with interrupts disabled, is
either not in an interrupt handler or is not using RCU_SOFTIRQ, is not
the first RCU read-side critical section in the chain, and either there
is an expedited grace period in flight or this is a NO_HZ_FULL kernel,
the end of the grace period can be unduly delayed.  The reason for this
is that it is not safe to do wakeups in this situation.

This commit fixes this problem by using the irq_work subsystem to
force a later interrupt handler in a clean environment.  Because
set_tsk_need_resched(current) and set_preempt_need_resched() are
invoked prior to this, the scheduler will force a context switch
upon return from this interrupt (though perhaps at the end of any
interrupted preempt-disable or BH-disable region of code), which will
invoke rcu_note_context_switch() (again in a clean environment), which
will in turn give RCU the chance to report the deferred quiescent state.

Of course, by then this task might be within another RCU read-side
critical section.  But that will be detected at that time and reporting
will be further deferred to the outermost rcu_read_unlock().  See
rcu_preempt_need_deferred_qs() and rcu_preempt_deferred_qs() for more
details on the checking.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25 14:50:49 -07:00
Paul E. McKenney
385b599e8c rcu: Allow rcu_read_unlock_special() to raise_softirq() if in_irq()
When running in an interrupt handler, raise_softirq() and
raise_softirq_irqoff() have extremely low overhead: They simply set a
bit in a per-CPU mask, which is checked upon exit from that interrupt
handler.  Therefore, if rcu_read_unlock_special() is invoked within an
interrupt handler and RCU_SOFTIRQ is in use, this commit make use of
raise_softirq_irqoff() even if there is no expedited grace period in
flight and even if this is not a nohz_full CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25 14:50:48 -07:00
Paul E. McKenney
25102de65f rcu: Only do rcu_read_unlock_special() wakeups if expedited
Currently, rcu_read_unlock_special() will do wakeups whenever it is safe
to do so.  However, wakeups are expensive, and they are only really
needed when the just-ended RCU read-side critical section is blocking
an expedited grace period (in which case speed is of the essence)
or on a nohz_full CPU (where it might be a good long time before an
interrupt arrives).  This commit therefore checks for these conditions,
and does the expensive wakeups only if doing so would be useful.

Note it can be rather expensive to determine whether or not the current
task (as opposed to the current CPU) is blocking the current expedited
grace period.  Doing so requires traversing the ->blkd_tasks list, which
can be quite long.  This commit therefore cheats:  If the current task
is on a given ->blkd_tasks list, and some task on that list is blocking
the current expedited grace period, the code assumes that the current
task is blocking that expedited grace period.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25 14:50:48 -07:00
Paul E. McKenney
23634ebc1d rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
kthreads, a full and unconditional wakeup is required to initiate RCU
core processing.  In contrast, when RCU core processing is carried
out by RCU_SOFTIRQ, a raise_softirq() suffices.  Of course, there are
situations where raise_softirq() does a full wakeup, but these do not
occur with normal usage of rcu_read_unlock().

The reason that full wakeups can be problematic is that the scheduler
sometimes invokes rcu_read_unlock() with its pi or rq locks held,
which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
rcu_read_unlock() invokes the scheduler.  Scheduler invocations can happen
in the following situations: (1) The just-ended reader has been subjected
to RCU priority boosting, in which case rcu_read_unlock() must deboost,
(2) Interrupts were disabled across the call to rcu_read_unlock(), so
the quiescent state must be deferred, requiring a wakeup of the rcuc
kthread corresponding to the current CPU.

Now, the scheduler may hold one of its locks across rcu_read_unlock()
only if preemption has been disabled across the entire RCU read-side
critical section, which in the days prior to RCU flavor consolidation
meant that rcu_read_unlock() never needed to do wakeups.  However, this
is no longer the case for any but the first rcu_read_unlock() following a
condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
attention.  For example, an RCU read-side critical section might be
preempted, but preemption might be disabled across the rcu_read_unlock().
The rcu_read_unlock() must defer the quiescent state, and therefore
leaves the task queued on its leaf rcu_node structure.  If a scheduler
interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
one of its locks held.  However, the preempted task is still queued, so
rcu_read_unlock() will attempt to defer the quiescent state once more.
When RCU core processing is carried out by RCU_SOFTIRQ, this works just
fine: The raise_softirq() function simply sets a bit in a per-CPU mask
and the RCU core processing will be undertaken upon return from interrupt.

Not so when RCU core processing is carried out by the rcuc kthread: In this
case, the required wakeup can result in deadlock.

The initial solution to this problem was to use set_tsk_need_resched() and
set_preempt_need_resched() to force a future context switch, which allows
rcu_preempt_note_context_switch() to report the deferred quiescent state
to RCU's core processing.  Unfortunately for expedited grace periods,
there can be a significant delay between the call for a context switch
and the actual context switch.

This commit therefore introduces a ->deferred_qs flag to the task_struct
structure's rcu_special structure.  This flag is initially false, and
is set to true by the first call to rcu_read_unlock() requiring special
attention, then finally reset back to false when the quiescent state is
finally reported.  Then rcu_read_unlock() attempts full wakeups only when
->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
special attention.  Note that a chain of RCU readers linked by some other
sort of reader may find that a later rcu_read_unlock() is once again able
to do a full wakeup, courtesy of an intervening preemption:

	rcu_read_lock();
	/* preempted */
	local_irq_disable();
	rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
	rcu_read_lock();
	local_irq_enable();
	preempt_disable()
	rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
	rcu_read_lock();
	preempt_enable();
	/* preempted, >deferred_qs reset. */
	local_irq_disable();
	rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */

Such linked RCU readers do not yet seem to appear in the Linux kernel, and
it is probably best if they don't.  However, RCU needs to handle them, and
some variations on this theme could make even raise_softirq() unsafe due to
the possibility of its doing a full wakeup.  This commit therefore also
avoids invoking raise_softirq() when the ->deferred_qs set flag is set.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-05-25 14:50:47 -07:00
Sebastian Andrzej Siewior
48d07c04b4 rcu: Enable elimination of Tree-RCU softirq processing
Some workloads need to change kthread priority for RCU core processing
without affecting other softirq work.  This commit therefore introduces
the rcutree.use_softirq kernel boot parameter, which moves the RCU core
work from softirq to a per-CPU SCHED_OTHER kthread named rcuc.  Use of
SCHED_OTHER approach avoids the scalability problems that appeared
with the earlier attempt to move RCU core processing to from softirq
to kthreads.  That said, kernels built with RCU_BOOST=y will run the
rcuc kthreads at the RCU-boosting priority.

Note that rcutree.use_softirq=0 must be specified to move RCU core
processing to the rcuc kthreads: rcutree.use_softirq=1 is the default.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[ paulmck: Adjust for invoke_rcu_callbacks() only ever being invoked
  from RCU core processing, in contrast to softirq->rcuc transition
  in old mainline RCU priority boosting. ]
[ paulmck: Avoid wakeups when scheduler might have invoked rcu_read_unlock()
  while holding rq or pi locks, also possibly fixing a pre-existing latent
  bug involving raise_softirq()-induced wakeups. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25 14:50:46 -07:00