RCU pull request for v6.3
This pull request contains the following branches: doc.2023.01.05a: Documentation updates. fixes.2023.01.23a: Miscellaneous fixes, perhaps most notably: o Throttling callback invocation based on the number of callbacks that are now ready to invoke instead of on the total number of callbacks. o Several patches that suppress false-positive boot-time diagnostics, for example, due to lockdep not yet being initialized. o Make expedited RCU CPU stall warnings dump stacks of any tasks that are blocking the stalled grace period. (Normal RCU CPU stall warnings have doen this for mnay years.) o Lazy-callback fixes to avoid delays during boot, suspend, and resume. (Note that lazy callbacks must be explicitly enabled, so this should not (yet) affect production use cases.) kvfree.2023.01.03a: Cause kfree_rcu() and friends to take advantage of polled grace periods, thus reducing memory footprint by almost two orders of magnitude, admittedly on a microbenchmark. This series also begins the transition from kfree_rcu(p) to kfree_rcu_mightsleep(p). This transition was motivated by bugs where kfree_rcu(p), which can block, was typed instead of the intended kfree_rcu(p, rh). srcu.2023.01.03a: SRCU updates, perhaps most notably fixing a bug that causes SRCU to fail when booted on a system with a non-zero boot CPU. This surprising situation actually happens for kdump kernels on the powerpc architecture. It also adds an srcu_down_read() and srcu_up_read(), which act like srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side critical section to be handed off from one task to another. srcu-always.2023.02.02a: Cleans up the now-useless SRCU Kconfig option. There are a few more commits that are not yet acked or pulled into maintainer trees, and these will be in a pull request for a later merge window. tasks.2023.01.03a: RCU-tasks updates, perhaps most notably these fixes: o A strange interaction between PID-namespace unshare and the RCU-tasks grace period that results in a low-probability but very real hang. o A race between an RCU tasks rude grace period on a single-CPU system and CPU-hotplug addition of the second CPU that can result in a too-short grace period. o A race between shrinking RCU tasks down to a single callback list and queuing a new callback to some other CPU, but where that queuing is delayed for more than an RCU grace period. This can result in that callback being stranded on the non-boot CPU. torture.2023.01.05a: Torture-test updates and fixes. torturescript.2023.01.03a: Torture-test scripting updates and fixes. stall.2023.01.09a: Provide additional RCU CPU stall-warning information in kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute timeout limit for expedited RCU CPU stall warnings. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmPq29UTHHBhdWxtY2tA a2VybmVsLm9yZwAKCRCevxLzctn7jAhVEACEAKJY1VJ9IUqz7CwzAYkzgRJfiygh oDUXmlqtm6ew9pr2GdLUVCVsUSldzBc0K7Djb/G1niv4JPs+v7YwupIV33+UbStU Qxt6ztTdxc4lKospLm1+2vF9ZdzVEmiP4wVCc4iDarv5FM3FpWSTNc8+L7qmlC+X myjv+GqMTxkXZBvYJOgJGFjDwN8noTd7Fr3mCCVLFm3PXMDa7tcwD6HRP5AqD2N8 qC5M6LEqepKVGmz0mYMLlSN1GPaqIsEcexIFEazRsPEivPh/iafyQCQ/cqxwhXmV vEt7u+dXGZT/oiDq9cJ+/XRDS2RyKIS6dUE14TiiHolDCn1ONESahfA/gXWKykC2 BaGPfjWXrWv/hwbeZ+8xEdkAvTIV92tGpXir9Fby1Z5PjP3balvrnn6hs5AnQBJb NdhRPLzy/dCnEF+CweAYYm1qvTo8cd5nyiNwBZHn7rEAIu3Axrecag1rhFl3AJ07 cpVMQXZtkQVa2X8aIRTUC+ijX6yIqNaHlu0HqNXgIUTDzL4nv5cMjOMzpNQP9/dZ FwAMZYNiOk9IlMiKJ8ZiVcxeiA8ouIBlkYM3k6vGrmiONZ7a/EV/mSHoJqI8bvqr AxUIJ2Ayhg3bxPboL5oKgCiLql0A7ZVvz6quX6McitWGMgaSvel1fDzT3TnZd41e 4AFBFd/+VedUGg== =bBYK -----END PGP SIGNATURE----- Merge tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates - Miscellaneous fixes, perhaps most notably: - Throttling callback invocation based on the number of callbacks that are now ready to invoke instead of on the total number of callbacks - Several patches that suppress false-positive boot-time diagnostics, for example, due to lockdep not yet being initialized - Make expedited RCU CPU stall warnings dump stacks of any tasks that are blocking the stalled grace period. (Normal RCU CPU stall warnings have done this for many years) - Lazy-callback fixes to avoid delays during boot, suspend, and resume. (Note that lazy callbacks must be explicitly enabled, so this should not (yet) affect production use cases) - Make kfree_rcu() and friends take advantage of polled grace periods, thus reducing memory footprint by almost two orders of magnitude, admittedly on a microbenchmark This also begins the transition from kfree_rcu(p) to kfree_rcu_mightsleep(p). This transition was motivated by bugs where kfree_rcu(p), which can block, was typed instead of the intended kfree_rcu(p, rh) - SRCU updates, perhaps most notably fixing a bug that causes SRCU to fail when booted on a system with a non-zero boot CPU. This surprising situation actually happens for kdump kernels on the powerpc architecture This also adds an srcu_down_read() and srcu_up_read(), which act like srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side critical section to be handed off from one task to another - Clean up the now-useless SRCU Kconfig option There are a few more commits that are not yet acked or pulled into maintainer trees, and these will be in a pull request for a later merge window - RCU-tasks updates, perhaps most notably these fixes: - A strange interaction between PID-namespace unshare and the RCU-tasks grace period that results in a low-probability but very real hang - A race between an RCU tasks rude grace period on a single-CPU system and CPU-hotplug addition of the second CPU that can result in a too-short grace period - A race between shrinking RCU tasks down to a single callback list and queuing a new callback to some other CPU, but where that queuing is delayed for more than an RCU grace period. This can result in that callback being stranded on the non-boot CPU - Torture-test updates and fixes - Torture-test scripting updates and fixes - Provide additional RCU CPU stall-warning information in kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute timeout limit for expedited RCU CPU stall warnings * tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits) rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep() kernel/notifier: Remove CONFIG_SRCU init: Remove "select SRCU" fs/quota: Remove "select SRCU" fs/notify: Remove "select SRCU" fs/btrfs: Remove "select SRCU" fs: Remove CONFIG_SRCU drivers/pci/controller: Remove "select SRCU" drivers/net: Remove "select SRCU" drivers/md: Remove "select SRCU" drivers/hwtracing/stm: Remove "select SRCU" drivers/dax: Remove "select SRCU" drivers/base: Remove CONFIG_SRCU rcu: Disable laziness if lazy-tracking says so rcu: Track laziness during boot and suspend rcu: Remove redundant call to rcu_boost_kthread_setaffinity() rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts rcu: Align the output of RCU CPU stall warning messages rcu: Add RCU stall diagnosis information sched: Add helper nr_context_switches_cpu() ...
This commit is contained in:
commit
8cc01d43f8
@ -8,7 +8,7 @@ Although RCU is usually used to protect read-mostly data structures,
|
||||
it is possible to use RCU to provide dynamic non-maskable interrupt
|
||||
handlers, as well as dynamic irq handlers. This document describes
|
||||
how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer
|
||||
work in "arch/x86/kernel/traps.c".
|
||||
work in an old version of "arch/x86/kernel/traps.c".
|
||||
|
||||
The relevant pieces of code are listed below, each followed by a
|
||||
brief explanation::
|
||||
@ -116,7 +116,7 @@ Answer to Quick Quiz:
|
||||
|
||||
This same sad story can happen on other CPUs when using
|
||||
a compiler with aggressive pointer-value speculation
|
||||
optimizations.
|
||||
optimizations. (But please don't!)
|
||||
|
||||
More important, the rcu_dereference_sched() makes it
|
||||
clear to someone reading the code that the pointer is
|
||||
|
@ -38,7 +38,7 @@ by having call_rcu() directly invoke its arguments only if it was called
|
||||
from process context. However, this can fail in a similar manner.
|
||||
|
||||
Suppose that an RCU-based algorithm again scans a linked list containing
|
||||
elements A, B, and C in process contexts, but that it invokes a function
|
||||
elements A, B, and C in process context, but that it invokes a function
|
||||
on each element as it is scanned. Suppose further that this function
|
||||
deletes element B from the list, then passes it to call_rcu() for deferred
|
||||
freeing. This may be a bit unconventional, but it is perfectly legal
|
||||
@ -59,7 +59,8 @@ Example 3: Death by Deadlock
|
||||
Suppose that call_rcu() is invoked while holding a lock, and that the
|
||||
callback function must acquire this same lock. In this case, if
|
||||
call_rcu() were to directly invoke the callback, the result would
|
||||
be self-deadlock.
|
||||
be self-deadlock *even if* this invocation occurred from a later
|
||||
call_rcu() invocation a full grace period later.
|
||||
|
||||
In some cases, it would possible to restructure to code so that
|
||||
the call_rcu() is delayed until after the lock is released. However,
|
||||
@ -85,6 +86,14 @@ Quick Quiz #2:
|
||||
|
||||
:ref:`Answers to Quick Quiz <answer_quick_quiz_up>`
|
||||
|
||||
It is important to note that userspace RCU implementations *do*
|
||||
permit call_rcu() to directly invoke callbacks, but only if a full
|
||||
grace period has elapsed since those callbacks were queued. This is
|
||||
the case because some userspace environments are extremely constrained.
|
||||
Nevertheless, people writing userspace RCU implementations are strongly
|
||||
encouraged to avoid invoking callbacks from call_rcu(), thus obtaining
|
||||
the deadlock-avoidance benefits called out above.
|
||||
|
||||
Summary
|
||||
-------
|
||||
|
||||
|
@ -69,9 +69,8 @@ checking of rcu_dereference() primitives:
|
||||
value of the pointer itself, for example, against NULL.
|
||||
|
||||
The rcu_dereference_check() check expression can be any boolean
|
||||
expression, but would normally include a lockdep expression. However,
|
||||
any boolean expression can be used. For a moderately ornate example,
|
||||
consider the following::
|
||||
expression, but would normally include a lockdep expression. For a
|
||||
moderately ornate example, consider the following::
|
||||
|
||||
file = rcu_dereference_check(fdt->fd[fd],
|
||||
lockdep_is_held(&files->file_lock) ||
|
||||
@ -97,10 +96,10 @@ code, it could instead be written as follows::
|
||||
atomic_read(&files->count) == 1);
|
||||
|
||||
This would verify cases #2 and #3 above, and furthermore lockdep would
|
||||
complain if this was used in an RCU read-side critical section unless one
|
||||
of these two cases held. Because rcu_dereference_protected() omits all
|
||||
barriers and compiler constraints, it generates better code than do the
|
||||
other flavors of rcu_dereference(). On the other hand, it is illegal
|
||||
complain even if this was used in an RCU read-side critical section unless
|
||||
one of these two cases held. Because rcu_dereference_protected() omits
|
||||
all barriers and compiler constraints, it generates better code than do
|
||||
the other flavors of rcu_dereference(). On the other hand, it is illegal
|
||||
to use rcu_dereference_protected() if either the RCU-protected pointer
|
||||
or the RCU-protected data that it points to can change concurrently.
|
||||
|
||||
|
@ -77,15 +77,17 @@ Frequently Asked Questions
|
||||
search for the string "Patent" in Documentation/RCU/RTFP.txt to find them.
|
||||
Of these, one was allowed to lapse by the assignee, and the
|
||||
others have been contributed to the Linux kernel under GPL.
|
||||
Many (but not all) have long since expired.
|
||||
There are now also LGPL implementations of user-level RCU
|
||||
available (https://liburcu.org/).
|
||||
|
||||
- I hear that RCU needs work in order to support realtime kernels?
|
||||
|
||||
Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU
|
||||
Realtime-friendly RCU are enabled via the CONFIG_PREEMPTION
|
||||
kernel configuration parameter.
|
||||
|
||||
- Where can I find more information on RCU?
|
||||
|
||||
See the Documentation/RCU/RTFP.txt file.
|
||||
Or point your browser at (http://www.rdrop.com/users/paulmck/RCU/).
|
||||
Or point your browser at (https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit)
|
||||
or (https://docs.google.com/document/d/1GCdQC8SDbb54W1shjEXqGZ0Rq8a6kIeYutdSIajfpLA/edit?usp=sharing).
|
||||
|
@ -19,8 +19,9 @@ Follow these rules to keep your RCU code working properly:
|
||||
can reload the value, and won't your code have fun with two
|
||||
different values for a single pointer! Without rcu_dereference(),
|
||||
DEC Alpha can load a pointer, dereference that pointer, and
|
||||
return data preceding initialization that preceded the store of
|
||||
the pointer.
|
||||
return data preceding initialization that preceded the store
|
||||
of the pointer. (As noted later, in recent kernels READ_ONCE()
|
||||
also prevents DEC Alpha from playing these tricks.)
|
||||
|
||||
In addition, the volatile cast in rcu_dereference() prevents the
|
||||
compiler from deducing the resulting pointer value. Please see
|
||||
@ -34,7 +35,7 @@ Follow these rules to keep your RCU code working properly:
|
||||
takes on the role of the lockless_dereference() primitive that
|
||||
was removed in v4.15.
|
||||
|
||||
- You are only permitted to use rcu_dereference on pointer values.
|
||||
- You are only permitted to use rcu_dereference() on pointer values.
|
||||
The compiler simply knows too much about integral values to
|
||||
trust it to carry dependencies through integer operations.
|
||||
There are a very few exceptions, namely that you can temporarily
|
||||
@ -240,6 +241,7 @@ precautions. To see this, consider the following code fragment::
|
||||
struct foo *q;
|
||||
int r1, r2;
|
||||
|
||||
rcu_read_lock();
|
||||
p = rcu_dereference(gp2);
|
||||
if (p == NULL)
|
||||
return;
|
||||
@ -248,7 +250,10 @@ precautions. To see this, consider the following code fragment::
|
||||
if (p == q) {
|
||||
/* The compiler decides that q->c is same as p->c. */
|
||||
r2 = p->c; /* Could get 44 on weakly order system. */
|
||||
} else {
|
||||
r2 = p->c - r1; /* Unconditional access to p->c. */
|
||||
}
|
||||
rcu_read_unlock();
|
||||
do_something_with(r1, r2);
|
||||
}
|
||||
|
||||
@ -297,6 +302,7 @@ Then one approach is to use locking, for example, as follows::
|
||||
struct foo *q;
|
||||
int r1, r2;
|
||||
|
||||
rcu_read_lock();
|
||||
p = rcu_dereference(gp2);
|
||||
if (p == NULL)
|
||||
return;
|
||||
@ -306,7 +312,12 @@ Then one approach is to use locking, for example, as follows::
|
||||
if (p == q) {
|
||||
/* The compiler decides that q->c is same as p->c. */
|
||||
r2 = p->c; /* Locking guarantees r2 == 144. */
|
||||
} else {
|
||||
spin_lock(&q->lock);
|
||||
r2 = q->c - r1;
|
||||
spin_unlock(&q->lock);
|
||||
}
|
||||
rcu_read_unlock();
|
||||
spin_unlock(&p->lock);
|
||||
do_something_with(r1, r2);
|
||||
}
|
||||
@ -364,7 +375,7 @@ the exact value of "p" even in the not-equals case. This allows the
|
||||
compiler to make the return values independent of the load from "gp",
|
||||
in turn destroying the ordering between this load and the loads of the
|
||||
return values. This can result in "p->b" returning pre-initialization
|
||||
garbage values.
|
||||
garbage values on weakly ordered systems.
|
||||
|
||||
In short, rcu_dereference() is *not* optional when you are going to
|
||||
dereference the resulting pointer.
|
||||
@ -430,7 +441,7 @@ member of the rcu_dereference() to use in various situations:
|
||||
SPARSE CHECKING OF RCU-PROTECTED POINTERS
|
||||
-----------------------------------------
|
||||
|
||||
The sparse static-analysis tool checks for direct access to RCU-protected
|
||||
The sparse static-analysis tool checks for non-RCU access to RCU-protected
|
||||
pointers, which can result in "interesting" bugs due to compiler
|
||||
optimizations involving invented loads and perhaps also load tearing.
|
||||
For example, suppose someone mistakenly does something like this::
|
||||
|
@ -5,37 +5,12 @@ RCU and Unloadable Modules
|
||||
|
||||
[Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/]
|
||||
|
||||
RCU (read-copy update) is a synchronization mechanism that can be thought
|
||||
of as a replacement for read-writer locking (among other things), but with
|
||||
very low-overhead readers that are immune to deadlock, priority inversion,
|
||||
and unbounded latency. RCU read-side critical sections are delimited
|
||||
by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPTION
|
||||
kernels, generate no code whatsoever.
|
||||
|
||||
This means that RCU writers are unaware of the presence of concurrent
|
||||
readers, so that RCU updates to shared data must be undertaken quite
|
||||
carefully, leaving an old version of the data structure in place until all
|
||||
pre-existing readers have finished. These old versions are needed because
|
||||
such readers might hold a reference to them. RCU updates can therefore be
|
||||
rather expensive, and RCU is thus best suited for read-mostly situations.
|
||||
|
||||
How can an RCU writer possibly determine when all readers are finished,
|
||||
given that readers might well leave absolutely no trace of their
|
||||
presence? There is a synchronize_rcu() primitive that blocks until all
|
||||
pre-existing readers have completed. An updater wishing to delete an
|
||||
element p from a linked list might do the following, while holding an
|
||||
appropriate lock, of course::
|
||||
|
||||
list_del_rcu(p);
|
||||
synchronize_rcu();
|
||||
kfree(p);
|
||||
|
||||
But the above code cannot be used in IRQ context -- the call_rcu()
|
||||
primitive must be used instead. This primitive takes a pointer to an
|
||||
rcu_head struct placed within the RCU-protected data structure and
|
||||
another pointer to a function that may be invoked later to free that
|
||||
structure. Code to delete an element p from the linked list from IRQ
|
||||
context might then be as follows::
|
||||
RCU updaters sometimes use call_rcu() to initiate an asynchronous wait for
|
||||
a grace period to elapse. This primitive takes a pointer to an rcu_head
|
||||
struct placed within the RCU-protected data structure and another pointer
|
||||
to a function that may be invoked later to free that structure. Code to
|
||||
delete an element p from the linked list from IRQ context might then be
|
||||
as follows::
|
||||
|
||||
list_del_rcu(p);
|
||||
call_rcu(&p->rcu, p_callback);
|
||||
@ -54,7 +29,7 @@ IRQ context. The function p_callback() might be defined as follows::
|
||||
Unloading Modules That Use call_rcu()
|
||||
-------------------------------------
|
||||
|
||||
But what if p_callback is defined in an unloadable module?
|
||||
But what if the p_callback() function is defined in an unloadable module?
|
||||
|
||||
If we unload the module while some RCU callbacks are pending,
|
||||
the CPUs executing these callbacks are going to be severely
|
||||
@ -67,20 +42,21 @@ grace period to elapse, it does not wait for the callbacks to complete.
|
||||
|
||||
One might be tempted to try several back-to-back synchronize_rcu()
|
||||
calls, but this is still not guaranteed to work. If there is a very
|
||||
heavy RCU-callback load, then some of the callbacks might be deferred
|
||||
in order to allow other processing to proceed. Such deferral is required
|
||||
in realtime kernels in order to avoid excessive scheduling latencies.
|
||||
heavy RCU-callback load, then some of the callbacks might be deferred in
|
||||
order to allow other processing to proceed. For but one example, such
|
||||
deferral is required in realtime kernels in order to avoid excessive
|
||||
scheduling latencies.
|
||||
|
||||
|
||||
rcu_barrier()
|
||||
-------------
|
||||
|
||||
We instead need the rcu_barrier() primitive. Rather than waiting for
|
||||
a grace period to elapse, rcu_barrier() waits for all outstanding RCU
|
||||
callbacks to complete. Please note that rcu_barrier() does **not** imply
|
||||
synchronize_rcu(), in particular, if there are no RCU callbacks queued
|
||||
anywhere, rcu_barrier() is within its rights to return immediately,
|
||||
without waiting for a grace period to elapse.
|
||||
This situation can be handled by the rcu_barrier() primitive. Rather
|
||||
than waiting for a grace period to elapse, rcu_barrier() waits for all
|
||||
outstanding RCU callbacks to complete. Please note that rcu_barrier()
|
||||
does **not** imply synchronize_rcu(), in particular, if there are no RCU
|
||||
callbacks queued anywhere, rcu_barrier() is within its rights to return
|
||||
immediately, without waiting for anything, let alone a grace period.
|
||||
|
||||
Pseudo-code using rcu_barrier() is as follows:
|
||||
|
||||
@ -89,83 +65,86 @@ Pseudo-code using rcu_barrier() is as follows:
|
||||
3. Allow the module to be unloaded.
|
||||
|
||||
There is also an srcu_barrier() function for SRCU, and you of course
|
||||
must match the flavor of rcu_barrier() with that of call_rcu(). If your
|
||||
module uses multiple flavors of call_rcu(), then it must also use multiple
|
||||
flavors of rcu_barrier() when unloading that module. For example, if
|
||||
it uses call_rcu(), call_srcu() on srcu_struct_1, and call_srcu() on
|
||||
srcu_struct_2, then the following three lines of code will be required
|
||||
when unloading::
|
||||
must match the flavor of srcu_barrier() with that of call_srcu().
|
||||
If your module uses multiple srcu_struct structures, then it must also
|
||||
use multiple invocations of srcu_barrier() when unloading that module.
|
||||
For example, if it uses call_rcu(), call_srcu() on srcu_struct_1, and
|
||||
call_srcu() on srcu_struct_2, then the following three lines of code
|
||||
will be required when unloading::
|
||||
|
||||
1 rcu_barrier();
|
||||
2 srcu_barrier(&srcu_struct_1);
|
||||
3 srcu_barrier(&srcu_struct_2);
|
||||
1 rcu_barrier();
|
||||
2 srcu_barrier(&srcu_struct_1);
|
||||
3 srcu_barrier(&srcu_struct_2);
|
||||
|
||||
The rcutorture module makes use of rcu_barrier() in its exit function
|
||||
as follows::
|
||||
If latency is of the essence, workqueues could be used to run these
|
||||
three functions concurrently.
|
||||
|
||||
1 static void
|
||||
2 rcu_torture_cleanup(void)
|
||||
3 {
|
||||
4 int i;
|
||||
5
|
||||
6 fullstop = 1;
|
||||
7 if (shuffler_task != NULL) {
|
||||
8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");
|
||||
9 kthread_stop(shuffler_task);
|
||||
10 }
|
||||
11 shuffler_task = NULL;
|
||||
An ancient version of the rcutorture module makes use of rcu_barrier()
|
||||
in its exit function as follows::
|
||||
|
||||
1 static void
|
||||
2 rcu_torture_cleanup(void)
|
||||
3 {
|
||||
4 int i;
|
||||
5
|
||||
6 fullstop = 1;
|
||||
7 if (shuffler_task != NULL) {
|
||||
8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");
|
||||
9 kthread_stop(shuffler_task);
|
||||
10 }
|
||||
11 shuffler_task = NULL;
|
||||
12
|
||||
13 if (writer_task != NULL) {
|
||||
14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");
|
||||
15 kthread_stop(writer_task);
|
||||
16 }
|
||||
17 writer_task = NULL;
|
||||
13 if (writer_task != NULL) {
|
||||
14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");
|
||||
15 kthread_stop(writer_task);
|
||||
16 }
|
||||
17 writer_task = NULL;
|
||||
18
|
||||
19 if (reader_tasks != NULL) {
|
||||
20 for (i = 0; i < nrealreaders; i++) {
|
||||
21 if (reader_tasks[i] != NULL) {
|
||||
22 VERBOSE_PRINTK_STRING(
|
||||
23 "Stopping rcu_torture_reader task");
|
||||
24 kthread_stop(reader_tasks[i]);
|
||||
25 }
|
||||
26 reader_tasks[i] = NULL;
|
||||
27 }
|
||||
28 kfree(reader_tasks);
|
||||
29 reader_tasks = NULL;
|
||||
30 }
|
||||
31 rcu_torture_current = NULL;
|
||||
19 if (reader_tasks != NULL) {
|
||||
20 for (i = 0; i < nrealreaders; i++) {
|
||||
21 if (reader_tasks[i] != NULL) {
|
||||
22 VERBOSE_PRINTK_STRING(
|
||||
23 "Stopping rcu_torture_reader task");
|
||||
24 kthread_stop(reader_tasks[i]);
|
||||
25 }
|
||||
26 reader_tasks[i] = NULL;
|
||||
27 }
|
||||
28 kfree(reader_tasks);
|
||||
29 reader_tasks = NULL;
|
||||
30 }
|
||||
31 rcu_torture_current = NULL;
|
||||
32
|
||||
33 if (fakewriter_tasks != NULL) {
|
||||
34 for (i = 0; i < nfakewriters; i++) {
|
||||
35 if (fakewriter_tasks[i] != NULL) {
|
||||
36 VERBOSE_PRINTK_STRING(
|
||||
37 "Stopping rcu_torture_fakewriter task");
|
||||
38 kthread_stop(fakewriter_tasks[i]);
|
||||
39 }
|
||||
40 fakewriter_tasks[i] = NULL;
|
||||
41 }
|
||||
42 kfree(fakewriter_tasks);
|
||||
43 fakewriter_tasks = NULL;
|
||||
44 }
|
||||
33 if (fakewriter_tasks != NULL) {
|
||||
34 for (i = 0; i < nfakewriters; i++) {
|
||||
35 if (fakewriter_tasks[i] != NULL) {
|
||||
36 VERBOSE_PRINTK_STRING(
|
||||
37 "Stopping rcu_torture_fakewriter task");
|
||||
38 kthread_stop(fakewriter_tasks[i]);
|
||||
39 }
|
||||
40 fakewriter_tasks[i] = NULL;
|
||||
41 }
|
||||
42 kfree(fakewriter_tasks);
|
||||
43 fakewriter_tasks = NULL;
|
||||
44 }
|
||||
45
|
||||
46 if (stats_task != NULL) {
|
||||
47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");
|
||||
48 kthread_stop(stats_task);
|
||||
49 }
|
||||
50 stats_task = NULL;
|
||||
46 if (stats_task != NULL) {
|
||||
47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");
|
||||
48 kthread_stop(stats_task);
|
||||
49 }
|
||||
50 stats_task = NULL;
|
||||
51
|
||||
52 /* Wait for all RCU callbacks to fire. */
|
||||
53 rcu_barrier();
|
||||
52 /* Wait for all RCU callbacks to fire. */
|
||||
53 rcu_barrier();
|
||||
54
|
||||
55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */
|
||||
55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */
|
||||
56
|
||||
57 if (cur_ops->cleanup != NULL)
|
||||
58 cur_ops->cleanup();
|
||||
59 if (atomic_read(&n_rcu_torture_error))
|
||||
60 rcu_torture_print_module_parms("End of test: FAILURE");
|
||||
61 else
|
||||
62 rcu_torture_print_module_parms("End of test: SUCCESS");
|
||||
63 }
|
||||
57 if (cur_ops->cleanup != NULL)
|
||||
58 cur_ops->cleanup();
|
||||
59 if (atomic_read(&n_rcu_torture_error))
|
||||
60 rcu_torture_print_module_parms("End of test: FAILURE");
|
||||
61 else
|
||||
62 rcu_torture_print_module_parms("End of test: SUCCESS");
|
||||
63 }
|
||||
|
||||
Line 6 sets a global variable that prevents any RCU callbacks from
|
||||
re-posting themselves. This will not be necessary in most cases, since
|
||||
@ -190,16 +169,17 @@ Quick Quiz #1:
|
||||
:ref:`Answer to Quick Quiz #1 <answer_rcubarrier_quiz_1>`
|
||||
|
||||
Your module might have additional complications. For example, if your
|
||||
module invokes call_rcu() from timers, you will need to first cancel all
|
||||
the timers, and only then invoke rcu_barrier() to wait for any remaining
|
||||
module invokes call_rcu() from timers, you will need to first refrain
|
||||
from posting new timers, cancel (or wait for) all the already-posted
|
||||
timers, and only then invoke rcu_barrier() to wait for any remaining
|
||||
RCU callbacks to complete.
|
||||
|
||||
Of course, if you module uses call_rcu(), you will need to invoke
|
||||
Of course, if your module uses call_rcu(), you will need to invoke
|
||||
rcu_barrier() before unloading. Similarly, if your module uses
|
||||
call_srcu(), you will need to invoke srcu_barrier() before unloading,
|
||||
and on the same srcu_struct structure. If your module uses call_rcu()
|
||||
**and** call_srcu(), then you will need to invoke rcu_barrier() **and**
|
||||
srcu_barrier().
|
||||
**and** call_srcu(), then (as noted above) you will need to invoke
|
||||
rcu_barrier() **and** srcu_barrier().
|
||||
|
||||
|
||||
Implementing rcu_barrier()
|
||||
@ -211,27 +191,40 @@ queues. His implementation queues an RCU callback on each of the per-CPU
|
||||
callback queues, and then waits until they have all started executing, at
|
||||
which point, all earlier RCU callbacks are guaranteed to have completed.
|
||||
|
||||
The original code for rcu_barrier() was as follows::
|
||||
The original code for rcu_barrier() was roughly as follows::
|
||||
|
||||
1 void rcu_barrier(void)
|
||||
2 {
|
||||
3 BUG_ON(in_interrupt());
|
||||
4 /* Take cpucontrol mutex to protect against CPU hotplug */
|
||||
5 mutex_lock(&rcu_barrier_mutex);
|
||||
6 init_completion(&rcu_barrier_completion);
|
||||
7 atomic_set(&rcu_barrier_cpu_count, 0);
|
||||
8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);
|
||||
9 wait_for_completion(&rcu_barrier_completion);
|
||||
10 mutex_unlock(&rcu_barrier_mutex);
|
||||
11 }
|
||||
1 void rcu_barrier(void)
|
||||
2 {
|
||||
3 BUG_ON(in_interrupt());
|
||||
4 /* Take cpucontrol mutex to protect against CPU hotplug */
|
||||
5 mutex_lock(&rcu_barrier_mutex);
|
||||
6 init_completion(&rcu_barrier_completion);
|
||||
7 atomic_set(&rcu_barrier_cpu_count, 1);
|
||||
8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);
|
||||
9 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
|
||||
10 complete(&rcu_barrier_completion);
|
||||
11 wait_for_completion(&rcu_barrier_completion);
|
||||
12 mutex_unlock(&rcu_barrier_mutex);
|
||||
13 }
|
||||
|
||||
Line 3 verifies that the caller is in process context, and lines 5 and 10
|
||||
Line 3 verifies that the caller is in process context, and lines 5 and 12
|
||||
use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the
|
||||
global completion and counters at a time, which are initialized on lines
|
||||
6 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is
|
||||
shown below. Note that the final "1" in on_each_cpu()'s argument list
|
||||
ensures that all the calls to rcu_barrier_func() will have completed
|
||||
before on_each_cpu() returns. Line 9 then waits for the completion.
|
||||
before on_each_cpu() returns. Line 9 removes the initial count from
|
||||
rcu_barrier_cpu_count, and if this count is now zero, line 10 finalizes
|
||||
the completion, which prevents line 11 from blocking. Either way,
|
||||
line 11 then waits (if needed) for the completion.
|
||||
|
||||
.. _rcubarrier_quiz_2:
|
||||
|
||||
Quick Quiz #2:
|
||||
Why doesn't line 8 initialize rcu_barrier_cpu_count to zero,
|
||||
thereby avoiding the need for lines 9 and 10?
|
||||
|
||||
:ref:`Answer to Quick Quiz #2 <answer_rcubarrier_quiz_2>`
|
||||
|
||||
This code was rewritten in 2008 and several times thereafter, but this
|
||||
still gives the general idea.
|
||||
@ -239,21 +232,21 @@ still gives the general idea.
|
||||
The rcu_barrier_func() runs on each CPU, where it invokes call_rcu()
|
||||
to post an RCU callback, as follows::
|
||||
|
||||
1 static void rcu_barrier_func(void *notused)
|
||||
2 {
|
||||
3 int cpu = smp_processor_id();
|
||||
4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
|
||||
5 struct rcu_head *head;
|
||||
6
|
||||
7 head = &rdp->barrier;
|
||||
8 atomic_inc(&rcu_barrier_cpu_count);
|
||||
9 call_rcu(head, rcu_barrier_callback);
|
||||
10 }
|
||||
1 static void rcu_barrier_func(void *notused)
|
||||
2 {
|
||||
3 int cpu = smp_processor_id();
|
||||
4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
|
||||
5 struct rcu_head *head;
|
||||
6
|
||||
7 head = &rdp->barrier;
|
||||
8 atomic_inc(&rcu_barrier_cpu_count);
|
||||
9 call_rcu(head, rcu_barrier_callback);
|
||||
10 }
|
||||
|
||||
Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure,
|
||||
which contains the struct rcu_head that needed for the later call to
|
||||
call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line
|
||||
8 increments a global counter. This counter will later be decremented
|
||||
8 increments the global counter. This counter will later be decremented
|
||||
by the callback. Line 9 then registers the rcu_barrier_callback() on
|
||||
the current CPU's queue.
|
||||
|
||||
@ -261,33 +254,34 @@ The rcu_barrier_callback() function simply atomically decrements the
|
||||
rcu_barrier_cpu_count variable and finalizes the completion when it
|
||||
reaches zero, as follows::
|
||||
|
||||
1 static void rcu_barrier_callback(struct rcu_head *notused)
|
||||
2 {
|
||||
3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
|
||||
4 complete(&rcu_barrier_completion);
|
||||
5 }
|
||||
1 static void rcu_barrier_callback(struct rcu_head *notused)
|
||||
2 {
|
||||
3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
|
||||
4 complete(&rcu_barrier_completion);
|
||||
5 }
|
||||
|
||||
.. _rcubarrier_quiz_2:
|
||||
.. _rcubarrier_quiz_3:
|
||||
|
||||
Quick Quiz #2:
|
||||
Quick Quiz #3:
|
||||
What happens if CPU 0's rcu_barrier_func() executes
|
||||
immediately (thus incrementing rcu_barrier_cpu_count to the
|
||||
value one), but the other CPU's rcu_barrier_func() invocations
|
||||
are delayed for a full grace period? Couldn't this result in
|
||||
rcu_barrier() returning prematurely?
|
||||
|
||||
:ref:`Answer to Quick Quiz #2 <answer_rcubarrier_quiz_2>`
|
||||
:ref:`Answer to Quick Quiz #3 <answer_rcubarrier_quiz_3>`
|
||||
|
||||
The current rcu_barrier() implementation is more complex, due to the need
|
||||
to avoid disturbing idle CPUs (especially on battery-powered systems)
|
||||
and the need to minimally disturb non-idle CPUs in real-time systems.
|
||||
However, the code above illustrates the concepts.
|
||||
In addition, a great many optimizations have been applied. However,
|
||||
the code above illustrates the concepts.
|
||||
|
||||
|
||||
rcu_barrier() Summary
|
||||
---------------------
|
||||
|
||||
The rcu_barrier() primitive has seen relatively little use, since most
|
||||
The rcu_barrier() primitive is used relatively infrequently, since most
|
||||
code using RCU is in the core kernel rather than in modules. However, if
|
||||
you are using RCU from an unloadable module, you need to use rcu_barrier()
|
||||
so that your module may be safely unloaded.
|
||||
@ -302,7 +296,8 @@ Quick Quiz #1:
|
||||
Is there any other situation where rcu_barrier() might
|
||||
be required?
|
||||
|
||||
Answer: Interestingly enough, rcu_barrier() was not originally
|
||||
Answer:
|
||||
Interestingly enough, rcu_barrier() was not originally
|
||||
implemented for module unloading. Nikita Danilov was using
|
||||
RCU in a filesystem, which resulted in a similar situation at
|
||||
filesystem-unmount time. Dipankar Sarma coded up rcu_barrier()
|
||||
@ -318,13 +313,48 @@ Answer: Interestingly enough, rcu_barrier() was not originally
|
||||
.. _answer_rcubarrier_quiz_2:
|
||||
|
||||
Quick Quiz #2:
|
||||
Why doesn't line 8 initialize rcu_barrier_cpu_count to zero,
|
||||
thereby avoiding the need for lines 9 and 10?
|
||||
|
||||
Answer:
|
||||
Suppose that the on_each_cpu() function shown on line 8 was
|
||||
delayed, so that CPU 0's rcu_barrier_func() executed and
|
||||
the corresponding grace period elapsed, all before CPU 1's
|
||||
rcu_barrier_func() started executing. This would result in
|
||||
rcu_barrier_cpu_count being decremented to zero, so that line
|
||||
11's wait_for_completion() would return immediately, failing to
|
||||
wait for CPU 1's callbacks to be invoked.
|
||||
|
||||
Note that this was not a problem when the rcu_barrier() code
|
||||
was first added back in 2005. This is because on_each_cpu()
|
||||
disables preemption, which acted as an RCU read-side critical
|
||||
section, thus preventing CPU 0's grace period from completing
|
||||
until on_each_cpu() had dealt with all of the CPUs. However,
|
||||
with the advent of preemptible RCU, rcu_barrier() no longer
|
||||
waited on nonpreemptible regions of code in preemptible kernels,
|
||||
that being the job of the new rcu_barrier_sched() function.
|
||||
|
||||
However, with the RCU flavor consolidation around v4.20, this
|
||||
possibility was once again ruled out, because the consolidated
|
||||
RCU once again waits on nonpreemptible regions of code.
|
||||
|
||||
Nevertheless, that extra count might still be a good idea.
|
||||
Relying on these sort of accidents of implementation can result
|
||||
in later surprise bugs when the implementation changes.
|
||||
|
||||
:ref:`Back to Quick Quiz #2 <rcubarrier_quiz_2>`
|
||||
|
||||
.. _answer_rcubarrier_quiz_3:
|
||||
|
||||
Quick Quiz #3:
|
||||
What happens if CPU 0's rcu_barrier_func() executes
|
||||
immediately (thus incrementing rcu_barrier_cpu_count to the
|
||||
value one), but the other CPU's rcu_barrier_func() invocations
|
||||
are delayed for a full grace period? Couldn't this result in
|
||||
rcu_barrier() returning prematurely?
|
||||
|
||||
Answer: This cannot happen. The reason is that on_each_cpu() has its last
|
||||
Answer:
|
||||
This cannot happen. The reason is that on_each_cpu() has its last
|
||||
argument, the wait flag, set to "1". This flag is passed through
|
||||
to smp_call_function() and further to smp_call_function_on_cpu(),
|
||||
causing this latter to spin until the cross-CPU invocation of
|
||||
@ -336,18 +366,15 @@ Answer: This cannot happen. The reason is that on_each_cpu() has its last
|
||||
|
||||
Therefore, on_each_cpu() disables preemption across its call
|
||||
to smp_call_function() and also across the local call to
|
||||
rcu_barrier_func(). This prevents the local CPU from context
|
||||
switching, again preventing grace periods from completing. This
|
||||
rcu_barrier_func(). Because recent RCU implementations treat
|
||||
preemption-disabled regions of code as RCU read-side critical
|
||||
sections, this prevents grace periods from completing. This
|
||||
means that all CPUs have executed rcu_barrier_func() before
|
||||
the first rcu_barrier_callback() can possibly execute, in turn
|
||||
preventing rcu_barrier_cpu_count from prematurely reaching zero.
|
||||
|
||||
Currently, -rt implementations of RCU keep but a single global
|
||||
queue for RCU callbacks, and thus do not suffer from this
|
||||
problem. However, when the -rt RCU eventually does have per-CPU
|
||||
callback queues, things will have to change. One simple change
|
||||
is to add an rcu_read_lock() before line 8 of rcu_barrier()
|
||||
and an rcu_read_unlock() after line 8 of this same function. If
|
||||
you can think of a better change, please let me know!
|
||||
But if on_each_cpu() ever decides to forgo disabling preemption,
|
||||
as might well happen due to real-time latency considerations,
|
||||
initializing rcu_barrier_cpu_count to one will save the day.
|
||||
|
||||
:ref:`Back to Quick Quiz #2 <rcubarrier_quiz_2>`
|
||||
:ref:`Back to Quick Quiz #3 <rcubarrier_quiz_3>`
|
||||
|
@ -14,19 +14,19 @@ Using 'nulls'
|
||||
=============
|
||||
|
||||
Using special makers (called 'nulls') is a convenient way
|
||||
to solve following problem :
|
||||
to solve following problem.
|
||||
|
||||
A typical RCU linked list managing objects which are
|
||||
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
|
||||
use following algos :
|
||||
Without 'nulls', a typical RCU linked list managing objects which are
|
||||
allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can use the following
|
||||
algorithms:
|
||||
|
||||
1) Lookup algo
|
||||
--------------
|
||||
1) Lookup algorithm
|
||||
-------------------
|
||||
|
||||
::
|
||||
|
||||
rcu_read_lock()
|
||||
begin:
|
||||
rcu_read_lock()
|
||||
obj = lockless_lookup(key);
|
||||
if (obj) {
|
||||
if (!try_get_ref(obj)) // might fail for free objects
|
||||
@ -38,6 +38,7 @@ use following algos :
|
||||
*/
|
||||
if (obj->key != key) { // not the object we expected
|
||||
put_ref(obj);
|
||||
rcu_read_unlock();
|
||||
goto begin;
|
||||
}
|
||||
}
|
||||
@ -52,9 +53,9 @@ but a version with an additional memory barrier (smp_rmb())
|
||||
{
|
||||
struct hlist_node *node, *next;
|
||||
for (pos = rcu_dereference((head)->first);
|
||||
pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(next))
|
||||
pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(next))
|
||||
if (obj->key == key)
|
||||
return obj;
|
||||
return NULL;
|
||||
@ -64,9 +65,9 @@ And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb()::
|
||||
|
||||
struct hlist_node *node;
|
||||
for (pos = rcu_dereference((head)->first);
|
||||
pos && ({ prefetch(pos->next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(pos->next))
|
||||
pos && ({ prefetch(pos->next); 1; }) &&
|
||||
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
|
||||
pos = rcu_dereference(pos->next))
|
||||
if (obj->key == key)
|
||||
return obj;
|
||||
return NULL;
|
||||
@ -82,36 +83,32 @@ Quoting Corey Minyard::
|
||||
solved by pre-fetching the "next" field (with proper barriers) before
|
||||
checking the key."
|
||||
|
||||
2) Insert algo
|
||||
--------------
|
||||
2) Insertion algorithm
|
||||
----------------------
|
||||
|
||||
We need to make sure a reader cannot read the new 'obj->obj_next' value
|
||||
and previous value of 'obj->key'. Or else, an item could be deleted
|
||||
and previous value of 'obj->key'. Otherwise, an item could be deleted
|
||||
from a chain, and inserted into another chain. If new chain was empty
|
||||
before the move, 'next' pointer is NULL, and lockless reader can
|
||||
not detect it missed following items in original chain.
|
||||
before the move, 'next' pointer is NULL, and lockless reader can not
|
||||
detect the fact that it missed following items in original chain.
|
||||
|
||||
::
|
||||
|
||||
/*
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
obj = kmem_cache_alloc(...);
|
||||
lock_chain(); // typically a spin_lock()
|
||||
obj->key = key;
|
||||
/*
|
||||
* we need to make sure obj->key is updated before obj->next
|
||||
* or obj->refcnt
|
||||
*/
|
||||
smp_wmb();
|
||||
atomic_set(&obj->refcnt, 1);
|
||||
atomic_set_release(&obj->refcnt, 1); // key before refcnt
|
||||
hlist_add_head_rcu(&obj->obj_node, list);
|
||||
unlock_chain(); // typically a spin_unlock()
|
||||
|
||||
|
||||
3) Remove algo
|
||||
--------------
|
||||
3) Removal algorithm
|
||||
--------------------
|
||||
|
||||
Nothing special here, we can use a standard RCU hlist deletion.
|
||||
But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
|
||||
very very fast (before the end of RCU grace period)
|
||||
@ -133,7 +130,7 @@ Avoiding extra smp_rmb()
|
||||
========================
|
||||
|
||||
With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
|
||||
and extra smp_wmb() in insert function.
|
||||
and extra _release() in insert function.
|
||||
|
||||
For example, if we choose to store the slot number as the 'nulls'
|
||||
end-of-list marker for each slot of the hash table, we can detect
|
||||
@ -142,59 +139,61 @@ to another chain) checking the final 'nulls' value if
|
||||
the lookup met the end of chain. If final 'nulls' value
|
||||
is not the slot number, then we must restart the lookup at
|
||||
the beginning. If the object was moved to the same chain,
|
||||
then the reader doesn't care : It might eventually
|
||||
then the reader doesn't care: It might occasionally
|
||||
scan the list again without harm.
|
||||
|
||||
|
||||
1) lookup algo
|
||||
--------------
|
||||
1) lookup algorithm
|
||||
-------------------
|
||||
|
||||
::
|
||||
|
||||
head = &table[slot];
|
||||
rcu_read_lock();
|
||||
begin:
|
||||
rcu_read_lock();
|
||||
hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
|
||||
if (obj->key == key) {
|
||||
if (!try_get_ref(obj)) // might fail for free objects
|
||||
goto begin;
|
||||
if (obj->key != key) { // not the object we expected
|
||||
put_ref(obj);
|
||||
if (!try_get_ref(obj)) { // might fail for free objects
|
||||
rcu_read_unlock();
|
||||
goto begin;
|
||||
}
|
||||
goto out;
|
||||
if (obj->key != key) { // not the object we expected
|
||||
put_ref(obj);
|
||||
rcu_read_unlock();
|
||||
goto begin;
|
||||
}
|
||||
goto out;
|
||||
}
|
||||
}
|
||||
|
||||
// If the nulls value we got at the end of this lookup is
|
||||
// not the expected one, we must restart lookup.
|
||||
// We probably met an item that was moved to another chain.
|
||||
if (get_nulls_value(node) != slot) {
|
||||
put_ref(obj);
|
||||
rcu_read_unlock();
|
||||
goto begin;
|
||||
}
|
||||
/*
|
||||
* if the nulls value we got at the end of this lookup is
|
||||
* not the expected one, we must restart lookup.
|
||||
* We probably met an item that was moved to another chain.
|
||||
*/
|
||||
if (get_nulls_value(node) != slot)
|
||||
goto begin;
|
||||
obj = NULL;
|
||||
|
||||
out:
|
||||
rcu_read_unlock();
|
||||
|
||||
2) Insert function
|
||||
------------------
|
||||
2) Insert algorithm
|
||||
-------------------
|
||||
|
||||
::
|
||||
|
||||
/*
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
* Please note that new inserts are done at the head of list,
|
||||
* not in the middle or end.
|
||||
*/
|
||||
obj = kmem_cache_alloc(cachep);
|
||||
lock_chain(); // typically a spin_lock()
|
||||
obj->key = key;
|
||||
atomic_set_release(&obj->refcnt, 1); // key before refcnt
|
||||
/*
|
||||
* changes to obj->key must be visible before refcnt one
|
||||
*/
|
||||
smp_wmb();
|
||||
atomic_set(&obj->refcnt, 1);
|
||||
/*
|
||||
* insert obj in RCU way (readers might be traversing chain)
|
||||
*/
|
||||
* insert obj in RCU way (readers might be traversing chain)
|
||||
*/
|
||||
hlist_nulls_add_head_rcu(&obj->obj_node, list);
|
||||
unlock_chain(); // typically a spin_unlock()
|
||||
|
@ -25,10 +25,10 @@ warnings:
|
||||
|
||||
- A CPU looping with bottom halves disabled.
|
||||
|
||||
- For !CONFIG_PREEMPTION kernels, a CPU looping anywhere in the kernel
|
||||
without invoking schedule(). If the looping in the kernel is
|
||||
really expected and desirable behavior, you might need to add
|
||||
some calls to cond_resched().
|
||||
- For !CONFIG_PREEMPTION kernels, a CPU looping anywhere in the
|
||||
kernel without potentially invoking schedule(). If the looping
|
||||
in the kernel is really expected and desirable behavior, you
|
||||
might need to add some calls to cond_resched().
|
||||
|
||||
- Booting Linux using a console connection that is too slow to
|
||||
keep up with the boot-time console-message rate. For example,
|
||||
@ -108,16 +108,17 @@ warnings:
|
||||
|
||||
- A bug in the RCU implementation.
|
||||
|
||||
- A hardware failure. This is quite unlikely, but has occurred
|
||||
at least once in real life. A CPU failed in a running system,
|
||||
becoming unresponsive, but not causing an immediate crash.
|
||||
This resulted in a series of RCU CPU stall warnings, eventually
|
||||
leading the realization that the CPU had failed.
|
||||
- A hardware failure. This is quite unlikely, but is not at all
|
||||
uncommon in large datacenter. In one memorable case some decades
|
||||
back, a CPU failed in a running system, becoming unresponsive,
|
||||
but not causing an immediate crash. This resulted in a series
|
||||
of RCU CPU stall warnings, eventually leading the realization
|
||||
that the CPU had failed.
|
||||
|
||||
The RCU, RCU-sched, and RCU-tasks implementations have CPU stall warning.
|
||||
Note that SRCU does *not* have CPU stall warnings. Please note that
|
||||
RCU only detects CPU stalls when there is a grace period in progress.
|
||||
No grace period, no CPU stall warnings.
|
||||
The RCU, RCU-sched, RCU-tasks, and RCU-tasks-trace implementations have
|
||||
CPU stall warning. Note that SRCU does *not* have CPU stall warnings.
|
||||
Please note that RCU only detects CPU stalls when there is a grace period
|
||||
in progress. No grace period, no CPU stall warnings.
|
||||
|
||||
To diagnose the cause of the stall, inspect the stack traces.
|
||||
The offending function will usually be near the top of the stack.
|
||||
@ -205,16 +206,21 @@ RCU_STALL_RAT_DELAY
|
||||
rcupdate.rcu_task_stall_timeout
|
||||
-------------------------------
|
||||
|
||||
This boot/sysfs parameter controls the RCU-tasks stall warning
|
||||
interval. A value of zero or less suppresses RCU-tasks stall
|
||||
warnings. A positive value sets the stall-warning interval
|
||||
in seconds. An RCU-tasks stall warning starts with the line:
|
||||
This boot/sysfs parameter controls the RCU-tasks and
|
||||
RCU-tasks-trace stall warning intervals. A value of zero or less
|
||||
suppresses RCU-tasks stall warnings. A positive value sets the
|
||||
stall-warning interval in seconds. An RCU-tasks stall warning
|
||||
starts with the line:
|
||||
|
||||
INFO: rcu_tasks detected stalls on tasks:
|
||||
|
||||
And continues with the output of sched_show_task() for each
|
||||
task stalling the current RCU-tasks grace period.
|
||||
|
||||
An RCU-tasks-trace stall warning starts (and continues) similarly:
|
||||
|
||||
INFO: rcu_tasks_trace detected stalls on tasks
|
||||
|
||||
|
||||
Interpreting RCU's CPU Stall-Detector "Splats"
|
||||
==============================================
|
||||
@ -248,7 +254,8 @@ dynticks counter, which will have an even-numbered value if the CPU
|
||||
is in dyntick-idle mode and an odd-numbered value otherwise. The hex
|
||||
number between the two "/"s is the value of the nesting, which will be
|
||||
a small non-negative number if in the idle loop (as shown above) and a
|
||||
very large positive number otherwise.
|
||||
very large positive number otherwise. The number following the final
|
||||
"/" is the NMI nesting, which will be a small non-negative number.
|
||||
|
||||
The "softirq=" portion of the message tracks the number of RCU softirq
|
||||
handlers that the stalled CPU has executed. The number before the "/"
|
||||
@ -383,3 +390,95 @@ for example, "P3421".
|
||||
|
||||
It is entirely possible to see stall warnings from normal and from
|
||||
expedited grace periods at about the same time during the same run.
|
||||
|
||||
RCU_CPU_STALL_CPUTIME
|
||||
=====================
|
||||
|
||||
In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
|
||||
rcupdate.rcu_cpu_stall_cputime=1, the following additional information
|
||||
is supplied with each RCU CPU stall warning::
|
||||
|
||||
rcu: hardirqs softirqs csw/system
|
||||
rcu: number: 624 45 0
|
||||
rcu: cputime: 69 1 2425 ==> 2500(ms)
|
||||
|
||||
These statistics are collected during the sampling period. The values
|
||||
in row "number:" are the number of hard interrupts, number of soft
|
||||
interrupts, and number of context switches on the stalled CPU. The
|
||||
first three values in row "cputime:" indicate the CPU time in
|
||||
milliseconds consumed by hard interrupts, soft interrupts, and tasks
|
||||
on the stalled CPU. The last number is the measurement interval, again
|
||||
in milliseconds. Because user-mode tasks normally do not cause RCU CPU
|
||||
stalls, these tasks are typically kernel tasks, which is why only the
|
||||
system CPU time are considered.
|
||||
|
||||
The sampling period is shown as follows::
|
||||
|
||||
|<------------first timeout---------->|<-----second timeout----->|
|
||||
|<--half timeout-->|<--half timeout-->| |
|
||||
| |<--first period-->| |
|
||||
| |<-----------second sampling period---------->|
|
||||
| | | |
|
||||
snapshot time point 1st-stall 2nd-stall
|
||||
|
||||
The following describes four typical scenarios:
|
||||
|
||||
1. A CPU looping with interrupts disabled.
|
||||
|
||||
::
|
||||
|
||||
rcu: hardirqs softirqs csw/system
|
||||
rcu: number: 0 0 0
|
||||
rcu: cputime: 0 0 0 ==> 2500(ms)
|
||||
|
||||
Because interrupts have been disabled throughout the measurement
|
||||
interval, there are no interrupts and no context switches.
|
||||
Furthermore, because CPU time consumption was measured using interrupt
|
||||
handlers, the system CPU consumption is misleadingly measured as zero.
|
||||
This scenario will normally also have "(0 ticks this GP)" printed on
|
||||
this CPU's summary line.
|
||||
|
||||
2. A CPU looping with bottom halves disabled.
|
||||
|
||||
This is similar to the previous example, but with non-zero number of
|
||||
and CPU time consumed by hard interrupts, along with non-zero CPU
|
||||
time consumed by in-kernel execution::
|
||||
|
||||
rcu: hardirqs softirqs csw/system
|
||||
rcu: number: 624 0 0
|
||||
rcu: cputime: 49 0 2446 ==> 2500(ms)
|
||||
|
||||
The fact that there are zero softirqs gives a hint that these were
|
||||
disabled, perhaps via local_bh_disable(). It is of course possible
|
||||
that there were no softirqs, perhaps because all events that would
|
||||
result in softirq execution are confined to other CPUs. In this case,
|
||||
the diagnosis should continue as shown in the next example.
|
||||
|
||||
3. A CPU looping with preemption disabled.
|
||||
|
||||
Here, only the number of context switches is zero::
|
||||
|
||||
rcu: hardirqs softirqs csw/system
|
||||
rcu: number: 624 45 0
|
||||
rcu: cputime: 69 1 2425 ==> 2500(ms)
|
||||
|
||||
This situation hints that the stalled CPU was looping with preemption
|
||||
disabled.
|
||||
|
||||
4. No looping, but massive hard and soft interrupts.
|
||||
|
||||
::
|
||||
|
||||
rcu: hardirqs softirqs csw/system
|
||||
rcu: number: xx xx 0
|
||||
rcu: cputime: xx xx 0 ==> 2500(ms)
|
||||
|
||||
Here, the number and CPU time of hard interrupts are all non-zero,
|
||||
but the number of context switches and the in-kernel CPU time consumed
|
||||
are zero. The number and cputime of soft interrupts will usually be
|
||||
non-zero, but could be zero, for example, if the CPU was spinning
|
||||
within a single hard interrupt handler.
|
||||
|
||||
If this type of RCU CPU stall warning can be reproduced, you can
|
||||
narrow it down by looking at /proc/interrupts or by writing code to
|
||||
trace each interrupt, for example, by referring to show_interrupts().
|
||||
|
@ -206,7 +206,11 @@ values for memory may require disabling the callback-flooding tests
|
||||
using the --bootargs parameter discussed below.
|
||||
|
||||
Sometimes additional debugging is useful, and in such cases the --kconfig
|
||||
parameter to kvm.sh may be used, for example, ``--kconfig 'CONFIG_KASAN=y'``.
|
||||
parameter to kvm.sh may be used, for example, ``--kconfig 'CONFIG_RCU_EQS_DEBUG=y'``.
|
||||
In addition, there are the --gdb, --kasan, and --kcsan parameters.
|
||||
Note that --gdb limits you to one scenario per kvm.sh run and requires
|
||||
that you have another window open from which to run ``gdb`` as instructed
|
||||
by the script.
|
||||
|
||||
Kernel boot arguments can also be supplied, for example, to control
|
||||
rcutorture's module parameters. For example, to test a change to RCU's
|
||||
@ -219,10 +223,17 @@ require disabling rcutorture's callback-flooding tests::
|
||||
--bootargs 'rcutorture.fwd_progress=0'
|
||||
|
||||
Sometimes all that is needed is a full set of kernel builds. This is
|
||||
what the --buildonly argument does.
|
||||
what the --buildonly parameter does.
|
||||
|
||||
Finally, the --trust-make argument allows each kernel build to reuse what
|
||||
it can from the previous kernel build.
|
||||
The --duration parameter can override the default run time of 30 minutes.
|
||||
For example, ``--duration 2d`` would run for two days, ``--duration 3h``
|
||||
would run for three hours, ``--duration 5m`` would run for five minutes,
|
||||
and ``--duration 45s`` would run for 45 seconds. This last can be useful
|
||||
for tracking down rare boot-time failures.
|
||||
|
||||
Finally, the --trust-make parameter allows each kernel build to reuse what
|
||||
it can from the previous kernel build. Please note that without the
|
||||
--trust-make parameter, your tags files may be demolished.
|
||||
|
||||
There are additional more arcane arguments that are documented in the
|
||||
source code of the kvm.sh script.
|
||||
@ -291,3 +302,73 @@ the following summary at the end of the run on a 12-CPU system::
|
||||
TREE07 ------- 167347 GPs (30.9902/s) [rcu: g1079021 f0x0 ] n_max_cbs: 478732
|
||||
CPU count limited from 16 to 12
|
||||
TREE09 ------- 752238 GPs (139.303/s) [rcu: g13075057 f0x0 ] n_max_cbs: 99011
|
||||
|
||||
|
||||
Repeated Runs
|
||||
=============
|
||||
|
||||
Suppose that you are chasing down a rare boot-time failure. Although you
|
||||
could use kvm.sh, doing so will rebuild the kernel on each run. If you
|
||||
need (say) 1,000 runs to have confidence that you have fixed the bug,
|
||||
these pointless rebuilds can become extremely annoying.
|
||||
|
||||
This is why kvm-again.sh exists.
|
||||
|
||||
Suppose that a previous kvm.sh run left its output in this directory::
|
||||
|
||||
tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28
|
||||
|
||||
Then this run can be re-run without rebuilding as follow:
|
||||
|
||||
kvm-again.sh tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28
|
||||
|
||||
A few of the original run's kvm.sh parameters may be overridden, perhaps
|
||||
most notably --duration and --bootargs. For example::
|
||||
|
||||
kvm-again.sh tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28 \
|
||||
--duration 45s
|
||||
|
||||
would re-run the previous test, but for only 45 seconds, thus facilitating
|
||||
tracking down the aforementioned rare boot-time failure.
|
||||
|
||||
|
||||
Distributed Runs
|
||||
================
|
||||
|
||||
Although kvm.sh is quite useful, its testing is confined to a single
|
||||
system. It is not all that hard to use your favorite framework to cause
|
||||
(say) 5 instances of kvm.sh to run on your 5 systems, but this will very
|
||||
likely unnecessarily rebuild kernels. In addition, manually distributing
|
||||
the desired rcutorture scenarios across the available systems can be
|
||||
painstaking and error-prone.
|
||||
|
||||
And this is why the kvm-remote.sh script exists.
|
||||
|
||||
If you the following command works::
|
||||
|
||||
ssh system0 date
|
||||
|
||||
and if it also works for system1, system2, system3, system4, and system5,
|
||||
and all of these systems have 64 CPUs, you can type::
|
||||
|
||||
kvm-remote.sh "system0 system1 system2 system3 system4 system5" \
|
||||
--cpus 64 --duration 8h --configs "5*CFLIST"
|
||||
|
||||
This will build each default scenario's kernel on the local system, then
|
||||
spread each of five instances of each scenario over the systems listed,
|
||||
running each scenario for eight hours. At the end of the runs, the
|
||||
results will be gathered, recorded, and printed. Most of the parameters
|
||||
that kvm.sh will accept can be passed to kvm-remote.sh, but the list of
|
||||
systems must come first.
|
||||
|
||||
The kvm.sh ``--dryrun scenarios`` argument is useful for working out
|
||||
how many scenarios may be run in one batch across a group of systems.
|
||||
|
||||
You can also re-run a previous remote run in a manner similar to kvm.sh:
|
||||
|
||||
kvm-remote.sh "system0 system1 system2 system3 system4 system5" \
|
||||
tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28-remote \
|
||||
--duration 24h
|
||||
|
||||
In this case, most of the kvm-again.sh parmeters may be supplied following
|
||||
the pathname of the old run-results directory.
|
||||
|
@ -16,18 +16,23 @@ to start learning about RCU:
|
||||
| 6. The RCU API, 2019 Edition https://lwn.net/Articles/777036/
|
||||
| 2019 Big API Table https://lwn.net/Articles/777165/
|
||||
|
||||
For those preferring video:
|
||||
|
||||
| 1. Unraveling RCU Mysteries: Fundamentals https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries
|
||||
| 2. Unraveling RCU Mysteries: Additional Use Cases https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases
|
||||
|
||||
|
||||
What is RCU?
|
||||
|
||||
RCU is a synchronization mechanism that was added to the Linux kernel
|
||||
during the 2.5 development effort that is optimized for read-mostly
|
||||
situations. Although RCU is actually quite simple once you understand it,
|
||||
getting there can sometimes be a challenge. Part of the problem is that
|
||||
most of the past descriptions of RCU have been written with the mistaken
|
||||
assumption that there is "one true way" to describe RCU. Instead,
|
||||
the experience has been that different people must take different paths
|
||||
to arrive at an understanding of RCU. This document provides several
|
||||
different paths, as follows:
|
||||
situations. Although RCU is actually quite simple, making effective use
|
||||
of it requires you to think differently about your code. Another part
|
||||
of the problem is the mistaken assumption that there is "one true way" to
|
||||
describe and to use RCU. Instead, the experience has been that different
|
||||
people must take different paths to arrive at an understanding of RCU,
|
||||
depending on their experiences and use cases. This document provides
|
||||
several different paths, as follows:
|
||||
|
||||
:ref:`1. RCU OVERVIEW <1_whatisRCU>`
|
||||
|
||||
@ -157,34 +162,36 @@ rcu_read_lock()
|
||||
^^^^^^^^^^^^^^^
|
||||
void rcu_read_lock(void);
|
||||
|
||||
Used by a reader to inform the reclaimer that the reader is
|
||||
entering an RCU read-side critical section. It is illegal
|
||||
to block while in an RCU read-side critical section, though
|
||||
kernels built with CONFIG_PREEMPT_RCU can preempt RCU
|
||||
read-side critical sections. Any RCU-protected data structure
|
||||
accessed during an RCU read-side critical section is guaranteed to
|
||||
remain unreclaimed for the full duration of that critical section.
|
||||
Reference counts may be used in conjunction with RCU to maintain
|
||||
longer-term references to data structures.
|
||||
This temporal primitive is used by a reader to inform the
|
||||
reclaimer that the reader is entering an RCU read-side critical
|
||||
section. It is illegal to block while in an RCU read-side
|
||||
critical section, though kernels built with CONFIG_PREEMPT_RCU
|
||||
can preempt RCU read-side critical sections. Any RCU-protected
|
||||
data structure accessed during an RCU read-side critical section
|
||||
is guaranteed to remain unreclaimed for the full duration of that
|
||||
critical section. Reference counts may be used in conjunction
|
||||
with RCU to maintain longer-term references to data structures.
|
||||
|
||||
rcu_read_unlock()
|
||||
^^^^^^^^^^^^^^^^^
|
||||
void rcu_read_unlock(void);
|
||||
|
||||
Used by a reader to inform the reclaimer that the reader is
|
||||
exiting an RCU read-side critical section. Note that RCU
|
||||
read-side critical sections may be nested and/or overlapping.
|
||||
This temporal primitives is used by a reader to inform the
|
||||
reclaimer that the reader is exiting an RCU read-side critical
|
||||
section. Note that RCU read-side critical sections may be nested
|
||||
and/or overlapping.
|
||||
|
||||
synchronize_rcu()
|
||||
^^^^^^^^^^^^^^^^^
|
||||
void synchronize_rcu(void);
|
||||
|
||||
Marks the end of updater code and the beginning of reclaimer
|
||||
code. It does this by blocking until all pre-existing RCU
|
||||
read-side critical sections on all CPUs have completed.
|
||||
Note that synchronize_rcu() will **not** necessarily wait for
|
||||
any subsequent RCU read-side critical sections to complete.
|
||||
For example, consider the following sequence of events::
|
||||
This temporal primitive marks the end of updater code and the
|
||||
beginning of reclaimer code. It does this by blocking until
|
||||
all pre-existing RCU read-side critical sections on all CPUs
|
||||
have completed. Note that synchronize_rcu() will **not**
|
||||
necessarily wait for any subsequent RCU read-side critical
|
||||
sections to complete. For example, consider the following
|
||||
sequence of events::
|
||||
|
||||
CPU 0 CPU 1 CPU 2
|
||||
----------------- ------------------------- ---------------
|
||||
@ -211,13 +218,13 @@ synchronize_rcu()
|
||||
to be useful in all but the most read-intensive situations,
|
||||
synchronize_rcu()'s overhead must also be quite small.
|
||||
|
||||
The call_rcu() API is a callback form of synchronize_rcu(),
|
||||
and is described in more detail in a later section. Instead of
|
||||
blocking, it registers a function and argument which are invoked
|
||||
after all ongoing RCU read-side critical sections have completed.
|
||||
This callback variant is particularly useful in situations where
|
||||
it is illegal to block or where update-side performance is
|
||||
critically important.
|
||||
The call_rcu() API is an asynchronous callback form of
|
||||
synchronize_rcu(), and is described in more detail in a later
|
||||
section. Instead of blocking, it registers a function and
|
||||
argument which are invoked after all ongoing RCU read-side
|
||||
critical sections have completed. This callback variant is
|
||||
particularly useful in situations where it is illegal to block
|
||||
or where update-side performance is critically important.
|
||||
|
||||
However, the call_rcu() API should not be used lightly, as use
|
||||
of the synchronize_rcu() API generally results in simpler code.
|
||||
@ -236,11 +243,13 @@ rcu_assign_pointer()
|
||||
would be cool to be able to declare a function in this manner.
|
||||
(Compiler experts will no doubt disagree.)
|
||||
|
||||
The updater uses this function to assign a new value to an
|
||||
The updater uses this spatial macro to assign a new value to an
|
||||
RCU-protected pointer, in order to safely communicate the change
|
||||
in value from the updater to the reader. This macro does not
|
||||
evaluate to an rvalue, but it does execute any memory-barrier
|
||||
instructions required for a given CPU architecture.
|
||||
in value from the updater to the reader. This is a spatial (as
|
||||
opposed to temporal) macro. It does not evaluate to an rvalue,
|
||||
but it does execute any memory-barrier instructions required
|
||||
for a given CPU architecture. Its ordering properties are that
|
||||
of a store-release operation.
|
||||
|
||||
Perhaps just as important, it serves to document (1) which
|
||||
pointers are protected by RCU and (2) the point at which a
|
||||
@ -255,14 +264,15 @@ rcu_dereference()
|
||||
Like rcu_assign_pointer(), rcu_dereference() must be implemented
|
||||
as a macro.
|
||||
|
||||
The reader uses rcu_dereference() to fetch an RCU-protected
|
||||
pointer, which returns a value that may then be safely
|
||||
dereferenced. Note that rcu_dereference() does not actually
|
||||
dereference the pointer, instead, it protects the pointer for
|
||||
later dereferencing. It also executes any needed memory-barrier
|
||||
instructions for a given CPU architecture. Currently, only Alpha
|
||||
needs memory barriers within rcu_dereference() -- on other CPUs,
|
||||
it compiles to nothing, not even a compiler directive.
|
||||
The reader uses the spatial rcu_dereference() macro to fetch
|
||||
an RCU-protected pointer, which returns a value that may
|
||||
then be safely dereferenced. Note that rcu_dereference()
|
||||
does not actually dereference the pointer, instead, it
|
||||
protects the pointer for later dereferencing. It also
|
||||
executes any needed memory-barrier instructions for a given
|
||||
CPU architecture. Currently, only Alpha needs memory barriers
|
||||
within rcu_dereference() -- on other CPUs, it compiles to a
|
||||
volatile load.
|
||||
|
||||
Common coding practice uses rcu_dereference() to copy an
|
||||
RCU-protected pointer to a local variable, then dereferences
|
||||
@ -355,12 +365,15 @@ reader, updater, and reclaimer.
|
||||
synchronize_rcu() & call_rcu()
|
||||
|
||||
|
||||
The RCU infrastructure observes the time sequence of rcu_read_lock(),
|
||||
The RCU infrastructure observes the temporal sequence of rcu_read_lock(),
|
||||
rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
|
||||
order to determine when (1) synchronize_rcu() invocations may return
|
||||
to their callers and (2) call_rcu() callbacks may be invoked. Efficient
|
||||
implementations of the RCU infrastructure make heavy use of batching in
|
||||
order to amortize their overhead over many uses of the corresponding APIs.
|
||||
The rcu_assign_pointer() and rcu_dereference() invocations communicate
|
||||
spatial changes via stores to and loads from the RCU-protected pointer in
|
||||
question.
|
||||
|
||||
There are at least three flavors of RCU usage in the Linux kernel. The diagram
|
||||
above shows the most common one. On the updater side, the rcu_assign_pointer(),
|
||||
@ -392,7 +405,9 @@ b. RCU applied to networking data structures that may be subjected
|
||||
c. RCU applied to scheduler and interrupt/NMI-handler tasks.
|
||||
|
||||
Again, most uses will be of (a). The (b) and (c) cases are important
|
||||
for specialized uses, but are relatively uncommon.
|
||||
for specialized uses, but are relatively uncommon. The SRCU, RCU-Tasks,
|
||||
RCU-Tasks-Rude, and RCU-Tasks-Trace have similar relationships among
|
||||
their assorted primitives.
|
||||
|
||||
.. _3_whatisRCU:
|
||||
|
||||
@ -468,7 +483,7 @@ So, to sum up:
|
||||
- Within an RCU read-side critical section, use rcu_dereference()
|
||||
to dereference RCU-protected pointers.
|
||||
|
||||
- Use some solid scheme (such as locks or semaphores) to
|
||||
- Use some solid design (such as locks or semaphores) to
|
||||
keep concurrent updates from interfering with each other.
|
||||
|
||||
- Use rcu_assign_pointer() to update an RCU-protected pointer.
|
||||
@ -579,6 +594,14 @@ to avoid having to write your own callback::
|
||||
|
||||
kfree_rcu(old_fp, rcu);
|
||||
|
||||
If the occasional sleep is permitted, the single-argument form may
|
||||
be used, omitting the rcu_head structure from struct foo.
|
||||
|
||||
kfree_rcu(old_fp);
|
||||
|
||||
This variant of kfree_rcu() almost never blocks, but might do so by
|
||||
invoking synchronize_rcu() in response to memory-allocation failure.
|
||||
|
||||
Again, see checklist.rst for additional rules governing the use of RCU.
|
||||
|
||||
.. _5_whatisRCU:
|
||||
@ -596,7 +619,7 @@ lacking both functionality and performance. However, they are useful
|
||||
in getting a feel for how RCU works. See kernel/rcu/update.c for a
|
||||
production-quality implementation, and see:
|
||||
|
||||
http://www.rdrop.com/users/paulmck/RCU
|
||||
https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit
|
||||
|
||||
for papers describing the Linux kernel RCU implementation. The OLS'01
|
||||
and OLS'02 papers are a good introduction, and the dissertation provides
|
||||
@ -929,6 +952,8 @@ unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be
|
||||
initialized after each and every call to kmem_cache_alloc(), which renders
|
||||
reference-free spinlock acquisition completely unsafe. Therefore, when
|
||||
using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter.
|
||||
(Those willing to use a kmem_cache constructor may also use locking,
|
||||
including cache-friendly sequence locking.)
|
||||
|
||||
With traditional reference counting -- such as that implemented by the
|
||||
kref library in Linux -- there is typically code that runs when the last
|
||||
@ -1047,6 +1072,30 @@ sched::
|
||||
rcu_read_lock_sched_held
|
||||
|
||||
|
||||
RCU-Tasks::
|
||||
|
||||
Critical sections Grace period Barrier
|
||||
|
||||
N/A call_rcu_tasks rcu_barrier_tasks
|
||||
synchronize_rcu_tasks
|
||||
|
||||
|
||||
RCU-Tasks-Rude::
|
||||
|
||||
Critical sections Grace period Barrier
|
||||
|
||||
N/A call_rcu_tasks_rude rcu_barrier_tasks_rude
|
||||
synchronize_rcu_tasks_rude
|
||||
|
||||
|
||||
RCU-Tasks-Trace::
|
||||
|
||||
Critical sections Grace period Barrier
|
||||
|
||||
rcu_read_lock_trace call_rcu_tasks_trace rcu_barrier_tasks_trace
|
||||
rcu_read_unlock_trace synchronize_rcu_tasks_trace
|
||||
|
||||
|
||||
SRCU::
|
||||
|
||||
Critical sections Grace period Barrier
|
||||
@ -1087,35 +1136,43 @@ list can be helpful:
|
||||
|
||||
a. Will readers need to block? If so, you need SRCU.
|
||||
|
||||
b. What about the -rt patchset? If readers would need to block
|
||||
in an non-rt kernel, you need SRCU. If readers would block
|
||||
in a -rt kernel, but not in a non-rt kernel, SRCU is not
|
||||
necessary. (The -rt patchset turns spinlocks into sleeplocks,
|
||||
hence this distinction.)
|
||||
b. Will readers need to block and are you doing tracing, for
|
||||
example, ftrace or BPF? If so, you need RCU-tasks,
|
||||
RCU-tasks-rude, and/or RCU-tasks-trace.
|
||||
|
||||
c. Do you need to treat NMI handlers, hardirq handlers,
|
||||
c. What about the -rt patchset? If readers would need to block in
|
||||
an non-rt kernel, you need SRCU. If readers would block when
|
||||
acquiring spinlocks in a -rt kernel, but not in a non-rt kernel,
|
||||
SRCU is not necessary. (The -rt patchset turns spinlocks into
|
||||
sleeplocks, hence this distinction.)
|
||||
|
||||
d. Do you need to treat NMI handlers, hardirq handlers,
|
||||
and code segments with preemption disabled (whether
|
||||
via preempt_disable(), local_irq_save(), local_bh_disable(),
|
||||
or some other mechanism) as if they were explicit RCU readers?
|
||||
If so, RCU-sched is the only choice that will work for you.
|
||||
If so, RCU-sched readers are the only choice that will work
|
||||
for you, but since about v4.20 you use can use the vanilla RCU
|
||||
update primitives.
|
||||
|
||||
d. Do you need RCU grace periods to complete even in the face
|
||||
of softirq monopolization of one or more of the CPUs? For
|
||||
example, is your code subject to network-based denial-of-service
|
||||
attacks? If so, you should disable softirq across your readers,
|
||||
for example, by using rcu_read_lock_bh().
|
||||
e. Do you need RCU grace periods to complete even in the face of
|
||||
softirq monopolization of one or more of the CPUs? For example,
|
||||
is your code subject to network-based denial-of-service attacks?
|
||||
If so, you should disable softirq across your readers, for
|
||||
example, by using rcu_read_lock_bh(). Since about v4.20 you
|
||||
use can use the vanilla RCU update primitives.
|
||||
|
||||
e. Is your workload too update-intensive for normal use of
|
||||
f. Is your workload too update-intensive for normal use of
|
||||
RCU, but inappropriate for other synchronization mechanisms?
|
||||
If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
|
||||
named SLAB_DESTROY_BY_RCU). But please be careful!
|
||||
|
||||
f. Do you need read-side critical sections that are respected
|
||||
even though they are in the middle of the idle loop, during
|
||||
user-mode execution, or on an offlined CPU? If so, SRCU is the
|
||||
only choice that will work for you.
|
||||
g. Do you need read-side critical sections that are respected even
|
||||
on CPUs that are deep in the idle loop, during entry to or exit
|
||||
from user-mode execution, or on an offlined CPU? If so, SRCU
|
||||
and RCU Tasks Trace are the only choices that will work for you,
|
||||
with SRCU being strongly preferred in almost all cases.
|
||||
|
||||
g. Otherwise, use RCU.
|
||||
h. Otherwise, use RCU.
|
||||
|
||||
Of course, this all assumes that you have determined that RCU is in fact
|
||||
the right tool for your job.
|
||||
|
@ -5113,6 +5113,17 @@
|
||||
rcupdate.rcu_cpu_stall_timeout to be used (after
|
||||
conversion from seconds to milliseconds).
|
||||
|
||||
rcupdate.rcu_cpu_stall_cputime= [KNL]
|
||||
Provide statistics on the cputime and count of
|
||||
interrupts and tasks during the sampling period. For
|
||||
multiple continuous RCU stalls, all sampling periods
|
||||
begin at half of the first RCU stall timeout.
|
||||
|
||||
rcupdate.rcu_exp_stall_task_details= [KNL]
|
||||
Print stack dumps of any tasks blocking the
|
||||
current expedited RCU grace period during an
|
||||
expedited RCU CPU stall warning.
|
||||
|
||||
rcupdate.rcu_expedited= [KNL]
|
||||
Use expedited grace-period primitives, for
|
||||
example, synchronize_rcu_expedited() instead
|
||||
|
@ -181,7 +181,6 @@ void fw_devlink_purge_absent_suppliers(struct fwnode_handle *fwnode)
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(fw_devlink_purge_absent_suppliers);
|
||||
|
||||
#ifdef CONFIG_SRCU
|
||||
static DEFINE_MUTEX(device_links_lock);
|
||||
DEFINE_STATIC_SRCU(device_links_srcu);
|
||||
|
||||
@ -220,47 +219,6 @@ static void device_link_remove_from_lists(struct device_link *link)
|
||||
list_del_rcu(&link->s_node);
|
||||
list_del_rcu(&link->c_node);
|
||||
}
|
||||
#else /* !CONFIG_SRCU */
|
||||
static DECLARE_RWSEM(device_links_lock);
|
||||
|
||||
static inline void device_links_write_lock(void)
|
||||
{
|
||||
down_write(&device_links_lock);
|
||||
}
|
||||
|
||||
static inline void device_links_write_unlock(void)
|
||||
{
|
||||
up_write(&device_links_lock);
|
||||
}
|
||||
|
||||
int device_links_read_lock(void)
|
||||
{
|
||||
down_read(&device_links_lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
void device_links_read_unlock(int not_used)
|
||||
{
|
||||
up_read(&device_links_lock);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
int device_links_read_lock_held(void)
|
||||
{
|
||||
return lockdep_is_held(&device_links_lock);
|
||||
}
|
||||
#endif
|
||||
|
||||
static inline void device_link_synchronize_removal(void)
|
||||
{
|
||||
}
|
||||
|
||||
static void device_link_remove_from_lists(struct device_link *link)
|
||||
{
|
||||
list_del(&link->s_node);
|
||||
list_del(&link->c_node);
|
||||
}
|
||||
#endif /* !CONFIG_SRCU */
|
||||
|
||||
static bool device_is_ancestor(struct device *dev, struct device *target)
|
||||
{
|
||||
|
@ -1,7 +1,6 @@
|
||||
# SPDX-License-Identifier: GPL-2.0-only
|
||||
menuconfig DAX
|
||||
tristate "DAX: direct access to differentiated memory"
|
||||
select SRCU
|
||||
default m if NVDIMM_DAX
|
||||
|
||||
if DAX
|
||||
|
@ -2,7 +2,6 @@
|
||||
config STM
|
||||
tristate "System Trace Module devices"
|
||||
select CONFIGFS_FS
|
||||
select SRCU
|
||||
help
|
||||
A System Trace Module (STM) is a device exporting data in System
|
||||
Trace Protocol (STP) format as defined by MIPI STP standards.
|
||||
|
@ -6,7 +6,6 @@
|
||||
menuconfig MD
|
||||
bool "Multiple devices driver support (RAID and LVM)"
|
||||
depends on BLOCK
|
||||
select SRCU
|
||||
help
|
||||
Support multiple physical spindles through a single logical device.
|
||||
Required for RAID and logical volume management.
|
||||
|
@ -334,7 +334,6 @@ config NETCONSOLE_DYNAMIC
|
||||
|
||||
config NETPOLL
|
||||
def_bool NETCONSOLE
|
||||
select SRCU
|
||||
|
||||
config NET_POLL_CONTROLLER
|
||||
def_bool NETPOLL
|
||||
|
@ -258,7 +258,7 @@ config PCIE_MEDIATEK_GEN3
|
||||
MediaTek SoCs.
|
||||
|
||||
config VMD
|
||||
depends on PCI_MSI && X86_64 && SRCU && !UML
|
||||
depends on PCI_MSI && X86_64 && !UML
|
||||
tristate "Intel Volume Management Device Driver"
|
||||
help
|
||||
Adds support for the Intel Volume Management Device (VMD). VMD is a
|
||||
|
@ -17,7 +17,6 @@ config BTRFS_FS
|
||||
select FS_IOMAP
|
||||
select RAID6_PQ
|
||||
select XOR_BLOCKS
|
||||
select SRCU
|
||||
depends on PAGE_SIZE_LESS_THAN_256KB
|
||||
|
||||
help
|
||||
|
25
fs/locks.c
25
fs/locks.c
@ -1890,7 +1890,6 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp,
|
||||
}
|
||||
EXPORT_SYMBOL(generic_setlease);
|
||||
|
||||
#if IS_ENABLED(CONFIG_SRCU)
|
||||
/*
|
||||
* Kernel subsystems can register to be notified on any attempt to set
|
||||
* a new lease with the lease_notifier_chain. This is used by (e.g.) nfsd
|
||||
@ -1924,30 +1923,6 @@ void lease_unregister_notifier(struct notifier_block *nb)
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(lease_unregister_notifier);
|
||||
|
||||
#else /* !IS_ENABLED(CONFIG_SRCU) */
|
||||
static inline void
|
||||
lease_notifier_chain_init(void)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void
|
||||
setlease_notifier(long arg, struct file_lock *lease)
|
||||
{
|
||||
}
|
||||
|
||||
int lease_register_notifier(struct notifier_block *nb)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(lease_register_notifier);
|
||||
|
||||
void lease_unregister_notifier(struct notifier_block *nb)
|
||||
{
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(lease_unregister_notifier);
|
||||
|
||||
#endif /* IS_ENABLED(CONFIG_SRCU) */
|
||||
|
||||
/**
|
||||
* vfs_setlease - sets a lease on an open file
|
||||
* @filp: file pointer
|
||||
|
@ -1,7 +1,6 @@
|
||||
# SPDX-License-Identifier: GPL-2.0-only
|
||||
config FSNOTIFY
|
||||
def_bool n
|
||||
select SRCU
|
||||
|
||||
source "fs/notify/dnotify/Kconfig"
|
||||
source "fs/notify/inotify/Kconfig"
|
||||
|
@ -6,7 +6,6 @@
|
||||
config QUOTA
|
||||
bool "Quota support"
|
||||
select QUOTACTL
|
||||
select SRCU
|
||||
help
|
||||
If you say Y here, you will be able to set per user limits for disk
|
||||
usage (also called disk quotas). Currently, it works for the
|
||||
|
@ -52,6 +52,7 @@ DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
|
||||
#define kstat_cpu(cpu) per_cpu(kstat, cpu)
|
||||
#define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
|
||||
|
||||
extern unsigned long long nr_context_switches_cpu(int cpu);
|
||||
extern unsigned long long nr_context_switches(void);
|
||||
|
||||
extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu);
|
||||
@ -67,6 +68,17 @@ static inline unsigned int kstat_softirqs_cpu(unsigned int irq, int cpu)
|
||||
return kstat_cpu(cpu).softirqs[irq];
|
||||
}
|
||||
|
||||
static inline unsigned int kstat_cpu_softirqs_sum(int cpu)
|
||||
{
|
||||
int i;
|
||||
unsigned int sum = 0;
|
||||
|
||||
for (i = 0; i < NR_SOFTIRQS; i++)
|
||||
sum += kstat_softirqs_cpu(i, cpu);
|
||||
|
||||
return sum;
|
||||
}
|
||||
|
||||
/*
|
||||
* Number of interrupts per specific IRQ source, since bootup
|
||||
*/
|
||||
@ -75,7 +87,7 @@ extern unsigned int kstat_irqs_usr(unsigned int irq);
|
||||
/*
|
||||
* Number of interrupts per cpu, since bootup
|
||||
*/
|
||||
static inline unsigned int kstat_cpu_irqs_sum(unsigned int cpu)
|
||||
static inline unsigned long kstat_cpu_irqs_sum(unsigned int cpu)
|
||||
{
|
||||
return kstat_cpu(cpu).irqs_sum;
|
||||
}
|
||||
|
@ -139,7 +139,7 @@ static inline void hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n,
|
||||
if (last) {
|
||||
n->next = last->next;
|
||||
n->pprev = &last->next;
|
||||
rcu_assign_pointer(hlist_next_rcu(last), n);
|
||||
rcu_assign_pointer(hlist_nulls_next_rcu(last), n);
|
||||
} else {
|
||||
hlist_nulls_add_head_rcu(n, h);
|
||||
}
|
||||
|
@ -238,6 +238,7 @@ void synchronize_rcu_tasks_rude(void);
|
||||
|
||||
#define rcu_note_voluntary_context_switch(t) rcu_tasks_qs(t, false)
|
||||
void exit_tasks_rcu_start(void);
|
||||
void exit_tasks_rcu_stop(void);
|
||||
void exit_tasks_rcu_finish(void);
|
||||
#else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
|
||||
#define rcu_tasks_classic_qs(t, preempt) do { } while (0)
|
||||
@ -246,6 +247,7 @@ void exit_tasks_rcu_finish(void);
|
||||
#define call_rcu_tasks call_rcu
|
||||
#define synchronize_rcu_tasks synchronize_rcu
|
||||
static inline void exit_tasks_rcu_start(void) { }
|
||||
static inline void exit_tasks_rcu_stop(void) { }
|
||||
static inline void exit_tasks_rcu_finish(void) { }
|
||||
#endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */
|
||||
|
||||
@ -374,11 +376,18 @@ static inline int debug_lockdep_rcu_enabled(void)
|
||||
* RCU_LOCKDEP_WARN - emit lockdep splat if specified condition is met
|
||||
* @c: condition to check
|
||||
* @s: informative message
|
||||
*
|
||||
* This checks debug_lockdep_rcu_enabled() before checking (c) to
|
||||
* prevent early boot splats due to lockdep not yet being initialized,
|
||||
* and rechecks it after checking (c) to prevent false-positive splats
|
||||
* due to races with lockdep being disabled. See commit 3066820034b5dd
|
||||
* ("rcu: Reject RCU_LOCKDEP_WARN() false positives") for more detail.
|
||||
*/
|
||||
#define RCU_LOCKDEP_WARN(c, s) \
|
||||
do { \
|
||||
static bool __section(".data.unlikely") __warned; \
|
||||
if ((c) && debug_lockdep_rcu_enabled() && !__warned) { \
|
||||
if (debug_lockdep_rcu_enabled() && (c) && \
|
||||
debug_lockdep_rcu_enabled() && !__warned) { \
|
||||
__warned = true; \
|
||||
lockdep_rcu_suspicious(__FILE__, __LINE__, s); \
|
||||
} \
|
||||
@ -1004,6 +1013,9 @@ static inline notrace void rcu_read_unlock_sched_notrace(void)
|
||||
#define kvfree_rcu(...) KVFREE_GET_MACRO(__VA_ARGS__, \
|
||||
kvfree_rcu_arg_2, kvfree_rcu_arg_1)(__VA_ARGS__)
|
||||
|
||||
#define kvfree_rcu_mightsleep(ptr) kvfree_rcu_arg_1(ptr)
|
||||
#define kfree_rcu_mightsleep(ptr) kvfree_rcu_mightsleep(ptr)
|
||||
|
||||
#define KVFREE_GET_MACRO(_1, _2, NAME, ...) NAME
|
||||
#define kvfree_rcu_arg_2(ptr, rhf) \
|
||||
do { \
|
||||
@ -1011,8 +1023,7 @@ do { \
|
||||
\
|
||||
if (___p) { \
|
||||
BUILD_BUG_ON(!__is_kvfree_rcu_offset(offsetof(typeof(*(ptr)), rhf))); \
|
||||
kvfree_call_rcu(&((___p)->rhf), (rcu_callback_t)(unsigned long) \
|
||||
(offsetof(typeof(*(ptr)), rhf))); \
|
||||
kvfree_call_rcu(&((___p)->rhf), (void *) (___p)); \
|
||||
} \
|
||||
} while (0)
|
||||
|
||||
@ -1021,7 +1032,7 @@ do { \
|
||||
typeof(ptr) ___p = (ptr); \
|
||||
\
|
||||
if (___p) \
|
||||
kvfree_call_rcu(NULL, (rcu_callback_t) (___p)); \
|
||||
kvfree_call_rcu(NULL, (void *) (___p)); \
|
||||
} while (0)
|
||||
|
||||
/*
|
||||
|
@ -98,25 +98,25 @@ static inline void synchronize_rcu_expedited(void)
|
||||
*/
|
||||
extern void kvfree(const void *addr);
|
||||
|
||||
static inline void __kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
|
||||
static inline void __kvfree_call_rcu(struct rcu_head *head, void *ptr)
|
||||
{
|
||||
if (head) {
|
||||
call_rcu(head, func);
|
||||
call_rcu(head, (rcu_callback_t) ((void *) head - ptr));
|
||||
return;
|
||||
}
|
||||
|
||||
// kvfree_rcu(one_arg) call.
|
||||
might_sleep();
|
||||
synchronize_rcu();
|
||||
kvfree((void *) func);
|
||||
kvfree(ptr);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_KASAN_GENERIC
|
||||
void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func);
|
||||
void kvfree_call_rcu(struct rcu_head *head, void *ptr);
|
||||
#else
|
||||
static inline void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
|
||||
static inline void kvfree_call_rcu(struct rcu_head *head, void *ptr)
|
||||
{
|
||||
__kvfree_call_rcu(head, func);
|
||||
__kvfree_call_rcu(head, ptr);
|
||||
}
|
||||
#endif
|
||||
|
||||
|
@ -33,7 +33,7 @@ static inline void rcu_virt_note_context_switch(void)
|
||||
}
|
||||
|
||||
void synchronize_rcu_expedited(void);
|
||||
void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func);
|
||||
void kvfree_call_rcu(struct rcu_head *head, void *ptr);
|
||||
|
||||
void rcu_barrier(void);
|
||||
bool rcu_eqs_special_set(int cpu);
|
||||
|
@ -214,6 +214,34 @@ srcu_read_lock_notrace(struct srcu_struct *ssp) __acquires(ssp)
|
||||
return retval;
|
||||
}
|
||||
|
||||
/**
|
||||
* srcu_down_read - register a new reader for an SRCU-protected structure.
|
||||
* @ssp: srcu_struct in which to register the new reader.
|
||||
*
|
||||
* Enter a semaphore-like SRCU read-side critical section. Note that
|
||||
* SRCU read-side critical sections may be nested. However, it is
|
||||
* illegal to call anything that waits on an SRCU grace period for the
|
||||
* same srcu_struct, whether directly or indirectly. Please note that
|
||||
* one way to indirectly wait on an SRCU grace period is to acquire
|
||||
* a mutex that is held elsewhere while calling synchronize_srcu() or
|
||||
* synchronize_srcu_expedited(). But if you want lockdep to help you
|
||||
* keep this stuff straight, you should instead use srcu_read_lock().
|
||||
*
|
||||
* The semaphore-like nature of srcu_down_read() means that the matching
|
||||
* srcu_up_read() can be invoked from some other context, for example,
|
||||
* from some other task or from an irq handler. However, neither
|
||||
* srcu_down_read() nor srcu_up_read() may be invoked from an NMI handler.
|
||||
*
|
||||
* Calls to srcu_down_read() may be nested, similar to the manner in
|
||||
* which calls to down_read() may be nested.
|
||||
*/
|
||||
static inline int srcu_down_read(struct srcu_struct *ssp) __acquires(ssp)
|
||||
{
|
||||
WARN_ON_ONCE(in_nmi());
|
||||
srcu_check_nmi_safety(ssp, false);
|
||||
return __srcu_read_lock(ssp);
|
||||
}
|
||||
|
||||
/**
|
||||
* srcu_read_unlock - unregister a old reader from an SRCU-protected structure.
|
||||
* @ssp: srcu_struct in which to unregister the old reader.
|
||||
@ -254,6 +282,23 @@ srcu_read_unlock_notrace(struct srcu_struct *ssp, int idx) __releases(ssp)
|
||||
__srcu_read_unlock(ssp, idx);
|
||||
}
|
||||
|
||||
/**
|
||||
* srcu_up_read - unregister a old reader from an SRCU-protected structure.
|
||||
* @ssp: srcu_struct in which to unregister the old reader.
|
||||
* @idx: return value from corresponding srcu_read_lock().
|
||||
*
|
||||
* Exit an SRCU read-side critical section, but not necessarily from
|
||||
* the same context as the maching srcu_down_read().
|
||||
*/
|
||||
static inline void srcu_up_read(struct srcu_struct *ssp, int idx)
|
||||
__releases(ssp)
|
||||
{
|
||||
WARN_ON_ONCE(idx & ~0x1);
|
||||
WARN_ON_ONCE(in_nmi());
|
||||
srcu_check_nmi_safety(ssp, false);
|
||||
__srcu_read_unlock(ssp, idx);
|
||||
}
|
||||
|
||||
/**
|
||||
* smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock
|
||||
*
|
||||
|
@ -49,7 +49,7 @@ struct srcu_data {
|
||||
struct srcu_node {
|
||||
spinlock_t __private lock;
|
||||
unsigned long srcu_have_cbs[4]; /* GP seq for children having CBs, but only */
|
||||
/* if greater than ->srcu_gq_seq. */
|
||||
/* if greater than ->srcu_gp_seq. */
|
||||
unsigned long srcu_data_have_cbs[4]; /* Which srcu_data structs have CBs for given GP? */
|
||||
unsigned long srcu_gp_seq_needed_exp; /* Furthest future exp GP. */
|
||||
struct srcu_node *srcu_parent; /* Next up in tree. */
|
||||
|
@ -1873,7 +1873,6 @@ config PERF_EVENTS
|
||||
default y if PROFILING
|
||||
depends on HAVE_PERF_EVENTS
|
||||
select IRQ_WORK
|
||||
select SRCU
|
||||
help
|
||||
Enable kernel support for various performance events provided
|
||||
by software and hardware.
|
||||
|
@ -46,6 +46,9 @@ torture_param(int, shutdown_secs, 0, "Shutdown time (j), <= zero to disable.");
|
||||
torture_param(int, stat_interval, 60,
|
||||
"Number of seconds between stats printk()s");
|
||||
torture_param(int, stutter, 5, "Number of jiffies to run/halt test, 0=disable");
|
||||
torture_param(int, rt_boost, 2,
|
||||
"Do periodic rt-boost. 0=Disable, 1=Only for rt_mutex, 2=For all lock types.");
|
||||
torture_param(int, rt_boost_factor, 50, "A factor determining how often rt-boost happens.");
|
||||
torture_param(int, verbose, 1,
|
||||
"Enable verbose debugging printk()s");
|
||||
|
||||
@ -127,15 +130,50 @@ static void torture_lock_busted_write_unlock(int tid __maybe_unused)
|
||||
/* BUGGY, do not use in real life!!! */
|
||||
}
|
||||
|
||||
static void torture_boost_dummy(struct torture_random_state *trsp)
|
||||
static void __torture_rt_boost(struct torture_random_state *trsp)
|
||||
{
|
||||
/* Only rtmutexes care about priority */
|
||||
const unsigned int factor = rt_boost_factor;
|
||||
|
||||
if (!rt_task(current)) {
|
||||
/*
|
||||
* Boost priority once every rt_boost_factor operations. When
|
||||
* the task tries to take the lock, the rtmutex it will account
|
||||
* for the new priority, and do any corresponding pi-dance.
|
||||
*/
|
||||
if (trsp && !(torture_random(trsp) %
|
||||
(cxt.nrealwriters_stress * factor))) {
|
||||
sched_set_fifo(current);
|
||||
} else /* common case, do nothing */
|
||||
return;
|
||||
} else {
|
||||
/*
|
||||
* The task will remain boosted for another 10 * rt_boost_factor
|
||||
* operations, then restored back to its original prio, and so
|
||||
* forth.
|
||||
*
|
||||
* When @trsp is nil, we want to force-reset the task for
|
||||
* stopping the kthread.
|
||||
*/
|
||||
if (!trsp || !(torture_random(trsp) %
|
||||
(cxt.nrealwriters_stress * factor * 2))) {
|
||||
sched_set_normal(current, 0);
|
||||
} else /* common case, do nothing */
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
static void torture_rt_boost(struct torture_random_state *trsp)
|
||||
{
|
||||
if (rt_boost != 2)
|
||||
return;
|
||||
|
||||
__torture_rt_boost(trsp);
|
||||
}
|
||||
|
||||
static struct lock_torture_ops lock_busted_ops = {
|
||||
.writelock = torture_lock_busted_write_lock,
|
||||
.write_delay = torture_lock_busted_write_delay,
|
||||
.task_boost = torture_boost_dummy,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_lock_busted_write_unlock,
|
||||
.readlock = NULL,
|
||||
.read_delay = NULL,
|
||||
@ -179,7 +217,7 @@ __releases(torture_spinlock)
|
||||
static struct lock_torture_ops spin_lock_ops = {
|
||||
.writelock = torture_spin_lock_write_lock,
|
||||
.write_delay = torture_spin_lock_write_delay,
|
||||
.task_boost = torture_boost_dummy,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_spin_lock_write_unlock,
|
||||
.readlock = NULL,
|
||||
.read_delay = NULL,
|
||||
@ -206,7 +244,7 @@ __releases(torture_spinlock)
|
||||
static struct lock_torture_ops spin_lock_irq_ops = {
|
||||
.writelock = torture_spin_lock_write_lock_irq,
|
||||
.write_delay = torture_spin_lock_write_delay,
|
||||
.task_boost = torture_boost_dummy,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_lock_spin_write_unlock_irq,
|
||||
.readlock = NULL,
|
||||
.read_delay = NULL,
|
||||
@ -275,7 +313,7 @@ __releases(torture_rwlock)
|
||||
static struct lock_torture_ops rw_lock_ops = {
|
||||
.writelock = torture_rwlock_write_lock,
|
||||
.write_delay = torture_rwlock_write_delay,
|
||||
.task_boost = torture_boost_dummy,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_rwlock_write_unlock,
|
||||
.readlock = torture_rwlock_read_lock,
|
||||
.read_delay = torture_rwlock_read_delay,
|
||||
@ -318,7 +356,7 @@ __releases(torture_rwlock)
|
||||
static struct lock_torture_ops rw_lock_irq_ops = {
|
||||
.writelock = torture_rwlock_write_lock_irq,
|
||||
.write_delay = torture_rwlock_write_delay,
|
||||
.task_boost = torture_boost_dummy,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_rwlock_write_unlock_irq,
|
||||
.readlock = torture_rwlock_read_lock_irq,
|
||||
.read_delay = torture_rwlock_read_delay,
|
||||
@ -358,7 +396,7 @@ __releases(torture_mutex)
|
||||
static struct lock_torture_ops mutex_lock_ops = {
|
||||
.writelock = torture_mutex_lock,
|
||||
.write_delay = torture_mutex_delay,
|
||||
.task_boost = torture_boost_dummy,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_mutex_unlock,
|
||||
.readlock = NULL,
|
||||
.read_delay = NULL,
|
||||
@ -456,7 +494,7 @@ static struct lock_torture_ops ww_mutex_lock_ops = {
|
||||
.exit = torture_ww_mutex_exit,
|
||||
.writelock = torture_ww_mutex_lock,
|
||||
.write_delay = torture_mutex_delay,
|
||||
.task_boost = torture_boost_dummy,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_ww_mutex_unlock,
|
||||
.readlock = NULL,
|
||||
.read_delay = NULL,
|
||||
@ -474,37 +512,6 @@ __acquires(torture_rtmutex)
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void torture_rtmutex_boost(struct torture_random_state *trsp)
|
||||
{
|
||||
const unsigned int factor = 50000; /* yes, quite arbitrary */
|
||||
|
||||
if (!rt_task(current)) {
|
||||
/*
|
||||
* Boost priority once every ~50k operations. When the
|
||||
* task tries to take the lock, the rtmutex it will account
|
||||
* for the new priority, and do any corresponding pi-dance.
|
||||
*/
|
||||
if (trsp && !(torture_random(trsp) %
|
||||
(cxt.nrealwriters_stress * factor))) {
|
||||
sched_set_fifo(current);
|
||||
} else /* common case, do nothing */
|
||||
return;
|
||||
} else {
|
||||
/*
|
||||
* The task will remain boosted for another ~500k operations,
|
||||
* then restored back to its original prio, and so forth.
|
||||
*
|
||||
* When @trsp is nil, we want to force-reset the task for
|
||||
* stopping the kthread.
|
||||
*/
|
||||
if (!trsp || !(torture_random(trsp) %
|
||||
(cxt.nrealwriters_stress * factor * 2))) {
|
||||
sched_set_normal(current, 0);
|
||||
} else /* common case, do nothing */
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
static void torture_rtmutex_delay(struct torture_random_state *trsp)
|
||||
{
|
||||
const unsigned long shortdelay_us = 2;
|
||||
@ -530,10 +537,18 @@ __releases(torture_rtmutex)
|
||||
rt_mutex_unlock(&torture_rtmutex);
|
||||
}
|
||||
|
||||
static void torture_rt_boost_rtmutex(struct torture_random_state *trsp)
|
||||
{
|
||||
if (!rt_boost)
|
||||
return;
|
||||
|
||||
__torture_rt_boost(trsp);
|
||||
}
|
||||
|
||||
static struct lock_torture_ops rtmutex_lock_ops = {
|
||||
.writelock = torture_rtmutex_lock,
|
||||
.write_delay = torture_rtmutex_delay,
|
||||
.task_boost = torture_rtmutex_boost,
|
||||
.task_boost = torture_rt_boost_rtmutex,
|
||||
.writeunlock = torture_rtmutex_unlock,
|
||||
.readlock = NULL,
|
||||
.read_delay = NULL,
|
||||
@ -600,7 +615,7 @@ __releases(torture_rwsem)
|
||||
static struct lock_torture_ops rwsem_lock_ops = {
|
||||
.writelock = torture_rwsem_down_write,
|
||||
.write_delay = torture_rwsem_write_delay,
|
||||
.task_boost = torture_boost_dummy,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_rwsem_up_write,
|
||||
.readlock = torture_rwsem_down_read,
|
||||
.read_delay = torture_rwsem_read_delay,
|
||||
@ -652,7 +667,7 @@ static struct lock_torture_ops percpu_rwsem_lock_ops = {
|
||||
.exit = torture_percpu_rwsem_exit,
|
||||
.writelock = torture_percpu_rwsem_down_write,
|
||||
.write_delay = torture_rwsem_write_delay,
|
||||
.task_boost = torture_boost_dummy,
|
||||
.task_boost = torture_rt_boost,
|
||||
.writeunlock = torture_percpu_rwsem_up_write,
|
||||
.readlock = torture_percpu_rwsem_down_read,
|
||||
.read_delay = torture_rwsem_read_delay,
|
||||
|
@ -456,7 +456,6 @@ int raw_notifier_call_chain(struct raw_notifier_head *nh,
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(raw_notifier_call_chain);
|
||||
|
||||
#ifdef CONFIG_SRCU
|
||||
/*
|
||||
* SRCU notifier chain routines. Registration and unregistration
|
||||
* use a mutex, and call_chain is synchronized by SRCU (no locks).
|
||||
@ -573,8 +572,6 @@ void srcu_init_notifier_head(struct srcu_notifier_head *nh)
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(srcu_init_notifier_head);
|
||||
|
||||
#endif /* CONFIG_SRCU */
|
||||
|
||||
static ATOMIC_NOTIFIER_HEAD(die_chain);
|
||||
|
||||
int notrace notify_die(enum die_val val, const char *str,
|
||||
|
@ -244,7 +244,24 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
|
||||
set_current_state(TASK_INTERRUPTIBLE);
|
||||
if (pid_ns->pid_allocated == init_pids)
|
||||
break;
|
||||
/*
|
||||
* Release tasks_rcu_exit_srcu to avoid following deadlock:
|
||||
*
|
||||
* 1) TASK A unshare(CLONE_NEWPID)
|
||||
* 2) TASK A fork() twice -> TASK B (child reaper for new ns)
|
||||
* and TASK C
|
||||
* 3) TASK B exits, kills TASK C, waits for TASK A to reap it
|
||||
* 4) TASK A calls synchronize_rcu_tasks()
|
||||
* -> synchronize_srcu(tasks_rcu_exit_srcu)
|
||||
* 5) *DEADLOCK*
|
||||
*
|
||||
* It is considered safe to release tasks_rcu_exit_srcu here
|
||||
* because we assume the current task can not be concurrently
|
||||
* reaped at this point.
|
||||
*/
|
||||
exit_tasks_rcu_stop();
|
||||
schedule();
|
||||
exit_tasks_rcu_start();
|
||||
}
|
||||
__set_current_state(TASK_RUNNING);
|
||||
|
||||
|
@ -82,7 +82,7 @@ config RCU_CPU_STALL_TIMEOUT
|
||||
config RCU_EXP_CPU_STALL_TIMEOUT
|
||||
int "Expedited RCU CPU stall timeout in milliseconds"
|
||||
depends on RCU_STALL_COMMON
|
||||
range 0 21000
|
||||
range 0 300000
|
||||
default 0
|
||||
help
|
||||
If a given expedited RCU grace period extends more than the
|
||||
@ -92,6 +92,19 @@ config RCU_EXP_CPU_STALL_TIMEOUT
|
||||
says to use the RCU_CPU_STALL_TIMEOUT value converted from
|
||||
seconds to milliseconds.
|
||||
|
||||
config RCU_CPU_STALL_CPUTIME
|
||||
bool "Provide additional RCU stall debug information"
|
||||
depends on RCU_STALL_COMMON
|
||||
default n
|
||||
help
|
||||
Collect statistics during the sampling period, such as the number of
|
||||
(hard interrupts, soft interrupts, task switches) and the cputime of
|
||||
(hard interrupts, soft interrupts, kernel tasks) are added to the
|
||||
RCU stall report. For multiple continuous RCU stalls, all sampling
|
||||
periods begin at half of the first RCU stall timeout.
|
||||
The boot option rcupdate.rcu_cpu_stall_cputime has the same function
|
||||
as this one, but will override this if it exists.
|
||||
|
||||
config RCU_TRACE
|
||||
bool "Enable tracing for RCU"
|
||||
depends on DEBUG_KERNEL
|
||||
|
@ -224,6 +224,8 @@ extern int rcu_cpu_stall_ftrace_dump;
|
||||
extern int rcu_cpu_stall_suppress;
|
||||
extern int rcu_cpu_stall_timeout;
|
||||
extern int rcu_exp_cpu_stall_timeout;
|
||||
extern int rcu_cpu_stall_cputime;
|
||||
extern bool rcu_exp_stall_task_details __read_mostly;
|
||||
int rcu_jiffies_till_stall_check(void);
|
||||
int rcu_exp_jiffies_till_stall_check(void);
|
||||
|
||||
@ -447,14 +449,20 @@ do { \
|
||||
/* Tiny RCU doesn't expedite, as its purpose in life is instead to be tiny. */
|
||||
static inline bool rcu_gp_is_normal(void) { return true; }
|
||||
static inline bool rcu_gp_is_expedited(void) { return false; }
|
||||
static inline bool rcu_async_should_hurry(void) { return false; }
|
||||
static inline void rcu_expedite_gp(void) { }
|
||||
static inline void rcu_unexpedite_gp(void) { }
|
||||
static inline void rcu_async_hurry(void) { }
|
||||
static inline void rcu_async_relax(void) { }
|
||||
static inline void rcu_request_urgent_qs_task(struct task_struct *t) { }
|
||||
#else /* #ifdef CONFIG_TINY_RCU */
|
||||
bool rcu_gp_is_normal(void); /* Internal RCU use. */
|
||||
bool rcu_gp_is_expedited(void); /* Internal RCU use. */
|
||||
bool rcu_async_should_hurry(void); /* Internal RCU use. */
|
||||
void rcu_expedite_gp(void);
|
||||
void rcu_unexpedite_gp(void);
|
||||
void rcu_async_hurry(void);
|
||||
void rcu_async_relax(void);
|
||||
void rcupdate_announce_bootup_oddness(void);
|
||||
#ifdef CONFIG_TASKS_RCU_GENERIC
|
||||
void show_rcu_tasks_gp_kthreads(void);
|
||||
|
@ -89,7 +89,7 @@ static void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v)
|
||||
}
|
||||
|
||||
/* Get the length of a segment of the rcu_segcblist structure. */
|
||||
static long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg)
|
||||
long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg)
|
||||
{
|
||||
return READ_ONCE(rsclp->seglen[seg]);
|
||||
}
|
||||
|
@ -15,6 +15,8 @@ static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp)
|
||||
return READ_ONCE(rclp->len);
|
||||
}
|
||||
|
||||
long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg);
|
||||
|
||||
/* Return number of callbacks in segmented callback list by summing seglen. */
|
||||
long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp);
|
||||
|
||||
|
@ -399,7 +399,7 @@ static int torture_readlock_not_held(void)
|
||||
return rcu_read_lock_bh_held() || rcu_read_lock_sched_held();
|
||||
}
|
||||
|
||||
static int rcu_torture_read_lock(void) __acquires(RCU)
|
||||
static int rcu_torture_read_lock(void)
|
||||
{
|
||||
rcu_read_lock();
|
||||
return 0;
|
||||
@ -441,7 +441,7 @@ rcu_read_delay(struct torture_random_state *rrsp, struct rt_read_seg *rtrsp)
|
||||
}
|
||||
}
|
||||
|
||||
static void rcu_torture_read_unlock(int idx) __releases(RCU)
|
||||
static void rcu_torture_read_unlock(int idx)
|
||||
{
|
||||
rcu_read_unlock();
|
||||
}
|
||||
@ -625,7 +625,7 @@ static struct srcu_struct srcu_ctld;
|
||||
static struct srcu_struct *srcu_ctlp = &srcu_ctl;
|
||||
static struct rcu_torture_ops srcud_ops;
|
||||
|
||||
static int srcu_torture_read_lock(void) __acquires(srcu_ctlp)
|
||||
static int srcu_torture_read_lock(void)
|
||||
{
|
||||
if (cur_ops == &srcud_ops)
|
||||
return srcu_read_lock_nmisafe(srcu_ctlp);
|
||||
@ -652,7 +652,7 @@ srcu_read_delay(struct torture_random_state *rrsp, struct rt_read_seg *rtrsp)
|
||||
}
|
||||
}
|
||||
|
||||
static void srcu_torture_read_unlock(int idx) __releases(srcu_ctlp)
|
||||
static void srcu_torture_read_unlock(int idx)
|
||||
{
|
||||
if (cur_ops == &srcud_ops)
|
||||
srcu_read_unlock_nmisafe(srcu_ctlp, idx);
|
||||
@ -814,13 +814,13 @@ static void synchronize_rcu_trivial(void)
|
||||
}
|
||||
}
|
||||
|
||||
static int rcu_torture_read_lock_trivial(void) __acquires(RCU)
|
||||
static int rcu_torture_read_lock_trivial(void)
|
||||
{
|
||||
preempt_disable();
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void rcu_torture_read_unlock_trivial(int idx) __releases(RCU)
|
||||
static void rcu_torture_read_unlock_trivial(int idx)
|
||||
{
|
||||
preempt_enable();
|
||||
}
|
||||
|
@ -76,6 +76,8 @@ torture_param(int, verbose_batched, 0, "Batch verbose debugging printk()s");
|
||||
// Wait until there are multiple CPUs before starting test.
|
||||
torture_param(int, holdoff, IS_BUILTIN(CONFIG_RCU_REF_SCALE_TEST) ? 10 : 0,
|
||||
"Holdoff time before test start (s)");
|
||||
// Number of typesafe_lookup structures, that is, the degree of concurrency.
|
||||
torture_param(long, lookup_instances, 0, "Number of typesafe_lookup structures.");
|
||||
// Number of loops per experiment, all readers execute operations concurrently.
|
||||
torture_param(long, loops, 10000, "Number of loops per experiment.");
|
||||
// Number of readers, with -1 defaulting to about 75% of the CPUs.
|
||||
@ -124,7 +126,7 @@ static int exp_idx;
|
||||
|
||||
// Operations vector for selecting different types of tests.
|
||||
struct ref_scale_ops {
|
||||
void (*init)(void);
|
||||
bool (*init)(void);
|
||||
void (*cleanup)(void);
|
||||
void (*readsection)(const int nloops);
|
||||
void (*delaysection)(const int nloops, const int udl, const int ndl);
|
||||
@ -162,8 +164,9 @@ static void ref_rcu_delay_section(const int nloops, const int udl, const int ndl
|
||||
}
|
||||
}
|
||||
|
||||
static void rcu_sync_scale_init(void)
|
||||
static bool rcu_sync_scale_init(void)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
static struct ref_scale_ops rcu_ops = {
|
||||
@ -315,9 +318,10 @@ static struct ref_scale_ops refcnt_ops = {
|
||||
// Definitions for rwlock
|
||||
static rwlock_t test_rwlock;
|
||||
|
||||
static void ref_rwlock_init(void)
|
||||
static bool ref_rwlock_init(void)
|
||||
{
|
||||
rwlock_init(&test_rwlock);
|
||||
return true;
|
||||
}
|
||||
|
||||
static void ref_rwlock_section(const int nloops)
|
||||
@ -351,9 +355,10 @@ static struct ref_scale_ops rwlock_ops = {
|
||||
// Definitions for rwsem
|
||||
static struct rw_semaphore test_rwsem;
|
||||
|
||||
static void ref_rwsem_init(void)
|
||||
static bool ref_rwsem_init(void)
|
||||
{
|
||||
init_rwsem(&test_rwsem);
|
||||
return true;
|
||||
}
|
||||
|
||||
static void ref_rwsem_section(const int nloops)
|
||||
@ -523,6 +528,237 @@ static struct ref_scale_ops clock_ops = {
|
||||
.name = "clock"
|
||||
};
|
||||
|
||||
////////////////////////////////////////////////////////////////////////
|
||||
//
|
||||
// Methods leveraging SLAB_TYPESAFE_BY_RCU.
|
||||
//
|
||||
|
||||
// Item to look up in a typesafe manner. Array of pointers to these.
|
||||
struct refscale_typesafe {
|
||||
atomic_t rts_refctr; // Used by all flavors
|
||||
spinlock_t rts_lock;
|
||||
seqlock_t rts_seqlock;
|
||||
unsigned int a;
|
||||
unsigned int b;
|
||||
};
|
||||
|
||||
static struct kmem_cache *typesafe_kmem_cachep;
|
||||
static struct refscale_typesafe **rtsarray;
|
||||
static long rtsarray_size;
|
||||
static DEFINE_TORTURE_RANDOM_PERCPU(refscale_rand);
|
||||
static bool (*rts_acquire)(struct refscale_typesafe *rtsp, unsigned int *start);
|
||||
static bool (*rts_release)(struct refscale_typesafe *rtsp, unsigned int start);
|
||||
|
||||
// Conditionally acquire an explicit in-structure reference count.
|
||||
static bool typesafe_ref_acquire(struct refscale_typesafe *rtsp, unsigned int *start)
|
||||
{
|
||||
return atomic_inc_not_zero(&rtsp->rts_refctr);
|
||||
}
|
||||
|
||||
// Unconditionally release an explicit in-structure reference count.
|
||||
static bool typesafe_ref_release(struct refscale_typesafe *rtsp, unsigned int start)
|
||||
{
|
||||
if (!atomic_dec_return(&rtsp->rts_refctr)) {
|
||||
WRITE_ONCE(rtsp->a, rtsp->a + 1);
|
||||
kmem_cache_free(typesafe_kmem_cachep, rtsp);
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
// Unconditionally acquire an explicit in-structure spinlock.
|
||||
static bool typesafe_lock_acquire(struct refscale_typesafe *rtsp, unsigned int *start)
|
||||
{
|
||||
spin_lock(&rtsp->rts_lock);
|
||||
return true;
|
||||
}
|
||||
|
||||
// Unconditionally release an explicit in-structure spinlock.
|
||||
static bool typesafe_lock_release(struct refscale_typesafe *rtsp, unsigned int start)
|
||||
{
|
||||
spin_unlock(&rtsp->rts_lock);
|
||||
return true;
|
||||
}
|
||||
|
||||
// Unconditionally acquire an explicit in-structure sequence lock.
|
||||
static bool typesafe_seqlock_acquire(struct refscale_typesafe *rtsp, unsigned int *start)
|
||||
{
|
||||
*start = read_seqbegin(&rtsp->rts_seqlock);
|
||||
return true;
|
||||
}
|
||||
|
||||
// Conditionally release an explicit in-structure sequence lock. Return
|
||||
// true if this release was successful, that is, if no retry is required.
|
||||
static bool typesafe_seqlock_release(struct refscale_typesafe *rtsp, unsigned int start)
|
||||
{
|
||||
return !read_seqretry(&rtsp->rts_seqlock, start);
|
||||
}
|
||||
|
||||
// Do a read-side critical section with the specified delay in
|
||||
// microseconds and nanoseconds inserted so as to increase probability
|
||||
// of failure.
|
||||
static void typesafe_delay_section(const int nloops, const int udl, const int ndl)
|
||||
{
|
||||
unsigned int a;
|
||||
unsigned int b;
|
||||
int i;
|
||||
long idx;
|
||||
struct refscale_typesafe *rtsp;
|
||||
unsigned int start;
|
||||
|
||||
for (i = nloops; i >= 0; i--) {
|
||||
preempt_disable();
|
||||
idx = torture_random(this_cpu_ptr(&refscale_rand)) % rtsarray_size;
|
||||
preempt_enable();
|
||||
retry:
|
||||
rcu_read_lock();
|
||||
rtsp = rcu_dereference(rtsarray[idx]);
|
||||
a = READ_ONCE(rtsp->a);
|
||||
if (!rts_acquire(rtsp, &start)) {
|
||||
rcu_read_unlock();
|
||||
goto retry;
|
||||
}
|
||||
if (a != READ_ONCE(rtsp->a)) {
|
||||
(void)rts_release(rtsp, start);
|
||||
rcu_read_unlock();
|
||||
goto retry;
|
||||
}
|
||||
un_delay(udl, ndl);
|
||||
// Remember, seqlock read-side release can fail.
|
||||
if (!rts_release(rtsp, start)) {
|
||||
rcu_read_unlock();
|
||||
goto retry;
|
||||
}
|
||||
b = READ_ONCE(rtsp->a);
|
||||
WARN_ONCE(a != b, "Re-read of ->a changed from %u to %u.\n", a, b);
|
||||
b = rtsp->b;
|
||||
rcu_read_unlock();
|
||||
WARN_ON_ONCE(a * a != b);
|
||||
}
|
||||
}
|
||||
|
||||
// Because the acquisition and release methods are expensive, there
|
||||
// is no point in optimizing away the un_delay() function's two checks.
|
||||
// Thus simply define typesafe_read_section() as a simple wrapper around
|
||||
// typesafe_delay_section().
|
||||
static void typesafe_read_section(const int nloops)
|
||||
{
|
||||
typesafe_delay_section(nloops, 0, 0);
|
||||
}
|
||||
|
||||
// Allocate and initialize one refscale_typesafe structure.
|
||||
static struct refscale_typesafe *typesafe_alloc_one(void)
|
||||
{
|
||||
struct refscale_typesafe *rtsp;
|
||||
|
||||
rtsp = kmem_cache_alloc(typesafe_kmem_cachep, GFP_KERNEL);
|
||||
if (!rtsp)
|
||||
return NULL;
|
||||
atomic_set(&rtsp->rts_refctr, 1);
|
||||
WRITE_ONCE(rtsp->a, rtsp->a + 1);
|
||||
WRITE_ONCE(rtsp->b, rtsp->a * rtsp->a);
|
||||
return rtsp;
|
||||
}
|
||||
|
||||
// Slab-allocator constructor for refscale_typesafe structures created
|
||||
// out of a new slab of system memory.
|
||||
static void refscale_typesafe_ctor(void *rtsp_in)
|
||||
{
|
||||
struct refscale_typesafe *rtsp = rtsp_in;
|
||||
|
||||
spin_lock_init(&rtsp->rts_lock);
|
||||
seqlock_init(&rtsp->rts_seqlock);
|
||||
preempt_disable();
|
||||
rtsp->a = torture_random(this_cpu_ptr(&refscale_rand));
|
||||
preempt_enable();
|
||||
}
|
||||
|
||||
static struct ref_scale_ops typesafe_ref_ops;
|
||||
static struct ref_scale_ops typesafe_lock_ops;
|
||||
static struct ref_scale_ops typesafe_seqlock_ops;
|
||||
|
||||
// Initialize for a typesafe test.
|
||||
static bool typesafe_init(void)
|
||||
{
|
||||
long idx;
|
||||
long si = lookup_instances;
|
||||
|
||||
typesafe_kmem_cachep = kmem_cache_create("refscale_typesafe",
|
||||
sizeof(struct refscale_typesafe), sizeof(void *),
|
||||
SLAB_TYPESAFE_BY_RCU, refscale_typesafe_ctor);
|
||||
if (!typesafe_kmem_cachep)
|
||||
return false;
|
||||
if (si < 0)
|
||||
si = -si * nr_cpu_ids;
|
||||
else if (si == 0)
|
||||
si = nr_cpu_ids;
|
||||
rtsarray_size = si;
|
||||
rtsarray = kcalloc(si, sizeof(*rtsarray), GFP_KERNEL);
|
||||
if (!rtsarray)
|
||||
return false;
|
||||
for (idx = 0; idx < rtsarray_size; idx++) {
|
||||
rtsarray[idx] = typesafe_alloc_one();
|
||||
if (!rtsarray[idx])
|
||||
return false;
|
||||
}
|
||||
if (cur_ops == &typesafe_ref_ops) {
|
||||
rts_acquire = typesafe_ref_acquire;
|
||||
rts_release = typesafe_ref_release;
|
||||
} else if (cur_ops == &typesafe_lock_ops) {
|
||||
rts_acquire = typesafe_lock_acquire;
|
||||
rts_release = typesafe_lock_release;
|
||||
} else if (cur_ops == &typesafe_seqlock_ops) {
|
||||
rts_acquire = typesafe_seqlock_acquire;
|
||||
rts_release = typesafe_seqlock_release;
|
||||
} else {
|
||||
WARN_ON_ONCE(1);
|
||||
return false;
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
// Clean up after a typesafe test.
|
||||
static void typesafe_cleanup(void)
|
||||
{
|
||||
long idx;
|
||||
|
||||
if (rtsarray) {
|
||||
for (idx = 0; idx < rtsarray_size; idx++)
|
||||
kmem_cache_free(typesafe_kmem_cachep, rtsarray[idx]);
|
||||
kfree(rtsarray);
|
||||
rtsarray = NULL;
|
||||
rtsarray_size = 0;
|
||||
}
|
||||
kmem_cache_destroy(typesafe_kmem_cachep);
|
||||
typesafe_kmem_cachep = NULL;
|
||||
rts_acquire = NULL;
|
||||
rts_release = NULL;
|
||||
}
|
||||
|
||||
// The typesafe_init() function distinguishes these structures by address.
|
||||
static struct ref_scale_ops typesafe_ref_ops = {
|
||||
.init = typesafe_init,
|
||||
.cleanup = typesafe_cleanup,
|
||||
.readsection = typesafe_read_section,
|
||||
.delaysection = typesafe_delay_section,
|
||||
.name = "typesafe_ref"
|
||||
};
|
||||
|
||||
static struct ref_scale_ops typesafe_lock_ops = {
|
||||
.init = typesafe_init,
|
||||
.cleanup = typesafe_cleanup,
|
||||
.readsection = typesafe_read_section,
|
||||
.delaysection = typesafe_delay_section,
|
||||
.name = "typesafe_lock"
|
||||
};
|
||||
|
||||
static struct ref_scale_ops typesafe_seqlock_ops = {
|
||||
.init = typesafe_init,
|
||||
.cleanup = typesafe_cleanup,
|
||||
.readsection = typesafe_read_section,
|
||||
.delaysection = typesafe_delay_section,
|
||||
.name = "typesafe_seqlock"
|
||||
};
|
||||
|
||||
static void rcu_scale_one_reader(void)
|
||||
{
|
||||
if (readdelay <= 0)
|
||||
@ -812,6 +1048,7 @@ ref_scale_init(void)
|
||||
static struct ref_scale_ops *scale_ops[] = {
|
||||
&rcu_ops, &srcu_ops, RCU_TRACE_OPS RCU_TASKS_OPS &refcnt_ops, &rwlock_ops,
|
||||
&rwsem_ops, &lock_ops, &lock_irq_ops, &acqrel_ops, &clock_ops,
|
||||
&typesafe_ref_ops, &typesafe_lock_ops, &typesafe_seqlock_ops,
|
||||
};
|
||||
|
||||
if (!torture_init_begin(scale_type, verbose))
|
||||
@ -833,7 +1070,10 @@ ref_scale_init(void)
|
||||
goto unwind;
|
||||
}
|
||||
if (cur_ops->init)
|
||||
cur_ops->init();
|
||||
if (!cur_ops->init()) {
|
||||
firsterr = -EUCLEAN;
|
||||
goto unwind;
|
||||
}
|
||||
|
||||
ref_scale_print_module_parms(cur_ops, "Start of test");
|
||||
|
||||
|
@ -154,7 +154,7 @@ static void init_srcu_struct_data(struct srcu_struct *ssp)
|
||||
*/
|
||||
static inline bool srcu_invl_snp_seq(unsigned long s)
|
||||
{
|
||||
return rcu_seq_state(s) == SRCU_SNP_INIT_SEQ;
|
||||
return s == SRCU_SNP_INIT_SEQ;
|
||||
}
|
||||
|
||||
/*
|
||||
@ -469,24 +469,59 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *ssp, int idx)
|
||||
|
||||
/*
|
||||
* If the locks are the same as the unlocks, then there must have
|
||||
* been no readers on this index at some time in between. This does
|
||||
* not mean that there are no more readers, as one could have read
|
||||
* the current index but not have incremented the lock counter yet.
|
||||
* been no readers on this index at some point in this function.
|
||||
* But there might be more readers, as a task might have read
|
||||
* the current ->srcu_idx but not yet have incremented its CPU's
|
||||
* ->srcu_lock_count[idx] counter. In fact, it is possible
|
||||
* that most of the tasks have been preempted between fetching
|
||||
* ->srcu_idx and incrementing ->srcu_lock_count[idx]. And there
|
||||
* could be almost (ULONG_MAX / sizeof(struct task_struct)) tasks
|
||||
* in a system whose address space was fully populated with memory.
|
||||
* Call this quantity Nt.
|
||||
*
|
||||
* So suppose that the updater is preempted here for so long
|
||||
* that more than ULONG_MAX non-nested readers come and go in
|
||||
* the meantime. It turns out that this cannot result in overflow
|
||||
* because if a reader modifies its unlock count after we read it
|
||||
* above, then that reader's next load of ->srcu_idx is guaranteed
|
||||
* to get the new value, which will cause it to operate on the
|
||||
* other bank of counters, where it cannot contribute to the
|
||||
* overflow of these counters. This means that there is a maximum
|
||||
* of 2*NR_CPUS increments, which cannot overflow given current
|
||||
* systems, especially not on 64-bit systems.
|
||||
* So suppose that the updater is preempted at this point in the
|
||||
* code for a long time. That now-preempted updater has already
|
||||
* flipped ->srcu_idx (possibly during the preceding grace period),
|
||||
* done an smp_mb() (again, possibly during the preceding grace
|
||||
* period), and summed up the ->srcu_unlock_count[idx] counters.
|
||||
* How many times can a given one of the aforementioned Nt tasks
|
||||
* increment the old ->srcu_idx value's ->srcu_lock_count[idx]
|
||||
* counter, in the absence of nesting?
|
||||
*
|
||||
* OK, how about nesting? This does impose a limit on nesting
|
||||
* of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient,
|
||||
* especially on 64-bit systems.
|
||||
* It can clearly do so once, given that it has already fetched
|
||||
* the old value of ->srcu_idx and is just about to use that value
|
||||
* to index its increment of ->srcu_lock_count[idx]. But as soon as
|
||||
* it leaves that SRCU read-side critical section, it will increment
|
||||
* ->srcu_unlock_count[idx], which must follow the updater's above
|
||||
* read from that same value. Thus, as soon the reading task does
|
||||
* an smp_mb() and a later fetch from ->srcu_idx, that task will be
|
||||
* guaranteed to get the new index. Except that the increment of
|
||||
* ->srcu_unlock_count[idx] in __srcu_read_unlock() is after the
|
||||
* smp_mb(), and the fetch from ->srcu_idx in __srcu_read_lock()
|
||||
* is before the smp_mb(). Thus, that task might not see the new
|
||||
* value of ->srcu_idx until the -second- __srcu_read_lock(),
|
||||
* which in turn means that this task might well increment
|
||||
* ->srcu_lock_count[idx] for the old value of ->srcu_idx twice,
|
||||
* not just once.
|
||||
*
|
||||
* However, it is important to note that a given smp_mb() takes
|
||||
* effect not just for the task executing it, but also for any
|
||||
* later task running on that same CPU.
|
||||
*
|
||||
* That is, there can be almost Nt + Nc further increments of
|
||||
* ->srcu_lock_count[idx] for the old index, where Nc is the number
|
||||
* of CPUs. But this is OK because the size of the task_struct
|
||||
* structure limits the value of Nt and current systems limit Nc
|
||||
* to a few thousand.
|
||||
*
|
||||
* OK, but what about nesting? This does impose a limit on
|
||||
* nesting of half of the size of the task_struct structure
|
||||
* (measured in bytes), which should be sufficient. A late 2022
|
||||
* TREE01 rcutorture run reported this size to be no less than
|
||||
* 9408 bytes, allowing up to 4704 levels of nesting, which is
|
||||
* comfortably beyond excessive. Especially on 64-bit systems,
|
||||
* which are unlikely to be configured with an address space fully
|
||||
* populated with memory, at least not anytime soon.
|
||||
*/
|
||||
return srcu_readers_lock_idx(ssp, idx) == unlocks;
|
||||
}
|
||||
@ -726,7 +761,7 @@ static void srcu_gp_start(struct srcu_struct *ssp)
|
||||
int state;
|
||||
|
||||
if (smp_load_acquire(&ssp->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER)
|
||||
sdp = per_cpu_ptr(ssp->sda, 0);
|
||||
sdp = per_cpu_ptr(ssp->sda, get_boot_cpu_id());
|
||||
else
|
||||
sdp = this_cpu_ptr(ssp->sda);
|
||||
lockdep_assert_held(&ACCESS_PRIVATE(ssp, lock));
|
||||
@ -837,7 +872,8 @@ static void srcu_gp_end(struct srcu_struct *ssp)
|
||||
/* Initiate callback invocation as needed. */
|
||||
ss_state = smp_load_acquire(&ssp->srcu_size_state);
|
||||
if (ss_state < SRCU_SIZE_WAIT_BARRIER) {
|
||||
srcu_schedule_cbs_sdp(per_cpu_ptr(ssp->sda, 0), cbdelay);
|
||||
srcu_schedule_cbs_sdp(per_cpu_ptr(ssp->sda, get_boot_cpu_id()),
|
||||
cbdelay);
|
||||
} else {
|
||||
idx = rcu_seq_ctr(gpseq) % ARRAY_SIZE(snp->srcu_have_cbs);
|
||||
srcu_for_each_node_breadth_first(ssp, snp) {
|
||||
@ -914,7 +950,7 @@ static void srcu_funnel_exp_start(struct srcu_struct *ssp, struct srcu_node *snp
|
||||
if (snp)
|
||||
for (; snp != NULL; snp = snp->srcu_parent) {
|
||||
sgsne = READ_ONCE(snp->srcu_gp_seq_needed_exp);
|
||||
if (rcu_seq_done(&ssp->srcu_gp_seq, s) ||
|
||||
if (WARN_ON_ONCE(rcu_seq_done(&ssp->srcu_gp_seq, s)) ||
|
||||
(!srcu_invl_snp_seq(sgsne) && ULONG_CMP_GE(sgsne, s)))
|
||||
return;
|
||||
spin_lock_irqsave_rcu_node(snp, flags);
|
||||
@ -941,6 +977,9 @@ static void srcu_funnel_exp_start(struct srcu_struct *ssp, struct srcu_node *snp
|
||||
*
|
||||
* Note that this function also does the work of srcu_funnel_exp_start(),
|
||||
* in some cases by directly invoking it.
|
||||
*
|
||||
* The srcu read lock should be hold around this function. And s is a seq snap
|
||||
* after holding that lock.
|
||||
*/
|
||||
static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp,
|
||||
unsigned long s, bool do_norm)
|
||||
@ -961,7 +1000,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp,
|
||||
if (snp_leaf)
|
||||
/* Each pass through the loop does one level of the srcu_node tree. */
|
||||
for (snp = snp_leaf; snp != NULL; snp = snp->srcu_parent) {
|
||||
if (rcu_seq_done(&ssp->srcu_gp_seq, s) && snp != snp_leaf)
|
||||
if (WARN_ON_ONCE(rcu_seq_done(&ssp->srcu_gp_seq, s)) && snp != snp_leaf)
|
||||
return; /* GP already done and CBs recorded. */
|
||||
spin_lock_irqsave_rcu_node(snp, flags);
|
||||
snp_seq = snp->srcu_have_cbs[idx];
|
||||
@ -998,8 +1037,8 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp,
|
||||
if (!do_norm && ULONG_CMP_LT(ssp->srcu_gp_seq_needed_exp, s))
|
||||
WRITE_ONCE(ssp->srcu_gp_seq_needed_exp, s);
|
||||
|
||||
/* If grace period not already done and none in progress, start it. */
|
||||
if (!rcu_seq_done(&ssp->srcu_gp_seq, s) &&
|
||||
/* If grace period not already in progress, start it. */
|
||||
if (!WARN_ON_ONCE(rcu_seq_done(&ssp->srcu_gp_seq, s)) &&
|
||||
rcu_seq_state(ssp->srcu_gp_seq) == SRCU_STATE_IDLE) {
|
||||
WARN_ON_ONCE(ULONG_CMP_GE(ssp->srcu_gp_seq, ssp->srcu_gp_seq_needed));
|
||||
srcu_gp_start(ssp);
|
||||
@ -1059,10 +1098,11 @@ static void srcu_flip(struct srcu_struct *ssp)
|
||||
|
||||
/*
|
||||
* Ensure that if the updater misses an __srcu_read_unlock()
|
||||
* increment, that task's next __srcu_read_lock() will see the
|
||||
* above counter update. Note that both this memory barrier
|
||||
* and the one in srcu_readers_active_idx_check() provide the
|
||||
* guarantee for __srcu_read_lock().
|
||||
* increment, that task's __srcu_read_lock() following its next
|
||||
* __srcu_read_lock() or __srcu_read_unlock() will see the above
|
||||
* counter update. Note that both this memory barrier and the
|
||||
* one in srcu_readers_active_idx_check() provide the guarantee
|
||||
* for __srcu_read_lock().
|
||||
*/
|
||||
smp_mb(); /* D */ /* Pairs with C. */
|
||||
}
|
||||
@ -1161,7 +1201,7 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
|
||||
idx = __srcu_read_lock_nmisafe(ssp);
|
||||
ss_state = smp_load_acquire(&ssp->srcu_size_state);
|
||||
if (ss_state < SRCU_SIZE_WAIT_CALL)
|
||||
sdp = per_cpu_ptr(ssp->sda, 0);
|
||||
sdp = per_cpu_ptr(ssp->sda, get_boot_cpu_id());
|
||||
else
|
||||
sdp = raw_cpu_ptr(ssp->sda);
|
||||
spin_lock_irqsave_sdp_contention(sdp, &flags);
|
||||
@ -1497,7 +1537,7 @@ void srcu_barrier(struct srcu_struct *ssp)
|
||||
|
||||
idx = __srcu_read_lock_nmisafe(ssp);
|
||||
if (smp_load_acquire(&ssp->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER)
|
||||
srcu_barrier_one_cpu(ssp, per_cpu_ptr(ssp->sda, 0));
|
||||
srcu_barrier_one_cpu(ssp, per_cpu_ptr(ssp->sda, get_boot_cpu_id()));
|
||||
else
|
||||
for_each_possible_cpu(cpu)
|
||||
srcu_barrier_one_cpu(ssp, per_cpu_ptr(ssp->sda, cpu));
|
||||
|
@ -384,6 +384,7 @@ static int rcu_tasks_need_gpcb(struct rcu_tasks *rtp)
|
||||
{
|
||||
int cpu;
|
||||
unsigned long flags;
|
||||
bool gpdone = poll_state_synchronize_rcu(rtp->percpu_dequeue_gpseq);
|
||||
long n;
|
||||
long ncbs = 0;
|
||||
long ncbsnz = 0;
|
||||
@ -425,21 +426,23 @@ static int rcu_tasks_need_gpcb(struct rcu_tasks *rtp)
|
||||
WRITE_ONCE(rtp->percpu_enqueue_shift, order_base_2(nr_cpu_ids));
|
||||
smp_store_release(&rtp->percpu_enqueue_lim, 1);
|
||||
rtp->percpu_dequeue_gpseq = get_state_synchronize_rcu();
|
||||
gpdone = false;
|
||||
pr_info("Starting switch %s to CPU-0 callback queuing.\n", rtp->name);
|
||||
}
|
||||
raw_spin_unlock_irqrestore(&rtp->cbs_gbl_lock, flags);
|
||||
}
|
||||
if (rcu_task_cb_adjust && !ncbsnz &&
|
||||
poll_state_synchronize_rcu(rtp->percpu_dequeue_gpseq)) {
|
||||
if (rcu_task_cb_adjust && !ncbsnz && gpdone) {
|
||||
raw_spin_lock_irqsave(&rtp->cbs_gbl_lock, flags);
|
||||
if (rtp->percpu_enqueue_lim < rtp->percpu_dequeue_lim) {
|
||||
WRITE_ONCE(rtp->percpu_dequeue_lim, 1);
|
||||
pr_info("Completing switch %s to CPU-0 callback queuing.\n", rtp->name);
|
||||
}
|
||||
for (cpu = rtp->percpu_dequeue_lim; cpu < nr_cpu_ids; cpu++) {
|
||||
struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rtp->rtpcpu, cpu);
|
||||
if (rtp->percpu_dequeue_lim == 1) {
|
||||
for (cpu = rtp->percpu_dequeue_lim; cpu < nr_cpu_ids; cpu++) {
|
||||
struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rtp->rtpcpu, cpu);
|
||||
|
||||
WARN_ON_ONCE(rcu_segcblist_n_cbs(&rtpcp->cblist));
|
||||
WARN_ON_ONCE(rcu_segcblist_n_cbs(&rtpcp->cblist));
|
||||
}
|
||||
}
|
||||
raw_spin_unlock_irqrestore(&rtp->cbs_gbl_lock, flags);
|
||||
}
|
||||
@ -560,8 +563,9 @@ static int __noreturn rcu_tasks_kthread(void *arg)
|
||||
static void synchronize_rcu_tasks_generic(struct rcu_tasks *rtp)
|
||||
{
|
||||
/* Complain if the scheduler has not started. */
|
||||
WARN_ONCE(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
|
||||
"synchronize_rcu_tasks called too soon");
|
||||
if (WARN_ONCE(rcu_scheduler_active == RCU_SCHEDULER_INACTIVE,
|
||||
"synchronize_%s() called too soon", rtp->name))
|
||||
return;
|
||||
|
||||
// If the grace-period kthread is running, use it.
|
||||
if (READ_ONCE(rtp->kthread_ptr)) {
|
||||
@ -827,11 +831,21 @@ static void rcu_tasks_pertask(struct task_struct *t, struct list_head *hop)
|
||||
static void rcu_tasks_postscan(struct list_head *hop)
|
||||
{
|
||||
/*
|
||||
* Wait for tasks that are in the process of exiting. This
|
||||
* does only part of the job, ensuring that all tasks that were
|
||||
* previously exiting reach the point where they have disabled
|
||||
* preemption, allowing the later synchronize_rcu() to finish
|
||||
* the job.
|
||||
* Exiting tasks may escape the tasklist scan. Those are vulnerable
|
||||
* until their final schedule() with TASK_DEAD state. To cope with
|
||||
* this, divide the fragile exit path part in two intersecting
|
||||
* read side critical sections:
|
||||
*
|
||||
* 1) An _SRCU_ read side starting before calling exit_notify(),
|
||||
* which may remove the task from the tasklist, and ending after
|
||||
* the final preempt_disable() call in do_exit().
|
||||
*
|
||||
* 2) An _RCU_ read side starting with the final preempt_disable()
|
||||
* call in do_exit() and ending with the final call to schedule()
|
||||
* with TASK_DEAD state.
|
||||
*
|
||||
* This handles the part 1). And postgp will handle part 2) with a
|
||||
* call to synchronize_rcu().
|
||||
*/
|
||||
synchronize_srcu(&tasks_rcu_exit_srcu);
|
||||
}
|
||||
@ -898,7 +912,10 @@ static void rcu_tasks_postgp(struct rcu_tasks *rtp)
|
||||
*
|
||||
* In addition, this synchronize_rcu() waits for exiting tasks
|
||||
* to complete their final preempt_disable() region of execution,
|
||||
* cleaning up after the synchronize_srcu() above.
|
||||
* cleaning up after synchronize_srcu(&tasks_rcu_exit_srcu),
|
||||
* enforcing the whole region before tasklist removal until
|
||||
* the final schedule() with TASK_DEAD state to be an RCU TASKS
|
||||
* read side critical section.
|
||||
*/
|
||||
synchronize_rcu();
|
||||
}
|
||||
@ -988,27 +1005,42 @@ void show_rcu_tasks_classic_gp_kthread(void)
|
||||
EXPORT_SYMBOL_GPL(show_rcu_tasks_classic_gp_kthread);
|
||||
#endif // !defined(CONFIG_TINY_RCU)
|
||||
|
||||
/* Do the srcu_read_lock() for the above synchronize_srcu(). */
|
||||
/*
|
||||
* Contribute to protect against tasklist scan blind spot while the
|
||||
* task is exiting and may be removed from the tasklist. See
|
||||
* corresponding synchronize_srcu() for further details.
|
||||
*/
|
||||
void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
|
||||
{
|
||||
preempt_disable();
|
||||
current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
|
||||
preempt_enable();
|
||||
}
|
||||
|
||||
/* Do the srcu_read_unlock() for the above synchronize_srcu(). */
|
||||
void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu)
|
||||
/*
|
||||
* Contribute to protect against tasklist scan blind spot while the
|
||||
* task is exiting and may be removed from the tasklist. See
|
||||
* corresponding synchronize_srcu() for further details.
|
||||
*/
|
||||
void exit_tasks_rcu_stop(void) __releases(&tasks_rcu_exit_srcu)
|
||||
{
|
||||
struct task_struct *t = current;
|
||||
|
||||
preempt_disable();
|
||||
__srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
|
||||
preempt_enable();
|
||||
exit_tasks_rcu_finish_trace(t);
|
||||
}
|
||||
|
||||
/*
|
||||
* Contribute to protect against tasklist scan blind spot while the
|
||||
* task is exiting and may be removed from the tasklist. See
|
||||
* corresponding synchronize_srcu() for further details.
|
||||
*/
|
||||
void exit_tasks_rcu_finish(void)
|
||||
{
|
||||
exit_tasks_rcu_stop();
|
||||
exit_tasks_rcu_finish_trace(current);
|
||||
}
|
||||
|
||||
#else /* #ifdef CONFIG_TASKS_RCU */
|
||||
void exit_tasks_rcu_start(void) { }
|
||||
void exit_tasks_rcu_stop(void) { }
|
||||
void exit_tasks_rcu_finish(void) { exit_tasks_rcu_finish_trace(current); }
|
||||
#endif /* #else #ifdef CONFIG_TASKS_RCU */
|
||||
|
||||
@ -1036,9 +1068,6 @@ static void rcu_tasks_be_rude(struct work_struct *work)
|
||||
// Wait for one rude RCU-tasks grace period.
|
||||
static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp)
|
||||
{
|
||||
if (num_online_cpus() <= 1)
|
||||
return; // Fastpath for only one CPU.
|
||||
|
||||
rtp->n_ipis += cpumask_weight(cpu_online_mask);
|
||||
schedule_on_each_cpu(rcu_tasks_be_rude);
|
||||
}
|
||||
@ -1815,23 +1844,21 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
|
||||
|
||||
static void rcu_tasks_initiate_self_tests(void)
|
||||
{
|
||||
unsigned long j = jiffies;
|
||||
|
||||
pr_info("Running RCU-tasks wait API self tests\n");
|
||||
#ifdef CONFIG_TASKS_RCU
|
||||
tests[0].runstart = j;
|
||||
tests[0].runstart = jiffies;
|
||||
synchronize_rcu_tasks();
|
||||
call_rcu_tasks(&tests[0].rh, test_rcu_tasks_callback);
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_TASKS_RUDE_RCU
|
||||
tests[1].runstart = j;
|
||||
tests[1].runstart = jiffies;
|
||||
synchronize_rcu_tasks_rude();
|
||||
call_rcu_tasks_rude(&tests[1].rh, test_rcu_tasks_callback);
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_TASKS_TRACE_RCU
|
||||
tests[2].runstart = j;
|
||||
tests[2].runstart = jiffies;
|
||||
synchronize_rcu_tasks_trace();
|
||||
call_rcu_tasks_trace(&tests[2].rh, test_rcu_tasks_callback);
|
||||
#endif
|
||||
|
@ -246,15 +246,12 @@ bool poll_state_synchronize_rcu(unsigned long oldstate)
|
||||
EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
|
||||
|
||||
#ifdef CONFIG_KASAN_GENERIC
|
||||
void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
|
||||
void kvfree_call_rcu(struct rcu_head *head, void *ptr)
|
||||
{
|
||||
if (head) {
|
||||
void *ptr = (void *) head - (unsigned long) func;
|
||||
|
||||
if (head)
|
||||
kasan_record_aux_stack_noalloc(ptr);
|
||||
}
|
||||
|
||||
__kvfree_call_rcu(head, func);
|
||||
__kvfree_call_rcu(head, ptr);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(kvfree_call_rcu);
|
||||
#endif
|
||||
|
@ -144,14 +144,16 @@ static int rcu_scheduler_fully_active __read_mostly;
|
||||
|
||||
static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp,
|
||||
unsigned long gps, unsigned long flags);
|
||||
static void rcu_init_new_rnp(struct rcu_node *rnp_leaf);
|
||||
static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf);
|
||||
static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu);
|
||||
static void invoke_rcu_core(void);
|
||||
static void rcu_report_exp_rdp(struct rcu_data *rdp);
|
||||
static void sync_sched_exp_online_cleanup(int cpu);
|
||||
static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
|
||||
static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
|
||||
static bool rcu_rdp_cpu_online(struct rcu_data *rdp);
|
||||
static bool rcu_init_invoked(void);
|
||||
static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf);
|
||||
static void rcu_init_new_rnp(struct rcu_node *rnp_leaf);
|
||||
|
||||
/*
|
||||
* rcuc/rcub/rcuop kthread realtime priority. The "rcuop"
|
||||
@ -214,27 +216,6 @@ EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio);
|
||||
*/
|
||||
#define PER_RCU_NODE_PERIOD 3 /* Number of grace periods between delays for debugging. */
|
||||
|
||||
/*
|
||||
* Compute the mask of online CPUs for the specified rcu_node structure.
|
||||
* This will not be stable unless the rcu_node structure's ->lock is
|
||||
* held, but the bit corresponding to the current CPU will be stable
|
||||
* in most contexts.
|
||||
*/
|
||||
static unsigned long rcu_rnp_online_cpus(struct rcu_node *rnp)
|
||||
{
|
||||
return READ_ONCE(rnp->qsmaskinitnext);
|
||||
}
|
||||
|
||||
/*
|
||||
* Is the CPU corresponding to the specified rcu_data structure online
|
||||
* from RCU's perspective? This perspective is given by that structure's
|
||||
* ->qsmaskinitnext field rather than by the global cpu_online_mask.
|
||||
*/
|
||||
static bool rcu_rdp_cpu_online(struct rcu_data *rdp)
|
||||
{
|
||||
return !!(rdp->grpmask & rcu_rnp_online_cpus(rdp->mynode));
|
||||
}
|
||||
|
||||
/*
|
||||
* Return true if an RCU grace period is in progress. The READ_ONCE()s
|
||||
* permit this function to be invoked without holding the root rcu_node
|
||||
@ -734,46 +715,6 @@ void rcu_request_urgent_qs_task(struct task_struct *t)
|
||||
smp_store_release(per_cpu_ptr(&rcu_data.rcu_urgent_qs, cpu), true);
|
||||
}
|
||||
|
||||
#if defined(CONFIG_PROVE_RCU) && defined(CONFIG_HOTPLUG_CPU)
|
||||
|
||||
/*
|
||||
* Is the current CPU online as far as RCU is concerned?
|
||||
*
|
||||
* Disable preemption to avoid false positives that could otherwise
|
||||
* happen due to the current CPU number being sampled, this task being
|
||||
* preempted, its old CPU being taken offline, resuming on some other CPU,
|
||||
* then determining that its old CPU is now offline.
|
||||
*
|
||||
* Disable checking if in an NMI handler because we cannot safely
|
||||
* report errors from NMI handlers anyway. In addition, it is OK to use
|
||||
* RCU on an offline processor during initial boot, hence the check for
|
||||
* rcu_scheduler_fully_active.
|
||||
*/
|
||||
bool rcu_lockdep_current_cpu_online(void)
|
||||
{
|
||||
struct rcu_data *rdp;
|
||||
bool ret = false;
|
||||
|
||||
if (in_nmi() || !rcu_scheduler_fully_active)
|
||||
return true;
|
||||
preempt_disable_notrace();
|
||||
rdp = this_cpu_ptr(&rcu_data);
|
||||
/*
|
||||
* Strictly, we care here about the case where the current CPU is
|
||||
* in rcu_cpu_starting() and thus has an excuse for rdp->grpmask
|
||||
* not being up to date. So arch_spin_is_locked() might have a
|
||||
* false positive if it's held by some *other* CPU, but that's
|
||||
* OK because that just means a false *negative* on the warning.
|
||||
*/
|
||||
if (rcu_rdp_cpu_online(rdp) || arch_spin_is_locked(&rcu_state.ofl_lock))
|
||||
ret = true;
|
||||
preempt_enable_notrace();
|
||||
return ret;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(rcu_lockdep_current_cpu_online);
|
||||
|
||||
#endif /* #if defined(CONFIG_PROVE_RCU) && defined(CONFIG_HOTPLUG_CPU) */
|
||||
|
||||
/*
|
||||
* When trying to report a quiescent state on behalf of some other CPU,
|
||||
* it is our responsibility to check for and handle potential overflow
|
||||
@ -925,6 +866,24 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
|
||||
rdp->rcu_iw_gp_seq = rnp->gp_seq;
|
||||
irq_work_queue_on(&rdp->rcu_iw, rdp->cpu);
|
||||
}
|
||||
|
||||
if (rcu_cpu_stall_cputime && rdp->snap_record.gp_seq != rdp->gp_seq) {
|
||||
int cpu = rdp->cpu;
|
||||
struct rcu_snap_record *rsrp;
|
||||
struct kernel_cpustat *kcsp;
|
||||
|
||||
kcsp = &kcpustat_cpu(cpu);
|
||||
|
||||
rsrp = &rdp->snap_record;
|
||||
rsrp->cputime_irq = kcpustat_field(kcsp, CPUTIME_IRQ, cpu);
|
||||
rsrp->cputime_softirq = kcpustat_field(kcsp, CPUTIME_SOFTIRQ, cpu);
|
||||
rsrp->cputime_system = kcpustat_field(kcsp, CPUTIME_SYSTEM, cpu);
|
||||
rsrp->nr_hardirqs = kstat_cpu_irqs_sum(rdp->cpu);
|
||||
rsrp->nr_softirqs = kstat_cpu_softirqs_sum(rdp->cpu);
|
||||
rsrp->nr_csw = nr_context_switches_cpu(rdp->cpu);
|
||||
rsrp->jiffies = jiffies;
|
||||
rsrp->gp_seq = rdp->gp_seq;
|
||||
}
|
||||
}
|
||||
|
||||
return 0;
|
||||
@ -1350,13 +1309,6 @@ static void rcu_strict_gp_boundary(void *unused)
|
||||
invoke_rcu_core();
|
||||
}
|
||||
|
||||
// Has rcu_init() been invoked? This is used (for example) to determine
|
||||
// whether spinlocks may be acquired safely.
|
||||
static bool rcu_init_invoked(void)
|
||||
{
|
||||
return !!rcu_state.n_online_cpus;
|
||||
}
|
||||
|
||||
// Make the polled API aware of the beginning of a grace period.
|
||||
static void rcu_poll_gp_seq_start(unsigned long *snap)
|
||||
{
|
||||
@ -2091,92 +2043,6 @@ rcu_check_quiescent_state(struct rcu_data *rdp)
|
||||
rcu_report_qs_rdp(rdp);
|
||||
}
|
||||
|
||||
/*
|
||||
* Near the end of the offline process. Trace the fact that this CPU
|
||||
* is going offline.
|
||||
*/
|
||||
int rcutree_dying_cpu(unsigned int cpu)
|
||||
{
|
||||
bool blkd;
|
||||
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
|
||||
struct rcu_node *rnp = rdp->mynode;
|
||||
|
||||
if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
|
||||
return 0;
|
||||
|
||||
blkd = !!(READ_ONCE(rnp->qsmask) & rdp->grpmask);
|
||||
trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq),
|
||||
blkd ? TPS("cpuofl-bgp") : TPS("cpuofl"));
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* All CPUs for the specified rcu_node structure have gone offline,
|
||||
* and all tasks that were preempted within an RCU read-side critical
|
||||
* section while running on one of those CPUs have since exited their RCU
|
||||
* read-side critical section. Some other CPU is reporting this fact with
|
||||
* the specified rcu_node structure's ->lock held and interrupts disabled.
|
||||
* This function therefore goes up the tree of rcu_node structures,
|
||||
* clearing the corresponding bits in the ->qsmaskinit fields. Note that
|
||||
* the leaf rcu_node structure's ->qsmaskinit field has already been
|
||||
* updated.
|
||||
*
|
||||
* This function does check that the specified rcu_node structure has
|
||||
* all CPUs offline and no blocked tasks, so it is OK to invoke it
|
||||
* prematurely. That said, invoking it after the fact will cost you
|
||||
* a needless lock acquisition. So once it has done its work, don't
|
||||
* invoke it again.
|
||||
*/
|
||||
static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
|
||||
{
|
||||
long mask;
|
||||
struct rcu_node *rnp = rnp_leaf;
|
||||
|
||||
raw_lockdep_assert_held_rcu_node(rnp_leaf);
|
||||
if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
|
||||
WARN_ON_ONCE(rnp_leaf->qsmaskinit) ||
|
||||
WARN_ON_ONCE(rcu_preempt_has_tasks(rnp_leaf)))
|
||||
return;
|
||||
for (;;) {
|
||||
mask = rnp->grpmask;
|
||||
rnp = rnp->parent;
|
||||
if (!rnp)
|
||||
break;
|
||||
raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
|
||||
rnp->qsmaskinit &= ~mask;
|
||||
/* Between grace periods, so better already be zero! */
|
||||
WARN_ON_ONCE(rnp->qsmask);
|
||||
if (rnp->qsmaskinit) {
|
||||
raw_spin_unlock_rcu_node(rnp);
|
||||
/* irqs remain disabled. */
|
||||
return;
|
||||
}
|
||||
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* The CPU has been completely removed, and some other CPU is reporting
|
||||
* this fact from process context. Do the remainder of the cleanup.
|
||||
* There can only be one CPU hotplug operation at a time, so no need for
|
||||
* explicit locking.
|
||||
*/
|
||||
int rcutree_dead_cpu(unsigned int cpu)
|
||||
{
|
||||
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
|
||||
struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
|
||||
|
||||
if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
|
||||
return 0;
|
||||
|
||||
WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1);
|
||||
/* Adjust any no-longer-needed kthreads. */
|
||||
rcu_boost_kthread_setaffinity(rnp, -1);
|
||||
// Stop-machine done, so allow nohz_full to disable tick.
|
||||
tick_dep_clear(TICK_DEP_BIT_RCU);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Invoke any RCU callbacks that have made it to the end of their grace
|
||||
* period. Throttle as specified by rdp->blimit.
|
||||
@ -2209,7 +2075,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
|
||||
*/
|
||||
rcu_nocb_lock_irqsave(rdp, flags);
|
||||
WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
|
||||
pending = rcu_segcblist_n_cbs(&rdp->cblist);
|
||||
pending = rcu_segcblist_get_seglen(&rdp->cblist, RCU_DONE_TAIL);
|
||||
div = READ_ONCE(rcu_divisor);
|
||||
div = div < 0 ? 7 : div > sizeof(long) * 8 - 2 ? sizeof(long) * 8 - 2 : div;
|
||||
bl = max(rdp->blimit, pending >> div);
|
||||
@ -2727,10 +2593,11 @@ static void check_cb_ovld(struct rcu_data *rdp)
|
||||
}
|
||||
|
||||
static void
|
||||
__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
|
||||
__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy_in)
|
||||
{
|
||||
static atomic_t doublefrees;
|
||||
unsigned long flags;
|
||||
bool lazy;
|
||||
struct rcu_data *rdp;
|
||||
bool was_alldone;
|
||||
|
||||
@ -2755,6 +2622,7 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
|
||||
kasan_record_aux_stack_noalloc(head);
|
||||
local_irq_save(flags);
|
||||
rdp = this_cpu_ptr(&rcu_data);
|
||||
lazy = lazy_in && !rcu_async_should_hurry();
|
||||
|
||||
/* Add the callback to our list. */
|
||||
if (unlikely(!rcu_segcblist_is_enabled(&rdp->cblist))) {
|
||||
@ -2876,13 +2744,15 @@ EXPORT_SYMBOL_GPL(call_rcu);
|
||||
|
||||
/**
|
||||
* struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
|
||||
* @list: List node. All blocks are linked between each other
|
||||
* @gp_snap: Snapshot of RCU state for objects placed to this bulk
|
||||
* @nr_records: Number of active pointers in the array
|
||||
* @next: Next bulk object in the block chain
|
||||
* @records: Array of the kvfree_rcu() pointers
|
||||
*/
|
||||
struct kvfree_rcu_bulk_data {
|
||||
struct list_head list;
|
||||
unsigned long gp_snap;
|
||||
unsigned long nr_records;
|
||||
struct kvfree_rcu_bulk_data *next;
|
||||
void *records[];
|
||||
};
|
||||
|
||||
@ -2898,26 +2768,28 @@ struct kvfree_rcu_bulk_data {
|
||||
* struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
|
||||
* @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
|
||||
* @head_free: List of kfree_rcu() objects waiting for a grace period
|
||||
* @bkvhead_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
|
||||
* @bulk_head_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
|
||||
* @krcp: Pointer to @kfree_rcu_cpu structure
|
||||
*/
|
||||
|
||||
struct kfree_rcu_cpu_work {
|
||||
struct rcu_work rcu_work;
|
||||
struct rcu_head *head_free;
|
||||
struct kvfree_rcu_bulk_data *bkvhead_free[FREE_N_CHANNELS];
|
||||
struct list_head bulk_head_free[FREE_N_CHANNELS];
|
||||
struct kfree_rcu_cpu *krcp;
|
||||
};
|
||||
|
||||
/**
|
||||
* struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
|
||||
* @head: List of kfree_rcu() objects not yet waiting for a grace period
|
||||
* @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
|
||||
* @head_gp_snap: Snapshot of RCU state for objects placed to "@head"
|
||||
* @bulk_head: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
|
||||
* @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
|
||||
* @lock: Synchronize access to this structure
|
||||
* @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
|
||||
* @initialized: The @rcu_work fields have been initialized
|
||||
* @count: Number of objects for which GP not started
|
||||
* @head_count: Number of objects in rcu_head singular list
|
||||
* @bulk_count: Number of objects in bulk-list
|
||||
* @bkvcache:
|
||||
* A simple cache list that contains objects for reuse purpose.
|
||||
* In order to save some per-cpu space the list is singular.
|
||||
@ -2935,13 +2807,20 @@ struct kfree_rcu_cpu_work {
|
||||
* the interactions with the slab allocators.
|
||||
*/
|
||||
struct kfree_rcu_cpu {
|
||||
// Objects queued on a linked list
|
||||
// through their rcu_head structures.
|
||||
struct rcu_head *head;
|
||||
struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS];
|
||||
unsigned long head_gp_snap;
|
||||
atomic_t head_count;
|
||||
|
||||
// Objects queued on a bulk-list.
|
||||
struct list_head bulk_head[FREE_N_CHANNELS];
|
||||
atomic_t bulk_count[FREE_N_CHANNELS];
|
||||
|
||||
struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
|
||||
raw_spinlock_t lock;
|
||||
struct delayed_work monitor_work;
|
||||
bool initialized;
|
||||
int count;
|
||||
|
||||
struct delayed_work page_cache_work;
|
||||
atomic_t backoff_page_cache_fill;
|
||||
@ -3029,82 +2908,51 @@ drain_page_cache(struct kfree_rcu_cpu *krcp)
|
||||
return freed;
|
||||
}
|
||||
|
||||
/*
|
||||
* This function is invoked in workqueue context after a grace period.
|
||||
* It frees all the objects queued on ->bkvhead_free or ->head_free.
|
||||
*/
|
||||
static void kfree_rcu_work(struct work_struct *work)
|
||||
static void
|
||||
kvfree_rcu_bulk(struct kfree_rcu_cpu *krcp,
|
||||
struct kvfree_rcu_bulk_data *bnode, int idx)
|
||||
{
|
||||
unsigned long flags;
|
||||
struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS], *bnext;
|
||||
struct rcu_head *head, *next;
|
||||
struct kfree_rcu_cpu *krcp;
|
||||
struct kfree_rcu_cpu_work *krwp;
|
||||
int i, j;
|
||||
int i;
|
||||
|
||||
krwp = container_of(to_rcu_work(work),
|
||||
struct kfree_rcu_cpu_work, rcu_work);
|
||||
krcp = krwp->krcp;
|
||||
debug_rcu_bhead_unqueue(bnode);
|
||||
|
||||
raw_spin_lock_irqsave(&krcp->lock, flags);
|
||||
// Channels 1 and 2.
|
||||
for (i = 0; i < FREE_N_CHANNELS; i++) {
|
||||
bkvhead[i] = krwp->bkvhead_free[i];
|
||||
krwp->bkvhead_free[i] = NULL;
|
||||
}
|
||||
rcu_lock_acquire(&rcu_callback_map);
|
||||
if (idx == 0) { // kmalloc() / kfree().
|
||||
trace_rcu_invoke_kfree_bulk_callback(
|
||||
rcu_state.name, bnode->nr_records,
|
||||
bnode->records);
|
||||
|
||||
// Channel 3.
|
||||
head = krwp->head_free;
|
||||
krwp->head_free = NULL;
|
||||
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
||||
kfree_bulk(bnode->nr_records, bnode->records);
|
||||
} else { // vmalloc() / vfree().
|
||||
for (i = 0; i < bnode->nr_records; i++) {
|
||||
trace_rcu_invoke_kvfree_callback(
|
||||
rcu_state.name, bnode->records[i], 0);
|
||||
|
||||
// Handle the first two channels.
|
||||
for (i = 0; i < FREE_N_CHANNELS; i++) {
|
||||
for (; bkvhead[i]; bkvhead[i] = bnext) {
|
||||
bnext = bkvhead[i]->next;
|
||||
debug_rcu_bhead_unqueue(bkvhead[i]);
|
||||
|
||||
rcu_lock_acquire(&rcu_callback_map);
|
||||
if (i == 0) { // kmalloc() / kfree().
|
||||
trace_rcu_invoke_kfree_bulk_callback(
|
||||
rcu_state.name, bkvhead[i]->nr_records,
|
||||
bkvhead[i]->records);
|
||||
|
||||
kfree_bulk(bkvhead[i]->nr_records,
|
||||
bkvhead[i]->records);
|
||||
} else { // vmalloc() / vfree().
|
||||
for (j = 0; j < bkvhead[i]->nr_records; j++) {
|
||||
trace_rcu_invoke_kvfree_callback(
|
||||
rcu_state.name,
|
||||
bkvhead[i]->records[j], 0);
|
||||
|
||||
vfree(bkvhead[i]->records[j]);
|
||||
}
|
||||
}
|
||||
rcu_lock_release(&rcu_callback_map);
|
||||
|
||||
raw_spin_lock_irqsave(&krcp->lock, flags);
|
||||
if (put_cached_bnode(krcp, bkvhead[i]))
|
||||
bkvhead[i] = NULL;
|
||||
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
||||
|
||||
if (bkvhead[i])
|
||||
free_page((unsigned long) bkvhead[i]);
|
||||
|
||||
cond_resched_tasks_rcu_qs();
|
||||
vfree(bnode->records[i]);
|
||||
}
|
||||
}
|
||||
rcu_lock_release(&rcu_callback_map);
|
||||
|
||||
raw_spin_lock_irqsave(&krcp->lock, flags);
|
||||
if (put_cached_bnode(krcp, bnode))
|
||||
bnode = NULL;
|
||||
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
||||
|
||||
if (bnode)
|
||||
free_page((unsigned long) bnode);
|
||||
|
||||
cond_resched_tasks_rcu_qs();
|
||||
}
|
||||
|
||||
static void
|
||||
kvfree_rcu_list(struct rcu_head *head)
|
||||
{
|
||||
struct rcu_head *next;
|
||||
|
||||
/*
|
||||
* This is used when the "bulk" path can not be used for the
|
||||
* double-argument of kvfree_rcu(). This happens when the
|
||||
* page-cache is empty, which means that objects are instead
|
||||
* queued on a linked list through their rcu_head structures.
|
||||
* This list is named "Channel 3".
|
||||
*/
|
||||
for (; head; head = next) {
|
||||
unsigned long offset = (unsigned long)head->func;
|
||||
void *ptr = (void *)head - offset;
|
||||
void *ptr = (void *) head->func;
|
||||
unsigned long offset = (void *) head - ptr;
|
||||
|
||||
next = head->next;
|
||||
debug_rcu_head_unqueue((struct rcu_head *)ptr);
|
||||
@ -3119,16 +2967,72 @@ static void kfree_rcu_work(struct work_struct *work)
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* This function is invoked in workqueue context after a grace period.
|
||||
* It frees all the objects queued on ->bulk_head_free or ->head_free.
|
||||
*/
|
||||
static void kfree_rcu_work(struct work_struct *work)
|
||||
{
|
||||
unsigned long flags;
|
||||
struct kvfree_rcu_bulk_data *bnode, *n;
|
||||
struct list_head bulk_head[FREE_N_CHANNELS];
|
||||
struct rcu_head *head;
|
||||
struct kfree_rcu_cpu *krcp;
|
||||
struct kfree_rcu_cpu_work *krwp;
|
||||
int i;
|
||||
|
||||
krwp = container_of(to_rcu_work(work),
|
||||
struct kfree_rcu_cpu_work, rcu_work);
|
||||
krcp = krwp->krcp;
|
||||
|
||||
raw_spin_lock_irqsave(&krcp->lock, flags);
|
||||
// Channels 1 and 2.
|
||||
for (i = 0; i < FREE_N_CHANNELS; i++)
|
||||
list_replace_init(&krwp->bulk_head_free[i], &bulk_head[i]);
|
||||
|
||||
// Channel 3.
|
||||
head = krwp->head_free;
|
||||
krwp->head_free = NULL;
|
||||
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
||||
|
||||
// Handle the first two channels.
|
||||
for (i = 0; i < FREE_N_CHANNELS; i++) {
|
||||
// Start from the tail page, so a GP is likely passed for it.
|
||||
list_for_each_entry_safe(bnode, n, &bulk_head[i], list)
|
||||
kvfree_rcu_bulk(krcp, bnode, i);
|
||||
}
|
||||
|
||||
/*
|
||||
* This is used when the "bulk" path can not be used for the
|
||||
* double-argument of kvfree_rcu(). This happens when the
|
||||
* page-cache is empty, which means that objects are instead
|
||||
* queued on a linked list through their rcu_head structures.
|
||||
* This list is named "Channel 3".
|
||||
*/
|
||||
kvfree_rcu_list(head);
|
||||
}
|
||||
|
||||
static bool
|
||||
need_offload_krc(struct kfree_rcu_cpu *krcp)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < FREE_N_CHANNELS; i++)
|
||||
if (krcp->bkvhead[i])
|
||||
if (!list_empty(&krcp->bulk_head[i]))
|
||||
return true;
|
||||
|
||||
return !!krcp->head;
|
||||
return !!READ_ONCE(krcp->head);
|
||||
}
|
||||
|
||||
static int krc_count(struct kfree_rcu_cpu *krcp)
|
||||
{
|
||||
int sum = atomic_read(&krcp->head_count);
|
||||
int i;
|
||||
|
||||
for (i = 0; i < FREE_N_CHANNELS; i++)
|
||||
sum += atomic_read(&krcp->bulk_count[i]);
|
||||
|
||||
return sum;
|
||||
}
|
||||
|
||||
static void
|
||||
@ -3136,7 +3040,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
|
||||
{
|
||||
long delay, delay_left;
|
||||
|
||||
delay = READ_ONCE(krcp->count) >= KVFREE_BULK_MAX_ENTR ? 1:KFREE_DRAIN_JIFFIES;
|
||||
delay = krc_count(krcp) >= KVFREE_BULK_MAX_ENTR ? 1:KFREE_DRAIN_JIFFIES;
|
||||
if (delayed_work_pending(&krcp->monitor_work)) {
|
||||
delay_left = krcp->monitor_work.timer.expires - jiffies;
|
||||
if (delay < delay_left)
|
||||
@ -3146,6 +3050,44 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
|
||||
queue_delayed_work(system_wq, &krcp->monitor_work, delay);
|
||||
}
|
||||
|
||||
static void
|
||||
kvfree_rcu_drain_ready(struct kfree_rcu_cpu *krcp)
|
||||
{
|
||||
struct list_head bulk_ready[FREE_N_CHANNELS];
|
||||
struct kvfree_rcu_bulk_data *bnode, *n;
|
||||
struct rcu_head *head_ready = NULL;
|
||||
unsigned long flags;
|
||||
int i;
|
||||
|
||||
raw_spin_lock_irqsave(&krcp->lock, flags);
|
||||
for (i = 0; i < FREE_N_CHANNELS; i++) {
|
||||
INIT_LIST_HEAD(&bulk_ready[i]);
|
||||
|
||||
list_for_each_entry_safe_reverse(bnode, n, &krcp->bulk_head[i], list) {
|
||||
if (!poll_state_synchronize_rcu(bnode->gp_snap))
|
||||
break;
|
||||
|
||||
atomic_sub(bnode->nr_records, &krcp->bulk_count[i]);
|
||||
list_move(&bnode->list, &bulk_ready[i]);
|
||||
}
|
||||
}
|
||||
|
||||
if (krcp->head && poll_state_synchronize_rcu(krcp->head_gp_snap)) {
|
||||
head_ready = krcp->head;
|
||||
atomic_set(&krcp->head_count, 0);
|
||||
WRITE_ONCE(krcp->head, NULL);
|
||||
}
|
||||
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
||||
|
||||
for (i = 0; i < FREE_N_CHANNELS; i++) {
|
||||
list_for_each_entry_safe(bnode, n, &bulk_ready[i], list)
|
||||
kvfree_rcu_bulk(krcp, bnode, i);
|
||||
}
|
||||
|
||||
if (head_ready)
|
||||
kvfree_rcu_list(head_ready);
|
||||
}
|
||||
|
||||
/*
|
||||
* This function is invoked after the KFREE_DRAIN_JIFFIES timeout.
|
||||
*/
|
||||
@ -3156,26 +3098,31 @@ static void kfree_rcu_monitor(struct work_struct *work)
|
||||
unsigned long flags;
|
||||
int i, j;
|
||||
|
||||
// Drain ready for reclaim.
|
||||
kvfree_rcu_drain_ready(krcp);
|
||||
|
||||
raw_spin_lock_irqsave(&krcp->lock, flags);
|
||||
|
||||
// Attempt to start a new batch.
|
||||
for (i = 0; i < KFREE_N_BATCHES; i++) {
|
||||
struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
|
||||
|
||||
// Try to detach bkvhead or head and attach it over any
|
||||
// Try to detach bulk_head or head and attach it over any
|
||||
// available corresponding free channel. It can be that
|
||||
// a previous RCU batch is in progress, it means that
|
||||
// immediately to queue another one is not possible so
|
||||
// in that case the monitor work is rearmed.
|
||||
if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
|
||||
(krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
|
||||
(krcp->head && !krwp->head_free)) {
|
||||
if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
|
||||
(!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
|
||||
(READ_ONCE(krcp->head) && !krwp->head_free)) {
|
||||
|
||||
// Channel 1 corresponds to the SLAB-pointer bulk path.
|
||||
// Channel 2 corresponds to vmalloc-pointer bulk path.
|
||||
for (j = 0; j < FREE_N_CHANNELS; j++) {
|
||||
if (!krwp->bkvhead_free[j]) {
|
||||
krwp->bkvhead_free[j] = krcp->bkvhead[j];
|
||||
krcp->bkvhead[j] = NULL;
|
||||
if (list_empty(&krwp->bulk_head_free[j])) {
|
||||
atomic_set(&krcp->bulk_count[j], 0);
|
||||
list_replace_init(&krcp->bulk_head[j],
|
||||
&krwp->bulk_head_free[j]);
|
||||
}
|
||||
}
|
||||
|
||||
@ -3183,11 +3130,10 @@ static void kfree_rcu_monitor(struct work_struct *work)
|
||||
// objects queued on the linked list.
|
||||
if (!krwp->head_free) {
|
||||
krwp->head_free = krcp->head;
|
||||
krcp->head = NULL;
|
||||
atomic_set(&krcp->head_count, 0);
|
||||
WRITE_ONCE(krcp->head, NULL);
|
||||
}
|
||||
|
||||
WRITE_ONCE(krcp->count, 0);
|
||||
|
||||
// One work is per one batch, so there are three
|
||||
// "free channels", the batch can handle. It can
|
||||
// be that the work is in the pending state when
|
||||
@ -3197,6 +3143,8 @@ static void kfree_rcu_monitor(struct work_struct *work)
|
||||
}
|
||||
}
|
||||
|
||||
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
||||
|
||||
// If there is nothing to detach, it means that our job is
|
||||
// successfully done here. In case of having at least one
|
||||
// of the channels that is still busy we should rearm the
|
||||
@ -3204,8 +3152,6 @@ static void kfree_rcu_monitor(struct work_struct *work)
|
||||
// still in progress.
|
||||
if (need_offload_krc(krcp))
|
||||
schedule_delayed_monitor_work(krcp);
|
||||
|
||||
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
||||
}
|
||||
|
||||
static enum hrtimer_restart
|
||||
@ -3288,10 +3234,11 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
|
||||
return false;
|
||||
|
||||
idx = !!is_vmalloc_addr(ptr);
|
||||
bnode = list_first_entry_or_null(&(*krcp)->bulk_head[idx],
|
||||
struct kvfree_rcu_bulk_data, list);
|
||||
|
||||
/* Check if a new block is required. */
|
||||
if (!(*krcp)->bkvhead[idx] ||
|
||||
(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
|
||||
if (!bnode || bnode->nr_records == KVFREE_BULK_MAX_ENTR) {
|
||||
bnode = get_cached_bnode(*krcp);
|
||||
if (!bnode && can_alloc) {
|
||||
krc_this_cpu_unlock(*krcp, *flags);
|
||||
@ -3315,17 +3262,15 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
|
||||
if (!bnode)
|
||||
return false;
|
||||
|
||||
/* Initialize the new block. */
|
||||
// Initialize the new block and attach it.
|
||||
bnode->nr_records = 0;
|
||||
bnode->next = (*krcp)->bkvhead[idx];
|
||||
|
||||
/* Attach it to the head. */
|
||||
(*krcp)->bkvhead[idx] = bnode;
|
||||
list_add(&bnode->list, &(*krcp)->bulk_head[idx]);
|
||||
}
|
||||
|
||||
/* Finally insert. */
|
||||
(*krcp)->bkvhead[idx]->records
|
||||
[(*krcp)->bkvhead[idx]->nr_records++] = ptr;
|
||||
// Finally insert and update the GP for this page.
|
||||
bnode->records[bnode->nr_records++] = ptr;
|
||||
bnode->gp_snap = get_state_synchronize_rcu();
|
||||
atomic_inc(&(*krcp)->bulk_count[idx]);
|
||||
|
||||
return true;
|
||||
}
|
||||
@ -3342,26 +3287,21 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
|
||||
* be free'd in workqueue context. This allows us to: batch requests together to
|
||||
* reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load.
|
||||
*/
|
||||
void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
|
||||
void kvfree_call_rcu(struct rcu_head *head, void *ptr)
|
||||
{
|
||||
unsigned long flags;
|
||||
struct kfree_rcu_cpu *krcp;
|
||||
bool success;
|
||||
void *ptr;
|
||||
|
||||
if (head) {
|
||||
ptr = (void *) head - (unsigned long) func;
|
||||
} else {
|
||||
/*
|
||||
* Please note there is a limitation for the head-less
|
||||
* variant, that is why there is a clear rule for such
|
||||
* objects: it can be used from might_sleep() context
|
||||
* only. For other places please embed an rcu_head to
|
||||
* your data.
|
||||
*/
|
||||
/*
|
||||
* Please note there is a limitation for the head-less
|
||||
* variant, that is why there is a clear rule for such
|
||||
* objects: it can be used from might_sleep() context
|
||||
* only. For other places please embed an rcu_head to
|
||||
* your data.
|
||||
*/
|
||||
if (!head)
|
||||
might_sleep();
|
||||
ptr = (unsigned long *) func;
|
||||
}
|
||||
|
||||
// Queue the object but don't yet schedule the batch.
|
||||
if (debug_rcu_head_queue(ptr)) {
|
||||
@ -3382,14 +3322,16 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
|
||||
// Inline if kvfree_rcu(one_arg) call.
|
||||
goto unlock_return;
|
||||
|
||||
head->func = func;
|
||||
head->func = ptr;
|
||||
head->next = krcp->head;
|
||||
krcp->head = head;
|
||||
WRITE_ONCE(krcp->head, head);
|
||||
atomic_inc(&krcp->head_count);
|
||||
|
||||
// Take a snapshot for this krcp.
|
||||
krcp->head_gp_snap = get_state_synchronize_rcu();
|
||||
success = true;
|
||||
}
|
||||
|
||||
WRITE_ONCE(krcp->count, krcp->count + 1);
|
||||
|
||||
// Set timer to drain after KFREE_DRAIN_JIFFIES.
|
||||
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
|
||||
schedule_delayed_monitor_work(krcp);
|
||||
@ -3420,7 +3362,7 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
|
||||
for_each_possible_cpu(cpu) {
|
||||
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
|
||||
|
||||
count += READ_ONCE(krcp->count);
|
||||
count += krc_count(krcp);
|
||||
count += READ_ONCE(krcp->nr_bkv_objs);
|
||||
atomic_set(&krcp->backoff_page_cache_fill, 1);
|
||||
}
|
||||
@ -3437,7 +3379,7 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
|
||||
int count;
|
||||
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
|
||||
|
||||
count = krcp->count;
|
||||
count = krc_count(krcp);
|
||||
count += drain_page_cache(krcp);
|
||||
kfree_rcu_monitor(&krcp->monitor_work.work);
|
||||
|
||||
@ -3461,15 +3403,12 @@ static struct shrinker kfree_rcu_shrinker = {
|
||||
void __init kfree_rcu_scheduler_running(void)
|
||||
{
|
||||
int cpu;
|
||||
unsigned long flags;
|
||||
|
||||
for_each_possible_cpu(cpu) {
|
||||
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
|
||||
|
||||
raw_spin_lock_irqsave(&krcp->lock, flags);
|
||||
if (need_offload_krc(krcp))
|
||||
schedule_delayed_monitor_work(krcp);
|
||||
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
||||
}
|
||||
}
|
||||
|
||||
@ -3485,9 +3424,10 @@ void __init kfree_rcu_scheduler_running(void)
|
||||
*/
|
||||
static int rcu_blocking_is_gp(void)
|
||||
{
|
||||
if (rcu_scheduler_active != RCU_SCHEDULER_INACTIVE)
|
||||
if (rcu_scheduler_active != RCU_SCHEDULER_INACTIVE) {
|
||||
might_sleep();
|
||||
return false;
|
||||
might_sleep(); /* Check for RCU read-side critical section. */
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
@ -3711,7 +3651,9 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_full);
|
||||
* If @false is returned, it is the caller's responsibility to invoke this
|
||||
* function later on until it does return @true. Alternatively, the caller
|
||||
* can explicitly wait for a grace period, for example, by passing @oldstate
|
||||
* to cond_synchronize_rcu() or by directly invoking synchronize_rcu().
|
||||
* to either cond_synchronize_rcu() or cond_synchronize_rcu_expedited()
|
||||
* on the one hand or by directly invoking either synchronize_rcu() or
|
||||
* synchronize_rcu_expedited() on the other.
|
||||
*
|
||||
* Yes, this function does not take counter wrap into account.
|
||||
* But counter wrap is harmless. If the counter wraps, we have waited for
|
||||
@ -3722,6 +3664,12 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_full);
|
||||
* completed. Alternatively, they can use get_completed_synchronize_rcu()
|
||||
* to get a guaranteed-completed grace-period state.
|
||||
*
|
||||
* In addition, because oldstate compresses the grace-period state for
|
||||
* both normal and expedited grace periods into a single unsigned long,
|
||||
* it can miss a grace period when synchronize_rcu() runs concurrently
|
||||
* with synchronize_rcu_expedited(). If this is unacceptable, please
|
||||
* instead use the _full() variant of these polling APIs.
|
||||
*
|
||||
* This function provides the same memory-ordering guarantees that
|
||||
* would be provided by a synchronize_rcu() that was invoked at the call
|
||||
* to the function that provided @oldstate, and that returned at the end
|
||||
@ -4079,6 +4027,155 @@ retry:
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(rcu_barrier);
|
||||
|
||||
/*
|
||||
* Compute the mask of online CPUs for the specified rcu_node structure.
|
||||
* This will not be stable unless the rcu_node structure's ->lock is
|
||||
* held, but the bit corresponding to the current CPU will be stable
|
||||
* in most contexts.
|
||||
*/
|
||||
static unsigned long rcu_rnp_online_cpus(struct rcu_node *rnp)
|
||||
{
|
||||
return READ_ONCE(rnp->qsmaskinitnext);
|
||||
}
|
||||
|
||||
/*
|
||||
* Is the CPU corresponding to the specified rcu_data structure online
|
||||
* from RCU's perspective? This perspective is given by that structure's
|
||||
* ->qsmaskinitnext field rather than by the global cpu_online_mask.
|
||||
*/
|
||||
static bool rcu_rdp_cpu_online(struct rcu_data *rdp)
|
||||
{
|
||||
return !!(rdp->grpmask & rcu_rnp_online_cpus(rdp->mynode));
|
||||
}
|
||||
|
||||
#if defined(CONFIG_PROVE_RCU) && defined(CONFIG_HOTPLUG_CPU)
|
||||
|
||||
/*
|
||||
* Is the current CPU online as far as RCU is concerned?
|
||||
*
|
||||
* Disable preemption to avoid false positives that could otherwise
|
||||
* happen due to the current CPU number being sampled, this task being
|
||||
* preempted, its old CPU being taken offline, resuming on some other CPU,
|
||||
* then determining that its old CPU is now offline.
|
||||
*
|
||||
* Disable checking if in an NMI handler because we cannot safely
|
||||
* report errors from NMI handlers anyway. In addition, it is OK to use
|
||||
* RCU on an offline processor during initial boot, hence the check for
|
||||
* rcu_scheduler_fully_active.
|
||||
*/
|
||||
bool rcu_lockdep_current_cpu_online(void)
|
||||
{
|
||||
struct rcu_data *rdp;
|
||||
bool ret = false;
|
||||
|
||||
if (in_nmi() || !rcu_scheduler_fully_active)
|
||||
return true;
|
||||
preempt_disable_notrace();
|
||||
rdp = this_cpu_ptr(&rcu_data);
|
||||
/*
|
||||
* Strictly, we care here about the case where the current CPU is
|
||||
* in rcu_cpu_starting() and thus has an excuse for rdp->grpmask
|
||||
* not being up to date. So arch_spin_is_locked() might have a
|
||||
* false positive if it's held by some *other* CPU, but that's
|
||||
* OK because that just means a false *negative* on the warning.
|
||||
*/
|
||||
if (rcu_rdp_cpu_online(rdp) || arch_spin_is_locked(&rcu_state.ofl_lock))
|
||||
ret = true;
|
||||
preempt_enable_notrace();
|
||||
return ret;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(rcu_lockdep_current_cpu_online);
|
||||
|
||||
#endif /* #if defined(CONFIG_PROVE_RCU) && defined(CONFIG_HOTPLUG_CPU) */
|
||||
|
||||
// Has rcu_init() been invoked? This is used (for example) to determine
|
||||
// whether spinlocks may be acquired safely.
|
||||
static bool rcu_init_invoked(void)
|
||||
{
|
||||
return !!rcu_state.n_online_cpus;
|
||||
}
|
||||
|
||||
/*
|
||||
* Near the end of the offline process. Trace the fact that this CPU
|
||||
* is going offline.
|
||||
*/
|
||||
int rcutree_dying_cpu(unsigned int cpu)
|
||||
{
|
||||
bool blkd;
|
||||
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
|
||||
struct rcu_node *rnp = rdp->mynode;
|
||||
|
||||
if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
|
||||
return 0;
|
||||
|
||||
blkd = !!(READ_ONCE(rnp->qsmask) & rdp->grpmask);
|
||||
trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq),
|
||||
blkd ? TPS("cpuofl-bgp") : TPS("cpuofl"));
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* All CPUs for the specified rcu_node structure have gone offline,
|
||||
* and all tasks that were preempted within an RCU read-side critical
|
||||
* section while running on one of those CPUs have since exited their RCU
|
||||
* read-side critical section. Some other CPU is reporting this fact with
|
||||
* the specified rcu_node structure's ->lock held and interrupts disabled.
|
||||
* This function therefore goes up the tree of rcu_node structures,
|
||||
* clearing the corresponding bits in the ->qsmaskinit fields. Note that
|
||||
* the leaf rcu_node structure's ->qsmaskinit field has already been
|
||||
* updated.
|
||||
*
|
||||
* This function does check that the specified rcu_node structure has
|
||||
* all CPUs offline and no blocked tasks, so it is OK to invoke it
|
||||
* prematurely. That said, invoking it after the fact will cost you
|
||||
* a needless lock acquisition. So once it has done its work, don't
|
||||
* invoke it again.
|
||||
*/
|
||||
static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
|
||||
{
|
||||
long mask;
|
||||
struct rcu_node *rnp = rnp_leaf;
|
||||
|
||||
raw_lockdep_assert_held_rcu_node(rnp_leaf);
|
||||
if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
|
||||
WARN_ON_ONCE(rnp_leaf->qsmaskinit) ||
|
||||
WARN_ON_ONCE(rcu_preempt_has_tasks(rnp_leaf)))
|
||||
return;
|
||||
for (;;) {
|
||||
mask = rnp->grpmask;
|
||||
rnp = rnp->parent;
|
||||
if (!rnp)
|
||||
break;
|
||||
raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
|
||||
rnp->qsmaskinit &= ~mask;
|
||||
/* Between grace periods, so better already be zero! */
|
||||
WARN_ON_ONCE(rnp->qsmask);
|
||||
if (rnp->qsmaskinit) {
|
||||
raw_spin_unlock_rcu_node(rnp);
|
||||
/* irqs remain disabled. */
|
||||
return;
|
||||
}
|
||||
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* The CPU has been completely removed, and some other CPU is reporting
|
||||
* this fact from process context. Do the remainder of the cleanup.
|
||||
* There can only be one CPU hotplug operation at a time, so no need for
|
||||
* explicit locking.
|
||||
*/
|
||||
int rcutree_dead_cpu(unsigned int cpu)
|
||||
{
|
||||
if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
|
||||
return 0;
|
||||
|
||||
WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1);
|
||||
// Stop-machine done, so allow nohz_full to disable tick.
|
||||
tick_dep_clear(TICK_DEP_BIT_RCU);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Propagate ->qsinitmask bits up the rcu_node tree to account for the
|
||||
* first CPU in a given leaf rcu_node structure coming online. The caller
|
||||
@ -4408,11 +4505,13 @@ static int rcu_pm_notify(struct notifier_block *self,
|
||||
switch (action) {
|
||||
case PM_HIBERNATION_PREPARE:
|
||||
case PM_SUSPEND_PREPARE:
|
||||
rcu_async_hurry();
|
||||
rcu_expedite_gp();
|
||||
break;
|
||||
case PM_POST_HIBERNATION:
|
||||
case PM_POST_SUSPEND:
|
||||
rcu_unexpedite_gp();
|
||||
rcu_async_relax();
|
||||
break;
|
||||
default:
|
||||
break;
|
||||
@ -4766,7 +4865,7 @@ struct workqueue_struct *rcu_gp_wq;
|
||||
static void __init kfree_rcu_batch_init(void)
|
||||
{
|
||||
int cpu;
|
||||
int i;
|
||||
int i, j;
|
||||
|
||||
/* Clamp it to [0:100] seconds interval. */
|
||||
if (rcu_delay_page_cache_fill_msec < 0 ||
|
||||
@ -4786,8 +4885,14 @@ static void __init kfree_rcu_batch_init(void)
|
||||
for (i = 0; i < KFREE_N_BATCHES; i++) {
|
||||
INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
|
||||
krcp->krw_arr[i].krcp = krcp;
|
||||
|
||||
for (j = 0; j < FREE_N_CHANNELS; j++)
|
||||
INIT_LIST_HEAD(&krcp->krw_arr[i].bulk_head_free[j]);
|
||||
}
|
||||
|
||||
for (i = 0; i < FREE_N_CHANNELS; i++)
|
||||
INIT_LIST_HEAD(&krcp->bulk_head[i]);
|
||||
|
||||
INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
|
||||
INIT_DELAYED_WORK(&krcp->page_cache_work, fill_page_cache_func);
|
||||
krcp->initialized = true;
|
||||
@ -4838,6 +4943,8 @@ void __init rcu_init(void)
|
||||
// Kick-start any polled grace periods that started early.
|
||||
if (!(per_cpu_ptr(&rcu_data, cpu)->mynode->exp_seq_poll_rq & 0x1))
|
||||
(void)start_poll_synchronize_rcu_expedited();
|
||||
|
||||
rcu_test_sync_prims();
|
||||
}
|
||||
|
||||
#include "tree_stall.h"
|
||||
|
@ -158,6 +158,23 @@ union rcu_noqs {
|
||||
u16 s; /* Set of bits, aggregate OR here. */
|
||||
};
|
||||
|
||||
/*
|
||||
* Record the snapshot of the core stats at half of the first RCU stall timeout.
|
||||
* The member gp_seq is used to ensure that all members are updated only once
|
||||
* during the sampling period. The snapshot is taken only if this gp_seq is not
|
||||
* equal to rdp->gp_seq.
|
||||
*/
|
||||
struct rcu_snap_record {
|
||||
unsigned long gp_seq; /* Track rdp->gp_seq counter */
|
||||
u64 cputime_irq; /* Accumulated cputime of hard irqs */
|
||||
u64 cputime_softirq;/* Accumulated cputime of soft irqs */
|
||||
u64 cputime_system; /* Accumulated cputime of kernel tasks */
|
||||
unsigned long nr_hardirqs; /* Accumulated number of hard irqs */
|
||||
unsigned int nr_softirqs; /* Accumulated number of soft irqs */
|
||||
unsigned long long nr_csw; /* Accumulated number of task switches */
|
||||
unsigned long jiffies; /* Track jiffies value */
|
||||
};
|
||||
|
||||
/* Per-CPU data for read-copy update. */
|
||||
struct rcu_data {
|
||||
/* 1) quiescent-state and grace-period handling : */
|
||||
@ -262,6 +279,8 @@ struct rcu_data {
|
||||
short rcu_onl_gp_flags; /* ->gp_flags at last online. */
|
||||
unsigned long last_fqs_resched; /* Time of last rcu_resched(). */
|
||||
unsigned long last_sched_clock; /* Jiffies of last rcu_sched_clock_irq(). */
|
||||
struct rcu_snap_record snap_record; /* Snapshot of core stats at half of */
|
||||
/* the first RCU stall timeout */
|
||||
|
||||
long lazy_len; /* Length of buffered lazy callbacks. */
|
||||
int cpu;
|
||||
|
@ -11,6 +11,7 @@
|
||||
|
||||
static void rcu_exp_handler(void *unused);
|
||||
static int rcu_print_task_exp_stall(struct rcu_node *rnp);
|
||||
static void rcu_exp_print_detail_task_stall_rnp(struct rcu_node *rnp);
|
||||
|
||||
/*
|
||||
* Record the start of an expedited grace period.
|
||||
@ -667,8 +668,11 @@ static void synchronize_rcu_expedited_wait(void)
|
||||
mask = leaf_node_cpu_bit(rnp, cpu);
|
||||
if (!(READ_ONCE(rnp->expmask) & mask))
|
||||
continue;
|
||||
preempt_disable(); // For smp_processor_id() in dump_cpu_task().
|
||||
dump_cpu_task(cpu);
|
||||
preempt_enable();
|
||||
}
|
||||
rcu_exp_print_detail_task_stall_rnp(rnp);
|
||||
}
|
||||
jiffies_stall = 3 * rcu_exp_jiffies_till_stall_check() + 3;
|
||||
panic_on_rcu_stall();
|
||||
@ -811,6 +815,36 @@ static int rcu_print_task_exp_stall(struct rcu_node *rnp)
|
||||
return ndetected;
|
||||
}
|
||||
|
||||
/*
|
||||
* Scan the current list of tasks blocked within RCU read-side critical
|
||||
* sections, dumping the stack of each that is blocking the current
|
||||
* expedited grace period.
|
||||
*/
|
||||
static void rcu_exp_print_detail_task_stall_rnp(struct rcu_node *rnp)
|
||||
{
|
||||
unsigned long flags;
|
||||
struct task_struct *t;
|
||||
|
||||
if (!rcu_exp_stall_task_details)
|
||||
return;
|
||||
raw_spin_lock_irqsave_rcu_node(rnp, flags);
|
||||
if (!READ_ONCE(rnp->exp_tasks)) {
|
||||
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
||||
return;
|
||||
}
|
||||
t = list_entry(rnp->exp_tasks->prev,
|
||||
struct task_struct, rcu_node_entry);
|
||||
list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
|
||||
/*
|
||||
* We could be printing a lot while holding a spinlock.
|
||||
* Avoid triggering hard lockup.
|
||||
*/
|
||||
touch_nmi_watchdog();
|
||||
sched_show_task(t);
|
||||
}
|
||||
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
||||
}
|
||||
|
||||
#else /* #ifdef CONFIG_PREEMPT_RCU */
|
||||
|
||||
/* Request an expedited quiescent state. */
|
||||
@ -883,6 +917,15 @@ static int rcu_print_task_exp_stall(struct rcu_node *rnp)
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Because preemptible RCU does not exist, we never have to print out
|
||||
* tasks blocked within RCU read-side critical sections that are blocking
|
||||
* the current expedited grace period.
|
||||
*/
|
||||
static void rcu_exp_print_detail_task_stall_rnp(struct rcu_node *rnp)
|
||||
{
|
||||
}
|
||||
|
||||
#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
|
||||
|
||||
/**
|
||||
|
@ -39,7 +39,7 @@ int rcu_exp_jiffies_till_stall_check(void)
|
||||
// CONFIG_RCU_EXP_CPU_STALL_TIMEOUT, so check the allowed range.
|
||||
// The minimum clamped value is "2UL", because at least one full
|
||||
// tick has to be guaranteed.
|
||||
till_stall_check = clamp(msecs_to_jiffies(cpu_stall_timeout), 2UL, 21UL * HZ);
|
||||
till_stall_check = clamp(msecs_to_jiffies(cpu_stall_timeout), 2UL, 300UL * HZ);
|
||||
|
||||
if (cpu_stall_timeout && jiffies_to_msecs(till_stall_check) != cpu_stall_timeout)
|
||||
WRITE_ONCE(rcu_exp_cpu_stall_timeout, jiffies_to_msecs(till_stall_check));
|
||||
@ -428,6 +428,35 @@ static bool rcu_is_rcuc_kthread_starving(struct rcu_data *rdp, unsigned long *jp
|
||||
return j > 2 * HZ;
|
||||
}
|
||||
|
||||
static void print_cpu_stat_info(int cpu)
|
||||
{
|
||||
struct rcu_snap_record rsr, *rsrp;
|
||||
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
|
||||
struct kernel_cpustat *kcsp = &kcpustat_cpu(cpu);
|
||||
|
||||
if (!rcu_cpu_stall_cputime)
|
||||
return;
|
||||
|
||||
rsrp = &rdp->snap_record;
|
||||
if (rsrp->gp_seq != rdp->gp_seq)
|
||||
return;
|
||||
|
||||
rsr.cputime_irq = kcpustat_field(kcsp, CPUTIME_IRQ, cpu);
|
||||
rsr.cputime_softirq = kcpustat_field(kcsp, CPUTIME_SOFTIRQ, cpu);
|
||||
rsr.cputime_system = kcpustat_field(kcsp, CPUTIME_SYSTEM, cpu);
|
||||
|
||||
pr_err("\t hardirqs softirqs csw/system\n");
|
||||
pr_err("\t number: %8ld %10d %12lld\n",
|
||||
kstat_cpu_irqs_sum(cpu) - rsrp->nr_hardirqs,
|
||||
kstat_cpu_softirqs_sum(cpu) - rsrp->nr_softirqs,
|
||||
nr_context_switches_cpu(cpu) - rsrp->nr_csw);
|
||||
pr_err("\tcputime: %8lld %10lld %12lld ==> %d(ms)\n",
|
||||
div_u64(rsr.cputime_irq - rsrp->cputime_irq, NSEC_PER_MSEC),
|
||||
div_u64(rsr.cputime_softirq - rsrp->cputime_softirq, NSEC_PER_MSEC),
|
||||
div_u64(rsr.cputime_system - rsrp->cputime_system, NSEC_PER_MSEC),
|
||||
jiffies_to_msecs(jiffies - rsrp->jiffies));
|
||||
}
|
||||
|
||||
/*
|
||||
* Print out diagnostic information for the specified stalled CPU.
|
||||
*
|
||||
@ -484,6 +513,8 @@ static void print_cpu_stall_info(int cpu)
|
||||
data_race(rcu_state.n_force_qs) - rcu_state.n_force_qs_gpstart,
|
||||
rcuc_starved ? buf : "",
|
||||
falsepositive ? " (false positive?)" : "");
|
||||
|
||||
print_cpu_stat_info(cpu);
|
||||
}
|
||||
|
||||
/* Complain about starvation of grace-period kthread. */
|
||||
@ -588,7 +619,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps)
|
||||
|
||||
for_each_possible_cpu(cpu)
|
||||
totqlen += rcu_get_n_cbs_cpu(cpu);
|
||||
pr_cont("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu ncpus=%d)\n",
|
||||
pr_err("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu ncpus=%d)\n",
|
||||
smp_processor_id(), (long)(jiffies - gps),
|
||||
(long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus);
|
||||
if (ndetected) {
|
||||
@ -649,7 +680,7 @@ static void print_cpu_stall(unsigned long gps)
|
||||
raw_spin_unlock_irqrestore_rcu_node(rdp->mynode, flags);
|
||||
for_each_possible_cpu(cpu)
|
||||
totqlen += rcu_get_n_cbs_cpu(cpu);
|
||||
pr_cont("\t(t=%lu jiffies g=%ld q=%lu ncpus=%d)\n",
|
||||
pr_err("\t(t=%lu jiffies g=%ld q=%lu ncpus=%d)\n",
|
||||
jiffies - gps,
|
||||
(long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus);
|
||||
|
||||
|
@ -144,8 +144,45 @@ bool rcu_gp_is_normal(void)
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(rcu_gp_is_normal);
|
||||
|
||||
static atomic_t rcu_expedited_nesting = ATOMIC_INIT(1);
|
||||
static atomic_t rcu_async_hurry_nesting = ATOMIC_INIT(1);
|
||||
/*
|
||||
* Should call_rcu() callbacks be processed with urgency or are
|
||||
* they OK being executed with arbitrary delays?
|
||||
*/
|
||||
bool rcu_async_should_hurry(void)
|
||||
{
|
||||
return !IS_ENABLED(CONFIG_RCU_LAZY) ||
|
||||
atomic_read(&rcu_async_hurry_nesting);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(rcu_async_should_hurry);
|
||||
|
||||
/**
|
||||
* rcu_async_hurry - Make future async RCU callbacks not lazy.
|
||||
*
|
||||
* After a call to this function, future calls to call_rcu()
|
||||
* will be processed in a timely fashion.
|
||||
*/
|
||||
void rcu_async_hurry(void)
|
||||
{
|
||||
if (IS_ENABLED(CONFIG_RCU_LAZY))
|
||||
atomic_inc(&rcu_async_hurry_nesting);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(rcu_async_hurry);
|
||||
|
||||
/**
|
||||
* rcu_async_relax - Make future async RCU callbacks lazy.
|
||||
*
|
||||
* After a call to this function, future calls to call_rcu()
|
||||
* will be processed in a lazy fashion.
|
||||
*/
|
||||
void rcu_async_relax(void)
|
||||
{
|
||||
if (IS_ENABLED(CONFIG_RCU_LAZY))
|
||||
atomic_dec(&rcu_async_hurry_nesting);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(rcu_async_relax);
|
||||
|
||||
static atomic_t rcu_expedited_nesting = ATOMIC_INIT(1);
|
||||
/*
|
||||
* Should normal grace-period primitives be expedited? Intended for
|
||||
* use within RCU. Note that this function takes the rcu_expedited
|
||||
@ -195,6 +232,7 @@ static bool rcu_boot_ended __read_mostly;
|
||||
void rcu_end_inkernel_boot(void)
|
||||
{
|
||||
rcu_unexpedite_gp();
|
||||
rcu_async_relax();
|
||||
if (rcu_normal_after_boot)
|
||||
WRITE_ONCE(rcu_normal, 1);
|
||||
rcu_boot_ended = true;
|
||||
@ -220,6 +258,7 @@ void rcu_test_sync_prims(void)
|
||||
{
|
||||
if (!IS_ENABLED(CONFIG_PROVE_RCU))
|
||||
return;
|
||||
pr_info("Running RCU synchronous self tests\n");
|
||||
synchronize_rcu();
|
||||
synchronize_rcu_expedited();
|
||||
}
|
||||
@ -508,6 +547,10 @@ int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
|
||||
module_param(rcu_cpu_stall_timeout, int, 0644);
|
||||
int rcu_exp_cpu_stall_timeout __read_mostly = CONFIG_RCU_EXP_CPU_STALL_TIMEOUT;
|
||||
module_param(rcu_exp_cpu_stall_timeout, int, 0644);
|
||||
int rcu_cpu_stall_cputime __read_mostly = IS_ENABLED(CONFIG_RCU_CPU_STALL_CPUTIME);
|
||||
module_param(rcu_cpu_stall_cputime, int, 0644);
|
||||
bool rcu_exp_stall_task_details __read_mostly;
|
||||
module_param(rcu_exp_stall_task_details, bool, 0644);
|
||||
#endif /* #ifdef CONFIG_RCU_STALL_COMMON */
|
||||
|
||||
// Suppress boot-time RCU CPU stall warnings and rcutorture writer stall
|
||||
@ -555,9 +598,12 @@ struct early_boot_kfree_rcu {
|
||||
static void early_boot_test_call_rcu(void)
|
||||
{
|
||||
static struct rcu_head head;
|
||||
int idx;
|
||||
static struct rcu_head shead;
|
||||
struct early_boot_kfree_rcu *rhp;
|
||||
|
||||
idx = srcu_down_read(&early_srcu);
|
||||
srcu_up_read(&early_srcu, idx);
|
||||
call_rcu(&head, test_callback);
|
||||
early_srcu_cookie = start_poll_synchronize_srcu(&early_srcu);
|
||||
call_srcu(&early_srcu, &shead, test_callback);
|
||||
@ -586,6 +632,7 @@ static int rcu_verify_early_boot_tests(void)
|
||||
early_boot_test_counter++;
|
||||
srcu_barrier(&early_srcu);
|
||||
WARN_ON_ONCE(!poll_state_synchronize_srcu(&early_srcu, early_srcu_cookie));
|
||||
cleanup_srcu_struct(&early_srcu);
|
||||
}
|
||||
if (rcu_self_test_counter != early_boot_test_counter) {
|
||||
WARN_ON(1);
|
||||
|
@ -5342,6 +5342,11 @@ bool single_task_running(void)
|
||||
}
|
||||
EXPORT_SYMBOL(single_task_running);
|
||||
|
||||
unsigned long long nr_context_switches_cpu(int cpu)
|
||||
{
|
||||
return cpu_rq(cpu)->nr_switches;
|
||||
}
|
||||
|
||||
unsigned long long nr_context_switches(void)
|
||||
{
|
||||
int i;
|
||||
|
@ -450,7 +450,7 @@ unsigned long
|
||||
torture_random(struct torture_random_state *trsp)
|
||||
{
|
||||
if (--trsp->trs_count < 0) {
|
||||
trsp->trs_state += (unsigned long)local_clock();
|
||||
trsp->trs_state += (unsigned long)local_clock() + raw_smp_processor_id();
|
||||
trsp->trs_count = TORTURE_RANDOM_REFRESH;
|
||||
}
|
||||
trsp->trs_state = trsp->trs_state * TORTURE_RANDOM_MULT +
|
||||
@ -915,7 +915,7 @@ void torture_kthread_stopping(char *title)
|
||||
VERBOSE_TOROUT_STRING(buf);
|
||||
while (!kthread_should_stop()) {
|
||||
torture_shutdown_absorb(title);
|
||||
schedule_timeout_uninterruptible(1);
|
||||
schedule_timeout_uninterruptible(HZ / 20);
|
||||
}
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(torture_kthread_stopping);
|
||||
|
@ -10,10 +10,9 @@
|
||||
T="`mktemp -d ${TMPDIR-/tmp}/configcheck.sh.XXXXXX`"
|
||||
trap 'rm -rf $T' 0
|
||||
|
||||
cat $1 > $T/.config
|
||||
sed -e 's/"//g' < $1 > $T/.config
|
||||
|
||||
cat $2 | sed -e 's/\(.*\)=n/# \1 is not set/' -e 's/^#CHECK#//' |
|
||||
grep -v '^CONFIG_INITRAMFS_SOURCE' |
|
||||
sed -e 's/"//g' -e 's/\(.*\)=n/# \1 is not set/' -e 's/^#CHECK#//' < $2 |
|
||||
awk '
|
||||
{
|
||||
print "if grep -q \"" $0 "\" < '"$T/.config"'";
|
||||
|
@ -10,7 +10,7 @@
|
||||
#
|
||||
# Authors: Paul E. McKenney <paulmck@kernel.org>
|
||||
|
||||
egrep 'Badness|WARNING:|Warn|BUG|===========|BUG: KCSAN:|Call Trace:|Oops:|detected stalls on CPUs/tasks:|self-detected stall on CPU|Stall ended before state dump start|\?\?\? Writer stall state|rcu_.*kthread starved for|!!!' |
|
||||
grep -E 'Badness|WARNING:|Warn|BUG|===========|BUG: KCSAN:|Call Trace:|Oops:|detected stalls on CPUs/tasks:|self-detected stall on CPU|Stall ended before state dump start|\?\?\? Writer stall state|rcu_.*kthread starved for|!!!' |
|
||||
grep -v 'ODEBUG: ' |
|
||||
grep -v 'This means that this is a DEBUG kernel and it is' |
|
||||
grep -v 'Warning: unable to open an initial console' |
|
||||
|
@ -44,10 +44,10 @@ fi
|
||||
ncpus="`getconf _NPROCESSORS_ONLN`"
|
||||
make -j$((2 * ncpus)) $TORTURE_KMAKE_ARG > $resdir/Make.out 2>&1
|
||||
retval=$?
|
||||
if test $retval -ne 0 || grep "rcu[^/]*": < $resdir/Make.out | egrep -q "Stop|Error|error:|warning:" || egrep -q "Stop|Error|error:" < $resdir/Make.out
|
||||
if test $retval -ne 0 || grep "rcu[^/]*": < $resdir/Make.out | grep -E -q "Stop|Error|error:|warning:" || grep -E -q "Stop|Error|error:" < $resdir/Make.out
|
||||
then
|
||||
echo Kernel build error
|
||||
egrep "Stop|Error|error:|warning:" < $resdir/Make.out
|
||||
grep -E "Stop|Error|error:|warning:" < $resdir/Make.out
|
||||
echo Run aborted.
|
||||
exit 3
|
||||
fi
|
||||
|
@ -32,11 +32,11 @@ for i in ${rundir}/*/Make.out
|
||||
do
|
||||
scenariodir="`dirname $i`"
|
||||
scenariobasedir="`echo ${scenariodir} | sed -e 's/\.[0-9]*$//'`"
|
||||
if egrep -q "error:|warning:|^ld: .*undefined reference to" < $i
|
||||
if grep -E -q "error:|warning:|^ld: .*undefined reference to" < $i
|
||||
then
|
||||
egrep "error:|warning:|^ld: .*undefined reference to" < $i > $i.diags
|
||||
grep -E "error:|warning:|^ld: .*undefined reference to" < $i > $i.diags
|
||||
files="$files $i.diags $i"
|
||||
elif ! test -f ${scenariobasedir}/vmlinux && ! test -f "${rundir}/re-run"
|
||||
elif ! test -f ${scenariobasedir}/vmlinux && ! test -f ${scenariobasedir}/vmlinux.xz && ! test -f "${rundir}/re-run"
|
||||
then
|
||||
echo No ${scenariobasedir}/vmlinux file > $i.diags
|
||||
files="$files $i.diags $i"
|
||||
|
@ -186,7 +186,7 @@ do
|
||||
fi
|
||||
;;
|
||||
--kconfig|--kconfigs)
|
||||
checkarg --kconfig "(Kconfig options)" $# "$2" '^CONFIG_[A-Z0-9_]\+=\([ynm]\|[0-9]\+\)\( CONFIG_[A-Z0-9_]\+=\([ynm]\|[0-9]\+\)\)*$' '^error$'
|
||||
checkarg --kconfig "(Kconfig options)" $# "$2" '^CONFIG_[A-Z0-9_]\+=\([ynm]\|[0-9]\+\|"[^"]*"\)\( CONFIG_[A-Z0-9_]\+=\([ynm]\|[0-9]\+\|"[^"]*"\)\)*$' '^error$'
|
||||
TORTURE_KCONFIG_ARG="`echo "$TORTURE_KCONFIG_ARG $2" | sed -e 's/^ *//' -e 's/ *$//'`"
|
||||
shift
|
||||
;;
|
||||
@ -585,7 +585,7 @@ awk < $T/cfgcpu.pack \
|
||||
echo kvm-end-run-stats.sh "$resdir/$ds" "$starttime" >> $T/script
|
||||
|
||||
# Extract the tests and their batches from the script.
|
||||
egrep 'Start batch|Starting build\.' $T/script | grep -v ">>" |
|
||||
grep -E 'Start batch|Starting build\.' $T/script | grep -v ">>" |
|
||||
sed -e 's/:.*$//' -e 's/^echo //' -e 's/-ovf//' |
|
||||
awk '
|
||||
/^----Start/ {
|
||||
@ -622,7 +622,7 @@ then
|
||||
elif test "$dryrun" = sched
|
||||
then
|
||||
# Extract the test run schedule from the script.
|
||||
egrep 'Start batch|Starting build\.' $T/script | grep -v ">>" |
|
||||
grep -E 'Start batch|Starting build\.' $T/script | grep -v ">>" |
|
||||
sed -e 's/:.*$//' -e 's/^echo //'
|
||||
nbuilds="`grep 'Starting build\.' $T/script |
|
||||
grep -v ">>" | sed -e 's/:.*$//' -e 's/^echo //' |
|
||||
|
@ -65,7 +65,7 @@ then
|
||||
fi
|
||||
|
||||
grep --binary-files=text 'torture:.*ver:' $file |
|
||||
egrep --binary-files=text -v '\(null\)|rtc: 000000000* ' |
|
||||
grep -E --binary-files=text -v '\(null\)|rtc: 000000000* ' |
|
||||
sed -e 's/^(initramfs)[^]]*] //' -e 's/^\[[^]]*] //' |
|
||||
sed -e 's/^.*ver: //' |
|
||||
awk '
|
||||
@ -128,17 +128,17 @@ then
|
||||
then
|
||||
summary="$summary Badness: $n_badness"
|
||||
fi
|
||||
n_warn=`grep -v 'Warning: unable to open an initial console' $file | grep -v 'Warning: Failed to add ttynull console. No stdin, stdout, and stderr for the init process' | egrep -c 'WARNING:|Warn'`
|
||||
n_warn=`grep -v 'Warning: unable to open an initial console' $file | grep -v 'Warning: Failed to add ttynull console. No stdin, stdout, and stderr for the init process' | grep -E -c 'WARNING:|Warn'`
|
||||
if test "$n_warn" -ne 0
|
||||
then
|
||||
summary="$summary Warnings: $n_warn"
|
||||
fi
|
||||
n_bugs=`egrep -c '\bBUG|Oops:' $file`
|
||||
n_bugs=`grep -E -c '\bBUG|Oops:' $file`
|
||||
if test "$n_bugs" -ne 0
|
||||
then
|
||||
summary="$summary Bugs: $n_bugs"
|
||||
fi
|
||||
n_kcsan=`egrep -c 'BUG: KCSAN: ' $file`
|
||||
n_kcsan=`grep -E -c 'BUG: KCSAN: ' $file`
|
||||
if test "$n_kcsan" -ne 0
|
||||
then
|
||||
if test "$n_bugs" = "$n_kcsan"
|
||||
@ -158,7 +158,7 @@ then
|
||||
then
|
||||
summary="$summary lockdep: $n_badness"
|
||||
fi
|
||||
n_stalls=`egrep -c 'detected stalls on CPUs/tasks:|self-detected stall on CPU|Stall ended before state dump start|\?\?\? Writer stall state' $file`
|
||||
n_stalls=`grep -E -c 'detected stalls on CPUs/tasks:|self-detected stall on CPU|Stall ended before state dump start|\?\?\? Writer stall state' $file`
|
||||
if test "$n_stalls" -ne 0
|
||||
then
|
||||
summary="$summary Stalls: $n_stalls"
|
||||
|
Loading…
Reference in New Issue
Block a user