1
linux/kernel
Lai Jiangshan 8161239a8b rtmutex: Simplify PI algorithm and make highest prio task get lock
In current rtmutex, the pending owner may be boosted by the tasks
in the rtmutex's waitlist when the pending owner is deboosted
or a task in the waitlist is boosted. This boosting is unrelated,
because the pending owner does not really take the rtmutex.
It is not reasonable.

Example.

time1:
A(high prio) onwers the rtmutex.
B(mid prio) and C (low prio) in the waitlist.

time2
A release the lock, B becomes the pending owner
A(or other high prio task) continues to run. B's prio is lower
than A, so B is just queued at the runqueue.

time3
A or other high prio task sleeps, but we have passed some time
The B and C's prio are changed in the period (time2 ~ time3)
due to boosting or deboosting. Now C has the priority higher
than B. ***Is it reasonable that C has to boost B and help B to
get the rtmutex?

NO!! I think, it is unrelated/unneed boosting before B really
owns the rtmutex. We should give C a chance to beat B and
win the rtmutex.

This is the motivation of this patch. This patch *ensures*
only the top waiter or higher priority task can take the lock.

How?
1) we don't dequeue the top waiter when unlock, if the top waiter
   is changed, the old top waiter will fail and go to sleep again.
2) when requiring lock, it will get the lock when the lock is not taken and:
   there is no waiter OR higher priority than waiters OR it is top waiter.
3) In any time, the top waiter is changed, the top waiter will be woken up.

The algorithm is much simpler than before, no pending owner, no
boosting for pending owner.

Other advantage of this patch:
1) The states of a rtmutex are reduced a half, easier to read the code.
2) the codes become shorter.
3) top waiter is not dequeued until it really take the lock:
   they will retain FIFO when it is stolen.

Not advantage nor disadvantage
1) Even we may wakeup multiple waiters(any time when top waiter changed),
   we hardly cause "thundering herd",
   the number of wokenup task is likely 1 or very little.
2) two APIs are changed.
   rt_mutex_owner() will not return pending owner, it will return NULL when
                    the top waiter is going to take the lock.
   rt_mutex_next_owner() always return the top waiter.
	                 will not return NULL if we have waiters
                         because the top waiter is not dequeued.

   I have fixed the code that use these APIs.

need updated after this patch is accepted
1) Document/*
2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst

Signed-off-by:  Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4D3012D5.4060709@cn.fujitsu.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-01-27 21:13:51 -05:00
..
debug Merge branch 'master' into for-next 2010-12-22 18:57:02 +01:00
gcov llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
irq genirq: Remove __do_IRQ 2011-01-21 11:55:31 +01:00
power Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 2011-01-13 20:15:35 -08:00
time hrtimers: Notify hrtimer users of switches to NOHZ mode 2011-01-19 20:08:15 +01:00
trace lockdep: Move early boot local IRQ enable/disable status to init/main.c 2011-01-20 13:32:33 +01:00
.gitignore
acct.c pass a struct path to vfs_statfs 2010-08-09 16:48:42 -04:00
async.c async: use workqueue for worker pool 2010-07-14 11:29:46 +02:00
audit_tree.c in untag_chunk() we need to do alloc_chunk() a bit earlier 2010-10-30 02:18:32 -04:00
audit_watch.c audit: make functions static 2010-10-30 01:42:19 -04:00
audit.c audit: error message typo correction 2010-11-03 13:49:58 -04:00
audit.h audit: make functions static 2010-10-30 01:42:19 -04:00
auditfilter.c Audit: add support to match lsm labels on user audit messages 2010-10-30 01:41:57 -04:00
auditsc.c audit mmap 2010-10-30 08:45:43 -04:00
backtracetest.c
bounds.c
capability.c
cgroup_freezer.c cgroup_freezer: update_freezer_state() does incorrect state transitions 2010-10-27 18:03:08 -07:00
cgroup.c Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin 2011-01-14 09:08:29 -08:00
compat.c compat: Make compat_alloc_user_space() incorporate the access_ok() 2010-09-14 16:08:45 -07:00
configs.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
cpu.c Merge branches 'x86-alternatives-for-linus', 'x86-fpu-for-linus', 'x86-hwmon-for-linus', 'x86-paravirt-for-linus', 'core-locking-for-linus' and 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-01-06 11:11:50 -08:00
cpuset.c convert cgroup and cpuset 2010-10-29 04:17:06 -04:00
cred.c signals: move cred_guard_mutex from task_struct to signal_struct 2010-10-27 18:03:12 -07:00
delayacct.c
dma.c
elfcore.c
exec_domain.c sys_personality: remove the bogus checks in sys_personality()->__set_personality() path 2010-08-09 20:45:05 -07:00
exit.c Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-01-11 11:02:13 -08:00
extable.c
fork.c thp: khugepaged 2011-01-13 17:32:43 -08:00
freezer.c Freezer: Fix a race during freezing of TASK_STOPPED tasks 2010-12-24 15:02:40 +01:00
futex_compat.c futex: Address compiler warnings in exit_robust_list 2010-11-10 13:27:50 +01:00
futex.c rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
groups.c kernel/groups.c: fix integer overflow in groups_search 2010-09-09 18:57:24 -07:00
hrtimer.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2011-01-13 10:05:56 -08:00
hung_task.c lockup detector: Fix grammar by adding a missing "to" in the comments 2010-08-17 09:11:52 +02:00
hw_breakpoint.c perf: Dynamic pmu types 2010-12-16 11:36:43 +01:00
irq_work.c irq_work: Use per cpu atomics instead of regular atomics 2010-12-18 15:54:48 +01:00
itimer.c
jump_label.c jump label: Make arch_jump_label_text_poke_early() optional 2010-10-29 12:56:13 -04:00
kallsyms.c Revert "kernel: make /proc/kallsyms mode 400 to reduce ease of attacking" 2010-11-19 11:54:40 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kexec.c tree-wide: fix comment/printk typos 2010-11-01 15:38:34 -04:00
kfifo.c kfifo: fix scatterlist usage 2010-10-01 10:50:58 -07:00
kmod.c Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
kprobes.c Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
ksysfs.c sysfs: add struct file* to bin_attr callbacks 2010-05-21 09:37:31 -07:00
kthread.c sched: Constify function scope static struct sched_param usage 2011-01-07 15:55:45 +01:00
latencytop.c fs/proc/base.c, kernel/latencytop.c: convert sprintf_symbol() to %ps 2011-01-13 08:03:16 -08:00
lockdep_internals.h
lockdep_proc.c locking, lockdep: Convert sprintf_symbol to %pS 2010-11-10 10:23:58 +01:00
lockdep_states.h
lockdep.c lockdep: Move early boot local IRQ enable/disable status to init/main.c 2011-01-20 13:32:33 +01:00
Makefile kernel: clean up USE_GENERIC_SMP_HELPERS 2011-01-13 08:03:08 -08:00
module.c module: Move RO/NX module protection to after ftrace module update 2010-12-23 09:56:00 -05:00
mutex-debug.c
mutex-debug.h
mutex.c mutexes, sched: Introduce arch_mutex_cpu_relax() 2010-11-26 15:05:34 +01:00
mutex.h
notifier.c
ns_cgroup.c cgroup: notify ns_cgroup deprecated 2010-10-27 18:03:09 -07:00
nsproxy.c
padata.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2010-08-04 15:23:14 -07:00
panic.c ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type support 2011-01-12 03:06:19 -05:00
params.c module: show version information for built-in modules in sysfs 2011-01-24 14:32:51 +10:30
perf_event.c perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/ 2011-01-21 22:08:16 +01:00
pid_namespace.c
pid.c Add RCU check for find_task_by_vpid(). 2010-08-19 17:18:02 -07:00
pm_qos_params.c PM / PM QoS: Fix reversed min and max 2010-11-15 22:45:22 +01:00
posix-cpu-timers.c posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call 2010-11-10 13:07:06 +01:00
posix-timers.c posix-timers: Annotate lock_timer() 2010-10-21 17:30:06 +02:00
printk.c console: rename acquire/release_console_sem() to console_lock/unlock() 2011-01-26 10:50:06 +10:00
profile.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
ptrace.c signals: move cred_guard_mutex from task_struct to signal_struct 2010-10-27 18:03:12 -07:00
range.c kernel/range.c: fix clean_sort_range() for the case of full array 2010-11-12 07:55:31 -08:00
rcupdate.c Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu 2010-10-07 09:43:11 +02:00
rcutiny_plugin.h rcu: Distinguish between boosting and boosted 2010-11-29 22:01:56 -08:00
rcutiny.c rcu: avoid pointless blocked-task warnings 2011-01-14 04:58:08 -08:00
rcutorture.c rcu: add priority-inversion testing to rcutorture 2010-10-07 10:41:08 -07:00
rcutree_plugin.h rcu: increase synchronize_sched_expedited() batching 2010-12-17 12:34:08 -08:00
rcutree_trace.c rcu,cleanup: simplify the code when cpu is dying 2010-11-29 22:01:58 -08:00
rcutree.c Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
rcutree.h rcu: limit rcu_node leaf-level fanout 2010-12-17 12:34:20 -08:00
relay.c Clean up relay_alloc_page_array() slightly by using vzalloc rather than vmalloc and memset 2010-11-05 08:21:34 -07:00
res_counter.c
resource.c resources: add arch hook for preventing allocation in reserved areas 2010-12-17 10:01:09 -08:00
rtmutex_common.h rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex-debug.c rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex-debug.h
rtmutex-tester.c rtmutex-tester: make it build without BKL 2010-10-19 11:29:56 +02:00
rtmutex.c rtmutex: Simplify PI algorithm and make highest prio task get lock 2011-01-27 21:13:51 -05:00
rtmutex.h
rwsem.c
sched_autogroup.c sched, autogroup: Fix CONFIG_RT_GROUP_SCHED sched_setscheduler() failure 2011-01-18 15:09:42 +01:00
sched_autogroup.h sched, autogroup: Fix CONFIG_RT_GROUP_SCHED sched_setscheduler() failure 2011-01-18 15:09:42 +01:00
sched_clock.c sched: Add some clock info to sched_debug 2010-11-23 10:29:08 +01:00
sched_cpupri.c sched: No need for bootmem special cases 2010-07-17 12:06:22 +02:00
sched_cpupri.h sched: No need for bootmem special cases 2010-07-17 12:06:22 +02:00
sched_debug.c sched: Replace rq->bkl_count with rq->rq_sched_info.bkl_count 2011-01-18 15:09:43 +01:00
sched_fair.c sched: Fix poor interactivity on UP systems due to group scheduler nice tune bug 2011-01-24 11:47:50 +01:00
sched_features.h sched: Rewrite tg_shares_up) 2010-11-18 13:27:46 +01:00
sched_idletask.c
sched_rt.c sched: Implement on-demand (active) cfs_rq list 2010-11-18 13:27:47 +01:00
sched_stats.h sched_stat: Update sched_info_queue/dequeue() code comments 2010-10-24 13:29:01 +02:00
sched_stoptask.c sched: Fix cross-sched-class wakeup preemption 2010-11-11 14:37:23 +01:00
sched.c Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-01-20 16:37:55 -08:00
seccomp.c
semaphore.c
signal.c signals: annotate lock context change on ptrace_stop() 2010-10-27 18:03:12 -07:00
smp.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-01-20 18:30:37 -08:00
softirq.c kernel: clean up USE_GENERIC_SMP_HELPERS 2011-01-13 08:03:08 -08:00
spinlock.c
srcu.c rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status 2011-01-14 04:56:49 -08:00
stacktrace.c
stop_machine.c stop_machine: convert cpu notifier to return encapsulate errno value 2010-10-26 16:52:15 -07:00
sys_ni.c powerpc: define a compat_sys_recv cond_syscall 2010-09-23 17:03:55 +10:00
sys.c kmsg_dump: add kmsg_dump() calls to the reboot, halt, poweroff and emergency_restart paths 2011-01-13 08:03:07 -08:00
sysctl_binary.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2011-01-13 10:05:56 -08:00
sysctl_check.c sysctl: min/max bounds are optional 2010-10-15 14:42:24 -07:00
sysctl.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2011-01-26 16:31:44 +10:00
taskstats.c taskstats: use better ifdef for alignment 2011-01-13 08:03:19 -08:00
test_kprobes.c kprobes: Fix selftest to clear flags field for reusing probes 2010-10-14 08:55:27 +02:00
time.c Kill off a bunch of warning: ‘inline’ is not at beginning of declaration 2010-11-28 23:08:04 +01:00
timeconst.pl
timer.c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2011-01-06 10:42:43 -08:00
tracepoint.c jump_label: Use more consistent naming 2010-10-18 19:58:56 +02:00
tsacct.c taskstats: use real microsecond granularity for CPU times 2010-10-27 18:03:17 -07:00
uid16.c
up.c
user_namespace.c user_ns: improve the user_ns on-the-slab packaging 2011-01-13 08:03:18 -08:00
user-return-notifier.c
user.c fix freeing user_struct in user cache 2010-12-29 11:31:38 -08:00
utsname_sysctl.c
utsname.c
wait.c docbook: add more wait/wake/completion to device-drivers docbook 2010-10-26 17:32:41 -07:00
watchdog.c Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2011-01-07 17:02:58 -08:00
workqueue_sched.h workqueue: implement concurrency managed dynamic worker pool 2010-06-29 10:07:14 +02:00
workqueue.c workqueue: note the nested NOT_RUNNING test in worker_clr_flags() isn't a noop 2011-01-11 16:03:14 +01:00