1
linux/kernel
Linus Torvalds fa490cfd15 Fix possible runqueue lock starvation in wait_task_inactive()
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.

He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).

This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section.  If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.

So the code now looks like this:

	repeat:
		/* Unlocked, optimistic looping! */
		rq = task_rq(p);
		while (task_running(rq, p))
			cpu_relax();

		/* Get the *real* values */
		rq = task_rq_lock(p, &flags);
		running = task_running(rq, p);
		array = p->array;
		task_rq_unlock(rq, &flags);

		/* Check them.. */
		if (unlikely(running)) {
			cpu_relax();
			goto repeat;
		}

		/* Preempted away? Yield if so.. */
		if (unlikely(array)) {
			yield();
			goto repeat;
		}

Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care.  Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.

So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_.  And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.

(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).

Thanks to Miklos for all the testing and tracking it down.

Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 11:52:55 -07:00
..
irq Fix crash with irqpoll due to the IRQF_IRQPOLL flag testing 2007-05-24 08:37:14 -07:00
power swsusp: Fix userland interface 2007-06-16 13:16:15 -07:00
time timer stats: speedups 2007-06-01 08:18:30 -07:00
.gitignore
acct.c
audit.c
audit.h
auditfilter.c audit_match_signal() and friends are used only if CONFIG_AUDITSYSCALL is set 2007-05-15 18:56:37 -07:00
auditsc.c
capability.c
compat.c signal/timer/event: timerfd compat code 2007-05-11 08:29:36 -07:00
configs.c
cpu.c
cpuset.c cpuset: zero malloc - fix for old cpusets 2007-06-16 13:16:15 -07:00
delayacct.c
die_notifier.c
dma.c
exec_domain.c
exit.c pi-futex: fix exit races and locking problems 2007-06-08 17:23:34 -07:00
extable.c
fork.c freezer: fix vfork problem 2007-05-23 20:14:11 -07:00
futex_compat.c Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
futex.c Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
hrtimer.c
itimer.c
kallsyms.c fix possible null ptr deref in kallsyms_lookup 2007-05-30 10:51:38 -07:00
Kconfig.hz
Kconfig.preempt
kexec.c
kfifo.c
kmod.c
kprobes.c
ksysfs.c
kthread.c freezer: fix kthread_create vs freezer theoretical race 2007-05-23 20:14:11 -07:00
latency.c
lockdep_internals.h
lockdep_proc.c
lockdep.c
Makefile
module.c
mutex-debug.c
mutex-debug.h
mutex.c
mutex.h
nsproxy.c
panic.c
params.c
pid.c statically initialize struct pid for swapper 2007-05-11 08:29:35 -07:00
posix-cpu-timers.c
posix-timers.c
printk.c
profile.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
ptrace.c
rcupdate.c
rcutorture.c
relay.c
resource.c
rtmutex_common.h Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
rtmutex.h
rwsem.c
sched.c Fix possible runqueue lock starvation in wait_task_inactive() 2007-06-18 11:52:55 -07:00
seccomp.c
signal.c Fix signalfd interaction with thread-private signals 2007-06-18 10:18:32 -07:00
softirq.c
softlockup.c
spinlock.c
srcu.c
stacktrace.c
stop_machine.c stop_machine() now uses hard_irq_disable 2007-05-11 08:29:34 -07:00
sys_ni.c compat signalfd and timerfd are cond syscalls 2007-05-12 10:55:40 -07:00
sys.c attach_pid() with struct pid parameter 2007-05-11 08:29:35 -07:00
sysctl.c make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core filename size 2007-05-17 05:23:05 -07:00
taskstats.c
time.c
timer.c NOHZ: prevent multiplication overflow - stop timer for huge timeouts 2007-05-29 18:11:10 -07:00
tsacct.c
uid16.c
user.c
utsname_sysctl.c
utsname.c
wait.c
workqueue.c simplify cleanup_workqueue_thread() 2007-05-23 20:14:13 -07:00