1
Commit Graph

507 Commits

Author SHA1 Message Date
Dmitry Adamushko
ce96b5ac74 sched: fix __set_task_cpu() SMP race
Grant Wilson has reported rare SCHED_FAIR_USER crashes on his quad-core
system, which crashes can only be explained via runqueue corruption.

there is a narrow SMP race in __set_task_cpu(): after ->cpu is set up to
a new value, task_rq_lock(p, ...) can be successfuly executed on another
CPU. We must ensure that updates of per-task data have been completed by
this moment.

this bug has been hiding in the Linux scheduler for an eternity (we never
had any explicit barrier for task->cpu in set_task_cpu() - so the bug was
introduced in 2.5.1), but only became visible via set_task_cfs_rq() being
accidentally put after the task->cpu update. It also probably needs a
sufficiently out-of-order CPU to trigger.

Reported-by: Grant Wilson <grant.wilson@zen.co.uk>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15 20:57:40 +01:00
Oleg Nesterov
dae51f5620 sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHED
Suppose that the SCHED_FIFO task does

	switch_uid(new_user);

Now, p->se.cfs_rq and p->se.parent both point into the old
user_struct->tg because sched_move_task() doesn't call set_task_cfs_rq()
for !fair_sched_class case.

Suppose that old user_struct/task_group is freed/reused, and the task
does

	sched_setscheduler(SCHED_NORMAL);

__setscheduler() sets fair_sched_class, but doesn't update
->se.cfs_rq/parent which point to the freed memory.

This means that check_preempt_wakeup() doing

		while (!is_same_group(se, pse)) {
			se = parent_entity(se);
			pse = parent_entity(pse);
		}

may OOPS in a similar way if rq->curr or p did something like above.

Perhaps we need something like the patch below, note that
__setscheduler() can't do set_task_cfs_rq().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15 20:57:40 +01:00
Christian Borntraeger
9778385db3 sched: fix accounting of interrupts during guest execution on s390
Currently the scheduler checks for PF_VCPU to decide if this timeslice
has to be accounted as guest time. On s390 host interrupts are not
disabled during guest execution. This causes theses interrupts to be
accounted as guest time if CONFIG_VIRT_CPU_ACCOUNTING is set. Solution
is to check if an interrupt triggered account_system_time. As the tick
is timer interrupt based, we have to subtract hardirq_offset.

I tested the patch on s390 with CONFIG_VIRT_CPU_ACCOUNTING and on
x86_64. Seems to work.

CC: Avi Kivity <avi@qumranet.com>
CC: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-15 20:57:39 +01:00
Andrew Morton
cfb5285660 revert "Task Control Groups: example CPU accounting subsystem"
Revert 62d0df6406.

This was originally intended as a simple initial example of how to create a
control groups subsystem; it wasn't intended for mainline, but I didn't make
this clear enough to Andrew.

The CFS cgroup subsystem now has better functionality for the per-cgroup usage
accounting (based directly on CFS stats) than the "usage" status file in this
patch, and the "load" status file is rather simplistic - although having a
per-cgroup load average report would be a useful feature, I don't believe this
patch actually provides it.  If it gets into the final 2.6.24 we'd probably
have to support this interface for ever.

Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-14 18:45:40 -08:00
Adrian Bunk
e6fe6649b4 sched: proper prototype for kernel/sched.c:migration_init()
This patch adds a proper prototype for migration_init() in
include/linux/sched.h

Since there's no point in always returning 0 to a caller that doesn't check
the return value it also changes the function to return void.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Peter Zijlstra
b82d9fdd84 sched: avoid large irq-latencies in smp-balancing
SMP balancing is done with IRQs disabled and can iterate the full rq.
When rqs are large this can cause large irq-latencies. Limit the nr of
iterations on each run.

This fixes a scheduling latency regression reported by the -rt folks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Ingo Molnar
3e3e13f399 sched: remove PREEMPT_RESTRICT
remove PREEMPT_RESTRICT. (this is a separate commit so that any
regression related to the removal itself is bisectable)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Ingo Molnar
52d3da1ad4 sched: turn off PREEMPT_RESTRICT
PREEMPT_RESTRICT was a method aimed at reducing the amount of wakeup
related preemption. It has a disadvantage though, it can prevent
legitimate wakeups if a task is 'unlucky' to be hit too early by a tick
that clears peer_preempt.

Now that the wakeup preemption has been cleaned up we dont seem to have
excessive preemptions anymore, so this feature can be turned off. (and
removed in the next patch)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:39 +01:00
Eric Dumazet
d6322faf29 sched: cleanup, use NSEC_PER_MSEC and NSEC_PER_SEC
1) hardcoded 1000000000 value is used five times in places where
   NSEC_PER_SEC might be more readable.

2) A conversion from nsec to msec uses the hardcoded 1000000 value,
   which is a candidate for NSEC_PER_MSEC.

no code changed:

    text    data     bss     dec     hex filename
   44359    3326      36   47721    ba69 sched.o.before
   44359    3326      36   47721    ba69 sched.o.after

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:38 +01:00
Ingo Molnar
19978ca610 sched: reintroduce SMP tunings again
Yanmin Zhang reported an aim7 regression and bisected it down to:

 |  commit 38ad464d41
 |  Author: Ingo Molnar <mingo@elte.hu>
 |  Date:   Mon Oct 15 17:00:02 2007 +0200
 |
 |     sched: uniform tunings
 |
 |     use the same defaults on both UP and SMP.

fix this by reintroducing similar SMP tunings again. This resolves
the regression.

(also update the comments to match the ilog2(nr_cpus) tuning effect)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-11-09 22:39:38 +01:00
Ingo Molnar
38605cae99 sched: fix style in kernel/sched.c
fallout of recent commits: small coding style fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
Paul Menage
fe5c7cc228 sched: report CPU usage in CFS cgroup directories
Adds a cpu.usage file to the CFS cgroup that reports CPU usage in
milliseconds for that cgroup's tasks

[ mingo@elte.hu: style cleanups. ]

Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
Srivatsa Vaddagiri
ae8393e508 sched: move rcu_head to task_group struct
Peter Zijlstra noticed that the rcu_head object need not be present
in every cfs_rq of a group. Move it to the task_group structure
instead.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
James Bottomley
7bae49d498 sched: fix incorrect assumption that cpu 0 exists
This patch:

commit 9b5b77512d
Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Date:   Mon Oct 15 17:00:09 2007 +0200

    sched: clean up code under CONFIG_FAIR_GROUP_SCHED

Introduced an assumption of the existence of CPU0 via this line

cfs_rq = tg->cfs_rq[0];

If you have no CPU0, that will be NULL.  The fix seems to be just to
take whatever cfs_rq queue comes out of the for_each_possible_cpu()
loop, since they're all equally good for the destruction operation.

Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:11 +01:00
Adrian Bunk
f7402e0361 sched: make kernel/sched.c:account_guest_time() static
account_guest_time() can become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-29 21:18:10 +01:00
Peter Williams
681f3e6854 sched: isolate SMP balancing code a bit more
At the moment, a lot of load balancing code that is irrelevant to non
SMP systems gets included during non SMP builds.

This patch addresses this issue and reduces the binary size on non
SMP systems:

   text    data     bss     dec     hex filename
  10983      28    1192   12203    2fab sched.o.before
  10739      28    1192   11959    2eb7 sched.o.after

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:51 +02:00
Peter Williams
e1d1484f72 sched: reduce balance-tasks overhead
At the moment, balance_tasks() provides low level functionality for both
  move_tasks() and move_one_task() (indirectly) via the load_balance()
function (in the sched_class interface) which also provides dual
functionality.  This dual functionality complicates the interfaces and
internal mechanisms and makes the run time overhead of operations that
are called with two run queue locks held.

This patch addresses this issue and reduces the overhead of these
operations.

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:51 +02:00
Paul Menage
2b01dfe372 sched: clean up some control group code
- replace "cont" with "cgrp" in a few places in the CFS cgroup code, 
- use write_uint rather than write for cpu.shares write function

Signed-off-by: Paul Menage <menage@google.com>
Acked-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:50 +02:00
Satyam Sharma
838225b48e sched: use show_regs() to improve __schedule_bug() output
A full register dump along with stack backtrace would make the
"scheduling while atomic" message more helpful. Use show_regs() instead
of dump_stack() for this. We already know we're atomic in here (that is
why this function was called) so show_regs()'s atomicity expectations
are guaranteed.

Also, modify the output of the "BUG: scheduling while atomic:" header a
bit to keep task->comm and task->pid together and preempt_count() after
them.

Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:50 +02:00
Ingo Molnar
4dcf6aff02 sched: clean up sched_domain_debug()
clean up sched_domain_debug().

this also shrinks the code a bit:

   text    data     bss     dec     hex filename
  50474    4306     480   55260    d7dc sched.o.before
  50404    4306     480   55190    d796 sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:48 +02:00
Ingo Molnar
b15136e949 sched: fix fastcall mismatch in completion APIs
Jeff Dike noticed that wait_for_completion_interruptible()'s prototype
had a mismatched fastcall.

Fix this by removing the fastcall attributes from all the completion APIs.

Found-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:48 +02:00
Milton Miller
7378547f2c sched: fix sched_domain sysctl registration again
commit  029190c515 (cpuset
sched_load_balance flag) was not tested SCHED_DEBUG enabled as
committed as it dereferences NULL when used and it reordered
the sysctl registration to cause it to never show any domains
or their tunables.

Fixes:

1) restore arch_init_sched_domains ordering
	we can't walk the domains before we build them

	presently we register cpus with empty directories (no domain
	directories or files).

2) make unregister_sched_domain_sysctl do nothing when already unregistered
	detach_destroy_domains is now called one set of cpus at a time
	unregister_syctl dereferences NULL if called with a null.

	While the the function would always dereference null if called
	twice, in the previous code it was always called once and then
	was followed a register.  So only the hidden bug of the
	sysctl_root_table not being allocated followed by an attempt to
	free it would have shown the error.

3) always call unregister and register in partition_sched_domains
	The code is "smart" about unregistering only needed domains.
	Since we aren't guaranteed any calls to unregister, always 
	unregister.   Without calling register on the way out we
	will not have a table or any sysctl tree.

4) warn if register is called without unregistering
	The previous table memory is lost, leaving pointers to the
	later freed memory in sysctl and leaking the memory of the
	tables.

Before this patch on a 2-core 4-thread box compiled for SMT and NUMA,
the domains appear empty (there are actually 3 levels per cpu).  And as
soon as two domains a null pointer is dereferenced (unreliable in this
case is stack garbage):

bu19a:~# ls -R /proc/sys/kernel/sched_domain/
/proc/sys/kernel/sched_domain/:
cpu0  cpu1  cpu2  cpu3

/proc/sys/kernel/sched_domain/cpu0:

/proc/sys/kernel/sched_domain/cpu1:

/proc/sys/kernel/sched_domain/cpu2:

/proc/sys/kernel/sched_domain/cpu3:

bu19a:~# mkdir /dev/cpuset
bu19a:~# mount -tcpuset cpuset /dev/cpuset/
bu19a:~# cd /dev/cpuset/
bu19a:/dev/cpuset# echo 0 > sched_load_balance 
bu19a:/dev/cpuset# mkdir one
bu19a:/dev/cpuset# echo 1 > one/cpus               
bu19a:/dev/cpuset# echo 0 > one/sched_load_balance 
Unable to handle kernel paging request for data at address 0x00000018
Faulting instruction address: 0xc00000000006b608
NIP: c00000000006b608 LR: c00000000006b604 CTR: 0000000000000000
REGS: c000000018d973f0 TRAP: 0300   Not tainted  (2.6.23-bml)
MSR: 9000000000009032 <EE,ME,IR,DR>  CR: 28242442  XER: 00000000
DAR: 0000000000000018, DSISR: 0000000040000000
TASK = c00000001912e340[1987] 'bash' THREAD: c000000018d94000 CPU: 2
..
NIP [c00000000006b608] .unregister_sysctl_table+0x38/0x110
LR [c00000000006b604] .unregister_sysctl_table+0x34/0x110
Call Trace:
[c000000018d97670] [c000000007017270] 0xc000000007017270 (unreliable)
[c000000018d97720] [c000000000058710] .detach_destroy_domains+0x30/0xb0
[c000000018d977b0] [c00000000005cf1c] .partition_sched_domains+0x1bc/0x230
[c000000018d97870] [c00000000009fdc4] .rebuild_sched_domains+0xb4/0x4c0
[c000000018d97970] [c0000000000a02e8] .update_flag+0x118/0x170
[c000000018d97a80] [c0000000000a1768] .cpuset_common_file_write+0x568/0x820
[c000000018d97c00] [c00000000009d95c] .cgroup_file_write+0x7c/0x180
[c000000018d97cf0] [c0000000000e76b8] .vfs_write+0xe8/0x1b0
[c000000018d97d90] [c0000000000e810c] .sys_write+0x4c/0x90
[c000000018d97e30] [c00000000000852c] syscall_exit+0x0/0x40

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-24 18:23:48 +02:00
Laurent Vivier
83d87d1673 sched: don't clear PF_VCPU in scheduler
KVM clears it by itself now, and for s390 this is plain wrong.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-22 12:03:29 +02:00
Michael Neuling
6888c1ecd6 kernel/sched.c: remove bogus comment from account_user_time
hardirq_offset is no longer needed.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-20 01:41:05 +02:00
Robert P. J. Day
3a4fa0a25d Fix misspellings of "system", "controller", "interrupt" and "necessary".
Fix the various misspellings of "system", controller", "interrupt" and
"[un]necessary".

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-19 23:10:43 +02:00
Srivatsa Vaddagiri
68318b8e0b Hook up group scheduler with control groups
Enable "cgroup" (formerly containers) based fair group scheduling.  This
will let administrator create arbitrary groups of tasks (using "cgroup"
pseudo filesystem) and control their cpu bandwidth usage.

[akpm@linux-foundation.org: fix cpp condition]
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:51 -07:00
Cliff Wickman
470fd64644 hotplug cpu: migrate a task within its cpuset
When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
been running on that cpu.

Currently, such a task is migrated:
 1) to any cpu on the same node as the disabled cpu, which is both online
    and among that task's cpus_allowed
 2) to any cpu which is both online and among that task's cpus_allowed

It is typical of a multithreaded application running on a large NUMA system to
have its tasks confined to a cpuset so as to cluster them near the memory that
they share.  Furthermore, it is typical to explicitly place such a task on a
specific cpu in that cpuset.  And in that case the task's cpus_allowed
includes only a single cpu.

This patch would insert a preference to migrate such a task to some cpu within
its cpuset (and set its cpus_allowed to its entire cpuset).

With this patch, migrate the task to:
 1) to any cpu on the same node as the disabled cpu, which is both online
    and among that task's cpus_allowed
 2) to any online cpu within the task's cpuset
 3) to any cpu which is both online and among that task's cpus_allowed

In order to do this, move_task_off_dead_cpu() must make a call to
cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
not block.  (name change - per Oleg's suggestion)

Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
the cpuset mutex during the whole migrate_live_tasks() and
migrate_dead_tasks() procedure.

[akpm@linux-foundation.org: build fix]
[pj@sgi.com: Fix indentation and spacing]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:44 -07:00
Pavel Emelyanov
ba25f9dcc4 Use helpers to obtain task pid in printks
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:43 -07:00
Eugene Teo
270f722d4d Fix tsk->exit_state usage
tsk->exit_state can only be 0, EXIT_ZOMBIE, or EXIT_DEAD.  A non-zero test
is the same as tsk->exit_state & (EXIT_ZOMBIE | EXIT_DEAD), so just testing
tsk->exit_state is sufficient.

Signed-off-by: Eugene Teo <eugeneteo@kernel.sg>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:42 -07:00
Paul Menage
8707d8b8c0 Fix cpusets update_cpumask
Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks:

- collect batches of tasks under tasklist_lock and then call
  set_cpus_allowed() on them outside the lock (since this can sleep).

- add a simple generic priority heap type to allow efficient collection
  of batches of tasks to be processed without duplicating or missing any
  tasks in subsequent batches.

- make "cpus" file update a no-op if the mask hasn't changed

- fix race between update_cpumask() and sched_setaffinity() by making
  sched_setaffinity() post-check that it's not running on any cpus outside
  cpuset_cpus_allowed().

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:41 -07:00
Paul Jackson
029190c515 cpuset sched_load_balance flag
Add a new per-cpuset flag called 'sched_load_balance'.

When enabled in a cpuset (the default value) it tells the kernel scheduler
that the scheduler should provide the normal load balancing on the CPUs in
that cpuset, sometimes moving tasks from one CPU to a second CPU if the
second CPU is less loaded and if that task is allowed to run there.

When disabled (write "0" to the file) then it tells the kernel scheduler
that load balancing is not required for the CPUs in that cpuset.

Now even if this flag is disabled for some cpuset, the kernel may still
have to load balance some or all the CPUs in that cpuset, if some
overlapping cpuset has its sched_load_balance flag enabled.

If there are some CPUs that are not in any cpuset whose sched_load_balance
flag is enabled, the kernel scheduler will not load balance tasks to those
CPUs.

Moreover the kernel will partition the 'sched domains' (non-overlapping
sets of CPUs over which load balancing is attempted) into the finest
granularity partition that it can find, while still keeping any two CPUs
that are in the same shed_load_balance enabled cpuset in the same element
of the partition.

This serves two purposes:
 1) It provides a mechanism for real time isolation of some CPUs, and
 2) it can be used to improve performance on systems with many CPUs
    by supporting configurations in which load balancing is not done
    across all CPUs at once, but rather only done in several smaller
    disjoint sets of CPUs.

This mechanism replaces the earlier overloading of the per-cpuset
flag 'cpu_exclusive', which overloading was removed in an earlier
patch: cpuset-remove-sched-domain-hooks-from-cpusets

See further the Documentation and comments in the code itself.

[akpm@linux-foundation.org: don't be weird]
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:41 -07:00
Pavel Emelyanov
228ebcbe63 Uninline find_task_by_xxx set of functions
The find_task_by_something is a set of macros are used to find task by pid
depending on what kind of pid is proposed - global or virtual one.  All of
them are wrappers above the most generic one - find_task_by_pid_type_ns() -
and just substitute some args for it.

It turned out, that dereferencing the current->nsproxy->pid_ns construction
and pushing one more argument on the stack inline cause kernel text size to
grow.

This patch moves all this stuff out-of-line into kernel/pid.c.  Together
with the next patch it saves a bit less than 400 bytes from the .text
section.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:40 -07:00
Pavel Emelyanov
b488893a39 pid namespaces: changes to show virtual ids to user
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
 - all in-kernel data structures must store either struct pid itself
   or the pid's global nr, obtained with pid_nr() call;
 - when seeking the task from kernel code with the stored id one
   should use find_task_by_pid() call that works with global pids;
 - when showing pid's numerical value to the user the virtual one
   should be used, but however when one shows task's pid outside this
   task's namespace the global one is to be used;
 - when getting the pid from userspace one need to consider this as
   the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:40 -07:00
Paul Menage
62d0df6406 Task Control Groups: example CPU accounting subsystem
This example demonstrates how to use the generic cgroup subsystem for a
simple resource tracker that counts, for the processes in a cgroup, the
total CPU time used and the %CPU used in the last complete 10 second interval.

Portions contributed by Balbir Singh <balbir@in.ibm.com>

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:36 -07:00
Linus Torvalds
54e840dd50 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  sched: reduce schedstat variable overhead a bit
  sched: add KERN_CONT annotation
  sched: cleanup, make struct rq comments more consistent
  sched: cleanup, fix spacing
  sched: fix return value of wait_for_completion_interruptible()
2007-10-18 14:54:03 -07:00
Michael Neuling
c66f08be7e Add scaled time to taskstats based process accounting
This adds items to the taststats struct to account for user and system
time based on scaling the CPU frequency and instruction issue rates.

Adds account_(user|system)_time_scaled callbacks which architectures
can use to account for time using this mechanism.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:28 -07:00
Ken Chen
480b9434c5 sched: reduce schedstat variable overhead a bit
schedstat is useful in investigating CPU scheduler behavior.  Ideally,
I think it is beneficial to have it on all the time.  However, the
cost of turning it on in production system is quite high, largely due
to number of events it collects and also due to its large memory
footprint.

Most of the fields probably don't need to be full 64-bit on 64-bit
arch.  Rolling over 4 billion events will most like take a long time
and user space tool can be made to accommodate that.  I'm proposing
kernel to cut back most of variable width on 64-bit system.  (note,
the following patch doesn't affect 32-bit system).

Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-18 21:32:56 +02:00
Ingo Molnar
cc4ea79588 sched: add KERN_CONT annotation
printk: add the KERN_CONT annotation (which is empty string but via
which checkpatch.pl can notice that the lacking KERN_ level is fine).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-18 21:32:56 +02:00
Ingo Molnar
d801649162 sched: cleanup, make struct rq comments more consistent
cleanup, make struct rq comments more consistent.

found via scripts/checkpatch.pl.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-18 21:32:55 +02:00
Ingo Molnar
8401f77505 sched: cleanup, fix spacing
cleanup: fix sysctl_sched_features initialization spacing, and
fix sd_alloc_ctl_cpu_table() prototype spacing.

found via scripts/checkpatch.pl.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-18 21:32:55 +02:00
Andi Kleen
51e9799023 sched: fix return value of wait_for_completion_interruptible()
The recent wait_for_completion() cleanups:

    commit 8cbbe86dfc
    Author: Andi Kleen <ak@suse.de>
    Date:   Mon Oct 15 17:00:14 2007 +0200

    sched: cleanup: refactor common code of sleep_on / wait_for_completion

    Refactor common code of sleep_on / wait_for_completion

broke the return value of wait_for_completion_interruptible().
Previously it returned 0 on success, now -1.  Fix that.

Problem found by Geert Uytterhoeven.

[ mingo: fixed whitespace damage ]

Reported-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-18 21:32:55 +02:00
Linus Torvalds
e6d5a11dad Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  sched: fix new task startup crash
  sched: fix !SYSFS build breakage
  sched: fix improper load balance across sched domain
  sched: more robust sd-sysctl entry freeing
2007-10-17 09:11:18 -07:00
Oleg Nesterov
d2da272a4e migration_call(CPU_DEAD): use spin_lock_irq() instead of task_rq_lock()
Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of
task_rq_lock(rq->idle), rq->idle can't change its task_rq().

This makes the code a bit more symmetrical with migrate_dead_tasks()'s path
which uses spin_lock_irq/spin_unlock_irq.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
Oleg Nesterov
f7b4cddcc5 do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(tasklist)
Currently move_task_off_dead_cpu() is called under
write_lock_irq(tasklist).  This means it can't use task_lock() which is
needed to improve migrating to take task's ->cpuset into account.

Change the code to call move_task_off_dead_cpu() with irqs enabled, and
change migrate_live_tasks() to use read_lock(tasklist).

This all is a preparation for the futher changes proposed by Cliff Wickman, see
	http://marc.info/?t=117327786100003

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
Srivatsa Vaddagiri
b9dca1e0fc sched: fix new task startup crash
Child task may be added on a different cpu that the one on which parent
is running. In which case, task_new_fair() should check whether the new
born task's parent entity should be added as well on the cfs_rq.

Patch below fixes the problem in task_new_fair.

This could fix the put_prev_task_fair() crashes reported.

Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Reported-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-17 16:55:11 +02:00
Ken Chen
908a7c1b9b sched: fix improper load balance across sched domain
We recently discovered a nasty performance bug in the kernel CPU load
balancer where we were hit by 50% performance regression.

When tasks are assigned to a subset of CPUs that span across
sched_domains (either ccNUMA node or the new multi-core domain) via
cpu affinity, kernel fails to perform proper load balance at
these domains, due to several logic in find_busiest_group() miss
identified busiest sched group within a given domain. This leads to
inadequate load balance and causes 50% performance hit.

To give you a concrete example, on a dual-core, 2 socket numa system,
there are 4 logical cpu, organized as:

CPU0 attaching sched-domain:
 domain 0: span 0003  groups: 0001 0002
 domain 1: span 000f  groups: 0003 000c
CPU1 attaching sched-domain:
 domain 0: span 0003  groups: 0002 0001
 domain 1: span 000f  groups: 0003 000c
CPU2 attaching sched-domain:
 domain 0: span 000c  groups: 0004 0008
 domain 1: span 000f  groups: 000c 0003
CPU3 attaching sched-domain:
 domain 0: span 000c  groups: 0008 0004
 domain 1: span 000f  groups: 000c 0003

If I run 2 tasks with CPU affinity set to 0x5.  There are situation
where cpu0 has run queue length of 2, and cpu2 will be idle.  The
kernel load balancer is unable to balance out these two tasks over
cpu0 and cpu2 due to at least three logics in find_busiest_group()
that heavily bias load balance towards power saving mode. e.g. while
determining "busiest" variable, kernel only set it when
"sum_nr_running > group_capacity".  This test is flawed that
"sum_nr_running" is not necessary same as
sum-tasks-allowed-to-run-within-the sched-group.  The end result is
that kernel "think" everything is balanced, but in reality we have an
imbalance and thus causing one CPU to be over-subscribed and leaving
other idle.  There are two other logic in the same function will also
causing similar effect.  The nastiness of this bug is that kernel not
be able to get unstuck in this unfortunate broken state.  From what
we've seen in our environment, kernel will stuck in imbalanced state
for extended period of time and it is also very easy for the kernel to
stuck into that state (it's pretty much 100% reproducible for us).

So proposing the following fix: add addition logic in
find_busiest_group to detect intrinsic imbalance within the busiest
group.  When such condition is detected, load balance goes into spread
mode instead of default grouping mode.

Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-17 16:55:11 +02:00
Milton Miller
cd79007634 sched: more robust sd-sysctl entry freeing
It occurred to me this morning that the procname field was dynamically
allocated and needed to be freed.  I started to put in break statements
when allocation failed but it was approaching 50% error handling code.

I came up with this alternative of looping while entry->mode is set and
checking proc_handler instead of ->table.  Alternatively, the string
version of the domain name and cpu number could be stored the structs.

I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation
counts after taking a cpuset exclusive and back.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-17 16:55:11 +02:00
Paul Jackson
607717a65d cpuset: remove sched domain hooks from cpusets
Remove the cpuset hooks that defined sched domains depending on the setting
of the 'cpu_exclusive' flag.

The cpu_exclusive flag can only be set on a child if it is set on the
parent.

This made that flag painfully unsuitable for use as a flag defining a
partitioning of a system.

It was entirely unobvious to a cpuset user what partitioning of sched
domains they would be causing when they set that one cpu_exclusive bit on
one cpuset, because it depended on what CPUs were in the remainder of that
cpusets siblings and child cpusets, after subtracting out other
cpu_exclusive cpusets.

Furthermore, there was no way on production systems to query the
result.

Using the cpu_exclusive flag for this was simply wrong from the get go.

Fortunately, it was sufficiently borked that so far as I know, almost no
successful use has been made of this.  One real time group did use it to
affectively isolate CPUs from any load balancing efforts.  They are willing
to adapt to alternative mechanisms for this, such as someway to manipulate
the list of isolated CPUs on a running system.  They can do without this
present cpu_exclusive based mechanism while we develop an alternative.

There is a real risk, to the best of my understanding, of users
accidentally setting up a partitioned scheduler domains, inhibiting desired
load balancing across all their CPUs, due to the nonobvious (from the
cpuset perspective) side affects of the cpu_exclusive flag.

Furthermore, since there was no way on a running system to see what one was
doing with sched domains, this change will be invisible to any using code.
Unless they have real insight to the scheduler load balancing choices, they
will be unable to detect that this change has been made in the kernel's
behaviour.

Initial discussion on lkml of this patch has generated much comment.  My
(probably controversial) take on that discussion is that it has reached a
rough concensus that the current cpuset cpu_exclusive mechanism for
defining sched domains is borked.  There is no concensus on the
replacement.  But since we can remove this mechanism, and since its
continued presence risks causing unwanted partitioning of the schedulers
load balancing, we should remove it while we can, as we proceed to work the
replacement scheduler domain mechanisms.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:09 -07:00
Mike Travis
d5a7430ddc Convert cpu_sibling_map to be a per cpu variable
Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
variable.  This saves sizeof(cpumask_t) * NR unused cpus.  Access is mostly
from startup and CPU HOTPLUG functions.

Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Ingo Molnar
9c63d9c021 sched: sync wakeups preempt too
make sure sync wakeups preempt too - the scheduler will not
overschedule as we've got various throttles against that.
As a result, sync wakeups can be used more widely in the kernel
(to signal wakeup affinity between tasks), and no arbitrary
latencies will be introduced either.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:20 +02:00
Ingo Molnar
71e20f1873 sched: affine sync wakeups
make sync wakeups affine for cache-cold tasks: if a cache-cold task
is woken up by a sync wakeup then use the opportunity to migrate it
straight away. (the two tasks are 'related' because they communicate)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Laurent Vivier
94886b84b1 sched: guest CPU accounting: maintain stats in account_system_time()
modify account_system_time() to add cputime to cpustat->guest if we are
running a VCPU. We add this cputime to cpustat->user instead of
cpustat->system because this part of KVM code is in fact user code
although it is executed in the kernel. We duplicate VCPU time between
guest and user to allow an unmodified "top(1)" to display correct value.
A modified "top(1)" is able to display good cpu user time and cpu guest
time by subtracting cpu guest time from cpu user time. Update "gtime" in
task_struct accordingly.

Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Milton Miller
6323469f9b sched: domain sysctl fixes: add terminator comment
we had an incorrect-terminator bug in sd_alloc_ctl_domain_table()
before, so add a comment that documents it.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Milton Miller
ad1cdc1d78 sched: domain sysctl fixes: do not crash on allocation failure
Now that we are calling this at runtime, a more relaxed error path is
suggested.  If an allocation fails, we just register the partial table,
which will show empty directories.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Milton Miller
6382bc90f5 sched: domain sysctl fixes: unregister the sysctl table before domains
Unregister and free the sysctl table before destroying domains, then
rebuild and register after creating the new domains.  This prevents the
sysctl table from pointing to freed memory for root to write.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Milton Miller
97b6ea7b63 sched: domain sysctl fixes: use for_each_online_cpu()
init_sched_domain_sysctl was walking cpus 0-n and referencing per_cpu
variables.  If the cpus_possible mask is not contigious this will result
in a crash referencing unallocated data.  If the online mask is not
contigious then we would show offline cpus and miss online ones.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Milton Miller
5cf9f062c8 sched: domain sysctl fixes: use kcalloc()
kcalloc checks for n * sizeof(element) overflows and it zeros.

Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:19 +02:00
Ingo Molnar
6bc1665ba7 sched: allow the immediate migration of cache-cold tasks
allow the immediate migration of cache-cold tasks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:18 +02:00
Ingo Molnar
cc367732ff sched: debug, improve migration statistics
add new migration statistics when SCHED_DEBUG and SCHEDSTATS
is enabled. Available in /proc/<PID>/sched.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:18 +02:00
Peter Zijlstra
ff56b2f015 sched: activate task_hot() only on fair-scheduled tasks
activate task_hot() only for fair-scheduled tasks (i.e. disable it
for RT tasks).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:18 +02:00
Ingo Molnar
da84d96176 sched: reintroduce cache-hot affinity
reintroduce a simplified version of cache-hot/cold scheduling
affinity. This improves performance with certain SMP workloads,
such as sysbench.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:18 +02:00
Ingo Molnar
178be79348 sched: do not normalize kernel threads via SysRq-N
do not normalize kernel threads via SysRq-N: the migration threads,
softlockup threads, etc. might be essential for the system to
function properly. So only zap user tasks.

pointed out by Andi Kleen.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:18 +02:00
Andi Kleen
1666703af9 sched: remove stale comment from sched_group_set_shares()
remove stale comment from sched_group_set_shares().

Function never returns -EINVAL.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:18 +02:00
Ingo Molnar
d5036e89dc sched: clean up is_migration_thread()
clean up is_migration_thread() and turn it into an inline function.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:15 +02:00
Andi Kleen
3a5e4dc12f sched: cleanup: refactor normalize_rt_tasks
Replace a particularly ugly ifdef with an inline and a new macro.
Also split up the function to be easier to read.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:15 +02:00
Andi Kleen
8cbbe86dfc sched: cleanup: refactor common code of sleep_on / wait_for_completion
Refactor common code of sleep_on / wait_for_completion

These functions were largely cut'n'pasted. This moves
the common code into single helpers instead.  Advantage
is about 1k less code on x86-64 and 91 lines of code removed.
It adds one function call to the non timeout version of
the functions; i don't expect this to be measurable.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Andi Kleen
3a5c359a58 sched: cleanup: remove unnecessary gotos
Replace loops implemented with gotos with real loops.
Replace err = ...; goto x; x: return err; with return ...;

No functional changes.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Mike Galbraith
95938a35c5 sched: prevent wakeup over-scheduling
Prevent wakeup over-scheduling.  Once a task has been preempted by a
task of the same or lower priority, it becomes ineligible for repeated
preemption by same until it has been ticked, or slept.  Instead, the
task is marked for preemption at the next tick.  Tasks of higher
priority still preempt immediately.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Peter Zijlstra
ce6c131131 sched: disable forced preemption by default
Implement feature bit to disable forced preemption. This way
it can be checked whether a workload is overscheduling or not.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Zou Nan hai
ace8b3d633 sched: some proc entries are missed in sched_domain sys_ctl debug code
cache_nice_tries and flags entry do not appear in proc fs sched_domain
directory, because ctl_table entry is skipped.

This patch fixes the issue.

Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Gautham R Shenoy
638e13ac37 sched: fix rt ptracer monopolizing CPU
yield() in wait_task_inactive(), can cause a high priority thread to be
scheduled back in, and there by loop forever while it is waiting for some
lower priority thread which is unfortunately still on the runqueue.

Use schedule_timeout_uninterruptible(1) instead.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Credit: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Dhaval Giani
5cb350baf5 sched: group scheduling, sysfs tunables
Add tunables in sysfs to modify a user's cpu share.

A directory is created in sysfs for each new user in the system.

	/sys/kernel/uids/<uid>/cpu_share

Reading this file returns the cpu shares granted for the user.
Writing into this file modifies the cpu share for the user. Only an
administrator is allowed to modify a user's cpu share.

Ex:
	# cd /sys/kernel/uids/
	# cat 512/cpu_share
	1024
	# echo 2048 > 512/cpu_share
	# cat 512/cpu_share
	2048
	#

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Paul E. McKenney
a58f6f253d sched: export cpu_clock()
export cpu_clock() - the preferred API instead of sched_clock().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Ingo Molnar
00bf7bfc2e sched: fix: move the CPU check into ->task_new_fair()
noticed by Peter Zijlstra:

fix: move the CPU check into ->task_new_fair(), this way we
can call place_entity() and get child ->vruntime right at
initial wakeup time.

(without this there can be large latencies)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-15 17:00:14 +02:00
Ingo Molnar
4cf86d77f5 sched: cleanup: rename task_grp to task_group
cleanup: rename task_grp to task_group. No need to save two characters
and 'grp' is annoying to read.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:14 +02:00
Ingo Molnar
06877c33fe sched: cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG
cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG, to
make SCHED_FEAT_ names more consistent.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:13 +02:00
Ingo Molnar
a65914b365 sched: kfree(NULL) is valid
kfree(NULL) is valid.

pointed out by checkpatch.pl.

the fix shrinks the code a bit:

   text    data     bss     dec     hex filename
  40024    3842     100   43966    abbe sched.o.before
  40002    3842     100   43944    aba8 sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:13 +02:00
Ingo Molnar
8927f49479 sched: style cleanup
fix up __setup() style bug - noticed via checkpatch.pl.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:13 +02:00
Ingo Molnar
26797a34a2 sched: break out if printing a warning in sched_domain_debug()
checkpatch.pl and Andy Whitcroft noticed the following bug: we did
not break out after printing an error.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:13 +02:00
Ingo Molnar
3e9830dcab sched: run sched_domain_debug() if CONFIG_SCHED_DEBUG=y
run sched_domain_debug() if CONFIG_SCHED_DEBUG=y, instead
of relying on the hand-crafted SCHED_DOMAIN_DEBUG switch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:13 +02:00
Dmitry Adamushko
a4ec24b48d sched: tidy up SCHED_RR
- make timeslices of SCHED_RR tasks constant and not
dependent on task's static_prio [1] ;
- remove obsolete code (timeslice related bits);
- make sched_rr_get_interval() return something more
meaningful [2] for SCHED_OTHER tasks.

[1] according to the following link, it's not compliant with SUSv3
(not sure though, what is the reference for us :-)
http://lkml.org/lkml/2007/3/7/656

[2] the interval is dynamic and can be depicted as follows "should a
task be one of the runnable tasks at this particular moment, it would
expect to run for this interval of time before being re-scheduled by the
scheduler tick".
(i.e. it's more precise if a task is runnable at the moment)

yeah, this seems to require task_rq_lock/unlock() but this is not a hot
path.

results:

(SCHED_FIFO)

dimm@earth:~/storage/prog$ sudo chrt -f 10 ./rr_interval 
time_slice: 0 : 0

(SCHED_RR)

dimm@earth:~/storage/prog$ sudo chrt 10 ./rr_interval 
time_slice: 0 : 99984800

(SCHED_NORMAL)

dimm@earth:~/storage/prog$ ./rr_interval 
time_slice: 0 : 19996960

(SCHED_NORMAL + a cpu_hog of similar 'weight' on the same CPU --- so should be a half of the previous result)

dimm@earth:~/storage/prog$ taskset 1 ./rr_interval 
time_slice: 0 : 9998480

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:13 +02:00
Alexey Dobriyan
a9957449b0 sched: uninline scheduler
* save ~300 bytes
* activate_idle_task() was moved to avoid a warning

bloat-o-meter output:

add/remove: 6/0 grow/shrink: 0/16 up/down: 438/-733 (-295)		<===
function                                     old     new   delta
__enqueue_entity                               -     165    +165
finish_task_switch                             -     110    +110
update_curr_rt                                 -      79     +79
__load_balance_iterator                        -      32     +32
__task_rq_unlock                               -      28     +28
find_process_by_pid                            -      24     +24
do_sched_setscheduler                        133     123     -10
sys_sched_rr_get_interval                    176     165     -11
sys_sched_getparam                           156     145     -11
normalize_rt_tasks                           482     470     -12
sched_getaffinity                            112      99     -13
sys_sched_getscheduler                        86      72     -14
sched_setaffinity                            226     212     -14
sched_setscheduler                           666     642     -24
load_balance_start_fair                       33       9     -24
load_balance_next_fair                        33       9     -24
dequeue_task_rt                              133      67     -66
put_prev_task_rt                              97      28     -69
schedule_tail                                133      50     -83
schedule                                     682     594     -88
enqueue_entity                               499     366    -133
task_new_fair                                317     180    -137

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:13 +02:00
Ingo Molnar
1e81995066 sched: optimize schedule() a bit on SMP
optimize schedule() a bit on SMP, by moving the rq-clock update
outside the rq lock.

code size is the same:

      text    data     bss     dec     hex filename
     25725    2666      96   28487    6f47 sched.o.before
     25725    2666      96   28487    6f47 sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:13 +02:00
Ingo Molnar
3a25201572 sched: whitespace cleanups
more whitespace cleanups. No code changed:

      text    data     bss     dec     hex filename
     26553    2790     288   29631    73bf sched.o.before
     26553    2790     288   29631    73bf sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:12 +02:00
Ingo Molnar
5522d5d5f7 sched: mark scheduling classes as const
mark scheduling classes as const. The speeds up the code
a bit and shrinks it:

   text    data     bss     dec     hex filename
  40027    4018     292   44337    ad31 sched.o.before
  40190    3842     292   44324    ad24 sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:12 +02:00
Srivatsa Vaddagiri
2830cf8c90 sched: group scheduler SMP migration fix
group scheduler SMP migration fix: use task_cfs_rq(p) to get
to the relevant fair-scheduling runqueue of a task, rq->cfs
is not the right one.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15 17:00:12 +02:00
Ingo Molnar
2d72376b3a sched: clean up schedstats, cnt -> count
rename all 'cnt' fields and variables to the less yucky 'count' name.

yuckage noticed by Andrew Morton.

no change in code, other than the /proc/sched_debug bkl_count string got
a bit larger:

   text    data     bss     dec     hex filename
  38236    3506      24   41766    a326 sched.o.before
  38240    3506      24   41770    a32a sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:12 +02:00
Hiroshi Shimamoto
2ddbf95250 sched: clean up sched_fork()
The adjusting sched_class is a missing part of the already existing "do
not leak PI boosting priority to the child" at the sched_fork(). This
patch moves the adjusting sched_class from wake_up_new_task() to
sched_fork().

this also shrinks the code a bit:

   text    data     bss     dec     hex filename
  40111    4018     292   44421    ad85 sched.o.before
  40102    4018     292   44412    ad7c sched.o.after

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:11 +02:00
Ingo Molnar
02e4bac2a5 sched: fix sched_fork()
fix sched_fork(): large latencies at new task creation time because
the ->vruntime was not fixed up cross-CPU, if the parent got migrated
after the child's CPU got set up.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:11 +02:00
Ingo Molnar
94359f05cb sched: undo some of the recent changes
undo some of the recent changes that are not needed after all,
such as last_min_vruntime.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-15 17:00:11 +02:00
Ingo Molnar
785c29ef95 sched: remove condition from set_task_cpu()
remove condition from set_task_cpu(). Now that ->vruntime
is not global anymore, it should (and does) work fine without
it too.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-15 17:00:11 +02:00
Peter Zijlstra
ddc9729750 sched debug: check spread
debug feature: check how well we schedule within a reasonable
vruntime 'spread' range. (note that CPU overload can increase
the spread, so this is not a hard condition, but normal loads
should be within the spread.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-15 17:00:10 +02:00
Peter Zijlstra
67e9fb2a39 sched: add vslice
add vslice: the load-dependent "virtual slice" a task should
run ideally, so that the observed latency stays within the
sched_latency window.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:10 +02:00
Ingo Molnar
b8efb56172 sched debug: BKL usage statistics
add per task and per rq BKL usage statistics.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:10 +02:00
Srivatsa Vaddagiri
24e377a832 sched: add fair-user scheduler
Enable user-id based fair group scheduling. This is useful for anyone
who wants to test the group scheduler w/o having to enable
CONFIG_CGROUPS.

A separate scheduling group (i.e struct task_grp) is automatically created for 
every new user added to the system. Upon uid change for a task, it is made to 
move to the corresponding scheduling group.

A /proc tunable (/proc/root_user_share) is also provided to tune root
user's quota of cpu bandwidth.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:09 +02:00
Srivatsa Vaddagiri
9b5b77512d sched: clean up code under CONFIG_FAIR_GROUP_SCHED
With the view of supporting user-id based fair scheduling (and not just
container-based fair scheduling), this patch renames several functions
and makes them independent of whether they are being used for container
or user-id based fair scheduling.

Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating
less-sized array for tg->cfs_rq[] and tf->se[]).

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:09 +02:00
Srivatsa Vaddagiri
83b699ed20 sched: revert recent removal of set_curr_task()
Revert removal of set_curr_task.
Use put_prev_task/set_curr_task when changing groups/policies

Signed-off-by: Srivatsa Vaddagiri < vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-10-15 17:00:08 +02:00
Dmitry Adamushko
f6b53205e1 sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()
rework enqueue/dequeue_entity() to get rid of 
sched_class::set_curr_task(). This simplifies sched_setscheduler(), 
rt_mutex_setprio() and sched_move_tasks().

   text    data     bss     dec     hex filename
  24330    2734      20   27084    69cc sched.o.before
  24233    2730      20   26983    6967 sched.o.after

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:08 +02:00
Dmitry Adamushko
4530d7ab0f sched: simplify sched_class::yield_task()
the 'p' (task_struct) parameter in the sched_class :: yield_task() is
redundant as the caller is always the 'current'. Get rid of it.

   text    data     bss     dec     hex filename
  24341    2734      20   27095    69d7 sched.o.before
  24330    2734      20   27084    69cc sched.o.after

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:08 +02:00
Dmitry Adamushko
30cfdcfc5f sched: do not keep current in the tree and get rid of sched_entity::fair_key
Get rid of 'sched_entity::fair_key'.

As a side effect, 'current' is not kept withing the tree for 
SCHED_NORMAL/BATCH tasks anymore. This simplifies some parts of code 
(e.g. entity_tick() and yield_task_fair()) and also somewhat optimizes 
them (e.g. a single update_curr() now vs. dequeue/enqueue() before in 
entity_tick()).

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2007-10-15 17:00:07 +02:00