1
linux/mm
Paul Jackson a49335ccea [PATCH] cpusets: oom_kill tweaks
This patch series extends the use of the cpuset attribute 'mem_exclusive'
to support cpuset configurations that:
 1) allow GFP_KERNEL allocations to come from a potentially larger
    set of memory nodes than GFP_USER allocations, and
 2) can constrain the oom killer to tasks running in cpusets in
    a specified subtree of the cpuset hierarchy.

Here's an example usage scenario.  For a few hours or more, a large NUMA
system at a University is to be divided in two halves, with a bunch of student
jobs running in half the system under some form of batch manager, and with a
big research project running in the other half.  Each of the student jobs is
placed in a small cpuset, but should share the classic Unix time share
facilities, such as buffered pages of files in /bin and /usr/lib.  The big
research project wants no interference whatsoever from the student jobs, and
has highly tuned, unusual memory and i/o patterns that intend to make full use
of all the main memory on the nodes available to it.

In this example, we have two big sibling cpusets, one of which is further
divided into a more dynamic set of child cpusets.

We want kernel memory allocations constrained by the two big cpusets, and user
allocations constrained by the smaller child cpusets where present.  And we
require that the oom killer not operate across the two halves of this system,
or else the first time a student job runs amuck, the big research project will
likely be first inline to get shot.

Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project
really does run amuck allocating memory, it should be shot, not some other
task outside the research projects mem_exclusive cpuset.

I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage
such scenarios.  Let memory allocations for user space (GFP_USER) be
constrained by a tasks current cpuset, but memory allocations for kernel space
(GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the
current cpuset, even though kernel space allocations will still _prefer_ to
remain within the current tasks cpuset, if memory is easily available.

Let the oom killer be constrained to consider only tasks that are in
overlapping mem_exclusive cpusets (it won't help much to kill a task that
normally cannot allocate memory on any of the same nodes as the ones on which
the current task can allocate.)

The current constraints imposed on setting mem_exclusive are unchanged.  A
cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a
mem_exclusive cpuset may not overlap any of its siblings memory nodes.

This patch was presented on linux-mm in early July 2005, though did not
generate much feedback at that time.  It has been built for a variety of
arch's using cross tools, and built, booted and tested for function on SN2
(ia64).

There are 4 patches in this set:
  1) Some minor cleanup, and some improvements to the code layout
     of one routine to make subsequent patches cleaner.
  2) Add another GFP flag - __GFP_HARDWALL.  It marks memory
     requests for USER space, which are tightly confined by the
     current tasks cpuset.
  3) Now memory requests (such as KERNEL) that not marked HARDWALL can
     if short on memory, look in the potentially larger pool of memory
     defined by the nearest mem_exclusive ancestor cpuset of the current
     tasks cpuset.
  4) Finally, modify the oom killer to skip any task whose mem_exclusive
     cpuset doesn't overlap ours.

Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32
bytes of kernel text space.  Patch (2) has no affect on the size of kernel
text space (it just adds a preprocessor flag).  Patches (3) and (4) added
about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which
matters only if CONFIG_CPUSET is enabled.

This patch:

This patch applies a few comment and code cleanups to mm/oom_kill.c prior to
applying a few small patches to improve cpuset management of memory placement.

The comment changed in oom_kill.c was seriously misleading.  The code layout
change in select_bad_process() makes room for adding another condition on
which a process can be spared the oom killer (see the subsequent
cpuset_nodes_overlap patch for this addition).

Also a couple typos and spellos that bugged me, while I was here.

This patch should have no material affect.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:39 -07:00
..
bootmem.c [PATCH] Use ALIGN to remove duplicate code 2005-06-25 16:25:02 -07:00
fadvise.c [PATCH] xip: madvice/fadvice: execute in place 2005-06-24 00:06:42 -07:00
filemap_xip.c [PATCH] execute-in-place fixes 2005-07-15 09:54:50 -07:00
filemap.c [PATCH] shmem_populate: avoid an useless check, and some comments 2005-09-05 00:05:45 -07:00
filemap.h [PATCH] xip: reduce code duplication 2005-06-24 00:06:41 -07:00
fremap.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
highmem.c [PATCH] count bounce buffer pages in vmstat 2005-05-01 08:58:37 -07:00
hugetlb.c [PATCH] hugetlb: move stale pte check into huge_pte_alloc() 2005-09-05 00:05:46 -07:00
internal.h Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
Kconfig [PATCH] sparsemem extreme implementation 2005-09-05 00:05:38 -07:00
madvise.c [PATCH] mm: fix madvise vma merging 2005-09-05 00:05:44 -07:00
Makefile [PATCH] xip: fs/mm: execute in place 2005-06-24 00:06:41 -07:00
memory.c [PATCH] x86: ptep_clear optimization 2005-09-05 00:05:48 -07:00
mempolicy.c [PATCH] /proc/<pid>/numa_maps to show on which nodes pages reside 2005-09-05 00:05:43 -07:00
mempool.c [PATCH] propagate __nocast annotations 2005-07-07 18:23:46 -07:00
mincore.c [PATCH] freepgt: sys_mincore ignore FIRST_USER_PGD_NR 2005-04-19 13:29:20 -07:00
mlock.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
mmap.c [PATCH] remove misleading comment above sys_brk 2005-09-07 16:57:23 -07:00
mprotect.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
mremap.c [PATCH] mm: remap ZERO_PAGE mappings 2005-09-05 00:05:44 -07:00
msync.c [PATCH] msync: check pte dirty earlier 2005-06-21 18:46:21 -07:00
nommu.c [PATCH] __vm_enough_memory() signedness fix 2005-08-04 21:43:14 -07:00
oom_kill.c [PATCH] cpusets: oom_kill tweaks 2005-09-07 16:57:39 -07:00
page_alloc.c [PATCH] Additions to .data.read_mostly section 2005-09-07 16:57:33 -07:00
page_io.c [PATCH] swsusp: kill config_pm_disk 2005-06-25 16:24:32 -07:00
page-writeback.c [PATCH] rename wakeup_bdflush to wakeup_pdflush 2005-06-28 21:20:31 -07:00
pdflush.c [PATCH] Cleanup patch for process freezing 2005-06-25 17:10:13 -07:00
prio_tree.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
readahead.c [PATCH] readahead: reset cache_hit earlier 2005-09-07 16:57:25 -07:00
rmap.c [PATCH] mm: cleanup rmap 2005-09-05 00:05:43 -07:00
shmem.c [PATCH] Additions to .data.read_mostly section 2005-09-07 16:57:33 -07:00
slab.c [PATCH] slab: removes local_irq_save()/local_irq_restore() pair 2005-09-05 00:05:49 -07:00
sparse.c [PATCH] sparsemem extreme: hotplug preparation 2005-09-05 00:05:38 -07:00
swap_state.c [PATCH] delete from_swap_cache BUG_ONs 2005-09-05 00:05:42 -07:00
swap.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
swapfile.c [PATCH] swap: swap_lock replace list+device 2005-09-05 00:05:42 -07:00
thrash.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
tiny-shmem.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
truncate.c [PATCH] DocBook: fix some descriptions 2005-05-01 08:59:26 -07:00
vmalloc.c [PATCH] arm: allow for arch-specific IOREMAP_MAX_ORDER 2005-09-05 00:05:46 -07:00
vmscan.c [PATCH] VM: zone reclaim atomic ops cleanup 2005-09-05 00:05:44 -07:00