1
Commit Graph

4402 Commits

Author SHA1 Message Date
Andi Kleen
e9d08567ef Merge branch 'hwpoison-cleanups' into hwpoison 2010-10-22 17:40:11 +02:00
Linus Torvalds
3044100e58 Merge branch 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (74 commits)
  x86-64: Only set max_pfn_mapped to 512 MiB if we enter via head_64.S
  xen: Cope with unmapped pages when initializing kernel pagetable
  memblock, bootmem: Round pfn properly for memory and reserved regions
  memblock: Annotate memblock functions with __init_memblock
  memblock: Allow memblock_init to be called early
  memblock/arm: Fix memblock_region_is_memory() typo
  x86, memblock: Remove __memblock_x86_find_in_range_size()
  memblock: Fix wraparound in find_region()
  x86-32, memblock: Make add_highpages honor early reserved ranges
  x86, memblock: Fix crashkernel allocation
  arm, memblock: Fix the sparsemem build
  memblock: Fix section mismatch warnings
  powerpc, memblock: Fix memblock API change fallout
  memblock, microblaze: Fix memblock API change fallout
  x86: Remove old bootmem code
  x86, memblock: Use memblock_memory_size()/memblock_free_memory_size() to get correct dma_reserve
  x86: Remove not used early_res code
  x86, memblock: Replace e820_/_early string with memblock_
  x86: Use memblock to replace early_res
  x86, memblock: Use memblock_debug to control debug message print out
  ...

Fix up trivial conflicts in arch/x86/kernel/setup.c and kernel/Makefile
2010-10-21 18:52:11 -07:00
Linus Torvalds
c3b86a2942 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86-32, percpu: Correct the ordering of the percpu readmostly section
  x86, mm: Enable ARCH_DMA_ADDR_T_64BIT with X86_64 || HIGHMEM64G
  x86: Spread tlb flush vector between nodes
  percpu: Introduce a read-mostly percpu API
  x86, mm: Fix incorrect data type in vmalloc_sync_all()
  x86, mm: Hold mm->page_table_lock while doing vmalloc_sync
  x86, mm: Fix bogus whitespace in sync_global_pgds()
  x86-32: Fix sparse warning for the __PHYSICAL_MASK calculation
  x86, mm: Add RESERVE_BRK_ARRAY() helper
  mm, x86: Saving vmcore with non-lazy freeing of vmas
  x86, kdump: Change copy_oldmem_page() to use cached addressing
  x86, mm: fix uninitialized addr in kernel_physical_mapping_init()
  x86, kmemcheck: Remove double test
  x86, mm: Make spurious_fault check explicitly check the PRESENT bit
  x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes
  x86, mm: Separate x86_64 vmalloc_sync_all() into separate functions
  x86, mm: Avoid unnecessary TLB flush
2010-10-21 13:47:29 -07:00
H. Peter Anvin
8e4029ee35 Merge branch 'x86/urgent' into core/memblock
Reason for merge:

Forward-port urgent change to arch/x86/mm/srat_64.c to the memblock tree.

Resolved Conflicts:
	arch/x86/mm/srat_64.c

Originally-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2010-10-11 17:05:11 -07:00
Yinghai Lu
cd79481d27 memblock: Annotate memblock functions with __init_memblock
Stephen found

WARNING: mm/built-in.o(.text+0x25ab8): Section mismatch in reference from the function memblock_find_base() to the function .init.text:memblock_find_region()
The function memblock_find_base() references
the function __init memblock_find_region().
This is often because memblock_find_base lacks a __init
annotation or the annotation of memblock_find_region is wrong.

So let memblock_find_region() to use __init_memblock instead of __init
directly.

Also fix one function that did not have __init* to be __init_memblock.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4CB366B1.40405@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2010-10-11 16:00:52 -07:00
Jeremy Fitzhardinge
236260b90d memblock: Allow memblock_init to be called early
The Xen setup code needs to call memblock_x86_reserve_range() very early,
so allow it to initialize the memblock subsystem before doing so.  The
second memblock_init() is ignored.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LKML-Reference: <4CACFDAD.3090900@goop.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2010-10-11 15:59:01 -07:00
Andi Kleen
a08c80ebb6 HWPOISON: Remove retry loop for try_to_unmap
We don't reply in other temporary failure cases and there were no
reports of replies happening. I think the original reason it was
added was also just an early bug, not an observation of the race.

So remove the loop for now, but keep a warning message.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:33:01 +02:00
Andi Kleen
9033ae1640 HWPOISON: Turn addr_valid from bitfield into char
The addr_valid flag is the only flag in "to_kill" and it's slightly more
efficient to have it as char instead of a bitfield.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:33:01 +02:00
Andi Kleen
898e70d1e5 HWPOISON: Disable DEBUG by default
Now that only a few obscure messages are left as pr_debug disable
outputting of pr_debug in memory-failure.c by default.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:33:00 +02:00
Andi Kleen
fb46e73520 HWPOISON: Convert pr_debugs to pr_info
Convert a lot of pr_debugs in memory-failure.c that are generally useful
to pr_info. It's reasonable to print at least one message why
offlining succeeded or failed by default.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:33:00 +02:00
Andi Kleen
1c80b990a3 HWPOISON: Improve comments in memory-failure.c
Clean up and improve the overview comment in memory-failure.c

Tidy some grammar issues in other comments.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:33:00 +02:00
Ingo Molnar
153db80f8c Merge commit 'v2.6.36-rc7' into core/memblock
Merge reason: Update from -rc3 to -rc7.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-08 09:15:00 +02:00
Linus Torvalds
6b0cd00bc3 Merge branch 'hwpoison-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
  HWPOISON: Stop shrinking at right page count
  HWPOISON: Report correct address granuality for AO huge page errors
  HWPOISON: Copy si_addr_lsb to user
  page-types.c: fix name of unpoison interface
2010-10-07 13:59:32 -07:00
Kirill A. Shutemov
ad4ca5f4b7 memcg: fix thresholds with use_hierarchy == 1
We need to check parent's thresholds if parent has use_hierarchy == 1 to
be sure that parent's threshold events will be triggered even if parent
itself is not active (no MEM_CGROUP_EVENTS).

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-07 13:31:21 -07:00
Robin Holt
f241e6607b mm: alloc_large_system_hash() printk overflow on 16TB boot
During boot of a 16TB system, the following is printed:
Dentry cache hash table entries: -2147483648 (order: 22, 17179869184 bytes)

Signed-off-by: Robin Holt <holt@sgi.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-07 13:31:21 -07:00
Andi Kleen
47f43e7efa HWPOISON: Stop shrinking at right page count
When we call the slab shrinker to free a page we need to stop at
page count one because the caller always holds a single reference, not zero.

This avoids useless looping over slab shrinkers and freeing too much
memory.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-07 09:47:10 +02:00
Andi Kleen
0d9ee6a2d4 HWPOISON: Report correct address granuality for AO huge page errors
The SIGBUS user space signalling is supposed to report the
address granuality of a corruption. Pass this information correctly
for huge pages by querying the hpage order.

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-07 09:45:44 +02:00
Yinghai Lu
f1af98c762 memblock: Fix wraparound in find_region()
When trying to find huge range for crashkernel, get

[    0.000000] ------------[ cut here ]------------
[    0.000000] WARNING: at arch/x86/mm/memblock.c:248 memblock_x86_reserve_range+0x40/0x7a()
[    0.000000] Hardware name: Sun Fire x4800
[    0.000000] memblock_x86_reserve_range: wrong range [0xffffffff37000000, 0x137000000)
[    0.000000] Modules linked in:
[    0.000000] Pid: 0, comm: swapper Not tainted 2.6.36-rc5-tip-yh-01876-g1cac214-dirty #59
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff82816f7e>] ? memblock_x86_reserve_range+0x40/0x7a
[    0.000000]  [<ffffffff81078c2d>] warn_slowpath_common+0x85/0x9e
[    0.000000]  [<ffffffff81078d38>] warn_slowpath_fmt+0x6e/0x70
[    0.000000]  [<ffffffff8281e77c>] ? memblock_find_region+0x40/0x78
[    0.000000]  [<ffffffff8281eb1f>] ? memblock_find_base+0x9a/0xb9
[    0.000000]  [<ffffffff82816f7e>] memblock_x86_reserve_range+0x40/0x7a
[    0.000000]  [<ffffffff8280452c>] setup_arch+0x99d/0xb2a
[    0.000000]  [<ffffffff810a3e02>] ? trace_hardirqs_off+0xd/0xf
[    0.000000]  [<ffffffff81cec7d8>] ? _raw_spin_unlock_irqrestore+0x3d/0x4c
[    0.000000]  [<ffffffff827ffcec>] start_kernel+0xde/0x3f1
[    0.000000]  [<ffffffff827ff2d4>] x86_64_start_reservations+0xa0/0xa4
[    0.000000]  [<ffffffff827ff3de>] x86_64_start_kernel+0x106/0x10d
[    0.000000] ---[ end trace a7919e7f17c0a725 ]---
[    0.000000] Reserving 8192MB of memory at 17592186041200MB for crashkernel (System RAM: 526336MB)

This is caused by a wraparound in the test due to size > end;
explicitly check for this condition and fail.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4CAA4DD3.1080401@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-10-05 21:45:35 -07:00
Hugh Dickins
4e31635c36 ksm: fix bad user data when swapping
Building under memory pressure, with KSM on 2.6.36-rc5, collapsed with
an internal compiler error: typically indicating an error in swapping.

Perhaps there's a timing issue which makes it now more likely, perhaps
it's just a long time since I tried for so long: this bug goes back to
KSM swapping in 2.6.33.

Notice how reuse_swap_page() allows an exclusive page to be reused, but
only does SetPageDirty if it can delete it from swap cache right then -
if it's currently under Writeback, it has to be left in cache and we
don't SetPageDirty, but the page can be reused.  Fine, the dirty bit
will get set in the pte; but notice how zap_pte_range() does not bother
to transfer pte_dirty to page_dirty when unmapping a PageAnon.

If KSM chooses to share such a page, it will look like a clean copy of
swapcache, and not be written out to swap when its memory is needed;
then stale data read back from swap when it's needed again.

We could fix this in reuse_swap_page() (or even refuse to reuse a
page under writeback), but it's more honest to fix my oversight in
KSM's write_protect_page().  Several days of testing on three machines
confirms that this fixes the issue they showed.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-04 11:09:53 -07:00
Hugh Dickins
4829b906cc ksm: fix page_address_in_vma anon_vma oops
2.6.36-rc1 commit 21d0d443cd "rmap:
resurrect page_address_in_vma anon_vma check" was right to resurrect
that check; but now that it's comparing anon_vma->roots instead of
just anon_vmas, there's a danger of oopsing on a NULL anon_vma.

In most cases no NULL anon_vma ever gets here; but it turns out that
occasionally KSM, when enabled on a forked or forking process, will
itself call page_address_in_vma() on a "half-KSM" page left over from
an earlier failed attempt to merge - whose page_anon_vma() is NULL.

It's my bug that those should be getting here at all: I thought they
were already dealt with, this oops proves me wrong, I'll fix it in
the next release - such pages are effectively pinned until their
process exits, since rmap cannot find their ptes (though swapoff can).

For now just work around it by making page_address_in_vma() safe (and
add a comment on why that check is wanted anyway).  A similar check
in __page_check_anon_rmap() is safe because do_page_add_anon_rmap()
already excluded KSM pages.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-04 11:09:53 -07:00
Larry Woodman
5ec1055aa5 Avoid pgoff overflow in remap_file_pages
Thomas Pollet noticed that the remap_file_pages() system call in
fremap.c has a potential overflow in the first part of the if statement
below, which could cause it to process bogus input parameters.
Specifically the pgoff + size parameters could be wrap thereby
preventing the system call from failing when it should.

Reported-by: Thomas Pollet <thomas.pollet@gmail.com>
Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-25 09:34:58 -07:00
Linus Torvalds
e92b05dec8 fremap: get rid of broken 'end' variable
Thomas Pollet points out that the 'end' variable is broken.  It was
computed based on start/size before they were page-aligned, and as such
doesn't actually match any of the other actions we take.  The overflow
test on end was also redundant, since we had already tested it with the
properly aligned version.

So just get rid of it entirely.  The one remaining use for that broken
variable can just use 'start+size' like all the other cases already did.

Reported-by: Thomas Pollet <thomas.pollet@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-24 14:13:57 -07:00
Naoya Horiguchi
a850ea3037 hugetlb, rmap: add BUG_ON(!PageLocked) in hugetlb_add_anon_rmap()
Confirming page lock is held in hugetlb_add_anon_rmap() may be useful
to detect possible future problems.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-23 17:29:19 -07:00
Naoya Horiguchi
56c9cfb13c hugetlb, rmap: fix confusing page locking in hugetlb_cow()
The "if (!trylock_page)" block in the avoidcopy path of hugetlb_cow()
looks confusing and is buggy.  Originally this trylock_page() was
intended to make sure that old_page is locked even when old_page !=
pagecache_page, because then only pagecache_page is locked.

This patch fixes it by moving page locking into hugetlb_fault().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-23 17:29:18 -07:00
Naoya Horiguchi
cd67f0d2a9 hugetlb, rmap: use hugepage_add_new_anon_rmap() in hugetlb_cow()
Obviously, setting anon_vma for COWed hugepage should be done
by hugepage_add_new_anon_rmap() to scan vmas faster.
This patch fixes it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-23 17:29:18 -07:00
Naoya Horiguchi
433abed6c6 hugetlb, rmap: always use anon_vma root pointer
This patch applies Andrea's fix given by the following patch into hugepage
rmapping code:

  commit 288468c334
  Author: Andrea Arcangeli <aarcange@redhat.com>
  Date:   Mon Aug 9 17:19:09 2010 -0700

This patch uses anon_vma->root and avoids unnecessary overwriting when
anon_vma is already set up.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-23 17:29:18 -07:00
Linus Torvalds
57aebd7739 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: fix pcpu_last_unit_cpu
2010-09-23 08:06:55 -07:00
Andrea Arcangeli
2aeadc30de mmap: call unlink_anon_vmas() in __split_vma() in case of error
If __split_vma fails because of an out of memory condition the
anon_vma_chain isn't teardown and freed potentially leading to rmap walks
accessing freed vma information plus there's a memleak.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-22 17:22:40 -07:00
David Rientjes
e85bfd3aa7 oom: filter unkillable tasks from tasklist dump
/proc/sys/vm/oom_dump_tasks is enabled by default, so it's necessary to
limit as much information as possible that it should emit.

The tasklist dump should be filtered to only those tasks that are eligible
for oom kill.  This is already done for memcg ooms, but this patch extends
it to both cpuset and mempolicy ooms as well as init.

In addition to suppressing irrelevant information, this also reduces
confusion since users currently don't know which tasks in the tasklist
aren't eligible for kill (such as those attached to cpusets or bound to
mempolicies with a disjoint set of mems or nodes, respectively) since that
information is not shown.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-22 17:22:39 -07:00
Minchan Kim
d1908362ae vmscan: check all_unreclaimable in direct reclaim path
M.  Vefa Bicakci reported 2.6.35 kernel hang up when hibernation on his
32bit 3GB mem machine.
(https://bugzilla.kernel.org/show_bug.cgi?id=16771). Also he bisected
the regression to

  commit bb21c7ce18
  Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
  Date:   Fri Jun 4 14:15:05 2010 -0700

     vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim failure

At first impression, this seemed very strange because the above commit
only chenged function return value and hibernate_preallocate_memory()
ignore return value of shrink_all_memory().  But it's related.

Now, page allocation from hibernation code may enter infinite loop if the
system has highmem.  The reasons are that vmscan don't care enough OOM
case when oom_killer_disabled.

The problem sequence is following as.

1. hibernation
2. oom_disable
3. alloc_pages
4. do_try_to_free_pages
       if (scanning_global_lru(sc) && !all_unreclaimable)
               return 1;

If kswapd is not freozen, it would set zone->all_unreclaimable to 1 and
then shrink_zones maybe return true(ie, all_unreclaimable is true).  So at
last, alloc_pages could go to _nopage_.  If it is, it should have no
problem.

This patch adds all_unreclaimable check to protect in direct reclaim path,
too.  It can care of hibernation OOM case and help bailout
all_unreclaimable case slightly.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: M. Vefa Bicakci <bicave@superonline.com>
Reported-by: <caiqian@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: <caiqian@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-22 17:22:39 -07:00
David Rientjes
f19e8aa11a oom: always return a badness score of non-zero for eligible tasks
A task's badness score is roughly a proportion of its rss and swap
compared to the system's capacity.  The scale ranges from 0 to 1000 with
the highest score chosen for kill.  Thus, this scale operates on a
resolution of 0.1% of RAM + swap.  Admin tasks are also given a 3% bonus,
so the badness score of an admin task using 3% of memory, for example,
would still be 0.

It's possible that an exceptionally large number of tasks will combine to
exhaust all resources but never have a single task that uses more than
0.1% of RAM and swap (or 3.0% for admin tasks).

This patch ensures that the badness score of any eligible task is never 0
so the machine doesn't unnecessarily panic because it cannot find a task
to kill.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-22 17:22:38 -07:00
Linus Torvalds
b68e9d4581 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friends
  char: Mark /dev/zero and /dev/kmem as not capable of writeback
  bdi: Initialize noop_backing_dev_info properly
  cfq-iosched: fix a kernel OOPs when usb key is inserted
  block: fix blk_rq_map_kern bio direction flag
  cciss: freeing uninitialized data on error path
2010-09-22 09:12:37 -07:00
Jan Kara
976e48f8a5 bdi: Initialize noop_backing_dev_info properly
Properly initialize this backing dev info so that writeback code does not
barf when getting to it e.g. via sb->s_bdi.

Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-22 09:48:47 +02:00
Tejun Heo
46b30ea9bc percpu: fix pcpu_last_unit_cpu
pcpu_first/last_unit_cpu are used to track which cpu has the first and
last units assigned.  This in turn is used to determine the span of a
chunk for man/unmap cache flushes and whether an address belongs to
the first chunk or not in per_cpu_ptr_to_phys().

When the number of possible CPUs isn't power of two, a chunk may
contain unassigned units towards the end of a chunk.  The logic to
determine pcpu_last_unit_cpu was incorrect when there was an unused
unit at the end of a chunk.  It failed to ignore the unused unit and
assigned the unused marker NR_CPUS to pcpu_last_unit_cpu.

This was discovered through kdump failure which was caused by
malfunctioning per_cpu_ptr_to_phys() on a kvm setup with 50 possible
CPUs by CAI Qian.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc: stable@kernel.org
2010-09-21 08:12:25 +02:00
Hugh Dickins
31c4a3d3a0 mm: further fix swapin race condition
Commit 4969c1192d ("mm: fix swapin race condition") is now agreed to
be incomplete.  There's a race, not very much less likely than the
original race envisaged, in which it is further necessary to check that
the swapcache page's swap has not changed.

Here's the reasoning: cast in terms of reuse_swap_page(), but probably
could be reformulated to rely on try_to_free_swap() instead, or on
swapoff+swapon.

A, faults into do_swap_page(): does page1 = lookup_swap_cache(swap1) and
comes through the lock_page(page1).

B, a racing thread of the same process, faults on the same address: does
page1 = lookup_swap_cache(swap1) and now waits in lock_page(page1), but
for whatever reason is unlucky not to get the lock any time soon.

A carries on through do_swap_page(), a write fault, but cannot reuse the
swap page1 (another reference to swap1).  Unlocks the page1 (but B
doesn't get it yet), does COW in do_wp_page(), page2 now in that pte.

C, perhaps the parent of A+B, comes in and write faults the same swap
page1 into its mm, reuse_swap_page() succeeds this time, swap1 is freed.

kswapd comes in after some time (B still unlucky) and swaps out some
pages from A+B and C: it allocates the original swap1 to page2 in A+B,
and some other swap2 to the original page1 now in C.  But does not
immediately free page1 (actually it couldn't: B holds a reference),
leaving it in swap cache for now.

B at last gets the lock on page1, hooray! Is PageSwapCache(page1)? Yes.
Is pte_same(*page_table, orig_pte)? Yes, because page2 has now been
given the swap1 which page1 used to have.  So B proceeds to insert page1
into A+B's page_table, though its content now belongs to C, quite
different from what A wrote there.

B ought to have checked that page1's swap was still swap1.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-20 10:44:37 -07:00
Cliff Wickman
3ee48b6af4 mm, x86: Saving vmcore with non-lazy freeing of vmas
During the reading of /proc/vmcore the kernel is doing
ioremap()/iounmap() repeatedly. And the buildup of un-flushed
vm_area_struct's is causing a great deal of overhead. (rb_next()
is chewing up most of that time).

This solution is to provide function set_iounmap_nonlazy(). It
causes a subsequent call to iounmap() to immediately purge the
vma area (with try_purge_vmap_area_lazy()).

With this patch we have seen the time for writing a 250MB
compressed dump drop from 71 seconds to 44 seconds.

Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kexec@lists.infradead.org
Cc: <stable@kernel.org>
LKML-Reference: <E1OwHZ4-0005WK-Tw@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-17 09:11:56 +02:00
Yinghai Lu
3661ca66a4 memblock: Fix section mismatch warnings
Stephen found a bunch of section mismatch warnings with the
new memblock changes.

Use __init_memblock to replace __init in memblock.c and remove
__init in memblock.h. We should not use __init in header files.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Yinghai Lu <Yinghai@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LKML-Reference: <4C912709.2090201@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-15 22:17:13 +02:00
Linus Torvalds
ff3cb3fec3 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: Range check cpu in blk_cpu_to_group
  scatterlist: prevent invalid free when alloc fails
  writeback: Fix lost wake-up shutting down writeback thread
  writeback: do not lose wakeup events when forking bdi threads
  cciss: fix reporting of max queue depth since init
  block: switch s390 tape_block and mg_disk to elevator_change()
  block: add function call to switch the IO scheduler from a driver
  fs/bio-integrity.c: return -ENOMEM on kmalloc failure
  bio-integrity.c: remove dependency on __GFP_NOFAIL
  BLOCK: fix bio.bi_rw handling
  block: put dev->kobj in blk_register_queue fail path
  cciss: handle allocation failure
  cfq-iosched: Documentation help for new tunables
  cfq-iosched: blktrace print per slice sector stats
  cfq-iosched: Implement tunable group_idle
  cfq-iosched: Do group share accounting in IOPS when slice_idle=0
  cfq-iosched: Do not idle if slice_idle=0
  cciss: disable doorbell reset on reset_devices
  blkio: Fix return code for mkdir calls
2010-09-10 07:26:27 -07:00
Mel Gorman
9ee493ce0a mm: page allocator: drain per-cpu lists after direct reclaim allocation fails
When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page.  If it fails and no
further progress is made, it's possible the system will go OOM.  However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process.  This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the
problem.

This patch notes that if direct reclaim is making progress but allocations
are still failing that the system is already under heavy pressure.  In
this case, it drains the per-cpu lists and tries the allocation a second
time before continuing.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Christoph Lameter
aa45484031 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists.  To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold.  On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero.  Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken.  The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Mel Gorman
72853e2991 mm: page allocator: update free page counters after pages are placed on the free list
When allocating a page, the system uses NR_FREE_PAGES counters to
determine if watermarks would remain intact after the allocation was made.
This check is made without interrupts disabled or the zone lock held and
so is race-prone by nature.  Unfortunately, when pages are being freed in
batch, the counters are updated before the pages are added on the list.
During this window, the counters are misleading as the pages do not exist
yet.  When under significant pressure on systems with large numbers of
CPUs, it's possible for processes to make progress even though they should
have been stalled.  This is particularly problematic if a number of the
processes are using GFP_ATOMIC as the min watermark can be accidentally
breached and in extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list.  This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

[akpm@linux-foundation.org: avoid modifying incoming args]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
KAMEZAWA Hiroyuki
5ee28a4476 vmstat: update zone stat threshold when onlining a cpu
refresh_zone_stat_thresholds() calculates parameter based on the number of
online cpus.  It's called at cpu offlining but needs to be called at
onlining, too.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Hugh Dickins
3399446632 swap: discard while swapping only if SWAP_FLAG_DISCARD
Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
show a shift since I last tested in December: in part because of firmware
updates, in part because of the necessary move from barriers to awaiting
completion at the block layer.  While discard at swapon still shows as
slightly beneficial on both, discarding 1MB swap cluster when allocating
is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).

Surrender: discard as presently implemented is more hindrance than help
for swap; but might prove useful on other devices, or with improvements.
So continue to do the discard at swapon, but make discard while swapping
conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
only the lower 16 bits of int flags).

We can add a --discard or -d to swapon(8), and a "discard" to swap in
/etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Christoph Hellwig
8f2ae0faa3 swap: do not send discards as barriers
The swap code already uses synchronous discards, no need to add I/O
barriers.

This fixes the worst of the terrible slowdown in swap allocation for
hibernation, reported on 2.6.35 by Nigel Cunningham; but does not entirely
eliminate that regression.

[tj@kernel.org: superflous newlines removed]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Hugh Dickins
b73d7fcecd swap: prevent reuse during hibernation
Move the hibernation check from scan_swap_map() into try_to_free_swap():
to catch not only the common case when hibernation's allocation itself
triggers swap reuse, but also the less likely case when concurrent page
reclaim (shrink_page_list) might happen to try_to_free_swap from a page.

Hibernation already clears __GFP_IO from the gfp_allowed_mask, to stop
reclaim from going to swap: check that to prevent swap reuse too.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ondrej Zary <linux@rainbow-software.org>
Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Hugh Dickins
910321ea81 swap: revert special hibernation allocation
Please revert 2.6.36-rc commit d2997b1042
"hibernation: freeze swap at hibernation".  It complicated matters by
adding a second swap allocation path, just for hibernation; without in any
way fixing the issue that it was intended to address - page reclaim after
fixing the hibernation image might free swap from a page already imaged as
swapcache, letting its swap be reallocated to store a different page of
the image: resulting in data corruption if the imaged page were freed as
clean then swapped back in.  Pages freed to si->swap_map were still in
danger of being reallocated by the alternative allocation path.

I guess it inadvertently fixed slow SSD swap allocation for hibernation,
as reported by Nigel Cunningham: by missing out the discards that occur on
the usual swap allocation path; but that was unintentional, and needs a
separate fix.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ondrej Zary <linux@rainbow-software.org>
Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
Gary King
ac8456d6f9 bounce: call flush_dcache_page() after bounce_copy_vec()
I have been seeing problems on Tegra 2 (ARMv7 SMP) systems with HIGHMEM
enabled on 2.6.35 (plus some patches targetted at 2.6.36 to perform cache
maintenance lazily), and the root cause appears to be that the mm bouncing
code is calling flush_dcache_page before it copies the bounce buffer into
the bio.

The bounced page needs to be flushed after data is copied into it, to
ensure that architecture implementations can synchronize instruction and
data caches if necessary.

Signed-off-by: Gary King <gking@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:25 -07:00
KAMEZAWA Hiroyuki
0dcc48c15f memory hotplug: fix next block calculation in is_removable
next_active_pageblock() is for finding next _used_ freeblock.  It skips
several blocks when it finds there are a chunk of free pages lager than
pageblock.  But it has 2 bugs.

  1. We have no lock. page_order(page) - pageblock_order can be minus.
  2. pageblocks_stride += is wrong. it should skip page_order(p) of pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00
Minchan Kim
bc69304574 mm: compaction: handle active and inactive fairly in too_many_isolated
Iram reported that compaction's too_many_isolated() loops forever.
(http://www.spinics.net/lists/linux-mm/msg08123.html)

The meminfo when the situation happened was inactive anon is zero.  That's
because the system has no memory pressure until then.  While all anon
pages were in the active lru, compaction could select active lru as well
as inactive lru.  That's a different thing from vmscan's isolated.  So we
has been two too_many_isolated.

While compaction can isolate pages in both active and inactive, current
implementation of too_many_isolated only considers inactive.  It made
Iram's problem.

This patch handles active and inactive fairly.  That's because we can't
expect where from and how many compaction would isolated pages.

This patch changes (nr_isolated > nr_inactive) with
nr_isolated > (nr_active + nr_inactive) / 2.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: Iram Shahzad <iram.shahzad@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00
Andrea Arcangeli
152e0659fc mm: avoid warning when COMPACTION is selected
COMPACTION enables MIGRATION, but MIGRATION spawns a warning if numa or
memhotplug aren't selected.  However MIGRATION doesn't depend on them.  I
guess it's just trying to be strict doing a double check on who's enabling
it, but it doesn't know that compaction also enables MIGRATION.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09 18:57:24 -07:00