1
Commit Graph

4589 Commits

Author SHA1 Message Date
Linus Torvalds
a1e8fad590 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slub: Fix a crash during slabinfo -v
  tracing/slab: Move kmalloc tracepoint out of inline code
  slub: Fix slub_lock down/up imbalance
  slub: Fix build breakage in Documentation/vm
  slub tracing: move trace calls out of always inlined functions to reduce kernel code size
  slub: move slabinfo.c to tools/slub/slabinfo.c
2011-01-10 08:38:01 -08:00
Pekka Enberg
a45b0616e7 Merge branch 'slab/next' into for-linus 2011-01-09 11:05:53 +02:00
Linus Torvalds
72eb6a7914 Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
  gameport: use this_cpu_read instead of lookup
  x86: udelay: Use this_cpu_read to avoid address calculation
  x86: Use this_cpu_inc_return for nmi counter
  x86: Replace uses of current_cpu_data with this_cpu ops
  x86: Use this_cpu_ops to optimize code
  vmstat: User per cpu atomics to avoid interrupt disable / enable
  irq_work: Use per cpu atomics instead of regular atomics
  cpuops: Use cmpxchg for xchg to avoid lock semantics
  x86: this_cpu_cmpxchg and this_cpu_xchg operations
  percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
  percpu,x86: relocate this_cpu_add_return() and friends
  connector: Use this_cpu operations
  xen: Use this_cpu_inc_return
  taskstats: Use this_cpu_ops
  random: Use this_cpu_inc_return
  fs: Use this_cpu_inc_return in buffer.c
  highmem: Use this_cpu_xx_return() operations
  vmstat: Use this_cpu_inc_return for vm statistics
  x86: Support for this_cpu_add, sub, dec, inc_return
  percpu: Generic support for this_cpu_add, sub, dec, inc_return
  ...

Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
as per Tejun.
2011-01-07 17:02:58 -08:00
Linus Torvalds
23d69b09b7 Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
  usb: don't use flush_scheduled_work()
  speedtch: don't abuse struct delayed_work
  media/video: don't use flush_scheduled_work()
  media/video: explicitly flush request_module work
  ioc4: use static work_struct for ioc4_load_modules()
  init: don't call flush_scheduled_work() from do_initcalls()
  s390: don't use flush_scheduled_work()
  rtc: don't use flush_scheduled_work()
  mmc: update workqueue usages
  mfd: update workqueue usages
  dvb: don't use flush_scheduled_work()
  leds-wm8350: don't use flush_scheduled_work()
  mISDN: don't use flush_scheduled_work()
  macintosh/ams: don't use flush_scheduled_work()
  vmwgfx: don't use flush_scheduled_work()
  tpm: don't use flush_scheduled_work()
  sonypi: don't use flush_scheduled_work()
  hvsi: don't use flush_scheduled_work()
  xen: don't use flush_scheduled_work()
  gdrom: don't use flush_scheduled_work()
  ...

Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
as per Tejun.
2011-01-07 16:58:04 -08:00
Nick Piggin
fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Nick Piggin
b5c84bf6f6 fs: dcache remove dcache_lock
dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:23 +11:00
Nick Piggin
ccd35fb9f4 kernel: kmem_ptr_validate considered harmful
This is a nasty and error prone API. It is no longer used, remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:16 +11:00
KAMEZAWA Hiroyuki
ebb76ce16d memcg: fix wrong VM_BUG_ON() in try_charge()'s mm->owner check
At __mem_cgroup_try_charge(), VM_BUG_ON(!mm->owner) is checked.
But as commented in mem_cgroup_from_task(), mm->owner can be NULL
in some racy case. This check of VM_BUG_ON() is bad.

A possible story to hit this is at swapoff()->try_to_unuse(). It passes
mm_struct to mem_cgroup_try_charge_swapin() while mm->owner is NULL. If we
can't get proper mem_cgroup from swap_cgroup information, mm->owner is used
as charge target and we see NULL.

Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-30 10:07:06 -08:00
Linus Torvalds
ffc96d628b Merge branch 'nommu-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/nommu-2.6
* 'nommu-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/nommu-2.6:
  nommu: Provide stubbed alloc/free_vm_area() implementation.
  nommu: Fix up vmalloc_node() symbol export regression.
2010-12-27 10:36:27 -08:00
Linus Torvalds
a4790c9457 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: print out alloc information with KERN_DEBUG instead of KERN_INFO
  kthread_work: make lockdep happy
2010-12-24 12:59:09 -08:00
Paul Mundt
29c185e5c6 nommu: Provide stubbed alloc/free_vm_area() implementation.
Now that these have been introduced in to the vmalloc API, sync up the
nommu side of things. At present we don't deal with VMAs as such, so for
the time being these will simply BUG() out. In the future it should be
possible to support this interface by layering on top of the vm_regions.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2010-12-24 12:08:30 +09:00
Paul Mundt
9a14f653df nommu: Fix up vmalloc_node() symbol export regression.
Commit e1ca778 ("mm: add vzalloc() and vzalloc_node() helpers") ended up
accidentally deleting the vmalloc_node() symbol export, resulting in:

"vmalloc_node" [net/core/pktgen.ko] undefined!
"vmalloc_node" [net/netfilter/x_tables.ko] undefined!

regressions.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2010-12-24 11:50:34 +09:00
Michal Nazarewicz
0d1836c366 mm/migrate.c: fix compilation error
GCC complained about update_mmu_cache() not being defined in migrate.c.
Including <asm/tlbflush.h> seems to solve the problem.

Signed-off-by: Michal Nazarewicz <m.nazarewicz@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22 19:43:34 -08:00
Wu Fengguang
d153ba6445 writeback: do uninterruptible sleep in balance_dirty_pages()
Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop.  Plus we
*do* want these processes to appear in D state and to contribute to load
average.

So it should be TASK_UNINTERRUPTIBLE.                 -- Andrew Morton

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22 19:43:33 -08:00
Minchan Kim
dd9e5efe3a mm/compaction.c: avoid double mem_cgroup_del_lru()
del_page_from_lru_list() already called mem_cgroup_del_lru().  So we must
not call it again.  It adds unnecessary overhead.

It was not a runtime bug because the TestClearPageCgroupAcctLRU() early in
mem_cgroup_del_lru_list() will prevent any double-deletion, etc.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22 19:43:33 -08:00
Tejun Heo
bcbea798f8 percpu: print out alloc information with KERN_DEBUG instead of KERN_INFO
Now that percpu allocator is mostly stable, there is no reason to
print alloc information with KERN_INFO and clutter the boot messages.
Switch it to KERN_DEBUG.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mike Travis <travis@sgi.com>
2010-12-22 14:19:14 +01:00
Christoph Lameter
7c83912062 vmstat: User per cpu atomics to avoid interrupt disable / enable
Currently the operations to increment vm counters must disable interrupts
in order to not mess up their housekeeping of counters.

So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
count on preremption being disabled we still have some minor issues.
The fetching of the counter thresholds is racy.
A threshold from another cpu may be applied if we happen to be
rescheduled on another cpu.  However, the following vmstat operation
will then bring the counter again under the threshold limit.

The operations for __xxx_zone_state are not changed since the caller
has taken care of the synchronization needs (and therefore the cycle
count is even less than the optimized version for the irq disable case
provided here).

The optimization using this_cpu_cmpxchg will only be used if the arch
supports efficient this_cpu_ops (must have CONFIG_CMPXCHG_LOCAL set!)

The use of this_cpu_cmpxchg reduces the cycle count for the counter
operations by %80 (inc_zone_page_state goes from 170 cycles to 32).

Signed-off-by: Christoph Lameter <cl@linux.com>
2010-12-18 15:54:49 +01:00
Christoph Lameter
908ee0f122 vmstat: Use this_cpu_inc_return for vm statistics
this_cpu_inc_return() saves us a memory access there. Code
size does not change.

V1->V2:
	- Fixed the location of the __per_cpu pointer attributes
	- Sparse checked
V2->V3:
	- Move fixes to __percpu attribute usage to earlier patch

Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-17 15:18:04 +01:00
Tejun Heo
275c8b9328 Merge branch 'this_cpu_ops' into for-2.6.38 2010-12-17 15:16:46 +01:00
Christoph Lameter
909ea96468 core: Replace __get_cpu_var with __this_cpu_read if not used for an address.
__get_cpu_var() can be replaced with this_cpu_read and will then use a
single read instruction with implied address calculation to access the
correct per cpu instance.

However, the address of a per cpu variable passed to __this_cpu_read()
cannot be determined (since it's an implied address conversion through
segment prefixes).  Therefore apply this only to uses of __get_cpu_var
where the address of the variable is not used.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-17 15:07:19 +01:00
Christoph Lameter
12938a9220 vmstat: Optimize zone counter modifications through the use of this cpu operations
this cpu operations can be used to slightly optimize the function. The
changes will avoid some address calculations and replace them with the
use of the percpu segment register.

If one would have this_cpu_inc_return and this_cpu_dec_return then it
would be possible to optimize inc_zone_page_state and
dec_zone_page_state even more.

V1->V2:
	- Fix __dec_zone_state overflow handling
	- Use s8 variables for temporary storage.

V2->V3:
	- Put __percpu annotations in correct places.

Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-17 15:07:18 +01:00
Tavis Ormandy
462e635e5b install_special_mapping skips security_file_mmap check.
The install_special_mapping routine (used, for example, to setup the
vdso) skips the security check before insert_vm_struct, allowing a local
attacker to bypass the mmap_min_addr security restriction by limiting
the available pages for special mappings.

bprm_mm_init() also skips the check, and although I don't think this can
be used to bypass any restrictions, I don't see any reason not to have
the security check.

  $ uname -m
  x86_64
  $ cat /proc/sys/vm/mmap_min_addr
  65536
  $ cat install_special_mapping.s
  section .bss
      resb BSS_SIZE
  section .text
      global _start
      _start:
          mov     eax, __NR_pause
          int     0x80
  $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
  $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
  $ ./install_special_mapping &
  [1] 14303
  $ cat /proc/14303/maps
  0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
  00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
  00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]

It's worth noting that Red Hat are shipping with mmap_min_addr set to
4096.

Signed-off-by: Tavis Ormandy <taviso@google.com>
Acked-by: Kees Cook <kees@ubuntu.com>
Acked-by: Robert Swiecki <swiecki@google.com>
[ Changed to not drop the error code - akpm ]
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-15 12:30:36 -08:00
Tejun Heo
afe2c511fb workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync()
cancel_rearming_delayed_work[queue]() has been superceded by
cancel_delayed_work_sync() quite some time ago.  Convert all the
in-kernel users.  The conversions are completely equivalent and
trivial.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: netdev@vger.kernel.org
Cc: Anton Vorontsov <cbou@mail.ru>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: xfs-masters@oss.sgi.com
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: netfilter-devel@vger.kernel.org
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: linux-nfs@vger.kernel.org
2010-12-15 10:56:11 +01:00
Linus Torvalds
38971ce2fa Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Fix panic after nfs_umount()
  nfs: remove extraneous and problematic calls to nfs_clear_request
  nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
  NFS: Fix fcntl F_GETLK not reporting some conflicts
  nfs: Discard ACL cache on mode update
  NFS: Readdir cleanups
  NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
  NFS: Fix a memory leak in nfs_readdir
  Call the filesystem back whenever a page is removed from the page cache
  NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
2010-12-14 08:51:12 -08:00
Jesper Juhl
7af4c09324 percpu: zero memory more efficiently in mm/percpu.c::pcpu_mem_alloc()
Don't do vmalloc() + memset() when vzalloc() will do.

tj: dropped unnecessary temp variable ptr.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-07 14:47:23 +01:00
Linus Torvalds
7787d2c2f4 Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM / Hibernate: Fix memory corruption related to swap
  PM / Hibernate: Use async I/O when reading compressed hibernation image
2010-12-06 15:51:14 -08:00
Rafael J. Wysocki
c9e664f1fd PM / Hibernate: Fix memory corruption related to swap
There is a problem that swap pages allocated before the creation of
a hibernation image can be released and used for storing the contents
of different memory pages while the image is being saved.  Since the
kernel stored in the image doesn't know of that, it causes memory
corruption to occur after resume from hibernation, especially on
systems with relatively small RAM that need to swap often.

This issue can be addressed by keeping the GFP_IOFS bits clear
in gfp_allowed_mask during the entire hibernation, including the
saving of the image, until the system is finally turned off or
the hibernation is aborted.  Unfortunately, for this purpose
it's necessary to rework the way in which the hibernate and
suspend code manipulates gfp_allowed_mask.

This change is based on an earlier patch from Hugh Dickins.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Ondrej Zary <linux@rainbow-software.org>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: stable@kernel.org
2010-12-06 23:52:08 +01:00
Linus Torvalds
771f8bc71c Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slub: Fix a crash during slabinfo -v
2010-12-05 16:45:02 -08:00
Tero Roponen
37d57443d5 slub: Fix a crash during slabinfo -v
Commit f7cb193362 ("SLUB: Pass active
and inactive redzone flags instead of boolean to debug functions")
missed two instances of check_object(). This caused a lot of warnings
during 'slabinfo -v' finally leading to a crash:

  BUG ext4_xattr: Freepointer corrupt
  ...
  BUG buffer_head: Freepointer corrupt
  ...
  BUG ext4_alloc_context: Freepointer corrupt
  ...
  ...
  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35
  PGD 79d78067 PUD 79e67067 PMD 0
  Oops: 0002 [#1] SMP
  last sysfs file: /sys/kernel/slab/:t-0000192/validate

This patch fixes the problem by converting the two missed instances.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-12-04 09:53:49 +02:00
Tero Roponen
8165984acf slub: Fix a crash during slabinfo -v
Commit f7cb193362 ("SLUB: Pass active
and inactive redzone flags instead of boolean to debug functions")
missed two instances of check_object(). This caused a lot of warnings
during 'slabinfo -v' finally leading to a crash:

  BUG ext4_xattr: Freepointer corrupt
  ...
  BUG buffer_head: Freepointer corrupt
  ...
  BUG ext4_alloc_context: Freepointer corrupt
  ...
  ...
  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: [<ffffffff810a291f>] file_sb_list_del+0x1c/0x35
  PGD 79d78067 PUD 79e67067 PMD 0
  Oops: 0002 [#1] SMP
  last sysfs file: /sys/kernel/slab/:t-0000192/validate

This patch fixes the problem by converting the two missed instances.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-12-04 09:40:16 +02:00
KOSAKI Motohiro
a0b0f58cdd ksm: annotate ksm_thread_mutex is no deadlock source
commit 62b61f611e ("ksm: memory hotremove migration only") caused the
following new lockdep warning.

  =======================================================
  [ INFO: possible circular locking dependency detected ]
  -------------------------------------------------------
  bash/1621 is trying to acquire lock:
   ((memory_chain).rwsem){.+.+.+}, at: [<ffffffff81079339>]
  __blocking_notifier_call_chain+0x69/0xc0

  but task is already holding lock:
   (ksm_thread_mutex){+.+.+.}, at: [<ffffffff8113a3aa>]
  ksm_memory_callback+0x3a/0xc0

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (ksm_thread_mutex){+.+.+.}:
       [<ffffffff8108b70a>] lock_acquire+0xaa/0x140
       [<ffffffff81505d74>] __mutex_lock_common+0x44/0x3f0
       [<ffffffff81506228>] mutex_lock_nested+0x48/0x60
       [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0
       [<ffffffff8150c21c>] notifier_call_chain+0x8c/0xe0
       [<ffffffff8107934e>] __blocking_notifier_call_chain+0x7e/0xc0
       [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20
       [<ffffffff813afbfb>] memory_notify+0x1b/0x20
       [<ffffffff81141b7c>] remove_memory+0x1cc/0x5f0
       [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0
       [<ffffffff813afd62>] store_mem_state+0xe2/0xf0
       [<ffffffff813a0bb0>] sysdev_store+0x20/0x30
       [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170
       [<ffffffff8114f398>] vfs_write+0xc8/0x190
       [<ffffffff8114fc14>] sys_write+0x54/0x90
       [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b

  -> #0 ((memory_chain).rwsem){.+.+.+}:
       [<ffffffff8108b5ba>] __lock_acquire+0x155a/0x1600
       [<ffffffff8108b70a>] lock_acquire+0xaa/0x140
       [<ffffffff81506601>] down_read+0x51/0xa0
       [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0
       [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20
       [<ffffffff813afbfb>] memory_notify+0x1b/0x20
       [<ffffffff81141f1e>] remove_memory+0x56e/0x5f0
       [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0
       [<ffffffff813afd62>] store_mem_state+0xe2/0xf0
       [<ffffffff813a0bb0>] sysdev_store+0x20/0x30
       [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170
       [<ffffffff8114f398>] vfs_write+0xc8/0x190
       [<ffffffff8114fc14>] sys_write+0x54/0x90
       [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b

But it's a false positive.  Both memory_chain.rwsem and ksm_thread_mutex
have an outer lock (mem_hotplug_mutex).  So they cannot deadlock.

Thus, This patch annotate ksm_thread_mutex is not deadlock source.

[akpm@linux-foundation.org: update comment, from Hugh]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-02 14:51:15 -08:00
KOSAKI Motohiro
20d6c96b5f mem-hotplug: introduce {un}lock_memory_hotplug()
Presently hwpoison is using lock_system_sleep() to prevent a race with
memory hotplug.  However lock_system_sleep() is a no-op if
CONFIG_HIBERNATION=n.  Therefore we need a new lock.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-02 14:51:15 -08:00
Jeremy Fitzhardinge
64141da587 vmalloc: eagerly clear ptes on vunmap
On stock 2.6.37-rc4, running:

  # mount lilith:/export /mnt/lilith
  # find  /mnt/lilith/ -type f -print0 | xargs -0 file

crashes the machine fairly quickly under Xen.  Often it results in oops
messages, but the couple of times I tried just now, it just hung quietly
and made Xen print some rude messages:

    (XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
    3000000000000000) for mfn 1d7058 (pfn 18fa7)
    (XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
    (XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
    1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
    (XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04

Which means the domain tried to map a pagetable page RW, which would
allow it to map arbitrary memory, so Xen stopped it.  This is because
vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
finished with them, and those pages got recycled as pagetable pages
while still having these RW aliases.

Removing those mappings immediately removes the Xen-visible aliases, and
so it has no problem with those pages being reused as pagetable pages.
Deferring the TLB flush doesn't upset Xen because it can flush the TLB
itself as needed to maintain its invariants.

When unmapping a region in the vmalloc space, clear the ptes
immediately.  There's no point in deferring this because there's no
amortization benefit.

The TLBs are left dirty, and they are flushed lazily to amortize the
cost of the IPIs.

This specific motivation for this patch is an oops-causing regression
since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
of vm_map_ram() introduced in 56e4ebf877 ("NFS: readdir with vmapped
pages") .  XFS also uses vm_map_ram() and could cause similar problems.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-02 14:51:15 -08:00
Wu Fengguang
e172662d11 vmstat: fix dirty threshold ordering
The nr_dirty_[background_]threshold fields are misplaced before the
numa_* fields, and users will read strange values.

This is the right order.  Before patch, nr_dirty_background_threshold
will read as 0 (the value from numa_miss).

	numa_hit 128501
	numa_miss 0
	numa_foreign 0
	numa_interleave 7388
	numa_local 128501
	numa_other 0
	nr_dirty_threshold 144291
	nr_dirty_background_threshold 72145

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Michael Rubin <mrubin@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-02 14:51:14 -08:00
Zeng Zhaoming
55cfaa3cbd mm/mempolicy.c: add rcu read lock to protect pid structure
find_task_by_vpid() should be protected by rcu_read_lock(), to prevent
free_pid() reclaiming pid.

Signed-off-by: Zeng Zhaoming <zengzm.kernel@gmail.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-02 14:51:14 -08:00
Dean Nelson
1f64d69c7a mm/hugetlb.c: avoid double unlock_page() in hugetlb_fault()
Have hugetlb_fault() call unlock_page(page) only if it had previously
called lock_page(page).

Setting CONFIG_DEBUG_VM=y and then running the libhugetlbfs test suite,
resulted in the tripping of VM_BUG_ON(!PageLocked(page)) in
unlock_page() having been called by hugetlb_fault() when page ==
pagecache_page.  This patch remedied the problem.

Signed-off-by: Dean Nelson <dnelson@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-02 14:51:14 -08:00
Linus Torvalds
6072d13c42 Call the filesystem back whenever a page is removed from the page cache
NFS needs to be able to release objects that are stored in the page
cache once the page itself is no longer visible from the page cache.

This patch adds a callback to the address space operations that allows
filesystems to perform page cleanups once the page has been removed
from the page cache.

Original patch by: Linus Torvalds <torvalds@linux-foundation.org>
[trondmy: cover the cases of invalidate_inode_pages2() and
          truncate_inode_pages()]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-02 09:55:21 -05:00
Steven Rostedt
85beb5869a tracing/slab: Move kmalloc tracepoint out of inline code
The tracepoint for kmalloc is in the slab inlined code which causes
every instance of kmalloc to have the tracepoint.

This patch moves the tracepoint out of the inline code to the
slab C file, which removes a large number of inlined trace
points.

  objdump -dr vmlinux.slab| grep 'jmpq.*<trace_kmalloc' |wc -l
213
  objdump -dr vmlinux.slab.patched| grep 'jmpq.*<trace_kmalloc' |wc -l
1

This also has a nice impact on size.

   text	   data	    bss	    dec	    hex	filename
7023060	2121564	2482432	11627056	 b16a30	vmlinux.slab
6970579	2109772	2482432	11562783	 b06f1f	vmlinux.slab.patched

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-11-28 21:16:28 +02:00
David Sterba
5f0af70a25 mm: remove call to find_vma in pagewalk for non-hugetlbfs
Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
walk_page_range()") introduces a check if a vma is a hugetlbfs one and
later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is
moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is
left behind and its result is not used anywhere else in the function.

The side-effect of caching vma for @addr inside walk->mm is neither
utilized in walk_page_range() nor in called functions.

Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:46 +09:00
KAMEZAWA Hiroyuki
e9959f0f37 mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly called under stop_machine_run()
During memory hotplug, build_allzonelists() may be called under
stop_machine_run().  In this function, setup_zone_pageset() is called.
But it's bug because it will do page allocation under stop_machine_run().

Here is a report from Alok Kataria.

  BUG: sleeping function called from invalid context at kernel/mutex.c:94
  in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
  Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
  Call Trace:
   [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
   [<ffffffff81468245>] mutex_lock+0x24/0x50
   [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
   [<ffffffff81048888>] ? load_balance+0xbe/0x60e
   [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
   [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
   [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
   [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
   [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
   [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
   [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
   [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
   [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
   [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
   [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
   [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
   [<ffffffff81065f29>] kthread+0x7f/0x87
   [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
   [<ffffffff81065eaa>] ? kthread+0x0/0x87
   [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
  Built 5 zonelists in Node order, mobility grouping on.  Total pages: 289456
  Policy zone: Normal

This patch tries to fix the issue by moving setup_zone_pageset() out from
stop_machine_run(). It's obviously not necessary to be called under
stop_machine_run().

[akpm@linux-foundation.org: remove unneeded local]
Reported-by: Alok Kataria <akataria@vmware.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Petr Vandrovec <petr@vmware.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:45 +09:00
Michal Hocko
a42c390cfa cgroups: make swap accounting default behavior configurable
Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP
configuration option and then it is turned on by default.  There is a boot
option (noswapaccount) which can disable this feature.

This makes it hard for distributors to enable the configuration option as
this feature leads to a bigger memory consumption and this is a no-go for
general purpose distribution kernel.  On the other hand swap accounting
may be very usuful for some workloads.

This patch adds a new configuration option which controls the default
behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED).  If the option is selected
then the feature is turned on by default.

It also adds a new boot parameter swapaccount[=1|0] which enhances the
original noswapaccount parameter semantic by means of enable/disable logic
(defaults to 1 if no value is provided to be still consistent with
noswapaccount).

The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is
enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well)

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:45 +09:00
Daisuke Nishimura
b1dd693e5b memcg: avoid deadlock between move charge and try_charge()
__mem_cgroup_try_charge() can be called under down_write(&mmap_sem)(e.g.
mlock does it). This means it can cause deadlock if it races with move charge:

Ex.1)
                move charge             |        try charge
  --------------------------------------+------------------------------
    mem_cgroup_can_attach()             |  down_write(&mmap_sem)
      mc.moving_task = current          |    ..
      mem_cgroup_precharge_mc()         |  __mem_cgroup_try_charge()
        mem_cgroup_count_precharge()    |    prepare_to_wait()
          down_read(&mmap_sem)          |    if (mc.moving_task)
          -> cannot aquire the lock     |    -> true
                                        |      schedule()

Ex.2)
                move charge             |        try charge
  --------------------------------------+------------------------------
    mem_cgroup_can_attach()             |
      mc.moving_task = current          |
      mem_cgroup_precharge_mc()         |
        mem_cgroup_count_precharge()    |
          down_read(&mmap_sem)          |
          ..                            |
          up_read(&mmap_sem)            |
                                        |  down_write(&mmap_sem)
    mem_cgroup_move_task()              |    ..
      mem_cgroup_move_charge()          |  __mem_cgroup_try_charge()
        down_read(&mmap_sem)            |    prepare_to_wait()
        -> cannot aquire the lock       |    if (mc.moving_task)
                                        |    -> true
                                        |      schedule()

To avoid this deadlock, we do all the move charge works (both can_attach() and
attach()) under one mmap_sem section.
And after this patch, we set/clear mc.moving_task outside mc.lock, because we
use the lock only to check mc.from/to.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:44 +09:00
Kirill A. Shutemov
112bc2e120 memcg: fix false positive VM_BUG on non-SMP
Fix this:

  kernel BUG at mm/memcontrol.c:2155!
  invalid opcode: 0000 [#1]
  last sysfs file:

  Pid: 18, comm: sh Not tainted 2.6.37-rc3 #3 /Bochs
  EIP: 0060:[<c10731b2>] EFLAGS: 00000246 CPU: 0
  EIP is at mem_cgroup_move_account+0xe2/0xf0
  EAX: 00000004 EBX: c6f931d4 ECX: c681c300 EDX: c681c000
  ESI: c681c300 EDI: ffffffea EBP: c681c000 ESP: c46f3e30
   DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
  Process sh (pid: 18, ti=c46f2000 task=c6826e60 task.ti=c46f2000)
  Stack:
   00000155 c681c000 0805f000 c46ee180 c46f3e5c c7058820 c1074d37 00000000
   08060000 c46db9a0 c46ec080 c7058820 0805f000 08060000 c46f3e98 c1074c50
   c106c75e c46f3e98 c46ec080 08060000 0805ffff c46db9a0 c46f3e98 c46e0340
  Call Trace:
   [<c1074d37>] ? mem_cgroup_move_charge_pte_range+0xe7/0x130
   [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130
   [<c106c75e>] ? walk_page_range+0xee/0x1d0
   [<c10725d6>] ? mem_cgroup_move_task+0x66/0x90
   [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130
   [<c1072570>] ? mem_cgroup_move_task+0x0/0x90
   [<c1042616>] ? cgroup_attach_task+0x136/0x200
   [<c1042878>] ? cgroup_tasks_write+0x48/0xc0
   [<c1041e9e>] ? cgroup_file_write+0xde/0x220
   [<c101398d>] ? do_page_fault+0x17d/0x3f0
   [<c108a79d>] ? alloc_fd+0x2d/0xd0
   [<c1041dc0>] ? cgroup_file_write+0x0/0x220
   [<c1077ba2>] ? vfs_write+0x92/0xc0
   [<c1077c81>] ? sys_write+0x41/0x70
   [<c1140e3d>] ? syscall_call+0x7/0xb
  Code: 03 00 74 09 8b 44 24 04 e8 1c f1 ff ff 89 73 04 8d 86 b0 00 00 00 b9 01 00 00 00 89 da 31 ff e8 65 f5 ff ff e9 4d ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 90 8d b4 26 00 00 00 00 83 ec 10 8b 0d f4 e3
  EIP: [<c10731b2>] mem_cgroup_move_account+0xe2/0xf0 SS:ESP 0068:c46f3e30
  ---[ end trace 7daa1582159b6532 ]---

lock_page_cgroup and unlock_page_cgroup are implemented using
bit_spinlock.  bit_spinlock doesn't touch the bit if we are on non-SMP
machine, so we can't use the bit to check whether the lock was taken.

Let's introduce is_page_cgroup_locked based on bit_spin_is_locked instead
of PageCgroupLocked to fix it.

[akpm@linux-foundation.org: s/is_page_cgroup_locked/page_is_cgroup_locked/]
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:40 +09:00
Steven J. Magnani
04c3496152 nommu: yield CPU while disposing VM
Depending on processor speed, page size, and the amount of memory a
process is allowed to amass, cleanup of a large VM may freeze the system
for many seconds.  This can result in a watchdog timeout.

Make sure other tasks receive some service when cleaning up large VMs.

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-25 06:50:38 +09:00
Linus Torvalds
2744b8889c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slub: Fix slub_lock down/up imbalance
2010-11-14 13:06:37 -08:00
Pavel Emelyanov
68cee4f118 slub: Fix slub_lock down/up imbalance
There are two places, that do not release the slub_lock.

Respective bugs were introduced by sysfs changes ab4d5ed5 (slub: Enable
sysfs support for !CONFIG_SLUB_DEBUG) and 2bce6485 ( slub: Allow removal
of slab caches during boot).

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-11-14 16:53:11 +02:00
Nick Piggin
27d20fddc8 radix-tree: fix RCU bug
Salman Qazi describes the following radix-tree bug:

In the following case, we get can get a deadlock:

0.  The radix tree contains two items, one has the index 0.
1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
2.  The reader acquires slot(s) for item(s) including the index 0 item.
3.  The non-zero index item is deleted, and as a consequence the other item is
    moved to the root of the tree. The place where it used to be is queued for
    deletion after the readers finish.
3b. The zero item is deleted, removing it from the direct slot, it remains in
    the rcu-delayed indirect node.
4.  The reader looks at the index 0 slot, and finds that the page has 0 ref
    count
5.  The reader looks at it again, hoping that the item will either be freed or
    the ref count will increase. This never happens, as the slot it is looking
    at will never be updated. Also, this slot can never be reclaimed because
    the reader is holding rcu_read_lock and is in an infinite loop.

The fix is to re-use the same "indirect" pointer case that requires a slot
lookup retry into a general "retry the lookup" bit.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reported-by: Salman Qazi <sqazi@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-12 07:55:32 -08:00
Shaohua Li
1dce071e18 vmscan: avoid setting zone congested if no page dirty
nr_dirty and nr_congested are increased only when the page is dirty.  So
if all pages are clean, both them will be zero.  In this case, we should
not mark the zone congested.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-12 07:55:31 -08:00
Dave Hansen
8d056cb965 mm/vfs: revalidate page->mapping in do_generic_file_read()
70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
ran into a NULL dereference in here:

	int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
	                                        unsigned long from)
	{
---->		struct inode *inode = page->mapping->host;

It looks like page->mapping was the culprit.  (xmon trace is below).
After closer examination, I realized that do_generic_file_read() does a
find_get_page(), and eventually locks the page before calling
block_is_partially_uptodate().  However, it doesn't revalidate the
page->mapping after the page is locked.  So, there's a small window
between the find_get_page() and ->is_partially_uptodate() where the page
could get truncated and page->mapping cleared.

We _have_ a reference, so it can't get reclaimed, but it certainly
can be truncated.

I think the correct thing is to check page->mapping after the
trylock_page(), and jump out if it got truncated.  This patch has been
running in the test environment for a month or so now, and we have not
seen this bug pop up again.

xmon info:

  1f:mon> e
  cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
      pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
      lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
      sp: c0000002ae36f9f0
     msr: 8000000000009032
     dar: 0
   dsisr: 40000000
    current = 0xc000000378f99e30
    paca    = 0xc000000000f66300
      pid   = 21946, comm = bash
  1f:mon> r
  R00 = 0025c0500000006d   R16 = 0000000000000000
  R01 = c0000002ae36f9f0   R17 = c000000362cd3af0
  R02 = c000000000e8cd80   R18 = ffffffffffffffff
  R03 = c0000000031d0f88   R19 = 0000000000000001
  R04 = c0000002ae36fa68   R20 = c0000003bb97b8a0
  R05 = 0000000000000000   R21 = c0000002ae36fa68
  R06 = 0000000000000000   R22 = 0000000000000000
  R07 = 0000000000000001   R23 = c0000002ae36fbb0
  R08 = 0000000000000002   R24 = 0000000000000000
  R09 = 0000000000000000   R25 = c000000362cd3a80
  R10 = 0000000000000000   R26 = 0000000000000002
  R11 = c0000000001e7b60   R27 = 0000000000000000
  R12 = 0000000042000484   R28 = 0000000000000001
  R13 = c000000000f66300   R29 = c0000003bb97b9b8
  R14 = 0000000000000001   R30 = c000000000e28a08
  R15 = 000000000000ffff   R31 = c0000000031d0f88
  pc  = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
  lr  = c000000000142944 .generic_file_aio_read+0x1e4/0x770
  msr = 8000000000009032   cr  = 22000488
  ctr = c0000000001e7a60   xer = 0000000020000000   trap =  300
  dar = 0000000000000000   dsisr = 40000000
  1f:mon> t
  [link register   ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
  [c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
  [c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
  [c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
  [c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
  [c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
  --- Exception: c00 (System Call) at 00000080a840bc54
  SP (fffca15df30) is in userspace
  1f:mon> di c0000000001e7a6c
  c0000000001e7a6c  e9290000      ld      r9,0(r9)
  c0000000001e7a70  418200c0      beq     c0000000001e7b30        # .block_is_partially_uptodate+0xd0/0x100
  c0000000001e7a74  e9440008      ld      r10,8(r4)
  c0000000001e7a78  78a80020      clrldi  r8,r5,32
  c0000000001e7a7c  3c000001      lis     r0,1
  c0000000001e7a80  812900a8      lwz     r9,168(r9)
  c0000000001e7a84  39600001      li      r11,1
  c0000000001e7a88  7c080050      subf    r0,r8,r0
  c0000000001e7a8c  7f805040      cmplw   cr7,r0,r10
  c0000000001e7a90  7d6b4830      slw     r11,r11,r9
  c0000000001e7a94  796b0020      clrldi  r11,r11,32
  c0000000001e7a98  419d00a8      bgt     cr7,c0000000001e7b40    # .block_is_partially_uptodate+0xe0/0x100
  c0000000001e7a9c  7fa55840      cmpld   cr7,r5,r11
  c0000000001e7aa0  7d004214      add     r8,r0,r8
  c0000000001e7aa4  79080020      clrldi  r8,r8,32
  c0000000001e7aa8  419c0078      blt     cr7,c0000000001e7b20    # .block_is_partially_uptodate+0xc0/0x100

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <arunabal@in.ibm.com>
Cc: <sbest@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-12 07:55:31 -08:00
Dan Carpenter
d2e61b8dc9 memcg: null dereference on allocation failure
The original code had a null dereference if alloc_percpu() failed.  This
was introduced in commit 711d3d2c9b ("memcg: cpu hotplug aware percpu
count updates")

Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-12 07:55:31 -08:00