1
Commit Graph

284560 Commits

Author SHA1 Message Date
Johannes Weiner
9f3a0d0933 mm: memcg: consolidate hierarchy iteration primitives
The memcg naturalization series:

Memory control groups are currently bolted onto the side of
traditional memory management in places where better integration would
be preferrable.  To reclaim memory, for example, memory control groups
maintain their own LRU list and reclaim strategy aside from the global
per-zone LRU list reclaim.  But an extra list head for each existing
page frame is expensive and maintaining it requires additional code.

This patchset disables the global per-zone LRU lists on memory cgroup
configurations and converts all its users to operate on the per-memory
cgroup lists instead.  As LRU pages are then exclusively on one list,
this saves two list pointers for each page frame in the system:

page_cgroup array size with 4G physical memory

  vanilla: allocated 31457280 bytes of page_cgroup
  patched: allocated 15728640 bytes of page_cgroup

At the same time, system performance for various workloads is
unaffected:

100G sparse file cat, 4G physical memory, 10 runs, to test for code
bloat in the traditional LRU handling and kswapd & direct reclaim
paths, without/with the memory controller configured in

  vanilla: 71.603(0.207) seconds
  patched: 71.640(0.156) seconds

  vanilla: 79.558(0.288) seconds
  patched: 77.233(0.147) seconds

100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
bloat in the traditional memory cgroup LRU handling and reclaim path

  vanilla: 96.844(0.281) seconds
  patched: 94.454(0.311) seconds

4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
swap on SSD, 10 runs, to test for regressions in kswapd & direct
reclaim using per-memcg LRU lists with multiple memcgs and multiple
allocators within each memcg

  vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
  patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]

16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
swap on SSD, 10 runs, to test for regressions in hierarchical memcg
setups

  vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
  patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]

This patch:

There are currently two different implementations of iterating over a
memory cgroup hierarchy tree.

Consolidate them into one worker function and base the convenience
looping-macros on top of it.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:04 -08:00
KAMEZAWA Hiroyuki
ab936cbcd0 memcg: add mem_cgroup_replace_page_cache() to fix LRU issue
Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
function replace_page_cache_page().  This function replaces a page in the
radix-tree with a new page.  WHen doing this, memory cgroup needs to fix
up the accounting information.  memcg need to check PCG_USED bit etc.

In some(many?) cases, 'newpage' is on LRU before calling
replace_page_cache().  So, memcg's LRU accounting information should be
fixed, too.

This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
 In that function, old pages will be unaccounted without touching
res_counter and new page will be accounted to the memcg (of old page).
WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
races with LRU handling.

Background:
  replace_page_cache_page() is called by FUSE code in its splice() handling.
  Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
  page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
  because rmdir() checks the whole LRU is empty and there is no account leak.
  If a page is on the other LRU than it should be, rmdir() will fail.

This bug was added in March 2011, but no bug report yet.  I guess there
are not many people who use memcg and FUSE at the same time with upstream
kernels.

The result of this bug is that admin cannot destroy a memcg because of
account leak.  So, no panic, no deadlock.  And, even if an active cgroup
exist, umount can succseed.  So no problem at shutdown.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:04 -08:00
Jason Baron
28d82dc1c4 epoll: limit paths
The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely.  A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.

To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited.  Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm.  In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.

Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.

This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events.  I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'.  In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5.  Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links.  This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.

In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'.  In this case, each 'source file descriptor' has a 1 path of
length 1.  Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous.  Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.

In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations.  Currently its only used in a subset
of the add paths.  I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths.  I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead.  Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.

Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4

This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1).  However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order.  Thus, this limit is currently easily bypassed.  The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.

Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL.  I've also
testing using the piptest.c epoll tester, which showed no difference in
performance.  I've also created a number of different epoll networks and
tested that they behave as expectded.

I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Nelson Elhage <nelhage@ksplice.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:04 -08:00
Sasha Levin
2ccd4f4d47 pipe: fail cleanly when root tries F_SETPIPE_SZ with big size
When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
with size bigger than kmalloc() can alloc it spits out an ugly warning:

  ------------[ cut here ]------------
  WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
  Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
  Call Trace:
     warn_slowpath_common+0x75/0xb0
     warn_slowpath_null+0x15/0x20
     __alloc_pages_nodemask+0x5d3/0x7a0
     __get_free_pages+0x12/0x50
     __kmalloc+0x12b/0x150
     pipe_set_size+0x75/0x120
     pipe_fcntl+0xf8/0x140
     do_fcntl+0x2d4/0x410
     sys_fcntl+0x66/0xa0
     system_call_fastpath+0x16/0x1b
  ---[ end trace 432f702e6db7b5ee ]---

Instead, make kcalloc() handle the overflow case and fail quietly.

[akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:04 -08:00
Stanislaw Gruszka
888a214dc4 slub: document setting min order with debug_guardpage_minorder > 0
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:04 -08:00
Mathias Krause
15ee2d000d parisc, exec: remove redundant set_fs(USER_DS)
The address limit is already set in flush_old_exec() so those calls to
set_fs(USER_DS) are redundant.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:04 -08:00
Mathias Krause
01fa310cd9 ia64, exec: remove redundant set_fs(USER_DS)
The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:03 -08:00
Andrew Morton
08346bf805 drivers/video/nvidia/nvidia.c: fix warning
Fix the int/bool confusion in there.

  drivers/video/nvidia/nvidia.c:1602: warning: return from incompatible pointer type

Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:03 -08:00
Heiko Carstens
2565409fc0 mm,x86,um: move CMPXCHG_DOUBLE config option
Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures
can simply select the option if it is supported.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:03 -08:00
Heiko Carstens
4156153c4d mm,x86,um: move CMPXCHG_LOCAL config option
Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
can simply select the option if it is supported.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:03 -08:00
Heiko Carstens
43570fd2f4 mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL
While implementing cmpxchg_double() on s390 I realized that we don't set
CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.

However setting that option will increase the size of struct page by
eight bytes on 64 bit, which we certainly do not want.  Also, it doesn't
make sense that a present cpu feature should increase the size of struct
page.

Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
that it should depend on CMPXCHG_DOUBLE instead.

This patch:

If an architecture supports CMPXCHG_LOCAL this shouldn't result
automatically in larger struct pages if the SLUB allocator is used.
Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
can be selected if a double word aligned struct page is required.  Also
update x86 Kconfig so that it should work as before.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:03 -08:00
Joe Perches
0d259cf819 include/linux/linkage.h: remove unused ATTRIB_NORET macro
The uses have been renamed so delete the unused macro.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:03 -08:00
Joe Perches
ff2d8b19a3 treewide: convert uses of ATTRIB_NORETURN to __noreturn
Use the more commonly used __noreturn instead of ATTRIB_NORETURN.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:03 -08:00
Joe Perches
9402c95f34 treewide: remove useless NORET_TYPE macro and uses
It's a very old and now unused prototype marking so just delete it.

Neaten panic pointer argument style to keep checkpatch quiet.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:03 -08:00
Joe Perches
80bf007f20 include/linux/linkage.h: remove unused NORET_AND macro
The only use in kernel.h is gone so remove the macro.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:02 -08:00
Joe Perches
4da4785995 kernel.h: neaten panic prototype
Use __printf macro.
Convert NORET_AND to ATTRIB_NORET.
Use the normal kernel style for pointer arguments.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:02 -08:00
Stephen Boyd
efeb156e72 kprobes: silence DEBUG_STRICT_USER_COPY_CHECKS=y warning
Enabling DEBUG_STRICT_USER_COPY_CHECKS causes the following warning:

  In file included from arch/x86/include/asm/uaccess.h:573,
                   from kernel/kprobes.c:55:
  In function 'copy_from_user',
      inlined from 'write_enabled_file_bool' at
      kernel/kprobes.c:2191:
  arch/x86/include/asm/uaccess_64.h:65:
  warning: call to 'copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct

presumably due to buf_size being signed causing GCC to fail to see that
buf_size can't become negative.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:02 -08:00
Xiaotian Feng
a2ef990ab5 proc: fix null pointer deref in proc_pid_permission()
get_proc_task() can fail to search the task and return NULL,
put_task_struct() will then bomb the kernel with following oops:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
  IP: [<ffffffff81217d34>] proc_pid_permission+0x64/0xe0
  PGD 112075067 PUD 112814067 PMD 0
  Oops: 0002 [#1] PREEMPT SMP

This is a regression introduced by commit 0499680a ("procfs: add hidepid=
and gid= mount options").  The kernel should return -ESRCH if
get_proc_task() failed.

Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Stephen Wilson <wilsons@start.ca>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:02 -08:00
Bradley Peterson
91dce7ddab pptp: Accept packet with seq zero
Initialize the PPTP "seq received" value to 0xffffffff, so we don't
ignore packets with seq zero.

Signed-off-by: Bradley Peterson <despite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 20:05:28 -08:00
Roland Dreier
5b7bf42e3d RDS: Remove some unused iWARP code
rds_iw_flush_goal() just returns a count, but it is only called in one
place and its return value is ignored there.  So delete all the dead code.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 20:05:28 -08:00
Eric Benard
8d82f219c2 net: fsl: fec: handle 10Mbps speed in RMII mode
when the link is 10 Mbps and the mode is RMII, it's necessary
to set FRCONT to 1 in MIIGSK_CFGR to divide the RMII source
clock by 10 in order to support 10 Mbps operations.

Signed-off-by: Eric Bénard <eric@eukrea.com>
Acked-by: Shawn Guo <shawn.guo@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 20:05:28 -08:00
Julia Lawall
25cecd7e35 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c: add missing iounmap
Add missing iounmap in error handling code, in a case where the function
already preforms iounmap on some other execution path.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression e;
statement S,S1;
int ret;
@@
e = \(ioremap\|ioremap_nocache\)(...)
... when != iounmap(e)
if (<+...e...+>) S
... when any
    when != iounmap(e)
*if (...)
   { ... when != iounmap(e)
     return ...; }
... when any
iounmap(e);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 20:05:28 -08:00
Julia Lawall
20d4369b68 drivers/net/ethernet/tundra/tsi108_eth.c: add missing iounmap
Add missing iounmap in error handling code, in a case where the function
already preforms iounmap on some other execution path.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression e;
statement S,S1;
int ret;
@@
e = \(ioremap\|ioremap_nocache\)(...)
... when != iounmap(e)
if (<+...e...+>) S
... when any
    when != iounmap(e)
*if (...)
   { ... when != iounmap(e)
     return ...; }
... when any
iounmap(e);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 20:05:28 -08:00
Doug Kehn
83636580ad ksz884x: fix mtu for VLAN
The Ethernet header does not account for the addition of a VLAN header.
Full size Ethernet frames containing VLAN header are not processed
because the frame is larger than the resulting hw mtu.

Signed-off-by: Doug Kehn <rdkehn@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 20:05:28 -08:00
Eric Dumazet
ddecf0f4db net_sched: sfq: add optional RED on top of SFQ
Adds an optional Random Early Detection on each SFQ flow queue.

Traditional SFQ limits count of packets, while RED permits to also
control number of bytes per flow, and adds ECN capability as well.

1) We dont handle the idle time management in this RED implementation,
since each 'new flow' begins with a null qavg. We really want to address
backlogged flows.

2) if headdrop is selected, we try to ecn mark first packet instead of
currently enqueued packet. This gives faster feedback for tcp flows
compared to traditional RED [ marking the last packet in queue ]

Example of use :

tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
	limit 3000 headdrop flows 512 divisor 16384 \
	redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn

qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
 ewma 6 min 8000b max 60000b probability 0.2 ecn
 prob_mark 0 prob_mark_head 4876 prob_drop 6131
 forced_mark 0 forced_mark_head 0 forced_drop 0
 Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
requeues 0)
 rate 99483Kbit 8219pps backlog 689392b 456p requeues 0

In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
flows, we can see number of packets CE marked is smaller than number of
drops (for non ECN flows)

If same test is run, without RED, we can check backlog is much bigger.

qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
 Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
 rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Dave Taht <dave.taht@gmail.com>
Tested-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 20:05:28 -08:00
Manfred Rudigier
72092cc453 dp83640: Fix NOHZ local_softirq_pending 08 warning
Similar problem as in 481a819914 ("can:
fix NOHZ local_softirq_pending 08 warning"). This fix replaces
netif_rx() with netif_rx_ni() which has to be used from
process/softirq context.

Signed-off-by: Manfred Rudigier <manfred.rudigier@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 15:26:01 -08:00
Manfred Rudigier
9c4886e5e6 gianfar: Fix invalid TX frames returned on error queue when time stamping
When TX time stamping for PTP messages is enabled on a socket, a time
stamp is returned on the socket error queue to the user space application
after the frame was transmitted. The transmitted frame is also returned on
the error queue so that an application knows to which frame the time stamp
belongs.

In the current implementation the TxFCB is immediately followed by the
frame. Since the eTSEC inserts the TX time stamp 8 bytes after the TxFCB,
parts of the frame have been overwritten and an invalid frame was returned
on the socket error queue.

This patch fixes the described problem by adding additional 16 padding
bytes between the TxFCB and the frame for all messages sent from a time
stamping enabled socket (other sockets are not affected).

Signed-off-by: Manfred Rudigier <manfred.rudigier@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 15:26:01 -08:00
Manfred Rudigier
db83d136d7 gianfar: Fix missing sock reference when processing TX time stamps
When there is not enough headroom in the skb a private copy will be made.
However, the private copy had no reference to the socket and consequently
no time stamp could be queued on the socket error queue during the
skb_tstamp_tx function. This patch fixes this issue by also stealing the
sock reference from the original skb after making the private copy.

Signed-off-by: Manfred Rudigier <manfred.rudigier@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 15:26:01 -08:00
Timur Tabi
eb8a54a78e phylib: introduce mdiobus_alloc_size()
Introduce function mdiobus_alloc_size() as an alternative to mdiobus_alloc().
Most callers of mdiobus_alloc() also allocate a private data structure, and
then manually point bus->priv to this object.  mdiobus_alloc_size()
combines the two operations into one, which simplifies memory management.

The original mdiobus_alloc() now just calls mdiobus_alloc_size(0).

Signed-off-by: Timur Tabi <timur@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 15:23:04 -08:00
Linus Torvalds
2485a4b610 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: bcm5974 - set BUTTONPAD property
  Input: serio_raw - return proper result when serio_raw_write fails
  Input: serio_raw - really signal HUP upon disconnect
  Input: serio_raw - remove stray semicolon
  Input: revert some over-zealous conversions to module_platform_driver()
2012-01-12 12:40:41 -08:00
Linus Torvalds
6733e54b66 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  FUSE: Notifying the kernel of deletion.
  fuse: support ioctl on directories
  fuse: Use kcalloc instead of kzalloc to allocate array
  fuse: llseek optimize SEEK_CUR and SEEK_SET
2012-01-12 12:39:21 -08:00
Linus Torvalds
bcf8a3dfcb Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPDmxIAAoJENkgDmzRrbjxjZAP/00I83SuAfx8aDDgPB8o0Sv2
 9xI8ZhYIu0S8KfT8tu6Sl7pww7VoTP6gBm/cySnrOw1a3diWjhT5Xa3EVHIicWSV
 CWyCZK5cbTmPl2/NPs5cWbzrMGdSWotZN/7cCaKVQ5z43+uvRBLJj8s8LT5epzdA
 hGJZ5FDX01kQamrTdITIDpRJgMcO0A5e2rPNYQavJFf1SFZR5sdRmeyB8BqOmlyL
 r2cC13cNsHvyKc+xPZd0X96geslAah5XMJJacTMh9qaw1K/hyv5715iYciKANPEH
 5+XKzb2gGIX9H2qOMklB/sngA+pEczfO9SD0oM21PyVzZPfWcl8xHkCJq1R1VMTS
 vSCwvg/g9hBW6RgW3c3VE1b6usrLUbcaO1dzR1e5A7tPE48QZBzdi6CiJsWYlPPe
 gIOU7MDjmRGSiryOetzj9Rp0jAJvLiD7DXNn9brQzbNaTKi6BvIEWxgKVYzXr8bk
 pE9j8dNPU8TCBEQsrVNHKrgHL2Um8Mpbi8fzdzwi8odG7Ucu7oWKzsBiGkBiNB0B
 5swRrUGDgiMkElLn6rwdg+lSrtZyzWEGZe3bd+4ATD0vhTt3p8oGCFtTwTtz6yfW
 WhkzsToQA/EFR0JzaFkH9t1mnxTKTGli83cKVnljIP69Q3W5Bby8aTgkI2giAJkW
 q+rPPmt3v5BY3rYwTgFe
 =dgrs
 -----END PGP SIGNATURE-----

Merge tag 'to-linus' of git://github.com/rustyrussell/linux

* tag 'to-linus' of git://github.com/rustyrussell/linux: (24 commits)
  lguest: Make sure interrupt is allocated ok by lguest_setup_irq
  lguest: move the lguest tool to the tools directory
  lguest: switch segment-voodoo-numbers to readable symbols
  virtio: balloon: Add freeze, restore handlers to support S4
  virtio: balloon: Move vq initialization into separate function
  virtio: net: Add freeze, restore handlers to support S4
  virtio: net: Move vq and vq buf removal into separate function
  virtio: net: Move vq initialization into separate function
  virtio: blk: Add freeze, restore handlers to support S4
  virtio: blk: Move vq initialization to separate function
  virtio: console: Disable callbacks for virtqueues at start of S4 freeze
  virtio: console: Add freeze and restore handlers to support S4
  virtio: console: Move vq and vq buf removal into separate functions
  virtio: pci: add PM notification handlers for restore, freeze, thaw, poweroff
  virtio: pci: switch to new PM API
  virtio_blk: fix config handler race
  virtio: add debugging if driver doesn't kick.
  virtio: expose added descriptors immediately.
  virtio: avoid modulus operation.
  virtio: support unlocked queue kick
  ...
2012-01-12 12:37:27 -08:00
Glauber Costa
1398eee082 net: decrement memcg jump label when limit, not usage, is changed
The logic of the current code is that whenever we destroy
a cgroup that had its limit set (set meaning different than
maximum), we should decrement the jump_label counter.
Otherwise we assume it was never incremented.

But what the code actually does is test for RES_USAGE
instead of RES_LIMIT. Usage being different than maximum
is likely to be true most of the time.

The effect of this is that the key must become negative,
and since the jump_label test says:

        !!atomic_read(&key->enabled);

we'll have jump_labels still on when no one else is
using this functionality.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 12:27:59 -08:00
Eric Dumazet
cf778b00e9 net: reintroduce missing rcu_assign_pointer() calls
commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
y).

We miss needed barriers, even on x86, when y is not NULL.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12 12:26:56 -08:00
Linus Torvalds
61bd5e5683 brcmsmac: fix reading of PCI sprom contents
It appears that you can only read the sprom contents with aligned 16-bit
reads: anything else causes at least some versions of the broadcom
chipset to abort the PCI transaction, returning 0xff.

This apparently doesn't trigger very often, because most setups don't
use an external srom chip, and the OTP sprom loading doesn't have this
issue.  But at least the current 11" Macbook Air does trigger it, and
wireless communications were broken as a result.

Acked-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 12:19:34 -08:00
David S. Miller
9ee6045f09 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless 2012-01-12 12:10:00 -08:00
Anton Vorontsov
bccd17294a x86: Get rid of 'dubious one-bit signed bitfield' sprase warning
This very noisy sparse warning appears on almost every file in the
kernel:

  CHECK   init/main.c
  arch/x86/include/asm/thread_info.h:43:55: error: dubious one-bit signed bitfield
  arch/x86/include/asm/thread_info.h:44:46: error: dubious one-bit signed bitfield

This patch changes sig_on_uaccess_error and uaccess_err flags to unsigned
type and thus fixes the warning.

Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: Andy Lutomirski <luto@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 09:32:21 -08:00
Linus Torvalds
a429638cac Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (526 commits)
  ASoC: twl6040 - Add method to query optimum PDM_DL1 gain
  ALSA: hda - Fix the lost power-setup of seconary pins after PM resume
  ALSA: usb-audio: add Yamaha MOX6/MOX8 support
  ALSA: virtuoso: add S/PDIF input support for all Xonars
  ALSA: ice1724 - Support for ooAoo SQ210a
  ALSA: ice1724 - Allow card info based on model only
  ALSA: ice1724 - Create capture pcm only for ADC-enabled configurations
  ALSA: hdspm - Provide unique driver id based on card serial
  ASoC: Dynamically allocate the rtd device for a non-empty release()
  ASoC: Fix recursive dependency due to select ATMEL_SSC in SND_ATMEL_SOC_SSC
  ALSA: hda - Fix the detection of "Loopback Mixing" control for VIA codecs
  ALSA: hda - Return the error from get_wcaps_type() for invalid NIDs
  ALSA: hda - Use auto-parser for HP laptops with cx20459 codec
  ALSA: asihpi - Fix potential Oops in snd_asihpi_cmode_info()
  ALSA: hdsp - Fix potential Oops in snd_hdsp_info_pref_sync_ref()
  ALSA: hda/cirrus - support for iMac12,2 model
  ASoC: cx20442: add bias control over a platform provided regulator
  ALSA: usb-audio - Avoid flood of frame-active debug messages
  ALSA: snd-usb-us122l: Delete calls to preempt_disable
  mfd: Put WM8994 into cache only mode when suspending
  ...

Fix up trivial conflicts in:
 - arch/arm/mach-s3c64xx/mach-crag6410.c:
	renamed speyside_wm8962 to tobermory, added littlemill right
	next to it
 - drivers/base/regmap/{regcache.c,regmap.c}:
	duplicate diff that had already come in with other changes in
	the regmap tree
2012-01-12 08:00:30 -08:00
Bjorn Helgaas
5cf9a4e69c x86/PCI: build amd_bus.o only when CONFIG_AMD_NB=y
We only need amd_bus.o for AMD systems with PCI.  arch/x86/pci/Makefile
already depends on CONFIG_PCI=y, so this patch just adds the dependency
on CONFIG_AMD_NB.

Cc: Yinghai Lu <yinghai@kernel.org>
Cc: stable@kernel.org	# 2.6.34+ (needs adjustment for k8 -> amd rename)
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 07:54:18 -08:00
Takashi Iwai
9e4ce164ee Merge branch 'topic/hda' into for-linus 2012-01-12 09:59:18 +01:00
Takashi Iwai
627b79628f Merge branch 'topic/misc' into for-linus 2012-01-12 09:59:14 +01:00
Takashi Iwai
29abceb67f Merge branch 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into topic/asoc 2012-01-12 09:48:20 +01:00
Linus Torvalds
4c4d285ad5 SH/R-Mobile updates for 3.3 merge window.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.15 (GNU/Linux)
 
 iEYEABECAAYFAk8Obj8ACgkQGkmNcg7/o7hzngCfS5az4ZP3D+e/cvatHZm/nAzn
 0mIAoKbYyXpLXGkEN+yDkd5YZAYwQjVR
 =kryV
 -----END PGP SIGNATURE-----

Merge tag 'rmobile-for-linus' of git://github.com/pmundt/linux-sh

SH/R-Mobile updates for 3.3 merge window.

* tag 'rmobile-for-linus' of git://github.com/pmundt/linux-sh: (32 commits)
  arm: mach-shmobile: add a resource name for shdma
  ARM: mach-shmobile: r8a7779 SMP support V3
  ARM: mach-shmobile: Add kota2 defconfig.
  ARM: mach-shmobile: Add marzen defconfig.
  ARM: mach-shmobile: r8a7779 power domain support V2
  ARM: mach-shmobile: Fix up marzen build for recent GIC changes.
  ARM: mach-shmobile: r8a7779 PFC function support
  ARM: mach-shmobile: Flush caches in platform_cpu_die()
  ARM: mach-shmobile: Allow SoC specific CPU kill code
  ARM: mach-shmobile: Fix headsmp.S code to use CPUINIT
  ARM: mach-shmobile: clock-r8a7779: clkz/clkzs support
  ARM: mach-shmobile: clock-r8a7779: add DIV4 clock support
  ARM: mach-shmobile: Marzen LAN89218 support
  ARM: mach-shmobile: Marzen SCIF2/SCIF4 support
  ARM: mach-shmobile: r8a7779 PFC GPIO-only support V2
  ARM: mach-shmobile: r8a7779 and Marzen base support V2
  sh: pfc: Unlock register support
  sh: pfc: Variable bitfield width config register support
  sh: pfc: Add config_reg_helper() function
  sh: pfc: Convert index to field and value pair
  ...
2012-01-11 23:29:20 -08:00
Linus Torvalds
56c8bc3b7e SuperH updates for 3.3 merge window.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.15 (GNU/Linux)
 
 iEYEABECAAYFAk8ObfwACgkQGkmNcg7/o7gOHwCfZo8197ppks1LIfACo27cL8Q8
 rU8AnR+igHfCEkg9RlrX9sJ8jqxtpYLZ
 =WE1M
 -----END PGP SIGNATURE-----

Merge tag 'sh-for-linus' of git://github.com/pmundt/linux-sh

SuperH updates for 3.3 merge window.

* tag 'sh-for-linus' of git://github.com/pmundt/linux-sh: (38 commits)
  sh: magicpanelr2: Update for parse_mtd_partitions() fallout.
  sh: mach-rsk: Update for parse_mtd_partitions() fallout.
  sh: sh2a: Improve cache flush/invalidate functions
  sh: also without PM_RUNTIME pm_runtime.o must be built
  sh: add a resource name for shdma
  sh: Remove redundant try_to_freeze() invocations.
  sh: Ensure IRQs are enabled across do_notify_resume().
  sh: Fix up store queue code for subsys_interface changes.
  sh: clkfwk: sh_clk_init_parent() should be called after clk_register()
  sh: add platform_device for renesas_usbhs in board-sh7757lcr
  sh: modify clock-sh7757 for renesas_usbhs
  sh: pfc: ioremap() support
  sh: use ioread32/iowrite32 and mapped_reg for div6
  sh: use ioread32/iowrite32 and mapped_reg for div4
  sh: use ioread32/iowrite32 and mapped_reg for mstp32
  sh: extend clock struct with mapped_reg member
  sh: clkfwk: clock-sh73a0: all div6_clks use SH_CLK_DIV6_EXT()
  sh: clkfwk: clock-sh7724: all div6_clks use SH_CLK_DIV6_EXT()
  sh: clock-sh7723: add CLKDEV_ICK_ID for cleanup
  serial: sh-sci: Handle GPIO function requests.
  ...
2012-01-11 23:22:52 -08:00
Linus Torvalds
b8bf17d311 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix lockup by limiting load-balance retries on lock-break
  sched: Fix CONFIG_CGROUP_SCHED dependency
  sched: Remove empty #ifdefs
2012-01-11 22:52:48 -08:00
Stratos Psomadakis
b6c96c0214 lguest: Make sure interrupt is allocated ok by lguest_setup_irq
Make sure the interrupt is allocated correctly by lguest_setup_irq (check the
return value of irq_alloc_desc_at for -ENOMEM)

Signed-off-by: Stratos Psomadakis <psomas@cslab.ece.ntua.gr>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleanups and commentry)
2012-01-12 15:44:47 +10:30
Davidlohr Bueso
07fe9977b6 lguest: move the lguest tool to the tools directory
This is a better location instead of having it in Documentation.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (fixed compile)
2012-01-12 15:44:47 +10:30
Jacek Galowicz
39082f7e59 lguest: switch segment-voodoo-numbers to readable symbols
When studying lguest's x86 segment descriptor code, it is not longer
necessary to have the Intel x86 architecture manual open on the page
with the segment descriptor illustration to understand the crazy
numbers assigned to both descriptor structure halves a/b.
Now the struct desc_struct's fields, like suggested by
Glauber de Oliveira Costa in 2008, are used.

Signed-off-by: Jacek Galowicz <jacek@galowicz.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-12 15:44:47 +10:30
Amit Shah
e562966dba virtio: balloon: Add freeze, restore handlers to support S4
Handling balloon hibernate / restore is tricky.  If the balloon was
inflated before going into the hibernation state, upon resume, the host
will not have any memory of that.  Any pages that were passed on to the
host earlier would most likely be invalid, and the host will have to
re-balloon to the previous value to get in the pre-hibernate state.

So the only sane thing for the guest to do here is to discard all the
pages that were put in the balloon.  When to discard the pages is the
next question.

One solution is to deflate the balloon just before writing the image to
the disk (in the freeze() PM callback).  However, asking for pages from
the host just to discard them immediately after seems wasteful of
resources.  Hence, it makes sense to do this by just fudging our
counters soon after wakeup.  This means we don't deflate the balloon
before sleep, and also don't put unnecessary pressure on the host.

This also helps in the thaw case: if the freeze fails for whatever
reason, the balloon should continue to remain in the inflated state.
This was tested by issuing 'swapoff -a' and trying to go into the S4
state.  That fails, and the balloon stays inflated, as expected.  Both
the host and the guest are happy.

Finally, in the restore() callback, we empty the list of pages that were
previously given off to the host, add the appropriate number of pages to
the totalram_pages counter, reset the num_pages counter to 0, and
all is fine.

As a last step, delete the vqs on the freeze callback to prepare for
hibernation, and re-create them in the restore and thaw callbacks to
resume normal operation.

The kthread doesn't race with any operations here, since it's frozen
before the freeze() call and is thawed after the thaw() and restore()
callbacks, so we're safe with that.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-12 15:44:47 +10:30
Amit Shah
be91c33dd1 virtio: balloon: Move vq initialization into separate function
The probe and PM restore functions will share this code.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-12 15:44:46 +10:30