The memcg naturalization series:
Memory control groups are currently bolted onto the side of
traditional memory management in places where better integration would
be preferrable. To reclaim memory, for example, memory control groups
maintain their own LRU list and reclaim strategy aside from the global
per-zone LRU list reclaim. But an extra list head for each existing
page frame is expensive and maintaining it requires additional code.
This patchset disables the global per-zone LRU lists on memory cgroup
configurations and converts all its users to operate on the per-memory
cgroup lists instead. As LRU pages are then exclusively on one list,
this saves two list pointers for each page frame in the system:
page_cgroup array size with 4G physical memory
vanilla: allocated 31457280 bytes of page_cgroup
patched: allocated 15728640 bytes of page_cgroup
At the same time, system performance for various workloads is
unaffected:
100G sparse file cat, 4G physical memory, 10 runs, to test for code
bloat in the traditional LRU handling and kswapd & direct reclaim
paths, without/with the memory controller configured in
vanilla: 71.603(0.207) seconds
patched: 71.640(0.156) seconds
vanilla: 79.558(0.288) seconds
patched: 77.233(0.147) seconds
100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
bloat in the traditional memory cgroup LRU handling and reclaim path
vanilla: 96.844(0.281) seconds
patched: 94.454(0.311) seconds
4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
swap on SSD, 10 runs, to test for regressions in kswapd & direct
reclaim using per-memcg LRU lists with multiple memcgs and multiple
allocators within each memcg
vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
swap on SSD, 10 runs, to test for regressions in hierarchical memcg
setups
vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
This patch:
There are currently two different implementations of iterating over a
memory cgroup hierarchy tree.
Consolidate them into one worker function and base the convenience
looping-macros on top of it.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
function replace_page_cache_page(). This function replaces a page in the
radix-tree with a new page. WHen doing this, memory cgroup needs to fix
up the accounting information. memcg need to check PCG_USED bit etc.
In some(many?) cases, 'newpage' is on LRU before calling
replace_page_cache(). So, memcg's LRU accounting information should be
fixed, too.
This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
In that function, old pages will be unaccounted without touching
res_counter and new page will be accounted to the memcg (of old page).
WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
races with LRU handling.
Background:
replace_page_cache_page() is called by FUSE code in its splice() handling.
Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
because rmdir() checks the whole LRU is empty and there is no account leak.
If a page is on the other LRU than it should be, rmdir() will fail.
This bug was added in March 2011, but no bug report yet. I guess there
are not many people who use memcg and FUSE at the same time with upstream
kernels.
The result of this bug is that admin cannot destroy a memcg because of
account leak. So, no panic, no deadlock. And, even if an active cgroup
exist, umount can succseed. So no problem at shutdown.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.
To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.
Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.
This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.
In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.
In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.
Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:
/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4
This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.
Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.
I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Nelson Elhage <nelhage@ksplice.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
with size bigger than kmalloc() can alloc it spits out an ugly warning:
------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
Call Trace:
warn_slowpath_common+0x75/0xb0
warn_slowpath_null+0x15/0x20
__alloc_pages_nodemask+0x5d3/0x7a0
__get_free_pages+0x12/0x50
__kmalloc+0x12b/0x150
pipe_set_size+0x75/0x120
pipe_fcntl+0xf8/0x140
do_fcntl+0x2d4/0x410
sys_fcntl+0x66/0xa0
system_call_fastpath+0x16/0x1b
---[ end trace 432f702e6db7b5ee ]---
Instead, make kcalloc() handle the overflow case and fail quietly.
[akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The address limit is already set in flush_old_exec() so those calls to
set_fs(USER_DS) are redundant.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures
can simply select the option if it is supported.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
can simply select the option if it is supported.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While implementing cmpxchg_double() on s390 I realized that we don't set
CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.
However setting that option will increase the size of struct page by
eight bytes on 64 bit, which we certainly do not want. Also, it doesn't
make sense that a present cpu feature should increase the size of struct
page.
Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
that it should depend on CMPXCHG_DOUBLE instead.
This patch:
If an architecture supports CMPXCHG_LOCAL this shouldn't result
automatically in larger struct pages if the SLUB allocator is used.
Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
can be selected if a double word aligned struct page is required. Also
update x86 Kconfig so that it should work as before.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The uses have been renamed so delete the unused macro.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only use in kernel.h is gone so remove the macro.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use __printf macro.
Convert NORET_AND to ATTRIB_NORET.
Use the normal kernel style for pointer arguments.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enabling DEBUG_STRICT_USER_COPY_CHECKS causes the following warning:
In file included from arch/x86/include/asm/uaccess.h:573,
from kernel/kprobes.c:55:
In function 'copy_from_user',
inlined from 'write_enabled_file_bool' at
kernel/kprobes.c:2191:
arch/x86/include/asm/uaccess_64.h:65:
warning: call to 'copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
presumably due to buf_size being signed causing GCC to fail to see that
buf_size can't become negative.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_proc_task() can fail to search the task and return NULL,
put_task_struct() will then bomb the kernel with following oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffff81217d34>] proc_pid_permission+0x64/0xe0
PGD 112075067 PUD 112814067 PMD 0
Oops: 0002 [#1] PREEMPT SMP
This is a regression introduced by commit 0499680a ("procfs: add hidepid=
and gid= mount options"). The kernel should return -ESRCH if
get_proc_task() failed.
Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Stephen Wilson <wilsons@start.ca>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Initialize the PPTP "seq received" value to 0xffffffff, so we don't
ignore packets with seq zero.
Signed-off-by: Bradley Peterson <despite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_iw_flush_goal() just returns a count, but it is only called in one
place and its return value is ignored there. So delete all the dead code.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when the link is 10 Mbps and the mode is RMII, it's necessary
to set FRCONT to 1 in MIIGSK_CFGR to divide the RMII source
clock by 10 in order to support 10 Mbps operations.
Signed-off-by: Eric Bénard <eric@eukrea.com>
Acked-by: Shawn Guo <shawn.guo@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing iounmap in error handling code, in a case where the function
already preforms iounmap on some other execution path.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression e;
statement S,S1;
int ret;
@@
e = \(ioremap\|ioremap_nocache\)(...)
... when != iounmap(e)
if (<+...e...+>) S
... when any
when != iounmap(e)
*if (...)
{ ... when != iounmap(e)
return ...; }
... when any
iounmap(e);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing iounmap in error handling code, in a case where the function
already preforms iounmap on some other execution path.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression e;
statement S,S1;
int ret;
@@
e = \(ioremap\|ioremap_nocache\)(...)
... when != iounmap(e)
if (<+...e...+>) S
... when any
when != iounmap(e)
*if (...)
{ ... when != iounmap(e)
return ...; }
... when any
iounmap(e);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Ethernet header does not account for the addition of a VLAN header.
Full size Ethernet frames containing VLAN header are not processed
because the frame is larger than the resulting hw mtu.
Signed-off-by: Doug Kehn <rdkehn@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds an optional Random Early Detection on each SFQ flow queue.
Traditional SFQ limits count of packets, while RED permits to also
control number of bytes per flow, and adds ECN capability as well.
1) We dont handle the idle time management in this RED implementation,
since each 'new flow' begins with a null qavg. We really want to address
backlogged flows.
2) if headdrop is selected, we try to ecn mark first packet instead of
currently enqueued packet. This gives faster feedback for tcp flows
compared to traditional RED [ marking the last packet in queue ]
Example of use :
tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
limit 3000 headdrop flows 512 divisor 16384 \
redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn
qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
ewma 6 min 8000b max 60000b probability 0.2 ecn
prob_mark 0 prob_mark_head 4876 prob_drop 6131
forced_mark 0 forced_mark_head 0 forced_drop 0
Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
requeues 0)
rate 99483Kbit 8219pps backlog 689392b 456p requeues 0
In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
flows, we can see number of packets CE marked is smaller than number of
drops (for non ECN flows)
If same test is run, without RED, we can check backlog is much bigger.
qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Dave Taht <dave.taht@gmail.com>
Tested-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar problem as in 481a819914 ("can:
fix NOHZ local_softirq_pending 08 warning"). This fix replaces
netif_rx() with netif_rx_ni() which has to be used from
process/softirq context.
Signed-off-by: Manfred Rudigier <manfred.rudigier@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TX time stamping for PTP messages is enabled on a socket, a time
stamp is returned on the socket error queue to the user space application
after the frame was transmitted. The transmitted frame is also returned on
the error queue so that an application knows to which frame the time stamp
belongs.
In the current implementation the TxFCB is immediately followed by the
frame. Since the eTSEC inserts the TX time stamp 8 bytes after the TxFCB,
parts of the frame have been overwritten and an invalid frame was returned
on the socket error queue.
This patch fixes the described problem by adding additional 16 padding
bytes between the TxFCB and the frame for all messages sent from a time
stamping enabled socket (other sockets are not affected).
Signed-off-by: Manfred Rudigier <manfred.rudigier@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
When there is not enough headroom in the skb a private copy will be made.
However, the private copy had no reference to the socket and consequently
no time stamp could be queued on the socket error queue during the
skb_tstamp_tx function. This patch fixes this issue by also stealing the
sock reference from the original skb after making the private copy.
Signed-off-by: Manfred Rudigier <manfred.rudigier@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce function mdiobus_alloc_size() as an alternative to mdiobus_alloc().
Most callers of mdiobus_alloc() also allocate a private data structure, and
then manually point bus->priv to this object. mdiobus_alloc_size()
combines the two operations into one, which simplifies memory management.
The original mdiobus_alloc() now just calls mdiobus_alloc_size(0).
Signed-off-by: Timur Tabi <timur@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: bcm5974 - set BUTTONPAD property
Input: serio_raw - return proper result when serio_raw_write fails
Input: serio_raw - really signal HUP upon disconnect
Input: serio_raw - remove stray semicolon
Input: revert some over-zealous conversions to module_platform_driver()
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
FUSE: Notifying the kernel of deletion.
fuse: support ioctl on directories
fuse: Use kcalloc instead of kzalloc to allocate array
fuse: llseek optimize SEEK_CUR and SEEK_SET
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPDmxIAAoJENkgDmzRrbjxjZAP/00I83SuAfx8aDDgPB8o0Sv2
9xI8ZhYIu0S8KfT8tu6Sl7pww7VoTP6gBm/cySnrOw1a3diWjhT5Xa3EVHIicWSV
CWyCZK5cbTmPl2/NPs5cWbzrMGdSWotZN/7cCaKVQ5z43+uvRBLJj8s8LT5epzdA
hGJZ5FDX01kQamrTdITIDpRJgMcO0A5e2rPNYQavJFf1SFZR5sdRmeyB8BqOmlyL
r2cC13cNsHvyKc+xPZd0X96geslAah5XMJJacTMh9qaw1K/hyv5715iYciKANPEH
5+XKzb2gGIX9H2qOMklB/sngA+pEczfO9SD0oM21PyVzZPfWcl8xHkCJq1R1VMTS
vSCwvg/g9hBW6RgW3c3VE1b6usrLUbcaO1dzR1e5A7tPE48QZBzdi6CiJsWYlPPe
gIOU7MDjmRGSiryOetzj9Rp0jAJvLiD7DXNn9brQzbNaTKi6BvIEWxgKVYzXr8bk
pE9j8dNPU8TCBEQsrVNHKrgHL2Um8Mpbi8fzdzwi8odG7Ucu7oWKzsBiGkBiNB0B
5swRrUGDgiMkElLn6rwdg+lSrtZyzWEGZe3bd+4ATD0vhTt3p8oGCFtTwTtz6yfW
WhkzsToQA/EFR0JzaFkH9t1mnxTKTGli83cKVnljIP69Q3W5Bby8aTgkI2giAJkW
q+rPPmt3v5BY3rYwTgFe
=dgrs
-----END PGP SIGNATURE-----
Merge tag 'to-linus' of git://github.com/rustyrussell/linux
* tag 'to-linus' of git://github.com/rustyrussell/linux: (24 commits)
lguest: Make sure interrupt is allocated ok by lguest_setup_irq
lguest: move the lguest tool to the tools directory
lguest: switch segment-voodoo-numbers to readable symbols
virtio: balloon: Add freeze, restore handlers to support S4
virtio: balloon: Move vq initialization into separate function
virtio: net: Add freeze, restore handlers to support S4
virtio: net: Move vq and vq buf removal into separate function
virtio: net: Move vq initialization into separate function
virtio: blk: Add freeze, restore handlers to support S4
virtio: blk: Move vq initialization to separate function
virtio: console: Disable callbacks for virtqueues at start of S4 freeze
virtio: console: Add freeze and restore handlers to support S4
virtio: console: Move vq and vq buf removal into separate functions
virtio: pci: add PM notification handlers for restore, freeze, thaw, poweroff
virtio: pci: switch to new PM API
virtio_blk: fix config handler race
virtio: add debugging if driver doesn't kick.
virtio: expose added descriptors immediately.
virtio: avoid modulus operation.
virtio: support unlocked queue kick
...
The logic of the current code is that whenever we destroy
a cgroup that had its limit set (set meaning different than
maximum), we should decrement the jump_label counter.
Otherwise we assume it was never incremented.
But what the code actually does is test for RES_USAGE
instead of RES_LIMIT. Usage being different than maximum
is likely to be true most of the time.
The effect of this is that the key must become negative,
and since the jump_label test says:
!!atomic_read(&key->enabled);
we'll have jump_labels still on when no one else is
using this functionality.
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
y).
We miss needed barriers, even on x86, when y is not NULL.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It appears that you can only read the sprom contents with aligned 16-bit
reads: anything else causes at least some versions of the broadcom
chipset to abort the PCI transaction, returning 0xff.
This apparently doesn't trigger very often, because most setups don't
use an external srom chip, and the OTP sprom loading doesn't have this
issue. But at least the current 11" Macbook Air does trigger it, and
wireless communications were broken as a result.
Acked-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This very noisy sparse warning appears on almost every file in the
kernel:
CHECK init/main.c
arch/x86/include/asm/thread_info.h:43:55: error: dubious one-bit signed bitfield
arch/x86/include/asm/thread_info.h:44:46: error: dubious one-bit signed bitfield
This patch changes sig_on_uaccess_error and uaccess_err flags to unsigned
type and thus fixes the warning.
Signed-off-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: Andy Lutomirski <luto@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (526 commits)
ASoC: twl6040 - Add method to query optimum PDM_DL1 gain
ALSA: hda - Fix the lost power-setup of seconary pins after PM resume
ALSA: usb-audio: add Yamaha MOX6/MOX8 support
ALSA: virtuoso: add S/PDIF input support for all Xonars
ALSA: ice1724 - Support for ooAoo SQ210a
ALSA: ice1724 - Allow card info based on model only
ALSA: ice1724 - Create capture pcm only for ADC-enabled configurations
ALSA: hdspm - Provide unique driver id based on card serial
ASoC: Dynamically allocate the rtd device for a non-empty release()
ASoC: Fix recursive dependency due to select ATMEL_SSC in SND_ATMEL_SOC_SSC
ALSA: hda - Fix the detection of "Loopback Mixing" control for VIA codecs
ALSA: hda - Return the error from get_wcaps_type() for invalid NIDs
ALSA: hda - Use auto-parser for HP laptops with cx20459 codec
ALSA: asihpi - Fix potential Oops in snd_asihpi_cmode_info()
ALSA: hdsp - Fix potential Oops in snd_hdsp_info_pref_sync_ref()
ALSA: hda/cirrus - support for iMac12,2 model
ASoC: cx20442: add bias control over a platform provided regulator
ALSA: usb-audio - Avoid flood of frame-active debug messages
ALSA: snd-usb-us122l: Delete calls to preempt_disable
mfd: Put WM8994 into cache only mode when suspending
...
Fix up trivial conflicts in:
- arch/arm/mach-s3c64xx/mach-crag6410.c:
renamed speyside_wm8962 to tobermory, added littlemill right
next to it
- drivers/base/regmap/{regcache.c,regmap.c}:
duplicate diff that had already come in with other changes in
the regmap tree
We only need amd_bus.o for AMD systems with PCI. arch/x86/pci/Makefile
already depends on CONFIG_PCI=y, so this patch just adds the dependency
on CONFIG_AMD_NB.
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: stable@kernel.org # 2.6.34+ (needs adjustment for k8 -> amd rename)
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
iEYEABECAAYFAk8Obj8ACgkQGkmNcg7/o7hzngCfS5az4ZP3D+e/cvatHZm/nAzn
0mIAoKbYyXpLXGkEN+yDkd5YZAYwQjVR
=kryV
-----END PGP SIGNATURE-----
Merge tag 'rmobile-for-linus' of git://github.com/pmundt/linux-sh
SH/R-Mobile updates for 3.3 merge window.
* tag 'rmobile-for-linus' of git://github.com/pmundt/linux-sh: (32 commits)
arm: mach-shmobile: add a resource name for shdma
ARM: mach-shmobile: r8a7779 SMP support V3
ARM: mach-shmobile: Add kota2 defconfig.
ARM: mach-shmobile: Add marzen defconfig.
ARM: mach-shmobile: r8a7779 power domain support V2
ARM: mach-shmobile: Fix up marzen build for recent GIC changes.
ARM: mach-shmobile: r8a7779 PFC function support
ARM: mach-shmobile: Flush caches in platform_cpu_die()
ARM: mach-shmobile: Allow SoC specific CPU kill code
ARM: mach-shmobile: Fix headsmp.S code to use CPUINIT
ARM: mach-shmobile: clock-r8a7779: clkz/clkzs support
ARM: mach-shmobile: clock-r8a7779: add DIV4 clock support
ARM: mach-shmobile: Marzen LAN89218 support
ARM: mach-shmobile: Marzen SCIF2/SCIF4 support
ARM: mach-shmobile: r8a7779 PFC GPIO-only support V2
ARM: mach-shmobile: r8a7779 and Marzen base support V2
sh: pfc: Unlock register support
sh: pfc: Variable bitfield width config register support
sh: pfc: Add config_reg_helper() function
sh: pfc: Convert index to field and value pair
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
iEYEABECAAYFAk8ObfwACgkQGkmNcg7/o7gOHwCfZo8197ppks1LIfACo27cL8Q8
rU8AnR+igHfCEkg9RlrX9sJ8jqxtpYLZ
=WE1M
-----END PGP SIGNATURE-----
Merge tag 'sh-for-linus' of git://github.com/pmundt/linux-sh
SuperH updates for 3.3 merge window.
* tag 'sh-for-linus' of git://github.com/pmundt/linux-sh: (38 commits)
sh: magicpanelr2: Update for parse_mtd_partitions() fallout.
sh: mach-rsk: Update for parse_mtd_partitions() fallout.
sh: sh2a: Improve cache flush/invalidate functions
sh: also without PM_RUNTIME pm_runtime.o must be built
sh: add a resource name for shdma
sh: Remove redundant try_to_freeze() invocations.
sh: Ensure IRQs are enabled across do_notify_resume().
sh: Fix up store queue code for subsys_interface changes.
sh: clkfwk: sh_clk_init_parent() should be called after clk_register()
sh: add platform_device for renesas_usbhs in board-sh7757lcr
sh: modify clock-sh7757 for renesas_usbhs
sh: pfc: ioremap() support
sh: use ioread32/iowrite32 and mapped_reg for div6
sh: use ioread32/iowrite32 and mapped_reg for div4
sh: use ioread32/iowrite32 and mapped_reg for mstp32
sh: extend clock struct with mapped_reg member
sh: clkfwk: clock-sh73a0: all div6_clks use SH_CLK_DIV6_EXT()
sh: clkfwk: clock-sh7724: all div6_clks use SH_CLK_DIV6_EXT()
sh: clock-sh7723: add CLKDEV_ICK_ID for cleanup
serial: sh-sci: Handle GPIO function requests.
...
Make sure the interrupt is allocated correctly by lguest_setup_irq (check the
return value of irq_alloc_desc_at for -ENOMEM)
Signed-off-by: Stratos Psomadakis <psomas@cslab.ece.ntua.gr>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleanups and commentry)
This is a better location instead of having it in Documentation.
Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (fixed compile)
When studying lguest's x86 segment descriptor code, it is not longer
necessary to have the Intel x86 architecture manual open on the page
with the segment descriptor illustration to understand the crazy
numbers assigned to both descriptor structure halves a/b.
Now the struct desc_struct's fields, like suggested by
Glauber de Oliveira Costa in 2008, are used.
Signed-off-by: Jacek Galowicz <jacek@galowicz.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Handling balloon hibernate / restore is tricky. If the balloon was
inflated before going into the hibernation state, upon resume, the host
will not have any memory of that. Any pages that were passed on to the
host earlier would most likely be invalid, and the host will have to
re-balloon to the previous value to get in the pre-hibernate state.
So the only sane thing for the guest to do here is to discard all the
pages that were put in the balloon. When to discard the pages is the
next question.
One solution is to deflate the balloon just before writing the image to
the disk (in the freeze() PM callback). However, asking for pages from
the host just to discard them immediately after seems wasteful of
resources. Hence, it makes sense to do this by just fudging our
counters soon after wakeup. This means we don't deflate the balloon
before sleep, and also don't put unnecessary pressure on the host.
This also helps in the thaw case: if the freeze fails for whatever
reason, the balloon should continue to remain in the inflated state.
This was tested by issuing 'swapoff -a' and trying to go into the S4
state. That fails, and the balloon stays inflated, as expected. Both
the host and the guest are happy.
Finally, in the restore() callback, we empty the list of pages that were
previously given off to the host, add the appropriate number of pages to
the totalram_pages counter, reset the num_pages counter to 0, and
all is fine.
As a last step, delete the vqs on the freeze callback to prepare for
hibernation, and re-create them in the restore and thaw callbacks to
resume normal operation.
The kthread doesn't race with any operations here, since it's frozen
before the freeze() call and is thawed after the thaw() and restore()
callbacks, so we're safe with that.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The probe and PM restore functions will share this code.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>