In the process of writing up the mechanical proof of correctness for the
dynticks/preemptable-RCU interface, I noticed misplaced memory barriers in
rcu_enter_nohz() and rcu_exit_nohz().
This patch puts them in the right place and adds a comment. The key thing to
keep in mind is that rcu_enter_nohz() is -exiting- the mode that can legally
execute RCU read-side critical sections.
The memory barrier must be between any potential RCU read-side critical
sections and the increment of the per-CPU dynticks_progress_counter, and thus
must come -before- this increment. And vice versa for rcu_exit_nohz().
The locking in the scheduler is probably saving us for the moment.
Also, switch to smp_mb() - we don't need a barrier for uniprocessor kernels.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 4c7ffe0b9f ("fbdev: prevent drivers that
have hardware cursors from calling software cursor code") every call of
i810fb_cursor fails with -ENXIO because of a incorrect "!".
This hasn't struck until eaa0ff15c3 ("fix !
versus & precedence in various places") surrounded the expression with braces,
so that the intended behavior was inverted. That caused 'pixel waste' - the
same line of multi-colored pixels repeated over the whole screen - during
console switch.
This switches back to the original pre-4c7ffe0 behavior.
Signed-off-by: Stefan Bauer <stefan.bauer@cs.tu-chemnitz.de>
Tested-by: Stefan Bauer <stefan.bauer@cs.tu-chemnitz.de>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Antonino Daplas <adaplas@pol.net>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a long-standing typo (predating git) that will cause data corruption if a
journal data block needs unescaping. At the moment the wrong buffer head's
data is being unescaped.
To test this case mount a filesystem with data=journal, start creating and
deleting a bunch of files containing only JBD2_MAGIC_NUMBER (0xc03b3998), then
pull the plug on the device. Without this patch the files will contain zeros
instead of the correct data after recovery.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a long-standing typo (predating git) that will cause data corruption if a
journal data block needs unescaping. At the moment the wrong buffer head's
data is being unescaped.
To test this case mount a filesystem with data=journal, start creating and
deleting a bunch of files containing only JFS_MAGIC_NUMBER (0xc03b3998), then
pull the plug on the device. Without this patch the files will contain zeros
instead of the correct data after recovery.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the SYSV ipc msgctl(),semctl(),shmctl() family, if the user passed *_INFO
as the desired operation, no specific object is meant to be controlled and
only system-wide information is returned. This leads to a NULL IPC object in
the LSM hooks if the _INFO flag is given.
Avoid dereferencing this NULL pointer in Smack ipc *ctl() methods.
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up an error in iget removal in which romfs_lookup() making a successful
call to romfs_iget() continues through the negative/error handling (previously
the successful case jumped around the negative/error handling case):
(1) inode is initialised to NULL at the top of the function, eliminating the
need for specific negative-inode handling. This means the positive
success handling now flows straight through.
(2) Rename the labels to be clearer about what they mean.
Also make romfs_lookup()'s result variable of type long so as to avoid
32-bit/64-bit conversions with PTR_ERR() and friends.
Based upon a report and patch from Adam Richter.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: "Adam J. Richter" <adam@yggdrasil.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several places where we make allocations with GFP_KERNEL while under
a transaction, which could lead to an assertion panic or lockup if under
memory pressure. This patch switches these problem areas to use GFP_NOFS to
keep these problems from happening.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ibmpex's temperature sensors report incorrect units. Apply a conversion
factor so that tempertures report correctly. Until now, no systems seemed to
report temperatures this way, but evidently QS2x blades do.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Cc: "Mark M. Hoffman" <mhoffman@lightlink.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enhanced the list of supported machines.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Cc: Mark M. Hoffman <mhoffman@lightlink.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The QS2x blades ships with v2.54 of the firmware, which use the same
multiplier for all power meters.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Cc: Mark M. Hoffman <mhoffman@lightlink.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should always put inode we have reference to, even if quota was reenabled
in the mean time.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check t->pid == t->pid is not the blessed way to check whether a task is a
group leader.
This is not about the code beautifulness only, but about pid namespaces fixes
- both the tgid and the pid fields on the task_struct are (slowly :( )
becoming deprecated.
Besides, the thread_group_leader() macro makes only one dereference :)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Exposing the binary blob which is the md 'super-block' via sysfs doesn't
really fit with the whole sysfs model, and ever since commit
8118a859dc ("sysfs: fix off-by-one error
in fill_read_buffer()") it doesn't actually work at all (as the size of
the blob is often one page).
(akpm: as in, fs/sysfs/file.c:fill_read_buffer() goes BUG)
So just remove it altogether. It isn't really useful.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Correct kernel-doc function names and parameters in rmap.c.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert tiny-shmem.c function comments to kernel-doc. Add parameters and
convert/fix other kernel-doc in shmem.c.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix various kernel-doc notation in mm/:
filemap.c: add function short description; convert 2 to kernel-doc
fremap.c: change parameter 'prot' to @prot
pagewalk.c: change "-" in function parameters to ":"
slab.c: fix short description of kmem_ptr_validate()
swap.c: fix description & parameters of put_pages_list()
swap_state.c: fix function parameters
vmalloc.c: change "@returns" to "Returns:" since that is not a parameter
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
My group ran into a AIO process hang on a 2.6.24 kernel with the process
sleeping indefinitely in io_getevents(2) waiting for the last wakeup to come
and it never would.
We ran the tests on x86_64 SMP. The hang only occurred on a Xeon box
("Clovertown") but not a Core2Duo ("Conroe"). On the Xeon, the L2 cache isn't
shared between all eight processors, but is L2 is shared between between all
two processors on the Core2Duo we use.
My analysis of the hang is if you go down to the second while-loop
in read_events(), what happens on processor #1:
1) add_wait_queue_exclusive() adds thread to ctx->wait
2) aio_read_evt() to check tail
3) if aio_read_evt() returned 0, call [io_]schedule() and sleep
In aio_complete() with processor #2:
A) info->tail = tail;
B) waitqueue_active(&ctx->wait)
C) if waitqueue_active() returned non-0, call wake_up()
The way the code is written, step 1 must be seen by all other processors
before processor 1 checks for pending events in step 2 (that were recorded by
step A) and step A by processor 2 must be seen by all other processors
(checked in step 2) before step B is done.
The race I believed I was seeing is that steps 1 and 2 were
effectively swapped due to the __list_add() being delayed by the L2
cache not shared by some of the other processors. Imagine:
proc 2: just before step A
proc 1, step 1: adds to ctx->wait, but is not visible by other processors yet
proc 1, step 2: checks tail and sees no pending events
proc 2, step A: updates tail
proc 1, step 3: calls [io_]schedule() and sleeps
proc 2, step B: checks ctx->wait, but sees no one waiting, skips wakeup
so proc 1 sleeps indefinitely
My patch adds a memory barrier between steps A and B. It ensures that the
update in step 1 gets seen on processor 2 before continuing. If processor 1
was just before step 1, the memory barrier makes sure that step A (update
tail) gets seen by the time processor 1 makes it to step 2 (check tail).
Before the patch our AIO process would hang virtually 100% of the time. After
the patch, we have yet to see the process ever hang.
Signed-off-by: Quentin Barnes <qbarnes+linux@yahoo-inc.com>
Reviewed-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: <stable@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ We should probably disallow that "if (waitqueue_active()) wake_up()"
coding pattern, because it's so often buggy wrt memory ordering ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The PCI bridge representing the PCIE root complex on Axon, contains
device BARs for a memory range and ROM that define inbound accesses.
This confuses the kernel resource management code -- the resources
need to be hidden when Axon is a host bridge.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The cell IOMMU code to parse the dma-ranges properties, used for the fixed
mapping, was broken in two ways for some devices.
Firstly it didn't cope with empty dma-ranges properties. An empty property
implies no translation so can be safely skipped.
The code also wrongly assumed it would be looking at PCI devices, and hard
coded the number of address and size cells.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The wrapper script didn't have entries for the TQM8540 board and the
SBC8548 or SBC8560 boards. I've assumed that the TQM8540 console is
8250 based and not CPM based by looking at its defconfig. There was
also a trailing * on the TQM8555 entry that I removed too.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Since the PMU is an NMI now, it can come at any time we are only soft
disabled. We must hard disable around the two places we allow the kernel
stack SLB and r1 to go out of sync. Otherwise the PMU exception can
force a kernel stack SLB into another slot, which can lead to it
getting evicted, which can lead to a nasty unrecoverable SLB miss
in the exception entry code.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The PTRACE_SETREGS request was only recently added on powerpc,
and gdb does not use it. So it slipped through without getting
all the testing it should have had.
The user_regset changes had a simple bug in storing to all of
the 32-bit general registers block on 64-bit kernels. This bug
only comes up with PTRACE_SETREGS, not PPC_PTRACE_SETREGS.
It causes a BUG_ON to hit, so this fix needs to go in ASAP.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Ignoring the return value from nfs_pageio_add_request can cause deadlocks.
In read path:
call nfs_pageio_add_request from readpage_async_filler
assume at this point that there are requests already in desc, that
can't be merged with the current request.
so nfs_pageio_doio is fired up to clear out desc.
assume something goes wrong in setting up the io, so desc->pg_error is set.
This causes nfs_pageio_add_request to return 0, *WITHOUT* adding the original
request.
BUT, since return code is ignored, readpage_async_filler assumes it has
been added, and does nothing further, leaving page locked.
do_generic_mapping_read will eventually call lock_page, resulting in deadlock
In write path:
page is marked dirty by generic_perform_write
nfs_writepages is called
call nfs_pageio_add_request from nfs_page_async_flush
assume at this point that there are requests already in desc, that
can't be merged with the current request.
so nfs_pageio_doio is fired up to clear out desc.
assume something goes wrong in setting up the io, so desc->pg_error is set.
This causes nfs_page_async_flush to return 0, *WITHOUT* adding the original
request, yet marking the request as locked (PG_BUSY) and in writeback,
clearing dirty marks.
The next time a write is done to the page, deadlock will result as
nfs_write_end calls nfs_update_request
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit:
a341cd0f (SCSI: add asynchronous event notification API)
breaks:
285e9670 (sr,sd: send media state change modification events)
by introducing an event filter, which is removed here, to make
events, we are depending on, happen again.
Fix this by removing the event filter. It's pretty much broken at the
moment, since a user can't set it (the attribute being read only). A
proper fix will be to make the event discriminator distinguish between
AN and Polled media change events.
Cc: David Zeuthen <david@fubar.dk>
Cc: kristen accardi <kaccardi@gmail.com>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Initialize the "state changed" flag, so we do not send a change event
immediately after registering a new device.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Somebody had put struct nameidata in stack frame of link_path_walk().
Unfortunately, there are certain realities to deal with:
* It's in the middle of recursion. Depth is equal to the nesting
depth of symlinks, i.e. up to 8.
* struct namiedata is, even if one discards the intent junk,
at least 12 pointers + 5 ints.
* moreover, adding a stack frame is not free in that situation.
* there are fs methods called on top of that, and they also have
stack footprint.
* kernel stack is not infinite.
The thing is, even if one chooses to deal with -ESTALE that way (and it's
one hell of an overkill), the only thing that needs to be preserved is
vfsmount + dentry, not the entire struct nameidata.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some new uses of get_empty_filp() have crept in; switched
to alloc_file() to make sure that pieces of initialization
won't be missing.
We really need to kill get_empty_filp().
[AV] fixed dentry leak on failure exit in anon_inode_getfd()
Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure no-one calls dentry_open with a NULL vfsmount argument and crap
out with a stacktrace otherwise. A NULL file->f_vfsmnt has always been
problematic, but with the per-mount r/o tracking we can't accept anymore
at all.
[AV] the last place that passed NULL had been eliminated by the previous
patch (reiserfs xattr stuff)
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
After several posts and bug reports regarding interaction with the NULL
nameidata, here's a patch to clean up the mess with struct file in the
reiserfs xattr code.
As observed in several of the posts, there's really no need for struct file
to exist in the xattr code. It was really only passed around due to the
f_op->readdir() and a_ops->{prepare,commit}_write prototypes requiring it.
reiserfs_prepare_write() and reiserfs_commit_write() don't actually use the
struct file passed to it, and the xattr code uses a private version of
reiserfs_readdir() to enumerate the xattr directories.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* hppfs_iget() and its users are racy; there's no need to pollute icache
anyway, new_inode() works fine and is safe, unlike the current kludges
(these relied on overwriting ->i_ino before another iget_locked() gets
to that one - and did it after unlocking).
* merge hppfs_iget()/init_inode()/hppfs_read_inode(), while we are
at it.
* to pass proper vfsmount to dentry_open() store the reference
in hppfs superblock.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
--
Here's patch for hppfs that uses vfs_kern_mount to make sure it always has a
procfs instance and passed the vfsmount on through the inode private data.
Also fixes a procfs file_system_type leak for every attempted hppfs mount.
[ jdike - gave this file a style workover, plus deleted hppfs_dentry_ops ]
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4096 will not fit into the immediate field of a compare instruction,
in fact it will end up being -4096 causing the check to fail every
time and thus disabling backoff.
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx:
async_tx: avoid the async xor_zero_sum path when src_cnt > device->max_xor
fsldma: Fix the DMA halt when using DMA_INTERRUPT async_tx transfer.
This reverts commit 2c81ce4c9c.
It caused several new troubles (eg suspend slowdown bisected down to
this patch by Pavel Machek), so just revert it for now.
Signed-off-by: Alexey Starikovskiy <astarikovskiy@suse.de>
Cc: Pavel Machek <pavel@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we handle all the special commands using REQ_TYPE_ATA_TASKFILE
rather than using the old REQ_TYPE_ATA_CMD model, we need to also
emulate the lack of full taskfile data that comes with the old command
model (ie when commands are generated with the HDIO_DRIVE_CMD ioctl
rather than using the HDIO_DRIVE_TASK[FILE] ioctls).
In particular, this means that we should handle command completion the
more relaxed way that the old drive_cmd_intr() code did. It allows
commands to finish early even if they don't use up all the data that we
thought we had for them.
This fixes a regression seen by Anders Eriksson where some SMART
commands sent by smartd would cause a boot-time system hang on his
machine because the IDE command handling code didn't realize that the
command had completed.
Tested-by: Anders Eriksson <aeriksson@fastmail.fm>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WAKE_IDLE is too agressive on multi-core CPUs with the new
wake-affine code, keep it on for SMT/HT balancing alone
(where there's no cache affinity at all between logical CPUs).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Wakeup-buddy tasks are cache-hot - this makes it a bit harder
for the load-balancer to tear them apart. (but it's still possible,
if the load is sufficiently assymetric)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
improve affine wakeups. Maintain the 'overlap' metric based on CFS's
sum_exec_runtime - which means the amount of time a task executes
after it wakes up some other task.
Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
it means there's strong workload coupling between this task and the
woken up task. If the 'overlap' is large then the workload is decoupled
and the scheduler will move them to separate CPUs more easily.
( Also slightly move the preempt_check within try_to_wake_up() - this has
no effect on functionality but allows 'early wakeups' (for still-on-rq
tasks) to be correctly accounted as well.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
'sync' wakeups are a hint towards the scheduler that (certain)
networking related wakeups likely create coupling between tasks.
Signed-off-by: Ingo Molnar <mingo@elte.hu>