It seems odd that truncate_inode_pages_range(), called not only when
truncating but also when evicting inodes, has mem_cgroup_uncharge_start
and _end() batching in its second loop to clear up a few leftovers, but
not in its first loop that does almost all the work: add them there too.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The THP code didn't pass the correct interleaving shift to the memory
policy code. Fix this here by adjusting for the order.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A race can occur when io_submit() races with io_destroy():
CPU1 CPU2
io_submit()
do_io_submit()
...
ctx = lookup_ioctx(ctx_id);
io_destroy()
Now do_io_submit() holds the last reference to ctx.
...
queue new AIO
put_ioctx(ctx) - frees ctx with active AIOs
We solve this issue by checking whether ctx is being destroyed in AIO
submission path after adding new AIO to ctx. Then we are guaranteed that
either io_destroy() waits for new AIO or we see that ctx is being
destroyed and bail out.
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
lookup_ioctx doesn't implement the rcu lookup pattern properly.
rcu_read_lock does not prevent refcount going to zero, so we might take
a refcount on a zero count ioctx.
Fix the bug by atomically testing for zero refcount before incrementing.
[jack@suse.cz: added comment into the code]
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In linux rtc_time struct, tm_mon range is 0~11, tm_wday range is 0~6,
while in RTC HW REG, month range is 1~12, day of the week range is 1~7,
this patch adjusts difference of them.
The efect of this bug was that most of month will be operated on as the
next month by the hardware (When in Jan it maybe even worse). For
example, if in May, software wrote 4 to the hardware, which handled it as
April. Then the logic would be different between software and hardware,
which would cause weird things to happen.
Signed-off-by: Lei Xu <B33228@freescale.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Jack Lan <jack.lan@freescale.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel automatically evaluates partition tables of storage devices.
The code for evaluating LDM partitions (in fs/partitions/ldm.c) contains
a bug that causes a kernel oops on certain corrupted LDM partitions. A
kernel subsystem seems to crash, because, after the oops, the kernel no
longer recognizes newly connected storage devices.
The patch changes ldm_parse_vmdb() to Validate the value of vblk_size.
Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Acked-by: Richard Russon <ldm@flatcap.org>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
should_continue_reclaim() for reclaim/compaction allows scanning to
continue even if pages are not being reclaimed until the full list is
scanned. In terms of allocation success, this makes sense but potentially
it introduces unwanted latency for high-order allocations such as
transparent hugepages and network jumbo frames that would prefer to fail
the allocation attempt and fallback to order-0 pages. Worse, there is a
potential that the full LRU scan will clear all the young bits, distort
page aging information and potentially push pages into swap that would
have otherwise remained resident.
This patch will stop reclaim/compaction if no pages were reclaimed in the
last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
LRU list may still be scanned.
Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
is not set so the following avoids the gfp_mask being examined:
if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
return false;
A tool was developed based on ftrace that tracked the latency of
high-order allocations while transparent hugepage support was enabled and
three benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with
Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
applied.
STREAM Highorder Allocation Latency Statistics
fix-infinite break-early
1 :: Count 10298 10229
1 :: Min 0.4560 0.4640
1 :: Mean 1.0589 1.0183
1 :: Max 14.5990 11.7510
1 :: Stddev 0.5208 0.4719
2 :: Count 2 1
2 :: Min 1.8610 3.7240
2 :: Mean 3.4325 3.7240
2 :: Max 5.0040 3.7240
2 :: Stddev 1.5715 0.0000
9 :: Count 111696 111694
9 :: Min 0.5230 0.4110
9 :: Mean 10.5831 10.5718
9 :: Max 38.4480 43.2900
9 :: Stddev 1.1147 1.1325
Mean time for order-1 allocations is reduced. order-2 looks increased but
with so few allocations, it's not particularly significant. THP mean
allocation latency is also reduced. That said, allocation time varies so
significantly that the reductions are within noise.
Max allocation time is reduced by a significant amount for low-order
allocations but reduced for THP allocations which presumably are now
breaking before reclaim has done enough work.
SysBench Highorder Allocation Latency Statistics
fix-infinite break-early
1 :: Count 15745 15677
1 :: Min 0.4250 0.4550
1 :: Mean 1.1023 1.0810
1 :: Max 14.4590 10.8220
1 :: Stddev 0.5117 0.5100
2 :: Count 1 1
2 :: Min 3.0040 2.1530
2 :: Mean 3.0040 2.1530
2 :: Max 3.0040 2.1530
2 :: Stddev 0.0000 0.0000
9 :: Count 2017 1931
9 :: Min 0.4980 0.7480
9 :: Mean 10.4717 10.3840
9 :: Max 24.9460 26.2500
9 :: Stddev 1.1726 1.1966
Again, mean time for order-1 allocations is reduced while order-2
allocations are too few to draw conclusions from. The mean time for THP
allocations is also slightly reduced albeit the reductions are within
varianes.
Once again, our maximum allocation time is significantly reduced for
low-order allocations and slightly increased for THP allocations.
Anon stream mmap reference Highorder Allocation Latency Statistics
1 :: Count 1376 1790
1 :: Min 0.4940 0.5010
1 :: Mean 1.0289 0.9732
1 :: Max 6.2670 4.2540
1 :: Stddev 0.4142 0.2785
2 :: Count 1 -
2 :: Min 1.9060 -
2 :: Mean 1.9060 -
2 :: Max 1.9060 -
2 :: Stddev 0.0000 -
9 :: Count 11266 11257
9 :: Min 0.4990 0.4940
9 :: Mean 27250.4669 24256.1919
9 :: Max 11439211.0000 6008885.0000
9 :: Stddev 226427.4624 186298.1430
This benchmark creates one thread per CPU which references an amount of
anonymous memory 1.5 times the size of physical RAM. This pounds swap
quite heavily and is intended to exercise THP a bit.
Mean allocation time for order-1 is reduced as before. It's also reduced
for THP allocations but the variations here are pretty massive due to
swap. As before, maximum allocation times are significantly reduced.
Overall, the patch reduces the mean and maximum allocation latencies for
the smaller high-order allocations. This was with Slab configured so it
would be expected to be more significant with Slub which uses these size
allocations more aggressively.
The mean allocation times for THP allocations are also slightly reduced.
The maximum latency was slightly increased as predicted by the comments
due to reclaim/compaction breaking early. However, workloads care more
about the latency of lower-order allocations than THP so it's an
acceptable trade-off.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The regulator framework is used for power management. The regulators are
only named in the driver code, the actual control stuff is in the board
file for each architecture or use case.
The PN544 chip has three regulators that can be controlled or not -
depending on the architecture where the chip is being used. So some of
the regulators may not be controllable. In our current case the third
regulator, which was missing from the code, went unnoticed because we
didn't need to control it. To be as general as possible - in this respect
- the driver needs to list all regulators. Then the board file can be
used to actually set the usage.
Signed-off-by: Matti J. Aaltonen <matti.j.aaltonen@nokia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Spell out the NFC acronym when it's shown for the first time.
Signed-off-by: Matti J. Aaltonen <matti.j.aaltonen@nokia.com>
Acked-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swiotlb's map_page wrongly calls panic() when it can't find a buffer fit
for device's dma mask. It should return an error instead.
Devices with an odd dma mask (i.e. under 4G) like b44 network card hit
this bug (the system crashes):
http://marc.info/?l=linux-kernel&m=129648943830106&w=2
If swiotlb returns an error, b44 driver can use the own bouncing
mechanism.
Reported-by: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I have translated some kernel documentation so I wish to maintain the
Chinese documentation in our kernel directories.
Signed-off-by: Harry Wei <harryxiyou@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In several places, an epoll fd can call another file's ->f_op->poll()
method with ep->mtx held. This is in general unsafe, because that other
file could itself be an epoll fd that contains the original epoll fd.
The code defends against this possibility in its own ->poll() method using
ep_call_nested, but there are several other unsafe calls to ->poll
elsewhere that can be made to deadlock. For example, the following simple
program causes the call in ep_insert recursively call the original fd's
->poll, leading to deadlock:
#include <unistd.h>
#include <sys/epoll.h>
int main(void) {
int e1, e2, p[2];
struct epoll_event evt = {
.events = EPOLLIN
};
e1 = epoll_create(1);
e2 = epoll_create(2);
pipe(p);
epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt);
epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt);
write(p[1], p, sizeof p);
epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
return 0;
}
On insertion, check whether the inserted file is itself a struct epoll,
and if so, do a recursive walk to detect whether inserting this file would
create a loop of epoll structures, which could lead to deadlock.
[nelhage@ksplice.com: Use epmutex to serialize concurrent inserts]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Reported-by: Nelson Elhage <nelhage@ksplice.com>
Tested-by: Nelson Elhage <nelhage@ksplice.com>
Cc: <stable@kernel.org> [2.6.34+, possibly earlier]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6:
regulator, mc13xxx: Remove pointless test for unsigned less than zero
regulator: Fix warning with CONFIG_BUG disabled
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix fiemap bugs with delalloc
Btrfs: set FMODE_EXCL in btrfs_device->mode
Btrfs: make btrfs_rm_device() fail gracefully
Btrfs: Avoid accessing unmapped kernel address
Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
Btrfs: allow balance to explicitly allocate chunks as it relocates
Btrfs: put ENOSPC debugging under a mount option
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86 quirk: Fix polarity for IRQ0 pin2 override on SB800 systems
x86/mrst: Fix apb timer rating when lapic timer is used
x86: Fix reboot problem on VersaLogic Menlow boards
The member of the rtc_class_ops struct is called alarm_irq_enable and
not alarm_irq_enabled
CC: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jelle Martijn Kok <jmkok@youcom.nl>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6:
usb: musb: core: set has_tt flag
USB: xhci: mark local functions as static
USB: xhci: fix couple sparse annotations
USB: xhci: rework xhci_print_ir_set() to get ir set from xhci itself
USB: Reset USB 3.0 devices on (re)discovery
xhci: Fix an error in count_sg_trbs_needed()
xhci: Fix errors in the running total calculations in the TRB math
xhci: Clarify some expressions in the TRB math
xhci: Avoid BUG() in interrupt context
* 'for-linus' of git://neil.brown.name/md:
md: Fix - again - partition detection when array becomes active
Fix over-zealous flush_disk when changing device size.
md: avoid spinlock problem in blk_throtl_exit
md: correctly handle probe of an 'mdp' device.
md: don't set_capacity before array is active.
md: Fix raid1->raid0 takeover
With slab poisoning enabled, I see the following oops:
Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6b6b73
...
NIP [c0000000006bc61c] .rxrpc_destroy+0x44/0x104
LR [c0000000006bc618] .rxrpc_destroy+0x40/0x104
Call Trace:
[c0000000feb2bc00] [c0000000006bc618] .rxrpc_destroy+0x40/0x104 (unreliable)
[c0000000feb2bc90] [c000000000349b2c] .key_cleanup+0x1a8/0x20c
[c0000000feb2bd40] [c0000000000a2920] .process_one_work+0x2f4/0x4d0
[c0000000feb2be00] [c0000000000a2d50] .worker_thread+0x254/0x468
[c0000000feb2bec0] [c0000000000a868c] .kthread+0xbc/0xc8
[c0000000feb2bf90] [c000000000020e00] .kernel_thread+0x54/0x70
We aren't initialising token->next, but the code in destroy_context relies
on the list being NULL terminated. Use kzalloc to zero out all the fields.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm seeing the following oops when testing afs:
Unable to handle kernel paging request for data at address 0x00000008
...
NIP [c0000000003393b0] .afs_unlink_writeback+0x38/0xc0
LR [c00000000033987c] .afs_put_writeback+0x98/0xec
Call Trace:
[c00000000345f600] [c00000000033987c] .afs_put_writeback+0x98/0xec
[c00000000345f690] [c00000000033ae80] .afs_write_begin+0x6a4/0x75c
[c00000000345f790] [c00000000012b77c] .generic_file_buffered_write+0x148/0x320
[c00000000345f8d0] [c00000000012e1b8] .__generic_file_aio_write+0x37c/0x3e4
[c00000000345f9d0] [c00000000012e2a8] .generic_file_aio_write+0x88/0xfc
[c00000000345fa90] [c0000000003390a8] .afs_file_write+0x10c/0x178
[c00000000345fb40] [c000000000188788] .do_sync_write+0xc4/0x128
[c00000000345fcc0] [c000000000189658] .vfs_write+0xe8/0x1d8
[c00000000345fd70] [c000000000189884] .SyS_write+0x68/0xb0
[c00000000345fe30] [c000000000008564] syscall_exit+0x0/0x40
afs_write_begin hits an error and calls afs_unlink_writeback. In there
we do list_del_init on an uninitialised list.
The patch below initialises ->link when creating the afs_writeback struct.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable 'val' is a 'unsigned int', so it can never be less than zero.
This fact makes the "val < 0" part of the test done in BUG_ON() in
mc13xxx_regulator_get_voltage() rather pointles since it can never have
any effect.
This patch removes the pointless test.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Alberto Panizzo <maramaopercheseimorto@gmail.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm/i915: Fix unintended recursion in ironlake_disable_rc6
drm/i915: fix corruptions on i8xx due to relaxed fencing
drm/i915: skip FDI & PCH enabling for DP_A
agp/intel: Experiment with a 855GM GWB bit
drm/i915: don't enable FDI & transcoder interrupts after all
drm/i915: Ignore a hung GPU when flushing the framebuffer prior to a switch
On some SB800 systems polarity for IOAPIC pin2 is wrongly
specified as low active by BIOS. This caused system hangs after
resume from S3 when HPET was used in one-shot mode on such
systems because a timer interrupt was missed (HPET signal is
high active).
For more details see:
http://marc.info/?l=linux-kernel&m=129623757413868
Tested-by: Manoj Iyer <manoj.iyer@canonical.com>
Tested-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org # 37.x, 32.x
LKML-Reference: <20110224145346.GD3658@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
MUSB is a non-standard host implementation which
can handle all speeds with the same core. We need
to set has_tt flag after commit
d199c96d41 (USB: prevent
buggy hubs from crashing the USB stack) in order for
MUSB HCD to continue working.
Signed-off-by: Felipe Balbi <balbi@ti.com>
Cc: stable <stable@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Tested-by: Michael Jones <michael.jones@matrix-vision.de>
Tested-by: Alexander Holler <holler@ahsoftware.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
After disabling, we're meant to teardown the bo used for the contexts,
not recurse into ourselves again and preventing module unload.
Reported-and-tested-by: Ben Widawsky <bwidawsk@gmail.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The new implementation of bd_link_disk_holder() added by 49731baa41
(block: restore multiple bd_link_disk_holder() support) didn't get an
extra reference for the holder_dir kobject of the slave bdev; however,
bdev kills holder_dir on removal, not release, so if the slave bdev is
removed while there are holder links, the holder_dir will be destroyed
while there still are holder links, which leads to oops later when
bd_unlink_disk_order() tries to remove those links.
Make bd_link_disk_holder() grab an extra reference for the slave's
holder_dir and put it in bd_unlink_disk_holder().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
Tested-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Grab a reference to bdev before calling blkdev_get(), which expects
the refcount to be already incremented and either returns success or
decrements the refcount and returns an error.
The bug was introduced by e525fd89 (block: make blkdev_get/put()
handle exclusive access), which didn't take into account this behavior
of blkdev_get().
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adam Kovari and others reported that disconnecting an USB drive with
an ntfs-3g filesystem would cause "kernel BUG at fs/inode.c:1421!" to
be triggered.
The BUG could be traced back to ioctl(BLKBSZSET), which would
erroneously decrement the refcount on the bdev. This is because
blkdev_get() expects the refcount to be already incremented and either
returns success or decrements the refcount and returns an error.
The bug was introduced by e525fd89 (block: make blkdev_get/put()
handle exclusive access), which didn't take into account this behavior
of blkdev_get().
This fixes
https://bugzilla.kernel.org/show_bug.cgi?id=29202
(and likely 29792 too)
Reported-by: Adam Kovari <kovariadam@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Need to adjust the clockevent device rating for the structure
that will be registered with clockevent system instead of the
temporary structure.
Without this fix, APB timer rating will be higher than LAPIC
timer such that it can not be released later to be used as the
broadcast timer.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
LKML-Reference: <1298506046-439-1-git-send-email-jacob.jun.pan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
By the commit
b3e19d9 2011-01-07 fs: scale mntget/mntput
vfsmount_lock was introduced around testing mnt_count.
Fix the mis-typed 'unlock'
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Revert
b821eaa572
and
f3b99be19d
When I wrote the first of these I had a wrong idea about the
lifetime of 'struct block_device'. It can disappear at any time that
the block device is not open if it falls out of the inode cache.
So relying on the 'size' recorded with it to detect when the
device size has changed and so we need to revalidate, is wrong.
Rather, we really do need the 'changed' attribute stored directly in
the mddev and set/tested as appropriate.
Without this patch, a sequence of:
mknod / open / close / unlink
(which can cause a block_device to be created and then destroyed)
will result in a rescan of the partition table and consequence removal
and addition of partitions.
Several of these in a row can get udev racing to create and unlink and
other code can get confused.
With the patch, the rescan is only performed when needed and so there
are no races.
This is suitable for any stable kernel from 2.6.35.
Reported-by: "Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
There are two cases when we call flush_disk.
In one, the device has disappeared (check_disk_change) so any
data will hold becomes irrelevant.
In the oter, the device has changed size (check_disk_size_change)
so data we hold may be irrelevant.
In both cases it makes sense to discard any 'clean' buffers,
so they will be read back from the device if needed.
In the former case it makes sense to discard 'dirty' buffers
as there will never be anywhere safe to write the data. In the
second case it *does*not* make sense to discard dirty buffers
as that will lead to file system corruption when you simply enlarge
the containing devices.
flush_disk calls __invalidate_devices.
__invalidate_device calls both invalidate_inodes and invalidate_bdev.
invalidate_inodes *does* discard I_DIRTY inodes and this does lead
to fs corruption.
invalidate_bev *does*not* discard dirty pages, but I don't really care
about that at present.
So this patch adds a flag to __invalidate_device (calling it
__invalidate_device2) to indicate whether dirty buffers should be
killed, and this is passed to invalidate_inodes which can choose to
skip dirty inodes.
flusk_disk then passes true from check_disk_change and false from
check_disk_size_change.
dm avoids tripping over this problem by calling i_size_write directly
rathher than using check_disk_size_change.
md does use check_disk_size_change and so is affected.
This regression was introduced by commit 608aeef17a which causes
check_disk_size_change to call flush_disk, so it is suitable for any
kernel since 2.6.27.
Cc: stable@kernel.org
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Andrew Patterson <andrew.patterson@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
a hole with madvise(,, MADV_REMOVE). That path is under mutex, and
cannot be explained by lack of serialization in unmap_mapping_range().
Reviewing the code, I found one place where vm_truncate_count handling
should have been updated, when I switched at the last minute from one
way of managing the restart_addr to another: mremap move changes the
virtual addresses, so it ought to adjust the restart_addr.
But rather than exporting the notion of restart_addr from memory.c, or
converting to restart_pgoff throughout, simply reset vm_truncate_count
to 0 to force a rescan if mremap move races with preempted truncation.
We have no confirmation that this fixes Robert's BUG,
but it is a fix that's worth making anyway.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michael Leun reported that running parallel opens on a fuse filesystem
can trigger a "kernel BUG at mm/truncate.c:475"
Gurudas Pai reported the same bug on NFS.
The reason is, unmap_mapping_range() is not prepared for more than
one concurrent invocation per inode. For example:
thread1: going through a big range, stops in the middle of a vma and
stores the restart address in vm_truncate_count.
thread2: comes in with a small (e.g. single page) unmap request on
the same vma, somewhere before restart_address, finds that the
vma was already unmapped up to the restart address and happily
returns without doing anything.
Another scenario would be two big unmap requests, both having to
restart the unmapping and each one setting vm_truncate_count to its
own value. This could go on forever without any of them being able to
finish.
Truncate and hole punching already serialize with i_mutex. Other
callers of unmap_mapping_range() do not, and it's difficult to get
i_mutex protection for all callers. In particular ->d_revalidate(),
which calls invalidate_inode_pages2_range() in fuse, may be called
with or without i_mutex.
This patch adds a new mutex to 'struct address_space' to prevent
running multiple concurrent unmap_mapping_range() on the same mapping.
[ We'll hopefully get rid of all this with the upcoming mm
preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
lockbreak" patch in particular. But that is for 2.6.39 ]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Michael Leun <lkml20101129@newton.leun.net>
Reported-by: Gurudas Pai <gurudas.pai@oracle.com>
Tested-by: Gurudas Pai <gurudas.pai@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 556ea928f7.
Jeff Chua reports that it can cause some bluetooth devices (he mentions
an Bluetooth Intermec scanner) to just stop responding after a while
with messages like
[ 4533.361959] btusb 8-1:1.0: no reset_resume for driver btusb?
[ 4533.361964] btusb 8-1:1.1: no reset_resume for driver btusb?
from the kernel. See also
https://bugzilla.kernel.org/show_bug.cgi?id=26182
for other reports.
Reported-by: Jeff Chua <jeff.chua.linux@gmail.com>
Reported-by: Andrew Meakovski <meako@bigmir.net>
Reported-by: Jim Faulkner <jfaulkne@ccs.neu.edu>
Acked-by: Greg KH <gregkh@suse.de>
Acked-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Gustavo F. Padovan <padovan@profusion.mobi>
Cc: stable@kernel.org (for 2.6.37)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel:
drm/i915: fix corruptions on i8xx due to relaxed fencing
drm/i915: skip FDI & PCH enabling for DP_A
agp/intel: Experiment with a 855GM GWB bit
drm/i915: don't enable FDI & transcoder interrupts after all
drm/i915: Ignore a hung GPU when flushing the framebuffer prior to a switch
It looks like gen2 has a peculiar interleaved 2-row inter-tile
layout. Probably inherited from i81x which had 2kb tiles (which
naturally fit an even-number-of-tile-rows scheme to fit onto 4kb
pages). There is no other mention of this in any docs (also not
in the Intel internal documention according to Chris Wilson).
Problem manifests itself in corruptions in the second half of the
last tile row (if the bo has an odd number of tiles). Which can
only happen with relaxed tiling (introduced in a00b10c360).
So reject set_tiling calls that don't satisfy this constrain to
prevent broken userspace from causing havoc. While at it, also
check the size for newer chipsets.
LKML: https://lkml.org/lkml/2011/2/19/5
Reported-by: Indan Zupancic <indan@nul.nu>
Tested-by: Indan Zupancic <indan@nul.nu>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (33 commits)
Added support for usb ethernet (0x0fe6, 0x9700)
r8169: fix RTL8168DP power off issue.
r8169: correct settings of rtl8102e.
r8169: fix incorrect args to oob notify.
DM9000B: Fix PHY power for network down/up
DM9000B: Fix reg_save after spin_lock in dm9000_timeout
net_sched: long word align struct qdisc_skb_cb data
sfc: lower stack usage in efx_ethtool_self_test
bridge: Use IPv6 link-local address for multicast listener queries
bridge: Fix MLD queries' ethernet source address
bridge: Allow mcast snooping for transient link local addresses too
ipv6: Add IPv6 multicast address flag defines
bridge: Add missing ntohs()s for MLDv2 report parsing
bridge: Fix IPv6 multicast snooping by correcting offset in MLDv2 report
bridge: Fix IPv6 multicast snooping by storing correct protocol type
p54pci: update receive dma buffers before and after processing
fix cfg80211_wext_siwfreq lock ordering...
rt2x00: Fix WPA TKIP Michael MIC failures.
ath5k: Fix fast channel switching
tcp: undo_retrans counter fixes
...
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
amd64-agp: fix crash at second module load
drm/radeon: fix regression with AA resolve checking
drm: drop commented out code and preceding comment
drm/vblank: Enable precise vblank timestamps for interlaced and doublescan modes.
drm/vblank: Use memory barriers optimized for atomic_t instead of generics.
drm/vblank: Use abs64(diff_ns) for s64 diff_ns instead of abs(diff_ns)
drm/radeon/kms: align height of fb allocation.
Revert "drm/radeon/kms: switch back to min->max pll post divider iteration"
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: check if device support discard in xfs_ioc_trim()
xfs: prevent leaking uninitialized stack memory in FSGEOMETRY_V1
The device is very similar to (0x0fe6, 0x8101),
And works well with dm9601 driver.
Signed-off-by: Shahar Havivi <shaharh@redhat.com>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>