In SSD mode for data, and all the time for metadata the allocator
will try to find a cluster of nearby blocks for allocations. This
commit adds extra checks to make sure that each free block in the
cluster is close to the last one.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs IO submission threads try to service a bunch of devices with a small
number of threads. They do a congestion check to try and avoid waiting
on requests for a busy device.
The checks make sure we've sent a few requests down to a given device just so
that we aren't bouncing between busy devices without actually sending down
any IO. The counter used to decide if we can switch to the next device
is somewhat overloaded. It is also being used to decide if we've done
a good batch of requests between the WRITE_SYNC or regular priority lists.
It may get reset to zero often, leaving us hammering on a busy device
instead of moving on to another disk.
This commit adds a new counter for the number of bios sent while
servicing a device. It doesn't get reset or fiddled with. On
multi-device filesystems, this fixes IO stalls in streaming
write workloads.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs uses dedicated threads to submit bios when checksumming is on,
which allows us to make sure the threads dedicated to checksumming don't get
stuck waiting for requests. For each btrfs device, there are
two lists of bios. One list is for WRITE_SYNC bios and the other
is for regular priority bios.
The IO submission threads used to process all of the WRITE_SYNC bios first and
then switch to the regular bios. This commit makes sure we don't completely
starve the regular bios by rotating between the two lists.
WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries
to run in batches to avoid seeking. Benchmarking shows this eliminates
stalls during streaming buffered writes on both multi-device and
single device filesystems.
If the regular bios starve, the system can end up with a large amount of ram
pinned down in writeback pages. If we are a little more fair between the two
classes, we're able to keep throughput up and make progress on the bulk of
our dirty ram.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Once a metadata block has been written, it must be recowed, so the
btrfs dirty balancing call has a check to make sure a fair amount of metadata
was actually dirty before it started writing it back to disk.
A previous commit had changed the dirty tracking for metadata without
updating the btrfs dirty balancing checks. This commit switches it
to use the correct counter.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The block allocator in SSD mode will try to find groups of free blocks
that are close together. This commit makes it loop less on a given
group size before bumping it.
The end result is that we are less likely to fill small holes in the
available free space, but we don't waste as much CPU building the
large cluster used by ssd mode.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the new back reference code, the cost of a balance has gone down
in terms of the number of back reference updates done. This commit
makes us more aggressively balance leaves and nodes as they become
less full.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When the delayed reference code was added, some checks were added
to avoid extra balancing while the delayed references were being flushed.
This made for less efficient btrees, but it reduced the chances of
loops where no forward progress was made because the balances made
more delayed ref updates.
With the new dead root removal code and the mixed back references,
the extent allocation tree is no longer using precise back refs, and
the delayed reference updates don't carry the risk of looping forever
anymore. So, the balance avoidance is no longer required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are some 'start = state->end + 1;' like code in set_extent_bit
and clear_extent_bit. They overflow when end == (u64)-1.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In commit code, we scan buffers attached to a transaction. During this
scan, we sometimes have to drop j_list_lock and then we recheck whether
the journal buffer head didn't get freed by journal_try_to_free_buffers().
But checking for buffer_jbd(bh) isn't enough because a new journal head
could get attached to our buffer head. So add a check whether the journal
head remained the same and whether it's still at the same transaction and
list.
This is a nasty bug and can cause problems like memory corruption (use after
free) or trigger various assertions in JBD code (observed).
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The recent ->lookup() deadlock correction required the directory inode
mutex to be dropped while waiting for expire completion. We were
concerned about side effects from this change and one has been identified.
I saw several error messages.
They cause autofs to become quite confused and don't really point to the
actual problem.
Things like:
handle_packet_missing_direct:1376: can't find map entry for (43,1827932)
which is usually totally fatal (although in this case it wouldn't be
except that I treat is as such because it normally is).
do_mount_direct: direct trigger not valid or already mounted
/test/nested/g3c/s1/ss1
which is recoverable, however if this problem is at play it can cause
autofs to become quite confused as to the dependencies in the mount tree
because mount triggers end up mounted multiple times. It's hard to
accurately check for this over mounting case and automount shouldn't need
to if the kernel module is doing its job.
There was one other message, similar in consequence of this last one but I
can't locate a log example just now.
When checking if a mount has already completed prior to adding a new mount
request to the wait queue we check if the dentry is hashed and, if so, if
it is a mount point. But, if a mount successfully completed while we
slept on the wait queue mutex the dentry must exist for the mount to have
completed so the test is not really needed.
Mounts can also be done on top of a global root dentry, so for the above
case, where a mount request completes and the wait queue entry has already
been removed, the hashed test returning false can cause an incorrect
callback to the daemon. Also, d_mountpoint() is not sufficient to check
if a mount has completed for the multi-mount case when we don't have a
real mount at the base of the tree.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_IMA=y inode activity leaks iint_cache and radix_tree_node objects
until the system runs out of memory. Nowhere is calling ima_inode_free()
a.k.a. ima_iint_delete(). Fix that by calling it from destroy_inode().
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OK, that's probably the easiest way to do that, as much as I don't like it...
Since iget() et.al. will not accept I_FREEING (will wait to go away
and restart), and since we'd better have serialization between new/free
on fs data structures anyway, we can afford simply skipping I_FREEING
et.al. in insert_inode_locked().
We do that from new_inode, so it won't race with free_inode in any interesting
ways and it won't race with iget (of any origin; nfsd or in case of fs
corruption a lookup) since both still will wait for I_LOCK.
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Tested-by: David Watson <dbwatson@ukfsn.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The nobh_truncate_page() function is used by ext2, exofs, and jfs. Of
these three, only ext2 and jfs's get_block() function pays attention
to bh->b_size --- which is normally always the filesystem blocksize
except when the get_block() function is called by either
mpage_readpage(), mpage_readpages(), or the direct I/O routines in
fs/direct_io.c.
Unfortunately, nobh_truncate_page() does not initialize map_bh before
calling the filesystem-supplied get_block() function. So ext2 and jfs
will try to calculate the number of blocks to map by taking stack
garbage and shifting it left by inode->i_blkbits. This should be
*mostly* harmless (except the filesystem will do some unnneeded work)
unless the stack garbage is less than filesystem's blocksize, in which
case maxblocks will be zero, and the attempt to find out whether or
not the filesystem has a hole at a given logical block will fail, and
the page cache entry might not get zero'ed out.
Also if the stack garbage in in map_bh->state happens to have the
BH_Mapped bit set, there could be an attempt to call readpage() on a
non-existent page, which could cause nobh_truncate_page() to return an
error when it should not.
Fix this by initializing map_bh->state and map_bh->size.
Fortunately, it's probably fairly unlikely that ext2 and jfs users
mount with nobh these days.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: Fix oops and use after free during space balancing
Btrfs: set device->total_disk_bytes when adding new device
The btrfs allocator uses list_for_each to walk the available block
groups when searching for free blocks. It starts off with a hint
to help find the best block group for a given allocation.
The hint is resolved into a block group, but we don't properly check
to make sure the block group we find isn't in the middle of being
freed due to filesystem shrinking or balancing. If it is being
freed, the list pointers in it are bogus and can't be trusted. But,
the code happily goes along and uses them in the list_for_each loop,
leading to all kinds of fun.
The fix used here is to check to make sure the block group we find really
is on the list before we use it. list_del_init is used when removing
it from the list, so we can do a proper check.
The allocation clustering code has a similar bug where it will trust
the block group in the current free space cluster. If our allocation
flags have changed (going from single spindle dup to raid1 for example)
because the drives in the FS have changed, we're not allowed to use
the old block group any more.
The fix used here is to check the current cluster against the
current allocation flags.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: prevent deadlock in xfs_qm_shake()
xfs: fix overflow in xfs_growfs_data_private
xfs: fix double unlock in xfs_swap_extents()
It's possible to recurse into filesystem from the memory
allocation, which deadlocks in xfs_qm_shake(). Add check
for __GFP_FS, and bail out if it is not set.
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
In the case where growing a filesystem would leave the last AG
too small, the fixup code has an overflow in the calculation
of the new size with one fewer ag, because "nagcount" is a 32
bit number. If the new filesystem has > 2^32 blocks in it
this causes a problem resulting in an EINVAL return from growfs:
# xfs_io -f -c "truncate 19998630180864" fsfile
# mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
# mount -o loop fsfile /mnt
# xfs_growfs /mnt
meta-data=/dev/loop0 isize=256 agcount=52,
agsize=76288719 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=3905982455, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=0
realtime =none extsz=4096 blocks=0, rtextents=0
xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument
Reported-by: richard.ems@cape-horn-eng.com
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Regreesion from commit ef8f7fc, which rearranged the code in
xfs_swap_extents() leading to double unlock of xfs inode ilock.
That resulted in xfs_fsr deadlocking itself on platforms, which
don't handle double unlock of rw_semaphore nicely. It caused the
count go negative, which represents the write holder, without
really having one. ia64 is one of the platforms where deadlock
was easily reproduced and the fix was tested.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The nilfs_cpfile_delete_checkpoints() wrongly skips brelse() for the
header block of checkpoint file in case of errors. This fixes the
leak bug.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
Driver Core: do not oops when driver_unregister() is called for unregistered drivers
sysfs: file.c: use create_singlethread_workqueue()
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux:
svcrdma: dma unmap the correct length for the RPCRDMA header page.
nfsd: Revert "svcrpc: take advantage of tcp autotuning"
nfsd: fix hung up of nfs client while sync write data to nfs server
The flat loader uses an architecture's flat_stack_align() to align the
stack but assumes word-alignment is enough for the data sections.
However, on the Xtensa S6000 we have registers up to 128bit width
which can be used from userspace and therefor need userspace stack and
data-section alignment of at least this size.
This patch drops flat_stack_align() and uses the same alignment that
is required for slab caches, ARCH_SLAB_MINALIGN, or wordsize if it's
not defined by the architecture.
It also fixes m32r which was obviously kaput, aligning an
uninitialized stack entry instead of the stack pointer.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oskar Schirmer <os@emlix.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <cooloney@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Johannes Weiner <jw@emlix.com>
Acked-by: Mike Frysinger <vapier.adi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_pident_instantiate() has following call flow.
proc_pident_lookup()
proc_pident_instantiate()
proc_pid_make_inode()
And, proc_pident_lookup() has following error handling.
const struct pid_entry *p, *last;
error = ERR_PTR(-ENOENT);
if (!task)
goto out_no_task;
Then, proc_pident_instantiate should return ENOENT too when racing against
exit(2) occur.
EINAL has two bad reason.
- it implies caller is wrong. bad the race isn't caller's mistake.
- man 2 open don't explain EINVAL. user often don't handle it.
Note: Other proc_pid_make_inode() caller already use ENOENT properly.
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Erase errors such as:
"Newly-erased block contained word 0xa4ef223e at offset 0x0296a014"
and failure to write the clean marker,
moves the offending erase block to erasing list before calling
jffs2_erase_failed(). This is bad as jffs2_erase_failed() will
also move the block to the bad_list, but is now moving the
wrong block, causing FS corruption.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
We don't need a kernel thread per CPU for this application.
Acked-by: Alex Chiang <achiang@hp.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit 'Short write in nfsd becomes a full write to the client'
(31dec2538e) broken the sync write.
With the following commands to reproduce:
$ mount -t nfs -o sync 192.168.0.21:/nfsroot /mnt
$ cd /mnt
$ echo aaaa > temp.txt
Then nfs client is hung up.
In SYNC mode the server alaways return the write count 0 to the
client. This is because the value of host_err in nfsd_vfs_write()
will be overwrite in SYNC mode by 'host_err=nfsd_sync(file);',
and then we return host_err(which is now 0) as write count.
This patch fixed the problem.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Fix up renamed filenames in comments in fs/cachefiles/internal.h.
Originally, the files were all called cf-xxx.c, but they got renamed to
just xxx.c.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up renamed filenames in comments in fs/fscache/internal.h.
Originally, the files were all called fsc-xxx.c, but they got renamed to
just xxx.c.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the asynchronous lease renewal fails (usually due to a soft timeout),
then we _must_ schedule state recovery in order to ensure that we don't
lose the lease unnecessarily or, if the lease is already lost, that we
recover the locking state promptly...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
fix build error with latest kbuild adjustments to initconst.
The commit a447c09324 ("vfs: Use
const for kernel parser table") changed:
static match_table_t __initdata tokens = {
to
static match_table_t __initconst tokens = {
But the missing const causes popwerpc to fail with latest
updates to __initconst like this:
fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict
fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict
The bug is only present with kbuild-next.
Following patch has been build tested.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Small change (mostly formatting) to limit lookup based open calls to
file create only.
After discussion yesteday on samba-technical about the posix lookup
regression, and looking at a problem with cifs posix open to one
particular Samba version, Jeff and JRA realized that Samba server's
behavior changed in this area (posix open behavior on files vs.
directories). To make this behavior consistent, JRA just made a
fix to Samba server to alter how it handles open of directories (now
returning the equivalent of EISDIR instead of success). Since we don't
know at lookup time whether the inode is a directory or file (and
thus whether posix open will succeed with most current Samba server),
this change avoids the posix open code on lookup open (just issues
posix open on creates). This gets the semantic benefits we want
(atomicity, posix byte range locks, improved write semantics on newly
created files) and file create still is fast, and we avoid the problem
that Jeff noticed yesterday with "openat" (and some open directory
calls) of non-cached directories to one version of Samba server, and
will work with future Samba versions (which include the fix jra just
pushed into Samba server). I confirmed this approach with jra
yesterday and with Shirish today.
Posix open is only called (at lookup time) for file create now.
For opens (rather than creates), because we do not know if it
is a file or directory yet, and current Samba no longer allows
us to do posix open on dirs, we could end up wasting an open call
on what turns out to be a dir. For file opens, we wait to call posix
open till cifs_open. It could be added here (lookup) in the future
but the performance tradeoff of the extra network request when EISDIR
or EACCES is returned would have to be weighed against the 50%
reduction in network traffic in the other paths.
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Tested-by: Jeff Layton <jlayton@redhat.com>
CC: Jeremy Allison <jra@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This fixes a new memory leak problem in garbage collection. The
problem was brought by the bugfix patch ("nilfs2: fix lock order
reversal in nilfs_clean_segments ioctl").
Thanks to Kentaro Suzuki for finding this problem.
Reported-by: Kentaro Suzuki <k_suzuki@ms.sylc.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Posix open code was not properly adding the file to the
list of open files. Fix allocating cifsFileInfo
more than once, and adding twice to flist and tlist.
Also fix mode setting to be done in one place in these
paths.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Luca Tettamanti <kronos.it@gmail.com>
This is the third respin of the patch posted yesterday to fix the error
handling in cifs_follow_symlink. It also includes a fix for a bogus NULL
pointer check in CIFSSMBQueryUnixSymLink that Jeff Moyer spotted.
It's possible for CIFSSMBQueryUnixSymLink to return without setting
target_path to a valid pointer. If that happens then the current value
to which we're initializing this pointer could cause an oops when it's
kfree'd.
This patch is a little more comprehensive than the last patches. It
reorganizes cifs_follow_link a bit for (hopefully) better readability.
It should also eliminate the uneeded allocation of full_path on servers
without unix extensions (assuming they can get to this point anyway, of
which I'm not convinced).
On a side note, I'm not sure I agree with the logic of enabling this
query even when unix extensions are disabled on the client. It seems
like that should disable this as well. But, changing that is outside the
scope of this fix, so I've left it alone for now.
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@inraded.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The problem is that permission checking is skipped if atomic open is
possible, but when exec opens a file, it just opens it O_READONLY which
means EXEC permission will not be checked at that time.
This problem is observed by the following sequence (executed as root):
mount -t nfs4 server:/ /mnt4
echo "ls" >/mnt4/foo
chmod 744 /mnt4/foo
su guest -c "mnt4/foo"
Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Tested-by: Eugene Teo <eugeneteo@kernel.sg>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds CONFIG_REISERFS_FS_XATTR protection from reiserfs_permission.
This is needed to avoid warnings during file deletions and chowns with
xattrs disabled.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This avoids an Oops in open_xa_root that can occur when deleting a file
with xattrs disabled. It assumes that the xattr root will be there, and
that is not guaranteed.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With xattr cleanup even with xattrs disabled, much of the initial setup
is still performed. Some #ifdefs are just not needed since the options
they protect wouldn't be available anyway.
This cleans those up.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fix race in ext4_inode_info.i_cached_extent
ext4: Clear the unwritten buffer_head flag after the extent is initialized
ext4: Use a fake block number for delayed new buffer_head
ext4: Fix sub-block zeroing for writes into preallocated extents
devpts_get_sb() calls memset(0) to clear mount options and calls
parse_mount_options() if user specified any mount options.
The memset(0) is bogus since the 'mode' and 'ptmxmode' options are
non-zero by default. parse_mount_options() restores options to default
anyway and can properly deal with NULL mount options.
So in devpts_get_sb() remove memset(0) and call parse_mount_options() even
for NULL mount options.
Bug reported by Eric Paris: http://lkml.org/lkml/2009/5/7/448.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
Reported-by: Eric Paris <eparis@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Reviewed-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If two CPU's simultaneously call ext4_ext_get_blocks() at the same
time, there is nothing protecting the i_cached_extent structure from
being used and updated at the same time. This could potentially cause
the wrong location on disk to be read or written to, including
potentially causing the corruption of the block group descriptors
and/or inode table.
This bug has been in the ext4 code since almost the very beginning of
ext4's development. Fortunately once the data is stored in the page
cache cache, ext4_get_blocks() doesn't need to be called, so trying to
replicate this problem to the point where we could identify its root
cause was *extremely* difficult. Many thanks to Kevin Shanahan for
working over several months to be able to reproduce this easily so we
could finally nail down the cause of the corruption.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
Btrfs: make show_options result match actual option names
Btrfs: remove outdated comment in btrfs_ioctl_resize()
Btrfs: remove some WARN_ONs in the IO failure path
Btrfs: Don't loop forever on metadata IO failures
Btrfs: init inode ordered_data_close flag properly
The BH_Unwritten flag indicates that the buffer is allocated on disk
but has not been written; that is, the disk was part of a persistent
preallocation area. That flag should only be set when a get_blocks()
function is looking up a inode's logical to physical block mapping.
When ext4_get_blocks_wrap() is called with create=1, the uninitialized
extent is converted into an initialized one, so the BH_Unwritten flag
is no longer appropriate. Hence, we need to make sure the
BH_Unwritten is not left set, since the combination of BH_Mapped and
BH_Unwritten is not allowed; among other things, it will result ext4's
get_block() to be called over and over again during the write_begin
phase of write(2).
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The notreelog and flushoncommit mount options were being printed slightly
differently.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In Li Zefan's commit dae7b665cf,
a combination call of kmalloc() and copy_from_user() is replaced by
memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These debugging WARN_ONs make too much console noise during regular
IO failures. An IO failure will still generate a number of messages
as we verify checksums etc, but these two are not needed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a btrfs metadata read fails, the first thing we try to do is find
a good copy on another mirror of the block. If this fails, read_tree_block()
ends up returning a buffer that isn't up to date.
The btrfs btree reading code was reworked to drop locks and repeat
the search when IO was done, but the changes didn't add a check for failed
reads. The end result was looping forever on buffers that were never
going to become up to date.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This flag is used to decide when we need to send a given file through
the ordered code to make sure it is fully written before a transaction
commits. It was not being properly set to zero when the inode was
being setup.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
cifs_strndup_from_ucs returns NULL on error, not an ERR_PTR
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
Squashfs: cody tidying, remove commented out line in Makefile
Squashfs: check page size is not larger than the filesystem block size
Squashfs: fix breakage when page size > metadata block size
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: check size of array structured data exchanged via ioctls
nilfs2: fix lock order reversal in nilfs_clean_segments ioctl
nilfs2: fix possible circular locking for get information ioctls
nilfs2: ensure to clear dirty state when deleting metadata file block
nilfs2: fix circular locking dependency of writer mutex
nilfs2: fix possible recovery failure due to block creation without writer
The core VM assumes the page size used by the address_space in
inode->i_mapping is PAGE_SIZE but hugetlbfs breaks this assumption by
inserting pages into the page cache at offsets the core VM considers
unexpected.
This would not be a problem except that hugetlbfs also provide a
->readpage implementation. As it exists, the core VM can assume the
base page size is being used, allocate pages on behalf of the
filesystem, insert them into the page cache and call ->readpage to
populate them. These pages are the wrong size and at the wrong offset
for hugetlbfs causing confusion.
This patch deletes the ->readpage implementation for hugetlbfs on the
grounds the core VM should not be allocating and populating pages on
behalf of hugetlbfs. There should be no existing users of the
->readpage implementation so it should not cause a regression.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Normally the block size (by default 128K) will be larger than the
page size, unless a non-standard block size has been specified in
Mksquashfs, and the page size is larger than 4K.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Squashfs is broken on any system where the page size is larger than
the metadata size (8192). This is easily fixed by ensuring cache->pages
is always > 0.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Doug Chapman <doug.chapman@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux:
nfsd: silence lockdep warning
lockd: fix list corruption on lockd restart
nfsd4: check for negative dentry before use in nfsv4 readdir
nfsd41: slots are freed with session
svcrdma: clean up error paths.
svcrdma: Fix dma map direction for rdma read targets
Use a very large unsigned number (~0xffff) as as the fake block number
for the delayed new buffer. The VFS should never try to write out this
number, but if it does, this will make it obvious.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We need to mark the buffer_head mapping preallocated space as new
during write_begin. Otherwise we don't zero out the page cache content
properly for a partial write. This will cause file corruption with
preallocation.
Now that we mark the buffer_head new we also need to have a valid
buffer_head blocknr so that unmap_underlying_metadata() unmaps the
correct block.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The return value of dup2 when oldfd == newfd and the fd isn't valid is
not getting properly sign extended. We end up with 4294967287 instead
of -EBADF.
I've reproduced this on SLE11 (2.6.27.21), openSUSE Factory
(2.6.29-rc5), and Ubuntu 9.04 (2.6.28).
This patch uses a signed int for the error value so it is properly
extended.
Commit 6c5d0512a0 introduced this
regression.
Reported-by: Jiri Dluhos <jdluhos@novell.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Although some ioctls of nilfs2 exchange data in the form of indirectly
referenced array, some of them lack size check on the array elements.
This inserts the missing checks and rejects requests if data of ioctl
does not have a valid format.
We usually don't have to check size of structures that we associated
with ioctl commands because the size is tested implicitly for
identifying ioctl command; the checks this patch adds are for the
cases where the implicit check is not applied.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This is a companion patch to ("nilfs2: fix possible circular locking
for get information ioctls").
This corrects lock order reversal between mm->mmap_sem and
nilfs->ns_segctor_sem in nilfs_clean_segments() which was detected by
lockdep check:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.30-rc3-nilfs-00003-g360bdc1 #7
-------------------------------------------------------
mmap/5294 is trying to acquire lock:
(&nilfs->ns_segctor_sem){++++.+}, at: [<d0d0e846>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&mm->mmap_sem){++++++}:
[<c01470a5>] __lock_acquire+0x1066/0x13b0
[<c01474a9>] lock_acquire+0xba/0xdd
[<c01836bc>] might_fault+0x68/0x88
[<c023c61d>] copy_from_user+0x2a/0x111
[<d0d120d0>] nilfs_ioctl_prepare_clean_segments+0x1d/0xf1 [nilfs2]
[<d0d0e2aa>] nilfs_clean_segments+0x6d/0x1b9 [nilfs2]
[<d0d11f68>] nilfs_ioctl+0x2ad/0x318 [nilfs2]
[<c01a3be7>] vfs_ioctl+0x22/0x69
[<c01a408e>] do_vfs_ioctl+0x460/0x499
[<c01a4107>] sys_ioctl+0x40/0x5a
[<c01031a4>] sysenter_do_call+0x12/0x38
[<ffffffff>] 0xffffffff
-> #0 (&nilfs->ns_segctor_sem){++++.+}:
[<c0146e0b>] __lock_acquire+0xdcc/0x13b0
[<c01474a9>] lock_acquire+0xba/0xdd
[<c0433f1d>] down_read+0x2a/0x3e
[<d0d0e846>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
[<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
[<c0183b0b>] __do_fault+0x165/0x376
[<c01855cd>] handle_mm_fault+0x287/0x5d1
[<c043712d>] do_page_fault+0x2fb/0x30a
[<c0435462>] error_code+0x72/0x78
[<ffffffff>] 0xffffffff
where nilfs_clean_segments() holds:
nilfs->ns_segctor_sem -> copy_from_user()
--> page fault -> mm->mmap_sem
And, page fault path may hold:
page fault -> mm->mmap_sem
--> nilfs_page_mkwrite() -> nilfs->ns_segctor_sem
Even though nilfs_clean_segments() does not perform write access on
given user pages, it may cause deadlock because nilfs->ns_segctor_sem
is shared per device and mm->mmap_sem can be shared with other tasks.
To avoid this problem, this patch moves all calls of copy_from_user()
outside the nilfs->ns_segctor_sem lock in the ioctl.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This is one of two patches which are to correct possible circular
locking between mm->mmap_sem and nilfs->ns_segctor_sem.
The problem was detected by lockdep check as follows:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.30-rc3-nilfs-00002-g3552613 #6
-------------------------------------------------------
mmap/5418 is trying to acquire lock:
(&nilfs->ns_segctor_sem){++++.+}, at: [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&mm->mmap_sem){++++++}:
[<c01470a5>] __lock_acquire+0x1066/0x13b0
[<c01474a9>] lock_acquire+0xba/0xdd
[<c01836bc>] might_fault+0x68/0x88
[<c023c730>] copy_to_user+0x2c/0xfc
[<d0d11b4f>] nilfs_ioctl_wrap_copy+0x103/0x160 [nilfs2]
[<d0d11fa9>] nilfs_ioctl+0x30a/0x3b0 [nilfs2]
[<c01a3be7>] vfs_ioctl+0x22/0x69
[<c01a408e>] do_vfs_ioctl+0x460/0x499
[<c01a4107>] sys_ioctl+0x40/0x5a
[<c01031a4>] sysenter_do_call+0x12/0x38
[<ffffffff>] 0xffffffff
-> #0 (&nilfs->ns_segctor_sem){++++.+}:
[<c0146e0b>] __lock_acquire+0xdcc/0x13b0
[<c01474a9>] lock_acquire+0xba/0xdd
[<c0433f1d>] down_read+0x2a/0x3e
[<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
[<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
[<c0183b0b>] __do_fault+0x165/0x376
[<c01855cd>] handle_mm_fault+0x287/0x5d1
[<c043712d>] do_page_fault+0x2fb/0x30a
[<c0435462>] error_code+0x72/0x78
[<ffffffff>] 0xffffffff
other info that might help us debug this:
1 lock held by mmap/5418:
#0: (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a
stack backtrace:
Pid: 5418, comm: mmap Not tainted 2.6.30-rc3-nilfs-00002-g3552613 #6
Call Trace:
[<c0432145>] ? printk+0xf/0x12
[<c0145c48>] print_circular_bug_tail+0xaa/0xb5
[<c0146e0b>] __lock_acquire+0xdcc/0x13b0
[<d0d10149>] ? nilfs_sufile_get_stat+0x1e/0x105 [nilfs2]
[<c013b59a>] ? up_read+0x16/0x2c
[<d0d10225>] ? nilfs_sufile_get_stat+0xfa/0x105 [nilfs2]
[<c01474a9>] lock_acquire+0xba/0xdd
[<d0d0e852>] ? nilfs_transaction_begin+0xb6/0x10c [nilfs2]
[<c0433f1d>] down_read+0x2a/0x3e
[<d0d0e852>] ? nilfs_transaction_begin+0xb6/0x10c [nilfs2]
[<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
[<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
[<c0183b0b>] __do_fault+0x165/0x376
[<c01855cd>] handle_mm_fault+0x287/0x5d1
[<c043700a>] ? do_page_fault+0x1d8/0x30a
[<c013b54f>] ? down_read_trylock+0x39/0x43
[<c043712d>] do_page_fault+0x2fb/0x30a
[<c0436e32>] ? do_page_fault+0x0/0x30a
[<c0435462>] error_code+0x72/0x78
[<c0436e32>] ? do_page_fault+0x0/0x30a
This makes the lock granularity of nilfs->ns_segctor_sem finer than
that of the mmap semaphore for ioctl commands except
nilfs_clean_segments().
The successive patch ("nilfs2: fix lock order reversal in
nilfs_clean_segments ioctl") is required to fully resolve the problem.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (22 commits)
Fix the race between capifs remount and node creation
Fix races around the access to ->s_options
switch ufs directories to ufs_sync_file()
Switch open_exec() and sys_uselib() to do_open_filp()
Make open_exec() and sys_uselib() use may_open(), instead of duplicating its parts
Reduce path_lookup() abuses
Make checkpatch.pl shut up on fs/inode.c
NULL noise in fs/super.c:kill_bdev_super()
romfs: cleanup romfs_fs.h
ROMFS: romfs_dev_read() error ignored
fs: dcache fix LRU ordering
ocfs2: Use nd_set_link().
Fix deadlock in ipathfs ->get_sb()
Fix a leak in failure exit in 9p ->get_sb()
Convert obvious places to deactivate_locked_super()
New helper: deactivate_locked_super()
reiserfs: remove privroot hiding in lookup
reiserfs: dont associate security.* with xattr files
reiserfs: fixup xattr_root caching
Always lookup priv_root on reiserfs mount and keep it
...
This would fix the following failure during GC:
nilfs_cpfile_delete_checkpoints: cannot delete block
NILFS: GC failed during preparation: cannot delete checkpoints: err=-2
The problem was caused by a break in state consistency between page
cache and btree; the above block was removed from the btree but the
page buffering the block was remaining in the page cache in dirty
state.
This resolves the inconsistency by ensuring to clear dirty state of
the page buffering the deleted block.
Reported-by: David Arendt <admin@prnet.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Put generic_show_options read access to s_options under rcu_read_lock,
split save_mount_options() into "we are setting it the first time"
(uses in foo_fill_super()) and "we are relacing and freeing the old one",
synchronize_rcu() before kfree() in the latter.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Code Quality According To Mingo(tm) has been vastly improved,
no code has been damaged^Wchanged^Wdamaged.
[commit message rewritten -- AV]
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
romfs_dev_read() may return -EIO, but ret is unsigned, so the errorpath
isn't taken.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix ordering of LRU when moving referenced dentries to the head of the list
(they should go to the head of the list in the same order as they were found
from the tail, rather than reverse order).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ocfs2 was hand-calling vfs_follow_link(), but there's no point to that.
Let's use page_follow_link_light() and nd_set_link().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Does equivalent of up_write(&s->s_umount); deactivate_super(s);
However, it does not does not unlock it until it's all over.
As the result, it's safe to use to dispose of new superblock on ->get_sb()
failure exits - nobody will see the sucker until it's all over.
Equivalent using up_write/deactivate_super is safe for that purpose
if superblock is either safe to use or has NULL ->s_root when we unlock.
Normally filesystems take the required precautions, but
a) we do have bugs in that area in some of them.
b) up_write/deactivate_super sequence is extremely common,
so the helper makes sense anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With Al Viro's patch to move privroot lookup to fs mount, there's no need
to have special code to hide the privroot in reiserfs_lookup.
I've also cleaned up the privroot hiding in reiserfs_readdir_dentry and
removed the last user of reiserfs_xattrs().
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The security.* xattrs are ignored for xattr files, so don't create them.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The xattr_root caching was broken from my previous patch set. It wouldn't
cause corruption, but could cause decreased performance due to allocating
a larger chunk of the journal (~ 27 blocks) than it would actually use.
This patch loads the xattr root dentry at xattr initialization and creates
it on-demand. Since we're using the cached dentry, there's no point
in keeping lookup_or_create_dir around, so that's removed.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... even if it's a negative dentry. That way we can set ->d_op on
root before anyone could race with us. Simplify d_compare(), while
we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2.6.30-rc3 introduced some sanity checks in the VFS code to avoid NFS
bugs by ensuring that lookup_one_len is always called under i_mutex.
This patch expands the i_mutex locking to enclose lookup_one_len. This was
always required, but not not enforced in the reiserfs code since it
does locking around the xattr interactions with the xattr_sem.
This is obvious enough, and it survived an overnight 50 thread ACL test.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Depending on the ordering of events as we go around the
glock shrinker loop, it is possible to drop the ref count
of a glock incorrectly. It doesn't happen very often. This
patch corrects the got_ref variable, fixing the problem.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This fixes the following circular locking dependency problem:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.30-rc3 #5
-------------------------------------------------------
segctord/3895 is trying to acquire lock:
(&nilfs->ns_writer_mutex){+.+...}, at: [<d0d02172>]
nilfs_mdt_get_block+0x89/0x20f [nilfs2]
but task is already holding lock:
(&bmap->b_sem){++++..}, at: [<d0d02d99>]
nilfs_bmap_propagate+0x14/0x2e [nilfs2]
which lock already depends on the new lock.
The bugfix is done by replacing call sites of nilfs_get_writer() which
are never called from read-only context with direct dereferencing of
pointer to a writable FS-instance.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Some function calls in nilfs_prepare_segment_for_recovery() may fail
because they can create blocks on meta data files without configuring
a writable FS-instance. Concretely, nilfs_mdt_create_block() routine
of meta data files will fail in that case.
This fixes the problem by temporarily attaching a writable FS-instace
during the function is called.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>