1
Commit Graph

18874 Commits

Author SHA1 Message Date
Christoph Hellwig
ef35e9255d xfs: remove xfs_iput_new
We never get an i_mode of 0 or a locked VFS inode until we pass in the
XFS_IGET_CREATE flag to xfs_iget, which makes xfs_iput_new equivalent to
xfs_iput for the only caller.  In addition to that xfs_nfs_get_inode
does not even need to lock the inode given that the generation never changes
for a life inode, so just pass a 0 lock_flags to xfs_iget and release
the inode using IRELE in the error path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:44 -05:00
Christoph Hellwig
d2e078c33c xfs: some iget tracing cleanups / fixes
The xfs_iget_alloc/found tracepoints are a bit misnamed and misplaced.
Rename them to xfs_iget_hit/xfs_iget_miss and move them to the beggining
of the xfs_iget_cache_hit/miss functions.  Add a new xfs_iget_reclaim_fail
tracepoint for the case where we fail to re-initialize a VFS inode,
and add a second instance of the xfs_iget_skip tracepoint for the case
of a failed igrab() call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:43 -05:00
Christoph Hellwig
807cbbdb43 xfs: do not use emums for flags used in tracing
The tracing code can't print flags defined as enums.  Most flags that
we want to print are defines as macros already, but move the few remaining
ones over to make the trace output more useful.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:43 -05:00
Christoph Hellwig
64c8614941 xfs: remove explicit xfs_sync_data/xfs_sync_attr calls on umount
On the final put of a superblock the VFS already calls sync_filesystem
for us to write out all data and wait for it.  No need to start another
asynchronous writeback inside ->put_super.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:42 -05:00
Christoph Hellwig
f2bde9b89b xfs: small cleanups for xfs_iomap / __xfs_get_blocks
Remove the flags argument to  __xfs_get_blocks as we can easily derive
it from the direct argument, and remove the unused BMAPI_MMAP flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:42 -05:00
Christoph Hellwig
3070451eea xfs: reduce stack usage in xfs_iomap
xfs_iomap passes a xfs_bmbt_irec pointer to xfs_iomap_write_direct and
xfs_iomap_write_allocate to give them the results of our read-only
xfs_bmapi query.  Instead of allocating a new xfs_bmbt_irec on stack
for the next call to xfs_bmapi re use the one we got passed as it's not
used after this point.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:42 -05:00
Christoph Hellwig
7a36c8a98a xfs: avoid synchronous transaction in xfs_fs_write_inode
We already rely on the fact that the sync code will cause a synchronous
log force later on (currently via xfs_fs_sync_fs -> xfs_quiesce_data ->
xfs_sync_data), so no need to do this here.  This allows us to avoid
a lot of synchronous log forces during sync, which pays of especially
with delayed logging enabled.   Some compilebench numbers that show
this:

xfs (delayed logging, 256k logbufs)
===================================

intial create		  25.94 MB/s	  25.75 MB/s	  25.64 MB/s
create			   8.54 MB/s	   9.12 MB/s	   9.15 MB/s
patch			   2.47 MB/s	   2.47 MB/s	   3.17 MB/s
compile			  29.65 MB/s	  30.51 MB/s	  27.33 MB/s
clean			  90.92 MB/s	  98.83 MB/s	 128.87 MB/s
read tree		  11.90 MB/s	  11.84 MB/s	   8.56 MB/s
read compiled		  28.75 MB/s	  29.96 MB/s	  24.25 MB/s
delete tree		8.39 seconds	8.12 seconds	8.46 seconds
delete compiled		8.35 seconds	8.44 seconds	5.11 seconds
stat tree		6.03 seconds	5.59 seconds	5.19 seconds
stat compiled tree	9.00 seconds	9.52 seconds	8.49 seconds

xfs + write_inode log_force removal
===================================
intial create		  25.87 MB/s	  25.76 MB/s	  25.87 MB/s
create			  15.18 MB/s	  14.80 MB/s	  14.94 MB/s
patch			   3.13 MB/s	   3.14 MB/s	   3.11 MB/s
compile			  36.74 MB/s	  37.17 MB/s	  36.84 MB/s
clean			 226.02 MB/s	 222.58 MB/s	 217.94 MB/s
read tree		  15.14 MB/s	  15.02 MB/s	  15.14 MB/s
read compiled tree	  29.30 MB/s	  29.31 MB/s	  29.32 MB/s
delete tree		6.22 seconds	6.14 seconds	6.15 seconds
delete compiled tree	5.75 seconds	5.92 seconds	5.81 seconds
stat tree		4.60 seconds	4.51 seconds	4.56 seconds
stat compiled tree	4.07 seconds	3.87 seconds	3.96 seconds

In addition to that also remove the delwri inode flush that is unessecary
now that bulkstat is always coherent.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:41 -05:00
Christoph Hellwig
20cb52ebd1 xfs: simplify xfs_vm_writepage
The writepage implementation in XFS still tries to deal with dirty but
unmapped buffers which used to caused by writes through shared mmaps.  Since
the introduction of ->page_mkwrite these can't happen anymore, so remove the
code dealing with them.

Note that the all_bh variable which causes us to start I/O on all buffers on
the pages was controlled by the count of unmapped buffers, which also
included those not actually dirty.  It's now unconditionally initialized to
0 but set to 1 for the case of small file size extensions.  It probably can
be removed entirely, but that's left for another patch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:41 -05:00
Christoph Hellwig
89f3b36396 xfs: simplify xfs_vm_releasepage
Currently the xfs releasepage implementation has code to deal with converting
delayed allocated and unwritten space.  But we never get called for those as
we always convert delayed and unwritten space when cleaning a page, or drop
the state from the buffers in block_invalidatepage.  We still keep a WARN_ON
on those cases for now, but remove all the case dealing with it, which allows
to fold xfs_page_state_convert into xfs_vm_writepage and remove the !startio
case from the whole writeback path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:40 -05:00
Eric Sandeen
3d9b02e3c7 xfs: fix corruption case for block size < page size
xfstests 194 first truncats a file back and then extends it again by
truncating it to a larger size.  This causes discard_buffer to drop
the mapped, but not the uptodate bit and thus creates something that
xfs_page_state_convert takes for unmapped space created by mmap because
it doesn't check for the dirty bit, which also gets cleared by
discard_buffer and checked by other ->writepage implementations like
block_write_full_page.  Handle this kind of buffers early, and unlike
Eric's first version of the patch simply ASSERT that the buffers is
dirty, given that the mmap write case can't happen anymore since the
introduction of ->page_mkwrite.  The now dead code dealing with that
will be deleted in a follow on patch.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:40 -05:00
Christoph Hellwig
b4e9181e77 xfs: remove unused delta tracking code in xfs_bmapi
This code was introduced four years ago in commit
3e57ecf640 without any review and has
been unused since.  Remove it just as the rest of the code introduced
in that commit to reduce that stack usage and complexity in this central
piece of code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:39 -05:00
Christoph Hellwig
cd8b0bb3c4 xfs: remove unused XFS_BMAPI_ flags
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:39 -05:00
Christoph Hellwig
a59f55703c xfs: remove the unused XFS_TRANS_NOSLEEP/XFS_TRANS_WAIT flags
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:38 -05:00
Christoph Hellwig
9134c2332e xfs: remove the unused XFS_LOG_SLEEP and XFS_LOG_NOSLEEP flags
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:38 -05:00
Christoph Hellwig
dbb2f6529f xfs: kill the unused xlog_debug variable
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:37 -05:00
Christoph Hellwig
4e0d5f926b xfs: fix the xfs_log_iovec i_addr type
By making this member a void pointer we can get rid of a lot of pointless
casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:36 -05:00
Christoph Hellwig
898621d5a7 xfs: simplify inode to transaction joining
Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
joining it to a transaction via xfs_trans_ijoin.

This patches instead makes xfs_trans_ijoin usable on it's own by doing
an implicity xfs_trans_ihold, which also allows us to drop the third
argument.  For the case where we want to hold a reference on the inode
a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
the inode for needing an xfs_iput.  In addition to the cleaner interface
to the caller this also simplifies the implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:36 -05:00
Christoph Hellwig
4d16e9246f xfs: simplify buffer pinning
Get rid of the xfs_buf_pin/xfs_buf_unpin/xfs_buf_ispin helpers and opencode
them in their only callers, just like we did for the inode pinning a while
ago.  Also remove duplicate trace points - the bufitem tracepoints cover
all the information that is present in a buffer tracepoint.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:36 -05:00
Christoph Hellwig
ca30b2a7b7 xfs: give li_cb callbacks the correct prototype
Stop the function pointer casting madness and give all the li_cb instances
correct prototype.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:35 -05:00
Christoph Hellwig
7bfa31d8e0 xfs: give xfs_item_ops methods the correct prototypes
Stop the function pointer casting madness and give all the xfs_item_ops the
correct prototypes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:35 -05:00
Christoph Hellwig
9412e3181c xfs: merge iop_unpin_remove into iop_unpin
The unpin_remove item operation instances always share most of the
implementation with the respective unpin implementation.  So instead
of keeping two different entry points add a remove flag to the unpin
operation and share the code more easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:34 -05:00
Christoph Hellwig
e98c414f9a xfs: simplify log item descriptor tracking
Currently we track log item descriptor belonging to a transaction using a
complex opencoded chunk allocator.  This code has been there since day one
and seems to work around the lack of an efficient slab allocator.

This patch replaces it with dynamically allocated log item descriptors
from a dedicated slab pool, linked to the transaction by a linked list.

This allows to greatly simplify the log item descriptor tracking to the
point where it's just a couple hundred lines in xfs_trans.c instead of
a separate file.  The external API has also been simplified while we're
at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
delete items from a transaction have been simplified to the bare minium,
and the xfs_trans_find_item function is replaced with a direct dereference
of the li_desc field.  All debug code walking the list of log items in
a transaction is down to a simple list_for_each_entry.

Note that we could easily use a singly linked list here instead of the
double linked list from list.h as the fastpath only does deletion from
sequential traversal.  But given that we don't have one available as
a library function yet I use the list.h functions for simplicity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:34 -05:00
Christoph Hellwig
3400777ff0 xfs: remove unneeded #include statements
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:33 -05:00
Christoph Hellwig
288699feca xfs: drop dmapi hooks
Dmapi support was never merged upstream, but we still have a lot of hooks
bloating XFS for it, all over the fast pathes of the filesystem.

This patch drops over 700 lines of dmapi overhead.  If we'll ever get HSM
support in mainline at least the namespace events can be done much saner
in the VFS instead of the individual filesystem, so it's not like this
is much help for future work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:33 -05:00
Ryusuke Konishi
89c0fd014d nilfs2: reject filesystem with unsupported block size
This inserts sanity check that refuses to mount a filesystem with
unsupported block size.

Previously, kernel code of nilfs was looking only limitation of
devices though mkfs.nilfs2 limits the range of block sizes; there was
no check that prevents rec_len overflow with larger block sizes.

With this change, block sizes larger than 64KB or smaller than 1KB
will get rejected explicitly by kernel.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-25 23:29:21 +09:00
Ryusuke Konishi
6cda9fa257 nilfs2: avoid rec_len overflow with 64KB block size
With 64KB blocksize, a directory entry can have size 64KB which does
not fit into 16 bits we have for entry length.  So this patch stores
0xffff instead and converts value when read from / written to disk.

Nilfs derives its directory implementation from ext2 filesystem, and
this draws upon the corresponding change on ext2.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-25 20:46:43 +09:00
Robert P. J. Day
25848b3ec6 ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES"
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-24 21:36:07 -07:00
Tejun Heo
40f2b6ffe5 fscache: fix build on !CONFIG_SYSCTL
Commit 8b8edefa (fscache: convert object to use workqueue instead of
slow-work) made fscache_exit() call unregister_sysctl_table()
unconditionally breaking build when sysctl is disabled.  Fix it by
putting it inside CONFIG_SYSCTL.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Howells <dhowells@redhat.com>
2010-07-24 11:10:09 +02:00
Ryusuke Konishi
c28e69d933 nilfs2: simplify nilfs_get_page function
Implementation of nilfs_get_page() is a bit old as below:

 - A common read_mapping_page inline function is now available instead
   of its read_cache_page use.
 - wait_on_page_locked() use in the function is eliminable since
   read_cache_page function does the same thing through wait_on_page_read().
 - PageUptodate() check is eliminable for the same reason.

This renews nilfs_get_page() based on these points.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-24 17:36:29 +09:00
Sage Weil
1dadcce358 ceph: fix dentry lease release
When we embed a dentry lease release notification in a request, invalidate
our lease so we don't think we still have it.  Otherwise we can get all
sorts of incorrect client behavior when multiple clients are interacting
with the same part of the namespace.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 13:54:21 -07:00
Sage Weil
8c696737aa ceph: fix leak of dentry in ceph_init_dentry() error path
If we fail to allocate a ceph_dentry_info, don't leak the dn reference.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:07 -07:00
Sage Weil
bc4fdca857 ceph: fix pg_mapping leak on pg_temp updates
Free the ceph_pg_mapping structs when they are removed from the pg_temp
rbtree.  Also fix a leak in the __insert_pg_mapping() error path.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:06 -07:00
Sage Weil
252af52146 ceph: fix d_release dop for snapdir, snapped dentries
We need to set the d_release dop for snapdir and snapped dentries so that
the ceph_dentry_info struct gets released.  We also use the dcache to
cache readdir results when possible, which only works if we know when
dentries are dropped from the cache.  Since we don't use the dcache for
readdir in the hidden snapdir, avoid that case in ceph_dentry_release.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:05 -07:00
J. Bruce Fields
af4718f3f9 nfsd: minor nfsd_svc() cleanup
More idiomatic to put the error case in the if clause.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:27 -04:00
J. Bruce Fields
59db4a0c10 nfsd: move more into nfsd_startup()
This is just cleanup--it's harmless to call nfsd_rachache_init,
nfsd_init_socks, and nfsd_reset_versions more than once.  But there's no
point to it.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:26 -04:00
Jeff Layton
ac77efbe2b nfsd: just keep single lockd reference for nfsd
Right now, nfsd keeps a lockd reference for each socket that it has
open. This is unnecessary and complicates the error handling on
startup and shutdown. Change it to just do a lockd_up when starting
the first nfsd thread just do a single lockd_down when taking down the
last nfsd thread. Because of the strange way the sv_count is handled
this requires an extra flag to tell whether the nfsd_serv holds a
reference for lockd or not.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:26 -04:00
Jeff Layton
628b368728 nfsd: clean up nfsd_create_serv error handling
There doesn't seem to be any need to reset the nfssvc_boot time if the
nfsd startup failed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:25 -04:00
Jeff Layton
0cd14a061e nfsd: fix error handling in __write_ports_addxprt
__write_ports_addxprt calls nfsd_create_serv. That increases the
refcount of nfsd_serv (which is tracked in sv_nrthreads). The service
only decrements the thread count on error, not on success like
__write_ports_addfd does, so using this interface leaves the nfsd
thread count high.

Fix this by having this function call svc_destroy() on error to release
the reference (and possibly to tear down the service) and simply
decrement the refcount without tearing down the service on success.

This makes the sv_threads handling work basically the same in both
__write_ports_addxprt and __write_ports_addfd.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:24 -04:00
Jeff Layton
78a8d7c8ca nfsd: fix error handling when starting nfsd with rpcbind down
The refcounting for nfsd is a little goofy. What happens is that we
create the nfsd RPC service, attach sockets to it but don't actually
start the threads until someone writes to the "threads" procfile. To do
this, __write_ports_addfd will create the nfsd service and then will
decrement the refcount when exiting but won't actually destroy the
service.

This is fine when there aren't errors, but when there are this can
cause later attempts to start nfsd to fail. nfsd_serv will be set,
and that causes __write_versions to return EBUSY.

Fix this by calling svc_destroy on nfsd_serv when this function is
going to return error.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:23 -04:00
Jeff Layton
4ad9a344be nfsd4: fix v4 state shutdown error paths
If someone tries to shut down the laundry_wq while it isn't up it'll
cause an oops.

This can happen because write_ports can create a nfsd_svc before we
really start the nfs server, and we may fail before the server is ever
started.

Also make sure state is shutdown on error paths in nfsd_svc().

Use a common global nfsd_up flag instead of nfs4_init, and create common
helper functions for nfsd start/shutdown, as there will be other work
that we want done only when we the number of nfsd threads transitions
between zero and nonzero.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:22 -04:00
J. Bruce Fields
55b13354d7 nfsd: remove unused assignment from nfsd_link
Trivial cleanup, since "dest" is never used.

Reported-by: Anshul Madan <Anshul.Madan@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:50:39 -04:00
Tejun Heo
6ecd7c2dd9 gfs2: use workqueue instead of slow-work
Workqueue can now handle high concurrency.  Convert gfs to use
workqueue instead of slow-work.

* Steven pointed out that recovery path might be run from allocation
  path and thus requires forward progress guarantee without memory
  allocation.  Create and use gfs_recovery_wq with rescuer.  Please
  note that forward progress wasn't guaranteed with slow-work.

* Updated to use non-reentrant workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-23 13:14:25 +02:00
Dave Chinner
aa32a79638 ext3: default to ordered mode
data=writeback mode is dangerous as it leads to higher data loss and stale data
exposure when systems crash. It should not be the default, especially when all
major distros ensure their ext3 filesystems default to ordered mode. Change the
default mode to the safer data=ordered mode, because we should be caring far
more about avoiding stale data exposure than performance.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-23 12:50:55 +02:00
Jan Kara
43d2932d88 quota: Use mark_inode_dirty_sync instead of mark_inode_dirty
Quota code never touches file data. It just modifies i_blocks + i_bytes
of inodes and inode flags of quota files. So use mark_inode_dirty_sync
instead of mark_inode_dirty.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-23 12:50:46 +02:00
Ryusuke Konishi
c5ca48aabe nilfs2: reject incompatible filesystem
This forces nilfs to check compatibility of feature flags so as to
reject a filesystem with unknown features when it mounts or remounts
the filesystem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:16 +09:00
Ryusuke Konishi
03bdb5ac58 nilfs2: apply read-ahead for nilfs_btree_lookup_contig
This applies read-ahead to nilfs_btree_do_lookup and
nilfs_btree_lookup_contig functions and extends them to read ahead
siblings of level 1 btree nodes that hold data blocks.

At present, the read-ahead is not applied to most btree operations;
only get_block() callback function, which is used during read of
regular files or directories, receives the benefit.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:16 +09:00
Ryusuke Konishi
4e13e66bee nilfs2: introduce check flag to btree node buffer
nilfs_btree_get_block() now may return untested buffer due to
read-ahead.  This adds a new flag for buffer heads so that the btree
code can check whether the buffer is already verified or not.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:15 +09:00
Ryusuke Konishi
464ece8863 nilfs2: add btree get block function with readahead option
This adds __nilfs_btree_get_block() function that can issue a series
of read-ahead requests for sibling btree nodes.

This read-ahead needs parent node block, so nilfs_btree_readahead_info
structure is added to pass the information that
__nilfs_btree_get_block() needs.

This also replaces the previous nilfs_btree_get_block() implementation
with a wrapper function of __nilfs_btree_get_block().

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:15 +09:00
Ryusuke Konishi
26dfdd8e29 nilfs2: add read ahead mode to nilfs_btnode_submit_block
This adds mode argument to nilfs_btnode_submit_block() function and
allows it to issue a read-ahead request.

An optional submit_ptr argument is also added to store the actual
block address for which bio is sent.  submit_ptr is used for a series
of read-ahead requests, and helps to decide if each requested block is
continous to the previous one on disk.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:15 +09:00
Ryusuke Konishi
f8e6cc013b nilfs2: fix buffer head leak in nilfs_btnode_submit_block
nilfs_btnode_submit_block() refers to buffer head just before
returning from the function, but it releases the buffer head earlier
than that if nilfs_dat_translate() gets an error.

This has potential for oops in the erroneous case.  This fixes the
issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:15 +09:00
Ryusuke Konishi
7c397a81fe nilfs2: eliminate inline keywords in btree implementation
This removes all inline uses from btree.c.  Gcc now agressively apply
inline expansion even for the functions declared without the keyword;
the inline use in btree.c looks excessive.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:15 +09:00
Ryusuke Konishi
5ad2686e92 nilfs2: get maximum number of child nodes from bmap object
The patch "reduce repetitive calculation of max number of child nodes"
gathered up the calculation of maximum number of child nodes into
nilfs_btree_nchildren_per_block() function.  This makes the function
get resultant value from a private variable in bmap object instead of
calculating it for each call.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
9b7b265c9a nilfs2: reduce repetitive calculation of max number of child nodes
The current btree implementation repeats the same calculation on the
maximum number of child nodes.  This is because a few low level
routines use the calculation for index addressing in a btree node
block.

This reduces the calculation by explicitly passing the maximum number
of child nodes (ncmax) through their argument.

This changes parameter passing of the following functions:

 - nilfs_btree_node_dptrs
 - nilfs_btree_node_get_ptr
 - nilfs_btree_node_set_ptr
 - nilfs_btree_node_init
 - nilfs_btree_node_move_left
 - nilfs_btree_node_move_right
 - nilfs_btree_node_insert
 - nilfs_btree_node_delete, and
 - nilfs_btree_get_node

The following functions are removed:

 - nilfs_btree_node_nchildren_min
 - nilfs_btree_node_nchildren_max

Most middle level btree operations are rewritten to pass a proper
ncmax value depending on whether each occurrence of node is "root" or
not.

A constant NILFS_BTREE_ROOT_NCHILDREN_MAX is used for the root node,
whereas nilfs_btree_nchildren_per_block() function is used for
non-root nodes.  If a node could be either root or a non-root node, an
output argument of nilfs_btree_get_node() is used to set up ncmax.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
ea64ab87cd nilfs2: optimize calculation of min/max number of btree node children
nilfs_btree_node_nchildren_max() and nilfs_btree_node_nchildren_min()
functions switch return value depending on whether target node is the
root or a node block.  In most uses of these functions, however, the
node type is fixed, and moreover the same calculation is repeatedly
performed in loop.

This unfold these functions depending on context and move them outside
loops wherever possible.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
364ec2d700 nilfs2: remove redundant pointer checks in bmap lookup functions
nilfs_bmap_lookup and its variants are supposed to take a valid
pointer argument to return a block address, thus pointer checks in
nilfs_btree_lookup and nilfs_direct_lookup are needless.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
05d0e94b66 nilfs2: get rid of nilfs_bmap_union
This removes nilfs_bmap_union and finally unifies three structures and
the union in bmap/btree code into one.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
dc935be2a0 nilfs2: unify bmap set_target_v operations
This unifies two similar functions nilfs_btree_set_target_v and
nilfs_direct_set_target_v into one, nilfs_bmap_set_target_v.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
e7c274f808 nilfs2: get rid of nilfs_btree uses
This replaces all uses of nilfs_btree struct in implementation of
btree mapping with nilfs_bmap struct.

Name of local variable "btree" is kept not to bloat amount of change.
And, a part of local variables "bmap" is renamed to "btree" to uniform
naming rule.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:13 +09:00
Ryusuke Konishi
10ff885ba6 nilfs2: get rid of nilfs_direct uses
This replaces all uses of nilfs_direct struct in implementation of
direct mapping with nilfs_bmap struct.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:13 +09:00
Ryusuke Konishi
583ada4761 nilfs2: remove constant qualifier from argument of bmap propagate
The first argument of bops->bop_propagate operation takes a constant
qualifier, and causes compilation error when removed cast to pointer
of nilfs_btree structure type.  This fixes the issue to prepare for
succesive removal of nilfs_btree struct.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:13 +09:00
Ryusuke Konishi
25b8d7ded0 nilfs2: get rid of private conversion macros on bmap key and pointer
Will remove nilfs_bmap_key_to_dkey(), nilfs_bmap_dkey_to_key(),
nilfs_bmap_ptr_to_dptr(), and nilfs_bmap_dptr_to_ptr() for simplicity.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:13 +09:00
Ryusuke Konishi
1d5385b9f3 nilfs2: verify btree node after reading
This inserts sanity checks soon after read btree node from disk.  This
allows early detection of broken btree nodes, and helps to narrow down
problems due to file system corruption.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:13 +09:00
Ryusuke Konishi
cfa913a507 nilfs2: add sanity check in nilfs_btree_add_dirty_buffer
According to the report titled "problem with nilfs_cleanerd" from
Łukasz Wójcicki, nilfs_btree_lookup_dirty_buffers or
nilfs_btree_add_dirty_buffer got memory violation during garbage
collection.

This could happen if a level field of given btree node buffer is
incorrect, which is a crucial internal bug.

This inserts a sanity check to figure out the problem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:12 +09:00
Ryusuke Konishi
7c01745781 nilfs2: pass remount flag to parse_options
This adds is_remount argument to the parse_options() function that
obtains mount options from strings.

Previously, parse_options did not distinguish context whether it's
called for a new mount or remount, so the caller needed additional
verifications outside the function.

This allows parse_options to verify options and print messages
depending on the context.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:12 +09:00
Ryusuke Konishi
c6b4d57ddf nilfs2: use seq_puts to print mount options without argument
This replaces seq_printf() with seq_puts() in nilfs_show_options for
mount options which have no argument.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:12 +09:00
Ryusuke Konishi
802d317754 nilfs2: add nodiscard mount option
Nilfs has "discard" mount option which issues discard/TRIM commands to
underlying block device, but it lacks a complementary option and has
no way to disable the feature through remount.

This adds "nodiscard" option to resolve this imbalance.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:12 +09:00
Ryusuke Konishi
773bc4f3b6 nilfs2: add barrier mount option
Nilfs enables write barriers by default and has "nobarrier" mount
option to disable this feature.  But it lacks the complementary option
and has no way to re-enable the feature on remount.

This adds "barrier" option to resolve this imbalance.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:12 +09:00
Ryusuke Konishi
325020477a nilfs2: do not update log cursor for small change
Super blocks of nilfs are periodically overwritten in order to record
the recent log position.  This shortens recovery time after unclean
unmount, but the current implementation performs the update even for a
few blocks of change.  If the filesystem gets small changes slowly and
continually, super blocks may be updated excessively.

This moderates the issue by skipping update of log cursor if it does
not cross a segment boundary.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:11 +09:00
Ryusuke Konishi
6c12516083 nilfs2: implement fallback for super root search
Although nilfs redundantly uses two super blocks and each may point to
different position on log, the current version of nilfs does not try
fallback to the spare super block when it doesn't find any valid log
at the position that the primary super block points to.

This has been a cause of mount failures due to write order reversals
on barrier less block devices.

This inserts fallback code in error path of nilfs_search_super_root
routine to resolve the mount failure problem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:11 +09:00
Ryusuke Konishi
2d72b99ecd nilfs2: add missing error code in comment of nilfs_search_super_root
nilfs_search_super_root can return -ENOMEM, but this error code is not
described in its kernel-doc comment.  This fixes the discrepancy.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:11 +09:00
Ryusuke Konishi
843d63baa5 nilfs2: separate setup of log cursor from init_nilfs
This separates a setup routine of log cursor from init_nilfs().  The
routine, nilfs_store_log_cursor, reads the last position of the log
containing a super root, and initializes relevant state on the nilfs
object.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:11 +09:00
Jiro SEKIBA
b2ac86e1a8 nilfs2: sync super blocks in turns
This will sync super blocks in turns instead of syncing duplicate
super blocks at the time.  This will help searching valid super root
when super block is written into disk before log is written, which is
happen when barrier-less block devices are unmounted uncleanly.  In
the situation, old super block likely points to valid log.

This patch introduces ns_sbwcount member to the nilfs object and adds
nilfs_sb_will_flip() function; ns_sbwcount counts how many times super
blocks write back to the disk.  And, nilfs_sb_will_flip() decides
whether flipping required or not based on the count of ns_sbwcount to
sync super blocks asymmetrically.

The following functions are also changed:

 - nilfs_prepare_super(): flips super blocks according to the
   argument.  The argument is calculated by nilfs_sb_will_flip()
   function.

 - nilfs_cleanup_super(): sets "clean" flag to both super blocks if
   they point to the same checkpoint.

To update both of super block information, caller of
nilfs_commit_super must set the information on both super blocks.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:11 +09:00
Jiro SEKIBA
d26493b6f0 nilfs2: introduce nilfs_prepare_super
This function checks validity of super block pointers.
If first super block is invalid, it will swap the super blocks.
The function should be called before any super block information updates.
Caller must obtain nilfs->ns_sem.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:10 +09:00
Ryusuke Konishi
60f46b7efc nilfs2: separate function that updates log position
This moves out section that updates information of the recent log
position stored in super blocks from nilfs_commit_super to a new
routine named nilfs_set_log_cursor.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:10 +09:00
Ryusuke Konishi
c8a11c8a14 nilfs2: add nilfs_set_error
This function marks error state and write it on super blocks.  This is
a preparation for making super block writeback alternately.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:10 +09:00
Ryusuke Konishi
7ecaa46cfe nilfs2: add nilfs_cleanup_super
This function write out filesystem state to super blocks in order to
share the same cleanup work.  This is a preparation for making super
block writeback alternately.

Cc: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:10 +09:00
Ryusuke Konishi
bde4e696e4 nilfs2: do not update mount time on rw->ro remount
Mount time field in super block is wrongly updated when nilfs remounts
the partition from read-write to read-only.  This fixes the issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:10 +09:00
Ryusuke Konishi
57a4bfc486 nilfs2: get rid of ns_free_segments_count
This counter is unused.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:09 +09:00
Ryusuke Konishi
4762077c7b nilfs2: get rid of macros for segment summary information
This removes macros to test segment summary flags and redefines a few
relevant macros with inline functions.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:09 +09:00
Ryusuke Konishi
85655484f8 nilfs2: do not use nilfs_segsum_info structure in recovery code
This will get rid of nilfs_segsum_info use from recovery functions for
simplicity.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:09 +09:00
Ryusuke Konishi
354fa8be28 nilfs2: divide load_segment_summary function
load_segment_summary function has two distinct roles: getting summary
header of a log, and verifying consistencies of the log.

This divide it into two corresponding functions, nilfs_read_log_header
and nilfs_validate_log to clarify the meaning.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:09 +09:00
Ryusuke Konishi
aee5ce2f57 nilfs2: rename nilfs_recover_logical_segments function
The function name of nilfs_recover_logical_segments makes no sense.
This changes the name into nilfs_salvage_orphan_logs to clarify the
role of the function.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:09 +09:00
Ryusuke Konishi
8b94025c00 nilfs2: refactor recovery logic routines
Most functions in recovery code take an argument of a super block
instance or a nilfs_sb_info struct for convenience sake.

This replaces them aggressively with a nilfs object by applying
__bread and __breadahead against routines using sb_bread and
sb_breadahead.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:08 +09:00
Ryusuke Konishi
92c60ccaf3 nilfs2: add blocksize member to nilfs object
This stores blocksize in nilfs objects for the successive refactoring
of recovery logic.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:08 +09:00
Tejun Heo
9b64697246 cifs: use workqueue instead of slow-work
Workqueue can now handle high concurrency.  Use system_nrt_wq
instead of slow-work.

* Updated is_valid_oplock_break() to not call cifs_oplock_break_put()
  as advised by Steve French.  It might cause deadlock.  Instead,
  reference is increased after queueing succeeded and
  cifs_oplock_break() briefly grabs GlobalSMBSeslock before putting
  the cfile to make sure it doesn't put before the matching get is
  finished.

* Anton Blanchard reported that cifs conversion was using now gone
  system_single_wq.  Use system_nrt_wq which provides non-reentrance
  guarantee which is enough and much better.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Steve French <sfrench@samba.org>
Cc: Anton Blanchard <anton@samba.org>
2010-07-22 22:59:15 +02:00
Tejun Heo
d098adfb7d fscache: drop references to slow-work
fscache no longer uses slow-work.  Drop references to it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
2010-07-22 22:58:58 +02:00
Tejun Heo
8af7c12436 fscache: convert operation to use workqueue instead of slow-work
Make fscache operation to use only workqueue instead of combination of
workqueue and slow-work.  FSCACHE_OP_SLOW is dropped and
FSCACHE_OP_FAST is renamed to FSCACHE_OP_ASYNC and uses newly added
fscache_op_wq workqueue to execute op->processor().
fscache_operation_init_slow() is dropped and fscache_operation_init()
now takes @processor argument directly.

* Unbound workqueue is used.

* fscache_retrieval_work() is no longer necessary as OP_ASYNC now does
  the equivalent thing.

* sysctl fscache.operation_max_active added to control concurrency.
  The default value is nr_cpus clamped between 2 and
  WQ_UNBOUND_MAX_ACTIVE.

* debugfs support is dropped for now.  Tracing API based debug
  facility is planned to be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
2010-07-22 22:58:47 +02:00
Tejun Heo
8b8edefa2f fscache: convert object to use workqueue instead of slow-work
Make fscache object state transition callbacks use workqueue instead
of slow-work.  New dedicated unbound CPU workqueue fscache_object_wq
is created.  get/put callbacks are renamed and modified to take
@object and called directly from the enqueue wrapper and the work
function.  While at it, make all open coded instances of get/put to
use fscache_get/put_object().

* Unbound workqueue is used.

* work_busy() output is printed instead of slow-work flags in object
  debugging outputs.  They mean basically the same thing bit-for-bit.

* sysctl fscache.object_max_active added to control concurrency.  The
  default value is nr_cpus clamped between 4 and
  WQ_UNBOUND_MAX_ACTIVE.

* slow_work_sleep_till_thread_needed() is replaced with fscache
  private implementation fscache_object_sleep_till_congested() which
  waits on fscache_object_wq congestion.

* debugfs support is dropped for now.  Tracing API based debug
  facility is planned to be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
2010-07-22 22:58:34 +02:00
Sage Weil
a0dff78dab ceph: avoid dcache readdir for snapdir
We should always go to the MDS for readdir on the hidden snapdir.  The
set of snapshots can change at any time; the client can't trust its cache
for that.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-22 13:50:45 -07:00
David Howells
4c0c03ca54 CIFS: Fix a malicious redirect problem in the DNS lookup code
Fix the security problem in the CIFS filesystem DNS lookup code in which a
malicious redirect could be installed by a random user by simply adding a
result record into one of their keyrings with add_key() and then invoking a
CIFS CFS lookup [CVE-2010-2524].

This is done by creating an internal keyring specifically for the caching of
DNS lookups.  To enforce the use of this keyring, the module init routine
creates a set of override credentials with the keyring installed as the thread
keyring and instructs request_key() to only install lookup result keys in that
keyring.

The override is then applied around the call to request_key().

This has some additional benefits when a kernel service uses this module to
request a key:

 (1) The result keys are owned by root, not the user that caused the lookup.

 (2) The result keys don't pop up in the user's keyrings.

 (3) The result keys don't come out of the quota of the user that caused the
     lookup.

The keyring can be viewed as root by doing cat /proc/keys:

2a0ca6c3 I-----     1 perm 1f030000     0     0 keyring   .dns_resolver: 1/4

It can then be listed with 'keyctl list' by root.

	# keyctl list 0x2a0ca6c3
	1 key in keyring:
	726766307: --alswrv     0     0 dns_resolver: foo.bar.com

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steve French <smfrench@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-22 09:42:40 -07:00
Ingo Molnar
9dcdbf7a33 Merge branch 'linus' into perf/core
Merge reason: Pick up the latest perf fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-21 21:43:06 +02:00
Johan Hedberg
63c7d09cd5 Bluetooth: Add HCIUARTSETFLAGS and HCIUARTGETFLAGS ioctls
This patch introduces two new ioctls: HCIUARTSETFLAGS and
HCIUARTGETFLAGS. The only flag available for now is HCI_UART_RAW_DEVICE
which allows to initialize a UART device into RAW mode from userspace.
This is particularly useful for experimenting with Bluetooth controllers
that don't yet have proper support in BlueZ.

Signed-off-by: Johan Hedberg <johan.hedberg@nokia.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2010-07-21 10:39:11 -07:00
Johan Hedberg
2cdf096fff Bluetooth: Add missing HCIUARTGETDEVICE ioctl to compat_ioctl.c
Signed-off-by: Johan Hedberg <johan.hedberg@nokia.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2010-07-21 10:39:11 -07:00
Johan Hedberg
f03585689f Bluetooth: Add blacklist support for incoming connections
In some circumstances it could be desirable to reject incoming
connections on the baseband level. This patch adds this feature through
two new ioctl's: HCIBLOCKADDR and HCIUNBLOCKADDR. Both take a simple
Bluetooth address as a parameter. BDADDR_ANY can be used with
HCIUNBLOCKADDR to remove all devices from the blacklist.

Signed-off-by: Johan Hedberg <johan.hedberg@nokia.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2010-07-21 10:39:05 -07:00
Linus Torvalds
a4ce96ac35 Fix up trivial spelling errors ('taht' -> 'that')
Pointed out by Lucas who found the new one in a comment in
setup_percpu.c. And then I fixed the others that I grepped
for.

Reported-by: Lucas <canolucas@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-21 09:25:42 -07:00
Jiaying Zhang
fb5ffb0e16 quota: Change quota error message to print out disk and function name
The current quota error message doesn't always print the disk name, so
it is hard to identify the "bad" disk when quota error happens.

This patch changes the standardized quota error message to print out disk name
and function name. It also uses a combination of cpp macro and inline function
to provide better type checking and to lower the text size of the message.

[Jan Kara: Export __quota_error]

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-21 16:05:58 +02:00
Jan Kara
f25f624263 ext3: Avoid filesystem corruption after a crash under heavy delete load
It can happen that ext3_free_branches calls ext3_forget() for an indirect block
in an earlier transaction than a transaction in which we clear pointer to this
indirect block. Thus if we crash before a transaction clearing the block
pointer is committed, we will see indirect block pointing to already freed
blocks and complain during orphan list cleanup.

The fix is simple: Make sure ext3_forget() is called in the transaction
doing block pointer clearing.

This is a backport of an ext4 fix by Amir G. <amir73il@users.sourceforge.net>

Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-21 16:04:26 +02:00
Christoph Hellwig
4c4d390122 ext3: remove vestiges of nobh support
The nobh option was only supported for writeback mode, but given that all
write paths (except mmapped writed) actually create buffer heads, it
effectively was a no-op already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-21 16:01:47 +02:00
Andi Kleen
0411ba7902 ext3: Fix set but unused variables
[tytso@mit.edu: Fix compilation with CONFIG_JBD_DEBUG enabled]

Acked-by: tytso@mit.edu
cc: linux-ext4@vger.kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-21 16:01:47 +02:00
Christoph Hellwig
189eef59e7 quota: clean up quota active checks
The various quota operations check for any quota beeing active on
a superblock, and the inode not having the noquota flag.

Merge these two checks into a dquot_active check and move that
into dquot.c as that's the only place where it's needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-21 16:01:47 +02:00
Christoph Hellwig
ade7ce31c2 quota: Clean up the namespace in dqblk_xfs.h
Almost all identifiers use the FS_* namespace, so rename the missing few
XFS_* ones to FS_* as well.  Without this some people might get upset
about having too many XFS names in generic code.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-21 16:01:46 +02:00
Dmitry Monakhov
7af9cce8ae quota: check quota reservation on remove_dquot_ref
Reserved space must being claimed before remove_dquot_ref() for a
given inode. Filesystem is responsible for performing force blocks
allocation in case of dealloc in ->quota_off. Let's add sanity check
for that case. Do it similar to add_dquot_ref().

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-21 16:01:46 +02:00
Linus Torvalds
e0959371b4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: do not include cap/dentry releases in replayed messages
  ceph: reuse request message when replaying against recovering mds
  ceph: fix creation of ipv6 sockets
  ceph: fix parsing of ipv6 addresses
  ceph: fix printing of ipv6 addrs
  ceph: add kfree() to error path
  ceph: fix leak of mon authorizer
  ceph: fix message revocation
2010-07-20 16:27:58 -07:00
Stephen Boyd
59b4856831 fs/Kconfig: Fix typo Userpace -> Userspace
Signed-off-by: Stephen Boyd <bebarino@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-07-20 17:30:22 +02:00
Joe Perches
33fa1d909c fs/ocfs2: Remove unnecessary casts of private_data
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-07-20 17:20:08 +02:00
Linus Torvalds
620d0be881 Merge branch 'shrinker' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev
* 'shrinker' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev:
  xfs: track AGs with reclaimable inodes in per-ag radix tree
  xfs: convert inode shrinker to per-filesystem contexts
  mm: add context argument to shrinker callback
2010-07-19 20:18:24 -07:00
Linus Torvalds
ee1039307a Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE
  Btrfs: fix CLONE ioctl destination file size expansion to block boundary
  Btrfs: fix split_leaf double split corner case
2010-07-19 19:33:02 -07:00
Dave Chinner
16fd536737 xfs: track AGs with reclaimable inodes in per-ag radix tree
https://bugzilla.kernel.org/show_bug.cgi?id=16348

When the filesystem grows to a large number of allocation groups,
the summing of recalimable inodes gets expensive. In many cases,
most AGs won't have any reclaimable inodes and so we are wasting CPU
time aggregating over these AGs. This is particularly important for
the inode shrinker that gets called frequently under memory
pressure.

To avoid the overhead, track AGs with reclaimable inodes in the
per-ag radix tree so that we can find all the AGs with reclaimable
inodes via a simple gang tag lookup. This involves setting the tag
when the first reclaimable inode is tracked in the AG, and removing
the tag when the last reclaimable inode is removed from the tree.
Then the summation process becomes a loop walking the radix tree
summing AGs with the reclaim tag set.

This significantly reduces the overhead of scanning - a 6400 AG
filesystea now only uses about 25% of a cpu in kswapd while slab
reclaim progresses instead of being permanently stuck at 100% CPU
and making little progress. Clean filesystems filesystems will see
no overhead and the overhead only increases linearly with the number
of dirty AGs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-20 09:43:39 +10:00
Dave Chinner
70e60ce715 xfs: convert inode shrinker to per-filesystem contexts
Now the shrinker passes us a context, wire up a shrinker context per
filesystem. This allows us to remove the global mount list and the
locking problems that introduced. It also means that a shrinker call
does not need to traverse clean filesystems before finding a
filesystem with reclaimable inodes.  This significantly reduces
scanning overhead when lots of filesystems are present.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-20 08:07:02 +10:00
Dan Rosenberg
2ebc346478 Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE
1.  The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
whether the donor file is append-only before writing to it.

2.  The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
overflow that allows a user to specify an out-of-bounds range to copy
from the source file (if off + len wraps around).  I haven't been able
to successfully exploit this, but I'd imagine that a clever attacker
could use this to read things he shouldn't.  Even if it's not
exploitable, it couldn't hurt to be safe.

Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-19 16:58:20 -04:00
Sage Weil
b5384d48f4 Btrfs: fix CLONE ioctl destination file size expansion to block boundary
The CLONE and CLONE_RANGE ioctls round up the range of extents being
cloned to the block size when the range to clone extends to the end of file
(this is always the case with CLONE).  It was then using that offset when
extending the destination file's i_size.  Fix this by not setting i_size
beyond the originally requested ending offset.

This bug was introduced by a22285a6 (2.6.35-rc1).

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-19 16:15:06 -04:00
Chris Mason
99d8f83c98 Btrfs: fix split_leaf double split corner case
split_leaf was not properly balancing leaves when it was forced to
split a leaf twice.  This commit adds an extra push left and right
before forcing the double split in hopes of getting the slot where
we want to insert at either the start or end of the leaf.

If the extra pushes do work, then we are able to avoid splitting twice
and we keep the tree properly balanced.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-19 16:14:50 -04:00
Pavel Machek
a2531293db update email address
pavel@suse.cz no longer works, replace it with working address.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-07-19 10:56:54 +02:00
Peter Oberparleiter
cffab6bc55 [S390] dasd: use correct label location for diag fba disks
Partition boundary calculation fails for DASD FBA disks under the
following conditions:
- disk is formatted with CMS FORMAT with a blocksize of more than
  512 bytes
- all of the disk is reserved to a single CMS file using CMS RESERVE
- the disk is accessed using the DIAG mode of the DASD driver

Under these circumstances, the partition detection code tries to
read the CMS label block containing partition-relevant information
from logical block offset 1, while it is in fact located at physical
block offset 1.

Fix this problem by using the correct CMS label block location
depending on the device type as determined by the DASD SENSE ID
information.

Signed-off-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2010-07-19 09:22:50 +02:00
Dave Chinner
7f8275d0d6 mm: add context argument to shrinker callback
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-19 14:56:17 +10:00
Linus Torvalds
bea9a6d239 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2: Silence gcc warning in ocfs2_write_zero_page().
  jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
  ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node
  ocfs2: Don't duplicate pages past i_size during CoW.
  ocfs2: tighten up strlen() checking
  ocfs2: Make xattr reflink work with new local alloc reservation.
  ocfs2: make xattr extension work with new local alloc reservation.
  ocfs2: Remove the redundant cpu_to_le64.
  ocfs2/dlm: don't access beyond bitmap size
  ocfs2: No need to zero pages past i_size.
  ocfs2: Zero the tail cluster when extending past i_size.
  ocfs2: When zero extending, do it by page.
  ocfs2: Limit default local alloc size within bitmap range.
  ocfs2: Move orphan scan work to ocfs2_wq.
  fs/ocfs2/dlm: Add missing spin_unlock
2010-07-18 10:09:25 -07:00
Joel Becker
5453258d53 ocfs2: Silence gcc warning in ocfs2_write_zero_page().
ocfs2_write_zero_page() has a loop that won't ever be skipped, but gcc
doesn't know that.  Set ret=0 just to make gcc happy.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-16 13:33:39 -07:00
Sage Weil
e979cf5039 ceph: do not include cap/dentry releases in replayed messages
Strip the cap and dentry releases from replayed messages.  They can
cause the shared state to get out of sync because they were generated
(with the request message) earlier, and no longer reflect the current
client state.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-16 10:30:18 -07:00
Sage Weil
01a92f174f ceph: reuse request message when replaying against recovering mds
Replayed rename operations (after an mds failure/recovery) were broken
because the request paths were regenerated from the dentry names, which
get mangled when d_move() is called.

Instead, resend the previous request message when replaying completed
operations.  Just make sure the REPLAY flag is set and the target ino is
filled in.

This fixes problems with workloads doing renames when the MDS restarts,
where the rename operation appears to succeed, but on mds restart then
fails (leading to client confusion, app breakage, etc.).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-16 10:30:17 -07:00
Jan Kara
13ceef099e jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
OCFS2 uses t_commit trigger to compute and store checksum of the just
committed blocks. When a buffer has b_frozen_data, checksum is computed
for it instead of b_data but this can result in an old checksum being
written to the filesystem in the following scenario:

1) transaction1 is opened
2) handle1 is opened
3) journal_access(handle1, bh)
    - This sets jh->b_transaction to transaction1
4) modify(bh)
5) journal_dirty(handle1, bh)
6) handle1 is closed
7) start committing transaction1, opening transaction2
8) handle2 is opened
9) journal_access(handle2, bh)
    - This copies off b_frozen_data to make it safe for transaction1 to commit.
      jh->b_next_transaction is set to transaction2.
10) jbd2_journal_write_metadata() checksums b_frozen_data
11) the journal correctly writes b_frozen_data to the disk journal
12) handle2 is closed
    - There was no dirty call for the bh on handle2, so it is never queued for
      any more journal operation
13) Checkpointing finally happens, and it just spools the bh via normal buffer
writeback.  This will write b_data, which was never triggered on and thus
contains a wrong (old) checksum.

This patch fixes the problem by calling the trigger at the moment data is
frozen for journal commit - i.e., either when b_frozen_data is created by
do_get_write_access or just before we write a buffer to the log if
b_frozen_data does not exist. We also rename the trigger to t_frozen as
that better describes when it is called.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-15 15:17:47 -07:00
Wengang Wang
a39953dd95 ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node
For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set
before sending DLM_MIG_LOCKRES_MSG message to the target. We are using
dlm_migration_can_proceed() for that purpose.  However, if the node is
down, dlm_migration_can_proceed() will also return "go ahead".  In this
rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove
the BUG_ON() that trips over this condition.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-15 10:56:30 -07:00
Tao Ma
f5e27b6ddf ocfs2: Don't duplicate pages past i_size during CoW.
During CoW, the pages after i_size don't contain valid data, so there's
no need to read and duplicate them.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-15 10:54:28 -07:00
Bob Peterson
728a756b8f GFS2: rename causes kernel Oops
This patch fixes a kernel Oops in the GFS2 rename code.

The problem was in the way the gfs2 directory code was trying
to re-use sentinel directory entries.

In the failing case, gfs2's rename function was renaming a
file to another name that had the same non-trivial length.
The file being renamed happened to be the first directory
entry on the leaf block.

First, the rename code (gfs2_rename in ops_inode.c) found the
original directory entry and decided it could do its job by
simply replacing the directory entry with another.  Therefore
it determined correctly that no block allocations were needed.

Next, the rename code deleted the old directory entry prior to
replacing it with the new name.  Therefore, the soon-to-be
replaced directory entry was temporarily made into a directory
entry "sentinel" or a place holder at the start of a leaf block.

Lastly, it went to re-add the replacement directory entry in
that leaf block.  However, when gfs2_dirent_find_space was
looking for space in the leaf block, it used the wrong value
for the sentinel.  That threw off its calculations so later
it decides it can't really re-use the sentinel and therefore
must allocate a new leaf block.  But because it previously decided
to re-use the directory entry, it didn't waste the time to
grab a new block allocation for the inode.  Therefore, the
inode's i_alloc pointer was still NULL and it crashes trying to
reference it.

In the case of sentinel directory entries, the entire dirent is
reused, not just the "free space" portion of it, and therefore
the function gfs2_dirent_find_space should use the value 0
rather than GFS2_DIRENT_SIZE(0) for the actual dirent size.

Fixing this calculation enables the reproducer programs to work
properly.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-15 09:07:56 +01:00
Abhijith Das
8b4216018b GFS2: BUG in gfs2_adjust_quota
HighMem pages on i686 do not get mapped to the buffer_heads and this was
causing a NULL pointer dereference when we were trying to memset page buffers
to zero.
We now use zero_user() that kmaps the page and directly manipulates page data.
This patch also fixes a boundary condition that was incorrect.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-15 09:07:16 +01:00
Bob Peterson
b1becbdee7 GFS2: Fix kernel NULL pointer dereference by dlm_astd
This patch fixes a problem in an error path when looking
up dinodes.  There are two sister-functions, gfs2_inode_lookup
and gfs2_process_unlinked_inode.  Both functions acquire and
hold the i_iopen glock for the dinode being looked up. The last
thing they try to do is hold the i_gl glock for the dinode.
If that glock fails for some reason, the error path was
incorrectly calling gfs2_glock_put for the i_iopen glock twice.
This resulted in the glock being prematurely freed.  The
"minimum hold time" usually kept the glock in memory, but the
lock interface to dlm (aka lock_dlm) freed its memory for the
glock.  In some circumstances, it would cause dlm's dlm_astd daemon
to try to call the bast function for the freed lock_dlm memory,
which resulted in a NULL pointer dereference.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-15 09:06:25 +01:00
Bob Peterson
b7dc2df572 GFS2: recovery stuck on transaction lock
This patch fixes bugzilla bug #590878: GFS2: recovery stuck on
transaction lock.  We set the frozen flag on the glock when we receive
a completion that cannot be delivered due to blocked locks. At that
point we check to see whether the first waiting holder has the noexp
flag set. If the noexp lock is queued later, then we need to unfreeze
the glock at that point in time, namely, in the glock work function.

This patch was originally written by Steve Whitehouse, but since
he's on holiday, I'm submitting it.  It's been well tested with a
complex recovery test called revolver.

Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2010-07-15 09:05:57 +01:00
Bob Peterson
a8bf2bc212 GFS2: O_TRUNC not working on stuffed files across cluster
This patch replaces a statement that got dropped out by accident.
Without the patch, truncates on stuffed (very small) files cause
those files to have an unpredictable size.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-15 09:05:17 +01:00
Artem Bityutskiy
6fb4374f6b UBIFS: fix GC LEB recovery
UBIFS tries to alway have an LEB reserved for GC, and stores it
in c->gc_lnum. Besides, there is GC head which points to the current
GC head LEB.

In case of an unclean power cut, what may happen is that the GC head
was switched to the reserved GC LEB (c->gc_lnum), but a new reserved
GC LEB was not created yet. So, after an unclean reboot we may have
no reserved GC LEB, and we need to find a new LEB for this.

To do this, we find a dirty LEB which can fit the current GC head,
move the data, unmap this dirty LEB, and it becomes our reserved GC
LEB.

However, if we cannot find a dirty enough LEB, we return failure,
which is wrong, because we still can have free LEBs to use for
the reserved GC LEB. This patch fixes the issue.

This patch also fixes few typos in comments, which were spotted by
aspell.

Note, this patch fixes a real issue

[   14.328117] UBIFS: recovery needed
[   53.941378] UBIFS error (pid 462): ubifs_rcvry_gc_commit: could not find a dirty LEB
[   89.606399] UBIFS: recovery completed
[   89.609329] UBIFS assert failed in mount_ubifs at 1358 (pid 462)
[   89.616165] [<c0026144>] (unwind_backtrace+0x0/0xe4) from [<c0125ce4>] (ubifs_fill_super+0x11d0/0x1c4c)
[   89.625930] [<c0125ce4>] (ubifs_fill_super+0x11d0/0x1c4c) from [<c0126910>] (ubifs_get_sb+0x1b0/0x354)
[   89.635696] [<c0126910>] (ubifs_get_sb+0x1b0/0x354) from [<c008a50c>] (vfs_kern_mount+0x50/0xe0)
[   89.644485] [<c008a50c>] (vfs_kern_mount+0x50/0xe0) from [<c008a5e0>] (do_kern_mount+0x34/0xdc)
[   89.653274] [<c008a5e0>] (do_kern_mount+0x34/0xdc) from [<c00a29d8>] (do_mount+0x148/0x7cc)
[   89.662063] [<c00a29d8>] (do_mount+0x148/0x7cc) from [<c00a30f4>] (sys_mount+0x98/0xc8)
[   89.670852] [<c00a30f4>] (sys_mount+0x98/0xc8) from [<c0021f40>] (ret_fast_syscall+0x0/0x28)

which was reported here:
http://article.gmane.org/gmane.linux.drivers.mtd/29923
by Alexander Pazdnikov <pazdnikov@list.ru>

Reported-by: Alexander Pazdnikov <pazdnikov@list.ru>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <adrian.hunter@nokia.com>
2010-07-13 06:51:57 +03:00
Dan Carpenter
e372357ba5 ocfs2: tighten up strlen() checking
This function is only called from one place and it's like this:
	dlm_register_domain(conn->cc_name, dlm_key, &fs_version);

The "conn->cc_name" is 64 characters long.  If strlen(conn->cc_name)
were equal to O2NM_MAX_NAME_LEN (64) that would be a bug because
strlen() doesn't count the NULL character.

In fact, if you look how O2NM_MAX_NAME_LEN is used, it mostly describes
64 character buffers.  The only exception is nd_name from struct
o2nm_node.

Anyway I looked into it and in this case the domain string comes from
osb->uuid_str in ocfs2_setup_osb_uuid().  That's 32 characters and NULL
which easily fits into O2NM_MAX_NAME_LEN.  This patch doesn't change how
the code works, but I think it makes the code a little cleaner.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-12 13:57:53 -07:00
Tao Ma
121a39bb00 ocfs2: Make xattr reflink work with new local alloc reservation.
The new reservation code in local alloc has add the limitation
that the caller should handle the case that the local alloc
doesn't give use enough contiguous clusters. It make the old
xattr reflink code broken.

So this patch udpate the xattr reflink code so that it can
handle the case that local alloc give us one cluster at a time.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-12 13:57:50 -07:00
Tao Ma
a78f9f4668 ocfs2: make xattr extension work with new local alloc reservation.
The old ocfs2_xattr_extent_allocation is too optimistic about
the clusters we can get. So actually if the file system is
too fragmented, ocfs2_add_clusters_in_btree will return us
with EGAIN and we need to allocate clusters once again.

So this patch change it to a while loop so that we can allocate
clusters until we reach clusters_to_add.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
2010-07-12 13:57:24 -07:00
Tao Ma
0a463b74e7 ocfs2: Remove the redundant cpu_to_le64.
In ocfs2_block_group_alloc, we set c_blkno by bg->bg_blkno.
But actually bg->bg_blkno is already changed to little endian
in ocfs2_block_group_fill. So remove the extra cpu_to_le64.

Reported-by: Marcos Matsunaga <Marcos.Matsunaga@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-12 13:56:18 -07:00
Wengang Wang
f471c9df92 ocfs2/dlm: don't access beyond bitmap size
dlm->recovery_map is defined as
	unsigned long recovery_map[BITS_TO_LONGS(O2NM_MAX_NODES)];

We should treat O2NM_MAX_NODES as the bit map size in bits.
This patches fixes a bit operation that takes O2NM_MAX_NODES + 1 as bitmap size.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-12 13:56:14 -07:00
Joel Becker
693c241a5f ocfs2: No need to zero pages past i_size.
When ocfs2 fills a hole, it does so by allocating clusters.  When a
cluster is larger than the write, ocfs2 must zero the portions of the
cluster outside of the write.  If the clustersize is smaller than a
pagecache page, this is handled by the normal pagecache mechanisms, but
when the clustersize is larger than a page, ocfs2's write code will zero
the pages adjacent to the write.  This makes sure the entire cluster is
zeroed correctly.

Currently ocfs2 behaves exactly the same when writing past i_size.
However, this means ocfs2 is writing zeroed pages for portions of a new
cluster that are beyond i_size.  The page writeback code isn't expecting
this.  It treats all pages past the one containing i_size as left behind
due to a previous truncate operation.

Thankfully, ocfs2 calculates the number of pages it will be working on
up front.  The rest of the write code merely honors the original
calculation.  We can simply trim the number of pages to only cover the
actual file data.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
2010-07-12 13:55:27 -07:00
Miklos Szeredi
2d45ba381a fuse: add retrieve request
Userspace filesystem can request data to be retrieved from the inode's
mapping.  This request is synchronous and the retrieved data is queued
as a new request.  If the write to the fuse device returns an error
then the retrieve request was not completed and a reply will not be
sent.

Only present pages are returned in the retrieve reply.  Retrieving
stops when it finds a non-present page and only data prior to that is
returned.

This request doesn't change the dirty state of pages.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-07-12 14:41:40 +02:00
Miklos Szeredi
a1d75f2582 fuse: add store request
Userspace filesystem can request data to be stored in the inode's
mapping.  This request is synchronous and has no reply.  If the write
to the fuse device returns an error then the store request was not
fully completed (but may have updated some pages).

If the stored data overflows the current file size, then the size is
extended, similarly to a write(2) on the filesystem.

Pages which have been completely stored are marked uptodate.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-07-12 14:41:40 +02:00
Miklos Szeredi
7909b1c640 fuse: don't use atomic kmap
Don't use atomic kmap for mapping userspace buffers in device
read/write/splice.

This is necessary because the next patch (adding store notify)
requires that caller of fuse_copy_page() may sleep between
invocations.  The simplest way to ensure this is to change the atomic
kmaps to non-atomic ones.

Thankfully architectures where kmap() is not a no-op are going out of
fashion, so we can ignore the (probably negligible) performance impact
of this change.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-07-12 14:41:40 +02:00
Arnd Bergmann
5f202bd5ca do_coredump: Do not take BKL
core_pattern is not actually protected and hasn't been
ever since we introduced procfs support for sysctl -- a
_long_ time. Don't take it here either.

Also nothing inside do_coredump appears to require bkl
protection.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[ remove smp_lock.h headers ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-07-10 01:21:49 +02:00
Sage Weil
f91d3471cc ceph: fix creation of ipv6 sockets
Use the address family from the peer address instead of assuming IPv4.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-09 15:00:20 -07:00
Sage Weil
39139f64e1 ceph: fix parsing of ipv6 addresses
Check for brackets around the ipv6 address to avoid ambiguity with the port
number.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-09 15:00:18 -07:00
Sage Weil
d06dbaf6c2 ceph: fix printing of ipv6 addrs
The buffer was too small.  Make it bigger, use snprintf(), put brackets
around the ipv6 address to avoid mixing it up with the :port, and use the
ever-so-handy %pI[46] formats.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-08 16:49:53 -07:00
Joel Becker
5693486bad ocfs2: Zero the tail cluster when extending past i_size.
ocfs2's allocation unit is the cluster.  This can be larger than a block
or even a memory page.  This means that a file may have many blocks in
its last extent that are beyond the block containing i_size.  There also
may be more unwritten extents after that.

When ocfs2 grows a file, it zeros the entire cluster in order to ensure
future i_size growth will see cleared blocks.  Unfortunately,
block_write_full_page() drops the pages past i_size.  This means that
ocfs2 is actually leaking garbage data into the tail end of that last
cluster.  This is a bug.

We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
when a write or truncate is past i_size.  They will use
ocfs2_zero_extend() to ensure the data is properly zeroed.

Older versions of ocfs2_zero_extend() simply zeroed every block between
i_size and the zeroing position.  This presumes three things:

1) There is allocation for all of these blocks.
2) The extents are not unwritten.
3) The extents are not refcounted.

(1) and (2) hold true for non-sparse filesystems, which used to be the
only users of ocfs2_zero_extend().  (3) is another bug.

Since we're now using ocfs2_zero_extend() for sparse filesystems as
well, we teach ocfs2_zero_extend() to check every extent between
i_size and the zeroing position.  If the extent is unwritten, it is
ignored.  If it is refcounted, it is CoWed.  Then it is zeroed.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
2010-07-08 13:25:35 -07:00
Joel Becker
a4bfb4cf11 ocfs2: When zero extending, do it by page.
ocfs2_zero_extend() does its zeroing block by block, but it calls a
function named ocfs2_write_zero_page().  Let's have
ocfs2_write_zero_page() handle the page level.  From
ocfs2_zero_extend()'s perspective, it is now page-at-a-time.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
2010-07-08 13:24:49 -07:00
Linus Torvalds
c77e9e6826 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  writeback: simplify the write back thread queue
  writeback: split writeback_inodes_wb
  writeback: remove writeback_inodes_wbc
  fs-writeback: fix kernel-doc warnings
  splice: check f_mode for seekable file
  splice: direct_splice_actor() should not use pos in sd
2010-07-08 08:06:40 -07:00
Dan Carpenter
b0bbb0be8f ceph: add kfree() to error path
We leak a "pi" on this error path.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-08 08:03:24 -07:00
Chuck Lever
43a9aa64a2 NFSD: Fill in WCC data for REMOVE, RMDIR, MKNOD, and MKDIR
Some well-known NFSv3 clients drop their directory entry caches when
they receive replies with no WCC data.  Without this data, they
employ extra READ, LOOKUP, and GETATTR requests to ensure their
directory entry caches are up to date, causing performance to suffer
needlessly.

In order to return WCC data, our server has to have both the pre-op
and the post-op attribute data on hand when a reply is XDR encoded.
The pre-op data is filled in when the incoming fh is locked, and the
post-op data is filled in when the fh is unlocked.

Unfortunately, for REMOVE, RMDIR, MKNOD, and MKDIR, the directory fh
is not unlocked until well after the reply has been XDR encoded.  This
means that encode_wcc_data() does not have wcc_data for the parent
directory, so none is returned to the client after these operations
complete.

By unlocking the parent directory fh immediately after the internal
operations for each NFS procedure is complete, the post-op data is
filled in before XDR encoding starts, so it can be returned to the
client properly.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-07-07 17:12:32 -04:00
Linus Torvalds
1cc9629402 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix crush device 'out' threshold to 1.0, not 0.1
  ceph: fix caps usage accounting for import (non-reserved) case
  ceph: only release clean, unused caps with mds requests
  ceph: fix crush CHOOSE_LEAF when type is already a leaf
  ceph: fix crush recursion
  ceph: fix caps debugfs entry
  ceph: delay umount until all mds requests drop inode+dentry refs
  ceph: handle splice_dentry/d_materialize_unique error in readdir_prepopulate
  ceph: fix crush map update decoding
  ceph: fix message memory leak, uninitialized variable
  ceph: fix map handler error path
  ceph: some endianity fixes
2010-07-06 17:15:15 -07:00
J. Bruce Fields
6a85d6c769 nfsd4: comment nitpick
Reported-by: "Madan, Anshul" <Anshul.Madan@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-07-06 12:40:22 -04:00
Christoph Hellwig
83ba7b071f writeback: simplify the write back thread queue
First remove items from work_list as soon as we start working on them.  This
means we don't have to track any pending or visited state and can get
rid of all the RCU magic freeing the work items - we can simply free
them once the operation has finished.  Second use a real completion for
tracking synchronous requests - if the caller sets the completion pointer
we complete it, otherwise use it as a boolean indicator that we can free
the work item directly.  Third unify struct wb_writeback_args and struct
bdi_work into a single data structure, wb_writeback_work.  Previous we
set all parameters into a struct wb_writeback_args, copied it into
struct bdi_work, copied it again on the stack to use it there.  Instead
of just allocate one structure dynamically or on the stack and use it
all the way through the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-06 08:59:53 +02:00
Christoph Hellwig
edadfb10ba writeback: split writeback_inodes_wb
The case where we have a superblock doesn't require a loop here as we scan
over all inodes in writeback_sb_inodes. Split it out into a separate helper
to make the code simpler.  This also allows to get rid of the sb member in
struct writeback_control, which was rather out of place there.

Also update the comments in writeback_sb_inodes that explain the handling
of inodes from wrong superblocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-06 08:54:08 +02:00
Christoph Hellwig
9c3a8ee8a1 writeback: remove writeback_inodes_wbc
This was just an odd wrapper around writeback_inodes_wb.  Removing this
also allows to get rid of the bdi member of struct writeback_control
which was rather out of place there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-06 08:54:03 +02:00
Sage Weil
22b1de06c9 ceph: fix leak of mon authorizer
Fix leak of a struct ceph_buffer on umount.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-05 15:36:49 -07:00
Sage Weil
ed98adad3d ceph: fix message revocation
A message can be on a queue (pending or sent), or out_msg (sending), or
both.  We were assuming that if it's not on a queue it couldn't be out_msg,
but that was false in the case of lossy connections like the OSD.  Fix
ceph_con_revoke() to treat these cases independently.  Also, fix the
out_kvec_is_message check to only trigger if we are currently sending
_this_ message.

This fixes a GPF in tcp_sendpage, triggered by OSD restarts.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-05 12:16:23 -07:00
Sage Weil
153a10939e ceph: fix crush device 'out' threshold to 1.0, not 0.1
Fix a typo that made any OSD weighted between 0.1 and 1.0 effectively
weighted as 1.0 (fully in).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-05 09:44:17 -07:00
Ingo Molnar
08f8ba0799 Merge commit 'v2.6.35-rc4' into perf/core
Merge reason: Pick up the latest perf fixes

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-05 08:30:58 +02:00
Linus Torvalds
e3668dd83b Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: remove block number from inode lookup code
  xfs: rename XFS_IGET_BULKSTAT to XFS_IGET_UNTRUSTED
  xfs: validate untrusted inode numbers during lookup
  xfs: always use iget in bulkstat
  xfs: prevent swapext from operating on write-only files
2010-07-04 20:13:31 -07:00
Ingo Molnar
0a54cec0c2 Merge branch 'linus' into core/rcu
Conflicts:
	fs/fs-writeback.c

Merge reason: Resolve the conflict

Note, i picked the version from Linus's tree, which effectively reverts
the fs-writeback.c bits of:

  b97181f: fs: remove all rcu head initializations, except on_stack initializations

As the upstream changes to this file changed this code heavily and the
first attempt to resolve the conflict resulted in a non-booting kernel.
It's safer to re-try this portion of the commit cleanly.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-01 09:31:25 +02:00
Randy Dunlap
06d738fa91 fs-writeback: fix kernel-doc warnings
Fix kernel-doc to match the function's changed args.

Warning(fs/fs-writeback.c:190): No description found for parameter 'args'
Warning(fs/fs-writeback.c:190): Excess function parameter 'sb' description in 'bdi_queue_work_onstack'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-01 08:26:34 +02:00
Changli Gao
19c9a49b43 splice: check f_mode for seekable file
check f_mode for seekable file

As a seekable file is allowed without a llseek function, so the old way isn't
work any more.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
----
 fs/splice.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-30 08:12:37 +02:00
Changli Gao
2cb4b05e76 splice: direct_splice_actor() should not use pos in sd
direct_splice_actor() shouldn't use sd->pos, as sd->pos is for file reading,
file->f_pos should be used instead.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
----
 fs/splice.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-30 08:12:37 +02:00
Andrew Morton
f4985dc714 fs/fcntl.c:kill_fasync_rcu() fa_lock must be IRQ-safe
Fix a lockdep-splat-causing regression introduced by commit 989a297920
("fasync: RCU and fine grained locking").

kill_fasync() can be called from both process and hard-irq context, so
fa_lock must be taken with IRQs disabled.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16230

Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Tested-by: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-29 15:29:32 -07:00
Lubomir Rintel
46c23d7f52 sysvfs: fix NULL deref. when allocating new inode
A call to sysv_write_inode() in sysv_new_inode() to its new interface that
replaced wait flag with writeback structure.  This was broken by
a9185b41a4 ("pass writeback_control to
->write_inode").

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>		[2.6.34.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-29 15:29:32 -07:00
Mike Frysinger
2952095c6b flat: tweak default stack alignment
The recent commit 1f0ce8b3dd ("mm: Move ARCH_SLAB_MINALIGN and
ARCH_KMALLOC_MINALIGN to <linux/slab_def.h>") which moved the
ARCH_SLAB_MINALIGN default into the global header inadvertently broke FLAT
for a bunch of systems.  Blackfin systems now fail on any FLAT exec with:
Unable to read code+data+bss, errno 14 When your /init is a FLAT binary,
obviously this can be annoying ;).

This stems from the alignment usage in the FLAT loader.  The behavior
before was that FLAT would default to ARCH_SLAB_MINALIGN only if it was
defined, and this was only defined by arches when they wanted a larger
alignment value.  Otherwise it'd default to pointer alignment.  Arguably,
this is kind of hokey that the FLAT is semi-abusing defines it shouldn't.

So let's merge the two alignment requirements so the floor is never 0.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: David McCullough <davidm@snapgear.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-29 15:29:31 -07:00
Mike Frysinger
3c26c9d959 nommu: add '[stack]' label to /proc/pid/maps output
Add support to the NOMMU /proc/pid/maps file to show which mapping is the stack
of the original thread after execve.  This is largely based on the MMU code.
Subsidiary thread stacks are not indicated.

For FDPIC, we now get:

	root:/> cat /proc/self/maps
	02064000-02067ccc rw-p 0004d000 00:01 22         /bin/busybox
	0206e000-0206f35c rw-p 00006000 00:01 295        /lib/ld-uClibc.so.0
	025f0000-025f6f0c r-xs 00000000 00:01 295        /lib/ld-uClibc.so.0
	02680000-026ba6b0 r-xs 00000000 00:01 297        /lib/libc.so.0
	02700000-0274d384 r-xs 00000000 00:01 22         /bin/busybox
	02816000-02817000 rw-p 00000000 00:00 0
	02848000-0284c0d8 rw-p 00000000 00:00 0
	02860000-02880000 rw-p 00000000 00:00 0          [stack]

The semi-downside here is that for FLAT, we get:

	root:/> cat /proc/155/maps
	029f0000-029f9000 rwxp 00000000 00:00 0          [stack]

The reason being that FLAT combines a whole lot of stuff into one map
(including the stack).  But this isn't any worse than the current output
(which is nothing), so screw it.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-29 15:29:30 -07:00
Theodore Ts'o
90c7201b97 ext4: Pass line number to ext4_journal_abort_handle()
This allows the error messages to include the line number

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 14:53:24 -04:00
Linus Torvalds
984bc9601f Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: Don't count_vm_events for discard bio in submit_bio.
  cfq: fix recursive call in cfq_blkiocg_update_completion_stats()
  cfq-iosched: Fixed boot warning with BLK_CGROUP=y and CFQ_GROUP_IOSCHED=n
  cfq: Don't allow queue merges for queues that have no process references
  block: fix DISCARD_BARRIER requests
  cciss: set SCSI max cmd len to 16, as default is wrong
  cpqarray: fix two more wrong section type
  cpqarray: fix wrong __init type on pci probe function
  drbd: Fixed a race between disk-attach and unexpected state changes
  writeback: fix pin_sb_for_writeback
  writeback: add missing requeue_io in writeback_inodes_wb
  writeback: simplify and split bdi_start_writeback
  writeback: simplify wakeup_flusher_threads
  writeback: fix writeback_inodes_wb from writeback_inodes_sb
  writeback: enforce s_umount locking in writeback_inodes_sb
  writeback: queue work on stack in writeback_inodes_sb
  writeback: fix writeback completion notifications
2010-06-29 10:42:52 -07:00
npiggin@suse.de
57439f878a fs: fix superblock iteration race
list_for_each_entry_safe is not suitable to protect against concurrent
modification of the list. 6754af6 introduced a race in sb walking.

list_for_each_entry can use the trick of pinning the current entry in
the list before we drop and retake the lock because it subsequently
follows cur->next. However list_for_each_entry_safe saves n=cur->next
for following before entering the loop body, so when the lock is
dropped, n may be deleted.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Frank Mayhar <fmayhar@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-29 10:38:22 -07:00
Theodore Ts'o
e29136f80e ext4: Enhance ext4_grp_locked_error() to take block and function numbers
Also use a macro definition so that __func__ and __LINE__ is implicit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 12:54:28 -04:00
Sage Weil
443b3760a0 ceph: fix caps usage accounting for import (non-reserved) case
We need to increase the total and used counters when allocating a new cap
in the non-reserved (cap import) case.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-29 09:31:56 -07:00
Sage Weil
ec97f88ba6 ceph: only release clean, unused caps with mds requests
We can drop caps with an mds request.  Ensure we only drop unused AND
clean caps, since the MDS doesn't support cap writeback in that context,
nor do we track it.  If caps are dirty, and the MDS needs them back, we
it will revoke and we will flush in the normal fashion.

This fixes a possibly loss of metadata.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-29 09:31:55 -07:00
Theodore Ts'o
c67d859e39 ext4: clean up ext4_abort() so __func__ is now implicit
Use a macro definition for ext4_abort() to clean up the .c files a wee
bit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 11:07:07 -04:00
Theodore Ts'o
4a9cdec73f ext4: Add new superblock fields reserved for the Next3 snapshot feature
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 11:00:23 -04:00
Thomas Gleixner
f384c954c9 Merge branch 'linus' into perf/core
Reason: Further changes conflict with upstream fixes

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-06-28 22:33:24 +02:00
Matthew Whitworth
e956b4b7e3 Fix comment spelling errors in ncpfs/inode.c
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-06-28 11:56:32 +02:00
Tejun Heo
327f935a9e ocfs2: update gfp/slab.h includes
Implicit slab.h inclusion via percpu.h is about to go away.  Make sure
gfp.h or slab.h is included as necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
2010-06-28 10:19:19 +10:00
Linus Torvalds
e7865c234f Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFSv4: Fix an embarassing typo in encode_attrs()
  NFSv4: Ensure that /proc/self/mountinfo displays the minor version number
  NFSv4.1: Ensure that we initialise the session when following a referral
  SUNRPC: Fix a re-entrancy bug in xs_tcp_read_calldir()
  nfs4 use mandatory attribute file type in nfs4_get_root
2010-06-27 09:04:02 -07:00
Linus Torvalds
55982d9400 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  ext3: update ctime when changing the file's permission by setfacl
  ext2: update ctime when changing the file's permission by setfacl
2010-06-27 07:50:47 -07:00
Linus Torvalds
be1d29f59c Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  MAINTAINERS: change mailing list address for CIFS
  cifs: remove bogus first_time check in NTLMv2 session setup code
  cifs: don't call cifs_new_fileinfo unless cifs_open succeeds
  cifs: don't ignore cifs_posix_open_inode_helper return value
  cifs: clean up arguments to cifs_open_inode_helper
  cifs: pass instantiated filp back after open call
  cifs: move cifs_new_fileinfo call out of cifs_posix_open
  cifs: implement drop_inode superblock op
  cifs: don't attempt busy-file rename unless it's in same directory
2010-06-27 07:34:02 -07:00
Miao Xie
30e2bab2d6 ext3: update ctime when changing the file's permission by setfacl
ext3 didn't update the ctime of the file when its permission was changed.

Steps to reproduce:
 # touch aaa
 # stat -c %Z aaa
 1275289822
 # setfacl -m  'u::x,g::x,o::x' aaa
 # stat -c %Z aaa
 1275289822				<- unchanged

But, according to the spec of the ctime, ext3 must update it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-06-25 01:20:37 +02:00
Jan Kara
523825bc58 ext2: update ctime when changing the file's permission by setfacl
ext2 didn't update the ctime of the file when its permission was changed.

Steps to reproduce:
 # touch aaa
 # stat -c %Z aaa
 1275289822
 # setfacl -m  'u::x,g::x,o::x' aaa
 # stat -c %Z aaa
 1275289822                         <- unchanged

But, according to the spec of the ctime, ext2 must update it.

Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-06-25 01:20:37 +02:00
Sage Weil
a1a31e7342 ceph: fix crush CHOOSE_LEAF when type is already a leaf
We may not recurse for CHOOSE_LEAF if we start with a leaf node.  When
that happens, the out2 vector needs to be filled in with the result.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-24 12:58:14 -07:00
Sage Weil
55bda7aacd ceph: fix crush recursion
There was a longstanding problem with recursion through intervening
bucket types on complex hierarchies.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-24 12:55:48 -07:00
Trond Myklebust
1f0e890dba NFSv4: Clean up struct nfs4_state_owner
The 'so_delegations' list appears to be unused.

Also eliminate so_client. If we already have so_server, we can get to the
nfs_client structure.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-24 15:11:43 -04:00
Yehuda Sadeh
bfaf148eb2 ceph: fix caps debugfs entry
The ceph client structure was not set correctly.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-24 09:47:36 -07:00
J. Bruce Fields
cba9ba4b90 nfsd4: fix delegation recall race use-after-free
When the rarely-used callback-connection-changing setclientid occurs
simultaneously with a delegation recall, we rerun the recall by
requeueing it on a workqueue.  But we also need to take a reference on
the delegation in that case, since the delegation held by the rpc itself
will be released by the rpc_release callback.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-06-24 12:24:55 -04:00
J. Bruce Fields
ac94bf5825 nfsd4: fix deleg leak on callback error
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-06-24 12:24:53 -04:00
Dave Chinner
7b6259e7a8 xfs: remove block number from inode lookup code
The block number comes from bulkstat based inode lookups to shortcut
the mapping calculations. We ar enot able to trust anything from
bulkstat, so drop the block number as well so that the correct
lookups and mappings are always done.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-06-24 11:35:17 +10:00
Dave Chinner
1920779e67 xfs: rename XFS_IGET_BULKSTAT to XFS_IGET_UNTRUSTED
Inode numbers may come from somewhere external to the filesystem
(e.g. file handles, bulkstat information) and so are inherently
untrusted. Rename the flag we use for these lookups to make it
obvious we are doing a lookup of an untrusted inode number and need
to verify it completely before trying to read it from disk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-06-24 11:15:47 +10:00
Dave Chinner
7124fe0a5b xfs: validate untrusted inode numbers during lookup
When we decode a handle or do a bulkstat lookup, we are using an
inode number we cannot trust to be valid. If we are deleting inode
chunks from disk (default noikeep mode), then we cannot trust the on
disk inode buffer for any given inode number to correctly reflect
whether the inode has been unlinked as the di_mode nor the
generation number may have been updated on disk.

This is due to the fact that when we delete an inode chunk, we do
not write the clusters back to disk when they are removed - instead
we mark them stale to avoid them being written back potentially over
the top of something that has been subsequently allocated at that
location. The result is that we can have locations of disk that look
like they contain valid inodes but in reality do not. Hence we
cannot simply convert the inode number to a block number and read
the location from disk to determine if the inode is valid or not.

As a result, and XFS_IGET_BULKSTAT lookup needs to actually look the
inode up in the inode allocation btree to determine if the inode
number is valid or not.

It should be noted even on ikeep filesystems, there is the
possibility that blocks on disk may look like valid inode clusters.
e.g. if there are filesystem images hosted on the filesystem. Hence
even for ikeep filesystems we really need to validate that the inode
number is valid before issuing the inode buffer read.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-06-24 11:15:33 +10:00
Christoph Hellwig
7dce11dbac xfs: always use iget in bulkstat
The non-coherent bulkstat versionsthat look directly at the inode
buffers causes various problems with performance optimizations that
make increased use of just logging inodes.  This patch makes bulkstat
always use iget, which should be fast enough for normal use with the
radix-tree based inode cache introduced a while ago.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-06-23 18:11:11 +10:00
Dan Rosenberg
1817176a86 xfs: prevent swapext from operating on write-only files
This patch prevents user "foo" from using the SWAPEXT ioctl to swap
a write-only file owned by user "bar" into a file owned by "foo" and
subsequently reading it.  It does so by checking that the file
descriptors passed to the ioctl are also opened for reading.

Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-06-24 12:07:47 +10:00
J. Bruce Fields
ec8acac84a nfsd4: remove some debugging code
This is overkill.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-06-22 22:29:03 -04:00
Benny Halevy
9303bbd3de nfsd: nfs4callback encode_stateid helper function
To be used also for the pnfs cb_layoutrecall callback

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd4: fix cb_recall encoding]
    "nfsd: nfs4callback encode_stateid helper function" forgot to reserve
    more space after return from the new helper.
Reported-by: Michael Groshans <groshans@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-06-22 17:19:51 -04:00
J. Bruce Fields
4731030d58 nfsd4: translate memory errors to delay, not serverfault
If the server is out of memory is better for clients to back off and
retry than to just error out.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-06-22 17:19:36 -04:00
J. Bruce Fields
76407f76e0 nfsd4; fix session reference count leak
Note the session has to be put() here regardless of what happens to the
client.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-06-22 17:19:28 -04:00
Trond Myklebust
1055d76d91 NFSv4.1: There is no need to init the session more than once...
Set up a flag to ensure that is indeed the case.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:03 -04:00
Trond Myklebust
fe74ba3a8d NFSv41: Cleanup for nfs4_alloc_session.
There is no reason to change the nfs_client state every time we allocate a
new session. Move that line into nfs4_init_client_minor_version.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:03 -04:00
Trond Myklebust
d77d76ffb6 NFSv41: Clean up exclusive create
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:03 -04:00
Trond Myklebust
a443234535 NFSv41: Deprecate nfs_client->cl_minorversion
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:02 -04:00
Trond Myklebust
e047a10c12 NFSv41: Fix nfs_async_inode_return_delegation() ugliness
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:02 -04:00
Trond Myklebust
c48f4f3541 NFSv41: Convert the various reboot recovery ops etc to minor version ops
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:02 -04:00
Trond Myklebust
97dc135947 NFSv41: Clean up the NFSv4.1 minor version specific operations
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:02 -04:00
Trond Myklebust
a2118c33aa NFSv41: Don't store session state in the nfs_client->cl_state
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:02 -04:00
Trond Myklebust
df8964554a NFSv41: Further cleanup for nfs4_sequence_done
Instead of testing if the nfs_client has a session, we should be testing if
the struct nfs4_sequence_res was set up with one.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:02 -04:00
Trond Myklebust
035168ab39 NFSv4.1: Make nfs4_setup_sequence take a nfs_server argument
In anticipation of the day when we have per-filesystem sessions, and also
in order to allow the session to change in the event of a filesystem
migration event.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:02 -04:00
Trond Myklebust
71ac6da994 NFSv4.1: Merge the nfs41_proc_async_sequence() and nfs4_proc_sequence()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:01 -04:00
Trond Myklebust
aa5190d0ed NFSv4: Kill nfs4_async_handle_error() abuses by NFSv4.1
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:01 -04:00
Trond Myklebust
d185a334c7 NFSv4.1: Simplify nfs41_sequence_done()
Nobody uses the rpc_status parameter.

It is not obvious why we need the struct nfs_client argument either, when
we already have that information in the session.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:01 -04:00
Trond Myklebust
2a6e26cdb8 NFSv4.1: Clean up nfs4_setup_sequence
Firstly, there is little point in first zeroing out the entire struct
nfs4_sequence_res, and then initialising all fields save one. Just
initialise the last field to zero...

Secondly, nfs41_setup_sequence() has only 2 possible return values: 0, or
-EAGAIN, so there is no 'terminate rpc task' case.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:01 -04:00
Trond Myklebust
d5f8d3fe72 NFSv41: Fix a memory leak in nfs41_proc_async_sequence()
If the call to rpc_call_async() fails, then the arguments will not be
freed, since there will be no call to nfs41_sequence_call_done

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:24:01 -04:00
Trond Myklebust
d3f6baaa34 NFSv4: Fix an embarassing typo in encode_attrs()
Apparently, we have never been able to set the atime correctly from the
NFSv4 client.

Reported-by: 小倉一夫 <ka-ogura@bd6.so-net.ne.jp>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-06-22 13:22:54 -04:00
Trond Myklebust
0be8189f2c NFSv4: Ensure that /proc/self/mountinfo displays the minor version number
Currently, we do not display the minor version mount parameter in the
/proc mount info.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-06-22 13:22:53 -04:00
Trond Myklebust
44950b67a6 NFSv4.1: Ensure that we initialise the session when following a referral
Put the code that is common to both the referral and ordinary mount cases
into a common helper routine.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-06-22 13:22:53 -04:00
Andy Adamson
f799bdb355 nfs4 use mandatory attribute file type in nfs4_get_root
S_ISDIR(fsinfo.fattr->mode) checks the file type rather than the mode bits,
so we should be checking for the NFS_ATTR_FATTR_TYPE fattr property.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-06-22 13:17:43 -04:00
Sage Weil
17c688c3df ceph: delay umount until all mds requests drop inode+dentry refs
This fixes a race between handle_reply finishing an mds request, signalling
completion, and then dropping the request structing and its dentry+inode
refs, and pre_umount function waiting for requests to finish before
letting the vfs tear down the dcache.  If umount was delayed waiting for
mds requests, we could race and BUG in shrink_dcache_for_umount_subtree
because of a slow dput.

This delays umount until the msgr queue flushes, which means handle_reply
will exit and will have dropped the ceph_mds_request struct.  I'm assuming
the VFS has already ensured that its calls have all completed and those
request refs have thus been dropped as well (I haven't seen that race, at
least).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-21 16:11:50 -07:00
Sage Weil
d69ed05a80 ceph: handle splice_dentry/d_materialize_unique error in readdir_prepopulate
Handle a splice_dentry failure (due to a d_materialize_unique error)
without crashing.  (Also, report the error code.)

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-21 16:04:10 -07:00
Ingo Molnar
646b1db495 Merge commit 'v2.6.35-rc3' into perf/core
Merge reason: Go from -rc1 base to -rc3 base, merge in fixes.
2010-06-18 10:53:19 +02:00
Sage Weil
cebc5be6b6 ceph: fix crush map update decoding
If the incremental osdmap has a new crush map, advance the position after
decoding so that we can parse the rest of the osdmap properly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-17 10:22:48 -07:00
Jeff Layton
8a224d4894 cifs: remove bogus first_time check in NTLMv2 session setup code
This bug appears to be the result of a cut-and-paste mistake from the
NTLMv1 code. The function to generate the MAC key was commented out, but
not the conditional above it. The conditional then ended up causing the
session setup key not to be copied to the buffer unless this was the
first session on the socket, and that made all but the first NTLMv2
session setup fail.

Fix this by removing the conditional and all of the commented clutter
that made it difficult to see.

Cc: Stable <stable@kernel.org>
Reported-by: Gunther Deschner <gdeschne@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2010-06-16 13:40:18 -04:00
Jeff Layton
47c78b7f40 cifs: don't call cifs_new_fileinfo unless cifs_open succeeds
It's currently possible for cifs_open to fail after it has already
called cifs_new_fileinfo. In that situation, the new fileinfo will be
leaked as the caller doesn't call fput. That in turn leads to a busy
inodes after umount problem since the fileinfo holds an extra inode
reference now. Shuffle cifs_open around a bit so that it only calls
cifs_new_fileinfo if it's going to succeed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
2010-06-16 13:40:17 -04:00
Suresh Jayaraman
d9d5d8df95 cifs: don't ignore cifs_posix_open_inode_helper return value
...and ensure that we propagate the error back to avoid any surprises.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
2010-06-16 13:40:17 -04:00
Jeff Layton
db460242bf cifs: clean up arguments to cifs_open_inode_helper
...which takes a ton of unneeded arguments and does a lot more pointer
dereferencing than is really needed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
2010-06-16 13:40:17 -04:00
Jeff Layton
6ca9f3bae8 cifs: pass instantiated filp back after open call
The current scheme of sticking open files on a list and assuming that
cifs_open will scoop them off of it is broken and leads to "Busy
inodes after umount..." errors at unmount time.

The problem is that there is no guarantee that cifs_open will always
be called after a ->lookup or ->create operation. If there are
permissions or other problems, then it's quite likely that it *won't*
be called.

Fix this by fully instantiating the filp whenever the file is created
and pass that filp back to the VFS. If there is a problem, the VFS
can clean up the references.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
2010-06-16 13:40:16 -04:00
Jeff Layton
2422f676fb cifs: move cifs_new_fileinfo call out of cifs_posix_open
Having cifs_posix_open call cifs_new_fileinfo is problematic and
inconsistent with how "regular" opens work. It's also buggy as
cifs_reopen_file calls this function on a reconnect, which creates a new
struct cifsFileInfo that just gets leaked.

Push it out into the callers. This also allows us to get rid of the
"mnt" arg to cifs_posix_open.

Finally, in the event that a cifsFileInfo isn't or can't be created, we
always want to close the filehandle out on the server as the client
won't have a record of the filehandle and can't actually use it. Make
sure that CIFSSMBClose is called in those cases.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
2010-06-16 13:40:16 -04:00
Jiri Kosina
f1bbbb6912 Merge branch 'master' into for-next 2010-06-16 18:08:13 +02:00
Uwe Kleine-König
421f91d21a fix typos concerning "initiali[zs]e"
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-06-16 18:05:05 +02:00
Steve French
0933a95dfd Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 2010-06-16 13:19:36 +00:00
Tao Ma
1739da4054 ocfs2: Limit default local alloc size within bitmap range.
In commit 6b82021b9e, we increase
our local alloc size and calculate how much megabytes we can
get according to group size and volume size.
But we also need to check the maximum bits a local alloc block
bitmap can have. With a bs=512, cs=32K, local volume with 160G,
it calculate 96MB while the maximum local alloc size is only
76M. So the bitmap will overflow and corrupt the system truncate
log file. See bug
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1262

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-06-15 16:50:43 -07:00
Tao Ma
40f165f416 ocfs2: Move orphan scan work to ocfs2_wq.
We used to let orphan scan work in the default work queue,
but there is a corner case which will make the system deadlock.
The scenario is like this:
1. set heartbeat threadshold to 200. this will allow us to have a
   great chance to have a orphan scan work before our quorum decision.
2. mount node 1.
3. after 1~2 minutes, mount node 2(in order to make the bug easier
   to reproduce, better add maxcpus=1 to kernel command line).
4. node 1 do orphan scan work.
5. node 2 do orphan scan work.
6. node 1 do orphan scan work. After this, node 1 hold the orphan scan
   lock while node 2 know node 1 is the master.
7. ifdown eth2 in node 2(eth2 is what we do ocfs2 interconnection).

Now when node 2 begins orphan scan, the system queue is blocked.

The root cause is that both orphan scan work and quorum decision work
will use the system event work queue. orphan scan has a chance of
blocking the event work queue(in dlm_wait_for_node_death) so that there
is no chance for quorum decision work to proceed.

This patch resolve it by moving orphan scan work to ocfs2_wq.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-06-15 15:43:48 -07:00
Julia Lawall
6469272c35 fs/ocfs2/dlm: Add missing spin_unlock
Add a spin_unlock missing on the error path.  Unlock as in the other code
that leads to the leave label.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression E1;
@@

* spin_lock(E1,...);
  <+... when != E1
  if (...) {
    ... when != E1
*   return ...;
  }
  ...+>
* spin_unlock(E1,...);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-06-15 15:43:46 -07:00
Jan Kara
c6ac12a615 ext4: update ctime when changing the file's permission by setfacl
ext4 didn't update the ctime of the file when its permission was
changed.

Steps to reproduce:
 # touch aaa
 # stat -c %Z aaa
 1275289822
 # setfacl -m  'u::x,g::x,o::x' aaa
 # stat -c %Z aaa
 1275289822                         <- unchanged

But, according to the spec of the ctime, ext4 must update it.

Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-15 12:19:59 -04:00
Paul E. McKenney
b97181f242 fs: remove all rcu head initializations, except on_stack initializations
Remove all rcu head inits. We don't care about the RCU head state before passing
it to call_rcu() anyway. Only leave the "on_stack" variants so debugobjects can
keep track of objects on stack.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andries Brouwer <aeb@cwi.nl>
2010-06-14 16:37:26 -07:00
Christoph Hellwig
206f7ab4f4 ext4: remove vestiges of nobh support
The nobh option was only supported for writeback mode, but given that all
write paths actually create buffer heads it effectively was a no-op already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-14 14:42:49 -04:00
Andi Kleen
5a0790c2c4 ext4: remove initialized but not read variables
No real bugs found, just removed some dead code.

Found by gcc 4.6's new warnings.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-14 13:28:03 -04:00
Theodore Ts'o
07a038245b ext4: Convert more i_flags references to use accessor functions
These changes are not ones which are likely to result in races, but
they should be fixed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-14 09:54:48 -04:00
Jens Axboe
575f552012 Merge branch 'for-jens' of git://git.drbd.org/linux-2.6-drbd into for-linus 2010-06-14 12:54:57 +02:00
Michael Ellerman
9f069af5b6 of: Drop properties with "/" in their name
Some bogus firmwares include properties with "/" in their name. This
causes problems when creating the /proc/device-tree file system,
because the slash is taken to indicate a directory.

We don't care about those properties, and we don't want to encourage
them, so just throw them away when creating /proc/device-tree.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-06-13 18:12:24 -06:00
Sage Weil
ae32be3134 ceph: fix message memory leak, uninitialized variable
We need to properly initialize skip, as not all alloc_msg op instances
set it.

Also, BUG if someone says skip but also allocates a message.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-13 10:34:36 -07:00
Sage Weil
4a32f93d29 ceph: fix map handler error path
Don't leak message if we receive an unexpected message type.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-13 10:34:36 -07:00
Yehuda Sadeh
0cf5537b15 ceph: some endianity fixes
Fix some problems that came up with sparse.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-06-13 10:34:36 -07:00
Julia Lawall
6da5156fad UBIFS: use ERR_CAST
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)).  The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-06-12 14:46:15 +03:00
Artem Bityutskiy
276de5d2a1 UBIFS: check return code
The error code from 'ubifs_rcvry_gc_commit()' was ignored, so UBIFS
failed to recover and continued. Instead, we should refuse mounting
the file-system.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-06-12 14:45:25 +03:00
Theodore Ts'o
a0375156ca ext4: Clean up s_dirt handling
We don't need to set s_dirt in most of the ext4 code when journaling
is enabled.  In ext3/4 some of the summary statistics for # of free
inodes, blocks, and directories are calculated from the per-block
group statistics when the file system is mounted or unmounted.  As a
result the superblock doesn't have to be updated, either via the
journal or by setting s_dirt.  There are a few exceptions, most
notably when resizing the file system, where the superblock needs to
be modified --- and in that case it should be done as a journalled
operation if possible, and s_dirt set only in no-journal mode.

This patch will optimize out some unneeded disk writes when using ext4
with a journal.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-11 23:14:04 -04:00
Jeff Layton
12420ac341 cifs: implement drop_inode superblock op
The standard behavior for drop_inode is to delete the inode when the
last reference to it is put and the nlink count goes to 0. This helps
keep inodes that are still considered "not deleted" in cache as long as
possible even when there aren't dentries attached to them.

When server inode numbers are disabled, it's not possible for cifs_iget
to ever match an existing inode (since inode numbers are generated via
iunique). In this situation, cifs can keep a lot of inodes in cache that
will never be used again.

Implement a drop_inode routine that deletes the inode if server inode
numbers are disabled on the mount. This helps keep the cifs inode
caches down to a more manageable size when server inode numbers are
disabled.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-06-12 02:06:52 +00:00
Jeff Layton
ed0e3ace57 cifs: don't attempt busy-file rename unless it's in same directory
Busy-file renames don't actually work across directories, so we need
to limit this code to renames within the same dir.

This fixes the bug detailed here:

    https://bugzilla.redhat.com/show_bug.cgi?id=591938

Signed-off-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-06-12 01:45:36 +00:00
Linus Torvalds
b25b550bb1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: The file argument for fsync() is never null
  Btrfs: handle ERR_PTR from posix_acl_from_xattr()
  Btrfs: avoid BUG when dropping root and reference in same transaction
  Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
  Btrfs: should add a permission check for setfacl
  Btrfs: btrfs_lookup_dir_item() can return ERR_PTR
  Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRs
  Btrfs: unwind after btrfs_start_transaction() errors
  Btrfs: btrfs_iget() returns ERR_PTR
  Btrfs: handle kzalloc() failure in open_ctree()
  Btrfs: handle error returns from btrfs_lookup_dir_item()
  Btrfs: Fix BUG_ON for fs converted from extN
  Btrfs: Fix null dereference in relocation.c
  Btrfs: fix remap_file_pages error
  Btrfs: uninitialized data is check_path_shared()
  Btrfs: fix fallocate regression
  Btrfs: fix loop device on top of btrfs
2010-06-11 14:18:47 -07:00
Dan Carpenter
6f902af400 Btrfs: The file argument for fsync() is never null
The "file" argument for fsync is never null so we can remove this check.

What drew my attention here is that 7ea8085910: "drop unused dentry
argument to ->fsync" introduced an unconditional dereference at the
start of the function and that generated a smatch warning.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:40 -04:00
Dan Carpenter
834e74759a Btrfs: handle ERR_PTR from posix_acl_from_xattr()
posix_acl_from_xattr() returns both ERR_PTRs and null, but it's OK to
pass null values to set_cached_acl()

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:39 -04:00
Sage Weil
15e7000095 Btrfs: avoid BUG when dropping root and reference in same transaction
If btrfs_ioctl_snap_destroy() deletes a snapshot but finishes
with end_transaction(), the cleaner kthread may come in and
drop the root in the same transaction.  If that's the case, the
root's refs still == 1 in the tree when btrfs_del_root() deletes
the item, because commit_fs_roots() hasn't updated it yet (that
happens during the commit).

This wasn't a problem before only because
btrfs_ioctl_snap_destroy() would commit the transaction before dropping
the dentry reference, so the dead root wouldn't get queued up until
after the fs root item was updated in the btree.

Since it is not an error to drop the root reference and the root in the
same transaction, just drop the BUG_ON() in btrfs_del_root().

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:39 -04:00
Shi Weihua
731e3d1b43 Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
when used Posix File System Test Suite(pjd-fstest) to test btrfs,
some cases about setfacl failed when noacl mount option used.
I simplified used commands in pjd-fstest, and the following steps
can reproduce it.
------------------------
# cd btrfs-part/
# mkdir aaa
# setfacl -m m::rw aaa    <- successed, but not expected by pjd-fstest.
------------------------
I checked ext3, a warning message occured, like as:
  setfacl: aaa/: Operation not supported
Certainly, it's expected by pjd-fstest.

So, i compared acl.c of btrfs and ext3. Based on that, a patch created.
Fortunately, it works.

Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:38 -04:00