db9effe99a
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: fs: fix do_last error case when need_reval_dot nfs: add missing rcu-walk check fs: hlist UP debug fixup fs: fix dropping of rcu-walk from force_reval_path fs: force_reval_path drop rcu-walk before d_invalidate fs: small rcu-walk documentation fixes Fixed up trivial conflicts in Documentation/filesystems/porting
397 lines
14 KiB
Plaintext
397 lines
14 KiB
Plaintext
Changes since 2.5.0:
|
|
|
|
---
|
|
[recommended]
|
|
|
|
New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(),
|
|
sb_set_blocksize() and sb_min_blocksize().
|
|
|
|
Use them.
|
|
|
|
(sb_find_get_block() replaces 2.4's get_hash_table())
|
|
|
|
---
|
|
[recommended]
|
|
|
|
New methods: ->alloc_inode() and ->destroy_inode().
|
|
|
|
Remove inode->u.foo_inode_i
|
|
Declare
|
|
struct foo_inode_info {
|
|
/* fs-private stuff */
|
|
struct inode vfs_inode;
|
|
};
|
|
static inline struct foo_inode_info *FOO_I(struct inode *inode)
|
|
{
|
|
return list_entry(inode, struct foo_inode_info, vfs_inode);
|
|
}
|
|
|
|
Use FOO_I(inode) instead of &inode->u.foo_inode_i;
|
|
|
|
Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate
|
|
foo_inode_info and return the address of ->vfs_inode, the latter should free
|
|
FOO_I(inode) (see in-tree filesystems for examples).
|
|
|
|
Make them ->alloc_inode and ->destroy_inode in your super_operations.
|
|
|
|
Keep in mind that now you need explicit initialization of private data
|
|
typically between calling iget_locked() and unlocking the inode.
|
|
|
|
At some point that will become mandatory.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
Change of file_system_type method (->read_super to ->get_sb)
|
|
|
|
->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV.
|
|
|
|
Turn your foo_read_super() into a function that would return 0 in case of
|
|
success and negative number in case of error (-EINVAL unless you have more
|
|
informative error value to report). Call it foo_fill_super(). Now declare
|
|
|
|
int foo_get_sb(struct file_system_type *fs_type,
|
|
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
|
|
{
|
|
return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,
|
|
mnt);
|
|
}
|
|
|
|
(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
|
|
filesystem).
|
|
|
|
Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as
|
|
foo_get_sb.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames.
|
|
Most likely there is no need to change anything, but if you relied on
|
|
global exclusion between renames for some internal purpose - you need to
|
|
change your internal locking. Otherwise exclusion warranties remain the
|
|
same (i.e. parents and victim are locked, etc.).
|
|
|
|
---
|
|
[informational]
|
|
|
|
Now we have the exclusion between ->lookup() and directory removal (by
|
|
->rmdir() and ->rename()). If you used to need that exclusion and do
|
|
it by internal locking (most of filesystems couldn't care less) - you
|
|
can relax your locking.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
|
|
->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename()
|
|
and ->readdir() are called without BKL now. Grab it on entry, drop upon return
|
|
- that will guarantee the same locking you used to have. If your method or its
|
|
parts do not need BKL - better yet, now you can shift lock_kernel() and
|
|
unlock_kernel() so that they would protect exactly what needs to be
|
|
protected.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
BKL is also moved from around sb operations. ->write_super() Is now called
|
|
without BKL held. BKL should have been shifted into individual fs sb_op
|
|
functions. If you don't need it, remove it.
|
|
|
|
---
|
|
[informational]
|
|
|
|
check for ->link() target not being a directory is done by callers. Feel
|
|
free to drop it...
|
|
|
|
---
|
|
[informational]
|
|
|
|
->link() callers hold ->i_mutex on the object we are linking to. Some of your
|
|
problems might be over...
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
new file_system_type method - kill_sb(superblock). If you are converting
|
|
an existing filesystem, set it according to ->fs_flags:
|
|
FS_REQUIRES_DEV - kill_block_super
|
|
FS_LITTER - kill_litter_super
|
|
neither - kill_anon_super
|
|
FS_LITTER is gone - just remove it from fs_flags.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
FS_SINGLE is gone (actually, that had happened back when ->get_sb()
|
|
went in - and hadn't been documented ;-/). Just remove it from fs_flags
|
|
(and see ->get_sb() entry for other actions).
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so
|
|
watch for ->i_mutex-grabbing code that might be used by your ->setattr().
|
|
Callers of notify_change() need ->i_mutex now.
|
|
|
|
---
|
|
[recommended]
|
|
|
|
New super_block field "struct export_operations *s_export_op" for
|
|
explicit support for exporting, e.g. via NFS. The structure is fully
|
|
documented at its declaration in include/linux/fs.h, and in
|
|
Documentation/filesystems/nfs/Exporting.
|
|
|
|
Briefly it allows for the definition of decode_fh and encode_fh operations
|
|
to encode and decode filehandles, and allows the filesystem to use
|
|
a standard helper function for decode_fh, and provide file-system specific
|
|
support for this helper, particularly get_parent.
|
|
|
|
It is planned that this will be required for exporting once the code
|
|
settles down a bit.
|
|
|
|
[mandatory]
|
|
|
|
s_export_op is now required for exporting a filesystem.
|
|
isofs, ext2, ext3, resierfs, fat
|
|
can be used as examples of very different filesystems.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
iget4() and the read_inode2 callback have been superseded by iget5_locked()
|
|
which has the following prototype,
|
|
|
|
struct inode *iget5_locked(struct super_block *sb, unsigned long ino,
|
|
int (*test)(struct inode *, void *),
|
|
int (*set)(struct inode *, void *),
|
|
void *data);
|
|
|
|
'test' is an additional function that can be used when the inode
|
|
number is not sufficient to identify the actual file object. 'set'
|
|
should be a non-blocking function that initializes those parts of a
|
|
newly created inode to allow the test function to succeed. 'data' is
|
|
passed as an opaque value to both test and set functions.
|
|
|
|
When the inode has been created by iget5_locked(), it will be returned with the
|
|
I_NEW flag set and will still be locked. The filesystem then needs to finalize
|
|
the initialization. Once the inode is initialized it must be unlocked by
|
|
calling unlock_new_inode().
|
|
|
|
The filesystem is responsible for setting (and possibly testing) i_ino
|
|
when appropriate. There is also a simpler iget_locked function that
|
|
just takes the superblock and inode number as arguments and does the
|
|
test and set for you.
|
|
|
|
e.g.
|
|
inode = iget_locked(sb, ino);
|
|
if (inode->i_state & I_NEW) {
|
|
err = read_inode_from_disk(inode);
|
|
if (err < 0) {
|
|
iget_failed(inode);
|
|
return err;
|
|
}
|
|
unlock_new_inode(inode);
|
|
}
|
|
|
|
Note that if the process of setting up a new inode fails, then iget_failed()
|
|
should be called on the inode to render it dead, and an appropriate error
|
|
should be passed back to the caller.
|
|
|
|
---
|
|
[recommended]
|
|
|
|
->getattr() finally getting used. See instances in nfs, minix, etc.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
->revalidate() is gone. If your filesystem had it - provide ->getattr()
|
|
and let it call whatever you had as ->revlidate() + (for symlinks that
|
|
had ->revalidate()) add calls in ->follow_link()/->readlink().
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
->d_parent changes are not protected by BKL anymore. Read access is safe
|
|
if at least one of the following is true:
|
|
* filesystem has no cross-directory rename()
|
|
* we know that parent had been locked (e.g. we are looking at
|
|
->d_parent of ->lookup() argument).
|
|
* we are called from ->rename().
|
|
* the child's ->d_lock is held
|
|
Audit your code and add locking if needed. Notice that any place that is
|
|
not protected by the conditions above is risky even in the old tree - you
|
|
had been relying on BKL and that's prone to screwups. Old tree had quite
|
|
a few holes of that kind - unprotected access to ->d_parent leading to
|
|
anything from oops to silent memory corruption.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags
|
|
(see rootfs for one kind of solution and bdev/socket/pipe for another).
|
|
|
|
---
|
|
[recommended]
|
|
|
|
Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter
|
|
is still alive, but only because of the mess in drivers/s390/block/dasd.c.
|
|
As soon as it gets fixed is_read_only() will die.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
->permission() is called without BKL now. Grab it on entry, drop upon
|
|
return - that will guarantee the same locking you used to have. If
|
|
your method or its parts do not need BKL - better yet, now you can
|
|
shift lock_kernel() and unlock_kernel() so that they would protect
|
|
exactly what needs to be protected.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
->statfs() is now called without BKL held. BKL should have been
|
|
shifted into individual fs sb_op functions where it's not clear that
|
|
it's safe to remove it. If you don't need it, remove it.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
is_read_only() is gone; use bdev_read_only() instead.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
destroy_buffers() is gone; use invalidate_bdev().
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is
|
|
deliberate; as soon as struct block_device * is propagated in a reasonable
|
|
way by that code fixing will become trivial; until then nothing can be
|
|
done.
|
|
|
|
[mandatory]
|
|
|
|
block truncatation on error exit from ->write_begin, and ->direct_IO
|
|
moved from generic methods (block_write_begin, cont_write_begin,
|
|
nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at
|
|
ext2_write_failed and callers for an example.
|
|
|
|
[mandatory]
|
|
|
|
->truncate is going away. The whole truncate sequence needs to be
|
|
implemented in ->setattr, which is now mandatory for filesystems
|
|
implementing on-disk size changes. Start with a copy of the old inode_setattr
|
|
and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to
|
|
be in order of zeroing blocks using block_truncate_page or similar helpers,
|
|
size update and on finally on-disk truncation which should not fail.
|
|
inode_change_ok now includes the size checks for ATTR_SIZE and must be called
|
|
in the beginning of ->setattr unconditionally.
|
|
|
|
[mandatory]
|
|
|
|
->clear_inode() and ->delete_inode() are gone; ->evict_inode() should
|
|
be used instead. It gets called whenever the inode is evicted, whether it has
|
|
remaining links or not. Caller does *not* evict the pagecache or inode-associated
|
|
metadata buffers; getting rid of those is responsibility of method, as it had
|
|
been for ->delete_inode().
|
|
->drop_inode() returns int now; it's called on final iput() with inode_lock
|
|
held and it returns true if filesystems wants the inode to be dropped. As before,
|
|
generic_drop_inode() is still the default and it's been updated appropriately.
|
|
generic_delete_inode() is also alive and it consists simply of return 1. Note that
|
|
all actual eviction work is done by caller after ->drop_inode() returns.
|
|
clear_inode() is gone; use end_writeback() instead. As before, it must
|
|
be called exactly once on each call of ->evict_inode() (as it used to be for
|
|
each call of ->delete_inode()). Unlike before, if you are using inode-associated
|
|
metadata buffers (i.e. mark_buffer_dirty_inode()), it's your responsibility to
|
|
call invalidate_inode_buffers() before end_writeback().
|
|
No async writeback (and thus no calls of ->write_inode()) will happen
|
|
after end_writeback() returns, so actions that should not overlap with ->write_inode()
|
|
(e.g. freeing on-disk inode if i_nlink is 0) ought to be done after that call.
|
|
|
|
NOTE: checking i_nlink in the beginning of ->write_inode() and bailing out
|
|
if it's zero is not *and* *never* *had* *been* enough. Final unlink() and iput()
|
|
may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
|
|
free the on-disk inode, you may end up doing that while ->write_inode() is writing
|
|
to it.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
.d_delete() now only advises the dcache as to whether or not to cache
|
|
unreferenced dentries, and is now only called when the dentry refcount goes to
|
|
0. Even on 0 refcount transition, it must be able to tolerate being called 0,
|
|
1, or more times (eg. constant, idempotent).
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
.d_compare() calling convention and locking rules are significantly
|
|
changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
|
|
look at examples of other filesystems) for guidance.
|
|
|
|
---
|
|
[mandatory]
|
|
|
|
.d_hash() calling convention and locking rules are significantly
|
|
changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
|
|
look at examples of other filesystems) for guidance.
|
|
|
|
---
|
|
[mandatory]
|
|
dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c
|
|
for details of what locks to replace dcache_lock with in order to protect
|
|
particular things. Most of the time, a filesystem only needs ->d_lock, which
|
|
protects *all* the dcache state of a given dentry.
|
|
|
|
--
|
|
[mandatory]
|
|
|
|
Filesystems must RCU-free their inodes, if they can have been accessed
|
|
via rcu-walk path walk (basically, if the file can have had a path name in the
|
|
vfs namespace).
|
|
|
|
i_dentry and i_rcu share storage in a union, and the vfs expects
|
|
i_dentry to be reinitialized before it is freed, so an:
|
|
|
|
INIT_LIST_HEAD(&inode->i_dentry);
|
|
|
|
must be done in the RCU callback.
|
|
|
|
--
|
|
[recommended]
|
|
vfs now tries to do path walking in "rcu-walk mode", which avoids
|
|
atomic operations and scalability hazards on dentries and inodes (see
|
|
Documentation/filesystems/path-lookup.txt). d_hash and d_compare changes
|
|
(above) are examples of the changes required to support this. For more complex
|
|
filesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, so
|
|
no changes are required to the filesystem. However, this is costly and loses
|
|
the benefits of rcu-walk mode. We will begin to add filesystem callbacks that
|
|
are rcu-walk aware, shown below. Filesystems should take advantage of this
|
|
where possible.
|
|
|
|
--
|
|
[mandatory]
|
|
d_revalidate is a callback that is made on every path element (if
|
|
the filesystem provides it), which requires dropping out of rcu-walk mode. This
|
|
may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be
|
|
returned if the filesystem cannot handle rcu-walk. See
|
|
Documentation/filesystems/vfs.txt for more details.
|
|
|
|
permission and check_acl are inode permission checks that are called
|
|
on many or all directory inodes on the way down a path walk (to check for
|
|
exec permission). These must now be rcu-walk aware (flags & IPERM_FLAG_RCU).
|
|
See Documentation/filesystems/vfs.txt for more details.
|
|
|
|
--
|
|
[mandatory]
|
|
In ->fallocate() you must check the mode option passed in. If your
|
|
filesystem does not support hole punching (deallocating space in the middle of a
|
|
file) you must return -EOPNOTSUPP if FALLOC_FL_PUNCH_HOLE is set in mode.
|
|
Currently you can only have FALLOC_FL_PUNCH_HOLE with FALLOC_FL_KEEP_SIZE set,
|
|
so the i_size should not change when hole punching, even when puching the end of
|
|
a file off.
|