1
Commit Graph

1513 Commits

Author SHA1 Message Date
Miklos Szeredi
6a567e920f fuse: ioctl: translate ENOSYS in outarg
Fuse shouldn't return ENOSYS from its ioctl implementation. If userspace
responds with ENOSYS it should be translated to ENOTTY.

There are two ways to return an error from the IOCTL request:

 - fuse_out_header.error
 - fuse_ioctl_out.result

Commit 02c0cab8e7 ("fuse: ioctl: translate ENOSYS") already fixed this
issue for the first case, but missed the second case.  This patch fixes the
second case.

Reported-by: Jonathan Katz <jkatz@eitmlabs.org>
Closes: https://lore.kernel.org/all/CALKgVmcC1VUV_gJVq70n--omMJZUb4HSh_FqvLTHgNBc+HCLFQ@mail.gmail.com/
Fixes: 02c0cab8e7 ("fuse: ioctl: translate ENOSYS")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-21 11:17:36 +02:00
Christoph Hellwig
64d1b4dd82 fuse: use direct_write_fallback
Use the generic direct_write_fallback helper instead of duplicating the
logic.

Link: https://lkml.kernel.org/r/20230601145904.1385409-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:54 -07:00
Christoph Hellwig
596df33d67 fuse: drop redundant arguments to fuse_perform_write
pos is always equal to iocb->ki_pos, and mapping is always equal to
iocb->ki_filp->f_mapping.

Link: https://lkml.kernel.org/r/20230601145904.1385409-12-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:54 -07:00
Christoph Hellwig
70e986c3b4 fuse: update ki_pos in fuse_perform_write
Both callers of fuse_perform_write need to updated ki_pos, move it into
common code.

Link: https://lkml.kernel.org/r/20230601145904.1385409-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:54 -07:00
Christoph Hellwig
0d625446d0 backing_dev: remove current->backing_dev_info
Patch series "cleanup the filemap / direct I/O interaction", v4.

This series cleans up some of the generic write helper calling conventions
and the page cache writeback / invalidation for direct I/O.  This is a
spinoff from the no-bufferhead kernel project, for which we'll want to an
use iomap based buffered write path in the block layer.


This patch (of 12):

The last user of current->backing_dev_info disappeared in commit
b9b1335e64 ("remove bdi_congested() and wb_congested() and related
functions").  Remove the field and all assignments to it.

Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de
Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:51 -07:00
Miklos Szeredi
a9d1c4c6df fuse: revalidate: don't invalidate if interrupted
If the LOOKUP request triggered from fuse_dentry_revalidate() is
interrupted, then the dentry will be invalidated, possibly resulting in
submounts being unmounted.

Reported-by: Xu Rongbo <xurongbo@baidu.com>
Closes: https://lore.kernel.org/all/CAJfpegswN_CJJ6C3RZiaK6rpFmNyWmXfaEpnQUJ42KCwNF5tWw@mail.gmail.com/
Fixes: 9e6268db49 ("[PATCH] FUSE - read-write operations")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 17:49:20 +02:00
Bernd Schubert
3066ff9347 fuse: Apply flags2 only when userspace set the FUSE_INIT_EXT
This is just a safety precaution to avoid checking flags on memory that was
initialized on the user space side.  libfuse zeroes struct fuse_init_out
outarg, but this is not guranteed to be done in all implementations.
Better is to act on flags and to only apply flags2 when FUSE_INIT_EXT is
set.

There is a risk with this change, though - it might break existing user
space libraries, which are already using flags2 without setting
FUSE_INIT_EXT.

The corresponding libfuse patch is here
https://github.com/libfuse/libfuse/pull/662

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Fixes: 53db28933e ("fuse: extend init flags")
Cc: <stable@vger.kernel.org> # v5.17
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 17:49:20 +02:00
zyfjeff
1c3610d30e fuse: remove duplicate check for nodeid
before this check, the nodeid has already been checked once, so
the check here doesn't make an sense, so remove the check for
nodeid here.

            if (err || !outarg->nodeid)
                    goto out_put_forget;

            err = -EIO;
    >>>     if (!outarg->nodeid)
                    goto out_put_forget;

Signed-off-by: zyfjeff <zyfjeff@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 16:26:34 +02:00
Miklos Szeredi
5cadfbd5a1 fuse: add feature flag for expire-only
Add an init flag idicating whether the FUSE_EXPIRE_ONLY flag of
FUSE_NOTIFY_INVAL_ENTRY is effective.

This is needed for backports of this feature, otherwise the server could
just check the protocol version.

Fixes: 4f8d37020e ("fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY")
Cc: <stable@vger.kernel.org> # v6.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 16:26:33 +02:00
David Howells
2cb1e08985 splice: Use filemap_splice_read() instead of generic_file_splice_read()
Replace pointers to generic_file_splice_read() with calls to
filemap_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:17 -06:00
Linus Torvalds
556eb8b791 Driver core changes for 6.4-rc1
Here is the large set of driver core changes for 6.4-rc1.
 
 Once again, a busy development cycle, with lots of changes happening in
 the driver core in the quest to be able to move "struct bus" and "struct
 class" into read-only memory, a task now complete with these changes.
 
 This will make the future rust interactions with the driver core more
 "provably correct" as well as providing more obvious lifetime rules for
 all busses and classes in the kernel.
 
 The changes required for this did touch many individual classes and
 busses as many callbacks were changed to take const * parameters
 instead.  All of these changes have been submitted to the various
 subsystem maintainers, giving them plenty of time to review, and most of
 them actually did so.
 
 Other than those changes, included in here are a small set of other
 things:
   - kobject logging improvements
   - cacheinfo improvements and updates
   - obligatory fw_devlink updates and fixes
   - documentation updates
   - device property cleanups and const * changes
   - firwmare loader dependency fixes.
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
 LEGadNS38k5fs+73UaxV
 =7K4B
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the large set of driver core changes for 6.4-rc1.

  Once again, a busy development cycle, with lots of changes happening
  in the driver core in the quest to be able to move "struct bus" and
  "struct class" into read-only memory, a task now complete with these
  changes.

  This will make the future rust interactions with the driver core more
  "provably correct" as well as providing more obvious lifetime rules
  for all busses and classes in the kernel.

  The changes required for this did touch many individual classes and
  busses as many callbacks were changed to take const * parameters
  instead. All of these changes have been submitted to the various
  subsystem maintainers, giving them plenty of time to review, and most
  of them actually did so.

  Other than those changes, included in here are a small set of other
  things:

   - kobject logging improvements

   - cacheinfo improvements and updates

   - obligatory fw_devlink updates and fixes

   - documentation updates

   - device property cleanups and const * changes

   - firwmare loader dependency fixes.

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
  device property: make device_property functions take const device *
  driver core: update comments in device_rename()
  driver core: Don't require dynamic_debug for initcall_debug probe timing
  firmware_loader: rework crypto dependencies
  firmware_loader: Strip off \n from customized path
  zram: fix up permission for the hot_add sysfs file
  cacheinfo: Add use_arch[|_cache]_info field/function
  arch_topology: Remove early cacheinfo error message if -ENOENT
  cacheinfo: Check cache properties are present in DT
  cacheinfo: Check sib_leaf in cache_leaves_are_shared()
  cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
  cacheinfo: Add arm64 early level initializer implementation
  cacheinfo: Add arch specific early level initializer
  tty: make tty_class a static const structure
  driver core: class: remove struct class_interface * from callbacks
  driver core: class: mark the struct class in struct class_interface constant
  driver core: class: make class_register() take a const *
  driver core: class: mark class_release() as taking a const *
  driver core: remove incorrect comment for device_create*
  MIPS: vpe-cmp: remove module owner pointer from struct class usage.
  ...
2023-04-27 11:53:57 -07:00
Linus Torvalds
ef36b9afc2 fget() to fdget() conversions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYCQAAKCRBZ7Krx/gZQ
 64FdAQDZ2hTDyZEWPt486dWYPYpiKyaGFXSXDGo7wgP0fiwxXQEA/mROKb6JqYw6
 27mZ9A7qluT8r3AfTTQ0D+Yse/dr4AM=
 =GA9W
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fget updates from Al Viro:
 "fget() to fdget() conversions"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fuse_dev_ioctl(): switch to fdget()
  cgroup_get_from_fd(): switch to fdget_raw()
  bpf: switch to fdget_raw()
  build_mount_idmapped(): switch to fdget()
  kill the last remaining user of proc_ns_fget()
  SVM-SEV: convert the rest of fget() uses to fdget() in there
  convert sgx_set_attribute() to fdget()/fdput()
  convert setns(2) to fdget()/fdput()
2023-04-24 19:14:20 -07:00
Al Viro
4a892c0fe4 fuse_dev_ioctl(): switch to fdget()
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-20 22:55:35 -04:00
Jens Axboe
de4f5fed3f iov_iter: add iter_iovec() helper
This returns a pointer to the current iovec entry in the iterator. Only
useful with ITER_IOVEC right now, but it prepares us to treat ITER_UBUF
and ITER_IOVEC identically for the first segment.

Rename struct iov_iter->iov to iov_iter->__iov to find any potentially
troublesome spots, and also to prevent anyone from adding new code that
accesses iter->iov directly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-30 08:12:29 -06:00
Greg Kroah-Hartman
1aaba11da9 driver core: class: remove module * from class_create()
The module pointer in class_create() never actually did anything, and it
shouldn't have been requred to be set as a parameter even if it did
something.  So just remove it and fix up all callers of the function in
the kernel tree at the same time.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-17 15:16:33 +01:00
Linus Torvalds
d40b2f4c94 fuse update for 6.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCY/ysqAAKCRDh3BK/laaZ
 PLSfAP9c4z/KOZj/Am8nQ0mqI8Ss0Ei+hyu8Vow6NoyJkR4NvwD/SxEeI2+rpXj7
 TGOA+6k4chLc7QIFBsIwGvQgeXld1gA=
 =s4b/
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Fix regression in fileattr permission checking

 - Fix possible hang during PID namespace destruction

 - Add generic support for request extensions

 - Add supplementary group list extension

 - Add limited support for supplying supplementary groups in create
   requests

 - Documentation fixes

* tag 'fuse-update-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: add inode/permission checks to fileattr_get/fileattr_set
  fuse: fix all W=1 kernel-doc warnings
  fuse: in fuse_flush only wait if someone wants the return code
  fuse: optional supplementary group in create requests
  fuse: add request extension
2023-02-27 09:53:58 -08:00
Linus Torvalds
3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Linus Torvalds
05e6295f7b fs.idmapped.v6.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
 orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
 Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
 =+BG5
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfs idmapping updates from Christian Brauner:

 - Last cycle we introduced the dedicated struct mnt_idmap type for
   mount idmapping and the required infrastucture in 256c8aed2b ("fs:
   introduce dedicated idmap type for mounts"). As promised in last
   cycle's pull request message this converts everything to rely on
   struct mnt_idmap.

   Currently we still pass around the plain namespace that was attached
   to a mount. This is in general pretty convenient but it makes it easy
   to conflate namespaces that are relevant on the filesystem with
   namespaces that are relevant on the mount level. Especially for
   non-vfs developers without detailed knowledge in this area this was a
   potential source for bugs.

   This finishes the conversion. Instead of passing the plain namespace
   around this updates all places that currently take a pointer to a
   mnt_userns with a pointer to struct mnt_idmap.

   Now that the conversion is done all helpers down to the really
   low-level helpers only accept a struct mnt_idmap argument instead of
   two namespace arguments.

   Conflating mount and other idmappings will now cause the compiler to
   complain loudly thus eliminating the possibility of any bugs. This
   makes it impossible for filesystem developers to mix up mount and
   filesystem idmappings as they are two distinct types and require
   distinct helpers that cannot be used interchangeably.

   Everything associated with struct mnt_idmap is moved into a single
   separate file. With that change no code can poke around in struct
   mnt_idmap. It can only be interacted with through dedicated helpers.
   That means all filesystems are and all of the vfs is completely
   oblivious to the actual implementation of idmappings.

   We are now also able to extend struct mnt_idmap as we see fit. For
   example, we can decouple it completely from namespaces for users that
   don't require or don't want to use them at all. We can also extend
   the concept of idmappings so we can cover filesystem specific
   requirements.

   In combination with the vfs{g,u}id_t work we finished in v6.2 this
   makes this feature substantially more robust and thus difficult to
   implement wrong by a given filesystem and also protects the vfs.

 - Enable idmapped mounts for tmpfs and fulfill a longstanding request.

   A long-standing request from users had been to make it possible to
   create idmapped mounts for tmpfs. For example, to share the host's
   tmpfs mount between multiple sandboxes. This is a prerequisite for
   some advanced Kubernetes cases. Systemd also has a range of use-cases
   to increase service isolation. And there are more users of this.

   However, with all of the other work going on this was way down on the
   priority list but luckily someone other than ourselves picked this
   up.

   As usual the patch is tiny as all the infrastructure work had been
   done multiple kernel releases ago. In addition to all the tests that
   we already have I requested that Rodrigo add a dedicated tmpfs
   testsuite for idmapped mounts to xfstests. It is to be included into
   xfstests during the v6.3 development cycle. This should add a slew of
   additional tests.

* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
  shmem: support idmapped mounts for tmpfs
  fs: move mnt_idmap
  fs: port vfs{g,u}id helpers to mnt_idmap
  fs: port fs{g,u}id helpers to mnt_idmap
  fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
  fs: port i_{g,u}id_{needs_}update() to mnt_idmap
  quota: port to mnt_idmap
  fs: port privilege checking helpers to mnt_idmap
  fs: port inode_owner_or_capable() to mnt_idmap
  fs: port inode_init_owner() to mnt_idmap
  fs: port acl to mnt_idmap
  fs: port xattr to mnt_idmap
  fs: port ->permission() to pass mnt_idmap
  fs: port ->fileattr_set() to pass mnt_idmap
  fs: port ->set_acl() to pass mnt_idmap
  fs: port ->get_acl() to pass mnt_idmap
  fs: port ->tmpfile() to pass mnt_idmap
  fs: port ->rename() to pass mnt_idmap
  fs: port ->mknod() to pass mnt_idmap
  fs: port ->mkdir() to pass mnt_idmap
  ...
2023-02-20 11:53:11 -08:00
Linus Torvalds
575a7e0f81 File locking changes for v6.3
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmPuC5kTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFdXpEADFN/loTtcANLPQmLmgmJLDuZr2zKrf
 aziMJRjqGMx6BLdwBgX8/XGBNwpG4tkVbI+zdRoVHkpcayMDpLq0dnrvi79a/dGU
 fBrI72ZDMd/S9lzbodObHMziLqvgFthsPm9ldVAZ2Kt400KKNE+ozcveiC3yVGy0
 n1k5BSt/78abzpqut5whVgJBooHtUMCh3XvBJPKwgOneHfAXCm+jqaXlKKpKlpZj
 s2OUyn8BLfNkTgpAZ88L5Rkf0mftjziL6C8KOMy1hvOsyiP0IkwLuQ/kO+2H0Ate
 p3tbOGvUT+n1gYpFYBDLnuWB4G8+CVPxfoO6KGhwT4OlCJpPlNCM8O+w/A/dKn4I
 858spkpYPMy91lEkcrRLRkg/MARWGTgZ3k76fp3OWNnfruWd6ekMlYKx9n6CIy34
 Aoc3Svy9KeA7oRrbRDltmw3UVmz53GcDo337ZL1J6Jph3s86dMG7AwGYvoDfKuKK
 b0oNK5db5v50scBnRHX6UWejE5fSnnHvgC7pU57u08odCVEUALB+r8f04vmkxcVJ
 Qed7lolQdFzn9ddaOXzpg5KeCe/cX3p4IPZSTHad7CPr8gswmC135DfXCr64x2hC
 5jyNzKbe/x+7B2xCweHmEk4ojt8IU3UaYxLJoQkNeVr8rEGC9gqZgkSDe7BnTpOf
 wT0ijzhy2u5RKg==
 =Zhf3
 -----END PGP SIGNATURE-----

Merge tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
 "The main change here is that I've broken out most of the file locking
  definitions into a new header file. I also went ahead and completed
  the removal of locks_inode function"

* tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: remove locks_inode
  filelock: move file locking definitions to separate header file
2023-02-20 11:10:38 -08:00
Suren Baghdasaryan
1c71222e5f mm: replace vma->vm_flags direct modifications with modifier calls
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:39 -08:00
Matthew Wilcox (Oracle)
d585bdbeb7 fs: convert writepage_t callback to pass a folio
Patch series "Convert writepage_t to use a folio".

More folioisation.  I split out the mpage work from everything else
because it completely dominated the patch, but some implementations I just
converted outright.


This patch (of 2):

We always write back an entire folio, but that's currently passed as the
head page.  Convert all filesystems that use write_cache_pages() to expect
a folio instead of a page.

Link: https://lkml.kernel.org/r/20230126201255.1681189-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230126201255.1681189-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:34 -08:00
Alexander Mikhalitsyn
1cc4606d19 fuse: add inode/permission checks to fileattr_get/fileattr_set
It looks like these checks were accidentally lost during the conversion to
fileattr API.

Fixes: 72227eac17 ("fuse: convert to fileattr")
Cc: <stable@vger.kernel.org> # v5.13
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:22:04 +01:00
Randy Dunlap
06bbb761c1 fuse: fix all W=1 kernel-doc warnings
Use correct function name in kernel-doc notation. (1)
Don't use "/**" to begin non-kernel-doc comments. (3)

Fixes these warnings:

fs/fuse/cuse.c:272: warning: expecting prototype for cuse_parse_dev_info(). Prototype was for cuse_parse_devinfo() instead
fs/fuse/dev.c:212: warning: expecting prototype for A new request is available, wake fiq(). Prototype was for fuse_dev_wake_and_unlock() instead
fs/fuse/dir.c:149: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
 * Mark the attributes as stale due to an atime change.  Avoid the invalidate if
fs/fuse/file.c:656: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
 * In case of short read, the caller sets 'pos' to the position of

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
Eric W. Biederman
5a8bee63b1 fuse: in fuse_flush only wait if someone wants the return code
If a fuse filesystem is mounted inside a container, there is a problem
during pid namespace destruction. The scenario is:

1. task (a thread in the fuse server, with a fuse file open) starts
   exiting, does exit_signals(), goes into fuse_flush() -> wait
2. fuse daemon gets killed, tries to wake everyone up
3. task from 1 is stuck because complete_signal() doesn't wake it up, since
   it has PF_EXITING.

The result is that the thread will never be woken up, and pid namespace
destruction will block indefinitely.

To add insult to injury, nobody is waiting for these return codes, since
the pid namespace is being destroyed.

To fix this, let's not block on flush operations when the current task has
PF_EXITING.

This does change the semantics slightly: the wait here is for posix locks
to be unlocked, so the task will exit before things are unlocked. To quote
Miklos:

  "remote" posix locks are almost never used due to problems like this, so
  I think it's safe to do this.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Link: https://lore.kernel.org/all/YrShFXRLtRt6T%2Fj+@risky/
Tested-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:58 +01:00
Miklos Szeredi
8ed7cb3f27 fuse: optional supplementary group in create requests
Permission to create an object (create, mkdir, symlink, mknod) needs to
take supplementary groups into account.

Add a supplementary group request extension.  This can contain an arbitrary
number of group IDs and can be added to any request.  This extension is not
added to any request by default.

Add FUSE_CREATE_SUPP_GROUP init flag to enable supplementary group info in
creation requests.  This adds just a single supplementary group that
matches the parent group in the case described above.  In other cases the
extension is not added.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:38 +01:00
Miklos Szeredi
15d937d7ca fuse: add request extension
Will need to add supplementary groups to create messages, so add the
general concept of a request extension.  A request extension is appended to
the end of the main request.  It has a header indicating the size and type
of the extension.

The create security context (fuse_secctx_*) is similar to the generic
request extension, so include that as well in a backward compatible manner.

Add the total extension length to the request header.  The offset of the
extension block within the request can be calculated by:

  inh->len - inh->total_extlen * 8

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-01-26 17:10:37 +01:00
Christian Brauner
facd61053c
fuse: fixes after adapting to new posix acl api
This cycle we ported all filesystems to the new posix acl api. While
looking at further simplifications in this area to remove the last
remnants of the generic dummy posix acl handlers we realized that we
regressed fuse daemons that don't set FUSE_POSIX_ACL but still make use
of posix acls.

With the change to a dedicated posix acl api interacting with posix acls
doesn't go through the old xattr codepaths anymore and instead only
relies the get acl and set acl inode operations.

Before this change fuse daemons that don't set FUSE_POSIX_ACL were able
to get and set posix acl albeit with two caveats. First, that posix acls
aren't cached. And second, that they aren't used for permission checking
in the vfs.

We regressed that use-case as we currently refuse to retrieve any posix
acls if they aren't enabled via FUSE_POSIX_ACL. So older fuse daemons
would see a change in behavior.

We can restore the old behavior in multiple ways. We could change the
new posix acl api and look for a dedicated xattr handler and if we find
one prefer that over the dedicated posix acl api. That would break the
consistency of the new posix acl api so we would very much prefer not to
do that.

We could introduce a new ACL_*_CACHE sentinel that would instruct the
vfs permission checking codepath to not call into the filesystem and
ignore acls.

But a more straightforward fix for v6.2 is to do the same thing that
Overlayfs does and give fuse a separate get acl method for permission
checking. Overlayfs uses this to express different needs for vfs
permission lookup and acl based retrieval via the regular system call
path as well. Let fuse do the same for now. This way fuse can continue
to refuse to retrieve posix acls for daemons that don't set
FUSE_POSXI_ACL for permission checking while allowing a fuse server to
retrieve it via the usual system calls.

In the future, we could extend the get acl inode operation to not just
pass a simple boolean to indicate rcu lookup but instead make it a flag
argument. Then in addition to passing the information that this is an
rcu lookup to the filesystem we could also introduce a flag that tells
the filesystem that this is a request from the vfs to use these acls for
permission checking. Then fuse could refuse the get acl request for
permission checking when the daemon doesn't have FUSE_POSIX_ACL set in
the same get acl method. This would also help Overlayfs and allow us to
remove the second method for it as well.

But since that change is more invasive as we need to update the get acl
inode operation for multiple filesystems we should not do this as a fix
for v6.2. Instead we will do this for the v6.3 merge window.

Fwiw, since posix acls are now always correctly translated in the new
posix acl api we could also allow them to be used for daemons without
FUSE_POSIX_ACL that are not mounted on the host. But this is behavioral
change and again if dones should be done for v6.3. For now, let's just
restore the original behavior.

A nice side-effect of this change is that for fuse daemons with and
without FUSE_POSIX_ACL the same code is used for posix acls in a
backwards compatible way. This also means we can remove the legacy xattr
handlers completely. We've also added comments to explain the expected
behavior for daemons without FUSE_POSIX_ACL into the code.

Fixes: 318e66856d ("xattr: use posix acl api")
Signed-off-by: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-24 16:33:37 +01:00
Christian Brauner
e67fe63341
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
Convert to struct mnt_idmap.
Remove legacy file_mnt_user_ns() and mnt_user_ns().

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:29 +01:00
Christian Brauner
9452e93e6d
fs: port privilege checking helpers to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:29 +01:00
Christian Brauner
39f60c1cce
fs: port xattr to mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
4609e1f18e
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:28 +01:00
Christian Brauner
8782a9aea3
fs: port ->fileattr_set() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:27 +01:00
Christian Brauner
13e83a4923
fs: port ->set_acl() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:27 +01:00
Christian Brauner
011e2b717b
fs: port ->tmpfile() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:27 +01:00
Christian Brauner
e18275ae55
fs: port ->rename() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner
5ebb29bee8
fs: port ->mknod() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner
c54bd91e9e
fs: port ->mkdir() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:26 +01:00
Christian Brauner
7a77db9551
fs: port ->symlink() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner
6c960e68aa
fs: port ->create() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner
b74d24f7a7
fs: port ->getattr() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:25 +01:00
Christian Brauner
c1632a0f11
fs: port ->setattr() to pass mnt_idmap
Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19 09:24:02 +01:00
Jeff Layton
5970e15dbc filelock: move file locking definitions to separate header file
The file locking definitions have lived in fs.h since the dawn of time,
but they are only used by a small subset of the source files that
include it.

Move the file locking definitions to a new header file, and add the
appropriate #include directives to the source files that need them. By
doing this we trim down fs.h a bit and limit the amount of rebuilding
that has to be done when we make changes to the file locking APIs.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Steve French <stfrench@microsoft.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-01-11 06:52:32 -05:00
Linus Torvalds
e2ca6ba6ba MM patches for 6.2-rc1.
- More userfaultfs work from Peter Xu.
 
 - Several convert-to-folios series from Sidhartha Kumar and Huang Ying.
 
 - Some filemap cleanups from Vishal Moola.
 
 - David Hildenbrand added the ability to selftest anon memory COW handling.
 
 - Some cpuset simplifications from Liu Shixin.
 
 - Addition of vmalloc tracing support by Uladzislau Rezki.
 
 - Some pagecache folioifications and simplifications from Matthew Wilcox.
 
 - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it.
 
 - Miguel Ojeda contributed some cleanups for our use of the
   __no_sanitize_thread__ gcc keyword.  This series shold have been in the
   non-MM tree, my bad.
 
 - Naoya Horiguchi improved the interaction between memory poisoning and
   memory section removal for huge pages.
 
 - DAMON cleanups and tuneups from SeongJae Park
 
 - Tony Luck fixed the handling of COW faults against poisoned pages.
 
 - Peter Xu utilized the PTE marker code for handling swapin errors.
 
 - Hugh Dickins reworked compound page mapcount handling, simplifying it
   and making it more efficient.
 
 - Removal of the autonuma savedwrite infrastructure from Nadav Amit and
   David Hildenbrand.
 
 - zram support for multiple compression streams from Sergey Senozhatsky.
 
 - David Hildenbrand reworked the GUP code's R/O long-term pinning so
   that drivers no longer need to use the FOLL_FORCE workaround which
   didn't work very well anyway.
 
 - Mel Gorman altered the page allocator so that local IRQs can remnain
   enabled during per-cpu page allocations.
 
 - Vishal Moola removed the try_to_release_page() wrapper.
 
 - Stefan Roesch added some per-BDI sysfs tunables which are used to
   prevent network block devices from dirtying excessive amounts of
   pagecache.
 
 - David Hildenbrand did some cleanup and repair work on KSM COW
   breaking.
 
 - Nhat Pham and Johannes Weiner have implemented writeback in zswap's
   zsmalloc backend.
 
 - Brian Foster has fixed a longstanding corner-case oddity in
   file[map]_write_and_wait_range().
 
 - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
   Chen.
 
 - Shiyang Ruan has done some work on fsdax, to make its reflink mode
   work better under xfstests.  Better, but still not perfect.
 
 - Christoph Hellwig has removed the .writepage() method from several
   filesystems.  They only need .writepages().
 
 - Yosry Ahmed wrote a series which fixes the memcg reclaim target
   beancounting.
 
 - David Hildenbrand has fixed some of our MM selftests for 32-bit
   machines.
 
 - Many singleton patches, as usual.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA
 jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml
 CodAgiA51qwzId3GRytIo/tfWZSezgA=
 =d19R
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - More userfaultfs work from Peter Xu

 - Several convert-to-folios series from Sidhartha Kumar and Huang Ying

 - Some filemap cleanups from Vishal Moola

 - David Hildenbrand added the ability to selftest anon memory COW
   handling

 - Some cpuset simplifications from Liu Shixin

 - Addition of vmalloc tracing support by Uladzislau Rezki

 - Some pagecache folioifications and simplifications from Matthew
   Wilcox

 - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use
   it

 - Miguel Ojeda contributed some cleanups for our use of the
   __no_sanitize_thread__ gcc keyword.

   This series should have been in the non-MM tree, my bad

 - Naoya Horiguchi improved the interaction between memory poisoning and
   memory section removal for huge pages

 - DAMON cleanups and tuneups from SeongJae Park

 - Tony Luck fixed the handling of COW faults against poisoned pages

 - Peter Xu utilized the PTE marker code for handling swapin errors

 - Hugh Dickins reworked compound page mapcount handling, simplifying it
   and making it more efficient

 - Removal of the autonuma savedwrite infrastructure from Nadav Amit and
   David Hildenbrand

 - zram support for multiple compression streams from Sergey Senozhatsky

 - David Hildenbrand reworked the GUP code's R/O long-term pinning so
   that drivers no longer need to use the FOLL_FORCE workaround which
   didn't work very well anyway

 - Mel Gorman altered the page allocator so that local IRQs can remnain
   enabled during per-cpu page allocations

 - Vishal Moola removed the try_to_release_page() wrapper

 - Stefan Roesch added some per-BDI sysfs tunables which are used to
   prevent network block devices from dirtying excessive amounts of
   pagecache

 - David Hildenbrand did some cleanup and repair work on KSM COW
   breaking

 - Nhat Pham and Johannes Weiner have implemented writeback in zswap's
   zsmalloc backend

 - Brian Foster has fixed a longstanding corner-case oddity in
   file[map]_write_and_wait_range()

 - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
   Chen

 - Shiyang Ruan has done some work on fsdax, to make its reflink mode
   work better under xfstests. Better, but still not perfect

 - Christoph Hellwig has removed the .writepage() method from several
   filesystems. They only need .writepages()

 - Yosry Ahmed wrote a series which fixes the memcg reclaim target
   beancounting

 - David Hildenbrand has fixed some of our MM selftests for 32-bit
   machines

 - Many singleton patches, as usual

* tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits)
  mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio
  mm: mmu_gather: allow more than one batch of delayed rmaps
  mm: fix typo in struct pglist_data code comment
  kmsan: fix memcpy tests
  mm: add cond_resched() in swapin_walk_pmd_entry()
  mm: do not show fs mm pc for VM_LOCKONFAULT pages
  selftests/vm: ksm_functional_tests: fixes for 32bit
  selftests/vm: cow: fix compile warning on 32bit
  selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions
  mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem
  mm,thp,rmap: fix races between updates of subpages_mapcount
  mm: memcg: fix swapcached stat accounting
  mm: add nodes= arg to memory.reclaim
  mm: disable top-tier fallback to reclaim on proactive reclaim
  selftests: cgroup: make sure reclaim target memcg is unprotected
  selftests: cgroup: refactor proactive reclaim code to reclaim_until()
  mm: memcg: fix stale protection of reclaim target memcg
  mm/mmap: properly unaccount memory on mas_preallocate() failure
  omfs: remove ->writepage
  jfs: remove ->writepage
  ...
2022-12-13 19:29:45 -08:00
Linus Torvalds
043930b1c8 fuse update for 6.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCY5cH0wAKCRDh3BK/laaZ
 PAIJAQC658ZaXRgWBZ/XmCqbb+c8g/InrccE+PXhtVGYTiWTiwD/ZEy7r/X/uJaO
 gb5anxJT5jVJ9Qfk1VbyZqdwqUqydAo=
 =If6t
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse update from Miklos Szeredi:

 - Allow some write requests to proceed in parallel

 - Fix a performance problem with allow_sys_admin_access

 - Add a special kind of invalidation that doesn't immediately purge
   submounts

 - On revalidation treat the target of rename(RENAME_NOREPLACE) the same
   as open(O_EXCL)

 - Use type safe helpers for some mnt_userns transformations

* tag 'fuse-update-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: Rearrange fuse_allow_current_process checks
  fuse: allow non-extending parallel direct writes on the same file
  fuse: remove the unneeded result variable
  fuse: port to vfs{g,u}id_t and associated helpers
  fuse: Remove user_ns check for FUSE_DEV_IOC_CLONE
  fuse: always revalidate rename target dentry
  fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY
  fs/fuse: Replace kmap() with kmap_local_page()
2022-12-12 20:22:09 -08:00
Linus Torvalds
e1212e9b6f fs.vfsuid.conversion.v6.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bspgAKCRCRxhvAZXjc
 opEWAQDpF5rnZn1vv4/uOTij9ztcA4yLxu/Q19CdqBaoHlWZ9AD/d3eecee3bh5h
 iPHtlUK5/VspfD9LPpdc5ZbPCdZ2pA4=
 =t6NN
 -----END PGP SIGNATURE-----

Merge tag 'fs.vfsuid.conversion.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull vfsuid updates from Christian Brauner:
 "Last cycle we introduced the vfs{g,u}id_t types and associated helpers
  to gain type safety when dealing with idmapped mounts. That initial
  work already converted a lot of places over but there were still some
  left,

  This converts all remaining places that still make use of non-type
  safe idmapping helpers to rely on the new type safe vfs{g,u}id based
  helpers.

  Afterwards it removes all the old non-type safe helpers"

* tag 'fs.vfsuid.conversion.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  fs: remove unused idmapping helpers
  ovl: port to vfs{g,u}id_t and associated helpers
  fuse: port to vfs{g,u}id_t and associated helpers
  ima: use type safe idmapping helpers
  apparmor: use type safe idmapping helpers
  caps: use type safe idmapping helpers
  fs: use type safe idmapping helpers
  mnt_idmapping: add missing helpers
2022-12-12 19:20:05 -08:00
Linus Torvalds
cf619f8919 fs.ovl.setgid.v6.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bt7AAKCRCRxhvAZXjc
 ovAOAP9qcrUqs2MoyBDe6qUXThYY9w2rgX/ZI4ZZmbtsXEDGtQEA/LPddq8lD8o9
 m17zpvMGbXXRwz4/zVGuyWsHgg0HsQ0=
 =ioRq
 -----END PGP SIGNATURE-----

Merge tag 'fs.ovl.setgid.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull setgid inheritance updates from Christian Brauner:
 "This contains the work to make setgid inheritance consistent between
  modifying a file and when changing ownership or mode as this has been
  a repeated source of very subtle bugs. The gist is that we perform the
  same permission checks in the write path as we do in the ownership and
  mode changing paths after this series where we're currently doing
  different things.

  We've already made setgid inheritance a lot more consistent and
  reliable in the last releases by moving setgid stripping from the
  individual filesystems up into the vfs. This aims to make the logic
  even more consistent and easier to understand and also to fix
  long-standing overlayfs setgid inheritance bugs. Miklos was nice
  enough to just let me carry the trivial overlayfs patches from Amir
  too.

  Below is a more detailed explanation how the current difference in
  setgid handling lead to very subtle bugs exemplified via overlayfs
  which is a victim of the current rules. I hope this explains why I
  think taking the regression risk here is worth it.

  A long while ago I found a few setgid inheritance bugs in overlayfs in
  the write path in certain conditions. Amir recently picked this back
  up in [1] and I jumped on board to fix this more generally.

  On the surface all that overlayfs would need to fix setgid inheritance
  would be to call file_remove_privs() or file_modified() but actually
  that isn't enough because the setgid inheritance api is wildly
  inconsistent in that area.

  Before this pr setgid stripping in file_remove_privs()'s old
  should_remove_suid() helper was inconsistent with other parts of the
  vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is
  S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups
  and the caller isn't privileged over the inode although we require
  this already in setattr_prepare() and setattr_copy() and so all
  filesystem implement this requirement implicitly because they have to
  use setattr_{prepare,copy}() anyway.

  But the inconsistency shows up in setgid stripping bugs for overlayfs
  in xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
  generic/687). For example, we test whether suid and setgid stripping
  works correctly when performing various write-like operations as an
  unprivileged user (fallocate, reflink, write, etc.):

      echo "Test 1 - qa_user, non-exec file $verb"
      setup_testfile
      chmod a+rws $junk_file
      commit_and_check "$qa_user" "$verb" 64k 64k

  The test basically creates a file with 6666 permissions. While the
  file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP
  set.

  On a regular filesystem like xfs what will happen is:

      sys_fallocate()
      -> vfs_fallocate()
         -> xfs_file_fallocate()
            -> file_modified()
               -> __file_remove_privs()
                  -> dentry_needs_remove_privs()
                     -> should_remove_suid()
                  -> __remove_privs()
                     newattrs.ia_valid = ATTR_FORCE | kill;
                     -> notify_change()
                        -> setattr_copy()

  In should_remove_suid() we can see that ATTR_KILL_SUID is raised
  unconditionally because the file in the test has S_ISUID set.

  But we also see that ATTR_KILL_SGID won't be set because while the
  file is S_ISGID it is not S_IXGRP (see above) which is a condition for
  ATTR_KILL_SGID being raised.

  So by the time we call notify_change() we have attr->ia_valid set to
  ATTR_KILL_SUID | ATTR_FORCE.

  Now notify_change() sees that ATTR_KILL_SUID is set and does:

      ia_valid      = attr->ia_valid |= ATTR_MODE
      attr->ia_mode = (inode->i_mode & ~S_ISUID);

  which means that when we call setattr_copy() later we will definitely
  update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.

  Now we call into the filesystem's ->setattr() inode operation which
  will end up calling setattr_copy(). Since ATTR_MODE is set we will
  hit:

      if (ia_valid & ATTR_MODE) {
              umode_t mode = attr->ia_mode;
              vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
              if (!vfsgid_in_group_p(vfsgid) &&
                  !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                      mode &= ~S_ISGID;
              inode->i_mode = mode;
      }

  and since the caller in the test is neither capable nor in the group
  of the inode the S_ISGID bit is stripped.

  But assume the file isn't suid then ATTR_KILL_SUID won't be raised
  which has the consequence that neither the setgid nor the suid bits
  are stripped even though it should be stripped because the inode isn't
  in the caller's groups and the caller isn't privileged over the inode.

  If overlayfs is in the mix things become a bit more complicated and
  the bug shows up more clearly.

  When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to
  file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be
  raised but because the check in notify_change() is questioning the
  ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped
  the S_ISGID bit isn't removed even though it should be stripped:

      sys_fallocate()
      -> vfs_fallocate()
         -> ovl_fallocate()
            -> file_remove_privs()
               -> dentry_needs_remove_privs()
                  -> should_remove_suid()
               -> __remove_privs()
                  newattrs.ia_valid = ATTR_FORCE | kill;
                  -> notify_change()
                     -> ovl_setattr()
                        /* TAKE ON MOUNTER'S CREDS */
                        -> ovl_do_notify_change()
                           -> notify_change()
                        /* GIVE UP MOUNTER'S CREDS */
           /* TAKE ON MOUNTER'S CREDS */
           -> vfs_fallocate()
              -> xfs_file_fallocate()
                 -> file_modified()
                    -> __file_remove_privs()
                       -> dentry_needs_remove_privs()
                          -> should_remove_suid()
                       -> __remove_privs()
                          newattrs.ia_valid = attr_force | kill;
                          -> notify_change()

  The fix for all of this is to make file_remove_privs()'s
  should_remove_suid() helper perform the same checks as we already
  require in setattr_prepare() and setattr_copy() and have
  notify_change() not pointlessly requiring S_IXGRP again. It doesn't
  make any sense in the first place because the caller must calculate
  the flags via should_remove_suid() anyway which would raise
  ATTR_KILL_SGID

  Note that some xfstests will now fail as these patches will cause the
  setgid bit to be lost in certain conditions for unprivileged users
  modifying a setgid file when they would've been kept otherwise. I
  think this risk is worth taking and I explained and mentioned this
  multiple times on the list [2].

  Enforcing the rules consistently across write operations and
  chmod/chown will lead to losing the setgid bit in cases were it
  might've been retained before.

  While I've mentioned this a few times but it's worth repeating just to
  make sure that this is understood. For the sake of maintainability,
  consistency, and security this is a risk worth taking.

  If we really see regressions for workloads the fix is to have special
  setgid handling in the write path again with different semantics from
  chmod/chown and possibly additional duct tape for overlayfs. I'll
  update the relevant xfstests with if you should decide to merge this
  second setgid cleanup.

  Before that people should be aware that there might be failures for
  fstests where unprivileged users modify a setgid file"

Link: https://lore.kernel.org/linux-fsdevel/20221003123040.900827-1-amir73il@gmail.com [1]
Link: https://lore.kernel.org/linux-fsdevel/20221122142010.zchf2jz2oymx55qi@wittgenstein [2]

* tag 'fs.ovl.setgid.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  fs: use consistent setgid checks in is_sxid()
  ovl: remove privs in ovl_fallocate()
  ovl: remove privs in ovl_copyfile()
  attr: use consistent sgid stripping checks
  attr: add setattr_should_drop_sgid()
  fs: move should_remove_suid()
  attr: add in_group_or_capable()
2022-12-12 19:03:10 -08:00
Linus Torvalds
6a518afcc2 fs.acl.rework.v6.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bwTgAKCRCRxhvAZXjc
 ovd2AQCK00NAtGjQCjQPQGyTa4GAPqvWgq1ef0lnhv+TL5US5gD9FncQ8UofeMXt
 pBfjtAD6ettTPCTxUQfnTwWEU4rc7Qg=
 =27Wm
 -----END PGP SIGNATURE-----

Merge tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping

Pull VFS acl updates from Christian Brauner:
 "This contains the work that builds a dedicated vfs posix acl api.

  The origins of this work trace back to v5.19 but it took quite a while
  to understand the various filesystem specific implementations in
  sufficient detail and also come up with an acceptable solution.

  As we discussed and seen multiple times the current state of how posix
  acls are handled isn't nice and comes with a lot of problems: The
  current way of handling posix acls via the generic xattr api is error
  prone, hard to maintain, and type unsafe for the vfs until we call
  into the filesystem's dedicated get and set inode operations.

  It is already the case that posix acls are special-cased to death all
  the way through the vfs. There are an uncounted number of hacks that
  operate on the uapi posix acl struct instead of the dedicated vfs
  struct posix_acl. And the vfs must be involved in order to interpret
  and fixup posix acls before storing them to the backing store, caching
  them, reporting them to userspace, or for permission checking.

  Currently a range of hacks and duct tape exist to make this work. As
  with most things this is really no ones fault it's just something that
  happened over time. But the code is hard to understand and difficult
  to maintain and one is constantly at risk of introducing bugs and
  regressions when having to touch it.

  Instead of continuing to hack posix acls through the xattr handlers
  this series builds a dedicated posix acl api solely around the get and
  set inode operations.

  Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl()
  helpers must be used in order to interact with posix acls. They
  operate directly on the vfs internal struct posix_acl instead of
  abusing the uapi posix acl struct as we currently do. In the end this
  removes all of the hackiness, makes the codepaths easier to maintain,
  and gets us type safety.

  This series passes the LTP and xfstests suites without any
  regressions. For xfstests the following combinations were tested:
   - xfs
   - ext4
   - btrfs
   - overlayfs
   - overlayfs on top of idmapped mounts
   - orangefs
   - (limited) cifs

  There's more simplifications for posix acls that we can make in the
  future if the basic api has made it.

  A few implementation details:

   - The series makes sure to retain exactly the same security and
     integrity module permission checks. Especially for the integrity
     modules this api is a win because right now they convert the uapi
     posix acl struct passed to them via a void pointer into the vfs
     struct posix_acl format to perform permission checking on the mode.

     There's a new dedicated security hook for setting posix acls which
     passes the vfs struct posix_acl not a void pointer. Basing checking
     on the posix acl stored in the uapi format is really unreliable.
     The vfs currently hacks around directly in the uapi struct storing
     values that frankly the security and integrity modules can't
     correctly interpret as evidenced by bugs we reported and fixed in
     this area. It's not necessarily even their fault it's just that the
     format we provide to them is sub optimal.

   - Some filesystems like 9p and cifs need access to the dentry in
     order to get and set posix acls which is why they either only
     partially or not even at all implement get and set inode
     operations. For example, cifs allows setxattr() and getxattr()
     operations but doesn't allow permission checking based on posix
     acls because it can't implement a get acl inode operation.

     Thus, this patch series updates the set acl inode operation to take
     a dentry instead of an inode argument. However, for the get acl
     inode operation we can't do this as the old get acl method is
     called in e.g., generic_permission() and inode_permission(). These
     helpers in turn are called in various filesystem's permission inode
     operation. So passing a dentry argument to the old get acl inode
     operation would amount to passing a dentry to the permission inode
     operation which we shouldn't and probably can't do.

     So instead of extending the existing inode operation Christoph
     suggested to add a new one. He also requested to ensure that the
     get and set acl inode operation taking a dentry are consistently
     named. So for this version the old get acl operation is renamed to
     ->get_inode_acl() and a new ->get_acl() inode operation taking a
     dentry is added. With this we can give both 9p and cifs get and set
     acl inode operations and in turn remove their complex custom posix
     xattr handlers.

     In the future I hope to get rid of the inode method duplication but
     it isn't like we have never had this situation. Readdir is just one
     example. And frankly, the overall gain in type safety and the more
     pleasant api wise are simply too big of a benefit to not accept
     this duplication for a while.

   - We've done a full audit of every codepaths using variant of the
     current generic xattr api to get and set posix acls and
     surprisingly it isn't that many places. There's of course always a
     chance that we might have missed some and if so I'm sure we'll find
     them soon enough.

     The crucial codepaths to be converted are obviously stacking
     filesystems such as ecryptfs and overlayfs.

     For a list of all callers currently using generic xattr api helpers
     see [2] including comments whether they support posix acls or not.

   - The old vfs generic posix acl infrastructure doesn't obey the
     create and replace semantics promised on the setxattr(2) manpage.
     This patch series doesn't address this. It really is something we
     should revisit later though.

  The patches are roughly organized as follows:

   (1) Change existing set acl inode operation to take a dentry
       argument (Intended to be a non-functional change)

   (2) Rename existing get acl method (Intended to be a non-functional
       change)

   (3) Implement get and set acl inode operations for filesystems that
       couldn't implement one before because of the missing dentry.
       That's mostly 9p and cifs (Intended to be a non-functional
       change)

   (4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(),
       and vfs_set_acl() including security and integrity hooks
       (Intended to be a non-functional change)

   (5) Implement get and set acl inode operations for stacking
       filesystems (Intended to be a non-functional change)

   (6) Switch posix acl handling in stacking filesystems to new posix
       acl api now that all filesystems it can stack upon support it.

   (7) Switch vfs to new posix acl api (semantical change)

   (8) Remove all now unused helpers

   (9) Additional regression fixes reported after we merged this into
       linux-next

  Thanks to Seth for a lot of good discussion around this and
  encouragement and input from Christoph"

* tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits)
  posix_acl: Fix the type of sentinel in get_acl
  orangefs: fix mode handling
  ovl: call posix_acl_release() after error checking
  evm: remove dead code in evm_inode_set_acl()
  cifs: check whether acl is valid early
  acl: make vfs_posix_acl_to_xattr() static
  acl: remove a slew of now unused helpers
  9p: use stub posix acl handlers
  cifs: use stub posix acl handlers
  ovl: use stub posix acl handlers
  ecryptfs: use stub posix acl handlers
  evm: remove evm_xattr_acl_change()
  xattr: use posix acl api
  ovl: use posix acl api
  ovl: implement set acl method
  ovl: implement get acl method
  ecryptfs: implement set acl method
  ecryptfs: implement get acl method
  ksmbd: use vfs_remove_acl()
  acl: add vfs_remove_acl()
  ...
2022-12-12 18:46:39 -08:00
Linus Torvalds
75f4d9af8b iov_iter work; most of that is about getting rid of
direction misannotations and (hopefully) preventing
 more of the same for the future.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
 65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
 MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
 =bcvF
 -----END PGP SIGNATURE-----

Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull iov_iter updates from Al Viro:
 "iov_iter work; most of that is about getting rid of direction
  misannotations and (hopefully) preventing more of the same for the
  future"

* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  use less confusing names for iov_iter direction initializers
  iov_iter: saner checks for attempt to copy to/from iterator
  [xen] fix "direction" argument of iov_iter_kvec()
  [vhost] fix 'direction' argument of iov_iter_{init,bvec}()
  [target] fix iov_iter_bvec() "direction" argument
  [s390] memcpy_real(): WRITE is "data source", not destination...
  [s390] zcore: WRITE is "data source", not destination...
  [infiniband] READ is "data destination", not source...
  [fsi] WRITE is "data source", not destination...
  [s390] copy_oldmem_kernel() - WRITE is "data source", not destination
  csum_and_copy_to_iter(): handle ITER_DISCARD
  get rid of unlikely() on page_copy_sane() calls
2022-12-12 18:29:54 -08:00
Vishal Moola (Oracle)
063aaad792 fuse: convert fuse_try_move_page() to use folios
Converts the function to try to move folios instead of pages. Also
converts fuse_check_page() to fuse_get_folio() since this is its only
caller. This change removes 15 calls to compound_head().

Link: https://lkml.kernel.org/r/20221101175326.13265-3-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11 18:12:12 -08:00
Vishal Moola (Oracle)
3720dd6dca filemap: convert replace_page_cache_page() to replace_page_cache_folio()
Patch series "Removing the lru_cache_add() wrapper".

This patchset replaces all calls of lru_cache_add() with the folio
equivalent: folio_add_lru().  This is allows us to get rid of the wrapper
The series passes xfstests and the userfaultfd selftests.


This patch (of 5):

Eliminates 7 calls to compound_head().

Link: https://lkml.kernel.org/r/20221101175326.13265-1-vishal.moola@gmail.com
Link: https://lkml.kernel.org/r/20221101175326.13265-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11 18:12:12 -08:00
Al Viro
de4eda9de2 use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.

Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-25 13:01:55 -05:00
Dave Marchevsky
b138777786 fuse: Rearrange fuse_allow_current_process checks
This is a followup to a previous commit of mine [0], which added the
allow_sys_admin_access && capable(CAP_SYS_ADMIN) check.  This patch
rearranges the order of checks in fuse_allow_current_process without
changing functionality.

Commit 9ccf47b26b ("fuse: Add module param for CAP_SYS_ADMIN access
bypassing allow_other") added allow_sys_admin_access &&
capable(CAP_SYS_ADMIN) check to the beginning of the function, with the
reasoning that allow_sys_admin_access should be an 'escape hatch' for users
with CAP_SYS_ADMIN, allowing them to skip any subsequent checks.

However, placing this new check first results in many capable() calls when
allow_sys_admin_access is set, where another check would've also returned
1.  This can be problematic when a BPF program is tracing capable() calls.

At Meta we ran into such a scenario recently.  On a host where
allow_sys_admin_access is set but most of the FUSE access is from processes
which would pass other checks - i.e.  they don't need CAP_SYS_ADMIN 'escape
hatch' - this results in an unnecessary capable() call for each fs op.  We
also have a daemon tracing capable() with BPF and doing some data
collection, so tracing these extraneous capable() calls has the potential
to regress performance for an application doing many FUSE ops.

So rearrange the order of these checks such that CAP_SYS_ADMIN 'escape
hatch' is checked last.  Add a small helper, fuse_permissible_uidgid, to
make the logic easier to understand.  Previously, if allow_other is set on
the fuse_conn, uid/git checking doesn't happen as current_in_userns result
is returned.  These semantics are maintained here: fuse_permissible_uidgid
check only happens if allow_other is not set.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23 09:10:50 +01:00
Dharmendra Singh
153524053b fuse: allow non-extending parallel direct writes on the same file
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration of
the write request.  I could not find in fuse code and git history a comment
which clearly explains why this exclusive lock is taken for direct writes.
Following might be the reasons for acquiring an exclusive lock but not be
limited to

 1) Our guess is some USER space fuse implementations might be relying on
    this lock for serialization.

 2) The lock protects against file read/write size races.

 3) Ruling out any issues arising from partial write failures.

This patch relaxes the exclusive lock for direct non-extending writes only.
File size extending writes might not need the lock either, but we are not
entirely sure if there is a risk to introduce any kind of regression.
Furthermore, benchmarking with fio does not show a difference between patch
versions that take on file size extension a) an exclusive lock and b) a
shared lock.

A possible example of an issue with i_size extending writes are write error
cases.  Some writes might succeed and others might fail for file system
internal reasons - for example ENOSPACE.  With parallel file size extending
writes it _might_ be difficult to revert the action of the failing write,
especially to restore the right i_size.

With these changes, we allow non-extending parallel direct writes on the
same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES.  If
this flag is set on the file (flag is passed from libfuse to fuse kernel as
part of file open/create), we do not take exclusive lock anymore, but
instead use a shared lock that allows non-extending writes to run in
parallel.  FUSE implementations which rely on this inode lock for
serialization can continue to do so and serialized direct writes are still
the default.  Implementations that do not do write serialization need to be
updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
open/create reply.

On patch review there were concerns that network file systems (or vfs
multiple mounts of the same file system) might have issues with parallel
writes.  We believe this is not the case, as this is just a local lock,
which network file systems could not rely on anyway.  I.e. this lock is
just for local consistency.

Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23 09:10:50 +01:00
ye xingchen
e2283a7366 fuse: remove the unneeded result variable
Return the value fuse_dev_release() directly instead of storing it in
another redundant variable.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23 09:10:49 +01:00
Christian Brauner
00d369bc2d fuse: port to vfs{g,u}id_t and associated helpers
A while ago we introduced a dedicated vfs{g,u}id_t type in commit
1e5267cd08 ("mnt_idmapping: add vfs{g,u}id_t").  We already switched over
a good part of the VFS.  Ultimately we will remove all legacy idmapped
mount helpers that operate only on k{g,u}id_t in favor of the new type safe
helpers that operate on vfs{g,u}id_t.

Cc: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23 09:10:49 +01:00
Jann Horn
0618021e34 fuse: Remove user_ns check for FUSE_DEV_IOC_CLONE
Commit 8ed1f0e22f ("fs/fuse: fix ioctl type confusion") fixed a type
confusion bug by adding an ->f_op comparison.

Based on some off-list discussion back then, another check was added to
compare the f_cred->user_ns.  This is not for security reasons, but was
based on the idea that a FUSE device FD should be using the UID/GID
mappings of its f_cred->user_ns, and those translations are done using
fc->user_ns, which matches the f_cred->user_ns of the initial FUSE device
FD thanks to the check in fuse_fill_super().  See also commit 8cb08329b0
("fuse: Support fuse filesystems outside of init_user_ns").

But FUSE_DEV_IOC_CLONE is, at a higher level, a *cloning* operation that
copies an existing context (with a weird API that involves first opening
/dev/fuse, then tying the resulting new FUSE device FD to an existing FUSE
instance).  So if an application is already passing FUSE FDs across userns
boundaries and dealing with the resulting ID mapping complications somehow,
it doesn't make much sense to block this cloning operation.

I've heard that this check is an obstacle for some folks, and I don't see a
good reason to keep it, so remove it.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23 09:10:49 +01:00
Jiachen Zhang
ccc031e26a fuse: always revalidate rename target dentry
The previous commit df8629af29 ("fuse: always revalidate if exclusive
create") ensures that the dentries are revalidated on O_EXCL creates.  This
commit complements it by also performing revalidation for rename target
dentries.  Otherwise, a rename target file that only exists in kernel
dentry cache but not in the filesystem will result in EEXIST if
RENAME_NOREPLACE flag is used.

Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23 09:10:49 +01:00
Miklos Szeredi
4f8d37020e fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY
Add a flag to entry expiration that lets the filesystem expire a dentry
without kicking it out from the cache immediately.

This makes a difference for overmounted dentries, where plain invalidation
would detach all submounts before dropping the dentry from the cache.  If
only expiry is set on the dentry, then any overmounts are left alone and
until ->d_revalidate() is called.

Note: ->d_revalidate() is not called for the case of following a submount,
so invalidation will only be triggered for the non-overmounted case.  The
dentry could also be mounted in a different mount instance, in which case
any submounts will still be detached.

Suggested-by: Jakob Blomer <jblomer@cern.ch>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23 09:10:49 +01:00
Fabio M. De Francesco
a1db2f7ede fs/fuse: Replace kmap() with kmap_local_page()
The use of kmap() is being deprecated in favor of kmap_local_page().

There are two main problems with kmap(): (1) It comes with an overhead as
the mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.

With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and still valid.

Therefore, replace kmap() with kmap_local_page() in fuse_readdir_cached(), 
it being the only call site of kmap() currently left in fs/fuse.

Cc: "Venkataramanan, Anirudh" <anirudh.venkataramanan@intel.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23 09:10:49 +01:00
Miklos Szeredi
44361e8cf9 fuse: lock inode unconditionally in fuse_fallocate()
file_modified() must be called with inode lock held.  fuse_fallocate()
didn't lock the inode in case of just FALLOC_KEEP_SIZE flags value, which
resulted in a kernel Warning in notify_change().

Lock the inode unconditionally, like all other fallocate implementations
do.

Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Reported-and-tested-by: syzbot+462da39f0667b357c4b6@syzkaller.appspotmail.com
Fixes: 4a6f278d48 ("fuse: add file_modified() to fallocate")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-11-23 09:10:42 +01:00
Miklos Szeredi
4a6f278d48 fuse: add file_modified() to fallocate
Add missing file_modified() call to fuse_file_fallocate().  Without this
fallocate on fuse failed to clear privileges.

Fixes: 05ba1f0823 ("fuse: add FALLOCATE operation")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-10-28 14:25:20 +02:00
Christian Brauner
a03a972b26
fuse: port to vfs{g,u}id_t and associated helpers
A while ago we introduced a dedicated vfs{g,u}id_t type in commit
1e5267cd08 ("mnt_idmapping: add vfs{g,u}id_t").  We already switched over
a good part of the VFS.  Ultimately we will remove all legacy idmapped
mount helpers that operate only on k{g,u}id_t in favor of the new type safe
helpers that operate on vfs{g,u}id_t.

Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-26 10:03:34 +02:00
Miklos Szeredi
9fa248c65b fuse: fix readdir cache race
There's a race in fuse's readdir cache that can result in an uninitilized
page being read.  The page lock is supposed to prevent this from happening
but in the following case it doesn't:

Two fuse_add_dirent_to_cache() start out and get the same parameters
(size=0,offset=0).  One of them wins the race to create and lock the page,
after which it fills in data, sets rdc.size and unlocks the page.

In the meantime the page gets evicted from the cache before the other
instance gets to run.  That one also creates the page, but finds the
size to be mismatched, bails out and leaves the uninitialized page in the
cache.

Fix by marking a filled page uptodate and ignoring non-uptodate pages.

Reported-by: Frank Sorenson <fsorenso@redhat.com>
Fixes: 5d7bc7e868 ("fuse: allow using readdir cache")
Cc: <stable@vger.kernel.org> # v4.20
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-10-20 17:18:58 +02:00
Christian Brauner
cac2f8b8d8
fs: rename current get acl method
The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

The current inode operation for getting posix acls takes an inode
argument but various filesystems (e.g., 9p, cifs, overlayfs) need access
to the dentry. In contrast to the ->set_acl() inode operation we cannot
simply extend ->get_acl() to take a dentry argument. The ->get_acl()
inode operation is called from:

acl_permission_check()
-> check_acl()
   -> get_acl()

which is part of generic_permission() which in turn is part of
inode_permission(). Both generic_permission() and inode_permission() are
called in the ->permission() handler of various filesystems (e.g.,
overlayfs). So simply passing a dentry argument to ->get_acl() would
amount to also having to pass a dentry argument to ->permission(). We
should avoid this unnecessary change.

So instead of extending the existing inode operation rename it from
->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that
passes a dentry argument and which filesystems that need access to the
dentry can implement instead of ->get_inode_acl(). Filesystems like cifs
which allow setting and getting posix acls but not using them for
permission checking during lookup can simply not implement
->get_inode_acl().

This is intended to be a non-functional change.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-20 10:13:27 +02:00
Christian Brauner
138060ba92
fs: pass dentry to set acl method
The current way of setting and getting posix acls through the generic
xattr interface is error prone and type unsafe. The vfs needs to
interpret and fixup posix acls before storing or reporting it to
userspace. Various hacks exist to make this work. The code is hard to
understand and difficult to maintain in it's current form. Instead of
making this work by hacking posix acls through xattr handlers we are
building a dedicated posix acl api around the get and set inode
operations. This removes a lot of hackiness and makes the codepaths
easier to maintain. A lot of background can be found in [1].

Since some filesystem rely on the dentry being available to them when
setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode
operation. But since ->set_acl() is required in order to use the generic
posix acl xattr handlers filesystems that do not implement this inode
operation cannot use the handler and need to implement their own
dedicated posix acl handlers.

Update the ->set_acl() inode method to take a dentry argument. This
allows all filesystems to rely on ->set_acl().

As far as I can tell all codepaths can be switched to rely on the dentry
instead of just the inode. Note that the original motivation for passing
the dentry separate from the inode instead of just the dentry in the
xattr handlers was because of security modules that call
security_d_instantiate(). This hook is called during
d_instantiate_new(), d_add(), __d_instantiate_anon(), and
d_splice_alias() to initialize the inode's security context and possibly
to set security.* xattrs. Since this only affects security.* xattrs this
is completely irrelevant for posix acls.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-19 12:55:42 +02:00
Christian Brauner
ed5a7047d2
attr: use consistent sgid stripping checks
Currently setgid stripping in file_remove_privs()'s should_remove_suid()
helper is inconsistent with other parts of the vfs. Specifically, it only
raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
inode isn't in the caller's groups and the caller isn't privileged over the
inode although we require this already in setattr_prepare() and
setattr_copy() and so all filesystem implement this requirement implicitly
because they have to use setattr_{prepare,copy}() anyway.

But the inconsistency shows up in setgid stripping bugs for overlayfs in
xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
generic/687). For example, we test whether suid and setgid stripping works
correctly when performing various write-like operations as an unprivileged
user (fallocate, reflink, write, etc.):

echo "Test 1 - qa_user, non-exec file $verb"
setup_testfile
chmod a+rws $junk_file
commit_and_check "$qa_user" "$verb" 64k 64k

The test basically creates a file with 6666 permissions. While the file has
the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
regular filesystem like xfs what will happen is:

sys_fallocate()
-> vfs_fallocate()
   -> xfs_file_fallocate()
      -> file_modified()
         -> __file_remove_privs()
            -> dentry_needs_remove_privs()
               -> should_remove_suid()
            -> __remove_privs()
               newattrs.ia_valid = ATTR_FORCE | kill;
               -> notify_change()
                  -> setattr_copy()

In should_remove_suid() we can see that ATTR_KILL_SUID is raised
unconditionally because the file in the test has S_ISUID set.

But we also see that ATTR_KILL_SGID won't be set because while the file
is S_ISGID it is not S_IXGRP (see above) which is a condition for
ATTR_KILL_SGID being raised.

So by the time we call notify_change() we have attr->ia_valid set to
ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
ATTR_KILL_SUID is set and does:

ia_valid = attr->ia_valid |= ATTR_MODE
attr->ia_mode = (inode->i_mode & ~S_ISUID);

which means that when we call setattr_copy() later we will definitely
update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.

Now we call into the filesystem's ->setattr() inode operation which will
end up calling setattr_copy(). Since ATTR_MODE is set we will hit:

if (ia_valid & ATTR_MODE) {
        umode_t mode = attr->ia_mode;
        vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
        if (!vfsgid_in_group_p(vfsgid) &&
            !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                mode &= ~S_ISGID;
        inode->i_mode = mode;
}

and since the caller in the test is neither capable nor in the group of the
inode the S_ISGID bit is stripped.

But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
has the consequence that neither the setgid nor the suid bits are stripped
even though it should be stripped because the inode isn't in the caller's
groups and the caller isn't privileged over the inode.

If overlayfs is in the mix things become a bit more complicated and the bug
shows up more clearly. When e.g., ovl_setattr() is hit from
ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
ATTR_KILL_SGID might be raised but because the check in notify_change() is
questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
stripped the S_ISGID bit isn't removed even though it should be stripped:

sys_fallocate()
-> vfs_fallocate()
   -> ovl_fallocate()
      -> file_remove_privs()
         -> dentry_needs_remove_privs()
            -> should_remove_suid()
         -> __remove_privs()
            newattrs.ia_valid = ATTR_FORCE | kill;
            -> notify_change()
               -> ovl_setattr()
                  // TAKE ON MOUNTER'S CREDS
                  -> ovl_do_notify_change()
                     -> notify_change()
                  // GIVE UP MOUNTER'S CREDS
     // TAKE ON MOUNTER'S CREDS
     -> vfs_fallocate()
        -> xfs_file_fallocate()
           -> file_modified()
              -> __file_remove_privs()
                 -> dentry_needs_remove_privs()
                    -> should_remove_suid()
                 -> __remove_privs()
                    newattrs.ia_valid = attr_force | kill;
                    -> notify_change()

The fix for all of this is to make file_remove_privs()'s
should_remove_suid() helper to perform the same checks as we already
require in setattr_prepare() and setattr_copy() and have notify_change()
not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
first place because the caller must calculate the flags via
should_remove_suid() anyway which would raise ATTR_KILL_SGID.

While we're at it we move should_remove_suid() from inode.c to attr.c
where it belongs with the rest of the iattr helpers. Especially since it
returns ATTR_KILL_S{G,U}ID flags. We also rename it to
setattr_should_drop_suidgid() to better reflect that it indicates both
setuid and setgid bit removal and also that it returns attr flags.

Running xfstests with this doesn't report any regressions. We should really
try and use consistent checks.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-18 10:09:47 +02:00
Linus Torvalds
f721d24e5d tmpfile API change
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY0DP2AAKCRBZ7Krx/gZQ
 6/+qAQCEGQWpcC5MB17zylaX7gqzhgAsDrwtpevlno3aIv/1pQD/YWr/E8tf7WTW
 ERXRXMRx1cAzBJhUhVgIY+3ANfU2Rg4=
 =cko4
 -----END PGP SIGNATURE-----

Merge tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs tmpfile updates from Al Viro:
 "Miklos' ->tmpfile() signature change; pass an unopened struct file to
  it, let it open the damn thing. Allows to add tmpfile support to FUSE"

* tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fuse: implement ->tmpfile()
  vfs: open inside ->tmpfile()
  vfs: move open right after ->tmpfile()
  vfs: make vfs_tmpfile() static
  ovl: use vfs_tmpfile_open() helper
  cachefiles: use vfs_tmpfile_open() helper
  cachefiles: only pass inode to *mark_inode_inuse() helpers
  cachefiles: tmpfile error handling cleanup
  hugetlbfs: cleanup mknod and tmpfile
  vfs: add vfs_tmpfile_open() helper
2022-10-10 19:45:17 -07:00
Yu Zhao
ec1c86b25f mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations.  They form a closed-loop system, i.e., "the page reclaim". 
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim.  These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking rendering
   threads.

There are also two access patterns: one with temporal locality and the
other without.  For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults".  Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting.  The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction.  The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then.  This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:09 -07:00
Miklos Szeredi
7d37539037 fuse: implement ->tmpfile()
This is basically equivalent to the FUSE_CREATE operation which creates and
opens a regular file.

Add a new FUSE_TMPFILE operation, otherwise just reuse the protocol and the
code for FUSE_CREATE.

Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-09-24 07:00:00 +02:00
Linus Torvalds
f30adc0d33 iov_iter stuff, part 2, rebased
* more new_sync_{read,write}() speedups - ITER_UBUF introduction
 * ITER_PIPE cleanups
 * unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
   switching them to advancing semantics
 * making ITER_PIPE take high-order pages without splitting them
 * handling copy_page_from_iter() for high-order pages properly
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYvHI8QAKCRBZ7Krx/gZQ
 62CQAPsGlbebqBeAT2pMulaGDxfLAsgz5Yf4BEaMLhPtRqFOQgD+KrZQId7Sd8O0
 3IWucpTb2c4jvLlXhGMS+XWnusQH+AQ=
 =pBux
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull more iov_iter updates from Al Viro:

 - more new_sync_{read,write}() speedups - ITER_UBUF introduction

 - ITER_PIPE cleanups

 - unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
   switching them to advancing semantics

 - making ITER_PIPE take high-order pages without splitting them

 - handling copy_page_from_iter() for high-order pages properly

* tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
  fix copy_page_from_iter() for compound destinations
  hugetlbfs: copy_page_to_iter() can deal with compound pages
  copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
  expand those iov_iter_advance()...
  pipe_get_pages(): switch to append_pipe()
  get rid of non-advancing variants
  ceph: switch the last caller of iov_iter_get_pages_alloc()
  9p: convert to advancing variant of iov_iter_get_pages_alloc()
  af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
  iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
  block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
  iov_iter: saner helper for page array allocation
  fold __pipe_get_pages() into pipe_get_pages()
  ITER_XARRAY: don't open-code DIV_ROUND_UP()
  unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
  unify xarray_get_pages() and xarray_get_pages_alloc()
  unify pipe_get_pages() and pipe_get_pages_alloc()
  iov_iter_get_pages(): sanity-check arguments
  iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
  ...
2022-08-08 20:04:35 -07:00
Al Viro
1ef255e257 iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
Most of the users immediately follow successful iov_iter_get_pages()
with advancing by the amount it had returned.

Provide inline wrappers doing that, convert trivial open-coded
uses of those.

BTW, iov_iter_get_pages() never returns more than it had been asked
to; such checks in cifs ought to be removed someday...

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:22 -04:00
Al Viro
fcb14cb1bd new iov_iter flavour - ITER_UBUF
Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.

We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.

New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.

DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:15 -04:00
Linus Torvalds
2bd5d41e0e fuse update for 6.0
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYvD4IQAKCRDh3BK/laaZ
 PDHHAP93H+2E9c6biGd5pEaL2ABChRY+wsQURzD+SZ6AL3JaUQD+KHxA4q0MJvws
 L6CWcf2XptUDCLe3P6sgSTvv5Gk1OAM=
 =FoXF
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Fix an issue with reusing the bdi in case of block based filesystems

 - Allow root (in init namespace) to access fuse filesystems in user
   namespaces if expicitly enabled with a module param

 - Misc fixes

* tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: retire block-device-based superblock on force unmount
  vfs: function to prevent re-use of block-device-based superblocks
  virtio_fs: Modify format for virtio_fs_direct_access
  virtiofs: delete unused parameter for virtio_fs_cleanup_vqs
  fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other
  fuse: Remove the control interface for virtio-fs
  fuse: ioctl: translate ENOSYS
  fuse: limit nsec
  fuse: avoid unnecessary spinlock bump
  fuse: fix deadlock between atomic O_TRUNC and page invalidation
  fuse: write inode in fuse_release()
2022-08-08 11:10:02 -07:00
Daniil Lunev
247861c325 fuse: retire block-device-based superblock on force unmount
Force unmount of FUSE severes the connection with the user space, even if
there are still open files.  Subsequent remount tries to re-use the
superblock held by the open files, which is meaningless in the FUSE case
after disconnect - reused super block doesn't have userspace counterpart
attached to it and is incapable of doing any IO.

This patch adds the functionality only for the block-device-based supers,
since the primary use case of the feature is to gracefully handle force
unmount of external devices, mounted with FUSE.  This can be further
extended to cover all superblocks, if the need arises.

Signed-off-by: Daniil Lunev <dlunev@chromium.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-27 11:30:31 +02:00
Deming Wang
73fb2c8b61 virtio_fs: Modify format for virtio_fs_direct_access
We should isolate operators with spaces.

Signed-off-by: Deming Wang <wangdeming@inspur.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-26 10:38:58 +02:00
Deming Wang
1e5b9e048c virtiofs: delete unused parameter for virtio_fs_cleanup_vqs
fs parameter not used. So, it needs to be deleted.

Signed-off-by: Deming Wang <wangdeming@inspur.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:06:19 +02:00
Dave Marchevsky
9ccf47b26b fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other
Since commit 73f03c2b4b ("fuse: Restrict allow_other to the superblock's
namespace or a descendant"), access to allow_other FUSE filesystems has
been limited to users in the mounting user namespace or descendants. This
prevents a process that is privileged in its userns - but not its parent
namespaces - from mounting a FUSE fs w/ allow_other that is accessible to
processes in parent namespaces.

While this restriction makes sense overall it breaks a legitimate usecase:
I have a tracing daemon which needs to peek into process' open files in
order to symbolicate - similar to 'perf'. The daemon is a privileged
process in the root userns, but is unable to peek into FUSE filesystems
mounted by processes in child namespaces.

This patch adds a module param, allow_sys_admin_access, to act as an escape
hatch for this descendant userns logic and for the allow_other mount option
in general. Setting allow_sys_admin_access allows processes with
CAP_SYS_ADMIN in the initial userns to access FUSE filesystems irrespective
of the mounting userns or whether allow_other was set. A sysadmin setting
this param must trust FUSEs on the host to not DoS processes as described
in 73f03c2b4b.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:06:19 +02:00
Xie Yongji
c64797809a fuse: Remove the control interface for virtio-fs
The commit 15c8e72e88 ("fuse: allow skipping control interface and forced
unmount") tries to remove the control interface for virtio-fs since it does
not support aborting requests which are being processed. But it doesn't
work now.

This patch fixes it by skipping creating the control interface if
fuse_conn->no_control is set.

Fixes: 15c8e72e88 ("fuse: allow skipping control interface and forced unmount")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:06:19 +02:00
Miklos Szeredi
02c0cab8e7 fuse: ioctl: translate ENOSYS
Overlayfs may fail to complete updates when a filesystem lacks
fileattr/xattr syscall support and responds with an ENOSYS error code,
resulting in an unexpected "Function not implemented" error.

This bug may occur with FUSE filesystems, such as davfs2.

Steps to reproduce:

  # install davfs2, e.g., apk add davfs2
  mkdir /test mkdir /test/lower /test/upper /test/work /test/mnt
  yes '' | mount -t davfs -o ro http://some-web-dav-server/path \
    /test/lower
  mount -t overlay -o upperdir=/test/upper,lowerdir=/test/lower \
    -o workdir=/test/work overlay /test/mnt

  # when "some-file" exists in the lowerdir, this fails with "Function
  # not implemented", with dmesg showing "overlayfs: failed to retrieve
  # lower fileattr (/some-file, err=-38)"
  touch /test/mnt/some-file

The underlying cause of this regresion is actually in FUSE, which fails to
translate the ENOSYS error code returned by userspace filesystem (which
means that the ioctl operation is not supported) to ENOTTY.

Reported-by: Christian Kohlschütter <christian@kohlschutter.com>
Fixes: 72db82115d ("ovl: copy up sync/noatime fileattr flags")
Fixes: 59efec7b90 ("fuse: implement ioctl support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:06:18 +02:00
Miklos Szeredi
47912eaa06 fuse: limit nsec
Limit nanoseconds to 0..999999999.

Fixes: d8a5ba4545 ("[PATCH] FUSE - core")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:06:18 +02:00
Jeffle Xu
47e301491c fuse: avoid unnecessary spinlock bump
Move dmap free worker kicker inside the critical region, so that extra
spinlock lock/unlock could be avoided.

Suggested-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:02:45 +02:00
Miklos Szeredi
2fdbb8dd01 fuse: fix deadlock between atomic O_TRUNC and page invalidation
fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic
O_TRUNC open(), so commit 76224355db ("fuse: truncate pagecache on
atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache()
in such a case to avoid the A-A deadlock. However, we found another A-B-B-A
deadlock related to the case above, which will cause the xfstests
generic/464 testcase hung in our virtio-fs test environment.

For example, consider two processes concurrently open one same file, one
with O_TRUNC and another without O_TRUNC. The deadlock case is described
below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying
to lock a page (acquiring B), open() could have held the page lock
(acquired B), and waiting on the page writeback (acquiring A). This would
lead to deadlocks.

open(O_TRUNC)
----------------------------------------------------------------
fuse_open_common
  inode_lock            [C acquire]
  fuse_set_nowrite      [A acquire]

  fuse_finish_open
    truncate_pagecache
      lock_page         [B acquire]
      truncate_inode_page
      unlock_page       [B release]

  fuse_release_nowrite  [A release]
  inode_unlock          [C release]
----------------------------------------------------------------

open()
----------------------------------------------------------------
fuse_open_common
  fuse_finish_open
    invalidate_inode_pages2
      lock_page         [B acquire]
        fuse_launder_page
          fuse_wait_on_page_writeback [A acquire & release]
      unlock_page       [B release]
----------------------------------------------------------------

Besides this case, all calls of invalidate_inode_pages2() and
invalidate_inode_pages2_range() in fuse code also can deadlock with
open(O_TRUNC).

Fix by moving the truncate_pagecache() call outside the nowrite protected
region.  The nowrite protection is only for delayed writeback
(writeback_cache) case, where inode lock does not protect against
truncation racing with writes on the server.  Write syscalls racing with
page cache truncation still get the inode lock protection.

This patch also changes the order of filemap_invalidate_lock()
vs. fuse_set_nowrite() in fuse_open_common().  This new order matches the
order found in fuse_file_fallocate() and fuse_do_setattr().

Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Fixes: e4648309b8 ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:02:45 +02:00
Miklos Szeredi
035ff33cf4 fuse: write inode in fuse_release()
A race between write(2) and close(2) allows pages to be dirtied after
fuse_flush -> write_inode_now().  If these pages are not flushed from
fuse_release(), then there might not be a writable open file later.  So any
remaining dirty pages must be written back before the file is released.

This is a partial revert of the blamed commit.

Reported-by: syzbot+6e1efbd8efaaa6860e91@syzkaller.appspotmail.com
Fixes: 36ea23374d ("fuse: write inode in fuse_vma_close() instead of fuse_release()")
Cc: <stable@vger.kernel.org> # v5.16
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-21 16:02:45 +02:00
Al Viro
91b94c5d6a iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
New helper to be used instead of direct checks for IOCB_DSYNC:
iocb_is_dsync(iocb).  Checks converted, which allows to avoid
the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines)
from iocb_flags() - it's checked in iocb_is_dsync() instead

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-06-10 16:05:15 -04:00
Linus Torvalds
35cdd8656e libnvdimm for 5.19
- Add support for clearing memory error via pwrite(2) on DAX
 
 - Fix 'security overwrite' support in the presence of media errors
 
 - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYpFPcQAKCRDfioYZHlFs
 Z9A3AQCdfoT5sY3OK+I/3oTvJ//6lw2MtXrnXFM046ICKPi9sgD8CzR9mRAHA+vj
 kxOtJEU2bA9naninXGORsDUndiNkwQo=
 =gVIn
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm and DAX updates from Dan Williams:
 "New support for clearing memory errors when a file is in DAX mode,
  alongside with some other fixes and cleanups.

  Previously it was only possible to clear these errors using a truncate
  or hole-punch operation to trigger the filesystem to reallocate the
  block, now, any page aligned write can opportunistically clear errors
  as well.

  This change spans x86/mm, nvdimm, and fs/dax, and has received the
  appropriate sign-offs. Thanks to Jane for her work on this.

  Summary:

   - Add support for clearing memory error via pwrite(2) on DAX

   - Fix 'security overwrite' support in the presence of media errors

   - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)"

* tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  pmem: implement pmem_recovery_write()
  pmem: refactor pmem_clear_poison()
  dax: add .recovery_write dax_operation
  dax: introduce DAX_RECOVERY_WRITE dax access mode
  mce: fix set_mce_nospec to always unmap the whole page
  x86/mce: relocate set{clear}_mce_nospec() functions
  acpi/nfit: rely on mce->misc to determine poison granularity
  testing: nvdimm: asm/mce.h is not needed in nfit.c
  testing: nvdimm: iomap: make __nfit_test_ioremap a macro
  nvdimm: Allow overwrite in the presence of disabled dimms
  tools/testing/nvdimm: remove unneeded flush_workqueue
2022-05-27 15:49:30 -07:00
Jane Chu
e511c4a3d2 dax: introduce DAX_RECOVERY_WRITE dax access mode
Up till now, dax_direct_access() is used implicitly for normal
access, but for the purpose of recovery write, dax range with
poison is requested.  To make the interface clear, introduce
	enum dax_access_mode {
		DAX_ACCESS,
		DAX_RECOVERY_WRITE,
	}
where DAX_ACCESS is used for normal dax access, and
DAX_RECOVERY_WRITE is used for dax recovery write.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/165247982851.52965.11024212198889762949.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-16 13:35:56 -07:00
Matthew Wilcox (Oracle)
5efd00e489 fuse: Convert fuse to read_folio
This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-09 16:21:45 -04:00
Matthew Wilcox (Oracle)
9d6b0cd757 fs: Remove flags parameter from aops->write_begin
There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:19 -04:00
Matthew Wilcox (Oracle)
b7446e7cf1 fs: Remove aop flags parameter from grab_cache_page_write_begin()
There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:19 -04:00
Matthew Wilcox (Oracle)
704528d895 fs: Remove ->readpages address space operation
All filesystems have now been converted to use ->readahead, so
remove the ->readpages operation and fix all the comments that
used to refer to it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2022-04-01 13:45:33 -04:00
Linus Torvalds
7001052160 Add support for Intel CET-IBT, available since Tigerlake (11th gen), which is a
coarse grained, hardware based, forward edge Control-Flow-Integrity mechanism
 where any indirect CALL/JMP must target an ENDBR instruction or suffer #CP.
 
 Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation is
 limited to 2 instructions (and typically fewer) on branch targets not starting
 with ENDBR. CET-IBT also limits speculation of the next sequential instruction
 after the indirect CALL/JMP [1].
 
 CET-IBT is fundamentally incompatible with retpolines, but provides, as
 described above, speculation limits itself.
 
 [1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEv3OU3/byMaA0LqWJdkfhpEvA5LoFAmI/LI8VHHBldGVyekBp
 bmZyYWRlYWQub3JnAAoJEHZH4aRLwOS6ZnkP/2QCgQLTu6oRxv9O020CHwlaSEeD
 1Hoy3loum5q5hAi1Ik3dR9p0H5u64c9qbrBVxaFoNKaLt5GKrtHaDSHNk2L/CFHX
 urpH65uvTLxbyZzcahkAahoJ71XU+m7PcrHLWMunw9sy10rExYVsUOlFyoyG6XCF
 BDCNZpdkC09ZM3vwlWGMZd5Pp+6HcZNPyoV9tpvWAS2l+WYFWAID7mflbpQ+tA8b
 y/hM6b3Ud0rT2ubuG1iUpopgNdwqQZ+HisMPGprh+wKZkYwS2l8pUTrz0MaBkFde
 go7fW16kFy2HQzGm6aIEBmfcg0palP/mFVaWP0zS62LwhJSWTn5G6xWBr3yxSsht
 9gWCiI0oDZuTg698MedWmomdG2SK6yAuZuqmdKtLLoWfWgviPEi7TDFG/cKtZdAW
 ag8GM8T4iyYZzpCEcWO9GWbjo6TTGq30JBQefCBG47GjD0csv2ubXXx0Iey+jOwT
 x3E8wnv9dl8V9FSd/tMpTFmje8ges23yGrWtNpb5BRBuWTeuGiBPZED2BNyyIf+T
 dmewi2ufNMONgyNp27bDKopY81CPAQq9cVxqNm9Cg3eWPFnpOq2KGYEvisZ/rpEL
 EjMQeUBsy/C3AUFAleu1vwNnkwP/7JfKYpN00gnSyeQNZpqwxXBCKnHNgOMTXyJz
 beB/7u2KIUbKEkSN
 =jZfK
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 CET-IBT (Control-Flow-Integrity) support from Peter Zijlstra:
 "Add support for Intel CET-IBT, available since Tigerlake (11th gen),
  which is a coarse grained, hardware based, forward edge
  Control-Flow-Integrity mechanism where any indirect CALL/JMP must
  target an ENDBR instruction or suffer #CP.

  Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation
  is limited to 2 instructions (and typically fewer) on branch targets
  not starting with ENDBR. CET-IBT also limits speculation of the next
  sequential instruction after the indirect CALL/JMP [1].

  CET-IBT is fundamentally incompatible with retpolines, but provides,
  as described above, speculation limits itself"

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html

* tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  kvm/emulate: Fix SETcc emulation for ENDBR
  x86/Kconfig: Only allow CONFIG_X86_KERNEL_IBT with ld.lld >= 14.0.0
  x86/Kconfig: Only enable CONFIG_CC_HAS_IBT for clang >= 14.0.0
  kbuild: Fixup the IBT kbuild changes
  x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy
  x86: Remove toolchain check for X32 ABI capability
  x86/alternative: Use .ibt_endbr_seal to seal indirect calls
  objtool: Find unused ENDBR instructions
  objtool: Validate IBT assumptions
  objtool: Add IBT/ENDBR decoding
  objtool: Read the NOENDBR annotation
  x86: Annotate idtentry_df()
  x86,objtool: Move the ASM_REACHABLE annotation to objtool.h
  x86: Annotate call_on_stack()
  objtool: Rework ASM_REACHABLE
  x86: Mark __invalid_creds() __noreturn
  exit: Mark do_group_exit() __noreturn
  x86: Mark stop_this_cpu() __noreturn
  objtool: Ignore extra-symbol code
  objtool: Rename --duplicate to --lto
  ...
2022-03-27 10:17:23 -07:00
Linus Torvalds
6b1f86f8e9 Filesystem folio changes for 5.18
Primarily this series converts some of the address_space operations
 to take a folio instead of a page.
 
 ->is_partially_uptodate() takes a folio instead of a page and changes the
 type of the 'from' and 'count' arguments to make it obvious they're bytes.
 ->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
 ->launder_page() becomes ->launder_folio()
 ->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
 an argument.
 
 There are a couple of other misc changes up front that weren't worth
 separating into their own pull request.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
 gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
 BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
 SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
 EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
 omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
 Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
 =cOiq
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache

Pull filesystem folio updates from Matthew Wilcox:
 "Primarily this series converts some of the address_space operations to
  take a folio instead of a page.

  Notably:

   - a_ops->is_partially_uptodate() takes a folio instead of a page and
     changes the type of the 'from' and 'count' arguments to make it
     obvious they're bytes.

   - a_ops->invalidatepage() becomes ->invalidate_folio() and has a
     similar type change.

   - a_ops->launder_page() becomes ->launder_folio()

   - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
     address_space as an argument.

  There are a couple of other misc changes up front that weren't worth
  separating into their own pull request"

* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
  fs: Remove aops ->set_page_dirty
  fb_defio: Use noop_dirty_folio()
  fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
  fs: Convert __set_page_dirty_buffers to block_dirty_folio
  nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
  mm: Convert swap_set_page_dirty() to swap_dirty_folio()
  ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
  f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
  f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
  f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
  afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
  btrfs: Convert extent_range_redirty_for_io() to use folios
  fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
  btrfs: Convert from set_page_dirty to dirty_folio
  fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
  fs: Add aops->dirty_folio
  fs: Remove aops->launder_page
  orangefs: Convert launder_page to launder_folio
  nfs: Convert from launder_page to launder_folio
  fuse: Convert from launder_page to launder_folio
  ...
2022-03-22 18:26:56 -07:00
Linus Torvalds
9030fb0bb9 Folio changes for 5.18
- Rewrite how munlock works to massively reduce the contention
    on i_mmap_rwsem (Hugh Dickins):
    https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
  - Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig):
    https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
  - Convert GUP to use folios and make pincount available for order-1
    pages. (Matthew Wilcox)
  - Convert a few more truncation functions to use folios (Matthew Wilcox)
  - Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox)
  - Convert rmap_walk to use folios (Matthew Wilcox)
  - Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
  - Add support for creating large folios in readahead (Matthew Wilcox)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4ucgACgkQDpNsjXcp
 gj69Wgf6AwqwmO5Tmy+fLScDPqWxmXJofbocae1kyoGHf7Ui91OK4U2j6IpvAr+g
 P/vLIK+JAAcTQcrSCjymuEkf4HkGZOR03QQn7maPIEe4eLrZRQDEsmHC1L9gpeJp
 s/GMvDWiGE0Tnxu0EOzfVi/yT+qjIl/S8VvqtCoJv1HdzxitZ7+1RDuqImaMC5MM
 Qi3uHag78vLmCltLXpIOdpgZhdZexCdL2Y/1npf+b6FVkAJRRNUnA0gRbS7YpoVp
 CbxEJcmAl9cpJLuj5i5kIfS9trr+/QcvbUlzRxh4ggC58iqnmF2V09l2MJ7YU3XL
 v1O/Elq4lRhXninZFQEm9zjrri7LDQ==
 =n9Ad
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache

Pull folio updates from Matthew Wilcox:

 - Rewrite how munlock works to massively reduce the contention on
   i_mmap_rwsem (Hugh Dickins):

     https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/

 - Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
   Hellwig):

     https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/

 - Convert GUP to use folios and make pincount available for order-1
   pages. (Matthew Wilcox)

 - Convert a few more truncation functions to use folios (Matthew
   Wilcox)

 - Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
   Wilcox)

 - Convert rmap_walk to use folios (Matthew Wilcox)

 - Convert most of shrink_page_list() to use a folio (Matthew Wilcox)

 - Add support for creating large folios in readahead (Matthew Wilcox)

* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
  mm/damon: minor cleanup for damon_pa_young
  selftests/vm/transhuge-stress: Support file-backed PMD folios
  mm/filemap: Support VM_HUGEPAGE for file mappings
  mm/readahead: Switch to page_cache_ra_order
  mm/readahead: Align file mappings for non-DAX
  mm/readahead: Add large folio readahead
  mm: Support arbitrary THP sizes
  mm: Make large folios depend on THP
  mm: Fix READ_ONLY_THP warning
  mm/filemap: Allow large folios to be added to the page cache
  mm: Turn can_split_huge_page() into can_split_folio()
  mm/vmscan: Convert pageout() to take a folio
  mm/vmscan: Turn page_check_references() into folio_check_references()
  mm/vmscan: Account large folios correctly
  mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
  mm/vmscan: Free non-shmem folios without splitting them
  mm/rmap: Constify the rmap_walk_control argument
  mm/rmap: Convert rmap_walk() to take a folio
  mm: Turn page_anon_vma() into folio_anon_vma()
  mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
  ...
2022-03-22 17:03:12 -07:00
Muchun Song
fd60b28842 fs: allocate inode by using alloc_inode_sb()
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().

Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>		[ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:03 -07:00
NeilBrown
670d21c6e1 fuse: remove reliance on bdi congestion
The bdi congestion tracking in not widely used and will be removed.

Fuse is one of a small number of filesystems that uses it, setting both
the sync (read) and async (write) congestion flags at what it determines
are appropriate times.

The only remaining effect of the sync flag is to cause read-ahead to be
skipped.  The only remaining effect of the async flag is to cause (some)
WB_SYNC_NONE writes to be skipped.

So instead of setting the flags, change:

 - .readahead to stop when it has submitted all non-async pages for
   read.

 - .writepages to do nothing if WB_SYNC_NONE and the flag would be set

 - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE and the
   flag would be set.

The writepages change causes a behavioural change in that pageout() can
now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will be
called on the page which (I think) will further delay the next attempt at
writeout.  This might be a good thing.

Link: https://lkml.kernel.org/r/164549983737.9187.2627117501000365074.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:00 -07:00
Matthew Wilcox (Oracle)
46de8b9794 fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
This is a mechanical change.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-16 13:37:05 -04:00
Matthew Wilcox (Oracle)
187c82cb03 fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
These filesystems use __set_page_dirty_nobuffers() either directly or
with a very thin wrapper; convert them en masse.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:34:38 -04:00
Matthew Wilcox (Oracle)
2bf06b8e64 fuse: Convert from launder_page to launder_folio
Straightforward conversion although the helper functions still assume
a single page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:23:30 -04:00
Matthew Wilcox (Oracle)
5660a8630d fs: Remove noop_invalidatepage()
We used to have to use noop_invalidatepage() to prevent
block_invalidatepage() from being called, but that behaviour is now gone.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:23:29 -04:00
Masahiro Yamada
83a44a4f47 x86: Remove toolchain check for X32 ABI capability
Commit 0bf6276392 ("x32: Warn and disable rather than error if
binutils too old") added a small test in arch/x86/Makefile because
binutils 2.22 or newer is needed to properly support elf32-x86-64. This
check is no longer necessary, as the minimum supported version of
binutils is 2.23, which is enforced at configuration time with
scripts/min-tool-version.sh.

Remove this check and replace all uses of CONFIG_X86_X32 with
CONFIG_X86_X32_ABI, as two symbols are no longer necessary.

[nathan: Rebase, fix up a few places where CONFIG_X86_X32 was still
         used, and simplify commit message to satisfy -tip requirements]

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220314194842.3452-2-nathan@kernel.org
2022-03-15 10:32:48 +01:00
Miklos Szeredi
0c4bcfdecb fuse: fix pipe buffer lifetime for direct_io
In FOPEN_DIRECT_IO mode, fuse_file_write_iter() calls
fuse_direct_write_iter(), which normally calls fuse_direct_io(), which then
imports the write buffer with fuse_get_user_pages(), which uses
iov_iter_get_pages() to grab references to userspace pages instead of
actually copying memory.

On the filesystem device side, these pages can then either be read to
userspace (via fuse_dev_read()), or splice()d over into a pipe using
fuse_dev_splice_read() as pipe buffers with &nosteal_pipe_buf_ops.

This is wrong because after fuse_dev_do_read() unlocks the FUSE request,
the userspace filesystem can mark the request as completed, causing write()
to return. At that point, the userspace filesystem should no longer have
access to the pipe buffer.

Fix by copying pages coming from the user address space to new pipe
buffers.

Reported-by: Jann Horn <jannh@google.com>
Fixes: c3021629a0 ("fuse: support splice() reading from fuse device")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-03-07 16:30:44 +01:00
Christoph Hellwig
dc90f0846d mm: don't include <linux/memremap.h> in <linux/mm.h>
Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.

Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-03 12:47:33 -05:00
Jeff Layton
c086df4902 fuse: move FUSE_SUPER_MAGIC definition to magic.h
...to help userland apps that need to identify FUSE mounts.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-02-21 14:57:26 +01:00
Miklos Szeredi
a679a61520 fuse: fix fileattr op failure
The fileattr API conversion broke lsattr on ntfs3g.

Previously the ioctl(... FS_IOC_GETFLAGS) returned an EINVAL error, but
after the conversion the error returned by the fuse filesystem was not
propagated back to the ioctl() system call, resulting in success being
returned with bogus values.

Fix by checking for outarg.result in fuse_priv_ioctl(), just as generic
ioctl code does.

Reported-by: Jean-Pierre André <jean-pierre.andre@wanadoo.fr>
Fixes: 72227eac17 ("fuse: convert to fileattr")
Cc: <stable@vger.kernel.org> # v5.13
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-02-18 11:47:51 +01:00
Linus Torvalds
3bf6a9e36e virtio,vdpa,qemu_fw_cfg: features, cleanups, fixes
partial support for < MAX_ORDER - 1 granularity for virtio-mem
 driver_override for vdpa
 sysfs ABI documentation for vdpa
 multiqueue config support for mlx5 vdpa
 
 Misc fixes, cleanups.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmHiDHkPHG1zdEByZWRo
 YXQuY29tAAoJECgfDbjSjVRpVT4H/3Veixt3uYPOmuLU2tSx+8X+sFTtik81hyiE
 okz5fRJrxxA8SqS76FnmO10FS4hlPOGNk0Z5WVhr0yihwFvPLvpCM/xi2Lmrz9I7
 pB0sXOIocEL1xApsxukR9K1Twpb2hfYsflbJYUVlRfhS5G0izKJNZp5I7OPrzd80
 vVNNDWKW2iLDlfqsavumI4Kvm4nsFuCHG03jzMtcIa7YTXYV3DORD4ZGFFVUOIQN
 t5F74TznwHOeYgJeg7TzjFjfPWmXjLetvx10QX1A1uOvwppWW/QY6My0UafTXNXj
 VB3gOwJPf+gxXAXl/4bafq4NzM0xys6cpcPpjvhmU+erY4UuyAU=
 =Y1eO
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
 "virtio,vdpa,qemu_fw_cfg: features, cleanups, and fixes.

   - partial support for < MAX_ORDER - 1 granularity for virtio-mem

   - driver_override for vdpa

   - sysfs ABI documentation for vdpa

   - multiqueue config support for mlx5 vdpa

   - and misc fixes, cleanups"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (42 commits)
  vdpa/mlx5: Fix tracking of current number of VQs
  vdpa/mlx5: Fix is_index_valid() to refer to features
  vdpa: Protect vdpa reset with cf_mutex
  vdpa: Avoid taking cf_mutex lock on get status
  vdpa/vdpa_sim_net: Report max device capabilities
  vdpa: Use BIT_ULL for bit operations
  vdpa/vdpa_sim: Configure max supported virtqueues
  vdpa/mlx5: Report max device capabilities
  vdpa: Support reporting max device capabilities
  vdpa/mlx5: Restore cur_num_vqs in case of failure in change_num_qps()
  vdpa: Add support for returning device configuration information
  vdpa/mlx5: Support configuring max data virtqueue
  vdpa/mlx5: Fix config_attr_mask assignment
  vdpa: Allow to configure max data virtqueues
  vdpa: Read device configuration only if FEATURES_OK
  vdpa: Sync calls set/get config/status with cf_mutex
  vdpa/mlx5: Distribute RX virtqueues in RQT object
  vdpa: Provide interface to read driver features
  vdpa: clean up get_config_size ret value handling
  virtio_ring: mark ring unused on error
  ...
2022-01-18 10:05:48 +02:00
Michael S. Tsirkin
d9679d0013 virtio: wrap config->reset calls
This will enable cleanups down the road.
The idea is to disable cbs, then add "flush_queued_cbs" callback
as a parameter, this way drivers can flush any work
queued after callbacks have been disabled.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20211013105226.20225-1-mst@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-01-14 18:50:52 -05:00
Linus Torvalds
3acbdbf42e dax + libnvdimm for v5.17
- Simplify the dax_operations API
   - Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining
     and applying a partition offset to all its DAX iomap operations.
   - Remove wrappers and device-mapper stacked callbacks for
     ->copy_from_iter() and ->copy_to_iter() in favor of moving
     block_device relative offset responsibility to the
     dax_direct_access() caller.
   - Remove the need for an @bdev in filesystem-DAX infrastructure
   - Remove unused uio helpers copy_from_iter_flushcache() and
     copy_mc_to_iter() as only the non-check_copy_size() versions are
     used for DAX.
 - Prepare XFS for the pending (next merge window) DAX+reflink support
 - Remove deprecated DEV_DAX_PMEM_COMPAT support
 - Cleanup a straggling misuse of the GUID api
 
 Tags offered after the branch was cut:
 Reviewed-by: Mike Snitzer <snitzer@redhat.com>
 Link: https://lore.kernel.org/r/Ydb/3P+8nvjCjYfO@redhat.com
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYd3dTAAKCRDfioYZHlFs
 Z//UAP9zetoTE+O7zJG7CXja4jSopSadbdbh6QKSXaqfKBPvQQD+N4US3wA2bGv8
 f/qCY62j2Hj3hUTGHs9RvTyw3JsSYAA=
 =QvDs
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull dax and libnvdimm updates from Dan Williams:
 "The bulk of this is a rework of the dax_operations API after
  discovering the obstacles it posed to the work-in-progress DAX+reflink
  support for XFS and other copy-on-write filesystem mechanics.

  Primarily the need to plumb a block_device through the API to handle
  partition offsets was a sticking point and Christoph untangled that
  dependency in addition to other cleanups to make landing the
  DAX+reflink support easier.

  The DAX_PMEM_COMPAT option has been around for 4 years and not only
  are distributions shipping userspace that understand the current
  configuration API, but some are not even bothering to turn this option
  on anymore, so it seems a good time to remove it per the deprecation
  schedule. Recall that this was added after the device-dax subsystem
  moved from /sys/class/dax to /sys/bus/dax for its sysfs organization.
  All recent functionality depends on /sys/bus/dax.

  Some other miscellaneous cleanups and reflink prep patches are
  included as well.

  Summary:

   - Simplify the dax_operations API:

      - Eliminate bdev_dax_pgoff() in favor of the filesystem
        maintaining and applying a partition offset to all its DAX iomap
        operations.

      - Remove wrappers and device-mapper stacked callbacks for
        ->copy_from_iter() and ->copy_to_iter() in favor of moving
        block_device relative offset responsibility to the
        dax_direct_access() caller.

      - Remove the need for an @bdev in filesystem-DAX infrastructure

      - Remove unused uio helpers copy_from_iter_flushcache() and
        copy_mc_to_iter() as only the non-check_copy_size() versions are
        used for DAX.

   - Prepare XFS for the pending (next merge window) DAX+reflink support

   - Remove deprecated DEV_DAX_PMEM_COMPAT support

   - Cleanup a straggling misuse of the GUID api"

* tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (38 commits)
  iomap: Fix error handling in iomap_zero_iter()
  ACPI: NFIT: Import GUID before use
  dax: remove the copy_from_iter and copy_to_iter methods
  dax: remove the DAXDEV_F_SYNC flag
  dax: simplify dax_synchronous and set_dax_synchronous
  uio: remove copy_from_iter_flushcache() and copy_mc_to_iter()
  iomap: turn the byte variable in iomap_zero_iter into a ssize_t
  memremap: remove support for external pgmap refcounts
  fsdax: don't require CONFIG_BLOCK
  iomap: build the block based code conditionally
  dax: fix up some of the block device related ifdefs
  fsdax: shift partition offset handling into the file systems
  dax: return the partition offset from fs_dax_get_by_bdev
  iomap: add a IOMAP_DAX flag
  xfs: pass the mapping flags to xfs_bmbt_to_iomap
  xfs: use xfs_direct_write_iomap_ops for DAX zeroing
  xfs: move dax device handling into xfs_{alloc,free}_buftarg
  ext4: cleanup the dax handling in ext4_fill_super
  ext2: cleanup the dax handling in ext2_fill_super
  fsdax: decouple zeroing from the iomap buffered I/O code
  ...
2022-01-12 15:46:11 -08:00
Christoph Hellwig
7ac5360cd4 dax: remove the copy_from_iter and copy_to_iter methods
These methods indirect the actual DAX read/write path.  In the end pmem
uses magic flush and mc safe variants and fuse and dcssblk use plain ones
while device mapper picks redirects to the underlying device.

Add set_dax_nocache() and set_dax_nomc() APIs to control which copy
routines are used to remove indirect call from the read/write fast path
as well as a lot of boilerplate code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com> [virtiofs]
Link: https://lore.kernel.org/r/20211215084508.435401-5-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-18 08:04:53 -08:00
Christoph Hellwig
30c6828a17 dax: remove the DAXDEV_F_SYNC flag
Remove the DAXDEV_F_SYNC flag and thus the flags argument to alloc_dax and
just let the drivers call set_dax_synchronous directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211215084508.435401-4-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-18 08:04:52 -08:00
Jeffle Xu
c3cb6f935e fuse: mark inode DONT_CACHE when per inode DAX hint changes
When the per inode DAX hint changes while the file is still *opened*, it
is quite complicated and maybe fragile to dynamically change the DAX
state.

Hence mark the inode and corresponding dentries as DONE_CACHE once the
per inode DAX hint changes, so that the inode instance will be evicted
and freed as soon as possible once the file is closed and the last
reference to the inode is put. And then when the file gets reopened next
time, the new instantiated inode will reflect the new DAX state.

In summary, when the per inode DAX hint changes for an *opened* file, the
DAX state of the file won't be updated until this file is closed and
reopened later.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-14 11:09:37 +01:00
Jeffle Xu
2ee019fadc fuse: negotiate per inode DAX in FUSE_INIT
Among the FUSE_INIT phase, client shall advertise per inode DAX if it's
mounted with "dax=inode". Then server is aware that client is in per
inode DAX mode, and will construct per-inode DAX attribute accordingly.

Server shall also advertise support for per inode DAX. If server doesn't
support it while client is mounted with "dax=inode", client will
silently fallback to "dax=never" since "dax=inode" is advisory only.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-14 11:09:37 +01:00
Jeffle Xu
93a497b9ad fuse: enable per inode DAX
DAX may be limited in some specific situation. When the number of usable
DAX windows is under watermark, the recalim routine will be triggered to
reclaim some DAX windows. It may have a negative impact on the
performance, since some processes may need to wait for DAX windows to be
recalimed and reused then. To mitigate the performance degradation, the
overall DAX window need to be expanded larger.

However, simply expanding the DAX window may not be a good deal in some
scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB
(512 * 64 bytes) memory footprint will be consumed for page descriptors
inside guest, which is greater than the memory footprint if it uses
guest page cache when DAX disabled. Thus it'd better disable DAX for
those files smaller than 32KB, to reduce the demand for DAX window and
thus avoid the unworthy memory overhead.

Per inode DAX feature is introduced to address this issue, by offering a
finer grained control for dax to users, trying to achieve a balance
between performance and memory overhead.

The FUSE_ATTR_DAX flag in FUSE_LOOKUP reply is used to indicate whether
DAX should be enabled or not for corresponding file. Currently the state
whether DAX is enabled or not for the file is initialized only when
inode is instantiated.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-14 11:09:37 +01:00
Jeffle Xu
780b1b959f fuse: make DAX mount option a tri-state
We add 'always', 'never', and 'inode' (default). '-o dax' continues to
operate the same which is equivalent to 'always'.

The following behavior is consistent with that on ext4/xfs:

 - The default behavior (when neither '-o dax' nor
   '-o dax=always|never|inode' option is specified) is equal to 'inode'
   mode, while 'dax=inode' won't be printed among the mount option list.

 - The 'inode' mode is only advisory. It will silently fallback to 'never'
   mode if fuse server doesn't support that.

Also noted that by the time of this commit, 'inode' mode is actually equal
to 'always' mode, before the per inode DAX flag is introduced in the
following patch.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-14 11:09:36 +01:00
Jeffle Xu
cecd491641 fuse: add fuse_should_enable_dax() helper
This is in prep for following per inode DAX checking.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-14 11:09:36 +01:00
Xie Yongji
e388164ea3 fuse: Pass correct lend value to filemap_write_and_wait_range()
The acceptable maximum value of lend parameter in
filemap_write_and_wait_range() is LLONG_MAX rather than -1. And there is
also some logic depending on LLONG_MAX check in write_cache_pages(). So
let's pass LLONG_MAX to filemap_write_and_wait_range() in
fuse_writeback_range() instead.

Fixes: 59bda8ecee ("fuse: flush extending writes")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Cc: <stable@vger.kernel.org> # v5.15
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-07 15:20:16 +01:00
Christoph Hellwig
fb08a1908c dax: simplify the dax_device <-> gendisk association
Replace the dax_host_hash with an xarray indexed by the pointer value
of the gendisk, and require explicitly calls from the block drivers that
want to associate their gendisk with a dax_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-5-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:51 -08:00
Christoph Hellwig
afd586f0d0 dax: remove CONFIG_DAX_DRIVER
CONFIG_DAX_DRIVER only selects CONFIG_DAX now, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211129102203.2243509-4-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:51 -08:00
Vivek Goyal
3e2b6fdbdc fuse: send security context of inode on file
When a new inode is created, send its security context to server along with
creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK).
This gives server an opportunity to create new file and set security
context (possibly atomically).  In all the configurations it might not be
possible to set context atomically.

Like nfs and ceph, use security_dentry_init_security() to dermine security
context of inode and send it with create, mkdir, mknod, and symlink
requests.

Following is the information sent to server.

fuse_sectx_header, fuse_secctx, xattr_name, security_context

 - struct fuse_secctx_header
   This contains total number of security contexts being sent and total
   size of all the security contexts (including size of
   fuse_secctx_header).

 - struct fuse_secctx
   This contains size of security context which follows this structure.
   There is one fuse_secctx instance per security context.

 - xattr name string
   This string represents name of xattr which should be used while setting
   security context.

 - security context
   This is the actual security context whose size is specified in
   fuse_secctx struct.

Also add the FUSE_SECURITY_CTX flag for the `flags` field of the
fuse_init_out struct.  When this flag is set the kernel will append the
security context for a newly created inode to the request (create, mkdir,
mknod, and symlink).  The server is responsible for ensuring that the inode
appears atomically (preferrably) with the requested security context.

For example, If the server is using SELinux and backed by a "real" linux
file system that supports extended attributes it can write the security
context value to /proc/thread-self/attr/fscreate before making the syscall
to create the inode.

This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org>

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-25 14:05:18 +01:00
Miklos Szeredi
473441720c fuse: release pipe buf after last use
Checking buf->flags should be done before the pipe_buf_release() is called
on the pipe buffer, since releasing the buffer might modify the flags.

This is exactly what page_cache_pipe_buf_release() does, and which results
in the same VM_BUG_ON_PAGE(PageLRU(page)) that the original patch was
trying to fix.

Reported-by: Justin Forbes <jmforbes@linuxtx.org>
Fixes: 712a951025 ("fuse: fix page stealing")
Cc: <stable@vger.kernel.org> # v2.6.35
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-25 14:05:18 +01:00
Miklos Szeredi
53db28933e fuse: extend init flags
FUSE_INIT flags are close to running out, so add another 32bits worth of
space.

Add FUSE_INIT_EXT flag to the old flags field in fuse_init_in.  If this
flag is set, then fuse_init_in is extended by 48bytes, in which a flags_hi
field is allocated to contain the high 32bits of the flags.

A flags_hi field is also added to fuse_init_out, allocated out of the
remaining unused fields.

Known userspace implementations of the fuse protocol have been checked to
accept the extended FUSE_INIT request, but this might cause problems with
other implementations.  If that happens to be the case, the protocol
negotiation will have to be extended with an extra initialization request
roundtrip.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-25 14:05:18 +01:00
Linus Torvalds
cdd39b0539 fuse update for 5.16
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYYk8VgAKCRDh3BK/laaZ
 PIUGAP4yYrlK574xnrdfZgwmTEx03/6ze0ZOA/J82mwdyxV8NgD/W8wyheHyXOJR
 Mnk2eQj4avgwctHrjmyzH3jFFcT1Sgw=
 =nrFv
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Fix a possible of deadlock in case inode writeback is in progress
   during dentry reclaim

 - Fix a crash in case of page stealing

 - Selectively invalidate cached attributes, possibly improving
   performance

 - Allow filesystems to disable data flushing from ->flush()

 - Misc fixes and cleanups

* tag 'fuse-update-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (23 commits)
  fuse: fix page stealing
  virtiofs: use strscpy for copying the queue name
  fuse: add FOPEN_NOFLUSH
  fuse: only update necessary attributes
  fuse: take cache_mask into account in getattr
  fuse: add cache_mask
  fuse: move reverting attributes to fuse_change_attributes()
  fuse: simplify local variables holding writeback cache state
  fuse: cleanup code conditional on fc->writeback_cache
  fuse: fix attr version comparison in fuse_read_update_size()
  fuse: always invalidate attributes after writes
  fuse: rename fuse_write_update_size()
  fuse: don't bump attr_version in cached write
  fuse: selective attribute invalidation
  fuse: don't increment nlink in link()
  fuse: decrement nlink on overwriting rename
  fuse: simplify __fuse_write_file_get()
  fuse: move fuse_invalidate_attr() into fuse_update_ctime()
  fuse: delete redundant code
  fuse: use kmap_local_page()
  ...
2021-11-09 10:46:32 -08:00
Linus Torvalds
c03098d4b9 gfs2: Fix mmap + page fault deadlocks
Functions gfs2_file_read_iter and gfs2_file_write_iter are both
 accessing the user buffer to write to or read from while holding the
 inode glock.  In the most basic scenario, that buffer will not be
 resident and it will be mapped to the same file.  Accessing the buffer
 will trigger a page fault, and gfs2 will deadlock trying to take the
 same inode glock again while trying to handle that fault.
 
 Fix that and similar, more complex scenarios by disabling page faults
 while accessing user buffers.  To make this work, introduce a small
 amount of new infrastructure and fix some bugs that didn't trigger so
 far, with page faults enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGBPisUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpE6A/7BezUnGuNJxJrR8pC+vcLYA7xAgUU
 6STQ6IN7w5UHRlSkNzZxZ2XPxW4uVQ4SxSEeaLqBsHZihepjcLNFZ/8MhQ6UPSD0
 8noHOi7CoIcp6IuWQtCpxRM/xjjm2SlMt2XbVJZaiJcdzCV9gB6TU9EkBRq7Zm/X
 9WFBbv1xZF0skn9ISCJvNtiiI+VyWKgMDUKxJUiTQjmJcklyyqHcVGmQi9BjqPz4
 4s3F+WH6CoGbDKlmNk/6Y9wZ/2+sbvGswVscUxPwJVPoZWsR1xBBUdAeAmEMD1P4
 BgE/Y1J8JXyVPYtyvZKq70XUhKdQkxB7RfX87YasOk9mY4Kjd5rIIGEykh+o2vC9
 kDhCHvf2Mnw5I6Rum3B7UXyB1vemY+fECIHsXhgBnS+ztabRtcAdpCuWoqb43ymw
 yEX1KwXyU4FpRYbrRvdZT42Fmh6ty8TW+N4swg8S2TrffirvgAi5yrcHZ4mPupYv
 lyzvsCW7Wv8hPXn/twNObX+okRgJnsxcCdBXARdCnRXfA8tH23xmu88u8RA1Vdxh
 nzTvv6Dx2EowwojuDWMx29Mw3fA2IqIfbOV+4FaRU7NZ2ZKtknL8yGl27qQUsMoJ
 vYsHTmagasjQr+NDJ3vQRLCw+JQ6B1hENpdkmixFD9moo7X1ZFW3HBi/UL973Bv6
 5CmgeXto8FRUFjI=
 =WeNd
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher:
 "Functions gfs2_file_read_iter and gfs2_file_write_iter are both
  accessing the user buffer to write to or read from while holding the
  inode glock.

  In the most basic deadlock scenario, that buffer will not be resident
  and it will be mapped to the same file. Accessing the buffer will
  trigger a page fault, and gfs2 will deadlock trying to take the same
  inode glock again while trying to handle that fault.

  Fix that and similar, more complex scenarios by disabling page faults
  while accessing user buffers. To make this work, introduce a small
  amount of new infrastructure and fix some bugs that didn't trigger so
  far, with page faults enabled"

* tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix mmap + page fault deadlocks for direct I/O
  iov_iter: Introduce nofault flag to disable page faults
  gup: Introduce FOLL_NOFAULT flag to disable page faults
  iomap: Add done_before argument to iomap_dio_rw
  iomap: Support partial direct I/O on user copy failures
  iomap: Fix iomap_dio_rw return value for user copies
  gfs2: Fix mmap + page fault deadlocks for buffered I/O
  gfs2: Eliminate ip->i_gh
  gfs2: Move the inode glock locking to gfs2_file_buffered_write
  gfs2: Introduce flag for glock holder auto-demotion
  gfs2: Clean up function may_grant
  gfs2: Add wrapper for iomap_file_buffered_write
  iov_iter: Introduce fault_in_iov_iter_writeable
  iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
  gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
  powerpc/kvm: Fix kvm_use_magic_page
  iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
2021-11-02 12:25:03 -07:00
Miklos Szeredi
712a951025 fuse: fix page stealing
It is possible to trigger a crash by splicing anon pipe bufs to the fuse
device.

The reason for this is that anon_pipe_buf_release() will reuse buf->page if
the refcount is 1, but that page might have already been stolen and its
flags modified (e.g. PG_lru added).

This happens in the unlikely case of fuse_dev_splice_write() getting around
to calling pipe_buf_release() after a page has been stolen, added to the
page cache and removed from the page cache.

Fix by calling pipe_buf_release() right after the page was inserted into
the page cache.  In this case the page has an elevated refcount so any
release function will know that the page isn't reusable.

Reported-by: Frank Dinoff <fdinoff@google.com>
Link: https://lore.kernel.org/r/CAAmZXrsGg2xsP1CK+cbuEMumtrqdvD-NKnWzhNcvn71RV3c1yw@mail.gmail.com/
Fixes: dd3bb14f44 ("fuse: support splice() writing to fuse device")
Cc: <stable@vger.kernel.org> # v2.6.35
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-02 11:10:37 +01:00
Miklos Szeredi
7c594bbd2d virtiofs: use strscpy for copying the queue name
Always null terminate fsvq->name.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: b43b7e81eb ("virtiofs: provide a helper function for virtqueue initialization")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-02 11:08:19 +01:00
Linus Torvalds
b6773cdb0e for-5.16/ki_complete-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8MOUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmeqEACrayLMDMdlb1FduTYw29QAL7XxS375r92T
 bwLippmKQIFNi8p5ScHraelV5ixgxse2j68MexlQHpl9aHIn/oL7qHACIMgDP05m
 KaSy8Hr2abqr+zz+rLMhkm21zAva6aWjQu7NoEjBE4dC5L4l9p885LaA+jmqQUno
 1wvpaEcype8cITJ+sSCb3kD6nZx7y1Lt5zEefUfk6ruMm9x9FwvU6uc4rIHi+Zve
 Hwo8yGbTvlU8rGSi9naC/U8pIZ4bqEuTAcV5VHNrWG+b4aA/aFPpSjpIiSBZSXo0
 HXa+jmcr6gkejfPeOZkBbRub6Fm9Wq2pDAZskPWFX6zyX0pIV05GjJ2J/ba8rovn
 QrcfxaBv8XitKgrjFZeR0ZBqD2iJjPA/Yq5/r1ZmZ0wSHI3W4UuTGhQYEPyDLceH
 ZWq/wcfVFek4kAoCxCqy9kWiOujY90WWKQW3yD7b8FPZ0d+/R1Mn+drlYaSKN1Pk
 /9/+z1DaLtBWbJ2G+BQ9oUkYmNSapAiYc2YXVss86hmhLX+prFtSj3zECZUvhyAz
 b42A2DVsjU+65yT2zdPBXlMrbI91qNnvIXcz5szNdTfHTn9FiLQb4BffMV0FHT3g
 vap8N3Rb8UkZ3v4NCVAtlfcGr0kvYHQH+Qgh6oAlXB4NQoKJCVadzpTFPMWjx788
 oHBUjA0UTQ==
 =4vl/
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block

Pull kiocb->ki_complete() cleanup from Jens Axboe:
 "This removes the res2 argument from kiocb->ki_complete().

  Only the USB gadget code used it, everybody else passes 0. The USB
  guys checked the user gadget code they could find, and everybody just
  uses res as expected for the async interface"

* tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block:
  fs: get rid of the res2 iocb->ki_complete argument
  usb: remove res2 argument from gadget code completions
2021-11-01 10:17:11 -07:00
Amir Goldstein
a390ccb316 fuse: add FOPEN_NOFLUSH
Add flag returned by FUSE_OPEN and FUSE_CREATE requests to avoid flushing
data cache on close.

Different filesystems implement ->flush() is different ways:
 - Most disk filesystems do not implement ->flush() at all
 - Some network filesystem (e.g. nfs) flush local write cache of
   FMODE_WRITE file and send a "flush" command to server
 - Some network filesystem (e.g. cifs) flush local write cache of
   FMODE_WRITE file without sending an additional command to server

FUSE flushes local write cache of ANY file, even non FMODE_WRITE
and sends a "flush" command to server (if server implements it).

The FUSE implementation of ->flush() seems over agressive and
arbitrary and does not make a lot of sense when writeback caching is
disabled.

Instead of deciding on another arbitrary implementation that makes
sense, leave the choice of per-file flush behavior in the hands of
the server.

Link: https://lore.kernel.org/linux-fsdevel/CAJfpegspE8e6aKd47uZtSYX8Y-1e1FWS0VL0DH2Skb9gQP5RJQ@mail.gmail.com/
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 10:20:31 +02:00
Miklos Szeredi
c6c745b810 fuse: only update necessary attributes
fuse_update_attributes() refreshes metadata for internal use.

Each use needs a particular set of attributes to be refreshed, but
currently that cannot be expressed and all but atime are refreshed.

Add a mask argument, which lets fuse_update_get_attr() to decide based on
the cache_mask and the inval_mask whether a GETATTR call is needed or not.

Reported-by: Yongji Xie <xieyongji@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:33 +02:00
Miklos Szeredi
ec85537519 fuse: take cache_mask into account in getattr
When deciding to send a GETATTR request take into account the cache mask
(which attributes are always valid).  The cache mask takes precedence over
the invalid mask.

This results in the GETATTR request not being sent unnecessarily.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:33 +02:00
Miklos Szeredi
4b52f059b5 fuse: add cache_mask
If writeback_cache is enabled, then the size, mtime and ctime attributes of
regular files are always valid in the kernel's cache.  They are retrieved
from userspace only when the inode is freshly looked up.

Add a more generic "cache_mask", that indicates which attributes are
currently valid in cache.

This patch doesn't change behavior.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:33 +02:00
Miklos Szeredi
04d82db0c5 fuse: move reverting attributes to fuse_change_attributes()
In case of writeback_cache fuse_fillattr() would revert the queried
attributes to the cached version.

Move this to fuse_change_attributes() in order to manage the writeback
logic in a central helper.  This will be necessary for patches that follow.

Only fuse_do_getattr() -> fuse_fillattr() uses the attributes after calling
fuse_change_attributes(), so this should not change behavior.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:32 +02:00
Miklos Szeredi
c15016b7ae fuse: simplify local variables holding writeback cache state
There are two instances of "bool is_wb = fc->writeback_cache" where the
actual use mostly involves checking "is_wb && S_ISREG(inode->i_mode)".

Clean up these cases by storing the second condition in the local variable.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:32 +02:00
Miklos Szeredi
20235b435a fuse: cleanup code conditional on fc->writeback_cache
It's safe to call file_update_time() if writeback cache is not enabled,
since S_NOCMTIME is set in this case.  This part is purely a cleanup.

__fuse_copy_file_range() also calls fuse_write_update_attr() only in the
writeback cache case.  This is inconsistent with other callers, where it's
called unconditionally.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:32 +02:00
Miklos Szeredi
484ce65715 fuse: fix attr version comparison in fuse_read_update_size()
A READ request returning a short count is taken as indication of EOF, and
the cached file size is modified accordingly.

Fix the attribute version checking to allow for changes to fc->attr_version
on other inodes.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:32 +02:00
Miklos Szeredi
d347739a0e fuse: always invalidate attributes after writes
Extend the fuse_write_update_attr() helper to invalidate cached attributes
after a write.

This has already been done in all cases except in fuse_notify_store(), so
this is mostly a cleanup.

fuse_direct_write_iter() calls fuse_direct_IO() which already calls
fuse_write_update_attr(), so don't repeat that again in the former.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:32 +02:00
Miklos Szeredi
27ae449ba2 fuse: rename fuse_write_update_size()
This function already updates the attr_version in fuse_inode, regardless of
whether the size was changed or not.

Rename the helper to fuse_write_update_attr() to reflect the more generic
nature.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:32 +02:00
Miklos Szeredi
8c56e03d2e fuse: don't bump attr_version in cached write
The attribute version in fuse_inode should be updated whenever the
attributes might have changed on the server.  In case of cached writes this
is not the case, so updating the attr_version is unnecessary and could
possibly affect performance.

Open code the remaining part of fuse_write_update_size().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:32 +02:00
Miklos Szeredi
fa5eee57e3 fuse: selective attribute invalidation
Only invalidate attributes that the operation might have changed.

Introduce two constants for common combinations of changed attributes:

  FUSE_STATX_MODIFY: file contents are modified but not size

  FUSE_STATX_MODSIZE: size and/or file contents modified

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:32 +02:00
Miklos Szeredi
97f044f690 fuse: don't increment nlink in link()
The fuse_iget() call in create_new_entry() already updated the inode with
all the new attributes and incremented the attribute version.

Incrementing the nlink will result in the wrong count.  This wasn't noticed
because the attributes were invalidated right after this.

Updating ctime is still needed for the writeback case when the ctime is not
refreshed.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28 09:45:26 +02:00
Jens Axboe
6b19b766e8 fs: get rid of the res2 iocb->ki_complete argument
The second argument was only used by the USB gadget code, yet everyone
pays the overhead of passing a zero to be passed into aio, where it
ends up being part of the aio res2 value.

Now that everybody is passing in zero, kill off the extra argument.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 10:36:24 -06:00
Miklos Szeredi
cefd1b8327 fuse: decrement nlink on overwriting rename
Rename didn't decrement/clear nlink on overwritten target inode.

Create a common helper fuse_entry_unlinked() that handles this for unlink,
rmdir and rename.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:02 +02:00
Miklos Szeredi
84840efc3c fuse: simplify __fuse_write_file_get()
Use list_first_entry_or_null() instead of list_empty() + list_entry().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:02 +02:00
Miklos Szeredi
371e8fd029 fuse: move fuse_invalidate_attr() into fuse_update_ctime()
Logically it belongs there since attributes are invalidated due to the
updated ctime.  This is a cleanup and should not change behavior.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Peng Hao
b5d9758297 fuse: delete redundant code
'ia->io=io' has been set in fuse_io_alloc.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Peng Hao
5fe0fc9f1d fuse: use kmap_local_page()
Due to the introduction of kmap_local_*, the storage of slots used for
short-term mapping has changed from per-CPU to per-thread.  kmap_atomic()
disable preemption, while kmap_local_*() only disable migration.

There is no need to disable preemption in several kamp_atomic places used
in fuse.

Link: https://lwn.net/Articles/836144/
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Miklos Szeredi
bda9a71980 fuse: annotate lock in fuse_reverse_inval_entry()
Add missing inode lock annotatation; found by syzbot.

Reported-and-tested-by: syzbot+9f747458f5990eaa8d43@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Miklos Szeredi
36ea23374d fuse: write inode in fuse_vma_close() instead of fuse_release()
Fuse ->release() is otherwise asynchronous for the reason that it can
happen in contexts unrelated to close/munmap.

Inode is already written back from fuse_flush().  Add it to
fuse_vma_close() as well to make sure inode dirtying from mmaps also get
written out before the file is released.

Also add error handling.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Miklos Szeredi
5c791fe1e2 fuse: make sure reclaim doesn't write the inode
In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.

Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped.  This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.

The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).

Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.

 - fallocate(2)/copy_file_range(2): these call file_update_time() or
   file_modified(), so flush the inode before returning from the call

 - unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
   flush the ctime directly from this helper

Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-22 17:03:01 +02:00
Miklos Szeredi
964d32e512 fuse: clean up error exits in fuse_fill_super()
Instead of "goto err", return error directly, since there's no error
cleanup to do now.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:39 +02:00
Miklos Szeredi
80019f1138 fuse: always initialize sb->s_fs_info
Syzkaller reports a null pointer dereference in fuse_test_super() that is
caused by sb->s_fs_info being NULL.

This is due to the fact that fuse_fill_super() is initializing s_fs_info,
which is too late, it's already on the fs_supers list.  The initialization
needs to be done in sget_fc() with the sb_lock held.

Move allocation of fuse_mount and fuse_conn from fuse_fill_super() into
fuse_get_tree().

After this ->kill_sb() will always be called with non-NULL ->s_fs_info,
hence fuse_mount_destroy() can drop the test for non-NULL "fm".

Reported-by: syzbot+74a15f02ccb51f398601@syzkaller.appspotmail.com
Fixes: 5d5b74aa9c ("fuse: allow sharing existing sb")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:39 +02:00
Miklos Szeredi
c191cd07ee fuse: clean up fuse_mount destruction
1. call fuse_mount_destroy() for open coded variants

2. before deactivate_locked_super() don't need fuse_mount destruction since
that will now be done (if ->s_fs_info is not cleared)

3. rearrange fuse_mount setup in fuse_get_tree_submount() so that the
regular pattern can be used

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:39 +02:00
Miklos Szeredi
a27c061a49 fuse: get rid of fuse_put_super()
The ->put_super callback is called from generic_shutdown_super() in case of
a fully initialized sb.  This is called from kill_***_super(), which is
called from ->kill_sb instances.

Fuse uses ->put_super to destroy the fs specific fuse_mount and drop the
reference to the fuse_conn, while it does the same on each error case
during sb setup.

This patch moves the destruction from fuse_put_super() to
fuse_mount_destroy(), called at the end of all ->kill_sb instances.  A
follup patch will clean up the error paths.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:38 +02:00
Miklos Szeredi
d534d31d6a fuse: check s_root when destroying sb
Checking "fm" works because currently sb->s_fs_info is cleared on error
paths; however, sb->s_root is what generic_shutdown_super() checks to
determine whether the sb was fully initialized or not.

This change will allow cleanup of sb setup error paths.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21 10:01:38 +02:00
Andreas Gruenbacher
a6294593e8 iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in, similar to copy_to_user, instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted in
as well as new users that are happy if any pages can be faulted in.

Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure this change doesn't silently break things.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-18 16:35:06 +02:00
Linus Torvalds
75b96f0ec5 fuse update for 5.15
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYTcNYgAKCRDh3BK/laaZ
 PCaHAQCTfaWYK+H6kNIffxnRG+yGsMAmoZLGbc5l9M5GeTEmtgEA0vkmWbDt4h/I
 PSA0Q2T0gzSsBdxkF4jQHqKO+HGMVg8=
 =Vh9A
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Allow mounting an active fuse device. Previously the fuse device
   would always be mounted during initialization, and sharing a fuse
   superblock was only possible through mount or namespace cloning

 - Fix data flushing in syncfs (virtiofs only)

 - Fix data flushing in copy_file_range()

 - Fix a possible deadlock in atomic O_TRUNC

 - Misc fixes and cleanups

* tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: remove unused arg in fuse_write_file_get()
  fuse: wait for writepages in syncfs
  fuse: flush extending writes
  fuse: truncate pagecache on atomic_o_trunc
  fuse: allow sharing existing sb
  fuse: move fget() to fuse_get_tree()
  fuse: move option checking into fuse_fill_super()
  fuse: name fs_context consistently
  fuse: fix use after free in fuse_read_interrupt()
2021-09-07 12:18:29 -07:00
Miklos Szeredi
a9667ac88e fuse: remove unused arg in fuse_write_file_get()
The struct fuse_conn argument is not used and can be removed.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-09-06 13:37:10 +02:00
Miklos Szeredi
660585b56e fuse: wait for writepages in syncfs
In case of fuse the MM subsystem doesn't guarantee that page writeback
completes by the time ->sync_fs() is called.  This is because fuse
completes page writeback immediately to prevent DoS of memory reclaim by
the userspace file server.

This means that fuse itself must ensure that writes are synced before
sending the SYNCFS request to the server.

Introduce sync buckets, that hold a counter for the number of outstanding
write requests.  On syncfs replace the current bucket with a new one and
wait until the old bucket's counter goes down to zero.

It is possible to have multiple syncfs calls in parallel, in which case
there could be more than one waited-on buckets.  Descendant buckets must
not complete until the parent completes.  Add a count to the child (new)
bucket until the (parent) old bucket completes.

Use RCU protection to dereference the current bucket and to wake up an
emptied bucket.  Use fc->lock to protect against parallel assignments to
the current bucket.

This leaves just the counter to be a possible scalability issue.  The
fc->num_waiting counter has a similar issue, so both should be addressed at
the same time.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 2d82ab251e ("virtiofs: propagate sync() to file server")
Cc: <stable@vger.kernel.org> # v5.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-09-06 13:37:02 +02:00
Linus Torvalds
815409a12c overlayfs update for 5.15
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYTDKKAAKCRDh3BK/laaZ
 PG9PAQCUF0fdBlCKudwSEt5PV5xemycL9OCAlYCd7d4XbBIe9wEA6sVJL9J+OwV2
 aF0NomiXtJccE+S9+byjVCyqSzQJGQQ=
 =6L2Y
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs update from Miklos Szeredi:

 - Copy up immutable/append/sync/noatime attributes (Amir Goldstein)

 - Improve performance by enabling RCU lookup.

 - Misc fixes and improvements

The reason this touches so many files is that the ->get_acl() method now
gets a "bool rcu" argument.  The ->get_acl() API was updated based on
comments from Al and Linus:

Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/

* tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: enable RCU'd ->get_acl()
  vfs: add rcu argument to ->get_acl() callback
  ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
  ovl: use kvalloc in xattr copy-up
  ovl: update ctime when changing fileattr
  ovl: skip checking lower file's i_writecount on truncate
  ovl: relax lookup error on mismatch origin ftype
  ovl: do not set overlay.opaque for new directories
  ovl: add ovl_allow_offline_changes() helper
  ovl: disable decoding null uuid with redirect_dir
  ovl: consistent behavior for immutable/append-only inodes
  ovl: copy up sync/noatime fileattr flags
  ovl: pass ovl_fs to ovl_check_setxattr()
  fs: add generic helper for filling statx attribute flags
2021-09-02 09:21:27 -07:00
Miklos Szeredi
59bda8ecee fuse: flush extending writes
Callers of fuse_writeback_range() assume that the file is ready for
modification by the server in the supplied byte range after the call
returns.

If there's a write that extends the file beyond the end of the supplied
range, then the file needs to be extended to at least the end of the range,
but currently that's not done.

There are at least two cases where this can cause problems:

 - copy_file_range() will return short count if the file is not extended
   up to end of the source range.

 - FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file,
   hence the region may not be fully allocated.

Fix by flushing writes from the start of the range up to the end of the
file.  This could be optimized if the writes are non-extending, etc, but
it's probably not worth the trouble.

Fixes: a2bc923629 ("fuse: fix copy_file_range() in the writeback case")
Fixes: 6b1bdb56b1 ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)")
Cc: <stable@vger.kernel.org>  # v5.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-31 14:18:08 +02:00
Linus Torvalds
aa99f3c2b9 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmTZcACgkQnJ2qBz9k
 QNkkmAgArW6XoF1CePds/ZaC9vfg/nk66/zVo0n+J8xXjMWAPxcKbWFfV0uWVixq
 yk4lcLV47a2Mu/B/1oLNd3vrSmhwU+srWqNwOFn1nv+lP/6wJqr8oztRHn/0L9Q3
 ZSRrukSejbQ6AvTL/WzTNnCjjCc2ne3Kyko6W41aU6uyJuzhSM32wbx7qlV6t54Z
 iint9OrB4gM0avLohNafTUq6I+tEGzBMNwpCG/tqCmkcvDcv3rTDVAnPSCTm0Tx2
 hdrYDcY/rLxo93pDBaW1rYA/fohR+mIVye6k2TjkPAL6T1x+rxeT5qnc+YijH5yF
 sFPDhlD+ZsfOLi8stWXLOJ+8+gLODg==
 =pDBR
 -----END PGP SIGNATURE-----

Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fs hole punching vs cache filling race fixes from Jan Kara:
 "Fix races leading to possible data corruption or stale data exposure
  in multiple filesystems when hole punching races with operations such
  as readahead.

  This is the series I was sending for the last merge window but with
  your objection fixed - now filemap_fault() has been modified to take
  invalidate_lock only when we need to create new page in the page cache
  and / or bring it uptodate"

* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  filesystems/locking: fix Malformed table warning
  cifs: Fix race between hole punch and page fault
  ceph: Fix race between hole punch and page fault
  fuse: Convert to using invalidate_lock
  f2fs: Convert to using invalidate_lock
  zonefs: Convert to using invalidate_lock
  xfs: Convert double locking of MMAPLOCK to use VFS helpers
  xfs: Convert to use invalidate_lock
  xfs: Refactor xfs_isilocked()
  ext2: Convert to using invalidate_lock
  ext4: Convert to use mapping->invalidate_lock
  mm: Add functions to lock invalidate_lock for two mappings
  mm: Protect operations adding pages to page cache with invalidate_lock
  documentation: Sync file_operations members with reality
  mm: Fix comments mentioning i_mutex
2021-08-30 10:24:50 -07:00
Miklos Szeredi
0cad624662 vfs: add rcu argument to ->get_acl() callback
Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-18 22:08:24 +02:00
Miklos Szeredi
76224355db fuse: truncate pagecache on atomic_o_trunc
fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic
O_TRUNC.  This can deadlock with fuse_wait_on_page_writeback() in
fuse_launder_page() triggered by invalidate_inode_pages2().

Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a
truncate_pagecache() call.  This makes sense regardless of FOPEN_KEEP_CACHE
or fc->writeback cache, so do it unconditionally.

Reported-by: Xie Yongji <xieyongji@bytedance.com>
Reported-and-tested-by: syzbot+bea44a5189836d956894@syzkaller.appspotmail.com
Fixes: e4648309b8 ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-17 21:05:16 +02:00
Dan Williams
96dcb97d0a Merge branch 'for-5.14/dax' into libnvdimm-fixes
Pick up some small dax cleanups that make some of Ira's follow on work
easier.
2021-08-11 12:04:43 -07:00
Miklos Szeredi
5d5b74aa9c fuse: allow sharing existing sb
Make it possible to create a new mount from a already working server.

Here's a detailed description of the problem from Jakob:

  "The background for this question is occasional problems we see with our
   fuse filesystem [1] and mount namespaces. On a usual client, we have
   system-wide, autofs managed mountpoints. When a new mount namespace is
   created (which can be done unprivileged in combination with user
   namespaces), it can happen that a mountpoint is used inside the new
   namespace but idle in the root mount namespace. So autofs unmounts the
   parent, system-wide mountpoint. But the fuse module stays active and
   still serves mountpoint in the child mount namespace. Because the fuse
   daemon also blocks other system wide resources corresponding to the
   mountpoint, this situation effectively prevents new mounts until the
   child mount namespaces closes.

   [1] https://github.com/cvmfs/cvmfs"

Reported-by: Jakob Blomer <jblomer@cern.ch>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-05 05:57:27 +02:00
Miklos Szeredi
62dd1fc8cc fuse: move fget() to fuse_get_tree()
Affected call chains:

fuse_get_tree
   -> get_tree_(bdev|nodev)
      -> fuse_fill_super

Needed for following patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-05 05:57:27 +02:00
Miklos Szeredi
badc741459 fuse: move option checking into fuse_fill_super()
Checking whether the "fd=", "rootmode=", "user_id=" and "group_id=" mount
options are present can be moved from fuse_get_tree() into
fuse_fill_super() where the value of the options are consumed.

This relaxes semantics of reusing a fuse blockdev mount using the device
name.  Before this patch presence of these options were enforced but values
ignored, after this patch these options are completely ignored in this
case.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-04 13:22:58 +02:00
Miklos Szeredi
84c215075b fuse: name fs_context consistently
Naming convention under fs/fuse/:

	struct fuse_conn *fc;
	struct fs_context *fsc;

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-04 13:22:58 +02:00
Miklos Szeredi
e1e71c1688 fuse: fix use after free in fuse_read_interrupt()
There is a potential race between fuse_read_interrupt() and
fuse_request_end().

TASK1
  in fuse_read_interrupt(): delete req->intr_entry (while holding
  fiq->lock)

TASK2
  in fuse_request_end(): req->intr_entry is empty -> skip fiq->lock
  wake up TASK3

TASK3
  request is freed

TASK1
  in fuse_read_interrupt(): dereference req->in.h.unique ***BAM***


Fix by always grabbing fiq->lock if the request was ever interrupted
(FR_INTERRUPTED set) thereby serializing with concurrent
fuse_read_interrupt() calls.

FR_INTERRUPTED is set before the request is queued on fiq->interrupts.
Dequeing the request is done with list_del_init() but FR_INTERRUPTED is not
cleared in this case.

Reported-by: lijiazi <lijiazi@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-04 13:22:58 +02:00
Jan Kara
8bcbbe9c7c fuse: Convert to using invalidate_lock
Use invalidate_lock instead of fuse's private i_mmap_sem. The intended
purpose is exactly the same. By this conversion we fix a long standing
race between hole punching and read(2) / readahead(2) paths that can
lead to stale page cache contents.

CC: Miklos Szeredi <miklos@szeredi.hu>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-07-13 14:29:01 +02:00
Ira Weiny
2e29be2e49 fs/fuse: Remove unneeded kaddr parameter
fuse_dax_mem_range_init() does not need the address or the pfn of the
memory requested in dax_direct_access().  It is only calling direct
access to get the number of pages.

Remove the unused variables and stop requesting the kaddr and pfn from
dax_direct_access().

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/20210525172428.3634316-2-ira.weiny@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-07-07 22:10:03 -07:00
Linus Torvalds
8e4f3e1517 fuse update for 5.14
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYORvYQAKCRDh3BK/laaZ
 PCfvAQCbU+PW2RbwlqjZMet6w9qorh29XYe786P5pNRVbMYCygD+N45l66Sbd/Rz
 7M7ioVDseyTW4dnLhb8SzSNB0zr6jQs=
 =MDvD
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Fixes for virtiofs submounts

 - Misc fixes and cleanups

* tag 'fuse-update-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  virtiofs: Fix spelling mistakes
  fuse: use DIV_ROUND_UP helper macro for calculations
  fuse: fix illegal access to inode with reused nodeid
  fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)
  fuse: Make fuse_fill_super_submount() static
  fuse: Switch to fc_mount() for submounts
  fuse: Call vfs_get_tree() for submounts
  fuse: add dedicated filesystem context ops for submounts
  virtiofs: propagate sync() to file server
  fuse: reject internal errno
  fuse: check connected before queueing on fpq->io
  fuse: ignore PG_workingset after stealing
  fuse: Fix infinite loop in sget_fc()
  fuse: Fix crash if superblock of submount gets killed early
  fuse: Fix crash in fuse_dentry_automount() error path
2021-07-06 11:17:41 -07:00
Linus Torvalds
d3acb15a3a Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "iov_iter cleanups and fixes.

  There are followups, but this is what had sat in -next this cycle. IMO
  the macro forest in there became much thinner and easier to follow..."

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller
  clean up copy_mc_pipe_to_iter()
  pipe_zero(): we don't need no stinkin' kmap_atomic()...
  iov_iter: clean csum_and_copy_...() primitives up a bit
  copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases
  copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases
  iterate_xarray(): only of the first iteration we might get offset != 0
  pull handling of ->iov_offset into iterate_{iovec,bvec,xarray}
  iov_iter: make iterator callbacks use base and len instead of iovec
  iov_iter: make the amount already copied available to iterator callbacks
  iov_iter: get rid of separate bvec and xarray callbacks
  iov_iter: teach iterate_{bvec,xarray}() about possible short copies
  iterate_bvec(): expand bvec.h macro forest, massage a bit
  iov_iter: unify iterate_iovec and iterate_kvec
  iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec
  iterate_and_advance(): get rid of magic in case when n is 0
  csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter()
  iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
  [xarray] iov_iter_npages(): just use DIV_ROUND_UP()
  iov_iter_npages(): don't bother with iterate_all_kinds()
  ...
2021-07-03 11:30:04 -07:00
Matthew Wilcox (Oracle)
3a6b216200 mm: move page dirtying prototypes from mm.h
These functions implement the address_space ->set_page_dirty operation and
should live in pagemap.h, not mm.h so that the rest of the kernel doesn't
get funny ideas about calling them directly.

Link: https://lkml.kernel.org/r/20210615162342.1669332-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Matthew Wilcox (Oracle)
b82a96c925 fs: remove noop_set_page_dirty()
Use __set_page_dirty_no_writeback() instead.  This will set the dirty bit
on the page, which will be used to avoid calling set_page_dirty() in the
future.  It will have no effect on actually writing the page back, as the
pages are not on any LRU lists.

[akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules]

Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Zheng Yongjun
c4e0cd4e0c virtiofs: Fix spelling mistakes
Fix some spelling mistakes in comments:
refernce  ==> reference
happnes  ==> happens
threhold  ==> threshold
splitted  ==> split
mached  ==> matched

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:36 +02:00
Wu Bo
6c88632be3 fuse: use DIV_ROUND_UP helper macro for calculations
Replace open coded divisor calculations with the DIV_ROUND_UP kernel macro
for better readability.

Signed-off-by: Wu Bo <wubo40@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:36 +02:00
Amir Goldstein
15db16837a fuse: fix illegal access to inode with reused nodeid
Server responds to LOOKUP and other ops (READDIRPLUS/CREATE/MKNOD/...)
with ourarg containing nodeid and generation.

If a fuse inode is found in inode cache with the same nodeid but different
generation, the existing fuse inode should be unhashed and marked "bad" and
a new inode with the new generation should be hashed instead.

This can happen, for example, with passhrough fuse filesystem that returns
the real filesystem ino/generation on lookup and where real inode numbers
can get recycled due to real files being unlinked not via the fuse
passthrough filesystem.

With current code, this situation will not be detected and an old fuse
dentry that used to point to an older generation real inode, can be used to
access a completely new inode, which should be accessed only via the new
dentry.

Note that because the FORGET message carries the nodeid w/o generation, the
server should wait to get FORGET counts for the nlookup counts of the old
and reused inodes combined, before it can free the resources associated to
that nodeid.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:36 +02:00
Richard W.M. Jones
6b1bdb56b1 fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)
The current fuse module filters out fallocate(FALLOC_FL_ZERO_RANGE)
returning -EOPNOTSUPP.  libnbd's nbdfuse would like to translate
FALLOC_FL_ZERO_RANGE requests into the NBD command
NBD_CMD_WRITE_ZEROES which allows NBD servers that support it to do
zeroing efficiently.

This commit treats this flag exactly like FALLOC_FL_PUNCH_HOLE.

A way to test this, requiring fuse >= 3, nbdkit >= 1.8 and the latest
nbdfuse from https://gitlab.com/nbdkit/libnbd/-/tree/master/fuse is to
create a file containing some data and "mirror" it to a fuse file:

  $ dd if=/dev/urandom of=disk.img bs=1M count=1
  $ nbdkit file disk.img
  $ touch mirror.img
  $ nbdfuse mirror.img nbd://localhost &

(mirror.img -> nbdfuse -> NBD over loopback -> nbdkit -> disk.img)

You can then run commands such as:

  $ fallocate -z -o 1024 -l 1024 mirror.img

and check that the content of the original file ("disk.img") stays
synchronized.  To show NBD commands, export LIBNBD_DEBUG=1 before
running nbdfuse.  To clean up:

  $ fusermount3 -u mirror.img
  $ killall nbdkit

Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:36 +02:00
Greg Kurz
1b53991737 fuse: Make fuse_fill_super_submount() static
This function used to be called from fuse_dentry_automount(). This code
was moved to fuse_get_tree_submount() in the same file since then. It
is unlikely there will ever be another user. No need to be extern in
this case.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:35 +02:00
Greg Kurz
29e0e4df9d fuse: Switch to fc_mount() for submounts
fc_mount() already handles the vfs_get_tree(), sb->s_umount
unlocking and vfs_create_mount() sequence. Using it greatly
simplifies fuse_dentry_automount().

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:35 +02:00
Greg Kurz
266eb3f2fa fuse: Call vfs_get_tree() for submounts
We recently fixed an infinite loop by setting the SB_BORN flag on
submounts along with the write barrier needed by super_cache_count().
This is the job of vfs_get_tree() and FUSE shouldn't have to care
about the barrier at all.

Split out some code from fuse_dentry_automount() to the dedicated
fuse_get_tree_submount() handler for submounts and call vfs_get_tree().

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:35 +02:00
Greg Kurz
fe0a7bd81b fuse: add dedicated filesystem context ops for submounts
The creation of a submount is open-coded in fuse_dentry_automount().
This brings a lot of complexity and we recently had to fix bugs
because we weren't setting SB_BORN or because we were unlocking
sb->s_umount before sb was fully configured. Most of these could
have been avoided by using the mount API instead of open-coding.

Basically, this means coming up with a proper ->get_tree()
implementation for submounts and call vfs_get_tree(), or better
fc_mount().

The creation of the superblock for submounts is quite different from
the root mount. Especially, it doesn't require to allocate a FUSE
filesystem context, nor to parse parameters.

Introduce a dedicated context ops for submounts to make this clear.
This is just a placeholder for now, fuse_get_tree_submount() will
be populated in a subsequent patch.

Only visible change is that we stop allocating/freeing a useless FUSE
filesystem context with submounts.

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:35 +02:00
Greg Kurz
2d82ab251e virtiofs: propagate sync() to file server
Even if POSIX doesn't mandate it, linux users legitimately expect sync() to
flush all data and metadata to physical storage when it is located on the
same system.  This isn't happening with virtiofs though: sync() inside the
guest returns right away even though data still needs to be flushed from
the host page cache.

This is easily demonstrated by doing the following in the guest:

$ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync
5120+0 records in
5120+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s
sync()                                  = 0 <0.024068>

and start the following in the host when the 'dd' command completes
in the guest:

$ strace -T -e fsync /usr/bin/sync virtiofs/foo
fsync(3)                                = 0 <10.371640>

There are no good reasons not to honor the expected behavior of sync()
actually: it gives an unrealistic impression that virtiofs is super fast
and that data has safely landed on HW, which isn't the case obviously.

Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS
request type for this purpose.  Provision a 64-bit placeholder for possible
future extensions.  Since the file server cannot handle the wait == 0 case,
we skip it to avoid a gratuitous roundtrip.  Note that this is
per-superblock: a FUSE_SYNCFS is send for the root mount and for each
submount.

Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in
the file server is treated as permanent success.  This ensures
compatibility with older file servers: the client will get the current
behavior of sync() not being propagated to the file server.

Note that such an operation allows the file server to DoS sync().  Since a
typical FUSE file server is an untrusted piece of software running in
userspace, this is disabled by default.  Only enable it with virtiofs for
now since virtiofsd is supposedly trusted by the guest kernel.

Reported-by: Robert Krawitz <rlk@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:35 +02:00
Miklos Szeredi
49221cf86d fuse: reject internal errno
Don't allow userspace to report errors that could be kernel-internal.

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Fixes: 334f485df8 ("[PATCH] FUSE - device functions")
Cc: <stable@vger.kernel.org> # v2.6.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:35 +02:00
Miklos Szeredi
80ef08670d fuse: check connected before queueing on fpq->io
A request could end up on the fpq->io list after fuse_abort_conn() has
reset fpq->connected and aborted requests on that list:

Thread-1			  Thread-2
========			  ========
->fuse_simple_request()           ->shutdown
  ->__fuse_request_send()
    ->queue_request()		->fuse_abort_conn()
->fuse_dev_do_read()                ->acquire(fpq->lock)
  ->wait_for(fpq->lock) 	  ->set err to all req's in fpq->io
				  ->release(fpq->lock)
  ->acquire(fpq->lock)
  ->add req to fpq->io

After the userspace copy is done the request will be ended, but
req->out.h.error will remain uninitialized.  Also the copy might block
despite being already aborted.

Fix both issues by not allowing the request to be queued on the fpq->io
list after fuse_abort_conn() has processed this list.

Reported-by: Pradeep P V K <pragalla@codeaurora.org>
Fixes: fd22d62ed0 ("fuse: no fc->lock for iqueue parts")
Cc: <stable@vger.kernel.org> # v4.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22 09:15:35 +02:00
Miklos Szeredi
b89ecd60d3 fuse: ignore PG_workingset after stealing
Fix the "fuse: trying to steal weird page" warning.

Description from Johannes Weiner:

  "Think of it as similar to PG_active. It's just another usage/heat
   indicator of file and anon pages on the reclaim LRU that, unlike
   PG_active, persists across deactivation and even reclaim (we store it in
   the page cache / swapper cache tree until the page refaults).

   So if fuse accepts pages that can legally have PG_active set,
   PG_workingset is fine too."

Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Fixes: 1899ad18c6 ("mm: workingset: tell cache transitions from workingset thrashing")
Cc: <stable@vger.kernel.org> # v4.20
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-18 21:16:42 +02:00
Al Viro
f0b65f39ac iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
Replacement is called copy_page_from_iter_atomic(); unlike the old primitive the
callers do *not* need to do iov_iter_advance() after it.  In case when they end
up consuming less than they'd been given they need to do iov_iter_revert() on
everything they had not consumed.  That, however, needs to be done only on slow
paths.

All in-tree callers converted.  And that kills the last user of iterate_all_kinds()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-06-10 11:45:14 -04:00
Greg Kurz
e4a9ccdd1c fuse: Fix infinite loop in sget_fc()
We don't set the SB_BORN flag on submounts. This is wrong as these
superblocks are then considered as partially constructed or dying
in the rest of the code and can break some assumptions.

One such case is when you have a virtiofs filesystem with submounts
and you try to mount it again : virtio_fs_get_tree() tries to obtain
a superblock with sget_fc(). The logic in sget_fc() is to loop until
it has either found an existing matching superblock with SB_BORN set
or to create a brand new one. It is assumed that a superblock without
SB_BORN is transient and the loop is restarted. Forgetting to set
SB_BORN on submounts hence causes sget_fc() to retry forever.

Setting SB_BORN requires special care, i.e. a write barrier for
super_cache_count() which can check SB_BORN without taking any lock.
We should call vfs_get_tree() to deal with that but this requires
to have a proper ->get_tree() implementation for submounts, which
is a bigger piece of work. Go for a simple bug fix in the meatime.

Fixes: bf109c6404 ("fuse: implement crossmounts")
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-09 15:33:40 +02:00
Greg Kurz
e3a43f2a95 fuse: Fix crash if superblock of submount gets killed early
As soon as fuse_dentry_automount() does up_write(&sb->s_umount), the
superblock can theoretically be killed. If this happens before the
submount was added to the &fc->mounts list, fuse_mount_remove() later
crashes in list_del_init() because it assumes the submount to be
already there.

Add the submount before dropping sb->s_umount to fix the inconsistency.
It is okay to nest fc->killsb under sb->s_umount, we already do this
on the ->kill_sb() path.

Signed-off-by: Greg Kurz <groug@kaod.org>
Fixes: bf109c6404 ("fuse: implement crossmounts")
Cc: stable@vger.kernel.org # v5.10+
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-09 15:33:40 +02:00
Greg Kurz
d92d88f056 fuse: Fix crash in fuse_dentry_automount() error path
If fuse_fill_super_submount() returns an error, the error path
triggers a crash:

[   26.206673] BUG: kernel NULL pointer dereference, address: 0000000000000000
[...]
[   26.226362] RIP: 0010:__list_del_entry_valid+0x25/0x90
[...]
[   26.247938] Call Trace:
[   26.248300]  fuse_mount_remove+0x2c/0x70 [fuse]
[   26.248892]  virtio_kill_sb+0x22/0x160 [virtiofs]
[   26.249487]  deactivate_locked_super+0x36/0xa0
[   26.250077]  fuse_dentry_automount+0x178/0x1a0 [fuse]

The crash happens because fuse_mount_remove() assumes that the FUSE
mount was already added to list under the FUSE connection, but this
only done after fuse_fill_super_submount() has returned success.

This means that until fuse_fill_super_submount() has returned success,
the FUSE mount isn't actually owned by the superblock. We should thus
reclaim ownership by clearing sb->s_fs_info, which will skip the call
to fuse_mount_remove(), and perform rollback, like virtio_fs_get_tree()
already does for the root sb.

Fixes: bf109c6404 ("fuse: implement crossmounts")
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-09 15:33:40 +02:00
Al Viro
8959a23924 fuse_fill_write_pages(): don't bother with iov_iter_single_seg_count()
another rudiment of fault-in originally having been limited to the
first segment, same as in generic_perform_write() and friends.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-06-03 10:34:55 -04:00
Linus Torvalds
27787ba3fa Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff all over the place"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  useful constants: struct qstr for ".."
  hostfs_open(): don't open-code file_dentry()
  whack-a-mole: kill strlen_user() (again)
  autofs: should_expire() argument is guaranteed to be positive
  apparmor:match_mn() - constify devpath argument
  buffer: a small optimization in grow_buffers
  get rid of autofs_getpath()
  constify dentry argument of dentry_path()/dentry_path_raw()
2021-05-02 09:14:01 -07:00
Linus Torvalds
9ec1efbf9d fuse update for 5.13
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYIwY/wAKCRDh3BK/laaZ
 PNSmAPwLFCBGegvwxUSguiPmIXpDrrlG+USwTzGlxhVOg2ETGgEA6D+Lsz2uCBI3
 xLkPAXD6uTbWLp13YtUSMXK+LR8V5wc=
 =Fl+Q
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Fix a page locking bug in write (introduced in 2.6.26)

 - Allow sgid bit to be killed in setacl()

 - Miscellaneous fixes and cleanups

* tag 'fuse-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  cuse: simplify refcount
  cuse: prevent clone
  virtiofs: fix userns
  virtiofs: remove useless function
  virtiofs: split requests that exceed virtqueue size
  virtiofs: fix memory leak in virtio_fs_probe()
  fuse: invalidate attrs when page writeback completes
  fuse: add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID
  fuse: extend FUSE_SETXATTR request
  fuse: fix matching of FUSE_DEV_IOC_CLONE command
  fuse: fix a typo
  fuse: don't zero pages twice
  fuse: fix typo for fuse_conn.max_pages comment
  fuse: fix write deadlock
2021-04-30 15:23:16 -07:00
Linus Torvalds
a4f7fae101 Merge branch 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fileattr conversion updates from Miklos Szeredi via Al Viro:
 "This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a
  separate method.

  The interface is reasonably uniform across the filesystems that
  support it and gives nice boilerplate removal"

* 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits)
  ovl: remove unneeded ioctls
  fuse: convert to fileattr
  fuse: add internal open/release helpers
  fuse: unsigned open flags
  fuse: move ioctl to separate source file
  vfs: remove unused ioctl helpers
  ubifs: convert to fileattr
  reiserfs: convert to fileattr
  ocfs2: convert to fileattr
  nilfs2: convert to fileattr
  jfs: convert to fileattr
  hfsplus: convert to fileattr
  efivars: convert to fileattr
  xfs: convert to fileattr
  orangefs: convert to fileattr
  gfs2: convert to fileattr
  f2fs: convert to fileattr
  ext4: convert to fileattr
  ext2: convert to fileattr
  btrfs: convert to fileattr
  ...
2021-04-27 11:18:24 -07:00
Linus Torvalds
d1466bc583 Merge branch 'work.inode-type-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs inode type handling updates from Al Viro:
 "We should never change the type bits of ->i_mode or the method tables
  (->i_op and ->i_fop) of a live inode.

  Unfortunately, not all filesystems took care to prevent that"

* 'work.inode-type-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  spufs: fix bogosity in S_ISGID handling
  9p: missing chunk of "fs/9p: Don't update file type when updating file attributes"
  openpromfs: don't do unlock_new_inode() until the new inode is set up
  hostfs_mknod(): don't bother with init_special_inode()
  cifs: have cifs_fattr_to_inode() refuse to change type on live inode
  cifs: have ->mkdir() handle race with another client sanely
  do_cifs_create(): don't set ->i_mode of something we had not created
  gfs2: be careful with inode refresh
  ocfs2_inode_lock_update(): make sure we don't change the type bits of i_mode
  orangefs_inode_is_stale(): i_mode type bits do *not* form a bitmap...
  vboxsf: don't allow to change the inode type
  afs: Fix updating of i_mode due to 3rd party change
  ceph: don't allow type or device number to change on non-I_NEW inodes
  ceph: fix up error handling with snapdirs
  new helper: inode_wrong_type()
2021-04-27 10:57:42 -07:00
Al Viro
80e5d1ff5d useful constants: struct qstr for ".."
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-15 22:36:45 -04:00
Miklos Szeredi
3c9c14338c cuse: simplify refcount
Put extra reference early in cuse_channel_open().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-14 10:40:58 +02:00
Miklos Szeredi
8217673d07 cuse: prevent clone
For cloned connections cuse_channel_release() will be called more than
once, resulting in use after free.

Prevent device cloning for CUSE, which does not make sense at this point,
and highly unlikely to be used in real life.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-14 10:40:58 +02:00
Miklos Szeredi
0a7419c68a virtiofs: fix userns
get_user_ns() is done twice (once in virtio_fs_get_tree() and once in
fuse_conn_init()), resulting in a reference leak.

Also looks better to use fsc->user_ns (which *should* be the
current_user_ns() at this point).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-14 10:40:58 +02:00
Jiapeng Chong
07595bfa24 virtiofs: remove useless function
Fix the following clang warning:

fs/fuse/virtio_fs.c:130:35: warning: unused function 'vq_to_fpq'
[-Wunused-function].

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-14 10:40:57 +02:00
Connor Kuehl
a7f0d7aab0 virtiofs: split requests that exceed virtqueue size
If an incoming FUSE request can't fit on the virtqueue, the request is
placed onto a workqueue so a worker can try to resubmit it later where
there will (hopefully) be space for it next time.

This is fine for requests that aren't larger than a virtqueue's maximum
capacity.  However, if a request's size exceeds the maximum capacity of the
virtqueue (even if the virtqueue is empty), it will be doomed to a life of
being placed on the workqueue, removed, discovered it won't fit, and placed
on the workqueue yet again.

Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect
Descriptors) of the virtio spec:

  "A driver MUST NOT create a descriptor chain longer than the Queue
  Size of the device."

To fix this, limit the number of pages FUSE will use for an overall
request.  This way, each request can realistically fit on the virtqueue
when it is decomposed into a scattergather list and avoid violating section
2.6.5.3.1 of the virtio spec.

Signed-off-by: Connor Kuehl <ckuehl@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-14 10:40:57 +02:00