License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 07:07:57 -07:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2009-12-17 19:24:29 -07:00
|
|
|
#include <linux/fanotify.h>
|
2009-12-17 19:24:25 -07:00
|
|
|
#include <linux/fcntl.h>
|
2021-08-07 22:26:25 -07:00
|
|
|
#include <linux/fdtable.h>
|
2009-12-17 19:24:26 -07:00
|
|
|
#include <linux/file.h>
|
2009-12-17 19:24:25 -07:00
|
|
|
#include <linux/fs.h>
|
2009-12-17 19:24:26 -07:00
|
|
|
#include <linux/anon_inodes.h>
|
2009-12-17 19:24:25 -07:00
|
|
|
#include <linux/fsnotify_backend.h>
|
2009-12-17 19:24:26 -07:00
|
|
|
#include <linux/init.h>
|
2009-12-17 19:24:26 -07:00
|
|
|
#include <linux/mount.h>
|
2009-12-17 19:24:26 -07:00
|
|
|
#include <linux/namei.h>
|
2009-12-17 19:24:26 -07:00
|
|
|
#include <linux/poll.h>
|
2009-12-17 19:24:25 -07:00
|
|
|
#include <linux/security.h>
|
|
|
|
#include <linux/syscalls.h>
|
2010-05-19 08:36:28 -07:00
|
|
|
#include <linux/slab.h>
|
2009-12-17 19:24:26 -07:00
|
|
|
#include <linux/types.h>
|
2009-12-17 19:24:26 -07:00
|
|
|
#include <linux/uaccess.h>
|
2013-03-05 18:10:59 -07:00
|
|
|
#include <linux/compat.h>
|
2017-02-02 11:15:33 -07:00
|
|
|
#include <linux/sched/signal.h>
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 15:46:39 -07:00
|
|
|
#include <linux/memcontrol.h>
|
2019-01-10 10:04:36 -07:00
|
|
|
#include <linux/statfs.h>
|
|
|
|
#include <linux/exportfs.h>
|
2009-12-17 19:24:26 -07:00
|
|
|
|
|
|
|
#include <asm/ioctls.h>
|
2009-12-17 19:24:25 -07:00
|
|
|
|
2023-11-30 09:56:19 -07:00
|
|
|
#include "../fsnotify.h"
|
2012-12-17 17:05:12 -07:00
|
|
|
#include "../fdinfo.h"
|
2014-01-21 16:48:14 -07:00
|
|
|
#include "fanotify.h"
|
2011-11-25 00:35:16 -07:00
|
|
|
|
2010-10-28 14:21:57 -07:00
|
|
|
#define FANOTIFY_DEFAULT_MAX_EVENTS 16384
|
2021-03-04 04:29:20 -07:00
|
|
|
#define FANOTIFY_OLD_DEFAULT_MAX_MARKS 8192
|
|
|
|
#define FANOTIFY_DEFAULT_MAX_GROUPS 128
|
2021-10-25 12:27:34 -07:00
|
|
|
#define FANOTIFY_DEFAULT_FEE_POOL_SIZE 32
|
2021-03-04 04:29:20 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Legacy fanotify marks limits (8192) is per group and we introduced a tunable
|
|
|
|
* limit of marks per user, similar to inotify. Effectively, the legacy limit
|
|
|
|
* of fanotify marks per user is <max marks per group> * <max groups per user>.
|
|
|
|
* This default limit (1M) also happens to match the increased limit of inotify
|
|
|
|
* max_user_watches since v5.10.
|
|
|
|
*/
|
|
|
|
#define FANOTIFY_DEFAULT_MAX_USER_MARKS \
|
|
|
|
(FANOTIFY_OLD_DEFAULT_MAX_MARKS * FANOTIFY_DEFAULT_MAX_GROUPS)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Most of the memory cost of adding an inode mark is pinning the marked inode.
|
|
|
|
* The size of the filesystem inode struct is not uniform across filesystems,
|
|
|
|
* so double the size of a VFS inode is used as a conservative approximation.
|
|
|
|
*/
|
|
|
|
#define INODE_MARK_COST (2 * sizeof(struct inode))
|
|
|
|
|
|
|
|
/* configurable via /proc/sys/fs/fanotify/ */
|
|
|
|
static int fanotify_max_queued_events __read_mostly;
|
|
|
|
|
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
|
|
|
|
#include <linux/sysctl.h>
|
|
|
|
|
2021-07-29 23:28:54 -07:00
|
|
|
static long ft_zero = 0;
|
|
|
|
static long ft_int_max = INT_MAX;
|
|
|
|
|
2022-01-21 23:11:59 -07:00
|
|
|
static struct ctl_table fanotify_table[] = {
|
2021-03-04 04:29:20 -07:00
|
|
|
{
|
|
|
|
.procname = "max_user_groups",
|
|
|
|
.data = &init_user_ns.ucount_max[UCOUNT_FANOTIFY_GROUPS],
|
2021-07-29 23:28:54 -07:00
|
|
|
.maxlen = sizeof(long),
|
2021-03-04 04:29:20 -07:00
|
|
|
.mode = 0644,
|
2021-07-29 23:28:54 -07:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
.extra1 = &ft_zero,
|
|
|
|
.extra2 = &ft_int_max,
|
2021-03-04 04:29:20 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "max_user_marks",
|
|
|
|
.data = &init_user_ns.ucount_max[UCOUNT_FANOTIFY_MARKS],
|
2021-07-29 23:28:54 -07:00
|
|
|
.maxlen = sizeof(long),
|
2021-03-04 04:29:20 -07:00
|
|
|
.mode = 0644,
|
2021-07-29 23:28:54 -07:00
|
|
|
.proc_handler = proc_doulongvec_minmax,
|
|
|
|
.extra1 = &ft_zero,
|
|
|
|
.extra2 = &ft_int_max,
|
2021-03-04 04:29:20 -07:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "max_queued_events",
|
|
|
|
.data = &fanotify_max_queued_events,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
|
|
|
.extra1 = SYSCTL_ZERO
|
|
|
|
},
|
|
|
|
};
|
2022-01-21 23:11:59 -07:00
|
|
|
|
|
|
|
static void __init fanotify_sysctls_init(void)
|
|
|
|
{
|
|
|
|
register_sysctl("fs/fanotify", fanotify_table);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
#define fanotify_sysctls_init() do { } while (0)
|
2021-03-04 04:29:20 -07:00
|
|
|
#endif /* CONFIG_SYSCTL */
|
2010-10-28 14:21:57 -07:00
|
|
|
|
fanotify: check file flags passed in fanotify_init
Without this patch fanotify_init does not validate the value passed in
event_f_flags.
When a fanotify event is read from the fanotify file descriptor a new
file descriptor is created where file.f_flags = event_f_flags.
Internal and external open flags are stored together in field f_flags of
struct file. Hence, an application might create file descriptors with
internal flags like FMODE_EXEC, FMODE_NOCMTIME set.
Jan Kara and Eric Paris both aggreed that this is a bug and the value of
event_f_flags should be checked:
https://lkml.org/lkml/2014/4/29/522
https://lkml.org/lkml/2014/4/29/539
This updated patch version considers the comments by Michael Kerrisk in
https://lkml.org/lkml/2014/5/4/10
With the patch the value of event_f_flags is checked.
When specifying an invalid value error EINVAL is returned.
Internal flags are disallowed.
File creation flags are disallowed:
O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TRUNC, and O_TTY_INIT.
Flags which do not make sense with fanotify are disallowed:
__O_TMPFILE, O_PATH, FASYNC, and O_DIRECT.
This leaves us with the following allowed values:
O_RDONLY, O_WRONLY, O_RDWR are basic functionality. The are stored in the
bits given by O_ACCMODE.
O_APPEND is working as expected. The value might be useful in a logging
application which appends the current status each time the log is opened.
O_LARGEFILE is needed for files exceeding 4GB on 32bit systems.
O_NONBLOCK may be useful when monitoring slow devices like tapes.
O_NDELAY is equal to O_NONBLOCK except for platform parisc.
To avoid code breaking on parisc either both flags should be
allowed or none. The patch allows both.
__O_SYNC and O_DSYNC may be used to avoid data loss on power disruption.
O_NOATIME may be useful to reduce disk activity.
O_CLOEXEC may be useful, if separate processes shall be used to scan files.
Once this patch is accepted, the fanotify_init.2 manpage has to be updated.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:05:44 -07:00
|
|
|
/*
|
|
|
|
* All flags that may be specified in parameter event_f_flags of fanotify_init.
|
|
|
|
*
|
|
|
|
* Internal and external open flags are stored together in field f_flags of
|
|
|
|
* struct file. Only external open flags shall be allowed in event_f_flags.
|
|
|
|
* Internal flags like FMODE_NONOTIFY, FMODE_EXEC, FMODE_NOCMTIME shall be
|
|
|
|
* excluded.
|
|
|
|
*/
|
|
|
|
#define FANOTIFY_INIT_ALL_EVENT_F_BITS ( \
|
|
|
|
O_ACCMODE | O_APPEND | O_NONBLOCK | \
|
|
|
|
__O_SYNC | O_DSYNC | O_CLOEXEC | \
|
|
|
|
O_LARGEFILE | O_NOATIME )
|
|
|
|
|
2009-12-17 19:24:29 -07:00
|
|
|
extern const struct fsnotify_ops fanotify_fsnotify_ops;
|
2009-12-17 19:24:25 -07:00
|
|
|
|
2023-10-11 09:55:00 -07:00
|
|
|
struct kmem_cache *fanotify_mark_cache __ro_after_init;
|
|
|
|
struct kmem_cache *fanotify_fid_event_cachep __ro_after_init;
|
|
|
|
struct kmem_cache *fanotify_path_event_cachep __ro_after_init;
|
|
|
|
struct kmem_cache *fanotify_perm_event_cachep __ro_after_init;
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2019-01-10 10:04:35 -07:00
|
|
|
#define FANOTIFY_EVENT_ALIGN 4
|
2021-08-07 22:25:32 -07:00
|
|
|
#define FANOTIFY_FID_INFO_HDR_LEN \
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
|
2021-08-07 22:26:25 -07:00
|
|
|
#define FANOTIFY_PIDFD_INFO_HDR_LEN \
|
|
|
|
sizeof(struct fanotify_event_info_pidfd)
|
2021-10-25 12:27:42 -07:00
|
|
|
#define FANOTIFY_ERROR_INFO_LEN \
|
|
|
|
(sizeof(struct fanotify_event_info_error))
|
2019-01-10 10:04:35 -07:00
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
static int fanotify_fid_info_len(int fh_len, int name_len)
|
2020-03-19 08:10:20 -07:00
|
|
|
{
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
int info_len = fh_len;
|
|
|
|
|
|
|
|
if (name_len)
|
|
|
|
info_len += name_len + 1;
|
|
|
|
|
2021-08-07 22:25:32 -07:00
|
|
|
return roundup(FANOTIFY_FID_INFO_HDR_LEN + info_len,
|
|
|
|
FANOTIFY_EVENT_ALIGN);
|
2020-03-19 08:10:20 -07:00
|
|
|
}
|
|
|
|
|
2021-11-29 13:15:36 -07:00
|
|
|
/* FAN_RENAME may have one or two dir+name info records */
|
|
|
|
static int fanotify_dir_name_info_len(struct fanotify_event *event)
|
|
|
|
{
|
|
|
|
struct fanotify_info *info = fanotify_event_info(event);
|
|
|
|
int dir_fh_len = fanotify_event_dir_fh_len(event);
|
|
|
|
int dir2_fh_len = fanotify_event_dir2_fh_len(event);
|
|
|
|
int info_len = 0;
|
|
|
|
|
|
|
|
if (dir_fh_len)
|
|
|
|
info_len += fanotify_fid_info_len(dir_fh_len,
|
|
|
|
info->name_len);
|
|
|
|
if (dir2_fh_len)
|
|
|
|
info_len += fanotify_fid_info_len(dir2_fh_len,
|
|
|
|
info->name2_len);
|
|
|
|
|
|
|
|
return info_len;
|
|
|
|
}
|
|
|
|
|
2021-10-25 12:27:20 -07:00
|
|
|
static size_t fanotify_event_len(unsigned int info_mode,
|
|
|
|
struct fanotify_event *event)
|
2019-01-10 10:04:35 -07:00
|
|
|
{
|
2021-10-25 12:27:20 -07:00
|
|
|
size_t event_len = FAN_EVENT_METADATA_LEN;
|
|
|
|
int fh_len;
|
2020-07-16 01:42:28 -07:00
|
|
|
int dot_len = 0;
|
2020-07-16 01:42:17 -07:00
|
|
|
|
2021-10-25 12:27:20 -07:00
|
|
|
if (!info_mode)
|
|
|
|
return event_len;
|
|
|
|
|
2021-10-25 12:27:42 -07:00
|
|
|
if (fanotify_is_error_event(event->mask))
|
|
|
|
event_len += FANOTIFY_ERROR_INFO_LEN;
|
|
|
|
|
2021-11-29 13:15:36 -07:00
|
|
|
if (fanotify_event_has_any_dir_fh(event)) {
|
|
|
|
event_len += fanotify_dir_name_info_len(event);
|
2021-08-07 22:25:32 -07:00
|
|
|
} else if ((info_mode & FAN_REPORT_NAME) &&
|
|
|
|
(event->mask & FAN_ONDIR)) {
|
2020-07-16 01:42:28 -07:00
|
|
|
/*
|
|
|
|
* With group flag FAN_REPORT_NAME, if name was not recorded in
|
|
|
|
* event on a directory, we will report the name ".".
|
|
|
|
*/
|
|
|
|
dot_len = 1;
|
|
|
|
}
|
2020-03-24 08:55:37 -07:00
|
|
|
|
2021-08-07 22:26:25 -07:00
|
|
|
if (info_mode & FAN_REPORT_PIDFD)
|
2021-10-25 12:27:20 -07:00
|
|
|
event_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
|
2021-08-07 22:26:25 -07:00
|
|
|
|
2021-10-25 12:27:38 -07:00
|
|
|
if (fanotify_event_has_object_fh(event)) {
|
|
|
|
fh_len = fanotify_event_object_fh_len(event);
|
2021-10-25 12:27:20 -07:00
|
|
|
event_len += fanotify_fid_info_len(fh_len, dot_len);
|
2021-10-25 12:27:38 -07:00
|
|
|
}
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
|
2021-10-25 12:27:20 -07:00
|
|
|
return event_len;
|
2019-01-10 10:04:35 -07:00
|
|
|
}
|
|
|
|
|
2021-03-04 03:48:25 -07:00
|
|
|
/*
|
|
|
|
* Remove an hashed event from merge hash table.
|
|
|
|
*/
|
|
|
|
static void fanotify_unhash_event(struct fsnotify_group *group,
|
|
|
|
struct fanotify_event *event)
|
|
|
|
{
|
|
|
|
assert_spin_locked(&group->notification_lock);
|
|
|
|
|
|
|
|
pr_debug("%s: group=%p event=%p bucket=%u\n", __func__,
|
|
|
|
group, event, fanotify_event_hash_bucket(group, event));
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(hlist_unhashed(&event->merge_list)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
hlist_del_init(&event->merge_list);
|
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
/*
|
2020-03-24 09:04:20 -07:00
|
|
|
* Get an fanotify notification event if one exists and is small
|
2009-12-17 19:24:26 -07:00
|
|
|
* enough to fit in "count". Return an error pointer if the count
|
2019-01-08 06:02:44 -07:00
|
|
|
* is not large enough. When permission event is dequeued, its state is
|
|
|
|
* updated accordingly.
|
2009-12-17 19:24:26 -07:00
|
|
|
*/
|
2020-03-24 09:04:20 -07:00
|
|
|
static struct fanotify_event *get_one_event(struct fsnotify_group *group,
|
2009-12-17 19:24:26 -07:00
|
|
|
size_t count)
|
|
|
|
{
|
2021-10-25 12:27:20 -07:00
|
|
|
size_t event_size;
|
2020-03-24 09:04:20 -07:00
|
|
|
struct fanotify_event *event = NULL;
|
2021-03-04 03:48:22 -07:00
|
|
|
struct fsnotify_event *fsn_event;
|
2021-08-07 22:25:58 -07:00
|
|
|
unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
|
2009-12-17 19:24:26 -07:00
|
|
|
|
|
|
|
pr_debug("%s: group=%p count=%zd\n", __func__, group, count);
|
|
|
|
|
2019-01-08 05:52:31 -07:00
|
|
|
spin_lock(&group->notification_lock);
|
2021-03-04 03:48:22 -07:00
|
|
|
fsn_event = fsnotify_peek_first_event(group);
|
|
|
|
if (!fsn_event)
|
2019-01-08 05:52:31 -07:00
|
|
|
goto out;
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2021-03-04 03:48:22 -07:00
|
|
|
event = FANOTIFY_E(fsn_event);
|
2021-10-25 12:27:20 -07:00
|
|
|
event_size = fanotify_event_len(info_mode, event);
|
2019-01-10 10:04:35 -07:00
|
|
|
|
2019-01-08 05:52:31 -07:00
|
|
|
if (event_size > count) {
|
2020-03-24 09:04:20 -07:00
|
|
|
event = ERR_PTR(-EINVAL);
|
2019-01-08 05:52:31 -07:00
|
|
|
goto out;
|
|
|
|
}
|
2021-03-04 03:48:22 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Held the notification_lock the whole time, so this is the
|
|
|
|
* same event we peeked above.
|
|
|
|
*/
|
|
|
|
fsnotify_remove_first_event(group);
|
2020-03-24 09:04:20 -07:00
|
|
|
if (fanotify_is_perm_event(event->mask))
|
|
|
|
FANOTIFY_PERM(event)->state = FAN_EVENT_REPORTED;
|
2021-03-04 03:48:25 -07:00
|
|
|
if (fanotify_is_hashed_event(event->mask))
|
|
|
|
fanotify_unhash_event(group, event);
|
2019-01-08 05:52:31 -07:00
|
|
|
out:
|
|
|
|
spin_unlock(&group->notification_lock);
|
2020-03-24 09:04:20 -07:00
|
|
|
return event;
|
2009-12-17 19:24:26 -07:00
|
|
|
}
|
|
|
|
|
2022-08-04 09:57:38 -07:00
|
|
|
static int create_fd(struct fsnotify_group *group, const struct path *path,
|
2014-01-21 16:48:14 -07:00
|
|
|
struct file **file)
|
2009-12-17 19:24:26 -07:00
|
|
|
{
|
|
|
|
int client_fd;
|
|
|
|
struct file *new_file;
|
|
|
|
|
fanotify: enable close-on-exec on events' fd when requested in fanotify_init()
According to commit 80af258867648 ("fanotify: groups can specify their
f_flags for new fd"), file descriptors created as part of file access
notification events inherit flags from the event_f_flags argument passed
to syscall fanotify_init(2)[1].
Unfortunately O_CLOEXEC is currently silently ignored.
Indeed, event_f_flags are only given to dentry_open(), which only seems to
care about O_ACCMODE and O_PATH in do_dentry_open(), O_DIRECT in
open_check_o_direct() and O_LARGEFILE in generic_file_open().
It's a pity, since, according to some lookup on various search engines and
http://codesearch.debian.net/, there's already some userspace code which
use O_CLOEXEC:
- in systemd's readahead[2]:
fanotify_fd = fanotify_init(FAN_CLOEXEC|FAN_NONBLOCK, O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_NOATIME);
- in clsync[3]:
#define FANOTIFY_EVFLAGS (O_LARGEFILE|O_RDONLY|O_CLOEXEC)
int fanotify_d = fanotify_init(FANOTIFY_FLAGS, FANOTIFY_EVFLAGS);
- in examples [4] from "Filesystem monitoring in the Linux
kernel" article[5] by Aleksander Morgado:
if ((fanotify_fd = fanotify_init (FAN_CLOEXEC,
O_RDONLY | O_CLOEXEC | O_LARGEFILE)) < 0)
Additionally, since commit 48149e9d3a7e ("fanotify: check file flags
passed in fanotify_init"). having O_CLOEXEC as part of fanotify_init()
second argument is expressly allowed.
So it seems expected to set close-on-exec flag on the file descriptors if
userspace is allowed to request it with O_CLOEXEC.
But Andrew Morton raised[6] the concern that enabling now close-on-exec
might break existing applications which ask for O_CLOEXEC but expect the
file descriptor to be inherited across exec().
In the other hand, as reported by Mihai Dontu[7] close-on-exec on the file
descriptor returned as part of file access notify can break applications
due to deadlock. So close-on-exec is needed for most applications.
More, applications asking for close-on-exec are likely expecting it to be
enabled, relying on O_CLOEXEC being effective. If not, it might weaken
their security, as noted by Jan Kara[8].
So this patch replaces call to macro get_unused_fd() by a call to function
get_unused_fd_flags() with event_f_flags value as argument. This way
O_CLOEXEC flag in the second argument of fanotify_init(2) syscall is
interpreted and close-on-exec get enabled when requested.
[1] http://man7.org/linux/man-pages/man2/fanotify_init.2.html
[2] http://cgit.freedesktop.org/systemd/systemd/tree/src/readahead/readahead-collect.c?id=v208#n294
[3] https://github.com/xaionaro/clsync/blob/v0.2.1/sync.c#L1631
https://github.com/xaionaro/clsync/blob/v0.2.1/configuration.h#L38
[4] http://www.lanedo.com/~aleksander/fanotify/fanotify-example.c
[5] http://www.lanedo.com/2013/filesystem-monitoring-linux-kernel/
[6] http://lkml.kernel.org/r/20141001153621.65e9258e65a6167bf2e4cb50@linux-foundation.org
[7] http://lkml.kernel.org/r/20141002095046.3715eb69@mdontu-l
[8] http://lkml.kernel.org/r/20141002104410.GB19748@quack.suse.cz
Link: http://lkml.kernel.org/r/cover.1411562410.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Mihai Don\u021bu <mihai.dontu@gmail.com>
Cc: Pádraig Brady <P@draigBrady.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Michael Kerrisk-manpages <mtk.manpages@gmail.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Richard Guy Briggs <rgb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 15:24:40 -07:00
|
|
|
client_fd = get_unused_fd_flags(group->fanotify_data.f_flags);
|
2009-12-17 19:24:26 -07:00
|
|
|
if (client_fd < 0)
|
|
|
|
return client_fd;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we need a new file handle for the userspace program so it can read even if it was
|
|
|
|
* originally opened O_WRONLY.
|
|
|
|
*/
|
2020-03-24 07:27:52 -07:00
|
|
|
new_file = dentry_open(path,
|
2022-05-22 05:08:02 -07:00
|
|
|
group->fanotify_data.f_flags | __FMODE_NONOTIFY,
|
2020-03-24 07:27:52 -07:00
|
|
|
current_cred());
|
2009-12-17 19:24:26 -07:00
|
|
|
if (IS_ERR(new_file)) {
|
|
|
|
/*
|
|
|
|
* we still send an event even if we can't open the file. this
|
|
|
|
* can happen when say tasks are gone and we try to open their
|
|
|
|
* /proc files or we try to open a WRONLY file like in sysfs
|
|
|
|
* we just send the errno to userspace since there isn't much
|
|
|
|
* else we can do.
|
|
|
|
*/
|
|
|
|
put_unused_fd(client_fd);
|
|
|
|
client_fd = PTR_ERR(new_file);
|
|
|
|
} else {
|
2012-08-19 09:30:45 -07:00
|
|
|
*file = new_file;
|
2009-12-17 19:24:26 -07:00
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
return client_fd;
|
2009-12-17 19:24:26 -07:00
|
|
|
}
|
|
|
|
|
2023-02-03 14:35:15 -07:00
|
|
|
static int process_access_response_info(const char __user *info,
|
|
|
|
size_t info_len,
|
|
|
|
struct fanotify_response_info_audit_rule *friar)
|
|
|
|
{
|
|
|
|
if (info_len != sizeof(*friar))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (copy_from_user(friar, info, sizeof(*friar)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (friar->hdr.type != FAN_RESPONSE_INFO_AUDIT_RULE)
|
|
|
|
return -EINVAL;
|
|
|
|
if (friar->hdr.pad != 0)
|
|
|
|
return -EINVAL;
|
|
|
|
if (friar->hdr.len != sizeof(*friar))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return info_len;
|
|
|
|
}
|
|
|
|
|
2019-01-08 06:02:44 -07:00
|
|
|
/*
|
|
|
|
* Finish processing of permission event by setting it to ANSWERED state and
|
|
|
|
* drop group->notification_lock.
|
|
|
|
*/
|
|
|
|
static void finish_permission_event(struct fsnotify_group *group,
|
2023-02-03 14:35:15 -07:00
|
|
|
struct fanotify_perm_event *event, u32 response,
|
|
|
|
struct fanotify_response_info_audit_rule *friar)
|
2019-01-08 06:02:44 -07:00
|
|
|
__releases(&group->notification_lock)
|
|
|
|
{
|
2019-01-08 07:18:02 -07:00
|
|
|
bool destroy = false;
|
|
|
|
|
2019-01-08 06:02:44 -07:00
|
|
|
assert_spin_locked(&group->notification_lock);
|
2023-02-03 14:35:15 -07:00
|
|
|
event->response = response & ~FAN_INFO;
|
|
|
|
if (response & FAN_INFO)
|
|
|
|
memcpy(&event->audit_rule, friar, sizeof(*friar));
|
|
|
|
|
2019-01-08 07:18:02 -07:00
|
|
|
if (event->state == FAN_EVENT_CANCELED)
|
|
|
|
destroy = true;
|
|
|
|
else
|
|
|
|
event->state = FAN_EVENT_ANSWERED;
|
2019-01-08 06:02:44 -07:00
|
|
|
spin_unlock(&group->notification_lock);
|
2019-01-08 07:18:02 -07:00
|
|
|
if (destroy)
|
|
|
|
fsnotify_destroy_event(group, &event->fae.fse);
|
2019-01-08 06:02:44 -07:00
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:34 -07:00
|
|
|
static int process_access_response(struct fsnotify_group *group,
|
2023-02-03 14:35:15 -07:00
|
|
|
struct fanotify_response *response_struct,
|
|
|
|
const char __user *info,
|
|
|
|
size_t info_len)
|
2009-12-17 19:24:34 -07:00
|
|
|
{
|
2019-01-10 10:04:32 -07:00
|
|
|
struct fanotify_perm_event *event;
|
2014-04-03 14:46:33 -07:00
|
|
|
int fd = response_struct->fd;
|
2023-02-03 14:35:14 -07:00
|
|
|
u32 response = response_struct->response;
|
2023-02-03 14:35:15 -07:00
|
|
|
int ret = info_len;
|
|
|
|
struct fanotify_response_info_audit_rule friar;
|
2009-12-17 19:24:34 -07:00
|
|
|
|
2023-02-03 14:35:15 -07:00
|
|
|
pr_debug("%s: group=%p fd=%d response=%u buf=%p size=%zu\n", __func__,
|
|
|
|
group, fd, response, info, info_len);
|
2009-12-17 19:24:34 -07:00
|
|
|
/*
|
|
|
|
* make sure the response is valid, if invalid we do nothing and either
|
2011-03-30 18:57:33 -07:00
|
|
|
* userspace can send a valid response or we will clean it up after the
|
2009-12-17 19:24:34 -07:00
|
|
|
* timeout
|
|
|
|
*/
|
2023-02-03 14:35:15 -07:00
|
|
|
if (response & ~FANOTIFY_RESPONSE_VALID_MASK)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
switch (response & FANOTIFY_RESPONSE_ACCESS) {
|
2009-12-17 19:24:34 -07:00
|
|
|
case FAN_ALLOW:
|
|
|
|
case FAN_DENY:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2023-02-03 14:35:15 -07:00
|
|
|
if ((response & FAN_AUDIT) && !FAN_GROUP_FLAG(group, FAN_ENABLE_AUDIT))
|
2009-12-17 19:24:34 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2023-02-03 14:35:15 -07:00
|
|
|
if (response & FAN_INFO) {
|
|
|
|
ret = process_access_response_info(info, info_len, &friar);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
if (fd == FAN_NOFD)
|
|
|
|
return ret;
|
|
|
|
} else {
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fd < 0)
|
2017-10-02 17:21:39 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2019-01-08 05:28:18 -07:00
|
|
|
spin_lock(&group->notification_lock);
|
|
|
|
list_for_each_entry(event, &group->fanotify_data.access_list,
|
|
|
|
fae.fse.list) {
|
|
|
|
if (event->fd != fd)
|
|
|
|
continue;
|
2009-12-17 19:24:34 -07:00
|
|
|
|
2019-01-08 05:28:18 -07:00
|
|
|
list_del_init(&event->fae.fse.list);
|
2023-02-03 14:35:15 -07:00
|
|
|
finish_permission_event(group, event, response, &friar);
|
2019-01-08 05:28:18 -07:00
|
|
|
wake_up(&group->fanotify_data.access_waitq);
|
2023-02-03 14:35:15 -07:00
|
|
|
return ret;
|
2019-01-08 05:28:18 -07:00
|
|
|
}
|
|
|
|
spin_unlock(&group->notification_lock);
|
2009-12-17 19:24:34 -07:00
|
|
|
|
2019-01-08 05:28:18 -07:00
|
|
|
return -ENOENT;
|
2009-12-17 19:24:34 -07:00
|
|
|
}
|
|
|
|
|
2021-10-25 12:27:42 -07:00
|
|
|
static size_t copy_error_info_to_user(struct fanotify_event *event,
|
|
|
|
char __user *buf, int count)
|
|
|
|
{
|
2021-11-29 13:15:33 -07:00
|
|
|
struct fanotify_event_info_error info = { };
|
2021-10-25 12:27:42 -07:00
|
|
|
struct fanotify_error_event *fee = FANOTIFY_EE(event);
|
|
|
|
|
|
|
|
info.hdr.info_type = FAN_EVENT_INFO_TYPE_ERROR;
|
|
|
|
info.hdr.len = FANOTIFY_ERROR_INFO_LEN;
|
|
|
|
|
|
|
|
if (WARN_ON(count < info.hdr.len))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
info.error = fee->error;
|
|
|
|
info.error_count = fee->err_count;
|
|
|
|
|
|
|
|
if (copy_to_user(buf, &info, sizeof(info)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return info.hdr.len;
|
|
|
|
}
|
|
|
|
|
2021-08-07 22:25:32 -07:00
|
|
|
static int copy_fid_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
|
|
|
|
int info_type, const char *name,
|
|
|
|
size_t name_len,
|
|
|
|
char __user *buf, size_t count)
|
2019-01-10 10:04:35 -07:00
|
|
|
{
|
|
|
|
struct fanotify_event_info_fid info = { };
|
|
|
|
struct file_handle handle = { };
|
2020-03-24 08:55:37 -07:00
|
|
|
unsigned char bounce[FANOTIFY_INLINE_FH_LEN], *fh_buf;
|
2020-03-19 08:10:21 -07:00
|
|
|
size_t fh_len = fh ? fh->len : 0;
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
size_t info_len = fanotify_fid_info_len(fh_len, name_len);
|
|
|
|
size_t len = info_len;
|
2019-01-10 10:04:35 -07:00
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
pr_debug("%s: fh_len=%zu name_len=%zu, info_len=%zu, count=%zu\n",
|
|
|
|
__func__, fh_len, name_len, info_len, count);
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(len < sizeof(info) || len > count))
|
2019-01-10 10:04:35 -07:00
|
|
|
return -EFAULT;
|
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
/*
|
|
|
|
* Copy event info fid header followed by variable sized file handle
|
|
|
|
* and optionally followed by variable sized filename.
|
|
|
|
*/
|
2020-07-16 01:42:26 -07:00
|
|
|
switch (info_type) {
|
|
|
|
case FAN_EVENT_INFO_TYPE_FID:
|
|
|
|
case FAN_EVENT_INFO_TYPE_DFID:
|
|
|
|
if (WARN_ON_ONCE(name_len))
|
|
|
|
return -EFAULT;
|
|
|
|
break;
|
|
|
|
case FAN_EVENT_INFO_TYPE_DFID_NAME:
|
2021-11-29 13:15:36 -07:00
|
|
|
case FAN_EVENT_INFO_TYPE_OLD_DFID_NAME:
|
|
|
|
case FAN_EVENT_INFO_TYPE_NEW_DFID_NAME:
|
2020-07-16 01:42:26 -07:00
|
|
|
if (WARN_ON_ONCE(!name || !name_len))
|
|
|
|
return -EFAULT;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
|
|
|
|
info.hdr.info_type = info_type;
|
2019-01-10 10:04:35 -07:00
|
|
|
info.hdr.len = len;
|
2020-03-19 08:10:20 -07:00
|
|
|
info.fsid = *fsid;
|
2019-01-10 10:04:35 -07:00
|
|
|
if (copy_to_user(buf, &info, sizeof(info)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
buf += sizeof(info);
|
|
|
|
len -= sizeof(info);
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
if (WARN_ON_ONCE(len < sizeof(handle)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
2020-03-24 08:55:37 -07:00
|
|
|
handle.handle_type = fh->type;
|
2019-01-10 10:04:35 -07:00
|
|
|
handle.handle_bytes = fh_len;
|
2021-10-25 12:27:41 -07:00
|
|
|
|
|
|
|
/* Mangle handle_type for bad file_handle */
|
|
|
|
if (!fh_len)
|
|
|
|
handle.handle_type = FILEID_INVALID;
|
|
|
|
|
2019-01-10 10:04:35 -07:00
|
|
|
if (copy_to_user(buf, &handle, sizeof(handle)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
buf += sizeof(handle);
|
|
|
|
len -= sizeof(handle);
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
if (WARN_ON_ONCE(len < fh_len))
|
|
|
|
return -EFAULT;
|
|
|
|
|
2019-03-12 04:42:37 -07:00
|
|
|
/*
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
* For an inline fh and inline file name, copy through stack to exclude
|
|
|
|
* the copy from usercopy hardening protections.
|
2019-03-12 04:42:37 -07:00
|
|
|
*/
|
2020-03-24 08:55:37 -07:00
|
|
|
fh_buf = fanotify_fh_buf(fh);
|
2019-03-12 04:42:37 -07:00
|
|
|
if (fh_len <= FANOTIFY_INLINE_FH_LEN) {
|
2020-03-24 08:55:37 -07:00
|
|
|
memcpy(bounce, fh_buf, fh_len);
|
|
|
|
fh_buf = bounce;
|
2019-03-12 04:42:37 -07:00
|
|
|
}
|
2020-03-24 08:55:37 -07:00
|
|
|
if (copy_to_user(buf, fh_buf, fh_len))
|
2019-01-10 10:04:35 -07:00
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
buf += fh_len;
|
|
|
|
len -= fh_len;
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
|
|
|
|
if (name_len) {
|
|
|
|
/* Copy the filename with terminating null */
|
|
|
|
name_len++;
|
|
|
|
if (WARN_ON_ONCE(len < name_len))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (copy_to_user(buf, name, name_len))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
buf += name_len;
|
|
|
|
len -= name_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Pad with 0's */
|
2024-05-20 12:43:58 -07:00
|
|
|
WARN_ON_ONCE(len < 0 || len >= FANOTIFY_EVENT_ALIGN);
|
2019-01-10 10:04:35 -07:00
|
|
|
if (len > 0 && clear_user(buf, len))
|
|
|
|
return -EFAULT;
|
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
return info_len;
|
2019-01-10 10:04:35 -07:00
|
|
|
}
|
|
|
|
|
2021-08-07 22:26:25 -07:00
|
|
|
static int copy_pidfd_info_to_user(int pidfd,
|
|
|
|
char __user *buf,
|
|
|
|
size_t count)
|
|
|
|
{
|
|
|
|
struct fanotify_event_info_pidfd info = { };
|
|
|
|
size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(info_len > count))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
|
|
|
|
info.hdr.len = info_len;
|
|
|
|
info.pidfd = pidfd;
|
|
|
|
|
|
|
|
if (copy_to_user(buf, &info, info_len))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
return info_len;
|
|
|
|
}
|
|
|
|
|
2021-08-07 22:25:58 -07:00
|
|
|
static int copy_info_records_to_user(struct fanotify_event *event,
|
|
|
|
struct fanotify_info *info,
|
2021-08-07 22:26:25 -07:00
|
|
|
unsigned int info_mode, int pidfd,
|
2021-08-07 22:25:58 -07:00
|
|
|
char __user *buf, size_t count)
|
|
|
|
{
|
|
|
|
int ret, total_bytes = 0, info_type = 0;
|
|
|
|
unsigned int fid_mode = info_mode & FANOTIFY_FID_BITS;
|
2021-08-07 22:26:25 -07:00
|
|
|
unsigned int pidfd_mode = info_mode & FAN_REPORT_PIDFD;
|
2021-08-07 22:25:58 -07:00
|
|
|
|
|
|
|
/*
|
2021-11-29 13:15:36 -07:00
|
|
|
* Event info records order is as follows:
|
|
|
|
* 1. dir fid + name
|
|
|
|
* 2. (optional) new dir fid + new name
|
|
|
|
* 3. (optional) child fid
|
2021-08-07 22:25:58 -07:00
|
|
|
*/
|
2021-10-25 12:27:38 -07:00
|
|
|
if (fanotify_event_has_dir_fh(event)) {
|
2021-08-07 22:25:58 -07:00
|
|
|
info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
|
|
|
|
FAN_EVENT_INFO_TYPE_DFID;
|
2021-11-29 13:15:36 -07:00
|
|
|
|
|
|
|
/* FAN_RENAME uses special info types */
|
|
|
|
if (event->mask & FAN_RENAME)
|
|
|
|
info_type = FAN_EVENT_INFO_TYPE_OLD_DFID_NAME;
|
|
|
|
|
2021-08-07 22:25:58 -07:00
|
|
|
ret = copy_fid_info_to_user(fanotify_event_fsid(event),
|
|
|
|
fanotify_info_dir_fh(info),
|
|
|
|
info_type,
|
|
|
|
fanotify_info_name(info),
|
|
|
|
info->name_len, buf, count);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
buf += ret;
|
|
|
|
count -= ret;
|
|
|
|
total_bytes += ret;
|
|
|
|
}
|
|
|
|
|
2021-11-29 13:15:36 -07:00
|
|
|
/* New dir fid+name may be reported in addition to old dir fid+name */
|
|
|
|
if (fanotify_event_has_dir2_fh(event)) {
|
|
|
|
info_type = FAN_EVENT_INFO_TYPE_NEW_DFID_NAME;
|
|
|
|
ret = copy_fid_info_to_user(fanotify_event_fsid(event),
|
|
|
|
fanotify_info_dir2_fh(info),
|
|
|
|
info_type,
|
|
|
|
fanotify_info_name2(info),
|
|
|
|
info->name2_len, buf, count);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
buf += ret;
|
|
|
|
count -= ret;
|
|
|
|
total_bytes += ret;
|
|
|
|
}
|
|
|
|
|
2021-10-25 12:27:38 -07:00
|
|
|
if (fanotify_event_has_object_fh(event)) {
|
2021-08-07 22:25:58 -07:00
|
|
|
const char *dot = NULL;
|
|
|
|
int dot_len = 0;
|
|
|
|
|
|
|
|
if (fid_mode == FAN_REPORT_FID || info_type) {
|
|
|
|
/*
|
|
|
|
* With only group flag FAN_REPORT_FID only type FID is
|
|
|
|
* reported. Second info record type is always FID.
|
|
|
|
*/
|
|
|
|
info_type = FAN_EVENT_INFO_TYPE_FID;
|
|
|
|
} else if ((fid_mode & FAN_REPORT_NAME) &&
|
|
|
|
(event->mask & FAN_ONDIR)) {
|
|
|
|
/*
|
|
|
|
* With group flag FAN_REPORT_NAME, if name was not
|
|
|
|
* recorded in an event on a directory, report the name
|
|
|
|
* "." with info type DFID_NAME.
|
|
|
|
*/
|
|
|
|
info_type = FAN_EVENT_INFO_TYPE_DFID_NAME;
|
|
|
|
dot = ".";
|
|
|
|
dot_len = 1;
|
|
|
|
} else if ((event->mask & ALL_FSNOTIFY_DIRENT_EVENTS) ||
|
|
|
|
(event->mask & FAN_ONDIR)) {
|
|
|
|
/*
|
|
|
|
* With group flag FAN_REPORT_DIR_FID, a single info
|
|
|
|
* record has type DFID for directory entry modification
|
|
|
|
* event and for event on a directory.
|
|
|
|
*/
|
|
|
|
info_type = FAN_EVENT_INFO_TYPE_DFID;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* With group flags FAN_REPORT_DIR_FID|FAN_REPORT_FID,
|
|
|
|
* a single info record has type FID for event on a
|
|
|
|
* non-directory, when there is no directory to report.
|
|
|
|
* For example, on FAN_DELETE_SELF event.
|
|
|
|
*/
|
|
|
|
info_type = FAN_EVENT_INFO_TYPE_FID;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = copy_fid_info_to_user(fanotify_event_fsid(event),
|
|
|
|
fanotify_event_object_fh(event),
|
|
|
|
info_type, dot, dot_len,
|
|
|
|
buf, count);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
buf += ret;
|
|
|
|
count -= ret;
|
|
|
|
total_bytes += ret;
|
|
|
|
}
|
|
|
|
|
2021-08-07 22:26:25 -07:00
|
|
|
if (pidfd_mode) {
|
|
|
|
ret = copy_pidfd_info_to_user(pidfd, buf, count);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
buf += ret;
|
|
|
|
count -= ret;
|
|
|
|
total_bytes += ret;
|
|
|
|
}
|
|
|
|
|
2021-10-25 12:27:42 -07:00
|
|
|
if (fanotify_is_error_event(event->mask)) {
|
|
|
|
ret = copy_error_info_to_user(event, buf, count);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
buf += ret;
|
|
|
|
count -= ret;
|
|
|
|
total_bytes += ret;
|
|
|
|
}
|
|
|
|
|
2021-08-07 22:25:58 -07:00
|
|
|
return total_bytes;
|
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
static ssize_t copy_event_to_user(struct fsnotify_group *group,
|
2020-03-24 09:04:20 -07:00
|
|
|
struct fanotify_event *event,
|
2018-12-04 16:44:46 -07:00
|
|
|
char __user *buf, size_t count)
|
2009-12-17 19:24:26 -07:00
|
|
|
{
|
2019-01-10 10:04:33 -07:00
|
|
|
struct fanotify_event_metadata metadata;
|
2022-08-04 09:57:38 -07:00
|
|
|
const struct path *path = fanotify_event_path(event);
|
2020-07-16 01:42:17 -07:00
|
|
|
struct fanotify_info *info = fanotify_event_info(event);
|
2021-08-07 22:25:58 -07:00
|
|
|
unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
|
2021-08-07 22:26:25 -07:00
|
|
|
unsigned int pidfd_mode = info_mode & FAN_REPORT_PIDFD;
|
2023-03-27 11:22:53 -07:00
|
|
|
struct file *f = NULL, *pidfd_file = NULL;
|
2021-08-07 22:26:25 -07:00
|
|
|
int ret, pidfd = FAN_NOPIDFD, fd = FAN_NOFD;
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2020-03-24 09:04:20 -07:00
|
|
|
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2021-10-25 12:27:20 -07:00
|
|
|
metadata.event_len = fanotify_event_len(info_mode, event);
|
2019-01-10 10:04:33 -07:00
|
|
|
metadata.metadata_len = FAN_EVENT_METADATA_LEN;
|
|
|
|
metadata.vers = FANOTIFY_METADATA_VERSION;
|
|
|
|
metadata.reserved = 0;
|
|
|
|
metadata.mask = event->mask & FANOTIFY_OUTGOING_EVENTS;
|
|
|
|
metadata.pid = pid_vnr(event->pid);
|
2021-03-04 04:29:21 -07:00
|
|
|
/*
|
|
|
|
* For an unprivileged listener, event->pid can be used to identify the
|
|
|
|
* events generated by the listener process itself, without disclosing
|
|
|
|
* the pids of other processes.
|
|
|
|
*/
|
2021-05-24 06:53:21 -07:00
|
|
|
if (FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV) &&
|
2021-03-04 04:29:21 -07:00
|
|
|
task_tgid(current) != event->pid)
|
|
|
|
metadata.pid = 0;
|
2019-01-10 10:04:33 -07:00
|
|
|
|
2021-05-24 06:53:21 -07:00
|
|
|
/*
|
|
|
|
* For now, fid mode is required for an unprivileged listener and
|
|
|
|
* fid mode does not report fd in events. Keep this check anyway
|
|
|
|
* for safety in case fid mode requirement is relaxed in the future
|
|
|
|
* to allow unprivileged listener to get events with no fd and no fid.
|
|
|
|
*/
|
|
|
|
if (!FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV) &&
|
|
|
|
path && path->mnt && path->dentry) {
|
2020-03-24 08:55:37 -07:00
|
|
|
fd = create_fd(group, path, &f);
|
|
|
|
if (fd < 0)
|
|
|
|
return fd;
|
2019-01-10 10:04:33 -07:00
|
|
|
}
|
|
|
|
metadata.fd = fd;
|
2009-12-17 19:24:34 -07:00
|
|
|
|
2021-08-07 22:26:25 -07:00
|
|
|
if (pidfd_mode) {
|
|
|
|
/*
|
|
|
|
* Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
|
|
|
|
* exclusion is ever lifted. At the time of incoporating pidfd
|
|
|
|
* support within fanotify, the pidfd API only supported the
|
|
|
|
* creation of pidfds for thread-group leaders.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The PIDTYPE_TGID check for an event->pid is performed
|
|
|
|
* preemptively in an attempt to catch out cases where the event
|
|
|
|
* listener reads events after the event generating process has
|
|
|
|
* already terminated. Report FAN_NOPIDFD to the event listener
|
|
|
|
* in those cases, with all other pidfd creation errors being
|
|
|
|
* reported as FAN_EPIDFD.
|
|
|
|
*/
|
|
|
|
if (metadata.pid == 0 ||
|
|
|
|
!pid_has_task(event->pid, PIDTYPE_TGID)) {
|
|
|
|
pidfd = FAN_NOPIDFD;
|
|
|
|
} else {
|
2023-03-27 11:22:53 -07:00
|
|
|
pidfd = pidfd_prepare(event->pid, 0, &pidfd_file);
|
2021-08-07 22:26:25 -07:00
|
|
|
if (pidfd < 0)
|
|
|
|
pidfd = FAN_EPIDFD;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:34 -07:00
|
|
|
ret = -EFAULT;
|
2018-12-04 16:44:46 -07:00
|
|
|
/*
|
|
|
|
* Sanity check copy size in case get_one_event() and
|
2020-05-12 11:18:36 -07:00
|
|
|
* event_len sizes ever get out of sync.
|
2018-12-04 16:44:46 -07:00
|
|
|
*/
|
2019-01-10 10:04:33 -07:00
|
|
|
if (WARN_ON_ONCE(metadata.event_len > count))
|
2018-12-04 16:44:46 -07:00
|
|
|
goto out_close_fd;
|
2019-01-10 10:04:33 -07:00
|
|
|
|
2019-01-10 10:04:35 -07:00
|
|
|
if (copy_to_user(buf, &metadata, FAN_EVENT_METADATA_LEN))
|
2012-08-19 09:30:45 -07:00
|
|
|
goto out_close_fd;
|
|
|
|
|
fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported. With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.
When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.
For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY). Later on, events "on child" will report both records.
There are several ways that an application can use this information:
1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.
2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2). When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.
3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.
The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.
Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-19 08:10:22 -07:00
|
|
|
buf += FAN_EVENT_METADATA_LEN;
|
|
|
|
count -= FAN_EVENT_METADATA_LEN;
|
|
|
|
|
2019-01-10 10:04:33 -07:00
|
|
|
if (fanotify_is_perm_event(event->mask))
|
2020-03-24 09:04:20 -07:00
|
|
|
FANOTIFY_PERM(event)->fd = fd;
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2021-08-07 22:25:58 -07:00
|
|
|
if (info_mode) {
|
2021-08-07 22:26:25 -07:00
|
|
|
ret = copy_info_records_to_user(event, info, info_mode, pidfd,
|
2021-08-07 22:25:58 -07:00
|
|
|
buf, count);
|
2019-01-10 10:04:35 -07:00
|
|
|
if (ret < 0)
|
2021-06-10 20:32:06 -07:00
|
|
|
goto out_close_fd;
|
2019-01-10 10:04:35 -07:00
|
|
|
}
|
|
|
|
|
2022-01-28 12:57:01 -07:00
|
|
|
if (f)
|
|
|
|
fd_install(fd, f);
|
|
|
|
|
2023-03-27 11:22:53 -07:00
|
|
|
if (pidfd_file)
|
|
|
|
fd_install(pidfd, pidfd_file);
|
|
|
|
|
2019-01-10 10:04:33 -07:00
|
|
|
return metadata.event_len;
|
2009-12-17 19:24:34 -07:00
|
|
|
|
|
|
|
out_close_fd:
|
2012-08-19 09:30:45 -07:00
|
|
|
if (fd != FAN_NOFD) {
|
|
|
|
put_unused_fd(fd);
|
|
|
|
fput(f);
|
|
|
|
}
|
2021-08-07 22:26:25 -07:00
|
|
|
|
2023-03-27 11:22:53 -07:00
|
|
|
if (pidfd >= 0) {
|
|
|
|
put_unused_fd(pidfd);
|
|
|
|
fput(pidfd_file);
|
|
|
|
}
|
2021-08-07 22:26:25 -07:00
|
|
|
|
2009-12-17 19:24:34 -07:00
|
|
|
return ret;
|
2009-12-17 19:24:26 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* intofiy userspace file descriptor functions */
|
2017-07-02 22:02:18 -07:00
|
|
|
static __poll_t fanotify_poll(struct file *file, poll_table *wait)
|
2009-12-17 19:24:26 -07:00
|
|
|
{
|
|
|
|
struct fsnotify_group *group = file->private_data;
|
2017-07-02 22:02:18 -07:00
|
|
|
__poll_t ret = 0;
|
2009-12-17 19:24:26 -07:00
|
|
|
|
|
|
|
poll_wait(file, &group->notification_waitq, wait);
|
2016-10-07 16:56:52 -07:00
|
|
|
spin_lock(&group->notification_lock);
|
2009-12-17 19:24:26 -07:00
|
|
|
if (!fsnotify_notify_queue_is_empty(group))
|
2018-02-11 15:34:03 -07:00
|
|
|
ret = EPOLLIN | EPOLLRDNORM;
|
2016-10-07 16:56:52 -07:00
|
|
|
spin_unlock(&group->notification_lock);
|
2009-12-17 19:24:26 -07:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t fanotify_read(struct file *file, char __user *buf,
|
|
|
|
size_t count, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct fsnotify_group *group;
|
2020-03-24 09:04:20 -07:00
|
|
|
struct fanotify_event *event;
|
2009-12-17 19:24:26 -07:00
|
|
|
char __user *start;
|
|
|
|
int ret;
|
2014-12-16 08:28:38 -07:00
|
|
|
DEFINE_WAIT_FUNC(wait, woken_wake_function);
|
2009-12-17 19:24:26 -07:00
|
|
|
|
|
|
|
start = buf;
|
|
|
|
group = file->private_data;
|
|
|
|
|
|
|
|
pr_debug("%s: group=%p\n", __func__, group);
|
|
|
|
|
2014-12-16 08:28:38 -07:00
|
|
|
add_wait_queue(&group->notification_waitq, &wait);
|
2009-12-17 19:24:26 -07:00
|
|
|
while (1) {
|
2020-07-15 05:06:21 -07:00
|
|
|
/*
|
|
|
|
* User can supply arbitrarily large buffer. Avoid softlockups
|
|
|
|
* in case there are lots of available events.
|
|
|
|
*/
|
|
|
|
cond_resched();
|
2020-03-24 09:04:20 -07:00
|
|
|
event = get_one_event(group, count);
|
|
|
|
if (IS_ERR(event)) {
|
|
|
|
ret = PTR_ERR(event);
|
2014-04-03 14:46:35 -07:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2020-03-24 09:04:20 -07:00
|
|
|
if (!event) {
|
2014-04-03 14:46:35 -07:00
|
|
|
ret = -EAGAIN;
|
|
|
|
if (file->f_flags & O_NONBLOCK)
|
2009-12-17 19:24:26 -07:00
|
|
|
break;
|
2014-04-03 14:46:35 -07:00
|
|
|
|
|
|
|
ret = -ERESTARTSYS;
|
|
|
|
if (signal_pending(current))
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (start != buf)
|
2009-12-17 19:24:26 -07:00
|
|
|
break;
|
2014-12-16 08:28:38 -07:00
|
|
|
|
|
|
|
wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
|
2009-12-17 19:24:26 -07:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2020-03-24 09:04:20 -07:00
|
|
|
ret = copy_event_to_user(group, event, buf, count);
|
2017-04-25 04:29:35 -07:00
|
|
|
if (unlikely(ret == -EOPENSTALE)) {
|
|
|
|
/*
|
|
|
|
* We cannot report events with stale fd so drop it.
|
|
|
|
* Setting ret to 0 will continue the event loop and
|
|
|
|
* do the right thing if there are no more events to
|
|
|
|
* read (i.e. return bytes read, -EAGAIN or wait).
|
|
|
|
*/
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
2014-04-03 14:46:35 -07:00
|
|
|
/*
|
|
|
|
* Permission events get queued to wait for response. Other
|
|
|
|
* events can be destroyed now.
|
|
|
|
*/
|
2020-03-24 09:04:20 -07:00
|
|
|
if (!fanotify_is_perm_event(event->mask)) {
|
|
|
|
fsnotify_destroy_event(group, &event->fse);
|
2014-04-03 14:46:36 -07:00
|
|
|
} else {
|
2017-04-25 04:29:35 -07:00
|
|
|
if (ret <= 0) {
|
2019-01-08 06:02:44 -07:00
|
|
|
spin_lock(&group->notification_lock);
|
|
|
|
finish_permission_event(group,
|
2023-02-03 14:35:15 -07:00
|
|
|
FANOTIFY_PERM(event), FAN_DENY, NULL);
|
2014-04-03 14:46:36 -07:00
|
|
|
wake_up(&group->fanotify_data.access_waitq);
|
2017-04-25 04:29:35 -07:00
|
|
|
} else {
|
|
|
|
spin_lock(&group->notification_lock);
|
2020-03-24 09:04:20 -07:00
|
|
|
list_add_tail(&event->fse.list,
|
2017-04-25 04:29:35 -07:00
|
|
|
&group->fanotify_data.access_list);
|
|
|
|
spin_unlock(&group->notification_lock);
|
2014-04-03 14:46:36 -07:00
|
|
|
}
|
|
|
|
}
|
2017-04-25 04:29:35 -07:00
|
|
|
if (ret < 0)
|
|
|
|
break;
|
2014-04-03 14:46:35 -07:00
|
|
|
buf += ret;
|
|
|
|
count -= ret;
|
2009-12-17 19:24:26 -07:00
|
|
|
}
|
2014-12-16 08:28:38 -07:00
|
|
|
remove_wait_queue(&group->notification_waitq, &wait);
|
2009-12-17 19:24:26 -07:00
|
|
|
|
|
|
|
if (start != buf && ret != -EFAULT)
|
|
|
|
ret = buf - start;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:34 -07:00
|
|
|
static ssize_t fanotify_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
|
|
|
|
{
|
2023-02-03 14:35:15 -07:00
|
|
|
struct fanotify_response response;
|
2009-12-17 19:24:34 -07:00
|
|
|
struct fsnotify_group *group;
|
|
|
|
int ret;
|
2023-02-03 14:35:15 -07:00
|
|
|
const char __user *info_buf = buf + sizeof(struct fanotify_response);
|
|
|
|
size_t info_len;
|
2009-12-17 19:24:34 -07:00
|
|
|
|
2017-10-30 13:14:56 -07:00
|
|
|
if (!IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2009-12-17 19:24:34 -07:00
|
|
|
group = file->private_data;
|
|
|
|
|
2023-02-03 14:35:15 -07:00
|
|
|
pr_debug("%s: group=%p count=%zu\n", __func__, group, count);
|
|
|
|
|
2020-05-12 11:19:21 -07:00
|
|
|
if (count < sizeof(response))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2023-02-03 14:35:15 -07:00
|
|
|
if (copy_from_user(&response, buf, sizeof(response)))
|
2009-12-17 19:24:34 -07:00
|
|
|
return -EFAULT;
|
|
|
|
|
2023-02-03 14:35:15 -07:00
|
|
|
info_len = count - sizeof(response);
|
|
|
|
|
|
|
|
ret = process_access_response(group, &response, info_buf, info_len);
|
2009-12-17 19:24:34 -07:00
|
|
|
if (ret < 0)
|
|
|
|
count = ret;
|
2023-02-03 14:35:15 -07:00
|
|
|
else
|
|
|
|
count = sizeof(response) + ret;
|
2009-12-17 19:24:34 -07:00
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
static int fanotify_release(struct inode *ignored, struct file *file)
|
|
|
|
{
|
|
|
|
struct fsnotify_group *group = file->private_data;
|
2021-03-04 03:48:22 -07:00
|
|
|
struct fsnotify_event *fsn_event;
|
2010-10-28 14:21:59 -07:00
|
|
|
|
2014-08-06 16:03:28 -07:00
|
|
|
/*
|
2016-09-19 14:44:30 -07:00
|
|
|
* Stop new events from arriving in the notification queue. since
|
|
|
|
* userspace cannot use fanotify fd anymore, no event can enter or
|
|
|
|
* leave access_list by now either.
|
2014-08-06 16:03:28 -07:00
|
|
|
*/
|
2016-09-19 14:44:30 -07:00
|
|
|
fsnotify_group_stop_queueing(group);
|
2010-08-18 09:25:50 -07:00
|
|
|
|
2016-09-19 14:44:30 -07:00
|
|
|
/*
|
|
|
|
* Process all permission events on access_list and notification queue
|
|
|
|
* and simulate reply from userspace.
|
|
|
|
*/
|
2016-10-07 16:56:55 -07:00
|
|
|
spin_lock(&group->notification_lock);
|
2019-01-09 05:21:01 -07:00
|
|
|
while (!list_empty(&group->fanotify_data.access_list)) {
|
2020-03-24 09:04:20 -07:00
|
|
|
struct fanotify_perm_event *event;
|
|
|
|
|
2019-01-09 05:21:01 -07:00
|
|
|
event = list_first_entry(&group->fanotify_data.access_list,
|
|
|
|
struct fanotify_perm_event, fae.fse.list);
|
2014-04-03 14:46:33 -07:00
|
|
|
list_del_init(&event->fae.fse.list);
|
2023-02-03 14:35:15 -07:00
|
|
|
finish_permission_event(group, event, FAN_ALLOW, NULL);
|
2019-01-08 06:02:44 -07:00
|
|
|
spin_lock(&group->notification_lock);
|
2010-08-18 09:25:50 -07:00
|
|
|
}
|
|
|
|
|
2014-08-06 16:03:28 -07:00
|
|
|
/*
|
2016-09-19 14:44:30 -07:00
|
|
|
* Destroy all non-permission events. For permission events just
|
|
|
|
* dequeue them and set the response. They will be freed once the
|
|
|
|
* response is consumed and fanotify_get_response() returns.
|
2014-08-06 16:03:28 -07:00
|
|
|
*/
|
2021-03-04 03:48:22 -07:00
|
|
|
while ((fsn_event = fsnotify_remove_first_event(group))) {
|
|
|
|
struct fanotify_event *event = FANOTIFY_E(fsn_event);
|
2020-03-24 09:04:20 -07:00
|
|
|
|
|
|
|
if (!(event->mask & FANOTIFY_PERM_EVENTS)) {
|
2016-10-07 16:56:52 -07:00
|
|
|
spin_unlock(&group->notification_lock);
|
2021-03-04 03:48:22 -07:00
|
|
|
fsnotify_destroy_event(group, fsn_event);
|
2017-10-30 13:14:56 -07:00
|
|
|
} else {
|
2020-03-24 09:04:20 -07:00
|
|
|
finish_permission_event(group, FANOTIFY_PERM(event),
|
2023-02-03 14:35:15 -07:00
|
|
|
FAN_ALLOW, NULL);
|
2017-10-30 13:14:56 -07:00
|
|
|
}
|
2019-01-08 06:02:44 -07:00
|
|
|
spin_lock(&group->notification_lock);
|
2016-09-19 14:44:30 -07:00
|
|
|
}
|
2016-10-07 16:56:52 -07:00
|
|
|
spin_unlock(&group->notification_lock);
|
2016-09-19 14:44:30 -07:00
|
|
|
|
|
|
|
/* Response for all permission events it set, wakeup waiters */
|
2010-08-18 09:25:50 -07:00
|
|
|
wake_up(&group->fanotify_data.access_waitq);
|
2011-10-14 14:43:39 -07:00
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
/* matches the fanotify_init->fsnotify_alloc_group */
|
2011-06-14 08:29:45 -07:00
|
|
|
fsnotify_destroy_group(group);
|
2009-12-17 19:24:26 -07:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
|
|
|
|
{
|
|
|
|
struct fsnotify_group *group;
|
2014-01-21 16:48:14 -07:00
|
|
|
struct fsnotify_event *fsn_event;
|
2009-12-17 19:24:26 -07:00
|
|
|
void __user *p;
|
|
|
|
int ret = -ENOTTY;
|
|
|
|
size_t send_len = 0;
|
|
|
|
|
|
|
|
group = file->private_data;
|
|
|
|
|
|
|
|
p = (void __user *) arg;
|
|
|
|
|
|
|
|
switch (cmd) {
|
|
|
|
case FIONREAD:
|
2016-10-07 16:56:52 -07:00
|
|
|
spin_lock(&group->notification_lock);
|
2014-01-21 16:48:14 -07:00
|
|
|
list_for_each_entry(fsn_event, &group->notification_list, list)
|
2009-12-17 19:24:26 -07:00
|
|
|
send_len += FAN_EVENT_METADATA_LEN;
|
2016-10-07 16:56:52 -07:00
|
|
|
spin_unlock(&group->notification_lock);
|
2009-12-17 19:24:26 -07:00
|
|
|
ret = put_user(send_len, (int __user *) p);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
static const struct file_operations fanotify_fops = {
|
2012-12-17 17:05:12 -07:00
|
|
|
.show_fdinfo = fanotify_show_fdinfo,
|
2009-12-17 19:24:26 -07:00
|
|
|
.poll = fanotify_poll,
|
|
|
|
.read = fanotify_read,
|
2009-12-17 19:24:34 -07:00
|
|
|
.write = fanotify_write,
|
2009-12-17 19:24:26 -07:00
|
|
|
.fasync = NULL,
|
|
|
|
.release = fanotify_release,
|
2009-12-17 19:24:26 -07:00
|
|
|
.unlocked_ioctl = fanotify_ioctl,
|
2018-09-11 12:59:08 -07:00
|
|
|
.compat_ioctl = compat_ptr_ioctl,
|
llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-08-15 09:52:59 -07:00
|
|
|
.llseek = noop_llseek,
|
2009-12-17 19:24:26 -07:00
|
|
|
};
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
static int fanotify_find_path(int dfd, const char __user *filename,
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 08:20:00 -07:00
|
|
|
struct path *path, unsigned int flags, __u64 mask,
|
|
|
|
unsigned int obj_type)
|
2009-12-17 19:24:26 -07:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
pr_debug("%s: dfd=%d filename=%p flags=%x\n", __func__,
|
|
|
|
dfd, filename, flags);
|
|
|
|
|
|
|
|
if (filename == NULL) {
|
2012-08-28 09:52:22 -07:00
|
|
|
struct fd f = fdget(dfd);
|
2009-12-17 19:24:26 -07:00
|
|
|
|
|
|
|
ret = -EBADF;
|
2024-05-31 11:12:01 -07:00
|
|
|
if (!fd_file(f))
|
2009-12-17 19:24:26 -07:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = -ENOTDIR;
|
|
|
|
if ((flags & FAN_MARK_ONLYDIR) &&
|
2024-05-31 11:12:01 -07:00
|
|
|
!(S_ISDIR(file_inode(fd_file(f))->i_mode))) {
|
2012-08-28 09:52:22 -07:00
|
|
|
fdput(f);
|
2009-12-17 19:24:26 -07:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2024-05-31 11:12:01 -07:00
|
|
|
*path = fd_file(f)->f_path;
|
2009-12-17 19:24:26 -07:00
|
|
|
path_get(path);
|
2012-08-28 09:52:22 -07:00
|
|
|
fdput(f);
|
2009-12-17 19:24:26 -07:00
|
|
|
} else {
|
|
|
|
unsigned int lookup_flags = 0;
|
|
|
|
|
|
|
|
if (!(flags & FAN_MARK_DONT_FOLLOW))
|
|
|
|
lookup_flags |= LOOKUP_FOLLOW;
|
|
|
|
if (flags & FAN_MARK_ONLYDIR)
|
|
|
|
lookup_flags |= LOOKUP_DIRECTORY;
|
|
|
|
|
|
|
|
ret = user_path_at(dfd, filename, lookup_flags, path);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* you can only watch an inode if you have read permissions on it */
|
2021-01-21 06:19:22 -07:00
|
|
|
ret = path_permission(path, MAY_READ);
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 08:20:00 -07:00
|
|
|
if (ret) {
|
|
|
|
path_put(path);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = security_path_notify(path, mask, obj_type);
|
2009-12-17 19:24:26 -07:00
|
|
|
if (ret)
|
|
|
|
path_put(path);
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 08:20:00 -07:00
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:33 -07:00
|
|
|
static __u32 fanotify_mark_remove_from_mask(struct fsnotify_mark *fsn_mark,
|
2020-07-16 01:42:14 -07:00
|
|
|
__u32 mask, unsigned int flags,
|
|
|
|
__u32 umask, int *destroy)
|
2009-12-17 19:24:28 -07:00
|
|
|
{
|
2022-02-23 08:14:37 -07:00
|
|
|
__u32 oldmask, newmask;
|
2009-12-17 19:24:28 -07:00
|
|
|
|
2020-07-16 01:42:14 -07:00
|
|
|
/* umask bits cannot be removed by user */
|
|
|
|
mask &= ~umask;
|
2009-12-17 19:24:28 -07:00
|
|
|
spin_lock(&fsn_mark->lock);
|
2022-02-23 08:14:37 -07:00
|
|
|
oldmask = fsnotify_calc_mask(fsn_mark);
|
2022-06-29 07:42:10 -07:00
|
|
|
if (!(flags & FANOTIFY_MARK_IGNORE_BITS)) {
|
2018-10-03 14:25:34 -07:00
|
|
|
fsn_mark->mask &= ~mask;
|
2009-12-17 19:24:33 -07:00
|
|
|
} else {
|
2022-06-29 07:42:08 -07:00
|
|
|
fsn_mark->ignore_mask &= ~mask;
|
2009-12-17 19:24:33 -07:00
|
|
|
}
|
2022-02-23 08:14:37 -07:00
|
|
|
newmask = fsnotify_calc_mask(fsn_mark);
|
2020-07-16 01:42:14 -07:00
|
|
|
/*
|
|
|
|
* We need to keep the mark around even if remaining mask cannot
|
|
|
|
* result in any events (e.g. mask == FAN_ONDIR) to support incremenal
|
|
|
|
* changes to the mask.
|
|
|
|
* Destroy mark when only umask bits remain.
|
|
|
|
*/
|
2022-06-29 07:42:08 -07:00
|
|
|
*destroy = !((fsn_mark->mask | fsn_mark->ignore_mask) & ~umask);
|
2009-12-17 19:24:28 -07:00
|
|
|
spin_unlock(&fsn_mark->lock);
|
|
|
|
|
2022-02-23 08:14:37 -07:00
|
|
|
return oldmask & ~newmask;
|
2009-12-17 19:24:28 -07:00
|
|
|
}
|
|
|
|
|
2018-06-23 07:54:51 -07:00
|
|
|
static int fanotify_remove_mark(struct fsnotify_group *group,
|
2024-03-17 11:41:49 -07:00
|
|
|
void *obj, unsigned int obj_type, __u32 mask,
|
2020-07-16 01:42:14 -07:00
|
|
|
unsigned int flags, __u32 umask)
|
2009-12-17 19:24:28 -07:00
|
|
|
{
|
|
|
|
struct fsnotify_mark *fsn_mark = NULL;
|
2009-12-17 19:24:28 -07:00
|
|
|
__u32 removed;
|
2011-06-14 08:29:49 -07:00
|
|
|
int destroy_mark;
|
2009-12-17 19:24:28 -07:00
|
|
|
|
2022-04-22 05:03:26 -07:00
|
|
|
fsnotify_group_lock(group);
|
2024-03-17 11:41:49 -07:00
|
|
|
fsn_mark = fsnotify_find_mark(obj, obj_type, group);
|
2013-07-08 15:59:42 -07:00
|
|
|
if (!fsn_mark) {
|
2022-04-22 05:03:26 -07:00
|
|
|
fsnotify_group_unlock(group);
|
2009-12-17 19:24:29 -07:00
|
|
|
return -ENOENT;
|
2013-07-08 15:59:42 -07:00
|
|
|
}
|
2009-12-17 19:24:28 -07:00
|
|
|
|
2011-06-14 08:29:49 -07:00
|
|
|
removed = fanotify_mark_remove_from_mask(fsn_mark, mask, flags,
|
2020-07-16 01:42:14 -07:00
|
|
|
umask, &destroy_mark);
|
2018-06-23 07:54:50 -07:00
|
|
|
if (removed & fsnotify_conn_mask(fsn_mark->connector))
|
|
|
|
fsnotify_recalc_mask(fsn_mark->connector);
|
2011-06-14 08:29:49 -07:00
|
|
|
if (destroy_mark)
|
2015-09-04 15:43:12 -07:00
|
|
|
fsnotify_detach_mark(fsn_mark);
|
2022-04-22 05:03:26 -07:00
|
|
|
fsnotify_group_unlock(group);
|
2015-09-04 15:43:12 -07:00
|
|
|
if (destroy_mark)
|
|
|
|
fsnotify_free_mark(fsn_mark);
|
2011-06-14 08:29:49 -07:00
|
|
|
|
2018-06-23 07:54:51 -07:00
|
|
|
/* matches the fsnotify_find_mark() */
|
2009-12-17 19:24:29 -07:00
|
|
|
fsnotify_put_mark(fsn_mark);
|
|
|
|
return 0;
|
|
|
|
}
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2022-04-22 05:03:24 -07:00
|
|
|
static bool fanotify_mark_update_flags(struct fsnotify_mark *fsn_mark,
|
|
|
|
unsigned int fan_flags)
|
2022-02-23 08:14:38 -07:00
|
|
|
{
|
2022-04-22 05:03:25 -07:00
|
|
|
bool want_iref = !(fan_flags & FAN_MARK_EVICTABLE);
|
2022-06-29 07:42:10 -07:00
|
|
|
unsigned int ignore = fan_flags & FANOTIFY_MARK_IGNORE_BITS;
|
2022-04-22 05:03:24 -07:00
|
|
|
bool recalc = false;
|
2022-02-23 08:14:38 -07:00
|
|
|
|
2022-06-29 07:42:10 -07:00
|
|
|
/*
|
|
|
|
* When using FAN_MARK_IGNORE for the first time, mark starts using
|
|
|
|
* independent event flags in ignore mask. After that, trying to
|
|
|
|
* update the ignore mask with the old FAN_MARK_IGNORED_MASK API
|
|
|
|
* will result in EEXIST error.
|
|
|
|
*/
|
|
|
|
if (ignore == FAN_MARK_IGNORE)
|
|
|
|
fsn_mark->flags |= FSNOTIFY_MARK_FLAG_HAS_IGNORE_FLAGS;
|
|
|
|
|
2022-02-23 08:14:38 -07:00
|
|
|
/*
|
|
|
|
* Setting FAN_MARK_IGNORED_SURV_MODIFY for the first time may lead to
|
|
|
|
* the removal of the FS_MODIFY bit in calculated mask if it was set
|
2022-06-29 07:42:08 -07:00
|
|
|
* because of an ignore mask that is now going to survive FS_MODIFY.
|
2022-02-23 08:14:38 -07:00
|
|
|
*/
|
2022-06-29 07:42:10 -07:00
|
|
|
if (ignore && (fan_flags & FAN_MARK_IGNORED_SURV_MODIFY) &&
|
2022-02-23 08:14:38 -07:00
|
|
|
!(fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)) {
|
|
|
|
fsn_mark->flags |= FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY;
|
|
|
|
if (!(fsn_mark->mask & FS_MODIFY))
|
2022-04-22 05:03:24 -07:00
|
|
|
recalc = true;
|
2022-02-23 08:14:38 -07:00
|
|
|
}
|
2022-04-22 05:03:24 -07:00
|
|
|
|
2022-04-22 05:03:25 -07:00
|
|
|
if (fsn_mark->connector->type != FSNOTIFY_OBJ_TYPE_INODE ||
|
|
|
|
want_iref == !(fsn_mark->flags & FSNOTIFY_MARK_FLAG_NO_IREF))
|
|
|
|
return recalc;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* NO_IREF may be removed from a mark, but not added.
|
|
|
|
* When removed, fsnotify_recalc_mask() will take the inode ref.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(!want_iref);
|
|
|
|
fsn_mark->flags &= ~FSNOTIFY_MARK_FLAG_NO_IREF;
|
|
|
|
|
|
|
|
return true;
|
2022-02-23 08:14:38 -07:00
|
|
|
}
|
|
|
|
|
2022-04-22 05:03:24 -07:00
|
|
|
static bool fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
|
|
|
|
__u32 mask, unsigned int fan_flags)
|
2009-12-17 19:24:28 -07:00
|
|
|
{
|
2022-04-22 05:03:24 -07:00
|
|
|
bool recalc;
|
2009-12-17 19:24:28 -07:00
|
|
|
|
|
|
|
spin_lock(&fsn_mark->lock);
|
2022-06-29 07:42:10 -07:00
|
|
|
if (!(fan_flags & FANOTIFY_MARK_IGNORE_BITS))
|
2018-10-03 14:25:34 -07:00
|
|
|
fsn_mark->mask |= mask;
|
2022-04-22 05:03:24 -07:00
|
|
|
else
|
2022-06-29 07:42:08 -07:00
|
|
|
fsn_mark->ignore_mask |= mask;
|
2022-04-22 05:03:24 -07:00
|
|
|
|
|
|
|
recalc = fsnotify_calc_mask(fsn_mark) &
|
|
|
|
~fsnotify_conn_mask(fsn_mark->connector);
|
|
|
|
|
|
|
|
recalc |= fanotify_mark_update_flags(fsn_mark, fan_flags);
|
2009-12-17 19:24:28 -07:00
|
|
|
spin_unlock(&fsn_mark->lock);
|
|
|
|
|
2022-04-22 05:03:24 -07:00
|
|
|
return recalc;
|
2009-12-17 19:24:28 -07:00
|
|
|
}
|
|
|
|
|
2023-11-30 09:56:19 -07:00
|
|
|
struct fan_fsid {
|
|
|
|
struct super_block *sb;
|
|
|
|
__kernel_fsid_t id;
|
|
|
|
bool weak;
|
|
|
|
};
|
|
|
|
|
|
|
|
static int fanotify_set_mark_fsid(struct fsnotify_group *group,
|
|
|
|
struct fsnotify_mark *mark,
|
|
|
|
struct fan_fsid *fsid)
|
|
|
|
{
|
|
|
|
struct fsnotify_mark_connector *conn;
|
|
|
|
struct fsnotify_mark *old;
|
|
|
|
struct super_block *old_sb = NULL;
|
|
|
|
|
|
|
|
FANOTIFY_MARK(mark)->fsid = fsid->id;
|
|
|
|
mark->flags |= FSNOTIFY_MARK_FLAG_HAS_FSID;
|
|
|
|
if (fsid->weak)
|
|
|
|
mark->flags |= FSNOTIFY_MARK_FLAG_WEAK_FSID;
|
|
|
|
|
|
|
|
/* First mark added will determine if group is single or multi fsid */
|
|
|
|
if (list_empty(&group->marks_list))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Find sb of an existing mark */
|
|
|
|
list_for_each_entry(old, &group->marks_list, g_list) {
|
|
|
|
conn = READ_ONCE(old->connector);
|
|
|
|
if (!conn)
|
|
|
|
continue;
|
|
|
|
old_sb = fsnotify_connector_sb(conn);
|
|
|
|
if (old_sb)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Only detached marks left? */
|
|
|
|
if (!old_sb)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Do not allow mixing of marks with weak and strong fsid */
|
|
|
|
if ((mark->flags ^ old->flags) & FSNOTIFY_MARK_FLAG_WEAK_FSID)
|
|
|
|
return -EXDEV;
|
|
|
|
|
|
|
|
/* Allow mixing of marks with strong fsid from different fs */
|
|
|
|
if (!fsid->weak)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Do not allow mixing marks with weak fsid from different fs */
|
|
|
|
if (old_sb != fsid->sb)
|
|
|
|
return -EXDEV;
|
|
|
|
|
|
|
|
/* Do not allow mixing marks from different btrfs sub-volumes */
|
|
|
|
if (!fanotify_fsid_equal(&FANOTIFY_MARK(old)->fsid,
|
|
|
|
&FANOTIFY_MARK(mark)->fsid))
|
|
|
|
return -EXDEV;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-07-08 15:59:43 -07:00
|
|
|
static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
|
2024-03-17 11:41:49 -07:00
|
|
|
void *obj,
|
2021-11-29 13:15:27 -07:00
|
|
|
unsigned int obj_type,
|
2022-04-22 05:03:25 -07:00
|
|
|
unsigned int fan_flags,
|
2023-11-30 09:56:19 -07:00
|
|
|
struct fan_fsid *fsid)
|
2013-07-08 15:59:43 -07:00
|
|
|
{
|
2021-03-04 04:29:20 -07:00
|
|
|
struct ucounts *ucounts = group->fanotify_data.ucounts;
|
2023-11-30 09:56:18 -07:00
|
|
|
struct fanotify_mark *fan_mark;
|
2013-07-08 15:59:43 -07:00
|
|
|
struct fsnotify_mark *mark;
|
|
|
|
int ret;
|
|
|
|
|
2021-03-04 04:29:20 -07:00
|
|
|
/*
|
|
|
|
* Enforce per user marks limits per user in all containing user ns.
|
|
|
|
* A group with FAN_UNLIMITED_MARKS does not contribute to mark count
|
|
|
|
* in the limited groups account.
|
|
|
|
*/
|
|
|
|
if (!FAN_GROUP_FLAG(group, FAN_UNLIMITED_MARKS) &&
|
|
|
|
!inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_FANOTIFY_MARKS))
|
2013-07-08 15:59:43 -07:00
|
|
|
return ERR_PTR(-ENOSPC);
|
|
|
|
|
2023-11-30 09:56:18 -07:00
|
|
|
fan_mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
|
|
|
|
if (!fan_mark) {
|
2021-03-04 04:29:20 -07:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out_dec_ucounts;
|
|
|
|
}
|
2013-07-08 15:59:43 -07:00
|
|
|
|
2023-11-30 09:56:18 -07:00
|
|
|
mark = &fan_mark->fsn_mark;
|
2016-12-21 10:06:12 -07:00
|
|
|
fsnotify_init_mark(mark, group);
|
2022-04-22 05:03:25 -07:00
|
|
|
if (fan_flags & FAN_MARK_EVICTABLE)
|
|
|
|
mark->flags |= FSNOTIFY_MARK_FLAG_NO_IREF;
|
|
|
|
|
2023-11-30 09:56:18 -07:00
|
|
|
/* Cache fsid of filesystem containing the marked object */
|
|
|
|
if (fsid) {
|
2023-11-30 09:56:19 -07:00
|
|
|
ret = fanotify_set_mark_fsid(group, mark, fsid);
|
|
|
|
if (ret)
|
|
|
|
goto out_put_mark;
|
2023-11-30 09:56:18 -07:00
|
|
|
} else {
|
|
|
|
fan_mark->fsid.val[0] = fan_mark->fsid.val[1] = 0;
|
|
|
|
}
|
|
|
|
|
2024-03-17 11:41:49 -07:00
|
|
|
ret = fsnotify_add_mark_locked(mark, obj, obj_type, 0);
|
2023-11-30 09:56:19 -07:00
|
|
|
if (ret)
|
|
|
|
goto out_put_mark;
|
2013-07-08 15:59:43 -07:00
|
|
|
|
|
|
|
return mark;
|
2021-03-04 04:29:20 -07:00
|
|
|
|
2023-11-30 09:56:19 -07:00
|
|
|
out_put_mark:
|
|
|
|
fsnotify_put_mark(mark);
|
2021-03-04 04:29:20 -07:00
|
|
|
out_dec_ucounts:
|
|
|
|
if (!FAN_GROUP_FLAG(group, FAN_UNLIMITED_MARKS))
|
|
|
|
dec_ucount(ucounts, UCOUNT_FANOTIFY_MARKS);
|
|
|
|
return ERR_PTR(ret);
|
2013-07-08 15:59:43 -07:00
|
|
|
}
|
|
|
|
|
2021-10-25 12:27:34 -07:00
|
|
|
static int fanotify_group_init_error_pool(struct fsnotify_group *group)
|
|
|
|
{
|
|
|
|
if (mempool_initialized(&group->fanotify_data.error_events_pool))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return mempool_init_kmalloc_pool(&group->fanotify_data.error_events_pool,
|
|
|
|
FANOTIFY_DEFAULT_FEE_POOL_SIZE,
|
|
|
|
sizeof(struct fanotify_error_event));
|
|
|
|
}
|
2013-07-08 15:59:43 -07:00
|
|
|
|
2022-06-29 07:42:09 -07:00
|
|
|
static int fanotify_may_update_existing_mark(struct fsnotify_mark *fsn_mark,
|
|
|
|
unsigned int fan_flags)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Non evictable mark cannot be downgraded to evictable mark.
|
|
|
|
*/
|
|
|
|
if (fan_flags & FAN_MARK_EVICTABLE &&
|
|
|
|
!(fsn_mark->flags & FSNOTIFY_MARK_FLAG_NO_IREF))
|
|
|
|
return -EEXIST;
|
|
|
|
|
2022-06-29 07:42:10 -07:00
|
|
|
/*
|
|
|
|
* New ignore mask semantics cannot be downgraded to old semantics.
|
|
|
|
*/
|
|
|
|
if (fan_flags & FAN_MARK_IGNORED_MASK &&
|
|
|
|
fsn_mark->flags & FSNOTIFY_MARK_FLAG_HAS_IGNORE_FLAGS)
|
|
|
|
return -EEXIST;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* An ignore mask that survives modify could never be downgraded to not
|
|
|
|
* survive modify. With new FAN_MARK_IGNORE semantics we make that rule
|
|
|
|
* explicit and return an error when trying to update the ignore mask
|
|
|
|
* without the original FAN_MARK_IGNORED_SURV_MODIFY value.
|
|
|
|
*/
|
|
|
|
if (fan_flags & FAN_MARK_IGNORE &&
|
|
|
|
!(fan_flags & FAN_MARK_IGNORED_SURV_MODIFY) &&
|
|
|
|
fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
|
|
|
|
return -EEXIST;
|
|
|
|
|
2022-06-29 07:42:09 -07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-06-23 07:54:51 -07:00
|
|
|
static int fanotify_add_mark(struct fsnotify_group *group,
|
2024-03-17 11:41:49 -07:00
|
|
|
void *obj, unsigned int obj_type,
|
2022-04-22 05:03:24 -07:00
|
|
|
__u32 mask, unsigned int fan_flags,
|
2023-11-30 09:56:19 -07:00
|
|
|
struct fan_fsid *fsid)
|
2009-12-17 19:24:26 -07:00
|
|
|
{
|
|
|
|
struct fsnotify_mark *fsn_mark;
|
2022-04-22 05:03:24 -07:00
|
|
|
bool recalc;
|
2021-10-25 12:27:34 -07:00
|
|
|
int ret = 0;
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2022-04-22 05:03:26 -07:00
|
|
|
fsnotify_group_lock(group);
|
2024-03-17 11:41:49 -07:00
|
|
|
fsn_mark = fsnotify_find_mark(obj, obj_type, group);
|
2009-12-17 19:24:28 -07:00
|
|
|
if (!fsn_mark) {
|
2024-03-17 11:41:49 -07:00
|
|
|
fsn_mark = fanotify_add_new_mark(group, obj, obj_type,
|
2022-04-22 05:03:25 -07:00
|
|
|
fan_flags, fsid);
|
2013-07-08 15:59:43 -07:00
|
|
|
if (IS_ERR(fsn_mark)) {
|
2022-04-22 05:03:26 -07:00
|
|
|
fsnotify_group_unlock(group);
|
2013-07-08 15:59:43 -07:00
|
|
|
return PTR_ERR(fsn_mark);
|
2013-07-08 15:59:42 -07:00
|
|
|
}
|
2009-12-17 19:24:28 -07:00
|
|
|
}
|
2021-10-25 12:27:34 -07:00
|
|
|
|
2022-04-22 05:03:25 -07:00
|
|
|
/*
|
2022-06-29 07:42:09 -07:00
|
|
|
* Check if requested mark flags conflict with an existing mark flags.
|
2022-04-22 05:03:25 -07:00
|
|
|
*/
|
2022-06-29 07:42:09 -07:00
|
|
|
ret = fanotify_may_update_existing_mark(fsn_mark, fan_flags);
|
|
|
|
if (ret)
|
2022-04-22 05:03:25 -07:00
|
|
|
goto out;
|
|
|
|
|
2021-10-25 12:27:34 -07:00
|
|
|
/*
|
|
|
|
* Error events are pre-allocated per group, only if strictly
|
|
|
|
* needed (i.e. FAN_FS_ERROR was requested).
|
|
|
|
*/
|
2022-06-29 07:42:10 -07:00
|
|
|
if (!(fan_flags & FANOTIFY_MARK_IGNORE_BITS) &&
|
|
|
|
(mask & FAN_FS_ERROR)) {
|
2021-10-25 12:27:34 -07:00
|
|
|
ret = fanotify_group_init_error_pool(group);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2022-04-22 05:03:24 -07:00
|
|
|
recalc = fanotify_mark_add_to_mask(fsn_mark, mask, fan_flags);
|
|
|
|
if (recalc)
|
2018-06-23 07:54:50 -07:00
|
|
|
fsnotify_recalc_mask(fsn_mark->connector);
|
2021-10-25 12:27:34 -07:00
|
|
|
|
|
|
|
out:
|
2022-04-22 05:03:26 -07:00
|
|
|
fsnotify_group_unlock(group);
|
2013-07-08 15:59:43 -07:00
|
|
|
|
2010-11-09 10:18:16 -07:00
|
|
|
fsnotify_put_mark(fsn_mark);
|
2021-10-25 12:27:34 -07:00
|
|
|
return ret;
|
2009-12-17 19:24:28 -07:00
|
|
|
}
|
|
|
|
|
2020-07-08 04:11:42 -07:00
|
|
|
static struct fsnotify_event *fanotify_alloc_overflow_event(void)
|
|
|
|
{
|
|
|
|
struct fanotify_event *oevent;
|
|
|
|
|
|
|
|
oevent = kmalloc(sizeof(*oevent), GFP_KERNEL_ACCOUNT);
|
|
|
|
if (!oevent)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
fanotify_init_event(oevent, 0, FS_Q_OVERFLOW);
|
|
|
|
oevent->type = FANOTIFY_EVENT_TYPE_OVERFLOW;
|
|
|
|
|
|
|
|
return &oevent->fse;
|
|
|
|
}
|
|
|
|
|
2021-03-04 03:48:25 -07:00
|
|
|
static struct hlist_head *fanotify_alloc_merge_hash(void)
|
|
|
|
{
|
|
|
|
struct hlist_head *hash;
|
|
|
|
|
|
|
|
hash = kmalloc(sizeof(struct hlist_head) << FANOTIFY_HTABLE_BITS,
|
|
|
|
GFP_KERNEL_ACCOUNT);
|
|
|
|
if (!hash)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
__hash_init(hash, FANOTIFY_HTABLE_SIZE);
|
|
|
|
|
|
|
|
return hash;
|
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
/* fanotify syscalls */
|
2010-05-27 06:41:40 -07:00
|
|
|
SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
|
2009-12-17 19:24:25 -07:00
|
|
|
{
|
2009-12-17 19:24:26 -07:00
|
|
|
struct fsnotify_group *group;
|
|
|
|
int f_flags, fd;
|
2020-07-16 01:42:26 -07:00
|
|
|
unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
|
|
|
|
unsigned int class = flags & FANOTIFY_CLASS_BITS;
|
2021-05-24 06:53:21 -07:00
|
|
|
unsigned int internal_flags = 0;
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2018-09-21 11:20:30 -07:00
|
|
|
pr_debug("%s: flags=%x event_f_flags=%x\n",
|
|
|
|
__func__, flags, event_f_flags);
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2021-03-04 04:29:21 -07:00
|
|
|
if (!capable(CAP_SYS_ADMIN)) {
|
|
|
|
/*
|
|
|
|
* An unprivileged user can setup an fanotify group with
|
|
|
|
* limited functionality - an unprivileged group is limited to
|
|
|
|
* notification events with file handles and it cannot use
|
|
|
|
* unlimited queue/marks.
|
|
|
|
*/
|
|
|
|
if ((flags & FANOTIFY_ADMIN_INIT_FLAGS) || !fid_mode)
|
|
|
|
return -EPERM;
|
2021-05-24 06:53:21 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Setting the internal flag FANOTIFY_UNPRIV on the group
|
|
|
|
* prevents setting mount/filesystem marks on this group and
|
|
|
|
* prevents reporting pid and open fd in events.
|
|
|
|
*/
|
|
|
|
internal_flags |= FANOTIFY_UNPRIV;
|
2021-03-04 04:29:21 -07:00
|
|
|
}
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2017-10-02 17:21:39 -07:00
|
|
|
#ifdef CONFIG_AUDITSYSCALL
|
2018-10-03 14:25:35 -07:00
|
|
|
if (flags & ~(FANOTIFY_INIT_FLAGS | FAN_ENABLE_AUDIT))
|
2017-10-02 17:21:39 -07:00
|
|
|
#else
|
2018-10-03 14:25:35 -07:00
|
|
|
if (flags & ~FANOTIFY_INIT_FLAGS)
|
2017-10-02 17:21:39 -07:00
|
|
|
#endif
|
2009-12-17 19:24:26 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2021-08-07 22:26:25 -07:00
|
|
|
/*
|
|
|
|
* A pidfd can only be returned for a thread-group leader; thus
|
|
|
|
* FAN_REPORT_PIDFD and FAN_REPORT_TID need to remain mutually
|
|
|
|
* exclusive.
|
|
|
|
*/
|
|
|
|
if ((flags & FAN_REPORT_PIDFD) && (flags & FAN_REPORT_TID))
|
|
|
|
return -EINVAL;
|
|
|
|
|
fanotify: check file flags passed in fanotify_init
Without this patch fanotify_init does not validate the value passed in
event_f_flags.
When a fanotify event is read from the fanotify file descriptor a new
file descriptor is created where file.f_flags = event_f_flags.
Internal and external open flags are stored together in field f_flags of
struct file. Hence, an application might create file descriptors with
internal flags like FMODE_EXEC, FMODE_NOCMTIME set.
Jan Kara and Eric Paris both aggreed that this is a bug and the value of
event_f_flags should be checked:
https://lkml.org/lkml/2014/4/29/522
https://lkml.org/lkml/2014/4/29/539
This updated patch version considers the comments by Michael Kerrisk in
https://lkml.org/lkml/2014/5/4/10
With the patch the value of event_f_flags is checked.
When specifying an invalid value error EINVAL is returned.
Internal flags are disallowed.
File creation flags are disallowed:
O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TRUNC, and O_TTY_INIT.
Flags which do not make sense with fanotify are disallowed:
__O_TMPFILE, O_PATH, FASYNC, and O_DIRECT.
This leaves us with the following allowed values:
O_RDONLY, O_WRONLY, O_RDWR are basic functionality. The are stored in the
bits given by O_ACCMODE.
O_APPEND is working as expected. The value might be useful in a logging
application which appends the current status each time the log is opened.
O_LARGEFILE is needed for files exceeding 4GB on 32bit systems.
O_NONBLOCK may be useful when monitoring slow devices like tapes.
O_NDELAY is equal to O_NONBLOCK except for platform parisc.
To avoid code breaking on parisc either both flags should be
allowed or none. The patch allows both.
__O_SYNC and O_DSYNC may be used to avoid data loss on power disruption.
O_NOATIME may be useful to reduce disk activity.
O_CLOEXEC may be useful, if separate processes shall be used to scan files.
Once this patch is accepted, the fanotify_init.2 manpage has to be updated.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:05:44 -07:00
|
|
|
if (event_f_flags & ~FANOTIFY_INIT_ALL_EVENT_F_BITS)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
switch (event_f_flags & O_ACCMODE) {
|
|
|
|
case O_RDONLY:
|
|
|
|
case O_RDWR:
|
|
|
|
case O_WRONLY:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2020-07-16 01:42:26 -07:00
|
|
|
if (fid_mode && class != FAN_CLASS_NOTIF)
|
2019-01-10 10:04:36 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-16 01:42:28 -07:00
|
|
|
/*
|
|
|
|
* Child name is reported with parent fid so requires dir fid.
|
2020-07-16 01:42:30 -07:00
|
|
|
* We can report both child fid and dir fid with or without name.
|
2020-07-16 01:42:28 -07:00
|
|
|
*/
|
2020-07-16 01:42:30 -07:00
|
|
|
if ((fid_mode & FAN_REPORT_NAME) && !(fid_mode & FAN_REPORT_DIR_FID))
|
2020-07-16 01:42:26 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2021-11-29 13:15:29 -07:00
|
|
|
/*
|
|
|
|
* FAN_REPORT_TARGET_FID requires FAN_REPORT_NAME and FAN_REPORT_FID
|
|
|
|
* and is used as an indication to report both dir and child fid on all
|
|
|
|
* dirent events.
|
|
|
|
*/
|
|
|
|
if ((fid_mode & FAN_REPORT_TARGET_FID) &&
|
|
|
|
(!(fid_mode & FAN_REPORT_NAME) || !(fid_mode & FAN_REPORT_FID)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2022-05-22 05:08:02 -07:00
|
|
|
f_flags = O_RDWR | __FMODE_NONOTIFY;
|
2009-12-17 19:24:26 -07:00
|
|
|
if (flags & FAN_CLOEXEC)
|
|
|
|
f_flags |= O_CLOEXEC;
|
|
|
|
if (flags & FAN_NONBLOCK)
|
|
|
|
f_flags |= O_NONBLOCK;
|
|
|
|
|
|
|
|
/* fsnotify_alloc_group takes a ref. Dropped in fanotify_release */
|
2022-04-22 05:03:15 -07:00
|
|
|
group = fsnotify_alloc_group(&fanotify_fsnotify_ops,
|
2024-09-27 07:36:42 -07:00
|
|
|
FSNOTIFY_GROUP_USER);
|
2010-11-23 21:48:26 -07:00
|
|
|
if (IS_ERR(group)) {
|
2009-12-17 19:24:26 -07:00
|
|
|
return PTR_ERR(group);
|
2010-11-23 21:48:26 -07:00
|
|
|
}
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2021-03-04 04:29:20 -07:00
|
|
|
/* Enforce groups limits per user in all containing user ns */
|
|
|
|
group->fanotify_data.ucounts = inc_ucount(current_user_ns(),
|
|
|
|
current_euid(),
|
|
|
|
UCOUNT_FANOTIFY_GROUPS);
|
|
|
|
if (!group->fanotify_data.ucounts) {
|
|
|
|
fd = -EMFILE;
|
|
|
|
goto out_destroy_group;
|
|
|
|
}
|
|
|
|
|
2021-05-24 06:53:21 -07:00
|
|
|
group->fanotify_data.flags = flags | internal_flags;
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 15:46:39 -07:00
|
|
|
group->memcg = get_mem_cgroup_from_mm(current->mm);
|
2010-10-28 14:21:58 -07:00
|
|
|
|
2021-03-04 03:48:25 -07:00
|
|
|
group->fanotify_data.merge_hash = fanotify_alloc_merge_hash();
|
|
|
|
if (!group->fanotify_data.merge_hash) {
|
|
|
|
fd = -ENOMEM;
|
|
|
|
goto out_destroy_group;
|
|
|
|
}
|
|
|
|
|
2020-07-08 04:11:42 -07:00
|
|
|
group->overflow_event = fanotify_alloc_overflow_event();
|
|
|
|
if (unlikely(!group->overflow_event)) {
|
2014-02-21 11:14:11 -07:00
|
|
|
fd = -ENOMEM;
|
|
|
|
goto out_destroy_group;
|
|
|
|
}
|
|
|
|
|
2014-05-06 12:50:10 -07:00
|
|
|
if (force_o_largefile())
|
|
|
|
event_f_flags |= O_LARGEFILE;
|
2010-07-28 07:18:37 -07:00
|
|
|
group->fanotify_data.f_flags = event_f_flags;
|
2009-12-17 19:24:34 -07:00
|
|
|
init_waitqueue_head(&group->fanotify_data.access_waitq);
|
|
|
|
INIT_LIST_HEAD(&group->fanotify_data.access_list);
|
2020-07-16 01:42:26 -07:00
|
|
|
switch (class) {
|
2010-10-28 14:21:56 -07:00
|
|
|
case FAN_CLASS_NOTIF:
|
2024-03-17 11:41:53 -07:00
|
|
|
group->priority = FSNOTIFY_PRIO_NORMAL;
|
2010-10-28 14:21:56 -07:00
|
|
|
break;
|
|
|
|
case FAN_CLASS_CONTENT:
|
2024-03-17 11:41:53 -07:00
|
|
|
group->priority = FSNOTIFY_PRIO_CONTENT;
|
2010-10-28 14:21:56 -07:00
|
|
|
break;
|
|
|
|
case FAN_CLASS_PRE_CONTENT:
|
2024-03-17 11:41:53 -07:00
|
|
|
group->priority = FSNOTIFY_PRIO_PRE_CONTENT;
|
2010-10-28 14:21:56 -07:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
fd = -EINVAL;
|
2011-06-14 08:29:45 -07:00
|
|
|
goto out_destroy_group;
|
2010-10-28 14:21:56 -07:00
|
|
|
}
|
2009-12-17 19:24:34 -07:00
|
|
|
|
2010-10-28 14:21:57 -07:00
|
|
|
if (flags & FAN_UNLIMITED_QUEUE) {
|
|
|
|
fd = -EPERM;
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
2011-06-14 08:29:45 -07:00
|
|
|
goto out_destroy_group;
|
2010-10-28 14:21:57 -07:00
|
|
|
group->max_events = UINT_MAX;
|
|
|
|
} else {
|
2021-03-04 04:29:20 -07:00
|
|
|
group->max_events = fanotify_max_queued_events;
|
2010-10-28 14:21:57 -07:00
|
|
|
}
|
2010-10-28 14:21:57 -07:00
|
|
|
|
2010-10-28 14:21:58 -07:00
|
|
|
if (flags & FAN_UNLIMITED_MARKS) {
|
|
|
|
fd = -EPERM;
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
2011-06-14 08:29:45 -07:00
|
|
|
goto out_destroy_group;
|
2010-10-28 14:21:58 -07:00
|
|
|
}
|
2010-10-28 14:21:57 -07:00
|
|
|
|
2017-10-02 17:21:39 -07:00
|
|
|
if (flags & FAN_ENABLE_AUDIT) {
|
|
|
|
fd = -EPERM;
|
|
|
|
if (!capable(CAP_AUDIT_WRITE))
|
|
|
|
goto out_destroy_group;
|
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
|
|
|
|
if (fd < 0)
|
2011-06-14 08:29:45 -07:00
|
|
|
goto out_destroy_group;
|
2009-12-17 19:24:26 -07:00
|
|
|
|
|
|
|
return fd;
|
|
|
|
|
2011-06-14 08:29:45 -07:00
|
|
|
out_destroy_group:
|
|
|
|
fsnotify_destroy_group(group);
|
2009-12-17 19:24:26 -07:00
|
|
|
return fd;
|
2009-12-17 19:24:25 -07:00
|
|
|
}
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2023-11-30 09:56:19 -07:00
|
|
|
static int fanotify_test_fsid(struct dentry *dentry, unsigned int flags,
|
|
|
|
struct fan_fsid *fsid)
|
2019-01-10 10:04:36 -07:00
|
|
|
{
|
2023-11-30 09:56:19 -07:00
|
|
|
unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
|
2019-01-10 10:04:39 -07:00
|
|
|
__kernel_fsid_t root_fsid;
|
2019-01-10 10:04:36 -07:00
|
|
|
int err;
|
|
|
|
|
|
|
|
/*
|
2021-10-25 12:27:21 -07:00
|
|
|
* Make sure dentry is not of a filesystem with zero fsid (e.g. fuse).
|
2019-01-10 10:04:36 -07:00
|
|
|
*/
|
2023-11-30 09:56:19 -07:00
|
|
|
err = vfs_get_fsid(dentry, &fsid->id);
|
2019-01-10 10:04:36 -07:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2023-11-30 09:56:19 -07:00
|
|
|
fsid->sb = dentry->d_sb;
|
|
|
|
if (!fsid->id.val[0] && !fsid->id.val[1]) {
|
|
|
|
err = -ENODEV;
|
|
|
|
goto weak;
|
|
|
|
}
|
2019-01-10 10:04:36 -07:00
|
|
|
|
|
|
|
/*
|
2021-10-25 12:27:21 -07:00
|
|
|
* Make sure dentry is not of a filesystem subvolume (e.g. btrfs)
|
2019-01-10 10:04:36 -07:00
|
|
|
* which uses a different fsid than sb root.
|
|
|
|
*/
|
2021-10-25 12:27:21 -07:00
|
|
|
err = vfs_get_fsid(dentry->d_sb->s_root, &root_fsid);
|
2019-01-10 10:04:36 -07:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
2023-11-30 09:56:19 -07:00
|
|
|
if (!fanotify_fsid_equal(&root_fsid, &fsid->id)) {
|
|
|
|
err = -EXDEV;
|
|
|
|
goto weak;
|
|
|
|
}
|
2019-01-10 10:04:36 -07:00
|
|
|
|
2023-11-30 09:56:19 -07:00
|
|
|
fsid->weak = false;
|
2021-10-25 12:27:21 -07:00
|
|
|
return 0;
|
2023-11-30 09:56:19 -07:00
|
|
|
|
|
|
|
weak:
|
|
|
|
/* Allow weak fsid when marking inodes */
|
|
|
|
fsid->weak = true;
|
|
|
|
return (mark_type == FAN_MARK_INODE) ? 0 : err;
|
2021-10-25 12:27:21 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Check if filesystem can encode a unique fid */
|
fanotify: limit reporting of event with non-decodeable file handles
Commit a95aef69a740 ("fanotify: support reporting non-decodeable file
handles") merged in v6.5-rc1, added the ability to use an fanotify group
with FAN_REPORT_FID mode to watch filesystems that do not support nfs
export, but do know how to encode non-decodeable file handles, with the
newly introduced AT_HANDLE_FID flag.
At the time that this commit was merged, there were no filesystems
in-tree with those traits.
Commit 16aac5ad1fa9 ("ovl: support encoding non-decodable file handles"),
merged in v6.6-rc1, added this trait to overlayfs, thus allowing fanotify
watching of overlayfs with FAN_REPORT_FID mode.
In retrospect, allowing an fanotify filesystem/mount mark on such
filesystem in FAN_REPORT_FID mode will result in getting events with
file handles, without the ability to resolve the filesystem objects from
those file handles (i.e. no open_by_handle_at() support).
For v6.6, the safer option would be to allow this mode for inode marks
only, where the caller has the opportunity to use name_to_handle_at() at
the time of setting the mark. In the future we can revise this decision.
Fixes: a95aef69a740 ("fanotify: support reporting non-decodeable file handles")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20231018100000.2453965-2-amir73il@gmail.com>
2023-10-18 02:59:56 -07:00
|
|
|
static int fanotify_test_fid(struct dentry *dentry, unsigned int flags)
|
2021-10-25 12:27:21 -07:00
|
|
|
{
|
fanotify: limit reporting of event with non-decodeable file handles
Commit a95aef69a740 ("fanotify: support reporting non-decodeable file
handles") merged in v6.5-rc1, added the ability to use an fanotify group
with FAN_REPORT_FID mode to watch filesystems that do not support nfs
export, but do know how to encode non-decodeable file handles, with the
newly introduced AT_HANDLE_FID flag.
At the time that this commit was merged, there were no filesystems
in-tree with those traits.
Commit 16aac5ad1fa9 ("ovl: support encoding non-decodable file handles"),
merged in v6.6-rc1, added this trait to overlayfs, thus allowing fanotify
watching of overlayfs with FAN_REPORT_FID mode.
In retrospect, allowing an fanotify filesystem/mount mark on such
filesystem in FAN_REPORT_FID mode will result in getting events with
file handles, without the ability to resolve the filesystem objects from
those file handles (i.e. no open_by_handle_at() support).
For v6.6, the safer option would be to allow this mode for inode marks
only, where the caller has the opportunity to use name_to_handle_at() at
the time of setting the mark. In the future we can revise this decision.
Fixes: a95aef69a740 ("fanotify: support reporting non-decodeable file handles")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20231018100000.2453965-2-amir73il@gmail.com>
2023-10-18 02:59:56 -07:00
|
|
|
unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
|
|
|
|
const struct export_operations *nop = dentry->d_sb->s_export_op;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to make sure that the filesystem supports encoding of
|
|
|
|
* file handles so user can use name_to_handle_at() to compare fids
|
|
|
|
* reported with events to the file handle of watched objects.
|
|
|
|
*/
|
2023-10-23 11:07:58 -07:00
|
|
|
if (!exportfs_can_encode_fid(nop))
|
fanotify: limit reporting of event with non-decodeable file handles
Commit a95aef69a740 ("fanotify: support reporting non-decodeable file
handles") merged in v6.5-rc1, added the ability to use an fanotify group
with FAN_REPORT_FID mode to watch filesystems that do not support nfs
export, but do know how to encode non-decodeable file handles, with the
newly introduced AT_HANDLE_FID flag.
At the time that this commit was merged, there were no filesystems
in-tree with those traits.
Commit 16aac5ad1fa9 ("ovl: support encoding non-decodable file handles"),
merged in v6.6-rc1, added this trait to overlayfs, thus allowing fanotify
watching of overlayfs with FAN_REPORT_FID mode.
In retrospect, allowing an fanotify filesystem/mount mark on such
filesystem in FAN_REPORT_FID mode will result in getting events with
file handles, without the ability to resolve the filesystem objects from
those file handles (i.e. no open_by_handle_at() support).
For v6.6, the safer option would be to allow this mode for inode marks
only, where the caller has the opportunity to use name_to_handle_at() at
the time of setting the mark. In the future we can revise this decision.
Fixes: a95aef69a740 ("fanotify: support reporting non-decodeable file handles")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20231018100000.2453965-2-amir73il@gmail.com>
2023-10-18 02:59:56 -07:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
2019-01-10 10:04:36 -07:00
|
|
|
/*
|
fanotify: limit reporting of event with non-decodeable file handles
Commit a95aef69a740 ("fanotify: support reporting non-decodeable file
handles") merged in v6.5-rc1, added the ability to use an fanotify group
with FAN_REPORT_FID mode to watch filesystems that do not support nfs
export, but do know how to encode non-decodeable file handles, with the
newly introduced AT_HANDLE_FID flag.
At the time that this commit was merged, there were no filesystems
in-tree with those traits.
Commit 16aac5ad1fa9 ("ovl: support encoding non-decodable file handles"),
merged in v6.6-rc1, added this trait to overlayfs, thus allowing fanotify
watching of overlayfs with FAN_REPORT_FID mode.
In retrospect, allowing an fanotify filesystem/mount mark on such
filesystem in FAN_REPORT_FID mode will result in getting events with
file handles, without the ability to resolve the filesystem objects from
those file handles (i.e. no open_by_handle_at() support).
For v6.6, the safer option would be to allow this mode for inode marks
only, where the caller has the opportunity to use name_to_handle_at() at
the time of setting the mark. In the future we can revise this decision.
Fixes: a95aef69a740 ("fanotify: support reporting non-decodeable file handles")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20231018100000.2453965-2-amir73il@gmail.com>
2023-10-18 02:59:56 -07:00
|
|
|
* For sb/mount mark, we also need to make sure that the filesystem
|
|
|
|
* supports decoding file handles, so user has a way to map back the
|
|
|
|
* reported fids to filesystem objects.
|
2019-01-10 10:04:36 -07:00
|
|
|
*/
|
2023-10-23 11:07:58 -07:00
|
|
|
if (mark_type != FAN_MARK_INODE && !exportfs_can_decode_fh(nop))
|
2019-01-10 10:04:36 -07:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-06-27 10:47:19 -07:00
|
|
|
static int fanotify_events_supported(struct fsnotify_group *group,
|
2022-08-04 09:57:38 -07:00
|
|
|
const struct path *path, __u64 mask,
|
2022-06-27 10:47:19 -07:00
|
|
|
unsigned int flags)
|
2019-05-15 07:28:34 -07:00
|
|
|
{
|
2022-06-27 10:47:19 -07:00
|
|
|
unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
|
|
|
|
/* Strict validation of events in non-dir inode mask with v5.17+ APIs */
|
|
|
|
bool strict_dir_events = FAN_GROUP_FLAG(group, FAN_REPORT_TARGET_FID) ||
|
2022-06-29 07:42:10 -07:00
|
|
|
(mask & FAN_RENAME) ||
|
|
|
|
(flags & FAN_MARK_IGNORE);
|
2022-06-27 10:47:19 -07:00
|
|
|
|
2019-05-15 07:28:34 -07:00
|
|
|
/*
|
|
|
|
* Some filesystems such as 'proc' acquire unusual locks when opening
|
|
|
|
* files. For them fanotify permission events have high chances of
|
|
|
|
* deadlocking the system - open done when reporting fanotify event
|
|
|
|
* blocks on this "unusual" lock while another process holding the lock
|
|
|
|
* waits for fanotify permission event to be answered. Just disallow
|
|
|
|
* permission events for such filesystems.
|
|
|
|
*/
|
|
|
|
if (mask & FANOTIFY_PERM_EVENTS &&
|
|
|
|
path->mnt->mnt_sb->s_type->fs_flags & FS_DISALLOW_NOTIFY_PERM)
|
|
|
|
return -EINVAL;
|
2022-06-27 10:47:19 -07:00
|
|
|
|
2023-06-28 21:20:44 -07:00
|
|
|
/*
|
|
|
|
* mount and sb marks are not allowed on kernel internal pseudo fs,
|
|
|
|
* like pipe_mnt, because that would subscribe to events on all the
|
|
|
|
* anonynous pipes in the system.
|
|
|
|
*
|
|
|
|
* SB_NOUSER covers all of the internal pseudo fs whose objects are not
|
|
|
|
* exposed to user's mount namespace, but there are other SB_KERNMOUNT
|
|
|
|
* fs, like nsfs, debugfs, for which the value of allowing sb and mount
|
|
|
|
* mark is questionable. For now we leave them alone.
|
|
|
|
*/
|
|
|
|
if (mark_type != FAN_MARK_INODE &&
|
|
|
|
path->mnt->mnt_sb->s_flags & SB_NOUSER)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2022-06-27 10:47:19 -07:00
|
|
|
/*
|
|
|
|
* We shouldn't have allowed setting dirent events and the directory
|
|
|
|
* flags FAN_ONDIR and FAN_EVENT_ON_CHILD in mask of non-dir inode,
|
|
|
|
* but because we always allowed it, error only when using new APIs.
|
|
|
|
*/
|
|
|
|
if (strict_dir_events && mark_type == FAN_MARK_INODE &&
|
|
|
|
!d_is_dir(path->dentry) && (mask & FANOTIFY_DIRONLY_EVENT_BITS))
|
|
|
|
return -ENOTDIR;
|
|
|
|
|
2019-05-15 07:28:34 -07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-03-17 07:06:11 -07:00
|
|
|
static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
|
|
|
|
int dfd, const char __user *pathname)
|
2009-12-17 19:24:26 -07:00
|
|
|
{
|
2009-12-17 19:24:29 -07:00
|
|
|
struct inode *inode = NULL;
|
|
|
|
struct vfsmount *mnt = NULL;
|
2009-12-17 19:24:26 -07:00
|
|
|
struct fsnotify_group *group;
|
2012-08-28 09:52:22 -07:00
|
|
|
struct fd f;
|
2009-12-17 19:24:26 -07:00
|
|
|
struct path path;
|
2023-11-30 09:56:19 -07:00
|
|
|
struct fan_fsid __fsid, *fsid = NULL;
|
2018-10-03 14:25:37 -07:00
|
|
|
u32 valid_mask = FANOTIFY_EVENTS | FANOTIFY_EVENT_FLAGS;
|
2018-10-03 14:25:35 -07:00
|
|
|
unsigned int mark_type = flags & FANOTIFY_MARK_TYPE_BITS;
|
2022-06-29 07:42:09 -07:00
|
|
|
unsigned int mark_cmd = flags & FANOTIFY_MARK_CMD_BITS;
|
2022-06-29 07:42:10 -07:00
|
|
|
unsigned int ignore = flags & FANOTIFY_MARK_IGNORE_BITS;
|
2020-07-16 01:42:12 -07:00
|
|
|
unsigned int obj_type, fid_mode;
|
2024-03-17 11:41:49 -07:00
|
|
|
void *obj;
|
2020-07-16 01:42:15 -07:00
|
|
|
u32 umask = 0;
|
2012-08-28 09:52:22 -07:00
|
|
|
int ret;
|
2009-12-17 19:24:26 -07:00
|
|
|
|
|
|
|
pr_debug("%s: fanotify_fd=%d flags=%x dfd=%d pathname=%p mask=%llx\n",
|
|
|
|
__func__, fanotify_fd, flags, dfd, pathname, mask);
|
|
|
|
|
|
|
|
/* we only use the lower 32 bits as of right now. */
|
2021-03-25 01:37:43 -07:00
|
|
|
if (upper_32_bits(mask))
|
2009-12-17 19:24:26 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2018-10-03 14:25:35 -07:00
|
|
|
if (flags & ~FANOTIFY_MARK_FLAGS)
|
2009-12-17 19:24:29 -07:00
|
|
|
return -EINVAL;
|
2018-09-01 00:41:13 -07:00
|
|
|
|
|
|
|
switch (mark_type) {
|
|
|
|
case FAN_MARK_INODE:
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 08:20:00 -07:00
|
|
|
obj_type = FSNOTIFY_OBJ_TYPE_INODE;
|
|
|
|
break;
|
2018-09-01 00:41:13 -07:00
|
|
|
case FAN_MARK_MOUNT:
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 08:20:00 -07:00
|
|
|
obj_type = FSNOTIFY_OBJ_TYPE_VFSMOUNT;
|
|
|
|
break;
|
2018-09-01 00:41:13 -07:00
|
|
|
case FAN_MARK_FILESYSTEM:
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 08:20:00 -07:00
|
|
|
obj_type = FSNOTIFY_OBJ_TYPE_SB;
|
2018-09-01 00:41:13 -07:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2022-06-29 07:42:09 -07:00
|
|
|
switch (mark_cmd) {
|
2020-08-23 15:36:59 -07:00
|
|
|
case FAN_MARK_ADD:
|
2009-12-17 19:24:29 -07:00
|
|
|
case FAN_MARK_REMOVE:
|
2010-11-22 10:46:33 -07:00
|
|
|
if (!mask)
|
|
|
|
return -EINVAL;
|
2014-06-04 16:05:43 -07:00
|
|
|
break;
|
2009-12-17 19:24:34 -07:00
|
|
|
case FAN_MARK_FLUSH:
|
2018-10-03 14:25:35 -07:00
|
|
|
if (flags & ~(FANOTIFY_MARK_TYPE_BITS | FAN_MARK_FLUSH))
|
2014-06-04 16:05:43 -07:00
|
|
|
return -EINVAL;
|
2009-12-17 19:24:29 -07:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2010-10-28 14:21:59 -07:00
|
|
|
|
2017-10-30 13:14:56 -07:00
|
|
|
if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS))
|
2018-10-03 14:25:35 -07:00
|
|
|
valid_mask |= FANOTIFY_PERM_EVENTS;
|
2017-10-30 13:14:56 -07:00
|
|
|
|
|
|
|
if (mask & ~valid_mask)
|
2009-12-17 19:24:26 -07:00
|
|
|
return -EINVAL;
|
|
|
|
|
2022-06-29 07:42:10 -07:00
|
|
|
|
|
|
|
/* We don't allow FAN_MARK_IGNORE & FAN_MARK_IGNORED_MASK together */
|
|
|
|
if (ignore == (FAN_MARK_IGNORE | FAN_MARK_IGNORED_MASK))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2022-06-29 07:42:08 -07:00
|
|
|
/*
|
|
|
|
* Event flags (FAN_ONDIR, FAN_EVENT_ON_CHILD) have no effect with
|
|
|
|
* FAN_MARK_IGNORED_MASK.
|
|
|
|
*/
|
2022-06-29 07:42:10 -07:00
|
|
|
if (ignore == FAN_MARK_IGNORED_MASK) {
|
2020-07-16 01:42:13 -07:00
|
|
|
mask &= ~FANOTIFY_EVENT_FLAGS;
|
2022-06-29 07:42:10 -07:00
|
|
|
umask = FANOTIFY_EVENT_FLAGS;
|
|
|
|
}
|
2020-07-16 01:42:13 -07:00
|
|
|
|
2012-08-28 09:52:22 -07:00
|
|
|
f = fdget(fanotify_fd);
|
2024-05-31 11:12:01 -07:00
|
|
|
if (unlikely(!fd_file(f)))
|
2009-12-17 19:24:26 -07:00
|
|
|
return -EBADF;
|
|
|
|
|
|
|
|
/* verify that this is indeed an fanotify instance */
|
|
|
|
ret = -EINVAL;
|
2024-05-31 11:12:01 -07:00
|
|
|
if (unlikely(fd_file(f)->f_op != &fanotify_fops))
|
2009-12-17 19:24:26 -07:00
|
|
|
goto fput_and_out;
|
2024-05-31 11:12:01 -07:00
|
|
|
group = fd_file(f)->private_data;
|
2010-10-28 14:21:56 -07:00
|
|
|
|
2021-03-04 04:29:21 -07:00
|
|
|
/*
|
2021-05-24 06:53:21 -07:00
|
|
|
* An unprivileged user is not allowed to setup mount nor filesystem
|
|
|
|
* marks. This also includes setting up such marks by a group that
|
|
|
|
* was initialized by an unprivileged user.
|
2021-03-04 04:29:21 -07:00
|
|
|
*/
|
|
|
|
ret = -EPERM;
|
2021-05-24 06:53:21 -07:00
|
|
|
if ((!capable(CAP_SYS_ADMIN) ||
|
|
|
|
FAN_GROUP_FLAG(group, FANOTIFY_UNPRIV)) &&
|
2021-03-04 04:29:21 -07:00
|
|
|
mark_type != FAN_MARK_INODE)
|
|
|
|
goto fput_and_out;
|
|
|
|
|
2010-10-28 14:21:56 -07:00
|
|
|
/*
|
2024-03-17 11:41:53 -07:00
|
|
|
* Permission events require minimum priority FAN_CLASS_CONTENT.
|
2010-10-28 14:21:56 -07:00
|
|
|
*/
|
|
|
|
ret = -EINVAL;
|
2018-10-03 14:25:35 -07:00
|
|
|
if (mask & FANOTIFY_PERM_EVENTS &&
|
2024-03-17 11:41:53 -07:00
|
|
|
group->priority < FSNOTIFY_PRIO_CONTENT)
|
2010-10-28 14:21:56 -07:00
|
|
|
goto fput_and_out;
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2021-10-25 12:27:43 -07:00
|
|
|
if (mask & FAN_FS_ERROR &&
|
|
|
|
mark_type != FAN_MARK_FILESYSTEM)
|
|
|
|
goto fput_and_out;
|
|
|
|
|
2022-04-22 05:03:25 -07:00
|
|
|
/*
|
|
|
|
* Evictable is only relevant for inode marks, because only inode object
|
|
|
|
* can be evicted on memory pressure.
|
|
|
|
*/
|
|
|
|
if (flags & FAN_MARK_EVICTABLE &&
|
|
|
|
mark_type != FAN_MARK_INODE)
|
|
|
|
goto fput_and_out;
|
|
|
|
|
2019-01-10 10:04:43 -07:00
|
|
|
/*
|
2021-10-25 12:27:31 -07:00
|
|
|
* Events that do not carry enough information to report
|
|
|
|
* event->fd require a group that supports reporting fid. Those
|
|
|
|
* events are not supported on a mount mark, because they do not
|
|
|
|
* carry enough information (i.e. path) to be filtered by mount
|
|
|
|
* point.
|
2019-01-10 10:04:43 -07:00
|
|
|
*/
|
2020-07-16 01:42:12 -07:00
|
|
|
fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
|
2021-10-25 12:27:31 -07:00
|
|
|
if (mask & ~(FANOTIFY_FD_EVENTS|FANOTIFY_EVENT_FLAGS) &&
|
2020-07-16 01:42:12 -07:00
|
|
|
(!fid_mode || mark_type == FAN_MARK_MOUNT))
|
2019-01-10 10:04:43 -07:00
|
|
|
goto fput_and_out;
|
|
|
|
|
2021-11-29 13:15:37 -07:00
|
|
|
/*
|
|
|
|
* FAN_RENAME uses special info type records to report the old and
|
|
|
|
* new parent+name. Reporting only old and new parent id is less
|
|
|
|
* useful and was not implemented.
|
|
|
|
*/
|
|
|
|
if (mask & FAN_RENAME && !(fid_mode & FAN_REPORT_NAME))
|
|
|
|
goto fput_and_out;
|
|
|
|
|
2022-06-29 07:42:09 -07:00
|
|
|
if (mark_cmd == FAN_MARK_FLUSH) {
|
2014-06-04 16:05:40 -07:00
|
|
|
ret = 0;
|
2018-09-01 00:41:13 -07:00
|
|
|
if (mark_type == FAN_MARK_MOUNT)
|
2014-06-04 16:05:40 -07:00
|
|
|
fsnotify_clear_vfsmount_marks_by_group(group);
|
2018-09-01 00:41:13 -07:00
|
|
|
else if (mark_type == FAN_MARK_FILESYSTEM)
|
|
|
|
fsnotify_clear_sb_marks_by_group(group);
|
2014-06-04 16:05:40 -07:00
|
|
|
else
|
|
|
|
fsnotify_clear_inode_marks_by_group(group);
|
|
|
|
goto fput_and_out;
|
|
|
|
}
|
|
|
|
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 08:20:00 -07:00
|
|
|
ret = fanotify_find_path(dfd, pathname, &path, flags,
|
|
|
|
(mask & ALL_FSNOTIFY_EVENTS), obj_type);
|
2009-12-17 19:24:26 -07:00
|
|
|
if (ret)
|
|
|
|
goto fput_and_out;
|
|
|
|
|
2022-06-29 07:42:09 -07:00
|
|
|
if (mark_cmd == FAN_MARK_ADD) {
|
2022-06-27 10:47:19 -07:00
|
|
|
ret = fanotify_events_supported(group, &path, mask, flags);
|
2019-05-15 07:28:34 -07:00
|
|
|
if (ret)
|
|
|
|
goto path_put_and_out;
|
|
|
|
}
|
|
|
|
|
2020-07-16 01:42:12 -07:00
|
|
|
if (fid_mode) {
|
2023-11-30 09:56:19 -07:00
|
|
|
ret = fanotify_test_fsid(path.dentry, flags, &__fsid);
|
2021-10-25 12:27:21 -07:00
|
|
|
if (ret)
|
|
|
|
goto path_put_and_out;
|
|
|
|
|
fanotify: limit reporting of event with non-decodeable file handles
Commit a95aef69a740 ("fanotify: support reporting non-decodeable file
handles") merged in v6.5-rc1, added the ability to use an fanotify group
with FAN_REPORT_FID mode to watch filesystems that do not support nfs
export, but do know how to encode non-decodeable file handles, with the
newly introduced AT_HANDLE_FID flag.
At the time that this commit was merged, there were no filesystems
in-tree with those traits.
Commit 16aac5ad1fa9 ("ovl: support encoding non-decodable file handles"),
merged in v6.6-rc1, added this trait to overlayfs, thus allowing fanotify
watching of overlayfs with FAN_REPORT_FID mode.
In retrospect, allowing an fanotify filesystem/mount mark on such
filesystem in FAN_REPORT_FID mode will result in getting events with
file handles, without the ability to resolve the filesystem objects from
those file handles (i.e. no open_by_handle_at() support).
For v6.6, the safer option would be to allow this mode for inode marks
only, where the caller has the opportunity to use name_to_handle_at() at
the time of setting the mark. In the future we can revise this decision.
Fixes: a95aef69a740 ("fanotify: support reporting non-decodeable file handles")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20231018100000.2453965-2-amir73il@gmail.com>
2023-10-18 02:59:56 -07:00
|
|
|
ret = fanotify_test_fid(path.dentry, flags);
|
2019-01-10 10:04:36 -07:00
|
|
|
if (ret)
|
|
|
|
goto path_put_and_out;
|
2019-01-10 10:04:37 -07:00
|
|
|
|
2019-01-10 10:04:39 -07:00
|
|
|
fsid = &__fsid;
|
2019-01-10 10:04:36 -07:00
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
/* inode held in place by reference to path; group by fget on fd */
|
2024-03-17 11:41:49 -07:00
|
|
|
if (mark_type == FAN_MARK_INODE) {
|
2009-12-17 19:24:29 -07:00
|
|
|
inode = path.dentry->d_inode;
|
2024-03-17 11:41:49 -07:00
|
|
|
obj = inode;
|
|
|
|
} else {
|
2009-12-17 19:24:29 -07:00
|
|
|
mnt = path.mnt;
|
2024-03-17 11:41:49 -07:00
|
|
|
if (mark_type == FAN_MARK_MOUNT)
|
|
|
|
obj = mnt;
|
|
|
|
else
|
|
|
|
obj = mnt->mnt_sb;
|
|
|
|
}
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2024-03-17 11:41:48 -07:00
|
|
|
/*
|
|
|
|
* If some other task has this inode open for write we should not add
|
|
|
|
* an ignore mask, unless that ignore mask is supposed to survive
|
|
|
|
* modification changes anyway.
|
|
|
|
*/
|
|
|
|
if (mark_cmd == FAN_MARK_ADD && (flags & FANOTIFY_MARK_IGNORE_BITS) &&
|
|
|
|
!(flags & FAN_MARK_IGNORED_SURV_MODIFY)) {
|
|
|
|
ret = mnt ? -EINVAL : -EISDIR;
|
|
|
|
/* FAN_MARK_IGNORE requires SURV_MODIFY for sb/mount/dir marks */
|
|
|
|
if (ignore == FAN_MARK_IGNORE &&
|
|
|
|
(mnt || S_ISDIR(inode->i_mode)))
|
|
|
|
goto path_put_and_out;
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
if (inode && inode_is_open_for_write(inode))
|
|
|
|
goto path_put_and_out;
|
|
|
|
}
|
2022-06-29 07:42:10 -07:00
|
|
|
|
2020-07-16 01:42:15 -07:00
|
|
|
/* Mask out FAN_EVENT_ON_CHILD flag for sb/mount/non-dir marks */
|
|
|
|
if (mnt || !S_ISDIR(inode->i_mode)) {
|
|
|
|
mask &= ~FAN_EVENT_ON_CHILD;
|
|
|
|
umask = FAN_EVENT_ON_CHILD;
|
2020-07-16 01:42:27 -07:00
|
|
|
/*
|
|
|
|
* If group needs to report parent fid, register for getting
|
|
|
|
* events with parent/name info for non-directory.
|
|
|
|
*/
|
|
|
|
if ((fid_mode & FAN_REPORT_DIR_FID) &&
|
2022-06-29 07:42:08 -07:00
|
|
|
(flags & FAN_MARK_ADD) && !ignore)
|
2020-07-16 01:42:27 -07:00
|
|
|
mask |= FAN_EVENT_ON_CHILD;
|
2020-07-16 01:42:15 -07:00
|
|
|
}
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
/* create/update an inode mark */
|
2022-06-29 07:42:09 -07:00
|
|
|
switch (mark_cmd) {
|
2009-12-17 19:24:28 -07:00
|
|
|
case FAN_MARK_ADD:
|
2024-03-17 11:41:49 -07:00
|
|
|
ret = fanotify_add_mark(group, obj, obj_type, mask, flags,
|
|
|
|
fsid);
|
2009-12-17 19:24:28 -07:00
|
|
|
break;
|
|
|
|
case FAN_MARK_REMOVE:
|
2024-03-17 11:41:49 -07:00
|
|
|
ret = fanotify_remove_mark(group, obj, obj_type, mask, flags,
|
|
|
|
umask);
|
2009-12-17 19:24:28 -07:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
ret = -EINVAL;
|
|
|
|
}
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2019-01-10 10:04:36 -07:00
|
|
|
path_put_and_out:
|
2009-12-17 19:24:26 -07:00
|
|
|
path_put(&path);
|
|
|
|
fput_and_out:
|
2012-08-28 09:52:22 -07:00
|
|
|
fdput(f);
|
2009-12-17 19:24:26 -07:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-11-30 15:30:59 -07:00
|
|
|
#ifndef CONFIG_ARCH_SPLIT_ARG64
|
2018-03-17 07:06:11 -07:00
|
|
|
SYSCALL_DEFINE5(fanotify_mark, int, fanotify_fd, unsigned int, flags,
|
|
|
|
__u64, mask, int, dfd,
|
|
|
|
const char __user *, pathname)
|
|
|
|
{
|
|
|
|
return do_fanotify_mark(fanotify_fd, flags, mask, dfd, pathname);
|
|
|
|
}
|
2020-11-30 15:30:59 -07:00
|
|
|
#endif
|
2018-03-17 07:06:11 -07:00
|
|
|
|
2020-11-30 15:30:59 -07:00
|
|
|
#if defined(CONFIG_ARCH_SPLIT_ARG64) || defined(CONFIG_COMPAT)
|
|
|
|
SYSCALL32_DEFINE6(fanotify_mark,
|
2013-03-05 18:10:59 -07:00
|
|
|
int, fanotify_fd, unsigned int, flags,
|
2020-11-30 15:30:59 -07:00
|
|
|
SC_ARG64(mask), int, dfd,
|
2013-03-05 18:10:59 -07:00
|
|
|
const char __user *, pathname)
|
|
|
|
{
|
2020-11-30 15:30:59 -07:00
|
|
|
return do_fanotify_mark(fanotify_fd, flags, SC_VAL64(__u64, mask),
|
|
|
|
dfd, pathname);
|
2013-03-05 18:10:59 -07:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
/*
|
2011-03-01 07:06:02 -07:00
|
|
|
* fanotify_user_setup - Our initialization function. Note that we cannot return
|
2009-12-17 19:24:26 -07:00
|
|
|
* error because we have compiled-in VFS hooks. So an (unlikely) failure here
|
|
|
|
* must result in panic().
|
|
|
|
*/
|
|
|
|
static int __init fanotify_user_setup(void)
|
|
|
|
{
|
2021-03-04 04:29:20 -07:00
|
|
|
struct sysinfo si;
|
|
|
|
int max_marks;
|
|
|
|
|
|
|
|
si_meminfo(&si);
|
|
|
|
/*
|
|
|
|
* Allow up to 1% of addressable memory to be accounted for per user
|
|
|
|
* marks limited to the range [8192, 1048576]. mount and sb marks are
|
|
|
|
* a lot cheaper than inode marks, but there is no reason for a user
|
|
|
|
* to have many of those, so calculate by the cost of inode marks.
|
|
|
|
*/
|
|
|
|
max_marks = (((si.totalram - si.totalhigh) / 100) << PAGE_SHIFT) /
|
|
|
|
INODE_MARK_COST;
|
|
|
|
max_marks = clamp(max_marks, FANOTIFY_OLD_DEFAULT_MAX_MARKS,
|
|
|
|
FANOTIFY_DEFAULT_MAX_USER_MARKS);
|
|
|
|
|
2021-05-24 06:53:21 -07:00
|
|
|
BUILD_BUG_ON(FANOTIFY_INIT_FLAGS & FANOTIFY_INTERNAL_GROUP_FLAGS);
|
2021-11-29 13:15:29 -07:00
|
|
|
BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 12);
|
2022-06-29 07:42:10 -07:00
|
|
|
BUILD_BUG_ON(HWEIGHT32(FANOTIFY_MARK_FLAGS) != 11);
|
2018-10-03 14:25:37 -07:00
|
|
|
|
2023-11-30 09:56:18 -07:00
|
|
|
fanotify_mark_cache = KMEM_CACHE(fanotify_mark,
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 15:46:39 -07:00
|
|
|
SLAB_PANIC|SLAB_ACCOUNT);
|
2020-03-24 09:04:20 -07:00
|
|
|
fanotify_fid_event_cachep = KMEM_CACHE(fanotify_fid_event,
|
|
|
|
SLAB_PANIC);
|
|
|
|
fanotify_path_event_cachep = KMEM_CACHE(fanotify_path_event,
|
|
|
|
SLAB_PANIC);
|
2017-10-30 13:14:56 -07:00
|
|
|
if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) {
|
|
|
|
fanotify_perm_event_cachep =
|
2019-01-10 10:04:32 -07:00
|
|
|
KMEM_CACHE(fanotify_perm_event, SLAB_PANIC);
|
2017-10-30 13:14:56 -07:00
|
|
|
}
|
2009-12-17 19:24:26 -07:00
|
|
|
|
2021-03-04 04:29:20 -07:00
|
|
|
fanotify_max_queued_events = FANOTIFY_DEFAULT_MAX_EVENTS;
|
|
|
|
init_user_ns.ucount_max[UCOUNT_FANOTIFY_GROUPS] =
|
|
|
|
FANOTIFY_DEFAULT_MAX_GROUPS;
|
|
|
|
init_user_ns.ucount_max[UCOUNT_FANOTIFY_MARKS] = max_marks;
|
2022-01-21 23:11:59 -07:00
|
|
|
fanotify_sysctls_init();
|
2021-03-04 04:29:20 -07:00
|
|
|
|
2009-12-17 19:24:26 -07:00
|
|
|
return 0;
|
2009-12-17 19:24:26 -07:00
|
|
|
}
|
2009-12-17 19:24:26 -07:00
|
|
|
device_initcall(fanotify_user_setup);
|