2024-02-29 07:51:00 -07:00
|
|
|
# SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
|
2011-11-11 16:55:49 -07:00
|
|
|
#
|
|
|
|
# 32-bit system call numbers and entry vectors
|
|
|
|
#
|
|
|
|
# The format is:
|
2024-06-25 23:02:00 -07:00
|
|
|
# <number> <abi> <name> <entry point> [<compat entry point> [noreturn]]
|
2011-11-11 16:55:49 -07:00
|
|
|
#
|
syscalls/core, syscalls/x86: Clean up compat syscall stub naming convention
Tidy the naming convention for compat syscall subs. Hints which describe
the purpose of the stub go in front and receive a double underscore to
denote that they are generated on-the-fly by the COMPAT_SYSCALL_DEFINEx()
macro.
For the generic case, this means:
t kernel_waitid # common C function (see kernel/exit.c)
__do_compat_sys_waitid # inlined helper doing the actual work
# (takes original parameters as declared)
T __se_compat_sys_waitid # sign-extending C function calling inlined
# helper (takes parameters of type long,
# casts them to unsigned long and then to
# the declared type)
T compat_sys_waitid # alias to __se_compat_sys_waitid()
# (taking parameters as declared), to
# be included in syscall table
For x86, the naming is as follows:
t kernel_waitid # common C function (see kernel/exit.c)
__do_compat_sys_waitid # inlined helper doing the actual work
# (takes original parameters as declared)
t __se_compat_sys_waitid # sign-extending C function calling inlined
# helper (takes parameters of type long,
# casts them to unsigned long and then to
# the declared type)
T __ia32_compat_sys_waitid # IA32_EMULATION 32-bit-ptregs -> C stub,
# calls __se_compat_sys_waitid(); to be
# included in syscall table
T __x32_compat_sys_waitid # x32 64-bit-ptregs -> C stub, calls
# __se_compat_sys_waitid(); to be included
# in syscall table
If only one of IA32_EMULATION and x32 is enabled, __se_compat_sys_waitid()
may be inlined into the stub __{ia32,x32}_compat_sys_waitid().
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180409105145.5364-3-linux@dominikbrodowski.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-09 03:51:43 -07:00
|
|
|
# The __ia32_sys and __ia32_compat_sys stubs are created on-the-fly for
|
syscalls/x86: Use 'struct pt_regs' based syscall calling for IA32_EMULATION and x32
Extend ARCH_HAS_SYSCALL_WRAPPER for i386 emulation and for x32 on 64-bit
x86.
For x32, all we need to do is to create an additional stub for each
compat syscall which decodes the parameters in x86-64 ordering, e.g.:
asmlinkage long __compat_sys_x32_xyzzy(struct pt_regs *regs)
{
return c_SyS_xyzzy(regs->di, regs->si, regs->dx);
}
For i386 emulation, we need to teach compat_sys_*() to take struct
pt_regs as its only argument, e.g.:
asmlinkage long __compat_sys_ia32_xyzzy(struct pt_regs *regs)
{
return c_SyS_xyzzy(regs->bx, regs->cx, regs->dx);
}
In addition, we need to create additional stubs for common syscalls
(that is, for syscalls which have the same parameters on 32-bit and
64-bit), e.g.:
asmlinkage long __sys_ia32_xyzzy(struct pt_regs *regs)
{
return c_sys_xyzzy(regs->bx, regs->cx, regs->dx);
}
This approach avoids leaking random user-provided register content down
the call chain.
This patch is based on an original proof-of-concept
| From: Linus Torvalds <torvalds@linux-foundation.org>
| Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
and was split up and heavily modified by me, in particular to base it on
ARCH_HAS_SYSCALL_WRAPPER.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180405095307.3730-6-linux@dominikbrodowski.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05 02:53:04 -07:00
|
|
|
# sys_*() system calls and compat_sys_*() compat system calls if
|
|
|
|
# IA32_EMULATION is defined, and expect struct pt_regs *regs as their only
|
|
|
|
# parameter.
|
|
|
|
#
|
2011-11-11 16:55:49 -07:00
|
|
|
# The abi is always "i386" for this file.
|
|
|
|
#
|
2020-03-13 12:51:39 -07:00
|
|
|
0 i386 restart_syscall sys_restart_syscall
|
2024-06-25 23:02:00 -07:00
|
|
|
1 i386 exit sys_exit - noreturn
|
2020-03-13 12:51:39 -07:00
|
|
|
2 i386 fork sys_fork
|
|
|
|
3 i386 read sys_read
|
|
|
|
4 i386 write sys_write
|
2020-03-13 12:51:38 -07:00
|
|
|
5 i386 open sys_open compat_sys_open
|
2020-03-13 12:51:39 -07:00
|
|
|
6 i386 close sys_close
|
|
|
|
7 i386 waitpid sys_waitpid
|
|
|
|
8 i386 creat sys_creat
|
|
|
|
9 i386 link sys_link
|
|
|
|
10 i386 unlink sys_unlink
|
2020-03-13 12:51:38 -07:00
|
|
|
11 i386 execve sys_execve compat_sys_execve
|
2020-03-13 12:51:39 -07:00
|
|
|
12 i386 chdir sys_chdir
|
|
|
|
13 i386 time sys_time32
|
|
|
|
14 i386 mknod sys_mknod
|
|
|
|
15 i386 chmod sys_chmod
|
|
|
|
16 i386 lchown sys_lchown16
|
2011-11-11 16:55:49 -07:00
|
|
|
17 i386 break
|
2020-03-13 12:51:39 -07:00
|
|
|
18 i386 oldstat sys_stat
|
2020-03-13 12:51:38 -07:00
|
|
|
19 i386 lseek sys_lseek compat_sys_lseek
|
2020-03-13 12:51:39 -07:00
|
|
|
20 i386 getpid sys_getpid
|
2020-09-17 01:22:34 -07:00
|
|
|
21 i386 mount sys_mount
|
2020-03-13 12:51:39 -07:00
|
|
|
22 i386 umount sys_oldumount
|
|
|
|
23 i386 setuid sys_setuid16
|
|
|
|
24 i386 getuid sys_getuid16
|
|
|
|
25 i386 stime sys_stime32
|
2020-03-13 12:51:38 -07:00
|
|
|
26 i386 ptrace sys_ptrace compat_sys_ptrace
|
2020-03-13 12:51:39 -07:00
|
|
|
27 i386 alarm sys_alarm
|
|
|
|
28 i386 oldfstat sys_fstat
|
|
|
|
29 i386 pause sys_pause
|
|
|
|
30 i386 utime sys_utime32
|
2011-11-11 16:55:49 -07:00
|
|
|
31 i386 stty
|
|
|
|
32 i386 gtty
|
2020-03-13 12:51:39 -07:00
|
|
|
33 i386 access sys_access
|
|
|
|
34 i386 nice sys_nice
|
2011-11-11 16:55:49 -07:00
|
|
|
35 i386 ftime
|
2020-03-13 12:51:39 -07:00
|
|
|
36 i386 sync sys_sync
|
|
|
|
37 i386 kill sys_kill
|
|
|
|
38 i386 rename sys_rename
|
|
|
|
39 i386 mkdir sys_mkdir
|
|
|
|
40 i386 rmdir sys_rmdir
|
|
|
|
41 i386 dup sys_dup
|
|
|
|
42 i386 pipe sys_pipe
|
2020-03-13 12:51:38 -07:00
|
|
|
43 i386 times sys_times compat_sys_times
|
2011-11-11 16:55:49 -07:00
|
|
|
44 i386 prof
|
2020-03-13 12:51:39 -07:00
|
|
|
45 i386 brk sys_brk
|
|
|
|
46 i386 setgid sys_setgid16
|
|
|
|
47 i386 getgid sys_getgid16
|
|
|
|
48 i386 signal sys_signal
|
|
|
|
49 i386 geteuid sys_geteuid16
|
|
|
|
50 i386 getegid sys_getegid16
|
|
|
|
51 i386 acct sys_acct
|
|
|
|
52 i386 umount2 sys_umount
|
2011-11-11 16:55:49 -07:00
|
|
|
53 i386 lock
|
2020-03-13 12:51:38 -07:00
|
|
|
54 i386 ioctl sys_ioctl compat_sys_ioctl
|
|
|
|
55 i386 fcntl sys_fcntl compat_sys_fcntl64
|
2011-11-11 16:55:49 -07:00
|
|
|
56 i386 mpx
|
2020-03-13 12:51:39 -07:00
|
|
|
57 i386 setpgid sys_setpgid
|
2011-11-11 16:55:49 -07:00
|
|
|
58 i386 ulimit
|
2020-03-13 12:51:39 -07:00
|
|
|
59 i386 oldolduname sys_olduname
|
|
|
|
60 i386 umask sys_umask
|
|
|
|
61 i386 chroot sys_chroot
|
2020-03-13 12:51:38 -07:00
|
|
|
62 i386 ustat sys_ustat compat_sys_ustat
|
2020-03-13 12:51:39 -07:00
|
|
|
63 i386 dup2 sys_dup2
|
|
|
|
64 i386 getppid sys_getppid
|
|
|
|
65 i386 getpgrp sys_getpgrp
|
|
|
|
66 i386 setsid sys_setsid
|
2020-03-13 12:51:38 -07:00
|
|
|
67 i386 sigaction sys_sigaction compat_sys_sigaction
|
2020-03-13 12:51:39 -07:00
|
|
|
68 i386 sgetmask sys_sgetmask
|
|
|
|
69 i386 ssetmask sys_ssetmask
|
|
|
|
70 i386 setreuid sys_setreuid16
|
|
|
|
71 i386 setregid sys_setregid16
|
|
|
|
72 i386 sigsuspend sys_sigsuspend
|
2020-03-13 12:51:38 -07:00
|
|
|
73 i386 sigpending sys_sigpending compat_sys_sigpending
|
2020-03-13 12:51:39 -07:00
|
|
|
74 i386 sethostname sys_sethostname
|
2020-03-13 12:51:38 -07:00
|
|
|
75 i386 setrlimit sys_setrlimit compat_sys_setrlimit
|
|
|
|
76 i386 getrlimit sys_old_getrlimit compat_sys_old_getrlimit
|
|
|
|
77 i386 getrusage sys_getrusage compat_sys_getrusage
|
|
|
|
78 i386 gettimeofday sys_gettimeofday compat_sys_gettimeofday
|
|
|
|
79 i386 settimeofday sys_settimeofday compat_sys_settimeofday
|
2020-03-13 12:51:39 -07:00
|
|
|
80 i386 getgroups sys_getgroups16
|
|
|
|
81 i386 setgroups sys_setgroups16
|
2020-03-13 12:51:38 -07:00
|
|
|
82 i386 select sys_old_select compat_sys_old_select
|
2020-03-13 12:51:39 -07:00
|
|
|
83 i386 symlink sys_symlink
|
|
|
|
84 i386 oldlstat sys_lstat
|
|
|
|
85 i386 readlink sys_readlink
|
|
|
|
86 i386 uselib sys_uselib
|
|
|
|
87 i386 swapon sys_swapon
|
|
|
|
88 i386 reboot sys_reboot
|
2020-03-13 12:51:38 -07:00
|
|
|
89 i386 readdir sys_old_readdir compat_sys_old_readdir
|
2020-03-13 12:51:40 -07:00
|
|
|
90 i386 mmap sys_old_mmap compat_sys_ia32_mmap
|
2020-03-13 12:51:39 -07:00
|
|
|
91 i386 munmap sys_munmap
|
2020-03-13 12:51:38 -07:00
|
|
|
92 i386 truncate sys_truncate compat_sys_truncate
|
|
|
|
93 i386 ftruncate sys_ftruncate compat_sys_ftruncate
|
2020-03-13 12:51:39 -07:00
|
|
|
94 i386 fchmod sys_fchmod
|
|
|
|
95 i386 fchown sys_fchown16
|
|
|
|
96 i386 getpriority sys_getpriority
|
|
|
|
97 i386 setpriority sys_setpriority
|
2011-11-11 16:55:49 -07:00
|
|
|
98 i386 profil
|
2020-03-13 12:51:38 -07:00
|
|
|
99 i386 statfs sys_statfs compat_sys_statfs
|
|
|
|
100 i386 fstatfs sys_fstatfs compat_sys_fstatfs
|
2020-03-13 12:51:39 -07:00
|
|
|
101 i386 ioperm sys_ioperm
|
2020-03-13 12:51:38 -07:00
|
|
|
102 i386 socketcall sys_socketcall compat_sys_socketcall
|
2020-03-13 12:51:39 -07:00
|
|
|
103 i386 syslog sys_syslog
|
2020-03-13 12:51:38 -07:00
|
|
|
104 i386 setitimer sys_setitimer compat_sys_setitimer
|
|
|
|
105 i386 getitimer sys_getitimer compat_sys_getitimer
|
|
|
|
106 i386 stat sys_newstat compat_sys_newstat
|
|
|
|
107 i386 lstat sys_newlstat compat_sys_newlstat
|
|
|
|
108 i386 fstat sys_newfstat compat_sys_newfstat
|
2020-03-13 12:51:39 -07:00
|
|
|
109 i386 olduname sys_uname
|
|
|
|
110 i386 iopl sys_iopl
|
|
|
|
111 i386 vhangup sys_vhangup
|
2011-11-11 16:55:49 -07:00
|
|
|
112 i386 idle
|
2020-03-13 12:51:38 -07:00
|
|
|
113 i386 vm86old sys_vm86old sys_ni_syscall
|
|
|
|
114 i386 wait4 sys_wait4 compat_sys_wait4
|
2020-03-13 12:51:39 -07:00
|
|
|
115 i386 swapoff sys_swapoff
|
2020-03-13 12:51:38 -07:00
|
|
|
116 i386 sysinfo sys_sysinfo compat_sys_sysinfo
|
|
|
|
117 i386 ipc sys_ipc compat_sys_ipc
|
2020-03-13 12:51:39 -07:00
|
|
|
118 i386 fsync sys_fsync
|
2020-03-13 12:51:38 -07:00
|
|
|
119 i386 sigreturn sys_sigreturn compat_sys_sigreturn
|
2020-03-13 12:51:40 -07:00
|
|
|
120 i386 clone sys_clone compat_sys_ia32_clone
|
2020-03-13 12:51:39 -07:00
|
|
|
121 i386 setdomainname sys_setdomainname
|
|
|
|
122 i386 uname sys_newuname
|
|
|
|
123 i386 modify_ldt sys_modify_ldt
|
|
|
|
124 i386 adjtimex sys_adjtimex_time32
|
|
|
|
125 i386 mprotect sys_mprotect
|
2020-03-13 12:51:38 -07:00
|
|
|
126 i386 sigprocmask sys_sigprocmask compat_sys_sigprocmask
|
2011-11-11 16:55:49 -07:00
|
|
|
127 i386 create_module
|
2020-03-13 12:51:39 -07:00
|
|
|
128 i386 init_module sys_init_module
|
|
|
|
129 i386 delete_module sys_delete_module
|
2011-11-11 16:55:49 -07:00
|
|
|
130 i386 get_kernel_syms
|
2020-09-17 00:41:59 -07:00
|
|
|
131 i386 quotactl sys_quotactl
|
2020-03-13 12:51:39 -07:00
|
|
|
132 i386 getpgid sys_getpgid
|
|
|
|
133 i386 fchdir sys_fchdir
|
2021-06-29 13:11:44 -07:00
|
|
|
134 i386 bdflush sys_ni_syscall
|
2020-03-13 12:51:39 -07:00
|
|
|
135 i386 sysfs sys_sysfs
|
|
|
|
136 i386 personality sys_personality
|
2011-11-11 16:55:49 -07:00
|
|
|
137 i386 afs_syscall
|
2020-03-13 12:51:39 -07:00
|
|
|
138 i386 setfsuid sys_setfsuid16
|
|
|
|
139 i386 setfsgid sys_setfsgid16
|
|
|
|
140 i386 _llseek sys_llseek
|
2020-03-13 12:51:38 -07:00
|
|
|
141 i386 getdents sys_getdents compat_sys_getdents
|
|
|
|
142 i386 _newselect sys_select compat_sys_select
|
2020-03-13 12:51:39 -07:00
|
|
|
143 i386 flock sys_flock
|
|
|
|
144 i386 msync sys_msync
|
2020-09-24 21:51:43 -07:00
|
|
|
145 i386 readv sys_readv
|
|
|
|
146 i386 writev sys_writev
|
2020-03-13 12:51:39 -07:00
|
|
|
147 i386 getsid sys_getsid
|
|
|
|
148 i386 fdatasync sys_fdatasync
|
2020-08-14 17:31:07 -07:00
|
|
|
149 i386 _sysctl sys_ni_syscall
|
2020-03-13 12:51:39 -07:00
|
|
|
150 i386 mlock sys_mlock
|
|
|
|
151 i386 munlock sys_munlock
|
|
|
|
152 i386 mlockall sys_mlockall
|
|
|
|
153 i386 munlockall sys_munlockall
|
|
|
|
154 i386 sched_setparam sys_sched_setparam
|
|
|
|
155 i386 sched_getparam sys_sched_getparam
|
|
|
|
156 i386 sched_setscheduler sys_sched_setscheduler
|
|
|
|
157 i386 sched_getscheduler sys_sched_getscheduler
|
|
|
|
158 i386 sched_yield sys_sched_yield
|
|
|
|
159 i386 sched_get_priority_max sys_sched_get_priority_max
|
|
|
|
160 i386 sched_get_priority_min sys_sched_get_priority_min
|
|
|
|
161 i386 sched_rr_get_interval sys_sched_rr_get_interval_time32
|
|
|
|
162 i386 nanosleep sys_nanosleep_time32
|
|
|
|
163 i386 mremap sys_mremap
|
|
|
|
164 i386 setresuid sys_setresuid16
|
|
|
|
165 i386 getresuid sys_getresuid16
|
2020-03-13 12:51:38 -07:00
|
|
|
166 i386 vm86 sys_vm86 sys_ni_syscall
|
2011-11-11 16:55:49 -07:00
|
|
|
167 i386 query_module
|
2020-03-13 12:51:39 -07:00
|
|
|
168 i386 poll sys_poll
|
2011-11-11 16:55:49 -07:00
|
|
|
169 i386 nfsservctl
|
2020-03-13 12:51:39 -07:00
|
|
|
170 i386 setresgid sys_setresgid16
|
|
|
|
171 i386 getresgid sys_getresgid16
|
|
|
|
172 i386 prctl sys_prctl
|
2020-03-13 12:51:38 -07:00
|
|
|
173 i386 rt_sigreturn sys_rt_sigreturn compat_sys_rt_sigreturn
|
|
|
|
174 i386 rt_sigaction sys_rt_sigaction compat_sys_rt_sigaction
|
|
|
|
175 i386 rt_sigprocmask sys_rt_sigprocmask compat_sys_rt_sigprocmask
|
|
|
|
176 i386 rt_sigpending sys_rt_sigpending compat_sys_rt_sigpending
|
|
|
|
177 i386 rt_sigtimedwait sys_rt_sigtimedwait_time32 compat_sys_rt_sigtimedwait_time32
|
|
|
|
178 i386 rt_sigqueueinfo sys_rt_sigqueueinfo compat_sys_rt_sigqueueinfo
|
|
|
|
179 i386 rt_sigsuspend sys_rt_sigsuspend compat_sys_rt_sigsuspend
|
2020-03-13 12:51:41 -07:00
|
|
|
180 i386 pread64 sys_ia32_pread64
|
|
|
|
181 i386 pwrite64 sys_ia32_pwrite64
|
2020-03-13 12:51:39 -07:00
|
|
|
182 i386 chown sys_chown16
|
|
|
|
183 i386 getcwd sys_getcwd
|
|
|
|
184 i386 capget sys_capget
|
|
|
|
185 i386 capset sys_capset
|
2020-03-13 12:51:38 -07:00
|
|
|
186 i386 sigaltstack sys_sigaltstack compat_sys_sigaltstack
|
|
|
|
187 i386 sendfile sys_sendfile compat_sys_sendfile
|
2011-11-11 16:55:49 -07:00
|
|
|
188 i386 getpmsg
|
|
|
|
189 i386 putpmsg
|
2020-03-13 12:51:39 -07:00
|
|
|
190 i386 vfork sys_vfork
|
2020-03-13 12:51:38 -07:00
|
|
|
191 i386 ugetrlimit sys_getrlimit compat_sys_getrlimit
|
2020-03-13 12:51:39 -07:00
|
|
|
192 i386 mmap2 sys_mmap_pgoff
|
2020-03-13 12:51:41 -07:00
|
|
|
193 i386 truncate64 sys_ia32_truncate64
|
|
|
|
194 i386 ftruncate64 sys_ia32_ftruncate64
|
2020-03-13 12:51:40 -07:00
|
|
|
195 i386 stat64 sys_stat64 compat_sys_ia32_stat64
|
|
|
|
196 i386 lstat64 sys_lstat64 compat_sys_ia32_lstat64
|
|
|
|
197 i386 fstat64 sys_fstat64 compat_sys_ia32_fstat64
|
2020-03-13 12:51:39 -07:00
|
|
|
198 i386 lchown32 sys_lchown
|
|
|
|
199 i386 getuid32 sys_getuid
|
|
|
|
200 i386 getgid32 sys_getgid
|
|
|
|
201 i386 geteuid32 sys_geteuid
|
|
|
|
202 i386 getegid32 sys_getegid
|
|
|
|
203 i386 setreuid32 sys_setreuid
|
|
|
|
204 i386 setregid32 sys_setregid
|
|
|
|
205 i386 getgroups32 sys_getgroups
|
|
|
|
206 i386 setgroups32 sys_setgroups
|
|
|
|
207 i386 fchown32 sys_fchown
|
|
|
|
208 i386 setresuid32 sys_setresuid
|
|
|
|
209 i386 getresuid32 sys_getresuid
|
|
|
|
210 i386 setresgid32 sys_setresgid
|
|
|
|
211 i386 getresgid32 sys_getresgid
|
|
|
|
212 i386 chown32 sys_chown
|
|
|
|
213 i386 setuid32 sys_setuid
|
|
|
|
214 i386 setgid32 sys_setgid
|
|
|
|
215 i386 setfsuid32 sys_setfsuid
|
|
|
|
216 i386 setfsgid32 sys_setfsgid
|
|
|
|
217 i386 pivot_root sys_pivot_root
|
|
|
|
218 i386 mincore sys_mincore
|
|
|
|
219 i386 madvise sys_madvise
|
|
|
|
220 i386 getdents64 sys_getdents64
|
2020-03-13 12:51:38 -07:00
|
|
|
221 i386 fcntl64 sys_fcntl64 compat_sys_fcntl64
|
2011-11-11 16:55:49 -07:00
|
|
|
# 222 is unused
|
|
|
|
# 223 is unused
|
2020-03-13 12:51:39 -07:00
|
|
|
224 i386 gettid sys_gettid
|
2020-03-13 12:51:41 -07:00
|
|
|
225 i386 readahead sys_ia32_readahead
|
2020-03-13 12:51:39 -07:00
|
|
|
226 i386 setxattr sys_setxattr
|
|
|
|
227 i386 lsetxattr sys_lsetxattr
|
|
|
|
228 i386 fsetxattr sys_fsetxattr
|
|
|
|
229 i386 getxattr sys_getxattr
|
|
|
|
230 i386 lgetxattr sys_lgetxattr
|
|
|
|
231 i386 fgetxattr sys_fgetxattr
|
|
|
|
232 i386 listxattr sys_listxattr
|
|
|
|
233 i386 llistxattr sys_llistxattr
|
|
|
|
234 i386 flistxattr sys_flistxattr
|
|
|
|
235 i386 removexattr sys_removexattr
|
|
|
|
236 i386 lremovexattr sys_lremovexattr
|
|
|
|
237 i386 fremovexattr sys_fremovexattr
|
|
|
|
238 i386 tkill sys_tkill
|
|
|
|
239 i386 sendfile64 sys_sendfile64
|
|
|
|
240 i386 futex sys_futex_time32
|
2020-03-13 12:51:38 -07:00
|
|
|
241 i386 sched_setaffinity sys_sched_setaffinity compat_sys_sched_setaffinity
|
|
|
|
242 i386 sched_getaffinity sys_sched_getaffinity compat_sys_sched_getaffinity
|
2020-03-13 12:51:39 -07:00
|
|
|
243 i386 set_thread_area sys_set_thread_area
|
|
|
|
244 i386 get_thread_area sys_get_thread_area
|
2020-03-13 12:51:38 -07:00
|
|
|
245 i386 io_setup sys_io_setup compat_sys_io_setup
|
2020-03-13 12:51:39 -07:00
|
|
|
246 i386 io_destroy sys_io_destroy
|
|
|
|
247 i386 io_getevents sys_io_getevents_time32
|
2020-03-13 12:51:38 -07:00
|
|
|
248 i386 io_submit sys_io_submit compat_sys_io_submit
|
2020-03-13 12:51:39 -07:00
|
|
|
249 i386 io_cancel sys_io_cancel
|
2020-03-13 12:51:41 -07:00
|
|
|
250 i386 fadvise64 sys_ia32_fadvise64
|
2011-11-11 16:55:49 -07:00
|
|
|
# 251 is available for reuse (was briefly sys_set_zone_reclaim)
|
2024-06-25 23:02:00 -07:00
|
|
|
252 i386 exit_group sys_exit_group - noreturn
|
2023-07-10 11:51:24 -07:00
|
|
|
253 i386 lookup_dcookie
|
2020-03-13 12:51:39 -07:00
|
|
|
254 i386 epoll_create sys_epoll_create
|
|
|
|
255 i386 epoll_ctl sys_epoll_ctl
|
|
|
|
256 i386 epoll_wait sys_epoll_wait
|
|
|
|
257 i386 remap_file_pages sys_remap_file_pages
|
|
|
|
258 i386 set_tid_address sys_set_tid_address
|
2020-03-13 12:51:38 -07:00
|
|
|
259 i386 timer_create sys_timer_create compat_sys_timer_create
|
2020-03-13 12:51:39 -07:00
|
|
|
260 i386 timer_settime sys_timer_settime32
|
|
|
|
261 i386 timer_gettime sys_timer_gettime32
|
|
|
|
262 i386 timer_getoverrun sys_timer_getoverrun
|
|
|
|
263 i386 timer_delete sys_timer_delete
|
|
|
|
264 i386 clock_settime sys_clock_settime32
|
|
|
|
265 i386 clock_gettime sys_clock_gettime32
|
|
|
|
266 i386 clock_getres sys_clock_getres_time32
|
|
|
|
267 i386 clock_nanosleep sys_clock_nanosleep_time32
|
2020-03-13 12:51:38 -07:00
|
|
|
268 i386 statfs64 sys_statfs64 compat_sys_statfs64
|
|
|
|
269 i386 fstatfs64 sys_fstatfs64 compat_sys_fstatfs64
|
2020-03-13 12:51:39 -07:00
|
|
|
270 i386 tgkill sys_tgkill
|
|
|
|
271 i386 utimes sys_utimes_time32
|
2020-03-13 12:51:41 -07:00
|
|
|
272 i386 fadvise64_64 sys_ia32_fadvise64_64
|
2011-11-11 16:55:49 -07:00
|
|
|
273 i386 vserver
|
2020-03-13 12:51:39 -07:00
|
|
|
274 i386 mbind sys_mbind
|
2021-09-08 15:18:25 -07:00
|
|
|
275 i386 get_mempolicy sys_get_mempolicy
|
2020-03-13 12:51:39 -07:00
|
|
|
276 i386 set_mempolicy sys_set_mempolicy
|
2020-03-13 12:51:38 -07:00
|
|
|
277 i386 mq_open sys_mq_open compat_sys_mq_open
|
2020-03-13 12:51:39 -07:00
|
|
|
278 i386 mq_unlink sys_mq_unlink
|
|
|
|
279 i386 mq_timedsend sys_mq_timedsend_time32
|
|
|
|
280 i386 mq_timedreceive sys_mq_timedreceive_time32
|
2020-03-13 12:51:38 -07:00
|
|
|
281 i386 mq_notify sys_mq_notify compat_sys_mq_notify
|
|
|
|
282 i386 mq_getsetattr sys_mq_getsetattr compat_sys_mq_getsetattr
|
|
|
|
283 i386 kexec_load sys_kexec_load compat_sys_kexec_load
|
|
|
|
284 i386 waitid sys_waitid compat_sys_waitid
|
2011-11-11 16:55:49 -07:00
|
|
|
# 285 sys_setaltroot
|
2020-03-13 12:51:39 -07:00
|
|
|
286 i386 add_key sys_add_key
|
|
|
|
287 i386 request_key sys_request_key
|
2020-03-13 12:51:38 -07:00
|
|
|
288 i386 keyctl sys_keyctl compat_sys_keyctl
|
2020-03-13 12:51:39 -07:00
|
|
|
289 i386 ioprio_set sys_ioprio_set
|
|
|
|
290 i386 ioprio_get sys_ioprio_get
|
|
|
|
291 i386 inotify_init sys_inotify_init
|
|
|
|
292 i386 inotify_add_watch sys_inotify_add_watch
|
|
|
|
293 i386 inotify_rm_watch sys_inotify_rm_watch
|
|
|
|
294 i386 migrate_pages sys_migrate_pages
|
2020-03-13 12:51:38 -07:00
|
|
|
295 i386 openat sys_openat compat_sys_openat
|
2020-03-13 12:51:39 -07:00
|
|
|
296 i386 mkdirat sys_mkdirat
|
|
|
|
297 i386 mknodat sys_mknodat
|
|
|
|
298 i386 fchownat sys_fchownat
|
|
|
|
299 i386 futimesat sys_futimesat_time32
|
2020-03-13 12:51:40 -07:00
|
|
|
300 i386 fstatat64 sys_fstatat64 compat_sys_ia32_fstatat64
|
2020-03-13 12:51:39 -07:00
|
|
|
301 i386 unlinkat sys_unlinkat
|
|
|
|
302 i386 renameat sys_renameat
|
|
|
|
303 i386 linkat sys_linkat
|
|
|
|
304 i386 symlinkat sys_symlinkat
|
|
|
|
305 i386 readlinkat sys_readlinkat
|
|
|
|
306 i386 fchmodat sys_fchmodat
|
|
|
|
307 i386 faccessat sys_faccessat
|
2020-03-13 12:51:38 -07:00
|
|
|
308 i386 pselect6 sys_pselect6_time32 compat_sys_pselect6_time32
|
|
|
|
309 i386 ppoll sys_ppoll_time32 compat_sys_ppoll_time32
|
2020-03-13 12:51:39 -07:00
|
|
|
310 i386 unshare sys_unshare
|
2020-03-13 12:51:38 -07:00
|
|
|
311 i386 set_robust_list sys_set_robust_list compat_sys_set_robust_list
|
|
|
|
312 i386 get_robust_list sys_get_robust_list compat_sys_get_robust_list
|
2020-03-13 12:51:39 -07:00
|
|
|
313 i386 splice sys_splice
|
2020-03-13 12:51:41 -07:00
|
|
|
314 i386 sync_file_range sys_ia32_sync_file_range
|
2020-03-13 12:51:39 -07:00
|
|
|
315 i386 tee sys_tee
|
2020-09-24 21:51:44 -07:00
|
|
|
316 i386 vmsplice sys_vmsplice
|
2021-09-08 15:18:25 -07:00
|
|
|
317 i386 move_pages sys_move_pages
|
2020-03-13 12:51:39 -07:00
|
|
|
318 i386 getcpu sys_getcpu
|
|
|
|
319 i386 epoll_pwait sys_epoll_pwait
|
|
|
|
320 i386 utimensat sys_utimensat_time32
|
2020-03-13 12:51:38 -07:00
|
|
|
321 i386 signalfd sys_signalfd compat_sys_signalfd
|
2020-03-13 12:51:39 -07:00
|
|
|
322 i386 timerfd_create sys_timerfd_create
|
|
|
|
323 i386 eventfd sys_eventfd
|
2020-03-13 12:51:41 -07:00
|
|
|
324 i386 fallocate sys_ia32_fallocate
|
2020-03-13 12:51:39 -07:00
|
|
|
325 i386 timerfd_settime sys_timerfd_settime32
|
|
|
|
326 i386 timerfd_gettime sys_timerfd_gettime32
|
2020-03-13 12:51:38 -07:00
|
|
|
327 i386 signalfd4 sys_signalfd4 compat_sys_signalfd4
|
2020-03-13 12:51:39 -07:00
|
|
|
328 i386 eventfd2 sys_eventfd2
|
|
|
|
329 i386 epoll_create1 sys_epoll_create1
|
|
|
|
330 i386 dup3 sys_dup3
|
|
|
|
331 i386 pipe2 sys_pipe2
|
|
|
|
332 i386 inotify_init1 sys_inotify_init1
|
2020-03-13 12:51:38 -07:00
|
|
|
333 i386 preadv sys_preadv compat_sys_preadv
|
|
|
|
334 i386 pwritev sys_pwritev compat_sys_pwritev
|
|
|
|
335 i386 rt_tgsigqueueinfo sys_rt_tgsigqueueinfo compat_sys_rt_tgsigqueueinfo
|
2020-03-13 12:51:39 -07:00
|
|
|
336 i386 perf_event_open sys_perf_event_open
|
2020-03-13 12:51:38 -07:00
|
|
|
337 i386 recvmmsg sys_recvmmsg_time32 compat_sys_recvmmsg_time32
|
2020-03-13 12:51:39 -07:00
|
|
|
338 i386 fanotify_init sys_fanotify_init
|
2020-03-13 12:51:38 -07:00
|
|
|
339 i386 fanotify_mark sys_fanotify_mark compat_sys_fanotify_mark
|
2020-03-13 12:51:39 -07:00
|
|
|
340 i386 prlimit64 sys_prlimit64
|
|
|
|
341 i386 name_to_handle_at sys_name_to_handle_at
|
2020-03-13 12:51:38 -07:00
|
|
|
342 i386 open_by_handle_at sys_open_by_handle_at compat_sys_open_by_handle_at
|
2020-03-13 12:51:39 -07:00
|
|
|
343 i386 clock_adjtime sys_clock_adjtime32
|
|
|
|
344 i386 syncfs sys_syncfs
|
2020-03-13 12:51:38 -07:00
|
|
|
345 i386 sendmmsg sys_sendmmsg compat_sys_sendmmsg
|
2020-03-13 12:51:39 -07:00
|
|
|
346 i386 setns sys_setns
|
2020-09-24 21:51:45 -07:00
|
|
|
347 i386 process_vm_readv sys_process_vm_readv
|
|
|
|
348 i386 process_vm_writev sys_process_vm_writev
|
2020-03-13 12:51:39 -07:00
|
|
|
349 i386 kcmp sys_kcmp
|
|
|
|
350 i386 finit_module sys_finit_module
|
|
|
|
351 i386 sched_setattr sys_sched_setattr
|
|
|
|
352 i386 sched_getattr sys_sched_getattr
|
|
|
|
353 i386 renameat2 sys_renameat2
|
|
|
|
354 i386 seccomp sys_seccomp
|
|
|
|
355 i386 getrandom sys_getrandom
|
|
|
|
356 i386 memfd_create sys_memfd_create
|
|
|
|
357 i386 bpf sys_bpf
|
2020-03-13 12:51:38 -07:00
|
|
|
358 i386 execveat sys_execveat compat_sys_execveat
|
2020-03-13 12:51:39 -07:00
|
|
|
359 i386 socket sys_socket
|
|
|
|
360 i386 socketpair sys_socketpair
|
|
|
|
361 i386 bind sys_bind
|
|
|
|
362 i386 connect sys_connect
|
|
|
|
363 i386 listen sys_listen
|
|
|
|
364 i386 accept4 sys_accept4
|
2020-07-16 23:23:15 -07:00
|
|
|
365 i386 getsockopt sys_getsockopt sys_getsockopt
|
|
|
|
366 i386 setsockopt sys_setsockopt sys_setsockopt
|
2020-03-13 12:51:39 -07:00
|
|
|
367 i386 getsockname sys_getsockname
|
|
|
|
368 i386 getpeername sys_getpeername
|
|
|
|
369 i386 sendto sys_sendto
|
2020-03-13 12:51:38 -07:00
|
|
|
370 i386 sendmsg sys_sendmsg compat_sys_sendmsg
|
|
|
|
371 i386 recvfrom sys_recvfrom compat_sys_recvfrom
|
|
|
|
372 i386 recvmsg sys_recvmsg compat_sys_recvmsg
|
2020-03-13 12:51:39 -07:00
|
|
|
373 i386 shutdown sys_shutdown
|
|
|
|
374 i386 userfaultfd sys_userfaultfd
|
|
|
|
375 i386 membarrier sys_membarrier
|
|
|
|
376 i386 mlock2 sys_mlock2
|
|
|
|
377 i386 copy_file_range sys_copy_file_range
|
2020-03-13 12:51:38 -07:00
|
|
|
378 i386 preadv2 sys_preadv2 compat_sys_preadv2
|
|
|
|
379 i386 pwritev2 sys_pwritev2 compat_sys_pwritev2
|
2020-03-13 12:51:39 -07:00
|
|
|
380 i386 pkey_mprotect sys_pkey_mprotect
|
|
|
|
381 i386 pkey_alloc sys_pkey_alloc
|
|
|
|
382 i386 pkey_free sys_pkey_free
|
|
|
|
383 i386 statx sys_statx
|
2020-03-13 12:51:38 -07:00
|
|
|
384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
|
|
|
|
385 i386 io_pgetevents sys_io_pgetevents_time32 compat_sys_io_pgetevents
|
2020-03-13 12:51:39 -07:00
|
|
|
386 i386 rseq sys_rseq
|
|
|
|
393 i386 semget sys_semget
|
2020-03-13 12:51:38 -07:00
|
|
|
394 i386 semctl sys_semctl compat_sys_semctl
|
2020-03-13 12:51:39 -07:00
|
|
|
395 i386 shmget sys_shmget
|
2020-03-13 12:51:38 -07:00
|
|
|
396 i386 shmctl sys_shmctl compat_sys_shmctl
|
|
|
|
397 i386 shmat sys_shmat compat_sys_shmat
|
2020-03-13 12:51:39 -07:00
|
|
|
398 i386 shmdt sys_shmdt
|
|
|
|
399 i386 msgget sys_msgget
|
2020-03-13 12:51:38 -07:00
|
|
|
400 i386 msgsnd sys_msgsnd compat_sys_msgsnd
|
|
|
|
401 i386 msgrcv sys_msgrcv compat_sys_msgrcv
|
|
|
|
402 i386 msgctl sys_msgctl compat_sys_msgctl
|
2020-03-13 12:51:39 -07:00
|
|
|
403 i386 clock_gettime64 sys_clock_gettime
|
|
|
|
404 i386 clock_settime64 sys_clock_settime
|
|
|
|
405 i386 clock_adjtime64 sys_clock_adjtime
|
|
|
|
406 i386 clock_getres_time64 sys_clock_getres
|
|
|
|
407 i386 clock_nanosleep_time64 sys_clock_nanosleep
|
|
|
|
408 i386 timer_gettime64 sys_timer_gettime
|
|
|
|
409 i386 timer_settime64 sys_timer_settime
|
|
|
|
410 i386 timerfd_gettime64 sys_timerfd_gettime
|
|
|
|
411 i386 timerfd_settime64 sys_timerfd_settime
|
|
|
|
412 i386 utimensat_time64 sys_utimensat
|
2020-03-13 12:51:38 -07:00
|
|
|
413 i386 pselect6_time64 sys_pselect6 compat_sys_pselect6_time64
|
|
|
|
414 i386 ppoll_time64 sys_ppoll compat_sys_ppoll_time64
|
2024-06-20 05:16:37 -07:00
|
|
|
416 i386 io_pgetevents_time64 sys_io_pgetevents compat_sys_io_pgetevents_time64
|
2020-03-13 12:51:38 -07:00
|
|
|
417 i386 recvmmsg_time64 sys_recvmmsg compat_sys_recvmmsg_time64
|
2020-03-13 12:51:39 -07:00
|
|
|
418 i386 mq_timedsend_time64 sys_mq_timedsend
|
|
|
|
419 i386 mq_timedreceive_time64 sys_mq_timedreceive
|
|
|
|
420 i386 semtimedop_time64 sys_semtimedop
|
2020-03-13 12:51:38 -07:00
|
|
|
421 i386 rt_sigtimedwait_time64 sys_rt_sigtimedwait compat_sys_rt_sigtimedwait_time64
|
2020-03-13 12:51:39 -07:00
|
|
|
422 i386 futex_time64 sys_futex
|
|
|
|
423 i386 sched_rr_get_interval_time64 sys_sched_rr_get_interval
|
|
|
|
424 i386 pidfd_send_signal sys_pidfd_send_signal
|
|
|
|
425 i386 io_uring_setup sys_io_uring_setup
|
|
|
|
426 i386 io_uring_enter sys_io_uring_enter
|
|
|
|
427 i386 io_uring_register sys_io_uring_register
|
|
|
|
428 i386 open_tree sys_open_tree
|
|
|
|
429 i386 move_mount sys_move_mount
|
|
|
|
430 i386 fsopen sys_fsopen
|
|
|
|
431 i386 fsconfig sys_fsconfig
|
|
|
|
432 i386 fsmount sys_fsmount
|
|
|
|
433 i386 fspick sys_fspick
|
|
|
|
434 i386 pidfd_open sys_pidfd_open
|
|
|
|
435 i386 clone3 sys_clone3
|
2019-05-24 02:31:44 -07:00
|
|
|
436 i386 close_range sys_close_range
|
2020-03-13 12:51:39 -07:00
|
|
|
437 i386 openat2 sys_openat2
|
|
|
|
438 i386 pidfd_getfd sys_pidfd_getfd
|
2020-05-14 07:44:25 -07:00
|
|
|
439 i386 faccessat2 sys_faccessat2
|
mm/madvise: introduce process_madvise() syscall: an external memory hinting API
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.
The information required to make the reclaim decision is not known to the
app. Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to initiate
reclaim on its own without any app involvement.
To solve the issue, this patch introduces a new syscall
process_madvise(2). It uses pidfd of an external process to give the
hint. It also supports vector address range because Android app has
thousands of vmas due to zygote so it's totally waste of CPU and power if
we should call the syscall one by one for each vma.(With testing 2000-vma
syscall vs 1-vector syscall, it showed 15% performance improvement. I
think it would be bigger in real practice because the testing ran very
cache friendly environment).
Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations. In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment. With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.
ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
UID) gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the API.
I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone. Thus,
I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
If someone want to add other hints, we could hear the usecase and review
it for each hint. It's safer for maintenance rather than introducing a
buggy syscall but hard to fix it later.
So finally, the API is as follows,
ssize_t process_madvise(int pidfd, const struct iovec *iovec,
unsigned long vlen, int advice, unsigned int flags);
DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve
system or application performance.
The pidfd selects the process referred to by the PID file descriptor
specified in pidfd. (See pidofd_open(2) for further information)
The pointer iovec points to an array of iovec structures, defined in
<sys/uio.h> as:
struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};
The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).
The vlen represents the number of elements in iovec.
The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by pidfd is
external.
MADV_COLD
MADV_PAGEOUT
Permission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.
RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.
FAQ:
Q.1 - Why does any external entity have better knowledge?
Quote from Sandeep
"For Android, every application (including the special SystemServer)
are forked from Zygote. The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.
After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.
In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.
So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.
Besides, we can never rely on applications to clean things up
themselves. We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.
So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.
- ssp
Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?
process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called. If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect. It's the
responsibility of the process calling process_madvise to close this
race condition. For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called. Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process. Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm. The suggested API itself does not provide synchronization. It
also apply other APIs like move_pages, process_vm_write.
The race isn't really a problem though. Why is it so wrong to require
that callers do their own synchronization in some manner? Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something. Think about mmap. It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before. That's where we need synchronization by using other API or
design from userside. It shouldn't be part of API itself. If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.
To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.
Q.3 - Why doesn't ptrace work?
Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA. Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill. It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.
[1] https://developer.android.com/topic/performance/memory"
[2] process_getinfo for getting the cookie which is updated whenever
vma of process address layout are changed - Daniel Colascione -
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
[3] anonymous fd which is used for the object(i.e., address range)
validation - Michal Hocko -
https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
[minchan@kernel.org: fix process_madvise build break for arm64]
Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
[minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
[akpm@linux-foundation.org: fix i386 build]
[sfr@canb.auug.org.au: fix syscall numbering]
Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
[sfr@canb.auug.org.au: madvise.c needs compat.h]
Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
[minchan@kernel.org: fix mips build]
Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
[yuehaibing@huawei.com: remove duplicate header which is included twice]
Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
[minchan@kernel.org: do not use helper functions for process_madvise]
Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
[akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
[sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-17 16:14:59 -07:00
|
|
|
440 i386 process_madvise sys_process_madvise
|
2020-12-18 15:05:41 -07:00
|
|
|
441 i386 epoll_pwait2 sys_epoll_pwait2 compat_sys_epoll_pwait2
|
fs: add mount_setattr()
This implements the missing mount_setattr() syscall. While the new mount
api allows to change the properties of a superblock there is currently
no way to change the properties of a mount or a mount tree using file
descriptors which the new mount api is based on. In addition the old
mount api has the restriction that mount options cannot be applied
recursively. This hasn't changed since changing mount options on a
per-mount basis was implemented in [1] and has been a frequent request
not just for convenience but also for security reasons. The legacy
mount syscall is unable to accommodate this behavior without introducing
a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
mount. Changing MS_REC to apply to the whole mount tree would mean
introducing a significant uapi change and would likely cause significant
regressions.
The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:
int mount_setattr(int dfd, const char *path, unsigned flags,
struct mount_attr *uattr, size_t usize);
Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.
The mount_setattr() syscall can be expected to grow over time and is
designed with extensibility in mind. It follows the extensible syscall
pattern we have used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:
struct mount_attr {
__u64 attr_set;
__u64 attr_clr;
__u64 propagation;
__u64 userns_fd;
};
The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.
Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
not a bitmap, users wanting to transition to a different atime setting
cannot simply specify the atime setting in @attr_set, but must also
specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
@attr_clr.
The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.
The @userns_fd field lets user specify a user namespace whose idmapping
becomes the idmapping of the mount. This is implemented and explained in
detail in the next patch.
[1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-21 06:19:53 -07:00
|
|
|
442 i386 mount_setattr sys_mount_setattr
|
2021-05-31 09:42:58 -07:00
|
|
|
443 i386 quotactl_fd sys_quotactl_fd
|
2021-04-22 08:41:19 -07:00
|
|
|
444 i386 landlock_create_ruleset sys_landlock_create_ruleset
|
|
|
|
445 i386 landlock_add_rule sys_landlock_add_rule
|
|
|
|
446 i386 landlock_restrict_self sys_landlock_restrict_self
|
2021-07-07 18:08:11 -07:00
|
|
|
447 i386 memfd_secret sys_memfd_secret
|
2021-09-02 15:00:33 -07:00
|
|
|
448 i386 process_mrelease sys_process_mrelease
|
2021-09-23 10:11:06 -07:00
|
|
|
449 i386 futex_waitv sys_futex_waitv
|
2022-01-14 15:08:21 -07:00
|
|
|
450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
|
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 18:36:07 -07:00
|
|
|
451 i386 cachestat sys_cachestat
|
2023-07-11 09:16:05 -07:00
|
|
|
452 i386 fchmodat2 sys_fchmodat2
|
2023-09-14 11:58:03 -07:00
|
|
|
453 i386 map_shadow_stack sys_map_shadow_stack
|
2023-09-21 03:45:10 -07:00
|
|
|
454 i386 futex_wake sys_futex_wake
|
2023-09-21 03:45:12 -07:00
|
|
|
455 i386 futex_wait sys_futex_wait
|
2023-09-21 03:45:15 -07:00
|
|
|
456 i386 futex_requeue sys_futex_requeue
|
2023-10-25 07:02:04 -07:00
|
|
|
457 i386 statmount sys_statmount
|
|
|
|
458 i386 listmount sys_listmount
|
lsm/stable-6.8 PR 20240105
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmWYKUIUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNyHw/+IKnqL1MZ5QS+/HtSzi4jCL47N9yZ
OHLol6XswyEGHH9myKPPGnT5lVA93v98v4ty2mws7EJUSGZQQUntYBPbU9Gi40+B
XDzYSRocoj96sdlKeOJMgaWo3NBRD9HYSoGPDNWZixy6m+bLPk/Dqhn3FabKf1lo
2qQSmstvChFRmVNkmgaQnBCAtWVqla4EJEL0EKX6cspHbuzRNTeJdTPn6Q/zOUVL
O2znOZuEtSVpYS7yg3uJT0hHD8H0GnIciAcDAhyPSBL5Uk5l6gwJiACcdRfLRbgp
QM5Z4qUFdKljV5XBCzYnfhhrx1df08h1SG84El8UK8HgTTfOZfYmawByJRWNJSQE
TdCmtyyvEbfb61CKBFVwD7Tzb9/y8WgcY5N3Un8uCQqRzFIO+6cghHri5NrVhifp
nPFlP4klxLHh3d7ZVekLmCMHbpaacRyJKwLy+f/nwbBEID47jpPkvZFIpbalat+r
QaKRBNWdTeV+GZ+Yu0uWsI029aQnpcO1kAnGg09fl6b/dsmxeKOVWebir25AzQ++
a702S8HRmj80X+VnXHU9a64XeGtBH7Nq0vu0lGHQPgwhSx/9P6/qICEPwsIriRjR
I9OulWt4OBPDtlsonHFgDs+lbnd0Z0GJUwYT8e9pjRDMxijVO9lhAXyglVRmuNR8
to2ByKP5BO+Vh8Y=
=Py+n
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull security module updates from Paul Moore:
- Add three new syscalls: lsm_list_modules(), lsm_get_self_attr(), and
lsm_set_self_attr().
The first syscall simply lists the LSMs enabled, while the second and
third get and set the current process' LSM attributes. Yes, these
syscalls may provide similar functionality to what can be found under
/proc or /sys, but they were designed to support multiple,
simultaneaous (stacked) LSMs from the start as opposed to the current
/proc based solutions which were created at a time when only one LSM
was allowed to be active at a given time.
We have spent considerable time discussing ways to extend the
existing /proc interfaces to support multiple, simultaneaous LSMs and
even our best ideas have been far too ugly to support as a kernel
API; after +20 years in the kernel, I felt the LSM layer had
established itself enough to justify a handful of syscalls.
Support amongst the individual LSM developers has been nearly
unanimous, with a single objection coming from Tetsuo (TOMOYO) as he
is worried that the LSM_ID_XXX token concept will make it more
difficult for out-of-tree LSMs to survive. Several members of the LSM
community have demonstrated the ability for out-of-tree LSMs to
continue to exist by picking high/unused LSM_ID values as well as
pointing out that many kernel APIs rely on integer identifiers, e.g.
syscalls (!), but unfortunately Tetsuo's objections remain.
My personal opinion is that while I have no interest in penalizing
out-of-tree LSMs, I'm not going to penalize in-tree development to
support out-of-tree development, and I view this as a necessary step
forward to support the push for expanded LSM stacking and reduce our
reliance on /proc and /sys which has occassionally been problematic
for some container users. Finally, we have included the linux-api
folks on (all?) recent revisions of the patchset and addressed all of
their concerns.
- Add a new security_file_ioctl_compat() LSM hook to handle the 32-bit
ioctls on 64-bit systems problem.
This patch includes support for all of the existing LSMs which
provide ioctl hooks, although it turns out only SELinux actually
cares about the individual ioctls. It is worth noting that while
Casey (Smack) and Tetsuo (TOMOYO) did not give explicit ACKs to this
patch, they did both indicate they are okay with the changes.
- Fix a potential memory leak in the CALIPSO code when IPv6 is disabled
at boot.
While it's good that we are fixing this, I doubt this is something
users are seeing in the wild as you need to both disable IPv6 and
then attempt to configure IPv6 labeled networking via
NetLabel/CALIPSO; that just doesn't make much sense.
Normally this would go through netdev, but Jakub asked me to take
this patch and of all the trees I maintain, the LSM tree seemed like
the best fit.
- Update the LSM MAINTAINERS entry with additional information about
our process docs, patchwork, bug reporting, etc.
I also noticed that the Lockdown LSM is missing a dedicated
MAINTAINERS entry so I've added that to the pull request. I've been
working with one of the major Lockdown authors/contributors to see if
they are willing to step up and assume a Lockdown maintainer role;
hopefully that will happen soon, but in the meantime I'll continue to
look after it.
- Add a handful of mailmap entries for Serge Hallyn and myself.
* tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (27 commits)
lsm: new security_file_ioctl_compat() hook
lsm: Add a __counted_by() annotation to lsm_ctx.ctx
calipso: fix memory leak in netlbl_calipso_add_pass()
selftests: remove the LSM_ID_IMA check in lsm/lsm_list_modules_test
MAINTAINERS: add an entry for the lockdown LSM
MAINTAINERS: update the LSM entry
mailmap: add entries for Serge Hallyn's dead accounts
mailmap: update/replace my old email addresses
lsm: mark the lsm_id variables are marked as static
lsm: convert security_setselfattr() to use memdup_user()
lsm: align based on pointer length in lsm_fill_user_ctx()
lsm: consolidate buffer size handling into lsm_fill_user_ctx()
lsm: correct error codes in security_getselfattr()
lsm: cleanup the size counters in security_getselfattr()
lsm: don't yet account for IMA in LSM_CONFIG_COUNT calculation
lsm: drop LSM_ID_IMA
LSM: selftests for Linux Security Module syscalls
SELinux: Add selfattr hooks
AppArmor: Add selfattr hooks
Smack: implement setselfattr and getselfattr hooks
...
2024-01-09 13:57:46 -07:00
|
|
|
459 i386 lsm_get_self_attr sys_lsm_get_self_attr
|
|
|
|
460 i386 lsm_set_self_attr sys_lsm_set_self_attr
|
|
|
|
461 i386 lsm_list_modules sys_lsm_list_modules
|
mseal: wire up mseal syscall
Patch series "Introduce mseal", v10.
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory range
against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW) and
no-execute (NX) bits. Linux has supported NX since the release of kernel
version 2.6.8 in August 2004 [1]. The memory permission feature improves
the security stance on memory corruption bugs, as an attacker cannot
simply write to arbitrary memory and point the code to it. The memory
must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects the
VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime. A
similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall
[4]. Also, Chrome wants to adopt this feature for their CFI work [2] and
this patchset has been designed to be compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in the
case of libc, sealing is only applied to read-only (RO) or read-execute
(RX) memory segments (such as .text and .RELRO) to prevent them from
becoming writable, the lifetime of those mappings are tied to the lifetime
of the process.
Chrome wants to seal two large address space reservations that are managed
by different allocators. The memory is mapped RW- and RWX respectively
but write access to it is restricted using pkeys (or in the future ARM
permission overlay extensions). The lifetime of those mappings are not
tied to the lifetime of the process, therefore, while the memory is
sealed, the allocators still need to free or discard the unused memory.
For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a security
risk. For example if a jump instruction crosses a page boundary and the
second page gets discarded, it will overwrite the target bytes with zeros
and change the control flow. Checking write-permission before the discard
operation allows us to control when the operation is valid. In this case,
the madvise will only succeed if the executing thread has PKEY write
permissions and PKRU changes are protected in software by control-flow
integrity.
Although the initial version of this patch series is targeting the Chrome
browser as its first user, it became evident during upstream discussions
that we would also want to ensure that the patch set eventually is a
complete solution for memory sealing and compatible with other use cases.
The specific scenario currently in mind is glibc's use case of loading and
sealing ELF executables. To this end, Stephen is working on a change to
glibc to add sealing support to the dynamic linker, which will seal all
non-writable segments at startup. Once this work is completed, all
applications will be able to automatically benefit from these new
protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental in
shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.
MM perf benchmarks
==================
This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
check the VMAs’ sealing flag, so that no partial update can be made,
when any segment within the given memory range is sealed.
To measure the performance impact of this loop, two tests are developed.
[8]
The first is measuring the time taken for a particular system call,
by using clock_gettime(CLOCK_MONOTONIC). The second is using
PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
similar results.
The tests have roughly below sequence:
for (i = 0; i < 1000, i++)
create 1000 mappings (1 page per VMA)
start the sampling
for (j = 0; j < 1000, j++)
mprotect one mapping
stop and save the sample
delete 1000 mappings
calculates all samples.
Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
4G memory, Chromebook.
Based on the latest upstream code:
The first test (measuring time)
syscall__ vmas t t_mseal delta_ns per_vma %
munmap__ 1 909 944 35 35 104%
munmap__ 2 1398 1502 104 52 107%
munmap__ 4 2444 2594 149 37 106%
munmap__ 8 4029 4323 293 37 107%
munmap__ 16 6647 6935 288 18 104%
munmap__ 32 11811 12398 587 18 105%
mprotect 1 439 465 26 26 106%
mprotect 2 1659 1745 86 43 105%
mprotect 4 3747 3889 142 36 104%
mprotect 8 6755 6969 215 27 103%
mprotect 16 13748 14144 396 25 103%
mprotect 32 27827 28969 1142 36 104%
madvise_ 1 240 262 22 22 109%
madvise_ 2 366 442 76 38 121%
madvise_ 4 623 751 128 32 121%
madvise_ 8 1110 1324 215 27 119%
madvise_ 16 2127 2451 324 20 115%
madvise_ 32 4109 4642 534 17 113%
The second test (measuring cpu cycle)
syscall__ vmas cpu cmseal delta_cpu per_vma %
munmap__ 1 1790 1890 100 100 106%
munmap__ 2 2819 3033 214 107 108%
munmap__ 4 4959 5271 312 78 106%
munmap__ 8 8262 8745 483 60 106%
munmap__ 16 13099 14116 1017 64 108%
munmap__ 32 23221 24785 1565 49 107%
mprotect 1 906 967 62 62 107%
mprotect 2 3019 3203 184 92 106%
mprotect 4 6149 6569 420 105 107%
mprotect 8 9978 10524 545 68 105%
mprotect 16 20448 21427 979 61 105%
mprotect 32 40972 42935 1963 61 105%
madvise_ 1 434 497 63 63 115%
madvise_ 2 752 899 147 74 120%
madvise_ 4 1313 1513 200 50 115%
madvise_ 8 2271 2627 356 44 116%
madvise_ 16 4312 4883 571 36 113%
madvise_ 32 8376 9319 943 29 111%
Based on the result, for 6.8 kernel, sealing check adds
20-40 nano seconds, or around 50-100 CPU cycles, per VMA.
In addition, I applied the sealing to 5.10 kernel:
The first test (measuring time)
syscall__ vmas t tmseal delta_ns per_vma %
munmap__ 1 357 390 33 33 109%
munmap__ 2 442 463 21 11 105%
munmap__ 4 614 634 20 5 103%
munmap__ 8 1017 1137 120 15 112%
munmap__ 16 1889 2153 263 16 114%
munmap__ 32 4109 4088 -21 -1 99%
mprotect 1 235 227 -7 -7 97%
mprotect 2 495 464 -30 -15 94%
mprotect 4 741 764 24 6 103%
mprotect 8 1434 1437 2 0 100%
mprotect 16 2958 2991 33 2 101%
mprotect 32 6431 6608 177 6 103%
madvise_ 1 191 208 16 16 109%
madvise_ 2 300 324 24 12 108%
madvise_ 4 450 473 23 6 105%
madvise_ 8 753 806 53 7 107%
madvise_ 16 1467 1592 125 8 108%
madvise_ 32 2795 3405 610 19 122%
The second test (measuring cpu cycle)
syscall__ nbr_vma cpu cmseal delta_cpu per_vma %
munmap__ 1 684 715 31 31 105%
munmap__ 2 861 898 38 19 104%
munmap__ 4 1183 1235 51 13 104%
munmap__ 8 1999 2045 46 6 102%
munmap__ 16 3839 3816 -23 -1 99%
munmap__ 32 7672 7887 216 7 103%
mprotect 1 397 443 46 46 112%
mprotect 2 738 788 50 25 107%
mprotect 4 1221 1256 35 9 103%
mprotect 8 2356 2429 72 9 103%
mprotect 16 4961 4935 -26 -2 99%
mprotect 32 9882 10172 291 9 103%
madvise_ 1 351 380 29 29 108%
madvise_ 2 565 615 49 25 109%
madvise_ 4 872 933 61 15 107%
madvise_ 8 1508 1640 132 16 109%
madvise_ 16 3078 3323 245 15 108%
madvise_ 32 5893 6704 811 25 114%
For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
CPU cycles, there is even decrease in some cases.
It might be interesting to compare 5.10 and 6.8 kernel
The first test (measuring time)
syscall__ vmas t_5_10 t_6_8 delta_ns per_vma %
munmap__ 1 357 909 552 552 254%
munmap__ 2 442 1398 956 478 316%
munmap__ 4 614 2444 1830 458 398%
munmap__ 8 1017 4029 3012 377 396%
munmap__ 16 1889 6647 4758 297 352%
munmap__ 32 4109 11811 7702 241 287%
mprotect 1 235 439 204 204 187%
mprotect 2 495 1659 1164 582 335%
mprotect 4 741 3747 3006 752 506%
mprotect 8 1434 6755 5320 665 471%
mprotect 16 2958 13748 10790 674 465%
mprotect 32 6431 27827 21397 669 433%
madvise_ 1 191 240 49 49 125%
madvise_ 2 300 366 67 33 122%
madvise_ 4 450 623 173 43 138%
madvise_ 8 753 1110 357 45 147%
madvise_ 16 1467 2127 660 41 145%
madvise_ 32 2795 4109 1314 41 147%
The second test (measuring cpu cycle)
syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma %
munmap__ 1 684 1790 1106 1106 262%
munmap__ 2 861 2819 1958 979 327%
munmap__ 4 1183 4959 3776 944 419%
munmap__ 8 1999 8262 6263 783 413%
munmap__ 16 3839 13099 9260 579 341%
munmap__ 32 7672 23221 15549 486 303%
mprotect 1 397 906 509 509 228%
mprotect 2 738 3019 2281 1140 409%
mprotect 4 1221 6149 4929 1232 504%
mprotect 8 2356 9978 7622 953 423%
mprotect 16 4961 20448 15487 968 412%
mprotect 32 9882 40972 31091 972 415%
madvise_ 1 351 434 82 82 123%
madvise_ 2 565 752 186 93 133%
madvise_ 4 872 1313 442 110 151%
madvise_ 8 1508 2271 763 95 151%
madvise_ 16 3078 4312 1234 77 140%
madvise_ 32 5893 8376 2483 78 142%
From 5.10 to 6.8
munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.
In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
increase from 5.10 to 6.8 is significantly larger, approximately ten times
greater for munmap and mprotect.
When I discuss the mm performance with Brian Makin, an engineer who worked
on performance, it was brought to my attention that such performance
benchmarks, which measuring millions of mm syscall in a tight loop, may
not accurately reflect real-world scenarios, such as that of a database
service. Also this is tested using a single HW and ChromeOS, the data
from another HW or distribution might be different. It might be best to
take this data with a grain of salt.
This patch (of 5):
Wire up mseal syscall for all architectures.
Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org
Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Jann Horn <jannh@google.com> [Bug #2]
Cc: Jeff Xu <jeffxu@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Stephen Röttger <sroettger@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Amer Al Shanawany <amer.shanawany@gmail.com>
Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-15 09:35:20 -07:00
|
|
|
462 i386 mseal sys_mseal
|