The tcp_bpf_recvmsg_parser() function, running in user context,
retrieves seq_copied from tcp_sk without holding the socket lock, and
stores it in a local variable seq. However, the softirq context can
modify tcp_sk->seq_copied concurrently, for example, n tcp_read_sock().
As a result, the seq value is stale when it is assigned back to
tcp_sk->copied_seq at the end of tcp_bpf_recvmsg_parser(), leading to
incorrect behavior.
Due to concurrency, the copied_seq field in tcp_bpf_recvmsg_parser()
might be set to an incorrect value (less than the actual copied_seq) at
the end of function: 'WRITE_ONCE(tcp->copied_seq, seq)'. This causes the
'offset' to be negative in tcp_read_sock()->tcp_recv_skb() when
processing new incoming packets (sk->copied_seq - skb->seq becomes less
than 0), and all subsequent packets will be dropped.
Signed-off-by: Jiayuan Chen <mrpre@163.com>
Link: https://lore.kernel.org/r/20241028065226.35568-1-mrpre@163.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
The following race condition could trigger a NULL pointer dereference:
sock_map_link_detach(): sock_map_link_update_prog():
mutex_lock(&sockmap_mutex);
...
sockmap_link->map = NULL;
mutex_unlock(&sockmap_mutex);
mutex_lock(&sockmap_mutex);
...
sock_map_prog_link_lookup(sockmap_link->map);
mutex_unlock(&sockmap_mutex);
<continue>
Fix it by adding a NULL pointer check. In this specific case, it makes
no sense to update a link which is being released.
Reported-by: Ruan Bonan <bonan.ruan@u.nus.edu>
Fixes: 699c23f02c ("bpf: Add bpf_link support for sk_msg and sk_skb progs")
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20241026185522.338562-1-xiyou.wangcong@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
- Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF
sockmap link file descriptors (Hou Tao)
- Fix BPF arm64 JIT's address emission with tag-based KASAN
enabled reserving not enough size (Peter Collingbourne)
- Fix BPF verifier do_misc_fixups patching for inlining of the
bpf_get_branch_snapshot BPF helper (Andrii Nakryiko)
- Fix a BPF verifier bug and reject BPF program write attempts
into read-only marked BPF maps (Daniel Borkmann)
- Fix perf_event_detach_bpf_prog error handling by removing an
invalid check which would skip BPF program release (Jiri Olsa)
- Fix memory leak when parsing mount options for the BPF
filesystem (Hou Tao)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZxrAzxUcZGFuaWVsQGlv
Z2VhcmJveC5uZXQACgkQ2yufC7HISIPcHwD8DnBSPlHX9OezMWCm8mjVx2Fd26W9
/IaiW2tyOPtoSGIA/3hfgfLrxkb3Raoh0miQB2+FRrz9e+y7i8c4Q91mcUgJ
=Hvht
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Daniel Borkmann:
- Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF sockmap
link file descriptors (Hou Tao)
- Fix BPF arm64 JIT's address emission with tag-based KASAN enabled
reserving not enough size (Peter Collingbourne)
- Fix BPF verifier do_misc_fixups patching for inlining of the
bpf_get_branch_snapshot BPF helper (Andrii Nakryiko)
- Fix a BPF verifier bug and reject BPF program write attempts into
read-only marked BPF maps (Daniel Borkmann)
- Fix perf_event_detach_bpf_prog error handling by removing an invalid
check which would skip BPF program release (Jiri Olsa)
- Fix memory leak when parsing mount options for the BPF filesystem
(Hou Tao)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Check validity of link->type in bpf_link_show_fdinfo()
bpf: Add the missing BPF_LINK_TYPE invocation for sockmap
bpf: fix do_misc_fixups() for bpf_get_branch_snapshot()
bpf,perf: Fix perf_event_detach_bpf_prog error handling
selftests/bpf: Add test for passing in uninit mtu_len
selftests/bpf: Add test for writes to .rodata
bpf: Remove MEM_UNINIT from skb/xdp MTU helpers
bpf: Fix overloading of MEM_UNINIT's meaning
bpf: Add MEM_WRITE attribute
bpf: Preserve param->string when parsing mount options
bpf, arm64: Fix address emission with tag-based KASAN enabled
Current release - regressions:
- posix-clock: Fix unbalanced locking in pc_clock_settime()
- netfilter: fix typo causing some targets not to load on IPv6
Current release - new code bugs:
- xfrm: policy: remove last remnants of pernet inexact list
Previous releases - regressions:
- core: fix races in netdev_tx_sent_queue()/dev_watchdog()
- bluetooth: fix UAF on sco_sock_timeout
- eth: hv_netvsc: fix VF namespace also in synthetic NIC NETDEV_REGISTER event
- eth: usbnet: fix name regression
- eth: be2net: fix potential memory leak in be_xmit()
- eth: plip: fix transmit path breakage
Previous releases - always broken:
- sched: deny mismatched skip_sw/skip_hw flags for actions created by classifiers
- netfilter: bpf: must hold reference on net namespace
- eth: virtio_net: fix integer overflow in stats
- eth: bnxt_en: replace ptp_lock with irqsave variant
- eth: octeon_ep: add SKB allocation failures handling in __octep_oq_process_rx()
Misc:
- MAINTAINERS: add Simon as an official reviewer
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmcaTkUSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkW8kP/iYfaxQ8zR61wUU7bOcVUSnEADR9XQ1H
Nta5Z0tDJprZv254XW3hYDzU0Iy3OgclRE1oewF5fQVLn6Sfg4U5awxRTNdJw7KV
wj62ziAv/xht2W/4nBsNfYkOZaDAibItbKtxlkOhgCGXSrXBoS22IonKRqEv2HLV
Gu0vAY/VI9YNvB5Z6SEKFmQp2bWfX79AChVT72shLBLakOCUHBavk/DOU56XH1Ci
IRmU5Lt8ysXWxCTF91rPCAbMyuxBbIv6phIKPV2ALpRUd6ha5nBqcl0wcS7Y1E+/
0XOV71zjcXFoE/6hc5W3/mC7jm+ipXKVJOnIkCcWq40p6kDVJJ+E1RWEr5JxGEyF
FtnUCZ8iK/F3/jSalMras2z+AZ/CGtfHF9wAS3YfMGtOJJb/k4dCxAddp7UzD9O4
yxAJhJ0DrVuplzwovL5owoJJXeRAMQeFydzHBYun5P8Sc9TtvviICi19fMgKGn4O
eUQhjgZZY371sPnTDLDEw1Oqzs9qeaeV3S2dSeFJ98PQuPA5KVOf/R2/CptBIMi5
+UNcqeXrlUeYSBW94pPioEVStZDrzax5RVKh/Jo1tTnKzbnWDOOKZqSVsGPMWXdO
0aBlGuSsNe36VDg2C0QMxGk7+gXbKmk9U4+qVQH3KMpB8uqdAu5deMbTT6dfcwBV
O/BaGiqoR4ak
=dR3Q
-----END PGP SIGNATURE-----
Merge tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfiler, xfrm and bluetooth.
Oddly this includes a fix for a posix clock regression; in our
previous PR we included a change there as a pre-requisite for
networking one. That fix proved to be buggy and requires the follow-up
included here. Thomas suggested we should send it, given we sent the
buggy patch.
Current release - regressions:
- posix-clock: Fix unbalanced locking in pc_clock_settime()
- netfilter: fix typo causing some targets not to load on IPv6
Current release - new code bugs:
- xfrm: policy: remove last remnants of pernet inexact list
Previous releases - regressions:
- core: fix races in netdev_tx_sent_queue()/dev_watchdog()
- bluetooth: fix UAF on sco_sock_timeout
- eth: hv_netvsc: fix VF namespace also in synthetic NIC
NETDEV_REGISTER event
- eth: usbnet: fix name regression
- eth: be2net: fix potential memory leak in be_xmit()
- eth: plip: fix transmit path breakage
Previous releases - always broken:
- sched: deny mismatched skip_sw/skip_hw flags for actions created by
classifiers
- netfilter: bpf: must hold reference on net namespace
- eth: virtio_net: fix integer overflow in stats
- eth: bnxt_en: replace ptp_lock with irqsave variant
- eth: octeon_ep: add SKB allocation failures handling in
__octep_oq_process_rx()
Misc:
- MAINTAINERS: add Simon as an official reviewer"
* tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits)
net: dsa: mv88e6xxx: support 4000ps cycle counter period
net: dsa: mv88e6xxx: read cycle counter period from hardware
net: dsa: mv88e6xxx: group cycle counter coefficients
net: usb: qmi_wwan: add Fibocom FG132 0x0112 composition
hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event
net: dsa: microchip: disable EEE for KSZ879x/KSZ877x/KSZ876x
Bluetooth: ISO: Fix UAF on iso_sock_timeout
Bluetooth: SCO: Fix UAF on sco_sock_timeout
Bluetooth: hci_core: Disable works on hci_unregister_dev
posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime()
r8169: avoid unsolicited interrupts
net: sched: use RCU read-side critical section in taprio_dump()
net: sched: fix use-after-free in taprio_change()
net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers
net: usb: usbnet: fix name regression
mlxsw: spectrum_router: fix xa_store() error checking
virtio_net: fix integer overflow in stats
net: fix races in netdev_tx_sent_queue()/dev_watchdog()
net: wwan: fix global oob in wwan_rtnl_policy
netfilter: xtables: fix typo causing some targets not to load on IPv6
...
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmcZBXwWHGNoZW5odWFj
YWlAa2VybmVsLm9yZwAKCRAChivD8uImeqN/EACSkIXn6754wSH4cDRWlZQcxvRa
05gaqDzAXU8IZ+O0/+I3JQHezPxBNoMsr+J59KIwlYxCGQQ6xyjaLK3+fC3yf3jW
ITzBLGLufzv88LbL9LFW2ag5bSl9+Wi/KMxBd9oKQm0e0z31vZ0Ofj8WjHCMvNdH
SA9gGFFZ7e6B0RsvvoTCR6n9bjbuMKQ28XU6wJnboNIARvbzGVdD2BtBfPPLH66i
VeWjM3o6DUyikwyoCfhco8ovzNBrt4Z4y+W6sHZb66zKccYGKPdeFxrDFuks/oTw
uVwiaEJhwEv76E5z8Jp7c7uWWsflObXZ5hWkHBMGn2etlHO6csTfOV1GLrciHM+p
uI62eUEOSLt7r5doFqkv3f9jsSKHiXM9Yq7osqG1gg0pSFnSVrtq8B/coTrwx6K7
EWkX8vpuWVmYVe9/LM17wV+AL7rQfTW51Ypmqz37Sl+GyuD2HOjqYG8g0KDNBFa0
lnKVvMYaewnQyHgpQino6Grue3H3cb8+jgAxNYJNJ5+vgbQYFhw2sYWty9DJOREC
klOlGz/OvJWLOTCrQr/RqDVyR1SDEseITglgc2Ng/IkGuiQEx83oOQm7B3aajdPi
NhNFtKz/YyiJasRpock6UeYCpM2rOfXq46t52HLTHGBlGd3ZlF4FfL/2z1FuJZOX
B5DoUhOz3QZyXXguMg==
=83tm
-----END PGP SIGNATURE-----
Merge tag 'loongarch-fixes-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Get correct cores_per_package for SMT systems, enable IRQ if do_ale()
triggered in irq-enabled context, and fix some bugs about vDSO, memory
managenent, hrtimer in KVM, etc"
* tag 'loongarch-fixes-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Mark hrtimer to expire in hard interrupt context
LoongArch: Make KASAN usable for variable cpu_vabits
LoongArch: Set initial pte entry with PAGE_GLOBAL for kernel space
LoongArch: Don't crash in stack_top() for tasks without vDSO
LoongArch: Set correct size for vDSO code mapping
LoongArch: Enable IRQ if do_ale() triggered in irq-enabled context
LoongArch: Get correct cores_per_package for SMT systems
LoongArch: Use "Exception return address" to comment ERA
- objpool: Fix choosing allocation for percpu slots
Fixes to allocate objpool's percpu slots correctly according to the
GFP flag. It checks whether "any bit" in GFP_ATOMIC is set to choose
the vmalloc source, but it should check "all bits" in GFP_ATOMIC flag
is set, because GFP_ATOMIC is a combined flag.
- tracing/probes: Fix MAX_TRACE_ARGS limit handling
If more than MAX_TRACE_ARGS are passed for creating a probe event, the
entries over MAX_TRACE_ARG in trace_arg array are not initialized.
Thus if the kernel accesses those entries, it crashes. This rejects
creating event if the number of arguments is over MAX_TRACE_ARGS.
- tracing: Consider the NULL character when validating the event length
A strlen() is used when parsing the event name, and the original code
does not consider the terminal null byte. Thus it can pass the name
1 byte longer than the buffer. This fixes to check it correctly.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmcZBJ0ACgkQ2/sHvwUr
Pxu4qAgAm+mIiCaBGyolsT1oB5EF+9gztbwRtcAOY1811RJZ0XiQPuOwtZfijpBr
1Pl+SjubRKhLg+lLHEuCQHxkqlTSp+zrjkF+A0hFlB38nJ5P3pIw+b5pM5FCvhY+
w0tBTwkjiRBS9h1z88c74ciKYA/XR4apcMMUrPQZUCHq8P73Wu/Fo2lhnCVGBs6q
nYESyrTcOCDR0c6HP9D2GWxQFtbbCyAfotUjX37EIooTcl7ufAr8IPm8jBx7EzCa
WM841FwbuIgGbFCGYlG1/lOR+Qf7FszKAY5SBJMV/BiyFbxJqZfA5DWfJcrZ9YpW
pl86oKWyEkidwx8OIiB3Y1enPzUUJQ==
=8oUB
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.12-rc4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- objpool: Fix choosing allocation for percpu slots
Fixes to allocate objpool's percpu slots correctly according to the
GFP flag. It checks whether "any bit" in GFP_ATOMIC is set to choose
the vmalloc source, but it should check "all bits" in GFP_ATOMIC flag
is set, because GFP_ATOMIC is a combined flag.
- tracing/probes: Fix MAX_TRACE_ARGS limit handling
If more than MAX_TRACE_ARGS are passed for creating a probe event,
the entries over MAX_TRACE_ARG in trace_arg array are not
initialized. Thus if the kernel accesses those entries, it crashes.
This rejects creating event if the number of arguments is over
MAX_TRACE_ARGS.
- tracing: Consider the NUL character when validating the event length
A strlen() is used when parsing the event name, and the original code
does not consider the terminal null byte. Thus it can pass the name
one byte longer than the buffer. This fixes to check it correctly.
* tag 'probes-fixes-v6.12-rc4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Consider the NULL character when validating the event length
tracing/probes: Fix MAX_TRACE_ARGS limit handling
objpool: fix choosing allocation for percpu slots
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcZGqsACgkQxWXV+ddt
WDsZTg//dSLAAswT3dTAvWDt7rGxPhJC1V+gnzWdj/0Q3CUimNaw7zHp1QHtJDFT
S+3TVvJJ2JDvOnbi3u24s/bL5YdkWcvIyy87oVE4trJMbQPc/E45pkYFqF3TRQAH
wEYAVEnO9f9WUY+ekxX2XBzhmKp3xol93j9BBHDXOF9kHsDC5lI0D5YVYEhRY8Qu
D4GcqAhqUj5Hj6I6ppiqO47NCJBNzw1Se9QsgruPpmItRbB0/LYJFhUfevwosTKg
xVpkRVntFqQjkIdIdBfBv/ZGWfxJyM7K4M49QwLUqUQfxugu7BiGYuEjkBkiy07a
pZDEOF9s8wUZsxvRVohqKlhL0zHBF+/pAANowYuhKNW1sqKt4GVCdN3V34AbH8ST
JIbPvC2g1tUzIc2wckE8GO/NnsNR4r3k6iPB53MdCHrIo3jnENOeb9wF0GiVDb6s
OrhCa3ph2ps80YC1aCnc4Jr/yV2ONebSivnqvHCIUQEZfpjtAc07atm0i946/7nx
eiBE+9zSVJZB0LoooIVz2I3HX3lRnm3Wwi4nj8U/sNL/IbGaHNCurIETR23NivYP
yWVql2njwE+yMc8q9YZXs4MBdKsSGP6eGJW3ZwKi6ru2PJdf5eIib+ffvR4bLqXD
UUDfMyC1esGsQB24sc8wppk97wmmMrdqQUj9WnmNlTP8FUFggvQ=
=sYxX
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- mount option fixes:
- fix handling of compression mount options on remount
- reject rw remount in case there are options that don't work
in read-write mode (like rescue options)
- fix zone accounting of unusable space
- fix in-memory corruption when merging extent maps
- fix delalloc range locking for sector < page
- use more convenient default value of drop subtree threshold, clean
more subvolumes without the fallback to marking quotas inconsistent
- fix smatch warning about incorrect value passed to ERR_PTR
* tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
btrfs: reject ro->rw reconfiguration if there are hard ro requirements
btrfs: fix read corruption due to race with extent map merging
btrfs: fix the delalloc range locking if sector size < page size
btrfs: qgroup: set a more sane default value for subtree drop threshold
btrfs: clear force-compress on remount when compress mount option is given
btrfs: zoned: fix zone unusable accounting for freed reserved extent
Lots of hotfixes:
- transaction restart injection has been shaking out a few things
- fix a data corruption in the buffered write path on -ENOSPC, found by
xfstests generic/299
- Some small show_options fixes
- Repair mismatches in inode hash type, seed: different snapshot
versions of an inode must have the same hash/type seed, used for
directory entries and xattrs. We were checking the hash seed, but not
the type, and a user contributed a filesystem where the hash type on
one inode had somehow been flipped; these fixes allow his filesystem
to repair.
Additionally, the hash type flip made some directory entries
invisible, which were then recreated by userspace; so the hash check
code now checks for duplicate non dangling dirents, and renames one of
them if necessary.
- Don't use wait_event_interruptible() in recovery: this fixes some
filesystems failing to mount with -ERESTARTSYS
- Workaround for kvmalloc not supporting > INT_MAX allocations, causing
an -ENOMEM when allocating the sorted array of journal keys: this
allows a 75 TB filesystem to mount
- Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
compat path: this alllows Marcin's filesystem (in use since before
6.7) to repair and mount.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcX4vYACgkQE6szbY3K
bnbywxAArBfIJfshWq5Wk9WztenzUmyUmV2HIgntT/iN4ty4eIpZ26VSvHcGvgkU
j3wx+OuxMTPBGc3fjUS+gALf/BGcQEgh6oPZCV+6M3kasTzNzG2jYOCkLqKbpcO1
V5n/Le/SM1X2grkgTm/H+TulGHNgG9gJ2U4kjihroJrTbTesZhzcW/qlz6RWo7U1
02NvLop4WE9M6WaW9RzsHK2llRUAl2Z3oRMuwNz3IIijCpm98STGD4gyvGoMV2b8
qNsXjy7b2lkYObKI29yWF0caRzWK1LRz79afRlnNVSJb6DK1QB83ms5Qa8rprCU4
uOq0wsGWyg6lzwQ19X+2TvUYABopVk2HXLlzTO/lJrWeMTuYJVPZ7KZi3l6ubw5T
GIsAD5qMdCm8E5nXX8hG//0rOIl6QK288+zMQyRCvAkCL+iN2k0TU8qKAEEC44de
vj6ZyNqbuLR39LLz9K09ZhzIZGk09ELpxOJ2Wwwj4ZFriwphWDtFgBtBUpNo/KWA
inBfq2lZJsmNjfns9vCqOmNOStOJxXnyMOR25sTv7wM69QPGkl41dPY3oeuG8lRk
cU/qJQKlpTKJbFeXiEKWKDnMzWxOnovqLFC0tKu2qAYM6vAz+AtwTXgthVFGh21U
QoUDbsnQCCixMkS2AksCo7nivLrxmV/EeYm5pgeiU38VdA5ofBM=
=OpYN
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Lots of hotfixes:
- transaction restart injection has been shaking out a few things
- fix a data corruption in the buffered write path on -ENOSPC, found
by xfstests generic/299
- Some small show_options fixes
- Repair mismatches in inode hash type, seed: different snapshot
versions of an inode must have the same hash/type seed, used for
directory entries and xattrs. We were checking the hash seed, but
not the type, and a user contributed a filesystem where the hash
type on one inode had somehow been flipped; these fixes allow his
filesystem to repair.
Additionally, the hash type flip made some directory entries
invisible, which were then recreated by userspace; so the hash
check code now checks for duplicate non dangling dirents, and
renames one of them if necessary.
- Don't use wait_event_interruptible() in recovery: this fixes some
filesystems failing to mount with -ERESTARTSYS
- Workaround for kvmalloc not supporting > INT_MAX allocations,
causing an -ENOMEM when allocating the sorted array of journal
keys: this allows a 75 TB filesystem to mount
- Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
compat path: this alllows Marcin's filesystem (in use since before
6.7) to repair and mount"
* tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs: (26 commits)
bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path
bcachefs: Mark more errors as AUTOFIX
bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations
bcachefs: Don't use wait_event_interruptible() in recovery
bcachefs: Fix __bch2_fsck_err() warning
bcachefs: fsck: Improve hash_check_key()
bcachefs: bch2_hash_set_or_get_in_snapshot()
bcachefs: Repair mismatches in inode hash seed, type
bcachefs: Add hash seed, type to inode_to_text()
bcachefs: INODE_STR_HASH() for bch_inode_unpacked
bcachefs: Run in-kernel offline fsck without ratelimit errors
bcachefs: skip mount option handle for empty string.
bcachefs: fix incorrect show_options results
bcachefs: Fix data corruption on -ENOSPC in buffered write path
bcachefs: bch2_folio_reservation_get_partial() is now better behaved
bcachefs: fix disk reservation accounting in bch2_folio_reservation_get()
bcachefS: ec: fix data type on stripe deletion
bcachefs: Don't use commit_do() unnecessarily
bcachefs: handle restarts in bch2_bucket_io_time_reset()
bcachefs: fix restart handling in __bch2_resume_logged_op_finsert()
...
This reverts commit 1325e4a91a.
using multipage folios apparently break some madvise operations like
MADV_PAGEOUT which do not reliably unload the specified page anymore,
Revert the patch until that is figured out.
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Fixes: 1325e4a91a ("9p: Enable multipage folios")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hou Tao says:
====================
Add the missing BPF_LINK_TYPE invocation for sockmap
From: Hou Tao <houtao1@huawei.com>
Hi,
The tiny patch set fixes the out-of-bound read problem when reading the
fdinfo of sock map link fd. And in order to spot such omission early for
the newly-added link type in the future, it also checks the validity of
the link->type and adds a WARN_ONCE() for missed invocation.
Please see individual patches for more details. And comments are always
welcome.
v3:
* patch #2: check and warn the validity of link->type instead of
adding a static assertion for bpf_link_type_strs array.
v2: http://lore.kernel.org/bpf/d49fa2f4-f743-c763-7579-c3cab4dd88cb@huaweicloud.com
====================
Link: https://lore.kernel.org/r/20241024013558.1135167-1-houtao@huaweicloud.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
If a newly-added link type doesn't invoke BPF_LINK_TYPE(), accessing
bpf_link_type_strs[link->type] may result in an out-of-bounds access.
To spot such missed invocations early in the future, checking the
validity of link->type in bpf_link_show_fdinfo() and emitting a warning
when such invocations are missed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241024013558.1135167-3-houtao@huaweicloud.com
There is an out-of-bounds read in bpf_link_show_fdinfo() for the sockmap
link fd. Fix it by adding the missing BPF_LINK_TYPE invocation for
sockmap link
Also add comments for bpf_link_type to prevent missing updates in the
future.
Fixes: 699c23f02c ("bpf: Add bpf_link support for sk_msg and sk_skb progs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241024013558.1135167-2-houtao@huaweicloud.com
Shenghao Yang says:
====================
net: dsa: mv88e6xxx: fix MV88E6393X PHC frequency on internal clock
The MV88E6393X family of switches can additionally run their cycle
counters using a 250MHz internal clock instead of the usual 125MHz
external clock [1].
The driver currently assumes all designs utilize that external clock,
but MikroTik's RB5009 uses the internal source - causing the PHC to be
seen running at 2x real time in userspace, making synchronization
with ptp4l impossible.
This series adds support for reading off the cycle counter frequency
known to the hardware in the TAI_CLOCK_PERIOD register and picking an
appropriate set of scaling coefficients instead of using a fixed set
for each switch family.
Patch 1 groups those cycle counter coefficients into a new structure to
make it easier to pass them around.
Patch 2 modifies PTP initialization to probe TAI_CLOCK_PERIOD and
use an appropriate set of coefficients.
Patch 3 adds support for 4000ps cycle counter periods.
Changes since v2 [2]:
- Patch 1: "net: dsa: mv88e6xxx: group cycle counter coefficients"
- Moved declaration of mv88e6xxx_cc_coeffs to avoid moving that in
Patch 2.
- Patch 2: "net: dsa: mv88e6xxx: read cycle counter period from hardware"
- Removed move of mv88e6xxx_cc_coeffs declaration.
- Patch 3: "net: dsa: mv88e6xxx: support 4000ps cycle counter periods"
- No change.
[1] https://lore.kernel.org/netdev/d6622575-bf1b-445a-b08f-2739e3642aae@lunn.ch/
[2] https://lore.kernel.org/netdev/20241006145951.719162-1-me@shenghaoyang.info/
====================
Link: https://patch.msgid.link/20241020063833.5425-1-me@shenghaoyang.info
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The MV88E6393X family of devices can run its cycle counter off
an internal 250MHz clock instead of an external 125MHz one.
Add support for this cycle counter period by adding another set
of coefficients and lowering the periodic cycle counter read interval
to compensate for faster overflows at the increased frequency.
Otherwise, the PHC runs at 2x real time in userspace and cannot be
synchronized.
Fixes: de776d0d31 ("net: dsa: mv88e6xxx: add support for mv88e6393x family")
Signed-off-by: Shenghao Yang <me@shenghaoyang.info>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of relying on a fixed mapping of hardware family to cycle
counter frequency, pull this information from the
MV88E6XXX_TAI_CLOCK_PERIOD register.
This lets us support switches whose cycle counter frequencies depend on
board design.
Fixes: de776d0d31 ("net: dsa: mv88e6xxx: add support for mv88e6393x family")
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Shenghao Yang <me@shenghaoyang.info>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of having them as individual fields in ptp_ops, wrap the
coefficients in a separate struct so they can be referenced together.
Fixes: de776d0d31 ("net: dsa: mv88e6xxx: add support for mv88e6393x family")
Signed-off-by: Shenghao Yang <me@shenghaoyang.info>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The existing code moves VF to the same namespace as the synthetic NIC
during netvsc_register_vf(). But, if the synthetic device is moved to a
new namespace after the VF registration, the VF won't be moved together.
To make the behavior more consistent, add a namespace check for synthetic
NIC's NETDEV_REGISTER event (generated during its move), and move the VF
if it is not in the same namespace.
Cc: stable@vger.kernel.org
Fixes: c0a41b887c ("hv_netvsc: move VF to same namespace as netvsc device")
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1729275922-17595-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The well-known errata regarding EEE not being functional on various KSZ
switches has been refactored a few times. Recently the refactoring has
excluded several switches that the errata should also apply to.
Disable EEE for additional switches with this errata and provide
additional comments referring to the public errata document.
The original workaround for the errata was applied with a register
write to manually disable the EEE feature in MMD 7:60 which was being
applied for KSZ9477/KSZ9897/KSZ9567 switch ID's.
Then came commit 26dd2974c5 ("net: phy: micrel: Move KSZ9477 errata
fixes to PHY driver") and commit 6068e6d7ba ("net: dsa: microchip:
remove KSZ9477 PHY errata handling") which moved the errata from the
switch driver to the PHY driver but only for PHY_ID_KSZ9477 (PHY ID)
however that PHY code was dead code because an entry was never added
for PHY_ID_KSZ9477 via MODULE_DEVICE_TABLE.
This was apparently realized much later and commit 54a4e5c163 ("net:
phy: micrel: add Microchip KSZ 9477 to the device table") added the
PHY_ID_KSZ9477 to the PHY driver but as the errata was only being
applied to PHY_ID_KSZ9477 it's not completely clear what switches
that relates to.
Later commit 6149db4997 ("net: phy: micrel: fix KSZ9477 PHY issues
after suspend/resume") breaks this again for all but KSZ9897 by only
applying the errata for that PHY ID.
Following that this was affected with commit 08c6d8bae48c("net: phy:
Provide Module 4 KSZ9477 errata (DS80000754C)") which removes
the blatant register write to MMD 7:60 and replaces it by
setting phydev->eee_broken_modes = -1 so that the generic phy-c45 code
disables EEE but this is only done for the KSZ9477_CHIP_ID (Switch ID).
Lastly commit 0411f73c13 ("net: dsa: microchip: disable EEE for
KSZ8567/KSZ9567/KSZ9896/KSZ9897.") adds some additional switches
that were missing to the errata due to the previous changes.
This commit adds an additional set of switches.
Fixes: 0411f73c13 ("net: dsa: microchip: disable EEE for KSZ8567/KSZ9567/KSZ9896/KSZ9897.")
Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20241018160658.781564-1-tharvey@gateworks.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- hci_core: Disable works on hci_unregister_dev
- SCO: Fix UAF on sco_sock_timeout
- ISO: Fix UAF on iso_sock_timeout
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmcZB+0ZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKWcfD/4j4Q3W01y1UotsTKa1et7V
r6ydZfHBwU/ND8G8t7sfg4zJ6daYcYY6ERllj92JYCHdmrwILzt3WEgCAAC1Y8dj
r6RUXFapGSwE3YMIdDstcKgsfYV+zzKL+IBt0VRjhEMeoEEczzRGldk8Y8Z9ePhm
9Jyw1WDmid9DkIfIdrTKjnLH5HsjYZSvIJfkVuKKcfRshdolxfokHv9DpbPIbueO
ya33a/mnHVKnLyZPT72uP2DANkP218+zHI+dCjE6279LzrXJkQKLKThf/3bTGO+o
NohqGEj9NsIp8NqS/jr1yzvXOIqXkA5cG0Qrix+OyZuvBIohukTFyes1f2hcRHoh
l41s+IorviUhrtEPZ2ki/AWyXTVpWJl2EQ6XPf6iUexF2PDCgTzILN4WIELBTZse
cEWPVbMI+ZEq9FHX1P9Vfc+Yje4glTXcQzBSlfaljPmbW0CouxYCJ4kEj+5m4F6V
xBUevpIz3dRUMST4tXaOrhso/Th2zCDDbWwp6ImEZQ8xBM5wm8sDrjkZOtWEWRNc
miLEnkfqZxJmt6b8DfGzeM/p3FJ9i7OsCo6tUVEH4mGATViD8QOUaqMk/kxV5Wh0
ORB3cWk4VYTvXGuPceEH7u8yBslbUzmad/eMOK52ErP4UrORhESZYoaDy3ALcr7A
4z+9xBlOlezKTe5puR5ulw==
=gOTr
-----END PGP SIGNATURE-----
Merge tag 'for-net-2024-10-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_core: Disable works on hci_unregister_dev
- SCO: Fix UAF on sco_sock_timeout
- ISO: Fix UAF on iso_sock_timeout
* tag 'for-net-2024-10-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: ISO: Fix UAF on iso_sock_timeout
Bluetooth: SCO: Fix UAF on sco_sock_timeout
Bluetooth: hci_core: Disable works on hci_unregister_dev
====================
Link: https://patch.msgid.link/20241023143005.2297694-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmcXbBEACgkQrB3Eaf9P
W7e0hQ//XiBdyhArA8kYIgsCylrOr+y/uCErnIhzUTqo20uE3dMPvzQHwY1GIgiU
HYXKg49WLVxSuFtLRu32qCr0G+muU1UI5OL58IQuQ+TxKzj0hnV4BqAx+rNYhaFb
JxJhgAcQQu7VCL7/qgqGsQnhq/hhg29Rfqa1VTEZ4RthEMahPDbwyibjyOfqwSgm
fCPIl2FTkB7E0PZnwZJGxmaOJXS7g/djb+CPmBI6zxLQHG5VXY/UGNyObUvTLD9K
gV+N0u0ieyDTxpvpgh6HMAFSkORLS/PIUCAX0SZEW48+7DLbBeKMMYwegtxxJZ3D
3zaWi8uKGh5rjOslQbU4ZlpxJr7yvIV6RhGJhOPDYz5Es4EXHU7c0tZ/pma46eb0
2PJxQyTHW4O9fbybQvl0w9fUQlhjKMbv/TygJgpOIk9YUr2y8Yxc8yhmWi+669ly
e7PEi/33lqJI44gisu0BMresxJcPA3eFWje+Dzw/7N/tlLJzbWt3psRqB9u/JwVH
LD0YvXraZYvaRNzeGUfbXTrvmouhLcl15zAE8RFJBTgGJbpILviJ9NfUMOIO7Yor
BBKEWlylCm/4x5iOdVb17gFCi7uERiahbxNg3+hltAQuMvEdrhhWXp1N7esTRvkf
D1o0qR5C2k2jyc9LQNqfiGWDEOgTCt1DCdhpo2F/EtF5kSerp6s=
=Ai21
-----END PGP SIGNATURE-----
Merge tag 'ipsec-2024-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2024-10-22
1) Fix routing behavior that relies on L4 information
for xfrm encapsulated packets.
From Eyal Birger.
2) Remove leftovers of pernet policy_inexact lists.
From Florian Westphal.
3) Validate new SA's prefixlen when the selector family is
not set from userspace.
From Sabrina Dubroca.
4) Fix a kernel-infoleak when dumping an auth algorithm.
From Petr Vaganov.
Please pull or let me know if there are problems.
ipsec-2024-10-22
* tag 'ipsec-2024-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: fix one more kernel-infoleak in algo dumping
xfrm: validate new SA's prefixlen using SA family when sel.family is unset
xfrm: policy: remove last remnants of pernet inexact list
xfrm: respect ip protocols rules criteria when performing dst lookups
xfrm: extract dst lookup parameters into a struct
====================
Link: https://patch.msgid.link/20241022092226.654370-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We need `goto next_insn;` at the end of patching instead of `continue;`.
It currently works by accident by making verifier re-process patched
instructions.
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Fixes: 314a53623c ("bpf: inline bpf_get_branch_snapshot() helper")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20241023161916.2896274-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Peter reported that perf_event_detach_bpf_prog might skip to release
the bpf program for -ENOENT error from bpf_prog_array_copy.
This can't happen because bpf program is stored in perf event and is
detached and released only when perf event is freed.
Let's drop the -ENOENT check and make sure the bpf program is released
in any case.
Fixes: 170a7e3ea0 ("bpf: bpf_prog_array_copy() should return -ENOENT if exclude_prog not found")
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241023200352.3488610-1-jolsa@kernel.org
Closes: https://lore.kernel.org/lkml/20241022111638.GC16066@noisy.programming.kicks-ass.net/
conn->sk maybe have been unlinked/freed while waiting for iso_conn_lock
so this checks if the conn->sk is still valid by checking if it part of
iso_sk_list.
Fixes: ccf74f2390 ("Bluetooth: Add BTPROTO_ISO socket type")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This make use of disable_work_* on hci_unregister_dev since the hci_dev is
about to be freed new submissions are not disarable.
Fixes: 0d151a1037 ("Bluetooth: hci_core: cancel all works upon hci_unregister_dev()")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Like commit 2c0d278f32 ("KVM: LAPIC: Mark hrtimer to expire in hard
interrupt context") and commit 9090825fa9 ("KVM: arm/arm64: Let the
timer expire in hardirq context on RT"), On PREEMPT_RT enabled kernels
unmarked hrtimers are moved into soft interrupt expiry mode by default.
Then the timers are canceled from an preempt-notifier which is invoked
with disabled preemption which is not allowed on PREEMPT_RT.
The timer callback is short so in could be invoked in hard-IRQ context.
So let the timer expire on hard-IRQ context even on -RT.
This fix a "scheduling while atomic" bug for PREEMPT_RT enabled kernels:
BUG: scheduling while atomic: qemu-system-loo/1011/0x00000002
Modules linked in: amdgpu rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ns
CPU: 1 UID: 0 PID: 1011 Comm: qemu-system-loo Tainted: G W 6.12.0-rc2+ #1774
Tainted: [W]=WARN
Hardware name: Loongson Loongson-3A5000-7A1000-1w-CRB/Loongson-LS3A5000-7A1000-1w-CRB, BIOS vUDK2018-LoongArch-V2.0.0-prebeta9 10/21/2022
Stack : ffffffffffffffff 0000000000000000 9000000004e3ea38 9000000116744000
90000001167475a0 0000000000000000 90000001167475a8 9000000005644830
90000000058dc000 90000000058dbff8 9000000116747420 0000000000000001
0000000000000001 6a613fc938313980 000000000790c000 90000001001c1140
00000000000003fe 0000000000000001 000000000000000d 0000000000000003
0000000000000030 00000000000003f3 000000000790c000 9000000116747830
90000000057ef000 0000000000000000 9000000005644830 0000000000000004
0000000000000000 90000000057f4b58 0000000000000001 9000000116747868
900000000451b600 9000000005644830 9000000003a13998 0000000010000020
00000000000000b0 0000000000000004 0000000000000000 0000000000071c1d
...
Call Trace:
[<9000000003a13998>] show_stack+0x38/0x180
[<9000000004e3ea34>] dump_stack_lvl+0x84/0xc0
[<9000000003a71708>] __schedule_bug+0x48/0x60
[<9000000004e45734>] __schedule+0x1114/0x1660
[<9000000004e46040>] schedule_rtlock+0x20/0x60
[<9000000004e4e330>] rtlock_slowlock_locked+0x3f0/0x10a0
[<9000000004e4f038>] rt_spin_lock+0x58/0x80
[<9000000003b02d68>] hrtimer_cancel_wait_running+0x68/0xc0
[<9000000003b02e30>] hrtimer_cancel+0x70/0x80
[<ffff80000235eb70>] kvm_restore_timer+0x50/0x1a0 [kvm]
[<ffff8000023616c8>] kvm_arch_vcpu_load+0x68/0x2a0 [kvm]
[<ffff80000234c2d4>] kvm_sched_in+0x34/0x60 [kvm]
[<9000000003a749a0>] finish_task_switch.isra.0+0x140/0x2e0
[<9000000004e44a70>] __schedule+0x450/0x1660
[<9000000004e45cb0>] schedule+0x30/0x180
[<ffff800002354c70>] kvm_vcpu_block+0x70/0x120 [kvm]
[<ffff800002354d80>] kvm_vcpu_halt+0x60/0x3e0 [kvm]
[<ffff80000235b194>] kvm_handle_gspr+0x3f4/0x4e0 [kvm]
[<ffff80000235f548>] kvm_handle_exit+0x1c8/0x260 [kvm]
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Currently, KASAN on LoongArch assume the CPU VA bits is 48, which is
true for Loongson-3 series, but not for Loongson-2 series (only 40 or
lower), this patch fix that issue and make KASAN usable for variable
cpu_vabits.
Solution is very simple: Just define XRANGE_SHADOW_SHIFT which means
valid address length from VA_BITS to min(cpu_vabits, VA_BITS).
Cc: stable@vger.kernel.org
Signed-off-by: Kanglong Wang <wangkanglong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
If get_clock_desc() succeeds, it calls fget() for the clockid's fd,
and get the clk->rwsem read lock, so the error path should release
the lock to make the lock balance and fput the clockid's fd to make
the refcount balance and release the fd related resource.
However the below commit left the error path locked behind resulting in
unbalanced locking. Check timespec64_valid_strict() before
get_clock_desc() to fix it, because the "ts" is not changed
after that.
Fixes: d8794ac20a ("posix-clock: Fix missing timespec64 check in pc_clock_settime()")
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
[pabeni@redhat.com: fixed commit message typo]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
It was reported that after resume from suspend a PCI error is logged
and connectivity is broken. Error message is:
PCI error (cmd = 0x0407, status_errs = 0x0000)
The message seems to be a red herring as none of the error bits is set,
and the PCI command register value also is normal. Exception handling
for a PCI error includes a chip reset what apparently brakes connectivity
here. The interrupt status bit triggering the PCI error handling isn't
actually used on PCIe chip versions, so it's not clear why this bit is
set by the chip. Fix this by ignoring this bit on PCIe chip versions.
Fixes: 0e4851502f ("r8169: merge with version 8.001.00 of Realtek's r8168 driver")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219388
Tested-by: Atlas Yu <atlas.yu@canonical.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/78e2f535-438f-4212-ad94-a77637ac6c9c@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In 'taprio_change()', 'admin' pointer may become dangling due to sched
switch / removal caused by 'advance_sched()', and critical section
protected by 'q->current_entry_lock' is too small to prevent from such
a scenario (which causes use-after-free detected by KASAN). Fix this
by prefer 'rcu_replace_pointer()' over 'rcu_assign_pointer()' to update
'admin' immediately before an attempt to schedule freeing.
Fixes: a3d43c0d56 ("taprio: Add support adding an admin schedule")
Reported-by: syzbot+b65e0af58423fc8a73aa@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://patch.msgid.link/20241018051339.418890-1-dmantipov@yandex.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
tcf_action_init() has logic for checking mismatches between action and
filter offload flags (skip_sw/skip_hw). AFAIU, this is intended to run
on the transition between the new tc_act_bind(flags) returning true (aka
now gets bound to classifier) and tc_act_bind(act->tcfa_flags) returning
false (aka action was not bound to classifier before). Otherwise, the
check is skipped.
For the case where an action is not standalone, but rather it was
created by a classifier and is bound to it, tcf_action_init() skips the
check entirely, and this means it allows mismatched flags to occur.
Taking the matchall classifier code path as an example (with mirred as
an action), the reason is the following:
1 | mall_change()
2 | -> mall_replace_hw_filter()
3 | -> tcf_exts_validate_ex()
4 | -> flags |= TCA_ACT_FLAGS_BIND;
5 | -> tcf_action_init()
6 | -> tcf_action_init_1()
7 | -> a_o->init()
8 | -> tcf_mirred_init()
9 | -> tcf_idr_create_from_flags()
10 | -> tcf_idr_create()
11 | -> p->tcfa_flags = flags;
12 | -> tc_act_bind(flags))
13 | -> tc_act_bind(act->tcfa_flags)
When invoked from tcf_exts_validate_ex() like matchall does (but other
classifiers validate their extensions as well), tcf_action_init() runs
in a call path where "flags" always contains TCA_ACT_FLAGS_BIND (set by
line 4). So line 12 is always true, and line 13 is always true as well.
No transition ever takes place, and the check is skipped.
The code was added in this form in commit c86e0209dc ("flow_offload:
validate flags of filter and actions"), but I'm attributing the blame
even earlier in that series, to when TCA_ACT_FLAGS_SKIP_HW and
TCA_ACT_FLAGS_SKIP_SW were added to the UAPI.
Following the development process of this change, the check did not
always exist in this form. A change took place between v3 [1] and v4 [2],
AFAIU due to review feedback that it doesn't make sense for action flags
to be different than classifier flags. I think I agree with that
feedback, but it was translated into code that omits enforcing this for
"classic" actions created at the same time with the filters themselves.
There are 3 more important cases to discuss. First there is this command:
$ tc qdisc add dev eth0 clasct
$ tc filter add dev eth0 ingress matchall skip_sw \
action mirred ingress mirror dev eth1
which should be allowed, because prior to the concept of dedicated
action flags, it used to work and it used to mean the action inherited
the skip_sw/skip_hw flags from the classifier. It's not a mismatch.
Then we have this command:
$ tc qdisc add dev eth0 clasct
$ tc filter add dev eth0 ingress matchall skip_sw \
action mirred ingress mirror dev eth1 skip_hw
where there is a mismatch and it should be rejected.
Finally, we have:
$ tc qdisc add dev eth0 clasct
$ tc filter add dev eth0 ingress matchall skip_sw \
action mirred ingress mirror dev eth1 skip_sw
where the offload flags coincide, and this should be treated the same as
the first command based on inheritance, and accepted.
[1]: https://lore.kernel.org/netdev/20211028110646.13791-9-simon.horman@corigine.com/
[2]: https://lore.kernel.org/netdev/20211118130805.23897-10-simon.horman@corigine.com/
Fixes: 7adc576512 ("flow_offload: add skip_hw and skip_sw to control if offload the action")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20241017161049.3570037-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
strlen() returns a string length excluding the null byte. If the string
length equals to the maximum buffer length, the buffer will have no
space for the NULL terminating character.
This commit checks this condition and returns failure for it.
Link: https://lore.kernel.org/all/20241007144724.920954-1-leo.yan@arm.com/
Fixes: dec65d79fd ("tracing/probe: Check event name length correctly")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When creating a trace_probe we would set nr_args prior to truncating the
arguments to MAX_TRACE_ARGS. However, we would only initialize arguments
up to the limit.
This caused invalid memory access when attempting to set up probes with
more than 128 fetchargs.
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 UID: 0 PID: 1769 Comm: cat Not tainted 6.11.0-rc7+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:__set_print_fmt+0x134/0x330
Resolve the issue by applying the MAX_TRACE_ARGS limit earlier. Return
an error when there are too many arguments instead of silently
truncating.
Link: https://lore.kernel.org/all/20240930202656.292869-1-mikel@mikelr.com/
Fixes: 035ba76014 ("tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init")
Signed-off-by: Mikel Rychliski <mikel@mikelr.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
We can now undo parts of 4b3786a6c5 ("bpf: Zero former ARG_PTR_TO_{LONG,INT}
args in case of error") as discussed in [0].
Given the BPF helpers now have MEM_WRITE tag, the MEM_UNINIT can be cleared.
The mtu_len is an input as well as output argument, meaning, the BPF program
has to set it to something. It cannot be uninitialized. Therefore, allowing
uninitialized memory and zeroing it on error would be odd. It was done as
an interim step in 4b3786a6c5 as the desired behavior could not have been
expressed before the introduction of MEM_WRITE tag.
Fixes: 4b3786a6c5 ("bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/a86eb76d-f52f-dee4-e5d2-87e45de3e16f@iogearbox.net [0]
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241021152809.33343-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lonial reported an issue in the BPF verifier where check_mem_size_reg()
has the following code:
if (!tnum_is_const(reg->var_off))
/* For unprivileged variable accesses, disable raw
* mode so that the program is required to
* initialize all the memory that the helper could
* just partially fill up.
*/
meta = NULL;
This means that writes are not checked when the register containing the
size of the passed buffer has not a fixed size. Through this bug, a BPF
program can write to a map which is marked as read-only, for example,
.rodata global maps.
The problem is that MEM_UNINIT's initial meaning that "the passed buffer
to the BPF helper does not need to be initialized" which was added back
in commit 435faee1aa ("bpf, verifier: add ARG_PTR_TO_RAW_STACK type")
got overloaded over time with "the passed buffer is being written to".
The problem however is that checks such as the above which were added later
via 06c1c04972 ("bpf: allow helpers access to variable memory") set meta
to NULL in order force the user to always initialize the passed buffer to
the helper. Due to the current double meaning of MEM_UNINIT, this bypasses
verifier write checks to the memory (not boundary checks though) and only
assumes the latter memory is read instead.
Fix this by reverting MEM_UNINIT back to its original meaning, and having
MEM_WRITE as an annotation to BPF helpers in order to then trigger the
BPF verifier checks for writing to memory.
Some notes: check_arg_pair_ok() ensures that for ARG_CONST_SIZE{,_OR_ZERO}
we can access fn->arg_type[arg - 1] since it must contain a preceding
ARG_PTR_TO_MEM. For check_mem_reg() the meta argument can be removed
altogether since we do check both BPF_READ and BPF_WRITE. Same for the
equivalent check_kfunc_mem_size_reg().
Fixes: 7b3552d3f9 ("bpf: Reject writes for PTR_TO_MAP_KEY in check_helper_mem_access")
Fixes: 97e6d7dab1 ("bpf: Check PTR_TO_MEM | MEM_RDONLY in check_helper_mem_access")
Fixes: 15baa55ff5 ("bpf/verifier: allow all functions to read user provided context")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241021152809.33343-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a MEM_WRITE attribute for BPF helper functions which can be used in
bpf_func_proto to annotate an argument type in order to let the verifier
know that the helper writes into the memory passed as an argument. In
the past MEM_UNINIT has been (ab)used for this function, but the latter
merely tells the verifier that the passed memory can be uninitialized.
There have been bugs with overloading the latter but aside from that
there are also cases where the passed memory is read + written which
currently cannot be expressed, see also 4b3786a6c5 ("bpf: Zero former
ARG_PTR_TO_{LONG,INT} args in case of error").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241021152809.33343-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In bpf_parse_param(), keep the value of param->string intact so it can
be freed later. Otherwise, the kmalloc area pointed to by param->string
will be leaked as shown below:
unreferenced object 0xffff888118c46d20 (size 8):
comm "new_name", pid 12109, jiffies 4295580214
hex dump (first 8 bytes):
61 6e 79 00 38 c9 5c 7e any.8.\~
backtrace (crc e1b7f876):
[<00000000c6848ac7>] kmemleak_alloc+0x4b/0x80
[<00000000de9f7d00>] __kmalloc_node_track_caller_noprof+0x36e/0x4a0
[<000000003e29b886>] memdup_user+0x32/0xa0
[<0000000007248326>] strndup_user+0x46/0x60
[<0000000035b3dd29>] __x64_sys_fsconfig+0x368/0x3d0
[<0000000018657927>] x64_sys_call+0xff/0x9f0
[<00000000c0cabc95>] do_syscall_64+0x3b/0xc0
[<000000002f331597>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: 6c1752e0b6 ("bpf: Support symbolic BPF FS delegation mount options")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20241022130133.3798232-1-houtao@huaweicloud.com
MAXAG is a legitimate value for bmp->db_numag
Fixes: e63866a475 ("jfs: fix out-of-bounds in dbNextAG() and diAlloc()")
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
The ret may be zero in btrfs_search_dir_index_item() and should not
passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to
this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0.
This fixes smatch warnings:
fs/btrfs/dir-item.c:353
btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR'
Fixes: 9dcbe16fcc ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reports the following crash:
BTRFS info (device loop0 state MCS): disabling free space tree
BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline]
RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041
Call Trace:
<TASK>
btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530
btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312
btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012
btrfs_remount_rw fs/btrfs/super.c:1309 [inline]
btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534
btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline]
btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline]
btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115
vfs_get_tree+0x90/0x2b0 fs/super.c:1800
do_new_mount+0x2be/0xb40 fs/namespace.c:3472
do_mount fs/namespace.c:3812 [inline]
__do_sys_mount fs/namespace.c:4020 [inline]
__se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
[CAUSE]
To support mounting different subvolume with different RO/RW flags for
the new mount APIs, btrfs introduced two workaround to support this feature:
- Skip mount option/feature checks if we are mounting a different
subvolume
- Reconfigure the fs to RW if the initial mount is RO
Combining these two, we can have the following sequence:
- Mount the fs ro,rescue=all,clear_cache,space_cache=v1
rescue=all will mark the fs as hard read-only, so no v2 cache clearing
will happen.
- Mount a subvolume rw of the same fs.
We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY
because our new fc is RW, different from the original fs.
Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag
first so that we can grab the existing fs_info.
Then we reconfigure the fs to RW.
- During reconfiguration, option/features check is skipped
This means we will restart the v2 cache clearing, and convert back to
v1 cache.
This will trigger fs writes, and since the original fs has "rescue=all"
option, it skips the csum tree read.
And eventually causing NULL pointer dereference in super block
writeback.
[FIX]
For reconfiguration caused by different subvolume RO/RW flags, ensure we
always run btrfs_check_options() to ensure we have proper hard RO
requirements met.
In fact the function btrfs_check_options() doesn't really do many
complex checks, but hard RO requirement and some feature dependency
checks, thus there is no special reason not to do the check for mount
reconfiguration.
Reported-by: syzbot+56360f93efa90ff15870@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/0000000000008c5d090621cb2770@google.com/
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In debugging some corrupt squashfs files, we observed symptoms of
corrupt page cache pages but correct on-disk contents. Further
investigation revealed that the exact symptom was a correct page
followed by an incorrect, duplicate, page. This got us thinking about
extent maps.
commit ac05ca913e ("Btrfs: fix race between using extent maps and merging them")
enforces a reference count on the primary `em` extent_map being merged,
as that one gets modified.
However, since,
commit 3d2ac99224 ("btrfs: introduce new members for extent_map")
both 'em' and 'merge' get modified, which started modifying 'merge'
and thus introduced the same race.
We were able to reproduce this by looping the affected squashfs workload
in parallel on a bunch of separate btrfs-es while also dropping caches.
We are still working on a simple enough reproducer to make into an fstest.
The simplest fix is to stop modifying 'merge', which is not essential,
as it is dropped immediately after the merge. This behavior is simply
a consequence of the order of the two extent maps being important in
computing the new values. Modify merge_ondisk_extents to take prev and
next by const* and also take a third merged parameter that it puts the
results in. Note that this introduces the rather odd behavior of passing
'em' to merge_ondisk_extents as a const * and as a regular ptr.
Fixes: 3d2ac99224 ("btrfs: introduce new members for extent_map")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside lock_delalloc_folios(), there are several problems related to
sector size < page size handling:
- Set the writer locks without checking if the folio is still valid
We call btrfs_folio_start_writer_lock() just like it's folio_lock().
But since the folio may not even be the folio of the current mapping,
we can easily screw up the folio->private.
- The range is not clamped inside the page
This means we can over write other bitmaps if the start/len is not
properly handled, and trigger the btrfs_subpage_assert().
- @processed_end is always rounded up to page end
If the delalloc range is not page aligned, and we need to retry
(returning -EAGAIN), then we will unlock to the page end.
Thankfully this is not a huge problem, as now
btrfs_folio_end_writer_lock() can handle range larger than the locked
range, and only unlock what is already locked.
Fix all these problems by:
- Lock and check the folio first, then call
btrfs_folio_set_writer_lock()
So that if we got a folio not belonging to the inode, we won't
touch folio->private.
- Properly truncate the range inside the page
- Update @processed_end to the locked range end
Fixes: 1e1de38792 ("btrfs: make process_one_page() to handle subpage locking")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 011b46c304 ("btrfs: skip subtree scan if it's too high to
avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can
automatically skip large subtree scan at the cost of marking qgroup
inconsistent.
It's designed to address the final performance problem of snapshot drop
with qgroup enabled, but to be safe the default value is
BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value
to make it work.
I'd say it's not a good idea to rely on user space tool to set this
default value, especially when some operations (snapshot dropping) can
be triggered immediately after mount, leaving a very small window to
that that sysfs interface.
So instead of disabling this new feature by default, enable it with a
low threshold (3), so that large subvolume tree drop at mount time won't
cause huge qgroup workload.
CC: stable@vger.kernel.org # 6.1
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the migration to use fs context for processing mount options we had
a slight change in the semantics for remounting a filesystem that was
mounted with compress-force. Before we could clear compress-force by
passing only "-o compress[=algo]" during a remount, but after that change
that does not work anymore, force-compress is still present and one needs
to pass "-o compress-force=no,compress[=algo]" to the mount command.
Example, when running on a kernel 6.8+:
$ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)
$ mount -o remount,compress=zlib:5 /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)
On a 6.7 kernel (or older):
$ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)
$ mount -o remount,compress=zlib:5 /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)
So update btrfs_parse_param() to clear "compress-force" when "compress" is
given, providing the same semantics as kernel 6.7 and older.
Reported-by: Roman Mamedov <rm@romanrm.net>
Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/
CC: stable@vger.kernel.org # 6.8+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>