1
linux/mm
Hugh Dickins f8f931bba0 mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without).  The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.

Before fixing locking: rename misleading folio_undo_large_rmappable(),
which does not undo large_rmappable, to folio_unqueue_deferred_split(),
which is what it does.  But that and its out-of-line __callee are mm
internals of very limited usability: add comment and WARN_ON_ONCEs to
check usage; and return a bool to say if a deferred split was unqueued,
which can then be used in WARN_ON_ONCEs around safety checks (sparing
callers the arcane conditionals in __folio_unqueue_deferred_split()).

Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all
of whose callers now call it beforehand (and if any forget then bad_page()
will tell) - except for its caller put_pages_list(), which itself no
longer has any callers (and will be deleted separately).

Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
without checking and unqueueing a THP folio from deferred split list;
which is unfortunate, since the split_queue_lock depends on the memcg
(when memcg is enabled); so swapout has been unqueueing such THPs later,
when freeing the folio, using the pgdat's lock instead: potentially
corrupting the memcg's list.  __remove_mapping() has frozen refcount to 0
here, so no problem with calling folio_unqueue_deferred_split() before
resetting memcg_data.

That goes back to 5.4 commit 87eaceb3fa ("mm: thp: make deferred split
shrinker memcg aware"): which included a check on swapcache before adding
to deferred queue, but no check on deferred queue before adding THP to
swapcache.  That worked fine with the usual sequence of events in reclaim
(though there were a couple of rare ways in which a THP on deferred queue
could have been swapped out), but 6.12 commit dafff3f4c8 ("mm: split
underused THPs") avoids splitting underused THPs in reclaim, which makes
swapcache THPs on deferred queue commonplace.

Keep the check on swapcache before adding to deferred queue?  Yes: it is
no longer essential, but preserves the existing behaviour, and is likely
to be a worthwhile optimization (vmstat showed much more traffic on the
queue under swapping load if the check was removed); update its comment.

Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
folio->memcg_data without checking and unqueueing a THP folio from the
deferred list, sometimes corrupting "from" memcg's list, like swapout. 
Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
of course unqueues), or skip it if that fails.  Not ideal, but moving
charge has been requested, and khugepaged should repair the THP later:
nobody wants new custom unqueueing code just for this deprecated case.

The 87eaceb3fa commit did have the code to move from one deferred list
to another (but was not conscious of its unsafety while refcount non-0);
but that was removed by 5.6 commit fac0516b55 ("mm: thp: don't need care
deferred split queue in memcg charge move path"), which argued that the
existence of a PMD mapping guarantees that the THP cannot be on a deferred
list.  As above, false in rare cases, and now commonly false.

Backport to 6.11 should be straightforward.  Earlier backports must take
care that other _deferred_list fixes and dependencies are included.  There
is not a strong case for backports, but they can fix cornercases.

Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com
Fixes: 87eaceb3fa ("mm: thp: make deferred split shrinker memcg aware")
Fixes: dafff3f4c8 ("mm: split underused THPs")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:49:54 -08:00
..
damon mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets() 2024-10-17 00:28:08 -07:00
kasan kasan: remove vmalloc_percpu test 2024-10-30 20:14:11 -07:00
kfence mm: kfence: fix elapsed time for allocated/freed track 2024-09-26 14:01:44 -07:00
kmsan kmsan: do not pass NULL pointers as 0 2024-07-03 19:30:26 -07:00
backing-dev.c
balloon_compaction.c mm: remove MIGRATE_SYNC_NO_COPY mode 2024-07-03 19:30:00 -07:00
bootmem_info.c
cma_debug.c
cma_sysfs.c
cma.c mm/cma: add cma_{alloc,free}_folio() 2024-09-03 21:15:36 -07:00
cma.h
compaction.c mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath() 2024-09-03 21:15:47 -07:00
debug_page_alloc.c
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries 2024-09-17 01:07:01 -07:00
debug.c mm: support only one page_type per page 2024-09-03 21:15:43 -07:00
dmapool_test.c mm/dmapool: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
dmapool.c
early_ioremap.c
execmem.c
fadvise.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
fail_page_alloc.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
failslab.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
filemap.c mm/filemap: fix filemap_get_folios_contig THP panic 2024-09-26 14:01:43 -07:00
folio-compat.c mm: remove putback_lru_page() 2024-09-09 16:38:59 -07:00
gup_test.c
gup_test.h
gup.c mm/gup: stop leaking pinned pages in low memory conditions 2024-10-30 20:14:10 -07:00
highmem.c mm/highmem: make nr_free_highpages() return "unsigned long" 2024-07-03 19:30:06 -07:00
hmm.c mm: provide mm_struct and address to huge_ptep_get() 2024-07-12 15:52:15 -07:00
huge_memory.c mm/thp: fix deferred split unqueue naming and locking 2024-11-05 16:49:54 -08:00
hugetlb_cgroup.c mm: memcg: don't call propagate_protected_usage() needlessly 2024-09-01 20:25:50 -07:00
hugetlb_vmemmap.c mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO 2024-09-01 20:25:45 -07:00
hugetlb_vmemmap.h
hugetlb.c mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak 2024-09-26 14:01:43 -07:00
hwpoison-inject.c mm/hwpoison: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
init-mm.c
internal.h mm/thp: fix deferred split unqueue naming and locking 2024-11-05 16:49:54 -08:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig resource: remove dependency on SPARSEMEM from GET_FREE_REGION 2024-10-28 21:40:39 -07:00
Kconfig.debug slub: Introduce CONFIG_SLUB_RCU_DEBUG 2024-08-27 14:12:51 +02:00
khugepaged.c mm: khugepaged: fix the incorrect statistics when collapsing large file folios 2024-10-17 00:28:10 -07:00
kmemleak.c mm/kmemleak: use IS_ERR_PCPU() for pointer in the percpu address space 2024-09-03 21:15:38 -07:00
ksm.c mm: remove PageSwapCache 2024-09-03 21:15:44 -07:00
list_lru.c mm: list_lru: fix UAF for memory cgroup 2024-08-07 18:33:56 -07:00
maccess.c
madvise.c ALong with the usual shower of singleton patches, notable patch series in 2024-09-21 07:29:05 -07:00
Makefile mm: introduce numa_emulation 2024-09-03 21:15:31 -07:00
mapping_dirty_helpers.c
memblock.c memblock: updates for 6.12-rc1 2024-09-25 11:35:19 -07:00
memcontrol-v1.c mm/thp: fix deferred split unqueue naming and locking 2024-11-05 16:49:54 -08:00
memcontrol-v1.h memcg: cleanup with !CONFIG_MEMCG_V1 2024-09-17 01:07:00 -07:00
memcontrol.c mm/thp: fix deferred split unqueue naming and locking 2024-11-05 16:49:54 -08:00
memfd.c mm/hugetlb: simplify refs in memfd_alloc_folio 2024-09-26 14:01:44 -07:00
memory_hotplug.c ALong with the usual shower of singleton patches, notable patch series in 2024-09-21 07:29:05 -07:00
memory-failure.c mm: migrate: add isolate_folio_to_list() 2024-09-03 21:15:59 -07:00
memory-tiers.c memory tiers: use default_dram_perf_ref_source in log message 2024-09-26 14:01:44 -07:00
memory.c mm: avoid unconditional one-tick sleep when swapcache_prepare fails 2024-10-28 21:40:41 -07:00
mempolicy.c mm,memcg: provide per-cgroup counters for NUMA balancing operations 2024-09-03 21:15:36 -07:00
mempool.c
memremap.c
memtest.c
migrate_device.c mm: remap unused subpages to shared zeropage when splitting isolated thp 2024-09-09 16:39:03 -07:00
migrate.c mm/thp: fix deferred split unqueue naming and locking 2024-11-05 16:49:54 -08:00
mincore.c mm: provide mm_struct and address to huge_ptep_get() 2024-07-12 15:52:15 -07:00
mlock.c Random number generator updates for Linux 6.11-rc1. 2024-07-24 10:29:50 -07:00
mm_init.c mm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSION 2024-09-03 21:15:28 -07:00
mm_slot.h
mmap_lock.c mm: mmap_lock: replace get_memcg_path_buf() with on-stack buffer 2024-07-03 19:30:26 -07:00
mmap.c mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes 2024-10-31 20:27:04 -07:00
mmu_gather.c
mmu_notifier.c mm: move internal core VMA manipulation functions to own file 2024-09-01 20:25:54 -07:00
mmzone.c mm: improve code consistency with zonelist_* helper functions 2024-09-01 20:25:55 -07:00
mprotect.c mm/mprotect: replace can_modify_mm with can_modify_vma 2024-09-03 21:15:41 -07:00
mremap.c mm/mremap: fix move_normal_pmd/retract_page_tables race 2024-10-17 00:28:07 -07:00
mseal.c ALong with the usual shower of singleton patches, notable patch series in 2024-09-21 07:29:05 -07:00
msync.c
nommu.c mm: remove follow_page() 2024-09-01 20:26:01 -07:00
numa_emulation.c mm: introduce numa_emulation 2024-09-03 21:15:31 -07:00
numa_memblks.c mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id 2024-10-28 21:40:40 -07:00
numa.c mm: make range-to-target_node lookup facility a part of numa_memblks 2024-09-03 21:15:32 -07:00
oom_kill.c
page_alloc.c mm/thp: fix deferred split unqueue naming and locking 2024-11-05 16:49:54 -08:00
page_counter.c mm, memcg: cg2 memory{.swap,}.peak write handlers 2024-09-01 20:25:53 -07:00
page_ext.c mm: don't account memmap per-node 2024-08-15 22:16:14 -07:00
page_idle.c
page_io.c mm: fix PSWPIN counter for large folios swap-in 2024-10-30 20:14:11 -07:00
page_isolation.c mm: remove migration for HugePage in isolate_single_pageblock() 2024-09-03 21:15:40 -07:00
page_owner.c
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c mm/page_table_check: fix crash on ZONE_DEVICE 2024-06-15 10:43:04 -07:00
page_vma_mapped.c
page-writeback.c ALong with the usual shower of singleton patches, notable patch series in 2024-09-21 07:29:05 -07:00
pagewalk.c mm/pagewalk: fix usage of pmd_leaf()/pud_leaf() without present check 2024-10-28 21:40:38 -07:00
percpu-internal.h mm: remove CONFIG_MEMCG_KMEM 2024-07-10 12:14:54 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c percpu: remove pcpu_alloc_size() 2024-09-01 20:26:04 -07:00
pgalloc-track.h
pgtable-generic.c
process_vm_access.c
ptdump.c
readahead.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
rmap.c mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify() 2024-11-03 10:47:03 -08:00
rodata_test.c
secretmem.c secretmem: disable memfd_secret() if arch cannot set direct map 2024-10-09 12:47:19 -07:00
shmem_quota.c shmem_quota: build the object file conditionally to the config option 2024-09-01 20:25:45 -07:00
shmem.c mm: shmem: fix data-race in shmem_getattr() 2024-10-28 21:40:39 -07:00
show_mem.c mm/show_mem.c: report alloc tags in human readable units 2024-09-17 01:07:00 -07:00
shrinker_debug.c mm: shrinker: use min() to improve shrinker_debugfs_scan_write() 2024-09-03 21:15:40 -07:00
shrinker.c mm: shrinker: avoid memleak in alloc_shrinker_info 2024-10-31 20:27:04 -07:00
shuffle.c
shuffle.h mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
slab_common.c mm: krealloc: Fix MTE false alarm in __do_krealloc 2024-10-29 10:40:53 +01:00
slab.h mm, slab: suppress warnings in test_leak_destroy kunit test 2024-10-02 16:28:46 +02:00
slub.c mm, slab: suppress warnings in test_leak_destroy kunit test 2024-10-02 16:28:46 +02:00
sparse-vmemmap.c LoongArch: Set initial pte entry with PAGE_GLOBAL for kernel space 2024-10-21 22:11:19 +08:00
sparse.c A set of X86 fixes: 2024-09-01 14:43:08 -07:00
swap_cgroup.c mm: attempt to batch free swap entries for zap_pte_range() 2024-09-03 21:15:33 -07:00
swap_slots.c
swap_state.c mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios 2024-09-17 01:07:01 -07:00
swap.c mm/thp: fix deferred split unqueue naming and locking 2024-11-05 16:49:54 -08:00
swap.h mm: fix swap_read_folio_zeromap() for large folios with partial zeromap 2024-09-17 01:07:01 -07:00
swapfile.c mm, swap: avoid over reclaim of full clusters 2024-10-30 20:14:11 -07:00
truncate.c mm: Fix missing folio invalidation calls during truncation 2024-08-24 16:09:16 +02:00
usercopy.c
userfaultfd.c ALong with the usual shower of singleton patches, notable patch series in 2024-09-21 07:29:05 -07:00
util.c mm: only enforce minimum stack gap size if it's sensible 2024-09-01 20:26:02 -07:00
vma_internal.h mm: remove duplicated include in vma_internal.h 2024-09-01 20:26:02 -07:00
vma.c mm/vma: add expand-only VMA merge mode and optimise do_brk_flags() 2024-10-28 21:40:40 -07:00
vma.h mm/vma: add expand-only VMA merge mode and optimise do_brk_flags() 2024-10-28 21:40:40 -07:00
vmalloc.c ALong with the usual shower of singleton patches, notable patch series in 2024-09-21 07:29:05 -07:00
vmpressure.c
vmscan.c mm/thp: fix deferred split unqueue naming and locking 2024-11-05 16:49:54 -08:00
vmstat.c mm: split underused THPs 2024-09-09 16:39:04 -07:00
workingset.c cachestat: do not flush stats in recency check 2024-07-03 22:40:37 -07:00
z3fold.c mm/z3fold: add __percpu annotation to *unbuddied pointer in struct z3fold_pool 2024-09-01 20:25:56 -07:00
zbud.c
zpool.c
zsmalloc.c ALong with the usual shower of singleton patches, notable patch series in 2024-09-21 07:29:05 -07:00
zswap.c mm: zswap: delete comments for "value" member of 'struct zswap_entry'. 2024-10-09 12:47:19 -07:00